Statistical Methods III: Week-2

AY 2025–26

Instructor: Debasis Sengupta

Office / Department: ASU

Email: sdebasis@isical.ac.in

Marking Scheme:
Assignments: 20% | Midterm Test: 30% | End Semester: 50%

Contents

🧠 Variance Inequality: Toward a General CRLB

You’re discussing a variance inequality in estimation theory — a generalized form of the Cramér–Rao bound.

📦 1. Setup:

Key assumptions:

The second condition is a regularity condition — it lets you differentiate under the integral sign, essential for proving CR-type results.

🔎 2. The Inequality Itself:

Then, the variance of \( W(\mathbf{X}) \) satisfies the inequality:

\[ \operatorname{Var}_\theta(W(\mathbf{X})) \geq \frac{ \left( \frac{d}{d\theta} \mathbb{E}_\theta[W(\mathbf{X})] \right)^2 }{ \mathbb{E}_\theta \left[ \left( \frac{d}{d\theta} \log f(\mathbf{X}|\theta) \right)^2 \right] } \]

This is a generalized Cramér–Rao Lower Bound.

✅ No matter what statistic or (nearly) unbiased estimator you use, its variance can’t fall below this bound.

ℹ️ Fisher Information

The denominator is the Fisher Information:

\[ \mathcal{I}(\theta) := \mathbb{E}_\theta \left[ \left( \frac{d}{d\theta} \log f(\mathbf{X}|\theta) \right)^2 \right] \]

It quantifies how much information the sample \( \mathbf{X} \) provides about \( \theta \).

✂️ 3. Simplification When Unbiased

If \( W(\mathbf{X}) \) is unbiased, i.e.

\[ \mathbb{E}_\theta[W(\mathbf{X})] = \theta \quad \text{for all } \theta, \]

then:

\[ \frac{d}{d\theta} \mathbb{E}_\theta[W(\mathbf{X})] = 1 \]

This simplifies the variance bound to:

\[ \operatorname{Var}_\theta(W(\mathbf{X})) \geq \frac{1}{\mathbb{E}_\theta \left[ \left( \frac{d}{d\theta} \log f(\mathbf{X}|\theta) \right)^2 \right]} \]

✅ This is the standard Cramér–Rao Lower Bound for unbiased estimators.

🔢 4. Concrete Example

Suppose:

Then clearly \( \mathbb{E}(W(\mathbf{X})) = \theta \), so this is an unbiased estimator.

✅ The simplified CRLB applies.

📌 Summary So Far:

🔍 Simplification (2): Cramér–Rao Bound for IID Case

🧩 1. Factorization of Likelihood for IID Samples

If \( X_1, \dots, X_n \) are i.i.d., then:

\[ f(\mathbf{x}|\theta) = \prod_{i=1}^{n} f(x_i|\theta) \quad \Rightarrow \quad \log f(\mathbf{x}|\theta) = \sum_{i=1}^{n} \log f(x_i|\theta) \]

This simplifies the log-likelihood and makes computing the score function much easier.

📘 2. Example: \( X_i \sim \mathcal{N}(\mu, 1) \)

The density is:

\[ f(x_i|\mu) = \frac{1}{\sqrt{2\pi}} \exp\left(-\frac{(x_i - \mu)^2}{2}\right) \]

Then the log-likelihood becomes:

\[ \log f(x_i|\mu) = -\frac{1}{2} \log(2\pi) - \frac{1}{2}(x_i - \mu)^2 \]

Differentiating with respect to \( \mu \):

\[ \frac{d}{d\mu} \log f(x_i|\mu) = x_i - \mu \]

So, the score function for the full sample is:

\[ \frac{d}{d\mu} \log f(\mathbf{x}|\mu) = \sum_{i=1}^n (x_i - \mu) \]

📐 3. Fisher Information

To compute the Fisher Information \( \mathcal{I}(\mu) \), we evaluate:

\[ \mathbb{E}_\mu \left[ \left( \sum_{i=1}^n (X_i - \mu) \right)^2 \right] \]

Because the \( X_i \) are independent and each has zero mean after centering:

\[ = \sum_{i=1}^n \mathbb{E}_\mu[(X_i - \mu)^2] \]

Each term \( \mathbb{E}_\mu[(X_i - \mu)^2] = 1 \), so total Fisher Information is:

\[ \mathcal{I}(\mu) = n \]

✅ Conclusion: CRLB Achieved by Sample Mean

✅ The sample mean achieves the bound, hence it is efficient.

✅ Putting It All Together

You've now made two key simplifications:

Simplification (1)

If \(\mathbb{E}_\mu[W(\mathbf{X})] = \mu\), then:

\[ \operatorname{Var}_\mu(W(\mathbf{X})) \geq \frac{1}{\mathbb{E}_\mu\left[ \left( \frac{d}{d\mu} \log f(\mathbf{X}|\mu) \right)^2 \right]} = \frac{1}{\text{Fisher Information}} \]

Simplification (2)

For the i.i.d. normal case:

❓ Immediate Question: Does the sample mean attain this lower bound?

Let’s answer it clearly.

✅ So the sample mean:

✅ Final Answer

Yes, the sample mean attains the Cramér–Rao lower bound in this case. Hence, it is an efficient estimator of \(\mu\) for the normal model with known variance.

🔁 Cramér–Rao Inequality in the Discrete Case

You're now pointing toward the discrete version of the Cramér–Rao bound, where we deal with PMFs instead of PDFs.

Let’s go through this step-by-step.

📦 Now in the discrete case:

Instead of a pdf \( f(x|\theta) \), you have a pmf \( p(x|\theta) \), where \( x \in \mathcal{X} \), a countable set.

🧠 Key Result:

Suppose:

Then, the Cramér–Rao-type inequality holds:

\[ \operatorname{Var}_\theta(W(\mathbf{X})) \geq \frac{\left( \frac{d}{d\theta} \mathbb{E}_\theta[W(\mathbf{X})] \right)^2}{ \mathbb{E}_\theta\left[ \left( \frac{d}{d\theta} \log p(\mathbf{X}|\theta) \right)^2 \right]} \]

This is structurally identical to the continuous case — but all integrals are now sums over discrete outcomes.

✅ Example: Bernoulli Case

Let’s make this real.

Then CRLB is:

\[ \operatorname{Var}(\hat{p}) \geq \frac{1}{\mathcal{I}(p)} = \frac{p(1 - p)}{n} \]

✅ The sample mean attains this → it is efficient.

📌 Summary

📘 Fisher Information Identity

We now state a very important identity in statistical inference — the Fisher Information identity, which links the variance of the score with the expected second derivative of the log-likelihood.

Let’s unpack it clearly.

📘 Statement of the Result:

If the regularity condition holds:

\[ \frac{d}{d\theta}\mathbb{E}_{\theta}\left(\frac{\partial}{\partial\theta}\log f(X|\theta)\right) = \int \left[\frac{\partial}{\partial\theta}\left[\left(\frac{\partial}{\partial\theta}\log f(x|\theta)\right) f(x|\theta)\right]\right] dx, \]

then we can derive:

\[ \mathbb{E}_\theta\left[\left(\frac{\partial}{\partial\theta} \log f(X|\theta)\right)^2\right] = -\mathbb{E}_\theta\left[\frac{\partial^2}{\partial\theta^2} \log f(X|\theta)\right] \]

This is known as the Fisher Information equality, and both sides equal the Fisher Information \( \mathcal{I}(\theta) \).

📌 Why This Is True: Key Insight

Let \( \ell(\theta; X) = \log f(X|\theta) \), the log-likelihood. Then:

Differentiate under the integral sign:

\[ \frac{d}{d\theta} \mathbb{E}_\theta\left[\frac{\partial}{\partial\theta} \ell(\theta; X)\right] = 0 = \mathbb{E}_\theta\left[ \frac{\partial^2}{\partial\theta^2} \ell(\theta; X) + \left( \frac{\partial}{\partial\theta} \ell(\theta; X) \right)^2 \right] \]

Rearranged:

\[ \mathbb{E}_\theta\left[\left(\frac{\partial}{\partial\theta} \log f(X|\theta)\right)^2\right] = -\mathbb{E}_\theta\left[\frac{\partial^2}{\partial\theta^2} \log f(X|\theta)\right] \]

🔁 Interpretation

Both quantify how much information the data contains about \( \theta \).

✅ Summary

You’re stating the Fisher Information identity, which tells us:

Under regularity conditions, the variance of the score function equals the negative expected second derivative of the log-likelihood.

\[ \mathcal{I}(\theta) = \operatorname{Var}_\theta\left( \frac{d}{d\theta} \log f(X|\theta) \right) = -\mathbb{E}_\theta\left[ \frac{d^2}{d\theta^2} \log f(X|\theta) \right] \]

This is foundational in deriving the Cramér–Rao Lower Bound, and in asymptotic theory (MLE consistency and normality).

🧠 Cramér–Rao Bound for Vector Parameters

Excellent — you’re now stepping into the multivariate (vector parameter) version of the Cramér–Rao Lower Bound (CRLB), which generalizes the scalar case using matrix calculus and Fisher information matrices.

Let’s break it down.


Let:

  • \(\boldsymbol{\theta} \in \mathbb{R}^k\): vector parameter
  • \(\mathbf{X} \sim f(\mathbf{x}|\boldsymbol{\theta})\)
  • \(\mathbf{W}(\mathbf{X}) \in \mathbb{R}^m\): vector-valued statistic
  • \(\mathbf{W}(\mathbf{X})\) has finite variance and differentiable expectation w.r.t. \(\boldsymbol{\theta}\)

📘 Multivariate CRLB (Matrix Form):

\[ \operatorname{Var}_{\boldsymbol{\theta}}(\mathbf{W}(\mathbf{X})) \succeq \left( \frac{d}{d\boldsymbol{\theta}} \mathbb{E}_{\boldsymbol{\theta}}[\mathbf{W}(\mathbf{X})] \right)^\top \cdot \mathcal{I}(\boldsymbol{\theta})^{-1} \cdot \left( \frac{d}{d\boldsymbol{\theta}} \mathbb{E}_{\boldsymbol{\theta}}[\mathbf{W}(\mathbf{X})] \right) \]

📌 Notation Breakdown:

  • \(\operatorname{Var}_{\boldsymbol{\theta}}(\mathbf{W}) \in \mathbb{R}^{m \times m}\): Covariance matrix of the estimator
  • \(\frac{d}{d\boldsymbol{\theta}} \mathbb{E}_{\boldsymbol{\theta}}[\mathbf{W}] \in \mathbb{R}^{m \times k}\): Jacobian matrix of expectation
  • \(\mathcal{I}(\boldsymbol{\theta}) \in \mathbb{R}^{k \times k}\): Fisher Information matrix
  • \(\succeq\): Matrix inequality meaning the difference is positive semi-definite

⚙️ Meaning:

This inequality tells you that the covariance matrix of any estimator of a vector parameter is bounded from below by a matrix formed from:

  • The sensitivity of the estimator (Jacobian of expectation)
  • The inverse Fisher Information matrix

✅ Special Case: Unbiased Estimator

If \(\mathbb{E}_{\boldsymbol{\theta}}[\mathbf{W}(\mathbf{X})] = \boldsymbol{\theta}\), then:

  • Jacobian = Identity matrix \(I_k\)

So, the bound becomes:

\[ \operatorname{Var}_{\boldsymbol{\theta}}(\mathbf{W}(\mathbf{X})) \succeq \mathcal{I}(\boldsymbol{\theta})^{-1} \]

Just like in the scalar case, but extended to full matrix structure.

📘 Fisher Information Matrix \(\mathcal{I}(\boldsymbol{\theta})\)

Let’s now define the Fisher Information Matrix \(\mathcal{I}(\boldsymbol{\theta})\) in the vector parameter case.


Let:

  • \(\boldsymbol{\theta} \in \mathbb{R}^k\): vector parameter
  • \(\mathbf{X} \sim f(\mathbf{x} | \boldsymbol{\theta})\)
  • Assume \(f(\mathbf{x} | \boldsymbol{\theta})\) satisfies suitable regularity conditions (e.g., to allow differentiation under the integral)

✅ The Fisher Information Matrix is defined as:

\[ \mathcal{I}(\boldsymbol{\theta}) = \mathbb{E}_{\boldsymbol{\theta}} \left[ \left( \nabla_{\boldsymbol{\theta}} \log f(\mathbf{X}|\boldsymbol{\theta}) \right) \left( \nabla_{\boldsymbol{\theta}} \log f(\mathbf{X}|\boldsymbol{\theta}) \right)^\top \right] \]

🔍 Where:

  • \(\nabla_{\boldsymbol{\theta}} \log f(\mathbf{X}|\boldsymbol{\theta}) \in \mathbb{R}^k\): the score vector \[ \nabla_{\boldsymbol{\theta}} \log f(\mathbf{X}|\boldsymbol{\theta}) = \begin{bmatrix} \frac{\partial}{\partial\theta_1} \log f(\mathbf{X}|\boldsymbol{\theta}) \\ \vdots \\ \frac{\partial}{\partial\theta_k} \log f(\mathbf{X}|\boldsymbol{\theta}) \end{bmatrix} \]
  • The outer product produces a \(k \times k\) matrix
  • So, the Fisher Information Matrix is a positive semi-definite matrix of size \(k \times k\)

🧠 Alternate Form (if second derivatives are easier to compute):

Under stronger regularity conditions, we can also write:

\[ \mathcal{I}(\boldsymbol{\theta}) = -\mathbb{E}_{\boldsymbol{\theta}} \left[ \nabla_{\boldsymbol{\theta}}^2 \log f(\mathbf{X}|\boldsymbol{\theta}) \right] \]

Here, \(\nabla_{\boldsymbol{\theta}}^2 \log f(\mathbf{X}|\boldsymbol{\theta}) \in \mathbb{R}^{k \times k}\) is the Hessian matrix of second-order partial derivatives.

🧠 Cramér–Rao Lower Bound: Vector Parameter Case

Here's the full form of the Cramér–Rao Lower Bound (CRLB) in the vector parameter case, complete with all definitions, assumptions, and mathematical structure.


Let:

  • \(\boldsymbol{\theta} \in \mathbb{R}^k\): vector-valued parameter
  • \(\mathbf{X} \in \mathbb{R}^n\): random sample with joint pdf (or pmf) \(f(\mathbf{x} | \boldsymbol{\theta})\)
  • \(\mathbf{W}(\mathbf{X}) \in \mathbb{R}^m\): statistic estimating a function of \(\boldsymbol{\theta}\)
  • All relevant expectations, gradients, and Fisher Information are well-defined and finite

📘 General Matrix Form of the CRLB:

\[ \operatorname{Cov}_{\boldsymbol{\theta}}(\mathbf{W}(\mathbf{X})) \succeq \left( \frac{\partial}{\partial \boldsymbol{\theta}} \mathbb{E}_{\boldsymbol{\theta}}[\mathbf{W}(\mathbf{X})] \right) \cdot \mathcal{I}(\boldsymbol{\theta})^{-1} \cdot \left( \frac{\partial}{\partial \boldsymbol{\theta}} \mathbb{E}_{\boldsymbol{\theta}}[\mathbf{W}(\mathbf{X})] \right)^\top \]

🧩 Term Definitions:

  • \(\operatorname{Cov}_{\boldsymbol{\theta}}(\mathbf{W}) \in \mathbb{R}^{m \times m}\): Covariance matrix of the estimator
  • \(\frac{\partial}{\partial \boldsymbol{\theta}} \mathbb{E}_{\boldsymbol{\theta}}[\mathbf{W}] \in \mathbb{R}^{m \times k}\): Jacobian matrix of expected estimator wrt \(\boldsymbol{\theta}\)
  • \(\mathcal{I}(\boldsymbol{\theta}) \in \mathbb{R}^{k \times k}\): Fisher Information Matrix

🎯 Fisher Information Matrix Definition:

Let the score vector be:

\[ \mathbf{S}(\mathbf{X}; \boldsymbol{\theta}) = \nabla_{\boldsymbol{\theta}} \log f(\mathbf{X} | \boldsymbol{\theta}) \in \mathbb{R}^k \]

Then:

\[ \mathcal{I}(\boldsymbol{\theta}) = \mathbb{E}_{\boldsymbol{\theta}}\left[ \mathbf{S}(\mathbf{X}; \boldsymbol{\theta}) \mathbf{S}(\mathbf{X}; \boldsymbol{\theta})^\top \right] \]

Or, under regularity conditions:

\[ \mathcal{I}(\boldsymbol{\theta}) = - \mathbb{E}_{\boldsymbol{\theta}}\left[ \nabla_{\boldsymbol{\theta}}^2 \log f(\mathbf{X} | \boldsymbol{\theta}) \right] \]

✅ Special Case: Unbiased Estimator

If \(\mathbb{E}_{\boldsymbol{\theta}}[\mathbf{W}(\mathbf{X})] = \boldsymbol{\theta}\), then the Jacobian is identity, and:

\[ \operatorname{Cov}_{\boldsymbol{\theta}}(\mathbf{W}) \succeq \mathcal{I}(\boldsymbol{\theta})^{-1} \]

This matches the scalar case:

\[ \operatorname{Var}_{\theta}(W) \geq \frac{1}{\mathcal{I}(\theta)} \]

🧠 Setup: Vector CRLB with Fisher Info Substituted

  • \(\boldsymbol{\theta} \in \mathbb{R}^p\): vector parameter
  • \(\mathbf{X}\): data with PDF/PMF \(f(\mathbf{x}|\boldsymbol{\theta})\)
  • \(\mathbf{W}(\mathbf{X}) \in \mathbb{R}^k\): unbiased estimator of \(\boldsymbol{\psi}(\boldsymbol{\theta}) = \mathbb{E}_{\boldsymbol{\theta}}[\mathbf{W}(\mathbf{X})]\)

✅ CRLB (Multivariate Version):

\[ \text{Cov}_{\boldsymbol{\theta}}(\mathbf{W}(\mathbf{X})) \succeq \left[ \frac{ \partial \boldsymbol{\psi}(\boldsymbol{\theta}) }{ \partial \boldsymbol{\theta} } \right]^\top \left[ \mathbb{E}_{\boldsymbol{\theta}} \left[ \left( \frac{ \partial }{ \partial \boldsymbol{\theta} } \log f(\mathbf{X}|\boldsymbol{\theta}) \right) \left( \frac{ \partial }{ \partial \boldsymbol{\theta} } \log f(\mathbf{X}|\boldsymbol{\theta}) \right)^\top \right] \right]^{-1} \left[ \frac{ \partial \boldsymbol{\psi}(\boldsymbol{\theta}) }{ \partial \boldsymbol{\theta} } \right] \]

🔍 Explanation of Components:

  • \(\frac{ \partial \boldsymbol{\psi}(\boldsymbol{\theta}) }{ \partial \boldsymbol{\theta} } \in \mathbb{R}^{k \times p}\): Jacobian of \(\boldsymbol{\psi}(\boldsymbol{\theta})\)
  • Middle matrix is the Fisher Information Matrix: \[ \mathbf{I}(\boldsymbol{\theta}) = \mathbb{E}_{\boldsymbol{\theta}} \left[ \left( \frac{ \partial }{ \partial \boldsymbol{\theta} } \log f(\mathbf{X}|\boldsymbol{\theta}) \right) \left( \frac{ \partial }{ \partial \boldsymbol{\theta} } \log f(\mathbf{X}|\boldsymbol{\theta}) \right)^\top \right] \]

📌 Summary:

The multivariate CRLB states:

\[ \boxed{ \text{Cov}_{\boldsymbol{\theta}}(\mathbf{W}(\mathbf{X})) \succeq \left[ \frac{ \partial \mathbb{E}_{\boldsymbol{\theta}}[\mathbf{W}(\mathbf{X})] }{ \partial \boldsymbol{\theta} } \right]^\top \left[ \mathbb{E}_{\boldsymbol{\theta}} \left[ \left( \frac{ \partial }{ \partial \boldsymbol{\theta} } \log f(\mathbf{X}|\boldsymbol{\theta}) \right) \left( \frac{ \partial }{ \partial \boldsymbol{\theta} } \log f(\mathbf{X}|\boldsymbol{\theta}) \right)^\top \right] \right]^{-1} \left[ \frac{ \partial \mathbb{E}_{\boldsymbol{\theta}}[\mathbf{W}(\mathbf{X})] }{ \partial \boldsymbol{\theta} } \right] } \]

This is a matrix inequality: the LHS is the covariance matrix of the estimator, and the RHS is the lower bound (in the positive semi-definite sense).

✨ Information Equality and Fisher Information Identity

You're diving into the information equality result, which is foundational for the Cramér–Rao lower bound (CRLB). It links the variance of the score function with the expected curvature of the log-likelihood.


Let us go through each step in full clarity, with notation, interpretation, and a diagram at the end.

🔶 The Setup

  • \( f(x \mid \theta) \) : parametric family of densities
  • Regularity conditions assumed (interchanging integration and differentiation allowed)
  • \( x \) is a random variable drawn from $f(x \mid \theta)$
  • \( \theta \in \mathbb{R} \): scalar parameter

🧰 Step 1: Expectation of the Score Function is Zero

\[ \mathbb{E}_\theta\left[ \frac{\partial}{\partial \theta} \log f(x \mid \theta) \right] = 0 \]

This is a standard identity under regularity conditions:

\[ \begin{align*} \mathbb{E}_\theta\left[ \frac{\partial}{\partial \theta} \log f(x \mid \theta) \right] &= \int \frac{\partial}{\partial \theta} \log f(x \mid \theta) \cdot f(x \mid \theta)\, dx \\ &= \int \frac{1}{f(x \mid \theta)} \cdot \frac{\partial f(x \mid \theta)}{\partial \theta} \cdot f(x \mid \theta)\, dx \\ &= \int \frac{\partial f(x \mid \theta)}{\partial \theta} \, dx \\ &= \frac{\partial}{\partial \theta} \int f(x \mid \theta) \, dx = \frac{\partial}{\partial \theta}(1) = 0 \end{align*} \]

Conclusion: the expected score is zero.

🔹 Step 2: Variance of Score Function = Negative Expectation of Second Derivative

The claim is:

\[ \mathbb{E}_\theta\left[ \left( \frac{\partial}{\partial \theta} \log f(x \mid \theta) \right)^2 \right] = - \mathbb{E}_\theta\left[ \frac{\partial^2}{\partial \theta^2} \log f(x \mid \theta) \right] \]

📌 LHS: Variance of the Score Function

\[ \begin{align*} \mathbb{E}_\theta\left[ \left( \frac{\partial}{\partial \theta} \log f(x \mid \theta) \right)^2 \right] &= \int \left( \frac{\partial}{\partial \theta} \log f(x \mid \theta) \right)^2 f(x \mid \theta)\, dx \\ &= \int \left( \frac{ \partial f(x \mid \theta)/\partial \theta }{ f(x \mid \theta) } \right)^2 f(x \mid \theta)\, dx \\ &= \int \left( \frac{ \partial f(x \mid \theta) }{ \partial \theta } \right)^2 \cdot \frac{1}{f(x \mid \theta)}\, dx \end{align*} \]

📌 RHS: Negative Expectation of Second Derivative

\[ \begin{align*} \mathbb{E}_\theta\left[ \frac{\partial^2}{\partial \theta^2} \log f(x \mid \theta) \right] &= \int \frac{\partial^2}{\partial \theta^2} \log f(x \mid \theta) \cdot f(x \mid \theta)\, dx \\ &= \int \left[ \frac{ \partial^2 f }{ \partial \theta^2 } - \frac{ (\partial f/\partial \theta)^2 }{ f } \right] dx \end{align*} \]

Now, under regularity:

\[ \int \frac{\partial^2 f(x \mid \theta)}{\partial \theta^2} dx = \frac{\partial^2}{\partial \theta^2} 1 = 0 \]

So the result becomes:

\[ - \int \frac{ (\partial f / \partial \theta)^2 }{ f } dx \]

Conclusion:

\[ \mathbb{E}_\theta\left[ \left( \frac{\partial}{\partial \theta} \log f(x \mid \theta) \right)^2 \right] = - \mathbb{E}_\theta\left[ \frac{\partial^2}{\partial \theta^2} \log f(x \mid \theta) \right] \]

📊 Visual Interpretation

Let \( \ell(\theta) = \log f(x \mid \theta) \). Then:

  • \( \ell'(\theta) \): score function — slope of log-likelihood
  • \( \ell''(\theta) \): curvature — measures peakness/flatness
  • Taking expectation, the curvature = variance of slope (negated)

ℹ Fisher Information

\[ \mathcal{I}(\theta) = \mathbb{E}_\theta\left[ \left( \frac{\partial}{\partial \theta} \log f(x \mid \theta) \right)^2 \right] = -\mathbb{E}_\theta\left[ \frac{\partial^2}{\partial \theta^2} \log f(x \mid \theta) \right] \]

This quantifies how much information one observation gives about \(\theta\).

📆 Summary

  • This is the Fisher Information identity
  • Backbone of the Cramér–Rao Inequality
  • Relies on regularity conditions (to switch integral and derivative)
  • Captures the connection between score and curvature

🔷 Step-by-Step Breakdown: The Score Function

Let’s walk through it clearly, with intuition, derivation, and a small example.


📘 Log-Likelihood Expression

You wrote:

\[ L(\theta \mid x_1, x_2, \dots, x_n) = \sum_{i=1}^n \log f(x_i \mid \theta) \]

This is the log-likelihood function for \( n \) i.i.d. observations from the model \( f(x \mid \theta) \).

🧠 Why Sum of Logs?

Because:

\[ \begin{aligned} \text{Likelihood: } \mathcal{L}(\theta \mid x_1, \dots, x_n) &= \prod_{i=1}^n f(x_i \mid \theta) \\ \Rightarrow \log \mathcal{L}(\theta \mid x_1, \dots, x_n) &= \sum_{i=1}^n \log f(x_i \mid \theta) \end{aligned} \]

Taking the log simplifies multiplication into summation.

🔧 Score Function (Gradient of Log-Likelihood)

Now differentiate the log-likelihood with respect to \( \theta \):

\[ \frac{\partial}{\partial \theta} L(\theta \mid x_1, \dots, x_n) = \sum_{i=1}^n \frac{\partial}{\partial \theta} \log f(x_i \mid \theta) \]

Using the chain rule:

\[ = \sum_{i=1}^n \frac{1}{f(x_i \mid \theta)} \cdot \frac{\partial f(x_i \mid \theta)}{\partial \theta} \]

This is the score function, often denoted:

\[ \nabla_\theta L(\theta \mid x_1, \dots, x_n) \]

✅ Your symbolic expression is correct — it's the score function for the full sample.

✅ Summary

\[ \nabla_\theta L(\theta) = \sum_{i=1}^n \frac{ \frac{\partial}{\partial \theta} f(x_i \mid \theta) }{ f(x_i \mid \theta) } = \sum_{i=1}^n \frac{\partial}{\partial \theta} \log f(x_i \mid \theta) \]

This is crucial for:

  • Maximum Likelihood Estimation (MLE): Set score = 0 and solve for \( \theta \)
  • Fisher Information for \( n \) samples: \( \mathcal{I}_n(\theta) = n \mathcal{I}(\theta) \)
  • Asymptotic theory of MLEs (e.g., consistency, normality)

📊 Small Example: Normal Model with Known Variance

Suppose:

  • \( X_1, \dots, X_n \sim \mathcal{N}(\mu, \sigma^2) \)
  • \( \sigma^2 \) known; estimate \( \mu \)

Then:

\[ f(x_i \mid \mu) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{(x_i - \mu)^2}{2\sigma^2} \right) \] \[ \log f(x_i \mid \mu) = -\frac{1}{2} \log(2\pi \sigma^2) - \frac{(x_i - \mu)^2}{2\sigma^2} \]

Summing over \( i \):

\[ L(\mu) = \sum_{i=1}^n \log f(x_i \mid \mu) = -\frac{n}{2} \log(2\pi \sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2 \]

Differentiating:

\[ \frac{d}{d\mu} L(\mu) = \frac{1}{\sigma^2} \sum_{i=1}^n (x_i - \mu) \]

Set derivative = 0 for MLE:

\[ \sum_{i=1}^n (x_i - \mu) = 0 \quad \Rightarrow \quad \mu = \bar{x} \]

✅ So the MLE for \( \mu \) is the sample mean \( \bar{x} \).

🔷 Newton–Raphson for MLE: Likelihood Maximization

You're now discussing numerical optimization of the likelihood using the Newton–Raphson method, one of the core techniques to compute Maximum Likelihood Estimators (MLEs).

We'll break this into four parts for clarity:


🔹 1. Understanding the Graph of \( L(\theta) \) vs \( \theta \)

The graph of the log-likelihood function \( L(\theta) \) often looks like a smooth, concave curve. The peak of this curve corresponds to the MLE \( \hat{\theta} \).

        ^
     L(θ)
        |          /\
        |         /  \
        |        /    \
        |_______/______\________>
                    θ̂
  

At the maximum, the slope \( \frac{d}{d\theta} L(\theta) \) is zero — this is the first-order optimality condition.

🔧 2. Gradient Ascent (Steepest Ascent)

The update:

\[ \hat{\theta}_{r+1} = \hat{\theta}_r + \eta_r \cdot \nabla_\theta L(\theta_r \mid x_1, \dots, x_n) \]

This is a gradient ascent step, where:

  • \( \eta_r \) is a step size (learning rate)
  • \( \nabla_\theta L(\theta) \) is the gradient (score function)
  • You’re moving in the direction of increasing likelihood
This is not Newton–Raphson yet — just basic gradient ascent.

🧠 3. Newton–Raphson Update

Newton–Raphson improves on gradient ascent by incorporating the second derivative:

  • Gradient: \( \nabla_\theta L(\theta) \)
  • Hessian: \( \nabla^2_\theta L(\theta) \)

The update rule becomes:

\[ \hat{\theta}_{r+1} = \hat{\theta}_r - \left[ \nabla^2_\theta L(\theta_r) \right]^{-1} \cdot \nabla_\theta L(\theta_r) \]

✅ The minus sign reflects that for maximizing a concave function, you move against the curvature direction.

📌 What's the Intuition?

  • The first derivative tells you the direction to move
  • The second derivative tells you how much to move (curvature)
   ^
L(θ)|         *
     |        ^
     |       /|\
     |      / | \    ← Newton step lands at the peak
     |_____/__|__\________>
               θ
  

💡 Newton vs Gradient Ascent

Method Uses Pros Cons
Gradient Ascent Gradient only Simple May take many steps
Newton–Raphson Gradient + Hessian Quadratic convergence Requires second derivative

For well-behaved likelihoods (e.g., Normal), Newton–Raphson can converge in just a few steps.

🔢 Example: Normal Model with Known Variance

Suppose:

  • \( X_1, \dots, X_n \sim \mathcal{N}(\mu, \sigma^2) \)
  • \( \sigma^2 \) is known; find MLE for \( \mu \)

Log-likelihood:

\[ L(\mu) = -\frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2 + \text{const} \]

Then:

  • \( \nabla_\mu L = \frac{1}{\sigma^2} \sum (x_i - \mu) = \frac{n}{\sigma^2} (\bar{x} - \mu) \)
  • \( \nabla^2_\mu L = -\frac{n}{\sigma^2} \)

Newton–Raphson update:

\[ \mu_{r+1} = \mu_r - \left( -\frac{\sigma^2}{n} \right) \cdot \frac{n}{\sigma^2} (\bar{x} - \mu_r) = \mu_r + (\bar{x} - \mu_r) = \bar{x} \]

✅ It converges in one step — because the log-likelihood is quadratic!

✅ Summary

  • You’re using gradient ascent to maximize \( L(\theta) \)
  • Newton–Raphson is more efficient by using curvature (Hessian)
  • It’s standard in MLE, logistic regression, Poisson models, etc.

On the Direction of Steepest Ascent:

🔷 1. Why the Direction of Steepest Ascent?

Imagine the log-likelihood function \( L(\theta) \) as a landscape — a smooth surface over the parameter space.
Your goal: reach the top (maximum likelihood estimate \( \hat{\theta} \)).

At any point \( \theta_r \), the gradient \( \nabla_\theta L(\theta_r) \) is a vector that:

  • Points in the direction where \( L(\theta) \) increases most rapidly.
  • Has a magnitude equal to the rate of increase in that direction.

So, to climb the hill toward the MLE, you step in the direction of the gradient:

\[ \hat{\theta}_{r+1} = \hat{\theta}_r + \eta_r \cdot \nabla_\theta L(\theta_r) \]

Here, \( \eta_r > 0 \) is your step size — how far you move in that direction.

🔶 2. What Makes Gradient the Steepest Direction?

Let’s be precise:

Suppose you want to increase \( L(\theta) \) by moving a small distance \( \delta \theta \). The first-order change is:

\[ \Delta L(\theta) \approx \nabla_\theta L(\theta)^\top \delta \theta \]

We want to choose \( \delta \theta \) of unit length (i.e., \( \| \delta \theta \| = 1 \)) that gives maximum increase:

\[ \max_{\| \delta \theta \| = 1} \nabla_\theta L(\theta)^\top \delta \theta \]

The solution to this maximization is:

\[ \delta \theta = \frac{ \nabla_\theta L(\theta) }{ \| \nabla_\theta L(\theta) \| } \]

i.e., move in the direction of the gradient. ✅

📊 Visual Intuition (2D Case)

Think of the log-likelihood surface like a hill in 3D.

  • The gradient vector at your current point points uphill in the steepest direction.
  • If you move orthogonally to the gradient, you stay at the same level (iso-likelihood curve).
  • Moving against the gradient leads you downhill — which you want to avoid during maximization.
      ^
     /|\\         ← Gradient points this way
    / |
   /  |             Log-Likelihood surface
  *---→
Current θ_r
  

🔁 Comparison with Newton–Raphson

  • Steepest ascent: always follows the direction of greatest immediate gain.
  • Newton–Raphson: modifies the direction based on the curvature, sometimes avoiding overshooting or zigzagging.

So, in practice:

  • Gradient ascent is more robust, especially when the curvature is poorly behaved or Hessians are hard to compute.
  • Newton–Raphson is faster when curvature information is reliable and the function is well-behaved (e.g., Gaussian likelihood).

✅ Summary

  • We prefer the direction of steepest ascent because it guarantees the fastest local increase in the log-likelihood.
  • The gradient gives this direction.
  • The step size \( \eta_r \) controls how aggressively we move — too small = slow; too large = unstable.
  • This approach is the foundation for many modern learning algorithms, including stochastic gradient ascent in MLE for big data or machine learning.

🔷 Connecting Fisher Information with the Hessian Matrix

Now connecting Fisher’s Information Matrix with the Hessian matrix of the log-likelihood function.

Let’s unpack this carefully and precisely.


🔷 1. Fisher Information Matrix (Multivariate Version)

Suppose \( \theta \in \mathbb{R}^k \) is a vector parameter and \( f(x \mid \theta) \) is a probability density function satisfying standard regularity conditions.

Then, the Fisher Information Matrix is defined as:

\[ \mathcal{I}(\theta) = \mathbb{E}_\theta \left[ \left( \frac{\partial}{\partial \theta} \log f(x \mid \theta) \right) \left( \frac{\partial}{\partial \theta} \log f(x \mid \theta) \right)^\top \right] \]
  • This is a \( k \times k \) positive semi-definite matrix.
  • It measures how much information the data carries about each component of \( \theta \), and how the components interact.

🔶 2. Alternative Definition Using the Hessian (Second Derivatives)

Under regularity conditions, the Fisher Information Matrix can also be written as the negative expected Hessian of the log-likelihood function:

\[ \mathcal{I}(\theta) = - \mathbb{E}_\theta \left[ \frac{\partial^2}{\partial \theta \partial \theta^\top} \log f(x \mid \theta) \right] \]

✅ This is what your note says:

\( \mathbb{E}\left( -\frac{ \partial^2 }{ \partial \theta \partial \theta^\top } \log f(x \mid \theta) \right) = \text{Fisher Information Matrix} \)

📌 3. What Is the Hessian Matrix?

The Hessian of a scalar-valued function \( L(\theta) \), where \( \theta \in \mathbb{R}^k \), is the matrix of second partial derivatives:

\[ H(\theta) = \left[ \frac{ \partial^2 L(\theta) }{ \partial \theta_i \partial \theta_j } \right]_{i,j=1}^k \]

So for the log-likelihood function \( \log f(x \mid \theta) \), the Hessian is:

\[ \frac{ \partial^2 }{ \partial \theta \partial \theta^\top } \log f(x \mid \theta) \]

It’s a \( k \times k \) symmetric matrix, assuming smoothness.


🔷 4. The (i,j)-th Element of Fisher Info = Expected Negative Hessian Entry

You correctly stated:

(i, j)-th element of the Fisher Information matrix is the (i, j)-th element of the expected negative Hessian of \( \log f(x \mid \theta) \)

In symbols:

\[ [\mathcal{I}(\theta)]_{i,j} = - \mathbb{E}_\theta \left[ \frac{\partial^2}{\partial \theta_i \partial \theta_j} \log f(x \mid \theta) \right] \]

This captures the curvature of the likelihood in the \( (\theta_i, \theta_j) \) direction.


🧠 Why Two Equivalent Definitions?

Because of the Fisher identity:

\[ \mathcal{I}(\theta) = \mathbb{E}_\theta \left[ \nabla_\theta \log f(x \mid \theta) \cdot \nabla_\theta \log f(x \mid \theta)^\top \right] = - \mathbb{E}_\theta \left[ \nabla^2_\theta \log f(x \mid \theta) \right] \]

Both are equal only under regularity conditions, where we’re allowed to exchange integration and differentiation.


✅ Summary

Quantity Meaning
\( \nabla_\theta \log f(x \mid \theta) \) Score function (gradient of log-likelihood)
\( \nabla^2_\theta \log f(x \mid \theta) \) Hessian matrix (second derivatives)
\( \mathcal{I}(\theta) \) Fisher Information Matrix
\( [\mathcal{I}(\theta)]_{i,j} \) Expected negative second derivative of log-likelihood w.r.t \( \theta_i \) and \( \theta_j \)

🧭 Overview

You are essentially learning the geometry and calculus behind MLE:


1. ✅ Stopping Criteria in Iterative Estimation

🔁 In iterative methods (like Newton–Raphson, gradient ascent), we update:

θ̂_{r+1} = θ̂_r + η_r ⋅ ∇θ L(θ̂_r)

We stop when the estimate becomes stable:

‖θ̂r+1 − θ̂r‖₂ < ε

✅ Example: If θ = (μ, σ)T, then:

‖θ̂r+1 − θ̂r‖₂² = (μ̂r+1 − μ̂r)² + (σ̂r+1 − σ̂r

2. 📘 Fisher Information Matrix ℐ(θ)

ℐ(θ) = E[ −∂²/∂θ ∂θᵗ log f(x | θ) ]

🔎 Intuition:


       log L(θ)
           ^
           |
   High    |     ____
   info    |   /    \
           |  /      \
           | /        \
   Low     |/          \___
           +-------------------> θ

3. 🧮 Hessian Matrix H

H(θ) = [ ∂²L / ∂θᵢ∂θⱼ ]

🔁 In Newton–Raphson:

θ̂r+1 = θ̂r − H−1θ L

4. 🔗 Link Between Hessian and Fisher Information

ℐ(θ) = E_θ[−H(θ)] = E_θ[−∂²/∂θ∂θᵗ log f(x | θ)]

✅ Requires regularity conditions

For i.i.d. Data:

X₁,…,Xₙ ~ i.i.d. f(x | θ)

L(θ) = ∑ log f(xᵢ | θ) ⇒
ℐ(θ) = n ⋅ E[−∂²/∂θ∂θᵗ log f(x₁ | θ)]

So more data ⇒ higher curvature ⇒ better certainty


E[−∂²/∂θ∂θᵗ log f(x₁ | θ)] = Pre-sample Fisher Info

5. 🎯 Cramér–Rao Lower Bound (CRLB)

If θ̂ is unbiased:

Var(θ̂) ≥ ℐ−1(θ)

In scalar case:

Var(θ̂) ≥ 1 / ℐ(θ)

6. 🧾 Observed Information Matrix

𝒥(θ) = −H = −∑ ∂²/∂θ∂θᵗ log f(xᵢ | θ)

Used in:

Var̂(θ̂) ≈ [𝒥(θ̂)]−1

✅ Summary Table

Concept Formula Interpretation
Gradient (Score) θ L(θ) Direction of steepest ascent
Hessian H(θ) = ∂²L / ∂θ ∂θᵗ Curvature of log-likelihood
Fisher Info ℐ(θ) = E[−H] Expected curvature
Observed Info 𝒥(θ) = −H Actual curvature from data
CRLB Var(θ̂) ≥ ℐ−1(θ) Lower bound on estimator variance

🔷 1. Background: Newton–Raphson in MLE

Recall that for Maximum Likelihood Estimation (MLE), we want to maximize the log-likelihood function:

$$L(\theta) = \sum_{i=1}^n \log f(x_i \mid \theta)$$

The Newton–Raphson update rule for finding the root of $\nabla_\theta L(\theta) = 0$ is:

$$\theta^{(r+1)} = \theta^{(r)} - \left[ H(\theta^{(r)}) \right]^{-1} \nabla_\theta L(\theta^{(r)})$$

  • $\nabla_\theta L(\theta)$ is the score function
  • $H(\theta) = \nabla_\theta^2 L(\theta)$ is the Hessian matrix

🔷 2. Problem with Newton–Raphson

In practice, $H(\theta)$ may:

  • Be difficult to compute or invert
  • Be non-invertible at some points
  • Lead to unstable convergence

🔷 3. Fisher’s Scoring: Motivation

To solve these issues, we replace the observed Hessian \( H(\theta) \) with its expectation:

$$\mathcal{I}(\theta) = \mathbb{E}_\theta\left[-\nabla_\theta^2 L(\theta)\right]$$

This is the Fisher Information Matrix. It's often easier to compute and is positive semi-definite.

🔷 4. Fisher’s Scoring Update Rule

The update rule becomes:

$$\theta^{(r+1)} = \theta^{(r)} + \left[\mathcal{I}(\theta^{(r)})\right]^{-1} \nabla_\theta L(\theta^{(r)})$$

🔁 Notice:

  • Update direction uses the gradient (score)
  • Scaled by inverse Fisher Information, not Hessian

🔷 5. Visualization: Geometric Insight

Imagine you’re on a hill (log-likelihood surface):

  • Gradient Ascent: step in steepest direction
  • Newton–Raphson: adjusts for curvature (Hessian)
  • Fisher Scoring: uses average curvature — smoother steps

🔷 6. Example: Bernoulli Distribution

Let \( X_1, \dots, X_n \sim \text{Bern}(p)\) where \( 0 < p < 1 \)

Log-likelihood:

$$L(p) = \sum_{i=1}^n \left[ x_i \log p + (1 - x_i)\log(1 - p) \right]$$

Score function:

$$\frac{dL}{dp} = \frac{n\bar{X}}{p} - \frac{n(1 - \bar{X})}{1 - p}$$

Observed Information:

$$H(p) = -\frac{n\bar{X}}{p^2} - \frac{n(1 - \bar{X})}{(1 - p)^2}$$

Fisher Information:

$$\mathcal{I}(p) = \frac{n}{p(1 - p)}$$

Fisher Scoring update:

$$p^{(r+1)} = p^{(r)} + \left[ \frac{n}{p^{(r)}(1 - p^{(r)})} \right]^{-1} \left( \frac{n\bar{X}}{p^{(r)}} - \frac{n(1 - \bar{X})}{1 - p^{(r)}} \right)$$

Which simplifies to:

$$p^{(r+1)} = \bar{X}$$

🎯 That’s the MLE! Fisher scoring converges to it in one step for Bernoulli.

🔷 7. Comparison Table

Method Uses Stability Convergence Speed
Gradient Ascent Gradient only May overshoot Slower
Newton–Raphson Gradient + Hessian Can be unstable Fast if Hessian good
Fisher Scoring Gradient + Expected Hessian More stable Fast, smoother

🔷 8. When to Use Fisher Scoring?

✅ Use it when:

  • Likelihood function is well-behaved but observed Hessian is messy
  • Working with GLMs — Fisher scoring is standard for IRLS
  • You need stable convergence with positive curvature