AY 2025–26
Instructor: Debasis Sengupta
Office / Department: ASU
Email: sdebasis@isical.ac.in
Marking Scheme:
Assignments: 20% | Midterm Test: 30% | End Semester: 50%
You’re discussing a variance inequality in estimation theory — a generalized form of the Cramér–Rao bound.
Key assumptions:
The second condition is a regularity condition — it lets you differentiate under the integral sign, essential for proving CR-type results.
Then, the variance of \( W(\mathbf{X}) \) satisfies the inequality:
\[ \operatorname{Var}_\theta(W(\mathbf{X})) \geq \frac{ \left( \frac{d}{d\theta} \mathbb{E}_\theta[W(\mathbf{X})] \right)^2 }{ \mathbb{E}_\theta \left[ \left( \frac{d}{d\theta} \log f(\mathbf{X}|\theta) \right)^2 \right] } \]This is a generalized Cramér–Rao Lower Bound.
✅ No matter what statistic or (nearly) unbiased estimator you use, its variance can’t fall below this bound.
The denominator is the Fisher Information:
\[ \mathcal{I}(\theta) := \mathbb{E}_\theta \left[ \left( \frac{d}{d\theta} \log f(\mathbf{X}|\theta) \right)^2 \right] \]It quantifies how much information the sample \( \mathbf{X} \) provides about \( \theta \).
If \( W(\mathbf{X}) \) is unbiased, i.e.
\[ \mathbb{E}_\theta[W(\mathbf{X})] = \theta \quad \text{for all } \theta, \]then:
\[ \frac{d}{d\theta} \mathbb{E}_\theta[W(\mathbf{X})] = 1 \]This simplifies the variance bound to:
\[ \operatorname{Var}_\theta(W(\mathbf{X})) \geq \frac{1}{\mathbb{E}_\theta \left[ \left( \frac{d}{d\theta} \log f(\mathbf{X}|\theta) \right)^2 \right]} \]✅ This is the standard Cramér–Rao Lower Bound for unbiased estimators.
Suppose:
Then clearly \( \mathbb{E}(W(\mathbf{X})) = \theta \), so this is an unbiased estimator.
✅ The simplified CRLB applies.
If \( X_1, \dots, X_n \) are i.i.d., then:
\[ f(\mathbf{x}|\theta) = \prod_{i=1}^{n} f(x_i|\theta) \quad \Rightarrow \quad \log f(\mathbf{x}|\theta) = \sum_{i=1}^{n} \log f(x_i|\theta) \]This simplifies the log-likelihood and makes computing the score function much easier.
The density is:
\[ f(x_i|\mu) = \frac{1}{\sqrt{2\pi}} \exp\left(-\frac{(x_i - \mu)^2}{2}\right) \]Then the log-likelihood becomes:
\[ \log f(x_i|\mu) = -\frac{1}{2} \log(2\pi) - \frac{1}{2}(x_i - \mu)^2 \]Differentiating with respect to \( \mu \):
\[ \frac{d}{d\mu} \log f(x_i|\mu) = x_i - \mu \]So, the score function for the full sample is:
\[ \frac{d}{d\mu} \log f(\mathbf{x}|\mu) = \sum_{i=1}^n (x_i - \mu) \]To compute the Fisher Information \( \mathcal{I}(\mu) \), we evaluate:
\[ \mathbb{E}_\mu \left[ \left( \sum_{i=1}^n (X_i - \mu) \right)^2 \right] \]Because the \( X_i \) are independent and each has zero mean after centering:
\[ = \sum_{i=1}^n \mathbb{E}_\mu[(X_i - \mu)^2] \]Each term \( \mathbb{E}_\mu[(X_i - \mu)^2] = 1 \), so total Fisher Information is:
\[ \mathcal{I}(\mu) = n \]✅ The sample mean achieves the bound, hence it is efficient.
You've now made two key simplifications:
If \(\mathbb{E}_\mu[W(\mathbf{X})] = \mu\), then:
\[ \operatorname{Var}_\mu(W(\mathbf{X})) \geq \frac{1}{\mathbb{E}_\mu\left[ \left( \frac{d}{d\mu} \log f(\mathbf{X}|\mu) \right)^2 \right]} = \frac{1}{\text{Fisher Information}} \]
For the i.i.d. normal case:
Let’s answer it clearly.
✅ So the sample mean:
Yes, the sample mean attains the Cramér–Rao lower bound in this case. Hence, it is an efficient estimator of \(\mu\) for the normal model with known variance.
You're now pointing toward the discrete version of the Cramér–Rao bound, where we deal with PMFs instead of PDFs.
Let’s go through this step-by-step.
Instead of a pdf \( f(x|\theta) \), you have a pmf \( p(x|\theta) \), where \( x \in \mathcal{X} \), a countable set.
Suppose:
Then, the Cramér–Rao-type inequality holds:
\[ \operatorname{Var}_\theta(W(\mathbf{X})) \geq \frac{\left( \frac{d}{d\theta} \mathbb{E}_\theta[W(\mathbf{X})] \right)^2}{ \mathbb{E}_\theta\left[ \left( \frac{d}{d\theta} \log p(\mathbf{X}|\theta) \right)^2 \right]} \]
This is structurally identical to the continuous case — but all integrals are now sums over discrete outcomes.
Let’s make this real.
Then CRLB is:
\[ \operatorname{Var}(\hat{p}) \geq \frac{1}{\mathcal{I}(p)} = \frac{p(1 - p)}{n} \]
✅ The sample mean attains this → it is efficient.
We now state a very important identity in statistical inference — the Fisher Information identity, which links the variance of the score with the expected second derivative of the log-likelihood.
Let’s unpack it clearly.
If the regularity condition holds:
\[ \frac{d}{d\theta}\mathbb{E}_{\theta}\left(\frac{\partial}{\partial\theta}\log f(X|\theta)\right) = \int \left[\frac{\partial}{\partial\theta}\left[\left(\frac{\partial}{\partial\theta}\log f(x|\theta)\right) f(x|\theta)\right]\right] dx, \]
then we can derive:
\[ \mathbb{E}_\theta\left[\left(\frac{\partial}{\partial\theta} \log f(X|\theta)\right)^2\right] = -\mathbb{E}_\theta\left[\frac{\partial^2}{\partial\theta^2} \log f(X|\theta)\right] \]
This is known as the Fisher Information equality, and both sides equal the Fisher Information \( \mathcal{I}(\theta) \).
Let \( \ell(\theta; X) = \log f(X|\theta) \), the log-likelihood. Then:
Differentiate under the integral sign:
\[ \frac{d}{d\theta} \mathbb{E}_\theta\left[\frac{\partial}{\partial\theta} \ell(\theta; X)\right] = 0 = \mathbb{E}_\theta\left[ \frac{\partial^2}{\partial\theta^2} \ell(\theta; X) + \left( \frac{\partial}{\partial\theta} \ell(\theta; X) \right)^2 \right] \]
Rearranged:
\[ \mathbb{E}_\theta\left[\left(\frac{\partial}{\partial\theta} \log f(X|\theta)\right)^2\right] = -\mathbb{E}_\theta\left[\frac{\partial^2}{\partial\theta^2} \log f(X|\theta)\right] \]
Both quantify how much information the data contains about \( \theta \).
You’re stating the Fisher Information identity, which tells us:
Under regularity conditions, the variance of the score function equals the negative expected second derivative of the log-likelihood.
\[ \mathcal{I}(\theta) = \operatorname{Var}_\theta\left( \frac{d}{d\theta} \log f(X|\theta) \right) = -\mathbb{E}_\theta\left[ \frac{d^2}{d\theta^2} \log f(X|\theta) \right] \]
This is foundational in deriving the Cramér–Rao Lower Bound, and in asymptotic theory (MLE consistency and normality).
Excellent — you’re now stepping into the multivariate (vector parameter) version of the Cramér–Rao Lower Bound (CRLB), which generalizes the scalar case using matrix calculus and Fisher information matrices.
Let’s break it down.
Let:
This inequality tells you that the covariance matrix of any estimator of a vector parameter is bounded from below by a matrix formed from:
If \(\mathbb{E}_{\boldsymbol{\theta}}[\mathbf{W}(\mathbf{X})] = \boldsymbol{\theta}\), then:
So, the bound becomes:
Just like in the scalar case, but extended to full matrix structure.
Let’s now define the Fisher Information Matrix \(\mathcal{I}(\boldsymbol{\theta})\) in the vector parameter case.
Let:
Under stronger regularity conditions, we can also write:
\[ \mathcal{I}(\boldsymbol{\theta}) = -\mathbb{E}_{\boldsymbol{\theta}} \left[ \nabla_{\boldsymbol{\theta}}^2 \log f(\mathbf{X}|\boldsymbol{\theta}) \right] \]Here, \(\nabla_{\boldsymbol{\theta}}^2 \log f(\mathbf{X}|\boldsymbol{\theta}) \in \mathbb{R}^{k \times k}\) is the Hessian matrix of second-order partial derivatives.
Here's the full form of the Cramér–Rao Lower Bound (CRLB) in the vector parameter case, complete with all definitions, assumptions, and mathematical structure.
Let:
Let the score vector be:
\[ \mathbf{S}(\mathbf{X}; \boldsymbol{\theta}) = \nabla_{\boldsymbol{\theta}} \log f(\mathbf{X} | \boldsymbol{\theta}) \in \mathbb{R}^k \]Then:
\[ \mathcal{I}(\boldsymbol{\theta}) = \mathbb{E}_{\boldsymbol{\theta}}\left[ \mathbf{S}(\mathbf{X}; \boldsymbol{\theta}) \mathbf{S}(\mathbf{X}; \boldsymbol{\theta})^\top \right] \]Or, under regularity conditions:
\[ \mathcal{I}(\boldsymbol{\theta}) = - \mathbb{E}_{\boldsymbol{\theta}}\left[ \nabla_{\boldsymbol{\theta}}^2 \log f(\mathbf{X} | \boldsymbol{\theta}) \right] \]If \(\mathbb{E}_{\boldsymbol{\theta}}[\mathbf{W}(\mathbf{X})] = \boldsymbol{\theta}\), then the Jacobian is identity, and:
\[ \operatorname{Cov}_{\boldsymbol{\theta}}(\mathbf{W}) \succeq \mathcal{I}(\boldsymbol{\theta})^{-1} \]This matches the scalar case:
\[ \operatorname{Var}_{\theta}(W) \geq \frac{1}{\mathcal{I}(\theta)} \]The multivariate CRLB states:
\[ \boxed{ \text{Cov}_{\boldsymbol{\theta}}(\mathbf{W}(\mathbf{X})) \succeq \left[ \frac{ \partial \mathbb{E}_{\boldsymbol{\theta}}[\mathbf{W}(\mathbf{X})] }{ \partial \boldsymbol{\theta} } \right]^\top \left[ \mathbb{E}_{\boldsymbol{\theta}} \left[ \left( \frac{ \partial }{ \partial \boldsymbol{\theta} } \log f(\mathbf{X}|\boldsymbol{\theta}) \right) \left( \frac{ \partial }{ \partial \boldsymbol{\theta} } \log f(\mathbf{X}|\boldsymbol{\theta}) \right)^\top \right] \right]^{-1} \left[ \frac{ \partial \mathbb{E}_{\boldsymbol{\theta}}[\mathbf{W}(\mathbf{X})] }{ \partial \boldsymbol{\theta} } \right] } \]This is a matrix inequality: the LHS is the covariance matrix of the estimator, and the RHS is the lower bound (in the positive semi-definite sense).
You're diving into the information equality result, which is foundational for the Cramér–Rao lower bound (CRLB). It links the variance of the score function with the expected curvature of the log-likelihood.
Let us go through each step in full clarity, with notation, interpretation, and a diagram at the end.
This is a standard identity under regularity conditions:
\[ \begin{align*} \mathbb{E}_\theta\left[ \frac{\partial}{\partial \theta} \log f(x \mid \theta) \right] &= \int \frac{\partial}{\partial \theta} \log f(x \mid \theta) \cdot f(x \mid \theta)\, dx \\ &= \int \frac{1}{f(x \mid \theta)} \cdot \frac{\partial f(x \mid \theta)}{\partial \theta} \cdot f(x \mid \theta)\, dx \\ &= \int \frac{\partial f(x \mid \theta)}{\partial \theta} \, dx \\ &= \frac{\partial}{\partial \theta} \int f(x \mid \theta) \, dx = \frac{\partial}{\partial \theta}(1) = 0 \end{align*} \]Conclusion: the expected score is zero.
The claim is:
\[ \mathbb{E}_\theta\left[ \left( \frac{\partial}{\partial \theta} \log f(x \mid \theta) \right)^2 \right] = - \mathbb{E}_\theta\left[ \frac{\partial^2}{\partial \theta^2} \log f(x \mid \theta) \right] \]Now, under regularity:
\[ \int \frac{\partial^2 f(x \mid \theta)}{\partial \theta^2} dx = \frac{\partial^2}{\partial \theta^2} 1 = 0 \]So the result becomes:
\[ - \int \frac{ (\partial f / \partial \theta)^2 }{ f } dx \]Conclusion:
\[ \mathbb{E}_\theta\left[ \left( \frac{\partial}{\partial \theta} \log f(x \mid \theta) \right)^2 \right] = - \mathbb{E}_\theta\left[ \frac{\partial^2}{\partial \theta^2} \log f(x \mid \theta) \right] \]Let \( \ell(\theta) = \log f(x \mid \theta) \). Then:
This quantifies how much information one observation gives about \(\theta\).
Let’s walk through it clearly, with intuition, derivation, and a small example.
You wrote:
\[ L(\theta \mid x_1, x_2, \dots, x_n) = \sum_{i=1}^n \log f(x_i \mid \theta) \]This is the log-likelihood function for \( n \) i.i.d. observations from the model \( f(x \mid \theta) \).
Because:
\[ \begin{aligned} \text{Likelihood: } \mathcal{L}(\theta \mid x_1, \dots, x_n) &= \prod_{i=1}^n f(x_i \mid \theta) \\ \Rightarrow \log \mathcal{L}(\theta \mid x_1, \dots, x_n) &= \sum_{i=1}^n \log f(x_i \mid \theta) \end{aligned} \]Taking the log simplifies multiplication into summation.
Now differentiate the log-likelihood with respect to \( \theta \):
\[ \frac{\partial}{\partial \theta} L(\theta \mid x_1, \dots, x_n) = \sum_{i=1}^n \frac{\partial}{\partial \theta} \log f(x_i \mid \theta) \]Using the chain rule:
\[ = \sum_{i=1}^n \frac{1}{f(x_i \mid \theta)} \cdot \frac{\partial f(x_i \mid \theta)}{\partial \theta} \]This is the score function, often denoted:
\[ \nabla_\theta L(\theta \mid x_1, \dots, x_n) \]✅ Your symbolic expression is correct — it's the score function for the full sample.
This is crucial for:
Suppose:
Then:
\[ f(x_i \mid \mu) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{(x_i - \mu)^2}{2\sigma^2} \right) \] \[ \log f(x_i \mid \mu) = -\frac{1}{2} \log(2\pi \sigma^2) - \frac{(x_i - \mu)^2}{2\sigma^2} \]Summing over \( i \):
\[ L(\mu) = \sum_{i=1}^n \log f(x_i \mid \mu) = -\frac{n}{2} \log(2\pi \sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2 \]Differentiating:
\[ \frac{d}{d\mu} L(\mu) = \frac{1}{\sigma^2} \sum_{i=1}^n (x_i - \mu) \]Set derivative = 0 for MLE:
\[ \sum_{i=1}^n (x_i - \mu) = 0 \quad \Rightarrow \quad \mu = \bar{x} \]✅ So the MLE for \( \mu \) is the sample mean \( \bar{x} \).
You're now discussing numerical optimization of the likelihood using the Newton–Raphson method, one of the core techniques to compute Maximum Likelihood Estimators (MLEs).
We'll break this into four parts for clarity:
The graph of the log-likelihood function \( L(\theta) \) often looks like a smooth, concave curve. The peak of this curve corresponds to the MLE \( \hat{\theta} \).
^
L(θ)
| /\
| / \
| / \
|_______/______\________>
θ̂
At the maximum, the slope \( \frac{d}{d\theta} L(\theta) \) is zero — this is the first-order optimality condition.
The update:
\[ \hat{\theta}_{r+1} = \hat{\theta}_r + \eta_r \cdot \nabla_\theta L(\theta_r \mid x_1, \dots, x_n) \]This is a gradient ascent step, where:
This is not Newton–Raphson yet — just basic gradient ascent.
Newton–Raphson improves on gradient ascent by incorporating the second derivative:
The update rule becomes:
\[ \hat{\theta}_{r+1} = \hat{\theta}_r - \left[ \nabla^2_\theta L(\theta_r) \right]^{-1} \cdot \nabla_\theta L(\theta_r) \]✅ The minus sign reflects that for maximizing a concave function, you move against the curvature direction.
^
L(θ)| *
| ^
| /|\
| / | \ ← Newton step lands at the peak
|_____/__|__\________>
θ
| Method | Uses | Pros | Cons |
|---|---|---|---|
| Gradient Ascent | Gradient only | Simple | May take many steps |
| Newton–Raphson | Gradient + Hessian | Quadratic convergence | Requires second derivative |
For well-behaved likelihoods (e.g., Normal), Newton–Raphson can converge in just a few steps.
Suppose:
Log-likelihood:
\[ L(\mu) = -\frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2 + \text{const} \]Then:
Newton–Raphson update:
\[ \mu_{r+1} = \mu_r - \left( -\frac{\sigma^2}{n} \right) \cdot \frac{n}{\sigma^2} (\bar{x} - \mu_r) = \mu_r + (\bar{x} - \mu_r) = \bar{x} \]✅ It converges in one step — because the log-likelihood is quadratic!
Imagine the log-likelihood function \( L(\theta) \) as a landscape — a smooth surface over the parameter space.
Your goal: reach the top (maximum likelihood estimate \( \hat{\theta} \)).
At any point \( \theta_r \), the gradient \( \nabla_\theta L(\theta_r) \) is a vector that:
So, to climb the hill toward the MLE, you step in the direction of the gradient:
\[ \hat{\theta}_{r+1} = \hat{\theta}_r + \eta_r \cdot \nabla_\theta L(\theta_r) \]Here, \( \eta_r > 0 \) is your step size — how far you move in that direction.
Let’s be precise:
Suppose you want to increase \( L(\theta) \) by moving a small distance \( \delta \theta \). The first-order change is:
\[ \Delta L(\theta) \approx \nabla_\theta L(\theta)^\top \delta \theta \]We want to choose \( \delta \theta \) of unit length (i.e., \( \| \delta \theta \| = 1 \)) that gives maximum increase:
\[ \max_{\| \delta \theta \| = 1} \nabla_\theta L(\theta)^\top \delta \theta \]The solution to this maximization is:
\[ \delta \theta = \frac{ \nabla_\theta L(\theta) }{ \| \nabla_\theta L(\theta) \| } \]i.e., move in the direction of the gradient. ✅
Think of the log-likelihood surface like a hill in 3D.
^
/|\\ ← Gradient points this way
/ |
/ | Log-Likelihood surface
*---→
Current θ_r
So, in practice:
Now connecting Fisher’s Information Matrix with the Hessian matrix of the log-likelihood function.
Let’s unpack this carefully and precisely.
Suppose \( \theta \in \mathbb{R}^k \) is a vector parameter and \( f(x \mid \theta) \) is a probability density function satisfying standard regularity conditions.
Then, the Fisher Information Matrix is defined as:
\[ \mathcal{I}(\theta) = \mathbb{E}_\theta \left[ \left( \frac{\partial}{\partial \theta} \log f(x \mid \theta) \right) \left( \frac{\partial}{\partial \theta} \log f(x \mid \theta) \right)^\top \right] \]Under regularity conditions, the Fisher Information Matrix can also be written as the negative expected Hessian of the log-likelihood function:
\[ \mathcal{I}(\theta) = - \mathbb{E}_\theta \left[ \frac{\partial^2}{\partial \theta \partial \theta^\top} \log f(x \mid \theta) \right] \]✅ This is what your note says:
\( \mathbb{E}\left( -\frac{ \partial^2 }{ \partial \theta \partial \theta^\top } \log f(x \mid \theta) \right) = \text{Fisher Information Matrix} \)
The Hessian of a scalar-valued function \( L(\theta) \), where \( \theta \in \mathbb{R}^k \), is the matrix of second partial derivatives:
\[ H(\theta) = \left[ \frac{ \partial^2 L(\theta) }{ \partial \theta_i \partial \theta_j } \right]_{i,j=1}^k \]So for the log-likelihood function \( \log f(x \mid \theta) \), the Hessian is:
\[ \frac{ \partial^2 }{ \partial \theta \partial \theta^\top } \log f(x \mid \theta) \]It’s a \( k \times k \) symmetric matrix, assuming smoothness.
You correctly stated:
(i, j)-th element of the Fisher Information matrix is the (i, j)-th element of the expected negative Hessian of \( \log f(x \mid \theta) \)
In symbols:
\[ [\mathcal{I}(\theta)]_{i,j} = - \mathbb{E}_\theta \left[ \frac{\partial^2}{\partial \theta_i \partial \theta_j} \log f(x \mid \theta) \right] \]This captures the curvature of the likelihood in the \( (\theta_i, \theta_j) \) direction.
Because of the Fisher identity:
\[ \mathcal{I}(\theta) = \mathbb{E}_\theta \left[ \nabla_\theta \log f(x \mid \theta) \cdot \nabla_\theta \log f(x \mid \theta)^\top \right] = - \mathbb{E}_\theta \left[ \nabla^2_\theta \log f(x \mid \theta) \right] \]Both are equal only under regularity conditions, where we’re allowed to exchange integration and differentiation.
| Quantity | Meaning |
|---|---|
| \( \nabla_\theta \log f(x \mid \theta) \) | Score function (gradient of log-likelihood) |
| \( \nabla^2_\theta \log f(x \mid \theta) \) | Hessian matrix (second derivatives) |
| \( \mathcal{I}(\theta) \) | Fisher Information Matrix |
| \( [\mathcal{I}(\theta)]_{i,j} \) | Expected negative second derivative of log-likelihood w.r.t \( \theta_i \) and \( \theta_j \) |
You are essentially learning the geometry and calculus behind MLE:
🔁 In iterative methods (like Newton–Raphson, gradient ascent), we update:
θ̂_{r+1} = θ̂_r + η_r ⋅ ∇θ L(θ̂_r)
We stop when the estimate becomes stable:
‖θ̂r+1 − θ̂r‖₂ < ε
✅ Example: If θ = (μ, σ)T, then:
‖θ̂r+1 − θ̂r‖₂² = (μ̂r+1 − μ̂r)² + (σ̂r+1 − σ̂r)²
ℐ(θ) = E[ −∂²/∂θ ∂θᵗ log f(x | θ) ]
🔎 Intuition:
log L(θ)
^
|
High | ____
info | / \
| / \
| / \
Low |/ \___
+-------------------> θ
H(θ) = [ ∂²L / ∂θᵢ∂θⱼ ]
🔁 In Newton–Raphson:
θ̂r+1 = θ̂r − H−1 ∇θ L
ℐ(θ) = E_θ[−H(θ)] = E_θ[−∂²/∂θ∂θᵗ log f(x | θ)]
✅ Requires regularity conditions
For i.i.d. Data:
X₁,…,Xₙ ~ i.i.d. f(x | θ)
L(θ) = ∑ log f(xᵢ | θ) ⇒
ℐ(θ) = n ⋅ E[−∂²/∂θ∂θᵗ log f(x₁ | θ)]
So more data ⇒ higher curvature ⇒ better certainty
E[−∂²/∂θ∂θᵗ log f(x₁ | θ)] = Pre-sample Fisher Info
If θ̂ is unbiased:
Var(θ̂) ≥ ℐ−1(θ)
In scalar case:
Var(θ̂) ≥ 1 / ℐ(θ)
𝒥(θ) = −H = −∑ ∂²/∂θ∂θᵗ log f(xᵢ | θ)
Used in:
Var̂(θ̂) ≈ [𝒥(θ̂)]−1
| Concept | Formula | Interpretation |
|---|---|---|
| Gradient (Score) | ∇θ L(θ) | Direction of steepest ascent |
| Hessian | H(θ) = ∂²L / ∂θ ∂θᵗ | Curvature of log-likelihood |
| Fisher Info | ℐ(θ) = E[−H] | Expected curvature |
| Observed Info | 𝒥(θ) = −H | Actual curvature from data |
| CRLB | Var(θ̂) ≥ ℐ−1(θ) | Lower bound on estimator variance |
Recall that for Maximum Likelihood Estimation (MLE), we want to maximize the log-likelihood function:
$$L(\theta) = \sum_{i=1}^n \log f(x_i \mid \theta)$$
The Newton–Raphson update rule for finding the root of $\nabla_\theta L(\theta) = 0$ is:
$$\theta^{(r+1)} = \theta^{(r)} - \left[ H(\theta^{(r)}) \right]^{-1} \nabla_\theta L(\theta^{(r)})$$
In practice, $H(\theta)$ may:
To solve these issues, we replace the observed Hessian \( H(\theta) \) with its expectation:
$$\mathcal{I}(\theta) = \mathbb{E}_\theta\left[-\nabla_\theta^2 L(\theta)\right]$$
This is the Fisher Information Matrix. It's often easier to compute and is positive semi-definite.
The update rule becomes:
$$\theta^{(r+1)} = \theta^{(r)} + \left[\mathcal{I}(\theta^{(r)})\right]^{-1} \nabla_\theta L(\theta^{(r)})$$
🔁 Notice:
Imagine you’re on a hill (log-likelihood surface):
Let \( X_1, \dots, X_n \sim \text{Bern}(p)\) where \( 0 < p < 1 \)
Log-likelihood:
$$L(p) = \sum_{i=1}^n \left[ x_i \log p + (1 - x_i)\log(1 - p) \right]$$
Score function:
$$\frac{dL}{dp} = \frac{n\bar{X}}{p} - \frac{n(1 - \bar{X})}{1 - p}$$
Observed Information:
$$H(p) = -\frac{n\bar{X}}{p^2} - \frac{n(1 - \bar{X})}{(1 - p)^2}$$
Fisher Information:
$$\mathcal{I}(p) = \frac{n}{p(1 - p)}$$
Fisher Scoring update:
$$p^{(r+1)} = p^{(r)} + \left[ \frac{n}{p^{(r)}(1 - p^{(r)})} \right]^{-1} \left( \frac{n\bar{X}}{p^{(r)}} - \frac{n(1 - \bar{X})}{1 - p^{(r)}} \right)$$
Which simplifies to:
$$p^{(r+1)} = \bar{X}$$
🎯 That’s the MLE! Fisher scoring converges to it in one step for Bernoulli.
| Method | Uses | Stability | Convergence Speed |
|---|---|---|---|
| Gradient Ascent | Gradient only | May overshoot | Slower |
| Newton–Raphson | Gradient + Hessian | Can be unstable | Fast if Hessian good |
| Fisher Scoring | Gradient + Expected Hessian | More stable | Fast, smoother |
✅ Use it when: