Week-2 | Shaan

Instructor: Debasis Sengupta

Office / Department: ASU

Email: sdebasis@isical.ac.in

Marking Scheme:
Assignments: 20% | Midterm Test: 30% | End Semester: 50%

Variance Inequality: Toward a General CRLB
Cramér–Rao Bound for IID Case
Summarising the simplifications
Cramér–Rao Inequality in the Discrete Case
Fisher Information Identity
Cramér–Rao Bound for Vector Parameters
Fisher Information Matrix
Cramér–Rao Lower Bound: Vector Parameter Case
Information Equality and Fisher Information Identity
Step-by-Step Breakdown: The Score Function
Newton–Raphson for MLE: Likelihood Maximization
On the Direction of Steepest Ascent
Connecting Fisher Information with the Hessian Matrix
Geometry and Calculus Behind MLE
Fisher’s Scoring Method

🧠 Variance Inequality: Toward a General CRLB

You’re discussing a variance inequality in estimation theory — a generalized form of the Cramér–Rao bound.

📦 1. Setup:

Let $ X_1, X_2, \dots, X_n $ be i.i.d. with density $ f(x|\theta) $.
Let $ \mathbf{X} = (X_1, \dots, X_n) $ and $ W(\mathbf{X}) $ be any statistic.
You aim to bound $ \operatorname{Var}_\theta(W(\mathbf{X})) $.

Key assumptions:

$ \operatorname{Var}_\theta(W(\mathbf{X})) < \infty $
$ \frac{d}{d\theta} \mathbb{E}_\theta[W(\mathbf{X})] = \int \frac{d}{d\theta} \left( W(\mathbf{X}) f(\mathbf{X}|\theta) \right) d\mathbf{x} $

The second condition is a regularity condition — it lets you differentiate under the integral sign, essential for proving CR-type results.

🔎 2. The Inequality Itself:

Then, the variance of $ W(\mathbf{X}) $ satisfies the inequality:

\[ \operatorname{Var}_\theta(W(\mathbf{X})) \geq \frac{ \left( \frac{d}{d\theta} \mathbb{E}_\theta[W(\mathbf{X})] \right)^2 }{ \mathbb{E}_\theta \left[ \left( \frac{d}{d\theta} \log f(\mathbf{X}|\theta) \right)^2 \right] } \]

This is a generalized Cramér–Rao Lower Bound.

✅ No matter what statistic or (nearly) unbiased estimator you use, its variance can’t fall below this bound.

ℹ️ Fisher Information

The denominator is the Fisher Information:

\[ \mathcal{I}(\theta) := \mathbb{E}_\theta \left[ \left( \frac{d}{d\theta} \log f(\mathbf{X}|\theta) \right)^2 \right] \]

It quantifies how much information the sample $ \mathbf{X} $ provides about $ \theta $.

✂️ 3. Simplification When Unbiased

If $ W(\mathbf{X}) $ is unbiased, i.e.

\[ \mathbb{E}_\theta[W(\mathbf{X})] = \theta \quad \text{for all } \theta, \]

then:

\[ \frac{d}{d\theta} \mathbb{E}_\theta[W(\mathbf{X})] = 1 \]

This simplifies the variance bound to:

\[ \operatorname{Var}_\theta(W(\mathbf{X})) \geq \frac{1}{\mathbb{E}_\theta \left[ \left( \frac{d}{d\theta} \log f(\mathbf{X}|\theta) \right)^2 \right]} \]

✅ This is the standard Cramér–Rao Lower Bound for unbiased estimators.

🔢 4. Concrete Example

Suppose:

$ \mathbb{E}(X_i) = \theta $ for all $ i $
Let $ W(\mathbf{X}) $ be the sample mean: $ \bar{X} = \frac{1}{n} \sum X_i $

Then clearly $ \mathbb{E}(W(\mathbf{X})) = \theta $, so this is an unbiased estimator.

✅ The simplified CRLB applies.

📌 Summary So Far:

A general inequality relates the variance of an estimator to the sensitivity of its expectation and the Fisher Information.
Under mild conditions, the inequality holds broadly.
If the estimator is unbiased, the bound takes a cleaner form.
Example: Sample mean of i.i.d. data with mean $ \theta $ satisfies the CRLB conditions.

🔍 Simplification (2): Cramér–Rao Bound for IID Case

🧩 1. Factorization of Likelihood for IID Samples

If $ X_1, \dots, X_n $ are i.i.d., then:

\[ f(\mathbf{x}|\theta) = \prod_{i=1}^{n} f(x_i|\theta) \quad \Rightarrow \quad \log f(\mathbf{x}|\theta) = \sum_{i=1}^{n} \log f(x_i|\theta) \]

This simplifies the log-likelihood and makes computing the score function much easier.

📘 2. Example: $ X_i \sim \mathcal{N}(\mu, 1) $

The density is:

\[ f(x_i|\mu) = \frac{1}{\sqrt{2\pi}} \exp\left(-\frac{(x_i - \mu)^2}{2}\right) \]

Then the log-likelihood becomes:

\[ \log f(x_i|\mu) = -\frac{1}{2} \log(2\pi) - \frac{1}{2}(x_i - \mu)^2 \]

Differentiating with respect to $ \mu $:

\[ \frac{d}{d\mu} \log f(x_i|\mu) = x_i - \mu \]

So, the score function for the full sample is:

\[ \frac{d}{d\mu} \log f(\mathbf{x}|\mu) = \sum_{i=1}^n (x_i - \mu) \]

📐 3. Fisher Information

To compute the Fisher Information $ \mathcal{I}(\mu) $, we evaluate:

\[ \mathbb{E}_\mu \left[ \left( \sum_{i=1}^n (X_i - \mu) \right)^2 \right] \]

Because the $ X_i $ are independent and each has zero mean after centering:

\[ = \sum_{i=1}^n \mathbb{E}_\mu[(X_i - \mu)^2] \]

Each term $ \mathbb{E}_\mu[(X_i - \mu)^2] = 1 $, so total Fisher Information is:

\[ \mathcal{I}(\mu) = n \]

✅ Conclusion: CRLB Achieved by Sample Mean

Estimator: Sample mean $ \bar{X} $
Unbiased: Yes
Variance of $ \bar{X} $: $ \frac{1}{n} $
Fisher Information: $ n $
CRLB: $ \frac{1}{n} $

✅ The sample mean achieves the bound, hence it is efficient.

✅ Putting It All Together

You've now made two key simplifications:

Simplification (1)

If $\mathbb{E}_\mu[W(\mathbf{X})] = \mu$, then:

$W(\mathbf{X})$ is an unbiased estimator of $\mu$
So, numerator in CRLB becomes 1

\[ \operatorname{Var}_\mu(W(\mathbf{X})) \geq \frac{1}{\mathbb{E}_\mu\left[ \left( \frac{d}{d\mu} \log f(\mathbf{X}|\mu) \right)^2 \right]} = \frac{1}{\text{Fisher Information}} \]

Simplification (2)

For the i.i.d. normal case:

The Fisher Information = $n$
So CRLB = $\frac{1}{n}$

❓ Immediate Question: Does the sample mean attain this lower bound?

Let’s answer it clearly.

The sample mean is: \[ \bar{X} = \frac{1}{n} \sum_{i=1}^n X_i \]
Since $X_i \sim \mathcal{N}(\mu, 1)$, by standard properties: \[ \mathbb{E}[\bar{X}] = \mu, \quad \operatorname{Var}(\bar{X}) = \frac{1}{n} \]

✅ So the sample mean:

Is unbiased: satisfies Simplification (1)
Has variance $\frac{1}{n}$: exactly equals the CRLB from Simplification (2)

✅ Final Answer

Yes, the sample mean attains the Cramér–Rao lower bound in this case. Hence, it is an efficient estimator of $\mu$ for the normal model with known variance.

🔁 Cramér–Rao Inequality in the Discrete Case

You're now pointing toward the discrete version of the Cramér–Rao bound, where we deal with PMFs instead of PDFs.

Let’s go through this step-by-step.

📦 Now in the discrete case:

Instead of a pdf $ f(x|\theta) $, you have a pmf $ p(x|\theta) $, where $ x \in \mathcal{X} $, a countable set.

🧠 Key Result:

Suppose:

$ X_1, \dots, X_n $ are i.i.d. with pmf $ p(x|\theta) $
$ W(\mathbf{X}) $ is a statistic satisfying:
- $ \mathbb{E}_\theta[W(\mathbf{X})] $ is differentiable
- The derivative of expectation can be interchanged with the sum: \[ \frac{d}{d\theta} \mathbb{E}_\theta[W(\mathbf{X})] = \sum_{\mathbf{x}} W(\mathbf{x}) \frac{d}{d\theta} p(\mathbf{x}|\theta) \]

Then, the Cramér–Rao-type inequality holds:

\[ \operatorname{Var}_\theta(W(\mathbf{X})) \geq \frac{\left( \frac{d}{d\theta} \mathbb{E}_\theta[W(\mathbf{X})] \right)^2}{ \mathbb{E}_\theta\left[ \left( \frac{d}{d\theta} \log p(\mathbf{X}|\theta) \right)^2 \right]} \]

This is structurally identical to the continuous case — but all integrals are now sums over discrete outcomes.

✅ Example: Bernoulli Case

Let’s make this real.

$ X_1, \dots, X_n \sim \text{Bernoulli}(p) $
$ p \in (0, 1) $, the parameter to estimate
$ \mathbf{X} = (X_1, \dots, X_n) $

PMF: $ p(x_i | p) = p^{x_i}(1-p)^{1 - x_i} $
Log-likelihood: \[ \log p(\mathbf{X}|p) = \sum_{i=1}^n \left[ x_i \log p + (1 - x_i) \log(1 - p) \right] \]
Score: \[ \frac{d}{dp} \log p(\mathbf{X}|p) = \sum_{i=1}^n \left( \frac{x_i}{p} - \frac{1 - x_i}{1 - p} \right) \]
Fisher Information: \[ \mathcal{I}(p) = \mathbb{E}_p\left[ \left( \sum_{i=1}^n \left( \frac{X_i}{p} - \frac{1 - X_i}{1 - p} \right) \right)^2 \right] = \frac{n}{p(1 - p)} \]
Sample mean $ \bar{X} = \frac{1}{n} \sum X_i $ is unbiased
Its variance: $ \frac{p(1 - p)}{n} $

Then CRLB is:

\[ \operatorname{Var}(\hat{p}) \geq \frac{1}{\mathcal{I}(p)} = \frac{p(1 - p)}{n} \]

✅ The sample mean attains this → it is efficient.

📌 Summary

The CRLB works analogously for discrete pmfs
Replace integrals with sums
Use $ \log p(x|\theta) $ instead of $ \log f(x|\theta) $
Example: Bernoulli trials → sample mean of i.i.d. Bernoulli ⇒ efficient estimator of $ p $

📘 Fisher Information Identity

We now state a very important identity in statistical inference — the Fisher Information identity, which links the variance of the score with the expected second derivative of the log-likelihood.

Let’s unpack it clearly.

📘 Statement of the Result:

If the regularity condition holds:

\[ \frac{d}{d\theta}\mathbb{E}_{\theta}\left(\frac{\partial}{\partial\theta}\log f(X|\theta)\right) = \int \left[\frac{\partial}{\partial\theta}\left[\left(\frac{\partial}{\partial\theta}\log f(x|\theta)\right) f(x|\theta)\right]\right] dx, \]

then we can derive:

\[ \mathbb{E}_\theta\left[\left(\frac{\partial}{\partial\theta} \log f(X|\theta)\right)^2\right] = -\mathbb{E}_\theta\left[\frac{\partial^2}{\partial\theta^2} \log f(X|\theta)\right] \]

This is known as the Fisher Information equality, and both sides equal the Fisher Information $ \mathcal{I}(\theta) $.

📌 Why This Is True: Key Insight

Let $ \ell(\theta; X) = \log f(X|\theta) $, the log-likelihood. Then:

Score: $ \frac{\partial}{\partial\theta} \ell(\theta; X) $
Expectation of the score: \[ \mathbb{E}_\theta\left[\frac{\partial}{\partial\theta} \ell(\theta; X)\right] = 0 \] (as long as the pdf is regular enough and integrates to 1)

Differentiate under the integral sign:

\[ \frac{d}{d\theta} \mathbb{E}_\theta\left[\frac{\partial}{\partial\theta} \ell(\theta; X)\right] = 0 = \mathbb{E}_\theta\left[ \frac{\partial^2}{\partial\theta^2} \ell(\theta; X) + \left( \frac{\partial}{\partial\theta} \ell(\theta; X) \right)^2 \right] \]

Rearranged:

\[ \mathbb{E}_\theta\left[\left(\frac{\partial}{\partial\theta} \log f(X|\theta)\right)^2\right] = -\mathbb{E}_\theta\left[\frac{\partial^2}{\partial\theta^2} \log f(X|\theta)\right] \]

🔁 Interpretation

LHS = variance of the score (square of derivative of log-likelihood)
RHS = negative expectation of the second derivative (curvature of log-likelihood)

Both quantify how much information the data contains about $ \theta $.

✅ Summary

You’re stating the Fisher Information identity, which tells us:

Under regularity conditions, the variance of the score function equals the negative expected second derivative of the log-likelihood.

\[ \mathcal{I}(\theta) = \operatorname{Var}_\theta\left( \frac{d}{d\theta} \log f(X|\theta) \right) = -\mathbb{E}_\theta\left[ \frac{d^2}{d\theta^2} \log f(X|\theta) \right] \]

This is foundational in deriving the Cramér–Rao Lower Bound, and in asymptotic theory (MLE consistency and normality).

🧠 Cramér–Rao Bound for Vector Parameters

Excellent — you’re now stepping into the multivariate (vector parameter) version of the Cramér–Rao Lower Bound (CRLB), which generalizes the scalar case using matrix calculus and Fisher information matrices.

Let’s break it down.

Let:

$\boldsymbol{\theta} \in \mathbb{R}^k$: vector parameter
$\mathbf{X} \sim f(\mathbf{x}|\boldsymbol{\theta})$
$\mathbf{W}(\mathbf{X}) \in \mathbb{R}^m$: vector-valued statistic
$\mathbf{W}(\mathbf{X})$ has finite variance and differentiable expectation w.r.t. $\boldsymbol{\theta}$

📘 Multivariate CRLB (Matrix Form):

\[ \operatorname{Var}_{\boldsymbol{\theta}}(\mathbf{W}(\mathbf{X})) \succeq \left( \frac{d}{d\boldsymbol{\theta}} \mathbb{E}_{\boldsymbol{\theta}}[\mathbf{W}(\mathbf{X})] \right)^\top \cdot \mathcal{I}(\boldsymbol{\theta})^{-1} \cdot \left( \frac{d}{d\boldsymbol{\theta}} \mathbb{E}_{\boldsymbol{\theta}}[\mathbf{W}(\mathbf{X})] \right) \]

📌 Notation Breakdown:

$\operatorname{Var}_{\boldsymbol{\theta}}(\mathbf{W}) \in \mathbb{R}^{m \times m}$: Covariance matrix of the estimator
$\frac{d}{d\boldsymbol{\theta}} \mathbb{E}_{\boldsymbol{\theta}}[\mathbf{W}] \in \mathbb{R}^{m \times k}$: Jacobian matrix of expectation
$\mathcal{I}(\boldsymbol{\theta}) \in \mathbb{R}^{k \times k}$: Fisher Information matrix
$\succeq$: Matrix inequality meaning the difference is positive semi-definite

⚙️ Meaning:

This inequality tells you that the covariance matrix of any estimator of a vector parameter is bounded from below by a matrix formed from:

The sensitivity of the estimator (Jacobian of expectation)
The inverse Fisher Information matrix

✅ Special Case: Unbiased Estimator

If $\mathbb{E}_{\boldsymbol{\theta}}[\mathbf{W}(\mathbf{X})] = \boldsymbol{\theta}$, then:

Jacobian = Identity matrix $I_k$

So, the bound becomes:

\[ \operatorname{Var}_{\boldsymbol{\theta}}(\mathbf{W}(\mathbf{X})) \succeq \mathcal{I}(\boldsymbol{\theta})^{-1} \]

Just like in the scalar case, but extended to full matrix structure.

📘 Fisher Information Matrix $\mathcal{I}(\boldsymbol{\theta})$

Let’s now define the Fisher Information Matrix $\mathcal{I}(\boldsymbol{\theta})$ in the vector parameter case.

Let:

$\boldsymbol{\theta} \in \mathbb{R}^k$: vector parameter
$\mathbf{X} \sim f(\mathbf{x} | \boldsymbol{\theta})$
Assume $f(\mathbf{x} | \boldsymbol{\theta})$ satisfies suitable regularity conditions (e.g., to allow differentiation under the integral)

✅ The Fisher Information Matrix is defined as:

\[ \mathcal{I}(\boldsymbol{\theta}) = \mathbb{E}_{\boldsymbol{\theta}} \left[ \left( \nabla_{\boldsymbol{\theta}} \log f(\mathbf{X}|\boldsymbol{\theta}) \right) \left( \nabla_{\boldsymbol{\theta}} \log f(\mathbf{X}|\boldsymbol{\theta}) \right)^\top \right] \]

🔍 Where:

$\nabla_{\boldsymbol{\theta}} \log f(\mathbf{X}|\boldsymbol{\theta}) \in \mathbb{R}^k$: the score vector \[ \nabla_{\boldsymbol{\theta}} \log f(\mathbf{X}|\boldsymbol{\theta}) = \begin{bmatrix} \frac{\partial}{\partial\theta_1} \log f(\mathbf{X}|\boldsymbol{\theta}) \\ \vdots \\ \frac{\partial}{\partial\theta_k} \log f(\mathbf{X}|\boldsymbol{\theta}) \end{bmatrix} \]
The outer product produces a $k \times k$ matrix
So, the Fisher Information Matrix is a positive semi-definite matrix of size $k \times k$

🧠 Alternate Form (if second derivatives are easier to compute):

Under stronger regularity conditions, we can also write:

\[ \mathcal{I}(\boldsymbol{\theta}) = -\mathbb{E}_{\boldsymbol{\theta}} \left[ \nabla_{\boldsymbol{\theta}}^2 \log f(\mathbf{X}|\boldsymbol{\theta}) \right] \]

Here, $\nabla_{\boldsymbol{\theta}}^2 \log f(\mathbf{X}|\boldsymbol{\theta}) \in \mathbb{R}^{k \times k}$ is the Hessian matrix of second-order partial derivatives.

🧠 Cramér–Rao Lower Bound: Vector Parameter Case

Here's the full form of the Cramér–Rao Lower Bound (CRLB) in the vector parameter case, complete with all definitions, assumptions, and mathematical structure.

Let:

$\boldsymbol{\theta} \in \mathbb{R}^k$: vector-valued parameter
$\mathbf{X} \in \mathbb{R}^n$: random sample with joint pdf (or pmf) $f(\mathbf{x} | \boldsymbol{\theta})$
$\mathbf{W}(\mathbf{X}) \in \mathbb{R}^m$: statistic estimating a function of $\boldsymbol{\theta}$
All relevant expectations, gradients, and Fisher Information are well-defined and finite

📘 General Matrix Form of the CRLB:

\[ \operatorname{Cov}_{\boldsymbol{\theta}}(\mathbf{W}(\mathbf{X})) \succeq \left( \frac{\partial}{\partial \boldsymbol{\theta}} \mathbb{E}_{\boldsymbol{\theta}}[\mathbf{W}(\mathbf{X})] \right) \cdot \mathcal{I}(\boldsymbol{\theta})^{-1} \cdot \left( \frac{\partial}{\partial \boldsymbol{\theta}} \mathbb{E}_{\boldsymbol{\theta}}[\mathbf{W}(\mathbf{X})] \right)^\top \]

🧩 Term Definitions:

$\operatorname{Cov}_{\boldsymbol{\theta}}(\mathbf{W}) \in \mathbb{R}^{m \times m}$: Covariance matrix of the estimator
$\frac{\partial}{\partial \boldsymbol{\theta}} \mathbb{E}_{\boldsymbol{\theta}}[\mathbf{W}] \in \mathbb{R}^{m \times k}$: Jacobian matrix of expected estimator wrt $\boldsymbol{\theta}$
$\mathcal{I}(\boldsymbol{\theta}) \in \mathbb{R}^{k \times k}$: Fisher Information Matrix

🎯 Fisher Information Matrix Definition:

Let the score vector be:

\[ \mathbf{S}(\mathbf{X}; \boldsymbol{\theta}) = \nabla_{\boldsymbol{\theta}} \log f(\mathbf{X} | \boldsymbol{\theta}) \in \mathbb{R}^k \]

Then:

\[ \mathcal{I}(\boldsymbol{\theta}) = \mathbb{E}_{\boldsymbol{\theta}}\left[ \mathbf{S}(\mathbf{X}; \boldsymbol{\theta}) \mathbf{S}(\mathbf{X}; \boldsymbol{\theta})^\top \right] \]

Or, under regularity conditions:

\[ \mathcal{I}(\boldsymbol{\theta}) = - \mathbb{E}_{\boldsymbol{\theta}}\left[ \nabla_{\boldsymbol{\theta}}^2 \log f(\mathbf{X} | \boldsymbol{\theta}) \right] \]

✅ Special Case: Unbiased Estimator

If $\mathbb{E}_{\boldsymbol{\theta}}[\mathbf{W}(\mathbf{X})] = \boldsymbol{\theta}$, then the Jacobian is identity, and:

\[ \operatorname{Cov}_{\boldsymbol{\theta}}(\mathbf{W}) \succeq \mathcal{I}(\boldsymbol{\theta})^{-1} \]

This matches the scalar case:

\[ \operatorname{Var}_{\theta}(W) \geq \frac{1}{\mathcal{I}(\theta)} \]

🧠 Setup: Vector CRLB with Fisher Info Substituted

$\boldsymbol{\theta} \in \mathbb{R}^p$: vector parameter
$\mathbf{X}$: data with PDF/PMF $f(\mathbf{x}|\boldsymbol{\theta})$
$\mathbf{W}(\mathbf{X}) \in \mathbb{R}^k$: unbiased estimator of $\boldsymbol{\psi}(\boldsymbol{\theta}) = \mathbb{E}_{\boldsymbol{\theta}}[\mathbf{W}(\mathbf{X})]$

✅ CRLB (Multivariate Version):

\[ \text{Cov}_{\boldsymbol{\theta}}(\mathbf{W}(\mathbf{X})) \succeq \left[ \frac{ \partial \boldsymbol{\psi}(\boldsymbol{\theta}) }{ \partial \boldsymbol{\theta} } \right]^\top \left[ \mathbb{E}_{\boldsymbol{\theta}} \left[ \left( \frac{ \partial }{ \partial \boldsymbol{\theta} } \log f(\mathbf{X}|\boldsymbol{\theta}) \right) \left( \frac{ \partial }{ \partial \boldsymbol{\theta} } \log f(\mathbf{X}|\boldsymbol{\theta}) \right)^\top \right] \right]^{-1} \left[ \frac{ \partial \boldsymbol{\psi}(\boldsymbol{\theta}) }{ \partial \boldsymbol{\theta} } \right] \]

🔍 Explanation of Components:

$\frac{ \partial \boldsymbol{\psi}(\boldsymbol{\theta}) }{ \partial \boldsymbol{\theta} } \in \mathbb{R}^{k \times p}$: Jacobian of $\boldsymbol{\psi}(\boldsymbol{\theta})$
Middle matrix is the Fisher Information Matrix: \[ \mathbf{I}(\boldsymbol{\theta}) = \mathbb{E}_{\boldsymbol{\theta}} \left[ \left( \frac{ \partial }{ \partial \boldsymbol{\theta} } \log f(\mathbf{X}|\boldsymbol{\theta}) \right) \left( \frac{ \partial }{ \partial \boldsymbol{\theta} } \log f(\mathbf{X}|\boldsymbol{\theta}) \right)^\top \right] \]

📌 Summary:

The multivariate CRLB states:

\[ \boxed{ \text{Cov}_{\boldsymbol{\theta}}(\mathbf{W}(\mathbf{X})) \succeq \left[ \frac{ \partial \mathbb{E}_{\boldsymbol{\theta}}[\mathbf{W}(\mathbf{X})] }{ \partial \boldsymbol{\theta} } \right]^\top \left[ \mathbb{E}_{\boldsymbol{\theta}} \left[ \left( \frac{ \partial }{ \partial \boldsymbol{\theta} } \log f(\mathbf{X}|\boldsymbol{\theta}) \right) \left( \frac{ \partial }{ \partial \boldsymbol{\theta} } \log f(\mathbf{X}|\boldsymbol{\theta}) \right)^\top \right] \right]^{-1} \left[ \frac{ \partial \mathbb{E}_{\boldsymbol{\theta}}[\mathbf{W}(\mathbf{X})] }{ \partial \boldsymbol{\theta} } \right] } \]

This is a matrix inequality: the LHS is the covariance matrix of the estimator, and the RHS is the lower bound (in the positive semi-definite sense).

✨ Information Equality and Fisher Information Identity

You're diving into the information equality result, which is foundational for the Cramér–Rao lower bound (CRLB). It links the variance of the score function with the expected curvature of the log-likelihood.

Let us go through each step in full clarity, with notation, interpretation, and a diagram at the end.

🔶 The Setup

$ f(x \mid \theta) $ : parametric family of densities
Regularity conditions assumed (interchanging integration and differentiation allowed)
$ x $ is a random variable drawn from $f(x \mid \theta)$
$ \theta \in \mathbb{R} $: scalar parameter

🧰 Step 1: Expectation of the Score Function is Zero

\[ \mathbb{E}_\theta\left[ \frac{\partial}{\partial \theta} \log f(x \mid \theta) \right] = 0 \]

This is a standard identity under regularity conditions:

\[ \begin{align*} \mathbb{E}_\theta\left[ \frac{\partial}{\partial \theta} \log f(x \mid \theta) \right] &= \int \frac{\partial}{\partial \theta} \log f(x \mid \theta) \cdot f(x \mid \theta)\, dx \\ &= \int \frac{1}{f(x \mid \theta)} \cdot \frac{\partial f(x \mid \theta)}{\partial \theta} \cdot f(x \mid \theta)\, dx \\ &= \int \frac{\partial f(x \mid \theta)}{\partial \theta} \, dx \\ &= \frac{\partial}{\partial \theta} \int f(x \mid \theta) \, dx = \frac{\partial}{\partial \theta}(1) = 0 \end{align*} \]

Conclusion: the expected score is zero.

🔹 Step 2: Variance of Score Function = Negative Expectation of Second Derivative

The claim is:

\[ \mathbb{E}_\theta\left[ \left( \frac{\partial}{\partial \theta} \log f(x \mid \theta) \right)^2 \right] = - \mathbb{E}_\theta\left[ \frac{\partial^2}{\partial \theta^2} \log f(x \mid \theta) \right] \]

📌 LHS: Variance of the Score Function

\[ \begin{align*} \mathbb{E}_\theta\left[ \left( \frac{\partial}{\partial \theta} \log f(x \mid \theta) \right)^2 \right] &= \int \left( \frac{\partial}{\partial \theta} \log f(x \mid \theta) \right)^2 f(x \mid \theta)\, dx \\ &= \int \left( \frac{ \partial f(x \mid \theta)/\partial \theta }{ f(x \mid \theta) } \right)^2 f(x \mid \theta)\, dx \\ &= \int \left( \frac{ \partial f(x \mid \theta) }{ \partial \theta } \right)^2 \cdot \frac{1}{f(x \mid \theta)}\, dx \end{align*} \]

📌 RHS: Negative Expectation of Second Derivative

\[ \begin{align*} \mathbb{E}_\theta\left[ \frac{\partial^2}{\partial \theta^2} \log f(x \mid \theta) \right] &= \int \frac{\partial^2}{\partial \theta^2} \log f(x \mid \theta) \cdot f(x \mid \theta)\, dx \\ &= \int \left[ \frac{ \partial^2 f }{ \partial \theta^2 } - \frac{ (\partial f/\partial \theta)^2 }{ f } \right] dx \end{align*} \]

Now, under regularity:

\[ \int \frac{\partial^2 f(x \mid \theta)}{\partial \theta^2} dx = \frac{\partial^2}{\partial \theta^2} 1 = 0 \]

So the result becomes:

\[ - \int \frac{ (\partial f / \partial \theta)^2 }{ f } dx \]

Conclusion:

📊 Visual Interpretation

Let $ \ell(\theta) = \log f(x \mid \theta) $. Then:

$ \ell'(\theta) $: score function — slope of log-likelihood
$ \ell''(\theta) $: curvature — measures peakness/flatness
Taking expectation, the curvature = variance of slope (negated)

ℹ Fisher Information

\[ \mathcal{I}(\theta) = \mathbb{E}_\theta\left[ \left( \frac{\partial}{\partial \theta} \log f(x \mid \theta) \right)^2 \right] = -\mathbb{E}_\theta\left[ \frac{\partial^2}{\partial \theta^2} \log f(x \mid \theta) \right] \]

This quantifies how much information one observation gives about $\theta$.

📆 Summary

This is the Fisher Information identity
Backbone of the Cramér–Rao Inequality
Relies on regularity conditions (to switch integral and derivative)
Captures the connection between score and curvature

🔷 Step-by-Step Breakdown: The Score Function

Let’s walk through it clearly, with intuition, derivation, and a small example.

📘 Log-Likelihood Expression

You wrote:

\[ L(\theta \mid x_1, x_2, \dots, x_n) = \sum_{i=1}^n \log f(x_i \mid \theta) \]

This is the log-likelihood function for $ n $ i.i.d. observations from the model $ f(x \mid \theta) $.

🧠 Why Sum of Logs?

Because:

\[ \begin{aligned} \text{Likelihood: } \mathcal{L}(\theta \mid x_1, \dots, x_n) &= \prod_{i=1}^n f(x_i \mid \theta) \\ \Rightarrow \log \mathcal{L}(\theta \mid x_1, \dots, x_n) &= \sum_{i=1}^n \log f(x_i \mid \theta) \end{aligned} \]

Taking the log simplifies multiplication into summation.

🔧 Score Function (Gradient of Log-Likelihood)

Now differentiate the log-likelihood with respect to $ \theta $:

\[ \frac{\partial}{\partial \theta} L(\theta \mid x_1, \dots, x_n) = \sum_{i=1}^n \frac{\partial}{\partial \theta} \log f(x_i \mid \theta) \]

Using the chain rule:

\[ = \sum_{i=1}^n \frac{1}{f(x_i \mid \theta)} \cdot \frac{\partial f(x_i \mid \theta)}{\partial \theta} \]

This is the score function, often denoted:

\[ \nabla_\theta L(\theta \mid x_1, \dots, x_n) \]

✅ Your symbolic expression is correct — it's the score function for the full sample.

✅ Summary

\[ \nabla_\theta L(\theta) = \sum_{i=1}^n \frac{ \frac{\partial}{\partial \theta} f(x_i \mid \theta) }{ f(x_i \mid \theta) } = \sum_{i=1}^n \frac{\partial}{\partial \theta} \log f(x_i \mid \theta) \]

This is crucial for:

Maximum Likelihood Estimation (MLE): Set score = 0 and solve for $ \theta $
Fisher Information for $ n $ samples: $ \mathcal{I}_n(\theta) = n \mathcal{I}(\theta) $
Asymptotic theory of MLEs (e.g., consistency, normality)

📊 Small Example: Normal Model with Known Variance

Suppose:

$ X_1, \dots, X_n \sim \mathcal{N}(\mu, \sigma^2) $
$ \sigma^2 $ known; estimate $ \mu $

Then:

\[ f(x_i \mid \mu) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{(x_i - \mu)^2}{2\sigma^2} \right) \] \[ \log f(x_i \mid \mu) = -\frac{1}{2} \log(2\pi \sigma^2) - \frac{(x_i - \mu)^2}{2\sigma^2} \]

Summing over $ i $:

\[ L(\mu) = \sum_{i=1}^n \log f(x_i \mid \mu) = -\frac{n}{2} \log(2\pi \sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2 \]

Differentiating:

\[ \frac{d}{d\mu} L(\mu) = \frac{1}{\sigma^2} \sum_{i=1}^n (x_i - \mu) \]

Set derivative = 0 for MLE:

\[ \sum_{i=1}^n (x_i - \mu) = 0 \quad \Rightarrow \quad \mu = \bar{x} \]

✅ So the MLE for $ \mu $ is the sample mean $ \bar{x} $.

🔷 Newton–Raphson for MLE: Likelihood Maximization

You're now discussing numerical optimization of the likelihood using the Newton–Raphson method, one of the core techniques to compute Maximum Likelihood Estimators (MLEs).

We'll break this into four parts for clarity:

🔹 1. Understanding the Graph of $ L(\theta) $ vs $ \theta $

The graph of the log-likelihood function $ L(\theta) $ often looks like a smooth, concave curve. The peak of this curve corresponds to the MLE $ \hat{\theta} $.

        ^
     L(θ)
        |          /\
        |         /  \
        |        /    \
        |_______/______\________>
                    θ̂

At the maximum, the slope $ \frac{d}{d\theta} L(\theta) $ is zero — this is the first-order optimality condition.

🔧 2. Gradient Ascent (Steepest Ascent)

The update:

\[ \hat{\theta}_{r+1} = \hat{\theta}_r + \eta_r \cdot \nabla_\theta L(\theta_r \mid x_1, \dots, x_n) \]

This is a gradient ascent step, where:

$ \eta_r $ is a step size (learning rate)
$ \nabla_\theta L(\theta) $ is the gradient (score function)
You’re moving in the direction of increasing likelihood

This is not Newton–Raphson yet — just basic gradient ascent.

🧠 3. Newton–Raphson Update

Newton–Raphson improves on gradient ascent by incorporating the second derivative:

Gradient: $ \nabla_\theta L(\theta) $
Hessian: $ \nabla^2_\theta L(\theta) $

The update rule becomes:

\[ \hat{\theta}_{r+1} = \hat{\theta}_r - \left[ \nabla^2_\theta L(\theta_r) \right]^{-1} \cdot \nabla_\theta L(\theta_r) \]

✅ The minus sign reflects that for maximizing a concave function, you move against the curvature direction.

📌 What's the Intuition?

The first derivative tells you the direction to move
The second derivative tells you how much to move (curvature)

   ^
L(θ)|         *
     |        ^
     |       /|\
     |      / | \    ← Newton step lands at the peak
     |_____/__|__\________>
               θ

💡 Newton vs Gradient Ascent

Method	Uses	Pros	Cons
Gradient Ascent	Gradient only	Simple	May take many steps
Newton–Raphson	Gradient + Hessian	Quadratic convergence	Requires second derivative

For well-behaved likelihoods (e.g., Normal), Newton–Raphson can converge in just a few steps.

🔢 Example: Normal Model with Known Variance

Suppose:

$ X_1, \dots, X_n \sim \mathcal{N}(\mu, \sigma^2) $
$ \sigma^2 $ is known; find MLE for $ \mu $

Log-likelihood:

\[ L(\mu) = -\frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2 + \text{const} \]

Then:

$ \nabla_\mu L = \frac{1}{\sigma^2} \sum (x_i - \mu) = \frac{n}{\sigma^2} (\bar{x} - \mu) $
$ \nabla^2_\mu L = -\frac{n}{\sigma^2} $

Newton–Raphson update:

\[ \mu_{r+1} = \mu_r - \left( -\frac{\sigma^2}{n} \right) \cdot \frac{n}{\sigma^2} (\bar{x} - \mu_r) = \mu_r + (\bar{x} - \mu_r) = \bar{x} \]

✅ It converges in one step — because the log-likelihood is quadratic!

✅ Summary

You’re using gradient ascent to maximize $ L(\theta) $
Newton–Raphson is more efficient by using curvature (Hessian)
It’s standard in MLE, logistic regression, Poisson models, etc.

On the Direction of Steepest Ascent:

🔷 1. Why the Direction of Steepest Ascent?

Imagine the log-likelihood function $ L(\theta) $ as a landscape — a smooth surface over the parameter space.
Your goal: reach the top (maximum likelihood estimate $ \hat{\theta} $).

At any point $ \theta_r $, the gradient $ \nabla_\theta L(\theta_r) $ is a vector that:

Points in the direction where $ L(\theta) $ increases most rapidly.
Has a magnitude equal to the rate of increase in that direction.

So, to climb the hill toward the MLE, you step in the direction of the gradient:

\[ \hat{\theta}_{r+1} = \hat{\theta}_r + \eta_r \cdot \nabla_\theta L(\theta_r) \]

Here, $ \eta_r > 0 $ is your step size — how far you move in that direction.

🔶 2. What Makes Gradient the Steepest Direction?

Let’s be precise:

Suppose you want to increase $ L(\theta) $ by moving a small distance $ \delta \theta $. The first-order change is:

\[ \Delta L(\theta) \approx \nabla_\theta L(\theta)^\top \delta \theta \]

We want to choose $ \delta \theta $ of unit length (i.e., $ \| \delta \theta \| = 1 $) that gives maximum increase:

\[ \max_{\| \delta \theta \| = 1} \nabla_\theta L(\theta)^\top \delta \theta \]

The solution to this maximization is:

\[ \delta \theta = \frac{ \nabla_\theta L(\theta) }{ \| \nabla_\theta L(\theta) \| } \]

i.e., move in the direction of the gradient. ✅

📊 Visual Intuition (2D Case)

Think of the log-likelihood surface like a hill in 3D.

The gradient vector at your current point points uphill in the steepest direction.
If you move orthogonally to the gradient, you stay at the same level (iso-likelihood curve).
Moving against the gradient leads you downhill — which you want to avoid during maximization.

      ^
     /|\\         ← Gradient points this way
    / |
   /  |             Log-Likelihood surface
  *---→
Current θ_r

🔁 Comparison with Newton–Raphson

Steepest ascent: always follows the direction of greatest immediate gain.
Newton–Raphson: modifies the direction based on the curvature, sometimes avoiding overshooting or zigzagging.

So, in practice:

Gradient ascent is more robust, especially when the curvature is poorly behaved or Hessians are hard to compute.
Newton–Raphson is faster when curvature information is reliable and the function is well-behaved (e.g., Gaussian likelihood).

✅ Summary

We prefer the direction of steepest ascent because it guarantees the fastest local increase in the log-likelihood.
The gradient gives this direction.
The step size $ \eta_r $ controls how aggressively we move — too small = slow; too large = unstable.
This approach is the foundation for many modern learning algorithms, including stochastic gradient ascent in MLE for big data or machine learning.

🔷 Connecting Fisher Information with the Hessian Matrix

Now connecting Fisher’s Information Matrix with the Hessian matrix of the log-likelihood function.

Let’s unpack this carefully and precisely.

🔷 1. Fisher Information Matrix (Multivariate Version)

Suppose $ \theta \in \mathbb{R}^k $ is a vector parameter and $ f(x \mid \theta) $ is a probability density function satisfying standard regularity conditions.

Then, the Fisher Information Matrix is defined as:

\[ \mathcal{I}(\theta) = \mathbb{E}_\theta \left[ \left( \frac{\partial}{\partial \theta} \log f(x \mid \theta) \right) \left( \frac{\partial}{\partial \theta} \log f(x \mid \theta) \right)^\top \right] \]

This is a $ k \times k $ positive semi-definite matrix.
It measures how much information the data carries about each component of $ \theta $, and how the components interact.

🔶 2. Alternative Definition Using the Hessian (Second Derivatives)

Under regularity conditions, the Fisher Information Matrix can also be written as the negative expected Hessian of the log-likelihood function:

\[ \mathcal{I}(\theta) = - \mathbb{E}_\theta \left[ \frac{\partial^2}{\partial \theta \partial \theta^\top} \log f(x \mid \theta) \right] \]

✅ This is what your note says:

$ \mathbb{E}\left( -\frac{ \partial^2 }{ \partial \theta \partial \theta^\top } \log f(x \mid \theta) \right) = \text{Fisher Information Matrix} $

📌 3. What Is the Hessian Matrix?

The Hessian of a scalar-valued function $ L(\theta) $, where $ \theta \in \mathbb{R}^k $, is the matrix of second partial derivatives:

\[ H(\theta) = \left[ \frac{ \partial^2 L(\theta) }{ \partial \theta_i \partial \theta_j } \right]_{i,j=1}^k \]

So for the log-likelihood function $ \log f(x \mid \theta) $, the Hessian is:

\[ \frac{ \partial^2 }{ \partial \theta \partial \theta^\top } \log f(x \mid \theta) \]

It’s a $ k \times k $ symmetric matrix, assuming smoothness.

🔷 4. The (i,j)-th Element of Fisher Info = Expected Negative Hessian Entry

You correctly stated:

(i, j)-th element of the Fisher Information matrix is the (i, j)-th element of the expected negative Hessian of $ \log f(x \mid \theta) $

In symbols:

\[ [\mathcal{I}(\theta)]_{i,j} = - \mathbb{E}_\theta \left[ \frac{\partial^2}{\partial \theta_i \partial \theta_j} \log f(x \mid \theta) \right] \]

This captures the curvature of the likelihood in the $ (\theta_i, \theta_j) $ direction.

🧠 Why Two Equivalent Definitions?

Because of the Fisher identity:

\[ \mathcal{I}(\theta) = \mathbb{E}_\theta \left[ \nabla_\theta \log f(x \mid \theta) \cdot \nabla_\theta \log f(x \mid \theta)^\top \right] = - \mathbb{E}_\theta \left[ \nabla^2_\theta \log f(x \mid \theta) \right] \]

Both are equal only under regularity conditions, where we’re allowed to exchange integration and differentiation.

✅ Summary

Quantity	Meaning
$ \nabla_\theta \log f(x \mid \theta) $	Score function (gradient of log-likelihood)
$ \nabla^2_\theta \log f(x \mid \theta) $	Hessian matrix (second derivatives)
$ \mathcal{I}(\theta) $	Fisher Information Matrix
$ [\mathcal{I}(\theta)]_{i,j} $	Expected negative second derivative of log-likelihood w.r.t $ \theta_i $ and $ \theta_j $

🧭 Overview

You are essentially learning the geometry and calculus behind MLE:

When do we stop iterating for an estimate?
How do we measure “information” in the data?
What are the limits of accuracy for unbiased estimators?

1. ✅ Stopping Criteria in Iterative Estimation

🔁 In iterative methods (like Newton–Raphson, gradient ascent), we update:

θ̂_{r+1} = θ̂_r + η_r ⋅ ∇_θ L(θ̂_r)

We stop when the estimate becomes stable:

‖θ̂_r+1 − θ̂_r‖₂ < ε

✅ Example: If θ = (μ, σ)^T, then:

‖θ̂_r+1 − θ̂_r‖₂² = (μ̂_r+1 − μ̂_r)² + (σ̂_r+1 − σ̂_r)²

2. 📘 Fisher Information Matrix ℐ(θ)

ℐ(θ) = E[ −∂²/∂θ ∂θᵗ log f(x | θ) ]

🔎 Intuition:

Very curved log-likelihood ⇒ precise estimate (low variance)
Flat log-likelihood ⇒ uncertain estimate (high variance)


       log L(θ)
           ^
           |
   High    |     ____
   info    |   /    \
           |  /      \
           | /        \
   Low     |/          \___
           +-------------------> θ

3. 🧮 Hessian Matrix H

H(θ) = [ ∂²L / ∂θᵢ∂θⱼ ]

Measures curvature of L(θ)
Scalar case: just d²L/dθ²
Vector θ ⇒ k×k symmetric matrix

🔁 In Newton–Raphson:

θ̂_r+1 = θ̂_r − H⁻¹ ∇_θ L

4. 🔗 Link Between Hessian and Fisher Information

ℐ(θ) = E_θ[−H(θ)] = E_θ[−∂²/∂θ∂θᵗ log f(x | θ)]

✅ Requires regularity conditions

For i.i.d. Data:

X₁,…,Xₙ ~ i.i.d. f(x | θ)


L(θ) = ∑ log f(xᵢ | θ) ⇒
ℐ(θ) = n ⋅ E[−∂²/∂θ∂θᵗ log f(x₁ | θ)]

So more data ⇒ higher curvature ⇒ better certainty


E[−∂²/∂θ∂θᵗ log f(x₁ | θ)] = Pre-sample Fisher Info

5. 🎯 Cramér–Rao Lower Bound (CRLB)

If θ̂ is unbiased:

Var(θ̂) ≥ ℐ⁻¹(θ)

In scalar case:

Var(θ̂) ≥ 1 / ℐ(θ)

CRLB is the theoretical lower bound on variance of unbiased estimators
Attained when estimator is "efficient"

6. 🧾 Observed Information Matrix

𝒥(θ) = −H = −∑ ∂²/∂θ∂θᵗ log f(xᵢ | θ)

Used in:

Newton-Raphson updates (actual curvature)
MLE variance estimation:

Var̂(θ̂) ≈ [𝒥(θ̂)]⁻¹

✅ Summary Table

Concept	Formula	Interpretation
Gradient (Score)	∇_θ L(θ)	Direction of steepest ascent
Hessian	H(θ) = ∂²L / ∂θ ∂θᵗ	Curvature of log-likelihood
Fisher Info	ℐ(θ) = E[−H]	Expected curvature
Observed Info	𝒥(θ) = −H	Actual curvature from data
CRLB	Var(θ̂) ≥ ℐ⁻¹(θ)	Lower bound on estimator variance

🔷 1. Background: Newton–Raphson in MLE

Recall that for Maximum Likelihood Estimation (MLE), we want to maximize the log-likelihood function:

$$L(\theta) = \sum_{i=1}^n \log f(x_i \mid \theta)$$

The Newton–Raphson update rule for finding the root of $\nabla_\theta L(\theta) = 0$ is:

$$\theta^{(r+1)} = \theta^{(r)} - \left[ H(\theta^{(r)}) \right]^{-1} \nabla_\theta L(\theta^{(r)})$$

$\nabla_\theta L(\theta)$ is the score function
$H(\theta) = \nabla_\theta^2 L(\theta)$ is the Hessian matrix

🔷 2. Problem with Newton–Raphson

In practice, $H(\theta)$ may:

Be difficult to compute or invert
Be non-invertible at some points
Lead to unstable convergence

🔷 3. Fisher’s Scoring: Motivation

To solve these issues, we replace the observed Hessian $ H(\theta) $ with its expectation:

$$\mathcal{I}(\theta) = \mathbb{E}_\theta\left[-\nabla_\theta^2 L(\theta)\right]$$

This is the Fisher Information Matrix. It's often easier to compute and is positive semi-definite.

🔷 4. Fisher’s Scoring Update Rule

The update rule becomes:

$$\theta^{(r+1)} = \theta^{(r)} + \left[\mathcal{I}(\theta^{(r)})\right]^{-1} \nabla_\theta L(\theta^{(r)})$$

🔁 Notice:

Update direction uses the gradient (score)
Scaled by inverse Fisher Information, not Hessian

🔷 5. Visualization: Geometric Insight

Imagine you’re on a hill (log-likelihood surface):

Gradient Ascent: step in steepest direction
Newton–Raphson: adjusts for curvature (Hessian)
Fisher Scoring: uses average curvature — smoother steps

🔷 6. Example: Bernoulli Distribution

Let $ X_1, \dots, X_n \sim \text{Bern}(p)$ where $ 0 < p < 1 $

Log-likelihood:

$$L(p) = \sum_{i=1}^n \left[ x_i \log p + (1 - x_i)\log(1 - p) \right]$$

Score function:

$$\frac{dL}{dp} = \frac{n\bar{X}}{p} - \frac{n(1 - \bar{X})}{1 - p}$$

Observed Information:

$$H(p) = -\frac{n\bar{X}}{p^2} - \frac{n(1 - \bar{X})}{(1 - p)^2}$$

Fisher Information:

$$\mathcal{I}(p) = \frac{n}{p(1 - p)}$$

Fisher Scoring update:

$$p^{(r+1)} = p^{(r)} + \left[ \frac{n}{p^{(r)}(1 - p^{(r)})} \right]^{-1} \left( \frac{n\bar{X}}{p^{(r)}} - \frac{n(1 - \bar{X})}{1 - p^{(r)}} \right)$$

Which simplifies to:

$$p^{(r+1)} = \bar{X}$$

🎯 That’s the MLE! Fisher scoring converges to it in one step for Bernoulli.

🔷 7. Comparison Table

Method	Uses	Stability	Convergence Speed
Gradient Ascent	Gradient only	May overshoot	Slower
Newton–Raphson	Gradient + Hessian	Can be unstable	Fast if Hessian good
Fisher Scoring	Gradient + Expected Hessian	More stable	Fast, smoother

🔷 8. When to Use Fisher Scoring?

✅ Use it when:

Likelihood function is well-behaved but observed Hessian is messy
Working with GLMs — Fisher scoring is standard for IRLS
You need stable convergence with positive curvature

Quantity	Meaning
\( \nabla_\theta \log f(x \mid \theta) \)	Score function (gradient of log-likelihood)
\( \nabla^2_\theta \log f(x \mid \theta) \)	Hessian matrix (second derivatives)
\( \mathcal{I}(\theta) \)	Fisher Information Matrix
\( [\mathcal{I}(\theta)]_{i,j} \)	Expected negative second derivative of log-likelihood w.r.t \( \theta_i \) and \( \theta_j \)

Contents

🧠 Variance Inequality: Toward a General CRLB

📦 1. Setup:

🔎 2. The Inequality Itself:

ℹ️ Fisher Information

✂️ 3. Simplification When Unbiased

🔢 4. Concrete Example

📌 Summary So Far:

🔍 Simplification (2): Cramér–Rao Bound for IID Case

🧩 1. Factorization of Likelihood for IID Samples

📘 2. Example: \( X_i \sim \mathcal{N}(\mu, 1) \)

📐 3. Fisher Information

✅ Conclusion: CRLB Achieved by Sample Mean

✅ Putting It All Together

Simplification (1)

Simplification (2)

❓ Immediate Question: Does the sample mean attain this lower bound?

✅ Final Answer

🔁 Cramér–Rao Inequality in the Discrete Case

📦 Now in the discrete case:

🧠 Key Result:

✅ Example: Bernoulli Case

📌 Summary

📘 Fisher Information Identity

📘 Statement of the Result:

📌 Why This Is True: Key Insight

🔁 Interpretation

✅ Summary

🧠 Cramér–Rao Bound for Vector Parameters

📘 Multivariate CRLB (Matrix Form):

📌 Notation Breakdown:

⚙️ Meaning:

✅ Special Case: Unbiased Estimator

📘 Fisher Information Matrix \(\mathcal{I}(\boldsymbol{\theta})\)

✅ The Fisher Information Matrix is defined as:

🔍 Where:

🧠 Alternate Form (if second derivatives are easier to compute):

🧠 Cramér–Rao Lower Bound: Vector Parameter Case

📘 General Matrix Form of the CRLB:

🧩 Term Definitions:

🎯 Fisher Information Matrix Definition:

✅ Special Case: Unbiased Estimator

🧠 Setup: Vector CRLB with Fisher Info Substituted

✅ CRLB (Multivariate Version):

🔍 Explanation of Components:

📌 Summary:

✨ Information Equality and Fisher Information Identity

🔶 The Setup

🧰 Step 1: Expectation of the Score Function is Zero

🔹 Step 2: Variance of Score Function = Negative Expectation of Second Derivative

📌 LHS: Variance of the Score Function

📌 RHS: Negative Expectation of Second Derivative

📊 Visual Interpretation

ℹ Fisher Information

📆 Summary

🔷 Step-by-Step Breakdown: The Score Function

📘 Log-Likelihood Expression

🧠 Why Sum of Logs?

🔧 Score Function (Gradient of Log-Likelihood)

✅ Summary

📊 Small Example: Normal Model with Known Variance

🔷 Newton–Raphson for MLE: Likelihood Maximization

🔹 1. Understanding the Graph of \( L(\theta) \) vs \( \theta \)

🔧 2. Gradient Ascent (Steepest Ascent)

🧠 3. Newton–Raphson Update

📌 What's the Intuition?

💡 Newton vs Gradient Ascent

🔢 Example: Normal Model with Known Variance

✅ Summary

On the Direction of Steepest Ascent:

🔷 1. Why the Direction of Steepest Ascent?

🔶 2. What Makes Gradient the Steepest Direction?

📊 Visual Intuition (2D Case)

🔁 Comparison with Newton–Raphson

✅ Summary

🔷 Connecting Fisher Information with the Hessian Matrix

🔷 1. Fisher Information Matrix (Multivariate Version)

🔶 2. Alternative Definition Using the Hessian (Second Derivatives)

📌 3. What Is the Hessian Matrix?

🔷 4. The (i,j)-th Element of Fisher Info = Expected Negative Hessian Entry