Week-3 | Shaan

Question

Likelihood-based estimation for Infant Mortality Rate (IMR) modeled with Poisson distributions across 3 decades. Work line by line with intuition and visual thinking.

Assume \(Y_i \sim \text{Poisson}(\mu_i)\) with \(\mu_i=\alpha\,\gamma^{\,i-1}\) for \(i=1,2,3\).
Form the joint likelihood and log-likelihood.
Derive the score equations w.r.t. \(\alpha\) and \(\gamma\).
Compute the Hessian and discuss when mixed partials are equal.
State the gradient vector and provide interpretation and a visualization idea.

Solution

Step 1: Model Setup

\(Y_i \sim \text{Poisson}(\mu_i)\), \(i=1,2,3\).
\(\mu_i=\alpha\,\gamma^{\,i-1}\).
Interpretations:
- \(\alpha\): baseline mortality rate (first decade).
- \(\gamma\): multiplicative growth/decay factor across decades.
  - \(\gamma>1\): rate increases each decade.
  - \(\gamma<1\): rate decreases (improves) each decade.
  - \(\gamma=1\): constant rate across decades.
- Mortality rate follows a multiplicative trend across decades.

Intuition: think of \(\alpha\) as the starting height and \(\gamma\) as the common ratio of a geometric progression driving the decade-to-decade change.

Step 2: Likelihood Function

Because of independence, \[ f(\vec y\mid \alpha,\gamma)=\prod_{i=1}^{3}\frac{e^{-\alpha \gamma^{\,i-1}}\big(\alpha \gamma^{\,i-1}\big)^{y_i}}{y_i!}. \]
Log-likelihood: \[ L(\alpha,\gamma\mid \vec y)= -\sum_{i=1}^{3}\alpha \gamma^{\,i-1} +\sum_{i=1}^{3} y_i \log\!\big(\alpha \gamma^{\,i-1}\big) -\sum_{i=1}^{3}\log(y_i!). \]
Use \(\log(\alpha \gamma^{\,i-1})=\log\alpha+(i-1)\log\gamma\) to simplify derivatives.
The term \(-\sum \log(y_i!)\) is constant in \((\alpha,\gamma)\) for optimization.

Step 3: Score Equations (First Derivatives)

With respect to \(\alpha\): \[ \frac{\partial L}{\partial \alpha} = -\sum_{i=1}^{3}\gamma^{\,i-1} + \frac{1}{\alpha}\sum_{i=1}^{3} y_i. \]
- First term: expected “load” of \(\alpha\) across decades.
- Second term: contribution from observed counts, scaled by \(\alpha\).
With respect to \(\gamma\): \[ \frac{\partial L}{\partial \gamma} = -\sum_{i=1}^{3} (i-1)\alpha \gamma^{\,i-2} + \frac{1}{\gamma}\sum_{i=1}^{3}(i-1) y_i. \]
- First term: effect of \(\gamma\) on expected mortality (weighted by \(\alpha\)).
- Second term: \(\gamma\)’s effect on the observed data via the log link.
Set both equal to zero to obtain the MLEs of \((\alpha,\gamma)\).

Step 4: Hessian (Second Derivatives)

\[ \frac{\partial^{2} L}{\partial \alpha^{2}} = -\frac{1}{\alpha^{2}}\sum_{i=1}^{3} y_i \quad \text{(always negative ⇒ concave in \(\alpha\)).} \]
\[ \frac{\partial^{2} L}{\partial \alpha\,\partial \gamma} = -\sum_{i=1}^{3}(i-1)\gamma^{\,i-2}. \]
\[ \frac{\partial^{2} L}{\partial \gamma^{2}} = -\sum_{i=1}^{3}(i-1)(i-2)\alpha \gamma^{\,i-3} - \frac{1}{\gamma^{2}}\sum_{i=1}^{3}(i-1)y_i. \]
Hessian matrix: \[ H= \begin{pmatrix} \dfrac{\partial^{2} L}{\partial \alpha^{2}} & \dfrac{\partial^{2} L}{\partial \alpha\,\partial \gamma}\\[6pt] \dfrac{\partial^{2} L}{\partial \gamma\,\partial \alpha} & \dfrac{\partial^{2} L}{\partial \gamma^{2}} \end{pmatrix}. \]

Step 5: Symmetry of Mixed Partials

By Clairaut’s theorem (twice continuously differentiable \(L\)), mixed partials are equal: \[ \frac{\partial^{2} L}{\partial \alpha\,\partial \gamma} = \frac{\partial^{2} L}{\partial \gamma\,\partial \alpha}. \]
If differing expressions appear, they arise from algebraic slips, not a failure of symmetry here.

Step 6: Gradient Vector (Summary)

\[ \nabla L(\alpha,\gamma)= \begin{pmatrix} -\sum_{i=1}^{3}\gamma^{\,i-1} + \dfrac{1}{\alpha}\sum_{i=1}^{3} y_i\\[6pt] -\sum_{i=1}^{3}(i-1)\alpha \gamma^{\,i-2} + \dfrac{1}{\gamma}\sum_{i=1}^{3}(i-1) y_i \end{pmatrix}. \]
Solve \(\nabla L=0\) for the MLEs of \((\alpha,\gamma)\).

Viz

Bar chart of IMR across decades. Example with \(\alpha=50\):
- \(\gamma=0.8 \Rightarrow [50,\,40,\,32]\) (declining trend).
- \(\gamma=1.2 \Rightarrow [50,\,60,\,72]\) (growing trend).
Interpretation: the slope of bars tracks \(\gamma\); the first bar anchors at \(\alpha\).

Summary / Insight

This is a two-parameter Poisson trend model where \(\alpha\) sets the initial level and \(\gamma\) controls geometric change across decades. The log-likelihood is concave in \(\alpha\) and has well-behaved mixed partials; the score equations cleanly separate expected vs. observed contributions, making interpretation straightforward.

Numerical Optimization Algorithms

Newton–Raphson for MLE

Purpose: Find parameter values where the score function (gradient of log-likelihood) is zero.
Update rule: \[ \theta^{(n+1)} = \theta^{(n)} - H^{-1} \nabla L \] where:
- \(\theta = (\alpha, \gamma)\) in the IMR case.
- \(\nabla L\) = score vector.
- \(H\) = Hessian (matrix of 2nd derivatives).
Geometric picture: Standing on a mountain surface (log-likelihood). Newton–Raphson uses local curvature to jump toward the peak.
Problem: If the likelihood surface is bumpy or flat, the Hessian may be unstable → divergence can occur.

Fisher’s Scoring

Modification: Replace observed Hessian (\(-H\)) with expected Hessian (Fisher Information \(I\)).
Update rule: \[ \theta^{(n+1)} = \theta^{(n)} + I^{-1} \nabla L \]
Why?
- Expected curvature often simpler to compute.
- Provides more stable convergence.
- Fisher Information is positive definite (under regularity) → avoids negative directions.
Visualization: Newton uses actual local curvature at your point (can be lumpy). Fisher’s Scoring uses the average curvature of the whole mountain (smoother) → better convergence.

Worked Poisson Example: Accidents in Two Cities

Setup

City A counts: \(X_i \sim \text{Pois}(T_i)\).
City B counts: \(Y_i \sim \text{Pois}(\beta T_i)\).
Parameters:
- \(T_i\): baseline expected counts (differs by year).
- \(\beta\): constant relative risk multiplier (same across years).
Interpretation: City B is always \(\beta\) times as risky as City A → multiplicative risk model.

Step 1: Log-Likelihood

Ignoring constants: \[ L = \sum_{i=1}^n \Big[-(\beta T_i) + y_i \ln(\beta T_i) - T_i + x_i \ln(T_i)\Big] \]

Step 2: First-Order Conditions

Derivative wrt \(\beta\): \[ \frac{\partial L}{\partial \beta} = -\sum T_i + \frac{1}{\beta}\sum y_i = 0 \]
Derivative wrt \(T_i\): \[ \frac{\partial L}{\partial T_i} = -\beta + \frac{y_i}{T_i} - 1 + \frac{x_i}{T_i} = 0 \]

Step 3: Solve Equations

From derivative wrt \(T_i\): \[ T_i = \frac{x_i+y_i}{\beta+1} \]
Plug into derivative wrt \(\beta\): \[ \frac{1}{\beta}\sum y_i = \sum T_i = \sum \frac{x_i+y_i}{\beta+1} \]
Simplify: \[ \hat{\beta} = \frac{\sum y_i}{\sum x_i} \]

Step 4: Final MLEs

Risk ratio: \[ \hat{\beta} = \frac{\bar{Y}}{\bar{X}} \quad \text{(ratio of averages or totals)} \]
Baseline yearly rates: \[ \hat{T}_i = \frac{x_i+y_i}{\hat{\beta}+1} \]
Interpretation:
- \(\hat{\beta}\) = relative risk of City B vs City A.
- \(\hat{T}_i\) = baseline expected accidents for year \(i\), adjusted so A and B scale correctly.

Connecting Back to Newton–Raphson / Fisher Scoring

In the IMR example (\(\alpha, \gamma\)): analytic MLEs are messy → iterative methods (Newton or Fisher) are used.
In the Poisson accidents example: algebra is nice → closed-form MLEs possible.
Takeaway:
- Sometimes MLEs = neat closed forms (like \(\hat{\beta}=\sum Y/\sum X\)).
- Often MLEs = require iterative numerical optimization (Newton/Fisher).

1) Model Recap (Complete-Data)

For each year \(i=1,\dots,n\):
- \(X_i \sim \text{Pois}(T_i)\) (City A)
- \(Y_i \sim \text{Pois}(\beta T_i)\) (City B)
Parameters: \(\beta>0\) and \(T_1,\dots,T_n>0\).

Ignoring constants, the complete-data log-likelihood is:

\[ L(\beta,\mathbf T;\mathbf x,\mathbf y) =\sum_{i=1}^n \Big[-(\beta+1)T_i + y_i(\log\beta+\log T_i) + x_i\log T_i\Big]. \]

The complete-data sufficient statistics are:

\[ S_X=\sum_i x_i,\quad S_Y=\sum_i y_i,\quad \sum_i\log T_i \;\text{(enters via }x_i+y_i\text{)}. \]

When nothing is missing, the MLEs satisfy:

\[ \hat{\beta}=\frac{\sum_i y_i}{\sum_i x_i},\qquad \hat T_i=\frac{x_i+y_i}{\hat\beta+1}. \]

2) Missing Data: One \(x_s\) is Missing

E-step (Expectation)

Construct:

\[ Q(\beta,\mathbf T\mid \beta^{(n)},\mathbf T^{(n)}) = \mathbb E\!\left[L(\beta,\mathbf T;\mathbf X,\mathbf Y)\,\middle|\,\text{observed},\,\beta^{(n)},\mathbf T^{(n)}\right]. \]

Only terms involving the missing \(X_s\) need an expectation. Since:

\[ X_s \mid \beta^{(n)},\mathbf T^{(n)} \sim \text{Pois}\big(T_s^{(n)}\big),\quad \mathbb E[X_s\mid\cdot]=T_s^{(n)}, \]

and \(\log T_s\) is not random, we have:

\[ \mathbb E[X_s \log T_s \mid \cdot]=T_s^{(n)}\log T_s. \]

So, in \(Q\) replace missing \(x_s\) by \(T_s^{(n)}\). Concretely, use:

\[ \tilde S_X^{(n)}=\sum_{i\neq s} x_i + T_s^{(n)}. \]

M-step (Maximization)

Maximize \(Q\) as if \(x_s\) were replaced. Updates:

Update for \(\beta\):
\[ \beta^{(n+1)}=\frac{\sum_i y_i}{\tilde S_X^{(n)}}=\frac{\sum_i y_i}{\sum_{i\neq s} x_i + T_s^{(n)}}. \]
Updates for \(T_i\):
\[ T_i^{(n+1)}=\frac{x_i+y_i}{\beta^{(n+1)}+1}\quad(i\neq s),\qquad T_s^{(n+1)}=\frac{T_s^{(n)}+y_s}{\beta^{(n+1)}+1}. \]

Repeat E→M until convergence.

3) Why This Works

EM maximizes observed-data likelihood via expected complete-data log-likelihood.
In exponential families, EM works with expected sufficient statistics.
Here, only \(S_X\) is affected; replace missing \(x_s\) by \(T_s^{(n)}\).
Each EM step increases or preserves the observed log-likelihood → convergence to stationary point (MLE under mild conditions).

4) Pseudocode (Notes-Ready)

Input: observed {x_i (i≠s), y_i (all i)}, choose initial β(0)>0 and T_i(0)>0
repeat
  # E-step
  Sx_tilde ← (∑_{i≠s} x_i) + T_s^(n)

# M-step

β^(n+1) ← (∑\_i y\_i) / Sx\_tilde
for i≠s:   T\_i^(n+1) ← (x\_i + y\_i) / (β^(n+1) + 1)
i = s:     T\_s^(n+1) ← (T\_s^(n) + y\_s) / (β^(n+1) + 1)

until convergence (e.g., relative change in β and all T\_i below tolerance)
Output: β̂, T̂\_1,…,T̂\_n

Stopping rule:

\[ \max\left\{\frac{|\beta^{(n+1)}-\beta^{(n)}|}{\beta^{(n)}},\;\max_i \frac{|T_i^{(n+1)}-T_i^{(n)}|}{T_i^{(n)}}\right\}<10^{-6}\;(\text{tight}) \;\text{or } 10^{-4}\;(\text{looser}). \]

5) Tiny Numeric Walk-Through

Suppose \(n=3\). Observed:

\[ (x_1, x_2, x_3)=(12,\,?,\,9),\qquad (y_1,y_2,y_3)=(8,\,10,\,7). \]

Init: \(\beta^{(0)}=1\). Guess missing \(x_2\) with \(y_2\). Then:

\[ T_1^{(0)}=10,\;\;T_2^{(0)}=10,\;\;T_3^{(0)}=8. \]

E: \(\tilde S_X^{(0)}=31.\)

M: \(\beta^{(1)}=\tfrac{25}{31}\approx0.8065.\)

Then:

\[ T_1^{(1)}\approx11.06,\quad T_2^{(1)}\approx11.08,\quad T_3^{(1)}\approx8.86. \]

Next iteration uses \(T_2^{(1)}\), and so on, until convergence.

6) Visual Intuition

Plate diagram: \(T_i \rightarrow X_i\); \(T_i,\beta \rightarrow Y_i\). Dashed circle around missing \(X_s\).
EM loop schematic: E-step (impute) → M-step (closed form update) → repeat.
Convergence plot: \(\beta^{(n)}\) vs iterations, flattening to stable value.

7) Extensions & Cautions

Multiple missing \(x_i\): replace each by \(T_i^{(n)}\).
Missing \(y_i\): replace by \(\beta^{(n)}T_i^{(n)}\); re-derive updates.
Ignorable missingness assumption: MCAR/MAR; otherwise need missingness model.
Initialization: good starts speed convergence.
- \(\beta^{(0)}=\tfrac{\sum y_i}{\sum_{i:\,x_i\text{ obs}} x_i + \sum_{i:\,x_i\text{ miss}} y_i}\)
- \(T_i^{(0)}=\tfrac{x_i+y_i}{\beta^{(0)}+1}\), using \(x_i:=y_i\) where missing.
Monotonicity: EM always increases observed log-likelihood (Newton/Fisher may not).

Statistical Methods III: Week-3

Contents