Statistical Methods III: Week-3

AY 2025–26

Instructor: Debasis Sengupta

Office / Department: ASU

Email: sdebasis@isical.ac.in

Marking Scheme:
Assignments: 20% | Midterm Test: 30% | End Semester: 50%

Contents

Question

Likelihood-based estimation for Infant Mortality Rate (IMR) modeled with Poisson distributions across 3 decades. Work line by line with intuition and visual thinking.

  • Assume \(Y_i \sim \text{Poisson}(\mu_i)\) with \(\mu_i=\alpha\,\gamma^{\,i-1}\) for \(i=1,2,3\).
  • Form the joint likelihood and log-likelihood.
  • Derive the score equations w.r.t. \(\alpha\) and \(\gamma\).
  • Compute the Hessian and discuss when mixed partials are equal.
  • State the gradient vector and provide interpretation and a visualization idea.

Solution

Step 1: Model Setup

  • \(Y_i \sim \text{Poisson}(\mu_i)\), \(i=1,2,3\).
  • \(\mu_i=\alpha\,\gamma^{\,i-1}\).
  • Interpretations:
    • \(\alpha\): baseline mortality rate (first decade).
    • \(\gamma\): multiplicative growth/decay factor across decades.
      • \(\gamma>1\): rate increases each decade.
      • \(\gamma<1\): rate decreases (improves) each decade.
      • \(\gamma=1\): constant rate across decades.
    • Mortality rate follows a multiplicative trend across decades.

Intuition: think of \(\alpha\) as the starting height and \(\gamma\) as the common ratio of a geometric progression driving the decade-to-decade change.

Step 2: Likelihood Function

  • Because of independence, \[ f(\vec y\mid \alpha,\gamma)=\prod_{i=1}^{3}\frac{e^{-\alpha \gamma^{\,i-1}}\big(\alpha \gamma^{\,i-1}\big)^{y_i}}{y_i!}. \]
  • Log-likelihood: \[ L(\alpha,\gamma\mid \vec y)= -\sum_{i=1}^{3}\alpha \gamma^{\,i-1} +\sum_{i=1}^{3} y_i \log\!\big(\alpha \gamma^{\,i-1}\big) -\sum_{i=1}^{3}\log(y_i!). \]
  • Use \(\log(\alpha \gamma^{\,i-1})=\log\alpha+(i-1)\log\gamma\) to simplify derivatives.
  • The term \(-\sum \log(y_i!)\) is constant in \((\alpha,\gamma)\) for optimization.

Step 3: Score Equations (First Derivatives)

  • With respect to \(\alpha\): \[ \frac{\partial L}{\partial \alpha} = -\sum_{i=1}^{3}\gamma^{\,i-1} + \frac{1}{\alpha}\sum_{i=1}^{3} y_i. \]
    • First term: expected “load” of \(\alpha\) across decades.
    • Second term: contribution from observed counts, scaled by \(\alpha\).
  • With respect to \(\gamma\): \[ \frac{\partial L}{\partial \gamma} = -\sum_{i=1}^{3} (i-1)\alpha \gamma^{\,i-2} + \frac{1}{\gamma}\sum_{i=1}^{3}(i-1) y_i. \]
    • First term: effect of \(\gamma\) on expected mortality (weighted by \(\alpha\)).
    • Second term: \(\gamma\)’s effect on the observed data via the log link.
  • Set both equal to zero to obtain the MLEs of \((\alpha,\gamma)\).

Step 4: Hessian (Second Derivatives)

  • \[ \frac{\partial^{2} L}{\partial \alpha^{2}} = -\frac{1}{\alpha^{2}}\sum_{i=1}^{3} y_i \quad \text{(always negative ⇒ concave in \(\alpha\)).} \]
  • \[ \frac{\partial^{2} L}{\partial \alpha\,\partial \gamma} = -\sum_{i=1}^{3}(i-1)\gamma^{\,i-2}. \]
  • \[ \frac{\partial^{2} L}{\partial \gamma^{2}} = -\sum_{i=1}^{3}(i-1)(i-2)\alpha \gamma^{\,i-3} - \frac{1}{\gamma^{2}}\sum_{i=1}^{3}(i-1)y_i. \]
  • Hessian matrix: \[ H= \begin{pmatrix} \dfrac{\partial^{2} L}{\partial \alpha^{2}} & \dfrac{\partial^{2} L}{\partial \alpha\,\partial \gamma}\\[6pt] \dfrac{\partial^{2} L}{\partial \gamma\,\partial \alpha} & \dfrac{\partial^{2} L}{\partial \gamma^{2}} \end{pmatrix}. \]

Step 5: Symmetry of Mixed Partials

  • By Clairaut’s theorem (twice continuously differentiable \(L\)), mixed partials are equal: \[ \frac{\partial^{2} L}{\partial \alpha\,\partial \gamma} = \frac{\partial^{2} L}{\partial \gamma\,\partial \alpha}. \]
  • If differing expressions appear, they arise from algebraic slips, not a failure of symmetry here.

Step 6: Gradient Vector (Summary)

  • \[ \nabla L(\alpha,\gamma)= \begin{pmatrix} -\sum_{i=1}^{3}\gamma^{\,i-1} + \dfrac{1}{\alpha}\sum_{i=1}^{3} y_i\\[6pt] -\sum_{i=1}^{3}(i-1)\alpha \gamma^{\,i-2} + \dfrac{1}{\gamma}\sum_{i=1}^{3}(i-1) y_i \end{pmatrix}. \]
  • Solve \(\nabla L=0\) for the MLEs of \((\alpha,\gamma)\).

Viz

  • Bar chart of IMR across decades. Example with \(\alpha=50\):
    • \(\gamma=0.8 \Rightarrow [50,\,40,\,32]\) (declining trend).
    • \(\gamma=1.2 \Rightarrow [50,\,60,\,72]\) (growing trend).
  • Interpretation: the slope of bars tracks \(\gamma\); the first bar anchors at \(\alpha\).

Summary / Insight

This is a two-parameter Poisson trend model where \(\alpha\) sets the initial level and \(\gamma\) controls geometric change across decades. The log-likelihood is concave in \(\alpha\) and has well-behaved mixed partials; the score equations cleanly separate expected vs. observed contributions, making interpretation straightforward.

Numerical Optimization Algorithms

Newton–Raphson for MLE

  • Purpose: Find parameter values where the score function (gradient of log-likelihood) is zero.
  • Update rule: \[ \theta^{(n+1)} = \theta^{(n)} - H^{-1} \nabla L \] where:
    • \(\theta = (\alpha, \gamma)\) in the IMR case.
    • \(\nabla L\) = score vector.
    • \(H\) = Hessian (matrix of 2nd derivatives).
  • Geometric picture: Standing on a mountain surface (log-likelihood). Newton–Raphson uses local curvature to jump toward the peak.
  • Problem: If the likelihood surface is bumpy or flat, the Hessian may be unstable → divergence can occur.

Fisher’s Scoring

  • Modification: Replace observed Hessian (\(-H\)) with expected Hessian (Fisher Information \(I\)).
  • Update rule: \[ \theta^{(n+1)} = \theta^{(n)} + I^{-1} \nabla L \]
  • Why?
    • Expected curvature often simpler to compute.
    • Provides more stable convergence.
    • Fisher Information is positive definite (under regularity) → avoids negative directions.
  • Visualization: Newton uses actual local curvature at your point (can be lumpy). Fisher’s Scoring uses the average curvature of the whole mountain (smoother) → better convergence.

Worked Poisson Example: Accidents in Two Cities

Setup

  • City A counts: \(X_i \sim \text{Pois}(T_i)\).
  • City B counts: \(Y_i \sim \text{Pois}(\beta T_i)\).
  • Parameters:
    • \(T_i\): baseline expected counts (differs by year).
    • \(\beta\): constant relative risk multiplier (same across years).
  • Interpretation: City B is always \(\beta\) times as risky as City A → multiplicative risk model.

Step 1: Log-Likelihood

  • Ignoring constants: \[ L = \sum_{i=1}^n \Big[-(\beta T_i) + y_i \ln(\beta T_i) - T_i + x_i \ln(T_i)\Big] \]

Step 2: First-Order Conditions

  • Derivative wrt \(\beta\): \[ \frac{\partial L}{\partial \beta} = -\sum T_i + \frac{1}{\beta}\sum y_i = 0 \]
  • Derivative wrt \(T_i\): \[ \frac{\partial L}{\partial T_i} = -\beta + \frac{y_i}{T_i} - 1 + \frac{x_i}{T_i} = 0 \]

Step 3: Solve Equations

  • From derivative wrt \(T_i\): \[ T_i = \frac{x_i+y_i}{\beta+1} \]
  • Plug into derivative wrt \(\beta\): \[ \frac{1}{\beta}\sum y_i = \sum T_i = \sum \frac{x_i+y_i}{\beta+1} \]
  • Simplify: \[ \hat{\beta} = \frac{\sum y_i}{\sum x_i} \]

Step 4: Final MLEs

  • Risk ratio: \[ \hat{\beta} = \frac{\bar{Y}}{\bar{X}} \quad \text{(ratio of averages or totals)} \]
  • Baseline yearly rates: \[ \hat{T}_i = \frac{x_i+y_i}{\hat{\beta}+1} \]
  • Interpretation:
    • \(\hat{\beta}\) = relative risk of City B vs City A.
    • \(\hat{T}_i\) = baseline expected accidents for year \(i\), adjusted so A and B scale correctly.

Connecting Back to Newton–Raphson / Fisher Scoring

  • In the IMR example (\(\alpha, \gamma\)): analytic MLEs are messy → iterative methods (Newton or Fisher) are used.
  • In the Poisson accidents example: algebra is nice → closed-form MLEs possible.
  • Takeaway:
    • Sometimes MLEs = neat closed forms (like \(\hat{\beta}=\sum Y/\sum X\)).
    • Often MLEs = require iterative numerical optimization (Newton/Fisher).

1) Model Recap (Complete-Data)

  • For each year \(i=1,\dots,n\):
    • \(X_i \sim \text{Pois}(T_i)\) (City A)
    • \(Y_i \sim \text{Pois}(\beta T_i)\) (City B)
  • Parameters: \(\beta>0\) and \(T_1,\dots,T_n>0\).

Ignoring constants, the complete-data log-likelihood is:

\[ L(\beta,\mathbf T;\mathbf x,\mathbf y) =\sum_{i=1}^n \Big[-(\beta+1)T_i + y_i(\log\beta+\log T_i) + x_i\log T_i\Big]. \]

The complete-data sufficient statistics are:

\[ S_X=\sum_i x_i,\quad S_Y=\sum_i y_i,\quad \sum_i\log T_i \;\text{(enters via }x_i+y_i\text{)}. \]

When nothing is missing, the MLEs satisfy:

\[ \hat{\beta}=\frac{\sum_i y_i}{\sum_i x_i},\qquad \hat T_i=\frac{x_i+y_i}{\hat\beta+1}. \]

2) Missing Data: One \(x_s\) is Missing

E-step (Expectation)

Construct:

\[ Q(\beta,\mathbf T\mid \beta^{(n)},\mathbf T^{(n)}) = \mathbb E\!\left[L(\beta,\mathbf T;\mathbf X,\mathbf Y)\,\middle|\,\text{observed},\,\beta^{(n)},\mathbf T^{(n)}\right]. \]

Only terms involving the missing \(X_s\) need an expectation. Since:

\[ X_s \mid \beta^{(n)},\mathbf T^{(n)} \sim \text{Pois}\big(T_s^{(n)}\big),\quad \mathbb E[X_s\mid\cdot]=T_s^{(n)}, \]

and \(\log T_s\) is not random, we have:

\[ \mathbb E[X_s \log T_s \mid \cdot]=T_s^{(n)}\log T_s. \]

So, in \(Q\) replace missing \(x_s\) by \(T_s^{(n)}\). Concretely, use:

\[ \tilde S_X^{(n)}=\sum_{i\neq s} x_i + T_s^{(n)}. \]

M-step (Maximization)

Maximize \(Q\) as if \(x_s\) were replaced. Updates:

  • Update for \(\beta\):

    \[ \beta^{(n+1)}=\frac{\sum_i y_i}{\tilde S_X^{(n)}}=\frac{\sum_i y_i}{\sum_{i\neq s} x_i + T_s^{(n)}}. \]

  • Updates for \(T_i\):

    \[ T_i^{(n+1)}=\frac{x_i+y_i}{\beta^{(n+1)}+1}\quad(i\neq s),\qquad T_s^{(n+1)}=\frac{T_s^{(n)}+y_s}{\beta^{(n+1)}+1}. \]

Repeat E→M until convergence.

3) Why This Works

  • EM maximizes observed-data likelihood via expected complete-data log-likelihood.
  • In exponential families, EM works with expected sufficient statistics.
  • Here, only \(S_X\) is affected; replace missing \(x_s\) by \(T_s^{(n)}\).
  • Each EM step increases or preserves the observed log-likelihood → convergence to stationary point (MLE under mild conditions).

4) Pseudocode (Notes-Ready)

Input: observed {x_i (i≠s), y_i (all i)}, choose initial β(0)>0 and T_i(0)>0
repeat
  # E-step
  Sx_tilde ← (∑_{i≠s} x_i) + T_s^(n)

# M-step

β^(n+1) ← (∑\_i y\_i) / Sx\_tilde
for i≠s:   T\_i^(n+1) ← (x\_i + y\_i) / (β^(n+1) + 1)
i = s:     T\_s^(n+1) ← (T\_s^(n) + y\_s) / (β^(n+1) + 1)

until convergence (e.g., relative change in β and all T\_i below tolerance)
Output: β̂, T̂\_1,…,T̂\_n 

Stopping rule:

\[ \max\left\{\frac{|\beta^{(n+1)}-\beta^{(n)}|}{\beta^{(n)}},\;\max_i \frac{|T_i^{(n+1)}-T_i^{(n)}|}{T_i^{(n)}}\right\}<10^{-6}\;(\text{tight}) \;\text{or } 10^{-4}\;(\text{looser}). \]

5) Tiny Numeric Walk-Through

Suppose \(n=3\). Observed:

\[ (x_1, x_2, x_3)=(12,\,?,\,9),\qquad (y_1,y_2,y_3)=(8,\,10,\,7). \]

Init: \(\beta^{(0)}=1\). Guess missing \(x_2\) with \(y_2\). Then:

\[ T_1^{(0)}=10,\;\;T_2^{(0)}=10,\;\;T_3^{(0)}=8. \]

E: \(\tilde S_X^{(0)}=31.\)

M: \(\beta^{(1)}=\tfrac{25}{31}\approx0.8065.\)

Then:

\[ T_1^{(1)}\approx11.06,\quad T_2^{(1)}\approx11.08,\quad T_3^{(1)}\approx8.86. \]

Next iteration uses \(T_2^{(1)}\), and so on, until convergence.

6) Visual Intuition

  • Plate diagram: \(T_i \rightarrow X_i\); \(T_i,\beta \rightarrow Y_i\). Dashed circle around missing \(X_s\).
  • EM loop schematic: E-step (impute) → M-step (closed form update) → repeat.
  • Convergence plot: \(\beta^{(n)}\) vs iterations, flattening to stable value.

7) Extensions & Cautions

  • Multiple missing \(x_i\): replace each by \(T_i^{(n)}\).
  • Missing \(y_i\): replace by \(\beta^{(n)}T_i^{(n)}\); re-derive updates.
  • Ignorable missingness assumption: MCAR/MAR; otherwise need missingness model.
  • Initialization: good starts speed convergence.
    • \(\beta^{(0)}=\tfrac{\sum y_i}{\sum_{i:\,x_i\text{ obs}} x_i + \sum_{i:\,x_i\text{ miss}} y_i}\)
    • \(T_i^{(0)}=\tfrac{x_i+y_i}{\beta^{(0)}+1}\), using \(x_i:=y_i\) where missing.
  • Monotonicity: EM always increases observed log-likelihood (Newton/Fisher may not).