Statistical Methods III: Week-4

AY 2025–26

Instructor: Debasis Sengupta

Office / Department: ASU

Email: sdebasis@isical.ac.in

Marking Scheme:
Assignments: 20% | Midterm Test: 30% | End Semester: 50%

Contents

General EM Framework

  • Introduce a “fantasy complete dataset” \(\vec{x}\) so optimization is easier if we had it.
  • Observed data: \(\vec{y}\).
  • Complete data: \((\vec{y},\vec{x})\).
  • Parameters: \(\vec{\theta}\).
  • Target: maximize observed log-likelihood \(L(\vec{\theta}\mid\vec{y})=\ln g(\vec{y}\mid\vec{\theta})\), which is hard directly.
  • Strategy:
    • Express the problem via the complete-data log-likelihood.
    • Take expectation over missing parts given current parameter guess.
    • Alternate between estimating missing parts (E-step) and maximizing w.r.t. parameters (M-step).

E-step and M-step (Formal)

  • E-step: compute the expected complete-data log-likelihood under the current estimate \(\hat\theta^{(n)}\): \[ Q(\theta\mid\hat\theta^{(n)})= \mathbb{E}_{\vec{x}\mid\vec{y},\hat\theta^{(n)}}\big[L(\theta\mid\vec{y},\vec{x})\big]. \]
  • M-step: update by maximizing that expectation: \[ \hat\theta^{(n+1)}=\arg\max_{\theta} Q(\theta\mid\hat\theta^{(n)}). \]
  • Interpretation: E-step fills missing information probabilistically; M-step treats that fill-in as data for optimization.

Proof of Monotone Convergence (Sketch)

  • Claim: \(L(\hat\theta^{(n+1)}\mid\vec y)\ge L(\hat\theta^{(n)}\mid\vec y)\) — EM never decreases the observed log-likelihood.
  • ```
  • Step 1 — Decomposition:
    • For any \(\theta\): \[ L(\theta\mid\vec y)=\mathbb{E}_{\vec x\mid\vec y,\hat\theta^{(n)}}\big[L(\theta\mid\vec y,\vec x)\big] -\mathbb{E}_{\vec x\mid\vec y,\hat\theta^{(n)}}\big[\ln h(\vec x\mid\vec y,\theta)\big]. \]
    • First term = expected complete-data log-likelihood; second term = penalty involving the conditional distribution of missing data.
  • Step 2 — M-step increases the first term:
    • \(\hat\theta^{(n+1)}\) is chosen to maximize the expectation, hence \[ \mathbb{E}\big[L(\hat\theta^{(n+1)}\mid\vec y,\vec x)\big]\ge \mathbb{E}\big[L(\hat\theta^{(n)}\mid\vec y,\vec x)\big]. \]
  • Step 3 — Penalty term does not increase:
    • Need \[ \mathbb{E}\big[\ln h(\vec x\mid\vec y,\hat\theta^{(n+1)})\big]\le \mathbb{E}\big[\ln h(\vec x\mid\vec y,\hat\theta^{(n)})\big]. \]
    • Follows from concavity of \(\ln\) and KL-divergence arguments: the expected log-ratio induces a non-negative KL term, preventing the penalty from offsetting the gain.
  • Step 4 — Combine:
    • Subtract penalty terms from expected complete-data logs to get \[ L(\hat\theta^{(n+1)}\mid\vec y)-L(\hat\theta^{(n)}\mid\vec y)\ge0, \] proving ascent.
  • ```

Why This Works — Intuition

  • EM turns a hard marginal optimization into repeated easier optimizations of an expected complete-data objective.
  • In exponential families, the E-step reduces to replacing missing sufficient statistics by their conditional expectations.
  • Each M-step optimizes a surrogate that lower-bounds (or tangentially touches) the observed log-likelihood at the current iterate, guaranteeing non-decrease.

Visual Intuition

  • Likelihood surface = fogged mountain; complete-data likelihood = clear map; observed-data likelihood = fogged projection.
  • E-step: clear the fog locally by estimating missing pieces; M-step: climb uphill on the clearer map.
  • Iteration trace: monotone ascent of \(L(\hat\theta^{(n)}\mid\vec y)\) vs iteration \(n\), flattening near convergence.
  • Suggested sketch: EM schematic (E → M → E …) + plot of likelihood vs iterations showing steady ascent and plateau; optional image: EM intuition diagram (illustrative).

Caveats & Practical Notes

  • Local maxima: EM may converge to a local peak; initialization matters.
  • Slow near optimum: EM generally has linear convergence; Newton/Fisher can be quadratically faster but less stable.
  • Missingness mechanism: EM assumes ignorable missingness (MCAR/MAR); otherwise model the missingness process explicitly.
  • Diagnostics: monitor observed log-likelihood, relative parameter changes, and run multiple starts to check for global optimum.
  • Extensions: stochastic EM, ECM, AECM and other variants can speed up or stabilize convergence in complex models.

Summary / Insight

EM alternates expectation (probabilistic imputation of missing data) and maximization (optimization of a tractable surrogate). The ascent property follows from decomposing the observed log-likelihood into expected complete-data terms minus a penalty; M-step raises the expected term while Jensen/KL ensures the penalty cannot overturn the gain. EM is robust and stable, but can be slow and sensitive to starting values—so use sensible initialization and diagnostics.

Genetics Example of EM (Dempster–Laird–Rubin, 1977)

We start with observed data counts. The EM trick is to imagine hidden complete data, so the optimization becomes much simpler.

Observed Data Setup

We have 197 animals split into 4 categories:

Group Count (yi) Probability
y1 125 1/2 + π/4
y2 18 (1-π)/4
y3 20 (1-π)/4
y4 34 π/4

So the multinomial probabilities depend on one parameter π. But the first probability y1 is a sum of two probabilities: 1/2 and π/4. That’s the clue for EM.

Complete Data Idea

  • Pretend that y1 was really two hidden groups:
    • x1 animals came from prob 1/2
    • x2 animals came from prob π/4
  • So: x1+x2=y1, x3=y2, x4=y3, x5=y4.
Complete Data Probabilities
x1 1/2
x2 π/4
x3 (1-π)/4
x4 (1-π)/4
x5 π/4

Complete-Data Log-Likelihood

L(π) = x1 ln(1/2) + x2 ln(π/4) + (x3+x4) ln((1-π)/4) + x5 ln(π/4)

This is simple to maximize if we had x2. But we don’t. That’s where the E-step comes in.

E-Step

  • We compute the expected value of x2, given y1 and current π(n).
  • x2 | y1 ~ Binomial(y1, (π/4)/(1/2 + π/4))
  • So expectation:

    x2(n) = y1 · (π(n)/4) / (1/2 + π(n)/4)

M-Step

  • Maximize expected log-likelihood. Differentiate, set to zero, solve for π:

π(n+1) = (x2(n) + x5) / (x2(n) + x3 + x4 + x5)

Substitute observed counts:

π(n+1) = (x2(n) + y4) / (x2(n) + y2 + y3 + y4)

And iterate until convergence.

Side Concepts

Conditional Multinomial Property

If you lump two multinomial cells, then condition on their sum, one component is Binomial. This is why x2|(x1+x2) became Binomial.

Hierarchical Model / Linear Regression

  • xi ~ g(x|θ)
  • Yi | Xi ~ N(α+βXi, σ²)
Joint likelihood: L(α,β,σ²,θ) = Σ ln( (1/σ) φ((yi-(α+βxi))/σ) g(xi|θ) )

Laplace Distribution (Robust Regression)

If Yi|Xi ~ (1/2) λ exp(-λ |yi-(α+βxi)|), this leads to Least Absolute Deviations (LAD) regression.

  • Normal errors → minimize squared error (sensitive to outliers).
  • Laplace errors → minimize absolute deviations (robust to outliers).

Visual Summary

  • Observed categories: y1…y4
  • Hidden split: y1=x1+x2
  • E-step: estimate x2
  • M-step: update π
  • Iterate until stable

Summary Insight

This genetics example is a template: split complicated observed categories into hidden simple ones, use conditional distributions to impute missing parts, optimize on the completed dataset, and repeat. EM alternates estimation and maximization, guaranteeing monotone ascent in likelihood.

Report on the EM Algorithm in Genetics, Genomics, and Public Health

These notes are based on the report by Nan M. Laird. The full document can be accessed here: