Week-4 | Shaan

Instructor: Debasis Sengupta

Office / Department: ASU

Email: sdebasis@isical.ac.in

Marking Scheme:
Assignments: 20% | Midterm Test: 30% | End Semester: 50%

General EM Framework
Genetics Example of EM (Dempster–Laird–Rubin, 1977)
Report on the EM Algorithm in Genetics, Genomics, and Public Health

General EM Framework

Introduce a “fantasy complete dataset” \(\vec{x}\) so optimization is easier if we had it.
Observed data: \(\vec{y}\).
Complete data: \((\vec{y},\vec{x})\).
Parameters: \(\vec{\theta}\).
Target: maximize observed log-likelihood \(L(\vec{\theta}\mid\vec{y})=\ln g(\vec{y}\mid\vec{\theta})\), which is hard directly.
Strategy:
- Express the problem via the complete-data log-likelihood.
- Take expectation over missing parts given current parameter guess.
- Alternate between estimating missing parts (E-step) and maximizing w.r.t. parameters (M-step).

E-step and M-step (Formal)

E-step: compute the expected complete-data log-likelihood under the current estimate \(\hat\theta^{(n)}\): \[ Q(\theta\mid\hat\theta^{(n)})= \mathbb{E}_{\vec{x}\mid\vec{y},\hat\theta^{(n)}}\big[L(\theta\mid\vec{y},\vec{x})\big]. \]
M-step: update by maximizing that expectation: \[ \hat\theta^{(n+1)}=\arg\max_{\theta} Q(\theta\mid\hat\theta^{(n)}). \]
Interpretation: E-step fills missing information probabilistically; M-step treats that fill-in as data for optimization.

Proof of Monotone Convergence (Sketch)

Claim: \(L(\hat\theta^{(n+1)}\mid\vec y)\ge L(\hat\theta^{(n)}\mid\vec y)\) — EM never decreases the observed log-likelihood.
Step 1 — Decomposition:
- For any \(\theta\): \[ L(\theta\mid\vec y)=\mathbb{E}_{\vec x\mid\vec y,\hat\theta^{(n)}}\big[L(\theta\mid\vec y,\vec x)\big] -\mathbb{E}_{\vec x\mid\vec y,\hat\theta^{(n)}}\big[\ln h(\vec x\mid\vec y,\theta)\big]. \]
- First term = expected complete-data log-likelihood; second term = penalty involving the conditional distribution of missing data.
Step 2 — M-step increases the first term:
- \(\hat\theta^{(n+1)}\) is chosen to maximize the expectation, hence \[ \mathbb{E}\big[L(\hat\theta^{(n+1)}\mid\vec y,\vec x)\big]\ge \mathbb{E}\big[L(\hat\theta^{(n)}\mid\vec y,\vec x)\big]. \]
Step 3 — Penalty term does not increase:
- Need \[ \mathbb{E}\big[\ln h(\vec x\mid\vec y,\hat\theta^{(n+1)})\big]\le \mathbb{E}\big[\ln h(\vec x\mid\vec y,\hat\theta^{(n)})\big]. \]
- Follows from concavity of \(\ln\) and KL-divergence arguments: the expected log-ratio induces a non-negative KL term, preventing the penalty from offsetting the gain.
Step 4 — Combine:
- Subtract penalty terms from expected complete-data logs to get \[ L(\hat\theta^{(n+1)}\mid\vec y)-L(\hat\theta^{(n)}\mid\vec y)\ge0, \] proving ascent.

Why This Works — Intuition

EM turns a hard marginal optimization into repeated easier optimizations of an expected complete-data objective.
In exponential families, the E-step reduces to replacing missing sufficient statistics by their conditional expectations.
Each M-step optimizes a surrogate that lower-bounds (or tangentially touches) the observed log-likelihood at the current iterate, guaranteeing non-decrease.

Visual Intuition

Likelihood surface = fogged mountain; complete-data likelihood = clear map; observed-data likelihood = fogged projection.
E-step: clear the fog locally by estimating missing pieces; M-step: climb uphill on the clearer map.
Iteration trace: monotone ascent of \(L(\hat\theta^{(n)}\mid\vec y)\) vs iteration \(n\), flattening near convergence.
Suggested sketch: EM schematic (E → M → E …) + plot of likelihood vs iterations showing steady ascent and plateau; optional image: EM intuition diagram (illustrative).

Caveats & Practical Notes

Local maxima: EM may converge to a local peak; initialization matters.
Slow near optimum: EM generally has linear convergence; Newton/Fisher can be quadratically faster but less stable.
Missingness mechanism: EM assumes ignorable missingness (MCAR/MAR); otherwise model the missingness process explicitly.
Diagnostics: monitor observed log-likelihood, relative parameter changes, and run multiple starts to check for global optimum.
Extensions: stochastic EM, ECM, AECM and other variants can speed up or stabilize convergence in complex models.

Summary / Insight

EM alternates expectation (probabilistic imputation of missing data) and maximization (optimization of a tractable surrogate). The ascent property follows from decomposing the observed log-likelihood into expected complete-data terms minus a penalty; M-step raises the expected term while Jensen/KL ensures the penalty cannot overturn the gain. EM is robust and stable, but can be slow and sensitive to starting values—so use sensible initialization and diagnostics.

Genetics Example of EM (Dempster–Laird–Rubin, 1977)

We start with observed data counts. The EM trick is to imagine hidden complete data, so the optimization becomes much simpler.

Observed Data Setup

We have 197 animals split into 4 categories:

Group	Count (y_i)	Probability
y₁	125	1/2 + π/4
y₂	18	(1-π)/4
y₃	20	(1-π)/4
y₄	34	π/4

So the multinomial probabilities depend on one parameter π. But the first probability y₁ is a sum of two probabilities: 1/2 and π/4. That’s the clue for EM.

Complete Data Idea

Pretend that y₁ was really two hidden groups:
- x₁ animals came from prob 1/2
- x₂ animals came from prob π/4
So: x₁+x₂=y₁, x₃=y₂, x₄=y₃, x₅=y₄.

Complete Data	Probabilities
x₁	1/2
x₂	π/4
x₃	(1-π)/4
x₄	(1-π)/4
x₅	π/4

Complete-Data Log-Likelihood

L(π) = x₁ ln(1/2) + x₂ ln(π/4) + (x₃+x₄) ln((1-π)/4) + x₅ ln(π/4)

This is simple to maximize if we had x₂. But we don’t. That’s where the E-step comes in.

E-Step

We compute the expected value of x₂, given y₁ and current π⁽ⁿ⁾.
x₂ | y₁ ~ Binomial(y₁, (π/4)/(1/2 + π/4))
So expectation:
x₂⁽ⁿ⁾ = y₁ · (π⁽ⁿ⁾/4) / (1/2 + π⁽ⁿ⁾/4)

M-Step

Maximize expected log-likelihood. Differentiate, set to zero, solve for π:

π⁽ⁿ⁺¹⁾ = (x₂⁽ⁿ⁾ + x₅) / (x₂⁽ⁿ⁾ + x₃ + x₄ + x₅)

Substitute observed counts:

π⁽ⁿ⁺¹⁾ = (x₂⁽ⁿ⁾ + y₄) / (x₂⁽ⁿ⁾ + y₂ + y₃ + y₄)

And iterate until convergence.

Side Concepts

Conditional Multinomial Property

If you lump two multinomial cells, then condition on their sum, one component is Binomial. This is why x₂|(x₁+x₂) became Binomial.

Hierarchical Model / Linear Regression

x_i ~ g(x|θ)
Y_i | X_i ~ N(α+βX_i, σ²)

Joint likelihood: L(α,β,σ²,θ) = Σ ln( (1/σ) φ((y_i-(α+βx_i))/σ) g(x_i|θ) )

Laplace Distribution (Robust Regression)

If Y_i|X_i ~ (1/2) λ exp(-λ |y_i-(α+βx_i)|), this leads to Least Absolute Deviations (LAD) regression.

Normal errors → minimize squared error (sensitive to outliers).
Laplace errors → minimize absolute deviations (robust to outliers).

Visual Summary

Observed categories: y₁…y₄
Hidden split: y₁=x₁+x₂
E-step: estimate x₂
M-step: update π
Iterate until stable

Summary Insight

This genetics example is a template: split complicated observed categories into hidden simple ones, use conditional distributions to impute missing parts, optimize on the completed dataset, and repeat. EM alternates estimation and maximization, guaranteeing monotone ascent in likelihood.

Report on the EM Algorithm in Genetics, Genomics, and Public Health

These notes are based on the report by Nan M. Laird. The full document can be accessed here:

Download: EM Algorithm in Genetics, Genomics, and Public Health

Statistical Methods III: Week-4

Contents