Instructor: Debasis Sengupta
Office / Department: ASU
Email: sdebasis@isical.ac.in
Marking Scheme:
Assignments: 20% | Midterm Test: 30% | End Semester: 50%
Question
Q1.
Let \(X_1,X_2,\dots,X_n\) be the numbers of annual accidents occurring in \(n\) stretches of a highway in year 2010. These are independent with pmf \(\mathrm{Poi}(\tau_i),\; i=1,\dots,n.\)
Let \(Y_1,Y_2,\dots,Y_n\) be the numbers of annual accidents on the same stretches in year 2020; they are independent with pmf \(\mathrm{Poi}(\beta\tau_i),\; i=1,\dots,n.\)
For some reason observations \(X_1,\dots,X_k\) are missing (only \(X_{k+1},\dots,X_n\) are observed). Derive explicit expressions for the MLE of \(\beta\) and \(\tau_1,\dots,\tau_n.\)
Solution
-
Step 0 — notation.
- Define the indicator \(\delta_i\) for observed \(X_i\):
\[
\delta_i=\begin{cases}1,& i=k+1,\dots,n\ (\text{observed }X_i),\\[4pt]
0,& i=1,\dots,k\ (\text{missing }X_i).\end{cases}
\]
- Observed-data totals:
\[
S_X=\sum_{i:\,\delta_i=1} X_i=\sum_{i=k+1}^n X_i,\qquad S_Y=\sum_{i=1}^n Y_i.
\]
-
Step 1 — observed-data likelihood.
- The observed-data likelihood (up to constants not involving parameters) is
\[
L(\beta,\tau_1,\dots,\tau_n)
\propto
\prod_{i:\,\delta_i=1} e^{-\tau_i}\tau_i^{X_i}\times
\prod_{i=1}^n e^{-\beta\tau_i}(\beta\tau_i)^{Y_i}.
\]
- Log-likelihood \(\ell(\beta,\tau)\):
\[
\ell(\beta,\tau)
= \sum_{i:\,\delta_i=1}\big[-\tau_i+X_i\log\tau_i\big]
+ \sum_{i=1}^n\big[-\beta\tau_i + Y_i\log\beta + Y_i\log\tau_i\big]
+ \text{const.}
\]
-
Step 2 — score equations for each \(\tau_i\).
- If \(X_i\) is observed (\(\delta_i=1\)):
\[
\frac{\partial\ell}{\partial\tau_i}
= -1 + \frac{X_i}{\tau_i} -\beta + \frac{Y_i}{\tau_i}
= -(1+\beta)+\frac{X_i+Y_i}{\tau_i}.
\]
Setting to \(0\) gives
\[
\tau_i=\frac{X_i+Y_i}{1+\beta},\qquad i=k+1,\dots,n.
\]
- If \(X_i\) is missing (\(\delta_i=0\)):
\[
\frac{\partial\ell}{\partial\tau_i}
= -(1+\beta)+\frac{Y_i}{\tau_i}=0
\quad\Rightarrow\quad
\tau_i=\frac{Y_i}{1+\beta},\qquad i=1,\dots,k.
\]
- Compact form (solution of \(\tau_i\)-score eqns as a function of \(\beta\)):
\[
\boxed{\;\hat\tau_i(\beta)=\frac{\delta_i X_i + Y_i}{1+\beta}\;,\qquad i=1,\dots,n.\;}
\]
-
Step 3 — score equation for \(\beta\).
- Differentiate \(\ell\) w.r.t. \(\beta\):
\[
\frac{\partial\ell}{\partial\beta}
= \sum_{i=1}^n\left(-\tau_i + \frac{Y_i}{\beta}\right)
= -\sum_{i=1}^n \tau_i + \frac{S_Y}{\beta}.
\]
- Substitute \(\tau_i=\hat\tau_i(\beta)=\dfrac{\delta_i X_i+Y_i}{1+\beta}\) to get
\[
-\sum_{i=1}^n \frac{\delta_i X_i + Y_i}{1+\beta} + \frac{S_Y}{\beta} = 0.
\]
Since \(\sum_{i=1}^n(\delta_i X_i+Y_i)=S_X+S_Y\), this simplifies to
\[
-\frac{S_X+S_Y}{1+\beta} + \frac{S_Y}{\beta}=0.
\]
- Solve for \(\beta\):
\[
\frac{S_Y}{\beta}=\frac{S_X+S_Y}{1+\beta}
\quad\Longrightarrow\quad
\beta(S_X+S_Y)=S_Y(1+\beta)
\quad\Longrightarrow\quad
\beta S_X=S_Y.
\]
Hence, provided \(S_X>0\),
\[
\boxed{\;\hat\beta=\frac{S_Y}{S_X}=\frac{\sum_{i=1}^n Y_i}{\sum_{i=k+1}^n X_i}\;. \;}
\]
-
Step 4 — plug back to get \(\hat\tau_i\).
- Using \(\hat\beta=S_Y/S_X\) in \(\hat\tau_i(\beta)\) yields
\[
\boxed{\;\hat\tau_i
= \frac{\delta_i X_i + Y_i}{1+\hat\beta}
= \frac{\delta_i X_i + Y_i}{1 + \dfrac{S_Y}{S_X}}
= \frac{S_X(\delta_i X_i + Y_i)}{S_X+S_Y}\;. \;}
\]
- Equivalently,
\[
\hat\tau_i=\begin{cases}
\dfrac{X_i+Y_i}{1+\hat\beta}, & i=k+1,\dots,n\ (\text{observed }X_i),\\[8pt]
\dfrac{Y_i}{1+\hat\beta}, & i=1,\dots,k\ (\text{missing }X_i).
\end{cases}
\]
-
Step 5 — existence / edge cases.
- If \(S_X=\sum_{i=k+1}^n X_i=0\) while \(S_Y>0\), the equation \(\beta S_X=S_Y\) cannot be satisfied by any finite \(\beta\); the likelihood increases as \(\beta\to\infty\) and no finite MLE exists.
- If \(S_X=0\) and \(S_Y=0\), \(\beta\) is not identifiable (any \(\beta\) gives same likelihood) and all \(\tau_i\) are estimated as zero. Assume \(S_X>0\) for the usual case.
-
Step 6 — justification of maximum.
- For fixed \(\beta\) the log-likelihood is concave in each \(\tau_i\) (second derivative w.r.t. \(\tau_i\) equals \(- (X_i+Y_i)/\tau_i^2<0\) when \(X_i+Y_i>0\)), and profiling yields a single finite critical point for \(\beta\) under regular data (\(S_X>0\)), giving the global maximum in regular cases.
Related concepts
- Poisson-regression / GLM viewpoint: the model is a multiplicative scale-change; \(\log\beta\) acts like an offset if \(\tau_i\) were modeled.
- Profile likelihood: we profiled out the \(\tau_i\) to obtain a closed-form MLE for \(\beta\).
- Method-of-moments equivalence: \(\hat\beta=S_Y/S_X\) also follows from equating totals, since \(E[S_Y]=\beta E[S_X]\).
- Missingness assumption: we implicitly assumed the \(X\)-missingness is non-informative (MCAR). If missingness depends on \(\tau_i\) or \(Y_i\), estimates may be biased.
Remarks / Summary
The MLEs are simple and intuitive:
\[
\boxed{\;\hat\beta=\frac{S_Y}{S_X}\;,\qquad
\hat\tau_i=\frac{\delta_i X_i + Y_i}{1+\hat\beta}=\frac{S_X(\delta_i X_i + Y_i)}{S_X+S_Y}\;}
\]
(with the usual caveat that \(S_X>0\) is required for a finite \(\hat\beta\)). These estimators borrow strength across stretches and years: missing \(X_i\) are estimated using \(Y_i\) and the global scale \(\hat\beta\).
Q2.
In Problem 1, assume a “complete data model” that includes the observed values of \(X_1,\dots,X_k\) (so the missing \(X_1,\dots,X_k\) are treated as latent). Write down the E-step and the M-step of the EM algorithm to obtain the MLEs of \(\beta\) and \(\tau_1,\dots,\tau_n\).
Solution
Model recap / notation
- For \(i=1,\dots,n\):
- \(X_i \sim \operatorname{Poi}(\tau_i)\)
- \(Y_i \sim \operatorname{Poi}(\beta\tau_i)\)
Independent across \(i\) and between years conditional on the \(\tau_i\)’s.
- Observed: all \(Y_i\); \(X_{k+1},\dots,X_n\) observed; \(X_1,\dots,X_k\) missing.
- Sets: \(M=\{1,\dots,k\}\) (missing), \(O=\{k+1,\dots,n\}\) (observed).
- Notation:
\[
S_X^{(O)}=\sum_{i\in O} X_i,\quad
S_Y=\sum_{i=1}^n Y_i,\quad
S_Y^{(O)}=\sum_{i\in O} Y_i,\quad
S_Y^{(M)}=\sum_{i\in M} Y_i
\]
Complete-data log-likelihood
If we had the full data \((X_1,\dots,X_n,Y_1,\dots,Y_n)\), the log-likelihood (up to constant) is:
\[
\ell_c(\beta,\tau)
= \sum_{i=1}^n\left[-(1+\beta)\tau_i + (X_i+Y_i)\log\tau_i\right] + S_Y\log\beta
\]
E-step
- For each missing \(i\in M\):
\[
\mathbb{E}[X_i\mid Y_i,\theta^{(t)}] = \tau_i^{(t)}
\]
(since \(X_i\) and \(Y_i\) are independent given \(\tau_i\)).
- Define:
\[
\tilde X_i^{(t)} =
\begin{cases}
X_i, & i\in O,\\
\tau_i^{(t)}, & i\in M.
\end{cases}
\]
- Expected complete-data log-likelihood:
\[
Q(\beta,\tau\mid\theta^{(t)}) = \sum_{i=1}^n\left[-(1+\beta)\tau_i + (\tilde X_i^{(t)}+Y_i)\log\tau_i\right] + S_Y\log\beta
\]
M-step
- Differentiate w.r.t. \(\tau_i\):
\[
\tau_i^{(t+1)} = \frac{\tilde X_i^{(t)} + Y_i}{1+\beta}
\]
- Differentiate w.r.t. \(\beta\):
\[
\beta^{(t+1)} = \frac{S_Y}{\sum_{i=1}^n \tau_i^{(t+1)}}
\]
- Substitute to obtain:
\[
\beta^{(t+1)} = \frac{S_Y}{S_X^{(O)} + \sum_{i\in M} \tau_i^{(t)}}
\]
One EM iteration
- E-step:
- For \(i\in M\): \(\tilde X_i^{(t)} = \tau_i^{(t)}\)
- For \(i\in O\): \(\tilde X_i^{(t)} = X_i\)
- M-step:
- \(\displaystyle \beta^{(t+1)} = \frac{S_Y}{S_X^{(O)} + \sum_{i\in M} \tau_i^{(t)}}\)
- \(\displaystyle \tau_i^{(t+1)} = \frac{\tilde X_i^{(t)} + Y_i}{1+\beta^{(t+1)}}\)
Initialization & termination
- Initialize \(\tau_i^{(0)}\) and \(\beta^{(0)}\) with reasonable values.
- Iterate until convergence based on parameter change or log-likelihood increment.
Fixed point / connection to MLE
- At convergence:
\[
\tau_i^{*} = \frac{\tilde X_i^{*}+Y_i}{1+\beta^{*}}, \quad
\beta^{*} = \frac{S_Y}{\sum_i \tilde X_i^{*}}
\]
- For \(i\in O\): \(\tau_i^{*} = \frac{X_i+Y_i}{1+\beta^{*}}\)
- For \(i\in M\): \(\tau_i^{*} = \frac{Y_i}{\beta^{*}}\)
- Final closed-form MLE:
\[
\beta^{*} = \frac{S_Y^{(O)}}{S_X^{(O)}}
\]
Edge cases
- If \(S_X^{(O)}=0\), \(\beta\) not identifiable — regularize or use Bayesian approach.
- If \(S_Y^{(O)}=0\) and \(S_X^{(O)}>0\), \(\beta^{*}=0\).
Q4.
If the recursions in Problem 3 converge, what is the value of τ₁ reached at convergence?
Detailed solution
Let the EM iterates converge to \((\beta^*,\tau_1^*,\tau_2^*,\dots,\tau_n^*)\). At convergence the EM fixed-point (stationarity) conditions must hold (these are simply the M-step formulae evaluated at the limit, together with the definition of the \(\tilde X_i\)).
Recall the M-step relations used in the EM:
\[
\beta = \frac{S_Y}{\sum_{i=1}^n\tilde X_i},\qquad
\tau_i = \frac{\tilde X_i + Y_i}{1+\beta},
\]
where \(\tilde X_i = X_i\) if \(X_i\) observed and \(\tilde X_i=\tau_i\) if \(X_i\) missing, and \(S_Y=\sum_{i=1}^n Y_i\).
Case A — \(X_1\) observed
Then \(\tilde X_1^*=X_1\) and the fixed-point equation for \(\tau_1\) is:
\[
\tau_1^* = \frac{X_1 + Y_1}{1+\beta^*}.
\]
So the limit value is:
\[
\boxed{\tau_1^* = \frac{X_1 + Y_1}{1+\beta^*}}
\]
Equivalently, using \(\beta^* = S_Y / \sum_{i\in O}X_i\) when appropriate, this equals the closed-form MLE value.
Case B — \(X_1\) missing
Then \(\tilde X_1^*=\tau_1^*\). The fixed-point equation gives:
\[
\tau_1^* = \frac{\tau_1^* + Y_1}{1+\beta^*}.
\]
Rearranging:
\[
(1+\beta^*)\tau_1^* = \tau_1^* + Y_1
\quad\Longrightarrow\quad
\beta^* \tau_1^* = Y_1
\quad\Longrightarrow\quad
\boxed{\tau_1^* = \frac{Y_1}{\beta^*}}
\]
Summary
- If \(X_1\) observed: \(\tau_1^*=\dfrac{X_1+Y_1}{1+\beta^*}\)
- If \(X_1\) missing: \(\tau_1^*=\dfrac{Y_1}{\beta^*}\)
These match the MLE relations from direct maximization.
How to prove these equalities (short rigorous argument)
- EM fixed-point property: A limit point \(\theta^*=(\beta^*,\tau^*)\) satisfies the M-step maximization with the E-step expectations computed at \(\theta^*\), implying the M-step equations hold at \(\theta^*\).
- Algebraic rearrangement: For the missing case, \(\tau_1^*=(\tau_1^*+Y_1)/(1+\beta^*) \Rightarrow \beta^*\tau_1^*=Y_1 \Rightarrow \tau_1^*=Y_1/\beta^*\). Similarly for the observed case.
- Consistency with MLE: Solving the observed-data score equations directly yields the same relations.
How to prove EM recursions converge (sketch)
- Monotonicity + compactness: If parameters lie in a compact set and the observed log-likelihood has a unique stationary point, EM converges to it.
- One-dimensional contraction: For only \(\tau_1\) missing, analyze \(f(\tau)=\frac{(S_X^{(O)}+\tau)(\tau+Y_1)}{S_X^{(O)}+\tau+S_Y}\) and check \(|f'(\tau^*)|<1\) for local convergence.
- Vector case: Check local contraction (spectral radius of Jacobian < 1) or rely on MLE uniqueness + monotonicity.
Important caveats / edge cases
- No identifiability can cause divergence (e.g. \(S_X^{(O)}=0\) but \(S_Y>0\)).
- Small counts or many missing entries can cause slow convergence; acceleration methods may help.
Question
Q6.
Let \( Y_1, \ldots, Y_n \) be i.i.d. from the two–component normal mixture
\[
f(y \mid p, \mu_1, \mu_2, \sigma^2)
= p \cdot \frac{1}{\sigma} \, \phi\!\left(\frac{y - \mu_1}{\sigma}\right)
+ (1-p) \cdot \frac{1}{\sigma} \, \phi\!\left(\frac{y - \mu_2}{\sigma}\right),
\quad -\infty < y < \infty,
\]
Write down an iterative step of the steepest-ascent method for the MLEs of \(p,\mu_1,\mu_2,\sigma^2\).
Solution
Log-likelihood and responsibilities
- Parameters: \(\theta=(p,\mu_1,\mu_2,\sigma^2)\), with \(\sigma>0\).
- Log-likelihood:
\[
\ell(\theta)=\sum_{i=1}^n
\log\!\left[p\,\phi_{\mu_1,\sigma}(Y_i) + (1-p)\,\phi_{\mu_2,\sigma}(Y_i)\right],
\]
where \(\phi_{\mu,\sigma}(y)=\frac{1}{\sigma}\phi\!\big((y-\mu)/\sigma\big)\).
- Responsibilities (posterior weights under current \(\theta\)):
\[
r_i(\theta)=
\frac{p\,\phi_{\mu_1,\sigma}(Y_i)}
{p\,\phi_{\mu_1,\sigma}(Y_i)+(1-p)\,\phi_{\mu_2,\sigma}(Y_i)}\in(0,1).
\]
Score (gradient) components
-
\[
\frac{\partial \ell}{\partial p}
= \sum_{i=1}^n\!\left(\frac{r_i}{p}-\frac{1-r_i}{1-p}\right)
\]
-
\[
\frac{\partial \ell}{\partial \mu_1}
= \frac{1}{\sigma^2}\sum_{i=1}^n r_i\,(Y_i-\mu_1),\qquad
\frac{\partial \ell}{\partial \mu_2}
= \frac{1}{\sigma^2}\sum_{i=1}^n (1-r_i)\,(Y_i-\mu_2)
\]
-
\[
\frac{\partial \ell}{\partial \sigma^2}
= \sum_{i=1}^n\!\left[
-\frac{1}{2\sigma^2}
+ \frac{ r_i\,(Y_i-\mu_1)^2 + (1-r_i)\,(Y_i-\mu_2)^2 }{2\sigma^4}
\right]
\]
Steepest-ascent iterative step
- Given iterate \(\theta^{(t)}=(p^{(t)},\mu_1^{(t)},\mu_2^{(t)},(\sigma^2)^{(t)})\) and stepsize \(\eta_t>0\), compute responsibilities
\[
r_i^{(t)}=
\frac{p^{(t)}\,\phi_{\mu_1^{(t)},\,\sigma^{(t)}}(Y_i)}
{p^{(t)}\,\phi_{\mu_1^{(t)},\,\sigma^{(t)}}(Y_i)+\big(1-p^{(t)}\big)\,\phi_{\mu_2^{(t)},\,\sigma^{(t)}}(Y_i)}.
\]
- Update:
\[
\begin{aligned}
p^{(t+1)} \,&=\, p^{(t)} + \eta_t \sum_{i=1}^n\!\left(\frac{r_i^{(t)}}{p^{(t)}}-\frac{1-r_i^{(t)}}{1-p^{(t)}}\right),\\[6pt]
\mu_1^{(t+1)} \,&=\, \mu_1^{(t)} + \eta_t \frac{1}{(\sigma^{2})^{(t)}}\sum_{i=1}^n r_i^{(t)}\big(Y_i-\mu_1^{(t)}\big),\\[6pt]
\mu_2^{(t+1)} \,&=\, \mu_2^{(t)} + \eta_t \frac{1}{(\sigma^{2})^{(t)}}\sum_{i=1}^n (1-r_i^{(t)})\big(Y_i-\mu_2^{(t)}\big),\\[8pt]
(\sigma^{2})^{(t+1)} \,&=\, (\sigma^{2})^{(t)} + \eta_t
\sum_{i=1}^n\!\left[
-\frac{1}{2(\sigma^{2})^{(t)}}
+ \frac{ r_i^{(t)}(Y_i-\mu_1^{(t)})^2 + (1-r_i^{(t)})(Y_i-\mu_2^{(t)})^2}
{2\big((\sigma^{2})^{(t)}\big)^2}
\right].
\end{aligned}
\]
Constraints & practicalities
- Maintain constraints: \(0
0\).
- Projection: after the step, clip \(p\in[\epsilon,1-\epsilon]\), \(\sigma^2\ge \epsilon\).
- Reparameterize (cleaner): update \(\omega=\log\!\frac{p}{1-p}\), \(\lambda=\log\sigma^2\) unconstrained via gradient ascent, then map back \(p=\mathrm{logit}^{-1}(\omega)\), \(\sigma^2=e^\lambda\).
- Use a line-search or diminishing \(\eta_t\) to ensure ascent of \(\ell(\theta)\).
- Impose an identifiability convention (e.g., \(\mu_1\le \mu_2\)) to avoid label switching.
Remarks
The responsibilities \(r_i^{(t)}\) are the same weights used in EM. Steepest ascent follows the score direction; EM or second-order (Newton / Fisher scoring, L-BFGS) steps usually converge faster for mixtures, but the updates above are the canonical steepest-ascent iteration.
Question
Q8.
Suppose \((X_1,Y_1),\dots,(X_n,Y_n)\) are i.i.d. where \(X_i\sim\operatorname{Bernoulli}(p)\) and the conditional pdf of \(Y_i\) given \(X_i=x\) is
\[
f(y \mid x, p, \mu_1, \mu_2, \sigma^2) =
x \; \frac{1}{\sigma} \, \phi\!\left(\frac{y - \mu_1}{\sigma}\right)
+ (1 - x) \; \frac{1}{\sigma} \, \phi\!\left(\frac{y - \mu_2}{\sigma}\right),
\quad -\infty < y < \infty
\]
with \( 0 < p < 1 \).
Obtain explicit expressions for the MLEs of the four parameters.
Solution
Interpretation & notation
- The label \(X_i\) is observed, so this is a labeled mixture: if \(X_i=1\) then \(Y_i\sim N(\mu_1,\sigma^2)\); if \(X_i=0\) then \(Y_i\sim N(\mu_2,\sigma^2)\).
- Define
\[
n_1=\sum_{i=1}^n X_i,\qquad n_0=n-n_1.
\]
- Index sets and sums:
\[
I_1=\{i: X_i=1\},\quad I_0=\{i:X_i=0\},
\]
\[
S_1=\sum_{i\in I_1} Y_i,\quad S_0=\sum_{i\in I_0} Y_i.
\]
- Sample means (when defined):
\[
\bar Y_1=\frac{S_1}{n_1}\ (\text{if }n_1>0),\qquad
\bar Y_0=\frac{S_0}{n_0}\ (\text{if }n_0>0).
\]
- Sums of squared deviations:
\[
Q_1=\sum_{i\in I_1}(Y_i-\bar Y_1)^2,\quad
Q_0=\sum_{i\in I_0}(Y_i-\bar Y_0)^2.
\]
Log-likelihood (brief)
Because \((X_i,Y_i)\) are i.i.d. and \(X_i\) observed,
\[
\ell(p,\mu_1,\mu_2,\sigma^2)
= \sum_{i=1}^n \big[X_i\log p +(1-X_i)\log(1-p)\big]
+ \sum_{i\in I_1}\log\phi_{\mu_1,\sigma}(Y_i)
+ \sum_{i\in I_0}\log\phi_{\mu_2,\sigma}(Y_i),
\]
where \(\log\phi_{\mu,\sigma}(y)=-\tfrac12\log(2\pi\sigma^2)-\dfrac{(y-\mu)^2}{2\sigma^2}\).
MLE for \(p\)
- Maximizing the Bernoulli part yields the familiar estimator:
\[
\boxed{\;\hat p = \frac{n_1}{n} = \frac{1}{n}\sum_{i=1}^n X_i\;.}
\]
MLEs for \(\mu_1,\mu_2\)
- Maximizing the normal log-terms w.r.t. \(\mu_1\) (holding \(\sigma^2\) fixed) gives the sample mean within group 1:
\[
\boxed{\;\hat\mu_1 = \bar Y_1 = \frac{1}{n_1}\sum_{i: X_i=1} Y_i \quad(\text{if }n_1>0)\;.}
\]
- Similarly for group 0:
\[
\boxed{\;\hat\mu_2 = \bar Y_0 = \frac{1}{n_0}\sum_{i: X_i=0} Y_i \quad(\text{if }n_0>0)\;.}
\]
MLE for \(\sigma^2\)
- Collect the normal contributions and differentiate w.r.t. \(\sigma^2\). The MLE is
\[
\boxed{\;
\widehat{\sigma^2}
= \frac{1}{n}\Bigg(\sum_{i\in I_1}(Y_i-\hat\mu_1)^2 + \sum_{i\in I_0}(Y_i-\hat\mu_2)^2\Bigg)
\;=\; \frac{Q_1+Q_0}{n}\;.
}
\]
(Note: this is the MLE divisor \(n\), not the unbiased divisor \(n-2\).)
Edge cases & existence
- If \(n_1=0\) then \(\hat p=0\) and \(\hat\mu_1\) is undefined from the data; similarly if \(n_0=0\) then \(\hat p=1\) and \(\hat\mu_2\) is undefined. In those degenerate cases extra assumptions or priors are required.
- If both groups have at least one observation, the MLEs above are well-defined and unique.
Summary (boxed answers)
- \(\displaystyle \boxed{\;\hat p = \dfrac{n_1}{n}\;}\).
- \(\displaystyle \boxed{\;\hat\mu_1 = \bar Y_1 = \dfrac{1}{n_1}\sum_{i:X_i=1} Y_i\;}\) (if \(n_1>0\)).
- \(\displaystyle \boxed{\;\hat\mu_2 = \bar Y_0 = \dfrac{1}{n_0}\sum_{i:X_i=0} Y_i\;}\) (if \(n_0>0\)).
- \(\displaystyle \boxed{\;\widehat{\sigma^2} = \dfrac{1}{n}\Big(\sum_{i\in I_1}(Y_i-\hat\mu_1)^2 + \sum_{i\in I_0}(Y_i-\hat\mu_2)^2\Big)\;}\).
Question
Q9.
For the data and the model described in Problem 6, let the “complete data model” be as in Problem 8. Derive the E-step and the M-step of the EM algorithm for finding the MLEs of the four parameters \((p,\mu_1,\mu_2,\sigma^2)\).
Solution
Model / notation (recap)
- Observed: \(Y_1,\dots,Y_n\). Latent labels: \(X_i\in\{0,1\}\) for \(i=1,\dots,n\).
- Complete-data model:
\[
P(X_i=1)=p,\qquad
Y_i\mid X_i=1\sim N(\mu_1,\sigma^2),\quad
Y_i\mid X_i=0\sim N(\mu_2,\sigma^2).
\]
- Current iterate: \(\theta^{(t)}=(p^{(t)},\mu_1^{(t)},\mu_2^{(t)},(\sigma^2)^{(t)})\).
E-step
- Compute the conditional expectation of the complete-data indicator \(X_i\) given \(Y_i\) under \(\theta^{(t)}\). These are the responsibilities:
\[
\boxed{\;
r_i^{(t)} \;=\; \mathbb{E}[X_i\mid Y_i,\theta^{(t)}]
\;=\; \Pr\{X_i=1\mid Y_i,\theta^{(t)}\}
\;=\; \frac{p^{(t)}\,\phi\!\big(\dfrac{Y_i-\mu_1^{(t)}}{\sigma^{(t)}}\big)}
{\,p^{(t)}\,\phi\!\big(\dfrac{Y_i-\mu_1^{(t)}}{\sigma^{(t)}}\big)
+(1-p^{(t)})\,\phi\!\big(\dfrac{Y_i-\mu_2^{(t)}}{\sigma^{(t)}}\big)\,}\;}
\]
where \(\sigma^{(t)}=\sqrt{(\sigma^2)^{(t)}}\) and \(\phi\) is the standard normal pdf.
- Also note \(\mathbb{E}[1-X_i\mid Y_i,\theta^{(t)}]=1-r_i^{(t)}\).
- For numerical stability compute the ratio using log-sum-exp form (work in log-domain).
M-step
- Set \(R^{(t)}=\sum_{i=1}^n r_i^{(t)}\) (the expected count in component 1) and \(n-R^{(t)}=\sum_{i=1}^n(1-r_i^{(t)})\).
- Maximize the expected complete-data log-likelihood (replace \(X_i\) by \(r_i^{(t)}\)). Closed-form updates are:
-
\[
\boxed{\; p^{(t+1)} \;=\; \frac{1}{n}\sum_{i=1}^n r_i^{(t)} \;=\; \frac{R^{(t)}}{n} \;}
\]
-
\[
\boxed{\; \mu_1^{(t+1)} \;=\; \frac{\displaystyle\sum_{i=1}^n r_i^{(t)} Y_i}{\displaystyle\sum_{i=1}^n r_i^{(t)}} \;=\; \frac{1}{R^{(t)}}\sum_{i=1}^n r_i^{(t)}Y_i \;}
\]
-
\[
\boxed{\; \mu_2^{(t+1)} \;=\; \frac{\displaystyle\sum_{i=1}^n (1-r_i^{(t)}) Y_i}{\displaystyle\sum_{i=1}^n (1-r_i^{(t)})} \;=\; \frac{1}{n-R^{(t)}}\sum_{i=1}^n (1-r_i^{(t)})Y_i \;}
\]
-
\[
\boxed{\; (\sigma^2)^{(t+1)} \;=\; \frac{1}{n}\sum_{i=1}^n\Big[\, r_i^{(t)}\big(Y_i-\mu_1^{(t+1)}\big)^2 + (1-r_i^{(t)})\big(Y_i-\mu_2^{(t+1)}\big)^2 \Big]\; }
\]
(This follows by maximizing the expected complete-data normal log-likelihood; the MLE divisor is \(n\).)
One EM iteration (concise summary)
- Given \(\theta^{(t)}\):
- E-step: compute \(r_i^{(t)}\) for \(i=1,\dots,n\) using the boxed formula above.
- M-step: compute \(R^{(t)}=\sum r_i^{(t)}\) and update
\[
p^{(t+1)}=\frac{R^{(t)}}{n},\quad
\mu_1^{(t+1)}=\frac{\sum r_i^{(t)}Y_i}{R^{(t)}},\quad
\mu_2^{(t+1)}=\frac{\sum (1-r_i^{(t)})Y_i}{n-R^{(t)}},
\]
and
\[
(\sigma^2)^{(t+1)}=\frac{1}{n}\sum \big[ r_i^{(t)}(Y_i-\mu_1^{(t+1)})^2 + (1-r_i^{(t)})(Y_i-\mu_2^{(t+1)})^2\big].
\]
Convergence & practical notes
- Iterate E and M steps until parameter change or log-likelihood increment falls below tolerance.
- Initialization: choose sensible starts (k-means, random, or multiple restarts) to avoid poor local maxima and label switching.
- For numerical stability use log-domain computations for responsibilities; guard against division by zero (ensure \(R^{(t)}\) and \(n-R^{(t)}\) are not too close to 0; regularize if necessary).
Remark
These EM updates are the classical closed-form E- and M-steps for the Gaussian two-component mixture with common variance. They increase the observed-data log-likelihood monotonically and (under regularity) converge to a stationary point of the likelihood.
Question
Q10.
Simulate a sample of size n = 100 from the model presumed in Problem 6 for reasonable values for the four parameters. Write an R code to obtain the MLEs through the EM algorithm, starting from initial values p = 0, \(μ_1\) = 0, \(μ_2\) equal to the sample mean and \(σ^2 \) equal to the sample variance.
Solution
set.seed(123) # For reproducibility
### ---- Parameters for simulation ----
n <- 100
p_true <- 0.4
mu1_true <- -1
mu2_true <- 2
sigma2_true <- 1
### ---- Simulate data from mixture model ----
z <- rbinom(n, 1, p_true) # Latent component indicators
x <- numeric(n)
x[z == 1] <- rnorm(sum(z == 1), mean = mu1_true, sd = sqrt(sigma2_true))
x[z == 0] <- rnorm(sum(z == 0), mean = mu2_true, sd = sqrt(sigma2_true))
### ---- Initial values ----
p <- 0.001 # small positive to avoid NA in E-step
mu1 <- 0
mu2 <- mean(x)
sigma2 <- var(x)
### ---- EM Algorithm ----
max_iter <- 1000
tol <- 1e-6
loglik <- numeric(max_iter)
eps <- 1e-10
for (iter in 1:max_iter) {
## E-step: Posterior probabilities for component 1
tau1 <- p * dnorm(x, mean = mu1, sd = sqrt(sigma2))
tau2 <- (1 - p) * dnorm(x, mean = mu2, sd = sqrt(sigma2))
denom <- tau1 + tau2 + eps # avoid division by zero
gamma <- tau1 / denom
## Log-likelihood
loglik[iter] <- sum(log(denom))
## M-step: Update parameters
p_new <- mean(gamma)
mu1_new <- sum(gamma * x) / sum(gamma)
mu2_new <- sum((1 - gamma) * x) / sum(1 - gamma)
sigma2_new <- sum(gamma * (x - mu1_new)^2 + (1 - gamma) * (x - mu2_new)^2) / n
## Check convergence
if (max(abs(c(p_new - p, mu1_new - mu1, mu2_new - mu2, sigma2_new - sigma2)), na.rm = TRUE) < tol) {
p <- p_new; mu1 <- mu1_new; mu2 <- mu2_new; sigma2 <- sigma2_new
loglik <- loglik[1:iter]
cat("Converged in", iter, "iterations\n")
break
}
## Update for next iteration
p <- p_new
mu1 <- mu1_new
mu2 <- mu2_new
sigma2 <- sigma2_new
}
### ---- Final estimates ----
cat("Estimated p:", p, "\n")
cat("Estimated mu1:", mu1, "\n")
cat("Estimated mu2:", mu2, "\n")
cat("Estimated sigma^2:", sigma2, "\n")
### ---- Plot log-likelihood ----
plot(loglik, type = "o", main = "Log-likelihood over iterations",
xlab = "Iteration", ylab = "Log-likelihood")