Assignment: Week-2

Question

Q1.
Let \(X_1,X_2,\dots,X_n\) be the numbers of annual accidents occurring in \(n\) stretches of a highway in year 2010. These are independent with pmf \(\mathrm{Poi}(\tau_i),\; i=1,\dots,n.\) Let \(Y_1,Y_2,\dots,Y_n\) be the numbers of annual accidents on the same stretches in year 2020; they are independent with pmf \(\mathrm{Poi}(\beta\tau_i),\; i=1,\dots,n.\) For some reason observations \(X_1,\dots,X_k\) are missing (only \(X_{k+1},\dots,X_n\) are observed). Derive explicit expressions for the MLE of \(\beta\) and \(\tau_1,\dots,\tau_n.\)

Solution

Step 0 — notation.
- Define the indicator \(\delta_i\) for observed \(X_i\): \[ \delta_i=\begin{cases}1,& i=k+1,\dots,n\ (\text{observed }X_i),\\[4pt] 0,& i=1,\dots,k\ (\text{missing }X_i).\end{cases} \]
- Observed-data totals: \[ S_X=\sum_{i:\,\delta_i=1} X_i=\sum_{i=k+1}^n X_i,\qquad S_Y=\sum_{i=1}^n Y_i. \]
Step 1 — observed-data likelihood.
- The observed-data likelihood (up to constants not involving parameters) is \[ L(\beta,\tau_1,\dots,\tau_n) \propto \prod_{i:\,\delta_i=1} e^{-\tau_i}\tau_i^{X_i}\times \prod_{i=1}^n e^{-\beta\tau_i}(\beta\tau_i)^{Y_i}. \]
- Log-likelihood \(\ell(\beta,\tau)\): \[ \ell(\beta,\tau) = \sum_{i:\,\delta_i=1}\big[-\tau_i+X_i\log\tau_i\big] + \sum_{i=1}^n\big[-\beta\tau_i + Y_i\log\beta + Y_i\log\tau_i\big] + \text{const.} \]
Step 2 — score equations for each \(\tau_i\).
- If \(X_i\) is observed (\(\delta_i=1\)): \[ \frac{\partial\ell}{\partial\tau_i} = -1 + \frac{X_i}{\tau_i} -\beta + \frac{Y_i}{\tau_i} = -(1+\beta)+\frac{X_i+Y_i}{\tau_i}. \] Setting to \(0\) gives \[ \tau_i=\frac{X_i+Y_i}{1+\beta},\qquad i=k+1,\dots,n. \]
- If \(X_i\) is missing (\(\delta_i=0\)): \[ \frac{\partial\ell}{\partial\tau_i} = -(1+\beta)+\frac{Y_i}{\tau_i}=0 \quad\Rightarrow\quad \tau_i=\frac{Y_i}{1+\beta},\qquad i=1,\dots,k. \]
- Compact form (solution of \(\tau_i\)-score eqns as a function of \(\beta\)): \[ \boxed{\;\hat\tau_i(\beta)=\frac{\delta_i X_i + Y_i}{1+\beta}\;,\qquad i=1,\dots,n.\;} \]
Step 3 — score equation for \(\beta\).
- Differentiate \(\ell\) w.r.t. \(\beta\): \[ \frac{\partial\ell}{\partial\beta} = \sum_{i=1}^n\left(-\tau_i + \frac{Y_i}{\beta}\right) = -\sum_{i=1}^n \tau_i + \frac{S_Y}{\beta}. \]
- Substitute \(\tau_i=\hat\tau_i(\beta)=\dfrac{\delta_i X_i+Y_i}{1+\beta}\) to get \[ -\sum_{i=1}^n \frac{\delta_i X_i + Y_i}{1+\beta} + \frac{S_Y}{\beta} = 0. \] Since \(\sum_{i=1}^n(\delta_i X_i+Y_i)=S_X+S_Y\), this simplifies to \[ -\frac{S_X+S_Y}{1+\beta} + \frac{S_Y}{\beta}=0. \]
- Solve for \(\beta\): \[ \frac{S_Y}{\beta}=\frac{S_X+S_Y}{1+\beta} \quad\Longrightarrow\quad \beta(S_X+S_Y)=S_Y(1+\beta) \quad\Longrightarrow\quad \beta S_X=S_Y. \] Hence, provided \(S_X>0\), \[ \boxed{\;\hat\beta=\frac{S_Y}{S_X}=\frac{\sum_{i=1}^n Y_i}{\sum_{i=k+1}^n X_i}\;. \;} \]
Step 4 — plug back to get \(\hat\tau_i\).
- Using \(\hat\beta=S_Y/S_X\) in \(\hat\tau_i(\beta)\) yields \[ \boxed{\;\hat\tau_i = \frac{\delta_i X_i + Y_i}{1+\hat\beta} = \frac{\delta_i X_i + Y_i}{1 + \dfrac{S_Y}{S_X}} = \frac{S_X(\delta_i X_i + Y_i)}{S_X+S_Y}\;. \;} \]
- Equivalently, \[ \hat\tau_i=\begin{cases} \dfrac{X_i+Y_i}{1+\hat\beta}, & i=k+1,\dots,n\ (\text{observed }X_i),\\[8pt] \dfrac{Y_i}{1+\hat\beta}, & i=1,\dots,k\ (\text{missing }X_i). \end{cases} \]
Step 5 — existence / edge cases.
- If \(S_X=\sum_{i=k+1}^n X_i=0\) while \(S_Y>0\), the equation \(\beta S_X=S_Y\) cannot be satisfied by any finite \(\beta\); the likelihood increases as \(\beta\to\infty\) and no finite MLE exists.
- If \(S_X=0\) and \(S_Y=0\), \(\beta\) is not identifiable (any \(\beta\) gives same likelihood) and all \(\tau_i\) are estimated as zero. Assume \(S_X>0\) for the usual case.
Step 6 — justification of maximum.
- For fixed \(\beta\) the log-likelihood is concave in each \(\tau_i\) (second derivative w.r.t. \(\tau_i\) equals \(- (X_i+Y_i)/\tau_i^2<0\) when \(X_i+Y_i>0\)), and profiling yields a single finite critical point for \(\beta\) under regular data (\(S_X>0\)), giving the global maximum in regular cases.

Related concepts

Poisson-regression / GLM viewpoint: the model is a multiplicative scale-change; \(\log\beta\) acts like an offset if \(\tau_i\) were modeled.
Profile likelihood: we profiled out the \(\tau_i\) to obtain a closed-form MLE for \(\beta\).
Method-of-moments equivalence: \(\hat\beta=S_Y/S_X\) also follows from equating totals, since \(E[S_Y]=\beta E[S_X]\).
Missingness assumption: we implicitly assumed the \(X\)-missingness is non-informative (MCAR). If missingness depends on \(\tau_i\) or \(Y_i\), estimates may be biased.

Remarks / Summary

The MLEs are simple and intuitive: \[ \boxed{\;\hat\beta=\frac{S_Y}{S_X}\;,\qquad \hat\tau_i=\frac{\delta_i X_i + Y_i}{1+\hat\beta}=\frac{S_X(\delta_i X_i + Y_i)}{S_X+S_Y}\;} \] (with the usual caveat that \(S_X>0\) is required for a finite \(\hat\beta\)). These estimators borrow strength across stretches and years: missing \(X_i\) are estimated using \(Y_i\) and the global scale \(\hat\beta\).

Q2.

In Problem 1, assume a “complete data model” that includes the observed values of \(X_1,\dots,X_k\) (so the missing \(X_1,\dots,X_k\) are treated as latent). Write down the E-step and the M-step of the EM algorithm to obtain the MLEs of \(\beta\) and \(\tau_1,\dots,\tau_n\).

Solution

Model recap / notation

For \(i=1,\dots,n\):
- \(X_i \sim \operatorname{Poi}(\tau_i)\)
- \(Y_i \sim \operatorname{Poi}(\beta\tau_i)\)
Independent across \(i\) and between years conditional on the \(\tau_i\)’s.
Observed: all \(Y_i\); \(X_{k+1},\dots,X_n\) observed; \(X_1,\dots,X_k\) missing.
Sets: \(M=\{1,\dots,k\}\) (missing), \(O=\{k+1,\dots,n\}\) (observed).
Notation: \[ S_X^{(O)}=\sum_{i\in O} X_i,\quad S_Y=\sum_{i=1}^n Y_i,\quad S_Y^{(O)}=\sum_{i\in O} Y_i,\quad S_Y^{(M)}=\sum_{i\in M} Y_i \]

Complete-data log-likelihood

If we had the full data \((X_1,\dots,X_n,Y_1,\dots,Y_n)\), the log-likelihood (up to constant) is:

\[ \ell_c(\beta,\tau) = \sum_{i=1}^n\left[-(1+\beta)\tau_i + (X_i+Y_i)\log\tau_i\right] + S_Y\log\beta \]

E-step

For each missing \(i\in M\): \[ \mathbb{E}[X_i\mid Y_i,\theta^{(t)}] = \tau_i^{(t)} \] (since \(X_i\) and \(Y_i\) are independent given \(\tau_i\)).
Define: \[ \tilde X_i^{(t)} = \begin{cases} X_i, & i\in O,\\ \tau_i^{(t)}, & i\in M. \end{cases} \]
Expected complete-data log-likelihood: \[ Q(\beta,\tau\mid\theta^{(t)}) = \sum_{i=1}^n\left[-(1+\beta)\tau_i + (\tilde X_i^{(t)}+Y_i)\log\tau_i\right] + S_Y\log\beta \]

M-step

Differentiate w.r.t. \(\tau_i\): \[ \tau_i^{(t+1)} = \frac{\tilde X_i^{(t)} + Y_i}{1+\beta} \]
Differentiate w.r.t. \(\beta\): \[ \beta^{(t+1)} = \frac{S_Y}{\sum_{i=1}^n \tau_i^{(t+1)}} \]
Substitute to obtain: \[ \beta^{(t+1)} = \frac{S_Y}{S_X^{(O)} + \sum_{i\in M} \tau_i^{(t)}} \]

One EM iteration

E-step:
- For \(i\in M\): \(\tilde X_i^{(t)} = \tau_i^{(t)}\)
- For \(i\in O\): \(\tilde X_i^{(t)} = X_i\)
M-step:
1. \(\displaystyle \beta^{(t+1)} = \frac{S_Y}{S_X^{(O)} + \sum_{i\in M} \tau_i^{(t)}}\)
2. \(\displaystyle \tau_i^{(t+1)} = \frac{\tilde X_i^{(t)} + Y_i}{1+\beta^{(t+1)}}\)

Initialization & termination

Initialize \(\tau_i^{(0)}\) and \(\beta^{(0)}\) with reasonable values.
Iterate until convergence based on parameter change or log-likelihood increment.

Fixed point / connection to MLE

At convergence: \[ \tau_i^{*} = \frac{\tilde X_i^{*}+Y_i}{1+\beta^{*}}, \quad \beta^{*} = \frac{S_Y}{\sum_i \tilde X_i^{*}} \]
For \(i\in O\): \(\tau_i^{*} = \frac{X_i+Y_i}{1+\beta^{*}}\)
For \(i\in M\): \(\tau_i^{*} = \frac{Y_i}{\beta^{*}}\)
Final closed-form MLE: \[ \beta^{*} = \frac{S_Y^{(O)}}{S_X^{(O)}} \]

Edge cases

If \(S_X^{(O)}=0\), \(\beta\) not identifiable — regularize or use Bayesian approach.
If \(S_Y^{(O)}=0\) and \(S_X^{(O)}>0\), \(\beta^{*}=0\).

Q3.

In Problem 2, write down the recursive updating equation for \(\tau_1\) alone, covering both cases: \(X_1\) missing and \(X_1\) observed.

Solution

EM-step structure (recap)

At iteration \(t\): \[ \tilde X_i^{(t)} = \begin{cases} X_i, & i\in O \ (\text{observed}),\\[4pt] \tau_i^{(t)}, & i\in M \ (\text{missing}). \end{cases} \]
Define: \[ A^{(t)} = \sum_{i=1}^n \tilde X_i^{(t)}, \quad S_Y = \sum_{i=1}^n Y_i. \]
M-step updates: \[ \beta^{(t+1)} = \frac{S_Y}{A^{(t)}}, \quad \tau_i^{(t+1)} = \frac{\tilde X_i^{(t)} + Y_i}{1 + \beta^{(t+1)}}. \]

Case A — \(X_1\) missing (\(1 \in M\))

\(\tilde X_1^{(t)} = \tau_1^{(t)}\)
Let: \[ B^{(t)} = \sum_{i \ne 1} \tilde X_i^{(t)} = A^{(t)} - \tau_1^{(t)}. \]
From \(\beta^{(t+1)} = S_Y / A^{(t)}\) and the M-step: \[ \tau_1^{(t+1)} = \frac{\tau_1^{(t)} + Y_1}{1 + \frac{S_Y}{A^{(t)}}} = \frac{A^{(t)}\big(\tau_1^{(t)} + Y_1\big)}{A^{(t)} + S_Y}. \]
Substitute \(A^{(t)} = B^{(t)} + \tau_1^{(t)}\) to get: \[ \boxed{ \tau_1^{(t+1)} = \frac{\big(B^{(t)} + \tau_1^{(t)}\big)\big(\tau_1^{(t)} + Y_1\big)} {B^{(t)} + \tau_1^{(t)} + S_Y} \quad (\text{if \(X_1\) missing})} \]

Special subcase — only \(\tau_1\) missing

If \(M = \{1\}\), \(B^{(t)} = S_X^{(O)}\) (constant).
Recurrence simplifies to: \[ \boxed{ \tau_1^{(t+1)} = \frac{\big(S_X^{(O)} + \tau_1^{(t)}\big)\big(\tau_1^{(t)} + Y_1\big)} {S_X^{(O)} + \tau_1^{(t)} + S_Y} \quad (\text{if only \(\tau_1\) missing})} \]
At convergence (\(\tau_1^{(t+1)} = \tau_1^{(t)} = \tau_1^*\)): \[ \tau_1^* = \frac{Y_1}{\beta^*}, \quad \beta^* = \frac{S_Y}{\sum_i \tilde X_i^*}. \]

Case B — \(X_1\) observed (\(1 \in O\))

\(\tilde X_1^{(t)} = X_1\) (constant data).
M-step gives: \[ \boxed{ \tau_1^{(t+1)} = \frac{X_1 + Y_1}{1 + \beta^{(t+1)}} \quad (\text{if \(X_1\) observed})} \]
Using \(\beta^{(t+1)} = S_Y / A^{(t)}\): \[ \tau_1^{(t+1)} = \frac{A^{(t)}(X_1 + Y_1)}{A^{(t)} + S_Y}. \]
At convergence: \[ \boxed{ \hat{\tau}_1 = \frac{X_1 + Y_1}{1 + \hat{\beta}}, \quad \hat{\beta} = \frac{S_Y}{\sum_{i \in O} X_i} } \]

Q4.

If the recursions in Problem 3 converge, what is the value of τ₁ reached at convergence?

Detailed solution

Let the EM iterates converge to \((\beta^*,\tau_1^*,\tau_2^*,\dots,\tau_n^*)\). At convergence the EM fixed-point (stationarity) conditions must hold (these are simply the M-step formulae evaluated at the limit, together with the definition of the \(\tilde X_i\)).

Recall the M-step relations used in the EM:

\[ \beta = \frac{S_Y}{\sum_{i=1}^n\tilde X_i},\qquad \tau_i = \frac{\tilde X_i + Y_i}{1+\beta}, \]

where \(\tilde X_i = X_i\) if \(X_i\) observed and \(\tilde X_i=\tau_i\) if \(X_i\) missing, and \(S_Y=\sum_{i=1}^n Y_i\).

Case A — \(X_1\) observed

Then \(\tilde X_1^*=X_1\) and the fixed-point equation for \(\tau_1\) is:

\[ \tau_1^* = \frac{X_1 + Y_1}{1+\beta^*}. \]

So the limit value is:

\[ \boxed{\tau_1^* = \frac{X_1 + Y_1}{1+\beta^*}} \]

Equivalently, using \(\beta^* = S_Y / \sum_{i\in O}X_i\) when appropriate, this equals the closed-form MLE value.

Case B — \(X_1\) missing

Then \(\tilde X_1^*=\tau_1^*\). The fixed-point equation gives:

\[ \tau_1^* = \frac{\tau_1^* + Y_1}{1+\beta^*}. \]

Rearranging:

\[ (1+\beta^*)\tau_1^* = \tau_1^* + Y_1 \quad\Longrightarrow\quad \beta^* \tau_1^* = Y_1 \quad\Longrightarrow\quad \boxed{\tau_1^* = \frac{Y_1}{\beta^*}} \]

Summary

If \(X_1\) observed: \(\tau_1^*=\dfrac{X_1+Y_1}{1+\beta^*}\)
If \(X_1\) missing: \(\tau_1^*=\dfrac{Y_1}{\beta^*}\)

These match the MLE relations from direct maximization.

How to prove these equalities (short rigorous argument)

EM fixed-point property: A limit point \(\theta^*=(\beta^*,\tau^*)\) satisfies the M-step maximization with the E-step expectations computed at \(\theta^*\), implying the M-step equations hold at \(\theta^*\).
Algebraic rearrangement: For the missing case, \(\tau_1^*=(\tau_1^*+Y_1)/(1+\beta^*) \Rightarrow \beta^*\tau_1^*=Y_1 \Rightarrow \tau_1^*=Y_1/\beta^*\). Similarly for the observed case.
Consistency with MLE: Solving the observed-data score equations directly yields the same relations.

How to prove EM recursions converge (sketch)

Monotonicity + compactness: If parameters lie in a compact set and the observed log-likelihood has a unique stationary point, EM converges to it.
One-dimensional contraction: For only \(\tau_1\) missing, analyze \(f(\tau)=\frac{(S_X^{(O)}+\tau)(\tau+Y_1)}{S_X^{(O)}+\tau+S_Y}\) and check \(|f'(\tau^*)|<1\) for local convergence.
Vector case: Check local contraction (spectral radius of Jacobian < 1) or rely on MLE uniqueness + monotonicity.

Important caveats / edge cases

No identifiability can cause divergence (e.g. \(S_X^{(O)}=0\) but \(S_Y>0\)).
Small counts or many missing entries can cause slow convergence; acceleration methods may help.

Question

Q5.
If the recursions in Problem 2 converge, what is the value of \(\beta\) reached at convergence?

Solution

Setup / notation

Index sets:
- \(O=\{k+1,\dots,n\}\) (indices where \(X_i\) is observed)
- \(M=\{1,\dots,k\}\) (indices where \(X_i\) is missing)
Totals: \[ S_X^{(O)}=\sum_{i\in O} X_i,\quad S_Y^{(O)}=\sum_{i\in O} Y_i,\quad S_Y^{(M)}=\sum_{i\in M} Y_i,\quad S_Y=\sum_{i=1}^n Y_i = S_Y^{(O)}+S_Y^{(M)}. \]

Fixed-point relations at EM limit

At a limit point \((\beta^*,\tau_1^*,\dots,\tau_n^*)\) the M-step coordinate formulas hold:
- For \(i\in O\) (observed): \[ \tau_i^* = \frac{X_i+Y_i}{1+\beta^*}. \]
- For \(i\in M\) (missing): \[ \tau_i^* = \frac{Y_i}{\beta^*}. \]
The \(\beta\)-limit satisfies the M-step relation: \[ \beta^* \;=\; \frac{S_Y}{\displaystyle\sum_{i=1}^n \tilde X_i^*}, \qquad\text{where }\tilde X_i^*=\begin{cases}X_i,&i\in O,\\[4pt]\tau_i^*,&i\in M.\end{cases} \]

Algebraic derivation

Compute \(\sum_{i=1}^n \tilde X_i^*\): \[ \sum_{i=1}^n \tilde X_i^* = \sum_{i\in O} X_i + \sum_{i\in M} \tau_i^* = S_X^{(O)} + \sum_{i\in M} \frac{Y_i}{\beta^*} = S_X^{(O)} + \frac{S_Y^{(M)}}{\beta^*}. \]
Substitute into the \(\beta\)-equation: \[ \beta^* = \frac{S_Y}{S_X^{(O)} + \dfrac{S_Y^{(M)}}{\beta^*}}. \]
Multiply both sides by the denominator: \[ \beta^*\Big(S_X^{(O)} + \frac{S_Y^{(M)}}{\beta^*}\Big) = S_Y \quad\Longrightarrow\quad \beta^* S_X^{(O)} + S_Y^{(M)} = S_Y. \]
Simplify to obtain: \[ \beta^* S_X^{(O)} = S_Y - S_Y^{(M)} = S_Y^{(O)}. \]

Final result

Therefore the limit (and under regularity the MLE) is

\[ \boxed{\displaystyle \beta^* \;=\; \frac{S_Y^{(O)}}{S_X^{(O)}} \;=\; \frac{\sum_{i\in O} Y_i}{\sum_{i\in O} X_i}.} \]

Intuition: only stretches with observed \(X_i\) provide the direct link identifying the scale \(\beta\). Observed \(Y\) for missing \(X\) inform the \(\tau_i\) but cannot by themselves separate \(\beta\) from \(\tau_i\).

Edge cases & interpretation

If all \(X_i\) observed (\(M=\varnothing\)), this reduces to \(\beta^*=S_Y/S_X\) with \(S_X=\sum_i X_i\).
If no \(X_i\) observed (\(O=\varnothing\)), then \(S_X^{(O)}=0\) and \(\beta\) is not identifiable from the data (denominator zero); additional assumptions or priors are required.

Question

Q6.
Let \( Y_1, \ldots, Y_n \) be i.i.d. from the two–component normal mixture \[ f(y \mid p, \mu_1, \mu_2, \sigma^2) = p \cdot \frac{1}{\sigma} \, \phi\!\left(\frac{y - \mu_1}{\sigma}\right) + (1-p) \cdot \frac{1}{\sigma} \, \phi\!\left(\frac{y - \mu_2}{\sigma}\right), \quad -\infty < y < \infty, \]

Write down an iterative step of the steepest-ascent method for the MLEs of \(p,\mu_1,\mu_2,\sigma^2\).

Solution

Log-likelihood and responsibilities

Parameters: \(\theta=(p,\mu_1,\mu_2,\sigma^2)\), with \(\sigma>0\).
Log-likelihood: \[ \ell(\theta)=\sum_{i=1}^n \log\!\left[p\,\phi_{\mu_1,\sigma}(Y_i) + (1-p)\,\phi_{\mu_2,\sigma}(Y_i)\right], \] where \(\phi_{\mu,\sigma}(y)=\frac{1}{\sigma}\phi\!\big((y-\mu)/\sigma\big)\).
Responsibilities (posterior weights under current \(\theta\)): \[ r_i(\theta)= \frac{p\,\phi_{\mu_1,\sigma}(Y_i)} {p\,\phi_{\mu_1,\sigma}(Y_i)+(1-p)\,\phi_{\mu_2,\sigma}(Y_i)}\in(0,1). \]

Score (gradient) components

\[ \frac{\partial \ell}{\partial p} = \sum_{i=1}^n\!\left(\frac{r_i}{p}-\frac{1-r_i}{1-p}\right) \]
\[ \frac{\partial \ell}{\partial \mu_1} = \frac{1}{\sigma^2}\sum_{i=1}^n r_i\,(Y_i-\mu_1),\qquad \frac{\partial \ell}{\partial \mu_2} = \frac{1}{\sigma^2}\sum_{i=1}^n (1-r_i)\,(Y_i-\mu_2) \]
\[ \frac{\partial \ell}{\partial \sigma^2} = \sum_{i=1}^n\!\left[ -\frac{1}{2\sigma^2} + \frac{ r_i\,(Y_i-\mu_1)^2 + (1-r_i)\,(Y_i-\mu_2)^2 }{2\sigma^4} \right] \]

Steepest-ascent iterative step

Given iterate \(\theta^{(t)}=(p^{(t)},\mu_1^{(t)},\mu_2^{(t)},(\sigma^2)^{(t)})\) and stepsize \(\eta_t>0\), compute responsibilities \[ r_i^{(t)}= \frac{p^{(t)}\,\phi_{\mu_1^{(t)},\,\sigma^{(t)}}(Y_i)} {p^{(t)}\,\phi_{\mu_1^{(t)},\,\sigma^{(t)}}(Y_i)+\big(1-p^{(t)}\big)\,\phi_{\mu_2^{(t)},\,\sigma^{(t)}}(Y_i)}. \]
Update: \[ \begin{aligned} p^{(t+1)} \,&=\, p^{(t)} + \eta_t \sum_{i=1}^n\!\left(\frac{r_i^{(t)}}{p^{(t)}}-\frac{1-r_i^{(t)}}{1-p^{(t)}}\right),\\[6pt] \mu_1^{(t+1)} \,&=\, \mu_1^{(t)} + \eta_t \frac{1}{(\sigma^{2})^{(t)}}\sum_{i=1}^n r_i^{(t)}\big(Y_i-\mu_1^{(t)}\big),\\[6pt] \mu_2^{(t+1)} \,&=\, \mu_2^{(t)} + \eta_t \frac{1}{(\sigma^{2})^{(t)}}\sum_{i=1}^n (1-r_i^{(t)})\big(Y_i-\mu_2^{(t)}\big),\\[8pt] (\sigma^{2})^{(t+1)} \,&=\, (\sigma^{2})^{(t)} + \eta_t \sum_{i=1}^n\!\left[ -\frac{1}{2(\sigma^{2})^{(t)}} + \frac{ r_i^{(t)}(Y_i-\mu_1^{(t)})^2 + (1-r_i^{(t)})(Y_i-\mu_2^{(t)})^2} {2\big((\sigma^{2})^{(t)}\big)^2} \right]. \end{aligned} \]

Constraints & practicalities

Maintain constraints: \(0
0\).
- Projection: after the step, clip \(p\in[\epsilon,1-\epsilon]\), \(\sigma^2\ge \epsilon\).
- Reparameterize (cleaner): update \(\omega=\log\!\frac{p}{1-p}\), \(\lambda=\log\sigma^2\) unconstrained via gradient ascent, then map back \(p=\mathrm{logit}^{-1}(\omega)\), \(\sigma^2=e^\lambda\).
Use a line-search or diminishing \(\eta_t\) to ensure ascent of \(\ell(\theta)\).
Impose an identifiability convention (e.g., \(\mu_1\le \mu_2\)) to avoid label switching.

Remarks

The responsibilities \(r_i^{(t)}\) are the same weights used in EM. Steepest ascent follows the score direction; EM or second-order (Newton / Fisher scoring, L-BFGS) steps usually converge faster for mixtures, but the updates above are the canonical steepest-ascent iteration.

Question

Q7.
Are the parameters of the model of Problem 6 uniquely identifiable from the data?

Solution

Short answer

Yes — generically. The parameters \((p,\mu_1,\mu_2,\sigma^2)\) of the two-normal mixture with a common variance are identifiable up to label switching. That is, the mapping \[ (p,\mu_1,\mu_2,\sigma^2)\mapsto f(y\mid p,\mu_1,\mu_2,\sigma^2) \] is one-to-one except for the trivial symmetry that swaps component labels: \((p,\mu_1,\mu_2,\sigma^2)\) and \((1-p,\mu_2,\mu_1,\sigma^2)\) define the same distribution.
Degenerate failures of identifiability occur only in obvious special cases (components coincide or one weight is 0 or 1).

Constructive identifiability sketch (moment method)

Let \(m=\mathbb{E}[Y]=p\mu_1+(1-p)\mu_2\) and \(\Delta=\mu_1-\mu_2\). Denote central moments \(\mu_r=\mathbb{E}[(Y-m)^r]\). For the model \[ Y\sim p\,\mathcal N(\mu_1,\sigma^2)+(1-p)\,\mathcal N(\mu_2,\sigma^2), \] one obtains the moment relations \[ \begin{aligned} \mu_2 &= \sigma^2 + p(1-p)\Delta^2,\\[4pt] \mu_3 &= p(1-p)(1-2p)\Delta^3,\\[4pt] \mu_4 &= 3\sigma^4 + 6\sigma^2 p(1-p)\Delta^2 + p(1-p)\big[(1-p)^3+p^3\big]\Delta^4. \end{aligned} \]
From \((\mu_2,\mu_3,\mu_4)\) one can (generically) solve for \((p,\Delta^2,\sigma^2)\). Then \(\mu_1,\mu_2\) follow from \(m\) and \(\Delta\) (the sign of \(\Delta\) is the label choice).
Special subcases:
- If \(p\neq\tfrac12\) and \(\Delta\neq0\), then \(\mu_3\neq0\) typically gives the sign and magnitude information needed to recover \(p\) and \(\Delta\).
- If \(p=\tfrac12\) then \(\mu_3=0\) but \(\mu_2,\mu_4\) still determine \(\Delta^2\) and \(\sigma^2\) (so parameters recoverable up to label swap).

Non-identifiable / degenerate cases

\(\mu_1=\mu_2\): the mixture reduces to a single \(\mathcal N(\mu_1,\sigma^2)\); \(p\) and the two means cannot be separated.
\(p\in\{0,1\}\): only a single component is present; the other component's parameters are irrelevant.
Label-switching symmetry: \((p,\mu_1,\mu_2,\sigma^2)\) and \((1-p,\mu_2,\mu_1,\sigma^2)\) represent the same model — this is the only nonidentifiability in the generic (nondegenerate) case.

Related theoretical facts

Finite mixtures of distinct normal components are identifiable (up to permutations of labels); this is a classical result in mixture theory.
Having a common unknown variance \(\sigma^2\) does not destroy identifiability — higher moments or characteristic-function arguments separate \(\Delta\), \(p\), and \(\sigma^2\).

Practical implications & recommendations

In estimation, enforce an identifiability convention (e.g. impose \(\mu_1\le\mu_2\) or order by component means) or post-process MCMC/EM output to resolve label switching.
Be alert to near-degeneracy: when \(\mu_1\approx\mu_2\) or \(p\) near 0 or 1 the likelihood is flat in directions and numerical methods may be unstable; use regularization, informative priors, or constrained optimization.
Use multiple random starts and check for label permutations and convergence to the same fitted mixture (modulo label swaps).

Visualization check

Plot the fitted mixture density and the two component densities. If components overlap strongly or one weight is near 0, identifiability is practically weak even if theoretically present.

Conclusion

The model is identifiable except for the unavoidable label-switching symmetry and trivial degenerate cases (coincident components or trivial weight). In plain terms: you can recover the parameters from the distribution, but you must either accept or remove the label ambiguity and be careful near degeneracies.

Question

Q8.
Suppose \((X_1,Y_1),\dots,(X_n,Y_n)\) are i.i.d. where \(X_i\sim\operatorname{Bernoulli}(p)\) and the conditional pdf of \(Y_i\) given \(X_i=x\) is

\[ f(y \mid x, p, \mu_1, \mu_2, \sigma^2) = x \; \frac{1}{\sigma} \, \phi\!\left(\frac{y - \mu_1}{\sigma}\right) + (1 - x) \; \frac{1}{\sigma} \, \phi\!\left(\frac{y - \mu_2}{\sigma}\right), \quad -\infty < y < \infty \] with \( 0 < p < 1 \). Obtain explicit expressions for the MLEs of the four parameters.

Solution

Interpretation & notation

The label \(X_i\) is observed, so this is a labeled mixture: if \(X_i=1\) then \(Y_i\sim N(\mu_1,\sigma^2)\); if \(X_i=0\) then \(Y_i\sim N(\mu_2,\sigma^2)\).
Define \[ n_1=\sum_{i=1}^n X_i,\qquad n_0=n-n_1. \]
Index sets and sums: \[ I_1=\{i: X_i=1\},\quad I_0=\{i:X_i=0\}, \] \[ S_1=\sum_{i\in I_1} Y_i,\quad S_0=\sum_{i\in I_0} Y_i. \]
Sample means (when defined): \[ \bar Y_1=\frac{S_1}{n_1}\ (\text{if }n_1>0),\qquad \bar Y_0=\frac{S_0}{n_0}\ (\text{if }n_0>0). \]
Sums of squared deviations: \[ Q_1=\sum_{i\in I_1}(Y_i-\bar Y_1)^2,\quad Q_0=\sum_{i\in I_0}(Y_i-\bar Y_0)^2. \]

Log-likelihood (brief)

Because \((X_i,Y_i)\) are i.i.d. and \(X_i\) observed, \[ \ell(p,\mu_1,\mu_2,\sigma^2) = \sum_{i=1}^n \big[X_i\log p +(1-X_i)\log(1-p)\big] + \sum_{i\in I_1}\log\phi_{\mu_1,\sigma}(Y_i) + \sum_{i\in I_0}\log\phi_{\mu_2,\sigma}(Y_i), \] where \(\log\phi_{\mu,\sigma}(y)=-\tfrac12\log(2\pi\sigma^2)-\dfrac{(y-\mu)^2}{2\sigma^2}\).

MLE for \(p\)

Maximizing the Bernoulli part yields the familiar estimator: \[ \boxed{\;\hat p = \frac{n_1}{n} = \frac{1}{n}\sum_{i=1}^n X_i\;.} \]

MLEs for \(\mu_1,\mu_2\)

Maximizing the normal log-terms w.r.t. \(\mu_1\) (holding \(\sigma^2\) fixed) gives the sample mean within group 1: \[ \boxed{\;\hat\mu_1 = \bar Y_1 = \frac{1}{n_1}\sum_{i: X_i=1} Y_i \quad(\text{if }n_1>0)\;.} \]
Similarly for group 0: \[ \boxed{\;\hat\mu_2 = \bar Y_0 = \frac{1}{n_0}\sum_{i: X_i=0} Y_i \quad(\text{if }n_0>0)\;.} \]

MLE for \(\sigma^2\)

Collect the normal contributions and differentiate w.r.t. \(\sigma^2\). The MLE is \[ \boxed{\; \widehat{\sigma^2} = \frac{1}{n}\Bigg(\sum_{i\in I_1}(Y_i-\hat\mu_1)^2 + \sum_{i\in I_0}(Y_i-\hat\mu_2)^2\Bigg) \;=\; \frac{Q_1+Q_0}{n}\;. } \] (Note: this is the MLE divisor \(n\), not the unbiased divisor \(n-2\).)

Edge cases & existence

If \(n_1=0\) then \(\hat p=0\) and \(\hat\mu_1\) is undefined from the data; similarly if \(n_0=0\) then \(\hat p=1\) and \(\hat\mu_2\) is undefined. In those degenerate cases extra assumptions or priors are required.
If both groups have at least one observation, the MLEs above are well-defined and unique.

Summary (boxed answers)

\(\displaystyle \boxed{\;\hat p = \dfrac{n_1}{n}\;}\).
\(\displaystyle \boxed{\;\hat\mu_1 = \bar Y_1 = \dfrac{1}{n_1}\sum_{i:X_i=1} Y_i\;}\) (if \(n_1>0\)).
\(\displaystyle \boxed{\;\hat\mu_2 = \bar Y_0 = \dfrac{1}{n_0}\sum_{i:X_i=0} Y_i\;}\) (if \(n_0>0\)).
\(\displaystyle \boxed{\;\widehat{\sigma^2} = \dfrac{1}{n}\Big(\sum_{i\in I_1}(Y_i-\hat\mu_1)^2 + \sum_{i\in I_0}(Y_i-\hat\mu_2)^2\Big)\;}\).

Question

Q9.
For the data and the model described in Problem 6, let the “complete data model” be as in Problem 8. Derive the E-step and the M-step of the EM algorithm for finding the MLEs of the four parameters \((p,\mu_1,\mu_2,\sigma^2)\).

Solution

Model / notation (recap)

Observed: \(Y_1,\dots,Y_n\). Latent labels: \(X_i\in\{0,1\}\) for \(i=1,\dots,n\).
Complete-data model: \[ P(X_i=1)=p,\qquad Y_i\mid X_i=1\sim N(\mu_1,\sigma^2),\quad Y_i\mid X_i=0\sim N(\mu_2,\sigma^2). \]
Current iterate: \(\theta^{(t)}=(p^{(t)},\mu_1^{(t)},\mu_2^{(t)},(\sigma^2)^{(t)})\).

E-step

Compute the conditional expectation of the complete-data indicator \(X_i\) given \(Y_i\) under \(\theta^{(t)}\). These are the responsibilities: \[ \boxed{\; r_i^{(t)} \;=\; \mathbb{E}[X_i\mid Y_i,\theta^{(t)}] \;=\; \Pr\{X_i=1\mid Y_i,\theta^{(t)}\} \;=\; \frac{p^{(t)}\,\phi\!\big(\dfrac{Y_i-\mu_1^{(t)}}{\sigma^{(t)}}\big)} {\,p^{(t)}\,\phi\!\big(\dfrac{Y_i-\mu_1^{(t)}}{\sigma^{(t)}}\big) +(1-p^{(t)})\,\phi\!\big(\dfrac{Y_i-\mu_2^{(t)}}{\sigma^{(t)}}\big)\,}\;} \] where \(\sigma^{(t)}=\sqrt{(\sigma^2)^{(t)}}\) and \(\phi\) is the standard normal pdf.
Also note \(\mathbb{E}[1-X_i\mid Y_i,\theta^{(t)}]=1-r_i^{(t)}\).
For numerical stability compute the ratio using log-sum-exp form (work in log-domain).

M-step

Set \(R^{(t)}=\sum_{i=1}^n r_i^{(t)}\) (the expected count in component 1) and \(n-R^{(t)}=\sum_{i=1}^n(1-r_i^{(t)})\).
Maximize the expected complete-data log-likelihood (replace \(X_i\) by \(r_i^{(t)}\)). Closed-form updates are:
\[ \boxed{\; p^{(t+1)} \;=\; \frac{1}{n}\sum_{i=1}^n r_i^{(t)} \;=\; \frac{R^{(t)}}{n} \;} \]
\[ \boxed{\; \mu_1^{(t+1)} \;=\; \frac{\displaystyle\sum_{i=1}^n r_i^{(t)} Y_i}{\displaystyle\sum_{i=1}^n r_i^{(t)}} \;=\; \frac{1}{R^{(t)}}\sum_{i=1}^n r_i^{(t)}Y_i \;} \]
\[ \boxed{\; \mu_2^{(t+1)} \;=\; \frac{\displaystyle\sum_{i=1}^n (1-r_i^{(t)}) Y_i}{\displaystyle\sum_{i=1}^n (1-r_i^{(t)})} \;=\; \frac{1}{n-R^{(t)}}\sum_{i=1}^n (1-r_i^{(t)})Y_i \;} \]
\[ \boxed{\; (\sigma^2)^{(t+1)} \;=\; \frac{1}{n}\sum_{i=1}^n\Big[\, r_i^{(t)}\big(Y_i-\mu_1^{(t+1)}\big)^2 + (1-r_i^{(t)})\big(Y_i-\mu_2^{(t+1)}\big)^2 \Big]\; } \] (This follows by maximizing the expected complete-data normal log-likelihood; the MLE divisor is \(n\).)

One EM iteration (concise summary)

Given \(\theta^{(t)}\):
E-step: compute \(r_i^{(t)}\) for \(i=1,\dots,n\) using the boxed formula above.
M-step: compute \(R^{(t)}=\sum r_i^{(t)}\) and update \[ p^{(t+1)}=\frac{R^{(t)}}{n},\quad \mu_1^{(t+1)}=\frac{\sum r_i^{(t)}Y_i}{R^{(t)}},\quad \mu_2^{(t+1)}=\frac{\sum (1-r_i^{(t)})Y_i}{n-R^{(t)}}, \] and \[ (\sigma^2)^{(t+1)}=\frac{1}{n}\sum \big[ r_i^{(t)}(Y_i-\mu_1^{(t+1)})^2 + (1-r_i^{(t)})(Y_i-\mu_2^{(t+1)})^2\big]. \]

Convergence & practical notes

Iterate E and M steps until parameter change or log-likelihood increment falls below tolerance.
Initialization: choose sensible starts (k-means, random, or multiple restarts) to avoid poor local maxima and label switching.
For numerical stability use log-domain computations for responsibilities; guard against division by zero (ensure \(R^{(t)}\) and \(n-R^{(t)}\) are not too close to 0; regularize if necessary).

Remark

These EM updates are the classical closed-form E- and M-steps for the Gaussian two-component mixture with common variance. They increase the observed-data log-likelihood monotonically and (under regularity) converge to a stationary point of the likelihood.

Question

Q10.
Simulate a sample of size n = 100 from the model presumed in Problem 6 for reasonable values for the four parameters. Write an R code to obtain the MLEs through the EM algorithm, starting from initial values p = 0, \(μ_1\) = 0, \(μ_2\) equal to the sample mean and \(σ^2 \) equal to the sample variance.

Solution

  set.seed(123)  # For reproducibility

### ---- Parameters for simulation ----
n <- 100
p_true <- 0.4
mu1_true <- -1
mu2_true <- 2
sigma2_true <- 1

### ---- Simulate data from mixture model ----
z <- rbinom(n, 1, p_true)   # Latent component indicators
x <- numeric(n)
x[z == 1] <- rnorm(sum(z == 1), mean = mu1_true, sd = sqrt(sigma2_true))
x[z == 0] <- rnorm(sum(z == 0), mean = mu2_true, sd = sqrt(sigma2_true))

### ---- Initial values ----
p <- 0.001              # small positive to avoid NA in E-step
mu1 <- 0
mu2 <- mean(x)
sigma2 <- var(x)

### ---- EM Algorithm ----
max_iter <- 1000
tol <- 1e-6
loglik <- numeric(max_iter)
eps <- 1e-10

for (iter in 1:max_iter) {
  
  ## E-step: Posterior probabilities for component 1
  tau1 <- p * dnorm(x, mean = mu1, sd = sqrt(sigma2))
  tau2 <- (1 - p) * dnorm(x, mean = mu2, sd = sqrt(sigma2))
  
  denom <- tau1 + tau2 + eps  # avoid division by zero
  gamma <- tau1 / denom
  
  ## Log-likelihood
  loglik[iter] <- sum(log(denom))
  
  ## M-step: Update parameters
  p_new <- mean(gamma)
  mu1_new <- sum(gamma * x) / sum(gamma)
  mu2_new <- sum((1 - gamma) * x) / sum(1 - gamma)
  sigma2_new <- sum(gamma * (x - mu1_new)^2 + (1 - gamma) * (x - mu2_new)^2) / n
  
  ## Check convergence
  if (max(abs(c(p_new - p, mu1_new - mu1, mu2_new - mu2, sigma2_new - sigma2)), na.rm = TRUE) < tol) {
    p <- p_new; mu1 <- mu1_new; mu2 <- mu2_new; sigma2 <- sigma2_new
    loglik <- loglik[1:iter]
    cat("Converged in", iter, "iterations\n")
    break
  }
  
  ## Update for next iteration
  p <- p_new
  mu1 <- mu1_new
  mu2 <- mu2_new
  sigma2 <- sigma2_new
}

### ---- Final estimates ----
cat("Estimated p:", p, "\n")
cat("Estimated mu1:", mu1, "\n")
cat("Estimated mu2:", mu2, "\n")
cat("Estimated sigma^2:", sigma2, "\n")

### ---- Plot log-likelihood ----
plot(loglik, type = "o", main = "Log-likelihood over iterations",
     xlab = "Iteration", ylab = "Log-likelihood")

Contents

Question

Solution

Related concepts

Remarks / Summary

Q2.

Solution

Model recap / notation

Complete-data log-likelihood

E-step

M-step

One EM iteration

Initialization & termination

Fixed point / connection to MLE

Edge cases

Q3.

Solution

EM-step structure (recap)

Case A — \(X_1\) missing (\(1 \in M\))

Special subcase — only \(\tau_1\) missing

Case B — \(X_1\) observed (\(1 \in O\))

Q4.

Detailed solution

Case A — \(X_1\) observed

Case B — \(X_1\) missing

Summary

How to prove these equalities (short rigorous argument)

How to prove EM recursions converge (sketch)

Important caveats / edge cases

Question

Solution

Setup / notation

Fixed-point relations at EM limit

Algebraic derivation

Final result

Edge cases & interpretation

Question

Solution

Log-likelihood and responsibilities

Score (gradient) components

Steepest-ascent iterative step

Constraints & practicalities

Remarks

Question

Solution

Short answer

Constructive identifiability sketch (moment method)

Non-identifiable / degenerate cases

Related theoretical facts

Practical implications & recommendations

Visualization check

Conclusion

Question

Solution

Interpretation & notation

Log-likelihood (brief)

MLE for \(p\)

MLEs for \(\mu_1,\mu_2\)

MLE for \(\sigma^2\)

Edge cases & existence

Summary (boxed answers)

Question

Solution

Model / notation (recap)

E-step

M-step

One EM iteration (concise summary)

Convergence & practical notes

Remark

Question

Solution