Statistical Methods III: Assignment Week-2

AY 2025–26

Instructor: Debasis Sengupta

Office / Department: ASU

Email: sdebasis@isical.ac.in

Marking Scheme:
Assignments: 20% | Midterm Test: 30% | End Semester: 50%

Contents

Question

Q1.
Let \(X_1,X_2,\dots,X_n\) be the numbers of annual accidents occurring in \(n\) stretches of a highway in year 2010. These are independent with pmf \(\mathrm{Poi}(\tau_i),\; i=1,\dots,n.\) Let \(Y_1,Y_2,\dots,Y_n\) be the numbers of annual accidents on the same stretches in year 2020; they are independent with pmf \(\mathrm{Poi}(\beta\tau_i),\; i=1,\dots,n.\) For some reason observations \(X_1,\dots,X_k\) are missing (only \(X_{k+1},\dots,X_n\) are observed). Derive explicit expressions for the MLE of \(\beta\) and \(\tau_1,\dots,\tau_n.\)

Solution

  • Step 0 — notation.
    • Define the indicator \(\delta_i\) for observed \(X_i\): \[ \delta_i=\begin{cases}1,& i=k+1,\dots,n\ (\text{observed }X_i),\\[4pt] 0,& i=1,\dots,k\ (\text{missing }X_i).\end{cases} \]
    • Observed-data totals: \[ S_X=\sum_{i:\,\delta_i=1} X_i=\sum_{i=k+1}^n X_i,\qquad S_Y=\sum_{i=1}^n Y_i. \]
  • Step 1 — observed-data likelihood.
    • The observed-data likelihood (up to constants not involving parameters) is \[ L(\beta,\tau_1,\dots,\tau_n) \propto \prod_{i:\,\delta_i=1} e^{-\tau_i}\tau_i^{X_i}\times \prod_{i=1}^n e^{-\beta\tau_i}(\beta\tau_i)^{Y_i}. \]
    • Log-likelihood \(\ell(\beta,\tau)\): \[ \ell(\beta,\tau) = \sum_{i:\,\delta_i=1}\big[-\tau_i+X_i\log\tau_i\big] + \sum_{i=1}^n\big[-\beta\tau_i + Y_i\log\beta + Y_i\log\tau_i\big] + \text{const.} \]
  • Step 2 — score equations for each \(\tau_i\).
    • If \(X_i\) is observed (\(\delta_i=1\)): \[ \frac{\partial\ell}{\partial\tau_i} = -1 + \frac{X_i}{\tau_i} -\beta + \frac{Y_i}{\tau_i} = -(1+\beta)+\frac{X_i+Y_i}{\tau_i}. \] Setting to \(0\) gives \[ \tau_i=\frac{X_i+Y_i}{1+\beta},\qquad i=k+1,\dots,n. \]
    • If \(X_i\) is missing (\(\delta_i=0\)): \[ \frac{\partial\ell}{\partial\tau_i} = -(1+\beta)+\frac{Y_i}{\tau_i}=0 \quad\Rightarrow\quad \tau_i=\frac{Y_i}{1+\beta},\qquad i=1,\dots,k. \]
    • Compact form (solution of \(\tau_i\)-score eqns as a function of \(\beta\)): \[ \boxed{\;\hat\tau_i(\beta)=\frac{\delta_i X_i + Y_i}{1+\beta}\;,\qquad i=1,\dots,n.\;} \]
  • Step 3 — score equation for \(\beta\).
    • Differentiate \(\ell\) w.r.t. \(\beta\): \[ \frac{\partial\ell}{\partial\beta} = \sum_{i=1}^n\left(-\tau_i + \frac{Y_i}{\beta}\right) = -\sum_{i=1}^n \tau_i + \frac{S_Y}{\beta}. \]
    • Substitute \(\tau_i=\hat\tau_i(\beta)=\dfrac{\delta_i X_i+Y_i}{1+\beta}\) to get \[ -\sum_{i=1}^n \frac{\delta_i X_i + Y_i}{1+\beta} + \frac{S_Y}{\beta} = 0. \] Since \(\sum_{i=1}^n(\delta_i X_i+Y_i)=S_X+S_Y\), this simplifies to \[ -\frac{S_X+S_Y}{1+\beta} + \frac{S_Y}{\beta}=0. \]
    • Solve for \(\beta\): \[ \frac{S_Y}{\beta}=\frac{S_X+S_Y}{1+\beta} \quad\Longrightarrow\quad \beta(S_X+S_Y)=S_Y(1+\beta) \quad\Longrightarrow\quad \beta S_X=S_Y. \] Hence, provided \(S_X>0\), \[ \boxed{\;\hat\beta=\frac{S_Y}{S_X}=\frac{\sum_{i=1}^n Y_i}{\sum_{i=k+1}^n X_i}\;. \;} \]
  • Step 4 — plug back to get \(\hat\tau_i\).
    • Using \(\hat\beta=S_Y/S_X\) in \(\hat\tau_i(\beta)\) yields \[ \boxed{\;\hat\tau_i = \frac{\delta_i X_i + Y_i}{1+\hat\beta} = \frac{\delta_i X_i + Y_i}{1 + \dfrac{S_Y}{S_X}} = \frac{S_X(\delta_i X_i + Y_i)}{S_X+S_Y}\;. \;} \]
    • Equivalently, \[ \hat\tau_i=\begin{cases} \dfrac{X_i+Y_i}{1+\hat\beta}, & i=k+1,\dots,n\ (\text{observed }X_i),\\[8pt] \dfrac{Y_i}{1+\hat\beta}, & i=1,\dots,k\ (\text{missing }X_i). \end{cases} \]
  • Step 5 — existence / edge cases.
    • If \(S_X=\sum_{i=k+1}^n X_i=0\) while \(S_Y>0\), the equation \(\beta S_X=S_Y\) cannot be satisfied by any finite \(\beta\); the likelihood increases as \(\beta\to\infty\) and no finite MLE exists.
    • If \(S_X=0\) and \(S_Y=0\), \(\beta\) is not identifiable (any \(\beta\) gives same likelihood) and all \(\tau_i\) are estimated as zero. Assume \(S_X>0\) for the usual case.
  • Step 6 — justification of maximum.
    • For fixed \(\beta\) the log-likelihood is concave in each \(\tau_i\) (second derivative w.r.t. \(\tau_i\) equals \(- (X_i+Y_i)/\tau_i^2<0\) when \(X_i+Y_i>0\)), and profiling yields a single finite critical point for \(\beta\) under regular data (\(S_X>0\)), giving the global maximum in regular cases.

Related concepts

  • Poisson-regression / GLM viewpoint: the model is a multiplicative scale-change; \(\log\beta\) acts like an offset if \(\tau_i\) were modeled.
  • Profile likelihood: we profiled out the \(\tau_i\) to obtain a closed-form MLE for \(\beta\).
  • Method-of-moments equivalence: \(\hat\beta=S_Y/S_X\) also follows from equating totals, since \(E[S_Y]=\beta E[S_X]\).
  • Missingness assumption: we implicitly assumed the \(X\)-missingness is non-informative (MCAR). If missingness depends on \(\tau_i\) or \(Y_i\), estimates may be biased.

Remarks / Summary

The MLEs are simple and intuitive: \[ \boxed{\;\hat\beta=\frac{S_Y}{S_X}\;,\qquad \hat\tau_i=\frac{\delta_i X_i + Y_i}{1+\hat\beta}=\frac{S_X(\delta_i X_i + Y_i)}{S_X+S_Y}\;} \] (with the usual caveat that \(S_X>0\) is required for a finite \(\hat\beta\)). These estimators borrow strength across stretches and years: missing \(X_i\) are estimated using \(Y_i\) and the global scale \(\hat\beta\).

Q2.

In Problem 1, assume a “complete data model” that includes the observed values of \(X_1,\dots,X_k\) (so the missing \(X_1,\dots,X_k\) are treated as latent). Write down the E-step and the M-step of the EM algorithm to obtain the MLEs of \(\beta\) and \(\tau_1,\dots,\tau_n\).

Solution

Model recap / notation

Complete-data log-likelihood

If we had the full data \((X_1,\dots,X_n,Y_1,\dots,Y_n)\), the log-likelihood (up to constant) is:

\[ \ell_c(\beta,\tau) = \sum_{i=1}^n\left[-(1+\beta)\tau_i + (X_i+Y_i)\log\tau_i\right] + S_Y\log\beta \]

E-step

M-step

One EM iteration

Initialization & termination

Fixed point / connection to MLE

Edge cases

Q3.

In Problem 2, write down the recursive updating equation for \(\tau_1\) alone, covering both cases: \(X_1\) missing and \(X_1\) observed.

Solution

EM-step structure (recap)

Case A — \(X_1\) missing (\(1 \in M\))

Special subcase — only \(\tau_1\) missing

Case B — \(X_1\) observed (\(1 \in O\))

Q4.

If the recursions in Problem 3 converge, what is the value of τ₁ reached at convergence?

Detailed solution

Let the EM iterates converge to \((\beta^*,\tau_1^*,\tau_2^*,\dots,\tau_n^*)\). At convergence the EM fixed-point (stationarity) conditions must hold (these are simply the M-step formulae evaluated at the limit, together with the definition of the \(\tilde X_i\)).

Recall the M-step relations used in the EM:

\[ \beta = \frac{S_Y}{\sum_{i=1}^n\tilde X_i},\qquad \tau_i = \frac{\tilde X_i + Y_i}{1+\beta}, \]

where \(\tilde X_i = X_i\) if \(X_i\) observed and \(\tilde X_i=\tau_i\) if \(X_i\) missing, and \(S_Y=\sum_{i=1}^n Y_i\).

Case A — \(X_1\) observed

Then \(\tilde X_1^*=X_1\) and the fixed-point equation for \(\tau_1\) is:

\[ \tau_1^* = \frac{X_1 + Y_1}{1+\beta^*}. \]

So the limit value is:

\[ \boxed{\tau_1^* = \frac{X_1 + Y_1}{1+\beta^*}} \]

Equivalently, using \(\beta^* = S_Y / \sum_{i\in O}X_i\) when appropriate, this equals the closed-form MLE value.

Case B — \(X_1\) missing

Then \(\tilde X_1^*=\tau_1^*\). The fixed-point equation gives:

\[ \tau_1^* = \frac{\tau_1^* + Y_1}{1+\beta^*}. \]

Rearranging:

\[ (1+\beta^*)\tau_1^* = \tau_1^* + Y_1 \quad\Longrightarrow\quad \beta^* \tau_1^* = Y_1 \quad\Longrightarrow\quad \boxed{\tau_1^* = \frac{Y_1}{\beta^*}} \]

Summary

These match the MLE relations from direct maximization.

How to prove these equalities (short rigorous argument)

  1. EM fixed-point property: A limit point \(\theta^*=(\beta^*,\tau^*)\) satisfies the M-step maximization with the E-step expectations computed at \(\theta^*\), implying the M-step equations hold at \(\theta^*\).
  2. Algebraic rearrangement: For the missing case, \(\tau_1^*=(\tau_1^*+Y_1)/(1+\beta^*) \Rightarrow \beta^*\tau_1^*=Y_1 \Rightarrow \tau_1^*=Y_1/\beta^*\). Similarly for the observed case.
  3. Consistency with MLE: Solving the observed-data score equations directly yields the same relations.

How to prove EM recursions converge (sketch)

Important caveats / edge cases

Question

Q5.
If the recursions in Problem 2 converge, what is the value of \(\beta\) reached at convergence?

Solution

Setup / notation

  • Index sets:
    • \(O=\{k+1,\dots,n\}\) (indices where \(X_i\) is observed)
    • \(M=\{1,\dots,k\}\) (indices where \(X_i\) is missing)
  • Totals: \[ S_X^{(O)}=\sum_{i\in O} X_i,\quad S_Y^{(O)}=\sum_{i\in O} Y_i,\quad S_Y^{(M)}=\sum_{i\in M} Y_i,\quad S_Y=\sum_{i=1}^n Y_i = S_Y^{(O)}+S_Y^{(M)}. \]

Fixed-point relations at EM limit

  • At a limit point \((\beta^*,\tau_1^*,\dots,\tau_n^*)\) the M-step coordinate formulas hold:
    • For \(i\in O\) (observed): \[ \tau_i^* = \frac{X_i+Y_i}{1+\beta^*}. \]
    • For \(i\in M\) (missing): \[ \tau_i^* = \frac{Y_i}{\beta^*}. \]
  • The \(\beta\)-limit satisfies the M-step relation: \[ \beta^* \;=\; \frac{S_Y}{\displaystyle\sum_{i=1}^n \tilde X_i^*}, \qquad\text{where }\tilde X_i^*=\begin{cases}X_i,&i\in O,\\[4pt]\tau_i^*,&i\in M.\end{cases} \]

Algebraic derivation

  • Compute \(\sum_{i=1}^n \tilde X_i^*\): \[ \sum_{i=1}^n \tilde X_i^* = \sum_{i\in O} X_i + \sum_{i\in M} \tau_i^* = S_X^{(O)} + \sum_{i\in M} \frac{Y_i}{\beta^*} = S_X^{(O)} + \frac{S_Y^{(M)}}{\beta^*}. \]
  • Substitute into the \(\beta\)-equation: \[ \beta^* = \frac{S_Y}{S_X^{(O)} + \dfrac{S_Y^{(M)}}{\beta^*}}. \]
  • Multiply both sides by the denominator: \[ \beta^*\Big(S_X^{(O)} + \frac{S_Y^{(M)}}{\beta^*}\Big) = S_Y \quad\Longrightarrow\quad \beta^* S_X^{(O)} + S_Y^{(M)} = S_Y. \]
  • Simplify to obtain: \[ \beta^* S_X^{(O)} = S_Y - S_Y^{(M)} = S_Y^{(O)}. \]

Final result

Therefore the limit (and under regularity the MLE) is

\[ \boxed{\displaystyle \beta^* \;=\; \frac{S_Y^{(O)}}{S_X^{(O)}} \;=\; \frac{\sum_{i\in O} Y_i}{\sum_{i\in O} X_i}.} \]

Intuition: only stretches with observed \(X_i\) provide the direct link identifying the scale \(\beta\). Observed \(Y\) for missing \(X\) inform the \(\tau_i\) but cannot by themselves separate \(\beta\) from \(\tau_i\).

Edge cases & interpretation

  • If all \(X_i\) observed (\(M=\varnothing\)), this reduces to \(\beta^*=S_Y/S_X\) with \(S_X=\sum_i X_i\).
  • If no \(X_i\) observed (\(O=\varnothing\)), then \(S_X^{(O)}=0\) and \(\beta\) is not identifiable from the data (denominator zero); additional assumptions or priors are required.

Question

Q6.
Let \( Y_1, \ldots, Y_n \) be i.i.d. from the two–component normal mixture \[ f(y \mid p, \mu_1, \mu_2, \sigma^2) = p \cdot \frac{1}{\sigma} \, \phi\!\left(\frac{y - \mu_1}{\sigma}\right) + (1-p) \cdot \frac{1}{\sigma} \, \phi\!\left(\frac{y - \mu_2}{\sigma}\right), \quad -\infty < y < \infty, \]

Write down an iterative step of the steepest-ascent method for the MLEs of \(p,\mu_1,\mu_2,\sigma^2\).

Solution

Log-likelihood and responsibilities

  • Parameters: \(\theta=(p,\mu_1,\mu_2,\sigma^2)\), with \(\sigma>0\).
  • Log-likelihood: \[ \ell(\theta)=\sum_{i=1}^n \log\!\left[p\,\phi_{\mu_1,\sigma}(Y_i) + (1-p)\,\phi_{\mu_2,\sigma}(Y_i)\right], \] where \(\phi_{\mu,\sigma}(y)=\frac{1}{\sigma}\phi\!\big((y-\mu)/\sigma\big)\).
  • Responsibilities (posterior weights under current \(\theta\)): \[ r_i(\theta)= \frac{p\,\phi_{\mu_1,\sigma}(Y_i)} {p\,\phi_{\mu_1,\sigma}(Y_i)+(1-p)\,\phi_{\mu_2,\sigma}(Y_i)}\in(0,1). \]

Score (gradient) components

  • \[ \frac{\partial \ell}{\partial p} = \sum_{i=1}^n\!\left(\frac{r_i}{p}-\frac{1-r_i}{1-p}\right) \]
  • \[ \frac{\partial \ell}{\partial \mu_1} = \frac{1}{\sigma^2}\sum_{i=1}^n r_i\,(Y_i-\mu_1),\qquad \frac{\partial \ell}{\partial \mu_2} = \frac{1}{\sigma^2}\sum_{i=1}^n (1-r_i)\,(Y_i-\mu_2) \]
  • \[ \frac{\partial \ell}{\partial \sigma^2} = \sum_{i=1}^n\!\left[ -\frac{1}{2\sigma^2} + \frac{ r_i\,(Y_i-\mu_1)^2 + (1-r_i)\,(Y_i-\mu_2)^2 }{2\sigma^4} \right] \]

Steepest-ascent iterative step

  • Given iterate \(\theta^{(t)}=(p^{(t)},\mu_1^{(t)},\mu_2^{(t)},(\sigma^2)^{(t)})\) and stepsize \(\eta_t>0\), compute responsibilities \[ r_i^{(t)}= \frac{p^{(t)}\,\phi_{\mu_1^{(t)},\,\sigma^{(t)}}(Y_i)} {p^{(t)}\,\phi_{\mu_1^{(t)},\,\sigma^{(t)}}(Y_i)+\big(1-p^{(t)}\big)\,\phi_{\mu_2^{(t)},\,\sigma^{(t)}}(Y_i)}. \]
  • Update: \[ \begin{aligned} p^{(t+1)} \,&=\, p^{(t)} + \eta_t \sum_{i=1}^n\!\left(\frac{r_i^{(t)}}{p^{(t)}}-\frac{1-r_i^{(t)}}{1-p^{(t)}}\right),\\[6pt] \mu_1^{(t+1)} \,&=\, \mu_1^{(t)} + \eta_t \frac{1}{(\sigma^{2})^{(t)}}\sum_{i=1}^n r_i^{(t)}\big(Y_i-\mu_1^{(t)}\big),\\[6pt] \mu_2^{(t+1)} \,&=\, \mu_2^{(t)} + \eta_t \frac{1}{(\sigma^{2})^{(t)}}\sum_{i=1}^n (1-r_i^{(t)})\big(Y_i-\mu_2^{(t)}\big),\\[8pt] (\sigma^{2})^{(t+1)} \,&=\, (\sigma^{2})^{(t)} + \eta_t \sum_{i=1}^n\!\left[ -\frac{1}{2(\sigma^{2})^{(t)}} + \frac{ r_i^{(t)}(Y_i-\mu_1^{(t)})^2 + (1-r_i^{(t)})(Y_i-\mu_2^{(t)})^2} {2\big((\sigma^{2})^{(t)}\big)^2} \right]. \end{aligned} \]

Constraints & practicalities

  • Maintain constraints: \(00\).
    • Projection: after the step, clip \(p\in[\epsilon,1-\epsilon]\), \(\sigma^2\ge \epsilon\).
    • Reparameterize (cleaner): update \(\omega=\log\!\frac{p}{1-p}\), \(\lambda=\log\sigma^2\) unconstrained via gradient ascent, then map back \(p=\mathrm{logit}^{-1}(\omega)\), \(\sigma^2=e^\lambda\).
  • Use a line-search or diminishing \(\eta_t\) to ensure ascent of \(\ell(\theta)\).
  • Impose an identifiability convention (e.g., \(\mu_1\le \mu_2\)) to avoid label switching.

Remarks

The responsibilities \(r_i^{(t)}\) are the same weights used in EM. Steepest ascent follows the score direction; EM or second-order (Newton / Fisher scoring, L-BFGS) steps usually converge faster for mixtures, but the updates above are the canonical steepest-ascent iteration.

Question

Q7.
Are the parameters of the model of Problem 6 uniquely identifiable from the data?

Solution

Short answer

  • Yes — generically. The parameters \((p,\mu_1,\mu_2,\sigma^2)\) of the two-normal mixture with a common variance are identifiable up to label switching. That is, the mapping \[ (p,\mu_1,\mu_2,\sigma^2)\mapsto f(y\mid p,\mu_1,\mu_2,\sigma^2) \] is one-to-one except for the trivial symmetry that swaps component labels: \((p,\mu_1,\mu_2,\sigma^2)\) and \((1-p,\mu_2,\mu_1,\sigma^2)\) define the same distribution.

  • Degenerate failures of identifiability occur only in obvious special cases (components coincide or one weight is 0 or 1).

Constructive identifiability sketch (moment method)

  • Let \(m=\mathbb{E}[Y]=p\mu_1+(1-p)\mu_2\) and \(\Delta=\mu_1-\mu_2\). Denote central moments \(\mu_r=\mathbb{E}[(Y-m)^r]\). For the model \[ Y\sim p\,\mathcal N(\mu_1,\sigma^2)+(1-p)\,\mathcal N(\mu_2,\sigma^2), \] one obtains the moment relations \[ \begin{aligned} \mu_2 &= \sigma^2 + p(1-p)\Delta^2,\\[4pt] \mu_3 &= p(1-p)(1-2p)\Delta^3,\\[4pt] \mu_4 &= 3\sigma^4 + 6\sigma^2 p(1-p)\Delta^2 + p(1-p)\big[(1-p)^3+p^3\big]\Delta^4. \end{aligned} \]

  • From \((\mu_2,\mu_3,\mu_4)\) one can (generically) solve for \((p,\Delta^2,\sigma^2)\). Then \(\mu_1,\mu_2\) follow from \(m\) and \(\Delta\) (the sign of \(\Delta\) is the label choice).

  • Special subcases:

    • If \(p\neq\tfrac12\) and \(\Delta\neq0\), then \(\mu_3\neq0\) typically gives the sign and magnitude information needed to recover \(p\) and \(\Delta\).
    • If \(p=\tfrac12\) then \(\mu_3=0\) but \(\mu_2,\mu_4\) still determine \(\Delta^2\) and \(\sigma^2\) (so parameters recoverable up to label swap).

Non-identifiable / degenerate cases

  • \(\mu_1=\mu_2\): the mixture reduces to a single \(\mathcal N(\mu_1,\sigma^2)\); \(p\) and the two means cannot be separated.
  • \(p\in\{0,1\}\): only a single component is present; the other component's parameters are irrelevant.
  • Label-switching symmetry: \((p,\mu_1,\mu_2,\sigma^2)\) and \((1-p,\mu_2,\mu_1,\sigma^2)\) represent the same model — this is the only nonidentifiability in the generic (nondegenerate) case.

Related theoretical facts

  • Finite mixtures of distinct normal components are identifiable (up to permutations of labels); this is a classical result in mixture theory.
  • Having a common unknown variance \(\sigma^2\) does not destroy identifiability — higher moments or characteristic-function arguments separate \(\Delta\), \(p\), and \(\sigma^2\).

Practical implications & recommendations

  • In estimation, enforce an identifiability convention (e.g. impose \(\mu_1\le\mu_2\) or order by component means) or post-process MCMC/EM output to resolve label switching.
  • Be alert to near-degeneracy: when \(\mu_1\approx\mu_2\) or \(p\) near 0 or 1 the likelihood is flat in directions and numerical methods may be unstable; use regularization, informative priors, or constrained optimization.
  • Use multiple random starts and check for label permutations and convergence to the same fitted mixture (modulo label swaps).

Visualization check

  • Plot the fitted mixture density and the two component densities. If components overlap strongly or one weight is near 0, identifiability is practically weak even if theoretically present.

Conclusion

The model is identifiable except for the unavoidable label-switching symmetry and trivial degenerate cases (coincident components or trivial weight). In plain terms: you can recover the parameters from the distribution, but you must either accept or remove the label ambiguity and be careful near degeneracies.

Question

Q8.
Suppose \((X_1,Y_1),\dots,(X_n,Y_n)\) are i.i.d. where \(X_i\sim\operatorname{Bernoulli}(p)\) and the conditional pdf of \(Y_i\) given \(X_i=x\) is

\[ f(y \mid x, p, \mu_1, \mu_2, \sigma^2) = x \; \frac{1}{\sigma} \, \phi\!\left(\frac{y - \mu_1}{\sigma}\right) + (1 - x) \; \frac{1}{\sigma} \, \phi\!\left(\frac{y - \mu_2}{\sigma}\right), \quad -\infty < y < \infty \] with \( 0 < p < 1 \). Obtain explicit expressions for the MLEs of the four parameters.

Solution

Interpretation & notation

  • The label \(X_i\) is observed, so this is a labeled mixture: if \(X_i=1\) then \(Y_i\sim N(\mu_1,\sigma^2)\); if \(X_i=0\) then \(Y_i\sim N(\mu_2,\sigma^2)\).
  • Define \[ n_1=\sum_{i=1}^n X_i,\qquad n_0=n-n_1. \]
  • Index sets and sums: \[ I_1=\{i: X_i=1\},\quad I_0=\{i:X_i=0\}, \] \[ S_1=\sum_{i\in I_1} Y_i,\quad S_0=\sum_{i\in I_0} Y_i. \]
  • Sample means (when defined): \[ \bar Y_1=\frac{S_1}{n_1}\ (\text{if }n_1>0),\qquad \bar Y_0=\frac{S_0}{n_0}\ (\text{if }n_0>0). \]
  • Sums of squared deviations: \[ Q_1=\sum_{i\in I_1}(Y_i-\bar Y_1)^2,\quad Q_0=\sum_{i\in I_0}(Y_i-\bar Y_0)^2. \]

Log-likelihood (brief)

Because \((X_i,Y_i)\) are i.i.d. and \(X_i\) observed, \[ \ell(p,\mu_1,\mu_2,\sigma^2) = \sum_{i=1}^n \big[X_i\log p +(1-X_i)\log(1-p)\big] + \sum_{i\in I_1}\log\phi_{\mu_1,\sigma}(Y_i) + \sum_{i\in I_0}\log\phi_{\mu_2,\sigma}(Y_i), \] where \(\log\phi_{\mu,\sigma}(y)=-\tfrac12\log(2\pi\sigma^2)-\dfrac{(y-\mu)^2}{2\sigma^2}\).

MLE for \(p\)

  • Maximizing the Bernoulli part yields the familiar estimator: \[ \boxed{\;\hat p = \frac{n_1}{n} = \frac{1}{n}\sum_{i=1}^n X_i\;.} \]

MLEs for \(\mu_1,\mu_2\)

  • Maximizing the normal log-terms w.r.t. \(\mu_1\) (holding \(\sigma^2\) fixed) gives the sample mean within group 1: \[ \boxed{\;\hat\mu_1 = \bar Y_1 = \frac{1}{n_1}\sum_{i: X_i=1} Y_i \quad(\text{if }n_1>0)\;.} \]
  • Similarly for group 0: \[ \boxed{\;\hat\mu_2 = \bar Y_0 = \frac{1}{n_0}\sum_{i: X_i=0} Y_i \quad(\text{if }n_0>0)\;.} \]

MLE for \(\sigma^2\)

  • Collect the normal contributions and differentiate w.r.t. \(\sigma^2\). The MLE is \[ \boxed{\; \widehat{\sigma^2} = \frac{1}{n}\Bigg(\sum_{i\in I_1}(Y_i-\hat\mu_1)^2 + \sum_{i\in I_0}(Y_i-\hat\mu_2)^2\Bigg) \;=\; \frac{Q_1+Q_0}{n}\;. } \] (Note: this is the MLE divisor \(n\), not the unbiased divisor \(n-2\).)

Edge cases & existence

  • If \(n_1=0\) then \(\hat p=0\) and \(\hat\mu_1\) is undefined from the data; similarly if \(n_0=0\) then \(\hat p=1\) and \(\hat\mu_2\) is undefined. In those degenerate cases extra assumptions or priors are required.
  • If both groups have at least one observation, the MLEs above are well-defined and unique.

Summary (boxed answers)

  • \(\displaystyle \boxed{\;\hat p = \dfrac{n_1}{n}\;}\).
  • \(\displaystyle \boxed{\;\hat\mu_1 = \bar Y_1 = \dfrac{1}{n_1}\sum_{i:X_i=1} Y_i\;}\) (if \(n_1>0\)).
  • \(\displaystyle \boxed{\;\hat\mu_2 = \bar Y_0 = \dfrac{1}{n_0}\sum_{i:X_i=0} Y_i\;}\) (if \(n_0>0\)).
  • \(\displaystyle \boxed{\;\widehat{\sigma^2} = \dfrac{1}{n}\Big(\sum_{i\in I_1}(Y_i-\hat\mu_1)^2 + \sum_{i\in I_0}(Y_i-\hat\mu_2)^2\Big)\;}\).

Question

Q9.
For the data and the model described in Problem 6, let the “complete data model” be as in Problem 8. Derive the E-step and the M-step of the EM algorithm for finding the MLEs of the four parameters \((p,\mu_1,\mu_2,\sigma^2)\).

Solution

Model / notation (recap)

  • Observed: \(Y_1,\dots,Y_n\). Latent labels: \(X_i\in\{0,1\}\) for \(i=1,\dots,n\).
  • Complete-data model: \[ P(X_i=1)=p,\qquad Y_i\mid X_i=1\sim N(\mu_1,\sigma^2),\quad Y_i\mid X_i=0\sim N(\mu_2,\sigma^2). \]
  • Current iterate: \(\theta^{(t)}=(p^{(t)},\mu_1^{(t)},\mu_2^{(t)},(\sigma^2)^{(t)})\).

E-step

  • Compute the conditional expectation of the complete-data indicator \(X_i\) given \(Y_i\) under \(\theta^{(t)}\). These are the responsibilities: \[ \boxed{\; r_i^{(t)} \;=\; \mathbb{E}[X_i\mid Y_i,\theta^{(t)}] \;=\; \Pr\{X_i=1\mid Y_i,\theta^{(t)}\} \;=\; \frac{p^{(t)}\,\phi\!\big(\dfrac{Y_i-\mu_1^{(t)}}{\sigma^{(t)}}\big)} {\,p^{(t)}\,\phi\!\big(\dfrac{Y_i-\mu_1^{(t)}}{\sigma^{(t)}}\big) +(1-p^{(t)})\,\phi\!\big(\dfrac{Y_i-\mu_2^{(t)}}{\sigma^{(t)}}\big)\,}\;} \] where \(\sigma^{(t)}=\sqrt{(\sigma^2)^{(t)}}\) and \(\phi\) is the standard normal pdf.
  • Also note \(\mathbb{E}[1-X_i\mid Y_i,\theta^{(t)}]=1-r_i^{(t)}\).
  • For numerical stability compute the ratio using log-sum-exp form (work in log-domain).

M-step

  • Set \(R^{(t)}=\sum_{i=1}^n r_i^{(t)}\) (the expected count in component 1) and \(n-R^{(t)}=\sum_{i=1}^n(1-r_i^{(t)})\).
  • Maximize the expected complete-data log-likelihood (replace \(X_i\) by \(r_i^{(t)}\)). Closed-form updates are:
  • \[ \boxed{\; p^{(t+1)} \;=\; \frac{1}{n}\sum_{i=1}^n r_i^{(t)} \;=\; \frac{R^{(t)}}{n} \;} \]
  • \[ \boxed{\; \mu_1^{(t+1)} \;=\; \frac{\displaystyle\sum_{i=1}^n r_i^{(t)} Y_i}{\displaystyle\sum_{i=1}^n r_i^{(t)}} \;=\; \frac{1}{R^{(t)}}\sum_{i=1}^n r_i^{(t)}Y_i \;} \]
  • \[ \boxed{\; \mu_2^{(t+1)} \;=\; \frac{\displaystyle\sum_{i=1}^n (1-r_i^{(t)}) Y_i}{\displaystyle\sum_{i=1}^n (1-r_i^{(t)})} \;=\; \frac{1}{n-R^{(t)}}\sum_{i=1}^n (1-r_i^{(t)})Y_i \;} \]
  • \[ \boxed{\; (\sigma^2)^{(t+1)} \;=\; \frac{1}{n}\sum_{i=1}^n\Big[\, r_i^{(t)}\big(Y_i-\mu_1^{(t+1)}\big)^2 + (1-r_i^{(t)})\big(Y_i-\mu_2^{(t+1)}\big)^2 \Big]\; } \] (This follows by maximizing the expected complete-data normal log-likelihood; the MLE divisor is \(n\).)

One EM iteration (concise summary)

  • Given \(\theta^{(t)}\):
  • E-step: compute \(r_i^{(t)}\) for \(i=1,\dots,n\) using the boxed formula above.
  • M-step: compute \(R^{(t)}=\sum r_i^{(t)}\) and update \[ p^{(t+1)}=\frac{R^{(t)}}{n},\quad \mu_1^{(t+1)}=\frac{\sum r_i^{(t)}Y_i}{R^{(t)}},\quad \mu_2^{(t+1)}=\frac{\sum (1-r_i^{(t)})Y_i}{n-R^{(t)}}, \] and \[ (\sigma^2)^{(t+1)}=\frac{1}{n}\sum \big[ r_i^{(t)}(Y_i-\mu_1^{(t+1)})^2 + (1-r_i^{(t)})(Y_i-\mu_2^{(t+1)})^2\big]. \]

Convergence & practical notes

  • Iterate E and M steps until parameter change or log-likelihood increment falls below tolerance.
  • Initialization: choose sensible starts (k-means, random, or multiple restarts) to avoid poor local maxima and label switching.
  • For numerical stability use log-domain computations for responsibilities; guard against division by zero (ensure \(R^{(t)}\) and \(n-R^{(t)}\) are not too close to 0; regularize if necessary).

Remark

These EM updates are the classical closed-form E- and M-steps for the Gaussian two-component mixture with common variance. They increase the observed-data log-likelihood monotonically and (under regularity) converge to a stationary point of the likelihood.

Question

Q10.
Simulate a sample of size n = 100 from the model presumed in Problem 6 for reasonable values for the four parameters. Write an R code to obtain the MLEs through the EM algorithm, starting from initial values p = 0, \(μ_1\) = 0, \(μ_2\) equal to the sample mean and \(σ^2 \) equal to the sample variance.

Solution

  set.seed(123)  # For reproducibility

### ---- Parameters for simulation ----
n <- 100
p_true <- 0.4
mu1_true <- -1
mu2_true <- 2
sigma2_true <- 1

### ---- Simulate data from mixture model ----
z <- rbinom(n, 1, p_true)   # Latent component indicators
x <- numeric(n)
x[z == 1] <- rnorm(sum(z == 1), mean = mu1_true, sd = sqrt(sigma2_true))
x[z == 0] <- rnorm(sum(z == 0), mean = mu2_true, sd = sqrt(sigma2_true))

### ---- Initial values ----
p <- 0.001              # small positive to avoid NA in E-step
mu1 <- 0
mu2 <- mean(x)
sigma2 <- var(x)

### ---- EM Algorithm ----
max_iter <- 1000
tol <- 1e-6
loglik <- numeric(max_iter)
eps <- 1e-10

for (iter in 1:max_iter) {
  
  ## E-step: Posterior probabilities for component 1
  tau1 <- p * dnorm(x, mean = mu1, sd = sqrt(sigma2))
  tau2 <- (1 - p) * dnorm(x, mean = mu2, sd = sqrt(sigma2))
  
  denom <- tau1 + tau2 + eps  # avoid division by zero
  gamma <- tau1 / denom
  
  ## Log-likelihood
  loglik[iter] <- sum(log(denom))
  
  ## M-step: Update parameters
  p_new <- mean(gamma)
  mu1_new <- sum(gamma * x) / sum(gamma)
  mu2_new <- sum((1 - gamma) * x) / sum(1 - gamma)
  sigma2_new <- sum(gamma * (x - mu1_new)^2 + (1 - gamma) * (x - mu2_new)^2) / n
  
  ## Check convergence
  if (max(abs(c(p_new - p, mu1_new - mu1, mu2_new - mu2, sigma2_new - sigma2)), na.rm = TRUE) < tol) {
    p <- p_new; mu1 <- mu1_new; mu2 <- mu2_new; sigma2 <- sigma2_new
    loglik <- loglik[1:iter]
    cat("Converged in", iter, "iterations\n")
    break
  }
  
  ## Update for next iteration
  p <- p_new
  mu1 <- mu1_new
  mu2 <- mu2_new
  sigma2 <- sigma2_new
}

### ---- Final estimates ----
cat("Estimated p:", p, "\n")
cat("Estimated mu1:", mu1, "\n")
cat("Estimated mu2:", mu2, "\n")
cat("Estimated sigma^2:", sigma2, "\n")

### ---- Plot log-likelihood ----
plot(loglik, type = "o", main = "Log-likelihood over iterations",
     xlab = "Iteration", ylab = "Log-likelihood")