Assignment: Week-3

Instructor: Debasis Sengupta

Office / Department: ASU

Email: sdebasis@isical.ac.in

Marking Scheme:
Assignments: 20% | Midterm Test: 30% | End Semester: 50%

Question 1
Question 2
Question 3
Question 4
Question 5
Question 6

Question 1

We have counts \(Y_1,\dots,Y_4\) (multinomial given total \(n=\sum_i Y_i\)) with cell probabilities

\[ \pi_1=\tfrac12+\tfrac{\pi}{4},\qquad \pi_2=\pi_3=\tfrac{1-\pi}{4},\qquad \pi_4=\tfrac{\pi}{4}, \] where \(\pi\in(0,1)\) is unknown. Reparametrise \(\pi=(1-\theta)^2\) and derive the Fisher–scoring iterative update for the MLE of \(\pi\) (i.e. the update for \(\theta\)).

Solution — Detailed Answer

Step 0 — substitute the reparametrisation

Put \(\pi=(1-\theta)^2\) and define

\(a:=\dfrac{1-\theta}{2}\quad(\Rightarrow a^2=\dfrac{(1-\theta)^2}{4}).\)

Then the cell probabilities as functions of \(\theta\) become (useful algebraic forms)

\(p_1(\theta)=\pi_1=\tfrac12+\tfrac{(1-\theta)^2}{4}=\tfrac12 + a^2,\)
\(p_2(\theta)=p_3(\theta)=\tfrac{1-(1-\theta)^2}{4}=\tfrac14 - a^2,\)
\(p_4(\theta)=\tfrac{(1-\theta)^2}{4}=a^2.\)

Step 1 — log-likelihood

For total \(n=\sum_i Y_i\),

\(\ell(\theta)=\sum_{i=1}^4 Y_i\log p_i(\theta)+\text{const}.\)

Step 2 — first derivative (score)

Differentiate \(p_i\) w.r.t. \(\theta\). Using \(a=(1-\theta)/2\) and \(da/d\theta=-1/2\),

\(\dfrac{dp_1}{d\theta}=\dfrac{dp_4}{d\theta}=-a,\)
\(\dfrac{dp_2}{d\theta}=\dfrac{dp_3}{d\theta}=+a.\)

Hence the score \(S(\theta)=\dfrac{d\ell}{d\theta}\) is

\[ \begin{aligned} S(\theta) &=\sum_{i=1}^4 Y_i\frac{p_i'(\theta)}{p_i(\theta)} = -a\frac{Y_1}{p_1}+a\frac{Y_2}{p_2}+a\frac{Y_3}{p_3}-a\frac{Y_4}{p_4}\\ &= a\Big(\frac{Y_2+Y_3}{p_2}-\frac{Y_1}{p_1}-\frac{Y_4}{p_4}\Big). \end{aligned} \]

Step 3 — Fisher (expected) information

For a multinomial with probabilities \(p_i(\theta)\),

\(I(\theta)=n\sum_{i=1}^4\frac{(p_i'(\theta))^2}{p_i(\theta)}.\)

Because \((p_i')^2=a^2\) for every \(i\),

\(I(\theta)=n\,a^2\Big(\frac{1}{p_1}+\frac{2}{p_2}+\frac{1}{p_4}\Big).\)

Step 4 — Fisher scoring update (final iterative step)

Fisher scoring updates by

\(\displaystyle \theta^{(t+1)}=\theta^{(t)}+\frac{S(\theta^{(t)})}{I(\theta^{(t)})}.\)

Substitute \(S\) and \(I\). Cancelling common factors gives a convenient expression in terms of proportions \(r_i=Y_i/n\):

\[ \boxed{\,\theta^{(t+1)}=\theta^{(t)}+\frac{1}{a}\cdot \frac{\dfrac{r_2+r_3}{p_2}-\dfrac{r_1}{p_1}-\dfrac{r_4}{p_4}} {\dfrac{1}{p_1}+\dfrac{2}{p_2}+\dfrac{1}{p_4}}\,,} \]

where all \(p_j\) and \(a\) are evaluated at \(\theta^{(t)}\):

\(a=\dfrac{1-\theta^{(t)}}{2},\)
\(p_1=\tfrac12+a^2,\quad p_2=\tfrac14-a^2,\quad p_4=a^2,\)
\(r_i=Y_i/n\). (Equivalently you may write the update using counts \(Y_i\); the \(n\) cancels.)

This boxed formula is the Fisher–scoring iterative step for \(\theta\). After convergence \(\hat\theta\) gives \(\hat\pi=(1-\hat\theta)^2\).

Related Concepts / Remarks

Short remarks and important connections.

Why Fisher scoring? Fisher scoring replaces the observed negative second derivative by its expectation (the Fisher information). For generalized multinomial models, it often gives a stable and efficient algorithm; it is Newton–Raphson using expected information.
Score / information simplification: The algebra simplified nicely because all \((p_i')^2\) equal the same value \(a^2\). That yields compact expressions for \(S\) and \(I\).
Parameter domain & numerical caution: For valid probabilities we need \(p_2=\tfrac14-a^2>0\), i.e. \(a^2<1/4\) ⇔ \(|1-\theta|<1\) ⇔ \(\theta\in(0,2)\). Practically choose an initial \(\theta^{(0)}\) inside \((0,2)\) to keep all \(p_i>0\) (and avoid division by zero). If an iteration proposes \(\theta\) outside the admissible range, project it back (or use a damped update).
Initial guess (practical): A reasonable start is to match \(p_4\) with the observed proportion \(r_4=Y_4/n\): \(\pi\approx 4r_4\), so \(\theta^{(0)}=1-\sqrt{\max(0,\min(1,4r_4))}\). If \(4r_4>1\) use a safer interior starting value (e.g. \(\theta^{(0)}=1\) ± small).
Stopping rule: Use relative change \(|\theta^{(t+1)}-\theta^{(t)}|<\varepsilon\) (e.g. \(10^{-8}\)) or change in log-likelihood, or max iterations.
Observed vs expected information: If you prefer Newton–Raphson (observed information), compute the second derivative \(d^2\ell/d\theta^2\) and use \(\theta^{(t+1)}=\theta^{(t)}- \big(\ell'(\theta^{(t)})/\ell''(\theta^{(t)})\big)\). Fisher scoring often yields more stable steps in multinomial problems.
Asymptotic variance: At convergence \(\hat\theta\) the asymptotic variance is approximately \(I(\hat\theta)^{-1}\), with \(I(\hat\theta)=n a^2\big(1/p_1+2/p_2+1/p_4\big)\) evaluated at \(\hat\theta\). Delta method gives variance for \(\hat\pi=(1-\theta)^2\).

Viz (Visualizations)

Practical plots to inspect and debug the algorithm numerically.

Plot the log-likelihood \(\ell(\theta)\) over a grid in the admissible \(\theta\)-interval (say \([0.01,1.99]\)) to see where the peak lies and whether it's sharp or flat.
Plot the iteration path \(\theta^{(t)}\) vs iteration \(t\) to check monotone convergence and whether damping is needed.
Optionally plot the score \(S(\theta)\) and expected information \(I(\theta)\) over the grid to visualise step sizes implied by Fisher scoring.

Other Relevant Points

Additional practical advice, edge-cases, and numerical tips.

If any observed cell count is zero, avoid dividing by zero; start from interior values and/or add a tiny pseudo-count (e.g. +0.5) if needed for numerical stability.
Fisher scoring will usually converge fast for this one-parameter problem; if oscillation occurs, use step-halving/damping (e.g. \(\theta^{(t+1)}\leftarrow \theta^{(t)}+\gamma\cdot\)update with \(\gamma\in(0,1)\)).
After obtaining \(\hat\theta\), transform back to \(\hat\pi=(1-\hat\theta)^2\) and report standard errors via the observed Fisher information or the delta method.
Implementation note: compute \(p_2=\tfrac14-a^2\) carefully to avoid catastrophic cancellation when \(a^2\) is close to \(\tfrac14\).

Question 2

A complete-data formulation for the problem in Q1 was given in class so that the EM algorithm can be used. Starting from that, obtain a single iterative step (merge the E-step and M-step) which updates the estimate of \(\pi\).

Solution — Detailed Answer

Setup / notation

Write the cell probabilities as linear functions of \(\pi\):

\(p_i(\pi)=a_i+b_i\pi,\qquad i=1,\dots,4,\)

\(a_1=\tfrac12,\; b_1=\tfrac14\)
\(a_2=\tfrac14,\; b_2=-\tfrac14\)
\(a_3=\tfrac14,\; b_3=-\tfrac14\)
\(a_4=0,\; b_4=\tfrac14\)

Let \(\widetilde Y_i^{(t)}\) denote the expected complete counts computed at the E-step under \(\pi^{(t)}\). Write \(\tilde r_i^{(t)}=\widetilde Y_i^{(t)}/n\) if you prefer proportions.

Step 1 — Q-function (expected complete log-likelihood)

The EM M-step maximises

\(Q(\pi\mid\pi^{(t)})=\sum_{i=1}^4 \widetilde Y_i^{(t)}\log\big(a_i+b_i\pi\big) + \text{const (w.r.t. }\pi).\)

Step 2 — derivative of Q (score for Q)

Differentiate to get

\(\dfrac{\partial Q}{\partial\pi}(\pi\mid\pi^{(t)})=\sum_{i=1}^4 \widetilde Y_i^{(t)}\dfrac{b_i}{a_i+b_i\pi}.\)

Solving \(\partial Q/\partial\pi=0\) for \(\pi\) exactly gives the M-step but is implicit. A practical and standard merged one-line update is obtained by taking a single Newton step for this equation using \(\widetilde Y_i^{(t)}\).

Step 3 — Newton (one-step) update on Q (single merged iteration)

The second derivative is

\(\dfrac{\partial^2 Q}{\partial\pi^2}(\pi\mid\pi^{(t)})=-\sum_{i=1}^4 \widetilde Y_i^{(t)}\dfrac{b_i^2}{(a_i+b_i\pi)^2}.\)

Therefore a Newton update applied to \(Q\) gives the single-step merged iteration:

\[ \boxed{\; \pi^{(t+1)} \;=\; \pi^{(t)} \;+\; \frac{\displaystyle \sum_{i=1}^4 \widetilde Y_i^{(t)}\dfrac{b_i}{a_i+b_i\pi^{(t)}}} {\displaystyle \sum_{i=1}^4 \widetilde Y_i^{(t)}\dfrac{b_i^2}{(a_i+b_i\pi^{(t)})^2}} \; }. \]

Step 4 — explicit specialised form (using \(b_i=\pm\tfrac14\))

Plugging the \(a_i,b_i\) values and factoring constants yields the equivalent explicit form

\[ \boxed{\; \pi^{(t+1)}=\pi^{(t)} + \dfrac{\displaystyle \frac{\widetilde Y_1^{(t)}}{p_1(\pi^{(t)})}-\frac{\widetilde Y_2^{(t)}}{p_2(\pi^{(t)})}-\frac{\widetilde Y_3^{(t)}}{p_3(\pi^{(t)})}+\frac{\widetilde Y_4^{(t)}}{p_4(\pi^{(t)})} } {\displaystyle \frac{\widetilde Y_1^{(t)}}{p_1(\pi^{(t)})^2}+\frac{\widetilde Y_2^{(t)}}{p_2(\pi^{(t)})^2} +\frac{\widetilde Y_3^{(t)}}{p_3(\pi^{(t)})^2}+\frac{\widetilde Y_4^{(t)}}{p_4(\pi^{(t)})^2} }\; }, \]

where \(p_i(\pi^{(t)})=a_i+b_i\pi^{(t)}\). If data are complete then \(\widetilde Y_i^{(t)}=Y_i\).

How this merges E- and M-steps (practical summary)

E-step: compute \(\widetilde Y_i^{(t)}=\mathbb{E}[Y_i^{\text{(complete)}}\mid\text{observed},\pi^{(t)}]\).
M-step (merged): perform the single Newton update for \(\pi\) given above (one explicit line).
If you insist on a pure EM M-step you would need to solve \(\sum_i \widetilde Y_i^{(t)}\dfrac{b_i}{a_i+b_i\pi^{(t+1)}}=0\) exactly each iteration, which is implicit and inconvenient; the one-step Newton is the common practical remedy.

Related Concepts

Newton-within-EM: using one Newton (or Fisher) step in the M-step when a closed-form maximiser is unavailable; often accelerates convergence while keeping the EM structure.
Fisher-scoring variant: replace the Hessian in the denominator by its expectation (with \(\widetilde Y_i^{(t)}\) weights) to obtain a Fisher-scoring merged iteration.
Feasible region: ensure \(\pi^{(t+1)}\in(0,1)\) and \(p_i(\pi^{(t+1)})>0\). If the update leaves the feasible set, apply damping (step-halving) or project back into \((0,1)\).

Viz (Diagnostics)

Plot \(Q(\pi\mid\pi^{(t)})\) vs \(\pi\) for several \(\pi^{(t)}\) to check whether the Newton step moves uphill.
Trace \(\pi^{(t)}\) vs iteration \(t\) and optionally plot the observed log-likelihood to verify monotone behaviour (or detect the need for damping).
Plot the evolution of \(\widetilde Y_i^{(t)}\) if there is missingness — these typically stabilise quickly and reveal which components drive updates.

Other Relevant Points / Implementation tips

Use interior starting values for \(\pi^{(0)}\) (e.g. method-of-moments or \(\pi^{(0)}=\min(0.99,\max(0.01,4r_4))\)).
If any \(\widetilde Y_i^{(t)}\) or observed \(Y_i\) is zero, avoid division-by-zero by adding a tiny pseudo-count (e.g. +0.5) or using safe-guards in code.
Control step length (damping) when necessary: set \(\pi^{(t+1)}=\pi^{(t)}+\gamma\Delta\pi\) with \(\gamma\in(0,1]\) chosen to keep \(Q\) (or the observed log-likelihood) non-decreasing.
Check convergence by \(|\pi^{(t+1)}-\pi^{(t)}|<\varepsilon\) or small change in observed log-likelihood; report standard errors using the (observed or expected) information at the final iterate and the delta method if transforming parameters.

Question 3

Let \(Y_1,Y_2,Y_3,Y_4\) be counts in four genetic categories. Conditionally on the total \(N=Y_1+Y_2+Y_3+Y_4\) the vector has a multinomial distribution with cell probabilities

\[ p_1(\pi)=\tfrac12+\tfrac{\pi}{4},\qquad p_2(\pi)=p_3(\pi)=\tfrac{1-\pi}{4},\qquad p_4(\pi)=\tfrac{\pi}{4}, \]

where the unknown parameter \(\pi\in(0,1)\). Viewing this as an incomplete-data / latent-allocation problem, the EM algorithm applies. The question: obtain a closed form expression for the value of \(\pi\) attained at the convergence of the EM algorithm.

Detailed solution — overview

Introduce notation \(n_1=y_1,\; n_2=y_2+y_3,\; n_4=y_4\) and \(N=n_1+n_2+n_4\).
Write the (observed) log-likelihood \(\ell(\pi)\), differentiate (score), clear denominators to obtain a quadratic, and solve it.
Choose the root lying in \((0,1)\) — that is the EM fixed-point (the stationary solution of the observed-data likelihood).

Step 1 — write the (observed) log-likelihood

Up to an additive constant, the log-likelihood is
\[ \ell(\pi)=y_1\log\!\Big(\tfrac12+\tfrac{\pi}{4}\Big)+(y_2+y_3)\log\!\Big(\tfrac{1-\pi}{4}\Big)+y_4\log\!\Big(\tfrac{\pi}{4}\Big). \]
With \(n_1=y_1,\; n_2=y_2+y_3,\; n_4=y_4\) and \(N=n_1+n_2+n_4\), substitute to simplify notation.

Step 2 — stationary equation (score = 0)

Differentiate \(\ell(\pi)\) w.r.t. \(\pi\). Useful derivatives:
- \(\dfrac{d}{d\pi}\log\!\Big(\tfrac12+\tfrac{\pi}{4}\Big)=\dfrac{1}{2+\pi}\).
- \(\dfrac{d}{d\pi}\log\!\Big(\tfrac{1-\pi}{4}\Big)=-\dfrac{1}{1-\pi}\).
- \(\dfrac{d}{d\pi}\log\!\Big(\tfrac{\pi}{4}\Big)=\dfrac{1}{\pi}\).
Thus the score equation is
\[ \frac{n_1}{2+\pi}-\frac{n_2}{1-\pi}+\frac{n_4}{\pi}=0. \]

Step 3 — clear denominators to obtain a polynomial

Multiply the score equation by \(\pi(1-\pi)(2+\pi)\) to remove denominators:
\[ n_1\pi(1-\pi)+n_4(1-\pi)(2+\pi)-n_2\pi(2+\pi)=0. \]
Expand and collect like powers:
- \(n_1\pi(1-\pi)=n_1(\pi-\pi^2)\).
- \((1-\pi)(2+\pi)=2-\pi-\pi^2\), so \(n_4(1-\pi)(2+\pi)=n_4(2-\pi-\pi^2)\).
- \(n_2\pi(2+\pi)=n_2(2\pi+\pi^2)\).
Collecting coefficients gives the quadratic:
\[ - N \pi^2 + (n_1 - n_4 - 2 n_2)\pi + 2 n_4 = 0, \]
and multiplying by \(-1\) yields the standard form
\[ N\pi^2 + (-n_1 + n_4 + 2 n_2)\pi - 2 n_4 = 0. \]

Step 4 — solve the quadratic

Apply the quadratic formula. The solutions are
\[ \boxed{ \displaystyle \pi \;=\; \frac{\,n_1 - n_4 - 2 n_2 \;\pm\; \sqrt{( -n_1 + n_4 + 2 n_2)^2 + 8 N n_4}\,}{2N} }. \]

(The discriminant simplifies as \(( -n_1 + n_4 + 2 n_2)^2 - 4\cdot N\cdot(-2 n_4) = ( -n_1 + n_4 + 2 n_2)^2 + 8 N n_4\).)

Step 5 — choose the correct root (EM convergence)

The EM algorithm converges to a stationary point of the observed-data likelihood; the valid parameter must lie in \((0,1)\).
Thus \(\pi_{\text{EM}}\) is the root above that lies in \((0,1)\). In practice, when \(n_4>0\) the “\(+\)” branch typically yields the admissible root; choose the sign that gives \(0<\pi<1\).

Conclusion / boxed result (explicit):

\[ \pi_{\star} \;=\; \frac{n_1 - n_4 - 2 n_2 \;\pm\; \sqrt{( -n_1 + n_4 + 2 n_2)^2 + 8 N n_4}}{2N}, \]

with the sign chosen so that \(0<\pi_\star<1\). This closed-form gives the stationary value that the EM fixed-point must satisfy (hence the EM limit).

Related concepts

EM and stationary equations: An EM fixed-point satisfies the observed-data score equations whenever the M-step produces a full maximizer; thus the EM limit is a stationary point of the observed log-likelihood.
Score equation / MLE: Here the score reduces to a quadratic in \(\pi\), so the MLE (and EM limit) can be obtained in closed form.
Root selection / boundaries: If no root lies in \((0,1)\) check boundary MLEs at \(\pi=0\) or \(\pi=1\) and inspect the log-likelihood for maxima vs minima.
Numerical stability: Reparametrizations (e.g., \(\pi=(1-\theta)^2\)) can help numerics, though algebra gave an explicit solution here.

Viz

Plot the observed-data log-likelihood \(\ell(\pi)\) on \(\pi\in[0,1]\) to visualize the stationary point and to confirm the chosen root is a global maximum.
Plot the quadratic polynomial \(f(\pi)=N\pi^2 +(-n_1 + n_4 + 2 n_2)\pi -2 n_4\) to see the two roots and where they lie relative to \([0,1]\).

Other points worth mentioning

If you run EM from different starting values and it always converges to the same root in \((0,1)\), that gives empirical confirmation the root is the global maximizer (for regular data).
When \(n_4=0\) the formula simplifies (the discriminant reduces) and one must check boundaries carefully — often the MLE may sit at \(\pi=0\).
Everything above preserves the full derivation and the final closed-form expression; substitute \(n_1=y_1,\; n_2=y_2+y_3,\; n_4=y_4\) when applying to data.

Question 4

Consider paired data \((X_i,Y_i),\ i=1,\dots,n\), where the pairs are independent, the marginal distribution of the \(X_i\) is unspecified, and the conditional distribution of \(Y_i\) given \(X_i=x_i\) is

\[ Y_i\mid X_i=x_i \sim N(\alpha+\beta x_i,\ \sigma^2). \]

Show that the maximum likelihood estimators (MLEs) of \(\alpha\) and \(\beta\) coincide with the ordinary least squares estimators for the linear regression model \(Y=\alpha+\beta X+\varepsilon\).

Detailed solution

Idea. Because only the conditional distribution of \(Y_i\) given \(X_i\) is specified, the likelihood for \((\alpha,\beta,\sigma^2)\) is the conditional likelihood
\[ L(\alpha,\beta,\sigma^2)=\prod_{i=1}^n f_{Y\mid X}(y_i\mid x_i) =\prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}}\exp\!\Big(-\frac{(y_i-(\alpha+\beta x_i))^2}{2\sigma^2}\Big). \]

Maximizing this over \(\alpha,\beta\) (for fixed \(\sigma^2\)) is equivalent to minimizing the sum of squared residuals

\[ Q(\alpha,\beta)=\sum_{i=1}^n\big(y_i-(\alpha+\beta x_i)\big)^2. \]

Hence the MLEs of \(\alpha,\beta\) are the least-squares estimators.
Step 1 — log-likelihood and equivalence to least squares
- The log-likelihood (conditional on the observed \(x_i\)) is
  \[ \ell(\alpha,\beta,\sigma^2) = -\frac{n}{2}\log(2\pi\sigma^2)-\frac{1}{2\sigma^2}\sum_{i=1}^n\big(y_i-(\alpha+\beta x_i)\big)^2. \]
- For fixed \(\sigma^2\), maximizing \(\ell\) w.r.t. \(\alpha,\beta\) is identical to minimizing \(Q(\alpha,\beta)\).
Step 2 — normal equations
- Set partial derivatives of \(Q\) to zero:
  \[ \frac{\partial Q}{\partial \alpha} = -2\sum_{i=1}^n\big(y_i-(\alpha+\beta x_i)\big)=0, \]
  
  \[ \frac{\partial Q}{\partial \beta} = -2\sum_{i=1}^n x_i\big(y_i-(\alpha+\beta x_i)\big)=0. \]
- Divide by \(-2\) and rewrite to obtain the normal equations:
  \[ n\alpha + \beta\sum_{i=1}^n x_i = \sum_{i=1}^n y_i, \]
  
  \[ \alpha\sum_{i=1}^n x_i + \beta\sum_{i=1}^n x_i^2 = \sum_{i=1}^n x_i y_i. \]
Step 3 — solve for \(\hat\alpha,\hat\beta\)
- Let \(\bar x=\dfrac{1}{n}\sum_i x_i\), \(\bar y=\dfrac{1}{n}\sum_i y_i\). Define
  \[ S_{xx}=\sum_{i=1}^n (x_i-\bar x)^2=\sum_i x_i^2 - n\bar x^2,\qquad S_{xy}=\sum_{i=1}^n (x_i-\bar x)(y_i-\bar y)=\sum_i x_i y_i - n\bar x\bar y. \]
- Solving the normal equations yields the OLS formulas (and hence the MLEs):
  \[ \boxed{\displaystyle \hat\beta \;=\; \frac{S_{xy}}{S_{xx}}, \qquad \hat\alpha \;=\; \bar y - \hat\beta\,\bar x.} \]
Step 4 — MLE for \(\sigma^2\)
- Plugging \(\hat\alpha,\hat\beta\) back into the log-likelihood, the MLE for \(\sigma^2\) is
  \[ \boxed{\displaystyle \hat\sigma^2_{\text{MLE}} \;=\; \frac{1}{n}\sum_{i=1}^n \big(y_i - \hat\alpha - \hat\beta x_i\big)^2.} \]
  
  Note: this ML estimator uses division by \(n\); the unbiased estimator uses division by \(n-2\).
Step 5 — second-derivative check (concavity)
- The conditional log-likelihood is quadratic (negative-definite in \(\alpha,\beta\) for fixed \(\sigma^2\)), so the solution above is a global maximizer for \(\alpha,\beta\).

Thus the MLEs of \(\alpha\) and \(\beta\) coincide exactly with the least-squares estimators.

Related concepts

Conditional likelihood vs joint likelihood. When the marginal distribution of \(X\) is unspecified, inference about regression parameters is done via the conditional likelihood \(f(y\mid x)\); this is equivalent to treating \(x_i\) as fixed regressors for likelihood purposes.
Gauss–Markov theorem. Under mean-zero, homoscedastic, uncorrelated errors, OLS estimators are the Best Linear Unbiased Estimators (BLUE). With normality they are also MLEs (hence efficient).
MLE vs unbiased estimation of \(\sigma^2\). ML gives \(\hat\sigma^2 = \mathrm{RSS}/n\); the unbiased estimator is \(\mathrm{RSS}/(n-2)\).
Fisher information / asymptotic variance. For known \(\sigma^2\), the exact variances are
\[ \operatorname{Var}(\hat\beta)=\frac{\sigma^2}{S_{xx}},\qquad \operatorname{Var}(\hat\alpha)=\sigma^2\Big(\frac{1}{n}+\frac{\bar x^2}{S_{xx}}\Big). \]
Interpretation when \(X\) random. If \(X_i\) are random but independent of the error and their distribution does not depend on \(\alpha,\beta\), the conditional-likelihood approach still yields the same estimators as the joint likelihood (which factors as marginal of \(X\) times conditional of \(Y\mid X\)).

Viz

Scatter plot \((x_i,y_i)\) with fitted regression line \(y=\hat\alpha+\hat\beta x\) and residuals drawn down to the line — checks fit visually.
Residual diagnostics: plot residuals vs fitted values and a Q–Q plot of residuals to check homoscedasticity and normality assumptions.
Confidence bands: show pointwise confidence intervals for the regression line using the variance formulas above.
Profile of log-likelihood: plot \(\ell(\alpha,\beta)\) (or profile \(\ell(\beta)\) fixing \(\alpha=\bar y-\beta\bar x\)) to visualize the quadratic shape and the unique maximum.

Question 5

In the regression model

\[ Y_i \mid X_i = x_i \;\sim\; N(\alpha + \beta x_i,\ \sigma^2), \quad i=1,\dots,n, \]

we already showed that the MLEs of \(\alpha\) and \(\beta\) are the least-squares estimators. Now: What is the MLE of \(\sigma^2\)? Is this MLE unbiased?

Detailed solution

Step 1 — log-likelihood (conditional on \(x_i\))
- The conditional likelihood is
  \[ L(\alpha,\beta,\sigma^2) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left(-\frac{(y_i-(\alpha+\beta x_i))^2}{2\sigma^2}\right). \]
- The log-likelihood is
  \[ \ell(\alpha,\beta,\sigma^2) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log\sigma^2 - \frac{1}{2\sigma^2}\sum_{i=1}^n (y_i - \alpha - \beta x_i)^2. \]
Step 2 — profile likelihood for \(\sigma^2\)
- Let \(\displaystyle \mathrm{RSS}=\sum_{i=1}^n (y_i-\hat\alpha-\hat\beta x_i)^2\) denote the residual sum of squares at the MLEs \(\hat\alpha,\hat\beta\).
- Plugging \(\hat\alpha,\hat\beta\) into \(\ell\) and differentiating w.r.t. \(\sigma^2\) gives
  \[ \frac{\partial \ell}{\partial \sigma^2} = -\frac{n}{2\sigma^2} + \frac{\mathrm{RSS}}{2\sigma^4}. \]
- Setting this to zero and solving yields
  \[ -\frac{n}{2\sigma^2} + \frac{\mathrm{RSS}}{2\sigma^4} = 0 \ \Longrightarrow\ -n\sigma^2 + \mathrm{RSS}=0, \]
  
  hence
  
  \[ \boxed{\displaystyle \hat\sigma^2_{\mathrm{MLE}} = \frac{\mathrm{RSS}}{n}.} \]
Step 3 — unbiasedness check
- From standard linear-regression distributional results, \(\dfrac{\mathrm{RSS}}{\sigma^2}\sim\chi^2_{\,n-2}\), so \(E[\mathrm{RSS}]=(n-2)\sigma^2\).
- Therefore
  \[ E[\hat\sigma^2_{\mathrm{MLE}}] = E\!\left[\frac{\mathrm{RSS}}{n}\right] = \frac{n-2}{n}\,\sigma^2. \]
- Conclusion: \(\hat\sigma^2_{\mathrm{MLE}}=\mathrm{RSS}/n\) is biased downward. It underestimates \(\sigma^2\) by the factor \(\dfrac{n-2}{n}\).
Step 4 — unbiased alternative
- The usual unbiased estimator is
  \[ \boxed{\displaystyle \tilde\sigma^2 = \frac{\mathrm{RSS}}{n-2},} \]
  which satisfies \(E[\tilde\sigma^2]=\sigma^2\).
- Note: for a general linear model with \(p\) parameters (including intercept), the unbiased estimator is \(\mathrm{RSS}/(n-p)\) while the MLE is \(\mathrm{RSS}/n\).

Related concepts

Bias of variance MLE. In normal models with unknown mean parameters, the ML estimator of variance is downward-biased because degrees of freedom are used to estimate the mean(s).
Degrees of freedom. In simple linear regression with intercept and slope, \(p=2\) parameters are estimated, leaving \(n-2\) residual degrees of freedom.
Distributional result. \(\dfrac{\mathrm{RSS}}{\sigma^2}\sim\chi^2_{n-2}\), which yields both expectation and confidence-interval constructions for \(\sigma^2\).
Asymptotics. \(\hat\sigma^2_{\mathrm{MLE}}\) is consistent and asymptotically unbiased; the finite-sample bias vanishes as \(n\to\infty\).

Viz

Plot \(\hat\sigma^2_{\mathrm{MLE}}=\mathrm{RSS}/n\) vs \(\tilde\sigma^2=\mathrm{RSS}/(n-2)\) as \(n\) varies: for small \(n\) the ML estimator is noticeably smaller; for large \(n\) they converge.
Plot the sampling distribution of \(\dfrac{n\hat\sigma^2}{\sigma^2}=\dfrac{\mathrm{RSS}}{\sigma^2}\) which follows \(\chi^2_{n-2}\); compare to the scaled unbiased version.\

Other worthwhile points

Statistical software typically reports the unbiased estimator \(\mathrm{RSS}/(n-p)\) for inference (t-tests, CI), because those procedures rely on the exact finite-sample \(\chi^2/\)Student-\(t\) calibrations.
For likelihood-based comparisons (likelihood-ratio tests, AIC), the ML estimator \(\mathrm{RSS}/n\) often appears naturally because it maximizes the likelihood.
Both estimators are consistent; difference matters mainly for small samples and for reporting unbiased variance estimates.

Question 6

Consider paired data \((X_i,Y_i),\ i=1,\dots,n\), independent, the marginal distribution of the \(X_i\) unspecified, and the conditional density of \(Y_i\) given \(X_i\) is

\[ f(y\mid X_i,\lambda,\alpha,\beta)=\frac{\lambda}{2}\exp\!\big(-\lambda|y-\alpha-\beta X_i|\big),\qquad -\infty

(i.e. a Laplace / double-exponential conditional distribution with scale parameter \(\lambda^{-1}\)). Show that the maximum likelihood estimators (MLEs) of \(\alpha\) and \(\beta\) minimize

\[ \sum_{i=1}^n |Y_i-\alpha-\beta X_i|. \]

Detailed solution

Step 1 — conditional likelihood (treat \(x_i\) as given)

The conditional likelihood for \((\lambda,\alpha,\beta)\) is

\[ L(\lambda,\alpha,\beta) =\prod_{i=1}^n \frac{\lambda}{2}\exp\!\big(-\lambda|y_i-\alpha-\beta x_i|\big) =\Big(\frac{\lambda}{2}\Big)^n \exp\!\Big(-\lambda\sum_{i=1}^n|y_i-\alpha-\beta x_i|\Big). \]
Step 2 — log-likelihood

The log-likelihood is

\[ \ell(\lambda,\alpha,\beta) = n\log\lambda - n\log 2 \;-\; \lambda\sum_{i=1}^n|y_i-\alpha-\beta x_i|. \]
Step 3 — maximize w.r.t. \(\alpha,\beta\)
- For fixed \((\alpha,\beta)\) the log-likelihood as a function of \(\lambda\) is
  \[ \ell(\lambda\mid\alpha,\beta)=n\log\lambda - \lambda S(\alpha,\beta) + \text{constant}, \qquad S(\alpha,\beta)=\sum_{i=1}^n|y_i-\alpha-\beta x_i|. \]
- Differentiating w.r.t. \(\lambda\) and equating to zero gives
  \[ \frac{n}{\hat\lambda}-S(\alpha,\beta)=0\quad\Rightarrow\quad \hat\lambda(\alpha,\beta)=\frac{n}{S(\alpha,\beta)}. \]
- Plugging \(\hat\lambda\) back yields the profile log-likelihood (up to constants)
  \[ \ell_{\text{prof}}(\alpha,\beta) = n\log\!\Big(\frac{n}{S(\alpha,\beta)}\Big) - n\log 2 - n = -\,n\log S(\alpha,\beta) + \text{constant}. \]
- Since \(-n\log S(\alpha,\beta)\) is a strictly monotone transformation of \(S(\alpha,\beta)>0\), maximizing \(\ell_{\text{prof}}\) over \((\alpha,\beta)\) is equivalent to minimizing
  \[ S(\alpha,\beta)=\sum_{i=1}^n |y_i-\alpha-\beta x_i|. \]
Step 4 — conclusion

Therefore the MLEs \((\hat\alpha,\hat\beta)\) are exactly the minimizers of the sum of absolute residuals:

\[ (\hat\alpha,\hat\beta) = \arg\min_{(\alpha,\beta)}\sum_{i=1}^n |Y_i-\alpha-\beta X_i|. \]

These are the least absolute deviations (LAD) or L1 regression estimators (equivalently median regression for the conditional median).
Optional — subgradient (first-order) conditions

Let \(r_i(\alpha,\beta)=y_i-\alpha-\beta x_i\) and \(\operatorname{sgn}(u)\) be the sign function with \(\operatorname{sgn}(0)\in[-1,1]\). A minimizer \((\hat\alpha,\hat\beta)\) satisfies the subgradient equations

\[ \sum_{i=1}^n \operatorname{sgn}\big(r_i(\hat\alpha,\hat\beta)\big)=0,\qquad \sum_{i=1}^n x_i\,\operatorname{sgn}\big(r_i(\hat\alpha,\hat\beta)\big)=0, \]

where equalities are interpreted in the subgradient sense if some residuals equal zero. These are the LAD analogues of the normal equations.

Related concepts

Least absolute deviations (LAD) / L1 regression: Minimizes \(\sum|r_i|\); robust to outliers in \(Y\).
Median / quantile regression: LAD is the special case of quantile regression at the 0.5 quantile (conditional median).
MLE equivalence: The Laplace conditional density makes the negative log-likelihood proportional to the L1 loss, hence MLE = LAD.
Subgradient vs differentiability: Because \(|\cdot|\) is nondifferentiable at 0, optimality uses subgradients rather than ordinary gradients.
Uniqueness: The LAD minimizer may not be unique in degenerate data configurations; with \(x_i\) in general position uniqueness typically holds.

Viz

Scatter plot \((x_i,y_i)\) with both OLS and LAD fit lines to demonstrate robustness: LAD resists vertical outliers.
Surface plot of \(S(\alpha,\beta)\) over a grid to visualize kinks where residual signs change; minima occur at intersections of these kinked regions.
Sign-balance plot at the LAD fit: the residual signs should balance so the sum of signs (and the weighted sum by \(x_i\)) is approximately zero.
Contamination experiment: simulate outliers and compare OLS vs LAD fits to show trade-offs between efficiency and robustness.

Other worthwhile points

Computation: LAD can be computed by linear programming (introduce positive/negative parts for residuals), by specialized algorithms (simplex, interior-point), or by IRLS approximations. Many packages implement quantile regression (e.g. rq in R) that computes LAD.
Asymptotics: LAD estimators are consistent and asymptotically normal under mild conditions; asymptotic variance depends on the error density at zero (for Laplace errors one gets closed-form expressions).
MLE for \(\lambda\): Given \((\hat\alpha,\hat\beta)\), the MLE for \(\lambda\) is
\[ \hat\lambda=\frac{n}{\sum_{i=1}^n|y_i-\hat\alpha-\hat\beta x_i|}. \]
Robustness trade-off: LAD has higher breakdown point and bounded influence compared with OLS, but is typically less efficient under Gaussian errors. Choose method with the error distribution and outlier risk in mind.

Contents

Question 1

Solution — Detailed Answer

Step 0 — substitute the reparametrisation

Step 1 — log-likelihood

Step 2 — first derivative (score)

Step 3 — Fisher (expected) information

Step 4 — Fisher scoring update (final iterative step)

Related Concepts / Remarks

Viz (Visualizations)

Other Relevant Points

Question 2

Solution — Detailed Answer

Setup / notation

Step 1 — Q-function (expected complete log-likelihood)

Step 2 — derivative of Q (score for Q)

Step 3 — Newton (one-step) update on Q (single merged iteration)

Step 4 — explicit specialised form (using \(b_i=\pm\tfrac14\))

How this merges E- and M-steps (practical summary)

Related Concepts

Viz (Diagnostics)

Other Relevant Points / Implementation tips

Question 3

Detailed solution — overview

Step 1 — write the (observed) log-likelihood

Step 2 — stationary equation (score = 0)

Step 3 — clear denominators to obtain a polynomial

Step 4 — solve the quadratic

Step 5 — choose the correct root (EM convergence)

Related concepts

Viz

Other points worth mentioning

Question 4

Detailed solution

Step 1 — log-likelihood and equivalence to least squares

Step 2 — normal equations

Step 3 — solve for \(\hat\alpha,\hat\beta\)

Step 4 — MLE for \(\sigma^2\)

Step 5 — second-derivative check (concavity)

Related concepts

Viz

Question 5

Detailed solution

Step 1 — log-likelihood (conditional on \(x_i\))

Step 2 — profile likelihood for \(\sigma^2\)

Step 3 — unbiasedness check

Step 4 — unbiased alternative

Related concepts

Viz

Other worthwhile points

Question 6

Detailed solution

Step 1 — conditional likelihood (treat \(x_i\) as given)

Step 2 — log-likelihood

Step 3 — maximize w.r.t. \(\alpha,\beta\)

Step 4 — conclusion

Optional — subgradient (first-order) conditions

Related concepts

Viz

Other worthwhile points