Statistical Methods III: Assignment Week-3

AY 2025–26

Instructor: Debasis Sengupta

Office / Department: ASU

Email: sdebasis@isical.ac.in

Marking Scheme:
Assignments: 20% | Midterm Test: 30% | End Semester: 50%

Contents

Question 1

We have counts \(Y_1,\dots,Y_4\) (multinomial given total \(n=\sum_i Y_i\)) with cell probabilities

\[ \pi_1=\tfrac12+\tfrac{\pi}{4},\qquad \pi_2=\pi_3=\tfrac{1-\pi}{4},\qquad \pi_4=\tfrac{\pi}{4}, \] where \(\pi\in(0,1)\) is unknown. Reparametrise \(\pi=(1-\theta)^2\) and derive the Fisher–scoring iterative update for the MLE of \(\pi\) (i.e. the update for \(\theta\)).

Solution — Detailed Answer

Step 0 — substitute the reparametrisation

Put \(\pi=(1-\theta)^2\) and define

\(a:=\dfrac{1-\theta}{2}\quad(\Rightarrow a^2=\dfrac{(1-\theta)^2}{4}).\)

Then the cell probabilities as functions of \(\theta\) become (useful algebraic forms)

  • \(p_1(\theta)=\pi_1=\tfrac12+\tfrac{(1-\theta)^2}{4}=\tfrac12 + a^2,\)
  • \(p_2(\theta)=p_3(\theta)=\tfrac{1-(1-\theta)^2}{4}=\tfrac14 - a^2,\)
  • \(p_4(\theta)=\tfrac{(1-\theta)^2}{4}=a^2.\)

Step 1 — log-likelihood

For total \(n=\sum_i Y_i\),

\(\ell(\theta)=\sum_{i=1}^4 Y_i\log p_i(\theta)+\text{const}.\)

Step 2 — first derivative (score)

Differentiate \(p_i\) w.r.t. \(\theta\). Using \(a=(1-\theta)/2\) and \(da/d\theta=-1/2\),

  • \(\dfrac{dp_1}{d\theta}=\dfrac{dp_4}{d\theta}=-a,\)
  • \(\dfrac{dp_2}{d\theta}=\dfrac{dp_3}{d\theta}=+a.\)

Hence the score \(S(\theta)=\dfrac{d\ell}{d\theta}\) is

\[ \begin{aligned} S(\theta) &=\sum_{i=1}^4 Y_i\frac{p_i'(\theta)}{p_i(\theta)} = -a\frac{Y_1}{p_1}+a\frac{Y_2}{p_2}+a\frac{Y_3}{p_3}-a\frac{Y_4}{p_4}\\ &= a\Big(\frac{Y_2+Y_3}{p_2}-\frac{Y_1}{p_1}-\frac{Y_4}{p_4}\Big). \end{aligned} \]

Step 3 — Fisher (expected) information

For a multinomial with probabilities \(p_i(\theta)\),

\(I(\theta)=n\sum_{i=1}^4\frac{(p_i'(\theta))^2}{p_i(\theta)}.\)

Because \((p_i')^2=a^2\) for every \(i\),

\(I(\theta)=n\,a^2\Big(\frac{1}{p_1}+\frac{2}{p_2}+\frac{1}{p_4}\Big).\)

Step 4 — Fisher scoring update (final iterative step)

Fisher scoring updates by

\(\displaystyle \theta^{(t+1)}=\theta^{(t)}+\frac{S(\theta^{(t)})}{I(\theta^{(t)})}.\)

Substitute \(S\) and \(I\). Cancelling common factors gives a convenient expression in terms of proportions \(r_i=Y_i/n\):

\[ \boxed{\,\theta^{(t+1)}=\theta^{(t)}+\frac{1}{a}\cdot \frac{\dfrac{r_2+r_3}{p_2}-\dfrac{r_1}{p_1}-\dfrac{r_4}{p_4}} {\dfrac{1}{p_1}+\dfrac{2}{p_2}+\dfrac{1}{p_4}}\,,} \]

where all \(p_j\) and \(a\) are evaluated at \(\theta^{(t)}\):

  • \(a=\dfrac{1-\theta^{(t)}}{2},\)
  • \(p_1=\tfrac12+a^2,\quad p_2=\tfrac14-a^2,\quad p_4=a^2,\)
  • \(r_i=Y_i/n\). (Equivalently you may write the update using counts \(Y_i\); the \(n\) cancels.)

This boxed formula is the Fisher–scoring iterative step for \(\theta\). After convergence \(\hat\theta\) gives \(\hat\pi=(1-\hat\theta)^2\).

Related Concepts / Remarks

Short remarks and important connections.

  • Why Fisher scoring? Fisher scoring replaces the observed negative second derivative by its expectation (the Fisher information). For generalized multinomial models, it often gives a stable and efficient algorithm; it is Newton–Raphson using expected information.
  • Score / information simplification: The algebra simplified nicely because all \((p_i')^2\) equal the same value \(a^2\). That yields compact expressions for \(S\) and \(I\).
  • Parameter domain & numerical caution: For valid probabilities we need \(p_2=\tfrac14-a^2>0\), i.e. \(a^2<1/4\) ⇔ \(|1-\theta|<1\) ⇔ \(\theta\in(0,2)\). Practically choose an initial \(\theta^{(0)}\) inside \((0,2)\) to keep all \(p_i>0\) (and avoid division by zero). If an iteration proposes \(\theta\) outside the admissible range, project it back (or use a damped update).
  • Initial guess (practical): A reasonable start is to match \(p_4\) with the observed proportion \(r_4=Y_4/n\): \(\pi\approx 4r_4\), so \(\theta^{(0)}=1-\sqrt{\max(0,\min(1,4r_4))}\). If \(4r_4>1\) use a safer interior starting value (e.g. \(\theta^{(0)}=1\) ± small).
  • Stopping rule: Use relative change \(|\theta^{(t+1)}-\theta^{(t)}|<\varepsilon\) (e.g. \(10^{-8}\)) or change in log-likelihood, or max iterations.
  • Observed vs expected information: If you prefer Newton–Raphson (observed information), compute the second derivative \(d^2\ell/d\theta^2\) and use \(\theta^{(t+1)}=\theta^{(t)}- \big(\ell'(\theta^{(t)})/\ell''(\theta^{(t)})\big)\). Fisher scoring often yields more stable steps in multinomial problems.
  • Asymptotic variance: At convergence \(\hat\theta\) the asymptotic variance is approximately \(I(\hat\theta)^{-1}\), with \(I(\hat\theta)=n a^2\big(1/p_1+2/p_2+1/p_4\big)\) evaluated at \(\hat\theta\). Delta method gives variance for \(\hat\pi=(1-\theta)^2\).

Viz (Visualizations)

Practical plots to inspect and debug the algorithm numerically.

  • Plot the log-likelihood \(\ell(\theta)\) over a grid in the admissible \(\theta\)-interval (say \([0.01,1.99]\)) to see where the peak lies and whether it's sharp or flat.
  • Plot the iteration path \(\theta^{(t)}\) vs iteration \(t\) to check monotone convergence and whether damping is needed.
  • Optionally plot the score \(S(\theta)\) and expected information \(I(\theta)\) over the grid to visualise step sizes implied by Fisher scoring.

Other Relevant Points

Additional practical advice, edge-cases, and numerical tips.

  • If any observed cell count is zero, avoid dividing by zero; start from interior values and/or add a tiny pseudo-count (e.g. +0.5) if needed for numerical stability.
  • Fisher scoring will usually converge fast for this one-parameter problem; if oscillation occurs, use step-halving/damping (e.g. \(\theta^{(t+1)}\leftarrow \theta^{(t)}+\gamma\cdot\)update with \(\gamma\in(0,1)\)).
  • After obtaining \(\hat\theta\), transform back to \(\hat\pi=(1-\hat\theta)^2\) and report standard errors via the observed Fisher information or the delta method.
  • Implementation note: compute \(p_2=\tfrac14-a^2\) carefully to avoid catastrophic cancellation when \(a^2\) is close to \(\tfrac14\).

Question 2

A complete-data formulation for the problem in Q1 was given in class so that the EM algorithm can be used. Starting from that, obtain a single iterative step (merge the E-step and M-step) which updates the estimate of \(\pi\).

Solution — Detailed Answer

Setup / notation

Write the cell probabilities as linear functions of \(\pi\):

\(p_i(\pi)=a_i+b_i\pi,\qquad i=1,\dots,4,\)

  • \(a_1=\tfrac12,\; b_1=\tfrac14\)
  • \(a_2=\tfrac14,\; b_2=-\tfrac14\)
  • \(a_3=\tfrac14,\; b_3=-\tfrac14\)
  • \(a_4=0,\; b_4=\tfrac14\)

Let \(\widetilde Y_i^{(t)}\) denote the expected complete counts computed at the E-step under \(\pi^{(t)}\). Write \(\tilde r_i^{(t)}=\widetilde Y_i^{(t)}/n\) if you prefer proportions.

Step 1 — Q-function (expected complete log-likelihood)

The EM M-step maximises

\(Q(\pi\mid\pi^{(t)})=\sum_{i=1}^4 \widetilde Y_i^{(t)}\log\big(a_i+b_i\pi\big) + \text{const (w.r.t. }\pi).\)

Step 2 — derivative of Q (score for Q)

Differentiate to get

\(\dfrac{\partial Q}{\partial\pi}(\pi\mid\pi^{(t)})=\sum_{i=1}^4 \widetilde Y_i^{(t)}\dfrac{b_i}{a_i+b_i\pi}.\)

Solving \(\partial Q/\partial\pi=0\) for \(\pi\) exactly gives the M-step but is implicit. A practical and standard merged one-line update is obtained by taking a single Newton step for this equation using \(\widetilde Y_i^{(t)}\).

Step 3 — Newton (one-step) update on Q (single merged iteration)

The second derivative is

\(\dfrac{\partial^2 Q}{\partial\pi^2}(\pi\mid\pi^{(t)})=-\sum_{i=1}^4 \widetilde Y_i^{(t)}\dfrac{b_i^2}{(a_i+b_i\pi)^2}.\)

Therefore a Newton update applied to \(Q\) gives the single-step merged iteration:

\[ \boxed{\; \pi^{(t+1)} \;=\; \pi^{(t)} \;+\; \frac{\displaystyle \sum_{i=1}^4 \widetilde Y_i^{(t)}\dfrac{b_i}{a_i+b_i\pi^{(t)}}} {\displaystyle \sum_{i=1}^4 \widetilde Y_i^{(t)}\dfrac{b_i^2}{(a_i+b_i\pi^{(t)})^2}} \; }. \]

Step 4 — explicit specialised form (using \(b_i=\pm\tfrac14\))

Plugging the \(a_i,b_i\) values and factoring constants yields the equivalent explicit form

\[ \boxed{\; \pi^{(t+1)}=\pi^{(t)} + \dfrac{\displaystyle \frac{\widetilde Y_1^{(t)}}{p_1(\pi^{(t)})}-\frac{\widetilde Y_2^{(t)}}{p_2(\pi^{(t)})}-\frac{\widetilde Y_3^{(t)}}{p_3(\pi^{(t)})}+\frac{\widetilde Y_4^{(t)}}{p_4(\pi^{(t)})} } {\displaystyle \frac{\widetilde Y_1^{(t)}}{p_1(\pi^{(t)})^2}+\frac{\widetilde Y_2^{(t)}}{p_2(\pi^{(t)})^2} +\frac{\widetilde Y_3^{(t)}}{p_3(\pi^{(t)})^2}+\frac{\widetilde Y_4^{(t)}}{p_4(\pi^{(t)})^2} }\; }, \]

where \(p_i(\pi^{(t)})=a_i+b_i\pi^{(t)}\). If data are complete then \(\widetilde Y_i^{(t)}=Y_i\).

How this merges E- and M-steps (practical summary)

  • E-step: compute \(\widetilde Y_i^{(t)}=\mathbb{E}[Y_i^{\text{(complete)}}\mid\text{observed},\pi^{(t)}]\).
  • M-step (merged): perform the single Newton update for \(\pi\) given above (one explicit line).
  • If you insist on a pure EM M-step you would need to solve \(\sum_i \widetilde Y_i^{(t)}\dfrac{b_i}{a_i+b_i\pi^{(t+1)}}=0\) exactly each iteration, which is implicit and inconvenient; the one-step Newton is the common practical remedy.

Related Concepts

  • Newton-within-EM: using one Newton (or Fisher) step in the M-step when a closed-form maximiser is unavailable; often accelerates convergence while keeping the EM structure.
  • Fisher-scoring variant: replace the Hessian in the denominator by its expectation (with \(\widetilde Y_i^{(t)}\) weights) to obtain a Fisher-scoring merged iteration.
  • Feasible region: ensure \(\pi^{(t+1)}\in(0,1)\) and \(p_i(\pi^{(t+1)})>0\). If the update leaves the feasible set, apply damping (step-halving) or project back into \((0,1)\).

Viz (Diagnostics)

  • Plot \(Q(\pi\mid\pi^{(t)})\) vs \(\pi\) for several \(\pi^{(t)}\) to check whether the Newton step moves uphill.
  • Trace \(\pi^{(t)}\) vs iteration \(t\) and optionally plot the observed log-likelihood to verify monotone behaviour (or detect the need for damping).
  • Plot the evolution of \(\widetilde Y_i^{(t)}\) if there is missingness — these typically stabilise quickly and reveal which components drive updates.

Other Relevant Points / Implementation tips

  • Use interior starting values for \(\pi^{(0)}\) (e.g. method-of-moments or \(\pi^{(0)}=\min(0.99,\max(0.01,4r_4))\)).
  • If any \(\widetilde Y_i^{(t)}\) or observed \(Y_i\) is zero, avoid division-by-zero by adding a tiny pseudo-count (e.g. +0.5) or using safe-guards in code.
  • Control step length (damping) when necessary: set \(\pi^{(t+1)}=\pi^{(t)}+\gamma\Delta\pi\) with \(\gamma\in(0,1]\) chosen to keep \(Q\) (or the observed log-likelihood) non-decreasing.
  • Check convergence by \(|\pi^{(t+1)}-\pi^{(t)}|<\varepsilon\) or small change in observed log-likelihood; report standard errors using the (observed or expected) information at the final iterate and the delta method if transforming parameters.

Question 3

Let \(Y_1,Y_2,Y_3,Y_4\) be counts in four genetic categories. Conditionally on the total \(N=Y_1+Y_2+Y_3+Y_4\) the vector has a multinomial distribution with cell probabilities

\[ p_1(\pi)=\tfrac12+\tfrac{\pi}{4},\qquad p_2(\pi)=p_3(\pi)=\tfrac{1-\pi}{4},\qquad p_4(\pi)=\tfrac{\pi}{4}, \]

where the unknown parameter \(\pi\in(0,1)\). Viewing this as an incomplete-data / latent-allocation problem, the EM algorithm applies. The question: obtain a closed form expression for the value of \(\pi\) attained at the convergence of the EM algorithm.

Detailed solution — overview

  • Introduce notation \(n_1=y_1,\; n_2=y_2+y_3,\; n_4=y_4\) and \(N=n_1+n_2+n_4\).
  • Write the (observed) log-likelihood \(\ell(\pi)\), differentiate (score), clear denominators to obtain a quadratic, and solve it.
  • Choose the root lying in \((0,1)\) — that is the EM fixed-point (the stationary solution of the observed-data likelihood).

Step 1 — write the (observed) log-likelihood

  • Up to an additive constant, the log-likelihood is

    \[ \ell(\pi)=y_1\log\!\Big(\tfrac12+\tfrac{\pi}{4}\Big)+(y_2+y_3)\log\!\Big(\tfrac{1-\pi}{4}\Big)+y_4\log\!\Big(\tfrac{\pi}{4}\Big). \]

  • With \(n_1=y_1,\; n_2=y_2+y_3,\; n_4=y_4\) and \(N=n_1+n_2+n_4\), substitute to simplify notation.

Step 2 — stationary equation (score = 0)

  • Differentiate \(\ell(\pi)\) w.r.t. \(\pi\). Useful derivatives:
    • \(\dfrac{d}{d\pi}\log\!\Big(\tfrac12+\tfrac{\pi}{4}\Big)=\dfrac{1}{2+\pi}\).
    • \(\dfrac{d}{d\pi}\log\!\Big(\tfrac{1-\pi}{4}\Big)=-\dfrac{1}{1-\pi}\).
    • \(\dfrac{d}{d\pi}\log\!\Big(\tfrac{\pi}{4}\Big)=\dfrac{1}{\pi}\).
  • Thus the score equation is

    \[ \frac{n_1}{2+\pi}-\frac{n_2}{1-\pi}+\frac{n_4}{\pi}=0. \]

Step 3 — clear denominators to obtain a polynomial

  • Multiply the score equation by \(\pi(1-\pi)(2+\pi)\) to remove denominators:

    \[ n_1\pi(1-\pi)+n_4(1-\pi)(2+\pi)-n_2\pi(2+\pi)=0. \]

  • Expand and collect like powers:
    • \(n_1\pi(1-\pi)=n_1(\pi-\pi^2)\).
    • \((1-\pi)(2+\pi)=2-\pi-\pi^2\), so \(n_4(1-\pi)(2+\pi)=n_4(2-\pi-\pi^2)\).
    • \(n_2\pi(2+\pi)=n_2(2\pi+\pi^2)\).
  • Collecting coefficients gives the quadratic:

    \[ - N \pi^2 + (n_1 - n_4 - 2 n_2)\pi + 2 n_4 = 0, \]

    and multiplying by \(-1\) yields the standard form

    \[ N\pi^2 + (-n_1 + n_4 + 2 n_2)\pi - 2 n_4 = 0. \]

Step 4 — solve the quadratic

  • Apply the quadratic formula. The solutions are

    \[ \boxed{ \displaystyle \pi \;=\; \frac{\,n_1 - n_4 - 2 n_2 \;\pm\; \sqrt{( -n_1 + n_4 + 2 n_2)^2 + 8 N n_4}\,}{2N} }. \]

    (The discriminant simplifies as \(( -n_1 + n_4 + 2 n_2)^2 - 4\cdot N\cdot(-2 n_4) = ( -n_1 + n_4 + 2 n_2)^2 + 8 N n_4\).)

Step 5 — choose the correct root (EM convergence)

  • The EM algorithm converges to a stationary point of the observed-data likelihood; the valid parameter must lie in \((0,1)\).
  • Thus \(\pi_{\text{EM}}\) is the root above that lies in \((0,1)\). In practice, when \(n_4>0\) the “\(+\)” branch typically yields the admissible root; choose the sign that gives \(0<\pi<1\).

Conclusion / boxed result (explicit):

\[ \pi_{\star} \;=\; \frac{n_1 - n_4 - 2 n_2 \;\pm\; \sqrt{( -n_1 + n_4 + 2 n_2)^2 + 8 N n_4}}{2N}, \]

with the sign chosen so that \(0<\pi_\star<1\). This closed-form gives the stationary value that the EM fixed-point must satisfy (hence the EM limit).

Related concepts

  • EM and stationary equations: An EM fixed-point satisfies the observed-data score equations whenever the M-step produces a full maximizer; thus the EM limit is a stationary point of the observed log-likelihood.
  • Score equation / MLE: Here the score reduces to a quadratic in \(\pi\), so the MLE (and EM limit) can be obtained in closed form.
  • Root selection / boundaries: If no root lies in \((0,1)\) check boundary MLEs at \(\pi=0\) or \(\pi=1\) and inspect the log-likelihood for maxima vs minima.
  • Numerical stability: Reparametrizations (e.g., \(\pi=(1-\theta)^2\)) can help numerics, though algebra gave an explicit solution here.

Viz

  • Plot the observed-data log-likelihood \(\ell(\pi)\) on \(\pi\in[0,1]\) to visualize the stationary point and to confirm the chosen root is a global maximum.
  • Plot the quadratic polynomial \(f(\pi)=N\pi^2 +(-n_1 + n_4 + 2 n_2)\pi -2 n_4\) to see the two roots and where they lie relative to \([0,1]\).

Other points worth mentioning

  • If you run EM from different starting values and it always converges to the same root in \((0,1)\), that gives empirical confirmation the root is the global maximizer (for regular data).
  • When \(n_4=0\) the formula simplifies (the discriminant reduces) and one must check boundaries carefully — often the MLE may sit at \(\pi=0\).
  • Everything above preserves the full derivation and the final closed-form expression; substitute \(n_1=y_1,\; n_2=y_2+y_3,\; n_4=y_4\) when applying to data.

Question 4

Consider paired data \((X_i,Y_i),\ i=1,\dots,n\), where the pairs are independent, the marginal distribution of the \(X_i\) is unspecified, and the conditional distribution of \(Y_i\) given \(X_i=x_i\) is

\[ Y_i\mid X_i=x_i \sim N(\alpha+\beta x_i,\ \sigma^2). \]

Show that the maximum likelihood estimators (MLEs) of \(\alpha\) and \(\beta\) coincide with the ordinary least squares estimators for the linear regression model \(Y=\alpha+\beta X+\varepsilon\).

Detailed solution

  • Idea. Because only the conditional distribution of \(Y_i\) given \(X_i\) is specified, the likelihood for \((\alpha,\beta,\sigma^2)\) is the conditional likelihood

    \[ L(\alpha,\beta,\sigma^2)=\prod_{i=1}^n f_{Y\mid X}(y_i\mid x_i) =\prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}}\exp\!\Big(-\frac{(y_i-(\alpha+\beta x_i))^2}{2\sigma^2}\Big). \]

    Maximizing this over \(\alpha,\beta\) (for fixed \(\sigma^2\)) is equivalent to minimizing the sum of squared residuals

    \[ Q(\alpha,\beta)=\sum_{i=1}^n\big(y_i-(\alpha+\beta x_i)\big)^2. \]

    Hence the MLEs of \(\alpha,\beta\) are the least-squares estimators.

  • ```
  • Step 1 — log-likelihood and equivalence to least squares

    • The log-likelihood (conditional on the observed \(x_i\)) is

      \[ \ell(\alpha,\beta,\sigma^2) = -\frac{n}{2}\log(2\pi\sigma^2)-\frac{1}{2\sigma^2}\sum_{i=1}^n\big(y_i-(\alpha+\beta x_i)\big)^2. \]

    • For fixed \(\sigma^2\), maximizing \(\ell\) w.r.t. \(\alpha,\beta\) is identical to minimizing \(Q(\alpha,\beta)\).
  • Step 2 — normal equations

    • Set partial derivatives of \(Q\) to zero:

      \[ \frac{\partial Q}{\partial \alpha} = -2\sum_{i=1}^n\big(y_i-(\alpha+\beta x_i)\big)=0, \]

      \[ \frac{\partial Q}{\partial \beta} = -2\sum_{i=1}^n x_i\big(y_i-(\alpha+\beta x_i)\big)=0. \]

    • Divide by \(-2\) and rewrite to obtain the normal equations:

      \[ n\alpha + \beta\sum_{i=1}^n x_i = \sum_{i=1}^n y_i, \]

      \[ \alpha\sum_{i=1}^n x_i + \beta\sum_{i=1}^n x_i^2 = \sum_{i=1}^n x_i y_i. \]

  • Step 3 — solve for \(\hat\alpha,\hat\beta\)

    • Let \(\bar x=\dfrac{1}{n}\sum_i x_i\), \(\bar y=\dfrac{1}{n}\sum_i y_i\). Define

      \[ S_{xx}=\sum_{i=1}^n (x_i-\bar x)^2=\sum_i x_i^2 - n\bar x^2,\qquad S_{xy}=\sum_{i=1}^n (x_i-\bar x)(y_i-\bar y)=\sum_i x_i y_i - n\bar x\bar y. \]

    • Solving the normal equations yields the OLS formulas (and hence the MLEs):

      \[ \boxed{\displaystyle \hat\beta \;=\; \frac{S_{xy}}{S_{xx}}, \qquad \hat\alpha \;=\; \bar y - \hat\beta\,\bar x.} \]

  • Step 4 — MLE for \(\sigma^2\)

    • Plugging \(\hat\alpha,\hat\beta\) back into the log-likelihood, the MLE for \(\sigma^2\) is

      \[ \boxed{\displaystyle \hat\sigma^2_{\text{MLE}} \;=\; \frac{1}{n}\sum_{i=1}^n \big(y_i - \hat\alpha - \hat\beta x_i\big)^2.} \]

      Note: this ML estimator uses division by \(n\); the unbiased estimator uses division by \(n-2\).

  • Step 5 — second-derivative check (concavity)

    • The conditional log-likelihood is quadratic (negative-definite in \(\alpha,\beta\) for fixed \(\sigma^2\)), so the solution above is a global maximizer for \(\alpha,\beta\).
  • ```

Thus the MLEs of \(\alpha\) and \(\beta\) coincide exactly with the least-squares estimators.

Related concepts

  • Conditional likelihood vs joint likelihood. When the marginal distribution of \(X\) is unspecified, inference about regression parameters is done via the conditional likelihood \(f(y\mid x)\); this is equivalent to treating \(x_i\) as fixed regressors for likelihood purposes.
  • Gauss–Markov theorem. Under mean-zero, homoscedastic, uncorrelated errors, OLS estimators are the Best Linear Unbiased Estimators (BLUE). With normality they are also MLEs (hence efficient).
  • MLE vs unbiased estimation of \(\sigma^2\). ML gives \(\hat\sigma^2 = \mathrm{RSS}/n\); the unbiased estimator is \(\mathrm{RSS}/(n-2)\).
  • Fisher information / asymptotic variance. For known \(\sigma^2\), the exact variances are

    \[ \operatorname{Var}(\hat\beta)=\frac{\sigma^2}{S_{xx}},\qquad \operatorname{Var}(\hat\alpha)=\sigma^2\Big(\frac{1}{n}+\frac{\bar x^2}{S_{xx}}\Big). \]

  • Interpretation when \(X\) random. If \(X_i\) are random but independent of the error and their distribution does not depend on \(\alpha,\beta\), the conditional-likelihood approach still yields the same estimators as the joint likelihood (which factors as marginal of \(X\) times conditional of \(Y\mid X\)).

Viz

  • Scatter plot \((x_i,y_i)\) with fitted regression line \(y=\hat\alpha+\hat\beta x\) and residuals drawn down to the line — checks fit visually.
  • Residual diagnostics: plot residuals vs fitted values and a Q–Q plot of residuals to check homoscedasticity and normality assumptions.
  • Confidence bands: show pointwise confidence intervals for the regression line using the variance formulas above.
  • Profile of log-likelihood: plot \(\ell(\alpha,\beta)\) (or profile \(\ell(\beta)\) fixing \(\alpha=\bar y-\beta\bar x\)) to visualize the quadratic shape and the unique maximum.

Question 5

In the regression model

\[ Y_i \mid X_i = x_i \;\sim\; N(\alpha + \beta x_i,\ \sigma^2), \quad i=1,\dots,n, \]

we already showed that the MLEs of \(\alpha\) and \(\beta\) are the least-squares estimators. Now: What is the MLE of \(\sigma^2\)? Is this MLE unbiased?

Detailed solution

  • Step 1 — log-likelihood (conditional on \(x_i\))

    • The conditional likelihood is

      \[ L(\alpha,\beta,\sigma^2) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left(-\frac{(y_i-(\alpha+\beta x_i))^2}{2\sigma^2}\right). \]

    • The log-likelihood is

      \[ \ell(\alpha,\beta,\sigma^2) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log\sigma^2 - \frac{1}{2\sigma^2}\sum_{i=1}^n (y_i - \alpha - \beta x_i)^2. \]

  • ```
  • Step 2 — profile likelihood for \(\sigma^2\)

    • Let \(\displaystyle \mathrm{RSS}=\sum_{i=1}^n (y_i-\hat\alpha-\hat\beta x_i)^2\) denote the residual sum of squares at the MLEs \(\hat\alpha,\hat\beta\).
    • Plugging \(\hat\alpha,\hat\beta\) into \(\ell\) and differentiating w.r.t. \(\sigma^2\) gives

      \[ \frac{\partial \ell}{\partial \sigma^2} = -\frac{n}{2\sigma^2} + \frac{\mathrm{RSS}}{2\sigma^4}. \]

    • Setting this to zero and solving yields

      \[ -\frac{n}{2\sigma^2} + \frac{\mathrm{RSS}}{2\sigma^4} = 0 \ \Longrightarrow\ -n\sigma^2 + \mathrm{RSS}=0, \]

      hence

      \[ \boxed{\displaystyle \hat\sigma^2_{\mathrm{MLE}} = \frac{\mathrm{RSS}}{n}.} \]

  • Step 3 — unbiasedness check

    • From standard linear-regression distributional results, \(\dfrac{\mathrm{RSS}}{\sigma^2}\sim\chi^2_{\,n-2}\), so \(E[\mathrm{RSS}]=(n-2)\sigma^2\).
    • Therefore

      \[ E[\hat\sigma^2_{\mathrm{MLE}}] = E\!\left[\frac{\mathrm{RSS}}{n}\right] = \frac{n-2}{n}\,\sigma^2. \]

    • Conclusion: \(\hat\sigma^2_{\mathrm{MLE}}=\mathrm{RSS}/n\) is biased downward. It underestimates \(\sigma^2\) by the factor \(\dfrac{n-2}{n}\).
  • Step 4 — unbiased alternative

    • The usual unbiased estimator is

      \[ \boxed{\displaystyle \tilde\sigma^2 = \frac{\mathrm{RSS}}{n-2},} \]

      which satisfies \(E[\tilde\sigma^2]=\sigma^2\).
    • Note: for a general linear model with \(p\) parameters (including intercept), the unbiased estimator is \(\mathrm{RSS}/(n-p)\) while the MLE is \(\mathrm{RSS}/n\).
  • ```

Related concepts

  • Bias of variance MLE. In normal models with unknown mean parameters, the ML estimator of variance is downward-biased because degrees of freedom are used to estimate the mean(s).
  • Degrees of freedom. In simple linear regression with intercept and slope, \(p=2\) parameters are estimated, leaving \(n-2\) residual degrees of freedom.
  • Distributional result. \(\dfrac{\mathrm{RSS}}{\sigma^2}\sim\chi^2_{n-2}\), which yields both expectation and confidence-interval constructions for \(\sigma^2\).
  • Asymptotics. \(\hat\sigma^2_{\mathrm{MLE}}\) is consistent and asymptotically unbiased; the finite-sample bias vanishes as \(n\to\infty\).

Viz

  • Plot \(\hat\sigma^2_{\mathrm{MLE}}=\mathrm{RSS}/n\) vs \(\tilde\sigma^2=\mathrm{RSS}/(n-2)\) as \(n\) varies: for small \(n\) the ML estimator is noticeably smaller; for large \(n\) they converge.
  • Plot the sampling distribution of \(\dfrac{n\hat\sigma^2}{\sigma^2}=\dfrac{\mathrm{RSS}}{\sigma^2}\) which follows \(\chi^2_{n-2}\); compare to the scaled unbiased version.\

Other worthwhile points

  • Statistical software typically reports the unbiased estimator \(\mathrm{RSS}/(n-p)\) for inference (t-tests, CI), because those procedures rely on the exact finite-sample \(\chi^2/\)Student-\(t\) calibrations.
  • For likelihood-based comparisons (likelihood-ratio tests, AIC), the ML estimator \(\mathrm{RSS}/n\) often appears naturally because it maximizes the likelihood.
  • Both estimators are consistent; difference matters mainly for small samples and for reporting unbiased variance estimates.

Question 6

Consider paired data \((X_i,Y_i),\ i=1,\dots,n\), independent, the marginal distribution of the \(X_i\) unspecified, and the conditional density of \(Y_i\) given \(X_i\) is

\[ f(y\mid X_i,\lambda,\alpha,\beta)=\frac{\lambda}{2}\exp\!\big(-\lambda|y-\alpha-\beta X_i|\big),\qquad -\infty

(i.e. a Laplace / double-exponential conditional distribution with scale parameter \(\lambda^{-1}\)). Show that the maximum likelihood estimators (MLEs) of \(\alpha\) and \(\beta\) minimize

\[ \sum_{i=1}^n |Y_i-\alpha-\beta X_i|. \]

Detailed solution

  • Step 1 — conditional likelihood (treat \(x_i\) as given)

    The conditional likelihood for \((\lambda,\alpha,\beta)\) is

    \[ L(\lambda,\alpha,\beta) =\prod_{i=1}^n \frac{\lambda}{2}\exp\!\big(-\lambda|y_i-\alpha-\beta x_i|\big) =\Big(\frac{\lambda}{2}\Big)^n \exp\!\Big(-\lambda\sum_{i=1}^n|y_i-\alpha-\beta x_i|\Big). \]

  • ```
  • Step 2 — log-likelihood

    The log-likelihood is

    \[ \ell(\lambda,\alpha,\beta) = n\log\lambda - n\log 2 \;-\; \lambda\sum_{i=1}^n|y_i-\alpha-\beta x_i|. \]

  • Step 3 — maximize w.r.t. \(\alpha,\beta\)

    • For fixed \((\alpha,\beta)\) the log-likelihood as a function of \(\lambda\) is

      \[ \ell(\lambda\mid\alpha,\beta)=n\log\lambda - \lambda S(\alpha,\beta) + \text{constant}, \qquad S(\alpha,\beta)=\sum_{i=1}^n|y_i-\alpha-\beta x_i|. \]

    • Differentiating w.r.t. \(\lambda\) and equating to zero gives

      \[ \frac{n}{\hat\lambda}-S(\alpha,\beta)=0\quad\Rightarrow\quad \hat\lambda(\alpha,\beta)=\frac{n}{S(\alpha,\beta)}. \]

    • Plugging \(\hat\lambda\) back yields the profile log-likelihood (up to constants)

      \[ \ell_{\text{prof}}(\alpha,\beta) = n\log\!\Big(\frac{n}{S(\alpha,\beta)}\Big) - n\log 2 - n = -\,n\log S(\alpha,\beta) + \text{constant}. \]

    • Since \(-n\log S(\alpha,\beta)\) is a strictly monotone transformation of \(S(\alpha,\beta)>0\), maximizing \(\ell_{\text{prof}}\) over \((\alpha,\beta)\) is equivalent to minimizing

      \[ S(\alpha,\beta)=\sum_{i=1}^n |y_i-\alpha-\beta x_i|. \]

  • Step 4 — conclusion

    Therefore the MLEs \((\hat\alpha,\hat\beta)\) are exactly the minimizers of the sum of absolute residuals:

    \[ (\hat\alpha,\hat\beta) = \arg\min_{(\alpha,\beta)}\sum_{i=1}^n |Y_i-\alpha-\beta X_i|. \]

    These are the least absolute deviations (LAD) or L1 regression estimators (equivalently median regression for the conditional median).

  • Optional — subgradient (first-order) conditions

    Let \(r_i(\alpha,\beta)=y_i-\alpha-\beta x_i\) and \(\operatorname{sgn}(u)\) be the sign function with \(\operatorname{sgn}(0)\in[-1,1]\). A minimizer \((\hat\alpha,\hat\beta)\) satisfies the subgradient equations

    \[ \sum_{i=1}^n \operatorname{sgn}\big(r_i(\hat\alpha,\hat\beta)\big)=0,\qquad \sum_{i=1}^n x_i\,\operatorname{sgn}\big(r_i(\hat\alpha,\hat\beta)\big)=0, \]

    where equalities are interpreted in the subgradient sense if some residuals equal zero. These are the LAD analogues of the normal equations.

  • ```

Related concepts

  • Least absolute deviations (LAD) / L1 regression: Minimizes \(\sum|r_i|\); robust to outliers in \(Y\).
  • Median / quantile regression: LAD is the special case of quantile regression at the 0.5 quantile (conditional median).
  • MLE equivalence: The Laplace conditional density makes the negative log-likelihood proportional to the L1 loss, hence MLE = LAD.
  • Subgradient vs differentiability: Because \(|\cdot|\) is nondifferentiable at 0, optimality uses subgradients rather than ordinary gradients.
  • Uniqueness: The LAD minimizer may not be unique in degenerate data configurations; with \(x_i\) in general position uniqueness typically holds.

Viz

  • Scatter plot \((x_i,y_i)\) with both OLS and LAD fit lines to demonstrate robustness: LAD resists vertical outliers.
  • Surface plot of \(S(\alpha,\beta)\) over a grid to visualize kinks where residual signs change; minima occur at intersections of these kinked regions.
  • Sign-balance plot at the LAD fit: the residual signs should balance so the sum of signs (and the weighted sum by \(x_i\)) is approximately zero.
  • Contamination experiment: simulate outliers and compare OLS vs LAD fits to show trade-offs between efficiency and robustness.

Other worthwhile points

  • Computation: LAD can be computed by linear programming (introduce positive/negative parts for residuals), by specialized algorithms (simplex, interior-point), or by IRLS approximations. Many packages implement quantile regression (e.g. rq in R) that computes LAD.
  • Asymptotics: LAD estimators are consistent and asymptotically normal under mild conditions; asymptotic variance depends on the error density at zero (for Laplace errors one gets closed-form expressions).
  • MLE for \(\lambda\): Given \((\hat\alpha,\hat\beta)\), the MLE for \(\lambda\) is

    \[ \hat\lambda=\frac{n}{\sum_{i=1}^n|y_i-\hat\alpha-\hat\beta x_i|}. \]

  • Robustness trade-off: LAD has higher breakdown point and bounded influence compared with OLS, but is typically less efficient under Gaussian errors. Choose method with the error distribution and outlier risk in mind.