Statistical Methods III: Week-5

AY 2025–26

Instructor: Debasis Sengupta

Office / Department: ASU

Email: sdebasis@isical.ac.in

Marking Scheme:
Assignments: 20% | Midterm Test: 30% | End Semester: 50%

Contents

1. Motivation: Why look beyond OLS?

  • OLS (Ordinary Least Squares) is the standard regression method. It minimizes the sum of squared residuals.
  • But squaring makes large errors (outliers) extremely influential. A single extreme data point can drag the fitted line towards itself.

Visualization: Imagine a scatter plot with one point far above the main cluster. The OLS line bends upward to reduce the squared error of that outlier.

2. LAD (Least Absolute Deviation) regression

  • LAD minimizes the sum of absolute residuals, instead of squared residuals.
  • The formula becomes: “Choose \( \beta_0, \beta_1 \) to minimize \(\sum |y_i - \beta_0 - \beta_1 x_i|\).”
  • Geometrically, this means we are fitting a line that makes the absolute vertical distances as small as possible.

Visualization: On the same scatter plot, the LAD line passes closer to the cluster and ignores the influence of the outlier.

3. Comparing OLS and LAD

  • OLS loss function is quadratic (\(x^2\)), a smooth parabola.
  • LAD loss function is linear (\(|x|\)), a sharp V-shape.
  • This small difference changes everything:
    • OLS → sensitive to outliers, mean-based.
    • LAD → robust to outliers, median-based.

Visualization: Plot two curves against \(x\):

  • Parabola \(x^2\), showing steep growth for large deviations.
  • V-shape \(|x|\), showing slower, linear growth.

4. Practical meaning

  • In OLS, a huge error from one observation dominates the objective function.
  • In LAD, all errors contribute more equally — so the largest deviation doesn’t overwhelm the model.

This robustness is why LAD is often called a “robust regression method.”

5. Estimation procedure

  • The general regression problem is written as: “Minimize \(\sum \rho(y_i - \beta_0 - \beta_1 x_i)\), where \(\rho\) is a chosen loss function.”
  • Special cases:
    • \(\rho(x) = x^2\) → OLS.
    • \(\rho(x) = |x|\) → LAD.

So LAD is just one member of a family of “M-estimators,” where you choose different functions \(\rho\) to control how much you penalize large errors.

6. Computation: Why LAD is harder

  • For OLS: we can take derivatives and get closed-form solutions.
  • For LAD: the absolute value function is not differentiable at zero, so we can’t use simple calculus.
  • Instead, LAD is solved using:
    1. Linear programming methods.
    2. Iteratively Reweighted Least Squares (IRLS).
    3. Geometric median ideas (median instead of mean).

7. Related concepts

  • Median vs Mean: OLS → mean, LAD → median. This is the one-dimensional analogue.
  • Robustness: LAD is less sensitive to extreme values.
  • Quantile regression: LAD regression is a special case (median regression). More generally, quantile regression fits other quantiles (like the 25th or 75th percentile).
  • M-estimators: General framework of estimators defined by minimizing different \(\rho\)-functions.

Quick descriptive summary

  • OLS and LAD differ only in how they measure error.
  • Squared error (OLS) exaggerates the importance of outliers.
  • Absolute error (LAD) keeps all points on a more equal footing.
  • The geometry of their loss functions explains this difference.
  • LAD is computationally harder but conceptually simple: it finds a regression line that “balances” residuals in a median-like fashion.
  • This makes LAD a robust alternative to OLS and a gateway to broader robust regression methods.

1. The Big Question

Huber asked: when we fit a regression line, should we prioritize fitting the majority of the data well, or should we also give weight to the few outliers?

  • OLS (least squares) gives a lot of importance to outliers because squaring residuals makes big errors dominate.
  • LAD (least absolute deviation) ignores outliers more, but may sometimes be too blunt.
  • Huber’s idea: design a compromise — treat small errors like OLS (quadratic penalty, smooth), but treat large errors more gently (linear penalty).

Think of it like having a shock absorber: normal bumps are treated smoothly (like OLS), but when the road gets too rough (big outliers), the absorber prevents the car (our model) from jumping violently.

2. General M-estimation setup

We write the objective function as:

\[ C(\beta_0, \beta_1) = \sum_{i=1}^n \rho(y_i - \beta_0 - \beta_1 x_i) \]

  • Here \(\rho\) is a loss function that measures how bad a residual is.
  • Choice of \(\rho\):
    • \(\rho(u) = u^2\) → OLS.
    • \(\rho(u) = |u|\) → LAD.
    • \(\rho(u) =\) Huber’s function (quadratic for small \(u\), linear for large \(u\)) → Robust regression.

So the framework is: pick a \(\rho\), minimize the sum, get your regression line.

3. First-order conditions (derivatives)

To minimize, take derivatives with respect to the coefficients:

  • For \(\beta_0\):

    \[ \frac{\partial C}{\partial \beta_0} = -\sum_{i=1}^n \rho'(y_i - \beta_0 - \beta_1 x_i) \]

  • For \(\beta_1\):

    \[ \frac{\partial C}{\partial \beta_1} = -\sum_{i=1}^n \rho'(y_i - \beta_0 - \beta_1 x_i) \cdot x_i \]

Here \(\rho'(u)\) is called the influence function or ψ-function. It tells us how strongly each residual \(u\) “pulls” on the fit.

  • For OLS: \(\rho(u) = u^2\), so \(\rho'(u) = 2u\). Large residuals exert a very strong pull.
  • For LAD: \(\rho(u) = |u|\), so \(\rho'(u) = \text{sign}(u)\). Every point, no matter how large, pulls with the same unit force.
  • For Huber: \(\rho'(u)\) grows linearly for small \(u\), but is capped for large \(u\). Outliers can’t pull infinitely.

Analogy: Imagine each data point attached to the regression line with a spring. In OLS, the spring force grows stronger the more it’s stretched (quadratic). In LAD, every spring pulls with the same constant tension. In Huber’s, springs behave normally at first, but once stretched beyond a threshold, they lock at a maximum pull.

4. Defining weights

We introduce weights:

\[ w_i = \frac{\rho'(y_i - \beta_0 - \beta_1 x_i)}{y_i - \beta_0 - \beta_1 x_i} \]

This formula is clever:

  • The numerator is the “influence” of a point.
  • The denominator is the residual itself.
  • So \(w_i\) tells us how much importance the point gets relative to its size.

Check special cases:

  • OLS: \(\rho'(u) = 2u\). So \(w_i = \frac{2u}{u} = 2\). Constant weight → all points are equally trusted.
  • LAD: \(\rho'(u) = \text{sign}(u)\). So \(w_i = \frac{\pm1}{u}\). Big residuals → tiny weights.
  • Huber: \(w_i\) depends on whether \(u\) is small (weight ≈ 2) or big (weight shrinks).

So OLS = equal trust, LAD = diminishing trust, Huber = adaptive trust.

5. Why the equations become nonlinear

  • For OLS, weights are constant, so the equations reduce to neat linear formulas.
  • For Huber/LAD, weights depend on the residuals, which themselves depend on the unknown parameters \(\beta_0, \beta_1\). That makes the system nonlinear and hard to solve directly.

So we cheat: we linearize the problem by temporarily fixing the weights at some guess of \(\beta\), then updating \(\beta\), then recalculating weights, and so on.

6. Iteratively Reweighted Least Squares (IRLS)

The procedure is:

  1. Start with an initial guess of \(\beta_0, \beta_1\) (often OLS estimates).
  2. Compute residuals.
  3. Compute weights \(w_i\) based on those residuals.
  4. Solve a weighted least squares problem:

    \[ \hat{\beta}^{(n+1)} = \arg\min_{\beta_0,\beta_1} \sum_{i=1}^n w_i^{(n)} (y_i - \beta_0 - \beta_1 x_i)^2 \]

    This is just like OLS, but each point gets a different importance.

  5. Repeat until the estimates stop changing.

This is why it’s called Iteratively Reweighted Least Squares: each step is OLS with weights, but the weights are updated iteratively.

7. Normal equations in IRLS

If we differentiate the weighted sum with respect to the coefficients, we get:

  • For \(\beta_0\):

    \[ -2 \sum w_i (y_i - \beta_0 - \beta_1 x_i) = 0 \]

  • For \(\beta_1\):

    \[ -2 \sum w_i (y_i - \beta_0 - \beta_1 x_i)(-x_i) = 0 \]

These are just like OLS normal equations, but each term is multiplied by a weight.

8. Related Concepts

  • Huber’s ψ-function: the derivative of the loss. Defines influence of residuals.
  • Breakdown point: the proportion of contaminated data the estimator can tolerate. OLS has breakdown point 0% (one bad outlier can ruin it). LAD has 50%. Huber is in between.
  • Robust statistics: field focused on methods that resist outliers and model misspecification.
  • M-estimation: generalization of maximum likelihood, where we minimize a sum of some function of residuals. Huber pioneered this.

Summary

Huber’s work showed that we don’t have to choose between being “too harsh on outliers” (OLS) and “ignoring them completely” (LAD). His M-estimator is a middle path: quadratic penalty for small residuals (like OLS), linear for big ones (like LAD). This leads naturally to IRLS, where we keep updating weights until the regression line stabilizes.

1. Logistic Regression: The Trial Model

You start with:

\[ E(y_i | x=x_i) = \beta_0 + \beta_1 x_i \]

  • Here, \(y_i\) can only be 0 or 1.
  • But \(\beta_0 + \beta_1 x_i\) can take any real value \((-\infty, \infty)\).

Problem: This means for some \(x\), the model might predict probabilities <0 or >1, which makes no sense.

Visual intuition: Imagine you have data points where \(y_i\) is only at the levels 0 or 1. If you try to fit a straight line through them, some predictions fall outside the [0,1] band. That’s the reason linear regression doesn’t work well for binary outcomes.

2. The Fix — A Transformation

We want a mapping:

\[ g: \mathbb{R} \to [0,1] \]

So instead of saying:

\[ E(y_i|x=x_i) = \beta_0 + \beta_1 x_i, \]

we say:

\[ E(y_i|x=x_i) = g(\beta_0 + \beta_1 x_i). \]

This ensures the predicted mean is always between 0 and 1 — a valid probability.

3. The Logistic Function

A popular choice of \(g\) is the logistic function:

\[ g(z) = \frac{e^z}{1+e^z}, \quad z \in \mathbb{R}. \]

  • Always between 0 and 1.
  • S-shaped (sigmoid curve).
  • As \(z \to -\infty\), \(g(z) \to 0\).
  • As \(z \to +\infty\), \(g(z) \to 1\).

Why CDF? Because it looks like the CDF of a standard logistic distribution — smooth, monotone, bounded between 0 and 1.

So now:

\[ P(y_i=1|x_i=x) = \frac{e^{\beta_0 + \beta_1 x}}{1+e^{\beta_0 + \beta_1 x}}, \]

\[ P(y_i=0|x_i=x) = \frac{1}{1+e^{\beta_0 + \beta_1 x}}. \]

4. The Odds and Log-Odds

The odds of success (y=1) are:

\[ \frac{P(y_i=1|x_i=x)}{P(y_i=0|x_i=x)} = e^{\beta_0 + \beta_1 x}. \]

Taking log:

\[ \ln \left(\frac{P(y_i=1|x_i=x)}{P(y_i=0|x_i=x)}\right) = \beta_0 + \beta_1 x. \]

This is called the logit transform.

  • It maps probabilities (0–1) into the whole real line.
  • And importantly, it’s linear in parameters (\(\beta_0, \beta_1\)).

5. Interpretation of Parameters

  • \(\beta_0\): log-odds when \(x=0\).
    • If \(\beta_0 = 0\), then odds = 1, meaning probability = 0.5.
  • \(\beta_1\): change in log-odds for a one-unit increase in \(x\).
    • If \(\beta_1 > 0\), higher \(x\) increases the probability of success.
    • If \(\beta_1 < 0\), higher \(x\) decreases it.

6. Visual Picture

  • Imagine your \(x\)-axis is a continuous variable (say, test score).
  • Your \(y\)-values are only 0 (fail) or 1 (pass).
  • A straight line through them doesn’t make sense, because it might predict values like -0.3 or 1.4.
  • The logistic curve bends to stay between 0 and 1, capturing the S-shaped relationship: at low \(x\), probability near 0; at high \(x\), probability near 1.

7. Related Concepts

  • Probit regression: uses the normal CDF instead of logistic CDF. Similar S-shape but slightly different tails.
  • Linear Probability Model (LPM): the initial trial (\(\beta_0 + \beta_1 x\)) — easy to compute but flawed.
  • Odds Ratio: \(e^{\beta_1}\) is the multiplicative change in odds for a unit increase in \(x\).

1. The Idea of the Probit Model

The logistic regression used the logistic CDF as the link function. The probit model instead uses the standard normal CDF:

\[ g(x) = \Phi(x), \quad \text{where } \Phi(x) = \int_{-\infty}^x \frac{1}{\sqrt{2\pi}} e^{-t^2/2}\, dt. \]

So in probit regression:

\[ P(y_i=1|x_i=x) = \Phi(\beta_0 + \beta_1 x). \]

2. Latent Variable Motivation

Instead of modeling probabilities directly, the probit model assumes there is a hidden (“latent”) continuous variable \(Z_i\) that drives the binary outcome.

  • Assume:

    \[ Z_i | x_i \sim N(\beta_0 + \beta_1 x_i, 1). \]

  • Then define observed outcome:

    \[ y_i = \begin{cases} 1 & \text{if } Z_i > 0, \\ 0 & \text{if } Z_i \leq 0. \end{cases} \]

Interpretation: there is some unobserved "propensity" \(Z_i\) (say, propensity to buy a product). You only see the binary decision (buy = 1, not buy = 0), but in reality there’s a continuous tendency behind it.

3. Probability Derivation

We want:

\[ P(y_i = 1 | x_i=x) = P(Z_i > 0 | x_i=x). \]

Since \(Z_i \sim N(\beta_0 + \beta_1 x, 1)\):

\[ P(Z_i > 0 | x_i=x) = 1 - \Phi\left(\frac{0 - (\beta_0 + \beta_1 x)}{1}\right). \]

Simplify:

\[ = 1 - \Phi(-(\beta_0 + \beta_1 x)). \]

Using symmetry of the normal CDF (\(\Phi(-z) = 1 - \Phi(z)\)):

\[ = \Phi(\beta_0 + \beta_1 x). \]

So:

\[ P(y_i = 1 | x_i=x) = \Phi(\beta_0 + \beta_1 x). \]

4. Interpretation

  • \(\beta_0\): shift in the threshold — the baseline tendency when \(x=0\).
  • \(\beta_1\): slope — how strongly \(x\) affects the underlying latent variable \(Z_i\), and hence the probability of success.

5. Visualization (mental picture)

  • Imagine a bell-shaped normal curve centered at \(\beta_0 + \beta_1 x\).
  • You’re asking: what is the probability mass to the right of 0?
  • If \(\beta_0 + \beta_1 x\) is very negative, the normal curve is mostly left of 0 → probability near 0.
  • If \(\beta_0 + \beta_1 x\) is very positive, most of the curve is right of 0 → probability near 1.
  • As \(x\) increases, the curve shifts right, smoothly increasing the probability.

This gives you the familiar S-shaped curve, but now shaped by the normal distribution instead of the logistic.

6. Logistic vs. Probit

  • Both map real numbers to probabilities in [0,1].
  • Logistic CDF has slightly heavier tails → probabilities change a bit more gradually for extreme \(x\).
  • Probit CDF is tied directly to the normal distribution → convenient if you believe the latent mechanism is Gaussian.
  • In practice, they give very similar results (coefficients differ by about a scaling factor of 1.6).

So the Probit model is essentially a latent-variable threshold model: the binary outcome is determined by whether an unobserved continuous Gaussian variable exceeds zero.

Report on the EM Algorithm in Genetics, Genomics, and Public Health

These notes are based on the report by Nan M. Laird. The full document can be accessed here: