AY 2025–26
Instructor: Debasis Sengupta
Office / Department: ASU
Email: sdebasis@isical.ac.in
Marking Scheme:
Assignments: 20% | Midterm Test: 30% | End Semester: 50%
Visualization: Imagine a scatter plot with one point far above the main cluster. The OLS line bends upward to reduce the squared error of that outlier.
Visualization: On the same scatter plot, the LAD line passes closer to the cluster and ignores the influence of the outlier.
Visualization: Plot two curves against \(x\):
This robustness is why LAD is often called a “robust regression method.”
So LAD is just one member of a family of “M-estimators,” where you choose different functions \(\rho\) to control how much you penalize large errors.
Huber asked: when we fit a regression line, should we prioritize fitting the majority of the data well, or should we also give weight to the few outliers?
Think of it like having a shock absorber: normal bumps are treated smoothly (like OLS), but when the road gets too rough (big outliers), the absorber prevents the car (our model) from jumping violently.
We write the objective function as:
\[ C(\beta_0, \beta_1) = \sum_{i=1}^n \rho(y_i - \beta_0 - \beta_1 x_i) \]
So the framework is: pick a \(\rho\), minimize the sum, get your regression line.
To minimize, take derivatives with respect to the coefficients:
\[ \frac{\partial C}{\partial \beta_0} = -\sum_{i=1}^n \rho'(y_i - \beta_0 - \beta_1 x_i) \]
\[ \frac{\partial C}{\partial \beta_1} = -\sum_{i=1}^n \rho'(y_i - \beta_0 - \beta_1 x_i) \cdot x_i \]
Here \(\rho'(u)\) is called the influence function or ψ-function. It tells us how strongly each residual \(u\) “pulls” on the fit.
Analogy: Imagine each data point attached to the regression line with a spring. In OLS, the spring force grows stronger the more it’s stretched (quadratic). In LAD, every spring pulls with the same constant tension. In Huber’s, springs behave normally at first, but once stretched beyond a threshold, they lock at a maximum pull.
We introduce weights:
\[ w_i = \frac{\rho'(y_i - \beta_0 - \beta_1 x_i)}{y_i - \beta_0 - \beta_1 x_i} \]
This formula is clever:
Check special cases:
So OLS = equal trust, LAD = diminishing trust, Huber = adaptive trust.
So we cheat: we linearize the problem by temporarily fixing the weights at some guess of \(\beta\), then updating \(\beta\), then recalculating weights, and so on.
The procedure is:
\[ \hat{\beta}^{(n+1)} = \arg\min_{\beta_0,\beta_1} \sum_{i=1}^n w_i^{(n)} (y_i - \beta_0 - \beta_1 x_i)^2 \]
This is just like OLS, but each point gets a different importance.
This is why it’s called Iteratively Reweighted Least Squares: each step is OLS with weights, but the weights are updated iteratively.
If we differentiate the weighted sum with respect to the coefficients, we get:
\[ -2 \sum w_i (y_i - \beta_0 - \beta_1 x_i) = 0 \]
\[ -2 \sum w_i (y_i - \beta_0 - \beta_1 x_i)(-x_i) = 0 \]
These are just like OLS normal equations, but each term is multiplied by a weight.
Huber’s work showed that we don’t have to choose between being “too harsh on outliers” (OLS) and “ignoring them completely” (LAD). His M-estimator is a middle path: quadratic penalty for small residuals (like OLS), linear for big ones (like LAD). This leads naturally to IRLS, where we keep updating weights until the regression line stabilizes.
You start with:
\[ E(y_i | x=x_i) = \beta_0 + \beta_1 x_i \]
Problem: This means for some \(x\), the model might predict probabilities <0 or >1, which makes no sense.
Visual intuition: Imagine you have data points where \(y_i\) is only at the levels 0 or 1. If you try to fit a straight line through them, some predictions fall outside the [0,1] band. That’s the reason linear regression doesn’t work well for binary outcomes.
We want a mapping:
\[ g: \mathbb{R} \to [0,1] \]
So instead of saying:
\[ E(y_i|x=x_i) = \beta_0 + \beta_1 x_i, \]
we say:
\[ E(y_i|x=x_i) = g(\beta_0 + \beta_1 x_i). \]
This ensures the predicted mean is always between 0 and 1 — a valid probability.
A popular choice of \(g\) is the logistic function:
\[ g(z) = \frac{e^z}{1+e^z}, \quad z \in \mathbb{R}. \]
Why CDF? Because it looks like the CDF of a standard logistic distribution — smooth, monotone, bounded between 0 and 1.
So now:
\[ P(y_i=1|x_i=x) = \frac{e^{\beta_0 + \beta_1 x}}{1+e^{\beta_0 + \beta_1 x}}, \]
\[ P(y_i=0|x_i=x) = \frac{1}{1+e^{\beta_0 + \beta_1 x}}. \]
The odds of success (y=1) are:
\[ \frac{P(y_i=1|x_i=x)}{P(y_i=0|x_i=x)} = e^{\beta_0 + \beta_1 x}. \]
Taking log:
\[ \ln \left(\frac{P(y_i=1|x_i=x)}{P(y_i=0|x_i=x)}\right) = \beta_0 + \beta_1 x. \]
This is called the logit transform.
The logistic regression used the logistic CDF as the link function. The probit model instead uses the standard normal CDF:
\[ g(x) = \Phi(x), \quad \text{where } \Phi(x) = \int_{-\infty}^x \frac{1}{\sqrt{2\pi}} e^{-t^2/2}\, dt. \]
So in probit regression:
\[ P(y_i=1|x_i=x) = \Phi(\beta_0 + \beta_1 x). \]
Instead of modeling probabilities directly, the probit model assumes there is a hidden (“latent”) continuous variable \(Z_i\) that drives the binary outcome.
\[ Z_i | x_i \sim N(\beta_0 + \beta_1 x_i, 1). \]
\[ y_i = \begin{cases} 1 & \text{if } Z_i > 0, \\ 0 & \text{if } Z_i \leq 0. \end{cases} \]
Interpretation: there is some unobserved "propensity" \(Z_i\) (say, propensity to buy a product). You only see the binary decision (buy = 1, not buy = 0), but in reality there’s a continuous tendency behind it.
We want:
\[ P(y_i = 1 | x_i=x) = P(Z_i > 0 | x_i=x). \]
Since \(Z_i \sim N(\beta_0 + \beta_1 x, 1)\):
\[ P(Z_i > 0 | x_i=x) = 1 - \Phi\left(\frac{0 - (\beta_0 + \beta_1 x)}{1}\right). \]
Simplify:
\[ = 1 - \Phi(-(\beta_0 + \beta_1 x)). \]
Using symmetry of the normal CDF (\(\Phi(-z) = 1 - \Phi(z)\)):
\[ = \Phi(\beta_0 + \beta_1 x). \]
So:
\[ P(y_i = 1 | x_i=x) = \Phi(\beta_0 + \beta_1 x). \]
This gives you the familiar S-shaped curve, but now shaped by the normal distribution instead of the logistic.
So the Probit model is essentially a latent-variable threshold model: the binary outcome is determined by whether an unobserved continuous Gaussian variable exceeds zero.
These notes are based on the report by Nan M. Laird. The full document can be accessed here: