AY 2025–26
Instructor: Debasis Sengupta
Office / Department: ASU
Email: sdebasis@isical.ac.in
Marking Scheme:
Assignments: 20% | Midterm Test: 30% | End Semester: 50%
A point estimate is simply a single number used to guess the value of an unknown population parameter.
Imagine you're a weather scientist trying to estimate tomorrow’s temperature in Kolkata:
Let’s say you observe a sample:
\[ X = \begin{bmatrix} X_1 \\ X_2 \\ \vdots \\ X_n \end{bmatrix} \sim F(\theta) \]where \( \theta \) is the unknown parameter, and \( F(\theta) \) is the distribution.
Let \( T(X) \) be a function — for example, the sample mean:
\[ T(X) = \bar{X} = \frac{1}{n} \sum_{i=1}^n X_i \]Then:
Since estimators depend on the sample \( X_1, \dots, X_n \), and those are random, the estimator itself is also random.
This means we can talk about its mean, variance, MSE, bias, etc.
Input: \( X_1, X_2, \dots, X_n \) → Estimator \( T(X) \) → Output: estimate (a number)
Different random samples yield different estimates — that's why studying the distribution of the estimator is crucial.
Let \( \hat{\theta} \) be an estimator of the parameter \( \theta \).
Then, the bias of \( \hat{\theta} \) is defined as:
\[ B(\theta) := \mathbb{E}[\hat{\theta}] - \theta \]Bias tells you on average, how far your estimator is from the true parameter. It's a measure of systematic error.
Even though we don’t know the true \( \theta \), the estimator is a random variable. So we can study its expected value (mean), variance, etc., as long as its distribution is defined (which it is, because we assume the data-generating distribution \( F(\theta) \) is known).
Let’s say:
\[ X_1, X_2, \dots, X_n \overset{iid}{\sim} \text{Unif}(0, \theta) \]Then the maximum of the sample:
\[ \hat{\theta}_{MLE} = \max(X_i) \]is the Maximum Likelihood Estimator for \( \theta \). But it is biased. Specifically:
\[ \max(X_i) < \theta \quad \text{(with probability 1)} \]So:
\[ \mathbb{E}[\hat{\theta}_{MLE}] < \theta \quad \Rightarrow \quad B(\theta) = \mathbb{E}[\hat{\theta}_{MLE}] - \theta < 0 \]✅ Negative bias: the estimator underestimates the true parameter.
Suppose true \( \theta = 10 \), and you draw 5 samples from \( \text{Unif}(0,10) \):
💡 You rarely get a value equal to 10 — the true upper limit is almost never attained in the sample. Hence, systematic underestimation.
Let:
\[ X_1, \dots, X_n \overset{iid}{\sim} N(\mu, \sigma^2) \]Define two estimators:
This underestimates the true variance \( \sigma^2 \).
So, \( \mathbb{E}[T(X)] < \sigma^2 \), hence \( B(T) < 0 \).
This is the familiar sample variance formula from statistics:
\[ \mathbb{E}[S(X)] = \sigma^2 \Rightarrow B(S) = 0 \]Because \( \bar{X} \) itself depends on the data, and introduces loss of one degree of freedom — you’re estimating the mean from the same data.
So we compensate for this by dividing by \( n - 1 \) instead of \( n \). This adjustment ensures unbiasedness.
| Estimator | Formula | Bias |
|---|---|---|
| Max of sample (Uniform case) | \( \hat{\theta}_{MLE} = \max(X_i) \) | Negative |
| Naive variance | \( T(X) = \frac{1}{n} \sum (X_i - \bar{X})^2 \) | Negative |
| Corrected variance | \( S(X) = \frac{1}{n-1} \sum (X_i - \bar{X})^2 \) | 0 (Unbiased) |
θ (true)
|
←---|---→
Estimator Values
(on average)
↓ Biased Estimator
-----|---- θ
(systematically off)
↓ Unbiased Estimator
|
E[θ̂] = θ
If \( \hat{\theta} \) is an estimator for the parameter \( \theta \), then:
\[ \text{MSE}(\hat{\theta}) = \mathbb{E}[(\hat{\theta} - \theta)^2] \]This is the expected squared distance between the estimator and the true value. It measures overall accuracy — how close your guesses are to the truth, on average.
We can rewrite the MSE as:
\[ \text{MSE}(\hat{\theta}) = \underbrace{\left( \mathbb{E}[\hat{\theta}] - \theta \right)^2}_{\text{Bias}^2} + \underbrace{\text{Var}(\hat{\theta})}_{\text{Variance}} \]✅ This is called the bias–variance decomposition.
Use the identity:
\[ \mathbb{E}[(X - a)^2] = (\mathbb{E}[X] - a)^2 + \text{Var}(X) \]Set \( X = \hat{\theta} \) and \( a = \theta \).
Even an unbiased estimator (\( B(\theta) = 0 \)) can have large MSE if its variance is high.
MSE penalizes both error sources:
Suppose you’re using an archery robot to hit a bullseye (true \( \theta \)):
→ MSE is like asking: “How far are you typically from the bullseye?”
You're estimating your friend’s height, known to be \( \theta = 170 \) cm.
Estimator is unbiased:
\[ \mathbb{E}[\hat{\theta}] = 170 \]But in 5 trials, the outputs are:
Trial Estimate (θ̂) 1 130 cm 2 240 cm 3 80 cm 4 200 cm 5 170 cm
✅ The average = 170
❌ But the individual guesses are all over the place — high variance
True θ = 170 Estimates: [130, 240, 80, 200, 170] → MSE = E[(θ̂ − 170)²] = High!
You don't just want your estimator to be right on average — you want it to be reliably close to the true value most of the time.
That’s what MSE captures.
| Estimator Type | Bias | Variance | MSE = Bias² + Var |
|---|---|---|---|
| Good Estimator | Low | Low | Low |
| Unbiased but high variance | 0 | High | High |
| Slight bias, low variance | Small | Small | Possibly Low! |
| Biased and high variance | High | High | High |
Sometimes a biased estimator can have lower MSE than an unbiased one!
So in practice, we often prefer an estimator with slight bias and low variance over one that’s unbiased but very noisy.
Among all unbiased estimators of a parameter \( \theta \), is there one with minimum possible variance?
This leads us to...
Let \( X \sim f_\theta(x) \) and let \( \hat{\theta}(X) \) be an unbiased estimator of \( \theta \).
Then \( \hat{\theta}_{\text{UMVU}} \) is defined as:
\[ \hat{\theta}_{\text{UMVU}} = \arg\min_{\hat{\theta} \text{ unbiased}} \ \text{Var}(\hat{\theta}) \]It is the best (i.e., least variable) among all unbiased estimators — for every value of \( \theta \) (this uniformity is important).
Among unbiased estimators, we want the one with smallest fluctuation (i.e., most stable).
The UMVU estimator is the one with lowest MSE among all unbiased estimators, since for unbiased estimators:
\[ \text{MSE} = \text{Bias}^2 + \text{Var} = 0 + \text{Var} \Rightarrow \text{Minimizing MSE} = \text{Minimizing Variance} \]So for unbiased estimators, MSE = variance.
You might think: “Why not just minimize variance directly, even allowing bias?”
But here's the trap:
For example:
\[ \hat{\theta}(X) = 0 \quad \text{(always returns 0)} \Rightarrow \text{Var} = 0,\quad \text{Bias} = \theta \Rightarrow \text{MSE} = \theta^2 \]This is utter garbage as an estimator — it just ignores the data.
✅ So, we restrict ourselves to the class of unbiased estimators, and minimize variance within that class. That gives us the UMVU.
Minimize MSE(\( \hat{\theta} \)) subject to the constraint that \( \mathbb{E}[\hat{\theta}] = \theta \)
This is a constrained optimization problem:
These theorems give a method to actually construct the UMVU estimator — using:
But you don’t need them just yet to understand the intuition.
| Concept | Description |
|---|---|
| Bias | Systematic error: how far off your estimator is on average |
| Variance | How much your estimator fluctuates across samples |
| MSE | Total error: combines bias and variance |
| UMVU | Among all unbiased estimators, the one with smallest variance |
| Degenerate Estimator | Zero variance but useless (doesn’t depend on data) |
Unbiased estimators:
↑
| (*) High variance
| (*)
| (*) Lower variance
| (*) ← UMVU (min var)
+--------------------------→ Estimators
You're seeking the least wiggly, but unbiased estimator.
We want an estimator that:
But we saw earlier that not all unbiased estimators are equally good — some have higher variance than others.
❓What is the best we can do, variance-wise, among all unbiased estimators?
➡️ The Cramér–Rao Bound gives us a theoretical lower limit on this variance.
Let:
Under some regularity conditions (we’ll get to them), the variance of \( W(X) \) satisfies:
\[ \boxed{ \text{Var}(W(X)) \geq \frac{1}{\mathbb{E}_\theta \left[ \left( \frac{d}{d\theta} \log f_\theta(X) \right)^2 \right]} } \]✅ This is the Cramér–Rao Lower Bound (CRLB).
We often take logarithms of likelihoods because they make derivatives easier and preserve the maximum.
\[ \frac{d}{d\theta} \log f_\theta(X) \]This is called the score function — it measures how sensitive the likelihood is to changes in \( \theta \).
Define the Fisher Information:
\[ \mathcal{I}(\theta) := \mathbb{E}_\theta \left[ \left( \frac{d}{d\theta} \log f_\theta(X) \right)^2 \right] \]It tells us how much information the data \( X \) carries about \( \theta \).
✅ The more concentrated the likelihood is around the true \( \theta \), the higher the Fisher Information.
When \( W(X) \) is unbiased, the CRLB becomes:
\[ \boxed{ \text{Var}(W(X)) \geq \frac{1}{\mathcal{I}(\theta)} } \]At first glance, it may seem odd that the RHS involves expectations of functions of \( X \), the random variable we’re estimating from. So how can this be a clean lower bound?
Here’s the trick:
So the general inequality:
\[ \text{Var}(W(X)) \geq \frac{ \left( \frac{d}{d\theta} \mathbb{E}_\theta[W(X)] \right)^2 }{ \mathbb{E}_\theta \left[ \left( \frac{d}{d\theta} \log f_\theta(X) \right)^2 \right] } \]becomes:
\[ \text{Var}(W(X)) \geq \frac{1^2}{\mathcal{I}(\theta)} = \frac{1}{\mathcal{I}(\theta)} \]So the bound does not depend on the estimator, only on the model \( f_\theta(x) \).
| Quantity | Meaning |
|---|---|
| \( W(X) \) | Your estimator (e.g., sample mean, sample max) |
| \( \mathbb{E}[W(X)] = \theta \) | Unbiasedness condition |
| \( \text{Var}(W(X)) \) | How much your estimator varies |
| \( \mathcal{I}(\theta) \) | How much your data “tells” you about \( \theta \) |
CRLB tells you: 🔻 You can’t beat this variance limit for any unbiased estimator.
You have many runners (estimators). Each is unbiased, but runs with a different "wiggle" (variance).
The Cramér–Rao bound is a solid track limit — no one can run below it.
Variance │ │ │ (*) Sample Variance Estimator │ | │ | │ | (*) Another Estimator │ | │ |_________________________ │ CRLB (floor) └─────────────────────────────► Estimators
If someone reaches the floor — they are efficient.
If an estimator achieves the CRLB (i.e., its variance equals the bound), it is:
Often, MLEs in large samples tend to achieve CRLB asymptotically.