Week-1 | Shaan

Instructor: Debasis Sengupta

Office / Department: ASU

Email: sdebasis@isical.ac.in

Marking Scheme:
Assignments: 20% | Midterm Test: 30% | End Semester: 50%

Point Estimation
Bias of an Estimator
Mean Squared Error
Uniformly Minimum Variance Unbiased Estimator (UMVU)
Cramér–Rao Lower Bound (CRLB)

🎯 Definition 1.1 — Point Estimation

A point estimate is simply a single number used to guess the value of an unknown population parameter.

📌 Key Terms

Parameter: A fixed, unknown number that describes a feature of a population (like \( \mu, \sigma^2 \)).
Sample: A subset of data from the population (e.g., \( X_1, X_2, \dots, X_n \)).
Statistic: Any function of the sample. E.g., \( \bar{X} = \frac{1}{n} \sum X_i \).
Estimator: A statistic used to estimate a parameter — it's a function or rule.
Estimate: The numerical value you get when you plug the sample into the estimator.

🧠 Analogy

Imagine you're a weather scientist trying to estimate tomorrow’s temperature in Kolkata:

The true average temperature of tomorrow is unknown — this is the parameter.
You gather sample data (e.g., past 10 days’ temperatures).
You decide to take their average — that’s your estimator: a method.
After plugging in the 10 values, the result (e.g., 32.5°C) is the estimate.

🧮 Mathematical Framework

Let’s say you observe a sample:

\[ X = \begin{bmatrix} X_1 \\ X_2 \\ \vdots \\ X_n \end{bmatrix} \sim F(\theta) \]

where \( \theta \) is the unknown parameter, and \( F(\theta) \) is the distribution.

Let \( T(X) \) be a function — for example, the sample mean:

\[ T(X) = \bar{X} = \frac{1}{n} \sum_{i=1}^n X_i \]

Then:

\( T(X) \) is a statistic
If we use it to estimate \( \theta \), it becomes the estimator
When we plug in real values for \( X_i \), we get an estimate of \( \theta \)

🔁 Estimators are Random Variables

Since estimators depend on the sample \( X_1, \dots, X_n \), and those are random, the estimator itself is also random.

This means we can talk about its mean, variance, MSE, bias, etc.

📈 Mental Visualization

Input: \( X_1, X_2, \dots, X_n \) → Estimator \( T(X) \) → Output: estimate (a number)

Different random samples yield different estimates — that's why studying the distribution of the estimator is crucial.

✅ Summary

Estimator = rule (statistic) that processes sample data
Estimate = number you get when you apply the rule to a particular sample
Estimators are random variables → can study their properties
Estimating a single number like \( \theta \) is called point estimation

🎯 Definition 1.2 — Bias of an Estimator

Let \( \hat{\theta} \) be an estimator of the parameter \( \theta \).

Then, the bias of \( \hat{\theta} \) is defined as:

\[ B(\theta) := \mathbb{E}[\hat{\theta}] - \theta \]

🔍 What does this mean?

Bias tells you on average, how far your estimator is from the true parameter. It's a measure of systematic error.

If \( B(\theta) = 0 \), then the estimator is unbiased.
If \( B(\theta) > 0 \), then the estimator overestimates on average.
If \( B(\theta) < 0 \), then it underestimates on average.

🧠 Remark 1.3 — Why we can talk about \( \mathbb{E}[\hat{\theta}] \)

Even though we don’t know the true \( \theta \), the estimator is a random variable. So we can study its expected value (mean), variance, etc., as long as its distribution is defined (which it is, because we assume the data-generating distribution \( F(\theta) \) is known).

🧪 Example 1.4 — Biased Estimator (Uniform Distribution)

Let’s say:

\[ X_1, X_2, \dots, X_n \overset{iid}{\sim} \text{Unif}(0, \theta) \]

Then the maximum of the sample:

\[ \hat{\theta}_{MLE} = \max(X_i) \]

is the Maximum Likelihood Estimator for \( \theta \). But it is biased. Specifically:

\[ \max(X_i) < \theta \quad \text{(with probability 1)} \]

So:

\[ \mathbb{E}[\hat{\theta}_{MLE}] < \theta \quad \Rightarrow \quad B(\theta) = \mathbb{E}[\hat{\theta}_{MLE}] - \theta < 0 \]

✅ Negative bias: the estimator underestimates the true parameter.

📊 Intuitive Visualization

Suppose true \( \theta = 10 \), and you draw 5 samples from \( \text{Unif}(0,10) \):

Maybe you get: [2.5, 7.1, 5.6, 9.2, 8.0] → max = 9.2
Repeat this many times → you’ll find average(max) < 10

💡 You rarely get a value equal to 10 — the true upper limit is almost never attained in the sample. Hence, systematic underestimation.

🧪 Example 1.5 — Biased vs. Unbiased Estimator (Variance Estimation)

Let:

\[ X_1, \dots, X_n \overset{iid}{\sim} N(\mu, \sigma^2) \]

Define two estimators:

❌ Biased Estimator:

\[ T(X) = \frac{1}{n} \sum_{i=1}^n (X_i - \bar{X})^2 \]

This underestimates the true variance \( \sigma^2 \).

So, \( \mathbb{E}[T(X)] < \sigma^2 \), hence \( B(T) < 0 \).

✅ Unbiased Estimator:

\[ S(X) = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X})^2 \]

This is the familiar sample variance formula from statistics:

\[ \mathbb{E}[S(X)] = \sigma^2 \Rightarrow B(S) = 0 \]

📌 Why does dividing by \( n - 1 \) fix the bias?

Because \( \bar{X} \) itself depends on the data, and introduces loss of one degree of freedom — you’re estimating the mean from the same data.

So we compensate for this by dividing by \( n - 1 \) instead of \( n \). This adjustment ensures unbiasedness.

🔁 Summary Table

Estimator	Formula	Bias
Max of sample (Uniform case)	\( \hat{\theta}_{MLE} = \max(X_i) \)	Negative
Naive variance	\( T(X) = \frac{1}{n} \sum (X_i - \bar{X})^2 \)	Negative
Corrected variance	\( S(X) = \frac{1}{n-1} \sum (X_i - \bar{X})^2 \)	0 (Unbiased)

🧠 Visual Intuition: Bias

        θ (true)
         |
      ←---|---→
   Estimator Values
    (on average)

   ↓ Biased Estimator
    -----|---- θ
   (systematically off)

   ↓ Unbiased Estimator
         |
       E[θ̂] = θ

📐 Definition 1.6 — Mean Squared Error

If \( \hat{\theta} \) is an estimator for the parameter \( \theta \), then:

\[ \text{MSE}(\hat{\theta}) = \mathbb{E}[(\hat{\theta} - \theta)^2] \]

This is the expected squared distance between the estimator and the true value. It measures overall accuracy — how close your guesses are to the truth, on average.

🔬 Decomposition: Bias–Variance Tradeoff

We can rewrite the MSE as:

\[ \text{MSE}(\hat{\theta}) = \underbrace{\left( \mathbb{E}[\hat{\theta}] - \theta \right)^2}_{\text{Bias}^2} + \underbrace{\text{Var}(\hat{\theta})}_{\text{Variance}} \]

✅ This is called the bias–variance decomposition.

🔁 Why this identity holds:

Use the identity:

\[ \mathbb{E}[(X - a)^2] = (\mathbb{E}[X] - a)^2 + \text{Var}(X) \]

Set \( X = \hat{\theta} \) and \( a = \theta \).

⚠️ Why this matters

Even an unbiased estimator (\( B(\theta) = 0 \)) can have large MSE if its variance is high.

MSE penalizes both error sources:

Systematic deviation (bias)
Fluctuation/randomness (variance)

🧠 Real-Life Analogy

Suppose you’re using an archery robot to hit a bullseye (true \( \theta \)):

🎯 Arrows cluster around the bullseye → low bias, low variance (ideal)
↔️ Arrows spread widely but centered → unbiased, high variance
🔀 Arrows cluster away from bullseye → biased, low variance
❌ Arrows are off-center and widely spread → high bias and high variance

→ MSE is like asking: “How far are you typically from the bullseye?”

📉 Example 1.7 — Unbiased Estimator with High Variance

Situation:

You're estimating your friend’s height, known to be \( \theta = 170 \) cm.

Estimator is unbiased:

\[ \mathbb{E}[\hat{\theta}] = 170 \]

But in 5 trials, the outputs are:

Trial     Estimate (θ̂)
  1          130 cm
  2          240 cm
  3           80 cm
  4          200 cm
  5          170 cm

✅ The average = 170
❌ But the individual guesses are all over the place — high variance

📊 Visualization

True θ = 170
Estimates: [130, 240, 80, 200, 170]

→ MSE = E[(θ̂ − 170)²] = High!

🔁 Conclusion: Why MSE is Important

You don't just want your estimator to be right on average — you want it to be reliably close to the true value most of the time.

That’s what MSE captures.

Estimator Type	Bias	Variance	MSE = Bias² + Var
Good Estimator	Low	Low	Low
Unbiased but high variance	0	High	High
Slight bias, low variance	Small	Small	Possibly Low!
Biased and high variance	High	High	High

🧠 Insight: Biased Estimators Can Be Better!

Sometimes a biased estimator can have lower MSE than an unbiased one!

So in practice, we often prefer an estimator with slight bias and low variance over one that’s unbiased but very noisy.

🎯 Key Question:

Among all unbiased estimators of a parameter \( \theta \), is there one with minimum possible variance?

This leads us to...

🔶 Definition 1.8 — Uniformly Minimum Variance Unbiased Estimator (UMVU)

Let \( X \sim f_\theta(x) \) and let \( \hat{\theta}(X) \) be an unbiased estimator of \( \theta \).

Then \( \hat{\theta}_{\text{UMVU}} \) is defined as:

\[ \hat{\theta}_{\text{UMVU}} = \arg\min_{\hat{\theta} \text{ unbiased}} \ \text{Var}(\hat{\theta}) \]

It is the best (i.e., least variable) among all unbiased estimators — for every value of \( \theta \) (this uniformity is important).

📌 Why do we care?

Among unbiased estimators, we want the one with smallest fluctuation (i.e., most stable).

The UMVU estimator is the one with lowest MSE among all unbiased estimators, since for unbiased estimators:

\[ \text{MSE} = \text{Bias}^2 + \text{Var} = 0 + \text{Var} \Rightarrow \text{Minimizing MSE} = \text{Minimizing Variance} \]

So for unbiased estimators, MSE = variance.

🧠 Deeper Insight: Why Not Just Minimize Variance?

You might think: “Why not just minimize variance directly, even allowing bias?”

But here's the trap:

You can construct an estimator with zero variance.
But that estimator would always return the same value, regardless of the data.
That value won't adjust based on the sample — so it won’t track the parameter at all.

⚠️ That’s called a degenerate estimator:

For example:

\[ \hat{\theta}(X) = 0 \quad \text{(always returns 0)} \Rightarrow \text{Var} = 0,\quad \text{Bias} = \theta \Rightarrow \text{MSE} = \theta^2 \]

This is utter garbage as an estimator — it just ignores the data.

✅ So, we restrict ourselves to the class of unbiased estimators, and minimize variance within that class. That gives us the UMVU.

⚙️ Optimization View:

Minimize MSE(\( \hat{\theta} \)) subject to the constraint that \( \mathbb{E}[\hat{\theta}] = \theta \)

This is a constrained optimization problem:

Objective: Minimize \( \mathbb{E}[(\hat{\theta} - \theta)^2] \)
Constraint: \( \mathbb{E}[\hat{\theta}] = \theta \)

🔧 What’s next: Theorem 1.9 (e.g., Rao-Blackwell or Lehmann–Scheffé)

These theorems give a method to actually construct the UMVU estimator — using:

Sufficient statistics
Completeness
Conditioning to reduce variance

But you don’t need them just yet to understand the intuition.

🔁 Summary Table

Concept	Description
Bias	Systematic error: how far off your estimator is on average
Variance	How much your estimator fluctuates across samples
MSE	Total error: combines bias and variance
UMVU	Among all unbiased estimators, the one with smallest variance
Degenerate Estimator	Zero variance but useless (doesn’t depend on data)

📊 Visual Intuition:

Unbiased estimators:
        ↑
        |     (*) High variance
        |   (*)
        |      (*) Lower variance
        |         (*) ← UMVU (min var)
        +--------------------------→ Estimators

You're seeking the least wiggly, but unbiased estimator.

⚙️ The Core Problem:

We want an estimator that:

Is unbiased
Has the lowest possible variance

But we saw earlier that not all unbiased estimators are equally good — some have higher variance than others.

❓What is the best we can do, variance-wise, among all unbiased estimators?
➡️ The Cramér–Rao Bound gives us a theoretical lower limit on this variance.

🎯 Theorem 1.9 — Cramér–Rao Lower Bound (CRLB)

Let:

\( X \sim f_\theta(x) \)
\( W(X) \) be an unbiased estimator of \( \theta \), i.e. \( \mathbb{E}_\theta[W(X)] = \theta \)

Under some regularity conditions (we’ll get to them), the variance of \( W(X) \) satisfies:

\[ \boxed{ \text{Var}(W(X)) \geq \frac{1}{\mathbb{E}_\theta \left[ \left( \frac{d}{d\theta} \log f_\theta(X) \right)^2 \right]} } \]

✅ This is the Cramér–Rao Lower Bound (CRLB).

✍️ Let’s Understand Each Piece:

🔹 \( \log f_\theta(X) \) — The Log-Likelihood

We often take logarithms of likelihoods because they make derivatives easier and preserve the maximum.

\[ \frac{d}{d\theta} \log f_\theta(X) \]

This is called the score function — it measures how sensitive the likelihood is to changes in \( \theta \).

🔹 Fisher Information:

Define the Fisher Information:

\[ \mathcal{I}(\theta) := \mathbb{E}_\theta \left[ \left( \frac{d}{d\theta} \log f_\theta(X) \right)^2 \right] \]

It tells us how much information the data \( X \) carries about \( \theta \).

✅ The more concentrated the likelihood is around the true \( \theta \), the higher the Fisher Information.

🔹 Final form of CRLB for Unbiased Estimators:

When \( W(X) \) is unbiased, the CRLB becomes:

\[ \boxed{ \text{Var}(W(X)) \geq \frac{1}{\mathcal{I}(\theta)} } \]

🧠 Key Insight: Why is this really a lower bound?

At first glance, it may seem odd that the RHS involves expectations of functions of \( X \), the random variable we’re estimating from. So how can this be a clean lower bound?

Here’s the trick:

If \( W(X) \) is unbiased, then:

\[ \frac{d}{d\theta} \mathbb{E}_\theta[W(X)] = \frac{d}{d\theta} \theta = 1 \]

So the general inequality:

\[ \text{Var}(W(X)) \geq \frac{ \left( \frac{d}{d\theta} \mathbb{E}_\theta[W(X)] \right)^2 }{ \mathbb{E}_\theta \left[ \left( \frac{d}{d\theta} \log f_\theta(X) \right)^2 \right] } \]

becomes:

\[ \text{Var}(W(X)) \geq \frac{1^2}{\mathcal{I}(\theta)} = \frac{1}{\mathcal{I}(\theta)} \]

So the bound does not depend on the estimator, only on the model \( f_\theta(x) \).

📊 Interpretation

Quantity	Meaning
\( W(X) \)	Your estimator (e.g., sample mean, sample max)
\( \mathbb{E}[W(X)] = \theta \)	Unbiasedness condition
\( \text{Var}(W(X)) \)	How much your estimator varies
\( \mathcal{I}(\theta) \)	How much your data “tells” you about \( \theta \)

CRLB tells you: 🔻 You can’t beat this variance limit for any unbiased estimator.

📉 Visual Metaphor: The Estimator Race

You have many runners (estimators). Each is unbiased, but runs with a different "wiggle" (variance).

The Cramér–Rao bound is a solid track limit — no one can run below it.

Variance
│
│
│      (*)  Sample Variance Estimator
│      |
│      |
│      |         (*)  Another Estimator
│      |
│      |_________________________
│             CRLB (floor)
└─────────────────────────────► Estimators

If someone reaches the floor — they are efficient.

✨ When is the CRLB Attained?

If an estimator achieves the CRLB (i.e., its variance equals the bound), it is:

Unbiased
Minimum variance among all unbiased estimators
⇒ Hence, it is the UMVU estimator

Often, MLEs in large samples tend to achieve CRLB asymptotically.

🔁 Recap: Why It Matters

CRLB gives a concrete lower limit on how good your estimator's variance can be.
If your estimator hits that bound → it’s the best (UMVU).
It’s a benchmark — if you propose a new estimator, compare it to the CRLB.

Contents

🎯 Definition 1.1 — Point Estimation

📌 Key Terms

🧠 Analogy

🧮 Mathematical Framework

🔁 Estimators are Random Variables

📈 Mental Visualization

✅ Summary

🎯 Definition 1.2 — Bias of an Estimator

🔍 What does this mean?

🧠 Remark 1.3 — Why we can talk about \( \mathbb{E}[\hat{\theta}] \)

🧪 Example 1.4 — Biased Estimator (Uniform Distribution)

📊 Intuitive Visualization

🧪 Example 1.5 — Biased vs. Unbiased Estimator (Variance Estimation)

❌ Biased Estimator:

✅ Unbiased Estimator:

📌 Why does dividing by \( n - 1 \) fix the bias?

🔁 Summary Table

🧠 Visual Intuition: Bias

📐 Definition 1.6 — Mean Squared Error

🔬 Decomposition: Bias–Variance Tradeoff

🔁 Why this identity holds:

⚠️ Why this matters

🧠 Real-Life Analogy

📉 Example 1.7 — Unbiased Estimator with High Variance

Situation:

📊 Visualization

🔁 Conclusion: Why MSE is Important

🧠 Insight: Biased Estimators Can Be Better!

🎯 Key Question:

🔶 Definition 1.8 — Uniformly Minimum Variance Unbiased Estimator (UMVU)

📌 Why do we care?

🧠 Deeper Insight: Why Not Just Minimize Variance?

⚠️ That’s called a degenerate estimator:

⚙️ Optimization View:

🔧 What’s next: Theorem 1.9 (e.g., Rao-Blackwell or Lehmann–Scheffé)

🔁 Summary Table

📊 Visual Intuition:

⚙️ The Core Problem:

🎯 Theorem 1.9 — Cramér–Rao Lower Bound (CRLB)

✍️ Let’s Understand Each Piece:

🔹 \( \log f_\theta(X) \) — The Log-Likelihood

🔹 Fisher Information:

🔹 Final form of CRLB for Unbiased Estimators:

🧠 Key Insight: Why is this really a lower bound?

📊 Interpretation

📉 Visual Metaphor: The Estimator Race

✨ When is the CRLB Attained?

🔁 Recap: Why It Matters