Statistical Methods III: Week-1

AY 2025–26

Instructor: Debasis Sengupta

Office / Department: ASU

Email: sdebasis@isical.ac.in

Marking Scheme:
Assignments: 20% | Midterm Test: 30% | End Semester: 50%

Contents

🎯 Definition 1.1 — Point Estimation

A point estimate is simply a single number used to guess the value of an unknown population parameter.

📌 Key Terms

  • Parameter: A fixed, unknown number that describes a feature of a population (like \( \mu, \sigma^2 \)).
  • Sample: A subset of data from the population (e.g., \( X_1, X_2, \dots, X_n \)).
  • Statistic: Any function of the sample. E.g., \( \bar{X} = \frac{1}{n} \sum X_i \).
  • Estimator: A statistic used to estimate a parameter — it's a function or rule.
  • Estimate: The numerical value you get when you plug the sample into the estimator.

🧠 Analogy

Imagine you're a weather scientist trying to estimate tomorrow’s temperature in Kolkata:

  • The true average temperature of tomorrow is unknown — this is the parameter.
  • You gather sample data (e.g., past 10 days’ temperatures).
  • You decide to take their average — that’s your estimator: a method.
  • After plugging in the 10 values, the result (e.g., 32.5°C) is the estimate.

🧮 Mathematical Framework

Let’s say you observe a sample:

\[ X = \begin{bmatrix} X_1 \\ X_2 \\ \vdots \\ X_n \end{bmatrix} \sim F(\theta) \]

where \( \theta \) is the unknown parameter, and \( F(\theta) \) is the distribution.

Let \( T(X) \) be a function — for example, the sample mean:

\[ T(X) = \bar{X} = \frac{1}{n} \sum_{i=1}^n X_i \]

Then:

  • \( T(X) \) is a statistic
  • If we use it to estimate \( \theta \), it becomes the estimator
  • When we plug in real values for \( X_i \), we get an estimate of \( \theta \)

🔁 Estimators are Random Variables

Since estimators depend on the sample \( X_1, \dots, X_n \), and those are random, the estimator itself is also random.

This means we can talk about its mean, variance, MSE, bias, etc.

📈 Mental Visualization

Input: \( X_1, X_2, \dots, X_n \) → Estimator \( T(X) \) → Output: estimate (a number)

Different random samples yield different estimates — that's why studying the distribution of the estimator is crucial.

✅ Summary

  • Estimator = rule (statistic) that processes sample data
  • Estimate = number you get when you apply the rule to a particular sample
  • Estimators are random variables → can study their properties
  • Estimating a single number like \( \theta \) is called point estimation

🎯 Definition 1.2 — Bias of an Estimator

Let \( \hat{\theta} \) be an estimator of the parameter \( \theta \).

Then, the bias of \( \hat{\theta} \) is defined as:

\[ B(\theta) := \mathbb{E}[\hat{\theta}] - \theta \]

🔍 What does this mean?

Bias tells you on average, how far your estimator is from the true parameter. It's a measure of systematic error.

🧠 Remark 1.3 — Why we can talk about \( \mathbb{E}[\hat{\theta}] \)

Even though we don’t know the true \( \theta \), the estimator is a random variable. So we can study its expected value (mean), variance, etc., as long as its distribution is defined (which it is, because we assume the data-generating distribution \( F(\theta) \) is known).

🧪 Example 1.4 — Biased Estimator (Uniform Distribution)

Let’s say:

\[ X_1, X_2, \dots, X_n \overset{iid}{\sim} \text{Unif}(0, \theta) \]

Then the maximum of the sample:

\[ \hat{\theta}_{MLE} = \max(X_i) \]

is the Maximum Likelihood Estimator for \( \theta \). But it is biased. Specifically:

\[ \max(X_i) < \theta \quad \text{(with probability 1)} \]

So:

\[ \mathbb{E}[\hat{\theta}_{MLE}] < \theta \quad \Rightarrow \quad B(\theta) = \mathbb{E}[\hat{\theta}_{MLE}] - \theta < 0 \]

Negative bias: the estimator underestimates the true parameter.

📊 Intuitive Visualization

Suppose true \( \theta = 10 \), and you draw 5 samples from \( \text{Unif}(0,10) \):

💡 You rarely get a value equal to 10 — the true upper limit is almost never attained in the sample. Hence, systematic underestimation.

🧪 Example 1.5 — Biased vs. Unbiased Estimator (Variance Estimation)

Let:

\[ X_1, \dots, X_n \overset{iid}{\sim} N(\mu, \sigma^2) \]

Define two estimators:

❌ Biased Estimator:

\[ T(X) = \frac{1}{n} \sum_{i=1}^n (X_i - \bar{X})^2 \]

This underestimates the true variance \( \sigma^2 \).

So, \( \mathbb{E}[T(X)] < \sigma^2 \), hence \( B(T) < 0 \).

✅ Unbiased Estimator:

\[ S(X) = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X})^2 \]

This is the familiar sample variance formula from statistics:

\[ \mathbb{E}[S(X)] = \sigma^2 \Rightarrow B(S) = 0 \]

📌 Why does dividing by \( n - 1 \) fix the bias?

Because \( \bar{X} \) itself depends on the data, and introduces loss of one degree of freedom — you’re estimating the mean from the same data.

So we compensate for this by dividing by \( n - 1 \) instead of \( n \). This adjustment ensures unbiasedness.

🔁 Summary Table

Estimator Formula Bias
Max of sample (Uniform case) \( \hat{\theta}_{MLE} = \max(X_i) \) Negative
Naive variance \( T(X) = \frac{1}{n} \sum (X_i - \bar{X})^2 \) Negative
Corrected variance \( S(X) = \frac{1}{n-1} \sum (X_i - \bar{X})^2 \) 0 (Unbiased)

🧠 Visual Intuition: Bias

        θ (true)
         |
      ←---|---→
   Estimator Values
    (on average)

   ↓ Biased Estimator
    -----|---- θ
   (systematically off)

   ↓ Unbiased Estimator
         |
       E[θ̂] = θ
  

📐 Definition 1.6 — Mean Squared Error

If \( \hat{\theta} \) is an estimator for the parameter \( \theta \), then:

\[ \text{MSE}(\hat{\theta}) = \mathbb{E}[(\hat{\theta} - \theta)^2] \]

This is the expected squared distance between the estimator and the true value. It measures overall accuracy — how close your guesses are to the truth, on average.

🔬 Decomposition: Bias–Variance Tradeoff

We can rewrite the MSE as:

\[ \text{MSE}(\hat{\theta}) = \underbrace{\left( \mathbb{E}[\hat{\theta}] - \theta \right)^2}_{\text{Bias}^2} + \underbrace{\text{Var}(\hat{\theta})}_{\text{Variance}} \]

✅ This is called the bias–variance decomposition.

🔁 Why this identity holds:

Use the identity:

\[ \mathbb{E}[(X - a)^2] = (\mathbb{E}[X] - a)^2 + \text{Var}(X) \]

Set \( X = \hat{\theta} \) and \( a = \theta \).

⚠️ Why this matters

Even an unbiased estimator (\( B(\theta) = 0 \)) can have large MSE if its variance is high.

MSE penalizes both error sources:

🧠 Real-Life Analogy

Suppose you’re using an archery robot to hit a bullseye (true \( \theta \)):

MSE is like asking: “How far are you typically from the bullseye?”

📉 Example 1.7 — Unbiased Estimator with High Variance

Situation:

You're estimating your friend’s height, known to be \( \theta = 170 \) cm.

Estimator is unbiased:

\[ \mathbb{E}[\hat{\theta}] = 170 \]

But in 5 trials, the outputs are:

Trial     Estimate (θ̂)
  1          130 cm
  2          240 cm
  3           80 cm
  4          200 cm
  5          170 cm
  

✅ The average = 170
❌ But the individual guesses are all over the place — high variance

📊 Visualization

True θ = 170
Estimates: [130, 240, 80, 200, 170]

→ MSE = E[(θ̂ − 170)²] = High!
  

🔁 Conclusion: Why MSE is Important

You don't just want your estimator to be right on average — you want it to be reliably close to the true value most of the time.

That’s what MSE captures.

Estimator Type Bias Variance MSE = Bias² + Var
Good Estimator Low Low Low
Unbiased but high variance 0 High High
Slight bias, low variance Small Small Possibly Low!
Biased and high variance High High High

🧠 Insight: Biased Estimators Can Be Better!

Sometimes a biased estimator can have lower MSE than an unbiased one!

So in practice, we often prefer an estimator with slight bias and low variance over one that’s unbiased but very noisy.

🎯 Key Question:

Among all unbiased estimators of a parameter \( \theta \), is there one with minimum possible variance?

This leads us to...

🔶 Definition 1.8 — Uniformly Minimum Variance Unbiased Estimator (UMVU)

Let \( X \sim f_\theta(x) \) and let \( \hat{\theta}(X) \) be an unbiased estimator of \( \theta \).

Then \( \hat{\theta}_{\text{UMVU}} \) is defined as:

\[ \hat{\theta}_{\text{UMVU}} = \arg\min_{\hat{\theta} \text{ unbiased}} \ \text{Var}(\hat{\theta}) \]

It is the best (i.e., least variable) among all unbiased estimators — for every value of \( \theta \) (this uniformity is important).

📌 Why do we care?

Among unbiased estimators, we want the one with smallest fluctuation (i.e., most stable).

The UMVU estimator is the one with lowest MSE among all unbiased estimators, since for unbiased estimators:

\[ \text{MSE} = \text{Bias}^2 + \text{Var} = 0 + \text{Var} \Rightarrow \text{Minimizing MSE} = \text{Minimizing Variance} \]

So for unbiased estimators, MSE = variance.

🧠 Deeper Insight: Why Not Just Minimize Variance?

You might think: “Why not just minimize variance directly, even allowing bias?”

But here's the trap:

⚠️ That’s called a degenerate estimator:

For example:

\[ \hat{\theta}(X) = 0 \quad \text{(always returns 0)} \Rightarrow \text{Var} = 0,\quad \text{Bias} = \theta \Rightarrow \text{MSE} = \theta^2 \]

This is utter garbage as an estimator — it just ignores the data.

✅ So, we restrict ourselves to the class of unbiased estimators, and minimize variance within that class. That gives us the UMVU.

⚙️ Optimization View:

Minimize MSE(\( \hat{\theta} \)) subject to the constraint that \( \mathbb{E}[\hat{\theta}] = \theta \)

This is a constrained optimization problem:

🔧 What’s next: Theorem 1.9 (e.g., Rao-Blackwell or Lehmann–Scheffé)

These theorems give a method to actually construct the UMVU estimator — using:

But you don’t need them just yet to understand the intuition.

🔁 Summary Table

Concept Description
Bias Systematic error: how far off your estimator is on average
Variance How much your estimator fluctuates across samples
MSE Total error: combines bias and variance
UMVU Among all unbiased estimators, the one with smallest variance
Degenerate Estimator Zero variance but useless (doesn’t depend on data)

📊 Visual Intuition:

Unbiased estimators:
        ↑
        |     (*) High variance
        |   (*)
        |      (*) Lower variance
        |         (*) ← UMVU (min var)
        +--------------------------→ Estimators
  

You're seeking the least wiggly, but unbiased estimator.

⚙️ The Core Problem:

We want an estimator that:

But we saw earlier that not all unbiased estimators are equally good — some have higher variance than others.

What is the best we can do, variance-wise, among all unbiased estimators?
➡️ The Cramér–Rao Bound gives us a theoretical lower limit on this variance.

🎯 Theorem 1.9 — Cramér–Rao Lower Bound (CRLB)

Let:

Under some regularity conditions (we’ll get to them), the variance of \( W(X) \) satisfies:

\[ \boxed{ \text{Var}(W(X)) \geq \frac{1}{\mathbb{E}_\theta \left[ \left( \frac{d}{d\theta} \log f_\theta(X) \right)^2 \right]} } \]

✅ This is the Cramér–Rao Lower Bound (CRLB).

✍️ Let’s Understand Each Piece:

🔹 \( \log f_\theta(X) \) — The Log-Likelihood

We often take logarithms of likelihoods because they make derivatives easier and preserve the maximum.

\[ \frac{d}{d\theta} \log f_\theta(X) \]

This is called the score function — it measures how sensitive the likelihood is to changes in \( \theta \).

🔹 Fisher Information:

Define the Fisher Information:

\[ \mathcal{I}(\theta) := \mathbb{E}_\theta \left[ \left( \frac{d}{d\theta} \log f_\theta(X) \right)^2 \right] \]

It tells us how much information the data \( X \) carries about \( \theta \).

✅ The more concentrated the likelihood is around the true \( \theta \), the higher the Fisher Information.

🔹 Final form of CRLB for Unbiased Estimators:

When \( W(X) \) is unbiased, the CRLB becomes:

\[ \boxed{ \text{Var}(W(X)) \geq \frac{1}{\mathcal{I}(\theta)} } \]

🧠 Key Insight: Why is this really a lower bound?

At first glance, it may seem odd that the RHS involves expectations of functions of \( X \), the random variable we’re estimating from. So how can this be a clean lower bound?

Here’s the trick:

\[ \frac{d}{d\theta} \mathbb{E}_\theta[W(X)] = \frac{d}{d\theta} \theta = 1 \]

So the general inequality:

\[ \text{Var}(W(X)) \geq \frac{ \left( \frac{d}{d\theta} \mathbb{E}_\theta[W(X)] \right)^2 }{ \mathbb{E}_\theta \left[ \left( \frac{d}{d\theta} \log f_\theta(X) \right)^2 \right] } \]

becomes:

\[ \text{Var}(W(X)) \geq \frac{1^2}{\mathcal{I}(\theta)} = \frac{1}{\mathcal{I}(\theta)} \]

So the bound does not depend on the estimator, only on the model \( f_\theta(x) \).

📊 Interpretation

Quantity Meaning
\( W(X) \) Your estimator (e.g., sample mean, sample max)
\( \mathbb{E}[W(X)] = \theta \) Unbiasedness condition
\( \text{Var}(W(X)) \) How much your estimator varies
\( \mathcal{I}(\theta) \) How much your data “tells” you about \( \theta \)

CRLB tells you: 🔻 You can’t beat this variance limit for any unbiased estimator.

📉 Visual Metaphor: The Estimator Race

You have many runners (estimators). Each is unbiased, but runs with a different "wiggle" (variance).

The Cramér–Rao bound is a solid track limit — no one can run below it.

Variance
│
│
│      (*)  Sample Variance Estimator
│      |
│      |
│      |         (*)  Another Estimator
│      |
│      |_________________________
│             CRLB (floor)
└─────────────────────────────► Estimators
  

If someone reaches the floor — they are efficient.

✨ When is the CRLB Attained?

If an estimator achieves the CRLB (i.e., its variance equals the bound), it is:

Often, MLEs in large samples tend to achieve CRLB asymptotically.

🔁 Recap: Why It Matters