What is error term (u) and Omitted Variable Bias in Regression?

Omitted Variable Bias?
Omitted Variable Bias?

What is "error u"?

  • Imagine you want to predict students’ grades (Y) using hours studied (X).
  • But grades are not decided only by study hours.
    • Some students are naturally smarter.
    • Some had better teachers.
    • Some were sick on exam day.

All of these extra things (not included in your formula) are captured in the error term (u).
So u = "all the other factors we didn’t include in the equation."

Could you please explain why there are always omitted variables? 

In real life, it's impossible to include every single factor that affects Y.

Example: For grades, you can’t measure “motivation”, “sleep quality”, or “stress” perfectly.
So, some variables will always be omitted.

Could you please let me know when this might become a concern

  • If the omitted factors are unrelated to X (study hours), it’s not a big deal. OLS (our method) still works fine.
  • BUT if the omitted factors are related to X, then OLS gets “biased”.

Example:

  • Students who study more might also have better teachers.
  • If “teacher quality” is omitted, the effect of studying will look bigger than it really is, because OLS is mixing the effect of studying and teacher quality.

This is called omitted variable bias.

An omitted variable bias can lead to incorrect conclusions due to the absence of an important factor.

For this to happen, the omitted factor (let’s call it Z) must satisfy two conditions:

Condition 1: Z affects Y (the outcome).

  • Example: You want to know how hours studied (X) affect exam score (Y).
  • But “intelligence” (Z) also affects exam scores. Smarter students usually get higher grades.
  • That means Z → Y.
  • So Z belongs in the error term if we didn’t include it.

Condition 2: Z is related to X (the thing you included).

  • Smarter students (Z) are also more likely to study more (X).
  • So “intelligence” (Z) is related to “hours studied” (X).
  • That means if we don’t include Z, the effect of studying will look bigger than it really is.

Setup:

  • Y = test scores (outcome we care about).
  • X = STR (student–teacher ratio) = number of students per teacher in a class
  • Z = English language ability (whether English is the student’s first language)

Step 1: Does Z affect Y?

Yes.

  • If English is not the student’s first language, test scores will generally be lower.
  • Therefore, Z has an effect on Y, indicating that condition 1 is satisfied.

Step 2: Is Z related to X?

Yes.

  • Immigrant communities (with more English-as-second-language students) tend to be poorer.
  • Poorer communities usually have less money for schools, which means larger class sizes (higher STR).
  • Therefore, if Z is related to X, then condition 2 is satisfied.

Step 3: So what happens?

When both conditions are met, omitted variable bias exists.

Now the question: What direction is the bias?

Step 4: Use common sense:

  • Bigger class sizes (higher STR) look like they reduce test scores.
  • But part of this negative effect might actually be coming from the fact that these schools also have more immigrant students (lower English ability).
  • Since we did NOT include Z (English ability), OLS incorrectly blames STR for all the lower test scores.

👉 That means OLS exaggerates (makes too negative) the effect of STR.

Comments

Popular posts from this blog

How to Write Research Proposal? Easy Guide

Gender Studies (CSS): Status of Women in Pakistan 2024

CSS Recommended Free PDF Books and Course Packs