The Least Squares Assumptions for Multiple Regression
![]() |
Assumptions of Multiple Regression |
These are the conditions under which the OLS estimator is valid and has the nice statistical properties we rely on (like unbiasedness and consistency).
Assumption 1: Zero Conditional Mean
E(u|X1 = x1,…, Xk
= xk) = 0
Meaning:
- On average, the omitted factors
uuu are unrelated to the included regressors XXX.
- Put differently: Once you control
for the regressors, there’s no leftover systematic relationship between
uuu and XXX.
Why it
matters:
- If this fails, your regression
suffers from omitted variable bias.
Example: If PctEL (percent English learners) belongs in the model but you leave it out, and it’s correlated with STR, then the STR coefficient gets biased.
Solution:
- Include the omitted variable (if you can measure it).
Assumption 2: i.i.d. Sampling
(X1i,…,Xki,Yi), i =1,…,n, are i.i.d.
Meaning:
- Each observation comes from the
same population and is independent of others.
- This is satisfied if your data is
collected using simple random sampling.
Why it
matters:
- Ensures OLS results can be generalised and that standard error formulas work properly
Assumption 3:
Large Outliers are Rare
Meaning:
- OLS is sensitive to extreme outliers because it minimises square errors.
- Outliers can pull the regression
line away from where most of the data lies.
Why it
matters:
- Outliers can make estimates
unreliable.
- That’s why you should always check scatterplots or summary stats for unusual values (typos, coding errors, or genuine but extreme values).
Assumption 4:
No Perfect Multicollinearity
Meaning:
- None of your regressors is an exact
linear combination of the others.
- It happens when one regressor is
strongly related to (or exactly determined by) another regressor.
- This makes it hard (or impossible)
for regression to separate their effects on Y.
Comments
Post a Comment