The Least Squares Assumptions for Multiple Regression
![]()  | 
| Assumptions of Multiple Regression | 
These are the conditions under which the OLS estimator is valid and has the nice statistical properties we rely on (like unbiasedness and consistency).
Assumption 1: Zero Conditional Mean
E(u|X1 = x1,…, Xk
= xk) = 0 
Meaning:
- On average, the omitted factors
     uuu are unrelated to the included regressors XXX.
 - Put differently: Once you control
     for the regressors, there’s no leftover systematic relationship between
     uuu and XXX.
 
Why it
matters:
- If this fails, your regression
     suffers from omitted variable bias.
Example: If PctEL (percent English learners) belongs in the model but you leave it out, and it’s correlated with STR, then the STR coefficient gets biased. 
Solution:
- Include the omitted variable (if you can measure it).
 
Assumption 2: i.i.d. Sampling
(X1i,…,Xki,Yi), i =1,…,n, are i.i.d.
Meaning:
- Each observation comes from the
     same population and is independent of others.
 - This is satisfied if your data is
     collected using simple random sampling.
 
Why it
matters:
- Ensures OLS results can be generalised and that standard error formulas work properly
 
Assumption 3:
Large Outliers are Rare
Meaning:
- OLS is sensitive to extreme outliers because it minimises square errors.
 - Outliers can pull the regression
     line away from where most of the data lies.
 
Why it
matters:
- Outliers can make estimates
     unreliable.
 - That’s why you should always check scatterplots or summary stats for unusual values (typos, coding errors, or genuine but extreme values).
 
Assumption 4:
No Perfect Multicollinearity
Meaning:
- None of your regressors is an exact
     linear combination of the others.
 - It happens when one regressor is
     strongly related to (or exactly determined by) another regressor.
 - This makes it hard (or impossible)
     for regression to separate their effects on Y.
 

Comments
Post a Comment