多元回归

遗漏变量偏误

简单回归有一个主要缺陷：我们忽略了因变量（考试分数）与回归因子（班级规模）相关的的其他决定因素。这可能会导致估计偏差，即OLS估计的样本均值不再等于真实均值。这个问题被称为“遗漏变量偏误（Omitted Variable Bias OVB）”。

要导致遗漏变量偏误，需要满足两个条件;

\(X\) 与一个遗漏的变量相关。
遗漏的变量是因变量 \(Y\) 的决定因素。

条件1.2一起导致违反了OLS关于 \(E(u|x)=0\) 的假设。形式上，由此导致的偏差表示为：\(\hat{\beta_1}\xrightarrow{\rho}\beta_1+\rho_{X_u}\frac {\sigma_u}{\sigma_X}\)

估计 \(\beta_1\) 过程中的遗漏变量偏误无法靠增加观测值的数目解决。估计值 \(\hat{\beta_1}\) 无法确定。OVB导致估计无法在概率上收敛到真实值。偏误的强度和方向由 \(\rho_{X_u}\) 即误差与自变量之间的相关性决定。

在这个考试分数与班级规模的例子中，很容易从模型中找出可能导致这种偏差的被忽略变量。一个高度相关的变量可能是学区内英语学习者的百分比：说、读、写英语的能力可能是学习成功的重要因素。

下面思考因忽略英语学习学生可能导致的偏差（PctEL），尽管估计回归模型不包括（PctEL），但真实数据产生方式确是：

\[TestScore=\beta_0+\beta_1\times STR+\beta_2\times PctEL\]

事实上 \(STR\) 与 \(PctEL\) 是相关的， \(\rho_{STR,PctEL}\neq0\)

下面使用 R 计算 \(STR\) 与 \(PctEL\) 及 \(STR\) 与 \(TestScore\) 间的相关系数。

# load the AER package
# library(AER)

# load the data set
data(CASchools)   

# define variables
CASchools$STR <- CASchools$students/CASchools$teachers       
CASchools$score <- (CASchools$read + CASchools$math)/2

# compute correlations
# 计算相关系数
cor(CASchools$STR, CASchools$score)

## [1] -0.2263627

cor(CASchools$STR, CASchools$english)

## [1] 0.1876424

\(\hat\rho_{STR,Testscore}=-0.2264\) 的事实使得不得不考虑遗漏 \(PctEL\) 导致了对 \(\hat{\beta_1}\) 的负偏估计，因为 \(\rho_{X_u}<0\)。

当将 \(PctEL\) 加入回归模型：

\[TestScore=\beta_0+\beta_1\times STR+\beta_2\times PctEL+u\]

将得到一个负的但相比之前稍大的估计量 \(\hat{\beta_1}\) 以及一个负的估计量 \(\hat{\beta_2}\)

# 对两个回归模型进行评估
# estimate both regression models
mod <- lm(score ~ STR, data = CASchools) 
mult.mod <- lm(score ~ STR + english, data = CASchools)

# print the results to the console
mod

## 
## Call:
## lm(formula = score ~ STR, data = CASchools)
## 
## Coefficients:
## (Intercept)          STR  
##      698.93        -2.28

mult.mod

## 
## Call:
## lm(formula = score ~ STR + english, data = CASchools)
## 
## Coefficients:
## (Intercept)          STR      english  
##    686.0322      -1.1013      -0.6498

# 结果与预期相符

多元回归模型

summary(mult.mod)$coef

##                Estimate Std. Error    t value      Pr(>|t|)
## (Intercept) 686.0322445 7.41131160  92.565565 3.871327e-280
## STR          -1.1012956 0.38027827  -2.896026  3.978059e-03
## english      -0.6497768 0.03934254 -16.515882  1.657448e-47

多元回归的拟合优度

summary(mult.mod)

## 
## Call:
## lm(formula = score ~ STR + english, data = CASchools)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.845 -10.240  -0.308   9.815  43.461 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 686.03224    7.41131  92.566  < 2e-16 ***
## STR          -1.10130    0.38028  -2.896  0.00398 ** 
## english      -0.64978    0.03934 -16.516  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.46 on 417 degrees of freedom
## Multiple R-squared:  0.4264, Adjusted R-squared:  0.4237 
## F-statistic:   155 on 2 and 417 DF,  p-value: < 2.2e-16

多元回归的OLS假设

多重共线性

完全多重共线性举例

# define the fraction of English learners
CASchools$FracEL <- CASchools$english / 100

# estimate the model
mult.mod <- lm(score ~ STR + english + FracEL, data = CASchools)

# obtain a summary of the model
summary(mult.mod)

## 
## Call:
## lm(formula = score ~ STR + english + FracEL, data = CASchools)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.845 -10.240  -0.308   9.815  43.461 
## 
## Coefficients: (1 not defined because of singularities)
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 686.03224    7.41131  92.566  < 2e-16 ***
## STR          -1.10130    0.38028  -2.896  0.00398 ** 
## english      -0.64978    0.03934 -16.516  < 2e-16 ***
## FracEL             NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.46 on 417 degrees of freedom
## Multiple R-squared:  0.4264, Adjusted R-squared:  0.4237 
## F-statistic:   155 on 2 and 417 DF,  p-value: < 2.2e-16

# if STR smaller 12, NS = 0, else NS = 1
CASchools$NS <- ifelse(CASchools$STR < 12, 0, 1)

# estimate the model
mult.mod <- lm(score ~ computer + english + NS, data = CASchools)

# obtain a model summary
summary(mult.mod)

## 
## Call:
## lm(formula = score ~ computer + english + NS, data = CASchools)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.492  -9.976  -0.778   8.761  43.798 
## 
## Coefficients: (1 not defined because of singularities)
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 663.704837   0.984259 674.319  < 2e-16 ***
## computer      0.005374   0.001670   3.218  0.00139 ** 
## english      -0.708947   0.040303 -17.591  < 2e-16 ***
## NS                  NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.43 on 417 degrees of freedom
## Multiple R-squared:  0.4291, Adjusted R-squared:  0.4263 
## F-statistic: 156.7 on 2 and 417 DF,  p-value: < 2.2e-16

table(CASchools$NS)

## 
##   1 
## 420

# set seed for reproducibility
set.seed(1)

# generate artificial data on location
CASchools$direction <- sample( c("West", "North", "South", "East"),
                               420,
                               replace = T )

# estimate the model
mult.mod <- lm(score ~ STR + english + direction, data = CASchools)

# obtain a model summary
summary(mult.mod)

## 
## Call:
## lm(formula = score ~ STR + english + direction, data = CASchools)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.603 -10.175  -0.484   9.524  42.830 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    684.80477    7.54130  90.807  < 2e-16 ***
## STR             -1.08873    0.38153  -2.854  0.00454 ** 
## english         -0.65597    0.04018 -16.325  < 2e-16 ***
## directionNorth   1.66314    2.05870   0.808  0.41964    
## directionSouth   0.71619    2.06321   0.347  0.72867    
## directionWest    1.79351    1.98174   0.905  0.36598    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.5 on 414 degrees of freedom
## Multiple R-squared:  0.4279, Adjusted R-squared:  0.421 
## F-statistic: 61.92 on 5 and 414 DF,  p-value: < 2.2e-16

# Percentage of english speakers
CASchools$PctES <- 100 - CASchools$english


# estimate the model
mult.mod <- lm(score ~ STR + english + PctES, data = CASchools)

# obtain a model summary
summary(mult.mod)

## 
## Call:
## lm(formula = score ~ STR + english + PctES, data = CASchools)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.845 -10.240  -0.308   9.815  43.461 
## 
## Coefficients: (1 not defined because of singularities)
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 686.03224    7.41131  92.566  < 2e-16 ***
## STR          -1.10130    0.38028  -2.896  0.00398 ** 
## english      -0.64978    0.03934 -16.516  < 2e-16 ***
## PctES              NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.46 on 417 degrees of freedom
## Multiple R-squared:  0.4264, Adjusted R-squared:  0.4237 
## F-statistic:   155 on 2 and 417 DF,  p-value: < 2.2e-16

多元回归的OLS估计量的概率分布

# load packages
# library(MASS)
# library(mvtnorm)

# set sample size
n <- 50

# initialize vector of coefficients
coefs <- cbind("hat_beta_1" = numeric(10000), "hat_beta_2" = numeric(10000))

# set seed for reproducibility
set.seed(1)

# loop sampling and estimation
for (i in 1:10000) 
{
  X <- rmvnorm(n, c(50, 100), sigma = cbind(c(10, 2.5), c(2.5, 10)))
  u <- rnorm(n, sd = 5)
  Y <- 5 + 2.5 * X[, 1] + 3 * X[, 2] + u
  coefs[i,] <- lm(Y ~ X[, 1] + X[, 2])$coefficients[-1]
}

# compute density estimate
kde <- kde2d(coefs[, 1], coefs[, 2])

# plot density estimate
persp( kde,
       theta = 310,
       phi = 30,
       xlab = "beta_1",
       ylab = "beta_2",
       zlab = "Est. Density" )

第3篇

参考书

安装包