B1:计量经济学导论现代观点 第六版 杰弗里.M.伍德里奇
B2:商务与经济统计 第13版 戴维.R.安德森
B3:概率论与数理统计 第1版 陈希孺
B4:统计学-基于R 第4版 贾俊平
library(ggplot2)
library(utils)
library(tibble)
library(itewrpkg)
library(AER)
library(MASS)
library(scales)
library(mvtnorm)
library(stargazer)
library(e1071)
library(mvtnorm)
简单回归有一个主要缺陷:我们忽略了因变量(考试分数)与回归因子(班级规模)相关的的其他决定因素。这可能会导致估计偏差,即OLS估计的样本均值不再等于真实均值。这个问题被称为“遗漏变量偏误(Omitted Variable Bias OVB)”。
要导致遗漏变量偏误,需要满足两个条件;
\(X\) 与一个遗漏的变量相关。
遗漏的变量是因变量 \(Y\) 的决定因素。
条件1.2一起导致违反了OLS关于 \(E(u|x)=0\) 的假设。形式上,由此导致的偏差表示为:\(\hat{\beta_1}\xrightarrow{\rho}\beta_1+\rho_{X_u}\frac {\sigma_u}{\sigma_X}\)
估计 \(\beta_1\) 过程中的遗漏变量偏误无法靠增加观测值的数目解决。估计值 \(\hat{\beta_1}\) 无法确定。OVB导致估计无法在概率上收敛到真实值。偏误的强度和方向由 \(\rho_{X_u}\) 即误差与自变量之间的相关性决定。
在这个考试分数与班级规模的例子中,很容易从模型中找出可能导致这种偏差的被忽略变量。一个高度相关的变量可能是学区内英语学习者的百分比:说、读、写英语的能力可能是学习成功的重要因素。
下面思考因忽略英语学习学生可能导致的偏差(PctEL),尽管估计回归模型不包括(PctEL),但真实数据产生方式确是:
\[TestScore=\beta_0+\beta_1\times STR+\beta_2\times PctEL\]
事实上 \(STR\) 与 \(PctEL\) 是相关的, \(\rho_{STR,PctEL}\neq0\)
下面使用 R 计算 \(STR\) 与 \(PctEL\) 及 \(STR\) 与 \(TestScore\) 间的相关系数。
# load the AER package
# library(AER)
# load the data set
data(CASchools)
# define variables
CASchools$STR <- CASchools$students/CASchools$teachers
CASchools$score <- (CASchools$read + CASchools$math)/2
# compute correlations
# 计算相关系数
cor(CASchools$STR, CASchools$score)
## [1] -0.2263627
cor(CASchools$STR, CASchools$english)
## [1] 0.1876424
\(\hat\rho_{STR,Testscore}=-0.2264\) 的事实使得不得不考虑遗漏 \(PctEL\) 导致了对 \(\hat{\beta_1}\) 的负偏估计,因为 \(\rho_{X_u}<0\)。
当将 \(PctEL\) 加入回归模型:
\[TestScore=\beta_0+\beta_1\times STR+\beta_2\times PctEL+u\]
将得到一个负的但相比之前稍大的估计量 \(\hat{\beta_1}\) 以及一个负的估计量 \(\hat{\beta_2}\)
# 对两个回归模型进行评估
# estimate both regression models
mod <- lm(score ~ STR, data = CASchools)
mult.mod <- lm(score ~ STR + english, data = CASchools)
# print the results to the console
mod
##
## Call:
## lm(formula = score ~ STR, data = CASchools)
##
## Coefficients:
## (Intercept) STR
## 698.93 -2.28
mult.mod
##
## Call:
## lm(formula = score ~ STR + english, data = CASchools)
##
## Coefficients:
## (Intercept) STR english
## 686.0322 -1.1013 -0.6498
# 结果与预期相符
summary(mult.mod)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 686.0322445 7.41131160 92.565565 3.871327e-280
## STR -1.1012956 0.38027827 -2.896026 3.978059e-03
## english -0.6497768 0.03934254 -16.515882 1.657448e-47
summary(mult.mod)
##
## Call:
## lm(formula = score ~ STR + english, data = CASchools)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.845 -10.240 -0.308 9.815 43.461
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 686.03224 7.41131 92.566 < 2e-16 ***
## STR -1.10130 0.38028 -2.896 0.00398 **
## english -0.64978 0.03934 -16.516 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.46 on 417 degrees of freedom
## Multiple R-squared: 0.4264, Adjusted R-squared: 0.4237
## F-statistic: 155 on 2 and 417 DF, p-value: < 2.2e-16
# define the fraction of English learners
CASchools$FracEL <- CASchools$english / 100
# estimate the model
mult.mod <- lm(score ~ STR + english + FracEL, data = CASchools)
# obtain a summary of the model
summary(mult.mod)
##
## Call:
## lm(formula = score ~ STR + english + FracEL, data = CASchools)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.845 -10.240 -0.308 9.815 43.461
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 686.03224 7.41131 92.566 < 2e-16 ***
## STR -1.10130 0.38028 -2.896 0.00398 **
## english -0.64978 0.03934 -16.516 < 2e-16 ***
## FracEL NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.46 on 417 degrees of freedom
## Multiple R-squared: 0.4264, Adjusted R-squared: 0.4237
## F-statistic: 155 on 2 and 417 DF, p-value: < 2.2e-16
# if STR smaller 12, NS = 0, else NS = 1
CASchools$NS <- ifelse(CASchools$STR < 12, 0, 1)
# estimate the model
mult.mod <- lm(score ~ computer + english + NS, data = CASchools)
# obtain a model summary
summary(mult.mod)
##
## Call:
## lm(formula = score ~ computer + english + NS, data = CASchools)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.492 -9.976 -0.778 8.761 43.798
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 663.704837 0.984259 674.319 < 2e-16 ***
## computer 0.005374 0.001670 3.218 0.00139 **
## english -0.708947 0.040303 -17.591 < 2e-16 ***
## NS NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.43 on 417 degrees of freedom
## Multiple R-squared: 0.4291, Adjusted R-squared: 0.4263
## F-statistic: 156.7 on 2 and 417 DF, p-value: < 2.2e-16
table(CASchools$NS)
##
## 1
## 420
# set seed for reproducibility
set.seed(1)
# generate artificial data on location
CASchools$direction <- sample( c("West", "North", "South", "East"),
420,
replace = T )
# estimate the model
mult.mod <- lm(score ~ STR + english + direction, data = CASchools)
# obtain a model summary
summary(mult.mod)
##
## Call:
## lm(formula = score ~ STR + english + direction, data = CASchools)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.603 -10.175 -0.484 9.524 42.830
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 684.80477 7.54130 90.807 < 2e-16 ***
## STR -1.08873 0.38153 -2.854 0.00454 **
## english -0.65597 0.04018 -16.325 < 2e-16 ***
## directionNorth 1.66314 2.05870 0.808 0.41964
## directionSouth 0.71619 2.06321 0.347 0.72867
## directionWest 1.79351 1.98174 0.905 0.36598
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.5 on 414 degrees of freedom
## Multiple R-squared: 0.4279, Adjusted R-squared: 0.421
## F-statistic: 61.92 on 5 and 414 DF, p-value: < 2.2e-16
# Percentage of english speakers
CASchools$PctES <- 100 - CASchools$english
# estimate the model
mult.mod <- lm(score ~ STR + english + PctES, data = CASchools)
# obtain a model summary
summary(mult.mod)
##
## Call:
## lm(formula = score ~ STR + english + PctES, data = CASchools)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.845 -10.240 -0.308 9.815 43.461
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 686.03224 7.41131 92.566 < 2e-16 ***
## STR -1.10130 0.38028 -2.896 0.00398 **
## english -0.64978 0.03934 -16.516 < 2e-16 ***
## PctES NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.46 on 417 degrees of freedom
## Multiple R-squared: 0.4264, Adjusted R-squared: 0.4237
## F-statistic: 155 on 2 and 417 DF, p-value: < 2.2e-16
# load packages
# library(MASS)
# library(mvtnorm)
# set sample size
n <- 50
# initialize vector of coefficients
coefs <- cbind("hat_beta_1" = numeric(10000), "hat_beta_2" = numeric(10000))
# set seed for reproducibility
set.seed(1)
# loop sampling and estimation
for (i in 1:10000)
{
X <- rmvnorm(n, c(50, 100), sigma = cbind(c(10, 2.5), c(2.5, 10)))
u <- rnorm(n, sd = 5)
Y <- 5 + 2.5 * X[, 1] + 3 * X[, 2] + u
coefs[i,] <- lm(Y ~ X[, 1] + X[, 2])$coefficients[-1]
}
# compute density estimate
kde <- kde2d(coefs[, 1], coefs[, 2])
# plot density estimate
persp( kde,
theta = 310,
phi = 30,
xlab = "beta_1",
ylab = "beta_2",
zlab = "Est. Density" )