Project 2 of Statistics.md

摘要

本次大作业按章节依次展示了岭估计、主成分估计、逐步回归和带示性变量的回归分析。

主要使用 dplyr 和 readxl 包导入数据并获得数据框视图；利用 MASS 和 car 包作岭迹图、主成分估计和逐步回归分析； dummies 包提供了将示性变量转换为特殊矩阵的可能；bootstrap 包帮助我们实现了 $k$ 重交叉验证。

所有文献和数据集均可在文末的参考文献链接中下载获得。

library(dplyr)
library(readxl)
library(MASS)
library(car)
library(dummies)
library(bootstrap)

岭估计

对《线性模型与回归分析》(王松桂)书习题3.19数据进行回归分析，十次试验得到的观测数据如下

$y$	16.3	16.8	19.2	18.0	19.5	20.9	21.1	20.9	20.3	22.0
$X_1$	1.1	1.4	1.7	1.7	1.8	1.8	1.9	2.0	2.3	2.4
$X_2$	1.1	1.5	1.8	1.7	1.9	1.8	1.8	2.1	2.4	2.5

首先导入数据并获得预览：

1
2
3

test <- tbl_df(read_excel("/path/to/test.xlsx"))

str(test)

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':    10 obs. of  3 variables:
 $ y : num  16.3 16.8 19.2 18 19.5 20.9 21.1 20.9 20.3 22
 $ x1: num  1.1 1.4 1.7 1.7 1.8 1.8 1.9 2 2.3 2.4
 $ x2: num  1.1 1.5 1.8 1.7 1.9 1.8 1.8 2.1 2.4 2.5

下面假设数据满足线性模型:

$$
y=\beta_0 + \beta_1 x_1+ \beta_2 x_2 + e
$$

其中误差 $e$ 满足 Gauss-Markov 假设，最小二乘法估计参数 $\beta = (\beta_0, \beta_1, \beta_2)$.

md1 <- lm(y ~ x1 + x2, test)

summary(md1)

vif(md1)

Call:
lm(formula = y ~ x1 + x2, data = test)

Residuals:
    Min      1Q  Median      3Q     Max
-1.3107 -0.3727  0.1168  0.5054  1.1176

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   11.292      1.463   7.719 0.000114 ***
x1            11.307      4.719   2.396 0.047740 *
x2            -6.591      4.436  -1.486 0.180941
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9072 on 7 degrees of freedom
Multiple R-squared:  0.8267,    Adjusted R-squared:  0.7772
F-statistic:  16.7 on 2 and 7 DF,  p-value: 0.002167

x1: 35.9628643396901
x2: 35.9628643396901

由上可知预测变量 $x$ 的方差膨胀因子均大于30, 数据存在较严重的共线性, 于是考虑使用岭估计确定线形模型中的系数.

下面规定岭参数在 $(0,1)$ 中, 以 $0.1$ 为步长作出岭迹图：

1
2
3

md2 <- lm.ridge(y ~ x1 + x2, test, lambda = seq(0, 1, 0.1))

plot(md2)

从图中发现当岭参数 $k=4$ 时, $\hat{\beta}(k)$的图像趋于平稳, 故选择

$$
\hat{\beta(4)}=(11.86099 ,\ 4.480425, \ -0.25298675)
$$

主成分估计

对《线性模型与回归分析》(王松桂)习题3.20数据进行回归分析。导入数据集 commodity:

1
2
3

commodity <- tbl_df(read_excel("/path/to/commodity.xlsx"))

str(commodity)

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':    10 obs. of  5 variables:
 $ y : num  8.4 9.6 10.4 11.4 12.2 14.2 15.8 17.9 19.6 20.8
 $ x1: num  82.9 88 99.9 105.3 117.7 ...
 $ x2: num  92 93 96 94 100 101 105 112 112 112
 $ x3: num  17.1 21.3 25.1 29 34 40 44 49 51 53
 $ x4: num  94 96 97 97 100 101 104 109 111 111

假设数据满足线性模型:

$$
y=\beta_0 + \beta_1 x_1+ \beta_2 x_2 +x_3 + x_4+ e
$$

误差 $e$ 满足Gauss - Markov 假设.

md1 <- lm(y ~ x1 + x2 + x3 + x4, commodity)

summary(md1)

vif(md1)

Call:
lm(formula = y ~ x1 + x2 + x3 + x4, data = commodity)

Residuals:
        1         2         3         4         5         6         7         8
 0.024803  0.079476  0.012381 -0.007025 -0.288345  0.216090 -0.142085  0.158360
        9        10
-0.135964  0.082310

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.66768    5.94360  -2.973  0.03107 *
x1            0.09006    0.02095   4.298  0.00773 **
x2           -0.23132    0.07132  -3.243  0.02287 *
x3            0.01806    0.03907   0.462  0.66328
x4            0.42075    0.11847   3.552  0.01636 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2037 on 5 degrees of freedom
Multiple R-squared:  0.9988,    Adjusted R-squared:  0.9978
F-statistic:  1021 on 4 and 5 DF,  p-value: 1.827e-07

x1: 126.280159677426
x2: 72.827434063754
x3: 55.4395006902401
x4: 125.122528952804

使用最小二乘法估计参数 $\beta$ , 发现回归系数 $\beta_3$ 不显著不为0，并且预测变量 $x$ 的方差膨胀因子均大于30, 数据存在较严重的共线性,考虑使用主成分回归估计参数.

使用样本的相关矩阵进行主成分分析：

pca <- princomp(~ x1 + x2 + x3 + x4, commodity, cor=T)

summary(pca, loading=TRUE)

pre <- predict(pca)

pre

Importance of components:
                          Comp.1      Comp.2     Comp.3       Comp.4
Standard deviation     1.9859037 0.199906992 0.11218966 0.0603085506
Proportion of Variance 0.9859534 0.009990701 0.00314663 0.0009092803
Cumulative Proportion  0.9859534 0.995944090 0.99909072 1.0000000000

Loadings:
   Comp.1 Comp.2 Comp.3 Comp.4
x1 -0.502 -0.237  0.579 -0.598
x2 -0.500  0.493 -0.610 -0.367
x3 -0.498 -0.707 -0.368  0.342
x4 -0.501  0.449  0.396  0.626

	Comp.1	Comp.2	Comp.3	Comp.4
1	2.74300221	0.21717213	0.04664610	-0.092431543
2	2.26912529	0.15180493	0.05701818	0.094266513
3	1.66536864	0.11683915	-0.03013748	-0.045923725
4	1.55844739	-0.27262184	0.10173123	0.064513216
5	0.53961155	-0.04080947	-0.12051901	0.011795871
6	-0.04403012	-0.33989164	-0.09182299	0.003917754
7	-0.96230840	-0.21126360	-0.04524573	-0.064368428
8	-2.22802489	0.22379838	-0.19645929	0.020173666
9	-2.65380615	0.17110117	0.08143164	0.067009629
10	-2.88738552	-0.01612921	0.19735735	-0.058952953

由上可得第一主成分的方差贡献率达到 $98.59534\%$，且第一主成分为：

$$
Z_1 = -0.502x_1 -0.500x_2 - 0.498x_3 - 0.501x_4
$$

下面使用第一主成分进行回归：

1
2
3

commodity$z1 <- pre[, 1]
md2 <- lm(y ~ z1, commodity)
summary(md2)

Call:
lm(formula = y ~ z1, data = commodity)

Residuals:
     Min       1Q   Median       3Q      Max
-0.72237 -0.20946  0.05154  0.21032  0.81856

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.03000    0.16615   84.44 4.32e-13 ***
z1          -2.06119    0.08367  -24.64 7.87e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5254 on 8 degrees of freedom
Multiple R-squared:  0.987,    Adjusted R-squared:  0.9854
F-statistic: 606.9 on 1 and 8 DF,  p-value: 7.873e-09

得到：

$$
y = 14.0300 - 2.06119 * Z_1
$$

最后我们仍需将变量 $Z_1$ 还原：

Phi <- loadings(pca)
xm <- pca$center
xs <- pca$scale
tem <- md2$coefficients
beta <- tem[2] * Phi[, 1] / xs
cons <- tem[1] - beta %*% xm
beta <- c(cons, beta)
beta

1: -23.7777186112137
x1: 0.0299264271458343
x2: 0.133651577042884
x3: 0.0836115587386088
x4: 0.169651874372033

得到回归方程：

$$
y = -23.7777 + 0.0299 x_1 + 0.1337 x_2 + 0.08361 x_3 + 0.1697 x_4
$$

逐步回归

对《线性模型与回归分析》(王松桂)习题5.7数据做逐步回归分析.

1
2
3

data <- tbl_df(read_excel("/path/to/data.xlsx"))

str(data)

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':    29 obs. of  6 variables:
 $ y : num  272 264 239 231 252 ...
 $ x1: num  783 748 684 828 860 ...
 $ x2: num  33.5 36.5 34.7 33.1 35.8 ...
 $ x3: num  40.5 36.2 37.3 35.5 33.7 ...
 $ x4: num  16.7 16.5 17.7 17.5 16.4 ...
 $ x5: num  13.2 14.1 15.7 10.5 11 ...

先对数据集做向后逐步回归：

1
2
3

md1 <- lm(y ~ x1 + x2 + x3 + x4 + x5, data=data)

stepAIC(md1, direction="backward")

Start:  AIC=176
y ~ x1 + x2 + x3 + x4 + x5

       Df Sum of Sq     RSS    AIC
- x3    1      35.1  8323.5 174.13
- x2    1     114.0  8402.4 174.40
- x5    1     411.8  8700.1 175.41
<none>               8288.4 176.00
- x1    1     723.8  9012.2 176.43
- x4    1    5074.5 13362.9 187.85

Step:  AIC=174.13
y ~ x1 + x2 + x4 + x5

       Df Sum of Sq     RSS    AIC
- x2    1     217.2  8540.7 172.87
<none>               8323.5 174.13
- x1    1    1009.8  9333.3 175.45
- x5    1    1267.3  9590.9 176.24
- x4    1    5136.6 13460.1 186.06

Step:  AIC=172.87
y ~ x1 + x4 + x5

       Df Sum of Sq     RSS    AIC
<none>               8540.7 172.87
- x5    1    1365.8  9906.5 175.18
- x1    1    1605.5 10146.2 175.87
- x4    1    4929.7 13470.4 184.09




Call:
lm(formula = y ~ x1 + x4 + x5, data = data)

Coefficients:
(Intercept)           x1           x4           x5
   447.3340       0.1273     -21.8362       5.1049

再对数据集做向前逐步回归：

1
2
3

md2 <- lm(y ~ 1, data=data)

stepAIC(md2, direction="forward", scope="~ x1 + x2 + x3 + x4 + x5")

Start:  AIC=193.65
y ~ 1

       Df Sum of Sq   RSS    AIC
+ x4    1   10612.7 10892 175.93
+ x1    1    8030.1 13475 182.10
+ x5    1    2563.9 18941 191.97
<none>              21505 193.65
+ x2    1     245.0 21260 195.32
+ x3    1       1.8 21503 195.65

Step:  AIC=175.93
y ~ x4

       Df Sum of Sq     RSS    AIC
+ x3    1   1656.22  9235.8 173.14
+ x1    1    985.56  9906.5 175.18
+ x2    1    779.98 10112.1 175.77
+ x5    1    745.86 10146.2 175.87
<none>              10892.0 175.93

Step:  AIC=173.14
y ~ x4 + x3

       Df Sum of Sq    RSS    AIC
<none>              9235.8 173.14
+ x1    1    505.14 8730.7 173.51
+ x2    1    143.80 9092.0 174.69
+ x5    1     14.91 9220.9 175.10




Call:
lm(formula = y ~ x4 + x3, data = data)

Coefficients:
(Intercept)           x4           x3
    491.523      -24.663        4.675

发现两种方向的结果并不一致。

下面对数据进行划分: 前15个样本作为训练数据, 后14个样本作为测试数据，根据AIC准则和RSS准则综合判断如何选择经逐步回归后得到的线性方程.

datatrain <- data[1:15, ]
datatest <- data[16:29, ]

md1 <- lm(y ~ 1, datatrain)
md2 <- step(md1, direction = "both", scope = "~ x1 + x2 + x3 + x4 + x5")

summary(md2)

Start:  AIC=79.88
y ~ 1

       Df Sum of Sq    RSS    AIC
+ x4    1   1107.48 1589.4 73.946
+ x3    1    666.73 2030.1 77.617
+ x5    1    397.59 2299.3 79.485
+ x1    1    385.77 2311.1 79.561
<none>              2696.9 79.877
+ x2    1    104.83 2592.0 81.282

Step:  AIC=73.95
y ~ x4

       Df Sum of Sq     RSS    AIC
+ x3    1   1182.99  406.40 55.489
+ x5    1    835.24  754.14 64.763
+ x2    1    295.31 1294.07 72.862
<none>              1589.39 73.946
+ x1    1      1.30 1588.09 75.934
- x4    1   1107.48 2696.87 79.877

Step:  AIC=55.49
y ~ x4 + x3

       Df Sum of Sq     RSS    AIC
+ x1    1     70.01  336.39 54.653
<none>               406.40 55.489
+ x5    1     42.58  363.82 55.829
+ x2    1      2.93  403.47 57.381
- x3    1   1182.99 1589.39 73.946
- x4    1   1623.74 2030.14 77.617

Step:  AIC=54.65
y ~ x4 + x3 + x1

       Df Sum of Sq     RSS    AIC
+ x5    1    116.20  220.18 50.296
+ x2    1     42.17  294.21 54.644
<none>               336.39 54.653
- x1    1     70.01  406.40 55.489
- x4    1    881.47 1217.86 71.952
- x3    1   1251.70 1588.09 75.934

Step:  AIC=50.3
y ~ x4 + x3 + x1 + x5

       Df Sum of Sq     RSS    AIC
<none>               220.18 50.296
+ x2    1     13.40  206.78 51.354
- x5    1    116.20  336.39 54.653
- x1    1    143.64  363.82 55.829
- x3    1    317.12  537.30 61.678
- x4    1    790.37 1010.56 71.153




Call:
lm(formula = y ~ x4 + x3 + x1 + x5, data = datatrain)

Residuals:
    Min      1Q  Median      3Q     Max
-6.1147 -2.7790 -0.4392  3.2326  6.8494

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) 324.64019   67.10578   4.838 0.000684 ***
x4          -17.84038    2.97769  -5.991 0.000134 ***
x3            3.62457    0.95507   3.795 0.003514 **
x1            0.07047    0.02759   2.554 0.028659 *
x5            2.99391    1.30323   2.297 0.044459 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.692 on 10 degrees of freedom
Multiple R-squared:  0.9184,    Adjusted R-squared:  0.8857
F-statistic: 28.12 on 4 and 10 DF,  p-value: 2.028e-05

由上可以发现：

首先, 引入 $x_4$ 后,方程的AIC将降低, 并且引入后的RSS最小,故考虑引入 $x_4$；

接下来, 引入 $x_3$ 会继续降低方程的AIC, 且引入 $x_3$ 对应的RSS最小；

继续地，以此判断准则分别引入 $x_1$ 和 $x_5$, 依据AIC准则,方程达到最优。

$$
y = 324.64019 + 0.07047 x1 + 3.62457 x3 - 17.84038 x4 + 2.99391 x5
$$

R <- summary(md2)$r.square
MSE <- sum(residuals(md2)^2)/7

yp <- predict(md2, datatest)
yf <- datatest$y

Rp <- 1 - sum((yf-yp)^2)/sum((yf-mean(yf))^2)
MSEp <- sum((yf-yp)^2)/7

cat("Original R-Square =", R, "\n")
cat("cv Rp-Square =", Rp, "\n")

cat("Original MSE =", MSE, "\n")
cat("cv MSEp =", MSEp, "\n")

Original R-Square = 0.9183561
cv Rp-Square = 0.4209326
Original MSE = 31.4547
cv MSEp = 1362.036

对比原方程的 $R^2 = 91.84\%$, $MSE=27.52287$ 使用测试数据进行交叉验证得到的 $R^2{pre} = 42.09\%$, $MSE{pre} = 1362.036$.

$R^2$ 下降很多, 说明若用该方程进行预测, 预测结果不可靠.

带示性的线性回归模型

使用讲义上的2014年全国教育数据进行回归分析:

1
2
3

edu <- tbl_df(read_excel("/path/to/education.xlsx"))

head(edu, n=8)

y	x1	x2	jbh
5082.47	44488.57	14119	1
4170.25	28832.29	13544	1
1470.97	16647.40	18461	3
1928.79	16538.32	18415	3
2552.41	20559.34	15207	2
1981.45	22820.15	14277	2
1945.20	17520.39	13930	2
1638.09	17404.39	12597	2

edu$jbh <- factor(edu$jbh)

md <- lm(y ~ x1 + x2 + jbh, data=edu)

influencePlot(md, id.method="identity", main="Influence Plot", sub="Circle size is proportional to Cook's")

	StudRes	Hat	CookD
9	-0.7881532	0.4759338	0.09546883
26	4.8251309	0.1874467	0.47330170

由上图，我们需要剔除异常点西藏省对应的数据.

1	edu <- edu[-26,]

原数据中对人均可支配收入 jbh 进行了分组, 我们针对分组做适当简化: 将原数据中第一组作为高收入组 h; 第二组和第三组作为中等收入组 m, 第四组作为低收入组 l.

简化后的分组有四个水平, 故引入三个示性变量:

$$
x_3 = \begin{cases}
1 \quad jbh=1 \
0 \quad \text{else} \
\end{cases} \qquad x_4 = \begin{cases}
1 \quad jbh=2 \
0 \quad \text{else} \
\end{cases} \qquad x_5 = \begin{cases}
1 \quad jbh=3 \
0 \quad \text{else} \
\end{cases}
$$

据此我们可将数据框 edu 做如下变换

jbh <- dummy(edu$jbh)  # convert the dummies into a sparse matrix
edu.new <- data.frame(edu$y, edu$x1, edu$x2, jbh)  # update the dateframe
names(edu.new) <- c("y", "x1", "x2", "x3", "x4", "x5", "x6") # rename the columns

head(edu.new, n=8)

y	x1	x2	x3	x4	x5
5082.47	44488.57	14119	1	0	0
4170.25	28832.29	13544	1	0	0
1470.97	16647.40	18461	0	0	1
1928.79	16538.32	18415	0	0	1
2552.41	20559.34	15207	0	1	0
1981.45	22820.15	14277	0	1	0
1945.20	17520.39	13930	0	1	0
1638.09	17404.39	12597	0	1	0
4077.58	45965.83	11719	1	0	0
2613.18	27172.77	16619	1	0	0

新的 edu.new 中已将哑变量 x3 转化为相应矩阵，可以假设此数据满足线性模型:

$$
y=\beta_0 + \beta_1\, x_1+ \beta_2\, x_2 + \beta_3\, x_3 + \beta_4\, x_4 + \beta_5\, x_5 + e
$$

误差 $e$ 满足 Gauss-Markov 假设. 使用最小二乘法估计参数 $\beta$.

下面对模型进行交叉验证，将样本分为 $k$ 个子样本，轮流将 $k-1$ 个子样本组合作为训练集，另外 $1$ 个子样本作为保留集。这样获得的 $k$ 个预测方程，记录了 $k$ 个保留样本的预测表现结果，然后求其均值。

利用 bootstrap 包中的 crossval() 函数实现如下交叉验证函数：

# construct the k fold cross validation function
shrinkage <- function(fit, k=3) {

  theta.fit <- function(x, y) {lsfit(x, y)}
  theta.predict <- function(fit, x) {cbind(1, x) %*% fit$coef}

  x <- fit$model[, 2:ncol(fit$model)]
  y <- fit$model[, 1]

  results <- crossval(x, y, theta.fit, theta.predict, ngroup = k)
  r2 <- cor(y, fit$fitted.values)^2
  r2cv <- cor(y, results$cv.fit)^2
  cat("Original R-square =", r2, "\n")
  cat(k, "Fold Cross-Validated R-square =", r2cv, "\n")
  cat("R-square Change =", r2 - r2cv, "\n")
}

对 new.data 所有预测变量进行回归，然后再用 shrinkage() 函数做 $3$ 重交叉验证：

1
2
3

md.new <- lm(y ~ x1 + x2 + x3 + x4 + x5, data=edu.new)

shrinkage(md.new)

Original R-square = 0.7147451
3 Fold Cross-Validated R-square = 0.4983459
R-square Change = 0.2163992

可以看到，基于初始样本的 $R$ 平方为 0.7147451，对新数据更好的方差解释率估计是交叉验证后的 $R$ 平方 0.4686895，$R$ 平方减少量 0.2163992 较小，说明预测变量模型泛化能力较好，预测较为精准。

1	summary(md.new)

Call:
lm(formula = y ~ x1 + x2 + x3 + x4 + x5, data = edu.new)

Residuals:
   Min     1Q Median     3Q    Max
-520.1 -271.8 -198.1  259.4 1176.5

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)  858.93181 1012.19824   0.849 0.404499
x1             0.09523    0.02400   3.967 0.000572 ***
x2             0.01520    0.03799   0.400 0.692716
x3          -816.78300  498.04009  -1.640 0.114050
x4          -873.96293  312.85124  -2.794 0.010079 *
x5          -776.99506  280.13906  -2.774 0.010556 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 490.3 on 24 degrees of freedom
Multiple R-squared:  0.7147,    Adjusted R-squared:  0.6553
F-statistic: 12.03 on 5 and 24 DF,  p-value: 6.658e-06

故可得带哑变量的线性模型回归方程：

$$
y = 858.93181 + 0.09523 x1 + 0.01520 x2 - 816.78300 x3 - 873.96293 x4 - 776.99506 * x5
$$

由上可知，人均可支配收入最高省份相比人均可支配收入最低省份在人均教育费用上会减少约 816.78 元的投入。 人均收入稍高的省份相比减少 873.96 元人均教育经费投入，再稍低的相比减少 776.99 元的人均教育经费投入。最低收入的作为参考指标，视为不减少。

参考文献

R in action Data Analysis and Graphics with R (Second Edition)
线性模型引论（王松桂、史建红等）
线性统计模型（王松桂、陈敏等）
数据集地址：http://oye4atjxc.bkt.clouddn.com/statistic/project2/data.zip