Project 1 of Statistics.md

一、准备阶段

本次实验主要考虑影响大倪体重各体型指标的因素。我们的数据收集了112组大鲵的体重(weight)、全长(tl)、体长(bl)、头长(hl)、体高(h)、体宽(w)、尾柄长(cl)和肠长(sl)。

利用统计软件R来实现，其中用到的包有readxl, dplyr, ggplot2, car。

library(readxl)
library(dplyr)
library(ggplot2)
library(car)

首先，利用read_excel语句导入数据为新的数据框，并利用str()函数展示一下大致的数据结构。

1 2	df = tbl_df(read_excel("/path/to/data.xls")) str(df)

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':    112 obs. of  8 variables:
 $ weight: num  2.3 2.4 2.5 2.7 2.8 3 3.2 3.4 3.5 3.6 ...
 $ tl    : num  5.7 5.73 5.9 6.12 6.2 ...
 $ bl    : num  4.9 4.53 4.65 4.85 5.1 5 5.1 6.25 5.25 5.2 ...
 $ hl    : num  1.6 1.63 1.85 1.7 1.75 1.73 1.7 2.08 1.9 1.8 ...
 $ h     : num  0.9 0.93 1.05 1.05 1 1.1 1 1.28 1.05 1.1 ...
 $ w     : num  1 1.1 1 1.1 1.1 1.13 1.1 1.43 1.2 1.2 ...
 $ cl    : num  1.2 1.12 1.2 1.1 1.3 1.27 1.1 1.32 1.35 1.4 ...
 $ sl    : num  4.7 4.3 4.7 4.6 4.2 5 5 6.3 5.75 3.9 ...

下面对影响体重的体型指标做多元线性回归。

二、多元线性回归

1. 检测二变量关系

首先，检查一下变量间的相关性。可以利用car包中scatterplotMatrix()函数生成散点图矩阵

1	scatterplotMatrix(df, spread=FALSE, smoother.args=list(lty=2), main="Scatter Plot Matrix")

从图中可以看到，大鲵体重随着体长、身长、头长、高度、宽度，尾长和肠长的增加而增加，可能有线性关系。

2. 多元线性回归及显著性检验

下面使用lm()函数拟合多元线性回归模型：

1 2	mod1 <- lm(weight ~ tl + bl + hl + h + w + cl + sl, data=df) summary(mod1)

Call:
lm(formula = weight ~ tl + bl + hl + h + w + cl + sl, data = df)

Residuals:
     Min       1Q   Median       3Q      Max
-10.2719  -3.2688  -0.1753   2.2389  27.0821

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -31.0098     2.8396 -10.920  < 2e-16 ***
tl            5.0452     1.6941   2.978 0.003609 **
bl           -8.0840     2.2201  -3.641 0.000425 ***
hl            3.1704     3.1763   0.998 0.320529
h             3.4869     2.0005   1.743 0.084289 .
w            26.1356     4.0578   6.441 3.75e-09 ***
cl            2.4080     3.2128   0.750 0.455246
sl           -0.8519     0.3111  -2.739 0.007262 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.867 on 104 degrees of freedom
Multiple R-squared:  0.8862,    Adjusted R-squared:  0.8785
F-statistic: 115.7 on 7 and 104 DF,  p-value: < 2.2e-16

上表的最后一行表示自由度为 $7, 104$ 的F统计量取值为 $115.7$，在给定水平 $\alpha = 0.05$ 下，算出的F统计量大于 $F{7,104}(0.05)$ 的概率 $P(F{7,104}(0.05)>F)$ 小于 $2.2\times 10^{-16}$，所以拒绝原假设，认为回归方程是显著的。

同时，注意到最后一列头长hl和尾柄长cl的回归系数对应的p值大于 $0.025$，认为头长和尾柄长对体重的影响不显著，因而可以将其从回归方程中剔除。

1
2
3

vars <- names(df) %in% c("cl", "hl")
newdf <- df[!vars]
newdf

weight	tl	bl	h	w	sl
2.3	5.700	4.90	0.90	1.00	4.70
2.4	5.730	4.53	0.93	1.10	4.30
2.5	5.900	4.65	1.05	1.00	4.70
2.7	6.125	4.85	1.05	1.10	4.60
2.8	6.200	5.10	1.00	1.10	4.20
⋮	⋮	⋮	⋮	⋮	⋮
49.6	14.90	12.70	2.50	3.40	11.80
55.8	15.80	13.10	3.01	3.30	9.70
64.2	15.90	12.30	2.60	3.50	15.50
64.9	15.10	12.50	2.30	3.60	13.90
101.2	18.40	15.20	3.30	4.30	14.20

将体重weight对剩余的回归自变量重新做回归，然后再做显著性检验。

1 2	mod2 <- lm(weight ~ tl + bl + w + sl, data=newdf) summary(mod2)

Call:
lm(formula = weight ~ tl + bl + w + sl, data = newdf)

Residuals:
     Min       1Q   Median       3Q      Max
-10.1155  -3.2318  -0.5772   2.5155  27.6941

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -28.8822     2.4665 -11.710  < 2e-16 ***
tl            5.8369     1.6683   3.499 0.000683 ***
bl           -7.6708     2.1310  -3.600 0.000484 ***
w            28.8123     3.8875   7.411 3.07e-11 ***
sl           -0.8667     0.3059  -2.833 0.005514 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.908 on 107 degrees of freedom
Multiple R-squared:  0.8809,    Adjusted R-squared:  0.8764
F-statistic: 197.8 on 4 and 107 DF,  p-value: < 2.2e-16

观测新表对应的 $p$ 值可知，回归方程是显著的，剩余的回归系数均显著不为0。其中回归系数含义为：一个预测变量增加一个单位，其他预测变量保持不变时，因变量将要增加的数量。如宽度w的回归系数为 $28.8123$，表示体长、身长等其他因素不变时，宽度每上升 $1\%$，体重将会上升 $28.8123\%$

总体来看，所有的预测变量解释了体重影响因素88.09%的方差,体重w总变动的 $88.09\%$ 可以由该经验回归方程解释。

3. 回归诊断

3.1 Gauss Markov 假设

首先考虑是否满足 $E(e)=0$，我们通过绘制学生化残差图来检验：

sresid <- rstudent(mod2)
fweight <- fitted(mod2)
myplot <- ggplot(newdf, aes(fweight, sresid)) + geom_point()
myplot <- myplot + ggtitle("Studentized residual vs. fitted weight")
myplot <- myplot + labs(x = "fitted weight") + labs(y = "studentized residual")
myplot

从上图可以看出，$(\hat w_i, r_i)$ 大致落在宽度为6的水平带 $|r_i|\leq 3$ 区域内，学生化残差 $r_i$ 是否服从标准正态分布可以通过绘制学生化残差的频率分布直方图和 Q-Q 图来进一步分析:

par(mfrow=c(1,2))

hist(sresid, freq=FALSE, xlab="Studentized Residual", main="Distribution of Errors")
curve(dnorm(x, mean=mean(sresid), sd=sd(sresid)), add=TRUE, col="blue", lwd=2)
legend("topright", legend="Normal Curve", lty=1, col="blue", cex=.7)

qqPlot(mod2, labels=row.names(newdf), id.method="identify", simulate=TRUE, main="Q-Q Plot")

可以看出，除一个明显的离群点(112组数据)，Q-Q图近似于一条截距为0，斜率为1的直线，因此误差很好地服从了正态分布。

下面进行误差的独立性检验，使用car包提供的Durbin-Watson检验函数，能够检测误差的序列相关性。

1	durbinWatsonTest(mod2)

lag Autocorrelation D-W Statistic p-value
  1       0.4479978     0.7615764       0
Alternative hypothesis: rho != 0

$p<0.05$ 显著，说明误差之间不独立，有相关性。

下面考虑同方差性，利用car包提供的ncvTest()函数生成一个检验，零假设为误差方差不变，备择假设为误差方差随拟合值水平变化而变化。若检验显著，则说明存在异方差性。

1	ncvTest(mod2)

Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 117.0427    Df = 1     p = 2.809471e-27

$p<0.05$显著，拒绝同方差假设。我们还可以通过分布水平图(非水平趋势)验证这一点：

1	spreadLevelPlot(mod2)

综上，该模型不满足Gauss Markov假设。

3.2 异常点检验

由上面的残差图可以看出，最后一组数据的残差较其他数据的残差大很多，因此该数据为异常点。由于异常点的检验是一个复杂的问题，我们不能确定异常点的个数，因此不能进一步处理。

高杠杆值的观测点可以通过帽子统计量判断，对于此处的数据集，帽子均值为$p/n=5/112$，其中 $p$是模型估计的参数数目(包含截距项)，$n$ 是样本量。一般来说，观测点的帽子值大于帽子均值的2或3倍，就可以认定为高杠杆值点。下面画出帽子值的分布：

hat.plot <- function (mod2) {
    p <- length(coefficients(mod2))
    n <- length(fitted(mod2))
    plot(hatvalues(mod2), main = "Index Plot of Hat Values")
    abline(h=c(2,3)*p/n, col="red", lty=2)
    identify(1:n, hatvalues(mod2), names(hatvalues(mod2)))
}
hat.plot(mod2)

此图中可以看出有四个高杠杆值点。

下面根据Cook距离检测强影响点，考虑Cook’s D值大于 $4/(n-k-1)$，其中 $n$ 为样本量大小，$k$ 是预测变量数目。可通过如下代码绘制Cook’s D图形：

1
2
3

cutoff <- 4/(nrow(newdf)-length(mod2$coefficients)-2)
plot(mod2, which=4, cook.levels=cutoff)
abline(h=cutoff, lty=2, col="red")

通过图形可以判断第 $24, 110$ 组数据是强影响点（第112组为异常点），若删除强影响点，将会导致回归模型的截距项和斜率发生显著变化。

其实，利用car包中的influencePlot()函数，可以将离群点，高杠杆值点和强影响点的信息整合到一副图形中：

1	influencePlot(mod2, id.method="identity", main="Influence Plot", sub="Circle size is proportional to Cook's")

	StudRes	Hat	CookD
24	-1.635591	0.3859345	0.3310792
112	8.156196	0.2284171	2.4427929

三、结论

经验线性回归模型

$$
weight = -28.8822 + 5.8369tl + -7.6708bl + 28.8123w + -0.8667sl
$$

虽然可以解释体重总变动的 $88.09\%$，但是不满足Gauss Markov假设，仍待改善。

四、参考文献

R in action Data Analysis and Graphics with R (Second Edition)
线性模型引论（王松桂、史建红等）
线性统计模型 (王松桂、陈敏等编
数据集地址：http://leanote.com/api/file/getAttach?fileId=59fbe27cab644137db000a2b

Homework 3

3.16

Box-Cox变换与加权最小二乘

(1)

1
2
3

tree = tbl_df(read.csv("~/tree.csv"))
mod3 <- lm(Y ~ X1 + X2, data=tree)
summary(mod3)

Call:
lm(formula = Y ~ X1 + X2, data = tree)

Residuals:
    Min      1Q  Median      3Q     Max
-6.4065 -2.6493 -0.2876  2.2003  8.4847

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.9877     8.6382  -6.713 2.75e-07 ***
X1            4.7082     0.2643  17.816  < 2e-16 ***
X2            0.3393     0.1302   2.607   0.0145 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared:  0.948,    Adjusted R-squared:  0.9442
F-statistic:   255 on 2 and 28 DF,  p-value: < 2.2e-16

由上表，$X2$ 的回归系数不显著，故剔除后重新回归：

1 2	mod4 <- lm(Y ~ X1, data=tree) summary(mod4)

Call:
lm(formula = Y ~ X1, data = tree)

Residuals:
   Min     1Q Median     3Q    Max
-8.065 -3.107  0.152  3.495  9.587

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -36.9435     3.3651  -10.98 7.62e-12 ***
X1            5.0659     0.2474   20.48  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.252 on 29 degrees of freedom
Multiple R-squared:  0.9353,    Adjusted R-squared:  0.9331
F-statistic: 419.4 on 1 and 29 DF,  p-value: < 2.2e-16

得到经验线性回归方程

$$
Y = -36.9435 + 5.0659*X1
$$

下面作出残差图：

sresid <- rstudent(mod4)
fweight <- fitted(mod4)
plot3_16 <- ggplot(tree, aes(Y, sresid)) + geom_point()
plot3_16 <- plot3_16 + ggtitle("Studentized residual vs. fitted Y")
plot3_16 <- plot3_16 + labs(x = "fitted Y") + labs(y = "studentized residual")
plot3_16

残差和拟合值的关系大致呈现喇叭形，说明拟合值与残差不相互独立，不满足 Gauss Markov 假设，故进行 Box-Cox 变换：

1 2	library(MASS) boxcox(mod4)

可以由上图看出最优变换参数 $\lambda$ 在 $0$ 附近，进一步做计算找出 $\lambda$ 的最优值：

1	summary(powerTransform(mod4))

bcPower Transformation to Normality
   Est.Power Std.Err. Wald Lower Bound Wald Upper Bound
Y1    0.3791   0.1216           0.1408           0.6175

Likelihood ratio tests about transformation parameters
                            LRT df         pval
LR test, lambda = (0)  8.362665  1 3.830084e-03
LR test, lambda = (1) 21.004754  1 4.581452e-06

得到最优变换参数为0.3791

(2)

newY <- (tree$Y^(0.3791)-1)/0.3791
newtree <- cbind(tree, newY)
mod5 <- lm(newY ~ X1 + X2, data=newtree)
summary(mod5)

Call:
lm(formula = newY ~ X1 + X2, data = newtree)

Residuals:
     Min       1Q   Median       3Q      Max
-0.57136 -0.17434 -0.02026  0.24693  0.45782

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.17273    0.64775  -6.442 5.62e-07 ***
X1           0.53242    0.01982  26.868  < 2e-16 ***
X2           0.04975    0.00976   5.098 2.12e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2911 on 28 degrees of freedom
Multiple R-squared:  0.9775,    Adjusted R-squared:  0.9759
F-statistic: 609.4 on 2 and 28 DF,  p-value: < 2.2e-16

经过Box-Cox变换后的经验回归方程仍显著，且能解释Y的变化的 $97.75\%$。

下面作残差图，检验Gauss Markov假设：

sresid <- rstudent(mod5)
fittedY <- fitted(mod5)
newplot3_16 <- ggplot(newtree, aes(fittedY, sresid)) + geom_point()
newplot3_16 <- newplot3_16 + ggtitle("Studentized residual vs. fitted Y")
newplot3_16 <- newplot3_16 + labs(x = "fitted Y") + labs(y = "studentized residual")
newplot3_16

进一步作正态性检验,作如下学生化残差分布图和Q-Q图：

par(mfrow=c(1,2))

hist(sresid, freq=FALSE, xlab="Studentized Residual", main="Distribution of Errors")
curve(dnorm(x, mean=mean(sresid), sd=sd(sresid)), add=TRUE, col="blue", lwd=2)
legend("topleft", legend="Normal Curve", lty=1, col="blue", cex=.7)

qqPlot(mod5, labels=row.names(newtree), id.method="identify", simulate=TRUE, main="Q-Q Plot")

可以看出经过Box-Cox变换后，经验回归模型相比原模型有明显改善。

(3)

下面我们继续对数据集tree做加权最小二乘回归，编写如下wls函数：

wls <- function(linear_mod, iterations = 1){

  X <- model.matrix(linear_mod)   # design matrix
  H <- solve(t(X) %*% X) %*% t(X) # hat matrix
  Y <- linear_mod$model[,1]       # Y observed
  size <- nrow(linear_mod$model)  # block dimensions
  beta_ols <- H %*% Y             # The estimated coefficients using ols
  resid <- residuals(linear_mod)
  sigma <- diag(size) * as.vector(abs(resid)^2)

  beta_wls <- solve(t(X) %*% solve(sigma) %*% X) %*% t(X) %*% solve(sigma) %*% Y
  # the estimated coefficients with weight least squares method (1st iteration)

  beta_list <- list(beta_ols, beta_wls)

  i <- 3
  while(i <= iterations){

    Y_hat <- X %*% beta_wls

    resid <- Y - Y_hat

    W_hat <- diag(size) * as.vector(abs(resid)^2)

    if(abs(min(eigen(W_hat)$values)) < 1e-15){break}
    # stops when the estimated covariance matrix (W.hat) becomes singular

    beta_wls <- solve(t(X) %*% solve(W_hat) %*% X) %*% t(X) %*% solve(W_hat) %*% Y

    beta_list[[i]] <- beta_wls
    i <- i+1
  }
  return(beta_list)
}

直接调用函数wls(mod3)可以给出如下结果：

(Intercept) -57.9876589

X1 4.7081605

X2 0.3392512

(Intercept) -58.6005437

X1 4.7738483

X2 0.3364717

(Intercept)	-57.9876589
X1	4.7081605
X2	0.3392512

(Intercept)	-58.6005437
X1	4.7738483
X2	0.3364717

从而，加权最小二乘模型为

$$
Y = -58.600543 + 4.7738483X1 + 0.3364717X2
$$