本文利用GLM包进行简单线性回归分析,同时对模型检验以及置信区间作图。
1  | using GLM,DataFrames  | 
先随机生成一组数据用于回归分析。
1  | # 设置截距  | 
| x | y | |
|---|---|---|
| 1 | -0.2945844765867794 | 95.47303779965895 | 
| 2 | -1.3440842917796316 | 100.32164951469537 | 
| 3 | 1.3037018863265355 | 109.57629236830104 | 
| 4 | 4.012507916775679 | 106.46933242509654 | 
| 5 | -1.7321051747524616 | 89.55844065579969 | 
| 6 | -1.4253325394154235 | 92.72629790575935 | 
| 7 | -0.9042631229720477 | 87.93522916305417 | 
| 8 | 3.4865202058421714 | 127.92951823278281 | 
| 9 | -2.134433785764415 | 94.20802706899649 | 
| 10 | 3.322020816866826 | 118.04216195389577 | 
| 11 | 2.203086430462482 | 105.55320230669415 | 
| 12 | -5.281962503754427 | 82.8888784146054 | 
| 13 | 2.001897730730764 | 108.02467262087006 | 
| 14 | -2.7543838829327565 | 88.76348948674855 | 
| 15 | 6.101614682211077 | 128.6239736717972 | 
| 16 | 0.6888033886810481 | 111.17457991815601 | 
| 17 | 0.7010819245192449 | 107.39681268065465 | 
| 18 | 3.1116548996809783 | 113.4549225848041 | 
| 19 | -1.8511960189357837 | 83.79388935398916 | 
| 20 | -0.962664244723129 | 91.41543924802386 | 
| 21 | 0.5608348467513868 | 90.8605217368299 | 
| 22 | 2.1522594774515786 | 93.8105103626811 | 
| 23 | -0.31022032825520746 | 101.53069848453823 | 
| 24 | 1.6256993827235733 | 104.12765249398367 | 
| 25 | 3.3214081526457813 | 118.46644563889394 | 
| 26 | 0.43501963025212675 | 95.810320259296 | 
| 27 | 2.59876092609342 | 121.9741328977443 | 
| 28 | -1.0290269848902651 | 113.72423797098067 | 
| 29 | -0.21404769863190354 | 92.43405156194733 | 
| 30 | 3.5537714568100016 | 99.06513391626848 | 
| ⋮ | ⋮ | ⋮ | 
使用lm函数进行线性回归分析:
1  | mod = lm((y ~ x), dataForRegression)  | 
DataFrames.DataFrameRegressionModel{GLM.LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}
Formula: y ~ 1 + x
Coefficients:
             Estimate Std.Error t value Pr(>|t|)
(Intercept)   99.8365   1.42361 70.1292   <1e-49
x             5.19724  0.457037 11.3716   <1e-14
另外一种等价的写法是:
1  | mod = fit(LinearModel, (y ~ x), dataForRegression)  | 
DataFrames.DataFrameRegressionModel{GLM.LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}
Formula: y ~ 1 + x
Coefficients:
             Estimate Std.Error t value Pr(>|t|)
(Intercept)   99.8365   1.42361 70.1292   <1e-49
x             5.19724  0.457037 11.3716   <1e-14
故知回归系数为5.19724,截距为99.8365,所以最优拟合直线为 $y = 99.8365 + 5.19724x$。同时给出了标准差、t检验和p值等结果。
接下来对假设回归系数是 $0$ 进行FF检验。
原假设 $H_0: \beta = 0$,备选假设 $H_1: \beta\neq 0$
构造回归系数为 $0$ 的模型:
1  | nullmod = lm((y ~ 1), dataForRegression)  | 
DataFrames.DataFrameRegressionModel{GLM.LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}
Formula: y ~ +1
Coefficients:
             Estimate Std.Error t value Pr(>|t|)
(Intercept)   104.526   2.59199 40.3264   <1e-38
假设回归系数为 $0$,则截距为 $y$ 的平均数。
1  | isapprox(coef(nullmod)[1], mean(dataForRegression[:y]))  | 
true
GLM的ftest函数提供了回归模型的FF检验。
1  | ftest(mod.model, nullmod.model)  | 
        Res. DOF DOF ΔDOF        SSR        ΔSSR      R²    ΔR²       F*  p(>F)
Model 1       48   3       4455.8890              0.7293
Model 2       49   2   -1 16460.1689 -12004.2799 -0.0000 0.7293 129.3132 <1e-14
在此例中,$p<0.05$,差异显著,拒绝 $\beta = 0$ 的假设,回归系数显著不为0,说明了回归方程有较高的有效性。而 $R^2=0.7293$ 也说明了随机变量 $x$ 与 $y$ 之间有相对高的关联程度。
1  | using RCall  | 
WARNING: Method definition ==(Base.Nullable{S}, Base.Nullable{T}) in module Base at nullable.jl:238 overwritten in module NullableArrays at /Users/kay/.julia/v0.6/NullableArrays/src/operators.jl:99.
RCall.RObject{RCall.VecSxp}
Call:
lm(formula = `#JL`$`(dataForRegression[:x])` ~ `#JL`$`(dataForRegression[:y])`)
Residuals:
    Min      1Q  Median      3Q     Max
-3.2220 -0.9915 -0.1898  0.7734  3.6579
Coefficients:
                                 Estimate Std. Error t value Pr(>|t|)
(Intercept)                     -13.76511    1.30911  -10.52 4.77e-14 ***
`#JL`$`(dataForRegression[:y])`   0.14032    0.01234   11.37 3.19e-15 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.583 on 48 degrees of freedom
Multiple R-squared:  0.7293,    Adjusted R-squared:  0.7237
F-statistic: 129.3 on 1 and 48 DF,  p-value: 3.195e-15
以上是R的输出结果,除了小数点精度外,结果完全相同。
下面对回归结果做可视化。
1  | # 基本数据作图  | 
函数predict返回的是回归值。残差定义为回归值与观测值之差。
1  | # 残差点图  | 
1  | # 残差线段图  | 
1  | # 通过Q-Q图的线性程度来检验残差的正态性。  | 
1  | # 绘制置信区间  | 


