本文利用GLM包进行简单线性回归分析,同时对模型检验以及置信区间作图。
1 | using GLM,DataFrames |
先随机生成一组数据用于回归分析。
1 | # 设置截距 |
x | y | |
---|---|---|
1 | -0.2945844765867794 | 95.47303779965895 |
2 | -1.3440842917796316 | 100.32164951469537 |
3 | 1.3037018863265355 | 109.57629236830104 |
4 | 4.012507916775679 | 106.46933242509654 |
5 | -1.7321051747524616 | 89.55844065579969 |
6 | -1.4253325394154235 | 92.72629790575935 |
7 | -0.9042631229720477 | 87.93522916305417 |
8 | 3.4865202058421714 | 127.92951823278281 |
9 | -2.134433785764415 | 94.20802706899649 |
10 | 3.322020816866826 | 118.04216195389577 |
11 | 2.203086430462482 | 105.55320230669415 |
12 | -5.281962503754427 | 82.8888784146054 |
13 | 2.001897730730764 | 108.02467262087006 |
14 | -2.7543838829327565 | 88.76348948674855 |
15 | 6.101614682211077 | 128.6239736717972 |
16 | 0.6888033886810481 | 111.17457991815601 |
17 | 0.7010819245192449 | 107.39681268065465 |
18 | 3.1116548996809783 | 113.4549225848041 |
19 | -1.8511960189357837 | 83.79388935398916 |
20 | -0.962664244723129 | 91.41543924802386 |
21 | 0.5608348467513868 | 90.8605217368299 |
22 | 2.1522594774515786 | 93.8105103626811 |
23 | -0.31022032825520746 | 101.53069848453823 |
24 | 1.6256993827235733 | 104.12765249398367 |
25 | 3.3214081526457813 | 118.46644563889394 |
26 | 0.43501963025212675 | 95.810320259296 |
27 | 2.59876092609342 | 121.9741328977443 |
28 | -1.0290269848902651 | 113.72423797098067 |
29 | -0.21404769863190354 | 92.43405156194733 |
30 | 3.5537714568100016 | 99.06513391626848 |
⋮ | ⋮ | ⋮ |
使用lm
函数进行线性回归分析:
1 | mod = lm( (y ~ x), dataForRegression) |
DataFrames.DataFrameRegressionModel{GLM.LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}
Formula: y ~ 1 + x
Coefficients:
Estimate Std.Error t value Pr(>|t|)
(Intercept) 99.8365 1.42361 70.1292 <1e-49
x 5.19724 0.457037 11.3716 <1e-14
另外一种等价的写法是:
1 | mod = fit(LinearModel, (y ~ x), dataForRegression) |
DataFrames.DataFrameRegressionModel{GLM.LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}
Formula: y ~ 1 + x
Coefficients:
Estimate Std.Error t value Pr(>|t|)
(Intercept) 99.8365 1.42361 70.1292 <1e-49
x 5.19724 0.457037 11.3716 <1e-14
故知回归系数为5.19724
,截距为99.8365
,所以最优拟合直线为 $y = 99.8365 + 5.19724x$。同时给出了标准差、t检验和p值等结果。
接下来对假设回归系数是 $0$ 进行FF检验。
原假设 $H_0: \beta = 0$,备选假设 $H_1: \beta\neq 0$
构造回归系数为 $0$ 的模型:
1 | nullmod = lm(1), dataForRegression) (y ~ |
DataFrames.DataFrameRegressionModel{GLM.LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}
Formula: y ~ +1
Coefficients:
Estimate Std.Error t value Pr(>|t|)
(Intercept) 104.526 2.59199 40.3264 <1e-38
假设回归系数为 $0$,则截距为 $y$ 的平均数。
1 | isapprox(coef(nullmod)[1], mean(dataForRegression[:y])) |
true
GLM的ftest
函数提供了回归模型的FF检验。
1 | ftest(mod.model, nullmod.model) |
Res. DOF DOF ΔDOF SSR ΔSSR R² ΔR² F* p(>F)
Model 1 48 3 4455.8890 0.7293
Model 2 49 2 -1 16460.1689 -12004.2799 -0.0000 0.7293 129.3132 <1e-14
在此例中,$p<0.05$,差异显著,拒绝 $\beta = 0$ 的假设,回归系数显著不为0,说明了回归方程有较高的有效性。而 $R^2=0.7293$ 也说明了随机变量 $x$ 与 $y$ 之间有相对高的关联程度。
1 | using RCall |
WARNING: Method definition ==(Base.Nullable{S}, Base.Nullable{T}) in module Base at nullable.jl:238 overwritten in module NullableArrays at /Users/kay/.julia/v0.6/NullableArrays/src/operators.jl:99.
RCall.RObject{RCall.VecSxp}
Call:
lm(formula = `#JL`$`(dataForRegression[:x])` ~ `#JL`$`(dataForRegression[:y])`)
Residuals:
Min 1Q Median 3Q Max
-3.2220 -0.9915 -0.1898 0.7734 3.6579
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -13.76511 1.30911 -10.52 4.77e-14 ***
`#JL`$`(dataForRegression[:y])` 0.14032 0.01234 11.37 3.19e-15 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.583 on 48 degrees of freedom
Multiple R-squared: 0.7293, Adjusted R-squared: 0.7237
F-statistic: 129.3 on 1 and 48 DF, p-value: 3.195e-15
以上是R的输出结果,除了小数点精度外,结果完全相同。
下面对回归结果做可视化。
1 | # 基本数据作图 |
函数predict
返回的是回归值。残差定义为回归值与观测值之差。
1 | # 残差点图 |
1 | # 残差线段图 |
1 | # 通过Q-Q图的线性程度来检验残差的正态性。 |
1 | # 绘制置信区间 |