julia基于GLM包的线性回归

本文利用GLM包进行简单线性回归分析,同时对模型检验以及置信区间作图。

1
using GLM,DataFrames

先随机生成一组数据用于回归分析。

1
2
3
4
5
6
7
8
9
10
11
12
# 设置截距
α = 100
# 设置系数
β = 5
# 设置随机数个数
n = 50
# 设置误差项
ε = rand(Normal(),n) * 11
# 生成数据
x = randn(n)*3
y = α + β * x + ε
dataForRegression = DataFrame(x = x, y = y)
xy
1-0.294584476586779495.47303779965895
2-1.3440842917796316100.32164951469537
31.3037018863265355109.57629236830104
44.012507916775679106.46933242509654
5-1.732105174752461689.55844065579969
6-1.425332539415423592.72629790575935
7-0.904263122972047787.93522916305417
83.4865202058421714127.92951823278281
9-2.13443378576441594.20802706899649
103.322020816866826118.04216195389577
112.203086430462482105.55320230669415
12-5.28196250375442782.8888784146054
132.001897730730764108.02467262087006
14-2.754383882932756588.76348948674855
156.101614682211077128.6239736717972
160.6888033886810481111.17457991815601
170.7010819245192449107.39681268065465
183.1116548996809783113.4549225848041
19-1.851196018935783783.79388935398916
20-0.96266424472312991.41543924802386
210.560834846751386890.8605217368299
222.152259477451578693.8105103626811
23-0.31022032825520746101.53069848453823
241.6256993827235733104.12765249398367
253.3214081526457813118.46644563889394
260.4350196302521267595.810320259296
272.59876092609342121.9741328977443
28-1.0290269848902651113.72423797098067
29-0.2140476986319035492.43405156194733
303.553771456810001699.06513391626848

使用lm函数进行线性回归分析:

1
mod = lm(@formula(y ~ x), dataForRegression)
DataFrames.DataFrameRegressionModel{GLM.LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: y ~ 1 + x

Coefficients:
             Estimate Std.Error t value Pr(>|t|)
(Intercept)   99.8365   1.42361 70.1292   <1e-49
x             5.19724  0.457037 11.3716   <1e-14

另外一种等价的写法是:

1
mod = fit(LinearModel, @formula(y ~ x), dataForRegression)
DataFrames.DataFrameRegressionModel{GLM.LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: y ~ 1 + x

Coefficients:
             Estimate Std.Error t value Pr(>|t|)
(Intercept)   99.8365   1.42361 70.1292   <1e-49
x             5.19724  0.457037 11.3716   <1e-14

故知回归系数为5.19724,截距为99.8365,所以最优拟合直线为 $y = 99.8365 + 5.19724x$。同时给出了标准差、t检验和p值等结果。

接下来对假设回归系数是 $0$ 进行FF检验。

原假设 $H_0: \beta = 0$,备选假设 $H_1: \beta\neq 0$

构造回归系数为 $0$ 的模型:

1
nullmod = lm(@formula(y ~ 1), dataForRegression)
DataFrames.DataFrameRegressionModel{GLM.LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: y ~ +1

Coefficients:
             Estimate Std.Error t value Pr(>|t|)
(Intercept)   104.526   2.59199 40.3264   <1e-38

假设回归系数为 $0$,则截距为 $y$ 的平均数。

1
isapprox(coef(nullmod)[1], mean(dataForRegression[:y]))
true

GLM的ftest函数提供了回归模型的FF检验。

1
ftest(mod.model, nullmod.model)
        Res. DOF DOF ΔDOF        SSR        ΔSSR      R²    ΔR²       F*  p(>F)
Model 1       48   3       4455.8890              0.7293
Model 2       49   2   -1 16460.1689 -12004.2799 -0.0000 0.7293 129.3132 <1e-14

在此例中,$p<0.05$,差异显著,拒绝 $\beta = 0$ 的假设,回归系数显著不为0,说明了回归方程有较高的有效性。而 $R^2=0.7293$ 也说明了随机变量 $x$ 与 $y$ 之间有相对高的关联程度。

1
2
using RCall
R"summary(lm($(dataForRegression[:x]) ~ $(dataForRegression[:y])))"
WARNING: Method definition ==(Base.Nullable{S}, Base.Nullable{T}) in module Base at nullable.jl:238 overwritten in module NullableArrays at /Users/kay/.julia/v0.6/NullableArrays/src/operators.jl:99.





RCall.RObject{RCall.VecSxp}

Call:
lm(formula = `#JL`$`(dataForRegression[:x])` ~ `#JL`$`(dataForRegression[:y])`)

Residuals:
    Min      1Q  Median      3Q     Max
-3.2220 -0.9915 -0.1898  0.7734  3.6579

Coefficients:
                                 Estimate Std. Error t value Pr(>|t|)
(Intercept)                     -13.76511    1.30911  -10.52 4.77e-14 ***
`#JL`$`(dataForRegression[:y])`   0.14032    0.01234   11.37 3.19e-15 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.583 on 48 degrees of freedom
Multiple R-squared:  0.7293,    Adjusted R-squared:  0.7237
F-statistic: 129.3 on 1 and 48 DF,  p-value: 3.195e-15

以上是R的输出结果,除了小数点精度外,结果完全相同。

下面对回归结果做可视化。

1
2
3
4
5
6
7
8
9
10
11
12
# 基本数据作图
using Gadfly
plot(layer(dataForRegression,
x = :x,
y = :y,
Geom.point,),
layer(dataForRegression,
x = :x,
y = :y,
Geom.smooth(method = :lm),
color = [colorant"purple"])
)


函数predict返回的是回归值。残差定义为回归值与观测值之差。

1
2
3
4
5
6
7
# 残差点图
resid = dataForRegression[:y] - predict(mod)
plot(x = predict(mod),
y = resid,
Geom.point,
Guide.xlabel("predict(mod)"),
Guide.ylabel("resid"))


1
2
3
4
5
6
# 残差线段图
plot(dataForRegression, x = "x",
y = "y",
xend = "x",
yend = predict(mod),
Geom.segment)


1
2
# 通过Q-Q图的线性程度来检验残差的正态性。
plot(x = Normal(), y = resid, Stat.qq, Geom.point,)


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# 绘制置信区间
mod_predict_confidence = predict(mod,
hcat(ones(nrow(dataForRegression)), collect(dataForRegression[:,1])),
:confint)

plot(layer(dataForRegression,
x = "x",
y = "y",
Geom.point,
intercept = [coef(mod)[1]],
slope = [coef(mod)[2]],
Geom.abline),
layer(x = dataForRegression[:x],
y = mod_predict_confidence[:,2],
Geom.point, Theme(default_color = "purple")),
layer(x = dataForRegression[:x],
y = mod_predict_confidence[:,3],
Geom.point, Theme(default_color = "purple"))
)


文章作者: Monad Kai
文章链接: onlookerliu.github.io/2017/11/29/julia基于GLM包的线性回归/
版权声明: 本博客所有文章除特别声明外,均采用 CC BY-NC-SA 4.0 许可协议。转载请注明来自 Code@浮生记
支付宝打赏
微信打赏