julia基于GLM包的线性回归

本文利用GLM包进行简单线性回归分析，同时对模型检验以及置信区间作图。

1	using GLM,DataFrames

先随机生成一组数据用于回归分析。

# 设置截距
α = 100
# 设置系数
β = 5
# 设置随机数个数
n = 50
# 设置误差项
ε = rand(Normal(),n) * 11
# 生成数据
x = randn(n)*3
y = α + β * x + ε
dataForRegression = DataFrame(x = x, y = y)

	x	y
1	-0.2945844765867794	95.47303779965895
2	-1.3440842917796316	100.32164951469537
3	1.3037018863265355	109.57629236830104
4	4.012507916775679	106.46933242509654
5	-1.7321051747524616	89.55844065579969
6	-1.4253325394154235	92.72629790575935
7	-0.9042631229720477	87.93522916305417
8	3.4865202058421714	127.92951823278281
9	-2.134433785764415	94.20802706899649
10	3.322020816866826	118.04216195389577
11	2.203086430462482	105.55320230669415
12	-5.281962503754427	82.8888784146054
13	2.001897730730764	108.02467262087006
14	-2.7543838829327565	88.76348948674855
15	6.101614682211077	128.6239736717972
16	0.6888033886810481	111.17457991815601
17	0.7010819245192449	107.39681268065465
18	3.1116548996809783	113.4549225848041
19	-1.8511960189357837	83.79388935398916
20	-0.962664244723129	91.41543924802386
21	0.5608348467513868	90.8605217368299
22	2.1522594774515786	93.8105103626811
23	-0.31022032825520746	101.53069848453823
24	1.6256993827235733	104.12765249398367
25	3.3214081526457813	118.46644563889394
26	0.43501963025212675	95.810320259296
27	2.59876092609342	121.9741328977443
28	-1.0290269848902651	113.72423797098067
29	-0.21404769863190354	92.43405156194733
30	3.5537714568100016	99.06513391626848
⋮	⋮	⋮

使用lm函数进行线性回归分析：

1	mod = lm(@formula(y ~ x), dataForRegression)

DataFrames.DataFrameRegressionModel{GLM.LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: y ~ 1 + x

Coefficients:
             Estimate Std.Error t value Pr(>|t|)
(Intercept)   99.8365   1.42361 70.1292   <1e-49
x             5.19724  0.457037 11.3716   <1e-14

另外一种等价的写法是：

1	mod = fit(LinearModel, @formula(y ~ x), dataForRegression)

DataFrames.DataFrameRegressionModel{GLM.LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: y ~ 1 + x

Coefficients:
             Estimate Std.Error t value Pr(>|t|)
(Intercept)   99.8365   1.42361 70.1292   <1e-49
x             5.19724  0.457037 11.3716   <1e-14

故知回归系数为5.19724，截距为99.8365，所以最优拟合直线为 $y = 99.8365 + 5.19724x$。同时给出了标准差、t检验和p值等结果。

接下来对假设回归系数是 $0$ 进行FF检验。

原假设 $H_0: \beta = 0$，备选假设 $H_1: \beta\neq 0$

构造回归系数为 $0$ 的模型：

1	nullmod = lm(@formula(y ~ 1), dataForRegression)

DataFrames.DataFrameRegressionModel{GLM.LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: y ~ +1

Coefficients:
             Estimate Std.Error t value Pr(>|t|)
(Intercept)   104.526   2.59199 40.3264   <1e-38

假设回归系数为 $0$，则截距为 $y$ 的平均数。

1	isapprox(coef(nullmod)[1], mean(dataForRegression[:y]))

true

GLM的ftest函数提供了回归模型的FF检验。

1	ftest(mod.model, nullmod.model)

        Res. DOF DOF ΔDOF        SSR        ΔSSR      R²    ΔR²       F*  p(>F)
Model 1       48   3       4455.8890              0.7293
Model 2       49   2   -1 16460.1689 -12004.2799 -0.0000 0.7293 129.3132 <1e-14

在此例中，$p<0.05$，差异显著，拒绝 $\beta = 0$ 的假设，回归系数显著不为0，说明了回归方程有较高的有效性。而 $R^2=0.7293$ 也说明了随机变量 $x$ 与 $y$ 之间有相对高的关联程度。

1 2	using RCall R"summary(lm($(dataForRegression[:x]) ~ $(dataForRegression[:y])))"

WARNING: Method definition ==(Base.Nullable{S}, Base.Nullable{T}) in module Base at nullable.jl:238 overwritten in module NullableArrays at /Users/kay/.julia/v0.6/NullableArrays/src/operators.jl:99.





RCall.RObject{RCall.VecSxp}

Call:
lm(formula = `#JL`$`(dataForRegression[:x])` ~ `#JL`$`(dataForRegression[:y])`)

Residuals:
    Min      1Q  Median      3Q     Max
-3.2220 -0.9915 -0.1898  0.7734  3.6579

Coefficients:
                                 Estimate Std. Error t value Pr(>|t|)
(Intercept)                     -13.76511    1.30911  -10.52 4.77e-14 ***
`#JL`$`(dataForRegression[:y])`   0.14032    0.01234   11.37 3.19e-15 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.583 on 48 degrees of freedom
Multiple R-squared:  0.7293,    Adjusted R-squared:  0.7237
F-statistic: 129.3 on 1 and 48 DF,  p-value: 3.195e-15

以上是R的输出结果，除了小数点精度外，结果完全相同。

下面对回归结果做可视化。

# 基本数据作图
using Gadfly
plot(layer(dataForRegression,
        x = :x,
        y = :y,
        Geom.point,),
    layer(dataForRegression,
        x = :x,
        y = :y,
        Geom.smooth(method = :lm),
        color = [colorant"purple"])
)

函数predict返回的是回归值。残差定义为回归值与观测值之差。

# 残差点图
resid = dataForRegression[:y] - predict(mod)
plot(x = predict(mod),
    y = resid,
    Geom.point,
    Guide.xlabel("predict(mod)"),
    Guide.ylabel("resid"))

# 残差线段图
plot(dataForRegression, x = "x",
    y = "y",
    xend = "x",
    yend = predict(mod),
    Geom.segment)

1 2	# 通过Q-Q图的线性程度来检验残差的正态性。 plot(x = Normal(), y = resid, Stat.qq, Geom.point,)

# 绘制置信区间
mod_predict_confidence = predict(mod,
    hcat(ones(nrow(dataForRegression)), collect(dataForRegression[:,1])),
    :confint)

plot(layer(dataForRegression,
        x = "x",
        y = "y",
        Geom.point,
        intercept = [coef(mod)[1]],
        slope = [coef(mod)[2]],
        Geom.abline),
    layer(x = dataForRegression[:x],
        y = mod_predict_confidence[:,2],
        Geom.point, Theme(default_color = "purple")),
    layer(x = dataForRegression[:x],
        y = mod_predict_confidence[:,3],
        Geom.point, Theme(default_color = "purple"))
)