choose.k(mgcv)
choose.k()所属R语言包:mgcv
Basis dimension choice for smooths
为平滑的基准尺寸的选择
译者:生物统计家园网 机器人LoveR
描述----------Description----------
Choosing the basis dimension, and checking the choice, when using penalized regression smoothers.
选择的基础层面,并检查的选择,在使用惩罚回归平滑。
Penalized regression smoothers gain computational efficiency by virtue of being defined using a basis of relatively modest size, k. When setting up models in the mgcv package, using s or te terms in a model formula, k must be chosen: the defaults are essentially arbitrary.
凭借被定义尺寸比较适中的基础,k判罚回归平滑得到的计算效率。当模型mgcv包中,使用s或te模型中的公式,k必须选择:默认本质上是任意的。
In practice k-1 (or k) sets the upper limit on the degrees of freedom associated with an s smooth (1 degree of freedom is usually lost to the identifiability constraint on the smooth). For te smooths the upper limit of the degrees of freedom is given by the product of the k values provided for each marginal smooth less one, for the constraint. However the actual effective degrees of freedom are controlled by the degree of penalization selected during fitting, by GCV, AIC, REML or whatever is specified. The exception to this is if a smooth is specified using the fx=TRUE option, in which case it is unpenalized.
在实践中k-1(或k)设定的上限s光滑(1度的自由与自由的程度通常是失去了顺利的可识别性约束)。对于te平滑的上限的自由度k的值设置为每个边缘光滑少一,该约束的产物,由下式给出。不过,实际的有效自由度控制在嵌合过程中,选择由GCV,AIC,REML或指定的任何程度的处罚。唯一的例外是如果能顺利使用fx=TRUE选项,在这种情况下,unpenalized。
So, exact choice of k is not generally critical: it should be chosen to be large enough that you are reasonably sure of having enough degrees of freedom to represent the underlying "truth" reasonably well, but small enough to maintain reasonable computational efficiency. Clearly "large" and "small" are dependent on the particular problem being addressed.
所以,准确k选择是关键的:应该选择足够大,你有理由相信,有足够的自由度来代表“真理”的基础相当不错,但小到足以维持合理的计算效率。显然,大和小是依赖于特定的问题正在解决中。
As with all model assumptions, it is useful to be able to check the choice of k informally. If the effective degrees of freedom for a model term are estimated to be much less than k-1 then this is unlikely to be very worthwhile, but as the EDF approach k-1, checking can be important. A useful general purpose approach goes as follows: (i) fit your model and extract the deviance residuals; (ii) for each smooth term in your model, fit an equivalent, single, smooth to the residuals, using a substantially increased k to see if there is pattern in the residuals that could potentially be explained by increasing k. Examples are provided below.
与所有的模型假设,它是有用的,是能够检查k非正式的选择。如果有效自由度的模型项的估计要少得多k-1那么这是不可能的,是非常值得的,但的EDF方法k-1,检查也很重要。一个有用的通用方法如下:(一)符合你的模型,并提取偏差残差;(ii)对每个平滑术语在你的模型,配合等效的,单一的,光滑的残差,大大增加<X >,看看是否有通过增加k是图案中可能被解释的残差。下面实施例中所提供。
The obvious, but more costly, alternative is simply to increase the suspect k and refit the original model. If there are no statistically important changes as a result of doing this, then k was large enough. (Change in the smoothness selection criterion, and/or the effective degrees of freedom, when k is increased, provide the obvious numerical measures for whether the fit has changed substantially.)
最明显的,但更昂贵的,另一种方法是简单地增加犯罪嫌疑人k“,并重新安装原始模型。作为这样做的结果,如果不存在统计学上的重要变化,然后k是足够大。 (变动的平滑度的选择标准,和/或有效自由度,当k被增加,提供了明显的数值措施是否适合基本上已经改变。)
gam.check runs a simple simulation based check on the basis dimensions, which can help to flag up terms for which k is too low. Grossly too small k will also be visible from partial residuals available with plot.gam.
gam.check运行一个简单的基于仿真的基础上,尺寸检查,它可以帮助标记k太低。眼观太小k也可以看到部分残差与plot.gam。
One scenario that can cause confusion is this: a model is fitted with k=10 for a smooth term, and the EDF for the term is estimated as 7.6, some way below the maximum of 9. The model is then refitted with k=20 and the EDF increases to 8.7 - what is happening - how come the EDF was not 8.7 the first time around? The explanation is that the function space with k=20 contains a larger subspace of functions with EDF 8.7 than did the function space with k=10: one of the functions in this larger subspace fits the data a little better than did any function in the smaller subspace. These subtleties seldom have much impact on the statistical conclusions to be drawn from a model fit, however.
,可能会导致混乱的情况下是这样的:一个模型都配有k=10的顺利任期,而EDF的估计为7.6的,下面一些方法最大的9。然后装复k=20和EDF增加至8.7 - 发生了什么事 - 怎么来的EDF模型是不是8.7的第一时间吗?的解释是,k=20包含的功能空间较大的子空间功能与EDF 8.7比没有的功能空间,k=10:在这个更大的子空间的功能与数据的拟合比没有好一点在较小的子空间的任何功能。这些细微之处很少有太大的影响,但是从模型的拟合统计得出的结论。
(作者)----------Author(s)----------
Simon N. Wood <a href="mailto:simon.wood@r-project.org">simon.wood@r-project.org</a>
参考文献----------References----------
实例----------Examples----------
## Simulate some data ....[#模拟的一些数据......]
library(mgcv)
set.seed(1)
dat <- gamSim(1,n=400,scale=2)
## fit a GAM with quite low `k'[#适合GAM相当低的K]
b<-gam(y~s(x0,k=6)+s(x1,k=6)+s(x2,k=6)+s(x3,k=6),data=dat)
plot(b,pages=1,residuals=TRUE) ## hint of a problem in s(x2)[#提示的问题在s(X2)]
## the following suggests a problem with s(x2)[#以下提出的问题(X2)]
gam.check(b)
## Another approach (see below for more obvious method)....[#另一种方法(见下文更明显的方法)。......]
## check for residual pattern, removeable by increasing `k'[#检查剩余的模式,可移动通过增加K]
## typically `k', below, chould be substantially larger than [#通常k,也下面chould基本上大于]
## the original, `k' but certainly less than n/2.[#原来,K,但肯定小于n / 2。]
## Note use of cheap "cs" shrinkage smoothers, and gamma=1.4[#注意使用廉价的“CS”收缩平滑,γ= 1.4]
## to reduce chance of overfitting...[#减少过度拟合的机会...]
rsd <- residuals(b)
gam(rsd~s(x0,k=40,bs="cs"),gamma=1.4,data=dat) ## fine[#细]
gam(rsd~s(x1,k=40,bs="cs"),gamma=1.4,data=dat) ## fine[#细]
gam(rsd~s(x2,k=40,bs="cs"),gamma=1.4,data=dat) ## `k' too low[#K太低]
gam(rsd~s(x3,k=40,bs="cs"),gamma=1.4,data=dat) ## fine[#细]
## refit...[#改装...]
b <- gam(y~s(x0,k=6)+s(x1,k=6)+s(x2,k=20)+s(x3,k=6),data=dat)
gam.check(b) ## better[#更好]
## similar example with multi-dimensional smooth[#类似的例子多维光滑]
b1 <- gam(y~s(x0)+s(x1,x2,k=15)+s(x3),data=dat)
rsd <- residuals(b1)
gam(rsd~s(x0,k=40,bs="cs"),gamma=1.4,data=dat) ## fine[#细]
gam(rsd~s(x1,x2,k=100,bs="ts"),gamma=1.4,data=dat) ## `k' too low[#K太低]
gam(rsd~s(x3,k=40,bs="cs"),gamma=1.4,data=dat) ## fine[#细]
gam.check(b1) ## shows same problem[#显示了同样的问题,]
## and a `te' example[#和德的例子]
b2 <- gam(y~s(x0)+te(x1,x2,k=4)+s(x3),data=dat)
rsd <- residuals(b2)
gam(rsd~s(x0,k=40,bs="cs"),gamma=1.4,data=dat) ## fine[#细]
gam(rsd~te(x1,x2,k=10,bs="cs"),gamma=1.4,data=dat) ## `k' too low[#K太低]
gam(rsd~s(x3,k=40,bs="cs"),gamma=1.4,data=dat) ## fine[#细]
gam.check(b2) ## shows same problem[#显示了同样的问题,]
## same approach works with other families in the original model[#同样的方法在原有机型与其他家庭]
dat <- gamSim(1,n=400,scale=.25,dist="poisson")
bp<-gam(y~s(x0,k=5)+s(x1,k=5)+s(x2,k=5)+s(x3,k=5),
family=poisson,data=dat,method="ML")
gam.check(bp)
rsd <- residuals(bp)
gam(rsd~s(x0,k=40,bs="cs"),gamma=1.4,data=dat) ## fine[#细]
gam(rsd~s(x1,k=40,bs="cs"),gamma=1.4,data=dat) ## fine[#细]
gam(rsd~s(x2,k=40,bs="cs"),gamma=1.4,data=dat) ## `k' too low[#K太低]
gam(rsd~s(x3,k=40,bs="cs"),gamma=1.4,data=dat) ## fine[#细]
rm(dat)
## More obvious, but more expensive tactic... Just increase [#较为明显,但更昂贵的战术...只是增加]
## suspicious k until fit is stable.[#可疑K直到合适的稳定。]
set.seed(0)
dat <- gamSim(1,n=400,scale=2)
## fit a GAM with quite low `k'[#适合GAM相当低的K]
b <- gam(y~s(x0,k=6)+s(x1,k=6)+s(x2,k=6)+s(x3,k=6),
data=dat,method="REML")
b
## edf for 3rd smooth is highest as proportion of k -- increase k[#EDF第三光滑的比例最高的k - 增加K]
b <- gam(y~s(x0,k=6)+s(x1,k=6)+s(x2,k=12)+s(x3,k=6),
data=dat,method="REML")
b
## edf substantially up, -ve REML substantially down[#EDF大幅上升,已经REML大幅下降]
b <- gam(y~s(x0,k=6)+s(x1,k=6)+s(x2,k=24)+s(x3,k=6),
data=dat,method="REML")
b
## slight edf increase and -ve REML change[#轻微EDF的增加和已经REML的变化]
b <- gam(y~s(x0,k=6)+s(x1,k=6)+s(x2,k=40)+s(x3,k=6),
data=dat,method="REML")
b
## defintely stabilized (but really k around 20 would have been fine)[#defintely稳定(但真正K表将被罚款约20)]
转载请注明:出自 生物统计家园网(http://www.biostatistic.net)。
注:
注1:为了方便大家学习,本文档为生物统计家园网机器人LoveR翻译而成,仅供个人R语言学习参考使用,生物统计家园保留版权。
注2:由于是机器人自动翻译,难免有不准确之处,使用时仔细对照中、英文内容进行反复理解,可以帮助R语言的学习。
注3:如遇到不准确之处,请在本贴的后面进行回帖,我们会逐渐进行修订。
|