gam.selection(mgcv)
gam.selection()所属R语言包:mgcv
Generalized Additive Model Selection
广义相加模型选择
译者:生物统计家园网 机器人LoveR
描述----------Description----------
This page is intended to provide some more information on how to select GAMs. In particular, it gives a brief overview of smoothness selection, and then discusses how this can be extended to select inclusion/exclusion of terms. Hypothesis testing approaches to the latter problem are also discussed.
此页旨在提供一些关于如何选择GAMS的信息。特别是,它提供了一个平滑的选择简要介绍,然后讨论如何可以延长至选择条款纳入/排除。假设检验的方法,后者的问题进行了讨论。
平滑的选择标准----------Smoothness selection criteria----------
Given a model structure specified by a gam model formula, gam() attempts to find the appropriate smoothness for each applicable model term using prediction error criteria or likelihood based methods. The prediction error criteria used are Generalized (Approximate) Cross Validation (GCV or GACV) when the scale parameter is unknown or an Un-Biased Risk Estimator (UBRE) when it is known. UBRE is essentially scaled AIC (Generalized case) or Mallows' Cp (additive model case). GCV and UBRE are covered in Craven and Wahba (1979) and Wahba (1990). Alternatively REML of maximum likelihood (ML) may be used for smoothness selection, by viewing the smooth components as random effects (in this case the variance component for each smooth random effect will be given by the scale parameter divided by the smoothing parameter — for smooths with multiple penalties, there will be multiple variance components). The method argument to gam selects the smoothness selection criterion.
鉴于1 GAM模型公式,gam()为每个适用的模型预测误差的标准或长期的可能性为基础的方法试图找到合适的平滑指定一个模型结构。用预测误差标准是广义的(近似)交叉验证(GCV或者GACV)当尺度参数是未知的或联合国偏风险估计(UBRE)时,它被称为。 UBRE基本上缩放AIC的(广义情况)或锦葵的CP(添加剂模型的情况下)。 GCV和UBRE覆盖在克雷文和Wahba(1979年)和Wahba(1990年)。另外REML法的最大似然(ML)可用于光滑的选择,看作随机效应(在这种情况下,每一个平稳的随机效应的方差分量将由除以平滑参数的尺度参数的顺利元件 - 与多个处罚平滑,会有多个方差分量)。 methodgam参数选择平滑的选择标准。
Automatic smoothness selection is unlikely to be successful with few data, particularly with multiple terms to be selected. In addition GCV and UBRE/AIC score can occasionally display local minima that can trap the minimisation algorithms. GCV/UBRE/AIC scores become constant with changing smoothing parameters at very low or very high smoothing parameters, and on occasion these "flat" regions can be separated from regions of lower score by a small "lip". This seems to be the most common form of local minimum, but is usually avoidable by avoiding extreme smoothing parameters as starting values in optimization, and by avoiding big jumps in smoothing parameters while optimizing. Never the less, if you are suspicious of smoothing parameter estimates, try changing fit method (see gam arguments method and optimizer) and see if the estimates change, or try changing some or all of the smoothing parameters "manually" (argument sp of gam, or sp arguments to s or te).
自动平滑的选择是不太可能成功的几个数据,特别是与被选中的多个方面。 GCV和UBRE / AIC得分,除了偶尔可以显示局部极小,可以捕获最小化算法。在非常低或非常高的平滑参数的变化平滑参数的GCV / UBRE的/ AIC的分数成为常数,由一个小型的“唇”“平面”区域,有时可以从得分较低的区域分隔。这似乎是当地最低的最常见的形式,但通常是可以避免的,避免极端平滑参数的优化值开始,并在平滑参数的同时优化,避免大的跳跃。从来没有少,如果你是可疑的平滑参数估计,尝试改变合适的方法(见gam参数method和optimizer),看看是否估计变更,或尝试改变部分或全部平滑参数手动(参数spgam或sp参数s或te)。
REML and ML are less prone to local minima than the other criteria, and may therefore be preferable.
REML法和ML比其他标准是不容易局部极小,因此可能是可取的。
自动长期选择----------Automatic term selection----------
Unmodified smoothness selection by GCV, AIC, REML etc. will not usually remove a smooth from a model. This is because most smoothing penalties view some space of (non-zero) functions as "completely smooth" and once a term is penalized heavily enough that it is in this space, further penalization does not change it.
未修改平滑选择由GCV的,工商局,REML法等通常不会删除从模型的平稳。这是因为最平滑惩处一些空间(非零)“完全顺利,一次长期被处罚严重,以至于它在这个空间的功能,进一步的处罚不会改变它。
However it is straightforward to modify smooths so that under heavy penalization they are penalized to the zero function and thereby "selected out" of the model. There are two approaches.
然而,它是直接修改的平滑下沉重的处罚,他们受到处罚零的功能,从而评选出“模型。有两种方法。
The first approach is to modify the smoothing penalty with an additional shrinkage term. Smooth classescs.smooth and tprs.smooth (specified by "cs" and "ts" respectively) have smoothness penalties which include a small shrinkage component, so that for large enough smoothing parameters the smooth becomes identically zero. This allows automatic smoothing parameter selection methods to effectively remove the term from the model altogether. The shrinkage component of the penalty is set at a level that usually makes negligable contribution to the penalization of the model, only becoming effective when the term is effectively "completely smooth" according to the conventional penalty.
第一种方法是修改一个额外的收缩长期的平滑罚款。顺利类cs.smooth和tprs.smooth(指定"cs"和"ts")有平滑的惩罚,其中包括一个小的收缩组件,足够大的平滑参数,以便为顺利成为相同为零。这使得自动平滑参数的选择方法,以有效去除长期完全从模型。罚款收缩组件设置在一定的水平,通常忽略不计的贡献,该模型的处罚,只能成为有效的长期有效,完全顺利“,根据传统的刑罚。
The second approach leaves the original smoothing penalty unchanged, but constructs an additional penalty for each smooth, which penalizes only functions in the null space of the original penalty (the "completely smooth" functions). Hence, if all the smoothing parameters for a term tend to infinity, the term will be selected out of the model. This latter approach is more expensive computationally, but has the advantage that it can be applied automatically to any smooth term. The select argument to gam turns on this method.
第二种方法使原有的平滑刑罚不变,但构造一个为每个光滑,惩罚只在原处罚的零空间(“完全畅通”的功能)功能的附加罚款。因此,如果一个长期的平滑参数趋于无穷大,长期将被挑选出来的模型。这后一种方法是计算更昂贵,但具有优势,它可以自动应用到任何光滑的长期。 select参数gam打开这个方法。
In fact, as implemented, both approaches operate by eigen-decomposiong the original penalty matrix. A new penalty is created on the null space: it is the matrix with the same eigenvectors as the original penalty, but with the originally postive egienvalues set to zero, and the originally zero eigenvalues set to something positive. The first approach just addes a multiple of this penalty to the original penalty, where the multiple is chosen so that the new penalty can not dominate the original. The second approach treats the new penalty as an extra penalty, with its own smoothing parameter.
事实上,作为实施,这两种方法操作由特征decomposiong原处罚矩阵。空的空间上创建一个新的罚款:它是与原来的罚款,但最初阳性egienvalues设置为零相同的特征向量矩阵,并设置一些积极的最初的零特征值。第一种方法只是addes原来的刑罚,其中多个选择,使新的刑罚不能主宰原来这个点球的倍数。第二种方法对待新的刑罚,作为额外的处罚,用自己的平滑参数。
Of course, as with all model selection methods, some care must be take to ensure that the automatic selection is sensible, and a decision about the effective degrees of freedom at which to declare a term "negligible" has to be made.
当然,与所有的模型选择方法,一些护理必须采取确保自动选择是明智的,并声明“微不足道”作出的长期的有效自由度的决定。
交互式长期选择----------Interactive term selection----------
In general the most logically consistent method to use for deciding which terms to include in the model is to compare GCV/UBRE/ML scores for models with and without the term (REML scores should not be used to compare models with different fixed effects structures). When UBRE is the smoothness selection method this will give the same result as comparing by AIC (the AIC in this case uses the model EDF in place of the usual model DF). Similarly, comparison via GCV score and via AIC seldom yields different answers. Note that the negative binomial with estimated theta parameter is a special case: the GCV score is not informative, because of the theta estimation scheme used. More generally the score for the model with a smooth term can be compared to the score for the model with the smooth term replaced by appropriate parametric terms. Candidates for replacement by parametric terms are smooth terms with estimated degrees of freedom close to their minimum possible.
逻辑最一致的方法来决定模型中包括哪些方面使用,一般是比较没有长期(REML法的分数不应该被用来比较不同结构的固定效应模型)为模型的GCV / UBRE的/毫升分数。 ,当UBRE平滑的选择方法,这将给作为比较,以相同的结果AIC(在这种情况下,工商行政管理机关使用的一般模型DF模型EDF),。同样,通过GCV的得分比较,通过AIC的很少产生不同的答案。请注意,负二项分布估计theta参数是一种特殊情况:GCV的得分是不是因为theta估计使用计划,翔实。更一般的模型具有光滑的长期的得分可以相比,取而代之的是适当的参数条款的顺利长期模型的得分。参数方面的替补人选,估计接近其最低可能的自由程度的顺利条款。
Candidates for removal can also be identified by reference to the approximate p-values provided by summary.gam, and by looking at the extent to which the confidence band for an estimated term includes the zero function. It is perfectly possible to perform backwards selection using p-values in the usual way: that is by sequentially dropping the single term with the highest non-significant p-value from the model and re-fitting, until all terms are significant. This suffers from the same problems as stepwise procedures for any GLM/LM, with the additional caveat that the p-values are only approximate. If adopting this approach, it is probably best to use ML smoothness selection.
搬迁的考生也可以参考summary.gam,估计长期的信心带包括零函数的范围内寻找提供近似的p值确定。这是完全可能的执行向后选择,通常的方式使用p值比上一季度下降,与最高的非显着的p值从模型和重新装修的单来看,直到所有的条件是重大的。患有同样的问题,额外的警告,任何的GLM / LM的逐步程序,p值是唯一的近似。如果采用这种方法,它可能是最好的使用ML平滑的选择。
Note that GCV and UBRE are not appropriate for comparing models using different families: in that case AIC should be used.
请注意,GCV和UBRE不适合比较使用不同的家庭模式:在这种情况下,工商行政管理机关应使用。
注意事项/陈词滥调----------Caveats/platitudes----------
Formal model selection methods are only appropriate for selecting between reasonable models. If formal model selection is attempted starting from a model that simply doesn't fit the data, then it is unlikely to provide meaningful results.
只有选择合理的模型之间的适当的正式模型选择方法。如果试图从一个模型,根本不适合数据开始正式的模型选择,那么它是不可能提供有意义的结果。
The more thought is given to appropriate model structure up front, the more successful model selection is likely to be. Simply starting with a hugely flexible model with "everything in" and hoping that automatic selection will find the right structure is not often successful.
更多的思考,放弃前面适当的模型结构,更成功的模式选择是可能的。只是一个巨大的灵活的一切的模式开始,并希望自动选择找到正确的结构是不是经常成功。
作者(S)----------Author(s)----------
Simon N. Wood <a href="mailto:simon.wood@r-project.org">simon.wood@r-project.org</a>
参考文献----------References----------
generalized additive models. J.R.Statist. Soc. B 70(3):495-518
and marginal likelihood estimation of semiparametric generalized linear models. Journal of the Royal Statistical Society (B) 73(1):3-36
参见----------See Also----------
gam, step.gam
gam,step.gam
举例----------Examples----------
## an example of automatic model selection via null space penalization[#一个例子,通过空空间处罚的模型自动选择]
library(mgcv)
set.seed(3);n<-200
dat <- gamSim(1,n=n,scale=.15,dist="poisson") ## simulate data[#模拟数据]
dat$x4 <- runif(n, 0, 1);dat$x5 <- runif(n, 0, 1) ## spurious[#杂散]
b<-gam(y~s(x0)+s(x1)+s(x2)+s(x3)+s(x4)+s(x5),data=dat,
family=poisson,select=TRUE,method="REML")
summary(b)
plot(b,pages=1)
转载请注明:出自 生物统计家园网(http://www.biostatistic.net)。
注:
注1:为了方便大家学习,本文档为生物统计家园网机器人LoveR翻译而成,仅供个人R语言学习参考使用,生物统计家园保留版权。
注2:由于是机器人自动翻译,难免有不准确之处,使用时仔细对照中、英文内容进行反复理解,可以帮助R语言的学习。
注3:如遇到不准确之处,请在本贴的后面进行回帖,我们会逐渐进行修订。
|