bam(mgcv)
bam()所属R语言包:mgcv
Generalized additive models for very large datasets
对于非常大的数据集的广义相加模型
译者:生物统计家园网 机器人LoveR
描述----------Description----------
Fits a generalized additive model (GAM) to a very large data set, the term "GAM" being taken to include any quadratically penalized GLM. The degree of smoothness of model terms is estimated as part of fitting. In use the function is much like gam, except that the numerical methods are designed for datasets containing upwards of several tens of thousands of data. The advantage of bam is much lower memory footprint than gam, but it can also be much faster, for large datasets. bam can also compute on a cluster set up by the parallel package.
适合广义相加模型(GAM)的一个非常大的数据集,“自由亚齐运动”正在采取包括任何二次处罚的GLM。估计拟合模型计算的平滑度。很像gam,除含有几十成千上万的数据向上的集设计,数值方法是在使用的功能。 bam优势比gam内存占用要低得多,但它也可以更快的大型数据集,。 bam也可以计算集群并行包成立。
用法----------Usage----------
bam(formula,family=gaussian(),data=list(),weights=NULL,subset=NULL,
na.action=na.omit, offset=NULL,method="REML",control=list(),
scale=0,gamma=1,knots=NULL,sp=NULL,min.sp=NULL,paraPen=NULL,
chunk.size=10000,rho=0,sparse=FALSE,cluster=NULL,gc.level=1,...)
参数----------Arguments----------
参数:formula
A GAM formula (see formula.gam and also gam.models). This is exactly like the formula for a GLM except that smooth terms, s and te can be added to the right hand side to specify that the linear predictor depends on smooth functions of predictors (or linear functionals of these).
一个自由亚齐运动公式(见formula.gam和gam.models)。这正是像那光滑的条款除外公式为的GLM,s和te可以添加到右边的指定取决于预测光滑函数(或线性泛函的线性预测这些)。
参数:family
This is a family object specifying the distribution and link to use in fitting etc. See glm and family for more details. A negative binomial family is provided: see negbin, but only the known theta case is supported by bam.
这是一个家庭的对象指定的分布和装修等环节使用glm和family更多细节。负二项分布家庭提供:见negbin,但只有著名的THETA情况bam的支持。
参数:data
A data frame or list containing the model response variable and covariates required by the formula. By default the variables are taken from environment(formula): typically the environment from which gam is called.
数据框列表,其中包含模型响应变量和协变量所需的公式。默认情况下,从environment(formula):通常从gam被称为环境变量。
参数:weights
prior weights on the data.
数据前的重量。
参数:subset
an optional vector specifying a subset of observations to be used in the fitting process.
一个可选的向量指定要在装修过程中使用的观测的子集。
参数:na.action
a function which indicates what should happen when the data contain "NA"s. The default is set by the "na.action" setting of "options", and is "na.fail" if that is unset. The “factory-fresh” default is "na.omit".
一个函数,它表示数据时,包含“NA的,应该发生什么。由的“na.action的”选项“设置,默认设置是”na.fail“如果没有设置。的“新鲜工厂”默认na.omit“。
参数:offset
Can be used to supply a model offset for use in fitting. Note that this offset will always be completely ignored when predicting, unlike an offset included in formula: this conforms to the behaviour of lm and glm.
可以用来提供拟合模型偏移。请注意,这个偏移将永远被完全忽略预测时,不同偏移formula:这符合lm和glm的行为。
参数:method
The smoothing parameter estimation method. "GCV.Cp" to use GCV for unknown scale parameter and Mallows' Cp/UBRE/AIC for known scale. "GACV.Cp" is equivalent, but using GACV in place of GCV. "REML" for REML estimation, including of unknown scale, "P-REML" for REML estimation, but using a Pearson estimate of the scale. "ML" and "P-ML" are similar, but using maximum likelihood in place of REML.
平滑参数估计方法。 "GCV.Cp"使用未知的尺度参数和锦葵“CP / UBRE的/已知规模AIC的GCV的。 "GACV.Cp"是等价的,但使用GACV GCV的地方。 "REML"REML法估计,包括未知的规模,"P-REML"REML法估计,但皮尔森估计使用的规模。 "ML"和"P-ML"很相似,但使用REML法进行的最大可能性。
参数:control
A list of fit control parameters to replace defaults returned by gam.control. Any control parameters not supplied stay at their default values.
一个合适的控制参数列表取代gam.control返回的默认值。没有任何控制参数提供留在它们的默认值。
参数:scale
If this is positive then it is taken as the known scale parameter. Negative signals that the scale paraemter is unknown. 0 signals that the scale parameter is 1 for Poisson and binomial and unknown otherwise. Note that (RE)ML methods can only work with scale parameter 1 for the Poisson and binomial cases.
如果这是积极的,那么它被称为尺度参数。消极的信号,的规模paraemter是未知的。尺度参数泊松分布和二项式和未知的,否则是1 0信号。注意(RE)的ML方法可以只使用尺度参数的泊松分布和二项式情况1。
参数:gamma
It is sometimes useful to inflate the model degrees of freedom in the GCV or UBRE/AIC score by a constant multiplier. This allows such a multiplier to be supplied.
它有时是有用的一个常乘数模型度的自由膨胀的GCV或UBRE的/ AIC的得分。这可以提供这样一个乘数。
参数:knots
this is an optional list containing user specified knot values to be used for basis construction. For most bases the user simply supplies the knots to be used, which must match up with the k value supplied (note that the number of knots is not always just k). See tprs for what happens in the "tp"/"ts" case. Different terms can use different numbers of knots, unless they share a covariate.
这是一个可选列表,其中包含用户指定的结值被用于基础建设。对于最基础的用户只需提供要使用的绳结,必须与k提供的值(注意,结的数量并不总是公正k)。看到tprs"tp"/"ts"情况发生的事情的。不同的条件,可以使用不同数量的结,除非它们共享一个协。
参数:sp
A vector of smoothing parameters can be provided here. Smoothing parameters must be supplied in the order that the smooth terms appear in the model formula. Negative elements indicate that the parameter should be estimated, and hence a mixture of fixed and estimated parameters is possible. If smooths share smoothing parameters then length(sp) must correspond to the number of underlying smoothing parameters.
这里可以提供一个平滑的参数向量。必须提供平滑参数的顺序,顺利的条款,在出现的模型公式。消极因素表明,应估计参数,因此混合固定和参数估计是可能的。如果平滑份额的平滑参数,那么length(sp)必须符合基本平滑参数的数量。
参数:min.sp
Lower bounds can be supplied for the smoothing parameters. Note that if this option is used then the smoothing parameters full.sp, in the returned object, will need to be added to what is supplied here to get the smoothing parameters actually multiplying the penalties. length(min.sp) should always be the same as the total number of penalties (so it may be longer than sp, if smooths share smoothing parameters).
下界可以提供平滑参数。注意:如果使用此选项,那么平滑参数full.sp,在返回的对象,将需要被添加到这里提供平滑参数乘以罚则。 length(min.sp)应始终作为处罚的总数(因此它可能会比sp长,如果平滑份额平滑参数)相同。
参数:paraPen
optional list specifying any penalties to be applied to parametric model terms. gam.models explains more.
可选列表指定被应用到参数模型条款的任何处罚。 gam.models解释更多。
参数:chunk.size
The model matrix is created in chunks of this size, rather than ever being formed whole.
模型矩阵中创建这个大小的块,而不是被形成整体。
参数:rho
An AR1 error model can be used for the residuals (based on dataframe order), of Gaussian-identity link models. This is the AR1 correlation parameter.
一个AR1的误差模型,可用于高斯的身份链路模型的残差(基于dataframe顺序)。这是AR1的相关参数。
参数:sparse
If all smooths are P-splines and all tensor products are of the form te(...,bs="ps",np=FALSE) then in principle computation could be made faster using sparse matrix methods, and you could set this to TRUE. In practice the speed up is disappointing, and the computation is less well conditioned than the default. See details.
如果所有平滑的P-样条和张量积形式的te(...,bs="ps",np=FALSE)然后在原则计算可以更快使用稀疏矩阵方法,你可以设置TRUE的。在实践中增长速度是令人失望,并计算不如默认的条件。查看详情。
参数:cluster
bam can compute the computationally dominant QR decomposition in parallel using parLapply from the parallel package, if it is supplied with a cluster on which to do this (a cluster here can be some cores of a single machine). See details and example code.
bam可以并行使用parallel包parLapply计算QR分解计算的优势,如果它提供集群,做到这一点(这里群集可以是一台机器上的一些核心) 。查看详细信息和示例代码。
参数:gc.level
to keep the memory footprint down, it helps to call the garbage collector often, but this takes a substatial amount of time. Setting this to zero means that garbage collection only happens when R decides it should. Setting to 2 gives frequent garbage collection. 1 is in between.
内存占用保持下来,它可以帮助经常调用垃圾收集器,但是这需要时间substatial金额。设置为0意味着垃圾收集只发生当R决定。设置2给出了频繁的垃圾收集。 1之间。
参数:...
further arguments for passing on e.g. to gam.fit (such as mustart).
上,例如通过进一步的参数gam.fit(如mustart)。
Details
详情----------Details----------
bam operates by first setting up the basis characteristics for the smooths, using a representative subsample of the data. Then the model matrix is constructed in blocks using predict.gam. For each block the factor R, from the QR decomposition of the whole model matrix is updated, along with Q'y. and the sum of squares of y. At the end of block processing, fitting takes place, without the need to ever form the whole model matrix.
bam工作首次设立的基础上平滑的特点,代表数据的一个子样本。然后构造块使用predict.gam的模型矩阵。从整个模型矩阵的QR分解因子R,每个块被更新,随着Qy。 y的平方的总和。在块处理结束,装修需要的地方,而不需要不断形成整个模型矩阵。
In the generalized case, the same trick is used with the weighted model matrix and weighted pseudodata, at each step of the PIRLS. Smoothness selection is performed on the working model at each stage (performance oriented iteration), to maintain the small memory footprint. This is trivial to justify in the case of GCV or Cp/UBRE/AIC based model selection, and for REML/ML is justified via the asymptotic multivariate normality of Q'z where z is the IRLS pseudodata.
在广义的情况下,模型的加权矩阵和加权伪数据用同样的伎俩,在PIRLS的每个步骤。 (业绩为导向的迭代)在每个阶段的工作模式上进行平滑的选择,保持小的内存占用。这是微不足道的理由,在GCV的或CP / UBRE的/ AIC模型选择的情况下,REML法/ ML合理通过渐近Qz的多元z是正常的地方IRLS伪数据。
Note that POI is not as stable as the default nested iteration used with gam, but that for very large, information rich, datasets, this is unlikely to matter much.
请注意POI是稳定的默认嵌套迭代gam使用,但非常大,信息丰富,数据集,这是不可能多大关系。
Note also that it is possible to spend most of the computational time on basis evaluation, if an expensive basis is used. In practice this means that the default "tp" basis should be avoided: almost any other basis (e.g. "cr" or "ps") can be used in the 1D case, and tensor product smooths (te) are typically much less costly in the multi-dimensional case.
还要注意,它有可能花费大部分时间计算的基础上评估,如果使用昂贵的基础。实际上,这意味着默认"tp"的基础上,应避免:几乎所有其他的基础上(例如"cr"或"ps")可以用在一维情况下,和张量积平滑(te)通常是在多维情况下要少得多昂贵。
If cluster is provided as a cluster set up using makeCluster (or makeForkCluster) from the parallel package, then the rate limiting QR decomposition of the model matrix is performed in parallel using this cluster. Note that the speed ups are often not that great. On a multi-core machine it is usually best to set the cluster size to the number of physical cores, which is often less than what is reported by detectCores. Using more than the number of physical cores can result in no speed up at all (or even a slow down). Note that a highly parallel BLAS may negate all advantage from using a cluster of cores. Computing in parallel of course requires more memory than computing in series. See examples.
如果cluster作为设立使用集束makeCluster(或makeForkCluster)parallel包,然后模型矩阵QR分解的速率限制并行执行使用此群集。请注意,速度UPS往往不是很大。在一个多核心的机器,它通常是最好的设置簇的大小的物理核心的数量,这往往比detectCores报道少。使用超过物理内核数量可能会导致在没有速度在所有(甚至慢下来)。请注意,高度并行的BLAS可能否定使用的内核集群的所有优势。当然并行计算需要更多的内存比在一系列计算。看到的例子。
If the argument sparse=TRUE then QR updating is replaced by an alternative scheme, in which the model matrix is stored whole as a sparse matrix. This only makes sense if all smooths are P-splines and all tensor products are of the form te(...,bs="ps",np=FALSE), but no check is made. The computations are then based on the Choleski decomposition of the crossproduct of the sparse model matrix. Although this crossproduct is nearly dense, sparsity should make its formation efficient, which is useful as it is the leading order term in the operations count. However there is no benefit in using sparse methods to form the Choleski decomposition, given that the crossproduct is dense. In practice the sparse matrix handling overheads mean that modest or no speed ups are produced by this approach, while the computation is less stable than the default, and the memory footprint often higher (but please let the author know if you find an example where the speedup is really worthwhile).
如果参数sparse=TRUEQR更新是由一个替代方案,在该模型矩阵存储稀疏矩阵的整体更换。这仅是有道理的,如果所有平滑的P-样条和张量积形式te(...,bs="ps",np=FALSE),但没有检查。然后根据Choleski分解稀疏模型矩阵交叉积的计算。虽然这几乎是密集,交叉积稀疏应使其形成高效,这是有用的,因为它是领导为了长期在操作计数。但是没有任何的好处,在使用稀疏的方法形成的Choleski的分解,是密集的交叉积。稀疏矩阵处理的开销,在实践中表示谦虚或没有速度UPS都是由这种方法产生的,而计算是不太稳定的,而不是默认,内存占用量往往较高(但请让笔者知道,如果你找到一个例子,加速真的是值得的)。
值----------Value----------
An object of class "gam" as described in gamObject.
一个类的对象"gam"所述gamObject,。
警告----------WARNINGS ----------
The routine will be slow if the default "tp" basis is used.
该例程将是缓慢的,如果默认的"tp"基础。
You must have more unique combinations of covariates than the model has total parameters. (Total parameters is sum of basis dimensions plus sum of non-spline terms less the number of spline terms).
你必须有更独特的组合变项,比模型总参数。 (总参数的基础上尺寸的总和,加上非样条条款的总和少了样条条款)。
This routine is less stable than "gam" for a the same dataset.
这个例程是比“自由亚齐运动”为相同的数据集稳定。
The negbin family is only supported for the *known theta* case.
negbin家庭只支持* THETA *情况。
作者(S)----------Author(s)----------
Simon N. Wood <a href="mailto:simon.wood@r-project.org">simon.wood@r-project.org</a>
参考文献----------References----------
参见----------See Also----------
mgcv-package, gamObject, gam.models, smooth.terms, linear.functional.terms, s, te predict.gam, plot.gam, summary.gam, gam.side, gam.selection,mgcv, gam.control gam.check, linear.functional.terms negbin, magic,vis.gam
mgcv-package,gamObject,gam.models,smooth.terms,linear.functional.terms,s,tepredict.gam,<X >,plot.gam,summary.gam,gam.side,gam.selection,mgcvgam.control,gam.checklinear.functional.terms, negbin,magic
举例----------Examples----------
library(mgcv)
## following is not *very* large, for obvious reasons...[#后是不是很*大,原因很明显...]
dat <- gamSim(1,n=15000,dist="normal",scale=20)
bs <- "ps";k <- 20
b <- bam(y ~ s(x0,bs=bs,k=k)+s(x1,bs=bs,k=k)+s(x2,bs=bs,k=k)+
s(x3,bs=bs,k=k),data=dat,method="REML")
summary(b)
plot(b,pages=1,rug=FALSE) ## plot smooths, but not rug[#图平滑,但不是地毯]
plot(b,pages=1,rug=FALSE,seWithMean=TRUE) ## `with intercept' CIs[#拦截“证明书]
ba <- bam(y ~ s(x0,bs=bs,k=k)+s(x1,bs=bs,k=k)+s(x2,bs=bs,k=k)+
s(x3,bs=bs,k=k),data=dat,method="GCV.Cp") ## use GCV[#使用GCV的]
summary(ba)
## A Poisson example...[#泊松例如...]
dat <- gamSim(1,n=15000,dist="poisson",scale=.1)
system.time(b1 <- bam(y ~ s(x0,bs=bs,k=k)+s(x1,bs=bs,k=k)+s(x2,bs=bs,k=k)+
s(x3,bs=bs,k=k),data=dat,method="ML",family=poisson()))
b1
## repeat on a cluster[#重复群集]
require(parallel)
nc <- 2 ## cluster size, set for example portability[#簇大小,例如可移植性]
if (detectCores()>1) { ## no point otherwise[#不点,否则]
cl <- makeCluster(nc)
## could also use makeForkCluster, but read warnings first![#也可以使用makeForkCluster,但先阅读警告!]
} else cl <- NULL
system.time(b2 <- bam(y ~ s(x0,bs=bs,k=k)+s(x1,bs=bs,k=k)+s(x2,bs=bs,k=k)+
s(x3,bs=bs,k=k),data=dat,method="ML",family=poisson(),cluster=cl))
## ... first call has startup overheads, repeat shows speed up...[#...第一次调用的启动开销,重复加快...]
system.time(b2 <- bam(y ~ s(x0,bs=bs,k=k)+s(x1,bs=bs,k=k)+s(x2,bs=bs,k=k)+
s(x3,bs=bs,k=k),data=dat,method="ML",family=poisson(),cluster=cl))
if (!is.null(cl)) stopCluster(cl)
b2
## Sparse smoothers example...[#例如稀疏平滑...]
b3 <- bam(y ~ te(x0,x1,bs="ps",k=10,np=FALSE)+s(x2,bs="ps",k=30)+
s(x3,bs="ps",k=30),data=dat,method="ML",
family=poisson(),sparse=TRUE)
b3
转载请注明:出自 生物统计家园网(http://www.biostatistic.net)。
注:
注1:为了方便大家学习,本文档为生物统计家园网机器人LoveR翻译而成,仅供个人R语言学习参考使用,生物统计家园保留版权。
注2:由于是机器人自动翻译,难免有不准确之处,使用时仔细对照中、英文内容进行反复理解,可以帮助R语言的学习。
注3:如遇到不准确之处,请在本贴的后面进行回帖,我们会逐渐进行修订。
|