boot(boot)
boot()所属R语言包:boot
Bootstrap Resampling
引导重采样
译者:生物统计家园网 机器人LoveR
描述----------Description----------
Generate R bootstrap replicates of a statistic applied to data. Both parametric and nonparametric resampling are possible. For the nonparametric bootstrap, possible resampling methods are the ordinary bootstrap, the balanced bootstrap, antithetic resampling, and permutation. For nonparametric multi-sample problems stratified resampling is used: this is specified by including a vector of strata in the call to boot. Importance resampling weights may be specified.
生成R引导复制应用于数据统计。参数和非参数重采样是可能的。对于非参数引导,可能重采样方法是普通的引导,引导均衡,对立的重采样和置换。非参数多样品的问题,采用分层重采样:这是由在呼叫引导包括地层的向量指定。可以指定重要性重采样权重。
用法----------Usage----------
boot(data, statistic, R, sim = "ordinary", stype = c("i", "f", "w"),
strata = rep(1,n), L = NULL, m = 0, weights = NULL,
ran.gen = function(d, p) d, mle = NULL, simple = FALSE, ...,
parallel = c("no", "multicore", "snow"),
ncpus = getOption("boot.ncpus", 1L), cl = NULL)
参数----------Arguments----------
参数:data
The data as a vector, matrix or data frame. If it is a matrix or data frame then each row is considered as one multivariate observation.
作为一个向量,矩阵或数据框中的数据。如果它是一个矩阵或数据框,那么每一行被认为是一个多元的观察。
参数:statistic
A function which when applied to data returns a vector containing the statistic(s) of interest. When sim = "parametric", the first argument to statistic must be the data. For each replicate a simulated dataset returned by ran.gen will be passed. In all other cases statistic must take at least two arguments. The first argument passed will always be the original data. The second will be a vector of indices, frequencies or weights which define the bootstrap sample. Further, if predictions are required, then a third argument is required which would be a vector of the random indices used to generate the bootstrap predictions. Any further arguments can be passed to statistic through the ... argument.
一种应用于数据时,函数返回一个向量,包含利益的统计(S)。当sim = "parametric",statistic的第一个参数必须的数据。对于每一个复制返回ran.gen将通过一个模拟的数据集。在所有其他情况下statistic必须至少有两个参数。第一个参数传递,将永远是原始数据。第二,将是一个指标,频率或重定义引导样的向量。此外,如果需要预测,然后第三个参数是必需的,这将是一个用来生成引导预测的随机指标向量。 statistic通过...参数,可以通过任何进一步的论据。
参数:R
The number of bootstrap replicates. Usually this will be a single positive integer. For importance resampling, some resamples may use one set of weights and others use a different set of weights. In this case R would be a vector of integers where each component gives the number of resamples from each of the rows of weights.
引导数复制。通常,这将是一个正整数。一些重新采样重要性重采样,可以使用一组的重量和他人使用一套不同的权重。在这种情况下R将是一个整数向量,其中每个组件从每个行权数量的重新采样。
参数:sim
A character string indicating the type of simulation required. Possible values are "ordinary" (the default), "parametric", "balanced", "permutation", or "antithetic". Importance resampling is specified by including importance weights; the type of importance resampling must still be specified but may only be "ordinary" or "balanced" in this case.
需要一个字符串,指示模拟类型。可能的值是"ordinary"(默认),"parametric","balanced","permutation"或"antithetic"。指定包括重要性权重的重要性重采样;仍然必须指定类型的重要性重采样,但可能只是"ordinary"或"balanced"在这种情况下。
参数:stype
A character string indicating what the second argument of statistic represents. Possible values of stype are "i" (indices - the default), "f" (frequencies), or "w" (weights). Not used for sim = "parametric".
一个字符串,指示什么的statistic第二个参数代表。 stype可能的值是"i"(指数 - 默认值),"f"(频率),或"w"(重)。不习惯sim = "parametric"。
参数:strata
An integer vector or factor specifying the strata for multi-sample problems. This may be specified for any simulation, but is ignored when sim = "parametric". When strata is supplied for a nonparametric bootstrap, the simulations are done within the specified strata.
指定一个整数向量或因素的地层样品多问题。这可能被指定为任何模拟,但时忽略sim = "parametric"。当strata提供一个非参数引导,模拟内完成指定的阶层。
参数:L
Vector of influence values evaluated at the observations. This is used only when sim is "antithetic". If not supplied, they are calculated through a call to empinf. This will use the infinitesimal jackknife provided that stype is "w", otherwise the usual jackknife is used.
影响值的向量评估的意见。这是只有当sim是"antithetic"。如果没有提供,他们计算通过empinf调用。这将使用提供了无穷的折刀stype"w",否则一般折刀使用。
参数:m
The number of predictions which are to be made at each bootstrap replicate. This is most useful for (generalized) linear models. This can only be used when sim is "ordinary". m will usually be a single integer but, if there are strata, it may be a vector with length equal to the number of strata, specifying how many of the errors for prediction should come from each strata. The actual predictions should be returned as the final part of the output of statistic, which should also take an argument giving the vector of indices of the errors to be used for the predictions.
是要在每个引导复制的预测数。这是最有用的线性模型(广义)。这只能用来当sim是"ordinary"。 m通常会是一个整数,但是,如果有阶层的,它可能是一个长度等于地层的数量,指定预测的错误,有多少应该来自各阶层的向量。实际的预测应返回作为statistic,也应采取的指数被用于预测的错误矢量参数输出的最后部分。
参数:weights
Vector or matrix of importance weights. If a vector then it should have as many elements as there are observations in data. When simulation from more than one set of weights is required, weights should be a matrix where each row of the matrix is one set of importance weights. If weights is a matrix then R must be a vector of length nrow(weights). This parameter is ignored if sim is not "ordinary" or "balanced".
向量或矩阵的重要性权重。如果一个向量,那么它应该有尽可能多的元素,有data意见。当需要从多个组的重量模拟,weights应该是一个矩阵,矩阵的每一行是一个重要砝码。 weights如果是一个矩阵,那么R必须是一个长度nrow(weights)向量。 sim如果不"ordinary"或"balanced",则忽略此参数。
参数:ran.gen
This function is used only when sim = "parametric" when it describes how random values are to be generated. It should be a function of two arguments. The first argument should be the observed data and the second argument consists of any other information needed (e.g. parameter estimates). The second argument may be a list, allowing any number of items to be passed to ran.gen. The returned value should be a simulated data set of the same form as the observed data which will be passed to statistic to get a bootstrap replicate. It is important that the returned value be of the same shape and type as the original dataset. If ran.gen is not specified, the default is a function which returns the original data in which case all simulation should be included as part of statistic. Use of sim = "parametric" with a suitable ran.gen allows the user to implement any types of nonparametric resampling which are not supported directly.
使用此功能,只有当sim = "parametric"时,它描述了如何生成随机值。它应该是一个有两个参数的函数。第一个参数应该是观测到的数据和任何其他所需的信息(如参数估计值)的第二个参数组成。第二个参数可能是一个列表,允许任意数量的项目,要通过ran.gen。返回值应该是相同的形式设置一个模拟数据作为观测数据将被传递给statistic得到一个bootstrap复制。重要的是,返回的值是相同的形状和类型的原始数据集。如果ran.gen未指定,默认是一个函数,它返回原data在这种情况下,所有的模拟部分的statistic应包括在内。使用sim = "parametric"一个合适的ran.gen允许用户执行不直接支持的任何类型的非参数重采样。
参数:mle
The second argument to be passed to ran.gen. Typically these will be maximum likelihood estimates of the parameters. For efficiency mle is often a list containing all of the objects needed by ran.gen which can be calculated using the original data set only.
第二个参数被传递到ran.gen。通常,这些将是最大似然估计的参数。 效率mle往往是一个列表,其中包含所有的ran.gen使用只设置原始数据,可以计算出所需的对象。
参数:simple
logical, only allowed to be TRUE for sim = "ordinary", stype = "i", n = 0 (otherwise ignored with a warning). By default a n by R index array is created: this can be large and if simple = TRUE this is avoided by sampling separately for each replication, which is slower but uses less memory.
逻辑,只允许TRUEsim = "ordinary", stype = "i", n = 0(否则忽略警告)。默认情况下nR创建索引数组:这可能是大,如果simple = TRUE这是避免样品分别为每个复制,这是速度较慢,但使用较少的内存。
参数:...
Other named arguments for statistic which are passed unchanged each time it is called. Any such arguments to statistic should follow the arguments which statistic is required to have for the simulation. Beware of partial matching to arguments of boot listed above, and that arguments named X and FUN cause conflicts in some versions of boot (but not this one).
其他命名参数为statistic传递不变每次它被称为。 statistic任何这种论调,应遵循statistic需要有模拟的参数。谨防部分匹配的参数boot上面列出,该参数名为X和FUNboot(但不是这一个)的某些版本的原因冲突。
参数:parallel
The type of parallel operation to be used (if any). If missing, the default is taken from the option "boot.parallel" (and if that is not set, "no").
用于并行操作的类型(如果有的话)。如果缺少,默认是从选项"boot.parallel"(如果不设置,"no")。
参数:ncpus
integer: number of processes to be used in parallel operation: typically one would chose this to the number of available CPUs.
整数:进程数被用于并行操作:通常人会选择可用的CPU数量。
参数:cl
An optional parallel or snow cluster for use if parallel = "snow". If not supplied, a cluster on the local machine is created for the duration of the boot call.
一个可选的parallel或snow如果parallel = "snow"使用集群。如果没有提供,在本地机器上的群集创建boot通话时间。
Details
详情----------Details----------
The statistic to be bootstrapped can be as simple or complicated as desired as long as its arguments correspond to the dataset and (for a nonparametric bootstrap) a vector of indices, frequencies or weights. statistic is treated as a black box by the boot function and is not checked to ensure that these conditions are met.
要引导的统计,可以是简单或复杂,只要所需作为其参数对应的数据集和非参数引导指数,频率或权重向量。 statistic被视为一个黑盒子boot函数不检查,以确保这些条件得到满足。
The first order balanced bootstrap is described in Davison, Hinkley and Schechtman (1986). The antithetic bootstrap is described by Hall (1989) and is experimental, particularly when used with strata. The other non-parametric simulation types are the ordinary bootstrap (possibly with unequal probabilities), and permutation which returns random permutations of cases. All of these methods work independently within strata if that argument is supplied.
一阶平衡引导戴维森,欣克利和Schechtman的的(1986)。偶举介绍霍尔(1989)和实验,特别是当使用阶层。其他非参数模拟类型是普通的引导(可能是不平等的概率),和置换随机排列的情况下返回。所有这些方法在地层工作提供独立,如果这样的说法。
For the parametric bootstrap it is necessary for the user to specify how the resampling is to be conducted. The best way of accomplishing this is to specify the function ran.gen which will return a simulated data set from the observed data set and a set of parameter estimates specified in mle.
对于参数的引导,它是为用户指定如何进行重采样是必要的。实现这一点的最好方式是指定函数ran.gen将返回模拟数据从观测数据集和一组在mle指定的参数估计。
值----------Value----------
The returned value is an object of class "boot", containing the following components:
返回值是一个类的对象"boot",包含以下组件:
参数:t0
The observed value of statistic applied to data.
statistic观测值data的。
参数:t
A matrix with sum(R) rows each of which is a bootstrap replicate of the result of calling statistic.
与矩阵sum(R)每个行是引导复制的结果调用statistic。
参数:R
The value of R as passed to boot.
的R值传递boot。
参数:data
The data as passed to boot.
data如通过boot。
参数:seed
The value of .Random.seed when boot was called.
.Random.seed值boot被称为。
参数:statistic
The function statistic as passed to boot.
功能statistic传递boot的。
参数:sim
Simulation type used.
使用模拟式。
参数:stype
Statistic type as passed to boot.
统计类型传递到boot。
参数:call
The original call to boot.
boot原来的呼叫。
参数:strata
The strata used. This is the vector passed to boot, if it was supplied or a vector of ones if there were no strata. It is not returned if sim is "parametric".
使用的地层。这是传递boot,如果有人提供或向量的,如果有任何阶层的向量。它不返回如果sim是"parametric"的。
参数:weights
The importance sampling weights as passed to boot or the empirical distribution function weights if no importance sampling weights were specified. It is omitted if sim is not one of "ordinary" or "balanced".
重要性采样重量传递给boot或经验分布函数的重量,如果没有重要性取样权重被指定。它被省略如果sim不"ordinary"或"balanced"之一。
参数:pred.i
If predictions are required (m > 0) this is the matrix of indices at which predictions were calculated as they were passed to statistic. Omitted if m is 0 or sim is not "ordinary".
如果需要预测(m > 0),这是预测,因为他们通过统计计算指数矩阵。省略m是0或sim不"ordinary"。
参数:L
The influence values used when sim is "antithetic". If no such values were specified and stype is not "w" then L is returned as consecutive integers corresponding to the assumption that data is ordered by influence values. This component is omitted when sim is not "antithetic".
影响值时sim是"antithetic"。如果没有这样指定值和stype不"w"然后L返回相应的假设数据是由影响值的有序连续整数。这个组件被省略时sim不"antithetic"。
参数:ran.gen
The random generator function used if sim is "parametric". This component is omitted for any other value of sim.
随机发生器功能sim如果是"parametric"。这个组件被省略为任何其他值sim。
参数:mle
The parameter estimates passed to boot when sim is "parametric". It is omitted for all other values of sim.
参数估计通过boot时sim是"parametric"。它省略了sim所有其他值。
There are c, plot and print methods for this class.
有cplot和print这个类的方法。
并联运行----------Parallel operation----------
When parallel = "multicore" is used (not available on Windows), each worker process inherits the environment of the current session, including the workspace and the loaded namespaces and attached packages (but not the random number seed: see below).
当parallel = "multicore"(Windows下不可用),每个工人的过程中继承了本届会议的环境,包括工作区和加载的命名空间和附加包(但不是随机数种子:见下文)。
More work is needed when parallel = "snow" is used: the worker processes are newly created R processes, and statistic needs to arrange to set up the environment it needs: often a good way to do that is to make use of lexical scoping since when statistic is sent to the worker processes its enclosing environment is also sent. (E.g. see the example for jack.after.boot where ancillary functions are nested inside the statistic function.) parallel = "snow" is primarily intended to be used on multi-core Windows machine where parallel = "multicore" is not available.
更多的工作需要时parallel = "snow"使用新创建的工作进程ŕ进程,statistic需要安排成立,它需要的环境往往是一个好办法做到这一点是利用词法作用域,因为当statistic被送到工人处理其封闭的环境也被发送。 (例如看到jack.after.bootstatistic函数内嵌套辅助功能的例子。)parallel = "snow"主要拟用于多核心的Windows机器parallel = "multicore"不可用。
For most of the boot methods the resampling is done in the master process, but not if simple = TRUE nor sim = "parametric". In those cases (or where statistic itself uses random numbers), more care is needed if the results need to be reproducible. Resampling is done in the worker processes by censboot(sim = "wierd") and by most of the schemes in tsboot (the exceptions being sim == "fixed" and sim == "geom" with the default ran.gen).
重采样,对于大多数的boot方法是在主进程,但如果simple = TRUE也sim = "parametric"。在这种情况下(或其中statistic本身使用随机数),需要更多的照顾,如果结果需要重现。重采样完成的工作进程,censboot(sim = "wierd")“大多数在计划tsboot(例外sim == "fixed"和sim == "geom"默认ran.gen)。
Where random-number generation is done in the worker processes, the default behaviour is that each worker chooses a separate seed, non-reproducibly. However, with parallel = "multicore" or parallel = "snow" using the default cluster, a second approach is used if RNGkind("L'Ecuyer-CMRG") has been selected. In that approach each worker gets a different subsequence of the RNG stream based on the seed at the time the worker is spawned and so the results will be reproducible if ncpus is unchanged, and for parallel = "multicore" if parallel::mc.reset.stream() is called: see the examples for mclapply.
凡在工作进程中所做的随机数生成,默认行为是每个工人选择一个单独的种子,非重复性。然而,随着parallel = "multicore"或parallel = "snow"使用默认的簇,第二种方法是使用RNGkind("L'Ecuyer-CMRG")如果已被选中。每个工人在这种方法得到不同序列的基础上的种子在时间的工人产生这样的结果将是重现如果RNG的流ncpus是不变的,和parallel = "multicore"如果parallel::mc.reset.stream()被称为:mclapply看到的例子。
参考文献----------References----------
Among them are :
for the bootstrap. Annals of Statistics, 21, 286–298.
Bootstrap Methods and Their Application. Cambridge University Press.
simulation. Biometrika, 73, 555–566.
Chapman & Hall.
American Statistician, 42, 263–266.
73, 713–724.
Journal of the Royal Statistical Society, B, 50, 312–337, 355–370.
Biometrika, 76, 435–446.
Journal of the American Statistical Association, 83, 709–714.
John Wiley & Sons.
参见----------See Also----------
boot.array, boot.ci, censboot, empinf, jack.after.boot, tilt.boot, tsboot.
boot.array,boot.ci,censboot,empinf,jack.after.boot,tilt.boot,tsboot。
举例----------Examples----------
# Usual bootstrap of the ratio of means using the city data[平时引导的城市数据使用的手段比]
ratio <- function(d, w) sum(d$x * w)/sum(d$u * w)
boot(city, ratio, R = 999, stype = "w")
# Stratified resampling for the difference of means. In this[分层重采样为手段的差异。在这]
# example we will look at the difference of means between the final[例如,我们将看看最后手段之间的差异]
# two series in the gravity data.[两个系列的重力数据。]
diff.means <- function(d, f)
{ n <- nrow(d)
gp1 <- 1:table(as.numeric(d$series))[1]
m1 <- sum(d[gp1,1] * f[gp1])/sum(f[gp1])
m2 <- sum(d[-gp1,1] * f[-gp1])/sum(f[-gp1])
ss1 <- sum(d[gp1,1]^2 * f[gp1]) - (m1 * m1 * sum(f[gp1]))
ss2 <- sum(d[-gp1,1]^2 * f[-gp1]) - (m2 * m2 * sum(f[-gp1]))
c(m1 - m2, (ss1 + ss2)/(sum(f) - 2))
}
grav1 <- gravity[as.numeric(gravity[,2]) >= 7,]
boot(grav1, diff.means, R = 999, stype = "f", strata = grav1[,2])
# In this example we show the use of boot in a prediction from[在这个例子中,我们在预测中使用的引导,从]
# regression based on the nuclear data. This example is taken[核数据的基础上回归。这个例子是采取]
# from Example 6.8 of Davison and Hinkley (1997). Notice also[从戴维森和欣克利(1997)680例。还请注意]
# that two extra arguments to 'statistic' are passed through boot.[统计两个额外的参数传递通过引导。]
nuke <- nuclear[, c(1, 2, 5, 7, 8, 10, 11)]
nuke.lm <- glm(log(cost) ~ date+log(cap)+ne+ct+log(cum.n)+pt, data = nuke)
nuke.diag <- glm.diag(nuke.lm)
nuke.res <- nuke.diag$res * nuke.diag$sd
nuke.res <- nuke.res - mean(nuke.res)
# We set up a new data frame with the data, the standardized [我们成立一个新的数据的数据框,标准化]
# residuals and the fitted values for use in the bootstrap.[残差和引导使用的拟合值。]
nuke.data <- data.frame(nuke, resid = nuke.res, fit = fitted(nuke.lm))
# Now we want a prediction of plant number 32 but at date 73.00[现在,我们要一厂32号的预测,但日期73.00]
new.data <- data.frame(cost = 1, date = 73.00, cap = 886, ne = 0,
ct = 0, cum.n = 11, pt = 1)
new.fit <- predict(nuke.lm, new.data)
nuke.fun <- function(dat, inds, i.pred, fit.pred, x.pred)
{
lm.b <- glm(fit+resid[inds] ~ date+log(cap)+ne+ct+log(cum.n)+pt,
data = dat)
pred.b <- predict(lm.b, x.pred)
c(coef(lm.b), pred.b - (fit.pred + dat$resid[i.pred]))
}
nuke.boot <- boot(nuke.data, nuke.fun, R = 999, m = 1,
fit.pred = new.fit, x.pred = new.data)
# The bootstrap prediction squared error would then be found by[引导预测误差平方将被发现]
mean(nuke.boot$t[, 8]^2)
# Basic bootstrap prediction limits would be[基本引导预测限制]
new.fit - sort(nuke.boot$t[, 8])[c(975, 25)]
# Finally a parametric bootstrap. For this example we shall look [最后一个参数的引导。在这个例子中,我们应看]
# at the air-conditioning data. In this example our aim is to test [在空调数据在这个例子中,我们的目的是测试]
# the hypothesis that the true value of the index is 1 (i.e. that [假设,真正的指数值是1(即,]
# the data come from an exponential distribution) against the [数据来自对指数分布)]
# alternative that the data come from a gamma distribution with[替代数据从伽玛分布]
# index not equal to 1.[指数不等于1。]
air.fun <- function(data) {
ybar <- mean(data$hours)
para <- c(log(ybar), mean(log(data$hours)))
ll <- function(k) {
if (k <= 0) 1e200 else lgamma(k)-k*(log(k)-1-para[1]+para[2])
}
khat <- nlm(ll, ybar^2/var(data$hours))$estimate
c(ybar, khat)
}
air.rg <- function(data, mle) {
# Function to generate random exponential variates.[函数生成随机指数变元。]
# mle will contain the mean of the original data[,MLE将包含原始数据的平均数]
out <- data
out$hours <- rexp(nrow(out), 1/mle)
out
}
air.boot <- boot(aircondit, air.fun, R = 999, sim = "parametric",
ran.gen = air.rg, mle = mean(aircondit$hours))
# The bootstrap p-value can then be approximated by[然后可以引导p值近似]
sum(abs(air.boot$t[,2]-1) > abs(air.boot$t0[2]-1))/(1+air.boot$R)
转载请注明:出自 生物统计家园网(http://www.biostatistic.net)。
注:
注1:为了方便大家学习,本文档为生物统计家园网机器人LoveR翻译而成,仅供个人R语言学习参考使用,生物统计家园保留版权。
注2:由于是机器人自动翻译,难免有不准确之处,使用时仔细对照中、英文内容进行反复理解,可以帮助R语言的学习。
注3:如遇到不准确之处,请在本贴的后面进行回帖,我们会逐渐进行修订。
|