clusterSetup(simFrame)
clusterSetup()所属R语言包:simFrame
Set up multiple samples on a cluster
在聚类上设置多个样本
译者:生物统计家园网 机器人LoveR
描述----------Description----------
Generic function for setting up multiple samples on a cluster.
在聚类上设置多个样品的通用功能。
用法----------Usage----------
clusterSetup(cl, x, control, ...)
## S4 method for signature 'ANY,data.frame,SampleControl'
clusterSetup(cl, x, control)
参数----------Arguments----------
参数:cl
a cluster as generated by makeCluster.
一个聚类所产生的makeCluster。
参数:x
the data.frame to sample from.
data.frame品尝。
参数:control
a control object inheriting from the virtual class "VirtualSampleControl" or a character string specifying such a control class (the default being "SampleControl").
控制对象从虚拟类"VirtualSampleControl"或一个字符串指定控制类(默认是"SampleControl")继承。
参数:...
if control is a character string or missing, the slots of the control object may be supplied as additional arguments. See "SampleControl" for details on the slots.
如果control是一个字符串或丢失,控制对象的插槽,可以提供额外的参数。见"SampleControl"的插槽。
Details
详细信息----------Details----------
A fundamental design principle of the framework in the case of design-based simulation studies is that the sampling procedure is separated from the simulation procedure. Two main advantages arise from setting up all samples in advance.
是从模拟过程中,采样过程是分离的基本设计原则的框架的情况下,基于设计的模拟研究。所有样品的产生主要有两大优势。
First, the repeated sampling reduces overall computation time dramatically in certain situations, since computer-intensive tasks like stratification need to be performed only once. This is particularly relevant for large population data. In close-to-reality simulation studies carried out in research projects in survey statistics, often up to 10000 samples are drawn from a population of millions of individuals with stratified sampling designs. For such large data sets, stratification takes a considerable amount of time and is a very memory-intensive task. If the samples are taken on-the-fly, i.e., in every simulation run one sample is drawn, the function to take the stratified sample would typically split the population into the different strata in each of the 10000 simulation runs. If all samples are drawn in advance, on the other hand, the population data need to be split only once and all 10000 samples can be taken from the respective strata together.
首先,重复抽样在某些情况下,显着降低了整体的计算时间,因为需要仅执行一次计算机密集型任务,如分层。庞大的人口数据,这是特别重要。在贴近现实的模拟进行了研究,研究项目的调查统计中,常达10000样本来自一个百万人口的分层抽样设计的个人。对于如此大的数据集,分层需要相当长的时间,并且是一个非常内存密集型任务。如果样本上飞,即,在每次模拟运行一个样本被绘制,通常采取分层抽样的人口分割为各阶层的10000模拟运行功能。如果所有样本都预先绘制的,另一方面,需要分割的人口数据仅一次所有10000样品可以取自各阶层一起。
Second, the samples can be stored permanently, which simplifies the reproduction of simulation results and may help to maximize comparability of results obtained by different partners in a research project. In particular, this is useful for large population data, when complex sampling techniques may be very time-consuming. In research projects involving different partners, usually different groups investigate different kinds of estimators. If the two groups use not only the same population data, but also the same previously set up samples, their results are highly comparable.
二,样品可以永久存储,从而简化了复制的仿真结果,并可能有助于最大限度地提高不同合作伙伴的一个研究项目获得的结果的可比性。特别是,这是非常有用的庞大的人口数据,复杂的抽样技术时,可能会很费时。在研究项目涉及不同的合作伙伴,通常是不同的群体探讨各种不同的估计。如果这两个群体使用不仅人口数据,也是同样的先前成立的样品,其结果是具有高度可比性。
The computational performance of setting up multiple samples can be increased by parallel computing. Since version 0.5.0, parallel computing in simFrame is implemented using the package parallel, which is part of the R base distribution since version 2.14.0 and builds upon work done for the contributed packages multicore and snow. Note that all objects and packages required for the computations (including simFrame) need to be made available on every worker process unless the worker processes are created by forking (see makeCluster).
设立多个样品的计算性能可以增加由并行计算。从0.5.0版本开始,并行计算simFrame使用的包parallel,这是从版本2.14.0的R基分布,并建立在工作的贡献的软件包<X >和multicore。请注意,所有的对象和包所需的计算(包括snow)需要在每个工作进程,除非创建的工作进程通过fork(见simFrame)。
In order to prevent problems with random numbers and to ensure reproducibility, random number streams should be used. With parallel, random number streams can be created via the function clusterSetRNGStream().
为了防止随机数的问题,以确保可重复性,随机数流应该被使用。 parallel与,随机数流可以通过创建功能clusterSetRNGStream()。
The control class "SampleControl" is highly flexible and allows stratified sampling as well as sampling of whole groups rather than individuals with a specified sampling method. Hence it is often sufficient to implement the desired sampling method for the simple non-stratified case to extend the existing framework. See "SampleControl" for some restrictions on the argument names of such a function, which should return a vector containing the indices of the sampled observations.
控制类"SampleControl"具有高度的灵活性,并允许指定的抽样方法分层抽样和采样整个群体而非个人。因此,它往往是简单的非分层的情况下,扩大现有的框架足够实现所需的采样方法。见"SampleControl"一些限制的参数名这样的功能,它应该返回一个向量,包含的抽样观察的指标。
Nevertheless, for very complex sampling procedures, it is possible to define a control class "MySampleControl" extending "VirtualSampleControl", and the corresponding method clusterSetup(cl, x, control) with signature 'ANY, data.frame, MySampleControl'. In order to optimize computational performance, it is necessary to efficiently set up multiple samples. Thereby the slot k of "VirtualSampleControl" needs to be used to control the number of samples, and the resulting object must be of class "SampleSetup".
然而,它是非常复杂的抽样程序,可以定义控制类"MySampleControl"延长"VirtualSampleControl",和相应的方法clusterSetup(cl, x, control)签名'ANY, data.frame, MySampleControl'。为了优化计算性能,它是必要的,有效地设置多个样品。由此,槽k"VirtualSampleControl"需要将用于控制的样本数量,并生成的对象必须是类"SampleSetup"。
值----------Value----------
An object of class "SampleSetup".
对象的类"SampleSetup"。
方法----------Methods----------
cl = "ANY", x = "data.frame", control = "character" set up multiple samples on a cluster using a control class specified by the character string control. The slots of the control object
cl = "ANY", x = "data.frame", control = "character"设立多个样品在聚类上使用控制类指定的字符串control。的控制对象物的槽
cl = "ANY", x = "data.frame", control = "missing" set up multiple samples on a cluster using a control object of class
cl = "ANY", x = "data.frame", control = "missing"设立多个样品在聚类上使用控制类的对象
cl = "ANY", x = "data.frame", control = "SampleControl" set up multiple samples on a cluster as defined by the control object
cl = "ANY", x = "data.frame", control = "SampleControl"设立多个样品在聚类上定义的控制对象
(作者)----------Author(s)----------
Andreas Alfons
参考文献----------References----------
Statistical Simulation: The R Package <code>simFrame</code>. Journal of Statistical Software, 37(3), 1–36. URL http://www.jstatsoft.org/v37/i03/.
Random-Number Package with Many Long Streams and Substreams. Operations Research, 50(6), 1073–1075.
in R. Journal of Computational and Graphical Statistics, 16(2), 399–420.
Framework for the R System. International Journal of Parallel Programming, 37(1), 78–90.
参见----------See Also----------
makeCluster, clusterSetRNGStream, setup, draw, "SampleControl", "TwoStageControl", "VirtualSampleControl", "SampleSetup"
makeCluster,clusterSetRNGStream,setup,draw,"SampleControl","TwoStageControl","VirtualSampleControl","SampleSetup"
实例----------Examples----------
## Not run: [#不运行:]
# these examples require at least a dual core processor[这些例子至少需要双核处理器]
# load data[加载数据]
data(eusilcP)
# start cluster[启动聚类]
cl <- makeCluster(2, type = "PSOCK")
# load package and data on workers[对工人的负载包和数据]
clusterEvalQ(cl, {
library(simFrame)
data(eusilcP)
})
# set up random number stream[随机数流]
clusterSetRNGStream(cl, iseed = "12345")
# simple random sampling[简单随机抽样]
srss <- clusterSetup(cl, eusilcP, size = 20, k = 4)
summary(srss)
draw(eusilcP[, c("id", "eqIncome")], srss, i = 1)
# group sampling[整群抽样]
gss <- clusterSetup(cl, eusilcP, grouping = "hid", size = 10, k = 4)
summary(gss)
draw(eusilcP[, c("hid", "id", "eqIncome")], gss, i = 2)
# stratified simple random sampling[分层简单随机抽样]
ssrss <- clusterSetup(cl, eusilcP, design = "region",
size = c(2, 5, 5, 3, 4, 5, 3, 5, 2), k = 4)
summary(ssrss)
draw(eusilcP[, c("id", "region", "eqIncome")], ssrss, i = 3)
# stratified group sampling[分层整群抽样]
sgss <- clusterSetup(cl, eusilcP, design = "region",
grouping = "hid", size = c(2, 5, 5, 3, 4, 5, 3, 5, 2), k = 4)
summary(sgss)
draw(eusilcP[, c("hid", "id", "region", "eqIncome")], sgss, i = 4)
# stop cluster[停止聚类]
stopCluster(cl)
## End(Not run)[#(不执行)]
转载请注明:出自 生物统计家园网(http://www.biostatistic.net)。
注:
注1:为了方便大家学习,本文档为生物统计家园网机器人LoveR翻译而成,仅供个人R语言学习参考使用,生物统计家园保留版权。
注2:由于是机器人自动翻译,难免有不准确之处,使用时仔细对照中、英文内容进行反复理解,可以帮助R语言的学习。
注3:如遇到不准确之处,请在本贴的后面进行回帖,我们会逐渐进行修订。
|