R语言:clara()函数中文帮助文档(中英文对照)

loveR · 发表于 2012-2-16 20:48:11

clara(cluster)
clara()所属R语言包：cluster

 Clustering Large Applications
 大型应用程序聚类

 译者：生物统计家园网机器人LoveR

描述----------Description----------

Computes a "clara" object, a list representing a clustering of the data into k clusters.
计算"clara"对象，代表的集群数据k集群的列表。

用法----------Usage----------

clara(x, k, metric = "euclidean", stand = FALSE, samples = 5,
 sampsize = min(n, 40 + 2 * k), trace = 0, medoids.x = TRUE,
 keep.data = medoids.x, rngR = FALSE, pamLike = FALSE)

参数----------Arguments----------

参数：x
data matrix or data frame, each row corresponds to an observation, and each column corresponds to a variable. All variables must be numeric. Missing values (NAs) are allowed.
数据矩阵或数据框，每一行对应一个观察，每列对应一个变量。所有的变量必须是数字。遗漏值（NAS）是允许的。

参数：k
integer, the number of clusters. It is required that 0 < k < n where n is the number of observations (i.e., n = nrow(x)).
整数，数字集群。它需要0 < k < n其中n的若干意见（即N =nrow(x)）。

参数：metric
character string specifying the metric to be used for calculating dissimilarities between observations. The currently available options are "euclidean" and "manhattan". Euclidean distances are root sum-of-squares of differences, and manhattan distances are the sum of absolute differences.
字符串指定的度量用于计算之间的意见异同。目前可用的选项是“欧几里德”和“曼哈顿”。欧氏距离总和的平方差异的根，和曼哈顿距离是绝对差异的总和。

参数：stand
logical, indicating if the measurements in x are standardized before calculating the dissimilarities. Measurements are standardized for each variable (column), by subtracting the variable's mean value and dividing by the variable's mean absolute deviation.
逻辑，这表明，如果在测量x前计算的异同标准化。测量是为每个变量（列）减去变量的平均值除以变量的平均绝对偏差，标准化。

参数：samples
integer, number of samples to be drawn from the dataset. The default, 5, is rather small for historical (and now back compatibility) reasons and we recommend to set samples an order of magnitude larger.
整数，从得出的数据集的样本数。默认情况下，5，是相当小的历史（现在回兼容性）的原因，我们建议设置samples幅度较大的订单。

参数：sampsize
integer, number of observations in each sample. sampsize should be higher than the number of clusters (k) and at most the number of observations (n = nrow(x)).
整数，每个样本中的观测。 sampsize应该是高于数字集群（k），在大多数的若干意见（N =nrow(x)）。

参数：trace
integer indicating a trace level for diagnostic output during the algorithm.
整数，表示期间的算法诊断输出的跟踪级别。

参数：medoids.x
logical indicating if the medoids should be returned, identically to some rows of the input data x. If FALSE, keep.data must be false as well, and the medoid indices, i.e., row numbers of the medoids will still be returned (i.med component), and the algorithm saves space by needing one copy less of x.
逻辑表示的中心点应予以退货，相同的输入数据x某些行。如果FALSE，keep.data必须是假的，以及medoid指数，即行号的中心点仍然会被退回（i.med组成部分），该算法可以节省空间需要一份x。

参数：keep.data
logical indicating if the (scaled if stand is true) data should be kept in the result. Setting this to FALSE saves memory (and hence time), but disables clusplot()ing of the result. Use medoids.x = FALSE to save even more memory.
逻辑表明，如果（缩放stand如果是真）的数据应保持在。此设置FALSE节省内存（因此时间），但禁用clusplot()的结果。使用medoids.x = FALSE节省更多的内存。

参数：rngR
logical indicating if R's random number generator should be used instead of the primitive clara()-builtin one. If true, this also means that each call to clara() returns a different result – though only slightly different in good situations.
逻辑表明，如果R的随机数发生器应使用原始的克拉拉（），内建一个。如果情况属实，这也意味着，每个clara()调用返回不同的结果 - 虽然只有在良好的情况下略有不同。

参数：pamLike
logical indicating if the “swap” phase (see pam, in C code) should use the same algorithm as pam(). Note that from Kaufman and Rousseeuw's description this should have been true always, but as the original Fortran code and the subsequent port to C has always contained a small one-letter change (a typo according to Martin Maechler) with respect to PAM, the default, pamLike = FALSE has been chosen to remain back compatible rather than &ldquo

AM compatible”.
逻辑表明，如果在“交换”的阶段（见pamC代码）应该使用相同的算法pam()。请注意，考夫曼和Rousseeuw的描述这应该已经被真正的始终，但在原来的Fortran代码，并在随后的端口到C已经总是载一个小的一个字母变化（1笔误根据马丁Maechler）对PAM，默认pamLike = FALSE保持兼容，而不是“的PAM兼容”已被选定。

Details

详情----------Details----------

clara is fully described in chapter 3 of Kaufman and Rousseeuw (1990). Compared to other partitioning methods such as pam, it can deal with much larger datasets. Internally, this is achieved by considering sub-datasets of fixed size (sampsize) such that the time and storage requirements become linear in n rather than quadratic.
clara完全中所述：章考夫曼和Rousseeuw的（1990）3。如pam其他的分区方法相比，它可以处理更大的数据集。在内部，这是考虑固定大小的子集来实现（sampsize），这样的时间和存储需求成为n而不是二次线性。

Each sub-dataset is partitioned into k clusters using the same algorithm as in pam. Once k representative objects have been selected from the sub-dataset, each observation of the entire dataset is assigned to the nearest medoid.
每个子集被分成kpam。参考使用相同的算法集群一旦k代表对象已选定从子集，每个观察整个集被分配到最近的medoid。

The mean (equivalent to the sum) of the dissimilarities of the observations to their closest medoid is used as a measure of the quality of the clustering. The sub-dataset for which the mean (or sum) is minimal, is retained. A further analysis is carried out on the final partition.
最接近medoid意见的异同平均（相当于总和）被用作衡量聚类质量。子集平均（或款项）是最小的，将被保留。最后的分区上进行了进一步分析。

Each sub-dataset is forced to contain the medoids obtained from the best sub-dataset until then. Randomly drawn observations are added to this set until sampsize has been reached.
强迫每个子数据集包含直到然后从最好的子数据集获得的中心点。直到sampsize已达到添加到这个组随机抽取的意见。

值----------Value----------

an object of class "clara" representing the clustering. See clara.object for details.
类"clara"代表聚类的对象。看到clara.object详情。

注意----------Note----------

By default, the random sampling is implemented with a very simple scheme (with period 2^{16} = 65536) inside the Fortran code, independently of R's random number generation, and as a matter of fact, deterministically. Alternatively, we recommend setting rngR = TRUE which uses R's random number generators. Then, clara() results are made reproducible typically by using set.seed() before calling clara.
默认情况下，随机抽样是一个非常简单的计划（期间2^{16} = 65536）内的Fortran代码，独立的R的随机数生成，事实上，确定性与实施。另外，我们建议设置rngR = TRUE它使用R的随机数发生器。然后，clara()结果是重复性的，通常使用set.seed()之前调用clara。

The storage requirement of clara computation (for small k) is about O(n * p) + O(j^2) where j = \code{sampsize}, and (n,p) = \code{dim(x)}. The CPU computing time (again assuming small k) is about O(n * p * j^2 * N), where N = \code{samples}.
clara计算的存储要求（k）大约是小O(n * p) + O(j^2)其中j = \code{sampsize}，(n,p) = \code{dim(x)}。 CPU的计算时间（再次假设小k）关于O(n * p * j^2 * N)，其中N = \code{samples}是。

For “small” datasets, the function pam can be used directly. What can be considered small, is really a function of available computing power, both memory (RAM) and speed. Originally (1990), “small” meant less than 100 observations; in 1997, the authors said “small (say with fewer than 200 observations)”; as of 2006, you can use pam with several thousand observations.
对于“小”的数据集，功能pam可以直接使用。什么可以被认为是小，确实是一个可用的计算能力，两个存储器（RAM）和速度的功能。最初（1990年），“小”是指小于100的意见;在1997年，作者说：“小（说少于200意见）”，截至2006年，你可以使用pam几千意见。

作者（S）----------Author(s)----------

Kaufman and Rousseeuw (see <code><a href="../../cluster/help/agnes">agnes</a></code>), originally.
All arguments from <code>trace</code> on, and most R documentation and all
tests by Martin Maechler.

参见----------See Also----------

agnes for background and references; clara.object, pam, partition.object, plot.partition.
agnes背景和参考; clara.object，pam，partition.object，plot.partition。

举例----------Examples----------

## generate 500 objects, divided into 2 clusters.[＃生成500个对象，分为2簇。]
x <- rbind(cbind(rnorm(200,0,8), rnorm(200,0,8)),
 cbind(rnorm(300,50,8), rnorm(300,50,8)))
clarax <- clara(x, 2, samples=50)
clarax
clarax$clusinfo
## using pamLike=TRUE gives the same (apart from the 'call'):[＃pamLike = TRUE，给出了相同的（除了“呼叫”）：]
all.equal(clarax[-8],
 clara(x, 2, samples=50, pamLike = TRUE)[-8])
plot(clarax)

## `xclara' is an artificial data set with 3 clusters of 1000 bivariate[＃xclara是1000二元簇设置一个人工数据]
## objects each.[＃每个对象。]
data(xclara)
(clx3 <- clara(xclara, 3))
## "better" number of samples[＃“更好”的样本数]
cl.3 <- clara(xclara, 3, samples=100)
## but that did not change the result here:[＃但没有改变这里的结果：]
stopifnot(cl.3$clustering == clx3$clustering)
## Plot similar to Figure 5 in Struyf et al (1996)[＃绘制类似图5中Struyf等（1996）]
## Not run: plot(clx3, ask = TRUE)[＃无法运行：图（clx3，问= TRUE）]

## Try 100 times *different* random samples -- for reliability:[＃尝试100倍*不同*随机抽样 - 可靠性：]
nSim <- 100
nCl <- 3 # = no.classes[= no.classes]
set.seed(421)# (reproducibility)[（重复性）]
cl <- matrix(NA,nrow(xclara), nSim)
for(i in 1:nSim)
cl[,i] <- clara(xclara, nCl, medoids.x = FALSE, rngR = TRUE)$cluster
tcl <- apply(cl,1, tabulate, nbins = nCl)
## those that are not always in same cluster (5 out of 3000 for this seed):[＃那些不总是在同一个集群（5 3000的这粒种子）：]
(iDoubt <- which(apply(tcl,2, function(n) all(n < nSim))))
if(length(iDoubt)) { # (not for all seeds)[（不适用于所有的种子）]
 tabD <- tcl[,iDoubt, drop=FALSE]
 dimnames(tabD) <- list(cluster = paste(1:nCl), obs = format(iDoubt))
 t(tabD) # how many times in which clusters[多少次在集群]
}

转载请注明:出自生物统计家园网(http://www.biostatistic.net)。

注：
注1：为了方便大家学习，本文档为生物统计家园网机器人LoveR翻译而成，仅供个人R语言学习参考使用，生物统计家园保留版权。
注2：由于是机器人自动翻译，难免有不准确之处，使用时仔细对照中、英文内容进行反复理解，可以帮助R语言的学习。
注3：如遇到不准确之处，请在本贴的后面进行回帖，我们会逐渐进行修订。

账号		自动登录	找回密码
密码			注册

R语言:clara()函数中文帮助文档(中英文对照)

浏览过的版块