pam(cluster)
pam()所属R语言包:cluster
Partitioning Around Medoids
围绕中心点分区
译者:生物统计家园网 机器人LoveR
描述----------Description----------
Partitioning (clustering) of the data into k clusters “around medoids”, a more robust version of K-means.
数据分区(集群)k集群“围绕中心点”,更强大的版本的K-means。
用法----------Usage----------
pam(x, k, diss = inherits(x, "dist"), metric = "euclidean",
medoids = NULL, stand = FALSE, cluster.only = FALSE,
do.swap = TRUE,
keep.diss = !diss && !cluster.only && n < 100,
keep.data = !diss && !cluster.only, trace.lev = 0)
参数----------Arguments----------
参数:x
data matrix or data frame, or dissimilarity matrix or object, depending on the value of the diss argument. In case of a matrix or data frame, each row corresponds to an observation, and each column corresponds to a variable. All variables must be numeric. Missing values (NAs) are allowed—as long as every pair of observations has at least one case not missing. In case of a dissimilarity matrix, x is typically the output of daisy or dist. Also a vector of length n*(n-1)/2 is allowed (where n is the number of observations), and will be interpreted in the same way as the output of the above-mentioned functions. Missing values (NAs) are not allowed.
数据矩阵或数据框,或相异矩阵或对象,取决于diss参数值的。在一个矩阵或数据框的情况下,每一行对应一个观察,每列对应一个变量。所有的变量必须是数字。遗漏值(NA)只要每对观测有至少有一个不缺的情况下允许的作为。在一个相异矩阵的情况下,x通常是daisy或dist输出。也被允许长度为n *(N-1)/ 2矢量(其中n为若干意见),将在上述功能的输出相同的方式解释。遗漏值(NAS)是不允许的。
参数:k
positive integer specifying the number of clusters, less than the number of observations.
正整数,指定簇数,小于的若干意见。
参数:diss
logical flag: if TRUE (default for dist or dissimilarity objects), then x will be considered as a dissimilarity matrix. If FALSE, then x will be considered as a matrix of observations by variables.
逻辑标志:如果为TRUE(默认为dist或dissimilarity对象),然后x将考虑作为一个相异矩阵。如果为FALSE,那么x将被视为一个由变量的观测矩阵。
参数:metric
character string specifying the metric to be used for calculating dissimilarities between observations.<br> The currently available options are "euclidean" and "manhattan". Euclidean distances are root sum-of-squares of differences, and manhattan distances are the sum of absolute differences. If x is already a dissimilarity matrix, then this argument will be ignored.
字符串,指定要使用公制计算之间的意见异同。参考目前可用的选项是“欧几里德”和“曼哈顿”。欧氏距离总和的平方差异的根,和曼哈顿距离是绝对差异的总和。 x如果已经是一个相异矩阵,那么这个参数将被忽略。
参数:medoids
NULL (default) or length-k vector of integer indices (in 1:n) specifying initial medoids instead of using the "build" algorithm.
NULL(默认)或长度k向量整数索引(在1:n)指定,而不是使用“生成”算法的初始中心点。
参数:stand
logical; if true, the measurements in x are standardized before calculating the dissimilarities. Measurements are standardized for each variable (column), by subtracting the variable's mean value and dividing by the variable's mean absolute deviation. If x is already a dissimilarity matrix, then this argument will be ignored.
逻辑,如果属实,测量x前计算的异同标准化。测量是为每个变量(列)减去变量的平均值除以变量的平均绝对偏差,标准化。 x如果已经是一个相异矩阵,那么这个参数将被忽略。
参数:cluster.only
logical; if true, only the clustering will be computed and returned, see details.
逻辑;如果属实,将只有聚类计算和返回,查看详细信息。
参数:do.swap
logical indicating if the swap phase should happen. The default, TRUE, correspond to the original algorithm. On the other hand, the swap phase is much more computer intensive than the build one for large n, so can be skipped by do.swap = FALSE.
逻辑表明,如果在交换阶段应该发生。默认情况下,TRUE,对应的原始算法。另一方面,交换阶段是更比建立一个计算机密集的大型n,所以可以跳过do.swap = FALSE。
参数:keep.diss, keep.data
logicals indicating if the dissimilarities and/or input data x should be kept in the result. Setting these to FALSE can give much smaller results and hence even save memory allocation time.
逻辑值表示的异同和/或输入数据x应保持在结果。这些设置FALSE可以给结果小得多,因此,即使节省内存分配时间。
参数:trace.lev
integer specifying a trace level for printing diagnostics during the build and swap phase of the algorithm. Default 0 does not print anything; higher values print increasingly more.
在建立和交换算法相整数,指定跟踪级别为印刷诊断。默认的0不打印任何东西;值越高打印越来越多。
Details
详情----------Details----------
pam is fully described in chapter 2 of Kaufman and Rousseeuw (1990). Compared to the k-means approach in kmeans, the function pam has the following features: (a) it also accepts a dissimilarity matrix; (b) it is more robust because it minimizes a sum of dissimilarities instead of a sum of squared euclidean distances; (c) it provides a novel graphical display, the silhouette plot (see plot.partition) (d) it allows to select the number of clusters using mean(silhouette(pr)) on the result pr <- pam(..), or directly its component pr$silinfo$avg.width, see also pam.object.
pam完全中所述:章考夫曼和Rousseeuw的(1990)2。 的K-means方法相比kmeans,功能pam具有以下特点:(一)它也接受相异矩阵;(b)它是更强大,因为它最大限度地减少了一笔的异同(三),而不是一个平方欧氏距离的总和;它提供了一种新的图形显示,轮廓图(见plot.partition)(D)允许选择数字集群使用mean(silhouette(pr))结果pr <- pam(..),或直接其组件pr$silinfo$avg.width,看到pam.object。
When cluster.only is true, the result is simply a (possibly named) integer vector specifying the clustering, i.e.,<br> pam(x,k, cluster.only=TRUE) is the same as <br> pam(x,k)$clustering but computed more efficiently.
当cluster.only是真实的,结果很简单(可能命名)指定的聚类的整数向量,即参考pam(x,k, cluster.only=TRUE)参考pam(x,k)$clustering但计算更有效地。
The pam-algorithm is based on the search for k representative objects or medoids among the observations of the dataset. These observations should represent the structure of the data. After finding a set of k medoids, k clusters are constructed by assigning each observation to the nearest medoid. The goal is to find k representative objects which minimize the sum of the dissimilarities of the observations to their closest representative object. <br> By default, when medoids are not specified, the algorithm first looks for a good initial set of medoids (this is called the build phase). Then it finds a local minimum for the objective function, that is, a solution such that there is no single switch of an observation with a medoid that will decrease the objective (this is called the swap phase).
pam算法为基础搜索k代表的对象或数据集的意见之间的中心点。这些意见应代表的数据结构。在找到一套k中心点,k集群分配每个观察到最近的medoid构建。我们的目标是找到k代表性的对象,最大限度地减少他们最接近的代表对象的意见相异的总和。参考默认情况下,当medoids没有被指定的,算法首先看起来有一个良好的初始中心点集(这就是所谓的构建阶段)。然后找到一个当地最低为目标的功能,那就是,一个解决方案,这样将减少客观与medoid观察有没有单开关(这就是所谓的交换阶段)。
When the medoids are specified, their order does not matter; in general, the algorithms have been designed to not depend on the order of the observations.
当medoids指定的,他们的订单并不要紧,在一般情况下,已设计的算法不依赖于观测的顺序。
值----------Value----------
an object of class "pam" representing the clustering. See ?pam.object for details.
类"pam"代表聚类的对象。看到?pam.object详情。
注意----------Note----------
For large datasets, pam may need too much memory or too much computation time since both are O(n^2). Then, clara() is preferable, see its documentation.
对于大型数据集,pam可能需要的内存太多太多的计算时间,因为两者都是O(n^2)。然后,clara()是可取的,看到它的文档。
参见----------See Also----------
agnes for background and references; pam.object, clara, daisy, partition.object, plot.partition, dist.
agnes背景和参考; pam.object,clara,daisy,partition.object,plot.partition,dist。
举例----------Examples----------
## generate 25 objects, divided into 2 clusters.[#产生25个对象,分为2簇。]
x <- rbind(cbind(rnorm(10,0,0.5), rnorm(10,0,0.5)),
cbind(rnorm(15,5,0.5), rnorm(15,5,0.5)))
pamx <- pam(x, 2)
pamx
summary(pamx)
plot(pamx)
## use obs. 1 & 16 as starting medoids -- same result (typically)[#使用OBS。 1&16起中心点 - 相同的结果(通常)]
(p2m <- pam(x, 2, medoids = c(1,16)))
p3m <- pam(x, 3, trace = 2)
## rather stupid initial medoids:[#而愚蠢的初始中心点:]
(p3m. <- pam(x, 3, medoids = 3:1, trace = 1))
pam(daisy(x, metric = "manhattan"), 2, diss = TRUE)
data(ruspini)
## Plot similar to Figure 4 in Stryuf et al (1996)[#绘制类似于图4在Stryuf等(1996)]
## Not run: plot(pam(ruspini, 4), ask = TRUE)[#无法运行图(PAM(ruspini,4),问= TRUE)]
转载请注明:出自 生物统计家园网(http://www.biostatistic.net)。
注:
注1:为了方便大家学习,本文档为生物统计家园网机器人LoveR翻译而成,仅供个人R语言学习参考使用,生物统计家园保留版权。
注2:由于是机器人自动翻译,难免有不准确之处,使用时仔细对照中、英文内容进行反复理解,可以帮助R语言的学习。
注3:如遇到不准确之处,请在本贴的后面进行回帖,我们会逐渐进行修订。
|