skmeans(skmeans)
skmeans()所属R语言包:skmeans
Compute Spherical k-Means Partitions
计算球形的k-means分区
译者:生物统计家园网 机器人LoveR
描述----------Description----------
Partition given vectors x_b by minimizing the spherical k-means criterion ∑_{b,j} w_b u_{bj}^m d(x_b, p_j) over memberships and prototypes, where the w_b are case weights, u_{bj} is the membership of x_b to class j, p_j is the prototype of class j (thus minimizing ∑_b w_b u_{bj}^m d(x_b, p) over p), and d is the cosine dissimilarity d(x, p) = 1 - \cos(x, p).
分区给定的向量x_b通过最大限度地减少球形k意味着标准∑_{b,j} w_b u_{bj}^m d(x_b, p_j)以上的成员资格和原型,w_b的情况下重量,u_{bj}是成员的x_b类j,p_j的原型类j(从而减少∑_b w_b u_{bj}^m d(x_b, p)经p),和d是余弦相异d(x, p) = 1 - \cos(x, p)。
用法----------Usage----------
skmeans(x, k, method = NULL, m = 1, weights = 1, control = list())
参数----------Arguments----------
参数:x
A numeric data matrix, with rows corresponding to the objects to be partitioned (such that row b contains x_b). Can be a dense matrix, a simple triplet matrix (package slam), or a dgTMatrix (package Matrix). Zero rows are not allowed.
的数值数据矩阵,分区(使得行b包含x_b的)行对应的对象。可以是一个密集的矩阵,一个简单的三重峰矩阵(包slam),或dgTMatrix(包Matrix)。零行是不允许的。
参数:k
an integer giving the number of classes to be used in the partition.
一个整数,在该分区中的类的数量被使用。
参数:method
a character string specifying one of the built-in methods for computing spherical k-means partitions, or a function to be taken as a user-defined method, or NULL (default value). If a character string, its lower-cased version is matched against the lower-cased names of the available built-in methods using pmatch. See Details for available built-in methods and defaults.
一个字符串,指定一个内置的方法计算球形k,指的分区,或一个函数被视为一个用户定义的方法,或NULL(默认值)。如果一个字符串,它的小写版本的小写名称相匹配的可用内置的方法,使用pmatch。请参阅可用的内置方法和默认值的详细信息。
参数:m
a number not less than 1 controlling the softness of the partition (as the “fuzzification parameter” of the fuzzy c-means algorithm). The default value of 1 corresponds to hard partitions; values greater than one give partitions of increasing softness obtained from a generalized soft spherical k-means problem.
一个数不小于1控制的分区(作为“模糊化参数”模糊c-means算法)的柔软性。的默认值1对应于硬分区;大于1的值给从广义软球形k-意味着问题得到增加柔软性的分区。
参数:weights
a numeric vector of non-negative case weights. Recycled to the number of objects given by x if necessary.
一个数值向量的非负的情况下的权重。再循环到由x如果必要的给定的对象的数目。
参数:control
a list of control parameters. See Details.
的控制参数的列表。查看详细信息。
Details
详细信息----------Details----------
The “standard” spherical k-means problem where all case weights are one and m = 1 is equivalent to maximizing the criterion ∑_j ∑_{b \in C_j} \cos(x_b, p_j), where C_j is the j-th class of the partition. This is the formulation used in Dhillon & Modha (2001) and related references, and when optimized over the prototypes yields the criterion function \mathcal{I}_2 in the CLUTO documentation.
“标准”球形k,是指在所有情况下,权重是一个和m = 1的问题是最大化的标准∑_j ∑_{b \in C_j} \cos(x_b, p_j)C_j是j个类的分区。这是在Dhillon和莫德哈的(2001年)和相关参考文献,制定和优化过的原型时产生的标准功能\mathcal{I}_2:中的CLUTO的文件。
Obtaining optimal spherical k-means partitions obviously is a computationally hard problem, and several methods are available which attempt to obtain optimal partitions. The built-in methods are as follows.
获得最佳的球状k意味着分区,显然是一个计算上的问题,有几种方法可以尝试,以获得最佳的分区。内置的方法如下。
"genetic" a genetic algorithm patterned after the
"genetic"一个遗传算法的图案后,
"pclust" a Lloyd-Forgy style fixed-point algorithm which iterates between determining optimal memberships for fixed prototypes, and computing optimal prototypes for fixed memberships. For hard partitions, this can optionally attempt further local improvements via Kernighan-Lin chains of first variation single object moves as suggested by Dhillon, Guan and
"pclust"一个劳埃德Forgy风格的固定点算法,迭代固定的原型,确定最佳的会籍和计算最佳的原型为固定会员之间。“对于硬盘分区,可以有选择地尝试进一步的局部改进,通过Kernighan的林链的第一个变化单个对象移动所建议的Dhillon,关
"CLUTO" an interface to the vcluster partitional clustering program from CLUTO, the CLUstering TOolkit by George
"CLUTO"接口vcluster分割聚类程序从CLUTO,聚类工具包,由乔治·
"gmeans" an interface to the gmeans partitional
"gmeans"到gmeans划分的接口
"kmndirs" an interface to the C code for the k-mean-directions algorithm of Ranjan Maitra and Ivan
"kmndirs"平均方向k兰詹Maitra撰文算法和伊万接口的C代码
Method "pclust" is the only method available for soft spherical k-means problems. Method "genetic" can handle case weights. By default, the genetic algorithm is used for obtaining hard partitions, and the fixed-point algorithm otherwise.
方法"pclust"是的软球k意味着问题的唯一可行的方法。方法"genetic"可以处理的情况下重量。默认情况下,遗传算法用于获取硬盘分区,否则和固定点算法。
Common control parameters for methods "genetic" and "pclust" are as follows.
共同控制参数方法"genetic"和"pclust"如下。
start a specification of the starting values to be employed. Can either be a character vector with elements "p" (randomly pick objects as prototypes), "i" (randomly pick ids for the objects), "S" (take p minimizing ∑_b w_b d(x_b, p) as the first prototype, and successively pick objects farthest away from the already picked prototypes), or "s" (like "S", but with the first prototype a randomly picked object). Can also be a list of skmeans objects (obtained by previous runs), a list of prototype matrices, or a list of class ids. For the genetic algorithm, the given starting values are used as the initial population; the fixed-point algorithm is applied individually to each starting value, and the best solution found is returned. Defaults to randomly picking objects as prototypes.
start一个规范的初始值。可以是一个字符向量的元素"p"(随机选择对象的原型),"i"(随机选取的对象的id),"S"(采取p最小化 X>的第一台样机,并先后挑选对象的距离最远的已采摘的原型),或∑_b w_b d(x_b, p)("s",但与第一架原型机随机挑选的对象)。也可以是一个列表的"S"对象(通过以前运行)的原型矩阵,列表,或列表的类ID。对于遗传算法中,在给定的初始值被用作初始种群;定点算法被单独应用到每个初始值,并返回找到的最好解。默认为随机挑选对象的原型。
reltol The minimum relative improvement per iteration. If improvement is less, the algorithm will stop under the assumption that no further significant improvement can be
reltol每次迭代的最小相对改善。如果改善越少,算法将停止的假设下,没有进一步的重大改进可以
verbose a logical indicating whether to provide some output on minimization progress.
verbose逻辑是否提供一些输出上最小的进展。
Additional control parameters for method "genetic" are as follows.
其他控制参数的方法"genetic"如下。
maxiter an integer giving the maximum number of
maxiter给予的最大数目的整数
popsize an integer giving the population size for the genetic algorithm. Default: 6.
popsize一个整数,人口规模的遗传算法。默认值:6。
mutations a number between 0 and 1 giving the
mutations0和1之间的一个数,给
Additional control parameters for method "pclust" are as follows.
其他控制参数的方法"pclust"如下。
maxiter an integer giving the maximal number of
maxiter一个整数,给出的最大数量
nruns an integer giving the number of fixed-point runs to be performed. Default: 1.
nruns给人固定点的数目的整数以进行运行。默认值:1。
maxchains an integer giving the maximal length of the Kernighan-Lin chains. Default: 0 (no first variation improvements
maxchains整数的Kernighan林链的最大长度。默认值:0(无变异的改进
Control parameters for method "CLUTO" are as follows.
控制参数的方法"CLUTO"如下。
vcluster the path to the CLUTO vcluster
vcluster路径的CLUTO vcluster
colmodel a specification of the CLUTO column model.
colmodel的规范在CLUTO列模型。
verbose as for the genetic algorithm.
verbose的遗传算法。
control a character string specifying arguments passed
control一个字符串指定的参数通过
Control parameters for method "gmeans" are as follows.
控制参数的方法"gmeans"如下。
gmeans the path to the gmeans executable.
gmeans gmeans可执行的路径。
verbose as for the genetic algorithm.
verbose的遗传算法。
control a character string specifying arguments passed
control一个字符串指定的参数通过
Control parameters for method "kmndirs" are as follows.
控制参数的方法"kmndirs"如下。
nstart an integer giving the number of starting points to compute the starting value for the iteration stage.
nstart给人的开始点来计算的迭代阶段的开始值的数目的整数。
maxiter an integer giving the maximum number of iterations.
maxiter给予的最大迭代次数的整数。
Method "CLUTO" requires that the CLUTO vcluster executable is available. CLUTO binaries for the Linux, SunOS, Mac OS X, and MS Windows platforms can be downloaded from http://www-users.cs.umn.edu/~karypis/cluto/. If the executable cannot be found in the system path via Sys.which("vcluster") (i.e., named differently or not made available in the system path), its (full) path must be specified in control option vcluster.
的方法"CLUTO"要求,CLUTO vcluster可执行文件是可用的。在Linux,SunOS上,Mac OS X中,和MS Windows平台上的CLUTO的二进制文件可以下载,从http://www-users.cs.umn.edu/~karypis / CLUTO /。如果可执行文件不能被发现在系统路径中通过Sys.which("vcluster")(即,不同的名称或在系统路径中没有提供),必须指定在控制选项“vcluster(全)路径。
Method "gmeans" requires that the gmeans executable is available. Sources for compilation with ANSI C++ compliant compilers are available from http://www.dataminingresearch.com/index.php/2010/06/gmeans-clustering-software-compatible-with-gcc-4; original sources can be obtained from http://userweb.cs.utexas.edu/users/dml/Software/gmeans.html; If the executable cannot be found in the system path via Sys.which("gmeans") (i.e., named differently or not made available in the system path), its (full) path must be specified in control option vcluster.
方法"gmeans"需要的gmeans可执行的。编译源与ANSI C + +标准的编译器可从原始资料,可以从http://userweb.cs.utexas.edu/users/dml/Software/gmeans.html的;如果可执行文件不能被发现在系统路径中通过Sys.which("gmeans")(即,不同的名称或在系统路径中没有提供)(已满),必须指定路径控制选项“vcluster。
Method "kmndirs" requires package kmndirs (available from http://R-Forge.R-project.org/projects/kmndirs), which provides an R interface to a suitable modification of the C code for the k-mean-directions algorithm made available as supplementary material to Maitra & Ramler (2010) at http://pubs.amstat.org/doi/suppl/10.1198/jcgs.2009.08155.
方法"kmndirs"需要程序包kmndirs(可从http://R-Forge.R-project.org/projects/kmndirs),这提供了一个R接口的C代码,用于在一个合适的变形k平均方向算法提供Maitra撰文拉姆勒(2010)http://pubs.amstat.org/doi/suppl/10.1198/jcgs.2009.08155作为补充材料。
User-defined methods must have formals x, k and control, and optionally may have formals weights or m if providing support for case weights or soft spherical k-means partitions, respectively.
用户定义的方法必须具有的形参x,k和control,并选择性地可能有甲醛weights或m,如果提供支持的情况下,重量或软球k-分区,分别。
值----------Value----------
An object inheriting from classes skmeans and pclust (see the information on pclust objects in package clue for further details) representing the obtained spherical k-means partition, which is a list with components including the following:
从类继承的对象skmeans和pclust(见包clue为进一步的细节),将得到的球形k意味着分区的信息pclust对象,这是一个列表组件,包括以下各项:
参数:prototypes
a dense matrix with k rows giving the prototypes.
一个稠密矩阵k行提供的原型。
参数:membership
cluster membership as a matrix with k columns (only provided if m > 1).
如果k列(只提供作为基质的聚类成员m > 1“)。
参数:cluster
the class ids of the closest hard partition (the partition itself if m = 1).
最接近的硬盘分区中的类ID(分区本身,如果m = 1)。
参数:value
the value of the criterion.
的标准的值。
Objects representing spherical k-means partitions have special methods for print, cl_validity (providing the “dissimilarity accounted for”) from package clue, and silhouette from package cluster (the latter two take advantage of the special structure of the cosine distance to avoid computing full object-by-object distance matrices, and hence also perform well for large data sets).
对象代表球形k分区有特殊的方法print,cl_validity(提供的“相异占”)从包装clue和silhouette从包装cluster(后两者利用的余弦距离,以避免计算完整的对象,对象的距离矩阵的特殊结构,因此也表现良好的大型数据集)。
Package clue provides additional methods for objects inheriting from class pclust, see the examples.
套件clue提供额外的对象继承类pclust的方法,请参阅范例。
(作者)----------Author(s)----------
Kurt Hornik <a href="mailto:Kurt.Hornik@wu.ac.at">Kurt.Hornik@wu.ac.at</a>, <br>
Ingo Feinerer <a href="mailto:feinerer@logic.at">feinerer@logic.at</a>, <br>
Martin Kober <a href="mailto:martin.kober@wu.ac.at">martin.kober@wu.ac.at</a>.
参考文献----------References----------
Concept decompositions for large sparse text data using clustering. Machine Learning, 42, 143–175.
Iterative clustering of high dimensional text data augmented by local search. In Proceedings of the Second IEEE International Conference on Data Mining, pages 131–138. http://www.cs.utexas.edu/users/inderjit/public_papers/iterative_icdm02.pdf.
Genetic <code>K</code>-means algorithm. IEEE Transactions on Systems, Man, and Cybernetics — Part B: Cybernetics, 29/3, 433–439. http://eprints.iisc.ernet.in/2937/1/genetic-k.pdf.
CLUTO: A Clustering Toolkit. Technical Report #02-017, Department of Computer Science, University of Minnesota. http://glaros.dtc.umn.edu/gkhome/fetch/sw/cluto/manual.pdf.
A <code>k</code>-mean-directions algorithm for fast clustering of data on the sphere. Journal of Computational and Graphical Statistics, 19/2, 377–396.
实例----------Examples----------
set.seed(1234)
## Use CLUTO dataset 're0' and the reader for CLUTO sparse matrix[使用CLUTO数据集的RE0和读者的CLUTO稀疏矩阵]
## format in package 'slam'. (In text clustering applications, x will[#格式在一揽子“大满贯”。 (在文本聚类应用中,X会]
## often be a DocumentTermMatrix object obtained from package 'tm'.)[#往往是一个DocumentTermMatrix包“以旧换新”的对象。)]
x <- slam::read_stm_CLUTO(system.file("cluto", "re0.mat",
package = "skmeans"))
## Which is not really small:[#这是着实不小:]
dim(x)
## Hard partition into 5 clusters.[#硬盘分区分成5类。]
hparty <- skmeans(x, 5, control = list(verbose = TRUE))
## Criterion value obtained:[#标准值获得:]
hparty$value
## Compare with "true" classifications:[#比较“真实”分类:]
class_ids <- attr(x, "rclass")
table(class_ids, hparty$cluster)
## (Note that there are actually 10 "true" classes.)[(请注意,实际上有10个“真正的”类)。]
## Plot the silhouette information for the obtained partition.[#图的轮廓得到的分区信息。]
require("cluster")
plot(silhouette(hparty))
## Clearly, cluster 3 is "best", and cluster 5 needs splitting.[#显然,簇3是“最好的”,和聚类需求的分裂。]
## Soft partition into 5 clusters.[软分区分为5类。]
sparty <- skmeans(x, 5, m = 1.1,
control = list(nruns = 5, verbose = TRUE))
## Criterion value obtained:[#标准值获得:]
sparty$value
## (This should be a lower bound for the criterion value of the hard[#(这应该是一个硬盘的标准值下限]
## partition.)[#分区)。]
## Compare the soft and hard partitions:[#比较软分区和硬分区:]
table(hparty$cluster, sparty$cluster)
## Or equivalently using the high-level accessors from package 'clue':[#或者等价地使用高层次的访问包线索]
require("clue")
table(cl_class_ids(hparty), cl_class_ids(sparty))
## Which can also be used for computing agreement/dissimilarity measures[#也可以用于计算协议/相异措施]
## between the obtained partitions.[#之间所获得的分区。]
cl_agreement(hparty, sparty, "Rand")
## How fuzzy is the obtained soft partition?[获得的软分区是如何模糊?]
cl_fuzziness(sparty)
## And in fact, looking at the membership margins we see that the[#而事实上,在会员资格的利润,我们看到,]
## "sureness" of classification is rather high:[#“踏实”的分类是相当高的:]
summary(cl_margin(sparty))
转载请注明:出自 生物统计家园网(http://www.biostatistic.net)。
注:
注1:为了方便大家学习,本文档为生物统计家园网机器人LoveR翻译而成,仅供个人R语言学习参考使用,生物统计家园保留版权。
注2:由于是机器人自动翻译,难免有不准确之处,使用时仔细对照中、英文内容进行反复理解,可以帮助R语言的学习。
注3:如遇到不准确之处,请在本贴的后面进行回帖,我们会逐渐进行修订。
|