ewkm(weightedKmeans)
ewkm()所属R语言包:weightedKmeans
Entropy Weighted K-Means
基于熵权的K-Means
译者:生物统计家园网 机器人LoveR
描述----------Description----------
Perform an entropy weighted subspace k-means.
执行熵的加权子空间的k-means。
用法----------Usage----------
ewkm(x, k, lambda=1, maxiter=100, delta=0.00001, maxrestart=10)
参数----------Arguments----------
参数:x
numeric matrix of observations and variables.
数字矩阵的观测值和变量。
参数:k
target number of clusters.
聚类的目标数量。
参数:lambda
parameter for variable weight distribution.
变权重分布的参数。
参数:maxiter
maximum number of iterations.
最大迭代次数。
参数:delta
maximum change allowed between iterations for convergence.
迭代收敛之间允许的最大变化。
参数:maxrestart
maximum number of restarts. Default is 10 so that we stand a good chance of getting a full set of clusters. Normally, any empty clusters that result are removed from the result, and so we may obtain fewer than k clusters if we don't allow restarts (i.e., maxrestart=0). If < 0 then there is no limit on the number of restarts and we are much more likely to get a full set of k clusters.
重新启动的最大数量。默认值是10,使我们站在了一个很好的机会获得一套完整的聚类。通常情况下,任何空聚类,从结果中删除,因此我们可以得到少于k个簇,如果我们不容许重新启动(即maxrestart = 0)。如果<0,则有重新启动的次数没有限制,我们更可能得到一套完整的k个聚类。
Details
详细信息----------Details----------
The entopy weighted k-means clustering algorithm is a subspace clusterer ideal for high dimensional data. Along with each cluster we also obtain variable weights that provide a relative measure of the importance of each variable to that cluster.
entopy加权k-means聚类算法的高维数据子空间聚类器的理想选择。随着每个聚类中,我们也获得到该聚类的每个变量的重要性,提供了一个相对衡量的变量的权重。
The algorithm is based on the k-means approach to clustering. An initial set of k means are identified as the starting centroids. Observartions are clustered to the nearest centroid according to a distance measure. This defines the initial clustrering. New centroids are then identified based on these clusters.
聚类方法的k-means算法的基础上。一组初始的k装置识别作为起始重心。 observartions聚集到最近的形心,根据距离测量。定义初始clustrering。那么这些聚类的基础上确定新的质心。
Weights are then calculated for each variable within each cluster, based on the current clustering. The weights are a measure of the relative importance of each variable with regard to the membership of the observations to that cluster. These weights are then incorporated into the distance function, typically reducing the distance for the more important variables.
的权重,然后计算每个聚类内的每个变量,基于当前的聚类。的权重是衡量方面的成员的观测到该聚类的每个变量的相对重要性。然后,这些权重纳入的距离函数,通常减少的距离,更重要的变量。
New centroids are then calculated, and using the weighted distance measure each observation is once again clustered to its nearest centroid.
然后计算新的质心,并使用每个观察再次聚类其最接近的质心的加权距离测度。
The process continues until convergence (using a measure of dispersion and stopping when the change becomes less than delta) or until a specified number of iterations has been reached (maxiter).
该过程继续进行,直到收敛(使用分散的量度和停止时的变化变得小于三角形)或直到指定数量的迭代已经达到(MAXITER)。
Large lambda (e,g, > 3) lead to a relatively even distribution of weights across the variables. Small lambda (e.g., < 1) lead to a more uneven distribution of weights, giving more discrimintation between features. Recommended values are between 1 and 3.
大λ(E,G,> 3)导致跨变量相对均匀分布的权重。小λ(例如,<1)导致的不均衡分布的权重,让更多的discrimintation之间的功能。推荐值是1和3之间。
Always check the number of iterations, the number of restarts, and the total number of iterations as they give a good indication of whether the algorithm converged.
检查的迭代次数,重新启动的次数和总的迭代次数的算法是否融合,为他们提供一个良好的迹象。
As with any distance based algorithm, be sure to rescale your numeric data so that large values do not bias the clustering. A quick rescaling method to use is scale.
任何距离的算法,一定要重新调整您的数字数据,这样大的值没有偏见的聚类。使用一个快速的重新标度方法是scale。
值----------Value----------
Returns an object of class "kmeans" and "ewkm", compatible with other functions that work with kmeans objects, such as the 'print' method. The object is a list with the following components in addition to the components of the kmeans object:
返回一个对象类“的kmeans”和“ewkm”,兼容与其他功能工作的kmeans对象,如“打印”的方法。该对象是一个除了的k均值对象的组件列表中的下列组件:
weights: A matrix of weights recording the relative importance of each variable for each cluster.
权重:记录为每个聚类的每个变量的相对重要性的权重矩阵。
iterations: This reports on the number of iterations before termination. Check this to see whether the maxiters was reached. If so then the algroithm may not be converging,and thus the resulting clustering may not be particularly good.
迭代:此报告的数目的迭代终止之前。检查这,看是否maxiters达成。如果是这样的话algroithm可能不能聚光,并且因此,所得到的聚类可能不是特别好。
restarts: The number of times the clustering restarted because of a disappearing cluster resulting from one or more k-means having no observations associated with it. An number here greater than 0 indicates that the algorithm is not converging on a clustering for the given k. It is recommended that k be reduced.
重新启动的次数重新启动的聚类,因为一个消失聚类导致从一个或较多的K-装置不具有与它相关联的观测。这里大于0的数字表明,该算法不收敛于一个聚类为给定的k。这是推荐的k降低。
total.iterations: The total number of iterations over all restarts.
的total.iterations:总的迭代次数在所有重新启动。
(作者)----------Author(s)----------
Qiang Wang, Xiaojun Chen, Graham J Williams, Joshua Z Huang
参考文献----------References----------
Algorithm for Subspace Clustering of High-Dimensional Sparse Data, IEEE Transactions on Knowledge and Data Engineering, 19 (8), Aug 2007 pp 1026–1041.
参见----------See Also----------
plot.ewkm.
plot.ewkm。
实例----------Examples----------
myewkm <- ewkm(iris[1:4], k=3, lambda=0.5, maxiter=100)
plot(iris[1:4], col=myewkm$cluster)
# For comparative testing[比较试验]
mykm <- kmeans(iris[1:4], 3)
plot(iris[1:4], col=mykm$cluster)
转载请注明:出自 生物统计家园网(http://www.biostatistic.net)。
注:
注1:为了方便大家学习,本文档为生物统计家园网机器人LoveR翻译而成,仅供个人R语言学习参考使用,生物统计家园保留版权。
注2:由于是机器人自动翻译,难免有不准确之处,使用时仔细对照中、英文内容进行反复理解,可以帮助R语言的学习。
注3:如遇到不准确之处,请在本贴的后面进行回帖,我们会逐渐进行修订。
|