intVal(validator)
intVal()所属R语言包:validator
Internal Validation Indices
内部验证指数
译者:生物统计家园网 机器人LoveR
描述----------Description----------
This function is calculating the values of certain internal validation indices.
此功能是计算某些内部验证指数的值。
用法----------Usage----------
intVal(y, x, index = "all")
参数----------Arguments----------
参数:y
Object of class kcca returned by clustering methods from the package flexclust
类的对象kcca返回通过的包flexclust聚类方法
参数:x
Data matrix, which contains the observations of clustering or data matrix of a test data set
数据矩阵,其中包含的测试数据的聚类或数据矩阵观测设置
参数:index
The internal validation indices, which are calculated: "calinski", "db", "hartigan", "ratkowsky", "scott", "marriot", "ball", "trcovw", "tracew", "friedman", "rubin", "xuindex", "dunn", "connectivity", "silhouette", and "all".
内部验证指标,计算:“calinski”,“db”,“hartigan”,“ratkowsky”,“scott”“ marriot“,”ball“,”trcovw“,”tracew“,”friedman“,”rubin“,”<所述“,”xuindex“,”dunn“,”connectivity“,和”silhouette“。
Details
详细信息----------Details----------
The internal validation indices based all on the references below. The indices are defined as:
内部验证指数都在下面的参考。该指数被定义为:
calinski: \frac{SSB/(k-1)}{SSW/(n-k)}, where SSB is the sum of squares between, SSW is the sum of squares between, n is the number of data points, and k is the number of clusters.
calinski:\frac{SSB/(k-1)}{SSW/(n-k)},其中SSB是的平方之间,SSW是的平方之间,n是个数据点的数量,和k的是簇的数量。
db: \frac{1}{k} ∑_{i=1}^k R_i, where R_i as the maximum value of R_{ij}=\frac{s_i + s_j}{d_{ij}} with s_i is the similarity in the clusters and d_{ij} is the dissimilarity between the two clusters.
DB:\frac{1}{k} ∑_{i=1}^k R_i,其中R_i的最大值R_{ij}=\frac{s_i + s_j}{d_{ij}}s_i是在聚类的相似性和d_{ij}是两个聚类之间的差异性。
hartigan: \log \frac{SSB}{SSW}
哈蒂根:\log \frac{SSB}{SSW}
ratkowsky: \tilde{c}/ √{k}, where \tilde{c} = mean √{varSSB / varSST} and the abbreviation emphvar stands for each variable and SST for the sum of squares total.
ratkowsky:\tilde{c}/ √{k},其中\tilde{c} = mean √{varSSB / varSST}和的缩写emphvar代表为每个变量和SST总的平方。
ball: \frac{SSW}{k}
球:\frac{SSW}{k}
xuindex: d\log(√{SSW/(dn^2)}) + \log(k), with d as the dimension of the data points.
xuindex:d\log(√{SSW/(dn^2)}) + \log(k),用d为维度的数据点。
scott: n \log \frac{\det(T)}{\det(W)}, with T is the scatter distance matrix and W is the pooled within groups scatter matrix.
斯科特:n \log \frac{\det(T)}{\det(W)},T是分散的距离矩阵和W是汇集组内散布矩阵。
marriot: k^2 \det(W)
万豪酒店:k^2 \det(W)
trcovw: trace(cov W)
trcovw:trace(cov W)
tracew: trace(W)
tracew:trace(W)
friedman: trace(W^{-1}B), where B is the between groups scatter matrix.
弗里德曼:trace(W^{-1}B),其中B是群体之间的散射矩阵。
rubin: \det(T) / \det(W)
鲁宾:\det(T) / \det(W)
The R code of these functions above are taken from the package cclust with small changes, e.g. by computing the SSW regarding their cluster centers.
上述这些功能的R代码从包cclust小的变化,例如通过计算SSW对他们的聚类中心。
dunn: D(\mathcal{C}) = \frac{\min\limits_{C_k, C_l \in \mathcal{C}, C_k \neq C_l} \Big( \min\limits_{x_i \in C_k, x_j \in C_l} dist(x_i, x_j) \Big)} {\max\limits_{C_m \in \mathcal{C}} diam (C_m)}, where diam(C_m) is maximum distance between each data item in the cluster C_m, and dist(x_i, x_j) is the distance between the pairs of the data points.
邓恩:D(\mathcal{C}) = \frac{\min\limits_{C_k, C_l \in \mathcal{C}, C_k \neq C_l} \Big( \min\limits_{x_i \in C_k, x_j \in C_l} dist(x_i, x_j) \Big)} {\max\limits_{C_m \in \mathcal{C}} diam (C_m)},其中diam(C_m)是在聚类中的每个数据项之间的最大距离C_m,和dist(x_i, x_j)是对数据点之间的距离。
silhouette: S (x_i) = \frac{b_i - a_i}{max(a_i, b_i)}, where a_i is the average distance between the data points to all the other observations in the same cluster and b_i is the average distance between data points to all other points from the closest neighbouring cluster. Then, the average of all Silhouette Widths is computed.
剪影:S (x_i) = \frac{b_i - a_i}{max(a_i, b_i)},这里a_i是所有其他在同一个聚类的观察和b_i是其它所有点的数据点之间的平均距离从数据点之间的平均距离最紧密相邻的聚类。然后,平均所有剪影的宽度的计算。
connectivity: Conn( \mathcal{C} ) = ∑_{i=1}^{N} ∑_{j=1}^{L} x_{i, nn_{i(j)}}, where nn_{i(j)} be the j-th neighbour of the data point x_i ; so x_{i, nn_{i(j)}} is zero if x_i and nn_{i(j)} are in the same cluster, otherwise the value is computed by \frac{1}{j} and L determines the number of neighbours that contribute to the connectivity measure.
连通性:Conn( \mathcal{C} ) = ∑_{i=1}^{N} ∑_{j=1}^{L} x_{i, nn_{i(j)}},这里nn_{i(j)}是第j个数据点的邻居x_i ; x_{i, nn_{i(j)}}是零,如果x_i和nn_{i(j)}都在同一个聚类,否则该值计算的\frac{1}{j}和L的数量决定的邻居的连通性措施。
值----------Value----------
This function returns a vector with the internal validation indices.
这个函数返回一个向量,其内部的验证指标。
(作者)----------Author(s)----------
Marcus Scherl
参考文献----------References----------
Glenn W. Milligan and Martha C. Cooper, An examination of procedures for de- termining the number of clusters in a dataset.
Julia Handl, Joshua Knowles, and Douglas B. Kell, Computational cluster validation in post-genomic data analysis, http://dbkgroup.org/handl/clustervalidation/.
Andreas Weingessel, Evgenia Dimitriadou, and Sara Dolnicar, An examination of indexes for determining the number of clusters in binary data sets.
Guy Brock, Vasyl Pihur, Susmita Datta, and Somnath Datta, clValid: An R package for cluster validation, http://www.jstatsoft.org/v25/i04.
实例----------Examples----------
# require(mlbench)[要求(mlbench)]
# require(flexclust)[要求(flexclust)]
x <- mlbench.2dnormals(500, 3)
cl <- kcca(x$x, 3)
intVal(cl, x$x)
x <- mlbench.2dnormals(500, 3)
cl <- kmeans(x$x, 3)
cl <- as.kcca(cl, x$x)
intVal(cl, x$x)
转载请注明:出自 生物统计家园网(http://www.biostatistic.net)。
注:
注1:为了方便大家学习,本文档为生物统计家园网机器人LoveR翻译而成,仅供个人R语言学习参考使用,生物统计家园保留版权。
注2:由于是机器人自动翻译,难免有不准确之处,使用时仔细对照中、英文内容进行反复理解,可以帮助R语言的学习。
注3:如遇到不准确之处,请在本贴的后面进行回帖,我们会逐渐进行修订。
|