daisy(cluster)
daisy()所属R语言包:cluster
Dissimilarity Matrix Calculation
相异矩阵的计算
译者:生物统计家园网 机器人LoveR
描述----------Description----------
Compute all the pairwise dissimilarities (distances) between observations in the data set. The original variables may be of mixed types. In that case, or whenever metric = "gower" is set, a generalization of Gower's formula is used, see "Details" below.
计算所有成对的观察数据集之间的异同(距离)。原来的变量可能是混合类型。在这种情况下,或当metric = "gower"设置,高尔的公式的推广使用,见下面的“详细资料”。
用法----------Usage----------
daisy(x, metric = c("euclidean", "manhattan", "gower"),
stand = FALSE, type = list(), weights = rep.int(1, p))
参数----------Arguments----------
参数:x
numeric matrix or data frame, of dimension n x p, say. Dissimilarities will be computed between the rows of x. Columns of mode numeric (i.e. all columns when x is a matrix) will be recognized as interval scaled variables, columns of class factor will be recognized as nominal variables, and columns of class ordered will be recognized as ordinal variables. Other variable types should be specified with the type argument. Missing values (NAs) are allowed.
数字矩阵或数据框,维n x p,说。将计算x行之间的异同。模式的列numeric(即所有列时x是一个矩阵)将被确认为间隔规模变量,列类factor将被确认为名义变量,类列 ordered将确认为序变量。其他变量类型应指定type参数。遗漏值(NAS)是允许的。
参数:metric
character string specifying the metric to be used. The currently available options are "euclidean" (the default), "manhattan" and "gower".<br> Euclidean distances are root sum-of-squares of differences, and manhattan distances are the sum of absolute differences. “Gower's distance” is chosen by metric "gower" or automatically if some columns of x are not numeric. Also known as Gower's coefficient (1971), expressed as a dissimilarity, this implies that a particular standardisation will be applied to each variable, and the “distance” between two units is the sum of all the variable-specific distances, see the details section.
字符串指定要使用的度量。目前可用的选项"euclidean"(默认),"manhattan"和"gower"。参考欧氏距离总和的平方差异的根,和曼哈顿距离是绝对的总和差异。 “高尔的距离”,选择度量"gower"或自动,如果一些列x是不是数字。也被称为高尔系数(1971)表示,作为一个相异,这意味着将被应用到每个变量,特别是标准化,和两个单位之间的“距离”是所有变量特定距离的总和,看到的细节一节。
参数:stand
logical flag: if TRUE, then the measurements in x are standardized before calculating the dissimilarities. Measurements are standardized for each variable (column), by subtracting the variable's mean value and dividing by the variable's mean absolute deviation. If not all columns of x are numeric, stand will be ignored and Gower's standardization (based on the range) will be applied in any case, see argument metric, above, and the details section.
逻辑标志:如果属实,那么测量x前计算的异同标准化。测量是为每个变量(列)减去变量的平均值除以变量的平均绝对偏差,标准化。如果不x数字stand将被忽略,高尔的标准化(基于的range)将被应用在任何情况下,请参阅参数metric,以上所有列细节部分。
参数:type
list for specifying some (or all) of the types of the variables (columns) in x. The list may contain the following components: "ordratio" (ratio scaled variables to be treated as ordinal variables), "logratio" (ratio scaled variables that must be logarithmically transformed), "asymm" (asymmetric binary) and "symm" (symmetric binary variables). Each component's value is a vector, containing the names or the numbers of the corresponding columns of x. Variables not mentioned in the type list are interpreted as usual (see argument x).
在x指定一些变量的类型(列)(或全部)的名单。该列表可能包含以下几部分组成:"ordratio"(比处理规模为序变量的变量),"logratio"(比例缩放,必须对数转换的变量),"asymm"(非对称二元) "symm"(对称二元变量)。每个组件的值是一个向量,含有x的相应列的名称或数字。像往常一样(见参数type列表中没有提到的变量解释x)。
参数:weights
an optional numeric vector of length p(=ncol(x)); to be used in “case 2” (mixed variables, or metric = "gower"), specifying a weight for each variable (x[,k]) instead of 1 in Gower's original formula.
一个可选的数字矢量长度p(ncol(x));“2”(混合变量,或metric = "gower"),用于指定每个变量的权重(x[,k]),而不是1在高尔的原始公式。
Details
详情----------Details----------
The original version of daisy is fully described in chapter 1 of Kaufman and Rousseeuw (1990). Compared to dist whose input must be numeric variables, the main feature of daisy is its ability to handle other variable types as well (e.g. nominal, ordinal, (a)symmetric binary) even when different types occur in the same data set.
daisy原始版本完全中所述:章考夫曼和Rousseeuw的(1990)1。相比dist的输入必须是数字变量,的的daisy主要功能是它能够处理其他变量类型,不同类型的发生以及(例如名义,序(一)对称二元),甚至当在相同的数据集。
The handling of nominal, ordinal, and (a)symmetric binary data is achieved by using the general dissimilarity coefficient of Gower (1971). If x contains any columns of these data-types, both arguments metric and stand will be ignored and Gower's coefficient will be used as the metric. This can also be activated for purely numeric data by metric = "gower". With that, each variable (column) is first standardized by dividing each entry by the range of the corresponding variable, after subtracting the minimum value; consequently the rescaled variable has range [0,1], exactly.
处理的名义,有序,(一)对称的二进制数据是使用高尔一般相异系数(1971年)实现。如果x包含这些数据类型的任何列,双方的观点metric和stand将被忽略,高尔的系数将被用来作为度量。这也可以激活为metric = "gower"纯粹的数值数据。 ,每个变量(列)是第一除以相应的变量的范围,每个条目的标准化,减去的最低值;因此,重新调整的变量范围[0,1],正是。
Note that setting the type to symm (symmetric binary) gives the same dissimilarities as using nominal (which is chosen for non-ordered factors) only when no missing values are present, and more efficiently.
请注意,设置的类型symm(对称二元)使用标称(非有序因素选择),只有当没有缺失值的存在,更有效地同异同。
Note that daisy now gives a warning when 2-valued numerical variables do not have an explicit type specified, because the reference authors recommend to consider using "asymm".
请注意daisy现在给出一个警告当值的数值变量没有一个明确的type指定的,因为参考的建议考虑使用"asymm"。
In the daisy algorithm, missing values in a row of x are not included in the dissimilarities involving that row. There are two main cases,
daisy算法,在一排的X遗漏值不包括在涉及该行的异同。主要有两种情况下,
If all variables are interval scaled (and metric is not "gower"), the metric is "euclidean", and n_g is the number of columns in which neither row i and j have NAs, then the dissimilarity d(i,j) returned is sqrt(p/n_g) (p=ncol(x)) times the Euclidean distance between the two vectors of length n_g shortened to exclude NAs. The rule is similar for the "manhattan" metric, except that the coefficient is p/n_g. If n_g = 0, the dissimilarity is NA.
如果所有的变量是缩小间隔(和metric不"gower"),公制是“欧几里德”,n_g是既不行i和j有定居的列数,然后返回相异D(I,J)是sqrt(p/n_g)(p=的NCOL(X))的时间长度为两个向量之间的欧氏距离n_g缩短排除定居。规则是类似的“曼哈顿”的度量,除了系数是p/n_g。如果n_g = 0,相异不适用。
When some variables have a type other than interval scaled, or if metric = "gower" is specified, the dissimilarity between two rows is the weighted mean of the contributions of each variable. Specifically,
当一些变量比间隔缩小,或如果metric = "gower"指定,两行之间的不同是加权平均每个变量的贡献的其他类型。具体来说,
In other words, d_ij is a weighted mean of d(ij,k) with weights w_k delta(ij;k), where w_k= weigths[k], delta(ij;k) is 0 or 1, and d(ij,k), the k-th variable contribution to the total distance, is a distance between x[i,k] and x[j,k], see below.
换句话说,d_ij是d(ij,k)重量w_k delta(ij;k),其中w_k = weigths[k],delta(ij;k)是0或1,加权平均d(ij,k)k个变量贡献的总距离,是距离之间x[i,k]和x[j,k],见下文。
The 0-1 weight delta(ij;k) becomes zero when the variable x[,k] is missing in either or both rows (i and j), or when the variable is asymmetric binary and both values are zero. In all other situations it is 1.
0-1重量delta(ij;k)变为零,当变量x[,k]是在一方或双方行(i和j),或变量是不对称的二进制和这两个值都为零时失踪。在所有其他情况下,它是1。
The contribution d(ij,k) of a nominal or binary variable to the total dissimilarity is 0 if both values are equal, 1 otherwise. The contribution of other variables is the absolute difference of both values, divided by the total range of that variable. Note that “standard scoring” is applied to ordinal variables, i.e., they are replaced by their integer codes 1:K. Note that this is not the same as using their ranks (since there typically are ties).
的贡献d(ij,k)0总相异的名义或二进制变量,如果两个值相等,否则为1。其他变量的贡献,是这两个值的差的绝对值,除以该变量的总范围。注意标准“得分王”被应用到序变量,即,它们是由他们的整数代码取代1:K。请注意,这是不使用他们的行列(因为通常是关系)。
As the individual contributions d(ij,k) are in [0,1], the dissimilarity d_ij will remain in this range. If all weights w_k delta(ij;k) are zero, the dissimilarity is set to NA.
作为个人捐款d(ij,k)[0,1],的相异d_ij仍将在此范围内。如果所有的权重w_k delta(ij;k)是零,相异设置NA。
值----------Value----------
an object of class "dissimilarity" containing the dissimilarities among the rows of x. This is typically the input for the functions pam, fanny, agnes or diana. For more details, see dissimilarity.object.
一个类的对象"dissimilarity"之间x行包含的异同。这是典型的输入功能pam,fanny,agnes或diana。有关详细信息,请参阅dissimilarity.object。
背景----------Background----------
Dissimilarities are used as inputs to cluster analysis and multidimensional scaling. The choice of metric may have a large impact.
异同被用作聚类分析和多维尺度的投入。度量的选择可能有很大的影响。
作者(S)----------Author(s)----------
Anja Struyf, Mia Hubert, and Peter and Rousseeuw, for the original
version.
<br>
Martin Maechler improved the <code><a href="../../cluster/help/NA">NA</a></code> handling and
<code>type</code> specification checking, and extended functionality to
<code>metric = "gower"</code> and the optional <code>weights</code> argument.
参考文献----------References----------
A general coefficient of similarity and some of its properties, Biometrics 27, 857–874.
Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York.
Integrating Robust Clustering Techniques in S-PLUS, Computational Statistics and Data Analysis 26, 17–37.
参见----------See Also----------
dissimilarity.object, dist, pam, fanny, clara, agnes, diana.
dissimilarity.object,dist,pam,fanny,clara,agnes,diana。
举例----------Examples----------
data(agriculture)
## Example 1 in ref:[#1号为例:]
## Dissimilarities using Euclidean metric and without standardization[#使用欧氏度量,没有标准化的异同]
d.agr <- daisy(agriculture, metric = "euclidean", stand = FALSE)
d.agr
as.matrix(d.agr)[,"DK"] # via as.matrix.dist(.)[通过as.matrix.dist()。]
## compare with[#比较]
as.matrix(daisy(agriculture, metric = "gower"))
data(flower)
## Example 2 in ref[#示例2号]
summary(dfl1 <- daisy(flower, type = list(asymm = 3)))
summary(dfl2 <- daisy(flower, type = list(asymm = c(1, 3), ordratio = 7)))
## this failed earlier:[#这次失败:]
summary(dfl3 <- daisy(flower,
type = list(asymm = c("V1", "V3"), symm= 2,
ordratio= 7, logratio= 8)))
转载请注明:出自 生物统计家园网(http://www.biostatistic.net)。
注:
注1:为了方便大家学习,本文档为生物统计家园网机器人LoveR翻译而成,仅供个人R语言学习参考使用,生物统计家园保留版权。
注2:由于是机器人自动翻译,难免有不准确之处,使用时仔细对照中、英文内容进行反复理解,可以帮助R语言的学习。
注3:如遇到不准确之处,请在本贴的后面进行回帖,我们会逐渐进行修订。
|