daisy(cluster)
daisy()所属R语言包:cluster
Dissimilarity Matrix Calculation
相异度矩阵计算
译者:生物统计家园网 机器人LoveR
描述----------Description----------
Compute all the pairwise dissimilarities (distances) between observations in the data set. The original variables may be of mixed types. In that case, or whenever metric = "gower" is set, a generalization of Gower's formula is used, see "Details" below.
计算所有观测数据集之间的两两相异性(距离)。原来的变量可以是混合类型。在这种情况下,或当metric = "gower"设置,高尔公式的一个推广使用,请参阅下面的“详细信息”。
用法----------Usage----------
daisy(x, metric = c("euclidean", "manhattan", "gower"),
stand = FALSE, type = list(), weights = rep.int(1, p))
参数----------Arguments----------
参数:x
numeric matrix or data frame, of dimension n x p, say. Dissimilarities will be computed between the rows of x. Columns of mode numeric (i.e. all columns when x is a matrix) will be recognized as interval scaled variables, columns of class factor will be recognized as nominal variables, and columns of class ordered will be recognized as ordinal variables. Other variable types should be specified with the type argument. Missing values (NAs) are allowed.
数字矩阵或数据框,尺寸n x p“说。将被计算四不像x之间的行。列模式numeric(即所有列当x是一个矩阵)将被确认为间隔规模的变量,列类factor将被确认为名义变量,列类ordered将确认为有序变量。其他变量的类型应该与type参数指定。遗漏值(NAS)是允许的。
参数:metric
character string specifying the metric to be used. The currently available options are "euclidean" (the default), "manhattan" and "gower".<br> Euclidean distances are root sum-of-squares of differences, and manhattan distances are the sum of absolute differences. “Gower's distance” is chosen by metric "gower" or automatically if some columns of x are not numeric. Also known as Gower's coefficient (1971), expressed as a dissimilarity, this implies that a particular standardisation will be applied to each variable, and the “distance” between two units is the sum of all the variable-specific distances, see the details section.
字符的字符串,指定要使用的度量。当前可用的选项是"euclidean"(默认值),"manhattan"和根"gower"。参考欧氏距离平方和的差异,和曼哈顿距离是绝对的总和不同之处。 “高尔距离”选择的度量"gower"或自动,如果一些列的x不是数字。也被称为高尔系数(1971),作为一个相异度表示,这意味着,一个特定的标准化,将被应用到每个变量,两个单元之间的“距离”是所有变量的特定距离的总和,看到的细节一节。
参数:stand
logical flag: if TRUE, then the measurements in x are standardized before calculating the dissimilarities. Measurements are standardized for each variable (column), by subtracting the variable's mean value and dividing by the variable's mean absolute deviation. If not all columns of x are numeric, stand will be ignored and Gower's standardization (based on the range) will be applied in any case, see argument metric, above, and the details section.
逻辑标志:如果为true,然后测量x是标准化前计算的异同。测量是标准化的每个变量(列),通过减去变量的平均值,并除以变量的绝对平均偏差。如果不是所有的x数字,stand将被忽略和高尔的标准化(基于上的range)将被应用在任何情况下,请参阅参数metric,上面列和细节部分。
参数:type
list for specifying some (or all) of the types of the variables (columns) in x. The list may contain the following components: "ordratio" (ratio scaled variables to be treated as ordinal variables), "logratio" (ratio scaled variables that must be logarithmically transformed), "asymm" (asymmetric binary) and "symm" (symmetric binary variables). Each component's value is a vector, containing the names or the numbers of the corresponding columns of x. Variables not mentioned in the type list are interpreted as usual (see argument x).
用于指定在x的部分(或全部)的类型的变量(列)的列表。该列表可能包含以下部分组成:"ordratio"(比例缩放被视为有序变量的变量),"logratio"(比例缩放的变量,必须进行对数变换),"asymm"(非对称二进制)和"symm"(对称二元变量)。每个组件的值是一个向量,包含相应的列x的名称或编号。 type列表中没有提到像往常一样解释变量(参见参数x“)。
参数:weights
an optional numeric vector of length p(=ncol(x)); to be used in “case 2” (mixed variables, or metric = "gower"), specifying a weight for each variable (x[,k]) instead of 1 in Gower's original formula.
一个可选的数字向量的长度p(ncol(x))中使用的“案例”(混合变量,或metric = "gower"),指定每个变量的权重(x[,k]),而不是1在高尔原配方。
Details
详细信息----------Details----------
The original version of daisy is fully described in chapter 1 of Kaufman and Rousseeuw (1990). Compared to dist whose input must be numeric variables, the main feature of daisy is its ability to handle other variable types as well (e.g. nominal, ordinal, (a)symmetric binary) even when different types occur in the same data set.
原版本的daisy充分说明在考夫曼和Rousseeuw(1990)第1章。 dist daisy的输入必须是数值型变量相比,其主要特点是它能够处理其他变量的类型,不同类型的出现,以及(如名义,序数,(一)对称二元),即使在相同的数据集。
The handling of nominal, ordinal, and (a)symmetric binary data is achieved by using the general dissimilarity coefficient of Gower (1971). If x contains any columns of these data-types, both arguments metric and stand will be ignored and Gower's coefficient will be used as the metric. This can also be activated for purely numeric data by metric = "gower". With that, each variable (column) is first standardized by dividing each entry by the range of the corresponding variable, after subtracting the minimum value; consequently the rescaled variable has range [0,1], exactly.
名义的,顺序和(a)对称的二进制数据是通过使用一般的相异系数高尔(1971年)的处理。如果x包含任何对这些数据类型的列,双方的观点metric和stand将被忽略和高尔系数将被用来作为度量。这也可以激活纯粹的数字数据metric = "gower"。 ,每个变量(列)是第一个标准化的除以每个条目对应的变量的范围内后,减去最小值,因此重新缩放变量具有范围[0,1],完全相同。
Note that setting the type to symm (symmetric binary) gives the same dissimilarities as using nominal (which is chosen for non-ordered factors) only when no missing values are present, and more efficiently.
请注意,设置的类型symm(对称二)使用标称(这是选择非有序的因素),只有当没有缺失值都存在,并且更有效地给出了同样的异同。
Note that daisy now gives a warning when 2-valued numerical variables do not have an explicit type specified, because the reference authors recommend to consider using "asymm".
请注意这daisy现在给出一个警告,当二值变量的数值并没有一个明确的type指定,因为的参考建议考虑使用"asymm"的。
In the daisy algorithm, missing values in a row of x are not included in the dissimilarities involving that row. There are two main cases,
在daisy算法,缺少的值的x的行中不包括在涉及该行的异同。有两个主要的情况下,
If all variables are interval scaled (and metric is not "gower"), the metric is "euclidean", and n_g is the number of columns in which neither row i and j have NAs, then the dissimilarity d(i,j) returned is sqrt(p/n_g) (p=ncol(x)) times the Euclidean distance between the two vectors of length n_g shortened to exclude NAs. The rule is similar for the "manhattan" metric, except that the coefficient is p/n_g. If n_g = 0, the dissimilarity is NA.
如果所有的变量都间隔缩放(和metric不"gower"),度量是“欧几里得”,并n_g是既行的列数,其中i和j来港定居,那么差异性d(I,J)返回的是sqrt(p/n_g)(p= NCOL(x))的时间长度为两个向量之间的欧式距离n_g,缩短排除来港定居。规则是相似的“曼哈顿”的度量,除了系数是p/n_g。如果n_g = 0,相异NA。
When some variables have a type other than interval scaled, or if metric = "gower" is specified, the dissimilarity between two rows is the weighted mean of the contributions of each variable. Specifically,
当一些变量有一个类型以外的间隔缩放,或者如果metric = "gower"指定,两行之间的相异之加权平均每个变量的贡献。具体来说,
In other words, d_ij is a weighted mean of d(ij,k) with weights w_k delta(ij;k), where w_k= weigths[k], delta(ij;k) is 0 or 1, and d(ij,k), the k-th variable contribution to the total distance, is a distance between x[i,k] and x[j,k], see below.
换句话说,d_ij是一个加权平均d(ij,k)配重块w_k delta(ij;k),其中w_k“= weigths[k],delta(ij;k)是0或1, d(ij,k),k个变量的总距离的贡献,是一个之间的距离x[i,k]和x[j,k],见下文。
The 0-1 weight delta(ij;k) becomes zero when the variable x[,k] is missing in either or both rows (i and j), or when the variable is asymmetric binary and both values are zero. In all other situations it is 1.
时的0-1重量delta(ij;k)变为零变量x[,k]中缺少任一或两行(i和j),或当该变量是不对称的二进制和这两个值都为零。在所有其他情况下,它是1。
The contribution d(ij,k) of a nominal or binary variable to the total dissimilarity is 0 if both values are equal, 1 otherwise. The contribution of other variables is the absolute difference of both values, divided by the total range of that variable. Note that “standard scoring” is applied to ordinal variables, i.e., they are replaced by their integer codes 1:K. Note that this is not the same as using their ranks (since there typically are ties).
的贡献d(ij,k)的名义或二进制变量的总差异性是0,如果这两个值是相等的,否则为1。其他变量的贡献,是这两个值的差的绝对值,除以该变量的总范围。请注意,“标准计分”序变量,即,它们被替换为整数代码1:K。请注意,这是不一样的,使用他们的行列(因为通常是有关系)。
As the individual contributions d(ij,k) are in [0,1], the dissimilarity d_ij will remain in this range. If all weights w_k delta(ij;k) are zero, the dissimilarity is set to NA.
由于个人的贡献d(ij,k)在[0,1]中的相异d_ij将保持在此范围内。如果所有的权重w_k delta(ij;k)是零,相异设置为NA。
值----------Value----------
an object of class "dissimilarity" containing the dissimilarities among the rows of x. This is typically the input for the functions pam, fanny, agnes or diana. For more details, see dissimilarity.object.
类的一个对象"dissimilarity"包含的行x之间的异同。这是典型的输入功能pam,fanny,agnes或diana。有关详细信息,请参阅dissimilarity.object。
背景----------Background----------
Dissimilarities are used as inputs to cluster analysis and multidimensional scaling. The choice of metric may have a large impact.
不同点主要用于聚类分析和多维的输入。度量的选择可能有很大的影响。
(作者)----------Author(s)----------
Anja Struyf, Mia Hubert, and Peter and Rousseeuw, for the original
version.
<br>
Martin Maechler improved the <code><a href="../../base/html/NA.html">NA</a></code> handling and
<code>type</code> specification checking, and extended functionality to
<code>metric = "gower"</code> and the optional <code>weights</code> argument.
参考文献----------References----------
A general coefficient of similarity and some of its properties, Biometrics 27, 857–874.
Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York.
Integrating Robust Clustering Techniques in S-PLUS, Computational Statistics and Data Analysis 26, 17–37.
参见----------See Also----------
dissimilarity.object, dist, pam, fanny, clara, agnes, diana.
dissimilarity.object,dist,pam,fanny,clara,agnes,diana。
实例----------Examples----------
data(agriculture)
## Example 1 in ref:[#示例1文献:]
## Dissimilarities using Euclidean metric and without standardization[#四不像使用欧氏度量,没有标准化]
d.agr <- daisy(agriculture, metric = "euclidean", stand = FALSE)
d.agr
as.matrix(d.agr)[,"DK"] # via as.matrix.dist(.)[通过as.matrix.dist(。)]
## compare with[#比较]
as.matrix(daisy(agriculture, metric = "gower"))
data(flower)
## Example 2 in ref[例2文献]
summary(dfl1 <- daisy(flower, type = list(asymm = 3)))
summary(dfl2 <- daisy(flower, type = list(asymm = c(1, 3), ordratio = 7)))
## this failed earlier:[#无法在前面:]
summary(dfl3 <- daisy(flower,
type = list(asymm = c("V1", "V3"), symm= 2,
ordratio= 7, logratio= 8)))
转载请注明:出自 生物统计家园网(http://www.biostatistic.net)。
注:
注1:为了方便大家学习,本文档为生物统计家园网机器人LoveR翻译而成,仅供个人R语言学习参考使用,生物统计家园保留版权。
注2:由于是机器人自动翻译,难免有不准确之处,使用时仔细对照中、英文内容进行反复理解,可以帮助R语言的学习。
注3:如遇到不准确之处,请在本贴的后面进行回帖,我们会逐渐进行修订。
|