R语言 hopach包 hopach()函数中文帮助文档(中英文对照)

loveR · 发表于 2012-2-25 21:46:23

hopach(hopach)
hopach()所属R语言包：hopach

                                    function to perform HOPACH hierarchical clustering
                                       函数执行HOPACH层次聚类

                                       译者：生物统计家园网机器人LoveR

描述----------Description----------

The Hierarchical Ordered Partitioning and Collapsing Hybrid (HOPACH) clustering algorithm builds a hierarchical tree by recursively  partitioning a data set (e.g., gene expression measurements) with the PAM algorithm, while ordering and possibly collapsing clusters at each level. The algorithm uses the Mean/Median Split Silhouette (MSS) criteria to identify the level of the tree with maximally homogeneous clusters. It also runs the tree down to produce a final ordered list of the elements.
分层有序分区和倍数混合（HOPACH）的聚类算法建立分层树递归分割数据集（例如，基因表达测量）与PAM算法，在订货时，可能倒塌，在每个级别聚类。该算法采用中位数/平均数分割剪影（MSS）的标准来确定的水平，最大限度地均匀聚类树。它还运行的树，产生最终的有序列表元素。

用法----------Usage----------

hopach(data, dmat = NULL, d = "cosangle", clusters = "best", K = 15,
kmax = 9, khigh = 9, coll = "seq", newmed = "medsil", mss = "med",
impr = 0, initord = "co", ord = "own", verbose=FALSE)

参数----------Arguments----------

参数：data
data matrix, data frame or exprSet of gene expression measurements. Typically, each column corresponds to an array, and each row corresponds to a gene. For clustering arrays, the arrays appear in the rows and the genes in the columns. All values must be numeric. Missing values are ignored.
数据矩阵，基因表达测量的数据框或exprSet。通常情况下，每一列对应一个数组，每一行对应一个基因。为聚类阵列，阵列中的行和列中的基因出现。所有的值必须是数字。遗漏值将被忽略。

参数：dmat
matrix or hdist object of pair wise distances between all genes (arrays). All values  must be numeric, and missing values are not allowed. If NULL, this matrix is computed  using the metric specified by d. If a matrix is provided, the user is responsible for ensuring that the metric used agrees with d.
矩阵或hdist成对所有基因（阵列）之间距离的对象。所有的值必须是数字，不允许遗漏值。如果为NULL，这使用d指定的度量矩阵计算。如果提供一个矩阵，用户负责确保公制使用d同意。

参数：d
character string specifying the metric to be used for calculating  dissimilarities between variables. The currently available options are  "cosangle" (cosine angle or uncentered correlation distance), "abscosangle"  (absolute cosine angle or absolute uncentered correlation distance),  "euclid" (Euclidean distance), "abseuclid" (absolute Euclidean distance), "cor" (correlation distance), and "abscor" (absolute correlation distance). Advanced users can write their own distance functions and add these to the functions distancematrix() and distancevector().
字符串指定的度量用于计算变量之间的异同。目前可用的选项是“cosangle”（余弦角或者非中心的相关距离），的“abscosangle”（绝对余弦角度或绝对非中心相关距离），“欧几里得”（欧氏距离），的“abseuclid”（绝对欧氏距离），“心病”（相关距离），和“abscor”（绝对相关距离）。高级用户可以编写自己的距离函数，并添加这些功能distancematrix()和distancevector()。

参数：clusters
character string specifying if clusters are to be identified as the level of the tree with the minimum mean/median split  silhouette (MSS) ("best"), the first level of the tree below which MSS  increases ("greedy"), or not at all ("none").
字符串指定聚类作为树的水平确定的最低平均/位数分割剪影（MSS），（“最好的”），第一级以下的MSS增加树（“贪婪”），或根本没有（“无”）。

参数：K
positive integer specifying the maximum number of levels in the tree.  Must be 15 or less, due to computational limitations (overflow).
正整数，指定树中的级别最高的数字。必须是15岁或以下，由于计算的限制（溢出）。

参数：kmax
integer between 1 and 9 specifying the maximum number of children at each node in the tree.
1至9之间的整数，指定在树中的每个节点的最大数量的儿童。

参数：khigh
integer between 1 and 9 specifying the maximum number of children at each node in the tree when computing MSS. Can be different from kmax,  though typically these are the same value.
1至9之间的整数，指定在树中的每个节点的最大数量的儿童时，计算的MSS。可以是从KMAX不同，但通常这些都是相同的值。

参数：coll
character string specifying how collapsing steps are performed at each  level. The options are "seq" (begin with the closest pair of clusters and collapse pairs sequentially as long as MSS decreases) and "all" (consider all pairs of clusters and collapse any that decrease MSS).
在每个级别字符串指定如何崩溃的步骤进行。选项是“SEQ”（开始依次为MSS的跌幅聚类和崩溃对最接近的一对）和“所有”（考虑所有对聚类和倍数，任何减少的MSS）。

参数：newmed
character string specifying how to choose a medoid for the new cluster after collapsing a pair of clusters. The options are "medsil"  (maximizer of medoid based silhouette, i.e.: (a-b)/max(a,b), where  a is distance to medoid and b is distance to next closest medoid), "nn" (nearest neighbor of mean of two collapsed cluster medoids weighted by cluster size), "uwnn" (unweighted version of nearest neighbor, i.e.  each cluster - rather than each element - gets equal weight), "center"  (minimizer of average distance to the medoid).
字符串指定如何选择一双聚类倒塌后的新的聚类medoid。选项是的“medsil”（medoid基于轮廓的最大化，即：（AB）/ MAX（A，B），其中一个是距离medoid和b是下一个最接近medoid的距离），“NN”（近邻“uwnn”（近邻版本加权平均倒塌聚类簇大小加权中心点），即每个聚类 - 而不是每个元素 - 得到同等重量），“中心”（平均距离极小的medoid ）。

参数：mss
character vector specifying what criteria function to use. The options are "med" (median split silhouette) or "mean" (mean split silhouette). See details for definition of split silhouettes. The MSS criteria is used to determine the number of children at each node, to decide what collapsing  should be performed at each level, and to determine the main clusters.
特征向量指定什么标准函数使用。选项是“地中海”（中位数分割剪影）或“平均”（平均分裂剪影）。详情请参阅分割轮廓的定义。 MSS的标准，用来确定每个节点的儿童数目，来决定什么倒塌，应在每个级别进行，并确定主要的聚类。

参数：impr
number between 0 and 1 specifying the margin of improvement in MSS  needed to accept a collapse step. If (MSS.before - MSS.after)/MSS.before  is less than impr, then the collapse is not performed.
在MSS的改善保证金0和1之间的数字，指定需要接受崩溃的一步。如果（MSS.before  -  MSS.after）/ MSS.before是比impr，然后崩溃不执行。

参数：initord
character string specifying how to order the clusters in the initial level of the tree. The options are "co" (maximize correlation ordering, i.e.  the empirical correlation between distance apart in the ordering and distance between the cluster medoids) or "clust" (apply hopach with binary splits to  the cluster medoids and use the final level of that tree as the ordering). In subsequent levels, the clusters are ordered relative to the previous level, so this initial ordering determines the overall structure of the tree.
字符串指定如何订购聚类，在树的初始水平。选项是“合作”（最大化的相关性排序，即在订货和聚类中心点之间的距离除了经验之间的距离相关）或“clust”（适用于二元分割的聚类中心点hopach和使用最后一级作为排序树）。在随后的层次，聚类排序，相对以前的水平，所以这个初始的顺序确定树的整体结构。

参数：ord
character string specifying how to order the elements within clusters. This method is used to create an ordering of all elements at the level of the tree  corresponding to the main clusters. The options are "own" (order based on distance from cluster medoid with medoid first, i.e. leftmost), "neighbor" (order based on distance to the medoid of the next cluster to the right), or "co" (maximize correlation ordering - can be slow for large clusters!).
字符串指定如何订购聚类内的元素。此方法用于创建所有元素的顺序，在树对应的主要聚类水平。选项是“自己”（基于medoid medoid第一，即最左边的距离从聚类的命令），“邻居”（基于距离medoid权的下一个簇的顺序），或“公司”（可以最大化的相关性排序 - 大型聚类慢）！

参数：verbose
If TRUE then verbose output is printed.
如果TRUE然后详细打印输出。

Details

详情----------Details----------

The HOPACH hierarchical clustering algorithm is a hybrid between an agglomerative (bottom up) and a divisive (top down) algorithm. The HOPACH tree is built from the root node (all  elements) down to the leaf nodes, but at each level collapsing steps are used to unite similar clusters. In addition, the clusters in each level are ordered with a deterministic algorithm based on the same distance metric that is used in the clustering. In this way, the ordering produced in the final level of the tree does not depend on the order of the data in the original data set (as can be the case with algorithms that have a random  component in their ordering methods). Unlike other hierarchical clustering methods, HOPACH  builds a tree of clusters in which the nodes need not be binary, i.e. there can be more than two children at each split. The divisive steps of the HOPACH algorithm are performed using the PAM algorithm described in chapter 2 of Kaufman and Rousseeuw (1990) and the R package 'cluster'.
HOPACH层次聚类算法之间的凝聚（自下而上）混合动力汽车和分裂（自上而下）算法。建立HOPACH树从根节点（所有元素）叶节点，但在每个级别的倍数步骤是用来团结类似的聚类。此外，在每个级别的聚类排序与基于相同的距离度量聚类确定性算法。在这种方式，在树的最后一级订购不依赖于原始数据集（如可与算法，有一个在他们的订货方式的随机成分的情况下）的数据的顺序。 HOPACH不同于其他层次聚类方法，建立一个聚类中的节点不需要是二进制树，即可以有两个以上的孩子在每个分割。使用PAM算法在第2章介绍了考夫曼和Rousseeuw的（1990）和R包聚类的HOPACH算法的分裂步骤进行。

The Median (or Mean) Split Silhouette (MSS) criteria is used by HOPACH to (i) determine  the optimal number of children at each node, (ii) decide which pairs of clusters to  collapse at each level, and (iii) identify the first level of the tree with maximally homogeneous clusters. In each case, the goal is to minimize MSS, which is a measure of cluster heterogeneity described in http://www.bepress.com/ucbbiostat/paper107/.
中位数（或平均数）斯普利特剪影（MSS）的标准是由HOPACH使用（一）确定在儿童的最佳数量，每个节点，（二）决定在每个级别对聚类的崩溃，以及（iii）确定最大限度地均匀聚类树的第一级。在每一种情况下，我们的目标是尽量减少MSS的，这是一个聚类的异质性的措施在http://www.bepress.com/ucbbiostat/paper107/描述。

In hopach versions <2.0.0, these functions returned the square root of  the usual distance for d="cosangle", d="abscosangle",  d="cor", and d="abscor". Typically, this transformation makes the dissimilarity correspond more closely with the norm. In order to  agree with the dist function, the square root is no longer used  in versions >=2.0.0. See ? distancematrix().
在hopach <2.0.0版本中，这些函数返回的平方根通常距离d="cosangle"，d="abscosangle"，d="cor"，d="abscor"。通常情况下，这种转变使相异与规范，更加紧密地对应。以同意dist函数，> = 2.0.0版本中不再使用的平方根。看到了什么？ distancematrix（）。

值----------Value----------

A list with the following components:
以下组件列表：

参数：clustering
the partitioning or 'main clusters' with the following components  'k' is an integer specifying the number of clusters identified by minimizing MSS.  'medoids' is a vector indicating the rows of data that are the 'k' cluster medoids, i.e. profiles (or centroids) for each cluster.  'sizes' is a vector containing the 'k' cluster sizes.  'labels' is a vector containing the main cluster labels for every variable. Each label consists of one digit per level of the tree (up to the level identified as the main clusters). The digit (1-9) indicates which child cluster the variable was in at that level. For example, '124' means the fist (leftmost in the tree) cluster in level 1, the second child of cluster '1' in level 2, and the fourth  child of cluster '12' in level 3. These can be mapped to the numbers 1:k for simplicity, though the tree structure and relationship amongst the clusters is then lost, e.g. 1211 is closer to 1212 than to 1221.  'order' is a vector containing the ordering of variables within the main clusters. The clusters are ordered deterministically as the tree is built. The elements within each of the main clusters are ordered with the method determined by the value of ord: "own" (relative to own medoid), "neighbor" (relative to next medoid to the right), or "co" (maximize correlation ordering).
分区或以下组件K的主要聚类“是一个整数，指定数量减少的MSS确定的聚类。 “中心点”是一个向量，表示行data了k聚类中心点，即配置文件为每个聚类（或中心点）。大小是一个向量，包含了k的簇大小。 “标签”是一个向量，每个变量的主要聚类标签。每个标签包含一个数字的树每级（确定为主要聚类的水平）。数字（1-9）表示孩子聚类变量在这一水平。例如，124是指拳（在树的最左边）在第1级的聚类，第二个孩子聚类1在第2级，和聚类12在3级的第四个孩子。这些都可以映射到数字1：ķ简单，虽然树结构和聚类间的关系，然后失去，例如： 1211是接近比到1212年至1221年。 “秩序”是一个向量，在主聚类变量的排序。树建聚类下令确定性。在各主要聚类的元素是有序的确定方法由ord值：“自己”（相对自己的medoid），“邻居”（相对下medoid的权利），或“合作”（最大限度地提高相关排序）。

参数：final
the final level of the hierarchical tree with the following components  'labels' is a vector containing the final labels for every variable. Each label consists of one digit per level of the tree (up to the final level), and the  format for the labels is the same as for the clustering labels. The final labels contain the entire history of the tree. In fact, internal level 'n' can be  reproduced by truncating the final labels to 'n' digits. Ordering the final  labels produces the final ordering (final level of the tree), while ordering internal level labels produces an ordering of the clusters at that level.  'order' is a vector containing the ordering of variables at the final level of the tree. Essentially, this is the numeric ordering of the final labels. Due to the limit on the largest possible integer (overflow), the final labels can have  at most 16 digits, i.e. the tree can have at most 16 levels. For large data sets, this may not be enough partitioning steps to result in final nodes (leaves) with only one variable each. Furthermore, PAM can not partition a node of size 3 or  less, so that leaves may contain 2 or 3 variables regardless of the number of levels in the tree. Hence, the final ordering of variables is completed by ordering the variables in any leaf of size 2 or larger with the method determined by the value of ord: "own" (relative to own medoid), "neighbor" (relative  to next medoid to the right), or "co" (maximize correlation ordering).  'medoids' is a matrix containing the labels and corresponding medoids for each internal node and leaf of the tree. The number of digits in the label indicates the level for that node. The medoid refers to a row of data
的下列组件“标签的层次结构树，最后一级是一个向量，为每一个变量的最终标签。每个标签包含一个数字，每一级的树（最后一级），标签的格式是相同的聚类标签。最后一个标签包含树的整个历史。事实上，国内一级N可以复制截断最后的标签为n数字。排列在最后的标签产生的最终排序（树的最后一级），订货时内部一级标签，产生聚类的订货在这一水平。 “秩序”是一个向量，在树的最后一级变量的排序。从本质上讲，这是最终标签的数字顺序。由于最大可能的整数（溢出）的限制，最终的标签可以有最多16位，即树可以有最多16级。对于大型数据集，这可能是不够的分区导致只有一个变量的每个步骤，在最后节点（叶子）。此外，PAM不能分区的大小为3或更少的节点，使叶子可能包含2个或3个变量，无论在各级树。因此，完成最后的变量排序排序在任何大小2叶较大的变量或由ord值确定的方法：“自己的”（相对自己的medoid），“邻居” （相对下medoid权），或“合作”（最大限度地提高相关排序）。 “中心点”是一个矩阵，每个内部节点和树的叶含有标签和相应的中心点。在标签的数字表示该节点的水平。 medoid指行了data

参数：call
the matched 'call' generating the HOPACH output
匹配的“呼吁”产生HOPACH输出

参数：metric
the distance metric
距离度量

注意----------Note----------

Thank you to Karen Vranizan <vranizan@uclink.berkeley.edu> for her input
感谢您的卡伦Vranizan <vranizan@uclink.berkeley.edu>她输入

作者（S）----------Author(s)----------

Katherine S. Pollard <kpollard@gladstone.ucsf.edu> and Mark J. van der Laan <laan@stat.berkeley.edu>, with Greg Wall

参考文献----------References----------

参见----------See Also----------

distancematrix, labelstomss, boothopach, pam, makeoutput
distancematrix，labelstomss，boothopach，pam，makeoutput

举例----------Examples----------

#25 variables from two groups with 3 observations per variable[25两组每3个变量观测变量]
mydata<-rbind(cbind(rnorm(10,0,0.5),rnorm(10,0,0.5),rnorm(10,0,0.5)),cbind(rnorm(15,5,0.5),rnorm(15,5,0.5),rnorm(15,5,0.5)))
dimnames(mydata)<-list(paste("Var",1:25,sep=""),paste("Exp",1:3,sep=""))
mydist<-distancematrix(mydata,d="cosangle") #compute the distance matrix.[计算距离矩阵。]

#clusters and final tree[聚类和最终树]
clustresult<-hopach(mydata,dmat=mydist)
clustresult$clustering$k #number of clusters.[聚类数目。]
dimnames(mydata)[[1]][clustresult$clustering$medoids] #medoids of clusters.[中心点聚类。]
table(clustresult$clustering$labels) #equal to clustresult$clustering$sizes.[等于到clustresult聚类美元规模。]

#faster, sometimes fewer clusters[更快，有时更少的聚类]
greedyresult<-hopach(mydata,clusters="greedy",dmat=mydist)

#only get the final ordering (no partitioning into clusters)[只得到最终的顺序（没有分割成聚类）]
orderonly<-hopach(mydata,clusters="none",dmat=mydist)

#cluster the columns (rather than rows)[聚类中的列（而非行）]
colresult<-hopach(t(mydata),dmat=distancematrix(t(mydata),d="euclid"))

转载请注明:出自生物统计家园网(http://www.biostatistic.net)。

注：
注1：为了方便大家学习，本文档为生物统计家园网机器人LoveR翻译而成，仅供个人R语言学习参考使用，生物统计家园保留版权。
注2：由于是机器人自动翻译，难免有不准确之处，使用时仔细对照中、英文内容进行反复理解，可以帮助R语言的学习。
注3：如遇到不准确之处，请在本贴的后面进行回帖，我们会逐渐进行修订。

账号		自动登录	找回密码
密码			注册

R语言 hopach包 hopach()函数中文帮助文档(中英文对照)

浏览过的版块