seqrep(TraMineR)
seqrep()所属R语言包:TraMineR
Extracting sets of representative sequences
具有代表性的序列中提取套
译者:生物统计家园网 机器人LoveR
描述----------Description----------
The function attempts to find an optimal (as small as possible for a required coverage) set of representative sequences that exhibits the key features of the whole sequence data set, the goal being to easily get sounded interpretation of the set of sequences.
试图找到一个最佳的(所需的覆盖范围尽可能小)的代表序列,整个序列数据表现出的主要特点,我们的目标是轻松地获得响起组序列的解释。
用法----------Usage----------
decreasing=TRUE, trep=0.25, nrep=NULL,
tsim=0.1, dmax=NULL, dist.matrix=NULL, weighted=TRUE, ...)
参数----------Arguments----------
参数:seqdata
a state sequence object as defined by the seqdef function.
一个状态序列对象定义的seqdef功能。
参数:criterion
the representativeness criterion for sorting the candidate list. One of "freq" (sequence frequency), "density" (neighborhood density), "mscore" (mean state frequency), "dist" (centrality) and "prob" (sequence likelihood). See details.
具有代表性的标准进行排序的候选名单。 "freq"(序列频率),"density"(居密度),"mscore"(平均状态频率),"dist"(核心)和"prob"(序列可能性)。查看详细信息。
参数:score
an optional vector containing the representativeness scores used to sort the sequences in the candidate list. The length of the vector must be equal to the number of sequences in the sequence object.
一个可选的向量,包含使用的代表性分数,在候选列表中的顺序进行排序。该矢量的长度必须等于序列中的对象的数目的序列。
参数:decreasing
if a score vector is provided, indicates whether the objects in the candidate list must be sorted in ascending or descending order of this score. Default is TRUE, i.e. descending. The first object in the candidate list is then supposed to be the most representative.
如果得分向量,表示在候选列表中的对象是否必须在这个分数的升序或降序进行排序。默认值是TRUE,即降。然后,在候选列表中的第一个对象,应该是最有代表性的。
参数:trep
coverage threshold, i.e. minimum proportion of sequences that should have a representative in their neighborhood (neighborhood diameter is defined by tsim).
覆盖的阈值,即最低比例的序列,应该有一个代表在其附近(是定义的tsim),附近直径。
参数:nrep
number of representative sequences. If NULL (default), the size of the representative set is controlled by trep.
数代表的序列。如果NULL(默认),代表集的大小控制trep。
参数:tsim
threshold for setting the redundancy and neighborhood radius. Defined as a percentage of the maximum (theoretical) distance. Defaults to 0.1 (10%). Sequence y is considered as redundant to/in the neighborhood of sequence x if the distance from y to x is less than tsim*dmax. The neighborhood diameter is thus twice this threshold.
阈值设置的冗余性和邻域半径。定义为最大(理论值)的距离的百分比。默认为0.1(10%)。序列y被认为是多余的/在附近的序列x如果从y到x是小于tsim*dmax 。附近直径为这个阈值的两倍。
参数:dmax
maximum theoretical distance. The neighborhood diameter is defined as a proportion of this maximum theoretical distance. If NULL, it is derived from the distance matrix.
理论上的最大距离。的邻域的直径被定义作为一个比例的这个最大理论距离。如果NULL,它是来自距离矩阵。
参数:dist.matrix
a matrix containing the pairwise distances between sequences in seqdata. If NULL, the matrix is computed by calling the seqdist function. In that case, optional arguments to be passed to the seqdist function (see ... hereafter) should also be provided.
矩阵包含在seqdata序列两两之间的距离。如果NULL,矩阵计算通过调用seqdist函数。在这种情况下,可选参数被传递到seqdist功能(见...下同)也应提供。
参数:weighted
logical: Should weights assigned to the state sequence object be accounted for? (See seqdef.) Set as FALSE to ignore the weights.
逻辑:如果权重分配的状态序列对象进行会计处理? (见seqdef。)设置为FALSE忽略的权重。
参数:...
optional arguments to be passed to the seqdist function, mainly dist.method specifying the metric for computing the distance matrix, norm for normalizing the distances, indel and sm for indel and substitution costs when Optimal Matching metric is chosen. See seqdist manual page for details.
可选的参数被传递给seqdist功能,主要的dist.method指定为计算距离矩阵norm的距离标准化,indel和sm的的度量 InDel和替代成本的最佳匹配时,选择度量。见seqdist手册页的详细信息。
Details
详细信息----------Details----------
The representative set is obtained by an heuristic that first builds a sorted list of candidates using a representativeness score and then eliminates redundancy. The available criterions for sorting the candidate list are: sequence frequency, neighborhood density, mean state frequency, centrality and sequence likelihood.
代表集的启发式算法,首先构建一个排序使用的代表性得分的候选人名单,然后消除冗余。可用的准则进行排序的候选名单如下:序列频率,邻里密度,平均态频率,核心序列的可能性。
The sequence frequency criterion uses the sequence frequencies as representativeness score. The more frequent a sequence the more representative it is supposed to be. Hence, sequences are sorted in decreasing frequency order.
序列频率标准使用的序列频率为代表性得分。更频繁的序列更具有代表性的,它被认为是。因此,在降低频率的顺序进行排序的序列。
The neighborhood density criterion uses the number—density—of sequences in the neighborhood of each candidate sequence. This requires indeed to set the neighborhood diameter tsim. We suggest to set it as a given proportion of the maximal theoretical distance between two sequences. Sequences are sorted in decreasing density order.
的邻域密度标准使用的数密度的序列,在周边的每个候选序列。这要求确实附近设置直径tsim。我们建议将其设置为一个给定的比例最大理论两个序列之间的距离。序列进行排序,在降低密度秩序。
The mean state frequency criterion is the mean value of the transversal frequencies of the successive states. Let s=(s_1, s_2, ..., s_l) be a sequence of length l and f(s_1), f(s_2), ..., f(s_l) the frequencies of the states at (time-)position t_1, t_2, ..., t_l. The mean state frequency is the sum of the state frequencies divided by the sequence length
的平均状态的频率标准是横向的连续状态的频率的平均值。让我们s=(s_1, s_2, ..., s_l)是一个序列的长度l和f(s_1), f(s_2), ..., f(s_l)状态的频率(时间)的位置t_1, t_2, ..., t_l。的平均状态的频率的总和除以序列长度的状态的频率
The lower and upper boundaries of MSF are 0 and 1. MSF is equal to 1 when all the sequences in the set are the same, i.e. when there is a single distinct sequence. The most representative sequence is the one with the highest score.
MSF的上限和下限边界是0和1。 MSF是等于1时,集合中的所有的序列是相同的,也就是说,当有一个单一的独特的序列。最有代表性的序列是得分最高的一个。
The centrality criterion uses the sum of distances to all other sequences as a representativeness criterion. The smallest the sum, the most representative the sequence.
核心标准使用的所有其他序列作为一个代表性的标准距离的总和。的总和最小的,最有代表性的序列。
The sequence likelihood P(s) is defined as the product of the probability with which each of its observed successive state is supposed to occur at its position. Let s_1, s_2, s_l be a sequence of length l. Then
被定义为与每个其观察到的连续状态的概率应该是发生在其位置的产品的序列似然P(s)。让我们s_1, s_2, s_l是一个序列的长度l。然后
with P(s_t,t) the probability to observe state s_t at position t. <br> The question is how to determinate the state probabilities P(s_t,t). One commonly used method for computing them is to postulate a Markov model, which can be of various order. The implemented criterion considers the probabilities derived from the first order Markov model, that is each P(s_t,t), t>1 is set to the transition rate p(s_t)|s_t-1 estimated across sequences from the observations at positions t and t-1. For t=1, we set P(s_1,1) to the observed frequency of the state s_1 at position 1.
P(s_t,t)的概率遵守国家s_t的位置t。 <BR>的问题是如何确定的状态的概率P(s_t,t)。一个常用的计算方法是假设,它可以是各阶马尔可夫模型。实现的标准主要考虑从第一阶Markov模型产生的概率,这是每一个P(s_t,t),t>1设置的转换率p(s_t)|s_t-1估计在从观测序列位置<X >和t。对于t-1,我们将t=1国家P(s_1,1)观测到的频率在位置1。
The likelihood P(s) being generally very small, we use -\log P(s) as sorting criterion. The latter quantity is minimal when P(s) is equal to 1, which leads to sort the sequences in ascending order of their score.
P(s)一般都非常小的可能性,我们使用的-\log P(s)作为排序标准。当P(s)是等于1,从而导致他们的得分升序排序的序列,后者的数量是最少的。
For more details, see <CITE>Gabadinho et al., 2009</CITE>.
有关详细信息,请参阅<CITE> Gabadinho等。,2009 </ CITE>。
值----------Value----------
An object of class stslist.rep. This is actually a state sequence object (containing a list of state sequences) with the following additional attributes:
对象的类stslist.rep。其实,这是一个状态序列对象(包含列表中的状态序列)具有以下附加属性:
参数:Scores
a vector with the representative score of each sequence in the original set given the chosen criterion.
在原给定集所选择的标准的每个序列代表得分的向量。
参数:Distances
a matrix with the distance of each sequence to its nearest representative.
的距离的每个序列的一个矩阵与它的最有代表性的。
参数:Statistics
contains several quality measures for each representative sequence in the set: number of sequences attributed to the representative, number of sequence in the representatives neighborhood, mean distance to the representative.
为每个代表性序列集合中包含几个质量的措施:归因于该代表的序列数,代表附近序列数,平均距离的代表。
参数:Quality
overall quality measure.
整体质量的措施。
Print,plot and summary methods are available. More elaborated plots are produced by the seqplot function using the type="r"
打印,绘图和总结的方法。更详细的图是由seqplot功能使用type="r"
参考文献----------References----------
参见----------See Also----------
seqplot, plot.stslist.rep
seqplot,plot.stslist.rep
实例----------Examples----------
## Defining a sequence object with the data in columns 10 to 25[#定义一个序列对象中的数据列10至25]
## (family status from age 15 to 30) in the biofam data set[#(家庭状况从15岁至30日)在biofam数据集]
data(biofam)
biofam.lab <- c("Parent", "Left", "Married", "Left+Marr",
"Child", "Left+Child", "Left+Marr+Child", "Divorced")
biofam.seq <- seqdef(biofam, 10:25, labels=biofam.lab)
## Computing the distance matrix[#计算距离矩阵]
costs <- seqsubm(biofam.seq, method="TRATE")
biofam.om <- seqdist(biofam.seq, method="OM", sm=costs)
## Representative set using the neighborhood density criterion[#代表性的一组使用附近的密度标准]
biofam.rep <- seqrep(biofam.seq, dist.matrix=biofam.om, criterion="density")
biofam.rep
summary(biofam.rep)
plot(biofam.rep)
转载请注明:出自 生物统计家园网(http://www.biostatistic.net)。
注:
注1:为了方便大家学习,本文档为生物统计家园网机器人LoveR翻译而成,仅供个人R语言学习参考使用,生物统计家园保留版权。
注2:由于是机器人自动翻译,难免有不准确之处,使用时仔细对照中、英文内容进行反复理解,可以帮助R语言的学习。
注3:如遇到不准确之处,请在本贴的后面进行回帖,我们会逐渐进行修订。
|