lowlevel-matching(Biostrings)
lowlevel-matching()所属R语言包:Biostrings
Low-level matching functions
低层次的匹配功能
译者:生物统计家园网 机器人LoveR
描述----------Description----------
In this man page we define precisely and illustrate what a "match" of a pattern P in a subject S is in the context of the Biostrings package. This definition of a "match" is central to most pattern matching functions available in this package: unless specified otherwise, most of them will adhere to the definition provided here.
在本手册页中,我们准确界定和说明“匹配”的模式P在一个主题小号是的Biostrings包中。这种“比赛”的定义,是中央最模式匹配功能,可在此包:除另有指明外,他们大多会坚持在这里提供的定义。
hasLetterAt checks whether a sequence or set of sequences has the specified letters at the specified positions.
hasLetterAt检查组序列或序列是否在指定位置的指定字母。
neditAt, isMatchingAt and which.isMatchingAt are low-level matching functions that only look for matches at the specified positions in the subject.
neditAt,isMatchingAt和which.isMatchingAt是低层次的匹配功能只有看比赛,在主题中的指定位置。
用法----------Usage----------
hasLetterAt(x, letter, at, fixed=TRUE)
## neditAt() and related utils:
neditAt(pattern, subject, at=1,
with.indels=FALSE, fixed=TRUE)
neditStartingAt(pattern, subject, starting.at=1,
with.indels=FALSE, fixed=TRUE)
neditEndingAt(pattern, subject, ending.at=1,
with.indels=FALSE, fixed=TRUE)
## isMatchingAt() and related utils:
isMatchingAt(pattern, subject, at=1,
max.mismatch=0, min.mismatch=0, with.indels=FALSE, fixed=TRUE)
isMatchingStartingAt(pattern, subject, starting.at=1,
max.mismatch=0, min.mismatch=0, with.indels=FALSE, fixed=TRUE)
isMatchingEndingAt(pattern, subject, ending.at=1,
max.mismatch=0, min.mismatch=0, with.indels=FALSE, fixed=TRUE)
## which.isMatchingAt() and related utils:
which.isMatchingAt(pattern, subject, at=1,
max.mismatch=0, min.mismatch=0, with.indels=FALSE, fixed=TRUE,
follow.index=FALSE, auto.reduce.pattern=FALSE)
which.isMatchingStartingAt(pattern, subject, starting.at=1,
max.mismatch=0, min.mismatch=0, with.indels=FALSE, fixed=TRUE,
follow.index=FALSE, auto.reduce.pattern=FALSE)
which.isMatchingEndingAt(pattern, subject, ending.at=1,
max.mismatch=0, min.mismatch=0, with.indels=FALSE, fixed=TRUE,
follow.index=FALSE, auto.reduce.pattern=FALSE)
参数----------Arguments----------
参数:x
A character vector, or an XString or XStringSet object.
一个特征向量,或一个XString或XStringSet对象。
参数:letter
A character string or an XString object containing the letters to check.
一个字符串或XString的对象,其中包含的字母检查。
参数:at, starting.at, ending.at
An integer vector specifying the starting (for starting.at and at) or ending (for ending.at) positions of the pattern relatively to the subject. With auto.reduce.pattern (below), either a single integer or a constant vector of length nchar(pattern) (below), to which the former is immediately converted. For the hasLetterAt function, letter and at must have the same length.
指定起点(整数向量starting.at和at)或结束(ending.at)主体的格局相对位置。 auto.reduce.pattern(下同),或一个整数或常数向量的长度nchar(pattern)(下同),其中前者是立即转换。 hasLetterAt功能,letter和at必须具有相同的长度。
参数:pattern
The pattern string (but see auto.reduce.pattern, below).
模式字符串(但看到auto.reduce.pattern,下面)。
参数:subject
A character vector, or an XString or XStringSet object containing the subject sequence(s).
一个特征向量,或XString或XStringSet的对象,其中包含的主题序列(S)。
参数:max.mismatch, min.mismatch
Integer vectors of length >= 1 recycled to the length of the at (or starting.at, or ending.at) argument. More details below.
整数向量的长度> = 1,回收at(或starting.at或ending.at)参数的长度。更多细节如下。
参数:with.indels
See details below.
详见下文。
参数:fixed
Only with a DNAString or RNAString-based subject can a fixed value other than the default (TRUE) be used. If TRUE (the default), an IUPAC ambiguity code in the pattern can only match the same code in the subject, and vice versa. If FALSE, an IUPAC ambiguity code in the pattern can match any letter in the subject that is associated with the code, and vice versa. See IUPAC_CODE_MAP for more information about the IUPAC Extended Genetic Alphabet. fixed can also be a character vector, a subset of c("pattern", "subject"). fixed=c("pattern", "subject") is equivalent to fixed=TRUE (the default). An empty vector is equivalent to fixed=FALSE. With fixed="subject", ambiguities in the pattern only are interpreted as wildcards. With fixed="pattern", ambiguities in the subject only are interpreted as wildcards.
只有一个主题DNAString或RNAString基于fixed比默认值(TRUE)用于。如果TRUE(默认),IUPAC模糊代码模式只能匹配在主题相同的代码,反之亦然。如果FALSE,IUPAC模糊代码模式可以匹配任何在信与代码相关的主题,反之亦然。看到IUPAC_CODE_MAP有关的IUPAC扩展的遗传字母的信息。 fixed也可以是一个特征向量,c("pattern", "subject")子集。 fixed=c("pattern", "subject")相当于fixed=TRUE(默认)。一个空的向量是相当于fixed=FALSE。与fixed="subject",模式中的含糊之处,只解释为通配符。与fixed="pattern",含糊不清的问题只解释为通配符。
参数:follow.index
Whether the single integer returned by which.isMatchingAt (and related utils) should be the first *value* in at for which a match occurred, or its *index* in at (the default).
无论是单个整数返回which.isMatchingAt(和相关utils的)应该是第一个*值*at一个匹配发生后,或at(默认)*指数* 。
参数:auto.reduce.pattern
Whether pattern should be effectively shortened by 1 letter, from its beginning for which.isMatchingStartingAt and from its end for which.isMatchingEndingAt, for each successive (at, max.mismatch) "pair".
是否pattern应有效地缩短了1日的来信,从一开始which.isMatchingStartingAt和从which.isMatchingEndingAt的结束为每个连续的(at, max.mismatch)“对”。
Details
详情----------Details----------
A "match" of pattern P in subject S is a substring S' of S that is considered similar enough to P according to some distance (or metric) specified by the user. 2 distances are supported by most pattern matching functions in the Biostrings package. The first (and simplest) one is the "number of mismatching letters". It is defined only when the 2 strings to compare have the same length, so when this distance is used, only matches that have the same number of letters as P are considered. The second one is the "edit distance" (aka Levenshtein distance): it's the minimum number of operations needed to transform P into S', where an operation is an insertion, deletion, or substitution of a single letter. When this metric is used, matches can have a different number of letters than P.
小号主题模式P在一个“匹配”是一个子的S S“被认为是足够相似到P,根据一定的距离(或公制)由用户指定。 2距离最模式匹配在Biostrings包的功能支持。第一,最简单的一种是“不匹配的字母数”。它的定义是只有2串比较有相同的长度,所以当这个距离,有相同数量的字母为P的唯一的比赛被认为是。第二个是“编辑距离”(又名Levenshtein距离):它的操作需要转换成S“P,其中一个操作是插入,删除或替换一个字母的最小数目。当此指标,用于比赛可以有不同比P.字母的数量
The neditAt function implements these 2 distances. If with.indels is FALSE (the default), then the first distance is used i.e. neditAt returns the "number of mismatching letters" between the pattern P and the substring S' of S starting at the positions specified in at (note that neditAt is vectorized so a long vector of integers can be passed thru the at argument). If with.indels is TRUE, then the "edit distance" is used: for each position specified in at, P is compared to all the substrings S' of S starting at this position and the smallest distance is returned. Note that this distance is guaranteed to be reached for a substring of length < 2*length(P) so, of course, in practice, P only needs to be compared to a small number of substrings for every starting position.
neditAt函数实现这两个距离。如果with.indels是FALSE(默认),那么第一次距离使用即neditAt返回模式P和启动子串的小号“之间的”不匹配的字母数“在指定的位置,在at(注neditAt矢量这样一个整数向量可以通过at参数传递)。如果with.indels是TRUE,然后“编辑距离”:为at指定每个位置,P是S的所有子串S“在这个位置开始返回的最小距离。请注意,这个距离是保证要达到*长度(P)这样子的长度<2,当然,在实践中,仅P需要每一个首发位置的小数目的子字符串进行比较。
值----------Value----------
hasLetterAt: A logical matrix with one row per element in x and one column per letter/position to check. When a specified position is invalid with respect to an element in x then the corresponding matrix element is set to NA.
hasLetterAt:x和检查每个字母/位置的一列,每一个元素的行的一个逻辑矩阵。当指定的位置是尊重到x元素无效然后设置相应的矩阵元素为NA。
neditAt: If subject is an XString object, then return an integer vector of the same length as at. If subject is an XStringSet object, then return the integer matrix with length(at) rows and length(subject) columns defined by:
neditAt如果subject是XString的对象,然后返回一个整数向量相同的长度为at。如果subject是XStringSet的对象,然后返回整数矩阵length(at)行和定义length(subject)列:
参见----------See Also----------
nucleotideFrequencyAt, matchPattern, matchPDict, matchLRPatterns, trimLRPatterns, IUPAC_CODE_MAP, XString-class, align-utils
nucleotideFrequencyAt,matchPattern,matchPDict,matchLRPatterns,trimLRPatterns,IUPAC_CODE_MAP,级XString,对准-utils的
举例----------Examples----------
## ---------------------------------------------------------------------[#------------------------------------------------- --------------------]
## hasLetterAt()[:#hasLetterAt()]
## ---------------------------------------------------------------------[#------------------------------------------------- --------------------]
x <- DNAStringSet(c("AAACGT", "AACGT", "ACGT", "TAGGA"))
hasLetterAt(x, "AAAAAA", 1:6)
## hasLetterAt() can be used to answer questions like: "which elements[,#hasLetterAt()可以用来回答这样的问题:“哪些元素]
## in 'x' have an A at position 2 and a G at position 4?"[X#在位置2和4位置的G?“]
q1 <- hasLetterAt(x, "AG", c(2, 4))
which(rowSums(q1) == 2)
## or "how many probes in the drosophila2 chip have T, G, T, A at[#或“多少在drosophila2芯片的探针有T,G,T,在]
## position 2, 4, 13 and 20, respectively?"[#位置2,4,13和20,分别是多少?“]
library(drosophila2probe)
probes <- DNAStringSet(drosophila2probe)
q2 <- hasLetterAt(probes, "TGTA", c(2, 4, 13, 20))
sum(rowSums(q2) == 4)
## or "what's the probability to have an A at position 25 if there is[#或“什么的概率有25位置,如果有一个A]
## one at position 13?"[在13位?“]
q3 <- hasLetterAt(probes, "AACGT", c(13, 25, 25, 25, 25))
sum(q3[ , 1] & q3[ , 2]) / sum(q3[ , 1])
## Probabilities to have other bases at position 25 if there is an A[#概率有25位置的其他碱基,如果有一个A]
## at position 13:[#13位置:]
sum(q3[ , 1] & q3[ , 3]) / sum(q3[ , 1]) # C[Ç]
sum(q3[ , 1] & q3[ , 4]) / sum(q3[ , 1]) # G[Ğ]
sum(q3[ , 1] & q3[ , 5]) / sum(q3[ , 1]) # T[ţ]
## See ?nucleotideFrequencyAt for another way to get those results.[#看到了吗?nucleotideFrequencyAt另一种方式来获得这些结果。]
## ---------------------------------------------------------------------[#------------------------------------------------- --------------------]
## neditAt() / isMatchingAt() / which.isMatchingAt()[#neditAt()/ isMatchingAt()/ which.isMatchingAt()]
## ---------------------------------------------------------------------[#------------------------------------------------- --------------------]
subject <- DNAString("GTATA")
## Pattern "AT" matches subject "GTATA" at position 3 (exact match)[#模式“AT”的比赛主题为“GTATA”3位置(精确匹配)]
neditAt("AT", subject, at=3)
isMatchingAt("AT", subject, at=3)
## ... but not at position 1[#...但不是在位置1]
neditAt("AT", subject)
isMatchingAt("AT", subject)
## ... unless we allow 1 mismatching letter (inexact match)[#...除非我们允许1失配信(不精确配合)]
isMatchingAt("AT", subject, max.mismatch=1)
## Here we look at 6 different starting positions and find 3 matches if[#在这里,我们期待在6个不同的起始位置,并找到3场比赛,如果]
## we allow 1 mismatching letter[#我们允许1个错配信]
isMatchingAt("AT", subject, at=0:5, max.mismatch=1)
## No match[#不匹配]
neditAt("NT", subject, at=1:4)
isMatchingAt("NT", subject, at=1:4)
## 2 matches if N is interpreted as an ambiguity (fixed=FALSE)[#2场比赛如果N(固定= FALSE,作为一个模糊的解释)]
neditAt("NT", subject, at=1:4, fixed=FALSE)
isMatchingAt("NT", subject, at=1:4, fixed=FALSE)
## max.mismatch != 0 and fixed=FALSE can be used together[#max.mismatch!= 0和固定= FALSE,可以一起使用]
neditAt("NCA", subject, at=0:5, fixed=FALSE)
isMatchingAt("NCA", subject, at=0:5, max.mismatch=1, fixed=FALSE)
some_starts <- c(10:-10, NA, 6)
subject <- DNAString("ACGTGCA")
is_matching <- isMatchingAt("CAT", subject, at=some_starts, max.mismatch=1)
some_starts[is_matching]
which.isMatchingAt("CAT", subject, at=some_starts, max.mismatch=1)
which.isMatchingAt("CAT", subject, at=some_starts, max.mismatch=1,
follow.index=TRUE)
## ---------------------------------------------------------------------[#------------------------------------------------- --------------------]
## WITH INDELS[#与INDELS]
## ---------------------------------------------------------------------[#------------------------------------------------- --------------------]
subject <- BString("ABCDEFxxxCDEFxxxABBCDE")
neditAt("ABCDEF", subject, at=9)
neditAt("ABCDEF", subject, at=9, with.indels=TRUE)
isMatchingAt("ABCDEF", subject, at=9, max.mismatch=1, with.indels=TRUE)
isMatchingAt("ABCDEF", subject, at=9, max.mismatch=2, with.indels=TRUE)
neditAt("ABCDEF", subject, at=17)
neditAt("ABCDEF", subject, at=17, with.indels=TRUE)
neditEndingAt("ABCDEF", subject, ending.at=22)
neditEndingAt("ABCDEF", subject, ending.at=22, with.indels=TRUE)
转载请注明:出自 生物统计家园网(http://www.biostatistic.net)。
注:
注1:为了方便大家学习,本文档为生物统计家园网机器人LoveR翻译而成,仅供个人R语言学习参考使用,生物统计家园保留版权。
注2:由于是机器人自动翻译,难免有不准确之处,使用时仔细对照中、英文内容进行反复理解,可以帮助R语言的学习。
注3:如遇到不准确之处,请在本贴的后面进行回帖,我们会逐渐进行修订。
|