Low-level matching functions
译者:生物统计家园网 机器人LoveR
In this man page we define precisely and illustrate what a "match" of a pattern P in a subject S is in the context of the Biostrings package. This definition of a "match" is central to most pattern matching functions available in this package: unless specified otherwise, most of them will adhere to the definition provided here.
hasLetterAt checks whether a sequence or set of sequences has the specified letters at the specified positions.
neditAt, isMatchingAt and which.isMatchingAt are low-level matching functions that only look for matches at the specified positions in the subject.
hasLetterAt(x, letter, at, fixed=TRUE)
## neditAt() and related utils:
neditAt(pattern, subject, at=1,
with.indels=FALSE, fixed=TRUE)
neditStartingAt(pattern, subject, starting.at=1,
with.indels=FALSE, fixed=TRUE)
neditEndingAt(pattern, subject, ending.at=1,
with.indels=FALSE, fixed=TRUE)
## isMatchingAt() and related utils:
isMatchingAt(pattern, subject, at=1,
max.mismatch=0, min.mismatch=0, with.indels=FALSE, fixed=TRUE)
isMatchingStartingAt(pattern, subject, starting.at=1,
max.mismatch=0, min.mismatch=0, with.indels=FALSE, fixed=TRUE)
isMatchingEndingAt(pattern, subject, ending.at=1,
max.mismatch=0, min.mismatch=0, with.indels=FALSE, fixed=TRUE)
## which.isMatchingAt() and related utils:
which.isMatchingAt(pattern, subject, at=1,
max.mismatch=0, min.mismatch=0, with.indels=FALSE, fixed=TRUE,
follow.index=FALSE, auto.reduce.pattern=FALSE)
which.isMatchingStartingAt(pattern, subject, starting.at=1,
max.mismatch=0, min.mismatch=0, with.indels=FALSE, fixed=TRUE,
follow.index=FALSE, auto.reduce.pattern=FALSE)
which.isMatchingEndingAt(pattern, subject, ending.at=1,
max.mismatch=0, min.mismatch=0, with.indels=FALSE, fixed=TRUE,
follow.index=FALSE, auto.reduce.pattern=FALSE)
A character vector, or an XString or XStringSet object.
A character string or an XString object containing the letters to check.
参数:at, starting.at, ending.at
An integer vector specifying the starting (for starting.at and at) or ending (for ending.at) positions of the pattern relatively to the subject. With auto.reduce.pattern (below), either a single integer or a constant vector of length nchar(pattern) (below), to which the former is immediately converted. For the hasLetterAt function, letter and at must have the same length.
指定起点(整数向量starting.at和at)或结束(ending.at)主体的格局相对位置。 auto.reduce.pattern(下同),或一个整数或常数向量的长度nchar(pattern)(下同),其中前者是立即转换。 hasLetterAt功能,letter和at必须具有相同的长度。
The pattern string (but see auto.reduce.pattern, below).
A character vector, or an XString or XStringSet object containing the subject sequence(s).
参数:max.mismatch, min.mismatch
Integer vectors of length >= 1 recycled to the length of the at (or starting.at, or ending.at) argument. More details below.
整数向量的长度> = 1,回收at(或starting.at或ending.at)参数的长度。更多细节如下。
See details below.
Only with a DNAString or RNAString-based subject can a fixed value other than the default (TRUE) be used. If TRUE (the default), an IUPAC ambiguity code in the pattern can only match the same code in the subject, and vice versa. If FALSE, an IUPAC ambiguity code in the pattern can match any letter in the subject that is associated with the code, and vice versa. See IUPAC_CODE_MAP for more information about the IUPAC Extended Genetic Alphabet. fixed can also be a character vector, a subset of c("pattern", "subject"). fixed=c("pattern", "subject") is equivalent to fixed=TRUE (the default). An empty vector is equivalent to fixed=FALSE. With fixed="subject", ambiguities in the pattern only are interpreted as wildcards. With fixed="pattern", ambiguities in the subject only are interpreted as wildcards.
只有一个主题DNAString或RNAString基于fixed比默认值(TRUE)用于。如果TRUE(默认),IUPAC模糊代码模式只能匹配在主题相同的代码,反之亦然。如果FALSE,IUPAC模糊代码模式可以匹配任何在信与代码相关的主题,反之亦然。看到IUPAC_CODE_MAP有关的IUPAC扩展的遗传字母的信息。 fixed也可以是一个特征向量,c("pattern", "subject")子集。 fixed=c("pattern", "subject")相当于fixed=TRUE(默认)。一个空的向量是相当于fixed=FALSE。与fixed="subject",模式中的含糊之处,只解释为通配符。与fixed="pattern",含糊不清的问题只解释为通配符。
Whether the single integer returned by which.isMatchingAt (and related utils) should be the first *value* in at for which a match occurred, or its *index* in at (the default).
无论是单个整数返回which.isMatchingAt(和相关utils的)应该是第一个*值*at一个匹配发生后,或at(默认)*指数* 。
Whether pattern should be effectively shortened by 1 letter, from its beginning for which.isMatchingStartingAt and from its end for which.isMatchingEndingAt, for each successive (at, max.mismatch) "pair".
是否pattern应有效地缩短了1日的来信,从一开始which.isMatchingStartingAt和从which.isMatchingEndingAt的结束为每个连续的(at, max.mismatch)“对”。
A "match" of pattern P in subject S is a substring S' of S that is considered similar enough to P according to some distance (or metric) specified by the user. 2 distances are supported by most pattern matching functions in the Biostrings package. The first (and simplest) one is the "number of mismatching letters". It is defined only when the 2 strings to compare have the same length, so when this distance is used, only matches that have the same number of letters as P are considered. The second one is the "edit distance" (aka Levenshtein distance): it's the minimum number of operations needed to transform P into S', where an operation is an insertion, deletion, or substitution of a single letter. When this metric is used, matches can have a different number of letters than P.
小号主题模式P在一个“匹配”是一个子的S S“被认为是足够相似到P,根据一定的距离(或公制)由用户指定。 2距离最模式匹配在Biostrings包的功能支持。第一,最简单的一种是“不匹配的字母数”。它的定义是只有2串比较有相同的长度,所以当这个距离,有相同数量的字母为P的唯一的比赛被认为是。第二个是“编辑距离”(又名Levenshtein距离):它的操作需要转换成S“P,其中一个操作是插入,删除或替换一个字母的最小数目。当此指标,用于比赛可以有不同比P.字母的数量
The neditAt function implements these 2 distances. If with.indels is FALSE (the default), then the first distance is used i.e. neditAt returns the "number of mismatching letters" between the pattern P and the substring S' of S starting at the positions specified in at (note that neditAt is vectorized so a long vector of integers can be passed thru the at argument). If with.indels is TRUE, then the "edit distance" is used: for each position specified in at, P is compared to all the substrings S' of S starting at this position and the smallest distance is returned. Note that this distance is guaranteed to be reached for a substring of length < 2*length(P) so, of course, in practice, P only needs to be compared to a small number of substrings for every starting position.
hasLetterAt: A logical matrix with one row per element in x and one column per letter/position to check. When a specified position is invalid with respect to an element in x then the corresponding matrix element is set to NA.
neditAt: If subject is an XString object, then return an integer vector of the same length as at. If subject is an XStringSet object, then return the integer matrix with length(at) rows and length(subject) columns defined by:
参见----------See Also----------
nucleotideFrequencyAt, matchPattern, matchPDict, matchLRPatterns, trimLRPatterns, IUPAC_CODE_MAP, XString-class, align-utils
## ---------------------------------------------------------------------[#------------------------------------------------- --------------------]
## hasLetterAt()[:#hasLetterAt()]
## ---------------------------------------------------------------------[#------------------------------------------------- --------------------]
x <- DNAStringSet(c("AAACGT", "AACGT", "ACGT", "TAGGA"))
hasLetterAt(x, "AAAAAA", 1:6)
## hasLetterAt() can be used to answer questions like: "which elements[,#hasLetterAt()可以用来回答这样的问题:“哪些元素]
## in 'x' have an A at position 2 and a G at position 4?"[X#在位置2和4位置的G?“]
q1 <- hasLetterAt(x, "AG", c(2, 4))
which(rowSums(q1) == 2)
## or "how many probes in the drosophila2 chip have T, G, T, A at[#或“多少在drosophila2芯片的探针有T,G,T,在]
## position 2, 4, 13 and 20, respectively?"[#位置2,4,13和20,分别是多少?“]
probes <- DNAStringSet(drosophila2probe)
q2 <- hasLetterAt(probes, "TGTA", c(2, 4, 13, 20))
sum(rowSums(q2) == 4)
## or "what's the probability to have an A at position 25 if there is[#或“什么的概率有25位置,如果有一个A]
## one at position 13?"[在13位?“]
q3 <- hasLetterAt(probes, "AACGT", c(13, 25, 25, 25, 25))
sum(q3[ , 1] & q3[ , 2]) / sum(q3[ , 1])
## Probabilities to have other bases at position 25 if there is an A[#概率有25位置的其他碱基,如果有一个A]
## at position 13:[#13位置:]
sum(q3[ , 1] & q3[ , 3]) / sum(q3[ , 1]) # C[Ç]
sum(q3[ , 1] & q3[ , 4]) / sum(q3[ , 1]) # G[Ğ]
sum(q3[ , 1] & q3[ , 5]) / sum(q3[ , 1]) # T[ţ]
## See ?nucleotideFrequencyAt for another way to get those results.[#看到了吗?nucleotideFrequencyAt另一种方式来获得这些结果。]
## ---------------------------------------------------------------------[#------------------------------------------------- --------------------]
## neditAt() / isMatchingAt() / which.isMatchingAt()[#neditAt()/ isMatchingAt()/ which.isMatchingAt()]
## ---------------------------------------------------------------------[#------------------------------------------------- --------------------]
subject <- DNAString("GTATA")
## Pattern "AT" matches subject "GTATA" at position 3 (exact match)[#模式“AT”的比赛主题为“GTATA”3位置(精确匹配)]
neditAt("AT", subject, at=3)
isMatchingAt("AT", subject, at=3)
## ... but not at position 1[#...但不是在位置1]
neditAt("AT", subject)
isMatchingAt("AT", subject)
## ... unless we allow 1 mismatching letter (inexact match)[#...除非我们允许1失配信(不精确配合)]
isMatchingAt("AT", subject, max.mismatch=1)
## Here we look at 6 different starting positions and find 3 matches if[#在这里,我们期待在6个不同的起始位置,并找到3场比赛,如果]
## we allow 1 mismatching letter[#我们允许1个错配信]
isMatchingAt("AT", subject, at=0:5, max.mismatch=1)
## No match[#不匹配]
neditAt("NT", subject, at=1:4)
isMatchingAt("NT", subject, at=1:4)
## 2 matches if N is interpreted as an ambiguity (fixed=FALSE)[#2场比赛如果N(固定= FALSE,作为一个模糊的解释)]
neditAt("NT", subject, at=1:4, fixed=FALSE)
isMatchingAt("NT", subject, at=1:4, fixed=FALSE)
## max.mismatch != 0 and fixed=FALSE can be used together[#max.mismatch!= 0和固定= FALSE,可以一起使用]
neditAt("NCA", subject, at=0:5, fixed=FALSE)
isMatchingAt("NCA", subject, at=0:5, max.mismatch=1, fixed=FALSE)
some_starts <- c(10:-10, NA, 6)
subject <- DNAString("ACGTGCA")
is_matching <- isMatchingAt("CAT", subject, at=some_starts, max.mismatch=1)
which.isMatchingAt("CAT", subject, at=some_starts, max.mismatch=1)
which.isMatchingAt("CAT", subject, at=some_starts, max.mismatch=1,
## ---------------------------------------------------------------------[#------------------------------------------------- --------------------]
## ---------------------------------------------------------------------[#------------------------------------------------- --------------------]
subject <- BString("ABCDEFxxxCDEFxxxABBCDE")
neditAt("ABCDEF", subject, at=9)
neditAt("ABCDEF", subject, at=9, with.indels=TRUE)
isMatchingAt("ABCDEF", subject, at=9, max.mismatch=1, with.indels=TRUE)
isMatchingAt("ABCDEF", subject, at=9, max.mismatch=2, with.indels=TRUE)
neditAt("ABCDEF", subject, at=17)
neditAt("ABCDEF", subject, at=17, with.indels=TRUE)
neditEndingAt("ABCDEF", subject, ending.at=22)
neditEndingAt("ABCDEF", subject, ending.at=22, with.indels=TRUE)
转载请注明:出自 生物统计家园网(http://www.biostatistic.net)。