matchPattern(Biostrings)
matchPattern()所属R语言包:Biostrings
String searching functions
字符串搜索功能
译者:生物统计家园网 机器人LoveR
描述----------Description----------
A set of functions for finding all the occurrences (aka "matches" or "hits") of a given pattern (typically short) in a (typically long) reference sequence or set of reference sequences (aka the subject)
一个集寻找一个给定的模式中(通常是术语的)参考序列或参考序列集(通常是短期的)的所有事件(又名“匹配”或“点击”)的功能(也称为主题)
用法----------Usage----------
matchPattern(pattern, subject,
max.mismatch=0, min.mismatch=0, with.indels=FALSE, fixed=TRUE,
algorithm="auto")
countPattern(pattern, subject,
max.mismatch=0, min.mismatch=0, with.indels=FALSE, fixed=TRUE,
algorithm="auto")
vmatchPattern(pattern, subject,
max.mismatch=0, min.mismatch=0, with.indels=FALSE, fixed=TRUE,
algorithm="auto", ...)
vcountPattern(pattern, subject,
max.mismatch=0, min.mismatch=0, with.indels=FALSE, fixed=TRUE,
algorithm="auto", ...)
参数----------Arguments----------
参数:pattern
The pattern string.
模式字符串。
参数:subject
An XString, XStringViews or MaskedXString object for matchPattern and countPattern. An XStringSet or XStringViews object for vmatchPattern and vcountPattern.
28一个XString,XStringViews或MaskedXStringmatchPattern和countPattern的对象。为vmatchPattern和vcountPatternXStringSet或XStringViews对象。
参数:max.mismatch, min.mismatch
The maximum and minimum number of mismatching letters allowed (see ?`lowlevel-matching` for the details). If non-zero, an algorithm that supports inexact matching is used.
数量最大和最小的不匹配,允许字母(?lowlevel-matching细节)。如果不为零,不精确匹配的算法,支持使用。
参数:with.indels
If TRUE then indels are allowed. In that case, min.mismatch must be 0 and max.mismatch is interpreted as the maximum "edit distance" allowed between the pattern and a match. Note that in order to avoid pollution by redundant matches, only the "best local matches" are returned. Roughly speaking, a "best local match" is a match that is locally both the closest (to the pattern P) and the shortest. More precisely, a substring S' of the subject S is a "best local match" iff: <PRE> (a) nedit(P, S') <= max.mismatch (b) for every substring S1 of S': nedit(P, S1) > nedit(P, S') (c) for every substring S2 of S that contains S': nedit(P, S2) >= nedit(P, S') </PRE> One nice property of "best local matches" is that their first and last letters are guaranteed to match the letters in P that they align with.
如果TRUE然后INDELS允许。在这种情况下,min.mismatch必须0和max.mismatch作为最大的“编辑距离”之间的模式匹配允许的解释。请注意,为了避免多余的比赛中,只有“当地最好的比赛”污染返回。粗略地说,“当地最好的比赛”是当地最接近的模式P()和最短的一场比赛。更确切地说,主体S的一个子S“是”当地最好的比赛“论坛:<PRE>(一)用nedit(于P,S)<= max.mismatch(b)为每个子串S1 S” :用nedit(磷,S1)>用nedit(于P,S)(c)为S的每个子串S2包含S“:用nedit(磷,S2)> = n编辑(于P,S)</ pre> “当地最好的比赛”的一个很好的特性是保证比赛在P字母,他们配合,他们的第一个和最后一个字母。
参数:fixed
If TRUE (the default), an IUPAC ambiguity code in the pattern can only match the same code in the subject, and vice versa. If FALSE, an IUPAC ambiguity code in the pattern can match any letter in the subject that is associated with the code, and vice versa. See ?`lowlevel-matching` for more information.
如果TRUE(默认),IUPAC模糊代码模式只能匹配在主题相同的代码,反之亦然。如果FALSE,IUPAC模糊代码模式可以匹配任何在信与代码相关的主题,反之亦然。看到?lowlevel-matching更多信息。
参数:algorithm
One of the following: "auto", "naive-exact", "naive-inexact", "boyer-moore", "shift-or" or "indels".
下列操作之一:"auto","naive-exact","naive-inexact","boyer-moore","shift-or"或"indels"。
参数:...
Additional arguments for methods.
附加参数的方法。
Details
详情----------Details----------
Available algorithms are: “naive exact”, “naive inexact”, “Boyer-Moore-like”, “shift-or” and “indels”. Not all of them can be used in all situations: restrictions apply depending on the "search criteria" i.e. on the values of the pattern, subject, max.mismatch, min.mismatch, with.indels and fixed arguments.
可用的算法是:“天真的确切”,“天真不精确”,“应用Boyer-Moore样”,“移”和“INDELS”。并非所有可用于所有的情况:限制“搜索条件”,即根据上pattern值,subject,max.mismatch,min.mismatch with.indels和fixed参数。
It is important to note that the algorithm argument is not part of the search criteria. This is because the supported algorithms are interchangeable, that is, if 2 different algorithms are compatible with a given search criteria, then choosing one or the other will not affect the result (but will most likely affect the performance). So there is no "wrong choice" of algorithm (strictly speaking).
重要的是要注意algorithm参数是不搜索条件的一部分。这是因为所支持的算法是可以互换的,也就是说,如果2个不同的算法是一个给定的搜索标准兼容,然后选择一个或其他不会影响结果(但很可能会影响性能)。因此,有没有“错误的选择”(严格地说)算法。
Using algorithm="auto" (the default) is recommended because then the best suited algorithm will automatically be selected among the set of algorithms that are valid for the given search criteria.
使用algorithm="auto"(默认)的建议,因为这样最适合算法会自动选择其中一套给定的搜索条件是有效的算法。
值----------Value----------
An XStringViews object for matchPattern.
matchPatternXStringViews对象。
A single integer for countPattern.
一个单一的整数countPattern。
An MIndex object for vmatchPattern.
一个的vmatchPatternMIndex对象。
An integer vector for vcountPattern, with each element in the vector corresponding to the number of matches in the corresponding element of subject.
vcountPattern矢量中每个元素对应匹配相应的元素subject数,整数向量。
注意----------Note----------
Use matchPDict if you need to match a (big) set of patterns against a reference sequence.
使用matchPDict如果你需要匹配的模式对参考序列(大)集。
Use pairwiseAlignment if you need to solve a (Needleman-Wunsch) global alignment, a (Smith-Waterman) local alignment, or an (ends-free) overlap alignment problem.
使用pairwiseAlignment如果您需要解决(Needleman文施)全球对齐,(Smith-Waterman算法)的地方对齐,或(两端免)重叠对齐问题。
参见----------See Also----------
lowlevel-matching, matchPDict, pairwiseAlignment, mismatch, matchLRPatterns, matchProbePair, maskMotif, alphabetFrequency, XStringViews-class, MIndex-class
低级匹配,matchPDict,pairwiseAlignment,mismatch,matchLRPatterns,matchProbePair,maskMotif,alphabetFrequency,XStringViews级, MIndex-级
举例----------Examples----------
## ---------------------------------------------------------------------[#------------------------------------------------- --------------------]
## A. matchPattern()/countPattern()[#答matchPattern()/ countPattern()]
## ---------------------------------------------------------------------[#------------------------------------------------- --------------------]
## A simple inexact matching example with a short subject:[#一个简单的不精确匹配一个简短的主题例如:]
x <- DNAString("AAGCGCGATATG")
m1 <- matchPattern("GCNNNAT", x)
m1
m2 <- matchPattern("GCNNNAT", x, fixed=FALSE)
m2
as.matrix(m2)
## With DNA sequence of yeast chromosome number 1:[#与DNA序列的酵母染色体数目1:]
data(yeastSEQCHR1)
yeast1 <- DNAString(yeastSEQCHR1)
PpiI <- "GAACNNNNNCTC" # a restriction enzyme pattern[限制性内切酶模式]
match1.PpiI <- matchPattern(PpiI, yeast1, fixed=FALSE)
match2.PpiI <- matchPattern(PpiI, yeast1, max.mismatch=1, fixed=FALSE)
## With a genome containing isolated Ns:[#孤立NS基因组包含:]
library(BSgenome.Celegans.UCSC.ce2)
chrII <- Celegans[["chrII"]]
alphabetFrequency(chrII)
matchPattern("N", chrII)
matchPattern("TGGGTGTCTTT", chrII) # no match[不匹配]
matchPattern("TGGGTGTCTTT", chrII, fixed=FALSE) # 1 match[1场]
## Using wildcards ("N") in the pattern on a genome containing N-blocks:[#使用通配符(“N”的),在基因组上含有N块的模式:]
library(BSgenome.Dmelanogaster.UCSC.dm3)
chrX <- maskMotif(Dmelanogaster$chrX, "N")
as(chrX, "XStringViews") # 4 non masked regions[4蒙面非区域]
matchPattern("TTTATGNTTGGTA", chrX, fixed=FALSE)
## Can also be achieved with no mask:[#也可以实现无面膜:]
masks(chrX) <- NULL
matchPattern("TTTATGNTTGGTA", chrX, fixed="subject")
## ---------------------------------------------------------------------[#------------------------------------------------- --------------------]
## B. vmatchPattern()/vcountPattern()[#,二vmatchPattern()/ vcountPattern()]
## ---------------------------------------------------------------------[#------------------------------------------------- --------------------]
Ebox <- DNAString("CANNTG")
subject <- Celegans$upstream5000
mindex <- vmatchPattern(Ebox, subject, fixed=FALSE)
count_index <- countIndex(mindex) # Get the number of matches per[每场比赛得到的数量]
# subject element.[主题元素。]
sum(count_index) # Total number of matches.[比赛的总数。]
table(count_index)
i0 <- which(count_index == max(count_index))
subject[i0] # The subject element with most matches.[最匹配的主题元素。]
## The matches in 'subject[i0]' as an IRanges object:[#在匹配“的主题[I0]”作为IRanges对象:]
mindex[[i0]]
## The matches in 'subject[i0]' as an XStringViews object:[#在匹配“的主题[I0]”作为XStringViews对象:]
Views(subject[[i0]], mindex[[i0]])
## ---------------------------------------------------------------------[#------------------------------------------------- --------------------]
## C. WITH INDELS[#C.与INDELS的]
## ---------------------------------------------------------------------[#------------------------------------------------- --------------------]
library(BSgenome.Celegans.UCSC.ce2)
pattern <- DNAString("ACGGACCTAATGTTATC")
subject <- Celegans$chrI
## Allowing up to 2 mismatching letters doesn't give any match:[#允许多达2个不匹配的字母不放弃任何比赛:]
matchPattern(pattern, subject, max.mismatch=2)
## But allowing up to 2 edit operations gives 3 matches:[#但是允许多达2编辑操作给出了3场比赛:]
system.time(m <- matchPattern(pattern, subject, max.mismatch=2, with.indels=TRUE))
m
## pairwiseAlignment() returns the (first) best match only:[#pairwiseAlignment()返回(一)最佳匹配:]
if (interactive()) {
mat <- nucleotideSubstitutionMatrix(match=1, mismatch=0, baseOnly=TRUE)
## Note that this call to pairwiseAlignment() will need to[#请注意,这个调用将需要到pairwiseAlignment()]
## allocate 733.5 Mb of memory (i.e. length(pattern) * length(subject)[#拨出733.5 MB的内存(即长度(模式)*长度(视)]
## * 3 bytes).[#* 3个字节)。]
system.time(pwa <- pairwiseAlignment(pattern, subject, type="local",
substitutionMatrix=mat,
gapOpening=0, gapExtension=1))
pwa
}
## Only "best local matches" are reported:[#只有“当地最好的比赛”的报道:]
## - with deletions in the subject[ - 在主体缺失]
subject <- BString("ACDEFxxxCDEFxxxABCE")
matchPattern("ABCDEF", subject, max.mismatch=2, with.indels=TRUE)
matchPattern("ABCDEF", subject, max.mismatch=2)
## - with insertions in the subject[ - 主题中的插入]
subject <- BString("AiBCDiEFxxxABCDiiFxxxAiBCDEFxxxABCiDEF")
matchPattern("ABCDEF", subject, max.mismatch=2, with.indels=TRUE)
matchPattern("ABCDEF", subject, max.mismatch=2)
## - with substitutions (note that the "best local matches" can introduce[ - 与替换(注意,“当地最好的比赛”,可以引进]
## indels and therefore be shorter than 6)[#INDELS,因此是小于6)]
subject <- BString("AsCDEFxxxABDCEFxxxBACDEFxxxABCEDF")
matchPattern("ABCDEF", subject, max.mismatch=2, with.indels=TRUE)
matchPattern("ABCDEF", subject, max.mismatch=2)
转载请注明:出自 生物统计家园网(http://www.biostatistic.net)。
注:
注1:为了方便大家学习,本文档为生物统计家园网机器人LoveR翻译而成,仅供个人R语言学习参考使用,生物统计家园保留版权。
注2:由于是机器人自动翻译,难免有不准确之处,使用时仔细对照中、英文内容进行反复理解,可以帮助R语言的学习。
注3:如遇到不准确之处,请在本贴的后面进行回帖,我们会逐渐进行修订。
|