R语言 Biostrings包 matchPDict-inexact()函数中文帮助文档(中英文对照)

loveR · 发表于 2012-2-25 13:47:02

matchPDict-inexact(Biostrings)
matchPDict-inexact()所属R语言包：Biostrings

                                    Inexact matching with matchPDict()/countPDict()/whichPDict()
                                       与matchPDict（）/ countPDict（）/ whichPDict（不精确匹配）

                                       译者：生物统计家园网机器人LoveR

描述----------Description----------

The matchPDict, countPDict and whichPDict functions efficiently find the occurrences in a text (the subject) of all patterns stored in a preprocessed dictionary.
matchPDict，countPDict和whichPDict功能，有效地找到一个在预处理字典中的所有模式的文本（主体）的发生。

This man page shows how to use these functions for inexact (or fuzzy) matching or when the original dictionary has a variable width.
这名男子页面显示了如何使用这些功能不精确匹配（或模糊）或原字典时，有一个可变宽度。

See ?matchPDict for how to use these functions for exact matching of a constant width dictionary i.e. a dictionary where all the patterns have the same length (same number of nucleotides).
看到?matchPDict精确匹配字典里所有的模式有相同的长度（核苷酸数相同），即一个固定宽度的字典如何使用这些功能。

Details

详情----------Details----------

In this man page, we assume that you know how to preprocess a dictionary of DNA patterns that can then be used with matchPDict, countPDict or whichPDict. Please see ?PDict if you don't.
在这名男子页面中，我们假设你知道如何进行预处理的DNA模式字典，然后可以用matchPDict，countPDict或whichPDict。请参阅?PDict如果你不这样做。

matchPDict and family support different kinds of inexact matching but with some restrictions. Inexact matching is controlled via the definition of a Trusted Band during the preprocessing step and/or via the max.mismatch, min.mismatch and fixed arguments. Defining a Trusted Band is also required when the original dictionary is not rectangular (variable width), even for exact matching. See ?PDict for how to define a Trusted Band.
matchPDict和家人的支持各种不同的不精确匹配，但有一些限制。通过定义一个可信的波段，在预处理步骤和/或通过max.mismatch，min.mismatch和fixed参数控制不精确匹配。时，原来的字典是不是长方形（可变宽度），甚至完全匹配，还需要定义一个可信的波段。看到?PDict如何定义可信乐队。

Here is how matchPDict and family handle the Trusted Band defined on pdict:
下面是如何matchPDict和家庭处理信任的乐队pdict定义：

(1) Find all the exact matches of all the elements in the Trusted Band.
（1）找到所有的信任乐队的所有元素的精确匹配。

(2) For each element in the Trusted Band that has at least one exact match, compare the head and the tail of this element with the flanking sequences of the matches found in (1).
（2）对于每一个在信任的乐队，至少有一个完全匹配的元素，该元素的头部和尾部的比较（1）中发现了比赛的侧翼序列。

Note that the number of exact matches found in (1) will decrease exponentially with the width of the Trusted Band. Here is a simple guideline in order to get reasonably good performance: if TBW is the width of the Trusted Band (TBW <- tb.width(pdict)) and L the number of letters in the subject (L <- nchar(subject)), then L / (4^TBW) should be kept as small as possible, typically < 10 or 20.
注意：（1）中发现的精确匹配的数量将减少指数的可信带的宽度。这里是一个简单的准则，为了得到相当不错的表现：如果TBW的是信任的带宽（TBW <- tb.width(pdict)）和L在主题的字母数（L <- nchar(subject)），然后<X >应该保持尽可能小，一般小于10或20。

In addition, when a Trusted Band has been defined during preprocessing, then matchPDict and family can be called with fixed=FALSE. In this case, IUPAC ambiguity codes in the head or the tail of the PDict object are treated as ambiguities.
此外，当一个可信的乐队已在预处理定义，那么matchPDict和家庭可以调用fixed=FALSE。在这种情况下，在头部或尾部的PDict对象的IUPAC模糊密码被视为含糊。

Finally, fixed="pattern" can be used to indicate that IUPAC ambiguity codes in the subject should be treated as ambiguities. It only works if the density of codes is not too high. It works whether or not a Trusted Band has been defined on pdict.
最后，fixed="pattern"可以使用表明，在主体的IUPAC模糊密码，应视为含糊。只有当代码密度不太高。它的作品是否受信任的乐队已定义pdict。

作者（S）----------Author(s)----------

H. Pages

参考文献----------References----------

matching: An aid to bibliographic search". Communications of the ACM 18 (6): 333-340.

参见----------See Also----------

PDict-class, MIndex-class, matchPDict
PDict级，MIndex级，matchPDict，

举例----------Examples----------

  ## ---------------------------------------------------------------------[＃------------------------------------------------- --------------------]
  ## A. USING AN EXPLICIT TRUSTED BAND[＃答：使用一个显式可信带]
  ## ---------------------------------------------------------------------[＃------------------------------------------------- --------------------]

  library(drosophila2probe)
  dict0 <- DNAStringSet(drosophila2probe)
  dict0  # the original dictionary[原字典]

  ## Preprocess the original dictionary by defining a Trusted Band that[＃预处理定义一个信任的乐队，原字典]
  ## spans nucleotides 1 to 9 of each pattern.[＃跨越核苷酸1到9每个模式。]
  pdict9 <- PDict(dict0, tb.end=9)
  pdict9
  tail(pdict9)
  sum(duplicated(pdict9))
  table(patternFrequency(pdict9))

  library(BSgenome.Dmelanogaster.UCSC.dm3)
  chr3R <- Dmelanogaster$chr3R
  chr3R
  table(countPDict(pdict9, chr3R, max.mismatch=1))
  table(countPDict(pdict9, chr3R, max.mismatch=3))
  table(countPDict(pdict9, chr3R, max.mismatch=5))

  ## ---------------------------------------------------------------------[＃------------------------------------------------- --------------------]
  ## B. COMPARISON WITH EXACT MATCHING[＃B。比较精确匹配]
  ## ---------------------------------------------------------------------[＃------------------------------------------------- --------------------]

  ## When the original dictionary is of constant width, exact matching[＃当原来的词典是固定宽度，精确匹配]
  ## (i.e. 'max.mismatch=0' and 'fixed=TRUE) will be more efficient with[＃（即“max.mismatch = 0”和“固定= TRUE）会更有效率]
  ## a full-width Trusted Band (i.e. a Trusted Band that covers the entire[＃信任度全宽波段（即一个可信的波段，涵盖了整个]
  ## dictionary) than with a Trusted Band of width < width(dict0).[＃字典）比与信任带的宽度<宽度（dict0）。]
  pdict0 <- PDict(dict0)
  count0 <- countPDict(pdict0, chr3R)
  count0b <- countPDict(pdict9, chr3R, max.mismatch=0)
  identical(count0b, count0)  # TRUE[真]

  ## ---------------------------------------------------------------------[＃------------------------------------------------- --------------------]
  ## C. USING AN EXPLICIT TRUSTED BAND ON A VARIABLE WIDTH DICTIONARY[＃C.使用一个显式可信带可变宽度的字典]
  ## ---------------------------------------------------------------------[＃------------------------------------------------- --------------------]

  ## Here is a small variable width dictionary that contains IUPAC[＃这里是一个小型的可变宽度的字典，其中包含国际化联]
  ## ambiguities (pattern 1 and 3 contain an N):[＃含糊（模式1和3包含一个N）：]
  dict0 <- DNAStringSet(c("TACCNG", "TAGT", "CGGNT", "AGTAG", "TAGT"))
  ## (Note that pattern 2 and 5 are identical.)[＃（请注意，模式2和5是相同的。）]

  ## If we only want to do exact matching, then it is recommended to use[＃如果我们只想做精确匹配，则建议使用]
  ## the widest possible Trusted Band i.e. to set its width to[＃设置其宽度为尽可能广泛的信任带，即]
  ## 'min(width(dict0))' because this is what will give the best[＃分钟（宽度（dict0））的“，因为这是提供了最好的]
  ## performance. However, when 'dict0' contains IUPAC ambiguities (like[＃性能。然而，当“dict0包含国际化联含糊不清（如]
  ## in our case), it could be that one of them is falling into the[＃在我们的例子），也可能是其中之一是将下降]
  ## Trusted Band so we get an error (only base letters can go in the[＃信任的乐队，所以我们得到一个错误（只碱基的字母可以在]
  ## Trusted Band for now):[＃信任乐队）：]
  ## Not run: [＃无法运行：]
PDict(dict0, tb.end=min(width(dict0)))  # Error![错误！]

## End(Not run)[＃结束（不运行）]

  ## In our case, the Trusted Band cannot be wider than 3:[＃在我们的例子中，信任的乐队不能超过3宽：]
  pdict <- PDict(dict0, tb.end=3)
  tail(pdict)

  subject <- DNAString("TAGTACCAGTTTCGGG")

  m <- matchPDict(pdict, subject)
  countIndex(m)  # pattern 2 and 5 have 1 exact match[模式2和5，有1完全匹配]
  m[[2]]

  ## We can take advantage of the fact that our Trusted Band doesn't cover[＃我们可以利用优势的事实，我们信任的频段不包括]
  ## the entire dictionary to allow inexact matching on the uncovered parts[＃整个字典，让裸露的部位上的不精确匹配]
  ## (the tail in our case):[＃（尾巴在我们的例子）：]

  m <- matchPDict(pdict, subject, fixed=FALSE)
  countIndex(m)  # now pattern 1 has 1 match too[现在模式1 1场]
  m[[1]]

  m <- matchPDict(pdict, subject, max.mismatch=1)
  countIndex(m)  # now pattern 4 has 1 match too[现在模式4有1场]
  m[[4]]

  m <- matchPDict(pdict, subject, max.mismatch=1, fixed=FALSE)
  countIndex(m)  # now pattern 3 has 1 match too[现在模式3也有1场]
  m[[3]]  # note that this match is "out of limit"[注意，这场比赛是“极限”]
  Views(subject, m[[3]])

  m <- matchPDict(pdict, subject, max.mismatch=2)
  countIndex(m)  # pattern 4 gets 1 additional match[模式4，获得1个额外的匹配]
  m[[4]]

  ## Unlist all matches:[＃不公开所有的比赛：]
  unlist(m)

  ## ---------------------------------------------------------------------[＃------------------------------------------------- --------------------]
  ## D. WITH IUPAC AMBIGUITY CODES IN THE SUBJECT[＃D.在主题IUPAC模糊密码]
  ## ---------------------------------------------------------------------[＃------------------------------------------------- --------------------]
  pdict <- PDict(c("ACAC", "TCCG"))
  as.list(matchPDict(pdict, DNAString("ACNCCGT")))
  as.list(matchPDict(pdict, DNAString("ACNCCGT"), fixed="pattern"))
  as.list(matchPDict(pdict, DNAString("ACWCCGT"), fixed="pattern"))
  as.list(matchPDict(pdict, DNAString("ACRCCGT"), fixed="pattern"))
  as.list(matchPDict(pdict, DNAString("ACKCCGT"), fixed="pattern"))

  dict <- DNAStringSet(c("TTC", "CTT"))
  pdict <- PDict(dict)
  subject <- DNAString("CYTCACTTC")
  mi1 <- matchPDict(pdict, subject, fixed="pattern")
  mi2 <- matchPDict(dict, subject, fixed="pattern")
  stopifnot(identical(as.list(mi1), as.list(mi2)))

转载请注明:出自生物统计家园网(http://www.biostatistic.net)。

注：
注1：为了方便大家学习，本文档为生物统计家园网机器人LoveR翻译而成，仅供个人R语言学习参考使用，生物统计家园保留版权。
注2：由于是机器人自动翻译，难免有不准确之处，使用时仔细对照中、英文内容进行反复理解，可以帮助R语言的学习。
注3：如遇到不准确之处，请在本贴的后面进行回帖，我们会逐渐进行修订。

账号		自动登录	找回密码
密码			注册

R语言 Biostrings包 matchPDict-inexact()函数中文帮助文档(中英文对照)

浏览过的版块