matchPDict(Biostrings)
matchPDict()所属R语言包:Biostrings
Matching a dictionary of patterns against a reference
对符合条件的参考字典的图案
译者:生物统计家园网 机器人LoveR
描述----------Description----------
A set of functions for finding all the occurrences (aka "matches" or "hits") of a set of patterns (aka the dictionary) in a reference sequence or set of reference sequences (aka the subject)
一组寻找参考序列或参考序列集所有出现的一组模式(又名“匹配”或“点击”)(又名字典)的功能(也称为主题)
The following functions differ in what they return: matchPDict returns the "where" information i.e. the positions in the subject of all the occurrences of every pattern; countPDict returns the "how many times" information i.e. the number of occurrences for each pattern; and whichPDict returns the "who" information i.e. which patterns in the input dictionary have at least one match.
以下功能不同,在他们返回:matchPDict返回“,其中”信息,即在所有出现的每一个模式的问题的立场;countPDict返回“多少次”信息即每个模式的出现和whichPDict返回“谁”的信息,即在输入字典的模式有至少有一个匹配。
vcountPDict and vwhichPDict are vectorized versions of countPDict and whichPDict, respectively, that is, they work on a set of reference sequences in a vectorized fashion.
vcountPDict和vwhichPDictcountPDict和whichPDict,分别是,他们工作在矢量时尚的参考序列集的矢量版本。
This man page shows how to use these functions (aka the *PDict functions) for exact matching of a constant width dictionary i.e. a dictionary where all the patterns have the same length (same number of nucleotides).
这名男子的页面显示了如何使用这些功能的字典里所有的模式有相同的长度(核苷酸数相同),即一个固定宽度的字典完全匹配(又名*PDict功能)。
See ?`matchPDict-inexact` for how to use these functions for inexact matching or when the original dictionary has a variable width.
看到?matchPDict-inexact如何使用这些功能或不精确的匹配原字典时,有一个可变宽度的。
用法----------Usage----------
matchPDict(pdict, subject,
max.mismatch=0, min.mismatch=0, with.indels=FALSE, fixed=TRUE,
algorithm="auto", verbose=FALSE)
countPDict(pdict, subject,
max.mismatch=0, min.mismatch=0, with.indels=FALSE, fixed=TRUE,
algorithm="auto", verbose=FALSE)
whichPDict(pdict, subject,
max.mismatch=0, min.mismatch=0, with.indels=FALSE, fixed=TRUE,
algorithm="auto", verbose=FALSE)
vcountPDict(pdict, subject,
max.mismatch=0, min.mismatch=0, with.indels=FALSE, fixed=TRUE,
algorithm="auto", collapse=FALSE, weight=1L,
verbose=FALSE, ...)
vwhichPDict(pdict, subject,
max.mismatch=0, min.mismatch=0, with.indels=FALSE, fixed=TRUE,
algorithm="auto", verbose=FALSE)
参数----------Arguments----------
参数:pdict
A PDict object containing the preprocessed dictionary. All these functions also work with a dictionary that has not been preprocessed (in other words, the pdict argument can receive an XStringSet object). Of course, it won't be as fast as with a preprocessed dictionary, but it will generally be slightly faster than using matchPattern/countPattern or vmatchPattern/vcountPattern in a "lapply/sapply loop", because, here, looping is done at the C-level. However, by using a non-preprocessed dictionary, many of the restrictions that apply to preprocessed dictionaries don't apply anymore. For example, the dictionary doesn't need to be rectangular or to be a DNAStringSet object: it can be any type of XStringSet object and have a variable width.
一个PDict对象包含预处理字典。所有这些功能也可以用尚未预处理(换句话说,pdict参数可以收到XStringSet对象)的字典。当然,这不会是与预处理字典一样快,但它通常会比使用稍快matchPattern/countPattern或vmatchPattern/vcountPattern在“lapply / sapply循环”,在这里,因为循环是在C级。然而,通过使用非预处理的字典中,许多的限制,适用于预处理字典不适用了。例如,字典并不需要为长方形或成为DNAStringSet对象:它可以是任何类型XStringSet对象,有一个可变宽度。
参数:subject
An XString or MaskedXString object containing the subject sequence for matchPDict, countPDict and whichPDict. An XStringSet object containing the subject sequences for vcountPDict and vwhichPDict. For now, only subjects of base class DNAString are supported.
一个XString或MaskedXString对象的包含matchPDict,countPDict和whichPDict的主题序列。一个XStringSet对象包含vcountPDict和vwhichPDict的主题序列。现在,基类DNAString的科目只有支持。
参数:max.mismatch, min.mismatch
The maximum and minimum number of mismatching letters allowed (see ?isMatchingAt for the details). This man page focuses on exact matching of a constant width dictionary so max.mismatch=0 in the examples below. See ?`matchPDict-inexact` for inexact matching.
数量最大和最小的不匹配,允许字母(?isMatchingAt细节)。此手册页集中在一个固定宽度的字典完全匹配所以max.mismatch=0在下面的例子。看到?matchPDict-inexact不精确的匹配。
参数:with.indels
Only supported by countPDict, whichPDict, vcountPDict and vwhichPDict at the moment, and only when the input dictionary is non-preprocessed (i.e. XStringSet). If TRUE then indels are allowed. In that case, min.mismatch must be 0 and max.mismatch is interpreted as the maximum "edit distance" allowed between any pattern and any of its matches. See ?`matchPattern` for more information.
只支持countPDict,whichPDict,vcountPDict和vwhichPDict此刻,只有当输入字典非预处理(即XStringSet)。如果TRUE然后INDELS允许。在这种情况下,min.mismatch必须0和max.mismatch作为最大的“编辑距离”之间的任何模式和任何比赛所允许的解释。看到?matchPattern更多信息。
参数:fixed
Whether IUPAC ambiguity codes should be interpreted literally or not (see ?isMatchingAt for more information). This man page focuses on exact matching of a constant width dictionary so fixed=TRUE in the examples below. See ?`matchPDict-inexact` for inexact matching.
是否应字面解释或不IUPAC模糊密码(见?isMatchingAt更多的信息)。此手册页集中在一个固定宽度的字典完全匹配所以fixed=TRUE在下面的例子。看到?matchPDict-inexact不精确的匹配。
参数:algorithm
Ignored if pdict is a preprocessed dictionary (i.e. a PDict object). Otherwise, can be one of the following: "auto", "naive-exact", "naive-inexact", "boyer-moore" or "shift-or". See ?matchPattern for more information. Note that "indels" is not supported for now.
被忽略,如果pdict是一个预处理字典(即PDict对象)。否则,可以是下列之一:"auto","naive-exact","naive-inexact","boyer-moore"或"shift-or"。看到?matchPattern更多信息。请注意"indels"是不是现在的支持。
参数:verbose
TRUE or FALSE.
TRUE或FALSE。
参数:collapse, weight
collapse must be FALSE, 1, or 2. If collapse=FALSE (the default), then weight is ignored and vcountPDict returns the full matrix of counts (M0). If collapse=1, then M0 is collapsed "horizontally" i.e. it is turned into a vector with length equal to length(pdict). If weight=1L (the default), then this vector is defined by rowSums(M0). If collapse=2, then M0 is collapsed "vertically" i.e. it is turned into a vector with length equal to length(subject). If weight=1L (the default), then this vector is defined by colSums(M0). If collapse=1 or collapse=2, then the elements in subject (collapse=1) or in pdict (collapse=2) can be weighted thru the weight argument. In that case, the returned vector is defined by M0 %*% rep(weight, length.out=length(subject)) and rep(weight, length.out=length(pdict)) %*% M0, respectively.
collapse必须FALSE,1或2。如果collapse=FALSE(默认),然后weight被忽略vcountPDict返回计数的全矩阵(M0)。 “横向”,即进入了一个向量collapse=1等于M0,length如果,则length(pdict)倒塌。如果weight=1L(默认),那么这个向量的定义rowSums(M0)。如果collapse=2然后M0倒塌“垂直”,即打开length等于length(subject),它是将一个向量。如果weight=1L(默认),那么这个向量的定义colSums(M0)。如果collapse=1或collapse=2,subject(collapse=1)或pdict(collapse=2)可以加权通过<元素X>参数。在这种情况下,返回的向量weight和M0 %*% rep(weight, length.out=length(subject)),分别定义。
参数:...
Additional arguments for methods.
附加参数的方法。
Details
详情----------Details----------
In this man page, we assume that you know how to preprocess a dictionary of DNA patterns that can then be used with any of the *PDict functions described here. Please see ?PDict if you don't.
在这名男子页面中,我们假设你知道如何进行预处理,然后可以使用这里所描述的任何*PDict功能的DNA模式的字典。请参阅?PDict如果你不这样做。
When using the *PDict functions for exact matching of a constant width dictionary, the standard way to preprocess the original dictionary is by calling the PDict constructor on it with no extra arguments. This returns the preprocessed dictionary in a PDict object that can be used with any of the *PDict functions.
当使用一个固定宽度的字典完全匹配的*PDict功能,标准的方式进行预处理原词典是它没有额外的参数调用PDict构造。这返回的预处理字典在PDict对象,可以使用任何*PDict功能。
值----------Value----------
If M denotes the number of patterns in the pdict argument (M <- length(pdict)), then matchPDict returns an MIndex object of length M, and countPDict an integer vector of length M.
如果M表示模式中的pdict参数(M <- length(pdict))的数量,然后matchPDict返回长度MMIndex对象,和countPDict的长度M的整数向量。
whichPDict returns an integer vector made of the indices of the patterns in the pdict argument that have at least one match.
whichPDict返回一个整数向量指数pdict论点,即有至少有一个匹配的模式。
If N denotes the number of sequences in the subject argument (N <- length(subject)), then vcountPDict returns an integer matrix with M rows and N columns, unless the collapse argument is used. In that case, depending on the type of weight, an integer or numeric vector is returned (see above for the details).
如果N指subject参数(N <- length(subject))的序列,那么vcountPDict返回M行N的整数矩阵列,除非使用collapse参数。在这种情况下,这取决于类型weight的,返回整数或数字向量(见上面的细节)。
vwhichPDict returns a list of N integer vectors.
vwhichPDictN整数向量返回列表。
作者(S)----------Author(s)----------
H. Pages
参考文献----------References----------
matching: An aid to bibliographic search". Communications of the ACM 18 (6): 333-340.
参见----------See Also----------
PDict-class, MIndex-class, matchPDict-inexact, isMatchingAt, coverage,MIndex-method, matchPattern, alphabetFrequency, DNAStringSet-class, XStringViews-class, MaskedDNAString-class
,MIndex级PDict级,的不精确matchPDict,isMatchingAt,coverage,MIndex-method,matchPattern,alphabetFrequency,DNAStringSet级,XStringViews级,MaskedDNAString级
举例----------Examples----------
## ---------------------------------------------------------------------[#------------------------------------------------- --------------------]
## A. A SIMPLE EXAMPLE OF EXACT MATCHING[#答:简单的例子完全匹配]
## ---------------------------------------------------------------------[#------------------------------------------------- --------------------]
## Creating the pattern dictionary:[#创建模式字典:]
library(drosophila2probe)
dict0 <- DNAStringSet(drosophila2probe)
dict0 # The original dictionary.[原字典。]
length(dict0) # Hundreds of thousands of patterns.[数百上千种图案。]
pdict0 <- PDict(dict0) # Store the original dictionary in[原字典中存储]
# a PDict object (preprocessing).[1 PDict对象(预处理)。]
## Using the pattern dictionary on chromosome 3R:[#使用的3R染色体上的图案字典:]
library(BSgenome.Dmelanogaster.UCSC.dm3)
chr3R <- Dmelanogaster$chr3R # Load chromosome 3R[负载染色体3R]
chr3R
mi0 <- matchPDict(pdict0, chr3R) # Search...[搜索...]
## Looking at the matches:[#看比赛:]
start_index <- startIndex(mi0) # Get the start index.[获取起始索引。]
length(start_index) # Same as the original dictionary.[原字典相同。]
start_index[[8220]] # Starts of the 8220th pattern.[的第八千二百二十零模式启动。]
end_index <- endIndex(mi0) # Get the end index.[获取最终的指数。]
end_index[[8220]] # Ends of the 8220th pattern.[两端的第八千二百二十○模式。]
count_index <- countIndex(mi0) # Get the number of matches per pattern.[获取每个模式匹配的数量。]
count_index[[8220]]
mi0[[8220]] # Get the matches for the 8220th pattern.[获取的第八千二百二十零模式的比赛。]
start(mi0[[8220]]) # Equivalent to startIndex(mi0)[[8220]].[相当于从startIndex(mi0)[8220]。]
sum(count_index) # Total number of matches.[比赛的总数。]
table(count_index)
i0 <- which(count_index == max(count_index))
pdict0[[i0]] # The pattern with most occurrences.[与大多数出现的格局。]
mi0[[i0]] # Its matches as an IRanges object.[其比赛作为IRanges对象。]
Views(chr3R, mi0[[i0]]) # And as an XStringViews object.[作为XStringViews对象。]
## Get the coverage of the original subject:[#获取覆盖原来的主题:]
cov3R <- as.integer(coverage(mi0, width=length(chr3R)))
max(cov3R)
mean(cov3R)
sum(cov3R != 0) / length(cov3R) # Only 2.44% of chr3R is covered.[覆盖只有2.44 chr3R%。]
if (interactive()) {
plotCoverage <- function(cx, start, end)
{
plot.new()
plot.window(c(start, end), c(0, 20))
axis(1)
axis(2)
axis(4)
lines(start:end, cx[start:end], type="l")
}
plotCoverage(cov3R, 27600000, 27900000)
}
## ---------------------------------------------------------------------[#------------------------------------------------- --------------------]
## B. NAMING THE PATTERNS[命名模式#B。]
## ---------------------------------------------------------------------[#------------------------------------------------- --------------------]
## The names of the original patterns, if any, are propagated to the[#原模式的名称,如果有的话,会传播到]
## PDict and MIndex objects:[#PDict和MIndex的对象:]
names(dict0) <- mkAllStrings(letters, 4)[seq_len(length(dict0))]
dict0
dict0[["abcd"]]
pdict0n <- PDict(dict0)
names(pdict0n)[1:30]
pdict0n[["abcd"]]
mi0n <- matchPDict(pdict0n, chr3R)
names(mi0n)[1:30]
mi0n[["abcd"]]
## This is particularly useful when unlisting an MIndex object:[#这是特别有用时unlisting MIndex对象:]
unlist(mi0)[1:10]
unlist(mi0n)[1:10] # keep track of where the matches are coming from[跟踪比赛都来自何处]
## ---------------------------------------------------------------------[#------------------------------------------------- --------------------]
## C. PERFORMANCE[#C.绩效]
## ---------------------------------------------------------------------[#------------------------------------------------- --------------------]
## If getting the number of matches is what matters only (without[#如果匹配的数量,是什么事情(不]
## regarding their positions), then countPDict() will be faster,[#关于他们的位置),然后countPDict()会更快,]
## especially when there is a high number of matches:[#特别是当有一个高的比赛:]
count_index0 <- countPDict(pdict0, chr3R)
stopifnot(identical(count_index0, count_index))
if (interactive()) {
## What's the impact of the dictionary width on performance?[#什么是字典宽度对性能的影响?]
## Below is some code that can be used to figure out (will take a long[#下面是一些可以用来找出(将需要很长的代码]
## time to run). For different widths of the original dictionary, we[#运行时间)。对于不同宽度的原字典,我们]
## look at:[#看看:]
## o pptime: preprocessing time (in sec.) i.e. time needed for[#O pptime:预处理时间(秒),即时间需要]
## building the PDict object from the truncated input[#从截断的输入PDict对象]
## sequences;[#序列;]
## o nnodes: nb of nodes in the resulting Aho-Corasick tree;[#O nnodes:NB阿霍 - 有益扩充树中的节点;]
## o nupatt: nb of unique truncated input sequences;[#O nupatt:NB独特截断输入序列;]
## o matchtime: time (in sec.) needed to find all the matches;[#O匹配时间:找到所有赛事所需的时间(秒);]
## o totalcount: total number of matches.[#O totalcount:比赛的总人数。]
getPDictStats <- function(dict, subject)
{
ans_width <- width(dict[1])
ans_pptime <- system.time(pdict <- PDict(dict))[["elapsed"]]
pptb <- pdict@threeparts@pptb
ans_nnodes <- nnodes(pptb)
ans_nupatt <- sum(!duplicated(pdict))
ans_matchtime <- system.time(
mi0 <- matchPDict(pdict, subject)
)[["elapsed"]]
ans_totalcount <- sum(countIndex(mi0))
list(
width=ans_width,
pptime=ans_pptime,
nnodes=ans_nnodes,
nupatt=ans_nupatt,
matchtime=ans_matchtime,
totalcount=ans_totalcount
)
}
stats <- lapply(8:25,
function(width)
getPDictStats(DNAStringSet(dict0, end=width), chr3R))
stats <- data.frame(do.call(rbind, stats))
stats
}
## ---------------------------------------------------------------------[#------------------------------------------------- --------------------]
## D. USING A NON-PREPROCESSED DICTIONARY[#D.使用一个非预处理字典]
## ---------------------------------------------------------------------[#------------------------------------------------- --------------------]
dict3 <- DNAStringSet(mkAllStrings(DNA_BASES, 3)) # all trinucleotides[所有三核苷酸]
dict3
pdict3 <- PDict(dict3)
## The 3 following calls are equivalent (from faster to slower):[#3以下调用是等价的(从快至慢):]
res3a <- countPDict(pdict3, chr3R)
res3b <- countPDict(dict3, chr3R)
res3c <- sapply(dict3,
function(pattern) countPattern(pattern, chr3R))
stopifnot(identical(res3a, res3b))
stopifnot(identical(res3a, res3c))
## One reason for using a non-preprocessed dictionary is to get rid of[#使用非预处理字典的一个原因是摆脱]
## all the constraints associated with preprocessing, e.g., when[#所有与预处理相关的限制,例如,当]
## preprocessing with \code{\link{PDict}}, the input dictionary must[#预处理{\ \代码链接{PDict},输入字典必须]
## be DNA and a Trusted Band must be defined (explicitly or implicitly).[#是DNA,必须定义一个可信的波段(或明或暗地)。]
## See \code{?\link{PDict}} for more information about these constraints.[#\代码{\链接{PDict}的}有关这些限制的详细信息。]
## In particular, using a non-preprocessed dictionary can be[#特别是,使用非预处理字典]
## useful for the kind of inexact matching that can't be achieved[#有用的不精确匹配,不能达到]
## with a \link{PDict} object (if performance is not an issue).[#\链接{PDict}的对象(如果性能不是一个问题)。]
## See \code{?`\link{matchPDict-inexact}`} for more information about[#\代码{\链接{matchPDict不精确}}的更多信息,请参阅]
## inexact matching.[#不精确匹配。]
dictD <- xscat(dict3, "N", reverseComplement(dict3))
## The 2 following calls are equivalent (from faster to slower):[#2以下调用是等价的(从快至慢):]
resDa <- matchPDict(dictD, chr3R, fixed=FALSE)
resDb <- sapply(dictD,
function(pattern)
matchPattern(pattern, chr3R, fixed=FALSE))
stopifnot(all(sapply(seq_len(length(dictD)),
function(i)
identical(resDa[[i]], as(resDb[[i]], "IRanges")))))
## ---------------------------------------------------------------------[#------------------------------------------------- --------------------]
## E. vcountPDict()[#大肠杆菌vcountPDict()]
## ---------------------------------------------------------------------[#------------------------------------------------- --------------------]
subject <- Dmelanogaster$upstream1000[1:100]
subject
mat1 <- vcountPDict(pdict0, subject)
dim(mat1) # length(pdict0) x length(subject)[长度(pdict0)x长度(主题)]
nhit_per_probe <- rowSums(mat1)
table(nhit_per_probe)
## Without vcountPDict(), 'mat1' could have been computed with:[#没有vcountPDict(),“MAT1”可能已经计算出:]
mat2 <- sapply(unname(subject), function(x) countPDict(pdict0, x))
stopifnot(identical(mat1, mat2))
## but using vcountPDict() is faster (10x or more, depending of the[#但使用vcountPDict(),更快(10倍或更多,取决于对]
## average length of the sequences in 'subject').[#在“主题”的序列的平均长度)。]
if (interactive()) {
## This will fail (with message "allocMatrix: too many elements[#这将失败(用的消息“allocMatrix的:太多的元素]
## specified") because, on most platforms, vectors and matrices in R[#指定“),因为,在大多数平台上,向量和矩阵在R]
## are limited to 2^31 elements:[#是限制为2 ^ 31元素:]
subject <- Dmelanogaster$upstream1000
vcountPDict(pdict0, subject)
length(pdict0) * length(Dmelanogaster$upstream1000)
1 * length(pdict0) * length(Dmelanogaster$upstream1000) # > 2^31[> 2 ^ 31]
## But this will work:[#但是这将工作:]
nhit_per_seq <- vcountPDict(pdict0, subject, collapse=2)
sum(nhit_per_seq >= 1) # nb of subject sequences with at least 1 hit[NB的主题序列至少有1命中]
table(nhit_per_seq)
which(nhit_per_seq == 37) # 603[603]
sum(countPDict(pdict0, subject[[603]])) # 37[37]
}
## ---------------------------------------------------------------------[#------------------------------------------------- --------------------]
## F. RELATIONSHIP BETWEEN vcountPDict(), countPDict() AND[#楼之间vcountPDict关系(),countPDict(),]
## vcountPattern()[:#vcountPattern()]
## ---------------------------------------------------------------------[#------------------------------------------------- --------------------]
pdict3 <- PDict(dict3)
subject <- Dmelanogaster$upstream1000
subject
## The 4 following calls are equivalent (from faster to slower):[#4以下调用是等价的(从快至慢):]
mat3a <- vcountPDict(pdict3, subject)
mat3b <- vcountPDict(dict3, subject)
mat3c <- sapply(dict3,
function(pattern) vcountPattern(pattern, subject))
mat3d <- sapply(unname(subject),
function(x) countPDict(pdict3, x))
stopifnot(identical(mat3a, mat3b))
stopifnot(identical(mat3a, t(mat3c)))
stopifnot(identical(mat3a, mat3d))
## The 3 following calls are equivalent (from faster to slower):[#3以下调用是等价的(从快至慢):]
nhitpp3a <- vcountPDict(pdict3, subject, collapse=1) # rowSums(mat3a)[rowSums(mat3a)]
nhitpp3b <- vcountPDict(dict3, subject, collapse=1)
nhitpp3c <- sapply(dict3,
function(pattern) sum(vcountPattern(pattern, subject)))
stopifnot(identical(nhitpp3a, nhitpp3b))
stopifnot(identical(nhitpp3a, nhitpp3c))
## The 3 following calls are equivalent (from faster to slower):[#3以下调用是等价的(从快至慢):]
nhitps3a <- vcountPDict(pdict3, subject, collapse=2) # colSums(mat3a)[colSums(mat3a)]
nhitps3b <- vcountPDict(dict3, subject, collapse=2)
nhitps3c <- sapply(unname(subject),
function(x) sum(countPDict(pdict3, x)))
stopifnot(identical(nhitps3a, nhitps3b))
stopifnot(identical(nhitps3a, nhitps3c))
## ---------------------------------------------------------------------[#------------------------------------------------- --------------------]
## G. vwhichPDict()[#G. vwhichPDict()]
## ---------------------------------------------------------------------[#------------------------------------------------- --------------------]
## The 4 following calls are equivalent (from faster to slower):[#4以下调用是等价的(从快至慢):]
vwp3a <- vwhichPDict(pdict3, subject)
vwp3b <- vwhichPDict(dict3, subject)
vwp3c <- lapply(seq_len(ncol(mat3a)), function(j) which(mat3a[ , j] != 0L))
vwp3d <- lapply(unname(subject), function(x) whichPDict(pdict3, x))
stopifnot(identical(vwp3a, vwp3b))
stopifnot(identical(vwp3a, vwp3c))
stopifnot(identical(vwp3a, vwp3d))
table(sapply(vwp3a, length))
which.min(sapply(vwp3a, length))
## Get the trinucleotides not represented in reference sequence 9181:[#获取在参考序列9181为代表的三核苷酸:]
dict3[-vwp3a[[9181]]] # 21 trinucleotides[21三核苷酸]
## ---------------------------------------------------------------------[#------------------------------------------------- --------------------]
## H. MAPPING PROBE SET IDS BETWEEN CHIPS WITH vwhichPDict()[#H.的映射探针之间建立vwhichPDict(芯片的ID)]
## ---------------------------------------------------------------------[#------------------------------------------------- --------------------]
## Here we show a simple (and very naive) algorithm for mapping probe[#在这里,我们测绘探针显示一个简单的算法(很幼稚)]
## set IDs between the hgu95av2 and hgu133a chips (Affymetrix).[#设置之间的hgu95av2的和hgu133a芯片(Affymetrix公司)的ID。]
## 2 probe set IDs are considered mapped iff they share at least one[#2探针集ID被认为是映射当且仅当他们分享至少一个]
## probe.[#探针。]
## WARNING: This example takes about 25 minutes to run.[#警告:这个例子约需25分钟运行。]
if (interactive()) {
library(hgu95av2probe)
library(hgu133aprobe)
probes1 <- DNAStringSet(hgu95av2probe)
probes2 <- DNAStringSet(hgu133aprobe)
pdict2 <- PDict(probes2)
## Get the mapping from probes1 to probes2 (based on exact matching):[#获取映射从probes1 probes2(基于精确匹配):]
map1to2 <- vwhichPDict(pdict2, probes1) # takes about 10 minutes[大约需要10分钟]
## The following helper function uses the probe level mapping to induce[#下面的辅助功能使用探针级映射诱导]
## the mapping at the probe set IDs level (from hgu95av2 to hgu133a).[#在探针集ID水平(从hgu95av2到hgu133a)的映射。]
## To keep things simple, 2 probe set IDs are considered mapped iff[#为了简单起见,2探针集ID被认为是映射IFF]
## each of them contains at least one probe mapped to one probe of[#其中至少包含一个映射到一个探针的探针]
## the other:[#其他:]
mapProbeSetIDs1to2 <- function(psID)
unique(hgu133aprobe$Probe.Set.Name[unlist(
map1to2[hgu95av2probe$Probe.Set.Name == psID]
)])
## Use the helper function to build the complete mapping:[#使用的辅助功能,建立完整的映射:]
psIDs1 <- unique(hgu95av2probe$Probe.Set.Name)
mapPSIDs1to2 <- lapply(psIDs1, mapProbeSetIDs1to2) # about 3 min.[约3分钟。]
names(mapPSIDs1to2) <- psIDs1
## Do some basic stats:[#做一些基本的统计:]
table(sapply(mapPSIDs1to2, length))
## [ADVANCED USERS ONLY][#[高级用户]]
## An alternative that is slightly faster is to put all the probes[#另一种方法是稍快,是把所有的探针]
## (hgu95av2 + hgu133a) in a single PDict object and then query its[#(hgu95av2 + hgu133a)单一PDict对象,然后查询其]
## 'dups0' slot directly. This slot is a Dups object containing the[#“dups0插槽直接。这个插槽是DUPS对象,其中包含]
## mapping between duplicated patterns.[#复制模式之间的映射。]
## Note that we can do this only because all the probes have the[#请注意,我们可以这样做只是因为所有的探针有]
## same length (25) and because we are doing exact matching:[#相同的长度(25),因为我们正在做的精确匹配:]
probes12 <- DNAStringSet(c(hgu95av2probe$sequence, hgu133aprobe$sequence))
pdict12 <- PDict(probes12)
dups0 <- pdict12@dups0
mapProbeSetIDs1to2alt <- function(psID)
{
ii1 <- unique(togroup(dups0, which(hgu95av2probe$Probe.Set.Name == psID)))
ii2 <- members(dups0, ii1) - length(probes1)
ii2 <- ii2[ii2 >= 1L]
unique(hgu133aprobe$Probe.Set.Name[ii2])
}
mapPSIDs1to2alt <- lapply(psIDs1, mapProbeSetIDs1to2alt) # about 10 min.[约10分钟。]
names(mapPSIDs1to2alt) <- psIDs1
## 'mapPSIDs1to2alt' and 'mapPSIDs1to2' contain the same mapping:[#mapPSIDs1to2alt和mapPSIDs1to2“包含相同的映射:]
stopifnot(identical(lapply(mapPSIDs1to2alt, sort),
lapply(mapPSIDs1to2, sort)))
}
转载请注明:出自 生物统计家园网(http://www.biostatistic.net)。
注:
注1:为了方便大家学习,本文档为生物统计家园网机器人LoveR翻译而成,仅供个人R语言学习参考使用,生物统计家园保留版权。
注2:由于是机器人自动翻译,难免有不准确之处,使用时仔细对照中、英文内容进行反复理解,可以帮助R语言的学习。
注3:如遇到不准确之处,请在本贴的后面进行回帖,我们会逐渐进行修订。
|