getSeq-methods(BSgenome)
getSeq-methods()所属R语言包:BSgenome
getSeq method for BSgenome objects
getSeq BSgenome对象的方法
译者:生物统计家园网 机器人LoveR
描述----------Description----------
A getSeq method for extracting a set of sequences (or subsequences) from a BSgenome object.
一个getSeq方法提取从BSgenome对象的一组序列(子序列)。
用法----------Usage----------
## S4 method for signature 'BSgenome'
getSeq(x, names, start=NA, end=NA, width=NA,
strand="+", as.character=FALSE)
参数----------Arguments----------
参数:x
A BSgenome object. See the available.genomes function for how to install a genome.
BSgenome对象。见available.genomes如何安装一个基因组功能。
参数:names
A character vector containing the names of the sequences in x where to get the subsequences from, or a GRanges object, or a RangedData object, or a named RangesList object, or a named Ranges object. The RangesList or Ranges object must be named according to the sequences in x where to get the subsequences from. If names is missing, then seqnames(x) is used. See ?`BSgenome-class` for details on how to get the lists of single sequences and multiple sequences (respectively) contained in a BSgenome object.
字符向量在序列名称x从哪里得到的子序列,或者一个农庄对象,或RangedData对象,或一个名为RangesList对象,或已命名的范围对象。必须命名为RangesList或范围对象根据x子序列从哪里得到的序列。 names如果缺少,然后seqnames(x)使用。看到?BSgenome-class如何获得单一序列和多个序列(分别)载在BSgenome对象名单的详细信息。
参数:start, end, width
Vector of integers (eventually with NAs) specifying the locations of the subsequences to extract. These are not needed (and therefore it's an error to supply them) when names is a GRanges, RangedData, RangesList or Ranges object.
向量整数(最终NAS)指定的子序列的位置提取。这些都不需要(因此这是一个错误提供)时names是一个农庄,RangedData,RangesList或范围的对象。
参数:strand
A vector containing "+"s or/and "-"s. This is not needed (and therefore it's an error to supply it) when names is a GRanges object or a RangedData object with a strand column.
一个向量"+"s或/和:"-"的。这是不是需要(因此这是一个错误提供)当names是一个的农庄对象或RangedData对象一个一个串列。
参数:as.character
TRUE or FALSE. Should the extracted sequences be returned in a standard character vector?
TRUE或FALSE。应返回一个标准特征向量提取的序列?
参数:...
Additional arguments. (Currently ignored.)
额外的参数。 (目前被忽略。)
Details
详情----------Details----------
L, the number of sequences to extract, is determined as follow:
L时,提取的序列号,确定如下:
If names is a GRanges or Ranges object then L = length(names).
如果names是一个农庄或范围,对象则L =length(names)。
If names is a RangedData object then L = nrow(names).
如果names是RangedData的对象,则L =nrow(names)。
If names is a RangesList object then L = length(unlist(names)).
如果names是RangesList的对象,则L =length(unlist(names))。
Otherwise, L is the length of the longest of names, start, end and width and all these arguments are recycled to this length. NAs and negative values in these 3 arguments are solved according to the rules of the SEW (Start/End/Width) interface (see ?solveUserSEW for the details).
否则,L是names,start,end和width“所有这些论点都回收到这个长度最长的长度。 NA的负值,在这3个参数是根据规则的SEW(开始/结束/宽)接口(见?solveUserSEW细节)解决。
If names is neither a GRanges object or a RangedData object with a strand column, then the strand argument is also recycled to length L.
names如果既不是一个的农庄对象或一个链列1 RangedData对象,然后strand参数也被回收长度L
Here is how the lookup between the names passed to the names argument and the sequences in x is performed. For each name in names:
下面是如何查找之间的名称传递给namesx执行的参数和序列。对于每个namenames:
(1): If x contains a single sequence with that name then this sequence is used for extraction;
(1):如果x包含具有该名称的单一序列,这个序列是用于提取;
(2): Otherwise the names of all the elements in all the multiple sequences are searched. If the names argument is a character vector then name is treated as a regular expression and grep is used for this search, otherwise (i.e. when the names are supplied via a higher level object like GRanges) name must match exactly the name of the sequence. If exactly one sequence is found, then it is used for extraction, otherwise an error is raised.
(2):否则,所有的多个序列中的所有元素的名称搜索。如果names参数是一个字符向量,然后name作为一个正则表达式和grep用于搜索,否则(即治疗时的名称是通过提供更高级别的对象,如格朗)name序列的名称必须完全匹配。如果一个序列被发现,那么它被用于提取,否则将引发错误。
值----------Value----------
A character vector of length L when as.character=TRUE.
特征向量的长度为Las.character=TRUE。
A DNAString or DNAStringSet object when as.character=FALSE (the default). More precisely the returned value is a DNAString object if L = 1 and names is not a GRanges, RangedData, RangesList or Ranges object. Otherwise it's a DNAStringSet object.
一个DNAString或DNAStringSet对象as.character=FALSE(默认)。更精确的返回值是一个DNAString的对象,如果L = 1,names1农庄,RangedData,RangesList或范围对象。否则它是DNAStringSet的对象。
注意----------Note----------
Be aware that using as.character=TRUE can be very inefficient when extracting a "big" amount of DNA sequences (e.g. millions of short sequences or a small number of very long sequences).
请注意,使用as.character=TRUE可以提取时是非常低效的一个“大”的DNA序列的金额(如以百万计的短序列或一个很长的序列的少数)。
Note that the masks in x, if any, are always ignored. In other words, masked regions in the genome are extracted in the same way as unmasked regions (this is achieved by dropping the masks before extraction). See ?`MaskedDNAString-class` for more information about masked DNA sequences.
注意x,如果有的话,口罩总是被忽略。换句话说,在基因组中的蒙面区域提取在东窗事发区域(这是实现下降之前提取的口罩)相同的方式。看到?MaskedDNAString-class有关蒙面的DNA序列的详细信息。
作者(S)----------Author(s)----------
H. Pages; improvements suggested by Matt Settles and others
参见----------See Also----------
getSeq, available.genomes, BSgenome-class, DNAString-class, DNAStringSet-class, MaskedDNAString-class, GRanges-class, RangedData-class, RangesList-class, Ranges-class, grep
getSeq,available.genomes,DNAString级BSgenome级,级DNAStringSet,级MaskedDNAString,农庄类,级RangedData,级RangesList,范围类,grep
举例----------Examples----------
## ---------------------------------------------------------------------[#------------------------------------------------- --------------------]
## A. SIMPLE EXAMPLES[#答:简单的例子]
## ---------------------------------------------------------------------[#------------------------------------------------- --------------------]
## Load the Caenorhabditis elegans genome (UCSC Release ce2):[#加载线虫基因组(UCSC的推出CE2):]
library(BSgenome.Celegans.UCSC.ce2)
## Look at the index of sequences:[#看看序列指数:]
Celegans
## Get chromosome V as a DNAString object:[#获取染色体作为DNAString对象至五:]
getSeq(Celegans, "chrV")
## which is in fact the same as doing:[#这是实际上做相同的:]
Celegans$chrV
## Not run: [#无法运行:]
## Never try this:[#不要尝试这个:]
getSeq(Celegans, "chrV", as.character=TRUE)
## or this (even worse):[#或(更糟):]
getSeq(Celegans, as.character=TRUE)
## End(Not run)[#结束(不运行)]
## Get the first 20 bases of each chromosome:[#获取每个染色体的第20碱基:]
getSeq(Celegans, end=20)
## Get the last 20 bases of each chromosome:[#获取每个染色体的最后20个碱基:]
getSeq(Celegans, start=-20)
## Get the "NM_058280_up_1000" sequence (belongs to the upstream1000[#获取“NM_058280_up_1000”的序列(属于“upstream1000。]
## multiple sequence) as a DNAString object:[#多序列作为DNAString对象):]
s1 <- getSeq(Celegans, "NM_058280_up_1000")
stopifnot(identical(getSeq(Celegans, "NM_058280_up_5000", start=-1000), s1))
## Not run: [#无法运行:]
## Fails because there is more than one sequence across[#失败,因为有多个序列跨越]
## Celegans$upstream1000, Celegans$upstream2000 and Celegans$upstream5000[#Celegans $ upstream1000,Celegans $ upstream2000和Celegans美元upstream5000]
## with "NM_058280" in its name:[在其名称中的“NM_058280”#:]
getSeq(Celegans, "NM_058280")
## Fails because there is no sequence named exactly "NM_058280":[#失败,因为有没有序列命名,正是“NM_058280”:]
getSeq(Celegans, "^NM_058280$")
## End(Not run)[#结束(不运行)]
## ---------------------------------------------------------------------[#------------------------------------------------- --------------------]
## B. EXTRACTING SMALL SEQUENCES FROM DIFFERENT CHROMOSOMES[#乙提取不同染色体的小序列]
## ---------------------------------------------------------------------[#------------------------------------------------- --------------------]
myseqs <- data.frame(
chr=c("chrI", "chrX", "chrM", "chrM", "chrX", "chrI", "chrM", "chrI"),
start=c(NA, -40, 8510, 301, 30001, 9220500, -2804, -30),
end=c(50, NA, 8522, 324, 30011, 9220555, -2801, -11),
strand=c("+", "-", "+", "+", "-", "-", "+", "-")
)
getSeq(Celegans, myseqs$chr,
start=myseqs$start, end=myseqs$end)
getSeq(Celegans, myseqs$chr,
start=myseqs$start, end=myseqs$end, strand=myseqs$strand)
## ---------------------------------------------------------------------[#------------------------------------------------- --------------------]
## C. USING A GRanges OBJECT[#C.使用一个农庄对象]
## ---------------------------------------------------------------------[#------------------------------------------------- --------------------]
gr1 <- GRanges(seqnames=c("chrI", "chrI", "chrM"),
ranges=IRanges(start=101:103, width=9))
gr1 # all strand values are "*"[所有的股值是“*”]
getSeq(Celegans, gr1) # treats strand values as if they were "+"[把股值,如果他们“+”]
strand(gr1)[] <- "-"
getSeq(Celegans, gr1)
strand(gr1)[1] <- "+"
getSeq(Celegans, gr1)
strand(gr1)[2] <- "*"
if (interactive())
getSeq(Celegans, gr1) # Error: cannot mix "*" with other strand values[错误:不能混用与其他股值“*”]
gr2 <- GRanges(seqnames=c("chrM", "NM_058280_up_1000"),
ranges=IRanges(start=103:102, width=9))
gr2
if (interactive()) {
## Because the sequence names are supplied via a GRanges object, they[#由于通过农庄对象提供的序列名称,]
## are not treated as regular expressions:[#不被视为正则表达式:]
getSeq(Celegans, gr2) # Error: sequence NM_058280_up_1000 not found[没有发现错误:序列NM_058280_up_1000]
}
## ---------------------------------------------------------------------[#------------------------------------------------- --------------------]
## D. EXTRACTING A HIGH NUMBER OF RANDOM 40-MERS FROM A GENOME[#D.从基因组提取大量的随机40个碱基]
## ---------------------------------------------------------------------[#------------------------------------------------- --------------------]
extractRandomReads <- function(x, density, readlength)
{
if (!is.integer(readlength))
readlength <- as.integer(readlength)
start <- lapply(seqnames(x),
function(name)
{
seqlength <- seqlengths(x)[name]
sample(seqlength - readlength + 1L,
seqlength * density,
replace=TRUE)
})
names <- rep.int(seqnames(x), elementLengths(start))
ranges <- IRanges(start=unlist(start), width=readlength)
strand <- strand(sample(c("+", "-"), length(names), replace=TRUE))
gr <- GRanges(seqnames=names, ranges=ranges, strand=strand)
getSeq(x, gr)
}
## With a density of 1 read every 100 genome bases, the total number of[随着密度的1#读取每100个基因碱基,总数]
## extracted 40-mers is about 1 million:[#提取的40个碱基是约1万:]
rndreads <- extractRandomReads(Celegans, 0.01, 40)
## Notes:[#注:]
## - The short sequences in 'rndreads' can be seen as the result of a[# - rndreads“的短序列可以看作是一个结果]
## simulated high-throughput sequencing experiment. A non-realistic[#模拟的高通量测序实验。非逼真]
## one though because:[#1虽然这是因为:]
## (a) It assumes that the underlying technology is perfect (the[(一)它假定的底层技术是完美的(]
## generated reads have no technology induced errors).[#生成的内容有没有技术引起的误差)。]
## (b) It assumes that the sequenced genome is exactly the same as[(二)它假定序列的基因是完全一样的]
## the reference genome.[#参考基因组。]
## (c) The simulated reads can contain IUPAC ambiguity letters only[仅#(三)模拟读取可以包含IUPAC模糊信]
## because the reference genome contains them. In a real[#因为参考基因组包含其中。在一个真正的]
## high-throughput sequencing experiment, the sequenced genome[#高通量测序的基因组测序实验,]
## of course doesn't contain those letters, but the sequencer[当然#不包含这些信件,但定序]
## can introduce them in the generated reads to indicate[#可以介绍他们在生成的读取指示]
## ambiguous base-calling.[#暧昧碱基检测。]
## - Those reads are coming from the plus and minus strands of the[ - 这些读来的加号和减号股]
## chromosomes.[#染色体。]
## - With a density of 0.01 and the reads being only 40-base long, the[ - 随着0.01的密度和读取只有40基长,]
## average coverage of the genome is only 0.4 which is low. The total[#基因组的平均覆盖率只有0.4低。总]
## number of reads is about 1 million and it takes less than 10 sec.[#读人数约100万,它只需不到10秒。]
## to generate them.[#生成它们。]
## - A higher coverage can be achieved by using a higher density and/or[ - 较高的覆盖范围可以达到更高的密度和/或使用]
## longer reads. For example, with a density of 0.1 and 100-base reads[#再读取。例如,密度为0.1和100碱基,读取]
## the average coverage is 10. The total number of reads is about 10[#平均覆盖率为10。读取的总人数约10]
## millions and it takes less than 1 minute to generate them.[#百万,它需要不到1分钟生成它们。]
## - Those reads could easily be mapped back to the reference by using[# - 这些读可以很容易地被映射回参考使用]
## an efficient matching tool like matchPDict() for performing exact[#一个高效的匹配工具像matchPDict()执行精确]
## matching (see ?matchPDict for more information). Typically, a[#匹配(见matchPDict更多信息)。通常情况下,]
## small percentage of the reads (4 to 5% in our case) will hit the[#读取小的百分比(4至5%的情况下)将达到]
## reference at multiple locations. This is especially true for such[#参考在多个地点。尤其是这样,这是真正]
## short reads, and, in a lower proportion, is still true for longer[#短的读取,并在一个较低的比例,仍然是正确的,更长的时间]
## reads, even for reads as long as 300 bases.[#读取读,甚至只要300碱基。]
转载请注明:出自 生物统计家园网(http://www.biostatistic.net)。
注:
注1:为了方便大家学习,本文档为生物统计家园网机器人LoveR翻译而成,仅供个人R语言学习参考使用,生物统计家园保留版权。
注2:由于是机器人自动翻译,难免有不准确之处,使用时仔细对照中、英文内容进行反复理解,可以帮助R语言的学习。
注3:如遇到不准确之处,请在本贴的后面进行回帖,我们会逐渐进行修订。
|