BSgenome-class(BSgenome)
BSgenome-class()所属R语言包:BSgenome
BSgenome objects
BSgenome对象
译者:生物统计家园网 机器人LoveR
描述----------Description----------
The BSgenome class is a container for the complete genome sequence of a given organism.
BSgenome类是某一生物体的完整基因组序列的容器。
存取方法----------Accessor methods----------
In the code snippets below, x is a BSgenome object. Note that, because the BSgenome class contains the GenomeDescription class, then all the accessor methods for GenomeDescription objects can also be used on x.
在下面的代码片段,x是BSgenome的对象。注意,,因为BSgenome类包含GenomeDescription类,然后所有的存取方法GenomeDescription对象还可以在x使用。
Returns the source URL i.e. the permanent URL to the place where the FASTA files used to produce the sequences contained in x can be found (and downloaded).
即永久网址返回的源URL用于生产x载序列的FASTA格式的文件可以发现的地方(下载)。
Gets the names, lengths, and circularity flags of the single sequences contained in x. All this information is returned in a Seqinfo object. Each part of this information can be retrieved separately with seqnames(x), seqlengths(x), and isCircular(x), respectively, as described below.
获取的名称,长度,并包含在x单序列的圆形标志。所有这些信息返回在Seqinfo对象。此信息的每个部分可以单独检索seqnames(x),seqlengths(x),isCircular(x),分别如下所述。
Returns the names of the single sequences contained in x. Each single sequence is stored in a DNAString or MaskedDNAString object and typically comes from a source file (FASTA) with a single record. The names returned by seqnames(x) usually reflect the names of those source files but a common prefix or suffix was eventually removed in order to keep them as short as possible.
返回包含在x单序列的名称。每个单一的顺序存储在DNAString或MaskedDNAString对象,通常一条记录从一个源文件(FASTA格式)。 seqnames(x)返回的名称通常反映这些源文件的名字,但最终被删除,以保持他们尽可能短的一个共同的前缀或后缀。
Returns the lengths of the single sequences contained in x.
返回包含在x单序列的长度。
See ?`length,XVector-method` and ?`length,MaskedXString-method` for the definition of the length of a DNAString or MaskedDNAString object. Note that the length of a masked sequence (MaskedXString object) is not affected by the current set of active masks but the nchar method for MaskedXString objects is.
看到?length,XVector-method和?length,MaskedXString-method一个DNAString或MaskedDNAString对象的长度定义。请注意,一个蒙面序列的长度(MaskedXString对象)不积极口罩的当前设置,但nchar的MaskedXString对象方法的影响。
names(seqlengths(x)) is guaranteed to be identical to seqnames(x).
names(seqlengths(x))保证是相同的seqnames(x)。
Returns the circularity flags of the single sequences contained in x.
返回包含在x单序列的圆形标志。
names(isCircular(x)) is guaranteed to be identical to seqnames(x).
names(isCircular(x))保证是相同的seqnames(x)。
Returns the index of the multiple sequences contained in x. Each multiple sequence is stored in a DNAStringSet object and typically comes from a source file (FASTA) with multiple records. The names returned by mseqnames(x) usually reflect the names of those source files but a common prefix or suffix was eventually removed in order to keep them as short as possible.
返回指数在x所载的多个序列。每个多序列存储在DNAStringSet对象,通常是由多个记录源文件(FASTA格式)。 mseqnames(x)返回的名称通常反映这些源文件的名字,但最终被删除,以保持他们尽可能短的一个共同的前缀或后缀。
Returns the index of all sequences contained in x. This is the same as c(seqnames(x), mseqnames(x)).
返回指数在x中包含的所有序列。这是相同的c(seqnames(x), mseqnames(x))。
Returns the length of x, i.e., the number of all sequences that it contains. This is the same as length(names(x)).
返回x,也就是说,它包含的所有序列的长度。这是相同的length(names(x))。
Returns the sequence (single or multiple) in x named name (name must be a single string). No sequence is actually loaded into memory until this is explicitely requested with a call to x[[name]] or x$name. When loaded, a sequence is kept in a cache. It will be automatically removed from the cache at garbage collection if it's not in use anymore i.e. if there are no reference to it (other than the reference stored in the cache). With options(verbose=TRUE), a message is printed each time a sequence is removed from the cache.
返回序列(单个或多个)x名为name(name必须是一个字符串)。实际上,没有序列加载到内存中,直到这明确地调用x[[name]]或x$name要求。加载时,被保存在一个缓存序列。它会自动从缓存垃圾收集,如果不是在使用了,也就是说,如果有没有对它的引用(在缓存中存储的参考以外)。用options(verbose=TRUE),打印一条消息每个时间序列是从缓存中删除。
Same as x[[name]] but name is not evaluated and therefore must be a literal character string or a name (possibly backtick quoted).
一样x[[name]]但是name不评估,因此必须是文字字符串或名称(可能是反引号引述)。
The names of the built-in masks that are defined for all the single sequences. There can be up to 4 built-in masks per sequence. These will always be (in this order): (1) the mask of assembly gaps, aka "the AGAPS mask";
内置的口罩,定义为单一序列的名称。最多可以有4内置在每个序列的面具。这些将永远是(按这个顺序):(1)装配间隙的面具,又名“AGAPS的面具”;
(2) the mask of intra-contig ambiguities, aka "the AMB mask";
(2)面膜内contig的含糊之处,又名“安博面具”;
(3) the mask of repeat regions that were determined by the RepeatMasker software, aka "the RM mask";
(3)RepeatMasker软件确定的重复区域的面具,又名“马币面具”;
(4) the mask of repeat regions that were determined by the Tandem Repeats Finder software (where only repeats with period less than or equal to 12 were kept), aka "the TRF mask".
(4)重复区域的面具,确定由串联重复搜索软件(保持只重复周期小于或等于12),又名“成绩单面具”。
All the single sequences in a given package are guaranteed to have the same collection of built-in masks (same number of masks and in the same order).
所有在给定包的单一序列,保证内置口罩(口罩,并以相同的顺序相同数量)有相同的集合。
masknames(x) gives the names of the masks in this collection. Therefore the value returned by masknames(x) is a character vector made of the first N elements of c("AGAPS", "AMB", "RM", "TRF"), where N depends only on the BSgenome data package being looked at (0 <= N <= 4). The man page for most BSgenome data packages should provide the exact list and permanent URLs of the source data files that were used to extract the built-in masks. For example, if you've installed the BSgenome.Hsapiens.UCSC.hg19 package, load it and see the Note section in ?`BSgenome.Hsapiens.UCSC.hg19`.
masknames(x)给在此收集的口罩的名称。因此,返回值masknames(x)是特征向量的第Nc("AGAPS", "AMB", "RM", "TRF"),其中N只取决于被在看着BSgenome数据包(0 <= N <= 4)元素组成。最BSgenome数据包的手册页应该提供确切名单和永久的URL的源数据文件,用于提取内置的口罩。例如,如果你已经安装了BSgenome.Hsapiens.UCSC.hg19包,加载它,看注释部分在?BSgenome.Hsapiens.UCSC.hg19。
作者(S)----------Author(s)----------
H. Pages
参见----------See Also----------
available.genomes, GenomeDescription-class, BSgenome-utils, Seqinfo-class, DNAString-class, DNAStringSet-class, MaskedDNAString-class, getSeq,BSgenome-method, injectSNPs, subseq,XVector-method, rm, gc
available.genomes级GenomeDescription,BSgenome-utils的,Seqinfo类,级DNAString,级DNAStringSet,MaskedDNAString级getSeq,BSgenome-method,injectSNPs,subseq,XVector方法,rm,gc
举例----------Examples----------
## Loading a BSgenome data package doesn't load its sequences[#载入BSgenome数据包不加载它的序列]
## into memory:[#到内存:]
library(BSgenome.Celegans.UCSC.ce2)
## Number of sequences in this genome:[#在这个基因组序列号:]
length(Celegans)
## Display a summary of the sequences:[#显示序列的总结:]
Celegans
## Index of single sequences:[#指数单序列:]
seqnames(Celegans)
## Lengths (i.e. number of nucleotides) of the sequences:[#长度序列(即核苷酸数):]
seqlengths(Celegans)
## Load chromosome I from disk to memory (hence takes some time)[#加载染色体我从磁盘到内存(因此需要一些时间)]
## and keep a reference to it:[#并保持它的参考:]
chrI <- Celegans[["chrI"]] # equivalent to Celegans$chrI[相当于Celegans $ chrI的]
chrI
class(chrI) # a DNAString instance[1 DNAString实例]
length(chrI) # with 15080483 nucleotides[15080483核苷酸]
## Multiple sequences:[#多个序列:]
mseqnames(Celegans)
upstream1000 <- Celegans$upstream1000
upstream1000
class(upstream1000) # a DNAStringSet instance[1 DNAStringSet实例]
## Character vector containing the description lines of the first[#字符向量的第一行说明]
## 4 sequences in the original FASTA file:[#4序列在原来的FASTA格式文件:]
names(upstream1000)[1:4]
## ---------------------------------------------------------------------[#------------------------------------------------- --------------------]
## PASS-BY-ADDRESS SEMANTIC, CACHING AND MEMORY USAGE[#由地址传递语义,高速缓存和内存使用情况]
## ---------------------------------------------------------------------[#------------------------------------------------- --------------------]
## We want a message to be printed each time a sequence is removed[#我们希望要打印一条消息,每一个序列被删除]
## from the cache:[#缓存:]
options(verbose=TRUE)
gc() # nothing seems to be removed from the cache[似乎没有任何可以从缓存中删除]
rm(chrI, upstream1000)
gc() # chrI and upstream1000 are removed from the cache (they are[chrI和upstream1000从缓存中删除(它们是]
# not in use anymore)[不使用了)]
options(verbose=FALSE)
## Get the current amount of data in memory (in Mb):[#获取当前内存(MB)的数据量:]
mem0 <- gc()["Vcells", "(Mb)"]
system.time(chrV <- Celegans[["chrV"]]) # read from disk[从磁盘读取]
gc()["Vcells", "(Mb)"] - mem0 # chrV occupies 20Mb in memory[chrV占用内存20MB]
system.time(tmp <- Celegans[["chrV"]]) # much faster! (sequence[更快! (序列]
# is in the cache)[在缓存中)]
gc()["Vcells", "(Mb)"] - mem0 # we're still using 20Mb (sequences[我们还在使用20MB(序列]
# have a pass-by-address semantic[有通地址的语义]
# i.e. the sequence data are not[即序列数据不]
# duplicated)[重复)]
## subseq() doesn't copy the sequence data either, hence it is very[#subseq()不复制序列数据,因此它是非常]
## fast and memory efficient (but the returned object will hold a[#快速和高效的内存(但返回的对象将举行]
## reference to chrV):[#参考chrV):]
y <- subseq(chrV, 10, 8000000)
gc()["Vcells", "(Mb)"] - mem0
## We must remove all references to chrV before it can be removed from[#我们必须消除所有引用chrV之前,它可以从]
## the cache (so the 20Mb of memory used by this sequence are freed).[#缓存(所以被释放了这个序列所使用的内存20MB)。]
options(verbose=TRUE)
rm(chrV, tmp)
gc()
## Remember that 'y' holds a reference to chrV too:[#记住,Y持有的参考chrV太:]
rm(y)
gc()
options(verbose=FALSE)
gc()["Vcells", "(Mb)"] - mem0
转载请注明:出自 生物统计家园网(http://www.biostatistic.net)。
注:
注1:为了方便大家学习,本文档为生物统计家园网机器人LoveR翻译而成,仅供个人R语言学习参考使用,生物统计家园保留版权。
注2:由于是机器人自动翻译,难免有不准确之处,使用时仔细对照中、英文内容进行反复理解,可以帮助R语言的学习。
注3:如遇到不准确之处,请在本贴的后面进行回帖,我们会逐渐进行修订。
|