R语言 GenomicRanges包 cigar-utils()函数中文帮助文档(中英文对照)

loveR · 发表于 2012-2-25 19:28:21

cigar-utils(GenomicRanges)
cigar-utils()所属R语言包：GenomicRanges

                                       CIGAR utility functions
                                       CIGAR实用功能

                                       译者：生物统计家园网机器人LoveR

描述----------Description----------

Utility functions for low-level CIGAR manipulation.
低级CIGAR操纵的实用功能。

用法----------Usage----------

cigarOpTable(cigar)

cigarToQWidth(cigar, before.hard.clipping=FALSE)
cigarToWidth(cigar)

cigarQNarrow(cigar, start=NA, end=NA, width=NA)
cigarNarrow(cigar, start=NA, end=NA, width=NA)

cigarToIRanges(cigar, drop.D.ranges=FALSE, merge.ranges=TRUE)
cigarToIRangesListByAlignment(cigar, pos, flag=NULL, drop.D.ranges=FALSE)
cigarToIRangesListByRName(cigar, rname, pos, flag=NULL, drop.D.ranges=FALSE,
                        merge.ranges=TRUE)

queryLoc2refLoc(qloc, cigar, pos=1)
queryLocs2refLocs(qlocs, cigar, pos, flag=NULL)

splitCigar(cigar)
cigarToRleList(cigar)
cigarToCigarTable(cigar)
summarizeCigarTable(x)

参数----------Arguments----------

参数：cigar
A character vector/factor containing the extended CIGAR string for each read. For cigarToIRanges and queryLoc2refLoc, this must be a single string (i.e. a character vector/factor of length 1).
矢量/因子包含扩展CIGAR字符串每次读一个字符。 cigarToIRanges和queryLoc2refLoc，这必须是一个字符串（即特征向量的长度为1 /因素）。

参数：before.hard.clipping
Should the returned widths be the lengths of the reads before or after "hard clipping"? Hard clipping of a read is encoded with an H in the CIGAR. If NO (before.hard.clipping=FALSE, the default), then the returned widths are the lengths of the query sequences stored in the SAM/BAM file. If YES (before.hard.clipping=TRUE), then the returned widths are the lengths of the original reads.
返回的宽度应该是“硬剪裁”之前或之后的读取的长度？一个只读的硬剪裁CIGARĤ编码。如果NO（before.hard.clipping=FALSE，默认），然后返回宽度的SAM / BAM的文件中查询序列的长度。如果是（before.hard.clipping=TRUE），然后返回的宽度是原来读的长度。

参数：start,end,width
Vectors of integers. NAs and negative values are accepted and "solved" according to the rules of the SEW (Start/End/Width) interface (see ?solveUserSEW for the details).
向量整数。 NAS和负值被接受，并按照规则的SEW（开始/结束/宽）接口（见?solveUserSEW细节）来“解决”。

参数：drop.D.ranges
Should the ranges corresponding to a deletion from the reference (encoded with a D in the CIGAR) be dropped? By default we keep them to be consistent with the pileup tool from SAMtools. Note that, when drop.D.ranges is TRUE, then Ds and Ns in the CIGAR are equivalent.
应该被丢弃从参考（CIGARð编码）删除相应的范围？默认情况下，我们让他们与从SAMtools堆积工具一致。请注意，当drop.D.rangesTRUE，然后在CIGAR的DS和NS是相等的。

参数：merge.ranges
Should adjacent ranges coming from the same cigar be merged or not? Using TRUE (the default) can significantly reduce the size of the returned object.
应来自相同的CIGAR相邻的范围，合并或不呢？使用TRUE（默认）可以显着减少返回的对象的大小。

参数：pos
An integer vector containing the 1-based leftmost position/coordinate for each (eventually clipped) read sequence.
一个整数向量，包含1的最左边的位置/坐标（最终剪辑）读序列。

参数：flag
NULL or an integer vector containing the SAM flag for each read. According to the SAM specs, flag bits 0x004 and 0x400 have the following meaning: when bit 0x004 is ON then "the query sequence itself is unmapped" and when bit 0x400 is ON then "the read is either a PCR duplicate or an optical duplicate". When flag is provided, cigarToIRangesListByAlignment and cigarToIRangesListByRName ignore these reads.
NULL或包含SAM标志，每次读取一个整数向量。据SAM的规格，标志位量0x004和0x400的具有以下含义：当位量0x004然后在“查询序列本身的映射，”时位0x400的是“读的PCR复制或光学复制“。当flag是cigarToIRangesListByAlignment和cigarToIRangesListByRName忽略这些内容。

参数：rname
A character vector/factor containing the name of the reference sequence associated with each read (i.e. the name of the sequence the read has been aligned to).
一个特征向量/因子包含了每个读相关的参考序列（即读已对齐的序列的名称）的名称。

参数：qloc
An integer vector containing "query-based locations" i.e. 1-based locations relative to the query sequence stored in the SAM/BAM file.
整数向量，包含“基于查询的位置”，即1基于相对位置的查询序列的SAM / BAM的文件存储在。

参数：qlocs
A list of the same length as cigar where each element is an integer vector containing "query-based locations" i.e. 1-based locations relative to the corresponding query sequence stored in the SAM/BAM file.
长度相同的名单cigar，其中每个元素是一个整数的向量“基于查询的位置”，即1基于相对位置的SAM / BAM的文件存储到相应的查询序列。

参数：x
A DataFrame produced by cigarToCigarTable.
cigarToCigarTable产生了DataFrame。

值----------Value----------

For cigarOpTable: An integer matrix with number of rows equal to the length of cigar and seven columns, one for each extended CIGAR operation.
对于cigarOpTable：cigar“七列，每个扩展CIGAR操作之一的长度等于行数的整数矩阵。

For cigarToQWidth: An integer vector of the same length as cigar where each element is the width of the query (i.e. the length of the query sequence) as inferred from the corresponding element in cigar (NAs in cigar will produce NAs in the returned vector).
对于cigarToQWidth：相同长度的整数向量cigar，其中每个元素是从相应的元素推断cigar查询（即查询序列的长度）的宽度（cigar会产生返回的向量NAS NAS）。

For cigarQNarrow and cigarNarrow: A character vector of the same length as cigar containing the narrowed cigars. In addition the vector has an "rshift" attribute which is an integer vector of the same length as cigar. It contains the values that would need to be added to the POS field of a SAM/BAM file as a consequence of this cigar narrowing.
cigarQNarrow和cigarNarrow：cigar包含收窄CIGAR的长度相同的一个特征向量。此外向量“RSHIFT”的属性，这是一个整数向量相同长度为cigar。它包含的值，将需要增加一条，作为这个CIGAR缩小的后果POS机领域的SAM / BAM的文件。

For cigarToWidth: An integer vector of the same length as cigar where each element is the width of the alignment (i.e. its total length on the reference, gaps included) as inferred from the corresponding element in cigar (NAs in cigar will produce NAs in the returned vector).
cigarToWidth：相同长度的整数向量cigar，其中每个元素是对齐（即其总长度的参考，差距包括）<从相应的元素推断的宽度X>（cigar会产生返回的向量NAS NAS）。

For cigarToIRanges: An IRanges object describing where the bases in the read align with respect to an imaginary reference sequence assuming that the leftmost aligned base is at position 1 in the reference (i.e. at the first position).
cigarToIRanges：IRanges对象，描述在读的碱基与一个假想的参考序列假设的最左边对齐碱基位置“1”是在参考（即在第一个位置）对齐。

For cigarToIRangesListByAlignment: A CompressedNormalIRangesList object of the same length as cigar.
对于cigarToIRangesListByAlignment：一个相同长度的cigarCompressedNormalIRangesList对象。

For cigarToIRangesListByRName: A named IRangesList object with one element (IRanges) per unique reference sequence.
对于cigarToIRangesListByRName：同一个元素每独特的参考序列（IRanges）的命名IRangesList对象。

For queryLoc2refLoc: An integer vector of the same length as qloc containing the "reference-based locations" (i.e. the 1-based locations relative to the reference sequence) corresponding to the "query-based locations" passed in qloc.
对于queryLoc2refLoc：相同长度的整数向量qloc包含“参考位置”（即1基于相对位置的参考序列）对应的“查询为主位置：“通过在qloc。

For queryLocs2refLocs: A list of the same length as qlocs where each element is an integer vector containing the "reference-based locations" corresponding to the "query-based locations" passed in the corresponding element in qlocs.
为queryLocs2refLocs：长度相同的名单qlocs，其中每个元素是一个整数向量“参考位置”对应的“基于查询的位置”，在相应的元素通过qlocs。

For splitCigar: A list of the same length as cigar where each element is itself a list with 2 elements of the same lengths, the 1st one being a raw vector containing the CIGAR operations and the 2nd one being an integer vector containing the lengths of the CIGAR operations.
splitCigar：长度相同的名单cigar，其中每个元素本身是一个有2个元素的长度相同，第一的CIGAR业务，第二是原始向量名单包含一个整数向量的长度的CIGAR业务。

For cigarToRleList: A CompressedRleList object.
对于cigarToRleList：CompressedRleList的对象。

For cigarToCigarTable: A frequency table of the CIGARs in the form of a DataFrame with two columns: cigar (a CompressedRleList) and count (an integer).
cigarToCigarTable：频率表的两列在形式一个DataFrame的CIGAR：cigar（CompressedRleList）和count（一个整数）。

For summarizeCigarTable: A list with two elements: AlignedCharacters (integer) and Indels (matrix)
summarizeCigarTable：两个元素的列表：AlignedCharacters（整数）和Indels（矩阵）

作者（S）----------Author(s)----------

H. Pages and P. Aboyoun

参考文献----------References----------

参见----------See Also----------

IRanges-class, IRangesList-class, coverage, RleList-class
IRanges级，IRangesList级，coverage，RleList级

举例----------Examples----------

## ---------------------------------------------------------------------[＃------------------------------------------------- --------------------]
## A. SIMPLE EXAMPLES[＃答：简单的例子]
## ---------------------------------------------------------------------[＃------------------------------------------------- --------------------]

## With a cigar vector of length 1:[＃CIGAR向量长度为1：]
cigar1 <- "3H15M55N4M2I6M2D5M6S"

## cigarToQWidth()/cigarToWidth():[＃cigarToQWidth（）/ cigarToWidth（）：]
cigarToQWidth(cigar1)
cigarToQWidth(cigar1, before.hard.clipping=TRUE)
cigarToWidth(cigar1)

## cigarQNarrow():[，＃cigarQNarrow（）：]
cigarQNarrow(cigar1, start=4, end=-3)
cigarQNarrow(cigar1, start=10)
cigarQNarrow(cigar1, start=19)
cigarQNarrow(cigar1, start=24)

## cigarNarrow():[，＃cigarNarrow（）：]
cigarNarrow(cigar1)  # only drops the soft/hard clipping[唯一下降的软/硬剪裁]
cigarNarrow(cigar1, start=10)
cigarNarrow(cigar1, start=15)
cigarNarrow(cigar1, start=15, width=57)
cigarNarrow(cigar1, start=16)
#cigarNarrow(cigar1, start=16, width=55)  # ERROR! (empty cigar)[cigarNarrow（cigar1，开始= 16，宽度= 55）＃错误！（空CIGAR）]
cigarNarrow(cigar1, start=71)
cigarNarrow(cigar1, start=72)
cigarNarrow(cigar1, start=75)

## cigarToIRanges():[，＃cigarToIRanges（）：]
cigarToIRanges(cigar1)
cigarToIRanges(cigar1, merge.ranges=FALSE)
cigarToIRanges(cigar1, drop.D.ranges=TRUE)

## With a cigar vector of length 4:[＃CIGAR向量的长度为4：]
cigar2 <- c("40M", cigar1, "2S10M2000N15M", "3H25M5H")
pos <- c(1, 1001, 1,  351)
cigarToIRangesListByAlignment(cigar2, pos)
rname <- c("chr6", "chr6", "chr2", "chr6")
cigarToIRangesListByRName(cigar2, rname, pos)

cigarOpTable(cigar2)

splitCigar(cigar2)
cigarToRleList(cigar2)

cigarToCigarTable(cigar2)
cigarToCigarTable(cigar2)[,"cigar"]
cigarToCigarTable(cigar2)[,"count"]

summarizeCigarTable(cigarToCigarTable(cigar2))

## ---------------------------------------------------------------------[＃------------------------------------------------- --------------------]
## B. PERFORMANCE[＃B.性能]
## ---------------------------------------------------------------------[＃------------------------------------------------- --------------------]

if (interactive()) {
  ## We simulate 20 millions aligned reads, all 40-mers. 95% of them[＃我们模拟20百万对齐读取，所有40个碱基。其中95％]
  ## align with no indels. 5% align with a big deletion in the[＃没有INDELS配合。 5％，配合在大缺失]
  ## reference. In the context of an RNAseq experiment, those 5% would[＃参考。在一个RNAseq实验中，那些5％]
  ## be suspected to be "junction reads".[＃被怀疑是“交界读取”。]
  set.seed(123)
  nreads <- 20000000L
  njunctionreads <- nreads * 5L / 100L
  cigar3 <- character(nreads)
  cigar3[] <- "40M"
  junctioncigars <- paste(
   paste(10:30, "M", sep=""),
   paste(sample(80:8000, njunctionreads, replace=TRUE), "N", sep=""),
   paste(30:10, "M", sep=""), sep="")
  cigar3[sample(nreads, njunctionreads)] <- junctioncigars
  some_fake_rnames <- paste("chr", c(1:6, "X"), sep="")
  rname <- sample(some_fake_rnames, nreads, replace=TRUE)
  pos <- sample(80000000L, nreads, replace=TRUE)

  ## The following takes < 5 sec. to complete:[＃下面以<5秒。完成：]
  system.time(rglist <- cigarToIRangesListByAlignment(cigar3, pos))

  ## The following takes < 10 sec. to complete:[＃下面以<10秒。完成：]
  system.time(irl <- cigarToIRangesListByRName(cigar3, rname, pos))

  ## Internally, cigarToIRangesListByRName() turns 'rname' into a factor[＃内部，cigarToIRangesListByRName的（）变成一个因素“RNAME”]
  ## before starting the calculation. Hence it will run sligthly[＃前开始计算。因此，它会运行sligthly]
  ## faster if 'rname' is already a factor.[＃更快，如果RNAME“已经是一个因素。]
  rname2 <- as.factor(rname)
  system.time(irl2 <- cigarToIRangesListByRName(cigar3, rname2, pos))

  ## The sizes of the resulting objects are about 240M and 160M,[＃生成的对象的大小约240M和160M，]
  ## respectively:[＃分别为：]
  object.size(rglist)
  object.size(irl)
}

## ---------------------------------------------------------------------[＃------------------------------------------------- --------------------]
## C. COMPUTE THE COVERAGE OF THE READS STORED IN A BAM FILE[＃C.计算范围的读取存储在一个BAM文件]
## ---------------------------------------------------------------------[＃------------------------------------------------- --------------------]
## The information stored in a BAM file can be used to compute the[＃的信息存储在一个BAM文件可以用来计算]
## "coverage" of the mapped reads i.e. the number of reads that hit any[＃“覆盖”的映射，即读取击中任何读取]
## given position in the reference genome.[＃给定的参考基因组中的位置。]
## The following function takes the path to a BAM file and returns an[＃下面的函数路径的BAM文件，并返回一个]
## object representing the coverage of the mapped reads that are stored[＃对象代表的覆盖映射读取存储]
## in the file. The returned object is an RleList object named with the[＃在文件中。返回的对象是一个对象的命名与RleList]
## names of the reference sequences that actually receive some coverage.[＃名称的参考序列，实际上收到的一些报道。]

extractCoverageFromBAM <- function(file)
{
  ## This ScanBamParam object allows us to load only the necessary[＃这ScanBamParam对象允许我们只加载必要的]
  ## information from the file.[＃从文件的信息。]
  param <- ScanBamParam(flag=scanBamFlag(isUnmappedQuery=FALSE,
                                       isDuplicate=FALSE),
                     what=c("rname", "pos", "cigar"))
  bam <- scanBam(file, param=param)[[1]]
  ## Note that unmapped reads and reads that are PCR/optical duplicates[＃注意未映射的读取和读取的聚合酶链反应/光学重复]
  ## have already been filtered out by using the ScanBamParam object above.[已经使用ScanBamParam对象被过滤掉了。]
  irl <- cigarToIRangesListByRName(bam$cigar, bam$rname, bam$pos)
  irl <- irl[elementLengths(irl) != 0]  # drop empty elements[删除空元素]
  coverage(irl)
}

library(Rsamtools)
f1 <- system.file("extdata", "ex1.bam", package="Rsamtools")
extractCoverageFromBAM(f1)

转载请注明:出自生物统计家园网(http://www.biostatistic.net)。

注：
注1：为了方便大家学习，本文档为生物统计家园网机器人LoveR翻译而成，仅供个人R语言学习参考使用，生物统计家园保留版权。
注2：由于是机器人自动翻译，难免有不准确之处，使用时仔细对照中、英文内容进行反复理解，可以帮助R语言的学习。
注3：如遇到不准确之处，请在本贴的后面进行回帖，我们会逐渐进行修订。

账号		自动登录	找回密码
密码			注册