detectBreakpoints(R453Plus1Toolbox)
detectBreakpoints()所属R语言包:R453Plus1Toolbox
Clustering and consensus breakpoint detection for chimeric reads
聚类和嵌合共识的断点检测读取
译者:生物统计家园网 机器人LoveR
描述----------Description----------
Given a set of chimeric reads, this methods computes all putative breakpoints. First, chimeric reads are clustered such that all reads spanning the same breakpoint form a cluster. Then, a consensus breakpoint sequence and breakpoint position is computed for each cluster.
鉴于一套嵌合读取,这种方法计算所有假定断点。首先,嵌合读聚集所有读取跨越相同的断点形成聚类。然后,一个共识断点顺序和断点位置计算为每个聚类。
用法----------Usage----------
detectBreakpoints(chimericReads, bpDist=100, minClusterSize=4, removeSoftClips=TRUE, bsGenome)
参数----------Arguments----------
参数:chimericReads
A list storing chimeric reads as returned by filterChimericReads. The list must have the format as defined by the scanBam method.
存储嵌合列表读取返回filterChimericReads。该名单必须有scanBam方法定义的格式。
参数:bpDist
The maximum distance in base pairs between the breakpoints of two chimeric reads at which the reads are merge to a cluster.
两个嵌合体中断点在碱基对之间的最大距离,读取时读取被合并到一个聚类。
参数:minClusterSize
Cluster whose size is below minClusterSize are be excluded from breakpoint detection.
聚类,其规模低于minClusterSize被排除断点检测。
参数:removeSoftClips
If true, soft-clipped bases at the beginning or the end of a sequence are removed (see details below).
如果情况属实,软剪辑的开始或结束一个序列的碱基被删除(详见下文)。
参数:bsGenome
A bsGenome instance providing the reference sequences. If missing, the library BSgenome.Hsapiens.UCSC.hg19 is used by default.
一个bsGenome实例提供参考序列。如果丢失,图书馆BSgenome.Hsapiens.UCSC.hg19默认情况下使用。
Details
详情----------Details----------
This method is usually invoked after calling filterChimericReads and before calling mergeBreakpoints. It first forms clusters of chimeric reads (reads with exactly two local alignments) that span the same breakpoint and than computes a consensus breakpoint sequence for each cluster.
这种方法通常被调用后调用filterChimericReads“之前调用mergeBreakpoints。它首先形成嵌合体读取的簇(两个地方的路线读取)跨越相同的断点比计算为每个聚类的共识断点序列。
To carry out a hierarchical clustering, a measure for the distance between two chimeric reads must be defined. If reads span different chromosomes, their distance is set to infinity. The strand information of the local alignments may also indicate that two chimeric reads do not span the same breakpoint even if they span the same chromosomes. For example, the first reads has two local alignments on the positive strand whereas the second read has one local alignment on the positive strand and the other on the negative strand. In this case, the distance is set to inifinty, too. Finally, the distance measure distinguishes between the two breakpoints (sometimes called the pathogenic and the reciproce breakpoint) that originate from the same structual variant. The distance between a read from the pathogenic and a read from the reciproce breakpoint is infinity so that two different clusters will emerge. These two related breakpoints can be merge later using the mergeBreakpoints method. We observed that the breakpoints of these two cases often differ by a few ten or even a few hundred basepairs.
进行分层聚类,衡量两个嵌合读之间的距离必须定义。如果读取跨度不同的染色体上,它们之间的距离设置为无穷大。链的局部比对的信息也表明,两个嵌合读不跨越相同的断点,即使他们跨越相同的染色体。例如,首先读取对正股有两个地方的路线,而第二次读取的正链和负链上的其他地方对齐。在这种情况下,距离设置到inifinty,太。最后,距离测度区分的两个断点(有时也被称为的致病和reciproce的断点),从相同的结构型变种。从致病读和读从reciproce断点之间的距离为无穷大,因此,会出现两种不同的聚类。这两个相关的断点可合并后使用mergeBreakpoints方法。我们观察到,这两种情况下的断点常常几十年甚至几百个碱基对不同。
If the chromosome and strand information between two reads x and y are coherent, the Euclidian distance is used:
如果染色体和两股信息读取x和y是连贯的,欧氏距离用于:
where bp gives the coordinates of the breakpoint for the given read and chromosome. Hierarchical clustering is applied with complete linkage and the dendrogram is cutted at a height of bpDist to obtain the final clusters. The bpDist argument does usually not influence the result, because we observed that reads spanning the same breakpoint have very little variation (only a few base pairs) in their local alignments due to sequencing errors or due to ambiguity caused by same/similar sequence of both chromosome near the breakpoint.
bp给读取和染色体断点的坐标。层次聚类应用完整的联动和聚类是在一个高度bpDistcutted获得最终的聚类。 bpDist参数,通常不会影响结果,因为我们观察到,读取跨越相同的断点有非常小的变化(只有少数几个碱基对),由于测序错误的地方路线或由于存在歧义,造成相同/类似的两个断点附近的染色体序列。
Although the given set of reads may belong to the same chimeric DNA, their individual breakpoints may differ in a few base pairs. Furthermore, a single read may have more than one possible breakpoint if a (small) part of the read was aligned to both parts. <br> The following step determines a consensus breakpoint for each cluster. It uses the supplied bsGenome to construct a chimeric reference sequence for all possible breakpoints over all reads within each cluster. After the reads were realigned to the chimeric reference sequences, the one that yields the highest alignment score is taken to represent best the chimeric DNA and its breakpoints. <br>
尽管给定的读取可能属于相同的嵌合DNA,他们的个人断点可能在几个碱基对的不同。此外,一个读可能有多个可能的断点,如果读的(小)的一部分,两部分对齐。参考下面的步骤,决定为每个聚类的共识断点。它使用提供的bsGenome建设为的嵌合参考序列对所有读取每个聚类内所有可能的断点。后读取被拉直的的嵌合参考序列,产生的对齐得分最高的是一个代表最好的嵌合DNA和其断点。参考
As a preprocessing step, detectBreakpoints offers to remove soft clips occuring after the alignment: <br> Some reads may contain soft-clipped bases (e.g. linker sequences) at the beginning of the first part of the read or at the end of the second part. By default, detectBreakpoints removes these unaligned subsequences and adjusts the cigar string, the sequence, the sequence width (qwidth) and the local start/end coordinates.
作为预处理步骤,detectBreakpoints提供删除发生后,对齐软剪辑:参考一些读在读第一部分的开头或结尾可能包含软裁剪碱基(如连接器序列)第二部分。默认情况下,detectBreakpoints删除这些未对齐的子序列和调整的CIGAR字符串序列,序列宽度(qwidth)的和当地的开始/结束坐标。
值----------Value----------
detectBreakpoints returns an object of class breakpoints, which is a list of breakpoint clusters, which gives access to all alignments and consensus breakpoints: <br>
detectBreakpoints类breakpoints,这是一个断点聚类,它可以访问所有路线和共识断点列表返回对象:参考
参数:seqs
This IRanges DataFrame is mainly a rearranged version of the alignment input in chimericReads. In addition, it shows the corresponding breakpoints and local alignment coordinates.
这IRangesDataFrame是主要的对齐输入重排版本中chimericReads。此外,它显示了相应的断点和地方对齐坐标。
参数:commonBps
A dataframe listing the breakpoints for both parts of the chimeric reference, the associated chromosome, strand and the reference sequence itself, including positions "localStart"/"localEnd" indicating which part of the reference belongs to which breakpoint.
一个dataframe上市为嵌合参考两部分断点,相关的染色体,钢绞线和参考序列本身,包括仓“localStart”,/“localEnd”,表示引用部分属于该断点。
参数:commonAlign
An object of class PairwiseAlignedFixedSubject of the Biostrings package that contains the alignment to the (best) consensus reference sequence.
一个对象类PairwiseAlignedFixedSubject的Biostrings包,包含对齐(最好)的共识参考序列。
参数:alignedReads
On the basis of commonAlign and commonBps, alignedReads is an instance of class AlignedRead containing all aligned reads including their associated chromosomes, strands, and positions. Since the reference is a chimeric sequence each read has two chromosome and two strand entries.
的基础上commonAlign和commonBps,alignedReads是一个类的实例AlignedRead包含所有对齐,读取包括其相关的染色体,股和立场。由于参考是一种嵌合序列,每个读有两个染色体和两股项。
作者(S)----------Author(s)----------
Hans-Ulrich Klein, Christoph Bartenhagen
参见----------See Also----------
filterChimericReads mergeBreakpoints plotChimericReads
filterChimericReadsmergeBreakpointsplotChimericReads
举例----------Examples----------
# Construct a small example with three chimeric reads[建设一个小例子,三个嵌合读]
# (=6 local alignments) in bam format as given by[(局部比= 6)格式巴姆]
# aligners such as BWA-SW.[对准如的BWA-SW。]
# The first two reads originate from the same case but[前两个读取源于相同的情况下,但]
# from different strands. The third read originate from[从不同的链。第三读源自]
# the reciprocal breakpoint.[互惠断点。]
library("BSgenome.Scerevisiae.UCSC.sacCer2")
bamReads = list()
bamReads[[1]] = list(
qname=c("seq1", "seq1", "seq2", "seq2", "seq3", "seq3"),
flag = as.integer(c(0, 0, 16, 16, 0, 0)),
rname = factor(c("II", "III", "III", "II", "III", "II")),
strand = factor(c("+", "+", "-", "-", "+", "+")),
pos = as.integer(c(99951, 200000, 200000, 99951, 199950, 100001)),
qwidth = as.integer(c(100, 100, 100, 100, 100, 100)),
cigar = c("50M50S","50S50M","50S50M","50M50S","50M50S", "50S50M"),
seq = DNAStringSet(c(
paste(substr(Scerevisiae$chrII, start=99951, stop=100000),
substr(Scerevisiae$chrIII, start=200000, stop=200049),
sep=""),
paste(substr(Scerevisiae$chrII, start=99951, stop=100000),
substr(Scerevisiae$chrIII, start=200000, stop=200049),
sep=""),
paste(substr(Scerevisiae$chrIII, start=200000, stop=200049),
substr(Scerevisiae$chrII, start=99951, stop=100000),
sep=""),
paste(substr(Scerevisiae$chrIII, start=200000, stop=200049),
substr(Scerevisiae$chrII, start=99951, stop=100000),
sep=""),
paste(substr(Scerevisiae$chrIII, start=199950, stop=199999),
substr(Scerevisiae$chrII, start=100001, stop=100050),
sep=""),
paste(substr(Scerevisiae$chrIII, start=199950, stop=199999),
substr(Scerevisiae$chrII, start=100001, stop=100050),
sep="")))
)
bps = detectBreakpoints(bamReads, minClusterSize=1, bsGenome=Scerevisiae)
summary(bps)
table(bps)
mergeBreakpoints(bps)
转载请注明:出自 生物统计家园网(http://www.biostatistic.net)。
注:
注1:为了方便大家学习,本文档为生物统计家园网机器人LoveR翻译而成,仅供个人R语言学习参考使用,生物统计家园保留版权。
注2:由于是机器人自动翻译,难免有不准确之处,使用时仔细对照中、英文内容进行反复理解,可以帮助R语言的学习。
注3:如遇到不准确之处,请在本贴的后面进行回帖,我们会逐渐进行修订。
|