1000genomes 千人基因组概况

tomorrow · 发表于 2011-3-17 17:29:59

生物统计家园原创贴，转载请注明出处，出自生物统计家园（http://www.biostatistic.net）

近年来随着测序技术的发展，尤其是第二代测序技术，使得测序的费用大大的降低，1000genome计划是第一个通过大规模的测序来理解人类基因组的遗传变异的资源。

千人基因组计划的数据可以免费的为科学家服务。

千人基因组计划的目标是尽可能的发现最小频率高于1%的变异。实现的手段是测序，具体办法是将一个人的基因组进行随机打断，打断成多个几十bp的短序列，然后测出短序列，再和人类基因组计划中测出的参考序列比较，进行合并拼接。但是由于序列较短，并且由很多的重复出现，因此一次打断然后对照参考序列拼接的时候会出现一些问题，1、有些地方，会出现很多相似的序列聚集，2、也就导致，有些地方没有拼上。最终导致一次打断-测序-拼接覆盖基因组的程度比较低。若要高覆盖度，则需要进行多次 “打断-测序-拼接”。那测多少次合适呢？有研究表明，要想能够完全的覆盖人类基因组，需要重复测序28次（称为28X），也就是测序的平均深度为28，测的越多越精确，有利于发现结构变异，和排除拼接上的错误，但是花费也就越高。

尽管测序的费用有所下降，但是测得几千个人，每个人测几十次仍然很贵。因此千人基因组采取了一些特殊的策略：由于任何特定区域的基因组中的单体型数目是有限的（由于连锁不平衡，临近的等位是关联的），因此1000genome计划对每个样本测4次重复测序（28次太贵了），然后由于测了2500个人，即使每个人的覆盖度不高，也可以由已经测出的序列变异，推得没有测到的序列的变异（这叫impute）。这样的话，能够发现相当多的变异，按官方的说法是可以发现大多数的频率高于1%的变异。

以下是千人基因组计划的主页中about的部分内容
Project OverviewRecent improvements in sequencing technology ("next-gen" sequencing platforms) have sharply reduced the cost of sequencing. The 1000 Genomes Project is the first project to sequence the genomes of a large number of people, to provide a comprehensive resource on human genetic variation.
As with other major human genome reference projects, data from the 1000 Genomes Project will be made available quickly to the worldwide scientific community through freely accessible public databases. (See Data use statement.)
The goal of the 1000 Genomes Project is to find most genetic variants that have frequencies of at least 1% in the populations studied. This goal can be attained by sequencing many individuals lightly. To sequence a person's genome, many copies of the DNA are broken into short pieces and each piece is sequenced. The many copies of DNA mean that the DNA pieces are more-or-less randomly distributed across the genome. The pieces are then aligned to the reference sequence and joined together. To find the complete genomic sequence of one person with current sequencing platforms requires sequencing that person's DNA the equivalent of about 28 times (called 28X). If the amount of sequence done is only an average of once across the genome (1X), then much of the sequence will be missed, because some genomic locations will be covered by several pieces while others will have none. The deeper the sequencing coverage, the more of the genome will be covered at least once. Also, people are diploid; the deeper the sequencing coverage, the more likely that both chromosomes at a location will be included. In addition, deeper coverage is particularly useful for detecting structural variants, and allows sequencing errors to be corrected.
Sequencing is still too expensive to deeply sequence the many samples being studied for this project. However, any particular region of the genome generally contains a limited number of haplotypes. Data can be combined across many samples to allow efficient detection of most of the variants in a region. The Project currently plans to sequence each sample to about 4X coverage; at this depth sequencing cannot provide the complete genotype of each sample, but should allow the detection of most variants with frequencies as low as 1%. Combining the data from 2500 samples should allow highly accurate estimation (imputation) of the variants and genotypes for each sample that were not seen directly by the light sequencing.

账号		自动登录	找回密码
密码			注册

1000genomes 千人基因组概况

浏览过的版块