分类: 基因组

「基因组」人类基因组计划的八卦—在PNAS上的口水仗

原文发布日期:20121030

最近在Coursera上追两门课,Introduction to Genetics and Evolution 和 Experimental Genome Science。不知道Coursera的可以去看看 https://www.coursera.org/ ,可以在线听各种名校的课,而且每门课都是专门为Coursera准备的,不是简单的课堂录像。如果按要求完成每周的作业并通过mid-term and final-term exam (It depends different courses),还可以拿到该门课的通过证书。课程是全英文的,有英文字幕可以选择,个人感觉自己专业的课程不用字幕裸听也毫无压力。最好根据自己的精力来选择课程,选多了的结果一定是听不过来,我追两门课就已经觉得有点累了。其中Introduction to Genetics and Evolution的老师非常有意思,这门课的内容也相对简单,相当于本科三年级水平的遗传学,对我来说就是练下听力顺便复习下遗传学,虽说概念都清楚但是好多细节现在挺起来理解的程度完全不同了。另外一门Experimental Genome Science的老师非常不给力,幻灯做的就很枯燥,只是往上堆文字,课听起来很累。

昨天为了完成Experimental Genome Science的作业,这门课的作业全是读文献写Summary ORZ…读了两篇2002年PNAS上发表的关于人类基因组计划的评论文章,虽说以前听说过但这还是第一次读到在专业杂志上互喷的文章。

简单提下人类基因组计划,想必这个词大家都听过,从意义来说堪比登月的项目。。这个项目是在1990年正式启动,由美国牵头,预算30亿刀(人的基因组差不多有30亿个碱基,一个碱基1美刀啊。。。),众多国家参与,中国当时也在杨焕明的主持下完成了1% (3号染色体短臂上的3000万个碱基),于是有了后来的华大基因BGI,当然这是后话。1998年的时候,杀出来个程咬金,Celera Genomic,这是个由当时世界上最大的测序仪生产商PE Biosystems公司新成立的,掏出了300台当时最先进的毛细管自动测序仪(ABI3700)和3亿块钱,宣称3年内举公司300人和“世界第三”的超级计算机之力用全基因组鸟枪法(WGS, whole genome sequencing)完成人类全基因组测序,这摆明了是要砸人类基因组计划(HGP)的场子。

要知道HGP选择的测序策略是HS法,简单来说就是先把基因组打断,然后装到不同的BACs载体里分段测序;而Celera的WGS法是直接把基因组整体打碎,然后直接测,省时省钱,但是拼接工作非常困难。

这场较量的结果是,HGP和Celera在2001年同月的Nature和Science上分别发表文章,宣称完成了人类基因组草图。

背景讲多了,接下来才是正题,2002年HGP的项目的老大之一Watson J首先向Celera开火,文章发表在PNAS上 (Waterston, R. H. (2002). On the sequencing of the human genome. Proceedings of the National Academy of Sciences, 99(6), 3712–3716. doi:10.1073/pnas.042692499. http://www.pnas.org/content/99/6/3712.long),提出Celera用WGS的方法不可能独立拼装出人类基因组,而且那篇在Science的文章中根本没有单独分析自己的数据,而是用一种很奇怪的方式整合了HGP的数据并进行分析,Watson重新分析了Celera的数据,证明了几个结论:1、在Celera拼装的阶段,使用了HGP的2x coverage的BAC克隆,所以WGS的拼接起来的是“伪”数据;2、只用WGS的数据不能完整拼接人类基因组,并且,WGS的序列集加到HGP的序列集上,结果仅有很微小的差异。总的来说,意思就是,你们这是造假!拿了我们的数据才成功拼上了人类基因组。

一个月后,Celera同样在PNAS上发表了回应文章 (Myers, E. W. (2002). On the sequencing and assembly of the human genome. Proceedings of the National Academy of Sciences, 99(7), 4145–4146. doi:10.1073/pnas.092136699. http://www.pnas.org/content/99/7/4145.long),从几个角度回复Watson等人的质疑,他们宣称这些质疑是不正确,有瑕疵的。然后说自己没用HGP拼装好的基因组序列,只是用了两倍的BACs为了弥补少量Gap,并回应Waston分析的22号染色体的例子,说你仅仅拿一个染色体为例说明不了问题,因为整个基因组的数据是更复杂的。最后,他们说WGS的基因组数据不仅和HGP的不一样,还比你们的更好,不信见下图:不仅冗余数据和重复序列少,而且独立的序列比你们多!

争议基本到此,是否有续集其实我暂时也不知道,在我看来Celera确实耍了些小聪明,利用了HGP免费共享到互联网上的数据,当然,不管怎么说,毕竟是在WGS拼接方法在复杂真核生物基因组上的应用做出了贡献。

悲剧的是,在我写完了两篇英文Summary后想要交作业的时候,发现。。已经。。过了。。Deadline了。。因为作业要peers review所以过期就禁止提交了,而第二个作业是一篇关于癌症的Cell上的文章,摘要我读了两遍竟然完全读不懂。

所以,第一次的作业就贴在这里留个纪念好了。。

Summary 1, On the sequencing of the human genome (Waterston et al., 2002)

In 2001, the draft sequencing of human genome was finished and published by two independent group simultaneously—International Human Genome Project (HGP) and Celera Genomics company. These two group choice two different sequencing strategies which were hierarchical shotgun (HS) and whole-genome shotgun (WGS). The authors of this paper suspected the results concluded by Celera company, so they re-analysis Celera’s data and indicated that it is impossible to assemble the full human genome only using their own data.

In the introduction section, the authors reviewed the sequence of human genome by these two group in brief. What’s more, two sequencing strategies were explained in details. To be specific, HS method was laborious and more costly, BACs and mapping markers were indispensable, however, it will be easy in the assemble process using this methods, cause the contigs can be assembled independently. On the other hand, WGS method break the genome directly, sequencing all fragments of genome and assembled them as a whole, although it will be difficult to assemble , it is saving the time and money of make BAC library and find the chromosome landmarks. Considering the factors below, the HGP choice the HS strategy and Celera company using WGS years later.

In the analysis and results section, the authors analysis the WGS data step by step. Firstly,

the total sequence coverage fold of HS and WGS were shown, 7.5-fold and 5.1-fold respectively. Additionally, the authors point out, for full assembling the WGS data, HGP’s perfect 2 X coverage assembled contigs were used for Celera’s sake. So the WGS reads are “Faux”. What’s more, it is impossible to assemble the full genome sequence only use WGS data. In particular, even add the WGS data to the HGP’s data, there is very little different, that means the WGS data produced by Celera is totally dispensable.

In the end, the authors conclude their results and emphasis that their analysis does not imply WGS approach is useless.

For me, this research changed my impress about human genome project by Celera’s group and the weakness of WGS assemble approach. Before that, I still thought Celera’s work was more valuable because their approach was faster, easier and affordable than HGP. Even for nowadays, with the revolutionary developing of next-generation sequence technologies and more powerful compute platform, we still need some laborious BAC library constructed work in order to fill the uncountable gaps in the genome.

Summary 2 On the sequencing and assembly of the human genome (Myers et al., 2002)

After Waterston, Lander and Sulston (WLS) threw the doubt of WGS technology applying in human genome sequencing and suspected Celera Genomics using HGP’s data in an unusual way to assemble their WGS reads. Celera Genomics responded their suspect immediately in the same journal and point that WLS’s assumption was incorrect and flawed.

First of all, Celera denied their full assembly genome produced by the HGP’s data. The 2x coverage BACs which used in their assemble processing were unordered and differentiate from the genome, it is impossible to assemble the whole genome only using these 677,708 bactigs produced by HGP. On the contrary, the whole genome were determined by their own 27 million mate-paired reads. Specifically, Celera argued that the simulated assemble using chromosome 22 alone by WLS, they claimed that chromosome 22 was only constituted 1% of the whole genome, it would totally different compared to the whole genome assembling. And then, Celera repeated WLS’s simulating experiment and found that it is impossible to fully assemble the 2x BACs under 94% identity of Chromosome 22.

Additionally, Celera argued that they have no illusions that only use WGS methods can they coverage the full genome, their strategy was neither CSA nor WGA, instead, a combined strategy. What’s more, although the two published human genome were 2.6Gbp, seems identical, but they were totally different. As we can seen in the Figure 1, comparing to the HGP’s data, Celera’s data have fewer redundant (140Mbp vesus 50Mbp) and repeated sequences and more unique sequences.

In the end, Celera indicated that their work was first using large-scale WGS data, and these technology must be play an important role in the future genome sequencing.

After reading these two papers which fiercely discussed about the WGS technology. In my opinion, Celera’s argument was not strong enough to support their results. The most important question is, without out HGP’s data, they can hardly assemble the human genome using WGS approach independently. However, looking forward, WGS strategy will be a powerful tools to genome sequencing with less time and cost, especially with the rapidly developing of next- even 3rd generation sequencing technologies.