May 2 2013

杂合基因组的组装（De novo assembly of highly heterozygous genomes )

杂合基因组的组装（De novo assembly of highly heterozygous genomes ）

最近看到一篇文献
Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads 。
感觉高杂合的大型基因组的组装有了一些可喜的进展，日本人的工作还是很扎实的，有一些值得借鉴和参考的地方。

Abstract

随着测序平台的改进和发展，测序通量已经不是问题，测序价格越来越便宜，对于一些非模式生物或野生物种来说，测定它们的基因组序列对于科学研究意义越来越明显。但是，大多数情况来看，它们通常又具有较高的杂合或者多倍性这些问题，这对于以短reads做组装为主流的de novo项目非常棘手，目前尚缺乏较完善兼具可实践的方案(经费多的实验室除外)。

一般来说，杂合基因组的解决有两种可行的方案，但都公认为比较费时费力费钱费脑.

Fosmid-based (or Bac-based) hierarchical sequencing;

Inbred lines ( doubled-monoploid clone).

Fosmid或者bac为基础的分层组装需要构建大量的长片段文库，实验工作以及组装拼接都是精细活，像牡蛎oyster (Zhang et al. 2012), 小菜饿diamondback moth (You et al. 2013), and 挪威云杉Norway spruce (Nysted et al. 2013)这些经典案例都值得看一看、读一读。

再就是双单倍体的做法，这也是可遇不可求的，像土豆、甜橙等高杂合的植物都是花药离体培养出双单倍体之后测序组装的。

Back to genome assembly

什么是组装，说白了就是基因组序列很长，而测序已知的reads很短，所以就有了所谓组装拼接工作。比如人的基因组大小3G，23条染色单体平均长度130,434,783bp，而即使bac测序也只能知道大概100kb左右片段的两个末端序列长度，一代sanger测序约1000bp读长，而现在100bp左右的的短reads。因而，reads太短，又要利用这些信息去拼接更长的序列，就是砖头与房子的关系。也许，期待未来的某一天，测序能够直接读完一整条染色体，更长的序列，对于组装的改善是必然的。

言归正传，基因组的组装目前也是两种主流的做法：

De Bruijn Graph based algorithms (Gnerre et al. 2011; Pevzner et al. 2001; Vinson et al. 2005; Zerbino and Birney 2008);

Overlap Layout Consensus algorithms (Kurtz et al. 2004).

前面挖了两个坑，找机会把它们补上。

Assemble a genome
Genome assembly

简短唠叨一下，DBG算法对于处理短reads非常有效，它的思想是将reads打断成更短的K-mer(一定长度，如K = 17的片段)，那这些K-mer之间share的信息就是overlap信息了，于是用K-mer做顶点，有overlap的定点连接起来，边就是overlap信息；或者相反K-1个overlap做顶点，可延伸的做边。Whatevers，就是将reads构成kmer的链接图，多个kmer形成的路径就是一条reads，如果能延续或者找到一条欧拉路径或者汉密尔顿路径，就是最后的基因组的遍历了。

很显然，这个图很庞大，也很复杂，于是也有很多算法和研究去实现这个过程。但是要注意三点：

sequence errors;

heterozygous;

repeat.

测序错误带来错误连接，但是深度较低，可解。
杂合形成bubbles，两条路径覆盖度相近，较低杂合部分可解，高杂合较难。
repeat的原因比较复杂，有时候repeat和杂合影响类似，有时候则是复杂的结构原因。一般说来，解决repeat需要比repeat更长的reads跨过去。

OLC算法对于长reads组装效果就好，因为reads长的话，再打断成kmer就损失了信息，而长reads间overlap关系图比较清晰。缺点是，算法复杂，大量的overlap比对和确定，再就是构图和解图需要较多的计算资源。

How they do it?

言归正传，究竟如何处理高杂合基因组的组装？

由于主要利用二代高通量测序的短reads，他们其实主要也是借鉴DBG的优势和思想，不过做了一些重要的优化和改进。

“Similarly to other de Bruijn-graph-based assemblers, Platanus first constructs contigs from a de Bruijn graph and then builds up scaffolds from the contigs using paired-end or mate-pair libraries.”

“various improvements (e.g.,k-mer auto-extension) have been implemented to allow Platanus to efficiently handle giga-order and relatively repetitive genomes. In addition, Platanus efficiently captures heterozygous regions containing structural variations, repeats, and/or low-coverage sites; it can merge haplotypes during not only the contig assembly step but also the scaffolding step to overcome the challenge of heterozygosity.”

整体过程还是DBG的思想，先利用reads间overlap信息构建contig，然后再利用reads间PE关系构建scaffold的过程。不过他们做了很多改进工作，比如k-mer auto-extension这个关键思想使得他们能够发现杂合区域（一些structural variations, repeats, and/or low-coverage sites），并且在contig步和scaffold步都能对杂合的haplotypes进行合并。

Algorithm overview

主要是以下三个步骤

Contig assembly；

Scaffolding；

Gap-close。

熟悉不熟悉？像极了SOAPdenovo (Li et al. 2010) 和 Velvet (Zerbino and Birney 2008) 有木有？来来来！看图说话！

STEP 1: Contig assembly

contig

这一节的主要内容有：

reads 打断成kmer并计数kmer频率，低频kmer视为测序错误会去掉；

kmer 构建De Bruijin Graph，去掉低覆盖的短分枝；

marked junction nodes；

K-mer extending；

Remove bubbles；

junction-free contigs。

第四步和第五步是关键，涉及到小的repeat的处理以及杂合bubbles的处理。文中是这样说的:

“the k-mers are extracted from both the contigs of the kpre-mer graph and reads containing marked kpre-mers. In this way, repeats shorter than k can typically be resolved, and Platanus effectively excludes junctions caused by heterozygosity, short repeats, and errors.”

所谓junction nodes其实就是有冲突和分枝的kmer结点，error、repeat、heterozygosity都有可能带来这个问题，但是error形成低覆盖的tips极容易去除，错接也容易分辨；repeat和杂合的话，如果是小的repeat和杂合，他这里利用那些原始的有junction nodes的reads去辅助解图，思想非常好，和color DBG以及转录组做可变剪切的思想类似。

还有就是bubbles的去除，首先弄清楚bubbles形成的原因：

“Bubble structures are caused by both the heterozygosity of the diploid samples and errors.”

再就是bubbles的表现：

“A bubble is defined as a set of two straight nodes and two junction nodes at which the straight nodes are connected to the same junction in both directions.”

接下来就是怎么去除了，也是比较直观的：

a high identity between the two straight nodes and

a low coverage depth of k-mers in the two straight nodes.

然后他们还保留了这些bubbles信息以用于后续scaffolding步骤中

“The second condition is helpful to distinguish heterozygous regions from repetitive regions. The removed bubble structures are saved and utilized in the Scaffolding step.”

STEP 2: scaffolding

以上得到了一些contig序列，还可以利用reads见的PE关系估计contig间的顺序和gap从而将contig连城更长的scaffolds。这就需要将原始的reads又重新map回组装的contig序列。

这一部分干了这么些事：

Re-map bubbles removed in contig-assembly

Map PE/MP reads

Construct scaffold graph

Removed bubbles and branches

Output scaffold sequences

第一步是将那些去掉的bubbles重新定位回contig，这样便于区分那些杂合的contig；剩下的就是比对，然后估计insert size距离，构建contig图，解图之类的。但是呢？由于杂合contig的存在和可追溯，所以比一般的scaffolding多了去除bubbles和branches的过程。
很显然，这些bubbles和分枝来自于highly heterozygous regions (i.e.,regions with high SNV densities and/or structural variations)，去除它们主要基于以下散点信息：

(1) coverage depth,
(2) identity with other contigs, and
(3) bubble structures constructed in Contig-assembly.

前两点和前面contig步对bubbles的处理方式类似，后一点是基于二倍体的假设:

“The third condition means that Platanus assumes that the target genome is diploid and therefore does not allow for triple or higher-ordered heterozygote alleles.”

scaffolding