Bayesian Genome Assembly and MCMC Assessment

Introduction

They first build an assembly graph starting from a de Bruijn graph of the reads. Then they remove all tips and merge all unambiguous paths into single nodes that are annotated by the sequence of merged K-mers.

The resulting unresolved assembly graph (no longer de Bruijn) is a directed graph that consists only of bubbles and is a minimal representation of the variants that can be inferred from the sequenced data. Concatenating the sequences across the nodes in a particular path through this graph gives a possible assembly sequence.

Key ideas

A particular assembly hypothesis is represented as a boolean vector, whose values indicate which of the enumerated edges in the assembly graph are in an active path (Figure 1).

Figure 1

Contiguous paths through the assembly graph correspond to proposed assemblies, and are represented by a boolean vector indicating which edges in the graph are active.

Then, turning on an edge in an alternate branch turns on the entire alternate branch and turns off the existing path through the bubble (Figure 2).

Figure 2

(a) Starting from an existing (blue) assembly, the proposal mechanism randomly chooses a new (green) edge. (b) If the edge is not already active, it is extended with a random walk until it meets the existing assembly. (c) This defines a (red) branch in the existing assembly, (d) which is then removed to yield a new (blue) assembly.

Disccusion

It’s very interesting. But by the way,

Our prototype assembler cannot yet consider repeats, as each edge can only be visited once.

There’s still a lot of work to do.

Reference

Howison M, Zapata F, Edwards EJ, Dunn CW (2014) Bayesian Genome Assembly and Assessment by Markov Chain Monte Carlo Sampling. PLoS ONE 9(6): e99497. doi:10.1371/journal.pone.0099497.

Comments