Wednesday, February 12, 2014

A Brief Introduction to Sequence Assembly

Assembly Background
Sequence assembly is one of the overarching challenges in bioinformatics.  To understand the assembly problem it helps to understand some basics of DNA sequencing.  Consider a bacterium having a genome comprised of a single 5 megabase (5 million base pairs) chromosome.  Ideally, sequencing machines would start at the beginning of the chromosome and read each of the 5 million base pairs until arriving at the end.  Unfortunately, the current technology is limited to reading sequences between 30 and ~10,000+ bases.  The assembly problem is to take these short segments of DNA called reads and overlap them in such a way to recreate the original 5Mb chromosome.

To illustrate this consider the set of character strings below that come from a quote by Theodore Roosevelt (spaces have been replaced with "_" for clarity).  Can you put the pieces together to find out what it says?

You should end up with something that looks like this:

This is more or less what assembly programs attempt to do with DNA.  Some things to notice in the above example:
  • Repeats can be problematic during assembly.  Notice that the word "you" is used twice in this sentence.  Looking at the two character strings "Believe_yo" and "you're_ha" you may have incorrectly merged them to form the character string "Believe_you're_ha" which is an incorrect assembly.  DNA repeats are common in genomes and can fragment assemblies or cause assembly mistakes.
  • Longer reads can help with the repeat problem.  For example, given only two long character strings "Believe_you_can_and_you're" and "can_and_you're_halfway_there" it is much easier to unambiguously assembly the quotation despite the fact that the word "you" is used twice.
  • Sequencing errors complicate assembly.  For example, if the character string "halfway" was sequenced as "calfway" there would be no way to finish the assembly correctly because "you're_ha" does not overlap with "calfway."  
  • Coverage (i.e. the number of times a character is represented in the assembly) helps distinguish sequencing errors.  For example, if we sample from the quote (or in the genome in the case of DNA) many times and see the character string "halfway" 100 times and the character string "calfway" only once we can assume that "calfway" was incorrectly sequenced.  

De Novo vs Mapping Assembly
De novo is a latin expression meaning "from the beginning" (Wikipedia).  De novo sequence assemblies are build with no external information beyond the raw sequencing reads.  First pass de novo assembles are called "draft" assemblies because the genome remains fragmented (i.e. discontinuous) and may contain assembly errors.  Extensive resequencing and curation is generally required to "complete" or "finish" a genome assembly.

Alternatively, mapping assembly uses a reference sequence as an anchor to orient sequenced reads.  After reads are ordered based on their location in the reference sequence a consensus sequence is generated from all the mapped reads.  The consensus sequence can differ from the reference sequence but differences are generally single base differences scattered throughout the genome.  Mapping assembly is most useful when the reference sequence and sequenced organism are closely related.

In some cases (e.g. metagenomics), a combination of de novo and mapping assemblies may be advantageous.  However, hybrid assembly algorithms and protocols are not well explored.

De Novo Assembly Algorithm Classes
There are two main classes of assembly algorithms used in de novo assembly:  overlap-layout-consensus (OLC) and de bruijn graph (DBG).  Similar to the above example, OLC first finds reads with overlapping ends, builds a layout graph based on these overlaps, and lastly generates a consensus sequence as the graph is traversed.  OLC was the first assembly method developed and works well with long-read, low-coverage sequencing technologies like Sanger (and possibly PacBio).

DBG based assemblers convert the set of reads into a set of k-mers (i.e. short DNA sequences of length k).  These k-mers are then used to build a de bruijn graph from which the genomic sequence is inferred.  DBG assemblers work well with high-coverage sequencing methods like Illumina and Ion Torrent.  

Useful Links

1 comment:

  1. Thank you very much! The example is quite clear for me, though I am not familiar with bioinformatics.