Friday, August 22, 2014

How to Order Contigs Using a Reference Genome

Background
In genomics complete (i.e. finished) genomes provide an excellent resource for future sequencing projects.  However, generating finished genomes is an expensive and laborious endeavor.  In some cases, the returns of finishing a genome are not worth the cost particularly when draft genomes can provide enough information for robust hypotheses testing.  Draft genomes contain a large number of unordered sequences of various lengths called contigs.  Contigs can be joined into more contiguous sequences called scaffolds using paired-end reads.  While a set of contigs/scaffolds can be useful for a wide variety of projects (gene annotation, gene expression profiling, etc.), contig order provides vital information in comparative genomics and analyses of specific genomic regions.  Typically, a rough ordering of contigs can be accomplished by mapping contigs to a closely related reference genome.

ABACAS
Recently, I discovered a nice software package called ABACAS which not only orders contigs but creates a contiguous sequence representing the connected contigs.  Gaps between contigs are represented by N's and overlapping regions are resolved (although I could not find information on exactly how).  Because ABACAS requires MUMer, its outputs integrate will with other MUMer scripts such as those for visualizing alignments (Figure 1, i.e. mummerplot).  ABACAS is hosted by SourceForge and is well documented.  It was published in Bioinformatics in 2009.



Notes
Of course, more closely related draft and reference genomes will generate the most correct ordering.  However, in cases where the target genome is VERY closely related to the reference genome it may be better to map raw reads to the reference genome instead of build a de novo assembly.  Deviations from the reference genome can be discovered using SNP detection software.  Read mapping experiments are generally cheaper because they require fewer reads than de novo assembly but can still detect biologically significant differences when compared to the reference genome.