Tuesday, September 30, 2014

When is my genome finished?

This is a common question for anyone doing genome sequencing and assembly.

The short answer:  never.

The slightly longer answer:  it depends.

The long answer:  Finishing a genome means the order of all nucleotide bases have been correctly resolved.  Even for simple genomes this is extremely difficult.  Billions of dollars have been spent to sequence the human genome, and it is currently estimated that 92% of the human genome has greater than 99.99% accuracy (Schmutz et al., 2004).  In total, 99% of the genome has been assembled, and the remaining 1% are likely to be highly repetitive regions with little or no gene content (remember that 1% of 3 billion total bases means there are about 30 million unresolved bases).  Using the current technology, it is impossible to reach 100% accuracy across 100% of the human genome.  For simpler genomes such as some viral and bacterial genomes it is possible to completely resolve the entire sequence.  However, because these genomes are much more dynamic (i.e. change at a faster rate), generating a completely finished genome may not be worth the cost.

Finishing a genome requires a substantial amount of time, work, and money.  However, getting a genome to draft status (i.e. incomplete but usable) can be done with minimal costs and resources.  So perhaps the most important question is:  when is my genome usable?  The answer to this question depends on the research questions being considered.  For example, when studying the evolutionary history of genomes with a high propensity for genomic rearrangements, generating long contigs/scaffolds is important for determining the orientation and lineage of genomic rearrangements.  Alternatively, some research questions are more concerned about the completeness of the gene content for making functional comparisons between genomes.  For such a question, generating long, continuous contig/scaffolds is less important.

Here are some possible ways to estimate completeness (with what I think are the best methods near the top):
  • Compare the gene content of highly conserved genes
    • Eukaryote:  CEGMA
    • Bacterial:  CheckM
    • Archaeal:  A table of 53 conserved COGs
  • Check for possible errors using REAPR
  • Calculate assembly metrics like contig number, N50, coverage, and assembly size.  
  • Compare the size of your assembled genome to a related (or set of related) genome(s).  
See these papers for interesting discussion on evaluating assemblies:  


  1. Hi Scott,
    big fan of your blog. I find useful to check the completeness of genome assemblies also using Phylosift (http://phylosift.wordpress.com/). It can also tell you if there are contaminants.

    1. Thanks, Livio! I appreciate the useful comment. Yes, Phylosift looks like a useful software package.