The in silico lens: When is my genome finished?

This is a common question for anyone doing genome sequencing and assembly.

The short answer: never.

The slightly longer answer: it depends.

The long answer: Finishing a genome means the order of all nucleotide bases have been correctly resolved. Even for simple genomes this is extremely difficult. Billions of dollars have been spent to sequence the human genome, and it is currently estimated that 92% of the human genome has greater than 99.99% accuracy (Schmutz et al., 2004). In total, 99% of the genome has been assembled, and the remaining 1% are likely to be highly repetitive regions with little or no gene content (remember that 1% of 3 billion total bases means there are about 30 million unresolved bases). Using the current technology, it is impossible to reach 100% accuracy across 100% of the human genome. For simpler genomes such as some viral and bacterial genomes it is possible to completely resolve the entire sequence. However, because these genomes are much more dynamic (i.e. change at a faster rate), generating a completely finished genome may not be worth the cost.

Finishing a genome requires a substantial amount of time, work, and money. However, getting a genome to draft status (i.e. incomplete but usable) can be done with minimal costs and resources. So perhaps the most important question is: when is my genome usable? The answer to this question depends on the research questions being considered. For example, when studying the evolutionary history of genomes with a high propensity for genomic rearrangements, generating long contigs/scaffolds is important for determining the orientation and lineage of genomic rearrangements. Alternatively, some research questions are more concerned about the completeness of the gene content for making functional comparisons between genomes. For such a question, generating long, continuous contig/scaffolds is less important.

Here are some possible ways to estimate completeness (with what I think are the best methods near the top):

Compare the gene content of highly conserved genes

Eukaryote: CEGMA
Bacterial: CheckM
Archaeal: A table of 53 conserved COGs

Check for possible errors using REAPR
Calculate assembly metrics like contig number, N50, coverage, and assembly size.
Compare the size of your assembled genome to a related (or set of related) genome(s).

See these papers for interesting discussion on evaluating assemblies:

The in silico lens

Tuesday, September 30, 2014

When is my genome finished?

2 comments:

Search This Blog

Blog Archive

Labels

About Me

My Other Sites