Tuesday, February 25, 2014

The Pan/Core/Accessory Genome

Introduction
The term "pan genome" was coined in 2005 by Tettelin in a paper describing the genomes of eight pathogenic Streptococcus strains.  The pan genome is the set of all unique genes from a set of genomes (ie gene union).  The core genome is the set of genes found in each genome (ie gene intersection).  The accessory genome is the genes unique to a particular genome (ie strain specific genes).  

Previous Studies
Read et al. (2012) discusses the pan and core genome of phytoplancton.  They estimate the size of the E. huxleyi pan genome to be large because there are several thousand genes in the reference that are missing from all of the three well-sequenced isolates.  This is definitely more of a core genome paper (the analysis of which is easy when you have a reference).  Perhaps a better way of showing the diversity of the pan genome is rarefaction curves on the number of homologous genes or perhaps k-mer content.  The lead investigator, Igor Grigoriev, is the fungal genomics lead investigator at the JGI.

Pan and Core Genome Dynamics
In general, as more genomes are added to the analysis set, the core-genome shrinks and the pan-genome grows.  Collins (2012) describe these dynamics in their Molecular Biology and Evolution publication by using an infinitely many genes (IMG) model.  In summary, the Collins IMG model is based on the idea of three types of gene classes:  core, shell, and cloud genes.  Core genes are those found in all genomes, shell genes are gained and lost from genomes at a relatively slow rate, and cloud genes are rapidly gained and lost from genomes.  Empirical data from a set of Bacillaceae genomes support the Collins IMG model.  The Collins IMG model can be used to predict the size of core- and pan-genomes.  In the Dangl lab this has become a question of substantial interest for determining which relevant clades require more isolate genomes for more robust functional genomics analyses.

Core vs Accessory
Given a set of genes from various genomes what gene set is most interesting?  Of course this question will most heavily depend on previous knowledge of the genomes in question.  In the Dangl lab we are interested in microbes that inhabit the endosphere (inner plant root).  Because the set of microbes living inside these roots exhibit a different profile than surrounding soil, one could hypothesize that a single or small set of genes are responsible for a microbes ability to inhabit the inner root.  Under this hypothesis the core-genome would be of particular interest because it should contain these genes.  However, in practice the core genomes is primarily composed of common cellular functions ubiquitous to all bacteria.

Because the core-genome is likely to reveal nothing of specific interest the accessory-genome seems most interesting.  The genes in this set alludes to what makes a particular genome functionally distinct and interesting.  They get at questions like:  "What functions does a particular bacteria provide to the community."

Software for pan-genome analysis
The primary aim in any pan-genome analysis is grouping orthologous genes from different genomes.  To do this many pan-genome pipelines utilize algorithms and databases such as GO, COG, KEGG, eggNOG, Pfam, etc.

Here are some notes on various pipelines developed for pan-genome analyses.
  • GET_HOMOLOGUES (my recommendation)
    • This is my program of choice for pan/core/accessory genome analyses.  
    • Fantastic documentation
    • Options for bidirectional blast hit (BDBH), COGtriangle, and/or orthoMCL algorithms for building clusters of orthologous groups.
    • Builds clear figures
    • Several options for powerful downstream analyses. 
    • Parallelization options
  • Panseq (Laing, 2010)
    • Nice web-based interface.  
    • Seems to work (as opposed to some of the following programs)
    • Output formats are not as user friendly or as concise get_homologues
    • Job completion email never comes in.  Be sure to save the link somewhere.
    • More suited for small, quick analyses
  • PGAT (Brittnacher, 2011)
  • PGAP (Zhao, 2011) 
    • Did not install because it requires the old version of blast (blastall)
  • PanFunPro (Lukjancenko, 2013)
    • Still in the development stage.   I had a quick look at the source code and there were some things didn't make sense.  I'll give the developers a little more time to work out the kinks.  
    • The installation can take some time because of some large dependencies (eg. InterProScan).  Furthermore, the installation for the PanFunPro Perl scripts could be streamlined using a tool like Module::Build.  However, the installation instructions for it's dependencies are well written making installation remarkable easy. 
  • PanOCT (Fouts, 2012)
    • Primarily an algorithm for determining homology between a set of genes from 2 or more eukaryote genomes.  
    • Considers conservation of neighboring genomic regions for determining homology.  The basic idea is that two genes are truly homologous (as opposed to paralogous) will be situated in the same genomic location.

Wednesday, February 12, 2014

A Brief Introduction to Sequence Assembly

Assembly Background
Sequence assembly is one of the overarching challenges in bioinformatics.  To understand the assembly problem it helps to understand some basics of DNA sequencing.  Consider a bacterium having a genome comprised of a single 5 megabase (5 million base pairs) chromosome.  Ideally, sequencing machines would start at the beginning of the chromosome and read each of the 5 million base pairs until arriving at the end.  Unfortunately, the current technology is limited to reading sequences between 30 and ~10,000+ bases.  The assembly problem is to take these short segments of DNA called reads and overlap them in such a way to recreate the original 5Mb chromosome.

To illustrate this consider the set of character strings below that come from a quote by Theodore Roosevelt (spaces have been replaced with "_" for clarity).  Can you put the pieces together to find out what it says?


You should end up with something that looks like this:


This is more or less what assembly programs attempt to do with DNA.  Some things to notice in the above example:
  • Repeats can be problematic during assembly.  Notice that the word "you" is used twice in this sentence.  Looking at the two character strings "Believe_yo" and "you're_ha" you may have incorrectly merged them to form the character string "Believe_you're_ha" which is an incorrect assembly.  DNA repeats are common in genomes and can fragment assemblies or cause assembly mistakes.
  • Longer reads can help with the repeat problem.  For example, given only two long character strings "Believe_you_can_and_you're" and "can_and_you're_halfway_there" it is much easier to unambiguously assembly the quotation despite the fact that the word "you" is used twice.
  • Sequencing errors complicate assembly.  For example, if the character string "halfway" was sequenced as "calfway" there would be no way to finish the assembly correctly because "you're_ha" does not overlap with "calfway."  
  • Coverage (i.e. the number of times a character is represented in the assembly) helps distinguish sequencing errors.  For example, if we sample from the quote (or in the genome in the case of DNA) many times and see the character string "halfway" 100 times and the character string "calfway" only once we can assume that "calfway" was incorrectly sequenced.  

De Novo vs Mapping Assembly
De novo is a latin expression meaning "from the beginning" (Wikipedia).  De novo sequence assemblies are build with no external information beyond the raw sequencing reads.  First pass de novo assembles are called "draft" assemblies because the genome remains fragmented (i.e. discontinuous) and may contain assembly errors.  Extensive resequencing and curation is generally required to "complete" or "finish" a genome assembly.

Alternatively, mapping assembly uses a reference sequence as an anchor to orient sequenced reads.  After reads are ordered based on their location in the reference sequence a consensus sequence is generated from all the mapped reads.  The consensus sequence can differ from the reference sequence but differences are generally single base differences scattered throughout the genome.  Mapping assembly is most useful when the reference sequence and sequenced organism are closely related.

In some cases (e.g. metagenomics), a combination of de novo and mapping assemblies may be advantageous.  However, hybrid assembly algorithms and protocols are not well explored.

De Novo Assembly Algorithm Classes
There are two main classes of assembly algorithms used in de novo assembly:  overlap-layout-consensus (OLC) and de bruijn graph (DBG).  Similar to the above example, OLC first finds reads with overlapping ends, builds a layout graph based on these overlaps, and lastly generates a consensus sequence as the graph is traversed.  OLC was the first assembly method developed and works well with long-read, low-coverage sequencing technologies like Sanger (and possibly PacBio).

DBG based assemblers convert the set of reads into a set of k-mers (i.e. short DNA sequences of length k).  These k-mers are then used to build a de bruijn graph from which the genomic sequence is inferred.  DBG assemblers work well with high-coverage sequencing methods like Illumina and Ion Torrent.  


Useful Links