Introduction
The term "pan genome" was coined in 2005 by Tettelin in a paper describing the genomes of eight pathogenic Streptococcus strains. The pan genome is the set of all unique genes from a set of genomes (ie gene union). The core genome is the set of genes found in each genome (ie gene intersection). The accessory genome is the genes unique to a particular genome (ie strain specific genes).
Previous Studies
Read et al. (2012) discusses the pan and core genome of phytoplancton. They estimate the size of the E. huxleyi pan genome to be large because there are several thousand genes in the reference that are missing from all of the three well-sequenced isolates. This is definitely more of a core genome paper (the analysis of which is easy when you have a reference). Perhaps a better way of showing the diversity of the pan genome is rarefaction curves on the number of homologous genes or perhaps k-mer content. The lead investigator, Igor Grigoriev, is the fungal genomics lead investigator at the JGI.
Pan and Core Genome Dynamics
In general, as more genomes are added to the analysis set, the core-genome shrinks and the pan-genome grows. Collins (2012) describe these dynamics in their Molecular Biology and Evolution publication by using an infinitely many genes (IMG) model. In summary, the Collins IMG model is based on the idea of three types of gene classes: core, shell, and cloud genes. Core genes are those found in all genomes, shell genes are gained and lost from genomes at a relatively slow rate, and cloud genes are rapidly gained and lost from genomes. Empirical data from a set of Bacillaceae genomes support the Collins IMG model. The Collins IMG model can be used to predict the size of core- and pan-genomes. In the Dangl lab this has become a question of substantial interest for determining which relevant clades require more isolate genomes for more robust functional genomics analyses.
Core vs Accessory
Given a set of genes from various genomes what gene set is most interesting? Of course this question will most heavily depend on previous knowledge of the genomes in question. In the Dangl lab we are interested in microbes that inhabit the endosphere (inner plant root). Because the set of microbes living inside these roots exhibit a different profile than surrounding soil, one could hypothesize that a single or small set of genes are responsible for a microbes ability to inhabit the inner root. Under this hypothesis the core-genome would be of particular interest because it should contain these genes. However, in practice the core genomes is primarily composed of common cellular functions ubiquitous to all bacteria.
Because the core-genome is likely to reveal nothing of specific interest the accessory-genome seems most interesting. The genes in this set alludes to what makes a particular genome functionally distinct and interesting. They get at questions like: "What functions does a particular bacteria provide to the community."
Software for pan-genome analysis
The primary aim in any pan-genome analysis is grouping orthologous genes from different genomes. To do this many pan-genome pipelines utilize algorithms and databases such as GO, COG, KEGG, eggNOG, Pfam, etc.
Here are some notes on various pipelines developed for pan-genome analyses.
The term "pan genome" was coined in 2005 by Tettelin in a paper describing the genomes of eight pathogenic Streptococcus strains. The pan genome is the set of all unique genes from a set of genomes (ie gene union). The core genome is the set of genes found in each genome (ie gene intersection). The accessory genome is the genes unique to a particular genome (ie strain specific genes).
Previous Studies
Read et al. (2012) discusses the pan and core genome of phytoplancton. They estimate the size of the E. huxleyi pan genome to be large because there are several thousand genes in the reference that are missing from all of the three well-sequenced isolates. This is definitely more of a core genome paper (the analysis of which is easy when you have a reference). Perhaps a better way of showing the diversity of the pan genome is rarefaction curves on the number of homologous genes or perhaps k-mer content. The lead investigator, Igor Grigoriev, is the fungal genomics lead investigator at the JGI.
Pan and Core Genome Dynamics
In general, as more genomes are added to the analysis set, the core-genome shrinks and the pan-genome grows. Collins (2012) describe these dynamics in their Molecular Biology and Evolution publication by using an infinitely many genes (IMG) model. In summary, the Collins IMG model is based on the idea of three types of gene classes: core, shell, and cloud genes. Core genes are those found in all genomes, shell genes are gained and lost from genomes at a relatively slow rate, and cloud genes are rapidly gained and lost from genomes. Empirical data from a set of Bacillaceae genomes support the Collins IMG model. The Collins IMG model can be used to predict the size of core- and pan-genomes. In the Dangl lab this has become a question of substantial interest for determining which relevant clades require more isolate genomes for more robust functional genomics analyses.
Core vs Accessory
Given a set of genes from various genomes what gene set is most interesting? Of course this question will most heavily depend on previous knowledge of the genomes in question. In the Dangl lab we are interested in microbes that inhabit the endosphere (inner plant root). Because the set of microbes living inside these roots exhibit a different profile than surrounding soil, one could hypothesize that a single or small set of genes are responsible for a microbes ability to inhabit the inner root. Under this hypothesis the core-genome would be of particular interest because it should contain these genes. However, in practice the core genomes is primarily composed of common cellular functions ubiquitous to all bacteria.
Because the core-genome is likely to reveal nothing of specific interest the accessory-genome seems most interesting. The genes in this set alludes to what makes a particular genome functionally distinct and interesting. They get at questions like: "What functions does a particular bacteria provide to the community."
Software for pan-genome analysis
The primary aim in any pan-genome analysis is grouping orthologous genes from different genomes. To do this many pan-genome pipelines utilize algorithms and databases such as GO, COG, KEGG, eggNOG, Pfam, etc.
Here are some notes on various pipelines developed for pan-genome analyses.
- GET_HOMOLOGUES (my recommendation)
- This is my program of choice for pan/core/accessory genome analyses.
- Fantastic documentation
- Options for bidirectional blast hit (BDBH), COGtriangle, and/or orthoMCL algorithms for building clusters of orthologous groups.
- Builds clear figures
- Several options for powerful downstream analyses.
- Parallelization options
- Panseq (Laing, 2010)
- Nice web-based interface.
- Seems to work (as opposed to some of the following programs)
- Output formats are not as user friendly or as concise get_homologues
- Job completion email never comes in. Be sure to save the link somewhere.
- More suited for small, quick analyses
- PGAT (Brittnacher, 2011)
- Nice web-based interface and documentation
- Limited to comparisons between only a small number of genomes already stored in their database.
- PGAP (Zhao, 2011)
- Did not install because it requires the old version of blast (blastall)
- PanFunPro (Lukjancenko, 2013)
- Still in the development stage. I had a quick look at the source code and there were some things didn't make sense. I'll give the developers a little more time to work out the kinks.
- The installation can take some time because of some large dependencies (eg. InterProScan). Furthermore, the installation for the PanFunPro Perl scripts could be streamlined using a tool like Module::Build. However, the installation instructions for it's dependencies are well written making installation remarkable easy.
- PanOCT (Fouts, 2012)
- Primarily an algorithm for determining homology between a set of genes from 2 or more eukaryote genomes.
- Considers conservation of neighboring genomic regions for determining homology. The basic idea is that two genes are truly homologous (as opposed to paralogous) will be situated in the same genomic location.