The in silico lens: Metagenomics

Showing posts with label Metagenomics. Show all posts

Monday, January 26, 2015

The iChip: A tool for high-throughput microbial culturing

It is commonly accepted that only ~1% of naturally occurring microbes are culturable using standard culturing techniques. Until recently, culturing microbes has been the only way to investigate their presence and effects on various environments (i.e. human gut, soil, plant roots, oceans, etc). Metagenomics, sequencing DNA from a pool of microbes, has broadened our understanding of microbial communities by circumventing the need for culturing. However, culturing remains an important aspect of investigating microbial communities particularly for environments where the microbial diversity is so great that generating complete genome sequences via metagenomics is impractical (i.e. soil).

In 2010, a study from the Epstein and Lewis groups (Nichols et al., 2010) describes a new technology which can increase culturing success from ~1% to nearly 50%! This technology, the isolation chip or iChip, contains 384 chambers. A single microbial cell is deposited in each chamber. Then a semi-permeable membrane is used to cover all the chambers and the iChip is placed into the environment from which the microbes originate. The membrane allows cells access to nutrients and growth factors found in its natural environment.

The iChip has been successfully used to study novel secondary metabolites (Lewis et al., 2010) and discover novel antibiotics (Ling et al., 2015; Lewis, 2013). However, a possible reason why high-throughput culturing techniques have not gained much traction is the emphasis on metagenomic applications and studies which do not suffer from culturing bias. High-throughput culturing and metagenomic sequencing each have a unique set of strengths and weaknesses. Perhaps more thought should be put into designing methods using a combination of high-throughput culturing and metagenomic sequencing to leverage both sets of strengths and mitigate their weaknesses.

Friday, November 21, 2014

Literature Review: Selection on soil microbiomes reveals reproducible impacts on plant function

I recently found an intriguing microbiome paper, and I would like to post my thoughts about it.

Citation
Panke-Buisse et al., Selection on soil microbiomes reveals reproducible impacts on plant function, ISME Journal 2014, doi: 10.1038/ismej.2014.196.

Review
The main hypothesis in this paper tests if phenotypes from a soil microbiome connected with a single plant genotype can be recapitulated in other plant genotypes. First, early- and late-flowering time associated soil microbiome were generated. Ten generations of Arabidopsis thaliana Col-0 were grown and at each generation soils from the four earliest and latests flowering pots were collected as the inoculum for the subsequent generation. Using soils from generation 10, final early- and late-flowering time associated inoculums were derived. These inoculums were then administered to four Arabidopsis thaliana ecotypes (Rld, Ler, Be, and Col-0) and a close relative, Brasica rapa. Flowering time, plant biomass, and nitrogen mineralization enzyme activity for each of the genotypes were statistically different between early- and late-flowering associated inoculum treatments except Ler (Fig 4). Microbes specifically associated with early- and late- flowering time treatments were found in low abundance suggesting that low-abundance microbes can contribute significantly to interesting phenotypes.

Positives In general, I found this paper to be very insightful for the following reasons:

Soil Microbial phenotypes can persist in novel hosts. This is a really cool result and could have direct implications for building field-useful microbiome treatments. However, flowering time is a well conserved phenotype so this result may not apply to other specific phenotypes of interest such as crop yield, disease resistance, or with greater host genetic divergence (i.e. corn flowering time vs arabidopsis).
Low abundance members of a community can drive an important phenotype. A common assumption in metagenomics is that the most abundant members of a community are the most important. Perhaps more focus should be geared toward linking microbes to phenotypes regardless of abundance.
This artificial selection design doesn't require detailed knowledge of genetic mechanisms in order to produce a desired phenotype. It is similar to the old dogma of plant breading where two good plants were crossed to produce an even better plant with no underlying knowledge of the genetic mechanisms. While understanding the mechanisms is important for maximizing the phenotypic benefits, it is not required to generate a positive effect.

Critiques and Questions

The methods section seems incomplete.

Why are there so few OTUs in the heatmap and ternary plot? Were OTUs filtered out? I would expect to see hundreds or even thousands of OTUs from soil-derived microbiota.
Were the wild soils combined at equal ratios?
What was their measure of flowering time? I think there are well defined methods in the literature for observing flowering time.

Why are the EF associated samples in the 10 generation experiment actually flowering at a later time in later generations (supp fig 1)?
Why are the control samples not included in supp fig 1?
Even though control samples were included as a covariate to generate figure 4, I would have liked to have seen them graphed there as well.
Perhaps a better way of generating the control samples would have been to randomly pick pots to generate a control inoculum. This was what they did in Swenson et al., 2000

Future Work
This experimental design could be used to address the following intriguing hypotheses:

Selection on soil microbiomes changes the root endophyte microbial community.

Redo the experiment with the same design but also 16S profile endophyte microbes.
It would also be interesting to see if the microbes specific to each treatment are also found inside the root. This would suggest a direct microbial effect on the plant phenotype.

Selection for soil microbiome associated phenotypes is driven by multiple mechanisms.

Redo the experiment but don't combine replicates such that the end inocula include four independent lines of selection for both early- and late-flowering time soil microbiomes. Compare microbes among treatment groups looking for different microbial profiles that express similar phenotypes.

Monday, March 24, 2014

2014 JGI Users Meeting Notes

Here are some notes from a few of the speakers at the JGI Users meeting in California. In general the speakers were fantastic. Some general themes of the conference include: single-cell genomics, synthetic biology, fungal metagenomics, and metabolics. A person take-home message for me was the need for creative biological solutions to common issues that the human race currently faces or will face in the near future.

Mark Ackermann (opening keynote) – A Single Cell Perspective on Bacterial Interactions

- Focused on phenotypic heterogeneity, when identical cells have different functional profiles.

- Most genes don’t have clonal variation but in the ones that do how is that heterogeneity important for the community.

- Salmonella is an example of phenotypic heterogeneity. One cell type causes inflammation and one uses the inflammation response to reproduce and cause full infection.

- Different cell types survive better in different environmental conditions.

- Another example of phenotypic heterogeneity is in alpine lakes where there are generally large amounts of ammonium that bacteria can use as a nitrogen source. However, there are some cells that fix their own nitrogen in the event that ammonium runs out.

- preliminary data show that neighboring cells are more likely to be of the same cell type.

Mary Berbee – Pectinases link Early Fungal Evolution to the Land Plant Lineage

- Sequenced early divergent fungal groups.

- The relationship between the early branching groups is still poorly resolved.

- Showed some cool trees where she had overlaid two trees to highlight difference between the two. I would like to know what software she used to do this.

- Her trees were based on whole genomes but I’m not sure how she built them.

Rytas Vilgalys – Understanding the Forest Microbiome: A Fungal Perspective

- Oak and pine share many fungi while populus has more different fungi.

- Soils from the same region are likely to share the same fungi.

- Populus of different genotypes do not assembly different fungi. At least not nearly as different as fungi from different regions.

- They have isolated ~1,800 fungal isolates. These isolate represent only ~15% of the isolates that are likely populus endophytes.

- Many fungal isolates stimulate plant growth.

- They are re-inoculating these isolates to confirm they are endophytic.

- Mortierella elongata is an isolate that stimulates plant growth in populus and Arabidopsis thaliana.

- M. elongata also harbors bacterial symbionts (Glomeribacter which are known to affect lipid fermentation and is a sister to Burkholderia. These bacteria cannot be cultured possibly because they rely so heavily on the host for nutrients).

- M. elongata migrate to the roots.

- Different genes are expressed in M. elongata grown in culture than those sampled from the rhizosphere.

- Different genes are expressed in M. elongata inoculated on different hosts.

Eddy Rubin

- Bacterial genes are typically ~900bp.

- In a couple of sequenced genomes they saw average bacterial gene lengths as low as 200bp. However, when they adjust the codon table by replacing one of the stop codons to code for a glycine predicted genes have an average length of 900bp! Some bacteria use different codon translations!

- Natalia Ivanova is a gene annotation specialist they consulted for help in this analysis.

- They found evidence of recoding in lots of other bacteria by looking at sequenced isolates.

- Didn’t find evidence of recoding in archea.

- They show that phages which use different codon profiles can circumvent host cell machinery to match their codon profile!

- CRISPR regions in bacterial cells often contain phage elements that correspond to different codon profiles. This is further evidence that phages with different codon profiles can infect cells with canonical codon profiles.

Nicole Dublier –Metagenomics and Metaproteomic Analyses of Symbioses between Bacteria and Gutless Marine Worms

- Bacteria can use hydrogen to produce more energy than methane. Nature 2011

- They discovered key genes able to metabolize hydrogen.

- The second half of the talk was about gutless worms living in shallow water. They completely dependent on bacterial symbionts for feeding and waste excretion.

- There are species specific symbionts.

- Her proteomics data yield more obvious features than comparative genomics. As an example she shows how one isolate contains a protein that does the function of 3 different proteins in the canonical Calvin Cycle. DNA sequencing confirmed this observation but would have been a “needle-in-a-haystack” for a comparative genomics project. This work published in PNAS.

Erin Nuccio – Mapping Soil Carbon from Cradle to Grave: Using Omics and Isotope Analyses to Identify the Microbial Blueprint for Root-enhanced Decomposition of Organic Matter.

- The general question is how do microbes transform and stabilize root carbon in soil.

- Carbon can affect nitrogen rates.

- Plants fix carbon for microbes in the soil.

- Looking at the rhizosphere over time it gradually deviates from bulk soil in carbon levels at time points of 3, 6, 9, and 12 weeks.

- Some preliminary data show that bacteria prefer carbon excreted by plant over as an energy source over nitrogen liter material (ie material artificially added to the system).

Michael Fischbach – A Gene-to-Molecule Approach to the Discovery and Characterization of Natural Products

- Discovers natural gene products. By gene products I think he means functional protein units.

- Undiscovered gene products are often coded by clusters of genes.

- Has some type of algorithm to computationally discover these clusters that may produce unknown gene products.

- Lots of his most interesting clusters were found on human associated microbes.

- Discovered several oligosaccharide clusters. These bacteria were very difficult to work with but these clusters and the functions they provide to the human host are of high interest.

- The general observation of this study was that microbes in our gut are making products for which we have no idea what they are or how they function. It’s like taking several prescription drugs for your entire life! We need to figure out what is going on in there.

Kelly Matzen – Genetic Control of Mosquitoes

- In the 50’s DDT was used to control mosquito populations and subsequently mosquito born disease such as dengue. However, DDT is know to be detrimental to the environment in several ways and therefore is being used much less. We are starting to see diseases like dengue make a comeback in places like Florida and of course in places like Central and South America.

- Right now the most effective control is pesticides.

- They are releasing massive numbers of sterile male mosquitoes to control (ie reduce) mosquito populations. This technique has been successfully used before in the United States to control populations of other insects many years ago.

- This technique seems to be working in the small field studies they have been conducting.

- There is some push back from legislators but in general it seems like good solution.

Cameron Coates – Characterization of Cyanobacterial Hydrocarbon composition and Distribution of Biosynthetic Pathways

- Cyanobacteria produce over 30% of the earth’s oxygen.

- They are very diverse and live in all sorts of habitats on earth.

- They can produce hydrocarbons where are relevant of use of biofuels. However, they don’t produce large amounts of hydorcarbons.

- They looked at the evolution of cyanobacteria hydrocarbon pathways. There are two main pathways. Several clades have both pathways suggesting a large amount of horizontal gene transfer.

- This work was published in PLOS ONE.

June Medford – Making Better Plants: Synthetic Approaches in Plant Engineering

- They created a biological input/output system. This allows for some external factor to cause a reaction that can be observed in the plant.

- They use a pariplasmic binding protein as the input signal because it can quickly defuse through the cell wall and are then translocated to the nucleus to transcriptionally regulate some response.

- They can theoretically use this system as a flag for pollutants or other dangers that we currently use very expensive technology to detect.

- They are currently developing a system to detect TNT where the response signal of the plant is to turn white. This system can detect traces of TNT 10x smaller than a dog! There are still some kinks to work through like response time. But looks like a very promising system. This idea has countless unexplored applications!

Kankshita Swaminathan – Genome Biology of Miscanthus

- Miscanthus is in the same clade as sugar cane, corn, and sorghum. These plants have been amenable to breading.

- The genomic sequence of sorghum is very close to Miscanthus except that Miscanthus has had a whole genome duplication event.

- In the winter all the nutrients migrate to the rhizome leaving only the stalk above ground. The stalk is the most important element for biofuels and can be harvested without significantly depleting soil nutrients.

Annalee Newitz (closing keynote) - How Humans Will Survive a Mass Extinction

- Humans have a very good chance of surviving a mass extinction because we are very adaptable. However, our focus should be how we can preserve the diversity of the earth as it is now.

- A mass extinction is when greater than 70% of the earth's species are killed.

- Five mass extinctions have occurred in the history of the earth. Perhaps the largest was caused by cyanobacteria because they released large amounts of oxygen into the atmosphere. Close to 90% of species became extinct as a result.

- Climate change is inevitable regardless of wither or not humans are the cause.

- The questions we should be asking are: how can we respond to these changing climates and what can we do to preserve the world as we know it.

- Space travel seems like an important step in human survival.

Tuesday, October 22, 2013

Read simulators review with an emphasis on metagenomics

Why Read Simulation?

Simulations are an important aspect of bioinformatics that can be used for testing and benchmarking algorithms, optimizing parameters, and generating optimal study design. The following are examples of how read simulations have been successfully utilized in each of these instances.

Testing Algorithms

After writing a new algorithm how do we ensure that it works? The most effective way is to run the algorithm on a set of data where the answers are known. This is a perfect application for simulated data. Katharina Hoff use DNA sequence simulations to test the effects of sequencing error from several sequencing technologies on specialized metagenomic gene prediction algorithms. She concludes that gene prediction accuracy is poor in the presence of sequencing errors using these algorithms. Furthermore, gene prediction algorithms not specialized for metagenomic data perform as well or better than their specialize counterparts. This suggests that metagenomic gene predictors could be improved by being more robust to sequence error.

Benchmarking Algorithms

Benchmarking is used to compare accuracy, precision, technical requirements, and other attributes of algorithms. Simulated reads provide a common benchmark for comparing assembly algorithms in the Assemblathon competition [Earl, et al.], and have also been used to benchmark read mapping algorithms such as Bowtie, Soap, and Pass [Horner].

Optimizing Parameters

Researchers often ignore the detailed parameters of algorithms and programs. This can frequently cause problems by violating assumptions made in the software development process. Furthermore, the effects of parameters on results are largely unknown yet can significantly effect results and conclusions. To better understand the effects of parameters used by the algorithm FLASH, Magoc et al. use simulated reads to build ROC curves illustrating the trade-offs between correctly and incorrectly merged paired-end reads using different values for the mismatch and minimum overlap parameters (Figures 5 and 6).

Optimizing Study Design

Prior to sequencing, it is common to ask questions like, "how much coverage do we need," "what read length should we sequence at," "what sequencing platform is best for our project," "should we use paired-end (PE) or single-end (SE) reads," etc. These and other similar questions can be answered at least in part by sequence simulation. For example, the 1000 Genomes Project Consortium used ART to test the effects of read length and PE insert size on a reads ability to map to the human genome. They conclude that longer reads substantial increase mappability especially for SE reads. Furthermore, increasing insert size also marginally improves mappability.

Furthermore, Mende et al. use sequence simulations to test several aspects of study design for whole metagenome sequencing. They conclude that quality control measures such as quality filtering and quality trimming have a substantial impact on assembly by improving accuracy and extending contig lengths. They also evaluate assembly quality on Sanger, pyrosequencing, and Illumina platforms with communities of low (10 genomes), medium (100 genomes), and high (400 genomes) complexity. For the low complexity community all platforms were more or less equal in terms of assembly and accurately represented the functional aspects of the community. For the medium complexity community Illumina produced the best assembly and most accurately represented the functional elements. With the high complexity community none of the platforms were particularly good, however, because of the longer reads, Sanger was still able to represent much of the functional composition.

Things to look for in a simulator?

Good/appropriate error model
Models read lengths
Models coverage bias
Includes quality values
Single-end and Paired-end reads
Multiple sequencing platform capabilities (e.g. Illumna, 454, etc)
Easy to install
Easy to use
Good documentation

What are the simulators our there for...

Genomic DNA sequences

wgsim relies on a simple uniform substitution error model. The uniform error model does not reflect error patterns as accurately as the models and algorithms described below. Another major weakness of wgsim its inability to generate INDEL errors. Wgsim seems like the most basic read simulator.
simNGS models the processes in an Illumina sequencing run such as cluster generation and cluster intensity. The model must be input as the "runfile" where each line contains parameters for each cycle. The number of cycles contained in the "runfile" corresponds to the number of bases simNGS can model.
MAQ is primarily an assembly algorithm (hence the large number of citations). It also includes a uniform reference sequence mutation algorithm. To simulate reads it uses a first-order markov chain to model quality at each cycle. Using that model it generates quality values and then bases based on those quality values. Documentation is lacking, so I'm not entirely sure all the values in the table are correct.
GemSIM takes as input a SAM file and FASTQ file for empirical error model generation. This is advantageous because it can easily be extended to new sequencing technologies or upgrades as they are released. It can also simulate metagenome reads based on given abundance proportions.
ART uses quality score data from real Illumina reads to model substitution rates. Quality scores are generally very informative for Illumina reads, but may not be perfect. They map real Illumina reads to a reference to model INDEL rates. I would prefer they do that with substitutions as well. Currently the longest Illumina read length model can generate 75bp reads. This is becoming less useful as read lengths continue to grow. I am less familiar with 454 and Sanger reads so I won't comment on those models.
Mason was build using models empirically derived from fly and yeast reads. According to the technical report Mason can model 100bp Illumina reads from the GAII, however it appears that it may be possible to build a more up-to-date model. Mason uses previously published models for both 454 and Sanger reads. Mason was developed in C++ and is easily extendable making it useful for incorporating into your personal code.
FlowSim is specifically for 454 Pyrosequences. FlowSim builds an empirical model derived using quality trimmed E. coli K-12 and D. Labrax reads. Because the 454 platform interprets reads first as flow grams, FlowSim models these flow grams, specifically the homopolymer distributions. During the simulation process, flow grams are generated and are subsequently interpreted into base calls and corresponding quality scores. The number of cycles indicates roughly the number of bases in each read.
SimSeq has limited documentation! One of the useful attributes of SimSeq is its model for mate-pair chimeras.
pIRS builds an error model based on SOAP2 mapping results or a given SAM file. It uses a combination of two matrices to generate bases and quality values. These matrices are based of empirical training data and a first-order markov chain (similar to MAQ) of quality scores. This seems like a very good model for both bases and quality values. One of the defining features of pIRS is its simulation of coverage bias across a genome based on %GC content. It also includes an option for mutating the given genome to fabricate reference sequence heterogeneity.

Metagenomic Sequences

NeSSM can simulate coverage bias but relies on the mapping distributions from real metagenome sequences to the reference database. This is less useful approach because it cannot be applied to model coverage bias of previously unsequenced metagenomes. A very positive feature of NeSSM is its GPU capabilities. GPU processing is much faster which facilitates simulating large scale sequencing runs from platforms like the Illumina HiSeq. The download link in the paper was not working, so I was unable to download and test this software.
Grinder -- Can also simulate PCR dependent amplicon sequences for any amplicon based on given primers or a database of reference amplicons suca! For amplicon simulations it additionally produces chimeric reads and gene copy number biases. Grinder allows users to input abundance profiles, model alpha and beta diversity. Quality scores in Grinder are assigned a single "high" quality value for all correct bases and a single "low" quality value for bases designated as errors. Grinder includes a graphical user interface (GUI) and command line interface (CLI). Grinder can also be run on the web through the Galaxy interface.
MetaSim -- Stores user defined genomes in a database. MetaSim can simulate reads from any of the genomes in the database using various empirically defined models or a user defined custom model. Abundance measures for each genome allow users to simulate communities of variable abundance. To simulate heterogeneity in the community, MetaSim can mutate the selected reference genomes based on a phylogenetic tree input describing the degree of mutation. MetaSim also includes a GUI, however more advanced options can be accessed via the CLI. One of the short-comings of MetaSim is the lack of simulated quality values. As an increasing number of algorithms rely on these values this highlights an essential feature that is missing.
Bear is still under development.
GemSIM see above.

As a disclaimer I should say that I have only tried a few of these simulators. There may be special features that I have missed.

Discussion

Simulations are an important and under utilized method for testing algorithms and planning experiments. It can be tempting to haphazardly start sequencing or push your data through a collaborators favorite algorithm using the default parameters. These temptations can lead to failed experiments, unusual results, or a misunderstanding of data. A more systematic approach would be to build informative simulations to anticipate and address problematic algorithm assumptions and experimental design flaws. This approach requires a large up-front cost, however it can save time and money in the long run.

As a cautionary note, simulations are not biology. Sequence simulations are an imperfect image of a very complicated world. Conclusions drawn from simulation experiments must be taken with a grain of salt. In some cases it is vital to design biological experiments to validate such conclusions. The utility of sequence simulations is to perform a cheap, first-pass test of algorithms and experimental designs.

While sequence simulators are not expected to perfectly represent real sequences, the closer they can get to that point the better. Most simulators carefully model quality of bases, but there are only a couple that model biases imposed by sequencing platforms and sample preparation protocols such as GC content bias. This is one area where sequence simulators have room to improve. However, building such models is challenging due to the large number of variables that contribute to such biases. For example, there are hundreds of available sample prep protocols for hundreds of biological techniques. Building a universal model to account for biases from each of these protocols is an enormous task especially in a world where protocols can have a life expectance rate of only a few months. Furthermore, unique biases can be introduced by the set of DNA itself. Sequencing human DNA will produce different biases with different levels of effect than sequencing a single bacterial strain.

In conclusion, sequence simulators can be extremely useful for testing and benchmarking algorithms, optimizing parameters, and designing experiments. There are a number of simulators available each with its own strengths and weaknesses. Carfull consideration should be used when picking a simulator and interpreting simulation experiments.

The in silico lens