Tuesday, October 22, 2013

Read simulators review with an emphasis on metagenomics

Why Read Simulation?

Simulations are an important aspect of bioinformatics that can be used for testing and benchmarking algorithms, optimizing parameters, and generating optimal study design.  The following are examples of how read simulations have been successfully utilized in each of these instances.

Testing Algorithms

After writing a new algorithm how do we ensure that it works?  The most effective way is to run the algorithm on a set of data where the answers are known.  This is a perfect application for simulated data.  Katharina Hoff use DNA sequence simulations to test the effects of sequencing error from several sequencing technologies on specialized metagenomic gene prediction algorithms.  She concludes that gene prediction accuracy is poor in the presence of sequencing errors using these algorithms.  Furthermore, gene prediction algorithms not specialized for metagenomic data perform as well or better than their specialize counterparts.  This suggests that metagenomic gene predictors could be improved by being more robust to sequence error.

Benchmarking Algorithms

Benchmarking is used to compare accuracy, precision, technical requirements, and other attributes of algorithms.  Simulated reads provide a common benchmark for comparing assembly algorithms in the Assemblathon competition [Earl, et al.], and have also been used to benchmark read mapping algorithms such as Bowtie, Soap, and Pass [Horner].

Optimizing Parameters

Researchers often ignore the detailed parameters of algorithms and programs.  This can frequently cause problems by violating assumptions made in the software development process.  Furthermore, the effects of parameters on results are largely unknown yet can significantly effect results and conclusions.  To better understand the effects of parameters used by the algorithm FLASH, Magoc et al. use simulated reads to build ROC curves illustrating the trade-offs between correctly and incorrectly merged paired-end reads using different values for the mismatch and minimum overlap parameters (Figures 5 and 6).

Optimizing Study Design

Prior to sequencing, it is common to ask questions like, "how much coverage do we need," "what read length should we sequence at," "what sequencing platform is best for our project," "should we use paired-end (PE) or single-end (SE) reads," etc.  These and other similar questions can be answered at least in part by sequence simulation.  For example, the 1000 Genomes Project Consortium used ART to test the effects of read length and PE insert size on a reads ability to map to the human genome.  They conclude that longer reads substantial increase mappability especially for SE reads.  Furthermore, increasing insert size also marginally improves mappability.

Furthermore, Mende et al. use sequence simulations to test several aspects of study design for whole metagenome sequencing.  They conclude that quality control measures such as quality filtering and quality trimming have a substantial impact on assembly by improving accuracy and extending contig lengths.  They also evaluate assembly quality on Sanger, pyrosequencing, and Illumina platforms with communities of low (10 genomes), medium (100 genomes), and high (400 genomes) complexity.  For the low complexity community all platforms were more or less equal in terms of assembly and accurately represented the functional aspects of the community.  For the medium complexity community Illumina produced the best assembly and most accurately represented the functional elements.  With the high complexity community none of the platforms were particularly good, however, because of the longer reads, Sanger was still able to represent much of the functional composition.  

Things to look for in a simulator?

  • Good/appropriate error model
  • Models read lengths
  • Models coverage bias
  • Includes quality values
  • Single-end and Paired-end reads
  • Multiple sequencing platform capabilities (e.g. Illumna, 454, etc)
  • Easy to install
  • Easy to use
  • Good documentation

What are the simulators our there for...

Genomic DNA sequences

  • wgsim relies on a simple uniform substitution error model.  The uniform error model does not reflect error patterns as accurately as the models and algorithms described below.  Another major weakness of wgsim its inability to generate INDEL errors.  Wgsim seems like the most basic read simulator.
  • simNGS models the processes in an Illumina sequencing run such as cluster generation and cluster intensity.  The model must be input as the "runfile" where each line contains parameters for each cycle.  The number of cycles contained in the "runfile" corresponds to the number of bases simNGS can model.
  • MAQ is primarily an assembly algorithm (hence the large number of citations).  It also includes a uniform reference sequence mutation algorithm.  To simulate reads it uses a first-order markov chain to model quality at each cycle.  Using that model it generates quality values and then bases based on those quality values.  Documentation is lacking, so I'm not entirely sure all the values in the table are correct.
  • GemSIM takes as input a SAM file and FASTQ file for empirical error model generation.  This is advantageous because it can easily be extended to new sequencing technologies or upgrades as they are released.  It can also simulate metagenome reads based on given abundance proportions.  
  • ART uses quality score data from real Illumina reads to model substitution rates.  Quality scores are generally very informative for Illumina reads, but may not be perfect.  They map real Illumina reads to a reference to model INDEL rates.  I would prefer they do that with substitutions as well.  Currently the longest Illumina read length model can generate 75bp reads.  This is becoming less useful as read lengths continue to grow.  I am less familiar with 454 and Sanger reads so I won't comment on those models.  
  • Mason was build using models empirically derived from fly and yeast reads.  According to the technical report Mason can model 100bp Illumina reads from the GAII, however it appears that it may be possible to build a more up-to-date model.  Mason uses previously published models for both 454 and Sanger reads.  Mason was developed in C++ and is easily extendable making it useful for incorporating into your personal code.
  • FlowSim is specifically for 454 Pyrosequences.  FlowSim builds an empirical model derived using quality trimmed E. coli K-12 and D. Labrax reads.  Because the 454 platform interprets reads first as flow grams, FlowSim models these flow grams, specifically the homopolymer distributions.  During the simulation process, flow grams are generated and are subsequently interpreted into base calls and corresponding quality scores.  The number of cycles indicates roughly the number of bases in each read.
  • SimSeq has limited documentation!  One of the useful attributes of SimSeq is its model for mate-pair chimeras.
  • pIRS builds an error model based on SOAP2 mapping results or a given SAM file.  It uses a combination of two matrices to generate bases and quality values.  These matrices are based of empirical training data and a first-order markov chain (similar to MAQ) of quality scores.  This seems like a very good model for both bases and quality values.  One of the defining features of pIRS is its simulation of coverage bias across a genome based on %GC content.  It also includes an option for mutating the given genome to fabricate reference sequence heterogeneity.

Metagenomic Sequences

  • NeSSM can simulate coverage bias but relies on the mapping distributions from real metagenome sequences to the reference database.  This is less useful approach because it cannot be applied to model coverage bias of previously unsequenced metagenomes.  A very positive feature of NeSSM is its GPU capabilities.  GPU processing is much faster which facilitates simulating large scale sequencing runs from platforms like the Illumina HiSeq.  The download link in the paper was not working, so I was unable to download and test this software.
  • Grinder -- Can also simulate PCR dependent amplicon sequences for any amplicon based on given primers or a database of reference amplicons suca!  For amplicon simulations it additionally produces chimeric reads and gene copy number biases.  Grinder allows users to input abundance profiles, model alpha and beta diversity.  Quality scores in Grinder are assigned a single "high" quality value for all correct bases and a single "low" quality value for bases designated as errors.  Grinder includes a graphical user interface (GUI) and command line interface (CLI).  Grinder can also be run on the web through the Galaxy interface.
  • MetaSim -- Stores user defined genomes in a database.  MetaSim can simulate reads from any of the genomes in the database using various empirically defined models or a user defined custom model.  Abundance measures for each genome allow users to simulate communities of variable abundance.  To simulate heterogeneity in the community, MetaSim can mutate the selected reference genomes based on a phylogenetic tree input describing the degree of mutation.  MetaSim also includes a GUI, however more advanced options can be accessed via the CLI.  One of the short-comings of MetaSim is the lack of simulated quality values.  As an increasing number of algorithms rely on these values this highlights an essential feature that is missing.
  • Bear is still under development.  
  • GemSIM see above.
As a disclaimer I should say that I have only tried a few of these simulators.  There may be special features that I have missed.


Simulations are an important and under utilized method for testing algorithms and planning experiments.  It can be tempting to haphazardly start sequencing or push your data through a collaborators favorite algorithm using the default parameters.  These temptations can lead to failed experiments, unusual results, or a misunderstanding of data.  A more systematic approach would be to build informative simulations to anticipate and address problematic algorithm assumptions and experimental design flaws.  This approach requires a large up-front cost, however it can save time and money in the long run.  

As a cautionary note, simulations are not biology.  Sequence simulations are an imperfect image of a very complicated world.  Conclusions drawn from simulation experiments must be taken with a grain of salt.  In some cases it is vital to design biological experiments to validate such conclusions.  The utility of sequence simulations is to perform a cheap, first-pass test of algorithms and experimental designs.  

While sequence simulators are not expected to perfectly represent real sequences, the closer they can get to that point the better.  Most simulators carefully model quality of bases, but there are only a couple that model biases imposed by sequencing platforms and sample preparation protocols such as GC content bias.  This is one area where sequence simulators have room to improve.  However, building such models is challenging due to the large number of variables that contribute to such biases.  For example, there are hundreds of available sample prep protocols for hundreds of biological techniques.  Building a universal model to account for biases from each of these protocols is an enormous task especially in a world where protocols can have a life expectance rate of only a few months.  Furthermore, unique biases can be introduced by the set of DNA itself.  Sequencing human DNA will produce different biases with different levels of effect than sequencing a single bacterial strain.  

In conclusion, sequence simulators can be extremely useful for testing and benchmarking algorithms, optimizing parameters, and designing experiments.  There are a number of simulators available each with its own strengths and weaknesses.  Carfull consideration should be used when picking a simulator and interpreting simulation experiments.

Other interesting reads

  • Lessons learned from simulated metagenomics datasets
  • Zagordi, et al. Deep Sequencing of a Genetically Heterogeneous Sample: Local Haplotype Reconstruction and Read Error Correction.  2012.
  • Pignatelli, et al. Evaluating the fidelity of de novo short read metagenomic assembly using simulated data. 2011.
  • Morgan, et al. Metagenomic Sequencing of an In Vitro-Simulated Microbial Community. 2010.
  • Mavromatis, et al. Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. 2007.

UPDATE - Oct 8, 2014
  • The new BEAR simulator seems like a good option as well.