Tuesday, October 22, 2013

Read simulators review with an emphasis on metagenomics

Why Read Simulation?

Simulations are an important aspect of bioinformatics that can be used for testing and benchmarking algorithms, optimizing parameters, and generating optimal study design.  The following are examples of how read simulations have been successfully utilized in each of these instances.

Testing Algorithms

After writing a new algorithm how do we ensure that it works?  The most effective way is to run the algorithm on a set of data where the answers are known.  This is a perfect application for simulated data.  Katharina Hoff use DNA sequence simulations to test the effects of sequencing error from several sequencing technologies on specialized metagenomic gene prediction algorithms.  She concludes that gene prediction accuracy is poor in the presence of sequencing errors using these algorithms.  Furthermore, gene prediction algorithms not specialized for metagenomic data perform as well or better than their specialize counterparts.  This suggests that metagenomic gene predictors could be improved by being more robust to sequence error.

Benchmarking Algorithms

Benchmarking is used to compare accuracy, precision, technical requirements, and other attributes of algorithms.  Simulated reads provide a common benchmark for comparing assembly algorithms in the Assemblathon competition [Earl, et al.], and have also been used to benchmark read mapping algorithms such as Bowtie, Soap, and Pass [Horner].

Optimizing Parameters

Researchers often ignore the detailed parameters of algorithms and programs.  This can frequently cause problems by violating assumptions made in the software development process.  Furthermore, the effects of parameters on results are largely unknown yet can significantly effect results and conclusions.  To better understand the effects of parameters used by the algorithm FLASH, Magoc et al. use simulated reads to build ROC curves illustrating the trade-offs between correctly and incorrectly merged paired-end reads using different values for the mismatch and minimum overlap parameters (Figures 5 and 6).

Optimizing Study Design

Prior to sequencing, it is common to ask questions like, "how much coverage do we need," "what read length should we sequence at," "what sequencing platform is best for our project," "should we use paired-end (PE) or single-end (SE) reads," etc.  These and other similar questions can be answered at least in part by sequence simulation.  For example, the 1000 Genomes Project Consortium used ART to test the effects of read length and PE insert size on a reads ability to map to the human genome.  They conclude that longer reads substantial increase mappability especially for SE reads.  Furthermore, increasing insert size also marginally improves mappability.

Furthermore, Mende et al. use sequence simulations to test several aspects of study design for whole metagenome sequencing.  They conclude that quality control measures such as quality filtering and quality trimming have a substantial impact on assembly by improving accuracy and extending contig lengths.  They also evaluate assembly quality on Sanger, pyrosequencing, and Illumina platforms with communities of low (10 genomes), medium (100 genomes), and high (400 genomes) complexity.  For the low complexity community all platforms were more or less equal in terms of assembly and accurately represented the functional aspects of the community.  For the medium complexity community Illumina produced the best assembly and most accurately represented the functional elements.  With the high complexity community none of the platforms were particularly good, however, because of the longer reads, Sanger was still able to represent much of the functional composition.  


Things to look for in a simulator?

  • Good/appropriate error model
  • Models read lengths
  • Models coverage bias
  • Includes quality values
  • Single-end and Paired-end reads
  • Multiple sequencing platform capabilities (e.g. Illumna, 454, etc)
  • Easy to install
  • Easy to use
  • Good documentation

What are the simulators our there for...

Genomic DNA sequences



  • wgsim relies on a simple uniform substitution error model.  The uniform error model does not reflect error patterns as accurately as the models and algorithms described below.  Another major weakness of wgsim its inability to generate INDEL errors.  Wgsim seems like the most basic read simulator.
  • simNGS models the processes in an Illumina sequencing run such as cluster generation and cluster intensity.  The model must be input as the "runfile" where each line contains parameters for each cycle.  The number of cycles contained in the "runfile" corresponds to the number of bases simNGS can model.
  • MAQ is primarily an assembly algorithm (hence the large number of citations).  It also includes a uniform reference sequence mutation algorithm.  To simulate reads it uses a first-order markov chain to model quality at each cycle.  Using that model it generates quality values and then bases based on those quality values.  Documentation is lacking, so I'm not entirely sure all the values in the table are correct.
  • GemSIM takes as input a SAM file and FASTQ file for empirical error model generation.  This is advantageous because it can easily be extended to new sequencing technologies or upgrades as they are released.  It can also simulate metagenome reads based on given abundance proportions.  
  • ART uses quality score data from real Illumina reads to model substitution rates.  Quality scores are generally very informative for Illumina reads, but may not be perfect.  They map real Illumina reads to a reference to model INDEL rates.  I would prefer they do that with substitutions as well.  Currently the longest Illumina read length model can generate 75bp reads.  This is becoming less useful as read lengths continue to grow.  I am less familiar with 454 and Sanger reads so I won't comment on those models.  
  • Mason was build using models empirically derived from fly and yeast reads.  According to the technical report Mason can model 100bp Illumina reads from the GAII, however it appears that it may be possible to build a more up-to-date model.  Mason uses previously published models for both 454 and Sanger reads.  Mason was developed in C++ and is easily extendable making it useful for incorporating into your personal code.
  • FlowSim is specifically for 454 Pyrosequences.  FlowSim builds an empirical model derived using quality trimmed E. coli K-12 and D. Labrax reads.  Because the 454 platform interprets reads first as flow grams, FlowSim models these flow grams, specifically the homopolymer distributions.  During the simulation process, flow grams are generated and are subsequently interpreted into base calls and corresponding quality scores.  The number of cycles indicates roughly the number of bases in each read.
  • SimSeq has limited documentation!  One of the useful attributes of SimSeq is its model for mate-pair chimeras.
  • pIRS builds an error model based on SOAP2 mapping results or a given SAM file.  It uses a combination of two matrices to generate bases and quality values.  These matrices are based of empirical training data and a first-order markov chain (similar to MAQ) of quality scores.  This seems like a very good model for both bases and quality values.  One of the defining features of pIRS is its simulation of coverage bias across a genome based on %GC content.  It also includes an option for mutating the given genome to fabricate reference sequence heterogeneity.

Metagenomic Sequences



  • NeSSM can simulate coverage bias but relies on the mapping distributions from real metagenome sequences to the reference database.  This is less useful approach because it cannot be applied to model coverage bias of previously unsequenced metagenomes.  A very positive feature of NeSSM is its GPU capabilities.  GPU processing is much faster which facilitates simulating large scale sequencing runs from platforms like the Illumina HiSeq.  The download link in the paper was not working, so I was unable to download and test this software.
  • Grinder -- Can also simulate PCR dependent amplicon sequences for any amplicon based on given primers or a database of reference amplicons suca!  For amplicon simulations it additionally produces chimeric reads and gene copy number biases.  Grinder allows users to input abundance profiles, model alpha and beta diversity.  Quality scores in Grinder are assigned a single "high" quality value for all correct bases and a single "low" quality value for bases designated as errors.  Grinder includes a graphical user interface (GUI) and command line interface (CLI).  Grinder can also be run on the web through the Galaxy interface.
  • MetaSim -- Stores user defined genomes in a database.  MetaSim can simulate reads from any of the genomes in the database using various empirically defined models or a user defined custom model.  Abundance measures for each genome allow users to simulate communities of variable abundance.  To simulate heterogeneity in the community, MetaSim can mutate the selected reference genomes based on a phylogenetic tree input describing the degree of mutation.  MetaSim also includes a GUI, however more advanced options can be accessed via the CLI.  One of the short-comings of MetaSim is the lack of simulated quality values.  As an increasing number of algorithms rely on these values this highlights an essential feature that is missing.
  • Bear is still under development.  
  • GemSIM see above.
As a disclaimer I should say that I have only tried a few of these simulators.  There may be special features that I have missed.

Discussion

Simulations are an important and under utilized method for testing algorithms and planning experiments.  It can be tempting to haphazardly start sequencing or push your data through a collaborators favorite algorithm using the default parameters.  These temptations can lead to failed experiments, unusual results, or a misunderstanding of data.  A more systematic approach would be to build informative simulations to anticipate and address problematic algorithm assumptions and experimental design flaws.  This approach requires a large up-front cost, however it can save time and money in the long run.  

As a cautionary note, simulations are not biology.  Sequence simulations are an imperfect image of a very complicated world.  Conclusions drawn from simulation experiments must be taken with a grain of salt.  In some cases it is vital to design biological experiments to validate such conclusions.  The utility of sequence simulations is to perform a cheap, first-pass test of algorithms and experimental designs.  

While sequence simulators are not expected to perfectly represent real sequences, the closer they can get to that point the better.  Most simulators carefully model quality of bases, but there are only a couple that model biases imposed by sequencing platforms and sample preparation protocols such as GC content bias.  This is one area where sequence simulators have room to improve.  However, building such models is challenging due to the large number of variables that contribute to such biases.  For example, there are hundreds of available sample prep protocols for hundreds of biological techniques.  Building a universal model to account for biases from each of these protocols is an enormous task especially in a world where protocols can have a life expectance rate of only a few months.  Furthermore, unique biases can be introduced by the set of DNA itself.  Sequencing human DNA will produce different biases with different levels of effect than sequencing a single bacterial strain.  

In conclusion, sequence simulators can be extremely useful for testing and benchmarking algorithms, optimizing parameters, and designing experiments.  There are a number of simulators available each with its own strengths and weaknesses.  Carfull consideration should be used when picking a simulator and interpreting simulation experiments.

Other interesting reads

  • Lessons learned from simulated metagenomics datasets
  • Zagordi, et al. Deep Sequencing of a Genetically Heterogeneous Sample: Local Haplotype Reconstruction and Read Error Correction.  2012.
  • Pignatelli, et al. Evaluating the fidelity of de novo short read metagenomic assembly using simulated data. 2011.
  • Morgan, et al. Metagenomic Sequencing of an In Vitro-Simulated Microbial Community. 2010.
  • Mavromatis, et al. Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. 2007.

UPDATE - Oct 8, 2014
  • The new BEAR simulator seems like a good option as well.

Wednesday, October 9, 2013

Merging Overlapping Paired-End Reads

Paired-End Sequencing Overview

Amplicon sequencing is a technique in which PCR primers are used to amplify a small portion of a genome.  For example, if a gene of interest is preceded by a unique DNA sequence, S1, and followed by another unique DNA sequence, S2, PCR amplification using S1 and S2 as primers will exponentially copy the DNA segment flanked by S1 and S2 (see this video).  The resulting DNA can be sequenced using a paired-end (PE) sequencing protocol where DNA molecules in a sample are sequenced from both ends.  In some cases amplicons (i.e. the portion of DNA flanked by the two primers) are short enough such that the forward read overlaps the reverse read.  It can be advantageous to merge these reads into a single sequence for the following reason:

  • Sequenced reads generally have a higher density of errors near the end of the read.  The extra information gained by overlapping the tail end of the forward read with the tail end of the reverse read all for correction of most of these errors.

Merging PE Reads with FLASH

While there are several bioinformatics tools and algorithms that would be appropriate for merging overlapping PE reads, I prefer to use the assembly tool FLASH.  FLASH is freely available under the GPLv3 license and can be downloaded via SourceForge.

Reasons to use FLASH include:
  • Easy to download and install
  • Fast!
  • Uses quality scores to correct possible sequencing errors in the overlapping region.  This is by far the most important reason to use FLASH.  Correcting potential sequencing errors can lead to much less noise and more definitive results.  
Reasons you might want to find another tool:
  • FLASH is not able to look for gapped alignments between read pairs.  Ungapped alignments are appropriate when using Illumina reads, however they can be problematic for other sequencing platforms such as 454.
FLASH is easy to use if you are familiar with the Unix command line interface.  First, navigate to where you installed flash and type the command:

./flash -h

This will ensure that FLASH has been installed correctly and prints the FLASH help page to your terminal screen.  To run FLASH you will need two FASTQ formatted files representing the forward and reverse reads to merge.  These files should have the same number of reads and should be ordered such that the first read in the forward file is paired with the first read in the reverse file.

Similar to other bioinformatics applications, FLASH has several parameters that can be adjusted to maximize the algorithm's effectiveness.  The FLASH help page (./flash -h) give a full explanation of each parameter.  To run FLASH using all default parameters type:

./flash <forward reads file path> <reverse reads file path>

The following are a list of useful parameters that I often use:

-m, --min-overlap=NUM:  The minimum required overlap between the fwd and rev read.  Beware if this number is too small spurious overlaps will be formed.  I recommend at least and overlap of 10bp.  See the FLASH paper for more details.

-M, --max-overlap:  The expected overlap length of the fwd and rev reads.  This parameter can also be calculated using the -r, -f, and -s parameters.

-x, --max-mismatch-density=NUM:  The ratio of number of mismatched bases and overlap length.  This parameter controls the stringency for which pairs are considered to be mergable.  A high ratio of number of mismatches to overlap length will merge more pairs, but some of these pairs may be false merges.  Alternatively, a low ratio will merge fewer pairs but ensures that false merges are excluded.

-r, --read-len=LEN:  The average length of the raw reads.  This number is used in calculations for the -M parameter (--max-overlap).

-f, --fragment-len=LEN:  The expected amplicon length.

-s, --fragment-len-stddev=LEN:  The expected standard deviation of amplicon lengths.

-d, --output-directory=DIR:  The output directory.

-o, --output-prefix=PREFIX:  The output prefix.  For example if "-o my_output" is used all output files will have the prefix "my_output."  This parameter can be useful for organizing different FLASH runs.

Refer to the FLASH help page for explanations about other parameters.


Quality of Merged PE Reads

During the merging process of overlapping PE reads, discrepancies between bases in the overlapping region can be corrected by choosing the base with the highest quality score to represent the consensus base.  To illustrate quality improvements from merging PE reads I use a subset of recently published data where we cloned a known 16S gene into a plasmid and sequenced it using our typical 16S metagenome profiling protocol for 2x250 bp reads on the Illumina MiSeq.  The primed V4 amplicon is roughly 253bp.  After removing extraneous bases added by our protocol (e.g. molecule tags, primers), the PE reads are expected to have an overlapping region of ~183bp.  Because the overlapping region is large the lower quality tails of both the forward and reverse read overlap in high quality regions in the corresponding read.



The distribution of reads based on the errors per base (EPB) in each read is as follows.


The green line (merged reads) has shifted toward the y-axis indicating that there are more reads with lower EPB compared to forward reads and especially reverse reads.  Notice that the y-axis is in a log scale, so while the differences between groups seems small it is actually quite substantial.  The mean EPB for all merged reads is 0.001903164 while the forward and reverse reads are 0.003959693 and 0.00590493, respectively.  

One caveat to this investigation is the quality filtering done prior to analysis.  Naturally, low quality reads are less likely to merge causing merged reads to have artificially higher quality.   To compensate for this effect we rigorously filtered raw reads before calculating EPB.  Consequently, the divergence between merged reads and unfiltered fwd/rev raw reads will be more pronounced than is reflected in the figure strengthening the conclusion that merged reads have higher quality than fwd/rev raw reads.  

Yet another caveat to consider when designing an overlapping PE sequencing experiment is the expected overlap length.  In this study we use a small amplicon which allows the error prone tails to overlap in regions of higher quality in the corresponding read.  With some amplicons this will not be possible, however even when low quality regions overlap each other corrections can still be made but at a lower confidence.  Also, the V4 amplicon has very little variability in terms of length.  Again, this is not the case with all amplicons and can effect merging success.  For example, the ITS region is another amplicon used in metagenome profiling, but its length variability is substantially higher than the 16S V4 amplicon.  Therefore, reads that fail to overlap in an ITS experiment cannot be assumed to be low quality sequence and should be incorporated in downstream analysis as non-overlapping PE reads.

In conclusion, overlapping PE reads can easily be merged using tools like FLASH.  Furthermore, merging these reads yields higher quality merged sequences because sequencing errors in the overlapped portion can be corrected based on read quality scores.  Higher quality sequences reduce noise introduced by sequencing error and consequently lead to more robust downstream analysis and conclusions.