Wednesday, October 9, 2013

Merging Overlapping Paired-End Reads

Paired-End Sequencing Overview

Amplicon sequencing is a technique in which PCR primers are used to amplify a small portion of a genome.  For example, if a gene of interest is preceded by a unique DNA sequence, S1, and followed by another unique DNA sequence, S2, PCR amplification using S1 and S2 as primers will exponentially copy the DNA segment flanked by S1 and S2 (see this video).  The resulting DNA can be sequenced using a paired-end (PE) sequencing protocol where DNA molecules in a sample are sequenced from both ends.  In some cases amplicons (i.e. the portion of DNA flanked by the two primers) are short enough such that the forward read overlaps the reverse read.  It can be advantageous to merge these reads into a single sequence for the following reason:

  • Sequenced reads generally have a higher density of errors near the end of the read.  The extra information gained by overlapping the tail end of the forward read with the tail end of the reverse read all for correction of most of these errors.

Merging PE Reads with FLASH

While there are several bioinformatics tools and algorithms that would be appropriate for merging overlapping PE reads, I prefer to use the assembly tool FLASH.  FLASH is freely available under the GPLv3 license and can be downloaded via SourceForge.

Reasons to use FLASH include:
  • Easy to download and install
  • Fast!
  • Uses quality scores to correct possible sequencing errors in the overlapping region.  This is by far the most important reason to use FLASH.  Correcting potential sequencing errors can lead to much less noise and more definitive results.  
Reasons you might want to find another tool:
  • FLASH is not able to look for gapped alignments between read pairs.  Ungapped alignments are appropriate when using Illumina reads, however they can be problematic for other sequencing platforms such as 454.
FLASH is easy to use if you are familiar with the Unix command line interface.  First, navigate to where you installed flash and type the command:

./flash -h

This will ensure that FLASH has been installed correctly and prints the FLASH help page to your terminal screen.  To run FLASH you will need two FASTQ formatted files representing the forward and reverse reads to merge.  These files should have the same number of reads and should be ordered such that the first read in the forward file is paired with the first read in the reverse file.

Similar to other bioinformatics applications, FLASH has several parameters that can be adjusted to maximize the algorithm's effectiveness.  The FLASH help page (./flash -h) give a full explanation of each parameter.  To run FLASH using all default parameters type:

./flash <forward reads file path> <reverse reads file path>

The following are a list of useful parameters that I often use:

-m, --min-overlap=NUM:  The minimum required overlap between the fwd and rev read.  Beware if this number is too small spurious overlaps will be formed.  I recommend at least and overlap of 10bp.  See the FLASH paper for more details.

-M, --max-overlap:  The expected overlap length of the fwd and rev reads.  This parameter can also be calculated using the -r, -f, and -s parameters.

-x, --max-mismatch-density=NUM:  The ratio of number of mismatched bases and overlap length.  This parameter controls the stringency for which pairs are considered to be mergable.  A high ratio of number of mismatches to overlap length will merge more pairs, but some of these pairs may be false merges.  Alternatively, a low ratio will merge fewer pairs but ensures that false merges are excluded.

-r, --read-len=LEN:  The average length of the raw reads.  This number is used in calculations for the -M parameter (--max-overlap).

-f, --fragment-len=LEN:  The expected amplicon length.

-s, --fragment-len-stddev=LEN:  The expected standard deviation of amplicon lengths.

-d, --output-directory=DIR:  The output directory.

-o, --output-prefix=PREFIX:  The output prefix.  For example if "-o my_output" is used all output files will have the prefix "my_output."  This parameter can be useful for organizing different FLASH runs.

Refer to the FLASH help page for explanations about other parameters.


Quality of Merged PE Reads

During the merging process of overlapping PE reads, discrepancies between bases in the overlapping region can be corrected by choosing the base with the highest quality score to represent the consensus base.  To illustrate quality improvements from merging PE reads I use a subset of recently published data where we cloned a known 16S gene into a plasmid and sequenced it using our typical 16S metagenome profiling protocol for 2x250 bp reads on the Illumina MiSeq.  The primed V4 amplicon is roughly 253bp.  After removing extraneous bases added by our protocol (e.g. molecule tags, primers), the PE reads are expected to have an overlapping region of ~183bp.  Because the overlapping region is large the lower quality tails of both the forward and reverse read overlap in high quality regions in the corresponding read.



The distribution of reads based on the errors per base (EPB) in each read is as follows.


The green line (merged reads) has shifted toward the y-axis indicating that there are more reads with lower EPB compared to forward reads and especially reverse reads.  Notice that the y-axis is in a log scale, so while the differences between groups seems small it is actually quite substantial.  The mean EPB for all merged reads is 0.001903164 while the forward and reverse reads are 0.003959693 and 0.00590493, respectively.  

One caveat to this investigation is the quality filtering done prior to analysis.  Naturally, low quality reads are less likely to merge causing merged reads to have artificially higher quality.   To compensate for this effect we rigorously filtered raw reads before calculating EPB.  Consequently, the divergence between merged reads and unfiltered fwd/rev raw reads will be more pronounced than is reflected in the figure strengthening the conclusion that merged reads have higher quality than fwd/rev raw reads.  

Yet another caveat to consider when designing an overlapping PE sequencing experiment is the expected overlap length.  In this study we use a small amplicon which allows the error prone tails to overlap in regions of higher quality in the corresponding read.  With some amplicons this will not be possible, however even when low quality regions overlap each other corrections can still be made but at a lower confidence.  Also, the V4 amplicon has very little variability in terms of length.  Again, this is not the case with all amplicons and can effect merging success.  For example, the ITS region is another amplicon used in metagenome profiling, but its length variability is substantially higher than the 16S V4 amplicon.  Therefore, reads that fail to overlap in an ITS experiment cannot be assumed to be low quality sequence and should be incorporated in downstream analysis as non-overlapping PE reads.

In conclusion, overlapping PE reads can easily be merged using tools like FLASH.  Furthermore, merging these reads yields higher quality merged sequences because sequencing errors in the overlapped portion can be corrected based on read quality scores.  Higher quality sequences reduce noise introduced by sequencing error and consequently lead to more robust downstream analysis and conclusions. 

No comments:

Post a Comment