Friday, November 15, 2013

New SPAdes Assembler Used on MiSeq Reads for 6 Burkholderia Genomes

Introduction

This post introduces a practical assembly strategy using the new SPAdes assembler with Illumina MiSeq reads.  Utilizing the sequence analysis tools sickle, FLASH, SPAdes, and in-house Perl scripts we assemble 6 Burkholderia genomes to draft status using both paired-end (PE) and mate-pair (MP) reads.  This exercise demonstrates the practicality of using the Illumina MiSeq for small scale assembly projects.

The SPAdes Assembler

Many popular de novo assemblers, including SPAdes, rely on a computational data structure called a de Bruijn graph.  SPAdes uses a multisized de Bruijn graph to balance trade-offs between small and large k-mer sizes.  A smaller k causes more repeat regions to collapse into the same node tangling the graph.  However, larger k values can fragment the graph in low coverage regions.  Multisized de Bruijn graphs allow the size of k to vary based on regional coverage depth.  SPAdes also has a unique method for handling PE reads.  Typically PE information is incorporated after the initial assembly by mapping PE reads back to assembled contigs.  When corresponding pairs map to the ends of different contigs those contigs can be merged into a single scaffold.  SPAdes improves on this by incorporating PE distance information directly into the de Bruijn graph.  Because of these features SPAdes can be run with any number of single-end (SE), PE, or MP read files.

Assembling Burkhoderia Genomes with SPAdes

Burkholderia is a genus of bacteria with strains having a variety of environmental effects ranging from plant growth promoting activity to human and plant pathogenesis.  Because our lab is particularly interested in phenotypes of association between bacteria and plants, we sequenced 6 strains of burkholderia isolated from the endophyte compartment (i.e. inner root) of A. thaliana using two runs on our Illumina MiSeq machine.  The first run was done using a standard PE library, and the second was done using a MP library.


The pipeline used for assembly is outlined and described below.  



FLASH
FLASH is a software package the merges PE reads into a single read.  In PE sequencing there is a distribution of DNA fragment lengths from which ends are sequenced (Supp Fig 1).  As reads continue to lengthen more PE reads can potentially overlap.  The main advantage to merging the overlapping reads is the error correction step.  Furthermore, longer reads tend to yield better assemblies because of their ability to span repetitive regions that fragment assemblies.  While longer merged reads may span repetitive regions it is still possible that spanned repeats are incorrectly assembled.  In terms of draft assembly this is a minor point.

After experimenting with various combinations of parameters I decided on the following FLASH parameters:

-m 25 -r 250 -f 500 -s 100

Adapter Trimming
A preprocessing step is required to trim MP reads at the Illumina junction adapter.  For details on how to do this and why it is necessary see this post.  

Sickle
Graph assemblers can be sensitive to erroneous k-mers produced from sequencing errors which convolute the graph with false nodes and connections.  Rigorous quality filtering prior to assembly reduces the number of false k-mers thereby improving assembly.  Sickle is a software package for quality filtering which relies on a sliding window algorithm over the read quality scores.  If the average quality of a window drops below a given minimum the remaining bases are trimmed off.  Any trimmed or untrimmed reads shorter than a given length are removed.  Sickle also produces a file containing high quality reads where the corresponding pair failed the quality check.  These reads can be used in downstream analysis as single-end reads.  

The parameter setting that I used when running Sickle are:

-t sanger -l 127 -q 20

The '-t' parameter specifies the quality value type.  The '-l' parameter specifies the minimum read length and '-q' is the minimum average window quality score.  At this stage it is important to consider the '-l' parameter if running SPAdes is your next step.  The SPAdes parameter '-k' takes a list of k-mers from which to build a multi-sized graph.  For long Illumina reads the SPAdes manual advises '-k' be set to '21,33,55,77,99,127.'  Because the longest k-mer is 127bp reads in you dataset should be at least that long.  To allow for shorter reads when running Sickle simply lower the '-l' parameter and ensure the largest '-k' value is shorter than or equal to that value.  Because many of the MP reads are already short I ran sickle with the -l parameter set to 75 and the max SPAdes -k parameter set to 75.

I should note that SPAdes has a built-in correction algorithm--hammer.  Hammer looks at k-mer frequency to identify potential false k-mers.  However, I recommend running Sickle first for 2 reasons:
  1. Sickle is a much faster correction algorithm than hammer.  Hence, limiting the number of false k-mers that hammer must identify speeds up SPAdes.
  2. Sickle uses quality values.  While quality values are not perfect, they do correlate with the correctness of bases.  If reads have any low-quality k-mers it is possible by chance that some of these k-mers will be the same causing hammer to classify them as a true k-mer.  However, it is likely that one error inside a high quality k-mer will not be trimmed by Sickle, but will be identified by hammer.  This intuition says that a combination of Sickle and hammer is the best regime for quality filtering.
SPAdes
I ran the SPAdes assembler twice for each genome--first with both PE and MP reads and then with only the PE reads.  The second run shows that improvements were made when MP reads are included.  The following command was used to run SPAdes with both PE and MP reads:

spades.py -k 21,33,55,75 -t 5 --careful -s merged_PE_reads.fastq -1 fwd_PE_sickled_reads.fastq -2 rev_PE_sickled_reads.fastq -s PE_sickled_single_reads.fastq --mp1-1 fwd_MP_sickled_reads.fastq --mp1-2 rev_MP_sickled_reads.fastq --mp1-s singles_MP_sickled_reads.fastq --mp1-rf -o spades_out

Using only PE reads (i.e. the merged reads, fwd high quality reads, reverse high quality reads, and single high quality reads) SPAdes was run using the following command:

spades.py -k 21,33,55,77,99,127 -t 5 --careful -s out.extendedFrags.fastq -1 out.notCombined_1_qc.fastq -2 out.notCombined_2_qc.fastq -s out.notCombined_singles_qc.fastq   -o spades_out

QUAST
When the assembly is complete QUAST can be used to compare and view assembly results.  For examples see the Results section of this post.


Results

Assembly
All 6 genomes assembled very well when both PE and MP reads were used.


Does merging PE reads help?
To illustrate how merging reads prior to assembly can improve assembly, I assembled one Burkholderia genome using three different input sets:
  1. Merged and unmerged reads
  2. Raw PE reads with no merging
  3. Only reads that merge

Under the assumption that improved assembly statistics (e.g. number of contigs, N50, etc.) indicate a better assembly, a combination of merged and unmerged reads (option 1) yields the best assembly.

Does adding MP reads help?
When assembling with only PE reads the resulting assemblies are worse than when MP reads are included.


To test if the assembly improvements were primarily a factor of increased coverage or information incorporated in MP reads, I reassembled the PE+MP reads for B1/BMP1 treating the MP reads as SE reads.  Despite MP reads substantial increase in coverage (52 to 103), when treated as SE reads only minor assembly improvements are observed.  This suggests that distance information provided by MP reads can have a greater effect on improving assemblies than additional coverage.


Conclusions

Using a combination of FLASH, Sickle, SPAdes, and in-house Perl scripts we assemble 6 Burkholderia genomes to draft status.  This exercise demonstrates the practicality of the Illumina MiSeq for small scale sequencing projects consisting of a handful of bacterial genomes.  A single PE MiSeq run may be sufficient to assembly a small number of bacterial genomes, however the additional coverage and pairing information of MP reads can improve PE assemblies.  Merging overlapping PE reads and quality trimming before assembly can increase N50 and other assembly statistics.


Supplemental Figures
Figure 1 -- PE insert size distribution for one sample (B1)

4 comments:

  1. Scott,

    Several notes:

    1. The N50 comparison is incorrect, because your assemblies lengths differ too much. You need to evaluate NG50 wrt the expected genome length and compare these numbers. You can use QUAST's --est-ref-size option to provide expected genome length and it will calculate NG50 for you.
    2. SPAdes as of version 2.5.1 does not use long single reads for any repeat resolution. So, only paired-end reads are used here, thus you may obtain sub-optimal assemblies if you have to many merged reads. Your tables clearly show this. Also there is no difference between the assemblies from merged and unmerged reads - the N50, assembly length and longest contig are the same, there are some additional short contigs in unmerged case which are most surely just low-covered junk reads being glued together.
    3. BayesHammer does use quality values for k-mer clustering. Also it does quality trimming by itself, however, pretty conservative.

    There are some good news wrt SPAdes 3.0:

    1. It will use long single reads for repeat resolution
    2. It will include numerous improvements in scaffolder, so it may happen that your assemblies with MP will be better

    ReplyDelete
    Replies
    1. Anton,

      Thanks for your comments/corrections! That's great news about SPAdes 3.0. We are sequencing several more bacterial genomes with primarily MP reads.

      Delete
  2. Hello Scott, Thank for such a interesting blog. For pair end reads do you think the merging with FLASH really needed moreover does it really improve assembly quality??
    or else it does not matte so much weather you mixed or not?? because in Spades web site i did not find any protocol like that before assembly you need to do it .

    ReplyDelete
    Replies
    1. Hi Hiren,

      In terms of N50 and the length of the largest contig the merging doesn't appear to have much of an effect. However, there are fewer short contigs in the assemblies where the reads were merged. Do these short contigs matter? Honestly, probably not much. So the improvements (at least from these data) from merging appear marginal. However, theoretically merging should improve the assembly because it reduces sequencing errors and creates longer reads which helps with repeat resolution. Additionally, merging is computationally inexpensive (i.e. it doesn't take much processing time and memory). So at the very least it will make the assembly marginally better with only a marginal cost. How much merging helps or doesn't help probably depends on what you are assembling and how many reads you have that overlap. I can't think of a reason why merging would make the assembly worse.

      Scott

      Delete