Introduction
This post introduces a practical assembly strategy using the new SPAdes assembler with Illumina MiSeq reads. Utilizing the sequence analysis tools sickle, FLASH, SPAdes, and in-house Perl scripts we assemble 6 Burkholderia genomes to draft status using both paired-end (PE) and mate-pair (MP) reads. This exercise demonstrates the practicality of using the Illumina MiSeq for small scale assembly projects.
The SPAdes Assembler
Many popular de novo assemblers, including SPAdes, rely on a computational data structure called a de Bruijn graph. SPAdes uses a multisized de Bruijn graph to balance trade-offs between small and large k-mer sizes. A smaller k causes more repeat regions to collapse into the same node tangling the graph. However, larger k values can fragment the graph in low coverage regions. Multisized de Bruijn graphs allow the size of k to vary based on regional coverage depth. SPAdes also has a unique method for handling PE reads. Typically PE information is incorporated after the initial assembly by mapping PE reads back to assembled contigs. When corresponding pairs map to the ends of different contigs those contigs can be merged into a single scaffold. SPAdes improves on this by incorporating PE distance information directly into the de Bruijn graph. Because of these features SPAdes can be run with any number of single-end (SE), PE, or MP read files.
Assembling Burkhoderia Genomes with SPAdes
To test if the assembly improvements were primarily a factor of increased coverage or information incorporated in MP reads, I reassembled the PE+MP reads for B1/BMP1 treating the MP reads as SE reads. Despite MP reads substantial increase in coverage (52 to 103), when treated as SE reads only minor assembly improvements are observed. This suggests that distance information provided by MP reads can have a greater effect on improving assemblies than additional coverage.
Conclusions
Using a combination of FLASH, Sickle, SPAdes, and in-house Perl scripts we assemble 6 Burkholderia genomes to draft status. This exercise demonstrates the practicality of the Illumina MiSeq for small scale sequencing projects consisting of a handful of bacterial genomes. A single PE MiSeq run may be sufficient to assembly a small number of bacterial genomes, however the additional coverage and pairing information of MP reads can improve PE assemblies. Merging overlapping PE reads and quality trimming before assembly can increase N50 and other assembly statistics.
Supplemental Figures
Figure 1 -- PE insert size distribution for one sample (B1)
This post introduces a practical assembly strategy using the new SPAdes assembler with Illumina MiSeq reads. Utilizing the sequence analysis tools sickle, FLASH, SPAdes, and in-house Perl scripts we assemble 6 Burkholderia genomes to draft status using both paired-end (PE) and mate-pair (MP) reads. This exercise demonstrates the practicality of using the Illumina MiSeq for small scale assembly projects.
The SPAdes Assembler
Many popular de novo assemblers, including SPAdes, rely on a computational data structure called a de Bruijn graph. SPAdes uses a multisized de Bruijn graph to balance trade-offs between small and large k-mer sizes. A smaller k causes more repeat regions to collapse into the same node tangling the graph. However, larger k values can fragment the graph in low coverage regions. Multisized de Bruijn graphs allow the size of k to vary based on regional coverage depth. SPAdes also has a unique method for handling PE reads. Typically PE information is incorporated after the initial assembly by mapping PE reads back to assembled contigs. When corresponding pairs map to the ends of different contigs those contigs can be merged into a single scaffold. SPAdes improves on this by incorporating PE distance information directly into the de Bruijn graph. Because of these features SPAdes can be run with any number of single-end (SE), PE, or MP read files.
Assembling Burkhoderia Genomes with SPAdes
Burkholderia is a genus of bacteria with strains having a variety of environmental effects ranging from plant growth promoting activity to human and plant pathogenesis. Because our lab is particularly interested in phenotypes of association between bacteria and plants, we sequenced 6 strains of burkholderia isolated from the endophyte compartment (i.e. inner root) of A. thaliana using two runs on our Illumina MiSeq machine. The first run was done using a standard PE library, and the second was done using a MP library.
The pipeline used for assembly is outlined and described below.
FLASH
FLASH is a software package the merges PE reads into a single read. In PE sequencing there is a distribution of DNA fragment lengths from which ends are sequenced (Supp Fig 1). As reads continue to lengthen more PE reads can potentially overlap. The main advantage to merging the overlapping reads is the error correction step. Furthermore, longer reads tend to yield better assemblies because of their ability to span repetitive regions that fragment assemblies. While longer merged reads may span repetitive regions it is still possible that spanned repeats are incorrectly assembled. In terms of draft assembly this is a minor point.
After experimenting with various combinations of parameters I decided on the following FLASH parameters:
-m 25 -r 250 -f 500 -s 100
Adapter Trimming
A preprocessing step is required to trim MP reads at the Illumina junction adapter. For details on how to do this and why it is necessary see this post.
Adapter Trimming
A preprocessing step is required to trim MP reads at the Illumina junction adapter. For details on how to do this and why it is necessary see this post.
Sickle
Graph assemblers can be sensitive to erroneous k-mers produced from sequencing errors which convolute the graph with false nodes and connections. Rigorous quality filtering prior to assembly reduces the number of false k-mers thereby improving assembly. Sickle is a software package for quality filtering which relies on a sliding window algorithm over the read quality scores. If the average quality of a window drops below a given minimum the remaining bases are trimmed off. Any trimmed or untrimmed reads shorter than a given length are removed. Sickle also produces a file containing high quality reads where the corresponding pair failed the quality check. These reads can be used in downstream analysis as single-end reads.
The parameter setting that I used when running Sickle are:
-t sanger -l 127 -q 20
The '-t' parameter specifies the quality value type. The '-l' parameter specifies the minimum read length and '-q' is the minimum average window quality score. At this stage it is important to consider the '-l' parameter if running SPAdes is your next step. The SPAdes parameter '-k' takes a list of k-mers from which to build a multi-sized graph. For long Illumina reads the SPAdes manual advises '-k' be set to '21,33,55,77,99,127.' Because the longest k-mer is 127bp reads in you dataset should be at least that long. To allow for shorter reads when running Sickle simply lower the '-l' parameter and ensure the largest '-k' value is shorter than or equal to that value. Because many of the MP reads are already short I ran sickle with the -l parameter set to 75 and the max SPAdes -k parameter set to 75.
I should note that SPAdes has a built-in correction algorithm--hammer. Hammer looks at k-mer frequency to identify potential false k-mers. However, I recommend running Sickle first for 2 reasons:
I should note that SPAdes has a built-in correction algorithm--hammer. Hammer looks at k-mer frequency to identify potential false k-mers. However, I recommend running Sickle first for 2 reasons:
- Sickle is a much faster correction algorithm than hammer. Hence, limiting the number of false k-mers that hammer must identify speeds up SPAdes.
- Sickle uses quality values. While quality values are not perfect, they do correlate with the correctness of bases. If reads have any low-quality k-mers it is possible by chance that some of these k-mers will be the same causing hammer to classify them as a true k-mer. However, it is likely that one error inside a high quality k-mer will not be trimmed by Sickle, but will be identified by hammer. This intuition says that a combination of Sickle and hammer is the best regime for quality filtering.
SPAdes
I ran the SPAdes assembler twice for each genome--first with both PE and MP reads and then with only the PE reads. The second run shows that improvements were made when MP reads are included. The following command was used to run SPAdes with both PE and MP reads:
spades.py -k 21,33,55,75 -t 5 --careful -s merged_PE_reads.fastq -1 fwd_PE_sickled_reads.fastq -2 rev_PE_sickled_reads.fastq -s PE_sickled_single_reads.fastq --mp1-1 fwd_MP_sickled_reads.fastq --mp1-2 rev_MP_sickled_reads.fastq --mp1-s singles_MP_sickled_reads.fastq --mp1-rf -o spades_out
Using only PE reads (i.e. the merged reads, fwd high quality reads, reverse high quality reads, and single high quality reads) SPAdes was run using the following command:
spades.py -k 21,33,55,77,99,127 -t 5 --careful -s out.extendedFrags.fastq -1 out.notCombined_1_qc.fastq -2 out.notCombined_2_qc.fastq -s out.notCombined_singles_qc.fastq -o spades_out
spades.py -k 21,33,55,75 -t 5 --careful -s merged_PE_reads.fastq -1 fwd_PE_sickled_reads.fastq -2 rev_PE_sickled_reads.fastq -s PE_sickled_single_reads.fastq --mp1-1 fwd_MP_sickled_reads.fastq --mp1-2 rev_MP_sickled_reads.fastq --mp1-s singles_MP_sickled_reads.fastq --mp1-rf -o spades_out
Using only PE reads (i.e. the merged reads, fwd high quality reads, reverse high quality reads, and single high quality reads) SPAdes was run using the following command:
spades.py -k 21,33,55,77,99,127 -t 5 --careful -s out.extendedFrags.fastq -1 out.notCombined_1_qc.fastq -2 out.notCombined_2_qc.fastq -s out.notCombined_singles_qc.fastq -o spades_out
QUAST
When the assembly is complete QUAST can be used to compare and view assembly results. For examples see the Results section of this post.
Results
Assembly
All 6 genomes assembled very well when both PE and MP reads were used.
Does merging PE reads help?
When the assembly is complete QUAST can be used to compare and view assembly results. For examples see the Results section of this post.
Results
Assembly
All 6 genomes assembled very well when both PE and MP reads were used.
Does merging PE reads help?
To illustrate how merging reads prior to assembly can improve assembly, I assembled one Burkholderia genome using three different input sets:
- Merged and unmerged reads
- Raw PE reads with no merging
- Only reads that merge
Under the assumption that improved assembly statistics (e.g. number of contigs, N50, etc.) indicate a better assembly, a combination of merged and unmerged reads (option 1) yields the best assembly.
Does adding MP reads help?
When assembling with only PE reads the resulting assemblies are worse than when MP reads are included.
When assembling with only PE reads the resulting assemblies are worse than when MP reads are included.
To test if the assembly improvements were primarily a factor of increased coverage or information incorporated in MP reads, I reassembled the PE+MP reads for B1/BMP1 treating the MP reads as SE reads. Despite MP reads substantial increase in coverage (52 to 103), when treated as SE reads only minor assembly improvements are observed. This suggests that distance information provided by MP reads can have a greater effect on improving assemblies than additional coverage.
Conclusions
Using a combination of FLASH, Sickle, SPAdes, and in-house Perl scripts we assemble 6 Burkholderia genomes to draft status. This exercise demonstrates the practicality of the Illumina MiSeq for small scale sequencing projects consisting of a handful of bacterial genomes. A single PE MiSeq run may be sufficient to assembly a small number of bacterial genomes, however the additional coverage and pairing information of MP reads can improve PE assemblies. Merging overlapping PE reads and quality trimming before assembly can increase N50 and other assembly statistics.
Supplemental Figures
Figure 1 -- PE insert size distribution for one sample (B1)