Tuesday, April 23, 2013

BioUtils vs. BioPerl

Perl is a popular language used in bioinformatics.  Its popularity is based primarily on its easiness to learn and plethora of open source libraries.  BioPerl is an example of one such library.  As a bioinformatician, I appreciate BioPerl because it quickly and easily allows me to perform common, menial tasks such as parsing fasta files and manipulating basic DNA sequence objects.  However, BioPerl's speed and memory scale poorly and are becoming limiting factors in large-scale sequencing projects.  In order to accommodate many diverse data types, the object hierarchy and object attributes of BioPerl have become somewhat convoluted thereby increasing memory and runtime requirements.

To address memory and runtime limitations of BioPerl, I developed BioUtils, a collection of simple Perl modules for FASTA/Q formated DNA sequences.  BioUtils can:
  • store FASTA/Q sequences (including quality values)
  • Read/write FASTA/Q files
  • Perform simple quality control on FASTA/Q sequences
  • Provide summary info for a set of FASTA/Q sequences
  • Build a consensus sequence from 2 or more FASTQ sequences
Because of its simplicity, BioUtils requires much less memory and drastically reduces runtime of the above operations compared to BioPerl.  The following figure displays the CPU time for reading and writing FASTQ sequences from files with increasing numbers of sequences.

Currently, a full Illumina MiSeq run can produce ~6 million paired-end reads (i.e. ~12 million raw reads).  Extrapolating from the figure above, the projected runtime of BioPerl for a complete MiSeq dataset would take nearly 2 and a half hours compared to only 5 minutes using BioUtils.  

Because BioUtils is implemented using inside-out objects described in Perl Best Practices (Damian Conway), it is difficult to measure memory usage.  However, a peek at the source code will confirm that BioUtils stores less metadata and thus requires less memory than BioPerl.

More information and a BioUtils download can be found on sourceforge: https://sourceforge.net/projects/bioutilsperllib/?source=directory

No comments:

Post a Comment