Perl is a popular language used in bioinformatics. Its popularity is based primarily on its easiness to learn and plethora of open source libraries. BioPerl is an example of one such library. As a bioinformatician, I appreciate BioPerl because it quickly and easily allows me to perform common, menial tasks such as parsing fasta files and manipulating basic DNA sequence objects. However, BioPerl's speed and memory scale poorly and are becoming limiting factors in large-scale sequencing projects. In order to accommodate many diverse data types, the object hierarchy and object attributes of BioPerl have become somewhat convoluted thereby increasing memory and runtime requirements.
To address memory and runtime limitations of BioPerl, I developed BioUtils, a collection of simple Perl modules for FASTA/Q formated DNA sequences. BioUtils can:
- store FASTA/Q sequences (including quality values)
- Read/write FASTA/Q files
- Perform simple quality control on FASTA/Q sequences
- Provide summary info for a set of FASTA/Q sequences
- Build a consensus sequence from 2 or more FASTQ sequences
Currently, a full Illumina MiSeq run can produce ~6 million paired-end reads (i.e. ~12 million raw reads). Extrapolating from the figure above, the projected runtime of BioPerl for a complete MiSeq dataset would take nearly 2 and a half hours compared to only 5 minutes using BioUtils.
Because BioUtils is implemented using inside-out objects described in Perl Best Practices (Damian Conway), it is difficult to measure memory usage. However, a peek at the source code will confirm that BioUtils stores less metadata and thus requires less memory than BioPerl.
More information and a BioUtils download can be found on sourceforge: https://sourceforge.net/projects/bioutilsperllib/?source=directory