Tuesday, April 23, 2013

BioUtils vs. BioPerl

Perl is a popular language used in bioinformatics.  Its popularity is based primarily on its easiness to learn and plethora of open source libraries.  BioPerl is an example of one such library.  As a bioinformatician, I appreciate BioPerl because it quickly and easily allows me to perform common, menial tasks such as parsing fasta files and manipulating basic DNA sequence objects.  However, BioPerl's speed and memory scale poorly and are becoming limiting factors in large-scale sequencing projects.  In order to accommodate many diverse data types, the object hierarchy and object attributes of BioPerl have become somewhat convoluted thereby increasing memory and runtime requirements.

To address memory and runtime limitations of BioPerl, I developed BioUtils, a collection of simple Perl modules for FASTA/Q formated DNA sequences.  BioUtils can:
  • store FASTA/Q sequences (including quality values)
  • Read/write FASTA/Q files
  • Perform simple quality control on FASTA/Q sequences
  • Provide summary info for a set of FASTA/Q sequences
  • Build a consensus sequence from 2 or more FASTQ sequences
Because of its simplicity, BioUtils requires much less memory and drastically reduces runtime of the above operations compared to BioPerl.  The following figure displays the CPU time for reading and writing FASTQ sequences from files with increasing numbers of sequences.


Currently, a full Illumina MiSeq run can produce ~6 million paired-end reads (i.e. ~12 million raw reads).  Extrapolating from the figure above, the projected runtime of BioPerl for a complete MiSeq dataset would take nearly 2 and a half hours compared to only 5 minutes using BioUtils.  

Because BioUtils is implemented using inside-out objects described in Perl Best Practices (Damian Conway), it is difficult to measure memory usage.  However, a peek at the source code will confirm that BioUtils stores less metadata and thus requires less memory than BioPerl.

More information and a BioUtils download can be found on sourceforge: https://sourceforge.net/projects/bioutilsperllib/?source=directory


Becoming Googlable

This week has been a very "progressive" week in my life with regards to my involvement in social media.  I have never subscribed to social media because I have seen how its addictive nature leads to many lost hours with little or no benefit (which I still believe it true for sites like Facebook and Twitter).  However, I have begun to see the advantages of being "googlable" and have consequently decided to take the plunge into social media by having a personal website and blog.  A growing number of scientific researchers have personal websites to advertise their research and blogs to quickly communicate novel discoveries and ideas.  I hope to follow that model by using my personal website to deploy some of my useful bioinformatics tools and by blogging about interesting scientific ideas.  Hopefully this experimentation with social media will be productive both to myself and others.