The in silico lens

Categorizing musicians by genre using artificial neural networks

2015-05-13T14:53:00.000-07:00

Background
Genres are an important classification scheme for organizing music and other forms of art. However, genre classifications are often ambiguous. For example, undergraduate students asked to classify a given set of songs into 10 genres had only a 72% concordance with recording companies genre classifications (D. Perrot and R. Gjerdigen, 1999). Some of the most sophisticated computer algorithms can classify songs into genres with about 70% - 80% accuracy using song features such as tempo, time, instruments, etc. The accuracy of genre classification is also dependent on the number of genres an algorithm attempts to delineate. From the MIREX 2005 audio genre classification contest, the highest accuracy for 6 genre classifications was 87% but dropped to 75% when classifying 10 genres (http://www.music-ir.org/evaluation/mirex-results/).

An alternative to classifying songs into genres is classify musicians into genres. Categorizing musicians into genres is also a difficult problem because individual musicians can be members of several different, but related, genres. Musicians can be described using a set of terms (i.e. genres and sub-genres) representing their genre membership. For example, Florida Georgia Line, can be classified using the terms: bro-country, country rock, country pop, and country rap. This multi-class membership makes it difficult to make general comparisons between genres. For example, it would be interesting to test if the word "truck" is more frequently mentioned in country songs (i.e. by country musicians) than in rock songs (i.e by rock musicians).

The first objective in this series of posts is to categorize musicians into a set of 20 pre-defined genres based on a list of terms associated with each musician using an artificial neural network model.

Artificial Neural Network Background
Artificial neural networks (ANN) are a class of machine learning models frequently used to solve classification problems. ANNs contain a set of input, hidden, and output nodes connected by weighted edges. Input nodes correspond to features in the training dataset. Output nodes correspond to classification types. Input nodes are connected by weighted edges to hidden nodes, and hidden nodes are connected by weighted edges to output nodes. Hidden nodes and weighted edges are the medium by which the input features are transformed into a classification prediction. For example, if a given musician, M1, is defined by features f1, f2, and f3, by "activating" input nodes f1, f2, and f3 the network redirects the flow of weight towards the output nodes. After all the weight has been redirected, the output node with the highest weight has the highest probability of being the correct classification for musician M1.

The Training Dataset
The task of building or training an ANN is equivalent to assigning weights to each edge such that the classification accuracy of the model is maximized. The overly simplistic intuition behind this operation is as follows. First, edge weights are randomly initialized. Then features of a single training entry are allowed to flow through the network and the final output for that entry is noted. Using the final output and known output, an error value can be calculated which measures the level of correctness of the final output. The network is then backpropagated and weights are redistributed to minimize the error for the that feature set. This process is repeated for each training entry.

In this project, an ANN will be used to classify musicians based on a set of terms describing each musician. Training data was manually gathered and curated. First, for each genre, G, google searches for "popular musicians of genre G" generated a list of popular musicians and their matching genre. Terms describing each musician, M, were gathered from the Echo Nest database. The "popular musician" lists were supplemented with more musicians obtained from Wikipedia pages for each genre. The complete training dataset consisted of 20 genres where each genre contained 80 musicians and each musician is represented by a list of terms. This training dataset, like nearly all training datasets for song classification, is likely to have incorrect annotations. Making more accurate training sets is a difficult task but substantially improves classification accuracy (Cory McKay and Ichiro Fujinaga, 2006).

Feature Selection
Estimating model parameters (i.e. weights) can be computationally expensive. Training sets having a large number of features take much longer to estimate model parameters than those with fewer features. Here the initial training dataset had over 3,000 features (i.e. terms). Removing zero and near-zero variance features can reduce the required computational resources without having a substantial impact on model accuracy. Additionally, removing correlated features will also reduce runtime and will improve model accuracy. For these data, terms associated with fewer than 1% of musicians were deemed near-zero variance and removed. Also, terms having a Phi coefficient (a measure of similarity) higher than .9 were removed. These filtering measures reduced the number of features (i.e. terms) to 412. The final training dataset contained 1,544 musicians linked to 20 genres and described by 412 terms. Note that 56 musicians are linked to more than one genre.

Building the Model
The ANN model was built using the caret package and principles outlined in Applied Predictive Modeling by Max Kuhn and Kjell Johnson. ANN models have two main parameters: decay and size. After each backpropagation, weights are multiplied by the decay parameter, a number less than one, to prevent weights from growing too large. Excessively large weights cause a model to be overly specific to the given training dataset (i.e. not generalizable to other input data). The second parameter, size, is the number of hidden nodes in the model. Arbitrarily choosing these parameters may lead to suboptimal models. A method called model tuning builds several models based on a range of parameter values and selects the most accurate as the final model. Accuracy for each tuning model built in the project is shown in Figure 1. The final model used parameters decay=0.5 and size=25. Further adjustments to these parameters are not likely to significantly improve model accuracy for the given training data.

Figure 1: Model accuracy for a range of decay and size parameter values. The most accurate model was built using decay=0.5 and size=25. Bars indicate standard deviation.

Accuracy of statistical models can be measure using a technique called cross validation. In cross validation the training data are split into two groups: a training group and a test group. The model is built using the training group and subsequently evaluated using the test group. The process is repeated multiple times using different subsets of training and test groups to ensure that a chance good (or bad) split between the training and test groups does not incorrectly estimate model accuracy. Here the final model correctly classified about 60% the test cases (Figure 1).

Model accuracy can also be evaluated using a confusion matrix. A confusion matrix shows the number of observed against the expected classifications for each category. Figure 2 shows the percentages of data observed in each category from the cross validation test cases. The vast majority of observed musician's genres matched the expected genre.

Figure 2: Confusion matrix built using all cross validation test cases. Values of each cell indicate the percentage of total musicians observed in each category for a given expected genre. If the classification model were perfect there would be a blue diagonal and all other boxes would be red.

Genres having musicians that are frequently classified incorrectly into a single alternative genre are likely similar to that alternative genre. A hierarchically clustering of the confusion matrix percentages (Figure 3) shows that some genres are similar to other genres (i.e. classic_rock and soft rock; disco and funk; etc.). These similarities match logical expectations. These similarities correspond to how accurately the model can distinguish between genres. For example, the model is more likely to incorrectly classify a "classic rock" musician as "soft rock" than "reggae."

Figure 3: Dendrogram clustering of confusion matrix values. This shows the similarity between genres based on the number of misclassification instances shared between genres. As expected, genres know to be similar cluster together (i.e. classic_rock and soft_rock).

Conclusion
Artificial neural networks are an appropriate model for classification tasks. Here an ANN was built to classify musicians based on a list of terms associated with each musician. Considering accuracy of other models used in classifying songs, this model is sufficiently accurate for classifying musicians into genres. Future analyses will use classifications from this model to make general comparisons between genres.

Additional Notes

The code (which needs to be cleaned and organized) can be found here.
Disclaimer: I am not a statistician. This is a learning exercise for me. I'm sure there are plenty mistakes and things I overlooked. If anyone has suggestions on how to improve the correctness of this analysis I am interested in hearing them. I want to learn.
For a fantastic primer on using artificial neural networks for regression and classification see the following posts by Brian Dolhansky:

These chapters by Michael Nielsen describing ANNs are also a great resource.

Liang et al. use music features and lyrics to classify genres. However, many popular genres are missing (e.g. country) and there may be mistakes.
Other interesting references:

2015 JGI Users Meeting Notes

2015-04-29T14:02:00.000-07:00

Interested in watching any of these talks? See this webpage!

Jack Gilbert -- Genome-Enabled Flux Balance Metabolic Networks Form Periodically Flooded Soils

Presented an array of studies across many different microbial communities
environmental microbes have roughly a 35 year lab to major climate changes
Cyanobacteria can metabolize nitrogen into amonium which has been show to be useful in sphagnum moss
bacteria can protect against allergies. Adding clostridium into mouse will alleviate allergenic symptoms
added clostridium to a young man who had really bad allergies. Comparisons to other family members without allergies is in progress
also doing this experiment on dolphins because it is easier to control their environment
microbes can contribute to you being fat or skinny
circadian rhythms in the gut genes and microbes looks to be very important
microbes in roots also have circadian cycles
your house takes on your microbiome
dogs increase the similarity of couple's microbiomes

Francis Martin -- Harnessing Genomics for Understanding Tree-Microbe Interactions in Forest Ecosystems

we know very little about fungi affect the carbon cycle
4 major groups of fungi in most forest ecosystems: white rotters, brown rotters, litter soul decayers, and ectomycorrizal
he studies mycorrhizal fungi and their symbiotic toolkit
mycorrhizal symbiosis has evolved independently several time!
ectomycorrizal fungi (EMF) have reduced complement of plant cell wall degrading genes compared to ancestors
small proteins from EMF are secreted and land on plant cells
Look at Platt paper for JAZ interaction (PNAS). Missp7 binds to JAZ and prevents the immune response in the plant
each symbiosis event has developed a unique set of effectors

Joan Bennett -- Do Fungi have a 'Volatome'?

aflotoxins can cause cancer in small doses
mycotoxins may cause sick building syndrome
volatile organic compounds (VOCs) are things that we can smell and are frequently made by fungi
using arabidopsis and flies as models for testing the effects of these compounds
flies exposed to c-8 VOCs acted similarly to model flies for Parkinson's disease
VOCs in the fungal microbiome of humans are likely responsible for attracting mosquitoes

Susanna Theroux -- Marsh Madness: Microbial Communities Driving Greenhouse Gas Cycling in Coastal Wetlands

wants to link wetland microbes with carbon emissions
methanogens break down carbon into methane
wetlands produce a lot of methane
microbes might be useful in minimizing methane production compared to carbon storage
more methanogens yield more methane

Antonis Rokas -- Evolution of Fungal Chemodiversity

fungal metabolism
asks how did chemo diversity originate and why is chemo diversity clustered
toxicity clusters are likely driven by genetic linkage (ie butterflies)
mined metabolic databases for genes that are clustered and genes that are not and measured toxicity. Clustered genes are more toxic
tissue specific expression in humans and mammals is the equivalent of clustering in fungi
this implies that the position of two genes in the fugal genome give information about how they interact in humans (ie those two genes are likely to be expressed in the same tissue)

Rotem Sorek -- The Immune System of Bacteria

crisper is the immune response in bacteria for phages
the cas proteins find phage DNA in the cell and insert it into the next spacer
crisper is in only ~40% of bacteria. How do other bacteria fight phage? The only other known mechanism is restriction enzymes. Can we find new immune systems in bacteria?
immune system signatures: rapidly evolving, high horizontal gene transfer
found the BREZ system in B. cereus! Found in 10% of bacteria. We don't understand the mechanisms yet
phages have anti-defense systems. Ergo it would be best for bacteria to have multiple defense systems

Phil Hugenholtz -- Back from the Dead: The Curious Tale of the Predatory Cyanobacterium Vampirovibrio chlorellavorus

these bacteria suck host cells dry!
falls in Cyanobacteria clade
contains a type-IV secretion system partially found on plasmids
non photosynthetic (unlike most Cyanobacteria)

Stephen Wright -- Comparative and Population Genomics in the Brassicaceae: Understanding Genome-Wide Natural Selection

evolution can go backwards. For example, Y chromosome degeneration in humans.
many plants that undergo whole genome duplication revert to diploid
there is evidence in Arabidopsis that a long time ago it had a whole genome duplication event
possible explanations: passive constituency of redundancy, inefficient selection, differential adaptation
there is limited evidence contrary to popular theory that selfing or limited recombination leads to a high density of deleterious mutations and evolutionary dead-ends
capsella rubells is a model for this phenomenon
ploidy effects the efficiency of natural selection
higher ploidy can weaken efficacy of natural selection
ploidy combined with transition to selfing increases rate of deleterious mutation accumulation
plant sex chromosomes are younger than mammalian sex chromosomes

Steve Briggs -- Protein Regulatory Networks

see Walley et al. PNAS 2013!

Susan Lynch -- The Microbiome--A New Frontier in Human Health

increase in asthma cases in children in US and Australia
asthma is mostly an environmental disease (ie not so much genetic)
now people have far less environmental exposure
Americans spend about 90% of their time indoors
vaginal born children have different microbiome than c-section born children. And they are less likely to have asthma.
lactobacillus keeps airways of mice challenged with dust open like normal airways
see Fuijmura et al PNAS January 2014
WHEALS study
allergenic children often develop asthma

The iChip: A tool for high-throughput microbial culturing

2015-01-26T13:03:00.001-08:00

It is commonly accepted that only ~1% of naturally occurring microbes are culturable using standard culturing techniques. Until recently, culturing microbes has been the only way to investigate their presence and effects on various environments (i.e. human gut, soil, plant roots, oceans, etc). Metagenomics, sequencing DNA from a pool of microbes, has broadened our understanding of microbial communities by circumventing the need for culturing. However, culturing remains an important aspect of investigating microbial communities particularly for environments where the microbial diversity is so great that generating complete genome sequences via metagenomics is impractical (i.e. soil).

In 2010, a study from the Epstein and Lewis groups (Nichols et al., 2010) describes a new technology which can increase culturing success from ~1% to nearly 50%! This technology, the isolation chip or iChip, contains 384 chambers. A single microbial cell is deposited in each chamber. Then a semi-permeable membrane is used to cover all the chambers and the iChip is placed into the environment from which the microbes originate. The membrane allows cells access to nutrients and growth factors found in its natural environment.

The iChip has been successfully used to study novel secondary metabolites (Lewis et al., 2010) and discover novel antibiotics (Ling et al., 2015; Lewis, 2013). However, a possible reason why high-throughput culturing techniques have not gained much traction is the emphasis on metagenomic applications and studies which do not suffer from culturing bias. High-throughput culturing and metagenomic sequencing each have a unique set of strengths and weaknesses. Perhaps more thought should be put into designing methods using a combination of high-throughput culturing and metagenomic sequencing to leverage both sets of strengths and mitigate their weaknesses.

Literature Review: Selection on soil microbiomes reveals reproducible impacts on plant function

2014-11-21T06:17:00.000-08:00

I recently found an intriguing microbiome paper, and I would like to post my thoughts about it.

Citation
Panke-Buisse et al., Selection on soil microbiomes reveals reproducible impacts on plant function, ISME Journal 2014, doi: 10.1038/ismej.2014.196.

Review
The main hypothesis in this paper tests if phenotypes from a soil microbiome connected with a single plant genotype can be recapitulated in other plant genotypes. First, early- and late-flowering time associated soil microbiome were generated. Ten generations of Arabidopsis thaliana Col-0 were grown and at each generation soils from the four earliest and latests flowering pots were collected as the inoculum for the subsequent generation. Using soils from generation 10, final early- and late-flowering time associated inoculums were derived. These inoculums were then administered to four Arabidopsis thaliana ecotypes (Rld, Ler, Be, and Col-0) and a close relative, Brasica rapa. Flowering time, plant biomass, and nitrogen mineralization enzyme activity for each of the genotypes were statistically different between early- and late-flowering associated inoculum treatments except Ler (Fig 4). Microbes specifically associated with early- and late- flowering time treatments were found in low abundance suggesting that low-abundance microbes can contribute significantly to interesting phenotypes.

Positives In general, I found this paper to be very insightful for the following reasons:

Soil Microbial phenotypes can persist in novel hosts. This is a really cool result and could have direct implications for building field-useful microbiome treatments. However, flowering time is a well conserved phenotype so this result may not apply to other specific phenotypes of interest such as crop yield, disease resistance, or with greater host genetic divergence (i.e. corn flowering time vs arabidopsis).
Low abundance members of a community can drive an important phenotype. A common assumption in metagenomics is that the most abundant members of a community are the most important. Perhaps more focus should be geared toward linking microbes to phenotypes regardless of abundance.
This artificial selection design doesn't require detailed knowledge of genetic mechanisms in order to produce a desired phenotype. It is similar to the old dogma of plant breading where two good plants were crossed to produce an even better plant with no underlying knowledge of the genetic mechanisms. While understanding the mechanisms is important for maximizing the phenotypic benefits, it is not required to generate a positive effect.

Critiques and Questions

The methods section seems incomplete.

Why are there so few OTUs in the heatmap and ternary plot? Were OTUs filtered out? I would expect to see hundreds or even thousands of OTUs from soil-derived microbiota.
Were the wild soils combined at equal ratios?
What was their measure of flowering time? I think there are well defined methods in the literature for observing flowering time.

Why are the EF associated samples in the 10 generation experiment actually flowering at a later time in later generations (supp fig 1)?
Why are the control samples not included in supp fig 1?
Even though control samples were included as a covariate to generate figure 4, I would have liked to have seen them graphed there as well.
Perhaps a better way of generating the control samples would have been to randomly pick pots to generate a control inoculum. This was what they did in Swenson et al., 2000

Future Work
This experimental design could be used to address the following intriguing hypotheses:

Selection on soil microbiomes changes the root endophyte microbial community.

Redo the experiment with the same design but also 16S profile endophyte microbes.
It would also be interesting to see if the microbes specific to each treatment are also found inside the root. This would suggest a direct microbial effect on the plant phenotype.

Selection for soil microbiome associated phenotypes is driven by multiple mechanisms.

Redo the experiment but don't combine replicates such that the end inocula include four independent lines of selection for both early- and late-flowering time soil microbiomes. Compare microbes among treatment groups looking for different microbial profiles that express similar phenotypes.

MiSeq Flow Cell Edges Correlate with Low Quality Scores

2014-10-08T16:54:00.001-07:00

Upon receiving a FASTQ file fresh off the MiSeq, the first question I ask myself is: "Did the sequencing work?" On several occasions I open these files to discover the first few pages of sequence reads are littered with N's and have low quality scores. However, when I run the full set of reads through a QC pipeline (e.g. FastQC) an overwhelming majority are high quality.

So why is it that the reads at the beginning of the FASTQ file have such poor quality?

Reads generated using Illumina technology are ordered by their cluster's y-coordinate on the flow cell. This led me to hypothesize that clusters near the edge of the flow cell are more likely to have low quality scores. To test this hypothesis I took a subset of reads (932,709) from a MiSeq run, calculated their average quality score, and graphed that against the distance from the closest edge.

The splines function in R was used to model the relationship between average quality score and distance from the closest flow cell edge (red line). Clusters closer to the edge of the flow cell have significantly lower quality scores. However, the good news is that the vast majority of clusters do not fall close to the edge (light blue heat).

In practical terms this issue is insignificant because so few reads are affected, but it does explain the high concentration of low quality reads at the beginning (and end) of Illumina generated FASTQ files.

When is my genome finished?

2014-09-30T11:21:00.000-07:00

This is a common question for anyone doing genome sequencing and assembly.

The short answer: never.

The slightly longer answer: it depends.

The long answer: Finishing a genome means the order of all nucleotide bases have been correctly resolved. Even for simple genomes this is extremely difficult. Billions of dollars have been spent to sequence the human genome, and it is currently estimated that 92% of the human genome has greater than 99.99% accuracy (Schmutz et al., 2004). In total, 99% of the genome has been assembled, and the remaining 1% are likely to be highly repetitive regions with little or no gene content (remember that 1% of 3 billion total bases means there are about 30 million unresolved bases). Using the current technology, it is impossible to reach 100% accuracy across 100% of the human genome. For simpler genomes such as some viral and bacterial genomes it is possible to completely resolve the entire sequence. However, because these genomes are much more dynamic (i.e. change at a faster rate), generating a completely finished genome may not be worth the cost.

Finishing a genome requires a substantial amount of time, work, and money. However, getting a genome to draft status (i.e. incomplete but usable) can be done with minimal costs and resources. So perhaps the most important question is: when is my genome usable? The answer to this question depends on the research questions being considered. For example, when studying the evolutionary history of genomes with a high propensity for genomic rearrangements, generating long contigs/scaffolds is important for determining the orientation and lineage of genomic rearrangements. Alternatively, some research questions are more concerned about the completeness of the gene content for making functional comparisons between genomes. For such a question, generating long, continuous contig/scaffolds is less important.

Here are some possible ways to estimate completeness (with what I think are the best methods near the top):

Compare the gene content of highly conserved genes

Eukaryote: CEGMA
Bacterial: CheckM
Archaeal: A table of 53 conserved COGs

Check for possible errors using REAPR
Calculate assembly metrics like contig number, N50, coverage, and assembly size.
Compare the size of your assembled genome to a related (or set of related) genome(s).

See these papers for interesting discussion on evaluating assemblies:

How to Order Contigs Using a Reference Genome

2014-08-22T14:14:00.000-07:00

Background
In genomics complete (i.e. finished) genomes provide an excellent resource for future sequencing projects. However, generating finished genomes is an expensive and laborious endeavor. In some cases, the returns of finishing a genome are not worth the cost particularly when draft genomes can provide enough information for robust hypotheses testing. Draft genomes contain a large number of unordered sequences of various lengths called contigs. Contigs can be joined into more contiguous sequences called scaffolds using paired-end reads. While a set of contigs/scaffolds can be useful for a wide variety of projects (gene annotation, gene expression profiling, etc.), contig order provides vital information in comparative genomics and analyses of specific genomic regions. Typically, a rough ordering of contigs can be accomplished by mapping contigs to a closely related reference genome.

ABACAS
Recently, I discovered a nice software package called ABACAS which not only orders contigs but creates a contiguous sequence representing the connected contigs. Gaps between contigs are represented by N's and overlapping regions are resolved (although I could not find information on exactly how). Because ABACAS requires MUMer, its outputs integrate will with other MUMer scripts such as those for visualizing alignments (Figure 1, i.e. mummerplot). ABACAS is hosted by SourceForge and is well documented. It was published in Bioinformatics in 2009.

Notes
Of course, more closely related draft and reference genomes will generate the most correct ordering. However, in cases where the target genome is VERY closely related to the reference genome it may be better to map raw reads to the reference genome instead of build a de novo assembly. Deviations from the reference genome can be discovered using SNP detection software. Read mapping experiments are generally cheaper because they require fewer reads than de novo assembly but can still detect biologically significant differences when compared to the reference genome.

Cool Unix Commands

2014-07-11T12:08:00.001-07:00

I will add to this list as I discover new ones. If you have a favorite or useful command feel free to include it in a comment on this post.

Convert a FASTQ file to FASTA (originally posted here):

sed -n '1~4s/^@/>/p;2~4p'

NOTE: this assumes that each FASTQ entry spans only four lines as is customary.

Convert a SAM file to FASTA

awk '{OFS=""}{print $1, "\n", $10; }' file.sam > file.fasta

NOTE: You will loose a lot of information in the sam file. You can save more of that info by adding column variables to the print statement. Also, you may have to change the column variable numbers depending on your sam file format. This is just a general example.

Replace spaces in file names with underscore (originally posted here)

rename ' ' '_' *

NOTE: do NOT put spaces in file names!! This is so annoying!

Get a histogram of sequence lengths from FASTA/Q files (from Surge Biswas)

FASTQ: cat <fastq file> | awk '{if(NR%4==2) print length($1)}' | sort -n | uniq -c
FASTA: cat <fasta file> | awk '{if(NR%4==0) print length($1)}' | sort -n | uniq -c

Do arithmetic operations on the bash command line

echo $((1 + 1))
echo $((1 - 1))
echo $((1 * 1))
echo $((1 / 1))
echo $(((1+3) / (1+1)))

For floating point operations you can use the bc tool. For example

echo "scale=1; 1/2" | bc

Add a comment to a bash command on the command line

<command>; # this is a comment line

A practical example: mv file1 old_file1; # there is now a new file1 is a more recent version

NOTE: Do you ever have a long and complex command for which you would like to save a simple note? You can use this little trick and the note will be saved along side your command in your history. The next time you look through your history to rerun the command you will also see the associated note.

Count the number of bases in a FASTA file

grep -v ">" file.fasta | wc | awk '{print $3 - $1}'

(from martinghunt on SEQanswers)

How to Learn Bioinformatics

2014-06-26T12:14:00.003-07:00

Introduction

At least once a month someone asks me for help learning bioinformatics. I love it when this happens because it usually means they want to take control of their own analysis thereby freeing up my time for problems that interest me. This post is a collection of tips and resources for people wanting to learn how to do bioinformatics.

Keep These Things in Mind:

Learning the basics of bioinformatics is easy. The basics as described in this post are often taught in high school. However, don't get frustrated if you don't understand everything all at once. Learning anything new takes time and practice no matter its difficulty.
A little bit goes a long way. I estimate that nearly 90% of my work is occupied by simple routine procedures. Learning how to do these tasks will substantially expand your ability to analyze and interpret your data.
Google it. Google is the best resource for learning new techniques and trouble shooting problems. If you have a question type exactly what you would say to a person into the google search bar. When you take questions to your bioinformatics friends it's likely they won't know the answer offhand and will google it anyway.
Try it. If you're not sure about something try it and see what happens. Generally, there is very little danger is just trying a command to see if and how it works. That being said it's a good idea to backup important files and data just incase something goes very wrong. Every Unix programmer that I know has deleted a really important file using the rm command (which is one of the few irreversible Unix commands). It's going to happen to you too so make a backup.

Learn the Unix Basics

Get on a Unix machine. Doing is the most important aspect of learning Unix. You will never fully understand the basic concepts if you only read about them. Mac users have it easy because OSX is build on a unix shell. Simply open the terminal application and you are ready to start with an online tutorial. For non-mac users I recommend finding an old computer and installing a Linux/Unix operating system like Ubuntu. A slightly more difficult approach would be to partition the drive of an existing computer to dual boot a Linux/Unix OS along with the existing OS.
Complete an online tutorial

Buy a book if you are a book learner. However, the basic can pretty much all be learned using online materials. My favorite Unix book is O'Reilly's Unix Power Tools.

Learn a Scripting Language

Pick a scripting language. Scripting languages are computer languages that are not compiled (i.e. they are interpreted by the computer on the fly). The two most popular bioinformatics scripting languages are Perl and Python. Both languages have their strengths and weakness, but I personally prefer Perl.
Complete an online tutorial for your language.

Buy a book. My favorite Perl book is Perl Best Practices by Damian Conway. This book is a must have for all Perl programmers! I don't have much experience with Python books, so I would recommend looking at book reviews before making a purchase.

Learn a Statistical/Graphing Language

Pick a language for doing statistical operations and building figures. Languages like R and Matlab are prime choices for both statistics and graphics. Both languages have their strengths and weaknesses, but I personally prefer R. If you choose R I highly recommend using the ggplot2 library for building figures.
Complete an online tutorial for your language.

Give up Excel. Excel is a powerful program but lacks the flexibility of computer languages like R and Matlab. While there is a steeper learning curve for R and Matlab, you will substantially enhance your ability to do statistical analyses and build graphics by getting away from Excel.

Learn Basic Bioinformatics Procedures and Corresponding Software Tools

For example:

Quality control

fastQC

Sequence assembly

Velvet,
SPAdes (my preference, also see this post)

Sequence mapping

BWA
Bowtie

Sequence searching

BLAST

Differential expression

edgeR
DESeq

Variant calling

GATK

This is only a small list of procedures and tools primarily focusing on DNA sequence analysis. For a more comprehensive list see OMICtools.

Find a problem

I strongly encourage new bioinformaticians to find some real data to do meaningful science using the above principles and skills. If you don't personally have data I recommend downloading data from a public repository (i.e. Genbank). A similar alternative would be to choose a paper that uses a procedure you are interested in learning and recapitulate the results.

Predicting Full-Length Ribosomal Gene Sequences

2014-03-26T11:25:00.000-07:00

Introduction

The 16S ribosomal gene has been used extensively in biology for distinguishing relatedness between species. This gene has regions of DNA that are highly conserved among almost all living organisms and other regions that have high DNA sequence variability. The conserved regions are ideal for building PCR primers that can amplify DNA from many different organism. The variable regions that are amplified using these conserved primers can be used to determine the relatedness between two or more organisms. Closely related species typically have much more similar DNA sequences than distantly related species.

Typically PCR is used to amplify a portion of the 16S ribosomal gene for sequencing. However, whole genome sequences or whole metagenome sequences also contain short DNA reads originating from the 16S gene. These reads can be separated from the pool of other genomic reads and assembled into the entire 16S gene. EMIRGE (Miller, et al. 2011) is an algorithm for reconstructing full-length ribosomal genes from short read DNA sequences.

EMIRGE

EMIRGE reconstructs full-length ribosomal genes from short read DNA sequences. It first maps reads to a database of known 16S genes such as the SILVA or greengenes database. After the initial mapping, EMIRGE estimates the probability that a given read was generated from the reference to which it mapped. Based on these probability estimates, reference sequences are changed to reflect the 16S sequences that are likely to be represented by the set of reads. Reads are then remapped to the adjusted 16S sequence database and the processes is repeated until an equilibrium is achieved. The resulting database of 16S sequences reflect the likely 16S genes represented by the input set of short reads.

This software was primarily built to infer the set of 16S genes from whole metagenome reads. However, it can also be used to infer the single 16S gene from genomic sequences from a single isolate. Full-length 16S genes are difficult to assemble even when only reads from a single genome are considered.

In the Dangl lab, we use EMIRGE to predict full-length 16S genes from reads generated from a single genome of bacteria. An example of the EMIRGE command we use is:

emirge.py my_output_dir -1 fwd_reads.fastq -2 rev_reads.fastq -b SSURef_NR99_115_tax_silva_formated -f SSURef_NR99_115_tax_silva_formated.fasta -i 600 -s 1000 -l 250

The descriptions of each parameter are below:

my_output_dir: the output and working directory for EMIRGE.

-1: the forward or single-end genomic sequencing reads

-2: the reverse end genomic sequencing reads

-b: the bowtie index of the 16S sequence database

-f: the fasta file of the 16S sequence database

-i: insert size of paired-end reads

-s: standard deviation of insert size for paired-end reads

-l: max length of reads

Other Details

EMIRGE uses bowtie to map reads to the reference database. To build the bowtie index of the reference database the following command was used:

bowtie-build SSURef_NR99_115_tax_silva_formated.fasta SSURef_NR99_115_tax_silva_formated

Also, the database downloaded from SILVA had to be reformatted using this Perl script. This script requires BioUtils.

2014 JGI Users Meeting Notes

2014-03-24T13:52:00.000-07:00

Here are some notes from a few of the speakers at the JGI Users meeting in California. In general the speakers were fantastic. Some general themes of the conference include: single-cell genomics, synthetic biology, fungal metagenomics, and metabolics. A person take-home message for me was the need for creative biological solutions to common issues that the human race currently faces or will face in the near future.

Mark Ackermann (opening keynote) – A Single Cell Perspective on Bacterial Interactions

- Focused on phenotypic heterogeneity, when identical cells have different functional profiles.

- Most genes don’t have clonal variation but in the ones that do how is that heterogeneity important for the community.

- Salmonella is an example of phenotypic heterogeneity. One cell type causes inflammation and one uses the inflammation response to reproduce and cause full infection.

- Different cell types survive better in different environmental conditions.

- Another example of phenotypic heterogeneity is in alpine lakes where there are generally large amounts of ammonium that bacteria can use as a nitrogen source. However, there are some cells that fix their own nitrogen in the event that ammonium runs out.

- preliminary data show that neighboring cells are more likely to be of the same cell type.

Mary Berbee – Pectinases link Early Fungal Evolution to the Land Plant Lineage

- Sequenced early divergent fungal groups.

- The relationship between the early branching groups is still poorly resolved.

- Showed some cool trees where she had overlaid two trees to highlight difference between the two. I would like to know what software she used to do this.

- Her trees were based on whole genomes but I’m not sure how she built them.

Rytas Vilgalys – Understanding the Forest Microbiome: A Fungal Perspective

- Oak and pine share many fungi while populus has more different fungi.

- Soils from the same region are likely to share the same fungi.

- Populus of different genotypes do not assembly different fungi. At least not nearly as different as fungi from different regions.

- They have isolated ~1,800 fungal isolates. These isolate represent only ~15% of the isolates that are likely populus endophytes.

- Many fungal isolates stimulate plant growth.

- They are re-inoculating these isolates to confirm they are endophytic.

- Mortierella elongata is an isolate that stimulates plant growth in populus and Arabidopsis thaliana.

- M. elongata also harbors bacterial symbionts (Glomeribacter which are known to affect lipid fermentation and is a sister to Burkholderia. These bacteria cannot be cultured possibly because they rely so heavily on the host for nutrients).

- M. elongata migrate to the roots.

- Different genes are expressed in M. elongata grown in culture than those sampled from the rhizosphere.

- Different genes are expressed in M. elongata inoculated on different hosts.

Eddy Rubin

- Bacterial genes are typically ~900bp.

- In a couple of sequenced genomes they saw average bacterial gene lengths as low as 200bp. However, when they adjust the codon table by replacing one of the stop codons to code for a glycine predicted genes have an average length of 900bp! Some bacteria use different codon translations!

- Natalia Ivanova is a gene annotation specialist they consulted for help in this analysis.

- They found evidence of recoding in lots of other bacteria by looking at sequenced isolates.

- Didn’t find evidence of recoding in archea.

- They show that phages which use different codon profiles can circumvent host cell machinery to match their codon profile!

- CRISPR regions in bacterial cells often contain phage elements that correspond to different codon profiles. This is further evidence that phages with different codon profiles can infect cells with canonical codon profiles.

Nicole Dublier –Metagenomics and Metaproteomic Analyses of Symbioses between Bacteria and Gutless Marine Worms

- Bacteria can use hydrogen to produce more energy than methane. Nature 2011

- They discovered key genes able to metabolize hydrogen.

- The second half of the talk was about gutless worms living in shallow water. They completely dependent on bacterial symbionts for feeding and waste excretion.

- There are species specific symbionts.

- Her proteomics data yield more obvious features than comparative genomics. As an example she shows how one isolate contains a protein that does the function of 3 different proteins in the canonical Calvin Cycle. DNA sequencing confirmed this observation but would have been a “needle-in-a-haystack” for a comparative genomics project. This work published in PNAS.

Erin Nuccio – Mapping Soil Carbon from Cradle to Grave: Using Omics and Isotope Analyses to Identify the Microbial Blueprint for Root-enhanced Decomposition of Organic Matter.

- The general question is how do microbes transform and stabilize root carbon in soil.

- Carbon can affect nitrogen rates.

- Plants fix carbon for microbes in the soil.

- Looking at the rhizosphere over time it gradually deviates from bulk soil in carbon levels at time points of 3, 6, 9, and 12 weeks.

- Some preliminary data show that bacteria prefer carbon excreted by plant over as an energy source over nitrogen liter material (ie material artificially added to the system).

Michael Fischbach – A Gene-to-Molecule Approach to the Discovery and Characterization of Natural Products

- Discovers natural gene products. By gene products I think he means functional protein units.

- Undiscovered gene products are often coded by clusters of genes.

- Has some type of algorithm to computationally discover these clusters that may produce unknown gene products.

- Lots of his most interesting clusters were found on human associated microbes.

- Discovered several oligosaccharide clusters. These bacteria were very difficult to work with but these clusters and the functions they provide to the human host are of high interest.

- The general observation of this study was that microbes in our gut are making products for which we have no idea what they are or how they function. It’s like taking several prescription drugs for your entire life! We need to figure out what is going on in there.

Kelly Matzen – Genetic Control of Mosquitoes

- In the 50’s DDT was used to control mosquito populations and subsequently mosquito born disease such as dengue. However, DDT is know to be detrimental to the environment in several ways and therefore is being used much less. We are starting to see diseases like dengue make a comeback in places like Florida and of course in places like Central and South America.

- Right now the most effective control is pesticides.

- They are releasing massive numbers of sterile male mosquitoes to control (ie reduce) mosquito populations. This technique has been successfully used before in the United States to control populations of other insects many years ago.

- This technique seems to be working in the small field studies they have been conducting.

- There is some push back from legislators but in general it seems like good solution.

Cameron Coates – Characterization of Cyanobacterial Hydrocarbon composition and Distribution of Biosynthetic Pathways

- Cyanobacteria produce over 30% of the earth’s oxygen.

- They are very diverse and live in all sorts of habitats on earth.

- They can produce hydrocarbons where are relevant of use of biofuels. However, they don’t produce large amounts of hydorcarbons.

- They looked at the evolution of cyanobacteria hydrocarbon pathways. There are two main pathways. Several clades have both pathways suggesting a large amount of horizontal gene transfer.

- This work was published in PLOS ONE.

June Medford – Making Better Plants: Synthetic Approaches in Plant Engineering

- They created a biological input/output system. This allows for some external factor to cause a reaction that can be observed in the plant.

- They use a pariplasmic binding protein as the input signal because it can quickly defuse through the cell wall and are then translocated to the nucleus to transcriptionally regulate some response.

- They can theoretically use this system as a flag for pollutants or other dangers that we currently use very expensive technology to detect.

- They are currently developing a system to detect TNT where the response signal of the plant is to turn white. This system can detect traces of TNT 10x smaller than a dog! There are still some kinks to work through like response time. But looks like a very promising system. This idea has countless unexplored applications!

Kankshita Swaminathan – Genome Biology of Miscanthus

- Miscanthus is in the same clade as sugar cane, corn, and sorghum. These plants have been amenable to breading.

- The genomic sequence of sorghum is very close to Miscanthus except that Miscanthus has had a whole genome duplication event.

- In the winter all the nutrients migrate to the rhizome leaving only the stalk above ground. The stalk is the most important element for biofuels and can be harvested without significantly depleting soil nutrients.

Annalee Newitz (closing keynote) - How Humans Will Survive a Mass Extinction

- Humans have a very good chance of surviving a mass extinction because we are very adaptable. However, our focus should be how we can preserve the diversity of the earth as it is now.

- A mass extinction is when greater than 70% of the earth's species are killed.

- Five mass extinctions have occurred in the history of the earth. Perhaps the largest was caused by cyanobacteria because they released large amounts of oxygen into the atmosphere. Close to 90% of species became extinct as a result.

- Climate change is inevitable regardless of wither or not humans are the cause.

- The questions we should be asking are: how can we respond to these changing climates and what can we do to preserve the world as we know it.

- Space travel seems like an important step in human survival.

The Pan/Core/Accessory Genome

2014-02-25T11:55:00.000-08:00

Introduction
The term "pan genome" was coined in 2005 by Tettelin in a paper describing the genomes of eight pathogenic Streptococcus strains. The pan genome is the set of all unique genes from a set of genomes (ie gene union). The core genome is the set of genes found in each genome (ie gene intersection). The accessory genome is the genes unique to a particular genome (ie strain specific genes).

Previous Studies
Read et al. (2012) discusses the pan and core genome of phytoplancton. They estimate the size of the E. huxleyi pan genome to be large because there are several thousand genes in the reference that are missing from all of the three well-sequenced isolates. This is definitely more of a core genome paper (the analysis of which is easy when you have a reference). Perhaps a better way of showing the diversity of the pan genome is rarefaction curves on the number of homologous genes or perhaps k-mer content. The lead investigator, Igor Grigoriev, is the fungal genomics lead investigator at the JGI.

Pan and Core Genome Dynamics
In general, as more genomes are added to the analysis set, the core-genome shrinks and the pan-genome grows. Collins (2012) describe these dynamics in their Molecular Biology and Evolution publication by using an infinitely many genes (IMG) model. In summary, the Collins IMG model is based on the idea of three types of gene classes: core, shell, and cloud genes. Core genes are those found in all genomes, shell genes are gained and lost from genomes at a relatively slow rate, and cloud genes are rapidly gained and lost from genomes. Empirical data from a set of Bacillaceae genomes support the Collins IMG model. The Collins IMG model can be used to predict the size of core- and pan-genomes. In the Dangl lab this has become a question of substantial interest for determining which relevant clades require more isolate genomes for more robust functional genomics analyses.

Core vs Accessory
Given a set of genes from various genomes what gene set is most interesting? Of course this question will most heavily depend on previous knowledge of the genomes in question. In the Dangl lab we are interested in microbes that inhabit the endosphere (inner plant root). Because the set of microbes living inside these roots exhibit a different profile than surrounding soil, one could hypothesize that a single or small set of genes are responsible for a microbes ability to inhabit the inner root. Under this hypothesis the core-genome would be of particular interest because it should contain these genes. However, in practice the core genomes is primarily composed of common cellular functions ubiquitous to all bacteria.

Because the core-genome is likely to reveal nothing of specific interest the accessory-genome seems most interesting. The genes in this set alludes to what makes a particular genome functionally distinct and interesting. They get at questions like: "What functions does a particular bacteria provide to the community."

Software for pan-genome analysis
The primary aim in any pan-genome analysis is grouping orthologous genes from different genomes. To do this many pan-genome pipelines utilize algorithms and databases such as GO, COG, KEGG, eggNOG, Pfam, etc.

Here are some notes on various pipelines developed for pan-genome analyses.

GET_HOMOLOGUES (my recommendation)

This is my program of choice for pan/core/accessory genome analyses.
Fantastic documentation
Options for bidirectional blast hit (BDBH), COGtriangle, and/or orthoMCL algorithms for building clusters of orthologous groups.
Builds clear figures
Several options for powerful downstream analyses.
Parallelization options

Panseq (Laing, 2010)

Nice web-based interface.
Seems to work (as opposed to some of the following programs)
Output formats are not as user friendly or as concise get_homologues
Job completion email never comes in. Be sure to save the link somewhere.
More suited for small, quick analyses

PGAT (Brittnacher, 2011)

Nice web-based interface and documentation
Limited to comparisons between only a small number of genomes already stored in their database.

PGAP (Zhao, 2011)

Did not install because it requires the old version of blast (blastall)

PanFunPro (Lukjancenko, 2013)

Still in the development stage. I had a quick look at the source code and there were some things didn't make sense. I'll give the developers a little more time to work out the kinks.
The installation can take some time because of some large dependencies (eg. InterProScan). Furthermore, the installation for the PanFunPro Perl scripts could be streamlined using a tool like Module::Build. However, the installation instructions for it's dependencies are well written making installation remarkable easy.

PanOCT (Fouts, 2012)

Primarily an algorithm for determining homology between a set of genes from 2 or more eukaryote genomes.
Considers conservation of neighboring genomic regions for determining homology. The basic idea is that two genes are truly homologous (as opposed to paralogous) will be situated in the same genomic location.

A Brief Introduction to Sequence Assembly

2014-02-12T19:23:00.000-08:00

Assembly Background

Sequence assembly is one of the overarching challenges in bioinformatics. To understand the assembly problem it helps to understand some basics of DNA sequencing. Consider a bacterium having a genome comprised of a single 5 megabase (5 million base pairs) chromosome. Ideally, sequencing machines would start at the beginning of the chromosome and read each of the 5 million base pairs until arriving at the end. Unfortunately, the current technology is limited to reading sequences between 30 and ~10,000+ bases. The assembly problem is to take these short segments of DNA called reads and overlap them in such a way to recreate the original 5Mb chromosome.

To illustrate this consider the set of character strings below that come from a quote by Theodore Roosevelt (spaces have been replaced with "_" for clarity). Can you put the pieces together to find out what it says?

You should end up with something that looks like this:

This is more or less what assembly programs attempt to do with DNA. Some things to notice in the above example:

Repeats can be problematic during assembly. Notice that the word "you" is used twice in this sentence. Looking at the two character strings "Believe_yo" and "you're_ha" you may have incorrectly merged them to form the character string "Believe_you're_ha" which is an incorrect assembly. DNA repeats are common in genomes and can fragment assemblies or cause assembly mistakes.
Longer reads can help with the repeat problem. For example, given only two long character strings "Believe_you_can_and_you're" and "can_and_you're_halfway_there" it is much easier to unambiguously assembly the quotation despite the fact that the word "you" is used twice.
Sequencing errors complicate assembly. For example, if the character string "halfway" was sequenced as "calfway" there would be no way to finish the assembly correctly because "you're_ha" does not overlap with "calfway."
Coverage (i.e. the number of times a character is represented in the assembly) helps distinguish sequencing errors. For example, if we sample from the quote (or in the genome in the case of DNA) many times and see the character string "halfway" 100 times and the character string "calfway" only once we can assume that "calfway" was incorrectly sequenced.

De Novo vs Mapping Assembly

De novo is a latin expression meaning "from the beginning" (Wikipedia). De novo sequence assemblies are build with no external information beyond the raw sequencing reads. First pass de novo assembles are called "draft" assemblies because the genome remains fragmented (i.e. discontinuous) and may contain assembly errors. Extensive resequencing and curation is generally required to "complete" or "finish" a genome assembly.

Alternatively, mapping assembly uses a reference sequence as an anchor to orient sequenced reads. After reads are ordered based on their location in the reference sequence a consensus sequence is generated from all the mapped reads. The consensus sequence can differ from the reference sequence but differences are generally single base differences scattered throughout the genome. Mapping assembly is most useful when the reference sequence and sequenced organism are closely related.

In some cases (e.g. metagenomics), a combination of de novo and mapping assemblies may be advantageous. However, hybrid assembly algorithms and protocols are not well explored.

De Novo Assembly Algorithm Classes

There are two main classes of assembly algorithms used in de novo assembly: overlap-layout-consensus (OLC) and de bruijn graph (DBG). Similar to the above example, OLC first finds reads with overlapping ends, builds a layout graph based on these overlaps, and lastly generates a consensus sequence as the graph is traversed. OLC was the first assembly method developed and works well with long-read, low-coverage sequencing technologies like Sanger (and possibly PacBio).

DBG based assemblers convert the set of reads into a set of k-mers (i.e. short DNA sequences of length k). These k-mers are then used to build a de bruijn graph from which the genomic sequence is inferred. DBG assemblers work well with high-coverage sequencing methods like Illumina and Ion Torrent.

Useful Links

For more information and comparisons between OLC and DBG see this review.
What is finished and why do we care? (Genome Research 2002).
A review of assembly algorithms including: SSAKE, SHARCGS, VCAKE, Newbler, Celera Assembler, Euler, Velvet, ABySS, AllPaths, and SOAPdenovo (Genomics 2010).
How to apply de bruijn graphs to genome assembly (Nature 2011).

New SPAdes Assembler Used on MiSeq Reads for 6 Burkholderia Genomes

2013-11-15T06:24:00.000-08:00

Introduction

This post introduces a practical assembly strategy using the new SPAdes assembler with Illumina MiSeq reads. Utilizing the sequence analysis tools sickle, FLASH, SPAdes, and in-house Perl scripts we assemble 6 Burkholderia genomes to draft status using both paired-end (PE) and mate-pair (MP) reads. This exercise demonstrates the practicality of using the Illumina MiSeq for small scale assembly projects.

The SPAdes Assembler

Many popular de novo assemblers, including SPAdes, rely on a computational data structure called a de Bruijn graph. SPAdes uses a multisized de Bruijn graph to balance trade-offs between small and large k-mer sizes. A smaller k causes more repeat regions to collapse into the same node tangling the graph. However, larger k values can fragment the graph in low coverage regions. Multisized de Bruijn graphs allow the size of k to vary based on regional coverage depth. SPAdes also has a unique method for handling PE reads. Typically PE information is incorporated after the initial assembly by mapping PE reads back to assembled contigs. When corresponding pairs map to the ends of different contigs those contigs can be merged into a single scaffold. SPAdes improves on this by incorporating PE distance information directly into the de Bruijn graph. Because of these features SPAdes can be run with any number of single-end (SE), PE, or MP read files.

Assembling Burkhoderia Genomes with SPAdes

Burkholderia is a genus of bacteria with strains having a variety of environmental effects ranging from plant growth promoting activity to human and plant pathogenesis. Because our lab is particularly interested in phenotypes of association between bacteria and plants, we sequenced 6 strains of burkholderia isolated from the endophyte compartment (i.e. inner root) of A. thaliana using two runs on our Illumina MiSeq machine. The first run was done using a standard PE library, and the second was done using a MP library.

The pipeline used for assembly is outlined and described below.

FLASH

FLASH is a software package the merges PE reads into a single read. In PE sequencing there is a distribution of DNA fragment lengths from which ends are sequenced (Supp Fig 1). As reads continue to lengthen more PE reads can potentially overlap. The main advantage to merging the overlapping reads is the error correction step. Furthermore, longer reads tend to yield better assemblies because of their ability to span repetitive regions that fragment assemblies. While longer merged reads may span repetitive regions it is still possible that spanned repeats are incorrectly assembled. In terms of draft assembly this is a minor point.

After experimenting with various combinations of parameters I decided on the following FLASH parameters:

-m 25 -r 250 -f 500 -s 100

Adapter Trimming
A preprocessing step is required to trim MP reads at the Illumina junction adapter. For details on how to do this and why it is necessary see this post.

Sickle

Graph assemblers can be sensitive to erroneous k-mers produced from sequencing errors which convolute the graph with false nodes and connections. Rigorous quality filtering prior to assembly reduces the number of false k-mers thereby improving assembly. Sickle is a software package for quality filtering which relies on a sliding window algorithm over the read quality scores. If the average quality of a window drops below a given minimum the remaining bases are trimmed off. Any trimmed or untrimmed reads shorter than a given length are removed. Sickle also produces a file containing high quality reads where the corresponding pair failed the quality check. These reads can be used in downstream analysis as single-end reads.

The parameter setting that I used when running Sickle are:

-t sanger -l 127 -q 20

The '-t' parameter specifies the quality value type. The '-l' parameter specifies the minimum read length and '-q' is the minimum average window quality score. At this stage it is important to consider the '-l' parameter if running SPAdes is your next step. The SPAdes parameter '-k' takes a list of k-mers from which to build a multi-sized graph. For long Illumina reads the SPAdes manual advises '-k' be set to '21,33,55,77,99,127.' Because the longest k-mer is 127bp reads in you dataset should be at least that long. To allow for shorter reads when running Sickle simply lower the '-l' parameter and ensure the largest '-k' value is shorter than or equal to that value. Because many of the MP reads are already short I ran sickle with the -l parameter set to 75 and the max SPAdes -k parameter set to 75.

I should note that SPAdes has a built-in correction algorithm--hammer. Hammer looks at k-mer frequency to identify potential false k-mers. However, I recommend running Sickle first for 2 reasons:

Sickle is a much faster correction algorithm than hammer. Hence, limiting the number of false k-mers that hammer must identify speeds up SPAdes.
Sickle uses quality values. While quality values are not perfect, they do correlate with the correctness of bases. If reads have any low-quality k-mers it is possible by chance that some of these k-mers will be the same causing hammer to classify them as a true k-mer. However, it is likely that one error inside a high quality k-mer will not be trimmed by Sickle, but will be identified by hammer. This intuition says that a combination of Sickle and hammer is the best regime for quality filtering.

SPAdes

I ran the SPAdes assembler twice for each genome--first with both PE and MP reads and then with only the PE reads. The second run shows that improvements were made when MP reads are included. The following command was used to run SPAdes with both PE and MP reads:

spades.py -k 21,33,55,75 -t 5 --careful -s merged_PE_reads.fastq -1 fwd_PE_sickled_reads.fastq -2 rev_PE_sickled_reads.fastq -s PE_sickled_single_reads.fastq --mp1-1 fwd_MP_sickled_reads.fastq --mp1-2 rev_MP_sickled_reads.fastq --mp1-s singles_MP_sickled_reads.fastq --mp1-rf -o spades_out

Using only PE reads (i.e. the merged reads, fwd high quality reads, reverse high quality reads, and single high quality reads) SPAdes was run using the following command:

spades.py -k 21,33,55,77,99,127 -t 5 --careful -s out.extendedFrags.fastq -1 out.notCombined_1_qc.fastq -2 out.notCombined_2_qc.fastq -s out.notCombined_singles_qc.fastq -o spades_out

QUAST
When the assembly is complete QUAST can be used to compare and view assembly results. For examples see the Results section of this post.

Results

Assembly
All 6 genomes assembled very well when both PE and MP reads were used.

Does merging PE reads help?

To illustrate how merging reads prior to assembly can improve assembly, I assembled one Burkholderia genome using three different input sets:

Merged and unmerged reads
Raw PE reads with no merging
Only reads that merge

Under the assumption that improved assembly statistics (e.g. number of contigs, N50, etc.) indicate a better assembly, a combination of merged and unmerged reads (option 1) yields the best assembly.

Does adding MP reads help?
When assembling with only PE reads the resulting assemblies are worse than when MP reads are included.

To test if the assembly improvements were primarily a factor of increased coverage or information incorporated in MP reads, I reassembled the PE+MP reads for B1/BMP1 treating the MP reads as SE reads. Despite MP reads substantial increase in coverage (52 to 103), when treated as SE reads only minor assembly improvements are observed. This suggests that distance information provided by MP reads can have a greater effect on improving assemblies than additional coverage.

Conclusions

Using a combination of FLASH, Sickle, SPAdes, and in-house Perl scripts we assemble 6 Burkholderia genomes to draft status. This exercise demonstrates the practicality of the Illumina MiSeq for small scale sequencing projects consisting of a handful of bacterial genomes. A single PE MiSeq run may be sufficient to assembly a small number of bacterial genomes, however the additional coverage and pairing information of MP reads can improve PE assemblies. Merging overlapping PE reads and quality trimming before assembly can increase N50 and other assembly statistics.

Supplemental Figures
Figure 1 -- PE insert size distribution for one sample (B1)

Difference Between Paired-End and Mate-Pair Reads

2013-11-02T07:24:00.000-07:00

In DNA sequencing lingo the words "paired-end" (PE) and "mate-pair" (MP) are frequently used interchangeably. While the underlying principles between PE and MP reads have strong similarities, there are inherent differences that are crucial to understand.

The similarities between PE and MP reads include:

Reads come in pairs
Pairs come from the ends of the same DNA strand

The differences between PE and MP reads include:

Library preparation protocols -- In short, PE protocols attach an adapter, SP1, to the fwd end and another adapter, SP2, to the reverse end. The first sequencing step is started by targeting SP1 to generate the forward read. The second sequencing step targets SP2 to generate the reverse read. For MP protocols longer DNA sequences are circularized using biotinylated adapters. During the circularization process the DNA strand ends are connected with the biotinylated adapter between them. Circularized DNA are sheared and the biotinylated adapters connecting stand ends are pulled down. These reads can then be sequenced using the same SP1-SP2 adapter protocols used in PE sequencing.
Insert size -- The insert size refers to the distance between the pairs. PE reads generally have a smaller insert size (< 1kp) than MP (2-5 kb). The difference in insert size stems from the difference in protocols. Depending on the length of your reads it is possible for PE reads to have overlapping ends.
Read orientation -- PE reads come in forward-reverse (FR) orientation where read 1 is the forward read and read 2 is the reverse read. Because of the circularization step MP reads com in reverse-forward (RF) orientation where read 1 is the reverse read and read 2 is the forward read. These differences are especially important to understand for assembly algorithms and projects.
Read Trimming -- Theoretically, PE reads require no trimming before sequence analysis. However, in practice it is recommended that low quality portions of the read be trimmed using tools like Sickle. Alternatively, MP reads require trimming because biotinylated adapters are often present in the middle of one or both MP reads. Adapter trimming software generally remove adapters and any sequence beyond the adapter. Software options for adapter trimming include cutadapt, Trimmomatic, and FastqMcf. For more reading on adapter trimming see this post and this post.

For more details on the sequencing protocols see the Illumina documentation for PE and MP sequencing.

Read simulators review with an emphasis on metagenomics

2013-10-22T09:33:00.000-07:00

Why Read Simulation?

Simulations are an important aspect of bioinformatics that can be used for testing and benchmarking algorithms, optimizing parameters, and generating optimal study design. The following are examples of how read simulations have been successfully utilized in each of these instances.

Testing Algorithms

After writing a new algorithm how do we ensure that it works? The most effective way is to run the algorithm on a set of data where the answers are known. This is a perfect application for simulated data. Katharina Hoff use DNA sequence simulations to test the effects of sequencing error from several sequencing technologies on specialized metagenomic gene prediction algorithms. She concludes that gene prediction accuracy is poor in the presence of sequencing errors using these algorithms. Furthermore, gene prediction algorithms not specialized for metagenomic data perform as well or better than their specialize counterparts. This suggests that metagenomic gene predictors could be improved by being more robust to sequence error.

Benchmarking Algorithms

Benchmarking is used to compare accuracy, precision, technical requirements, and other attributes of algorithms. Simulated reads provide a common benchmark for comparing assembly algorithms in the Assemblathon competition [Earl, et al.], and have also been used to benchmark read mapping algorithms such as Bowtie, Soap, and Pass [Horner].

Optimizing Parameters

Researchers often ignore the detailed parameters of algorithms and programs. This can frequently cause problems by violating assumptions made in the software development process. Furthermore, the effects of parameters on results are largely unknown yet can significantly effect results and conclusions. To better understand the effects of parameters used by the algorithm FLASH, Magoc et al. use simulated reads to build ROC curves illustrating the trade-offs between correctly and incorrectly merged paired-end reads using different values for the mismatch and minimum overlap parameters (Figures 5 and 6).

Optimizing Study Design

Prior to sequencing, it is common to ask questions like, "how much coverage do we need," "what read length should we sequence at," "what sequencing platform is best for our project," "should we use paired-end (PE) or single-end (SE) reads," etc. These and other similar questions can be answered at least in part by sequence simulation. For example, the 1000 Genomes Project Consortium used ART to test the effects of read length and PE insert size on a reads ability to map to the human genome. They conclude that longer reads substantial increase mappability especially for SE reads. Furthermore, increasing insert size also marginally improves mappability.

Furthermore, Mende et al. use sequence simulations to test several aspects of study design for whole metagenome sequencing. They conclude that quality control measures such as quality filtering and quality trimming have a substantial impact on assembly by improving accuracy and extending contig lengths. They also evaluate assembly quality on Sanger, pyrosequencing, and Illumina platforms with communities of low (10 genomes), medium (100 genomes), and high (400 genomes) complexity. For the low complexity community all platforms were more or less equal in terms of assembly and accurately represented the functional aspects of the community. For the medium complexity community Illumina produced the best assembly and most accurately represented the functional elements. With the high complexity community none of the platforms were particularly good, however, because of the longer reads, Sanger was still able to represent much of the functional composition.

Things to look for in a simulator?

Good/appropriate error model
Models read lengths
Models coverage bias
Includes quality values
Single-end and Paired-end reads
Multiple sequencing platform capabilities (e.g. Illumna, 454, etc)
Easy to install
Easy to use
Good documentation

What are the simulators our there for...

Genomic DNA sequences

wgsim relies on a simple uniform substitution error model. The uniform error model does not reflect error patterns as accurately as the models and algorithms described below. Another major weakness of wgsim its inability to generate INDEL errors. Wgsim seems like the most basic read simulator.
simNGS models the processes in an Illumina sequencing run such as cluster generation and cluster intensity. The model must be input as the "runfile" where each line contains parameters for each cycle. The number of cycles contained in the "runfile" corresponds to the number of bases simNGS can model.
MAQ is primarily an assembly algorithm (hence the large number of citations). It also includes a uniform reference sequence mutation algorithm. To simulate reads it uses a first-order markov chain to model quality at each cycle. Using that model it generates quality values and then bases based on those quality values. Documentation is lacking, so I'm not entirely sure all the values in the table are correct.
GemSIM takes as input a SAM file and FASTQ file for empirical error model generation. This is advantageous because it can easily be extended to new sequencing technologies or upgrades as they are released. It can also simulate metagenome reads based on given abundance proportions.
ART uses quality score data from real Illumina reads to model substitution rates. Quality scores are generally very informative for Illumina reads, but may not be perfect. They map real Illumina reads to a reference to model INDEL rates. I would prefer they do that with substitutions as well. Currently the longest Illumina read length model can generate 75bp reads. This is becoming less useful as read lengths continue to grow. I am less familiar with 454 and Sanger reads so I won't comment on those models.
Mason was build using models empirically derived from fly and yeast reads. According to the technical report Mason can model 100bp Illumina reads from the GAII, however it appears that it may be possible to build a more up-to-date model. Mason uses previously published models for both 454 and Sanger reads. Mason was developed in C++ and is easily extendable making it useful for incorporating into your personal code.
FlowSim is specifically for 454 Pyrosequences. FlowSim builds an empirical model derived using quality trimmed E. coli K-12 and D. Labrax reads. Because the 454 platform interprets reads first as flow grams, FlowSim models these flow grams, specifically the homopolymer distributions. During the simulation process, flow grams are generated and are subsequently interpreted into base calls and corresponding quality scores. The number of cycles indicates roughly the number of bases in each read.
SimSeq has limited documentation! One of the useful attributes of SimSeq is its model for mate-pair chimeras.
pIRS builds an error model based on SOAP2 mapping results or a given SAM file. It uses a combination of two matrices to generate bases and quality values. These matrices are based of empirical training data and a first-order markov chain (similar to MAQ) of quality scores. This seems like a very good model for both bases and quality values. One of the defining features of pIRS is its simulation of coverage bias across a genome based on %GC content. It also includes an option for mutating the given genome to fabricate reference sequence heterogeneity.

Metagenomic Sequences

NeSSM can simulate coverage bias but relies on the mapping distributions from real metagenome sequences to the reference database. This is less useful approach because it cannot be applied to model coverage bias of previously unsequenced metagenomes. A very positive feature of NeSSM is its GPU capabilities. GPU processing is much faster which facilitates simulating large scale sequencing runs from platforms like the Illumina HiSeq. The download link in the paper was not working, so I was unable to download and test this software.
Grinder -- Can also simulate PCR dependent amplicon sequences for any amplicon based on given primers or a database of reference amplicons suca! For amplicon simulations it additionally produces chimeric reads and gene copy number biases. Grinder allows users to input abundance profiles, model alpha and beta diversity. Quality scores in Grinder are assigned a single "high" quality value for all correct bases and a single "low" quality value for bases designated as errors. Grinder includes a graphical user interface (GUI) and command line interface (CLI). Grinder can also be run on the web through the Galaxy interface.
MetaSim -- Stores user defined genomes in a database. MetaSim can simulate reads from any of the genomes in the database using various empirically defined models or a user defined custom model. Abundance measures for each genome allow users to simulate communities of variable abundance. To simulate heterogeneity in the community, MetaSim can mutate the selected reference genomes based on a phylogenetic tree input describing the degree of mutation. MetaSim also includes a GUI, however more advanced options can be accessed via the CLI. One of the short-comings of MetaSim is the lack of simulated quality values. As an increasing number of algorithms rely on these values this highlights an essential feature that is missing.
Bear is still under development.
GemSIM see above.

As a disclaimer I should say that I have only tried a few of these simulators. There may be special features that I have missed.

Discussion

Simulations are an important and under utilized method for testing algorithms and planning experiments. It can be tempting to haphazardly start sequencing or push your data through a collaborators favorite algorithm using the default parameters. These temptations can lead to failed experiments, unusual results, or a misunderstanding of data. A more systematic approach would be to build informative simulations to anticipate and address problematic algorithm assumptions and experimental design flaws. This approach requires a large up-front cost, however it can save time and money in the long run.

As a cautionary note, simulations are not biology. Sequence simulations are an imperfect image of a very complicated world. Conclusions drawn from simulation experiments must be taken with a grain of salt. In some cases it is vital to design biological experiments to validate such conclusions. The utility of sequence simulations is to perform a cheap, first-pass test of algorithms and experimental designs.

While sequence simulators are not expected to perfectly represent real sequences, the closer they can get to that point the better. Most simulators carefully model quality of bases, but there are only a couple that model biases imposed by sequencing platforms and sample preparation protocols such as GC content bias. This is one area where sequence simulators have room to improve. However, building such models is challenging due to the large number of variables that contribute to such biases. For example, there are hundreds of available sample prep protocols for hundreds of biological techniques. Building a universal model to account for biases from each of these protocols is an enormous task especially in a world where protocols can have a life expectance rate of only a few months. Furthermore, unique biases can be introduced by the set of DNA itself. Sequencing human DNA will produce different biases with different levels of effect than sequencing a single bacterial strain.

In conclusion, sequence simulators can be extremely useful for testing and benchmarking algorithms, optimizing parameters, and designing experiments. There are a number of simulators available each with its own strengths and weaknesses. Carfull consideration should be used when picking a simulator and interpreting simulation experiments.

Merging Overlapping Paired-End Reads

2013-10-09T13:02:00.001-07:00

Paired-End Sequencing Overview

Amplicon sequencing is a technique in which PCR primers are used to amplify a small portion of a genome. For example, if a gene of interest is preceded by a unique DNA sequence, S1, and followed by another unique DNA sequence, S2, PCR amplification using S1 and S2 as primers will exponentially copy the DNA segment flanked by S1 and S2 (see this video). The resulting DNA can be sequenced using a paired-end (PE) sequencing protocol where DNA molecules in a sample are sequenced from both ends. In some cases amplicons (i.e. the portion of DNA flanked by the two primers) are short enough such that the forward read overlaps the reverse read. It can be advantageous to merge these reads into a single sequence for the following reason:

Sequenced reads generally have a higher density of errors near the end of the read. The extra information gained by overlapping the tail end of the forward read with the tail end of the reverse read all for correction of most of these errors.

Merging PE Reads with FLASH

While there are several bioinformatics tools and algorithms that would be appropriate for merging overlapping PE reads, I prefer to use the assembly tool FLASH. FLASH is freely available under the GPLv3 license and can be downloaded via SourceForge.

Reasons to use FLASH include:

Easy to download and install
Fast!
Uses quality scores to correct possible sequencing errors in the overlapping region. This is by far the most important reason to use FLASH. Correcting potential sequencing errors can lead to much less noise and more definitive results.

Reasons you might want to find another tool:

FLASH is not able to look for gapped alignments between read pairs. Ungapped alignments are appropriate when using Illumina reads, however they can be problematic for other sequencing platforms such as 454.

FLASH is easy to use if you are familiar with the Unix command line interface. First, navigate to where you installed flash and type the command:

./flash -h

This will ensure that FLASH has been installed correctly and prints the FLASH help page to your terminal screen. To run FLASH you will need two FASTQ formatted files representing the forward and reverse reads to merge. These files should have the same number of reads and should be ordered such that the first read in the forward file is paired with the first read in the reverse file.

Similar to other bioinformatics applications, FLASH has several parameters that can be adjusted to maximize the algorithm's effectiveness. The FLASH help page (./flash -h) give a full explanation of each parameter. To run FLASH using all default parameters type:

./flash <forward reads file path> <reverse reads file path>

The following are a list of useful parameters that I often use:

-m, --min-overlap=NUM: The minimum required overlap between the fwd and rev read. Beware if this number is too small spurious overlaps will be formed. I recommend at least and overlap of 10bp. See the FLASH paper for more details.

-M, --max-overlap: The expected overlap length of the fwd and rev reads. This parameter can also be calculated using the -r, -f, and -s parameters.

-x, --max-mismatch-density=NUM: The ratio of number of mismatched bases and overlap length. This parameter controls the stringency for which pairs are considered to be mergable. A high ratio of number of mismatches to overlap length will merge more pairs, but some of these pairs may be false merges. Alternatively, a low ratio will merge fewer pairs but ensures that false merges are excluded.

-r, --read-len=LEN: The average length of the raw reads. This number is used in calculations for the -M parameter (--max-overlap).

-f, --fragment-len=LEN: The expected amplicon length.

-s, --fragment-len-stddev=LEN: The expected standard deviation of amplicon lengths.

-d, --output-directory=DIR: The output directory.

-o, --output-prefix=PREFIX: The output prefix. For example if "-o my_output" is used all output files will have the prefix "my_output." This parameter can be useful for organizing different FLASH runs.

Refer to the FLASH help page for explanations about other parameters.

Quality of Merged PE Reads

During the merging process of overlapping PE reads, discrepancies between bases in the overlapping region can be corrected by choosing the base with the highest quality score to represent the consensus base. To illustrate quality improvements from merging PE reads I use a subset of recently published data where we cloned a known 16S gene into a plasmid and sequenced it using our typical 16S metagenome profiling protocol for 2x250 bp reads on the Illumina MiSeq. The primed V4 amplicon is roughly 253bp. After removing extraneous bases added by our protocol (e.g. molecule tags, primers), the PE reads are expected to have an overlapping region of ~183bp. Because the overlapping region is large the lower quality tails of both the forward and reverse read overlap in high quality regions in the corresponding read.

The distribution of reads based on the errors per base (EPB) in each read is as follows.

The green line (merged reads) has shifted toward the y-axis indicating that there are more reads with lower EPB compared to forward reads and especially reverse reads. Notice that the y-axis is in a log scale, so while the differences between groups seems small it is actually quite substantial. The mean EPB for all merged reads is 0.001903164 while the forward and reverse reads are 0.003959693 and 0.00590493, respectively.

One caveat to this investigation is the quality filtering done prior to analysis. Naturally, low quality reads are less likely to merge causing merged reads to have artificially higher quality. To compensate for this effect we rigorously filtered raw reads before calculating EPB. Consequently, the divergence between merged reads and unfiltered fwd/rev raw reads will be more pronounced than is reflected in the figure strengthening the conclusion that merged reads have higher quality than fwd/rev raw reads.

Yet another caveat to consider when designing an overlapping PE sequencing experiment is the expected overlap length. In this study we use a small amplicon which allows the error prone tails to overlap in regions of higher quality in the corresponding read. With some amplicons this will not be possible, however even when low quality regions overlap each other corrections can still be made but at a lower confidence. Also, the V4 amplicon has very little variability in terms of length. Again, this is not the case with all amplicons and can effect merging success. For example, the ITS region is another amplicon used in metagenome profiling, but its length variability is substantially higher than the 16S V4 amplicon. Therefore, reads that fail to overlap in an ITS experiment cannot be assumed to be low quality sequence and should be incorporated in downstream analysis as non-overlapping PE reads.

In conclusion, overlapping PE reads can easily be merged using tools like FLASH. Furthermore, merging these reads yields higher quality merged sequences because sequencing errors in the overlapped portion can be corrected based on read quality scores. Higher quality sequences reduce noise introduced by sequencing error and consequently lead to more robust downstream analysis and conclusions.

Installing Perl Modules

2013-09-20T09:56:00.000-07:00

One of the great advantages of Perl is the substantial amount of code developed by the Perl community. However, installing and using external code can be a minor annoyance for experienced Perl programmers and a major headache for Perl beginners. For personal future reference and to help ease the pain of this learning curve, I am writing this post to outline the main installation strategies for Perl modules and give some tips for getting things working smoothly.

There are two main approaches to installing Perl modules--CPAN and manual installation via ExtUtils::MakeMaker or Module::Build. CPAN is a fantastic resource. My preference for installing Perl modules is to use the cpanm script (which automates the CPAN download, build, and install process) coupled with local::lib. I also recently learned that cpanm can take a URL argument to a git or SourceForge developers release. It will also automatically install any dependencies declared in the build script. The command would look something like this:

cpanm http://sourceforge.net/projects/bioutilsperllib/files/latest/download?source=directory

local::lib is an important part of the installation process. Many instructional websites illustrate the installation process with sudo commands. If you are a system administrator or you know that several users will need the Perl module you are installing then it is best to use the sudo command to install into the public set of Perl modules. However, for personal use it is best to install in a local directory. The Perl local::lib module can take care of this very nicely for you. Note that there are other ways to trick CPAN into installing a Perl module in a local directory, but these are not nearly as clean as using local::lib. local::lib is surprising easy to set up by simply following the instructions on the CPAN documentation page here. Once local::lib is setup cpanm will recognize these setting and automatically install into your local Perl module set.

The second option for installing Perl modules is a manual installation via Module::Build or ExtUtils::MakeMaker. If you are a serious Perl programmer I would highly recommend becoming very familiar with one of these two installation modules. I personally like Module::Build, but I admit that I have had limited experience with ExtUtils::MakeMaker. To simply use these for installing Perl modules only a basic understanding of a handful of commands is required.

First, the module package must be downloaded to the machine where you plan to install. I recommend having a directory where you keep all the module you download. The directory I have designated for this is $HOME/build. Once you have the module downloaded to your build directory you will have to unpackage it with the unzip command (or a comparable command). This will create a directory where you unzipped the module. Navigate into the directory. If you see a file called Build.PL you know the module was built using Module::Build. Use the following commands to build, test and install the module:

perl Build.PL
./Build
./Build test
./Build install

If the module fails to install because required dependencies are missing you can try the command:

./Build installdeps

If you do not see the Build.PL file then you should see a file named Makefile.PL. This indicates that the module was built using ExtUtils::MakeMaker. To install such a module use the following commands:

perl Makefile.PL
make
make test
make install

For details on the wide range of capabilities for both these modules carefully read the documentation pages (Module::Build, ExtUtils::MakeMaker).

For more information on installing Perl modules see the following:

As a side note the following command is useful for determining if a module is already installed and where it is located:

perldoc -l <module name>

BioUtils vs. BioPerl

2013-04-23T18:35:00.002-07:00

Perl is a popular language used in bioinformatics. Its popularity is based primarily on its easiness to learn and plethora of open source libraries. BioPerl is an example of one such library. As a bioinformatician, I appreciate BioPerl because it quickly and easily allows me to perform common, menial tasks such as parsing fasta files and manipulating basic DNA sequence objects. However, BioPerl's speed and memory scale poorly and are becoming limiting factors in large-scale sequencing projects. In order to accommodate many diverse data types, the object hierarchy and object attributes of BioPerl have become somewhat convoluted thereby increasing memory and runtime requirements.

To address memory and runtime limitations of BioPerl, I developed BioUtils, a collection of simple Perl modules for FASTA/Q formated DNA sequences. BioUtils can:

store FASTA/Q sequences (including quality values)
Read/write FASTA/Q files
Perform simple quality control on FASTA/Q sequences
Provide summary info for a set of FASTA/Q sequences
Build a consensus sequence from 2 or more FASTQ sequences

Because of its simplicity, BioUtils requires much less memory and drastically reduces runtime of the above operations compared to BioPerl. The following figure displays the CPU time for reading and writing FASTQ sequences from files with increasing numbers of sequences.

Currently, a full Illumina MiSeq run can produce ~6 million paired-end reads (i.e. ~12 million raw reads). Extrapolating from the figure above, the projected runtime of BioPerl for a complete MiSeq dataset would take nearly 2 and a half hours compared to only 5 minutes using BioUtils.

Because BioUtils is implemented using inside-out objects described in Perl Best Practices (Damian Conway), it is difficult to measure memory usage. However, a peek at the source code will confirm that BioUtils stores less metadata and thus requires less memory than BioPerl.

More information and a BioUtils download can be found on sourceforge: https://sourceforge.net/projects/bioutilsperllib/?source=directory

Becoming Googlable

2013-04-23T18:30:00.000-07:00

This week has been a very "progressive" week in my life with regards to my involvement in social media. I have never subscribed to social media because I have seen how its addictive nature leads to many lost hours with little or no benefit (which I still believe it true for sites like Facebook and Twitter). However, I have begun to see the advantages of being "googlable" and have consequently decided to take the plunge into social media by having a personal website and blog. A growing number of scientific researchers have personal websites to advertise their research and blogs to quickly communicate novel discoveries and ideas. I hope to follow that model by using my personal website to deploy some of my useful bioinformatics tools and by blogging about interesting scientific ideas. Hopefully this experimentation with social media will be productive both to myself and others.