Wednesday, May 13, 2015

Categorizing musicians by genre using artificial neural networks

Background
Genres are an important classification scheme for organizing music and other forms of art.  However, genre classifications are often ambiguous.  For example, undergraduate students asked to classify a given set of songs into 10 genres had only a 72% concordance with recording companies genre classifications (D. Perrot and R. Gjerdigen, 1999).  Some of the most sophisticated computer algorithms can classify songs into genres with about 70% - 80% accuracy using song features such as tempo, time, instruments, etc.  The accuracy of genre classification is also dependent on the number of genres an algorithm attempts to delineate.  From the MIREX 2005 audio genre classification contest, the highest accuracy for 6 genre classifications was 87% but dropped to 75% when classifying 10 genres (http://www.music-ir.org/evaluation/mirex-results/).

An alternative to classifying songs into genres is classify musicians into genres.  Categorizing musicians into genres is also a difficult problem because individual musicians can be members of several different, but related, genres.  Musicians can be described using a set of terms (i.e. genres and sub-genres) representing their genre membership.  For example, Florida Georgia Line, can be classified using the terms:  bro-country, country rock, country pop, and country rap.  This multi-class membership makes it difficult to make general comparisons between genres.  For example, it would be interesting to test if the word "truck" is more frequently mentioned in country songs (i.e. by country musicians) than in rock songs (i.e by rock musicians).

The first objective in this series of posts is to categorize musicians into a set of 20 pre-defined genres based on a list of terms associated with each musician using an artificial neural network model.

Artificial Neural Network Background
Artificial neural networks (ANN) are a class of machine learning models frequently used to solve classification problems.  ANNs contain a set of input, hidden, and output nodes connected by weighted edges.  Input nodes correspond to features in the training dataset.  Output nodes correspond to classification types.  Input nodes are connected by weighted edges to hidden nodes, and hidden nodes are connected by weighted edges to output nodes.  Hidden nodes and weighted edges are the medium by which the input features are transformed into a classification prediction.  For example, if a given musician, M1, is defined by features f1, f2, and f3, by "activating" input nodes f1, f2, and f3 the network redirects the flow of weight towards the output nodes.  After all the weight has been redirected, the output node with the highest weight has the highest probability of being the correct classification for musician M1.

The Training Dataset
The task of building or training an ANN is equivalent to assigning weights to each edge such that the classification accuracy of the model is maximized.  The overly simplistic intuition behind this operation is as follows.  First, edge weights are randomly initialized.  Then features of a single training entry are allowed to flow through the network and the final output for that entry is noted.  Using the final output and known output, an error value can be calculated which measures the level of correctness of the final output.  The network is then backpropagated and weights are redistributed to minimize the error for the that feature set.  This process is repeated for each training entry.

In this project, an ANN will be used to classify musicians based on a set of terms describing each musician.  Training data was manually gathered and curated.  First, for each genre, G, google searches for "popular musicians of genre G" generated a list of popular musicians and their matching genre.  Terms describing each musician, M, were gathered from the Echo Nest database.  The "popular musician" lists were supplemented with more musicians obtained from Wikipedia pages for each genre.  The complete training dataset consisted of 20 genres where each genre contained 80 musicians and each musician is represented by a list of terms.  This training dataset, like nearly all training datasets for song classification, is likely to have incorrect annotations.  Making more accurate training sets is a difficult task but substantially improves classification accuracy (Cory McKay and Ichiro Fujinaga, 2006).

Feature Selection
Estimating model parameters (i.e. weights) can be computationally expensive.  Training sets having a large number of features take much longer to estimate model parameters than those with fewer features.  Here the initial training dataset had over 3,000 features (i.e. terms).  Removing zero and near-zero variance features can reduce the required computational resources without having a substantial impact on model accuracy.  Additionally, removing correlated features will also reduce runtime and will improve model accuracy.  For these data, terms associated with fewer than 1% of musicians were deemed near-zero variance and removed.  Also, terms having a Phi coefficient (a measure of similarity) higher than .9 were removed.  These filtering measures reduced the number of features (i.e. terms) to 412.  The final training dataset contained 1,544 musicians linked to 20 genres and described by 412 terms.  Note that 56 musicians are linked to more than one genre.

Building the Model
The ANN model was built using the caret package and principles outlined in Applied Predictive Modeling by Max Kuhn and Kjell Johnson.  ANN models have two main parameters:  decay and size.  After each backpropagation, weights are multiplied by the decay parameter, a number less than one, to prevent weights from growing too large.  Excessively large weights cause a model to be overly specific to the given training dataset (i.e. not generalizable to other input data).  The second parameter, size, is the number of hidden nodes in the model.  Arbitrarily choosing these parameters may lead to suboptimal models.  A method called model tuning builds several models based on a range of parameter values and selects the most accurate as the final model.  Accuracy for each tuning model built in the project is shown in Figure 1.  The final model used parameters decay=0.5 and size=25.  Further adjustments to these parameters are not likely to significantly improve model accuracy for the given training data.

 Figure 1:  Model accuracy for a range of decay and size parameter values.  The most accurate model was built using decay=0.5 and size=25.  Bars indicate standard deviation.    


Accuracy of statistical models can be measure using a technique called cross validation.  In cross validation the training data are split into two groups:  a training group and a test group.  The model is built using the training group and subsequently evaluated using the test group.  The process is repeated multiple times using different subsets of training and test groups to ensure that a chance good (or bad) split between the training and test groups does not incorrectly estimate model accuracy.  Here the final model correctly classified about 60% the test cases (Figure 1).

Model accuracy can also be evaluated using a confusion matrix.  A confusion matrix shows the number of observed against the expected classifications for each category.  Figure 2 shows the percentages of data observed in each category from the cross validation test cases.  The vast majority of observed musician's genres matched the expected genre.


Figure 2:  Confusion matrix built using all cross validation test cases.  Values of each cell indicate the percentage of total musicians observed in each category for a given expected genre.  If the classification model were perfect there would be a blue diagonal and all other boxes would be red.


Genres having musicians that are frequently classified incorrectly into a single alternative genre are likely similar to that alternative genre.  A hierarchically clustering of the confusion matrix percentages (Figure 3) shows that some genres are similar to other genres (i.e. classic_rock and soft rock; disco and funk; etc.).  These similarities match logical expectations.  These similarities correspond to how accurately the model can distinguish between genres.  For example, the model is more likely to incorrectly classify a "classic rock" musician as "soft rock" than "reggae."

Figure 3:  Dendrogram clustering of confusion matrix values.  This shows the similarity between genres based on the number of misclassification instances shared between genres.  As expected, genres know to be similar cluster together (i.e. classic_rock and soft_rock).  


Conclusion
Artificial neural networks are an appropriate model for classification tasks.  Here an ANN was built to classify musicians based on a list of terms associated with each musician.  Considering accuracy of other models used in classifying songs, this model is sufficiently accurate for classifying musicians into genres.  Future analyses will use classifications from this model to make general comparisons between genres.



Additional Notes
  • The code (which needs to be cleaned and organized) can be found here.
  • Disclaimer: I am not a statistician.  This is a learning exercise for me.  I'm sure there are plenty mistakes and things I overlooked.  If anyone has suggestions on how to improve the correctness of this analysis I am interested in hearing them.  I want to learn.
  • For a fantastic primer on using artificial neural networks for regression and classification see the following posts by Brian Dolhansky:  
  • These chapters by  Michael Nielsen describing ANNs are also a great resource.
  • Liang et al. use music features and lyrics to classify genres.  However, many popular genres are missing (e.g. country) and there may be mistakes.
  • Other interesting references:

Wednesday, April 29, 2015

2015 JGI Users Meeting Notes

Interested in watching any of these talks?  See this webpage!


Jack Gilbert -- Genome-Enabled Flux Balance Metabolic Networks Form Periodically Flooded Soils
  • Presented an array of studies across many different microbial communities
  • environmental microbes have roughly a 35 year lab to major climate changes
  • Cyanobacteria can metabolize nitrogen into amonium which has been show to be useful in sphagnum moss
  • bacteria can protect against allergies.  Adding clostridium into mouse will alleviate allergenic symptoms
  • added clostridium to a young man who had really bad allergies.  Comparisons to other family members without allergies is in progress
  • also doing this experiment on dolphins because it is easier to control their environment
  • microbes can contribute to you being fat or skinny
  • circadian rhythms in the gut genes and microbes looks to be very important
  • microbes in roots also have circadian cycles 
  • your house takes on your microbiome
  • dogs increase the similarity of couple's microbiomes

Francis Martin -- Harnessing Genomics for Understanding Tree-Microbe Interactions in Forest Ecosystems
  • we know very little about fungi affect the carbon cycle
  • 4 major groups of fungi in most forest ecosystems:  white rotters, brown rotters, litter soul decayers, and ectomycorrizal
  • he studies mycorrhizal fungi and their symbiotic toolkit
  • mycorrhizal symbiosis has evolved independently several time!
  • ectomycorrizal fungi (EMF) have reduced complement of plant cell wall degrading genes compared to ancestors
  • small proteins from EMF are secreted and land on plant cells
  • Look at Platt paper for JAZ interaction (PNAS).  Missp7 binds to JAZ and prevents the immune response in the plant
  • each symbiosis event has developed a unique set of effectors

Joan Bennett -- Do Fungi have a 'Volatome'?

  • aflotoxins can cause cancer in small doses
  • mycotoxins may cause sick building syndrome
  • volatile organic compounds (VOCs) are things that we can smell and are frequently made by fungi
  • using arabidopsis and flies as models for testing the effects of these compounds
  • flies exposed to c-8 VOCs acted similarly to model flies for Parkinson's disease
  • VOCs in the fungal microbiome of humans are likely responsible for attracting mosquitoes


Susanna Theroux -- Marsh Madness: Microbial Communities Driving Greenhouse Gas Cycling in Coastal Wetlands

  • wants to link wetland microbes with carbon emissions
  • methanogens break down carbon into methane
  • wetlands produce a lot of methane
  • microbes might be useful in minimizing methane production compared to carbon storage
  • more methanogens yield more methane


Antonis Rokas -- Evolution of Fungal Chemodiversity

  • fungal metabolism
  • asks how did chemo diversity originate and why is chemo diversity clustered
  • toxicity clusters are likely driven by genetic linkage (ie butterflies)
  • mined metabolic databases for genes that are clustered and genes that are not and measured toxicity.  Clustered genes are more toxic
  • tissue specific expression in humans and mammals is the equivalent of clustering in fungi
  • this implies that the position of two genes in the fugal genome give information about how they interact in humans (ie those two genes are likely to be expressed in the same tissue)


Rotem Sorek -- The Immune System of Bacteria

  • crisper is the immune response in bacteria for phages
  • the cas proteins find phage DNA in the cell and insert it into the next spacer
  • crisper is in only ~40% of bacteria.  How do other bacteria fight phage?  The only other known mechanism is restriction enzymes.  Can we find new immune systems in bacteria?
  • immune system signatures:  rapidly evolving, high horizontal gene transfer
  • found the BREZ system in B. cereus!  Found in 10% of bacteria.  We don't understand the mechanisms yet
  • phages have anti-defense systems.  Ergo it would be best for bacteria to have multiple defense systems


Phil Hugenholtz -- Back from the Dead:  The Curious Tale of the Predatory Cyanobacterium Vampirovibrio chlorellavorus

  • these bacteria suck host cells dry!
  • falls in Cyanobacteria clade
  • contains a type-IV secretion system partially found on plasmids
  • non photosynthetic (unlike most Cyanobacteria)


Stephen Wright -- Comparative and Population Genomics in the Brassicaceae:  Understanding Genome-Wide Natural Selection

  • evolution can go backwards.  For example, Y chromosome degeneration in humans.
  • many plants that undergo whole genome duplication revert to diploid
  • there is evidence in Arabidopsis that a long time ago it had a whole genome duplication event
  • possible explanations:  passive constituency of redundancy, inefficient selection, differential adaptation
  • there is limited evidence contrary to popular theory that selfing or limited recombination leads to a high density of deleterious mutations and evolutionary dead-ends
  • capsella rubells is a model for this phenomenon
  • ploidy effects the efficiency of natural selection
  • higher ploidy can weaken efficacy of natural selection
  • ploidy combined with transition to selfing increases rate of deleterious mutation accumulation
  • plant sex chromosomes are younger than mammalian sex chromosomes


Steve Briggs -- Protein Regulatory Networks

  • see Walley et al. PNAS 2013!


Susan Lynch -- The Microbiome--A New Frontier in Human Health

  • increase in asthma cases in children in US and Australia
  • asthma is mostly an environmental disease (ie not so much genetic)
  • now people have far less environmental exposure
  • Americans spend about 90% of their time indoors
  • vaginal born children have different microbiome than c-section born children.  And they are less likely to have asthma.
  • lactobacillus keeps airways of mice challenged with dust open like normal airways
  • see Fuijmura et al PNAS January 2014
  • WHEALS study
  • allergenic children often develop asthma

Monday, January 26, 2015

The iChip: A tool for high-throughput microbial culturing

It is commonly accepted that only ~1% of naturally occurring microbes are culturable using standard culturing techniques.  Until recently, culturing microbes has been the only way to investigate their presence and effects on various environments (i.e. human gut, soil, plant roots, oceans, etc).  Metagenomics, sequencing DNA from a pool of microbes, has broadened our understanding of microbial communities by circumventing the need for culturing.  However, culturing remains an important aspect of investigating microbial communities particularly for environments where the microbial diversity is so great that generating complete genome sequences via metagenomics is impractical (i.e. soil).

In 2010, a study from the Epstein and Lewis groups (Nichols et al., 2010) describes a new technology which can increase culturing success from ~1% to nearly 50%!  This technology, the isolation chip or iChip, contains 384 chambers.  A single microbial cell is deposited in each chamber.  Then a semi-permeable membrane is used to cover all the chambers and the iChip is placed into the environment from which the microbes originate.  The membrane allows cells access to nutrients and growth factors found in its natural environment.

The iChip has been successfully used to study novel secondary metabolites (Lewis et al., 2010) and discover novel antibiotics (Ling et al., 2015Lewis, 2013).  However, a possible reason why high-throughput culturing techniques have not gained much traction is the emphasis on metagenomic applications and studies which do not suffer from culturing bias.  High-throughput culturing and metagenomic sequencing each have a unique set of strengths and weaknesses.  Perhaps more thought should be put into designing methods using a combination of high-throughput culturing and metagenomic sequencing to leverage both sets of strengths and mitigate their weaknesses.  

Friday, November 21, 2014

Literature Review: Selection on soil microbiomes reveals reproducible impacts on plant function

I recently found an intriguing microbiome paper, and I would like to post my thoughts about it.

Citation
Panke-Buisse et al., Selection on soil microbiomes reveals reproducible impacts on plant function, ISME Journal 2014, doi: 10.1038/ismej.2014.196.

Review
The main hypothesis in this paper tests if phenotypes from a soil microbiome connected with a single plant genotype can be recapitulated in other plant genotypes.  First, early- and late-flowering time associated soil microbiome were generated.  Ten generations of Arabidopsis thaliana Col-0 were grown and at each generation soils from the four earliest and latests flowering pots were collected as the inoculum for the subsequent generation.  Using soils from generation 10, final early- and late-flowering time associated inoculums were derived.  These inoculums were then administered to four Arabidopsis thaliana ecotypes (Rld, Ler, Be, and Col-0) and a close relative, Brasica rapa.  Flowering time, plant biomass, and nitrogen mineralization enzyme activity for each of the genotypes were statistically different between early- and late-flowering associated inoculum treatments except Ler (Fig 4).  Microbes specifically associated with early- and late- flowering time treatments were found in low abundance suggesting that low-abundance microbes can contribute significantly to interesting phenotypes.


Positives
In general, I found this paper to be very insightful for the following reasons:
  • Soil Microbial phenotypes can persist in novel hosts.  This is a really cool result and could have direct implications for building field-useful microbiome treatments.  However, flowering time is a well conserved phenotype so this result may not apply to other specific phenotypes of interest such as crop yield, disease resistance, or with greater host genetic divergence (i.e. corn flowering time vs arabidopsis).
  • Low abundance members of a community can drive an important phenotype.  A common assumption in metagenomics is that the most abundant members of a community are the most important.  Perhaps more focus should be geared toward linking microbes to phenotypes regardless of abundance.
  • This artificial selection design doesn't require detailed knowledge of genetic mechanisms in order to produce a desired phenotype.  It is similar to the old dogma of plant breading where two good plants were crossed to produce an even better plant with no underlying knowledge of the genetic mechanisms.  While understanding the mechanisms is important for maximizing the phenotypic benefits, it is not required to generate a positive effect.

Critiques and Questions
  • The methods section seems incomplete.
    • Why are there so few OTUs in the heatmap and ternary plot?  Were OTUs filtered out?  I would expect to see hundreds or even thousands of OTUs from soil-derived microbiota.
    • Were the wild soils combined at equal ratios?
    • What was their measure of flowering time?  I think there are well defined methods in the literature for observing flowering time.
  • Why are the EF associated samples in the 10 generation experiment actually flowering at a later time in later generations (supp fig 1)?
  • Why are the control samples not included in supp fig 1?  
  • Even though control samples were included as a covariate to generate figure 4, I would have liked to have seen them graphed there as well.
  • Perhaps a better way of generating the control samples would have been to randomly pick pots to generate a control inoculum.  This was what they did in Swenson et al., 2000

Future Work
This experimental design could be used to address the following intriguing hypotheses:
  • Selection on soil microbiomes changes the root endophyte microbial community.
    • Redo the experiment with the same design but also 16S profile endophyte microbes.  
    • It would also be interesting to see if the microbes specific to each treatment are also found inside the root.  This would suggest a direct microbial effect on the plant phenotype.
  • Selection for soil microbiome associated phenotypes is driven by multiple mechanisms.
    • Redo the experiment but don't combine replicates such that the end inocula include four independent lines of selection for both early- and late-flowering time soil microbiomes.  Compare microbes among treatment groups looking for different microbial profiles that express similar phenotypes.  

Wednesday, October 8, 2014

MiSeq Flow Cell Edges Correlate with Low Quality Scores

Upon receiving a FASTQ file fresh off the MiSeq, the first question I ask myself is:  "Did the sequencing work?"  On several occasions I open these files to discover the first few pages of sequence reads are littered with N's and have low quality scores.  However, when I run the full set of reads through a QC pipeline (e.g. FastQC) an overwhelming majority are high quality.

So why is it that the reads at the beginning of the FASTQ file have such poor quality?

Reads generated using Illumina technology are ordered by their cluster's y-coordinate on the flow cell.  This led me to hypothesize that clusters near the edge of the flow cell are more likely to have low quality scores.  To test this hypothesis I took a subset of reads (932,709) from a MiSeq run, calculated their average quality score, and graphed that against the distance from the closest edge.


The splines function in R was used to model the relationship between average quality score and distance from the closest flow cell edge (red line).  Clusters closer to the edge of the flow cell have significantly lower quality scores.  However, the good news is that the vast majority of clusters do not fall close to the edge (light blue heat).  

In practical terms this issue is insignificant because so few reads are affected, but it does explain the high concentration of low quality reads at the beginning (and end) of Illumina generated FASTQ files.

Tuesday, September 30, 2014

When is my genome finished?

This is a common question for anyone doing genome sequencing and assembly.

The short answer:  never.

The slightly longer answer:  it depends.

The long answer:  Finishing a genome means the order of all nucleotide bases have been correctly resolved.  Even for simple genomes this is extremely difficult.  Billions of dollars have been spent to sequence the human genome, and it is currently estimated that 92% of the human genome has greater than 99.99% accuracy (Schmutz et al., 2004).  In total, 99% of the genome has been assembled, and the remaining 1% are likely to be highly repetitive regions with little or no gene content (remember that 1% of 3 billion total bases means there are about 30 million unresolved bases).  Using the current technology, it is impossible to reach 100% accuracy across 100% of the human genome.  For simpler genomes such as some viral and bacterial genomes it is possible to completely resolve the entire sequence.  However, because these genomes are much more dynamic (i.e. change at a faster rate), generating a completely finished genome may not be worth the cost.

Finishing a genome requires a substantial amount of time, work, and money.  However, getting a genome to draft status (i.e. incomplete but usable) can be done with minimal costs and resources.  So perhaps the most important question is:  when is my genome usable?  The answer to this question depends on the research questions being considered.  For example, when studying the evolutionary history of genomes with a high propensity for genomic rearrangements, generating long contigs/scaffolds is important for determining the orientation and lineage of genomic rearrangements.  Alternatively, some research questions are more concerned about the completeness of the gene content for making functional comparisons between genomes.  For such a question, generating long, continuous contig/scaffolds is less important.

Here are some possible ways to estimate completeness (with what I think are the best methods near the top):
  • Compare the gene content of highly conserved genes
    • Eukaryote:  CEGMA
    • Bacterial:  CheckM
    • Archaeal:  A table of 53 conserved COGs
  • Check for possible errors using REAPR
  • Calculate assembly metrics like contig number, N50, coverage, and assembly size.  
  • Compare the size of your assembled genome to a related (or set of related) genome(s).  
See these papers for interesting discussion on evaluating assemblies:  

Friday, August 22, 2014

How to Order Contigs Using a Reference Genome

Background
In genomics complete (i.e. finished) genomes provide an excellent resource for future sequencing projects.  However, generating finished genomes is an expensive and laborious endeavor.  In some cases, the returns of finishing a genome are not worth the cost particularly when draft genomes can provide enough information for robust hypotheses testing.  Draft genomes contain a large number of unordered sequences of various lengths called contigs.  Contigs can be joined into more contiguous sequences called scaffolds using paired-end reads.  While a set of contigs/scaffolds can be useful for a wide variety of projects (gene annotation, gene expression profiling, etc.), contig order provides vital information in comparative genomics and analyses of specific genomic regions.  Typically, a rough ordering of contigs can be accomplished by mapping contigs to a closely related reference genome.

ABACAS
Recently, I discovered a nice software package called ABACAS which not only orders contigs but creates a contiguous sequence representing the connected contigs.  Gaps between contigs are represented by N's and overlapping regions are resolved (although I could not find information on exactly how).  Because ABACAS requires MUMer, its outputs integrate will with other MUMer scripts such as those for visualizing alignments (Figure 1, i.e. mummerplot).  ABACAS is hosted by SourceForge and is well documented.  It was published in Bioinformatics in 2009.



Notes
Of course, more closely related draft and reference genomes will generate the most correct ordering.  However, in cases where the target genome is VERY closely related to the reference genome it may be better to map raw reads to the reference genome instead of build a de novo assembly.  Deviations from the reference genome can be discovered using SNP detection software.  Read mapping experiments are generally cheaper because they require fewer reads than de novo assembly but can still detect biologically significant differences when compared to the reference genome.