Wednesday, May 13, 2015

Categorizing musicians by genre using artificial neural networks

Background
Genres are an important classification scheme for organizing music and other forms of art.  However, genre classifications are often ambiguous.  For example, undergraduate students asked to classify a given set of songs into 10 genres had only a 72% concordance with recording companies genre classifications (D. Perrot and R. Gjerdigen, 1999).  Some of the most sophisticated computer algorithms can classify songs into genres with about 70% - 80% accuracy using song features such as tempo, time, instruments, etc.  The accuracy of genre classification is also dependent on the number of genres an algorithm attempts to delineate.  From the MIREX 2005 audio genre classification contest, the highest accuracy for 6 genre classifications was 87% but dropped to 75% when classifying 10 genres (http://www.music-ir.org/evaluation/mirex-results/).

An alternative to classifying songs into genres is classify musicians into genres.  Categorizing musicians into genres is also a difficult problem because individual musicians can be members of several different, but related, genres.  Musicians can be described using a set of terms (i.e. genres and sub-genres) representing their genre membership.  For example, Florida Georgia Line, can be classified using the terms:  bro-country, country rock, country pop, and country rap.  This multi-class membership makes it difficult to make general comparisons between genres.  For example, it would be interesting to test if the word "truck" is more frequently mentioned in country songs (i.e. by country musicians) than in rock songs (i.e by rock musicians).

The first objective in this series of posts is to categorize musicians into a set of 20 pre-defined genres based on a list of terms associated with each musician using an artificial neural network model.

Artificial Neural Network Background
Artificial neural networks (ANN) are a class of machine learning models frequently used to solve classification problems.  ANNs contain a set of input, hidden, and output nodes connected by weighted edges.  Input nodes correspond to features in the training dataset.  Output nodes correspond to classification types.  Input nodes are connected by weighted edges to hidden nodes, and hidden nodes are connected by weighted edges to output nodes.  Hidden nodes and weighted edges are the medium by which the input features are transformed into a classification prediction.  For example, if a given musician, M1, is defined by features f1, f2, and f3, by "activating" input nodes f1, f2, and f3 the network redirects the flow of weight towards the output nodes.  After all the weight has been redirected, the output node with the highest weight has the highest probability of being the correct classification for musician M1.

The Training Dataset
The task of building or training an ANN is equivalent to assigning weights to each edge such that the classification accuracy of the model is maximized.  The overly simplistic intuition behind this operation is as follows.  First, edge weights are randomly initialized.  Then features of a single training entry are allowed to flow through the network and the final output for that entry is noted.  Using the final output and known output, an error value can be calculated which measures the level of correctness of the final output.  The network is then backpropagated and weights are redistributed to minimize the error for the that feature set.  This process is repeated for each training entry.

In this project, an ANN will be used to classify musicians based on a set of terms describing each musician.  Training data was manually gathered and curated.  First, for each genre, G, google searches for "popular musicians of genre G" generated a list of popular musicians and their matching genre.  Terms describing each musician, M, were gathered from the Echo Nest database.  The "popular musician" lists were supplemented with more musicians obtained from Wikipedia pages for each genre.  The complete training dataset consisted of 20 genres where each genre contained 80 musicians and each musician is represented by a list of terms.  This training dataset, like nearly all training datasets for song classification, is likely to have incorrect annotations.  Making more accurate training sets is a difficult task but substantially improves classification accuracy (Cory McKay and Ichiro Fujinaga, 2006).

Feature Selection
Estimating model parameters (i.e. weights) can be computationally expensive.  Training sets having a large number of features take much longer to estimate model parameters than those with fewer features.  Here the initial training dataset had over 3,000 features (i.e. terms).  Removing zero and near-zero variance features can reduce the required computational resources without having a substantial impact on model accuracy.  Additionally, removing correlated features will also reduce runtime and will improve model accuracy.  For these data, terms associated with fewer than 1% of musicians were deemed near-zero variance and removed.  Also, terms having a Phi coefficient (a measure of similarity) higher than .9 were removed.  These filtering measures reduced the number of features (i.e. terms) to 412.  The final training dataset contained 1,544 musicians linked to 20 genres and described by 412 terms.  Note that 56 musicians are linked to more than one genre.

Building the Model
The ANN model was built using the caret package and principles outlined in Applied Predictive Modeling by Max Kuhn and Kjell Johnson.  ANN models have two main parameters:  decay and size.  After each backpropagation, weights are multiplied by the decay parameter, a number less than one, to prevent weights from growing too large.  Excessively large weights cause a model to be overly specific to the given training dataset (i.e. not generalizable to other input data).  The second parameter, size, is the number of hidden nodes in the model.  Arbitrarily choosing these parameters may lead to suboptimal models.  A method called model tuning builds several models based on a range of parameter values and selects the most accurate as the final model.  Accuracy for each tuning model built in the project is shown in Figure 1.  The final model used parameters decay=0.5 and size=25.  Further adjustments to these parameters are not likely to significantly improve model accuracy for the given training data.

 Figure 1:  Model accuracy for a range of decay and size parameter values.  The most accurate model was built using decay=0.5 and size=25.  Bars indicate standard deviation.    


Accuracy of statistical models can be measure using a technique called cross validation.  In cross validation the training data are split into two groups:  a training group and a test group.  The model is built using the training group and subsequently evaluated using the test group.  The process is repeated multiple times using different subsets of training and test groups to ensure that a chance good (or bad) split between the training and test groups does not incorrectly estimate model accuracy.  Here the final model correctly classified about 60% the test cases (Figure 1).

Model accuracy can also be evaluated using a confusion matrix.  A confusion matrix shows the number of observed against the expected classifications for each category.  Figure 2 shows the percentages of data observed in each category from the cross validation test cases.  The vast majority of observed musician's genres matched the expected genre.


Figure 2:  Confusion matrix built using all cross validation test cases.  Values of each cell indicate the percentage of total musicians observed in each category for a given expected genre.  If the classification model were perfect there would be a blue diagonal and all other boxes would be red.


Genres having musicians that are frequently classified incorrectly into a single alternative genre are likely similar to that alternative genre.  A hierarchically clustering of the confusion matrix percentages (Figure 3) shows that some genres are similar to other genres (i.e. classic_rock and soft rock; disco and funk; etc.).  These similarities match logical expectations.  These similarities correspond to how accurately the model can distinguish between genres.  For example, the model is more likely to incorrectly classify a "classic rock" musician as "soft rock" than "reggae."

Figure 3:  Dendrogram clustering of confusion matrix values.  This shows the similarity between genres based on the number of misclassification instances shared between genres.  As expected, genres know to be similar cluster together (i.e. classic_rock and soft_rock).  


Conclusion
Artificial neural networks are an appropriate model for classification tasks.  Here an ANN was built to classify musicians based on a list of terms associated with each musician.  Considering accuracy of other models used in classifying songs, this model is sufficiently accurate for classifying musicians into genres.  Future analyses will use classifications from this model to make general comparisons between genres.



Additional Notes
  • The code (which needs to be cleaned and organized) can be found here.
  • Disclaimer: I am not a statistician.  This is a learning exercise for me.  I'm sure there are plenty mistakes and things I overlooked.  If anyone has suggestions on how to improve the correctness of this analysis I am interested in hearing them.  I want to learn.
  • For a fantastic primer on using artificial neural networks for regression and classification see the following posts by Brian Dolhansky:  
  • These chapters by  Michael Nielsen describing ANNs are also a great resource.
  • Liang et al. use music features and lyrics to classify genres.  However, many popular genres are missing (e.g. country) and there may be mistakes.
  • Other interesting references:

Wednesday, April 29, 2015

2015 JGI Users Meeting Notes

Interested in watching any of these talks?  See this webpage!


Jack Gilbert -- Genome-Enabled Flux Balance Metabolic Networks Form Periodically Flooded Soils
  • Presented an array of studies across many different microbial communities
  • environmental microbes have roughly a 35 year lab to major climate changes
  • Cyanobacteria can metabolize nitrogen into amonium which has been show to be useful in sphagnum moss
  • bacteria can protect against allergies.  Adding clostridium into mouse will alleviate allergenic symptoms
  • added clostridium to a young man who had really bad allergies.  Comparisons to other family members without allergies is in progress
  • also doing this experiment on dolphins because it is easier to control their environment
  • microbes can contribute to you being fat or skinny
  • circadian rhythms in the gut genes and microbes looks to be very important
  • microbes in roots also have circadian cycles 
  • your house takes on your microbiome
  • dogs increase the similarity of couple's microbiomes

Francis Martin -- Harnessing Genomics for Understanding Tree-Microbe Interactions in Forest Ecosystems
  • we know very little about fungi affect the carbon cycle
  • 4 major groups of fungi in most forest ecosystems:  white rotters, brown rotters, litter soul decayers, and ectomycorrizal
  • he studies mycorrhizal fungi and their symbiotic toolkit
  • mycorrhizal symbiosis has evolved independently several time!
  • ectomycorrizal fungi (EMF) have reduced complement of plant cell wall degrading genes compared to ancestors
  • small proteins from EMF are secreted and land on plant cells
  • Look at Platt paper for JAZ interaction (PNAS).  Missp7 binds to JAZ and prevents the immune response in the plant
  • each symbiosis event has developed a unique set of effectors

Joan Bennett -- Do Fungi have a 'Volatome'?

  • aflotoxins can cause cancer in small doses
  • mycotoxins may cause sick building syndrome
  • volatile organic compounds (VOCs) are things that we can smell and are frequently made by fungi
  • using arabidopsis and flies as models for testing the effects of these compounds
  • flies exposed to c-8 VOCs acted similarly to model flies for Parkinson's disease
  • VOCs in the fungal microbiome of humans are likely responsible for attracting mosquitoes


Susanna Theroux -- Marsh Madness: Microbial Communities Driving Greenhouse Gas Cycling in Coastal Wetlands

  • wants to link wetland microbes with carbon emissions
  • methanogens break down carbon into methane
  • wetlands produce a lot of methane
  • microbes might be useful in minimizing methane production compared to carbon storage
  • more methanogens yield more methane


Antonis Rokas -- Evolution of Fungal Chemodiversity

  • fungal metabolism
  • asks how did chemo diversity originate and why is chemo diversity clustered
  • toxicity clusters are likely driven by genetic linkage (ie butterflies)
  • mined metabolic databases for genes that are clustered and genes that are not and measured toxicity.  Clustered genes are more toxic
  • tissue specific expression in humans and mammals is the equivalent of clustering in fungi
  • this implies that the position of two genes in the fugal genome give information about how they interact in humans (ie those two genes are likely to be expressed in the same tissue)


Rotem Sorek -- The Immune System of Bacteria

  • crisper is the immune response in bacteria for phages
  • the cas proteins find phage DNA in the cell and insert it into the next spacer
  • crisper is in only ~40% of bacteria.  How do other bacteria fight phage?  The only other known mechanism is restriction enzymes.  Can we find new immune systems in bacteria?
  • immune system signatures:  rapidly evolving, high horizontal gene transfer
  • found the BREZ system in B. cereus!  Found in 10% of bacteria.  We don't understand the mechanisms yet
  • phages have anti-defense systems.  Ergo it would be best for bacteria to have multiple defense systems


Phil Hugenholtz -- Back from the Dead:  The Curious Tale of the Predatory Cyanobacterium Vampirovibrio chlorellavorus

  • these bacteria suck host cells dry!
  • falls in Cyanobacteria clade
  • contains a type-IV secretion system partially found on plasmids
  • non photosynthetic (unlike most Cyanobacteria)


Stephen Wright -- Comparative and Population Genomics in the Brassicaceae:  Understanding Genome-Wide Natural Selection

  • evolution can go backwards.  For example, Y chromosome degeneration in humans.
  • many plants that undergo whole genome duplication revert to diploid
  • there is evidence in Arabidopsis that a long time ago it had a whole genome duplication event
  • possible explanations:  passive constituency of redundancy, inefficient selection, differential adaptation
  • there is limited evidence contrary to popular theory that selfing or limited recombination leads to a high density of deleterious mutations and evolutionary dead-ends
  • capsella rubells is a model for this phenomenon
  • ploidy effects the efficiency of natural selection
  • higher ploidy can weaken efficacy of natural selection
  • ploidy combined with transition to selfing increases rate of deleterious mutation accumulation
  • plant sex chromosomes are younger than mammalian sex chromosomes


Steve Briggs -- Protein Regulatory Networks

  • see Walley et al. PNAS 2013!


Susan Lynch -- The Microbiome--A New Frontier in Human Health

  • increase in asthma cases in children in US and Australia
  • asthma is mostly an environmental disease (ie not so much genetic)
  • now people have far less environmental exposure
  • Americans spend about 90% of their time indoors
  • vaginal born children have different microbiome than c-section born children.  And they are less likely to have asthma.
  • lactobacillus keeps airways of mice challenged with dust open like normal airways
  • see Fuijmura et al PNAS January 2014
  • WHEALS study
  • allergenic children often develop asthma

Monday, January 26, 2015

The iChip: A tool for high-throughput microbial culturing

It is commonly accepted that only ~1% of naturally occurring microbes are culturable using standard culturing techniques.  Until recently, culturing microbes has been the only way to investigate their presence and effects on various environments (i.e. human gut, soil, plant roots, oceans, etc).  Metagenomics, sequencing DNA from a pool of microbes, has broadened our understanding of microbial communities by circumventing the need for culturing.  However, culturing remains an important aspect of investigating microbial communities particularly for environments where the microbial diversity is so great that generating complete genome sequences via metagenomics is impractical (i.e. soil).

In 2010, a study from the Epstein and Lewis groups (Nichols et al., 2010) describes a new technology which can increase culturing success from ~1% to nearly 50%!  This technology, the isolation chip or iChip, contains 384 chambers.  A single microbial cell is deposited in each chamber.  Then a semi-permeable membrane is used to cover all the chambers and the iChip is placed into the environment from which the microbes originate.  The membrane allows cells access to nutrients and growth factors found in its natural environment.

The iChip has been successfully used to study novel secondary metabolites (Lewis et al., 2010) and discover novel antibiotics (Ling et al., 2015Lewis, 2013).  However, a possible reason why high-throughput culturing techniques have not gained much traction is the emphasis on metagenomic applications and studies which do not suffer from culturing bias.  High-throughput culturing and metagenomic sequencing each have a unique set of strengths and weaknesses.  Perhaps more thought should be put into designing methods using a combination of high-throughput culturing and metagenomic sequencing to leverage both sets of strengths and mitigate their weaknesses.