Wednesday, May 13, 2015

Categorizing musicians by genre using artificial neural networks

Genres are an important classification scheme for organizing music and other forms of art.  However, genre classifications are often ambiguous.  For example, undergraduate students asked to classify a given set of songs into 10 genres had only a 72% concordance with recording companies genre classifications (D. Perrot and R. Gjerdigen, 1999).  Some of the most sophisticated computer algorithms can classify songs into genres with about 70% - 80% accuracy using song features such as tempo, time, instruments, etc.  The accuracy of genre classification is also dependent on the number of genres an algorithm attempts to delineate.  From the MIREX 2005 audio genre classification contest, the highest accuracy for 6 genre classifications was 87% but dropped to 75% when classifying 10 genres (

An alternative to classifying songs into genres is classify musicians into genres.  Categorizing musicians into genres is also a difficult problem because individual musicians can be members of several different, but related, genres.  Musicians can be described using a set of terms (i.e. genres and sub-genres) representing their genre membership.  For example, Florida Georgia Line, can be classified using the terms:  bro-country, country rock, country pop, and country rap.  This multi-class membership makes it difficult to make general comparisons between genres.  For example, it would be interesting to test if the word "truck" is more frequently mentioned in country songs (i.e. by country musicians) than in rock songs (i.e by rock musicians).

The first objective in this series of posts is to categorize musicians into a set of 20 pre-defined genres based on a list of terms associated with each musician using an artificial neural network model.

Artificial Neural Network Background
Artificial neural networks (ANN) are a class of machine learning models frequently used to solve classification problems.  ANNs contain a set of input, hidden, and output nodes connected by weighted edges.  Input nodes correspond to features in the training dataset.  Output nodes correspond to classification types.  Input nodes are connected by weighted edges to hidden nodes, and hidden nodes are connected by weighted edges to output nodes.  Hidden nodes and weighted edges are the medium by which the input features are transformed into a classification prediction.  For example, if a given musician, M1, is defined by features f1, f2, and f3, by "activating" input nodes f1, f2, and f3 the network redirects the flow of weight towards the output nodes.  After all the weight has been redirected, the output node with the highest weight has the highest probability of being the correct classification for musician M1.

The Training Dataset
The task of building or training an ANN is equivalent to assigning weights to each edge such that the classification accuracy of the model is maximized.  The overly simplistic intuition behind this operation is as follows.  First, edge weights are randomly initialized.  Then features of a single training entry are allowed to flow through the network and the final output for that entry is noted.  Using the final output and known output, an error value can be calculated which measures the level of correctness of the final output.  The network is then backpropagated and weights are redistributed to minimize the error for the that feature set.  This process is repeated for each training entry.

In this project, an ANN will be used to classify musicians based on a set of terms describing each musician.  Training data was manually gathered and curated.  First, for each genre, G, google searches for "popular musicians of genre G" generated a list of popular musicians and their matching genre.  Terms describing each musician, M, were gathered from the Echo Nest database.  The "popular musician" lists were supplemented with more musicians obtained from Wikipedia pages for each genre.  The complete training dataset consisted of 20 genres where each genre contained 80 musicians and each musician is represented by a list of terms.  This training dataset, like nearly all training datasets for song classification, is likely to have incorrect annotations.  Making more accurate training sets is a difficult task but substantially improves classification accuracy (Cory McKay and Ichiro Fujinaga, 2006).

Feature Selection
Estimating model parameters (i.e. weights) can be computationally expensive.  Training sets having a large number of features take much longer to estimate model parameters than those with fewer features.  Here the initial training dataset had over 3,000 features (i.e. terms).  Removing zero and near-zero variance features can reduce the required computational resources without having a substantial impact on model accuracy.  Additionally, removing correlated features will also reduce runtime and will improve model accuracy.  For these data, terms associated with fewer than 1% of musicians were deemed near-zero variance and removed.  Also, terms having a Phi coefficient (a measure of similarity) higher than .9 were removed.  These filtering measures reduced the number of features (i.e. terms) to 412.  The final training dataset contained 1,544 musicians linked to 20 genres and described by 412 terms.  Note that 56 musicians are linked to more than one genre.

Building the Model
The ANN model was built using the caret package and principles outlined in Applied Predictive Modeling by Max Kuhn and Kjell Johnson.  ANN models have two main parameters:  decay and size.  After each backpropagation, weights are multiplied by the decay parameter, a number less than one, to prevent weights from growing too large.  Excessively large weights cause a model to be overly specific to the given training dataset (i.e. not generalizable to other input data).  The second parameter, size, is the number of hidden nodes in the model.  Arbitrarily choosing these parameters may lead to suboptimal models.  A method called model tuning builds several models based on a range of parameter values and selects the most accurate as the final model.  Accuracy for each tuning model built in the project is shown in Figure 1.  The final model used parameters decay=0.5 and size=25.  Further adjustments to these parameters are not likely to significantly improve model accuracy for the given training data.

 Figure 1:  Model accuracy for a range of decay and size parameter values.  The most accurate model was built using decay=0.5 and size=25.  Bars indicate standard deviation.    

Accuracy of statistical models can be measure using a technique called cross validation.  In cross validation the training data are split into two groups:  a training group and a test group.  The model is built using the training group and subsequently evaluated using the test group.  The process is repeated multiple times using different subsets of training and test groups to ensure that a chance good (or bad) split between the training and test groups does not incorrectly estimate model accuracy.  Here the final model correctly classified about 60% the test cases (Figure 1).

Model accuracy can also be evaluated using a confusion matrix.  A confusion matrix shows the number of observed against the expected classifications for each category.  Figure 2 shows the percentages of data observed in each category from the cross validation test cases.  The vast majority of observed musician's genres matched the expected genre.

Figure 2:  Confusion matrix built using all cross validation test cases.  Values of each cell indicate the percentage of total musicians observed in each category for a given expected genre.  If the classification model were perfect there would be a blue diagonal and all other boxes would be red.

Genres having musicians that are frequently classified incorrectly into a single alternative genre are likely similar to that alternative genre.  A hierarchically clustering of the confusion matrix percentages (Figure 3) shows that some genres are similar to other genres (i.e. classic_rock and soft rock; disco and funk; etc.).  These similarities match logical expectations.  These similarities correspond to how accurately the model can distinguish between genres.  For example, the model is more likely to incorrectly classify a "classic rock" musician as "soft rock" than "reggae."

Figure 3:  Dendrogram clustering of confusion matrix values.  This shows the similarity between genres based on the number of misclassification instances shared between genres.  As expected, genres know to be similar cluster together (i.e. classic_rock and soft_rock).  

Artificial neural networks are an appropriate model for classification tasks.  Here an ANN was built to classify musicians based on a list of terms associated with each musician.  Considering accuracy of other models used in classifying songs, this model is sufficiently accurate for classifying musicians into genres.  Future analyses will use classifications from this model to make general comparisons between genres.

Additional Notes
  • The code (which needs to be cleaned and organized) can be found here.
  • Disclaimer: I am not a statistician.  This is a learning exercise for me.  I'm sure there are plenty mistakes and things I overlooked.  If anyone has suggestions on how to improve the correctness of this analysis I am interested in hearing them.  I want to learn.
  • For a fantastic primer on using artificial neural networks for regression and classification see the following posts by Brian Dolhansky:  
  • These chapters by  Michael Nielsen describing ANNs are also a great resource.
  • Liang et al. use music features and lyrics to classify genres.  However, many popular genres are missing (e.g. country) and there may be mistakes.
  • Other interesting references:

1 comment:

  1. I recently came across your blog and have been reading along. I thought I would leave my first comment. I don't know what to say except that I have enjoyed reading. Nice blog. I will keep visiting this blog very often. sham idrees wife