There have been many different music trends throughout each year. Top songs, popular genres, and top artists change every year. We can look up top songs from 2010 to 2019 online, but we won't understand music trends of each year just by looking at song titles and artists. We can use Spotify to gather data on top streamed songs of each year from 2010 to 2019. Spotify is a popular music streaming app that allows people to listen to most of the music that gets released in the world. Spotify users have grown largely as there are more than 113 million Spotify users as of the third quarter of 2019.
In this tutorial, we will use computational analysis and data science techniques to analyze the components of top songs throughout the years. We will be looking at a song's danceability, valence, energy, acoustic, and more to find correlations between the top songs. Using data science protocols and analysis, we aim to leave the audience with a better understanding of music trends throughout the years.
The first thing we have to do is to import Spotify’s dataset on top songs from 2010 to 2019. Luckily, we were able to find the dataset on Kaggle readily available to us without having to scrape it from a website. However, one oddity of this dataset is that it is not encoded using the standard UTF-8 style, which is done with most csv datasets, but rather in the cp1252 form, which is more popular for Windows. Therefore, that particular argument is necessary in order to remove any potential errors and properly visualize the data.
Once we have our dataset, we’re going to import the essential libraries which includes pandas to hold the dataframe, as well as numpy, matplotlib, and seaborn for numerical and statistical analysis used later. We also filter any warnings that can be potentially created through deprecated functions that these libraries implicitly call.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
import warnings
warnings.filterwarnings('ignore')
data = pd.read_csv("top10s.csv", encoding='cp1252')
data.head()
Along with the basic information of the title, artist, genre, and year released of each song, there are also several other numerical factors of these songs that spotify assigns. Based on Spotify's Web API's information, these attributes to the tracks are:
However, one strange column in this dataset appears to be the second one, called "Unnamed : 0". We know that this corresponds to the rank of the song on Spotify for the year it was released, so let's change the column name, done below.
data = data.rename(columns = {'Unnamed: 0' : 'rank'})
data.head(20)
One of the data points from the year 2016 is actually a "bad" point, as in it is not properly formatted and messed up the data. We realized this later on in the analysis, so to prevent further harm, we will remove it in this step.
data = data.drop(data[data.year == 2016].dB.idxmin())
data[440:446]
Let's extract quantitatively exactly how many songs, artists, and genres exist within the dataset.
number_songs=data.title.nunique()
number_artists=data.artist.nunique()
number_genres=data['top genre'].nunique()
print('number_songs =', number_songs)
print('number_artists =', number_artists)
print('number_genres =', number_genres)
number_songs = 584
number_artists = 184
number_genres = 50
It appears that this dataset has different amounts of songs per year, with 2013 especially being abundant in the quantity of songs. We want to change the rank so that it corresponds to the year, rather than it just being the number of the row. A sample of this data, which is between the years 2013 and 2014, is shown below.
year = 2010
counter = 1
for index, row in data.iterrows():
row = row.copy()
if row['year'] != year:
counter = 1
year = row['year']
data.loc[index, 'rank'] = counter
counter += 1
data[200:230]
We want to analyze this data from very different perspectives. Our first perspective is how do these attributes for the top songs per year change over time? In addition, how distributed are they among the central statistics (the mean or median)? We also want to look at the common features that the highest subset of this data has, meaning the top 5 songs for each year and the most common genres.
cols = ['bpm', 'nrgy', 'dnce', 'dB', 'live', 'val', 'dur', 'acous', 'spch', 'pop']
for var in cols:
plt.figure(figsize = (10, 8))
plt.title(var + " Throughout the Years")
sns.violinplot(x = 'year', y = var, data = data)
plt.show()
First, look at the beats per minute, it appears that over the decade, the mean beats per minute has gradually gone down, based on the central statistic, although there definitely exist outliers on both ends. The energy has also gone down, but in general the energy for the top songs seems very high, above 60, even though spotify's ranking goes until zero. Again, a few outliers exist in this data too, but thankfully the inner boxplot's mean is robust.
Moving on to danceability, there is not much correlation or a trend in distribution; the plots look centered and close to the middle, with outliers in both directions. Likewise, the volume, recorded as the amount of decibals, is also not really affected by year, and neither is the liveliness. Regarding liveliness, it is important to note that top songs for the most part feature low amounts of liveliness, with the exception of a few outliers. Valence is evenly distributed from both ends of the spectrum, and does not change much over time, only increasing a little bit, which supports that the amount of valence largely varies, not being centered around a particular point, and continues like that over time.
In addition, the durations of the songs have only slightly decreased, although it makes sense that the durations are short given spotify's giant scale, since most popular songs are broadcast on the radio and other events, and too much time could be perceived as a waste other than special situations. Acousticity is also extremely low, average near zero, with the overwhelming majority of songs relying on electronic music - this has remained constant throughout the decade. Speech has also been very low, with 2013 having no outliers in the amount of words spoken. Lastly, popularity hasn't really changed, which makes complete sense since there are always some songs more popular than others.
for x in range(2010, 2020):
data2 = data[data['year'] == x]
plt.figure(figsize = (10, 8))
plt.pie(data2['top genre'].value_counts().iloc[:7],labels=data2['top genre'].value_counts().iloc[:7].index,
autopct='%1.1f%%', shadow=False, startangle=50)
plt.title('Top 7 Genres For Year ' + str(x),fontsize=16)
plt.show()
Looking at the pie charts, it seems that dance pop has been the most popular genre throughout the years until 2019. Pop has been consistently the second most popular genre until it became the first most streamed genre in 2019, beating dance pop. However, in 2013, the genre boy band was the second most played genre over pop which came in third. Looking at the dataset, 2013 experienced more boy band tracks than other years as One Direction had experienced immense global popularity in 2013 along with boy band The Wanted. In 2016, pop is not even on the pie chart, but this could be an error on Spotify's part on how they categorize tracks' genres as they only gave each track one genre when it could fall under multiple genres. For example, Canadian pop is essentially pop, only the singer is from Canada such as Justin Bieber, Shawn Mendes, Drake, and many other popular singers with many top tracks.
plt.figure(figsize = (10, 8))
data['artist'].value_counts().head(10).plot.bar()
This bar graph shows the top ten artists with the most top tracks with Katy Perry coming in first with 17 top tracks. However, we'll look into how popular their music was each year and how artists have changed over the years as an artist can release many top hits in one year but then none in the later years. This could give us an insight on how music trends have changed with the changes in artists and their respective genres.
for x in range(2010, 2020):
data2 = data[data['year'] == x]
plt.figure(figsize = (15, 8))
plt.title('Artists and Their Top Tracks in ' + str(x))
plt.scatter(data2['artist'], data2['pop'])
plt.xticks(data2['artist'], data2['artist'], rotation='vertical',fontsize=12)
plt.show()
Looking at the scatter plots of artists and their top tracks in each year, we can see the changes in artists throughout the years.
In 2010, Kesha had 4 top dance pop tracks with the highest popularity of of 80 and Lady Gaga had 3 top dance pop tracks with the highest popularity of 79. The Black Eyed Pease also had 4 top dance pop tracks. In 2011, Lady Gaga came out with the most top dance pop tracks with 5 songs that were consistently higher in popularity along with many other dance pop artists such as Jennifer Lopez, Kesha, LMFAO, Beyoncé, and more. However, in the later years, these artists rarely appeared in the charts. The Black Eyed Peas, LMFAO, Kesha, and many other artists never appeared again which correlates with the decline in dance pop genre as we've seen in the pie charts earlier. This could be due to no release of new music by the artists as Wikipedia shows that Black Eyed Peas did take an eight years hiatus after they released their album in 2010. Kesha also took a five years break after her album in 2012 according to her Wikipedia page. This could be a factor in why the dance pop genre declined in the top tracks as major dance pop artists took long hiatuses.
We also see new artists joining the charts in later years as Ed Sheeran first appears in 2015 with 4 top tracks and having the highest popularity in that year. Canadian pop also was the second most played genre in 2015 as Justin Bieber totaled 9 top Canadian pop tracks and The Weeknd comes out on the plots for the first time with a Canadian pop track having the highest popularity tied with Ed Sheeran.
With the hiatuses of many artists and debuts of other artists, it changes the most streamed genres each year. It's also interesting to note the increase in artists and tracks throughout the years as there are more scatter dots in 2013 and the later years than the earlier years. This could be due to Spotify's increase in users as Spotify was said to have about 650,000 paying users at the end of 2010 while there are 18 million users in 2015 which continued to increase to 113 million users in the third quarter of 2019 according to gigaom.com and statista.com.
We also want to analyze a subset of the top songs: the top fives. These songs are the best of the best, and represent the most exempelary aspects of the dataset, so we want to see if these songs have various scores higher than the means, or if they are more towards the expected values (and thus are a good representation of the overall data). First, let's filter out the top fives through the rank.
top5 = data[data['rank'] <= 5]
top5.head(20)
Once we do that, we want to compare the Spotify scores of the top fives with each other, to see if the data is overall normally distributed or skewed in one direction.
top5.hist(figsize = (15, 13), column = ['bpm', 'nrgy', 'dnce', 'dB', 'live', 'val', 'dur', 'acous', 'spch'])
The histograms above look like the expected outcome, consistent with the violin plots of the entire dataset. Duration, volume, energy, and beats per minute tend toward a normal distribution, while the speech and acousticity are skewed towards the left side, due to spotify's extremely high upper spectrum. The dance and the valence are uniformly distributed as well.
We also want to compare the mean scores of the top fives with that of the entire dataset, which we will do below.
means = pd.DataFrame(top5.mean())
means['total'] = data.mean()
means = means.rename(columns = {0 : 'top 5'})
means = means.drop(['year', 'pop', 'rank'])
means.plot.bar(rot =0, figsize = (13, 10))
means
It appears that the top fives do not have any drastic change in the variables, with the exception of acousticity, which is significantly higher, almost double, over the mean of the entire dataset. One possible reason for that may include an outlier in the top fives that greatly influences the data, or simply that the top songs are more instrumental which may influence popularity. Overall, though, the top fives are a good representation of the data as a whole, rather than being higher examples of certain types of scores.
data2 = data.copy()
data2.drop('rank', axis =1, inplace=True)
corr_matrix=data2.corr()
fig, ax = plt.subplots(figsize=(15,10))
sns.heatmap(corr_matrix, annot=True, linewidths=.5, ax=ax);
We have created a heatmap showing theee correlations betweeen each attribute. From this heat map, we can see that the attributes with the highest correlations are energy, acoustic, loudness, danceability, and valence. We will use these attributes for further analysis.
fig, ax = plt.subplots(figsize=(10,8))
sns.regplot(x=data.acous,y=data.nrgy, ax=ax).set_title('Acoustics of a Song Versus the Energy Level',fontsize=15)
plt.xlabel('Acoustics',fontsize=12);
plt.ylabel('Energy Level',fontsize=12);
This data does make intuitive sense, as we could expect the energy level of a song to go down as the acoustics of the song increase. One reason for this is that electronic based music has a lot of variety in the types of music it can create, which can also influence the amount of sound and danceability created, which are two other factors as well for the songs.
data[['nrgy','acous']].to_numpy()
linreg = LinearRegression().fit(data[['acous']],data[['nrgy']])
acousPred = linreg.predict(data[['acous']])
#printing the slope and r2 of the linear regression line
print('slope =', linreg.coef_)
print('intercept =', linreg.intercept_)
print('r2 = %.2f'
% r2_score(data[['nrgy']], acousPred))
#plotting the predicted best line onto the graph
plt.figure(figsize = (10, 8))
plt.scatter(data[['acous']], data[['nrgy']])
plt.plot(data[['acous']],acousPred)
plt.title('Sklearn\'s Regression of Acoustics vs. Energy Levels')
plt.xlabel('Acoustics',fontsize=12);
plt.ylabel('Energy Level',fontsize=12);
plt.show()
As expected, this scatterplot regression looks very similar to the one from sklearn. Furthermore, because we have the slope, intercept, and coefficients, we can predict the energy level for a given acoustic, say for instance an acoustic of 50, which gives an expected energy level of roughly 58.
all_data=data[["nrgy","acous","dB","dnce","val"]]
labels=data["pop"]
training_data,validation_data,training_labels,validation_labels=train_test_split(all_data,
labels,
test_size=.2,
random_state=100)
k_list=range(1,101)
accuracies=[]
for k in range(1,101):
classifier=KNeighborsClassifier(n_neighbors=k)
classifier.fit(training_data,training_labels)
accuracies.append(classifier.score(validation_data,validation_labels))
plt.figure(figsize = (10, 8))
plt.plot(k_list,accuracies)
plt.xlabel('Amount of Neighbors')
plt.ylabel('Accuracy Score')
plt.title('What K Value is Best?')
plt.show()
Here, we used sklearn's K Nearest Neighbors algorithm to try and predict the popularity value from energy, acoustics, loudness, danceability, and valence. We used these values because they had the most correlation to each other as seen from the heat correlation map. We then split the data into 8 parts training data and 2 parts test data and repeatedly ran the algorithm 100 times and plotted the accuracy of each k value onto a graph.
knn = KNeighborsClassifier(n_neighbors=39)
knn.fit(training_data, training_labels)
y_pred = knn.predict(validation_data)
print("Accuracy:",metrics.accuracy_score(validation_labels, y_pred))
#this is to be expected because it is trying to predict a continuous variable
Yikes! The best accuracy we could get is only 9.09%; however, this is to be expected because we are working with a pretty small dataset and are trying to predict a continuous variable. Let's try this again for a qualitative variable.
data.rename(columns = {'top genre':'topgenre'}, inplace = True)
top_genre_mapping={"dance pop": 0,
"pop": 1,
"canadian pop": 2,
"barbadian pop": 3,
"boy band": 4,
"electropop": 5,
"big room": 6,
"british soul": 7,
"neo mellow": 8,
"canadian contemporary r&b": 9,
"art pop": 10,
"hip pop": 11,
"complextro": 12,
"australian dance": 13,
"atl hip hop": 14,
"edm": 15,
"australian pop": 16,
"hip hop": 17,
"latin": 18,
"permanent wave": 19,
"tropical house": 20,
"colombian pop": 21,
"electronic trap": 22,
"candy pop": 23,
"folk-pop": 24,
"indie pop": 25,
"acoustic pop": 26,
"canadian hip hop": 27,
"detroit hip hop": 28,
"electro": 29,
"brostep": 30,
"belgian edm": 31,
"baroque pop": 32,
"escape room": 33,
"downtempo": 34,
"danish pop": 35,
"chicago rap": 36,
"australian hip hop": 37,
"moroccan pop": 38,
"metropopolis": 39,
"irish singer-songwriter": 40,
"contemporary country": 41,
"house": 42,
"french indie pop": 43,
"electro house": 44,
"hollywood": 45,
"alternative r&b": 46,
"canadian latin": 47,
"celtic rock": 48,
"alaska indie": 49}
labels=data.topgenre.map(top_genre_mapping)
training_data,validation_data,training_labels,validation_labels=train_test_split(all_data,
labels,
test_size=.2,
random_state=100)
k_list=range(1,101)
accuracies=[]
for k in range(1,101):
classifier=KNeighborsClassifier(n_neighbors=k)
classifier.fit(training_data,training_labels)
accuracies.append(classifier.score(validation_data,validation_labels))
plt.figure(figsize = (10, 8))
plt.plot(k_list,accuracies)
plt.xlabel('Amount of Neighbors')
plt.ylabel('Accuracy Score')
plt.title('What K Value is Best?')
plt.show()
#the accuracy isnt bad, only low because there are many genres that only have 1 top song
#lets get rid of the genres that only have 1 or 2 songs to see how accuracy is affected.
Here, we used sklearn's K Nearest Neighbors algorithm to try and predict the music genre from energy, acoustics, loudness, danceability, and valence. We split the data 8:2 again and plotted the accuracy of each k value onto a graph. We reached an accuracy of around 55.2% but this is because many of the genres listed only have one or two songs that reached the top 50 songs for each year. This is just a consequence of using a small dataset with a large bias - the bias being its popularity. Let's try it one more time, taking out the genres that only have one or two songs in the dataset.
data2 = data[~data['topgenre'].isin(["electronic trap","candy pop","folk-pop","indie pop","acoustic pop","canadian hip hop","detroit hip hop","electro","brostep","belgian edm","baroque pop","escape room","downtempo","danish pop","chicago rap","australian hip hop","moroccan pop","metropopolis","irish singer-songwriter","contemporary country","house","french indie pop","electro house","hollywood","alternative r&b","canadian latin","celtic rock","alaska indie"])]
all_data=data2[["nrgy","acous","dB","dnce","val"]]
top_genre_mapping={"dance pop": 0,
"pop": 1,
"canadian pop": 2,
"barbadian pop": 3,
"boy band": 4,
"electropop": 5,
"big room": 6,
"british soul": 7,
"neo mellow": 8,
"canadian contemporary r&b": 9,
"art pop": 10,
"hip pop": 11,
"complextro": 12,
"australian dance": 13,
"atl hip hop": 14,
"edm": 15,
"australian pop": 16,
"hip hop": 17,
"latin": 18,
"permanent wave": 19,
"tropical house": 20,
"colombian pop": 21}
labels=data2.topgenre.map(top_genre_mapping)
training_data,validation_data,training_labels,validation_labels=train_test_split(all_data,
labels,
test_size=.2,
random_state=100)
k_list=range(1,101)
accuracies=[]
for k in range(1,101):
classifier=KNeighborsClassifier(n_neighbors=k)
classifier.fit(training_data,training_labels)
accuracies.append(classifier.score(validation_data,validation_labels))
plt.figure(figsize = (10, 8))
plt.plot(k_list,accuracies)
plt.xlabel('Amount of Neighbors')
plt.ylabel('Accuracy Score')
plt.title('What K Value is Best?')
plt.show()
knn = KNeighborsClassifier(n_neighbors=17)
knn.fit(training_data, training_labels)
y_pred = knn.predict(validation_data)
print("Accuracy:",metrics.accuracy_score(validation_labels, y_pred))
#little better
Hey not bad! We reached an accuracy of 60.2%.
Based on our findings, we can see how music trends have changed throughout the years and how Spotify has grown. Looking at the top songs streamed on Spotify between the years 2010-2019, we can see how dance pop is generally a pretty popular genre along with pop and Canadian pop. In recent years, boy bands and electropop are becoming more popular as 2019 pop became the most streamed genre over dance pop which was first in all the other previous years. Looking at the increase in top tracks, we can see how more people streamed songs on Spotify as the number of Spotify users increased throughout the years.
We analyzed the top 5 subset of the data to evaluate if the top five songs of each year were an expected example of the entire data or if they had had factors that were different from the rest of the data. After comparing the factors of the top five amongst themselves (and seeing a predictable distribution for each), we determined in a direct comparison between the top fives and the total data that the former is true, that they an expected sample. This means that their means were very similar to those of the entire dataset, not in any way varying significantly.
Using a correlation heatmap, we were able to see that acoustics, energy, loudness, valence, and danceability had the highest amounts of positive or negative correlation to each other. We then used seaborn and sklearn’s linear regression functions to try and predict energy level based on acoustics. We found that with an x acoustic level, we could predict the y energy level with the equation y = -0.44589593x + 77.02009125.
We also tried to predict the popularity level of each song by using sklearn’s K-Nearest Neighbors function and the variables acoustics, energy, loudness, valence, and danceability. The popularity prediction had a low accuracy level of 9.1% because we were working with a small dataset and were trying to predict a continuous variable. We realized that it would be better to try and predict a qualitative variable like music genres so by using K-Nearest Neighbors again with the same variables, we were able to get around a 60.2% accuracy level.
One of the next steps we can take for this project is to analyze a much larger dataset rather than simply the top songs, since we can also look at the worst ranked and least popular songs, comparing them with the most popular, and predict why exactly certain songs failed - as in, which factors of the song variables were the biggest influence towards the popularity score.
In addition, it would be useful to look at different decades of music, not just the 2010s, but also the 2000s and even much older like the 90’s, 80’s - all the way back until the 50’s too. To accomplish this, however, we would best rely on a different database rather than Spotify for older songs, since Spotify may have bias towards newer released songs. Trends throughout decades may definitely exist, and there is a multitude of data we can get from a larger span of time.
Dataset of top Spotify songs from 2010 to 2019: https://www.kaggle.com/leonardopena/top-spotify-songs-from-20102019-by-year
Spotify API website explaining each song attributes: https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/
Recordings of number of Spotify users in 2010: https://gigaom.com/2010/11/22/spotify-2010-revenues/
Statistics of of number of Spotify users from 2015 to 2020: https://www.statista.com/statistics/244995/number-of-paying-spotify-subscribers/
Black Eyed Peas Wikipedia: https://en.wikipedia.org/wiki/Black_Eyed_Peas
Kesha Wikipedia: https://en.wikipedia.org/wiki/Kesha
NumPy: https://docs.scipy.org/doc/numpy-dev/user/quickstart.html
Pandas: http://pandas.pydata.org/pandas-docs/stable/
Matplotlib: http://matplotlib.org/contents.html
Seaborn: https://seaborn.pydata.org/
Scikit Basic Decision Tree: https://scikit-learn.org/stable/modules/tree.html
Polynomial features;Interaction terms in Scikit: https://chrisalbon.com/machine_learning/linear_regression/adding_interaction_terms/
K-Nearest Neighbor: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
Linear Regression: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html