Analyzing the Top Spotify Songs of the 2010s

By: Chris Rha, Christy Yau, and Amar Mujumdar

1. Introduction

Table of Contents:

  1. Introduction
  2. Data
  3. Exploratory Data Analysis
    • A. Violin plots of each attribute throughout the years
    • B. Pie charts of genres throughout the years
    • C. Top artists
    • D. Top 5 tracks of each year
    • E. Correlations between attributes
    • F. K Nearest Neighbors
  4. Conclusion
  5. References

There have been many different music trends throughout each year. Top songs, popular genres, and top artists change every year. We can look up top songs from 2010 to 2019 online, but we won't understand music trends of each year just by looking at song titles and artists. We can use Spotify to gather data on top streamed songs of each year from 2010 to 2019. Spotify is a popular music streaming app that allows people to listen to most of the music that gets released in the world. Spotify users have grown largely as there are more than 113 million Spotify users as of the third quarter of 2019.

In this tutorial, we will use computational analysis and data science techniques to analyze the components of top songs throughout the years. We will be looking at a song's danceability, valence, energy, acoustic, and more to find correlations between the top songs. Using data science protocols and analysis, we aim to leave the audience with a better understanding of music trends throughout the years.

2. Data

The first thing we have to do is to import Spotify’s dataset on top songs from 2010 to 2019. Luckily, we were able to find the dataset on Kaggle readily available to us without having to scrape it from a website. However, one oddity of this dataset is that it is not encoded using the standard UTF-8 style, which is done with most csv datasets, but rather in the cp1252 form, which is more popular for Windows. Therefore, that particular argument is necessary in order to remove any potential errors and properly visualize the data.

Once we have our dataset, we’re going to import the essential libraries which includes pandas to hold the dataframe, as well as numpy, matplotlib, and seaborn for numerical and statistical analysis used later. We also filter any warnings that can be potentially created through deprecated functions that these libraries implicitly call.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
import warnings
warnings.filterwarnings('ignore')
In [2]:
data = pd.read_csv("top10s.csv", encoding='cp1252')
data.head()
Out[2]:
Unnamed: 0 title artist top genre year bpm nrgy dnce dB live val dur acous spch pop
0 1 Hey, Soul Sister Train neo mellow 2010 97 89 67 -4 8 80 217 19 4 83
1 2 Love The Way You Lie Eminem detroit hip hop 2010 87 93 75 -5 52 64 263 24 23 82
2 3 TiK ToK Kesha dance pop 2010 120 84 76 -3 29 71 200 10 14 80
3 4 Bad Romance Lady Gaga dance pop 2010 119 92 70 -4 8 71 295 0 4 79
4 5 Just the Way You Are Bruno Mars pop 2010 109 84 64 -5 9 43 221 2 4 78

Along with the basic information of the title, artist, genre, and year released of each song, there are also several other numerical factors of these songs that spotify assigns. Based on Spotify's Web API's information, these attributes to the tracks are:

  • beats per minute (bpm) - as the name suggests, how fast is the song
  • energy level (nrgy) - how energetic the song is
  • danceability (dnce) - measured based on the tempo, rhythm, and beat strength; the higher this value is, the easier it is to dance to this song
  • loudness (dB) - measured in decibals; again, the higher the value, the louder the song
  • liveness (live) - the higher the value, the more likely the song is a live recording
  • valence (val) - the musical positivitiness, the higher this value is, the happier the track is
  • duration (dur) - measured in seconds, how long the song is
  • acoustic (acous) - higher the value, the more acoustic the track is
  • speechiness (spch) - detects the presence of speech in a track; high the value, the more speech
  • popularity (pop) - how popular the song is

However, one strange column in this dataset appears to be the second one, called "Unnamed : 0". We know that this corresponds to the rank of the song on Spotify for the year it was released, so let's change the column name, done below.

In [3]:
data = data.rename(columns = {'Unnamed: 0' : 'rank'})
data.head(20)
Out[3]:
rank title artist top genre year bpm nrgy dnce dB live val dur acous spch pop
0 1 Hey, Soul Sister Train neo mellow 2010 97 89 67 -4 8 80 217 19 4 83
1 2 Love The Way You Lie Eminem detroit hip hop 2010 87 93 75 -5 52 64 263 24 23 82
2 3 TiK ToK Kesha dance pop 2010 120 84 76 -3 29 71 200 10 14 80
3 4 Bad Romance Lady Gaga dance pop 2010 119 92 70 -4 8 71 295 0 4 79
4 5 Just the Way You Are Bruno Mars pop 2010 109 84 64 -5 9 43 221 2 4 78
5 6 Baby Justin Bieber canadian pop 2010 65 86 73 -5 11 54 214 4 14 77
6 7 Dynamite Taio Cruz dance pop 2010 120 78 75 -4 4 82 203 0 9 77
7 8 Secrets OneRepublic dance pop 2010 148 76 52 -6 12 38 225 7 4 77
8 9 Empire State of Mind (Part II) Broken Down Alicia Keys hip pop 2010 93 37 48 -8 12 14 216 74 3 76
9 10 Only Girl (In The World) Rihanna barbadian pop 2010 126 72 79 -4 7 61 235 13 4 73
10 11 Club Can't Handle Me (feat. David Guetta) Flo Rida dance pop 2010 128 87 62 -4 6 47 235 3 3 73
11 12 Marry You Bruno Mars pop 2010 145 83 62 -5 10 48 230 33 4 73
12 13 Cooler Than Me - Single Mix Mike Posner dance pop 2010 130 82 77 -5 70 63 213 18 5 73
13 14 Telephone Lady Gaga dance pop 2010 122 83 83 -6 11 71 221 1 4 73
14 15 Like A G6 Far East Movement dance pop 2010 125 84 44 -8 12 78 217 1 45 72
15 16 OMG (feat. will.i.am) Usher atl hip hop 2010 130 75 78 -6 36 33 269 20 3 72
16 17 Eenie Meenie Sean Kingston dance pop 2010 121 61 72 -4 11 83 202 5 3 71
17 18 The Time (Dirty Bit) The Black Eyed Peas dance pop 2010 128 81 82 -8 60 44 308 7 7 70
18 19 Alejandro Lady Gaga dance pop 2010 99 80 63 -7 36 37 274 0 5 69
19 20 Your Love Is My Drug Kesha dance pop 2010 120 61 83 -4 9 76 187 1 10 69

One of the data points from the year 2016 is actually a "bad" point, as in it is not properly formatted and messed up the data. We realized this later on in the analysis, so to prevent further harm, we will remove it in this step.

In [4]:
data = data.drop(data[data.year == 2016].dB.idxmin())
data[440:446]
Out[4]:
rank title artist top genre year bpm nrgy dnce dB live val dur acous spch pop
440 441 Picky - Remix Joey Montana latin 2016 186 81 70 -3 37 69 225 9 7 29
441 442 Behind Your Back Nelly Furtado canadian latin 2016 98 47 80 -10 7 69 228 25 4 18
443 444 Shape of You Ed Sheeran pop 2017 96 65 83 -3 9 93 234 58 8 87
444 445 Closer The Chainsmokers electropop 2017 95 52 75 -6 11 66 245 41 3 86
445 446 Starboy The Weeknd canadian contemporary r&b 2017 186 59 68 -7 14 49 230 14 28 85
446 447 Treat You Better Shawn Mendes canadian pop 2017 83 82 44 -4 11 75 188 11 34 84

Let's extract quantitatively exactly how many songs, artists, and genres exist within the dataset.

In [5]:
number_songs=data.title.nunique()
number_artists=data.artist.nunique()
number_genres=data['top genre'].nunique()
print('number_songs =', number_songs)
print('number_artists =', number_artists)
print('number_genres =', number_genres)
number_songs = 584
number_artists = 184
number_genres = 50
number_songs = 583
number_artists = 184
number_genres = 50

It appears that this dataset has different amounts of songs per year, with 2013 especially being abundant in the quantity of songs. We want to change the rank so that it corresponds to the year, rather than it just being the number of the row. A sample of this data, which is between the years 2013 and 2014, is shown below.

In [6]:
year = 2010
counter = 1
for index, row in data.iterrows():
    row = row.copy()
    if row['year'] != year:
        counter = 1
        year = row['year']
    data.loc[index, 'rank'] = counter
    counter += 1

data[200:230]
Out[6]:
rank title artist top genre year bpm nrgy dnce dB live val dur acous spch pop
200 62 How Ya Doin'? (feat. Missy Elliott) Little Mix dance pop 2013 201 95 36 -3 37 51 211 9 48 50
201 63 Crazy Kids (feat. will.i.am) Kesha dance pop 2013 128 75 72 -4 13 50 229 4 4 46
202 64 Ooh La La (from "The Smurfs 2") Britney Spears dance pop 2013 128 57 69 -5 11 73 257 2 5 45
203 65 People Like Us Kelly Clarkson dance pop 2013 128 79 60 -5 36 61 259 4 4 45
204 66 Overdose Ciara dance pop 2013 107 70 77 -6 6 79 227 1 3 43
205 67 Right Now - Dyro Radio Edit Rihanna barbadian pop 2013 130 74 53 -6 24 45 186 0 4 42
206 68 Give It 2 U Robin Thicke dance pop 2013 127 83 67 -4 16 58 230 10 7 41
207 69 Foolish Games Jewel alaska indie 2013 132 34 51 -11 12 7 250 23 3 36
208 70 Outta Nowhere (feat. Danny Mercer) Pitbull dance pop 2013 95 84 71 -4 21 66 207 16 3 35
209 71 Freak Kelly Rowland atl hip hop 2013 104 78 65 -5 12 45 274 13 6 28
210 1 All of Me John Legend neo mellow 2014 120 26 42 -7 13 33 270 92 3 86
211 2 Stay With Me Sam Smith pop 2014 84 42 42 -6 11 18 173 59 4 85
212 3 Summer Calvin Harris dance pop 2014 128 86 60 -4 14 74 223 2 3 80
213 4 Happy - From "Despicable Me 2" Pharrell Williams dance pop 2014 160 82 65 -5 9 96 233 22 18 79
214 5 Rude MAGIC! pop 2014 144 76 77 -5 31 93 225 4 4 79
215 6 Shake It Off Taylor Swift pop 2014 160 80 65 -5 33 94 219 6 17 78
216 7 Dark Horse Katy Perry dance pop 2014 132 59 65 -6 17 35 216 0 5 78
217 8 Hey Brother Avicii big room 2014 125 78 55 -5 8 46 255 3 4 78
218 9 Maps Maroon 5 pop 2014 120 71 74 -6 6 88 190 2 3 78
219 10 Treasure Bruno Mars pop 2014 116 69 87 -5 32 94 179 4 4 77
220 11 Let Her Go Passenger folk-pop 2014 75 54 51 -7 10 24 253 39 6 77
221 12 Problem Ariana Grande dance pop 2014 103 81 66 -5 16 63 194 2 15 75
222 13 Pompeii Bastille metropopolis 2014 127 72 68 -6 27 57 214 8 4 73
223 14 Team Lorde art pop 2014 100 58 69 -7 31 42 193 17 9 73
224 15 Love Me Again John Newman pop 2014 126 89 50 -5 10 21 240 0 4 73
225 16 Latch Disclosure house 2014 122 73 50 -5 9 52 256 2 17 72
226 17 Adore You Miley Cyrus dance pop 2014 120 66 58 -5 11 20 279 11 3 72
227 18 Love Never Felt So Good Michael Jackson pop 2014 118 72 78 -6 7 71 246 13 4 71
228 19 Burn Ellie Goulding dance pop 2014 87 78 56 -5 11 33 231 31 4 71
229 20 She Looks So Perfect 5 Seconds of Summer boy band 2014 160 95 49 -4 33 44 202 0 13 71

3. Exploratory Data Analysis

We want to analyze this data from very different perspectives. Our first perspective is how do these attributes for the top songs per year change over time? In addition, how distributed are they among the central statistics (the mean or median)? We also want to look at the common features that the highest subset of this data has, meaning the top 5 songs for each year and the most common genres.

A. Violin plots of each attribute throughout the years

In [7]:
cols = ['bpm', 'nrgy', 'dnce', 'dB', 'live', 'val', 'dur', 'acous', 'spch', 'pop']
for var in cols:
    plt.figure(figsize = (10, 8))
    plt.title(var + " Throughout the Years")
    sns.violinplot(x = 'year', y = var, data = data)
    plt.show()

First, look at the beats per minute, it appears that over the decade, the mean beats per minute has gradually gone down, based on the central statistic, although there definitely exist outliers on both ends. The energy has also gone down, but in general the energy for the top songs seems very high, above 60, even though spotify's ranking goes until zero. Again, a few outliers exist in this data too, but thankfully the inner boxplot's mean is robust.

Moving on to danceability, there is not much correlation or a trend in distribution; the plots look centered and close to the middle, with outliers in both directions. Likewise, the volume, recorded as the amount of decibals, is also not really affected by year, and neither is the liveliness. Regarding liveliness, it is important to note that top songs for the most part feature low amounts of liveliness, with the exception of a few outliers. Valence is evenly distributed from both ends of the spectrum, and does not change much over time, only increasing a little bit, which supports that the amount of valence largely varies, not being centered around a particular point, and continues like that over time.

In addition, the durations of the songs have only slightly decreased, although it makes sense that the durations are short given spotify's giant scale, since most popular songs are broadcast on the radio and other events, and too much time could be perceived as a waste other than special situations. Acousticity is also extremely low, average near zero, with the overwhelming majority of songs relying on electronic music - this has remained constant throughout the decade. Speech has also been very low, with 2013 having no outliers in the amount of words spoken. Lastly, popularity hasn't really changed, which makes complete sense since there are always some songs more popular than others.

B. Pie charts of genres throughout the years

In [8]:
for x in range(2010, 2020):
    data2 = data[data['year'] == x]
    plt.figure(figsize = (10, 8))
    plt.pie(data2['top genre'].value_counts().iloc[:7],labels=data2['top genre'].value_counts().iloc[:7].index,
    autopct='%1.1f%%', shadow=False, startangle=50)
    plt.title('Top 7 Genres For Year ' + str(x),fontsize=16)
    plt.show()

Looking at the pie charts, it seems that dance pop has been the most popular genre throughout the years until 2019. Pop has been consistently the second most popular genre until it became the first most streamed genre in 2019, beating dance pop. However, in 2013, the genre boy band was the second most played genre over pop which came in third. Looking at the dataset, 2013 experienced more boy band tracks than other years as One Direction had experienced immense global popularity in 2013 along with boy band The Wanted. In 2016, pop is not even on the pie chart, but this could be an error on Spotify's part on how they categorize tracks' genres as they only gave each track one genre when it could fall under multiple genres. For example, Canadian pop is essentially pop, only the singer is from Canada such as Justin Bieber, Shawn Mendes, Drake, and many other popular singers with many top tracks.

C. Top artists

In [9]:
plt.figure(figsize = (10, 8))
data['artist'].value_counts().head(10).plot.bar()
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4a601b5130>

This bar graph shows the top ten artists with the most top tracks with Katy Perry coming in first with 17 top tracks. However, we'll look into how popular their music was each year and how artists have changed over the years as an artist can release many top hits in one year but then none in the later years. This could give us an insight on how music trends have changed with the changes in artists and their respective genres.

In [10]:
for x in range(2010, 2020):
    data2 = data[data['year'] == x]
    plt.figure(figsize = (15, 8))
    plt.title('Artists and Their Top Tracks in ' + str(x))
    plt.scatter(data2['artist'], data2['pop'])
    plt.xticks(data2['artist'], data2['artist'], rotation='vertical',fontsize=12)
    plt.show()

Looking at the scatter plots of artists and their top tracks in each year, we can see the changes in artists throughout the years.

In 2010, Kesha had 4 top dance pop tracks with the highest popularity of of 80 and Lady Gaga had 3 top dance pop tracks with the highest popularity of 79. The Black Eyed Pease also had 4 top dance pop tracks. In 2011, Lady Gaga came out with the most top dance pop tracks with 5 songs that were consistently higher in popularity along with many other dance pop artists such as Jennifer Lopez, Kesha, LMFAO, Beyoncé, and more. However, in the later years, these artists rarely appeared in the charts. The Black Eyed Peas, LMFAO, Kesha, and many other artists never appeared again which correlates with the decline in dance pop genre as we've seen in the pie charts earlier. This could be due to no release of new music by the artists as Wikipedia shows that Black Eyed Peas did take an eight years hiatus after they released their album in 2010. Kesha also took a five years break after her album in 2012 according to her Wikipedia page. This could be a factor in why the dance pop genre declined in the top tracks as major dance pop artists took long hiatuses.

We also see new artists joining the charts in later years as Ed Sheeran first appears in 2015 with 4 top tracks and having the highest popularity in that year. Canadian pop also was the second most played genre in 2015 as Justin Bieber totaled 9 top Canadian pop tracks and The Weeknd comes out on the plots for the first time with a Canadian pop track having the highest popularity tied with Ed Sheeran.

With the hiatuses of many artists and debuts of other artists, it changes the most streamed genres each year. It's also interesting to note the increase in artists and tracks throughout the years as there are more scatter dots in 2013 and the later years than the earlier years. This could be due to Spotify's increase in users as Spotify was said to have about 650,000 paying users at the end of 2010 while there are 18 million users in 2015 which continued to increase to 113 million users in the third quarter of 2019 according to gigaom.com and statista.com.

We also want to analyze a subset of the top songs: the top fives. These songs are the best of the best, and represent the most exempelary aspects of the dataset, so we want to see if these songs have various scores higher than the means, or if they are more towards the expected values (and thus are a good representation of the overall data). First, let's filter out the top fives through the rank.

D. Top 5 tracks of each year

In [11]:
top5 = data[data['rank'] <= 5]
top5.head(20)
Out[11]:
rank title artist top genre year bpm nrgy dnce dB live val dur acous spch pop
0 1 Hey, Soul Sister Train neo mellow 2010 97 89 67 -4 8 80 217 19 4 83
1 2 Love The Way You Lie Eminem detroit hip hop 2010 87 93 75 -5 52 64 263 24 23 82
2 3 TiK ToK Kesha dance pop 2010 120 84 76 -3 29 71 200 10 14 80
3 4 Bad Romance Lady Gaga dance pop 2010 119 92 70 -4 8 71 295 0 4 79
4 5 Just the Way You Are Bruno Mars pop 2010 109 84 64 -5 9 43 221 2 4 78
51 1 A Thousand Years Christina Perri dance pop 2011 139 41 42 -7 11 16 285 31 3 81
52 2 Someone Like You Adele british soul 2011 135 33 56 -8 10 28 285 89 3 80
53 3 Give Me Everything Pitbull dance pop 2011 129 94 67 -3 30 53 252 19 16 79
54 4 Just the Way You Are Bruno Mars pop 2011 109 84 64 -5 9 43 221 2 4 78
55 5 Rolling in the Deep Adele british soul 2011 105 76 73 -5 5 52 228 13 3 76
104 1 Titanium (feat. Sia) David Guetta dance pop 2012 126 79 60 -4 13 30 245 7 10 80
105 2 Locked Out of Heaven Bruno Mars pop 2012 144 70 73 -4 31 87 233 5 4 79
106 3 Paradise Coldplay permanent wave 2012 140 59 45 -7 8 20 279 5 3 79
107 4 Payphone Maroon 5 pop 2012 110 75 74 -5 29 55 231 2 4 79
108 5 What Makes You Beautiful One Direction boy band 2012 125 79 73 -2 6 89 200 1 7 78
139 1 Underneath the Tree Kelly Clarkson dance pop 2013 160 81 51 -5 21 69 230 0 5 88
140 2 Wake Me Up Avicii big room 2013 124 78 53 -6 16 64 247 0 5 85
141 3 Story of My Life One Direction boy band 2013 121 66 60 -6 12 29 245 23 5 81
142 4 Just Give Me a Reason (feat. Nate Ruess) P!nk dance pop 2013 95 55 78 -7 13 44 243 35 5 81
143 5 Hall of Fame The Script celtic rock 2013 85 87 42 -4 12 63 203 7 6 80

Once we do that, we want to compare the Spotify scores of the top fives with each other, to see if the data is overall normally distributed or skewed in one direction.

In [12]:
top5.hist(figsize = (15, 13), column = ['bpm', 'nrgy', 'dnce', 'dB', 'live', 'val', 'dur', 'acous', 'spch'])
Out[12]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f4a256ccbb0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4a2538ba90>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4a25524f10>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f4a2563b3a0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4a25622ca0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4a257619a0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f4a25761a30>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4a2581ec70>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4a253e3b20>]],
      dtype=object)

The histograms above look like the expected outcome, consistent with the violin plots of the entire dataset. Duration, volume, energy, and beats per minute tend toward a normal distribution, while the speech and acousticity are skewed towards the left side, due to spotify's extremely high upper spectrum. The dance and the valence are uniformly distributed as well.

We also want to compare the mean scores of the top fives with that of the entire dataset, which we will do below.

In [13]:
means = pd.DataFrame(top5.mean())
means['total'] = data.mean()
means = means.rename(columns = {0 : 'top 5'})
means = means.drop(['year', 'pop', 'rank'])
means.plot.bar(rot =0, figsize = (13, 10))
means
Out[13]:
top 5 total
bpm 117.16 118.742525
nrgy 65.46 70.621262
dnce 65.12 64.486711
dB -5.48 -5.488372
live 15.56 17.803987
val 55.56 52.312292
dur 229.60 224.671096
acous 24.58 14.350498
spch 8.88 8.372093

It appears that the top fives do not have any drastic change in the variables, with the exception of acousticity, which is significantly higher, almost double, over the mean of the entire dataset. One possible reason for that may include an outlier in the top fives that greatly influences the data, or simply that the top songs are more instrumental which may influence popularity. Overall, though, the top fives are a good representation of the data as a whole, rather than being higher examples of certain types of scores.

E. Correlations between attributes

In [14]:
data2 = data.copy()
data2.drop('rank', axis =1, inplace=True)
corr_matrix=data2.corr()
fig, ax = plt.subplots(figsize=(15,10))
sns.heatmap(corr_matrix, annot=True, linewidths=.5, ax=ax);

We have created a heatmap showing theee correlations betweeen each attribute. From this heat map, we can see that the attributes with the highest correlations are energy, acoustic, loudness, danceability, and valence. We will use these attributes for further analysis.

In [15]:
fig, ax = plt.subplots(figsize=(10,8))
sns.regplot(x=data.acous,y=data.nrgy, ax=ax).set_title('Acoustics of a Song Versus the Energy Level',fontsize=15)
plt.xlabel('Acoustics',fontsize=12);
plt.ylabel('Energy Level',fontsize=12);

This data does make intuitive sense, as we could expect the energy level of a song to go down as the acoustics of the song increase. One reason for this is that electronic based music has a lot of variety in the types of music it can create, which can also influence the amount of sound and danceability created, which are two other factors as well for the songs.

In [16]:
data[['nrgy','acous']].to_numpy()
linreg = LinearRegression().fit(data[['acous']],data[['nrgy']])
acousPred = linreg.predict(data[['acous']])
#printing the slope and r2 of the linear regression line
print('slope =', linreg.coef_)
print('intercept =', linreg.intercept_)
print('r2 = %.2f'
      % r2_score(data[['nrgy']], acousPred))
#plotting the predicted best line onto the graph
plt.figure(figsize = (10, 8))
plt.scatter(data[['acous']], data[['nrgy']])
plt.plot(data[['acous']],acousPred)
plt.title('Sklearn\'s Regression of Acoustics vs. Energy Levels')
plt.xlabel('Acoustics',fontsize=12);
plt.ylabel('Energy Level',fontsize=12);
plt.show()
slope = [[-0.44589593]]
intercept = [77.02009125]
r2 = 0.33

As expected, this scatterplot regression looks very similar to the one from sklearn. Furthermore, because we have the slope, intercept, and coefficients, we can predict the energy level for a given acoustic, say for instance an acoustic of 50, which gives an expected energy level of roughly 58.

F. K Nearest Neighbors

In [17]:
all_data=data[["nrgy","acous","dB","dnce","val"]]
labels=data["pop"]
training_data,validation_data,training_labels,validation_labels=train_test_split(all_data,
                                   labels,
                                   test_size=.2,
                                   random_state=100)
k_list=range(1,101)
accuracies=[]
for k in range(1,101):
    classifier=KNeighborsClassifier(n_neighbors=k)
    classifier.fit(training_data,training_labels)
    accuracies.append(classifier.score(validation_data,validation_labels))
plt.figure(figsize = (10, 8))
plt.plot(k_list,accuracies)
plt.xlabel('Amount of Neighbors')
plt.ylabel('Accuracy Score')
plt.title('What K Value is Best?')
plt.show()

Here, we used sklearn's K Nearest Neighbors algorithm to try and predict the popularity value from energy, acoustics, loudness, danceability, and valence. We used these values because they had the most correlation to each other as seen from the heat correlation map. We then split the data into 8 parts training data and 2 parts test data and repeatedly ran the algorithm 100 times and plotted the accuracy of each k value onto a graph.

In [18]:
knn = KNeighborsClassifier(n_neighbors=39)
knn.fit(training_data, training_labels)
y_pred = knn.predict(validation_data)
print("Accuracy:",metrics.accuracy_score(validation_labels, y_pred))
#this is to be expected because it is trying to predict a continuous variable
Accuracy: 0.09090909090909091

Yikes! The best accuracy we could get is only 9.09%; however, this is to be expected because we are working with a pretty small dataset and are trying to predict a continuous variable. Let's try this again for a qualitative variable.

In [19]:
data.rename(columns = {'top genre':'topgenre'}, inplace = True) 
top_genre_mapping={"dance pop": 0,
                 "pop": 1,
                 "canadian pop": 2,
                 "barbadian pop": 3,
                 "boy band": 4,
                 "electropop": 5,
                 "big room": 6,
                 "british soul": 7,
                 "neo mellow": 8,
                 "canadian contemporary r&b": 9,
                 "art pop": 10,
                 "hip pop": 11,
                 "complextro": 12,
                 "australian dance": 13,
                 "atl hip hop": 14,
                 "edm": 15,
                 "australian pop": 16,
                 "hip hop": 17,
                 "latin": 18,
                 "permanent wave": 19,
                 "tropical house": 20,
                 "colombian pop": 21,
                 "electronic trap": 22,
                 "candy pop": 23,
                 "folk-pop": 24,
                 "indie pop": 25,
                 "acoustic pop": 26,
                 "canadian hip hop": 27,
                 "detroit hip hop": 28,
                 "electro": 29,
                 "brostep": 30,
                 "belgian edm": 31,
                 "baroque pop": 32,
                 "escape room": 33,
                 "downtempo": 34,
                 "danish pop": 35,
                 "chicago rap": 36,
                 "australian hip hop": 37,
                 "moroccan pop": 38,
                 "metropopolis": 39,
                 "irish singer-songwriter": 40,
                 "contemporary country": 41,
                 "house": 42,
                 "french indie pop": 43,
                 "electro house": 44,
                 "hollywood": 45,
                 "alternative r&b": 46,
                 "canadian latin": 47,
                 "celtic rock": 48,
                 "alaska indie": 49}
labels=data.topgenre.map(top_genre_mapping)
training_data,validation_data,training_labels,validation_labels=train_test_split(all_data,
                                   labels,
                                   test_size=.2,
                                   random_state=100)
k_list=range(1,101)
accuracies=[]
for k in range(1,101):
    classifier=KNeighborsClassifier(n_neighbors=k)
    classifier.fit(training_data,training_labels)
    accuracies.append(classifier.score(validation_data,validation_labels))
plt.figure(figsize = (10, 8))
plt.plot(k_list,accuracies)
plt.xlabel('Amount of Neighbors')
plt.ylabel('Accuracy Score')
plt.title('What K Value is Best?')
plt.show()

#the accuracy isnt bad, only low because there are many genres that only have 1 top song
#lets get rid of the genres that only have 1 or 2 songs to see how accuracy is affected.

Here, we used sklearn's K Nearest Neighbors algorithm to try and predict the music genre from energy, acoustics, loudness, danceability, and valence. We split the data 8:2 again and plotted the accuracy of each k value onto a graph. We reached an accuracy of around 55.2% but this is because many of the genres listed only have one or two songs that reached the top 50 songs for each year. This is just a consequence of using a small dataset with a large bias - the bias being its popularity. Let's try it one more time, taking out the genres that only have one or two songs in the dataset.

In [20]:
data2 = data[~data['topgenre'].isin(["electronic trap","candy pop","folk-pop","indie pop","acoustic pop","canadian hip hop","detroit hip hop","electro","brostep","belgian edm","baroque pop","escape room","downtempo","danish pop","chicago rap","australian hip hop","moroccan pop","metropopolis","irish singer-songwriter","contemporary country","house","french indie pop","electro house","hollywood","alternative r&b","canadian latin","celtic rock","alaska indie"])]

all_data=data2[["nrgy","acous","dB","dnce","val"]]
top_genre_mapping={"dance pop": 0,
                 "pop": 1,
                 "canadian pop": 2,
                 "barbadian pop": 3,
                 "boy band": 4,
                 "electropop": 5,
                 "big room": 6,
                 "british soul": 7,
                 "neo mellow": 8,
                 "canadian contemporary r&b": 9,
                 "art pop": 10,
                 "hip pop": 11,
                 "complextro": 12,
                 "australian dance": 13,
                 "atl hip hop": 14,
                 "edm": 15,
                 "australian pop": 16,
                 "hip hop": 17,
                 "latin": 18,
                 "permanent wave": 19,
                 "tropical house": 20,
                 "colombian pop": 21}
labels=data2.topgenre.map(top_genre_mapping)
training_data,validation_data,training_labels,validation_labels=train_test_split(all_data,
                                   labels,
                                   test_size=.2,
                                   random_state=100)
k_list=range(1,101)
accuracies=[]
for k in range(1,101):
    classifier=KNeighborsClassifier(n_neighbors=k)
    classifier.fit(training_data,training_labels)
    accuracies.append(classifier.score(validation_data,validation_labels))
plt.figure(figsize = (10, 8))
plt.plot(k_list,accuracies)
plt.xlabel('Amount of Neighbors')
plt.ylabel('Accuracy Score')
plt.title('What K Value is Best?')
plt.show()
In [21]:
knn = KNeighborsClassifier(n_neighbors=17)
knn.fit(training_data, training_labels)
y_pred = knn.predict(validation_data)
print("Accuracy:",metrics.accuracy_score(validation_labels, y_pred))
#little better
Accuracy: 0.6017699115044248

Hey not bad! We reached an accuracy of 60.2%.

4. Conclusion

Based on our findings, we can see how music trends have changed throughout the years and how Spotify has grown. Looking at the top songs streamed on Spotify between the years 2010-2019, we can see how dance pop is generally a pretty popular genre along with pop and Canadian pop. In recent years, boy bands and electropop are becoming more popular as 2019 pop became the most streamed genre over dance pop which was first in all the other previous years. Looking at the increase in top tracks, we can see how more people streamed songs on Spotify as the number of Spotify users increased throughout the years.

We analyzed the top 5 subset of the data to evaluate if the top five songs of each year were an expected example of the entire data or if they had had factors that were different from the rest of the data. After comparing the factors of the top five amongst themselves (and seeing a predictable distribution for each), we determined in a direct comparison between the top fives and the total data that the former is true, that they an expected sample. This means that their means were very similar to those of the entire dataset, not in any way varying significantly.

Using a correlation heatmap, we were able to see that acoustics, energy, loudness, valence, and danceability had the highest amounts of positive or negative correlation to each other. We then used seaborn and sklearn’s linear regression functions to try and predict energy level based on acoustics. We found that with an x acoustic level, we could predict the y energy level with the equation y = -0.44589593x + 77.02009125.

We also tried to predict the popularity level of each song by using sklearn’s K-Nearest Neighbors function and the variables acoustics, energy, loudness, valence, and danceability. The popularity prediction had a low accuracy level of 9.1% because we were working with a small dataset and were trying to predict a continuous variable. We realized that it would be better to try and predict a qualitative variable like music genres so by using K-Nearest Neighbors again with the same variables, we were able to get around a 60.2% accuracy level.

One of the next steps we can take for this project is to analyze a much larger dataset rather than simply the top songs, since we can also look at the worst ranked and least popular songs, comparing them with the most popular, and predict why exactly certain songs failed - as in, which factors of the song variables were the biggest influence towards the popularity score.

In addition, it would be useful to look at different decades of music, not just the 2010s, but also the 2000s and even much older like the 90’s, 80’s - all the way back until the 50’s too. To accomplish this, however, we would best rely on a different database rather than Spotify for older songs, since Spotify may have bias towards newer released songs. Trends throughout decades may definitely exist, and there is a multitude of data we can get from a larger span of time.