Introduction


We have collected data from different sources in order to understand and analyse the underlying mechanics of the Eurovision Song Contest. The collected data stretches from 1998 up to 2012 with both finals and semi-finals from 2004 when they were first introduced. We use musical attributes from Echonest's API and song lyrics from musiXmatch to show various peculiarities and trends in the data.

Using the gathered meta data and historical voting data, we try to make predictions for the winner in the up-coming Eurovision Song Contest 2013. We also try to predict the winner of Eurovision Song Contest 2012 to evaluate our models performance.

Below we show a range of visualizations and findings resulting from our analysis. We also explain some of the techniques we have used.

Happiness


Analyzing the lyrics: As lyrics is a central part of a song, analyzing the text in the lyrics could show some interesting information and connections. All the english songs in the Eurovision Song Contest from 1998-2012 have been analysed for happiness. If a song is identified as English in the dataset, we analyze the lyrics of this song. To calculate a songs average happiness we use a word list , in which words has been rated for happiness by humans. We then check a songs individual words for their happiness score. From this the average happiness for a song is calculated. These average happiness scores can be seen in the graph below:
This plot shows the same average happiness as the plot above, but on a map of Europe. The more bright green the color is, the happier the country's lyrics are. The white color show countries that are not singing in English or countries that are not participating in the Eurovision Song Contest.

This plot shows the happiness, categorized for each gender. These are very close to each other, as there is only a difference of 0.11 happiness score from the category "Both" to "Male".
Frequency destribution on Top 10 happy songs:
The word cloud has been generated from the 10 most positive songs. A frequency distribution has then been made on the words to show what the theme is in the happiest songs. In this case, love clearly conquers all.

Sasha Son with the song "Love" from Lithuania in 2009 is definitely the song with the happiest words and text in our analysis. The song can be heard here and the lyrics can be found here for those with to much time on there hands.
Frequency destribution on Top 10 sad songs:
With the 10 most sad songs according to our sentiment analysis, the prize goes to Natalia Podolskaya with "Nobody hurt no one".

The song can be heard here and the lyrics read here.

One can also see that the words are generally more negative here. There are still positive words, which dominate the word cloud, but other words like "without", "fight", "stay", "need" and many other are also present.


Friendship analysis


One of the most interesting things about the data available from the Eurovision Song Contest, is the underlying information about cultural and geographical friendships. When looking at the voting patterns over time, it is clear to everybody that a lot of the votes are given based on alliances rather than the actual quality of the entries.

We want to infer the latent friendship structure in the data, by looking at the voting data from 1998 to 2012. Our approach is to infer estimates for the actual quality of the entries, and then compare this to the votes given - consequently showing who is voting fairly, and who is biased. Before boring you with the math, here is an interactive graph showing the friendship relations between the countries of the Eurovision Song Contest!

Choose the threshold of the weights in the graph:

This graph shows the friendship biases of the countries in the Eurovision Song Contest. A friendly relationship from one country $A$ to another $B$, is represented by a directed edge from $A$ to $B$. By using the slider, you can choose to threshold how weak friendships you want to show. Dragging the slider all the way to the left will include a lot of the weaker friendships, and dragging the slider all the way to the right gives only the strongest friendships.

Some of the interesting things to observe from the friendship graph is the often discussed "Eastern Mafia", containing for example Croatia, Serbia, Macedonia and Bosnia Herzegovina. These eastern countries have a lot of internal friendships and are often critizised for being biased.

Another interesting thing is the chain of friendships between Sweden and Finland, Finland and Estonia, Estonia and Latvia, Latvia and Lithuania. Geographically, Sweden is connected to Finland, which in turn is connected to Estonia, which is connected to Latvia which finally is connected to Lithuania.


Inferring friendships and cultural biases


We will start by making some mathematical assumptions about the score an entry will get. We assume that the score an entry $s$ receives can be described as $$\text{score}(s) = \text{quality}(s) + \sum_{c \in V} \left[ \text{bias}(c, \text{country}(s)) \right] + \varepsilon(s)$$ where $\text{quality}(s)$ is the actual latent quality of the entry, $V$ is the set of voting countries, $\text{bias}(c,s)$ is a measure of bias from country $c$ towards the country with entry $s$ and where $\varepsilon(s)$ measures other parameters which we can't model directly, like for example the advantage of having a song in English or being the host country. Our method for inferring friendship relations between countries will not be able to remove the effects modelled by $\varepsilon(s)$, but will try to infer $$\text{estimated quality}(s) = \text{quality}(s) + \varepsilon(s)$$ with which we can estimate the biases by subtracting the estimated qualities from the actual voting data. The methods we apply comes from information filtering, and have applications in for example sensor networks and recommender systems.
The task: Given a voting matrix $X$ where the entry $X_{ij}$ is the number of points given to entry $i$ by country $j$; infer the actual qualities of the entries.
We have implemented and tested two methods for solving this task, one by Cristobald de Kerchove and Paul Van Dooren, and one by Mohammad Allahbakhsh and Aleksandar Ignjatovic. We will only describe the first one here. The method is an iterative fixed-point algorithm, which can be described using two formulas: $${\bf r}^{(t+1)} = \frac{[{\bf X}{\bf w}^{(t)}]}{[\sum_{j=1}^m {\bf w}_j^{(t)}]}$$ $${\bf w}^{(t+1)} = {\bf 1} - k^{(t)} \frac{1}{n} \begin{pmatrix}||{\bf X}_1 - {\bf r}^{(t+1)}||_2^2 \\ \vdots\\ ||{\bf X}_m - {\bf r}^{(t+1)}||_2^2\end{pmatrix}$$ Here ${\bf r}$ are the estimated qualities of the entries, ${\bf w}$ are the trustworthiness of the countries and $X$ is a matrix of votes with a row for each entry, and a column for each voting country. The number of entries is $n$ and the number of voting entries is denoted by $m$. $k^{(t)}$ is a parameter defined as $k^{(t)}: \min_j {\bf w}_j^{(t)} = 0$. The first formula predicts the quality estimates as weighted averages of the votes, where the weights are the trusthworthiness of the voters. The second formula updates the trustworthiness by considering the distance between a voters votes and the estimated true qualities. Starting with ${\bf w}^{(0)} = {\bf 1}$ (initially giving all voters maximum trustworthiness) and iterating the two formulas above until convergence produces estimates of the true underlying qualities of the entries.


The best and the worst songs


Having calculated the estimated quality ${\bf r}$ of the songs, we can now find the best and worst performing artist in the entire history of Eurovision Song Contests:

Worst songs


Best songs:


Here we feel the need to note that our model is not able to distinguish between actual song quality, and the popularity an entry can get from being a gimmick (we are looking at you Lordi!).

That is so unfair!


When building the friend graph, we estimate trustworthiness scores ${\bf w}$ for each country. Lets have a look at which countries are the most fair and the most unfair:
Fairness rankCountry
Most fair
#1Monaco
#2Slovakia
#3Poland
#4Belarus
#5Czech Republic
......
#44Montenegro
#45Serbia
#46Iceland
#47Austria
#48Bulgaria
Most unfair


Exploratory Analysis


Should you sing in English, or maybe host the show?


We would like to explore if it is an advantage to have a song in English, and we want to know if it is an advantage to be the host country. To determine this, we perform a Wilcoxon Rank Sum Test on the distributions of song qualities for the different groups. The Wilcoxon Rank Sum Test is a non-parameteric (that is, it does not assume the data to be normal distributed) test to determine if two samples come from distributions with different centers.

Performing a one-sided Wilcoxon Rank Sum Test on the song qualities of songs in English versus the song qualities of songs not in English yields a p-value of 0.01066 (meaning that the probability of observing a center difference at least as large as the one we observe is around 1%). Since this p-value is less than 0.05, we conclude that songs in English perform significantly better than songs not in English (significance level 5%).

Doing the same for the songs by host-countries and songs by non-host countries yields a p-value of 0.04987, showing a significant advantage of being the host country (significance level 5%).
Plots showing the song qualities of songs in different categories. We have added some noise to the categorical variables so the plots are easier to read. The red horizontal lines show the mean value for each category.


Prediction


We wanted to see if it was possible to estimate the quality of a song when disregarding the voting data (we subtracted the friendship biases from the voting data to obtain the estimates of song quality). To make this prediction, we downloaded data from three different sources:
  1. Competition data set from Kaggle.com
  2. Song meta-data from Echonest
  3. Lyrics from musiXmatch
The Kaggle data included variables like the gender of the artist, whether it was a solo or group performance, the geographical regions of the countries and the language of the song. The data from Kaggle.com was from 1998 to 2010, so we manually added meta-data like song titles and geographical regions for 2011 and 2012 using Wikipedia.

For each song in the data set, we made some queries to the Echonest API to obtain a set of descriptive values about the song. The variables we got back were: energy, duration, acousticness, danceability, tempo, speechiness, key, time_signature, mode, loudness and valence.

Finally, we obtained lyrics for the entries using the musiXmatch API. We computed a measure of happiness of each song using the labMT wordlist where the happiness of a song was the average happiness of each word in the lyrics.

Finally we combined these three data sets, and tried to train a set of models to predict the song quality. We tried using Random Forests, Boosting and Elastic Nets, but none of the methods gave good results.

The black line shows the actual song qualities for the entries in the Eurovision Song Contest 2012. The red line shows the song qualities for the same entries predicted by our model. The model was an Elastic net, trained on data from 1998 to 2011. The entries are sorted by actual song quality to show how bad our model performs.
According to our models, the five most important features are: happiness (from the lyrics), danceability, acousticness, energy and liveness. Plotting the happiness together with the song qualities give the following results
The happiness of the songs (based on lyrics) plotted together with the estimated song qualities. By looking at this plot, there is no clear relationship.

Predictions and result for the 2012 Final


Finally we want to make predictions about the results of the Eurovision Song Contest. To evaluate our models, we will start by producing a prediction of the competition in 2012, based on the data from 1998 to 2011. We start by predicting the semi finals, and then use the results from that to predict the result of the final.
To make a prediction for each of the semi-finals, we use two measures: The average quality of a countries entries and the friendship relations. For each country in the semi-final, we sorted their competitors by average quality and friendship bias, and then distributed the votes 1,2,3,4,5,6,7,8,10 and 12 amongst them. By doing this for each country, we can simulate the voting process for the semi-finals. After doing this for both semi-finals, we tallied together the votes, giving a prediction for the countries qualifying for the final. We then did the same thing for the countries in the final (except that all countries now had the possibility to vote). To get a quick overview of our results, the top ten has been extracted and shown here. On the left is our predictions, and on the right are the actual results:

Our prediction Score Actual outcome Score
Azerbaijan 226 Sweden 372
Serbia 171 Russia 259
Russia 169 Serbia 214
Greece 167 Albania 150
Sweden 140 Azerbaijan 146
Ukraine 131 Estonia 120
Denmark 113 Turkey 112
Turkey 113 Germany 110
Italy 110 Italy 101
Malta 98 Spain 97

As is evident, the predictions could have been better, but some similarities do exist. When looking at the top 10 on both sides, our prediction gets five out of ten in the top ten and two of the three in the top three. One thing to note about our predictions is that they are made without any regard to the quality of the actual songs.
definitely

2013 Predictions


And here we have it. Our predictions for the forth coming Eurovision Song Contest 2013
Our top 10 is displayed below together with the bookmaker's and Youtube's top 10.

The bookmaker's top 10 has ben taken from www.oddschecker.com with the 10 top winners altogether. For fourther comparision, we have gone through www.youtube.com and found all the songs in the semi-finals. By looking at the which has the most views we have comprised a Youtube top 10 list of finalists. Here one should note that some countries are larger than other, and that this might have an effect on the view-counts.



Place Our predictions Bookmakers Youtube views
1 Azerbaijan Denmark Italy
2 Russia Ukraine Germany
3 Armenia Norway Belarus
4 Greece Russia Greece
5 Serbia Azerbaijan Hungary
6 Sweden Germany Montenegro
7 Ukraine Italy Ukraine
8 Italy Georgia Netherlands
9 Denmark Netherlands United Kingdom
10 Albania Sweden Denmark

So far, what we can see is that we have 3 songs, Ukraine, Italy and Denmark,(shown in green) in the top ten which are also in both the Youtube and bookmaker's top 10 lists.

We also have 3 other songs: Azerbaijan, Russia and Sweden (shown in yellow), which are in the bookmaker's list, but not in the Youtube top 10.

Concluding on this, we are probably not completely off with our predictions, but it is highly unlikely that we will outperform the bookmakers.

Regarding the two semifinals in 2013, here are our predictions for both

Place Semi final 1 Semi final 2
1 Russia Azerbaijan
2 Serbia Greece
3 Ukraine Armenia
4 Denmark Malta
5 Estonia Norway
6 Lithuania Georgia
7 Croatia Iceland
8 Slovenia Romania
9 Ireland Albania
10 Moldova Finland


Update from after the first semi final: Now that the first semi final is over, we can begin to evaluate the performance of our model. We got 7 of the 10 finalists correct, doing slightly better than average. Unfortunately, the points awarded in the semi finals are not made public until after the final, so we can't learn much from the new data. Belgium, Netherlands and Belarus qualified for the final, even though our model did not predict this - implying that their entries have high quality. On the other hand, Serbia which we predicted would easily go to the final did not - this time implying rather low quality.

For comparison, here is a table of other predictions for the first semifinal:

Our predictions International fanclubs Danish expert Martin O'Leary
Russia Denmark Austria Russia
Serbia Netherlands Croatia Serbia
Ukraine Ukraine Denmark Ukraine
Denmark Russia Russia Denmark
Estonia Ireland Ukraine Croatia
Lithuania Montenegro Netherlands Estonia
Croatia Moldova Belarus Moldova
Slovenia Belarus Moldova Lithuania
Ireland Austria Ireland Cyprus
Moldova Croatia Serbia Slovenia
Total: 7 of 10 Total: 7 of 10 Total: 7 of 10 Total: 6 of 10


Update from after the second semi final: The second semi-final is over, and the results looks promising for the model. We predicted 9 out of 10 of the correct finalist. Our only error was predicting Albania for the final instead of Hungary, but neither our model, the expert models or the betting companies expected Hungary to qualify for the final.

For comparison, here is a table of other predictions for the second semifinal:

Our predictions International fanclubs Danish expert Martin O'Leary
Azerbaijan San Marino San Marino Greece
Greece Norway Macedonia Armenia
Armenia Azerbaijan Azerbaijan Azerbaijan
Malta Iceland Finland Albania
Norway Georgia Malta Romania
Georgia Israel Iceland Norway
Iceland Switzerland Israel Israel
Romania Finland Norway Iceland
Albania Malta Georgia Malta
Finland Bulgaria Romania Georgia
Total: 9 of 10 Total: 6 of 10 Total: 7 of 10 Total: 8 of 10


The final predictions: Tonight is the night, and we have updated our predictions based on the results in the semi-finals.

Here is a table of our predictions, and other peoples predictions for comparison:

Place Our predictions Martin O'Leary Youtube Views International fanclubs Betting Companies
1 Azerbaijan Azerbaijan Italy Denmark Denmark
2 Russia Russia Germany Norway Norway
3 Armenia Greece Belarus Germany Ukraine
4 Greece Ukraine Greece Italy Russia
5 Sweden Italy Hungary Netherlands Azerbaijan
6 Belgium Denmark Ukraine Ukraine Netherlands
7 Ukraine Sweden Netherlands United Kingdom Italy
8 Italy Romania United Kingdom Sweden Germany
9 Denmark Armenia Denmark Russia Georgia
10 Malta Moldova Azerbaijan Azerbaijan Finland
11 Georgia Norway Norway Iceland Greece
12 Estonia Malta Sweden Georgia Ireland
13 Iceland Iceland Russia Ireland Sweden
14 Norway Georgia Romania Finland United Kingdom
15 Romania Finland Ireland Moldova Malta
16 Hungary Estonia Finland Belarus Moldova
17 Finland Netherlands Georgia Malta Belgium
18 Moldova Ireland Iceland Hungary Romania
19 Lithuania United Kingdom Estonia Spain Belarus
20 Ireland Hungary France Armenia Iceland
21 Netherlands Belgium Spain Belgium Lithuania
22 France France Belgium Estonia Estonia
23 Germany Lithuania Lithuania Greece Hungary
24 United Kingdom Germany Moldova Romania France
25 Spain Belarus Malta France Spain
26 Belarus Spain Armenia Lithuania Armenia



Get the data!


We spent a lot of time preparing and cleaning data for this analysis, and the work on combining different data sources was rather tedious. For future research in the subject, our data is available for download. We have provided a set of data files: The data file eurovision_meta.csv contains meta data about the entries in the Eurovision Song Contest from 1998 to 2012. This file does not include any voting data, but information like the gender of the artist, data from Echonest and lyric happiness from musiXmatch.

Column nameTypeDescription
YearNumberThe year of the entry
CountryStringCountry of the performer
RegionStringRegion of the performer (as in the Kaggle data set)
ArtistStringName of artist
SongStringTitle of song
Artist.genderStringMale, female or both
Group.SoloStringGroup or solo
PlaceNumberPlacement in competition
PointsNumberTotal number of points
Home.Away.CountryStringIf the country is hosting
Home.Away.RegionStringIf the host is in the same region
Is.FinalBooleanIf this is a final or semi final
Semi.Final.NumberNumberWhich semifinal it is
Country.[Country name]BooleanIf this entry is from [Country name]
Region.[Region name]BooleanIf this entry is from [Region name]
Artist.gender.[Artist gender]BooleanIf the gender of the artist is [Artist gender]
Group.Solo.[Group type]BooleanIf this group type is [Group type]
Home.Away.Country.[Home or Away]BooleanIf this country is [Home or away] by country
Home.Away.Region.[Home or Away]BooleanIf this country is [Home or away] by region
Song.In.EnglishBooleanIf the song is in English
Song.QualityNumberEstimated quality of song
Normalized.PointsNumberNumber of points divided by total number of points in competition
energyNumberenergy from Echonest
durationNumberduration from Echonest
acousticnessNumberacousticness from Echonest
danceabilityNumberdanceability from Echonest
tempoNumbertempo from Echonest
speechinessNumberspeechiness from Echonest
keyNumberkey from Echonest
livenessNumberliveness from Echonest
time_signatureNumbertime_signature from Echonest
modeNumbermode from Echonest
loudnessNumberloudness from Echonest
valenceNumbervalence from Echonest
HappinessNumberAverage happiness of words in lyrics using labMT wordlist
The file voting_final.csv contains the votes given in the Eurovision finals from 1998-2012.

Column nameTypeDescription
YearNumberWhich year
CountryStringWhich country received the vote
GiverStringWhich country gave the vote
ScoreNumberNumber of points
The file voting_semifinal.csv contains the votes given in the Eurovision semifinals from 2004-2012.

Column nameTypeDescription
YearStringWhich year (2008 SF1 for semifinal 1 in 2008)
CountryStringWhich country received the vote
GiverStringWhich country gave the vote
ScoreNumberNumber of points
We want to thank Martin O'Leary for providing a nice version of the files voting_final.csv and voting_semifinal.csv.

Team

David Kofoed Wind, utdiscant@gmail.com
Helge Munk Jacobsen, helgemunkjacobsen@gmail.com
Benjamin Dalsgaard Hughes, benjamin.dals.hughes@gmail.com
Kristian Michael Clarkson, the.kc114@gmail.com


Code for analysis

Since this work was done for a university project, the code is very much hacked together. If you want to see some of the code, send an e-mail to utdiscant@gmail.com and I can send you whatever you need.