World Happiness

Abigail Protin and Daniela Lulli

Introduction

The World Happiness Report is a survey of the state of global happiness that scores and ranks countries by how happy their citizens perceive themselves to be. We investigated what factors may correlate to the estimated happiness scores of each country. In this tutorial, we will walk through how we obtained and organized our data and how we determined which factors were most significant in predicting the happiness scores. We create a linear regression model to predict the happiness scores of countries given the predictors. Once our model was created, we tested it to make sure that our test results were accurate. We then used a data visualization library to display happiness scores around the globe.

Libraries Required

Obtaining Data

We obtained our data from Knoema, a site that maintains a comprehensive global dataset. This dataset contains statistics for many countries regarding topics in Agriculture, Economy, Health, and many others. For our purpose of finding what affects the happiness of a country, we chose to look at the following factors:

  1. Food Production
  2. Female Obesity (% of the population)
  3. Male Obesity (% of the population)
  4. Life Expectancy (in years)
  5. Education Expenditure per Capita (in US dollars)
  6. Health Expenditure per Capita (in US dollars)

We downloaded the data for each factor as csv files, imported them, and combined them into one pandas dataframe joining by Country and Year as shown below. We used these factors to determine the happiness score of the country, which was also given in the Knoema dataset. The Happiness Scores were calculated through a survey asking people to rank their happiness on a scale from 0 to 10.

Tidying the Data

Before analyzing the data, we drop all rows with null values. To predict the happiness score, numerical values were required for all factors, so rows with missing data are useless.

Exploring the Data

We first made a violin plot of the happiness scores from each year to get an idea of the basic distribution of the scores. As you can see below, the distribution remains more or less the same over the three years with no significant trends.

Next, we ran a linear regression model on all of our independent variables with the dependent variable being the Happiness Score. The summary is displayed below to show the pvalues for each factor, which we used to determine if a given factor was statistically significant in our model.

As you can, the majority of the resulting pvalues are below 0.05, which is what we decided to use as our threshold for whether a factor was significant or not. The Food Production Index, however, has an extremely high p value, so we will not include it in our model.

The last thing we need to do before creating our model is see if any of the predictors are related to each other. When this is the case, it decreases the precision of the estimated correlation coefficients. To prevent this, the correlation has to be identified between the variables and interaction terms should be added to the model. If you are not familiar with interaction terms and would like more information, check out the previously hyperlinked article. To see if any of our predictors are related, we plot each combination of predictors against each other with a regression line and compute the pearson r squared value, which essentially tells how strong the correlation is. We wrote a short function, which we named 'interaction', to quickly do both of these things for each pair of predictors.

We decided that our threshold for whether two variables were correlated was an r squared value of 0.5 or above, which was true for male and female obesity, and male obesity and life expectancy. Now, we will create interaction terms for these by multiplying the related factors together. We added these interaction terms to our dataframe.

Now we're ready to make our model!

First, we have to split our data into a train and a test set. More information can be found here if you are curious about why we split the data into these two sets. We do this in order to avoid overfitting our data, which essentially would decrease the precision of our correlation coefficients. A more detailed explanation of what overfitting is and why it is bad can be found here.

Now we are ready to train our model using the Ordinary Least Squares method, commonly known as linear regression. We will get the r squared value of this model to determine how our resulting regression model is.

Our r squared value is close to 1, which indicates that the model is very good.

Testing our model

We will now use the model on the test data, and compare the model's predictions with the actual happiness scores. We will compare by plotting the expected against the actual. A good model would show a strong linear trend.

As you can see in the graph above, he scatter plot above shows a clear linear trend. While it's not perfect, it is accurate for the majority of the data indicating that our model works well in predicting happiness scores.

Data Visualization

To get an idea of what happiness looks like around the world, we use the folium library to create an interactive map that displays happiness scores as a color gradient across the globe. The higher the happiness score appear as a darker shade of pink, while the lower happiness scores appear as a lighter shade. The countries in white are those for which the dataset either did not have happiness scores or did not have enough information from the predictors to compute a happiness score.

Below, we used the same method as the previous example to create map of the scores predicted using our linear regression model, as opposed to the given happiness scores. As you can see, it results are not exactly the same but they are similar.

Conclusion

According to the World Happiness Report in 2014, Switzerland was found to be the happiest country, followed closely by Iceland and Norway. The least happy countries were Togo, Burundi, and Benin. The same results were found in 2015 and 2016.

There are many studies that show things individual people can do to increase their happiness or factors that affect individual happiness. Our tutorial finds what factors outside of our individual control affect the overall happiness of our societies. The results we found show that life expectancy, obesity, health care, and education are certainly correlated with and may have an impact on one's happiness. Like the one discussed in a Berkeley article titled 'Why Governments Should Care More About Happiness', studies have shown that societies benefit when their governments care about the happiness of their citizens. If countries allocated more time and resources addressing the four previously mentioned factors, it is likely that people around the world would be happier!