import pandas as pd
import seaborn as sns # for violin plot and for regplot
import statsmodels.api as sm # for linear regression model
from sklearn.model_selection import train_test_split # split data
from scipy import stats # for pearsonr value of interaction terms
import folium # for interactive map
We obtained our data from Knoema, a site that maintains a comprehensive global dataset. This dataset contains statistics for many countries regarding topics in Agriculture, Economy, Health, and many others. For our purpose of finding what affects the happiness of a country, we chose to look at the following factors:
We downloaded the data for each factor as csv files, imported them, and combined them into one pandas dataframe joining by Country and Year as shown below. We used these factors to determine the happiness score of the country, which was also given in the Knoema dataset. The Happiness Scores were calculated through a survey asking people to rank their happiness on a scale from 0 to 10.
df = pd.read_csv('Food Production.csv')
obese_female = pd.read_csv('Obesity Female.csv')
obese_male = pd.read_csv('Obesity Male.csv')
life_exp = pd.read_csv('Life Expectancy.csv')
ed_exp = pd.read_csv('Education Expenditure.csv')
health_exp = pd.read_csv('Health Expenditure.csv')
happiness = pd.read_csv('Happiness Score.csv')
frames = [obese_female, obese_male,
life_exp, ed_exp, health_exp, happiness]
for f in frames:
df = pd.merge(df, f, on = ['Country', 'Year'], how = 'outer')
df.head()
Country | Year | Food Production Index | Obesity Female | Obesity Male | Life expectancy | Education Expenditure | Health Expenditure | Happiness Score | |
---|---|---|---|---|---|---|---|---|---|
0 | World | 2014 | 125.601824 | 14.4 | 10.4 | 71.742279 | 4.62946 | 1039.40797 | NaN |
1 | East Asia & Pacific | 2014 | 129.655064 | NaN | NaN | 75.125170 | 4.15272 | 615.14134 | NaN |
2 | American Samoa | 2014 | 116.620000 | NaN | NaN | NaN | NaN | NaN | NaN |
3 | American Samoa | 2015 | 120.150000 | NaN | NaN | NaN | NaN | NaN | NaN |
4 | American Samoa | 2016 | 120.970000 | NaN | NaN | NaN | NaN | NaN | NaN |
Before analyzing the data, we drop all rows with null values. To predict the happiness score, numerical values were required for all factors, so rows with missing data are useless.
df = df.dropna()
df.head()
Country | Year | Food Production Index | Obesity Female | Obesity Male | Life expectancy | Education Expenditure | Health Expenditure | Happiness Score | |
---|---|---|---|---|---|---|---|---|---|
5 | Australia | 2014 | 109.14 | 27.4 | 28.3 | 82.30000 | 5.16477 | 5637.566895 | 7.284 |
6 | Australia | 2015 | 109.72 | 27.9 | 28.9 | 82.40000 | 5.31127 | 4887.800781 | 7.313 |
7 | Australia | 2016 | 105.58 | 28.4 | 29.6 | 82.44878 | 5.27678 | 4999.810547 | 7.284 |
11 | Cambodia | 2014 | 176.12 | 4.3 | 2.4 | 68.27300 | 1.90939 | 73.299316 | 3.819 |
29 | Indonesia | 2014 | 140.39 | 8.1 | 4.1 | 70.48100 | 3.28801 | 108.837265 | 5.399 |
We first made a violin plot of the happiness scores from each year to get an idea of the basic distribution of the scores. As you can see below, the distribution remains more or less the same over the three years with no significant trends.
ax = sns.violinplot(x='Year', y='Happiness Score', data=df)
Next, we ran a linear regression model on all of our independent variables with the dependent variable being the Happiness Score. The summary is displayed below to show the pvalues for each factor, which we used to determine if a given factor was statistically significant in our model.
x = df[['Food Production Index','Obesity Female', 'Obesity Male',
'Life expectancy', 'Education Expenditure', 'Health Expenditure']]
y = df[['Happiness Score']]
model = sm.OLS(y, x).fit()
model.summary()
Dep. Variable: | Happiness Score | R-squared (uncentered): | 0.991 |
---|---|---|---|
Model: | OLS | Adj. R-squared (uncentered): | 0.991 |
Method: | Least Squares | F-statistic: | 4544. |
Date: | Sat, 05 Dec 2020 | Prob (F-statistic): | 3.08e-255 |
Time: | 16:48:17 | Log-Likelihood: | -210.02 |
No. Observations: | 260 | AIC: | 432.0 |
Df Residuals: | 254 | BIC: | 453.4 |
Df Model: | 6 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Food Production Index | 0.0011 | 0.001 | 0.801 | 0.424 | -0.002 | 0.004 |
Obesity Female | 0.0180 | 0.008 | 2.154 | 0.032 | 0.002 | 0.034 |
Obesity Male | 0.0274 | 0.010 | 2.815 | 0.005 | 0.008 | 0.047 |
Life expectancy | 0.0526 | 0.003 | 15.258 | 0.000 | 0.046 | 0.059 |
Education Expenditure | 0.1083 | 0.028 | 3.920 | 0.000 | 0.054 | 0.163 |
Health Expenditure | 0.0002 | 2.42e-05 | 7.508 | 0.000 | 0.000 | 0.000 |
Omnibus: | 3.897 | Durbin-Watson: | 0.737 |
---|---|---|---|
Prob(Omnibus): | 0.143 | Jarque-Bera (JB): | 3.565 |
Skew: | -0.218 | Prob(JB): | 0.168 |
Kurtosis: | 2.627 | Cond. No. | 2.09e+03 |
As you can, the majority of the resulting pvalues are below 0.05, which is what we decided to use as our threshold for whether a factor was significant or not. The Food Production Index, however, has an extremely high p value, so we will not include it in our model.
The last thing we need to do before creating our model is see if any of the predictors are related to each other. When this is the case, it decreases the precision of the estimated correlation coefficients. To prevent this, the correlation has to be identified between the variables and interaction terms should be added to the model. If you are not familiar with interaction terms and would like more information, check out the previously hyperlinked article. To see if any of our predictors are related, we plot each combination of predictors against each other with a regression line and compute the pearson r squared value, which essentially tells how strong the correlation is. We wrote a short function, which we named 'interaction', to quickly do both of these things for each pair of predictors.
def interaction(x, y):
print(stats.pearsonr(df[x], df[y])[0] ** 2)
ax1 = sns.regplot(x=x, y=y, data=df)
interaction('Obesity Female', 'Obesity Male')
0.642773145901948
interaction('Obesity Female', 'Life expectancy')
0.22785701943344175
interaction('Obesity Female', 'Education Expenditure')
0.0755334498857785
interaction('Obesity Female', 'Health Expenditure')
0.029153557390565694
interaction('Life expectancy', 'Education Expenditure')
0.17668285382109106
interaction('Life expectancy', 'Health Expenditure')
0.44293844398926585
interaction('Education Expenditure', 'Health Expenditure')
0.1736263117050802
interaction('Obesity Male', 'Life expectancy')
0.6416543725734769
interaction('Obesity Male', 'Health Expenditure')
0.30353855358128756
interaction('Obesity Male', 'Education Expenditure')
0.14646904978684944
We decided that our threshold for whether two variables were correlated was an r squared value of 0.5 or above, which was true for male and female obesity, and male obesity and life expectancy. Now, we will create interaction terms for these by multiplying the related factors together. We added these interaction terms to our dataframe.
df['int_obesity'] = df['Obesity Male']*df['Obesity Female']
df['int_life'] = df['Obesity Male']*df['Life expectancy']
df.head()
Country | Year | Food Production Index | Obesity Female | Obesity Male | Life expectancy | Education Expenditure | Health Expenditure | Happiness Score | int_obesity | int_life | |
---|---|---|---|---|---|---|---|---|---|---|---|
5 | Australia | 2014 | 109.14 | 27.4 | 28.3 | 82.30000 | 5.16477 | 5637.566895 | 7.284 | 775.42 | 2329.090000 |
6 | Australia | 2015 | 109.72 | 27.9 | 28.9 | 82.40000 | 5.31127 | 4887.800781 | 7.313 | 806.31 | 2381.360000 |
7 | Australia | 2016 | 105.58 | 28.4 | 29.6 | 82.44878 | 5.27678 | 4999.810547 | 7.284 | 840.64 | 2440.483903 |
11 | Cambodia | 2014 | 176.12 | 4.3 | 2.4 | 68.27300 | 1.90939 | 73.299316 | 3.819 | 10.32 | 163.855200 |
29 | Indonesia | 2014 | 140.39 | 8.1 | 4.1 | 70.48100 | 3.28801 | 108.837265 | 5.399 | 33.21 | 288.972100 |
First, we have to split our data into a train and a test set. More information can be found here if you are curious about why we split the data into these two sets. We do this in order to avoid overfitting our data, which essentially would decrease the precision of our correlation coefficients. A more detailed explanation of what overfitting is and why it is bad can be found here.
train, test = train_test_split(df, train_size=0.75, random_state=40)
xtrain = train[['Obesity Female', 'Obesity Male',
'Life expectancy', 'Education Expenditure', 'Health Expenditure','int_obesity']]
ytrain = train[['Happiness Score']]
Now we are ready to train our model using the Ordinary Least Squares method, commonly known as linear regression. We will get the r squared value of this model to determine how our resulting regression model is.
model = sm.OLS(ytrain, xtrain).fit()
model.rsquared
0.991370876357963
Our r squared value is close to 1, which indicates that the model is very good.
We will now use the model on the test data, and compare the model's predictions with the actual happiness scores. We will compare by plotting the expected against the actual. A good model would show a strong linear trend.
xtest = test[['Obesity Female', 'Obesity Male',
'Life expectancy', 'Education Expenditure', 'Health Expenditure','int_obesity']]
ytest = test[['Happiness Score']]
predicted = model.predict(xtest)
ax1 = sns.regplot(x=predicted, y=ytest, data=df)
As you can see in the graph above, he scatter plot above shows a clear linear trend. While it's not perfect, it is accurate for the majority of the data indicating that our model works well in predicting happiness scores.
To get an idea of what happiness looks like around the world, we use the folium library to create an interactive map that displays happiness scores as a color gradient across the globe. The higher the happiness score appear as a darker shade of pink, while the lower happiness scores appear as a lighter shade. The countries in white are those for which the dataset either did not have happiness scores or did not have enough information from the predictors to compute a happiness score.
url = 'https://raw.githubusercontent.com/python-visualization/folium/master/examples/data'
country_shapes = f'{url}/world-countries.json'
m = folium.Map(location=[45.9432, 24.9668], zoom_start=4)
folium.Choropleth(
geo_data=country_shapes,
name='World Happiness',
data=df,
columns=['Country', 'Happiness Score'],
key_on='feature.properties.name',
fill_color='PuRd',
nan_fill_color='white'
).add_to(m)
m
Below, we used the same method as the previous example to create map of the scores predicted using our linear regression model, as opposed to the given happiness scores. As you can see, it results are not exactly the same but they are similar.
df['predicted'] = model.predict(x)
m = folium.Map(location=[45.9432, 24.9668], zoom_start=4)
folium.Choropleth(
geo_data=country_shapes,
name='World Happiness',
data=df,
columns=['Country', 'predicted'],
key_on='feature.properties.name',
fill_color='PuRd',
nan_fill_color='white'
).add_to(m)
m
According to the World Happiness Report in 2014, Switzerland was found to be the happiest country, followed closely by Iceland and Norway. The least happy countries were Togo, Burundi, and Benin. The same results were found in 2015 and 2016.
There are many studies that show things individual people can do to increase their happiness or factors that affect individual happiness. Our tutorial finds what factors outside of our individual control affect the overall happiness of our societies. The results we found show that life expectancy, obesity, health care, and education are certainly correlated with and may have an impact on one's happiness. Like the one discussed in a Berkeley article titled 'Why Governments Should Care More About Happiness', studies have shown that societies benefit when their governments care about the happiness of their citizens. If countries allocated more time and resources addressing the four previously mentioned factors, it is likely that people around the world would be happier!