Happiness.
A quality and feeling that is quite hard to quantify or tangibly define… especially on a country-wide scale. There are groups that have been approaching this debate with rankings and various scales, but of course we also had to try!
For a data science project, we used data from the World Happiness Report and added a few other variables that we thought might be useful predictively like population, population density, employment rates, and even alcohol usage! We ran various regressions and classifying algorithms to see what might present itself as the best method for determining whether or not a country was ‘happy’. Our full report can be found here.
THE DATASET
The dataset consists of 156 countries included in the happiness report, and 12 columns of predictive variables. Those in bold we sourced and added to the dataset in order to enhance and make predictions or seek correlations:
Total Population (in thousands)
Population Density (number of people per square kilometre)
Ladder Score (self-reported aggregated happiness ranking on a scale from 0-10)
Log GDP per capita (value of all the goods and services a country produces on a yearly basis divided by the country’s total population)
Social Support (respondents answer 0=No or 1=Yes to the question “if you were in trouble, do you have relatives or friends you can count on to help you whenever you need them, or not?”)
Healthy Life Expectancy (average number of “healthy” years a child at birth is estimated to live – calculated by the WHO based on over 100 factors)
Freedom to make life choices (respondents answer 0=No or 1=Yes to the question “Are you satisfied or dissatisfied with your freedom to choose what you do with your life?”)
Generosity (respondents answer 0=No or 1=Yes to the question “Have you donated money to charity this past month?)
Perceptions of Corruption (respondents answer 0=No or 1=Yes to the two questions: “Is corruption widespread throughout the government or not?” & “Is corruption widespread within businesses or not?”, and the average was used for this metric)
Dystopia
Employment (percentage of population reported employed)
Number of Deaths by risk factor of alcohol (total number of deaths with top contributing risk factor being alcohol use for all genders and ages in the respective population)
DATA PREPARATION
As the country name is unique, there is no correlation with happiness. Rank, upper whisker and lower whisker were also removed as they are derived from the ladder score and don’t influence it. In order to validate our models we split the available data into training, validation and testing data sets. Generally split into 80-10-10 sets, some methods used a different ratio to control overfitting as our data set is relatively small. Normalising the data is important in order to compare the features, as this improves the accuracy and integrity of the data. Lastly, we binarised the ladder score data into ‘Happy’ and ‘Unhappy’. This aided in analysis as most models used are classification methods. The chosen threshold is the mean of the happiness rating (5.375), which ensures a balanced data set.
LINEAR REGRESSION
To start with, the influence of each parameter on the happiness ladder score of the countries must be determined. As all the resulting parameters are continuous, the ideal model to determine which parameter has the greatest positive impact on a country’s happiness was linear regression. To visualise the correlations between all the parameters, we iterated through each parameter, creating a linear regression model for each, with the training data set. The correlation between each parameter and the happiness ladder score was measured using the R² metric and the testing dataset.
The closer R² is to 1, the bigger the correlation between the parameters in the linear regression model. Generally, the correlation is considered strong when R² is greater than 0.36. To validate these results obtained from our iteration process, a heat map was computed. This colour coded image plot helped us observe other correlations between all the parameters in the dataset. Once we have found the most impactful parameters, we plot the happiness ladder score against those parameters as a scatter plot along with the best fit line.
Log GDP per capita (R² = 0.6657)
Social Support (R² = 0.5947)
Healthy life expectancy (R² = 0.6390)
Above is a heat map with our zone of interest, the ladder score, which represents how ‘happy’ a country is by the WHI metrics is outlined in green. With a heat map the lighter the boxes the higher the correlation between parameters.
LOGISTIC REGRESSION
Logistic regression is a statistical method which fits a sigmoid function to a dependent binary variable. The sigmoid function gives an ‘S’ shaped curve that can take any real number and map it to a value between 0 and 1. This function allowed us to construct a model to answer the question; Which variables of the dataset are the most significant in predicting whether a country is happy or unhappy?
In order to do this classification of sorts, we had to binarise the ladder score variable and determine a threshold that would realistically sort countries between unhappy (0) and happy (1). We chose this threshold based as the mean ladder score; countries with average and above were happy and those below were unhappy.
The major limitation of Logistic Regression is the assumption of linearity between the dependent variable and the independent variables. Logistic Regression also requires a large dataset and sufficient training data for the categories it must identify. In our case, we were limited by the number of countries. This was especially low with only 60% of data for training. Some of the input variables have strong correlations between each other, making it easier for the model to overfit early on.
In the end our logistic regression returned the following variables as having the highest correlation with a country’s happiness:
Log GDP per Capita
Social Support
Healthy life expectancy
The following table shows our accuracy, precision, and recall rates of the training, validation, and testing sets using a 60-20-20 split.
DECISION TREE
A decision tree is a non-linear Flowchart that can help make decisions or predictions based on previous experiences, or ‘training’ by learning simple rules inferred from data features. Indirectly, they can identify which features of a dataset are more relevant by selecting those that most precisely split the dataset. The basic idea behind any decision tree algorithm is to find the best feature at which each split is made to a separate node – or leaf.
Decision trees are structured from the root at the top to the leaves at the bottom. The root node captures the entire sample population. A decision node is a split in the population based on a particular feature (e.g., Social Support, GDP per capita, etc). A leaf is where no further split is possible to improve the “gini index”. This leads to a final separation of the population samples – in our case either happy or sad – based on a combination of features. The decision tree goes through all the possible combinations of tree structures and identifies the one which minimises the probability of inaccuracy in the prediction of the terminal value at the final leaf.
The aforementioned gini impurity index is the probability of incorrectly predicting the terminal value of a random sample at the leaf node. That means the optimal decision tree will have the lowest gini index in the end. Using a decision tree with this dataset also required binarisation. Again, also, the size of the dataset created some limitation and inaccuracy in the model. There was a susceptibility to overfitting for which a 50-30-20 split was used.
We also generated a confusion matrix which is a way of documenting and analysing the number of false negative / positive and true negative / positives that a model may generate. This is an important consideration in a classification like this. It’s better to incorrectly predict that a country would be unhappy when in fact it would be happy as opposed to the other way around.
RANDOM FOREST
A random forest consists of a large number of individual, uncorrelated decision trees which are then analysed together to determine a final output. The combination creates a more robust classification than individual decision trees and can also combat overfitting. Again the goal of the random forest is to find a combination (branches to leaves in the forest) that produce the lowest probability of incorrectly classifying a country in this case of being happy or unhappy at the terminal leaf… aka having the lowest gini index.
To build a forest, in each decision tree the algorithm randomly selects rows, with replacement, and columns, without replacement, of the data set until the tree has been built or the set maximum depth has been reached. Again, binarisation of the independent variable – the ladder score representing happiness – is necessary.
The dataset is again split into training, validation, and testing. Then to fine tune and ensure no overfitting, graphs were plotted showing how the accuracy and precision of the training and validation data set changed with increasing the maximum depth and minimum impurity decrease. A maximum depth of 3 was chosen after which a plateau in the accuracy and precision of the model for the validation data occurs and there is no need to increase the depth as this may lead to overfitting.
The most important features from the random forest model analysis are the GDP per capita, social support, and healthy life expectancy.
SUPPORT VECTOR MACHINE CLASSIFIER (SVM)
Support Vector Machines (SVM) is a supervised machine learning model that uses classification algorithms in the case of two groups within a dataset. SVM are known to perform well with a limited amount of data to analyse, which in the case of our dataset presented itself as a promising option as our analysis began and we realised our happiness dataset was experiencing limitations due size.
The job of the SVM is to take the training data and three hyperparameters (C, Kernel, Gamma) and output a hyperplane in which a decision boundary, its shape depending on the kernel, can properly sort the Happy from the Unhappy countries. Therefore the question for this particular form of supervised learning in relation to the happiness dataset becomes: Which hyperparameters create the SVM model that is best able to correctly categorise countries as Happy or Unhappy?
The kernel is a function that can change the shape and dimension of the hyperplane so that non linear data is more easily optimised. The C value tells the SVM optimisation how much to avoid misclassification; a larger value will choose a smaller-margin hyperplane if it appears to do better, whereas a smaller value will look for larger-margins of separation. In the end manipulating the gamma parameter did nothing to change results.
There was some evidence of overfitting as seen in the accuracy and precision score plots for both kernel type and C values. However, as the dataset was already quite small it was undesirable to change the train-test-split ratios much more. In comparison with the performance of other models in this report the SVM approach would not be selected to move forward with in further research.
IN CONCLUSION
From the linear and logistic regressions, as well as the random forest, the variables with the strongest correlation to a country’s happiness are GDP per capita (R²=0.6657), Healthy life expectancy (R²=0.6390) and Social Support (R²=0.5947), while the predictive variables added to the dataset for our own curiosity showed significantly weaker correlations. The decision tree highlighted deaths linked to alcohol as significant features. In further research and modelling it could be interesting to inspect the impact of the pandemic on these indices and variables.