World Happiness Data from Kaggle

Rafael Djamous

Write Up

I am glad to state that I made the decision to complete the project on my own. As the first stage, I conducted some research and went over the list of sources provided for data mining. I concentrated on data sources that are intriguing to me and straightforward to get information from. Finally, I chose to base my judgment on the data I gathered from the website Kaggle. Let's look at the website. As previously stated, data access and modification will be handled through the use of a Google Collab notebook. In addition, the code is available on my Github page.

In [2]:
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive

World Happiness Data

Kaggle

The data obtained from Kaggle is the World happiness data. It contains data such as the Overall rank, Country or region, Score, GDP per capita, Social support, Healthy life expectancy, Freedom to make life choices, Generosity and Perceptions of corruption.

Extraction of data from Kaggle is quite different when compared to other sources. We make use of the opendatasets library to download the dataset from Kaggle. This prompts us to provide the details like the Kaggle username and the key, we are obtained in a json file downloaded from Kaggle under the ‘Account’ option. After provideng these details, the dataset is downloaded and exracted to our working directory. From here, we can then read the csv file using the read_csv method of Pandas.

In the EDA, we checked for the data types in the data, the correlation between them and also the measures of central tendency for the numerical variables. All are numerical variables except 'Country or region'. The data is also distributed over 156 different countries.

Heatmap_df.png

We can see that there is a high correlation between 'Score' and GDP per capita, Social support, Healthy life expectancy & Freedom to make life choices

Our main interest here was to find the correlation between the happiness score and the GDP per capita. To investigate this, we made a scatterplot using seaborn and this included a regression line. The output graph showed that the two are directly proportional in most parts of the world. Below is the output graph:

HapinessScoreVsGDP.png

We also noted that, in the countries that the freedom to make life choices was high, the happiness score was also high. There is also a high positive correlation between hapiness score and Social support, Healthy life expectancy & Freedom to make life choices

This is illustrated below:

HapinessScoreVsSocialsupport.png

  • These variables with high correlation with 'Score' can be used to predict it using machine learning models like Multiple Linear Regression

Challenges Faced

The main challenge we faced was in the extraction of data from Kaggle since we did not know that a json file from Kaggle was required so as to download the data in the first part.

Code

In [3]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Kaggle Dataset

  • The aim is to find out the correlation between hapiness score and the GDP per capita in various countries
In [4]:
! pip install opendatasets
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting opendatasets
  Downloading opendatasets-0.1.22-py3-none-any.whl (15 kB)
Requirement already satisfied: kaggle in /usr/local/lib/python3.8/dist-packages (from opendatasets) (1.5.12)
Requirement already satisfied: click in /usr/local/lib/python3.8/dist-packages (from opendatasets) (7.1.2)
Requirement already satisfied: tqdm in /usr/local/lib/python3.8/dist-packages (from opendatasets) (4.64.1)
Requirement already satisfied: urllib3 in /usr/local/lib/python3.8/dist-packages (from kaggle->opendatasets) (1.24.3)
Requirement already satisfied: requests in /usr/local/lib/python3.8/dist-packages (from kaggle->opendatasets) (2.23.0)
Requirement already satisfied: python-slugify in /usr/local/lib/python3.8/dist-packages (from kaggle->opendatasets) (7.0.0)
Requirement already satisfied: python-dateutil in /usr/local/lib/python3.8/dist-packages (from kaggle->opendatasets) (2.8.2)
Requirement already satisfied: six>=1.10 in /usr/local/lib/python3.8/dist-packages (from kaggle->opendatasets) (1.15.0)
Requirement already satisfied: certifi in /usr/local/lib/python3.8/dist-packages (from kaggle->opendatasets) (2022.9.24)
Requirement already satisfied: text-unidecode>=1.3 in /usr/local/lib/python3.8/dist-packages (from python-slugify->kaggle->opendatasets) (1.3)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.8/dist-packages (from requests->kaggle->opendatasets) (2.10)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.8/dist-packages (from requests->kaggle->opendatasets) (3.0.4)
Installing collected packages: opendatasets
Successfully installed opendatasets-0.1.22
In [5]:
import opendatasets as od
In [6]:
datasets = 'https://www.kaggle.com/datasets/unsdsn/world-happiness'
In [7]:
od.download(datasets)
Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: rafaeldjamous
Your Kaggle Key: ··········
Downloading world-happiness.zip to ./world-happiness
100%|██████████| 36.8k/36.8k [00:00<00:00, 21.9MB/s]


In [8]:
df = pd.read_csv('./world-happiness/2019.csv')
In [9]:
df.head()
Out[9]:
Overall rank Country or region Score GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption
0 1 Finland 7.769 1.340 1.587 0.986 0.596 0.153 0.393
1 2 Denmark 7.600 1.383 1.573 0.996 0.592 0.252 0.410
2 3 Norway 7.554 1.488 1.582 1.028 0.603 0.271 0.341
3 4 Iceland 7.494 1.380 1.624 1.026 0.591 0.354 0.118
4 5 Netherlands 7.488 1.396 1.522 0.999 0.557 0.322 0.298
In [10]:
df.info() #Let us look at the data types present in the data
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 156 entries, 0 to 155
Data columns (total 9 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Overall rank                  156 non-null    int64  
 1   Country or region             156 non-null    object 
 2   Score                         156 non-null    float64
 3   GDP per capita                156 non-null    float64
 4   Social support                156 non-null    float64
 5   Healthy life expectancy       156 non-null    float64
 6   Freedom to make life choices  156 non-null    float64
 7   Generosity                    156 non-null    float64
 8   Perceptions of corruption     156 non-null    float64
dtypes: float64(7), int64(1), object(1)
memory usage: 11.1+ KB
  • All are numerical variables except 'Country or region'
  • There are 9 columns and 156 instances
In [11]:
#Checking for missing values
missing_values = df.isnull().sum().sort_values(ascending = False)
missing_values
Out[11]:
Overall rank                    0
Country or region               0
Score                           0
GDP per capita                  0
Social support                  0
Healthy life expectancy         0
Freedom to make life choices    0
Generosity                      0
Perceptions of corruption       0
dtype: int64
  • There are no missing values in any column in this dataset
In [12]:
#let us check the distribution of data over countries
print('Total Number: ', len(df['Country or region'].value_counts()))
df['Country or region'].value_counts()
Total Number:  156
Out[12]:
Finland                1
Venezuela              1
Jordan                 1
Benin                  1
Congo (Brazzaville)    1
                      ..
Latvia                 1
South Korea            1
Estonia                1
Jamaica                1
South Sudan            1
Name: Country or region, Length: 156, dtype: int64
  • The data is distributed over 156 different countries
In [13]:
#Next we check the central measures of tendency for the numerical variables
df.describe()
Out[13]:
Overall rank Score GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption
count 156.000000 156.000000 156.000000 156.000000 156.000000 156.000000 156.000000 156.000000
mean 78.500000 5.407096 0.905147 1.208814 0.725244 0.392571 0.184846 0.110603
std 45.177428 1.113120 0.398389 0.299191 0.242124 0.143289 0.095254 0.094538
min 1.000000 2.853000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 39.750000 4.544500 0.602750 1.055750 0.547750 0.308000 0.108750 0.047000
50% 78.500000 5.379500 0.960000 1.271500 0.789000 0.417000 0.177500 0.085500
75% 117.250000 6.184500 1.232500 1.452500 0.881750 0.507250 0.248250 0.141250
max 156.000000 7.769000 1.684000 1.624000 1.141000 0.631000 0.566000 0.453000
In [14]:
#We next check the correlation between the variables
df.corr()
Out[14]:
Overall rank Score GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption
Overall rank 1.000000 -0.989096 -0.801947 -0.767465 -0.787411 -0.546606 -0.047993 -0.351959
Score -0.989096 1.000000 0.793883 0.777058 0.779883 0.566742 0.075824 0.385613
GDP per capita -0.801947 0.793883 1.000000 0.754906 0.835462 0.379079 -0.079662 0.298920
Social support -0.767465 0.777058 0.754906 1.000000 0.719009 0.447333 -0.048126 0.181899
Healthy life expectancy -0.787411 0.779883 0.835462 0.719009 1.000000 0.390395 -0.029511 0.295283
Freedom to make life choices -0.546606 0.566742 0.379079 0.447333 0.390395 1.000000 0.269742 0.438843
Generosity -0.047993 0.075824 -0.079662 -0.048126 -0.029511 0.269742 1.000000 0.326538
Perceptions of corruption -0.351959 0.385613 0.298920 0.181899 0.295283 0.438843 0.326538 1.000000
In [15]:
#Let us visualize this   
plt.figure(figsize=(16, 6))
heatmap = sns.heatmap(df.corr(), vmin=-1, vmax=1, annot=True,cmap="crest")
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':12}, pad=12);
plt.savefig('Heatmap_df.png')
  • We can see that there is a high correlation between 'Score' and GDP per capita, Social support, Healthy life expectancy & Freedom to make life choices
  • These variables can therefore be used to predict the 'Score' variable using a model like Multiple Linear Regression
  • Since we see that hapiness_score isdirectly proportional to the GDP, we can find a different dataset with GDP as the target variable and then see the elements that are directly proportional to it. In so doing, we will find more variables that determine the hapiness_score of a country.
In [16]:
dataset = 'https://www.kaggle.com/datasets/rutikbhoyar/gdp-prediction-dataset'
In [17]:
od.download(dataset)
Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: rafaeldjamous
Your Kaggle Key: ··········
Downloading gdp-prediction-dataset.zip to ./gdp-prediction-dataset
100%|██████████| 13.4k/13.4k [00:00<00:00, 9.50MB/s]


In [18]:
gdp = pd.read_csv('./gdp-prediction-dataset/world.csv')
In [19]:
gdp.head()
Out[19]:
Country Region Population Area (sq. mi.) Pop. Density (per sq. mi.) Coastline (coast/area ratio) Net migration Infant mortality (per 1000 births) GDP ($ per capita) Literacy (%) Phones (per 1000) Arable (%) Crops (%) Other (%) Climate Birthrate Deathrate Agriculture Industry Service
0 Afghanistan ASIA (EX. NEAR EAST) 31056997 647500 48,0 0,00 23,06 163,07 700.0 36,0 3,2 12,13 0,22 87,65 1 46,6 20,34 0,38 0,24 0,38
1 Albania EASTERN EUROPE 3581655 28748 124,6 1,26 -4,93 21,52 4500.0 86,5 71,2 21,09 4,42 74,49 3 15,11 5,22 0,232 0,188 0,579
2 Algeria NORTHERN AFRICA 32930091 2381740 13,8 0,04 -0,39 31 6000.0 70,0 78,1 3,22 0,25 96,53 1 17,14 4,61 0,101 0,6 0,298
3 American Samoa OCEANIA 57794 199 290,4 58,29 -20,71 9,27 8000.0 97,0 259,5 10 15 75 2 22,46 3,27 NaN NaN NaN
4 Andorra WESTERN EUROPE 71201 468 152,1 0,00 6,6 4,05 19000.0 100,0 497,2 2,22 0 97,78 3 8,71 6,25 NaN NaN NaN
In [20]:
gdp.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 227 entries, 0 to 226
Data columns (total 20 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   Country                             227 non-null    object 
 1   Region                              227 non-null    object 
 2   Population                          227 non-null    int64  
 3   Area (sq. mi.)                      227 non-null    int64  
 4   Pop. Density (per sq. mi.)          227 non-null    object 
 5   Coastline (coast/area ratio)        227 non-null    object 
 6   Net migration                       224 non-null    object 
 7   Infant mortality (per 1000 births)  224 non-null    object 
 8   GDP ($ per capita)                  226 non-null    float64
 9   Literacy (%)                        209 non-null    object 
 10  Phones (per 1000)                   223 non-null    object 
 11  Arable (%)                          225 non-null    object 
 12  Crops (%)                           225 non-null    object 
 13  Other (%)                           225 non-null    object 
 14  Climate                             205 non-null    object 
 15  Birthrate                           224 non-null    object 
 16  Deathrate                           223 non-null    object 
 17  Agriculture                         212 non-null    object 
 18  Industry                            211 non-null    object 
 19  Service                             212 non-null    object 
dtypes: float64(1), int64(2), object(17)
memory usage: 35.6+ KB
In [21]:
#Checking for missing values
missing_values = gdp.isnull().sum().sort_values(ascending = False)
missing_values
Out[21]:
Climate                               22
Literacy (%)                          18
Industry                              16
Service                               15
Agriculture                           15
Deathrate                              4
Phones (per 1000)                      4
Infant mortality (per 1000 births)     3
Net migration                          3
Birthrate                              3
Arable (%)                             2
Crops (%)                              2
Other (%)                              2
GDP ($ per capita)                     1
Region                                 0
Coastline (coast/area ratio)           0
Pop. Density (per sq. mi.)             0
Area (sq. mi.)                         0
Population                             0
Country                                0
dtype: int64
  • There are missing values in most columns of the dataset, let us drop them
In [22]:
gdp = gdp.dropna()
In [23]:
gdp.shape
Out[23]:
(179, 20)
In [24]:
gdp = gdp.replace(',', '.', regex=True)
In [25]:
gdp.head()
Out[25]:
Country Region Population Area (sq. mi.) Pop. Density (per sq. mi.) Coastline (coast/area ratio) Net migration Infant mortality (per 1000 births) GDP ($ per capita) Literacy (%) Phones (per 1000) Arable (%) Crops (%) Other (%) Climate Birthrate Deathrate Agriculture Industry Service
0 Afghanistan ASIA (EX. NEAR EAST) 31056997 647500 48.0 0.00 23.06 163.07 700.0 36.0 3.2 12.13 0.22 87.65 1 46.6 20.34 0.38 0.24 0.38
1 Albania EASTERN EUROPE 3581655 28748 124.6 1.26 -4.93 21.52 4500.0 86.5 71.2 21.09 4.42 74.49 3 15.11 5.22 0.232 0.188 0.579
2 Algeria NORTHERN AFRICA 32930091 2381740 13.8 0.04 -0.39 31 6000.0 70.0 78.1 3.22 0.25 96.53 1 17.14 4.61 0.101 0.6 0.298
6 Anguilla LATIN AMER. & CARIB 13477 102 132.1 59.80 10.76 21.03 8600.0 95.0 460.0 0 0 100 2 14.17 5.34 0.04 0.18 0.78
7 Antigua & Barbuda LATIN AMER. & CARIB 69108 443 156.0 34.54 -6.15 19.46 11000.0 89.0 549.9 18.18 4.55 77.27 2 16.93 5.37 0.038 0.22 0.743
In [26]:
gdp.columns
Out[26]:
Index(['Country', 'Region', 'Population', 'Area (sq. mi.)',
       'Pop. Density (per sq. mi.)', 'Coastline (coast/area ratio)',
       'Net migration', 'Infant mortality (per 1000 births)',
       'GDP ($ per capita)', 'Literacy (%)', 'Phones (per 1000)', 'Arable (%)',
       'Crops (%)', 'Other (%)', 'Climate', 'Birthrate', 'Deathrate',
       'Agriculture', 'Industry', 'Service'],
      dtype='object')
In [27]:
#Let us convert the object type columns to numeric
gdp['Population'] = pd.to_numeric(gdp['Population'])
gdp['Area (sq. mi.)'] = pd.to_numeric(gdp['Area (sq. mi.)'])
gdp['Pop. Density (per sq. mi.)'] = pd.to_numeric(gdp['Pop. Density (per sq. mi.)'])
gdp['Coastline (coast/area ratio)'] = pd.to_numeric(gdp['Coastline (coast/area ratio)'])
gdp['Infant mortality (per 1000 births)'] = pd.to_numeric(gdp['Infant mortality (per 1000 births)'])
gdp['Crops (%)'] = pd.to_numeric(gdp['Crops (%)'])
gdp['Other (%)'] = pd.to_numeric(gdp['Other (%)'])
gdp['Climate'] = pd.to_numeric(gdp['Climate'])
gdp['Birthrate'] = pd.to_numeric(gdp['Birthrate'])
gdp['Deathrate'] = pd.to_numeric(gdp['Deathrate'])
gdp['Agriculture'] = pd.to_numeric(gdp['Agriculture'])
gdp['Industry'] = pd.to_numeric(gdp['Industry'])
gdp['Service'] = pd.to_numeric(gdp['Service'])
In [28]:
#Let's check for the correlation between variables
gdp.corr()
Out[28]:
Population Area (sq. mi.) Pop. Density (per sq. mi.) Coastline (coast/area ratio) Infant mortality (per 1000 births) GDP ($ per capita) Crops (%) Other (%) Climate Birthrate Deathrate Agriculture Industry Service
Population 1.000000 0.610850 -0.019010 -0.054617 0.002438 -0.033618 -0.062567 -0.137345 -0.018471 -0.064719 -0.050578 -0.007401 0.092468 -0.070320
Area (sq. mi.) 0.610850 1.000000 -0.069010 -0.088162 0.002924 0.068356 -0.160433 0.124528 -0.094852 -0.037473 -0.024266 -0.017035 0.103225 -0.070204
Pop. Density (per sq. mi.) -0.019010 -0.069010 1.000000 0.164036 -0.143214 0.190122 -0.036580 0.066753 -0.012370 -0.174565 -0.130624 -0.144315 -0.145370 0.255477
Coastline (coast/area ratio) -0.054617 -0.088162 0.164036 1.000000 -0.105956 0.035815 0.399358 -0.137085 -0.027063 -0.063464 -0.148592 -0.032327 -0.188972 0.190004
Infant mortality (per 1000 births) 0.002438 0.002924 -0.143214 -0.105956 1.000000 -0.639090 -0.095712 0.148600 -0.366672 0.862113 0.665729 0.758537 -0.085310 -0.618259
GDP ($ per capita) -0.033618 0.068356 0.190122 0.035815 -0.639090 1.000000 -0.207844 0.066445 0.360567 -0.658795 -0.247562 -0.616919 0.032855 0.536551
Crops (%) -0.062567 -0.160433 -0.036580 0.399358 -0.095712 -0.207844 1.000000 -0.582627 -0.003734 0.075813 -0.208984 0.084289 -0.124211 0.029020
Other (%) -0.137345 0.124528 0.066753 -0.137085 0.148600 0.066445 -0.582627 1.000000 -0.318964 0.123943 0.066027 -0.027108 0.122303 -0.081546
Climate -0.018471 -0.094852 -0.012370 -0.027063 -0.366672 0.360567 -0.003734 -0.318964 1.000000 -0.456312 0.021979 -0.187472 -0.077286 0.237414
Birthrate -0.064719 -0.037473 -0.174565 -0.063464 0.862113 -0.658795 0.075813 0.123943 -0.456312 1.000000 0.446220 0.703979 -0.120518 -0.541710
Deathrate -0.050578 -0.024266 -0.130624 -0.148592 0.665729 -0.247562 -0.208984 0.066027 0.021979 0.446220 1.000000 0.416409 -0.012611 -0.366187
Agriculture -0.007401 -0.017035 -0.144315 -0.032327 0.758537 -0.616919 0.084289 -0.027108 -0.187472 0.703979 0.416409 1.000000 -0.352785 -0.613489
Industry 0.092468 0.103225 -0.145370 -0.188972 -0.085310 0.032855 -0.124211 0.122303 -0.077286 -0.120518 -0.012611 -0.352785 1.000000 -0.521413
Service -0.070320 -0.070204 0.255477 0.190004 -0.618259 0.536551 0.029020 -0.081546 0.237414 -0.541710 -0.366187 -0.613489 -0.521413 1.000000
In [29]:
#Let us visualize this
plt.figure(figsize=(16, 6))
heatmap = sns.heatmap(gdp.corr(), vmin=-1, vmax=1, annot=True,cmap="crest")
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':12}, pad=12);
  • We can see that GDP has got a high positive correlation with 'climate' and 'service'. On the other hand, it has a great negative correlation with 'Infant Mortality' and 'Birthrate' and 'Agriculture'.
  • As we remember, GDP is positively correlated with happiness-score. Therse features seen above, therefore impact it in the same way. For instance, if infant mortality in a country is high, then hapiness_score will also be low.
In [30]:
sns.lmplot(x='GDP per capita', y='Score',data=df,fit_reg=True) 
plt.title('Hapiness Score vs GDP')
plt.savefig('HapinessScoreVsGDP.png')
plt.show()
In [31]:
sns.lmplot(x='Social support', y='Score',data=df,fit_reg=True) 
plt.title('Hapiness Score vs Social support')
plt.savefig('HapinessScoreVsSocialsupport.png')
plt.show()
In [32]:
sns.lmplot(x='Healthy life expectancy', y='Score',data=df,fit_reg=True) 
plt.title('Hapiness Score vs Healthy life expectancy')
plt.savefig('HapinessScoreVsHealthylifeexpectancy.png')
plt.show()
In [33]:
sns.lmplot(x='Freedom to make life choices', y='Score',data=df,fit_reg=True) 
plt.title('Hapiness Score vs Freedom to make life choices')
plt.savefig('HapinessScoreVsFreedomtomakelifechoices.png')
plt.show()
In [34]:
#Let's create a model to predict hapiness_score given GDP
sns.jointplot(x='GDP per capita', y='Score',data=df)
Out[34]:
<seaborn.axisgrid.JointGrid at 0x7fb6f96ffd30>
In [35]:
from sklearn.linear_model import LinearRegression
In [36]:
import numpy as np
y = df['Score']
x = np.asarray(df['GDP per capita']).reshape(-1, 1)
In [37]:
lm2 = LinearRegression()
In [38]:
lm2.fit(x,y)
Out[38]:
LinearRegression()
In [39]:
print(lm2.intercept_, lm2.coef_) #These are the values of slope and intercept. The regression line is therefore y = 3.4x + 2.22
3.399345178292417 [2.218148]
In [40]:
lm2.predict(x) #Predicting score given gdp
Out[40]:
array([6.3716635 , 6.46704386, 6.6999494 , 6.46038942, 6.49587979,
       6.62009608, 6.47591646, 6.28959202, 6.4271172 , 6.45151683,
       6.44264424, 5.69291021, 6.22970203, 6.96834531, 6.35613646,
       6.72434903, 6.44486238, 6.40715387, 6.57795126, 6.21417499,
       6.73322162, 6.28293758, 5.77276354, 6.33617313, 6.43377164,
       5.97017871, 5.17386358, 6.51140682, 7.13470641, 6.25188351,
       5.94799723, 5.62636577, 5.89254353, 6.88627384, 5.16055469,
       6.26962869, 6.42046276, 6.16315759, 6.12988537, 6.07443167,
       5.05186544, 6.1454124 , 5.58422096, 6.18977536, 4.93873989,
       5.35575172, 5.8215628 , 5.97683316, 6.2008661 , 5.42229616,
       6.72656718, 5.72840058, 6.03228686, 6.28515573, 6.14319426,
       5.24262617, 5.88367094, 6.34282758, 4.82339619, 6.00123278,
       5.12062803, 6.06334093, 5.29586172, 6.2008661 , 5.52876726,
       6.10770389, 4.90103137, 6.02341426, 5.18939062, 5.62636577,
       4.91877656, 5.71509169, 5.73061873, 4.49289214, 5.96130612,
       6.589042  , 5.6507654 , 5.49549504, 6.02341426, 6.10770389,
       5.7661091 , 6.01897797, 5.50214948, 5.57978466, 4.94317619,
       4.62154473, 5.73283688, 5.62192948, 5.17608173, 5.71287354,
       5.58865726, 5.46444097, 5.68181947, 5.04299285, 5.2026995 ,
       4.61710843, 5.8215628 , 4.75463361, 4.66147139, 4.38863919,
       5.25593506, 4.27107734, 4.89215878, 5.74392762, 5.09401025,
       5.52876726, 5.49993134, 5.52876726, 4.67256213, 4.85666841,
       4.39751178, 3.39934518, 5.34909727, 3.7054496 , 4.13355217,
       5.28477098, 5.83930798, 4.24224142, 5.36462431, 4.08253476,
       4.53503695, 4.66368954, 3.85184737, 5.44225949, 4.64594435,
       5.71287354, 3.60785109, 4.25333216, 3.99380884, 5.50436763,
       4.97423026, 4.17569698, 5.21822654, 4.14464291, 5.19826321,
       4.13577031, 5.4245143 , 4.68143472, 4.00933588, 5.07404692,
       3.56126998, 4.00711773, 4.00711773, 4.48401955, 3.50137999,
       4.21118735, 4.11580698, 5.70843725, 4.77237879, 3.82301145,
       4.03595365, 4.19566031, 4.45518363, 4.17569698, 3.45701703,
       4.07809847])

Conclusion

From the project, I was able to import different sets of data and also extract meaningful insights from them.

Also from this course, I was able to learn how to properly use pandas in importing the data from a website link, know the data types in the data, and compute the measures of central tendency in the data. Filtering out columns of a dataframe and ranking them is another concept that I learnt too. In addition to that, I understood how to find the correlation between the data variables.

In our case where the data was going to be obtained from Kaggle. The data was first downloaded using the Opendatasets library and then read using pandas. The data types it contained were 7 float variables, one integer and also one object type variable. The object type variable contained the list of countries. The measures of central tendency were also checked. The observation was that there was a high correlation between Happiness score and GDP per capita, social support, Healthy life expectancy & Freedom to make life choices. These columns could therefore be used to predict score which is the target variable.

In this part, I understood how to make use of the Opendatasets library to download data from Kaggle. To download, a json file from Kaggle, which contained the username and the key was required; we therefore downloaded it. The other concept reinforced here was checking the measures of central tendency like the mean, median and percentiles with Pandas, checking the correlation between the variables and visualizing this using the correlation heatmap with seaborn. Seaborn library was also utilized in visualizing the data and drawing a regression line, especially for those columns that had a positive correlation with Happiness score.

The project was therefore a success coupled with great learning experience and activities that sharpened my programming skills.