I am glad to state that I made the decision to complete the project on my own. As the first stage, I conducted some research and went over the list of sources provided for data mining. I concentrated on data sources that are intriguing to me and straightforward to get information from. Finally, I chose to base my judgment on the data I gathered from the website Kaggle. Let's look at the website. As previously stated, data access and modification will be handled through the use of a Google Collab notebook. In addition, the code is available on my Github page.
from google.colab import drive
drive.mount('/content/drive')
The data obtained from Kaggle is the World happiness data. It contains data such as the Overall rank, Country or region, Score, GDP per capita, Social support, Healthy life expectancy, Freedom to make life choices, Generosity and Perceptions of corruption.
Extraction of data from Kaggle is quite different when compared to other sources. We make use of the opendatasets library to download the dataset from Kaggle. This prompts us to provide the details like the Kaggle username and the key, we are obtained in a json file downloaded from Kaggle under the ‘Account’ option. After provideng these details, the dataset is downloaded and exracted to our working directory. From here, we can then read the csv file using the read_csv method of Pandas.
In the EDA, we checked for the data types in the data, the correlation between them and also the measures of central tendency for the numerical variables. All are numerical variables except 'Country or region'. The data is also distributed over 156 different countries.
We can see that there is a high correlation between 'Score' and GDP per capita, Social support, Healthy life expectancy & Freedom to make life choices
Our main interest here was to find the correlation between the happiness score and the GDP per capita. To investigate this, we made a scatterplot using seaborn and this included a regression line. The output graph showed that the two are directly proportional in most parts of the world. Below is the output graph:
We also noted that, in the countries that the freedom to make life choices was high, the happiness score was also high. There is also a high positive correlation between hapiness score and Social support, Healthy life expectancy & Freedom to make life choices
This is illustrated below:
The main challenge we faced was in the extraction of data from Kaggle since we did not know that a json file from Kaggle was required so as to download the data in the first part.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
! pip install opendatasets
import opendatasets as od
datasets = 'https://www.kaggle.com/datasets/unsdsn/world-happiness'
od.download(datasets)
df = pd.read_csv('./world-happiness/2019.csv')
df.head()
df.info() #Let us look at the data types present in the data
#Checking for missing values
missing_values = df.isnull().sum().sort_values(ascending = False)
missing_values
#let us check the distribution of data over countries
print('Total Number: ', len(df['Country or region'].value_counts()))
df['Country or region'].value_counts()
#Next we check the central measures of tendency for the numerical variables
df.describe()
#We next check the correlation between the variables
df.corr()
#Let us visualize this
plt.figure(figsize=(16, 6))
heatmap = sns.heatmap(df.corr(), vmin=-1, vmax=1, annot=True,cmap="crest")
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':12}, pad=12);
plt.savefig('Heatmap_df.png')
dataset = 'https://www.kaggle.com/datasets/rutikbhoyar/gdp-prediction-dataset'
od.download(dataset)
gdp = pd.read_csv('./gdp-prediction-dataset/world.csv')
gdp.head()
gdp.info()
#Checking for missing values
missing_values = gdp.isnull().sum().sort_values(ascending = False)
missing_values
gdp = gdp.dropna()
gdp.shape
gdp = gdp.replace(',', '.', regex=True)
gdp.head()
gdp.columns
#Let us convert the object type columns to numeric
gdp['Population'] = pd.to_numeric(gdp['Population'])
gdp['Area (sq. mi.)'] = pd.to_numeric(gdp['Area (sq. mi.)'])
gdp['Pop. Density (per sq. mi.)'] = pd.to_numeric(gdp['Pop. Density (per sq. mi.)'])
gdp['Coastline (coast/area ratio)'] = pd.to_numeric(gdp['Coastline (coast/area ratio)'])
gdp['Infant mortality (per 1000 births)'] = pd.to_numeric(gdp['Infant mortality (per 1000 births)'])
gdp['Crops (%)'] = pd.to_numeric(gdp['Crops (%)'])
gdp['Other (%)'] = pd.to_numeric(gdp['Other (%)'])
gdp['Climate'] = pd.to_numeric(gdp['Climate'])
gdp['Birthrate'] = pd.to_numeric(gdp['Birthrate'])
gdp['Deathrate'] = pd.to_numeric(gdp['Deathrate'])
gdp['Agriculture'] = pd.to_numeric(gdp['Agriculture'])
gdp['Industry'] = pd.to_numeric(gdp['Industry'])
gdp['Service'] = pd.to_numeric(gdp['Service'])
#Let's check for the correlation between variables
gdp.corr()
#Let us visualize this
plt.figure(figsize=(16, 6))
heatmap = sns.heatmap(gdp.corr(), vmin=-1, vmax=1, annot=True,cmap="crest")
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':12}, pad=12);
sns.lmplot(x='GDP per capita', y='Score',data=df,fit_reg=True)
plt.title('Hapiness Score vs GDP')
plt.savefig('HapinessScoreVsGDP.png')
plt.show()
sns.lmplot(x='Social support', y='Score',data=df,fit_reg=True)
plt.title('Hapiness Score vs Social support')
plt.savefig('HapinessScoreVsSocialsupport.png')
plt.show()
sns.lmplot(x='Healthy life expectancy', y='Score',data=df,fit_reg=True)
plt.title('Hapiness Score vs Healthy life expectancy')
plt.savefig('HapinessScoreVsHealthylifeexpectancy.png')
plt.show()
sns.lmplot(x='Freedom to make life choices', y='Score',data=df,fit_reg=True)
plt.title('Hapiness Score vs Freedom to make life choices')
plt.savefig('HapinessScoreVsFreedomtomakelifechoices.png')
plt.show()
#Let's create a model to predict hapiness_score given GDP
sns.jointplot(x='GDP per capita', y='Score',data=df)
from sklearn.linear_model import LinearRegression
import numpy as np
y = df['Score']
x = np.asarray(df['GDP per capita']).reshape(-1, 1)
lm2 = LinearRegression()
lm2.fit(x,y)
print(lm2.intercept_, lm2.coef_) #These are the values of slope and intercept. The regression line is therefore y = 3.4x + 2.22
lm2.predict(x) #Predicting score given gdp
From the project, I was able to import different sets of data and also extract meaningful insights from them.
Also from this course, I was able to learn how to properly use pandas in importing the data from a website link, know the data types in the data, and compute the measures of central tendency in the data. Filtering out columns of a dataframe and ranking them is another concept that I learnt too. In addition to that, I understood how to find the correlation between the data variables.
In our case where the data was going to be obtained from Kaggle. The data was first downloaded using the Opendatasets library and then read using pandas. The data types it contained were 7 float variables, one integer and also one object type variable. The object type variable contained the list of countries. The measures of central tendency were also checked. The observation was that there was a high correlation between Happiness score and GDP per capita, social support, Healthy life expectancy & Freedom to make life choices. These columns could therefore be used to predict score which is the target variable.
In this part, I understood how to make use of the Opendatasets library to download data from Kaggle. To download, a json file from Kaggle, which contained the username and the key was required; we therefore downloaded it. The other concept reinforced here was checking the measures of central tendency like the mean, median and percentiles with Pandas, checking the correlation between the variables and visualizing this using the correlation heatmap with seaborn. Seaborn library was also utilized in visualizing the data and drawing a regression line, especially for those columns that had a positive correlation with Happiness score.
The project was therefore a success coupled with great learning experience and activities that sharpened my programming skills.