This notebook will introduce some foundation machine learning and data science concepts by exploring the problem of heart disease classification.
It is intended to be an end-to-end example of what a data science and machine learning proof of concept might look like.
Classification involves deciding whether a sample is part of one class or another (single-class classification). If there are multiple class options, it's referred to as multi-class classification.
Since we already have a dataset, we'll approach the problem with the following machine learning modelling framework.
6 Step Machine Learning Modelling Framework |
More specifically, we'll look at the following topics.
To work through these topics, we'll use pandas, Matplotlib and NumPy for data anaylsis, as well as, Scikit-Learn for machine learning and modelling tasks.
Tools which can be used for each step of the machine learning modelling process. |
We'll work through each step and by the end of the notebook, we'll have a handful of models, all which can predict whether or not a person has heart disease based on a number of different parameters at a considerable accuracy.
You'll also be able to describe which parameters are more indicative than others, for example, sex may be more important than age.
In our case, the problem we will be exploring is binary classification (a sample can only be one of two things).
This is because we're going to be using a number of differnet features (pieces of information) about a person to predict whether they have heart disease or not.
In a statement,
Given clinical parameters about a patient, can we predict whether or not they have heart disease?
What you'll want to do here is dive into the data your problem definition is based on. This may involve, sourcing, defining different parameters, talking to experts about it and finding out what you should expect.
The original data came from the Cleveland database from UCI Machine Learning Repository.
Howevever, we've downloaded it in a formatted way from Kaggle.
The original database contains 76 attributes, but here only 14 attributes will be used. Attributes (also called features) are the variables what we'll use to predict our target variable.
Attributes and features are also referred to as independent variables and a target variable can be referred to as a dependent variable.
We use the independent variables to predict our dependent variable.
Or in our case, the independent variables are a patients different medical attributes and the dependent variable is whether or not they have heart disease.
The evaluation metric is something you might define at the start of a project.
Since machine learning is very experimental, you might say something like,
If we can reach 95% accuracy at predicting whether or not a patient has heart disease during the proof of concept, we'll pursure this project.
The reason this is helpful is it provides a rough goal for a machine learning engineer or data scientist to work towards.
However, due to the nature of experimentation, the evaluation metric may change over time.
Features are different parts of the data. During this step, you'll want to start finding out what you can about the data.
One of the most common ways to do this, is to create a data dictionary.
A data dictionary describes the data you're dealing with. Not all datasets come with them so this is where you may have to do your research or ask a subject matter expert (someone who knows about the data) for more.
The following are the features we'll use to predict our target variable (heart disease or no heart disease).
Note: No personal identifiable information (PPI) can be found in the dataset.
It's a good idea to save these to a Python dictionary or in an external file, so we can look at them later without coming back here.
At the start of any project, it's custom to see the required libraries imported in a big chunk like you can see below.
However, in practice, your projects may import libraries as you go. After you've spent a couple of hours working on your problem, you'll probably want to do some tidying up. This is where you may want to consolidate every library you've used at the top of your notebook (like the cell below).
The libraries you use will differ from project to project. But there are a few which will you'll likely take advantage of during almost every structured data project.
# Regular EDA and plotting libraries
import numpy as np # np is short for numpy
import pandas as pd # pandas is so commonly used, it's shortened to pd
import matplotlib.pyplot as plt
import seaborn as sns # seaborn gets shortened to sns
# We want our plots to appear in the notebook
%matplotlib inline
## Models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
## Model evaluators
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import plot_roc_curve
df = pd.read_csv("https://raw.githubusercontent.com/thisiskhan/project_heart_disease/main/Data/heart-disease.csv")
df.shape
(303, 14)
The goal here is to find out more about the data and become a subject matter export on the dataset you're working wit.
df.head()
age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 63 | 1 | 3 | 145 | 233 | 1 | 0 | 150 | 0 | 2.3 | 0 | 0 | 1 | 1 |
1 | 37 | 1 | 2 | 130 | 250 | 0 | 1 | 187 | 0 | 3.5 | 0 | 0 | 2 | 1 |
2 | 41 | 0 | 1 | 130 | 204 | 0 | 0 | 172 | 0 | 1.4 | 2 | 0 | 2 | 1 |
3 | 56 | 1 | 1 | 120 | 236 | 0 | 1 | 178 | 0 | 0.8 | 2 | 0 | 2 | 1 |
4 | 57 | 0 | 0 | 120 | 354 | 0 | 1 | 163 | 1 | 0.6 | 2 | 0 | 2 | 1 |
df.tail()
age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
298 | 57 | 0 | 0 | 140 | 241 | 0 | 1 | 123 | 1 | 0.2 | 1 | 0 | 3 | 0 |
299 | 45 | 1 | 3 | 110 | 264 | 0 | 1 | 132 | 0 | 1.2 | 1 | 0 | 3 | 0 |
300 | 68 | 1 | 0 | 144 | 193 | 1 | 1 | 141 | 0 | 3.4 | 1 | 2 | 3 | 0 |
301 | 57 | 1 | 0 | 130 | 131 | 0 | 1 | 115 | 1 | 1.2 | 1 | 1 | 3 | 0 |
302 | 57 | 0 | 1 | 130 | 236 | 0 | 0 | 174 | 0 | 0.0 | 1 | 1 | 2 | 0 |
df["target"].value_counts()
1 165 0 138 Name: target, dtype: int64
df["target"].value_counts().plot(kind = "bar", color = ["salmon", "lightblue"]);
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 303 entries, 0 to 302 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 303 non-null int64 1 sex 303 non-null int64 2 cp 303 non-null int64 3 trestbps 303 non-null int64 4 chol 303 non-null int64 5 fbs 303 non-null int64 6 restecg 303 non-null int64 7 thalach 303 non-null int64 8 exang 303 non-null int64 9 oldpeak 303 non-null float64 10 slope 303 non-null int64 11 ca 303 non-null int64 12 thal 303 non-null int64 13 target 303 non-null int64 dtypes: float64(1), int64(13) memory usage: 33.3 KB
# Are there any missing values
df.isna().sum()
age 0 sex 0 cp 0 trestbps 0 chol 0 fbs 0 restecg 0 thalach 0 exang 0 oldpeak 0 slope 0 ca 0 thal 0 target 0 dtype: int64
df.describe()
age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 303.000000 | 303.000000 | 303.000000 | 303.000000 | 303.000000 | 303.000000 | 303.000000 | 303.000000 | 303.000000 | 303.000000 | 303.000000 | 303.000000 | 303.000000 | 303.000000 |
mean | 54.366337 | 0.683168 | 0.966997 | 131.623762 | 246.264026 | 0.148515 | 0.528053 | 149.646865 | 0.326733 | 1.039604 | 1.399340 | 0.729373 | 2.313531 | 0.544554 |
std | 9.082101 | 0.466011 | 1.032052 | 17.538143 | 51.830751 | 0.356198 | 0.525860 | 22.905161 | 0.469794 | 1.161075 | 0.616226 | 1.022606 | 0.612277 | 0.498835 |
min | 29.000000 | 0.000000 | 0.000000 | 94.000000 | 126.000000 | 0.000000 | 0.000000 | 71.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 47.500000 | 0.000000 | 0.000000 | 120.000000 | 211.000000 | 0.000000 | 0.000000 | 133.500000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 2.000000 | 0.000000 |
50% | 55.000000 | 1.000000 | 1.000000 | 130.000000 | 240.000000 | 0.000000 | 1.000000 | 153.000000 | 0.000000 | 0.800000 | 1.000000 | 0.000000 | 2.000000 | 1.000000 |
75% | 61.000000 | 1.000000 | 2.000000 | 140.000000 | 274.500000 | 0.000000 | 1.000000 | 166.000000 | 1.000000 | 1.600000 | 2.000000 | 1.000000 | 3.000000 | 1.000000 |
max | 77.000000 | 1.000000 | 3.000000 | 200.000000 | 564.000000 | 1.000000 | 2.000000 | 202.000000 | 1.000000 | 6.200000 | 2.000000 | 4.000000 | 3.000000 | 1.000000 |
df.sex.value_counts()
1 207 0 96 Name: sex, dtype: int64
# Compare target column with sex column
pd.crosstab(df.target, df.sex)
sex | 0 | 1 |
---|---|---|
target | ||
0 | 24 | 114 |
1 | 72 | 93 |
# # Compare age column with target column
# pd.crosstab(df.age, df.target).head(n = 41)
# Create a plot of crosstab
pd.crosstab(df.target, df.sex).plot(
kind = "bar",
figsize = (10, 6),
color = ["salmon", "lightblue"]
);
plt.title("Heart Disease Frequency for Sex")
plt.xlabel("0 = No Disaese, 1 = Disease")
plt.ylabel("Amount")
plt.legend(["Female", "Male"])
plt.xticks(rotation = 0)
(array([0, 1]), [Text(0, 0, '0'), Text(1, 0, '1')])
thalach
¶pd.crosstab(df.thalach, df.age)
age | 29 | 34 | 35 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | ... | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 74 | 76 | 77 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
thalach | |||||||||||||||||||||
71 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
88 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
90 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
95 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
96 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
190 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
192 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
194 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
195 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
202 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
91 rows × 41 columns
# Create a figure
plt.figure(figsize = (10, 6)),
# Scatter with positive examples
plt.scatter(df.age[df.target == 1],
df.thalach[df.target == 1],
c = "salmon"
);
# Scatter with negative examples
plt.scatter(df.age[df.target == 0],
df.thalach[df.target == 0],
c = "lightblue"
);
# Add some info
plt.title("Heart Disease in function of age and Max Heart Rate (thalach)")
plt.xlabel("Age")
plt.ylabel("Max Heart Rate (thalach)")
plt.legend(["Disease", "No Disease"]);
# Check the distribution of age Column with histogram
df.age.plot.hist();
pd.crosstab(df.cp, df.target)
target | 0 | 1 |
---|---|---|
cp | ||
0 | 104 | 39 |
1 | 9 | 41 |
2 | 18 | 69 |
3 | 7 | 16 |
# Make the crosstab more visual
pd.crosstab(df.cp, df.target).plot(kind = "bar",
figsize = (10, 6),
color = ["salmon", "lightblue"]
),
# Add some Communication
plt.xlabel("chest pain type (cp)")
plt.ylabel("Amount")
plt.title("Heart Disease Frequency Per Chest Pain Type")
plt.legend(["No Disease", "Disease"])
plt.xticks(rotation = 0);
df.head()
age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 63 | 1 | 3 | 145 | 233 | 1 | 0 | 150 | 0 | 2.3 | 0 | 0 | 1 | 1 |
1 | 37 | 1 | 2 | 130 | 250 | 0 | 1 | 187 | 0 | 3.5 | 0 | 0 | 2 | 1 |
2 | 41 | 0 | 1 | 130 | 204 | 0 | 0 | 172 | 0 | 1.4 | 2 | 0 | 2 | 1 |
3 | 56 | 1 | 1 | 120 | 236 | 0 | 1 | 178 | 0 | 0.8 | 2 | 0 | 2 | 1 |
4 | 57 | 0 | 0 | 120 | 354 | 0 | 1 | 163 | 1 | 0.6 | 2 | 0 | 2 | 1 |
# Building a Correlation matrics
df.corr()
age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
age | 1.000000 | -0.098447 | -0.068653 | 0.279351 | 0.213678 | 0.121308 | -0.116211 | -0.398522 | 0.096801 | 0.210013 | -0.168814 | 0.276326 | 0.068001 | -0.225439 |
sex | -0.098447 | 1.000000 | -0.049353 | -0.056769 | -0.197912 | 0.045032 | -0.058196 | -0.044020 | 0.141664 | 0.096093 | -0.030711 | 0.118261 | 0.210041 | -0.280937 |
cp | -0.068653 | -0.049353 | 1.000000 | 0.047608 | -0.076904 | 0.094444 | 0.044421 | 0.295762 | -0.394280 | -0.149230 | 0.119717 | -0.181053 | -0.161736 | 0.433798 |
trestbps | 0.279351 | -0.056769 | 0.047608 | 1.000000 | 0.123174 | 0.177531 | -0.114103 | -0.046698 | 0.067616 | 0.193216 | -0.121475 | 0.101389 | 0.062210 | -0.144931 |
chol | 0.213678 | -0.197912 | -0.076904 | 0.123174 | 1.000000 | 0.013294 | -0.151040 | -0.009940 | 0.067023 | 0.053952 | -0.004038 | 0.070511 | 0.098803 | -0.085239 |
fbs | 0.121308 | 0.045032 | 0.094444 | 0.177531 | 0.013294 | 1.000000 | -0.084189 | -0.008567 | 0.025665 | 0.005747 | -0.059894 | 0.137979 | -0.032019 | -0.028046 |
restecg | -0.116211 | -0.058196 | 0.044421 | -0.114103 | -0.151040 | -0.084189 | 1.000000 | 0.044123 | -0.070733 | -0.058770 | 0.093045 | -0.072042 | -0.011981 | 0.137230 |
thalach | -0.398522 | -0.044020 | 0.295762 | -0.046698 | -0.009940 | -0.008567 | 0.044123 | 1.000000 | -0.378812 | -0.344187 | 0.386784 | -0.213177 | -0.096439 | 0.421741 |
exang | 0.096801 | 0.141664 | -0.394280 | 0.067616 | 0.067023 | 0.025665 | -0.070733 | -0.378812 | 1.000000 | 0.288223 | -0.257748 | 0.115739 | 0.206754 | -0.436757 |
oldpeak | 0.210013 | 0.096093 | -0.149230 | 0.193216 | 0.053952 | 0.005747 | -0.058770 | -0.344187 | 0.288223 | 1.000000 | -0.577537 | 0.222682 | 0.210244 | -0.430696 |
slope | -0.168814 | -0.030711 | 0.119717 | -0.121475 | -0.004038 | -0.059894 | 0.093045 | 0.386784 | -0.257748 | -0.577537 | 1.000000 | -0.080155 | -0.104764 | 0.345877 |
ca | 0.276326 | 0.118261 | -0.181053 | 0.101389 | 0.070511 | 0.137979 | -0.072042 | -0.213177 | 0.115739 | 0.222682 | -0.080155 | 1.000000 | 0.151832 | -0.391724 |
thal | 0.068001 | 0.210041 | -0.161736 | 0.062210 | 0.098803 | -0.032019 | -0.011981 | -0.096439 | 0.206754 | 0.210244 | -0.104764 | 0.151832 | 1.000000 | -0.344029 |
target | -0.225439 | -0.280937 | 0.433798 | -0.144931 | -0.085239 | -0.028046 | 0.137230 | 0.421741 | -0.436757 | -0.430696 | 0.345877 | -0.391724 | -0.344029 | 1.000000 |
# Lets make our correlation metrix a littel prettier
corr_metrics = df.corr()
fig, ax = plt.subplots(figsize = (15, 10))
ax = sns.heatmap(corr_metrics,
annot = True,
linewidths = 0.5,
fmt = ".2f",
cmap = "YlGnBu"
);
df.head()
age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 63 | 1 | 3 | 145 | 233 | 1 | 0 | 150 | 0 | 2.3 | 0 | 0 | 1 | 1 |
1 | 37 | 1 | 2 | 130 | 250 | 0 | 1 | 187 | 0 | 3.5 | 0 | 0 | 2 | 1 |
2 | 41 | 0 | 1 | 130 | 204 | 0 | 0 | 172 | 0 | 1.4 | 2 | 0 | 2 | 1 |
3 | 56 | 1 | 1 | 120 | 236 | 0 | 1 | 178 | 0 | 0.8 | 2 | 0 | 2 | 1 |
4 | 57 | 0 | 0 | 120 | 354 | 0 | 1 | 163 | 1 | 0.6 | 2 | 0 | 2 | 1 |
# Everything except target variable
X = df.drop("target", axis=1)
# Target variable
y = df.target.values
X
age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 63 | 1 | 3 | 145 | 233 | 1 | 0 | 150 | 0 | 2.3 | 0 | 0 | 1 |
1 | 37 | 1 | 2 | 130 | 250 | 0 | 1 | 187 | 0 | 3.5 | 0 | 0 | 2 |
2 | 41 | 0 | 1 | 130 | 204 | 0 | 0 | 172 | 0 | 1.4 | 2 | 0 | 2 |
3 | 56 | 1 | 1 | 120 | 236 | 0 | 1 | 178 | 0 | 0.8 | 2 | 0 | 2 |
4 | 57 | 0 | 0 | 120 | 354 | 0 | 1 | 163 | 1 | 0.6 | 2 | 0 | 2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
298 | 57 | 0 | 0 | 140 | 241 | 0 | 1 | 123 | 1 | 0.2 | 1 | 0 | 3 |
299 | 45 | 1 | 3 | 110 | 264 | 0 | 1 | 132 | 0 | 1.2 | 1 | 0 | 3 |
300 | 68 | 1 | 0 | 144 | 193 | 1 | 1 | 141 | 0 | 3.4 | 1 | 2 | 3 |
301 | 57 | 1 | 0 | 130 | 131 | 0 | 1 | 115 | 1 | 1.2 | 1 | 1 | 3 |
302 | 57 | 0 | 1 | 130 | 236 | 0 | 0 | 174 | 0 | 0.0 | 1 | 1 | 2 |
303 rows × 13 columns
y
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
# Random seed for reproducibility
np.random.seed(42)
# Split into train & test set
X_train, X_test, y_train, y_test = train_test_split(X, # independent variables
y, # dependent variable
test_size = 0.2) # percentage of data to use for test set
X_test
age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
179 | 57 | 1 | 0 | 150 | 276 | 0 | 0 | 112 | 1 | 0.6 | 1 | 1 | 1 |
228 | 59 | 1 | 3 | 170 | 288 | 0 | 0 | 159 | 0 | 0.2 | 1 | 0 | 3 |
111 | 57 | 1 | 2 | 150 | 126 | 1 | 1 | 173 | 0 | 0.2 | 2 | 1 | 3 |
246 | 56 | 0 | 0 | 134 | 409 | 0 | 0 | 150 | 1 | 1.9 | 1 | 2 | 3 |
60 | 71 | 0 | 2 | 110 | 265 | 1 | 0 | 130 | 0 | 0.0 | 2 | 1 | 2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
249 | 69 | 1 | 2 | 140 | 254 | 0 | 0 | 146 | 0 | 2.0 | 1 | 3 | 3 |
104 | 50 | 1 | 2 | 129 | 196 | 0 | 1 | 163 | 0 | 0.0 | 2 | 0 | 2 |
300 | 68 | 1 | 0 | 144 | 193 | 1 | 1 | 141 | 0 | 3.4 | 1 | 2 | 3 |
193 | 60 | 1 | 0 | 145 | 282 | 0 | 0 | 142 | 1 | 2.8 | 1 | 2 | 3 |
184 | 50 | 1 | 0 | 150 | 243 | 0 | 0 | 128 | 0 | 2.6 | 1 | 0 | 3 |
61 rows × 13 columns
We'll train it(find the patterns) on the training set.
And we'll test it (use th patterns) in the test set.
We're going to try 3 different machine learning models:
# Put models in a dictionary
models = {"KNN": KNeighborsClassifier(),
"Logistic Regression": LogisticRegression(),
"Random Forest": RandomForestClassifier()}
# Create function to fit and score models
def fit_and_score(models, X_train, X_test, y_train, y_test):
"""
Fits and evaluates given machine learning models.
models : a dict of different Scikit-Learn machine learning models
X_train : training data
X_test : testing data
y_train : labels assosciated with training data
y_test : labels assosciated with test data
"""
# Random seed for reproducible results
np.random.seed(42)
# Make a list to keep model scores
model_scores = {}
# Loop through models
for name, model in models.items():
# Fit the model to the data
model.fit(X_train, y_train)
# Evaluate the model and append its score to model_scores
model_scores[name] = model.score(X_test, y_test)
return model_scores
model_scores = fit_and_score(models=models,
X_train=X_train,
X_test=X_test,
y_train=y_train,
y_test=y_test)
model_scores
/Users/razakhan/opt/anaconda3/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html Please also refer to the documentation for alternative solver options: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression n_iter_i = _check_optimize_result(
{'KNN': 0.6885245901639344, 'Logistic Regression': 0.8852459016393442, 'Random Forest': 0.8360655737704918}
model_compare = pd.DataFrame(model_scores, index = ["Accuracy"])
model_compare.T.plot.bar(),
(<AxesSubplot:>,)
Beautiful! We can't really see it from the graph but looking at the dictionary, the LogisticRegression() model performs best.
Since you've found the best model. Let's take it to the boss and show her what we've found.
You: I've found it!
Her: Nice one! What did you find?
You: The best algorithm for prediting heart disease is a LogisticRegrssion!
Her: Excellent. I'm surprised the hyperparameter tuning is finished by now.
You: wonders what hyperparameter tuning is
You: Ummm yeah, me too, it went pretty quick.
Her: I'm very proud, how about you put together a classification report to show the team, and be sure to include a confusion matrix, and the cross-validated precision, recall and F1 scores. I'd also be curious to see what features are most important. Oh and don't forget to include a ROC curve.
You: asks self, "what are those???"
You: Of course! I'll have to you by tomorrow.
Alright, there were a few words in there which could sound made up to someone who's not a budding data scientist like yourself. But being the budding data scientist you are, you know data scientists make up words all the time.
Let's briefly go through each before we see them in action.
classification_report()
which returns
some of the main classification metrics such as precision, recall and f1-score.To cook your favourite dish, you know to set the oven to 180 degrees and turn the grill on. But when your roommate cooks their favourite dish, they set use 200 degrees and the fan-forced mode. Same oven, different settings, different outcomes.
The same can be done for machine learning algorithms. You can use the same algorithms but change the settings (hyperparameters) and get different results.
But just like turning the oven up too high can burn your food, the same can happen for machine learning algorithms. You change the settings and it works so well, it overfits (does too well) the data.
We're looking for the goldilocks model. One which does well on our dataset but also does well on unseen examples.
To test different hyperparameters, you could use a validation set but since we don't have much data, we'll use cross-validation.
The most common type of cross-validation is k-fold. It involves splitting your data into k-fold's and then testing a model on each. For example, let's say we had 5 folds (k = 5). This what it might look like.
Normal train and test split versus 5-fold cross-validation |
We'll be using this setup to tune the hyperparameters of some of our models and then evaluate them. We'll also get a few more metrics like precision, recall, F1-score and ROC at the same time.
Here's the game plan:
# Let's tune KNN
train_scores = []
test_scores = []
# Create a list of different values for n_neighbors
neighbors =range(1, 21)
## Setup KNN instance
knn = KNeighborsClassifier()
## Loop through different n_neighbors
for i in neighbors:
knn.set_params(n_neighbors = i)
# Fit the algorithm
knn.fit(X_train, y_train)
# Updata the training scores lists
train_scores.append(knn.score(X_train, y_train))
# Update the test scores List
test_scores.append(knn.score(X_test, y_test))
train_scores
[1.0, 0.8099173553719008, 0.7727272727272727, 0.743801652892562, 0.7603305785123967, 0.7520661157024794, 0.743801652892562, 0.7231404958677686, 0.71900826446281, 0.6942148760330579, 0.7272727272727273, 0.6983471074380165, 0.6900826446280992, 0.6942148760330579, 0.6859504132231405, 0.6735537190082644, 0.6859504132231405, 0.6652892561983471, 0.6818181818181818, 0.6694214876033058]
test_scores
[0.6229508196721312, 0.639344262295082, 0.6557377049180327, 0.6721311475409836, 0.6885245901639344, 0.7213114754098361, 0.7049180327868853, 0.6885245901639344, 0.6885245901639344, 0.7049180327868853, 0.7540983606557377, 0.7377049180327869, 0.7377049180327869, 0.7377049180327869, 0.6885245901639344, 0.7213114754098361, 0.6885245901639344, 0.6885245901639344, 0.7049180327868853, 0.6557377049180327]
plt.plot(neighbors, train_scores, label = "Train Scores")
plt.plot(neighbors , test_scores, label = "Test Scrores")
plt.xlabel("Number Of neighbors")
plt.ylabel("Model scores")
plt.xticks(np.arange(1, 21, 1))
plt.legend()
print(f"Maximum KNN scores on the trian data : {max(train_scores)*100:.2f}%");
print(f"Maximum KNN scores on the test data : {max(test_scores)*100:.2f}%");
Maximum KNN scores on the trian data : 100.00% Maximum KNN scores on the test data : 75.41%
We are going to tune :
# Different LogisticRegression hyperparameters
log_reg_grid = {"C": np.logspace(-4, 4, 20),
"solver": ["liblinear"]}
# Different RandomForestClassifier hyperparameters
rf_grid = {"n_estimators": np.arange(10, 1000, 50),
"max_depth": [None, 3, 5, 10],
"min_samples_split": np.arange(2, 20, 2),
"min_samples_leaf": np.arange(1, 20, 2)}
Now we've got hyperparameter grid setup foe each of our models, Let's tune them using RandomizedSearchCV...
# Setup random seed
np.random.seed(42)
# Setup random hyperparameter search for LogisticRegression
rs_log_reg = RandomizedSearchCV(LogisticRegression(),
param_distributions=log_reg_grid,
cv=5,
n_iter=20,
verbose=True)
# Fit random hyperparameter search model
rs_log_reg.fit(X_train, y_train)
Fitting 5 folds for each of 20 candidates, totalling 100 fits
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed: 0.4s finished
RandomizedSearchCV(cv=5, estimator=LogisticRegression(), n_iter=20, param_distributions={'C': array([1.00000000e-04, 2.63665090e-04, 6.95192796e-04, 1.83298071e-03, 4.83293024e-03, 1.27427499e-02, 3.35981829e-02, 8.85866790e-02, 2.33572147e-01, 6.15848211e-01, 1.62377674e+00, 4.28133240e+00, 1.12883789e+01, 2.97635144e+01, 7.84759970e+01, 2.06913808e+02, 5.45559478e+02, 1.43844989e+03, 3.79269019e+03, 1.00000000e+04]), 'solver': ['liblinear']}, verbose=True)
rs_log_reg.best_params_
{'solver': 'liblinear', 'C': 0.23357214690901212}
rs_log_reg.score(X_test, y_test)
0.8852459016393442
Excellent! Tuning the hyperparameters for each model saw a slight performance boost in both the RandomForestClassifier
and LogisticRegression
.
This is akin to tuning the settings on your oven and getting it to cook your favourite dish just right.
But since LogisticRegression
is pulling out in front, we'll try tuning it further with GridSearchCV
.
GridSearchCV
¶The difference between RandomizedSearchCV
and GridSearchCV
is where RandomizedSearchCV
searches over a grid of hyperparameters performing n_iter
combinations, GridSearchCV
will
test every single possible combination.
In short:
RandomizedSearchCV
- tries n_iter
combinations of hyperparameters and saves the best.GridSearchCV
- tries every single combination of hyperparameters and saves the best.Let's see it in action.
# Differrent LogisticRegression hyperparameter
log_reg_grid = {"C" : np.logspace(-4, 4, 20),
"solver" :["liblinear"]
}
# Setup grid hyperparameter search for LogisticRegression
gs_log_reg = GridSearchCV(LogisticRegression(),
param_grid = log_reg_grid,
cv = 5,
verbose = True
)
# Fit grid hyperparameter search model
gs_log_reg.fit(X_train, y_train)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
Fitting 5 folds for each of 20 candidates, totalling 100 fits
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed: 0.4s finished
GridSearchCV(cv=5, estimator=LogisticRegression(), param_grid={'C': array([1.00000000e-04, 2.63665090e-04, 6.95192796e-04, 1.83298071e-03, 4.83293024e-03, 1.27427499e-02, 3.35981829e-02, 8.85866790e-02, 2.33572147e-01, 6.15848211e-01, 1.62377674e+00, 4.28133240e+00, 1.12883789e+01, 2.97635144e+01, 7.84759970e+01, 2.06913808e+02, 5.45559478e+02, 1.43844989e+03, 3.79269019e+03, 1.00000000e+04]), 'solver': ['liblinear']}, verbose=True)
# Check the best parameters
gs_log_reg.best_params_
{'C': 0.23357214690901212, 'solver': 'liblinear'}
gs_log_reg.score(X_test, y_test)
0.8852459016393442
In this case, we get the same results as before since our grid only has a maximum of 20 different hyperparameter combinations.
Note: If there are a large amount of hyperparameters combinations in your grid, GridSearchCV
may take a long time to try them all out. This is why it's a good idea to start with RandomizedSearchCV
,
try a certain amount of combinations and then use GridSearchCV
to refine them.
Now we've got a tuned model, let's get some of the metrics we discussed before.
We want:
plot_roc_curve()
confusion_matrix()
classification_report()
precision_score()
recall_score()
f1_score()
Luckily, Scikit-Learn has these all built-in.
To access them, we'll have to use our model to make predictions on the test set. You can make predictions by calling predict()
on a trained model and passing it the data you'd like to predict on.
We'll make predictions on the test data.
# Make predcition on the test data
y_preds = gs_log_reg.predict(X_test)
y_preds
array([0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0])
y_test
array([0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0])
Since we've got our prediction values we can find the metrics we want.
Let's start with the ROC curve and AUC scores.
What's a ROC curve?
It's a way of understanding how your model is performing by comparing the true positive rate to the false positive rate.
In our case...
To get an appropriate example in a real-world problem, consider a diagnostic test that seeks to determine whether a person has a certain disease. A false positive in this case occurs when the person tests positive, but does not actually have the disease. A false negative, on the other hand, occurs when the person tests negative, suggesting they are healthy, when they actually do have the disease.
Scikit-Learn implements a function plot_roc_curve
which can help us create a ROC curve as well as calculate the area under the curve (AUC) metric.
Reading the documentation on the plot_roc_curve
function we can see it takes (estimator, X, y)
as inputs. Where estiamator
is a fitted machine learning model and X
and y
are the data you'd like to test it on.
In our case, we'll use the GridSearchCV version of our LogisticRegression
estimator, gs_log_reg
as well as the test data, X_test
and y_test
.
# Import ROC Curve function from metrics module
from sklearn.metrics import plot_roc_curve
# Plot ROC Curve and Calculate AUC metrics
plot_roc_curve(gs_log_reg, X_test, y_test);
This is great, our model does far better than guessing which would be a line going from the bottom left corner to the top right corner, AUC = 0.5. But a perfect model would achieve an AUC score of 1.0, so there's still room for improvement.
Let's move onto the next evaluation request, a confusion matrix.
A confusion matrix is a visual way to show where your model made the right predictions and where it made the wrong predictions (or in other words, got confused).
Scikit-Learn allows us to create a confusion matrix using confusion_matrix()
and passing it the true labels and predicted labels.
print(confusion_matrix(y_test, y_preds))
[[25 4] [ 3 29]]
As you can see, Scikit-Learn's built-in confusion matrix is a bit bland. For a presentation you'd probably want to make it visual.
Let's create a function which uses Seaborn's heatmap()
for doing so.
# Import Seaborn
import seaborn as sns
sns.set(font_scale= 1.5) # Increase fint Size
def plot_conf_mat(y_test, y_preds):
"""
Plots a confusion metrics using Seaborn's heatmap().
"""
fig, ax= plt.subplots(figsize = (3, 3),)
ax = sns.heatmap(confusion_matrix(y_test, y_preds),
annot = True, # Annotate the boxes
cbar = False
),
plt.xlabel("True label")
plt.ylabel("Predicted labels")
plot_conf_mat(y_test, y_preds)
Beautiful! That looks much better.
You can see the model gets confused (predicts the wrong label) relatively the same across both classes. In essence, there are 4 occasaions where the model predicted 0 when it should've been 1 (false negative) and 3 occasions where the model predicted 1 instead of 0 (false positive).
We can make a classification report using classification_report()
and passing it the true labels as well as our models
predicted labels.
A classification report will also give us information of the precision and recall of our model for each class.
print(classification_report(y_test, y_preds))
precision recall f1-score support 0 0.89 0.86 0.88 29 1 0.88 0.91 0.89 32 accuracy 0.89 61 macro avg 0.89 0.88 0.88 61 weighted avg 0.89 0.89 0.89 61
What's going on here?
Let's get a refresh.
Ok, now we've got a few deeper insights on our model. But these were all calculated using a single training and test set.
What we'll do to make them more solid is calculate them using cross-validation.
How?
We'll take the best model along with the best hyperparameters and use cross_val_score()
along with various scoring
parameter values.
cross_val_score()
works by taking an estimator (machine learning model) along with data and labels. It then evaluates the machine learning model on the data and labels using cross-validation and a defined scoring
parameter.
Let's remind ourselves of the best hyperparameters and then see them in action.
# Check best hyperparameters
gs_log_reg.best_params_
{'C': 0.23357214690901212, 'solver': 'liblinear'}
# Import cross_val_score
from sklearn.model_selection import cross_val_score
# Instantiate best model with best hyperparameters (found with GridSearchCV)
clf = LogisticRegression(C=0.23357214690901212,
solver="liblinear")
# Cross-validated accuracy score
cv_acc = cross_val_score(clf,
X,
y,
cv=5, # 5-fold cross-validation
scoring="accuracy") # accuracy as scoring
cv_acc
array([0.81967213, 0.90163934, 0.8852459 , 0.88333333, 0.75 ])
cv_acc = np.mean(cv_acc)
cv_acc
0.8479781420765027
# Cross-validated precision score
cv_precision = np.mean(cross_val_score(clf,
X,
y,
cv=5, # 5-fold cross-validation
scoring="precision")) # precision as scoring
cv_precision
0.8215873015873015
# Cross-validated recall score
cv_recall = np.mean(cross_val_score(clf,
X,
y,
cv=5, # 5-fold cross-validation
scoring="recall")) # recall as scoring
cv_recall
0.9272727272727274
# Cross-validated F1 score
cv_f1 = np.mean(cross_val_score(clf,
X,
y,
cv=5, # 5-fold cross-validation
scoring="f1")) # f1 as scoring
cv_f1
0.8705403543192143
Okay, we've got cross validated metrics, now what?
Let's visualize them.
# Visualizing cross-validated metrics
cv_metrics = pd.DataFrame({"Accuracy": cv_acc,
"Precision": cv_precision,
"Recall": cv_recall,
"F1": cv_f1},
index=[0])
cv_metrics.T.plot.bar(title="Cross-Validated Metrics", legend=False);
Great! This looks like something we could share. An extension might be adding the metrics on top of each bar so someone can quickly tell what they were.
What now?
The final thing to check off the list of our model evaluation techniques is feature importance.
Feature importance is another way of asking, "which features contributing most to the outcomes of the model?"
Or for our problem, trying to predict heart disease using a patient's medical characterisitcs, which charateristics contribute most to a model predicting whether someone has heart disease or not?
Unlike some of the other functions we've seen, because how each model finds patterns in data is slightly different, how a model judges how important those patterns are is different as well. This means for each model, there's a slightly different way of finding which features were most important.
You can usually find an example via the Scikit-Learn documentation or via searching for something like "[MODEL TYPE] feature importance", such as, "random forest feature importance".
Since we're using LogisticRegression
, we'll look at one way we can calculate feature importance for it.
To do so, we'll use the coef_
attribute. Looking at the Scikit-Learn documentation for LogisticRegression
,
the coef_
attribute is the coefficient of the features in the decision function.
We can access the coef_
attribute after we've fit an instance of LogisticRegression
.
# Fit an instance of LogisticRegression (taken from above)
clf.fit(X_train, y_train);
# Check coef_
clf.coef_
array([[ 0.00369922, -0.90424098, 0.67472823, -0.0116134 , -0.00170364, 0.04787687, 0.33490208, 0.02472938, -0.63120414, -0.57590996, 0.47095166, -0.65165344, -0.69984217]])
Looking at this it might not make much sense. But these values are how much each feature contributes to how a model makes a decision on whether patterns in a sample of patients health data leans more towards having heart disease or not.
Even knowing this, in it's current form, this coef_
array still doesn't mean much. But it will if we combine it with the columns (features) of our dataframe.
# Match features to columns
features_dict = dict(zip(df.columns, list(clf.coef_[0])))
features_dict
{'age': 0.003699223396114675, 'sex': -0.9042409779785583, 'cp': 0.6747282348693419, 'trestbps': -0.011613398123390507, 'chol': -0.0017036431858934173, 'fbs': 0.0478768694057663, 'restecg': 0.33490207838133623, 'thalach': 0.024729380915946855, 'exang': -0.6312041363430085, 'oldpeak': -0.5759099636629296, 'slope': 0.47095166489539353, 'ca': -0.6516534354909507, 'thal': -0.6998421698316164}
Now we've match the feature coefficients to different features, let's visualize them.
# Visualize feature importance
features_df = pd.DataFrame(features_dict, index=[0])
features_df.T.plot.bar(title="Feature Importance", legend=False);
You'll notice some are negative and some are positive.
The larger the value (bigger bar), the more the feature contributes to the models decision.
If the value is negative, it means there's a negative correlation. And vice versa for positive values.
For example, the sex
attribute has a negative value of -0.904, which means as the value for sex
increases, the target
value decreases.
We can see this by comparing the sex
column to the target
column.
pd.crosstab(df["sex"], df["target"])
target | 0 | 1 |
---|---|---|
sex | ||
0 | 24 | 72 |
1 | 114 | 93 |
You can see, when sex
is 0 (female), there are almost 3 times as many (72 vs. 24) people with heart disease (target
= 1) than without.
And then as sex
increases to 1 (male), the ratio goes down to almost 1 to 1 (114 vs. 93) of people who have heart disease and who don't.
What does this mean?
It means the model has found a pattern which reflects the data. Looking at these figures and this specific dataset, it seems if the patient is female, they're more likely to have heart disease.
How about a positive correlation?
# Contrast slope (positive coefficient) with target
pd.crosstab(df["slope"], df["target"])
target | 0 | 1 |
---|---|---|
slope | ||
0 | 12 | 9 |
1 | 91 | 49 |
2 | 35 | 107 |
Looking back the data dictionary, we see slope
is the "slope of the peak exercise ST segment" where:
According to the model, there's a positive correlation of 0.470, not as strong as sex
and target
but still more than 0.
This positive correlation means our model is picking up the pattern that as slope
increases, so does the target
value.
Is this true?
When you look at the contrast (pd.crosstab(df["slope"], df["target"]
) it is. As slope
goes up, so does target
.
What can you do with this information?
This is something you might want to talk to a subject matter expert about. They may be interested in seeing where machine learning model is finding the most patterns (highest correlation) as well as where it's not (lowest correlation).
Doing this has a few benefits:
Well we've completed all the metrics your boss requested. You should be able to put together a great report containing a confusion matrix, a handful of cross-valdated metrics such as precision, recall and F1 as well as which features contribute most to the model making a decision.
But after all this you might be wondering where step 6 in the framework is, experimentation.
Well the secret here is, as you might've guessed, the whole thing is experimentation.
From trying different models, to tuning different models to figuring out which hyperparameters were best.
What we've worked through so far has been a series of experiments.
And the truth is, we could keep going. But of course, things can't go on forever.
So by this stage, after trying a few different things, we'd ask ourselves did we meet the evaluation metric?
Remember we defined one in step 3.
If we can reach 95% accuracy at predicting whether or not a patient has heart disease during the proof of concept, we'll pursure this project.
In this case, we didn't. The highest accuracy our model achieved was below 90%.
You might be wondering, what happens when the evaluation metric doesn't get hit?
Is everything we've done wasted?
No.
It means we know what doesn't work. In this case, we know the current model we're using (a tuned version of LogisticRegression
) along with our specific data set doesn't hit the target we set ourselves.
This is where step 6 comes into its own.
A good next step would be to discuss with your team or research on your own different options of going forward.
Could you collect more data?
Could you try a better model? If you're working with structured data, you might want to look into CatBoost or XGBoost.
Could you improve the current models (beyond what we've done so far)?
The key here is to remember, your biggest restriction will be time. Hence, why it's paramount to minimise your times between experiments.
The more you try, the more you figure out what doesn't work, the more you'll start to get a hang of what does.