Heart Disease Prediction

12 min readDec 4, 2022

Introduction

Heart disease is becoming very common in older and younger adults starting from age 20. About 18.2 million adults suffer from coronary heart disease which makes it 2 in every 10 deaths in adults less than 65 years old. In the United States alone, about 659,000 people die from heart disease each year, making it 1 in every 4 deaths. It is estimated that 16.3 million Americans aged 20 and older have coronary heart disease. We analyze what factors can be used to predict if a patient has some level of heart disease present.

Business Implications

Machine learning is used in analyzing, extracting, and organizing information from a large set of raw data. Currently, Machine Learning as one type of artificial learning is being used to find the patterns in data and implement the result to the various business decision-making. The Healthcare sector is one of them. Machine Learning can assess heart disease risk for patients. Developing a predictive Machine Learning model can be used as an accurate diagnostic tool and formulate the most effective treatment plans for the patient and identify high-risk patients by their characteristics. With patients’ symptoms, data sets, and records, predictions of the heart disease and insights for doctors to create the treatment plan can be accurately planned. Finally, it also helps patients to notice some early symptoms of severe heart disease so that they can go to the doctor for diagnosis and avoid missing the best time for treatment.

Data Exploration

Acquisition

We have worked with the heart disease dataset from the Kaggle. The response variable indicates if there is the presence of heart disease in a patient with a scale of 0–4 (where 0 indicates no presence of heart disease & 1,2,3,4 indicates the presence of heart disease). For this particular dataset, the response variable is named ‘target’ that has 2 nominal values: 0 — heart disease, not present, and 1 — heart disease present.

Here is a brief preview of what the dataset looks like:

There are a total of 1025 observations. The original dataset contains 76 columns, but only 14 columns including the predicted attribute as these columns are the most significant. There are no null values present in this dataset.

Description

The statistics of all attributes as shown below:

Looking at the minimums and maximums, we can see that most of the attributes have values that are nominal except age, trestbps, chol, thalach, and oldpeak.

Below is a detailed description of each of the attributes:

Visualization

The class distribution for our response variable — target, where 1 indicates having heart disease and 0 indicates not having heart disease:

We see that the target variable has 526 patients having heart disease and 499 not having heart disease. The class imbalance is not significant for the target variable in our case. Hence, the model accuracy and the confusion matrix are the evaluation metrics that evaluate the performance of the models.

The distributions for all the 14 attributes for a better understanding of the data:

From the histograms above, we can see the age is slightly skewed to the left, meaning there are more older people in the dataset. There are more males than females in the dataset. The distribution of the “trestbps” (resting blood pressure), “chol” (serum cholesterol), and “oldpeak” (ST depression induced by exercise relative to rest) are right-skewed. There are 9 categorical variables and 5 continuous numerical variables.

According to the correlation matrix shown above, even though there does not seem to be a strong correlation between any of the independent variables, we still addressed the slight multicollinearity present. The strongest positive correlation we see is between “slope” and “thalach” with a value of 0.4 and the strongest negative correlation is between oldpeak and slope of -0.58. Next, for the positively correlated attributes, we see that thalach and exang have a correlation of 0.38. For the negatively correlated attributes, we see that age and thalach have a correlation of -0.39, cp and exang have a correlation of -0.4, thalach and oldpeak have a correlation of -0.35, and thalach and exang have a correlation of -0.38. We also see that “cp”, “thalach”, and “slope”, have the strongest positive correlation with the target variable with values of 0.43, 0.42, and 0.35 respectively. And “exang”, “oldpeak”, and “ca” have the strongest negative correlation with the target variable with values of -0.44, -0.44, and -0.38 respectively.

We will drop “thalach” since it is highly correlated with more than one attribute to avoid any form of multicollinearity for further analysis. In addition, we will also drop “slope” as it is highly correlated with oldpeak and is more weakly correlated with the target variable than “oldpeak”.

Modeling and Analysis

This is a classification problem to predict if the patient has the disease or does not have the disease, so we used Python to run the supervised models and used weka to run the unsupervised model:

Supervised:

K-Nearest Neighbors
Naive Bayes
Logistic Regression
Decision Tree

Unsupervised:

Association Rules

For the supervised models, we used python’s sklearn library to run the four classifiers and perform pre-processing on our data. We used the confusion matrix and the overall accuracy of the model as the evaluation metric but mainly focused on the recall for the models as we cannot afford to misclassify a patient having heart disease. Here’s the formula for recall: true positive/true positive + false negative. We will perform a 70:30 train/test split on our data.

We first ran all our classifiers on the unprocessed data to create a benchmark for the model results. Next, to improve our model performance, we applied discretization to the continuous variables along with grid search and cross-validation of 10 folds to tune the model hyperparameters and to optimize model performance. Lastly, we applied feature selection with grid search to reduce the data to the attributes that matter the most and noted the model performance. We used sklearn’s feature selection tool — SelectKBest, to select the top k features for each algorithm using a pipeline, and then ran the algorithms on their respective selected features.

Finally, for the unsupervised model, we used weka to run the association rule on the dataset to see what set of factors related to having heart disease.

Benchmark

Looking at the results for the 4 classifiers after the train/test split, we can see that without any pre-processing, the decision tree classifier performed the best. KNN on the other hand, performed the worst especially looking at the recall which means the number of misclassified patients that truly have heart disease is quite high. We can see this from the confusion matrix that KNN misclassified 33% of patients as not having the disease when in reality, those patients did have the disease. Naive Bayes and Logistic Regression both yielded similar model results. To further improve the model performance, we applied a few pre-processing techniques.

Pre-processing

Grid search uses different hyperparameter values and evaluates model performance. The best model is then chosen by the search, based on the hyperparameter value that yielded the optimal model performance.

For logistic regression, the hyperparameter we would like to tune is the C parameter. This is the inverse of regularization strength. Regularization is used to improve a model’s performance on unseen data by preventing overfitting the training data. Regularization gives a penalty to complex models by shrinking the coefficients of less contributive variables towards zero. The high value of C means that less weight should be given to the penalty while a low value of C means more weight should be given to the penalty. Smaller values of C are better as the training data might not be representative of the real world.

For the Decision Tree classifier, the hyperparameter that will be optimized is the max depth of the tree. This is a pre-pruning technique to avoid further splits once the tree has acquired optimal ‘purity’. We ran this classifier with the ‘entropy’ selected as the criterion.

For K-Nearest Neighbors, we simply optimize the k number of neighbors for the classifier through grid search. Finally, for the Naive Bayes classifier, we optimized the var_smoothing parameter. The var_smoothing hyperparameter smooths out the GaussianNB distribution curve and accounts for observations that may be outliers on the GaussianNB curve by adding the largest variance portion of all features to the variances.

Discretization

We applied discretization “age”, “trestbps”, “chol”, and “oldpeak” as they contain continuous values. Classifiers such as decision trees split on the attribute with the highest information gain. However, if there are too many unique values, the information gain will be biased. To avoid this situation, it is important to bin or discretize those attributes. This categorizes the values so that, for example, age 25 falls under the 20–25 category and does not just have a single category for 25.

We used python’s quantile-based discretization function (qcut) tool to discretize these attributes. This tool by default gives an equal number of instances per bin. This means that the bin intervals are not consistent. For example, the first bin contains a range of 20 while the next contains a range of 10 but each bin interval is selected to acquire approximately equal instances per bin. All these continuous attributes were divided into four bins except oldpeak, which was divided into 2 bins as it had a small range of continuous variables. The bins were then converted to ordinal values using sklearn’s LabelEncoder. For instance, the first age bin of interval (28–48) is replaced by 0, the next with 1, and so on.

Here are the results after applying the four classifiers on the discretized dataset and using grid search:

Based on model accuracies, we can see that applying discretization to the dataset has different levels of influence on the models. Firstly, looking at the overall accuracy of the models, we can see that KNN has significantly improved from the benchmark accuracy of 70.24%. The precision, recall, and the f1-score have also significantly gone up. For the logistic regression, Naive Bayes, and Decision Tree classifier, there does not seem to be a significant improvement in the model accuracies compared to their corresponding benchmarks. Even though the precision for the decision tree classifier improved to 1, unfortunately, the recall has depreciated from the benchmark. Hence, we can conclude that KNN was the most affected by this preprocessing step.

Feature Selection

For the feature selection procedure, we used SelectKBest to extract the top k features based on a scoring method. The scoring function used is the chi2. Getting a high value of the chi2 statistic between the predictor and response variable indicates a dependence between the two variables hence, the predictor variable is significant.

To select the top k features in each of the models, we need to know the value of k. To do this, we created a separate pipeline with SelectKBest tool with each of the classifiers, then used grid search to go through the given values of k (anywhere between 1 and 6). This range was chosen to fulfill the goal of having as few features as possible to make the heart disease diagnosis process easier. Increasing the value of the range of k did not yield significantly different results across the classifiers, hence, 6 was the maximum number of features given to the grid search.

Using the pipeline and the grid search, all four classifiers yielded the same number of features. Here are the features that were selected by the SelectKBest tool according to the top 6 chi2 scores for the 4 classifiers:

Applying the feature selection preprocessing step, there was a significant improvement in the overall accuracy of KNN and the recall and f1-score. There was a slight improvement in the model accuracy for Naive Bayes as well as the precision and the F-score. For the logistic regression, the accuracy and recall improved slightly. For the Decision Tree classifier, the overall accuracy, precision, and f1-score dropped. The recall did not change. Hence, we can conclude that the best-performing model is KNN.

We ran Apriori in WEKA to find out which attribute combinations are more frequently shown in the dataset and related to having heart disease. Even though the association rules do not indicate causality, we can use them as a self-exam list. If a person has more than one symptom, he or she needs to be more aware of the risk of having heart disease and go to the hospital to do a further examination.

Before running the association rules, we used J48’s accuracy to decide how to discretize the data. Compared to many different set-ups, we decided to go with discretizing into 3 bins and equal frequency for the continuous numerical attributes.

Below are the association rules found in WEKA, with the setting of the confidence as metric type, 0.9 as minMetric, and 10 as numRules:

Based on the rules that the right-hand side is having heart disease, we learned that some of the attributes showed up in the different combinations multiple times, such as “thalach” (the maximum heart rate), “ca” (the number of major vessels colored by fluoroscopy), “thal”(thalassemias), “exang”(exercise-induced angina denoting), and “age”. We noticed that “ca=0” and ‘thal=2” are shown in all 9 rules that having heart disease. It is probably because of class imbalance. The instances that meet these two conditions are significantly more than others.

According to the association rules generated from WEKA, if a person has the following symptom shown below, he/she may have a higher chance of having heart disease.

Resting blood pressure is higher than 161.5 mm/Hg
None major vessels colored by fluoroscopy
Thalassemia is a fixed defect
The fasting blood sugar is <= 120 mg/dl
Having chest pain after exercise
Age is between 29 to 52

Takeaways

Below is a summary of the accuracy scores for the 4 classifiers:

Looking at the overall accuracy results summary, feature selection has been the best pre-processing step for all of the classifiers except the decision tree. The decision tree performed quite well at the benchmark and performed similarly after discretization. However, feature selection hurt the model’s performance. Decision trees can be sensitive to slight changes in the data which can accelerate or decelerate their performance. Comparing the two preprocessing steps with the benchmark, we can conclude that KNN’s performance is significantly affected by preprocessing especially by discretization. For Naive Bayes and Logistic Regression, the performance was somewhat constant throughout the two preprocessing steps. We can therefore conclude that the K-Nearest Neighbor classifier is the best performing model.

Along with the overall accuracy of the model, it becomes imperative to evaluate the confusion matrix. When the model was created without any preprocessing of feature selection, it predicted more False positives and False negatives, both of which can be considered as high cost as it can have an adverse impact on the life of the patient.

To improve the predictions on test data, it was important to focus on key features of the data to participate and predict. However, It is also interesting to see that after applying the preprocessing and feature selection, Naive Bayes and logistic regressions model are not able to increase the precision & recall.

Conclusion

Our project was intended to apply the techniques that we learned from course 273 to exploring this heart disease dataset and try to find out what attributes are better in predicting heart disease as well as how to improve models’ accuracy. The cost of the wrong prediction, important for deciding which model to pick in the real business situation, was not taken into account in our project. A future expansion of this work could be to give weight to different attributes based on cost-efficiency. Also, a larger size of the dataset can be helped to increase the accuracy of the models.

We can conclude the no. of major vessels colored by fluoroscopy, chest pain, exercise-induced angina, ST depression induced by exercise, age, and sex are good attributed to looking at for predicting if the person is having heart disease or not.

KNN classifier model can be used by doctors to improve their predictions, insurance companies, testing laboratories to improve their cost-efficiency. On the other hand, the Association rules are helpful for the companies and developers who want to create digital interfaces to sell products/services related to heart-related diseases.