Project Overview
WARNING: Please note that this project is intended as an illustrative example of the potential application of machine learning in assisting medical professionals with heart disease diagnosis. The information and results presented here (or on the accompanying website) do not constitute medical advice in any form.
Heart disease is a prevalent health condition that requires accurate and timely diagnosis for effective treatment. This project aims to develop a machine learning model to assist doctors in diagnosing heart disease accurately and efficiently. By leveraging machine learning and a comprehensive dataset, the model can analyze various patient factors and provide predictions regarding the probability of heart disease. The implementation of diagnostic machine learning models like this one offers several potential benefits, including improved diagnostic accuracy, reduced burden on medical professionals, and early detection of disease. Furthermore, the project promotes data-driven medicine and contributes to ongoing efforts in machine learning-based medical diagnosis. By providing an additional tool for risk assessment and decision-making, I hope that models like this one can enhance patient outcomes and advance our understanding of heart disease.
Sample Usage
- A medical professional could collect and enter the necessary patient information (such as age, sex, chest pain type, resting blood pressure, etc.) into the website.
- Once all the required data has been entered, click the "Predict" button to initiate the prediction process.
- A medical professional could interpret the prediction result alongside other diagnostic information and medical expertise to make informed decisions and provide appropriate care for the patient.
Dataset
Attribute Information
- Age: age of the patient [years]
- Sex: sex of the patient [M: Male, F: Female]
- ChestPainType: chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]
- RestingBP: resting blood pressure [mm Hg]
- Cholesterol: serum cholesterol [mm/dl]
- FastingBS: fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]
- RestingECG: resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria]
- MaxHR: maximum heart rate achieved [Numeric value between 60 and 202]
- ExerciseAngina: exercise-induced angina [Y: Yes, N: No]
- Oldpeak: oldpeak = ST [Numeric value measured in depression]
- ST_Slope: the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]
- HeartDisease: output class [1: heart disease, 0: Normal]
Source
This dataset was created by combining different datasets already available independently but not combined before. In this dataset, 5 heart datasets are combined over 11 common features which makes it the largest heart disease dataset available so far for research purposes. The five datasets used for its curation are:
- Cleveland: 303 observations
- Hungarian: 294 observations
- Switzerland: 123 observations
- Long Beach VA: 200 observations
- Stalog (Heart) Data Set: 270 observations
Total: 1190 observations
Duplicated: 272 observations
Final dataset: 918 observations
Creators:
- Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
- University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
- University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
- V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.
Donor to UC Irvine Machine Learning Repository:
David W. Aha (aha '@' ics.uci.edu) (714) 856-8779
The datasets from the above sources were combined into the dataset I used by Kaggle user fedesoriano.
Data source Citation:
fedesoriano. (September 2021). Heart Failure Prediction Dataset. Retrieved May 22, 2023 from https://www.kaggle.com/fedesoriano/heart-failure-prediction.
Exploratory Data Analysis
Questions I Asked About the Data:
- Upon visually examining the distribution of age, resting blood pressure, and maximum heart rate, they appeared to resemble a normal distribution. However, the application of Q-Q plots indicated deviations from Gaussian distribution. Consequently, I conducted Shapiro-Wilk tests on each of these variables, which confirmed their non-normal distribution.
- Notably, a considerable number of cholesterol values (172) were assigned as 0 to represent null values.
- To address the departure from normality, I opted to employ the
StandardScaler()
function from the sklearn library. This transformation aimed to bring the data points closer to a normal distribution. - Initially, when constructing the baseline models, I retained the original cholesterol data without any modifications. However, to overcome the limitation imposed by the null cholesterol values, I employed a series of techniques which aim to replace the null values with numbers from which models can generate meaningful predictions.
How many positive and negative examples are there of the target variable?
The dataset is close to balanced, so there is no need to impliment techniques to improve classifaction of infrequent categories like Synthetic Minority Over-sampling or undersampling.
How are continuous variables distributed (in particular, are they normally distributed)?
Key Takeaways:
Leveraging These Insights:
How do continuous variables change in conjunction with the target variable?
A visual inspection indicates that age, maximum heart rate and oldpeak are most different in Heart Disease positive vs negative. This was later statistically confirmed with an ANOVA
How many examples are there of each categorical variable?
How does each categorical variable change in conjunction with the target variable?
Feature Selection With Inferential Statistics
Feature Selection With Inferential Statistics
I used inferential statistics to determine the importance of the dataset's features. If I found that a feature has no significant impact on the target variable, then it would be helpful to try models which discard that variable. Removing an insignificant variable would reduce noise in the data, ideally lowering model overfitting and improving classification accuracy. For continuous features, I conducted an ANOVA, and for categorical features, I used a Chi-Squared Test.
ANOVA
Analysis of Variance (ANOVA) is a method from inferential statistics that aims to determine if there is a statistically significant difference between the means of two (or more) groups. This makes it a strong candidate for determining importance of continuous features in predicting a categorical output. I used a One-Way ANOVA to test the importance of each continuous feature by checking whether presence of heart disease had a statistically significant effect on the feature's mean.
ANOVA Results
I found that there was a statistically significant difference (p < 0.05) for each continuous feature. This led me to decide to keep all continuous features as part of my classification models.
Chi-Squared Test
The Chi-Squared test is a statistical hypothesis test that is used to determine whether there is a significant association between two categorical variables. It compares the observed frequencies in each category of a contingency table with the frequencies that would be expected if the variables were independent. In the context of feature selection for my machine learning models, I used the Chi-Squared test to identify the categorical features that are most significantly associated with the target variable.
Chi-Squared Test Results
Like the continuous features, I found a statistically significant difference in heart disease (p < 0.05) according to each categorical feature. This led me to decide to keep all categorical features as part of my classification models.
Data Imputation Techniques for Null Value Replacement
As I discovered during Exploratory Data Analysis, the dataset has 172 samples with null values for Cholesterol (which were initially set to 0). I explored various data imputation techniques in an attempt to extract meaningful training data from such samples.
Initially, simple imputation strategies were deployed, namely: mean, median, and mode imputation. A noteworthy improvement in model performance was observed compared to models trained on the original dataset where null values were replaced by a default value of zero. Among these initial imputation techniques, mean imputation was found to deliver the best results for most machine learning models.
Building upon these initial findings, I applied a sophisticated imputation method: applying regression analysis to estimate the missing values. The regression techniques applied included Linear Regression, Ridge Regression, Lasso Regression, Random Forest Regression, Support Vector Regression, and Regression using Deep Learning. Each of these regression-based imputation techniques displayed a similar level of performance in terms of RMSE and MAE.
The performance of the regression models was found to be not as satisfactory as initially hypothesized, often falling short of the results obtained with mean imputation. Despite this, it was observed that for Random Forest Classifiaction models, the regression-based methods exhibited strong results in terms of precision and specificity metrics. Particularly, Linear Regression and Ridge Regression-based imputation strategies performed well in these areas.
Classification Models
A key component of this project involved the implementation and performance evaluation of a variety of classification models. The chosen models were tested on the datasets prepared as described in the "Data Imputation Techniques for Null Value Replacement" section. All the datasets utilized in this phase had continuous features standardized and normalized to ensure a uniform scale and improved model performance.
- 1. Logistic Regression
- 2. Random Forest Classifier
- 3. Support Vector Machine Classifier
- 4. Gaussian Naive Bayes Classifier
- 5. Bernoulli Naive Bayes Classifier
- 6. XGBoost Classifier
- 7. Neural Network (of various architectures)
For the neural network models, an expansive exploration of hyperparameter variations was conducted using cross-validation. These architectures ranged from those with a single hidden layer to those with three hidden layers, with the number of neurons per layer varying from 32 to 128.
Each of these models was trained on 80% of the data, and tested on 20%. Accuracy, Precision, Recall, F1-Score and Specificity metrics were tracked.
Results and Model Performance Evaluation
A thorough analysis of over 80 models was conducted in this project, with the evaluation criteria based on several metrics including accuracy, precision, recall (sensitivity), F1-score, and specificity. Out of all the models evaluated, two demonstrated superior performance in their respective contexts.
- Deep Learning Model:
- Random Forest Classifier:
The top performing model by Accuracy, Recall, and F1-Score was a deep learning model trained on data where missing cholesterol values were imputed using the mean of the available values. Performance metrics can be found in the below figures and in Tables 1 and 2. To accompany the single-number metrics, PDFs were also constructed to quantify the uncertainty in this model's sensitivity/recall and specificity. I wrote a comprehensive report detailing the generation of Sensitivity and Specificity PDFs and Credible Intervals which can be found in the repository, or by clicking this link
Accuracy | Precision | Recall | F1-Score | Specificity |
---|---|---|---|---|
91.30% | 91.96% | 93.64% | 92.79% | 87.84% |
95% CI Minimum | 95% CI Maximum | |
---|---|---|
Sensitivity/Recall | 88.5% | 96.5% |
Specificity | 80.5% | 93.5% |
The top performing model by Precision and Specificity was a Random Forest Classifier trained on data with missing cholesterol values imputed using Ridge Regression. Performance metrics can be found in the below figures and in Tables 2 and 3.
Accuracy | Precision | Recall | F1-Score | Specificity |
---|---|---|---|---|
89.67% | 94.34% | 88.50% | 91.32% | 91.55% |
95% CI Minimum | 95% CI Maximum | |
---|---|---|
Sensitivity/Recall | 82.5% | 92.5% |
Specificity | 84.5% | 96.5% |