This project is a EDA on heart disease dataset, the dataset is located in the data folder. The data contains information on heart health indicators, including demographic, clinical, and lifestyle variables.
The EDA is done in a jupyter notebook called eda.ipynb
In this task, the python libraries were imported as alias. The data was inspected with the pandas.
Found the ? in the ca column, replace it the mode of the ca column and casted the data type to a float Found the ? in the thal column, replace it the mode of the thal column and casted the data type to a float
There were no duplicates in the data
In Task 3, the heart_disease column was edited to be in 0 and 1 Where 0 = no heart disease and 1 = heart disease
This will make heart_disease column easy to analyze and draw conclusions.
The conclusion drawn from this correlation matrix is that, there are low correlation between the age, sex and the heart disease. But there are slightly higher correlation between cp, exang, oldpeak, ca and thal to the heart disease on this a conclusion can be drawn that heart disease is more dependent on the lifestyle instead of the age and sex.
The conclusion draw from the crosstab shown that there is a chance of a male getting a heart disease than a female
In this pair plot, we are looking for patterns between the two color groups. Looking at the density plots along the diagonal, there are no features that cleanly separate the groups (age has the most separation). However, looking at the scatterplot for age and thalach (maximum heart rate from an exercise test), there is more clear separation. It appears that patients who are old and have low thalach are more likely to be diagnosed with heart disease than patients who are young and have high thalach. This suggests that we want to make sure both of these features are included in our model.