HeartDiseaseAnalysis.Rmd

---
title: "Analysis and Study of UCI Heart Disease dataset"
output:
  html_document:
    df_print: paged
  word_document: default
  pdf_document: default
---

```{r, echo=FALSE, comment="", tidy=TRUE, error=TRUE}
set.seed(1234)
today <- date()
today <- unlist(strsplit(today, " "))
cat("by Prateek Sarangi,", today[1], ",", today[2:3], today[5])
```

Part 1:- Getting the feature dataset from the actual data
---------------------------------------------------------------------------
## Dataset details
### Attribute information
- age  
- sex  
- chest pain type (4 values) -> cp  
- resting blood pressure -> trestbps  
- serum cholestoral in mg/dl-> chol  
- fasting blood sugar > 120 mg/dl -> fbs  
- resting electrocardiographic results (values 0,1,2) -> restecg  
- maximum heart rate achieved -> thalach  
- exercise induced angina -> exang  
- oldpeak = ST depression induced by exercise relative to rest -> oldpeak  
- the slope of the peak exercise ST segment -> slope  
- number of major vessels (0-3) colored by flourosopy -> ca  
- thal: 3 = normal; 6 = fixed defect; 7 = reversable defect  

## Program for exploratory analysis and preprocessing
### Function to Normalize the data
```{r}
normalize <- function(x) {
  return ((x - min(x)) / (max(x) - min(x)))
}
```

### Import dataset and view it as a dataframe
```{r}
heart <- read.csv("~/HeartDisease/heart.csv")
head(heart)
```

### Normalize the age and separate it in various groups to show which age group is more prone to heart disease 
It uses the normalize function defined above and then categorize the data into **four** groups.  
After normalization the valuesaranges from **0-1**  
It is grouped as follows  
- **0.00 to 0.25** as **Group 1** giving it value **0.1**  
- **0.25 to 0.50** as **Group 2** giving it value **0.4**  
- **0.50 to 0.75** as **Group 3** giving it value **0.6**  
- **0.75 to 1.00** as **Group 4** giving it value **0.9**  
```{r echo=FALSE, include=FALSE}
library(dplyr)
```

```{r}
dfNorm <- as.data.frame(lapply(heart["age"], normalize))
heart["age"] <- dfNorm
heart["age"] <- as.data.frame(lapply(heart["age"], function(x){replace(x,between(x, 0.0, 0.25), 0.1)}))
heart["age"] <- as.data.frame(lapply(heart["age"], function(x){replace(x,between(x, 0.25, 0.6), 0.4)}))
heart["age"] <- as.data.frame(lapply(heart["age"], function(x){replace(x,between(x, 0.5, 0.75), 0.6)}))
heart["age"] <- as.data.frame(lapply(heart["age"], function(x){replace(x,between(x, 0.75, 1), 0.9)}))
```

### Normalize the Rest Blood Pressure and separate it in various groups to show what blood pressure group is more prone to heart disease 
It uses the normalize function defined above and then categorize the data into **three** groups.  
After normalization the valuesaranges from **0-1**  
It is grouped as follows  
- **0.00 to 0.33** as **Group 1** giving it value **0.2**  
- **0.33 to 0.67** as **Group 2** giving it value **0.6**  
- **0.67 to 1.00** as **Group 3** giving it value **1.0**  
```{r}
dfNorm <- as.data.frame(lapply(heart["trestbps"], normalize))
heart["trestbps"] <- dfNorm
heart["trestbps"] <- as.data.frame(lapply(heart["trestbps"], function(x){replace(x, between(x, 0.0, 0.33), 0.2)}))
heart["trestbps"] <- as.data.frame(lapply(heart["trestbps"], function(x){replace(x, between(x, 0.33, 0.67), 0.6)}))
heart["trestbps"] <- as.data.frame(lapply(heart["trestbps"], function(x){replace(x, between(x, 0.67, 1), 1)}))
```

### Normalize the Cholestrole level and separate it in various groups to show what cholestrole level is more prone to heart disease 
It uses the normalize function defined above and then categorize the data into **five** groups.  
After normalization the valuesaranges from **0-1**  
It is grouped as follows  
- **0.00 to 0.20** as **Group 1** giving it value **0.1**  
- **0.20 to 0.40** as **Group 2** giving it value **0.3**  
- **0.40 to 0.60** as **Group 3** giving it value **0.5**  
- **0.60 to 0.80** as **Group 4** giving it value **0.7**  
- **0.80 to 1.00** as **Group 5** giving it value **0.9**  
```{r}
dfNorm <- as.data.frame(lapply(heart["chol"], normalize))
heart["chol"] <- dfNorm
heart["chol"] <- as.data.frame(lapply(heart["chol"], function(x){replace(x, between(x, 0.0, 0.2), 0.1)}))
heart["chol"] <- as.data.frame(lapply(heart["chol"], function(x){replace(x, between(x, 0.2, 0.4), 0.3)}))
heart["chol"] <- as.data.frame(lapply(heart["chol"], function(x){replace(x, between(x, 0.4, 0.6), 0.5)}))
heart["chol"] <- as.data.frame(lapply(heart["chol"], function(x){replace(x, between(x, 0.6, 0.8), 0.7)}))
heart["chol"] <- as.data.frame(lapply(heart["chol"], function(x){replace(x, between(x, 0.8, 1), 0.9)}))
```

### Separate the data in various groups to show which category of chest pain is more prone to heart disease 
It replaces the values of chest pain into different groups.  
- **Value 0**:- Typical angina(**Provided value 0.1**).  
- **Value 1**:- Atypical angina(**Provided value 0.6**).  
- **Value 2**:- Non-anginal pain(**Provided value 0.9**).  
- **Value 3**:- Asymptomatic(**Provided value 0.01**).  
```{r}
heart["cp"] <- as.data.frame(lapply(heart["cp"], function(x){replace(x, x == 0, 0.1)}))
heart["cp"] <- as.data.frame(lapply(heart["cp"], function(x){replace(x, x == 1, 0.6)}))
heart["cp"] <- as.data.frame(lapply(heart["cp"], function(x){replace(x, x == 2, 0.9)}))
heart["cp"] <- as.data.frame(lapply(heart["cp"], function(x){replace(x, x == 3, 0.01)}))
```

### Normailze the maximum heart rate of the patient
```{r}
dfNorm <- as.data.frame(lapply(heart["thalach"], normalize))
heart["thalach"] <- dfNorm
```

### Normalize the thal and separate it in various groups to show what thal is more prone to heart disease 
It uses the normalize function defined above and then categorize the data into **three** groups.  
Thal: 3 = normal; 6 = fixed defect; 7 = reversable defect  
After normalization the valuesaranges from **0-1**  
It is grouped as follows  
- **0.00 to 0.25** as **Group 1** giving it value **0.5**  
- **0.25 to 0.50** as **Group 2** giving it value **0.6**  
- **0.50 to 0.75** as **Group 3** giving it value **0.9**  
- **0.75 to 1.00** as **Group 4** giving it value **0.1**  
```{r}
dfNorm <- as.data.frame(lapply(heart["thal"], normalize))
heart["thal"] <- dfNorm
heart["thal"] <- as.data.frame(lapply(heart["thal"], function(x){replace(x, between(x, 0.0, 0.25), 0.5)}))
heart["thal"] <- as.data.frame(lapply(heart["thal"], function(x){replace(x, between(x, 0.25, 0.50), 0.6)}))
heart["thal"] <- as.data.frame(lapply(heart["thal"], function(x){replace(x, between(x, 0.50, 0.75), 0.9)}))
heart["thal"] <- as.data.frame(lapply(heart["thal"], function(x){replace(x, between(x, 0.75, 1.00), 0.1)}))
```

### Regularizing the rest ECG value of the patient
It replaces the values of Rest ECG in the following manner.  
- **Value 0**:- Normal(**Provided value 0.3**).  
- **Value 1**:- Having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)(**Provided value 0.9**).  
- **Value 2**:- showing probable or definite left ventricular hypertrophy by Estes' criteria.(**Provided value 0.1**)  
```{r}
heart["restecg"] <- as.data.frame(lapply(heart["restecg"], function(x){replace(x, x == 0, 0.3)}))
heart["restecg"] <- as.data.frame(lapply(heart["restecg"], function(x){replace(x, x == 1, 0.9)}))
heart["restecg"] <- as.data.frame(lapply(heart["restecg"], function(x){replace(x, x == 2, 0.1)}))
```

### Replace the peak exercise ST segment of the patient
It replaces the values of peak exercise ST segment into different groups.  
- **Value 0**:- Upsloping(**Provided value 0.01**).  
- **Value 1**:- Flat(**Provided value 0.2**).  
- **Value 2**:- Downsloping(**Provided value 0.9**).  
```{r}
heart["slope"] <- as.data.frame(lapply(heart["slope"], function(x){replace(x, x == 0, 0.01)}))
heart["slope"] <- as.data.frame(lapply(heart["slope"], function(x){replace(x, x == 1, 0.2)}))
heart["slope"] <- as.data.frame(lapply(heart["slope"], function(x){replace(x, x == 2, 0.9)}))
```

### Replace the major vessels (0-3) colored by flourosopy
It replaces the values of major vessels colored into different groups.  
- **Value 0**:- Upsloping(**Provided value 0.09**).  
- **Value 1**:- Flat(**Provided value 0.6**).  
- **Value 2**:- Downsloping(**Provided value 0.45**).  
- **Value 3**:- Downsloping(**Provided value 0.3**).  
- **Value 4**:- Downsloping(**Provided value 0.1**).  
```{r}
heart["ca"] <- as.data.frame(lapply(heart["ca"], function(x){replace(x, x == 0, 0.9)}))
heart["ca"] <- as.data.frame(lapply(heart["ca"], function(x){replace(x, x == 1, 0.6)}))
heart["ca"] <- as.data.frame(lapply(heart["ca"], function(x){replace(x, x == 2, 0.45)}))
heart["ca"] <- as.data.frame(lapply(heart["ca"], function(x){replace(x, x == 3, 0.3)}))
heart["ca"] <- as.data.frame(lapply(heart["ca"], function(x){replace(x, x == 4, 0.1)}))
```

### Replace the fasting bloog sugar
It replaces the values of fasting bloog sugar into different groups.  
- **Value 0**:- Bloog sugar < 120 mg/dl(**Provided value 0.9**).  
- **Value 1**:- Bloog sugar > 120 mg/dl(**Provided value 0.1**).  
```{r}
heart["fbs"] <- as.data.frame(lapply(heart["fbs"], function(x){replace(x, x == 0, 0.9)}))
heart["fbs"] <- as.data.frame(lapply(heart["fbs"], function(x){replace(x, x == 1, 0.1)}))
```
### Replace the sex of the patient with values
It replaces the values of sex into different groups.  
- **Value 1**:- Male, is replaced with **0.9**  
- **Value 0**:- Female, is replaced with **0.1**  
```{r}
heart["sex"] <- as.data.frame(lapply(heart["sex"], function(x){replace(x, x == 0, 0.1)}))
heart["sex"] <- as.data.frame(lapply(heart["sex"], function(x){replace(x, x == 1, 0.9)}))
```

### Replace the exercise induced angina
It replaces the values of peak exercise ST segment into different groups.  
- **Value 0**:- No(**Provided value 0.9**).  
- **Value 1**:- Yes(**Provided value 0.1**).  
```{r}
heart["exang"] <- as.data.frame(lapply(heart["exang"], function(x){replace(x, x == 0, 0.9)}))
heart["exang"] <- as.data.frame(lapply(heart["exang"], function(x){replace(x, x == 1, 0.1)}))
```

## The dataset after the modifications are made.
```{r echo = FALSE}
head(heart)
```

## Splitting of training and testing data
We are using both randomly generated and sequentially choosen 75% of the data as training set and rest 25% as our test set.  
**train_ind_rand** gives the indeces of the samples which are to be used as the training sample in the dataset.  
**trainrand** gives the randomly choosen train dataset.  
**testrand** given the randomly choosen test dataset.  
**trainseq** gives the sequencially choosen train dataset.  
**testseq** gives the sequentially choosen test dataset.  
```{r}
smp_size <- floor(0.75 * nrow(heart))
train_ind_rand <- sample(seq_len(nrow(heart)), size = smp_size)

trainrand <- heart[train_ind_rand, ]
testrand <- heart[-train_ind_rand, ]

trainseq <- heart[1:227, ]
testseq <- heart[227:303, ]
```

## Writting the dataset into csv files so that it can be used in the Python program for Neural network classification.
- **heart1.csv** contain the modified Heart Disease dataset.  
- **trainrand.csv** conains the randomly chosen train dataset.  
- **testrand.csv** conains the randomly chosen test dataset.  
- **trainseq.csv** conains the sequencially choosen train dataset.  
- **testseq.csv** conains the sequentially choosen test dataset. 
```{r}
write.csv(heart, "~/HeartDisease/heart1.csv", row.names = FALSE)
write.csv(trainrand, "~/HeartDisease/trainrand.csv", row.names = FALSE)
write.csv(testrand, "~/HeartDisease/testrand.csv", row.names = FALSE)
write.csv(trainseq, "~/HeartDisease/trainseq.csv", row.names = FALSE)
write.csv(testseq, "~/HeartDisease/testseq.csv", row.names = FALSE)
```

Part 2:- Analysis of the neural network model
---------------------------------------------------------------------------
## Plot for varying the number of hidden layers
Neural network with various number of hidden layers are tested and the results of the test accuracy are ploted.  
Input layer contains **Thirteen** input neurons, each hidden layer has **Tweleve** neurons and the output layer has **One** neuron with **Two** classes.  
- Class zero -> The patient is not suffering from heart disease.  
- Class one -> The patient is suffering from heart disease.  

### Plot of test accuracy for different hidden layers
```{r echo=FALSE}
LayerAcc <- read.csv("~/HeartDisease/LayerAcc.csv", header=FALSE)
LayerAcc.bar <- barplot(LayerAcc$V2, names.arg=LayerAcc$V1, col = 'blue', xlab = "Number of hidden layer", ylab = "Accuracy")
```

### Conclusion from the plot
According to the test runs we can see that model with **Three**, **Four**, **Five** and **Six** hidden layers, having **Tweleve** neuron each perform better than the other choosen models.  

### Plot for varying the number of neurons  
**For each hidden layer, five samples are taken for testing and it's accuracy is plotted below**  

#### For three hidden layers
**Argument numbers and the structure associated with it.**   
1.	13-8-8-8-1  
2.	13-10-10-10-1  
3.	13-8-12-12-1  
4.	13-10-10-12-1  
5.	12-10-12-12-1  
```{r echo=FALSE}
NeuronAcc <- read.csv("~/HeartDisease/Structure.csv", sep=";")
itr = 1:5
NeuraonAcc.bar <- barplot(NeuronAcc$Accuracy[1:5], names.arg=itr, col = 'red', xlab = "Argument", ylab = "Accuracy")
```

#### For four hidden layers
**Argument numbers and the structure associated with it.**   
1.	13-8-8-8-8-1  
2.	13-12-12-12-12-1  
3.	13-8-8-10-10-1  
4.	13-12-12-8-8-1  
5.	13-12-12-10-10-1  
```{r echo=FALSE}
NeuronAcc <- read.csv("~/HeartDisease/Structure.csv", sep=";")
itr = 1:5
NeuraonAcc.bar <- barplot(NeuronAcc$Accuracy[6:10], names.arg=itr, col = 'red', xlab = "Argument", ylab = "Accuracy")
```

#### For five hidden layers
**Argument numbers and the structure associated with it.**   
1.	13-8-8-8-8-8-1  
2.	13-12-12-10-8-8-1  
3.	13-12-10-10-10-8-1  
4.	13-8-8-10-8-8-1  
5.	13-8-8-12-8-8-1  
```{r echo=FALSE}
NeuronAcc <- read.csv("~/HeartDisease/Structure.csv", sep=";")
itr = 1:5
NeuraonAcc.bar <- barplot(NeuronAcc$Accuracy[11:15], names.arg=itr, col = 'red', xlab = "Argument", ylab = "Accuracy")
```

### Conclusion from the plot
**According to the test runs we can see that the models with structure**
- Input layer -> 13 input neuron  
- First hidden layer -> 12 hidden neuraon  
- Second hidden layer -> 12 hidden neuraon  
- Third hidden layer -> 10 hidden neuraon  
- Fourth hidden layer -> 8 hidden neuraon  
- Fifth hidden layer -> 8 hidden neuraon  
- Output layer -> 1 neuron, 2 classes  
gives the best output and so is taken as the final structure of the neural network.

### Plot for final accepted structure
The neural network with the best results is run **Twinty** times and the average mean square error, and accuracy was taken and plot in the following graph.

### Plot for Accuracy of the model
```{r echo = FALSE}
tableAccuracy <- read.csv("~/HeartDisease/tableAccuracy.csv", header=FALSE, sep=";")

meanAccuracy <- colMeans(tableAccuracy[sapply(tableAccuracy, is.numeric)])
itr <- 1:1500

plot(itr, meanAccuracy, "l", xlab = "Iteration", ylab = "Accuracy",
     col = "blue", lwd = 2)
```

### Plot for Mean Square Error of the model
```{r echo = FALSE}
tableMSE <- read.csv("~/HeartDisease/tableMSE.csv", header=FALSE, sep=";")

meanMSE <- colMeans(tableMSE[sapply(tableMSE, is.numeric)])
itr <- 1:1500

plot(itr, meanMSE, "l", xlab = "Iteration", ylab = "Mean Square Error",
     col = "orange", lwd = 2)
```

## Confusion matrix and performance analysis
```{r echo=FALSE}
performance_Matrix <- read.table("~/HeartDisease/performance_Matrix.txt", quote="\"", comment.char="")
FN <- performance_Matrix$X0[1]
FP <- performance_Matrix$X1[1]
TN <- performance_Matrix$X0[2]
TP <- performance_Matrix$X1[2]
knitr::include_graphics('PerformanceMatrix.png')
```

### Performance matrix analysis
```{r}
precision <- TP/(TP+FP)
recall <- TP/(TP+FN)
sensitivity <- TP/(TP+FN)
specificity <- TN/(TN+FP)
FNR <- FN/(FN+TP)
F_score <- (2*precision*recall)/(precision+recall)
MCC <- ((TP*TN)-(FP*FN)) / (sqrt((TP+FP)*(TP+FN)*(TN+FP)*(TN+FN)))
print(precision)
print(recall)
print(sensitivity)
print(specificity)
print(FNR)
print(F_score)
print(MCC)
```