-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathHeartDiseaseAnalysis.Rmd
369 lines (324 loc) · 15.8 KB
/
HeartDiseaseAnalysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
---
title: "Analysis and Study of UCI Heart Disease dataset"
output:
html_document:
df_print: paged
word_document: default
pdf_document: default
---
```{r, echo=FALSE, comment="", tidy=TRUE, error=TRUE}
set.seed(1234)
today <- date()
today <- unlist(strsplit(today, " "))
cat("by Prateek Sarangi,", today[1], ",", today[2:3], today[5])
```
Part 1:- Getting the feature dataset from the actual data
---------------------------------------------------------------------------
## Dataset details
### Attribute information
- age
- sex
- chest pain type (4 values) -> cp
- resting blood pressure -> trestbps
- serum cholestoral in mg/dl-> chol
- fasting blood sugar > 120 mg/dl -> fbs
- resting electrocardiographic results (values 0,1,2) -> restecg
- maximum heart rate achieved -> thalach
- exercise induced angina -> exang
- oldpeak = ST depression induced by exercise relative to rest -> oldpeak
- the slope of the peak exercise ST segment -> slope
- number of major vessels (0-3) colored by flourosopy -> ca
- thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
## Program for exploratory analysis and preprocessing
### Function to Normalize the data
```{r}
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
```
### Import dataset and view it as a dataframe
```{r}
heart <- read.csv("~/HeartDisease/heart.csv")
head(heart)
```
### Normalize the age and separate it in various groups to show which age group is more prone to heart disease
It uses the normalize function defined above and then categorize the data into **four** groups.
After normalization the valuesaranges from **0-1**
It is grouped as follows
- **0.00 to 0.25** as **Group 1** giving it value **0.1**
- **0.25 to 0.50** as **Group 2** giving it value **0.4**
- **0.50 to 0.75** as **Group 3** giving it value **0.6**
- **0.75 to 1.00** as **Group 4** giving it value **0.9**
```{r echo=FALSE, include=FALSE}
library(dplyr)
```
```{r}
dfNorm <- as.data.frame(lapply(heart["age"], normalize))
heart["age"] <- dfNorm
heart["age"] <- as.data.frame(lapply(heart["age"], function(x){replace(x,between(x, 0.0, 0.25), 0.1)}))
heart["age"] <- as.data.frame(lapply(heart["age"], function(x){replace(x,between(x, 0.25, 0.6), 0.4)}))
heart["age"] <- as.data.frame(lapply(heart["age"], function(x){replace(x,between(x, 0.5, 0.75), 0.6)}))
heart["age"] <- as.data.frame(lapply(heart["age"], function(x){replace(x,between(x, 0.75, 1), 0.9)}))
```
### Normalize the Rest Blood Pressure and separate it in various groups to show what blood pressure group is more prone to heart disease
It uses the normalize function defined above and then categorize the data into **three** groups.
After normalization the valuesaranges from **0-1**
It is grouped as follows
- **0.00 to 0.33** as **Group 1** giving it value **0.2**
- **0.33 to 0.67** as **Group 2** giving it value **0.6**
- **0.67 to 1.00** as **Group 3** giving it value **1.0**
```{r}
dfNorm <- as.data.frame(lapply(heart["trestbps"], normalize))
heart["trestbps"] <- dfNorm
heart["trestbps"] <- as.data.frame(lapply(heart["trestbps"], function(x){replace(x, between(x, 0.0, 0.33), 0.2)}))
heart["trestbps"] <- as.data.frame(lapply(heart["trestbps"], function(x){replace(x, between(x, 0.33, 0.67), 0.6)}))
heart["trestbps"] <- as.data.frame(lapply(heart["trestbps"], function(x){replace(x, between(x, 0.67, 1), 1)}))
```
### Normalize the Cholestrole level and separate it in various groups to show what cholestrole level is more prone to heart disease
It uses the normalize function defined above and then categorize the data into **five** groups.
After normalization the valuesaranges from **0-1**
It is grouped as follows
- **0.00 to 0.20** as **Group 1** giving it value **0.1**
- **0.20 to 0.40** as **Group 2** giving it value **0.3**
- **0.40 to 0.60** as **Group 3** giving it value **0.5**
- **0.60 to 0.80** as **Group 4** giving it value **0.7**
- **0.80 to 1.00** as **Group 5** giving it value **0.9**
```{r}
dfNorm <- as.data.frame(lapply(heart["chol"], normalize))
heart["chol"] <- dfNorm
heart["chol"] <- as.data.frame(lapply(heart["chol"], function(x){replace(x, between(x, 0.0, 0.2), 0.1)}))
heart["chol"] <- as.data.frame(lapply(heart["chol"], function(x){replace(x, between(x, 0.2, 0.4), 0.3)}))
heart["chol"] <- as.data.frame(lapply(heart["chol"], function(x){replace(x, between(x, 0.4, 0.6), 0.5)}))
heart["chol"] <- as.data.frame(lapply(heart["chol"], function(x){replace(x, between(x, 0.6, 0.8), 0.7)}))
heart["chol"] <- as.data.frame(lapply(heart["chol"], function(x){replace(x, between(x, 0.8, 1), 0.9)}))
```
### Separate the data in various groups to show which category of chest pain is more prone to heart disease
It replaces the values of chest pain into different groups.
- **Value 0**:- Typical angina(**Provided value 0.1**).
- **Value 1**:- Atypical angina(**Provided value 0.6**).
- **Value 2**:- Non-anginal pain(**Provided value 0.9**).
- **Value 3**:- Asymptomatic(**Provided value 0.01**).
```{r}
heart["cp"] <- as.data.frame(lapply(heart["cp"], function(x){replace(x, x == 0, 0.1)}))
heart["cp"] <- as.data.frame(lapply(heart["cp"], function(x){replace(x, x == 1, 0.6)}))
heart["cp"] <- as.data.frame(lapply(heart["cp"], function(x){replace(x, x == 2, 0.9)}))
heart["cp"] <- as.data.frame(lapply(heart["cp"], function(x){replace(x, x == 3, 0.01)}))
```
### Normailze the maximum heart rate of the patient
```{r}
dfNorm <- as.data.frame(lapply(heart["thalach"], normalize))
heart["thalach"] <- dfNorm
```
### Normalize the thal and separate it in various groups to show what thal is more prone to heart disease
It uses the normalize function defined above and then categorize the data into **three** groups.
Thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
After normalization the valuesaranges from **0-1**
It is grouped as follows
- **0.00 to 0.25** as **Group 1** giving it value **0.5**
- **0.25 to 0.50** as **Group 2** giving it value **0.6**
- **0.50 to 0.75** as **Group 3** giving it value **0.9**
- **0.75 to 1.00** as **Group 4** giving it value **0.1**
```{r}
dfNorm <- as.data.frame(lapply(heart["thal"], normalize))
heart["thal"] <- dfNorm
heart["thal"] <- as.data.frame(lapply(heart["thal"], function(x){replace(x, between(x, 0.0, 0.25), 0.5)}))
heart["thal"] <- as.data.frame(lapply(heart["thal"], function(x){replace(x, between(x, 0.25, 0.50), 0.6)}))
heart["thal"] <- as.data.frame(lapply(heart["thal"], function(x){replace(x, between(x, 0.50, 0.75), 0.9)}))
heart["thal"] <- as.data.frame(lapply(heart["thal"], function(x){replace(x, between(x, 0.75, 1.00), 0.1)}))
```
### Regularizing the rest ECG value of the patient
It replaces the values of Rest ECG in the following manner.
- **Value 0**:- Normal(**Provided value 0.3**).
- **Value 1**:- Having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)(**Provided value 0.9**).
- **Value 2**:- showing probable or definite left ventricular hypertrophy by Estes' criteria.(**Provided value 0.1**)
```{r}
heart["restecg"] <- as.data.frame(lapply(heart["restecg"], function(x){replace(x, x == 0, 0.3)}))
heart["restecg"] <- as.data.frame(lapply(heart["restecg"], function(x){replace(x, x == 1, 0.9)}))
heart["restecg"] <- as.data.frame(lapply(heart["restecg"], function(x){replace(x, x == 2, 0.1)}))
```
### Replace the peak exercise ST segment of the patient
It replaces the values of peak exercise ST segment into different groups.
- **Value 0**:- Upsloping(**Provided value 0.01**).
- **Value 1**:- Flat(**Provided value 0.2**).
- **Value 2**:- Downsloping(**Provided value 0.9**).
```{r}
heart["slope"] <- as.data.frame(lapply(heart["slope"], function(x){replace(x, x == 0, 0.01)}))
heart["slope"] <- as.data.frame(lapply(heart["slope"], function(x){replace(x, x == 1, 0.2)}))
heart["slope"] <- as.data.frame(lapply(heart["slope"], function(x){replace(x, x == 2, 0.9)}))
```
### Replace the major vessels (0-3) colored by flourosopy
It replaces the values of major vessels colored into different groups.
- **Value 0**:- Upsloping(**Provided value 0.09**).
- **Value 1**:- Flat(**Provided value 0.6**).
- **Value 2**:- Downsloping(**Provided value 0.45**).
- **Value 3**:- Downsloping(**Provided value 0.3**).
- **Value 4**:- Downsloping(**Provided value 0.1**).
```{r}
heart["ca"] <- as.data.frame(lapply(heart["ca"], function(x){replace(x, x == 0, 0.9)}))
heart["ca"] <- as.data.frame(lapply(heart["ca"], function(x){replace(x, x == 1, 0.6)}))
heart["ca"] <- as.data.frame(lapply(heart["ca"], function(x){replace(x, x == 2, 0.45)}))
heart["ca"] <- as.data.frame(lapply(heart["ca"], function(x){replace(x, x == 3, 0.3)}))
heart["ca"] <- as.data.frame(lapply(heart["ca"], function(x){replace(x, x == 4, 0.1)}))
```
### Replace the fasting bloog sugar
It replaces the values of fasting bloog sugar into different groups.
- **Value 0**:- Bloog sugar < 120 mg/dl(**Provided value 0.9**).
- **Value 1**:- Bloog sugar > 120 mg/dl(**Provided value 0.1**).
```{r}
heart["fbs"] <- as.data.frame(lapply(heart["fbs"], function(x){replace(x, x == 0, 0.9)}))
heart["fbs"] <- as.data.frame(lapply(heart["fbs"], function(x){replace(x, x == 1, 0.1)}))
```
### Replace the sex of the patient with values
It replaces the values of sex into different groups.
- **Value 1**:- Male, is replaced with **0.9**
- **Value 0**:- Female, is replaced with **0.1**
```{r}
heart["sex"] <- as.data.frame(lapply(heart["sex"], function(x){replace(x, x == 0, 0.1)}))
heart["sex"] <- as.data.frame(lapply(heart["sex"], function(x){replace(x, x == 1, 0.9)}))
```
### Replace the exercise induced angina
It replaces the values of peak exercise ST segment into different groups.
- **Value 0**:- No(**Provided value 0.9**).
- **Value 1**:- Yes(**Provided value 0.1**).
```{r}
heart["exang"] <- as.data.frame(lapply(heart["exang"], function(x){replace(x, x == 0, 0.9)}))
heart["exang"] <- as.data.frame(lapply(heart["exang"], function(x){replace(x, x == 1, 0.1)}))
```
## The dataset after the modifications are made.
```{r echo = FALSE}
head(heart)
```
## Splitting of training and testing data
We are using both randomly generated and sequentially choosen 75% of the data as training set and rest 25% as our test set.
**train_ind_rand** gives the indeces of the samples which are to be used as the training sample in the dataset.
**trainrand** gives the randomly choosen train dataset.
**testrand** given the randomly choosen test dataset.
**trainseq** gives the sequencially choosen train dataset.
**testseq** gives the sequentially choosen test dataset.
```{r}
smp_size <- floor(0.75 * nrow(heart))
train_ind_rand <- sample(seq_len(nrow(heart)), size = smp_size)
trainrand <- heart[train_ind_rand, ]
testrand <- heart[-train_ind_rand, ]
trainseq <- heart[1:227, ]
testseq <- heart[227:303, ]
```
## Writting the dataset into csv files so that it can be used in the Python program for Neural network classification.
- **heart1.csv** contain the modified Heart Disease dataset.
- **trainrand.csv** conains the randomly chosen train dataset.
- **testrand.csv** conains the randomly chosen test dataset.
- **trainseq.csv** conains the sequencially choosen train dataset.
- **testseq.csv** conains the sequentially choosen test dataset.
```{r}
write.csv(heart, "~/HeartDisease/heart1.csv", row.names = FALSE)
write.csv(trainrand, "~/HeartDisease/trainrand.csv", row.names = FALSE)
write.csv(testrand, "~/HeartDisease/testrand.csv", row.names = FALSE)
write.csv(trainseq, "~/HeartDisease/trainseq.csv", row.names = FALSE)
write.csv(testseq, "~/HeartDisease/testseq.csv", row.names = FALSE)
```
Part 2:- Analysis of the neural network model
---------------------------------------------------------------------------
## Plot for varying the number of hidden layers
Neural network with various number of hidden layers are tested and the results of the test accuracy are ploted.
Input layer contains **Thirteen** input neurons, each hidden layer has **Tweleve** neurons and the output layer has **One** neuron with **Two** classes.
- Class zero -> The patient is not suffering from heart disease.
- Class one -> The patient is suffering from heart disease.
### Plot of test accuracy for different hidden layers
```{r echo=FALSE}
LayerAcc <- read.csv("~/HeartDisease/LayerAcc.csv", header=FALSE)
LayerAcc.bar <- barplot(LayerAcc$V2, names.arg=LayerAcc$V1, col = 'blue', xlab = "Number of hidden layer", ylab = "Accuracy")
```
### Conclusion from the plot
According to the test runs we can see that model with **Three**, **Four**, **Five** and **Six** hidden layers, having **Tweleve** neuron each perform better than the other choosen models.
### Plot for varying the number of neurons
**For each hidden layer, five samples are taken for testing and it's accuracy is plotted below**
#### For three hidden layers
**Argument numbers and the structure associated with it.**
1. 13-8-8-8-1
2. 13-10-10-10-1
3. 13-8-12-12-1
4. 13-10-10-12-1
5. 12-10-12-12-1
```{r echo=FALSE}
NeuronAcc <- read.csv("~/HeartDisease/Structure.csv", sep=";")
itr = 1:5
NeuraonAcc.bar <- barplot(NeuronAcc$Accuracy[1:5], names.arg=itr, col = 'red', xlab = "Argument", ylab = "Accuracy")
```
#### For four hidden layers
**Argument numbers and the structure associated with it.**
1. 13-8-8-8-8-1
2. 13-12-12-12-12-1
3. 13-8-8-10-10-1
4. 13-12-12-8-8-1
5. 13-12-12-10-10-1
```{r echo=FALSE}
NeuronAcc <- read.csv("~/HeartDisease/Structure.csv", sep=";")
itr = 1:5
NeuraonAcc.bar <- barplot(NeuronAcc$Accuracy[6:10], names.arg=itr, col = 'red', xlab = "Argument", ylab = "Accuracy")
```
#### For five hidden layers
**Argument numbers and the structure associated with it.**
1. 13-8-8-8-8-8-1
2. 13-12-12-10-8-8-1
3. 13-12-10-10-10-8-1
4. 13-8-8-10-8-8-1
5. 13-8-8-12-8-8-1
```{r echo=FALSE}
NeuronAcc <- read.csv("~/HeartDisease/Structure.csv", sep=";")
itr = 1:5
NeuraonAcc.bar <- barplot(NeuronAcc$Accuracy[11:15], names.arg=itr, col = 'red', xlab = "Argument", ylab = "Accuracy")
```
### Conclusion from the plot
**According to the test runs we can see that the models with structure**
- Input layer -> 13 input neuron
- First hidden layer -> 12 hidden neuraon
- Second hidden layer -> 12 hidden neuraon
- Third hidden layer -> 10 hidden neuraon
- Fourth hidden layer -> 8 hidden neuraon
- Fifth hidden layer -> 8 hidden neuraon
- Output layer -> 1 neuron, 2 classes
gives the best output and so is taken as the final structure of the neural network.
### Plot for final accepted structure
The neural network with the best results is run **Twinty** times and the average mean square error, and accuracy was taken and plot in the following graph.
### Plot for Accuracy of the model
```{r echo = FALSE}
tableAccuracy <- read.csv("~/HeartDisease/tableAccuracy.csv", header=FALSE, sep=";")
meanAccuracy <- colMeans(tableAccuracy[sapply(tableAccuracy, is.numeric)])
itr <- 1:1500
plot(itr, meanAccuracy, "l", xlab = "Iteration", ylab = "Accuracy",
col = "blue", lwd = 2)
```
### Plot for Mean Square Error of the model
```{r echo = FALSE}
tableMSE <- read.csv("~/HeartDisease/tableMSE.csv", header=FALSE, sep=";")
meanMSE <- colMeans(tableMSE[sapply(tableMSE, is.numeric)])
itr <- 1:1500
plot(itr, meanMSE, "l", xlab = "Iteration", ylab = "Mean Square Error",
col = "orange", lwd = 2)
```
## Confusion matrix and performance analysis
```{r echo=FALSE}
performance_Matrix <- read.table("~/HeartDisease/performance_Matrix.txt", quote="\"", comment.char="")
FN <- performance_Matrix$X0[1]
FP <- performance_Matrix$X1[1]
TN <- performance_Matrix$X0[2]
TP <- performance_Matrix$X1[2]
knitr::include_graphics('PerformanceMatrix.png')
```
### Performance matrix analysis
```{r}
precision <- TP/(TP+FP)
recall <- TP/(TP+FN)
sensitivity <- TP/(TP+FN)
specificity <- TN/(TN+FP)
FNR <- FN/(FN+TP)
F_score <- (2*precision*recall)/(precision+recall)
MCC <- ((TP*TN)-(FP*FN)) / (sqrt((TP+FP)*(TP+FN)*(TN+FP)*(TN+FN)))
print(precision)
print(recall)
print(sensitivity)
print(specificity)
print(FNR)
print(F_score)
print(MCC)
```