-
Notifications
You must be signed in to change notification settings - Fork 18
/
Copy path400_basicModelBuildingSupervised.Rmd
394 lines (239 loc) · 11.4 KB
/
400_basicModelBuildingSupervised.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
# (PART) Supervised Models {-}
# Supervised Classification
A classification problem can be defined as the induction, from a dataset $\cal D$, of a classification function $\psi$ that, given the attribute vector of an instance/example, returns a class ${c}$. A regression problem, on the other hand, returns an numeric value.
Dataset, $\cal D$, is typically composed of $n$ attributes and a class attribute $C$.
| $Att_1$ | ... | $Att_n$ | $Class$ |
|----------|-----| ---------|---------|
| $a_{11}$ | ... | $a_{1n}$ | $c_1$ |
| $a_{21}$ | ... | $a_{2n}$ | $c_2$ |
| ... | ... | ... | ... |
| $a_{m1}$ | ... | $a_{mn}$ | $c_m$ |
Columns are usually called _attributes_ or _features_. Typically, there is a _class_ attribute, which can be numeric or discrete. When the class is numeric, it is a regression problem. With discrete values, we can talk about binary classification or multiclass (multinomial classification) when we have more than three values. There are variants such _multi-label_ classification (we will cover these in the advanced models section).
Once we learn a model, new instances are classified. As shown in the next figure.

We have multiple types of models such as _classification trees_, _rules_, _neural networks_, and _probabilistic classifiers_ that can be used to classify instances.
Fernandez et al provide an extensive comparison of 176 classifiers using the UCI dataset [@FernandezCBA14].
We will show the use of different classification techniques in the problem of defect prediction as running example. In this example,the different datasets are composed of classical metrics (_Halstead_ or _McCabe_ metrics) based on counts of operators/operands and like or object-oriented metrics (e.g. Chidamber and Kemerer) and the class attribute indicating whether the module or class was defective.
## Classification Trees
There are several packages for inducing classification trees, for example with the [party package](https://cran.r-project.org/web/packages/party/index.html) (recursive partitioning):
```{r warning=FALSE, message=FALSE}
library(foreign) # To load arff file
library(party) # Build a decision tree
library(caret)
jm1 <- read.arff("./datasets/defectPred/D1/JM1.arff")
str(jm1)
# Stratified partition (training and test sets)
set.seed(1234)
inTrain <- createDataPartition(y=jm1$Defective,p=.60,list=FALSE)
jm1.train <- jm1[inTrain,]
jm1.test <- jm1[-inTrain,]
jm1.formula <- jm1$Defective ~ . # formula approach: defect as dependent variable and the rest as independent variables
jm1.ctree <- ctree(jm1.formula, data = jm1.train)
# predict on test data
pred <- predict(jm1.ctree, newdata = jm1.test)
# check prediction result
table(pred, jm1.test$Defective)
plot(jm1.ctree)
```
Using the C50 package, there are two ways, specifying train and testing
```{r, eval=FALSE}
library(C50)
require(utils)
# c50t <- C5.0(jm1.train[,-ncol(jm1.train)], jm1.train[,ncol(jm1.train)])
c50t <- C5.0(Defective ~ ., jm1.train)
summary(c50t)
plot(c50t)
c50tPred <- predict(c50t, jm1.train)
# table(c50tPred, jm1.train$Defective)
```
Using the ['rpart'](https://cran.r-project.org/web/packages/rpart/index.html) package
``` {r}
# Using the 'rpart' package
library(rpart)
jm1.rpart <- rpart(Defective ~ ., data=jm1.train, parms = list(prior = c(.65,.35), split = "information"))
# par(mfrow = c(1,2), xpd = NA)
plot(jm1.rpart)
text(jm1.rpart, use.n = TRUE)
jm1.rpart
library(rpart.plot)
# asRules(jm1.rpart)
# fancyRpartPlot(jm1.rpart)
```
## Rules
C5 Rules
```{r}
library(C50)
c50r <- C5.0(jm1.train[,-ncol(jm1.train)], jm1.train[,ncol(jm1.train)], rules = TRUE)
summary(c50r)
# c50rPred <- predict(c50r, jm1.train)
# table(c50rPred, jm1.train$Defective)
```
## Distanced-based Methods
In this case, there is no model as such. Given a new instance to classify, this approach finds the closest $k$-neighbours to the given instance.

(Source: Wikipedia - https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm)
```{r}
library(class)
m1 <- knn(train=jm1.train[,-22], test=jm1.test[,-22], cl=jm1.train[,22], k=3)
table(jm1.test[,22],m1)
```
## Neural Networks


## Support Vector Machine

(Source: wikipedia https://en.wikipedia.org/wiki/Support_vector_machine)
## Probabilistic Methods
### Naive Bayes
Probabilistic graphical model assigning a probability to each possible outcome $p(C_k, x_1,\ldots,x_n)$

Using the `klaR` package with `caret`:
```{r warning=FALSE}
library(caret)
library(klaR)
model <- NaiveBayes(Defective ~ ., data = jm1.train)
predictions <- predict(model, jm1.test[,-22])
confusionMatrix(predictions$class, jm1.test$Defective)
```
Using the `e1071` package:
```{r warning=FALSE, message=FALSE}
library (e1071)
n1 <-naiveBayes(jm1.train$Defective ~ ., data=jm1.train)
# Show first 3 results using 'class'
head(predict(n1,jm1.test, type = c("class")),3) # class by default
# Show first 3 results using 'raw'
head(predict(n1,jm1.test, type = c("raw")),3)
```
There are other variants such as TAN and KDB that do not assume the independece condition allowin us more complex structures.


A comprehensice comparison of
## Linear Discriminant Analysis (LDA)
One classical approach to classification is Linear Discriminant Analysis (LDA), a generalization of Fisher's linear discriminant, as a method used to find a linear combination of features to separate two or more classes.
```{r warning=FALSE}
ldaModel <- train (Defective ~ ., data=jm1.train, method="lda", preProc=c("center","scale"))
ldaModel
```
We can observe that we are training our model using `Defective ~ .` as a formula were `Defective` is the class variable separed by `~` and the ´.´ means the rest of the variables. Also, we are using a filter for the training data to (preProc) to center and scale.
Also, as stated in the documentation about the `train` method :
> http://topepo.github.io/caret/training.html
```{r warning=FALSE}
ctrl <- trainControl(method = "repeatedcv",repeats=3)
ldaModel <- train (Defective ~ ., data=jm1.train, method="lda", trControl=ctrl, preProc=c("center","scale"))
ldaModel
```
Instead of accuracy we can activate other metrics using `summaryFunction=twoClassSummary` such as `ROC`, `sensitivity` and `specificity`. To do so, we also need to speficy `classProbs=TRUE`.
```{r warning=FALSE}
ctrl <- trainControl(method = "repeatedcv",repeats=3, classProbs=TRUE, summaryFunction=twoClassSummary)
ldaModel3xcv10 <- train (Defective ~ ., data=jm1.train, method="lda", trControl=ctrl, preProc=c("center","scale"))
ldaModel3xcv10
```
Most methods have parameters that need to be optimised and that is one of the
```{r warning=FALSE, message=FALSE}
plsFit3x10cv <- train (Defective ~ ., data=jm1.train, method="pls", trControl=trainControl(classProbs=TRUE), metric="ROC", preProc=c("center","scale"))
plsFit3x10cv
plot(plsFit3x10cv)
```
The parameter `tuneLength` allow us to specify the number values per parameter to consider.
```{r warning=FALSE}
plsFit3x10cv <- train (Defective ~ ., data=jm1.train, method="pls", trControl=ctrl, metric="ROC", tuneLength=5, preProc=c("center","scale"))
plsFit3x10cv
plot(plsFit3x10cv)
```
Finally to predict new cases, `caret` will use the best classfier obtained for prediction.
```{r warning=FALSE}
plsProbs <- predict(plsFit3x10cv, newdata = jm1.test, type = "prob")
```
```{r warning=FALSE}
plsClasses <- predict(plsFit3x10cv, newdata = jm1.test, type = "raw")
confusionMatrix(data=plsClasses,jm1.test$Defective)
```
### Predicting the number of defects (numerical class)
From the Bug Prediction Repository (BPR) [http://bug.inf.usi.ch/download.php](http://bug.inf.usi.ch/download.php)
Some datasets contain CK and other 11 object-oriented metrics for the last version of the system plus categorized (with severity and priority) post-release defects. Using such dataset:
```{r warning=FALSE, message=FALSE}
jdt <- read.csv("./datasets/defectPred/BPD/single-version-ck-oo-EclipseJDTCore.csv", sep=";")
# We just use the number of bugs, so we removed others
jdt$classname <- NULL
jdt$nonTrivialBugs <- NULL
jdt$majorBugs <- NULL
jdt$minorBugs <- NULL
jdt$criticalBugs <- NULL
jdt$highPriorityBugs <- NULL
jdt$X <- NULL
# Caret
library(caret)
# Split data into training and test datasets
set.seed(1)
inTrain <- createDataPartition(y=jdt$bugs,p=.8,list=FALSE)
jdt.train <- jdt[inTrain,]
jdt.test <- jdt[-inTrain,]
```
```{r warning=FALSE}
ctrl <- trainControl(method = "repeatedcv",repeats=3)
glmModel <- train (bugs ~ ., data=jdt.train, method="glm", trControl=ctrl, preProc=c("center","scale"))
glmModel
```
Others such as Elasticnet:
```{r warning=FALSE}
glmnetModel <- train (bugs ~ ., data=jdt.train, method="glmnet", trControl=ctrl, preProc=c("center","scale"))
glmnetModel
```
## Binary Logistic Regression (BLR)
Binary Logistic Regression (BLR) can models fault-proneness as follows
$$fp(X) = \frac{e^{logit()}}{1 + e^{logit(X)}}$$
where the simplest form for logit is:
$logit(X) = c_{0} + c_{1}X$
```{r warning=FALSE}
jdt <- read.csv("./datasets/defectPred/BPD/single-version-ck-oo-EclipseJDTCore.csv", sep=";")
# Caret
library(caret)
# Convert the response variable into a boolean variable (0/1)
jdt$bugs[jdt$bugs>=1]<-1
cbo <- jdt$cbo
bugs <- jdt$bugs
# Split data into training and test datasets
jdt2 = data.frame(cbo, bugs)
inTrain <- createDataPartition(y=jdt2$bugs,p=.8,list=FALSE)
jdtTrain <- jdt2[inTrain,]
jdtTest <- jdt2[-inTrain,]
```
BLR models fault-proneness are as follows $fp(X) = \frac{e^{logit()}}{1 + e^{logit(X)}}$
where the simplest form for logit is $logit(X) = c_{0} + c_{1}X$
```{r warning=FALSE}
# logit regression
# glmLogit <- train (bugs ~ ., data=jdt.train, method="glm", family=binomial(link = logit))
glmLogit <- glm (bugs ~ ., data=jdtTrain, family=binomial(link = logit))
summary(glmLogit)
```
Predict a single point:
```{r warning=FALSE}
newData = data.frame(cbo = 3)
predict(glmLogit, newData, type = "response")
```
Draw the results, modified from:
http://www.shizukalab.com/toolkits/plotting-logistic-regression-in-r
```{r warning=FALSE}
results <- predict(glmLogit, jdtTest, type = "response")
range(jdtTrain$cbo)
range(results)
plot(jdt2$cbo,jdt2$bugs)
curve(predict(glmLogit, data.frame(cbo=x), type = "response"),add=TRUE)
# points(jdtTrain$cbo,fitted(glmLogit))
```
Another type of graph:
```{r warning=FALSE}
library(popbio)
logi.hist.plot(jdt2$cbo,jdt2$bugs,boxp=FALSE,type="hist",col="gray")
```
## The caret package
There are hundreds of packages to perform classification task in R, but many of those can be used throught the 'caret' package which helps with many of the data mining process task as described next.
The caret package[http://topepo.github.io/caret/](http://topepo.github.io/caret/) provides a unified interface for modeling and prediction with around 150 different models with tools for:
+ data splitting
+ pre-processing
+ feature selection
+ model tuning using resampling
+ variable importance estimation, etc.
Website: [http://caret.r-forge.r-project.org](http://caret.r-forge.r-project.org)
JSS Paper: [www.jstatsoft.org/v28/i05/paper](www.jstatsoft.org/v28/i05/paper)
Book: [Applied Predictive Modeling](http://AppliedPredictiveModeling.com/)