400_basicModelBuildingSupervised.Rmd

# (PART) Supervised Models {-}

# Supervised Classification

A classification problem can be defined as the induction, from a dataset $\cal D$, of a classification function $\psi$ that, given the attribute vector of an instance/example, returns a class ${c}$. A regression problem, on the other hand, returns an numeric value.

Dataset, $\cal D$, is typically composed of $n$ attributes and a class attribute $C$. 

| $Att_1$  | ... | $Att_n$  | $Class$ |
|----------|-----| ---------|---------|
| $a_{11}$ | ... | $a_{1n}$ | $c_1$   |
| $a_{21}$ | ... | $a_{2n}$ | $c_2$   |
| ...      | ... | ...      | ...     |
| $a_{m1}$ | ... | $a_{mn}$ | $c_m$   |


Columns are usually called _attributes_ or _features_. Typically, there is a _class_ attribute, which can be numeric or discrete. When the class is numeric, it is a regression problem. With discrete values, we can talk about binary classification or multiclass (multinomial classification) when we have more than three values. There are variants such _multi-label_ classification (we will cover these in the advanced models section).

Once we learn a model, new instances are classified. As shown in the next figure.

![Supervised Classification](figures/supervClass1.png)


We have multiple types of models such as _classification trees_, _rules_, _neural networks_,  and _probabilistic classifiers_ that can be used to classify instances.

Fernandez et al provide an extensive comparison of 176 classifiers using the UCI dataset [@FernandezCBA14].

We will show the use of different classification techniques in the problem of defect prediction as running example. In this example,the different datasets are composed of classical metrics (_Halstead_ or _McCabe_ metrics) based on counts of operators/operands and like or object-oriented metrics (e.g. Chidamber and Kemerer) and the class attribute indicating whether the module or class was defective.


## Classification Trees

There are several packages for inducing classification trees, for example with the [party package](https://cran.r-project.org/web/packages/party/index.html) (recursive partitioning):


```{r warning=FALSE, message=FALSE}
library(foreign) # To load arff file
library(party) # Build a decision tree
library(caret) 

jm1 <- read.arff("./datasets/defectPred/D1/JM1.arff")
str(jm1)

# Stratified partition (training and test sets)
set.seed(1234)
inTrain <- createDataPartition(y=jm1$Defective,p=.60,list=FALSE)
jm1.train <- jm1[inTrain,]
jm1.test <- jm1[-inTrain,]

jm1.formula <- jm1$Defective ~ . # formula approach: defect as dependent variable and the rest as independent variables
jm1.ctree <- ctree(jm1.formula, data = jm1.train)

# predict on test data
pred <- predict(jm1.ctree, newdata = jm1.test)
# check prediction result
table(pred, jm1.test$Defective)
plot(jm1.ctree)
```

Using the C50 package, there are two ways, specifying train and testing

```{r, eval=FALSE}
library(C50)
require(utils)
# c50t <- C5.0(jm1.train[,-ncol(jm1.train)], jm1.train[,ncol(jm1.train)])
c50t <- C5.0(Defective ~ ., jm1.train)
summary(c50t)
plot(c50t)
c50tPred <- predict(c50t, jm1.train)
# table(c50tPred, jm1.train$Defective)
```


Using the ['rpart'](https://cran.r-project.org/web/packages/rpart/index.html) package

``` {r}
# Using the 'rpart' package
library(rpart)
jm1.rpart <- rpart(Defective ~ ., data=jm1.train, parms = list(prior = c(.65,.35), split = "information"))
# par(mfrow = c(1,2), xpd = NA)
plot(jm1.rpart)
text(jm1.rpart, use.n = TRUE)
jm1.rpart

library(rpart.plot)
# asRules(jm1.rpart)
# fancyRpartPlot(jm1.rpart)
```


## Rules

C5 Rules

```{r}
library(C50)
c50r <- C5.0(jm1.train[,-ncol(jm1.train)], jm1.train[,ncol(jm1.train)], rules = TRUE)
summary(c50r)
# c50rPred <- predict(c50r, jm1.train)
# table(c50rPred, jm1.train$Defective)
```


## Distanced-based Methods

In this case, there is no model as such. Given a new instance to classify, this approach finds the closest $k$-neighbours to the given instance. 

![k-NN Classification](./figures/279px-KnnClassification.svg.png)
(Source: Wikipedia - https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm)


```{r}
library(class)
m1 <- knn(train=jm1.train[,-22], test=jm1.test[,-22], cl=jm1.train[,22], k=3)

table(jm1.test[,22],m1)
```

## Neural Networks

![Neural Networks](./figures/neuralnet.png)

![Neural Networks](./figures/neuralnet2.png)


## Support Vector Machine

![SVM](./figures/Kernel_Machine.svg.png)
(Source: wikipedia https://en.wikipedia.org/wiki/Support_vector_machine)

## Probabilistic Methods

### Naive Bayes

Probabilistic graphical model assigning a probability to each possible outcome $p(C_k, x_1,\ldots,x_n)$ 

![Naive Bayes](./figures/classifier_NB.png)


Using the `klaR` package with `caret`:

```{r warning=FALSE}
library(caret)
library(klaR)
model <- NaiveBayes(Defective ~ ., data = jm1.train)
predictions <- predict(model, jm1.test[,-22])
confusionMatrix(predictions$class, jm1.test$Defective)
```


Using the `e1071` package:

```{r warning=FALSE, message=FALSE}
library (e1071)
n1 <-naiveBayes(jm1.train$Defective ~ ., data=jm1.train)

# Show first 3 results using 'class'
head(predict(n1,jm1.test, type = c("class")),3) # class by default

# Show first 3 results using 'raw'
head(predict(n1,jm1.test, type = c("raw")),3)

```


There are other variants such as TAN and KDB that do not assume the independece condition allowin us more complex structures.


![Naive Bayes](./figures/classifier_TAN.png)


![Naive Bayes](./figures/classifier_KDB.png)


A comprehensice comparison of 


## Linear Discriminant Analysis (LDA)

One classical approach to classification is Linear Discriminant Analysis (LDA), a generalization of Fisher's linear discriminant, as a method used to find a linear combination of features to separate two or more classes.

```{r warning=FALSE}
ldaModel <- train (Defective ~ ., data=jm1.train, method="lda", preProc=c("center","scale"))
ldaModel
```

We can observe that we are training our model using `Defective ~ .` as a formula were `Defective` is the class variable separed by `~` and the ´.´ means the rest of the variables. Also, we are using a filter for the training data to (preProc) to center and scale. 

Also, as stated in the documentation about the `train` method :
> http://topepo.github.io/caret/training.html

```{r warning=FALSE}
ctrl <- trainControl(method = "repeatedcv",repeats=3)
ldaModel <- train (Defective ~ ., data=jm1.train, method="lda", trControl=ctrl, preProc=c("center","scale"))

ldaModel
```

Instead of accuracy we can activate other metrics using `summaryFunction=twoClassSummary` such as `ROC`, `sensitivity` and `specificity`. To do so, we also need to speficy `classProbs=TRUE`.

```{r warning=FALSE}
ctrl <- trainControl(method = "repeatedcv",repeats=3, classProbs=TRUE, summaryFunction=twoClassSummary)
ldaModel3xcv10 <- train (Defective ~ ., data=jm1.train, method="lda", trControl=ctrl, preProc=c("center","scale"))

ldaModel3xcv10
```


Most methods have parameters that need to be optimised and that is one of the

```{r warning=FALSE, message=FALSE}
plsFit3x10cv <- train (Defective ~ ., data=jm1.train, method="pls", trControl=trainControl(classProbs=TRUE), metric="ROC", preProc=c("center","scale"))

plsFit3x10cv

plot(plsFit3x10cv)
```


The parameter `tuneLength` allow us to specify the number values per parameter to consider.

```{r warning=FALSE}
plsFit3x10cv <- train (Defective ~ ., data=jm1.train, method="pls", trControl=ctrl, metric="ROC", tuneLength=5, preProc=c("center","scale"))

plsFit3x10cv

plot(plsFit3x10cv)
``` 


Finally to predict new cases, `caret` will use the best classfier obtained for prediction.

```{r warning=FALSE}
plsProbs <- predict(plsFit3x10cv, newdata = jm1.test, type = "prob")
```

```{r warning=FALSE}
plsClasses <- predict(plsFit3x10cv, newdata = jm1.test, type = "raw")
confusionMatrix(data=plsClasses,jm1.test$Defective)
```

### Predicting the number of defects (numerical class)

From the Bug Prediction Repository (BPR) [http://bug.inf.usi.ch/download.php](http://bug.inf.usi.ch/download.php)

Some datasets contain CK and other 11 object-oriented metrics for the last version of the system plus categorized (with severity and priority) post-release defects. Using such dataset:

```{r warning=FALSE, message=FALSE}
jdt <- read.csv("./datasets/defectPred/BPD/single-version-ck-oo-EclipseJDTCore.csv", sep=";")

# We just use the number of bugs, so we removed others
jdt$classname <- NULL
jdt$nonTrivialBugs <- NULL
jdt$majorBugs <- NULL
jdt$minorBugs <- NULL
jdt$criticalBugs <- NULL
jdt$highPriorityBugs <- NULL
jdt$X <- NULL

# Caret
library(caret)

# Split data into training and test datasets
set.seed(1)
inTrain <- createDataPartition(y=jdt$bugs,p=.8,list=FALSE)
jdt.train <- jdt[inTrain,]
jdt.test <- jdt[-inTrain,]
```

```{r warning=FALSE}
ctrl <- trainControl(method = "repeatedcv",repeats=3)
glmModel <- train (bugs ~ ., data=jdt.train, method="glm", trControl=ctrl, preProc=c("center","scale"))
glmModel
```


Others such as Elasticnet:

```{r warning=FALSE}
glmnetModel <- train (bugs ~ ., data=jdt.train, method="glmnet", trControl=ctrl, preProc=c("center","scale"))
glmnetModel
```


## Binary Logistic Regression (BLR)

Binary Logistic Regression (BLR) can models fault-proneness as follows

$$fp(X) = \frac{e^{logit()}}{1 + e^{logit(X)}}$$

where the simplest form for logit is:

$logit(X) = c_{0} + c_{1}X$

```{r warning=FALSE}
jdt <- read.csv("./datasets/defectPred/BPD/single-version-ck-oo-EclipseJDTCore.csv", sep=";")

# Caret
library(caret)

# Convert the response variable into a boolean variable (0/1)
jdt$bugs[jdt$bugs>=1]<-1

cbo <- jdt$cbo
bugs <- jdt$bugs

# Split data into training and test datasets
jdt2 = data.frame(cbo, bugs)
inTrain <- createDataPartition(y=jdt2$bugs,p=.8,list=FALSE)
jdtTrain <- jdt2[inTrain,]
jdtTest <- jdt2[-inTrain,]
```

BLR models fault-proneness are as follows $fp(X) = \frac{e^{logit()}}{1 + e^{logit(X)}}$

where the simplest form for logit is $logit(X) = c_{0} + c_{1}X$

```{r warning=FALSE}
# logit regression
# glmLogit <- train (bugs ~ ., data=jdt.train, method="glm", family=binomial(link = logit))       

glmLogit <- glm (bugs ~ ., data=jdtTrain, family=binomial(link = logit))
summary(glmLogit)

```


Predict a single point:
```{r warning=FALSE}
newData = data.frame(cbo = 3)
predict(glmLogit, newData, type = "response")
```

Draw the results, modified from:
http://www.shizukalab.com/toolkits/plotting-logistic-regression-in-r


```{r warning=FALSE}
results <- predict(glmLogit, jdtTest, type = "response")

range(jdtTrain$cbo)
range(results)

plot(jdt2$cbo,jdt2$bugs)
curve(predict(glmLogit, data.frame(cbo=x), type = "response"),add=TRUE)
# points(jdtTrain$cbo,fitted(glmLogit))
```


Another type of graph:

```{r warning=FALSE}
library(popbio)
logi.hist.plot(jdt2$cbo,jdt2$bugs,boxp=FALSE,type="hist",col="gray")
```


## The caret package

There are hundreds of packages to perform classification task in R, but many of those can be used throught the 'caret' package which helps with many of the data mining process task as described next. 

The caret package[http://topepo.github.io/caret/](http://topepo.github.io/caret/) provides a unified interface for modeling and prediction with around 150 different models with tools for:

 + data splitting
    
 + pre-processing
    
 + feature selection
    
 + model tuning using resampling
    
 + variable importance estimation, etc.


Website: [http://caret.r-forge.r-project.org](http://caret.r-forge.r-project.org)

JSS Paper: [www.jstatsoft.org/v28/i05/paper](www.jstatsoft.org/v28/i05/paper)

Book: [Applied Predictive Modeling](http://AppliedPredictiveModeling.com/)