The goal of this project is to develop a classification model that will be able to identify the cervical dysplasia in two main categories, normal and abnormal. The data set that I’m working on has a target divided into 7 categories depending on how serious the dysplasia is. In order to divide these categories I will try to implement two unsupervised methods aiming to find:
-
The number of the clusters.
-
How I would split up these clusters.
In addition, I will implement some techniques to determine highly correlated features and a dimension reduction method in order to identify patterns and divide the target column into less categories.
My dataset has 26 features and 500 rows. Although I have to deal with many features the number of the rows is not so big and this is something that I will try to increase in order to develop another model and compare the results. Firstly, I used some statistics techniques to examine the distribution of the features and scatterplots aiming to find which variables are highly correlated.
![cor](https://user-images.githubusercontent.com/66875726/97926634-21763300-1d6c-11eb-91ff-4ba6dbc76721.png
Yes is a little bit chaotic, many features but after used the Univariate Feature Selection method I had to manage 15 features.
After these steps I implemented two Unsupervised techniques in order to determine the number of the classes that I could split the data. I started with a Hierarchical algorithm and below is the dendrogram of the algorithm.
As we can see from the dendrogram I can split the data into 2 or 3 classes, Definitely is an improvement considering the seven classes that the target group is divided. The second step that it was really useful in order to indentify the distribution of the data, were some scatter plots showed the correlation of the features with the target group.
Kerne_Short | Kyto_Short | Cyto_Long |
---|---|---|
Kerne_Short | Kyto_Short | Cyto_Long |
---|---|---|
From the above scatter plots we can draw the conclusion that the categories 1-4 could classified in one group (normal cells) and the rest 5-7 in other group (abnormal cells)
One other technique that could be really helpful to distinguish the data target and verify the above conclusion it’s the Principal Component Analysis. Looking at the variance ratio of the first two component, 80% of the dataset’s variance lies along the first Principal Component and 14% lies along the second PC, We have a lot of information in the first two components So, let’s plot them.
This plot verified indeed our target distinguish normal cell 1-4 and abnormal 5-7. It will be also really interesting to plot a 3D matrix in order to identify this classification.
Lastly before the prediction models, I trained a Supervised algorithm in order to examine the prediction on the 7 different cells categories. I trained a KNN model and below are the results.
The model was able to identify the third and fourth categories almost perfect but it’s not so accurate with the fifth and sixth.
After those steps, I started the training methods, I trained and optimized 4 supervised models with 4 different assumptions: 2 different feature selection methods and 2 different feature scaling methods, standard scaler and normalizer aiming to find out the effect that could cause the results.
Compared all the different assumptions, optimized parameters and considered also the cross validation scores the results are below:
Logistic Regression | SVM | KNN | Decision Tree |
---|---|---|---|
KNN | Decision Tree |
---|---|
Despite the models' performance in train, test and cross validation set the model that performed better is the optimized version of the Decision Tree classifier.