An assignment for DSC540 (Machine Learning for Data Science) at GCU that focused on building a classification model powered by a Support Vector Machine (SVM). The specific task was to determine if a mushroom is edible or poisonous from its physical characteristics (cap color, cap diameter, stem height, veil type, etc.).
Check out the full report here.
To perform a classification using the support vector machine algorithm, complete the following:
- Access the "UCI Machine Learning Repository," located in the topic Resources. Note: There are about 120 data sets that are suitable for use in a classification task. For this part of the exercise, you must choose one of these datasets, provided it includes at least 10 attributes and 10,000 instances.
- Ensure that the datasets are suitable for classification using this method.
- You may search for data in other repositories, such as Data.gov, Kaggle or Scikit Learn.
- Examine the repository through which you accessed the dataset and discuss data management measures set in place, such as protecting the privacy of those accessing the site and protecting the intellectual property rights of the data owners/contributors.
For your selected dataset, build a classification model as follows:
- Explain the dataset and the type of information you wish to gain by applying a classification method.
- Explain what makes SVM algorithm very special and very different from most other machine learning algorithms. Explain how you will be using it in your analysis (list the steps, the intuition behind the mathematical representation, and address its assumptions).
- Explain the concepts: kernel, hyperplane, and decision boundary, and their role in SVM.
- Explain the concepts: maximum margin, support vectors, and maximum margin hyperplane, and their role in SVM.
- Import the necessary libraries, then read the dataset into a data frame and perform initial statistical exploration.
- Clean the data and address unusual phenomena (e.g., normalization, feature scaling, outliers); use illustrative diagrams and plots and explain them.
- Formulate two questions that can be answered by applying a classification method using the SVM
- Split the data into 80% training and 20% testing sets using the train test split class.
- Use a linear kernel to train the SVM classifier on the training set (e.g., fit the support vector regressor to the dataset). Explain the intuition behind each of the key mathematical steps.
- Explain the choice of the optimal hyperplane.
- Make classification predictions.
- Interpret the results in the context of the questions you asked.
- Validate your model using a confusion matrix, accuracy score, ROC-AUC curves, and k-fold cross validation. Then, explain the results.
- Include all mathematical formulas used and graphs representing the final outcomes.
Prepare a comprehensive technical report as a markdown document or Jupyter notebook, including all code, code comments, all outputs, plots, and analysis. Make sure the project documentation contains a) problem statement, b) algorithm of the solution, c) analysis of the findings, and d) references.
While APA style is not required for the body of this assignment, solid academic writing is expected, and documentation of sources should be presented using APA formatting guidelines, which can be found in the APA Style Guide, located in the Student Success Center.
This assignment uses a rubric. Review the rubric prior to beginning the assignment to become familiar with the expectations for successful completion.
You are not required to submit this assignment to LopesWrite.