This project aims to predict student math scores based on demographic and educational features using machine learning models. The application is built using Flask and deployed in Docker containers. The project leverages DVC (Data Version Control) for managing datasets and models, and uses various models for training and eventually prediction. Best Model as of now: Lasso regression with the highest Test R2 score of 0.8812.
- MLFlow Tracking URI: https://dagshub.com/04bhavyaa/mlproject.mlflow/
- Project Overview
- Technologies Used
- Project Gallery
- Setup Instructions
- Running the Application
- Docker Setup
- Model Training
- Ongoing Enhancements
- Contributors
The objective of this project is to build a machine learning model to predict students' math scores based on features such as gender, race, parental education level, lunch type, and test preparation course status. The project includes:
- A Flask-based web application for real-time predictions.
- Machine learning pipeline and components using various models such as CatBoost, XGBoost, and Random Forest used for predictions and GridSearchCV for hyperparameter optimization.
- Data and model versioning with DVC and Dagshub.
- Flask: Built a user-friendly web interface for inputting data and displaying predictions.
- DVC: For managing data and model versioning efficiently.
- Dagshub: To version large datasets and track experiments.
- GitHub Actions: Automated workflows for CI/CD using YAML configurations.
- Python 3.8+: The backbone of the entire project.
- Git: Version-controlled the code and collaborated efficiently.
- VSCode: My go-to code editor for writing, testing, and debugging the project.
- Docker: For containerizing the app and ensuring consistency across environments.
- ML Models: Linear Regression, Ridge, Lasso, ElasticNet, Decision Tree, Random Forest, Gradient Boosting, XGBoost, CatBoost, and AdaBoost.
- GridSearchCV: To fine-tune the CatBoost model for optimal performance.
- Libraries: NumPy, Pandas, Scikit-learn, XGBoost, CatBoost, Matplotlib, Seaborn, and more.
- Python 3.8+
- Docker (for containerization)
- DVC installed and configured for your cloud storage (Dagshub, AWS, etc.)
- GitHub repository with necessary secrets for Dagshub and Docker Hub
-
Clone the repository:
git clone https://github.com/<your-username>/mlproject.git cd mlproject
-
Set up a Python environment:
conda create --name mlproject python=3.8 conda activate mlproject
-
Install dependencies:
pip install -r requirements.txt
-
Set up DVC and pull the data:
dvc remote add origin s3://dvc dvc pull
-
Run the Flask application locally:
python app.py
The app will be available at http://localhost:5000.
To run the Flask application locally, follow the steps below:
- Activate your environment:
conda activate mlproject
- Run the Flask app:
python app.py
Open the browser and go to http://localhost:5000 to interact with the web application.
The application is containerized using Docker. To build and run the app in a Docker container, follow these steps:
- Build the Docker image:
docker build -t flask-app .
- Run the Docker container:
docker run -p 5000:5000 flask-app
This will start the Flask app inside a Docker container, and the application will be available at http://localhost:5000.
The machine learning model is trained using various models with hyperparameter tuning performed via GridSearchCV. To train the model, follow these steps:
- Train the model by running component.py:
python components.py
- Evaluate the model using performance metrics (e.g., accuracy, MSE, etc.)
- To edit the component you can go to src/mlproject/components/model_trainer.py
After training, the model.pkl will be saved in the artifacts/folder, and it will be used for predictions in the Flask app.
- Improve the accuracy of the model by exploring other algorithms and feature engineering techniques.
- Add additional prediction models and compare their performance.
- Enhance the user interface of the web app to include more interactive visualizations.
- Expand the dataset and improve generalization.
Bhavya Jha (Developer)