This project classifies emails as spam or not spam using a fine-tuned DistilBERT model. The project includes data preprocessing, model training, evaluation, and real-time email classification.
- Installation
- Setup Kaggle Dataset
- Training the Model
- Evaluating the Model
- Real-time Email Classification
- Translation
- Results
To get started with this project, you need to set up your environment with the required libraries. Follow the steps below:
-
Clone the repository:
git clone https://github.com/your-repository/Spam_Detection.git cd Spam_Detection
-
Install the required packages:
pip install -r requirements.txt
-
Make sure you have a Kaggle account and have generated an API token. Place the
kaggle.json
file in your home directory under.kaggle/
. -
Run the
setup_kaggle.py
script to download and prepare the dataset:python setup_kaggle.py
- To train the model using the dataset, run the
main_script.py
:python main_script.py
This script will:
- Load and preprocess the dataset.
- Split the data for training.
- Train the DistilBERT model.
- Save the trained model to
./trained_model
.
- To evaluate the trained model on the remaining 80% of the dataset, run the
evaluation.py
script:python evaluation.py
This script will:
- Load the remaining dataset.
- Evaluate the model.
- Print the accuracy, precision, recall, and F1 score.
-
Create a file named
requirements.py
and update it with your Gmail credentials:EMAIL = 'your-email@gmail.com' PASSWORD = 'your-app-password'
-
Run the
predict.py
script to classify real-time emails:python predict.py
This script will:
- Connect to your Gmail account.
- Fetch unread emails.
- Translate the email content to English.
- Classify the email as spam or not spam.
- Label the email as "Potential Spam" if classified as spam.
The predict.py
script includes functionality to translate email content to English using Google Translate API.
Results of the evaluation and real-time classification will be printed to the console, including accuracy, precision, recall, and F1 score for evaluation, and classification results for real-time emails.