Spam SMS/E-Mail Detection using Natural Language Processing
Used NLP (Natural Language Processing) techniques in ML (Machine Learning) to detect whether an SMS/e-mail is spam or not spam . Used NLP techniques such as tokeniztion , lemmatization , stop words removal , punctuation removal using tools such as NLTK and regex . Used models such as Multinomial Naive Bayes and Logistic regression to achieve overall F1 Score of 0.99 . Also performed feature engineering and handcrafted features such as number of digits , email length , number of punctuations etc which further helped in predictions . Also generated wordclouds for the different prediction classes .
- Field : NLP (Natural Language Processing)
- Tools : NLTK , regex , scikit-learn , python
- Concepts : tokeniztion , lemmatization , stop words , Logistic regression , naive bayes
https://www.kaggle.com/datasets/bagavathypriya/spam-ham-dataset (Originally taken from UCI machine learning repository ) . Note that althought the dataset says SMS , it has a significant resemblance to the E-Mail spam received also , and hance can be used to train a moel to detect spam e-mails also :)