Chat messages spam classifier using machine learning

In the last blog, Ebbot explained to you the training process which helps him correctly respond to your queries. As mentioned in the end of the blog, every time Ebbot fails to understand you, he will learn from your messages in the conversation to improve his performance. But do you know that not every sentence is useful for the learning process? There is information that we do not want Ebbot to memorize, such as phone numbers, emails and spam messages (e.g: asdfda, wqrherewrere safdfa). That is why we decided to build a Machine Learning (ML) model to classify messages as spam or not spam to filter out unnecessary dataPlease keep reading to find out how we trained this spam classifier and its accuracy!

Collecting and labeling the dataset

Using data from a conversation script between users and Ebbot, we collected 2924 phrases in total and labeled them as ”Spam” or ”Not Spam”. With the help of our sentence similarity model, we were able to cluster meaningful and spam sentences into two groups. Thus, avoided manual data-labeling as much as we could and saved a lot of time.   

Training and evaluating the model

Inspired by the article ”Create a SMS spam classifier in Python”, we chose Multinomial Naive Bayes Classifier Model for this project. 75% of the dataset was used for the training and 25% was saved for testing. By analyzing the dataset, we also noticed that the average length of spam (12.4 characters) messages are much lower than non-spam (21.5 characters) ones.

After fitting the training dataset to the model, we used the test set to see how it performs. The AUC-ROC (Area Under The Curve- Receiver Operating Characteristics) score was very high, approximately 0.97!

AUC-ROC score for the model after fitting our dataset

Building and deplying a webapp to host the spam classifier

Before giving Ebbot this filter to help him collect only meaningful data, we still want to test and improve the model. Thanks to Streamlit, we were able to quickly build a web app in under one hour. Furthermore, we included a feedback system which would collect false predictions to improve the model. The web app was then deployed using Heroku. If you want to test it live, here is 🥁🥁🥁🥁🥁🥁🥁🥁🥁🥁🥁 the link!

The web app built using Streamlit and deployed with Heroku

We are hoping to implement this filter into our Bot-builder product in order to help our clients reducing the training time of their digital assistants in the near future. As soon as this is launched, we will definitely notify you of the good news with another blog. Until then, please feel free to look at other posts on our website or follow our LinkedIn to be updated with exciting news almost every week! 


Wanna know more about our product?

If you are curious and want to know more about how Ebbot – a helpful digital employee – can assist you, let’s meet and talk about it! All you need to do is clicking on the button below 👇

Share This Post

Dela på facebook
Dela på linkedin
Dela på twitter
Dela på email

Läs fler

Detecting toxic messages in Swedish language

Recently, one of our clients asked us to teach Ebbot to detect toxic messages in conversations. Thanks to this special request, we got a chance to work on one of the most difficult topics in the Natural Language Processing (NLP) field. And yes, we can not be more excited! 🥳


Grouping similar sentences for faster intent training

After successfully extended a powerful Natural Language Processing (NLP) model called ”SentenceTransformers” to Swedish language, we decided to continue with another exciting project aiming to half-automate the intent training process by grouping similar sentences.