spamBERT

Spam Classification of Email and Sms Texts Using a Fine-Tuned BERT Model

spamBERT (code available here) is a simple user-friendly webpage for spam classification of email and sms texts, using a fine-tuned BERT base model (cased). Leveraging BERT’s contextual understanding of language, the pre-trained model has been fine-tuned using two combined datasets, specific for this purpose, to be used to classify the text given in input by the user.

The dataset

The datasets, SMS Spam Collection and Spam-Ham Dataset are two collection of spam and not-spam SMSs and e-mails. Every sample has two features: the target feature (“ham” or “spam”) and the main feature that contains the text.

spamBERT architecture

The architecture includes the pre-trained BERT base cased (with 12 encoders) and a fully connected layer, fine-tuned in order to perform well on the spam spam classification task. It has been obtained an accuracy score of 99%.

Results

The interface of the webpage is the following: