Offensive Tweet Classifier - Cover Photo
AI & Machine Learning - August 30, 2023

Offensive Tweet Classifier

Offensive Tweet Detection and Categorization


NLP, Logistic Regression, TF-IDF, Classification, Scikit-learn

This project, developed for the Natural Language Processing (CSE440) course at BRAC University, focuses on detecting and categorizing offensive tweets using machine learning. Utilizing the Offensive Language Identification Dataset (OLID), I trained a logistic regression model with TF-IDF vectorization to classify tweets as offensive or non-offensive and further categorize them into targeted insults, threats, or untargeted profanity. I implemented data preprocessing techniques such as tokenization, lemmatization, and mention/symbol removal to enhance model performance. The model was evaluated using classification reports and confusion matrices. Through this project, I gained hands-on experience in text preprocessing, feature engineering, model training, and evaluation in NLP.

    Table of Content

    • Core Features
      • Text Preprocessing
      • Feature Extraction
      • Model Training
      • Model Evaluation
    • Technology Stack
      • Stack
    • Project Overview
      • Introduction & Purpose
      • Dataset
      • Preprocessing
      • Model Training
      • Model Evaluation
      • References

Core Features

  • Text Preprocessing

    • Dropped the 'id' column to remove unnecessary data.
    • Handled missing values by filling NULLs in the category column and removing rows with missing tweet values.
    • Converted all text to lowercase for uniform processing.
    • Removed '@user' mentions to anonymize data and reduce noise.
    • Eliminated non-alphanumeric characters and symbols.
    • Tokenized text into words using NLTK's word_tokenize().
    • Applied lemmatization using NLTK's WordNetLemmatizer for text normalization.
  • Feature Extraction

    • Utilized TfidfVectorizer from sklearn to convert preprocessed text into numerical feature vectors.
    • Applied a trained TF-IDF vectorizer to transform new text data for model input.
  • Model Training

    • Implemented Logistic Regression for classification tasks.
    • Trained two separate models: one for offensive detection and another for category detection.
    • Configured Logistic Regression with max_iter=1000 to ensure proper model convergence.
  • Model Evaluation

    • Used classification reports to analyze precision, recall, and F1-score for each class.
    • Generated confusion matrices to evaluate model performance on test data.
    • Examined true positives, true negatives, false positives, and false negatives to assess classification accuracy.

Technology Stack

Stack

  • Pandas
  • Scikit-Learn
  • NLTK
  • TF-IDF Vectorizer
  • Logistic Regression

Project Overview

Introduction & Purpose

This project focuses on building a machine learning model to classify tweets as offensive or non-offensive and categorize offensive tweets into specific types. The goal is to create a model that can identify offensive content on social media platforms and categorize it.

Dataset

The dataset used is the Offensive Language Identification Dataset (OLID) v1.0, introduced in "Predicting the Type and Target of Offensive Posts in Social Media" by Zampieri et al. (2019). It contains multiple levels of classification: Level A (Offensive vs. Not Offensive), Level B (Targeted Insults vs. Untargeted), and Level C (Target type: Individual, Group, Other).

DatasetDataset

Preprocessing

The text data undergoes the following preprocessing steps:

  • Column Removal: Dropped the 'id' column.
  • Missing Value Handling: Filled NULL values in the category column with 'NULL' and removed rows with missing tweet values.
  • Lowercasing: Converted all text to lowercase.
  • Mention Removal: Removed '@user' mentions.
  • Symbol Removal: Removed non-alphanumeric characters and symbols.
  • Tokenization: Used NLTK's word_tokenize() function.
  • Lemmatization: Applied NLTK's WordNetLemmatizer.
  • Vectorization: Converted text into TF-IDF features using sklearn's TfidfVectorizer.

Model Training

The model is trained using logistic regression. - Two instances of logistic regression are used for offensive detection and category detection. - Parameters: max_iter=1000 to ensure sufficient iterations for optimization.

Model Evaluation

The trained models are evaluated using classification reports and confusion matrices. - The classification report provides precision, recall, and F1-score for each class. - The confusion matrix details true positives, true negatives, false positives, and false negatives.

Confusion Matrix - OffensiveConfusion Matrix - Offensive
Confusion Matrix - ClassificationConfusion Matrix - Classification

References

  • Zampieri, M., Malmasi, S., Nakov, P., Rosenthal, S., Farra, N., & Kumar, R. (2019). Predicting the Type and Target of Offensive Posts in Social Media. Proceedings of NAACL.
  • Zampieri, M., Malmasi, S., Nakov, P., Rosenthal, S., Farra, N., & Kumar, R. (2019). SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval). Proceedings of The 13th International Workshop on Semantic Evaluation (SemEval).