Machine Learning
  • NLP
  • Logistic Regression
  • TF-IDF
  • Classification
  • Scikit-learn

Offensive Tweet Classifier

Offensive Tweet Detection and Categorization

This project focuses on detecting and categorizing offensive tweets using logistic regression with TF-IDF vectorization. It includes extensive text preprocessing, model training, and evaluation on the Offensive Language Identification Dataset (OLID).
Offensive Tweet Classifier
Offensive Tweet Classifier

Code: GitHub

Introduction & Purpose

This project focuses on building a machine learning model to classify tweets as offensive or non-offensive and categorize offensive tweets into specific types. The goal is to create a model that can identify offensive content on social media platforms and categorize it.

Dataset

The dataset used is the Offensive Language Identification Dataset (OLID) v1.0, introduced in "Predicting the Type and Target of Offensive Posts in Social Media" by Zampieri et al. (2019). It contains multiple classification levels: Level A (Offensive vs. Not Offensive), Level B (Targeted Insults vs. Untargeted), and Level C (Target type: Individual, Group, Other).

Dataset
Dataset

Text Preprocessing

  • Dropped the 'id' column to remove unnecessary data.
  • Handled missing values by filling NULLs in the category column and removing rows with missing tweet values.
  • Converted all text to lowercase.
  • Removed '@user' mentions.
  • Eliminated non-alphanumeric characters and symbols.
  • Tokenized text into words using NLTK's word_tokenize().
  • Applied lemmatization using NLTK's WordNetLemmatizer.

Feature Extraction

  • Utilized TfidfVectorizer from sklearn to convert preprocessed text into numerical feature vectors.
  • Applied a trained TF-IDF vectorizer to transform new text data for model input.

Model Training

  • Implemented Logistic Regression for classification tasks.
  • Trained two separate models: one for offensive detection and another for category detection.
  • Configured Logistic Regression with max_iter=1000 to ensure convergence.

Model Evaluation

  • Used classification reports to analyze precision, recall, and F1-score for each class.
  • Generated confusion matrices to evaluate model performance on test data.
  • Examined true positives, true negatives, false positives, and false negatives to assess classification accuracy.
Confusion Matrix - Offensive
Confusion Matrix - Offensive
Confusion Matrix - Classification
Confusion Matrix - Classification

Stack

Pandas, Scikit-Learn, NLTK, TF-IDF Vectorizer, Logistic Regression

References

  • Zampieri, M., Malmasi, S., Nakov, P., Rosenthal, S., Farra, N., & Kumar, R. (2019). Predicting the Type and Target of Offensive Posts in Social Media. Proceedings of NAACL.
  • Zampieri, M., Malmasi, S., Nakov, P., Rosenthal, S., Farra, N., & Kumar, R. (2019). SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval). Proceedings of The 13th International Workshop on Semantic Evaluation (SemEval).