This project, developed for the Natural Language Processing (CSE440) course at BRAC University, focuses on detecting and categorizing offensive tweets using machine learning. Utilizing the Offensive Language Identification Dataset (OLID), I trained a logistic regression model with TF-IDF vectorization to classify tweets as offensive or non-offensive and further categorize them into targeted insults, threats, or untargeted profanity. I implemented data preprocessing techniques such as tokenization, lemmatization, and mention/symbol removal to enhance model performance. The model was evaluated using classification reports and confusion matrices. Through this project, I gained hands-on experience in text preprocessing, feature engineering, model training, and evaluation in NLP.
This project focuses on building a machine learning model to classify tweets as offensive or non-offensive and categorize offensive tweets into specific types. The goal is to create a model that can identify offensive content on social media platforms and categorize it.
The dataset used is the Offensive Language Identification Dataset (OLID) v1.0, introduced in "Predicting the Type and Target of Offensive Posts in Social Media" by Zampieri et al. (2019). It contains multiple levels of classification: Level A (Offensive vs. Not Offensive), Level B (Targeted Insults vs. Untargeted), and Level C (Target type: Individual, Group, Other).
The text data undergoes the following preprocessing steps:
The model is trained using logistic regression. - Two instances of logistic regression are used for offensive detection and category detection. - Parameters: max_iter=1000 to ensure sufficient iterations for optimization.
The trained models are evaluated using classification reports and confusion matrices. - The classification report provides precision, recall, and F1-score for each class. - The confusion matrix details true positives, true negatives, false positives, and false negatives.