Machine Learning
  • Machine Learning
  • Scikit-learn
  • Classification
  • Data Preprocessing
  • Python

Lung Cancer Prediction

Predicting Lung Cancer Risk Using Health & Lifestyle Indicators

A machine learning project developed for BRAC University's AI course, focused on predicting lung cancer risk using a structured dataset of health and lifestyle indicators, with emphasis on preprocessing, model training, and evaluation.
Lung Cancer Prediction
Lung Cancer Prediction

Code: GitHub

Overview

Built as part of BRAC University’s CSE422 (Artificial Intelligence) course, this project uses supervised machine learning to predict lung cancer risk. The dataset contains 1,000 patient records with various features such as age, air pollution exposure, smoking habits, and more. The project demonstrates hands-on experience in data preprocessing, feature selection, model training, and evaluation using standard ML pipelines.

Dataset

The dataset includes the following features:

  • Demographics: Patient ID, Age, Gender

  • Health & Lifestyle: Smoking, Alcohol use, Balanced diet, Obesity, Chronic lung disease

  • Environmental Exposure: Air pollution, Dust allergy, Occupational hazards, Passive smoking

  • Symptoms: Chest pain, Fatigue, Shortness of breath, Wheezing, Frequent colds, Dry cough

  • Indicators: Coughing of blood, Weight loss, Nail clubbing, Snoring

  • Target Variables:

    • Level: Class of lung cancer risk
    • Result: Final prediction output

Data Preprocessing

  • Introduced artificial NULL values and duplicate columns to simulate real-world dirty data.
  • Handled missing data using mean imputation (sklearn.impute.SimpleImputer).
  • Encoded categorical features (e.g., Gender, Level).
  • Dropped low-correlation features like Genetic Risk and Occupational Hazards.
  • Scaled data using MinMaxScaler.
  • Created both scaled and unscaled versions of the dataset.
  • Performed an 75/25 train-test split.
Script For NULL Insertion
Script For NULL Insertion

Models Used

  • Decision Tree
  • Naive Bayes
  • Random Forest
  • Support Vector Machine
  • Grid Search for hyperparameter tuning

Evaluation Metrics

  • Accuracy, Precision, Recall, F1-score, Support
  • Confusion matrices visualized using Seaborn heatmaps

Key Learnings

  • Learned end-to-end machine learning workflow
  • Applied real-world data cleaning and feature selection
  • Gained experience in model tuning and evaluation
  • Improved skills in visualizing model performance and dataset correlations

Citations