Air Quality Classification

Introduction

This project analyzes an air quality dataset to classify observations into four categories of air quality using machine learning models. Support Vector Machine (SVM) models with linear, polynomial, and radial kernels, along with a Random Forest model, were developed to evaluate performance across different modeling approaches.
The objective is to compare model performance and understand how feature relationships and dataset characteristics influence classification accuracy.

Exploratory Data Analysis

Data Quality and Feature Characteristics

A summary of the dataset indicates that predictor variables operate on different scales, with some features (e.g., PM2.5, PM10) exhibiting much larger ranges than others, such as CO, motivating the use of feature standardization for SVM models.
The data contained no missing values, but a small number of negative values were observed in PM10 and SO2 measurements (<1%). Given their low frequency and the modeling approaches used, these values were retained without modification.

summary(pol_data[sapply(pol_data, is.numeric)])

  Temperature       Humidity          PM2.5             PM10       
 Min.   :13.40   Min.   : 36.00   Min.   :  0.00   Min.   : -0.20  
 1st Qu.:25.10   1st Qu.: 58.30   1st Qu.:  4.60   1st Qu.: 12.30  
 Median :29.00   Median : 69.80   Median : 12.00   Median : 21.70  
 Mean   :30.03   Mean   : 70.06   Mean   : 20.14   Mean   : 30.22  
 3rd Qu.:34.00   3rd Qu.: 80.30   3rd Qu.: 26.10   3rd Qu.: 38.10  
 Max.   :58.60   Max.   :128.10   Max.   :295.00   Max.   :315.80  
      NO2             SO2              CO       Proximity_to_Industrial_Areas
 Min.   : 7.40   Min.   :-6.20   Min.   :0.65   Min.   : 2.500               
 1st Qu.:20.10   1st Qu.: 5.10   1st Qu.:1.03   1st Qu.: 5.400               
 Median :25.30   Median : 8.00   Median :1.41   Median : 7.900               
 Mean   :26.41   Mean   :10.01   Mean   :1.50   Mean   : 8.425               
 3rd Qu.:31.90   3rd Qu.:13.72   3rd Qu.:1.84   3rd Qu.:11.100               
 Max.   :64.90   Max.   :44.90   Max.   :3.72   Max.   :25.800               
 Population_Density
 Min.   :188.0     
 1st Qu.:381.0     
 Median :494.0     
 Mean   :497.4     
 3rd Qu.:600.0     
 Max.   :957.0

Class Distribution

The distribution of air quality classes is moderately imbalanced. This will be accounted for when splitting the data into testing and training sets.

barplot(table(pol_data$Air.Quality),
        main = "Distribution of Air Quality Classes",
        ylab = "Count")

Feature Relationships

Due to the high correlation between PM2.5 and PM10 (~97%), these variables are likely to provide overlapping information. Both were retained to preserve pollutant-specific interpretability.
To better understand how predictor variables relate to air quality categories, CO levels were examined across classes due to their relevance in measuring air pollution. CO values show clear separation between categories, with higher levels associated with poorer air quality.
Some overlap is observed between adjacent categories, particularly Moderate and Poor.

cor_matrix = cor(pol_data[sapply(pol_data, is.numeric)])
cor_matrix["PM2.5", "PM10"]

[1] 0.9730049

boxplot(CO ~ Air.Quality, data = pol_data,
        main = "CO Levels by Air Quality Category",
        ylab = "CO")

Methodology

The dataset was split into training and testing sets using stratified sampling (80/20) to preserve class distribution across all air quality categories.
A random seed is set for reproducibility for all models.
Predictor variables were standardized for Support Vector Machine (SVM) models using training set statistics to ensure consistent scaling and prevent data leakage. Random Forest models were trained on the original feature space, as tree-based methods are invariant to feature scaling.
Three SVM models (linear, polynomial, and radial kernels) and a Random Forest model were developed to compare performance across different levels of model complexity.
SVM models were tuned using cross-validation to optimize hyperparameters, while the Random Forest model was tuned over different values of the feature-sampling parameter (mtry) using 10-fold cross-validation.
Model performance was evaluated on the held-out test set using classification accuracy and class-level metrics.

Results

Model Accuracy

All models achieved high overall accuracy, with Random Forest performing best and SVM variants yielding similar results.

results <- data.frame(
  Model = c("Linear SVM", "Polynomial SVM", "Radial SVM", "Random Forest"),
  Accuracy = c(0.947, 0.933, 0.949, 0.961)
)

knitr::kable(results, digits=3)

Model	Accuracy
Linear SVM	0.947
Polynomial SVM	0.933
Radial SVM	0.949
Random Forest	0.961

Confusion Matrices

Confusion matrices for the Linear SVM and Random Forest models are presented below to illustrate class-level performance across models.

Linear SVM

Confusion Matrix and Statistics

           Reference
Prediction  Good Hazardous Moderate Poor
  Good       400         0        3    0
  Hazardous    0        84        0    7
  Moderate     0         0      289   19
  Poor         0        16        8  174

Overall Statistics
                                          
               Accuracy : 0.947           
                 95% CI : (0.9312, 0.9601)
    No Information Rate : 0.4             
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.924           
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: Good Class: Hazardous Class: Moderate Class: Poor
Sensitivity               1.0000           0.8400          0.9633      0.8700
Specificity               0.9950           0.9922          0.9729      0.9700
Pos Pred Value            0.9926           0.9231          0.9383      0.8788
Neg Pred Value            1.0000           0.9824          0.9841      0.9676
Prevalence                0.4000           0.1000          0.3000      0.2000
Detection Rate            0.4000           0.0840          0.2890      0.1740
Detection Prevalence      0.4030           0.0910          0.3080      0.1980
Balanced Accuracy         0.9975           0.9161          0.9681      0.9200

Random Forest

Confusion Matrix and Statistics

           Reference
Prediction  Good Hazardous Moderate Poor
  Good       399         0        1    0
  Hazardous    0        85        0    7
  Moderate     1         0      294   10
  Poor         0        15        5  183

Overall Statistics
                                          
               Accuracy : 0.961           
                 95% CI : (0.9471, 0.9721)
    No Information Rate : 0.4             
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9442          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: Good Class: Hazardous Class: Moderate Class: Poor
Sensitivity               0.9975           0.8500          0.9800      0.9150
Specificity               0.9983           0.9922          0.9843      0.9750
Pos Pred Value            0.9975           0.9239          0.9639      0.9015
Neg Pred Value            0.9983           0.9835          0.9914      0.9787
Prevalence                0.4000           0.1000          0.3000      0.2000
Detection Rate            0.3990           0.0850          0.2940      0.1830
Detection Prevalence      0.4000           0.0920          0.3050      0.2030
Balanced Accuracy         0.9979           0.9211          0.9821      0.9450

Feature Importance

Feature importance from the Random Forest model highlights the relative contribution of predictors to classification performance.

plot(varImp(rf_model))

Analysis

The similar performance between linear and radial SVM suggests that the dataset is largely linearly separable, with limited benefit from more complex kernels.
The Random Forest model achieved the highest overall accuracy, indicating its ability to capture nonlinear relationships, although the improvement over SVM models was modest.
The high correlation between PM2.5 and PM10 reflects redundancy in the feature set.
Model performance varied across classes, with less frequent categories generally exhibiting lower predictive accuracy, highlighting the impact of class imbalance.
Feature importance results show that CO and proximity to industrial areas are the dominant predictors of air quality, while PM2.5 and PM10 contribute less despite their strong correlation, suggesting that other pollutants play a larger role in classification.
The clear separation of CO values across categories helps explain its high importance in the Random Forest model, while overlap between adjacent classes contributes to classification difficulty.

Conclusion

All models achieved high classification accuracy, with Random Forest providing the best overall performance, though the improvement over SVM models was modest.
The similar performance across models suggests that the dataset is well-structured and can be effectively modeled using relatively simple approaches, making linear SVM a viable alternative when model simplicity is preferred
Feature importance analysis indicates that CO and proximity to industrial areas are the primary drivers of air quality classification.