How do you deal with dataset imbalance in a problem like spam filtering ?

Class imbalance is a very common problem when applying ML algorithms. Spam filtering is one such application where class imbalance is apparent. There are many more non-spam emails in a typical inbox than spam emails. The following approaches can be used to address the class imbalance problem.

Designing an Assymetric cost function where the cost of misclassifying a minority class is higher than the cost of misclassifying the majority class. Typically a scaling factor is assigned to the loss function terms belonging to the minority class, that can be adjusted during hyper parameter tuning. Note that the evaluation metric needs to be aligned as well to do hyper parameter tuning – for instance F1 score or AUC is a good measure over plain accuracy.
Undersampling the majority class :
1. Remove randomly sampled data points
2. Cluster data points and remove points from large clusters with random sampling.
Oversampling the minority class :
1. SMOTE is a popular tool (Synthetic minority oversampling)
2. Randomly resampling data points. But remember resampling does not lead to enough independent data points to learn complex functions, but has the effect of assigning higher weight to some minority class data points.

Leave a Reply Cancel reply