From Bias to Balance: Solving Imbalanced Data Issues

9 min readSep 10, 2023

1. Introduction

It is true that many real-world datasets are imbalanced, meaning that they have an unequal distribution of classes or categories.

In applications where the minority class represents rare events (e.g., fraud detection, disease diagnosis), imbalanced datasets make it challenging to detect these events effectively.

Let’s take credit card fraud detection dataset as an example. Here, the class of interest is typically the “fraudulent” or “positive” class. This means you are primarily interested in identifying instances where credit card transactions are fraudulent or involve unauthorized activities. This class represents the minority of cases in the dataset since fraudulent transactions are relatively rare compared to legitimate ones.

2. Understanding the Imbalanced Classification Problem

Training a machine learning model with an imbalanced dataset can lead to several issues and challenges like,

Bias Toward Majority Class: The model is likely to become biased toward the majority class. In credit card fraud detection, for example, if most transactions are non-fraudulent, the model may perform well on those cases but struggle to identify fraudulent ones.
Poor Generalization: The model may have difficulty generalizing to the minority class because it has seen very few examples of it during training. As a result, it may not perform well on unseen instances of the minority class.
Misleading Accuracy: Traditional accuracy as an evaluation metric can be misleading when dealing with imbalanced datasets. A model that predicts the majority class for all instances can still achieve high accuracy while failing to detect the minority class effectively.
Loss of Information: When training on an imbalanced dataset, the model may not learn the underlying patterns and nuances of the minority class, which can lead to missed opportunities for detection or classification.

3. Solutions for Imbalanced Classification

To address these issues, it’s essential to consider strategies for handling imbalanced datasets. Some common approaches include:

RANDOM - Up sampling and down sampling techniques:

1.1 Up sampling: Increase the frequency of minority class by generating synthetic data

1.2 Down sampling: Decrease the frequency of majority class by removing random records

To up-sample data by just duplicating random records from minority class can cause overfitting.
To down-sample data by just removing random records from the majority class can cause loss of information.

2. SOPHISTICATED — Up sampling and down sampling techniques:

Python’s imbalanced-learn library provides more sophisticated resampling techniques. Click here to check details.

For example, when dealing with imbalanced datasets and opting for down sampling of the majority class, one approach is to use clustering techniques to group similar data points from the majority class into clusters and then remove some data points from each cluster. This can help in preserving the essential information from the majority class while reducing its size.

For up sampling, instead of creating exact copy of minority class data points, we can introduce small variations into copy of those data points to create more diverse synthetic dataset.

2.1 SMOTE (Synthetic Minority Over-sampling Technique):

SMOTE works by generating synthetic samples for the minority class.
It randomly selects a data point from this class as the starting point and computes the k-nearest neighbors for this point.
Then it adds synthetic data points between chosen point and it’s neighbour.

2.2 Tomek Links:

A technique used in machine learning to address the issue of class imbalance and to improve the performance of models, especially in binary classification tasks.
The goal of Tomek Links is to identify and remove data points of majority class that are near the decision boundary between two classes and are potentially causing misclassification.

2.3 SMOTETomek:

Combination of two resampling techniques: SMOTE (Synthetic Minority Over-sampling Technique) and Tomek Links.
It is designed to address the class imbalance problem in machine learning by simultaneously oversampling the minority class and undersampling the majority class.

2.4 Cluster-Based Undersampling Technique:

It identifies clusters within the majority class and retains only one data point per cluster(centroid), effectively reducing the size of the majority class.

Keep in mind that the choice of clustering algorithm, the number of clusters, and the sampling strategy should be adapted to your specific dataset and problem to ensure that you preserve valuable information while addressing the class imbalance.

4. Practical Example

Let’s take credit card fraud detection dataset as an example to understand above techniques. Click here to download the dataset.
The dataset contains two primary classes: “Legitimate Transactions” and “Fraudulent Transactions.”
The “Legitimate Transactions” class significantly outnumbers the “Fraudulent Transactions” class.

1. Explore, clean and prepare dataset:

1.1 Check shape of original dataste:

We have total 284807 rows and 31 columns in raw dataset.

1.2 Check list of columns:

Here, Class is the label column and rest are input features.

1.3 Find and remove duplicate rows:

We have 1081 duplicate rows in our dataset.

Dataset shape after removing duplicates.

1.4 Check class distribution:

Here, 99.8% records belongs to class 0 (Legitimate)
0.001% records belongs to class 1 (Fraudulent)

1.5 Display top 5 rows of dataset:

1.6 Find and remove outliers:

Let’s check data distribution of Amount column:

Let’s find and remove instances (rows) having Amount ≥ 10,000
We have total 8 such rows in our dataset.

Below is the data distribution of Amount column after removing outliers:

1.7 Split dataset into train and test set:

We have 198602 rows in training dataset and 85116 rows in testing dataset.
Below is the class distribution of training and testing dataset:

2. Base Model — Let’s train and test logistic regression model on cleaned dataset:

2.1 Let’s create 2D/3D scatter plot to check how data is distributed for each class:

2.2 Train and test logistic regression model

Out of a total of 130 fraudulent transactions, the model correctly identified 86 of them as fraudulent (66% Recall)
Out of total 134 (48+86) instances that the model predicted as positive, 86 were actually correct (64% Precision)

3. Random Under Sampling:

3.1 Let’s create scatter plot to visualize class distribution of training dataset after applying random under sampling technique:

Out of a total of 130 fraudulent transactions, the model correctly identified 118 of them as fraudulent (91% Recall)
Out of total 4653 (4535+118) instances that the model predicted as positive, 118 were actually correct (3% Precision)

4. Random Over Sampling:

4.1 Let’s create scatter plot to visualize class distribution of training dataset after applying random over sampling technique:

Out of a total of 130 fraudulent transactions, the model correctly identified 121 of them as fraudulent (93% Recall)
Out of total 3319 (3198+121) instances that the model predicted as positive, 121 were actually correct (4% Precision)

5. SMOTE:

5.1 To apply SMOTE, let’s first identify optimal value of k using elbow method:

Here, K=4 looks the optimal value. You can experiment with other values as well.

5.2 Let’s create scatter plot to visualize class distribution of training dataset after applying SMOTE:

Out of a total of 130 fraudulent transactions, the model correctly identified 115 of them as fraudulent (88% Recall)
Out of total 3100 (2985+115) instances that the model predicted as positive, 115 were actually correct (4% Precision)

6. TomekLinks:

6.1 Let’s create scatter plot to visualize class distribution of training dataset after applying TomekLinks:

Out of a total of 130 fraudulent transactions, the model correctly identified 86 of them as fraudulent (66% Recall)
Out of total 134 (48+86) instances that the model predicted as positive, 86 were actually correct (64% Precision)

7. Cluster based Undersampling:

7.1 Let’s create scatter plot to visualize class distribution of training dataset after applying Cluster based Undersampling:

Out of a total of 130 fraudulent transactions, the model correctly identified 125 of them as fraudulent (96% Recall)
Out of total 11656 (11531+125) instances that the model predicted as positive, 125 were actually correct (1% Precision)

8. Comparison:

9. Conclusion:

When you try to reduce Type 1 error (False Positive Rate), it often increases Type 2 error (False Negative Rate), and vice versa. This relationship is known as the “trade-off” between Type 1 and Type 2 errors.

Balancing datasets and managing the trade-off between Type 1 and Type 2 errors are interconnected challenges in machine learning. Finding the right equilibrium is a nuanced process that demands a thorough understanding of the problem, thoughtful experimentation, and a commitment to aligning model performance with the specific goals and consequences of classification errors in the given application.