Methods for dealing with imbalanced data
1 Introduction
The imbalanced data is the common feature of some type of data such as fraudulent credit card where the number of fraudulent cards is usually very small compared to the number of non fraudulent cards. The problem with imbalanced data is that the model being trained would be dominated by the majority class such as knn and svm models, and hence they would predict the majority class more effectively than the minority class which in turn would result in high value for sensitivity rate and low value for specificity rate (in binary classification).
The simple technique to reduce the negative impact of this problem is by subsampling the data. the common subsampling methods used in practice are the following.
Upsampling: this method increases the size of the minority class by sampling with replacement so that the classes will have the same size.
Downsampling: in contrast to the above method, this one decreases the size of the majority class to be the same or closer to the minority class size by just taking out a random sample.
Hybrid methods : The well known hybrid methods are ROSE (Random oversampling examples), and SMOTE (Synthetic minority oversampling technique), they downsample the majority class, and creat new artificial points in the minority class. For more detail about SMOTE method click here, and for ROSE click here.
Note: all the above methods should be applied only on the training set , the testing set must be never touched until the final model evaluation step.
Some type of models can handle imbalanced data such as deep learning model with the argument class_weight wich adds more weights to the minority class cases. Other models, however, such as svm or knn we have to make use of one of the above methods before training these type of models.
In this article we will make use of the creditcard data from kaggle website -click here to upload this data, which is highly imbalanced- and we will train a logistic regression model on the raw data and on the transformed data after applying the above methods and comparing the results. Also, we will use a simple deep learning model with and without taking into account the imbalanced problem.
First we call the data.
spsm(library(tidyverse))
## Warning: package 'ggplot2' was built under R version 4.0.2
## Warning: package 'tibble' was built under R version 4.0.2
## Warning: package 'tidyr' was built under R version 4.0.2
## Warning: package 'dplyr' was built under R version 4.0.2
data <- read.csv("../sparklyr/creditcard.csv", header = TRUE)
For privacy purposes the original features are replaced by the PCA variables from v1 to v28 and only Time and Amount features that are left from the original features.
Let’s first check Class variable levels frequency (after having been converted to a factor type).
data$Class <- as.factor(data$Class)
prop.table(table(data$Class))
##
## 0 1
## 0.998272514 0.001727486
As we see the minority class number “1” is only about 0.17% of the total cases. We also need to show the summary of the data to take an overall look at all the features to be aware of missing values or unusual outliers.
summary(data)
## Time V1 V2 V3
## Min. : 0 Min. :-56.40751 Min. :-72.71573 Min. :-48.3256
## 1st Qu.: 54202 1st Qu.: -0.92037 1st Qu.: -0.59855 1st Qu.: -0.8904
## Median : 84692 Median : 0.01811 Median : 0.06549 Median : 0.1799
## Mean : 94814 Mean : 0.00000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.:139321 3rd Qu.: 1.31564 3rd Qu.: 0.80372 3rd Qu.: 1.0272
## Max. :172792 Max. : 2.45493 Max. : 22.05773 Max. : 9.3826
## V4 V5 V6 V7
## Min. :-5.68317 Min. :-113.74331 Min. :-26.1605 Min. :-43.5572
## 1st Qu.:-0.84864 1st Qu.: -0.69160 1st Qu.: -0.7683 1st Qu.: -0.5541
## Median :-0.01985 Median : -0.05434 Median : -0.2742 Median : 0.0401
## Mean : 0.00000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.74334 3rd Qu.: 0.61193 3rd Qu.: 0.3986 3rd Qu.: 0.5704
## Max. :16.87534 Max. : 34.80167 Max. : 73.3016 Max. :120.5895
## V8 V9 V10 V11
## Min. :-73.21672 Min. :-13.43407 Min. :-24.58826 Min. :-4.79747
## 1st Qu.: -0.20863 1st Qu.: -0.64310 1st Qu.: -0.53543 1st Qu.:-0.76249
## Median : 0.02236 Median : -0.05143 Median : -0.09292 Median :-0.03276
## Mean : 0.00000 Mean : 0.00000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.32735 3rd Qu.: 0.59714 3rd Qu.: 0.45392 3rd Qu.: 0.73959
## Max. : 20.00721 Max. : 15.59500 Max. : 23.74514 Max. :12.01891
## V12 V13 V14 V15
## Min. :-18.6837 Min. :-5.79188 Min. :-19.2143 Min. :-4.49894
## 1st Qu.: -0.4056 1st Qu.:-0.64854 1st Qu.: -0.4256 1st Qu.:-0.58288
## Median : 0.1400 Median :-0.01357 Median : 0.0506 Median : 0.04807
## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.6182 3rd Qu.: 0.66251 3rd Qu.: 0.4931 3rd Qu.: 0.64882
## Max. : 7.8484 Max. : 7.12688 Max. : 10.5268 Max. : 8.87774
## V16 V17 V18
## Min. :-14.12985 Min. :-25.16280 Min. :-9.498746
## 1st Qu.: -0.46804 1st Qu.: -0.48375 1st Qu.:-0.498850
## Median : 0.06641 Median : -0.06568 Median :-0.003636
## Mean : 0.00000 Mean : 0.00000 Mean : 0.000000
## 3rd Qu.: 0.52330 3rd Qu.: 0.39968 3rd Qu.: 0.500807
## Max. : 17.31511 Max. : 9.25353 Max. : 5.041069
## V19 V20 V21
## Min. :-7.213527 Min. :-54.49772 Min. :-34.83038
## 1st Qu.:-0.456299 1st Qu.: -0.21172 1st Qu.: -0.22839
## Median : 0.003735 Median : -0.06248 Median : -0.02945
## Mean : 0.000000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.458949 3rd Qu.: 0.13304 3rd Qu.: 0.18638
## Max. : 5.591971 Max. : 39.42090 Max. : 27.20284
## V22 V23 V24
## Min. :-10.933144 Min. :-44.80774 Min. :-2.83663
## 1st Qu.: -0.542350 1st Qu.: -0.16185 1st Qu.:-0.35459
## Median : 0.006782 Median : -0.01119 Median : 0.04098
## Mean : 0.000000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.528554 3rd Qu.: 0.14764 3rd Qu.: 0.43953
## Max. : 10.503090 Max. : 22.52841 Max. : 4.58455
## V25 V26 V27
## Min. :-10.29540 Min. :-2.60455 Min. :-22.565679
## 1st Qu.: -0.31715 1st Qu.:-0.32698 1st Qu.: -0.070840
## Median : 0.01659 Median :-0.05214 Median : 0.001342
## Mean : 0.00000 Mean : 0.00000 Mean : 0.000000
## 3rd Qu.: 0.35072 3rd Qu.: 0.24095 3rd Qu.: 0.091045
## Max. : 7.51959 Max. : 3.51735 Max. : 31.612198
## V28 Amount Class
## Min. :-15.43008 Min. : 0.00 0:284315
## 1st Qu.: -0.05296 1st Qu.: 5.60 1: 492
## Median : 0.01124 Median : 22.00
## Mean : 0.00000 Mean : 88.35
## 3rd Qu.: 0.07828 3rd Qu.: 77.17
## Max. : 33.84781 Max. :25691.16
looking at this summary, we do not have any critical issues like missing values for instance.
2 Data partition
Before applying any subsampling method we split the data first between the training set and the testing set and we use only the former to be subsampled.
spsm(library(caret))
set.seed(1234)
index <- createDataPartition(data$Class, p = 0.8, list = FALSE)
train <- data[index, ]
test <- data[-index, ]
3 Subsampling the training data
3.1 Upsampling
The caret package provides a function called upSample to perform upsampling technique.
set.seed(111)
trainup <- upSample(x = train[, -ncol(train)], y = train$Class)
table(trainup$Class)
##
## 0 1
## 227452 227452
As we see the two classes now have the same size 227452
3.2 downsampling
By the some way we make use of the caret function downSample
set.seed(111)
traindown <- downSample(x = train[, -ncol(train)], y = train$Class)
table(traindown$Class)
##
## 0 1
## 394 394
now the size of each class is 394
3.3 ROSE
To use this technique we have to call the ROSE package
spsm(library(ROSE))
set.seed(111)
trainrose <- ROSE(Class ~ ., data = train)$data
table(trainrose$Class)
##
## 0 1
## 113827 114019
since this technique add new synthetic data points to the minority class and daownsamples the majority class the size now is about 114019 for minority class and 113827 for the majority class.
3.4 SMOTE
this technique requires the DMwR package.
spsm(library(DMwR))
set.seed(111)
trainsmote <- SMOTE(Class ~ ., data = train)
table(trainsmote$Class)
##
## 0 1
## 1576 1182
The size of the majority class is 113827 and for the minority class is 114019 .
4 training logistic regression model.
we are now ready to fit logit model to the original training set without subsampling, and to each of the above subsampled training sets.
4.1 without subsampling
set.seed(123)
model <- glm(Class ~ ., data = train, family = "binomial")
summary(model)
##
## Call:
## glm(formula = Class ~ ., family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.9290 -0.0291 -0.0190 -0.0124 4.6028
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.486e+00 2.852e-01 -29.753 < 2e-16 ***
## Time -2.673e-06 2.528e-06 -1.057 0.29037
## V1 9.397e-02 4.794e-02 1.960 0.04996 *
## V2 1.097e-02 6.706e-02 0.164 0.87006
## V3 1.290e-03 5.949e-02 0.022 0.98270
## V4 6.851e-01 8.408e-02 8.148 3.69e-16 ***
## V5 1.472e-01 7.301e-02 2.017 0.04372 *
## V6 -8.450e-02 7.902e-02 -1.069 0.28491
## V7 -1.098e-01 7.591e-02 -1.446 0.14816
## V8 -1.718e-01 3.402e-02 -5.050 4.41e-07 ***
## V9 -1.926e-01 1.258e-01 -1.531 0.12579
## V10 -8.073e-01 1.118e-01 -7.224 5.07e-13 ***
## V11 -3.920e-03 9.131e-02 -0.043 0.96575
## V12 2.855e-02 9.432e-02 0.303 0.76210
## V13 -3.064e-01 9.007e-02 -3.401 0.00067 ***
## V14 -5.308e-01 6.816e-02 -7.787 6.86e-15 ***
## V15 -1.285e-01 9.559e-02 -1.344 0.17903
## V16 -2.164e-01 1.423e-01 -1.520 0.12840
## V17 2.913e-02 7.729e-02 0.377 0.70624
## V18 -3.642e-02 1.445e-01 -0.252 0.80095
## V19 6.064e-02 1.094e-01 0.554 0.57938
## V20 -4.449e-01 9.737e-02 -4.570 4.89e-06 ***
## V21 3.661e-01 6.709e-02 5.456 4.87e-08 ***
## V22 5.965e-01 1.519e-01 3.927 8.59e-05 ***
## V23 -1.157e-01 6.545e-02 -1.768 0.07706 .
## V24 8.146e-02 1.625e-01 0.501 0.61622
## V25 4.325e-02 1.482e-01 0.292 0.77043
## V26 -2.679e-01 2.226e-01 -1.203 0.22893
## V27 -7.280e-01 1.542e-01 -4.720 2.36e-06 ***
## V28 -2.817e-01 9.864e-02 -2.856 0.00429 **
## Amount 9.154e-04 4.379e-04 2.091 0.03656 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5799.1 on 227845 degrees of freedom
## Residual deviance: 1768.0 on 227815 degrees of freedom
## AIC: 1830
##
## Number of Fisher Scoring iterations: 12
At this step and to make things more simpler, we remove the insignificant variables (without asterix) and we keep the remaining ones to use in all the following models.
set.seed(123)
model1 <- glm(Class ~ . - Time - V2 - V3 - V6 - V7 - V9 - V11 - V12 - V15 - V16 -
V17 - V18 - V19 - V24 - V25 - V26, data = train, family = "binomial")
summary(model1)
##
## Call:
## glm(formula = Class ~ . - Time - V2 - V3 - V6 - V7 - V9 - V11 -
## V12 - V15 - V16 - V17 - V18 - V19 - V24 - V25 - V26, family = "binomial",
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.6514 -0.0290 -0.0186 -0.0117 4.6192
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.763e+00 1.510e-01 -58.023 < 2e-16 ***
## V1 2.108e-02 2.918e-02 0.722 0.470129
## V4 7.241e-01 6.306e-02 11.483 < 2e-16 ***
## V5 9.934e-02 3.566e-02 2.785 0.005346 **
## V8 -1.549e-01 2.178e-02 -7.115 1.12e-12 ***
## V10 -9.290e-01 9.305e-02 -9.985 < 2e-16 ***
## V13 -3.307e-01 8.577e-02 -3.855 0.000116 ***
## V14 -5.229e-01 5.566e-02 -9.396 < 2e-16 ***
## V20 -2.388e-01 6.005e-02 -3.976 7.01e-05 ***
## V21 4.811e-01 5.259e-02 9.148 < 2e-16 ***
## V22 7.675e-01 1.277e-01 6.011 1.84e-09 ***
## V23 -1.522e-01 5.925e-02 -2.569 0.010212 *
## V27 -6.381e-01 1.295e-01 -4.927 8.34e-07 ***
## V28 -2.485e-01 9.881e-02 -2.515 0.011900 *
## Amount 2.713e-07 1.290e-04 0.002 0.998323
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5799.1 on 227845 degrees of freedom
## Residual deviance: 1798.7 on 227831 degrees of freedom
## AIC: 1828.7
##
## Number of Fisher Scoring iterations: 11
We have now two predictors that are non significant V1 and Amount, they should be also removed.
set.seed(123)
finalmodel <- glm(Class ~ . - Time - V1 - V2 - V3 - V6 - V7 - V9 - V11 - V12 - V15 -
V16 - V17 - V18 - V19 - V24 - V25 - V26 - Amount, data = train, family = "binomial")
summary(finalmodel)
##
## Call:
## glm(formula = Class ~ . - Time - V1 - V2 - V3 - V6 - V7 - V9 -
## V11 - V12 - V15 - V16 - V17 - V18 - V19 - V24 - V25 - V26 -
## Amount, family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.6285 -0.0289 -0.0186 -0.0117 4.5835
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.75058 0.14706 -59.505 < 2e-16 ***
## V4 0.69955 0.05265 13.288 < 2e-16 ***
## V5 0.10650 0.02586 4.119 3.81e-05 ***
## V8 -0.15525 0.01982 -7.833 4.76e-15 ***
## V10 -0.89573 0.07630 -11.740 < 2e-16 ***
## V13 -0.33583 0.08448 -3.975 7.02e-05 ***
## V14 -0.54238 0.04862 -11.155 < 2e-16 ***
## V20 -0.22318 0.04781 -4.668 3.04e-06 ***
## V21 0.47912 0.05205 9.204 < 2e-16 ***
## V22 0.78631 0.12439 6.321 2.60e-10 ***
## V23 -0.15046 0.05498 -2.736 0.00621 **
## V27 -0.58832 0.10411 -5.651 1.60e-08 ***
## V28 -0.23592 0.08901 -2.651 0.00804 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5799.1 on 227845 degrees of freedom
## Residual deviance: 1799.2 on 227833 degrees of freedom
## AIC: 1825.2
##
## Number of Fisher Scoring iterations: 11
For the other training sets we will use only these significant predictors from the above model.
Now let’s get the final results from the confusion matrix.
pred <- predict(finalmodel, test, type = "response")
pred <- as.integer(pred > 0.5)
confusionMatrix(as.factor(pred), test$Class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 56856 41
## 1 7 57
##
## Accuracy : 0.9992
## 95% CI : (0.9989, 0.9994)
## No Information Rate : 0.9983
## P-Value [Acc > NIR] : 1.581e-08
##
## Kappa : 0.7033
##
## Mcnemar's Test P-Value : 1.906e-06
##
## Sensitivity : 0.9999
## Specificity : 0.5816
## Pos Pred Value : 0.9993
## Neg Pred Value : 0.8906
## Prevalence : 0.9983
## Detection Rate : 0.9982
## Detection Prevalence : 0.9989
## Balanced Accuracy : 0.7908
##
## 'Positive' Class : 0
##
As we see we have a large accuracy rate about 99.92%. However, this rate is almost the same as the no information rate 99.83% (if we predict all the cases as class label 0). In other words this high rate is not due to the quality of the model but rather due to the imbalanced classes. if we look at the specificity rate. it is about 58.16% indicating that the model poorly predict the fraudulent cards which is the most important class label that we want to predict correctly. Among the available metrics, the best one for imbalanced data is cohen’s kappa statistic. and according to the scale of kappa value interpretation suggested by Landis & Koch (1977), the kappa value obtained here 0.7033 is a good score.
But here we stick with accuracy rate for pedagogic purposes to show the effectiveness of the above discussed methods.
4.2 Upsampling the train set
Now let’s use the training data resulted from the upsmpling method.
set.seed(123)
modelup <- glm(Class ~ V4 + V5 + V8 + V10 + V13 + V14 + V20 + V21 + V22 + V23 + V27 +
V28, data = trainup, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(modelup)
##
## Call:
## glm(formula = Class ~ V4 + V5 + V8 + V10 + V13 + V14 + V20 +
## V21 + V22 + V23 + V27 + V28, family = "binomial", data = trainup)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6.2906 -0.2785 -0.0001 0.0159 2.8055
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.271053 0.011741 -278.610 < 2e-16 ***
## V4 0.952941 0.005478 173.966 < 2e-16 ***
## V5 0.126627 0.003976 31.846 < 2e-16 ***
## V8 -0.289448 0.004368 -66.261 < 2e-16 ***
## V10 -0.710629 0.009150 -77.665 < 2e-16 ***
## V13 -0.479344 0.007352 -65.200 < 2e-16 ***
## V14 -0.802941 0.006825 -117.638 < 2e-16 ***
## V20 -0.090453 0.007955 -11.371 < 2e-16 ***
## V21 0.233604 0.007702 30.332 < 2e-16 ***
## V22 0.209203 0.010125 20.662 < 2e-16 ***
## V23 -0.320073 0.005299 -60.399 < 2e-16 ***
## V27 -0.238132 0.017019 -13.992 < 2e-16 ***
## V28 -0.152294 0.019922 -7.644 2.1e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 630631 on 454903 degrees of freedom
## Residual deviance: 136321 on 454891 degrees of freedom
## AIC: 136347
##
## Number of Fisher Scoring iterations: 9
pred <- predict(modelup, test, type = "response")
pred <- as.integer(pred > 0.5)
confusionMatrix(as.factor(pred), test$Class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 55334 12
## 1 1529 86
##
## Accuracy : 0.9729
## 95% CI : (0.9716, 0.9743)
## No Information Rate : 0.9983
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0975
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.97311
## Specificity : 0.87755
## Pos Pred Value : 0.99978
## Neg Pred Value : 0.05325
## Prevalence : 0.99828
## Detection Rate : 0.97144
## Detection Prevalence : 0.97165
## Balanced Accuracy : 0.92533
##
## 'Positive' Class : 0
##
Now we have a smaller accuracy rate 97.29%, but we have a larger specificity rate 87.75% which increases the power of the model to predict the fraudulent cards.
4.3 Down sampling the training set.
set.seed(123)
modeldown <- glm(Class ~ V4 + V5 + V8 + V10 + V13 + V14 + V20 + V21 + V22 + V23 +
V27 + V28, data = traindown, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
pred <- predict(modeldown, test, type = "response")
pred <- as.integer(pred > 0.5)
confusionMatrix(as.factor(pred), test$Class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 54837 12
## 1 2026 86
##
## Accuracy : 0.9642
## 95% CI : (0.9627, 0.9657)
## No Information Rate : 0.9983
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0748
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.96437
## Specificity : 0.87755
## Pos Pred Value : 0.99978
## Neg Pred Value : 0.04072
## Prevalence : 0.99828
## Detection Rate : 0.96271
## Detection Prevalence : 0.96292
## Balanced Accuracy : 0.92096
##
## 'Positive' Class : 0
##
With downsampling method, we get approximately the same specificity rate 87.75% with a slight decrease of the over all accuracy rate 96.42%, and the sensitivity rate has decreased to 96.43% since we have decreased the majority class size by downsampling.
4.4 subsampline the train set by ROSE technique
set.seed(123)
modelrose <- glm(Class ~ V4 + V5 + V8 + V10 + V13 + V14 + V20 + V21 + V22 + V23 +
V27 + V28, data = trainrose, family = "binomial")
pred <- predict(modelrose, test, type = "response")
pred <- as.integer(pred > 0.5)
confusionMatrix(as.factor(pred), test$Class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 56080 14
## 1 783 84
##
## Accuracy : 0.986
## 95% CI : (0.985, 0.987)
## No Information Rate : 0.9983
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1715
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.98623
## Specificity : 0.85714
## Pos Pred Value : 0.99975
## Neg Pred Value : 0.09689
## Prevalence : 0.99828
## Detection Rate : 0.98453
## Detection Prevalence : 0.98478
## Balanced Accuracy : 0.92169
##
## 'Positive' Class : 0
##
Using this method the sensitivity rate is slightly smaller than the previous ones 85.71% but still a large improvement in predicting fraudulent cards compared to the model trained with the original imbalanced data.
4.5 Subsampling the train set by SMOTE technique
set.seed(123)
modelsmote <- glm(Class ~ V4 + V5 + V8 + V10 + V13 + V14 + V20 + V21 + V22 + V23 +
V27 + V28, data = trainsmote, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
pred <- predict(modelsmote, test, type = "response")
pred <- as.integer(pred > 0.5)
confusionMatrix(as.factor(pred), test$Class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 55457 14
## 1 1406 84
##
## Accuracy : 0.9751
## 95% CI : (0.9738, 0.9763)
## No Information Rate : 0.9983
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1029
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.97527
## Specificity : 0.85714
## Pos Pred Value : 0.99975
## Neg Pred Value : 0.05638
## Prevalence : 0.99828
## Detection Rate : 0.97360
## Detection Prevalence : 0.97384
## Balanced Accuracy : 0.91621
##
## 'Positive' Class : 0
##
With this method we get the same specificity rate 85.71% such as ROSE method.
5 deep learning model (without class weight).
When we use deep learning models via some software we can assign a weight to the labels of the target variables. For us we will make use of keras package. We will first train the model without weighting the data , Then we retrain the same model after assigning weight to the minority class.
To train this model we should first convert the data (train and test sets) into numeric matrix and remove the column names (we convert also the Class to numeric type). However, in order to be inline with the above models we keep only their features, but this time it would be better to be normalized since this helps the gradient running more faster.
spsm(library(keras))
train1 <- train[, c("V4", "V5", "V8", "V10", "V13", "V14", "V20", "V21", "V22", "V23",
"V27", "V28", "Class")]
test1 <- test[, c("V4", "V5", "V8", "V10", "V13", "V14", "V20", "V21", "V22", "V23",
"V27", "V28", "Class")]
train1$Class <- as.numeric(train1$Class)
test1$Class <- as.numeric(test1$Class)
train1[, "Class"] <- train1[, "Class"] - 1
test1[, "Class"] <- test1[, "Class"] - 1
trainx <- train1[, -ncol(train1)]
testx <- test1[, -ncol(test1)]
trained <- as.matrix(trainx)
tested <- as.matrix(testx)
trainy <- train1$Class
testy <- test1$Class
dimnames(trained) <- NULL
dimnames(tested) <- NULL
then we apply one hot encoding on the target variable.
trainlabel <- to_categorical(trainy)
testlabel <- to_categorical(testy)
The final step now is normalizing the matrices (trained and tested)
trained1 <- normalize(trained)
tested1 <- normalize(tested)
Now we are ready to create the model with two hidden layers followed by dropout layers.
modeldeep <- keras_model_sequential()
modeldeep %>% layer_dense(units = 32, activation = "relu", kernel_initializer = "he_normal",
input_shape = c(12)) %>% layer_dropout(rate = 0.2) %>% layer_dense(units = 64,
activation = "relu", kernel_initializer = "he_normal") %>% layer_dropout(rate = 0.4) %>%
layer_dense(units = 2, activation = "sigmoid")
summary(modeldeep)
## Model: "sequential"
## ________________________________________________________________________________
## Layer (type) Output Shape Param #
## ================================================================================
## dense (Dense) (None, 32) 416
## ________________________________________________________________________________
## dropout (Dropout) (None, 32) 0
## ________________________________________________________________________________
## dense_1 (Dense) (None, 64) 2112
## ________________________________________________________________________________
## dropout_1 (Dropout) (None, 64) 0
## ________________________________________________________________________________
## dense_2 (Dense) (None, 2) 130
## ================================================================================
## Total params: 2,658
## Trainable params: 2,658
## Non-trainable params: 0
## ________________________________________________________________________________
we will use the accuracy rate as the metric. The loss function will be binary crossentropy since we deal with binary classification problem. and for the optimizer we will use adam optimizer.
modeldeep %>% compile(loss = "binary_crossentropy", optimizer = "adam", metric = "accuracy")
During training, the model will use 10 epochs (the default), 5 sample as batch size to update the weights, and keep 20% of the inputs (training samples) out to assess the model
You can run this model many times untill you get satisfied with the results, then it will be better to save it and load it again each time you need it as follows.
modeldeep <- load_model_hdf5("modeldeep.h5")
All the above metric values are used in the training process, so they are not much reliable. The more reliable ones are those computed from unseen data.
pred <- modeldeep %>% predict_classes(tested1)
confusionMatrix(as.factor(pred), as.factor(testy))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 56858 64
## 1 5 34
##
## Accuracy : 0.9988
## 95% CI : (0.9985, 0.9991)
## No Information Rate : 0.9983
## P-Value [Acc > NIR] : 0.00125
##
## Kappa : 0.4959
##
## Mcnemar's Test P-Value : 2.902e-12
##
## Sensitivity : 0.9999
## Specificity : 0.3469
## Pos Pred Value : 0.9989
## Neg Pred Value : 0.8718
## Prevalence : 0.9983
## Detection Rate : 0.9982
## Detection Prevalence : 0.9993
## Balanced Accuracy : 0.6734
##
## 'Positive' Class : 0
##
The same as the above models, the specificity rate is even worst than the other models 0.3469 which is also caused by the imbalanced data.
5.1 deep learning model with class weights
Now let’s try the previous model by taking into account the class imbalance
modeldeep1 <- keras_model_sequential()
modeldeep1 %>% layer_dense(units = 32, activation = "relu", kernel_initializer = "he_normal",
input_shape = c(12)) %>% layer_dropout(rate = 0.2) %>% layer_dense(units = 64,
activation = "relu", kernel_initializer = "he_normal") %>% layer_dropout(rate = 0.4) %>%
layer_dense(units = 2, activation = "sigmoid")
modeldeep1 %>% compile(loss = "binary_crossentropy", optimizer = "adam", metric = "accuracy")
To define the appropriate weight, we divide the fraction of the majority class by the fraction of the minority class to get how many times the former is larger than the latter.
prop.table(table(data$Class))[1]/prop.table(table(data$Class))[2]
## 0
## 577.876
Now we include this value as weight in the class_weight argument.
Again I should save this model before knitting the document. For you if you want to run the above code just uncomment it.
modeldeep1 <- load_model_hdf5("modeldeep1.h5")
Now let’s get the confusion matrix.
pred <- modeldeep1 %>% predict_classes(tested1)
confusionMatrix(as.factor(pred), as.factor(testy))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 55303 14
## 1 1560 84
##
## Accuracy : 0.9724
## 95% CI : (0.971, 0.9737)
## No Information Rate : 0.9983
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0935
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.97257
## Specificity : 0.85714
## Pos Pred Value : 0.99975
## Neg Pred Value : 0.05109
## Prevalence : 0.99828
## Detection Rate : 0.97089
## Detection Prevalence : 0.97114
## Balanced Accuracy : 0.91485
##
## 'Positive' Class : 0
##
Using this model we get less accuracy rate 0.9724, but the specificity rate is higher compared to the previous model so that this model can well predict the negative class label as well as the postive class label.
6 Conclusion
With the imbalanced data most machine learning model tend to more efficiently predict the majority class than the minority class. To correct thus this behavior we can use one of the above discussed methods to get more closer accuracy rates between classes. However, deep learning model can easily handle this problem by specifying the class weights.