Predicting binary response variable with h2o framework

1 Introduction

H2O is an open-source distributed scalable framework used to train machine learning and deep learning models as well as data analysis. It can handle large data sets, with ease of use, by creating a cluster from the available nodes. Fortunately, it provides an API for R users to get the most benefits from it, especially when it comes to large data sets, with which R has its most limitations.

The beauty is that R users can load and use this system via the package h2o which can be called and used like any other R packages.

# install.packages("h2o") if not already installed
library(tidyverse)
-- Attaching packages -------------
v ggplot2 3.3.2     v purrr   0.3.4
v tibble  3.0.3     v dplyr   1.0.2
v tidyr   1.1.2     v stringr 1.4.0
v readr   1.3.1     v forcats 0.5.0
Warning: package 'ggplot2' was built under R version 4.0.2
Warning: package 'tibble' was built under R version 4.0.2
Warning: package 'tidyr' was built under R version 4.0.2
Warning: package 'dplyr' was built under R version 4.0.2
-- Conflicts ----------------------
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
library(h2o)

----------------------------------------------------------------------

Your next step is to start H2O:
    > h2o.init()

For H2O package documentation, ask for help:
    > ??h2o

After starting H2O, you can use the Web UI at http://localhost:54321
For more information visit https://docs.h2o.ai

----------------------------------------------------------------------

Attaching package: 'h2o'
The following objects are masked from 'package:stats':

    cor, sd, var
The following objects are masked from 'package:base':

    %*%, %in%, &&, ||, apply, as.factor, as.numeric, colnames,
    colnames<-, ifelse, is.character, is.factor, is.numeric, log,
    log10, log1p, log2, round, signif, trunc

Then to lunch the cluster run the following script

h2o.init(nthreads = -1)

H2O is not running yet, starting it now...

Note:  In case of errors look at the following log files:
    C:\Users\dell\AppData\Local\Temp\RtmpGuHBDV\file2e5438ee3e8c/h2o_dell_started_from_r.out
    C:\Users\dell\AppData\Local\Temp\RtmpGuHBDV\file2e54103214ed/h2o_dell_started_from_r.err


Starting H2O JVM and connecting: . Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         6 seconds 974 milliseconds 
    H2O cluster timezone:       Europe/Paris 
    H2O data parsing timezone:  UTC 
    H2O cluster version:        3.30.1.3 
    H2O cluster version age:    13 days  
    H2O cluster name:           H2O_started_from_R_dell_sgv874 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   1.99 GB 
    H2O cluster total cores:    4 
    H2O cluster allowed cores:  4 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    H2O API Extensions:         Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4 
    R Version:                  R version 4.0.1 (2020-06-06) 

Looking at this output, we see that h2o uses java virtual machine JVM, so you need java already installed. If you notice I have specified the nthreads argument to be -1 to tell h20 to create its cluster using all the available cores I have less than 1.

Since our purpose is understanding how to work with h2o, we are going be using a small data set, in which the response will be a binary variable. The data that we will use is creditcard which is downloaded from kaggle website.

2 data preparation

To import the data directly into the h2o cluster we use the function h2O.importFile as follows.

card <- h2o.importFile("../creditcard.csv")

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=============================================================         |  87%
  |                                                                            
  |======================================================================| 100%

The following script gives the dimension of this data.

h2o.dim(card)
[1] 284807     31

This data has 284807 and 31 columns. According to the description of this data, the response variable is class with two values 1 for fraudulent card and 0 for regular card. The other variables are PCA components derived from the original ones for privacy purposes to protect, for instance, the users’ identities.
So first let’s check the summary of this data.

knitr::kable(h2o.describe(card))
Label Type Missing Zeros PosInf NegInf Min Max Mean Sigma Cardinality
Time int 0 2 0 0 0.000000 1.727920e+05 9.481386e+04 4.748815e+04 NA
V1 real 0 0 0 0 -56.407510 2.454930e+00 0.000000e+00 1.958696e+00 NA
V2 real 0 0 0 0 -72.715728 2.205773e+01 0.000000e+00 1.651309e+00 NA
V3 real 0 0 0 0 -48.325589 9.382558e+00 0.000000e+00 1.516255e+00 NA
V4 real 0 0 0 0 -5.683171 1.687534e+01 0.000000e+00 1.415869e+00 NA
V5 real 0 0 0 0 -113.743307 3.480167e+01 0.000000e+00 1.380247e+00 NA
V6 real 0 0 0 0 -26.160506 7.330163e+01 0.000000e+00 1.332271e+00 NA
V7 real 0 0 0 0 -43.557242 1.205895e+02 0.000000e+00 1.237094e+00 NA
V8 real 0 0 0 0 -73.216718 2.000721e+01 0.000000e+00 1.194353e+00 NA
V9 real 0 0 0 0 -13.434066 1.559499e+01 0.000000e+00 1.098632e+00 NA
V10 real 0 0 0 0 -24.588262 2.374514e+01 0.000000e+00 1.088850e+00 NA
V11 real 0 0 0 0 -4.797473 1.201891e+01 0.000000e+00 1.020713e+00 NA
V12 real 0 0 0 0 -18.683715 7.848392e+00 0.000000e+00 9.992014e-01 NA
V13 real 0 0 0 0 -5.791881 7.126883e+00 0.000000e+00 9.952742e-01 NA
V14 real 0 0 0 0 -19.214326 1.052677e+01 0.000000e+00 9.585956e-01 NA
V15 real 0 0 0 0 -4.498945 8.877742e+00 0.000000e+00 9.153160e-01 NA
V16 real 0 0 0 0 -14.129855 1.731511e+01 0.000000e+00 8.762529e-01 NA
V17 real 0 0 0 0 -25.162799 9.253526e+00 0.000000e+00 8.493371e-01 NA
V18 real 0 0 0 0 -9.498746 5.041069e+00 0.000000e+00 8.381762e-01 NA
V19 real 0 0 0 0 -7.213527 5.591971e+00 0.000000e+00 8.140405e-01 NA
V20 real 0 0 0 0 -54.497720 3.942090e+01 0.000000e+00 7.709250e-01 NA
V21 real 0 0 0 0 -34.830382 2.720284e+01 0.000000e+00 7.345240e-01 NA
V22 real 0 0 0 0 -10.933144 1.050309e+01 0.000000e+00 7.257016e-01 NA
V23 real 0 0 0 0 -44.807735 2.252841e+01 0.000000e+00 6.244603e-01 NA
V24 real 0 0 0 0 -2.836627 4.584549e+00 0.000000e+00 6.056471e-01 NA
V25 real 0 0 0 0 -10.295397 7.519589e+00 0.000000e+00 5.212781e-01 NA
V26 real 0 0 0 0 -2.604551 3.517346e+00 0.000000e+00 4.822270e-01 NA
V27 real 0 0 0 0 -22.565679 3.161220e+01 0.000000e+00 4.036325e-01 NA
V28 real 0 0 0 0 -15.430084 3.384781e+01 0.000000e+00 3.300833e-01 NA
Amount real 0 1825 0 0 0.000000 2.569116e+04 8.834962e+01 2.501201e+02 NA
Class int 0 284315 0 0 0.000000 1.000000e+00 1.727500e-03 4.152720e-02 NA

The most important issues that we usually check first are missing values and imbalance problems for classification.

For the missing values, you should know that a value recognized by R as a missing value if it is written as NA or blank cells. If, otherwise a missing value in imported data written in any other format, for instance, in a string format like na or missing, we should tell R that these are missing values to be converted to NA. Or like in our case when a variable takes zero value while it should not have to take it. The Amount variable, for instance, we know that any transaction requires some amount of money so that it should not be equal to zero, while in the data it has 1825 zero’s. the same thing applies for the Time variable with two zero’s. However, since the data is large then this is not a big issue, and we can comfortably remove these rows.

card$Amount <- h2o.ifelse(card$Amount == 0, NA, card$Amount)
card$Time <- h2o.ifelse(card$Time == 0, NA, card$Time)
card <- h2o.na_omit(card)

it is a good practice to check your output after each transformation to make sure your code did what would be expected.

knitr::kable(h2o.describe(card))
Label Type Missing Zeros PosInf NegInf Min Max Mean Sigma Cardinality
Time int 0 0 0 0 1.000000 1.727920e+05 94849.6338858 4.748196e+04 NA
V1 real 0 0 0 0 -56.407510 2.454930e+00 -0.0003483 1.956753e+00 NA
V2 real 0 0 0 0 -72.715728 2.205773e+01 -0.0020179 1.650496e+00 NA
V3 real 0 0 0 0 -48.325589 9.382558e+00 -0.0033027 1.514214e+00 NA
V4 real 0 0 0 0 -5.683171 1.687534e+01 -0.0119933 1.404852e+00 NA
V5 real 0 0 0 0 -113.743307 3.480167e+01 -0.0022396 1.378819e+00 NA
V6 real 0 0 0 0 -26.160506 7.330163e+01 -0.0013051 1.331596e+00 NA
V7 real 0 0 0 0 -43.557242 1.205895e+02 0.0025090 1.233944e+00 NA
V8 real 0 0 0 0 -73.216718 2.000721e+01 0.0000269 1.191177e+00 NA
V9 real 0 0 0 0 -13.320155 1.559499e+01 0.0014642 1.099065e+00 NA
V10 real 0 0 0 0 -24.588262 2.374514e+01 -0.0022783 1.087587e+00 NA
V11 real 0 0 0 0 -4.797473 1.201891e+01 0.0023114 1.018693e+00 NA
V12 real 0 0 0 0 -18.683715 7.848392e+00 0.0008656 9.972279e-01 NA
V13 real 0 0 0 0 -5.791881 7.126883e+00 0.0006992 9.945502e-01 NA
V14 real 0 0 0 0 -19.214326 1.052677e+01 0.0002020 9.555395e-01 NA
V15 real 0 0 0 0 -4.498945 8.877742e+00 0.0036456 9.137113e-01 NA
V16 real 0 0 0 0 -14.129855 1.731511e+01 -0.0010958 8.760560e-01 NA
V17 real 0 0 0 0 -25.162799 9.253526e+00 0.0016190 8.462568e-01 NA
V18 real 0 0 0 0 -9.498746 5.041069e+00 0.0013067 8.386969e-01 NA
V19 real 0 0 0 0 -7.213527 5.591971e+00 0.0019147 8.119902e-01 NA
V20 real 0 0 0 0 -54.497720 3.942090e+01 0.0009807 7.705625e-01 NA
V21 real 0 0 0 0 -34.830382 2.720284e+01 -0.0000481 7.326525e-01 NA
V22 real 0 0 0 0 -10.933144 1.050309e+01 -0.0016073 7.255767e-01 NA
V23 real 0 0 0 0 -44.807735 2.252841e+01 0.0001474 6.230342e-01 NA
V24 real 0 0 0 0 -2.836627 4.584549e+00 0.0002018 6.057968e-01 NA
V25 real 0 0 0 0 -10.295397 7.519589e+00 -0.0005087 5.209869e-01 NA
V26 real 0 0 0 0 -2.604551 3.517346e+00 -0.0013648 4.819297e-01 NA
V27 real 0 0 0 0 -22.565679 3.161220e+01 0.0002533 4.029874e-01 NA
V28 real 0 0 0 0 -15.430084 3.384781e+01 0.0001927 3.303524e-01 NA
Amount real 0 0 0 0 0.010000 2.569116e+04 88.9194915 2.508252e+02 NA
Class int 0 282515 0 0 0.000000 1.000000e+00 0.0016432 4.050350e-02 NA

In contrast, we have a very serious imbalance problem since the class variable, with only two values 1 and 0, has its mean equals 0.00173 which means that we have a large number of class label 0.

h2o.table(card$Class)
  Class  Count
1     0 282515
2     1    465

[2 rows x 2 columns] 

As expected, the majority of cases are of class label 0. Any machine learning model fitted to this data without correcting this problem will be dominated by the label 0, and will hardly correctly predict the fraudulent card (label 1) which is our main interest.

The h2o package provides a way to correct the imbalance problem. For glm models, for instance, we have three arguments for this purpose:

  • balance_classes: if it is set to true then it performs subsampling method by default, or specified in the next argument.
  • class_sampling_factors: The desired sampling ratios per class (over or under-sampling).
  • max_after_balance_size: The desired relative size of the training data after balancing class counts.

Before going ahead we should split the data randomly between training (80% of the data) and testing set (the rest 20%).

card$Class <- h2o.asfactor(card$Class)

parts <- h2o.splitFrame(card, 0.8, seed = 1111)
train <- parts[[1]]
test <- parts[[2]]
h2o.table(train$Class)
  Class  Count
1     0 226268
2     1    372

[2 rows x 2 columns] 
h2o.table(test$Class)
  Class Count
1     0 56247
2     1    93

[2 rows x 2 columns] 

3 Logistic regression

For binary classification problems, the first model that comes in mind is the logistic regression model. This model belongs to the glm models such that when we set the argument family to binomial we get a logistic regression model. The following are the main arguments of **glm* models (besides the arguments discussed above):

  • x: should contains the predictor names (not the data) or their indices.
  • y: the name of the response variable (again not the whole column).
  • training frame: The training data frame.
  • model_id: to name the model.
  • nfolds: the number of folds to use for cross-validation for hyperparameters tuning.
  • seed: for reproducibility.
  • fold_assignment: the skim of the cross-validation: AUTO, Random, Stratified, or Modulo.
  • family: many distributions are provided, for binary we have binomial, quasibinomial.
  • solver: the algorithm used, with AUTO, will decide the best one given the data, but you can choose another one like IRLSM, L_BFGS, COORDINATE_DESCENT, …etc.
  • alpha: ratio to mix the regularization L1 (lasso) and L2(ridge regression). larger values yield more lasso.
  • lambda_search: lambda is the strength of the L2 regularization. If TRUE then the model tries different values.
  • standardize: to standardize the numeric columns.
  • compute_p_value: it does not work with regularization.
  • link: the link function.
  • interaction: if we want interaction between predictors.

Now we are ready to train our model with some specified values. But first, let’s try to use the original data without correcting the imbalance problem.

model_logit <- h2o.glm(
  x = 1:30,
  y = 31,
  training_frame = train,
  model_id = "glm_binomial_no_eg",
  seed = 123,
  lambda = 0,
  family = "binomial",
  solver = "IRLSM",
  standardize = TRUE,
  link = "family_default"
)

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |===                                                                   |   4%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |======================================================================| 100%

h2o provides a bunch of metrics already computed during the training process along with the confusion matrix. we can get access to them by calling the function h2O.performance.

h2o.performance(model_logit)
H2OBinomialMetrics: glm
** Reported on training data. **

MSE:  0.0006269349
RMSE:  0.02503867
LogLoss:  0.003809522
Mean Per-Class Error:  0.1103587
AUC:  0.9731273
AUCPR:  0.7485898
Gini:  0.9462545
R^2:  0.6174137
Residual Deviance:  1726.78
AIC:  1788.78

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
            0   1    Error         Rate
0      226203  65 0.000287   =65/226268
1          82 290 0.220430      =82/372
Totals 226285 355 0.000649  =147/226640

Maximum Metrics: Maximum metrics at their respective thresholds
                        metric threshold         value idx
1                       max f1  0.135889      0.797799 176
2                       max f2  0.058169      0.808081 200
3                 max f0point5  0.475561      0.833333 102
4                 max accuracy  0.135889      0.999351 176
5                max precision  0.999976      0.909091   0
6                   max recall  0.000027      1.000000 397
7              max specificity  0.999976      0.999960   0
8             max absolute_mcc  0.135889      0.797693 176
9   max min_per_class_accuracy  0.001118      0.919355 345
10 max mean_per_class_accuracy  0.002782      0.934336 314
11                     max tns  0.999976 226259.000000   0
12                     max fns  0.999976    282.000000   0
13                     max fps  0.000007 226268.000000 399
14                     max tps  0.000027    372.000000 397
15                     max tnr  0.999976      0.999960   0
16                     max fnr  0.999976      0.758065   0
17                     max fpr  0.000007      1.000000 399
18                     max tpr  0.000027      1.000000 397

Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

To extract only the confusion matrix we call the function h2O.confusionMatrix

h2o.confusionMatrix(model_logit)
Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.135888872638703:
            0   1    Error         Rate
0      226203  65 0.000287   =65/226268
1          82 290 0.220430      =82/372
Totals 226285 355 0.000649  =147/226640

By looking at the confusion matrix, we get a very low error rate for the major label (0.029%), whereas, the error rate for the minor label is quite high (22.04%). This result is expected since the data is highly dominated by the label “0”.

h2o.confusionMatrix(model_logit, test)
Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.0767397449673996:
           0  1    Error       Rate
0      56223 24 0.000427  =24/56247
1         19 74 0.204301     =19/93
Totals 56242 98 0.000763  =43/56340

Using the testing set, the error rate of the major class is a little larger than its corresponding one for the training data 0.043%. Whereas, the error rate of the minor class is smaller than its corresponding one 20.43% (22.04%).

We can correct the imbalance problem by setting the argument balance_classes to TRUE. Unfortunately, I trained many times this model but it seemed this argument does not work for some reason. I do not know this problem occurs in this version of h20 for everyone or just for me due to some problems with my laptop. Anyway, I put an issue in stackoverflow about it but I do not get yet any answer at the time of writing.

we can correct the imbalance problem by loading the data as data frame into R, and using Rose package then converting back the corrected data to h2o object.

Note: This possibility of loading data from h2o to R will not be always possible for a very large dataset. I am using this alternative only to carry on our analysis and do not get stacked.

train_R <- as.data.frame(train)
train_balance <- ROSE::ROSE(Class~., data=train_R, seed=111)$data
table(train_balance$Class)

     0      1 
113244 113396 

Now we feed this corrected data to our model again after converting it back to h2o.

train_h <- as.h2o(train_balance)

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
model_logit2 <- h2o.glm(
  x = 1:30,
  y = 31,
  training_frame = train_h,
  model_id = "glm_binomial_balance",
  seed = 123,
  lambda = 0,
  family = "binomial",
  solver = "IRLSM",
  standardize = TRUE,
  link = "family_default"
)

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |======================================================================| 100%

We can check the confusion matrix as follows.

h2o.confusionMatrix(model_logit2)
Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.448958365921134:
            0      1    Error           Rate
0      110591   2653 0.023427   =2653/113244
1       12594 100802 0.111062  =12594/113396
Totals 123185 103455 0.067274  =15247/226640

As the reliable measure of the model performance is the unseen data, so let’s use our testing set.

h2o.confusionMatrix(model_logit2, test)
Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.9289188100923:
           0   1    Error       Rate
0      56200  47 0.000836  =47/56247
1         16  77 0.172043     =16/93
Totals 56216 124 0.001118  =63/56340

Since we are more interested to the minor class so we will consider an improvement if getting lower rate for the minor class. After correcting the class imbalance problem, The minor class rate has reduced from 20.43% to 17.20%.

One strategy to improve our model is to remove the less important variables by hand using a threshold. h2o provides a function to list the predictors in decreasing order of their importance in predicting the response variable. So we can think to remove the less important variable with the hope to reduce the error rate of the minor class.

h2o.varimp(model_logit)
   variable relative_importance scaled_importance   percentage
1        V4         0.994893006       1.000000000 0.1440671995
2       V10         0.916557351       0.921262231 0.1327236697
3       V14         0.532427886       0.535160950 0.0770991394
4       V22         0.481085303       0.483554815 0.0696643880
5        V9         0.371758112       0.373666424 0.0538330752
6       V20         0.360902368       0.362754956 0.0522610906
7       V27         0.340929107       0.342679168 0.0493688281
8       V13         0.324390153       0.326055315 0.0469738762
9       V21         0.309050873       0.310637296 0.0447526453
10      V16         0.230524987       0.231708320 0.0333815688
11   Amount         0.213887758       0.214985688 0.0309723861
12       V8         0.211491780       0.212577411 0.0306254323
13     Time         0.209412404       0.210487362 0.0303243248
14       V6         0.189761276       0.190735361 0.0274787093
15       V5         0.176081346       0.176985209 0.0254977634
16       V1         0.164618852       0.165463875 0.0238379171
17      V12         0.134542774       0.135233410 0.0194826987
18      V11         0.129031560       0.129693906 0.0186846379
19      V28         0.093633317       0.094113956 0.0135587341
20      V26         0.093283287       0.093762130 0.0135080475
21      V17         0.081893193       0.082313568 0.0118586852
22       V7         0.077962451       0.078362649 0.0112894873
23      V23         0.067840817       0.068189058 0.0098238066
24      V18         0.065741510       0.066078975 0.0095198129
25      V25         0.033325258       0.033496323 0.0048257215
26       V2         0.029047974       0.029197083 0.0042063420
27      V24         0.025833162       0.025965769 0.0037408156
28      V19         0.022354254       0.022469003 0.0032370463
29      V15         0.020189854       0.020293493 0.0029236267
30       V3         0.003304571       0.003321534 0.0004785241

Or as plot as follows:

h2o.varimp_plot(model_logit)

Another strategy to remove the less important variables, which is better, is by using the lasso regression (L1) that can strip out the less important ones automatically, known also as a feature selection method. Lasso, like ridge regression (L2), is a regularization technique to fight overfitting problems, and besides that, it is also known as a reduction technique since it reduces the number of predictors. We enable this method in h2o by setting alpha=1, where alpha is a ratio to the trade-off between lasso (L1) or ridge regression (L2). alpha closer to zero means more ridge than lasso.

model_lasso <- h2o.glm(
  x = 1:30,
  y = 31,
  training_frame = train_h,
  model_id = "glm_binomial_lasso",
  seed = 123,
  alpha = 1,
  family = "binomial",
  solver = "IRLSM",
  standardize = TRUE,
  link = "family_default"
)

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |====                                                                  |   6%
  |                                                                            
  |======================================================================| 100%
h2o.confusionMatrix(model_lasso)
Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.429938956856689:
            0      1    Error           Rate
0      110315   2929 0.025865   =2929/113244
1       12339 101057 0.108813  =12339/113396
Totals 122654 103986 0.067367  =15268/226640

Using the testing set, the confusion matrix will be:

h2o.confusionMatrix(model_lasso, test)
Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.958116537926135:
           0   1    Error       Rate
0      56210  37 0.000658  =37/56247
1         20  73 0.215054     =20/93
Totals 56230 110 0.001012  =57/56340

With the lasso model, the error rate of the minor class has increased from 17.20% to 21.50%, which is in contradiction with the improvement recorded in the rate computed from the training data where the rate has decreased from 11.10% to 10.88% with lasso model.

The last thing about hyperparameters tuning is that some of which are not supported by h2o.grid function like, for instance, the solver argument. But this not an issue since we can recycle a loop over the hyperparameters in question. Let’s try to explore the most popular solvers by using the R lapply function.

solvers <- c(
  "IRLSM",
  "L_BFGS",
  "COORDINATE_DESCENT"
)

mygrid <- lapply(solvers, function(solver) {
  grid_id <- paste0("glm_", solver)
  h2o.glm(
    x = 1:30,
    y = 31,
    training_frame = train_h,
    seed = 123,
    model_id = paste0("logit_", solver),
    family = "binomial",
    solver = solver,
    standardize = TRUE,
    link = "family_default"
  )
   
})

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=                                                                     |   2%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |======================================================================| 100%

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=                                                                     |   2%
  |                                                                            
  |=============                                                         |  18%
  |                                                                            
  |======================================================================| 100%
df <- cbind(
  h2o.confusionMatrix(mygrid[[1]])$Error,
  h2o.confusionMatrix(mygrid[[2]])$Error,
  h2o.confusionMatrix(mygrid[[3]])$Error
)
df <- t(round(df, digits = 6))
dimnames(df) <- list(
  list("IRLSM", "L_BFGS",  "COORDINATE_DESCENT"),
  list("Error (0)", "Error (1)", "Total Error")
  
)
df
                   Error (0) Error (1) Total Error
IRLSM               0.024169  0.110313    0.067270
L_BFGS              0.024354  0.110189    0.067301
COORDINATE_DESCENT  0.025909  0.109051    0.067508

It seems there is no significant difference between these solvers. If we focus, however, on the error of the minor class, it seems that the COORDINATE_DESCT is the best one with the lowest error. But it can be the result of random chances since we did not use cross-validation.

4 Random forest

The random forest model is the most popular machine learning model due to its capability to capture even complex patterns in the data. This is also, however, can be considered at the same time as a downside, since this capability tends to exceedingly memorize everything in the data including the noise, which gives rise to the overfitting problem. That is why this model has a large number of hyperparameters for regularization techniques, among others, to control the training process. The main hyperparameters provided by h2O are the following1 :

  • seed: for reproducibility.
  • ntrees: The number of trees used (called also iterations). The default is 50.
  • max_depth: The maximum level allowed for each tree. The default is 20.
  • mtries: The number of the columns chosen randomly for each tree. The default is \(\sqrt{p}\) for classification, and \(\frac{p}{3}\) for regression (where p is the number of columns).
  • sample_rate: the proportion of the training data selected randomly at each tree. The default is 63.2%.
  • balance_classes: This the most important hyperparameters for our data, since it is highly imbalanced. The default is false, if set to true then the model will correct this problem by making use of over/under sampling methods.
  • min_rows: the minimum number of instances in a node to allow for splitting this node. the default is 1.
  • min_split_improvement: The minimum error reduction to make further splitting. The default is 0.
  • binomial_double_trees: for binary classification. If true then the model two random forests, one for each output class. this method cn give high accuracy with the cost of doubling the computation time.
  • stopping_rounds: The number of iterations required to early stopping the training process if the moving average of the stopping_metric (based on this number of iterations) does not improve. The default is 0, which means the early stopping is disabled.
  • stopping_metric: works with the last argument. The default is AUTO, that is the logloss for classification, deviance for regression, but we have also MSE, RMSE, MAE, AUC, misclasssification.
  • stopping_tolerance: The threshold under which we consider no improvement. The default is 0.001.

First let’s try this model with the default values, except for balance_classes that we set to true. Fortunately, unlike glm models, this argument works fine with random forest model.

model_rf <- h2o.randomForest(
  x = 1:30,
  y = 31,
  training_frame = train,
  seed = 123,
  model_id = "rf_default",
  balance_classes = TRUE
)

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=                                                                     |   2%
  |                                                                            
  |===                                                                   |   4%
  |                                                                            
  |====                                                                  |   6%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |==========                                                            |  14%
  |                                                                            
  |=============                                                         |  18%
  |                                                                            
  |===============                                                       |  22%
  |                                                                            
  |==================                                                    |  26%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |========================                                              |  34%
  |                                                                            
  |===========================                                           |  38%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |================================                                      |  46%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |====================================                                  |  52%
  |                                                                            
  |=======================================                               |  56%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |=============================================                         |  64%
  |                                                                            
  |==============================================                        |  66%
  |                                                                            
  |==================================================                    |  72%
  |                                                                            
  |=====================================================                 |  76%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |=========================================================             |  82%
  |                                                                            
  |============================================================          |  86%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |===================================================================   |  96%
  |                                                                            
  |======================================================================| 100%

Now we check how this model did with the training data.

h2o.performance(model_rf)
H2OBinomialMetrics: drf
** Reported on training data. **
** Metrics reported on Out-Of-Bag training samples **

MSE:  0.03503944
RMSE:  0.1871882
LogLoss:  0.1012676
Mean Per-Class Error:  6.629307e-06
AUC:  0.999995
AUCPR:  0.9999938
Gini:  0.9999901
R^2:  0.8598422

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
            0      1    Error       Rate
0      226265      3 0.000013  =3/226268
1           0 226262 0.000000  =0/226262
Totals 226265 226265 0.000007  =3/452530

Maximum Metrics: Maximum metrics at their respective thresholds
                        metric threshold         value idx
1                       max f1  0.060268      0.999993 397
2                       max f2  0.060268      0.999997 397
3                 max f0point5  0.060268      0.999989 397
4                 max accuracy  0.060268      0.999993 397
5                max precision  1.000000      1.000000   0
6                   max recall  0.060268      1.000000 397
7              max specificity  1.000000      1.000000   0
8             max absolute_mcc  0.060268      0.999987 397
9   max min_per_class_accuracy  0.060268      0.999987 397
10 max mean_per_class_accuracy  0.060268      0.999993 397
11                     max tns  1.000000 226268.000000   0
12                     max fns  1.000000 132282.000000   0
13                     max fps  0.000002 226268.000000 399
14                     max tps  0.060268 226262.000000 397
15                     max tnr  1.000000      1.000000   0
16                     max fnr  1.000000      0.584641   0
17                     max fpr  0.000002      1.000000 399
18                     max tpr  0.060268      1.000000 397

Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

Surprisingly, the model is almost perfect with 0.0007% overall error rate, which is very suspicious, since this model memorized everything even the noisy patterns. The real challenge for every model is how it generalizes to unseen data, that is why we should always hold out some data as testing data to test the model performance.

h2o.confusionMatrix(model_rf, test)
Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.00208813635434166:
           0  1    Error       Rate
0      56240  7 0.000124   =7/56247
1         16 77 0.172043     =16/93
Totals 56256 84 0.000408  =23/56340

As expected the model overfitted the data. The error rate of the minor class is now very large which is the same as that obtained from the lasso model.

4.1 Random forest with binomial double trees

Before going ahead with hyperparameters tuning, let’s try the binomial double trees technique discussed above.

model_rf_dbl <- h2o.randomForest(
  x = 1:30,
  y = 31,
  training_frame = train,
  seed = 123,
  model_id = "rf_default",
  binomial_double_trees = TRUE 
)

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=                                                                     |   2%
  |                                                                            
  |====                                                                  |   6%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |==========                                                            |  14%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |=================                                                     |  24%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |========================                                              |  34%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |================================                                      |  46%
  |                                                                            
  |====================================                                  |  52%
  |                                                                            
  |=========================================                             |  58%
  |                                                                            
  |=============================================                         |  64%
  |                                                                            
  |==================================================                    |  72%
  |                                                                            
  |=====================================================                 |  76%
  |                                                                            
  |=========================================================             |  82%
  |                                                                            
  |==============================================================        |  88%
  |                                                                            
  |==================================================================    |  94%
  |                                                                            
  |======================================================================| 100%
h2o.confusionMatrix(model_rf_dbl)
Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.473692341854698:
            0   1    Error        Rate
0      226258  10 0.000044  =10/226268
1          83 289 0.223118     =83/372
Totals 226341 299 0.000410  =93/226640
h2o.confusionMatrix(model_rf_dbl, test)
Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.42:
           0  1    Error       Rate
0      56239  8 0.000142   =8/56247
1         13 80 0.139785     =13/93
Totals 56252 88 0.000373  =21/56340

As we see, this model is the best one until now with the lowest rate for the minor class at 13.98%.

4.2 Random forest tuning

We can try to tune the hyperparameters related to the regularization techniques to fight the overfitting problem. For instance, we use lower values for max_depth and larger values for min_rows to prune the trees, lower values for sample_rate to let each tree focus on a small part of the training data. We set also some values to early stop the training process if we do not obtain significant improvement. Finally, to avoid the randomness of the results we use cross-validation.

#model_rftuned <- h2o.grid(
#  "randomForest",
#  hyper_params = list(
#    max_depth = c(5, 10),
#    min_rows = c(10, 20, 30),
#    sample_rate = c(0.3, 0.5)
#  ),
# stopping_rounds = 5,
#  stopping_metric = "AUTO",
#  stopping_tolerance = 0.001,
#  balance_classes = TRUE,
#  nfolds = 5,
#  fold_assignment = "Stratified",
#  x = 1:30,
#  y = 31,
#  training_frame = train
#)

Since this model took a lot of time I saved the following output in csv file then I loaded it again.

#df_output <- model_rftuned@summary_table %>% 
#  select(max_depth, min_rows, sample_rate, logloss) %>% 
#  arrange(logloss)
#write.csv(df_output, "df_output.csv",  row.names = F)
df_output <- read.csv("df_output.csv")
knitr::kable(df_output)
max_depth min_rows sample_rate logloss
10 30 0.3 0.0041177
10 20 0.3 0.0041834
10 30 0.5 0.0043959
5 30 0.3 0.0044893
10 10 0.3 0.0045269
5 20 0.5 0.0045655
10 10 0.5 0.0045780
10 20 0.5 0.0046402
5 20 0.3 0.0046463
5 10 0.3 0.0046960
5 30 0.5 0.0047160
5 10 0.5 0.0047175

Using the logloss metric, the best model is obtained with 10 for max_depth, 30 for min_rows, and the sample rate is about 0.3. Now let’s run this model with these values.

model_rf_best <- h2o.randomForest(
  x = 1:30,
  y = 31,
  training_frame = train,
  seed = 123,
  model_id = "rf_best",
  max_depth = 10,
  min_rows = 30,
  sample_rate = 0.3,
  stopping_rounds = 5,
  stopping_metric = "AUTO",
  stopping_tolerance = 0.001,
  balance_classes = TRUE
)

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=                                                                     |   2%
  |                                                                            
  |======                                                                |   8%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |=================                                                     |  24%
  |                                                                            
  |======================                                                |  32%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |================================                                      |  46%
  |                                                                            
  |======================================                                |  54%
  |                                                                            
  |===========================================                           |  62%
  |                                                                            
  |================================================                      |  68%
  |                                                                            
  |=====================================================                 |  76%
  |                                                                            
  |===========================================================           |  84%
  |                                                                            
  |================================================================      |  92%
  |                                                                            
  |===================================================================== |  98%
  |                                                                            
  |======================================================================| 100%
h2o.confusionMatrix(model_rf_best)
Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.00180477647421117:
            0      1    Error          Rate
0      225985    283 0.001251   =283/226268
1        2582 223680 0.011412  =2582/226262
Totals 228567 223963 0.006331  =2865/452530

The model did well with the training data. But what about the testing set?.

h2o.confusionMatrix(model_rf_best, test)
Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.00496959295307637:
           0   1    Error       Rate
0      56226  21 0.000373  =21/56247
1         13  80 0.139785     =13/93
Totals 56239 101 0.000603  =34/56340

With this model, we get the same error rate for the minor class as the binomial double trees model. But for the overall error rate, the latter is better than the former.

5 Deep learning model

Deep learning models are known for their high accuracy at predicting very large and complex datasets. They have a large number of hyperparameters that can be tuned to efficiently handle a wide range of datasets. Tuning a large number of hyperparameters with large datasets, however, requires very large hardware resources and time, which is not always available or very costly (using the cloud providers’ platforms). That is why this type of model requires quite high experience and practice to be able to correctly set the right hyperparameter values.

There are many frameworks for deep learning models. The most used ones are tensorflow and keras since they are designed specifically for this type of models and can handle almost all the famous architectures such as: feedforward neural network, convolutional neural network, recurrent neural network,..etc. Besides, they can also provide us with some tools to define our architecture.

For h2o, it provides only the feedforward neural network which is densely connected layers. However, this type of architecture is the most used one in economics. We can briefly discuss the main hyperparameters provided by h2O for this type of models (in addition to some of the above hyperparameters):

  • hidden: we specify the number of the hidden layers and the number of nodes in each layer, the default is 2 layers with 200 nodes each. Notice that the number of nodes in the first and the last layers will be specified automatically by h2o given the data.
  • autoencoder: If true then we train autoencoder model, otherwise the model will use supervised learning which is the default.
  • activation: the activation function used. h2o provides three ones with or without dropout: Tanh, Rectifier, Maxout, TanhWithDropout, RecifierWithDropout, MaxoutWithDropout. The default is Rectifier.
  • hidden_dropout_ratio: it is a regularization technique. Drop randomly a fraction of node values from a hidden layer. The default is 0.5.
  • missing_values_handling: with two values MeanImputation and Skip. The default is MeanImputation.
  • input_dropout_ratio: The same as the previous argument but for the input layer. The default is 0.
  • L1 and L2: For lasso and ridge regularization. The default is 0 for both.
  • max_w2: It is the upper limit of the sum squared of the weights incoming to each node. This can help to fight the Exploding gradient problem.
  • train_samples_per_iteration: The number of samples used before declaring one iteration. At the end of one iteration, the model is scored. The default is -2, which means h2o will decide given the data.
  • score_interval: The alternative of the previous one, where the model will be scored after every 5 seconds with the default settings.
  • score_duty_cycle: It is another alternative to the two previous ones. It is the fraction of time spent in scoring, at the expense of that spent in training. The default is 0.1, which means 10% of the total time will be spent in scoring while the remaining 90% will be spent on training.
  • target_ratio_comm_to_comp: It is related to the cluster management. It controls the fraction of the communication time between nodes (The cluster nodes not the layer nodes). The default is 0.05, which means 5% of the total time will be spent on communication, and 95% in training inside each node.
  • replicate_training_data: The default is true, which means replicate the entire data on every cluster node.
  • shuffle_training_data: shuffle the inputs before feeding them into the network. It is recommended when we set balance_classes to true (like in our case). The default is false.
  • score_validation_samples: The number of samples from the validation set used in scoring. if we set this to 0 (which is the default) then the entire validation data will be used.
  • score_training_samples: The default is 10000, which the number of samples used from the training data to use in scoring. It is used when we do not have validation data.
  • score_validation_sampling: It is used when we use only a fraction of the validation (when the score_validation_samples has been specified with other values than the default of 0). The default is Uniform, but for our case with imbalance classes we can use instead Stratified, which is also provided as another value for this argument.

Since, in our case the two classes are imbalanced, we convert the balance_classes argument to true, then we leave all the other arguments to the default settings.

#model_deep <- h2o.deeplearning(
#  x = 1:30,
#  y = 31,
#  training_frame = train,
#  model_id = "deep_def",
#  balance_classes = TRUE
#)

As we did earlier, we save the model then we load it again to prevent rerunning the model when rendering this document.

#h2o.saveModel(model_deep, 
#              path = #"C://Users/dell/Documents/new-blog/content/sparklyr/h2o",
#              force = TRUE)
model_deep <- h2o.loadModel("C://Users/dell/Documents/new-blog/content/sparklyr/h2o/deep_def")
h2o.confusionMatrix(model_deep)
Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 9.38199481764512e-07:
          0    1    Error     Rate
0      4904    5 0.001019  =5/4909
1         0 5017 0.000000  =0/5017
Totals 4904 5022 0.000504  =5/9926

Like the above models, this model is almost perfect for predicting the training data.

h2o.confusionMatrix(model_deep, test)
Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.991909621262387:
           0  1    Error       Rate
0      56237 10 0.000178  =10/56247
1         17 76 0.182796     =17/93
Totals 56254 86 0.000479  =27/56340

As we see this model fails to predict very well the minor class. This result can be expected since we only used the default values. so let’s try using some custom hyperparameter values now.

Note: We will not tune any hyperparameters since we do not have many resources on my laptop.

As a guideline, since the above default deep learning model fitted almost perfectly the training data and it generalized poorly to the unseen testing data, then we should think to reduce the complexity of the model and some regularization methods. So we will set the following values.

  • hidden: we will use two hidden layers, with 100 each (instead of the default of 200 each).
  • nfolds: we will use 5 folds to properly score the model using validation data (not training data).
  • fold_assignment: set it to “Stratified” to be sure to get the minor class in all the folds. This is crucially important with imbalanced classes.
  • hidden_dropout_ratio: we set this to 0.2 for both layers.
  • activation: with the previous argument, we must provide the appropriate activation function RectifierWithDropout.
  • L1: we set this argument to 0.0001.
  • variable_importances: By default it is True, so we set it false to reduce computation time since our goal is a prediction, not explanation.
  • shuffle_training_data: since the replicate_training_data is true (by default), we set this to true (the default is false) to shuffle the training data.
#model_deep_new <- h2o.deeplearning(
#  x = 1:30,
#  y = 31,
#  training_frame = train,
#  nfolds = 5,
#  fold_assignment = "Stratified",
#  hidden = c(100,100),
#  model_id = "deep_new",
#  standardize = TRUE,
#  balance_classes = TRUE,
#  hidden_dropout_ratios = c(0.2,0.2),
#  activation = "RectifierWithDropout",
#  l1=1e-4,
#  variable_importances = FALSE,
#  shuffle_training_data = TRUE
#)

To prevent this model to be rerun when rendering our rmarkdown document, we save the model and load it again to further use.

#h2o.saveModel(model_deep_new, 
#              path = #"C://Users/dell/Documents/new-blog/content/sparklyr/h2o",
#              force = TRUE)
model_deep_new <- h2o.loadModel("C://Users/dell/Documents/new-blog/content/sparklyr/h2o/deep_new")
h2o.confusionMatrix(model_deep_new)
Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 1.69163968745202e-06:
          0    1    Error        Rate
0      4908   77 0.015446    =77/4985
1        51 5022 0.010053    =51/5073
Totals 4959 5099 0.012726  =128/10058

As expected, this model has less accuracy than the default one due to its less flexibility. In other words, it has a larger bias but we hope it has also lower variance, which can be verified by using the testing set.

h2o.confusionMatrix(model_deep_new, test)
Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.250769269598065:
           0  1    Error       Rate
0      56228 19 0.000338  =19/56247
1         15 78 0.161290     =15/93
Totals 56243 97 0.000603  =34/56340

With these new settings, we obtained a large improvement for the error rate of the minor class with 16% (compared to the default model with 18%). But this rate still larger than that of the best random forest model (13.97%). If you have enough time you can improve your model by applying a grid search to some hyperparameters.

Finally, when you finish your work do not forget to shut down your h2o to free your resources as follows:

h2o.shutdown()
Are you sure you want to shutdown the H2O instance running at http://localhost:54321/ (Y/N)? 

6 Conclusion:

Maybe the most important thing learned from this article is how important the hyperparameter values on the model performance. The difference (of performance) can be larger between models of the same type (with different hyperparameter values) than the difference between different types of models. In other words, if you do not have enough time, so exploit your time to fine-tune the hyperparameters of the same model rather than try a different type of models. In practice, for large and complex datasets, the most powerful models are by order: Deep learning, Xgboost, and Random forest.

7 Further reading

8 Session information

sessionInfo()
R version 4.0.1 (2020-06-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] h2o_3.30.1.3    forcats_0.5.0   stringr_1.4.0   dplyr_1.0.2    
 [5] purrr_0.3.4     readr_1.3.1     tidyr_1.1.2     tibble_3.0.3   
 [9] ggplot2_3.3.2   tidyverse_1.3.0

loaded via a namespace (and not attached):
 [1] tidyselect_1.1.0  xfun_0.18         haven_2.3.1       colorspace_1.4-1 
 [5] vctrs_0.3.4       generics_0.0.2    htmltools_0.5.0   yaml_2.2.1       
 [9] blob_1.2.1        rlang_0.4.7       pillar_1.4.6      glue_1.4.2       
[13] withr_2.3.0       DBI_1.1.0         bit64_4.0.5       dbplyr_1.4.4     
[17] modelr_0.1.8      readxl_1.3.1      lifecycle_0.2.0   munsell_0.5.0    
[21] blogdown_0.20     gtable_0.3.0      cellranger_1.1.0  rvest_0.3.6      
[25] codetools_0.2-16  evaluate_0.14     knitr_1.30        fansi_0.4.1      
[29] highr_0.8         broom_0.7.1       Rcpp_1.0.5        scales_1.1.1     
[33] backports_1.1.10  jsonlite_1.7.1    bit_4.0.4         fs_1.5.0         
[37] hms_0.5.3         digest_0.6.25     stringi_1.5.3     bookdown_0.20    
[41] grid_4.0.1        bitops_1.0-6      cli_2.0.2         tools_4.0.1      
[45] ROSE_0.0-3        magrittr_1.5      RCurl_1.98-1.2    crayon_1.3.4     
[49] pkgconfig_2.0.3   ellipsis_0.3.1    data.table_1.13.0 xml2_1.3.2       
[53] reprex_0.3.0      lubridate_1.7.9   assertthat_0.2.1  rmarkdown_2.4    
[57] httr_1.4.2        rstudioapi_0.11   R6_2.4.1          compiler_4.0.1   

  1. Darren cook, Practical Machine Learning Model With h2o, O’Reilly, 2017, p115↩︎

Metales Abdelkader
Metales Abdelkader

My research interests include Econometrics, statistics, machine learning, deep learning.

Related

comments powered by Disqus