Ordinal data models

1 Introduction

This tutorial aims to explore the most popular models used to predict an ordered response variable. We will use the heart disease data uploaded from kaggle website, where our response will be the chest pain cp variable instead of the target variable used usually.

2 Data preparation

First, we call the data and the libraries that we need along with this illustration as follows.

options(warn = -1)
library(tidyverse)
library(caret)
library(tidymodels)
mydata<-read.csv("../heart.csv",header = TRUE)
names(mydata)[1]<-"age"

The data at hand has the following features:

  • age.
  • sex: 1=male,0=female
  • cp : chest pain type.
  • trestbps : resting blood pressure.
  • chol: serum cholestoral.
  • fbs : fasting blood sugar.
  • restecg : resting electrocardiographic results.
  • thalach : maximum heart rate achieved
  • exang : exercise induced angina.
  • oldpeak : ST depression induced by exercise relative to rest.
  • slope : the slope of the peak exercise ST segment.
  • ca : number of major vessels colored by flourosopy.
  • thal : it is not well defined from the data source.
  • target: have heart disease or not.

I think the best start to explore the summary of all predictors and missing values is by using the powerful function skim from skimr package.

skimr::skim(mydata)
Table 2.1: Data summary
Name mydata
Number of rows 303
Number of columns 14
_______________________
Column type frequency:
numeric 14
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 0 1 54.37 9.08 29 47.5 55.0 61.0 77.0 ▁▆▇▇▁
sex 0 1 0.68 0.47 0 0.0 1.0 1.0 1.0 ▃▁▁▁▇
cp 0 1 0.97 1.03 0 0.0 1.0 2.0 3.0 ▇▃▁▅▁
trestbps 0 1 131.62 17.54 94 120.0 130.0 140.0 200.0 ▃▇▅▁▁
chol 0 1 246.26 51.83 126 211.0 240.0 274.5 564.0 ▃▇▂▁▁
fbs 0 1 0.15 0.36 0 0.0 0.0 0.0 1.0 ▇▁▁▁▂
restecg 0 1 0.53 0.53 0 0.0 1.0 1.0 2.0 ▇▁▇▁▁
thalach 0 1 149.65 22.91 71 133.5 153.0 166.0 202.0 ▁▂▅▇▂
exang 0 1 0.33 0.47 0 0.0 0.0 1.0 1.0 ▇▁▁▁▃
oldpeak 0 1 1.04 1.16 0 0.0 0.8 1.6 6.2 ▇▂▁▁▁
slope 0 1 1.40 0.62 0 1.0 1.0 2.0 2.0 ▁▁▇▁▇
ca 0 1 0.73 1.02 0 0.0 0.0 1.0 4.0 ▇▃▂▁▁
thal 0 1 2.31 0.61 0 2.0 2.0 3.0 3.0 ▁▁▁▇▆
target 0 1 0.54 0.50 0 0.0 1.0 1.0 1.0 ▇▁▁▁▇

For our case we will use the chest pain type cp variable as our target variable since it is a categorical variable. However, for pedagogic purposes, we will manipulate it so that it will be an ordered factor with only three levels no pain,moderate pain, severe pain (instead of 4 levels now).

Looking at the above output, we convert the variables that should be of factor type, which are: sex, target, fbs, resecg, exang, slope, ca, thal. For the response variable cp, we drop its less frequently level with all its related rows, then we rename the remaining ones as no pain for the most frequently one, severe pain for the less frequently one, and moderate pain for the last one.

table(mydata$cp)

  0   1   2   3 
143  50  87  23 

we see the level 3 is the less frequently one.

mydata<-mydata %>%
  modify_at(c("cp", "sex", "target", "fbs", "resecg", "exang", "slope", "ca", "thal"),
            as.factor)
mydata<-mydata[mydata$cp!=3,]
mydata$cp<-fct_drop(mydata$cp,only=levels(mydata$cp))
table(mydata$cp)

  0   1   2 
143  50  87 

According to these frequencies we rename and we order the levels as follows.

mydata$cp<-fct_recode(mydata$cp,no="0",sev="1",mod="2")
mydata$cp<-factor(mydata$cp,ordered = TRUE)
mydata$cp<-fct_infreq(mydata$cp)
mydata$cp[1:5]
[1] mod sev sev no  no 
Levels: no < mod < sev

Similar to the logistic regression, the number of cases in each cell from each cross table between the outcome and each factor should be above the threshold of 5 applied in practice.

xtabs(~cp+sex,data=mydata)
     sex
cp      0   1
  no   39 104
  mod  35  52
  sev  18  32
xtabs(~cp+target,data=mydata)
     target
cp      0   1
  no  104  39
  mod  18  69
  sev   9  41
xtabs(~cp+fbs,data=mydata)
     fbs
cp      0   1
  no  125  18
  mod  70  17
  sev  45   5
xtabs(~cp+restecg,data=mydata)
     restecg
cp     0  1  2
  no  78 62  3
  mod 36 50  1
  sev 19 31  0
xtabs(~cp+exang,data=mydata)
     exang
cp     0  1
  no  63 80
  mod 76 11
  sev 46  4
xtabs(~cp+slope,data=mydata)
     slope
cp     0  1  2
  no  11 84 48
  mod  5 33 49
  sev  2 12 36
xtabs(~cp+ca,data=mydata)
     ca
cp     0  1  2  3  4
  no  65 34 29 14  1
  mod 57 20  2  5  3
  sev 37  8  3  1  1
xtabs(~cp+thal,data=mydata)
     thal
cp     0  1  2  3
  no   1 12 52 78
  mod  1  2 62 22
  sev  0  2 39  9

The following variables do not respect this threshold and hence they will be removed from the predictors set: restecg, exang, slope, ca, and thal.

mydata<-mydata[,setdiff(names(mydata), 
                        c("restecg", "exang", "slope", "ca",  "thal"))]

The data is ready and we can now split the data between training and testing set.

set.seed(1122)
parts <- initial_split(mydata, prop=0.8, strata = cp)
train <- training(parts)
test <- testing(parts)

The most popular models that we will use are: ordinal logistic model, cart model, ordinal random forest model, Continuation ratio model.

3 Ordered logistic regression model (logit)

Before training this type of model let’s show how it works. For simplicity suppose we have data that has an ordered outcome \(y\) with three class labels (“1”,“2”,“3”) and only two features \(x_1\) and \(x_2\).

First we define a latent variable as a linear combination of the features:

\[\begin{equation} y_i^*=\beta_1 X_{i1}+\beta_2 X_{i2} \end{equation}\]

Then since we have three classes we define two thresholds for this latent variable \(\alpha_1\) and \(\alpha_2\) such that a particular observation \(y_i\) will be classified as follows:

\[\begin{cases} y_i=1 & \text{if $y_i^* \leq \alpha_1$} \\ y_i=2 & \text{if $\alpha_1 < y_i^* \leq \alpha_2$} \\ y_i=3 & \text{if $y_i^* > \alpha_2$} \end{cases}\]

Now we can obtain the probability of a particular observation to fall into a specific class as follows:

\[\begin{cases} p(y_i=1)=p(y_i^* \leq \alpha_1)=F(\alpha_1-\beta_1 X_{i1}-\beta_2 X_{i2}) \\ p(y_i=2)=p(\alpha_1 < y_i^* \leq \alpha_2)=F(\alpha_2-\beta_1 X_{i1}-\beta_2 X_{i2})-F(\alpha_1-\beta_1 X_{i1}-\beta_2 X_{i2}) \\ p(y_i=3)=1-p(y_i=2)-p(y_i=1)\end{cases}\]

It remains now to define the suitable distribution function F. There are two commonly used ones for this type of data, the logit function \(F(x)=\frac{1}{1+exp^{-x}}\) and the normal distribution function aka probit.

Note: there exist other functions like loglog, cloglog, and cauchit.

Using the logit function the probabilities will be.

\[\begin{cases} p(y_i=1)=\frac{1}{1+exp^{-(\alpha_1-\beta_1 X_{i1}-\beta_2 X_{i2})}} \\ p(y_i=2)=\frac{1}{1+exp^{-(\alpha_2-\beta_1 X_{i1}-\beta_2 X_{i2})}}-p(y_i=1) \\ p(y_i=3)=1-p(y_i=2)-p(y_i=1)\end{cases}\]

The MASS package provides the method polr to perform an ordinal logistic regression.

library(MASS)
set.seed(1234)
model_logistic<-train(cp~., data=train,
                      method="polr",
                      tuneGrid=expand.grid(method="logistic"))

summary(model_logistic)

Coefficients:
              Value Std. Error  t value
age       0.0112236   0.018219  0.61605
sex1      0.2593720   0.316333  0.81993
trestbps -0.0002329   0.009090 -0.02562
chol     -0.0013238   0.002697 -0.49082
fbs1      0.3188826   0.401836  0.79356
thalach   0.0226246   0.008199  2.75933
oldpeak  -0.3360326   0.163547 -2.05465
target1   1.7234740   0.376279  4.58031

Intercepts:
        Value   Std. Error t value
no|mod   4.5786  1.9271     2.3759
mod|sev  6.5004  1.9527     3.3289

Residual Deviance: 376.4697 
AIC: 396.4697 

This table does not provide the p-values. However, it is not a big problem since we can add the p_values by the following script.

prob <- pnorm(abs(summary(model_logistic)$coefficients[,3]),lower.tail = FALSE)*2
cbind(summary(model_logistic)$coefficients,prob)
                 Value  Std. Error     t value         prob
age       0.0112236479 0.018218848  0.61604597 5.378642e-01
sex1      0.2593719567 0.316332564  0.81993442 4.122535e-01
trestbps -0.0002329023 0.009090066 -0.02562163 9.795591e-01
chol     -0.0013237835 0.002697079 -0.49082122 6.235529e-01
fbs1      0.3188825831 0.401836034  0.79356393 4.274493e-01
thalach   0.0226246089 0.008199317  2.75932853 5.792027e-03
oldpeak  -0.3360326371 0.163547467 -2.05464899 3.991292e-02
target1   1.7234739863 0.376278770  4.58031152 4.642839e-06
no|mod    4.5785821473 1.927119568  2.37586822 1.750771e-02
mod|sev   6.5003986218 1.952726089  3.32888399 8.719471e-04

Using the threshold p-value 0.05, we remove the non significant variables. age, trestbps, chol.

set.seed(1234)
model_logistic<-train(cp~.-age-trestbps-chol, data=train,
                      method="polr",tuneGrid=expand.grid(method="logistic"))
prob <- pnorm(abs(summary(model_logistic)$coefficients[,3]),lower.tail = FALSE)*2
cbind(summary(model_logistic)$coefficients,prob)
              Value  Std. Error    t value         prob
sex1     0.25427581 0.308143065  0.8251875 4.092651e-01
fbs1     0.37177505 0.384667871  0.9664832 3.338024e-01
thalach  0.02050951 0.007487511  2.7391620 6.159602e-03
oldpeak -0.33669473 0.161699555 -2.0822242 3.732199e-02
target1  1.71338020 0.369558584  4.6362885 3.547208e-06
no|mod   4.00836398 1.143111953  3.5065367 4.539789e-04
mod|sev  5.92987585 1.185074388  5.0038005 5.621092e-07

Notice that we do not remove the factors sex and fbs even they are not significant due to the significance of the intercepts.

To well understand these coefficients lets restrict the model with only two predictors.

set.seed(1234)
model1<-train(cp~target+thalach, 
              data=train,
              method = "polr",
              tuneGrid=expand.grid(method="logistic"))
summary(model1)

Coefficients:
          Value Std. Error t value
target1 1.87953   0.333153   5.642
thalach 0.02347   0.007372   3.184

Intercepts:
        Value  Std. Error t value
no|mod  4.6457 1.0799     4.3018 
mod|sev 6.5325 1.1271     5.7959 

Residual Deviance: 383.3144 
AIC: 391.3144 

Let’s plug in these coefficients in the above equations we obtain the probability of each class as follows:

\[\begin{cases} p(no)=\frac{1}{1+exp^{-(4.6457-1.87953X_{i1}-0.02347X_{i2})}} \\ p(mod)=\frac{1}{1+exp^{-(6.5325-1.87953X_{i1}-0.02347X_{i2})}}-p(no) \\ p(sev)=1-p(mod)-p(no)\end{cases}\]

Let’s now predict a particular patient, say the third one.

train[3,c("cp","thalach","target")]
   cp thalach target
4 sev     178      1

We plug in the predictor values as follows:

\[\begin{cases} p(no)=\frac{1}{1+exp^{-(4.6457-1.87953*1-0.02347*178)}} \\ p(mod)=\frac{1}{1+exp^{-(6.5325-1.87953*1-0.02347*178)}}-p(no) \\ p(sev)=1-p(mod)-p(no)\end{cases}=\begin{cases} p(no)=0.1959992 \\ p(mod)=0.6166398-0.1959992=0.4206406 \\ p(sev)=1-0.4206406-0.1959992=0.3833602\end{cases}\]

Using the highest probability this patient will be predicted to have mod pain. Now let’s compare these probabilities with those obtained from function predict.

predict(model1, train[1:3,], type = "prob") %>% tail(1)
         no       mod      sev
4 0.1958709 0.4205981 0.383531

Now we go back to our original model and compute the accuracy rate for the training data.

predict(model_logistic, train) %>% 
  bind_cols(train) %>%
  rename(pred="...1", truth=cp) %>% 
  accuracy(pred, truth)
# A tibble: 1 x 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy multiclass     0.611

with the logistic regression model we get 61% accuracy for the training set, which is quite bad. So let’s test the model using the testing set now.

predict(model_logistic, test) %>% 
  bind_cols(test) %>%
  rename(pred="...1", truth=cp) %>% 
  accuracy(pred, truth)
# A tibble: 1 x 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy multiclass     0.648

Surprisingly, the accuracy rate for the testing set is about 65%, which is larger than that computed from the training data (61%). This is an indication of an underfitting problem (The opposite effect of overfitting problem). Is there any way to improve the model performance? Maybe yes, by going back and tune some hyperparameters, but since we have an underfitting problem we do not have much hyperparameters for this model except the type of function used which is by default the logistic function, but there exist as well other functions like probit, loglog, …ect.

For our case let’s try this model with the probit function

4 Ordinal logistic rgeression model (probit)

set.seed(1234)
model_probit<-train(cp~.-age-trestbps-chol, data=train,                                        method="polr",
                    tuneGrid=expand.grid(method="probit"))

predict(model_probit, train) %>% 
  bind_cols(train) %>%
  rename(pred="...1", truth=cp) %>% 
  accuracy(pred, truth)
# A tibble: 1 x 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy multiclass     0.606

This rate is slightly worse than that from the previous model. But what about the testing set.

predict(model_probit, test) %>% 
  bind_cols(test) %>%
  rename(pred="...1", truth=cp) %>% 
  accuracy(pred, truth)
# A tibble: 1 x 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy multiclass     0.593

This one also is worse than the previous model. So this means that the logistic function for this data performs better than the probit one.

When we try many things to improve the model performance and we do not gain much, it will be better to think to try different types of models.

5 CART model

This is a tree-based model used both for classification and regression. To train this model we make use of rpartScore package, and for simplification, we will include only the significant predictors from the previous model.

library(rpartScore)
set.seed(1234)
model_cart<-train(cp~.-age-trestbps-chol, data=train,
                      method="rpartScore")
model_cart
CART or Ordinal Responses 

226 samples
  8 predictor
  3 classes: 'no', 'mod', 'sev' 

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 226, 226, 226, 226, 226, 226, ... 
Resampling results across tuning parameters:

  cp          split  prune  Accuracy   Kappa    
  0.02702703  abs    mr     0.5748197  0.2845545
  0.02702703  abs    mc     0.5796085  0.3011122
  0.02702703  quad   mr     0.5711605  0.2764466
  0.02702703  quad   mc     0.5805216  0.3020125
  0.04504505  abs    mr     0.5620975  0.2719646
  0.04504505  abs    mc     0.5966801  0.3274893
  0.04504505  quad   mr     0.5592845  0.2608402
  0.04504505  quad   mc     0.5930817  0.3208220
  0.21621622  abs    mr     0.5303342  0.1266324
  0.21621622  abs    mc     0.6004116  0.3343997
  0.21621622  quad   mr     0.5290009  0.1143360
  0.21621622  quad   mc     0.5928132  0.3225686

Accuracy was used to select the optimal model using the largest value.
The final values used for the model were cp = 0.2162162, split = abs and
 prune = mc.

The caret model uses the bootstrapping technique for hyperparameters tuning. In our case, the largest accuracy rate is about 59.59%, with the complexity parameter **cp**=0.2162162, the **split**=abs, and **prune**= **mc**.

The argument split controls the splitting function used to grow the tree by setting the misclassification costs in the generalized Gini impurity function to the absolute abs or squared quad. The argument prune is used to select the performance measure to prune the tree between total misclassification rate mr or misclassification cost mc.

predict(model_cart, train) %>% 
  bind_cols(train) %>%
  rename(pred="...1", truth=cp) %>% 
  accuracy(pred, truth)
# A tibble: 1 x 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy multiclass     0.615

Surprisingly, we get approximately the same accuracy rate as the logit model. Let’s check the testing set.

predict(model_cart, test) %>% 
  bind_cols(test) %>%
  rename(pred="...1", truth=cp) %>% 
  accuracy(pred, truth)
# A tibble: 1 x 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy multiclass     0.630

Now wit this model we get a lower accuracy rate than that of the logistic model.

6 Ordinal Random forst model.

This model is a corrected version of random forest model that takes into account the ordinal nature of the response variable. For more detail about this model read this great paper.

To train ordinal random forest model, we need to call the following packages: e1071, ranger, ordinalForest.

library(ordinalForest)
library(ranger)
library(e1071)

Since the create function train use bootstrapping method to perform hyperparameters tuning to choose the best values, this makes the training process very slow, that is why i save the resulted output and load it again

# set.seed(1234)
# model_forest<-train(cp~.-age-trestbps-chol, data=train,
#                       method='ordinalRF')

# saveRDS(model_forest, #"C://Users/dell/Documents/new-blog/content/post/ordinal/model_forest.rds")

model_forest <- readRDS("C://Users/dell/Documents/new-blog/content/post/ordinal/model_forest.rds")

model_forest
Random Forest 

226 samples
  8 predictor
  3 classes: 'no', 'mod', 'sev' 

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 226, 226, 226, 226, 226, 226, ... 
Resampling results across tuning parameters:

  nsets  ntreeperdiv  ntreefinal  Accuracy   Kappa    
   50     50          200         0.5808002  0.3008422
   50     50          400         0.5776249  0.2954635
   50     50          600         0.5802381  0.3009845
   50    100          200         0.5805333  0.2982787
   50    100          400         0.5835550  0.3046105
   50    100          600         0.5792347  0.2966789
   50    150          200         0.5781306  0.2957198
   50    150          400         0.5763106  0.2929363
   50    150          600         0.5773418  0.2939428
  100     50          200         0.5825633  0.3037443
  100     50          400         0.5766958  0.2946094
  100     50          600         0.5801625  0.2992074
  100    100          200         0.5817261  0.3017512
  100    100          400         0.5802315  0.2984311
  100    100          600         0.5760195  0.2936909
  100    150          200         0.5791770  0.2986367
  100    150          400         0.5773527  0.2940674
  100    150          600         0.5800019  0.2990121
  150     50          200         0.5738722  0.2890697
  150     50          400         0.5755389  0.2915668
  150     50          600         0.5793087  0.2994984
  150    100          200         0.5821339  0.3039247
  150    100          400         0.5810183  0.3003594
  150    100          600         0.5797573  0.3001752
  150    150          200         0.5792505  0.2992324
  150    150          400         0.5757645  0.2930867
  150    150          600         0.5802099  0.2993488

Accuracy was used to select the optimal model using the largest value.
The final values used for the model were nsets = 50, ntreeperdiv = 100
 and ntreefinal = 400.

We can plot the important predictors as follows.

plot(varImp(model_forest))

Now we can obtain the accuracy rate for the training rate as follows.

predict(model_forest, train) %>% 
  bind_cols(train) %>%
  rename(pred="...1", truth=cp) %>% 
  accuracy(pred, truth)
# A tibble: 1 x 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy multiclass     0.819

Great!, with this model, the accuracy rate has largely improved to roughly 84%. But wait, what matters is the accuracy of the testing set.

predict(model_forest, test) %>% 
  bind_cols(test) %>%
  rename(pred="...1", truth=cp) %>% 
  accuracy(pred, truth)
# A tibble: 1 x 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy multiclass     0.574

This is exactly what is called the overfitting problem. The model generalizes poorly to new unseen data. We can go back and tune some other hyperparameters like increasing the minimum size of nodes (default is 5) to fight the overfitting problem. we do not, however, do that here since it is not the purpose of this tutorial.

7 Continuation Ratio Model

This model uses The vector generalized additive models which are available in the VGAM package. for more detail about these models click here.

library(VGAM)
set.seed(1234)
model_vgam<-train(cp~.-age-trestbps-chol, data=train,
                  method="vglmContRatio", trace=FALSE)
model_vgam
Continuation Ratio Model for Ordinal Data 

226 samples
  8 predictor
  3 classes: 'no', 'mod', 'sev' 

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 226, 226, 226, 226, 226, 226, ... 
Resampling results across tuning parameters:

  parallel  link     Accuracy   Kappa    
  FALSE     logit    0.5962581  0.3323075
  FALSE     probit   0.5942637  0.3302998
  FALSE     cloglog  0.5973844  0.3293056
  FALSE     cauchit  0.5967368  0.3316896
  FALSE     logc     0.5945121  0.3152759
   TRUE     logit    0.5758330  0.2961673
   TRUE     probit   0.5738297  0.2924747
   TRUE     cloglog  0.5838764  0.3014038
   TRUE     cauchit  0.5810184  0.3067004
   TRUE     logc     0.5302522  0.1031624

Accuracy was used to select the optimal model using the largest value.
The final values used for the model were parallel = FALSE and link = cloglog.

the best model is obtained when the argument parallel is FALSE and link is cauchit which is the tangent function.

The accuracy rate of the training data is:

predict(model_vgam, train) %>% 
  bind_cols(train) %>%
  rename(pred="...1", truth=cp) %>% 
  accuracy(pred, truth)
# A tibble: 1 x 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy multiclass     0.659

And the accuracy of the testing set is:

predict(model_vgam, test) %>% 
  bind_cols(test) %>%
  rename(pred="...1", truth=cp) %>% 
  accuracy(pred, truth)
# A tibble: 1 x 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy multiclass     0.630

This the best accuracy rate compared to the other models.

8 Compare models

We can compare between the above models using resample caret function.

models_eval<-resamples(list(logit=model_logistic,
                            cart=model_cart,
                            forest=model_forest,
                            vgam=model_vgam))
summary(models_eval)

Call:
summary.resamples(object = models_eval)

Models: logit, cart, forest, vgam 
Number of resamples: 25 

Accuracy 
            Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
logit  0.5060241 0.5731707 0.5822785 0.5871083 0.6097561 0.6627907    0
cart   0.3734940 0.5824176 0.6097561 0.6004116 0.6279070 0.6746988    0
forest 0.4891304 0.5609756 0.5853659 0.5835550 0.6162791 0.6385542    0
vgam   0.4936709 0.5760870 0.6046512 0.5973844 0.6202532 0.6626506    0

Kappa 
               Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
logit   0.189086980 0.2792369 0.3144822 0.3100458 0.3437500 0.4512651    0
cart   -0.004889406 0.3185420 0.3474144 0.3343997 0.3775576 0.4526136    0
forest  0.186912373 0.2719432 0.3091678 0.3046105 0.3464604 0.4011544    0
vgam    0.144558744 0.2993406 0.3367647 0.3293056 0.3690791 0.4142980    0

Based on the training set and using the mean of the accuracy rate we can say that cart model is the best model for this data with 60.97% accuracy for the training set. However, things are different when it comes to use the testing set instead.

tibble(models=c("logit", "cart", "forest", "vgam"), 
       accuracy=c(
  predict(model_logistic, test) %>% 
  bind_cols(test) %>%
  rename(pred="...1", truth=cp) %>% 
  accuracy(pred, truth) %>% 
  .[, ".estimate"],
  predict(model_cart, test) %>% 
  bind_cols(test) %>%
  rename(pred="...1", truth=cp) %>% 
  accuracy(pred, truth) %>% 
  .[, ".estimate"],
  predict(model_forest, test) %>% 
  bind_cols(test) %>%
  rename(pred="...1", truth=cp) %>% 
  accuracy(pred, truth) %>% 
  .[, ".estimate"],
  predict(model_vgam, test) %>% 
  bind_cols(test) %>%
  rename(pred="...1", truth=cp) %>% 
  accuracy(pred, truth) %>% 
  .[, ".estimate"])) %>% 
  unnest()
# A tibble: 4 x 2
  models accuracy
  <chr>     <dbl>
1 logit     0.648
2 cart      0.630
3 forest    0.574
4 vgam      0.630

Using the testing set, the logistic model with the link logit is the best model to predict this data.

9 Conclusion

We have seen so far how to model ordinal data by exploring several models, and it happened that the logistic model is the best on for our data. However, in general the best model depends strongly on the data at hand.

10 Session information

sessionInfo()
R version 4.0.1 (2020-06-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] splines   stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] VGAM_1.1-3          e1071_1.7-3         ranger_0.12.1      
 [4] ordinalForest_2.4-1 rpartScore_1.0-1    rpart_4.1-15       
 [7] MASS_7.3-53         yardstick_0.0.7     workflows_0.2.0    
[10] tune_0.1.1          rsample_0.0.8       recipes_0.1.13     
[13] parsnip_0.1.3       modeldata_0.0.2     infer_0.5.3        
[16] dials_0.0.9         scales_1.1.1        broom_0.7.1        
[19] tidymodels_0.1.1    caret_6.0-86        lattice_0.20-41    
[22] forcats_0.5.0       stringr_1.4.0       dplyr_1.0.2        
[25] purrr_0.3.4         readr_1.3.1         tidyr_1.1.2        
[28] tibble_3.0.3        ggplot2_3.3.2       tidyverse_1.3.0    

loaded via a namespace (and not attached):
 [1] colorspace_1.4-1     ellipsis_0.3.1       class_7.3-17        
 [4] base64enc_0.1-3      fs_1.5.0             rstudioapi_0.11     
 [7] listenv_0.8.0        furrr_0.1.0          prodlim_2019.11.13  
[10] fansi_0.4.1          lubridate_1.7.9      xml2_1.3.2          
[13] codetools_0.2-16     knitr_1.30           jsonlite_1.7.1      
[16] pROC_1.16.2          dbplyr_1.4.4         compiler_4.0.1      
[19] httr_1.4.2           backports_1.1.10     assertthat_0.2.1    
[22] Matrix_1.2-18        cli_2.0.2            htmltools_0.5.0     
[25] tools_4.0.1          gtable_0.3.0         glue_1.4.2          
[28] reshape2_1.4.4       Rcpp_1.0.5           cellranger_1.1.0    
[31] DiceDesign_1.8-1     vctrs_0.3.4          nlme_3.1-149        
[34] blogdown_0.20        iterators_1.0.12     timeDate_3043.102   
[37] gower_0.2.2          xfun_0.18            globals_0.13.0      
[40] rvest_0.3.6          lifecycle_0.2.0      future_1.19.1       
[43] ipred_0.9-9          hms_0.5.3            parallel_4.0.1      
[46] yaml_2.2.1           stringi_1.5.3        highr_0.8           
[49] foreach_1.5.0        lhs_1.1.0            lava_1.6.8          
[52] repr_1.1.0           rlang_0.4.7          pkgconfig_2.0.3     
[55] evaluate_0.14        tidyselect_1.1.0     plyr_1.8.6          
[58] magrittr_1.5         bookdown_0.20        R6_2.4.1            
[61] generics_0.0.2       DBI_1.1.0            pillar_1.4.6        
[64] haven_2.3.1          withr_2.3.0          survival_3.2-7      
[67] nnet_7.3-14          modelr_0.1.8         crayon_1.3.4        
[70] utf8_1.1.4           rmarkdown_2.4        grid_4.0.1          
[73] readxl_1.3.1         data.table_1.13.0    blob_1.2.1          
[76] ModelMetrics_1.2.2.2 reprex_0.3.0         digest_0.6.25       
[79] munsell_0.5.0        GPfit_1.0-8          skimr_2.1.2         
Metales Abdelkader
Metales Abdelkader

My research interests include Econometrics, statistics, machine learning, deep learning.

comments powered by Disqus
comments powered by Disqus