Count data Models

1 Introduction:

When we deal with data that has a response variable of integer type, using a linear regression may violate the normality assumption and hence all the classical statistic tests would fail to evaluate the model. However, as we do with logistic regression models, the generalized linear model GLM can be used instead here by specifying the suitable distribution.

The possible distributions for this type of data are the discrete distributions poisson and negative binomial. The former is the best choice if the mean and the variance of the response variable are closer to each other, if they are not however and we persist using this distribution we may cause the rise of the overdispersion problem of the residuals. As a solution thus, we can use the latter distribution that does not have this restriction.

There is another alternative if neither the poisson distribution nor the negative binomial are suitable called the Quasi maximum likelihood. The advantage of this method is that uses only the relationship between the mean and the variance and does not require any prespecified distribution. Moreover, its estimators are approximately as efficient as the maximum likelihood estimators.

2 Data preparation

To well understand how to model the count data we are going be using Doctorvisits data from AER package, in which the variable visits will be our target variable, so let’s call this data with the packages that we need along this article.

ssh <- suppressPackageStartupMessages
ssh(library(performance))
Warning: package 'performance' was built under R version 4.0.2
ssh(library(ModelMetrics))
ssh(library(corrr))
ssh(library(purrr))
ssh(library(MASS))
Warning: package 'MASS' was built under R version 4.0.2
ssh(library(tidyverse))
Warning: package 'ggplot2' was built under R version 4.0.2
Warning: package 'tibble' was built under R version 4.0.2
Warning: package 'tidyr' was built under R version 4.0.2
Warning: package 'dplyr' was built under R version 4.0.2
ssh(library(AER))
Warning: package 'car' was built under R version 4.0.2
Warning: package 'lmtest' was built under R version 4.0.2
Warning: package 'sandwich' was built under R version 4.0.2
Warning: package 'survival' was built under R version 4.0.2
ssh(library(broom))
Warning: package 'broom' was built under R version 4.0.2
data("DoctorVisits")
doc <- DoctorVisits
glimpse(doc)
Rows: 5,190
Columns: 12
$ visits    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 2, ...
$ gender    <fct> female, female, male, male, male, female, female, female,...
$ age       <dbl> 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.1...
$ income    <dbl> 0.55, 0.45, 0.90, 0.15, 0.45, 0.35, 0.55, 0.15, 0.65, 0.1...
$ illness   <dbl> 1, 1, 3, 1, 2, 5, 4, 3, 2, 1, 1, 2, 3, 4, 3, 2, 1, 1, 1, ...
$ reduced   <dbl> 4, 2, 0, 0, 5, 1, 0, 0, 0, 0, 0, 0, 13, 7, 1, 0, 0, 1, 0,...
$ health    <dbl> 1, 1, 0, 0, 1, 9, 2, 6, 5, 0, 0, 2, 1, 6, 0, 7, 5, 0, 0, ...
$ private   <fct> yes, yes, no, no, no, no, no, no, yes, yes, no, no, no, n...
$ freepoor  <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, n...
$ freerepat <fct> no, no, no, no, no, no, no, no, no, no, no, yes, no, no, ...
$ nchronic  <fct> no, no, no, no, yes, yes, no, no, no, no, no, no, yes, ye...
$ lchronic  <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, n...

This data from Australian health survey where visits is the number of doctor visits in past two weeks with 11 features listed above.

First we list the summary of the data to inspect any unwanted issue.

summary(doc)
     visits          gender          age             income      
 Min.   :0.0000   male  :2488   Min.   :0.1900   Min.   :0.0000  
 1st Qu.:0.0000   female:2702   1st Qu.:0.2200   1st Qu.:0.2500  
 Median :0.0000                 Median :0.3200   Median :0.5500  
 Mean   :0.3017                 Mean   :0.4064   Mean   :0.5832  
 3rd Qu.:0.0000                 3rd Qu.:0.6200   3rd Qu.:0.9000  
 Max.   :9.0000                 Max.   :0.7200   Max.   :1.5000  
    illness         reduced            health       private    freepoor  
 Min.   :0.000   Min.   : 0.0000   Min.   : 0.000   no :2892   no :4968  
 1st Qu.:0.000   1st Qu.: 0.0000   1st Qu.: 0.000   yes:2298   yes: 222  
 Median :1.000   Median : 0.0000   Median : 0.000                        
 Mean   :1.432   Mean   : 0.8619   Mean   : 1.218                        
 3rd Qu.:2.000   3rd Qu.: 0.0000   3rd Qu.: 2.000                        
 Max.   :5.000   Max.   :14.0000   Max.   :12.000                        
 freerepat  nchronic   lchronic  
 no :4099   no :3098   no :4585  
 yes:1091   yes:2092   yes: 605  
                                 
                                 
                                 
                                 

As we see we do not have missing values and the visits values ranges from 0 to 9 but it should be of integer type rather than double. Similarly, the variable illness should be converted to factor type since it has a few different values.

doc$visits<-as.integer(doc$visits)
doc$illness <- as.factor(doc$illness)
tab <- table(doc$visits)
tab

   0    1    2    3    4    5    6    7    8    9 
4141  782  174   30   24    9   12   12    5    1 

The best thing we do to start analyzing the data is by displaying the correlation coefficient of each pair variables we have. Thus, any particular predictor that has high correlation with the target variable could be highly likely to be relevant in our future model. Notice that our target variable is not continuous hence we will use the spearman correlation. As required by correlate function from corrr package, all the variables must be of numeric type so we convert all the factor to integer.

doc1 <-modify_if(doc, is.factor, as.integer)

notice that we have stored the result in another object doc1 to keep save our original data.

M <- correlate(doc1, method="spearman")
rplot(shave(M), colours=c("red", "white", "blue" ))+
   theme(axis.text.x = element_text(angle = 90, hjust = 1))

Looking at this plot all the correlations has low values. however, these correlations assess only the monotonic relations, they say nothing about any other form of relation.
First let’s compare the empirical distribution of the variable visits and the theoretical poisson distribution with \(\lambda\) equals the visits mean 0.3017341, and the total number of observations is 5190.

pos <- dpois(0:9,0.302)*5190
both <- numeric(20)
both[1:20 %% 2 != 0] <- tab
both[1:20 %% 2 == 0] <- pos
labels<-character(20)
labels[1:20 %% 2==0]<-as.character(0:9)
barplot(both,col=rep(c("red","yellow"),10),names=labels)

As we see the two distributions are more or less closer to each other. Let’s now check the negative binomial distribution by first estimate the clumping parameter \(k=\frac{\bar x^2}{s^2-\bar x}\).

k<-mean(doc$visits)^2/(var(doc$visits)-mean(doc$visits))
bin<-dnbinom(0:9,0.27,mu=0.302)*5190
both1<-numeric(20)
both1[1:20 %% 2 != 0]<-tab
both1[1:20 %% 2 == 0]<-bin
labels<-character(20)
labels[1:20 %% 2==0]<-as.character(0:9)
barplot(both1,col=rep(c("red","yellow"),10),names=labels)

With this distribution it seems that the empiricall distribution is more closer to the negative binomial than the poisson distribution.

Note: This data has very large number of zeros for the outcome compared to the other values which means that any trained model that does not take into account this anomaly will be biased to predict more likely the zero value. However, at the end of this article I will show two famous models to handel this type of count data called Haurdle model and zero_inflated model.

3 Data partition

In oreder to evaluate our model we held out 20% of the data as testing set.

set.seed(123)
index<-sample(2,nrow(doc),replace = TRUE,p=c(.8,.2))
train<-doc[index==1,]
test<-doc[index==2,]

4 Poisson model

This model belongs to the generalized linear model families, so in the function glm we set the argument family to poisson. In practice this model is sufficient with a wide range of count data.

set.seed(123)
model1<-glm(visits~., data=train, family ="poisson")
tidy(model1)
# A tibble: 16 x 5
   term         estimate std.error statistic   p.value
   <chr>           <dbl>     <dbl>     <dbl>     <dbl>
 1 (Intercept)   -2.70     0.141     -19.2   9.14e- 82
 2 genderfemale   0.193    0.0620      3.11  1.88e-  3
 3 age            0.436    0.184       2.37  1.77e-  2
 4 income        -0.161    0.0928     -1.74  8.23e-  2
 5 illness1       0.944    0.113       8.35  6.76e- 17
 6 illness2       1.21     0.118      10.3   1.15e- 24
 7 illness3       1.11     0.132       8.43  3.51e- 17
 8 illness4       1.28     0.140       9.13  7.14e- 20
 9 illness5       1.44     0.139      10.4   2.34e- 25
10 reduced        0.126    0.00560    22.6   6.85e-113
11 health         0.0348   0.0112      3.10  1.91e-  3
12 privateyes     0.111    0.0795      1.39  1.64e-  1
13 freepooryes   -0.344    0.190      -1.81  7.00e-  2
14 freerepatyes   0.0377   0.104       0.363 7.16e-  1
15 nchronicyes    0.0186   0.0732      0.254 7.99e-  1
16 lchronicyes    0.0255   0.0916      0.279 7.81e-  1

As we see all the variables are significant except for the income so we remove this variable and reestimate again.

set.seed(123)
model1<-glm(visits~.-income, data=train, family ="poisson")
tidy(model1)
# A tibble: 15 x 5
   term         estimate std.error statistic   p.value
   <chr>           <dbl>     <dbl>     <dbl>     <dbl>
 1 (Intercept)   -2.83     0.121     -23.4   7.36e-121
 2 genderfemale   0.213    0.0609      3.51  4.56e-  4
 3 age            0.479    0.183       2.62  8.91e-  3
 4 illness1       0.946    0.113       8.38  5.44e- 17
 5 illness2       1.21     0.118      10.3   8.29e- 25
 6 illness3       1.12     0.132       8.50  1.93e- 17
 7 illness4       1.28     0.140       9.17  4.71e- 20
 8 illness5       1.45     0.139      10.5   1.05e- 25
 9 reduced        0.126    0.00560    22.6   1.12e-112
10 health         0.0350   0.0112      3.11  1.84e-  3
11 privateyes     0.100    0.0793      1.27  2.06e-  1
12 freepooryes   -0.290    0.188      -1.55  1.22e-  1
13 freerepatyes   0.0683   0.102       0.667 5.05e-  1
14 nchronicyes    0.0171   0.0731      0.235 8.15e-  1
15 lchronicyes    0.0282   0.0914      0.308 7.58e-  1

For the interpretation of the coefficient estimates, we should exponentiate these values to get the marginal effect since the poisson model uses the log link function to preclude negative values. For continuous predictor, say age, if this predictor increases by one year, ceteris-paribus, we expect the doctor visits will be \(exp(0.47876624)=1.614082\) times larger. whereas, for categorical predictor, say gender, the female has \(exp(0.21342446)=1.23791\) larger doctor visits than male.
By looking at p-values all the predictors are significant. However, we have to check other statistics and metrics.

glance(model1)
# A tibble: 1 x 8
  null.deviance df.null logLik   AIC   BIC deviance df.residual  nobs
          <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
1         4565.    4154 -2685. 5399. 5494.    3486.        4140  4155

since the deviance value 3485.905 is lower than the degrees of freedom 4140, we will then worry about overdispersion problem. Fortunateley, the AER package provides a super easy way to test the significance of this difference via the function dispersiontest.

dispersiontest(model1)

    Overdispersion test

data:  model1
z = 6.278, p-value = 1.714e-10
alternative hypothesis: true dispersion is greater than 1
sample estimates:
dispersion 
  1.397176 

If our target variable really follows poisson distribution then its variance \(V\) should be approximately equal to its mean \(\mu\), which is the null hypothesis of the following dispersiontest test against the alternative hypothesis that the variance of the form: \[V=\mu+\alpha.trafo(\mu)\]

Where the trafo is an hyperparameter that should be specified as an argument of this test. The popular choices for this argument are:

  • trafo = NULL (default): \(V=(1+\alpha)\mu\)
  • trafo = 1: \(V=\mu+\alpha.\mu\)
  • trafo = 2: \(V=\mu+\alpha.\mu^2\)

For the first choice if true, then the data will be better modeled by quasi-poisson model than poisson model. For the last ones if one of them is true then the negative binomial will be better than poisson model.
Now once the trafo is defined the test estimates \(\alpha\), such that:

  • if \(\alpha = 0\) : equidispersion (The null hypothesis)
  • if \(\alpha < 0\) : underdispersion
  • if \(\alpha > 0\) : overdispersion

Therefore, the result of the test will depend on the direction of the test, where we have two.sided, greater (default) for the overdispersion, and less for underdispersion.

With this in mind the output of the above test (with the default values) tested the overdispersion against the quasi-poisson model, and since the p-value is very small 1.714e-10 then we have overdispersion problem, suggesting the use of quasi-poisson model instead.

Now let’s test the negative binomial now.

dispersiontest(model1, trafo = 1)

    Overdispersion test

data:  model1
z = 6.278, p-value = 1.714e-10
alternative hypothesis: true alpha is greater than 0
sample estimates:
    alpha 
0.3971763 

The test suggested the use of negative binomial with linear function for the variance with very tiny p-value 1.714e-10. This model is known as NB1 (with linear variance function).

dispersiontest(model1, trafo = 2)

    Overdispersion test

data:  model1
z = 7.4723, p-value = 3.939e-14
alternative hypothesis: true alpha is greater than 0
sample estimates:
  alpha 
0.95488 

If the relation is in quadratic form then this model is called NB2. And since this p-value 3.939e-14 is smaller than the previous one then NB2 could be more appropriate than NB1.

5 Quasi poisson model

The first test suggested the use of quasi-poisson model, so let’s train this model with the same predictors as the previous one.

set.seed(123)
model2<-glm(visits~.-income, data=train, family ="quasipoisson")
tidy(model2)
# A tibble: 15 x 5
   term         estimate std.error statistic  p.value
   <chr>           <dbl>     <dbl>     <dbl>    <dbl>
 1 (Intercept)   -2.83     0.140     -20.2   1.25e-86
 2 genderfemale   0.213    0.0705      3.03  2.47e- 3
 3 age            0.479    0.212       2.26  2.39e- 2
 4 illness1       0.946    0.131       7.24  5.36e-13
 5 illness2       1.21     0.136       8.89  9.15e-19
 6 illness3       1.12     0.152       7.34  2.49e-13
 7 illness4       1.28     0.162       7.92  2.91e-15
 8 illness5       1.45     0.160       9.06  2.01e-19
 9 reduced        0.126    0.00647    19.5   4.79e-81
10 health         0.0350   0.0130      2.69  7.14e- 3
11 privateyes     0.100    0.0917      1.09  2.74e- 1
12 freepooryes   -0.290    0.217      -1.34  1.81e- 1
13 freerepatyes   0.0683   0.119       0.576 5.64e- 1
14 nchronicyes    0.0171   0.0846      0.203 8.39e- 1
15 lchronicyes    0.0282   0.106       0.266 7.90e- 1

This model uses the quasi maximum likelihood which gives the same coefficient estimates but with different (corrected) standard errors.
Since here also all the variables are significant We see that the models are the same except the correction of the standard errors which are now more larger. In other words, the poisson distribution under overdispersion underestimates the standard errors and hence the t test would be biased towards the rejection of the null hypothesis. To better understand what is going on with quasi-poisson model let’s put the estimates and the standard errors of both models into one table, and we add a column that resulted from dividing the second standard errors vector by the first one.

D1 <- tidy(model1)
colnames(D1) <- NULL
D2 <- tidy(model2)
colnames(D2) <- NULL
tibble(term=D1[[1]], estimate1=D1[[2]], std1=D1[[3]],estimate2=D2[[2]], std2=D2[[3]], dispersion= std2/std1)
# A tibble: 15 x 6
   term         estimate1    std1 estimate2    std2 dispersion
   <chr>            <dbl>   <dbl>     <dbl>   <dbl>      <dbl>
 1 (Intercept)    -2.83   0.121     -2.83   0.140         1.16
 2 genderfemale    0.213  0.0609     0.213  0.0705        1.16
 3 age             0.479  0.183      0.479  0.212         1.16
 4 illness1        0.946  0.113      0.946  0.131         1.16
 5 illness2        1.21   0.118      1.21   0.136         1.16
 6 illness3        1.12   0.132      1.12   0.152         1.16
 7 illness4        1.28   0.140      1.28   0.162         1.16
 8 illness5        1.45   0.139      1.45   0.160         1.16
 9 reduced         0.126  0.00560    0.126  0.00647       1.16
10 health          0.0350 0.0112     0.0350 0.0130        1.16
11 privateyes      0.100  0.0793     0.100  0.0917        1.16
12 freepooryes    -0.290  0.188     -0.290  0.217         1.16
13 freerepatyes    0.0683 0.102      0.0683 0.119         1.16
14 nchronicyes     0.0171 0.0731     0.0171 0.0846        1.16
15 lchronicyes     0.0282 0.0914     0.0282 0.106         1.16

Note: The first two columns are for the model1, and the last one are for the model 2. Not surprisingly that the result of the last column is constant since this is exactly what the quasi maximum likelihood does, it computes the corrected standard errors from the original ones as follows \(std2=dispersion*std1\), with the dispersion value being estimated as 1.15718. if you want to know where this value came from, the answer is simple. this model computes the sigma of the standardized residuals resulted from the original model. we can thus get this value by specifying the argument type to pear then computing sigma by hand as follows:

resid <- resid(model1, type = "pear")
sqrt(sum(resid^2)/4140)
[1] 1.15718

Now to test the prediction qualities of our models we use the testing set test by ploting the original and the predicted values. Let’s start by the model1

pred<- predict.glm(model1,newdata=test[test$visits!=0,],type = "response")
plot(test$visits[test$visits!=0],type = "b",col="red")
lines(round(pred),col="blue")

If you noticed, and due to the large number of zero’s of the target variable, i have intentionally removed all theses values in order to get clearer plot. From this plot we can say that the model does not fit well the data especially the larger values that are not well captured, however this may due to the fact that the data are very skewed towards zero.

To compare different models we can use the root mean-square error and mean absolute error (all the data with zero’s included). Note: Here we are using the rmse function from ModelMetrics that expects the inpute to be two vectors, and not that with the same name from the performance package that expects the input to be a model object . To avoid thus any such ambiguity you should type this command ModelMetrics::rmse.

pred <- predict.glm(model1, newdata = test, type = "response")
rmsemodelp <- ModelMetrics::rmse(test$visits,round(pred))
maemodelp <- mae(test$visits,round(pred))
rmsemodelp
[1] 0.7381921
maemodelp
[1] 0.284058

By the same way, Now let’s evaluate the quasi-poisson model.

predq<- predict.glm(model2,newdata=test[test$visits!=0,],type = "response")
plot(test$visits[test$visits!=0],type = "b",col="red")
lines(round(predq),col="blue")

This plot does not seem to be very different from the previous plot. The rmse and mae for this model are computed as follows.

predq <- predict.glm(model2,newdata=test, type = "response")
rmsemodelqp <- ModelMetrics::rmse(test$visits,round(predq))
maemodelqp <- mae(test$visits,round(predq))
rmsemodelqp
[1] 0.7381921
maemodelqp
[1] 0.284058

we will not compare this two models until we finish with all the incoming models and we compare all the models at once.

6 Negative binomial model

The negative binomial distribution is used as an alternative for the poisson distribution under overdispersion problem.

set.seed(123)
model3<-glm.nb(visits~.-income, data=train)
summary(model3)

Call:
glm.nb(formula = visits ~ . - income, data = train, init.theta = 0.9715923611, 
    link = log)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.8413  -0.6894  -0.5335  -0.3540   3.6726  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -2.894820   0.135760 -21.323  < 2e-16 ***
genderfemale  0.258968   0.075352   3.437 0.000589 ***
age           0.511867   0.230297   2.223 0.026240 *  
illness1      0.880644   0.123264   7.144 9.04e-13 ***
illness2      1.171615   0.130240   8.996  < 2e-16 ***
illness3      1.118067   0.149032   7.502 6.28e-14 ***
illness4      1.263367   0.165370   7.640 2.18e-14 ***
illness5      1.378166   0.169907   8.111 5.01e-16 ***
reduced       0.141389   0.008184  17.275  < 2e-16 ***
health        0.041364   0.015029   2.752 0.005918 ** 
privateyes    0.086188   0.095173   0.906 0.365149    
freepooryes  -0.375471   0.223857  -1.677 0.093487 .  
freerepatyes  0.144928   0.127751   1.134 0.256602    
nchronicyes   0.022111   0.087590   0.252 0.800705    
lchronicyes   0.091622   0.114965   0.797 0.425477    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(0.9716) family taken to be 1)

    Null deviance: 3208.0  on 4154  degrees of freedom
Residual deviance: 2431.2  on 4140  degrees of freedom
AIC: 5159.6

Number of Fisher Scoring iterations: 1

              Theta:  0.972 
          Std. Err.:  0.103 

 2 x log-likelihood:  -5127.587 

As before we visualize the performance of this model as follows.

prednb<- predict.glm(model3,newdata=test[test$visits!=0,],type = "response")
plot(test$visits[test$visits!=0],type = "b",col="red")
lines(round(prednb),col="blue")

Again this plot also seems to be the same as the previous ones, so to figure out which model is best we use statistic metrics.

prednb<- predict.glm(model3,newdata=test,type = "response")
rmsemodelnb<-ModelMetrics::rmse(test$visits,round(prednb))
maemodelnb<-mae(test$visits,round(prednb))
knitr::kable(tibble(rms=rmsemodelnb,mae=maemodelnb))
rms mae
0.7808085 0.2966184

we will use these ouputs further.

7 Hurdle model

Originally proposed by Mullahy (1986) this model can take into account the fact that the data has more zeros and also can handle the overdispersion problem. It has two components (or steps), truncated count component defined by the chosen discrete distribution such as poisson or negative binomial, and a hurdle components models zero vs larger counts (that uses censored count distribution or binomial model). In other words, this models asumes that two population distributions underlying the data, one distribution for zero values, and another different distribution the psotive values. For more detail about hurdle and zero inflated models click here

To perform this model we make use of the function hurdle from the package pscl.

7.1 hurdle model with poisson distribution.

This model works in two steps. In the first step it uses binary classification to discriminate between the zero values and the positive values, and in the second step uses the traditional (poisson or binomial model, and here we use poisson model) model for positive values.

library(pscl)
set.seed(123)
modelhp<-hurdle(visits~. -income, data=train,dist = "poisson")
summary(modelhp)

Call:
hurdle(formula = visits ~ . - income, data = train, dist = "poisson")

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-1.5464 -0.4686 -0.3306 -0.2075 11.0887 

Count model coefficients (truncated poisson with log link):
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.977535   0.261835  -3.733 0.000189 ***
genderfemale  0.073326   0.098034   0.748 0.454480    
age           0.032762   0.287991   0.114 0.909427    
illness1      0.370071   0.251920   1.469 0.141833    
illness2      0.403514   0.256363   1.574 0.115489    
illness3      0.201724   0.278757   0.724 0.469277    
illness4      0.420285   0.277573   1.514 0.129990    
illness5      0.762209   0.269809   2.825 0.004728 ** 
reduced       0.111640   0.007967  14.013  < 2e-16 ***
health        0.007682   0.016452   0.467 0.640554    
privateyes   -0.215649   0.129860  -1.661 0.096787 .  
freepooryes   0.066277   0.269699   0.246 0.805879    
freerepatyes -0.434941   0.166196  -2.617 0.008870 ** 
nchronicyes   0.109660   0.125380   0.875 0.381779    
lchronicyes   0.135612   0.142766   0.950 0.342166    
Zero hurdle model coefficients (binomial with logit link):
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -3.224621   0.156469 -20.609  < 2e-16 ***
genderfemale  0.305324   0.089648   3.406 0.000660 ***
age           0.700345   0.276089   2.537 0.011191 *  
illness1      0.885148   0.136951   6.463 1.02e-10 ***
illness2      1.238227   0.146059   8.478  < 2e-16 ***
illness3      1.263698   0.169344   7.462 8.50e-14 ***
illness4      1.405167   0.195388   7.192 6.40e-13 ***
illness5      1.445585   0.208425   6.936 4.04e-12 ***
reduced       0.154858   0.013488  11.481  < 2e-16 ***
health        0.070464   0.019142   3.681 0.000232 ***
privateyes    0.271192   0.112751   2.405 0.016163 *  
freepooryes  -0.546177   0.277942  -1.965 0.049406 *  
freerepatyes  0.423220   0.153994   2.748 0.005991 ** 
nchronicyes  -0.006256   0.102033  -0.061 0.951106    
lchronicyes   0.070658   0.140587   0.503 0.615251    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Number of iterations in BFGS optimization: 22 
Log-likelihood: -2581 on 30 Df

As we see this output has two tables. The above one is for the poisson model performed only on the truncated positive values, and the below one is the result of the logistic regression with only two classes (zero or positive value)
As we did before we plot the results.

predhp<- predict(modelhp,newdata=test[test$visits!=0,],type = "response")
plot(test$visits[test$visits!=0],type = "b",col="red")
lines(round(predhp),col="blue")

As before by only looking at the plot we can not decide which model is the best. So it is better to use the statistic metrics.

predhp<- predict(modelhp,newdata=test, type = "response")
rmsemodelhp<-ModelMetrics::rmse(test$visits,round(predhp))
maemodelhp<-mae(test$visits,round(predhp))
knitr::kable(tibble(rmse=rmsemodelhp,mae=
maemodelhp))
rmse mae
0.7375374 0.2850242

7.2 hurdle model with negative binomial distribution.

Now let’s try to use the negative binomial instead of poisson distribution.

set.seed(123)
modelhnb<-hurdle(visits~.-income, data=train,dist = "negbin")
summary(modelhnb)

Call:
hurdle(formula = visits ~ . - income, data = train, dist = "negbin")

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-0.9078 -0.4515 -0.3201 -0.2022 10.6552 

Count model coefficients (truncated negbin with log link):
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -3.68462    2.65037  -1.390   0.1645    
genderfemale  0.07432    0.17299   0.430   0.6675    
age           0.26774    0.54001   0.496   0.6200    
illness1      0.27678    0.34564   0.801   0.4233    
illness2      0.37093    0.35241   1.053   0.2925    
illness3      0.04728    0.39747   0.119   0.9053    
illness4      0.40386    0.40517   0.997   0.3189    
illness5      0.68213    0.41357   1.649   0.0991 .  
reduced       0.15813    0.01935   8.171 3.05e-16 ***
health        0.01891    0.03291   0.575   0.5656    
privateyes   -0.45711    0.23118  -1.977   0.0480 *  
freepooryes   0.03334    0.55282   0.060   0.9519    
freerepatyes -0.59189    0.30437  -1.945   0.0518 .  
nchronicyes   0.08737    0.21061   0.415   0.6783    
lchronicyes   0.30274    0.25846   1.171   0.2415    
Log(theta)   -2.80552    2.80120  -1.002   0.3166    
Zero hurdle model coefficients (binomial with logit link):
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -3.224621   0.156469 -20.609  < 2e-16 ***
genderfemale  0.305324   0.089648   3.406 0.000660 ***
age           0.700345   0.276089   2.537 0.011191 *  
illness1      0.885148   0.136951   6.463 1.02e-10 ***
illness2      1.238227   0.146059   8.478  < 2e-16 ***
illness3      1.263698   0.169344   7.462 8.50e-14 ***
illness4      1.405167   0.195388   7.192 6.40e-13 ***
illness5      1.445585   0.208425   6.936 4.04e-12 ***
reduced       0.154858   0.013488  11.481  < 2e-16 ***
health        0.070464   0.019142   3.681 0.000232 ***
privateyes    0.271192   0.112751   2.405 0.016163 *  
freepooryes  -0.546177   0.277942  -1.965 0.049406 *  
freerepatyes  0.423220   0.153994   2.748 0.005991 ** 
nchronicyes  -0.006256   0.102033  -0.061 0.951106    
lchronicyes   0.070658   0.140587   0.503 0.615251    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Theta: count = 0.0605
Number of iterations in BFGS optimization: 31 
Log-likelihood: -2524 on 31 Df

And let’s plot the difference between the predicted and the actual values of the testing set. .

predhnb<- predict(modelhnb,newdata=test[test$visits!=0,],type = "response")
plot(test$visits[test$visits!=0],type = "b",col="red")
lines(round(predhnb),col="blue")

And for the metrics.

predhnb<- predict(modelhnb,newdata=test,type = "response")
rmsemodelhnb<-ModelMetrics::rmse(test$visits,round(predhnb))
maemodelhnb<-mae(test$visits,round(predhnb))
knitr::kable(tibble(rmse=rmsemodelhnb,mae=
maemodelhnb))
rmse mae
0.7408052 0.2879227

8 Zero inflated model

Such as the previous model type , this model also combines two components but with the difference that this model performs a mixture of binomial distribution (between zero and positive values) and the poisson (or negative binomial) distribution for the rest of the values (with the zero included).

8.1 Zero inflated model with poisson distribution

Here also we fit tow models one with poisson and one with negative binomial

set.seed(123)
modelzp<-zeroinfl(visits~.-income, data=train,dist = "poisson")
summary(modelzp)

Call:
zeroinfl(formula = visits ~ . - income, data = train, dist = "poisson")

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-1.6247 -0.4791 -0.3326 -0.1783 12.3448 

Count model coefficients (poisson with log link):
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.86467    0.23608  -3.663  0.00025 ***
genderfemale  0.03280    0.09078   0.361  0.71789    
age           0.13331    0.26922   0.495  0.62049    
illness1      0.32986    0.21846   1.510  0.13105    
illness2      0.34800    0.22426   1.552  0.12071    
illness3      0.20400    0.24152   0.845  0.39832    
illness4      0.44020    0.24324   1.810  0.07034 .  
illness5      0.72463    0.23632   3.066  0.00217 ** 
reduced       0.09679    0.00809  11.964  < 2e-16 ***
health        0.02269    0.01609   1.410  0.15860    
privateyes   -0.26390    0.12796  -2.062  0.03918 *  
freepooryes   0.04860    0.27675   0.176  0.86059    
freerepatyes -0.51894    0.17070  -3.040  0.00237 ** 
nchronicyes   0.08577    0.11490   0.746  0.45538    
lchronicyes   0.10876    0.12745   0.853  0.39346    

Zero-inflation model coefficients (binomial with logit link):
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)   2.44812    0.34933   7.008 2.42e-12 ***
genderfemale -0.48766    0.19038  -2.562 0.010422 *  
age          -0.88816    0.59030  -1.505 0.132431    
illness1     -0.80833    0.31248  -2.587 0.009685 ** 
illness2     -1.41461    0.35338  -4.003 6.25e-05 ***
illness3     -1.69204    0.44028  -3.843 0.000121 ***
illness4     -1.52224    0.46334  -3.285 0.001019 ** 
illness5     -1.08742    0.46493  -2.339 0.019342 *  
reduced      -0.14462    0.03861  -3.746 0.000180 ***
health       -0.05796    0.04486  -1.292 0.196386    
privateyes   -0.73945    0.22597  -3.272 0.001066 ** 
freepooryes   0.73371    0.41402   1.772 0.076370 .  
freerepatyes -1.75454    0.53938  -3.253 0.001142 ** 
nchronicyes   0.13229    0.22623   0.585 0.558697    
lchronicyes   0.03647    0.30620   0.119 0.905194    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Number of iterations in BFGS optimization: 42 
Log-likelihood: -2579 on 30 Df
predzp<- predict(modelzp,newdata=test[test$visits!=0,],type = "response")
plot(test$visits[test$visits!=0],type = "b",col="red")
lines(round(predzp),col="blue")

predzp<- predict(modelzp,newdata=test,type = "response")
rmsemodelzp<-ModelMetrics::rmse(test$visits,round(predzp))
maemodelzp<-mae(test$visits,round(predzp))
knitr::kable(tibble(rmse=rmsemodelzp,mae=
maemodelzp))
rmse mae
0.7485897 0.2898551

8.2 Zero inflated model with negative binomial distribution

Let’s this time try the negative binomial distribution.

set.seed(123)
modelznb<-zeroinfl(visits~., data=train,dist = "negbin")
summary(modelznb)

Call:
zeroinfl(formula = visits ~ ., data = train, dist = "negbin")

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-1.0440 -0.4582 -0.3031 -0.1680 14.2061 

Count model coefficients (negbin with log link):
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.383100   0.241778  -5.721 1.06e-08 ***
genderfemale  0.038985   0.090534   0.431 0.666751    
age           0.028837   0.290843   0.099 0.921019    
income       -0.196592   0.147560  -1.332 0.182767    
illness1      0.613258   0.174652   3.511 0.000446 ***
illness2      0.692297   0.179663   3.853 0.000117 ***
illness3      0.664613   0.196061   3.390 0.000699 ***
illness4      0.760162   0.204137   3.724 0.000196 ***
illness5      0.944756   0.206097   4.584 4.56e-06 ***
reduced       0.102651   0.008776  11.697  < 2e-16 ***
health        0.044012   0.015536   2.833 0.004611 ** 
privateyes   -0.168864   0.138680  -1.218 0.223358    
freepooryes  -0.422748   0.306653  -1.379 0.168022    
freerepatyes -0.383558   0.163995  -2.339 0.019344 *  
nchronicyes   0.033374   0.107881   0.309 0.757048    
lchronicyes   0.065834   0.128987   0.510 0.609775    
Log(theta)    0.473936   0.142626   3.323 0.000891 ***

Zero-inflation model coefficients (binomial with logit link):
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)   2.280e+00  5.322e-01   4.285 1.83e-05 ***
genderfemale -7.269e-01  2.753e-01  -2.640  0.00828 ** 
age          -2.003e+00  9.202e-01  -2.177  0.02951 *  
income       -1.803e-01  3.933e-01  -0.458  0.64669    
illness1     -3.327e-01  3.480e-01  -0.956  0.33894    
illness2     -1.112e+00  4.496e-01  -2.473  0.01339 *  
illness3     -9.533e-01  5.127e-01  -1.859  0.06297 .  
illness4     -1.551e+00  7.398e-01  -2.097  0.03599 *  
illness5     -1.230e+00  8.597e-01  -1.431  0.15257    
reduced      -1.298e+00  4.577e-01  -2.836  0.00456 ** 
health       -1.443e-03  5.509e-02  -0.026  0.97910    
privateyes   -8.179e-01  3.178e-01  -2.574  0.01005 *  
freepooryes   2.394e-01  6.648e-01   0.360  0.71878    
freerepatyes -1.572e+01  1.528e+03  -0.010  0.99179    
nchronicyes   4.502e-02  2.982e-01   0.151  0.88001    
lchronicyes  -1.637e-01  4.951e-01  -0.331  0.74085    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Theta = 1.6063 
Number of iterations in BFGS optimization: 66 
Log-likelihood: -2512 on 33 Df
predznb<- predict(modelznb,newdata=test,type = "response")
rmsemodelznb<-ModelMetrics::rmse(test$visits,round(predznb))
maemodelznb<-mae(test$visits,round(predznb))
knitr::kable(tibble(rmse=rmsemodelznb,mae=maemodelznb))
rmse mae
0.7309579 0.2753623

Finally let’s compare all the above models.

rmse<-c(rmsemodelp,rmsemodelqp,rmsemodelnb,rmsemodelhp,rmsemodelhnb,
           rmsemodelzp,rmsemodelznb)
mae<-c(maemodelp,maemodelqp,maemodelnb,maemodelhp,maemodelhnb,
           maemodelzp,maemodelznb)
models<-c("pois","q_pois","nb","h_pois","h_nb","zer_pois","zer_nb")

data.frame(models,rmse,mae)%>% 
  arrange(rmse)
    models      rmse       mae
1   zer_nb 0.7309579 0.2753623
2   h_pois 0.7375374 0.2850242
3     pois 0.7381921 0.2840580
4   q_pois 0.7381921 0.2840580
5     h_nb 0.7408052 0.2879227
6 zer_pois 0.7485897 0.2898551
7       nb 0.7808085 0.2966184

Both metrics have chosen the zero inflated negative binomial model as the best model with minimum rmse value 0.7309579, and minimum mae value 0.2753623. this result is in line with the fact that this kind of models take care of the zero inflated data and at the same time the overdispersion problem.

9 Conclusion:

If the data is truly follows Poisson distribution then all the the other models have extra parameters that, during training process, converges to the optimum parameter values for poisson, this relation is like the linear regression to the generalized least squares. However, if the data is very skewed towards zero then it should be better to use the last two models to take care of this issue.

10 Furhter reading:

11 Session info

sessionInfo()
R version 4.0.1 (2020-06-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] pscl_1.5.5           broom_0.7.1          AER_1.2-9           
 [4] survival_3.2-7       sandwich_3.0-0       lmtest_0.9-38       
 [7] zoo_1.8-8            car_3.0-10           carData_3.0-4       
[10] forcats_0.5.0        stringr_1.4.0        dplyr_1.0.2         
[13] readr_1.3.1          tidyr_1.1.2          tibble_3.0.3        
[16] ggplot2_3.3.2        tidyverse_1.3.0      MASS_7.3-53         
[19] purrr_0.3.4          corrr_0.4.2          ModelMetrics_1.2.2.2
[22] performance_0.5.0   

loaded via a namespace (and not attached):
 [1] httr_1.4.2        jsonlite_1.7.1    splines_4.0.1     modelr_0.1.8     
 [5] Formula_1.2-3     assertthat_0.2.1  highr_0.8         blob_1.2.1       
 [9] cellranger_1.1.0  yaml_2.2.1        bayestestR_0.7.2  pillar_1.4.6     
[13] backports_1.1.10  lattice_0.20-41   glue_1.4.2        digest_0.6.25    
[17] rvest_0.3.6       colorspace_1.4-1  htmltools_0.5.0   Matrix_1.2-18    
[21] pkgconfig_2.0.3   haven_2.3.1       bookdown_0.20     scales_1.1.1     
[25] openxlsx_4.2.2    rio_0.5.16        farver_2.0.3      generics_0.0.2   
[29] ellipsis_0.3.1    withr_2.3.0       cli_2.0.2         magrittr_1.5     
[33] crayon_1.3.4      readxl_1.3.1      evaluate_0.14     fs_1.5.0         
[37] fansi_0.4.1       xml2_1.3.2        foreign_0.8-80    blogdown_0.20    
[41] tools_4.0.1       data.table_1.13.0 hms_0.5.3         lifecycle_0.2.0  
[45] munsell_0.5.0     reprex_0.3.0      zip_2.1.1         compiler_4.0.1   
[49] rlang_0.4.7       grid_4.0.1        rstudioapi_0.11   labeling_0.3     
[53] rmarkdown_2.4     gtable_0.3.0      abind_1.4-5       DBI_1.1.0        
[57] curl_4.3          R6_2.4.1          lubridate_1.7.9   knitr_1.30       
[61] utf8_1.1.4        insight_0.9.6     stringi_1.5.3     Rcpp_1.0.5       
[65] vctrs_0.3.4       dbplyr_1.4.4      tidyselect_1.1.0  xfun_0.18        
Metales Abdelkader
Metales Abdelkader

My research interests include Econometrics, statistics, machine learning, deep learning.

comments powered by Disqus
comments powered by Disqus