Count data Models
1 Introduction:
When we deal with data that has a response variable of integer type, using a linear regression may violate the normality assumption and hence all the classical statistic tests would fail to evaluate the model. However, as we do with logistic regression models, the generalized linear model GLM can be used instead here by specifying the suitable distribution.
The possible distributions for this type of data are the discrete distributions poisson and negative binomial. The former is the best choice if the mean and the variance of the response variable are closer to each other, if they are not however and we persist using this distribution we may cause the rise of the overdispersion problem of the residuals. As a solution thus, we can use the latter distribution that does not have this restriction.
There is another alternative if neither the poisson distribution nor the negative binomial are suitable called the Quasi maximum likelihood. The advantage of this method is that uses only the relationship between the mean and the variance and does not require any prespecified distribution. Moreover, its estimators are approximately as efficient as the maximum likelihood estimators.
2 Data preparation
To well understand how to model the count data we are going be using Doctorvisits data from AER package, in which the variable visits will be our target variable, so let’s call this data with the packages that we need along this article.
ssh <- suppressPackageStartupMessages
ssh(library(performance))
Warning: package 'performance' was built under R version 4.0.2
ssh(library(ModelMetrics))
ssh(library(corrr))
ssh(library(purrr))
ssh(library(MASS))
Warning: package 'MASS' was built under R version 4.0.2
ssh(library(tidyverse))
Warning: package 'ggplot2' was built under R version 4.0.2
Warning: package 'tibble' was built under R version 4.0.2
Warning: package 'tidyr' was built under R version 4.0.2
Warning: package 'dplyr' was built under R version 4.0.2
ssh(library(AER))
Warning: package 'car' was built under R version 4.0.2
Warning: package 'lmtest' was built under R version 4.0.2
Warning: package 'sandwich' was built under R version 4.0.2
Warning: package 'survival' was built under R version 4.0.2
ssh(library(broom))
Warning: package 'broom' was built under R version 4.0.2
data("DoctorVisits")
doc <- DoctorVisits
glimpse(doc)
Rows: 5,190
Columns: 12
$ visits <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 2, ...
$ gender <fct> female, female, male, male, male, female, female, female,...
$ age <dbl> 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.1...
$ income <dbl> 0.55, 0.45, 0.90, 0.15, 0.45, 0.35, 0.55, 0.15, 0.65, 0.1...
$ illness <dbl> 1, 1, 3, 1, 2, 5, 4, 3, 2, 1, 1, 2, 3, 4, 3, 2, 1, 1, 1, ...
$ reduced <dbl> 4, 2, 0, 0, 5, 1, 0, 0, 0, 0, 0, 0, 13, 7, 1, 0, 0, 1, 0,...
$ health <dbl> 1, 1, 0, 0, 1, 9, 2, 6, 5, 0, 0, 2, 1, 6, 0, 7, 5, 0, 0, ...
$ private <fct> yes, yes, no, no, no, no, no, no, yes, yes, no, no, no, n...
$ freepoor <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, n...
$ freerepat <fct> no, no, no, no, no, no, no, no, no, no, no, yes, no, no, ...
$ nchronic <fct> no, no, no, no, yes, yes, no, no, no, no, no, no, yes, ye...
$ lchronic <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, n...
This data from Australian health survey where visits is the number of doctor visits in past two weeks with 11 features listed above.
First we list the summary of the data to inspect any unwanted issue.
summary(doc)
visits gender age income
Min. :0.0000 male :2488 Min. :0.1900 Min. :0.0000
1st Qu.:0.0000 female:2702 1st Qu.:0.2200 1st Qu.:0.2500
Median :0.0000 Median :0.3200 Median :0.5500
Mean :0.3017 Mean :0.4064 Mean :0.5832
3rd Qu.:0.0000 3rd Qu.:0.6200 3rd Qu.:0.9000
Max. :9.0000 Max. :0.7200 Max. :1.5000
illness reduced health private freepoor
Min. :0.000 Min. : 0.0000 Min. : 0.000 no :2892 no :4968
1st Qu.:0.000 1st Qu.: 0.0000 1st Qu.: 0.000 yes:2298 yes: 222
Median :1.000 Median : 0.0000 Median : 0.000
Mean :1.432 Mean : 0.8619 Mean : 1.218
3rd Qu.:2.000 3rd Qu.: 0.0000 3rd Qu.: 2.000
Max. :5.000 Max. :14.0000 Max. :12.000
freerepat nchronic lchronic
no :4099 no :3098 no :4585
yes:1091 yes:2092 yes: 605
As we see we do not have missing values and the visits values ranges from 0 to 9 but it should be of integer type rather than double. Similarly, the variable illness should be converted to factor type since it has a few different values.
doc$visits<-as.integer(doc$visits)
doc$illness <- as.factor(doc$illness)
tab <- table(doc$visits)
tab
0 1 2 3 4 5 6 7 8 9
4141 782 174 30 24 9 12 12 5 1
The best thing we do to start analyzing the data is by displaying the correlation coefficient of each pair variables we have. Thus, any particular predictor that has high correlation with the target variable could be highly likely to be relevant in our future model. Notice that our target variable is not continuous hence we will use the spearman correlation. As required by correlate function from corrr package, all the variables must be of numeric type so we convert all the factor to integer.
doc1 <-modify_if(doc, is.factor, as.integer)
notice that we have stored the result in another object doc1 to keep save our original data.
M <- correlate(doc1, method="spearman")
rplot(shave(M), colours=c("red", "white", "blue" ))+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Looking at this plot all the correlations has low values. however, these correlations assess only the monotonic relations, they say nothing about any other form of relation.
First let’s compare the empirical distribution of the variable visits and the theoretical poisson distribution with \(\lambda\) equals the visits mean 0.3017341, and the total number of observations is 5190.
pos <- dpois(0:9,0.302)*5190
both <- numeric(20)
both[1:20 %% 2 != 0] <- tab
both[1:20 %% 2 == 0] <- pos
labels<-character(20)
labels[1:20 %% 2==0]<-as.character(0:9)
barplot(both,col=rep(c("red","yellow"),10),names=labels)
As we see the two distributions are more or less closer to each other. Let’s now check the negative binomial distribution by first estimate the clumping parameter \(k=\frac{\bar x^2}{s^2-\bar x}\).
k<-mean(doc$visits)^2/(var(doc$visits)-mean(doc$visits))
bin<-dnbinom(0:9,0.27,mu=0.302)*5190
both1<-numeric(20)
both1[1:20 %% 2 != 0]<-tab
both1[1:20 %% 2 == 0]<-bin
labels<-character(20)
labels[1:20 %% 2==0]<-as.character(0:9)
barplot(both1,col=rep(c("red","yellow"),10),names=labels)
With this distribution it seems that the empiricall distribution is more closer to the negative binomial than the poisson distribution.
Note: This data has very large number of zeros for the outcome compared to the other values which means that any trained model that does not take into account this anomaly will be biased to predict more likely the zero value. However, at the end of this article I will show two famous models to handel this type of count data called Haurdle model and zero_inflated model.
3 Data partition
In oreder to evaluate our model we held out 20% of the data as testing set.
set.seed(123)
index<-sample(2,nrow(doc),replace = TRUE,p=c(.8,.2))
train<-doc[index==1,]
test<-doc[index==2,]
4 Poisson model
This model belongs to the generalized linear model families, so in the function glm we set the argument family to poisson. In practice this model is sufficient with a wide range of count data.
set.seed(123)
model1<-glm(visits~., data=train, family ="poisson")
tidy(model1)
# A tibble: 16 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -2.70 0.141 -19.2 9.14e- 82
2 genderfemale 0.193 0.0620 3.11 1.88e- 3
3 age 0.436 0.184 2.37 1.77e- 2
4 income -0.161 0.0928 -1.74 8.23e- 2
5 illness1 0.944 0.113 8.35 6.76e- 17
6 illness2 1.21 0.118 10.3 1.15e- 24
7 illness3 1.11 0.132 8.43 3.51e- 17
8 illness4 1.28 0.140 9.13 7.14e- 20
9 illness5 1.44 0.139 10.4 2.34e- 25
10 reduced 0.126 0.00560 22.6 6.85e-113
11 health 0.0348 0.0112 3.10 1.91e- 3
12 privateyes 0.111 0.0795 1.39 1.64e- 1
13 freepooryes -0.344 0.190 -1.81 7.00e- 2
14 freerepatyes 0.0377 0.104 0.363 7.16e- 1
15 nchronicyes 0.0186 0.0732 0.254 7.99e- 1
16 lchronicyes 0.0255 0.0916 0.279 7.81e- 1
As we see all the variables are significant except for the income so we remove this variable and reestimate again.
set.seed(123)
model1<-glm(visits~.-income, data=train, family ="poisson")
tidy(model1)
# A tibble: 15 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -2.83 0.121 -23.4 7.36e-121
2 genderfemale 0.213 0.0609 3.51 4.56e- 4
3 age 0.479 0.183 2.62 8.91e- 3
4 illness1 0.946 0.113 8.38 5.44e- 17
5 illness2 1.21 0.118 10.3 8.29e- 25
6 illness3 1.12 0.132 8.50 1.93e- 17
7 illness4 1.28 0.140 9.17 4.71e- 20
8 illness5 1.45 0.139 10.5 1.05e- 25
9 reduced 0.126 0.00560 22.6 1.12e-112
10 health 0.0350 0.0112 3.11 1.84e- 3
11 privateyes 0.100 0.0793 1.27 2.06e- 1
12 freepooryes -0.290 0.188 -1.55 1.22e- 1
13 freerepatyes 0.0683 0.102 0.667 5.05e- 1
14 nchronicyes 0.0171 0.0731 0.235 8.15e- 1
15 lchronicyes 0.0282 0.0914 0.308 7.58e- 1
For the interpretation of the coefficient estimates, we should exponentiate these values to get the marginal effect since the poisson model uses the log link function to preclude negative values. For continuous predictor, say age, if this predictor increases by one year, ceteris-paribus, we expect the doctor visits will be \(exp(0.47876624)=1.614082\) times larger. whereas, for categorical predictor, say gender, the female has \(exp(0.21342446)=1.23791\) larger doctor visits than male.
By looking at p-values all the predictors are significant. However, we have to check other statistics and metrics.
glance(model1)
# A tibble: 1 x 8
null.deviance df.null logLik AIC BIC deviance df.residual nobs
<dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 4565. 4154 -2685. 5399. 5494. 3486. 4140 4155
since the deviance value 3485.905 is lower than the degrees of freedom 4140, we will then worry about overdispersion problem. Fortunateley, the AER package provides a super easy way to test the significance of this difference via the function dispersiontest.
dispersiontest(model1)
Overdispersion test
data: model1
z = 6.278, p-value = 1.714e-10
alternative hypothesis: true dispersion is greater than 1
sample estimates:
dispersion
1.397176
If our target variable really follows poisson distribution then its variance \(V\) should be approximately equal to its mean \(\mu\), which is the null hypothesis of the following dispersiontest test against the alternative hypothesis that the variance of the form: \[V=\mu+\alpha.trafo(\mu)\]
Where the trafo is an hyperparameter that should be specified as an argument of this test. The popular choices for this argument are:
- trafo = NULL (default): \(V=(1+\alpha)\mu\)
- trafo = 1: \(V=\mu+\alpha.\mu\)
- trafo = 2: \(V=\mu+\alpha.\mu^2\)
For the first choice if true, then the data will be better modeled by quasi-poisson model than poisson model.
For the last ones if one of them is true then the negative binomial will be better than poisson model.
Now once the trafo is defined the test estimates \(\alpha\), such that:
- if \(\alpha = 0\) : equidispersion (The null hypothesis)
- if \(\alpha < 0\) : underdispersion
- if \(\alpha > 0\) : overdispersion
Therefore, the result of the test will depend on the direction of the test, where we have two.sided, greater (default) for the overdispersion, and less for underdispersion.
With this in mind the output of the above test (with the default values) tested the overdispersion against the quasi-poisson model, and since the p-value is very small 1.714e-10 then we have overdispersion problem, suggesting the use of quasi-poisson model instead.
Now let’s test the negative binomial now.
dispersiontest(model1, trafo = 1)
Overdispersion test
data: model1
z = 6.278, p-value = 1.714e-10
alternative hypothesis: true alpha is greater than 0
sample estimates:
alpha
0.3971763
The test suggested the use of negative binomial with linear function for the variance with very tiny p-value 1.714e-10. This model is known as NB1 (with linear variance function).
dispersiontest(model1, trafo = 2)
Overdispersion test
data: model1
z = 7.4723, p-value = 3.939e-14
alternative hypothesis: true alpha is greater than 0
sample estimates:
alpha
0.95488
If the relation is in quadratic form then this model is called NB2. And since this p-value 3.939e-14 is smaller than the previous one then NB2 could be more appropriate than NB1.
5 Quasi poisson model
The first test suggested the use of quasi-poisson model, so let’s train this model with the same predictors as the previous one.
set.seed(123)
model2<-glm(visits~.-income, data=train, family ="quasipoisson")
tidy(model2)
# A tibble: 15 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -2.83 0.140 -20.2 1.25e-86
2 genderfemale 0.213 0.0705 3.03 2.47e- 3
3 age 0.479 0.212 2.26 2.39e- 2
4 illness1 0.946 0.131 7.24 5.36e-13
5 illness2 1.21 0.136 8.89 9.15e-19
6 illness3 1.12 0.152 7.34 2.49e-13
7 illness4 1.28 0.162 7.92 2.91e-15
8 illness5 1.45 0.160 9.06 2.01e-19
9 reduced 0.126 0.00647 19.5 4.79e-81
10 health 0.0350 0.0130 2.69 7.14e- 3
11 privateyes 0.100 0.0917 1.09 2.74e- 1
12 freepooryes -0.290 0.217 -1.34 1.81e- 1
13 freerepatyes 0.0683 0.119 0.576 5.64e- 1
14 nchronicyes 0.0171 0.0846 0.203 8.39e- 1
15 lchronicyes 0.0282 0.106 0.266 7.90e- 1
This model uses the quasi maximum likelihood which gives the same coefficient estimates but with different (corrected) standard errors.
Since here also all the variables are significant We see that the models are the same except the correction of the standard errors which are now more larger. In other words, the poisson distribution under overdispersion underestimates the standard errors and hence the t test would be biased towards the rejection of the null hypothesis.
To better understand what is going on with quasi-poisson model let’s put the estimates and the standard errors of both models into one table, and we add a column that resulted from dividing the second standard errors vector by the first one.
D1 <- tidy(model1)
colnames(D1) <- NULL
D2 <- tidy(model2)
colnames(D2) <- NULL
tibble(term=D1[[1]], estimate1=D1[[2]], std1=D1[[3]],estimate2=D2[[2]], std2=D2[[3]], dispersion= std2/std1)
# A tibble: 15 x 6
term estimate1 std1 estimate2 std2 dispersion
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -2.83 0.121 -2.83 0.140 1.16
2 genderfemale 0.213 0.0609 0.213 0.0705 1.16
3 age 0.479 0.183 0.479 0.212 1.16
4 illness1 0.946 0.113 0.946 0.131 1.16
5 illness2 1.21 0.118 1.21 0.136 1.16
6 illness3 1.12 0.132 1.12 0.152 1.16
7 illness4 1.28 0.140 1.28 0.162 1.16
8 illness5 1.45 0.139 1.45 0.160 1.16
9 reduced 0.126 0.00560 0.126 0.00647 1.16
10 health 0.0350 0.0112 0.0350 0.0130 1.16
11 privateyes 0.100 0.0793 0.100 0.0917 1.16
12 freepooryes -0.290 0.188 -0.290 0.217 1.16
13 freerepatyes 0.0683 0.102 0.0683 0.119 1.16
14 nchronicyes 0.0171 0.0731 0.0171 0.0846 1.16
15 lchronicyes 0.0282 0.0914 0.0282 0.106 1.16
Note: The first two columns are for the model1, and the last one are for the model 2. Not surprisingly that the result of the last column is constant since this is exactly what the quasi maximum likelihood does, it computes the corrected standard errors from the original ones as follows \(std2=dispersion*std1\), with the dispersion value being estimated as 1.15718. if you want to know where this value came from, the answer is simple. this model computes the sigma of the standardized residuals resulted from the original model. we can thus get this value by specifying the argument type to pear then computing sigma by hand as follows:
resid <- resid(model1, type = "pear")
sqrt(sum(resid^2)/4140)
[1] 1.15718
Now to test the prediction qualities of our models we use the testing set test by ploting the original and the predicted values. Let’s start by the model1
pred<- predict.glm(model1,newdata=test[test$visits!=0,],type = "response")
plot(test$visits[test$visits!=0],type = "b",col="red")
lines(round(pred),col="blue")
If you noticed, and due to the large number of zero’s of the target variable, i have intentionally removed all theses values in order to get clearer plot. From this plot we can say that the model does not fit well the data especially the larger values that are not well captured, however this may due to the fact that the data are very skewed towards zero.
To compare different models we can use the root mean-square error and mean absolute error (all the data with zero’s included).
Note: Here we are using the rmse function from ModelMetrics that expects the inpute to be two vectors, and not that with the same name from the performance package that expects the input to be a model object . To avoid thus any such ambiguity you should type this command ModelMetrics::rmse
.
pred <- predict.glm(model1, newdata = test, type = "response")
rmsemodelp <- ModelMetrics::rmse(test$visits,round(pred))
maemodelp <- mae(test$visits,round(pred))
rmsemodelp
[1] 0.7381921
maemodelp
[1] 0.284058
By the same way, Now let’s evaluate the quasi-poisson model.
predq<- predict.glm(model2,newdata=test[test$visits!=0,],type = "response")
plot(test$visits[test$visits!=0],type = "b",col="red")
lines(round(predq),col="blue")
This plot does not seem to be very different from the previous plot. The rmse and mae for this model are computed as follows.
predq <- predict.glm(model2,newdata=test, type = "response")
rmsemodelqp <- ModelMetrics::rmse(test$visits,round(predq))
maemodelqp <- mae(test$visits,round(predq))
rmsemodelqp
[1] 0.7381921
maemodelqp
[1] 0.284058
we will not compare this two models until we finish with all the incoming models and we compare all the models at once.
6 Negative binomial model
The negative binomial distribution is used as an alternative for the poisson distribution under overdispersion problem.
set.seed(123)
model3<-glm.nb(visits~.-income, data=train)
summary(model3)
Call:
glm.nb(formula = visits ~ . - income, data = train, init.theta = 0.9715923611,
link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8413 -0.6894 -0.5335 -0.3540 3.6726
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.894820 0.135760 -21.323 < 2e-16 ***
genderfemale 0.258968 0.075352 3.437 0.000589 ***
age 0.511867 0.230297 2.223 0.026240 *
illness1 0.880644 0.123264 7.144 9.04e-13 ***
illness2 1.171615 0.130240 8.996 < 2e-16 ***
illness3 1.118067 0.149032 7.502 6.28e-14 ***
illness4 1.263367 0.165370 7.640 2.18e-14 ***
illness5 1.378166 0.169907 8.111 5.01e-16 ***
reduced 0.141389 0.008184 17.275 < 2e-16 ***
health 0.041364 0.015029 2.752 0.005918 **
privateyes 0.086188 0.095173 0.906 0.365149
freepooryes -0.375471 0.223857 -1.677 0.093487 .
freerepatyes 0.144928 0.127751 1.134 0.256602
nchronicyes 0.022111 0.087590 0.252 0.800705
lchronicyes 0.091622 0.114965 0.797 0.425477
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Negative Binomial(0.9716) family taken to be 1)
Null deviance: 3208.0 on 4154 degrees of freedom
Residual deviance: 2431.2 on 4140 degrees of freedom
AIC: 5159.6
Number of Fisher Scoring iterations: 1
Theta: 0.972
Std. Err.: 0.103
2 x log-likelihood: -5127.587
As before we visualize the performance of this model as follows.
prednb<- predict.glm(model3,newdata=test[test$visits!=0,],type = "response")
plot(test$visits[test$visits!=0],type = "b",col="red")
lines(round(prednb),col="blue")
Again this plot also seems to be the same as the previous ones, so to figure out which model is best we use statistic metrics.
prednb<- predict.glm(model3,newdata=test,type = "response")
rmsemodelnb<-ModelMetrics::rmse(test$visits,round(prednb))
maemodelnb<-mae(test$visits,round(prednb))
knitr::kable(tibble(rms=rmsemodelnb,mae=maemodelnb))
rms | mae |
---|---|
0.7808085 | 0.2966184 |
we will use these ouputs further.
7 Hurdle model
Originally proposed by Mullahy (1986) this model can take into account the fact that the data has more zeros and also can handle the overdispersion problem. It has two components (or steps), truncated count component defined by the chosen discrete distribution such as poisson or negative binomial, and a hurdle components models zero vs larger counts (that uses censored count distribution or binomial model). In other words, this models asumes that two population distributions underlying the data, one distribution for zero values, and another different distribution the psotive values. For more detail about hurdle and zero inflated models click here
To perform this model we make use of the function hurdle from the package pscl.
7.1 hurdle model with poisson distribution.
This model works in two steps. In the first step it uses binary classification to discriminate between the zero values and the positive values, and in the second step uses the traditional (poisson or binomial model, and here we use poisson model) model for positive values.
library(pscl)
set.seed(123)
modelhp<-hurdle(visits~. -income, data=train,dist = "poisson")
summary(modelhp)
Call:
hurdle(formula = visits ~ . - income, data = train, dist = "poisson")
Pearson residuals:
Min 1Q Median 3Q Max
-1.5464 -0.4686 -0.3306 -0.2075 11.0887
Count model coefficients (truncated poisson with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.977535 0.261835 -3.733 0.000189 ***
genderfemale 0.073326 0.098034 0.748 0.454480
age 0.032762 0.287991 0.114 0.909427
illness1 0.370071 0.251920 1.469 0.141833
illness2 0.403514 0.256363 1.574 0.115489
illness3 0.201724 0.278757 0.724 0.469277
illness4 0.420285 0.277573 1.514 0.129990
illness5 0.762209 0.269809 2.825 0.004728 **
reduced 0.111640 0.007967 14.013 < 2e-16 ***
health 0.007682 0.016452 0.467 0.640554
privateyes -0.215649 0.129860 -1.661 0.096787 .
freepooryes 0.066277 0.269699 0.246 0.805879
freerepatyes -0.434941 0.166196 -2.617 0.008870 **
nchronicyes 0.109660 0.125380 0.875 0.381779
lchronicyes 0.135612 0.142766 0.950 0.342166
Zero hurdle model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.224621 0.156469 -20.609 < 2e-16 ***
genderfemale 0.305324 0.089648 3.406 0.000660 ***
age 0.700345 0.276089 2.537 0.011191 *
illness1 0.885148 0.136951 6.463 1.02e-10 ***
illness2 1.238227 0.146059 8.478 < 2e-16 ***
illness3 1.263698 0.169344 7.462 8.50e-14 ***
illness4 1.405167 0.195388 7.192 6.40e-13 ***
illness5 1.445585 0.208425 6.936 4.04e-12 ***
reduced 0.154858 0.013488 11.481 < 2e-16 ***
health 0.070464 0.019142 3.681 0.000232 ***
privateyes 0.271192 0.112751 2.405 0.016163 *
freepooryes -0.546177 0.277942 -1.965 0.049406 *
freerepatyes 0.423220 0.153994 2.748 0.005991 **
nchronicyes -0.006256 0.102033 -0.061 0.951106
lchronicyes 0.070658 0.140587 0.503 0.615251
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Number of iterations in BFGS optimization: 22
Log-likelihood: -2581 on 30 Df
As we see this output has two tables. The above one is for the poisson model performed only on the truncated positive values, and the below one is the result of the logistic regression with only two classes (zero or positive value)
As we did before we plot the results.
predhp<- predict(modelhp,newdata=test[test$visits!=0,],type = "response")
plot(test$visits[test$visits!=0],type = "b",col="red")
lines(round(predhp),col="blue")
As before by only looking at the plot we can not decide which model is the best. So it is better to use the statistic metrics.
predhp<- predict(modelhp,newdata=test, type = "response")
rmsemodelhp<-ModelMetrics::rmse(test$visits,round(predhp))
maemodelhp<-mae(test$visits,round(predhp))
knitr::kable(tibble(rmse=rmsemodelhp,mae=
maemodelhp))
rmse | mae |
---|---|
0.7375374 | 0.2850242 |
7.2 hurdle model with negative binomial distribution.
Now let’s try to use the negative binomial instead of poisson distribution.
set.seed(123)
modelhnb<-hurdle(visits~.-income, data=train,dist = "negbin")
summary(modelhnb)
Call:
hurdle(formula = visits ~ . - income, data = train, dist = "negbin")
Pearson residuals:
Min 1Q Median 3Q Max
-0.9078 -0.4515 -0.3201 -0.2022 10.6552
Count model coefficients (truncated negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.68462 2.65037 -1.390 0.1645
genderfemale 0.07432 0.17299 0.430 0.6675
age 0.26774 0.54001 0.496 0.6200
illness1 0.27678 0.34564 0.801 0.4233
illness2 0.37093 0.35241 1.053 0.2925
illness3 0.04728 0.39747 0.119 0.9053
illness4 0.40386 0.40517 0.997 0.3189
illness5 0.68213 0.41357 1.649 0.0991 .
reduced 0.15813 0.01935 8.171 3.05e-16 ***
health 0.01891 0.03291 0.575 0.5656
privateyes -0.45711 0.23118 -1.977 0.0480 *
freepooryes 0.03334 0.55282 0.060 0.9519
freerepatyes -0.59189 0.30437 -1.945 0.0518 .
nchronicyes 0.08737 0.21061 0.415 0.6783
lchronicyes 0.30274 0.25846 1.171 0.2415
Log(theta) -2.80552 2.80120 -1.002 0.3166
Zero hurdle model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.224621 0.156469 -20.609 < 2e-16 ***
genderfemale 0.305324 0.089648 3.406 0.000660 ***
age 0.700345 0.276089 2.537 0.011191 *
illness1 0.885148 0.136951 6.463 1.02e-10 ***
illness2 1.238227 0.146059 8.478 < 2e-16 ***
illness3 1.263698 0.169344 7.462 8.50e-14 ***
illness4 1.405167 0.195388 7.192 6.40e-13 ***
illness5 1.445585 0.208425 6.936 4.04e-12 ***
reduced 0.154858 0.013488 11.481 < 2e-16 ***
health 0.070464 0.019142 3.681 0.000232 ***
privateyes 0.271192 0.112751 2.405 0.016163 *
freepooryes -0.546177 0.277942 -1.965 0.049406 *
freerepatyes 0.423220 0.153994 2.748 0.005991 **
nchronicyes -0.006256 0.102033 -0.061 0.951106
lchronicyes 0.070658 0.140587 0.503 0.615251
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Theta: count = 0.0605
Number of iterations in BFGS optimization: 31
Log-likelihood: -2524 on 31 Df
And let’s plot the difference between the predicted and the actual values of the testing set. .
predhnb<- predict(modelhnb,newdata=test[test$visits!=0,],type = "response")
plot(test$visits[test$visits!=0],type = "b",col="red")
lines(round(predhnb),col="blue")
And for the metrics.
predhnb<- predict(modelhnb,newdata=test,type = "response")
rmsemodelhnb<-ModelMetrics::rmse(test$visits,round(predhnb))
maemodelhnb<-mae(test$visits,round(predhnb))
knitr::kable(tibble(rmse=rmsemodelhnb,mae=
maemodelhnb))
rmse | mae |
---|---|
0.7408052 | 0.2879227 |
8 Zero inflated model
Such as the previous model type , this model also combines two components but with the difference that this model performs a mixture of binomial distribution (between zero and positive values) and the poisson (or negative binomial) distribution for the rest of the values (with the zero included).
8.1 Zero inflated model with poisson distribution
Here also we fit tow models one with poisson and one with negative binomial
set.seed(123)
modelzp<-zeroinfl(visits~.-income, data=train,dist = "poisson")
summary(modelzp)
Call:
zeroinfl(formula = visits ~ . - income, data = train, dist = "poisson")
Pearson residuals:
Min 1Q Median 3Q Max
-1.6247 -0.4791 -0.3326 -0.1783 12.3448
Count model coefficients (poisson with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.86467 0.23608 -3.663 0.00025 ***
genderfemale 0.03280 0.09078 0.361 0.71789
age 0.13331 0.26922 0.495 0.62049
illness1 0.32986 0.21846 1.510 0.13105
illness2 0.34800 0.22426 1.552 0.12071
illness3 0.20400 0.24152 0.845 0.39832
illness4 0.44020 0.24324 1.810 0.07034 .
illness5 0.72463 0.23632 3.066 0.00217 **
reduced 0.09679 0.00809 11.964 < 2e-16 ***
health 0.02269 0.01609 1.410 0.15860
privateyes -0.26390 0.12796 -2.062 0.03918 *
freepooryes 0.04860 0.27675 0.176 0.86059
freerepatyes -0.51894 0.17070 -3.040 0.00237 **
nchronicyes 0.08577 0.11490 0.746 0.45538
lchronicyes 0.10876 0.12745 0.853 0.39346
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.44812 0.34933 7.008 2.42e-12 ***
genderfemale -0.48766 0.19038 -2.562 0.010422 *
age -0.88816 0.59030 -1.505 0.132431
illness1 -0.80833 0.31248 -2.587 0.009685 **
illness2 -1.41461 0.35338 -4.003 6.25e-05 ***
illness3 -1.69204 0.44028 -3.843 0.000121 ***
illness4 -1.52224 0.46334 -3.285 0.001019 **
illness5 -1.08742 0.46493 -2.339 0.019342 *
reduced -0.14462 0.03861 -3.746 0.000180 ***
health -0.05796 0.04486 -1.292 0.196386
privateyes -0.73945 0.22597 -3.272 0.001066 **
freepooryes 0.73371 0.41402 1.772 0.076370 .
freerepatyes -1.75454 0.53938 -3.253 0.001142 **
nchronicyes 0.13229 0.22623 0.585 0.558697
lchronicyes 0.03647 0.30620 0.119 0.905194
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Number of iterations in BFGS optimization: 42
Log-likelihood: -2579 on 30 Df
predzp<- predict(modelzp,newdata=test[test$visits!=0,],type = "response")
plot(test$visits[test$visits!=0],type = "b",col="red")
lines(round(predzp),col="blue")
predzp<- predict(modelzp,newdata=test,type = "response")
rmsemodelzp<-ModelMetrics::rmse(test$visits,round(predzp))
maemodelzp<-mae(test$visits,round(predzp))
knitr::kable(tibble(rmse=rmsemodelzp,mae=
maemodelzp))
rmse | mae |
---|---|
0.7485897 | 0.2898551 |
8.2 Zero inflated model with negative binomial distribution
Let’s this time try the negative binomial distribution.
set.seed(123)
modelznb<-zeroinfl(visits~., data=train,dist = "negbin")
summary(modelznb)
Call:
zeroinfl(formula = visits ~ ., data = train, dist = "negbin")
Pearson residuals:
Min 1Q Median 3Q Max
-1.0440 -0.4582 -0.3031 -0.1680 14.2061
Count model coefficients (negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.383100 0.241778 -5.721 1.06e-08 ***
genderfemale 0.038985 0.090534 0.431 0.666751
age 0.028837 0.290843 0.099 0.921019
income -0.196592 0.147560 -1.332 0.182767
illness1 0.613258 0.174652 3.511 0.000446 ***
illness2 0.692297 0.179663 3.853 0.000117 ***
illness3 0.664613 0.196061 3.390 0.000699 ***
illness4 0.760162 0.204137 3.724 0.000196 ***
illness5 0.944756 0.206097 4.584 4.56e-06 ***
reduced 0.102651 0.008776 11.697 < 2e-16 ***
health 0.044012 0.015536 2.833 0.004611 **
privateyes -0.168864 0.138680 -1.218 0.223358
freepooryes -0.422748 0.306653 -1.379 0.168022
freerepatyes -0.383558 0.163995 -2.339 0.019344 *
nchronicyes 0.033374 0.107881 0.309 0.757048
lchronicyes 0.065834 0.128987 0.510 0.609775
Log(theta) 0.473936 0.142626 3.323 0.000891 ***
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.280e+00 5.322e-01 4.285 1.83e-05 ***
genderfemale -7.269e-01 2.753e-01 -2.640 0.00828 **
age -2.003e+00 9.202e-01 -2.177 0.02951 *
income -1.803e-01 3.933e-01 -0.458 0.64669
illness1 -3.327e-01 3.480e-01 -0.956 0.33894
illness2 -1.112e+00 4.496e-01 -2.473 0.01339 *
illness3 -9.533e-01 5.127e-01 -1.859 0.06297 .
illness4 -1.551e+00 7.398e-01 -2.097 0.03599 *
illness5 -1.230e+00 8.597e-01 -1.431 0.15257
reduced -1.298e+00 4.577e-01 -2.836 0.00456 **
health -1.443e-03 5.509e-02 -0.026 0.97910
privateyes -8.179e-01 3.178e-01 -2.574 0.01005 *
freepooryes 2.394e-01 6.648e-01 0.360 0.71878
freerepatyes -1.572e+01 1.528e+03 -0.010 0.99179
nchronicyes 4.502e-02 2.982e-01 0.151 0.88001
lchronicyes -1.637e-01 4.951e-01 -0.331 0.74085
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Theta = 1.6063
Number of iterations in BFGS optimization: 66
Log-likelihood: -2512 on 33 Df
predznb<- predict(modelznb,newdata=test,type = "response")
rmsemodelznb<-ModelMetrics::rmse(test$visits,round(predznb))
maemodelznb<-mae(test$visits,round(predznb))
knitr::kable(tibble(rmse=rmsemodelznb,mae=maemodelznb))
rmse | mae |
---|---|
0.7309579 | 0.2753623 |
Finally let’s compare all the above models.
rmse<-c(rmsemodelp,rmsemodelqp,rmsemodelnb,rmsemodelhp,rmsemodelhnb,
rmsemodelzp,rmsemodelznb)
mae<-c(maemodelp,maemodelqp,maemodelnb,maemodelhp,maemodelhnb,
maemodelzp,maemodelznb)
models<-c("pois","q_pois","nb","h_pois","h_nb","zer_pois","zer_nb")
data.frame(models,rmse,mae)%>%
arrange(rmse)
models rmse mae
1 zer_nb 0.7309579 0.2753623
2 h_pois 0.7375374 0.2850242
3 pois 0.7381921 0.2840580
4 q_pois 0.7381921 0.2840580
5 h_nb 0.7408052 0.2879227
6 zer_pois 0.7485897 0.2898551
7 nb 0.7808085 0.2966184
Both metrics have chosen the zero inflated negative binomial model as the best model with minimum rmse value 0.7309579, and minimum mae value 0.2753623. this result is in line with the fact that this kind of models take care of the zero inflated data and at the same time the overdispersion problem.
9 Conclusion:
If the data is truly follows Poisson distribution then all the the other models have extra parameters that, during training process, converges to the optimum parameter values for poisson, this relation is like the linear regression to the generalized least squares. However, if the data is very skewed towards zero then it should be better to use the last two models to take care of this issue.
10 Furhter reading:
- Michael J. Crawley, The R book, WILEY, UK, 2013. http://www.bio.ic.ac.uk/research/mjcraw/therbook/index.htm
11 Session info
sessionInfo()
R version 4.0.1 (2020-06-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] pscl_1.5.5 broom_0.7.1 AER_1.2-9
[4] survival_3.2-7 sandwich_3.0-0 lmtest_0.9-38
[7] zoo_1.8-8 car_3.0-10 carData_3.0-4
[10] forcats_0.5.0 stringr_1.4.0 dplyr_1.0.2
[13] readr_1.3.1 tidyr_1.1.2 tibble_3.0.3
[16] ggplot2_3.3.2 tidyverse_1.3.0 MASS_7.3-53
[19] purrr_0.3.4 corrr_0.4.2 ModelMetrics_1.2.2.2
[22] performance_0.5.0
loaded via a namespace (and not attached):
[1] httr_1.4.2 jsonlite_1.7.1 splines_4.0.1 modelr_0.1.8
[5] Formula_1.2-3 assertthat_0.2.1 highr_0.8 blob_1.2.1
[9] cellranger_1.1.0 yaml_2.2.1 bayestestR_0.7.2 pillar_1.4.6
[13] backports_1.1.10 lattice_0.20-41 glue_1.4.2 digest_0.6.25
[17] rvest_0.3.6 colorspace_1.4-1 htmltools_0.5.0 Matrix_1.2-18
[21] pkgconfig_2.0.3 haven_2.3.1 bookdown_0.20 scales_1.1.1
[25] openxlsx_4.2.2 rio_0.5.16 farver_2.0.3 generics_0.0.2
[29] ellipsis_0.3.1 withr_2.3.0 cli_2.0.2 magrittr_1.5
[33] crayon_1.3.4 readxl_1.3.1 evaluate_0.14 fs_1.5.0
[37] fansi_0.4.1 xml2_1.3.2 foreign_0.8-80 blogdown_0.20
[41] tools_4.0.1 data.table_1.13.0 hms_0.5.3 lifecycle_0.2.0
[45] munsell_0.5.0 reprex_0.3.0 zip_2.1.1 compiler_4.0.1
[49] rlang_0.4.7 grid_4.0.1 rstudioapi_0.11 labeling_0.3
[53] rmarkdown_2.4 gtable_0.3.0 abind_1.4-5 DBI_1.1.0
[57] curl_4.3 R6_2.4.1 lubridate_1.7.9 knitr_1.30
[61] utf8_1.1.4 insight_0.9.6 stringi_1.5.3 Rcpp_1.0.5
[65] vctrs_0.3.4 dbplyr_1.4.4 tidyselect_1.1.0 xfun_0.18