Modelling Count Data in R: A Multilevel Framework

A Quick Guide

Francisco Rowe

2021-04-19

Aim

This notebook is a practical guide. It aims to illustrate how to quickly estimate a count data regression model using four different variants: Poisson, Zero-inflated Poisson, Negative Binomial and Zero-inflated Negative Binomial models. It is not supposed to be a comprehensive guide but it is an starting point if you don’t know what R packages and how relevant functions can be used to estimate these models.

Here I am particularly interested in fitting these models in a hierarchical modelling framework. I illustrate this by using R functions within the glmmTMB() package and also glmer() or glmer.nb() within the lme4 package.

Why these two packages? I use lme4 because it is one of the most widely used R packages for fitting multilevel models or generalised linear mixed models (GLMMs) (Bates et al. 2015Bates, Douglas, Martin Mächler, Ben Bolker, and Steve Walker. 2015. “Fitting Linear Mixed-Effects Models Using Lme4.” Journal of Statistical Software, Articles 67 (1): 1–48. https://doi.org/10.18637/jss.v067.i01.). And, I focus primarily on glmmTMB() because of its speed, flexibility and interface’s similarity to lme4 (Brooks et al. 2017Brooks, Mollie E., Kasper Kristensen, Koen J. van Benthem, Arni Magnusson, Casper W. Berg, Anders Nielsen, Hans J. Skaug, Martin Mächler, and Benjamin M. Bolker. 2017. “glmmTMB Balances Speed and Flexibility Among Packages for Zero-inflated Generalized Linear Mixed Modeling.” The R Journal 9 (2): 378–400. https://doi.org/10.32614/RJ-2017-066.). Additionally, while Poisson and Negative Binomial regression models can be estimated using lme4, fitting zero-inflated models requires weighting the data appropriately to their zero probability (Bolker et al. 2012Bolker, Ben, Mollie Brooks, Beth Gardner, Cleridy Lennert, and Mihoko Minami. 2012. “Owls Example: A Zero-Inflated, Generalized Linear Mixed Model for Count Data.” Departments of Mathematics & Statistics and Biology, McMaster University, Hamilton. https://groups.nceas.ucsb.edu/non-linear-modeling/projects/owls/WRITEUP/owls.pdf/@@download.) and glmer.nb() is slower and unstable compared to other functions (Bolker and others 2021Bolker, Ben, and others. 2021. “GLMM Faq.” http://bbolker.github.io/mixedmodels-misc/glmmFAQ.html#zero-inflation.).

See Brooks et al. (2017) for more information on glmmTMB(), and Bates et al., (2015) for lme4. Below I quickly call the packages and describe the data before illustrating the use of glmmTMB().

Dependencies

library(tidyverse)
library(lme4)
library(merTools)
library(glmmTMB) # fitting generalised linear mixed models
library(bbmle) # general maximum likelihood estimation
library(ggthemes)
library(showtext)

Set font style

# load font
font_add_google("Roboto Condensed", "robotocondensed")
# automatically use showtext to render text
showtext_auto()

Data

The data are included in the glmmTMB() package and originally taken from Zuur et al. (2009Zuur, Alain, Elena N Ieno, Neil Walker, Anatoly A Saveliev, and Graham M Smith. 2009. Mixed Effects Models and Extensions in Ecology with R. Springer Science & Business Media.). They quantify the amount of sibling negotiation (vocalizations when parents are absent) by owlets (owl chicks) in different nests as a function of food treatment (deprived or satiated), the sex of the parent, and arrival time of the parent at the nest. Since the same nests are observed repeatedly, it is natural to consider a mixed-effects model for these data, with the nest as a random effect. Because we are interested in explaining variability in the number of vocalizations per chick, the total brood size in each nest is used as an offset in the model.

I know animals! Human population data for the next iteration.

Reading and transforming data:

Owls <- transform(Owls,
                  Nest = reorder(Nest,NegPerChick),
                  NCalls = SiblingNegotiation,
                  FT = FoodTreatment)

Let’s have a quick look at the dependent variable:

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Using glmmTMB

Poisson regression

We fit a varying-intercept Poisson model with offset (i.e. relative risk). We first specify the model and run glmmTMB. The varying-intercept term is fitted using a similar syntax to that in lme4; that is, adding (1 | group) assuming that it contains a group-level identifier - see Bolker et al’s FAQs for details.

eq <- NCalls ~ (FT + ArrivalTime) * SexParent + offset( log( BroodSize)) + ( 1 | Nest)

poisson1 <- glmmTMB(eq, 
                    data=Owls,
                    ziformula=~0,
                    family=poisson)
summary(poisson1)
##  Family: poisson  ( log )
## Formula:          
## NCalls ~ (FT + ArrivalTime) * SexParent + offset(log(BroodSize)) +  
##     (1 | Nest)
## Data: Owls
## 
##      AIC      BIC   logLik deviance df.resid 
##   5009.4   5040.2  -2497.7   4995.4      592 
## 
## Random effects:
## 
## Conditional model:
##  Groups Name        Variance Std.Dev.
##  Nest   (Intercept) 0.2098   0.4581  
## Number of obs: 599, groups:  Nest, 27
## 
## Conditional model:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                3.58149    0.36264   9.876   <2e-16 ***
## FTSatiated                -0.66682    0.05612 -11.883   <2e-16 ***
## ArrivalTime               -0.11948    0.01440  -8.299   <2e-16 ***
## SexParentMale              0.38769    0.44861   0.864   0.3875    
## FTSatiated:SexParentMale   0.13243    0.07043   1.880   0.0601 .  
## ArrivalTime:SexParentMale -0.01646    0.01836  -0.897   0.3700    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

NOTE: If you run into problems using glmmTMB, reinstall glmmTMB from the source code and restart R as described here.

Zero-inflated Poisson regression

Now we will fit a zero-inflated Poisson regression. glmmTMB by default estimates model excluding zero-inflation. To explicitly exclude zero-inflation, use ziformula = ~0.

To estimate a zero-inflated Poisson model with a single zero-inflation parameter applying to all observations, use ziformula = ~1.

Following from above, we fit a varying-intercept zero-inflated Poisson model with an offset parameter.

eq <- NCalls ~ (FT + ArrivalTime) * SexParent + offset( log( BroodSize)) + ( 1 | Nest)

zipoisson1 <- glmmTMB(eq,
                      data=Owls,
                      ziformula=~1,
                      family=poisson)
summary(zipoisson1)
##  Family: poisson  ( log )
## Formula:          
## NCalls ~ (FT + ArrivalTime) * SexParent + offset(log(BroodSize)) +  
##     (1 | Nest)
## Zero inflation:          ~1
## Data: Owls
## 
##      AIC      BIC   logLik deviance df.resid 
##   4015.6   4050.8  -1999.8   3999.6      591 
## 
## Random effects:
## 
## Conditional model:
##  Groups Name        Variance Std.Dev.
##  Nest   (Intercept) 0.1294   0.3597  
## Number of obs: 599, groups:  Nest, 27
## 
## Conditional model:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                2.53995    0.35656   7.123 1.05e-12 ***
## FTSatiated                -0.29111    0.05961  -4.884 1.04e-06 ***
## ArrivalTime               -0.06808    0.01427  -4.771 1.84e-06 ***
## SexParentMale              0.44885    0.45002   0.997    0.319    
## FTSatiated:SexParentMale   0.10473    0.07286   1.437    0.151    
## ArrivalTime:SexParentMale -0.02140    0.01835  -1.166    0.244    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Zero-inflation model:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.05753    0.09412  -11.24   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Negative Binomial regression

Next we fit a negative binomial regression. A negative binomial specification is normally used to handle over-dispersed data. Two parameterisations are often used to fit negative binomial regression models (see Hilbe (2011Hilbe, Joseph M. 2011. Negative Binomial Regression. Cambridge University Press. https://doi.org/10.1017/CBO9780511973420.)): (a) NB1 (variance = \(\mu\) + \(\alpha\mu\)); and, (b) NB2 (variance = \(\mu + \alpha\mu^2\)); where \(\mu\) is the mean, and \(\alpha\) is the overdispersion parameter. NB2 is the default parameterisation. As stated in Hilbe (2011), “NB2 is the standard form of negative binomial used to estimate data that are Poisson-overdispersed, and it is the form of the model which most statisticians understand by negative binomial. NB2 is typically the first model we turn to when we discover that a Poisson model is overdispersed.

NB1 parameterisation

We first fit a negative binomial model based on the NB1 parameterisation.

nbinom1a <- glmmTMB(eq,
                    data=Owls,
                    ziformula=~0,
                    family=nbinom1)
summary(nbinom1a)
##  Family: nbinom1  ( log )
## Formula:          
## NCalls ~ (FT + ArrivalTime) * SexParent + offset(log(BroodSize)) +  
##     (1 | Nest)
## Data: Owls
## 
##      AIC      BIC   logLik deviance df.resid 
##   3361.8   3396.9  -1672.9   3345.8      591 
## 
## Random effects:
## 
## Conditional model:
##  Groups Name        Variance Std.Dev.
##  Nest   (Intercept) 0.1391   0.3729  
## Number of obs: 599, groups:  Nest, 27
## 
## Overdispersion parameter for nbinom1 family (): 6.33 
## 
## Conditional model:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                3.74312    0.87498   4.278 1.89e-05 ***
## FTSatiated                -0.89289    0.13637  -6.548 5.84e-11 ***
## ArrivalTime               -0.12547    0.03571  -3.514 0.000442 ***
## SexParentMale              0.72216    1.09364   0.660 0.509042    
## FTSatiated:SexParentMale   0.12141    0.17051   0.712 0.476427    
## ArrivalTime:SexParentMale -0.02674    0.04473  -0.598 0.550011    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

NB2 parameterisation

We also estimate the NB2 version.

nbinom2b <- glmmTMB(eq,
                    data=Owls,
                    ziformula=~0,
                    family=nbinom2)
summary(nbinom2b)
##  Family: nbinom2  ( log )
## Formula:          
## NCalls ~ (FT + ArrivalTime) * SexParent + offset(log(BroodSize)) +  
##     (1 | Nest)
## Data: Owls
## 
##      AIC      BIC   logLik deviance df.resid 
##   3479.3   3514.4  -1731.6   3463.3      591 
## 
## Random effects:
## 
## Conditional model:
##  Groups Name        Variance Std.Dev.
##  Nest   (Intercept) 0.1087   0.3296  
## Number of obs: 599, groups:  Nest, 27
## 
## Overdispersion parameter for nbinom2 family (): 0.884 
## 
## Conditional model:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                3.90137    0.95359   4.091 4.29e-05 ***
## FTSatiated                -0.77201    0.16200  -4.765 1.88e-06 ***
## ArrivalTime               -0.13042    0.03823  -3.411 0.000647 ***
## SexParentMale             -0.65426    1.24544  -0.525 0.599358    
## FTSatiated:SexParentMale   0.17890    0.20166   0.887 0.374984    
## ArrivalTime:SexParentMale  0.02603    0.05000   0.521 0.602652    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Zero-inflated Negative Binomial regression

Similarly we fit zero-inflated negative binomial model regressions for both parametrisations:

NB1 parameterisation

zinbinom1a <- glmmTMB(eq ,
                      data=Owls,
                      ziformula=~1,
                      family=nbinom1)
summary(zinbinom1a)
##  Family: nbinom1  ( log )
## Formula:          
## NCalls ~ (FT + ArrivalTime) * SexParent + offset(log(BroodSize)) +  
##     (1 | Nest)
## Zero inflation:          ~1
## Data: Owls
## 
##      AIC      BIC   logLik deviance df.resid 
##   3350.8   3390.3  -1666.4   3332.8      590 
## 
## Random effects:
## 
## Conditional model:
##  Groups Name        Variance Std.Dev.
##  Nest   (Intercept) 0.1384   0.3721  
## Number of obs: 599, groups:  Nest, 27
## 
## Overdispersion parameter for nbinom1 family (): 4.95 
## 
## Conditional model:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                3.71401    0.83484   4.449 8.64e-06 ***
## FTSatiated                -0.90691    0.13989  -6.483 9.00e-11 ***
## ArrivalTime               -0.11883    0.03422  -3.472 0.000516 ***
## SexParentMale              0.40633    1.05342   0.386 0.699702    
## FTSatiated:SexParentMale   0.26146    0.17400   1.503 0.132930    
## ArrivalTime:SexParentMale -0.01894    0.04326  -0.438 0.661557    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Zero-inflation model:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.3594     0.3486  -6.767 1.31e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

NB2 parameterisation

zinbinom2b <- glmmTMB(eq ,
                      data=Owls,
                      ziformula=~1,
                      family=nbinom2)
summary(zinbinom2b)
##  Family: nbinom2  ( log )
## Formula:          
## NCalls ~ (FT + ArrivalTime) * SexParent + offset(log(BroodSize)) +  
##     (1 | Nest)
## Zero inflation:          ~1
## Data: Owls
## 
##      AIC      BIC   logLik deviance df.resid 
##   3418.3   3457.8  -1700.1   3400.3      590 
## 
## Random effects:
## 
## Conditional model:
##  Groups Name        Variance Std.Dev.
##  Nest   (Intercept) 0.06029  0.2455  
## Number of obs: 599, groups:  Nest, 27
## 
## Overdispersion parameter for nbinom2 family ():  2.3 
## 
## Conditional model:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                3.07026    0.76413   4.018 5.87e-05 ***
## FTSatiated                -0.40904    0.13544  -3.020  0.00253 ** 
## ArrivalTime               -0.09087    0.03092  -2.939  0.00329 ** 
## SexParentMale             -0.39679    0.97448  -0.407  0.68388    
## FTSatiated:SexParentMale   0.16240    0.16278   0.998  0.31842    
## ArrivalTime:SexParentMale  0.01315    0.03943   0.334  0.73873    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Zero-inflation model:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.2782     0.1223  -10.45   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

By this point you may want to know about more about overdispersion.

What causes overdispersion? Overdispersion is caused by positive correlation between responses or by an excess variation between response probabilities or counts. Overdispersion also arises when there are violations in the distributional assumptions of the data, such as when the data are clustered and thereby violate the likelihood independence of observations assumption.

Why is overdispersion a problem? Overdispersion may cause standard errors of the estimates to be underestimated. As a result, a variable may appear to be a significant predictor when it is not.

How is overdispersion recognised? A model may be overdispersed if the value of the Pearson (or $$2) statistic divided by the degrees of freedom is greater than \(1\). The quotient of either is called the dispersion.

What is apparent overdispersion / how can it be corrected? Apparent overdispersion occurs when: (i) the model omits important explanatory predictors; (ii) the data include outliers; (iii) the model fails to include a sufficient number of interaction terms; (iv) a predictor needs to be transformed to another scale; (v) the assumed linear relationship between the response and the link function and predictors is mis-specified or inappropriate.

Using lme4

We can use lme4 functions to fit Poisson and Negative Binomials models. However, as indicated above, zero-inflated models are not supported so these models are not considered here, but they can be estimated by implementing general expectation-maximisation algorithm as discussed by Bolker et al. (2013Bolker, Benjamin M, Beth Gardner, Mark Maunder, Casper W Berg, Mollie Brooks, Liza Comita, Elizabeth Crone, et al. 2013. “Strategies for Fitting Nonlinear Ecological Models in R, Ad M Odel B Uilder, and Bugs.” Methods in Ecology and Evolution 4 (6): 501–12. https://doi.org/doi.org/10.1111/2041-210x.12044.). If you are interested in this, the relevant R code is here in a function called zipme.

Poisson regression

Similar to the models fitted above, we start by estimating a varying-intercept Poisson model with offset. We use 100 points per axis in the adaptive Gauss-Hermite approximation of the log likelihood specifying nAGQ. The use of more points improves accuracy but will take longer. The default estimation uses the Laplace approximation which is equivalent to 1 point evaluated per axis in Gauss-Hermite approximation.

NOTE: glmmTMB uses maximum likelihood estimation and the Laplace approximation. It does not support restricted maximum likelihood estimation or Gauss-Hermite quadrature to integrate over random effects (Brooks et al. 2017).

eq <- NCalls ~ (FT + ArrivalTime) * SexParent + offset( log( BroodSize)) + ( 1 | Nest)

poisson2 <- glmer(eq,
            data = Owls,
            family = poisson, 
            nAGQ = 100)
summary(poisson2)
## Generalized linear mixed model fit by maximum likelihood (Adaptive
##   Gauss-Hermite Quadrature, nAGQ = 100) [glmerMod]
##  Family: poisson  ( log )
## Formula: NCalls ~ (FT + ArrivalTime) * SexParent + offset(log(BroodSize)) +  
##     (1 | Nest)
##    Data: Owls
## 
##      AIC      BIC   logLik deviance df.resid 
##   3328.5   3359.3  -1657.3   3314.5      592 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.4711 -1.7551 -0.5822  1.1484 11.1084 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  Nest   (Intercept) 0.21     0.4583  
## Number of obs: 599, groups:  Nest, 27
## 
## Fixed effects:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                3.58151    0.36265   9.876   <2e-16 ***
## FTSatiated                -0.66683    0.05612 -11.883   <2e-16 ***
## ArrivalTime               -0.11948    0.01440  -8.299   <2e-16 ***
## SexParentMale              0.38767    0.44861   0.864   0.3875    
## FTSatiated:SexParentMale   0.13243    0.07043   1.880   0.0601 .  
## ArrivalTime:SexParentMale -0.01646    0.01836  -0.897   0.3700    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) FTSttd ArrvlT SxPrnM FTS:SP
## FTSatiated  -0.076                            
## ArrivalTime -0.964  0.017                     
## SexParentMl -0.739  0.061  0.759              
## FTSttd:SxPM  0.055 -0.767 -0.010 -0.073       
## ArrvlTm:SPM  0.737 -0.012 -0.765 -0.995  0.012
## convergence code: 0
## Model failed to converge with max|grad| = 0.00406609 (tol = 0.002, component 1)
## Model is nearly unidentifiable: very large eigenvalue
##  - Rescale variables?

Negative Binomial regression

We also fit a negative binomial model and remind the reader is glmer.nb may be slower and unstable compared to other functions.

NOTE: glmer.nb estimates a starting \(\theta\) value automatically by fitting a glmer model with family = poisson to estimate an overdispersion parameter from the residuals (see lme4:::est_theta on the console for details).

nbinom3 <- glmer.nb(eq ,
                    data=Owls)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge with max|grad| = 0.00598841 (tol = 0.002, component 1)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue
##  - Rescale variables?
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## unable to evaluate scaled gradient
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge: degenerate Hessian with 1 negative eigenvalues
summary(nbinom3)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: Negative Binomial(0.8848)  ( log )
## Formula: NCalls ~ (FT + ArrivalTime) * SexParent + offset(log(BroodSize)) +  
##     (1 | Nest)
##    Data: Owls
## 
##      AIC      BIC   logLik deviance df.resid 
##   3479.3   3514.4  -1731.6   3463.3      591 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -0.9063 -0.7792 -0.2023  0.4372  5.4584 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  Nest   (Intercept) 0.1091   0.3303  
## Number of obs: 599, groups:  Nest, 27
## 
## Fixed effects:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                3.87792    0.95170   4.075 4.61e-05 ***
## FTSatiated                -0.77081    0.16144  -4.775 1.80e-06 ***
## ArrivalTime               -0.13015    0.03815  -3.411 0.000647 ***
## SexParentMale             -0.65398    1.24305  -0.526 0.598814    
## FTSatiated:SexParentMale   0.17792    0.20120   0.884 0.376548    
## ArrivalTime:SexParentMale  0.02601    0.04990   0.521 0.602156    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) FTSttd ArrvlT SxPrnM FTS:SP
## FTSatiated  -0.130                            
## ArrivalTime -0.991  0.048                     
## SexParentMl -0.752  0.115  0.747              
## FTSttd:SxPM  0.106 -0.744 -0.045 -0.118       
## ArrvlTm:SPM  0.746 -0.053 -0.753 -0.993  0.042

So What R Package?

As noted by Bolker (2020Bolker, Ben. 2020. “Getting Started with the glmmTMB Package.” R Foundation for Statistical Computing Vienna, Austria.), glmmTMB should generally be expected to offer advantages over lme4 by providing greater flexibility and speed for fitting generalised linear mixed models (GLMMs), particularly for models with large numbers of parameters. In contrast, lme4 should generally be expected to be faster for fitting linear mixed models (LMMs). Additionally, lme4 is a more mature package offering a wide range of diagnostic checks and methods.

(Brooks et al. 2017) compares the flexibility and speed of glmmTMB against various R packages offering functions to fit zero-inflated mixed models. The comparison shows that glmmTMB is faster than glmmADMB, MCMCglmm and brms, and more flexible than INLA and mgcv for modelling zero-inflated count data.

NOTE: The following R packages can be used for modelling count data:

The most commonly used R functions for mixed-effects or hierarchical modelling in R are (Bolker and others 2021Bolker, Ben, and others. 2021. “GLMM Faq.” http://bbolker.github.io/mixedmodels-misc/glmmFAQ.html#zero-inflation.):