Interpretation of hurdle models with estimated marginal means

My goal is to interpret the coefficients of a hurdle model through estimated marginal means. I prefer to interpret probabilities (back-transformed from the logit scale), rather than log-odds (model coefficients) or odd-ratio (exp(log-odds)). I would like to use emmeans() for this goal, as it is compatible with many models, and I have experience using it in linear models and binomial GLMs.

The hurdle part of the hurdle model yields the same coefficients as a binomial GLM, which has been commented elsewhere (here or here).

However, I don’t fully understand why, depending on the setting, the estimated means from the hurdle do not match those from the binomial GLM. In particular

  • lin.pred = FALSE. This yields different probabilites from a binomial GLM. Following the emmeans documentation, I think they are averaged at the probability scale.
  • lin.pred = TRUE, type = "response". This setting yields the same probabilites as binomial GLM. Following the emmeans documentation, I think they are averaged at the logit scale.

I have two questions:

    1. Which setting would be preferrable, if one’s goal is to interpret probabilities, rather than logits?
    1. What is the meaning of Inf degrees of freedom?

See below a reproducible example, using the number of papers produced by Biochemstry students during the least 3 years of PhD, focusing on a pairwise comparison between two factor levels (single vs married).

My preferred approach would be the one that matches the estimated probabilites from the binomial GLM. I would interpret that, married PhD students are as likely than single students, to publish at least one paper (= overcome the hurdle). Married students have a 74% chance of producing at least one paper, whereas single ones have a 67% change: this 7% change is a small difference, and not significant.

Many thanks in advance!

library(emmeans)
library(pscl)
#> Warning: package 'pscl' was built under R version 4.2.3
#> Classes and Methods for R developed in the
#> Political Science Computational Laboratory
#> Department of Political Science
#> Stanford University
#> Simon Jackman
#> hurdle and zeroinfl functions by Achim Zeileis
data("bioChemists", package = "pscl")
# str(bioChemists)
hurdle <- hurdle(art ~ ., data = bioChemists, dist = "poisson", zero.dist = "binomial", link = "logit")
glm <- glm(art > 0 ~ ., data = bioChemists, family = binomial(link = "logit"))

# Same coefficients
coef(summary(hurdle))$zero
#>                Estimate Std. Error    z value     Pr(>|z|)
#> (Intercept)  0.23679601 0.29551883  0.8012891 4.229643e-01
#> femWomen    -0.25115113 0.15910522 -1.5785222 1.144457e-01
#> marMarried   0.32623358 0.18081823  1.8042074 7.119880e-02
#> kid5        -0.28524872 0.11113043 -2.5667921 1.026441e-02
#> phd          0.02221940 0.07955710  0.2792887 7.800233e-01
#> ment         0.08012135 0.01301763  6.1548321 7.515710e-10
coef(glm)
#> (Intercept)    femWomen  marMarried        kid5         phd        ment 
#>  0.23679601 -0.25115113  0.32623358 -0.28524872  0.02221940  0.08012135

####### 
## lin.pred = FALSE
emmeans::emmeans(hurdle, specs = pairwise ~ mar, mode = "zero", lin.pred = FALSE)
#> $emmeans
#>  mar     emmean    SE  df lower.CL upper.CL
#>  Single   0.806 0.167 903    0.478     1.13
#>  Married  0.858 0.122 903    0.619     1.10
#> 
#> Results are averaged over the levels of: fem 
#> Confidence level used: 0.95 
#> 
#> $contrasts
#>  contrast         estimate    SE  df t.ratio p.value
#>  Single - Married  -0.0522 0.213 903  -0.245  0.8066
#> 
#> Results are averaged over the levels of: fem


####### 
## lin.pred = TRUE, type = "response"
emmeans::emmeans(hurdle, specs = pairwise ~ mar, mode = "zero", lin.pred = TRUE, type = "response")
#> $emmeans
#>  mar      prob     SE  df lower.CL upper.CL
#>  Single  0.677 0.0306 903    0.615    0.734
#>  Married 0.744 0.0200 903    0.703    0.781
#> 
#> Results are averaged over the levels of: fem 
#> Confidence level used: 0.95 
#> Intervals are back-transformed from the logit scale 
#> 
#> $contrasts
#>  contrast         odds.ratio   SE  df null t.ratio p.value
#>  Single / Married      0.722 0.13 903    1  -1.804  0.0715
#> 
#> Results are averaged over the levels of: fem 
#> Tests are performed on the log odds ratio scale


emmeans::emmeans(glm, specs = pairwise ~ mar, regrid = "response")
#> $emmeans
#>  mar     response     SE  df asymp.LCL asymp.UCL
#>  Single     0.677 0.0305 Inf     0.617     0.736
#>  Married    0.743 0.0201 Inf     0.704     0.783
#> 
#> Results are averaged over the levels of: fem 
#> Confidence level used: 0.95 
#> 
#> $contrasts
#>  contrast         estimate     SE  df z.ratio p.value
#>  Single - Married  -0.0667 0.0377 Inf  -1.772  0.0764
#> 
#> Results are averaged over the levels of: fem

#######
# Estimates and contrasts at the logit scale appear to match
emmeans::emmeans(hurdle, specs = pairwise ~ mar, mode = "zero", lin.pred = TRUE)
#> $emmeans
#>  mar     emmean    SE  df lower.CL upper.CL
#>  Single   0.741 0.140 903    0.466     1.02
#>  Married  1.068 0.105 903    0.862     1.27
#> 
#> Results are averaged over the levels of: fem 
#> Results are given on the logit (not the response) scale. 
#> Confidence level used: 0.95 
#> 
#> $contrasts
#>  contrast         estimate    SE  df t.ratio p.value
#>  Single - Married   -0.326 0.181 903  -1.804  0.0715
#> 
#> Results are averaged over the levels of: fem 
#> Results are given on the log odds ratio (not the response) scale.


emmeans::emmeans(glm, specs = pairwise ~ mar)
#> $emmeans
#>  mar     emmean    SE  df asymp.LCL asymp.UCL
#>  Single   0.741 0.140 Inf     0.467      1.02
#>  Married  1.068 0.105 Inf     0.862      1.27
#> 
#> Results are averaged over the levels of: fem 
#> Results are given on the logit (not the response) scale. 
#> Confidence level used: 0.95 
#> 
#> $contrasts
#>  contrast         estimate    SE  df z.ratio p.value
#>  Single - Married   -0.326 0.181 Inf  -1.804  0.0712
#> 
#> Results are averaged over the levels of: fem 
#> Results are given on the log odds ratio (not the response) scale.


sessionInfo()
#> R version 4.2.0 (2022-04-22 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 22000)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=Spanish_Spain.utf8  LC_CTYPE=Spanish_Spain.utf8   
#> [3] LC_MONETARY=Spanish_Spain.utf8 LC_NUMERIC=C                  
#> [5] LC_TIME=Spanish_Spain.utf8    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] pscl_1.5.5.1          emmeans_1.8.5-9000004
#> 
#> loaded via a namespace (and not attached):
#>  [1] compiler_4.2.0      highr_0.9           tools_4.2.0        
#>  [4] digest_0.6.29       evaluate_0.15       lifecycle_1.0.3    
#>  [7] lattice_0.20-45     rlang_1.1.1         reprex_2.0.2       
#> [10] Matrix_1.4-1        cli_3.6.1           rstudioapi_0.13    
#> [13] yaml_2.3.5          mvtnorm_1.1-3       xfun_0.40          
#> [16] fastmap_1.1.0       coda_0.19-4         withr_2.5.0        
#> [19] stringr_1.5.0       knitr_1.39          fs_1.5.2           
#> [22] vctrs_0.6.3         grid_4.2.0          glue_1.6.2         
#> [25] survival_3.3-1      rmarkdown_2.14      multcomp_1.4-19    
#> [28] TH.data_1.1-1       magrittr_2.0.3      codetools_0.2-18   
#> [31] htmltools_0.5.6     splines_4.2.0       MASS_7.3-57        
#> [34] xtable_1.8-4        numDeriv_2016.8-1.1 sandwich_3.0-1     
#> [37] estimability_1.4.1  stringi_1.7.6       zoo_1.8-10

Created on 2023-09-06 with reprex v2.0.2

  • If you need help interpreting results from statistical models, you should ask for help at Cross Validated instead. You are likely to get better help there. This is not really a specific programming question that’s appropriate for Stack Overflow.

    – 

Leave a Comment