My goal is to interpret the coefficients of a hurdle model through estimated marginal means. I prefer to interpret probabilities (back-transformed from the logit scale), rather than log-odds (model coefficients) or odd-ratio (exp(log-odds)). I would like to use emmeans()
for this goal, as it is compatible with many models, and I have experience using it in linear models and binomial GLMs.
The hurdle part of the hurdle model yields the same coefficients as a binomial GLM, which has been commented elsewhere (here or here).
However, I don’t fully understand why, depending on the setting, the estimated means from the hurdle do not match those from the binomial GLM. In particular
lin.pred = FALSE
. This yields different probabilites from a binomial GLM. Following theemmeans
documentation, I think they are averaged at the probability scale.lin.pred = TRUE, type = "response"
. This setting yields the same probabilites as binomial GLM. Following the emmeans documentation, I think they are averaged at the logit scale.
I have two questions:
-
- Which setting would be preferrable, if one’s goal is to interpret probabilities, rather than logits?
-
- What is the meaning of
Inf
degrees of freedom?
- What is the meaning of
See below a reproducible example, using the number of papers produced by Biochemstry students during the least 3 years of PhD, focusing on a pairwise comparison between two factor levels (single vs married).
My preferred approach would be the one that matches the estimated probabilites from the binomial GLM. I would interpret that, married PhD students are as likely than single students, to publish at least one paper (= overcome the hurdle). Married students have a 74% chance of producing at least one paper, whereas single ones have a 67% change: this 7% change is a small difference, and not significant.
Many thanks in advance!
library(emmeans)
library(pscl)
#> Warning: package 'pscl' was built under R version 4.2.3
#> Classes and Methods for R developed in the
#> Political Science Computational Laboratory
#> Department of Political Science
#> Stanford University
#> Simon Jackman
#> hurdle and zeroinfl functions by Achim Zeileis
data("bioChemists", package = "pscl")
# str(bioChemists)
hurdle <- hurdle(art ~ ., data = bioChemists, dist = "poisson", zero.dist = "binomial", link = "logit")
glm <- glm(art > 0 ~ ., data = bioChemists, family = binomial(link = "logit"))
# Same coefficients
coef(summary(hurdle))$zero
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.23679601 0.29551883 0.8012891 4.229643e-01
#> femWomen -0.25115113 0.15910522 -1.5785222 1.144457e-01
#> marMarried 0.32623358 0.18081823 1.8042074 7.119880e-02
#> kid5 -0.28524872 0.11113043 -2.5667921 1.026441e-02
#> phd 0.02221940 0.07955710 0.2792887 7.800233e-01
#> ment 0.08012135 0.01301763 6.1548321 7.515710e-10
coef(glm)
#> (Intercept) femWomen marMarried kid5 phd ment
#> 0.23679601 -0.25115113 0.32623358 -0.28524872 0.02221940 0.08012135
#######
## lin.pred = FALSE
emmeans::emmeans(hurdle, specs = pairwise ~ mar, mode = "zero", lin.pred = FALSE)
#> $emmeans
#> mar emmean SE df lower.CL upper.CL
#> Single 0.806 0.167 903 0.478 1.13
#> Married 0.858 0.122 903 0.619 1.10
#>
#> Results are averaged over the levels of: fem
#> Confidence level used: 0.95
#>
#> $contrasts
#> contrast estimate SE df t.ratio p.value
#> Single - Married -0.0522 0.213 903 -0.245 0.8066
#>
#> Results are averaged over the levels of: fem
#######
## lin.pred = TRUE, type = "response"
emmeans::emmeans(hurdle, specs = pairwise ~ mar, mode = "zero", lin.pred = TRUE, type = "response")
#> $emmeans
#> mar prob SE df lower.CL upper.CL
#> Single 0.677 0.0306 903 0.615 0.734
#> Married 0.744 0.0200 903 0.703 0.781
#>
#> Results are averaged over the levels of: fem
#> Confidence level used: 0.95
#> Intervals are back-transformed from the logit scale
#>
#> $contrasts
#> contrast odds.ratio SE df null t.ratio p.value
#> Single / Married 0.722 0.13 903 1 -1.804 0.0715
#>
#> Results are averaged over the levels of: fem
#> Tests are performed on the log odds ratio scale
emmeans::emmeans(glm, specs = pairwise ~ mar, regrid = "response")
#> $emmeans
#> mar response SE df asymp.LCL asymp.UCL
#> Single 0.677 0.0305 Inf 0.617 0.736
#> Married 0.743 0.0201 Inf 0.704 0.783
#>
#> Results are averaged over the levels of: fem
#> Confidence level used: 0.95
#>
#> $contrasts
#> contrast estimate SE df z.ratio p.value
#> Single - Married -0.0667 0.0377 Inf -1.772 0.0764
#>
#> Results are averaged over the levels of: fem
#######
# Estimates and contrasts at the logit scale appear to match
emmeans::emmeans(hurdle, specs = pairwise ~ mar, mode = "zero", lin.pred = TRUE)
#> $emmeans
#> mar emmean SE df lower.CL upper.CL
#> Single 0.741 0.140 903 0.466 1.02
#> Married 1.068 0.105 903 0.862 1.27
#>
#> Results are averaged over the levels of: fem
#> Results are given on the logit (not the response) scale.
#> Confidence level used: 0.95
#>
#> $contrasts
#> contrast estimate SE df t.ratio p.value
#> Single - Married -0.326 0.181 903 -1.804 0.0715
#>
#> Results are averaged over the levels of: fem
#> Results are given on the log odds ratio (not the response) scale.
emmeans::emmeans(glm, specs = pairwise ~ mar)
#> $emmeans
#> mar emmean SE df asymp.LCL asymp.UCL
#> Single 0.741 0.140 Inf 0.467 1.02
#> Married 1.068 0.105 Inf 0.862 1.27
#>
#> Results are averaged over the levels of: fem
#> Results are given on the logit (not the response) scale.
#> Confidence level used: 0.95
#>
#> $contrasts
#> contrast estimate SE df z.ratio p.value
#> Single - Married -0.326 0.181 Inf -1.804 0.0712
#>
#> Results are averaged over the levels of: fem
#> Results are given on the log odds ratio (not the response) scale.
sessionInfo()
#> R version 4.2.0 (2022-04-22 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 22000)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=Spanish_Spain.utf8 LC_CTYPE=Spanish_Spain.utf8
#> [3] LC_MONETARY=Spanish_Spain.utf8 LC_NUMERIC=C
#> [5] LC_TIME=Spanish_Spain.utf8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] pscl_1.5.5.1 emmeans_1.8.5-9000004
#>
#> loaded via a namespace (and not attached):
#> [1] compiler_4.2.0 highr_0.9 tools_4.2.0
#> [4] digest_0.6.29 evaluate_0.15 lifecycle_1.0.3
#> [7] lattice_0.20-45 rlang_1.1.1 reprex_2.0.2
#> [10] Matrix_1.4-1 cli_3.6.1 rstudioapi_0.13
#> [13] yaml_2.3.5 mvtnorm_1.1-3 xfun_0.40
#> [16] fastmap_1.1.0 coda_0.19-4 withr_2.5.0
#> [19] stringr_1.5.0 knitr_1.39 fs_1.5.2
#> [22] vctrs_0.6.3 grid_4.2.0 glue_1.6.2
#> [25] survival_3.3-1 rmarkdown_2.14 multcomp_1.4-19
#> [28] TH.data_1.1-1 magrittr_2.0.3 codetools_0.2-18
#> [31] htmltools_0.5.6 splines_4.2.0 MASS_7.3-57
#> [34] xtable_1.8-4 numDeriv_2016.8-1.1 sandwich_3.0-1
#> [37] estimability_1.4.1 stringi_1.7.6 zoo_1.8-10
Created on 2023-09-06 with reprex v2.0.2
If you need help interpreting results from statistical models, you should ask for help at Cross Validated instead. You are likely to get better help there. This is not really a specific programming question that’s appropriate for Stack Overflow.