I am working with a dataset that I believe follows a “Negative Binomial” distribution. However, when I fit the Negative Binomial distribution, it turns out to be a poor fit. To explore further, I simulated a Negative Binomial distribution, but even on the simulated data, the overlaying distribution does not provide a good fit.
Here is my simulated data:
library(ggplot2)
library(MASS)
library(fitdistrplus)
# Generating negative binomial random numbers
n <- 1000 # Number of random numbers
size <- 5 # Number of successes
prob <- 0.3 # Probability of success
# Generating negative binomial random numbers
negative_binomial <- rnbinom(n, size, prob)
xx <- data.frame(negative_binomial)
I want to create a histogram with an overlay of the ‘Negative Binomial‘ distribution on this data. Let’s assume that I was given this data, so I had to estimate the parameters of the distribution using fitdist()
.
fit <- fitdistr(negative_binomial,densfun = "negative binomial")
ggplot(data = xx, aes(negative_binomial)) +
geom_histogram(
aes(y = ..density..),
bins = 18, color = "black", fill = "lightblue") +
stat_function(fun = dnbinom ,
args = list(mu = fit$estimate[2] , size = fit$estimate[1]),
color = "red", size = 1)
Question: Despite knowing that the simulated data is Negative Binomial
, why does the overlaying distribution provide such a poor fit to the data? What did I do wrong?
a dataset that I believe follows a “Negative Binomial” distribution. How, why?
My real data is a count data and based on historical data I know that data follows a Negative Binomial Dist. But plz forget about my real data and tell me why Negative Binomial is a poor fit on simulated “Negative Binomial” data?
Usually, when you are interested in the distribution, you are interested in the distribution conditional on predictors. In particular, you would care about a negative binomial distribution if you intend to fit a generalized linear model (GLM) with the negative-binomial distribution family. Fitting the distribution to the data is not helpful for that purpose. So, why are you doing this?
non-integer x:
hist(xx$negative_binomial, prob = TRUE, col = "lightblue", breaks = 18L); curve(dnbinom(x, mu = fit$estimate[2L], size = fit$estimate[1L]), 0L, 40L, col = "red", add = TRUE)
issues warnings that say: “~you are treating a discrete distribution as continuous”.