I am first calculating the percentage of respondents across different demographics who graduated from high school, based on their program status. This code gets me those percents:
d_perc <- d %>%
group_by(group, levels, program_cat, highschool) %>%
summarize(n = n()) %>%
mutate(percent = n/sum(n)*100) %>%
select(-n)
Next, I want to additionally calculate error term around these perents. What is the best way to then calculate the SEs and corresponding 95% CI? (My ultimately goal is to then use geom_point()
and geom_errorbar
to plot these together, though I already have code to do this.)
I tried something like:
d_perc$se <- sqrt(d_perc$percent*(1-d_perc$percent)/d_perc$percent)
Which would then be followed by something like + and - 1.96*d_perc$se
to get the upper and lower estimate. However, when I try the above, I just get a series of NaNs for the se column.
Data here (sorry for the large data; I used head(100) to get somewhat more realistic percents by group):
d_perc <- structure(list(highschool= structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L), levels = c("no",
"yes"), class = "factor"), program_cat = structure(c(2L, 2L, 2L,
2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L,
1L), levels = c("0", "1", "2"), class = "factor"), group = c("gender",
"race", "cohort", "gender", "race", "cohort", "gender", "race",
"cohort", "gender", "race", "cohort", "gender", "race", "cohort",
"gender", "race", "cohort", "gender", "race", "cohort", "gender",
"race", "cohort", "gender", "race", "cohort", "gender", "race",
"cohort", "gender", "race", "cohort", "gender", "race", "cohort",
"gender", "race", "cohort", "gender", "race", "cohort", "gender",
"race", "cohort", "gender", "race", "cohort", "gender", "race",
"cohort", "gender", "race", "cohort", "gender", "race", "cohort",
"gender", "race", "cohort", "gender", "race", "cohort", "gender",
"race", "cohort", "gender", "race", "cohort", "gender", "race",
"cohort", "gender", "race", "cohort", "gender", "race", "cohort",
"gender", "race", "cohort", "gender", "race", "cohort", "gender",
"race", "cohort", "gender", "race", "cohort", "gender", "race",
"cohort", "gender", "race", "cohort", "gender", "race", "cohort",
"gender"), levels = structure(c(1L, 3L, 7L, 2L, 5L, 7L, 1L, 3L,
6L, 2L, 4L, 6L, 1L, 5L, 7L, 1L, 3L, 7L, 1L, 3L, 6L, 1L, 3L, 6L,
1L, 3L, 7L, 1L, 5L, 6L, 2L, 5L, 7L, 1L, 5L, 6L, 1L, 3L, 6L, 2L,
3L, 7L, 1L, 3L, 6L, 1L, 4L, 6L, 1L, 5L, 6L, 1L, 5L, 6L, 1L, 4L,
6L, 2L, 3L, 6L, 2L, 3L, 7L, 1L, 3L, 7L, 1L, 3L, 6L, 1L, 4L, 7L,
1L, 4L, 7L, 1L, 3L, 7L, 1L, 3L, 7L, 1L, 4L, 7L, 1L, 3L, 7L, 1L,
3L, 6L, 1L, 3L, 7L, 2L, 3L, 7L, 2L, 5L, 6L, 2L), levels = c("Female",
"Male", "Black", "Hispanic", "White", "CohortA", "CohortB"), class = "factor")), row.names = c(NA,
-100L), class = c("tbl_df", "tbl", "data.frame"))
You are calculating the standard error of a proportion. However,
percent
is not a proportion since you multiplied it by 100. So in your formula when you do1-percent
you get negative numbers that you are then trying to take the square root of, resulting inNaN
.Also, if you are trying to compute the standard error of a proportion the formula is
sqrt(p * q / n)
whereq
is 1 – p. Notice you divide byn
notp
(in your example you are dividing byp
). The way you have it written now thed_perc$percent
will just cancel and you will be left withsqrt(1-d_perc$percent)
.Generally, it might be better to ask if your data are structured in a way that is giving you what you want. Yes, your code will generate some output for those grouping variables. However, it looks like you might consider pivoting your
group
andlevels
columns into a wider format before trying to make this computation. To do this though, it looks like there might be some ID column that is not present in your data that will uniquely identify the individual from a given race, gender, and cohort.