How to calculate standard error and CI to plot in R

I am first calculating the percentage of respondents across different demographics who graduated from high school, based on their program status. This code gets me those percents:

d_perc <- d %>% 
  group_by(group, levels, program_cat, highschool) %>% 
  summarize(n = n()) %>% 
  mutate(percent = n/sum(n)*100) %>% 
  select(-n)

Next, I want to additionally calculate error term around these perents. What is the best way to then calculate the SEs and corresponding 95% CI? (My ultimately goal is to then use geom_point() and geom_errorbar to plot these together, though I already have code to do this.)

I tried something like:

d_perc$se <- sqrt(d_perc$percent*(1-d_perc$percent)/d_perc$percent)

Which would then be followed by something like + and - 1.96*d_perc$se to get the upper and lower estimate. However, when I try the above, I just get a series of NaNs for the se column.

Data here (sorry for the large data; I used head(100) to get somewhat more realistic percents by group):

d_perc <- structure(list(highschool= structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 
2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L), levels = c("no", 
"yes"), class = "factor"), program_cat = structure(c(2L, 2L, 2L, 
2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 
1L), levels = c("0", "1", "2"), class = "factor"), group = c("gender", 
"race", "cohort", "gender", "race", "cohort", "gender", "race", 
"cohort", "gender", "race", "cohort", "gender", "race", "cohort", 
"gender", "race", "cohort", "gender", "race", "cohort", "gender", 
"race", "cohort", "gender", "race", "cohort", "gender", "race", 
"cohort", "gender", "race", "cohort", "gender", "race", "cohort", 
"gender", "race", "cohort", "gender", "race", "cohort", "gender", 
"race", "cohort", "gender", "race", "cohort", "gender", "race", 
"cohort", "gender", "race", "cohort", "gender", "race", "cohort", 
"gender", "race", "cohort", "gender", "race", "cohort", "gender", 
"race", "cohort", "gender", "race", "cohort", "gender", "race", 
"cohort", "gender", "race", "cohort", "gender", "race", "cohort", 
"gender", "race", "cohort", "gender", "race", "cohort", "gender", 
"race", "cohort", "gender", "race", "cohort", "gender", "race", 
"cohort", "gender", "race", "cohort", "gender", "race", "cohort", 
"gender"), levels = structure(c(1L, 3L, 7L, 2L, 5L, 7L, 1L, 3L, 
6L, 2L, 4L, 6L, 1L, 5L, 7L, 1L, 3L, 7L, 1L, 3L, 6L, 1L, 3L, 6L, 
1L, 3L, 7L, 1L, 5L, 6L, 2L, 5L, 7L, 1L, 5L, 6L, 1L, 3L, 6L, 2L, 
3L, 7L, 1L, 3L, 6L, 1L, 4L, 6L, 1L, 5L, 6L, 1L, 5L, 6L, 1L, 4L, 
6L, 2L, 3L, 6L, 2L, 3L, 7L, 1L, 3L, 7L, 1L, 3L, 6L, 1L, 4L, 7L, 
1L, 4L, 7L, 1L, 3L, 7L, 1L, 3L, 7L, 1L, 4L, 7L, 1L, 3L, 7L, 1L, 
3L, 6L, 1L, 3L, 7L, 2L, 3L, 7L, 2L, 5L, 6L, 2L), levels = c("Female", 
"Male", "Black", "Hispanic", "White", "CohortA", "CohortB"), class = "factor")), row.names = c(NA, 
-100L), class = c("tbl_df", "tbl", "data.frame"))

  • You are calculating the standard error of a proportion. However, percent is not a proportion since you multiplied it by 100. So in your formula when you do 1-percent you get negative numbers that you are then trying to take the square root of, resulting in NaN.

    – 

  • Also, if you are trying to compute the standard error of a proportion the formula is sqrt(p * q / n) where q is 1 – p. Notice you divide by n not p (in your example you are dividing by p). The way you have it written now the d_perc$percent will just cancel and you will be left with sqrt(1-d_perc$percent).

    – 




  • Generally, it might be better to ask if your data are structured in a way that is giving you what you want. Yes, your code will generate some output for those grouping variables. However, it looks like you might consider pivoting your group and levels columns into a wider format before trying to make this computation. To do this though, it looks like there might be some ID column that is not present in your data that will uniquely identify the individual from a given race, gender, and cohort.

    – 




Leave a Comment