I am working with the R programming language
I have the following table (notice that for student1, row 1 and row 2 are same except var1 value):
df = structure(list(student = c(1, 1, 1, 1, 2, 2, 2), var1 = c("a",
"b", "b", "a", "c", "a", "b"), start = structure(c(14610, 14610,
15830, 17449, 14610, 16436, 17897), class = "Date"), end = structure(c(15706,
15706, 16687, 17775, 16071, 17410, 19421), class = "Date")), class = "data.frame", row.names = c(NA,
-7L))
student var1 start end
1 1 a 2010-01-01 2013-01-01
2 1 b 2010-01-01 2013-01-01
3 1 b 2013-05-05 2015-09-09
4 1 a 2017-10-10 2018-09-01
5 2 c 2010-01-01 2014-01-01
6 2 a 2015-01-01 2017-09-01
7 2 b 2019-01-01 2023-03-05
My Question:
- Consider years from 2000 to 2020.
- For each student, between March 1st-March 1st of each year, I want to find out if the student had at least one value of var1 =a.
- If yes, then TRUE else FALSE (note: when missing, still FALSE)
Here is my attempt to do this.
First, I create a template table where the final results will be stored (this contains all date ranges between 2000 and 2020):
df$start <- as.Date(df$start)
df$end <- as.Date(df$end)
time_intervals <- data.frame(
Student = rep(unique(df$student), each = 10),
start_time = rep(as.Date(paste0(2010:2019, "-03-01")), length(unique(df$student))),
end_time = rep(as.Date(paste0(2011:2020, "-03-01")), length(unique(df$student)))
)
Then, I tried to write a function to check each student between each time interval:
check_a <- function(student, start_time, end_time) {
subset_df <- df[df$student == student & df$var1 == 'a' & df$start <= end_time & df$end >= start_time, ]
if(nrow(subset_df) > 0) {
return(TRUE)
} else {
return(FALSE)
}
}
Finally, I applied this function to the data:
time_intervals$at_least_one_var1_a <- mapply(check_a, time_intervals$Student, time_intervals$start_time, time_intervals$end_time)
The results look like this:
Student start_time end_time at_least_one_var1_a
1 1 2010-03-01 2011-03-01 TRUE
2 1 2011-03-01 2012-03-01 TRUE
3 1 2012-03-01 2013-03-01 TRUE
4 1 2013-03-01 2014-03-01 FALSE
5 1 2014-03-01 2015-03-01 FALSE
6 1 2015-03-01 2016-03-01 FALSE
7 1 2016-03-01 2017-03-01 FALSE
8 1 2017-03-01 2018-03-01 TRUE
9 1 2018-03-01 2019-03-01 TRUE
10 1 2019-03-01 2020-03-01 FALSE
11 2 2010-03-01 2011-03-01 FALSE
12 2 2011-03-01 2012-03-01 FALSE
13 2 2012-03-01 2013-03-01 FALSE
14 2 2013-03-01 2014-03-01 FALSE
15 2 2014-03-01 2015-03-01 TRUE
16 2 2015-03-01 2016-03-01 TRUE
17 2 2016-03-01 2017-03-01 TRUE
18 2 2017-03-01 2018-03-01 TRUE
19 2 2018-03-01 2019-03-01 FALSE
20 2 2019-03-01 2020-03-01 FALSE
My Problem: However, there are some mistakes here. For example:
- For student 1, between “2018-03-01 and 2019-03-01”, the output shows a value of TRUE. However, this is missing in the original data so the correct output should be FALSE
- For student 2, between ‘2014-03-01 and 2015-03-01’, the output shows a value of TRUE. However, this is missing in the original data so the correct output should be FALSE
Can someone please show me how to correct this?
Thanks!
row 4 shows an end date of september 1st, 2018 and a var1=a. So the output should be TRUE. row 6 is for student 2, has a start in january 2015 and ends in 2017, so we should expect a TRUE for 2014-03-01 – 2015-03-01. The code is fine, it is the expected output that is wrong.
I am voting to close this question down as not reproducible. Please consider doing a minimal check before posting a question.