R: Finding Out Which Rows Match a Condition Between Two Dates [closed]

Question

I am working with the R programming language

I have the following table (notice that for student1, row 1 and row 2 are same except var1 value):

df = structure(list(student = c(1, 1, 1, 1, 2, 2, 2), var1 = c("a", 
"b", "b", "a", "c", "a", "b"), start = structure(c(14610, 14610, 
15830, 17449, 14610, 16436, 17897), class = "Date"), end = structure(c(15706, 
15706, 16687, 17775, 16071, 17410, 19421), class = "Date")), class = "data.frame", row.names = c(NA, 
-7L))

  student var1      start        end
1       1    a 2010-01-01 2013-01-01
2       1    b 2010-01-01 2013-01-01
3       1    b 2013-05-05 2015-09-09
4       1    a 2017-10-10 2018-09-01
5       2    c 2010-01-01 2014-01-01
6       2    a 2015-01-01 2017-09-01
7       2    b 2019-01-01 2023-03-05

My Question:

Consider years from 2000 to 2020.
For each student, between March 1st-March 1st of each year, I want to find out if the student had at least one value of var1 =a.
If yes, then TRUE else FALSE (note: when missing, still FALSE)

Here is my attempt to do this.

First, I create a template table where the final results will be stored (this contains all date ranges between 2000 and 2020):

df$start <- as.Date(df$start)
df$end <- as.Date(df$end)

time_intervals <- data.frame(
  Student = rep(unique(df$student), each = 10),
  start_time = rep(as.Date(paste0(2010:2019, "-03-01")), length(unique(df$student))),
  end_time = rep(as.Date(paste0(2011:2020, "-03-01")), length(unique(df$student)))
)

Then, I tried to write a function to check each student between each time interval:

check_a <- function(student, start_time, end_time) {
  subset_df <- df[df$student == student & df$var1 == 'a' & df$start <= end_time & df$end >= start_time, ]
  if(nrow(subset_df) > 0) {
    return(TRUE)
  } else {
    return(FALSE)
  }
}

Finally, I applied this function to the data:

time_intervals$at_least_one_var1_a <- mapply(check_a, time_intervals$Student, time_intervals$start_time, time_intervals$end_time)

The results look like this:

  Student start_time   end_time at_least_one_var1_a
1        1 2010-03-01 2011-03-01                TRUE
2        1 2011-03-01 2012-03-01                TRUE
3        1 2012-03-01 2013-03-01                TRUE
4        1 2013-03-01 2014-03-01               FALSE
5        1 2014-03-01 2015-03-01               FALSE
6        1 2015-03-01 2016-03-01               FALSE
7        1 2016-03-01 2017-03-01               FALSE
8        1 2017-03-01 2018-03-01                TRUE
9        1 2018-03-01 2019-03-01                TRUE
10       1 2019-03-01 2020-03-01               FALSE
11       2 2010-03-01 2011-03-01               FALSE
12       2 2011-03-01 2012-03-01               FALSE
13       2 2012-03-01 2013-03-01               FALSE
14       2 2013-03-01 2014-03-01               FALSE
15       2 2014-03-01 2015-03-01                TRUE
16       2 2015-03-01 2016-03-01                TRUE
17       2 2016-03-01 2017-03-01                TRUE
18       2 2017-03-01 2018-03-01                TRUE
19       2 2018-03-01 2019-03-01               FALSE
20       2 2019-03-01 2020-03-01               FALSE

My Problem: However, there are some mistakes here. For example:

For student 1, between “2018-03-01 and 2019-03-01”, the output shows a value of TRUE. However, this is missing in the original data so the correct output should be FALSE
For student 2, between ‘2014-03-01 and 2015-03-01’, the output shows a value of TRUE. However, this is missing in the original data so the correct output should be FALSE

Can someone please show me how to correct this?

Thanks!

Leave a Comment Cancel reply