How can I optimize this R code to run faster? [closed]

I am currently enrolled in an internship program, tasked with a project to predict the recovery time of orange trees—restoring their productivity—after being impacted by vines (referred to as ‘cipo’ in the code, in Portuguese). Admittedly, my coding experience, especially in R, is quite limited, given that I am a complete beginner. This marks my first inquiry here, and I apologize for any errors and potential shortcomings in my English as a Brazilian. Your response is genuinely appreciated!

The dataset I am working with originates from an analysis conducted every four months, known as a survey. During each survey, a drone traverses the orange orchards, capturing photos subsequently processed using AI to generate the dataset. This dataset encompasses information about each tree, including its name (treeid), grove, farm (fazenda_seetree), and current health status (morta, falha, replanta, saudavel). It also indicates whether a tree is sick and if it has been affected by vines.

For my project, I believe utilizing survival analysis is the most effective approach to examine the time required for a tree to recover after being affected by vines. However, to perform the survival analysis, I need to create a dataset with the necessary information. I am using the treeid to identify each tree, documenting the initial survey in which the tree had vines and the subsequent survey to determine whether it recovered or not.

The code below is what I am currently using to generate the dataset. While it functions correctly, the main issue lies in the considerable time it takes to complete. Even when using Databricks with a more powerful computer, the code runs remarkably slow. This presents a significant challenge for me, and as a beginner, I would appreciate any guidance or responses, even if it involves transitioning to another programming language besides R.

library(dplyr)

database <- read.csv("farm_1t_V4.csv")
database <- database[,!names(database) %in% "X"]

i = 1
k = 1
j = 1
database_sob <- data.frame()

for(i in unique(database$treeids)){
  aux <- database[database$treeids == i,]
  aux <- aux %>% arrange(aux$survey)
  
  if(!any(is.na(aux$cipo))){
    if(sum(aux$cipo) > 0){       
      if(aux$cipo[length(aux$fazenda_seetree)]!= 1 | sum(aux$cipo) > 1 ){
        aux_2 <- cumsum(aux$cipo)           
        aux <- aux[(length(aux_2) - length(aux_2[aux_2>0]) + 1):length(aux_2),]
        aux_2 <- cumsum(aux$cipo)
        j = 1

        while(!is.na(aux_2[j] != aux_2[j+1]) & (aux_2[j] != aux_2[j+1])) {
          
          j = j + 1
          
        }
        
        database_sob[k,1] <- aux$treeids[1]
        database_sob[k,2] <- aux$survey[1]
        
        
        for(l in 2:length(aux$fazenda_seetree)){
          
          print(i)
          
          if(aux$cipo[l] == 1 | aux$score[l] %in% c("morta","falha","replanta","saudavel")){
            database_sob[k,3] <- aux$survey[l]
            
            if(aux$cipo[l] == 0 & aux$score[l] %in% c("saudavel")){
              database_sob[k,4] <- 0
            }
            
            else{
              database_sob[k,4] <- 1
            }
            break
          }
          else{ 
            database_sob[k,3] <- aux$survey[l]
            database_sob[k,4] <- 1
          }
        }
        k = k + 1
      }
    }
  }
}

Here’s a small sample of my dataset for you to use in testing:

https://drive.google.com/file/d/1GdNzu4pRM461nI7bxM4jmC5erU_r6joV/view?usp=sharing

In the provided example, we’re working with 14 surveys, each involving 5 trees. However, in my actual dataset, I deal with 14 surveys encompassing a staggering 20 million trees.
With this more detailed explanation, I hope someone can offer insights or assistance to enhance the efficiency of my code execution. The sheer scale of the dataset poses significant challenges, and any guidance on optimizing the process for faster performance would be greatly appreciated.

  • 5

    Don’t tag a bunch of unrelated languages. Do define what performance problems you have, specifically, and what your performance goals are.

    – 

  • 4

    What is the purpose of this code? Without a description of your objectives, we first have to deduce what you are trying to do. Only then can we consider how to optimise it. Some sample data would be helpful too. But one useful maxim when coding in R is “if I’m thinking of using a loop, there’s probably a better way to do it”…

    – 

  • Welcome to SO, Gabriel Carneiro! Every forum has their own expectations, SO is no different. Questions here really benefit from being self-contained, including sample data and expected output, see stackoverflow.com/q/5963269 , minimal reproducible example, and stackoverflow.com/tags/r/info for ways to include sample data (dput(x), data.frame(...), etc). This question shows some lack of familiarity with R- and dplyr-efficiencies, so it seems likely that we can help improve the basis of your code, but for that we would really benefit from a walk-through of what you need in the end. Good luck!

    – 

  • 1

    Some hints on code so far: (1) “never” (almost) use aux$ in dplyr pipes that start from aux; (2) dplyr::group_by may help, might be able to use aux %>% group_by(treeids) %>% filter(!any(is.na(cipo)), last(cipo) != 1, sum(cipo) > 1) will get you past your for loop and three if statements, then we can likely reduce your internal code to much-simpler calculations; (3) in R, vectorizing operations tends to be preferred for many reasons, not sure how best to attack the inner for loop but my guess is that it can be improved.

    – 




  • 1

    @tadman i’ve enhanced the explanation and problem im having, I hope now it can helps to the community understands what i need! And im sorry about the tags, begginers things…

    – 

Leave a Comment