How can I optimize this R code to run faster? [closed]

Question

I am currently enrolled in an internship program, tasked with a project to predict the recovery time of orange trees—restoring their productivity—after being impacted by vines (referred to as ‘cipo’ in the code, in Portuguese). Admittedly, my coding experience, especially in R, is quite limited, given that I am a complete beginner. This marks my first inquiry here, and I apologize for any errors and potential shortcomings in my English as a Brazilian. Your response is genuinely appreciated!

The dataset I am working with originates from an analysis conducted every four months, known as a survey. During each survey, a drone traverses the orange orchards, capturing photos subsequently processed using AI to generate the dataset. This dataset encompasses information about each tree, including its name (treeid), grove, farm (fazenda_seetree), and current health status (morta, falha, replanta, saudavel). It also indicates whether a tree is sick and if it has been affected by vines.

For my project, I believe utilizing survival analysis is the most effective approach to examine the time required for a tree to recover after being affected by vines. However, to perform the survival analysis, I need to create a dataset with the necessary information. I am using the treeid to identify each tree, documenting the initial survey in which the tree had vines and the subsequent survey to determine whether it recovered or not.

The code below is what I am currently using to generate the dataset. While it functions correctly, the main issue lies in the considerable time it takes to complete. Even when using Databricks with a more powerful computer, the code runs remarkably slow. This presents a significant challenge for me, and as a beginner, I would appreciate any guidance or responses, even if it involves transitioning to another programming language besides R.

library(dplyr)

database <- read.csv("farm_1t_V4.csv")
database <- database[,!names(database) %in% "X"]

i = 1
k = 1
j = 1
database_sob <- data.frame()

for(i in unique(database$treeids)){
  aux <- database[database$treeids == i,]
  aux <- aux %>% arrange(aux$survey)
  
  if(!any(is.na(aux$cipo))){
    if(sum(aux$cipo) > 0){       
      if(aux$cipo[length(aux$fazenda_seetree)]!= 1 | sum(aux$cipo) > 1 ){
        aux_2 <- cumsum(aux$cipo)           
        aux <- aux[(length(aux_2) - length(aux_2[aux_2>0]) + 1):length(aux_2),]
        aux_2 <- cumsum(aux$cipo)
        j = 1

        while(!is.na(aux_2[j] != aux_2[j+1]) & (aux_2[j] != aux_2[j+1])) {
          
          j = j + 1
          
        }
        
        database_sob[k,1] <- aux$treeids[1]
        database_sob[k,2] <- aux$survey[1]
        
        
        for(l in 2:length(aux$fazenda_seetree)){
          
          print(i)
          
          if(aux$cipo[l] == 1 | aux$score[l] %in% c("morta","falha","replanta","saudavel")){
            database_sob[k,3] <- aux$survey[l]
            
            if(aux$cipo[l] == 0 & aux$score[l] %in% c("saudavel")){
              database_sob[k,4] <- 0
            }
            
            else{
              database_sob[k,4] <- 1
            }
            break
          }
          else{ 
            database_sob[k,3] <- aux$survey[l]
            database_sob[k,4] <- 1
          }
        }
        k = k + 1
      }
    }
  }
}

Here’s a small sample of my dataset for you to use in testing:

https://drive.google.com/file/d/1GdNzu4pRM461nI7bxM4jmC5erU_r6joV/view?usp=sharing

In the provided example, we’re working with 14 surveys, each involving 5 trees. However, in my actual dataset, I deal with 14 surveys encompassing a staggering 20 million trees.
With this more detailed explanation, I hope someone can offer insights or assistance to enhance the efficiency of my code execution. The sheer scale of the dataset poses significant challenges, and any guidance on optimizing the process for faster performance would be greatly appreciated.

Leave a Comment Cancel reply