Fuzzyjoin two datframes

I have two data frames where I want to apply fuzzyjoin in R. I have written the code like this.

   library(tidyverse)
  library(fuzzyjoin)
  library(readxl)
ex_hotels<-readRDS("expedia_hotels.rds")
census<-read_excel("US_Census_October_2023.xlsx")
# Subset data (adjust the number based on your available memory)
subset_ex_hotels <- ex_hotels[sample(nrow(ex_hotels), 1000), ]
subset_census <- census[sample(nrow(census), 1000), ]

# Set up parallel processing
plan(multiprocess)

result<-fuzzy_left_join(ex_hotels,
                        census,
                        by=c(
                          "hotel_name"="Hotel Name",
                          "locality" = "City",
                          "region_state"="State",
                          "street_address"="Address 1",
                          "country" = "Country",
                          "zip_code" = "Postal Code"
                        ),
                        match_fun = stringdist::stringdistmatrix,
                        method="jaccard")

The error is showing out of the application memory. How to solve this? Is there any simpler method to do this?

  • 2

    (1) I don’t see how future::plan(.) is helpful here, does fuzzyjoin have a multiproc mode I don’t know about? (2) Do you really need to do a stringdist comparison on all six pairs of columns? I imagine some (country and zip_code, most likely) can be normalized externally and using a simple == comparison. (3) Have you tried with fewer than 1000×1000? (4) It might help if you provide sample data, perhaps 10-20 rows of each (ensuring some overlap).

    – 

  • I reduced the data to 1000 rows now and it is working. But the matching gives Null values only

    – 

Leave a Comment