I have two data frames where I want to apply fuzzyjoin in R. I have written the code like this.
library(tidyverse)
library(fuzzyjoin)
library(readxl)
ex_hotels<-readRDS("expedia_hotels.rds")
census<-read_excel("US_Census_October_2023.xlsx")
# Subset data (adjust the number based on your available memory)
subset_ex_hotels <- ex_hotels[sample(nrow(ex_hotels), 1000), ]
subset_census <- census[sample(nrow(census), 1000), ]
# Set up parallel processing
plan(multiprocess)
result<-fuzzy_left_join(ex_hotels,
census,
by=c(
"hotel_name"="Hotel Name",
"locality" = "City",
"region_state"="State",
"street_address"="Address 1",
"country" = "Country",
"zip_code" = "Postal Code"
),
match_fun = stringdist::stringdistmatrix,
method="jaccard")
The error is showing out of the application memory. How to solve this? Is there any simpler method to do this?
(1) I don’t see how
future::plan(.)
is helpful here, doesfuzzyjoin
have a multiproc mode I don’t know about? (2) Do you really need to do astringdist
comparison on all six pairs of columns? I imagine some (country
andzip_code
, most likely) can be normalized externally and using a simple==
comparison. (3) Have you tried with fewer than 1000×1000? (4) It might help if you provide sample data, perhaps 10-20 rows of each (ensuring some overlap).I reduced the data to 1000 rows now and it is working. But the matching gives Null values only