I am trying to use foreach loop to split a big csv file into small files after making minor cleaning. My strategy was to
1- Use read_csv
to read a chunk that fits into RAM
2- Do the cleaning
3- Save the file into the new format
4- Repeat 1 to 3 using skip
and n_max
arguments so I can skip what I have read and read an equal amount of rows
I was looping over a sequence that starts from zero and increases with increments of chunk size.
This was working with the normal for
loop, and I was able to use break
if something wrong happened, but I could not make it work for the parallel foreach
. It does not save files properly. My assumption was that by using foreach
, for instance, a core will work on chunk from 0 to 100,000 and another core works on 100,001 to 200,000 and so on.
Can you include the code you have tried so far, and a
dput()
of the first few rows of your data? Thanksalso: describe (if possible) the cleaning process? It is simple filtering? If so, then you could possibly filter before reading, using something like
data.tabke:fread()
with a grep-likecmd
-argument.