mclapply not returning the same results as equivalent lapply in aws.s3
I have been reading in multiple parquet files from S3 using R. Recently, I decided to parallelize this process in order to read in these files quicker using the original function. I have shown both the lapply and mclapply below:
## load package
library(aws.s3)
library(parallel)
library(data.table)
library(dplyr)
library(arrow)
num_cores <- detectCores() - 2
rerating_data_files <- some_vector_of_s3_file_paths
read_parquet_from_s3 <- function(file){
tryCatch( expr = { return(setDT(arrow::read_parquet(aws.s3::get_object(file))))} ,
error = function(e) {
cat(c('\n','error with reading in file: ',file,'\n')) } ) }
test <- rbindlist(mclapply(rerating_data_files,function(x) read_parquet_from_s3(x),
mc.cores = min(num_cores,length(rerating_data_files))),use.names=T)
test$quote_id %>% n_distinct() ## 2342
data_temp <- rbindlist(lapply(rerating_data_files,function(x) read_parquet_from_s3(x)),use.names = T)
data_temp$quote_id %>% n_distinct() ## 2542
This is so odd because running the EXACT SAME snippet of code 5 or 6 times results in the mclapply bringing back a datatable with the correct 2542 distinct quotes.
I'm curious if anyone else is having this problem with aws.s3 and mclapply.
I am also having issues reading many files using mclapply. For some jobs I am getting an error (this is random and has nothing to do my with function or file) like
[[57]] [1] "Error in curl::curl_fetch_memory(url, handle = handle) : \n SSL read: error:00000000:lib(0):func(0):reason(0), errno 104\n" attr(,"class") [1] "try-error"
I believe the delta in your distinct quotes is equal to the number of curl error codes.
The simple answer is this is not supported. curl is not compatible with mclapply, because you cannot fork CURL handles (nor SSL handles). It will only work if you fork first and then load CURL, i.e. you cannot do any operations before mcparallel(). From experience the issue is worse on RHEL than other systems.
In practice, it means if you want to use mcparallel() you have to do all operations there, i.e. don't load curl, httr or aws.s3 in the session, only load it in the parallel code.