I would like to process a number of large datasets in parallel. Unfortunately the speedup I am getting from using Threads.@threads
is very sublinear, as the following simplified example shows.
(I'm very new to Julia, so apologies if I missed something obvious)
Let's create some dummy input data - 8 dataframes with 2 integer columns each and 10 million rows:
using DataFrames
n = 8
dfs = Vector{DataFrame}(undef, n)
for i = 1:n
dfs[i] = DataFrame(Dict("x1" => rand(1:Int64(1e7), Int64(1e7)), "x2" => rand(1:Int64(1e7), Int64(1e7))))
end
Now do some processing on each dataframe (group by x1
and sum x2
)
function process(df::DataFrame)::DataFrame
combine([:x2] => sum, groupby(df, :x1))
end
Finally, compare the speed of doing the processing on a single dataframe with doing it on all 8 dataframes in parallel. The machine I'm running this on has 50 cores and Julia was started with 50 threads, so ideally there should not be much of a time difference.
julia> dfs_res = Vector{DataFrame}(undef, n)
julia> @time for i = 1:1
dfs_res[i] = process(dfs[i])
end
3.041048 seconds (57.24 M allocations: 1.979 GiB, 4.20% gc time)
julia> Threads.nthreads()
50
julia> @time Threads.@threads for i = 1:n
dfs_res[i] = process(dfs[i])
end
5.603539 seconds (455.14 M allocations: 15.700 GiB, 39.11% gc time)
So the parallel run takes almost twice as long per dataset (this gets worse with more datasets). I have a feeling this has something to do with inefficient memory management. GC time is pretty high for the second run. And I assume the preallocation with undef
isn't efficient for DataFrame
s. Pretty much all the examples I've seen for parallel processing in Julia are done on numeric arrays with fixed and a-priori known sizes. However here the datasets could have arbitrary sizes, columns etc. In R workflows like that can be done very efficiently with mclapply
. Is there something similar (or a different but efficient pattern) in Julia? I chose to go with threads and not multi-processing to avoid copying data (Julia doesn't seem to support the fork process model like R / mclapply).