Based on what I have read - for example here - I understand the I/O operations release the GIL. So, if I have to read a large number of files on the local filesystem, my understanding is that a threaded execution should speed things up.
To test this - I have a folder (input
) with about ~100k files - each file has just one line with one random integer. I have two functions - one "sequential" and one "concurrent" that just add all the numbers
import glob
import concurrent.futures
ALL_FILES = glob.glob('./input/*.txt')
def extract_num_from_file(fname):
#time.sleep(0.1)
with open(fname, 'r') as f:
file_contents = int(f.read().strip())
return file_contents
def seq_sum_map_based():
return sum(map(extract_num_from_file, ALL_FILES))
def conc_sum_map_based():
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
return sum(executor.map(extract_num_from_file, ALL_FILES))
While both functions give me the same result - the "concurrent" version is about 3-4 times slower.
In [2]: %timeit ss.seq_sum_map_based()
3.77 s ± 50.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [3]: %timeit ss.conc_sum_map_based()
12.8 s ± 240 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Is there something wrong with my code or in my understanding?