3

Based on what I have read - for example here - I understand the I/O operations release the GIL. So, if I have to read a large number of files on the local filesystem, my understanding is that a threaded execution should speed things up.

To test this - I have a folder (input) with about ~100k files - each file has just one line with one random integer. I have two functions - one "sequential" and one "concurrent" that just add all the numbers

import glob
import concurrent.futures
ALL_FILES = glob.glob('./input/*.txt')
  
def extract_num_from_file(fname):
    #time.sleep(0.1)
    with open(fname, 'r') as f:
        file_contents = int(f.read().strip())
    return file_contents

def seq_sum_map_based():
   return sum(map(extract_num_from_file, ALL_FILES)) 

def conc_sum_map_based():
    with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
        return sum(executor.map(extract_num_from_file, ALL_FILES))

While both functions give me the same result - the "concurrent" version is about 3-4 times slower.

In [2]: %timeit ss.seq_sum_map_based()                                                                                                     
3.77 s ± 50.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit ss.conc_sum_map_based()                                                                                                    
12.8 s ± 240 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Is there something wrong with my code or in my understanding?

Mortz
  • 4,654
  • 1
  • 19
  • 35
  • 2
    I'm not very competent on this, but I assume that the concurrent version just wastes extra resources on spawning threads, switching context, passing GIL around, etc. You could see a win here if the CPU work was more intensive than a simple addition. Need to profile your code to know for sure – Expurple Dec 28 '21 at 16:27
  • If it was related to the overhead that you mention, I would think that increasing the number of files to read should have an impact on the difference in performance. But I don't see that - for 10 files or 100k files, the concurrent version remains equally slow – Mortz Dec 29 '21 at 11:42
  • 3
    I doubt that disk I/O parallelizes well, at least on a usual disks / file systems. In the end, all the I/O operations have to line up to the same disk. – bereal Dec 30 '21 at 16:48
  • Threaded definitely CAN be faster, but it would only speed things up if you have I/O to spare. If your disk's read speed is slow/saturated, then threading will do you no good... In many cases it will actually slow you down because random reads are much slower than sequential reads. Best-case the files are of some substantial size to take advantage of sequential reads (more I/O to play with). Tested on my own SSD with ~1000 files of various sizes (average 8800 characters) and threading with 5 threads did help somewhat, 3 threads was even faster. – sytech Dec 30 '21 at 20:58

2 Answers2

5

Note: The following only applies to HDDs, which have moving parts that can affect read throughput, not SDDs. The nature of the large performance difference makes it clear to me that this is a HDD-oriented problem, so this information operates under that assumption.

The problem is that while the threads may be operating in parallel, the data must be read sequentially from the hard drive as there is only the singular read head. Worse, however, is that since you've parallelized the I/O operations, the underlying OS will schedule these I/O tasks such that these files are processed only partially before switching to another thread--after all, even if you only have a single integer, the files headers still need to be processed as well--causing the read head to jump around much more wildly than in your strictly sequential code. All of this results in massively increased overhead compared to simply reading the entirety of each file in sequence, which doesn't require so many jumps.

This wouldn't be as much of a problem if, for instance, you had a single thread loading in large amounts of data from disk while a second thread performs some time-intensive processing of it, as that would allow the time-intensive processing to continue unblocked by the I/O operations. Your particular scenario is just a really, really bad case where you've given up a GIL bottleneck in exchange for a horrifically slow I/O bottleneck.

In short, you've understood correctly that I/O operations release the GIL, you just came to the wrong conclusion about parallelizing file reads.

B. Fleming
  • 7,170
  • 1
  • 18
  • 36
  • 1
    Good explanation, although this assumes a spinning disk hard drive. SSDs don't have the same problem of the read head jumping around. In short, threading can be faster if you have I/O to spare. On fast SSDs, it would probably be faster to thread the reading. On spinning platter disks, it's unlikely to help. – sytech Dec 30 '21 at 20:56
  • In line with what you and @sytech are saying, when I did move my code from a HDD to SSD machine, the multi-threaded operation worked about as fast as the single threaded one. – Mortz Dec 31 '21 at 15:46
  • @sytech Thank you for bringing that up. I forgot to include a disclaimer that the answer is hardware-specific. It was clear to me that the issue was due to a HDD bottleneck, but I should've clarified this in my answer. I've edited my answer accordingly. – B. Fleming Jan 01 '22 at 11:33
2

The other answer stated this quite well:

you've given up a GIL bottleneck in exchange for a horrifically slow I/O bottleneck. In short, you've understood correctly that I/O operations release the GIL, you just came to the wrong conclusion about parallelizing file reads.

I'll add that threaded file reading can be more performant if you have I/O to spare, like you might have on a very fast SSD.

I tested this on a relatively fast SSD (Samsung 970 EVO 1TB; secondary drive with no activity) reading from ~1000 files of various sizes, averaging 8800 characters. In testing, I got better performance with more threads... but diminishing returns kick in fast.

In [1]: len(ALL_FILES)
995
In [2]: %timeit ThreadedFileReader(ALL_FILES, n=1).join() # single threaded
61.8 ms ± 305 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [3]: %timeit ThreadedFileReader(ALL_FILES, n=2).join()
54.7 ms ± 158 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [4]: %timeit ThreadedFileReader(ALL_FILES, n=3).join()
56.1 ms ± 135 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [5]: %timeit ThreadedFileReader(ALL_FILES, n=4).join()
57.8 ms ± 131 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [6]: %timeit ThreadedFileReader(ALL_FILES, n=5).join()
58.9 ms ± 236 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [7]: %timeit ThreadedFileReader(ALL_FILES, n=50).join()
68.6 ms ± 378 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

So, your idea in principle is sound, but only if you have enough I/O to spare, compared to sequential read. Unless you have very fast storage, you're unlikely to need more than an extra thread or two. If your storage is not fast at all, the single-threaded approach may be the way to go.

Remember that if you have many threads reading files at the same time, especially small files, you're likely going to be bottlenecked by your drive's random read capabilities. Comparatively, a single threaded approach on large files is likely to be bottlenecked closer to your drive's sequential read capabilities. Depending on the hardware, these could be substantially different performance ratings.

Depending on the hardware and the characteristics of the data you're reading, the benefits of sequential read performance may outweigh any potential gains from parallelizing reads.

For completeness, the code I used to test this is below, though it doesn't have a particular bearing on the answer.

class ThreadedFileReader:
    def __init__(self, files, n=5):
        self.files = deque(files)
        self.threads = []
        self.results = queue.Queue()
        for _ in range(n):
            t = threading.Thread(target=self.worker)
            t.start()
            self.threads.append(t)
    def worker(self):
        while self.files:
            fname = self.files.pop()
            with open(fname, encoding='utf-8') as f:
                data = f.read()
            self.results.put(len(data))
        return
    def join(self):
        for t in self.threads:
            t.join()
sytech
  • 29,298
  • 3
  • 45
  • 86
  • It seems to me what you have done is just create your own multithreading pool implementation that returns its output in task completion order rather than in task submission order (is this a bug or a "feature"?), whereas standard pools support both ways. I am having difficulty understanding how this code could be more performant than the OP's code. – Booboo Jan 01 '22 at 15:31
  • @Booboo the implementation is not super important. The difference in performance comes from difference in hardware speed, not necessarily the code. – sytech Jan 01 '22 at 22:07
  • I got *that* point you were making. But I thought that you were also suggesting a switch from the `ThreadPoolExecutor` class used by the OP to your `ThreadedFileReader` class. Just as an aside having nothing to do with the OP's problem, you are using a `deque` where one would normally be using a second `queue.Queue` for enqueuing tasks. Granted that it is highly unlikely but do you think it impossible that a thread could lose control between the `while self.file:` and `fname = self.files.pop()` statements and end up with an exception because there are now no more elements on the deque? – Booboo Jan 01 '22 at 22:50
  • @Booboo I believe it would be possible, yes. – sytech Jan 01 '22 at 23:46