2

I am trying to process a file using the single core of my CPU. But I guess it is not sufficient to use a single core. Instead, if I had get an access to the multiple cores of my system then I can make the process run better and faster.

But unfortunately, I know to process a file using single core only. Here is what I did:

data = open('datafile','r',encoding='ascii',errors='ignore')
for line in data.readlines():
    splitted = line.lower().strip().split()
    check = process(splitted[0],splitted[1])
    if check == '':
         pass
data.close()

I want to know how I can use the complete capacity of the CPU for processing teh process() while taking the line separately and getting the output as desired? Even how I can avoid the deadlock state of the thread while processing as this can be dangerous for the process output.

Please share your view with me.

Jaffer Wilson
  • 7,029
  • 10
  • 62
  • 139
  • What deadlock? Deadlock state requires at least one lock, you know? Now since Python has this thing called GIL the only way to utilize multiple cores is to use processes instead of threads. Now parallel disk io may or may not increase the performance (depending on the disk you have) so what I suggest is to use `multiprocesing.Pool` and send "chunks" of the file to it from the main process for parallel procesing. – freakish Sep 06 '17 at 11:16
  • @freakish diving the file into chunks may loose data, which I do not want to, as it is a bit important to maintain the complete data. – Jaffer Wilson Sep 06 '17 at 11:19
  • Why would it lose the data? You just read line after line and send each line to a child process. There's no data loss here. – freakish Sep 06 '17 at 11:22
  • @freakish are you referring chunks with respect to line or bytes? Because bytes method is lossy. While with respect to lines I have tried making chunks and giving to the normal process, but do not know much how to multi thread or multi process. – Jaffer Wilson Sep 06 '17 at 11:24
  • Both ways may be correct depending on what exactly you are doing. Anyway see my answer. – freakish Sep 06 '17 at 11:39
  • @freakish I am looking at it.. looking quite favorable with my conditions .. let me check it. – Jaffer Wilson Sep 06 '17 at 11:40
  • Possible duplicate of [Read large file in parallel?](https://stackoverflow.com/questions/18104481/read-large-file-in-parallel) – stovfl Sep 06 '17 at 14:09

1 Answers1

1

First of all: you need multiple processes to utilize multiple cores. Not threads. It's a limitation due to GIL.

Now here's an example of how you can implement it with multiprocessing.Pool:

from multiprocessing import Pool, cpu_count

def process(arg1, arg2):
    ...

workers_count = 2*cpu_count()+1  # or whatever you need
pool = Pool(processes=workers_count)

with open('datafile','r',encoding='ascii',errors='ignore') as fo:
    buffer = []
    for line in fo:
        splitted = line.lower().strip().split()
        buffer.append((splitted[0], splitted[1]))
        if len(buffer) == workers_count:
            results = pool.map(process, buffer)
            buffer = []
            # do something with results
    if buffer:
        results = pool.map(process, buffer)
        # do something with results again

So what it does it reads the file line by line and once it gathers enough data it sends it to a multiprocess pool and waits for parallel processing. Note that unless you have SSD then running disk io in parallel will only deteriorate the performance (also it would not be trivial to parallelize line-by-line reads).

What you have to be aware though is that since multiple processes are used you cannot share memory between them, i.e. the process function should not read/write to global variables.

freakish
  • 54,167
  • 9
  • 132
  • 169
  • Yes I have SSD. What will happen if I use Global variable for storing the temporary data, as it is favoravle to use easily while processing any data throughthe `process()`? – Jaffer Wilson Sep 06 '17 at 11:47
  • @JafferWilson If you are using SSD then you may want to play with parallel reads, i.e. by calling `.read()` inside the pool. However this will be quite difficult to implement since each worker would have to know where to start reading the file and how many lines it should read. It doesn't seem to be easy to implement. – freakish Sep 06 '17 at 11:56
  • @JafferWilson As for storing temporary data you can always have a local variable inside `process` function, right? The problem would arise if for example you have a global counter that you increment after each `process` call. Sharing state between calls to `process` won't work with multiple processes. – freakish Sep 06 '17 at 11:58