multiprocessing when getting URLs python 3.2

Question

I've made a script to get inventory data from the Steam API and I'm a bit unsatisfied with the speed. So I read a bit about multiprocessing in python and simply cannot wrap my head around it. The program works as such: it gets the SteamID from a list, gets the inventory and then appends the SteamID and the inventory in a dictionary with the ID as the key and inventory contents as the value.

I've also understood that there are some issues involved with using a counter when multiprocessing, which is a small problem as I'd like to be able to resume the program from the last fetched inventory rather than from the beginning again.

Anyway, what I'm asking for is really a concrete example of how to do multiprocessing when opening the URL that contains the inventory data so that the program can fetch more than one inventory at a time rather than just one.

onto the code:

with open("index_to_name.json", "r", encoding=("utf-8")) as fp:
    index_to_name=json.load(fp)

with open("index_to_quality.json", "r", encoding=("utf-8")) as fp:
    index_to_quality=json.load(fp)

with open("index_to_name_no_the.json", "r", encoding=("utf-8")) as fp:
    index_to_name_no_the=json.load(fp)

with open("steamprofiler.json", "r", encoding=("utf-8")) as fp:
    steamprofiler=json.load(fp)

with open("itemdb.json", "r", encoding=("utf-8")) as fp:
    players=json.load(fp)

error=list()
playerinventories=dict()
c=127480

while c<len(steamprofiler):
    inventory=dict()
    items=list()
    try:
        url=urllib.request.urlopen("http://api.steampowered.com/IEconItems_440/GetPlayerItems/v0001/?key=DD5180808208B830FCA60D0BDFD27E27&steamid="+steamprofiler[c]+"&format=json")
        inv=json.loads(url.read().decode("utf-8"))
        url.close()
    except (urllib.error.HTTPError, urllib.error.URLError, socket.error, UnicodeDecodeError) as e:
        c+=1
        print("HTTP-error, continuing")
        error.append(c)
        continue
    try:
        for r in inv["result"]["items"]:
            inventory[r["id"]]=r["quality"], r["defindex"]
    except KeyError:
        c+=1
        error.append(c)
        continue
    for key in inventory:
        try:
            if index_to_quality[str(inventory[key][0])]=="":
                items.append(
                    index_to_quality[str(inventory[key][0])]
                    +""+
                    index_to_name[str(inventory[key][1])]
                    )
            else:
                items.append(
                    index_to_quality[str(inventory[key][0])]
                    +" "+
                    index_to_name_no_the[str(inventory[key][1])]
                    )
        except KeyError:
            print("keyerror, uppdate def_to_index")
            c+=1
            error.append(c)
            continue
    playerinventories[int(steamprofiler[c])]=items
    c+=1
    if c % 10==0:
        print(c, "inventories downloaded")

I hope my problem was clear, otherwise just say so obviously. I would optimally avoid using 3rd party libraries but if it's not possible it's not possible. Thanks in advance

score 1 · Answer 1 · answered Dec 11 '12 at 20:48

So you're assuming the fetching of the URL might be the thing slowing your program down? You'd do well to check that assumption first, but if it's indeed the case using the multiprocessing module is a huge overkill: for I/O bound bottlenecks threading is quite a bit simpler and might even be a bit faster (it takes a lot more time to spawn another python interpreter than to spawn a thread).

Looking at your code, you might get away with sticking most of the content of your while loop in a function with c as a parameter, and starting a thread from there using another function, something like:

def process_item(c):
    # The work goes here
    # Replace al those 'continue' statements with 'return'

for c in range(127480, len(steamprofiler)):
    thread = threading.Thread(name="inventory {0}".format(c), target=process_item, args=[c])
    thread.start()

A real problem might be that there's no limit to the amount of threads being spawned, which might break the program. Also the guys at Steam might not be amused at getting hammered by your script, and they might decide to un-friend you.

A better approach would be to fill a collections.deque object with your list of c's and then start a limited set of threads to do the work:

def process_item(c):
    # The work goes here
    # Replace al those 'continue' statements with 'return'

def process():
    while True:
       process_item(work.popleft())

work = collections.deque(range(127480, len(steamprofiler)))

threads = [threading.Thread(name="worker {0}".format(n), target=process)
                   for n in range(6)]
for worker in threads:
    worker.start()

Note that I'm counting on work.popleft() to throw an IndexError when we're out of work, which will kill the thread. That's a bit sneaky, so consider using a try...except instead.

Two more things:

Consider using the excellent Requests library instead of urllib (which, API-wise, is by far the worst module in the entire Python standard library that I've worked with).
For Requests, there's an add-on called grequests which allows you to do fully asynchronous HTTP-requests. That would have made for even simpler code.

I hope this helps, but please keep in mind this is all untested code.

score 0 · Answer 2 · edited May 23 '17 at 10:34

0

The outermost while loop seems to be distributed over a few processes(or tasks).

When you break the loop into tasks, note that you are sharing playerinventories and error object between processes. You will need to use multiprocessing.Manager for the sharing issue.

I recommend you to start modifying your code from this snippet.

edited May 23 '17 at 10:34

Community

1
1

answered Dec 10 '12 at 15:11

Hyungoo Kang

346
1
7

It's essentially separated into opening the url->getting data from the url->translating the data->appending it into a dictionary. The bottleneck is at opening the url and getting the data. Could you give some more concrete help for my code? – Tenbin Dec 10 '12 at 16:30
Think of a function that gets `counter` as input(`c` in your existing code) and updates `playerinventories` and `error`. The function can be concurrently run on several processes to boost network utilization. – Hyungoo Kang Dec 10 '12 at 18:24
Sorry for my poor expression 'separated into'. Please consider it as 'distributed over' – Hyungoo Kang Dec 10 '12 at 19:00

multiprocessing when getting URLs python 3.2

2 Answers2