Why is file processing in python taking more time for chunks which come later in the file?

Question

I have a very simple code which parses a JSON file. The file contains each line as a JSON object. For some reason, the processing time for each row increases as I run the code.

Can someone explain to me why this is happening and how to stop this?

Here is the code snippet:

from ast import literal_eval as le
import re
import string
from pandas import DataFrame
import pandas
import time
f = open('file.json')
df = DataFrame(columns=(column_names))
row_num = 0
while True:
    t = time.time()
    for line in f:
        line = line.strip()
        d = le(line)
        df.loc[row_num] = [d[column_name1], d[column_name2]]
        row_num+=1
        if(row_num%5000 == 0):
            print row_num, 'done', time.time() - t
            break
df.to_csv('MetaAnalysis', encoding='utf-8')

Part of the output is as follows:

5000 done 11.4549999237     
10000 done 16.5380001068    
15000 done 24.2339999676    
20000 done 36.3680000305    
25000 done 50.0610001087    
30000 done 57.0130000114    
35000 done 65.9800000191    
40000 done 74.4649999142

As visible the time is increasing for each row.

It looks like `df.loc` is `O(n)` WRT number of items in `df`? — Hugh Bothwell, Jun 30 '17 at 13:48
one of approach can be to create `df` for each 5000 chunk and pd.concat the chunk, this may not be as fast as pd.Series and append but can be better than current one. — Nikhil Rupanawar, Jun 30 '17 at 14:32

Hugh Bothwell · Accepted Answer · 2017-06-30T20:45:10.237

Pandas is notoriously slow about appending rows - it maintains hierarchical indices on the data, and every time you append a row it has to update all the indices.

This means it is much faster to add a thousand rows (then update) instead of adding a row (and updating) a thousand times.

Sample code to follow; I am still downloading the mozilla.tar.gz file (453 MB).

Edit: apparently the file I downloaded and extracted (/dump/mozilla/mozall.bson, 890 MB) is a MongoDB dump in bson with TenGen extensions, containing 769k rows. For testing purposes I took the first 50k rows and re-saved as json (result is 54 MB - average line is about 1200 characters), then used Notepad++ to break it to one record per line.

Most of the complexity here is for reading the file in chunks:

from itertools import islice
import pandas as pd
from time import time

LINES_PER_BLOCK = 5000

# read object chunks from json file
with open("mozilla.json") as inf:
    chunks = []
    while True:
        start = time()
        block = list(islice(inf, LINES_PER_BLOCK))
        if not block:
            # reached EOF
            break
        json  = "[" + ",".join(block) + "]"
        chunk = pd.read_json(json, "records")
        chunks.append(chunk)
        done = time()
        print(LINES_PER_BLOCK * len(chunks), "done", done - start)
        start = done

# now combine chunks
start = time()
df = pd.concat(chunks)
done = time()
print("Concat done", done - start)

which gives

5000 done 0.12293195724487305
10000 done 0.12034845352172852
15000 done 0.12239885330200195
20000 done 0.11942410469055176
25000 done 0.12282919883728027
30000 done 0.11931681632995605
35000 done 0.1278700828552246
40000 done 0.12238287925720215
45000 done 0.12096738815307617
50000 done 0.20111417770385742
Concat done 0.04361534118652344

for a total time of 1.355s; if you don't need to chunk the file, it simplifies to

import pandas as pd
from time import time

start = time()

with open("mozilla.json") as inf:
    json = "[" + ",".join(inf) + "]"
df = pd.read_json(json, "records")

done = time()
print("Total time", done - start)

which gives

Total time 1.247551441192627

Mine is 8 times faster again ;-) – Hugh Bothwell Jun 30 '17 at 20:46 — Hugh Bothwell, Jun 30 '17 at 20:46

score 0 · Answer 2 · answered Jun 30 '17 at 14:04

You are monotonically increasing the data structure df.loc by inserting new elements in the line

df.loc[row_num] = [d[column_name1], d[column_name2]].

The variable df.loc seems to be a dictionary (e.g. here). Inserting into a python dictionary is getting slower the more elements it already holds. This has already been discussed in this stackoverflow response. Hence, the increasing size of your dictionary will eventually slow down the inner code of the loop.

silent_dev · Answer 3 · 2017-06-30T14:38:17.377

So, based on the answer by mayercn and comment by Hugh Bowell, I was able to identify the issue with the code. I modified the code as follows to reduce the time by 1/12th (average). TL;DR: I appended rows to a list and later appended it to a final dataframe.

from ast import literal_eval as le
import re
import string
from pandas import DataFrame
import pandas
import time
f = open('Filename')
df = DataFrame(columns=cols)
row_num = 0
while True:
    t = time.time()
    l = []
    for line in f:
        line = line.strip()
        bug = le(line)
        l.append([values])
        row_num+=1
        if(row_num%5000 == 0):
            print row_num, 'done', time.time() - t
            df = df.append(pandas.DataFrame(l),ignore_index=True)
            break
df.to_csv('File', index='id', encoding='utf-8')

Output time:

5000 done 0.998000144958
10000 done 1.01800012589
15000 done 1.01699995995
20000 done 0.999000072479
25000 done 1.04600000381
30000 done 1.09200000763
35000 done 1.06200003624
40000 done 1.14300012589
45000 done 1.00900006294
50000 done 1.06600022316

Any chance you can share a link to MozillaBugs.json? You could probably greatly speed loading by using ie `df.from_records`. — Hugh Bothwell, Jun 30 '17 at 14:33

Why is file processing in python taking more time for chunks which come later in the file?

3 Answers3