Reading csv stored by row takes a longer time than reading csv stored by column

Question

I have some huge csv files (hundreds of megabytes). From this post here Why reading rows is faster than reading columns? it seems that storing and reading csv files by rows is more cache efficient and would be 30 times faster than using columns. However, when I tried this the file stored by row is actually a lot slower:

t = get_ms()
i = None
cols = csv.reader(open(col_csv, "r"))
for c in cols:
    for e in c:
        i = e

s = get_ms()
print("open cols file takes : " + str(s - t))

t = get_ms()
rows = csv.reader(open(row_csv, "r"))
i = None
for r in rows:
    for e in r:
        i = e
s = get_ms()
print("open rows file takes : " + str(s - t))

output:

open cols file takes : 13698
open rows file takes : 14971

Is this problem specific to python? I know that in C++ wide tables are usually faster than long tables but I'm not sure if it's the same thing in python.

edit : typo

score 0 · Answer 1 · answered May 22 '20 at 15:48

0

You aren't performing the same operation by rows as by cols. By cols you have:

for c in cols:
  for e in c:
    i = e

But for rows you have:

for r in rows:
  for e in r:
    r = e

The first one is just reading each cell of each column and returning that value to the variable i. The second is reading each cell of each row, and then setting the row equal to the value of the cell. Let's write it with clearer variable names:

Columns:

for column in columns:
  for cell in column:
    output = cell

Rows:

for row in rows:
  for cell in row:
    row = cell

answered May 22 '20 at 15:48

Kyle Alm

587
3
14

oops, my bad for the typo. But I still got the same results with i = e for both – qwerty_99 May 22 '20 at 15:57
yeah I changed that and ran the code again, the output is still 11926 for columns and 15420 for rows – qwerty_99 May 22 '20 at 16:02

Reading csv stored by row takes a longer time than reading csv stored by column

1 Answers1