0

I have some huge csv files (hundreds of megabytes). From this post here Why reading rows is faster than reading columns? it seems that storing and reading csv files by rows is more cache efficient and would be 30 times faster than using columns. However, when I tried this the file stored by row is actually a lot slower:

t = get_ms()
i = None
cols = csv.reader(open(col_csv, "r"))
for c in cols:
    for e in c:
        i = e

s = get_ms()
print("open cols file takes : " + str(s - t))

t = get_ms()
rows = csv.reader(open(row_csv, "r"))
i = None
for r in rows:
    for e in r:
        i = e
s = get_ms()
print("open rows file takes : " + str(s - t))

output:

open cols file takes : 13698
open rows file takes : 14971

Is this problem specific to python? I know that in C++ wide tables are usually faster than long tables but I'm not sure if it's the same thing in python.

edit : typo

Community
  • 1
  • 1
qwerty_99
  • 640
  • 5
  • 20

1 Answers1

0

You aren't performing the same operation by rows as by cols. By cols you have:

for c in cols:
  for e in c:
    i = e

But for rows you have:

for r in rows:
  for e in r:
    r = e

The first one is just reading each cell of each column and returning that value to the variable i. The second is reading each cell of each row, and then setting the row equal to the value of the cell. Let's write it with clearer variable names:

Columns:

for column in columns:
  for cell in column:
    output = cell

Rows:

for row in rows:
  for cell in row:
    row = cell
Kyle Alm
  • 587
  • 3
  • 14