Best way to sort 1M records in Python

Question

I have a service that runs that takes a list of about 1,000,000 dictionaries and does the following

myHashTable = {}
myLists = { 'hits':{}, 'misses':{}, 'total':{} }
sorted = { 'hits':[], 'misses':[], 'total':[] }
for item in myList:
  id = item.pop('id')
  myHashTable[id] = item
  for k, v in item.iteritems():
    myLists[k][id] = v

So, if I had the following list of dictionaries:

[ {'id':'id1', 'hits':200, 'misses':300, 'total':400},
  {'id':'id2', 'hits':300, 'misses':100, 'total':500},
  {'id':'id3', 'hits':100, 'misses':400, 'total':600}
]

I end up with

myHashTable =
{ 
  'id1': {'hits':200, 'misses':300, 'total':400},
  'id2': {'hits':300, 'misses':100, 'total':500},
  'id3': {'hits':100, 'misses':400, 'total':600}
}

and

myLists = 

    {
      'hits': {'id1':200, 'id2':300, 'id3':100},
      'misses': {'id1':300, 'id2':100, 'id3':400},
      'total': {'id1':400, 'id2':500, 'id3':600}
    }

I then need to sort all of the data in each of the myLists dictionaries.

What I doing currently is something like the following:

def doSort(key):
  sorted[key] = sorted(myLists[key].items(), key=operator.itemgetter(1), reverse=True)

which would yield, in the case of misses:
[('id3', 400), ('id1', 300), ('id2', 200)]

This works great when I have up to 100,000 records or so, but with 1,000,000 it is taking at least 5 - 10 minutes to sort each with a total of 16 (my original list of dictionaries actually has 17 fields including id which is popped)

* EDIT * This service is a ThreadingTCPServer which has a method allowing a client to connect and add new data. The new data may include new records (meaning dictionaries with unique 'id's to what is already in memory) or modified records (meaning the same 'id' with different data for the other key value pairs

So, once this is running I would pass in
[
  {'id':'id1', 'hits':205, 'misses':305, 'total':480},
  {'id':'id4', 'hits':30, 'misses':40, 'total':60},
  {'id':'id5', 'hits':50, 'misses':90, 'total':20
]
I have been using dictionaries to store the data so that I don't end up with duplicates. After the dictionaries are updated with the new/modified data I resort each of them.

* END EDIT *

So, what is the best way for me to sort these? Is there a better method?

This is probably not the answer you're looking for, but using pure Python to process such volumes of data is not a good idea in general. It's not designed for performance when you need to do a lot of small operations (such as, well, comparisons during sorting). — Pavel Minaev, Jul 24 '09 at 21:29
@Pavel, you're wrong: Python's sort (timsort) is probably THE fastest in-memory sort available, period. Josh Bloch saw it explained at a tech talk at Google and immediately started coding it as the internal sort for the next version of Java; see http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6804124 and http://svn.python.org/projects/python/trunk/Objects/listsort.txt — Alex Martelli, Jul 24 '09 at 21:34
@alex, Do you know which tech talk? Not that I doubt you. It just peaked my interest. :) — Emil H, Jul 24 '09 at 21:45
Don't know of a talk, but http://svn.python.org/projects/python/trunk/Objects/listsort.txt — Glenn Maynard, Jul 24 '09 at 22:55
The sort itself may be implemented using the fastest known algorithm. But what does it matter if for every element in the list, it will have to _retrieve the key used for sorting first_. — Pavel Minaev, Jul 25 '09 at 00:18
python is not retrieving the key each time. that's what key=itemgetter(1) is doing - it grabs the key once and uses that to do the sort. — Peter Recore, Jul 27 '09 at 00:09

score 13 · Accepted Answer · answered Jul 24 '09 at 21:38

13

You may find this related answer from Guido: Sorting a million 32-bit integers in 2MB of RAM using Python

answered Jul 24 '09 at 21:38

OscarRyz

196,001
113
385
569

Glenn Maynard · Answer 2 · 2009-07-24T21:40:35.773

What you really want is an ordered container, instead of an unordered one. That would implicitly sort the results as they're inserted. The standard data structure for this is a tree.

However, there doesn't seem to be one of these in Python. I can't explain that; this is a core, fundamental data type in any language. Python's dict and set are both unordered containers, which map to the basic data structure of a hash table. It should definitely have an optimized tree data structure; there are many things you can do with them that are impossible with a hash table, and they're quite tricky to implement well, so people generally don't want to be doing it themselves.

(There's also nothing mapping to a linked list, which also should be a core data type. No, a deque is not equivalent.)

I don't have an existing ordered container implementation to point you to (and it should probably be implemented natively, not in Python), but hopefully this will point you in the right direction.

A good tree implementation should support iterating across a range by value ("iterate all values from [2,100] in order"), find next/prev value from any other node in O(1), efficient range extraction ("delete all values in [2,100] and return them in a new tree"), etc. If anyone has a well-optimized data structure like this for Python, I'd love to know about it. (Not all operations fit nicely in Python's data model; for example, to get next/prev value from another value, you need a reference to a node, not the value itself.)

Exactly, the critical component in this case is "heapq": http://docs.python.org/library/heapq.html — Otto Allmendinger, Jul 25 '09 at 12:08
A priority queue isn't really the right data structure here--should be a b-tree, rb-tree, etc. — Glenn Maynard, Jul 25 '09 at 18:32

score 1 · Answer 3 · answered Jul 24 '09 at 21:34

1

If you have a fixed number of fields, use tuples instead of dictionaries. Place the field you want to sort on in first position, and just use mylist.sort()

answered Jul 24 '09 at 21:34

wbg

1,104
2
8
18

I thought about this. The problem is that I will be continually adding new data to the service. Some of the data will be new (meaning a unique 'id'), and some will be updated (same 'id'). Therefore I cannot just add the tuple to a list for sorting. At least, not unless there is a better way to avoid duplicating id entries. – sberry Jul 24 '09 at 21:47
@sberry2. Please update your question with this new information. Please provide an example of how you want this multiple occurrence thing to work. – S.Lott Jul 24 '09 at 21:56
Could you then use one dictionary containing `id` -> `tuple` instead of `id` -> `dictionary`? Stick `id` in the tuple, too, then sort only the items? I got the impression from the OP that the data was built from scratch each time. If not, it might be worth giving SQLite or another DB module a shot. It could even be worth trying, anyway, using an in-memory DB. It may seem awfully heavyweight, but it's optimised for exactly this kind of task. – wbg Jul 24 '09 at 22:15

score 1 · Answer 4 · edited Aug 18 '15 at 02:37

Others have provided some excellent advices, try them out.

As a general advice, in situations like that you need to profile your code. Know exactly where most of the time is spent. Bottlenecks hide well, in places you least expect them to be.
If there is a lot of number crunching involved then a JIT compiler like the (now-dead) psyco might also help. When processing takes minutes or hours 2x speed-up really counts.

score 1 · Answer 5 · answered Jul 24 '09 at 21:55

1

This seems to be pretty fast.

raw= [ {'id':'id1', 'hits':200, 'misses':300, 'total':400},
    {'id':'id2', 'hits':300, 'misses':100, 'total':500},
    {'id':'id3', 'hits':100, 'misses':400, 'total':600}
]

hits= [ (r['hits'],r['id']) for r in raw ]
hits.sort()

misses = [ (r['misses'],r['id']) for r in raw ]
misses.sort()

total = [ (r['total'],r['id']) for r in raw ]
total.sort()

Yes, it makes three passes through the raw data. I think it's faster than pulling out the data in one pass.

answered Jul 24 '09 at 21:55

S.Lott

384,516
81
508
779

I've been doing some benchmarking and this way is faster than the original but not by a huge factor. neither way seems to take 5 minutes on my machine. will post more details in an answer cuz it's gonna take a lot more room than will fit here. – Peter Recore Jul 24 '09 at 23:44

score 1 · Answer 6 · edited May 23 '17 at 12:30

1

Instead of trying to keep your list ordered, maybe you can get by with a heap queue. It lets you push any item, keeping the 'smallest' one at h[0], and popping this item (and 'bubbling' the next smallest) is an O(nlogn) operation.

so, just ask yourself:

do i need the whole list ordered all the time? : use an ordered structure (like Zope's BTree package, as mentioned by Ealdwulf)
or the whole list ordered but only after a day's work of random insertions?: use sort like you're doing, or like S.Lott's answer
or just a few 'smallest' items at any moment? : use heapq

edited May 23 '17 at 12:30

Community

1
1

answered Jul 24 '09 at 22:32

Javier

60,510
8
78
126

I have been reading the docs on Zope's BTree package (already had Zope installed) and I though it seems like a good solution, I am unclear as to what data I would store in it so that I can maintain unique 'id' values and keep it sorted correctly. Any insight? – sberry Jul 24 '09 at 23:17

score 0 · Answer 7 · answered Jul 24 '09 at 21:30

0

sorted(myLists[key], key=mylists[key].get, reverse=True)

should save you some time, though not a lot.

answered Jul 24 '09 at 21:30

Alex Martelli

854,459
170
1,222
1,395

score 0 · Answer 8 · answered Jul 24 '09 at 21:37

I would look into using a different sorting algorithm. Something like a Merge Sort might work. Break the list up into smaller lists and sort them individually. Then loop.

Pseudo code:

list1 = []  // sorted separately
list2 = []  // sorted separately

// Recombine sorted lists
result = []
while (list1.hasMoreElements || list2.hasMoreElements):
   if (! list1.hasMoreElements):
       result.addAll(list2)
       break
   elseif (! list2.hasMoreElements):
       result.AddAll(list1)
       break

   if (list1.peek < list2.peek):
      result.add(list1.pop)
   else:
      result.add(list2.pop)

score 0 · Answer 9 · answered Jul 24 '09 at 21:54

0

Glenn Maynard is correct that a sorted mapping would be appropriate here. This is one for python: http://wiki.zope.org/ZODB/guide/node6.html#SECTION000630000000000000000

answered Jul 24 '09 at 21:54

Ealdwulf

209
1
2

score 0 · Answer 10 · answered Jul 25 '09 at 00:09

I've done some quick profiling of both the original way and SLott's proposal. In neither case does it take 5-10 minutes per field. The actual sorting is not the problem. It looks like most of the time is spent in slinging data around and transforming it. Also, my memory usage is skyrocketing - my python is over 350 megs of ram! are you sure you're not using up all your ram and paging to disk? Even with my crappy 3 year old power saving processor laptop, I am seeing results way less than 5-10 minutes per key sorted for a million items. What I can't explain is the variability in the actual sort() calls. I know python sort is extra good at sorting partially sorted lists, so maybe his list is getting partially sorted in the transform from the raw data to the list to be sorted.

Here's the results for slott's method:

done creating data
done transform.  elapsed: 16.5160000324
sorting one key slott's way takes 1.29699993134

here's the code to get those results:

starttransform = time.time()
hits= [ (r['hits'],r['id']) for r in myList ]
endtransform = time.time()
print "done transform.  elapsed: " + str(endtransform - starttransform)
hits.sort()
endslottsort = time.time()
print "sorting one key slott's way takes " + str(endslottsort - endtransform)

Now the results for the original method, or at least a close version with some instrumentation added:

done creating data
done transform.  elapsed: 8.125
about to get stuff to be sorted 
done getting data. elapsed time: 37.5939998627
about to sort key hits
done  sorting on key <hits> elapsed time: 5.54699993134

Here's the code:

for k, v in myLists.iteritems():
    time1 = time.time()
    print "about to get stuff to be sorted "
    tobesorted = myLists[k].items()
    time2 = time.time()
    print "done getting data. elapsed time: " + str(time2-time1)
    print "about to sort key " + str(k) 
    mysorted[k] = tobesorted.sort( key=itemgetter(1))
    time3 = time.time()
    print "done  sorting on key <" + str(k) + "> elapsed time: " + str(time3-time2)

Best way to sort 1M records in Python

10 Answers10

Linked