1

This question is in relation to another question asked here: Sorting 1M records

I have since figured out the problem I was having with sorting. I was sorting items from a dictionary into a list every time I updated the data. I have since realized that a lot of the power of Python's sort resides in the fact that it sorts data more quickly that is already partially sorted.

So, here is the question. Suppose I have the following as a sample set:

self.sorted_records = [(1, 1234567890), (20, 1245678903), 
                       (40, 1256789034), (70, 1278903456)]

t[1] of each tuple in the list is a unique id. Now I want to update this list with the follwoing:

updated_records = {1245678903:45, 1278903456:76}

What is the fastest way for me to do so ending up with

self.sorted_records = [(1, 1234567890), (45, 1245678903),
                       (40, 1256789034), (76, 1278903456)]

Currently I am doing something like this:

updated_keys = updated_records.keys()
for i, record in enumerate(self.sorted_data):
    if record[1] in updated_keys:
        updated_keys.remove(record[1])
        self.sorted_data[i] = (updated_records[record[1]], record[1])

But I am sure there is a faster, more elegant solution out there.

Any help?

* edit It turns out I used bad exaples for the ids since they end up in sorted order when I do my update. I am actually interested in t[0] being in sorted order. After I do the update I was intending on resorting with the updated data, but it looks like bisect might be the ticket to insert in sorted order. end edit *

Community
  • 1
  • 1
sberry
  • 128,281
  • 18
  • 138
  • 165
  • measure carefully (the solution coded in detail in my answer, that in Brian's, and the vague suggestion about bisect), since .sort is often surprisingly fast (esp. on data that's already mostly sorted) while bisect offers little upside. – Alex Martelli Jul 27 '09 at 15:05

4 Answers4

2

You're scanning through all n records. You could instead do a binary search, which would be O(log(n)) instead of O(n). You can use the bisect module to do this.

Laurence Gonsalves
  • 137,896
  • 35
  • 246
  • 299
  • Isn't `bisect` only for inserting into already sorted arrays? How would one use it to do a search? – Evan Fosmark Jul 27 '09 at 05:29
  • bisect is for searching an array, with insertion being a common use case. It's just a binary search; it's a trickier algorithm to get right in all cases than many people realize, so it makes sense to have it in the standard library. – Glenn Maynard Jul 27 '09 at 05:39
  • Note thought that the list seems to be sorted on the **second** item. bisect uses normal comparison, which will give the wrong result in this case. – Brian Jul 27 '09 at 12:28
  • I can actually modify the code to have the current tuples (data, id) inserted as (id, data) and just change the key param to use itemgetter(1) when sorting. – sberry Jul 27 '09 at 13:48
1

Since apparently you don't care about the ending value of self.sorted_records actually being sorted (you have values in order 1, 45, 20, 76 -- that's NOT sorted!-), AND you only appear to care about IDs in updated_records that are also in self.sorted_data, a listcomp (with side effects if you want to change the updated_record on the fly) would serve you well, i.e.:

self.sorted_data = [(updated_records.pop(recid, value), recid) 
                    for (value, recid) in self.sorted_data]

the .pop call removes from updated_records the keys (and corresponding values) that are ending up in the new self.sorted_data (and the "previous value for that recid", value, is supplied as the 2nd argument to pop to ensure no change where a recid is NOT in updated_record); this leaves in updated_record the "new" stuff so you can e.g append it to self.sorted_data before re-sorting, i.e I suspect you want to continue with something like

self.sorted_data.extend(value, recid 
                        for recid, value in updated_records.iteritems())
self.sorted_data.sort()

though this part DOES go beyond the question you're actually asking (and I'm giving it only because I've seen your previous questions;-).

Alex Martelli
  • 854,459
  • 170
  • 1,222
  • 1,395
  • Correct you are Alex. I will be sorting the data after the updates, and the list IS sorted on the tuples index 1. I will also look into using the bisect module for any assistance it can provide. – sberry Jul 27 '09 at 14:02
1

You'd probably be best served by some form of tree here (preserving sorted order while allowing O(log n) replacements.) There is no builtin balanaced tree type, but you can find many third party examples. Alternatively, you could either:

  1. Use a binary search to find the node. The bisect module will do this, but it compares based on the normal python comparison order, whereas you seem to be sorted based on the second element of each tuple. You could reverse this, or just write your own binary search (or simply take the code from bisect_left and modify it)

  2. Use both a dict and a list. The list contains the sorted keys only. You can wrap the dict class easily enough to ensure this is kept in sync. This allows you fast dict updating while maintaining sort order of the keys. This should prevent your problem of losing sort performance due to constant conversion between dict/list.

Here's a quick implementation of such a thing:

import bisect

class SortedDict(dict):
    """Dictionary which is iterable in sorted order.

    O(n) sorted iteration
    O(1) lookup
    O(log n) replacement  ( but O(n) insertion or new items)
    """

    def __init__(self, *args, **kwargs):
        dict.__init__(self, *args, **kwargs)
        self._keys = sorted(dict.iterkeys(self))

    def __setitem__(self, key, val):
        if key not in self:
            # New key - need to add to list of keys.
            pos = bisect.bisect_left(self._keys, key)
            self._keys.insert(pos, key)
        dict.__setitem__(self, key, val)

    def __delitem__(self, key):
        if key in self:
            pos = bisect.bisect_left(self._keys, key)
            del self._keys[pos]
        dict.__delitem__(self, key)

    def __iter__(self):
        for k in self._keys: yield k
    iterkeys = __iter__

    def iteritems(self):
        for k in self._keys: yield (k, self[k])

    def itervalues(self):
        for k in self._keys: yield self[k]

    def update(self, other):
        dict.update(self, other)
        self._keys = sorted(dict.iterkeys(self)) # Rebuild (faster if lots of changes made - may be slower if only minor changes to large dict)

    def keys(self): return list(self.iterkeys())
    def values(self): return list(self.itervalues())
    def items(self): return list(self.iteritems())

    def __repr__(self):
        return "%s(%s)" % (self.__class__.__name__, ', '.join("%s=%r" % (k, self[k]) for k in self))
Brian
  • 116,865
  • 28
  • 107
  • 112
0

Since you want to replace by dictionary key, but have the array sorted by dictionary value, you definitely need a linear search for the key. In that sense, your algorithm is the best you can hope for.

If you would preserve the old dictionary value, then you could use a binary search for the value, and then locate the key in the proximity of where the binary search lead you.

Martin v. Löwis
  • 124,830
  • 17
  • 198
  • 235