Speed up finding matches between two dictionaries (Python)

Question

I am working on a spatial analysis problem using Python 2.7. I have a dictionary edges representing edges in a graph, where key is the edgeID and the value is the start/end points:

{e1: [(12.8254, 55.3880), (12.8343, 55.3920)], 
e2: [(12.8254, 55.3880), (12.8235, 55.3857)], 
e3: [(12.2432, 57.1120), (12.2426, 57.1122)]}

And I have another dictionary nodes where key is the nodeID and the value is the node coordinates:

{n14: (12.8254, 55.3880), 
n15: (12.8340, 55.3883), 
n16: (12.8235, 55.3857), 
n17: (12.8343, 55.3920)}

I need to get a list which will look like (the 'n' and 'e' in the keys is just for illustration purposes for this question, I have integers there):

[(e1,n14,n17),(e2,n14,n16)..]

That is, I iterate over the edges dict, take every key, find the value that exist in the nodes dict and add to a tuple. This is how I do it now:

edgesList = []
for featureId in edges:
        edgeFeatureId = [k for k, v in edges.iteritems() if k == featureId][0]
        edgeStartPoint = [k for k, v in nodes.iteritems() if v == edges[featureId][0]][0]#start point
        edgeEndPoint = [k for k, v in nodes.iteritems() if v == edges[featureId][1]][0]#end point
        edgesList.append((edgeFeatureId,edgeStartPoint,edgeEndPoint))

This is working, but is very slow when working with large datasets (with 100K edges and 90K nodes it takes ca 10 mins).

I've figured out how to use a list comprehension while getting each of the tuple's items, but is it possible to get my 3 list comprehensions into one to avoid iterating the edges with the for loop (if this will speed things up)?

Is there another way I can build such a list faster?

UPDATE

As Martin suggested, I've inverted my nodes dict:

nodesDict = dict((v,k) for k,v in oldnodesDict.iteritems())

having node coordinates tuple as key and the nodeID as value. Unfortunately, it did not speed up the lookup process (here is the updated code - I flipped the k and v for edgeStartPoint and edgeEndPoint):

edgesList = []
for featureId in edges:
        edgeFeatureId = [k for k, v in edges.iteritems() if k == featureId][0]
        edgeStartPoint = [v for k, v in nodes.iteritems() if k == edges[featureId][0]][0]#start point
        edgeEndPoint = [v for k, v in nodes.iteritems() if k == edges[featureId][1]][0]#end point
        edgesList.append((edgeFeatureId,edgeStartPoint,edgeEndPoint))

Could you create and object representing an edge? You could overload equal function or other function to ease working with this data structure? — Marcin, Feb 27 '15 at 06:55
@Marcin, did you mean creating a class for the edge object? I was hoping to be able to do some kind of smart dictionary matching but could find anything appropriate. — Alex Tereshenkov, Feb 27 '15 at 07:03
Yep, an Edge class. It seem that It could simply your code and operations on the edges. — Marcin, Feb 27 '15 at 07:06
OK, thanks for the tip. The operation I am trying to speed up is the last one I will do in Python, all the data processing that occurs before happens in seconds, it is just the last part that takes so much time. I would like to avoid creating class for the Edge as I won't be using it for anything else... — Alex Tereshenkov, Feb 27 '15 at 07:13
so your problem is the append()? then try using a deque as `edgesList`: https://docs.python.org/2/library/collections.html#collections.deque "Deques support thread-safe, memory efficient appends and pops from either side of the deque with approximately the same O(1) performance in either direction." — donmarkusi, Feb 27 '15 at 07:22
Why are you using `iteritems()` everywhere, the point of a dictionary is key based lookup, with `iteritems()` you have O(N) performance on each lookup. Either re-arrange your data to allow lookup by key, or switch to a binary-search-tree type structure which will at least give Olog(n). — user3467349, Feb 27 '15 at 08:53
@user3467349, thanks for pointing this out. I believe this is what Martin mentioned in the answer accepted by me. — Alex Tereshenkov, Feb 27 '15 at 09:46

Martin Konecny · Accepted Answer · 2015-02-27T08:02:58.050

2

Since you are matching based on the coordinates, your nodes dictionary should be inverted.

That is, it should look like so:

{(12.8254, 55.3880): n14, 
(12.8340, 55.3883): n15, 
(12.8235, 55.3857): n16, 
(12.8343, 55.3920): n17}

This way, when you are iterating over your edges, you have a very quick look-up to find the corresponding nodes:

edgesList = []
for featureId in edges:
    coordinates = edges[featureId]
    c0, c1 = coordinates

    n0 = nodes[c0]
    n1 = nodes[c1]

    edgesList.append((featureId, n0, n1))

Remember, that dictionaries are extremely quick at finding the corresponding value for any given key. So quick that in the average case, the speed of the lookup should barely change given that the dictionary is of size 1 or size 1 million.

edited Feb 27 '15 at 08:02

answered Feb 27 '15 at 07:21

Martin Konecny

57,827
19
139
159

Martin, what is the coordinates variables? I don't understand how `edgeFeatureId, coordinates = e` works. – Alex Tereshenkov Feb 27 '15 at 07:38
`x, y = [1, 2]` means x will become 1 and y will become 2. You need to use the code I supplied, not just invert the dictionary. – Martin Konecny Feb 27 '15 at 08:01
I misunderstand your original code slightly. Updated my answer - it should be more clear now. – Martin Konecny Feb 27 '15 at 08:03
with `'coordinates = edges[featureId]` it makes sense. I've just realized that I was searching through the values with `==` instead of referring to a value based on the key with `dict[key]`. Thanks for pointing this out, it works correctly and very fast. – Alex Tereshenkov Feb 27 '15 at 09:43

donmarkusi · Answer 2 · 2015-02-27T08:58:24.873

As it turned out in your comments, the problem is the last operation edgesList.append((id,start,end)).

this seems to be a data type problem: a big dictionary slows down by design. take a look here.

but gladly you can use a double-ended-queue (deque) instead. deque documentation: "Deques support thread-safe, memory efficient appends and pops from either side of the deque with approximately the same O(1) performance in either direction."

in code it means you initialize a deque and append to it with high performance.

edgesList = deque() 
for featureId in edges:
        edgeFeatureId = [k for k, v in edges.iteritems() if k == featureId][0]
        edgeStartPoint = [v for k, v in nodes.iteritems() if k == edges[featureId][0]][0]#start point
        edgeEndPoint = [v for k, v in nodes.iteritems() if k == edges[featureId][1]][0]#end point
        edgesList.append((edgeFeatureId,edgeStartPoint,edgeEndPoint))

thanks for the tip. For which dict should I use a set instead? And how would I refer to the set "key" when searching for the node? Any code snippet? — Alex Tereshenkov, Feb 27 '15 at 07:15
ah sorry a set uses values only. i'm a bit tired this morning :D i think about it again — donmarkusi, Feb 27 '15 at 07:16

UlfR · Answer 3 · 2015-02-27T09:18:29.643

Based on your example data here is an example I think might work:

edges = {
    1: [(12.8254, 55.3880), (12.8343, 55.3920)],
    2: [(12.8254, 55.3880), (12.8235, 55.3857)],
    3: [(12.2432, 57.1120), (12.2426, 57.1122)]}
nodes = {
    14: (12.8254, 55.3880),
    15: (12.8340, 55.3883),
    16: (12.8235, 55.3857),
    17: (12.8343, 55.3920)}
reverseNodes=dict((v,k) for k, v in nodes.iteritems())
edgesList=[]
for k,v in edges.items():
    edgesList.append( 
            (k,
             reverseNodes.get(v[0], -1),
             reverseNodes.get(v[1], -1)))

Maybe there is something I do not understand with your building of the edgesList but I think this looks simpler and faster.

Again based on your example code, this is what's consuming your cpu-time:

edgeFeatureId = [k for k, v in edges.iteritems() if k == featureId][0]
edgeStartPoint = [v for k, v in nodes.iteritems() if k == edges[featureId][0]][0]#start point
edgeEndPoint = [v for k, v in nodes.iteritems() if k == edges[featureId][1]][0]#end point

This exists inside a for-loop, so for every edge you:

Iterate one extra time over the list of edges (to find the edge id which you already have)
Iterate twice over the nodes list to find start and end points (you don't need this any longer since we figured out how to get direct-lookup with the reverseNodes-dict).

So with your datasizes you should get roughly 100000*(100000+90000+90000) or O(n^2) operations which is a lot more than just one pass over the edges (100000 or O(n))

Speed up finding matches between two dictionaries (Python)

3 Answers3