7

I was trying to improve the performance of the func function and I found that a simple change in how the aX list is generated improves the performance quite a bit:

import timeit
import numpy as np

def func(a, b):
    return [_ for _ in a if _ not in b]

Na, Nb = 10000, 5000
b = list(np.random.randint(1000, size=Nb))

# Ordered list of Na integers
a1 = [_ for _ in range(Na)]
# Random list of Na integers
a2 = list(np.random.randint(Na, size=Na))
# Ordered list of Na integers generated with numpy
a3 = list(np.arange(Na))

start_time = timeit.default_timer()
ab1 = func(a1, b)
abt1 = timeit.default_timer() - start_time
print("Time ab1", abt1)

start_time = timeit.default_timer()
ab2 = func(a2, b)
abt2 = timeit.default_timer() - start_time
print("Time ab2", abt2)

start_time = timeit.default_timer()
ab3 = func(a3, b)
abt3 = timeit.default_timer() - start_time
print("Time ab3", abt3)

print("Ratio 1/2:", abt1 / abt2)
print("Ratio 1/3:", abt1 / abt3)

In Python 2.7.13 this results in:

('Time ab1', 5.296088933944702)
('Time ab2', 1.5520200729370117)
('Time ab3', 1.5581469535827637)
('Ratio 1/2:', 3.412384302428827)
('Ratio 1/3:', 3.3989662667998095)

In Python 3.5.2 the difference is even larger:

Time ab1 6.758207322000089
Time ab2 1.5693355060011527
Time ab3 1.5148192759988888
Ratio 1/2: 4.306413317073784
Ratio 1/3: 4.461395117608107

I need to process an ordered list integers (i.e: a1 or a3), so my question is:

Why is the random list processed so much faster than the ordered list not generated with numpy?

Gabriel
  • 40,504
  • 73
  • 230
  • 404
  • This might be a silly question but can you not *re-order* or *order* the `list` **after** you are done processing it? – Ma0 Jun 01 '17 at 15:22
  • 2
    Is this a fair test? The max value in the list `a1` will be 10000 (the length of the list) where as the max value in the list `a2` will be 1000 as it would be a random number between 0 and 1000 thus replacing `a1 = [_ for _ in range(Na)]` with `a1 = [_//10 for _ in range(Na)]` gives a ratio of 4.6 still unsure why its faster. Or perhaps I'm misunderstanding this. – Alessi 42 Jun 01 '17 at 15:33
  • @Alessi42 makes a valid point. I'll edit the question to fix this difference. Thank you! – Gabriel Jun 01 '17 at 15:46
  • 1
    Since the ordered list generated with `numpy` (`a3`) is processed as fast as the random list (`a2`), this appears to be simply a matter of "`numpy` is just faster". – Gabriel Jun 01 '17 at 15:59

2 Answers2

7

Your b, a2, and a3 lists are lists of NumPy scalars, while your a1 list is a list of ordinary Python ints. Comparing NumPy scalars to ordinary Python scalars requires a lot of extra type checking and coercion, so the func(a1, b) test, which needs to compare NumPy scalars to ordinary Python scalars, performs slowest.

If you make b a list of Python ints (by calling the tolist method instead of the list function), the time difference is reversed.

You may want to consider using Python sets or NumPy's set-like operations to perform your task.

user2357112
  • 260,549
  • 28
  • 431
  • 505
1

As discussed here numpy arrays are much faster than python lists. This is why the numpy arrays seem faster as you are still using a numpy array when you call the list() function.

Using the numpy .tolist() function converts a NumPy array to regular Python objects all the way down (as user2357112 pointed out) and the performance differences disappear, see:

import timeit
import numpy as np

def func(a, b):
    return [_ for _ in a if _ not in b]

Na, Nb = 10000, 5000
b = list(np.random.randint(Na, size=Nb)) # len: 5000, max: 9999

# Ordered list of Na integers
a1 = [_ for _ in range(Na)] # len: 10000, max: 9999
# Random list of Na integers
a2 = np.random.randint(Na, size=Na).tolist() # len: 10000, max: 9999
# Ordered list of Na integers generated with numpy
a3 = np.arange(Na).tolist()

start_time = timeit.default_timer()
ab1 = func(a1, b)
abt1 = timeit.default_timer() - start_time
print("Time ab1", abt1)

start_time = timeit.default_timer()
ab2 = func(a2, b)
abt2 = timeit.default_timer() - start_time
print("Time ab2", abt2)

start_time = timeit.default_timer()
ab3 = func(a3, b)
abt3 = timeit.default_timer() - start_time
print("Time ab3", abt3)

print("Ratio 1/2:", abt1 / abt2)
print("Ratio 1/3:", abt1 / abt3)

#Time ab1 4.622085004015502
#Time ab2 4.598610720638726
#Time ab3 4.63976530848255
#Ratio 1/2: 1.005104646773301
#Ratio 1/3: 0.9961893968139456

Hopefully this answers your first question!

Gabriel
  • 40,504
  • 73
  • 230
  • 404
Alessi 42
  • 1,112
  • 11
  • 26
  • 3
    can you give a source for `This is why the numpy arrays seem faster as you are still using a numpy array when you call the list() function`? – Maarten Fabré Jun 01 '17 at 16:16
  • I have now included the docs for `ndarray.tolist()` but I can't seem to find a reference as to why `list()` doesn't do the same I assumed that if they added a function for it, it must not do it by default – Alessi 42 Jun 01 '17 at 16:19
  • `list` does build a list. `tolist` converts a NumPy array to regular Python objects all the way down, building a nested list for multidimensional arrays and converting all scalars to ordinary Python scalars. – user2357112 Jun 01 '17 at 16:21