2

I have a dataset of a thousand 128 dimensional features in the shape of e.g. (1000,128).

I want to find the sorted nearest neighbors of a 128 dimensional feature in the shape of (128,1).

The distance in calculated via a Matrix Multiplication between dataset (1000,128) and feature (128,1) which would give an array of similarities in the shape of (1000,1) :

DATASET (1000,128) x FEATURE (128,1) = SIMILARITIES (1000,1)

This is done via:

# features.shape=(1000,128) ; feature.shape=(128,1) ; similarities.shape=(1000,1)
similarities = features.dot(feature)

After calculating the distance (similarities), I'm finding the nearest neighbors using the code below:

# The n Nearest Neighbors Indexes (But Not Sorted)
nearest_neighbours_indexes_unsorted = np.argpartition(similarities, kth=-n)[-n:]

# The n Nearest Neighbors (But Not Sorted)
nearest_neighbours_similarities_unsorted = similarities[nearest_neighbours_indexes_unsorted]

# The Indexes of n Nearest Neighbors Sorted
nearest_neighbours_indexes_sorted = np.flip(nearest_neighbours_indexes_unsorted[np.argsort(nearest_neighbours_similarities_unsorted)], axis=0)

This code works very fast for millions of data (I'm interested if someone has a tip to make it faster) But I want to be able to find the nearest neighbors of more than one feature in one go:

DATASET (1000,128) x FEATURE (128,n) = SIMILARITIES (1000,n)

One way is to calculate the above code for each feature in a loop (which is slow) and the other way is to change the code to accommodate for multidimensional indexing and here's where I'm stuck: I don't know how to write the above code for features in the shape of (128,n) and not (128,1).

kmario23
  • 57,311
  • 13
  • 161
  • 150
Cypher
  • 2,374
  • 4
  • 24
  • 36

1 Answers1

3

Helper functions to get largest, smallest n-indices, elements along an axis

Here's a helper function to select top n-largest indices along a generic axis from a generic ndarray making use of np.argpartition and np.take_along_axis -

def take_largest_indices_along_axis(ar, n, axis):    
    s = ar.ndim*[slice(None,None,None)]
    s[axis] = slice(-n,None,None)
    idx = np.argpartition(ar, kth=-n, axis=axis)[tuple(s)]
    sidx = np.take_along_axis(ar,idx, axis=axis).argsort(axis=axis)
    return np.flip(np.take_along_axis(idx, sidx, axis=axis),axis=axis)

Extending this to get n-smallest indices -

def take_smallest_indices_along_axis(ar, n, axis):    
    s = ar.ndim*[slice(None,None,None)]
    s[axis] = slice(None,n,None)
    idx = np.argpartition(ar, kth=n, axis=axis)[tuple(s)]
    sidx = np.take_along_axis(ar,idx, axis=axis).argsort(axis=axis)
    return np.take_along_axis(idx, sidx, axis=axis)

And extending these to select the largest or smallest n elements themselves, it would be with a simple usage of np.take_along_axis as listed next -

def take_largest_along_axis(ar, n, axis):
    idx = take_largest_indices_along_axis(ar, n, axis)
    return np.take_along_axis(ar, idx, axis=axis)

def take_smallest_along_axis(ar, n, axis):
    idx = take_smallest_indices_along_axis(ar, n, axis)
    return np.take_along_axis(ar, idx, axis=axis)

Sample runs

# Sample setup
In [200]: np.random.seed(0)
     ...: ar = np.random.randint(0,99,(5,5))

In [201]: ar
Out[201]: 
array([[44, 47, 64, 67, 67],
       [ 9, 83, 21, 36, 87],
       [70, 88, 88, 12, 58],
       [65, 39, 87, 46, 88],
       [81, 37, 25, 77, 72]])

Take largest n indices, elements along axis -

In [202]: take_largest_indices_along_axis(ar, n=2, axis=0)
Out[202]: 
array([[4, 2, 2, 4, 3],
       [2, 1, 3, 0, 1]])

In [203]: take_largest_indices_along_axis(ar, n=2, axis=1)
Out[203]: 
array([[4, 3],
       [4, 1],
       [2, 1],
       [4, 2],
       [0, 3]])

In [251]: take_largest_along_axis(ar, n=2, axis=0)
Out[251]: 
array([[81, 88, 88, 77, 88],
       [70, 83, 87, 67, 87]])

In [252]: take_largest_along_axis(ar, n=2, axis=1)
Out[252]: 
array([[67, 67],
       [87, 83],
       [88, 88],
       [88, 87],
       [81, 77]])

Take smallest n indices, elements along axis -

In [232]: take_smallest_indices_along_axis(ar, n=2, axis=0)
Out[232]: 
array([[1, 4, 1, 2, 2],
       [0, 3, 4, 1, 0]])

In [233]: take_smallest_indices_along_axis(ar, n=2, axis=1)
Out[233]: 
array([[0, 1],
       [0, 2],
       [3, 4],
       [1, 3],
       [2, 1]])

In [253]: take_smallest_along_axis(ar, n=2, axis=0)
Out[253]: 
array([[ 9, 37, 21, 12, 58],
       [44, 39, 25, 36, 67]])

In [254]: take_smallest_along_axis(ar, n=2, axis=1)
Out[254]: 
array([[44, 47],
       [ 9, 21],
       [12, 58],
       [39, 46],
       [25, 37]])

Solving our case here

For our case, let's assume the input is similarities and is of shape (1000,128) representing 1000 data points and 128 features and that we want to look for largest say n=10 features for each of those data points, then it would be -

take_largest_indices_along_axis(similarities, n=10, axis=1) # indices
take_largest_along_axis(similarities, n=10, axis=1) # elements

The final indices/values array would be of shape (1000, n).

Sample run with the given dataset shape -

In [257]: np.random.seed(0)
     ...: similarities = np.random.randint(0,99,(1000,128))

In [263]: take_largest_indices_along_axis(similarities, n=10, axis=1).shape
Out[263]: (1000, 10)

In [264]: take_largest_along_axis(similarities, n=10, axis=1).shape
Out[264]: (1000, 10)

If instead you were looking to get n largest data-points for each of those features, that is the final indices/values array would be of shape (n, 128), then use axis=0.

Community
  • 1
  • 1
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • @Cypher It was stated - `But I want to be able to find the nearest neighbors of more than one feature in one go:`. So, I assumed you have an array `features`, which is of shape `(1000,128)` with 128 features and you want to be able to get largest indices for each of those 128 features and so, if you want `n` largest indices would give a `(n,128)` indices array. So, the idea is whichever is your dataset, use it with axis value being the axis other than the axis along which you have the features. Does that make sense? – Divakar May 05 '19 at 16:35
  • sorry, I had to edit my comment: thanks for the answer. But where is the `similarities` array used? your answer is returning a (10,128) array but I expect a (1000,n) answer where n is like 5 (1000 similarities for 5 images) not like 128 (which is the dimension of each feature). – Cypher May 05 '19 at 16:36
  • @Cypher Try : `take_largest_indices_along_axis(similarities, n, axis=1)`? – Divakar May 05 '19 at 16:36
  • can this be done with Numpy < 1.15? I have other dependencies that prevents me from even checking your proposed code. – Cypher May 05 '19 at 17:40
  • @Cypher See if `take_along_axis` from https://stackoverflow.com/a/47048882/ works out in place of `np.take_along_axis`? – Divakar May 05 '19 at 17:42
  • When I calculate `similarities[take_largest_indices_along_axis(similarities, n=5, axis=0)]` for 3 features (I have 3 features and want the 5 nearest features in the `similarities` array), it gives me an array with (5,3,3) shape when I'm looking for a (5,3) array (5 similarities for 3 images). Here's an example code: `similarities = np.random.uniform(low=0.0, high=1.0, size=(1000, 3))` `print(similarities[take_largest_indices_along_axis(similarities, n=5, axis=0)].shape)` – Cypher May 05 '19 at 18:17
  • @Cypher Please read the edited `Solving our case here` section. – Divakar May 05 '19 at 18:50
  • Thank you so much for the detailed help and explanation. Took me a while to completely implement it in my code and I hope you forgive me for coming back to you late. it works alright for now! Thanks again. – Cypher May 06 '19 at 13:24
  • @Divakar Thank you for a great explanation! – kmario23 May 06 '19 at 14:02