1

New at pandas, having an issue relating two DataFrames together based on the function of two columns in each DataFrame.

I have two DataFrames - one representing road segments, and another with points somewhere near that road.

Road Shape DataFrame

         shape_pt_lat  shape_pt_lon  shape_pt_sequence
2583910     53.402329     -6.150988                  1
2583911     53.402334     -6.151043                  2
2583912     53.402345     -6.151175                  3
2583913     53.402359     -6.151328                  4
2583914     53.402518     -6.152953                  5
...               ...           ...                ...

Points DataFrame

     latitude  longitude     timestamp
0   53.376873  -6.216212  1.686826e+09
1   53.370968  -6.223517  1.686827e+09
2   53.363358  -6.234719  1.686827e+09
3   53.360840  -6.238742  1.686827e+09
4   53.355160  -6.246171  1.686827e+09
..        ...        ...           ...

I want to add a column to the Points DataFrame that contains the index of the Road DataFrame row containing the closest point.

I can do this by iterating through the Points DataFrame, and for each row then iterating through the Road DataFrame, calculating each distance, and finding a minimum, but this seems very inefficient. Is there a better way to achieve this?

EDIT

I tried to adapt the approach suggested, it works but it ends up taking about 4x as long as simply doing all the iterations. Is there any way to make this faster?

shape_tup = [tuple(r) for r in shape_df[['shape_pt_lat', 'shape_pt_lon']].to_numpy()]

pos_points["shape_pt_ind"] = np.nan

for index, pos in pos_points.iterrows():
    min_pair = min(shape_tup, key=lambda t: (abs(t[0] - pos["latitude"]) + abs(t[1] - pos["longitude"])))
    min_index = shape_df.index[(shape_df[['shape_pt_lat', 'shape_pt_lon']].values[:, None] == min_pair).all(2).any(1)]
    pos_points.loc[index, "shape_pt_ind"] = min_index[0]
M B
  • 13
  • 4
  • Would this help? Although it doesn't involve DataFrames or coordinates, it could apply to your problem with a few adjustments. https://stackoverflow.com/questions/27239786/python-how-to-find-two-equal-closest-values-between-two-separate-arrays – luanpo1234 Jul 02 '23 at 17:38
  • Hi, thanks, this did let me do this but unfortunately it takes a lot longer to run than simply doing all the iterations involved in the most basic approach. I've added my code in an edit, could be an issue arising from how I tried to adapt this suggested code to my situation of checking both latitude and longitude instead of a single pair. – M B Jul 02 '23 at 20:33
  • I looked into doing something similar to this a while ago. The issue with just looping is that the number grows quite a lot. One way of doing it would be use to KNN – Mark Jul 03 '23 at 03:01

1 Answers1

0

The simplest approach is to use the cdist function from scipy. It calculates the distance between every point in points and every point in road. You have the choice of 20 or so distance functions.

After that it's trivial to find the closest point:

from scipy.spatial.distance import cdist

distance = cdist(
    points[["latitude", "longitude"]],
    road[["shape_pt_lat", "shape_pt_lon"]],
    metric="euclidean"
)

points["shape_pt_sequence"] = road["shape_pt_sequence"].to_numpy()[distance.argmin(axis=1)]

If your coordinates are far apart, the geodesic distance is more appropriate:

from scipy.spatial.distance import cdist
from geopy.distance import geodesic

distance = cdist(
    points[["latitude", "longitude"]],
    road[["shape_pt_lat", "shape_pt_lon"]],
    metric=lambda a, b: geodesic(a, b).kilometers,
)

points["shape_pt_sequence"] = road["shape_pt_sequence"].to_numpy()[distance.argmin(axis=1)]
Code Different
  • 90,614
  • 16
  • 144
  • 163
  • Brilliant, I used Euclidean distance at the points are fairly close. This works very well and is by far the faster, thanks! – M B Jul 03 '23 at 16:02