New at pandas, having an issue relating two DataFrames together based on the function of two columns in each DataFrame.
I have two DataFrames - one representing road segments, and another with points somewhere near that road.
Road Shape DataFrame
shape_pt_lat shape_pt_lon shape_pt_sequence
2583910 53.402329 -6.150988 1
2583911 53.402334 -6.151043 2
2583912 53.402345 -6.151175 3
2583913 53.402359 -6.151328 4
2583914 53.402518 -6.152953 5
... ... ... ...
Points DataFrame
latitude longitude timestamp
0 53.376873 -6.216212 1.686826e+09
1 53.370968 -6.223517 1.686827e+09
2 53.363358 -6.234719 1.686827e+09
3 53.360840 -6.238742 1.686827e+09
4 53.355160 -6.246171 1.686827e+09
.. ... ... ...
I want to add a column to the Points DataFrame that contains the index of the Road DataFrame row containing the closest point.
I can do this by iterating through the Points DataFrame, and for each row then iterating through the Road DataFrame, calculating each distance, and finding a minimum, but this seems very inefficient. Is there a better way to achieve this?
EDIT
I tried to adapt the approach suggested, it works but it ends up taking about 4x as long as simply doing all the iterations. Is there any way to make this faster?
shape_tup = [tuple(r) for r in shape_df[['shape_pt_lat', 'shape_pt_lon']].to_numpy()]
pos_points["shape_pt_ind"] = np.nan
for index, pos in pos_points.iterrows():
min_pair = min(shape_tup, key=lambda t: (abs(t[0] - pos["latitude"]) + abs(t[1] - pos["longitude"])))
min_index = shape_df.index[(shape_df[['shape_pt_lat', 'shape_pt_lon']].values[:, None] == min_pair).all(2).any(1)]
pos_points.loc[index, "shape_pt_ind"] = min_index[0]