I have a df with multiple datetime intervals (start time, end time) and values.
input:
id start end value
1 08:00:00.000 12:00:00.000 5
2 09:00:00.000 10:00:00.000 6
2 10:00:00.000 14:00:00.000 4
1 12:00:00.000 15:00:00.000 3
expected output:
id start end value
1 08:00:00.000 09:00:00.000 5
2 09:00:00.000 10:00:00.000 6
1 10:00:00.000 12:00:00.000 5
2 12:00:00.000 14:00:00.000 4
1 14:00:00.000 15:00:00.000 3
There is overlap between some of them. The objective is to have a succession of interval without overlapping.
When there is overlapping, I want to keep the interval with the highest value.
I coded a thing that loops on the df to find the overlapping intervals, create a new succession of intervals based on the condition and remove the old ones. I would like to find an alternative way, better optimized. Maybe with split of all intervals at the intersections and after, loop on the df and remove the intervals that overlap based on condition.
done = True
while done:
done = False
df_copy = df
for i, row in df.iterrows():
row_interval = pd.Interval(row.start, row.end)
if done:
break
for j, row_copy in row_copy.iterrows():
row_copy_interval = pd.Interval(row_copy.start, row_copy.end)
if i is not j and row_interval.overlaps(row_copy_interval):
earliest_start = np.minimum(row.start, row_copy.start)
latest_start = np.maximum(row.start, row_copy.start)
earliest_end = np.minimum(row.end, row_copy.end)
latest_end = np.maximum(row.end, row_copy.end)
if row.value > row_copy.value:
value = row.value
else:
value = row_copy.value
if row_interval == pd.Interval(earliest_start, earliest_end):
df = df.append('value': row.value, 'start': earliest_start,'end': latest_start}, ignore_index=True)
df = df.append('value': value, 'start': latest_start,'end': earliest_end}, ignore_index=True)
df = df.append('value': row_copy.value, 'start': earliest_end,'end': latest_end}, ignore_index=True)
elif row_interval == pd.Interval(earliest_start, latest_end):
...
elif row_interval == pd.Interval(latest_start, latest_end):
...
elif row_interval == pd.Interval(latest_start, earliest_end):
...
df = df.drop([i, j]).drop_duplicates()
done = True
break