2

Say I have a pandas data frame where there are time intervals between start and end times, and then a value associated with each interval.

import random
import time
import numpy as np

def random_date(input_dt = None):
    if input_dt is None:
        start = 921032233
    else:
        start = dt.datetime.timestamp(pd.to_datetime(input_dt))
    d = random.randint(start, int(time.time()))
    return dt.datetime.fromtimestamp(d).strftime('%Y-%m-%d %H:%M:%S')

date_ranges = []
for _ in range(200):
    date_range = []
    for i in range(2):
        if i == 0:
            date_range.append(random_date())
        else:
            date_range.append(random_date(date_range[0]))
    date_ranges.append(date_range)

date_ranges_df = pd.DataFrame(date_ranges, columns=['start_dt', 'end_dt'])
date_ranges_df['value'] = np.random.random((date_ranges_df.shape[0], 1))

There's 2 ways I can frame the problem and I would accept either answer.

  1. Obtain the sum of every different overlapping interval. Meaning there should be a sum associated with varying (non-overlapping and sequentially complete) time intervals. i.e. if the overlapping time intervals are unchanged for a period of time, the sum would remain unchanged and have a single value - then when the overlapping intervals changes in any way (removal or addition of a time interval) a new sum would be calculated. This may involve some self-merge on the table.

  2. The other (and maybe easier) way would be to define a standard time interval like 1 hour, and ask what is the sum of all overlapping intervals in this hour segment?

Resulting data frame should have a similar structure with start and end times followed by a value column representing the sum of all values in that interval.

EDIT: to obtain the bounty I would need the solutions for both #1 and #2 methods.

Method 1 Image

conv3d
  • 2,668
  • 6
  • 25
  • 45
  • See this question too which I don't believe was really answered: https://stackoverflow.com/questions/66840175/merge-overlapping-datetime-intervals – conv3d Jul 03 '23 at 00:49

2 Answers2

1
  1. Sum of every different overlapping interval: This is more complex, because we need to detect overlapping periods, and then we need to sum up the values of the overlapping periods.

  2. Sum of all overlapping intervals in a defined hour segment: That would need resampling our data to a regular hourly frequency (process of converting your irregular intervals into a regular hourly frequency and aggregating the 'value' data within these hourly intervals) and then calculating the sum.

Sum of every different overlapping interval

import pandas as pd

# Convert to datetime objects
date_ranges_df['start_dt'] = pd.to_datetime(date_ranges_df['start_dt'])
date_ranges_df['end_dt'] = pd.to_datetime(date_ranges_df['end_dt'])

# Sort by start_dt
date_ranges_df = date_ranges_df.sort_values(by='start_dt')

# Create list of tuples: [(start1, end1, value1), (start2, end2, value2),...]
intervals = list(date_ranges_df.itertuples(index=False, name=None))

# Split intervals into start and end points and sort them
points = sorted([(start, value, 1) for start, _, value in intervals] + [(end, value, -1) for _, end, value in intervals])

result = []
current_value = 0
current_start = points[0][0]

for i, (point, value, change) in enumerate(points):
    if i > 0 and point != points[i-1][0]:
        result.append((current_start, points, current_value))
        current_start = point
    current_value += change * value

# Create a dataframe from result
result_df = pd.DataFrame(result, columns=['start_dt', 'end_dt', 'value'])

That would work by transforming the intervals into individual points, each tagged with a value and a flag indicating whether it is a start point (1) or an end point (-1).

The points are then sorted. As we iterate through the points, when we hit a point that's not equal to the previous one (indicating a new interval segment), we record the previous segment along with the cumulative value up to that point. We then update the current start point and continue adding or subtracting values as we encounter start and end points.

The resulting DataFrame result_df contains non-overlapping segments, along with the sum of the values of the intervals that were active during each segment.

Sum of all overlapping intervals in a defined hour segment

# Convert to datetime objects
date_ranges_df['start_dt'] = pd.to_datetime(date_ranges_df['start_dt'])
date_ranges_df['end_dt'] = pd.to_datetime(date_ranges_df['end_dt'])

# Resample to 1-hour intervals
hourly_intervals = pd.date_range(date_ranges_df['start_dt'].min(), date_ranges_df['end_dt'].max(), freq='H')

hourly_df = pd.DataFrame()
for start in hourly_intervals:
    end = start + timedelta(hours=1)
    # Get intervals that overlap with current hour
    mask = ((date_ranges_df['start_dt'] < end) & (date_ranges_df['end_dt'] > start))
    overlap = date_ranges_df.loc[mask]
    if not overlap.empty:
        # Sum values of overlapping intervals
        total_value = overlap['value'].sum()
        hourly_df = hourly_df.append({'start_dt': start, 'end_dt': end, 'value': total_value}, ignore_index=True)

# Convert column types
hourly_df['start_dt'] = pd.to_datetime(hourly_df['start_dt'])
hourly_df['end_dt'] = pd.to_datetime(hourly_df['end_dt'])

Note: The code assumes the start_dt and end_dt columns are in the right format and the value column contains numerical values.
Also, it might not be the most optimized solution for large datasets due to the nested loops. It might be necessary to optimize the code depending on the size of your data.

That should give for both code a resulting DataFrame with the same structure, including 'start_dt', 'end_dt', and 'value' columns.
'start_dt' and 'end_dt' are the boundaries of each interval and 'value' is the sum of all overlapping intervals within these boundaries.

For the first method, which sums values for every distinct overlapping interval, it would look like:

  start_dt                  end_dt                     value
0 2023-07-04 08:06:02+00:00 2023-07-04 14:12:22+00:00  1.2789
1 2023-07-04 17:02:02+00:00 2023-07-04 23:17:54+00:00  0.8672
2 2021-06-30 00:45:11+00:00 2021-06-30 05:32:20+00:00  1.4563
...

For the second method, which sums values for every hour:

  start_dt                  end_dt                     value
0 2023-07-04 08:00:00+00:00 2023-07-04 09:00:00+00:00  0.7489
1 2023-07-04 09:00:00+00:00 2023-07-04 10:00:00+00:00  0.5321
2 2023-07-04 10:00:00+00:00 2023-07-04 11:00:00+00:00  0.4563
...

Note: The value column contains the sum of all overlapping interval values within the 'start_dt' to 'end_dt' range for each row.

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250
  • For method 1 I don't think you can iterate over the same list that you `.pop()` from. Otherwise you get a index out of range error. Is this part a typo: `enumerate(intervals[:])` ? – conv3d Jul 04 '23 at 18:06
  • @conv3d True, I have edited the code: The inner `for` loop with `enumerate()` was the issue, as it tried to iterate over a list while removing elements from it. I prefer creating a copy of the intervals list for the enumeration process. – VonC Jul 04 '23 at 18:15
  • Ok that makes sense. One other thing, I think this solution extends intervals rather than segmenting them in the case where two intervals overlap for only a portion of the time range? Please see the attached image I added to my question. I could be wrong, but maybe need some clarification here – conv3d Jul 04 '23 at 18:21
  • @conv3d Right: the code calculates the sum of values for each group of overlapping intervals as a whole. Do you want to segment the overlapping periods so that each distinct interval is represented separately? (i.e., an interval that only partially overlaps with another would be split into separate segments) – VonC Jul 04 '23 at 18:25
  • Yea, maybe I should've made the image the first time. Hard to describe in words. But yes that was my idea. I think the image defines it more accurately. – conv3d Jul 04 '23 at 18:27
  • @conv3d OK. I have quickly rewritten method 1 (not fully tested). Let me know if it is closer to what you are looking for. – VonC Jul 04 '23 at 18:43
  • Interesting! I think this is right, just going to double check. Only change rn is maybe `result.append((current_start, points[i-1][0], current_value))` should be `result.append((current_start, points[i][0], current_value))` I think? – conv3d Jul 04 '23 at 20:33
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/254372/discussion-between-vonc-and-conv3d). – VonC Jul 04 '23 at 20:38
1

Second way: As I am a new user I am only able to create an answer. This answer therefore builds heaviliy on @VonCs answer.

As far as I have tested it, currently intervals which have no value to them get lost in the process and overlapping intervals do not get added up. To adress the first issue, I added the else condition which adds a zero if no values are present within the time interval. To adress the second issue, I added .sum(), which also converts total_value to float. Furthermore, the usage of .append has been avoided.

# Convert to datetime objects
date_ranges_df['start_dt'] = pd.to_datetime(date_ranges_df['start_dt'])
date_ranges_df['end_dt'] = pd.to_datetime(date_ranges_df['end_dt'])

# Resample to 1-hour intervals
hourly_intervals = pd.date_range(date_ranges_df['start_dt'].min(), 
date_ranges_df['end_dt'].max(), freq='H')

hourly_df = pd.DataFrame()
for start in hourly_intervals:
    end = start + timedelta(hours=1)
    # Get intervals that overlap with current hour
    mask = ((date_ranges_df['start_dt'] < end) & (date_ranges_df['end_dt'] > start))
    overlap = date_ranges_df.loc[mask]
    if not overlap.empty:
        # Sum values of overlapping intervals
        total_value = overlap['value'].sum()
        new_entry = pd.Series({'start': start, 'end': end, 'value': total_value})
        hourly_df = pd.concat([hourly_df, new_entry.to_frame().T], ignore_index=True)
    else:
        new_entry = pd.Series({'start': start, 'end': end, 'value': 0})
        hourly_df = pd.concat([hourly_df, new_entry.to_frame().T], ignore_index=True)

# Convert column types
hourly_df['start_dt'] = pd.to_datetime(hourly_df['start_dt'])
hourly_df['end_dt'] = pd.to_datetime(hourly_df['end_dt'])
Tim
  • 11
  • 2