Is there a row in my dataframe within N seconds of a given row?

Question

population = pd.date_range(periods=1000,start='2015-01-01',freq='s')
timestamps = np.random.choice(population,size=16,replace=False)
s = pd.Series(timestamps)

Now, I want to create a boolean column indicating for each row of s, whether there is another row within the next N seconds.

I could use itertools, but that is not pandonic and does not generalize well in cases when I want to do this on a DataFrame. A related problem I would want to solve:

df = pd.DataFrame(
        {'id_col': ['a','b','c','d']*4, 
         'timestamp' : timestamps}
     )

For each row in this dataframe, is there a row with the same id_col value that happened within the next N seconds?

I could always write a function that takes one row and checks the whole dataframe, but it strikes me there may be a better way to do it. Thoughts?

maxymoo · Answer 1 · 2015-12-16T04:11:50.430

I think this is a question about the time-delta between consecutive values (after sorting by timestamp):

N = '30s'
s_sorted = s.sort_values()
deltas =  s_sorted -  s_sorted.shift()
flag = (deltas < N) | (deltas.shift(-1) < N)
pd.DataFrame({'s':s_sorted, 'flag':flag})

     flag                   s
7    True 2015-01-01 00:00:04
0    True 2015-01-01 00:00:32
8    True 2015-01-01 00:00:50
4    True 2015-01-01 00:00:58
5    True 2015-01-01 00:01:42
2    True 2015-01-01 00:01:57
1    True 2015-01-01 00:02:15
12  False 2015-01-01 00:03:52
14   True 2015-01-01 00:05:32
6    True 2015-01-01 00:05:57
11   True 2015-01-01 00:08:24
3    True 2015-01-01 00:08:36
13  False 2015-01-01 00:10:07
9    True 2015-01-01 00:13:53
15   True 2015-01-01 00:14:20
10  False 2015-01-01 00:15:14

For your second question, you can just groupby the id column then use the same technique.

you know, as I asked the question you're right. I forgot one part of the question. Will edit it. Sorry for the mixup — exp1orer, Dec 16 '15 at 17:11
cool can you see how to modify me answer to get you what you want? hint: try and undrerstand what the `flag` line is doing. — maxymoo, Dec 17 '15 at 00:17

score 0 · Answer 2 · edited May 23 '17 at 12:23

This method computes a leading window count, where the window is a number of seconds into the future from each event. The count includes the data point, so is always at least 1. If three events occur in less than lead_window_s seconds, then the count is 3.

from datetime import timedelta

def lead_count(s, lead_in_s):
  ssort = s.sort_index()
  lead = ssort.index + timedelta(seconds=lead_in_s)
  inds = np.searchsorted(ssort.index.astype(np.int64), lead.astype(np.int64)) - 1
  cs = ssort.cumsum()
  return pd.Series(cs[inds].values - cs.values + ssort[inds].values, index=ssort.index)

Your example code randomly generates events (out of chronological order). I'm augmenting the timestamps with a count of 1, so we can sum up the number of events that occur within the window of time, then adding an index on the timestamps.

>>> s = pd.Series([1]*len(timestamps), index=timestamps)
>>> s
2015-01-01 00:00:26    1
2015-01-01 00:05:15    1
2015-01-01 00:13:57    1
2015-01-01 00:10:45    1
2015-01-01 00:05:46    1
2015-01-01 00:00:01    1
2015-01-01 00:15:00    1
2015-01-01 00:13:12    1
2015-01-01 00:16:23    1
2015-01-01 00:13:18    1
2015-01-01 00:07:56    1
2015-01-01 00:00:47    1
2015-01-01 00:04:23    1
2015-01-01 00:02:58    1
2015-01-01 00:03:24    1
2015-01-01 00:11:34    1
dtype: int64

Then, if you call lead_count with a 30-second window

>>> lead_30s = lead_count(s, 30)
>>> df = pd.DataFrame({'s': s, 's_lead30s': lead_30s})
>>> print df.sort_index()
                     s  s_lead30s
2015-01-01 00:00:01  1          2
2015-01-01 00:00:26  1          2
2015-01-01 00:00:47  1          1
2015-01-01 00:02:58  1          2
2015-01-01 00:03:24  1          1
2015-01-01 00:04:23  1          1
2015-01-01 00:05:15  1          1
2015-01-01 00:05:46  1          1
2015-01-01 00:07:56  1          1
2015-01-01 00:10:45  1          1
2015-01-01 00:11:34  1          1
2015-01-01 00:13:12  1          2
2015-01-01 00:13:18  1          1
2015-01-01 00:13:57  1          1
2015-01-01 00:15:00  1          1
2015-01-01 00:16:23  1          1

This is modified from this answer, which uses the same binary search method for inserted values to find a rolling cumulative sum, but it looks into the past (lag) rather than the future (lead).

Is there a row in my dataframe within N seconds of a given row?

2 Answers2