7

I have a list of events that occur at mS accurate intervals, that spans a few days. I want to cluster all the events that occur in a 'per-n-minutes' slot (can be twenty events, can be no events). I have a datetime.datetime item for each event, so I can get datetime.datetime.minute without any trouble.

My list of events is sorted in time order, earliest first, latest last. The list is complete for the time period I am working on.

The idea being that I can change list:-

[[a],[b],[c],[d],[e],[f],[g],[h],[i]...]

where a, b, c, occur between mins 0 and 29, d,e,f,g occur between mins 30 and 59, nothing between 0 and 29 (next hour), h, i between 30 and 59 ...

into a new list:-

[[[a],[b],[c]],[[d],[e],[f],[g]],[],[[h],[i]]...]

I'm not sure how to build an iterator that loops through the two time slots until the time series list ends. Anything I can think of using xrange stops once it completes, so I wondered if there was a way of using `while' to do the slicing?

I also will be using a smaller timeslot, probably 5 mins, I used 30mins as a shorter example for demonstration.

(for context, I'm making a geo plotted time based view of the recent quakes in New Zealand. and want to show all the quakes that occurs in a small block of time in one step to speed up the replay)

Ron Klein
  • 9,178
  • 9
  • 55
  • 88
Jay Gattuso
  • 3,890
  • 12
  • 37
  • 51
  • 1
    If they are in order, you can use itertools.groupby. You just need to write a key function to collapse each 10 minutes to a single value – John La Rooy Jul 25 '13 at 07:25
  • They are indeed in order, sorted by the `datetime.datetime` value. I'll look at `itertools.groupby` - will it group all events that occur in the 'per-n-minutes' block, or will it serialise as per the example? – Jay Gattuso Jul 25 '13 at 07:28
  • I initially thought groupby would work, but the possibility of empty timeslots makes things messier. – user2357112 Jul 25 '13 at 07:28

6 Answers6

10
# create sample data
from datetime import datetime, timedelta
d = datetime.now()
data = [d + timedelta(minutes=i) for i in xrange(100)]

# prepare and group the data
from itertools import groupby

def get_key(d):
    # group by 30 minutes
    k = d + timedelta(minutes=-(d.minute % 30)) 
    return datetime(k.year, k.month, k.day, k.hour, k.minute, 0)

g = groupby(sorted(data), key=get_key)

# print data
for key, items in g:
    print key
    for item in items:
        print '-', item

This is a python translation of this answer, which works by rounding the datetime to the next boundary and use that for grouping.


If you really need the possible empty groups, you can just add them by using this or a similar method:

def add_missing_empty_frames(g):
    last_key = None
    for key, items in g:
        if last_key:
            while (key-last_key).seconds > 30*60:
                empty_key = last_key + timedelta(minutes=30)
                yield (empty_key, [])
                last_key = empty_key
        yield (key, items)
        last_key = key

for key, items in add_missing_empty_frames(g):
    ...
Community
  • 1
  • 1
sloth
  • 99,095
  • 21
  • 171
  • 219
  • Thats awesome! I can see it works by running it, and I'm trying to get my head around how you put that together. It does exactly what I want. Now I just need to absorb!... Thank you. – Jay Gattuso Jul 25 '13 at 07:49
  • Are you sure that actually works? I think it fails on empty time intervals. – user2357112 Jul 25 '13 at 07:55
  • Ahh.. true enough, if I substute the timedelta line for `timedelta(hours=i)` - giving a series that should have lots of empty slots, it doesn't have an empty slot in the result. – Jay Gattuso Jul 25 '13 at 08:17
  • @user2357112 *fails on empty time intervals* If you mean that there won't be any empty groups, then yes, empty groups will not be created. But if you really need the empty groups, its trivial to add them using a loop. – sloth Jul 25 '13 at 08:20
  • @user2357112 I've added a simple method to add the possible missing, empty time frames. – sloth Jul 25 '13 at 08:30
  • Ah, awesome, thanks. I only just finished building in the first suggestion. I'll add the empty frames too, they are important for maintaining temporal consistency. Thanks again for your help. I'm appreciative. – Jay Gattuso Jul 25 '13 at 09:06
2

Consider the following

def time_in_range(t,t_min,delta_t):
    if t<=t_min+delta_t and t>=t_min:
         return True
    else:
         return False
def group_list(input_list,ref_time,time_dx,result=[]):
    result.append([])
    for i,item in enumerate(input_list):
        if time_in_range(item,ref_time,time_dx):
            result[-1].append(item)
        else:
            return group_list(input_list[i:],ref_time+time_dx,time_dx,result=result)
def test():
    input_list = [1,2,3,4,5,8,10,20,30]
    print group_list(input_list,0,5)
test()
# Ouput:
# [[1, 2, 3, 4, 5], [8, 10], [], [20], [], [30]]

where you will need to write your own time_in_range function.

esmit
  • 1,748
  • 14
  • 27
1

If you have the whole list, you can just loop over it and stick each event in the right timeslot directly:

grouped = [[] for _ in xrange(whatever)]
for event in events:
    grouped[timeslot_of(event)].append(event)

If you need to turn an iterable of events into a grouped iterable, things get a bit messier. itertools.groupby almost works, but it skips time intervals with no events in them.

user2357112
  • 260,549
  • 28
  • 431
  • 505
1

Assuming that the events are available in a chronologically ordered list called events, having a datetime attribute called timestamp:

interval = 10    # min
period = 2*24*60 # two days in minutes
timeslots = [[] for slot in range(period/interval)]
for e in events:
    index = int((e.timestamp-events[0].timestamp).total_seconds()/60) / interval
    timeslots[index].append(e)

This uses the first event as t=0 on the timeline. If that's not what you want, just substitute events[0].timestamp with a reference to a datetime instance that represents your t=0.

Henrik
  • 4,254
  • 15
  • 28
  • I was poking around at this notion, but wondered if it was possible to not pre-feed the `period` value. I guess I could calculate by using the first and last values in the time series.... very useful, thanks – Jay Gattuso Jul 25 '13 at 07:40
  • 1
    `int((events[-1].timestamp-events[0].timestamp).to_seconds()/60)` – Henrik Jul 25 '13 at 07:43
0

I wondered if there was a way of using `while' to do the slicing?

I have this definition which might help you. It has no library dependencies and uses a while loop as requested:

If you have 2 lists; unix timestamps and values, each the same length where:

timestamps[0] is the time stamp for values[0] respectively.

timestamps = [unix, unix, unix, ....etc.]
values = [0.1, 0.2, 0.5, 1.1, ....etc.]

lets say you have 30 days of data, starting Nov 2011, and you want it grouped hourly:

BEGIN = 1320105600

hourly_values = []
z = 0
while z < 720:   # 24 hours * 30 days = 720
    hourly_values.append([])  # append an new empty list for each hour
    for i in range(len(timestamps)):
        if timestamps[i] >= (BEGIN + 3600*z):  # 3600 sec = 1 hour
            if timestamps[i] < (BEGIN + 3600*(z+1)):
                hourly_values[z].append(values[i])
    z+=1
return hourly_values

This will return a list of lists for each hour, with empty lists in the hours with no data.

litepresence
  • 3,109
  • 1
  • 27
  • 35
0

You could use the slotter module. I had a similar problem and I ended up writing a generic solution - https://github.com/saurabh-hirani/slotter

An asciinema demo - https://asciinema.org/a/8mm8f0qqurk4rqt90drkpvp1b?autoplay=1

Saurabh Hirani
  • 1,198
  • 14
  • 21