Groupby and count with condition and get week start (from sunday) and week end from date column in Pandas Dataframe?

Question

I have a dataset like below. I need to do count of ticket_id with status closed based on groupby and I need to get week start(sunday) and week end from ticket_time column. Input dataframe

id       status     ticket_time         product         country
1260057   open     2021-10-03 01:20:00  Broadband       Grenada
2998178   open     2021-10-06 00:00:00  Fixed Voice     Bahamas
3762949  closed    2021-10-04 00:00:00  Fixed Voice     St Lucia
3766608  closed    2021-10-10 00:00:00  Broadband       St Lucia
3767125  closed    2021-10-03 00:00:00  TV              Antigua
etc.................

Output dataframe EXAMPLE:

country  product    week_startdate     week_enddate          Value
antigua     TV      2021-10-03 00:00:00 2021-10-10 00:00:00  Count(id) with status=closed
St Lucia    Broadband   ...         ....                  ...
St Lucia    Fixedvoice  ...         ....                  ...

For each country and each product I should get the count of id which is having closed status. Is groupby country and product will work?

INPUT Dataset

{'ticket_id': [1260057,
  2998178,
  3762949,
  3766608,
  3767125,
  6050009,
  6050608,
  6050972,
  6052253,
  6053697,
  6053812,
  6053848,
  6053970,
  6054363,
  6054549,
  6055529,
  6057303,
  6057364,
  6058428,
  6101321,
  6103451,
  6103688,
  6105180,
  693170,
  1259365,
  1259466,
  1260843,
  3762579,
  3762778,
  3764140,
  3768850,
  6050528,
  6050595,
  6051099,
  6053704,
  6054013,
  6054582,
  6055323,
  6056096,
  6056363,
  6057701,
  6058251,
  6058323,
  6102386,
  6103121,
  6104844,
  6104935,
  692914,
  1260730,
  2997275],
 'status': ['open',
  'open',
  'closed',
  'closed',
  'closed',
  'closed',
  'open',
  'open',
  'closed',
  'open',
  'open',
  'closed',
  'open',
  'open',
  'open',
  'open',
  'open',
  'open',
  'open',
  'closed',
  'closed',
  'closed',
  'open',
  'open',
  'closed',
  'closed',
  'open',
  'closed',
  'closed',
  'closed',
  'open',
  'closed',
  'closed',
  'closed',
  'open',
  'open',
  'open',
  'open',
  'open',
  'open',
  'open',
  'open',
  'open',
  'cancelled',
  'open',
  'open',
  'open',
  'closed',
  'closed',
  'open'],
 'ticket_start_time': [Timestamp('2021-10-04 01:20:00'),
  Timestamp('2021-10-02 00:00:00'),
  Timestamp('2021-10-01 00:00:00'),
  Timestamp('2021-10-04 00:00:00'),
  Timestamp('2021-10-04 00:00:00'),
  Timestamp('2021-10-01 00:00:00'),
  Timestamp('2021-10-01 00:00:00'),
  Timestamp('2021-10-01 00:00:00'),
  Timestamp('2021-10-02 00:00:00'),
  Timestamp('2021-10-03 00:00:00'),
  Timestamp('2021-10-03 00:00:00'),
  Timestamp('2021-10-03 00:00:00'),
  Timestamp('2021-10-03 00:00:00'),
  Timestamp('2021-10-03 00:00:00'),
  Timestamp('2021-10-04 00:00:00'),
  Timestamp('2021-10-04 00:00:00'),
  Timestamp('2021-10-05 00:00:00'),
  Timestamp('2021-10-05 00:00:00'),
  Timestamp('2021-10-05 00:00:00'),
  Timestamp('2021-10-01 00:00:00'),
  Timestamp('2021-10-03 00:00:00'),
  Timestamp('2021-10-04 00:00:00'),
  Timestamp('2021-10-05 00:00:00'),
  Timestamp('2021-10-05 00:00:00'),
  Timestamp('2021-10-01 00:00:00'),
  Timestamp('2021-10-01 00:00:00'),
  Timestamp('2021-10-05 00:00:00'),
  Timestamp('2021-10-01 00:00:00'),
  Timestamp('2021-10-01 00:00:00'),
  Timestamp('2021-10-01 00:00:00'),
  Timestamp('2021-10-05 00:00:00'),
  Timestamp('2021-10-01 00:00:00'),
  Timestamp('2021-10-01 00:00:00'),
  Timestamp('2021-10-01 00:00:00'),
  Timestamp('2021-10-03 00:00:00'),
  Timestamp('2021-10-03 00:00:00'),
  Timestamp('2021-10-04 00:00:00'),
  Timestamp('2021-10-04 00:00:00'),
  Timestamp('2021-10-04 00:00:00'),
  Timestamp('2021-10-04 00:00:00'),
  Timestamp('2021-10-05 00:00:00'),
  Timestamp('2021-10-05 00:00:00'),
  Timestamp('2021-10-05 00:00:00'),
  Timestamp('2021-10-02 00:00:00'),
  Timestamp('2021-10-02 00:00:00'),
  Timestamp('2021-10-04 00:00:00'),
  Timestamp('2021-10-04 00:00:00'),
  Timestamp('2021-10-04 00:00:00'),
  Timestamp('2021-10-04 00:00:00'),
  Timestamp('2021-10-01 00:00:00')],
 'product_type': ['Broadband',
  'Fixed Voice',
  'Fixed Voice',
  'Broadband',
  'TV',
  'TV',
  'Broadband',
  'Broadband',
  'Broadband',
  'Broadband',
  'Broadband',
  'Broadband',
  'Broadband',
  'Broadband',
  'Broadband',
  'Broadband',
  'Broadband',
  'Broadband',
  'TV',
  'TV',
  'Broadband',
  'Broadband',
  'Broadband',
  'Broadband',
  'Fixed Voice',
  'Broadband',
  'Fixed Voice',
  'TV',
  'Broadband',
  'Broadband',
  'Broadband',
  'Broadband',
  'Broadband',
  'Broadband',
  'Broadband',
  'TV',
  'Fixed Voice',
  'TV',
  'TV',
  'TV',
  'TV',
  'TV',
  'Broadband',
  'Broadband',
  'Broadband',
  'Broadband',
  'Broadband',
  'Broadband',
  'Fixed Voice',
  'Fixed Voice'],
 'org_country': ['Grenada',
  'Bahamas',
  'St Lucia',
  'St Lucia',
  'Antigua',
  'Jamaica',
  'Jamaica',
  'Jamaica',
  'Jamaica',
  'Jamaica',
  'Jamaica',
  'Jamaica',
  'Jamaica',
  'Jamaica',
  'Jamaica',
  'Jamaica',
  'Jamaica',
  'Jamaica',
  'Jamaica',
  'Trinidad',
  'Trinidad',
  'Trinidad',
  'Trinidad',
  'Curacao',
  'Grenada',
  'Grenada',
  'Grenada',
  'St Lucia',
  'St Lucia',
  'St Vincent',
  'St Lucia',
  'Jamaica',
  'Jamaica',
  'Jamaica',
  'Jamaica',
  'Jamaica',
  'Jamaica',
  'Jamaica',
  'Jamaica',
  'Jamaica',
  'Jamaica',
  'Jamaica',
  'Jamaica',
  'Trinidad',
  'Trinidad',
  'Trinidad',
  'Trinidad',
  'Curacao',
  'Grenada',
  'Bahamas']}

@Phil Leh I've been referring to this https://stackoverflow.com/questions/70182295/how-to-calculate-monthly-and-weekly-averages-from-a-dataframe-using-python. But not getting for my case. Can you pls help me on this. — Mahilan, Dec 22 '21 at 14:29
can you describe a bit further what your ideal output is? Do you also need to group by week? — Pierre D, Dec 22 '21 at 16:13
My expected output is as above example. Please see above OUTPUTD DATAFRAME EXAMPLE. @PierreD..Yes only one week startdate and week enddate for each country and product. — Mahilan, Dec 22 '21 at 16:19
The example is ambiguous because there is only one row for 'Antigua'. You could improve this by giving a few more and showing what your aggregation logic should be. The section "Output getting from below code" makes no sense: there is no code, and the data is way too big for us to understand what you have in mind. — Pierre D, Dec 22 '21 at 16:21
Please provide a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example). We are all wasting time trying to understand what exactly you need. — Pierre D, Dec 22 '21 at 19:11

Muhammad Hassan · Answer 1 · 2021-12-22T16:03:58.100

1

IIUC:

def week_start_day(s):
    return s - (pd.to_timedelta((s.dt.weekday + 1)%7, unit='D'))

def week_end_day(s):
    return s + pd.to_timedelta(7-(s.dt.weekday+2)%7, unit='D')

df['ticket_time'] = pd.to_datetime(df['ticket_time'])

df = df[df['status'] == 'closed'].groupby(['country', 'product']).agg(week_startdate = pd.NamedAgg(column='ticket_time', aggfunc=week_start_day),
                                                                      week_enddate = pd.NamedAgg(column='ticket_time', aggfunc=week_end_day),
                                                                      Value = pd.NamedAgg(column='id', aggfunc="size"))
df.reset_index(inplace=True)

OUTPUT

    country      product week_startdate week_enddate  Value
0   Antigua           TV     2021-10-03   2021-10-09      1
1  St Lucia    Broadband     2021-10-10   2021-10-16      1
2  St Lucia  Fixed Voice     2021-10-03   2021-10-09      1

Setup

data = {'id': [1260057, 2998178, 3762949, 3766608, 3767125],
          'status': ['open', 'open', 'closed', 'closed', 'closed'],
          'ticket_time': ['2021-10-03 01:20:00', '2021-10-06 00:00:00', '2021-10-04 00:00:00',
                          '2021-10-10 00:00:00', '2021-10-03 00:00:00'],
          'product': ['Broadband', 'Fixed Voice', 'Fixed Voice', 'Broadband', 'TV'],
          'country': ['Grenada', 'Bahamas', 'St Lucia', 'St Lucia', 'Antigua']}

df = pd.DataFrame.from_dict(data)

edited Dec 22 '21 at 16:03

answered Dec 22 '21 at 15:29

Muhammad Hassan

4,079
1
13
27

Thanks much! But I'm getting week_startdate and enddate as unhashable type: 'DatetimeArray' . I should get only one start date , but it is consisting of array – Mahilan Dec 22 '21 at 15:52
Can you please post output of `df.head(50).to_json(orient='records')` so that i can check it on your dataframe. Its working fine for me for the data you posted in question' – Muhammad Hassan Dec 22 '21 at 15:55
I'm getting this error when I try to use df.head(50).to_json(orient='records') OverflowError: Maximum recursion level reached. – Mahilan Dec 22 '21 at 15:58
But when I work with large set of data ,it is giving me like above! (Pls check updated question). Can you pls help me why I'm facing this. @Muhammed Hassan – Mahilan Dec 22 '21 at 16:02
Plz post the output of `df.to_dict('list')` – Muhammad Hassan Dec 22 '21 at 16:05
Please check my updated question. I have updated my output in list @Muhammed Hassan – Mahilan Dec 22 '21 at 16:08
I need the output of your original dataframe. – Muhammad Hassan Dec 22 '21 at 16:10
I updated the output of my original dataframe @Muhammed Hassan – Mahilan Dec 22 '21 at 16:16
Please post the output of your `Input dataframe` as you posted in the question above. You are posting the result of output dataframe – Muhammad Hassan Dec 22 '21 at 16:20
try posting this: `df.head(50).to_dict('list')` – Muhammad Hassan Dec 22 '21 at 16:40
Yes. I have updated my input dataset above. Please check @Muhammad Hassan – Mahilan Dec 22 '21 at 16:43
Apparently, the reason you are getting arrays in `week_startdate` and `week_startdate` is that for a certain combination of `product_type` and `org_country` you have multiple rows where `status` is `closed`. For example, for `country: Grenada` and `product_type: Fixed Voice` you have two rows where `status` is `closed` – Muhammad Hassan Dec 22 '21 at 16:54
Yes, I understood now.. Thanks much for your effort. Either I can go with your approach(with array of dates) or @Pierre D approach – Mahilan Dec 22 '21 at 17:00
FYI, the answer I provide doesn't care how many times a ticket is 'closed'. It counts just one for a ticket closed, no matter how many times it appears in the df. That's essentially a COUNT DISTINCT. – Pierre D Dec 22 '21 at 17:30
yes.. But Can you pls explain why we are getting multiple rows for each country/product. I'm getting confused there. @Pierre D – Mahilan Dec 22 '21 at 17:56
@Muhammad Hassan.. Is there any way to handle this arrays of dates..because everything is fine expect that date part. – Mahilan Dec 22 '21 at 18:42

Pierre D · Accepted Answer · 2021-12-22T19:37:11.957

Modified answer

After many explanations, it seems that the OP is looking for week start and end as a result of aggregate, not as a grouping.

Therefore, here is the modified answer:

weekday = 6  # Sunday
wo = pd.offsets.Week(weekday=weekday, normalize=True)

t = df['ticket_start_time'] + wo
gbcols = ['org_country', 'product_type']
agg_dict = {
    'ticket_id': 'count',
    'week_startdate': 'first',
    'week_enddate': 'last',
}
out = (
    df
    .assign(week_startdate=t - wo, week_enddate=t)
    .groupby(['ticket_id', 'status'] + gbcols).agg(agg_dict)
    .query("status == 'closed'").groupby(gbcols).agg(agg_dict)
    .reset_index()
    .rename(columns={'ticket_id': 'Value'})
)

From the provided 'input data', we get:

>>> out
   org_country product_type  Value week_startdate week_enddate
0      Antigua           TV      1     2021-10-03   2021-10-10
1      Curacao    Broadband      1     2021-10-03   2021-10-10
2      Grenada    Broadband      1     2021-09-26   2021-10-03
3      Grenada  Fixed Voice      2     2021-09-26   2021-10-10
4      Jamaica    Broadband      5     2021-09-26   2021-10-10
5      Jamaica           TV      1     2021-09-26   2021-10-03
6     St Lucia    Broadband      2     2021-09-26   2021-10-10
7     St Lucia  Fixed Voice      1     2021-09-26   2021-10-03
8     St Lucia           TV      1     2021-09-26   2021-10-03
9   St Vincent    Broadband      1     2021-09-26   2021-10-03
10    Trinidad    Broadband      2     2021-10-03   2021-10-10
11    Trinidad           TV      1     2021-09-26   2021-10-03

Explanation

First the trick of adding / then subtracting the Week offset wo gets around the problem of subtracting an offset when the date is at the offset boundary. For example:

t = pd.to_datetime([
    '2021-10-02', '2021-10-03', '2021-10-03 01:00',
    '2021-10-04', '2021-10-09 23:59:59', '2021-10-10']
)
>>> pd.DataFrame(dict(t=t, ws=t + wo - wo, we=t + wo))
                    t         ws         we
0 2021-10-02 00:00:00 2021-09-26 2021-10-03
1 2021-10-03 00:00:00 2021-10-03 2021-10-10
2 2021-10-03 01:00:00 2021-10-03 2021-10-10
3 2021-10-04 00:00:00 2021-10-03 2021-10-10
4 2021-10-09 23:59:59 2021-10-03 2021-10-10
5 2021-10-10 00:00:00 2021-10-10 2021-10-17

Next, we want the count distinct of ticket_id where the status is 'closed'. This is a bit trickier. We do it with two groupby. The first one is to tally the counts for each (ticket, status) (and all other grouping columns) group. There could be multiple instances of the same ticket being open or closed, but we want to disregard that.

tmp = (
    df
    .assign(week_startdate=t - wo, week_enddate=t)
    .groupby(['ticket_id', 'status'] + gbcols)
    .agg(agg_dict)
)
>>> tmp.head()
                                           ticket_id week_startdate  \
ticket_id status org_country product_type                             
692914    closed Curacao     Broadband             1     2021-10-03   
693170    open   Curacao     Broadband             1     2021-10-03   
1259365   closed Grenada     Fixed Voice           1     2021-09-26   
1259466   closed Grenada     Broadband             1     2021-09-26   
1260057   open   Grenada     Broadband             1     2021-10-03   

                                          week_enddate  
ticket_id status org_country product_type               
692914    closed Curacao     Broadband      2021-10-10  
693170    open   Curacao     Broadband      2021-10-10  
1259365   closed Grenada     Fixed Voice    2021-10-03  
1259466   closed Grenada     Broadband      2021-10-03  
1260057   open   Grenada     Broadband      2021-10-10

Finally, we now simply aggregate again, but only grouping on the gbcols columns, and where status is 'closed' in that tally, giving us the complete expression above.

Addendum

The OP would like to convert the week_startdate and end date to appear as strings, with full H:M:S resolution. For this:

for k in ['week_startdate', 'week_enddate']:
    out[k] = out[k].dt.strftime('%F %T')

And now:

>>> out
   org_country product_type  Value       week_startdate         week_enddate
0      Antigua           TV      1  2021-10-03 00:00:00  2021-10-10 00:00:00
1      Curacao    Broadband      1  2021-10-03 00:00:00  2021-10-10 00:00:00
2      Grenada    Broadband      1  2021-09-26 00:00:00  2021-10-03 00:00:00
3      Grenada  Fixed Voice      2  2021-09-26 00:00:00  2021-10-10 00:00:00
4      Jamaica    Broadband      5  2021-09-26 00:00:00  2021-10-10 00:00:00
5      Jamaica           TV      1  2021-09-26 00:00:00  2021-10-03 00:00:00
6     St Lucia    Broadband      2  2021-09-26 00:00:00  2021-10-10 00:00:00
7     St Lucia  Fixed Voice      1  2021-09-26 00:00:00  2021-10-03 00:00:00
8     St Lucia           TV      1  2021-09-26 00:00:00  2021-10-03 00:00:00
9   St Vincent    Broadband      1  2021-09-26 00:00:00  2021-10-03 00:00:00
10    Trinidad    Broadband      2  2021-10-03 00:00:00  2021-10-10 00:00:00
11    Trinidad           TV      1  2021-09-26 00:00:00  2021-10-03 00:00:00

I should get the week start and end date in datetime format. I tried with (format='%Y-%m-%d %H:%M:%S)..But getting error. Can you pls help — Mahilan, Dec 22 '21 at 17:16
you mean in the output? This is just a formatting issue. The columns are indeed full resolution Timestamps. — Pierre D, Dec 22 '21 at 17:18
for example, try `out.to_dict(orient='records')` to see the Timestamps. If you just want to convert to str, you can also use `out['week_startdate'].dt.strftime('%F %T')` etc. — Pierre D, Dec 22 '21 at 17:28
yes In the output I'm getting the week start and end date as str type .But ideally it should be like this 2021-10-02 00:00:00 in this date time format — Mahilan, Dec 22 '21 at 17:30
Can you pls add that to your code. I'm not getting the weekstart and end date as @Pierre D — Mahilan, Dec 22 '21 at 17:35
Sure. If this addresses all your concerns, please don't forget to accept and uptick. — Pierre D, Dec 22 '21 at 17:41
yes @PierreD But Can you pls explain why we are getting multiple rows for each country/product. I'm getting confused there. — Mahilan, Dec 22 '21 at 17:46
Maybe I misunderstood your problem. I thought you wanted a per-week aggregate as well. Is that incorrect? What is the precise meaning of `week_startdate` and `week_enddate` in your desired output? — Pierre D, Dec 22 '21 at 18:18
Yes, I need week start date and end date from ticket_start_time of that particular row. It is right..but I want to know why it is returning two rows with different strt and end date for same product/country. @Pierre D — Mahilan, Dec 22 '21 at 18:37
Sorry -- do you need the week start and end for the overall aggregate (group by `(country, product)`), or do you need to group by week as well so that you have a break down by `(week, country, product)`? — Pierre D, Dec 22 '21 at 19:10
first one is right.. So I need weekstart and weekend for (groupby( country, product)) with (count of ticket id where status is closed as Value) @Pierre D — Mahilan, Dec 22 '21 at 19:17
I rewrote the answer based on this. It would have saved both you and us quite a bit of time if you had started with a [minimal, reproducible example](https://stackoverflow.com/help/minimal-reproducible-example). It is important to describe precisely the questions you have, highlighting corner cases, the logic of a desired result in the example, etc. — Pierre D, Dec 22 '21 at 19:39
Really Thanks you so much for your effort. Sorry for that, I will make sure my question will be more precise from next time. @Pierre D — Mahilan, Dec 23 '21 at 06:13
The solution is really helpful.. But I need to achieve other usecases which I have created as another question. https://stackoverflow.com/questions/70459619/groupby-with-condition-and-get-week-start-from-sunday-and-week-endsaturday-i . Can you pls help me on this. — Mahilan, Dec 23 '21 at 09:53

Groupby and count with condition and get week start (from sunday) and week end from date column in Pandas Dataframe?

2 Answers2

Modified answer

Explanation

Addendum