Find Overlapping within categories in SQL Big Query

Question

I'm trying to do something similar to this but in BigQuery. I have several users that may have 1 or more categories. I must find the overlapping within categories. Something like this:

What I want as result is something like this:

That is, for example, only one user has only category D (and no other), two users have categories 10 and 30, and so on.

The main problem is that I have a lot of categories (over 40). Previously I had done something like:

SELECT sum(cat1), sum(cat2), sum(cat3)
FROM  table
where cat1 = 0 and cat2 = 1 and cat3 = 0

That way worked, but is too manual and impossible to do it right now because I have lots of categories. Want to use BigQuery if it is possible.

it is not clear why 10-10 has 1 and 30-30 has 1. please clarify the logic — Mikhail Berlyant, Sep 29 '21 at 15:41
Usr D has only category 10. So 10-10 will be those users with only one category. User F has only category 30, that's the 1 in 30-30 — Pablo, Sep 29 '21 at 17:07

Mikhail Berlyant · Accepted Answer · 2021-09-29T17:20:30.440

The main problem is that I have a lot of categories (over 40).

Consider below (BigQuery) approach - works for any reasonable amount of categories

execute immediate (
select '''
  select * from (
    select distinct t1.usr, 
      t1.categories category, t2.categories category2
    from `your_table` t1 left join `your_table` t2 
    on t1.usr = t2.usr and t1.categories != t2.categories
    union all
    select usr, any_value(categories) category, any_value(categories) category2
    from `your_table`
    group by usr
    having count(1) = 1
  )
  pivot (count(usr) cat for category2 in (''' || list || '''))
  order by category
'''
from (
  select string_agg("'" || categories || "'" order by categories) list 
  from (select distinct categories from `your_table`)
  )
)

if applied to sample data in your question - output is

Sergey Geron · Answer 2 · 2021-09-29T14:50:11.077

0

FWIW:

with mytable as (
    select 'D' as Usr, '10' as Categories union all 
    select 'E', '10' union all
    select 'E', '30' union all
    select 'F', '30' union all
    select 'G', '10' union all
    select 'G', '50' union all
    select 'H', '10' union all
    select 'H', '30'
)
select grp, count(*) as cnt
from (
    select Usr, string_agg(Categories order by Categories) as grp
    from mytable
    group by Usr
)
group by grp

edited Sep 29 '21 at 14:50

answered Sep 29 '21 at 14:41

Sergey Geron

9,098
2
22
29

Also make sure category ordered so 10,30 and 30,10 are the same group, but meantime, I don’t think this is what user is asking – Mikhail Berlyant Sep 29 '21 at 14:46
Thank you @MikhailBerlyant. Fixed it. – Sergey Geron Sep 29 '21 at 14:49
the reason I think it is not what OP asks is because you just counting users with exact same categories - but question is more about overlapping categories and because OP expects matrix it looks like they want overlapping pairs (and one) categories - i can be wrong thought :o) – Mikhail Berlyant Sep 29 '21 at 15:00

Radagast · Answer 3 · 2021-09-29T18:13:34.360

This isn't exactly what you're looking for but you can use this output as a source to an Excel Pivot or BI tool to get exactly what you want. Performing pivots on 40+ columns in SQL is doable but not fun.

select a.categories, 
       b.categories as cross_categories, 
       count(distinct a.usr) as counts
from t a
join t b on a.usr=b.usr and a.categories<> b.categories
group by a.categories, b.categories

union all

select max(categories), 
       max(categories), 
       count(distinct categories)
from t
group by usr
having count(distinct categories)=1
order by 1,2

DEMO

Excel Pivot

Find Overlapping within categories in SQL Big Query

3 Answers3