0

I'm trying to do something similar to this but in BigQuery. I have several users that may have 1 or more categories. I must find the overlapping within categories. Something like this:

enter image description here

What I want as result is something like this:

enter image description here

That is, for example, only one user has only category D (and no other), two users have categories 10 and 30, and so on.

The main problem is that I have a lot of categories (over 40). Previously I had done something like:

SELECT sum(cat1), sum(cat2), sum(cat3)
FROM  table
where cat1 = 0 and cat2 = 1 and cat3 = 0

That way worked, but is too manual and impossible to do it right now because I have lots of categories. Want to use BigQuery if it is possible.

Pablo
  • 140
  • 1
  • 11

3 Answers3

1

The main problem is that I have a lot of categories (over 40).

Consider below (BigQuery) approach - works for any reasonable amount of categories

execute immediate (
select '''
  select * from (
    select distinct t1.usr, 
      t1.categories category, t2.categories category2
    from `your_table` t1 left join `your_table` t2 
    on t1.usr = t2.usr and t1.categories != t2.categories
    union all
    select usr, any_value(categories) category, any_value(categories) category2
    from `your_table`
    group by usr
    having count(1) = 1
  )
  pivot (count(usr) cat for category2 in (''' || list || '''))
  order by category
'''
from (
  select string_agg("'" || categories || "'" order by categories) list 
  from (select distinct categories from `your_table`)
  )
)     

if applied to sample data in your question - output is

enter image description here

Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
0

FWIW:

with mytable as (
    select 'D' as Usr, '10' as Categories union all 
    select 'E', '10' union all
    select 'E', '30' union all
    select 'F', '30' union all
    select 'G', '10' union all
    select 'G', '50' union all
    select 'H', '10' union all
    select 'H', '30'
)
select grp, count(*) as cnt
from (
    select Usr, string_agg(Categories order by Categories) as grp
    from mytable
    group by Usr
)
group by grp

enter image description here

Sergey Geron
  • 9,098
  • 2
  • 22
  • 29
  • Also make sure category ordered so 10,30 and 30,10 are the same group, but meantime, I don’t think this is what user is asking – Mikhail Berlyant Sep 29 '21 at 14:46
  • Thank you @MikhailBerlyant. Fixed it. – Sergey Geron Sep 29 '21 at 14:49
  • the reason I think it is not what OP asks is because you just counting users with exact same categories - but question is more about overlapping categories and because OP expects matrix it looks like they want overlapping pairs (and one) categories - i can be wrong thought :o) – Mikhail Berlyant Sep 29 '21 at 15:00
0

This isn't exactly what you're looking for but you can use this output as a source to an Excel Pivot or BI tool to get exactly what you want. Performing pivots on 40+ columns in SQL is doable but not fun.

select a.categories, 
       b.categories as cross_categories, 
       count(distinct a.usr) as counts
from t a
join t b on a.usr=b.usr and a.categories<> b.categories
group by a.categories, b.categories

union all

select max(categories), 
       max(categories), 
       count(distinct categories)
from t
group by usr
having count(distinct categories)=1
order by 1,2

DEMO

Excel Pivot

enter image description here

Radagast
  • 5,102
  • 3
  • 12
  • 27