Category Overlap Analysis

Question

I am trying to perform some category overlap analysis and need help.

I have data made up of customer service tickets. The tickets are labeled with category data. Tickets can contain multiple category labels.

I have a query that pulls ticket ids and categories. I get multiple rows for IDs with more than one category. I am looking for a way to show the category overlap, for example: how many tickets have category A, have A and B, B and C, etc..

I would like to be able to perform this in Excel or R so that it can easily be incorporated into reports for my management.

An example of my query output is as follows:

category  ticket_id

A   3975472 
D   3975472 
B   3975472 
P   3969484 
B   3969484 
S   3969484 
P   3968360 
C   3968360 
D   3964048 
A   3964048 
C   3963748 
E   3963748

Thank you!

I was hoping to achieve an output such as:

Where would you like the overlaps? In the worksheet? As variables in code? Also, how many categories are there in total? — Excel Hero, Aug 24 '15 at 18:01
I imagined a worksheet that would give me the count of ids per category combination. I imagined the worksheet having the categories across the top row and then down the first column with the count of ids at the intersections. In reality there are ~400 categories. — aguadamuz, Aug 24 '15 at 18:30
Wow! 400 categories? So each of those is to be combined as a column heading to do the count for? That's 400^2 combinations. Excel only has 16384 columns in a worksheet. Maybe I misunderstand. — Excel Hero, Aug 24 '15 at 18:37
If you just want to list each category singly per row, then is there a reason the PivotTable answer already provided would not suffice? — Excel Hero, Aug 24 '15 at 18:42
I have added a screen shot with my desired output to my question. I apologize for my lack of ability in describing the desired output. — aguadamuz, Aug 24 '15 at 19:05
building on Jason's answer, to get your expected output try... `crossprod(table(df$ticket_id, df$category))` — user20650, Aug 24 '15 at 19:43
Since each ticket can have 1 or more categories you need to decide how you want them counted. for example how would you deal with A 3975472 D 3975472 B 3975472 would it be counted for AA, AD, AB, DD, DB and BB — Rohit Das, Aug 24 '15 at 19:55
I would also assume the table you showed would have been symmetric. Any reason why its not symmetric or is that just random values you put in. — Rohit Das, Aug 24 '15 at 19:57
Sorry, I'm still not exactly sure what you are after. Do the totals in your "hoping to achieve output" image actually represent what should come from your posted sample data? — Excel Hero, Aug 24 '15 at 19:58

score 3 · Answer 1 · answered Aug 24 '15 at 18:07

3

In Excel you could do this with a Pivot table:

In R, assuming the data is in a data frame named df, you could do something like this:

table(df$ticket_id, df$category)
#         A B C D E P S
# 3963748 0 0 1 0 1 0 0
# 3964048 1 0 0 1 0 0 0
# 3968360 0 0 1 0 0 1 0
# 3969484 0 1 0 0 0 1 1
# 3975472 1 1 0 1 0 0 0

answered Aug 24 '15 at 18:07

JasonAizkalns

20,243
8
57
116

For R, `addmargins` and `zero.print` can be used to mimick the excel output: `print(addmargins(table(df$ticket_id, df$category)),zero.print="")` – Frank Aug 24 '15 at 18:36

score 1 · Answer 2 · answered Aug 24 '15 at 20:58

This was an interesting question. Hope the code below provides the solution. I am using the library reshape2 for some data rearranging.

set.seed(1)
# creating a sample dataset
dat <- data.frame(category = sample(x = letters[1:6], size = 1000,replace = T), ticket = sample(x = 1000:1500, size = 1000,replace = T))
dat <- unique(dat)
dat <- dat[order(dat$ticket, dat$category),]
head(dat)

    category ticket
311        a   1000
916        c   1000
978        d   1000
949        f   1000
72         f   1001
597        c   1002

library(reshape2)

#same as table function but gives a data frame
tab <- dcast(dat,ticket ~ category, length)  

#create all possible 2-way combinations
levels <- sort(unique(dat$category))
combs <- data.frame(rows = rep(levels,times = length(levels)), cols = rep(levels,each = length(levels)))

#calculate count for each combination
combs$count <- apply(combs,1,function(x) sum(tab[,x[1]] & tab[,x[2]]))

overlap <- dcast(combs, rows ~ cols) #convert output into a square matrix

  rows   a   b   c   d   e   f
1    a 140  38  36  41  36  42
2    b  38 128  48  32  41  39
3    c  36  48 161  35  49  36
4    d  41  32  35 123  32  35
5    e  36  41  49  32 139  38
6    f  42  39  36  35  38 138

Let me know if any of this needs further explaining.

Category Overlap Analysis

2 Answers2

Linked