Higher Order Category Overlap Analysis

Question

I am attempting to analyse the categorical overlap of a dataset to ultimately ascertain the optimal way of categorising the data to minimise the amount of used categories to describe the dataset.

Effort thus far:

I am relatively new to R and python and have been doing a bulk of these in Excel.

I have been able to derive a matrix to analyse 1° of overlap (ie. #items overlapping per pair of categories) following the advise in Category Overlap Analysis question .
unpivoting the matrix to derive an output that can be used to analyse overlaps per pair of categories and identify duplication that exists for 1° overlaps. Through this analysis I am able to identify 300 duplicated 1° categorical overlaps which can be resolved to 118 unique categories.

Challenge:

I am struggling to effectively analyse higher order Categorical overlaps and to identify independent ItemCodes

Thank you!

Example to better illustrate what I am having difficulties attempting to achieve.

Data snippet

data snippet viewed in worksheet

Category, ItemCode
G0617,5410.001
G0617,5410.006
G0617,5410.903
...
G0080,5410.001
...
G0419,5410.001
...
G0532,5410.001
G0532,5410.903
...
G0616,5410.006
...
G0659,5410.001
G0659,5410.903
...
G0846,5410.001
...
Gtest,5410.903
Gtest,5410.006

Ideal output

ideal output viewed in worksheet

category|equivalent categories|subsumed categories|independent item_code|duplicate category
G0080|||5410.001|
G0419|G0080|||TRUE
G0532||G0080|5410.903|
G0616|||5410.006|
G0617|G0532,G0616;G0080,Gtest;G0532,Gtest|G0080,G0419,G0532,G0616,G0659,G0846||
G0659|G0532|G0080|5410.903|TRUE
G0846|G0080|||TRUE
Gtest||G0616|5410.903|

"duplicate category" column will be biased towards category ID with smaller #.

(ie. Where categories are exactly the same, the category with higher ID will be marked as a duplicate of the category with the smallest ID. This example, G0419 and G0846 are both marked as duplicate of G0080)

I struggled to understand the meaning of: "G0617|G0532,G0616;G0080,Gtest;G0532,Gtest|G0080,G0419,G0532,G0616,G0659,G0846||" string, but I gave up. I cannot understand the logic of building it. "G0080" appears only once in your initial range. Based on which logic will be it included in "equivalent categories" and "subsumed categories" category, too? How "Gtest" has been introduced twice in "equivalent categories"? Did you try something on your own? We maybe better understand looking to your code... — FaneDuru, Aug 17 '20 at 09:35

Higher Order Category Overlap Analysis

0 Answers0