I am attempting to analyse the categorical overlap of a dataset to ultimately ascertain the optimal way of categorising the data to minimise the amount of used categories to describe the dataset.
Effort thus far:
I am relatively new to R and python and have been doing a bulk of these in Excel.
- I have been able to derive a matrix to analyse 1° of overlap (ie. #items overlapping per pair of categories) following the advise in Category Overlap Analysis question .
- unpivoting the matrix to derive an output that can be used to analyse overlaps per pair of categories and identify duplication that exists for 1° overlaps. Through this analysis I am able to identify 300 duplicated 1° categorical overlaps which can be resolved to 118 unique categories.
Challenge:
- I am struggling to effectively analyse higher order Categorical overlaps and to identify independent ItemCodes
Thank you!
Example to better illustrate what I am having difficulties attempting to achieve.
Data snippet
data snippet viewed in worksheet
Category, ItemCode
G0617,5410.001
G0617,5410.006
G0617,5410.903
...
G0080,5410.001
...
G0419,5410.001
...
G0532,5410.001
G0532,5410.903
...
G0616,5410.006
...
G0659,5410.001
G0659,5410.903
...
G0846,5410.001
...
Gtest,5410.903
Gtest,5410.006
Ideal output
ideal output viewed in worksheet
category|equivalent categories|subsumed categories|independent item_code|duplicate category
G0080|||5410.001|
G0419|G0080|||TRUE
G0532||G0080|5410.903|
G0616|||5410.006|
G0617|G0532,G0616;G0080,Gtest;G0532,Gtest|G0080,G0419,G0532,G0616,G0659,G0846||
G0659|G0532|G0080|5410.903|TRUE
G0846|G0080|||TRUE
Gtest||G0616|5410.903|
"duplicate category" column will be biased towards category ID with smaller #.
(ie. Where categories are exactly the same, the category with higher ID will be marked as a duplicate of the category with the smallest ID. This example, G0419 and G0846 are both marked as duplicate of G0080)