-1

So I have a large dataset with 89 variables where multiple are unique identifiers attributing data in a relational DB. I want to see the frequency of unique identifiers as cross referenced by a second variable which is a factor?

i.e. this does not work but is how I think would work -

length(unique(data$PID ~ data$ICD_grouping)

returning a table like

ICD_grouping        unique.PID
C43                   5
C47/C49               1
C50                   2
C56                   1
C57-C58               1
C80                   1

Sample data

 PID ICD_Grouping
1     1          C80
2   918          C43
3   919          C43
4   919          C43
5  1284             
6  1285             
7   550          C43
8   550          C43
9   550          C43
10  550          C50
11  920          C43
12  920          C43
13  921          C50
14  921          C56
15  921       C57-58
16  921       C57-58
17  549          C43
18  549          C43
19  922       C47/49
20  551          C43
  • To be honest, this post is a bit of a mess. I suggest you spend some time reviewing your post and then provide (1) a clear problem statement, and (2) consistent sample data and matching expected output. I also recommend reviewing how to provide a [minimal reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). We've been going back and forth here now for a while without any progress. – Maurits Evers Feb 15 '19 at 04:09

1 Answers1

1

I'm not entirely clear on your problem statement but provided I understood you correctly (unfortunately you don't provide sample data) we can use table. Here is a reproducible minimal example based on mtcars

table(mtcars$gear, mtcars$carb)
#
#    1 2 3 4 6 8
#  3 3 4 3 5 0 0
#  4 4 4 0 4 0 0
#  5 0 2 0 1 1 1

The output is an object of type table and shows the frequency for values for mtcars$gear (rows) and mtcars$carb (columns). If you prefer a long data.frame we can do

as.data.frame(table(mtcars$gear, mtcars$carb))
#   Var1 Var2 Freq
#1     3    1    3
#2     4    1    4
#3     5    1    0
#4     3    2    4
#5     4    2    4
#6     5    2    2
#7     3    3    3
#8     4    3    0
#9     5    3    0
#10    3    4    5
#11    4    4    4
#12    5    4    1
#13    3    6    0
#14    4    6    0
#15    5    6    1
#16    3    8    0
#17    4    8    0
#18    5    8    1

Here Var1 corresponds to values of mtcars$gear and Var2 to values of mtcars$carb.

Maurits Evers
  • 49,617
  • 4
  • 47
  • 68
  • Sorry I wasn't clear enough in the initial question. So the ID appear multiple times, so effectively all I want to know is how many unique IDs turn up with Var 2 in the dataset. Effectively asking a simple frequency table for a var2 but only for unique values of the ID – Seb Walpole Feb 13 '19 at 02:53
  • @SebWalpole Can you please provide sample data in a reproducible copy&paste-able format? For example, use `dput` and include the output. It might be useful (for future posts) to review how to provide a [minimal reproducible code example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – Maurits Evers Feb 13 '19 at 06:37
  • I have added a small example of the data. So I would like frequency of Var2 but of only unique identifiers from Var1. So 550 has multiple entries with C43 - should only count as 1 but also counts as a unique ID for C50 – Seb Walpole Feb 13 '19 at 11:26
  • @SebWalpole I can't see any updated data in your post. Did you forget to edit? – Maurits Evers Feb 13 '19 at 12:21
  • Sorry, it didn't work when I included it last night somehow. – Seb Walpole Feb 14 '19 at 00:01
  • @SebWalpole Ok I can see the data now but I struggle to relate the format of your data to your question. In your comments you mention columns `Var1` and `Var2`, your example code suggests columns `ID` and `varb`, and your actual data has columns `PID` and `ICD_Grouping`. So what's what here? Please use consistent variable/column names. – Maurits Evers Feb 14 '19 at 08:05
  • I was trying to use generic variable titles so it would be reproducible for any dataset but I wasn't clear enough in my question. Will amend @MauritsEvers – Seb Walpole Feb 15 '19 at 02:51
  • @SebWalpole I'm sorry but I don't think I understand your issue; at first I thought you might be after something like this `aggregate(PID ~ ICD_Grouping, data = df, FUN = function(x) length(unique(x)))`; but this doesn't match your expected output at all. In fact, I don't understand your expected output. Why do you only keep `C34`, `C50`, `C80`, `C56`? What about the other unique `ICD_grouping`s, like `C47/49`, `C57-58`; there also seems to be an empty `ICD_grouping`. Furthermore, how can you have 25 `unique.PID`s for `C50` if your sample data only has 20 rows? I'm very confused. – Maurits Evers Feb 15 '19 at 03:46
  • @MauritsEvers All my edits didn't come through. Check the output table again. – Seb Walpole Feb 15 '19 at 04:57
  • @SebWalpole Still not clear nor reproducible; I see 6 (not 5) unique `PID`s for `C43`. – Maurits Evers Feb 15 '19 at 07:31