Combining columns to preserve uniqueness

Question

I need to combine multiple columns together to get a single "grouping" variable as in the Paste multiple columns together thread. The problem is, I want it to be robust to similar content of the strings, e.g.

tmp1 <- data.frame(V1 = c("a", "aa", "a",  "b", "bb", "aa"),
                   V2 = c("a", "a",  "aa", "b", "b",  "a"))

tmp2 <- data.frame(V1 = c("+",  "++", "+-", "-|",  "||"),
                   V2 = c("-|", "--", "++", "|-+", "|"))

For the data as above, using function apply(x, 1, paste, collapse = sep) with some common separators like "", |, -, + would fail as it would make the columns unidentifiable in output and may lead to mixing together different kinds of columns.

The columns can be assumed to be of different types (numeric, factor, character etc.).

The expected output is a vector with one ID per row, where each ID is assigned to unique combination of values between the two columns. The actual form of the ID's is not important for me. For example,

1 2 3 4 5 2

for the tmp1 data.

Can you suggest a better way to do this? Please notice that I am concerned with performance.

Still not very clear. Wouldn't you just need to pick a value for the `sep` parameter that isn't in your dataset? `paste(c("a", "b", "++"), collapse = "-mycollapsestring-")` `"a-mycollapsestring-b-mycollapsestring-++"` — Eric Watt, Jul 10 '17 at 12:36
@EricWatt edited to add an example. How would you pick sep that is not in the data? Imagine that the data columns can have all the ASCII characters in any order and of any length, then there is no single "character" for this. You can always assign some long, random combinations of charters, but this doesn't sound either robust, or elegant. — Tim, Jul 10 '17 at 12:40

score 1 · Answer 1 · answered Jul 10 '17 at 12:51

Based on the update to your question, if the form the ID doesn't matter, this is easy. Here is a method using data.table, you can do similar with dplyr.

library(data.table)

merge(tmp1,
      unique(tmp1)[, .(V1, V2, ID = 1:.N)],
      by = c("V1", "V2"))

   V1 V2 ID
1:  a  a  1
2:  a aa  3
3: aa  a  2
4: aa  a  2
5:  b  b  4
6: bb  b  5

The second parameter of the merge subsets only the unique combinations and assigns a unique value to each unique row, and then the merge brings it back to the full dataset.

Combining columns to preserve uniqueness

1 Answers1