3

My data frame has the customerID, product and date. I need to create a column that generates a transaction id everytime the customerID or the date changes.

My dataframe currently looks like:

  1. CustomerID-Product--date
  2. 23-------------abv-------12-12-14
  3. 23-------------gfs--------12-12-14
  4. 18-------------gra--------12-12-14
  5. 52-------------wer--------14-12-14
  6. 23-------------qwe-------16-12-14

I need the transactionID column to get populated using r

  1. CustomerID-Product--date----------transactionID
  2. 23-------------abv-------12-12-14----1
  3. 23-------------gfs--------12-12-14----1
  4. 18-------------gra-------12-12-14----2
  5. 52-------------wer-------14-12-14----3
  6. 23-------------qwe------16-12-14----4

I have the data in a csv file. I am not able to create this column in excel as the file is too big and excel is unable to open the entire file.

Thank you for your help!

surser
  • 35
  • 1
  • 5
  • 3
    Are there any other rules we need to know about? Is the "date" column sorted? If you had a case where `CustomerID == 23` and `date == 12-12-14` in row 7, would that be a different transaction ID again or would it be 1? – A5C1D2H2I1M1N2O1R2T1 Dec 18 '14 at 05:44
  • THanks! The customerID == 23 and date == 12-12-14 combination would not appear again in row 7 or later. – surser Dec 18 '14 at 06:54

2 Answers2

2

Assuming that the columns CustomerID and date are ordered,

indx <- as.character(interaction(df[c(1,3)]))
df$transactionID <- cumsum(c(TRUE,indx[-1]!=indx[-length(indx)]))
df$transactionID
#[1] 1 1 2 3 4

Or if the columns are not ordered, for example, suppose there is one more row (6th row) that duplicates the first row

indx1 <- c(indx, indx[1])
as.numeric(factor(indx1, levels=unique(indx1)))
#[1] 1 1 2 3 4 1

Or

match(indx1, unique(indx1))
#[1] 1 1 2 3 4 1

data

df <- structure(list(CustomerID = c(23L, 23L, 18L, 52L, 23L), Product =
c("abv", "gfs", "gra", "wer", "qwe"), date = c("12-12-14", "12-12-14", 
"12-12-14", "14-12-14", "16-12-14")), .Names = c("CustomerID", 
"Product", "date"), class = "data.frame", row.names = c(NA, -5L))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    Nice alternatives. It would be nicer if we knew what the OP wanted though :-) +1 – A5C1D2H2I1M1N2O1R2T1 Dec 18 '14 at 05:48
  • @AnandaMahto Thanks, I think `match` or `factor` should work whether it is ordered or not. – akrun Dec 18 '14 at 05:49
  • I learned this kind of indexing from you before. But, this is additional information. +1 :) – jazzurro Dec 18 '14 at 07:00
  • @jazzurro Thanks, It works only if the data is ordered. – akrun Dec 18 '14 at 07:02
  • I added that in my note. :) – jazzurro Dec 18 '14 at 07:04
  • @jazzurro I hope someday you will publish a book :-) – akrun Dec 18 '14 at 07:07
  • If SO users work together, we can surely make a decent R book with concrete examples and explanation. That'll add something to our CVs. :) – jazzurro Dec 18 '14 at 07:10
  • 2
    @jazzurro, get something collaborative started with [GitBook](https://www.gitbook.com/) perhaps? I'm sure you already know this, but there are already lots of good R authors active here on SO; my concern would be rehashing content. – A5C1D2H2I1M1N2O1R2T1 Dec 18 '14 at 07:12
  • 1
    @AnandaMahto You just let me realize something important (i.e., Gitbook). I have been missing something big. My shame. As you created `SOfun`, many great R users know what people tend to ask here. Early today, there was a question which got 5-6 answers. I used your `getanID` in my answer. If we can identify some frequent topics like the question for subsetting, string, and etc, and compile possible solutions using various packages, that would be a great document. – jazzurro Dec 18 '14 at 07:23
  • @akrun You've answered so many questions. I guess you can select good contents for the book. :) – jazzurro Dec 18 '14 at 07:30
2

Depending on your responses to my comment, you may also be interested in .GRP from "data.table":

library(data.table)
## In case rows get out of order
DT <- as.data.table(df, keep.rownames = TRUE) 
DT[, transactionID := .GRP, by = list(CustomerID, date)][]
   rn CustomerID Product     date transactionID
1:  1         23     abv 12-12-14             1
2:  2         23     gfs 12-12-14             1
3:  3         18     gra 12-12-14             2
4:  4         52     wer 14-12-14             3
5:  5         23     qwe 16-12-14             4
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • I learned `.GRP` from your answer. Thank you. I'd like to know what `[]` exactly does? I ran the code with and without it, but I see identical results. What is it for? – jazzurro Dec 18 '14 at 07:29
  • @jazzurro, this is the issue we had discussed over email some time back. Depending on the version of "data.table" that you are using, you may not see the output when `:=` is used unless you explicitly call `[]` (which would print the output). See the ["bug fixes" section](https://github.com/Rdatatable/data.table/blob/master/README.md#bug-fixes) of the README file for version 1.9.5, point number 2. – A5C1D2H2I1M1N2O1R2T1 Dec 18 '14 at 07:41
  • Ahhhh that's right! What was I thinking??? Thanks for that again. Yes, I need `[]` to see the outcome. – jazzurro Dec 18 '14 at 07:43