Creating a column in r that auto-increments based on other columns

Question

My data frame has the customerID, product and date. I need to create a column that generates a transaction id everytime the customerID or the date changes.

My dataframe currently looks like:

CustomerID-Product--date
23-------------abv-------12-12-14
23-------------gfs--------12-12-14
18-------------gra--------12-12-14
52-------------wer--------14-12-14
23-------------qwe-------16-12-14

I need the transactionID column to get populated using r

CustomerID-Product--date----------transactionID
23-------------abv-------12-12-14----1
23-------------gfs--------12-12-14----1
18-------------gra-------12-12-14----2
52-------------wer-------14-12-14----3
23-------------qwe------16-12-14----4

I have the data in a csv file. I am not able to create this column in excel as the file is too big and excel is unable to open the entire file.

Thank you for your help!

Are there any other rules we need to know about? Is the "date" column sorted? If you had a case where `CustomerID == 23` and `date == 12-12-14` in row 7, would that be a different transaction ID again or would it be 1? — A5C1D2H2I1M1N2O1R2T1, Dec 18 '14 at 05:44
THanks! The customerID == 23 and date == 12-12-14 combination would not appear again in row 7 or later. — surser, Dec 18 '14 at 06:54

akrun · Answer 1 · 2014-12-18T05:50:54.867

2

Assuming that the columns CustomerID and date are ordered,

indx <- as.character(interaction(df[c(1,3)]))
df$transactionID <- cumsum(c(TRUE,indx[-1]!=indx[-length(indx)]))
df$transactionID
#[1] 1 1 2 3 4

Or if the columns are not ordered, for example, suppose there is one more row (6th row) that duplicates the first row

indx1 <- c(indx, indx[1])
as.numeric(factor(indx1, levels=unique(indx1)))
#[1] 1 1 2 3 4 1

Or

match(indx1, unique(indx1))
#[1] 1 1 2 3 4 1

data

df <- structure(list(CustomerID = c(23L, 23L, 18L, 52L, 23L), Product =
c("abv", "gfs", "gra", "wer", "qwe"), date = c("12-12-14", "12-12-14", 
"12-12-14", "14-12-14", "16-12-14")), .Names = c("CustomerID", 
"Product", "date"), class = "data.frame", row.names = c(NA, -5L))

edited Dec 18 '14 at 05:50

answered Dec 18 '14 at 05:29

akrun

874,273
37
540
662

1

Nice alternatives. It would be nicer if we knew what the OP wanted though :-) +1 – A5C1D2H2I1M1N2O1R2T1 Dec 18 '14 at 05:48
@AnandaMahto Thanks, I think `match` or `factor` should work whether it is ordered or not. – akrun Dec 18 '14 at 05:49
I learned this kind of indexing from you before. But, this is additional information. +1 :) – jazzurro Dec 18 '14 at 07:00
@jazzurro Thanks, It works only if the data is ordered. – akrun Dec 18 '14 at 07:02
I added that in my note. :) – jazzurro Dec 18 '14 at 07:04
@jazzurro I hope someday you will publish a book :-) – akrun Dec 18 '14 at 07:07
If SO users work together, we can surely make a decent R book with concrete examples and explanation. That'll add something to our CVs. :) – jazzurro Dec 18 '14 at 07:10
2

@jazzurro, get something collaborative started with [GitBook](https://www.gitbook.com/) perhaps? I'm sure you already know this, but there are already lots of good R authors active here on SO; my concern would be rehashing content. – A5C1D2H2I1M1N2O1R2T1 Dec 18 '14 at 07:12
1

@AnandaMahto You just let me realize something important (i.e., Gitbook). I have been missing something big. My shame. As you created `SOfun`, many great R users know what people tend to ask here. Early today, there was a question which got 5-6 answers. I used your `getanID` in my answer. If we can identify some frequent topics like the question for subsetting, string, and etc, and compile possible solutions using various packages, that would be a great document. – jazzurro Dec 18 '14 at 07:23
@akrun You've answered so many questions. I guess you can select good contents for the book. :) – jazzurro Dec 18 '14 at 07:30

score 2 · Answer 2 · answered Dec 18 '14 at 05:47

2

Depending on your responses to my comment, you may also be interested in .GRP from "data.table":

library(data.table)
## In case rows get out of order
DT <- as.data.table(df, keep.rownames = TRUE) 
DT[, transactionID := .GRP, by = list(CustomerID, date)][]
   rn CustomerID Product     date transactionID
1:  1         23     abv 12-12-14             1
2:  2         23     gfs 12-12-14             1
3:  3         18     gra 12-12-14             2
4:  4         52     wer 14-12-14             3
5:  5         23     qwe 16-12-14             4

answered Dec 18 '14 at 05:47

A5C1D2H2I1M1N2O1R2T1

190,393
28
405
485

I learned `.GRP` from your answer. Thank you. I'd like to know what `[]` exactly does? I ran the code with and without it, but I see identical results. What is it for? – jazzurro Dec 18 '14 at 07:29
@jazzurro, this is the issue we had discussed over email some time back. Depending on the version of "data.table" that you are using, you may not see the output when `:=` is used unless you explicitly call `[]` (which would print the output). See the ["bug fixes" section](https://github.com/Rdatatable/data.table/blob/master/README.md#bug-fixes) of the README file for version 1.9.5, point number 2. – A5C1D2H2I1M1N2O1R2T1 Dec 18 '14 at 07:41
Ahhhh that's right! What was I thinking??? Thanks for that again. Yes, I need `[]` to see the outcome. – jazzurro Dec 18 '14 at 07:43

Creating a column in r that auto-increments based on other columns

2 Answers2

data

Linked