0

My dataframe looks like this

library(dplyr)

N = 10

a <- c("a", "b", "h", "d", "e", "a", "c", "b", "d", "f")
b <- c("a", "b", "h", "d", "e", "z", "d", "g", "h", "q")
z <- rnorm(N) 

df1 <- data.frame(a, b, z) 
df2 <- data.frame(a, b, z)

df1 <- df1 %>% mutate(year = 2012) 
df2 <- df2 %>% mutate(year = 2013) 

I'm trying to create unique ID for each observation using this code:

dfs <- c("df1", "df2")

for (i in dfs){

assign(i, get(i) %>%
           group_by(a, b, z) %>% 
           mutate(id = cur_group_id()))
}

However, I find that even though the the values for a and b are the same across df1 and df2, they have a different id. Ideally, the id for the first 5 observations for df1 and df2 should be the same. Is there a way to make sure that observations with the values for a and b have the same id across the dataframes?

anrisakaki96
  • 193
  • 8
  • 3
    `rnorm` returns floating point numbers and the probably of any two values being equal is practically 0. Are your actual values more like integers? Because testing for matching number with decimal values is kind of tricky. – MrFlick May 03 '23 at 16:55
  • 3
    I cringe when I see patterns using `get` and `assign` when the frames are all similar/identical _structure_. See https://stackoverflow.com/a/24376207/3358227 for discussions on how to work on a list-of-frames, generally much easier once you understand the application of `lapply` here. – r2evans May 03 '23 at 16:56
  • 3
    When I run your code, `identical(df1$id, df2$id)` gives `TRUE`, so I do not see the issue you see. But I also strongly agree with the points of MrFlick and r2evans. If your real data is not decimal numbers, then there might be other good solutions, e.g., you could assign IDs in one data frame and then use a join to add the ID column to the other data frame. – Gregor Thomas May 03 '23 at 16:58
  • Hi, thats been edited now :) – anrisakaki96 May 03 '23 at 21:09
  • @anrisakaki96 you say `group_by(a, b, c)` but there is no `c` - do you just want `group_by(a, b)`? Could you provide your desired output? – jpsmith May 03 '23 at 22:34

1 Answers1

0

I tested your code and it seems to work fine for me

library(dplyr)
N = 100
a <- rnorm(N) 
b <- rnorm(N) 
c <- rnorm(N) 
df1 <- data.frame(a, b, c) 
df2 <- data.frame(a, b, c)
df1 <- df1 %>% mutate(year = 2012) 
df2 <- df2 %>% mutate(year = 2013) 

df2=df2[sample(1:nrow(df2)),] # reshuffling !

dfs <- c("df1", "df2")

for (i in dfs){
  
  assign(i, get(i) %>%
           group_by(a, b, c) %>% 
           mutate(id = cur_group_id()))
}


full_join(df1,df2,by="id")
#> # A tibble: 100 × 9
#>        a.x     b.x    c.x year.x    id     a.y     b.y    c.y year.y
#>      <dbl>   <dbl>  <dbl>  <dbl> <int>   <dbl>   <dbl>  <dbl>  <dbl>
#>  1  0.772  -0.405  -0.580   2012    73  0.772  -0.405  -0.580   2013
#>  2  0.931  -0.511  -0.312   2012    82  0.931  -0.511  -0.312   2013
#>  3  0.347   0.252  -0.317   2012    57  0.347   0.252  -0.317   2013
#>  4  0.0824  0.845   0.105   2012    47  0.0824  0.845   0.105   2013
#>  5 -0.0472 -2.32   -0.581   2012    44 -0.0472 -2.32   -0.581   2013
#>  6 -0.0556 -0.704  -0.459   2012    43 -0.0556 -0.704  -0.459   2013
#>  7 -0.321  -1.04   -0.159   2012    34 -0.321  -1.04   -0.159   2013
#>  8  0.408  -1.69    2.03    2012    60  0.408  -1.69    2.03    2013
#>  9 -1.35    0.0954  0.174   2012     8 -1.35    0.0954  0.174   2013
#> 10 -1.21   -1.71    0.321   2012     9 -1.21   -1.71    0.321   2013
#> # … with 90 more rows

Created on 2023-05-03 with reprex v2.0.2

Wael
  • 1,640
  • 1
  • 9
  • 20
  • 1
    This isn’t an answer, it should be a comment – jpsmith May 03 '23 at 21:19
  • just a reprex as i couldn't reproduce the problem ! could not be a comment, to be deleted if anrisakaki96 produce a reprex or provide some feedback. – Wael May 03 '23 at 21:28