1

I have extremely messy data. A portion of it looks like the following example.

x1_01=c("bearing_coordinates", "bearing_coordinates", "bearing_coordinates", "roadkill")
x1_02=c(146,122,68,1)
x2_01=c("tree_density","animals_on_road","animals_on_road", "tree_density")
x2_02=c(13,2,5,11)
x3_01=c("animals_on_road", "tree_density", "roadkill", "bearing_coordinates")
x3_02=c(3,10,1,1000)
x4_01=c("roadkill","roadkill", "tree_density", "animals_on_road")
x4_02=c(1,1,12,6)
testframe = data.frame(x1_01 = x1_01,x1_02=x1_02,x2_01=x2_01, x2_02=x2_02, x3_01=x3_01, x3_02=x3_02, x4_01=x4_01, x4_02=x4_02)

            x1_01      x1_02        x2_01    x2_02           x3_01     x3_02           x4_01
1 bearing_coordinates   146    tree_density    13     animals_on_road     3        roadkill
2 bearing_coordinates   122 animals_on_road     2        tree_density    10        roadkill
3 bearing_coordinates    68 animals_on_road     5            roadkill     1    tree_density
4            roadkill     1    tree_density    11 bearing_coordinates  1000 animals_on_road
  x4_02
1     1
2     1
3    12
4     6

I noticed when using dplyr spread that if I spread x1_01 and x1_02 on the initial datasheet, e.g.

test <- testframe %>% 
  spread(x1_01, x1_02)

and then used spread on that dataframe for x2_01 and x2_02, e.g.

testtest <- test %>% 
      spread(x2_01, x2_02)

that the second "bearing_coordinates" column would replace the original column, and result in NAs where there were values. To get around that, I went down the route of creating multiple dataframes and merging them together, e.g.

  test <- testframe %>% 
  spread(x1_01, x1_02) %>% 
  mutate(id = row_number())
test2 <- testframe %>% 
  spread(x2_01, x2_02)  %>% 
  mutate(id = row_number())
test3 <- testframe %>% 
  spread(x3_01, x3_02)  %>% 
  mutate(id = row_number())
test4 <- testframe %>% 
  spread(x4_01, x4_02)  %>% 
  mutate(id = row_number())

merge_test <- merge(test, test2, by="id")
merge_test2 <- merge(merge_test, test3, by ="id")
merge_test3 <- merge(merge_test2, test4, by = "id")

This (long-winded) approach is ok if it is a small dataset, like the test data I have supplied. However, as variables increase (x5_01, x5_02, x5_01, x5_02, etc) columns begin getting duplicated and deleting the previous columns named e.g. "bearing_coordinates", which results in loss of data. My question is, is there a way to do this where the data pivots from long to wide, and as it moves across variables, into one logical key:value column, so that all values associated with "bearing_coordinates" are in that column? The data should then look like this:

bearing_coordinates=c(146,122,68,1000)
roadkill=c(1,1,1,1)
tree_density=c(13,10,12,11)
animals_on_road=c(3,2,5,6)
id=c(1,2,3,4)
clean.data = data.frame(bearing.coordinates=bearing_coordinates,roadkill=roadkill,tree_density=tree_density,animals_on_road=animals_on_road,id=id)

bearing_coordinates roadkill tree_density animals_on_road id
1                 146        1           13               3  1
2                 122        1           10               2  2
3                  68        1           12               5  3
4                1000        1           11               6  4

I assume there must be a way to do this surprisingly easily in dplyr, but I rarely have data this messy and so am at a bit of loss as to what tools will accomplish this.

I've been looking through the dplyr documentation and SO posts and everything seems to be almost what I'm looking for but not quite right. For example, this post indicates that there could be a different strategy of taking "bearing.coordinates.x" and "bearing.coordinates.y" and then making those columns have duplicate names before finally merging them with no loss of data. However, that looks like it could be even more long-winded (particularly with multiple key:value pairs, as in my real dataset) and also potentially prone to error. I've also looked at filter as perhaps being a good option, but it seems to still hit that issue of columns deleting each other, and results in a necessary extra coding step to keep all the rest of the data.

Thank you in advance for help.

EDIT: Ben's answer below is correct, but I initially inaccurately represented the variables as being separated by "." and not "_" as they are in my real data. This could be addressed by simply changing the regex to (.*)_(.*), so:

testframe %>%
  pivot_longer(cols = everything(), names_to = c("name", ".value"), names_pattern = "(.*)_(.*)") %>%
  select(-name) %>%
  pivot_wider(names_from = "01", values_from = "02", values_fn = list) %>%
  unnest(cols = everything())

This is a really beautiful and elegant solution. Thank you Ben!

E.O.
  • 351
  • 2
  • 14
  • 1
    Can you clarify what your final/desired data frame should look like, based on your example data? If you don't mind, could you edit your question and add that? – Ben Sep 22 '20 at 02:31
  • @Ben my apologies for not including that from the outset! I have made that edit. – E.O. Sep 22 '20 at 15:25

1 Answers1

0

Maybe you might try something like this below. Based on your needs it could be modified further - but a lot depends on what your actual data looks like. This assumes complete key/value pairs, evenly divided.

Would first use pivot_longer to get your keys/values in two columns. Then you can use pivot_wider so that values are placed in the appropriate key columns.

library(tidyr)
library(dplyr)

testframe %>%
  pivot_longer(cols = everything(), names_to = c("name", ".value"), names_pattern = "x(\\d+)_(\\d+)") %>%
  select(-name) %>%
  pivot_wider(names_from = `01`, values_from = `02`, values_fn = list) %>%
  unnest(cols = everything())

Output

  bearing.coordinates tree.density animals.on.road roadkill
                <dbl>        <dbl>           <dbl>    <dbl>
1                 146           13               3        1
2                 122           10               2        1
3                  68           12               5        1
4                1000           11               6        1
Ben
  • 28,684
  • 5
  • 23
  • 45
  • I should have looked at the new documentation! I had no idea pivot had that kind of power (compared to what it looks like spread and gather could do). This code looks perfect, but I have realized that the regex is incorrect because my variables are separated by an underscore rather than a period. I'm sorry for the error- I will make that edit in the post. – E.O. Sep 22 '20 at 16:10