I have extremely messy data. A portion of it looks like the following example.
x1_01=c("bearing_coordinates", "bearing_coordinates", "bearing_coordinates", "roadkill")
x1_02=c(146,122,68,1)
x2_01=c("tree_density","animals_on_road","animals_on_road", "tree_density")
x2_02=c(13,2,5,11)
x3_01=c("animals_on_road", "tree_density", "roadkill", "bearing_coordinates")
x3_02=c(3,10,1,1000)
x4_01=c("roadkill","roadkill", "tree_density", "animals_on_road")
x4_02=c(1,1,12,6)
testframe = data.frame(x1_01 = x1_01,x1_02=x1_02,x2_01=x2_01, x2_02=x2_02, x3_01=x3_01, x3_02=x3_02, x4_01=x4_01, x4_02=x4_02)
x1_01 x1_02 x2_01 x2_02 x3_01 x3_02 x4_01
1 bearing_coordinates 146 tree_density 13 animals_on_road 3 roadkill
2 bearing_coordinates 122 animals_on_road 2 tree_density 10 roadkill
3 bearing_coordinates 68 animals_on_road 5 roadkill 1 tree_density
4 roadkill 1 tree_density 11 bearing_coordinates 1000 animals_on_road
x4_02
1 1
2 1
3 12
4 6
I noticed when using dplyr
spread
that if I spread x1_01 and x1_02 on the initial datasheet, e.g.
test <- testframe %>%
spread(x1_01, x1_02)
and then used spread
on that dataframe for x2_01 and x2_02, e.g.
testtest <- test %>%
spread(x2_01, x2_02)
that the second "bearing_coordinates" column would replace the original column, and result in NAs where there were values. To get around that, I went down the route of creating multiple dataframes and merging them together, e.g.
test <- testframe %>%
spread(x1_01, x1_02) %>%
mutate(id = row_number())
test2 <- testframe %>%
spread(x2_01, x2_02) %>%
mutate(id = row_number())
test3 <- testframe %>%
spread(x3_01, x3_02) %>%
mutate(id = row_number())
test4 <- testframe %>%
spread(x4_01, x4_02) %>%
mutate(id = row_number())
merge_test <- merge(test, test2, by="id")
merge_test2 <- merge(merge_test, test3, by ="id")
merge_test3 <- merge(merge_test2, test4, by = "id")
This (long-winded) approach is ok if it is a small dataset, like the test data I have supplied. However, as variables increase (x5_01, x5_02, x5_01, x5_02, etc) columns begin getting duplicated and deleting the previous columns named e.g. "bearing_coordinates", which results in loss of data. My question is, is there a way to do this where the data pivots from long to wide, and as it moves across variables, into one logical key:value column, so that all values associated with "bearing_coordinates" are in that column? The data should then look like this:
bearing_coordinates=c(146,122,68,1000)
roadkill=c(1,1,1,1)
tree_density=c(13,10,12,11)
animals_on_road=c(3,2,5,6)
id=c(1,2,3,4)
clean.data = data.frame(bearing.coordinates=bearing_coordinates,roadkill=roadkill,tree_density=tree_density,animals_on_road=animals_on_road,id=id)
bearing_coordinates roadkill tree_density animals_on_road id
1 146 1 13 3 1
2 122 1 10 2 2
3 68 1 12 5 3
4 1000 1 11 6 4
I assume there must be a way to do this surprisingly easily in dplyr
, but I rarely have data this messy and so am at a bit of loss as to what tools will accomplish this.
I've been looking through the dplyr
documentation and SO posts and everything seems to be almost what I'm looking for but not quite right. For example, this post indicates that there could be a different strategy of taking "bearing.coordinates.x" and "bearing.coordinates.y" and then making those columns have duplicate names before finally merging them with no loss of data. However, that looks like it could be even more long-winded (particularly with multiple key:value pairs, as in my real dataset) and also potentially prone to error. I've also looked at filter
as perhaps being a good option, but it seems to still hit that issue of columns deleting each other, and results in a necessary extra coding step to keep all the rest of the data.
Thank you in advance for help.
EDIT: Ben's answer below is correct, but I initially inaccurately represented the variables as being separated by "." and not "_" as they are in my real data. This could be addressed by simply changing the regex to (.*)_(.*)
, so:
testframe %>%
pivot_longer(cols = everything(), names_to = c("name", ".value"), names_pattern = "(.*)_(.*)") %>%
select(-name) %>%
pivot_wider(names_from = "01", values_from = "02", values_fn = list) %>%
unnest(cols = everything())
This is a really beautiful and elegant solution. Thank you Ben!