0

I am working on a NoSQL data, which I need to pivot in R.

Sample data:

structure(list(timestamp = structure(c(1595709882, 1595709882, 
1595709931, 1595709931, 1595710021, 1595710023, 1595710023, 1595710027, 
1595710157, 1595710157, 1595710277, 1595710277, 1595710337, 1595710337, 
1595710397, 1595710397, 1595710457, 1595710457, 1595710517, 1595710517
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), value = c("3000", 
"160", "160", "3000", "6000", "6000", "160", "6000", "6000", 
"160", "160", "6000", "6000", "160", "6000", "160", "6000", "160", 
"6000", "160"), variable = c("ENGINE_RPM", "VEHICLE_SPEED", "VEHICLE_SPEED", 
"ENGINE_RPM", "ENGINE_RPM", "ENGINE_RPM", "VEHICLE_SPEED", "ENGINE_RPM", 
"ENGINE_RPM", "VEHICLE_SPEED", "VEHICLE_SPEED", "ENGINE_RPM", 
"ENGINE_RPM", "VEHICLE_SPEED", "ENGINE_RPM", "VEHICLE_SPEED", 
"ENGINE_RPM", "VEHICLE_SPEED", "ENGINE_RPM", "VEHICLE_SPEED")), row.names = c(NA, 
-20L), class = c("tbl_df", "tbl", "data.frame"))


timestamp          value    variable
7/25/2020 20:44:42  3000    ENGINE_RPM
7/25/2020 20:44:42  160 VEHICLE_SPEED
7/25/2020 20:45:31  160 VEHICLE_SPEED
7/25/2020 20:45:31  3000    ENGINE_RPM
7/25/2020 20:47:01  6000    ENGINE_RPM
7/25/2020 20:47:03  6000    ENGINE_RPM
7/25/2020 20:47:03  160 VEHICLE_SPEED
7/25/2020 20:47:07  6000    ENGINE_RPM
7/25/2020 20:49:17  6000    ENGINE_RPM
7/25/2020 20:49:17  160 VEHICLE_SPEED
7/25/2020 20:51:17  160 VEHICLE_SPEED
7/25/2020 20:51:17  6000    ENGINE_RPM
7/25/2020 20:52:17  6000    ENGINE_RPM
7/25/2020 20:52:17  160 VEHICLE_SPEED
7/25/2020 20:53:17  6000    ENGINE_RPM
7/25/2020 20:53:17  160 VEHICLE_SPEED
7/25/2020 20:54:17  6000    ENGINE_RPM
7/25/2020 20:54:17  160 VEHICLE_SPEED
7/25/2020 20:55:17  6000    ENGINE_RPM
7/25/2020 20:55:17  160 VEHICLE_SPEED

If we look at the sample data, certain timestamps have both RPM and SPEED, whereas few timestamps have just one of them.

I need those rows that have 2 timestamps as they have both the vehicle speed and RPM that I can later pivot to see at a particular time, what was the vehicle's speed and its engine RPM.

The output I am looking at is:

timestamp           ENGINE_RPM  VEHICLE_SPEED
7/25/2020 20:44:42  3000        160
7/25/2020 20:45:31  3000        160
7/25/2020 20:47:03  6000        160
7/25/2020 20:49:17  6000        160
7/25/2020 20:51:17  6000        160
7/25/2020 20:52:17  6000        160
7/25/2020 20:53:17  6000        160
7/25/2020 20:54:17  6000        160
7/25/2020 20:55:17  6000        160

The query I used is:

data %>% group_by(timestamp, variable, value) %>% 
        mutate(row = row_number()) %>% filter(n() == 2) %>% 
          pivot_wider(names_from = variable, values_from = value) %>% select(-row)

The output I am getting is:

# A tibble: 8 x 3
# Groups:   timestamp [4]
  timestamp           VEHICLE_SPEED ENGINE_RPM
  <dttm>              <chr>         <chr>     
1 2020-08-05 16:09:02 5             NA        
2 2020-08-05 16:09:02 5             NA        
3 2020-08-06 18:32:33 15            NA        
4 2020-08-06 18:32:33 15            NA        
5 2020-08-06 18:32:52 25            NA        
6 2020-08-06 18:32:52 25            NA        
7 2020-08-07 12:03:53 NA            1500      
8 2020-08-07 12:03:53 NA            1500      
> 

Could someone let me know how to get the desired output.

s_baldur
  • 29,441
  • 4
  • 36
  • 69
Karthik S
  • 11,348
  • 2
  • 11
  • 25

2 Answers2

1

You can use the na.omit function after using pivot_wider to make your data wider:

dat %>%
    pivot_wider(names_from = variable, values_from = value) %>%
    na.omit()

  timestamp           ENGINE_RPM VEHICLE_SPEED
  <dttm>              <chr>      <chr>        
1 2020-07-25 20:44:42 3000       160          
2 2020-07-25 20:45:31 3000       160          
3 2020-07-25 20:47:03 6000       160          
4 2020-07-25 20:49:17 6000       160          
5 2020-07-25 20:51:17 6000       160          
6 2020-07-25 20:52:17 6000       160          
7 2020-07-25 20:53:17 6000       160          
8 2020-07-25 20:54:17 6000       160          
9 2020-07-25 20:55:17 6000       160 
bouncyball
  • 10,631
  • 19
  • 31
  • I am getting this error: Warning message: Values in `value` are not uniquely identified; output will contain list-cols. * Use `values_fn = list(value = list)` to suppress this warning. * Use `values_fn = list(value = length)` to identify where the duplicates arise * Use `values_fn = list(value = summary_fun)` to summarise duplicates – Karthik S Aug 18 '20 at 14:55
  • is it possible that you have multiple rows with the same variable for a single timestamp? you can check this with `count(dat, timestamp, variable)` – bouncyball Aug 18 '20 at 15:26
  • Yes, that's why I mentioned "certain timestamps have both RPM and SPEED, whereas few timestamps have just one of them" in my question. A timestamp can have one row for RPM and another row for SPEED. But I need these rows so that I can then pivot in such a way that I'll have one column each for timestamp, RPM and SPEED. – Karthik S Aug 18 '20 at 15:33
  • no, I am talking about the case where you would have multiple rows for __the same variable__ for a single timestamp. for example, for timestamp 1, you could have two rows for RPM. how many rows does `count(dat, timestamp, variable) %>% filter(n > 1)` return? – bouncyball Aug 18 '20 at 15:46
  • Ohk, so I ran count(data, timestamp, variable), I got slightly more than 30k rows, for 60 odd rows I get value of "n" as 2. Am assuming it means one timestamp could have two rows for RPM or SPEED? – Karthik S Aug 18 '20 at 15:53
  • Yes, that it was that means. You have some timestamps with two rows for RPM and/or SPEED. So then when you try to convert your data to "wide", `R` doesn't know which row it should use. – bouncyball Aug 18 '20 at 15:55
  • Is there anyway I can remove such rows in the original code? – Karthik S Aug 18 '20 at 15:59
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/220044/discussion-between-bouncyball-and-karthik-s). – bouncyball Aug 18 '20 at 16:00
0

You can try this

library(tidyr)
library(dplyr)
df2 <- df %>% 
  distinct(.) %>% 
  pivot_wider(names_from = variable, values_from = value) %>% 
  filter(!is.na(VEHICLE_SPEED))

Or

df2 <- df %>% 
  distinct(.) %>% 
  spread(variable, value) %>% 
  filter(!is.na(VEHICLE_SPEED))
    # timestamp           ENGINE_RPM VEHICLE_SPEED
# <dttm>              <chr>      <chr>        
# 1 2020-07-25 20:44:42 3000       160          
# 2 2020-07-25 20:45:31 3000       160          
# 3 2020-07-25 20:47:03 6000       160          
# 4 2020-07-25 20:49:17 6000       160          
# 5 2020-07-25 20:51:17 6000       160          
# 6 2020-07-25 20:52:17 6000       160          
# 7 2020-07-25 20:53:17 6000       160          
# 8 2020-07-25 20:54:17 6000       160          
# 9 2020-07-25 20:55:17 6000       160 
Tho Vu
  • 1,304
  • 2
  • 8
  • 20
  • I am getting same error as above for option 1) for option 2) I am getting this error: Error: Each row of output must be identified by a unique combination of keys. Keys are shared for 118 rows: – Karthik S Aug 18 '20 at 14:57
  • Let's me clarify a little bit. Did you just run these line of codes or you add some more lines? Or did you run the necessary libraries? – Tho Vu Aug 18 '20 at 15:03
  • I ran what you mentioned, didn't get "did you run the necessary libraries". – Karthik S Aug 18 '20 at 15:05
  • `library(tidyr)` and `library(dplyr)` are the ones I mentioned. I think you can check once more time, maybe you miss something. – Tho Vu Aug 18 '20 at 15:09
  • Yes I've loaded those libraries. – Karthik S Aug 18 '20 at 15:18
  • How about adding `distinct(.) %>% ` ?. Does it work in your case? See my update – Tho Vu Aug 18 '20 at 16:02