-1

My data are in the following wide format, in rows according to SUBJECT_ID, with a total of observations of variables X and Y and then various columns of metadata e.g. SUBJECT_BIRTHYEAR, SUBJECT_HOMETOWN:

variableX    variableY    SUBJECT_ID     SUBJECT_BIRTHYEAR     SUBJECT_HOMETOWN
2            1            A              1950                  Townsville
1            2            B              1951                  Villestown

I would like to transfom these into the following long format, where for each observation of variable X and Y for each SUBJECT_ID:

VARIABLE     SUBJECT_ID     SUBJECT_BIRTHYEAR     SUBJECT_HOMETOWN
X            A              1950                  Townsville
X            A              1950                  Townsville
Y            A              1950                  Townsville
X            B              1951                  Villestown
Y            B              1951                  Villestown
Y            B              1951                  Villestown

Specific to my question is how to transform n observations of a continuous variable into n rows of categorical data.

Bhargav Rao
  • 50,140
  • 28
  • 121
  • 140
fhng
  • 123
  • 4
  • 2
    Possible duplicate of [Reshaping data.frame from wide to long format](https://stackoverflow.com/questions/2185252/reshaping-data-frame-from-wide-to-long-format) – duckmayr Nov 24 '17 at 14:15
  • @duckmayr Yes, they are similar questions, but specific to my question is the query of how to transform *n* observations of a continuous variable into *n* rows of categorical data. I've added an edit to my question to make this clear. – fhng Nov 24 '17 at 14:49

3 Answers3

1

Try the following

Data

df <- read.table(text="variableX    variableY    SUBJECT_ID     SUBJECT_BIRTHYEAR     SUBJECT_HOMETOWN
2            1            A              1950                  Townsville
1            2            B              1951                  Villestown", header=TRUE)

Solution

library(tidyverse)
result <- df %>%
        nest(variableX, variableY, .key="VARIABLE") %>%
        mutate(VARIABLE = map(VARIABLE, function(i) {
                                    vec <- unlist(i)
                                    rep(gsub("variable", "", names(vec)), times=vec)
                                })) %>%
        unnest()

# A tibble: 6 x 4
  # SUBJECT_ID SUBJECT_BIRTHYEAR SUBJECT_HOMETOWN VARIABLE
      # <fctr>             <int>           <fctr>    <chr>
# 1          A              1950       Townsville        X
# 2          A              1950       Townsville        X
# 3          A              1950       Townsville        Y
# 4          B              1951       Villestown        X
# 5          B              1951       Villestown        Y
# 6          B              1951       Villestown        Y
CPak
  • 13,260
  • 3
  • 30
  • 48
  • Thank you very much for this answer! One point of clarification: is it possible to implement this as a function (so I can call it on a list of multiple dataframes) or is there a conflict in how `%>%` is handled? I haven't been successful and am not sure why. – fhng Nov 25 '17 at 22:12
  • It's likely not an issue with `%>%` but with how column-names are dealt with. You could either look up previous posts on `non-standard evaluation` or post as another question. – CPak Nov 25 '17 at 22:28
1

The question asks for inverting a call to dcast() which has reshaped the data from long to wide format using length() as aggregation function.

This can be achieved by a call to melt() plus some additional transformations:

library(data.table)
# reshape wide back to long format
long <- melt(setDT(wide), measure.vars = c("variableX", "variableY"))[
  # undo munging of variable names
  , variable := stringr::str_replace(variable, "^variable", "")][]
# undo effect of aggregation by length()
result <- long[long[, rep(.I, value)]][
  # beautify result
  order(SUBJECT_ID), !"value"]
result
   SUBJECT_ID SUBJECT_BIRTHYEAR SUBJECT_HOMETOWN variable
1:          A              1950       Townsville        X
2:          A              1950       Townsville        X
3:          A              1950       Townsville        Y
4:          B              1951       Villestown        X
5:          B              1951       Villestown        Y
6:          B              1951       Villestown        Y

.I is a special symbol which holds the row location, i.e., row index.


To demonstrate that this is indeed the inverse operation, result can be reshaped again to reproduce wide:

dcast(result, ... ~ paste0("variable", variable), length, value.var = "variable")
   SUBJECT_ID SUBJECT_BIRTHYEAR SUBJECT_HOMETOWN variableX variableY
1:          A              1950       Townsville         2         1
2:          B              1951       Villestown         1         2

Data

library(data.table)
wide <- fread("variableX    variableY    SUBJECT_ID     SUBJECT_BIRTHYEAR     SUBJECT_HOMETOWN
2            1            A              1950                  Townsville
1            2            B              1951                  Villestown")
Community
  • 1
  • 1
Uwe
  • 41,420
  • 11
  • 90
  • 134
0

Here is an option using base R

res <- cbind(VARIABLE = rep(substr(names(df1)[1:2], 9, 9)[row(df1[1:2])], t(df1[1:2])), 
        df1[rep(seq_len(nrow(df1)), rowSums(df1[1:2])), -(1:2)])
row.names(res) <- NULL
res
#   VARIABLE SUBJECT_ID SUBJECT_BIRTHYEAR SUBJECT_HOMETOWN
#1        X          A              1950       Townsville
#2        X          A              1950       Townsville
#3        Y          A              1950       Townsville
#4        X          B              1951       Villestown
#5        Y          B              1951       Villestown
#6        Y          B              1951       Villestown
akrun
  • 874,273
  • 37
  • 540
  • 662