3

I have a table in long format as shown below, every row is unique in this input table :-

 year variable
  2014   ab  
  2014   cd  
  2014   ef 
  2016   ef 
  2016   gh
  2014   ab  
  2014   cd  
  2014   ef 
  2016   ef 
  2016   gh

I would like convert this table into wide format but only for variable column which looks like a contigency matrix. For example - As shown in the output table below, the combination of ab+cd appears ONCE for year 2014, and the combination of ab+ef also appears Once for year 2014. In that way, the first row of my output table clearly shows all the Counts for different combinations variable column of input table for different years.

year    value  ab  cd  ef  gh  
2014    ab     2    2   2   0
2014    cd     2    2   2   0 
2014    ef     2    2   2   0 
2014    ef     0    0   2   2 
2016    gh     0    0   2   2 

I have tried to reshape table quite a few times but have not been able to achieve the result that I want to have. I would very much appreciate if the solution includes use of data.table. Thank you.

chinsoon12
  • 25,005
  • 4
  • 25
  • 35
relu
  • 333
  • 1
  • 3
  • 18

4 Answers4

2

Here is an option using data.table:

vs <- DT[, unique(variable)]
for (x in vs) set(DT, j=x, value=0L)
DT[, (vs) := {
        m <- as.matrix(.SD)
        m[, match(variable, vs)] <- 1L
        as.data.table(m)
    }, year, .SD=vs]
DT

And also:

DT[, (vs) := {
        m <- copy(.SD)
        m[, match(variable, vs)] <- 1L
        m
    }, year, .SD=vs]

output as per problem description (as pointed out by Daniel O and meriops, there are some inconsistencies between the desired output and problem description):

   year variable ab cd ef gh
1: 2014       ab  1  1  1  0
2: 2014       cd  1  1  1  0
3: 2014       ef  1  1  1  0
4: 2016       ef  0  0  1  1
5: 2016       gh  0  0  1  1

data:

library(data.table)
DT <- fread("year variable
2014   ab  
2014   cd  
2014   ef 
2016   ef 
2016   gh")
chinsoon12
  • 25,005
  • 4
  • 25
  • 35
  • Hey, I think it will take time for me to understand the code. I am fairly new to this. Can I know how the code can be changed further to get the numbers same as contigency table instead of 1 or 0? Thanks again. – relu Apr 15 '20 at 08:38
  • i might need some time to understand contingency table. is it possible for you to update your example to show an example? – chinsoon12 Apr 15 '20 at 09:27
  • I changed the input and output data example. As you can see that instead of getting sparse matrix consisting of 1 and 0 from your code. I want to have reduced matrix that includes sum of number of appearance of said combinations. I hope it is clear. Thank you very much again. – relu Apr 15 '20 at 09:38
  • @newbie, what would be the expected output for `fread("year variable\n 2014 ab\n 2014 cd\n 2014 ef\n 2014 ab")`? – chinsoon12 Apr 15 '20 at 10:30
1

Assuming that you want ef-cd cells to be 1 and not 0, here's a slightly tortured approach using igraph and tidyverse. The idea is to create a bipartite graph, find the 1-mode projection, and create an adjacency matrix from that projection:

library(tidyverse)
library(igraph)

df <- tibble(year = c("2014",
                "2014",
                "2014",
                "2016",
                "2016"),
             variable = c("ab",
                          "cd",
                          "ef",
                          "ef",
                          "gh"))

tab <- df %>% 
  group_split(year) %>% 
  map(~ .x %>% 
        graph_from_data_frame(directed = FALSE) %>% 
        set_vertex_attr("type", value = ifelse(V(.)$name %in% .x$year, TRUE, FALSE)) %>% 
        bipartite_projection(which = FALSE) %>% 
        add_edges(rep(1:length(unique(.x$variable)), 2) %>% sort()) %>% 
        as_adjacency_matrix(sparse = FALSE) %>% 
        as_tibble()) %>% 
  bind_rows() %>% 
  mutate_all(coalesce, 0)

cbind(df, tab)
#>   year variable ab cd ef gh
#> 1 2014       ab  1  1  1  0
#> 2 2014       cd  1  1  1  0
#> 3 2014       ef  1  1  1  0
#> 4 2016       ef  0  0  1  1
#> 5 2016       gh  0  0  1  1

Created on 2020-04-14 by the reprex package (v0.3.0)

paqmo
  • 3,649
  • 1
  • 11
  • 21
-1
df <- data.table(df)
df_dcast <- dcast.data.table(df,year~value,fun=length)
Prahlad
  • 118
  • 1
  • 4
  • No, this isn't correct. You are converting whole input table into plain wide format. The output table includes the combination of ab+cd and ab+ef – relu Apr 14 '20 at 12:42
-1

Here is some code in Base R that accomplishes this task

df_new <- t(sapply(unique(df$year), function(X) lapply(unique(df$variable), function(Y)  length(which(df$variable== Y & df$year == X)))))
row.names(df_new) <- unique(df$year)
colnames(df_new) <- unique(df$variable)

output:

     ab cd ef gh
2014 1  1  1  0 
2016 0  0  1  1 

Input Data:

Input = (
  ' year variable
  2014   ab  
  2014   cd  
  2014   ef 
  2016   ef 
  2016   gh')
df = read.table(textConnection(Input), header = T)
Daniel O
  • 4,258
  • 6
  • 20
  • I am interested in combinations of counts within `variable` column and not the counts for overall `year` column. You are missing the second column of my output table shown in question. – relu Apr 14 '20 at 12:35