1

I am trying to do a join of a data.table with itself. The condition to join is based on the value of a column (not the key) being used to access a matrix. Each row has a date(in seconds) and records should only join with newest records (t1

cn<-unique(sdd$column)
mat<-matrix(data=0,nrow=lde,ncol=lde,dimnames=list(cn,cn))

I am struggling with the documentation FAQ (including the SQL to data.table analogy) and the Beginner's Guide for data.table and multiple similar questions in this forum but I can't find how to solve it.

Q1, Q2, Q3

I get stuck with notation problems.

My DT is sdd:

> colnames(sdd)
[1] "ID" "DelayStartTimeSeconds" "DelayEndTimeSeconds" "EquipmentID"

I have made several attempts such as:

sdd2<-sdd # to avoid problems with the names of variables
sdd[sdd2,eqXrossM[cbind(sdd.EquipmentID,sdd2.EquipmentID)]==1 & sdd.DelayEndTimeSeconds<sdd2.DelayStartTimeSeconds, distance:=sdd.DelayEndTimeSeconds-sdd2.DelayStartTimeSeconds][,distance:=sdd.DelayEndTimeSeconds<sdd2.DelayStartTimeSeconds] # that would be the whole thing to do generating a new column with the time difference

sdd[sdd2[ sdd.DelayEndTimeSeconds<sdd2.DelayStartTimeSeconds, distance:=sdd.DelayEndTimeSeconds<sdd2.DelayStartTimeSeconds]] #this is an approximation attempt.

I simply don't get the notation and the different examples seem to use different notations.

EDIT: Well, after a night of sleep some stuff is making sense... but other is still confusing just the same. For example:

sdd_x<-sdd[sdd2,i.DelayStartTimeSeconds>DelayEndTimeSeconds] # returns a vector as long as sdd: len
sdd_x<-sdd[sdd2,i.DelayStartTimeSeconds>DelayEndTimeSeconds & eqXrossM[i.EquipmentID,EquipmentID]==1] # returns a matrix len x len.

Why adding a new condition changes the type of output? I was expecting a case such as the matrix (that would require optimization) In addition the whole matrix is false what is not the expected value as the records are different. In fact for the second case, either the upper or lower diagonal should be TRUE.

Also, looks like the call to the matrix doesn't require using cbind as other answer to a similar question mentioned. Why that?

And my last discovery has been finding out the CJ() operator but trying to use the i. notation doesn't work here. This part doesn't seem too documented.

sdd[CJ(DASDelayID,DASDelayID),i.DelayStartTimeSeconds>DelayEndTimeSeconds]

Any help would be appreciated.

Community
  • 1
  • 1
Picarus
  • 760
  • 1
  • 10
  • 25
  • 1
    add a _simple_ reproducible example that illustrates your confusion – eddi Mar 17 '15 at 18:17
  • @eddi, I have shown a couple of attempts to code it, I am getting errors sdd.EndTime not found. So I really don't know how to refer to the columns for each of the DT in the join and definitely I think the trick of having the DT with a second name is unnecessary. How should I refer to the variables in the first and second DT? – Picarus Mar 17 '15 at 21:36
  • 2
    use `i.` to refer to the `i-expression` `data.table`, e.g. `dt1[dt2, col.from.dt1 := i.col.from.dt2]` – eddi Mar 17 '15 at 21:42
  • @eddi, could you give me any more hints to continue solving the problem? I am almost there... Thanks in advance – Picarus Mar 18 '15 at 11:43
  • I'm afraid I don't know what you're doing or even trying to do. It does look like you're a bit confused about R's recycling though - check out the output of `1:5 * matrix(1:25, nrow = 5)` - once you understand what happens there it should at least clear out some of your confusion (also search "R recycling" on google). – eddi Mar 18 '15 at 15:23

1 Answers1

0

This is how I finally solved the problem:

sddx<<-CJ(ID1=sdd$DASDelayID,ID2=sdd$DASDelayID)[
    ID1<ID2] [,
              ':='(Connected=eqXrossM[cbind(sdd[DASDelayID==ID2,EquipmentID],sdd[DASDelayID==ID1,EquipmentID])]==1,
                   Distance=as.integer(sdd[DASDelayID==ID2,DelayStartTimeSeconds]-sdd[DASDelayID==ID1,DelayEndTimeSeconds]))
              ]

Step by step:

Generate all the combinations of DelayID, the number is large but each row has only two columns integers.

sddx<<-CJ(ID1=sdd$DASDelayID,ID2=sdd$DASDelayID) 

This cuts the size to half, since ID1 are given as they are created, ordered by DelayStartTime and DelayEndTime>DelayStartTime.

[ID1<ID2] 

This enforces the external condition accessing the matrix, note the cbind:

[,':='(Connected=eqXrossM[cbind(sdd[DASDelayID==ID2,EquipmentID],sdd[DASDelayID==ID1,EquipmentID])]==1,

This calculates the distance between Delays, that can be used to filter the ones where it is not strictly positive

Distance=as.integer(sdd[DASDelayID==ID2,DelayStartTimeSeconds]-sdd[DASDelayID==ID1,DelayEndTimeSeconds]))  
              ]

I hope it helps someone else.

Picarus
  • 760
  • 1
  • 10
  • 25