3

I am trying to figure out a way in R to take the difference of two string vectors, but only based on the first 3 columns that are tab delimited in each string. For Example this is list1 and list2

list1:

        "1\t1113200\t1118399\t1\t1101465\t1120176\tENSRNOG00000040300\tRaet1l\t0\n" 
        "1\t1180200\t1187599\t1\t1177682\t1221416\tENSRNOG00000061316\tAABR07000121.1\t0\n"
        "1\t1180200\t1187599\t1\t1177632\t1221416\tENSRNOG00000061316\tAABR07000121.1\t0\n"

list2:

 "1\t1113200\t1118399\t1\t1101465\t1120176\tENSRNOG00000040300\tRaet1l\t0\n" 
  "1\t1180200\t1187599\t1\t1177682\t1221416\tENSRNOG00000061316\tAABR07000121.1\t0\n"

i want to do setdiff(list2,list1) , so that i just get everything in list2 that is non-existent in list1, however i want to do it based on just the first 3 tab delimited strings. So in list1 i would only consider:

   "1\t1113200\t1118399"

from the first entry. However i still want the full string returned. I only want to compare using the first 3 columns. I am having trouble figuring out how to do this, any help would be appreciated. Ive already looked at several SO posts, none of them seemed to help.

mks
  • 199
  • 1
  • 15

1 Answers1

2

For extracting the first three columns (not sure why you need this as a long string rather than a dataframe...), I would use beg2char() from the qdap library. (Although, if they are all the same length base substr() will work fine.)

beg2char(list1, '\t', 3) # Will extract from the beginning up to the third tab delimiter

Then rather than setdiff I would simply use %in% to check if the substring of the element in list2 matches any of the elements in list1.

beg2char(list2, '\t', 3) %in% beg2char(list1, '\t', 3) # will give you TRUE/FALSE
list2[!(beg2char(list2, '\t', 3) %in% beg2char(list1, '\t', 3))]

Will give the the full elements of list2 that have substring that are nonexistent in list1.

vincentmajor
  • 1,076
  • 12
  • 20