10

With the function stringdist, I can calculate the Levenshtein distance between strings : it counts the number of deletions, insertions and substitutions necessary to turn a string into another. For instance, stringdist("abc abc","abcd abc") = 1 because "d" was inserted in the second string.

Is it possible to know the operations made to obtain the Levenshtein distance between two strings ? Or else to know the characters that are different between the 2 strings (in this example, only "d")? Thanks.

library(stringdist)
stringdist("abc abc","abcde acc") = 3

I would like to know that :

  • "d" was inserted

  • "e" was inserted

  • "b" was substitued into "c"

Or more simply, I would like to have the list ("d","e","c").

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
yaki
  • 125
  • 7
  • 1
    The Levenshtein editing paths are not necessarily unique: ofthen there are multiple equally long, and minimal sequences of edits that lead from one string to another. –  Jul 26 '19 at 07:19

4 Answers4

10

With adist(), you can retrieve the operations:

drop(attr(adist("abc abc","abcde acc", count = TRUE), "counts"))

ins del sub 
  2   0   1 

From ?adist:

If counts is TRUE, the transformation counts are returned as the "counts" attribute of this matrix, as a 3-dimensional array with dimensions corresponding to the elements of x, the elements of y, and the type of transformation (insertions, deletions and substitutions), respectively.

tmfmnk
  • 38,881
  • 4
  • 47
  • 67
  • Thanks it helps me a lot! Do you know if there is a function to directly know the characters corresponding to these operations ? Else, I could try to create a function using ```attr(adist("abda cc","abc abc", count = TRUE),"trafos") #= "MMSDMSIM"``` where ```M=match, S=substitute, D=delete, I=insert``` – yaki Jun 30 '19 at 20:53
  • 2
    Don't know about any handy function that will do it. However, I assume that playing around `trafos` will lead you to the desired results. – tmfmnk Jun 30 '19 at 20:57
10

This is known as the Needleman–Wunsch algorithm. It calculates both the distance between two strings as well as the so-called traceback, which allows you to reconstruct the alignment.

Since this problem mostly crops up in biology when comparing biological sequences, this algorithm (and related ones) are implemented in the R package {Biostrings}, which is part of Bioconductor.

Since this package implements are more general solution than the simple Levenshtein distance, the usage is unfortunately more complex, and the usage vignette is correspondingly long. But the fundamental usage for your purposes is as follows:

library(Biostrings)

dist_mat = diag(27L)
colnames(dist_mat) = rownames(dist_mat) = c(letters, ' ')

result = pairwiseAlignment(
    "abc abc", "abcde acc",
    substitutionMatrix = dist_mat,
    gapOpening = 1, gapExtension = 1
)

This won’t simply give you the list c('b', 'c', 'c'), though, because that list does not fully represent what actually happened here. Instead, it will return an alignment between the two strings. This can be represented as a sequence with substitutions and gaps:

score(result)
# [1] 3
aligned(result)
as.matrix(aligned(result))
#      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
# [1,] "a"  "b"  "c"  "-"  "-"  " "  "a"  "b"  "c"
aligned(result)

— For each character in the second string it provides the corresponding character in the original string, replacing inserted characters by -. Basically, this is a “recipe” for transforming the first string into the second string. Note that it will only contain insertions and substitutions, not deletions. To get these, you need to perform the alignment the other way round (i.e. swapping the string arguments).

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
  • 1
    Unfortunately the code above requires you to specify `dist_mat` manually such that it contains one row and column for each character that your string might contain. The code shown in this answer thus only allows lower-case letters and spaces, nothing else. – Konrad Rudolph Jun 30 '19 at 22:19
0

Here is the code which extracts the number of changes of each type, and then the corresponding characters for each type of operation:

source_string="12234"
target_string="02345"
lev=adist(source_string,target_string,count=T)

#number of operations of each kind
attributes(lev)$counts[,,"ins"] 
attributes(lev)$counts[,,"del"]
attributes(lev)$counts[,,"sub"]

substitution_bank=deletion_bank=insertion_bank=match_bank=NULL

changes<-strsplit(attributes(lev)$trafos, "")[[1]]

counter_source=counter_target=1
for(j in changes){
 if(j=="S") {
   substitution_bank=rbind(substitution_bank,
           cbind(strsplit(source_string,"")[[1]][counter_source], strsplit(target_string,"")[[1]][counter_target]))
   counter_source=counter_source+1
   counter_target=counter_target+1
 }
 if(j=="I") {
   insertion_bank=rbind(insertion_bank,
                           strsplit(target_string,"")[[1]][counter_target])
   counter_target=counter_target+1
 }
 if(j=="D") {
   deletion_bank=rbind(deletion_bank,
                        strsplit(source_string,"")[[1]][counter_source])
   counter_source=counter_source+1
 }
 if(j=="M") {
   match_bank=rbind(match_bank,
                           strsplit(source_string,"")[[1]][counter_source])
   counter_source=counter_source+1
   counter_target=counter_target+1
 }
 

}

substitution_bank
deletion_bank
insertion_bank
match_bank

Honestly, I'm ashamed of the code -- it seems wasteful to go one character at a time. But in the presence of both insertions and deletions, I couldn't figure out how to extract the right characters... So more elegant answers are welcome!

A Cristia
  • 31
  • 4
0

Investigating the comment by @tmfmnk to look at trafos led me to the following approach.

Function to return parts of string where edit_string is equal to match:

character_match = function(string, edit_string, match, drop = NA){
  # convert to array
  string = strsplit(string, "")[[1]]
  edit_string = strsplit(edit_string, "")[[1]]
  
  if(!is.na(drop)){
    edit_string = edit_string[edit_string != drop]
  }
  
  if(length(string) != length(edit_string)){
    stop("string and edit_string are different lengths")
  }
  
  output = rep("_", length(edit_string))
  is_match = edit_string == match
  output[is_match] = string[is_match]
  
  output = paste0(output, collapse = "")
  return(output)
}

Applying it to this problem:

s1 = "123456789"
s2 = "0123zz67"

out = adist(s1, s2, counts = TRUE)
edit_string = drop(attr(out, "trafos"))

Now the edit string will include the letter codes:

  • I = insert
  • M = match
  • S = substitute
  • D = delete

We can extract these with our function as follows:

# characters in s1 that match s2
character_match(s1, edit_string, "M", "I")
# "123__67__"

# characters in s1 that were substituted out
character_match(s1, edit_string, "S", "I")
# "___45____"

# characters in s1 that were deleted
character_match(s1, edit_string, "D", "I")
# "_______89"

# characters in s2 that match s1
character_match(s2, edit_string, "M", "D")
# "_123__67"

# characters in s2 that were substituted in
character_match(s2, edit_string, "S", "D")
# "____zz__"

# characters in s2 that were inserted
character_match(s2, edit_string, "I", "D")
# "0_______"

From here it is easy to see which characters and position were inserted, deleted, or substituted.

Simon.S.A.
  • 6,240
  • 7
  • 22
  • 41