Extract strings using fuzzy LR patterns in R

Question

I am struggling for long time.

I manage to extract everything between my Right and Left patterns in a string as you can see in the following example.

library(tidyverse)

data=c("everything will be ok one day")

str_extract(string = data, pattern = "(?<=thing).*(?=ok one)")
#> [1] " will be "

^{Created on 2022-01-26 by the reprex package (v2.0.1)}

As you notice in the code, I extract everything between "thing" and "ok one".

I need to incorporate the possibility of mismatches inside these patterns. I want to allow a maximum of two mismatches and consider indels and insertions.

Example1

for example one mismatch that I want to account for is the insertion of letter "s" in everything

dat.1=c("everythings will be ok one day")

I would like in this case to be able to extract the the phrase

will be

Example 2

dat.2=c("everythingswillbeokoneday")

I would like in this case to be able to extract the the phrase

will be

PS: This is just a simplified example. My actual data does not contain gaps, and it's complicated. I am looking forward to receiving your help and guidance.

It's not perfectly clear. Please provide additional strings in your `data` that suggest the differences you're talking about. Once you've added that, it would help to show what you currently get and contrast that with the exact strings you *want* to get. Thanks! — r2evans, Jan 26 '22 at 23:00

PaulS · Accepted Answer · 2022-01-27T00:27:13.733

2

One way is to use fuzzy matching of strings, relying, for instance, on package stringdist and computing, for each delimiter string (thing and ok, in your example), the respective matching score (that is what the function maxsim does below).

library(tidyverse)
library(stringdist)

dat.1=c("everythings will be ok one day")

maxsim <- function(df, delim)
{
  df %>% 
    str_split(" ") %>% unlist %>% 
    map(~ stringsim(delim,.x)) %>% 
    which.max
} 

dat.1 %>% 
  str_split(" ") %>% unlist %>% 
  .[ (maxsim(dat.1,"thing") + 1) : (maxsim(dat.1,"ok") - 1) ] %>% 
  str_c(collapse = " ")

#> [1] "will be"

edited Jan 27 '22 at 00:27

answered Jan 27 '22 at 00:13

PaulS

21,159
2
9
26

1

Paul this is a great answer and for this I upvoted it. My concern is that this works only with a space delimiter. How this would work if there were no gaps? – LDT Jan 27 '22 at 11:08
Thanks, @LDT. It would be best if you provided an example with no gaps. – PaulS Jan 27 '22 at 11:11
1

You are right Paul! But overall I am so impressed because you had a genius idea – LDT Jan 27 '22 at 11:12
1

I have just added one example. Thank you again – LDT Jan 27 '22 at 11:14
Thanks, @LDT. I do not think we can do anything without adding some structure to the delimiters. More specifically, `thing` can be `things` and what else? – PaulS Jan 27 '22 at 11:33
1

I like your splitting the words and then looking at the string distance—pure genius. I have seen this in python https://stackoverflow.com/questions/14691908/creating-fuzzy-matching-exceptions-with-pythons-new-regex-module. It is interesting because you indicate only how many errors you expect in your pattern. eg., with {e<=1} you indicate one error. The trivial thing is. How do you find in a string that is NOT separated by gaps a pattern that contains an error? Shall I ask better first this answer? – LDT Jan 28 '22 at 10:38
I believe that `stringdist` also allows us to specify the number of errors, @LDT. However, the crucial point is to _define_ clearly the problem -- that is what is missing at the moment. Take your example: `everythingswillbeokoneday` and the delimiters `thing` and `ok`. If we allow one error at most, we get `swillbe` or `willbe`. But how can we be sure which one of the two is the _valid_ string? In sum, it appears to me that your problem is not yet well defined. – PaulS Jan 28 '22 at 11:19
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/241491/discussion-between-ldt-and-paul-smith). – LDT Jan 28 '22 at 13:53

Extract strings using fuzzy LR patterns in R

1 Answers1