12

Suppose I have a string like "A B C (123-456-789)", I'm wondering what's the best way to retrieve "123-456-789" from it.

strsplit("A B C (123-456-789)", "\\(")
[[1]]
[1] "A B C" "123-456-789)"
zx8754
  • 52,746
  • 12
  • 114
  • 209
David Z
  • 6,641
  • 11
  • 50
  • 101

5 Answers5

11

If we want to extract the digits with - between the braces, one option is str_extract. If there are multiple patterns within a string, use str_extract_all

 library(stringr)
 str_extract(str1, '(?<=\\()[0-9-]+(?=\\))')
 #[1] "123-456-789"
 str_extract_all(str2, '(?<=\\()[0-9-]+(?=\\))')

In the above codes, we are using regex lookarounds to extract the numbers and the -. The positive lookbehind (?<=\\()[0-9-]+ matches numbers along with - ([0-9-]+) in (123-456-789 and not in 123-456-789. Similarly the lookahead ('[0-9-]+(?=\)') matches numbers along with - in 123-456-789) and not in 123-456-798. Taken together it matches all the cases that satisfy both the conditions (123-456-789) and extract those in between the lookarounds and not with cases like (123-456-789 or 123-456-789)

With strsplit you can specify the split as [()]. We keep the () inside the square brackets to [] to treat it as characters or else we have to escape the parentheses ('\\(|\\)').

 strsplit(str1, '[()]')[[1]][2]
 #[1] "123-456-789"

If there are multiple substrings to extract from a string, we could loop with lapply and extract the numeric split parts with grep

 lapply(strsplit(str2, '[()]'), function(x) grep('\\d', x, value=TRUE))

Or we can use stri_split from stringi which has the option to remove the empty strings as well (omit_empty=TRUE).

 library(stringi)
 stri_split_regex(str1, '[()A-Z ]', omit_empty=TRUE)[[1]]
 #[1] "123-456-789"

 stri_split_regex(str2, '[()A-Z ]', omit_empty=TRUE)

Another option is rm_round from qdapRegex if we are interested in extracting the contents inside the brackets.

 library(qdapRegex)
 rm_round(str1, extract=TRUE)[[1]]
 #[1] "123-456-789"
 rm_round(str2, extract=TRUE)

data

 str1 <-  "A B C (123-456-789)"
 str2 <- c("A B C (123-425-478) A", "ABC(123-423-428)",
 "(123-423-498) ABCDD", 
  "(123-432-423)", "ABC (123-423-389) GR (124-233-848) AK")
Community
  • 1
  • 1
akrun
  • 874,273
  • 37
  • 540
  • 662
8

or with sub from base R:

sub("[^(]+\\(([^)]+)\\).*", "\\1", "A B C (123-456-789)")
#[1] "123-456-789"

Explanation:

[^(]+ : matches anything except an opening bracket
\\( : matches an opening bracket, which is just before what you want
([^)]+) : matches the pattern you want to capture (which is then retrieved in replacement="\\1"), which is anything except a closing bracket
\\).* matches a closing bracket followed by anything, 0 or more times

Another option with look-ahead and look-behind

sub(".*(?<=\\()(.+)(?=\\)).*", "\\1", "A B C (123-456-789)", perl=TRUE)
#[1] "123-456-789"
Cath
  • 23,906
  • 5
  • 52
  • 86
5

The capture groups in sub will target your desired output:

sub('.*\\((.*)\\).*', '\\1', str1)
[1] "123-456-789"

Extra check to make sure I pass @akrun's extended example:

sub('.*\\((.*)\\).*', '\\1', str2)
[1] "123-425-478" "123-423-428" "123-423-498" "123-432-423" "124-233-848"
Pierre L
  • 28,203
  • 6
  • 47
  • 69
4

Try this also:

 k<-"A B C (123-456-789)"
     regmatches(k,gregexpr("*.(\\d+).*",k))[[1]]
[1] "(123-456-789)"

With suggestion from @Arun:

regmatches(k, gregexpr('(?<=\\()[^A-Z ]+(?=\\))', k, perl=TRUE))[[1]]

With suggestion from @akrun:

regmatches(k, gregexpr('[0-9-]+', k))[[1]]
akrun
  • 874,273
  • 37
  • 540
  • 662
user227710
  • 3,164
  • 18
  • 35
4

You may try these gsub functions.

> gsub("[^\\d-]", "", x, perl=T)
[1] "123-456-789"
> gsub(".*\\(|\\)", "", x)
[1] "123-456-789"
> gsub("[^0-9-]", "", x)
[1] "123-456-789"

Few more...

> gsub("[0-9-](*SKIP)(*F)|.", "", x, perl=T)
[1] "123-456-789"
> gsub("(?:(?![0-9-]).)*", "", x, perl=T)
[1] "123-456-789"
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274