strsplit by parentheses

Question

Suppose I have a string like "A B C (123-456-789)", I'm wondering what's the best way to retrieve "123-456-789" from it.

strsplit("A B C (123-456-789)", "\\(")
[[1]]
[1] "A B C" "123-456-789)"

This gets you the number `strsplit(str1, '[()]')[[1]][2]` but it is based on knowing the position beforehand. — akrun, Jul 08 '15 at 12:35

score 11 · Accepted Answer · edited Jun 20 '20 at 09:12

If we want to extract the digits with - between the braces, one option is str_extract. If there are multiple patterns within a string, use str_extract_all

 library(stringr)
 str_extract(str1, '(?<=\\()[0-9-]+(?=\\))')
 #[1] "123-456-789"
 str_extract_all(str2, '(?<=\\()[0-9-]+(?=\\))')

In the above codes, we are using regex lookarounds to extract the numbers and the -. The positive lookbehind (?<=\\()[0-9-]+ matches numbers along with - ([0-9-]+) in (123-456-789 and not in 123-456-789. Similarly the lookahead ('[0-9-]+(?=\)') matches numbers along with - in 123-456-789) and not in 123-456-798. Taken together it matches all the cases that satisfy both the conditions (123-456-789) and extract those in between the lookarounds and not with cases like (123-456-789 or 123-456-789)

With strsplit you can specify the split as [()]. We keep the () inside the square brackets to [] to treat it as characters or else we have to escape the parentheses ('\\(|\\)').

 strsplit(str1, '[()]')[[1]][2]
 #[1] "123-456-789"

If there are multiple substrings to extract from a string, we could loop with lapply and extract the numeric split parts with grep

 lapply(strsplit(str2, '[()]'), function(x) grep('\\d', x, value=TRUE))

Or we can use stri_split from stringi which has the option to remove the empty strings as well (omit_empty=TRUE).

 library(stringi)
 stri_split_regex(str1, '[()A-Z ]', omit_empty=TRUE)[[1]]
 #[1] "123-456-789"

 stri_split_regex(str2, '[()A-Z ]', omit_empty=TRUE)

Another option is rm_round from qdapRegex if we are interested in extracting the contents inside the brackets.

 library(qdapRegex)
 rm_round(str1, extract=TRUE)[[1]]
 #[1] "123-456-789"
 rm_round(str2, extract=TRUE)

data

 str1 <-  "A B C (123-456-789)"
 str2 <- c("A B C (123-425-478) A", "ABC(123-423-428)",
 "(123-423-498) ABCDD", 
  "(123-432-423)", "ABC (123-423-389) GR (124-233-848) AK")

Cath · Answer 2 · 2015-07-08T12:46:36.237

or with sub from base R:

sub("[^(]+\\(([^)]+)\\).*", "\\1", "A B C (123-456-789)")
#[1] "123-456-789"

Explanation:

[^(]+ : matches anything except an opening bracket
\\( : matches an opening bracket, which is just before what you want
([^)]+) : matches the pattern you want to capture (which is then retrieved in replacement="\\1"), which is anything except a closing bracket
\\).* matches a closing bracket followed by anything, 0 or more times

Another option with look-ahead and look-behind

sub(".*(?<=\\()(.+)(?=\\)).*", "\\1", "A B C (123-456-789)", perl=TRUE)
#[1] "123-456-789"

Pierre L · Answer 3 · 2015-07-08T15:15:35.140

5

The capture groups in sub will target your desired output:

sub('.*\\((.*)\\).*', '\\1', str1)
[1] "123-456-789"

Extra check to make sure I pass @akrun's extended example:

sub('.*\\((.*)\\).*', '\\1', str2)
[1] "123-425-478" "123-423-428" "123-423-498" "123-432-423" "124-233-848"

edited Jul 08 '15 at 15:15

answered Jul 08 '15 at 12:48

Pierre L

28,203
6
47
69

score 4 · Answer 4 · edited Jul 08 '15 at 13:31

4

Try this also:

 k<-"A B C (123-456-789)"
     regmatches(k,gregexpr("*.(\\d+).*",k))[[1]]
[1] "(123-456-789)"

With suggestion from @Arun:

regmatches(k, gregexpr('(?<=\\()[^A-Z ]+(?=\\))', k, perl=TRUE))[[1]]

With suggestion from @akrun:

regmatches(k, gregexpr('[0-9-]+', k))[[1]]

edited Jul 08 '15 at 13:31

akrun

874,273
37
540
662

answered Jul 08 '15 at 12:46

user227710

3,164
18
35

Avinash Raj · Answer 5 · 2015-07-08T13:31:17.157

4

You may try these gsub functions.

> gsub("[^\\d-]", "", x, perl=T)
[1] "123-456-789"
> gsub(".*\\(|\\)", "", x)
[1] "123-456-789"
> gsub("[^0-9-]", "", x)
[1] "123-456-789"

Few more...

> gsub("[0-9-](*SKIP)(*F)|.", "", x, perl=T)
[1] "123-456-789"
> gsub("(?:(?![0-9-]).)*", "", x, perl=T)
[1] "123-456-789"

edited Jul 08 '15 at 13:31

answered Jul 08 '15 at 13:23

Avinash Raj

172,303
28
230
274

strsplit by parentheses

5 Answers5

data

Linked