1

In short:

Given the followoing string:

Input string -> "hello, world" , oh my, parapappa12

I want to extract these three "tokens":

Output tokens ->

  • "hello, world"
  • oh my
  • parapappa12

Tokenizing string in ios

I got a file containing some data. It looks something like:

word , word, word 
word , word, word 
word , word, word 

where some words can contain a "," but only when the word starts and end with a certain character, eg. starts with " and ends with "

Example of words:

word : blebla bla bla
word : "bla bla bla, bla"

How do I define a regular expression to tokenize the file based on the "," ingoring white spaces between the words and including this "special" case?

I remember using regex in Perl to achieve something similar but was long time ago and I kind of forgot the syntax and I am not sure if this is supported in Objective-C and iOS

mm24
  • 9,280
  • 12
  • 75
  • 170

2 Answers2

1

First, a Perl oneliner (here fullscreen):

perl screenshot

# echo -n '"hello, world" , oh my, parapappa12' | perl -ne 'print "<$1>\n" while /("[^"]*"|[^, ]+)/g'
<"hello, world">
<oh>
<my>
<parapappa12>

And here the Objective C method:

NSString* const str = @"\"hello, world\" , oh my, parapappa12";
[self splitCommas:str];

- (void)splitCommas:(NSString*)str
{
    NSString* const pattern = @"(\"[^\"]*\"|[^, ]+)";

    NSRegularExpression *regex = [[NSRegularExpression alloc] initWithPattern:pattern
                                                                      options:0
                                                                        error:nil];
    NSRange searchRange = NSMakeRange(0, [str length]);
    NSArray *matches = [regex matchesInString:str
                                      options:0
                                        range:searchRange];

    for (NSTextCheckingResult *match in matches) {
        NSRange matchRange = [match range];
        NSLog(@"%@", [str substringWithRange:matchRange]);
    }
}

Explanation for the regex:

  1. You either search for "quoted strings": "[^"]*" (anything but quote)
  2. Or you capture anything between commas: [^, ]+ (anything but comma or space)

(the square brackets define the "character class" and the caret negates it).

Note: My solution doesn't handle escaped quotes like in "I say \"Hello\""

Alexander Farber
  • 21,519
  • 75
  • 241
  • 416
  • 1
    Excellent. I lost my perl touch and is great to see a perl one liner code "to catch them all". I will try the Obj-c solution and then accept the answer straight away. Thanks a lot for taking time to support me in this. – mm24 May 13 '14 at 11:34
  • It works almost perfectly, however I would need the token "oh" and "my" to be together.. I am trying to update your regex to do so – mm24 May 15 '14 at 12:24
  • Try removing space from the second pair of square brackets – Alexander Farber May 15 '14 at 13:16
0

Without knowing the context of why you need to parse strings like this I can't give you a great answer, but I here are some ideas that might be better than RegEx if you find yourself needing to parse something more complicated or if you would just like to learn more about state machines and grammars.

  1. You can easily write a basic state machine parser to do basic parsing using NSScanner (the code from that link isn't great so ignore it, but the concept is illustrated)
  2. You can use something like ParseKit for really heavy duty parsing (probably overkill here)

You seem content with RegEx, but maybe this will help future visitors.

Brad Allred
  • 7,323
  • 1
  • 30
  • 49