Use RegEx to parse a string with complicated delimiting

Question

This is a RegEx question.

Thanks for any help and please be patient as RegEx is definitely not my strength !

Entirely as background...my reason for asking is that I want to use RegEx to parse strings similar to SVG path data segments. I’ve looked for previous answers that parse both the segments and their segment-attributes, but found nothing that does the latter properly.

Here are some example strings like the ones I need to parse:

M-11.11,-22
L.33-44  
ac55         66 
h77  
M88 .99  
Z

I need to have the strings parsed into arrays like this:

["M", -11.11, -22]
["L", .33, -44]
["ac", 55, 66]
["h", 77]
["M", 88, .99]
["Z"]

So far I found this code on this answer: Parsing SVG "path" elements with C# - are there libraries out there to do this? The post is C#, but the regex was useful in javascript:

var argsRX = /[\s,]|(?=-)/; 
var args = segment.split(argsRX);

Here's what I get:

 [ "M", -11.11, -22, <empty element>  ]
 [ "L.33", -44, <empty>, <empty> ]
 [ "ac55", <empty>, <empty>, <empty>, 66 <empty>  ]
 [ "h77", <empty>, <empty>  
 [ "M88", .99, <empty>, <empty> ]
 [ "Z", <empty> ]

Problems when using this regex:

An unwanted empty array element is being put at the end of each string's array.
If multiple spaces are delimiters, an unwanted empty array element is being created for each extra space.
If a number immediately follows the opening letters, that number is being attached to the letters, but should become a separate array element.

Here are more complete definitions of incoming strings:

Each string starts with 1 or more letters (mixed case).
Next are zero or more numbers.
The numbers might have minus signs (always preceeding).
The numbers might have a decimal point anywhere in the number (except the end).
Possible delimiters are: comma, space, spaces, the minus sign.
A Comma with space(s) in front or back is also a possible delimiter.
Even though minus signs are delimiters, they must also remain with their number.
A number might immediately follow the opening letters (no space) and that number should be separate.

Here is test code I've been using:

<!doctype html>
<html>
<head>
<link rel="stylesheet" type="text/css" media="all" href="css/reset.css" /> <!-- reset css -->
<script type="text/javascript" src="http://code.jquery.com/jquery.min.js"></script>

<style>
    body{ background-color: ivory; }
</style>

<script>
    $(function(){


var pathData = "M-11.11,-22 L.33-44  ac55    66 h77  M88 .99  Z" 

// separate pathData into segments
var segmentRX = /[a-z]+[^a-z]*/ig;
var segments = pathData.match(segmentRX);

for(var i=0;i<segments.length;i++){
    var segment=segments[i];
    //console.log(segment);

    var argsRX = /[\s,]|(?=-)/; 
    var args = segment.split(argsRX);
    for(var j=0;j<args.length;j++){
        var arg=args[j];
        console.log(arg.length+": "+arg);
    }

}

    }); // end $(function(){});
</script>

</head>

<body>
</body>
</html>

OOPS, a typo actually! I meant to type an array with 3 elements: "M", 88, and .99 -- sorry. — markE, Jun 10 '13 at 06:26

Joseph Myers · Answer 1 · 2013-06-10T07:07:36.187

I had to perform very similar parsing of data for reporting live results at the nation's largest track meet. http://ksathletics.com/2013/statetf/liveresults.js Although there was a lot of both client and server-side code involved, the principles are the same. In fact, the kind of data was practically identical.

I suggest that you do not use one "jumbo" regular expression, but rather one expression which separates data pieces and another which breaks each data piece into its main identifier and the following values. This solves the problem of various delimiters by allowing the second-level regular expression to match the definition of data values rather than having to distinguish delimiters. (This also is more efficient than putting all of the logic into a single regular expression.)

This is a solution tested to work on the input you gave.

<script>
var pathData = "M-11.11,-22 L.33-44  ac55    66 h77  M88 .99  Z" 

function parseData(pathData) {
    var pieces = pathData.match(/([a-z]+[-.,\d ]*)/gi), i;
    /* now parse each piece into its own array */
    for (i=0; i<pieces.length; i++)
        pieces[i] = pieces[i].match(/([a-z]+|-?[.\d]*\d)/gi);
    return pieces;
}

pathPieces = parseData(pathData);
document.write(pathPieces.join('<br />'));
console.log(pathPieces);
</script>

http://dropoff.us/private/1370846040-1-test-path-data.html

Update: The results are exactly equivalent to the specified output you want. One thought that came to mind, however, was whether you also want or need type conversion from strings to numbers. Do you need that as well? I'm just thinking of the next step beyond parsing the data.

@markE Just to clarify, the only part of my code you need is the `parseData` function. You can stick that in your code and call it on your data string to convert it to arrays just like you want. — Joseph Myers, Jun 10 '13 at 06:54

Tomalak · Accepted Answer · 2016-10-17T09:32:29.743

3

^([a-z]+)(?:(-?\d*.?\d+)[^\d\n\r.-]*(-?\d*.?\d+)?)?

Explanation

^               # start of string
([a-z]+)        # any number of characters, match into group 1
(?:             # non-capturing group
  (-?\d*.?\d+)  #   first number (optional singn & decimal point, digits)
  [^\d\n\r.-]*  #   delimiting characters (anything but these)
  (-?\d*.?\d+)? #   second number
)?              # end non-capturing group, make optional

Use with "case insensitive" flag.

edited Oct 17 '16 at 09:32

answered Jun 10 '13 at 06:28

Tomalak

332,285
67
532
628

Thanks, looks good except for some unwanted extra elements (due to extra space delimiters). I can deal with these extra elements. – markE Jun 10 '13 at 06:33
In your link Match4 has 1 empty match and Match6 has 2 empty matches. – markE Jun 10 '13 at 06:53
@markE Yes, because there are no numbers to match on these lines? I am not sure what you are getting at. – Tomalak Jun 10 '13 at 06:57
Probably my fault in explaining. When there is more that 1 space that separate the numbers, I was hoping to have no matches returned for these extra spaces. And maybe I'm misunderstanding...regex is not my strong suit. :) – markE Jun 10 '13 at 07:00
My regex returns for your input `'ac55 66'`* the array `['ac55 66', 'ac', '55', '66']`. There are no extra spaces returned. What am I missing? *\*(note that multiplae spaces are collapesed in comments on this site)* – Tomalak Jun 10 '13 at 07:06
On http://rubular.com/r/EyUNmoONJ7 -- Match4: h,77,empty and Match6: Z,empty,empty. I think these "empty" are the result of multiple spaces in the input string? – markE Jun 10 '13 at 07:11
@markE No, they are the result of numbers not being there. And they do not contain the string "empty", they *are* empty. ;) – Tomalak Jun 10 '13 at 07:16
Thanks again Tomalak, I clearly will have to dedicate more brain cells to regex! :) BTW, if your moniker is a star trek reference, please "live long and prosper". – markE Jun 10 '13 at 07:20
@markE Thanks, and yes, you should. :) Regex are indispensible, every developer should have a basic working understanding of them. – Tomalak Jun 10 '13 at 07:24

Niet the Dark Absol · Answer 3 · 2013-06-10T13:15:03.537

2

Your "pattern" consists of one or more letters, followed by a decimal number, followed by another delimited by either a comma or whitespace.

Regex: /([a-z]+)(-?(?:\d*\.)?\d+)(?:[,\s]+|(?=-))(-?(?:\d*\.)?\d+)/i

edited Jun 10 '13 at 13:15

answered Jun 10 '13 at 06:12

Niet the Dark Absol

320,036
81
464
592

I believe that will fail to parse `L.33-44` correctly. `[,\s]+` needs to be changed to a non-capturing group of either a single comma, one or more whitespace characters, or a lookahead of -. – Robert McKee Jun 10 '13 at 06:20
Thanks very much. This regex is better, but some prefix-letters and numbers are still running together. Boiled down, the pattern is one or more letters followed by numbers that might be decimal or negative. Seems the problem is the delimiters which might be a comma, one-or-more-spaces, a comma-plus-space(s), the minus sign for the next number or the decimal point for the next number. – markE Jun 10 '13 at 06:25
Sorry, posted this answer on my tiny laptop, so I couldn't see the question to double-check everything! Edited answer to implement lookahead for `L.33-44` and added CI flag. – Niet the Dark Absol Jun 10 '13 at 13:15
Numbers for svgs can be float-literals so they can be +- prefixed (not just -) and can contain e or E for exponent, which will also cause issues with your a-z catch. – Tatarize Oct 08 '15 at 01:07

Markus Jarderot · Answer 4 · 2013-06-10T06:57:52.963

2

function parsePathData(pathData)
{
    var tokenizer = /([a-z]+)|([+-]?(?:\d+\.?\d*|\.\d+))/gi,
        match,
        current,
        commands = [];

    tokenizer.lastIndex = 0;
    while (match = tokenizer.exec(pathData))
    {
        if (match[1])
        {
            if (current) commands.push(current);
            current = [ match[1] ];
        }
        else
        {
            if (!current) current = [];
            current.push(match[2]);
        }
    }
    if (current) commands.push(current);
    return commands;
}

var pathData = "M-11.11,-22 L.33-44  ac55    66 h77  M88 .99  Z";
var commands = parsePathData(pathData);
console.log(commands);

Output:

[ [ "M", "-11.11", "-22" ],
  [ "L", ".33", "-44" ],
  [ "ac", "55", "66" ],
  [ "h", "77" ],
  [ "M", "88", ".99" ],
  [ "Z" ] ]

edited Jun 10 '13 at 06:57

answered Jun 10 '13 at 06:31

Markus Jarderot

86,735
21
136
138

Shouldn't ac be parsed as one element? – Robert McKee Jun 10 '13 at 06:32
Thank you! The good: Appears to parse out all the letters and numbers. But there are extra elements like "undefined", singular comma and extra empty elements. I guess I could filter these out in javascript afterward. – markE Jun 10 '13 at 06:38
Ok, the output looks good now! My initial "extra" elements were probably a result of my weakness in regex. – markE Jun 10 '13 at 07:07

score 1 · Answer 5 · answered Jun 10 '13 at 06:52

1

You can try with this pattern:

/([a-z]+)(-?(?:\d*\.)?\d+)?(?:\s+|,|(-(?:\d*\.)?\d+))?(-?(?:\d*\.)?\d+)?/

(a bit long, but it seems to work)

Note that the last number can be in the capture group \3 or \4

answered Jun 10 '13 at 06:52

Casimir et Hippolyte

88,009
5
94
125

Use RegEx to parse a string with complicated delimiting

5 Answers5

Linked