I have a UTF-8 encoded string and I would like to iterate through it, splitting it at one of multiple delimiters. I also need to know which delimiter matched, as each delimiter has a specific meaning.
An example usage:
algorithm("one, two; three") => Match("one")
algorithm(", two; three") => Delimiter(",")
algorithm(" two; three") => Match(" two")
algorithm("; three") => Delimiter(";")
algorithm(" three") => Match(" three")
Additional information:
- My delimiters are all single ASCII characters, so optimized algorithms that require that are possible.
- A solution that handles UTF-8 substrings would also be appreciated, but isn't required.
- I plan to call the method many times and potentially in a tight loop, so an ideal algorithm would not need to allocate any memory.
- The algorithm should return the first matching string or delimiter and I can handle restarting the search on the next iteration.
- An ideal algorithm would innately know if it is returning a match or a delimiter, but it's possible to check that after the fact.
My target language is Rust, but I would appreciate answers in any language with a similar lower-level focus. Pseudocode is fine as well, as long as it recognizes the realities of UTF-8 text. Solutions that use esoteric hex tricks or SIMD instructions are also suitable, but may require more explanation for me to understand ^_^.