2

I have a UTF-8 encoded string and I would like to iterate through it, splitting it at one of multiple delimiters. I also need to know which delimiter matched, as each delimiter has a specific meaning.

An example usage:

algorithm("one, two; three") => Match("one")
algorithm(", two; three")    => Delimiter(",")
algorithm(" two; three")     => Match(" two")
algorithm("; three")         => Delimiter(";")
algorithm(" three")          => Match(" three")   

Additional information:

  • My delimiters are all single ASCII characters, so optimized algorithms that require that are possible.
  • A solution that handles UTF-8 substrings would also be appreciated, but isn't required.
  • I plan to call the method many times and potentially in a tight loop, so an ideal algorithm would not need to allocate any memory.
  • The algorithm should return the first matching string or delimiter and I can handle restarting the search on the next iteration.
  • An ideal algorithm would innately know if it is returning a match or a delimiter, but it's possible to check that after the fact.

My target language is Rust, but I would appreciate answers in any language with a similar lower-level focus. Pseudocode is fine as well, as long as it recognizes the realities of UTF-8 text. Solutions that use esoteric hex tricks or SIMD instructions are also suitable, but may require more explanation for me to understand ^_^.

Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
  • I must be missing something. Does Rust provide a way of looping through the characters of a utf-8 string? Just do that till you hit one of your first match or delimiter. – Programmer Person May 05 '15 at 14:40
  • @ProgrammerPerson that is my current solution and a valid answer. I'm mostly asking because that I'm using what seems to be a naïve solution, and it's taking up a reasonable amount of time in my applications. Thus I'm hoping there's something more clever that I haven't thought of. – Shepmaster May 05 '15 at 21:34

1 Answers1

2

For a processor-specific solution, X86-64 processors with SSE4.2 contain the PCMPxSTRx family of instructions. One of the modes available with these instructions is Equal Any:

arg1 is a character set, arg2 is the string to search in. IntRes1[i] is set to 1 if arg2[i] is in the set represented by arg1

The basic algorithm is straight-forward:

  1. Fill an XMM register with up to 16 single bytes to search for (the needle).
  2. Set the count of needle bytes in rax.
  3. Calculate the memory address of the start of the string, including an offset.
  4. Set the count of haystack bytes in rdx.
  5. Call PCMPxSTRx with the appropriate control byte.
  6. Check the result of ecx or one of the control code flags.
  7. If there was no match and there is still string left to search for, increment the offset and loop.

There is a complication around page boundaries, however. Namely, the PCMPxSTRx instructions will always read 16 bytes of data. This can cause a segmentation fault if you read into a page of memory that is protected. A solution is to align all the reads to the end of the string, and handle the leftover bytes at the beginning. Before starting the above algorithm, use something like:

  1. Mask the address of the start of the string with ~0xF. This clears all the low bits.
  2. Use a PCMPxSTRM instruction (with a similar setup as above algorithm) for the first 16 bytes. This returns a mask of matching characters. You can shift the mask to ignore leading characters that are not part of your string.
  3. If there was no match and there is more string left to search, start the above algorithm.

You can see the complete example of this algorithm in my Rust library Jetscii. Inline assembly is used to call out to the PCMPxSTRx instructions.

Community
  • 1
  • 1
Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366