0

NumPy can work with comma separated lists but that is a different task. I want to convert each character of a string into an entry of a np array:

x = np.frombuffer('fooλ'.encode(), dtype=np.uint8) #x = [102 111 111 206 187]

But the UTF-8 encoding assigns a variable number of bytes to each char (ascii chars take one byte but unicode chars take up to four). In this example "λ" costs two bytes.

To get the correct answer "ord()" works well:

x = np.asarray([ord(c) for c in 'fooλ']) #x = [102 111 111 955]

But this solution involves a list comprehension. Doing so is slow since it's not vectorized: the Python intrepreter has to call ord() on each character instead of calling a function once on the whole string. Is there a faster way?

Edit: this question is very similar, although my answer is much more concise.

Kevin Kostlan
  • 3,311
  • 7
  • 29
  • 33
  • You could use str.encode('utf-16') to encode it using fixed-width 16 bit numbers. However that will still need to use multiple 16 bit numbers if the Unicode code point does not fit in the basic multilingual plane. – Nick ODell Feb 23 '23 at 04:16
  • See also https://stackoverflow.com/questions/54424433/converting-numpy-arrays-of-code-points-to-and-from-strings – Nick ODell Feb 23 '23 at 04:18

1 Answers1

2

It turns out that UTF-32 is the correct answer. Edit: the "-LE" means "little endian" and prevents 4 bytes from being prepended.

x = np.frombuffer('fooλ'.encode('UTF-32-LE'), dtype=np.uint32) #x = [102 111 111 955]

I think this is vectorized (for a long string most of the work will be in the C++ functions rather than the Python interpreter). It will cause trouble if UTF-32 ever uses more than 4 bytes however...

Kevin Kostlan
  • 3,311
  • 7
  • 29
  • 33
  • 1
    The extra 4 bytes are a byte order mark. You can avoid them by specifying an endianness explicitly: `'fooλ'.encode('UTF-32-LE')`. – user2357112 Feb 23 '23 at 04:27
  • I think you can accept your answer. Currently UTF-32 uses less than 0.1% of its capacity ([11 bits are always 0](https://en.wikipedia.org/wiki/UTF-32)) – dankal444 Feb 24 '23 at 12:32