0

Having read utf8 everywhere I attempted to change some of my code to use std::string. I assumed if I set a std::string to u8"€" (that's the euro symbol AltGr+4 on my keyboard) the std::string would have 3 bytes containing the unicode code (\U20AC) for the euro symbol. It doesn't. Consider

std::string x[] = {"€", u8"€", u8"\€", "\u20AC", u8"\u20AC"}

size_t size[] = {x[0].size(), x[1].size(), x[2].size(), x[3].size(), x[4].size()};

If I view the results in the debugger local variables I see

x[] = {"€", "€", "â??", "€", "€"}

and

size[] = {1, 1, 3, 3, 3}

From what I can see the last two are the only ones that give me the expected result. I'm obviously missing something to do with string literals but I'm also puzzled how the debugger shows the correct string for the first two given it thinks they're one char long and (int64_t(x[0].c_str()[0]) == int64_t(x[1].c_str()[0]) == -128.

Also why does '€' == '\€' but "€" != "\€" and u8"€" != u8"\€".(Edit: ignore this. Remy pointed out my error below re comparing char pointers).

The results also beg the question what is the purpose of the u8 string literal prefix?

Can anybody explain before I revert to wchar_t?

I'm on Windows 10 using RAD studio 10.2.

Edit: Tried it with various non-ASCII Unicode characters using the character map facility. Couldn't get it to work with any of them. size() was always 1 and the debugger showed a different character (often '?') to the one I used. I'm using the surface pro type cover and, from what I can find, there's no way to enter random Unicode chars using the keyboard (apart from €). Strictly backslashed codes for me from now on. Glad I've cleared it up even if I did waste a whole day. Thanks all.

NoComprende
  • 731
  • 4
  • 14
  • 2
    The euro symbol just so happens to be inside windows-1252 codepage. The purpose of u8 is for other symbols, that can't be found inside user codepages.. – KamilCuk Sep 09 '19 at 17:27
  • Your first line shouldn't compile. It's missing a semicolon. Also, `\€` is an unrecognized character escape sequence in Visual C++. Which compiler are you using? – Wyck Sep 09 '19 at 17:48
  • @SergeyA, how can it be a duplicate of a question that doesn't even discuss escape sequences? – Wyck Sep 09 '19 at 17:51
  • @Wyck `\EuroSign` is not an escape sequence. – SergeyA Sep 09 '19 at 17:53
  • I'm using the Clang compiler that comes with RAD studio 10.2 but I can't find the version. Could someone try std::string s = u8"€" and see if it works? – NoComprende Sep 09 '19 at 18:23
  • Yes, it works. `€` encodes in UTF-8 as `E2` `82` `AC`. (`-30` `-126` `-84` in decimal.) https://godbolt.org/z/UZByYI – Wyck Sep 09 '19 at 18:41
  • Thanks Wyck. See edit in opening post. – NoComprende Sep 09 '19 at 19:04

1 Answers1

2

I assumed if I set a std::string to u8"€" (that's the euro symbol AltGr+4 on my keyboard) the std::string would have 3 bytes containing the unicode code (\U20AC) for the euro symbol. It doesn't.

It should, yes. The u8 prefix guarantees the literal is stored as UTF-8 in the final executable, and U+20AC is indeed encoded as 3 bytes in UTF-8. If you are seeing something different, that is likely a compiler bug that should be reported to Embarcadero.

I'm also puzzled how the debugger shows the correct string for the first two given it thinks they're one char long and (int64_t(x[0].c_str()[0]) == int64_t(x[1].c_str()[0]) == -128.

The second one should be 3 bytes, not 1 byte.

Since both are 1 byte, the display works by chance only. There is no prefix on the string literal, so it is interpreted using the compiler's default ANSI charset, which in your case must happens to have the euro sign at byte 0x80.

Also why does '€' == '\€' but "€" != "\€" and u8"€" != u8"\€".

Because the first one is comparing actual char values, whereas the other ones are comparing raw char* pointers instead, not the actual char values.

The results also beg the question what is the purpose of the u8 string literal prefix?

Exactly what you are expecting - it is supposed to make the compiler output the contents of the string literal in UTF-8 encoding.

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
  • Once again, thanks Remy. I see my mistake with the char pointers. Setting std::string = u8"€" definitely doesn't work properly. I've maybe picked the one symbol that was going to add to my confusion. – NoComprende Sep 09 '19 at 18:20
  • @NoComprende The linked duplicate explains why `u8"€"` doesn't work as expected. Set the source file to be saved as UTF-8 instead of ANSI. – Remy Lebeau Sep 09 '19 at 19:06
  • Remy, when I pasted a different Unicode character into the code the IDE asked "there is an international character …. Do you wish to save as utf8?". I did so but it didn't make any difference to the end result. size() would be 1 although (unlike with €) an incorrect character would show in debugger. I just checked and UTF-8 is checked in the File Format. – NoComprende Sep 10 '19 at 07:34
  • In the case of x[3] = "\U20AC" where the compiler gets it right despite the absence of the u8 prefix does it default to utf8 when the string contains a non-ascii Unicode character? – NoComprende Sep 10 '19 at 07:42
  • @NoComprende the escape sequence tells the compiler which Unicode codepoint you want, but if you don't specify a prefix then the final encoding of that codepoint in the executable is based on which charset the source file and/or compiler are set to. – Remy Lebeau Sep 10 '19 at 16:28