2
  std::string str1="いい";
  std::string str2="الحانةالريفية";
  WriteToLog(str1.size());
  WriteToLog(str2.size());

I get "2,13" in my log file which is the exact number of characters in those strings. But how the japanese and arabic characters fit into one byte. I hope str.size() is supposed to return no of bytes used by the string.

mani
  • 21
  • 3
  • None of the standard library containers implement `size` to return number of *bytes*, rather all of the `size` methods return number of *elements*. Also you should have a read through [this post](https://stackoverflow.com/questions/3257263/how-do-i-get-stl-stdstring-to-work-with-unicode-on-windows) and especially this [great post](https://stackoverflow.com/questions/402283/stdwstring-vs-stdstring). – Cory Kramer Dec 01 '15 at 13:31
  • 1
    Couldn't reproduce in [Wandbox](http://melpon.org/wandbox/permlink/JOw4XRGLJJ2cceQ6). What is your environment (OS, compiler, charset, etc.)? – MikeCAT Dec 01 '15 at 13:32
  • @corykramer: std::string::size returns the number of bytes in the string, which is also the number of elements (since the C++ definition of byte is basically what can fit in a char). If it were a wstring, size would return the number of wchar_t in the wstring, which would match the observed output, but not the program. – rici Dec 01 '15 at 14:44

3 Answers3

2

On my UTF-8-based locale, I get 6 and 26 bytes respectively.

You must be using a locale that uses the high 8-bit portion of the character set to encode these non-Latin characters, using one byte per character.

If you switch to a UTF-8 locale, you should get the same results as I did.

Sam Varshavchik
  • 114,536
  • 5
  • 94
  • 148
0

The answer is, you can't.

Those strings don't contain what you think they contain.

  • First make sure you save your source file as UTF-8 with BOM, or as UTF-16. (Visual Studio calls these UTF-8 with signature and Unicode).

    Don't use any other encoding, as then the meaning of that string literal changes as you move your source file between computers with different language settings.

  • Then, you need to make sure the compiler uses a suitable character set to embed those strings in your binary. That's called the execution character set → see Does VC have a compile option like '-fexec-charset' in GCC to set the execution character set?

Or you can go for the portable solution, which is encoding the strings to UTF-8 yourself, and then writing the string literals as bytes: "\xe3\x81\x84\xe3\x81\x84".

Community
  • 1
  • 1
roeland
  • 5,349
  • 2
  • 14
  • 28
-1

They're using MBCS (Multi-byte character set).

Underlying while Unicode will encode all characters in two bytes, MBCS will encode common characters in a single byte, and will use an extension character first byte to denote its going to use more than one byte for this character. Confusingly depending on what character you chose for that second character in the Japanese string, you size may have been 3, not 2 or 4.

MBCS is a bit dated, it's recommended to use Unicode for new development when possible. See link below for more info.

https://msdn.microsoft.com/en-us/library/5z097dxa.aspx

Nick
  • 143
  • 1
  • 9
  • 1
    Slight correction: Unicode does not encode - UTF-8/16/32 encodes Unicode. Only UTF-32 is fixed length, UTF-16 can be as long as 32bits / 4 bytes when using surrogate pairs. UTF-8 can use 8bits/1 byte to 48bits / 6 bytes! – Alastair McCormack Dec 01 '15 at 14:15
  • 2
    It seems unlikely that there exists a MBCS code page in which there are single-byte representations for both japanese and arabic characters. – rici Dec 01 '15 at 14:49
  • 1
    @AlastairMcCormack, while originally UTF-8's design allows 6 bytes, it is constrained to 4 maximum. – Mark Tolonen Dec 01 '15 at 17:02