1

I'm working on Unicode support in a Linux console application. I ran into a need to change the screen buffer format to store Unicode glyphs instead of bytes representing ASCII characters. Unicode has combined characters, hence more than one Unicode code point can be rendered into one console cell.

The question is: what is the maximum number of Unicode combined characters that may be needed to render one glyph in real-life languages? Are there any languages ​​in the world that have glyphs that need more than 8 combined characters to render, for example? Let's assume that I don't need "Zalgo text" support at the cost of performance degradation caused by implementing dynamic length variables to store each console buffer glyph.

unxed
  • 196
  • 1
  • 3
  • 6
  • 1
    Flag of Scotland (as example) uses 7 codepoints. For sure some emojis will use more code points. And glyph is a bad choice. In general having cells for console is bad choice. ECMA/ISO standards do not requires monotypes (it is just a bad assumption from people who saw just monotype consoles). Consider also security implication (it doesn't matter if you don't find long codepoints sequences in std languages, somebody will missuses the feature). Do not optimize prematurely: display engines so much more complex and dynamic job: every thing you program is just a minimal use of CPU. – Giacomo Catenazzi Feb 23 '22 at 13:59
  • 1
    [Tibetan uses up to 8 combining characters](https://stackoverflow.com/a/11983435/65863) – Remy Lebeau Feb 28 '22 at 22:37
  • "*what is the maximum number of Unicode combined characters that may be needed to render one glyph in real-life languages?*" - [there is no limit](https://stackoverflow.com/questions/46184958/) ([How does Zalgo text work?](https://stackoverflow.com/questions/6579844/)). Also see [Is there a limit to combining characters with unicode?](https://stackoverflow.com/questions/65293355/) and [What is a realistic maximum number of unicode combining characters?](https://stackoverflow.com/questions/50272889/). – Remy Lebeau Feb 28 '22 at 22:38

1 Answers1

3

Nobody can be an expert in what makes up a "real-life" character in every language, so I might be missing some longer sequences here. But I do know about a lot of emoji! There are a few emojis for flags of geographic subdivisions which are implemented with combining codepoints. For example, the flag for Scotland, , is 7 codepoints, taking up 28 bytes in UTF-32:

  • WAVING BLACK FLAG
  • TAG LATIN SMALL LETTER G
  • TAG LATIN SMALL LETTER B
  • TAG LATIN SMALL LETTER S
  • TAG LATIN SMALL LETTER C
  • TAG LATIN SMALL LETTER T
  • CANCEL TAG

Country flags, like , have just two combining codepoints.

Family emojis with 4 people, like ‍‍‍, are also 7 codepoints. The only emoji I'm aware of that's longer are family emojis with a skin-tone specified for each family member, but these don't have a lot of support right now. Here's what one displays as on your device: ‍‍‍ (if you just see four heads, then you don't have a font installed that supports this). That emoji has 11 codepoints.

That being said, keep in mind that not all languages are rendered as a series of glyphs in sequence: أهلا is segmented using Unicode rules into 4 distinct characters.

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
smitop
  • 4,770
  • 2
  • 20
  • 53