Rebuild the World!

Limitations for Unicode

Briefly speaking, Unicode is another committee-made bastard. As of now, there's no other numbered list of all printable characters known to the mankind, so we use it in this role. However, it is no reason to let committee-made nonsense slip into our software, hence there are some certain limits.

No 'utf8 everywhere' assumption
Other ASCII extensions must gain no less support than utf8
No byte-order mark (BOM) in utf8
No unicode-based encodings other than utf8
No support for Unicode diacritical marks
Generally, no combining of several codes into a single glyph
Unicode emoji no more

No 'utf8 everywhere' assumption

The very principle of code representation used in utf8 might be not too bad (although, let's admit it, leaving the room for overlongs was weird, at the very least). Unfortunately, the term 'utf8' means not only the principle, but also the fact the numerical codes being represented are Unicode code points. Unicode itself, let's repeat it, is a committee-made bastard, so we deliberately don't accept it as the only possible option.

Yes, we know about the so-called 'Utf8 everywhere manifesto'. And we disagree.

Other ASCII extensions must gain no less support than utf8

If you don't want to bother with different character encodings, you are at your right, but once your program only supports a single character encoding, that encoding must be ASCII (or US-ASCII, if you like). If you decide to provide some targeted support for utf8, then you must support other ASCII extensions as well, such as latin1, koi8-r and the like.

It is hard to have recoding tables for each and every encoding that ever was in use in history, so this is not necessary. What is obligatory is to support at least some (two? three?) such encodings and provide a clear way to add more of them.

BTW, you can even omit the support for utf8 in preference for single-byte encodings, if you're brave enough. But it is strictly prohibited to support utf8 as the only encoding.

Basically you have the following options:

Your program may be totally encoding-agnostic. Sometimes it works.
You can take right the opposit decision and provide (and declare) support for different encodings. In this case there must be support for ASCII (well, US-ASCII) and at least one other single-byte encoding, such as latin1 or koi8r, and there must be a clear and easy way to add to your program the support for other encodings as well, such as adding more encoding conversion tables. Support for utf8 is useful and even encouraged but remains optional. It is also okay to assume all used encodings are ASCII extensions, meaning that you're not obliged to support things like EBCDIC or ELOT-927, although this is not prohibited either.
Your program may be ASCII only, deliberately refusing to consider all bytes greater than 127 to be codes of characters.
Finally, you can stay encoding-agnostic in all aspects but one: whether utf8 multibyte approach is in effect or not. For example, you can have a global (or not global) flag which affects functions such as string length computation, e.g., when the flag is false, every byte is considered a code of a separate character, but when it is true, all bytes from the range 0x80—0xBF are taken to be a part of a symbol representation started earlier.

Whatever choice you pick, be sure not to get affected by the so-called 'locale settings', as locales are prohibited on their own.

No byte-order mark (BOM) in utf8

BOM is a nonsense for utf8 as utf8 doesn't depend on byte order in any way. Hence it is prohibited to emit BOM as a part of utf8 text, and your program should (although not required to) produce a warning message if BOM is encountered in utf8 input.

No unicode-based encodings other than utf8

Multibyte character encodings other than utf8, such as UCS-4, UTF-16, UTF-32 and their variants, well, don't exist. Period.

No support for Unicode diacritical marks

All the code points from the so-called Combining Diacritical Marks Unicode block must either be ignored, or displayed and/or otherwise handled as completely separate characters, or even produce an error. Your program must not even try to handle them according to what damn Unicode demands.

Don't waste your time trying to comply to what all these irresponsible committees voted for. For all real-world combinations of a “main” glyph with a diacritical mark, separate code points exist.

Generally, no combining of several codes into a single glyph

Besides the diacritical marks, the irresponsible committee invented (and will perhaps invent in future) other codepoints that, to their opinion, must act as modifiers, effectively meaning that sometimes sequences of more than one code are to be combined in a single glyph. One example of such nonsense is the codepoints named VS-1—VS-256.

This ruins the only useful property of Unicode, which initially was that each and every possible symbol may have its own code.

No 'standard-compliant' handling of such code points must ever be provided. At your option, your program should either display all these modifiers as separate codes for which you don't have a glyph (the preferred solution), or ignore them (not perfect but still acceptable).

Unicode emoji no more

These must either be ignored, or filtered off, or rejected, or otherwise refused. All this crap is far beyond all possible limits.

As a special rule, no glyph may contain any information regarding colors, but this is not the only reason to get rid of unicode emoji.

previous • up • next