Briefly speaking, Unicode is another committee-made bastard. As of now, there's no other numbered list of all printable characters known to the mankind, so we use it in this role. However, it is no reason to let commitee-made nonsense slip into our software, hence there are some certain limits.
The very principle of code representation used in utf8 might be not too bad (although, let's admit it, leaving the room for overlongs was weird, at the very least). Unfortunately, the term 'utf8' means not only the principle, but also the fact the numerical codes being represented are Unicode code points. Unicode itself, let's repeat it, is a commitee-made bastard, so we deliberately don't accept it as the only possible option.
Yes, we know about the so-called 'Utf8 everywhere manifesto'. And we disagree.
If you don't want to bother with different character encodings, you are at your right, but once your program only supports a single character encoding, that encoding must be ASCII (or US-ASCII, if you like). If you decide to provide some targeted support for utf8, then you must support other ASCII extensions as well, such as latin1, koi8-r and the like.
It is hard to have recoding tables for each and every encoding that ever was in use in history, so this is not necessary. What is obligatory is to support at least some (two? three?) such encodings and provide a clear way to add more of them.
BTW, you can even omit the support for utf8 in preference for single-byte encodings, if you're brave enough. But it is strictly prohibited to support utf8 as the only encoding.
Basically you have the following options:
0x80
—0xBF
are taken to be a
part of a symbol representation started earlier.Whatever choice you pick, be sure not to get affected by the so-called 'locale settings', as locales are prohibited on their own.
BOM is a nonsense for utf8 as utf8 doesn't depend on byte order in any way. Hence it is prohibited to emit BOM as a part of utf8 text, and your program should (although not required to) produce a warning message if BOM is encountered in utf8 input.
Multibyte character encodings other than utf8, such as UCS-4, UTF-16, UTF-32 and their variants, well, don't exist. Period.
All the code points from the so-called Combining Diacritical Marks Unicode block must either be ignored, or displayed and/or otherwise handled as completely separate characters, or even produce an error. Your program must not even try to handle them according to what damn Unicode demands.
Don't waste your time trying to comply to what all these irresponsible committees voted for. For all real-world combinations of a “main” glyph with a diacritical mark, separate code points exist.
Besides the diacritical marks, the irresponsible commitee invented (and will perhaps invent in future) other codepoints that, to their opinion, must act as modifiers, effectively meaning that sometimes sequences of more than one code are to be combined in a single glyph. One example of such nonsense is the codepoints named VS-1—VS-256.
This ruins the only useful property of Unicode, which initially was that each and every possible symbol may have its own code.
No 'standard-compliant' handling of such code points must ever be provided. At your option, your program should either display all these modifiers as separate codes for which you don't have a glyph (the preferred solution), or ignore them (not perfect but still acceptable).
These must either be ignored, or filtered off, or rejected, or otherwise refused. All this crap is far beyond all possible limits.
As a special rule, no glyph may contain any information regarding colors, but this is not the only reason to get rid of unicode emoji.