Product Documentation

c-treeDB API for C++ - Developers Guide

Previous Topic

Next Topic

Unicode UTF-8

UTF-8 is the byte-oriented encoding form of Unicode. The UTF-8 encoding is defined in ISO 10646-1:2000 Annex D and also described in RFC 3629 as well as section 3.9 of the Unicode 4.0 standard.

UTF-8 has the following properties:

  • UCS characters U+0000 to U+007F (ASCII) are encoded simply as bytes 0x00 to 0x7F (ASCII compatibility). This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8.
  • All UCS characters >U+007F are encoded as a sequence of several bytes, each of which has the most significant bit set. Therefore, no ASCII byte (0x00-0x7F) can appear as part of any other character.
  • The first byte of a multi-byte sequence that represents a non-ASCII character is always in the range 0xC0 to 0xFD and it indicates how many bytes follow for this character. All further bytes in a multi-byte sequence are in the range 0x80 to 0xBF. This allows easy re-synchronization and makes the encoding stateless and robust against missing bytes.
  • All possible 231 UCS codes can be encoded.
  • UTF-8 encoded characters may theoretically be up to six bytes long, however 16-bit BMP characters are only up to three bytes long.
  • The sorting order of big endian UCS-4 byte strings is preserved.
  • The bytes 0xFE and 0xFF are never used in the UTF-8 encoding.

TOCIndex