Product Documentation

c-treeDB API for C++ - Developers Guide

Previous Topic

Next Topic

Unicode UTF-16

UCS-2 and UTF-16 are alternative names for a 16-bit Unicode Transformation Format, a character encoding form that provides a way to represent a series of abstract characters from Unicode and ISO/IEC 10646 as a series of 16-bit words suitable for storage or transmission via data networks. UTF-16 is officially defined in Annex Q of ISO/IEC 10646-1. It is also described in "The Unicode Standard" version 3.0 and higher, as well as in the IETF’s RFC 2871.

UTF-16 represents a character that has been assigned within the lower 65536 code points of Unicode or ISO/IEC 10646 as a single code value equivalent to the character’s code point: 0 for 0, hexadecimal FFFD for FFFD, for example.

UTF-16 represents a character above hexadecimal FFFF as a surrogate pair of code values from the range D800-DFFF. For example, the character at code point hexadecimal 10000 becomes the code value sequence D800 DC00, and the character at hexadecimal 10FFFD, the upper limit of Unicode, becomes the code value sequence DBFF DFFD. Unicode and ISO/IEC 10646 do not assign characters to any of the code points in the D800-DFFF range, so an individual code value from a surrogate pair does not ever represent a character.

These code values are then serialized as 16-bit words, one word per code value. Because the endian-ness of these words varies according to the computer architecture, UTF-16 specifies three encoding schemes: UTF-16, UTF-16LE, and UTF-16BE.

UTF-16 is the native internal representation of text in the NT/2000/XP versions of Windows and in the Java and .NET bytecode environments, as well as in Mac OS X’s Cocoa and Core Foundation frameworks.

TOCIndex