FairCom ISAM for C

Unicode Concepts

Unicode is an effort to standardize the representations of all languages in computer format. Early standards, like ASCII, only encoded letters for English. Efforts to internationalize started with extending ASCII to include characters used in other western languages, such as umlauts and accents, but was limited by a 255-character set that would fit in 1 byte. Unicode incorporates the characters of all the major government standards for ideographic characters from Japan, Korea, China, and Taiwan, and more.

Though Unicode is thought of as a wide-character encoding with 16 bits per character, Unicode standards include 8-bit multi-byte encoding (UTF‑8), 16-bit wide character encoding (UTF‑16), and 32-bit wide character encoding (UTF‑32). FairCom DB supports both UTF‑8 and UTF‑16.

Well-defined conversion routines permit unambiguous translation among UTF‑8, UTF‑16, and UTF‑32.

A Unicode string is terminated by a null character: a single zero byte for UTF‑8, and 2 and 4 zero bytes for UTF‑16 and UTF‑32, respectively.

Note: UTF‑16 does not encode all characters with a single 16-bit code unit. There are some languages that incorporate a sequence of two 16-bit code units to encode a single character.