Codepoints, UTF-8 and Unicode

From Serious Documentation
Jump to: navigation, search

In order to represent all the characters across all languages, the traditional ASCII character set that only has 8 bits (i.e. 256 characters) is insufficient. Unicode 16 is a standard that maps an unsigned 16 bit value called a "Unicode codepoint" or just "codepoint" to each glyph (character) across all different languages. The first 128 Unicode 16 characters are the same as the first 128 ASCII 8-bit characters. Unicode is well described at unicode.org.

However, storing text strings using uncompressed/unencoded 16-bit (i.e. 2 byte) values for every character can be very inefficient, especially when these strings have a preponderance of traditional ASCII characters including numbers and punctuation that would normally only take one byte each. Therefore various character compression/encoding systems have been developed in the industry attempting to optimize the storage required by text strings better than unencoded Unicode 16.

All strings within SHIP are encoded using the UTF-8 mechanism. UTF-8 is a variable length multi-byte encoding. The traditional 7-bit ASCII characters are all represented exactly the same in UTF-8. A special "escape" character, along with some encoding hints, allows several bytes of data to be combined into a single Unicode 16 value. It is important to recognize that "bytes" does not mean "characters". Each character may be encoded with 1, 2, 3, or even more bytes.

The SHIPTide tool performs all string editing in UTF-8 encoding. It will not be obvious to the user: text strings look normal. Cutting and pasting (for example) from Google Translate will work seamlessly. However, you must be careful if you (for example) copy a script from SHIPTide to your favorite text editor and then paste it back into SHIPTide. If your external text editor does not comprehend UTF-8 natively, any special non-ASCII characters in your embedded strings will be mangled.

There are numerous Sail Functions available for manipulating strings and converting characters to codepoints (and back).