Skip to content

Non-Standard Character Set Encodings

Character set encodings that are not in the list of approved standard encodings can be included using "extended segments". An extended segment begins with one of the following sequences:

01/11 2/05 02/15 03/00 M Lvariable number of octets per character
01/11 2/05 02/15 03/01 M L1 octet per character
01/11 2/05 02/15 03/02 M L2 octet per character
01/11 2/05 02/15 03/03 M L3 octet per character
01/11 2/05 02/15 03/04 M L4 octet per character

[This uses the "other coding system" of ISO 2022, using private Final characters.]

The "M" and "L" octets represent a 14-bit unsigned value giving the number of octets that appear in the remainder of the segment. The number is computed as ((M - 128) * 128) + (L - 128). The most significant bit M and L are always set to one. The remainder of the segment consists of two parts, the name of the character set encoding and the actual text. The name of the encoding comes first and is separated from the text by the octet 00/02 (STX, START OF TEXT). Note that the length defined by M and L includes the encoding name and separator.

[The encoding of the length is chosen to avoid having zero octets in Compound Text when possible, because embedded NUL values are problematic in many C language routines. The use of zero octets cannot be ruled out entirely however, since some octets in the actual text of the extended segment may have to be zero.]

The name of the encoding should be registered with the X Consortium to avoid conflicts and should when appropriate match the CharSet Registry and Encoding registration used in the X Logical Font Description. The name itself should be encoded using ISO 8859-1 (Latin 1), should not use question mark (03/15) or asterisk (02/10), and should use hyphen (02/13) only in accordance with the X Logical Font Description.

Extended segments are not to be used for any character set encoding that can be constructed from a GL/GR pair of approved standard encodings. For example, it is incorrect to use an extended segment for any of the ISO 8859 family of encodings.

It should be noted that the contents of an extended segment are arbitrary; for example, they may contain octets in the C0 and C1 ranges, including 00/00, and octets comprising a given character may differ in their most significant bit.

[ISO-registered "other coding systems" are not used in Compound Text; extended segments are the only mechanism for non-2022 encodings.]