[Next] [Up] [Previous] [Index]

Character Codes

As noted in the page on data compression, text can be represented more efficiently using Huffman coding. Since text is composed of words having lengths in a relatively narrow range, separated from each other by single spaces, a multi-state Huffman code, with one set of symbols for word lengths, and another set of symbols for letters, can be used, and it has the added attraction of obscuring this aspect of the structure of a text document.

Even when it is not intended to perform explicit compression, codes representing characters for transmission can be designed for efficiency.

ITA 2, 5-level code, or the Murray code, generally known as Baudot, as it is based on his principle, even if it does not resemble his original code,

uses only five bits to represent a character, but sometimes extra characters are needed to shift between cases.

ASCII requires seven bits per character, and is simpler to use, since no shifts are required:

      0   0   0 0 1 1 1 1
      0   0   1 1 0 0 1 1
      0   1   0 1 0 1 0 1
0000  NUL DLE   0 @ P ` p
0001  SOH DC1 ! 1 A Q a q
0010  STX DC2 " 2 B R b r
0011  ETX DC3 # 3 C S c s
0100  EOT DC4 $ 4 D T d t
0101  ENQ NAK % 5 E U e u
0110  ACK SYN & 6 F V f v
0111  BEL ETB ' 7 G W g w
1000  BS  CAN ( 8 H X h x
1001  HT  EM  ) 9 I Y i y
1010  LF  SUB * : J Z j z
1011  VT  ESC + ; K [ k {
1100  FF  FS  , < L \ l |
1101  CR  GS  - = M ] m }
1110  SO  RS  . > N ^ n ~
1111  SI  US  / ? O _ o DEL Delete

But with seven bits per character, the temptation is strong to use a whole 8-bit byte for a character.

And here is a graphic representation of ASCII, illustrating the parity bit, if used, by placing characters with the parity bit active in odd parity against a yellow background:

Originally, there were many versions of 8-bit ASCII in use, providing extra characters on a number of computer systems. One common 8-bit ASCII character set found on printers was the one which supported the Japanese katakana syllabary; the IBM PC, the Macintosh, and the Atari ST all had their own 8-bit character sets. Today, there is a standard; the Amiga was one of the first computers to use it, but it is also used in most fonts in Microsoft Windows. In this standard, the extra characters consist of 32 additional control characters, followed by 95 printable characters, most of which are accented letters for the major European languages. Characters commonly found on typewriters, including a superscript 2 and 3 for use in typing measurements, but not a complete set of superscripts, are found.

Where the OE ligature was originally placed, the arithmetic symbols for multiplication and division were put in the middle of the accented letters, rather than with the new graphic symbols (this part of the standard was still undecided when the Amiga was designed, so those two characters were omitted from its character set; some printers have the original version as a "Unix character set").

As with the 5-level code, some characters in 7-bit ASCII were available for national use. The diagram below illustrates some of the available substitutions:

The current 8-bit form of ASCII includes nearly all of the special characters shown here; however, other languages needed more additional characters, and thus there are also alternate forms of 8-bit ASCII for languages such as Greek and Russian.

Many of the different national versions of ASCII are described and illustrated here, on Roman Czyborra's web page.

While printers often have their own escape code sequences to switch between some of these character sets, today there are ambitious proposals to create a single code to encompass nearly all the world's languages.

There is the 16-bit Unicode character set, and the larger 31-bit ISO 10646 character set which includes it.

In the following sections, we will be examining a scheme for encoding a character set of potentially unlimited extent with reasonable efficiency, a scheme for encoding ISO 10646 characters which is highly compatible with normal ASCII, and ways in which the 5-level code has been, and could be, extended to handle a wider character repertoire.


[Next] [Up] [Previous] [Index]

Next
Table of Contents
Home Page