In an earlier section, I showed how, by an elaborate method, one could convert sequences of binary bits to sequences of the 26 letters of the alphabet, in batches of 47 bits and 10 letters. Because 26 to the 10th power was only slightly larger than two to the 47th power, this conversion added so little redundancy to the message that it seemed like one could perform further encryption on the resulting stream of letters without too great a danger of compromising the system. (Ideally, however, an independent key should still be used for post-conversion encryption to be on the safe side.)
A reason for having messages armored as letters only would be to allow them to be transmitted using Morse code. The characters of Morse code are different in length, and the shorter symbols stand for the more common letters of the alphabet.
One could use a Huffman code for the English language to convert bits into letters; the result then would have a frequency distribution that somewhat resembled English text. But given that we know the bandwith cost of each Morse symbol, can we devise an optimal coding into letters that minimizes transmission time of a message in Morse format?
Given that a dot takes one unit of time, a dash three units, and the spaces between dots and dashes in letters take one unit, the time each letter takes can be calculated. However, since one consideration in coding is how many letters are produced (if one wants to produce fewer letters, one would use the ones with longer codes more often), one must also consider overhead costs; the space between letters takes three units, and spaces between words take five units. This last, if we are sending our message in five-letter groups, is convenient, since it lets us avoid fractions, so 4 units of overhead are included in the bandwith cost of each letter.
A 9 E 5 I 7 M 11 Q 17 U 11 Y 17 B 13 F 13 J 17 N 9 R 11 V 13 Z 15 C 15 G 13 K 13 O 15 S 9 W 13 D 11 H 11 L 13 P 15 T 7 X 15
One way to approach the problem of optimized coding into Morse is to ask this question: for what set of probabilities p(letter) is Morse an optimal coding? Knowing these probabilities, one can then construct a Huffman code for such a source of letters, and use that in reverse to convert bits into letters, approximating, as closely as possible if we restrict ourselves to such a simple technique instead of something better, such as arithmetic coding, those probabilities in our stream of letters.
Huffman codes have different lengths; it is immediately obvious that a Huffman code is optimal if the symbol with a 3-bit code has probability 1/8th, a symbol with a 5-bit code has probability 1/32nd, and so on. But a variable length coding into digits would give symbols with the same ratio of lengths probabilities of 1/1000 and 1/100000, which have a different ratio to each other. Since working out a Huffman code involves adding probabilities, this would make a difference. So we need to work out the "worth" of one unit of bandwidth cost in Morse code to do this, for a reason that resembles our need to count the overhead in the cost of each letter.
Thus, given that there is one symbol (E) with bandwidth cost 5, two symbols (I, T) with bandwidth cost 7, and so on, we need to solve the equation:
-5 -7 -9 -11 -13 -15 -17 x +2x +3x +5x +7x +5x +3x = 1
in x. This equation is not linear (although, as a polynomial in 1/x, doubtless it can be solved analytically with the aid of elliptic integrals), but it can be solved using a binary search method to obtain an approximate value of x. The solution that can be obtained in this fashion indicates that we should consider each bandwith unit to be a symbol with 1.35999638070642 possible values; in other words, each bandwidth unit contains about 0.4436028 bits of information. Note that this only applies when the transmitted message consists only of letters, since it derives from that particular set of symbols; the value would be higher if, for example, we allowed ourselves to send digits by Morse code as well.
Thus, we assign the following probabilities to each letter:
5 E 0.214937 2 bits 7 I T 0.116208 3 bits 9 A N S 0.062829 4 bits 11 D H M R U 0.033969 5 bits 13 B F G K L V W 0.018366 6 bits 15 C O P X Z 0.009930 Four 7-bit codes, one 8-bit code 17 J Q Y 0.005369 8 bits
and when we work out a Huffman code on this basis, it comes out very nicely, with only one case where we must assign different-length codes to symbols with the same probability.
A possible coding with these symbol lengths would be:
00000 D 0010010 C 00101110 Q 0100 A 101 T 000010 F 0010011 O 00101111 Y 01010 H 11 E 000011 G 0010100 P 001100 K 01011 M 00010 R 0010101 X 001101 L 0110 N 00011 U 00101100 Z 001110 V 0111 S 001000 B 00101101 J 001111 W 100 I
which is the one that results from the tree used to determine the Huffman code; a more direct one, based on the lengths in the table above, would of course be:
00 E 10110 D 110111 F 1111010 C 11111110 Q 010 I 10111 H 111000 G 1111011 O 11111111 Y 011 T 11000 M 111001 K 1111100 P 1000 A 11001 R 111010 L 1111101 X 1001 N 11010 U 111011 V 11111100 Z 1010 S 110110 B 111100 W 11111101 J
which just assigns symbols in order to the letters in the various length groups.
This particular coding sends an average of 968 bits in 2190 bandwidth units, for an average of 0.4420091 bits per bandwidth unit, just under the theoretically possible maximum of 0.4436028 bits per bandwidth unit given above.
Table of Contents