This section is about a messy little topic that still has to be handled properly in many encryption applications if security is to be maintained.

If every message were an exact multiple of 47 bytes in length, it would be very easy to apply the complicated fractionation methods outlined in the preceding sections to encryption.

First, take the message, and encrypt it while it is in binary form.

Then, breaking the message up into 47-byte chunks, for each chunk generate six pseudorandom bits. Convert that into eight bits which include four 1s and four 0s, using a 4-of-8 code.

Four 47-bit segments of the chunk are immediately converted to ten letters each, and four remain unconverted.

The letters are converted to base-3 and base-5 digits in pairs. The bits are converted in groups of 5 bits and groups of 7 bits, eighteen groups of 5 bits and fourteen groups of 7 bits.

The base-3 and base-5 digits thus produced are independently enciphered independently of each other, then reconstituted to form bits and letters once again. Then the binary chunks are converted to ten letters each.

At this point, 47 bytes of the original binary compressed plaintext (and this is indeed where the problem comes in: Huffman compression produces messages that are an arbitrary number of bits in length, not in whole bytes, never mind exact multiples of 47 bytes) have become 80 letters.

And this number is a multiple of four, so the next step works out evenly as well. Using the following code for the first letter of a group of four:

A 111 D 121 G 131 J 211 M 221 P 231 S 311 V 321 Y 331 B 112 E 122 H 132 K 212 N 222 Q 232 T 312 W 322 Z 332 C 113 F 123 I 133 L 213 O 223 R 233 U 313 X 323

and using the three digits thus produced to modify the remaining three letters in that group as follows:

ABCDEFGHIJKLMNOPQRSTUVWXYZ ---------------------------- 1 %&'()*+,./0123456789:;<=>? 2 ABCDEFGHIJKLMNOPQRSTUVWXYZ 3 abcdefghijklmnopqrstuvwxyz

we now produce a line of 60 characters from an alphabet of 78 characters, suitable for efficient transmission as a text string. While trying to have a simple ASCII coding, the first set of equivalents is made slightly more complicated to avoid using the dollar sign ($), which is different in some national versions of ASCII, and to avoid using the minus sign (-) which has a special meaning at the start of a line for some Internet messages.

Then, an error-correcting code can be applied to the line of 60 7-bit characters. One might produce 28 bits of error-checking information from four 7-bit accumulators, rotated according to different schemes (one rotated only after every 49 characters added to the three illustrated in the diagram, so that a single-bit error in the line can be uniquely located, and with the result that 28 rather than 21 bits of error-checking information are produced).

To keep a correspondence between the bits of the error-checking characters and the bits they encode (so that single-bit errors in transmission don't propagate in the error-checking part of the code in a way the code was not designed to handle), and yet use only acceptable characters, the following code which places five bits in a character might be used:

0 0 1 1 ------- 000|P H p h 001|Q I q i 010|R J r j 011|S K s k 100|T L t l 101|U M u m 110|V N v n 111|W O w o

and thus a line might consist of 60 characters from the 78-character alphabet, a space, and six error-checking characters from this 32-character alphabet, one of which would only take on eight possible values.

But as noted, this is all well and good if the message consists entirely of exact chunks of 47 bytes. What happens if the last chunk falls short, how can we deal with this?

At the beginning of binary encipherment, just after the message has been compressed, it is reasonable to at least pad the message to a whole number of bytes, so that the entire encipherment process does not have to continuously account for an odd number of bits.

For a hash function, a common method of padding is this: append a single 1 bit to the end of the message, and then append as many 0 bits as required to make the message come out to an even number of blocks.

This is not very suitable for encipherment, as it creates a likelihood that the message will end in a sequence of zeroes. Here is a method that lets each individual bit of the end of the message be equally likely to be a 1 or a 0.

Consider only the last 6 through 13 bits of the message. (Since this includes eight consecutive values, one of these values can always be chosen to make the preceding part of the message consist entirely of whole bytes.)

These will be encoded into two bytes which contain the following three fields (the last one may have zero length) in order:

- Three bits will contain a number from 0 to 7, indicating how many bits of random padding have been added to the message.
- The 6 through 13 remaining bits of the message.
- From 7 down to 0 bits of random padding.

Although this kind of padding does ensure that each bit is equally likely to be a 0 or a 1, it does not complicate a brute-force search by ensuring that any random bit sequence, after being unpadded, will be one that decompresses by the compression method used, to a whole number of symbols. This problem can be dealt with by applying some form of encryption before padding, such as choosing the codes for Huffman coding in a random order.

However, it is possible to address this without introducing bias into the compressed form of the message. The key insight is that Morse code is an example of a method of encoding binary strings that also takes advantage of length information, and so if we encode the last symbol in a code similar to Morse code we can ensure that wherever the message ends, it can decompress to something. However, to avoid ambiguity, we must exclude symbols that start with one of our Huffman symbols.

For example, for a message composed only of letters, we might try to use the following Huffman and pseudo-Morse codes:

Letter Huffman Code Letter Pseudo-Morse Code (1) (2) E 000 E --- <null string> T 001 T E 0 A 0100 A T 1 O 0101 O A 00 I 0110 I O 01 H 0111 H I 10 N 1000 N H 11 S 1001 S N 010 R 1010 R S 011 D 10110 D R 100 L 10111 L D 101 U 11000 U L 110 M 11001 M U 111 W 11010 W M 1011 C 11011 C W 1100 Y 11100 Y C 1101 F 111010 F Y 1110 G 111011 G F 1111 P 111100 P G 11101 B 111101 B P 11110 V 111110 V B 11111 K 1111110 K V 111111 X 11111110 X K 1111111 J 111111110 J X 11111111 Q 1111111110 Q J 111111111 Z 1111111111

where, for the last symbol in the compressed ciphertext, column (1) normally applies, but column (2) applies after Z, ZQ, ZQQ, ZQQQ, and so on.

This is because we have run into trouble. We have run out of available Pseudo-Morse codes before having found an equivalent for the least frequent letter, Z.

The number of Morse code symbols with four or fewer dots and dashes is 30, and considering the null string, we obtain 31 symbols, one less than 32, the number of five-bit strings. Since these two different cases of a Morse code avoiding the symbols in another prefix-property code both have the property that the Morse code has exactly one symbol less, it becomes interesting to ask if this is always the case.

Surprisingly, the answer is yes.

The minimal case of a pseudo-Morse code and its corresponding Huffman code is:

Huffman code symbols: [0], [1] Pseudo-Morse code symbols: [<null string>]

In this case, there is one more Huffman code symbol than there are pseudo-Morse symbols.

Any Huffman code can be obtained from the minimal Huffman code by successive transformations of the following form: replace one symbol in the Huffman code by both itself followed by 0 and itself followed by 1.

Then, the pseudo-Morse code is also modified: the old symbol in the Huffman code that was replaced is added to it.

Since both codes have exactly one element added to them by this operation, it is still true that the Huffman code has exactly one more symbol in it than its corresponding pseudo-Morse code. This is therefore always true, by mathematical induction.

There are a number of ways to deal with this problem, but now that we know that we will always have the problem, and it will always only involve the least frequent symbol, it becomes easier to see which of them are most appropriate.

- We could exclude the null string as a possible pseudo-Morse symbol. Then, all the strings which exactly match a Huffman string would be usable as symbols, so we would always have more Pseudo-Morse bit strings than Huffman bit strings.
- Or, we could use the Pseudo-Morse symbol for Q to stand for <nothing> and if our message ended in a Q or Z, we simply end by using its Huffman code followed by the Pseudo-Morse code for <nothing>. In that case, though, we have two rare symbols following one another.
- This can be improved by having two versions of the Pseudo-Morse code. One is used to follow the Huffman symbol for any character other than Q or Z, and does not include a symbol for <nothing>, and therefore does include a symbol for Q (but not Z). The other, following the Huffman symbols for Q and Z, has a code for <nothing> but not for either Q or Z. In this case, since J is more common than Q and Z combined, the symbol for <nothing> would still be the longest in the code, but in cases where the tail of the frequency distribution is flatter, the symbol for <nothing> might move upwards in the code.
- Or, when <nothing> ends a message, assign the shortest Huffman codes to the characters that need to be followed by it, so that E<nothing> stands for Q, and T<nothing> stands for Z.
- Finally, as actually recommended above, following Z or in a string of "Q"s of any length preceded by Z, use a second version of the Pseudo-Morse code in which there is no code for Q or Z, but the code for <nothing> is the shortest character; following anything else, use the normal version of the Pseudo-Morse code in which there is a code for every letter except Z.

However, the first four methods, of which the fourth is the most efficient, all involve backtracking. Thus, the fifth method, which is both efficient and which can be implemented without backtracking in either encoding or decoding, is the most suitable.

Ignoring this detail for the moment, we can illustrate how well this scheme works by an example:

Let our message, using the convention given above, be nnn01010 10101010, where nnn takes on all possible values from 0 to 7. Does the message have a valid decoding for each value of nnn?

Message: Decoding: 000 01010 10101010 0101 A 0101 A 0101 A 0 T 001 01010 1010101x 0101 A 0101 A 0101 A <null string> E 010 01010 101010xx 0101 A 0101 A 010 S 011 01010 10101xxx 0101 A 0101 A 01 I 100 01010 1010xxxx 0101 A 0101 A 0 T 101 01010 101xxxxx 0101 A 0101 A <null string> E 110 01010 10xxxxxx 0101 A 010 S 111 01010 1xxxxxxx 0101 A 01 I

The basic idea shown here is based on a description by David A. Scott of ideas used in his encryption programs, but he is not responsible for the details of my proposal/example. Although I believe that this idea has probably been thought of before, perhaps very early in the development of binary prefix-property codes, I suspect it is entirely possible that Mr. Scott may well be the originator of the concept of generalizing this principle to arithmetic coding.

There is even available an encryption program by Matt Timmermans which makes use of his compression principle as well.

At this point, we now have a binary message that is a whole number of bytes in length. Such a message is convenient for encrypting by a large number of techniques. Among one of the techniques we are likely to apply to a binary message would be a block cipher, such as DES (which is still the best-known one, even if its key is too short by today's standards).

But we've only ensured that our message is made up of a whole number of bytes, not a whole number of eight-byte (or sixteen-byte, if we're using one of the AES candidates) blocks!

This is not a problem. We can encipher a message with the "wrong" number of bytes in it as follows:

-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- | 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|16|17|18|19| -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- | | | | | | | | | | | | | | | | | | | ---------------------- ---------------------- | | | | DES || DES | | | | ---------------------- ---------------------- | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ---------------------- | | | | | | | | | | | | DES | | | | | | | | | | | | ---------------------- | | | | | | | | | | | | | | | | | | |

that is, after enciphering as much of the message as possible in whole eight-byte blocks, if there are between one and seven left-over bytes, then encipher the last eight bytes of the message again, thus including the left-over bytes with some bytes that have already been enciphered to make a whole block.

Bruce Schneier's justly famed book *Applied Cryptography* illustrates
a somewhat modified technique for accomplishing this purpose, called
*ciphertext stealing*.

It differs from what is shown in the diagram above by making the extra block,
which still contains the three left-over bytes and the last five already-encrypted bytes,
out of the three left-over bytes *first*, and then the five already-encrypted
bytes *afterwards*.

It looks like this:

-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- | 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|16|17|18|19| -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- | | | | | | | | | | | | | | | | | | | ---------------------- ---------------------- | | | | DES || DES | | | | ---------------------- ---------------------- | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ----------------------- | | | | | | | | | | | | | | ----------------------- | | | | | | | | | | | | | | ----------------------- | | | | | | | | | | | | | | ---------------------- | | | | | | | | | | | | | | ----------------------- | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ---------------------- | | | | | | | | | | | | DES | | | | | | | | | | | | ---------------------- | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ----------------------- | | | | | | | | | | | | | | | | ---------------------- | | | | | | | | | | | | | | | | ----------------------- | | | | | | | | | | | | | | | | ----------------------- | | | | | | | | | | | | | | | | ----------------------- | | | | | | | | | | | | | | | | | | |

This maintains the alignment of all the bytes of the unencrypted message relative to the blocks in which they are first enciphered. This seems like a cumbersome method of dealing with some rather strange and obscure weaknesses some block ciphers might have. (Thanks to the initial permutation, DES certainly does not have any weakness of the kind this might address.)

However, if we hadn't already padded our message out to a whole number of bytes, or if we were using a block cipher that was very fast, and consisted of operations on whole 32-bit words in our computer, the alignment-preserving property of ciphertext stealing could be useful to save computer time.

If only alignment to the byte of a message that is an odd number of bits is desired, one could mix the simple technique I've outlined with ciphertext stealing, by enciphering, as our extra block, some whole bytes that have already been enciphered, the whole left-over bytes, and one byte composed of the left-over bits plus the last few bits of the already-enciphered byte immediately preceding the ones we've already used.

Once binary encryption is finished, there is the matter of converting the message, in groups of 47 bits at a time, to a message consisting of letters.

There is some good news with respect to the elaborate fractionation scheme discussed in the previous section. Any group of more than 35 bits can be represented by some combination of groups of 5 bits and 7 bits. Thus, as long as the message is sufficiently long, there will be no problem in converting it entirely to base-3 and base-5 digits for fractionation. However, particularly if after the last chunk of 47 bytes, there are less than five bytes left, and on general principles in any case (since transposition and fractionation over too small a block is less secure), the odd part of the message should be fractionated together with the last whole chunk of the message.

But how will we deal with converting a leftover part of the message, less than 47 bits in length, to letters? Again, some padding will be required.

In addition to the full conversion of 47 bits to 10 letters, the component coding which converted 14 bits to three letters:

00aaaabbbbcccc (4,4,4: 1 way) 010iiibbbbcccc (3,4,4: 3 ways) 011aaaajjjcccc 100aaaabbbbkkk 1010iiijjjcccc (3,3,4: 3 ways) 1011iiibbbbkkk 1100aaaajjjkkk 11010iiijjjkkk (3,3,3: 1 way) 11011mbbbbcccc (1,4,4: 3 ways) 11100aaaancccc 11101aaaabbbbo 111100mjjjcccc (1,3,4: 6 ways: 4 used here.) 111101mbbbbkkk 111110iiincccc 111111aaaankkk

is a reasonably efficient method of converting bits into letters. As well, since two to the ninth power is 512, which is under 676, we can convert nine bits into two letters as follows:

0aaaabbbb (4,4: 1 way) 10iiibbbb (3,4: 2 ways) 11aaaajjj

These charts are, of course, interpreted with the single-letter codings noted in the section on the 47-bit to 10-letter coding:

aaaa (or bbbb, cccc) will represent a letter from the code:

0000 A 0001 B 0010 C 0011 D 0100 E 0101 F 0110 G 0111 H 1000 I 1001 J 1010 K 1011 L 1100 M 1101 N 1110 O 1111 P

iii (or jjj, kkk) will represent a letter from the code:

000 Q 001 R 010 S 011 T 100 U 101 V 110 W 111 X

and m (or n, o) will represent a letter from the code:

0 Y 1 Z

Using these smaller encodings, the amount of random padding required to convert an arbitrary number of bits to letters can be reduced.

The amount of padding needed for various numbers of bits can be seen in the following table:

1 to 4 bits [aaaa] with 3 to 0 bits padding 5 to 9 bits [9] with 4 to 0 bits padding 10 to 14 bits [14] with 4 to 0 bits padding 15 to 18 bits [9][9] with 3 to 0 bits padding 19 to 23 bits [14][9] with 4 to 0 bits padding 24 to 28 bits [14][14] with 4 to 0 bits padding 29 to 32 bits [14][9][9] with 3 to 0 bits padding 33 to 37 bits [14][14][9] with 4 to 0 bits padding 38 to 42 bits [14][14][14] with 4 to 0 bits padding 43 to 47 bits [47] with 4 to 0 bits padding

The coding [9][9] is allowed at the end because it produces a number of letters of the form 3n+1, which cannot be confused with a number of the form 3n, involving only the use of [14] blocks, or a number of the form 3n+2, resulting from the use of a single [9] block. Thus, the number of letters present uniquely identifies how to convert the letters to bits, but the bits must then contain information on how many bits of padding are to be ignored (or this information must be available from somewhere else).

Using the single letter coding [aaaa] once, for the smallest number of remaining bits, allows the maximum required number of padding bits to be reduced to 4 from 8. This means that only three bits need to be added to the message to indicate how many bits of random padding are used.

Here, it will be easiest to find the three bits giving the amount of padding if they are the very last three bits converted, with any padding being between them and the message.

Unfortunately, representing a number of padding bits from 0 to 4 by a three-bit number means that the first bit of that number has a high probability of being 0. The probability is not a full 80%, as some of the possible lengths require only from 3 to 0 bits of padding, but it is still a significant source of bias.

One way of solving the problem is to put the information about how much padding is used outside the message, without encrypting it, since usually the length of the message need not be secret. The only disadvantage of this is that it complicates the format of messages. In any case, this small quantity of added redundancy should not be a problem if the alphabetic form of the message is now well-encrypted, and if that encryption includes a transposition cipher, so that the location of the slight extra redundancy is itself secret.

Once we have finished encrypting the message while it is in alphabetic form, for more efficient transmission, an armoring of the message was shown above where groups of four letters from the 26-letter alphabet were combined to form three symbols from a set of 78.

Since no letter as the first letter of a four-letter group coded to 333, a simple way to handle three or fewer leftover letters is simply by printing the leftover letters all in lowercase. This would permit no confusion to arise.

Note that if one or two letters are left, it is not necessary to print them both in lowercase to remove ambiguity. They could be printed in any of the three possible encodings, chosen at random. However, that is only useful if further encryption is performed, and the scheme of 78-character armor was designed to be simple. The absence of the code 333 during the message introduces bias, and thus if the key for the last encryption step were connected to the key previously used, a potential weakness is introduced.

[Next] [Up] [Previous] [Index]