[Next] [Up/Previous] [Index]
Main : Index : Pencil and Paper Systems : Cryptanalyzing Simple Substitution

# Cryptanalyzing the Simple Substitution Cipher

This page is not complete. It is placed here now to reserve space, to allow other changes to this section to take place.

Here is a short message, enciphered only by replacing each of its letters by a different letter on a consistent basis:

```MGSVR WWJXS VPTRY SSOEF YYTMQ SVSYM MTPTR XYMGS RVRFJ
NFVGX TYFWF EIFUS AXJJQ SJSNM QPMGS TJOTF IMLSS TYSJO
SLQSL LPLTF OYSHM MRSVO FP
```

How would one go about trying to read it?

The first step that would occur to many people would be to make use of the fact that some letters are more common than others in English. E is the most common letter, and letters like J, Q, X, and Z are quite rare.

And so, we count the letters in our message. This produces the following table of frequencies:

```A  E  F  G  H  I  J  L  M  N  O  P  Q  R  S  T  U  V  W  X  Y
1  2  9  4  1  2  7  5 10  2  5  5  4  6 18  9  1  6  3  4  8
```

In comparison, a frequency count I had my computer perform on a sample of literary text produced these frequencies:

```A 443747  8.03   H 331686  6.00   O 420966  7.62   V  54921  0.99
B  88298  1.60   I 382552  6.92   P 102205  1.85   W 114048  2.06
C 152187  2.75   J   7112  0.13   Q   5841  0.11   X  12081  0.22
D 225040  4.07   K  33872  0.61   R 330126  5.97   Y  95514  1.73
E 711756 12.88   L 220858  4.00   S 351389  6.36   Z   3519  0.06
F 139985  2.53   M 141726  2.56   T 514613  9.31
G 103279  1.87   N 383526  6.94   U 156536  2.83
```

Arranged in order of frequency, for clarity, they become:

```E 12.88   H  6.00   F  2.53   K  0.61
T  9.31   R  5.97   W  2.06   X  0.22
A  8.03   D  4.07   G  1.87   J  0.13
O  7.62   L  4.00   P  1.85   Q  0.11
N  6.94   U  2.83   Y  1.73   Z  0.06
I  6.92   C  2.75   B  1.60
S  6.36   M  2.56   V  0.99
```

Comparing these frequencies to those of the message:

```18: S     7: J      3: W
10: M     6: R V    2: E I N
9: F T   5: L O P  1: A H U
8: Y     4: G Q X
```

it might be tempting to start by aligning like frequencies wherever possible:

```Cipher: S M Y J W
---------
Plain:  e t n i f
```

to begin deciphering the message like this:

```MGSVR WWJXS VPTRY SSOEF YYTMQ SVSYM MTPTR XYMGS RVRFJ
t e   ffi e     n ee    nn t  e ent t      nt e     i

NFVGX TYFWF EIFUS AXJJQ SJSNM QPMGS TJOTF IMLSS TYSJO
n f      e   ii  eie t   t e  i     t ee  nei

SLQSL LPLTF OYSHM MRSVO FP
e  e         ne t t
```

Here, it looks like we've been luckier than we have a right to expect. With frequencies of 6.94 and 6.92 for N and I respectively, it isn't hard to imagine that I might be more common than N, instead of N being more common than I, in the text of a particular message.

The combination t-e occurs three times from MGS, and once each from MQS and MLS, so it seems reasonable to think that G stands for h. e-ent might be event, and -ffi-e might be office, although it is actually hard to take seriously that W necessarily stands for f.

To make a good start on breaking a simple substitution, however, single-letter frequencies are not enough. They might work for picking out the letters E and T in most cases, but more information is available that can serve as a better guide.

We've seen that N and I have frequencies of 6.94 and 6.92 respectively. This is a very small difference. But one is a consonant, and the other is a vowel. So we might expect them to behave differently. And they do.

[Next] [Up/Previous] [Index]