
To Decode Short Cryptograms George W. Hart hort cryptograms, in which an encoded sentence The solution of a cryptogram can either be given as an or quotation is to be decoded, are common pas- explicit sentence of plain text, or it can be characterized by times of many recreational puzzle enthusiasts. describing the permutation that was used to code the Here is a simple example of the type that can be plain text. This permutation can be inverted then to re- found in many collections of word games, and construct the plain text from the given cipher text. The daily in some newspapers: permutation used for this example codes each letter as the letter to its right on the standard typewriter keyboard GivenS (cipher text): YPNR,PTMPYYPNR:YJSYODYJRWlRDYOPM. (convenient for touch-typists), with the three letters on the Solution (plain text): TOBE,ORNOTTOBE:THAT I STHEQUESTION. extreme right, (P, L, and M) "wrapping around" to the left: A permutation of the 26-character alphabet is used to en- Plain: ABCDE FGH I JKLMNOPQRSTUVWXYZ code a sentence with spacing and punctuation intact. Cipher: SNVFRGH JOKLAZMPQWTDY I B E CUX Given only the encoded sentence (the "cipher text"), the Partial Permutation: SN**R* *JO .... MP*WTDYI ..... correct permutation is to be found so that the original sentence (the "plain text") can be understood. By framing An algorithm that decodes cryptograms must effec- the problem as a multiple-hypothesis detection problem, tively choose one of the 26! (~ 4 - 1026) different permuta- applying a maximum-likelihood criterion, using English tions according to some criterion. Actually, there are usu- language word frequency data, approximating liberally, ally somewhat fewer possibilities, because only a "partial and constructing a well-organized search tree, a rather permutation" is required. Only 12 distinct letters appear simple algorithm results, which quickly deciphers even in the preceding quote, so cipher entries for the plain let- difficult cryptograms. ters that do not appear may be left undefined, indicated Cryptograms of this form--simple permutation substi- here with an (asterisk) * tutions with word divisions--have been employed for message concealment, at least, since Roman times. The Criteria solution of simple permutation ciphers has not been of The first tool one might think of when constructing a se- much practical importance, since their use for military lection criterion is a probability distribution for the 26 let- communication was superseded in the nineteenth cen- ters. By tabulating the occurrences of each alphabetic tury, but they remain a formidable puzzle for those who character in large samples of text, one determines that the enjoy word games. Experienced solvers can manually most frequent letter is ~,, occurring about 13% of the time, solve a typical one-sentence cryptogram in a few minutes, while the least common is 7., with a frequency about 0.1%. but carefully constructed short puzzles, with unusual let- Complete rank orderings vary somewhat depending on ter frequencies or atypical letter combinations, can stymie the body of the text selected for tabulation. Four pub- even expert solvers. Many strategies are published for lished examples for modern English [2, 3, 9, 12] are: manual decipherment, e.g., [1-3, 5, 8-10, 12], but these ETAON I SRHLDCUP FMWY BGVKQXJ Z all require human pattern recognition skills "in the loop," ETNR I OASDHLCF PUM Y GWVBXKQ J Z and are not explicit enough to be called algorithms. This ETAO I NSRHLDCUM FWG Y P BVKX J QZ author is aware of only one previously published method ETOAN I RSHDLUCMP F YWG BVKJ XZQ for automatic solution--a relaxation method [7], also see [4]--but it is not suitable for short cryptograms. A natural solution criterion which can make use of a ~O2 September 1994/%1.37, No.9 ¢@IUlWUNICATIONS OP THN AClm given ordering is to select a permutation that results in a then f-l(Z). A probabilistic model for natural language hypothesized plain text with as close an ordering as possi- text assigns a probability P(S) to any string S. The ML ble to a standard published ordering. Variations on this criterion is then to chose the permutation idea appear in all discussions of cryptography. The exis- tence of conflicting nominal orderings, such as the pre- f = argmaxP(f - l(Z)). (1) ceding four is only a minor flaw with this approach; there f is a much more serious problem. While it is straightfor- There are two sizable problems: determining an appro- ward to construct a permutation that results in the desired priate probability distribution P for English sentences and letter ordering, the result of such an algorithm (the "out- choosing among the 26! values for the argumentf. put text") will, almost certainly, be gibberish for a short length of text. Figure 1 illustrates such an algorithm, and Language Model why it fails. The fundamental problem is that short text It is not clear that the notion of a probability distribution fragments have sample statistics that differ considerably for English sentences makes any mathematical, linguistic, from one another and larger samples. Even large samples or philosophical sense. People do not decide what to say vary, as the four orderings above attest, so the use of letter or write by any procedure analogous to flipping coins. It is frequencies alone may fail even with texts as large as more correct to describe the following as a text model 10,000 letters. based on word frequencies. Similar arguments and examples can be constructed to Many tabulations of word frequencies have been un- show that simple modifications of this character-based dertaken. Here is one listing of the 135 top-ranked words probability distribution criterion also do not work in short of modern American English, starting with the most cryptograms. Putative methods might include the fre- common [6]: quencies of word-initial letters or word-final letters, or the joint statistics of pairs or triples of letters, using Markov THE OF AND TO A IN THAT IS WAS HE FOR IT WITH AS HIS ON BE AT models such as in [5, 8, 10]. However these considerations BY I THIS HAD NOT ARE BUT FROM OR HAVE AN THEY WHICH ONE only exacerbate the problems, since only a very small frac- YOU WERE HER ALL SHE THERE WOULD THEIR WE HIM BEEN HAS tion of the possible n-tuples appears. The fundamental WHEN WHO WILL MORE NO IF OUT SO SAID WHAT UP ITS ABOUT problem remains that there is a large variance to the sam- INTOTHAN THEM CAN ONLYOTHER NEW SOMECOULD TIME THESE ple statistics of short segments of text. These measures can TWO MAY THEN DO FIRSTANY MY NOW SUCH LIKE OUR OVER MAN only be expected to converge when large samples of ci- ME EVEN MOST MADE AFTER ALSO DID MANY BEFORE MUST pher text are available. The method of [7] uses letter tri- THROUGH BACK YEARS WHERE MUCH YOUR WAY WELL DOWN ples, but based on the examples presented, it apparently SHOULD BECAUSE EACH JUST THOSE PEOPLE MR HOW TOO LIT- requires approximately 1,000 characters of text. To solve typical cryptograms containing only 5 to 25 words, a different tack is employed here, analogous to the method of Figure 1, but using complete words rather than Figure 1. A possible "message" which matches the letters. A word-based approach appears formidable at first known letter frequencies of English can always be because there are many more words in English than let- found by counting the occurrences of each cipher ters-over 100,000. However, it turns out that the use of a text letter and ranking them from most to least word table on the order of 100 to 1,000 entries allows for a frequent (1st two columns), then matching with a very effective method of solution. known ordering (ETAONISRHLDC... in 3rd column). As an appropriate criterion, a maximum-likelihood The output of this algorithm comes as close as pos- (ML) estimator is used, justified by the fact that it gives a sible to the correct letter frequencies of English, minimum probability of error under the assumption that but is gibberish. The problem is that typical sen- all permutations are equally likely [11]. Letf represent an tences of English do not display the actual statistics encoding permutation of the alphabet, applied to the of English because they are too short. In the plain-text string on a character-by-character basis. Let Z quoted line of Shakespeare, for example, "T" is the be the given cipher text. The corresponding plain text is most common letter (column 4), not "E." Given Cipher Text: YP NR, PT MPY YP NR: YJSY OD YJR WIRDYOPM. Cipher Number of chars, occurences Y E T P T O R A E N,M,J,O,D O,N,I,S,R B,N,H, I, S T,S,W,I H,L,D,C R,A,Q,U Resulting Output Text ET OA, TH NTE ET OA: EILE SR EIA DCARESTN. COMMUNI¢ATIONSOleTHIIACM September 1994/Vol.37, No.9 103 TLE STATE GOOD VERY MAKE WORLD STILL OWN SEE MEN WORK the most fundamental syntactic principles such as word LONG GET HERE BETWEEN BOTH LIFE BEING UNDER NEVER DAY order. But again the justification is in the results. SAME ANOTHER KNOW WHILE LAST (2) We can interpret (5) in two ways, according to how we count repeated words. When a given word appears k Surprisingly, although there are over 100,000 English times in a single sentence, we could include the corre- words, a randomly selected word of an English sentence sponding probability k times in the product (if we are has a greater than 50% chance of being found in, this list.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages7 Page
-
File Size-