<<

Decipherment

Kevin Knight USC/ISI 4676 Admiralty Way Marina del Rey CA 90292 [email protected]

Abstract oped by Alan Turing and Warren Weaver. We de- scribe recently published work on building auto- The first natural language processing sys- matic translation systems from non-parallel data. tems had a straightforward goal: deci- We also demonstrate how some of the same algo- pher coded messages sent by the en- rithmic tools can be applied to natural language emy. This tutorial explores connections tasks like part-of-speech tagging and word align- between early decipherment research and ment. today’s NLP work. We cover classic mili- Turning back to historical ciphers, we explore a tary and diplomatic ciphers, automatic de- number of unsolved ciphers, giving results of ini- cipherment algorithms, unsolved ciphers, tial computer experiments on several of them. Fi- language translation as decipherment, and nally, we look briefly at as a way to enci- analyzing ancient writing as decipher- pher phoneme sequences, covering ancient scripts ment. and modern applications.

1 Tutorial Overview 2 Outline The first natural language processing systems had 1. Classical military/diplomatic ciphers (15 a straightforward goal: decipher coded messages minutes) sent by the enemy. Sixty years later, we have many 60 cipher types (ACA) more applications, including web search, ques- • Ciphers vs. codes tion answering, summarization, speech recogni- • tion, and language translation. This tutorial ex- Enigma cipher: the mother of natural • plores connections between early decipherment language processing research and today’s NLP work. We find that – computer analysis of text many ideas from the earlier era have become core – language recognition to the field, while others still remain to be picked – Good-Turing smoothing up and developed. We first cover classic military and diplomatic 2. Foreign language as a code (10 minutes) cipher types, including complex substitution ci- Alan Turing’s ”Thinking Machines” phers implemented in the first electro-mechanical • Warren Weaver’s Memorandum encryption machines. We look at mathematical • tools (language recognition, frequency counting, 3. Automatic decipherment (55 minutes) smoothing) developed to decrypt such ciphers on Cipher type detection proto-computers. We show algorithms and exten- • Substitution ciphers (simple, homo- sive empirical results for solving different types of • ciphers, and we show the role of algorithms in re- phonic, polyalphabetic, etc) cent decipherments of historical documents. – plaintext language recognition We then look at how foreign language can be how much plaintext knowledge is ∗ viewed as a code for English, a concept devel- needed 3 Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pages 3–4, Sofia, Bulgaria, August 4-9 2013. c 2013 Association for Computational Linguistics index of coincidence, unicity dis- 7. Undeciphered writing systems (15 minutes) ∗ tance, and other measures Indus Valley Script (3300BC) – navigating a difficult search space • Linear A (1900BC) frequencies of letters and words • ∗ (1700BC?) pattern words and cribs • ∗ (1800s?) EM, ILP, Bayesian models, sam- • ∗ pling 8. Conclusion and further questions (15 min- – recent decipherments utes) Jefferson cipher, Copiale cipher, ∗ civil war ciphers, naval Enigma 3 About the Presenter Application to part-of-speech tagging, • Kevin Knight is a Senior Research Scientist and word alignment Fellow at the Information Sciences Institute of the Application to machine translation with- University of Southern California (USC), and a • out parallel text Research Professor in USC’s Computer Science Parallel development of cryptography Department. He received a PhD in computer sci- • and translation ence from Carnegie Mellon University and a bach- Recently released NSA internal elor’s degree from Harvard University. Profes- • newsletter (1974-1997) sor Knight’s research interests include natural lan- guage processing, machine translation, automata 4. *** Break *** (30 minutes) theory, and decipherment. In 2001, he co-founded Language Weaver, Inc., and in 2011, he served 5. Unsolved ciphers (40 minutes) as President of the Association for Computational Zodiac 340 (1969), including computa- Linguistics. Dr. Knight has taught computer sci- • tional work ence courses at USC for more than fifteen years and co-authored the widely adopted textbook Ar- Voynich Manuscript (early 1400s), in- • tificial Intelligence. cluding computational work Beale (1885) • Dorabella (1897) • Taman Shud (1948) • Kryptos (1990), including computa- • tional work McCormick (1999) • Shoeboxes in attics: DuPonceau jour- • nal, Finnerana, SYP, Mopse, diptych

6. Writing as a code (20 minutes) Does writing encode ideas, or does it en- • code phonemes? Ancient script decipherment • – – Mayan glyphs – Ugaritic, including computational work – Chinese N¨ushu, including computa- tional work Automatic phonetic decipherment • Application to transliteration • 4