Poet Admits // Mute Cypher: Beam Search to Find Mutually Enciphering
Total Page:16
File Type:pdf, Size:1020Kb
Poet Admits // Mute Cypher: Beam Search to find Mutually Enciphering Poetic Texts Cole Peterson and Alona Fyshe University of Victoria [email protected], [email protected] Abstract other English poem. In a sense, Bok¨ not only turns a microbe into a genetic book, he also engineers The Xenotext Experiment implants poetry into the microbe to be, in a sense, a poet. The organ- an extremophile’s DNA, and uses that DNA to generate new poetry in a protein form. ism’s molecular machinery powers this translation The molecular machinery of life requires that between the two texts, which, at a high level, is a these two poems encipher each other un- symmetric substitution cipher between the letters, der a symmetric substitution cipher. We and is described in more detail in Section 2. The search for ciphers which permit writing under two poems (the poet’s and the organism’s) must both the Xenotext constraints, incorporating ideas play by the rules we refer to as the Xenotext Con- from cipher-cracking algorithms, and using straints: n-gram data to assess a cipher’s “writabil- ity”. Our algorithm, Beam Verse, is a beam Each text is valid natural language. search which uses new heuristics to navigate • the cipher-space. We find thousands of ci- phers which score higher than successful ci- The substitution cipher function applied to one • phers used to write Xenotext constrained texts. text, results in the other text, and vice versa. In other words, the cipher function must be sym- 1 Introduction metric. For over a decade, poet Christian Bok¨ has been Whitespace characters (space, new line) enci- • working on The Xenotext (Bok,¨ 2008), a literary ex- pher themselves, and are the only characters al- periment which aims to insert poetry into the DNA lowed identity mappings. of an extremophile organism. In contrast to pop- ular modern data storage mediums like paper and After four years of work, Bok¨ successfully wrote hard disks, DNA is more robust to accidents, and two English poems which satisfied the Xenotext requires minimal maintenance to last hundreds or constraints1, becoming the first person to do so. The even thousands of years (Cox, 2001). Many groups first challenge was finding a cipher which allows are actively pursuing efficient and stable ways to use for two valid English texts to be written. Finding DNA to encode information (Shimanovsky et al., “writable” ciphers is difficult and is the focus of this 2002; Goldman et al., 2013). With his project, Bok¨ paper. aims to leave a lasting cultural contribution inside We present Beam Verse, a language-agnostic al- of an organism, which, as a result of DNA’s durabil- gorithm driven by the target language’s n-gram data, ity and self-replicating properties, could conceivably that searches for “writable” ciphers, and suggests survive longer than all other existing works of art. words and phrases which can be written under them. Furthermore, Bok¨ aims to craft his poem so the 1 abcdefghijlmq protein it instructs the cell to produce is yet an- The two poems used the cipher (tvukyspnoxrwz) 1339 Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1339–1347, Austin, Texas, November 1-5, 2016. c 2016 Association for Computational Linguistics We do not concern ourselves with the full biochem- linked, as wherever the poet uses the letter E, the cell ical constraints of The Xenotext (eg. the actual im- uses the letter T, and vice versa. If the poet was to pact of the protein and the cell’s reaction to it, or write “mute” instead of “poet”, the cell would write its viability after folding) and instead only consider “poet” instead of “mute”. the Xenotext constraints listed above. This problem sits at the intersection of natural language process- Poet’s Letter P O E T ing and cryptography, and is a prerequisite to the genetic engineering required to fully realize a liv- DNA Codon AAC GAG CCG GGC ing xenotext. Our algorithm uncovers new ciphers mRNA Codon UUG CUC GGC CCG which make satisfying the Xenotext constraints in English possible, and makes it easier for a poet of Amino Acid phenylalanine leucine glycine proline any language to investigate the feasibility of writing Cell’s Letter M U T E two mutually enciphering texts in their own tongue. Table 1: Sample translation from text through DNA to new text 2 Genetic Background This view of DNA is extremely simplistic, and DNA provides instructions to a living cell to pro- serves only to motivate and provide context for the duce a protein. DNA has a double helix structure rest of this paper. When using a codon to represent (reminiscent of a twisted ladder) and is made up of a poet’s letter and an amino acid to represent one four nucleotides: adenine, cytosine, guanine, and of the organism’s letters, many complexities arise thymine, commonly abbreviated as A, T, C, and G. which add further constraints to the text, which we Each nucleotide always appears paired with another ignore in the remainder of the paper. When actually across the “rung” of the ladder, A with T, C with G, inserting his poetry into a cell, Bok¨ struggled to get and vice versa. To transfer the data in the DNA to the organism to express the correct protein, because the protein-producing ribosome, the double helix is he failed to account for the full complexity of the “unzipped”, separating the ladder at the rungs, and a cell and caused the cell to “censor” itself (Wershler, copy of the exposed DNA sequence called an mRNA 2012). However, we consider these additional con- transcript is synthesized, pairing in the same way straints to be future work. the DNA did, with the exception that adenine in the DNA strand pairs with uracil (U) in the mRNA. The 3 Substitution Ciphers ribosome reads the mRNA as instructions to produce Substitution ciphers are among the earliest known a specific protein. A protein is a sequence of amino forms of cryptography, having existed since at least acids and each triplet of nucleotides in the mRNA the days of Caesar (Sinkov, 1966). They work by (called a codon) represents one of the twenty amino replacing one letter in the plaintext with another to acids (see Table 2) (Campbell and Reece, 2008). form the ciphertext. However, they are never used We can write in the DNA of an organism by hav- in modern cryptography because, despite a large ing a codon represent a letter. When this sequence keyspace, substitution ciphers do not change the of codons (the full poem) is read by the organism, it letter frequency between plaintext and ciphertext. then writes a sequence of amino acids, each of which When properties of the plaintext are known (like the represent a letter, in reply2. The letters in each poem language it is written in), letter or word frequency have a bijective relationship determined by the bio- data from that language can be used to quickly crack chemical processes that link them. For example, as the cipher and uncover the key. shown in Table 2, the letters E and T are mutually Every word has a deterministic encryption under 2This makes the writing lipogrammatic, as there are only 20 a cipher. The encrypted word could be nonsense, or amino acids, and one of them (Serine) must be used as the space it could be another word. The word “poet”, for ex- character. Serine is used for the space because it is the only ample, can encrypt to many other words, including amino acid to encipher itself, as both codons AGT and TCA produce it, mirroring the constraint that the space is mapped to “mute”. The “poet mute” word-pair forms what ↔ poe itself in our ciphers. we call a partial cipher, and notate as (mut). We 1340 say this partial cipher has a cardinality of three, as 4 Scoring a Cipher’s “Writability” it defines three letter pairings. A full cipher in a When scoring a cipher, an important consideration 26-letter language has a cardinality of 13. We also poe is what makes one cipher more “writable” than an- refer to ( ) as a primary cipher, because it is the mut other. We might score a cipher on the number of lowest cardinality cipher to contain the word-pairing valid words under it, as having more words at your “poet mute”. ↔ disposal makes it easier to write, but this is not nec- essarily so if all the words are all rare and useless. As no characters except whitespace are allowed To combat this, we weight words based upon their identity mappings a word-pair like “eat cat” is ↔ frequency in language, so that better, more frequent not valid, as both a and t would have to map to words contribute more to a ciphers overall score. them selves. The symmetric Xenotext constraint This values highly frequent and syntactically impor- prohibits “admits” from pairing with “cipher”, as tant words, like “the” or “and”, while also allow- the letter i would require a mapping to both d and ing a large number of infrequent words to also con- h. However, “admits” can pair with the alterna- tribute significantly to the score. However, a word’s tive spelling “cypher”, forming the primary cipher admits poe usefulness when writing mutually enciphering texts (cypher). We can combine this cipher with (mut), as is explicitly tied to its sister word under the cipher. none of the letter pairs conflict with each other – they “The” looses its usefulness if every time it is used are compatible with each other.