How to Make a Frenemy: Multitape FSTs for Portmanteau Generation

Aliya Deri and Kevin Knight Information Sciences Institute Department of Computer Science University of Southern California aderi, knight @isi.edu { }

Abstract W1 W2 PM affluence influenza affluenza A portmanteau is a type of word anecdote data anecdata that fuses the sounds and meanings of two chill relax chillax component words; for example, “frenemy” flavor favorite flavorite (friend + enemy) or “smog” (smoke + fog). We develop a system, including a novel mul- guess estimate guesstimate titape FST, that takes an input of two words jogging juggling joggling and outputs possible portmanteaux. Our sys- sheep people sheeple tem is trained on a list of known portmanteaux spanish english spanglish and their component words, and achieves 45% zeitgeist ghost zeitghost exact matches in cross-validated experiments. Table 1: Valid component words and portmanteaux.

1 Introduction with minimal human intervention would be not only Portmanteaux are new words that fuse both the a useful tool in areas like advertising and journalism, sounds and meanings of their component words. In- but also a notable achievement in creative NLP. novative and entertaining, they are ubiquitous in ad- Due to the complexity of both component word vertising, social media, and newspapers (Figure 1). selection and blending, previous portmanteau gen- Some, like “frenemy” (friend + enemy), “brunch” eration systems have several limitations. The Neho- (breakfast + lunch), and “smog” (smoke + fog), ex- vah system (Smith et al., 2014) combines words only press such unique concepts that they permanently at exact grapheme matches, making the generation enter the English lexicon. of more complex phonetic blends like “frenemy” or Portmanteau generation, while seemingly trivial “brunch” impossible. Ozbal¨ and Strappavara (2012) for humans, is actually a combination of two com- blend words phonetically and allow inexact matches plex natural language processing tasks: (1) choos- but rely on encoded human knowledge, such as sets ing component words that are both semantically of similar phonemes and semantically related words. and phonetically compatible, and (2) blending those Both systems are rule-based, rather than data-driven, words into the final portmanteau. An end-to-end and do not train or test their systems with real-world system that is able to generate novel portmanteaux portmanteaux. In contrast to these approaches, this paper presents a data-driven model that accomplishes (2) by blending two given words into a portmanteau. That is, with an input of “friend” and “enemy,” we Figure 1: A New Yorker headline portmanteau. want to generate “frenemy.”

206

Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the ACL, pages 206–210, Denver, Colorado, May 31 – June 5, 2015. c 2015 Association for Computational Linguistics 3. 1+ phonemes from W1 are aligned with an F1 R1 EH3 N3 D4 T1 OW1 F1 UW3 pron 2 equal number of phonemes from Wpron. EH3 N3 AH5 M5 IY5 T2 ER3 K5 IY5 For each aligned pair of phonemes (x, y), either x or y is output. Figure 2: Derivations for friend + enemy “frenemy” 1 → 4. 0+ phonemes from Wpron are deleted, until the and tofu + turkey “tofurkey.” Subscripts indicate the 1 → end of Wpron. step applied to each phoneme. 2 5. 0+ phonemes from Wpron are output, until the 2 end of Wpron. We take a statistical modeling approach to port- manteau generation, using training examples (Table 3 Multitape FST model 1) to learn weights for a cascade of finite state ma- chines. To handle the 2-input, 1-output problem in- Finite state machines (FSMs) are powerful tools herent in the task, we implement a multitape FST. in NLP and are frequently used in tasks like ma- This work’s contributions can be summarized as: chine transliteration and pronunciation. Toolkits like a portmanteau generation model, trained in an Carmel and OpenFST allow rapid implementations • unsupervised manner on unaligned portman- of complex FSM cascades, machine learning algo- teaux and component words, rithms, and n-best lists. the novel use of a multitape FST for a 2-input, Both toolkits implement two types of FSMs: fi- • 1-output problem, and nite state acceptors (FSAs) and finite state transduc- the release of our training data.1 ers (FSTs), and their weighted counterparts (wFSAs • and wFSTs). An FSA has one input tape; an FST 2 Definition of a portmanteau has one input and one output tape. In this work, a portmanteau PM and its pronuncia- What if we want a one input and two output tapes for an FST? Three input tapes for an FSA? Although tion PMpron have the following constraints: infrequently explored in NLP research, these “mul- PM has exactly 2 component words W1 and titape” machines are valid FSMs. • 1 2 W2, with pronunciations W1 and W2 . In the case of converting W ,W to pron pron { pron pron} All of PM’s letters are in W1 and W2, and all PMpron, an interleaved reading of two tapes would be • 1 2 impossible with a traditional FST. Instead, we model phonemes in PMpron are in Wpron and Wpron. All pronunciations use the Arpabet symbol set. the problem with a 2-input, 1-output FST (Figure • Portmanteau building occurs at the phoneme 3). Edges are labeled x : y : z to indicate input • 1 2 level. PMpron is built through the following tapes Wpron and Wpron and output tape PMpron, re- steps (further illustrated in Figure 2): spectively.

1 1. 0+ phonemes from Wpron are output. 4 FSM Cascade 2 2. 0+ phonemes from Wpron are deleted. We include the multitape model as part of an FSM 1Available at both authors’ websites. cascade that converts W1 and W2 to PM (Figure 4).

 :  :   :  :   :  :   :  :  q1 q2 q3 q4 q5  : x :  :  :  : x :  : x :  :  :  :  :  : y :  : y :  :  :  : y :  x    x/y    y

q1a q2a q3a q4a q5a

Figure 3: A 2- input, 1-output wFST for portmanteau pronunciation generation.

207 1 1 W FST A Wpron

wFST B PMpron wFST C PM0 wFSA D PM00 FSA E1,2 PM000

2 2 W FST A Wpron

jogging JH AH G IH NG JH AH G AH L IH NG joggaling juggling joggling juggling JH AA G AH L IH NG

Figure 4: The FSM cascade for converting W1 and W2 into a PM, and an illustrative example.

phonemes P (x, y z) step k description P (k) → x y z cond. joint mixed 1 1 Wpron keep 0.68 2 AA AA AA 1.000 0.017 1.000 2 Wpron delete 0.55 AH ER AH 0.424 0.007 0.445 3 align 0.74 1 AH ER ER 0.576 0.009 0.555 4 Wpron delete 0.64 2 PBP 0.972 0.002 1.000 5 Wpron keep 0.76 PBB 0.028 N/A N/A Z SH SH 1.000 N/A N/A Table 3: Learned step probabilities. The probabilities of keeping and aligning are higher than those of deleting, JH AO JH 1.000 N/A N/A showing a tendency to preserve the component words. Table 2: Sample learned phoneme alignment probabili- ties for each method. manteaux with three component words (“turkey” + “duck” + “chicken” “turducken”) or without any 1 → We first generate the pronunciations of W and overlap (“arpa” + “net” “arpanet”). From 571 ex- 2 → W with FST A, which functions as a simple look- amples, this yields 401 W1,W2, PM triples. { } up from the CMU Pronouncing Dictionary (Weide, We also use manual annotations of PMpron for 1998). learning the multitape wFST B weights and for mid- Next, wFST B, the multitape wFST from Figure cascade evaluation. 1 2 3, translates Wpron and Wpron into PMpron. wFST C, We randomly split the data for 10-fold cross- built from aligned graphemes and phonemes from validation. For each iteration, 8 folds are used for the CMU Pronunciation Dictionary (Galescu and training data, 1 for dev, and 1 for test. Training data Allen, 2001), spells PMpron as PM0. is used to learn wFST B weights (Section 6) and dev To improve PM0, we now use three FSAs built data is used to learn reranking weights (Section 7). from W1 and W2. The first, wFSA D, is a smoothed “mini language model” which strongly prefers letter 6 Training trigrams from W1 and W2. The second and third, 1 FST A is unweighted and wFST C is pretrained. FSA E1 and FSA E2, accept all inputs except W and W2. wFSA D and FSA E1,2 are built at runtime. We only need to learn wFST B weights, which we can reduce to weights on transitions q q a 5 Data k → k and q a q from Figure 3. The weights q 3 → 3 k → We obtained examples of portmanteaux and com- qka represent the probability of each step, or P (k). ponent words from and Wiktionary lists The weights q a q represent the probability of 3 → 3 (Wikipedia, 2013; Wiktionary, 2013). We reject any generating phoneme z from input phonemes x and that do not satisfy our constraints–for example, port- y, or P (x, y z). → 208 model % exact avg. dist. % 1k-best W1 W2 gold PM hyp. PM dev test dev test dev test affluence influenza affluenza affluenza cond 28.9 29.9 1.6 1.6 92.0 91.2 architecture ecology arcology architecology chill relax chillax chilax joint 44.6 44.6 1.5 1.5 91.0 89.7 friend enemy frenemy frienemy mixed 31.9 33.4 1.6 1.5 92.8 91.0 japan english japlish japanglish rerank 51.4 50.6 1.2 1.3 93.1 91.5 jeans shorts jorts js jogging juggling joggling joggling Table 4: PM results pre- and post-reranking. pron man purse murse mman tofu turkey tofurkey tofurkey PM % exact avg. dist. % 1k-best zeitgeist ghost zeitghost zeitghost PM 12.03 5.31 42.35 0 Table 6: Component words and gold and hypothesis PMs. PM00 42.14 1.80 58.10 PM000 45.39 1.59 61.35 6.3 Mixed Alignment Table 5: PM results on cross-validated test data. Our third learning method initializes alignment probabilities with the joint method, then normalizes We use expectation maximization (EM) to learn them so that P (x x, y) and P (y x, y) sum to 1. This | | these weights from our unaligned input and output, “mixed” method, like the joint method, is more con- W1 ,W2 and PM . We use three differ- { pron pron} pron servative in learning phoneme alignments. However, ent methods of normalizing fractional counts. The like the conditional method, it has high alignment learned phoneme alignment probabilities P (x, y → probabilities and prefers longer alignments. z) (Table 2) vary across these methods, but the learned step probabilities P (k) (Table 3) do not. 7 Model Combination and Reranking Using the methods from sections 6.1, 6.2, and 6.3, 6.1 Conditional Alignment we train three models and produce three different Our first learning method models phoneme align- 1000-best lists of PMpron candidates for dev data. ment P (x, y z) conditionally, as P (z x, y). We combine these three lists into a single one, and → | Since P (z x, y) tends to be larger than step prob- compute the following features for each candidate: | 1 abilities P (k), the model prefers to align phonemes model scores, PMpron length, percentage of Wpron 2 when possible, rather than keep or delete them sep- or Wpron in PMpron, and percentage of PMpron in 1 2 arately. This creates longer alignment regions. Wpron or Wpron. We also include a binary feature 1 2 Additionally, during training a potential align- for whether PMpron matches Wpron or Wpron. ment P (x x, y) can compete only with its pair We then compute feature weights using the aver- | P (y x, y), making it more difficult to zero out an aged perceptron algorithm (Zhou et al., 2006), and | alignment’s probability. The conditional method use them to rerank the candidate list, for both dev therefore also learns more potential alignments be- and test data. We combine the reranked PMpron lists tween phonemes. to generate wFST C’s input.

6.2 Joint Alignment 8 Evaluation

Our second learning method models P (x, y z) We evaluate our model’s generation of PMpron pre- → jointly, as P (z, x, y). Since P (z, x, y) is relatively and post-reranking against our manually annotated low compared to the step probabilities, this method PMpron. We also compare PM0, PM00, and PM000. For prefers very short alignments–the reverse of the ef- both PMpron and PM, we use three metrics: fect seen in the conditional method. percent of 1-best results that are exact matches, • average Levenshtein edit distance of 1-bests, However, the model can also zero out the prob- • abilities of unlikely aligments, so overall it learns and percent of 1000-best lists with an exact match. fewer possible alignments between phonemes. • 209 9 Results and Discussion References

We first evaluate the model at PMpron. Table 4 Lucian Galescu and James F Allen. 2001. Bi-directional shows that, despite less than 50% exact matches, conversion between graphemes and phonemes using over 90% of the 1000-best lists contain the correct a joint n-gram model. In 4th ISCA Tutorial and Re- search Workshop (ITRW) on Speech Synthesis. pronunciation. This motivates our model combina- Gozde¨ Ozbal¨ and Carlo Strapparava. 2012. A computa- tion and reranking, which increase exact matches to tional approach to the automation of creative naming. over 50%. In Proceedings of the 50th Annual Meeting of the As- Next, we evaluate PM (Table 5). A component sociation for Computational Linguistics: Long Papers word mini-LM dramatically improves PM00 com- - Volume 1, ACL ’12, pages 703–711. Association for pared to PM0. Filtering out component words pro- Computational Linguistics. vides additional gain, to 45% exact matches. Michael R Smith, Ryan S Hintze, and Dan Ventura. 1 2014. Nehovah: A creator nomen ipsum. In comparison, a baseline that merges Wpron and 2 In Proceedings of the International Conference on Wpron at the first shared phoneme achieves 33% ex- Computational Creativity, pages 173–181. ICCC. act matches for PMpron and 25% for PM. Robert Weide. 1998. The CMU pronunciation dictio- Table 6 provides examples of system output. Per- nary, release 0.6. fect outputs include “affluenza,” “joggling,” “to- Wikipedia. 2013. List of portmanteaus. furkey,” and “zeitghost.” For others, like “chilax” http://en.wikipedia.org/w/index.php?title= and “frienemy,” the discrepancy is negligible and the List_of_portmanteaus&oldid=578952494. hypothesis PM could be considered a correct alter- [Online; accessed 01-November-2013]. nate output. Some hypotheses, like “architecology” Wiktionary. 2013. Appendix:list of portmanteaux. and “japanglish,” might even be considered superior http://en.wiktionary.org/w/index.php?title= Appendix:List_of_portmanteaux&oldid=23685729. to their gold counterparts. However, some errors, [Online; accessed 02-November-2013]. like “js” and “mman,” are clearly unacceptable sys- Zhengyu Zhou, Jianfeng Gao, Frank K. Soong, and Helen tem outputs. Meng. 2006. A comparative study of discriminative methods for reranking LVCSR n-best hypotheses in 10 Conclusion domain adaptation and generalization. In 2006 IEEE We implement a data-driven system that generates International Conference on Acoustics Speech and Signal Processing, ICASSP 2006, Toulouse, France, portmanteaux from component words. To accom- pages 141–144. plish this, we use an FSM cascade, including a novel 2-input, 1-output multitape FST, and train it on exist- ing portmanteaux. In cross-validated experiments, we achieve 45% exact matches and an average Lev- enshtein edit distance of 1.59. In addition to improving this model, we are inter- ested in developing systems that can select compo- nent words for portmanteaux and reconstruct com- ponent words from portmanteaux. We also plan to research other applications for multi-input/output models.

11 Acknowledgements We would like to thank the anonymous reviewers for their helpful comments, as well as our colleagues Qing Dou, Tomer Levinboim, Jonathan May, and Ashish Vaswani for their advice. This work was supported in part by DARPA contract FA-8750-13- 2-0045.

210