Voynich2vec: Using Fasttext Word Embeddings for Voynich Decipherment

Voynich2Vec: Using FastText Word Embeddings for Voynich Decipherment William Merrill Eli Baum Yale University Yale University [email protected] [email protected] 1 Introduction tradition in linguistics (Firth, 1957). Lample et al.(2017) developed a modification 1.1 Word Embeddings of the core word2vec architecture called fastText [wm] Word embeddings, or vector representations that was designed to incorporate the morpholog- of words, are a useful way of representing lex- ical properties of words into its learned notion ical semantics for a variety of natural language of distributional similarity. Rather than directly processing (NLP) applications. The idea behind learning embeddings for words, fastText learns this approach is to encode each word as a high- embeddings for the character n-grams appearing dimensional vector of real numbers such that simi- within words. Then, a word embedding is pro- lar words have similar vector representations. The duced by combining the n-gram embeddings for similarity between two word vectors is often as- all the n-grams in a word. From a linguistic point sessed in terms of cosine similarity, which is de- of view, this is very similar to asking which se- fined as quences carry morphosyntactic information. The size of n-grams that should be considered is a hy- w · w cos-sim(w ; w ) = 1 2 (1) perparameter of the model that needs to be set. 1 2 kw kkw k 1 2 The end result is a word embedding scheme whose and gives the angle between the vectors. Vec- notion of distributional context incorporates more tors that point in the same direction will have a morphosyntactic information than word2vec. cosine similarity of 1, and vectors that are perpen- dicular will have a cosine similarity of 0. In the 1.2 Word Embeddings and Voynich case of word embeddings, word vectors for highly Perone(2016) reported some preliminary anal- similar words have a cosine similarity close to 1, ysis of the Voynich manuscript using word2vec and words that have very little to do with one an- embeddings. Using an established dimen- other will have a similarity score close to 0. sionality reduction technique called t-distributed Most modern approaches to building word em- Stochastic Neighbor Embedding (t-SNE), Perone beddings trace from the word2vec method devel- transformed these word embeddings into two- oped by Mikolov et al.(2013), which used a neu- dimensional space in an effort to visualize them. ral network to learn from raw text an embedding The plots that he reported had two primary clusters that encodes the distributional context in which which he claimed corresponded to the two differ- a word can appear. A neural network is a so- ent hands in the manuscript. Based on this, Perone phisticated machine learning algorithm that has tentatively suggested that the hands might be two achieved notable success in NLP and other prob- different languages (or at least two different modes lem domains during the last decade. Importantly, of text generation). since a word2vec network only needs raw text to Perone(2016) also analyzed the embedding of learn embeddings, it can be run on the Voynich one star label in an astrological diagram, noting manuscript. This is possible because the algo- how the embeddings clustered around it also seem rithm assumes that the context in which a word ap- to be star names. This sort of bottom-up analysis pears encodes that word’s meaning, which is jus- might suggest that there is some semantic content tified empirically by the success of Mikolov et al. in the manuscript that word2vec can pick up on. (2013)’s method and theoretically by the Firthian However, Perone only presented analysis of this one word. Thus, there is more potential for work small documents are dramatically different. in this vein. We aimed to build on the work done by Perone Instead of using word embeddings, Bunn(2017) (2016) by training fastText embeddings on the text built embeddings for each folio in the manuscript of the Voynich manuscript. We will analyze these using the frequency of each character. Each page embeddings to address the following three ques- was represented by a vector encoding the rela- tions: tive probability of occurrence of each character on that page (with a 24-character EVA alphabet, 1. Morphosyntactic Clustering Do the fast- each vector had dimension 24). Bunn(2017) then Text vectors reveal any suffixes that might used the same t-SNE visualization to plot the folio function as morphological units? vectors in three-dimensional space. His intention was to use this data to validate Currier’s Language A/B hypothesis (which was primarily made on the 2. Topic Modeling Do we find similar vector basis on character probabilities). Bunn’s results representations for folios with similar illus- show that folios tagged as A tend to appear in a trations? different part of the vector space than those tagged as B, although there is some overlap between the 3. Unsupervised Word Alignment Can we de- regions. Since the boundary between Languages cipher any Voynich words by aligning their A and B is poorly defined, it is unclear whether word embeddings with embeddings from many of the folios belong to a cluster. other languages? 1.3 Word Embedding Alignment 2 Data Unsupervised techniques for aligning fastText embeddings could also prove useful for trying to [eb] We built our word embeddings from the 1999 make sense of the Voynich lexicon. Word align- Takahashi transcription of the manuscript. While ment is the task of matching up words in dif- the archive file (text16e6.evt) is nearly ferent languages with the same meaning. Lam- machine-readable, it needed to be cleaned up ple et al.(2017) developed an approach to this slightly. problem called Multilingual Unsupervised or Su- The first step in our investigation was to write a pervised Word Embeddings (MUSE), which uses “tokenizer” which read the transcript file and pro- an adversarial neural network to align word em- duced a list of Takahashi EVA words. The main beddings from different languages without paral- difficulty here was accounting for special symbols lel texts. The intuition behind the learning algo- that had been inserted into the transcription to de- rithm is that one part of the neural network tries to note illustrations and formatting. For example, if a transform vectors in one language into the other, flower interrupts a word, {plant} would be in- and the other part of the network tries to distin- serted in the middle of a word. Elsewhere, extra guish between the transformed vectors and ac- spaces had been inserted to align the Takahashi tual vectors in the other language. These adver- transcription with other transcriptions, or show sarial training dynamics can eventually lead to a where folds and tears in the parchment were. Man- neural network that is able to map vectors from ual analysis and removal of such tags allowed us one language to the other. Given large data sets, to create a reliable and usable transcription. the model achieved close to state-of-the-art per- While creating the tokenizer, we did notice formance. While this approach could in princi- some errors and ambiguities in our source tran- ple be used to align Voynichese words to, say, scription. While many of these ambiguities are Latin words, the amount of Voynichese text in caused by smudged ink or torn pages, a number the manuscript is much smaller than the massive are very clear on the Beinecke’s recent scans. One corpora typically used to train such sophisticated direction for further work is to create an updated models. This lack of data is confounded by the fact transcription using these new high-definition im- that the model assumes there is a linear transfor- ages. However, for our project, we chose not to mation between the semantic spaces of each lan- modify Takahashi’s transcript from the original, to guage, which is problematic if the topics of two allow for a (hopefully) more consistent analysis. 3 Methods 3.1 FastText [wm] We built our fastText vectors using a public Python implementation (Yansyah, 2017-). Follow- ing the default values in the software package, we chose to train a skipgram model using n-grams of size 3 – 6. In Section 4.1, we also investigated the effects of switching to n-grams of size 2 – 6 when we were analyzing length 2 suffixes. Also based on the default parameters of the software package, we excluded words that occurred less than 5 times in the manuscript. Recall that fastText embeddings encode some representation of the morphosyntactic distribution Figure 1: Voynich word embeddings. Each dot of their corresponding words. This means that, if represents one word (with only the 980 most com- a suffix encodes morphological information, some mon words plotted). of the fastText embeddings for words ending with that suffix should cluster. This should hold es- 3. -edy, -ody pecially when we don’t have a large enough data set to determine meaningful semantic information Results of our analysis are discussed in Section about all roots. 4.1. Looking at word embeddings learned from Sec- reta secretorum, a Latin manuscript of similar size 3.2 t-SNE to Voynich, we verified that embeddings did in fact [eb] The fastText vectors that we built from the show clustering by functional morpheme. For ex- Voynich manuscript have 100 dimensions. Such ample, one of the embedding pairs with the high- vectors are easy for computers to work with but est cosine similarity was pulcritudine (“beauty- difficult for humans to visualize. Therefore, it ABL”) and longitudine (“longitude-ABL”). These makes sense to apply dimensionality reduction al- two words share both the derivational nominaliz- gorithms to the data for visualization purposes.

Load more