Protein Word Detection using Text Segmentation Techniques

G.Devi Ashish V. Tendulkar Sutanu Chakraborti Department of CSE Google Inc., Department of CSE IIT Madras Hyderabad-500084, India IIT Madras Chennai-600036, India [email protected] Chennai-600036, India [email protected] [email protected]

Abstract will be to identify the independent lexical units or words of the language. This motivates our current Literature in Molecular Biology is abun- attempt to establish a problem mapping from nat- dant with linguistic metaphors. There have ural language text to protein sequences at the level been works in the past that attempt to of words. Towards this end, we explore the use draw parallels between linguistics and bi- of an unsupervised word segmentation algorithm ology, driven by the fundamental premise to the task of extracting ”biological words” from that proteins have a language of their own. protein sequences. Since word detection is crucial to the de- Many unsupervised word segmentation algo- cipherment of any unknown language, we rithms use compression based techniques ((Chen, attempt to establish a problem mapping 2013), (Hewlett and Cohen, 2011), (Zhikov et al., from natural language text to protein se- 2010), (Argamon et al., 2004), (Kityz and Wilksz, quences at the level of words. Towards this 1999)) and are largely centred around the princi- end, we explore the use of an unsupervised ple of Minimum Description Length (MDL). We text segmentation algorithm to the task of use the MDL based segmentation algorithm de- extracting ”biological words” from pro- scribed in (Kityz and Wilksz, 1999) which makes tein sequences. In particular, we demon- use of the repeating subsequences present within strate the effectiveness of using domain text corpus to compress it. It is found that the seg- knowledge to complement data driven ap- ments generated by this algorithm exhibit close re- proaches in the text segmentation task, as semblances to words of English language. There well as in its biological counterpart. We are also other non-compression based unsuper- also propose a novel extrinsic evaluation vised word segmentation and morphology induc- measure for protein words through protein tion algorithms in literature ((Mochihashi et al., family classification. 2009), (Hammarstrom¨ and Borin, 2011), (Sori- cut and Och, 2015)). However, in this context of 1 Introduction protein sequence analysis, we have chosen to use Research works in the field of Protein Linguistics MDL based unsupervised segmentation because it (Searls, 2002) are largely based on the underlying resembles closely the first natural attempt of a lin- hypothesis that proteins have a language of their guist in identifying words of an unknown language own. However, modeling of protein molecules us- i.e. looking for repeating subsequences as candi- ing linguistic approaches is yet to be explored in dates for words. depth. This might be due to the structural com- As we do not have access to ground-truth plexities inherent to protein molecules. Instead of knowledge about protein words, we propose to resorting to purely wet lab experiments, we pro- use a novel extrinsic evaluation measure based on pose to make use of the abundant data available protein family classification. SCOPe is an ex- in the form of protein sequences together with tended database of SCOP hierarchy (Murzin et al., knowledge from domain experts to model the pro- 1995) which classifies protein domains based on tein language. From a linguistic point of view, the structural and sequence similarities. We have the first step in deciphering an unknown language proposed a MDL based classifier for the task of

238 Proceedings of the BioNLP 2017 workshop, pages 238–246, Vancouver, Canada, August 4, 2017. c 2017 Association for Computational Linguistics automatic SCOPe prediction. The performance of 2002; Atkinson and Gray, 2005; Gimona, 2006; this classifier is used as an extrinsic measure of the Tendulkar and Chakraborti, 2013). There are strik- quality of protein segments. ing similarities between the structure of a protein Finally, the MDL based word segmentation molecule and a sentence in a Natural Language used in (Kityz and Wilksz, 1999) is purely data text some of which have been highlighted in Fig- driven and does not have access to any domain- ure1. specific knowledge source. We propose that con- Gimona (2006) presents an excellent discus- straints based on domain knowledge can be prof- sion on linguistics-based protein annotation and itably used to improve the performance of segmen- raises the interesting question of whether compo- tation algorithms. In English, we use constraints sitional semantics could improve our understand- based on pronounceability rules to improve word ing of protein organization and functional plastic- segmentation. In protein segmentation, we use ity. Tendulkar and Chakraborti (2013) also have knowledge of SCOPe Class labels (Fox et al., drawn many parallels between biology and lin- 2014) to impose constraints. In both cases, con- guistics. straints based on domain knowledge are seen to The wide gap between available primary se- improve the segmentation quality. quences and their three dimensional structures To summarize, the main contributions of our leads to the thought that the current protein struc- work are the following : ture prediction methods might struggle due to lack of understanding of the folding code from protein 1. We attempt to establish a mapping from pro- sequence. If biological sequences are analogous tein sequences to language at the level of to strings generated from a specific but unknown words which is a vital step in the linguistic language, then it will be useful to find the rules of approach to protein language decoding. To- the unknown language. And, word identification wards this end, we explore the use of an un- is fundamental to the task of learning rules of an supervised text segmentation algorithm to the unknown language. task of extracting ”biological words” from Motomura et. al ((2012),(2013)) use a fre- protein sequences. quency based linguistic approach to protein de- 2. We propose a novel extrinsic evaluation mea- coding and design. They call the short consequent sure for protein words via protein family clas- sequences (SCS) present in protein sequences as sification. words and use availability scores to assess the bi- ological usage bias of SCS. Our approach of using 3. We demonstrate the effectiveness of us- MDL for segmentation is interesting in that it does ing domain knowledge to complement data not require prior fixing of word length as in (Mo- driven approaches in the text segmentation tomura et al., 2012), (Motomura et al., 2013). task, as well as in its biological counterpart. 3 Word Segmentation 2 Related Work Protein Linguistics (Searls, 2002) is the study of Word is defined as a single distinct conceptual unit of language, comprising inflected and vari- applying linguistic approaches to understand the 1 structure and function of protein molecules. Re- ant forms . In English, though space acts as a search in the field of Protein Linguistics is largely good approximation for word delimiter, proper based on the underlying assumption that proteins nouns like New York or phrases like once in a blue have a language of their own. David Searls draws moon make sense only when taken as a single unit. many analogies between Linguistics and Molecu- Therefore, space is not a good choice for delimit- lar Biology to show how a linguistic metaphor can ing atomic units of meaning. be seen interwoven into many problems of Molec- Imagine a corpus of English text with spaces ular Biology. The fundamental analogy is that the and other delimiters removed. Now, word seg- 20 amino acids of proteins and 4 nucleotides of mentation is the problem of dividing a continu- genes are analogous to the 26 letters in English al- ous piece of text into meaningful units. For exam- phabet. ple, imagine a piece of text in English with delim- Literature is abundant with parallels between iters removed such as ’BIRDONTHETREE’. The contin- language and biology (Bralley, 1996; Searls, 1https://en.oxforddictionaries.com/definition/word

239 Figure 1: Structural levels in a Protein Molecule [Image source : (Wikipedia, 2017)] vs. Natural Lan- guage Sentence uous text can be segmented into four meaningful Description Length (DL) of a corpus X is de- units as ’BIRD’,’ON’,’THE’,’TREE’. Analogously, we de- fined as the number of bits needed to encode it us- fine protein segmentation as the problem of divid- ing Shannon Fano coding [ (Shannon, 2001),(Ki- ing the amino acid sequence of a protein molecule tyz and Wilksz, 1999)] and is expressed as given into biologically meaningful segments. For exam- below. ple, the toy protein sequence ’MATGQKLMRAIRVFEFGG- PEVLKLQSDVVVPVPQSHQ’ can consist of three segments c(x) DL(X) = c(x) log (1) ’MATGQKLMRAIR’, ’VFEFGGPEV’, ’LKLQSDVVVPVPQSHQ’. For − X xV | | our work, we assume that the word segmentation X algorithm does not have knowledge about English where, V is the language vocabulary, c(x) is the lexicon. The significance of this assumption can frequency of word x in the given corpus and X | | be understood in the context of protein segmenta- is total number of words in X. tion. Since the ground truth about words in protein As an unsupervised learning algorithm does not language is not known, we consider the problem of have access to language lexicon, the initial DL protein segmentation to be analogous to unsuper- of the corpus is calculated by using the language vised word segmentation in English. alphabet as its vocabulary. When the algorithm We begin this section by explaining why MDL learns word-like segments, we can expect the DL can be a good model selection principle for learn- of corpus to get reduced. According to MDL, the ing words followed by description of the algorithm segmentation model that best minimizes the com- used and results obtained on Brown corpus. bined description length of data + model (i.e. cor- pus+ vocabulary) is the best approximation of the 3.1 MDL for Segmentation underlying word segmentation. According to the principle of Minimum Descrip- An exponential number of candidate segmenta- tion Length (MDL), tions is possible for a piece of unsegmented text. Data compression Learning For example, some candidate segmentations for → the text ’BIRDONTHETREE’ are given below. Any regularity present in data can be used to compress the data which can also be seen as learn- ’B’,’IRDONTHETREE’ ing of a model underlying the data (Grunwald¨ , ’BI’,’RD’,’ONTHET’,’R’,’E’,’E’ 2005). In an unsegmented text corpus, the rep- ’B’,’I’,’R’,’D’,’ONTHET’,’REE’ etition of words creates statistical regularities. ’BIR’,’D’,’ONT’,’HE’,’TREE’ Therefore, the key idea behind using MDL for ’BIRDON’,’THE’,’TREE’ word segmentation is that we can learn word-like ’BIRD’,’ON’,’THETREE’ segments by compressing the text corpus.

240 (Kityz and Wilksz, 1999) define a goodness 4. Bigrams having high probability of occur- measure called Description Length Gain (DLG) rence at word boundaries are obtained apriori to quantify the compression effect produced by a from a knowledge base to facilitate splitting candidate segmentation. DLG of a candidate seg- of long segments mentation is equal to the sum of DLGs of indi- vidual segments within it. DLG of a segment s 3.3 MDL Segmentation of Brown Corpus is defined as the reduction in description length The goal of our experiments is twofold. First, we achieved by retaining this segment as a single lex- apply an MDL based algorithm to identify word ical unit while aDLG stands for the average de- boundaries. Second, we use constraints based on scription length gain as given below. domain knowledge to further constrain the search space and thereby improving the quality of seg- ments. DLG(s) = DL(X) DL(X[r s] s) − → ⊕ The following is a sample input text from DLG(s) aDLG(s) = Brown corpus (Francis and Kucera, 1979) used in c(s) our experiment. where, X[r s] represents the new corpus ob- → implementationofgeorgiasautomobiletitlelaw tained by replacing all occurrences of the segment wasalsorecommendedbytheoutgoingjury s by a single token r, c(s) is the frequency of the iturgedthatthenextlegislatureprovideenab segment s in corpus and represents the con- ⊕ lingfundsandresettheeffectivedatesothata catenation of two strings with a delimiter in be- norderlyimplementationofthelawmaybeeffect tween. This is necessary because MDL minimizes the combined DL of corpus and vocabulary. (Ki- The output segmentation obtained after apply- tyz and Wilksz, 1999) uses Viterbi algorithm to ing MDL algorithm is given below. It can be seen find the optimal segmentation of a corpus. Time that the segments identified by the MDL algorithm complexity of the algorithm is O(mn) where n are close to the actual words of English language. is the length of the corpus and m is the maximal implementationof georgias automo- word length. bile title l a w wasalso recom- 3.2 Imposing Language Constraints mend edbythe outgoing jury i tur g edthat thenext legislature pro- MDL based algorithm as described in (Kityz vide enabling funds andre s et and Wilksz, 1999) performs uninformed search theeffective d ate sothat anorderly through the space of word segmentations. We implementationof thelaw maybe ef- propose to improve the performance of unsuper- fect ed vised algorithm by introducing constraints based on domain knowledge. These constraints help to The segments generated by MDL are improved improve the word-like quality of the MDL seg- by applying the language constraints listed in pre- ments. For example, in English domain, we have vious section. Sample output is shown below. used the following language constraints, mainly We can observe the effect of constraints on seg- inspired by the fact that legal English words are ments, for example, [l][a][w] is merged into [law] pronounceable. ; [d][ate] is merged into [date]. 1. Every legal English word has at least one implementationof georgias automo- vowel in it bile title law wasalso recommend 2. There cannot be three consecutive conso- edbythe outgoing jury i tur ged nants in the word beginning except when the that thenext legislature provide en- first consonant is ’s’ abling funds andre set theeffective date sothat anorderly implementa- 3. Some word beginnings are impossible. For tionof thelaw maybe effect ed example, ’db’, ’km’, ’lp’, ’mp’, ’ns’, ’ms’, ’td’, ’kd’, ’md’, ’ld’, ’bd’, ’cd’, ’fd’, ’gd’, Segmentation results are evaluated by averaging ’hd’, ’jd’, ’nd’, ’pd’, ’qd’, ’rd’, ’sd’, ’vd’, the precision and recall over multiple random sam- ’wd’, ’xd’, ’yd’, ’zd’ ples of Brown Corpus. A segment is declared as

241 Algorithm Precision Recall 4.1 Qualitative Analysis MDL (Kityz and 79.24 34.36 The objective of our experiments on PROSITE Wilksz, 1999) database (Sigrist et al., 2012) is to qualitatively MDL + Constraints 82.57 41.06 analyse the protein segments. It can be observed Table 1: Boundary Detection by MDL Segmenta- that within a protein family, some regions of the tion protein sequences have been better conserved than others during evolution. These conserved regions Algorithm Precision Recall are found to be important for realizing the protein MDL(Kityz and 39.81 17.26 function and/or for the maintenance of its three di- Wilksz, 1999) mensional structure. As part of our study, we ex- MDL + Constraints 52.94 26.36 amined if the MDL segments are able to capture Table 2: Word Detection by MDL Segmentation the conserved residues represented by PROSITE patterns. MDL segmentation algorithm was applied to 15 a correct word only if both the starting and ending randomly chosen PROSITE families containing boundaries are identified correctly by the segmen- varying number of protein sequences. 2 Within a tation algorithm. Word precision and word recall PROSITE family, some sequences get compressed are defined as follows. more than others. An interesting observation is that the less compressed sequences are those that No. of correct segments Word Precision = have evolved over time and hence have low se- Total no. of segments quence similarity with other members of the pro- No. of correct segments tein family. But, they have the conserved residues Word Recall = Total no. of words in corpus intact and MDL segmentation algorithm is able to Boundary precision and boundary recall are de- capture those conserved residues. 3 fined as follows. For example, consider the PROSITE pattern for Amidase enzyme (PS00571) G-[GAV]-S-[GS](2)-G- x-[GSAE]-[GSAVYCT]-x-[LIVMT]- [GSA]-x(6)-[GSAT]-x- [GA]-x-[DE]-x- # correct segment boundaries Boundary Precision = [GA]-x-S- [LIVM]-R-x-P-[GSACTL] . The symbol ’x’ in a # segment boundaries PROSITE pattern is used for a position where any # correct segment boundaries Boundary Recall = amino acid is accepted. ’x(6)’ stands for a chain # word boundaries of five amino acids of any type. For patterns with The performance of our learning algorithm av- long chains of x, MDL algorithm captures the con- eraged over 10 samples of size 10,000 characters served regions as a series of adjacent segments. (from random indices in Brown corpus) is shown For example, in the protein sequence with UniPro- in Tables1 and2. The reported results are in line tKB id O00519, the conserved residues and MDL with our proposed hypothesis that domain con- segments are shown in Figure2. straints help in improving the performance of un- As another example, consider the family supervised MDL segmentation. PS00319 with pattern G-[VT]-[EK]-[FY]-V-C-C-P . This PROSITE pattern is short and does not contain any 4 Protein Segmentation ’x’. In such cases, the conserved residues can get In this section, we discuss our experiments in pro- captured accurately by MDL segments. The pro- tein domain. Choice of protein corpus is very tein sequence with UniProtKB id P14599 has less critical to the success of MDL based segmenta- sequence similarity but its conserved residues GVE- tion. If we look at the problem of corpus selection FVCCP are captured exactly in a single MDL seg- from a language perspective, we know that simi- ment. We also studied the distribution of segment lar documents will share more words in common lengths among the PROSITE families. A single than dissimilar documents. Hence, we have cho- corpus was created combining the sequences from sen our corpus from databases of protein families 2 like SCOPe and PROSITE. We believe that protein The output segments are available at https://1drv.ms/f/s!AnQHeUjduCq0ae9rWhuoybZoA-U sequences performing similar functions will have 3A PROSITE pattern like [AC]-x-V-x(4)-AV is to be similar words. translated as: [Ala or Cys]-any-Val-any-any-any-any-Ala-Val

242 4.2.1 MDL based Classifier Suppose we want to classify a protein sequence p into one of k protein families, the MDL based classifier is given by,

family (p) = argmax DLG(p, family1...k) (2) family

where DLG(p,familyi) is the measure of the com- pression effect produced by protein sequence p in Figure 2: Conserved residues and MDL segments the protein corpus of familyi. We hypothesize that of a protein sequence (UniProtKB id O00519) in a protein sequence will be compressed more by PROSITE family PS00571 the protein family it belongs to, because of the presence of similar words among the same family members. Experimental Setup The dataset used for pro- tein classification is ASTRAL Compendium (Chandonia et al., 2004). It contains sequences for domains classified by the SCOPe hierarchy. ASTRAL 95 subset based on SCOPe v2.05 is used as training corpus and the test set is created by accumulating the protein do- main sequences that were newly added in SCOPe v2.06. Performance of the MDL classifier is dis- cussed in four SCOPe levels - Class, Fold, Su- perfamily and Family. At all levels, we consider Figure 3: Distribution of MDL segment lengths only the protein domains belonging to four SCOPe among PROSITE families PS00319, PS00460, classes A,B,C and D representing All Alpha, All PS00488, PS00806 and PS00818 Beta, Alpha+Beta, Alpha/Beta respectively. The blind test set contains a total of 4821 protein do- 5 randomly chosen PROSITE families and the dis- main sequences. tribution of segment lengths is shown in Figure3. SCOPe classification poses the problem of class Protein segments that were common among the imbalance due to the non-uniform distribution of families were typically four or five amino acids domains among different classes at all SCOPe lev- in length. However, within each individual family els. Due to this problem, we use macro precision there were longer segments unique to that family. and macro recall (Yang, 1999) as performance Very long segments (length >15) are formed when measures and are given by the below equations. the corpus contains many sequences with high se- q quence similarities. 1 TP P recision = i (3) macro q TP + FP i=1 i i 4.2 Quantitative Analysis Xq 1 TP Recall = i (4) Unlike in English language, we do not have access macro q TP + FN i=1 i i to ground truth about words in proteins. Hence, we X propose to use a novel extrinsic evaluation mea- 4.2.2 Performance of MDL Classifier sure based on protein family classification. We Class Prediction Out of 4821 domain se- describe a compression based classifier that uses quences in the test data, the MDL classifier ab- the MDL segments (envisaged as words in pro- stains from prediction for 71 sequences due to teins) for SCOPe predictions.The performance of multiple classes giving the same measure of com- the MDL based classifier on SCOPe predictions is pression. The MDL Classifier achieves a macro used as an extrinsic evaluation measure of protein precision of 75.64% and macro recall of 69.63% segments. in Class prediction.

243 SCOPe level Macro Macro Average Average Precision Recall Class 75.64 69.63 Fold 60.59 45.08 Super family 56.65 43.73 Family 43.25 37.7

Table 3: Performance of MDL Classifier in SCOPe Prediction

SCOPe level Weighted Weighted Average Average Precision Recall Figure 4: Variation of Filter Utility with Filter Size Class 76.38 69.77 k Fold 81.49 49.25 Super family 72.80 48.23 single fold giving maximum compression, if the Family 45.02 35.85 MDL classifier returns the top-k candidates, then we can reduce the search space for manual or high Table 4: Performance of MDL Classifier in cost inspections. We define utility of the MDL SCOPe Prediction - Weighted Measures classifier when used as a filter as given below. No. of predictions where correct fold is in top-k list Utility = Total no. of predictions Fold Prediction SCOPe v2.05 contains a total Figure4 shows the k versus utility on test data. It of 1208 folds out of which 991 folds belong to can be seen from the graph that at k=400 (which is classes A,B,C and D. The distribution of protein approximately 33% of the total number of folds), sequences among the folds is non-uniform ranging top-k predictions are able to give 93% utility. In from 1 to 2254 sequences with 250 folds contain- other words, in 93% of the test sequences, MDL ing only one sequence. MDL Classifier achieves filter can be used to achieve nearly 67% reduction a macro precision of 60.59% and macro recall of in the search space of 1208 folds. 45.08% in fold classification. Impact of Corpus Size The number of pro- 4.4 Impact of Constraints based on Domain tein domains per class decreases greatly down the Knowledge SCOPe hierarchy. The folds (or families, super- Similar to experiments in English domain, the families) that have very few sequences should have MDL algorithm on protein dataset can also be less contribution in the overall prediction accu- enhanced by including constraints from protein racy. We weighted the macro measures based domain knowledge. For example, in a protein on the number of instances which resulted in the molecule, hydrophobic amino acids are likely weighted averages reported in Table4. The MDL to be found in the interior, whereas hydrophilic classifier achieves a weighted macro precision of amino acids are more likely to be in contact with 81.49% in SCOPe fold prediction which is higher the aqueous environment. This information can be than the precision at any other level. This obser- used to introduce checks on allowable amino acids vation highlights the quality of protein segments at the beginning and end of protein segments. generated by MDL algorithm. It is also important Unlike in English, identifying constraints based to note that fold prediction is an important sub task on protein domain knowledge is difficult because of prediction just as how word there are no lexicon or protein language rules read- detection is crucial to understanding the meaning ily available. Domain expertise is needed for get- of a sentence. ting explicit constraints. As proof of concept, we use the SCOPe class 4.3 MDL Classifier as a Filter labels of protein sequences as domain knowledge The folds which are closer to each other in the and study its impact on the utility of the MDL fil- SCOPe hierarchy tend to compress protein se- ter. After introducing class knowledge, MDL filter quences almost equally. Instead of returning a achieves an utility of 93% at k=100, i.e., in 93%

244 6 Conclusion Given the abundance of unlabelled data, data driven approaches have witnessed significant suc- cess over the last decade in several tasks in vi- sion, language and speech. Inspired by the corre- spondence between biological and linguistic tasks at various levels of abstraction as revealed by the study of Protein Linguistics, it is only natural that there would be a propensity to extend such ap- proaches to several tasks in Computational Biol- ogy. A linguist already knows a lot about language however, and a biologist knows lot about biology; Figure 5: Variation of Filter Utility with Filter Size so, it does make sense to incorporate what they al- k after adding constraints based on SCOPe Class ready know to constrain the hypothesis space of a labels machine learner, rather than make the learner re- discover what the experts already know. The latter of the test sequences, MDL filter can be used to option is not only demanding in terms of data and achieve nearly 90% reduction in the search space computational resources, it may need us to solve of 1208 folds. In the absence of class knowledge, riddles we just do not have answers to. Classifying the same filter utility was obtained at k=400 which a piece of text as humorous or otherwise is hard at is only 67% reduction of search space (Figure5). the state of the art; there are far too many inter- Through this experiment, we emphasize that ap- actions between variables than we can model, not propriate domain knowledge can help in improv- only do the words interact between them, they also ing the quality of word segmentation in protein se- interact with the mental model of the person read- quences. Such domain knowledge could be im- ing the joke. It stretches our wildest imaginations posed in the form of constraints during unsuper- to think of a purely bottom up Deep Learner that vised learning of protein words. We would like to is deprived of common-sense and world knowl- emphasize the fact that introducing domain knowl- edge to learn such end-to-end mappings reliably edge in the form of class labels as in supervised or by looking at data alone. The same is true in semi-supervised learning frameworks may not be biological domains where non-linear interactions appropriate in protein sequences due to our current between a large number of functional units make ignorance of the true protein words. macro-properties ”emerge” out of interactions be- tween individual functional units. We feel that a 5 Discussion realistic route is one where top down (knowledge In the words of Jones and Pevzner (Jones and driven) approaches complement bottom up (data Pevzner, 2004), ”It stands to reason that if a driven) approaches effectively. This paper would word occurs considerably more frequently than have served a modest goal if it has aligned itself to- expected, then it is more likely to be some sort of wards demonstrating such a possibility within the ’signal’ and it is crucially important to figure out scope of discovering biological words, which is the biological meaning of the signal”. In this pa- just one small step in the fascinating quest towards per, we have proposed protein segments obtained deciphering the language in which biological se- from MDL segmentation as the signals to be de- quences express themselves. coded. As part of our future work, we would like to References study the performance of SCS words (Motomura Shlomo Argamon, Navot Akiva, Amihood Amir, and et al., 2012), (Motomura et al., 2013) in protein Oren Kapah. 2004. Efficient unsupervised recur- family classification and compare it against MDL sive word segmentation using minimum description words; We would also like to measure the avail- length. In Proceedings of the 20th international ability scores of MDL segments. It may also be in- conference on Computational Linguistics. Associa- sightful to study the co-occurrence matrix of MDL tion for Computational Linguistics, page 1058. segments. Quentin D Atkinson and Russell D Gray. 2005. Cu-

245 rious parallels and curious connectionsphylogenetic Kenta Motomura, Tomohiro Fujita, Motosuke Tsut- thinking in biology and historical linguistics. Sys- sumi, Satsuki Kikuzato, Morikazu Nakamura, and tematic biology 54(4):513–526. Joji M Otaki. 2012. Word decoding of protein amino acid sequences with availability analysis: a linguis- Patricia Bralley. 1996. An introduction to molecular tic approach. PloS one 7(11):e50039. linguistics. BioScience 46(2):146–153. Kenta Motomura, Morikazu Nakamura, and Joji M John-Marc Chandonia, Gary Hon, Nigel S Walker, Otaki. 2013. A frequency-based linguistic approach Loredana Lo Conte, Patrice Koehl, Michael Levitt, to protein decoding and design: Simple concepts, di- and Steven E Brenner. 2004. The ASTRAL com- verse applications, and the scs package. Computa- pendium in 2004. Nucleic acids research 32(suppl tional and structural biotechnology journal 5(6):1– 1):D189—-D192. 9. Ruey-Cheng Chen. 2013. An improved mdl-based Alexey G Murzin, Steven E Brenner, Tim Hubbard, and compression algorithm for unsupervised word seg- . 1995. Scop: a structural classifica- mentation. In ACL (2). pages 166–170. tion of proteins database for the investigation of se- Naomi K Fox, Steven E Brenner, and John-Marc Chan- quences and structures. Journal of molecular biol- donia. 2014. Scope: Structural classification of pro- ogy 247(4):536–540. teinsextended, integrating scop and astral data and David B Searls. 2002. The language of genes. classification of new structures. Nucleic acids re- 420(6912):211–217. search 42(D1):D304–D309. W Nelson Francis and Henry Kucera. 1979. The brown Claude Elwood Shannon. 2001. A mathematical the- corpus: A standard corpus of present-day edited ory of communication. ACM SIGMOBILE Mobile american english. Providence, RI: Department of Computing and Communications Review 5(1):3–55. Linguistics, Brown University [producer and dis- Christian JA Sigrist, Edouard De Castro, Lorenzo tributor] . Cerutti, Beatrice´ A Cuche, Nicolas Hulo, Alan Mario Gimona. 2006. Protein linguisticsa grammar for Bridge, Lydie Bougueleret, and Ioannis Xenarios. modular protein assembly? Nature Reviews Molec- 2012. New and continuing developments at prosite. ular Cell Biology 7(1):68–73. Nucleic acids research page gks1067. Peter Grunwald.¨ 2005. A tutorial introduction to the Radu Soricut and Franz Josef Och. 2015. Unsu- minimum description length principle. Advances in pervised morphology induction using word embed- minimum description length: Theory and applica- dings. In HLT-NAACL. pages 1627–1637. tions pages 23–81. Ashish Vijay Tendulkar and Sutanu Chakraborti. 2013. Harald Hammarstrom¨ and Lars Borin. 2011. Unsuper- Parallels between linguistics and biology. In Pro- vised learning of morphology. Computational Lin- ceedings of the 2013 Workshop on Biomedical Nat- guistics 37(2):309–350. ural Language Processing, Sofia, Bulgaria: Associ- ation for Computational Linguistics. Citeseer, pages Daniel Hewlett and Paul Cohen. 2011. Fully unsuper- 120–123. vised word segmentation with bve and mdl. In Pro- ceedings of the 49th Annual Meeting of the Associ- Wikipedia. 2017. Protein structure — Wikipedia, the ation for Computational Linguistics: Human Lan- free encyclopedia. http://en.wikipedia. guage Technologies: short papers-Volume 2. Asso- org/w/index.php?title=Protein% ciation for Computational Linguistics, pages 540– 20structure&oldid=774730776. [Online; 545. accessed 22-April-2017]. Neil C Jones and Pavel Pevzner. 2004. An introduction Yiming Yang. 1999. An evaluation of statistical ap- to algorithms. MIT press. proaches to text categorization. Information re- trieval 1(1-2):69–90. Chunyu Kityz and Yorick Wilksz. 1999. Unsupervised learning of word boundary with description length Valentin Zhikov, Hiroya Takamura, and Manabu Oku- gain. In Proceedings of the CoNLL99 ACL Work- mura. 2010. An efficient algorithm for unsupervised shop. Bergen, Norway: Association f or Computa- word segmentation with branching entropy and mdl. tional Linguistics. Citeseer, pages 1–6. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. Associa- Daichi Mochihashi, Takeshi Yamada, and Naonori tion for Computational Linguistics, pages 832–842. Ueda. 2009. Bayesian unsupervised word segmen- tation with nested pitman-yor language modeling. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1-Volume 1. Association for Computational Linguistics, pages 100–108.

246