arXiv:1803.01580v1 [cs.CL] 5 Mar 2018 itoay hc euei hsppr srglryupdated wikokit regularly is of paper, help this the in with use we multifunctio Russian which collaborative Machine-re Wiktionary, used. thesaurus. updated and dictionary been freely online multilingual have a Wiktionary is Wiktionary Russian synony relations the by research this different presented For not synonyms) [12]. where but words of synsets individual WordNet set between between indicated (a to are synset homonymy) occurrence (synonymy, of its notion The owes proposed. is eling process language natural the is in problems. it significant formaliza Such particularly which Hence words. between is formalization, on. relations the characterist a so of and analysis description and introduce quantitative use and sense to enable would develop notion, to of necessary meanings the are of shades in other tone. havin expressive each levels, specific from linguistic own different differing to sense, the identica belonging the notion, meanings, follows: same in different the as close expressing by or runs words characterized the definition are is synonyms descriptive and The definition approaches. rigorous no has oin,bsdo uVcoe eore,aerepresented. propo are with resources, experiments RusVectores Some on synset. ar a based senses of notions, which syn sense words the the the to sele i.e. nearest synset, to words, intended synset of are significant notions interior most These centrality. the and dista introduced: rank word the T. are characteristics as by words geometric used synset program Several is word2vec word-vectors. similarity (Skip-gram, cosine networks between as standard neural realized The the Mikolov. and on developed based CBOW), (synset). is geometr set embedding the synonym of proposed word modeling have mathematical we to factors representation approach unjustified vector intuitive metho word linguistic empirical the computer on the generally and based lingui conception theoretic synonym the of possible, as lem near as together, bring to ata ouino h olwn problems: following the of solution partial [5]. data Wiktionary 1 https://github.com/componavt/wikokit Abstract nti ae,teapoc oasne ahmtclmod- mathematical synset a to approach the paper, this In what questions: several raises immediately definition This use, common in is it though synonym, of notion The h uhr fti ae ersn h praht the to approach • the represent paper this of authors The nttt fApidMteaia eerho h Karelian the of Research Mathematical Applied of Institute codn otepoiiyo h od otesense the to synset words a the synset; in of by proximity synonyms represented the the to of according ordering automatic the Tega ffraiain rpsdi hsppr is paper, this in proposed formalization, of goal —The acltdAtiue fSnnmSets Synonym of Attributes Calculated .I I. NTRODUCTION 1 otaeo h aeo Russian of base the on software [email protected],[email protected]. [email protected], nrwKihnvk,AeadrKirillov Alexander Krizhanovsky, Andrew erzvdk aei,Russia Karelia, Petrozavodsk, tcprob- stic Using . their g adable c for ics fthe of tthe ct the e tion The nce ing sed nal ms set ds ic l ofie osblte fisapiaiiyt nigotth ( well-known brig out most the the finding is of word2vec to One of relations. examples applicability semantic of its pairs meaningful of possibilities confined using RusVectores tool language, proposed the Russian the by called for They developed, our NN-models corpora. have several the from who scientist word2vec, Kuzmenko, note, Russian of E. the aid to field and by Kutuzov this worth made A. to is been – contribution available has It significant linguistics the such corpora. computer that text of , of of help point group basis T.Mikolov’s the by the and also with on simplicity developed The word2vec usage their as [11]. is their instrument [10], NN-models wide [9], of these enjoyed colleagues possibility of his pro- has advantage and constructions, space Mikolov CBOW main vector T. and by some Skip-gram posed in to vector due popularity a as (NN), man ain eoeby Denote tation. compu the in researchers well-known linguistics. the of challenge this to researcher Mikolov’ to appeal the and following Goldberg the of of with paper justification ends recent which the formal [2], in po Levy a out to pointed of was lead approach lack approach, Mikolov’s The the results. s of representing illustrations examples, the factory from deviations slightest The eerhCnr fteRsinAaeyo Sciences of Academy Russian the of Centre Research • • • h pvry fMklvsapoc osssi rather in consists approach Mikolov’s of "poverty" The networks neural using representation, word a of idea The e scnie h anie ftewr etrrepresen- vector word the of idea main the consider us Let response partial the is degree, some to paper, presented The Cnw aeti nuto oepeie We’d precise? formal" more more something intuition see to this like really make we "Can ≈ RLINEADTEPVRYOF POVERTY THE AND BRILLIANCE I T II. rbe tmr ulttv ee ncmaio with comparison [1]. in methods level neural WSDexisting qualitative the the more solve combine disambigua- at to to sense problem methods proposed is word the task the and main networks is Our paper, (WSD). this tion to authors turn the incented to has which problem, "weak" dictionaries; significant the the the mathe- improve of to developed investigation), order the in future of synsets (in basis tool the matical on detection, the its online-dictionary Wiktionary); and analysis, (Russian the data synsets for using of verification tool experimental comparison mathematical and of characterization developing the king OSRCINB WORD BY CONSTRUCTION EWR ETRREPRESENTATION VECTOR WORD HE sntspotdb te xrsierelations. expressive other by supported not is ) D oedcinr n nmrt nsome in enumerate and dictionary some ru 2 E TOOL VEC queen NN- MODELS [2]. − : woman THE atis- [6]. ter of or + ht e s s N way its words. Let |D| be the number of the words in D, i — A = {a1, ..., an} and B = {b1, ..., bm}, ai,bj ∈ R , then the index number of a word in the dictionary. sim{A, B} = (M((ai),n), (M((bj ),m))).

Definition 1: The vector dictionary is the set D = {wi ∈ Consider a synset S = {vk, k =1, ..., |S|}. Let us remove |D| R }, where the i-th component of a vector wi equals 1, while any word v from S (the index of a word is omitted for brevity). the other components are zeros. Divide the set S \{v} into two disjunctive subsets: S \{v} = {vi }⊔{vj },s =1, ..., q, p =1, ..., r, q+r = |S|−1, is 6= jp. Thus, w is the image of the i-th word in D. The problem of s p i Denote S1 = {vi },S2 = {vj }. Then S \{v} = S1 ∪ S2. the word vector representation, as it is understood at present, s p is to construct a linear mapping L : D → RN , where Definition 2: The interior IntS of a synset S is the set of N << |D|, and vector v = L(w), w ∈ D, v has components all vectors v ∈ S satisfying the following condition v ∈ R. These procedure is called the distributed word vector j IntS = {v ∈ S : sim{S ,S } < sim{S ∪ v,S } representation. Its goal is to replace very thin set D ∈ R|D|, 1 2 1 2 ^ (1) consisting of vectors with zero mutual inner (scalar) product, sim{S1,S2} < sim{S1,S2 ∪ v}} by some subset of RN , where N << |D|, with the following for all disjunctive partitions S \{v} = S ⊔ S , where S 6= property: the inner products of vectors from RN may be 1 2 1 ∅, S 6= ∅. used as a measure of the words similarity, which is currently 2 accepted in the corresponding problems of nature language The sense of this definition: the addition of the vector v ∈ processing. If W is a matrix of such linear mapping L then IntS to any of two subsets of S \{v}, forming its disjunctive v = W w for v ∈ RN . In addition several methods, particularly , decreases the distance between these subsets (i.e. based on neural networks, are used to construct W . Recently, increases the similarity). CBOW and Skip-gram methods has become widely used. To illustrate the notion of IntS, consider two-dimensional Their mathematical basis is the modified maximum likelihood vectors. In Fig. 1 vector v (conditionally shown as a circle), method. added to S1 or S2, decreases the distance between S1 and S2. For instance, the Skip-gram NN-model provides the matrix W , mentioned above, as the matrix which maximizes the following function F (W )

T 1 F (W )= ln p(w |w ) T X X t+j t t=1 −c≤j≤c,j6=0

exp ut+j p(wt+j |wt)= |D| , ui = (W wi, W wt) i=1 exp ui Figure 1. Vector v decreases the distance between S1 and S2. If it occurs P for all disjunctive partitions of S \ {v} then v ∈ IntS where (·, ·) — the symbol of inner product, T — the volume of training context. Here, a word wt is given in order to find out the appropriate context, containing this word and having B. Rank and centrality of a word in synset the size 2c (the size of the "window"). The CBOW (continuous Let us introduce the notion of the rank of a synonym bag of words) NN-model, on the contrary, operates with some v ∈ S. In what follows we consider only disjunctive partitions given context and provide an appropriate word. These model and thus, for brevity, the disjunctive partition into two subsets, take into account only local context. There exist some attempts the elements of partition, we shall call the partition. Let Pv = to use global context (the whole document) [4]. Such approach {p ,i =1, ..., 2n−2 − 1} be the set of all enumerated in some would be useful for solving the problems of WSD. i way partitions pi of the set S \{v}, where |S| = n. Here |S| is the power (the number of elements) of S. Suppose n > 2. III. THE SYNSET GEOMETRY Consider any partition pi of the set S \{v}: S \{v} = S1 ⊔ 1 S2. Denote simi = sim{S1,S2}, simi = sim{S1 ∪ v,S2}, A. The synset interior: IntS 2 simi = sim{S1,S2 ∪ v}. Using these designations, we obtain The distance between word-vectors (normalized) is mea- 1 2 IntS = {v ∈ S : simi < sim ∧ simi < sim } (2) sured by their inner product, i.e. by the angle between them, i i as in the theory of projective spaces. Thus, the increasing Introduce the function rv : Pv → {−1, 0, 1} such that of inner product corresponds to decreasing of the distance 1 2 N −1, sim < sim sim < sim , sim{a,b} (similarity) between vector-words a,b ∈ R .  i i ^ i i Hence, (a,b) , where is the inner product  v moving apart of S from S sim{a,b} = ||a||·||b|| (a,b)  1 2  of vectors a and b, ||·|| is the norm symbol. There are proposed  some other measures of a distance between the vectors but they  1 2  1, simi > simi simi > simi, are based on the inner product [7], [8], [13]. rv(pi)=  ^ v approaching of S1 and S2 Let us introduce designations for normalized sum of vec-  n  Pi=1 ai  1 2 tors: M((ai),n) = n . In what follows, the distance  0, (simi − simi) · (simi − simi) < 0. || Pi=1 ai||   between the sets of vectors will be measured by the dis-  approaching − moving apart tance between normalized mean vectors of the sets. Thus, if  (3) The function rv is determined for each partition and gives, the centrality show the measure of significance of a word in metaphorically speaking, the "bricks" which will below a synset, i.e. the measure of proximity of this word to the compose the rank of a synonym. Let us briefly explain synset sense. Since the centrality is a real number it gives approaching-moving apart line of the above definition of rv. more precise characteristic of a word significance in a synset 1 2 The expression (simi −simi)·(simi −simi) < 0 is equivalent than the rank which is integer (see the 1). 1 2 1 2 to (simi < simi ∧ simi > simi) (simi > simi ∧ simi < W simi). In other words, the function rv(pi) has the value 0, if C. Rank and centrality computations the adding of a word v to one of the elements of a partition The definition of centrality implies the following centrality pi decreases (increases) the distance simi, but the adding to another element increases (decreases), on the contrary, the computation Procedures 1 and 2. distance simi. In Fig. 2 this is 3 partition. Hypothesis 2: the more meanings has the word the less is Definition 3: The rank of a synonym v ∈ S, where |S| > its rank and centrality in different synsets. 2, is the integer of the form The following example and table 1 support this hypothesis.

|Pv | It is worth to note that this example is not exclusive. The rank (v)= r (p ). (4) verification of the hypothesis on the large amount of data is X v i the substance of future research. i=1 Procedure 1 Computation of rank r (p ) and The definition implies that if v ∈ IntS then rank (v) = v i centrality(v,p ) of a word v and a correspondent partition 2|S|−2 −1 is the number of all nonempty disjunctive partitions i p of the synset S of S \{v} into two subsets, where |S \{v}| = |S|− 1, i.e. i rank (v) has maximum and equals to the Stirling number of Input: a synset S, a word v ∈ S and any correspondent n |S| partition pi of S \{v}; the second kind: { k } = { 2 }, where n is the number of elements in the set and k is the number of the subsets in a Require: S \{v} = S1 ⊔ S2; partition, here k =2 [3, p. 244]. Output: rv(pi), centrality(v,pi). 1: simi ← sim{S1,S2} The between IntS and rank (v) is given by the 1 2: simi (v) ← sim{S1 ∪ v,S2} // adding of a word v to S1 following 2 3: simi (v) ← sim{S1,S2 ∪ v} // adding of a word v to S2 Theorem 3.1 (IntS theorem): Assume |S| > 2. Then v ∈ 1 2 4: centrality(v,pi) ← (sim (v)−simi)+(sim (v)−simi) IntS if and only if the rank of a word v is maximal in a given i i synset and equals to the Stirling number of the second kind 1 5: rv(pi) ← 1/2 · (sgn(simi (v) − simi)+ for partition of S into two nonempty subsets, i.e. 2 sgn(simi (v) − simi)), v ∈ IntS ⇔ rank (v)=2|S|−2 − 1, where |S| > 3, 1, x> 0  where sgn(x)=  0, x =0 Proof: −1, x< 0 6: return rv(pi), centrality(v,pi)  (2) 1 v ∈ IntS ⇔ ∀pi : IntS = {v ∈ S : simi > simi 2 (3) ∧ simi > simi} (v approaching of S1 and S2) ⇔ (4) Procedure 2 Computation of rank(v) and centrality(v) of a ∀pi : rv(pi)=1 ⇔ word v of the synset S |Pv| Input: a synset S, a word v ∈ S; |S|−2 rank (v)= 1= |Pv| =2 − 1. (5) Output: rank(v), centrality(v). X i=1 |Pv| 1: centrality(v) ← i=1 centrality(v,pi), |S|−2 P since 2 − 1 — is the maximal number of nonempty |Pv | 2: rank(v) ← rv(pi). disjunctive partitions into two subsets. Pi=1 3: return rank(v), centrality(v) Definition 4: The centrality of a synonym v ∈ S under a partition pi of S \{v} is the following value Example 1: Let us consider the synset S = (battle, combat, 1 2 centrality(v,pi) = (simi (v) − simi) + (simi (v) − simi) fight, engagement). Let us find out IntS and calculate the rank and the centrality of each word in synset. Definition 5: The centrality of a synonym v ∈ S is the following value The example of calculating of rank and centrality of the word "combat" in this synset is shown in Fig. 2. The set of |P | v power |S \{v}| = 3 may be decomposed in three ways into centrality(v)= centrality(v,p ) X i two nonempty subsets. Each partition may add 1, 0 or -1 to i=1 rank(v) (Fig. 2). The values of rank and centrality equals 2 and 0,36, respectively. Hypothesis 1: it is worth to note, that the word v, belonging to IntS, has the greater rank and centrality than In table. 1 the rank, the centrality and IntS for the words the other words of a synset S. It is likely that the rank and of the synset are shown. Figure 2. The values of rank and centrality for the word "combat" in synset S = {battle, engagement, fight, combat}. Three possible partitions of the i i i i set S \ {combat} = {battle, engagement, fight} into two nonempty subsets, S1,S2,i = 1, 2, 3, are presented. For the brevity, S1,S2 are denoted as S1,S2 respectively. The values of rank(v) and centrality(v), v = {combat} are calculated as the sums of appropriate ∆ranki := rank(v,pi) and ∆centralityi := centrality(v,pi).

Table I. RANKANDCENTRALITYOFEACHWORDINSYNSET, THEBELONGINGSOFSYNONYMTO INTS IS SHOWN Rank and centrality in this example were calculated on the basis of data from Russian National Corpus. Russian synset баталия бой битва сражение Transliteration batálija boj bítva sražénije IV. EXPERIMENTS Translation fight combat battle engagement Centrality -0.12 0.34 0.45 0.6 In this paper we use the NN-models, created by the Rank -3 2 3 3 authors of the project RusV ectores [6], namely, the model IntS — —+ + constructed on the basis of the texts of the Russian National Note. The precise translation of the synset’ words is a rather Corpus (RNC), and the model, constructed on the basis of the difficult task. The translation serves only to illustrate the model. texts of the Russian news sites (News corpus). These model are available on the site of the project RusV ectores [6]. The authors of RusV ectores, A. Kutuzov and E. Kuz- According to above Theorem 3.1, the rank of the synonyms menko, pay attention to such peculiarities of RNC as hand belonging to the synset interior, IntS, equals typesetting of the texts for the corpus updating, the regulation |S|−2 |4|−2 of the different genres text relations, the small size of the main 2 − 1=2 − 1=3 corpus, approximately, 107 million of words (for example, the News corpus consists of 2,4 billion of tokens). Table 1 shows that the words "battle" and "engagement" have the largest rank (3) and centrality. Thus, Int (battle, In [6] the notion of corpus representativeness is intro- engagement, fight, combat) = (battle, engagement). It means duced. The sense of this notion is the ability of the corpus to that this pair, battle and engagement, is the most close in sense reflect (to point at) those association for a word which will be to all words of the synset. accepted by the majority of the native language speakers. The Table II. EXAMPLESOFSYNSETSWITHEMPTY INTS.THE SYNSETS WERE TAKEN FROM RUSSIAN WIKTIONARY.THE WORDS IN SYNSETS ARE ORDEREDBYRANKANDCENTRALITY.TWO RUSSIANCORPORAFROMTHEPROJECT RusV ector¯ es¯ WERE USED TO CONSTRUCT NN-MODELS.THESE MODELSWEREUSEDTOFINDOUT IntS, HERE OutS = S \ IntS

Russian Wiktionary Synset (from article) ||S|| ||IntS|| Corpus article Adverb beautifully IntS = ∅, 5 0 RNC (прекрасно) OutS = {wonderfully, remarkably, excellently, perfectly, beautifully} beautifully IntS = {perfectly, remarkably}, 5 2 News (прекрасно) OutS = {wonderfully, beautifully, excellently} Adjective stony IntS = ∅, 5 0 RNC (каменный) OutS = {stony, heartless, hard, cruel, pitiless} stony IntS = {pitiless}, 5 1 News (каменный) OutS = {stony, heartless, hard, cruel} associations generated by NN-models according to the data by different vectors. The different corpora and NN-models give from RNC and Web-corpus are just used in this paper. The different word vector representations. Thus, the synset interior problem of comparison is reduced to finding out the words (IntS) for the same word could be different (Table IV). which meanings in Web-corpus essentially differ from that in RNC. Let us take into account that for each word in corpus, via V. CONCLUSION NN-model, we can obtain the list of N nearest words (remind: a word is a vector). Then, the result of the corpora comparison The world of modern linguistics may be represented, is as follows: for more than a half of all the words (the common tentatively speaking, as the union of two domains, attracting words in two corpora) not less than three words of the nearest each other but nevertheless weakly bound nowadays: the ten words were the same [6]. It means that the linguistic world traditional, more qualitative, and computational linguistics. images, created on the basis of RNC and Internet texts, have a The strict formalization of the base notions is necessary for lot in common. But the opposite estimation is also necessary: further development of linguistics as exact science. The formal what is the measure of discrepancy of the NN-models? definition of such notions as word meaning, synonymy and others will permit to base, to the right degree, upon the Let us note that the notion of corpus representativeness ac- methods and algorithms of computational linguistics (corpora quires the new significance in view of the NN-models created linguistics, neural networks, etc.), discrete mathematics, prob- on the basis of the corpus. An unbalanced sample results in ability theory. excess weight of some corpus topics and as consequence to a less exact NN-model. In this paper we present an approach to some formal characterizations (IntS, rank, centrality) of such significant for It is significant for future experiments the following obser- machine-readable dictionaries and thesauri notion as the set of vation in [6]. For the rare words, presented by the small set of synonyms — the synset. The proposed formal tool permits contexts related to this word, the associative words, generated to analyse the synsets, to compare them, to determine the by NN-models, will be inexact and doubtful. significance of the words in a synset. We have conducted the experiments for testing the pro- In future investigations we will use the developed tool to posed synset model. We have used two NN-models, con- the problem of word sense disambiguation (WSD problem). structed by the authors of RusV ectores on the basis of RNC and News corpus. While operating with NN-models we ACKNOWLEDGMENT used the program gensim2, because it contains the realization of word2vec in Python language. The gensim program is This paper is supported by grant N 18-012-00117 from the presented in [14]. The authors of RusV ectores used the Russian Foundation for Basic Research. gensim program for the NN-models constructing as well [6]. We developed the number of scripts on the basis of gensim REFERENCES for operating with NN-models, for calculation of the rank and [1] T. V. Kaushinis, A. N. Kirillov, N. I. Korzhitsky, A. A. Krizhanovsky, the centrality and for determination of IntS. These script are A. V. Pilinovich, I. A. Sikhonina, A. M. Spirkova, V. G. Starkova, available online3. T. V. Stepkina, S. S. Tkach, Ju. V. Chirkova, A. L. Chuharev, D. S. Shorets, D. Yu. Yankevich, E. A. Yaryshkina, “A review of word- For the several thousands of synsets extracted from the sense disambiguation methods and algorithms: Introduction”, Transac- Russian Wiktionary the rank, the centrality and IntS were tions of Karelian Research Centre of Russian Academy of Science, series Mathematical Modeling and Information Technologies, 2015, pp. 69– calculated on the basis of RNC and News corpus. The ex- 98, doi: 10.17076/mat135, Web: http://journals.krc.karelia.ru/index.php/ periments have showed that the rare words in corpora have, mathem/article/view/135. as a rule, empty IntS. The same word in different NN- [2] Y. Goldberg, O. Levy, “word2vec explained: Deriving Mikolov models constructed according to different corpora is presented et al.’s negative-sampling word-embedding method”, arXiv preprint arXiv:1402.3722, 2014, pp. 1–5. 2http://radimrehurek.com/gensim/ [3] R. L. Graham, E. K. Donald and O. Patashnik, Concrete mathematics. 3https://github.com/componavt/piwidict/tree/master/lib_ext/gensim_wsd Addison–Wesley, 1994. [4] E. H. Huang, R. Socher, C. D. Manning, A. Y. Ng, “Improving word [9] T. Mikolov, S. Kombrink, L. Burget, J. Cernocky, S. Khudanpur, “Exten- representations via global context and multiple word prototypes”, in Proc. sions of recurrent neural network language model”, in Proc. of the 2011 of the ACL ’12, Jeju Island, Korea, 2012, pp. 873–882, Web: http://dl. IEEE International Conf. on Acoustics Speech and Signal Processing acm.org/citation.cfm?id=2390524.2390645. (ICASSP), 2011, doi: 10.1109/icassp.2011.5947611. [5] A. A. Krizhanovsky, A. V. Smirnov, “An approach to automated con- [10] T. Mikolov, G. Zweig, “Context dependent recurrent neural network lan- struction of a general-purpose lexical ontology based on Wiktionary”, guage model”, in Proc. of the 2012 IEEE Spoken Language Technology Journal of Computer and Systems Sciences International, 2013, N 2, Workshop (SLT), 2012, doi: 10.1109/slt.2012.6424228. pp. 215–225, doi: 10.1134/S1064230713020068, Web: http://scipeople. com/publication/113533/. [11] T. Mikolov, K. Chen, G. Corrado, J. Dean, “Efficient estimation of word representations in vector space”, arXiv preprint arXiv:1301.3781, 2013, [6] A. Kutuzov, E. Kuzmenko, “Comparing neural lexical models of Web: http://arxiv.org/abs/1301.3781. a classic national corpus and a web corpus: the case for Russian”, Computational Linguistics and Intelligent Text Processing, 2015, pp. 47– [12] Princeton University website, What is WordNet? Web: http://wordnet. 58, doi: 10.1007/978-3-319-18111-0_4, Web: https://www.academia.edu/ princeton.edu. 11754162/Comparing_neural_lexical_models_of_a_classic_national_ [13] G. Sidorov, A. Gelbukh, H. Gómez-Adorno, D. Pinto, “Soft similarity corpus_and_a_web_corpus_the_case_for_Russian. and soft cosine measure: Similarity of features in vector space model”, [7] O. Levy, Y. Goldberg, I. Dagan, “Improving distributional similarity with Computación y Sistemas, 2014, vol. 18, N 3, pp. 491–504, Web: http:// lessons learned from word embeddings”, Transactions of the Association www.scielo.org.mx/pdf/cys/v18n3/v18n3a7.pdf. for Computational Linguistics, 2015, vol. 3, pp. 211–225. [14] R. Reh˚uˇ rek,ˇ P. Sojka, “Software framework for topic modelling with [8] S. Mahadevan, S. Chandar, “Reasoning about linguistic regulari- large corpora”, in Proc. of the LREC 2010 Workshop on New Challenges ties in word embeddings using matrix manifolds”, arXiv preprint for NLP Frameworks, Valletta, Malta: University of Malta, 2010, pp. 45– arXiv:1507.07636, 2015, pp. 1–9. 50, Web: http://is.muni.cz/publication/884893/en.