Entropy Analysis of N-Grams and Estimation of the Number of Meaningful Language Texts. Cyber Security Applications

Manuscript ENTROPY ANALYSIS OF N-GRAMS AND ESTIMATION OF THE NUMBER OF MEANINGFUL LANGUAGE TEXTS. CYBER SECURITY APPLICATIONS February 11, 2021 Anastasia Malashina Computer Security Department HSE University, Moscow Russia [email protected] 1 Abstract We estimate the n-gram entropies of English- We explore texts in English collected from various web language texts, using dictionaries and taking into account pages based on corpus [4], considering an extended alpha- punctuation, and find a heuristic method for estimating the bet that includes the simplest punctuation marks. The aim marginal entropy. We propose a method for evaluating the of the study is to evaluate the entropy of short-length texts coverage of empirically generated dictionaries and an ap- and to extend the results obtained to long texts. Based on the proach to address the disadvantage of low coverage. In ad- entropy data, we theoretically estimate the number of long dition, we compare the probability of obtaining a meaning- meaningful texts in a language for which an empirical esti- ful text by directly iterating through all possible n-grams of mate is impossible. the alphabet and conclude that this is only possible for very The structure of the paper is as follows: in Section 2, we short text segments. describe the corpus of analyzed texts and the preprocessing Keywords: n-gram entropy, n-gram dictionaries, cover- of the corpus, and in Section 3, the methodology used for age, meaningful texts, information theory. n-gram dictionaries and coverage, as well as the estimation of n-gram entropies. Section 4 presents and discusses the results of our analysis. Section 5 summarizes the main con- 1 Introduction clusions of this article. Entropy is the basis of the information-theoretic approach to 2 Corpus Description information security. It is a degree of uncertainty. The data with maximum entropy is completely random, and no pat- The corpus we analyze is based on text samples from the terns can be established. For low-entropy data, we are able iWeb corpus of English language [4] and contains about 100 to predict the following generated values. The level of chaos million characters collected from web pages. in the data can be calculated using the entropy values of the The corpus we created must meet two main criteria: com- system. The higher the entropy, the greater the uncertainty pleteness and representativeness. The completeness is deter- and unpredictability, the more chaotic the system. mined by the coverage of this corpus. Optimization of the Text is also a system that has entropy. Moreover, natu- coverage depends on the problems considered [5]. First, the ral language texts have entropy significantly lower than the coverage depends on the text corpus volume that is used maximum entropy of the alphabet. In turn, a random non- for compiling dictionaries. But as the corpus volume in- meaningful set of characters has the maximum possible en- creases, this dependence becomes less pronounced, so data tropy in a given alphabet. The entropy index can be used can be extrapolated for further volumes of dictionaries [6]. for automatic recognition of the text meaningfulness when For example, for English, the growth of the dictionary vol- searching through various decryption options or when dic- ume slows down significantly when the corpus size is from tionary attack [1]. 30 to 50 million words. Second, the optimal size of the cor- In addition, entropy can be used in the keyless recovery pus depends on the sources and novelty of the data [7]. In method of encrypted information. If we divide an encrypted General, the corpus is considered saturated when the sharp message into discrete segments of a fixed length, the entropy growth of new words stops with an increase in the corpus value determines how many meaningful text recovery op- volume [5]. There is no metadata in the corpus, since it is tions there are for each such segment of the message. Since not essential for further use of this corpus. the number of meaningful texts in the language is signifi- Based on the text corpus collected and in accordance cantly less than arbitrary (random) ones, this approach crit- with the n-gram language model, we create dictionaries for ically reduces the complexity of decryption compared to a further research. brute force attack [2]. There are various methods for determining the entropy of a text or its individual segments, called n-grams. The most 2.1 Data preprocessing popular of them is the Shannon method, based on the rep- resentation of the text by a Markov chain of depth n [3]. To increase the coverage and relevance of n-gram dictionar- In this paper, we propose to use a dictionary-based method ies, we restrict the size of the alphabet. Therefore, the cor- for determining the entropy of n-grams, combining the Kol- pus created goes through a filtering process. In addition, we mogorov’s combinatorial approach and a corollary of the delete errors and typos from the text to minimize the prob- Shannon’s second theorem, which gives an asymptotic esti- ability that type II errors will appear: we assume that only mate of the number of meaningful texts. Moreover, we pro- meaningful texts are represented in the n-gram vocabularies. pose a theoretical approach to estimating the coverage of the In general, the normalization process consists of the fol- created n-gram dictionaries and a method for correcting the lowing steps: accuracy of its volume. 1) deleting HTML tags, 2 2) recoding, meaningful n-grams of the language are included in this dic- 3) filtering the text (deleting all characters except “a-z”, “.”, tionary. That is, type I errors are possible, when a meaning- “,”,” “, cast to lowercase), ful n-gram that is not present in the dictionary is discarded 4) removing double spaces, repeating dots and commas, and as random. This situation, for example, is possible when or- spaces before dots and commas, ganizing a dictionary attack. Therefore, it is necessary to 5) deleting errors and typos. evaluate the coverage of the dictionaries created, that is, to Thus, the alphabet power of our corpus is 29 characters. estimate what proportion of meaningful n-grams of the lan- We consider it as a simple extension of the Latin alphabet guage our dictionaries cover. including punctuation. In this study, we propose a theoretical estimate of the coverage that is independent on empirical tests. The theoretical assessment is based on the number of unique n-grams in 3 Methodology the dictionary, i.e. units of vocabulary, occurring only once: k 3.1 N-gram vocabularies coverage = (1 − ) · 100%; (1) K Within the n-gram model of the language, we generate the where K is the initial dictionary volume, and k is the number dictionaries. The dictionary units are n-grams. We consider of n-grams that occur once. an n-gram as a sequence of n characters. N-grams are se- If the coverage of the initial dictionary is low, then the lected from the text with chaining: for the next n-gram, we empirical estimates derived from it may not be accurate enough. shift to the right by one character. An example of this pro- Therefore, it is necessary to recalculate the volume of the cess is shown in Figure 1. empirical dictionary to bring it closer to the real one. We propose the approach to resizing of dictionaries. Let us obtain a dictionary of K units with k elements oc- k curring once. Then 1− K is a fraction of repetitive elements k in the dictionary, and (1 − K ) · K is the number of duplicate elements in the initial dictionary. Since it is compiled on the corpus of limited volume, this dictionary does not have full k Fig. 1 The process of trigram selection with chaining. coverage. Obviously, (1 − K ) · K < K. It is known that up to a certain point, the number of new n-grams in the dictionary grows at a linear rate [7]. To get a dictionary consisting of K 1 The process of a dictionary creation consists of extract- repeated elements, we need to increase its volume by k 1− K ing all n-grams from the corpus in the list, deleting duplicate times. n-grams and sorting. Thus, a new dictionary with volume We fix the total number of n-grams in the corpus, and the K vocabulary volume, that is, the number of non-repeating n- K˜ = (2) 1 − k grams in the corpus. Besides, we find the number of unique K n-grams in the dictionary, i.e. units of vocabulary, occurring contains about K repeated elements. The new volume of the only once. dictionary compensates for the lack of coverage of the orig- We assume that the optimal length of a text segment inal one. to examine is between 10 and 25 characters. Therefore, we Thus, K˜ is a theoretical estimate of the n-gram dictionary generate the n-gram dictionaries of 10, 15, 20, and 25 char- volume, obtained on the basis of an empirical dictionary. acters. We process about 100 million n-grams. Different lengths The theoretical n-gram vocabulary function is of n-grams help us study how the characteristics of dictio- K˜(n) = 2H·n: (3) naries change with increasing length of the text segment. The compiled n-gram dictionaries form the basis of our This estimation [8] is based on the language entropy H. methodology for calculating the entropy values. We assume that the created dictionaries are a tool for automatically dis- tinguishing meaningful and random n-grams of a language. 3.3 N-gram entropy We assume that the volume of the dictionary obtained is the 3.2 Coverage and vocabulary resizing number of meaningful n-grams in the language.

Entropy Analysis of N-Grams and Estimation of the Number of Meaningful Language Texts. Cyber Security Applications

Construction of English-Bodo Parallel Text

Usage of IT Terminology in Corpus Linguistics Mavlonova Mavluda Davurovna

Enriching an Explanatory Dictionary with Framenet and Propbank Corpus Examples

20 Toward a Sustainable Handling of Interlinear-Glossed Text in Language Documentation

Linguistic Annotation of the Digital Papyrological Corpus: Sematia

Text and Corpus Analysis: Computer-Assisted Studies of Language and Culture

Latvian Framenet: Cross-Lingual Issues

TEI and the Documentation of Mixtepec-Mixtec Jack Bowers

Natural Language Processing, Sentiment Analysis and Clinical Analytics

Improving Statistical Machine Translation Efficiency by Triangulation

Proceedings of the 55Th Annual Meeting of the Association For

Helping Language Learners Get Started with Concordancing