Novel Database Design for Extreme Scale Corpus Analysis

Novel Database Design for Extreme Scale Corpus Analysis A thesis submitted to Lancaster University for the degree of Ph.D. in Computer Science Matthew Parry Coole January 2021 Abstract This thesis presents the patterns and methods uncovered in the development of a new scalable corpus database management system, LexiDB, which can handle the ever-growing size of modern corpus datasets. Initially, an exploration of existing corpus data systems is conducted which examines their usage in corpus linguistics as well as their underlying architectures. From this survey, it is identified that existing systems are designed primarily to be vertically scalable (i.e. scalable through the usage of bigger, better and faster hardware). This motivates a wider examination of modern distributable database management systems and information retrieval techniques used for indexing and retrieval. These techniques are modified and adapted into an architecture that can be horizontally scaled to handle ever bigger corpora. Based on this architecture several new methods for querying and retrieval that improve upon existing techniques are proposed as modern approaches to query extremely large anno- tated text collections for corpus analysis. The effectiveness of these techniques and the scalability of the architecture is evaluated where it is demonstrated that the architecture is comparably scalable to two modern No-SQL database management systems and outperforms existing corpus data systems in token level pattern querying whilst still supporting character level pattern matching. i ii Declaration I declare that the work presented in this thesis is my own work. The material has not been submitted, in whole or in part, for a degree at any other university. Matthew Parry Coole iii iv List of Papers The following papers have been published during the period of study. Extracts from these papers make up elements (in part) of the thesis chapters noted below. • Coole, Matthew, Paul Rayson, and John Mariani. "Scaling Out for Extreme Scale Corpus Data." 2015 IEEE International Conference on Big Data (Big Data). IEEE, 2015. Chapter 5. • Coole, Matthew, Paul Rayson, and John Mariani. "LexiDB: A Scal- able Corpus Database Management System." 2016 IEEE International Conference on Big Data (Big Data). IEEE, 2016. Chapters 3 and 5. • Coole, Matthew, Paul Rayson, and John Mariani. "LexiDB: Patterns Methods for Corpus Linguistic Database Management." Proceedings of The 12th Language Resources and Evaluation Conference. 2020. Chap- ters 4 and 5. • Coole, Matthew, Paul Rayson, and John Mariani. ”Unfinished Busi- ness: Construction and Maintenance of a Semantically Tagged Histori- cal Parliamentary Corpus, UK Hansard From 1803 to the Present Day." Proceedings of the Second ParlaCLARIN Workshop. 2020. Chapter 5. v vi Contents List of Figures xi 1 Introduction 1 1.1 Overview . .1 1.2 Area of Study . .2 1.3 Motivation . .3 1.4 Research Questions . .4 1.5 Thesis Structure . .5 2 Literature Review 7 2.1 Background in Corpus Linguistics . .7 2.1.1 Corpus Queries . .9 2.2 Existing Corpus Data Systems and their Query Languages . 13 2.2.1 Overview of Corpus Data Systems . 13 2.2.2 Corpus Data Systems Architecture . 18 2.3 Proposed Systems & Related Work . 20 2.4 Database Management Systems and IR systems . 23 2.4.1 Indexing . 23 2.4.2 Querying Corpora with Database Management Systems and their Query Languages . 27 2.4.3 Distribution Methodologies & No-SQL Architectures . 37 2.5 Summary . 39 3 Architecture 41 vii 3.1 Architecture Overview . 42 3.2 Column Stores . 43 3.2.1 Numeric Data Representation . 43 3.2.2 Zipfian Columns . 44 3.2.3 Continuous Columns . 51 3.3 Indexing . 53 3.4 Distribution . 54 3.4.1 Distributed Querying . 55 3.4.2 Redundancy . 56 3.5 Summary . 57 4 Querying 59 4.1 Introduction . 59 4.2 Linguistic Query Types . 61 4.3 Overview of Query Syntax and Capabilities . 64 4.4 Resolving QBE objects . 65 4.4.1 Preamble . 66 4.4.2 Algorithm . 66 4.4.3 Example . 66 4.5 Token stream regex . 69 4.5.1 Preamble . 70 4.5.2 Algorithm . 70 4.5.3 Worked example . 70 4.5.4 Limitations . 71 4.6 Resolving Query Types . 72 4.7 Asynchronous Querying . 74 4.8 Sorting . 77 4.9 Summary . 79 5 Evaluation 81 5.1 Overview . 81 5.2 Quantitative Evaluation . 82 viii 5.2.1 Experiment 1: Scalability of existing DBMSs . 82 5.2.2 Experiment 2: Scalability of LexiDB . 91 5.2.3 Experiment 3: Comparative Evaluation . 95 5.3 Hansard Case Study . 99 5.3.1 Building the Tool . 99 5.4 Summary . 117 6 Conclusions 119 6.1 Summary of Thesis . 119 6.2 Proposed Design . 121 6.3 Limitations and Future Work . 122 6.4 Research Questions and Contributions . 123 References 124 Appendices 135 A Experimental Results 1 137 B Experimental Results 2 145 C Experimental Results 3 149 D Sample Focus Group Questions 157 D.1 Using the web interface . 157 D.2 Comparisons to other systems . 157 D.3 Future developments . 157 ix x List of Figures 2.1 Growth of corpora over time . .8 3.1 Architecture diagram . 42 3.2 Inserting Documents into Data Blocks . 47 4.1 DFA representing ^[ab]a.*$ ................... 68 4.2 Radix tree representing D ..................... 68 5.1 MongoDB Query Times (High frequency words) . 87 5.2 MongoDB Avg. Query Times (medium & low frequency words) 89 5.3 Cassandra Query Times (High frequency words) . 89 5.4 Cassandra Avg. Query Times (medium & low frequency words) 90 5.5 Insertion and Indexing . 92 5.6 Concordance Lines . 92 5.7 Collocations . 93 5.8 N-grams . 94 5.9 Frequency List . 94 5.10 Simple querying for Part-of-Speech (c5 tagset) . 96 5.11 Common POS bigram search . 97 5.12 Common POS bigram search . 98 5.13 Annotation Processing pipeline . 101 5.14 Hansard UI search bar . 101 5.15 Hansard UI Concordance Results . 102 5.16 Hansard UI Histogram Visualisation . 103 5.17 Hansard UI Query Options . 104 xi 5.18 Hansard UI Collocation Word Cloud . 106 xii Chapter 1 Introduction 1.1 Overview Corpora are samples of real-world text designed to act as a representation of a wider body of language and they are growing rapidly. From samples of millions of words (Brown) to 100s millions (BNC) by the 90s. The magnitude of modern corpora is now measured in the billions of words (Hansard, EEBO). This trend is only accelerating with recent linguistic studies making more use of online data streams from Twitter, Google Books and other data analytic services. Compounding this is the increased use of multiple layers of annotation through part-of-speech tagging, semantic analysis, sentiment marking, dependency parsing, all increasing the dimensionality of ever-growing corpus datasets. Traditional corpus tools (AntConc, CWB) were not built to handle this scale and as such are often unable to fulfil specific use cases, leading to many bespoke solutions being developed for individual linguistic research projects. Many larger corpora are now hosted online in systems such as BYU, CQPWeb and SketchEngine. However, if the system does not have the func- tionality to perform the querying or analysis a linguist needs they are often left with nowhere to go. The areas of Database Systems and Information Retrieval Systems have also 1 Novel Database Design for Extreme Scale Corpus Analysis faced the problems encountered in corpus linguistics in recent years - an ex- plosion in data scale. In those areas, this has lead to the development of scalable approaches and solutions for other \big data" problems. These techniques apply to the problems in modern corpus linguistics. These database techniques can handle but may not be well suited to Zipfian style language data so adjustments to the techniques are investigated. This thesis describes a set of approaches, methods and practises that apply to database management systems (DBMSs) tailored to fulfil corpus linguistic requirements. These approaches are realised in a corpus DBMS, LexiDB, that is capable of fulfilling not only the traditional needs of corpus linguists but also the modern needs for scalability and data management that have become ever more prevalent in the field in recent years. The remaining sections of this chapter outline the area of corpus linguistics, expand on the motivations for this project, formalise the research question and objectives and finally outline the thesis structure. 1.2 Area of Study Corpora are built as representative samples to allow language analysis to be performed reliably without the need to examine every text available in the analysis domain. Corpora can sometimes be general-purpose, meant to repre- sent a language as a whole (BNC) or can be built to examine a specific area such as political discourse (Hansard). Typically corpus linguists apply a standard set of approaches to analysing corpora (concordances, collocation, frequency lists etc.). As such various soft- ware tools have been developed over the years to allow linguists to do just that. These tools are often used as a basis for what may build into a wider analysis based linguistic research questions. 2 Chapter 1 Novel Database Design for Extreme Scale Corpus Analysis 1.3 Motivation For some corpus analyses, the standard, widely available tools may not be able to meet the functional or non-functional requirements of the project. There can be many reasons for this; perhaps the corpus data cannot be easily converted to a format for the tool, the type of analysis is different from those the tools typically provide or the tool is not capable of supporting the size of the corpora being studied. This can often lead (particularly in large well-funded projects) to specialist tools to manage, query and analyse corpora being built solely for use with a single corpus or a single research project.

Load more