Novel Database Design for Extreme Scale Corpus Analysis

Total Page:16

File Type:pdf, Size:1020Kb

Novel Database Design for Extreme Scale Corpus Analysis Novel Database Design for Extreme Scale Corpus Analysis A thesis submitted to Lancaster University for the degree of Ph.D. in Computer Science Matthew Parry Coole January 2021 Abstract This thesis presents the patterns and methods uncovered in the development of a new scalable corpus database management system, LexiDB, which can han- dle the ever-growing size of modern corpus datasets. Initially, an exploration of existing corpus data systems is conducted which examines their usage in cor- pus linguistics as well as their underlying architectures. From this survey, it is identified that existing systems are designed primarily to be vertically scalable (i.e. scalable through the usage of bigger, better and faster hardware). This motivates a wider examination of modern distributable database management systems and information retrieval techniques used for indexing and retrieval. These techniques are modified and adapted into an architecture that can be horizontally scaled to handle ever bigger corpora. Based on this architecture several new methods for querying and retrieval that improve upon existing techniques are proposed as modern approaches to query extremely large anno- tated text collections for corpus analysis. The effectiveness of these techniques and the scalability of the architecture is evaluated where it is demonstrated that the architecture is comparably scalable to two modern No-SQL database management systems and outperforms existing corpus data systems in token level pattern querying whilst still supporting character level pattern matching. i ii Declaration I declare that the work presented in this thesis is my own work. The material has not been submitted, in whole or in part, for a degree at any other university. Matthew Parry Coole iii iv List of Papers The following papers have been published during the period of study. Extracts from these papers make up elements (in part) of the thesis chapters noted below. • Coole, Matthew, Paul Rayson, and John Mariani. "Scaling Out for Extreme Scale Corpus Data." 2015 IEEE International Conference on Big Data (Big Data). IEEE, 2015. Chapter 5. • Coole, Matthew, Paul Rayson, and John Mariani. "LexiDB: A Scal- able Corpus Database Management System." 2016 IEEE International Conference on Big Data (Big Data). IEEE, 2016. Chapters 3 and 5. • Coole, Matthew, Paul Rayson, and John Mariani. "LexiDB: Patterns Methods for Corpus Linguistic Database Management." Proceedings of The 12th Language Resources and Evaluation Conference. 2020. Chap- ters 4 and 5. • Coole, Matthew, Paul Rayson, and John Mariani. ”Unfinished Busi- ness: Construction and Maintenance of a Semantically Tagged Histori- cal Parliamentary Corpus, UK Hansard From 1803 to the Present Day." Proceedings of the Second ParlaCLARIN Workshop. 2020. Chapter 5. v vi Contents List of Figures xi 1 Introduction 1 1.1 Overview . .1 1.2 Area of Study . .2 1.3 Motivation . .3 1.4 Research Questions . .4 1.5 Thesis Structure . .5 2 Literature Review 7 2.1 Background in Corpus Linguistics . .7 2.1.1 Corpus Queries . .9 2.2 Existing Corpus Data Systems and their Query Languages . 13 2.2.1 Overview of Corpus Data Systems . 13 2.2.2 Corpus Data Systems Architecture . 18 2.3 Proposed Systems & Related Work . 20 2.4 Database Management Systems and IR systems . 23 2.4.1 Indexing . 23 2.4.2 Querying Corpora with Database Management Systems and their Query Languages . 27 2.4.3 Distribution Methodologies & No-SQL Architectures . 37 2.5 Summary . 39 3 Architecture 41 vii 3.1 Architecture Overview . 42 3.2 Column Stores . 43 3.2.1 Numeric Data Representation . 43 3.2.2 Zipfian Columns . 44 3.2.3 Continuous Columns . 51 3.3 Indexing . 53 3.4 Distribution . 54 3.4.1 Distributed Querying . 55 3.4.2 Redundancy . 56 3.5 Summary . 57 4 Querying 59 4.1 Introduction . 59 4.2 Linguistic Query Types . 61 4.3 Overview of Query Syntax and Capabilities . 64 4.4 Resolving QBE objects . 65 4.4.1 Preamble . 66 4.4.2 Algorithm . 66 4.4.3 Example . 66 4.5 Token stream regex . 69 4.5.1 Preamble . 70 4.5.2 Algorithm . 70 4.5.3 Worked example . 70 4.5.4 Limitations . 71 4.6 Resolving Query Types . 72 4.7 Asynchronous Querying . 74 4.8 Sorting . 77 4.9 Summary . 79 5 Evaluation 81 5.1 Overview . 81 5.2 Quantitative Evaluation . 82 viii 5.2.1 Experiment 1: Scalability of existing DBMSs . 82 5.2.2 Experiment 2: Scalability of LexiDB . 91 5.2.3 Experiment 3: Comparative Evaluation . 95 5.3 Hansard Case Study . 99 5.3.1 Building the Tool . 99 5.4 Summary . 117 6 Conclusions 119 6.1 Summary of Thesis . 119 6.2 Proposed Design . 121 6.3 Limitations and Future Work . 122 6.4 Research Questions and Contributions . 123 References 124 Appendices 135 A Experimental Results 1 137 B Experimental Results 2 145 C Experimental Results 3 149 D Sample Focus Group Questions 157 D.1 Using the web interface . 157 D.2 Comparisons to other systems . 157 D.3 Future developments . 157 ix x List of Figures 2.1 Growth of corpora over time . .8 3.1 Architecture diagram . 42 3.2 Inserting Documents into Data Blocks . 47 4.1 DFA representing ^[ab]a.*$ ................... 68 4.2 Radix tree representing D ..................... 68 5.1 MongoDB Query Times (High frequency words) . 87 5.2 MongoDB Avg. Query Times (medium & low frequency words) 89 5.3 Cassandra Query Times (High frequency words) . 89 5.4 Cassandra Avg. Query Times (medium & low frequency words) 90 5.5 Insertion and Indexing . 92 5.6 Concordance Lines . 92 5.7 Collocations . 93 5.8 N-grams . 94 5.9 Frequency List . 94 5.10 Simple querying for Part-of-Speech (c5 tagset) . 96 5.11 Common POS bigram search . 97 5.12 Common POS bigram search . 98 5.13 Annotation Processing pipeline . 101 5.14 Hansard UI search bar . 101 5.15 Hansard UI Concordance Results . 102 5.16 Hansard UI Histogram Visualisation . 103 5.17 Hansard UI Query Options . 104 xi 5.18 Hansard UI Collocation Word Cloud . 106 xii Chapter 1 Introduction 1.1 Overview Corpora are samples of real-world text designed to act as a representation of a wider body of language and they are growing rapidly. From samples of mil- lions of words (Brown) to 100s millions (BNC) by the 90s. The magnitude of modern corpora is now measured in the billions of words (Hansard, EEBO). This trend is only accelerating with recent linguistic studies making more use of online data streams from Twitter, Google Books and other data analytic services. Compounding this is the increased use of multiple layers of anno- tation through part-of-speech tagging, semantic analysis, sentiment marking, dependency parsing, all increasing the dimensionality of ever-growing corpus datasets. Traditional corpus tools (AntConc, CWB) were not built to han- dle this scale and as such are often unable to fulfil specific use cases, leading to many bespoke solutions being developed for individual linguistic research projects. Many larger corpora are now hosted online in systems such as BYU, CQPWeb and SketchEngine. However, if the system does not have the func- tionality to perform the querying or analysis a linguist needs they are often left with nowhere to go. The areas of Database Systems and Information Retrieval Systems have also 1 Novel Database Design for Extreme Scale Corpus Analysis faced the problems encountered in corpus linguistics in recent years - an ex- plosion in data scale. In those areas, this has lead to the development of scalable approaches and solutions for other \big data" problems. These tech- niques apply to the problems in modern corpus linguistics. These database techniques can handle but may not be well suited to Zipfian style language data so adjustments to the techniques are investigated. This thesis describes a set of approaches, methods and practises that apply to database management systems (DBMSs) tailored to fulfil corpus linguistic requirements. These approaches are realised in a corpus DBMS, LexiDB, that is capable of fulfilling not only the traditional needs of corpus linguists but also the modern needs for scalability and data management that have become ever more prevalent in the field in recent years. The remaining sections of this chapter outline the area of corpus linguistics, expand on the motivations for this project, formalise the research question and objectives and finally outline the thesis structure. 1.2 Area of Study Corpora are built as representative samples to allow language analysis to be performed reliably without the need to examine every text available in the analysis domain. Corpora can sometimes be general-purpose, meant to repre- sent a language as a whole (BNC) or can be built to examine a specific area such as political discourse (Hansard). Typically corpus linguists apply a standard set of approaches to analysing corpora (concordances, collocation, frequency lists etc.). As such various soft- ware tools have been developed over the years to allow linguists to do just that. These tools are often used as a basis for what may build into a wider analysis based linguistic research questions. 2 Chapter 1 Novel Database Design for Extreme Scale Corpus Analysis 1.3 Motivation For some corpus analyses, the standard, widely available tools may not be able to meet the functional or non-functional requirements of the project. There can be many reasons for this; perhaps the corpus data cannot be easily converted to a format for the tool, the type of analysis is different from those the tools typically provide or the tool is not capable of supporting the size of the corpora being studied. This can often lead (particularly in large well-funded projects) to specialist tools to manage, query and analyse corpora being built solely for use with a single corpus or a single research project.
Recommended publications
  • A Definition of Database Design Standards for Human Rights Agencies
    A Definition of Database Design Standards for Human Rights Agencies by Patrick Ball, et al. AMERICAN ASSOCIATION FOR THE ADVANCEMENT OF SCIENCE 1994 TABLE OF CONTENTS Introduction Definition of an Entity Entity versus Role versus Link What is a Link? Presentation of a Data Model Rule-types with Rule Examples and Instance Examples . 6.1 Type of Rule: Act . 6.2 Type of Rule: Relationship . 6.3 Rule-type Name: Biography Rule Parsimony: the Trade between Precision and Simplicity xBase Table Design Implementation Example . 8.1 xBase Implementation Example . 8.2 Abstract Fields & Data Types SQL Database Design Implementation Example . 9.1 Querying & Performance . 9.2 Main Difference . 9.3 Storage of the Rules in SQL Model 1. Introduction In 1993, HURIDOCS published their Standard Human Rights Event Formats (Dueck et al. 1993a) which describe standards for the collection and exchange of information on human rights abuses. These formats represent a major step forward in enabling human rights organizations to develop manual and computerized systems for collecting and exchanging data. The formats define common fields for collection of human rights event, victim, source, perpetrator and agency intervention data, and a common vocabulary for many of the fields in the formats, for example occupation, type of event and geographical location. The formats are designed as a tool leading toward both manual and computerized systems of human rights violation documentation. Before organizations implement documentation systems which will meet their needs, a wide range of issues must be considered. One of these problems is the structural problems of some data having complex relations to other data.
    [Show full text]
  • Xbase++ Language Concepts for Newbies Geek Gatherings Roger Donnay
    Xbase++ Language Concepts for Newbies Geek Gatherings Roger Donnay Introduction Xbase++ has extended the capabilities of the language beyond what is available in FoxPro and Clipper. For FoxPro developers and Clipper developers who are new to Xbase++, there are new variable types and language concepts that can enhance the programmer's ability to create more powerful and more supportable applications. The flexibility of the Xbase++ language is what makes it possible to create libraries of functions that can be used dynamically across multiple applications. The preprocessor, code blocks, ragged arrays and objects combine to give the programmer the ability to create their own language of commands and functions and all the advantages of a 4th generation language. This session will also show how these language concepts can be employed to create 3rd party add-on products to Xbase++ that will integrate seamlessly into Xbase++ applications. The Xbase++ language in incredibly robust and it could take years to understand most of its capabilities, however when migrating Clipper and FoxPro applications, it is not necessary to know all of this. I have aided many Clipper and FoxPro developers with the migration process over the years and I have found that only a basic introduction to the following concepts are necessary to get off to a great start: * The Xbase++ Project. Creation of EXEs and DLLs. * The compiler, linker and project builder . * Console mode for quick migration of Clipper and Fox 2.6 apps. * INIT and EXIT procedures, DBESYS, APPSYS and MAIN. * The DBE (Database engine) * LOCALS, STATICS, PRIVATE and PUBLIC variables. * STATIC functions.
    [Show full text]
  • Course Description
    6/20/2018 Syllabus Syllabus This is a single, concatenated file, suitable for printing or saving as a PDF for offline viewing. Please note that some animations or images may not work. Course Description This module is also available as a concatenated page, suitable for printing or saving as a PDF for offline viewing. MET CS669 Database Design and Implementation for Business This course uses the latest database tools and techniques for persistent data and object-modeling and management. Students gain extensive hands-on experience with exercises and a term project using Oracle, SQL Server, and other leading database management systems. Students learn to model persistent data using the standard Entity-Relationship model (ERM) and how to diagram those models using Entity-Relationship Diagrams (ERDs), Extended Entity-Relationship Diagrams (EERDs), and UML diagrams. Students learn the standards-based Structured Query Language (SQL) and the extensions to the SQL standards implemented in Oracle and SQL Server. Students learn the basics of database programming, and write simple stored procedures and triggers. The Role of this Course in the MSCIS Online Curriculum This is a core course in the MSCIS online curriculum. It provides students with an understanding and experience with database technology, database design, SQL, and the roles of databases in enterprises. This course is a prerequisite for the three additional database courses in the MSCIS online curriculum, which are CS674 Database Security, CS699 Data Mining and Business Intelligence and CS779 Advanced Database Management. By taking these three courses you can obtain the Concentration in Database Management and Business Intelligence. CS674 Database Security also satisfies an elective requirement for the Concentration in Security.
    [Show full text]
  • Xbase PDF Class
    Xbase++ PDF Class User guide 2020 Created by Softsupply Informatica Rua Alagoas, 48 01242-000 São Paulo, SP Brazil Tel (5511) 3159-1997 Email : [email protected] Contact : Edgar Borger Table of Contents Overview ....................................................................................................................................................4 Installing ...............................................................................................................................................5 Changes ...............................................................................................................................................6 Acknowledgements ............................................................................................................................10 Demo ..................................................................................................................................................11 GraDemo ............................................................................................................................................13 Charts .................................................................................................................................................15 Notes ..................................................................................................................................................16 Class Methods .........................................................................................................................................17
    [Show full text]
  • Design and Development of a Thermodynamic Properties Database Using the Relational Data Model
    DESIGN AND DEVELOPMENT OF A THERMODYNAMIC PROPERTIES DATABASE USING THE RELATIONAL DATA MODEL By PARIKSHIT SANGHAVI Bachelor ofEngineering Birla Institute ofTechnology and Science, Pilani, India 1992 Submitted to the Faculty ofthe Graduate College ofthe Oklahoma State University in partial fulfillment of the requirements for the Degree of MASTER OF SCIENCE December, 1995 DESIGN AND DEVELOPMENT OF A THERMODYNAMIC PROPERTIES DATABASE USING THE RELATIONAL DATA MODEL Thesis Approved: ~ (}.J -IJ trT,J T~ K t 4J~ _ K1t'Il1·B~~ Dean ofthe Graduate College ii ACKNOWLEDGMENTS I would like to thank my adviser Dr. Jan Wagner, for his expert guidance and criticism towards the design and implementation ofthe GPA Database. I would like to express my appreciation to Dr. Martin S. High, for his encouragement and support on the GPA Database project. I am grateful to Dr. Khaled Gasem, for helping me gain a better understanding of the aspects pertaining to thermodynamics in.the GPA Database project. I would also like to thank Dr. James R. Whiteley for taking the time to provide constructive criticism for this thesis as a member ofmy thesis committee. I would like to thank the Enthalpy and Phase Equilibria Steering Committees of the Gas Processors Association for their support. The GPA Database project would not have been possible without their confidence in the faculty and graduate research assistants at Oklahoma State University. I wish to acknowledge Abhishek Rastogi, Heather Collins, C. S. Krishnan, Srikant, Nhi Lu, and Eric Maase for working with me on the GPA Database. I am grateful for the emotional support provided by my family.
    [Show full text]
  • R&R Report Writer Version 9.0, Xbase & SQL Editions New Release
    R&R Report Writer Version 9.0, xBase & SQL Editions New Release Enhancements and New Capabilities Liveware Publishing is pleased to announce an exciting set of enhancements to R&R Report Writer!! This is the first MAJOR update to R&R in almost five years, but Liveware has made up for lost time with substantial improvements for every R&R user. Developers building reports for their applications, end-users creating their own ad-hoc and production reports, and even casual R&R runtime users will all find something in this release to streamline their reporting activities. For those who have purchased and implemented every update, and particularly Liveware’s Version 8.1 release licensees, you will be rewarded for your loyalty with a reduced upgrade price and a wide range of capabilities that will make R&R even more indispensable. For others who waited, this is the upgrade to buy since it contains so many unique and compelling ideas about reporting -- and ones you can use right away!! Version 9.0 includes 5 new “modules” -- what we consider significant extensions of R&R capabilities -- and many convenience and usability improvements to open reporting to database users at all levels. The 5 new modules are (with page number in parenthesis): Email Report Bursting & Distribution (2) Report Librariantm (4) ParameteRRtm Field Prompting (9) FlexLinktm Indexing-on-the-fly (11) Rapid Runnertm Runtime Control (16) Among the other improvements, the most significant are: Pervasive Band Color-coding (18) Page X of Y Functions (20) Redesigned Table Join Dialog (21) Print Preview Band Identification (23) Drill-down in Result Set Browser (24) Visual dBase 7 Support (25) This document will summarize each of these modules and improvements and describe how they expand R&R’s capabilities to an ever-wider range of reporting needs.
    [Show full text]
  • Building Permit/Code Enforcement Instruction Manual
    Building Permit/Code Enforcement Instruction Manual Table of contents: Section 1 - Introduction Section 2 Page 1 - File setup/Screen layout Page 2 - Screen Layout/Menu Buttons Section 3 Page 1 - File Preferences Page 2 - File Preference Data Page 3 - File Preferences User Definition Page 4 - File Preferences Detail Information Section 4 Page 1 - File Parcel Locate Page 2 - Assessment Information Page 3 - Folder Selection Section 5 Page 1 - Applicant Folder Page 2 - Permit Folder Page 3 - Action Folder Page 4 - Site Folder Page 5 - Building Folder Page 6 - Inspections Folder Page 7 - Inspections Folder Information Page 8 - Contractor Folder Page 9 - Contractor Folder Information Page 10 - Notes Folder Page 11 – Notes Folder Information Table of Contents cont: Section 6 Page 1 - Change Permit Information Page 2 - Change Permit Folder Location Section 7 Page 1 - Housing Code Folder Page 2 - Housing Code Folder Information Section 8 Page 1 - Complaint Folder Page 2 - Complaint Folder Information Section 9 Page 1 - Report Writer System Page 2 - Report Writer Report Selection Page 3 - Report Writer Report Filter Section 1 Page 1 Introduction The Building Permit/Code Enforcement System has been designed to meet the need of today’s Code Enforcement Office. The software was developed in a Relational Database format (DB4) and compiled with a Windows version that is used in combination with the Assessment System files. The program can accommodate a wide variety of machines and operating systems and can be used with a Laptop in the field and files transferred to the Desktop back in the office. The file structure has complete compatibility with the Assessment System software and Assessor files can be used to establish base information on applications for all pertinent data relating to the Assessment Roll.
    [Show full text]
  • What Is It and Why Should VFP Developers Care?
    The X-Sharp Project What is it and why should VFP developers care? X# Vendor Session October 2019 X#: a new incarnation of an old development language X# Vendor Session October 2019 Agenda • A bit of history • Why X# • What is X# • X# and Visual FoxPro • New Features • Where can I get it X# Vendor Session October 2019 xBase languages, a bit of History -1 • Started at JPL with a project called Vulcan, by Wayne Ratliff (PTDOS, Later CP/M) (1978) • Ashton Tate bought it and released it under the name dBASE II for Apple II and DOS (1980) • An improved version dBASE III in 1984 • In the 80’s several competitors appeared on the market: Clipper, FoxBASE, DBXL, QuickSilver, Arago, Force, FlagShip and many more. Together these products were called “xBase”. The share the programming language and many functions and working with DBF files • Some were faster, others allowed to sell royalty free compiled versions of apps • Then Microsoft launched Windows in 1990 X# Vendor Session October 2019 xBase languages, a bit of History -2 • The move to Windows resulted in several product changes, also products were sold and resold: • dBase (AT) -> (Borland, Inprise, Borland) • QuickSilver -> Borland, then vanished • FlagShip -> MultiSoft (Various Unix versions, Windows) • Clipper (Nantucket)-> Visual Objects (Computer Associates) • FoxBASE (Fox Software) -> FoxPro (Microsoft) • Some products “died” • New products appeared, especially as successors of Clipper • (x) Harbour (open source, also for unix) • Xbase++ X# Vendor Session October 2019 xBase languages, a bit of History -3 • Now in the 2010’s most of these products are ‘dead’, no longer developed for various reasons.
    [Show full text]
  • Xbase++ the Next Generation of Xbase-Technology
    Overview • Xbase++ the language • Meeting points with PostgreSQL • Pass-Through-SQL (P-SQL) • PostgreSQL does ISAM and more! • Xbase++ as a Server-Side-Language • Into the future... • Summary PGCon May 2008 Alaska Software Inc. Xbase++ the platform • 4GL language, 100% Clipper compatible • With asynchronous garbage collection, automatic multi threading – no deadlock no need to sync. resource access • OOP engine with multiple inheritance, dynamic class/object creation and full object persistence • Lamba expressions aka. Lisp Codeblocks over the wire • Full dynamic loadable features and copy deployment • The language and runtime itself introduce no-limits! • Very powerfull preprocessor to add DSL constructs! • Supports development of Console/Textmode, Graphical User- Interface, Service, CGI, Web-Application-Server and Compiled- Web-Page application types • Has its roots on the OS/2 platform • Currently focuses onto the Windows 32Bit and 64 Bit Operating-Systems • Hybrid Compiler, generates native code and VM code. PGCon May 2008 Alaska Software Inc. Data-Access in Xbase++ & xBase • Rule-1: There is no impedance mismatch – Any table datatype is a valid language datatype this includes operators and expressive behaviour and NIL-State / NULL-behaviour – Only tables introduce fixed-typing, the language itself must be dynamic typed to adapt automatically to schema changes • Rule-2: The language is the database and vice versa – Any expression of the development language is a valid expression on behalf of the database/table (index-key-expr., functions, methods) • Rule-3: Isolation semantics of the database and language are the same – Different application access a remote data source and different threads accessing the same field/column are underlying the same isolation principles.
    [Show full text]
  • Teaching OOP to FPW with Stored Procedures Dave Jinkerson and Scott Malinowski
    Print Article Seite 1 von 4 Issue Date: FoxTalk May 1998 Teaching OOP to FPW with Stored Procedures Dave Jinkerson and Scott Malinowski Many FoxPro developers would like to make the transition between FPW and VFP, but they don't know where to start. Dave and Scott will show you how to use your current knowledge of FPW to build the bridge every programmer needs to cross from procedural programming into OOP. In this article, they'll look at an example that will help you move your procedural programming style one step closer to OOP and, in so doing, begin the journey from FPW to VFP. Object-oriented programming (OOP) is a data-centered view of programming in which data and behavior are strongly linked. The proper way to view OOP is as a programming style that captures the behavior of the real world in a way that hides implementation details. With a language that supports OOP, data and behavior are conceived as classes, and Instances of classes become objects. What many FPW programmers don't realize is that, in a language that doesn't support OOP, you can rely on programming style and still create an OOP-like result. ADTs with style A key factor in OOP is the creation of user-defined data types from the language's native data types. In languages that support OOP, user-defined extensions to the native types are called Abstract Data Types (ADT). To create an ADT, you must define a set of values and a collection of operations that can act on those values.
    [Show full text]
  • Big Data Solutions and Applications
    P. Bellini, M. Di Claudio, P. Nesi, N. Rauch Slides from: M. Di Claudio, N. Rauch Big Data Solutions and applications Slide del corso: Sistemi Collaborativi e di Protezione Corso di Laurea Magistrale in Ingegneria 2013/2014 http://www.disit.dinfo.unifi.it Distributed Data Intelligence and Technology Lab 1 Related to the following chapter in the book: • P. Bellini, M. Di Claudio, P. Nesi, N. Rauch, "Tassonomy and Review of Big Data Solutions Navigation", in "Big Data Computing", Ed. Rajendra Akerkar, Western Norway Research Institute, Norway, Chapman and Hall/CRC press, ISBN 978‐1‐ 46‐657837‐1, eBook: 978‐1‐ 46‐657838‐8, july 2013, in press. http://www.tmrfindia.org/b igdata.html 2 Index 1. The Big Data Problem • 5V’s of Big Data • Big Data Problem • CAP Principle • Big Data Analysis Pipeline 2. Big Data Application Fields 3. Overview of Big Data Solutions 4. Data Management Aspects • NoSQL Databases • MongoDB 5. Architectural aspects • Riak 3 Index 6. Access/Data Rendering aspects 7. Data Analysis and Mining/Ingestion Aspects 8. Comparison and Analysis of Architectural Features • Data Management • Data Access and Visual Rendering • Mining/Ingestion • Data Analysis • Architecture 4 Part 1 The Big Data Problem. 5 Index 1. The Big Data Problem • 5V’s of Big Data • Big Data Problem • CAP Principle • Big Data Analysis Pipeline 2. Big Data Application Fields 3. Overview of Big Data Solutions 4. Data Management Aspects • NoSQL Databases • MongoDB 5. Architectural aspects • Riak 6 What is Big Data? Variety Volume Value 5V’s of Big Data Velocity Variability 7 5V’s of Big Data (1/2) • Volume: Companies amassing terabytes/petabytes of information, and they always look for faster, more efficient, and lower‐ cost solutions for data management.
    [Show full text]
  • Processing in an Integrated Data Management and Analysis System
    DESIGN CONSIDERATIONS FOR OPTIMIZING FILE PROCESSING IN AN INTEGRATED DATA MANAGEMENT AND ANALYSIS SYSTEM Michael R. Wrona. Far West Laboratory for Educational Research and Development Norman A. Constantine. Far West Laboratory for Educational Research and Development Mark P. Branagh. Stanford University Introduction PRoe OBF first, then MERGE PROC OBF is a SAS (R) procedure for personal computers which converts an xBase file into a SAS dataset file. By convention, dbf ssd xBase files have the extension· .dbt and are often referred to as OBF files. SAS dataset files on the personal computer have the extension -.sscl" and will be refered to herein as SSD files. PROC OBF takes as input a OBF file on disk and outputs an SSO file to dbf l---=>~I ssd disk. This process is represented symbolically in Figure 1. IL_----1 PROC DBF dbf L-d_b_fj----=>""I ssd L--d_bf--ll----=>~I ssd dbf xBase JOIN first, then PROe OBF xBase file SASdataset Figure 2 computer's memory is refered to as an input operation. The Figure 1 process of writing data that is in the computer's memory to a disk (or other storage medium) is called an output operation. Input The xBase file strucbJre is an industry standard for personal operations and output operations are similar processes which require, for the most part, comparable amounts of time. Thus; computer database systems. As such, it has a large installed regardless of which direction the data flows, these operations are base. It is employed by, among others, the makers of dBase (A) refered to as inpuUoutput operations, or 110 operations.
    [Show full text]