QUERY-DRIVEN TEXT ANALYTICS for KNOWLEDGE EXTRACTION, RESOLUTION, and INFERENCE by CHRISTAN EARL GRANT a DISSERTATION PRESENTED

Total Page:16

File Type:pdf, Size:1020Kb

QUERY-DRIVEN TEXT ANALYTICS for KNOWLEDGE EXTRACTION, RESOLUTION, and INFERENCE by CHRISTAN EARL GRANT a DISSERTATION PRESENTED QUERY-DRIVEN TEXT ANALYTICS FOR KNOWLEDGE EXTRACTION, RESOLUTION, AND INFERENCE By CHRISTAN EARL GRANT A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2015 c 2015 Christan Earl Grant To Jesus my Savior, Vanisia my wife, my daughter Caliah, soon to be born son and my parents and siblings, whom I strive to impress. Also, to all my brothers and sisters battling injustice while I battled bugs and deadlines. ACKNOWLEDGMENTS I had an opportunity to see my dad, a software engineer from Jamaica work extremely hard to get a master's degree and work as a software engineer. I even had the privilege of sitting in some of his classes as he taught at a local university. Watching my dad work towards intellectual endeavors made me believe that anything is possible. I am extremely privileged to have someone I could look up to as an example of being a man, father, and scholar. I had my first taste of research when Dr. Joachim Hammer went out of his way to find a task for me on one of his research projects because I was interested in attending graduate school. After working with the team for a few weeks he was willing to give me increased responsibility | he let me attend the 2006 SIGMOD Conference in Chicago. It was at this that my eyes were opened to the world of database research. As an early graduate student Dr. Joseph Wilson exercised superhuman patience with me as I learned to grasp the fundamentals of paper writing. He helped me manage a rocky first few years. His abundance of wisdom would spill revealing jewels of truths that I still hold sacred. He along with Peter Dobbins, he helped me navigate the road to the Ph.D. I am delighted to have Dr. Daisy Zhe Wang as a dissertation advisor. I followed her work while she was still a graduate student and I was thrilled to hear she was considering coming to UF. Having the opportunity to watch someone as gifted as Dr. Wang brainstorm and write was an invaluable experience. Additionally, lab mates Clint P. George and Dr. Kun Li with whom I have worked with for many years. I also thank Sean Goldberg, Morteza Shahriari Nia, Yang Chen, Yang Peng, and Xiaofeng Zhou who have also been mentored by Dr. Wang, I appreciate their valuable feedback. During the last years of my graduate program there has been a large amount of civil unrest. While these issues do not affect me specifically, it is emotionally difficult to handle and can negatively affect my everyday productivity. It was important for me to have people around me who I know are going through similar circumstances emotionally 4 and still pursuing their degree. That is why I thank Dr. Pierre St. Juste, Dr. Corey Baker, and Jeremy Magruder for discussions about issues that are sacred to one's race and ethnicity. Finally, I would like to thank all the individuals who regularly attended the ACM Richard Tapia Celebration of Diversity in Computing. In 2007, I found this group because I was purposely searching for community. This is a group of talented intellectuals who continue to spur me towards excellence. Through them I met Dr. Juan Gilbert who has been an excellent mentor and role model throughout my research career. 5 TABLE OF CONTENTS page ACKNOWLEDGMENTS.................................4 LIST OF TABLES.....................................9 LIST OF FIGURES.................................... 10 ABSTRACT........................................ 12 CHAPTER 1 INTRODUCTION.................................. 14 1.1 Database as the Querying Engine....................... 16 1.2 Query-Driven Machine Learning........................ 17 1.3 Question Answering............................... 18 2 IN-DATABASE QUERY-DRIVEN TEXT ANALYTICS............. 20 2.1 MADden Introduction............................. 20 2.2 MADden System Description......................... 22 2.2.1 MADden System Architecture..................... 22 2.2.2 Statistical Text Analysis Functions.................. 23 2.2.3 MADden Implementation Details................... 24 2.3 Text Analysis Queries and Demonstration.................. 26 2.3.1 Dataset for MADden Example..................... 26 2.3.2 MADden Text Analytics Queries.................... 27 2.3.3 MADden User Interface......................... 29 2.4 GPText Introduction.............................. 30 2.4.1 GPText Related Work......................... 32 2.4.2 Greenplum Text Analytics....................... 32 2.4.2.1 In-database document representation............ 33 2.4.2.2 ML-based advanced text analysis.............. 35 2.4.3 CRF for IE over MPP Databases................... 35 2.4.3.1 Implementation overview................... 35 2.4.3.2 Feature extraction using SQL................ 36 2.4.3.3 Parallel linear-chain CRF training.............. 37 2.4.3.4 Parallel linear-chain CRF inference............. 39 2.4.4 GPText Experiments and Results................... 39 2.4.5 GPText Application........................... 40 2.4.6 GPText Summary............................ 42 3 MAKING ENTITY RESOLUTION QUERY-DRIVEN.............. 43 3.1 Query-Driven Entity Resolution Introduction................. 43 3.2 Query-Driven Entity Resolution Preliminaries................ 46 6 3.2.1 Factor Graphs.............................. 46 3.2.2 Inference over Factor Graphs...................... 48 3.2.3 Cross-Document Entity Resolution.................. 49 3.3 Query-Driven Entity Resolution Problem Statement............. 51 3.4 Query-Driven Entity Resolution Algorithms................. 53 3.4.1 Intuition of Query-Driven ER..................... 54 3.4.2 Single-Node ER............................. 55 3.4.3 Multi-query ER............................. 58 3.5 Optimization of Query-Driven ER....................... 59 3.5.1 Influence Function: Attract and Repel................. 59 3.5.2 Query-proportional ER......................... 61 3.5.3 Hybrid ER................................ 62 3.5.4 Implementation Details......................... 62 3.5.5 Algorithms Summary Discussion.................... 64 3.6 Query-Driven Entity Resolution Experiments................ 65 3.6.1 Experiment Setup............................ 67 3.6.2 Realtime Query-Driven ER Over NYT................ 69 3.6.3 Single-query ER............................. 70 3.6.4 Multi-query ER............................. 74 3.6.5 Context Levels.............................. 75 3.6.6 Parallel Hybrid ER........................... 76 3.7 Query-Driven Entity Resolution Related Work................ 77 3.8 Query-Driven Entity Resolution Summary.................. 79 4 A PROPOSAL OPTIMIZER FOR SAMPLING-BASED ENTITY RESOLUTION 80 4.1 Introduction to the Proposal Optimizer.................... 80 4.2 Proposal Optimizer Background........................ 83 4.3 Accelerating Entity Resolution......................... 84 4.4 Proposal Optimizer Algorithms........................ 86 4.5 Optimizer.................................... 87 4.6 Proposal Optimizer Experiment Implementation............... 88 4.6.1 WikiLink Corpus............................ 89 4.6.2 Micro Benchmark............................ 89 4.7 Proposal Optimizer Summary......................... 92 5 QUESTION ANSWERING............................. 93 5.1 Morpheus QA Introduction.......................... 93 5.2 Morpheus QA Related Work.......................... 94 5.2.1 Question Answering Systems...................... 94 5.2.2 Ontology Generators.......................... 95 5.3 Morpheus QA System Architecture...................... 96 5.3.1 Using Ontology and Corpora...................... 96 5.3.2 Recording................................ 97 5.3.3 Ranking................................. 98 7 5.3.4 Executing New Queries......................... 100 5.4 Morpheus QA Results............................. 100 5.5 Morpheus QA Summary............................ 101 6 PATH EXTRACTION IN KNOWLEDGE BASES................ 103 6.1 Preliminaries for Knowledge Base Expansion................. 103 6.1.1 Probabilistic Knowledge Base..................... 103 6.1.2 Markov Logic Network and Factor Graphs.............. 104 6.1.3 Sampling for Marginal Inference.................... 105 6.1.3.1 Gibbs sampling........................ 105 6.1.3.2 MC-SAT............................ 106 6.1.4 Linking Facts in a Knowledge Base.................. 106 6.2 Fact Path Expansion Related Work...................... 107 6.2.1 SPARQL Query Path Search...................... 108 6.2.2 Path Ranking.............................. 108 6.2.3 Fact Rank................................ 109 6.3 Fact Path Expansion Algorithm........................ 109 6.4 Joint Inference of Path Probabilities..................... 113 6.4.1 Fuzzy Querying............................. 114 6.4.2 PostgreSQL Fact Path Expansion Algorithm............. 114 6.4.3 Graph Database Query......................... 119 6.4.4 Fact Path Expansion Complexity................... 120 6.5 Fact Path Expansion Experiments....................... 121 6.6 Fact Path Expansion Summary........................ 124 7 CONCLUSIONS................................... 125 REFERENCES....................................... 126 BIOGRAPHICAL SKETCH................................ 134 8 LIST OF TABLES Table page 2-1 Listing of current MADden functions........................ 23 2-2 List of each MADden functions and its NLP task................. 28 2-3 Abbreviated NFL dataset schema.......................... 28 3-1 Mentions sets M from a corpus........................... 57 3-2 Example query node q ................................ 57 3-3 Summary
Recommended publications
  • A Data-Driven Framework for Assisting Geo-Ontology Engineering Using a Discrepancy Index
    University of California Santa Barbara A Data-Driven Framework for Assisting Geo-Ontology Engineering Using a Discrepancy Index A Thesis submitted in partial satisfaction of the requirements for the degree Master of Arts in Geography by Bo Yan Committee in charge: Professor Krzysztof Janowicz, Chair Professor Werner Kuhn Professor Emerita Helen Couclelis June 2016 The Thesis of Bo Yan is approved. Professor Werner Kuhn Professor Emerita Helen Couclelis Professor Krzysztof Janowicz, Committee Chair May 2016 A Data-Driven Framework for Assisting Geo-Ontology Engineering Using a Discrepancy Index Copyright c 2016 by Bo Yan iii Acknowledgements I would like to thank the members of my committee for their guidance and patience in the face of obstacles over the course of my research. I would like to thank my advisor, Krzysztof Janowicz, for his invaluable input on my work. Without his help and encour- agement, I would not have been able to find the light at the end of the tunnel during the last stage of the work. Because he provided insight that helped me think out of the box. There is no better advisor. I would like to thank Yingjie Hu who has offered me numer- ous feedback, suggestions and inspirations on my thesis topic. I would like to thank all my other intelligent colleagues in the STKO lab and the Geography Department { those who have moved on and started anew, those who are still in the quagmire, and those who have just begun { for their support and friendship. Last, but most importantly, I would like to thank my parents for their unconditional love.
    [Show full text]
  • Open Standards in Open Source Andrew Savory, Luminas
    Open Standards in Open Source Andrew Savory, Luminas 1 This is a tongue-in-cheek look at the symbiotic relationship between Open Standards and Open Source. It is designed to stimulate discussion rather than to be entirely truthful or accurate! Introduction • Andrew Savory: • OS developer for ~ 10 years • Developer on Apache Cocoon, Jackrabbit • Open Source pragmatist • Director of Luminas and Orixo 2 What are standards? • Industry standards • PDF? Word? • “open” standards • What’s the price of interoperability? • Relax-NG = £55 • Open Standards 3 The word “standard” is frequently abused, and there are several different types of “standard” in the IT industry: - Industry standards: where the market-leader uses / abuses their position to push one way of working (typically file formats) and will rarely publish the specifications for that widely-adopted “standard”. - “open” standards, which pretend to be widely available but where you have to pay the standards body to access the specifications. - Open Standards, often in the form of Recommendations (W3C) or RFQs (IETF). These are designed by experts and made available to anyone, with feedback and improvements encouraged and expected. Why Standards? • Interoperability • Competition • Security and testing 4 Why are standards important to Open Source developers? Interoperability: we aren’t interested in vendor lock-in. We want to make sure our software works with others. - even some proprietary developers understand this: a good example is Brent Simmons, author of NetNewsWire Competition: we’re a fiercely competitive lot, and we each believe we’re going to produce the best implementation. Because Open Source developers love to compete with each other, we need a frame of reference - standards set out the ground rules.
    [Show full text]
  • Usage-Dependent Maintenance of Structured Web Data Sets
    Usage-dependent maintenance of structured Web data sets Dissertation zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften (Dr. rer. nat) am Institut f¨urInformatik des Fachbereichs Mathematik und Informatik der Freien Unviersit¨atBerlin vorgelegt von Dipl. Inform. Markus Luczak-R¨osch Berlin, August 2013 Referent: Prof. Dr.-Ing. Robert Tolksdorf (Freie Universit¨atBerlin) Erste Korreferentin: Natalya F. Noy, PhD (Stanford University) Zweite Korreferentin: Dr. rer. nat. Elena Simperl (University of Southampton) Tag der Disputation: 13.01.2014 To Johanna. To Levi, Yael and Mili. Vielen Dank, dass ich durch Euch eine Lebenseinstellung lernen durfte, \. die bereit ist, auf kritische Argumente zu h¨oren und von der Erfahrung zu lernen. Es ist im Grunde eine Einstellung, die zugibt, daß ich mich irren kann, daß Du Recht haben kannst und daß wir zusammen vielleicht der Wahrheit auf die Spur kommen." { Karl Popper Abstract The Web of Data is the current shape of the Semantic Web that gained momentum outside of the research community and becomes publicly visible. It is a matter of fact that the Web of Data does not fully exploit the primarily intended technology stack. Instead, the so called Linked Data design issues [BL06], which are the basis for the Web of Data, rely on the much more lightweight technologies. Openly avail- able structured Web data sets are at the beginning of being used in real-world applications. The Linked Data research community investigates the overall goal to approach the Web-scale data integration problem in a way that distributes efforts between three contributing stakeholders on the Web of Data { the data publishers, the data consumers, and third parties.
    [Show full text]
  • Supported Reading Software
    Readers: Hardware & Software AMIS is a DAISY 2 & 3 playback software application for DTBs. Features include navigation by section, sub-section, page, and phrase; bookmarking; customize font, color; control voice rate and volume; navigation shortcuts; two views. http://www.daisy.org/amis?q=project/amis Balabolka is a text-to-speech (TTS) program. All computer voices installed on a system are available to Balabolka. On-screen text can be saved as a WAV, MP3, OGG or WMA file. The program can read clipboard content, view text from DOC, RTF, PDF, FB2 and HTML files, customize font and background color, control reading from the system tray or by global hotkeys. It can also be run from a flash drive. http://www.cross-plus- a.com/balabolka.htm BeBook offers four stand-alone e-book reader devices, from a mini model with a 5" screen to a wireless model with Wi-Fi capability. BeBook supports over 20 file formats, including Word, ePUB, PDF, Text, Mobipocket, HTML, JPG, and MP3. It has a patented Vizplex screen and 512 MB internal memory (which can store over 1,000 books) while external memory can be used with an SD card. Features include the ability to adjust fonts and font sizes, bookmarking, 9 levels of magnification with PDF sources, and menu support in 15 languages. http://mybebook.com/ Blio “is a reading application that presents e-books just like the printed version, in full color … with …features” and allows purchased books to be used on up to 5 devices with “reading views, including text-only mode, single page, dual page, tiled pages, or 3D ‘book view’” (from the web site).
    [Show full text]
  • In-RDBMS Hardware Acceleration of Advanced Analytics
    In-RDBMS Hardware Acceleration of Advanced Analytics Divya Mahajan∗ Joon Kyung Kim∗ Jacob Sacks∗ Adel Ardalan† Arun Kumar‡ Hadi Esmaeilzadeh‡ ∗Georgia Institute of Technology †University of Wisconsin-Madison ‡University of California, San Diego fdivya mahajan,jkkim,[email protected] [email protected] farunkk,[email protected] ABSTRACT The data revolution is fueled by advances in machine learning, databases, and hardware design. Programmable accelerators are making their way into each of these areas independently. As Enterprise Bismarck [7] such, there is a void of solutions that enables hardware accelera- In-Database Glade [8] tion at the intersection of these disjoint fields. This paper sets out Analytics Centaur [3] to be the initial step towards a unifying solution for in-Database DoppioDB [4] Acceleration of Advanced Analytics (DAnA). Deploying special- Modern DAnA Analytical ized hardware, such as FPGAs, for in-database analytics currently Acceleration Programing Platforms Paradigms requires hand-designing the hardware and manually routing the data. Instead, DAnA automatically maps a high-level specifica- tion of advanced analytics queries to an FPGA accelerator. The TABLA [5] accelerator implementation is generated for a User Defined Func- Catapult [6] tion (UDF), expressed as a part of an SQL query using a Python- Figure 1: DAnA represents the fusion of three research directions, embedded Domain-Specific Language (DSL). To realize an effi- in contrast with prior works [3{5,7{9] that merge two of the areas. cient in-database integration, DAnA accelerators contain a novel hardware structure, Striders, that directly interface with the buffer terprise settings. However, data-driven applications in such envi- pool of the database.
    [Show full text]
  • Document Publishing in the Daisy CMS
    Document publishing in the Daisy CMS Cocoon GetTogether October 4, 2006 Amsterdam Bruno Dumon [email protected] http://www.daisycms.org/ http://www.outerthought.org/ What is Daisy? ● CMS = Content Management System ● Java-based, frontend build on Cocoon (2.1) ● Open source project ● Daisy 1.0 released October 12, 2004 Current release: Daisy 1.5, Daisy 2.0 on the way. ● Used by Cocoon for its documentation Agenda ● General Daisy overview ● Demo of Daisy document publishing features ● Daisy Wiki overview ● Delve into how the document publishing works Daisy CMS HTTP+XML communication Web browser Daisy Wiki Daisy repository server Other frontends (e.g. see gsoc) Cocoon maven plugin Forrest plugin Import/export tools Utility applications (automation of boring tasks) Core repository server features ● Manages 'documents' – identified by an ID (Daisy 2.0: namespaced) – parts and fields (defined by a schema) – language and branch variants – flat structure (no directories) ● Versioning ● Locking ● Access control ● Link extraction ● Querying: full text, structured search, faceted browsing ● JMS notifications ● APIs: native: Java, remote access: HTTP+XML, Java ● Persistence: SQL database + files ystem + lucene index ● Backup solution Repository server extensions Repository JVM Core repository server API & SPI Extension components Navigation manager Email notifications Thumbnail generation Publisher Document tasks LDAP authentication NTLM authentication (demo) The Daisy Wiki ● A Cocoon-based application ● Much of the tough work is done by the repository, Wiki can focus on end-user interaction and styling. ● Can be viewed as: – a ready-to-use application – a front-end platform Daisy Wiki customisation ● Customisation possibilities: – Skinning (custom layout) – Document type and query styling – Extensions ● /ext/** are forwarded to custom sitemap.xmap ● flowscript API to access Daisy context: repository API etc.
    [Show full text]
  • L Dataspaces Make Data Ntegration Obsolete?
    DBKDA 2011, January 23-28, 2011 – St. Maarten, The Netherlands Antilles DBKDA 2011 Panel Discussion: Will Dataspaces Make Data Integration Obsolete? Moderator: Fritz Laux, Reutlingen Univ., Germany Panelists: Kazuko Takahashi, Kwansei Gakuin Univ., Japan Lena Strömbäck, Linköping Univ., Sweden Nipun Agarwal, Oracle Corp., USA Christopher Ireland, The Open Univ., UK Fritz Laux, Reutlingen Univ., Germany DBKDA 2011, January 23-28, 2011 – St. Maarten, The Netherlands Antilles The Dataspace Idea Space of Data Management Scalable Functionality and Costs far Web Search Functionality virtual Organization pay-as-you-go, Enterprise Dataspaces Admin. Portal Schema Proximity Federated first, DBMS DBMS scient. Desktop Repository Search DBMS schemaless, near unstructured high Semantic Integration low Time and Cost adopted from [Franklin, Halvey, Maier, 2005] DBKDA 2011, January 23-28, 2011 – St. Maarten, The Netherlands Antilles Dataspaces (DS) [Franklin, Halevy, Maier, 2005] is a new abstraction for Information Management ● DS are [paraphrasing and commenting Franklin, 2009] – Inclusive ● Deal with all the data of interest, in whatever form => but semantics matters ● We need access to the metadata! ● derive schema from instances? ● Discovering new data sources => The Münchhausen bootstrap problem? Theodor Hosemann (1807-1875) DBKDA 2011, January 23-28, 2011 – St. Maarten, The Netherlands Antilles Dataspaces (DS) [Franklin, Halevy, Maier, 2005] is a new abstraction for Information Management ● DS are [paraphrasing and commenting Franklin, 2009] – Co-existence
    [Show full text]
  • Daisy the Open Source CMS
    Daisy the Open Source CMS Steven NOELS Managing Partner, Outerthought Zwijnaarde, Belgium [email protected] Abstract Daisy is an open source content management framework, and consists of a stand-alone repository server and several client applications, the most notable being the Daisy Wiki application. Daisy has a strict two-tier separation between repository server and clients, which communicate using an HTTP- and XML-based interface. This whitepaper highlights some of Daisy’s innovative concepts, and explains where Daisy is different from the multitude of other CMS applications out there. These distinct Daisy concepts make Daisy an ideal candidate for managing diverse sets of information, for both website content management, software documentation and for intranet knowledge sharing. 1 Distinct Daisy Concepts 1.1 A Big Bag of Documents A Daisy repository can be envisioned as a big bag of documents: no more, no less. Documents can be grouped into Collections, and can be queried upon using their associated metadata, but the repository model itself has no concept of hierarchy. Often, a content repository imposes a hierarchical approach to managing repository contents, using the all-too-familiar “folders” concept. Looking at the typical use of content management systems, the repository hierarchy will then reflect either the organisational structure of a company, or the navigation hierarchy of the website(s) published out of the repository: there’s no middle ground between both approaches. This makes reuse of information, and sharing documents across department walls harder if not impossible. When the company organisation changes, the document repository will need to reflect these changes as well.
    [Show full text]
  • Exploring Digital Preservation Strategies Using DLT in the Context Of
    Forget-me-block Exploring digital preservation strategies using Distributed Ledger Technology in the context of personal information management By JAMES DAVID HACKMAN Department of Computer Science UNIVERSITY OF BRISTOL A dissertation submitted to the University of Bristol in accordance with the requirements of the degree of Master of Science by advanced study in Computer Science in the Faculty of Engineering. 15TH SEPTEMBER 2020 arXiv:2011.05759v1 [cs.CY] 2 Nov 2020 EXECUTIVE SUMMARY eceived wisdom portrays digital records as guaranteeing perpetuity; as the New York Times wrote a decade ago: “the web means the end of forgetting” [1]. The Rreality however is that digital records suffer similar risks of access loss as the analogue versions they replaced - but through the mechanisms of software, hardware and organisational change. The first two of these mechanisms are straightforward. Software change relates to how data is encoded - for instance later versions of Microsoft Word often cannot access documents written with earlier versions [2]. Likewise hardware formats obsolesce; even popular technologies such as the floppy disk reach a point where accessing data on these formats becomes increasingly difficult [3]. The third mechanism is however more abstract as it relates to societal structures, and ironically is often generated as a by-product of attempts to escape the first two risks. In our efforts to rid ourselves of hardware and software change these risks are often delegated to specialised external parties. Common use cases are those of conveying information to a future self, e.g. calendars, diaries, tasks, etc. These applications, categorised as Personal Information Management (PIM) [4, p.
    [Show full text]
  • Daisy Version 8.0
    Teleflora Point of Sales Daisy Version 8.0 PA-DSS Implementation Guide Version: 1.7 Version Date: May, 2010 REVISIONS Document Date Description Version 1.7 May 28,2010 Updated section “Daisy Connectivity Specifications”. Removed outbound internet connections not used by Daisy. Updated section “Collecting Sensitive Data for Debugging” - Added information for log settings. Updated document to make text more user friendly and steps easier for shop owners/managers to follow. Updated firewall requirements, removed old firewall information and Netgear information. 1.6 May 7, 2010 Updated sections “Storage of cardholder information”, “How to permanently remove credit card information” 1.5 Apr 6, 2010 Added CC purge steps 1.4 Sep 22, 2009 Updated document per Chris Campbell’s comments 1.3 Mar 1, 2009 Updated document per Doug King’s comments 1.2 Feb 27, 2009 Recreated document using Dove POS and updated RTI PADSS Implementation guides 1.1 Feb 26, 2009 Updated document per Doug King’s comments 1.0 Feb 26, 2009 Initial Document Creation Teleflora Daisy POS PA-DSS Implementation Guide Table of Contents Purpose of this Document ...................................................................................................................... 1 Scope and Definitions ........................................................................................................................ 2 To Learn More ................................................................................................................................... 3 Dissemination
    [Show full text]
  • Daisy Producer: an Integrated Production Management System for Accessible Media
    42 DAISY2009 LEIPZIG – Christian Egli DAISY PRODUCER: AN INTEGRATED PRODUCTION MANAGEMENT SYSTEM FOR ACCESSIBLE MEDIA Christian Egli Swiss Library for the Blind and Visually Impaired Zurich Grubenstrasse 12 CH-8045 Zurich SWITZERLAND ABSTRACT Large scale production of accessible media above and beyond DAISY Talking Books requires management of the workflow from the initial scan to the output of the media production. DAISY Producer was created to help manage this process. It tracks the transformation of hard copy or electronic content to DTBook XML at any stage of the workflow and interfaces to existing order processing systems. Making use of DAISY Pipeline and Liblouis, DAISY Producer fully automates the generation of on-demand, user-specific DAISY Talking Books, Large Print and Braille. This paper introduces DAISY Producer and shows how creators of accessible media can benefit from this open source tool. 1 Introduction The typical production of an accessible media involves a number of processes such as acquisition, markup and output generation. In any medium to large organization a number of people will be involved in this proc- ess, maybe in different locations and with different roles. They need to collaborate and share intermediate artifacts of the process. With all these factors taken into consideration, the management of this workflow becomes increasingly complex when scaling to a large production. Parts of this process, such as the output generation, have very good tool support in the form of the DAISY Pipeline (DAISY Consortium 2009). Others, such as integrated workflow management and collaboration, currently are lacking. DAISY Producer sets out to fill this gap.
    [Show full text]
  • Data in Context: Aiding News Consumers While Taming Dataspaces
    DBCrowd 2013: First VLDB Workshop on Databases and Crowdsourcing Data In Context: Aiding News Consumers while Taming Dataspaces Adam Marcus∗ , Eugene Wu, Sam Madden MIT CSAIL marcua, sirrice, madden @csail.mit.edu { } ...were it left to me to decide whether we should have a gov- reasons: 1) A lack of space or time, as is common in minute- ernment without newspapers, or newspapers without a gov- by-minute reporting 2) The article is a segment in a multi- ernment, I should not hesitate a moment to prefer the latter. part series, 3) The reader doesn’t have the assumed back- — Thomas Jefferson ground knowledge, 4) A newsroom is resources-limited and can not do additional analysis in-house, 5) The writer’s ABSTRACT agenda is better served through the lack of context, or 6) The context is not materialized in a convenient place (e.g., We present MuckRaker, a tool that provides news consumers there is no readily accessible table of historical earnings). with datasets and visualizations that contextualize facts and In some cases, the missing data is often accessible (e.g, on figures in the articles they read. MuckRaker takes advantage Wikipedia), and with enough effort, an enterprising reader of data integration techniques to identify matching datasets, can usually analyze or visualize it themselves. Ideally, all and makes use of data and schema extraction algorithms to news consumers would have tools to simplify this task. identify data points of interest in articles. It presents the Many database research results could aid readers, par- output of these algorithms to users requesting additional ticularly those related to dataspace management.
    [Show full text]