Database Benchmarking for Supporting Real-Time Interactive

Total Page:16

File Type:pdf, Size:1020Kb

Database Benchmarking for Supporting Real-Time Interactive Database Benchmarking for Supporting Real-Time Interactive Querying of Large Data Leilani Battle Philipp Eichmann Marco Angelini University of Maryland∗ Brown University University of Rome “La Sapienza”∗∗ College Park, Maryland Providence, Rhode Island Rome, Italy [email protected] [email protected] [email protected] Tiziana Catarci Giuseppe Santucci Yukun Zheng University of Rome “La Sapienza”∗∗ University of Rome “La Sapienza”∗∗ University of Maryland∗ [email protected] [email protected] [email protected] Carsten Binnig Jean-Daniel Fekete Dominik Moritz Technical University of Darmstadt Inria, Univ. Paris-Saclay, CNRS University of Washington Darmstadt, Germany Orsay, France Seattle, Washington [email protected]. [email protected] [email protected] de ABSTRACT CCS CONCEPTS In this paper, we present a new benchmark to validate the • Information systems → Data management systems; suitability of database systems for interactive visualization Data analytics; • Human-centered computing → Visu- workloads. While there exist proposals for evaluating data- alization systems and tools. base systems on interactive data exploration workloads, none rely on real user traces for database benchmarking. To this ACM Reference Format: end, our long term goal is to collect user traces that repre- Leilani Battle, Philipp Eichmann, Marco Angelini, Tiziana Catarci, sent workloads with different exploration characteristics. In Giuseppe Santucci, Yukun Zheng, Carsten Binnig, Jean-Daniel this paper, we present an initial benchmark that focuses on Fekete, and Dominik Moritz. 2020. Database Benchmarking for Supporting Real-Time Interactive Querying of Large Data. In Pro- “crossfilter”-style applications, which are a popular interac- ceedings of the 2020 ACM SIGMOD International Conference on Man- tion type for data exploration and a particularly demanding agement of Data (SIGMOD’20), June 14–19, 2020, Portland, OR, USA. scenario for testing database system performance. We make ACM, New York, NY, USA, 17 pages. https://doi.org/10.1145/3318464. our benchmark materials, including input datasets, inter- 3389732 action sequences, corresponding SQL queries, and analysis code, freely available as a community resource, to foster fur- 1 INTRODUCTION ther research in this area: https://osf.io/9xerb/?view_only= Motivation. The data science process often begins with 81de1a3f99d04529b6b173a3bd5b4d23. users (i.e., analysts, data scientists) exploring possibly mas- sive amounts of data through interactions with a graphical user interface, oftentimes a data visualization tool [4, 9, 12, 66]. However, each time a user interacts with a visualization Permission to make digital or hard copies of all or part of this work for interface, the underlying data must be processed (filtered, personal or classroom use is granted without fee provided that copies aggregated, etc.) such that the interface can quickly provide are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights a visual response [39, 42, 56]. To meet this growing demand for components of this work owned by others than the author(s) must for interactive and real-time performance, the database and be honored. Abstracting with credit is permitted. To copy otherwise, or visualization communities have developed a variety of tech- republish, to post on servers or to redistribute to lists, requires prior specific niques, including approximate query processing [2, 14], on- permission and/or a fee. Request permissions from [email protected]. line aggregation/progressive visualization [3, 19, 26], data SIGMOD’20, June 14–19, 2020, Portland, OR, USA cubes [8, 36, 38], spatial indexing [63], speculative query © 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM. execution [8, 30], and lineage tracking [53]. ACM ISBN 978-1-4503-6735-6/20/06...$15.00 However, we still lack adequate benchmarks to empirically https://doi.org/10.1145/3318464.3389732 assess which of these resulting systems provide satisfactory SIGMOD’20, June 14–19, 2020, Portland, OR, USA L. Battle et al. data exploration workloads, this paper thus presents a bench- mark constructed from real traces of users interacting with an information visualization system. In this scenario, every query corresponds to an actual interaction from a real user. Given the innumerable analysis scenarios supported by vi- sualization tools [23, 33], it would be unrealistic to have one benchmark represent all possible use cases. Instead, we be- lieve that a database benchmark ecosystem is required for evaluating interactive data exploration workloads, similar to the TPC benchmarks for classical database work- loads (OLTP, OLAP, etc.). Our long term goal is to collect user traces, representing workloads with different exploration Figure 1: An example crossfilter application from our characteristics. As a first step, we mainly use “crossfilter”- user study, visualizing the Flights dataset, composed style applications to collect user traces and derive our bench- of six numerical attributes. Each attribute is visual- mark (see Figure 1). The benchmark development process ized as a histogram that can be filtered using a range- presented in this paper provides a blueprint for creating more slider. When the user interacts with a range slider, the interaction-focused benchmarks in the future. entire dataset is filtered, the bins in each histogram Given how easy and intuitive they are to use when ex- are recomputed, and the visualizations are updated. ploring complex datasets, Crossfilter interfaces are pervasive in information visualization tools including foundational performance, and which systems are actually better than oth- systems like the Attribute Explorer [67] and Spotfire [58] ers for interactive, real-time querying scenarios. This issue to recent systems like Falcon [41]. Moreover, compared to is exacerbated for the most demanding yet popular visualiza- other interaction designs (e.g., mouse hovers or check-box se- tion scenarios such as crossfilter [41, 58, 67], where one in- lection), crossfilter-style applications are arguably the teraction may generate hundreds of queries per second, with most demanding use case for database systems [41]. an expectation of near-immediate results. Unfortunately, ex- With one single interaction, a user can generate hundreds of isting database benchmarks such as TPC-H [65], TPC-DS database queries per second when moving a single crossfilter [64], or the Star Schema Benchmark (SSB) [47] are insuffi- range-slider. For example, Figure 1 shows information about cient for making these comparisons. One main reason is that airline delays; changing the selection of the flight distance us- the workloads modeled in these benchmarks are not rep- ing the crossfilter range-slider of the histogram in the upper resentative of how database queries are generated through row (center), causes all other histograms to update at each user interactions, such as with tools like Tableau [59] or mouse drag. Crossfilter interfaces are instances of dynamic Spotfire [58]. In contrast, visualization benchmarks provide query interfaces [57], which require live updates from the realistic scenarios, but lack the configurability, precision and underlying database system as the slider moves in real time. automation afforded by database benchmarks when test- In the following, we discuss the contributions of this paper: ing performance [7]. Furthermore, there exists no evaluation As a first contribution, we present the results ofa platform to test database systems under a range of interactive user study conducted by the different international re- analysis conditions, such as different dataset characteristics, search groups who authored this article. In this user exploration scenarios, interaction paradigms, or user profiles. study, we asked 22 users with a broad range of data sci- Therefore, our community is still woefully unequipped to ence experience to perform four representative interactive answer the question “are database systems truly capable of analysis tasks using a crossfilter visualization tool, across supporting real-time interactive data exploration at scale?” three different datasets, producing a total of 128 different Here, real-time means with a latency < 100 ms, or support- traces for further study. Using the log data and video record- ing at least ten frames per second [15, 41, 42]. Unfortunately, ings collected from the study, we analyze and summarize our study reveals more work is needed from the database our findings regarding the interaction patterns observed, and community to answer positively. their effects on the resulting query workload. This characteri- Contributions. While there exist proposals [7, 18, 28, 29, zation analysis provides useful insights into how interaction 61] for evaluating database systems on interactive data explo- patterns can be taken into account when evaluating and ration workloads, none of these approaches derive a bench- optimizing queries on a database system. mark from real user traces, introducing potential artifacts As a second contribution, we present the results of and possibly missing key behavioral patterns. To validate our analysis of the 128 collected traces and character- the suitability of database systems for interactive and visual ize the behavior of typical users on interactive tasks. Database Benchmarking for Real-Time Interactive
Recommended publications
  • The Fourth Paradigm
    ABOUT THE FOURTH PARADIGM This book presents the first broad look at the rapidly emerging field of data- THE FOUR intensive science, with the goal of influencing the worldwide scientific and com- puting research communities and inspiring the next generation of scientists. Increasingly, scientific breakthroughs will be powered by advanced computing capabilities that help researchers manipulate and explore massive datasets. The speed at which any given scientific discipline advances will depend on how well its researchers collaborate with one another, and with technologists, in areas of eScience such as databases, workflow management, visualization, and cloud- computing technologies. This collection of essays expands on the vision of pio- T neering computer scientist Jim Gray for a new, fourth paradigm of discovery based H PARADIGM on data-intensive science and offers insights into how it can be fully realized. “The impact of Jim Gray’s thinking is continuing to get people to think in a new way about how data and software are redefining what it means to do science.” —Bill GaTES “I often tell people working in eScience that they aren’t in this field because they are visionaries or super-intelligent—it’s because they care about science The and they are alive now. It is about technology changing the world, and science taking advantage of it, to do more and do better.” —RhyS FRANCIS, AUSTRALIAN eRESEARCH INFRASTRUCTURE COUNCIL F OURTH “One of the greatest challenges for 21st-century science is how we respond to this new era of data-intensive
    [Show full text]
  • The Diversity of Data and Tasks in Event Analytics
    The Diversity of Data and Tasks in Event Analytics Catherine Plaisant, Ben Shneiderman Abstract— The growing interest in event analytics has resulted in an array of tools and applications using visual analytics techniques. As we start to compare and contrast approaches, tools and applications it will be essential to develop a common language to describe the data characteristics and diverse tasks. We propose a characterisation of event data along 3 dimensions (temporal characteristics, attributes and scale) and propose 8 high-level user tasks. We look forward to refining the lists based on the feedback of workshop attendees. KEYWORDS having the same time stamp, medical data are often Temporal analysis, pattern analysis, task analysis, taxonomy, big recorded in batches after the fact. data, temporal visualization o The relevant time scale may vary (from milliseconds to years), and may be homogeneous or not. INTRODUCTION o Data may represent changes over time of a status The growing interest in event analytics (e.g. Aigner et al, 2011; indicator, e.g. changes of cancer stages, student status or Shneiderman and Plaisant, 2016) has resulted in an array of novel physical presence in various hospital services, or may tools and applications using visual analytics techniques. As represent a set of events or actions that are not exclusive researchers compare and contrast approaches it will be helpful to from one another, e.g. actions in a computer log or series develop a common language to describe the diverse tasks and data of symptoms and medical tests. characteristics analysts encounter. o Patterns may be very cyclical or not, and this may vary The methodology of design studies - which primarily focus on over time.
    [Show full text]
  • Quantitative Literacy to New Quantitative Literacies
    See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/319160466 Quantitative Literacy to New Quantitative Literacies Chapter · May 2017 CITATIONS READS 0 39 3 authors, including: Rohit Mehta James P. Howard California State University, Fresno Johns Hopkins University 32 PUBLICATIONS 42 CITATIONS 13 PUBLICATIONS 0 CITATIONS SEE PROFILE SEE PROFILE Some of the authors of this publication are also working on these related projects: MSU-Wipro STEM & Leadership View project I Wonder: Research To Practice View project All content following this page was uploaded by Rohit Mehta on 18 October 2017. The user has requested enhancement of the downloaded file. 5 Quantitative Literacy to New Quantitative Literacies Jeffrey Craig, Rohit Mehta; James P. Howard II Michigan State University; University of Maryland University College 5.1 Introduction Steen and colleagues [42] made their Case for Quantitative Literacy based on the premise that the 21st-century, pri- marily due to technology changes, is a significantly more quantitative environment than any previous time in history. The design team who wrote the case made rhetorical observations about the increasing prevalence of numbers in society in the United States; phrases like [a] world awash in numbers [42] precede nearly every piece of literature regarding quantitative literacy (or numeracy). The authors used this rhetoric to position people relative to the demands of social systems in the world. Specifically, numeracy has emerged as a partner to literacy because the social world is being integrated with numbers due to recent technological advances. Steen argued that as the printing press gave the power of letters to the masses, so the computer gives the power of number to ordinary citizens [40, p.
    [Show full text]
  • Surfacing Visualization Mirages
    Surfacing Visualization Mirages Andrew McNutt Gordon Kindlmann Michael Correll University of Chicago University of Chicago Tableau Research Chicago, IL Chicago, IL Seattle, WA [email protected] [email protected] [email protected] ABSTRACT In this paper, we present a conceptual model of these visual- Dirty data and deceptive design practices can undermine, in- ization mirages and show how users’ choices can cause errors vert, or invalidate the purported messages of charts and graphs. in all stages of the visual analytics (VA) process that can lead These failures can arise silently: a conclusion derived from to untrue or unwarranted conclusions from data. Using our a particular visualization may look plausible unless the an- model we observe a gap in automatic techniques for validating alyst looks closer and discovers an issue with the backing visualizations, specifically in the relationship between data data, visual specification, or their own assumptions. We term and chart specification. We address this gap by developing a such silent but significant failures visualization mirages. We theory of metamorphic testing for visualization which synthe- describe a conceptual model of mirages and show how they sizes prior work on metamorphic testing [92] and algebraic can be generated at every stage of the visual analytics process. visualization errors [54]. Through this combination we seek to We adapt a methodology from software testing, metamorphic alert viewers to situations where minor changes to the visual- testing, as a way of automatically surfacing potential mirages ization design or backing data have large (but illusory) effects at the visual encoding stage of analysis through modifications on the resulting visualization, or where potentially important to the underlying data and chart specification.
    [Show full text]
  • Ways of Visualizing Data on Curves
    Edinburgh Research Explorer Ways of Visualizing Curves Citation for published version: Bach, B, Ren, Q, Perin, C & Dragicevic, P 2018, Ways of Visualizing Curves. in Proceedings of the 5th Biennial Transdisciplinary Imaging Conference 2018. Edinburgh, UK, The Latent Image, Edinburgh, United Kingdom, 18/04/18. https://doi.org/10.6084/m9.figshare.6104705 Digital Object Identifier (DOI): 10.6084/m9.figshare.6104705 Link: Link to publication record in Edinburgh Research Explorer Document Version: Peer reviewed version Published In: Proceedings of the 5th Biennial Transdisciplinary Imaging Conference 2018 General rights Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policy The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer content complies with UK legislation. If you believe that the public display of this file breaches copyright please contact [email protected] providing details, and we will remove access to the work immediately and investigate your claim. Download date: 24. Sep. 2021 Transimage 2018 Proceedings of the 5th Biennial Transdisciplinary Imaging Conference 2018 Ways of Visualizing Data on Curves Benjamin Bach [email protected] Charles Perin [email protected] University of Edinburgh, Edinburgh, UK City University, London, UK Qiuyuan Ren [email protected] Pierre Dragicevic [email protected] University of Edinburgh, Edinburgh, UK Inria, Saclay, France Visual variables used to encode data on the curve in the enrichment stage.
    [Show full text]
  • Juan Morales Data Scientist
    Juan Morales Data Scientist Researcher in Visual Analytics and Human-Computer Interaction. Programming lover. Avid learner. Personal Information Name: Juan Morales del Olmo Birth Date: July 18th, 1985 Homepage: http://bit.ly/juanmorales Education 2009–2013 PhD on Advanced Computing for Science and Engineering, Universidad Politécnica de Madrid (UPM), Madrid, Graduated with honor, Cum laude. 2008–2009 Artificial Intelligence MSc, Universidad Politécnica de Madrid (UPM), Madrid, pending Final Project. 2003–2008 Computer Science, Universidad Politécnica de Madrid (UPM), Madrid, Grade. 90th percentile by marks Research Experience 2013–Present Postdoc Researcher, UPM, Madrid. Developing interactive tools that help neuroscientist make sense of their data. Designing the architecture of the Visualization Framework to be used in the Human Brain Project, in collaboration with 3 international teams. May–Nov 12 Visitor Researcher at HCIL, UMD, Maryland. Under the advise of Ben Shneiderman and Catherine Plaisant. Involved in EventFlow project, a tool for the analysis of millions of event sequences. Publications 9 Published Most of them in JCR Q1 Journals like Neuroinformatics, Journal of Neuroscience or papers Frontiers in Neuroanatomy. H-Index 3. Google Scholar Conferences 6 Internat. Most of them are top congresses like IEEE VIS, ACM CHI or ECML PKDD. Lecturing conferences in three of them. http://bit.ly/juanmorales T SkypeID: juanmoralesdelolmo • H +34 608 029 224 B [email protected] 1/2 Computer skills Python Pandas, Numpy, Scipy, Matplotlib, JavaScript D3, Lodash, React, Angular, Back- VTK, ITK, Qt, Django, Flask, ZMQ, bone, jQuery, jQuery UI Gevent, Celery R Shiny, ggplot2, Statspat C++ Qt, VTK, ITK, STD, ZMQ Sys Admin Linux, Docker, Nginx, Apache Development Emacs, Eclipse, Qt Creator, R Studio, tools Git, SVN, CMake Database SQL, MongoDB Reporting LATEX, Sweave Languages Spanish First Language English Full professional Academic Experience 2014, 2015 Data Visualization, University Master in Graphical Computation and Simulation, U-tad, Madrid.
    [Show full text]
  • References Cited [1] Salman Ahmad, Alexis Battle, Zahan Malkani, and Sepander Kamvar
    References Cited [1] Salman Ahmad, Alexis Battle, Zahan Malkani, and Sepander Kamvar. The jabberwocky program- ming environment for structured social computing. In ACM User Interface Software and Technology (UIST), pages 53–64, 2011. [2] Saleema Amershi, James Fogarty, Ashish Kapoor, and Desney Tan. Overview based example se- lection in end user interactive concept learning. In ACM User Interface Software and Technology (UIST), pages 247–256, 2009. [3] Saleema Amershi, James Fogarty, Ashish Kapoor, and Desney Tan. Examining multiple potential models in end-user interactive concept learning. In ACM Human Factors in Computing Systems (CHI), pages 1357–1360, 2010. [4] Saleema Amershi, James Fogarty, and Daniel Weld. Regroup: Interactive machine learning for on- demand group creation in social networks. In ACM Human Factors in Computing Systems (CHI), pages 21–30, 2012. [5] Saleema Amershi, Bongshin Lee, Ashish Kapoor, Ratul Mahajan, and Blaine Christian. Cuet: Human-guided fast and accurate network alarm triage. In ACM Human Factors in Computing Systems (CHI), pages 157–166, 2011. [6] Apache lucene. http://lucene.apache.org/. [7] Alan R Aronson and Franc¸ois-Michel Lang. An overview of metamap: historical perspective and recent advances. Journal of the American Medical Informatics Association, 17(3):229–236, 2010. [8] David Bamman, Jacob Eisenstein, and Tyler Schnoebelen. Gender in twitter: Styles, stances, and social networks. CoRR, abs/1210.4567, 2012. [9] Michele Banko and Eric Brill. Scaling to very very large corpora for natural language disambiguation. In Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, ACL ’01, pages 26–33, 2001. [10] Lee Becker, George Erhart, David Skiba, and Valentine Matula.
    [Show full text]
  • NIH-NSF Visualization Research Challenges Report
    NIH-NSF Visualization Research Challenges Report Chris Johnson1, Robert Moorhead2, Tamara Munzner3, Hanspeter Pfister4, Penny Rheingans5, and Terry S. Yoo6 1 Scientific Computing and Imaging Institute School of Computing University of Utah Salt Lake City, UT 84112 [email protected] 2 Department of Electrical and Computer Engineering Mississippi State University Mississippi State, MS 39762 [email protected] 3 Department of Computer Science University of British Columbia 2366 Main Mall Vancouver, BC V6T 1Z4 Canada [email protected] 4 MERL - Mitsubishi Electric Research Laboratories 201 Broadway Cambridge, MA 02139 [email protected] 5 Department of Computer Science and Electrical Engineering University of Maryland Baltimore County Baltimore, MD 21250 [email protected] 6 Office of High Performance Computing and Communications National Library of Medicine, U.S. National Institutes of Health Bethesda, MD 20894 [email protected] October 20, 2005 Contents 1 Executive Summary . 3 2 The Value of Visualization . 5 3 The Process of Visualization . 7 3.1 Moving Beyond Moore’s Law. 7 3.2 Determining Success . 9 3.3 Supporting Repositories and Open Standards . 10 3.4 Achieving Our Goals . 10 4 The Power of Visualization . 13 4.1 Transforming Health Care . 13 4.2 Transforming Science and Engineering . 16 4.3 Transforming Life . 19 5 Roadmap . 23 6 State of the Field . 25 6.1 Other Reports . 25 6.2 National Infrastructure . 25 6.2.1 Visualization Hardware . 26 6.2.2 Networking . 27 6.2.3 Visualization Software . 27 7 Workshop Participants 2 EXECUTIVE SUMMARY 1 In this report we describe some of the remarkable security, and competitiveness, support for research and achievements visualization enables and discuss the major development in this critical, multidisciplinary field has been obstacles blocking the discipline’s advancement.
    [Show full text]
  • Data Visualization and Interface Design for Public Health Analysis: the American Hotspots Project Jihoon Kang, Mfa
    THE PARSONS INSTITUTE 68 Fifth Avenue 212 229 6825 FOR INFORMATION MAPPING New York, NY 10011 piim.newschool.edu developed GUI design and interactive prototypes for Data Visualization and Interface Westchester County Department of Health (WCDOH), Design for Public Health Analysis: New York State—who served as the test-bed client for the The American Hotspots Project design development. Therefore, the abbreviated name which we applied to this particular aspect of the American JIHOON KANG, MFA Hotspots Project was entitled: “Westchester Hotspots.” The objective for this paper is to discuss some of the interface, user experience, and interface design approach- KEYWORDS Center for Disease Control and Preven- es within the project—particularly as they might apply to tion (CDC), data analysis, decision-support, government health-oriented interactive response tools. I will focus on emergency response, Graphical User Interface (GUI) design, such approaches in a “punch-list” manner with discus- healthcare, information visualization, public health, user sions related to unique contributions to the design. As the experience design (UXD), usability design lead, I was tasked with GUI and UXD efforts, as well as relevant engineering strategies. Task focus was informa- ABSTRACT The American Hotspots Project is an emer- tion visualization supported through user requirements gency health-response system that allows operators to and usability studies. quickly ascertain insights and formulate appropriate responses respecting the spread of disease throughout WESTCHESTER HOTSPOTS geographic regions within their purview. Additionally, NCDP and PIIM were commissioned to research and pro- American Hotspots highlights polygons (such as a census vide solutions demonstrating how Westchester Hotspots block) where the most socially vulnerable demographic could be utilized in the scenario of an H1N1 outbreak in groups are located.
    [Show full text]
  • Visualization Viewpoints
    Visualization Viewpoints Editor: Theresa-Marie Rhyne NIH-NSF Visualization Research Challenges Report Summary __________________________________________ Tamara early 20 years ago, the US National Science help people comprehend information orders of magni- Munzner N Foundation (NSF) convened a panel to report on tude more quickly than through reading raw numbers or University of the potential of visualization as a new technology.1 Last text. British year, the NSF and US National Institutes of Health Visualization is fundamental to understanding mod- Columbia (NIH) convened the Visualization Research Challenges els of complex phenomena, such as multilevel models (VRC) Executive Committee—which was made up of of human physiology from DNA to whole organs, mul- Chris Johnson the authors of this article—to write a new report. Here, ticentury climate shifts, international financial markets, University of we summarize that new VRC report, available in full at or multidimensional simulations of airflow past a jet Utah http://tab.computer.org/vgtc/vrc and in print.2 wing. Visualization reduces and refines data streams, The goals of this new report are to thus enabling us to winnow huge volumes of data in Robert applications such as public health surveillance at a Moorhead ■ evaluate the progress of the maturing field of visual- regional or national level to track the spread of infec- Mississippi State ization, tious diseases. Visualizations of such application prob- University ■ help focus and direct future research projects, and lems as hurricane dynamics and biomedical imaging are ■ provide guidance on how to apportion national generating new knowledge that crosses traditional dis- Hanspeter resources as research challenges change rapidly in the ciplinary boundaries.
    [Show full text]
  • Arxiv:2105.00816V1 [Cs.CL] 12 Apr 2021
    What’s in a Summary? Laying the Groundwork for Advances in Hospital-Course Summarization Griffin Adams1, Emily Alsentzer2, Mert Ketenci1, Jason Zucker1, Noémie Elhadad1 1Columbia University, New York, NY 2Harvard-MIT’s Health Science and Technology, Cambridge, MA {griffin.adams, mert.ketenci, noemie.elhadad}@columbia.edu [email protected], [email protected] Abstract making sense of a patient’s longitudinal record over long periods of time and multiple interac- Summarization of clinical narratives is a long- standing research problem. Here, we in- tions with the healthcare system, to synthesizing troduce the task of hospital-course summa- a specific visit’s documentation. Here, we focus rization. Given the documentation authored on hospital-course summarization: faithfully and throughout a patient’s hospitalization, gener- concisely summarizing the EHR documentation for ate a paragraph that tells the story of the pa- a patient’s specific inpatient visit, from admission tient admission. We construct an English, text- to discharge. Crucial for continuity of care and pa- to-text dataset of 109,000 hospitalizations (2M tient safety after discharge (Kripalani et al., 2007; source notes) and their corresponding sum- Van Walraven et al., 2002), hospital-course summa- mary proxy: the clinician-authored “Brief Hos- pital Course” paragraph written as part of a rization also represents an incredibly challenging discharge note. Exploratory analyses reveal multi-document summarization task with diverse that the BHC paragraphs are highly abstrac- knowledge requirements. To properly synthesize tive with some long extracted fragments; are an admission, one must not only identify relevant concise yet comprehensive; differ in style and problems, but link them to symptoms, procedures, content organization from the source notes; ex- medications, and observations while adhering to hibit minimal lexical cohesion; and represent temporal, problem-specific constraints.
    [Show full text]
  • University of Oklahoma Graduate College
    UNIVERSITY OF OKLAHOMA GRADUATE COLLEGE AUGMENTING FACETED BROWSING OF THE ISIS BIBLIOGRAPHY OF THE HISTORY OF SCIENCE WITH SMALL MULTIPLE TIME SERIES VISUALIZATIONS A THESIS SUBMITTED TO THE GRADUATE FACULTY in partial fulfilment of the requirements for the Degree of MASTER OF SCIENCE BY BHAVANA GOPALACHARY Norman, Oklahoma 2021 AUGMENTING FACETED BROWSING OF THE ISIS BIBLIOGRAPHY OF THE HISTORY OF SCIENCE WITH SMALL MULTIPLE TIME SERIES VISUALIZATIONS A THESIS APPROVED FOR THE GALLOGLY COLLEGE OF ENGINEERING BY THE COMMITTEE CONSISTING OF Dr Chris Weaver Dr Christan Grant Dr Kash Barker © Copyright by Bhavana Gopalachary 2021 All Rights Reserved. ACKNOWLEDGEMENTS I am very thankful to my thesis advisor Dr.Chris Weaver for giving me the opportunity to explore various things during my research, for always giving me the push in the right direction when needed and for his invaluable expertise throughout my research. I would also like to thank my committee members Dr.Christian Grant and Dr.Kash Barker for their time and effort throughout the thesis process. I am deeply grateful to Dr.Stephen Weldon, for providing us with a wonderful platform like Isis Current Bibliography to work with and his constant support and expertise. I would also like to thank Julia Damerow for always being ready to help me with any questions I had. I would also like to thank my research group members Omkar Chekuri and Yonathan Hendrawan for always being interested and helpful in my research. Most importantly, I am especially thankful to my parents and my sister for being such an amazing support system and encouraging me over the last few years.
    [Show full text]