Visualization Tool for Semantic Document Representations

MASARYK UNIVERSITY FACULTY OF INFORMATICS Document Maps: Visualization Tool for Semantic Document Representations BACHELOR'S THESIS Michal Petr Brno, Spring 2021 MASARYK UNIVERSITY FACULTY OF INFORMATICS Document Maps: Visualization Tool for Semantic Document Representations BACHELOR'S THESIS Michal Petr Brno, Spring 2021 Declaration Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source. Michal Petr Advisor: RNDr. Vít Novotný i Acknowledgements I want to thank my supervisor Vit Novotny and advisor Jan Byska for their professional supervising, skilful advice and help with the design and implementation of the application, and their general enthusiasm, willingness, and patience with the subsequent work. iii Abstract Visualising textual data can be beneficial for fields such as machine learning, but unlike computers, which can work with data that make no sense to our human brains, we sometimes require to understand what data we are working with and thus require a better way to analyse such large sets of data. The main objective of this thesis is to study and research the means of visualising interactive sets of textual data, such as documents in a corpus, and to implement an interactive web application that would allow us to visualise this data. This application would also help us analyse all the documents' mutual similarities by putting them in a force-directed simulation and by giving us insight into which words in each document contribute to their similarity. iv Keywords D3.js, HTML, JavaScript, visual analytics, visualization, web applica• tion, web development v Contents 1 Introduction 1 2 Background 3 2.1 Data visualisation 3 2.2 Visualising similarities of textual data 3 2.2.1 Spatial arrangement visualisations 4 2.2.2 Force-directed algorithm 4 2.2.3 Scatter plot 5 2.3 Calculating distances 5 2.3.1 Euclidean distance 6 2.3.2 Manhattan distance 6 2.3.3 Cosine similarity 7 2.3.4 Soft cosine similarity 8 2.4 Modern web development 9 2.4.1 Front-end JavaScript frameworks 9 2.4.2 Component-based web development 10 2.4.3 Hyper-Text Markup Language 10 2.4.4 Cascading Style Sheets 10 2.4.5 JavaScript 12 2.4.6 TypeScript 12 2.4.7 Reactive Extensions for JavaScript 12 2.4.8 Node.js 12 2.4.9 Angular 13 2.4.10 Scalable Vector Graphics 13 2.4.11 Data-Driven Documents 13 2.5 Existing tools for visualizing similarities 14 2.5.1 Sketch Engine 14 2.5.2 VisCoDeR 14 2.5.3 Data-for-research browser 15 3 Design 17 3.1 Required features 17 3.1.1 Visualizing similarities in a corpus 17 3.1.2 Word contribution to similarities 18 3.1.3 Highlighting of word matches 18 vii 3.2 Concepts 18 3.2.1 Initial screen 18 3.2.2 Map design 20 3.2.3 Comparison screen 21 4 Implementation 25 4.1 Application components 25 4.1.1 App component 25 4.1.2 Init component 26 4.1.3 Home component 26 4.1.4 Graph component 27 4.1.5 User interface component 27 4.1.6 Sidenav component 28 4.2 Services 28 4.2.1 Query service 29 4.2.2 Loading service 29 4.2.3 JSON validate service 29 4.3 Pipes 30 4.3.1 Escape HTML pipe 30 4.3.2 Pair split pipe 30 4.4 Corpus loaded guard 30 4.5 Graph data web worker 30 4.6 Utility libraries 31 4.6.1 Query utility library 31 4.6.2 Graph utility library 31 4.6.3 Various utility library 32 4.7 Documentation 32 5 Conclusion 33 5.1 Future work 33 A The live demo and the source code 37 A.l Building the project 37 A.l.l Installing the project 37 A. 1.2 Starting the development server 37 A. 1.3 Building for deployment 38 A. 1.4 Generating documentation 38 viii B How-to guide 41 B.l Importing the corpus 41 B.2 Navigating the map 41 B.3 Selecting nodes 41 B.4 Viewing the document content 42 B.5 Comparing documents 43 B.5.1 Selecting words 43 B.6 Changing settings 44 Bibliography 45 ix List of Figures 2.1 Causes of mortality by Florence Nightingale 4 2.2 A simulation of a force layout diagram 5 2.3 A similarity scatter plot 6 2.4 A depiction of measures 7 2.5 A diagram of component based development 11 2.6 A comparison of raster and vector graphics 13 2.7 The thesaurus concepts, designed by Lucia Kocinová 15 2.8 The results of VisCoDeR 16 3.1 Concept drawings of the initial screen 19 3.2 A drawing of the loading screen 20 3.3 The concepts for the map design 20 3.4 The range of colours used to depict deviations 21 3.5 The expanded settings menu 22 3.6 The first conceptualizations of the comparison window 22 3.7 The initial word match selection concept 23 3.8 The final comparison screen concept 24 4.1 The hierarchical structure of the implemented components 26 4.2 The comparison of naive and sRGB colour mixing 31 5.1 A screenshot of the developed application 35 B.l The on-screen camera controls of the application 42 B.2 The on-screen UI element displaying the deviation error 42 B.3 The comparison button used to access the comparison screen 43 B.4 The word selection UI, showing checked and unchecked words 44 xi 1 Introduction Attempting to compare whether two documents are similar or not could be a dubious task for the human brain. Many documents could have multiple topics they talk about, making it harder to determine this problem. They could also use the same terminology but talk about something entirely else, or there could even be thousands of documents, which could take a human years to go through. We would therefore like to use a computer to perform this com• paring action, giving us a summary of why the documents are similar. To do this, we would need to develop an application that would give us this insight into a large set of documents organised into a corpus. This thesis aims to develop a user-friendly open source front-end for such an application, where the corpus data will eventually be pulled from a Representational State Transfer Application Program• ming Interface, or REST API for short. This front-end will aid us in exploring such corpus data and help us visualise and summarise what contributes to the similarities of any two documents. The thesis is divided into four chapters. The first chapter provides an overview of the current state of visualising textual data. First, we ex• plore existing tools used to visualise textual data, and then we will go over the means of developing a modern web application. The second describes what is expected from an implementation of the applica• tion and shows the creative process of iterating over the application's design. In the third chapter, we will go over the implementation de• tails and the component structure of the implemented code. We will conclude in the fourth chapter by summarizing the contributions of this thesis and suggesting directions for future work. We can also find some additional information in the appendices, such as the link to the source code and a simple usage guide for the tool. 1 2 Background This chapter will go over some of the techniques used to visualise textual data and similarities between two texts and the technologies used to create modern interactive web applications. We will also briefly examine some existing tools used for such visualizations of similarities. 2.1 Data visualisation Data visualisations and analyses are needed now more than ever as we shift to a more digitised world. It is no longer just an activity for scientists and statisticians, and many ordinary people require and need it in some way or another. [1] With the acceleration of productivity, we require quick means of consuming as much information as possible. To reach this lofty goal, we need a better way to visualise and analyse said information: Fortunately, we humans are intensely visual creatures. Few of us can detect patterns among rows of number, but even young children can interpret bar charts, extracting meaning from those numbers' visual representations. [2] Although today's need for visualising data may unprecedented in its urgency, visualizations have been around for quite some time. Archaeologists have discovered many artefacts from ancient history that seem to be visualising numerical data. [3] Innovative data visualisation also saved lives during the Crimean War in the 19th century. The statistical prodigy Florence Nightingale created a new type of diagram, portrayed in Figure 2.1 on the following page, showing the primary cause of death among the army actually to be disease, helping to stop its spread. 2.2 Visualising similarities of textual data When we are talking about the similarity of two documents, we can have numerous metrics in mind. However, the most desired metric 3 2. BACKGROUND HAfiRAfMC OP TO ZXU$&$ OF MORTALITY APRIL 1855 10 MARCH 1856. IN THE ARMY IN THE EAST, ATHTI 1H54 joMARClf 1A55. Tha drvasi vfih'.Riu-,. fJ,. & IhisJc- wedges arfy eaisA measured- frenis the.-centre as the-- emanm, eerteeat /'the- kill'; tmdut*- mwured f'rm,- tit* wJr* e/'lhcwr.i/:.

Load more