Visualization, Navigation and Relationship Discovery in Graphs

Visualization, Navigation and Relationship Discovery in Graphs ∗ Ján Mojžiš Institute of Informatics Slovak Academy of Sciences Dúbravská cesta 9, 845 07 Bratislava, Slovakia [email protected] Abstract Categories and Subject Descriptors Linked data is a concept used for interlinking several data E.1 [Data Structures]: Distributed data structures, sources, often placed across the world. One of its key re- Graphs and networks; H.1.2 [Models and principles]: quirements is a link. Another is machine readable struc- User/Machine Systems|Human factors, Human infor- ture. But nowadays, still many of data sources on the mation processing; H.3.3 [Information storage and re- web offer plain unstructured data. Newspaper articles, trieval]: Information search and retrieval|Information social networks or business register. Often the data is filtering; H.3.4 [Information storage and retrieval]: HTML formatted, where the formatting is mixed with Systems and software|Distributed systems the content, which is improper for machine reading. And the data would be very useful. We could extract infor- Keywords mation about persons, events, places and other objects. graph, visualization, distributed computing, parallel com- In order to extract such information from unstructured puting, relationship discovery, linked data data sources, an advanced techniques of information extraction are used. Even if data structure is extracted or created, a presentation of information to the end user is 1. Introduction crucial, because an information overload or clutter can be From many kinds of data sources, the Web is most dy- introduced. namic, dense and universal. Originally created by humans for humans, there are sources we use often on daily basis. In scope of our work, we focus on graph data structures, Newspaper articles, business statistics, social networks, data extraction, distributed computing and graph visu- traffic are just a few examples. But despite the fact, that alization. We design, implement and evaluate a single we have a vast set of data available. As humans, in order machine system for data extraction and information re- to get information, we are capable to extract and use only trieval, capable of using advanced graph visualization and a small portion of such data. Indeed a machine computing filtering techniques. We propose a new visualization con- power is very helpful for several reasons. First, machine cept of pen patterns and colors. Next we define a new can process data thousands times faster and more reli- universal graph visualization and filtering method, usable than any human. Next, with the use of appropriate able for filtering and relationship discovery. We propose software environments we could perform information ex- a new distributed algorithm PCMARS, intended to be traction and visualization tasks. used in a Pregel computing cluster for the graph relationship discovery tasks. We implement our proposal in a Based on website internetlivestats1, in 2015, internet con- client, stand alone program AGECRT (Advanced Graph nected roughly 863 millions websites, in comparison with and Clutter Removal Tool) and distributed algorithm PC- 2005 it grown more than 13 times. But this metric does MARS. A solution is dedicated as one single architecture. not include web pages themselves, only base unique IP addresses. Also, worth of a note are dynamic, periodi- ∗Recommended by thesis supervisor: Assoc. Prof. Michal cally changing websites, like news. A big contribution to Laclav´ık the Web is the content of social networks, which are re- Defended at Faculty of Informatics and Information Tech- sults of collaborative work of millions persons across the nologies, Slovak University of Technology in Bratislava on world. Here, a term Information Society is quite suitable. August 25, 2016. c Copyright 2016. All rights reserved. Permission to make digital Google Search is a popular web search engine owned by or hard copies of part or all of this work for personal or classroom use Google Inc, which also maintains web data index. Based is granted without fee provided that copies are not made or distributed 2 for profit or commercial advantage and that copies show this notice on of official Google web page , Google index contains more the first page or initial screen of a display along with the full citation. than 100 millions gigabytes of data (approximately 25 Copyrights for components of this work owned by others than ACM thousands of 4 terabyte discs). Google also maintained must be honored. Abstracting with credit is permitted. To copy other- a Freebase, multi-domain database of billions of N-triple wise, to republish, to post on servers, to redistribute to lists, or to use statements. Now, the project has emerged into Wikidata any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from STU Press, 1 Vazovova 5, 811 07 Bratislava, Slovakia. www.internetlivestats.com/total-number-of-websites/, 16.6.2016 Mojžiš, J. Visualization, Navigation and Relationship Discovery in 2 Graphs. Information Sciences and Technologies Bulletin of the ACM https://www.google.com/insidesearch/howsearchworks/ Slovakia, Vol. 8, No. 2 (2016) 45-55 crawling-indexing.html, 16.6.2016 46 Mojˇziˇs, J.: Visualization, Navigation and Relationship Discovery in Graphs knowledge base. Wikidata contains structured, machine readable data gathered across all the world. 300 250 News articles, blogs and social networks, together with traditional web pages (homepages, company pages), writ- 200 ten in HTML are types of unstructured data (formatting 150 mixed with content). They are a great part of the Web 100 volume, many times due to their daily or periodically up- dated content. One of most popular social networks in the 50 present, Facebook (FB), holds more than billion monthly active users (at least once per 30 days cycle, user is logged 1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015 in). In 2004, the count was 1 million. Ending year 2015, the count grown to 1.5 billion. Users contribute to FB writing status updates, periodical submissions on "time- Figure 1: Slovak Business Register. The evolu- line" and each user maintains his/her own profile, more or tion of registered subjects count. Vertical axis in less detail or public. FB is international social network, thousands. which links a broad spectrum of people across the world, not depending on language or culture. In the recent past, FB is also the space for firms, political parties or non- 6 governmental organizations to promote themselves. FB evolution is illustrated in Fig.3. 4 We see Slovak Business Register (SBR) as one kind of 2 social network, which is a public register, where subjects (natural or legal persons and companies) are listed based on particular law. In comparison to FB, SBR links are 0 not based on friends, instead, we find Partners, Manage- ment Body, Supervisory board or Liquidators. From of- 2002 2004 2006 2008 2010 2012 2014 2016 ficial statistics of Ministry of Justice SR3, the network grown from 52 thousands subjects in 1995 to more than 254 thousands registered subjects in 2015. And, despite Figure 2: en.wikipedia.org. Article published the fact, that SBR performs liquidation and deletion of count on year basis. Vertical axis in millions. registered entries, on the webpage of SBR, there is still possible to find and display all deleted subjects. Fig.1 dis- plays an evolution of registered subjects in SBR database In scope of this paper we propose a new complex solu- based on years. tion for information extraction and processing, involv- ing relationship discovery, resulting in visualization. We The importance and significance of the Web, as a large present a new and universal method, usable for relation- evolving information space is marked by various chal- ship discovery as well as filtering in graph visualization. lenges, like Semantic Web Challenge4 or research papers We identify and mark relation between graph clutter, cog- at WWW 5 or ISWC6 events. It is a kind of motivation nitive science and psychology, from which, based on our for us to try and find relations in the data. To use a research, we design a new pen drawing and color styles for modern distributed computing models, like Pregel, where graph visualization. Combining our solution under archi- the data is represented in a graph structure, giving a new tecture capable of extraction, distributed processing and opportunities for graph algorithms. visualization. We also evaluate our proposal on selected data inputs. A presentation, or a kind of an "overview" for the user is given by techniques from Information visualization field. 1.1 Goals A graph visualization, for instance, visualizes relation- The aim of this work is to combine relationship discov- ships providing graphic representation in vertexes and ery, visualization and navigation in graph data. Design edges. Despite the rich community of researchers and filtering techniques, offering some degree of personaliza- many contributing solutions for many issues in graph visu- tion. For unstructured data of SBR propose a structure alization, the potential for visualization is not fully used. and its creation. Valid structure can serve in data integra- The research is intended to layout algorithms, cluster- tion from various sources. Progress toward the concept of ing or drawing. Many solutions try to address the edge Linked Data by T. Berners-lee[2] for the transparent and crossing problem on traditional basis (layouts, clustering, effective data usage in public concern. A combination bundling, focus+context techniques), but we did not find of visualization and filtration techniques in data integra- the works for edge pens and colors. Only rather contro- tion, navigation and relation discovery tasks could lead versial edge visualization, where edges are rather hidden to interesting results. The work pick the following partial than displayed [4, 5], ultimately solving edge crossing, but goals: making higher uncertainty in relations.

Visualization, Navigation and Relationship Discovery in Graphs

Models of Time Travel

Emmerson Denney

Co-Workers Mourn, Address Elton's Murder

2010 Annual Report

Mckinney Macartney STEWART MEACHEM

Lesley Sharp and the Alternative Geographies of Northern English Stardom

Three's Company

10 Lost Wanted

The Simple Life

800-514-4001 Stay Connected for One Low Price

Interracial Bw/Wm Romance Movies.Docx

Country Shines