Evaluation of Graph Management Systems for Monitoring and Analyzing Social Media Content with Obi4wan
Total Page:16
File Type:pdf, Size:1020Kb
Evaluation of Graph Management Systems for Monitoring and Analyzing Social Media Content with OBI4wan Yordi Verkroost MSc Computer Science VU University Amsterdam [email protected] December 22, 2015 Abstract As social networks become ever more important, companies and other organizations are more and more interested in how people are talking about them on these networks. The Dutch company OBI4wan delivers a complete solution for social media monitoring, webcare and social analytics. This solution provides answers regarding for example who is talking about a company or organization, and with what sentiment. Some of the questions that OBI4wan wants to answer require another way of storing the social data set, because the normal "relational" way of storing does not suffice. This report compares the column store MonetDB to the graph database Titan, and tries to find an answer to the question: is a distributed solution (Titan) better than a centralized solution (MonetDB) when it comes to answering graph queries on a social network. The LDBC data set and its benchmark are used to answer this question. Because the benchmarks have shown that both MonetDB and Titan have their respective problems, another centralized database solution (Virtuoso) has been added to the comparison. For now, Virtuoso seems to be the best choice (out of the three systems) for loading the LDBC data set into a database and executing the LDBC benchmarks on this database. 1 Contents 1 Introduction 6 1.1 Research questions . .7 2 OBI4wan: Webcare & Social Media Monitoring 8 2.1 OBI4wan Twitter data set . .8 2.1.1 Friends and Followers . .9 2.2 ElasticSearch architecture . 12 2.3 Novel questions for OBI4wan's data set . 14 2.4 In need of graph query functionality . 14 2.5 Centralized versus decentralized . 18 2.6 Research questions . 18 3 Related work 19 3.1 Graph networks . 19 3.2 Graph databases . 19 3.2.1 Titan . 19 3.2.2 Neo4j . 20 3.2.3 Sparksee . 20 3.2.4 FlockDB . 21 3.2.5 RDF . 23 3.3 Graph query languages . 24 3.3.1 Gremlin . 24 3.3.2 Cypher . 25 3.3.3 SPARQL . 26 3.3.4 SQL . 27 3.4 Graph benchmarks . 29 3.4.1 Linked Data Benchmark Council (LDBC) . 29 3.4.2 Transactional Processing Performance Council (TPC) . 30 3.4.3 Graphalytics . 31 3.5 Other database systems . 31 3.5.1 MonetDB . 31 3.5.2 Virtuoso . 32 3.6 A note on graph standards . 32 3.7 Research questions . 33 4 Titan 34 4.1 Titan overview . 34 4.1.1 TinkerPop Stack . 35 4.2 Titan architecture . 35 4.2.1 Internals . 35 4.2.2 Data storage layer . 36 4.2.3 Indexing layer . 40 4.2.4 Query capabilities . 42 4.2.5 Remote access . 42 4.3 Research questions . 42 2 5 MonetDB 43 5.1 MonetDB overview . 43 5.2 MonetDB architecture . 43 5.2.1 Query processing . 43 5.2.2 Data storage . 44 5.2.3 Hash tables . 45 5.2.4 Optimizer pipelines: the mitosis problem . 45 5.2.5 Remote access . 46 5.3 Research questions . 47 6 Test data generation 48 6.1 In need of test data . 48 6.1.1 Stanford Network Analysis Platform (SNAP) . 48 6.1.2 Lancichinetti-Fortunato-Radicchi (LFR) . 48 6.1.3 LDBC DATAGEN . 49 6.1.4 Choosing the data source to use . 49 6.2 Generating test data . 49 6.2.1 Stripping down the LDBC data set . 50 6.3 Translating test data to database input . 50 6.3.1 LDBC DATAGEN output format . 50 6.3.2 Translation scripts . 51 6.3.3 Translating LDBC DATAGEN output into Titan Script IO format . 52 6.4 Loading test data into databases . 53 6.4.1 Loading data into a Titan database . 53 6.4.2 Using Hadoop for parallel data loading . 54 6.4.3 Loading data into a MonetDB database . 58 6.5 Research questions . 58 7 OBI4wan data usage 60 7.1 OBI4wan data format . 60 7.2 Translating the OBI4wan data set into database input . 61 7.3 Creating the friends/followers data set . 62 7.3.1 Program execution . 63 7.3.2 Translating friends/followers JSON to user relations file . 63 7.3.3 Keeping the friend- and follower data sets up-to-date . 63 7.4 OBI4wan graph analysis . 65 7.4.1 Friends/followers graph statistics . 67 7.4.2 Mentioned graph statistics . 69 7.4.3 Retweeted graph statistics . 71 7.5 OBI4wan data versus LDBC data . 72 7.5.1 Creating structural indicators for LDBC . 73 7.5.2 Comparing OBI4wan data and LDBC data . 74 7.5.3 Conclusion: OBI4wan friends/followers versus LDBC friends . 76 7.6 Research questions . 76 3 8 Benchmarks setup and execution 77 8.1 Creating the LDBC Driver . 77 8.1.1 LDBC SNB IW driver for Titan . 78 8.1.2 LDBC SNB IW driver for MonetDB . 83 8.1.3 LDBC SNB BI driver . 91 8.2 Validating the LDBC SNB IW driver . 92 8.2.1 Stripping down the validation set . 93 8.2.2 Validating the Titan and MonetDB drivers . 93 8.3 Benchmark set-up . 94 8.3.1 Setting up the general benchmark configuration . 94 8.3.2 Titan-specific configuration parameters . 97 8.3.3 MonetDB-specific configuration parameters . 98 8.4 Running the benchmarks . 98 8.4.1 Running the Titan benchmark . 98 8.4.2 Running the MonetDB benchmark . 99 8.5 Introducing Virtuoso . 100 8.6 Analyzing benchmark results . 100 8.6.1 LDBC SNB IW benchmark, general analysis . 101 8.6.2 LDBC SNB IW benchmark, SF1 specific analysis . 102 8.6.3 LDBC SNB IW benchmark, SF3 specific analysis . 102 8.6.4 LDBC SNB IW benchmark, SF10 specific analysis . 103 8.6.5 Conclusions about LDBC SNB IW . 103 8.6.6 LDBC SNB BI benchmark on MonetDB on SF1, SF3 and SF10 . 103 8.7 Research questions . 103 9 Discussion & Conclusion 105 10 Future work 106 10.1 Research questions . ..