Part 1: Studying Tweet campaigns: methodological issues Alex Frame, Arnaud Mercier, Gilles Brachotte and Caja Thimm - 9783631670095 Downloaded from PubFactory at 09/26/2021 02:00:34PM via free access Alex Frame, Arnaud Mercier, Gilles Brachotte and Caja Thimm - 9783631670095 Downloaded from PubFactory at 09/26/2021 02:00:34PM via free access Éric Leclercq, Marinette Savonnet, Thierry Grison, Éric Leclercq, Marinette Savonnet, Thierry Grison, Sergey Sergey Kirgizov & Ian Basaille, Laboratoire LE2I – UMR6306 – Kirgizov, Ian Basaille CNRS – ENSAM, Univ. Bourgogne Franche- Comté SNFreezer: a Platform for Harvesting and Storing Tweets 1. SNFreezer: a Platform for Harvesting and Storing Tweets in a Big Data Context Abstract In this chapter we show how a multi- paradigm platform can fulfill the requirements of building a corpus of tweets and can reduce the waiting time for researchers to perform analysis on data. We highlight major issues such as the scalability of the architecture that is collecting tweets, as well as its failover mechanism. 1.1 Introduction and Objectives In general, analysing complex interaction networks including Twitter data re- quires different types of algorithms with different theoretical foundations such as graph theory, linear algebra, or statistical models. Regarding tweet analysis, intrinsic links built by operators (i.e. hashtags denoted by #, user mentions de- noted by @, reply, and retweets) have a strong impact on the data model and on the performances of the algorithms being used. Addressing a scientific question often requires mixing different classes of- al gorithms using different data models that retrieve data from different storage structures. For instance, graph- based algorithms using a matrix adjacency rep- resentation are useful for discovering community structure; a Laplacian matrix is useful to evaluate centrality. In general, graph-based algorithms are near- sighted, they do not take into account contextual information. Linear algebra algorithms can be used to identify large scaled structures, for instance clusters can be found using Singular Value Decomposition (SVD) or Principal Component Analysis (PCA). Machine learning algorithms and statistical models are used to predict links or behaviors, to detect anomalies or events. In a context of massive data- sets, the most efficient storage structure should be used according to the selected algorithms, knowing that different kinds of algorithms are usually mandatory. Analysis of tweets can be performed at different levels of granularity: at an individual level like influence assessment, sentiment analysis, by extraction of features that are not explicit; and at a corpus level, through emergence of groups Alex Frame, Arnaud Mercier, Gilles Brachotte and Caja Thimm - 9783631670095 Downloaded from PubFactory at 09/26/2021 02:00:34PM via free access 20 Éric Leclercq, Marinette Savonnet, Thierry Grison, Sergey Kirgizov, Ian Basaille of users exhibiting similar behavior. Thus, possible outcomes of the analysis are the discovery of social structures, social positions, i.e. the role of individuals. Our contribution is an open source platform named SNFreezer (https://github. com/SNFreezer) that supports the management and the analysis of social data with different paradigms. We have developed a polyglot storage system to store and retrieve tweets in different structures that are able to scale up to the data flow requirements. 1.2 New Paradigms for Data Management Systems Most enterprise business applications rely on relational database management systems (RDBMS). This technology is mature, widely understood and adapted. However, some issues have recently emerged: • RDBMS may not have adequate performance for massive datasets; • RDBMS cannot provide the scalability required by online social network applications; • The structure of the relational model can be too rigid or not relevant to deal with the variability of complex data networks; • SQL was not designed to perform explanatory data analysis queries which does not provide exact results. Explicit and implicit links between social data can also be a hindrance to the use of RDBMS if they are combined with massive data. Explicit links are usually used by path queries that require joining tables, but when the length of the path is not known, SQL often needs to be embedded in a programming language. Implicit links can be discovered by data analysis but the schema of the relational database cannot be quickly and easily updated according to newly discovered relationships. Considering these drawbacks, a number of systems, not following the relational model paradigms, have recently emerged. They are often denoted under the um- brella term of NoSQL databases (Moniruzzaman 2013). In general, NoSQL databases rely on schema-less data models and scale hori- zontally. Their common features are scalability and flexibility in the structure of data. NoSQL database management systems provide different solutions for specific problems: the volume of datasets is addressed in the column- oriented NoSQL or key- value (HBase1, Cassandra2); documents and links management is supported 1 http://hbase.apache.org/ 2 http://cassandra.apache.org/ Alex Frame, Arnaud Mercier, Gilles Brachotte and Caja Thimm - 9783631670095 Downloaded from PubFactory at 09/26/2021 02:00:34PM via free access SNFreezer: a Platform for Harvesting and Storing Tweets 21 by document databases (CouchDB3, MongoDB4); high density of links, nodes and properties are taken into account in graph database management systems (GDBMS) which are also ideal for performing queries that walk down hierarchical relationships (Neo4j5, HypergraphDB6). XML oriented databases provide a highly extensible data model but lack scalability in the context of social networks. NoSQL databases can be accessed by different APIs and different query lan- guages, so Atzeni and al. (2014) propose a common programming interface to NoSQL systems hiding the specification details of the various systems for devel- oping applications. The TinkerPop project7 adopts a similar approach for graph databases. It introduces a graph query language, Gremlin, which is a domain- specific language based on Groovy8, supported by most GDBMS. Unlike most query languages that are declarative, Gremlin is an imperative language focusing on graph traversals. The multi- paradigm principle tends to generalize these different approaches. In modelling, multi- paradigm approaches address the necessity of using multiple mod- elling paradigms to design complex systems (Hodge et al. 2011). Indeed, complex systems require the use of multiple modelling languages to: 1) cope with the inherent heterogeneity of such systems; 2) offer different points of view on all their relevant aspects; 3) cover different activities of the design cycle; 4) allow reasoning at differ- ent levels of detail during the design process (Hardebolle and Boulanger 2009). As a result, multi- paradigm modelling addresses three orthogonal directions of research: 1) multi- formalism modelling, concerned with the coupling and transformation be- tween models described in different formalisms; 2) model abstraction concerned with the relationship between models at different levels of abstraction; 3) meta- modelling concerned with the description of classes of models dedicated to particular domains or applications called Domain Specific Languages (DSL). Multi-paradigm data storage or polyglot persistence uses multiple data storage technologies, chosen according to the way data is used by applications and/or algorithms (Sharp et al. 2013). As Ghosh states in (Ghosh 2010), storing data the way it is used in an application simplifies pro- gramming and makes it easier to decentralize data processing. ExSchema (Castrejon et al. 2013) is a tool that enables automatic discovery of data schema from a system that relies on multiple documents, graph, relational, column-family data stores. 3 http://couchdb.apache.org/ 4 https://www.mongodb.org/ 5 http://neo4j.com/ 6 http://hypergraphdb.org/ 7 http://www.tinkerpop.com 8 http://groovy.codehaus.org/ Alex Frame, Arnaud Mercier, Gilles Brachotte and Caja Thimm - 9783631670095 Downloaded from PubFactory at 09/26/2021 02:00:34PM via free access 22 Éric Leclercq, Marinette Savonnet, Thierry Grison, Sergey Kirgizov, Ian Basaille 1.3 SNFreezer Architecture In order to collect tweets during the political campaign, we have started by analys- ing existing solutions and we retained the project YourTwapperKeeper9 (YTK), an open source project that claims to provide users with a tool that archives data from Twitter directly on a server. After a period of tests and code review, we identified some major drawbacks. YTK was not able to collect tweets in various languages; the choice of the database engine limits the volume of datasets; and it does not retrieve information on accounts such as the list of following/followers nor the timeline of the users. Thus, we chose to enhance YTK with a real storage layer and to add database connectors in order to allow analysis tools such as R to retrieve data directly from SNFreezer repositories. 1.3.1 The Storage System To address the problem of tweet storage, both in terms of performance and interop- erability (i.e. easy connection of third party tools) we have specified and developed a storage layer. The proposed polyglot persistence storage layer includes relational databases (RDBMS), a graph data store (GDBMS), and a scalable document da- tabase management system (DDBMS) that can be used
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages82 Page
-
File Size-