A Software Architecture for Progressive Scanning of On-line Communities Roberto Baldoni, Fabrizio d’Amore, Massimo Mecella, Daniele Ucci Cyber Intelligence and Information Security Center Dipartimento di Ingegneria Informatica, Automatica e Gestionale Sapienza Universita` di Roma, Italy Email: fbaldoni|damore|
[email protected] ,
[email protected] Abstract—We consider a set of on-line communities (e.g., news, U1,1 U1,2 U1,3 U1,4 U1,5 U1,6 blogs, Google groups, Web sites, etc.). The content of a community C1 t is continuously updated by users and such updates can be seen U2,1 U2,2 U2,3 U2,4 U2,5 U2,6 by users of other communities. Thus, when creating an update, C2 a user could be influenced by one or more updates creating a t U U U U U U semantic causal relationship among updates. This transitively will 3,1 3,2 3,3 3,4 3,5 3,6 C3 allow to trace how an information flows across communities. The t paper presents a software architecture that progressively scan a seman8c causal relaonship community update set of on-line communities in order to detect such semantic causal relationships. The architecture includes a crawler, a large scale storage, a distributed indexing system and a mining system. The paper mainly focuses on crawling and indexing. (a) Evolution along the time of a set of social communities showing how an update is causally influenced by other updates Keywords—On-line communities, progressive scanning, MapR, Nutch I. INTRODUCTION U1,1 U1,2 U1,3 U1,4 U1,5 U1,6 C1 On-line communities are nowadays a fundamental source U U U U U of information in business and information security intelligence.