Università Degli Studi Di Bergamo Moogle: a Web Crawler and Full-Text Search Application for Distributed Private Data
Total Page:16
File Type:pdf, Size:1020Kb
UNIVERSITÀ DEGLI STUDI DI BERGAMO Dipartimento di Ingegneria Corso di Laurea Specialistica in Ingegneria Informatica Classe n. 35/S - Classe delle lauree specialistiche in ingegneria dell’informazione MOOGLE: A WEB CRAWLER AND FULL-TEXT SEARCH APPLICATION FOR DISTRIBUTED PRIVATE DATA Relatore: Chiar.mo Prof. Stefano Paraboschi Tesi di Laurea Specialistica Paolo COFFETTI Matricola n. 42519 ANNO ACCADEMICO 2013/2014 Abstract You want to search for that sushi restaurant someone recommended you last month but you don't remember its name: you type "sushi restaurant" in your laptop and you get an old tweet from a friend talking about Tokyo Sushi. You also get a comment you wrote on Facebook, an SMS message you sent your brother and a bookmark in your browser, all about the same restaurant. And imagine that you can do this with your smart phone, your laptop, tablet or smart TV. Something so basic yet so far from the reality. This is Moogle - My Own Google, the search engine for private data. This thesis describes the work I accomplished in order to design, develop and deploy a first solid version of Moogle following the iterative and incremental software development process outlined by Grady Booch. 1 Acknowledgments First and foremost I'd like to thank my parents and sisters for their patience and the support they provided me during my studies. I also want to express my gratitude to Professor Stefano Paraboschi for his precious guidance. I wish to thank my girlfriend Boyang for her loving support and who, once again, had to deal with those long hours of my absence while working on this project. A special acknowledgment goes to my dearest friends Marco Bontempi for his valuable and prompt technical advice and Simone Pilla for his daily backing. Finally yet importantly, thanks to United Academics and especially to Louis Lapidaire for his frequent encouraging words and material support. 2 Table Of Contents Abstract ............................................................................................................................. 1 Acknowledgments ............................................................................................................. 2 List Of Figures .................................................................................................................. 5 1. Introduction ................................................................................................................... 6 1.1. Purpose ............................................................................................................................... 8 2. Requirements ................................................................................................................ 9 2.1. Vision And Key Features ................................................................................................... 9 2.2. Use Case Analysis ............................................................................................................ 11 2.3. Non-functional Requirements .......................................................................................... 17 3. Software Development Process .................................................................................. 18 3.1. Macro Process: Phases And Milestones ........................................................................... 20 4. Architecture ................................................................................................................. 24 4.1. 4+1 Architectural View Model ......................................................................................... 25 4.2. Magpie Web Crawler ....................................................................................................... 27 4.3. Moogle Website ................................................................................................................ 46 5. Protocols And Data Entities ........................................................................................ 51 5.1. OAuth Protocol ................................................................................................................. 52 5.2. Facebook API ................................................................................................................... 53 5.3. Twitter API ....................................................................................................................... 61 5.4. Dropbox API .................................................................................................................... 70 5.5. Google API ....................................................................................................................... 76 5.6. Google Drive API ............................................................................................................. 78 5.7. Google Gmail API ............................................................................................................ 81 6. Task Queue With Redis .............................................................................................. 85 6.1. Redis Overview ................................................................................................................ 85 6.2. Data Structures ................................................................................................................. 88 6.3. Redis For Task Queues ..................................................................................................... 92 6.4. Data Model in Redis ......................................................................................................... 93 3 7. Full-Text Search Engine ............................................................................................. 99 7.1. Apache Solr: Key Concepts .............................................................................................. 99 7.2. Schema Design ............................................................................................................... 110 8. Conclusion And Future Developments ..................................................................... 121 Appendix A. Entities ..................................................................................................... 123 A.1. Facebook ........................................................................................................................ 123 A.2. Twitter ........................................................................................................................... 137 A.3. Dropbox ......................................................................................................................... 143 A.4. Google Drive ................................................................................................................. 145 A.5. Gmail ............................................................................................................................. 148 Appendix B. Providers Details ..................................................................................... 149 B.1. Facebook ........................................................................................................................ 149 B.2. Twitter ............................................................................................................................ 151 B.3. Dropbox ......................................................................................................................... 152 B.4. Google ............................................................................................................................ 153 Appendix C. Solr Schema ............................................................................................. 155 C.1. Twitter ............................................................................................................................ 155 C.2. Facebook ........................................................................................................................ 159 C.3. Dropbox ......................................................................................................................... 162 Bibliography ................................................................................................................. 167 4 List Of Figures Figure 2-A Main Use Cases……………………………………………………………12 Figure 3-A Macro process, phases and milestones…………………………………….20 Figure 4-A Main components of the system…………………………………………...24 Figure 4-B 4+1 Architectural View Model…………………………………………….26 Figure 4-C Top-level component diagram for Magpie………………………………...28 Figure 4-D SQLAlchemy as ORM interface to the database…………………………..29 Figure 4-E ER class diagram…………………………………………………………...29 Figure 4-F Class diagram for the Facebook crawler component.......…………………. 31 Figure 4-G Partial hierarchical structure of the different crawlers……………………..33 Figure 4-H Class diagram for the Facebook indexer component………………………33 Figure 4-I Partial hierarchical structure of the different indexers……………………... 35 Figure 4-J Class diagram for the Dropbox downloader component…………………... 36 Figure 4-K Sequence diagram for the Facebook crawling process…………………….38 Figure 4-L Sequence diagram for the Facebook indexing process……………………. 39 Figure 4-M Sequence diagram for the Dropbox downloading process………………...40 Figure 4-N Package diagram for Magpie………………………………………………42 Figure 4-O Deployment diagram for Magpie…………………………………………. 44 Figure 4-P Top-level component diagram for Moogle Website………………………. 47 Figure 4-Q ER class diagram………………………………………………………….