Generating Text Summaries for the Facebook Data Breach with Prototyping on the 2017 Solar Eclipse

Generating Text Summaries for the Facebook Data Breach with Prototyping on the 2017 Solar Eclipse April Fitzpatrick Akshay Goel Leah Hamilton Ramya Nandigam Esther Robb CS 4984/CS 5984 - Big Data Text Summarization Edward A. Fox Virginia Tech, Blacksburg, VA 24061 December 13, 2018 Table of Contents Table of Figures ............................................................................................... 3 Table of Tables ................................................................................................ 4 Executive Summary ........................................................................................ 5 Introduction .................................................................................................... 7 Literature Review ........................................................................................... 8 Approach ......................................................................................................... 10 Cleaning the Data ............................................................................................ 11 jusText Cleaning and Initial Modifications ...................................................... 11 jusText Evaluation Results ................................................................................ 13 Parameter based cleaning + Similar Document Removal ............................... 14 Module 1: Most Frequent Words .................................................................... 17 Module 2: WordNet Synsets ........................................................................... 21 Module 3: Parts of Speech Tagging ................................................................ 26 Module 4: Words/Word Stems of Discriminating Features .......................... 28 Module 5: Frequent and Important Named Entities .................................... 32 Module 6: Important Topics .......................................................................... 36 6.1 LSA/LDA ...................................................................................................... 36 6.2 Clustering ................................................................................................... 38 6.3 Plotting Clusters ......................................................................................... 42 6.4 Final Topic Results ..................................................................................... 44 Module 7: Extractive Summary of Important Sentences ............................ 47 7.1 Overall Summary Approach ..................................................................... 47 7.2 Clustered Summaries Approach ............................................................... 49 Module 8: Values for Each Slot Matching Collection Semantics ................. 52 Module 9: Readable Summary Explaining Slots and Values ....................... 57 Module 10: Readable Abstractive Summary ................................................ 58 10.1 Pointer-Generator Network (PGN) .......................................................... 58 10.2 FastAbsRL ................................................................................................. 65 1 10.3 AbTextSumm ........................................................................................... 67 Evaluation ..................................................................................................... 69 ROUGE Evaluation .......................................................................................... 69 Solr Indexing ................................................................................................... 71 Gold Standard Summaries .............................................................................. 72 User’s Manual ................................................................................................ 75 Cleaning Data .................................................................................................. 75 Single Word and Entity Based Summary Techniques ................................... 77 Clustering ......................................................................................................... 79 Complete Event Summaries ............................................................................ 80 Evaluating Gold Standard Summaries ........................................................... 87 Developer’s Manual ...................................................................................... 88 Flow of Development ...................................................................................... 88 Data Processing ............................................................................................... 91 Maintaining and Extending ............................................................................ 91 List of Files Submitted .................................................................................... 93 Lessons Learned ........................................................................................... 108 Timeline .......................................................................................................... 108 Challenges Faced ............................................................................................ 109 Solutions Developed ....................................................................................... 110 Conclusions and Future Work ..................................................................... 112 Future Work ................................................................................................... 112 Conclusions .................................................................................................... 113 Acknowledgments ........................................................................................ 115 References .................................................................................................... 116 Appendices ................................................................................................... 126 Appendix A: Extractive Summaries .............................................................. 126 Appendix B: Abstractive Summaries ............................................................ 133 Appendix C: Gold Standard Summaries ....................................................... 137 2 Table of Figures 1. Flowchart outlining dataset cleaning workflow ............................................... 11 2. Dataset cleaning parameters ............................................................................. 15 3. List of entity types supported by SpaCy ............................................................ 33 4. Flowchart detailing all clustering techniques used .......................................... 39 5. Results of K-means clustering ............................................................................ 43 6. Number of K-means clusters vs. sum of squared error within cluster ........... 43 7: Different results produced from clustering ...................................................... 45 8. Flowchart of development process .................................................................. 90 9. Original Gantt chart for 10/8-11/16 .................................................................. 109 3 Table of Tables 1. Most frequent words in the Solar Eclipse corpus ............................................. 18 2. Most frequent words in the Facebook corpus .................................................. 19 3. Most frequent WordNet synsets from the Solar Eclipse corpus ...................... 22 4. Most frequent WordNet synsets from the Facebook corpus ............................ 24 5. The most common tagged nouns, verbs, and adjectives from the solar eclipse corpus. Adverbs not included. .................................................................. 26 6. The most common tagged nouns, verbs, and adjectives from the Facebook corpus. Adverbs not included. .............................................................................. 27 7. Stemming Results ............................................................................................... 29 8. Lemmatization Results ...................................................................................... 30 9. Named Entity Results ......................................................................................... 35 10. LDA results, 5 topics (based on word frequency) ........................................... 36 11. LSA results (top 5 topics, based on word frequency) ..................................... 37 12. Two Paragraph Summary Clusters ................................................................. 51 13. A list of planned pieces of information to be extracted from the corpus for the template summary .................................................................................... 53 14. Example Single Document Abstractive Summaries ....................................... 60 15. A comparison of Cluster 0 and 24 summaries pre-and post-abstraction ...... 63 16. Calculated ROUGE results ................................................................................ 70 17. ROUGE results based on sentences ................................................................. 70 18. Entity Coverage results ................................................................................... 71 4 Executive Summary

Load more