Master Thesis
Total Page:16
File Type:pdf, Size:1020Kb
MASTER THESIS Bc. Michal Hušek Text mining in social network analysis Department of Theoretical Computer Science and Mathematical Logic Supervisor of the master thesis: doc. RNDr. Iveta Mrázová, CSc. Study programme: Informatics Specialization: Software Systems Prague 2018 2 3 I declare that I carried out this master thesis independently, and only with the cited sources, literature and other professional sources. I understand that my work relates to the rights and obligations under the Act No. 121/2000 Coll., the Copyright Act, as amended, in particular the fact that the Charles University has the right to conclude a license agreement on the use of this work as a school work pursuant to Section 60 paragraph 1 of the Copyright Act. In…...... date............ signature 4 Title: Text mining in social network analysis Author: Bc. Michal Hušek Department: Department of Theoretical Computer Science and Mathematical Logic Supervisor: doc. RNDr. Iveta Mrázová, CSc., Department of Theoretical Computer Science and Mathematical Logic Abstract: Nowadays, social networks represent one of the most important sources of valuable information. This work focuses on mining the data provided by social networks. Multiple data mining techniques are discussed and analysed in this work, namely, clustering, neural networks, ranking algorithms and histogram statistics. Most of the mentioned algorithms have been implemented and tested on real-world social network data and the obtained results have been mutually compared against each other whenever it made sense. For computationally demanding tasks, graphic processing units have been used in order to speed up calculations for vast amounts of data, e.g., during clustering. The performed tests have confirmed lower time requirements. All the performed analyses are, however, independent of the actually involved type of social network. Keywords: data mining, social networks, clustering, neural networks, ranking algorithms, CUDA 5 Názov: Text mining in social network analysis Autor: Bc. Michal Hušek Katedra: Katedra teoretické informatiky a matematické logiky Vedúca práce: doc. RNDr. Iveta Mrázová, CSc., Katedra teoretické informatiky a matematické logiky Abstrakt: Sociálne siete sú v súčastnosti veľmi hodnotným zdrojom informácií. Táto práca sa zameriava na dolovanie dát pochádzajúcich zo sociálnych sietí. Pojednáva a analyzuje rôzne techniky dolovania dát, konkrétne klastrovanie, neurónové siete, hodnotiace algoritmy a histogramovú štatistiku. Väčšina zmienených algoritmov bola implementovaná a testovaná s dátami z reálnej sociálnej siete. V prípade, že to bolo zmyslupné, výsledky boli vzájomne porovnané. Pre výpočtovo náročné úlohy, konkrétne klastrovanie, boli použité grafické procesory na ich zrýchlenie. Testy takto upravených programov potvrdili nižšie časové nároky. Všetky vykonané analýzy sú však nezávislé na konkrétnej použitej sociálnej sieti. Kľúčové slová: dolovanie dát, sociálne siete, klastrovanie, neurónové siete, hodnotiace algoritmy, CUDA 6 Acknowledgements In the first place, I would like to thank my supervisor doc. RNDr. Iveta Mrázová, CSc for all the help and time spend with my work. I would like to especially thank her for her patience with my not always sufficient progress and my very limited time schedule. In the second place, I would like to thank my faculty for providing me with the necessary software and hardware. Without them, it would be difficult to finish this work. 7 Table of Contents 1. INTRODUCTION ........................................................................................................... 11 1.1 MOTIVATION ............................................................................................................... 11 1.2 METHODOLOGY ............................................................................................................... 12 1.3 THESIS STRUCTURE ............................................................................................................ 12 1.4 OBJECTIVES OF THIS WORK ................................................................................................. 13 2. K-MEANS ALGORITHM ................................................................................................ 14 2.1 MOTIVATION ................................................................................................................... 14 2.2 DESCRIPTION OF THE ALGORITHM ........................................................................................ 14 2.3 CALCULATION OF THE CENTROID .......................................................................................... 15 2.4 CONCLUSION ................................................................................................................... 16 3. ENRON EMAIL DATA SET ............................................................................................. 17 3.1 MOTIVATION ................................................................................................................... 17 3.2 ENRON COMPANY ............................................................................................................. 17 3.3 ENRON EMAIL CORPUS ....................................................................................................... 18 3.4 THE FOLDER STRUCTURE OF ENRON EMAIL CORPUS ................................................................ 18 3.5 THE STRUCTURE OF THE EMAIL MESSAGE .............................................................................. 19 3.6 ENRON EMAIL CORPUS AS A SOCIAL NETWORK ....................................................................... 21 3.7 EMAILS CONTAINING POSSIBLY IMPORTANT INFORMATION ....................................................... 22 3.8 CONCLUSION ................................................................................................................... 23 4. EMAIL - BASED CLUSTERING OF THE ENRON DATA SET ................................................ 24 4.1 MOTIVATION ................................................................................................................... 24 4.2 VECTOR SPACE MODEL ....................................................................................................... 24 4.3 DESCRIPTION OF THE USED APPROACH .................................................................................. 25 4.4 TEST OF THE CLUSTERING ON A SMALL SUBSET OF THE INPUT DATA ............................................ 29 4.5 RESULTS OF THE CLUSTERING USING A SUBSET OF THE INPUT DATA ............................................ 29 4.6 CONCLUSION ................................................................................................................... 37 5. SPEEDING UP THE COMPUTATIONS USING GPUS ........................................................ 38 5.1 MOTIVATION ................................................................................................................... 38 5.2. GPU COMPUTING ............................................................................................................ 38 5.3. NVIDIA CUDA ............................................................................................................... 39 5.4 TASKS SUITABLE FOR GPU CALCULATIONS ............................................................................. 39 5.5. SUMMARY ...................................................................................................................... 40 6 NVIDIA CUDA BASED CLUSTERING ............................................................................... 41 6.1. MOTIVATION .................................................................................................................. 41 6.2 DESCRIPTION OF THE USED APPROACH .................................................................................. 41 6.3 CLUSTERING TESTS WITH THE FIRST VERSION OF CUDA BASED PROGRAM ................................... 43 6.4 CUDA CLUSTERING WITH PARALLEL CALCULATION OF DISTANCES TO CENTROIDS .......................... 44 6.5 MODIFIED CUDA KERNEL .................................................................................................. 46 6.6 SPEED COMPARISON ......................................................................................................... 48 6.7 CONCLUSION ................................................................................................................... 49 7. PERSON - BASED CLUSTERING OF THE ENRON DATA SET ............................................. 51 8 7.1 MOTIVATION ................................................................................................................... 51 7.2 THE IDEA OF THE FIRST PERSON-BASED CLUSTERING EXPERIMENT .............................................. 51 7.3 DESCRIPTION OF THE USED APPROACH IN THE FIRST PERSON-BASED CLUSTERING EXPERIMENT. ....... 51 7.4 RETHINKING THE IDEAS OF PERSON BASED CLUSTERING ........................................................... 54 7.5 PERSON- COMMUNICATION BASED CLUSTERING ..................................................................... 54 7.6 TIME BASED PERSON CLUSTERING ........................................................................................ 58 7.7 CONCLUSION ................................................................................................................... 60 8. HISTOGRAM STATISTICS .............................................................................................. 62 8.1 MOTIVATION ..................................................................................................................