An Architecture to Navigate the Topical World of Instagram

BEng Individual Project Imperial College London Department of Computing An Architecture to Navigate the Topical World of Instagram Author: Supervisor: Andreas Asprou Alex Carver Second Marker: Giuliano Casale June, 2018 Abstract Instagram allows over 800 million users to communicate with each other using a visual vocabulary of images and videos. This visual communication includes high- quality content creators sharing content over a diverse set of niche topics specific to Instagram. Despite this fact, there are no reliable methods to navigate Instagram topically, e.g. search for content creators by niche topic. This work aims to provide a first attempt at establishing topical structure for Instagram. To achieve this, we present an architecture for classifying content creators on Instagram into their niche topics of content over a large custom-curated taxonomy. The nature of Instagram results in unique challenges to the task of topic inference and classification. For instance, the noisy, sparse and unreliable nature of text on Instagram degrades the performance of state-of-the-art topic inference, modelling and discovery approaches. The topics of content creators inferred by this architecture are exposed to end-users through a variety of products, resulting in the additional challenge of achieving both high precision and recall classification. The proposed architecture solves the vast ar- ray of challenges by designing three independent components. The first component aims to solve the challenge of discovering the niche topics that are posted about on Instagram. To this end we design a novel topical local community detection (T- LCD) algorithm to discover niche and tightly knit communities of topically coherent content creators. These communities are utilised to enumerate the vast diversity of topics that are posted about on Instagram. Finally, using these discovered topics, we curate a custom taxonomy of hundreds of topics. With the taxonomy, we developed an ensemble of classifiers to infer the topics of content creators with high-precision (topical seeds). The ensemble model aggregates multiple topical signals (features) of a content creators’ Instagram account to mitigate the noisy and unreliable features. We showed that we were able to achieve high-precision classification of a content creator c, despite the noisy and unreliable features, by aggregating the inferred topics of all content creators u follows. This is a central hypothesis of this work: a content creator c of topic t follows majority content creators of topic t. The topical seeds are then used to spread topics across the content creator graph using a custom designed label spreading algorithm. Our algorithm is designed using various novel observations made about the nature of relationships between content creators on Instagram. The spreading of topics solve the issue of low recall of a high-precision classifier, resulting in a fully topically labelled content creator graph. We show experimentally that our architecture performs extremely well over a hand- curated dataset of labelled content creators. The architecture presented in this work is deployed in production and used to achieve high-performing social media adver- tising campaigns. The architecture is continuously evaluated in real-time through a user-facing topical content creator search engine. 2 Acknowledgements I understand my success to be the a combination of my own ability to reason and the people I expose myself to: “none of us are as smart as all of us”1. To this end, I’d like to express gratitude for a few external entities: • The deeply inspiring research literature which provided a clear pathway to walk along. • To Immanuel Kant, who through his essay “Answering the Question: What Is Enlightenment?” (1784), provided me with the guiding phrase “Sapere Aude” - “Dare to be wise” and its accompanying insight: the courage to act on my own insights and understanding. This idea grounded me in those times where the challenges seemed unsolvable. • My compadre Thevindu Edirisinghe, who was always there to discuss and contribute to the solutions of complex problems, which in turn stimulated ideas in this work. • My colleagues at Filli Studios, for their support both emotionally and in- tellectually, providing assistance in the tasks which required expertise and experience in the space of Instagram. In particular, Nia Pickering, for her exceptional insight in the topical structure of Instagram and her continuous effort in building the taxonomy. • My supervisor Alex Carver, for his continuous support through this project. • Finally, my family. For without them, none of this could have been possible. 1- Kenneth H. Blanchard 3 Contents 1 Introduction8 1.1 Motivation.................................8 1.1.1 Crowd Driven Innovation on Instagram.............8 1.1.2 Improvements to the Platform..................8 1.2 Key Challenges and Solutions...................... 12 1.3 Relevant Taxonomy Construction.................... 12 1.3.1 Accurate Reasoning in a Highly Deceptive and Noisy World. 14 1.4 End-User Requirements......................... 15 1.5 Contributions............................... 16 2 Background 18 2.1 The Data................................. 19 2.2 User Interest Identification and Modelling............... 19 2.2.1 Textual Topical Signals...................... 20 2.2.2 Leveraging Internal Structures.................. 24 2.3 Entity Linking and Extraction...................... 25 2.3.1 Preliminaries........................... 25 2.3.2 Domain Specific Linkers..................... 25 2.3.3 Challenges of Entity Linking in OSNs.............. 26 2.3.4 Voting Schemes.......................... 26 2.3.5 Conclusions............................ 27 2.4 Text Vectorisation............................ 27 2.4.1 Bag-of-Words........................... 27 2.4.2 TF-IDF.............................. 28 2.4.3 Word Embeddings (Word2Vec)................. 28 2.4.4 Document Embeddings (Doc2Vec)................ 28 2.5 Text Preprocessing............................ 29 3 Taxonomy Construction 31 3.1 Background................................ 31 3.2 Topical Community Detection...................... 31 3.2.1 Background and Related Work................. 31 3.2.2 Algorithm Setup and Objectives................. 33 3.2.3 Preliminaries........................... 34 3.2.4 What makes a Good Cluster?.................. 34 3.2.5 Topical Quality Measures.................... 37 3.2.6 Determining Optimal Seeds................... 39 3.2.7 Seed Expansion Algorithm.................... 41 3.2.8 Evaluation............................. 41 4 3.3 Final Taxonomy.............................. 46 3.3.1 Taxonomy Details......................... 46 3.4 Future Work................................ 47 4 High Precision Topic Inference Pipeline 49 4.1 Introduction................................ 49 4.2 Challenges................................. 49 4.2.1 The Impossibility of Human-Annotation............ 49 4.2.2 Unavailability of Labelled Data................. 50 4.2.3 Text on Online Social Networks................. 51 4.2.4 Low Accuracy Concept Extraction............... 52 4.3 High-Precision Topic Inference Pipeline................. 53 4.3.1 Topical Signals Identification................... 53 4.3.2 Pipeline Overview......................... 55 4.3.3 Co Training & Self-Training................... 56 4.4 Topical Dictionary Based Classification of Biographies......... 57 4.4.1 Background............................ 57 4.4.2 Extensions to Related Work................... 58 4.4.3 Dictionary Construction..................... 59 4.4.4 Practical Implementation..................... 62 4.5 Quantitative Evaluation......................... 66 4.5.1 Setup............................... 66 4.5.2 Precision and Recall Metrics................... 66 4.5.3 Model Comparison........................ 69 4.6 Website Classification........................... 70 4.6.1 Related Work........................... 71 4.6.2 An analysis of hyperlinks on Instagram............. 72 4.6.3 Determining Accurate Hyperlink Signals on Instagram.... 73 4.6.4 Automatic Labelled Data Collection.............. 74 4.6.5 Webpage Feature Extraction................... 75 4.7 Design Choices.............................. 75 4.8 Convolutional Neural Networks for Webpage Classification...... 76 4.8.1 Convolutional Neural Networks for Text Classification..... 76 4.8.2 Tooling, Text-Preprocessing and Feature Extraction...... 77 4.8.3 Convolutional Neural Network Architecture.......... 78 4.8.4 Evaluation............................. 78 4.9 Final Ensemble Model Evaluation.................... 80 4.9.1 Results............................... 80 4.10 Future Work................................ 81 4.10.1 Other Topical Signals....................... 81 4.10.2 Multi-Stage Topic Inference................... 81 4.10.3 Out of Vocabulary Word Embeddings.............. 81 4.10.4 Webpage Feature Extraction................... 81 4.10.5 Labelled Website Data Collection................ 81 5 5 Label Spreading 83 5.1 Introduction................................ 83 5.1.1 Assumptions............................ 83 5.1.2 Content Creators and Influencers................ 84 5.1.3 Homophily vs Influence...................... 84 5.1.4 Types of Semi-Supervised Learning............... 86 5.1.5 Graph Construction....................... 86 5.1.6 Hard Clamping Label Propagation............... 87 5.1.7 Soft Clamping Label Spreading................. 88 5.1.8 Regularisation Framework.................... 89 5.2 Influencer Graph Structure.......................

Load more