Exploring Visual Content in Social Networks

FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Exploring Visual Content in Social Networks Mariana Afonso WORKING VERSION PREPARATION FOR THE MSCDISSERTATION Supervisor: Prof. Dr. Luís Filipe Pinto de Almeida Teixeira February 18, 2015 c Mariana Fernandez Afonso, 2015 Contents 1 Introduction 1 1.1 Background . 1 1.2 Motivation . 2 1.3 Objectives . 2 1.4 Document Structure . 3 2 Related Concepts and Background 5 2.1 Social network mining . 5 2.1.1 Twitter . 6 2.1.2 Research directions and trends . 6 2.1.3 Opinion mining and sentiment analysis . 6 2.1.4 Cascade and Influence Maximization . 7 2.1.5 TweeProfiles . 8 2.2 Data mining techniques and algorithms . 9 2.2.1 Clustering . 10 2.2.2 Data Stream Clustering . 22 2.2.3 Cluster validity . 22 2.3 Content based image retrieval . 26 2.3.1 The semantic gap . 27 2.3.2 Reducing the sematic gap . 28 2.4 Image descriptors . 32 2.4.1 Low-level image features . 32 2.4.2 Mid level image descriptors . 54 2.5 Performance Evaluation . 58 2.5.1 Image databases . 59 2.6 Visualization of image collections . 61 2.7 Discussion . 65 3 Related Work 67 3.1 Discussion . 70 4 Work plan 71 4.1 Methodology . 71 4.2 Development Tools . 71 4.3 Task description and planning . 72 References 75 i ii CONTENTS List of Figures 2.1 Clusters in Portugal; Content proportion 100%. 9 2.2 Clusters in Portugal; Content 50% + Spacial 50%. 9 2.3 Clusters in Portugal; Content 50% + Temporal 50%. 9 2.4 Clusters in Portugal; Content 50% + Social 50% . 9 2.5 Visualization of the clusters obtained by TweeProfiles using different weights for the different dimensions. Extracted from [1]. 9 2.6 Example of kmeans clustering algorithm in a 2-D dataset. 14 2.7 Example of a dentrogram for hierarchical clustering, using either agglomerative or divisive methods. Extracted from [2] . 15 2.8 Illustration of single-link clustering. Extracted from [2] . 16 2.9 Illustration of complete-link clustering. Extracted from [2] . 16 2.10 Illustration of average-link clustering. Extracted from [2] . 16 2.11 Clusters found by DBSCAN in 3 databases. Extracted from [3] . 21 2.12 Example of two clustering results (obtained by K-means algorithm) in two different datasets. 25 2.13 A general model of CBIR systems. Extracted from [4]. 27 2.14 Object-ontology model used in [5]. Extracted from [5] . 29 2.15 CBIR with RF. Extracted from [6]. 30 2.16 Schematic representation of the different types of image descriptors. Extracted from [7]. 32 2.17 The RGB color model. Extracted from [8]. 33 2.18 HSV color representation. Extracted from [8] . 34 2.19 Example of color histograms for an example image. 35 2.20 Two images with similar color histograms. Extracted from [9]. 36 2.21 Three different textures with the same distribution of black and white. Extracted from [10]. 37 2.22 First level of a wavelet decomposition in 3 steps: Low and High pass filtering in horizontal direction, the same in the vertical direction, subsampling. Extracted from [11]. 39 2.23 The basis of Hough transform line detection; (A) (x,y) point image space; (B) (m,c) parameter space. Extracted from [12]. 42 2.24 For each octave, adjacent Gaussian images are subtracted to produce the difference- of-Gaussian images. After each octave, the images are downsampled with a factor of 2, and the process is repeated. Extracted from [13]. 46 2.25 Detection of the maxima and minima of the difference-of-Gaussian images by comparing the neighbors. Extracted from [13]. 46 iii iv LIST OF FIGURES 2.26 Stages of keypoint selection. (a) The 233 189 pixel original image. (b) The ini- ⇥ tial 832 keypoints locations at maxima and minima of the difference-of-Gaussian function. Keypoints are displayed as vectors indicating scale, orientation, and location. (c) After applying a threshold on minimum contrast, 729 keypoints remain. (d) The final 536 keypoints that remain following an additional threshold on ratio of principal curvatures. Extracted from [13]. 48 2.27 Example of a SIFT descriptor matching found in two different images of the same sight. Extracted from [14]. 49 2.28 Left to right: The (discretised and cropped) Gaussian second order partial deriva- tives in y-direction and xy-direction, and our approximations thereof using box filters. The grey regions are equal to zero. Extracted from [15]. 51 2.29 The descriptor entries of a sub-region represent the nature of the underlying intensity pattern. Left: In case of a homogeneous region, all values are relatively low. Middle: In presence of frequencies in x direction, the value of dx is high, but | | all others remain low. If the intensity is gradually increasing in x direction, both values dx and dx are high. Extracted from [15]. 52 | | | | 2.30 Process for Bag-of-Features Image Representation for CBIR. Extracted from [16]. 55 2.31 Example of a construction of a three-level pyramid. Extracted from [17]. 57 2.32 An analogy between image and text document in semantic granularity. Extracted from [18]. 58 2.33 Examples of images from the Corel dataset. 59 2.34 The 100 objects (classes) from the Coil-100 image databse. Extracted from [19]. 60 2.35 Examples of images from the MIR Flickr dataset. Also listed are Creative Com- mon attribution license icons and the creators of the images. Extracted from [20]. 61 2.36 A snapshot of two root-to-leaf branches of ImageNet: the top row is from the mammal subtree; the bottom row is from the vehicle subtree. For each synset, 9 randomly sampled images are presented. Extracted from [21]. 61 2.37 Results obtained by AverageExplorer: the many informative modes and examples of images and patches within each mode. Extracted from [22]. 62 2.38 Examples of average images edited and aligned compared to the unaligned ver- sions, using the AverageExplorer interactive visualization tool. Extracted from [22]. 64 2.39 A summary of 10 images extracted from a set of 2000 images of the Vatican com- puted by our algorithm. Extracted from [23]. 65 4.1 Gantt diagram of the work plan. 73 Chapter 1 Introduction 1.1 Background With the emergence of social media websites like Twitter, Facebook, Flickr and YouTube, there has a been a huge amount of information produced everyday by the millions of users. Facebook alone reports 6 billion photo uploads per month and Youtube sees 72 hours of video uploaded every minute[22]. The truth is that we have experienced an explosion of visual data available online, mainly shared by users trough the social media websites. The goal of this thesis is to find patterns in images shared via the social network Twitter. These patterns will be found using a technique called clustering, where given a set of unlabeled data, groups are formed based on the similarity of the content of the data itself. This means that the each cluster will represent a pattern or concept, which could depict, for example, a location, an activity or a photographic trend (e.g. "selfies"). Image analysis has been a popular topic of research among scientists for decades. Inside this large topic, there are more specific areas of research, such as image classification, image retrieval, image clustering and image annotation. All these areas use techniques and methods from the image processing, computer vision, machine learning and pattern recognition. In this thesis, image clustering will be the main focus, nonetheless it is important to understand and to have some background on the other fields as well. Twitter is one of the most popular social networks available and is mainly used as a microblog- ging service. Twitter users share what is called tweets, which are short messages (maximum of 140 characters). These tweets can also include links and multimedia content such as images. In order to obtain data shared in Twitter, many crawlers (a bot for automatic indexing of the Web) have been developed. This project will make use of data collected by TwitterEcho [24]. After the image clusters have been found, a visualization tool will be developed. This should allow the user to explore the clusters obtained by visualizing relevant or representative images among the clusters. This system will be implemented using the library OpenCV, which is a popular computer vision open-source library. It is available for C++, Python, Java and Android. 1 2 Introduction 1.2 Motivation The information shared in the social media websites can have many formats such as text, images or video. Until recent years, there has been a research focus on the analysis and extraction of relevant information from the text content of the information produced. This is because text is simpler to analyze and categorize than the other two formats. However, ignoring the visual content shared in the social media could be seen as a waste of important information for research. Therefore, there is a need to develop more robust and more powerful algorithms for the analysis of images from large image collections. The big issue with visual content is that the features which can be extracted directly from the image, called low-level, do not give information about the content or high-level concepts present in an image. For that reason, it is extremely difficult to compare images in a way that is understandable and acceptable to humans. Consequently, this is an area of great potential for research. The possibilities for applications of those types of systems are endless and could include safety, marketing and behavioral studies.

Exploring Visual Content in Social Networks

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

Should Google Be Taken at Its Word?

Ouellette, the Google Shortcut to Trademark Law, 102 CALIF

Revisiting Search Engine Bias Eric Goldman

From #Deleteuber. to 'Hell': a Short History of Uber's Recent Struggles

Revisiting Search Engine Bias Eric Goldman Santa Clara University School of Law, [email protected]

Amended Complaint

2012, Dec, Google Introduces Metaweb Searching

(2019-01-10) BUZZFEED. a New Lawsuit Claims Google's Board

Modern Information Retrieval: a Brief Overview

Google's Worldwidewatch Over the Worldwideweb

At 15, Google Revisits Past, Eyes Future 26 September 2013