Image Search for Improved Law and Order: Search, Analyse, Predict image spread on

Student Name: Sonal Goel IIIT-D-MTech-CS-GEN-16-MT14026 April, 2016

Indraprastha Institute of Information Technology New Delhi

Thesis Committee Dr. Ponnurangam Kumaraguru (Chair) Dr. AV Subramanyam Dr. Samarth Bharadwaj

Submitted in partial fulfillment of the requirements for the Degree of M.Tech. in Computer Science, in General Category

©2016 IIIT-D-MTech-CS-GEN-16-MT14026 All rights reserved Keywords: Image Search, Image Virality, Twitter, Law and Order, Police Certificate

This is to certify that the thesis titled “Image Search for Improved Law and Order:Search, Analyse, Predict Image Spread on Twitter" submitted by Sonal Goel for the partial ful- fillment of the requirements for the degree of Master of Technology in Computer Science & Engineering is a record of the bonafide work carried out by her under our guidance and super- vision in the Security and Privacy group at Indraprastha Institute of Information Technology, Delhi. This work has not been submitted anywhere else for the reward of any other degree.

Dr. Ponnurangam Kumarguru

Indraprastha Institute of Information Technology, New Delhi Abstract

Social media is often used to spread images that can instigate anger among people, hurt their religious, political, caste, and other sentiments, which in turn can create law and order situation in society. This results in the need for Law Enforcement Agencies (LEA) to inspect the spread of images related to such events on social media in real time. To help LEA analyse the image spread on microblogging websites like Twitter, we developed an Open Source Real Time Image

Search System, where the user can give an image, and a supportive text related to image and the system finds the images that are similar to the input image along with their occurrences. The system proposed is robust to identify images that can be cropped (to a certain factor), scaled

(to a certain factor), images with text embedded, images stitched with other images, images with varied brightness, rotated images and some combination of all these. On the input text, the system runs a text mining algorithm to extract the keywords, retrieve images related to these keywords from Twitter, and use image comparison methodology to extract similar images.

The system can analyse the users propagating the content, the sentiments floating with them, and their retweet analysis. We found that Improved ORB (Oriented Fast and Rotated Brief) performs the best for finding image similarity with an accuracy above 85% in all the tested cases.

The system developed is being used by one of the Government security agency.

In addition to identifying similar images, we also aim to predict the influence of events on people, that can create law and order issues in the society. In microblogging sites like Twitter, information provided by tweets diffuses over the users through retweets. Hence, to further enhance the understanding and controlling the diffusion of these kinds of images, we focus to predict the retweet count of such images by using visual cues from the images, content based information and structure-based features. For this, we build a random forest regression model that takes some tweet, image and structural features to predict the retweet count. Acknowledgments

I would like to express my deepest gratitude to my advisor Dr. Ponnurangam Kumaraguru for his guidance and support. The quality of this work would not have been nearly as high without his well-appreciated advice. I would like to thank my esteemed committee members A.V.

Subramanyam and Dr. Samarth Bharadwaj for agreeing to evaluate my thesis work. I thank

Dr. Samarth Bharadwaj for enriching this thesis with his valuable suggestions and feedback.

I thank all the members of Precog research group at IIIT- Delhi for their valuable feedback and suggestions, especially Niharika Scahdeva for shepherding me and spending her valuable time to come up with this thesis. Special thanks to all members of CERC group at IIIT-Delhi and

Hareesh Ravi for their valuable inputs.

Last but not the least, I would like to thank all my supportive family who encouraged and kept me motivated throughout the project.

i Contents

1 Research Motivation and Aim 1

1.1 Research Motivation ...... 1

1.2 Research Aim ...... 2

2 Related Work 4

2.1 Comparing Methodologies ...... 4

2.2 Image Retrieval Real Time Systems ...... 4

2.2.1 Google reverse image lookup ...... 5

2.2.2 TinEye ...... 5

2.3 Social Media for improved policing ...... 5

2.4 Information Diffusion on Social Media ...... 5

2.5 Finding and Predicting Image Virality on Online Social Media for improved law

and order ...... 6

3 Contributions 7

4 Proposed Methodology 9

4.1 Architecture design for the search system ...... 9

4.2 Solution approach for prediction ...... 10

4.3 Data Collection ...... 11

4.3.1 Search System ...... 11

ii 4.3.2 Prediction ...... 12

4.4 Data Annotation ...... 12

5 Quantifying Image Similarity 13

5.1 Keyword Extraction ...... 13

5.2 Challenges in finding similar images ...... 13

5.3 Image features to analyse similarity ...... 14

5.3.1 Hashing ...... 14

5.3.2 SSIM ...... 14

5.3.3 Colour Histogram ...... 14

5.3.4 Keypoint Descriptors ...... 15

5.4 Experimental Settings ...... 16

5.4.1 Evaluation Metrics ...... 16

5.5 Classification Results ...... 16

5.5.1 Histogram ...... 17

5.5.2 DAISY ...... 18

5.5.3 ORB ...... 19

5.5.4 Improved ORB ...... 20

5.6 Time Evaluation ...... 21

5.7 Variation in accuracy with modified images ...... 21

6 Predicting Image Spread 27

6.1 Features Analysed ...... 27

6.1.1 Content-Based ...... 27

6.1.2 Sentiment ...... 27

6.1.3 Structure-Based ...... 28

6.1.4 Image-Based ...... 28

iii 6.2 Features vs. Retweet Count ...... 28

6.3 Experimental Settings ...... 29

6.3.1 Linear Regression ...... 29

6.3.2 Support Vector Regression ...... 29

6.3.3 Random Forest ...... 30

6.4 Training and Testing Data ...... 30

7 Results 31

7.1 Efficient Image Features and methodology for quantifying similarity ...... 31

7.2 Comparing search results with Google reverse image lookup ...... 31

7.3 Real Time Image Search System ...... 32

7.4 Law Enforcement Agencies as beneficiaries of the system ...... 35

7.5 Evaluating Metric for Prediction ...... 35

7.6 Comparing Regression Models ...... 36

8 Conclusions, Limitations, Future Work 38

8.1 Conclusions ...... 38

8.2 Limitation ...... 39

8.3 Future Work ...... 39

iv List of Figures

1.1 News headlines showing the impact of some recent events ...... 2

4.1 Architecture diagram for the search system...... 10

5.1 Input images set...... 17

5.2 Accuracy of coloured histogram ...... 18

5.3 Daisy distance for similar and dissimilar images ...... 19

5.4 ORB and Improved ORB ...... 20

5.5 Plotting accuracy for modified images: Kulkarni event-1 ...... 22

5.6 Plotting accuracy for modified images: Kulkarni event-2 ...... 23

5.7 Plotting accuracy for modified images: Kejriwal cartoon ...... 24

5.8 Plotting accuracy for modified images: Baba Ram Rahi images ...... 25

5.9 Plotting accuracy for modified images: Charlie Hebdo cartoon ...... 26

7.1 Comparing proposed system with Google reverse image lookup ...... 32

7.2 Screen shots comparing output of proposed system with Google reverse image

lookup for Shipra Malik image ...... 33

7.3 Screen shots comparing output of proposed system with Google reverse image

lookup for Kanhaiya Kumar image ...... 34

7.4 Screen Shot of the system...... 37

v List of Tables

4.1 Data collection for six events...... 12

4.2 Data after annotation shows the number of similar and non-similar images. . . . 12

5.1 Comparing time (seconds) taken for comparing two images...... 21

6.1 Spearman correlation and p-values of features with retweet count...... 29

7.1 Comparing results of proposed system with Google’s reverse image lookup. . . . . 32

vi vii Chapter 1

Research Motivation and Aim

1.1 Research Motivation

Images have a high impact on both online and offline world. Recently, The Union Health Ministry made it mandatory for cigarette manufacturing companies to reserve 60 percent of space on the pack for the pictorial warnings [5]. Images shared on social media are also equally influential, having the power to evoke people’s emotions, driving a deeper engagement and more profound change in behaviour [2]. Before the advent of social media, news related to events was more localised, and it took more time to spread the news. However, with widespread usage of social media information spreads like wildfire and targets larger set of audience making social media as a powerful tool for spreading information [4]. A study reveals that when people hear information, they’re likely to remember only 10 percent of that information for three days. However, if a relevant image is paired with that same information, people can retain 65 percent of the information for three days [23]. Refering to social media, Facebook posts with images achieve 2.3 times more engagement than those without images and Buffer reported that for its user base, tweets with images received 150 percent more retweets than tweets without images [23]. When images related to some disturbing event are shared, they have the potential to influence public opinion, hurt their religious, political, caste, or communal which in turn can impact healthy law and order situation in the society. For instance, in June 2014 a Facebook post containing obscene images of king Shivaji Maharaj, late Shiv Sena leader Bal Thackeray and others, provoked violent protests in Maharashtra damaging many public vehicles, many people were injured by pelted stones, and the angry crowd even killed one Muslim techie giving the protest communal colours [7, 12]. Another example, in 2015 a self-proclaimed Godman Baba Ram Rahim posted his image posing as Hindu God Vishnu on social media sites. This image hurt religious sentiments of a section of society and they lodged a complaint against him. Fig 1.1 shows some newspaper articles about the unrest created by these events in the society [35].

1 (a) (b)

Figure 1.1: (a) Report from Pune Mirror showing how riots in Pune closed the city. (b) Report from Time of India shows after effects of Baba Ram Rahim image.

Due to such severe repercussions of posting instigating images on social media, there arises a need for law enforcement agencies to look-up the impact made by such images on the society, and how to further control their spread. A potential solution could be to build a system that tells how much viral a particular image has gone based on the frequency of similar images shared and then to further understand and manage the diffusion by predicting the future spread of such information. Current research has explored Content Based Image Retrieval (CBIR), image similarity [31,32,38] & research has been done on prediction of information diffusion [11,39,40]. However,to the best of our knowledge less attention has been paid on (a) how image spread on Online Social Network (OSN) and specifically micro-blogging sites can affect peace in the society, (b) what steps can be taken to help law enforcement agencies in building a healthy law and order situation in the society. In this work we take one step ahead to fill this gap.

1.2 Research Aim

We aim to develop a real-time image search system which can aid law enforcement agencies to analyse the spread of an image. The system takes an image as the input along with a supportive

2 text, and returns the images related to the text keywords and similar to the input image. The system can return images that are scaled, cropped, stitched with other images, images with some text, and images with slight colour changes. The major output of the system are:

• Count of most similar, moderately similar and least similar images found on microblogging site.

• Analysis of sentiments floating with the most similar set of images

• Analysis of Tweet vs. Retweet ratio

• Analysis of users who propagate these images

Further to understand the spread of information, we predict the approximate diffusion of images. This was done by training machine learning models to predict the retweet count of tweets, which in case of Twitter is a definitive parameter to approximate the diffusion. We tested various textual, structural, and visual features to predict the spread.

3 Chapter 2

Related Work

2.1 Comparing Methodologies

Image retrieval has become an important topic in the past decade. There has been work done in finding visually similar images, identifying objects in the image. The technique to do this task is to extract visual features of images like colour, texture, and shape information and then compare the similarity of images based on the comparison of these features [19]. Another widely studied method is to find the important keypoints in the image and compare these keypoints. The feature matching of images done using SIFT (Scale-Invariant Feature Transform), SURF (Speeded Up Robust Features) have proven remarkably successful in a number of applications using visual features, including object recognition, image stitching, visual mapping, etc. But they are not very efficient for real-time systems as they are computationally expensive. Rublee showed in their work that using FAST (Features from Accelerated Segment Test) keypoint detectors and BRIEF (Binary Robust Independent Elementary Features) descriptors ORB (Oriented Fast Rotated Brief) gives almost the same performance but at low cost [28]. Lei Yu1 also compared traditional feature matching techniques like SIFT, SURF, DAISY, ORB with their own technique, which is improved version of ORB and found that ORB and it's variant outperforms traditional techniques due to binary descriptor vector returned by brief, and fast binary string comparison using Hamming distance performed by XOR operations in the systems [38].

2.2 Image Retrieval Real Time Systems

Reverse image search engines are that special kind of search engines where a user can put a picture and the system retrieves the images similar to the input image [6]. Some popular image reverse image search engines are:

4 2.2.1 Google reverse image lookup

The mechanism of reverse photo search here is via uploading an image from your computer or pasting the link of the image in the search bar itself. It works equally well both ways, or you can simply drag and drop the image in search bar. use algorithms based on various attributes like shape, size, colour, keypoints and resolution to get the similar pictures [6,17]. 1

2.2.2 TinEye

TinEye is the first image search engine on the web to use image identification technology rather than keywords, metadata or watermarks. TinEye regularly crawls the web for new images. When a user inputs an image, it creates a unique and compact digital signature or ‘fingerprint’ for it, then compares this fingerprint to every other image in their index to retrieve matches. It does not typically find similar images (i.e. a different image with the same subject matter), it finds exact matches including those that have been cropped, edited or resized [6,33]. 2

2.3 Social Media for improved policing

Police forces are widely using OSN to maintain law and order and improve communication with the citizens. Research has been done on how OSN based technology can be adapted to support communication and collaboration for making safer society in developing countries like India. Research shows that Police in developed nations have realized the effectiveness of OSN in various activities such as investigation, crime identification, intelligence development, and community policing [29]. Few studies in India show that OSN was used to spread misinformation and public agitation during crisis events such as Mumbai terror attacks (2011), Muzzafarnagar riots and Assam disturbance (2012). These studies report that in these events, panic was spread through fake images, messages, and videos on OSN. All in all, multiple research work has been done on how OSN can be useful for police and first responders to maintain law and order in the society during events of riots, crisis, to prevent crime, to increase the trust of people, and maintain peace in society [29].

2.4 Information Diffusion on Social Media

Research has been done on predicting and understanding the spread of information both on Online Social media and Twitter in specific [20]. Research shows that the popularity of an image on Flickr is closely related to the user's popularity and activity level as well as its topical content or tags. They considered image features, text features, and user context features into

1https://images.google.com/ 2https://www.tineye.com

5 account [24]. Another work shows the prediction of diffusion on Twitter, they are predicting retweet count using textual features, user-based features and visual features of the image [11].

2.5 Finding and Predicting Image Virality on Online Social Me- dia for improved law and order

Our research is motivated by the above research work, but to the best of our knowledge, this is the first research focussed on image spread, that helps police force to find the actual spread of an image that can create unrest in society by retrieving similar images shared on microblogging site in real-time. We also aim to help police to take proactive measures in understanding the diffusion of the image by predicting the approximate spread on microblogging sites which in our case is represented by Twitter.

6 Chapter 3

Contributions

In this thesis we provide an integrated solution for law enforcement agencies that helps them to identify and evaluate the spread of images on a micro-blogging sites like Twitter that can result in law and order situations in the society. To this end we make following contributions.

• We compare different image features to find the similarity between the images and the conclude that the technique that uses ORB in a combination of RANSAC (Random Sam- ple Consensus) to retrieve similar images from Twitter gives the best result. Using this technique our system is able to quantify the spread with an accuracy above 85% in all the tested cases.

• The system is able to retrieve images that are cropped, scaled to an extent, images with text, contrasted or brightened images and images stitched with other images.

• We also observe that images which are highly scaled (more than a factor of 3.0), highly cropped and images with more colours that are modified, shows reduced accuracy when given to the system as input.

• The system is currently being used by one of the Government security agencies.

To enhance the capability of the law enforcement agencies to take some proactive preventive mea- sures to understand the spread of such images, we forecast the approximate spread by predicting the retweet count of an image.

• We use structure-based features, tweet based and image-based features to predict the retweet count.

• In contrast to previous work, we explore that low-level image-based features like mean red, mean blue and mean green are not very efficient to predict the retweet count as the spearman correlation with the retweet count is low (less than 0.05). We observe that high- level image feature like the presence of a face in the image gives decent correlation and can be used as a parameter for prediction.

7 • After testing different machine learning models we realise that the random forest regression model predicts the retweet count with higher accuracy (least rmse) with structural, textual and image based features.

8 Chapter 4

Proposed Methodology

4.1 Architecture design for the search system

The proposed system takes an image and a text related to it as input. This text is passed to a keyword extraction algorithm which extracts some keywords from the text. These keywords are then given to Twitter's Search API (Application Program Interface), part of Twitter's REST API 3. The API returns a set of tweets related to the keywords it takes as query, and the system then filters the tweets containing images to be stored in a database. As the image gets stored in the database, an image comparison methodology starts running in parallel that finds a similarity score between the input image and the images in the database. We then set thresholds t1 and t2, if the similarity score is more than t1 the system classifies the images as most similar, if the score is less than t1 but more than t2 images are classified as moderately similar, and the images having the similarity score less than t2 goes in the set of least similar images. After the classification is complete, the system outputs all the three labelled sets for visualisation and then starts by analysing the most similar image content. It extracts text related to the most similar set and performs sentiment analysis using Sentiment140 API4. Sentiment140 API is trained for classifying tweets and returns a score classifying a tweet sentiment as positive, negative or neutral [3]. The system returns a comparison visual that tells the percentage of tweets with negative, positive and neutral sentiments. Likewise, the system analyses the ratio of tweets and retweets. Another useful feature is analysing the users who are propagating these events, the system also returns the profile picture of the users, and their basic demographics from Twitter like username, description (if available), the number of followers, etc. Fig 4.1 shows the architecture diagram of the system.

3https://dev.twitter.com/rest/public/search 4http://www.sentiment140.com/

9 Figure 4.1: Architecture diagram of the system. Input to the system are image and keyword. Twitter REST API takes the keyword and forms an image database, we compare the input image with images in the database to get similar images on which the system further performs sentiment and user analysis.

4.2 Solution approach for prediction

Length of tweets is limited to 140 characters in Twitter hence, it becomes hard to analyse the features based on just the text, so we try to include the features of the image attached to the tweet to predict it's approximate spread. The most trusted parameter to define the diffusion of content on Twitter is the retweet count [11]. So, to define diffusion we aim to predict the retweet count of the tweet. Our focus is on unique tweets containing image link. We extract various structural, contextual, sentimental and visual features of the tweet and image attached to it. After extracting the set of features, we calculate spearman correlation between the target variable (retweet count) and the set of features to filter out most related features. We then train different regressor models of machine learning over these feature set to forecast the retweet count of a tweet.

10 4.3 Data Collection

4.3.1 Search System

We collected tweets containing images related to some recent events (as of April 2016) that created unrest in the society in some form. In this section we describe the events, and data collected for each event. Fig 5.1 shows the images that went viral during these events.

• Kulkarni Ink: Black ink was sprayed on the technocrat-turned-columnist Sudheendra Kulkarni by the members of a political party, ahead of the launch of former Pakistan Foreign Minister Khurshid Mahmud Kasuri’s book, ‘Neither a Hawk nor a Dove: An In- sider’s Account of Pakistan’s Foreign Policy’. FIR was lodged against the party workers and six workers got arrested [9]. This incidence was slammed on social media and the images of the man with black ink on face went viral. We collected a total of 1,912 images using the keyword “Kulkarni”.

• Baba Ram Rahim: A self-proclaimed Godman Baba Ram Rahim posted pictures posing as Hindu God Vishnu, and he faced trouble, when All India Hindu Student Federation accused him of insulting Lord Vishnu and hurting religious sentiments of Hindus by dressing up as Lord Vishnu and lodged a complaint against him [35]. We were able to collect 411 images using the keyword #RamRahim.

• Cartoon by Kejriwal: A cartoon tweeted by Delhi’s Chief Minister Arvind Kejriwal drew much flak from the BJP and its affiliates for allegedly hurting religious sentiments. The cartoon purportedly showed Hanuman clad in saffron robes telling Prime Minister Narendra Modi. A politaical party, The Hindu Sena lodged a complaint alleging that the twitter message was posted with intent to hurt religious sentiments of Hindus [34]. We collected 666 images using the keyword #KejrwalInsultsHanuman.

• Charlie Hebdo Cartoon: A cartoon in the French satirical magazine Charlie Hebdo sparked outrage by publishing a cartoon attempting to satirise the Syrian refugee crisis. The cartoon imagines Alan Kurdi, the three-year-old Syrian who died in the sea in September 2015, on the way to Europe, has grown up to be a sexual abuser. Many called the cartoon was racist and said it was incredibly bad [18]. We collected 570 images using the keyword #CharlieHebdo.

• Shani Shignapur protest: As many as 1,500 women, mostly housewives and college stu- dents, planned to storm the Shani Shingnapur temple in Ahmednagar district on Republic Day. The protesters wanted to end the age-old humiliating practice of not allowing women to enter the core shrine area [15]. We collected 183 images related to keyword #ShaniSh- ingnapur.

11 4.3.2 Prediction

We collected data for six events, five are the same discussed above, and one for Rohith Vemulla suicide case, using the keyword #rohithvemulla. After collecting tweets containing images we filtered out unique tweets posted by the original user to avoid biassing. A total of 2,040 unique tweets were collected.

Keyword Total tweets collected Unique Tweets Count #kulkarni 1,912 404 #RamRahim 420 117 #KejrwalInsultsHanuman 1,400 665 #CharlieHebdo 1,079 312 #ShaniShingnapur 1,230 183 #rohithvemulha 3,104 359

Table 4.1: Data collection for six events.

4.4 Data Annotation

After getting the data related to the 5 events we picked one image from each event. Our task was to find all the images similar to this image from the dataset of that particular event. But to test the accuracy of the image comparison methodologies, we need to get the data annotated beforehand and match the results of the methodology used with the results of the annotation. We gave the set of images related to an event and showed the input image (whose similar images we want to find) to the annotator, the task of the annotator was to label images from the given set which he/she thinks as similar to the image shown to him and which are not similar. So the dataset is now divided into two sets, one containing images similar to the shown image and second set containing images that are not similar to the image. Likewise, for each event, we get a set of similar images and non-similar images. The table below shows the data sets count after the annotation was complete.

Keyword Total Images Similar Images Non-Similar Images #Kulkarni 1,912 348 1,564 #RamRahim 411 99 312 #KejrwlaInsultsHanuman 666 278 388 #CharlieHebdo 570 114 456 #ShaniShignapur 183 67 116

Table 4.2: Data after annotation shows the number of similar and non-similar images.

12 Chapter 5

Quantifying Image Similarity

5.1 Keyword Extraction

Keyword extraction is an automatic way for identification of terms that best describe the subject of a document. Since Twitter does not allow us to directly fetch the data using an image, we take some supportive keywords to get all the tweets related to the keyword using the search API of Twitter. If the user does not know the exact keyword to enter, he/she can enter the text related to it (or corresponding tweet) and there will be text mining algorithm running that will take the text and filter out the important keywords. The main goal is to extract Noun phrases and Verb phrases from the text because they describe the main topic [8]. This is done using NLTK's tagger to define a new n-gram tagger, where n=3.

5.2 Challenges in finding similar images

In this section we discuss why finding similar images from a set of images is not a straight forward task. When an image related to an incident goes viral on social media, we generally tend to see different versions of the same picture floating, all these versions depict the same scene but the images are not exactly the duplicate of each other. We aim to find such similar images where the content of the image is same but they are modified by:

• Cropping

• Scaling

• Adding Text

• Changing contrast and brightness

• Stitching another image

• Adjusting colours

13 • Rotating the image

• Adding a part of the image to another image

5.3 Image features to analyse similarity

In this section, we discuss various features of an image that can describe it and how we use these features to compare the similarity between two images. We used these three image features to compare the similarity.

5.3.1 Hashing

 Features in the image are used to generate a distinct but not unique fingerprint, which are comparable. We pass an image to a hash function and compute it's image hash based on it's visual appearance. It is said that similar images should have similar hashes as well. A perceptual hashing algorithm takes a fingerprint of a multimedia file by deriving it from various features and takes into account transformations on a given input and is flexible to distinguish between dissimilar files [10]. It is different from cryptographic hash as very tiny changes in the input file will result in a substantially different hash. There are many perceptual hashing algorithms but  we use dHash difference hashing algorithm, it computes the difference in brightness between adjacent pixels, to identify the relative gradient direction. Also it is more fast and accurate than other algorithms offered by perceptual hashing [16]. After getting the hash of two images we use hamming distance to find the difference between two hashes. This distance can be used to quantify similarity of two images.

5.3.2 SSIM

 The structural similarity SSIM index is a method for predicting the perceived quality of digital images. It attempts to model the perceived change in the structural information of the image and is used for measuring the similarity between two images. It's value ranges from -1 to 1, 1 indicating perfect similar images [27], and -1 indicates dissimilar images. We use the scikit-image 5 implementation of SSIM.

5.3.3 Colour Histogram

The comparison of image colour distributions is an important tool in Content-Based Image Re- trieval. It is often done by comparing colour histograms of images, which eliminates information on the spatial distribution of colours [21]. So the image descriptor here is 3D RGB colour his- togram with 8 bins per channel and then we compare the descriptor using a similarity metric. We tried different similarity metric like Chi-Squared, Euclidean distance and, Bhattacharyya

5http://scikit-image.org

14 distance but, in our experiments, Bhattacharyya gave better results than other two [26]. The lesser is the distance between two histograms more similar the two images are.

5.3.4 Keypoint Descriptors

Certain parts of an image have more information than others particularly at edges and corners, and these are the ones we can use for image matching, these parts are known as keypoints. After finding the keypoints of an image, we need to find their descriptors. Descriptors are fixed length vectors that describes some characteristics about the keypoints. Next we compare each keypoint descriptor of one image to each keypoint descriptor of the other image, since the descriptors are vectors of numbers, we can compare them using simple distance metric like Euclidean distance, Hamming distance, etc depending on the requirements. Below are the various approaches discussed that we tested for computing keypoints descriptors and then comparing them.

DAISY

The DAISY local image descriptor is based on gradient orientation histograms similar to the SIFT descriptor. It is formulated in a way that allows for fast dense extraction which is useful for e.g. bag-of-features image representations [36]. After extracting Daisy descriptors we calculated the distance between the descriptor vectors of two images using Brute Force matcher with KNN (K-Nearest Neighbors) as the distance metric. This distance is then used as the score to define similarity of images.

ORB

ORB stands for Oriented Fast and Rotated Brief. This method returns binary strings to describe feature points. It uses Improved FAST for feature detection, and these features are described using an improved Rotated BRIEF feature descriptor. Since the speed of FAST and BRIEF are very fast this can be the choice for Real-Time systems. ORB is rotational invariant, noise invariant and, uses image pyramids for scale invariance [28]. After extracting binary descriptors we use Hamming distance for feature point matching. We use this hamming distance as a score to evaluate the similarity of two images.

Improved ORB

In order to improve the matches given by ORB we add one extra step, after comparing the keypoint descriptors given by ORB using Hamming distance, we extract top 30 matches having the least distance and pass these to RANSAC to weed out wrong matching points. The more good matching points are selected, the more possibly that the RANSAC will give a correct matrix [38]. We get the true matching points out of the total 30, in our case matching points

15 passed. We then calculate the ratio of true matches. We used this true-ratio to quantify the similarity of two images. In the given equation, tr denotes true ratio, n is the size of matches set passed to RANSAC, Ai represents the value of array returned by RANSAC at ith index.

Pn−1 A t = i=0 i r n

5.4 Experimental Settings

5.4.1 Evaluation Metrics

To access the performance of the above-discussed image features and the methodologies used to evaluate these features we use the standard information retrieval metrics viz. Accuracy, Sensi-  tivity, and Fall out. Sensitivity is also known as True Positive Rate TPR or recall, it measures the proportion of positives that are correctly identified as such, and the fall-out is also known  as False Positive Rate FPR , it is the proportion of non-relevant documents that are retrieved, out of all non-relevant documents available. TP denotes the number of images identified by the classifier system as similar and are actually similar. FN denotes the number of images identified by the classifier system as non-similar but are actually similar. FP denotes images classified as similar by the system but are non-similar and TN denotes count of images correctly classified as non-similar.

TP + TN Accuracy = TP + FN + FP + TN

5.5 Classification Results

In this section, we present the results of our classification models on the image dataset of five different events discussed in above section. The methodologies used are as discussed in section 5.3. We compare the results of the classifier with the result we got after the user annotation of data. Fig 5.1 shows input images taken for the five different events. The details of the events are discussed in section 4.3.1.

16 Figure 5.1: Input images set.

5.5.1 Histogram

For each of the five experiments, we plotted the accuracy for Bhattacharyya distance from 0.1 to 0.9. As the Fig 5.2 depicts for different events there is a lot of variation in the accuracy corresponding to the threshold. For example, if we take threshold as 0.4, the accuracy for ShaniShignapur and Kejriwal is approximately 0.6 whereas in For Charlie Hebdo it is 0.9 and close to 0.8 for Kulkarni, and Ram Rahim.

17 Figure 5.2: Accuracy plot for different distances between histograms of two images. Highly variant accuracy for different events.

5.5.2 DAISY

We plot the histogram of mean distance between DAISY descriptors of both similar and non- similar images of an event. The mean distance for both similar and non-similar images is almost same, with range of distance from 0.0 to 0.10, making it hard to select a threshold score below which images can be classified as similar else not similar. The dark green bars in the graph 5.3 represents the region of overlap for the two sets, that implies error of the algorithm, which is quite high.

18 (a) (b)

(c)

Figure 5.3: Histogram showing the mean distance between the Daisy descriptors of three events (a)Kejriwal, (b) Charlie Hebdo and (c) Baba Ram Rahim. The distance for both similar and dissimilar sets lies in range of 0.0 to 0.10, with high overlapping.

5.5.3 ORB

In ORB, the threshold represents the mean distance between top ten closest matches obtained by ORB descriptor match. The variation in accuracy is less as compared to Histogram. For example  for threshold distance 30, the accuracy for all the five events is above 0.9 90% though the threshold at which 90% accuracy is obtained is different for some events. This shows that ORB is certainly a better choice than Histogram because since histogram method is the colour dependent choice of the good threshold will vary a lot with the image dataset taken into consideration.

19 5.5.4 Improved ORB

In improved ORB after getting the match set of the descriptors from ORB, we pass top thirty matches to RANSAC, which filters matches which are true. The threshold value, in this case, represents the ratio of true matches returned by RANSAC to the total matches passed thirty in  our case . It is clear from the graph that this methodology shows the least variation in accuracy. After 0.3, the accuracy for each event is almost constant and the accuracy is above 0.9 in all the cases. If compared with above methods improved ORB is giving best results. Fig 5.4 compares the accuracy for ORB and Improved ORB.

(a)

(b)

Figure 5.4: Comparing accuracy for different distances between image descriptors for (a) ORB and (b) Improved ORB. Improve ORB shows less variance in accuracy.

20 5.6 Time Evaluation

Since our aim is to build a real-time image search system, apart from efficiency in terms of accu- racy we also need to select the technique that is time efficient. We now compare the approximate time consumed by the aforementioned image comparison techniques.

Histogram DAISY ORB Improved ORB 0.75 sec 8.12 sec 1.23 sec 1.25 sec

Table 5.1: Comparing time (seconds) taken for comparing two images.

5.7 Variation in accuracy with modified images

In this section we will analyse how the accuracy varies when the input image is edited and how well can the system perform in such cases. We take different cases of images stitched with other images, images with text added, cropped, scaled and contrasted images.

• Kulkarni Ink We input modified images found from Twitter to the algorithm and analyse the accuracy variation. We observe that there is dip in accuracy if image is scaled to a factor greater than 2.5 or the part of image we are looking at for similarity is very small. Though there is not much variation when text is added on image.

Image stitched with other images and scaled

We plot the accuracy graphs for six different images, which are stitched with other images, and the image part from the original image is scaled by different factor and in some images the part is also cropped. Looking at Fig 5.5, we can observe that the cyan colour line is showing the least accuracy  and the scaling factor of the image part is also pretty high 9.6*3.87 .

21 Figure 5.5: Scaling factor for the above images are– Cyan: (9.6*3.87), purple: (2.08*1.25), Yellow: (1.9*1.2) Red: (2.1*1.23), Green: (1.7*1.3), Blue: (2.7*2.3).

Image with text added

In this sub-section we take input images which are modified by adding text, and are scaled. Fig 5.6 shows how the accuracy varies for each of the four different images. Blue line is for the original image and others are for modified images. Clearly, image represented by green colour is showing dip in accuracy and we can see that the scaling factor is quite high (7.45*5.25), which again shows that if scaling factor high performance dips.

22 Figure 5.6: Scaling factor for the above images are–Blue: original image: size (600*815) Cyan:(1.2*1.08) Green: (7.45*5.25) Red: (1.66*1.7).

• Kejriwal Cartoon In this part we take different images (marked as similar) from the related event database. Fig 5.7 shows the accuracy plot for the different images. Lines represented by Cyan and Purple colour are the images which are most modified, they are scaled, good proportion of the image is text, colours are also varied, and thus we believe that they show lower accuracy than other images.

23 Figure 5.7: Scaling factor for the above images are–Cyan:(2.35*2.40), Purple: (1.27*1.25), Green: (1.52* 1.53), Blue: (1.08)*(1.07), Red: (1.04*1.07).

• Baba Ram Rahim In this section we plot accuracy for different images labelled similar for Baba Ram Rahim event. Fig 5.8 shows how the accuracy is varying if we take modified image as an input. Image represented by yellow and cyan coloured lines are both highly cropped (out of the complete body only face is present now). In case when the images are highly cropped the resolution changes and due to high varying resolution the performance is dipping, also we can say that as the images are cropped to a high extent bit other images in the database are not the keypoints that can be matched will also vary now.

24 Figure 5.8: Size of original image: 600*305 Yellow : Cropped+text+image: size (226*203) Cyan :cropped image : size (559*289) Blue: scaled: size (254*328),Green: cropped,scaled: size (223*354) Purple: cropped,scled: size (361*399);Red: cropped,scaled: left image size (220*373); right image size(158*370).

• Charlie Hebdo Cartoon This part represents the accuracy plots for different modified images from Charlie Hebdo case. Since the images in this database are not having high colours so cropping is not much affecting the accuracy. Fig 5.9 shows the accuracy variation for different images, again the image having highest scaling factor (cyan colour) is having least accuracy out of all others, but the dip is not high as the image does not contain many colours.

25 Figure 5.9: Scaling factor for the above images are– Green: (2.02* 1.54), Red: (1.04*1.28), Cyan: (3.9*2.4), Purple: (2.0*1.13), Blue: Colour change (white background).

26 Chapter 6

Predicting Image Spread

6.1 Features Analysed

In order to predict the retweet count, we take four type of feature set in our regressor models. Since the retweet count values had a high variation we converted it to log scale. This section briefly describes these features.

6.1.1 Content-Based

In this section we study tweet based features like the length of the tweet, the number of hashtags, the number of media links attached, and the number of users mentioned. The tweet character- istics define like the length of the tweet can be associated with how informative the tweet is. A very small tweet might not contain good information and thus, can be of less interest to the read- ers. Hashtags in the tweet are a representation of the broad topic discussed in the tweet, it also makes the tweet easily searchable and attracts an audience interested in the specific topic. The inclusion of media links like images, video attachments increase the credibility of the information since Twitter allows a maximum of 140 characters these links provide additional information and thus increase the probability of diffusion. Another feature that can have an impact on the message diffusion is the user mentions, mentioning user in tweet automatically shows up the tweet on the mentioned user notifications, research shows that there might be negative impact of increasing mentions on the retweet count as the tweet becomes more specific to some users rather than generalised audience [11,22,37].

6.1.2 Sentiment

This factor is studied to understand if the sentiment of the tweet text has an impact on it’s diffusion, i.e. is there a difference in terms of diffusion if the sentiment of the tweet is negative or positive. The sentiment of the tweet text is analysed using the Sentiment140 API, which tags the tweet as positive, negative or neutral.

27 6.1.3 Structure-Based

In terms of structural features, we take features like a number of followers a user has, the number of friends, account age of the user, the number of favourites achieved, follower-followee ratio, the age of the tweet, the user is verified or not, and the number of statuses user has posted. Number of followers should be directly related to the retweet count theoretically because the more the followers more people will have the tweet on their home page increasing the chances of a retweet. Friends are the who the user follows, so having more friends increases the chances of more information that the user can post. Factors like account age, status count of the user relates to the experience and the credibility of the user. Verified user may have a different starting point in terms of diffusion as compared to other users. To avoid biassing of large and small number variation we used log scale values for the status count, follower count, friends count, follower-followee ratio, and favourites count [11,22,37].

6.1.4 Image-Based

We define low-level image features like colour distribution in the image and high-level features like objects present in the image. We calculated the mean red, green and blue intensities of image to define the colour distribution. Since the kind of image dataset we are looking at relates to the events that have the potential to create law and order situations in society, we found that common object in most of the images was a human face. So we take a binary variable that implies the presence of a human face in the image. We used famous haar cascades implementation to detect the presence of a human face [1].

6.2 Features vs. Retweet Count

In this section we show how the features are correlated with the retweet count, we calculate the spearman correlation values and filter out the features having a correlation less than 0.05. The table below shows features having spearman correlation’s with retweet count greater than 0.05.

28 Feature Spearman Correlation value P-value Friends count 0.077 0.004 Favourites count 0.433 2.2e-94 Verified (binary) 0.311 0.15 Sentiment 0.367 5.07e-60 Tweet length 0.363 1.58e-64 Hashtag count 0.267 1.05e-34 Mention count 0.603 4.2e-20 Tweet age -0.259 1.07e-32 Status count -0.059 0.007 Face presence (binary) -0.066 0.0026 User age 0.109 7.22e-07 Media count -0.683 0.002 Follower-Followee ratio -0.053 0.05 Follower count 0.012 0.57

Table 6.1: Spearman correlation and p-values of features with retweet count.

6.3 Experimental Settings

Machine Learning

6.3.1 Linear Regression

In linear regression, the relationships are modelled using linear predictor functions whose un- known model parameters are estimated from the data. The following regression equation will be used: y = {b1 ∗ x1 + b2 ∗ x2 + b3 ∗ x3 + ··· + c}; where y = estimated dependent variable,  c = constant which includes the error term , b = regression coefficients and x = independent variables. Linear regression attempts to draw a straight line that will best minimise the residual sum of squares between the observed responses in the dataset, and the responses predicted by the linear approximation. The principal advantage of linear regression is its simplicity, inter- pretability, scientific acceptance, and widespread availability [25]. Linear regression is the first method to use for many problems. We use the ‘LinearRegression’ module provided by the ‘scikit’ library for this study. 6

6.3.2 Support Vector Regression

Support vector machines SVMs are a set of supervised learning methods used for classification, regression and outliers detection. Some advantages of support vector machines are it is effective

6http://scikit-learn.org/

29 in high dimensional spaces, still effective in cases where number of dimensions is greater than the number of samples, it uses a subset of training points in the decision function (called support vectors), so it is also memory efficient, different Kernel functions can also be specified for the decision function. We use the ‘SVR’ module provided by ‘scikit’ library.

6.3.3 Random Forest

A random forest is an ensemble of decision trees which will output a prediction value. Each decision tree is constructed by using a random subset of the training data. It selects the most important features of the data point hence improves the predictive accuracy and controls over- fitting. It also undertakes dimensional reduction methods, treats missing values, outlier values. In our implementation, we defined the number of trees in random forest to fifty. We use ‘Ran- domForestRegresssor’ module provided by ‘scikit’ library for this study.

6.4 Training and Testing Data

We perform a 10 fold cross-validation for computing the regression results. The dataset is partitioned into 10 subsets. In each test run, 9 subsets are used for training and the remaining subset is used as test data. Hence, we predict the target value using 10 test run which ensures that each set has been used for training as well as testing. The final evaluating result rmse in our case is the average of results from the 10 runs.

30 Chapter 7

Results

7.1 Efficient Image Features and methodology for quantifying similarity

In this section, we evaluate the performance of various features and techniques used for quantify- ing image similarity. We tested different image features like colour, hash, keypoints to find similar images but comparing keypoints as the feature set proved to be most efficient. The methodology that worked the best for detecting the keypoints and comparing them to get a similarity score was Improved ORB. Improved ORB is a combination of ORB with RANSAC. We were able to achieve accuracy above 85% in all the cases.

7.2 Comparing search results with Google reverse image lookup

In Table 7.1, for first case, the input keyword to both the systems (Google reverse image lookup and proposed system) is “ShipraMalik” and the image shown in Fig 7.1. In the second case, the input keyword is “KanhaiyaKumar” and the image is shown below. We specified the interested site as “twitter.com” in Google’s reverse image search. Fig 7.1 shows the input images to both the systems in case 1 and 2 respectively. Fig 7.2 and Fig 7.3 shows the output of both the systems for “ShipraMalik” and “KanhaiyaKumar” image at the time these events occurred respectively. Table 7.1 below shows the comparison of the results given by the proposed system and Google’s reverse image lookup.

31 Figure 7.1: Input image for keyword ShipraMalik and KanhaiyaKumar respectively to both the systems.

Case Proposed System’s Result Google Reverse Image Lookup Result 1. Total images: 28 Total Images: 1 Most Similar: 3 Moderately Similar: 1 Least Similar: 24

2. Total images: 892 Total Images: 48 Most Similar: 42 Moderately Similar: 36 Least Similar: 814

Table 7.1: Comparing results of proposed system with Google’s reverse image lookup.

7.3 Real Time Image Search System

We built a real-time image search system, in which a user inputs a image and a related text, this text is then processed to extract keywords and the keywords extracted are used for building the database of images using REST API of Twitter. As we get images in the database created, an image comparison methodology starts running to find similar images. As the data collection stops, the system outputs three set of images: most similar, moderately similar and least similar. The user can then further do analysis on the sentiments of the most similar set of images, the proportion of the retweet count and the users propagating these images. Fig 7.2 (a) and 7.2 (b) shows the input screen shot and output screen shot of an experiment done.

32 (a)

(b)

Figure 7.2: Comparing output of proposed system with Google image search for Shipra Malik image (a) Output of the system . (b) Output of the Google reverse image lookup.

33 (a)

(b)

Figure 7.3: Comparing output of proposed system with Google image search for Kanhaiya Kumar image (a) Output of the system . (b) Output of the Google reverse image lookup.

34 7.4 Law Enforcement Agencies as beneficiaries of the system

Maintaining law and order requires information availability from different sources [13]. In recent past, social media has been exploited to spread content that resulted in violence and disturbance in society. This content often is accompanied with images to leave a stronger, deeper impact on the masses [2]. In past, LEA have worked on textual information from online social media but few studies explore multimedia content such as images being shared on social media [14, 29,30]. To cover the gap in the research done, we built a system which serves as a platform where officials can find the spread of an image, the sentiments floating with the image and the people who are propagating these images. We conducted a usability study of our system with the officers and found that apart from identifying the similar images on social media, other useful information included metadata as (a) who are the users spreading instigating images, (b) what are the sentiments expressed with the images, (c) location where these images have spread. Currently the system is being used by a Government security agency.

7.5 Evaluating Metric for Prediction

In order to evaluate the performance of our regression models based on the features described in section 6.3, we use the sklearn.metrics modules of the library “scikit”. The scoring function used from this module is “mean squared error”, and then calculating the square root of mean squared error we get root mean squared error. 0 If y is the predicted value of the ith sample, and y is the corresponding true value, then the root   mean squared error RMSE estimated over n number of samples is defined as:

v u  u n−1 u 1 X 2 RMSEy, y0 = t y − y0 ] n i i i=0

The quality of prediction is indirectly proportional to RMSE, i.e. lower the value of RMSE better is the performance of the model. To further enhance the robustness of the evaluation, we used one more metric to describe the performance of the regression models. This is also a function of sklearn.metrics module provided by “scikit” library, known as mean absolute error. 0 If y is the predicted value of the ith sample, and y is the corresponding true value, then the mean absolute error (MAE) estimated over n samples is defined as:

(n−1) 1 X MAEy − y0 = |y − y0| n i i i=0

35 7.6 Comparing Regression Models

In this section, we will describe the results obtained by out prediction models as described in section 6.3. The three algorithms we tested are Linear Regression, Support Vector Regression and Random Forest. We performed the experiments on a total dataset of 2040 unique tweets, collected from six different events. The resultant RMSE and MAE values are obtained using  10-fold cross validation. We obtained root mean square error RMSE scores as 1.67, 2.42, 1.10  and mean absolute error MAE scores as 1.37, 2.21, and 0.72 for linear regression, support vector regression and random forest respectively. The result shows that random forest performs the best as it gives least RMSE and MAE scores.

36 (a)

(b)

Figure 7.4: (a)Input to the system is a image url and a keyword “Kanhaiyakumar". (b)Output screen: The system finds total 892 images related to the keyword KanhaiyaKumar and 42 were labelled as most similar images are shown above.

37 Chapter 8

Conclusions, Limitations, Future Work

8.1 Conclusions

In this study, we focus on how to assist law enforcement agencies to find and analyse the spread of images on social media that can potentially affect law and order situation in the society. For this, we developed a system where a user can enter the image and a related keyword and the system will return a set of images similar to input image floating on micro-blogging site Twitter with some user and text analysis.The system developed is currently being used by Government security agencies. To find similar images from Twitter we tested various methodologies and found that Improved ORB works the best with an accuracy above 85%. We also analysed how the accuracy of the system varies when the input image is scaled, cropped, contrasted or stitched with another image, text. Although, we found that in cases where we input an image to the system that is scaled more (scaling factor more than 3.0) or when the image highly cropped leading to large variation in the resolution leads to a dip in accuracy. It was also observed if a more colourful image was modified it leads to lower accuracy as compared to less colourful image, but in such cases also we were able to achieve an accuracy above of 80%. To further aid the police force to take some proactive steps to understand the spread of such images on Twitter we target on the task of predicting the spread of tweets. Since retweet count is the basic parameter to measure diffusion of messages on Twitter we train the models to predict retweet count. To achieve this we considered taking a combination of well-studied features set like content-based, structural-based, sentimental, and image features into our regression model. We trained linear regression, SVM, and random forest regression with these features on a total of 2,040 tweets and obtained RMSE values 1.67, 2.42, 1.10 and MAE values 1.37, 2.21, 0.72 for each of them respectively. With this we conclude that our random forest regression model outperforms the other two models, we used human face as the high-level image feature object to be identified. We conjecture that future work on high-level feature detection in images can provide a better approximation of its virality.

38 8.2 Limitation

We have used Improved ORB for a real-time system, although ORB uses pyramid scheme for scale invariance, but the experiments show that when the scaling factor is more than 3.0 there is a dip in the performance, likewise performance dip can be seen when images are cropped to an extent leading to high variation in the resolution. Another limitation in case of prediction is the small data set of 2,040 tweets. Though we collected data for 6 events but after filtering the unique tweets by the user containing images, and images which were detected by haar cascades to detect the presence of human face we were left with few tweets. We speculate that predicting on larger dataset can improve the quality of our prediction model.

8.3 Future Work

As a part of future work, we can improve on the time taken by the system to collect the data in real-time find the set of similar images by doing tasks in a multi-threaded way. We can also quantify the scaling factor and cropping factor (that leads to resolution variation) above which there is a sudden dip in performance and proposed system deviates by the mentioned accuracy. In this work we considered only human face in the high level image feature set for prediction model, we can expand the dataset and find other objects whose presence in such set of images can accelerate diffusion. Our focus in this work was on micro-blog site Twitter, we can also expand our work from Twitter to other majorly used social networking sites like Facebook, Instagram, Tumblr, Flickr, etc.

39 Bibliography

[1] Haar cascades. https://github.com/Itseez/opencv/tree/master/data/haarcascades.

[2] The power of visual storytelling. http://curve.gettyimages.com/article/ the-power-of-visual-storytelling.

[3] Sentiment140 general information. http://help.sentiment140.com/.

[4] The positives of social media: Spread of information. http://lifeasoflate.com/2013/ 11/the-positives-of-social-media-spread-of-information.html, 2013.

[5] 85http://timesofindia.indiatimes.com/india/85-of-space-on-cigarette-packs-to-be-covered-with-warning/ articleshow/44824690.cms, 2014.

[6] Ajinkye. Best reverse image search engines, apps and its uses (2016). http://beebom.com/ 2016/01/reverse-image-search-engines-apps-uses, 2016.

[7] Mubarak Ansari. Fb post shuts down pune. http://www.punemirror.in/pune/crime/ FB-post-shuts-down-Pune/articleshow/35911896.cms, 2014.

[8] Shlomi Babluki. An efficient way to extract the main top- ics from a sentence. https://thetokenizer.com/2013/05/09/ efficient-way-to-extract-the-main-topics-of-a-sentence/, 2013.

[9] BBC. Mumbai ink attack: India police arrest six shiv sena workers. http://www.bbc.com/ news/world-asia-india-34513322, 2015.

[10] Bertolami. Perceptual hashing. http://bertolami.com/index.php?engine=blog& content=posts&detail=perceptual-hashing, 2014.

[11] Ethem F Can, Hüseyin Oktay, and R Manmatha. Predicting retweet count using visual cues. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, pages 1481–1484. ACM, 2013.

[12] Sushant Kulkarni Chandan Shantaram Haygunde. Muslim techie beaten to death in pune, 7 men of hindu outfit held. http://indianexpress.com/article/india/politics/ muslim-techie-beaten-to-death-in-pune-7-men-of-hindu-outfit-held, 2014.

40 [13] Hsinchun Chen, Jenny Schroeder, Roslin V Hauck, Linda Ridgeway, Homa Atabakhsh, Harsh Gupta, Chris Boarman, Kevin Rasmussen, and Andy W Clements. Coplink connect: information and knowledge management for law enforcement. Decision Support Systems, 34(3):271–285, 2003.

[14] Edward F Davis, Alejandro A Alves, David Alan Sklansky, et al. Social media and police leadership: Lessons from boston. 2014.

[15] Indian Express. Shani shingnapur temple protests: CanâĂŹt worship women as god- desses and also deny them right to pray. http://indianexpress.com/article/blogs/ shani-shingnapur-sabarimala-temple-entry-protest-rights/, 2016.

[16] Hacker Factor. The hacker factor. http://www.hackerfactor.com/blog/index.php? /archives/529-Kind-of-Like-That.html, 2013.

[17] Google. Reverse image search. https://support.google.com/websearch/answer/ 1325808?hl=en.

[18] The Guardian. Charlie hebdo cartoon depicting drowned child alan kurdi sparks racism debate. http://www.theguardian.com/media/2016/jan/14/ charlie-hebdo-cartoon-depicting-drowned-child-alan-kurdi-sparks-racism-debate, 2016.

[19] PS Hiremath and Jagadeesh Pujari. Content based image retrieval using color, texture and shape features. In Advanced Computing and Communications, 2007. ADCOM 2007. International Conference on, pages 780–784. IEEE, 2007.

[20] Maximilian Jenders, Gjergji Kasneci, and Felix Naumann. Analyzing and predicting viral tweets. In Proceedings of the 22nd international conference on World Wide Web companion, pages 657–664. International World Wide Web Conferences Steering Committee, 2013.

[21] Sangoh Jeong. Histogram-based color image retrieval. Psych221/EE362 project report, 2001.

[22] Alex Leavitt and Josh Clark. Upvoting hurricane sandy: Event-based news production on a social news site, reddit, 2013.

[23] Jesse Mawhinney. 37 visual content marketing statistics you should know in 2016. http: //blog.hubspot.com/marketing/visual-content-marketing-strategy, 2016.

[24] Philip J McParlane, Yashar Moshfeghi, and Joemon M Jose. Nobody comes here anymore, it’s too crowded; predicting image popularity on flickr. In Proceedings of International Conference on Multimedia Retrieval, page 385. ACM, 2014.

[25] FT Press. Predictive analytics techniques. http://www.ftpress.com/articles/article. aspx?p=2248639&seqNum=5, 2014.

41 [26] Adrian Rosebrock. How-to: 3 ways to compare histograms us- ing opencv and python. http://www.pyimagesearch.com/2014/07/14/ 3-ways-compare-histograms-using-opencv-python/, 2014.

[27] Adrian Rosebrock. How-to: Python compare two images. http://www.pyimagesearch. com/2014/09/15/python-compare-two-images/, 2014.

[28] Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. Orb: an efficient alter- native to sift or surf. In (ICCV), 2011 IEEE International Conference on, pages 2564–2571. IEEE, 2011.

[29] Niharika Sachdeva and Ponnurangam Kumaraguru. Online social networks and police in indiaâĂŤunderstanding the perceptions, behavior, challenges. In ECSCW 2015: Proceedings of the 14th European Conference on Computer Supported Cooperative Work, 19-23 September 2015, Oslo, Norway, pages 183–203. Springer, 2015.

[30] Niharika Sachdeva and Ponnurangam Kumaraguru. Social networks for police and residents in india: exploring online communication for crime prevention. In Proceedings of the 16th Annual International Conference on Digital Government Research, pages 256–265. ACM, 2015.

[31] Abhinav Shrivastava, Tomasz Malisiewicz, Abhinav Gupta, and Alexei A Efros. Data- driven visual similarity for cross-domain image matching. In ACM Transactions on Graphics (TOG), volume 30, page 154. ACM, 2011.

[32] Anshuman Vikram Singh. Content-based image retrieval using . 2015.

[33] TinEye. Tineye documentation: What is matchengine. https://services.tineye.com/ developers/matchengine/what.html.

[34] India Today. Arvind kejriwal slammed on twitter again! twitterati trend kejriwalinsultshanuman. http://indiatoday.intoday.in/story/ kejriwal-insults-hanuman-delhi-cm-slammed-on-twitter/1/597147.html, 2016.

[35] India Today. Case against gurmeet ram rahim for posing as vishnu, will he be arrested? http://indiatoday.intoday.in/story/ case-against-gurmeet-ram-rahim-for-posing-as-vishnu-will-he-be-arrested/ 1/573867.html, 2016.

[36] Engin Tola, Vincent Lepetit, and Pascal Fua. Daisy: An efficient dense descriptor applied to wide-baseline stereo. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 32(5):815–830, 2010.

[37] Bob van de Velde, Albert Meijer, and Vincent Homburg. Police message diffusion on twitter: analysing the reach of social media communications. Behaviour & Information Technology, 34(1):4–16, 2015.

42 [38] Lei Yu, Zhixin Yu, and Yan Gong. An improved orb algorithm of extracting and matching. 2015.

[39] Tauhid Zaman, Emily B Fox, Eric T Bradlow, et al. A bayesian approach for predicting the popularity of tweets. The Annals of Applied Statistics, 8(3):1583–1611, 2014.

[40] Qingyuan Zhao, Murat A Erdogdu, Hera Y He, Anand Rajaraman, and Jure Leskovec. Seismic: A self-exciting point process model for predicting tweet popularity. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1513–1522. ACM, 2015.

43