Image Search for Improved Law and Order: Search, Analyse, Predict image spread on Twitter
Student Name: Sonal Goel IIIT-D-MTech-CS-GEN-16-MT14026 April, 2016
Indraprastha Institute of Information Technology New Delhi
Thesis Committee Dr. Ponnurangam Kumaraguru (Chair) Dr. AV Subramanyam Dr. Samarth Bharadwaj
Submitted in partial fulfillment of the requirements for the Degree of M.Tech. in Computer Science, in General Category
©2016 IIIT-D-MTech-CS-GEN-16-MT14026 All rights reserved Keywords: Image Search, Image Virality, Twitter, Law and Order, Police Certificate
This is to certify that the thesis titled “Image Search for Improved Law and Order:Search, Analyse, Predict Image Spread on Twitter" submitted by Sonal Goel for the partial ful- fillment of the requirements for the degree of Master of Technology in Computer Science & Engineering is a record of the bonafide work carried out by her under our guidance and super- vision in the Security and Privacy group at Indraprastha Institute of Information Technology, Delhi. This work has not been submitted anywhere else for the reward of any other degree.
Dr. Ponnurangam Kumarguru
Indraprastha Institute of Information Technology, New Delhi Abstract
Social media is often used to spread images that can instigate anger among people, hurt their religious, political, caste, and other sentiments, which in turn can create law and order situation in society. This results in the need for Law Enforcement Agencies (LEA) to inspect the spread of images related to such events on social media in real time. To help LEA analyse the image spread on microblogging websites like Twitter, we developed an Open Source Real Time Image
Search System, where the user can give an image, and a supportive text related to image and the system finds the images that are similar to the input image along with their occurrences. The system proposed is robust to identify images that can be cropped (to a certain factor), scaled
(to a certain factor), images with text embedded, images stitched with other images, images with varied brightness, rotated images and some combination of all these. On the input text, the system runs a text mining algorithm to extract the keywords, retrieve images related to these keywords from Twitter, and use image comparison methodology to extract similar images.
The system can analyse the users propagating the content, the sentiments floating with them, and their retweet analysis. We found that Improved ORB (Oriented Fast and Rotated Brief) performs the best for finding image similarity with an accuracy above 85% in all the tested cases.
The system developed is being used by one of the Government security agency.
In addition to identifying similar images, we also aim to predict the influence of events on people, that can create law and order issues in the society. In microblogging sites like Twitter, information provided by tweets diffuses over the users through retweets. Hence, to further enhance the understanding and controlling the diffusion of these kinds of images, we focus to predict the retweet count of such images by using visual cues from the images, content based information and structure-based features. For this, we build a random forest regression model that takes some tweet, image and structural features to predict the retweet count. Acknowledgments
I would like to express my deepest gratitude to my advisor Dr. Ponnurangam Kumaraguru for his guidance and support. The quality of this work would not have been nearly as high without his well-appreciated advice. I would like to thank my esteemed committee members A.V.
Subramanyam and Dr. Samarth Bharadwaj for agreeing to evaluate my thesis work. I thank
Dr. Samarth Bharadwaj for enriching this thesis with his valuable suggestions and feedback.
I thank all the members of Precog research group at IIIT- Delhi for their valuable feedback and suggestions, especially Niharika Scahdeva for shepherding me and spending her valuable time to come up with this thesis. Special thanks to all members of CERC group at IIIT-Delhi and
Hareesh Ravi for their valuable inputs.
Last but not the least, I would like to thank all my supportive family who encouraged and kept me motivated throughout the project.
i Contents
1 Research Motivation and Aim 1
1.1 Research Motivation ...... 1
1.2 Research Aim ...... 2
2 Related Work 4
2.1 Comparing Methodologies ...... 4
2.2 Image Retrieval Real Time Systems ...... 4
2.2.1 Google reverse image lookup ...... 5
2.2.2 TinEye ...... 5
2.3 Social Media for improved policing ...... 5
2.4 Information Diffusion on Social Media ...... 5
2.5 Finding and Predicting Image Virality on Online Social Media for improved law
and order ...... 6
3 Contributions 7
4 Proposed Methodology 9
4.1 Architecture design for the search system ...... 9
4.2 Solution approach for prediction ...... 10
4.3 Data Collection ...... 11
4.3.1 Search System ...... 11
ii 4.3.2 Prediction ...... 12
4.4 Data Annotation ...... 12
5 Quantifying Image Similarity 13
5.1 Keyword Extraction ...... 13
5.2 Challenges in finding similar images ...... 13
5.3 Image features to analyse similarity ...... 14
5.3.1 Hashing ...... 14
5.3.2 SSIM ...... 14
5.3.3 Colour Histogram ...... 14
5.3.4 Keypoint Descriptors ...... 15
5.4 Experimental Settings ...... 16
5.4.1 Evaluation Metrics ...... 16
5.5 Classification Results ...... 16
5.5.1 Histogram ...... 17
5.5.2 DAISY ...... 18
5.5.3 ORB ...... 19
5.5.4 Improved ORB ...... 20
5.6 Time Evaluation ...... 21
5.7 Variation in accuracy with modified images ...... 21
6 Predicting Image Spread 27
6.1 Features Analysed ...... 27
6.1.1 Content-Based ...... 27
6.1.2 Sentiment ...... 27
6.1.3 Structure-Based ...... 28
6.1.4 Image-Based ...... 28
iii 6.2 Features vs. Retweet Count ...... 28
6.3 Experimental Settings ...... 29
6.3.1 Linear Regression ...... 29
6.3.2 Support Vector Regression ...... 29
6.3.3 Random Forest ...... 30
6.4 Training and Testing Data ...... 30
7 Results 31
7.1 Efficient Image Features and methodology for quantifying similarity ...... 31
7.2 Comparing search results with Google reverse image lookup ...... 31
7.3 Real Time Image Search System ...... 32
7.4 Law Enforcement Agencies as beneficiaries of the system ...... 35
7.5 Evaluating Metric for Prediction ...... 35
7.6 Comparing Regression Models ...... 36
8 Conclusions, Limitations, Future Work 38
8.1 Conclusions ...... 38
8.2 Limitation ...... 39
8.3 Future Work ...... 39
iv List of Figures
1.1 News headlines showing the impact of some recent events ...... 2
4.1 Architecture diagram for the search system...... 10
5.1 Input images set...... 17
5.2 Accuracy of coloured histogram ...... 18
5.3 Daisy distance for similar and dissimilar images ...... 19
5.4 ORB and Improved ORB ...... 20
5.5 Plotting accuracy for modified images: Kulkarni event-1 ...... 22
5.6 Plotting accuracy for modified images: Kulkarni event-2 ...... 23
5.7 Plotting accuracy for modified images: Kejriwal cartoon ...... 24
5.8 Plotting accuracy for modified images: Baba Ram Rahi images ...... 25
5.9 Plotting accuracy for modified images: Charlie Hebdo cartoon ...... 26
7.1 Comparing proposed system with Google reverse image lookup ...... 32
7.2 Screen shots comparing output of proposed system with Google reverse image
lookup for Shipra Malik image ...... 33
7.3 Screen shots comparing output of proposed system with Google reverse image
lookup for Kanhaiya Kumar image ...... 34
7.4 Screen Shot of the system...... 37
v List of Tables
4.1 Data collection for six events...... 12
4.2 Data after annotation shows the number of similar and non-similar images. . . . 12
5.1 Comparing time (seconds) taken for comparing two images...... 21
6.1 Spearman correlation and p-values of features with retweet count...... 29
7.1 Comparing results of proposed system with Google’s reverse image lookup. . . . . 32
vi vii Chapter 1
Research Motivation and Aim
1.1 Research Motivation
Images have a high impact on both online and offline world. Recently, The Union Health Ministry made it mandatory for cigarette manufacturing companies to reserve 60 percent of space on the pack for the pictorial warnings [5]. Images shared on social media are also equally influential, having the power to evoke people’s emotions, driving a deeper engagement and more profound change in behaviour [2]. Before the advent of social media, news related to events was more localised, and it took more time to spread the news. However, with widespread usage of social media information spreads like wildfire and targets larger set of audience making social media as a powerful tool for spreading information [4]. A study reveals that when people hear information, they’re likely to remember only 10 percent of that information for three days. However, if a relevant image is paired with that same information, people can retain 65 percent of the information for three days [23]. Refering to social media, Facebook posts with images achieve 2.3 times more engagement than those without images and Buffer reported that for its user base, tweets with images received 150 percent more retweets than tweets without images [23]. When images related to some disturbing event are shared, they have the potential to influence public opinion, hurt their religious, political, caste, or communal which in turn can impact healthy law and order situation in the society. For instance, in June 2014 a Facebook post containing obscene images of king Shivaji Maharaj, late Shiv Sena leader Bal Thackeray and others, provoked violent protests in Maharashtra damaging many public vehicles, many people were injured by pelted stones, and the angry crowd even killed one Muslim techie giving the protest communal colours [7, 12]. Another example, in 2015 a self-proclaimed Godman Baba Ram Rahim posted his image posing as Hindu God Vishnu on social media sites. This image hurt religious sentiments of a section of society and they lodged a complaint against him. Fig 1.1 shows some newspaper articles about the unrest created by these events in the society [35].
1 (a) (b)
Figure 1.1: (a) Report from Pune Mirror showing how riots in Pune closed the city. (b) Report from Time of India shows after effects of Baba Ram Rahim image.
Due to such severe repercussions of posting instigating images on social media, there arises a need for law enforcement agencies to look-up the impact made by such images on the society, and how to further control their spread. A potential solution could be to build a system that tells how much viral a particular image has gone based on the frequency of similar images shared and then to further understand and manage the diffusion by predicting the future spread of such information. Current research has explored Content Based Image Retrieval (CBIR), image similarity [31,32,38] & research has been done on prediction of information diffusion [11,39,40]. However,to the best of our knowledge less attention has been paid on (a) how image spread on Online Social Network (OSN) and specifically micro-blogging sites can affect peace in the society, (b) what steps can be taken to help law enforcement agencies in building a healthy law and order situation in the society. In this work we take one step ahead to fill this gap.
1.2 Research Aim
We aim to develop a real-time image search system which can aid law enforcement agencies to analyse the spread of an image. The system takes an image as the input along with a supportive
2 text, and returns the images related to the text keywords and similar to the input image. The system can return images that are scaled, cropped, stitched with other images, images with some text, and images with slight colour changes. The major output of the system are:
• Count of most similar, moderately similar and least similar images found on microblogging site.
• Analysis of sentiments floating with the most similar set of images
• Analysis of Tweet vs. Retweet ratio
• Analysis of users who propagate these images
Further to understand the spread of information, we predict the approximate diffusion of images. This was done by training machine learning models to predict the retweet count of tweets, which in case of Twitter is a definitive parameter to approximate the diffusion. We tested various textual, structural, and visual features to predict the spread.
3 Chapter 2
Related Work
2.1 Comparing Methodologies
Image retrieval has become an important topic in the past decade. There has been work done in finding visually similar images, identifying objects in the image. The technique to do this task is to extract visual features of images like colour, texture, and shape information and then compare the similarity of images based on the comparison of these features [19]. Another widely studied method is to find the important keypoints in the image and compare these keypoints. The feature matching of images done using SIFT (Scale-Invariant Feature Transform), SURF (Speeded Up Robust Features) have proven remarkably successful in a number of applications using visual features, including object recognition, image stitching, visual mapping, etc. But they are not very efficient for real-time systems as they are computationally expensive. Rublee showed in their work that using FAST (Features from Accelerated Segment Test) keypoint detectors and BRIEF (Binary Robust Independent Elementary Features) descriptors ORB (Oriented Fast Rotated Brief) gives almost the same performance but at low cost [28]. Lei Yu1 also compared traditional feature matching techniques like SIFT, SURF, DAISY, ORB with their own technique, which is improved version of ORB and found that ORB and it's variant outperforms traditional techniques due to binary descriptor vector returned by brief, and fast binary string comparison using Hamming distance performed by XOR operations in the systems [38].
2.2 Image Retrieval Real Time Systems
Reverse image search engines are that special kind of search engines where a user can put a picture and the system retrieves the images similar to the input image [6]. Some popular image reverse image search engines are:
4 2.2.1 Google reverse image lookup
The mechanism of reverse photo search here is via uploading an image from your computer or pasting the link of the image in the search bar itself. It works equally well both ways, or you can simply drag and drop the image in search bar. Google images use algorithms based on various attributes like shape, size, colour, keypoints and resolution to get the similar pictures [6,17]. 1
2.2.2 TinEye
TinEye is the first image search engine on the web to use image identification technology rather than keywords, metadata or watermarks. TinEye regularly crawls the web for new images. When a user inputs an image, it creates a unique and compact digital signature or ‘fingerprint’ for it, then compares this fingerprint to every other image in their index to retrieve matches. It does not typically find similar images (i.e. a different image with the same subject matter), it finds exact matches including those that have been cropped, edited or resized [6,33]. 2
2.3 Social Media for improved policing
Police forces are widely using OSN to maintain law and order and improve communication with the citizens. Research has been done on how OSN based technology can be adapted to support communication and collaboration for making safer society in developing countries like India. Research shows that Police in developed nations have realized the effectiveness of OSN in various activities such as investigation, crime identification, intelligence development, and community policing [29]. Few studies in India show that OSN was used to spread misinformation and public agitation during crisis events such as Mumbai terror attacks (2011), Muzzafarnagar riots and Assam disturbance (2012). These studies report that in these events, panic was spread through fake images, messages, and videos on OSN. All in all, multiple research work has been done on how OSN can be useful for police and first responders to maintain law and order in the society during events of riots, crisis, to prevent crime, to increase the trust of people, and maintain peace in society [29].
2.4 Information Diffusion on Social Media
Research has been done on predicting and understanding the spread of information both on Online Social media and Twitter in specific [20]. Research shows that the popularity of an image on Flickr is closely related to the user's popularity and activity level as well as its topical content or tags. They considered image features, text features, and user context features into
1https://images.google.com/ 2https://www.tineye.com
5 account [24]. Another work shows the prediction of diffusion on Twitter, they are predicting retweet count using textual features, user-based features and visual features of the image [11].
2.5 Finding and Predicting Image Virality on Online Social Me- dia for improved law and order
Our research is motivated by the above research work, but to the best of our knowledge, this is the first research focussed on image spread, that helps police force to find the actual spread of an image that can create unrest in society by retrieving similar images shared on microblogging site in real-time. We also aim to help police to take proactive measures in understanding the diffusion of the image by predicting the approximate spread on microblogging sites which in our case is represented by Twitter.
6 Chapter 3
Contributions
In this thesis we provide an integrated solution for law enforcement agencies that helps them to identify and evaluate the spread of images on a micro-blogging sites like Twitter that can result in law and order situations in the society. To this end we make following contributions.
• We compare different image features to find the similarity between the images and the conclude that the technique that uses ORB in a combination of RANSAC (Random Sam- ple Consensus) to retrieve similar images from Twitter gives the best result. Using this technique our system is able to quantify the spread with an accuracy above 85% in all the tested cases.
• The system is able to retrieve images that are cropped, scaled to an extent, images with text, contrasted or brightened images and images stitched with other images.
• We also observe that images which are highly scaled (more than a factor of 3.0), highly cropped and images with more colours that are modified, shows reduced accuracy when given to the system as input.
• The system is currently being used by one of the Government security agencies.
To enhance the capability of the law enforcement agencies to take some proactive preventive mea- sures to understand the spread of such images, we forecast the approximate spread by predicting the retweet count of an image.
• We use structure-based features, tweet based and image-based features to predict the retweet count.
• In contrast to previous work, we explore that low-level image-based features like mean red, mean blue and mean green are not very efficient to predict the retweet count as the spearman correlation with the retweet count is low (less than 0.05). We observe that high- level image feature like the presence of a face in the image gives decent correlation and can be used as a parameter for prediction.
7 • After testing different machine learning models we realise that the random forest regression model predicts the retweet count with higher accuracy (least rmse) with structural, textual and image based features.
8 Chapter 4
Proposed Methodology
4.1 Architecture design for the search system
The proposed system takes an image and a text related to it as input. This text is passed to a keyword extraction algorithm which extracts some keywords from the text. These keywords are then given to Twitter's Search API (Application Program Interface), part of Twitter's REST API 3. The API returns a set of tweets related to the keywords it takes as query, and the system then filters the tweets containing images to be stored in a database. As the image gets stored in the database, an image comparison methodology starts running in parallel that finds a similarity score between the input image and the images in the database. We then set thresholds t1 and t2, if the similarity score is more than t1 the system classifies the images as most similar, if the score is less than t1 but more than t2 images are classified as moderately similar, and the images having the similarity score less than t2 goes in the set of least similar images. After the classification is complete, the system outputs all the three labelled sets for visualisation and then starts by analysing the most similar image content. It extracts text related to the most similar set and performs sentiment analysis using Sentiment140 API4. Sentiment140 API is trained for classifying tweets and returns a score classifying a tweet sentiment as positive, negative or neutral [3]. The system returns a comparison visual that tells the percentage of tweets with negative, positive and neutral sentiments. Likewise, the system analyses the ratio of tweets and retweets. Another useful feature is analysing the users who are propagating these events, the system also returns the profile picture of the users, and their basic demographics from Twitter like username, description (if available), the number of followers, etc. Fig 4.1 shows the architecture diagram of the system.
3https://dev.twitter.com/rest/public/search 4http://www.sentiment140.com/
9 Figure 4.1: Architecture diagram of the system. Input to the system are image and keyword. Twitter REST API takes the keyword and forms an image database, we compare the input image with images in the database to get similar images on which the system further performs sentiment and user analysis.
4.2 Solution approach for prediction
Length of tweets is limited to 140 characters in Twitter hence, it becomes hard to analyse the features based on just the text, so we try to include the features of the image attached to the tweet to predict it's approximate spread. The most trusted parameter to define the diffusion of content on Twitter is the retweet count [11]. So, to define diffusion we aim to predict the retweet count of the tweet. Our focus is on unique tweets containing image link. We extract various structural, contextual, sentimental and visual features of the tweet and image attached to it. After extracting the set of features, we calculate spearman correlation between the target variable (retweet count) and the set of features to filter out most related features. We then train different regressor models of machine learning over these feature set to forecast the retweet count of a tweet.
10 4.3 Data Collection
4.3.1 Search System
We collected tweets containing images related to some recent events (as of April 2016) that created unrest in the society in some form. In this section we describe the events, and data collected for each event. Fig 5.1 shows the images that went viral during these events.
• Kulkarni Ink: Black ink was sprayed on the technocrat-turned-columnist Sudheendra Kulkarni by the members of a political party, ahead of the launch of former Pakistan Foreign Minister Khurshid Mahmud Kasuri’s book, ‘Neither a Hawk nor a Dove: An In- sider’s Account of Pakistan’s Foreign Policy’. FIR was lodged against the party workers and six workers got arrested [9]. This incidence was slammed on social media and the images of the man with black ink on face went viral. We collected a total of 1,912 images using the keyword “Kulkarni”.
• Baba Ram Rahim: A self-proclaimed Godman Baba Ram Rahim posted pictures posing as Hindu God Vishnu, and he faced trouble, when All India Hindu Student Federation accused him of insulting Lord Vishnu and hurting religious sentiments of Hindus by dressing up as Lord Vishnu and lodged a complaint against him [35]. We were able to collect 411 images using the keyword #RamRahim.
• Cartoon by Kejriwal: A cartoon tweeted by Delhi’s Chief Minister Arvind Kejriwal drew much flak from the BJP and its affiliates for allegedly hurting religious sentiments. The cartoon purportedly showed Hanuman clad in saffron robes telling Prime Minister Narendra Modi. A politaical party, The Hindu Sena lodged a complaint alleging that the twitter message was posted with intent to hurt religious sentiments of Hindus [34]. We collected 666 images using the keyword #KejrwalInsultsHanuman.
• Charlie Hebdo Cartoon: A cartoon in the French satirical magazine Charlie Hebdo sparked outrage by publishing a cartoon attempting to satirise the Syrian refugee crisis. The cartoon imagines Alan Kurdi, the three-year-old Syrian who died in the sea in September 2015, on the way to Europe, has grown up to be a sexual abuser. Many called the cartoon was racist and said it was incredibly bad [18]. We collected 570 images using the keyword #CharlieHebdo.
• Shani Shignapur protest: As many as 1,500 women, mostly housewives and college stu- dents, planned to storm the Shani Shingnapur temple in Ahmednagar district on Republic Day. The protesters wanted to end the age-old humiliating practice of not allowing women to enter the core shrine area [15]. We collected 183 images related to keyword #ShaniSh- ingnapur.
11 4.3.2 Prediction
We collected data for six events, five are the same discussed above, and one for Rohith Vemulla suicide case, using the keyword #rohithvemulla. After collecting tweets containing images we filtered out unique tweets posted by the original user to avoid biassing. A total of 2,040 unique tweets were collected.
Keyword Total tweets collected Unique Tweets Count #kulkarni 1,912 404 #RamRahim 420 117 #KejrwalInsultsHanuman 1,400 665 #CharlieHebdo 1,079 312 #ShaniShingnapur 1,230 183 #rohithvemulha 3,104 359
Table 4.1: Data collection for six events.
4.4 Data Annotation
After getting the data related to the 5 events we picked one image from each event. Our task was to find all the images similar to this image from the dataset of that particular event. But to test the accuracy of the image comparison methodologies, we need to get the data annotated beforehand and match the results of the methodology used with the results of the annotation. We gave the set of images related to an event and showed the input image (whose similar images we want to find) to the annotator, the task of the annotator was to label images from the given set which he/she thinks as similar to the image shown to him and which are not similar. So the dataset is now divided into two sets, one containing images similar to the shown image and second set containing images that are not similar to the image. Likewise, for each event, we get a set of similar images and non-similar images. The table below shows the data sets count after the annotation was complete.
Keyword Total Images Similar Images Non-Similar Images #Kulkarni 1,912 348 1,564 #RamRahim 411 99 312 #KejrwlaInsultsHanuman 666 278 388 #CharlieHebdo 570 114 456 #ShaniShignapur 183 67 116
Table 4.2: Data after annotation shows the number of similar and non-similar images.
12 Chapter 5
Quantifying Image Similarity
5.1 Keyword Extraction
Keyword extraction is an automatic way for identification of terms that best describe the subject of a document. Since Twitter does not allow us to directly fetch the data using an image, we take some supportive keywords to get all the tweets related to the keyword using the search API of Twitter. If the user does not know the exact keyword to enter, he/she can enter the text related to it (or corresponding tweet) and there will be text mining algorithm running that will take the text and filter out the important keywords. The main goal is to extract Noun phrases and Verb phrases from the text because they describe the main topic [8]. This is done using NLTK's tagger to define a new n-gram tagger, where n=3.
5.2 Challenges in finding similar images
In this section we discuss why finding similar images from a set of images is not a straight forward task. When an image related to an incident goes viral on social media, we generally tend to see different versions of the same picture floating, all these versions depict the same scene but the images are not exactly the duplicate of each other. We aim to find such similar images where the content of the image is same but they are modified by:
• Cropping
• Scaling
• Adding Text
• Changing contrast and brightness
• Stitching another image
• Adjusting colours
13 • Rotating the image
• Adding a part of the image to another image
5.3 Image features to analyse similarity
In this section, we discuss various features of an image that can describe it and how we use these features to compare the similarity between two images. We used these three image features to compare the similarity.
5.3.1 Hashing