Search, Analyse, Predict Image Spread on Twitter
Total Page:16
File Type:pdf, Size:1020Kb
Image Search for Improved Law and Order: Search, Analyse, Predict image spread on Twitter Student Name: Sonal Goel IIIT-D-MTech-CS-GEN-16-MT14026 April, 2016 Indraprastha Institute of Information Technology New Delhi Thesis Committee Dr. Ponnurangam Kumaraguru (Chair) Dr. AV Subramanyam Dr. Samarth Bharadwaj Submitted in partial fulfillment of the requirements for the Degree of M.Tech. in Computer Science, in General Category ©2016 IIIT-D-MTech-CS-GEN-16-MT14026 All rights reserved Keywords: Image Search, Image Virality, Twitter, Law and Order, Police Certificate This is to certify that the thesis titled “Image Search for Improved Law and Order:Search, Analyse, Predict Image Spread on Twitter" submitted by Sonal Goel for the partial ful- fillment of the requirements for the degree of Master of Technology in Computer Science & Engineering is a record of the bonafide work carried out by her under our guidance and super- vision in the Security and Privacy group at Indraprastha Institute of Information Technology, Delhi. This work has not been submitted anywhere else for the reward of any other degree. Dr. Ponnurangam Kumarguru Indraprastha Institute of Information Technology, New Delhi Abstract Social media is often used to spread images that can instigate anger among people, hurt their religious, political, caste, and other sentiments, which in turn can create law and order situation in society. This results in the need for Law Enforcement Agencies (LEA) to inspect the spread of images related to such events on social media in real time. To help LEA analyse the image spread on microblogging websites like Twitter, we developed an Open Source Real Time Image Search System, where the user can give an image, and a supportive text related to image and the system finds the images that are similar to the input image along with their occurrences. The system proposed is robust to identify images that can be cropped (to a certain factor), scaled (to a certain factor), images with text embedded, images stitched with other images, images with varied brightness, rotated images and some combination of all these. On the input text, the system runs a text mining algorithm to extract the keywords, retrieve images related to these keywords from Twitter, and use image comparison methodology to extract similar images. The system can analyse the users propagating the content, the sentiments floating with them, and their retweet analysis. We found that Improved ORB (Oriented Fast and Rotated Brief) performs the best for finding image similarity with an accuracy above 85% in all the tested cases. The system developed is being used by one of the Government security agency. In addition to identifying similar images, we also aim to predict the influence of events on people, that can create law and order issues in the society. In microblogging sites like Twitter, information provided by tweets diffuses over the users through retweets. Hence, to further enhance the understanding and controlling the diffusion of these kinds of images, we focus to predict the retweet count of such images by using visual cues from the images, content based information and structure-based features. For this, we build a random forest regression model that takes some tweet, image and structural features to predict the retweet count. Acknowledgments I would like to express my deepest gratitude to my advisor Dr. Ponnurangam Kumaraguru for his guidance and support. The quality of this work would not have been nearly as high without his well-appreciated advice. I would like to thank my esteemed committee members A.V. Subramanyam and Dr. Samarth Bharadwaj for agreeing to evaluate my thesis work. I thank Dr. Samarth Bharadwaj for enriching this thesis with his valuable suggestions and feedback. I thank all the members of Precog research group at IIIT- Delhi for their valuable feedback and suggestions, especially Niharika Scahdeva for shepherding me and spending her valuable time to come up with this thesis. Special thanks to all members of CERC group at IIIT-Delhi and Hareesh Ravi for their valuable inputs. Last but not the least, I would like to thank all my supportive family who encouraged and kept me motivated throughout the project. i Contents 1 Research Motivation and Aim 1 1.1 Research Motivation . .1 1.2 Research Aim . .2 2 Related Work 4 2.1 Comparing Methodologies . .4 2.2 Image Retrieval Real Time Systems . .4 2.2.1 Google reverse image lookup . .5 2.2.2 TinEye . .5 2.3 Social Media for improved policing . .5 2.4 Information Diffusion on Social Media . .5 2.5 Finding and Predicting Image Virality on Online Social Media for improved law and order . .6 3 Contributions 7 4 Proposed Methodology 9 4.1 Architecture design for the search system . .9 4.2 Solution approach for prediction . 10 4.3 Data Collection . 11 4.3.1 Search System . 11 ii 4.3.2 Prediction . 12 4.4 Data Annotation . 12 5 Quantifying Image Similarity 13 5.1 Keyword Extraction . 13 5.2 Challenges in finding similar images . 13 5.3 Image features to analyse similarity . 14 5.3.1 Hashing . 14 5.3.2 SSIM . 14 5.3.3 Colour Histogram . 14 5.3.4 Keypoint Descriptors . 15 5.4 Experimental Settings . 16 5.4.1 Evaluation Metrics . 16 5.5 Classification Results . 16 5.5.1 Histogram . 17 5.5.2 DAISY . 18 5.5.3 ORB . 19 5.5.4 Improved ORB . 20 5.6 Time Evaluation . 21 5.7 Variation in accuracy with modified images . 21 6 Predicting Image Spread 27 6.1 Features Analysed . 27 6.1.1 Content-Based . 27 6.1.2 Sentiment . 27 6.1.3 Structure-Based . 28 6.1.4 Image-Based . 28 iii 6.2 Features vs. Retweet Count . 28 6.3 Experimental Settings . 29 6.3.1 Linear Regression . 29 6.3.2 Support Vector Regression . 29 6.3.3 Random Forest . 30 6.4 Training and Testing Data . 30 7 Results 31 7.1 Efficient Image Features and methodology for quantifying similarity . 31 7.2 Comparing search results with Google reverse image lookup . 31 7.3 Real Time Image Search System . 32 7.4 Law Enforcement Agencies as beneficiaries of the system . 35 7.5 Evaluating Metric for Prediction . 35 7.6 Comparing Regression Models . 36 8 Conclusions, Limitations, Future Work 38 8.1 Conclusions . 38 8.2 Limitation . 39 8.3 Future Work . 39 iv List of Figures 1.1 News headlines showing the impact of some recent events . .2 4.1 Architecture diagram for the search system. 10 5.1 Input images set. 17 5.2 Accuracy of coloured histogram . 18 5.3 Daisy distance for similar and dissimilar images . 19 5.4 ORB and Improved ORB . 20 5.5 Plotting accuracy for modified images: Kulkarni event-1 . 22 5.6 Plotting accuracy for modified images: Kulkarni event-2 . 23 5.7 Plotting accuracy for modified images: Kejriwal cartoon . 24 5.8 Plotting accuracy for modified images: Baba Ram Rahi images . 25 5.9 Plotting accuracy for modified images: Charlie Hebdo cartoon . 26 7.1 Comparing proposed system with Google reverse image lookup . 32 7.2 Screen shots comparing output of proposed system with Google reverse image lookup for Shipra Malik image . 33 7.3 Screen shots comparing output of proposed system with Google reverse image lookup for Kanhaiya Kumar image . 34 7.4 Screen Shot of the system. 37 v List of Tables 4.1 Data collection for six events. 12 4.2 Data after annotation shows the number of similar and non-similar images. 12 5.1 Comparing time (seconds) taken for comparing two images. 21 6.1 Spearman correlation and p-values of features with retweet count. 29 7.1 Comparing results of proposed system with Google’s reverse image lookup. 32 vi vii Chapter 1 Research Motivation and Aim 1.1 Research Motivation Images have a high impact on both online and offline world. Recently, The Union Health Ministry made it mandatory for cigarette manufacturing companies to reserve 60 percent of space on the pack for the pictorial warnings [5]. Images shared on social media are also equally influential, having the power to evoke people’s emotions, driving a deeper engagement and more profound change in behaviour [2]. Before the advent of social media, news related to events was more localised, and it took more time to spread the news. However, with widespread usage of social media information spreads like wildfire and targets larger set of audience making social media as a powerful tool for spreading information [4]. A study reveals that when people hear information, they’re likely to remember only 10 percent of that information for three days. However, if a relevant image is paired with that same information, people can retain 65 percent of the information for three days [23]. Refering to social media, Facebook posts with images achieve 2.3 times more engagement than those without images and Buffer reported that for its user base, tweets with images received 150 percent more retweets than tweets without images [23]. When images related to some disturbing event are shared, they have the potential to influence public opinion, hurt their religious, political, caste, or communal which in turn can impact healthy law and order situation in the society. For instance, in June 2014 a Facebook post containing obscene images of king Shivaji Maharaj, late Shiv Sena leader Bal Thackeray and others, provoked violent protests in Maharashtra damaging many public vehicles, many people were injured by pelted stones, and the angry crowd even killed one Muslim techie giving the protest communal colours [7, 12]. Another example, in 2015 a self-proclaimed Godman Baba Ram Rahim posted his image posing as Hindu God Vishnu on social media sites.