Persian Multimedia Search Services' Users Propensities
Total Page:16
File Type:pdf, Size:1020Kb
3th International Conference on Web Research Persian Multimedia Search Services’ Users Propensities Maryam Mahmoudi*, Masomeh Azimzade*, Mehdi Esnaashari**, Mojgan Farhoodi*, Reza Badie* * Information Technology Research Group Information and Telecommunication Research Center (ITRC) Tehran, Iran {Mahmoudy, Azim_ma, Farhoodi, Badie}@itrc.ac.ir ** Faculty of Computer Engineering K. N. Toosi University of Technology Tehran, Iran [email protected] Abstract— Nowadays, search engines are prominent tools, Until now, user propensity recognition is verified in which are required by users, for finding information in web. numerous papers [7-14]. Also for a Persian search engine, Multimedia search engines are of special importance due to two various researches have been carried out to analyze users’ different reasons; 1) attractiveness of multimedia contents and 2) behaviors and recognize their propensities [1-4]. In all of these growing rate of the creation and online dissemination of such researches, one of the two approaches which are introduced contents. In this paper every effort is made to analyze and below, were used for aforementioned purposes: recognize the propensities of the users of Persian multimedia search services. For this purpose, behaviors of Iranian users of 1- Using search engine’s log file: In the papers of this Parsijoo’s image, voice and video search services has been category [2-4] [7-11], contents of the log file of a specific studied by analyzing its usage log files. The analyses, which have search engine for a particular time interval has been gathered been carried out by using users’ queries for a time period of three and parsed. Then based on the achieved information, intended months, can be categorized into two distinct types; holistic analyses have been carried out. analyses and the ones based on using frequently used queries. The results of the analyses have shown that users are mostly after 2- User’s implicit behavior evaluation: Papers in this entertainments and amusement topics when they use multimedia category [1] [12-14] have tried to track users actions during the search services. time that they had interaction with the search service and in this way infer their tendencies and propensities. For tracking users, Keywords— Multimedia search services; Users’ behavior the researchers often use a plugin for a particular web browser. analysis; Usage log files; Search engine; This plugin logs different activities of users, such as clicking on one of the URLs in the result set constructed by the search I. INTRODUCTION service, time elapsed in each search session, time spent reading a web page and etc. Then the acquired information have been Enhancement of computer hardware capabilities on a daily sent to a central server where from the gathered information, basis and constant increase in the available bandwidth for users tendencies have been inferred. accessing the internet motivate users to increase their usage of online multimedia services [5]. At the same time, the amount The common aspect in these papers is that they all focus on of available online multimedia contents and the number of textual search engines. There are a few number of researches websites which provide such contents have a noticeable devoted to multimedia search services for this same purpose. growing rate. This phenomenon motivates search engine For instance, researchers in [15] have shown that more than 25 companies all around the world to provision multimedia search percent of the words used by Excite image search engine users services. were sexual and nudity–related ones and that there was a great amount of diversity in such words used. Also Spink and his Considering the huge volume of multimedia contents, colleagues in [16] concluded that users would often examine indexing and making the contents searchable is a demanding more web pages from the result set of sexual-related queries. In task, especially in large scale scenarios, which asks for [17] researchers evaluated the behaviors of the Dogpile significantly high hardware and bandwidth requirements [6]. multimedia meta-search engine users in 2006. The results of To conquer their hardware and bandwidth limitations, search these investigations have shown that among multimedia engine providers must be able to recognize the propensities of queries, 50% are devoted to image search service, 28% are of users and their topics of interests in order to prioritize the video search type and the remaining 22% belong to voice indexing of different kinds of multimedia contents. This way, search service. All the frequently used video and image search they will be able to provide high quality answers to users’ queries were of sexual and nudity-related type. Frequently used queries. voice queries contained song names or singer names. In [18] by leveraging a browser plugin and considering the users of the 978-1-5386-0420-5/17/$31.00 ©2017 IEEE 3th International Conference on Web Research Yahoo! Search engine, the researchers have observed that extracted data. Resulted data were then stored in a MySQL adolescents within the age range of 10 to 19 use multimedia database. This database is then used as the input for data search services 2.4 times more than adults using the same analysis process. services. In addition, among the queries in the aforementioned time To the best of our knowledge, no research has been carried period, top-200 frequently used queries were also extracted out in the context of the recognition of user propensities of separately for each multimedia service and stored in the same Persian multimedia search services. In this paper, by database for further analysis. investigating the log file of Parsijoo’s multimedia search services, every effort has been made to provide an analysis on C. Data Analysis users’ behavior of such services. Currently this search engine Two types of analysis have been carried out for each of the provides image, voice and video search services with the image, voice and video search services. In the first type of following characteristics: analysis, which is called the holistic analysis task, all the • Image search service: About 100 million images have queries from the considered time period were used during the been indexed and this service is capable of handling investigations. The statistics gathered in the holistic analysis 40,000 requests per day. task are as below: • Voice search service: This service makes it possible to • The total number of queries received by each of the search, download and play more than two million multimedia search services, sounds and music files available in the Persian web. • The average length of queries received by each of the This service receives 4,000 requests per day. multimedia search services, • Video search service: Approximately, one million video • Total number of queries sent from the internet or the files have been indexed and this service handles about intranet, 6,000 requests per day. • Number of unique IP addresses that access each of the The analysis accomplished in this paper include: languages services used in queries, typos in queries, whether queries sent from the internet or the intranet, and classification of queries. The results • Geographical distribution of the users of each achieved from such analysis can identify the hot topics which multimedia search service are the most popular topics among the users of the multimedia search services. Hot topics identification can help the search In the second type of analysis, the evaluation has been service provider to prioritize well the crawling and indexing performed by considering only the most frequently used activities of the multimedia contents available on the internet. queries of each search service. This type of analysis is of special interest due to the fact that most frequently used queries The rest of this paper is organized as described below. In are indicators of the topics which are most appealing to users. the second section, the steps taken for carrying out the analysis Analyses of this type are further classified into two different are described. The third section presents and verifies the results groups; general and specific. In the group of general analyses, achieved from the accomplished analyses. The fourth section we seek to provide analyses which are common among all concludes this paper. multimedia search services. These include queries languages, problematic queries (those queries which have writing II. ANALYSIS STEPS problems, contains sexual or nudity-related words, or like the query “((((((((((” has no meaning), whether queries were sent In order to be able to recognize users’ propensities and from the internet or the intranet. behaviors in the usage of Persian multimedia search services, one must first gather required data, then carry out normal data On the other hand, in the group of specific analyses, each cleaning and preparation and finally perform the various multimedia search service is considered separately for analysis. analysis tasks which are seen useful. The aim was to identify the tendencies and inclinations of Persian users towards each multimedia search service. These A. Data Acquisition analyses are dependent to the nature and characteristics of the queries used and the search service considered and as a result In this paper the data source that was used is the search can be regarded as content based analyses. As an example, it is engine’s log file contents. A portion of this log file, which was possible that identified classes of queries for each search for a period of 3 months was considered for the designated service is different from that of other search services. In the analysis tasks. voice search service, for instance, the classification is done based on names of singers, names of the albums, etc. which is B. Data Cleaning and Preparation completely irrelevant in other search services. In this step, firstly the log file contents were read and parsed based on special tags, in order to extract data about Yet, another example of analyses, within the group of different multimedia search services (i.e.