The Multi-Pattern Matching with Online User Requests for Determining Browser Capabilities
Total Page:16
File Type:pdf, Size:1020Kb
The Multi-Pattern Matching with Online User Requests for Determining Browser Capabilities Ignat Blazhkoa , Tetiana Kovaliukb, Nataliya Kobetsc and Tamara Tielyshevaa a National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, 37, ave. Peremohy, Kyiv, 03056, Ukraine b Taras Shevchenko National University of Kyiv, 64/13, Volodymyrska str., Kyiv, 01601, Ukraine c Borys Grinchenko Kyiv University, 18/2 Bulvarno-Kudriavska str., Kyiv, 04053, Ukraine Abstract One of the most effective ways to get information about Internet users is to analyze User-Agent request header, because it contains information about the user's browser. Quick methods are needed for multi-pattern matching to instances of User Agents. The article considers the algorithm of patterns matching with input strings to determine the capabilities of the browser. The input strings are the headers of User-Agent queries, and strings with special wildcards represent search patterns. The developments in the field of solving the problem of comparing search templates with input lines are considered and analyzed. A new algorithm for solving the problem is proposed. The steps of the algorithm are given and the time complexity of the algorithm for the problem is analyzed. Comparisons of the performance of the received program with analogues are made and the advantage of the offered approach is shown. The program implementation was compared with the most prominent analogues and the advantage of the proposed approach has been shown in terms of execution speed. Keywords 1 Pattern matching, wildcards, browser capabilities, User Agent request 1. Introduction Nowadays, when many companies operate on the Internet, the need to identify user information is very acute. One of the most effective ways to obtain such information is to analyze the User-Agent request HTTP header, which identifies the program making a request to the server and requesting access to the website. The User-Agent request header contains specific information about the hardware and software of the device making the request. This information typically includes information about the browser, web visualization mechanism, operating system, and user’s device, including iPhone, iPad or other mobile device, tablets, desktop computers. Web servers use User-Agent to maintain different web pages in different browsers in case of not adaptive web design, display different content for different operating systems, collect statistics that reflect the browsers and operating systems used by visitors. Web crawling bots also use User-Agent. The data specified in the User-Agent request HTTP header [1] describes browsers (Chrome, Firefox, Internet Explorer, Safari, BlackBerry, Opera), search engines (Google, Google Images, Yahoo), game consoles (PlayStation 3, PlayStation Portable, Wii, Nintendo Wii U), offline browsers (Offline Explorer), links (Link Checker, W3C-check link), electronic readers (Amazon Kindle), validators, cloud platforms, e-mail libraries, scripts. By User-Agent, you can define functions that are supported by a web browser, such as JavaScript, Java Applet, cookie, VBScript, and Microsoft ActiveX. COLINS-2021: 5th International Conference on Computational Linguistics and Intelligent Systems, April 22–23, 2021, Kharkiv, Ukraine EMAIL: [email protected] (I. Blazhko); [email protected] (T. Kovaliuk); [email protected] (N. Kobets), [email protected] (T. Tielysheva) ORCID: 0000-0002-1383-1589 (T. Kovaliuk); 0000-0003-4266-9741 (N. Kobets); 0000-0001-5254-3371 (T. Tielysheva) ©� 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) The User-Agent string is used to match the content when the source server selects the appropriate content or operating parameters to respond to the received user request. Thus, the concept of adapting content to the needs of users is realized. The User-Agent string is one of the criteria by which search engines can be excluded from accessing certain parts of the website using the robot exclusion standard (according to the robots.txt file). Information about users has obvious applications in the advertising sector, namely in cases where the user's device can operate as a criterion for determining the user's affiliation to the target market. Web analysis is an equally important use case. Due to the availability of information about customers, further analysis can significantly improve the decision about the published content; increase the conversion of the site by turning site visitors into real customers. Analyzing User-Agent request headers that are constantly updated also means that businesses will be aware of the emergence of new devices that visit their resource, and therefore related problems will be identified at an early stage. However, parsing the User-Agent request header is not a trivial task. Many software libraries are used to solve this problem by balancing between accuracy and speed of obtaining results. There are services that provide definition and analysis of the User-Agent request header, in particular, deviceatlas.com. The accuracy problem can be solved by using open lists of User-Agent templates, which are constantly updated. There are more than 62 million User-Agent strings [1], so existing solutions cannot quickly work with all of them. The speed of response to the client is another important indicator for business. Therefore, quick methods of analyzing these patterns are needed. These are methods for matching a large number of search patterns to User-Agent instances. The development of algorithms based on which libraries will be created to determine the capabilities of browsers is an urgent and necessary task for any web service. The purpose of the research is to develop an algorithm for multi-pattern matching with wildcards to be able to detect browser capabilities based on browscap.org data [2]. The library should work faster than its alternatives − the speed should be comparable to libraries that use simplified (inaccurate) methods to detect browser capabilities. Browscap is an open project dedicated to collecting and disseminating a database of browser capabilities, actually a set of different files describing browser capabilities. Browscap has built-in support for using these files. To achieve this goal we need to solve the following tasks: • figure out browscap.org data structure and similarities between different user agents of the same family; • review and analyze existing studies that solve the problem of multi-pattern matching with wildcards • develop an algorithm for multi-pattern matching with the best known lower bound time complexity for determining the browser capabilities; • develop software implementation of the algorithm and compare the speed of the proposed algorithm with analogue libraries, which give approximate results. 2. Problem Statement Let the set of search patterns P = {p1, p2 ,..., pm } be input to the search engine. Each search pattern pi , i =1, m is a string of characters that belong to the alphabet Σ . Each pattern contains special characters "?" and "*", that do not belong to this alphabet, and are called wildcards. The character "?" corresponds to any single character in the alphabet Σ , the character "*" corresponds to any string of characters in the alphabet Σ , including the empty string. The search pattern pi , i =1, m can be represented as a string pi = ci1ki1ci2ki2...cij kij *, where cij is the j -th wildcard character for the i -th pattern, kij is a keyword from a dictionary of unique keywords W , that is, a string of characters in the alphabet Σ . There is an additional condition for the wildcard character ci1 : it can be either skipped or equal to the character "*". Input requests are in the form of a string S ={si },i =1, n , consisting of n characters in the alphabet Σ . It is necessary for each input string S ={si },i =1, n to find all the search patterns from the set P that matches to the query string, i.e. get a set of patterns A ={p , p ,..., p }, where p denotes a1 a2 aanslen aS the search pattern for the S -th request. The number of requests is not limited. Requests must be answered as they are received. One of the features of this task is that the set of search patterns is very large and is in the millions of copies. Query string sizes typically range from 100 to 200 characters. 3. Related Works The main tasks of text string analysis that arise in the development of information verification tools in information retrieval systems include: • the task of matching text strings; • the problem of calculating the distance between text strings; • the problem of fuzzy match text strings; • the task of finding the longest repeating text substring. The algorithms for text string analysis are presented in the works of domestic and foreign researchers, in particular Stephen G.A., Knuth D.E., Morris J.H., Pratt V.R., Karp R.M., Rabin, M.O., Boyer R.S., Moore J.S., Fischer M.J., Hirschberg D.S., Hunt J.W., Szymanski T.G., Landau G.M., Vishkin U. Hamming R.W., Levenshtein V.I. The task of determining the browser capabilities based on the analysis of the User-Agent HTTP header is reduced to the task of matching text strings, which are used as search patterns and incoming requests. Algorithms for matching search patterns with input strings have many important applications and are used in antivirus software, systems for detecting various types of attacks and intrusion prevention [3], in-text compression, text search, data mining, programming grammar checking rules [4] etc. For example, different types of attacks can be defined using rules, which are regular expressions that match a set of possible strings. Algorithms for working with regular expressions require pre- processing which is very memory-intensive and time-consuming. An example of the application of the algorithm for matching search patterns with input strings is the study of DNA sequences, where wildcards are used as a replacement for some components of protein sequences.