2Gather4health: Web Crawling and Indexing System Implementation

2Gather4Health: Web Crawling and Indexing System Implementation Joao˜ N. Almeida INESC-ID, Lisboa, Portugal Instituto Superior Tecnico,´ Universidade de Lisboa, Portugal Email: [email protected] Abstract—In healthcare, patients tend to be more aware of caregivers to develop solutions for health condition derived their needs than market producers. So, it is only normal to start problems, which were not being addressed on the market. seeing innovative behavior emerging from them, or from their caregivers, to help them cope with their health conditions, before The user innovation being observed in the healthcare area producers do. and the noticeable presence of the Internet in today’s society Today, it is believed that these users share their innovative prompts investigators to study the intersection of the two. For health-related solutions, also called patient-driven solutions, on this purpose, Oliveira and Canhao˜ created the Patient Inno- the Internet. However, the size of the Internet makes it hard to efficiently manually browse for these solutions. vation platform, an online open platform with a community A focused crawler is a system that automatically browses the of over 60.000 users and more than 800 innovative patient- Web, focusing its search on a topic of interest. This Master thesis driven solutions developed by patients and informal caregivers. proposes a focused crawler that searches the Web for patient- These solutions were found by manually browsing the Web, driven solutions, storing and indexing them to ease a further searching for a combination of appropriate keywords in search medical screening. To perform the focusing task, it was developed a distiller and a classifier. The distiller ranks the URLs to visit, engines. The problem is that with the amount of information sorting them by a given score. This is a linear combination of currently on the Web, this searching method is not effective three components, that depend on the content, URL context, nor efficient. Consequently, there is the need for a system that or scores of the pages where the URL was found. The classifier automates that Internet search, retrieving solutions in a more automatically classifies visited webpages, verifying if they concern optimal manner. patient-driven solutions. In this thesis, it is shown that the developed system outper- A focused crawler is a system that automatically navigates forms a breadth-first crawling and common focused approaches through the Web in search of webpages concerning a topic on measures of harvest rate and target recall, while searching of interest. In this paper, we propose a focused crawler for patient-driven solutions. The proposed classifier’s results on that efficiently searches for webpages concerning patient- crawled data deviate from its validation results. However, it is proposed an approach to re-train the classifier with crawled data driven solutions, while indexing them. To classify and set that improves its performance. the fetching order of webpages, the crawler implements a classifier and a distiller, respectively. The classifier’s ultimate I. INTRODUCTION goal is to identify webpages regarding patient-driven solutions Users tend to be more aware of their needs than producers, and the distiller’s is to favor Uniform Resource Locators therefore, they are prone to innovate. They innovate expecting (URLs) believed to relate to the solutions being searched to improve the benefit they acquire from a certain product or for. Finally, the crawler outputs relevant results to a web service. This type of innovation is classified as user innovation, indexer, which organizes the information making it easily a phenomenon studied by several socio-economic researchers. accessible and searchable. Our results show that the proposed People need pragmatic solutions for their problems, solu- approach outperforms the broad crawling and common focused tions that are not being covered on the market. This drives crawling approaches, being more efficient while searching for users to invent the solutions themselves, to fulfill the needs the patient-driven solutions. Also, they show that incrementally market is not addressing. In healthcare, the word ”need” is not training the custom classifier with crawled data can improve an understatement. Some users (in this case patients) have been the solution search precision. living with the same health condition for years, sometimes In Section II, it is presented the related work required to even their whole life. These individuals have to cope with daily understand the development and evaluation of the implemented life problems, imposed by their health condition, for which system. Section III describes the dataset used in this system medicine does not provide a solution. This forces them to be development, as well as its pre-processing. Section IV de- constantly thinking of new ways to make their lives better, scribes in detail the system architecture. Section V presents the to approach new methods, to take some risks if that opens experiments done to evaluate the system’s performance, along a whole new better life for them. In fact, Zejnilovic et al. with their results and discussion. Finally, Section VI concludes [1] highlight the innovation capacity of patients and informal this paper with a summary, final thoughts and future work. II. RELATED WORK fuzzy function to the top most frequent terms in each class, In this section, we will talk about the three fundamental which is called the class fingerprint. The same is done for each concepts to understand this paper: text classification, web document to be classified, and its fingerprint is then compared crawling and web indexing. to the classes’ fingerprints, through a similarity score, to see to what class the document belongs to. A. Text Classification A threshold can be set to assign candidates to a ”negative” With nowadays’ digital documents availability and mass class, if the similarity score of the candidate with every class production, automatic text classification is a crucial procedure is below that threshold. for categorizing documents. 1) Classification performance evaluation: It is common Text classification methods use Natural Language Process- to evaluate a text classifier’s performance with metrics of ing (NLP) as a tool for text processing. NLP is the science accuracy, recall (or true positive rate), precision and F1-score that studies text analysis and processing. Documents are sets of (also called F -measure). Accuracy is the fraction of correctly TP +TN unstructured data (text). These can be chaotic, presented in all classified samples ( TP +TN+FP +FN ). Recall is the propor- TP shapes and sizes, so usually it is needed some pre-processing tion of positive samples classified as so ( TP +FN ). Precision before documents are fed to classifiers for training and clas- measures the fraction of positive predicted samples that are TP sification. It is through NLP methods that that processing is in fact positive samples ( TP +FP ). F1-score is the harmonic done. 2×P recision×Recall mean between precision and recall ( P recision+Recall ). Common text pre-processing steps are: Normalization, When applied to multi-class classification problem, it is where the text is lower cased, textual accents are removed, common practice to do a weighted average of all metrics and numbers, dates, acronyms and abbreviations are written in between all classes to get an overall performance. a canonical form or removed; Tokenization, where single or multiple-word terms are separated, to form tokens; Stopword B. Web Crawling removal, where common words (like ”to”, ”and” and ”or” in A web crawler is one of the principal systems of modern the English language) are removed; Lemmatization/Stemming information retrieval. Web crawling is the act of going through where different words are mapped to the same term, when a web graph, that represents a network where each node they are derived or inflected from that term. represents a resource (usually a document), gathering data at The most common approaches for text classification tasks each node according to its needs. The need of web crawling include [2]: Naive¨ Bayes (NB) variants, k-Nearest Neighbors rose with the large increase of web resources available on the (kNN), Logistic Regression or Maximum Entropy, Decision Internet. It has a large range of applications. It can be used for Trees, Neural Network (NN), and Support Vector Machine web data mining, for linguistic corpora collection, to build an (SVM). These usually represent each document or class as an archive, to gather local news of one’s interest, amongst other unordered set of terms. To each term is associated a presence applications. or a frequency value, and no semantic meaning is kept. This is A web crawler always begins its activity from a set of called a bag-of-words model. The most common approaches chosen URLs, called seed URLs. It fetches the content that use variations and combinations of Term Frequency (TF) and they point to (their webpages) and parses it. As a result, the Inverse Document Frequency (IDF) transformations [3] as crawler stores the parsed content and the webpages’ outlinks, feature values. These approaches are purely syntactic based, which are the URLs present in the webpages’ content. The but they demonstrate great results [3], [4]. parsed content is used to check for URL duplicates, which In text classification problems, using terms as features can are different URLs that point to the same content, and the result in a very large feature space. One approach that reduces parsed outlinks are checked to see if they present new URLs, the feature space uses pre-trained Word2Vec [5] models to which have never been crawler. After these operations, newly transform each word into a vector of features. Each word discovered URLs are stored in the URL frontier, where there vector representation is obtained by training a Neural Network lie URLs due for fetching. The process repeats with the URLs model using a very large dataset.

2Gather4health: Web Crawling and Indexing System Implementation

Why We Need an Independent Index of the Web ¬ Dirk Lewandowski 50 Society of the Query Reader

Distributed Indexing/Searching Workshop Agenda, Attendee List, and Position Papers

Web-Page Indexing Based on the Prioritize Ontology Terms

Google Bing Facebook Findopen Foursquare

Awareness Watch™ Newsletter by Marcus P

The Google Search Engine

Context Based Web Indexing for Storage of Relevant Web Pages

Appendix I: Search Quality and Economies of Scale

The Indexing at the Internet

A New Hidden Web Crawling Approach

DEWS: a Decentralized Engine for Web Search

Indexing the World Wide Web: the Journey So Far Abhishek Das Google Inc., USA Ankit Jain Google Inc., USA