2Gather4Health: Web Crawling and Indexing System Implementation

Joao˜ N. Almeida INESC-ID, Lisboa, Portugal Instituto Superior Tecnico,´ Universidade de Lisboa, Portugal Email: [email protected]

Abstract—In healthcare, patients tend to be more aware of caregivers to develop solutions for health condition derived their needs than market producers. So, it is only normal to start problems, which were not being addressed on the market. seeing innovative behavior emerging from them, or from their caregivers, to help them cope with their health conditions, before The user innovation being observed in the healthcare area producers do. and the noticeable presence of the in today’s society Today, it is believed that these users share their innovative prompts investigators to study the intersection of the two. For health-related solutions, also called patient-driven solutions, on this purpose, Oliveira and Canhao˜ created the Patient Inno- the Internet. However, the size of the Internet makes it hard to efficiently manually browse for these solutions. vation platform, an online open platform with a community A is a system that automatically browses the of over 60.000 users and more than 800 innovative patient- Web, focusing its search on a topic of interest. This Master thesis driven solutions developed by patients and informal caregivers. proposes a focused crawler that searches the Web for patient- These solutions were found by manually browsing the Web, driven solutions, storing and indexing them to ease a further searching for a combination of appropriate keywords in search medical screening. To perform the focusing task, it was developed a distiller and a classifier. The distiller ranks the URLs to visit, engines. The problem is that with the amount of information sorting them by a given score. This is a linear combination of currently on the Web, this searching method is not effective three components, that depend on the content, URL context, nor efficient. Consequently, there is the need for a system that or scores of the pages where the URL was found. The classifier automates that Internet search, retrieving solutions in a more automatically classifies visited webpages, verifying if they concern optimal manner. patient-driven solutions. In this thesis, it is shown that the developed system outper- A focused crawler is a system that automatically navigates forms a breadth-first crawling and common focused approaches through the Web in search of webpages concerning a topic on measures of harvest rate and target recall, while searching of interest. In this paper, we propose a focused crawler for patient-driven solutions. The proposed classifier’s results on that efficiently searches for webpages concerning patient- crawled data deviate from its validation results. However, it is proposed an approach to re-train the classifier with crawled data driven solutions, while indexing them. To classify and set that improves its performance. the fetching order of webpages, the crawler implements a classifier and a distiller, respectively. The classifier’s ultimate I.INTRODUCTION goal is to identify webpages regarding patient-driven solutions Users tend to be more aware of their needs than producers, and the distiller’s is to favor Uniform Resource Locators therefore, they are prone to innovate. They innovate expecting (URLs) believed to relate to the solutions being searched to improve the benefit they acquire from a certain product or for. Finally, the crawler outputs relevant results to a web service. This type of innovation is classified as user innovation, indexer, which organizes the information making it easily a phenomenon studied by several socio-economic researchers. accessible and searchable. Our results show that the proposed People need pragmatic solutions for their problems, solu- approach outperforms the broad crawling and common focused tions that are not being covered on the market. This drives crawling approaches, being more efficient while searching for users to invent the solutions themselves, to fulfill the needs the patient-driven solutions. Also, they show that incrementally market is not addressing. In healthcare, the word ”need” is not training the custom classifier with crawled data can improve an understatement. Some users (in this case patients) have been the solution search precision. living with the same health condition for years, sometimes In Section II, it is presented the related work required to even their whole life. These individuals have to cope with daily understand the development and evaluation of the implemented life problems, imposed by their health condition, for which system. Section III describes the dataset used in this system medicine does not provide a solution. This forces them to be development, as well as its pre-processing. Section IV de- constantly thinking of new ways to make their lives better, scribes in detail the system architecture. Section V presents the to approach new methods, to take some risks if that opens experiments done to evaluate the system’s performance, along a whole new better life for them. In fact, Zejnilovic et al. with their results and discussion. Finally, Section VI concludes [1] highlight the innovation capacity of patients and informal this paper with a summary, final thoughts and future work. II.RELATED WORK fuzzy function to the top most frequent terms in each class, In this section, we will talk about the three fundamental which is called the class fingerprint. The same is done for each concepts to understand this paper: text classification, web document to be classified, and its fingerprint is then compared crawling and web indexing. to the classes’ fingerprints, through a similarity score, to see to what class the document belongs to. A. Text Classification A threshold can be set to assign candidates to a ”negative” With nowadays’ digital documents availability and mass class, if the similarity score of the candidate with every class production, automatic text classification is a crucial procedure is below that threshold. for categorizing documents. 1) Classification performance evaluation: It is common Text classification methods use Natural Language Process- to evaluate a text classifier’s performance with metrics of ing (NLP) as a tool for text processing. NLP is the science accuracy, recall (or true positive rate), precision and F1-score that studies text analysis and processing. Documents are sets of (also called F -measure). Accuracy is the fraction of correctly TP +TN unstructured data (text). These can be chaotic, presented in all classified samples ( TP +TN+FP +FN ). Recall is the propor- TP shapes and sizes, so usually it is needed some pre-processing tion of positive samples classified as so ( TP +FN ). Precision before documents are fed to classifiers for training and clas- measures the fraction of positive predicted samples that are TP sification. It is through NLP methods that that processing is in fact positive samples ( TP +FP ). F1-score is the harmonic done. 2×P recision×Recall mean between precision and recall ( P recision+Recall ). Common text pre-processing steps are: Normalization, When applied to multi-class classification problem, it is where the text is lower cased, textual accents are removed, common practice to do a weighted average of all metrics and numbers, dates, acronyms and abbreviations are written in between all classes to get an overall performance. a canonical form or removed; Tokenization, where single or multiple-word terms are separated, to form tokens; Stopword B. Web Crawling removal, where common words (like ”to”, ”and” and ”or” in A is one of the principal systems of modern the English language) are removed; Lemmatization/Stemming . Web crawling is the act of going through where different words are mapped to the same term, when a web graph, that represents a network where each node they are derived or inflected from that term. represents a resource (usually a document), gathering data at The most common approaches for text classification tasks each node according to its needs. The need of web crawling include [2]: Naive¨ Bayes (NB) variants, k-Nearest Neighbors rose with the large increase of web resources available on the (kNN), Logistic Regression or Maximum Entropy, Decision Internet. It has a large range of applications. It can be used for Trees, Neural Network (NN), and Support Vector Machine web data mining, for linguistic corpora collection, to build an (SVM). These usually represent each document or class as an archive, to gather local news of one’s interest, amongst other unordered set of terms. To each term is associated a presence applications. or a frequency value, and no semantic meaning is kept. This is A web crawler always begins its activity from a set of called a bag-of-words model. The most common approaches chosen URLs, called seed URLs. It fetches the content that use variations and combinations of Term Frequency (TF) and they point to (their webpages) and parses it. As a result, the Inverse Document Frequency (IDF) transformations [3] as crawler stores the parsed content and the webpages’ outlinks, feature values. These approaches are purely syntactic based, which are the URLs present in the webpages’ content. The but they demonstrate great results [3], [4]. parsed content is used to check for URL duplicates, which In text classification problems, using terms as features can are different URLs that point to the same content, and the result in a very large feature space. One approach that reduces parsed outlinks are checked to see if they present new URLs, the feature space uses pre-trained Word2Vec [5] models to which have never been crawler. After these operations, newly transform each word into a vector of features. Each word discovered URLs are stored in the URL frontier, where there vector representation is obtained by training a Neural Network lie URLs due for fetching. The process repeats with the URLs model using a very large dataset. This model obtains each in the frontier, until a target condition is reached or the URLs word vector representation by analysis of each word’s neighbor in the frontier end. words. This representation holds some semantic value, as For this project, it was decided to use Nutch [7] as the words with similar meanings will have similar vector represen- base crawler for the final system. Nutch was developed by tations, because they will have similar neighbor words. This The Apache Software Foundation and it is still maintained and presents some additional value over the TF-IDF approach. All updated at the time of writing. This crawler is very extensible the word vectors of a document can then be averaged to make through a system of plugins, which offers a lot of transparency, a document vector representation. This approach often reduces flexibility and control over the system. Additionally, Nutch the feature space and it can perform better than a regular TF- comes with an indexer and search platform plugged into it, IDF approach. Solr [8]. This platform provides an index for the uploaded One alternative classification approach is the Fuzzy Fin- documents, granting the ability to efficiently search through gerprint Classifier (FFP-C), which was first used in [6] for them. This index can also be integrated in third-party applica- authorship identification. This method consists of applying a tions. 1) Types of web crawlers: Although crawlers began as of documents or queries, to compare with the content of simple programs whose goal was to crawl the entire web, the webpages being visited. These approaches may use the gathering as much information as they could (these are refered URL surrounding text and assign it different relevance levels to as broad crawlers), as time advanced, user’s needs and the according to its location. Internet structure evolved as well, together with the growth of A machine learning based approach commonly uses a information available. Thus, different types of crawlers began text classifier to guide a focused crawl. Several popular text to appear. The following mentioned types are the one which classifiers have been tested:NB,kNN, NN and SVM. Besides characterize the system implemented on this paper. common supervised text classification methods, other machine Focused, or preferential, crawlers are only interested in learning techniques have been applied [10],[11]. a subset of all the URLs they extract. They have an extra The implementation described in this paper will use a component, the guiding component, that filters and/or orders two-layered text classifier to identify and score patient-driven the URLs for fetching based on the domain they are part of, solutions and a content-based distiller to set the fetching on the content and/or structure of the webpage where they importance of newly discovered URLs. were found and/or on the geographic location of the server hosting the webpage. This component takes care of classifying 3) Performance Evaluation: For a focused crawler, we want the relevancy of webpages with respect to a specific topic. to check if the pages it is retrieving are relevant for the user’s Newly discovered URLs are sorted in an order influenced by needs and if it is doing so efficiently. Therefore, important the results of the classification process. indicators that test the focused crawler’s performance are: the Incremental, or continuous, crawlers try to always have harvest rate [12], which determines, at several points of the the last image of a webpage. While crawling the Web, they crawl, the fraction of pages crawled by the system that are make sequential scheduled fetches to a same webpage, to ex- relevant to the focused topic. It can be seen as the precision amine their updating rate, and adjust their schedule depending in an information retrieval system; and the target recall [13], on whether or not and when it has been modified. To achieve which estimates, during the crawl, the fraction of total relevant that behavior, in this project, we use the Nutch Adaptative pages that are fetched by the crawler. It can be seen as an Fetch Schedule [9] functionality, which has a default and estimate of the recall in an information retrieval system. This bounded re-fetch interval, that is decreased or increased by estimate can be obtained by pre-defining a set of target pages, a configurable factor, depending on whether a webpage has and taking, at several points of the crawl, the fraction of those been modified or not, respectively. pages that were already crawled. Centralized parallel crawling is usually referred to as The formulas for harvest rate and target recall can be t distributed crawling, although there is a difference between seen on equations 1 and 2, respectively. Here, SC is the set this and a fully distributed approach. It usually employs a comprising all the crawled pages at time t, R is the set of master-slave strategy, where the master takes the control of relevant pages on the Web and RT ⊂ R is the set of target the main flow of the crawl (e.g.: initialization, data stor- pages defined. age, process creation and termination) and distributes the t URLs between the available slaves, for them to fetch, parse SC ∩ R HarvestRate = t (1) and/or index. This distribution strategy can be domain-based, SC geographically-based, topic-based, amongst others. In the case t 1 S ∩ RT of Nutch, it is host-based. Nutch runs on top of Hadoop , a T argetRecall = C (2) framework which enables distributed processing and storage |RT | through simple programming models. This alone takes care 4) Problems and limitations: Most crawlers are built to run of data replication and fully distribution, making the system through the Internet, a public virtual place where there are no fault tolerant. In this implementation, it was decided to follow explicit rules for one’s behavior, but humans and machines are a single machine approach, using its multi-core capabilities expected to respect guidelines of common sense. Therefore, a to parallelize tasks among threads. Nonetheless, inter-machine crawler must be aware of how its actions affects others and parallelization is easy to configure in Hadoop when needed. itself. 2) The focused crawler guiding component: In the core of a focused crawler lies its guiding component. It is the Concerning crawler politeness, a crawler is expected not to component which navigates the crawler through the Web, overload hosts with requests and respect The Robots Exclusion according to the user’s information needs. In this project, the Protocol [14].This standard consists of a file written by hosts, guiding component follows two types of approaches, described called robots.txt, which states some rules a crawler should on the following paragraphs. follow, along with optional information about the . A content and structure based approach exploits a web- On the other hand, hosts are expected to be polite as well, page content to test it for relevancy. These consist of using unfortunately some hosts set traps for crawlers, intentionally or an established taxonomy, a set of keywords, a prepared set not. These include setting a crawl-delay too high and that dynamically create pages for a crawler to keep following 1http://hadoop.apache.org/ indefinitely. C. Web Indexing and they were crawled with Nutch. On the other hand, it was Indexing documents makes documents easily searchable by collected all samples from the Patient Innovation database, associating them with a compact, browsable list of terms or and they were labeled ”Post” samples. Within them we can sections [15]. Web indexing focuses on creating an index for differentiate between two types: ”Solution”, which are user the content of web documents. Documents may be indexed by innovative patient-driven solutions, created by a patient or a their full content and/or by their (title, author, date caregiver. This consist of a new or modified device, an aid, of creation, topic, key terms, etc). Making an index avoids an adaptive behavior or a low-cost alternative to an existing linearly scanning all documents while searching for queries. product; ”Non-solution”, which are potential solutions but Therefore, there exist several types of indexing techniques to considered otherwise, because they did not pass by a screening make that search as efficient as possible. The most commonly process yet (which includes a medical evaluation) or, after applied to web indexes is the inverted index method. the screening process, they were identified as duplicates from 1) Inverted index: An inverted index can be seen as a solutions, there was not enough information to classify them dictionary of unique terms, terms that occur in the collection as so, they were not considered innovations, not developed by of documents. To each term is associated a list comprised of a patient, caregiver or collaborator, they were of commercial all the documents within which that term occurs, a postings intent, offensive, inappropriate or physically intrusive (dan- list. Each postings list entry holds the documentID, the term gerous), either because they involved a diet, drugs, chemicals, frequency of the term in that document, possibly alongside other biologics or an invasive device. They can also be ideas other information (e.g.: positional information). for solutions, health related advice or other information related When searching for a query in a system with an inverted to patient-driven solutions, but not actual solutions. When index, all the terms in the search query are matched with possible, two versions of each of the ”Post” samples were their corresponding postings list, after passing by the same acquired: the treated post content, written by the Patient text processing phase the documents did. Specific documents Innovation analysts, and the original post content, retrieved from the selected postings lists are then selected, depending from the webpage where the post information was taken from. on the type of query. For example, if it is a boolean query The first version has the advantage of having data that does (a query joining the terms with logical expressions like AND, not have noisy (boilerplate) content, which webpages usually OR, NOT) the same logical operations are performed to the have. However, the posts’ syntactic and semantic content may postings lists to achieve the final result. be influenced by the analyst who wrote it. In contrast, the 2) Solr: This project uses Solr [8] as its web indexer. Solr second version preserves the syntactic and semantic content of is a popular open source search platform, that can be integrated the targeted information. However it comes with boilerplate. in all sorts of projects, to serve as a search interface over a To clean the dataset, repetitive patterns (expressions like collection of documents, and also comes with a built-in user ”More info”, ”Adapted from”, ”Watch the video”) were re- interface for off-the-shelf index searching. moved from the posts treated content and certain ”Non- Like most of web indexers for unstructured data (text), Solr solution” samples were removed: those that represented dupli- uses an inverted index to index a collection of documents. The cates from ”Solution” samples, those classified with missing structure and content type of the index is managed through information and those still under medical evaluation. These a schema, configured in the schema.xml file. This file states reasons caused these samples to be dubious about to which which fields (title, text, author, date, etc) should be indexed class, ”Solution” or ”Non-solution”, they should belong. The and/or stored, which field(s) is(are) the primary key, which is full dataset was also checked for duplicate samples. the default field for searching, which of them are required for After that, all samples passed through a pre-processing a document to have and how to index and search for each type phase, consisting of removing samples from languages other of field (through which type of text processing should it pass). than English, URL removal, normalization (lower-casing, number and punctuation removal), tokanization, short-words III.DATASET removal (less than 3 characters), short-samples removal (less This system’s dataset is comprised of two different sources than 4 tokens), stopwords removal and lemmatization or stem- of knowledge, DMOZ [16] and the Patient Innovation ming. Two major datasets were created: the Treated Dataset, database. DMOZ is the largest human edited directory of the comprising the treated posts of the Patient-Innovation platform Web. It contains approximately 4 billion references to external and the samples from DMOZ, and the Original Dataset, which websites that are human labeled in the directory. contained the original content from the posts, the treated posts From DMOZ, samples of Arts, News, Society, Recreation, whose original content was impossible to collect and the Business, Sports, Science, Computer and Health categories samples from DMOZ. In the end, it was made four different were acquired. These were the categories thought to be more dataset configurations for each major dataset: a stemmed one, informative, and that would give a better picture of the Internet a lemmatized one, one with full syntax of the words and one in general. The majority of the samples about Health were with full syntax and stopwords. These were used to test which examined, to make sure no sample could be representative of dataset configuration performed better on the tested classifiers. the kind of samples found in Patient Innovation. To collect Given the definitions above, the focus of this system is to these samples, several URLs were extracted from each topic automatically search for ”Solution” related webpages. IV. SYSTEMARCHITECTURE The indexing task takes care of permanently storing and in- This section presents the system architecture. First, it will dexing patient-driven solution candidate webpages. The main be presented a high-level picture of the overall system archi- purpose of this task is to make the relevant information tecture, explaining the inter-process communications between accessible and easily searchable. At the end of each iteration, components. Then, each main component is described, ex- the indexer takes and filters the information gathered to store plaining their functionalities, processes and development. and index, exempting the crawler from having to store it. A. Main components, tasks and interactions B. Classifier The classifier’s goal is to categorize crawled webpages and calculate the parsing score of each parsed URL. This score symbolizes the probability of an URL belonging to the assigned class. Additionally, it calculates the solution score, a score that represents the degree of resemblance to a patient- driven solution. This system’s classifier is a text classifier, which has two layers with different goals. In the first layer, the classifier has to separate webpages between the classes ”Other”, ”Health” and ”Post”. The class ”Other” represents all the information that there exists on the Web which is not health-related. Class ”Health” represents all health-related information which are not ”Solution” neither ”Non-solution”. Finally, the class ”Post” includes the ”Solution” and ”Non-solution” related informa- tion. In its second layer, the classifier’s goal is to separate between patient-driven solutions (”Solution”) and all the other similar but false solution information (”Non-solution”). This architecture was chosen because ”Non-Solutions” are very similar to ”Solutions”, so at a first stage, merging them in one Figure 1: Architecture of the proposed system. class (”Post”) will help differentiating them from other topics. Also in the first layer, the crawler can benefit from having a We refer to the proposed crawler as 2Gather4Health (2G4H) ”Health” class because ”Post” samples are also health-related, crawler. The goal of the proposed system is to crawl for so knowing that a webpage is not a ”Post” sample but it specific information, patient-driven solutions, while indexing it is health related can be useful to help the crawler stay on to a web indexer. One can think of this goal as the fulfillment the topic. Classifying samples as ”Other” helps the crawler of three tasks: a crawling task, a focusing task and an indexing identify webpages that do not present relevant topical infor- task. Figure 1 portrays this system’s architecture, in which the mation. Observing the raw texts and fingerprints of samples mentioned tasks are depicted as well as its interactions, which from classes ”Solution” and ”Non-solution”, we noticed that will be explained on the following paragraphs. these two shared a very similar vocabulary. Therefore, having The crawling task is the task that takes care of all the a dedicated layer to separate these two classes allowed to apply basic operations occurring during a crawl. That includes customized methods to try to separate them, without having to injecting the seed URLs (Injector), generating lists of URLs concern about a third class separation. The downside of this for fetching (Generator), fetching URLs (Fetcher), parsing approach is that relevant data can be lost in the first layer and webpages (Parser), storing/updating parsed (Updater) and link samples that are not ”Solutions” neither ”Non-Solutions” can information (Link Inverter) and removing duplicate URLs reach the second layer. The more the layers a classifier has, (Deduplicator). In this system, these operations are all exe- the more it can lose information and accumulate error. cuted by the core of Nutch. In its final configuration, the first layer is a Multinomial The focusing task is in charge of target specification. In Naive Bayes text classifier, trained with the full dataset, using this case, this task focuses on identifying and following unigram features with TF-IDF weights. While the second layer patient-driven solution webpages, performed by the classifier is a Fingerprint Classifier, trained with a subset consisting of and the distiller, respectively. The classifier takes the parsed just the ”Post” samples. This uses bigrams as features with information from webpages and classifies them as patient- TF weighting. This final configuration was a result of the driven solutions or other categories. The distiller takes the performance evaluation described on Section V-A. outlinks parsed from webpages and assigns them a score (the In addition to classifying a webpage, the classifier also fetching score) which will be used to order the fetching process sets its parsing and solution scores. Both are taken from the of newly discovered URLs. Additionally, it controls which probability distributions obtained in each layer. The parsing URLs are re-fetched and what score URLs use when setting score is the probability estimation of the classified webpage the fetching order. belonging to the class that it was assigned to. The solution score is the probability estimation of the classified webpage pages following the family analogy, to predict the relevancy belonging to the ”Solution” class, which is the similarity score of a target page to a topic. The parent parsing score and score, defined in the FFP-C approach, between the classified the context score are both methods that use the surrounding webpage’s fingerprint and the ”Solution” class fingerprint. text of an URL to predict the URL relevancy. The only difference is that, in this system, different methods are used C. Distiller for when the surrounding text is the full page and when it is The distiller sets the fetching order of the URLs. Its goal just the anchor text. is to prioritize the fetching of relevant URLs. The fetching order is sorted by one of two scores: the fetching score, D. Web indexer if it is a newly discovered URL, or the solution score, if The web indexer used in this system is Solr. Nutch already the URL is being being re-fetched for update purposes. If a comes with integrated Solr interaction, so it was just needed page is classified as ”Health” or ”Other”, its solution score is some configuration. considered to be zero, so these pages are never re-fetched. It was decided to store the following fields: URL, title, In order to define the degree of relevancy of a newly content, anchor texts, last fetched date, solution score and discovered URL, its fetching score is a linear combination of assigned class. Information belonging to the first 4 fields goes three components: the parent fetching score, the parent parsing through processes of tokenization, stopword removal, lower score and the context score. Their corresponding weights can casing and stemming during the indexing and search querying be manually set in the Nutch’s configuration file, the nutch- processes. This helps maintaining a more lightweight index site.xml. and improves result matching when searching through the The parent fetching score and the parent parsing score are index. The last fetch date can be useful to know the freshness the fetching and (part of) the parsing scores of the parent page, of the document stored. The purpose of storing the solution respectively. The parent page of an URL is the page where the score is to provide a sorting of the results by their similarity to URL was found. If an URL has more than one parent, their patient-driven solutions. This provides a degree of importance fetching and parsing scores are averaged to produce the final to the results, that users (the analysts of Patient Innovation) can parent fetching and parsing scores, respectively. When using utilize to prioritize the analysis of webpages. Additionally, it the parent fetching score, we hope to give a priority boost is a way of sorting the results without needing a word/phrase to pages whose parents and/or higher degree ascendants are query. Therefore, users can search through results by their relevant. importance, without having to focus on a specific topic, key- The parent parsing score is determined by the parsing word or query. The assigned class is stored for users to know score and assigned class that the classifier gave to the parent to which class the webpages were automatically assigned. It webpage. If the page was classified as ”Post”, the full parsing was decided to only index documents classified as ”Solution” score of the parent page is used, if it was classified as ”Health”, and ”Non-solution”, because there is still an unclear automatic only half of the score is used, if otherwise it was classified separation between these two classes. So, their samples could as ”Other”, the parent parsing score is considered zero. The serve as training data to re-train the classifier, improving its parent parsing score favors the premise that pages about the performance. This is in fact shown in Section V-C. Lastly, same topic are connected. there is a combined field which aggregates all text fields (URL, The context score is the similarity score of the set of terms title, content and anchor texts) to provide a default field for extracted from the URL anchor text in the parent page and text search. from the URL itself, with the set of terms extracted from the Patient-Innovation post titles. The similarity function used V. EXPERIMENTSAND EVALUATION is the cosine similarity. The set of terms comprised in the This section presents the experiments done to evaluate the URL are obtained by splitting the URL path component in all performance of the crawler’s classifier and of the crawler itself. dots, hyphens, underscores and slashes. The terms obtained are Along with each experiment there is a discussion of the results pre-processed (normalization, stopwords removal, stemming) obtained. as are the titles they are compared to. A URL can have In the end, there is a section discussing important points more than one anchor texts, from the same or more than one of the evaluation, connecting the dataset used with the results parent page. In that case, the scores are averaged to form a obtained and proposing related tactics to improve the crawler’s single context score. It was noticed that a lot of the URLs performance. of the original content of the posts contained the post title or a similar sentence in the URL path. Thus, we also believe A. Classifier validation that a similar sentence could also appear in its anchor text. In order to evaluate the performance of different classifiers Therefore, the similarity score presented could represent the and configurations over both the Original Dataset and the level of similarity between an URL and the target information. Treated Dataset, it was used the WEKA [18] framework. While this combination of scores is an original approach, it Several types of text classifiers and configurations were was inspired by other common approaches seen on researches. tried to train and validate the two-layered classifier. These For example in [17], they use the parent page, along with other include ZeroR, Multinomial Naive¨ Bayes (MNB), Logistic Regression (LR), SVMs and the novel approach of FFP-C. indicate that the large majority of samples are being classified While WEKA offers an implementation of the first classifiers, as ”Solution”. Using the FFP-C with a dataset of bigrams with an implementation of the last was developed together with the TF weights proved to achieve the best performance. With the corresponding wrapper for WEKA, so all classifiers could be threshold set to zero we noticed that it achieved a performance compared using the same evaluation framework. The layers similar to the methods already tried. However, a threshold was were trained independently. set that separated both classes, with very good performance. After experimenting all dataset configurations, we con- This changes the essence of the classifier, as now it can be cluded that using the dataset with full syntax of words per- seen as a relevancy classifier, that classifies a sample as non- formed better than the other. Also, we included the post titles relevant if its score is below some threshold. in the treated post samples. This configuration was used to 1) Testing with layer dependency: In order to test the obtain three derived datasets: one using unigrams, one using classifier’s performance as a whole, the validation dataset was bigrams and one using Word2Vec features. divided into a training and a test set, and the latter was used For the first layer, it was performed a 10-folded cross- to be classified by both layers of the classifier, sequentially. validation, while for the second layer it was 5-folded, as it The results are shown on table III. was used a much smaller dataset. The text classifier was configured using the Multinomial The results for both layers can be seen in tables I and II, Naive Bayes with unigram TF-IDF features for the first layer corresponding to using the Original Dataset in the fisrt layer and the FFP-C with bigram TF values and set threshold and using the Treated Dataset in the second layer, respectively. approach for the second layer. Additionally, it was decided to For synthesis purposes, we decided to show on this paper just train the first layer with the Original Dataset and the second the most relevant results. All metrics with no class associated layer with the Treated Dataset, plus the test set was obtained are weighted averages of all classes. from the Original Dataset, to better represent real crawled data.

TPR FPR Classified Classifier Precision Recall F-measure Solution Non-Solution Health Other (Post) (Post) Solution 38 0 2 0 93.7 % 1.9 % 91.0 % 91.0 % 91.0 % Non-Solution 2 12 6 0 Multinomial NB Actual Health 0 0 28 12 LR w/ Word2Vec 93.8 % 1.6 % 92.4 % 92.6 % 92.4 % Other 2 1 6 91 SVM w/ Word2Vec 94.4 % 1.9 % 92.3 % 92.4 % 92.3 % (a) Confusion matrix. FFP-C 95.0 % 3.4 % 89.0 % 88.5 % 88.6 % TPR / Recall FPR Precision F-Measure Recall Precision F-Measure (Solution) (Solution) (Solution) (Solution) Table I: First layer validation using the Original Dataset. 95.0 % 2.5 % 90.5 % 92.7 % 84.5 % 84.8 % 84.7 % (b) Evaluation metrics.

TPR FPR Precision Recall F-measure Classifier Table III: Results of the classifier on crawled test set 1. (Solution) (Solution) (Solution) (Solution) (Solution) Multinomial NB 96.3 % 59.2 % 92.6 % 96.3 % 94.4 % Looking at the confusion matrix, important points to stand LR w/ Word2Vec 98.3 % 88.8 % 89.6 % 98.3 % 93.7 % out are: some ”Post” samples are being mistaken as ”Health”; SVM w/ Word2Vec 94.5 % 77.6 % 90.4 % 94.5 % 92.4 % some (but few) not ”Post” samples, in this case ”Other”, are FFP-C 92.2 % 46.9 % 93.8 % 92.2 % 93.0 % being classified as ”Solution”, when they pass to the second FFP-C w/ bigrams 98.8 % 1.0 % 99.9 % 98.8 % 99.3 % layer. Having in mind the goal of this text classifier in the overall system, which is to identify patient-driven solutions, Table II: Second layer validation using the Treated Dataset. the first point affects the solution recall, while the second affects its precision. However, the results show that these Looking at the table, both classifiers using Word2Vec fea- consequences just slightly affect the overall performance. tures show the best results. The Multinomial Naive Bayes is the one which performs better between the approaches B. Evaluation on crawled data that use unigrams with TF-IDF values. This is important to A performance evaluation on crawled data was done to test have in mind, as the tasks of preparing the training set and the text classifier performance on a real scenario. the samples to be classified are much faster when using the Around 2000 webpages were crawled. From those, two TF-IDF approach rather than the Word2Vec one. So, when disjoint subsets of around 200 webpages each were sent to crawling the web using this text classifier, one can chose to an analyst for manual labeling. The subsets obtained were use the Multinomial Naive Bayes to achieve a more efficient randomly sampled and had the same class distribution of the and still effective crawl, as its validation results do not differ set they came from. The text classifier was configured with much from the Word2Vec approaches. the same specifications of Section V-A1. Table IV shows the As we can see in table II, separating between ”Solutions” results of the text classifier on the first subset of crawled and ”Non-Solutions” in the second layer using the same data, after cross-checking the manual labels with the labels approaches as in the first layer is not satisfactory. The results automatically given by the classifier. Classified Classified Solution Non-Solution Health Other Solution Non-Solution Health Other Solution 22 1 0 2 Solution 22 2 0 1 Non-Solution 29 2 1 7 Non-Solution 18 14 1 6 Actual Actual Health 28 6 46 25 Health 25 13 59 8 Other 3 3 2 15 Other 3 6 4 10 (a) Confusion matrix. (a) Confusion matrix. TPR / Recall FPR Precision F-Measure TPR / Recall FPR Precision F-Measure Recall Precision F-Measure Recall Precision F-Measure (Solution) (Solution) (Solution) (Solution) (Solution) (Solution) (Solution) (Solution) 88.0 % 35.9 % 26.8 % 41.1 % 44.3 % 61.9 % 51.6 % 88.0 % 27.5 % 32.4 % 47.3 % 54.7 % 67.5 % 60.4 % (b) Evaluation metrics. (b) Evaluation metrics. Table IV: Results of the classifier on crawled test set 1. Table V: Results of the classifier on crawled test set 1, after adding test set 2 to its training data.

As it can be seen, the text classifier performance drastically decreased from the one showed during validation. Observing as so and a much better separation is achieved between the results, we can see that the main reasons for the classifier’s the ”Solution” and ”Non-solution” classes, without affecting bad performance are: many ”Health” samples being missclas- the overall ”Solution” recall. This confirms the theory that sified in the first layer, specially, many ”Health” samples being repopulating the training data with significant examples, even classified as ”Solution”; and many not ”Solution” samples without changing the classification model, can improve the being classified as so, in the second layer. classifier’s performance. By manually inspecting the samples of the crawled test sets, However, it can also be noticed that a lot of ”Health” it was seen that the majority of ”Health” samples classified samples were still classified as ”Solution”. As there were not as ”Post” were either blog posts telling stories of people many of these examples in the second crawled data subset, the living with some kind of health related condition, or webpages classifier could not improve significantly on this matter. Still, describing health organizations for people with special needs. a slight improvement can be seen. It is presumed that these were classified as ”Post” due to the similarity in vocabulary that these have to real ”Solution” D. Crawler performance comparison samples and because the vast majority of ”Health” samples In order to analyze the system’s crawling performance, it in the training set do not cover these cases. They are mostly was compared four different crawling approaches. A breadth- webpages about general health topics. first approach, a best-first approach, a URL context approach The second problem seems to have appeared due to the lack and the proposed approach. A best-first approach gives pri- of training samples representing the negative class on the sec- ority to the URLs contained in webpages with the highest ond layer. Having different and more negative representatives parsed score, basically it is an approach that just uses the may alter the threshold needed to separate both classes, and parent parsing score. A URL context approach gives priority bring significance to the fingerprint of ”Non-solution”, chang- to the URLs with the highest context score, following the ing the used FFP-C essence back to a similarity classifier. approach mentioned on IV-C. The proposed approach used Therefore, a solution could be to populate the training data equal weights for each component of the fetching score. All with the manually labeled samples. crawls started from the same set of seed URLs. Additionally, Something to be noticed is that, although the classifier all crawls were done through Nutch, ran for the same amount does not seem to identify precisely the ”Solution” samples, of iterations and a limitation of 25 pages per host per iteration it appears to identify the majority of health-related samples as was imposed to increase webpage diversity. All approaches so, while having a good precision. Health-related samples are were run in the same machine and crawled a total of around the samples from the classes of ”Solution”, ”Non-solution” 4000 webpages. and ”Health”. The evaluation classifier used to classify the webpages visited was the one implemented for the proposed system. As C. Using crawled data to improve the classifier it was shown in Section V-B, this classifier, with the current In order to prove the hypothesis given, the second subset training data, does not present a high performance classifying of crawled data was added to the classifier’s training data. ”Solution” samples on crawled data, and its classification can The classifier model was kept unchanged. The first subset was not be used as ground-truth. However, it can be used to used again to verify if this addition improved the classifier’s compare different crawling approaches, because this way all performance. The results can be seen on table V. methods will use the same classification truth. One just has Observing the results of this experiment, several things to keep in mind that the measures presented, in reality, might can be noticed. It can be seen an overall improvement on have a lower true value. the classifier’s performance. Almost all metrics improved, In order to calculate the target recall, all URLs from pages with very few exceptions. Looking at the confusion matrix, classified as ”Solution” were collected during the breadth-first we can see that more ”Health” samples are being classified crawl. This would be the set of target webpages. As the broad (a) Comparison of harvest rate on the ”Solu- (b) Comparison of target recall on the ”Solu- (c) Comparison of harvest rate on the health- tion” topic. tion” topic. related topics. Figure 2: Crawling comparison plots. approach was used to collect the target set, this was not used The main problem may be on the dataset used to train for comparison on target recall. the classifier. Using human labeled webpage repositories, like Figure 2a and figure 2b show the harvest rate and target DMOZ, to collect data for webpage classification has proven recall on the ”Solution” topic for each crawling strategy, to be a satisfactory method when the classes in play are broad respectively. All the measure points presented were calculated topics like Health, Sports, Politics, Technology, etc. However, at the end of each iteration for each approach, except for the for very specific classes there must exist a lot of specific first iteration. This was taken out because it represented no information describing the mentioned classes and separate the added value, as it was the seed URLs fetching iteration. specific classification from broad classification. We tried to As it can be seen on figure 2a, all focused crawling achieve the later by adding the two layers to the text classifier. approaches perform better that the broad approach in terms But there was clearly a lack of negative examples on the of harvest rate on the target topic. As for the 2G4H approach, second-layer, to properly separate the positive and negative it can be seen that the harvest rate maintains approximately classes. a constant value, always higher than the other approaches. Prior to the classifier validation, more negative samples Additionally, it begins from the highest value of harvest rate, should have been identified. In fact, training the classifier meaning that on the second iteration it already performed a with more ”Non-solution” samples helped decreasing the false better choosing of URLs than the other techniques. Regarding positive rate in the second layer, as shown in Section V-C. In the target recall, it can be seen that the proposed approach’s Section V-B, it was noticed that a lot of ”Health” samples value is also always higher than the other two focused were being classified as ”Solution” due to the similarity in crawling approaches. Furthermore, the 2G4H approach’s target vocabulary. The majority of these samples described stories of recall keeps constantly increasing, as opposed to the other people living with some health-condition. The ”Non-solution” approaches. class definition could be broadened to include these examples, While the evaluation classifier does not have a high accuracy as they share the same nature of this class samples (neither classifying ”Solution” samples, it has high accuracy classify- are ”Solutions”, but both are patient-related and describe ing health-related samples, as stated in Section V-B. So it is how to cope with a health-related problem). Having these interesting to analyze its harvest rate on health-related topics, samples populating the ”Non-solution” class should increase which is depicted in figure 2c. the number of negative examples, and it should also further One can see that the proposed crawling approach reaches accentuate the separation between the ”Health” class and the values of harvest rate on health-related topics slightly above ”Post” class. 80%. These are considerably high values, and much higher than the other approaches. VI.CONCLUSION In order to conclude that the proposed approach is in fact a focused crawling approach, a requirement is for it For this research, it was built a system with the goal of to be more efficient than a breadth-first (broad) crawling automatically searching for innovative patient-driven solutions approach, while searching for patient-driven solutions. This and index the results. The system is composed of a focused system accomplished that requirement and it even showed that crawler, with two components responsible for classification it can perform better than other common focused crawling and focused navigation, the classifier and the distiller, re- approaches regarding metrics of harvest rate and target recall. spectively, and a web indexer, which indexes the webpages considered relevant by the crawler. E. Discussion Following state-of-the-art approaches, it was decided to Upon analyzing the results, it can be seen that there is pursue a machine learning based method to classify the visited much room for improvement. While the crawler performance webpages. As there were patient-driven solutions already seems satisfactory, the tests on crawled data show that the text available, this seemed to be the best technique to use. It was classifier needs improvement. opted to have a text classifier with two layers. The classifier validation results showed to be very good, with all metrics Article],” IEEE Computational Intelligence Magazine, above 90%. However, the classifier performance on crawled vol. 9, no. 2, pp. 48–57, 2014. data was not so satisfactory. Nonetheless, it was noticed that [5] Google, “word2vec,” 2013. [Online]. Available: https: the classifier’s results on crawled data were justified by the //code.google.com/archive/p/word2vec/ (Accessed 2018- lack of more contextualized representative samples on the 08-15). ”Health” and ”Non-solution” classes of the training data. This [6] N. Homem and J. P. Carvalho, “Authorship identification demonstrated that the initial dataset needed to be improved, as and author fuzzy “fingerprints”,” in the problem seemed to be on the classifier’s training data and 2011 Annual Meeting of the North American Fuzzy not on the model. Further results showed that the classification Information Processing Society, 2011, pp. 1–6. can be improved by re-training the classifier with manually [7] R. Khare, D. Cutting, K. Sitaker, and A. Rifkin, “Nutch: labeled crawled data. A flexible and scalable open-source web ,” The distiller is the component that prioritizes the URLs to be Oregon State University, vol. 1, p. 32, 2004. fetched. It was based on common state-of-the-art approaches. [8] The Apache Software Foundation, “Apache Solr.” [On- These approaches relied on the content, URL context or scores line]. Available: http://lucene.apache.org/solr/ (Accessed of the pages where the URL being analyzed was found. By 2018-08-20). combining these methods, it was expected to have a better [9] A. Bialecki, “AdaptiveFetchSchedule (apache- result than using a single one, as they target different aspects nutch 1.12 API).” [Online]. Avail- and each one proved to work on past researches. When able: http://nutch.apache.org/apidocs/apidocs-1.12/org/ testing the proposed crawling approach, it showed it is more apache/nutch/crawl/AdaptiveFetchSchedule.html (Ac- efficient than a broad approach and than some of the focused cessed 2018-09-24). approaches the distiller is based on. This proves the belief [10] C. Su, Y. Gao, J. Yang, and B. Luo, “An efficient adaptive that the combination of methods used can perform focused focused crawler based on ontology learning,” in Fifth crawling, outperforming single methods for the focused task. International Conference on Hybrid Intelligent Systems This Master thesis shows that it is possible to build a (HIS’05), 2005, p. 6 pp. system that successfully searches for innovative patient-driven [11] M. Shokouhi, P. Chubak, and Z. Raeesy, “Enhancing solution. The proposed system can be used on online platforms focused crawling with genetic algorithms,” in Interna- for patient-driven solution diffusion, like Patient-Innovation, to tional Conference on Information Technology: Coding increase the search efficiency and diversity of solutions to be and Computing (ITCC’05) - Volume II, vol. 2, 2005, pp. posted. Additionally, this opens doors to the automatic search 503–508 Vol. 2. of user innovation. User innovation researches can work upon [12] S. Chakrabarti, M. Berg, and B. Dom, “Focused crawl- this thesis’ methodology and results to learn better ways of ing: A New Approach to Topic- Specific Web Resource collecting and studying user innovation cases. Discovery,” Computer Networks, vol. 31, no. 11, pp. 1623–1640, 1999. A. Future work [13] P. Srinivasan, F. Menczer, and G. Pant, “A General Eval- Sections V-C and V-E suggest that the system would benefit uation Framework for Topical Crawlers,” Information from an online training classifier that would re-train itself with Retrieval, vol. 8, no. 3, pp. 417–447, jan 2005. more relevant data, automatically updating its model through [14] M. Koster, “The Web Robots Pages,” 1996. [On- a process of several validation tests, when a batch of new line]. Available: http://www.robotstxt.org/robotstxt.html manually labeled results are available. This system would be (Accessed 2018-08-18). supervised by the platform analysts, which would label the [15] A. Tripathi, “Unit-14 Overview of Web Indexing, Meta- new training samples. data, Interoperability and Ontologies.” IGNOU, 2017. [16] AOL Inc., “DMOZ - The Directory of the Web.” REFERENCES [Online]. Available: http://dmoz-odp.org/ (Accessed [1] L. Zejnilovic,´ P. Oliveira, and H. Canhao,˜ Innovations 2018-05-28). by and for Patients, and Their Place in the Future [17] X. Qi and B. D. Davison, “Knowing a Web Page by Health Care System. Berlin, Heidelberg: Springer Berlin the Company It Keeps,” in Proceedings of the 15th ACM Heidelberg, 2016, pp. 341–357. International Conference on Information and Knowledge [2] E. Ikonomakis, S. Kotsiantis, and V. Tampakas, “Text Management, ser. CIKM ’06. New York, NY, USA: Classification Using Machine Learning Techniques,” ACM, 2006, pp. 228–237. WSEAS transactions on computers, vol. 4, pp. 966–974, [18] E. Frank, M. Hall, P. Reutemann, and L. Trigg, “Weka 2005. 3 - Data Mining with Open Source Machine Learning [3] R. Jindal, R. Malhotra, and A. Jain, “Techniques for Software in Java.” [Online]. Available: https://www.cs. text classification: Literature review and current trends,” waikato.ac.nz/ml/weka/index.html (Accessed 2018-06- webology, vol. 12, no. 2, p. 1, 2015. 10). [4] E. Cambria and B. White, “Jumping NLP Curves: A Re- view of Natural Language Processing Research [Review