Stanford Pattern-Based Information Extraction and Diagnostics

SPIED: Stanford Pattern-based Information Extraction and Diagnostics Sonal Gupta Christopher D. Manning Department of Computer Science Stanford University fsonal, [email protected] Abstract of manually examining a system’s output to identify improvements or errors introduced by chang- This paper aims to provide an effective ing the entity or pattern extractor. Interpretabil- interface for progressive refinement of ity of patterns makes it easier for humans to iden- pattern-based information extraction sys- tify sources of errors by inspecting patterns that tems. Pattern-based information extrac- extracted incorrect instances or instances that re- tion (IE) systems have an advantage over sulted in learning of bad patterns. Parameters machine learning based systems that pat- range from window size of the context in surface terns are easy to customize to cope with word patterns to thresholds for learning a candi- errors and are interpretable by humans. date entity. At present, there is a lack of tools Building a pattern-based system is usually helping a system developer to understand results an iterative process of trying different pa- and to improve results iteratively. rameters and thresholds to learn patterns and entities with high precision and recall. Visualizing diagnostic information of a system Since patterns are interpretable to humans, and contrasting it with another system can make it is possible to identify sources of errors, the iterative process easier and more efficient. For such as patterns responsible for extract- example, consider a user trying to decide on the ing incorrect entities and vice-versa, and context’s window size in surface words patterns. correct them. However, it involves time And the user deliberates that part-of-speech (POS) consuming manual inspection of the ex- restriction of context words might be required for a reduced window size to avoid extracting erro- tracted output. We present a light-weight 1 tool, SPIED, to aid IE system develop- neous mentions. By comparing and contrasting ers in learning entities using patterns with extractions of two systems with different parame- bootstrapping, and visualizing the learned ters, the user can investigate the cases in which the entities and patterns with explanations. POS restriction is required with smaller window SPIED is the first publicly available tool to size, and whether the restriction causes the system visualize diagnostic information of multi- to miss some correct entities. In contrast, compar- ple pattern learning systems to the best of ing just accuracy of two systems does not allow our knowledge. inspecting finer details of extractions that increase or decrease accuracy and to make changes accord- 1 Introduction ingly. Entity extraction using rules dominates commer- In this paper, we present a pattern-based entity cial industry, mainly because rules are effective, learning and diagnostics tool, SPIED. It consists interpretable by humans, and easy to customize to of two components: 1. pattern-based entity learn- cope with errors (Chiticariu et al., 2013). Rules, ing using bootstrapping (SPIED-Learn), and 2. vi- which can be hand crafted or learned by a sys- sualizing the output of one or two entity learning tem, are commonly created by looking at the con- systems (SPIED-Viz). SPIED-Viz is independent text around already known entities, such as surface of SPIED-Learn and can be used with any pattern- word patterns (Hearst, 1992) and dependency pat- based entity learner. For demonstration, we use terns (Yangarber et al., 2000). Building a pattern- the output of SPIED-Learn as an input to SPIED- based learning system is usually a repetitive pro- 1A shorter context size usually extracts entities with cess, usually performed by the system developer, higher recall but lower precision. Viz. SPIED-Viz has pattern-centric and entity- 3. Pattern learning: Candidate patterns are centric views, which visualize learned patterns scored using a pattern scoring measure and and entities, respectively, and the explanations for the top ones are added to the list of learned learning them. SPIED-Viz can also contrast two patterns for DT. The maximum number of systems by comparing the ranks of learned enti- patterns learned is given as an input to the ties and patterns. In this paper, as a concrete ex- system by the developer. ample, we learn and visualize drug-treatment (DT) entities from unlabeled patient-generated medical 4. Entity learning: Learned patterns for the class text, starting with seed dictionaries of entities for are applied to the text to extract candidate en- multiple classes. The task was proposed and fur- tities. An entity scorer ranks the candidate ther developed in Gupta and Manning (2014b) entities and adds the top entities to DT’s dic- and Gupta and Manning (2014a). tionary. The maximum number of entities Our contributions in this paper are: 1. we learned is given as an input to the system by present a novel diagnostic tool for visual- the developer. ization of output of multiple pattern-based 5. Repeat steps 1-4 for a given number of itera- entity learning systems, and 2. we release the tions. code of an end-to-end pattern learning system, which learns entities using patterns in a SPIED provides an option to use any of the pat- bootstrapped system and visualizes its diag- tern scoring measures described in (Riloff, 1996; nostic output. The pattern learning code is Thelen and Riloff, 2002; Yangarber et al., 2002; http://nlp.stanford.edu/ available at Lin et al., 2003; Gupta and Manning, 2014b). A software/patternslearning.shtml . pattern is scored based on the positive, negative, The visualization code is available at and unlabeled entities it extracts. The positive and http://nlp.stanford.edu/software/ negative labels of entities are heuristically deter- patternviz.shtml . mined by the system using the dictionaries and the 2 Learning Patterns and Entities iterative entity learning process. The oracle labels of learned entities are not available to the learning Bootstrapped systems have been commonly used system. Note that an entity that the system consid- to learn entities (Riloff, 1996; Collins and Singer, ered positive might actually be incorrect, since the 1999). SPIED-Learn is based on the system de- seed dictionaries can be noisy and the system can scribed in Gupta and Manning (2014a), which learn incorrect entities in the previous iterations, builds upon the previous bootstrapped pattern- and vice-versa. SPIED’s entity scorer is the same learning work and proposed an improved mea- as in Gupta and Manning (2014a). sure to score patterns (Step 3 below). It learns Each candidate entity is scored using weights of entities for given classes from unlabeled text by the patterns that extract it and other entity scoring bootstrapping from seed dictionaries. Patterns measures, such as TF-IDF. Thus, learning of each are learned using labeled entities, and entities are entity can be explained by the learned patterns that learned based on the extractions of learned pat- extract it, and learning of each pattern can be ex- terns. The process is iteratively performed until plained by all the entities it extracts. no more patterns or entities can be learned. The following steps give a short summary of the itera- 3 Visualizing Diagnostic Information tive learning of entities belonging to a class DT: SPIED-Viz visualizes learned entities and patterns 1. Data labeling: The text is labeled using the from one or two entity learning systems, and the class dictionaries, starting with the seed dic- diagnostic information associated with them. It tionaries in the first iteration. A phrase optionally uses the oracle labels of learned enti- matching a dictionary phrase is labeled with ties to color code them, and contrast their ranks the dictionary’s class. of correct/incorrect entities when comparing two systems. The oracle labels are usually determined 2. Pattern generation: Patterns are generated us- by manually judging each learned entity as cor- ing the context around the positively labeled rect or incorrect. SPIED-Viz has two views: 1. a entities to create candidate patterns for DT. pattern-centric view that visualizes patterns of one it on itGoogle. other system. other entity label is not is label entity An star sign for an an for sign star An entity indicates the indicates entity this system and the the and system this other system, along system, other provided and it was itwas and provided not extracted by the by notextracted with a link to search tosearch link witha Score of the entity in in entity the Score of and was not and was A trophy sign sign trophy A other system. other entity is correct correct is entity extracted by the by extracted indicates the that indicates view. incorrect. entity. Their Their entity. shown in in shown the to the details details the to List of entities entities of List color indicates colorindicates color indicates colorindicates pattern-centric pattern-centric learned at each ateach learned List of patterns patterns of List correct and and red correct iteration. Green Green iteration. that the entity is is entity the that that the entity is is entity the that that extracted the the extracted that details are similar similar are details Figure 1: Entity centric view of SPIED-Viz. The interface allows the user to drill down the results to diagnose extraction of correct and incorrect entities, and contrast the details of the two systems. The entities that are not learned by the other system are marked with either a trophy (correct entity), a thumbs down (incorrect entity), or a star icon (oracle label missing), for easy identification. pattern. entities were were entities Details ofthe Details the unlabeled unlabeled the An exclamation exclamation An less than half of half than less with correct label. withcorrect sign indicates that indicates sign eventually learned learned eventually learned by the the by learned Green colorof Green system and and the system entity indicates indicates entity oracle assigned it assigned oracle the ‘correct’label.

Load more