Text Data Processing Language Reference Guide Company
Total Page:16
File Type:pdf, Size:1020Kb
PUBLIC SAP Data Services Document Version: 4.2 Support Package 14 (14.2.14.0) – 2021-01-08 Text Data Processing Language Reference Guide company. All rights reserved. All rights company. affiliate THE BEST RUN 2020 SAP SE or an SAP SE or an SAP SAP 2020 © Content 1 Text Data Processing Language Reference Guide....................................4 2 Overview of the Language Reference Guide........................................5 2.1 Who Should Read This Guide.....................................................5 3 Overview of Text Data Processing Functionality.....................................6 3.1 Character Encodings.......................................................... 9 3.2 About Linguistic Analysis....................................................... 9 3.3 About Entity Extraction.........................................................9 About Customizing Extraction.................................................10 3.4 About Fact Extraction.........................................................10 About Sentiment Analysis ....................................................11 About Public Sector Fact Extraction.............................................11 About Enterprise Fact Extraction...............................................11 4 Linguistic Analysis ..........................................................13 4.1 Word Segmentation...........................................................14 Whitespace Languages......................................................14 Nonwhitespace Languages...................................................30 4.2 Case Normalization.......................................................... 39 4.3 Stemming.................................................................40 Language-Specific Stemming Examples..........................................41 Stemming Unknown Words...................................................81 4.4 Compound Analysis.......................................................... 81 Language-Specific Compound Analysis Examples...................................81 4.5 Part-of-Speech Tagging........................................................87 Part-of-Speech Naming Conventions............................................88 Part-of-Speech and Stem Disambiguation ........................................88 Guessing Part-of-Speech for Unknown Words......................................89 Language-Specific Unknown Words Guessing Examples..............................89 Language-Specific Part-of-Speech Tagging Examples................................93 5 Entity Extraction.......................................................... 208 5.1 Predefined Entity Types by Language.............................................208 5.2 Language-specific Entity Type Examples...........................................214 5.3 Definitions of Entity Types.....................................................295 5.4 Entity Normalization.........................................................308 6 Fact Extraction............................................................314 Text Data Processing Language Reference Guide 2 PUBLIC Content 6.1 Sentiment Analysis Fact Extraction...............................................314 Sentiment Extraction......................................................315 Emoticon Extraction.......................................................329 Request Extraction....................................................... 340 Profanity Extraction.......................................................346 6.2 Public Sector Fact Extraction...................................................346 Common Entity Extraction..................................................348 Action Event Extraction.................................................... 353 Military Unit Extraction.....................................................355 Organizational Information Extraction..........................................356 Person: Alias Extraction.................................................... 357 Person: Appearance Extraction...............................................359 Person: Attribute Extraction................................................. 361 Person: Relationship Extraction...............................................362 Spatial Reference Extraction.................................................364 Travel Event Extraction.....................................................365 6.3 Enterprise Fact Extraction.....................................................367 Management Change Event Extraction..........................................368 Membership Information Extraction............................................371 Merger Information Extraction................................................373 Organizational Information Extraction.......................................... 374 Product Release Event Extraction............................................. 376 6.4 Entity Type Names and Subentity Type Names For Fact Extraction.........................378 Text Data Processing Language Reference Guide Content PUBLIC 3 1 Text Data Processing Language Reference Guide Text Data Processing provides a suite of analyzers that enable developers to construct applications that analyze unstructured data, enabling your application to perform extraction and natural-language processing of text. This guide provides information about the linguistic extraction features, and describes the behavior of the supported language modules. Text Data Processing Language Reference Guide 4 PUBLIC Text Data Processing Language Reference Guide 2 Overview of the Language Reference Guide Welcome to the Language Reference Guide. This guide addresses the variety of languages that can be processed by the text analytics engine: a suite of natural-language processing capabilities based on linguistic, statistical and machine-learning algorithms that model and structure the information content of textual sources in multiple languages. This technology forms the foundation for advanced text processing for a range of applications including search, business intelligence or exploratory data analysis. The major areas of text analytics functionality are: Linguistic Analysis is the most fundamental form of text analytics - tokenization, stemming and part-of- speech tagging. These operations allow for optimized full-text index building because they guarantee both high precision and recall. Entity Extraction is the identification of named entities (persons, organizations, and so on), which eliminates the 'noise' in textual data by highlighting salient information. This process transforms unstructured text into structured information. Fact Extraction is a higher-level semantic processing that links entities as "facts" in domain-specific applications. For example, "Voice of the Customer" classifies sentiments with their corresponding topics. 2.1 Who Should Read This Guide Users of this guide may need to enhance extraction in their text analytics application and should understand text data processing extraction concepts. However, users of this guide are not expected to understand or be familiar with the natural languages of the text being processed by the software. Similarly, users are not required to be familiar with linguistic principles. This document assumes the following: ● You are an application developer or consultant working on enhancing text data processing extraction. ● You understand your organization's text data processing extraction needs. Text Data Processing Language Reference Guide Overview of the Language Reference Guide PUBLIC 5 3 Overview of Text Data Processing Functionality The software includes language modules for the languages supported. Each language module consists of a set of files that include system dictionaries containing words to support the language processing operations for the given natural language. It is the language modules that enable linguistic analysis and extraction of unstructured text in a given language. Language modules use the following language processing technologies: ● Linguistic analysis to handle natural language processing ● Entity extraction to handle named entity extraction and normalization ● Fact extraction to handle sentiment analysis, public sector events, and enterprise facts Linguistic analysis provides ● Word segmentation ● Stemming ● Part of speech tagging ● Tagged stemming (part of speech in context) Extraction provides the following levels: ● Entity Extraction ○ Core Extraction – for named entities (PERSON, DATE, LOCALITY, and so on.) ● Fact Extraction ○ Sentiment Analysis – for positive and negative sentiment and topic extraction ○ Public Sector - for public-sector-specific information. ○ Enterprise - for business-specific information There is an additional language module for the "neutral" language. The neutral language module offers basic tokenization, stemming and part-of-speech tagging in any language other than the 33 currently covered by SAP Text Analysis. The neutral language works for whitespace languages, such as Afrikaans or Bengali. Customization is also possible through dictionaries and rules. Refer to the Extraction Customization Guide for more details about customization. The following table shows the languages which are explicitly supported for Text Data Processing. Entity Extraction Fact Extraction Linguistic Language Analysis Custom Dic Sentiment Public Sec Custom Predefined tionary Analysis tor Enterprise Rules Arabic (AR) X X X X X Catalan (CA) X X X Chinese, sim X X X* X X plified (ZH) Text Data