Gain Insight and Deep Value from Unstructured
Total Page:16
File Type:pdf, Size:1020Kb
www.basistech.com [email protected] +1 617-386-2090 Concept What is Big Text ? Pronoun Verb Name English It's huge volumes of multilingual , unstructured Pronoun+Verb Adjective Noun Prep. Adjective Adjective text that must be processed to deliver insights Noun Rel. Pronoun Verb Verb Verb Inf. Verb Noun Person and build connections . It’s President Clinton Conjunction Verb Noun Pronoun+Verb Title Name Place Person Place helping Malawi . Secretary Clinton in . Verb Name Title Name Prep. Name Urdu: “Islamabad” Place Concept The 福島 '"*!& &+' Determiner Noun Pronoun Verb Name Japanese: "Fukushima” Gain insight and deep value Supported from unstructured text 55 Languages ® Modern enterprise is well-acquainted Rosette is a suite of software KEY FEATURES with the promise of big data to components for use in enterprise revolutionize our insights and applications. It uses linguistic analysis, - Simple API decision making, although it is less statistical modeling, and machine - Fast and scalable well-known that up to 80% of big data learning to accurately process Big Text, - Industrial-strength support is represented by Big Text. Big Text revealing valuable information and - Easy installation is large quantities of “unstructured” actionable data. - Flexible and customizable text chunks found in documents, - Java or C++ webpages, and databases with all the Individually, each component is a - Unix, Linux, Mac, or Windows hallmarks of big data: the three Vs robust tool for processing language, - Built to work with Apache Solr™ and Elasticsearch (Volume, Velocity, and Variety). Big documents, or names. When - Cloudera certified partner Text is also multilingual, covering combined together, they create many languages and scripts, in all of powerful solutions that deliver useful their complexities and challenges. information for better decisions and deep value for their users. Our Because of the intrinsic nature of customers across the globe, in unstructured text, standard enterprise government, finance, eDiscovery, data solutions have a very limited search, social media, and beyond, Start using ROSETTE today ability to understand and utilize this depend on Rosette to analyze and Try our free product evaluation treasure trove of information. transform their Big Text. www.basistech.com THE PROBLEM THE ROSETTE SOLUTION THE RESULT Language Identifier RLI Identify languages and encodings Sorted Languages Base Linguistics RBL Search many languages with high accuracy Better Search Entity Extractor REX Tag names of people, places, and organizations Tagged Entities Entity Resolver CHARACTERISTICS RES Make real-world connections in your data Real Identities - 80% of Big Data - Unstructured Name Indexer - Multilingual RNI Match names between many variations Matched Names - Huge Volume Name Translator RNT Translate foreign names into English Translated Names Select Customers Compatibility Select Government Customers Search Engines Code Base Platform Support © 2015 Basis Technology Corporation. “Basis Technology HEADQUARTERS FEDERAL WEST COAST EUROPE ASIA Corporation” , “Rosette”, and “Highlight” are registered trademarks One Alewife Center 2553 Dulles View Dr. 1700 Montgomery St. Furzeground Way 9-6 Nibancho, of Basis Technology Corporation. “Big Text Analytics” is a trademark Cambridge, MA Suite 450 San Francisco, CA Middlesex UB11 1BD, Chiyoda-ku of Basis Technology Corporation. All other trademarks, service marks, and logos used in this document are the property of their respective 02140 Herndon, VA 94111 UK Tokyo 102-0084, owners. (2015-06-26-RLP) 20171 Japan RLI ROSETTE Language Identifier www.basistech.com [email protected] +1 617-386-2090 Instantly identify and triage many languages within large English Primary Language volumes of text. 8% French Chinese Chinese Primary Script 即时识别和处理大量多语言文本 22% Arabic 39% Latin Identifiez et triez instantanément plusieurs French French langues à travers de nombreux textes. English %31 اﻟﺘﺤﺪﻳﺪ واﻟﺘﺼﻨﻴﻒ اﻟﻔﻮري ﻟﻠﻌﺪﻳﺪ ﻣﻦ اﻟﻠﻐﺎت ﺿﻤﻦ ﻛﻤﻴﺎت ﻛﺒﻴﺮة ﻣﻦ اﻟﻨﺼﻮص. Arabic Identify languages and Supported transform encodings 55 Languages Rosette® Language Identifier (RLI) analyzes text from a few words to whole KEY FEATURES documents, to detect the languages and character encoding with speed and very high accuracy. Automatic language identification is the necessary first - Simple API step for applications that categorize, search, process, and store text in many - Fast and scalable languages. Individual documents may be routed to language specialists, or sent - Industrial-strength support - Easy installation into language-specific analysis pipelines (such as Rosette Base Linguistics) to - Flexible and customizable improve the quality of search results. - Java or C++ - Unix, Linux, Mac, or Windows For applications that analyze tweets, search keywords, and other short text, - Component of the Rosette SDK RLI offers market-leading accuracy for language detection given 1-3 words (<20 bytes) up to a full sentence. RLI achieves its incredible accuracy through the use of proprietary algorithms with information-rich language profiles derived from statistical analysis, in addition to language-specific methods for short text language detection. Basis Technology continually improves the Rosette product family with language additions, feature updates, and the latest innovations from the academic world. Select Customers Start using RLI today StumbleUpon Try our free product evaluation www.basistech.com RLI ROSETTE Language Identifier IDENTIFICATION FEATURES LANGUAGE BOUNDARY LOCATOR ENCODING CONVERSION - Identifies the primary or dominant language J'ai été surprise par cette surprise. Vice President of a document Biden spoke about this in Munich. El carpintero Although modern text encoding standards, - Identifies the language scripts within the prensa los bordes de la placa decorativa. Proper such as XML, mandate the use of Unicode, document, such as Latin and Cyrillic wound care management prevents die Geige gibt many existing applications, documents, - Determines the languages and their websites, and data streams use “legacy einen schoenen Laut von sich. percentages within multilingual documents encodings,” such as ASCII, ISO 8859-1, Shift-JIS, and many others. - Works with texts that have been ENGLISHFRENCH GERMAN SPANISH transliterated, such as Arabic chat that is Rosette accurately converts large collections written in the Latin script Digital text is often composed of multiple languages within the same document, of text with these legacy encodings into a - Accurate with short strings—from 1-3 words presenting a challenge to computers and single, uniform format in the Unicode standard. (<20 bytes) to a full sentence to enable full humans alike. RLI enriches the text with start This converted text can then be used in any analysis of search queries, tweets, image and end markers for each language placed language, which eliminates data corruption and captions, metadata, news headlines, email within multilingual documents—even if all the other problems due to incompatible code. subject lines, and more. languages are written in the same script— such as English, French, German, or Italian. Boundaries of each writing system are also detected, such as Latin, Cyrillic, Japanese kana, or Chinese hanzi. LANGUAGE AND ENCODING COMPATIBILITY Albanian — ISO-8859-1, Windows-1252 Lithuanian — ISO-8859-13, Windows-1257 Language/ Arabic — ISO-8859-6, Windows-720, Macedonian — ISO-8859-5, Windows-1251 Windows-1256 Malay — ISO-8859-1, Windows-1252 188 Encoding Pairs Arabic (transliterated) — ISO-8859-1, Malayalam — ISCII-Malayalam Windows-1252, Windows-1256 Norwegian — ISO-8859-1, Windows-1252 Bengali — ISCII-Bengali Pashto — ISO-8859-6, Windows-1256 Bulgarian — ISO-8859-5, Windows-1251, KOI8-R Pashto (transliterated) — ISO-8859-1, Languages Catalan — ISO-8859-1, Windows-1252 Windows-1252 Chinese, Simplified — GB-2312, GB-18030, Persian — ISO-8859-6, Windows-1256 55 with Unicode HZ-GB-2312, ISO-2022-CN Persian (transliterated) — ISO-8859-1, Chinese, Traditional — Big5, Big5-HKSCS Windows-1252, Windows-1256 Croatian — Windows-1250 Polish — ISO-8859-2, Windows-1250 Czech — ISO-8859-2, Windows-1250 Portuguese — ISO-8859-1, Windows-1252 Danish — ISO-8859-1, Windows-1252 Romanian — ISO-8859-2, Windows-1250 Latin Script Dutch — ISO-8859-1, Windows-1252 Russian — ISO-8859-5, Windows-1251, KOI8-R, Variants English — ISO-8859-1, Windows-1252 IBM-866, Mac Cyrillic 7 Estonian — ISO-8859-13, Windows-1257 Serbian — ISO-8859-5, Windows-1251 (Transliterations) Finnish — ISO-8859-1, Windows-1252 Serbian (transliterated) — ISO-8859-2, French — ISO-8859-1, Windows-1252 Windows-1250 German — ISO-8859-1, Windows-1252 Slovak — Windows-1250 Legacy Greek — ISO-8859-7, Windows-1253 Slovenian — Windows-1250 Gujarati — ISCII-Gujarati Somali — ISO-8859-1, Windows-1252 44 Encodings Hebrew — ISO-8859-8, Windows-1255 Spanish — ISO-8859-1, Windows-1252 Hindi — ISCII-Hindi Swedish — ISO-8859-1, Windows-1252 Hungarian — ISO-8859-2, Windows-1250 Tagalog — ISO-8859-1, Windows-1252 Icelandic — ISO-8859-1, Windows-1252 Tamil — ISCII-Tamil Indonesian — ISO-8859-1, Windows-1252 Telugu — ISCII-Telugu Compatibility Italian — ISO-8859-1, Windows-1252 Thai — Windows-874 Japanese — EUC-JP, ISO-2022-JP, Shift-JIS, Turkish — ISO-8859-9, Windows-1254 Code Base Platform Support Shift-JIS-2004 (JIS X 0213) Ukrainian — ISO-8859-5, Windows-1251, KOI8-R Kannada — ISCII-Kannada Urdu — ISO-8859-6, Windows-1256 Korean — EUC-KR, ISO-2022-KR Urdu (transliterated) — ISO-8859-1, Kurdish — Windows-1256 Windows-1252 Kurdish