Instantly Identify and Triage Many Languages

Rosette® BIG TEXT ANALYTICS Language Identifier RLI RLI ROSETTE Identify languages and encodings Language Identifier Sortedwww.basistech.com Languages [email protected] +1 617-386-2090 Base Linguistics RBL RBL ROSETTE Search many languages with high accuracy InstantlyBase Linguistics identify and triageBetter Search Entity Extractor REX REX ROSETTE Tag names of people, places, and organizations manyEntity languages Extractor within largeTagged Entities English Primary Language Entity Resolver 8% RES voRESlumes ROSETTE of text. French Make real-world connections in your data Chinese Entity Resolver Chinese RealPrimary Scrip Identitiest 即时识别和处理大量多语言文本。 22% Arabic 39% Latin Identifiez et triez instantanément plusieurs French French Name Indexer English RNI languesRNI à travers ROSETTE de nombreux textes. Match names between many variations Name Indexer Matched Names %31 اﻟﺘﺤﺪﻳﺪ واﻟﺘﺼﻨﻴﻒ اﻟﻔﻮري ﻟﻠﻌﺪﻳﺪ ﻣﻦ اﻟﻠﻐﺎت ﺿﻤﻦ ﻛﻤﻴﺎت ﻛﺒﻴﺮة ﻣﻦ اﻟﻨﺼﻮص. Arabic Name Translator RNT RNT ROSETTE Translate foreign names into English Name Translator Translated Names Identify languages and Supported Categorizer Languages transform ROSETTE encodings 55 RCA Categorize Everything In Sight RCA Rosette® LanguageCategorizer Identifier (RLI) analyzes text from a few words to whole KEY FEATURES Sorted Content documents, to detect the languages and character encoding with speed and very high accuracy. Automatic language identification is the necessary first - Simple API Sentiment Analyzer step for applications that categorize, search, process, and store text in many - Fast and scalable ROSETTE - Industrial-strength support RSA languages.RSA Individual documents may be routed to language specialists, or sent Detect The Sentiments Of Your Text - Easy installation into language-specificSentiment analysis pipelines Analyzer (such as Rosette Base Linguistics) to Actionable Insights - Flexible and customizable improve the quality of search results. - Java or C++ - Unix, Linux, Mac, or Windows For applications that analyze tweets, search keywords, and other short text, - Component of the Rosette SDK RLI offers market-leading accuracy for language detection given 1-3 words (<20 bytes) up to a full sentence. RLI achieves its incredible accuracy through the use of proprietary algorithms with information-rich language profiles derived from statistical analysis, in addition to language-specific methods for short text language detection. Basis Technology continually improves the Rosette product family with language additions, feature updates, and the latest innovations from the academic world. Select Customers Start using RLI today StumbleUpon Try our free product evaluation www.basistech.com Rosette® BIG TEXT ANALYTICS Language Identifier RLI RLI ROSETTE Identify languages and encodings Language Identifier Sorted Languages Rosette® Base Linguistics BIG TEXT ANALYTICS RBL RBL ROSETTE Search many languages with high accuracy IDENTIFICATIONBase LinguisticsFEATURES LANGUAGELanguage BOUNDARY Identifier LOCATORBetter SearchENCODING CONVERSION RLI RLI ROSETTE - Identifies the primary or dominant language Identify languages and encodings Entity Extractor J'ai été surprise par cette surprise. Vice President Language Identifier Sorted Languages REX REXof a document ROSETTE Tag names of people, places, and organizations Biden spoke about this in Munich. El carpintero Although modern text encoding standards, - Identifies Entitythe language Extractor scripts within the Tagged Entities prensa los bordes de la placa decorativa. Proper such as XML, mandate the use of Unicode, document, such as Latin and Cyrillic wound careBase management Linguistics prevents die Geige gibt many existing applications, documents, ROSETTE Entity Resolver - Determines the languages and their RBL websites, and data streams use “legacyRBL einen schoenen Search Laut von many sich. languages with high accuracy RES RESpercentagesROSETTE within multilingual documents encodings,” such as ASCII, ISO 8859-1, Shift-JIS, Better Search Make real-world connections in your data Base Linguistics Entity Resolver Real Identitiesand many others. - Works with texts that have been ENGLISHFRENCH GERMAN SPANISH transliterated, such as Arabic chat that is Rosette accurately converts large collections written in the Latin script Digital text is often composed of multiple Name Indexer Entity Extractor of text with these legacy encodings into a ROSETTE RNI RNI languages within the same document, ROSETTE Match names between many variations - Accurate with short strings—from 1-3 wordsREX single, uniform format in the UnicodeREX standard. presenting Tag a challenge names to ofcomputers people,Matched and places, Names and organizations (<20 bytes)Name to a full sentence Indexer to enable full humans alike. RLI enriches the text with start This converted text can then be used in any Entity Extractor Tagged Entities analysis of search queries, tweets, image and end markers for each language placed language, which eliminates data corruption and captions, metadata, news headlines, email Name Translator within multilingual documents—even if all the other problems due to incompatible code. subject lines, ROSETTE and more. RNT RNT languages are written in the same script— Translate foreign names into English Entity Resolver Name Translator Translated Names RESsuch as English, French, German, or Italian. RES ROSETTE Boundaries Make of each real-world writing system connections are also in your data Entity Resolver Real Identities Categorizer detected, such as Latin, Cyrillic, Japanese kana, RCA RCA ROSETTE or Chinese hanzi. Categorize Everything In Sight Categorizer Sorted Content Name Indexer LANGUAGE AND ENCODING COMPATIBILITYRNI RNI ROSETTE Sentiment Analyzer Match names between many variations Name Indexer Matched Names ROSETTE RSA Detect The Sentiments Of Your Text RSA Albanian — ISO-8859-1, Windows-1252 Lithuanian — ISO-8859-13, Windows-1257 SentimentLanguage/ Analyzer Arabic — ISO-8859-6, Windows-720,Actionable Macedonian Insights — ISO-8859-5, Windows-1251 Windows-1256 Malay — ISO-8859-1, Windows-1252 Encoding Pairs Arabic (transliterated)Name —Translator ISO-8859-1, Malayalam — ISCII-Malayalam 188 ROSETTE RNT Windows-1252, Windows-1256 Norwegian — ISO-8859-1, Windows-1252RNT Translate foreign names into English Bengali — ISCII-Bengali Pashto — ISO-8859-6, Windows-1256 Name Translator Translated Names Bulgarian — ISO-8859-5, Windows-1251, KOI8-R Pashto (transliterated) — ISO-8859-1, Languages Catalan — ISO-8859-1, Windows-1252 Windows-1252 Chinese, Simplified— GB-2312, GB-18030, Persian — ISO-8859-6, Windows-1256 with Unicode HZ-GB-2312,Categorizer ISO-2022-CN Persian (transliterated) — ISO-8859-1, 55 ROSETTE RCAChinese, TraditionalCategorize — Big5, Everything Big5-HKSCS In Sight Windows-1252, Windows-1256 RCA Croatian — Windows-1250 Polish — ISO-8859-2, Windows-1250 Categorizer Sorted Content Czech — ISO-8859-2, Windows-1250 Portuguese — ISO-8859-1, Windows-1252 Latin Script Danish — ISO-8859-1, Windows-1252 Romanian — ISO-8859-2, Windows-1250 Dutch — ISO-8859-1,Sentiment Windows-1252 Analyzer Russian — ISO-8859-5, Windows-1251, KOI8-R, Variants English — ISO-8859-1, Windows-1252 IBM-866, Mac Cyrillic ROSETTE 7 RSAEstonian —Detect ISO-8859-13, The Windows-1257Sentiments Of YourSerbian Text — ISO-8859-5, Windows-1251RSA (Transliterations) Finnish — ISO-8859-1, Windows-1252 Serbian (transliterated) — ISO-8859-2, Sentiment Analyzer Actionable Insights French — ISO-8859-1, Windows-1252 Windows-1250 German — ISO-8859-1, Windows-1252 Slovak — Windows-1250 Legacy Greek — ISO-8859-7, Windows-1253 Slovenian — Windows-1250 Gujarati — ISCII-Gujarati Somali — ISO-8859-1, Windows-1252 44 Encodings Hebrew — ISO-8859-8, Windows-1255 Spanish — ISO-8859-1, Windows-1252 Hindi — ISCII-Hindi Swedish — ISO-8859-1, Windows-1252 Hungarian — ISO-8859-2, Windows-1250 Tagalog — ISO-8859-1, Windows-1252 Icelandic — ISO-8859-1, Windows-1252 Tamil — ISCII-Tamil Indonesian — ISO-8859-1, Windows-1252 Telugu — ISCII-Telugu Compatibility Italian — ISO-8859-1, Windows-1252 Thai — Windows-874 Japanese — EUC-JP, ISO-2022-JP, Shift-JIS, Turkish — ISO-8859-9, Windows-1254 Code Base Platform Support Shift-JIS-2004 (JIS X 0213) Ukrainian — ISO-8859-5, Windows-1251, KOI8-R Kannada — ISCII-Kannada Urdu — ISO-8859-6, Windows-1256 Korean — EUC-KR, ISO-2022-KR Urdu (transliterated) — ISO-8859-1, Kurdish — Windows-1256 Windows-1252 Kurdish (transliterated) — ISO-8859-1, Uzbek — ISO-8859-5, Windows-1251, KOI8-R Windows-1252, Windows-1256 Uzbek (transliterated) — Windows-1251 Latvian — ISO-8859-13, Windows-1257 Vietnamese — TCVN, VIQR, VISCII, VNI, VPS © 2015 Basis Technology Corporation. “Basis Technology HEADQUARTERS FEDERAL WEST COAST EUROPE ASIA Corporation” , “Rosette”, and “Highlight” are registered trademarks One Alewife Center 2553 Dulles View Dr. 1700 Montgomery St. Furzeground Way 9-6 Nibancho, of Basis Technology Corporation. “Big Text Analytics” is a trademark Cambridge, MA Suite 450 San Francisco, CA Middlesex UB11 1BD, Chiyoda-ku of Basis Technology Corporation. All other trademarks, service marks, and logos used in this document are the property of their respective 02140 Herndon, VA 94111 UK Tokyo 102-0084, owners. (2015-06-29-RLI) 20171 Japan.

Instantly Identify and Triage Many Languages

Consonant Characters and Inherent Vowels

HAIL: an Algorithm for the Hardware Accelerated Identification of Languages, Master's Thesis, May 2006

Vntex — Typesetting Vietnamese Hàn Thế Thành Reinhard Kotucha

Legacy Character Sets & Encodings

Basis Technology Unicode対応ライブラリスペックシート文字コードその他の名称 Adobe-Standard-Encoding A

Implementing Cross-Locale CJKV Code Conversion

San José, October 2, 2000 Feel Free to Distribute This Text

Unicode Compression: Does Size Really Matter? TR CS-2002-11

Inews V3.4.2 Readme • 9390-65038-00 Rev

Package 'Fontmplus'

DVB); Specification for Service Information (SI) in DVB Systems

Implementing Cross-Locale CJKV Code Conversion