www.basistech.com [email protected] +1 617-386-2090

Concept What is Big Text ? Pronoun Verb Name English It's huge volumes of multilingual , unstructured Pronoun+Verb Adjective Noun Prep. Adjective Adjective text that must be processed to deliver insights Noun Rel. Pronoun Verb Verb Verb Inf. Verb Noun Person and build connections . It’s President Clinton Conjunction Verb Noun Pronoun+Verb Title Name

Place Person Place helping Malawi . Secretary Clinton in . Verb Name Title Name Prep. Name Urdu: “Islamabad” Place Concept The 福島  '"*!& &+' Determiner Noun Pronoun Verb Name Japanese: "Fukushima” Gain insight and deep value Supported from unstructured text 55 Languages

® Modern enterprise is well-acquainted Rosette is a suite of KEY FEATURES with the promise of big data to components for use in enterprise revolutionize our insights and applications. It uses linguistic analysis, - Simple API decision making, although it is less statistical modeling, and machine - Fast and scalable well-known that up to 80% of big data learning to accurately process Big Text, - Industrial-strength support is represented by Big Text. Big Text revealing valuable information and - Easy installation is large quantities of “unstructured” actionable data. - Flexible and customizable text chunks found in documents, - Java or C++ webpages, and databases with all the Individually, each component is a - Unix, Linux, Mac, or Windows hallmarks of big data: the three Vs robust tool for processing language, - Built to work with Apache Solr™ and Elasticsearch (Volume, Velocity, and Variety). Big documents, or names. When - Cloudera certified partner Text is also multilingual, covering combined together, they create many languages and scripts, in all of powerful solutions that deliver useful their complexities and challenges. information for better decisions and deep value for their users. Our Because of the intrinsic nature of customers across the globe, in unstructured text, standard enterprise government, finance, eDiscovery, data solutions have a very limited search, social media, and beyond, Start using ROSETTE today ability to understand and utilize this depend on Rosette to analyze and Try our free product evaluation treasure trove of information. transform their Big Text. www.basistech.com THE PROBLEM THE ROSETTE SOLUTION THE RESULT

Language Identifier RLI Identify languages and encodings Sorted Languages

Base Linguistics RBL Search many languages with high accuracy Better Search

Entity Extractor REX Tag names of people, places, and organizations Tagged Entities

Entity Resolver CHARACTERISTICS RES Make real-world connections in your data Real Identities - 80% of Big Data - Unstructured Name Indexer - Multilingual RNI Match names between many variations Matched Names - Huge Volume Name Translator RNT Translate foreign names into English Translated Names

Select Customers

Compatibility Select Government Customers

Search Engines

Code Base Platform Support

© 2015 Basis Technology Corporation. “Basis Technology HEADQUARTERS FEDERAL WEST COAST EUROPE ASIA Corporation” , “Rosette”, and “Highlight” are registered trademarks One Alewife Center 2553 Dulles View Dr. 1700 Montgomery St. Furzeground Way 9-6 Nibancho, of Basis Technology Corporation. “Big Text Analytics” is a trademark Cambridge, MA Suite 450 San Francisco, CA Middlesex UB11 1BD, Chiyoda-ku of Basis Technology Corporation. All other trademarks, service marks, and logos used in this document are the property of their respective 02140 Herndon, VA 94111 UK Tokyo 102-0084, owners. (2015-06-26-RLP) 20171 Japan RLI ROSETTE Language Identifier www.basistech.com [email protected] +1 617-386-2090 Instantly identify and triage many languages within large English

Primary Language volumes of text. 8% French Chinese Chinese Primary Script 即时识别和处理大量多语言文本 22% Arabic 39% Latin Identifiez et triez instantanément plusieurs French French langues à travers de nombreux textes. English

%31 اﻟﺘﺤﺪﻳﺪ واﻟﺘﺼﻨﻴﻒ اﻟﻔﻮري ﻟﻠﻌﺪﻳﺪ ﻣﻦ اﻟﻠﻐﺎت ﺿﻤﻦ ﻛﻤﻴﺎت ﻛﺒﻴﺮة ﻣﻦ اﻟﻨﺼﻮص. Arabic

Identify languages and Supported transform encodings 55 Languages Rosette® Language Identifier (RLI) analyzes text from a few words to whole KEY FEATURES documents, to detect the languages and with speed and very high accuracy. Automatic language identification is the necessary first - Simple API step for applications that categorize, search, process, and store text in many - Fast and scalable languages. Individual documents may be routed to language specialists, or sent - Industrial-strength support - Easy installation into language-specific analysis pipelines (such as Rosette Base Linguistics) to - Flexible and customizable improve the quality of search results. - Java or C++ - Unix, Linux, Mac, or Windows For applications that analyze tweets, search keywords, and other short text, - Component of the Rosette SDK RLI offers market-leading accuracy for language detection given 1-3 words (<20 bytes) up to a full sentence.

RLI achieves its incredible accuracy through the use of proprietary algorithms with information-rich language profiles derived from statistical analysis, in addition to language-specific methods for short text language detection. Basis Technology continually improves the Rosette product family with language additions, feature updates, and the latest innovations from the academic world.

Select Customers

Start using RLI today StumbleUpon Try our free product evaluation www.basistech.com RLI ROSETTE Language Identifier

IDENTIFICATION FEATURES LANGUAGE BOUNDARY LOCATOR ENCODING CONVERSION

- Identifies the primary or dominant language J'ai été surprise par cette surprise. Vice President of a document Biden spoke about this in Munich. El carpintero Although modern text encoding standards, - Identifies the language scripts within the prensa los bordes de la placa decorativa. Proper such as XML, mandate the use of , document, such as Latin and Cyrillic wound care management prevents die Geige gibt many existing applications, documents, - Determines the languages and their websites, and data streams use “legacy einen schoenen Laut von sich. percentages within multilingual documents encodings,” such as ASCII, ISO 8859-1, Shift-JIS, and many others. - Works with texts that have been ENGLISHFRENCH GERMAN SPANISH transliterated, such as Arabic chat that is Rosette accurately converts large collections written in the Latin script Digital text is often composed of multiple languages within the same document, of text with these legacy encodings into a - Accurate with short strings—from 1-3 words presenting a challenge to computers and single, uniform format in the Unicode standard. (<20 bytes) to a full sentence to enable full humans alike. RLI enriches the text with start This converted text can then be used in any analysis of search queries, tweets, image and end markers for each language placed language, which eliminates data corruption and captions, metadata, news headlines, email within multilingual documents—even if all the other problems due to incompatible code. subject lines, and more. languages are written in the same script— such as English, French, German, or Italian. Boundaries of each writing system are also detected, such as Latin, Cyrillic, Japanese kana, or Chinese hanzi.

LANGUAGE AND ENCODING COMPATIBILITY

Albanian — ISO-8859-1, Windows-1252 Lithuanian — ISO-8859-13, Windows-1257 Language/ Arabic — ISO-8859-6, Windows-720, Macedonian — ISO-8859-5, Windows-1251 Windows-1256 Malay — ISO-8859-1, Windows-1252 188 Encoding Pairs Arabic (transliterated) — ISO-8859-1, Malayalam — ISCII-Malayalam Windows-1252, Windows-1256 Norwegian — ISO-8859-1, Windows-1252 Bengali — ISCII-Bengali Pashto — ISO-8859-6, Windows-1256 Bulgarian — ISO-8859-5, Windows-1251, KOI8-R Pashto (transliterated) — ISO-8859-1, Languages Catalan — ISO-8859-1, Windows-1252 Windows-1252 Chinese, Simplified — GB-2312, GB-18030, Persian — ISO-8859-6, Windows-1256 55 with Unicode HZ-GB-2312, ISO-2022-CN Persian (transliterated) — ISO-8859-1, Chinese, Traditional — Big5, Big5-HKSCS Windows-1252, Windows-1256 Croatian — Windows-1250 Polish — ISO-8859-2, Windows-1250 Czech — ISO-8859-2, Windows-1250 Portuguese — ISO-8859-1, Windows-1252 Danish — ISO-8859-1, Windows-1252 Romanian — ISO-8859-2, Windows-1250 Latin Script Dutch — ISO-8859-1, Windows-1252 Russian — ISO-8859-5, Windows-1251, KOI8-R, Variants English — ISO-8859-1, Windows-1252 IBM-866, Mac Cyrillic 7 Estonian — ISO-8859-13, Windows-1257 Serbian — ISO-8859-5, Windows-1251 (Transliterations) Finnish — ISO-8859-1, Windows-1252 Serbian (transliterated) — ISO-8859-2, French — ISO-8859-1, Windows-1252 Windows-1250 German — ISO-8859-1, Windows-1252 Slovak — Windows-1250 Legacy Greek — ISO-8859-7, Windows-1253 Slovenian — Windows-1250 Gujarati — ISCII-Gujarati Somali — ISO-8859-1, Windows-1252 44 Encodings Hebrew — ISO-8859-8, Windows-1255 Spanish — ISO-8859-1, Windows-1252 Hindi — ISCII-Hindi Swedish — ISO-8859-1, Windows-1252 Hungarian — ISO-8859-2, Windows-1250 Tagalog — ISO-8859-1, Windows-1252 Icelandic — ISO-8859-1, Windows-1252 Tamil — ISCII-Tamil Indonesian — ISO-8859-1, Windows-1252 Telugu — ISCII-Telugu Compatibility Italian — ISO-8859-1, Windows-1252 Thai — Windows-874 Japanese — EUC-JP, ISO-2022-JP, Shift-JIS, Turkish — ISO-8859-9, Windows-1254 Code Base Platform Support Shift-JIS-2004 (JIS X 0213) Ukrainian — ISO-8859-5, Windows-1251, KOI8-R Kannada — ISCII-Kannada Urdu — ISO-8859-6, Windows-1256 Korean — EUC-KR, ISO-2022-KR Urdu (transliterated) — ISO-8859-1, Kurdish — Windows-1256 Windows-1252 Kurdish (transliterated) — ISO-8859-1, Uzbek — ISO-8859-5, Windows-1251, KOI8-R Windows-1252, Windows-1256 Uzbek (transliterated) — Windows-1251 Latvian — ISO-8859-13, Windows-1257 Vietnamese — TCVN, VIQR, VISCII, VNI, VPS

© 2015 Basis Technology Corporation. “Basis Technology HEADQUARTERS FEDERAL WEST COAST EUROPE ASIA Corporation” , “Rosette”, and “Highlight” are registered trademarks One Alewife Center 2553 Dulles View Dr. 1700 Montgomery St. Furzeground Way 9-6 Nibancho, of Basis Technology Corporation. “Big Text Analytics” is a trademark Cambridge, MA Suite 450 San Francisco, CA Middlesex UB11 1BD, Chiyoda-ku of Basis Technology Corporation. All other trademarks, service marks, and logos used in this document are the property of their respective 02140 Herndon, VA 94111 UK Tokyo 102-0084, owners. (2015-06-26-RLI) 20171 Japan RBL ROSETTE Base Linguistics www.basistech.com [email protected] +1 617-386-2090

Improve the speed and

Verb Determiner Noun Conjunction accuracy of your search

Noun Preposition Determiner Noun application with advanced

Noun Preposition Adjective linguistic analysis.

Adjective Noun Punctuation

Search many languages Supported with high accuracy 40 Languages

Every language, including English, presents unique and difficult challenges KEY FEATURES for search applications to deliver relevant and precise results. Rosette® Base Linguistics (RBL) enables enterprise applications to effectively search or process - Simple API text in many languages by providing a complete set of linguistic services. RBL - Fast and scalable enriches the original text in its native language for best-of-class natural language - Industrial-strength support - Easy installation processing, improving speed, and accuracy. - Flexible and customizable - Java or C++ As linguistics experts with deep understanding at the intersection of language - Component of the Rosette SDK and technology, Basis Technology continually improves the Rosette product - Customizable features such as user family with language additions, feature updates, and the latest innovations from dictionaries, orthographic normalization, the academic world. and script conversion - Built to work with Apache Solr™ and Elasticsearch - Cloudera certified partner Select Customers

Start using RBL today Try our free product evaluation www.basistech.com RBL ROSETTE Base Linguistics

Advanced Morphological Features

TOKENIZATION LEMMATIZATION DECOMPOUNDING

Many search tools use bigrams to understand Most search engines utilize a crude method of RBL breaks down compound words into languages written without spaces between chopping of characters at the end of a word in sub-components and delivers each individual words. This results in a larger index size and the hopes of removing unimportant diferences. element to be indexed. This is especially useful a reduction in relevancy. RBL, in contrast, This method, called stemming, often results for increasing search relevancy in languages accurately identifies and separates each in extra recall and poor precision. Instead, such as German and Korean. word through advanced statistical modeling. RBL finds the true dictionary form of each The resulting token output (also known as word, known as a lemma, by using vocabulary, segmentation) minimizes index size, enhances context, and advanced morphological analysis. Example: German search accuracy, and increases relevancy. Indexing the root form increases search Samstagmorgen is a compound word formed relevancy and slims the search index by with Samstag (Saturday) and morgen (morning). not indexing all inflected forms. Alternative Decompounding allows for an appropriate match Example: Chinese when searching for "Samstag". Consider the problem of indexing “Beijing lemmas are also made available to supplement University Biology Department” and a indexing. subsequent search for “student”:

Example: English INDEX SEARCH PART OF SPEECH TAGGING Linguistic analysis is useful for every language; lemmatization for English improves recall and precision. 学 学 As part of the lemmatization process, statistical Beijing Biology (Student) modeling is used to determine the correct part University Department CHALLENGE QUERY STEM LEMMA of speech, even with ambiguous words. Two unrelated words animals anim animal BIGRAMMING may share a stem. animated animate Each token is then tagged for enhanced 1 2 23 34 45 5 6 67 comprehension and search relevancy. Stemming may several sever several Because diferent languages have diferent deliver unintended grammars, part-of-speech tags difer. Beijing (non-word) University (Student) Biology (non-word) results. Dept. "Student" Incorrectly hits “Beijing University Biology Department” Rosette supports the Universal POS Tag Irregular verbs and spoke spoke speak (v.) standard from which the developer can map to nouns stump the spoke (n.) RBL MORPHOLOGICAL TOKENIZATION Penn Treebank or other POS tag systems. stemmer. 12 Correctly misses 学 “Beijing University Biology Department” SENTENCE DETECTION NOUN PHRASE EXTRACTION Beijing University Biology Department

Certain nouns, especially proper names, can The start and end of each sentence is be very tricky to identify as a single entity. automatically identified even though RBL groups the nouns and their modifiers, punctuation use may be ambiguous. which is useful in document clustering and concept extraction. Compatibility Available Languages

Search Engines WESTERN EUROPE EASTERN EUROPE MIDDLE EAST ASIA - Catalan* - Albanian* - Arabic - Chinese, Simplified - Czech - Bulgarian* - Hebrew - Chinese, Traditional - Danish - Croatian* - Pashto - Indonesian - Dutch - Estonian* - Persian - Japanese - English - Hungarian - Urdu - Korean - Finnish* - Latvian* - Malay* - French - Polish - Thai - German - Romanian Code Base Platform Support - Greek - Russian - Italian - Serbian* - Norwegian - Slovak* - Portuguese - Slovenian* - Spanish - Turkish - Swedish - Ukranian*

* Limited Support

© 2015 Basis Technology Corporation. “Basis Technology HEADQUARTERS FEDERAL WEST COAST EUROPE ASIA Corporation” , “Rosette”, and “Highlight” are registered trademarks One Alewife Center 2553 Dulles View Dr. 1700 Montgomery St. Furzeground Way 9-6 Nibancho, of Basis Technology Corporation. “Big Text Analytics” is a trademark Cambridge, MA Suite 450 San Francisco, CA Middlesex UB11 1BD, Chiyoda-ku of Basis Technology Corporation. All other trademarks, service marks, and logos used in this document are the property of their respective 02140 Herndon, VA 94111 UK Tokyo 102-0084, owners. (2015-06-26-RBL) 20171 Japan REX ROSETTE Entity Extractor www.basistech.com [email protected] +1 617-386-2090 Automatically find names of people, places , products,

and organizations in text across many languages.

Accurate & adaptable Supported statistical entity extraction 18 Languages

Rosette® Entity Extractor (REX) delivers structure, clarity, and insight, by KEY FEATURES revealing the key information—names, places, organizations, products, and other words and phrases—lying hidden within large volumes of unstructured - Component of the Rosette SDK Big Text. - Simple API - Fast and scalable - Industrial-strength support REX is the foundation for applications in eDiscovery, social media analysis, - Easy installation financial compliance, and government intelligence. The effectiveness of these - Flexible and customizable mission-critical applications depend on REX for its accuracy, robustness, and - Java or C++ ability to find entities across many languages. - Unix, Linux, Mac, or Windows

By nature, statistically trained models are most accurate on the type of data they are trained on. Besides machine learning from a wide range of text beyond news articles, REX is unique among named entity recognition software in its adaptability. REX’s field training mechanism enables you to add your text data to increase REX’s accuracy on your text.

Select Customers Start using REX today

ACTIVE INTELLIGENCE Try our free product evaluation www.basistech.com REX ROSETTE Entity Extractor

How it works REX in action

STATISTICAL ENTITY EXTRACTION The New York Philharmonic Orchestra will make a historic trip to North Korea in February, it has announced. Person Statistical modeling with advanced linguistics Dominique de Villepin a été nommé Premier ministre Location solves the three biggest challenges in entity ce mardi en fin de matinée par Jacques Chirac. extraction: finding entities which cannot be Organization exhaustively listed, finding entities which are The orchestra's president and executive director, Zarin yet unknown, and using context to distinguish Date Mehta said it would play in the capital Pyongyang between similar entities, e.g., the place “Newton, MA” and the person “Isaac Newton”. on February 26. In August, the reclusive communist Time country's Ministry of Culture sent an invitation to Title the orchestra at Lincoln Center in Manhattan. Customized extraction ငಭᄌಷᄤ௣ಢ࿍ҊƗӛ༾ࢃᇞٙ੉١߽ฒƗ֓ఴ฻๏ࡸ൥ӛ༾თنӛ༾ບ༇ി ૌݚᄤ੉١߽ฒআࡖୄะઉࢺԩૌݚؚӛ༾ࣈ಼ᇌґ໠฼èᆓؚӛ༾١૲֬׵སƗ FIELD TRAINING FOR ۹١यѝ൜ߒ႙èૌ৺ധᄌಷИ֨නƥõӐఀၢদ၉ᆷऒमთ௣ಢ࣐ྡྷᆷࢫؚߌ INCREASED ACCURACY ჆ᇖݚ֬໩࿏è܆੉١߽ฒ႒݉ڶૌݚቀ๫Ҋ൐ಱເƗ۹١չӵ၉ᇈç๤ၰ߲֬ For users with text that is particularly challenging in format, style, or vocabulary, L'ancien ministre de l'Intérieur, qui n'a jamais participé à une REX’s unique field training capability has élection, a déjeuné avec les députés UMP et UDF à l'invitation du multiple mechanisms to adapt its statistical président de l'Assemblée nationale, Jean-Louis Debré. model to their data. Users just add a quantity of their data (unannotated or annotated), and rebuild the model for maximum accuracy. 小澤征爾は、日本を代表する世界的な指揮者である。1973年、38歳のときに、 アメリカ5大オーケストラの一つであるボストン交響楽団の音楽監督に就任した。 PATTERN-MATCHING RULES اﳋﻤﻴﺲ 1431/2/5 ﻫـ - اﳌﻮاﻓﻖ 2010/1/21 م (آﺧﺮ ﲢﺪﻳﺚ) اﻟﺴﺎﻋﺔ 10:01 (ﻣﻜﺔ اﳌﻜﺮﻣﺔ)، 7 01 : (ﻏﺮﻳﻨﺘﺶ) ﻧﺎﺗﻮﻳﻔﻜﺮ ﲟﺴﺆول ﻣﺪﻧﻲ ﻷﻓﻐﺎﻧﺴﺘﺎن ﻳﺨﻄﻂ ﺣﻠﻒ ﺷﻤﺎل اﻷﻃﻠﺴﻲ (ﻧﺎﺗﻮ) Rules expressed as regular expressions find ، ,entities which follow a pattern, such as dates ﻟﺘﻌﻴﲔ ﻣﺴﺆول ﻣﺪﻧﻲ ﻛﺒﻴﺮ ﻓﻲ أﻓﻐﺎﻧﺴﺘﺎن وﺳﻂ دﻋﻮاﺗﻠﺘﺤﺴﲔ اﻟﺘﻨﺴﻴﻖ اﻟﺴﻴﺎﺳﻲ واﻟﺘﻨﻤﻮي ﻓﻲ times, and email addresses. Many standard اﻟﺒﻼد وﻓﻖ ﻣﺎ ﻧﻘﻠﺘﻪ ﺻﺤﻴﻔﺔ وول ﺳﺘﺮﻳﺖ ;string patterns are included with REX customers can customize by editing or adding their own rules, based on their specific needs.

CUSTOM ENTITY LISTS Predefined Entity Types Available Languages

Custom lists are helpful REX natively supports the following entity Additional languages are available through when users know that types. User-defined entities, such as SKU custom development. specific words or phrases in numbers, are also available. their data are almost never misspelled and always refer to the same thing (i.e., are - Person - Arabic - Dutch unambiguous). REX comes with such lists for - Location - Hebrew - English entity types like religions and nationalities. - Organization - Pashto - French - Product - Persian - German - Title - Urdu - Italian - Nationality - Portuguese - Religion - Chinese, Simplified - Russian Compatibility - Credit Card Number - Chinese, Traditional - Spanish - Geographic Coordinate - Indonesian Code Base Platform Support - Money - Japanese - Generic Number - Korean - Personal ID Number - Phone Number - Email Address/URL - Distance - Date - Time

© 2015 Basis Technology Corporation. “Basis Technology HEADQUARTERS FEDERAL WEST COAST EUROPE ASIA Corporation” , “Rosette”, and “Highlight” are registered trademarks One Alewife Center 2553 Dulles View Dr. 1700 Montgomery St. Furzeground Way 9-6 Nibancho, of Basis Technology Corporation. “Big Text Analytics” is a trademark Cambridge, MA Suite 450 San Francisco, CA Middlesex UB11 1BD, Chiyoda-ku of Basis Technology Corporation. All other trademarks, service marks, and logos used in this document are the property of their respective 02140 Herndon, VA 94111 UK Tokyo 102-0084, owners. (2015-06-26-REX) 20171 Japan RES ROSETTE Entity Resolver www.basistech.com [email protected] +1 617-386-2090

Connect your unstructured text to the real-world people, organizations and places you care about.

Linking and learning for real-world data Rosette® Entity Resolver (RES) reveals meaningful information in your text. KEY FEATURES It connects the words that represent real-world things to one another and to entities in an entity database like Wikipedia, both within and across documents. - Standard training from 2.5M Wikipedia entities - “Learning” Mode: Identifies previously unknown Good quality entity resolution means dealing with three key problems: variety, entities (“ghosts”) and learns new aliases from where one thing can have many names; ambiguity, where many things can have text as it processes very similar if not exactly the same name, and ghosts, where some collection of - “Linking” Mode: Rapidly links only known entities names identify a previously unknown real-world thing. - Custom entity database training - Fast and scalable RES enriches your text with high, quality metadata, enabling you to perform - Industrial-strength support intuitive, entity-centric search and discovery. With it you can power - Flexible and customizable notification applications designed to detect and track new people in text - Java streams. It provides excellent raw material for building the custom knowledge - Unix, Linux, Mac, or Windows graphs at the heart of many of today’s most innovative applications.

EXAMPLES: Tamerlan Tsarnaev Apple Paris Tamerlin Tsarnaev (TheAtlantic.com) Apple Corps Ltd. (Music) Paris, Texas (33°39 N, 95°32 W) —or— —or— —or— Tamerlane Tsarnaevy (Mir24.net) Apple Inc. (Technology) Paris, France (48°51 N 2°21 E)

Start using RES Try our free product evaluation www.basistech.com RES ROSETTE Entity Resolver

Organize Big Text using entity linking and learning Measuring Confidence

Confidence measures are essential for 1 Link to Known Entities efective use of statistically based systems. George H. W. Bush FISHING NEWS RES can be configured to deliver confidence 41st U.S. President KENNEBUNKPORT, Maine — Three ID: USPRES41 measures with each of its clustering and DOB: June 12, 1924 generations of an American political linking decisions, allowing developers to use dynasty went fishing off the Maine coast. George W. Bush the RES output intelligently. The family set out together in the morning

Un estadounidenseon their que newaprend powerboatió a saltar en pa, buiracaídltas by 43rd U.S. President durante la Segunda Guerra Mundial cumplió su sueño de ID: USPRES43 Penobscot Boat Builders. poner en práctica su habilidad a los 90 años de edad. Lester DOB: July 6, 1946 Slate saltó esteOn do boardmingo en were el esta thedo de fir Mstaine Presiden acompañadt Buo sh; de un guía patheraca idsecoista.nd A pes Prares deident haber Buvolashdo en numerosas ocasiones, según la prensa local, este veterano de and his niece, Noelle Bush. 2 Learn about New Entities la marina estadounidense nunca se había lanzado en paracaídas. Slate señaló que se sintió inspirado por el Unknown Person expresidente de Estados Unidos Goerge H.W. Bush, quien Noelle Bush

realizó un salto con motivo de su 85 cumpleaños en 2009. ID: BD239852 Tras pisar tierra, el nonagenario dijo que esperaba poder repetir en su 95 y 100 cumpleaños. Unknown Organization Penobscot Boat Builders ID: TF354723 API Custom training

Rosette Entity Resolver can be run in one of two modes: RES comes pre-trained to link to a Wikipedia-derived 2M+ entity database. RES Linking Learning may be further trained by adding to this entity database or by providing an entirely In linking mode, RES will link the names of In learning mode, RES not only links names to new entity database. people, places and organizations in the text known entities, but also discovers new entities to entities in the entity database. mentioned in the text (often called “ghosts”), What is training? and remembers the new aliases and contexts Training currently involves adding Anything that can’t be associated with an it has found for all entities. information about real-world entities to existing entity will be ignored. the system such as names, aliases, related For example, once “J. Doe” has been entities, and example documents. This mode is optimized for high scale and encountered and linked to the “John Doe” stable throughput. entity, future occurrences of “J. Doe” will be A simple example is adding a new alias matched with greater confidence. to a Wikipedia-derived entity to improve resolution accuracy. Under the hood EXAMPLE: RES uses a machine-learned model to associate names and their contexts with collections of Basketball player Jeremy Lin is often information drawn from the entity databases with known entities. referred to as “Linsanity”.

In linking mode, RES fixes both the number of entities and the information within. Training allows developers to add the “Linsanity” alias to the entry for Jeremy Lin. Learning mode allows new entities to be created and new information to be added to The next time “Linsanity” is encountered, it existing entities. As this system state grows, RES intelligently prunes the information to will be resolved appropriately. maintain performance.

Requirements Full Support Languages Partial Support Languages

Code Base Platform Support - English - Russian - Pashto - Spanish - Japanese - Persian (Dari) - Chinese - Arabic - Persian (Farsi) - Korean - Urdu

© 2015 Basis Technology Corporation. “Basis Technology HEADQUARTERS FEDERAL WEST COAST EUROPE ASIA Corporation” , “Rosette”, and “Highlight” are registered trademarks One Alewife Center 2553 Dulles View Dr. 1700 Montgomery St. Furzeground Way 9-6 Nibancho, of Basis Technology Corporation. “Big Text Analytics” is a trademark of Basis Technology Corporation. All other trademarks, service marks, Cambridge, MA Suite 450 San Francisco, CA Middlesex UB11 1BD, Chiyoda-ku and logos used in this document are the property of their respective 02140 Herndon, VA 94111 UK Tokyo 102-0084, owners. (2015-06-18-RES) 20171 Japan RNI ROSETTE Name Indexer www.basistech.com [email protected] +1 617-386-2090

Frank Delano Roosevelt 97% Рузвельт, Франклин 85%

President Roosevelt 84% Gov. Franklin Roosevelt 82%

82% Franklin D. Roosevelt

32nd U.S. President 79% ID: USPRES32 富兰克林·罗塞费尔特 DOB: Jan. 30, 1882 77% Franklin Rosenvelt Franklin Delano Roosevelt, also known by his initials, FDR, was the 32nd President of the United States and a central figure in world events during the F. D. R. 74% mid-20th century, leading the United States during.... F. D. Roosev 73%

Accurate fuzzy name Supported matching in many languages 14 Languages KEY FEATURES Names are the linchpin that connect data points in financial compliance, anti- fraud, government intelligence, law enforcement, and identity verification. - Component of the Rosette SDK Yet, names are challenging to connect because of their incredible variation in - Simple API misspellings, nicknames, initials, and titles. In international databases, a single - Fast and scalable name may also appear in many languages! - Industrial-strength support - Easy installation Rosette® Name Indexer (RNI) solves these challenges with a linguistic, - Flexible and customizable knowledge-based system that compares and matches names of people, places, - Java and organizations despite their many variations. RNI is unrivalled in its ability - Unix, Linux, Mac, or Windows to match names because of its intelligent approach. - Matches names of people, places, and organizations As linguistics experts with deep understanding at the intersection of language - Increases name search accuracy and technology, Basis Technology continually improves the Rosette product - Ranks results by relevancy with a similarity family with language additions, feature updates, and the latest innovations score from the academic world. RNI is unrivalled in its ability to match the names of - Built to work with Apache™ Solr and entities—find out how your organization can utilize this pioneering technology Elasticsearch for extraordinary results.

Select Customers

Start using RNI today Try our free product evaluation www.basistech.com RNI ROSETTE Name Indexer

The Rosette Advantage Integration Options

Our knowledge-based system combines the Rosette® Name Indexer integrates easily into Apache Solr™ as a plug-in or into applications as a latest in Natural Language Processing (NLP) Java library to support its main use cases. RNI can also be adapted to match the needs of each to intelligently match names based on their application. linguistic and cultural structures and norms. Apache Solr Unlike expensive and less accurate legacy Apache Solr™-based search systems can easily add high-quality fuzzy name matching to every solutions driven by thousands of spelling search by simply adding name fields. RNI provides a special Solr field type for names. This variants from known names, RNI analyzes the mechanism means Solr can index documents with multiple name fields, each with multiple values intrinsic structure of each name component (e.g., an “alias” field may contain more than one name). Each document could also contain non- and performs an intelligent comparison using name fields like dates or plain text. advanced linguistic algorithms. Muhammad Ali Our approach is not limited to a particular list Cassius Clay Jr of variants and reduces the likelihood of both The Greatest “false positives” (wrong matches) and “false 1/7/1942 negatives” (missed matches). A single query can then be constructed that gives diferent weight to the various fields. For List driven systems cannot equal RNI for example, a single query can find movies starring “Binedict Cumberbund” with screenplays by matching never-seen-before names or mis- “Giyermo Diltoro” that were released around 2014. segmented names (Mary Ellen vs. MaryEllen). Java Library Any application that needs name matching can directly integrate a Java library which takes care of Use Cases storing watchlists without incurring the overhead of a web-service call. Financial Compliance Name Matching Capabilities Customize To Your Needs

Financial institutions use RNI to manage and Same name in multiple languages - Set the minimum threshold of the similarity update watchlists to block terrorist access to Мао Цзэдун 1 Mao Zedong 1 毛泽东 score to manage the precision and recall of funds, simultaneously avoiding compliance Phonetic spelling diferences the returned search results. violations and protecting their reputation. Cairns 1 Kearns 1 Kerns - Ignore a given list of words (“stopwords”) Applications also include fraud detection, with respect to matching (e.g., titles, money laundering, and document triage. Transliteration spelling diferences Abdul Rasheed 1 Abd-al-Rasheed 1 Abdulrashid honorifics). Government Intelligence Nicknames - Force two name words to always match with William 1 Will 1 Bill 1 Billy a given score (e.g., “Elizabeth” and “Lisbeth” always match at 90%). Names are often the most critical data point Initials in intelligence, law enforcement, and border J. E. Smith 1 James Earl Smith - Force two names to always match with a control. RNI is being adopted throughout the Titles and honorifics given score (e.g., “John Doe” and “Joe Bloggs” U.S. government to address the challenge Dr. 1 Mr. 1 Ph.D. always match at 95%). of matching names in all their variations— particularly names from non-Latin languages Out-of-order name components - Link multiple names to a single individual such as Arabic, Russian, Chinese, Korean, or Diaz, Carlos Alfonzo 1 Carlos Alfonzo Diaz (e.g., queries for "Marilyn Monroe" and Persian. Missing name components "Norma Jeane Mortensen" include the same Phillip Charles Carr 1 Phillip Carr person). Identity Verification in the Missing spaces or hyphens Sharing Economy MaryEllen 1 Mary Ellen 1 Mary-Ellen Truncated name components Trust is foundational to the sharing economy. McDonalds 1 McD 1 McDonald Whether booking room rentals, rides, or Available Languages and Scripts odd jobs, it is important to establish ways Name split inconsistently across database fields RNI matches names from these languages either in to connect the online and ofline worlds to Dick • Van Dyke 1 Dick Van • Dyke transliteration to English or written in their native reinforce that trust and confidence. scripts. Compatibility Name matching is a key component of verifying - Arabic scripts: Arabic, Persian, Pashto, Urdu online identities with real-world documentation Code Base Platform Support - Cyrillic: Russian (passports, driver’s licenses). Members of the - Hangul: Korean sharing economy such as Airbnb rely on RNI to match names originating from all over the - Hanzi (Simplified & Traditional): Chinese world, and internationally between names - Kanji, Katakana, Hirigana: Japanese written in alphabets besides the Roman A-to-Z. - Roman scripts: English, Spanish, French, Italian, German, Portuguese

© 2015 Basis Technology Corporation. “Basis Technology HEADQUARTERS FEDERAL WEST COAST EUROPE ASIA Corporation” , “Rosette” and “Highlight” are registered trademarks of One Alewife Center 2553 Dulles View Dr. 1700 Montgomery St. Furzeground Way 9-6 Nibancho, Basis Technology Corporation. “Big Text Analytics” is a trademark of Cambridge, MA Suite 450 San Francisco, CA Middlesex UB11 1BD, Chiyoda-ku Basis Technology Corporation. All other trademarks, service marks, and logos used in this document are the property of their respective 02140 Herndon, VA 94111 UK Tokyo 102-0084, owners. (2015-06-26-RNI) 20171 Japan RNT ROSETTE Name Translator www.basistech.com [email protected] +1 617-386-2090

Language Arabic John Kennedy ﺟﻮن ﻛﻴﻨﻴﺪي Origin English Entity Type Person

Language Chinese Origin Chinese 姚明 Yao Ming Entity Type Person

Language Arabic Abu-Yusif Ya'qub أﺑﻮ ﻳﻮﺳﻒ ﻳﻌﻘﻮب Origin Arabic Entity Type Person

Language Japanese Origin Japanese 信濃川 Shinano River Entity Type Location

Language Russian Origin Korean Чан Хо Пак Chan Ho Pak Entity Type Person

Instantly translate many Supported names to (and from) English 10 Languages

Names are an essential source of information, but most names in the world KEY FEATURES are not written in English, rendering them nearly useless to Anglocentric corporations and governments. These organizations must quickly and - Component of the Rosette SDK accurately translate names, often at a very large scale. Rosette® Name Translator - Simple API (RNT) can quickly process millions of names from foreign languages and - Fast and scalable produce highly accurate, standardized English translations using industry- - Industrial-strength support - Easy installation leading technologies, such as linguistic algorithms and statistical modeling. In - Flexible and customizable addition, RNT can also translate any name written in English into its equivalent - Java in any supported language, such as Arabic or Chinese. - Unix, Linux, Mac, or Windows

As linguistics experts with deep understanding at the intersection of language and technology, Basis Technology continually improves the Rosette product family with language additions, feature updates, and the latest innovations from the academic world.

Select Customers

Start using RNT today Try our free product evaluation www.basistech.com RNT ROSETTE Name Translator

A DIFFICULT PROBLEM HOW IT WORKS UNIQUE CAPABILITIES Translating names from other languages RNT combines dictionary look-ups and - Generate consistent “conventional spellings” into English is quite difficult. Even the most transliteration to find the most accurate English of frequently appearing foreign names powerful and expensive “machine translation” spelling of a name. First, the foreign name is - Process “unrecognized” names, i.e., those not systems struggle when confronted with examined in user-supplied name dictionaries, appearing in any known catalog the task of accurately translating large known as gazetteers. If the name is not found, of foreign names numbers of names. Why is this so hard? RNT transliterates the name into English by using linguistic algorithms and statistical - Incorporate complex transliteration A FEW CHALLENGES: modeling, then matches it using preferred standards (such as the IC or U.S. Board on - Which words in a name should be name standards. For example, names written Geographic Names) for translating a name translated according to their spelling in Chinese are converted from ideographic from a foreign alphabet into English (i.e., transliterated) and which words characters into a phonetic representation. - Automatically resolve name spelling according to their meaning? Names written in “unvocalized” Arabic (i.e., ambiguities in the source language, such without short vowels) are automatically as partial vocalization of Arabic, or word vocalized to enable a phonetic translation segmentation in Chinese according to any of several user-selected .standard systems ﻣﻠﻚ ﻋﺒﺪ اﷲ ﺑﻦ ﻋﺒﺪ اﻟﻌﺰﻳﺰ

King 'Abdallah Bin-'Abd-al-'Aziz

Instead of “Malik” Instead of “Servant of God”, son of “Servant of the Precious One” COMBINING REX & RNT - Within a language, there may be Rosette Entity Extractor (REX) may be paired with RNT to extract and conflicting conventions for translation. REX RNT Both “Fuji” and “Huzi” are accepted translate key names in a document, with accuracy superior to either name spellings of the iconic Japanese statistical or rule-based machine translation systems. This approach may volcano. Arguments over spelling the also be used to enrich or remediate the output of such systems in situations capital of Ukraine as “Kiev” vs. “Kyiv” have where translations of entire paragraphs or documents are required. almost triggered diplomatic crises. - Common practice may conflict with Person واﻗﺘﺮﺣﺖ واﺷﻨﻄﻦ ﺑﺪﻻ ﻣﻦ ذﻟﻚ إﻧﺸﺎء ﻣﺤﻜﻤﺔ ﺟﺪﻳﺪة ﺗﺎﺑﻌﺔ ﻟﻸﱈ اﳌﺘﺤﺪة واﻻﲢﺎد اﻷﻓﺮﻳﻘﻲ organizational standards. For example, the name of the former ruler of Iraq typically Location appears in the news media as “Saddam African Union United Nations Washington Organization Hussein”. However, the CIA’s official Nationality ﻓﻲ ﺗﻨﺰاﻧﻴﺎ وﺗﻌﻬﺪت ﺑﺘﻘﺪﱘ دﻋﻢ ﻣﺎﻟﻲ ﻛﺒﻴﺮ ﻟﻬﺎ، ﻣﻄﺎﻟﺒﺔ اﻟﺪول اﻟﻐﻨﻴﺔ اﻷﺧﺮى ﺑﺘﻮﻓﻴﺮ spelling is “Saddam Husayn”. Similarly, the conventional spelling of the Syrian ruler is Tanzania ﻣﺴﺎﻋﺪات ﳑﺎﺛﻠﺔ. وأﻛﺪت اﻟﻘﺎﺋﻤﺔ ﺑﺄﻋﻤﺎل اﳌﻨﺪوب اﻷﻣﻴﺮﻛﻲ ﻓﻲ ﻣﺠﻠﺲ اﻷﻣﻦ .”Assad”. However, CIA guidelines say “Asad“ - A name written in a foreign language Security Council American may be native to that language, آن ﺑﺎﺗﺮﺳﻮن اﻫﺘﻤﺎم ﺑﻼدﻫﺎ ﲟﺴﺎءﻟﺔ ﻣﻦ وﺻﻔﺘﻬﻢ ﲟﺮﺗﻜﺒﻲ اﻷﻋﻤﺎل اﻟﻮﺣﺸﻴﺔ ﻓﻲ دارﻓﻮر Mahmoud) محمود أحمدي ناد such as Ahmadinejad), or may be an English Darfur Anne Patterson name written in a foreign alphabet, such .(George W. Bush) جورج دبليو بوش as

Compatibility Available Languages Pairs

Code Base Platform Support Additional languages are available via custom development.

Indicates names can be translated to and from English Arabic English Chinese English 1 1 1 Indicates names can be translated only to English Dari 1 English Japanese English Farsi 1 English Korean 1 English Pashto 1 English Russian 1 English Urdu English

© 2015 Basis Technology Corporation. “Basis Technology HEADQUARTERS FEDERAL WEST COAST EUROPE ASIA Corporation” , “Rosette”, and “Highlight” are registered trademarks One Alewife Center 2553 Dulles View Dr. 1700 Montgomery St. Furzeground Way 9-6 Nibancho, of Basis Technology Corporation. “Big Text Analytics” is a trademark Cambridge, MA Suite 450 San Francisco, CA Middlesex UB11 1BD, Chiyoda-ku of Basis Technology Corporation. All other trademarks, service marks, and logos used in this document are the property of their respective 02140 Herndon, VA 94111 UK Tokyo 102-0084, owners. (2015-06-26-RNT) 20171 Japan