The First International Chinese Word Segmentation Bakeoff

Total Page:16

File Type:pdf, Size:1020Kb

The First International Chinese Word Segmentation Bakeoff The First International Chinese Word Segmentation Bakeoff Richard Sproat Thomas Emerson AT&T Labs – Research Basis Technology 180 Park Avenue, Florham Park, NJ, 07932, USA 150 CambridgePark Drive [email protected] Cambridge, MA 02140, USA [email protected] Abstract context of more general evaluations for Chinese- English machine translation. See (Yao, 2001; Yao, This paper presents the results from the 2002) for the first and second of these; the third eval- ACL-SIGHAN-sponsored First Interna- uation will be held in August 2003. The test cor- tional Chinese Word Segmentation Bake- pora were segmented according to the Chinese na- off held in 2003 and reported in con- tional standard GB 13715 (GB/T 13715–92, 1993), junction with the Second SIGHAN Work- though some lenience was granted in the case of shop on Chinese Language Processing, plausible alternative segmentations (Yao, 2001); so Sapporo, Japan. We give the motivation while GB 13715 specifies the segmentation / ¡£¢ for having an international segmentation for Mao Zedong, ¡¤¢ was also allowed. Accura- contest (given that there have been two cies in the mid 80’s to mid 90’s were reported for the within-China contests to date) and we re- four systems that participated in the first evaluation, port on the results of this first international with higher scores (many in the high nineties) being contest, analyze these results, and make reported for the second evaluation. some recommendations for the future. The motivations for holding the current contest are twofold. First of all, by making the contest in- ternational, we are encouraging participation from 1 Introduction people and institutions who work on Chinese word Chinese word segmentation is a difficult problem segmentation anywhere in the world. The final set of that has received a lot of attention in the literature; participants in the bakeoff include two from Main- reviews of some of the various approaches can be land China, three from Hong Kong, one from Japan, found in (Wang et al., 1990; Wu and Tseng, 1993; one from Singapore, one from Taiwan and four from Sproat and Shih, 2001). The problem with this liter- the United States. ature has always been that it is very hard to compare Secondly, as we have already noted, there are at systems, due to the lack of any common standard test least four distinct standards in active use in the sense set. Thus, an approach that seems very promising that large corpora are being developed according to based on its published report is nonetheless hard to those standards; see Section 2.1. It has also been compare fairly with other systems, since the systems observed that different segmentation standards are are often tested on their own selected test corpora. appropriate for different purposes; that the segmen- Part of the problem is also that there is no single tation standard that one might prefer for information accepted segmentation standard: There are several, retrieval applications is likely to be different from including the four standards used in this evaluation. the one that one would prefer for text-to-speech syn- A number of segmentation contests have been thesis; see (Wu, 2003) for useful discussion. Thus, held in recent years within Mainland China, in the while we do not subscribe to the view that any of the extant standards are, in fact, appropriate for any 2 Details of the contest particular application, nevertheless, it seems desir- 2.1 Corpora able to have a contest where people are tested against more than one standard. The corpora are detailed in Table 1. Links to descriptions of the corpora can be found at A third point is that we decided early on that we http://www.sighan.org/bakeoff2003/ would not be lenient in our scoring, so that alter- bakeoff_instr.html; publications on spe- ¢ native segmentations as in the case of ¡ Mao cific corpora are (Huang et al., 1997) (Academia Zedong, cited above, would not be allowed. While Sinica), (Xia, 1999) (Chinese Treebank); the it would be fairly straightforward (in many cases) Beijing University standard is very similar to that to automatically score both alternatives, we felt we outlined in (GB/T 13715–92, 1993). Table 1 lists could provide a more objective measure if we went the abbreviations for the four corpora that will be strictly by the particular segmentation standard be- used throughout this paper. The suffixes “o” and ing tested on, and simply did not get into the busi- “c” will be used to denote open and closed tracks, ness of deciding upon allowable alternatives. respectively: Thus “ASo,c” denotes the Academia Comparing segmenters is difficult. This is not Sinica corpus, both open and closed tracks; and only because of differences in segmentation stan- “PKc” denotes the Beijing University corpus, closed dards but also due to differences in the design of track. systems: Systems based exclusively (or even pri- During the course of this bakeoff, a number of marily) on lexical and grammatical analysis will of- inconsistencies in segmentation were noted in the ten be at a disadvantage during the comparison com- CTB corpus by one of the participants. This was pared to systems trained exclusively on the training done early enough so that it was possible for the data. Competitions also may fail to predict the per- CTB developers to correct some of the more com- formance of the segmenter on new texts outside the mon cases, both in the training and the test data. training and testing sets. The handling of out-of- The revised training data was posted for participants, vocabulary words becomes a much larger issue in and the revised test data was used during the testing these situations than is accounted for within the test phase. environment: A system that performs admirably in Inconsistencies were also noted by another par- the competition may perform poorly on texts from ticipant for the AS corpus. Unfortunately this came different registers. too late in the process to correct the data. However, some informal tests on the revised testing data indi- Another issue that is not accounted for in the cated that the differences were minor. current collection of evaluations is the handling of short strings with minimal context, such as queries 2.2 Rules and Procedures submitted to a search engine. This has been stud- The contest followed a strict set of guidelines and ied indirectly through the cross-language informa- a rigid timetable. The detailed instructions for the tion retrieval work performed for the TREC 5 and bakeoff can be found at http://www.sighan. TREC 6 competitions (Smeaton and Wilkinson, org/bakeoff2003/bakeoff_instr.html 1997; Wilkinson, 1998). (with simplified and traditional Chinese versions This report summarizes the results of this First also available). Training material was available International Chinese Word Segmentation Bakeoff, starting March 15, testing material was available provides some analysis of the results, and makes April 22, and the results had to be returned to the specific recommendations for future bakeoffs. One SIGHAN ftp site by April 25 no later than 17:00 thing we do not do here is get into the details of spe- EDT. cific systems; each of the participants was required Upon initial registration sites were required to de- to provide a four page description of their system clare which corpora they would be training and test- along with detailed discussion of their results, and ing on, and whether they would be participating in these papers are published in this volume. the open or closed tracks (or both) on each corpus, Corpus Abbrev. Encoding # Train. Words # Test. Words Academia Sinica AS Big Five (MS Codepage 950) 5.8M 12K U. Penn Chinese Treebank CTB EUC-CN (GB 2312-80) 250K 40K Hong Kong CityU HK Big Five (HKSCS) 240K 35K Beijing University PK GBK (MS Codepage 936) 1.1M 17K Table 1: Corpora used. where these were defined as follows: 2.3 Participating sites Participating sites are shown in Table 2. These are a For the open test sites were allowed to train subset of the sites who had registered for the bake- on the training set for a particular corpus, and off, as some sites withdrew due to technical difficul- in addition they could use any other mate- ties. rial including material from other training cor- 3 Further details of the corpora pora, proprietary dictionaries, material from the WWW and so forth. However, if a site An unfortunate, and sometimes unforseen, complex- selected the open track the site was required ity in dealing with Chinese text on the computer is to explain what percentage of the results came the plethora of character sets and character encod- from which sources. For example, if the sys- ings used throughout Greater China. This is demon- tem did particularly well on out-of-vocabulary strated in the Encoding column of Table 1: words then the participants were required to ex- plain if, for example, those results could mostly 1. Both AS and HK utilize complex-form (or “tra- be attributed to having a good dictionary. ditional”) characters, using variants of the Big Five character set. The Academia Sinica cor- pus is composed almost entirely of characters In the closed test, participants could only use in pure Big Five (four characters, 0xFB5B, training material from the training data for the 0xFA76, 0xFB7A, and 0xFAAF are outside particular corpus being testing on. No other the encoding range of Big Five), while the material was allowed. City University corpus utilizes 38 (34 unique) characters from the Hong Kong Supplementary Other obvious restrictions applied: Participants Character Set (HKSCS) extension to Big Five. were prohibited from testing on corpora from their 2.
Recommended publications
  • Consonant Characters and Inherent Vowels
    Global Design: Characters, Language, and More Richard Ishida W3C Internationalization Activity Lead Copyright © 2005 W3C (MIT, ERCIM, Keio) slide 1 Getting more information W3C Internationalization Activity http://www.w3.org/International/ Copyright © 2005 W3C (MIT, ERCIM, Keio) slide 2 Outline Character encoding: What's that all about? Characters: What do I need to do? Characters: Using escapes Language: Two types of declaration Language: The new language tag values Text size Navigating to localized pages Copyright © 2005 W3C (MIT, ERCIM, Keio) slide 3 Character encoding Character encoding: What's that all about? Copyright © 2005 W3C (MIT, ERCIM, Keio) slide 4 Character encoding The Enigma Photo by David Blaikie Copyright © 2005 W3C (MIT, ERCIM, Keio) slide 5 Character encoding Berber 4,000 BC Copyright © 2005 W3C (MIT, ERCIM, Keio) slide 6 Character encoding Tifinagh http://www.dailymotion.com/video/x1rh6m_tifinagh_creation Copyright © 2005 W3C (MIT, ERCIM, Keio) slide 7 Character encoding Character set Character set ⴰ ⴱ ⴲ ⴳ ⴴ ⴵ ⴶ ⴷ ⴸ ⴹ ⴺ ⴻ ⴼ ⴽ ⴾ ⴿ ⵀ ⵁ ⵂ ⵃ ⵄ ⵅ ⵆ ⵇ ⵈ ⵉ ⵊ ⵋ ⵌ ⵍ ⵎ ⵏ ⵐ ⵑ ⵒ ⵓ ⵔ ⵕ ⵖ ⵗ ⵘ ⵙ ⵚ ⵛ ⵜ ⵝ ⵞ ⵟ ⵠ ⵢ ⵣ ⵤ ⵥ ⵯ Copyright © 2005 W3C (MIT, ERCIM, Keio) slide 8 Character encoding Coded character set 0 1 2 3 0 1 Coded character set 2 3 4 5 6 7 8 9 33 (hexadecimal) A B 52 (decimal) C D E F Copyright © 2005 W3C (MIT, ERCIM, Keio) slide 9 Character encoding Code pages ASCII Copyright © 2005 W3C (MIT, ERCIM, Keio) slide 10 Character encoding Code pages ISO 8859-1 (Latin 1) Western Europe ç (E7) Copyright © 2005 W3C (MIT, ERCIM, Keio) slide 11 Character encoding Code pages ISO 8859-7 Greek η (E7) Copyright © 2005 W3C (MIT, ERCIM, Keio) slide 12 Character encoding Double-byte characters Standard Country No.
    [Show full text]
  • Legacy Character Sets & Encodings
    Legacy & Not-So-Legacy Character Sets & Encodings Ken Lunde CJKV Type Development Adobe Systems Incorporated bc ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/unicode/iuc15-tb1-slides.pdf Tutorial Overview dc • What is a character set? What is an encoding? • How are character sets and encodings different? • Legacy character sets. • Non-legacy character sets. • Legacy encodings. • How does Unicode fit it? • Code conversion issues. • Disclaimer: The focus of this tutorial is primarily on Asian (CJKV) issues, which tend to be complex from a character set and encoding standpoint. 15th International Unicode Conference Copyright © 1999 Adobe Systems Incorporated Terminology & Abbreviations dc • GB (China) — Stands for “Guo Biao” (国标 guóbiâo ). — Short for “Guojia Biaozhun” (国家标准 guójiâ biâozhün). — Means “National Standard.” • GB/T (China) — “T” stands for “Tui” (推 tuî ). — Short for “Tuijian” (推荐 tuîjiàn ). — “T” means “Recommended.” • CNS (Taiwan) — 中國國家標準 ( zhôngguó guójiâ biâozhün) in Chinese. — Abbreviation for “Chinese National Standard.” 15th International Unicode Conference Copyright © 1999 Adobe Systems Incorporated Terminology & Abbreviations (Cont’d) dc • GCCS (Hong Kong) — Abbreviation for “Government Chinese Character Set.” • JIS (Japan) — 日本工業規格 ( nihon kôgyô kikaku) in Japanese. — Abbreviation for “Japanese Industrial Standard.” — 〄 • KS (Korea) — 한국 공업 규격 (韓國工業規格 hangug gongeob gyugyeog) in Korean. — Abbreviation for “Korean Standard.” — ㉿ — Designation change from “C” to “X” on August 20, 1997. 15th International Unicode Conference Copyright © 1999 Adobe Systems Incorporated Terminology & Abbreviations (Cont’d) dc • TCVN (Vietnam) — Tiu Chun Vit Nam in Vietnamese. — Means “Vietnamese Standard.” • CJKV — Chinese, Japanese, Korean, and Vietnamese. 15th International Unicode Conference Copyright © 1999 Adobe Systems Incorporated What Is A Character Set? dc • A collection of characters that are intended to be used together to create meaningful text.
    [Show full text]
  • Basis Technology Unicode対応ライブラリ スペックシート 文字コード その他の名称 Adobe-Standard-Encoding A
    Basis Technology Unicode対応ライブラリ スペックシート 文字コード その他の名称 Adobe-Standard-Encoding Adobe-Symbol-Encoding csHPPSMath Adobe-Zapf-Dingbats-Encoding csZapfDingbats Arabic ISO-8859-6, csISOLatinArabic, iso-ir-127, ECMA-114, ASMO-708 ASCII US-ASCII, ANSI_X3.4-1968, iso-ir-6, ANSI_X3.4-1986, ISO646-US, us, IBM367, csASCI big-endian ISO-10646-UCS-2, BigEndian, 68k, PowerPC, Mac, Macintosh Big5 csBig5, cn-big5, x-x-big5 Big5Plus Big5+, csBig5Plus BMP ISO-10646-UCS-2, BMPstring CCSID-1027 csCCSID1027, IBM1027 CCSID-1047 csCCSID1047, IBM1047 CCSID-290 csCCSID290, CCSID290, IBM290 CCSID-300 csCCSID300, CCSID300, IBM300 CCSID-930 csCCSID930, CCSID930, IBM930 CCSID-935 csCCSID935, CCSID935, IBM935 CCSID-937 csCCSID937, CCSID937, IBM937 CCSID-939 csCCSID939, CCSID939, IBM939 CCSID-942 csCCSID942, CCSID942, IBM942 ChineseAutoDetect csChineseAutoDetect: Candidate encodings: GB2312, Big5, GB18030, UTF32:UTF8, UCS2, UTF32 EUC-H, csCNS11643EUC, EUC-TW, TW-EUC, H-EUC, CNS-11643-1992, EUC-H-1992, csCNS11643-1992-EUC, EUC-TW-1992, CNS-11643 TW-EUC-1992, H-EUC-1992 CNS-11643-1986 EUC-H-1986, csCNS11643_1986_EUC, EUC-TW-1986, TW-EUC-1986, H-EUC-1986 CP10000 csCP10000, windows-10000 CP10001 csCP10001, windows-10001 CP10002 csCP10002, windows-10002 CP10003 csCP10003, windows-10003 CP10004 csCP10004, windows-10004 CP10005 csCP10005, windows-10005 CP10006 csCP10006, windows-10006 CP10007 csCP10007, windows-10007 CP10008 csCP10008, windows-10008 CP10010 csCP10010, windows-10010 CP10017 csCP10017, windows-10017 CP10029 csCP10029, windows-10029 CP10079 csCP10079, windows-10079
    [Show full text]
  • Implementing Cross-Locale CJKV Code Conversion
    Implementing Cross-Locale CJKV Code Conversion Ken Lunde CJKV Type Development Adobe Systems Incorporated bc ftp://ftp.oreilly.com/pub/examples/nutshell/ujip/unicode/iuc13-c2-paper.pdf ftp://ftp.oreilly.com/pub/examples/nutshell/ujip/unicode/iuc13-c2-slides.pdf Code Conversion Basics dc • Algorithmic code conversion — Within a single locale: Shift-JIS, EUC-JP, and ISO-2022-JP — A purely mathematical process • Table-driven code conversion — Required across locales: Chinese ↔ Japanese — Required when dealing with Unicode — Mapping tables are required — Can sometimes be faster than algorithmic code conversion— depends on the implementation September 10, 1998 Copyright © 1998 Adobe Systems Incorporated Code Conversion Basics (Cont’d) dc • CJKV character set differences — Different number of characters — Different ordering of characters — Different characters September 10, 1998 Copyright © 1998 Adobe Systems Incorporated Character Sets Versus Encodings dc • Common CJKV character set standards — China: GB 1988-89, GB 2312-80; GB 1988-89, GBK — Taiwan: ASCII, Big Five; CNS 5205-1989, CNS 11643-1992 — Hong Kong: ASCII, Big Five with Hong Kong extension — Japan: JIS X 0201-1997, JIS X 0208:1997, JIS X 0212-1990 — South Korea: KS X 1003:1993, KS X 1001:1992, KS X 1002:1991 — North Korea: ASCII (?), KPS 9566-97 — Vietnam: TCVN 5712:1993, TCVN 5773:1993, TCVN 6056:1995 • Common CJKV encodings — Locale-independent: EUC-*, ISO-2022-* — Locale-specific: GBK, Big Five, Big Five Plus, Shift-JIS, Johab, Unified Hangul Code — Other: UCS-2, UCS-4, UTF-7, UTF-8,
    [Show full text]
  • San José, October 2, 2000 Feel Free to Distribute This Text
    San José, October 2, 2000 Feel free to distribute this text (version 1.2) including the author’s email address ([email protected]) and to contact him for corrections and additions. Please do not take this text as a literal translation, but as a help to understand the standard GB 18030-2000. Insertions in brackets [] are used throughout the text to indicate corresponding sections of the published Chinese standard. Thanks to Markus Scherer (IBM) and Ken Lunde (Adobe Systems) for initial critical reviews of the text. SUMMARY, EXPLANATIONS, AND REMARKS: CHINESE NATIONAL STANDARD GB 18030-2000: INFORMATION TECHNOLOGY – CHINESE IDEOGRAMS CODED CHARACTER SET FOR INFORMATION INTERCHANGE – EXTENSION FOR THE BASIC SET (信息技术-信息交换用汉字编码字符集 Xinxi Jishu – Xinxi Jiaohuan Yong Hanzi Bianma Zifuji – Jibenji De Kuochong) March 17, 2000, was the publishing date of the Chinese national standard (国家标准 guojia biaozhun) GB 18030-2000 (hereafter: GBK2K). This standard tries to resolve issues resulting from the advent of Unicode, version 3.0. More specific, it attempts the combination of Uni- code's extended character repertoire, namely the Unihan Extension A, with the character cov- erage of earlier Chinese national standards. HISTORY The People’s Republic of China had already expressed her fundamental consent to support the combined efforts of the ISO/IEC and the Unicode Consortium through publishing a Chinese National Standard that was code- and character-compatible with ISO 10646-1/ Unicode 2.1. This standard was named GB 13000.1. Whenever the ISO and the Unicode Consortium changed or revised their “common” standard, GB 13000.1 adopted these changes subsequently. In order to remain compatible with GB 2312, however, which at the time of publishing Unicode/GB 13000.1 was an already existing national standard widely used to represent the Chinese “simplified” characters, the “specification” GBK was created.
    [Show full text]
  • DVB); Specification for Service Information (SI) in DVB Systems
    Final draft ETSI EN 300 468 V1.5.1 (2003-01) European Standard (Telecommunications series) Digital Video Broadcasting (DVB); Specification for Service Information (SI) in DVB systems European Broadcasting Union Union Européenne de Radio-Télévision EBU·UER 2 Final draft ETSI EN 300 468 V1.5.1 (2003-01) Reference REN/JTC-DVB-128 Keywords broadcasting, digital, DVB, MPEG, service, TV, video ETSI 650 Route des Lucioles F-06921 Sophia Antipolis Cedex - FRANCE Tel.: +33 4 92 94 42 00 Fax: +33 4 93 65 47 16 Siret N° 348 623 562 00017 - NAF 742 C Association à but non lucratif enregistrée à la Sous-Préfecture de Grasse (06) N° 7803/88 Important notice Individual copies of the present document can be downloaded from: http://www.etsi.org The present document may be made available in more than one electronic version or in print. In any case of existing or perceived difference in contents between such versions, the reference version is the Portable Document Format (PDF). In case of dispute, the reference shall be the printing on ETSI printers of the PDF version kept on a specific network drive within ETSI Secretariat. Users of the present document should be aware that the document may be subject to revision or change of status. Information on the current status of this and other ETSI documents is available at http://portal.etsi.org/tb/status/status.asp If you find errors in the present document, send your comment to: [email protected] Copyright Notification No part may be reproduced except as authorized by written permission.
    [Show full text]
  • Implementing Cross-Locale CJKV Code Conversion
    Implementing Cross-locale CJKV Code Conversion Ken Lunde, Adobe Systems Incorporated [email protected] http://www.oreilly.com/~lunde/ 1. Introduction Most operating systems today deal with single locales. Within a single CJKV locale, different operating sys- tems often use different encodings for the same character set. Consider Shift-JIS and EUC-JP encodings for Japanese—Shift-JIS is historically used on MacOS and Windows, but EUC-JP is used on Unix. This makes code conversion a necessity. Code conversion within a single locale is, by and large, a trivial operation that typically involves a mathematical algorithm. In the past, a lot of code conversion was performed by users through dedicated software tools. Many of today’s applications include built-in code conversion routines, but these routines deal with only multiple encodings of a single locale (such as EUC-KR, ISO-2022-KR, Johab, and Unified hangul Code encodings for Korean). Code conversion across CJKV locales, such as between Chinese and Japanese, is more problematic. While Unicode serves as an excellent basis for implementing cross-locale code conversion, there are still problems to be addressed, such as unmappable characters. 2. Code Conversion Basics Converting between different encodings of a single locale, which represents a trivial effort that involves well- established code conversion algorithms (or mapping tables), is a well-understood process these days. How- ever, as soon as code conversion extends beyond a single locale, there are additional complexities that arise, such as the following: • Code conversion algorithms almost always must be replaced by mapping tables because the ordering of characters in different CJKV character sets are different.
    [Show full text]
  • International Register of Coded Character Sets to Be Used with Escape Sequences for Information Interchange in Data Processing
    INTERNATIONAL REGISTER OF CODED CHARACTER SETS TO BE USED WITH ESCAPE SEQUENCES 1 Introduction 1.1 General This document is the ISO International Register of Coded Character Sets To Be Used With Escape Sequences for information interchange in data processing. It is compiled in accordance with the provisions of ISO/IEC 2022, "Code Extension Technique" and of ISO 2375 "Procedure for Registration of Escape Sequences". This International Register contains coded character sets which have been registered in accordance with procedures given in ISO 2375. Its purpose is to identify widely used coded character sets and associate with each a unique escape sequence by means of which it can be designated according to ISO/IEC 2022 and ISO/IEC 4873. The publication of this International Register should promote compatibility in international information interchange and avoid duplication of effort in developing application-oriented coded character sets. Registration provides an identification for a coded character set but implies nothing about its status; it may or may not be part of a standard of an international, national or a corporate body. However, if such a standard is published subsequently to the registration, it would be appropriate for the escape sequence identifying the character set to be specified in the standard. If it is desired to register a set, application should be made to the Registration Authority through an appropriate Sponsoring Authority as specified in ISO 2375. Any character set can be a candidate for registration if it meets the requirements of ISO 2375. The Registration Authority ascertains that the proposals received are formally in accordance with this International Standard, technically in accordance with ISO/IEC 2022, and, where applicable, with ISO/IEC 646 and ISO/IEC 4873, and meet the presentation practice of the Registration Authority.
    [Show full text]
  • Instantly Identify and Triage Many Languages
    Rosette® BIG TEXT ANALYTICS Language Identifier RLI RLI ROSETTE Identify languages and encodings Language Identifier Sortedwww.basistech.com Languages [email protected] +1 617-386-2090 Base Linguistics RBL RBL ROSETTE Search many languages with high accuracy InstantlyBase Linguistics identify and triageBetter Search Entity Extractor REX REX ROSETTE Tag names of people, places, and organizations manyEntity languages Extractor within largeTagged Entities English Primary Language Entity Resolver 8% RES voRESlumes ROSETTE of text. French Make real-world connections in your data Chinese Entity Resolver Chinese RealPrimary Scrip Identitiest 即时识别和处理大量多语言文本。 22% Arabic 39% Latin Identifiez et triez instantanément plusieurs French French Name Indexer English RNI languesRNI à travers ROSETTE de nombreux textes. Match names between many variations Name Indexer Matched Names %31 اﻟﺘﺤﺪﻳﺪ واﻟﺘﺼﻨﻴﻒ اﻟﻔﻮري ﻟﻠﻌﺪﻳﺪ ﻣﻦ اﻟﻠﻐﺎت ﺿﻤﻦ ﻛﻤﻴﺎت ﻛﺒﻴﺮة ﻣﻦ اﻟﻨﺼﻮص. Arabic Name Translator RNT RNT ROSETTE Translate foreign names into English Name Translator Translated Names Identify languages and Supported Categorizer Languages transform ROSETTE encodings 55 RCA Categorize Everything In Sight RCA Rosette® LanguageCategorizer Identifier (RLI) analyzes text from a few words to whole KEY FEATURES Sorted Content documents, to detect the languages and character encoding with speed and very high accuracy. Automatic language identification is the necessary first - Simple API Sentiment Analyzer step for applications that categorize, search, process, and store text in many - Fast and scalable ROSETTE - Industrial-strength support RSA languages.RSA Individual documents may be routed to language specialists, or sent Detect The Sentiments Of Your Text - Easy installation into language-specificSentiment analysis pipelines Analyzer (such as Rosette Base Linguistics) to Actionable Insights - Flexible and customizable improve the quality of search results.
    [Show full text]
  • OPMG.GB.2312 Operations in Panama: a Man, a Plan, a Canal
    OPMG.GB.2312 Operations in Panama: A man, a plan, a canal. Panama Spring Break: March 13 - 19, 2016 This course is by application. Applications due 11/13. (Details provided below) Course Meetings: This course includes an orientation session, two classes at Stern, and a one-week visit to Panama. Orientation session: February 22 from 5-6pm, TBA Class meetings: March 4th & April 1st from 6-9pm, TBA Panama Visit: March 13-19 Instructors Professor Harry Chernoff | [email protected] Professor Kristen Sosulski | [email protected] 1 About Panama With its inauguration in 1914, the Panama Canal changed the world. It had a more astounding effect on global shipping than any other event in the history of world trade. It was the impetus for Panama becoming an independent country, and more recently, a world trade and banking center. The history of the building of the canal is a most interesting one involving a number of world powers and spanning over 30 years. The French are credited with the start of the canal, and the US with the completion. It was the largest engineering feat of the 19th and 20th centuries, and is recognized as one of the seven modern wonders of the world by the American Society of Civil Engineers. This vital link in the world shipping supply chain is a unique combination of simplistic physics applied on a gigantic scope. It literally changed the world’s shipping patterns, influenced developing countries, and has served as the primary driver for economic growth in remote areas. The newly planned Panama Canal expansion project, slated to open in “late” 2014, is having worldwide effects on major ports throughout the United States and the rest of the world.
    [Show full text]
  • Windows Mbox Viewer User Manual 1.0.2.6 Table of Contents 1 Modification History
    Windows MBox Viewer User Manual 1.0.2.6 Table of Contents 1 Modification History.......................................................................................................................3 2 Feedback..........................................................................................................................................3 3 Overview.........................................................................................................................................4 4 Installation.......................................................................................................................................4 5 Running the MBox viewer..............................................................................................................4 5.1 Argument List Summary..............................................................................................................4 5.2 Setting Options from GUI............................................................................................................5 5.3 Basic Use Case.............................................................................................................................6 5.4 Mail Context Menu.......................................................................................................................7 5.5 Mail Archive Context menu.........................................................................................................8 5.6 Mail Attachments..........................................................................................................................9
    [Show full text]
  • Unicode Support in the Solaris Operating Environment
    Unicode Support in the Solaris Operating Environment Sun Microsystems, Inc. 901 San Antonio Road Palo Alto, CA 94303-4900 U.S.A. Part Number 806-5584 May 2000 Copyright 2000 Sun Microsystems, Inc. 901 San Antonio Road, Palo Alto, California 94303-4900 U.S.A. All rights reserved. This product or document is protected by copyright and distributed under licenses restricting its use, copying, distribution, and decompilation. No part of this product or document may be reproduced in any form by any means without prior written authorization of Sun and its licensors, if any. Third-party software, including font technology, is copyrighted and licensed from Sun suppliers. Parts of the product may be derived from Berkeley BSD systems, licensed from the University of California. UNIX is a registered trademark in the U.S. and other countries, exclusively licensed through X/Open Company, Ltd. Sun, Sun Microsystems, the Sun logo, docs.sun.com, AnswerBook, AnswerBook2, and Solaris are trademarks, registered trademarks, or service marks of Sun Microsystems, Inc. in the U.S. and other countries. All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International, Inc. in the U.S. and other countries. Products bearing SPARC trademarks are based upon an architecture developed by Sun Microsystems, Inc. The OPEN LOOK and SunTM Graphical User Interface was developed by Sun Microsystems, Inc. for its users and licensees. Sun acknowledges the pioneering efforts of Xerox in researching and developing the concept of visual or graphical user interfaces for the computer industry. Sun holds a non-exclusive license from Xerox to the Xerox Graphical User Interface, which license also covers Sun’s licensees who implement OPEN LOOK GUIs and otherwise comply with Sun’s written license agreements.
    [Show full text]