Software Applications for Cultural Diversity

Rod Davis, SIL International [email protected], [email protected]

Good morning, my name is Rod Davis from SIL International. I work as the IT Manager for SIL’s office here in Bamako. I’m here on behalf of Michael Cochran, the head of Software Development in , .

1 AboutAbout SILSIL InternationalInternational

• faith-based International NGO

-based development

• research, translation and

• active in 20% of the world’s

SIL is a faith-based International Non-Governmental Organisation that partners with language communities to help them meet the language-based parts of their development goals. For the last 70 years SIL has served people groups through research, translation and literacy. Today we have active projects in 20% of the world’s languages.

2 SILSIL LanguageLanguage SoftwareSoftware DevelopmentDevelopment

• 53 software development personnel

• Member of the Consortium and very active on behalf of minority languages

• 60 software titles to support the work of language fieldworkers

Over the years SIL has developed software to meet the challenges we face. Today we have 26 people working full-time on software development and another 27 working part-time. We are a member of the Unicode Consortium and are very active in ensuring that minority languages are included. SIL has developed more than 60 pieces of software to support the work of its fieldworkers. Most are available to download free of charge.

3 One of the Biggest Challenges: Dealing with Complex Scripts

• Encoding – Transition to Unicode • Input (Keyboarding) – Complexity – Extensibility • Type Design – Unicode based – Smart- compiler technology • Rendering – placement, Contextual shaping, Ligatures, Reordering/splitting, Bi-directionality

We have had to overcome a variety of challenges. For example languages with complex scripts present a variety of keyboarding and rendering challenges. These challenges can be categorized into four different, but closely related, domains: encoding, input, type design, and rendering.

4 Solutions in Dealing with Complex Scripts

• Encoding – TechKit, a utility to convert legacy encodings to Unicode – http://scripts.sil.org: SIL resource site with information, tutorials, utilities, etc. for making the transition to Unicode • Input – Keyman (Windows); http://tavultesoft.com – KMFL (a Linux ‘Keyman’ still under development) – IMEs (Input Method Editors) • Type Design – Graphite Compiler adds Graphite rendering tables to a TrueType font, giving it ‘smart-font’ capability • Rendering – Graphite rendering engine: Can handle basic display of any complex script in use today. • NOTE: These technologies can benefit not only ‘complex scripts’, but also orthographies with special or a few non-Roman characters.

Our software development staff has been very active in developing solutions in each of these domains. A lot of thought and work has been put into providing tools and resources for making the transition to Unicode. Keyman is a keyboarding solution that was developed by the son of an SIL member; this product is in wide use and very mature. Graphite is the culmination of an effort to provide an extensible, open-source solution to handling any complex script system in use today.

5 WhatWhat isis aa SmartSmart Font?Font?

• A font containing data describing how the are to displayed.

• The smart font data is in the form of tables inside the font file itself.

• A rendering engine interprets the font tables to appropriately render the glyphs.

• In contrast, a ‘dumb font’ has only a direct correspondence between the data characters and the displayed glyphs.

Smart font technology was developed to deal with the complex rendering issues associated with complex scripts. A smart font incorporates tables within the font file itself that contain data, associated with the codepoints, that specify how to display a and how to appropriately contextualize that glyph within text. This is in contrast to a ‘dumb font’, that has only a direct correspondence between the data characters and the displayed glyphs.

6 WhatWhat isis Graphite?Graphite?

• A package, developed by SIL International, that can be embedded into other applications, adding "smart font behavior" to that application. • This package includes the following: – A programming language (GDL) for specifying the font's behavior – A compiler for building the smart font – A rendering engine for displaying text using the smart font • Note that it is open-source: http://graphite.sil.org/ http://sourceforge.net/projects/silgraphite

Note that this is a package designed for software developers. The package includes: A programming language (GDL) for specifying the font's behavior A compiler for building the smart font A rendering engine for displaying text using the smart font To be useful it must be embedded in another application and a smart font created.

7 WhatWhat MakesMakes GraphiteGraphite Special?Special?

• It differs from other complex script technologies: – Unlike OpenType, it does NOT assume that script-specific information is incorporated at the application or operating system level (as in Uniscribe). – It is extensible, and provides support for characters in Unicode’s Private Use Area ranges … unlike Uniscribe, which handles only script behaviors already part of the Unicode standard. – It is an open-source solution versus proprietary. – It can handle basic display of any complex script in use today. – The following slides demonstrate the types of rendering difficulties that Graphite was designed to handle:

Graphite, unlike other existing solutions, is flexible, extensible, open-source. It is appropriate in every respect for incorporating into software used in minority language scripts for which other solutions would not work well (or at all). It can handle basic display of any complex script in use today, as we can see on the following slides.

8 http://scripts.sil.org/

Applications dealing with complex scripts need to handle different baselines and directions…

9 http://scripts.sil.org/ context sensitive glyphs…

10 http://scripts.sil.org/ multiple and contextually positioned diacritics…

11 http://scripts.sil.org/ glyphs that are typed in one order but are then displayed in another order…

12 http://scripts.sil.org/ and glyphs that split or combine in various contexts.

13 SILSIL’’ss WorkWork withwith UNESCOUNESCO

• In 2003, SIL International and UNESCO engaged in a cooperative project as part of UNESCO’s Initiative B@bel effort. http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&cat_id=Babel • Goal: Enable the development of complex script support in information and communication technologies (ICTs). • Project SILA: A Graphite-enabled version of Mozilla – Project Goal: To enable minority language communities to publish on the Internet … • Graphite-enabled Edit Control Version 0.9 – basic Graphite edit control for data input applications – SDK for Windows 2000/XP, developer’s guide and well-commented source code • Modified version of WorldPad -- Graphite-Enabled Text Processor – Simple text processor for Windows 2000/XP

For most of 2003, UNESCO and SIL International were engaged in a cooperative project as part of UNESCO’s Initiative B@bel effort. The stated goal was to enable the development of complex script support in information and communication technologies (ICTs).

14 GraphiteGraphite--enabledenabled MozillaMozilla Contrast Internet Explorer on the left with Graphite-enabled Mozilla on the right. The Graphite-enabled version of Mozilla displays the multiple diacritics separately as intended. http://sila.mozdev.org/silab2.htm

On the left you can see that Internet Explorer just displays multiple diacritics all superimposed on one another, whereas on the right, Graphite-enabled Mozilla displays the diacritics separately as intended.

15 GraphiteGraphite--enabledenabled MozillaMozilla EE--mailmail

Here is an e-mail in . We developed the Graphite-enabled version of Mozilla in partnership with UNESCO as part of their Initiative B@bel.

16 GraphiteGraphite--enabledenabled MozillaMozilla InstantInstant MessagingMessaging

Here is instant messaging with Burmese script in both the input field and the display area.

17 GraphiteGraphite--enabledenabled WorldPadWorldPad

http://www.ethnologue.com/tools_docs/fieldworks.asp

The Graphite rendering engine enables the WorldPad word-processing application to stack several diacritics both above and below the base glyph.

18 LessonsLessons learnedlearned

We are participating in or have created several open source development efforts. We have learned that… – FLOSS (Free/Libre Open Source Software) Linux is very attractive to low income groups we work with. – Setup and maintenance of open source OS and software is too complex for most of our end users. – Open source development is complex. – Getting support for complex minority scripts into the “core” builds of open source software is hard. – Localization of software is complex and time consuming. – Despite the difficulties, the benefits are great and we are heading in this direction on a number of fronts.

We are participating in or have created several open source development efforts. We have learned that… •open source development is complex •getting support for complex minority scripts into the “core” builds of open source software is hard •localization of software is complex and time consuming •setup and maintenance of FLOSS is too complex for most of our end users

19 WhatWhat isis KeymanKeyman??

• Keyman allows you to enter text in Windows®-based applications in other languages without changing your physical keyboard (or system keyboard). • It does this by remapping the character keys according to the font for the language you wish to use. • It works with both ANSI” and Unicode. It is in wide-use and well tested. • Keyman 6.1 is a commercial program available at a reasonable price: http://tavultesoft.com/keyman/ • The development tool called ‘Keyman Developer’ is sold separately, and uses a rule-based programming language. • Note: There are some issues with very large character sets that are handled better with IMEs (Input Method Editors). However, the IMEs are more complicated and are not really user modifiable or extensible.

Keyman allows you to enter text in Windows®-based applications in other languages without changing your physical keyboard (or system keyboard). It does this by remapping the character keys according to the font for the language you wish to use. It works with both ANSI” and Unicode. It is in wide-use and well tested. Note that it is a commercial price, but it is reasonably priced. A keyboard development tool called ‘Keyman Developer’ is sold separately, and uses a rule-based programming language.

20 TechnicalTechnical overviewoverview ofof KeymanKeyman

Operating system

keystrokes

Config- Keyman TIKE Keyman Program Compiler Executable uration Driver (.KMN) (.KMX) Program

characters

Application software

A Keyman keyboard is implemented via a program written in Keyman’s programming language. This program is compiled using TIKE (Tavultesoft Integrated Keyboard Editor), which produces a keyboard executable file, generally given a .KMX extension. This executable is installed into the Keyman system using the Keyman configuration program, from which the driver is initialized. When the Keyman driver is running, it intercepts keystrokes from the operating system and transforms them into characters using the active keyboard, then passes those characters on to the current application.

The modules shown above within the dotted lines are those that are part of the Keyman system.

21 SILSIL UnicodeUnicode--basedbased FontsFonts

• Doulos SIL 4.0: A linguist’s general use font … This is a comprehensive inventory of glyphs needed for almost any Roman- or Cyrillic-based writing system, whether used for phonetic or orthographic needs. Status: Released but not a ‘full-family’ font (regular type-face only)

is a family designed to enable the diverse ethnic groups around the world who use the Latin script to produce readable, high-quality publications. It supports a wide range of Latin-based alphabets and includes glyphs that correspond to all the Latin ranges of Unicode. Status: Released and in wide use; Linux installation also released.

• Additional Unicode-based , including one specifically tuned for literacy use, are under development.

• http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&cat_id=FontDownloads (Just remember http://scripts.sil.org )

SIL International has released two Unicode based fonts. Doulos SIL 4.0 is designed specifically for linguistic research and language development. Gentium is a Latin-based font that is appropriate for producing readable, high-quality publications. There are additional fonts under development, including one specifically for literacy used.

22 Language and Culture tools

SIL International has also developed software tools that serve the following needs:

• Lexical database programs that facilitate: – Grammatical/morphological analysis – Dictionary creation and printing – Web publication of lexicons – Entry and analysis of cultural data • Language Survey tools for doing statistical comparison of related languages/dialects • Speech Tools for doing phonetic analysis of speech (including tonal analysis) • An extensive CD-Rom based reference tool that covers all aspects of language and cultural field work

We have also developed a variety of tools for language and culture analysis. These tools serve a variety of needs, including …

23 FieldWorksFieldWorks DataData NotebookNotebook for analyzing cultural data

http://www.ethnologue.com/tools_docs/fieldworks.asp

The analysis of cultural data: The Data Notebook is a great tool for capturing cultural information about a society, i.e., if you want to document how the culture worked. It can help with literacy and educational development by documenting language attitudes, social patterns, and annual community cycles. It can help with translation tasks by documenting cultural concepts and terminology.

24 ToolBoxToolBox for language data management

http://www.sil.org/computing/toolbox/

This program enables you to build a lexical database. Toolbox facilitates grammatical/morphological analysis and the creating and printing of dictionaries. You can also use it for entry and analysis of cultural data (but it is not as sophisticated as the FieldWorks Data Notebook in this domain).

25 LexiqueLexique ProPro for publishing interactive lexicons

http://www.lexiquepro.com/

Lexique Pro is a tool for converting that Shoebox/Toolbox database into a web- based dictionary. In fact we developed this particular application, Lexique Pro right here in Mali. (Credit goes to Richard Margetts.) This is an entry from a Bambara-French-English lexicon.

26 WordSurvWordSurv andand PalmSurvPalmSurv to compare language survey data

http://wordsurv.css.tayloru.edu http://palmsurv.cyboreal.com

There is also software developed specifically for doing comparitive statistical analysis of related languages and dialects. There is also a version specifically for the Palm Pilot, making it more convenient for a field worker to do data entry and basic analysis on the field.

27 SpeechSpeech ToolsTools for analyzing speech data

See http://www.sil.org/computing/catalog/index.asp

To analyse speech data. This program is unique in that it facilitates not only basic phonetic analysis, but also tonal analysis of languages.

28 LinguaLinksLinguaLinks LibraryLibrary for electronic reference materials to support language and culture fieldwork

• Consulting • General Reference Works • Language Learning • Linguistics • Literacy • Sociolinguistics • Translation • http://www.ethnologue.com/lingualinks.asp

LinguaLinks Library contains electronic reference materials designed to support language fieldwork. It contains the entire contents of 149 books or book-length works and 223 SIL journal issues.

29 Partnering in Script Development • Things SIL International could potentially offer: – Support script development for languages currently unsupported – Partner in the development of keyboarding solutions – Continued development of the Graphite rendering technology – Support incorporating these technologies into current open source or commercial tools, e.g., Open Offiice • Things SIL International would benefit from: – Resources necessary to actually develop the scripts (people, money, hardware/software) – Community cooperation in the standardization of scripts – Help getting open source software groups to incorporate these technologies into their core software efforts

We have not been successful getting the open source core development groups to incorporate our software into their development efforts to date. As a result we can either fork the code and do our own effort (would require major commitment of resources) or we have to keep synchronizing our changes to each new version of their product (infeasible). Note that we would be happy for anyone open source or not to help by providing support needed by the complex minority scripts. We have pursued open source mainly because we can modify the code and the resultant looks attractive to us.

30 Web-sites of Interest

• SIL International: http://sil.org/ • Multilingual Computing: http://www.sil.org/computing/multilingual.html Content: Here are Links to some SIL contributions to research and development in the area of multilingual computing, links to additional resources, and a glossary of Character Encoding and Rendering. • Fonts in Cyberspace: http://www.sil.org/computing/fonts/ Content : This is a guide to finding language fonts on the Internet. Containing more than 400 sources for 123 languages. • NRSI: Computers & Writing Systems: http://scripts.sil.org Content: Character Encoding, Understanding Unicode, Keyboard Design And Keyboarding Utilities, Type Design, and Script Rendering Technologies • SIL Software Catalog: http://www.sil.org/computing/catalog/index.asp Content: SIL has developed more than 60 pieces of software to support the work of its fieldworkers; most available to the public for free download. • Keyman 6.1: http://tavultesoft.com/keyman/ Content: Keyman is a commercial Windows keyboard mapping software that, with Keyman Developer, is used to design your own keyboard layouts. • SIL FieldWorks 2.0: http://www.ethnologue.com/tools_docs/fieldworks.asp SIL FieldWorks is a suite of software tools that work together to help language teams worldwide. The current released suite of programs includes WorldPad and the SIL FieldWorks Data Notebook. Additional applications are under development. • Field Linguist’s Toolbox: http://www.sil.org/computing/toolbox/ Toolbox is a data management and analysis tool for field linguists. It is especially useful for maintaining lexical data, and for parsing and interlinearizing text, but it can be used to manage virtually any kind of data. • Lexique Pro: http://lexiquepro.com Lexique Pro is an interactive lexicon viewer, with hyperlinks between entries, category views, dictionary reversal, search, and export tools. It's designed to display your data in a user-friendly format so you can distribute it to others.

Following are some web sites of interest.

31