The Pulaar Language

• Pulaar (also known as Fulfulde) is the Digital Representation of Pulaar most widely spoken of the West Atlantic languages (Niger-Congo) of Africa. • In some states (Futa Jalon in Guinea David Robinson, Michigan State University and northern Nigeria, in particular), the Cheikh Babou, Michigan State University elites devoted considerable attention Bartek Plichta, Michigan State University the development of Pulaar pedagogy.

2/13/2002 1 2/13/2002 2

• This included an ajami system, that is, the writing (usually for recitation and instruction) of Pulaar texts in the Arabic script. • Indigenous authorities, writers, and expatriate linguists have opted for the Roman script in recent decades. • An important and largely untapped resource remains available in the Arabic-language texts written largely in the 18th and 19th centuries. Fulbe Migration and Distribution Source: H.P White and M. Gleave, An Economic Geography of West Africa (1971)

2/13/2002 3 2/13/2002 4

Representing Pulaar digitally • Digitization – Digitization is process of converting an analog source material into a computer-readable format • Language Digitization – Language digitization is a process of representing language in a computer-readable format • Representing sound (phonology, phonetics) through audio digitization • Representing structure (syntax, semantics, David Robinson, Moustapha Kane, Sonja Fagerberg-Diallo, discourse) through mark-up “Une vision iconoclaste de la guerre sainte d’al-Hajj Umar Taal, “ Cahiers d’Etudes Africaines, 1994. • Combining sound and structure with SMIL

2/13/2002 5 2/13/2002 6

1 Audio Digitization Master object Text Digitization - OCR

Transfer digital Save file as • Optical Character Recognition (OCR) Prepare Analog Burn 2x copies A/D Conversion data PCM Wav Originals to CD 96 kHz/24 bit via S/PDIF 96 kHz/24 bit – There exist Pulaar-specific OCR or proofing tools. – have developed a Pulaar-compliant OCR methodology The matrix matching process maps Pulaar characters to codes Make analog copies Enter metadata record batch convert to 22,050 Hz/16 bit

Transfer files The feature analysis process is updated by adding new bitmap shapes for editing to its inventory and assigning Unicode codes to them

Delete preservation copy

Derivative object 2/13/2002 7 2/13/2002 8

Text Digitization – Text Digitization – Character Encoding

Individual characters of the scripts that humans use to • There is exists no standard character set for Pulaar. record and transmit their languages are encoded in the form of binary numerical codes. • Unicode provides a unique number (code) for more characters and languages than any existing system. • Unicode is platform-independent and has been adopted by such industry leaders as Apple, HP, IBM, Microsoft, Oracle, Sun, and many others. • Unicode is required by modern standards such as Common character encoding schemes: XML and Java. ASCII, ISO 646, ISO 8859 parts 1-14m (a wide variety of languages), JIS X 0201-1976 (Japanese), GB 2312-80 • Unicode is supported by many operating systems (Chinese), KS C 5601-1992 (Korean) and all modern web browsers.

2/13/2002 9 2/13/2002 10

Unicode codes for Pulaar characters Text Digitization – Mark-up Ɓ Ɓ Hierarchical and Sequential Nature of Linguistic data ɓ ɓ Discourse Ɗ Ɗ Paragraph Sentence ɗ ɗ Phrase Word Morpheme ŋ ŋ Phoneme Ŋ Ŋ Phone vowe l F1 F2 On the matter of the shipwreck did 1 293 2295 ƴ Ƴ not say much. He only told that it 2 403.7778 1851.222 3 575 1690.25 had not occurred in the Mediterranean, Ƴ ƴ 4 384.125 2179.125 but on the other side of Southern 5 588.5 1804.75 France---in the Bay of Biscay. ``But Ň Ñ 6 701.1111 1238.889 this is hardly the place to enter on a 7 632.5 1044.5 story of that kind,'' he observed, 8 495 994.2857 looking round at the room with a faint ň ñ 9 478.5714 1207.571 smile as attractive as the rest of his 10 335.3333 1456 11 592.5 1246.5 rustic but well-bred personality.

2/13/2002 11 2/13/2002 12

2 How do we represent linguistic data Linguistic Mark-up – example 1 in a computer-readable format? Sentence XML mark-up John works in the factory. Relational database Structured text DTD – Document Type John> Definition Offers powerful analysis tools Offers good analysis tools works Fails to capture the hierarchical Promises to capture the hierarchical

in

and sequential nature of linguistic data and sequential nature of linguistic data the factory Oracle SGML MySQL XML MS Access TEI ]> 2/13/2002 13 2/13/2002 14

Linguistic Mark-up – example 2 Combining Text and Audio with SMIL Conversation •Synchronized Multimedia Integration Language (SMIL) is a simple but powerful markup language for assembling Mary, how do you like my new baseball hat? It’s, like, OK, guess. multimedia presentations. •We use SMIL to assemble time-synchronized multimedia DTD – Document Type Definition language corpora, as well. •We have developed a methodology for making SMIL Transcribe and Convert trs into Assemble SMIL ]> time-stamp XML mark-up rt, QT, and TEI With PHP audio With PHP, XSL and with trasn-13.dtd Sablotron Hi, Mary, how you doing? I am, like, OK, I guess Web example 2/13/2002 15 2/13/2002 16

Thank you!

2/13/2002 17

3