From Database to Publication: Tools for Typesetting a Three-Language Dictionary
Dennis Walters
SIL International 2009
SIL Forum for Language Fieldwork 2009-003, December 2009 © Dennis Walters and SIL International All rights reserved Abstract
This document describes a process for producing a three-language dictionary using a specific set of software tools. The solutions described herein are things that many linguists will be able to do themselves, although some steps may require help from a software specialist. The lexical data included roughly 11,000 entries, with English, International Phonetic Alphabet (IPA), Chinese characters, romanized Chinese characters, Nuosu Yi characters, and romanized Nuosu Yi characters. The data records were prepared using SIL FieldWorks Language Explorer (FLEx). The FLEx data were exported to a custom standard format (SFM) database, then manipulated with a Python script to produce an intermediate SFM database. The intermediate database served as input to “Shlex,” a Perl program that produces a formatted dictionary and reversed finder lists. These documents then became subdocuments in an OpenOffice master document and were combined with title page, cataloging-in-publication data, preface, intro- duction, character indexes, etc., into a single master document. After adjusting the style details for paragraph, character, page, list, and outline; and generating indexes, the master was exported as a .pdf document which could be used to make photo plates for printing. This document will be of interest to working linguists and project supervisors who want to have hands-on control of their data throughout the process of publication.
Overview
Given: Electronic lexicl dtbse of rouhly 11,000 well-structured entries, continin dt in Enlish, IPA, Chinese chrcters, romnized Chinese, Nuosu Yi chrcters, nd romnized Nuosu Yi chrcters.
Produce: Dictionry-type reference book, with reversed finder lists, redy for print publiction. Solution needs to be reproducible or refinble for future use with comprble dt.
This document describes nd evlutes the tools nd steps I used to solve this problem.
The lexicl dt were prepred usin SIL FieldWorks Lnue Explorer (FLEx) 1. The FLEx dt were exported to custom Stndrd Formt Mrkers (SFM) 2 dtbse, then mnipulted with script written in Python to produce n intermedite SFM dtbse. The intermedite dtbse served s input to “Shlex,” Perl prorm tht produces formtted dictionry nd reversed finder lists. The Shlex dictionry nd reversed finder lists re OpenOffice (.odt) documents, with ll prrph nd chrcter styles pplied. These documents then becme subdocuments in n OpenOffice mster document. The reminin prts of the dictionry—title pe, ctloin-in- publiction dt, prefce, introduction, chrcter indexes, etc.—were produced s seprte OpenOffice documents nd lter incorported into the sinle mster document. Aside from editin the lexicl dt in FLEx, most of the work ws lernin to use OpenOffice to djust style detils for
1 “FieldWorks Language Explorer is the lexical and text tools component of SIL FieldWorks . It is an open source desktop application designed to help field linguists perform many common tasks.” http://www.sil.org/computing/fieldworks/flex/ 2 The SFM is a simple scheme for structuring data in a text document. A backslash [\] followed by a letter or series of letters at the beginning of a line serves as a field name, which is then followed by a space and data, e.g., \lx ꀊꃚ specifies a lexeme in our Nuosu Yi database. prrph, chrcter, pe, list, nd outline; set up pe heders with lterntin hedins nd pe numberin; nd enerte ech of severl different indexes. The finl step ws to export the OpenOffice mster s .pdf document, which could be used to mke photo pltes for printin.
Figure 1. Screenshots of FLEx lexical entry and SFM lexical entry
Skills and tools
To prepre this publiction, personl computer nd number of different softwre tools were used. Those tools were built nd refined by substntil tem of softwre developers, most of whom hve specil expertise in computer processin for nturl lnue dt. The ol ws to meet publiction needs in wy tht would develop workin tools for mny other workin linuists.
The solutions described here re, for the most prt, ccessible to reulr PC user. Severl of the softwre tools don’t hve rphicl interfce; commnd hs to be typed insted. Some potentil users will find this duntin, lthouh (in my view) the commnd line tools re no more difficult to use thn some of the deeper fetures of prorm like Microsoft Word.
Preparing the data in FLEx
The finl export used for publiction ws done with FieldWorks 5.2, relesed in Mrch 2008. The PostProcess nd Shlex softwre tools cn operte on ny SFM dtbse with some djustments to ccommodte the prticulr dt model nd dtbse. Well-structured Toolbox 3 dt would redily feed into the typesettin process. 4
3 “Toolbox is a data management and analysis tool for field linguists. It is especially useful for maintaining lexical data, and for parsing and interlinearizing text, but it can be used to manage virtually any kind of data.” See http://www.sil.org/computing/catalog/show_software.asp?id=79 4 The WeSay team is developing a software tool, “SOLID,” designed to help a consultant repair inconsistencies in SFM lexical data. See http://projects.mseag.org/solid/wiki/Overview and http://wesay.org/wiki/SOLID . The softwre tools used for export, postprocess, nd typesettin ech ssume certin thins bout the dt they will operte on. As result, number of erly decisions were mde bout how we would record lexicl dt in FLEx. In prctice, both the typesettin softwre nd dt entry prctice were djusted over period of mny months, until the processed dt were just wht we needed. Once the decisions were mde, the dtbse ws thorouhly checked for consistent ppliction of the decisions mde. The consistency checkin ws done by combintion of humn nd mechnicl inspections. In other words, fter enterin nd revisin the dt in FLEx, we still hd to fix errors tht showed up durin export nd postprocessin.
Dt ws run throuh export, postprocess, nd typeset severl dozen times over period of more thn yer. Editin lexicl dt record is lwys pcke of chnes to n entire record nd there re opportunities for introducin new errors into the dt in severl different fields. While it would hve been possible to use mnul Find/Replce process to check for specific errors, this would hve been tedious nd inefficient. Rther, we chose to do firly comprehensive set of utomted checks on the dt ech time it ws exported nd prepred for typesettin.
For exmple, correct sortin of the Chinese reversl index required custom field in FLEx tht would mintin correspondence between ech Chinese chrcter in reversl entry nd its pinyin romniztion. (The romnized form for mny Chinese chrcters depends on menin in context.) Chinese chrcter-reversl entries were listed in sinle field (\revChn) in FLEx nd seprted by sinle-width ANSI semicolons, s were the correspondin romniztions in \revChP.
\revChn 桶; 朵; 串
\revChP tong3; duo3; chuan4
Often, full-width Chinese semicolons were used by mistke to seprte items. The postprocess script incorported check for pinyin mismtches nd recorded them to n output file. This check lso cuht the mistken Chinese semicolons. Corrections were then mde by hnd.
Another erly decision ws to use the Exclude As Hedword (EAH) field to rbitrrily exclude certin FLEx entries from the publiction. For publiction purposes, resercher miht wnt to exclude words tht re rrely used, tboo, borrowed, non-stndrd, etc. In n idel system, combintion of filter criteri in FLEx miht hve been used nd only exported the records we wnted to be published. A filtered export ws not supported in tht version of FLEx. 5 Nevertheless, the EAH check box ws simple to use nd served our purpose. The EAH vlue of ech record ws exported to SFM s \eah 0 or \eah 1 . Actul removl of ted entries ws done by the postprocess script.
To export our lexicl dt from FLEx, we used custom export definition export.xml . This mde it possible to export the custom fields used for Chinese reversl entries, s well s the EAH field. The custom export required tht every entry include dt in the Lexeme field in ll three writin systems: Yi chrcters, Yi Pinyin, nd Yi IPA. Thus, successful export ve prtil check on the vlidity of this prt of ech lexicl record. 6 (Ech writin system is equivlent to seprte field in Toolbox.)
Typin lexicl dt is n error-prone nd time-intensive process, therefore, mnul dt entry ws voided, especilly when dt field’s contents could be enerted from existin dt. The Bulk Edit feture in FLEx is desined for this purpose. To et correct Yi Pinyin for ech entry, Bulk Edit
5 Filtered export to XHTML format is supported in the current development version of FLEx. 6 Alternatively, the custom export could have allowed missing data in these fields. Then the postprocess script could have flagged missing or empty pinyin and IPA fields. Entries, Process ws used. There, trnsducer “YWZ to YPY” ws nmed nd TecKit 7 mppin tht specified the correct Yi Pinyin for ech Yi chrcter ws chosen. We chose Lexeme s the Source Field nd Lexeme (Yi Pinyin) s the Tret Field, then, the process to enerte Yi Pinyin in every entry ccordin to its Yi chrcters ws pplied. A similr process enerted IPA trnscriptions for ech entry.
FLEx offers some flexibility for recordin vrint lexicl items. 8 We decided to ccept nd work with some conventionl lbels tht hve been offered by Chinese nd Nuosu Yi linuists. Those lbels included such thins s dilectl vrints, conditioned vrints, nd free vrints. For dictionry presenttion, we chose to not distinuish these, but to lbel correspondin sets to show tht “X hs vrint Y” nd “Y is vrint of X”:
ꀊꃚ a fu [a³³fu³³] (var. ꀊꉻ a ho ) adj . 粗(指圆柱形;条形) thick; fat; big around (of long, cylindrical things) ↔ꀁꃚ ix fu
ꀊꉻ a ho [a³³xo³³] (var. of ꀊꃚ a fu) adj. 粗 thick; fat; big around (of long, cylindrical things) ↔ꀁꉻ ix ho In this cse, ꀊꉻ a ho ws recorded in FLEx with n Entry Type of “Dilectl Vrint” nd ꀊꃚ a fu s its ssocited Min Entry. FLEx did not ive us wy to show vrints s hvin equl sttus, for instnce, s with true free vrints. Rther, one vrint must be recorded s Min Entry nd the other Vrint linked to tht Min Entry in the dtbse. These reltionships were exported to SFM s follows: \lx ꀊꃚ \lxYiP a fu \va ꀊꉻ \et Main entry
\lx ꀊꉻ \lxYiP a ho \mn ꀊꃚ \et Dialectal variant
Some Nuosu Yi words hve severl homorphs. We wnted to publish only the best ttested homorphs in the dictionry. Sometimes this ment levin two or three others out of the book, in excludin entries by settin the ExcludeAsHedword fl in FLEx. This presented the problem tht excluded homorphs could leve ps in the homorph number sequence. For instnce, becuse homorphs 2 nd 3 were excluded, the dictionry would list homorphs numbered 1, 4, nd 5. To et round this, the ffected homorph sequences were reordered by 9 plcin the excluded entries t the end of the sequence. An IronPython script homograph.py ws ble to directly query the FLEx dtbse. On first pss throuh the dt, the script loed ll occurrences of severl different kinds of problems in the homorph sequences. The user could then exmine the lo nd either exit the script or hve the script correct the problems. This worked flwlessly nd completely solved the homorph numberin problem.
7 TecKit is software for converting electronic data between different character encodings. See http://www.sil.org/computing/catalog/show_software.asp?id=77 . The mapping file is simply a table that provides replacement values for a list of items, such as /yy/ for the Nuosu Yi character ꒉ. 8 During the summer of 2007, the FLEX Discussion email list hosted a long and detailed discussion of the available and preferred options for modeling variant lexical forms in FLEx. Depending on how they are analyzed, variants can be recorded as allomorphs, inflectional or derivational variants, dialect variants, spelling variants, etc. 9 This is a violent thing to do to the lexicon and would be difficult to reverse. At this stage of the process, I was using a version of the database which was dedicated to the publication task. I would use an earlier version of the data to continue language analysis, once the publication project was finished. Proper nmes in the dictionry were ted s such, but, for dictionry presenttion, we were lso expected to cpitlize their romnized forms. I lredy mentioned tht the Yi Pinyin forms were enerted by trnsducer. Addin cpitliztion then hd to be done mnully fter enertin the Yi Pinyin forms. We filtered the dt in FLEx for ll proper nmes nd cpitlized them. This hd n unforeseen consequence. Some of the proper nmes occurred in the dictionry s the first item in section, nd Shlex used the cpitlized forms in the section hedins, spoilin the uniformity of the section hedins. The fix ws to identify ll the syllbles tht hd this problem (there were two) nd use lower cse in their FLEx entries, then cpitlize those two entries mnully in the dictionry document produced by Shlex. This could hve been hndled by Shlex if config.xml included switch to check for cpitlized section hedins nd force them to be output s lower cse.
Custom export
Lexicl dt cn be exported from FLEx in severl different formts, such s MDF stndrd formt 10 , nd LIFT 11 . These formts for exported dt re ech specified in xml confiurtion files locted by defult in C:\Program Files\SIL\FieldWorks\Language Explorer\Export Templates . For this dictionry, I used custom export definition file, which produced n SFM file I will cll data0.db .
The custom export differed from stndrd Multi-Dictionry Formtter (MDF) in severl wys. It included writin system bbrevition in the SFM mrker for fields which contined dt in multiple writin systems, such s \lxYiI nd \lxYiP in ddition to \lxYi. It included severl fields tht re not prt of MDF: \et Entry Type; \v Hs Vrint; \mn Min Entry; \ps Prt of Speech Abbrevited; \rev Custom Reversl Chinese; \ut Use Type, nd \eh Exclude s Hedword.
Lte in the process, it ws discovered tht fields which did not hve multiple writin systems would export their dt in whtever primry nlysis lnue FLEx ws set for t the time of the export. For instnce, if I were usin Chinese s the primry nlysis lnue, FLEx would disply Chinese losses hed of Enlish losses, nd FLEx would export field vlues in Chinese rther thn in Enlish. For instnce, I miht see either \et 方言变体 or \et Dilectl Vrint in the exported dt, dependin on whether Chinese or Enlish ws the primry nlysis lnue. For the most prt, this continency ws hndled by the PostProcess.py script, where either Chinese or n Enlish field vlue could be rempped to whtever we wnted to disply in the publiction.
Postprocessing: PostProcess.py(data0.db)
After export, the SFM file data0.db ws mnipulted by Python script PostProcess.py . The script chnes some of the dtbse mrkup into pproprite punctution, performs severl mechnicl checks, nd dds sequence codes for sortin the Chinese – Yi Index. Specificlly, PostProcess.py does the followin: • convert Entry Type vlues to symbols: " 四字格" or "Four syllble form" into " ", nd "Enhnced modifier" into " " • convert Lexicl Function vlues to symbols: " 反义" or “Antonym” into " ↔" • check for punctution suitble for ech writin system
10 The MDF is a scheme for constructing a lexical database using a set of predefined fields, and for extracting and typesetting the data in a multilingual dictionary with or without reversal indexes. MDF is described in Coward, David F. and Charles E. Grimes. 1995. Making dictionaries: A guide to lexicography and the Multi-Dictionary Formatter. Waxhaw, North Carolina: Summer Institute of Linguistics. 11 LIFT (Lexicon Interchange FormaT) is an XML format for storing lexicons/dictionaries. See more on the LIFT standard at http://code.google.com/p/lift-standard and http://www.wesay.org/wiki/LIFT . • check for vlid simplified Chinese chrcters • check for vlid Pinyin for Chinese chrcters • use Chinese chrcter nd Pinyin fields to enerte \sortChn field nd sortin sequence for Chinese reversls • check for mismtched brckets nd prentheses in definition fields • convert Pinyin with tone numbers to Pinyin with dicritics • dd inline styles nd Pinyin to Yi chrcters embedded in definitions • enerte n error file flin suspect or erroneous dt, includin enouh informtion to find nd correct the errors usin FLEx.
For PostProcess.py to run successfully, every sense tht contins Chinese chrcter dt in RevChn must hve mtchin Hnyu Pinyin dt in RevChP. If the mismtch is too extreme, the PostProcess will fil. For minor mismtches, the error file will identify the error nd the record so tht corrections cn be mde.
The output of PostProcess.py is utf-8 text file in SFM mrkup. This file is clled data1.db .
The syntx for postprocessin is: “PostProcess.py infile outfile > errorreport.txt”.
Ech time fter PostProcess.py is run, I must check the errorreport.txt file. Idelly, ny corrections should be mde in the FLEx dt, rther thn in the exported dt. Otherwise, the sme errors will turn up in in the next export. When the FLEx export hs run clenly nd PostProcess.py runs without ny uncceptble error messes, the resultin file ( data1.db ) is redy to be input to Shlex.
On the IBM T40 computer, PostProcess.py tkes bout 40 seconds to run.
Typesetting Part I: Shlex(data1.db,config.xml)
The output of PostProcess.py (data1.db ) serves s input to “Shlex”, which identifies relted sets of dt in dtbse nd rrnes them for dictionry presenttion in open document text (.odt) or in TeX formt. Shlex refers to config.xml , which specifies the dt to extrct, its output sequence, nd punctution. Config.xml lso specifies style types nd style nmes pproprite to ech kind of dt.
For this dictionry, Shlex enerted three files: Yi – Chinese – Enlish Dictionry, Chinese – Yi Index, nd Enlish – Yi Index. The specil function of Shlex is to properly hndle the structured dt s found in data1.db , ccordin to specifictions recorded in config.xml for the finl ppernce of the dictionry.
To produce the min dictionry, config.xml tells Shlex tht whenever it encounters field lbeled \lx, it must bein new dictionry entry. This entry is ted with the prrph style DictionryEntry. The contents of \lxYi, \lxYiP, nd \lxYiI provide Yi chrcters, Yi Pinyin, nd IPA, respectively, for the entry. Seprte chrcter styles re pplied to the contents of ech of these fields. Config.xml tells Shlex to plce squre brcket before nd fter the contents of \lxYiI, so tht the IPA trnscription ppers in squre brckets. Shlex then identifies senses ssocited with the entry nd extrcts sense number, prt of speech, nd definition for ech sense. In ech cse, config.xml specifies chrcter style for ny dt tht needs to be displyed in specil typefce or size tht is different from the underlyin prrph style.
For the Chinese Finder List, config.xml tells Shlex to bein new entry whenever it finds field lbeled \revChn. This dictionry entry is ted with the prrph style ReverslEntryChinese. Config.xml specifies tht, within tht sme entry, Shlex is to extrct the contents of \lxYi nd \lxYiP nd print them with pproprite chrcter style ts. The Enlish Finder List follows similr pttern.
The syntx for Shlex is: shlex -c confi.xml [-o outfile] [-s style_info] [-b bckend] [-n] infile
(On my computers, Shlex took 5-10 minutes to enerte ll three .odt files.)
Typesetting Part II: Using OpenOffice
Shlex output files included the Min Dictionry, Chinese-Yi Index, nd Enlish-Yi Index.
These three files were copied to workin directory where the OpenOffice mster document would refer to them. A mster document is continer for collection of documents which need to be published toether. The mster document hs its own unified style sheet, which includes prrph styles, chrcter styles, pe styles, numberin styles, nd list styles. When viewed s prt of mster document, the content of ech subdocument will reflect the ttributes of the mster document’s style sheet. For instnce, prrph in subdocument my use style clled “Text Body - First Indent,” which uses the font Lucid Console 10 pt. However, the style “Text Body - First Indent” in the mster document hs its own ttributes, such s Times New Romn 12 pt, independent of the subdocument. Every prrph tht hs the style “Text Body - First Indent” pplied in ny subdocument will look the sme when displyed in the mster document.
A mster document cn lso contin text, indexes, nd tbles prt from subdocuments. Our dictionry included severl indexes which OpenOffice Writer utomticlly enerted. There ws Chinese Tble of Contents, Enlish Tble of Contents, Index of Yi chrcter dictionry entries, nd ABC Index of Yi chrcters. Besides these, two of the subdocuments contined indexes tht were utomticlly enerted by OOo Writer. The open-document formt used in OOo Writer mde it possible to build custom indexes. A smple of n index tht listed Yi chrcters in tble with correspondin pe numbers ws mde. By followin the pttern of the XML code behind the smple, it ws possible to utomticlly construct the entire index of more thn 800 entries. After tht, OOo Writer could enerte the correct index dt ny time by usin the Updte Index feture.
One sinificnt djustment mde to the files produced by Shlex, ws to remove the section breks which were utomticlly enerted for ech “letter” in the dictionry. In this dictionry, the “letters” were ctully chrcters, nd there were more thn 800 of them. Processin this mny sections slowed OpenOffice down to crwl when it updted the entire mster document. The solution ws to open ech document tht cme out of Shlex nd select the entire text. Usin the Formt, Sections menu item nd deletin ll the sections listed in the dilo box, then selectin the entire text in nd usin the sme commnd, sinle section in ech document ws creted. Then, this one section ws formtted with the desired column properties.
Workin in the OpenOffice mster document, I lerned tht pe styles needed to be ttched to the text of the mster document. (Pe styles from subdocuments did not retin their properties when displyed s prt of the mster document.) To mke this hppen, I typed the titles for ech mjor section directly into the mster document, levin these titles out of the subdocuments. Tht wy, pe style for ech of those sections of the mster document could be specified, mkin it possible to hve unique heders for ech section.
The best thin bout this scheme of typesettin is tht Shlex pplies ll the relevnt styles to the lexicl dt s it builds the open document files. This mens tht the humn typesetter’s min work is to mnipulte the styles in the mster document so tht ech kind of dt in the dictionry nd indexes looks riht, behves riht, nd works with the other prts of the document. There is little need for the typesetter to mnully pply or remove styles from the dictionry dt.
On the other hnd, preprin the front nd bck mtter required both composition of the content nd ppliction of pproprite styles. This, however, is simple compred to the work of pplyin styles to ech type of dt in the min body of dictionry.
The Prefce nd User's Guide ech hve Chinese version nd n Enlish version, resultin in two Chinese documents nd two Enlish documents. I could hve used sinle “Title” prrph style for ll these documents. Prrph styles hve n option for specifyin Asin font seprtely from Western font, so tht ws not problem, but this choice of prrph styles lso hd implictions for the construction of tbles nd indexes.
To utomticlly enerte tble of contents, one of the simplest wys ws to specify the prrph style(s) to pick out from the document. If ll the titles would be displyed in sinle, two-lnue tble of contents, then it would work well to use the sme title style for ll four documents. However, it seemed more consistent within the book's desin to mke seprte tble of contents for ech lnue. For this purpose, it worked better to seprte title styles for Chinese nd Enlish titles.
The min dictionry nd the Chinese nd Enlish indexes re ech sinle document, but they hve titles in both Chinese nd Enlish. Seprtin “Title – Chinese” nd “Title – Enlish” prrph styles fcilitted includin ech of these in its pproprite tble of contents.
Usin seprte title styles for Chinese nd Enlish titles hd one inconvenient result. The outline structure of the document could not be mintined becuse only one prrph style cn be ssined to iven outline level. I needed to be ble to ssin two different lnue styles to the sme outline level. The inconvenience surfced in two res. The nvition pne of OpenOffice Writer could not show ll the relevnt section titles so the nvition feture could not be used to jump directly to the Enlish subdocuments t the beinnin of the book. The other inconvenient spot ws in the export to .pdf. In document with consistent outline structure, ll the section hedins, t ll the levels, cn export s bookmrks in the .pdf file. This would hve been convenient to hve in the .pdf copy of this book, but, since the outline structure ws flwed in the mster document, mny of the bookmrks did not pper in the exported .pdf file.
Settin up lterntin heders for dictionry pes ws done vi pe styles. Ech mjor dictionry section hd two different pe styles; e.., MinDictionryFirstPe nd MinDictionryPe. In the finl lyout, MinDictionryFirstPe hd no heder t ll. MinDictionryPe hd both left nd riht pes, which were different from ech other. The left pe heder included the Chinese title, “词 汇 正 文” centered, nd the pe number t the left ede. The riht pe heder included the Enlish title, “Min Dictionry,” centered, nd the pe number t the riht ede. Both left nd riht pes included borderline runnin beneth the heder cross the full pe width.
One feture commonly used in Enlish dictionries is tht of showin ledin or trilin dictionry entries from iven pe in tht pe's heder to fcilitte the lookup process. If desired, tht feture probbly could hve been mde to work, but fter seein smple of it, I decided it ws not helpful for this prticulr book. Rther, I wnted to use vrition of it by listin ll the chrcters tht bein section on iven pe in tht pe's heder. For instnce, if certin pe hd sections for ꉁ,ꉂ,ꉃ,ꉄ,ꉅ,nd ꉆ, these chrcters would ll be listed in the pe heder. Mny Chinese dictionries use this s stndrd feture in their pe heders. Unfortuntely, it could not be implemented in OpenOffice in time for this project. In summry, I found tht OpenOffice provided ll the fetures I relly needed in order to typeset this somewht complex book. The lck of consistent outline structure nd the limittion in constructin pe heders were inconveniences, but not “show stoppers.”
Conclusion
In principle, it is strihtforwrd tsk to extrct well-structured dt nd typeset it for publiction. However, for number of resons, the ctul process cn include sinificnt chllenes. In this cse, the dt re nturl lnue dt recorded in severl different writin systems, two of them non-Romn. In the first plce, desinin dtbse structure to cpture the subtleties of nturl lnue dt is notoriously chllenin. Desinin the sme dtbse so tht the dt re redily ccessible for typesettin mkes the tsk even more difficult.
In the pst, some softwre tools were vilble. Usin those, I could collect my dt nd mnipulte it in its romnized form. At tht ste, MDF ve me nice lookin dictionry printout, but I could not mke it work with non-Romn dt. Lter I ws ble to include Chinese chrcters nd then to enerte Yi chrcters from Yi Pinyin dt, but I still hd difficulty ettin the SFM dt out in publishble formt. Softwre tools vilble within the pst five yers hve crried me over the lst few hurdles to mkin this dt not only vilble, but even nice lookin I think. I hope others will be ble to dpt the process nd the tools to their needs nd the needs of other lnue communities.
From my perspective, the resultin publiction is very stisfyin. I ws ble to control mny detils of wht dt were published nd how they pper on the pe, nd, becuse they were not retyped, the dt in the publiction reflect my dtbse exctly. I ws not ble to control everythin. Some of the problems were solved simply by findin the riht prormmin technique, while other problems remin due to inherent limittions in the dt, in the desin of the dtbse, or in the softwre tools.
Acknowledgments
The compiltion nd publiction of the Nuosu Yi - Chinese - Enlish Dictionry is coopertive project between the Est Asi Group of SIL Interntionl nd the Southwest University for Ntionlities Southwest Institute for Ethnoloy ( 西南民族大学西南民族研究院). The print version ws published in 2008 by Ntionlities Publishin House ( 民族出版社), Beijin. The lnue dt were compiled nd edited by M Linyin, Susn Gry Wlters, nd myself. The FieldWorks softwre suite ws developed by the Lnue Softwre Development deprtment of SIL Interntionl. Beinnin in 2004, the FieldWorks tem worked closely with me to provide fetures for enterin, displyin, mnipultin, nd exportin the Nuosu Yi lnue dt. Ken Zook fcilitted importin the Nuosu Yi lnue dt. Steve McConnel nd Victor Roetmn (SIL Est Asi Group) prepred the custom export specifiction. Victor Roetmn prepred nd modified the Homograph.py script, the PostProcess.py script, nd the Config.xml specifiction for use with Shlex. Victor Roetmn lso built the XML files for the 800+ entry indexes I needed, bsed on couple of smll smple indexes. Mrtin Hosken (SIL Non-Romn Scripts Inititive) wrote Shlex nd lso ve input to the Config.xml specifiction. Jeff Green, Eric Jckson, Susn Wlters, nd Ken Zook red nd ve feedbck on erlier versions of this publiction. The efficiency nd stisfyin results chieved so fr in preprin the Nuosu Yi - Chinese - Enlish Dictionry for publiction would not hve been possible without ll this ood support. On the other hnd, I m responsible for customiztions to the scripts, use of the softwre tools, s well s the explntions provided in this document.
Appendix – Sample dictionary pages
Chinese Table of Contents
Chinese Table of Contents
English Table of Contents
Yi Character Index First Page
Radical Stroke Index Second Page
ABC Yi Pinyin Index First Page
Main Glossary Page
Chinese - Yi Index First Page
English - Yi Index First Page
Yi Character Table Index First Page
Appendix – Scripts and Files Referenced
If you wnt to see ny of the files or scripts referenced here, plese send emil to dennis_w[email protected] . These items will be vilble s smples on request from the uthors or from Dennis Wlters under resonble licensin provisions.
confi.xml Used with Shlex. Defines styles nd dt to be extrcted nd typeset.
dt0.db Output of FLEx export usin export.xml.
dt1.db Output of PostProcess.py, input to Shlex.
errorreport.txt Output of PostProcess.py flin errors, such s mismtched prentheses.
export.xml Custom export specifiction for extrctin specific lexicl dt from FLEx to n SFM dt file.
FLEx FieldWorks Lnue Explorer, the lnue dt component of SIL FieldWorks.
homorph.py IronPython script for findin nd correctin errors in homorph sequences in FLEx lexicl dt.
PostProcess.py Python script to process SFM dt exported from FLEx.
Shlex Perl script for typesettin structured dt. References
Cowrd, Dvid F. nd Chrles E. Grimes. 1995. Mkin dictionries: A uide to lexicorphy nd the Multi-Dictionry Formtter. Wxhw, North Crolin: Summer Institute of Linusitics.