DECEMBER 1989VOL. 12 NO. 4

a quarterlybulletin of the IEEE ComputerSociety

technicalcommittee on Data Engineering

CONTENTS

Letterfrom the Issue Editor 1 R. King Problems and Peculiarities of ArabicDatabases 2 A.S. Elmaghraby, S. El—Shihaby, and S. El—Kassas

Bayan: A Text DatabaseManagementSystemWhichSupports a Full Representation of the ArabicLanguage 12 A. Morfeq, and R. King

Responsa: A Full—TextRetrievalSystem with LinguisticProcessing for a 65—MillionWordCorpus of JewishHeritage in Hebrew 22 V. Choueka IconicCommunication: ImageRealism and Meaning 32 F-S Liu, and J-W Ta!

Overview of Japanese Kanji and KoreanHangulSupport on the ADABAS/NATURALSystem 42 V. ishi!, and H. Nishimoto Call for Papers 52

SPECIALISSUE ON NON-ENGLISHINTERFACES TO DATABASES

~ IEEECOMPUTERSOCIETY 4 -~

IEEE Editor-in—Chief, Data Engineering Chairperson, TC Dr. Won Kim Prof. LarryKerschberg MCC Dept. of informationSystems and SystemsEngineering 3500West BaiconesCenterDrive GeorgeMasonUniversity Austin, TX 78759 4400UniversityDrive (512) 338—3439 Fairfax, VA 22030 (703) 323—4354 AssociateEditors Vice Chairperson, TC Prof. Dma Bitton Prof. Stefano Coil Dept. of EiectricaiEngineering Dipartlmento di Matematica and ComputerScience Universita’ di Modena University of iiiinois Via Campi 213 Chicago, iL 60680 41100Modena, Italy (312) 413—2296

Prof. MichaeiCarey Secretary, TC ComputerSciencesDepartment Prof. Don Potter University of Wisconsin Dept. of ComputerScience Madison, WI 53706 University of Georgia (608) 262—2252 Athens, GA 30602 (404) 542—0361 Prof. Roger King Past Chairperson, TC Department of ComputerScience Prof. Sushli,Jajodia campus box 430 Dept. of informationSystems and SystemsEngineering University of Colorado GeorgeMasonUniversity Boulder, CO 80309 4400University Drive (303) 492—7398 Fairfax, VA 22030 (703) 764—6192 Prof. Z. MoralOzsoyogiu Department of ComputerEngineering and Science CaseWesternReserveUniversity Cieveiand, Ohio 44106 (216) 368—2818

Dr. Sunil Sarin Distribution XeroxAdvancedInformationTechnology Ms. Lori Rottenberg 4 CambridgeCenter IEEE ComputerSociety Cambridge, MA 02142 1730 Massachusetts Ave. (617) 492—8860 Washington, D.C. 20036—1903 (202) 371—1012

DataEngineeringBuiletin is a quarteriypubiicatlon of Membership in the Data EngineeringTechnicalCommittee the IEEE ComputerSocietyTechnicalCommittee on Data is open to individuals who demonstratewillingness to its of interestincludes: data structures in Engineering . scope activelyparticipate the variousactivities of the TC. A and models,accessstrategies,accesscontroltechniques, member of the IEEE ComputerSociety may Join the TC as a databasearchitecture,databasemachines,inteiligentfront full member. A non—member of the ComputerSociety may ends, massstorage for very largedatabases,distributed Join as a participatingmember, with approvalfrom at least databasesystems and techniques,databasesoftwaredesign one officer of the TC. Both full members and participating and Implementation,databaseutilities databasesecurity members of the TC are entitled to receive the quarterly and relatedareas. bulietin of the TC free of charge, until furthernotice.

Contribution to the Bulletin is herebysolicited. News items, letters,technicalpapers, book reviews,meetingpreviews, summaries, casestudies, etc., should be sent to the Editor. All letters to the Editor wiii be considered for publication unlessaccompanied by a request to the contrary. Technical papers are unrefereed.

OpinIonsexpressed in contributions are those of the mdi viduaiauthorrather than the officlaiposition of the TC on Data Engineering, the IEEE ComputerSociety, or orga nizations with which the author may be affiiiated. Letterfrom the Editor

Severalmonths ago, Won Kimasked me to put together the December 1989issue. Ini tially, we agreed on graphicalinterfaces to databases as the topic. As I beganlooking into the literature and trying to figurewhatpapers to invite, I noticed that there wasvery little out therethat had to do withnon-Englishinterfaces to databases. So I set out to find a broadrange of papers, all dealingwithinterfaceproblemsthatAmerican andEuropean researcherswouldfindunusual.

Theindividualswhocontributed to this issuewereasked to put in a lot of effort in a very shortamount of time. A few of them haddifficultywriting in English. Therewere also communicationproblems for the authorslocatedoutside the USA. All of the authors deserve a verywarmthanks.

There are two papersdealing withArabic. The firstpaper, by Elmaghraby,El-Shihaby, and El-Kassas, gives an overview of why Arabicpresentsunusualproblems for the developer of interfaces, and whyEnglish-basedsoftware and hardwaremay not be easily adapted to handleArabic. The secondpaperrepresents the Ph.D.thesiswork of Au Mor feq and describes an effort to develop text databasemanagementtechniquesspecifically suited for .

The thirdpaper also deals with textual data. It is by YaacovChoueka and describes a very aggressiveeffort at developing a database of Hebrew texts. This effort has pro ducedresearchcontributionsthat are of use to developers of textretrievalsystems in gen eral, notjust to developers of Hebrew-basedsystems.

Thefourthpaper is by FangShengLiu and Ju Wei Tai, andconcerns the development of graphicalinterfaces. It suggests that a useful way of constructing and categorizing graphicaltokens can be based on Chinesecharactertheory. Oneinterestingaspect of this paper is that it is not dedicated to Chinese-basedinterfaces.

The fifth paper (by Yoshioki Ishii and HidekiNishimoto)describes the support of Japanese and Koreandatabaseinterfaces. It gives an overview of why the Japanese and Koreanlanguagespresentnovelproblems, and then describes the approach they have taken in developingtheirsystem.

I hopethat the readers of thisissuefindthesepapers to be unusual andinteresting.

RogerKing December, 1989

1. Problems and Peculiarities of ArabicDatabases

A. S. Eliuaghraby1, S. E1-Shihaby2, and S. E1—Kassas3

ABSTRACT

This paper presents a brief overview of Arabiclanguagedatabases, their problems, solutions, and directions for the future. Experience is based on implementation of an Arabizationscheme and severaldatabaseapplicationsusingstandardDBMSpackages and programminglanguages.Most of theseproblems are due to the fact that Arabization is not integrated in the design of the hardware and systemprograms.

1. INTRODUCTION

The need for ArabicDatabases has beenrecognizedsince the introduction of computers in the Middle East. Modernelectroniccomputers were first introduced to Egypt in the 1950’s starting by the IBM model 1620. Currently all types of computers exist, and are Arabized, with the exception of some high end models.

Arabizationschemes for computers are key to understandingproblems of Arabic Databases. It is alsobeneficial to briefly get acquaintedwithsomepeculiarities of the Arabic language.

The Arabiclanguage is written from right to left using the Arabiccharacter set with connectedscript.Twentyeightcharacterscompose the Arabic set, howevereach character has multipleshapes.Displaying an Arabicword requiresknowledge of shapeselection of each letter based on the letterspositions in the word.

The lack of endogenouscomputerdesigns in the Arabworld have forcedArabization to follow a modification and patchworkapproach.PeripheralArabization has been the most popularapproach in the earlyyears and is used for mainframe and minicomputerswhile microcomputerArabization has been approached with a variety of schemes.

Thispaperpresentssome of the problemsfacingDatabasedesign and Databaseinterfaces for Arabicapplications.Some of the necessarybackgroundrelated to Arabiclanguage and Arabizationschemes is also included. One major issue in Arabiclanguage that is briefly

presentedwithoutdetail in this paper is diacritics(al-tashkeel) - which has been dropped

in mostmodernArabicpublications -- diacritics are mainlyused to improvepronunciation.

1 EMA~SDept.,University of Louisville,Kentucky, USA40292. 2 University of Alexandria,Alexandria,EGYPT. EB Dept.,EindhovenUniversity of Technology, the Netherlands.

2. 2. PROBLEMS OF ARABICDATABASES

2.1. Peculiarities of Arabic and Arabization

Differences amongnaturallanguages have caused performancedegradation in many computerenvironments,includingdatabases.However,softwaredesigners,most of the time, includeconsiderations for the differences in Latin-basedlanguages. They do constructlayers to allow the update of the character set -- IBM’sMVS OS allows thatwithin the whole OS interface. This is convenient when applications do not use “clever” techniques and communicatedirectlywithsystemperipherals,such as in the case of Focus on IBMsystems.

When the linguisticdifferencesinclude more than minorvariations in the character set, such as in Arabization, serious problems occur. Operating system designers did not anticipate,whendesigningtheirsupervisorcallshandlingterminals, thatentering a character may alter the shape of an alreadydisplayedcharacter. In order to face these problems designers will assist in manoeuveringaround the systemcapabilities to reachtheseneeds.

In an attempt to build an Arabicdatabase, the designer is facedwith the obviousproblems of using a non-Englishcharacter set in addition to somepeculiarities that relate to Arabic language and Arabizationschemes.

2.1.1. Orientation

The Arabiclanguage is written in a right to left orientation.When the first IBMArabic terminals(XCOM,XCOM1, and XCOM2)wereintroducedtheyallowedwritingcharacters in both orientations;left-to-right (R-L), and right-to-left (L-R). However,because of technicalproblems, not related to Arabiclanguagefeatures,lettersenteredthroughthese terminalswereaccepted in the way theyappear on the VDUscreen. So, in L-R orientation all Arabicstringswerestored in an orderwhich is the reverse of theirentryorder, on the otherhand all Latincharacterswere also reverselyordered if the L-R orientation was used. The Arabic user was forced to switchbetween L-R and R-L orientations. L-R orientation is essential for databasequeries and operatingsystemcommands and R-L orientation has to be initiatedbeforeArabicstrings are entered. The orientationswitching also resulted in a screen reversal of strings entered previously in the other orientation. Besides the inconvenience, this concepthistorically lead to a serious bad practice (also IBM still insists on this mechanism in their new XBASICterminals). The Arabic user still insists on having selectableorientationrather than beingsatisfiedwith one bilingualorientationmechanism. Thus, although other naturallanguages are also defined on a L-R orientation,Arabic computerusers have the addeddifficulty of consideringboth.

2.1.2. BilingualIssues

Perhapsbecause of colonialinfluence in almost all Arabicspeakingnations, the Arabic user will neverrenounce the use of at least one Latinlanguage. The Egyptianscientist will use sometechnicalterms in Englishwhile the Egyptian artist and humanist will use some

3. French.Thisindeed is not verypeculiar.Remember the use of French in the Russiannovel “War and Peace” of Tolstoy(1869). Howeverwhen it comes up with two languages each having its orientation, the problems are moreserious than adding a few charactersfrom a differentcharacter set.

Problems in Databaseinterfacesallowingbilingual entry can be explained by a simple example.Assumeupper case lettersdenoteEnglish and lower case lettersdenoteArabic characters.Then if a bilingual field is entered in the followingsequence“ABCxyzDEF’, a personreadingArabic will expect it’s display as “ABCzyxDEF’. If we are interested in sorting this fieldbased on the first four characters,should we view this string as “ABCx” or as “ABCz” ? Storing the databased on it’s displayformatmeansconsidering“ABCz”! Such problems can (and have been) solved using un-shapingoperations on such fields before performing any Database operations such as search or sort. This howeverintroducesextra performancepenalties and buildslanguagedependenciesdeeper in the Database.

2.1.3.AutomaticShapeDetermination

A charactermust be displayed in a formwhich is a function of the context of characters it appearswithin.Most of the languagesprior to typographyshared this feature. An “e” at the beginning of a wordwould be slightlydifferent from an “e” in the middle of a word. Howeverwithincreasedmechanization this feature was relaxed in Latinlanguages. Writing is considered a fundamental art in the Arabicculture, and is rooted in the predominantly Islamicnotionsrestrictingotherforms of art such as sculpture and painting.Arabicspeaking peoplehave strongfeelings for their language and are not expected to give up their art in favor of speeding up the implementation of new technologies.

Automatic shape determination is needed and can be delegated to the computer, or achieved by using an Arabictypewriter style keyboard with an extendedcharacter set. Essentially this featureleads to problems in storing even purelyArabicfields. If an Arabic field is entered as “abc” it should be displayed as “cba” and “a” should be displayed using its beginning of the wordshape, “b” uses it’s middle of the wordshape, and “c” uses end of wordshape. The sequence“aaa”itselfwill be displayedusingthreedifferentshapes for “a”. If we were to search for occurrences of a particularcharacter we shouldsearch for all three of its representations. One surprisingsimplification in Arabic that contrasts these complications is the lack of capitalletters.Figure 1 displays a sample of Arabicletters and theirmultipledisplayshapesbased on theirposition.

First . .J. ,~

Middle .-.d~ ~ Last LJ~ meem seen em MultipleShapes For Arabiccharacters FIGURE 1

4. 2.1.4. ConnectedScript

Similar to mostlanguages,handwriting is almostalwaysdoneusingconnectedscripts. For the same reasons that Arabic printed characters have not been modified to discard automaticshapedetermination they also have been required to supportconnectedscript. This is essentially a challengingproblem in somecases. Graphicadapters such EGA and VGA for personalcomputers do not allowcompletelyconnected text in their text modes.

2.1.5.Diacritics

Diacritics in Arabiclanguagepresent a majorproblem.They are not reallycharacters,they are charactermodifiers. If Arabicimplementationswouldconsidercharacter-diacritic pairs as the basis of an expandedcharacter set, this would lead to an extremelylargecharacter set. The possibilities of an Arabicimplementation using more than 256 charactersscared most of the Arabic languageimplementers(although other languages such as Kanji considered this problem). In a moreconceptualview the expanded set solution was not view appropriatebecause it does not reflect the nature of Arabicdiacritics. A database query searching for a character in Arabicshouldresult in all occurrences of the letterregardless of its diacriticmodifiers. An alternativeapproach to diacriticswhichseemslogical is to consider the symbols as normalArabiccharacters that are not stated in the alphabet. This is consistentwith the Latinview of diacntics as vowels. A true discussion and analysis of this problem is beyond the scope of the paper and requiresfurtherresearch.

2.1.6. Numbers and DecimalNotation

The aboveproblems are mainlyconcerned with textual data. Numerical data provides othertypes of problems.SomeArabizationmechanisms havenumeralcomputationalwhile others do not. Non-computationalnumerals are needed for manipulation of data. Furthermoremodern Arabic numerals are of Hindi origin and their positional value increases left to right in the samemanner as Englishnumeralswhichthemselves are the old Arabicnumerals. A numeral is consideredcomputational if it can be entered in response to a numericdataentrycommand.Computationalnumerals are standard in seven bit ASCII Arabization,while they are not necessarilystandard in eight bit ones (Sabri1986).

SomeArabcountries such as Tunisia haveswitchedback to the original(currentlyLatin) Arabicnumerals. In bothcaseswhenentering the mostsignificantdigit in an Arabiccontext it’s positionwouldfollowothercharacters and enteringadditionalnumerals will require left shift of previouslyentereddigits. In contrast,whenenteringArabicdigits in a Latincontext they are entered in sequence and in theirappropriatevisualposition.

The differences in the ContinentalDecimalNotation(CDN)betweenLatinlanguages are minor. In the US a period and a comma are used. A numberrepresented in the US as “1,001,000,555.20”must be written in French as “1.001.555,20”. The Arabicculture gave up its decimalnotation and at sometimestartedusing a Persiannotationwithdifferentsymbols for the decimalpoint and the thousandsdelimiters.

5. 3. ARABIZATIONSCHEMES

3.1. PeripheralArabization

Perhaps the most logical scheme for Arabization is based on peripheralArabization, however;

i. Most of the databasesignore the mechanisms of the peripheralArabization, so either they tend to inhibit non standard sets to be sent to the terminal, or they fail to trigger the right sequencerequired by the added Arabizationfunctions. So databases (and otherapplicationprograms) must be revised for theseterminals.

ii. Changes and updates are less convenientwith terminals. A change must support earlierproductsusingemulation if compatibility is not provided.

iii. Text handling in graphicmodes is generallyprovided by the application. In databasesoftware this is typical in the statisticallayersresponsible for charts and plots. Even when the operating system kernel supports such processing, most applications find these OS supervisorcalls at leastnon-efficient if not non-sufficient. With the increasedinterest in Hypertext the issue of coexistinggraphics and text is of increasedimportance.

In general an Arabizedperipheralshouldhandle the followingfunctions:

i. AutomaticShapeDetermination(ASD). ii. Orientation in L-R and R-L modes. iii. Representation of numerals.

SomedatabaseexperiencewithFOCUS on IBMmainframesusingperipheralArabization of the IBM XBASICterminaldemonstrates bad behavior.Arabicstrings are stored in reverse in somemodules and are not displayed from others. In additionspecialcharacters are incorrectlyinterpretedsincetheypossesEBCDICvalues that do not correspond to their Latin characters. And, as far as numbers are concerned, the modernArabicCDN is not supported.

The ARABICVT320terminal fromDigitalconnected to a VAXfunctionsappropriately using Digital’s RMS (RecordManagementSystem).However,output on that terminal is inconsistent with output on the CI600Q VAX printer which is Arabized by the same company.Inconsistency can be attributed to some technicalvariations and some release differences.

3.2. Hardware and Software

Since it is not alwayspossible to invent an Arabizedperipheral that will be respected by the DBMS, it is sometimes morepractical to createsoftwaresolutions.Given the DBMS

6. behavior~ as a finite set of algorithms, it is possible to createsoftwareSolutions(patches). Examples of suchsolutions are patches for some of the popular PC softwarepackagessuch as DBase III. Also, a similar patch for TurboPascalallowedwriting B+ tree code that allowedimplementation of Arabicdatabases.

A simplifiedversionexplainingsoftwareimplementation for AutomaticShaping on an Arabicstring (no lam alefs & kas chars) is given in Figure 2. This algorithm is presented in TurboPascalsyntax.Variables of the type CharTablecontain an entry for eachArabic character, and each entrycontains the differentshapes for that character.

procedureShapeArabicString( var Table : CharTable; var Atext : string); const

ALONE = 0; LAST = 1; FIRST = 2; MID = 3; var

Len, Pos, I : byte; Next, Prey : boolean; {are TRUE if the next/preychars are Arabic} begin Prey := FALSE; Pos := ALONE; Len := Length(Atext); for I := 1 to Len do begin Next := (I < Len) and IsArabicLetter(textSucc(I) ] ) and Not IsNumericChar(textSucc(I)] ); Pos := (Ord(Next) Shl 1) + Ord( Prey); Prey : = Not IsEndArabicLetter(textI] );

if IsArabicLetter(textI] ) then textI] := abletextI] Pos ]; end;

Figure 2 SimplifiedAlgorithm For Automatic shape Determination

A layeredarchitecture for Arabization can be used for Databasemanagementsystems and otherapplicationsoftware. Figure 3 depicts a possiblestructure for this architecture.

For example an application’srequest to handle the display of an Arabicstring can be view as a procedural call to the ArabicInterface. This is given as follows:

APPLICATION ARABICINTERFACE

Display(Str,SCREEN) --> Shape(Str) Orient(Str,ScreenOrientation)

7. Trans(Str,ScreenCharSet) ScreenDisp(Str)

Application (DB, ..etc)

Unshaped text & destination

ArabicInterface + + + I Editing I + + I Shaping / Orientation I + + Devicespecifictranslation I + + I Deviceinterfaces I + + +

+ +

÷ + + low level dev interface eg char gen, Keyboard def + + I Device 1 I + +

Figure 3 LayeredArchitecture For Arabization

3.3. Standards and PortabilityIssues

3.3.1.Approaches to StandardArabicCodes

• Arabic LanguagePeculiarities are compounded by the lack, or more accurately, the multiplicity of standards.Theproblems of StandardizedArabiccodes are due to the absence of a strategic plan for Arabization of informationrepresentation in general.

Two differentapproaches for definingstandardizedArabiccodes can be identified:

i. An academicapproach was initiated by proposing a definitionschemesimilar to the Latin ASCII and EBCDIC. The problems of this approach are associated with the mappingapproximately 30 characters in place of 26*2charactersrepresentingupperand

lower case Englishletters - note that the Arabiclanguagedoes not havecapitalletters. The mappingproblemoccurswhen the codesreplacecharacters that are needed by the

8. applicationsoftware(including DBMS systems). The ASMO 449 and the ARCh standardcodes replacealmost all the graphiccharacters. Most databasesystems use thesegraphiccharacters to create theirwindows!

ii. A commercialapproachwas based on insertingArabiccodes in place of elements of standard codesthought to be not widely used by applicationsoftware. This leads to a codepagewhich is aestheticallyveryunsatisfactory and logicallypeculiar.Arabiccodes in such schemes are sparsed in some code page. The number of standardsdefined to this method is according almostequal to the number of Arabizationlayers on PC’s. Microsoftintroduced an Arabic shell which itselfcontains four different code pages. One of Microsoft’s pages is ASMO 449 compatible, others are called transparent-- indicating that applications should run withoutchange, a goal that is not yet fully achieved. In the PC applicationsAPTECArabization is a popularstandard.

Figure 4 depicts two seven bit standards and samplesfrom the eight bit APTECwhich is a de-factostandard.

3.3.2. Portability

3.3.2.1.DifferencesbetweenCode-Pages

The maindifferencesbetween one code-page(C-P) and another is mainly in one or more of the following: i. Codes of the Arabiccharacters.

ii. Order of Arabiccharacters, for examplesome C-P considers the letter“wawhamza” to follow“alef’whileothersconsider it to follow“waw”. iii. The initial definition of the Arabiccharacter set; someconsider“lam-alef’ as on character whileothers treat it as two collatingones. iv. The set of defineddiacritics.

v. The definition of digits and specialcharacters;some C-P redefinescodes for numerals and others use a single set for both Arabic and Latin.

Arabic code-pages are eitherseven or eight bit pages.Seven bit C-P has Arabicletters defined in the code same area of the Latin character code page. It might seem that this allows most Latin orientedapplications to work correctlywithout major modifications, however: i. Since the number of Arabic characters is slightlygreater than the Latin set. Both upper and lower case codes will be needed. Thus applications that make case translations and those that are caseinsensitive will createproblems.Also, the lack of capitalletters in Arabic as mentionedearlier is sub~tituted by the need for ASD. ii. is ill-defined Bilinguality an problem in 7-bit codes.Insertion of controlsequences to handle the switchingfrom one code area to another is mainly used and indicates a change in language.

9. a) ASMO 449 b) ARCh

.1~ -Q 3 6 .7 —

ob 0 00 . S

00 0 ii I I

00 02

00 ‘3

01 0 04 D( iu~

T .. 01 0 ‘5 %e

• 01 06 . ~l ~d

! 1 1 7~ ~az~

0 C 08

10 C 19 .;-~ T;~

10 0 11 . ;

10 I 111 + ~Lci

11 C I •C

I 1~ * 1: -Ci,C~

I..> ~!i .\ 11 0 14 • t

•~ I I 115 I ~!I — — — — — —

c) Sample 8 bit code

ENGLISH ARABIC

LETTER ASC I I LETTER ASCII

A 65 193

B 66 194

C 67 195

Figure 4 Standards For ArabicCharacter Set

10. Eight bit codes are morepopular and map the Arabiccharacters to the upper half of the ASCII set or basically to an. unused (in the implementersview) area of the character set. The main problem of such methods is that most applicationsinhibitcodes used in these areas.

3.3.2.2.FILTERS

When text files are storedusingASCII or DIF standards the translationfrom one code page to another is simple and is performed by creating a softwarefilter. The purpose of a software filter is best explained as a mappingfunctionfrom one domain to another. Such a mappingfunction is not a one-to-onefunction, but typicallymany-to-many.Some of these filters are sold by softwarecompanies to promote the use of theirproducts. An experience in developingsuch a filter was encountered in an attempt to producelaseroutputusing an HP printer and font cartridge from a database that stored data using a commercial Arabization set (APTEC). When files involvenon-textstandards such as lotusworksheets the design of such filters is more complex. Also, in some caseswheremultipledifferences exist somemanualediting is required after the filteringprocess.

Somepositiveexperience was encountered in uploadingAPTECdatafrom a PC database to an IBM mainframe using a filter. A reversesituationinvolvingtransfer of Arabic EBCDIC data to a PC usingBABY-36softwareposedseriousproblems.

4. CONCLUSIONS

Problems in Arabic databases and interfaces are mainly attributed to a historically inconsistentapproach to Arabization. The basic character set is not yet defined, some definitionsincludemorecharacters or diacritics. In additioninternalrepresentation of each Arabiclettermight be based on its displayshape or not. This depends on the specific Arabizationscheme used. Also, it is often desirable to displaybilingual text. This might haveArabic as the base language and English as the “occasional”language or vice-versa.

Experiences in Arabizing DBMS packages and in creatingdatabases using standard programminglanguages are frustrating.Mostsolutions are engineeringsolutionslacking the neededinvolvement in the design of hardware and systemsoftware.

BIBLIOGRAPHY

APTEC, Inc. “APTEC users Manual”.

Carrabis,Joseph—David, “Dbase III plus: The CompleteReference” OsborneMcGraw Hill, 1987.

InformationBuilders, Inc. “FOCUS Users Manual”, Release 5.5, New York, NY. 1987.

Sabri, Ibrahim“ComputerPeripheralArabization”,presented in NCR Seminarseries 1986, Cairo, Egypt.

11. Bayan: A TextDatabaseManagementSystemwhichSupports a Full Representation of the ArabicLanguage

AliMorfeq RogerKing

Department ofCompwerScience University ofColorado Boulder,Colorado80309

AqilAzmi

CenterforHadithAnalysis Boulder,Colorado80303

ABSTRACT

It is difficult to use existingcomputersystems, based on the Romanalphabct, for languages that are quite different from those that use the Romanalphabet. Manyprojects haveattempted to use conventionalsystems for Arabic, but because of Arabic’smanydiffer ences withEnglish,theseprojects have met with limitedsuccess. In the Bayanproject, the approach has been different. Instead of simplytrying to adopt an environment to Arabic, the properties of the Arabiclanguage were the starting point and everything was designed to meet the of needs Arabic, thus avoiding the shortcomings of otherArabicprojects. A new representation of the Arabiccharacters was designed; a completeArabickeyboardlayout was created; and a window-basedArabic user interface was also designed. A graphical based windowingsystem was used to properlydrawArabiccharacters. Bayan is based on an object-orientedapproachwhichhelps the extensibility of the system for future use. Further more,linguisticalgorithms (forwordgeneration and morphologicaldecomposition of words) weredesigned,leading to the formalization of rules of Arabiclanguagewriting and sentence construction.

1. Introduction

Arabic languageprocessing has beenhamperedbecause, for the mostpart,peoplehavesimplyattempted to adoptEnglish-basedcomputersystems for Arabic such as TAYL86]. This is very difficult and, to some extent, futilebecause the nature of the Arabiclanguage is muchdifferent than that of English. Arabic, for is written from example, right to left; thus, the Arabicsentencestructure and grammar are verydifferent from that in English. Therefore,existingsystems(“adoptedEnglishsystems”)cannotadequatelyprocessArabic data. language In Bayan, on the otherhand, the Arabicproperties are served first. This new environment will to examine the in with tiy problems working the Arabiclanguage by usingArabic text databases as an applica tion of thatenvironment.

Thisresearch will address those problems that have led to the shortcomings in the cun~entArabic pro In the nextsection, an overview of the grams. properties of Arabic will be given so the reader may appreciate some of the in with possibleproblems dealing Arabicusing an English-basedsystem.Someaspects of Arabic wordprocessingrequirements can be found in BECK87]AHMES6].

12. 2. TheArabicLanguage

Arabic is the officiallanguage of 22 countriesstretchingfromeasternAsia to northwestAfrica. Further there are other which more, languages use a modifiedArabicalphabet(such as Persian and Urdu). Theimpor tance of Arabic is perhapsobvious to the reader.

2.1. Properties of Arabicscript The followingsubsectionsdescribe the generalproperties of the Arabicscript,includingcharacters,digits and writingmethods.

Arabiccharacters

There are 28 characters in the . Thesounds of 12 of them do not mapdirectly to sounds that are part of the Englishlanguage. Thefollowing are someimportantproperties of the alphabet. • Arabicmayonly be written in cursivescript. • Arabic is wiittenfromright to left..

• In each character is writing, connected to the preceding or followingcharacters,except six characters, called“stubborncharacters”. These six stubborncharacters are ( .~ -‘ ~ .~ • Eachcharacter, in takcs general, on a differentshapedepending on whether it appears in the beginning, middle or end of a word if it or is standing by itself(forexample,after a “stubborncharacter” at the end of a word).

• The Arabic contains no vowels alphabet as such. Instead,diacriticalmarks,called“tashkeel’. are used to vowelsounds. represent They are writtenabove or below the consonants. There are 13 “tashkeel” Four “tashkeel” are basic and nine are derived by combining one or more “tashkeel’. The Tashkee” are essential to the understanding sentenceproperly;without the tashkeel, a sentence may easily be misun derstood. This is because the of the placement “tashkeel” is determined by the rules of grammar that is, the “tashkeel” tells the if reader the word is a verb,subject,adjective and so on.

• Eachcharacter in Arabiccould be pronounced in 13 differentways,depending on the “tashkeel”written with it.

Arabicnumerals

thc numerals of the Although Englishlanguage are referred to as Arabicnumerals, theydiffercompletely from the numeralsusedthroughout the Arabicworld. There are ten digits in Arabic. These are I ~ V 1 o digits ( e r c u . ). The numbers are written and read from left to right(notright to left, like Arabicwords). Thisimplies,obviously, that the least significantdigit in a number is the rightmostdigit.

2.2. Existingstandards There have been manyattempts to simplymodify Englishcomputersystems to accommodateArabic. Suchattemptshave from and ranged modifyinginput output(I/O)devicedrivers to usinggraphics to draw the Arabicscriptwithin All of applicationprograms. theseprojects have failed in theirattempt to represent all of the aspects of the Arabic For language. example,many of them do not have any means of representing the “tashkeel”.

One of the proposedstandards for representing Arabic characters in computers may be found in (ALSA86].Thatworkcontains a proposedrepresentation of a 7-bitArabiccharacter set (withoutTashkeel) and an 8-bit set for both Arabic(without Tashkeel) and English Still these sets are not. able to meet all the require ments of the satisfactorilyrepresenting Arabiclanguage. The source of theseshortcomings,again, is because thesecharactersets wereintended to workwithEnglishmachines. In general, the proposedstandardslack the following: . Theability to all of the Arabic represent “Lashkeel”,especially the ones that are used as a combination of “tashkeel”(e.g., “ ‘~“

13. • In general,theyrepresent“tashkcel’ as charactersadded to a word. Twowords may have the samechar acters butdiffer in meaningbecause of theirdifferent“tashkeeL”.Whenconsidering“tashkeel” as charac ters added to a word, this differencecannot be taken intoconsideration.

• Theyarrangesome of the characters withspecialmarkings as uniquecharacterswhilethey are simply the

same character with differentsigns. For example, ( ~ • .~ ) are all the same character,called “”. The systemmustconsider all of them as onecharacter and not as differentcharacters.

• Finally,thesestandards are not accepted by all Arabcountriesand,therefore,untilnowthere is no unifor mity on this matter.

Most of the proposedstandards are simplyderivationsfrom ALSA86]. Somesimplychange the order ing of the characterswhileothersremove or add someextracharacters. There are someapplicationprograms whichwhich tried to solvesome of the problemsrelated to the ~tashkeel” by creating a character set which includedeach of the ‘tashkeel~.However,sincethere are only 127 characterentries in the 7-bitextendedchar actersets,theywereforced to ignoresome of the ~ashkeel” in theirimplementation.

3. The Bayanenvironment

The Bayanenvironmentattempts to solve the problemsdescribedabove by attempting to fullyrepresent Arabicscript. In particular,representing the differentforms of “tashkeel” is given high priority.Bayan also includes a database of Arabicwords,roots and derivationsrules.Thisenvironmentcould be a base for future Arabicnaturallanguageprocessingresearch.

TheArabiccharacter set

A 16-bitcodedmcthod is to be used in the Bayansystem.Arabiccharacters are encoded in such a way that each character is represented by one and only one code. That is, the “tashkeel” do not affect the value of the character. Instead, the “tashkeel” are treated as a description or a property of the character. They can be treated in the same way underlining,highlighting and otherscreendisplayinformation are treated. A character will be represented by a byte whileattributes(such as underlining,highlighting and so on) are represented by anotherbyte. It was decided to representArabiccharacters as in Figure 1. In this figure, the righthand side byte is used to represent the characterswhile the left hand sidebyte is used to describe the properties of the character (i.e. the “tashkeet~).

awibuces Arabictharacter -HARAKAr code

Figure 1 A proposedrcpresenLalion of an ArabicCnaraaer

ft may be possible to make the code-bytevalues a modification of the proposedstandard found in (ALSA86] in order to allowsomecompatibilitywith the proposedstandard.However, the characterforms that represent the “tashkeel~ are removed as, it is arguedhere, it is moreefficient not to include the “tashkeel~ in the character set; rather.theyshould be considered as attributes to each character.Furthermore, it is alsoproposed here, thateachcharactarshould be stored in one codeonly.

Outputdrivers must be able to determine the propershape of each character(depending on where it occurs in the word)along with the correctpositioning of the “iashkeel”. This is done by building a contextual analyzerwhichknowshowArabicscript is written.

An Arabickeyboard

Since there are only 22 characters in the Arabicalphabet, it is possible to use an Englishkeyboard and simplymodify it to resemble the Arabictypewriterkeyboard. ThekeyboardBayan is shown in Figure 2. Note that the “tashkcel” havekeys on the Bayankeyboard, but Arabictypewriterstypically do not have “tashkeel” (the tashkeel are added by hand). Devicedriversshould be able to map the scanningcodefrom the keyboard to the appropriateArabiccode and attribute.Thiskeyboardlayout is a collection of differentproposedlayouts and it overlapswithmany of them.

14. HI ~ I I ~__I~J ~ IL I I~

Figure 2 Bayan’sKeyboardLayout.

The defaultinputmethod is to enter the Arabiccharacterfollowed by its “tashkeer.However, an editor has beenadded in order to allow the user to firsttypecharacterswithoutentering the “tashkeel” and to go back and add the proper“tashkeel” later if so desired. This feature was addedbecausesomeusersprefer to write a wordfirstand then go back and add any desired“tashkeel”.

4. Contextualanalysis

In Bayan, a contextualanalyzer was built. Therole of this analyzer is to determine the shape of the Ara bic characteraccording to its position in a word. The analyzertakes into considerationwhether the character appears at the beginning,middle or end of the word. It will represent the characterproperly andalsoplace the “tashkeel” in theirproperplace. The block diagram of the contextualanalyzer is shown in Figure 3. It uses four fonts for the output of Arabiccharacters.There is a separate font for each possibleposition or shape of a character. Thesefontsare: “standalonefont”,“start of wordfont”,“connect at middlefont”, and “connect at end font”. The analyzer will take the current and previouscharacter codes and theirshapes as its input and will return the correctshapes of the previous and currentcharacter. Thismeans that the callingprocedure to the contextualanalyzershould check if the previouscharacterchangesshapeafter the returnfrom the contextualanalyzercall.

Stand Alonefont

Sian of wordfont

Connect at middle font

Connect at Endfont

Figure 3 Bayan’sContextualanalyzer

Thealgorithm to determine the shape of the character at the currentposition is shown in Algorithm I. It uses the followingstructure and identifiers.

15. Datas*rucuuesandconstants. identifier~ 5TART_PosmoN.MID_coNNEcr.END_coNNEcr.andSTAND_ALONE;

Arabic_CharSu~urc of Cods:Byte; I’ Arabiceltaractercode’! Tashkeel: • Byte; /Tashkeelcode’! Shape: tnt; /‘characzershape numbers!

/sss.s.ssesssss.*sssssssssassssssssss.ss.s.s..ss.*e.ess**s.sssss.**sss,a.s Cona~uaIanalysis forthe Arabiclanguage • written in C likelanguage.Stuffbetween ‘I” and arecanrecess. 5/ Procedure ( currentdirArabic_char,prev_chrArabic_chir) car~u_chr.shapc = MID_CONNECT; I’ defaultshape’! if( CAN_NOT_CONNECT(prev~chr.code))( I’ check theprevchr ‘1 airrentchr.shape = START_POSiTION; ~ttO; )/ if can_not..conne~ ‘I

if( NON_LER(carrentcbr.code))( (‘checkasrrcntdiasacters! Qirrent_chr.ShapeSTART_POSITION; I’ currantwill be uan_shape’/ do case ( prev_chr.shape){ caseSTART_POSITION:prev_chr.shape = STAND_ALONE; break;

caseMID_CONNECT:piev_chr.Shape = ~‘D_CONNECT; break;

caseENDCONNECT: caseSTAND_ALONE: break; default: ERRORCSHAPEcode”); (‘docase ‘I ) /‘ if non_tenet ‘I else do case(prev_chr.shape)( caseSTART_POSITION: caseMID_CONNECT:currentchr~hape = MID_CONNECT; break; caseE;’~’tD~CONNECT: caseSTAND_ALONE: curranchr.shapc = MID~CONNECT; currrin~chrShape=MID_CONNECT; break; default: ERROR(ERROR in shapecode”); )/‘ do caseprev_ctir.shape ‘I I’ else’! rewmO I’ end of procedure

Algorithm I Arabiccontextualanalyzer

The identifiersabove are used to represent the differentfontsused in oursystem. Forexample, START_POSmON is the font that contains the shape of all characterswhenthey are at the starting position of a word.

The contextualalgorithm will take the current and previouscharacters as inputs. The contex tual analysisalogrichm will examine the previous and currentcharacters. If the previouscharacter can not connect to the currentcharacter ( e.g. space,comma, a stubborncharacter,...), then it will make the currentcharacter’sshape to be ‘START_POSITION_SHAPE’ and it will exit. Next, the algorithm will check if the currentcharacter is a space, a puctuationmark, or a digit; in whichcase, it will set the shape of the previouscharacter to be the properendingshapedepending on the shape of its previouscharacter(i.e. the second to the last character). If both of the abovetests fail, the algo rithm will change the currentcharacter to connect to the previouscharacterdepending on the shape

16. ofits previouscharacter.

In the foUowingtable, we show an example of howcontextualanalysisdetermines the shape of the Arabiccharacterdisplayed. The first columnshows the current input character. The second columnshowswhat is beingdisplayed as a result of the input so far. The thirdcolumncontainscom ments of what has beendone.Theblackrectangle is the cursor.

Inputcharacter Output on screen Comments

‘-~ start of wordshape.

J I_J_u connect the second to the firstletter.

f a ligature is a must, so go back andjoinboth intoone newchar.

~L.w S° start of wordshapebecause the previous can notconnect to next.

. I s° spacecause the last letter to haveendingshape.

5. Bayan’suserinterface

Bayan’suserinterface was built in a SUNworkstationenvironment.Machinedependency was kept to a minimum. The existinginterface will allow the use of Arabic text in windows; it will also perform all properinput and outputoperations on that window.Furthermore,Arabic text may be edited in thesewindows.

I..t._It ‘$~ ~I~D ~ I

~~Lk~ 4~l_J, ~ ~,j, ~ ~ ~ ~ o, ~

C , .‘—.~-JP 3” ~ e? ••~ ~ ~ ~ ~ U, 4 •~~•~_~_•J, .J~3•L~. ~ Ju.~ ~,( ~.4’~J~-’ ;~:. z.~ 1’JL3i8jL~I 3~_~ .J? ~ ~ ~ ~.i, ~._... .~e. .JI ~ j4-~~ ~ ~-.

• ~ .j~ ~ ~ ~‘ • ~3 i1~, ~ ~

C ~ ~ ~ . i~.~i1 JL~ ~ ~. t.U~ •~~ ~ ....ci ~ I)j~ 9~.91 ~ 4” ..u.I_ ~ ~ ~1 . 4~tUI ~Ifi ~. . .4~J, L. i JL~ ~ ~ j.,~, j~. ~ • ~j, ••~•~• ~. l.~l-,I J~ ~ ~ ~ ,•~ ~~

C.J~Ll~4~.L~1. ~‘ u-I.. ~i L~ , • 4~:_ ~ ~ C . 3 ii ~ ~

i~. ~ Figure 4 Arabicwindow in Bayan. 17. Figure 4 shows an ArabicwindowwhichdisplaysArabic text information. Note that the text has displayed the proper“tashkeel” foreveryword of text.The figureshows the mainoptionsallowed by the system.Currently the Text_unitoption is activated. TheText_unitobjectcalled “Bukhl.new” is displayed. The otheroptions are document,query, learn and exit options. The documentoption, forexample,allows the user to manipulate and createdocuments.

6. TheArabicwordmanager Thismanager is responsible for derivingwordsfromtheirroots and reporting the resultingword to the requestingprocedure. This managermust know the rules for wordderivations in Arabic as well as have the ability to learn new rules.This feature is built into the word managerbecause all rules of derivation that can apply to a wordmay not be known to the userwhen a word is added to the system. This feature will allow the user to add additional rules later if needed. Figure 5 shows the majorcomponents of the wordmanager.

ObjectManager

FigureS Wordm~ag~compoitaiu

This managerallows a databaseadministrator(DBA) to enterwhat is known as the teaching mode. In the teachingmode, the learningmanager is called and then the DBA can add new words to the vocabulary of the system, add new derivationrules and specifywhichwords it applies to. In the teachingmode, the DBAcan modifyexistingderivationrules forthe roots of an Arabicword.

7. Thelearningmanager Thispart of the wordmanagercan learnnewwords. It takesinformationabout a word and adds it to the objectmanager. It is importantthat this manageronly be used by someoneproficient in Ara bic as incorrectinformation can be harmful to the databasesinceBayanstoreswords as objectsrather thanactualcharacters of a word.

Derivationrules arc defined for eachword by use of the learningmanager. This is the onlyway to add a word to the vocabulary of the system. Arabicwords are added to the databaseobjects through this option afterobtaining the requiredinformation to form a completewordobject and to check if the wordalreadyexists.

Thelearningmanageruses an interactivewindow-basedinterface thatprompts the user to sup ply the requiredinformation. This processguides the DBA by requesting the neededinformation beforeaccepting the newword. The DBA is responsible for the correctness of the information in the system.

The learningmanagercan also accept newderivationrules.Rules can be specified by the DBA using the properformat.Rules can be added to the knowledgebase of the learningmanager and are used by the wordgenerationengine.

18. 8. ThewordgeneratiOnengine The wordgenerationengineinterpretsderivationrules and executescommandsencoded in the derivationrules. ft is like a smallcomputerexecuting a limitedinstruction set. It will take a root as an input and produce the newlygeneratedword (if any) as output. If there is an error, the samerootwill be returned with a flag indicating the presence of an error.Errors can range from a null root to the wrongnumber of characters in the rootand so on.

8.1. Derivationrules

All the rules are the same for each root class of words.Therefore, theyneed to be storedonly once;afterwards, the wordgenerationmanagercangenerate the desiredwordfrom its root. ft should be noted that rules are completelydifferent for three characterroots, four characterroots and five characterroots.

In general, a derivation is performed by addingcharactersand/orchanging the “tashkeel” of the word. Theencoding of the derivation is as follows:

concatenatecharacternumber of the root adding the following“tashkeel”.Whereeach will be followed by the proper“tashkeel”. concatenatewherecharacterrepresents an Arabiccharacterwith its “tashkeel~. Anexample of what the derivationruleslooklike is the following:

~ ~r,c

,, ,*# S°

Arabicderivationrules,however, do not onlyinvolve the concatenation of characters at the end of a word.Charactersmay also be added to the middle,beginning or end of a word to derive a new word.

8.2. An example: Given the following:

Root =

Rule =

/ ‘ ‘ / Result—-

___ ~

Thesteps for this result is shown in the followingtable:

19. Stepnumber Resultingword so far Comments

1 the newwordstartswith the arabiccharacter‘Meem’

2 letternumber (1) of the rootword to be added.

~0# 3 —‘ ~ letternumber (2) of the rootword to be added.

4 leuer(waow)tobeadded.

5 ‘,J ~ letternumber (3) of the rootword to be added.

6 letter(waow) to be added.

7 ,~ letter(noon) to be added.

9. Morphologicalanalysis of Arabic A morphologicalanalyzer for Arabicwords is in the process of beingbuilt. It will be able to provide the followinginformationconcerningArabicwords:recognition of differentparts of a word, the root of a word,and affixesattached to words. Thefollowinginformation is required forthe morphologicalanalysis of Arabicwords: (1) Theroots of wordsalongwith all the derivationrulesthatcan be applied to them; (2) Thederivedword and rootlinks(which can be performed by a lookuptable or by a smartalgo rithm)whichtakes a word and returns its root;

(3) A list of all possibleaffixes that can be attached to a word and that are not covered by the derivationrules; and

(4) A list of wordsthat arc not derivedfromroots,such as tools(e.g.,“from”,“to”,“on” and so on).

10. Scenario

A userwill create a text_unitobject usingbayan’seditor.Figure 4 shows an editingsession of a text_unitobjectcalled‘bukh7.new’.External files can also be imported to Bayan’seditorusing the “Load~option. An option can be activated by clicking on the buttonwhichrepresents the command. The updatedText_unit can then be stored in The databaseusing the “Store”option. The user, when done, can ask Bayan to incorporate the new or modifiedText_unitobject into its final database. Bayanwill check the Text_unitobject for new words,index the words, or it can activate the learning manager if a word is not known to Bayan. The user can specify the informationwhichBayan’slearningmanagerpromptshimiher for. This will include the specification of whichclass of Arabicwords this wordbelongs to. The userwill also verfly the derivationmethodused to derive this word from its root. Additionalinformationmay be requested to determineaffixesadded to the word. If the usermade a mistake in the word, can go back and modify the Text_unitobjectusing the editor, thcn ask Bayan to continue its operation. Bayan will add this information to its database. The user can query that Text_unitobject by words thatappear in it.

Documents can be built using the “Documentoption”which can be selected from the main menu. The user will be required to enter the elementswhichmake up documents.Theseelements

20. mayincludeText_unitobjectsand/orotherdocuments. The user can makequeriesaboutwordswithindocumentobjects. The result of the query will be a list of the document objects in which the wordappears. The usercan thenspecify that he wants to see the Text_unitobjectlevel and the Text_unitobjects will be displayed. At the Text_unitobject level the user can browse or modif~’theseobjects. If he modifies a Text_unitobject he wouldneed to ask Bayan to update its database as explainedabove.

Queries can also be madeaboutArabicwords,theirroots, or theirderivationmethods.Queries can ask Bayan to show examples of wordsderived from the same root, or even words of similar meanings.

11. Conclusion

Bayan is a startingstep formoreadvancedresearchregardingArabic. It points to the possibility of moreadvancedstudiesabout the language,such as naturallanguageprocessingbased on Bayan’s environment. Fornow,though,manypersistentproblemshavebeenaddressed,such as the design of an Arabiccharacter set including the “tashkeel”, an Arabickeyboard,word-generationalgorithms and Arabicwordgenerationmethodsfrom the roots of words. At the presenttime, an Arabic text data base,whichutilizes the representation and algorithmsdesigned in Bayan, is beingbuilt. In addition, a built-inmorphologicalwordanalyzer is alsounderconstruction.

Refrences:

(AHME86]Ahmed S. Ahmed, ‘AboutrewievingQuran in Othmaniscriptusingcomputers’, 9th nationalcomputerconference in SaudiArabia1986.

ALSA86JAl-SalehMohammed,‘Standards for Arabiccharacter in information’. 9th national conference in SaudiArabia,1986.

BECK87JJoseph D. Becker,‘Arabic wordprocessing ‘~ Communication of the ACM,July 1987.

B00T86] L. Booth, et a], ‘Arabization ofan automatedlibrarysystem’, 9thcomputerconference in SaudiArabia,1986.

Mohamed HASS87} Hassoun,‘Conception et reitsationd’unebase de donneespour les traitements automarique de Ia Ianguearabe’, 10thcomputerconference in SaudiArabic,1987

TAYL86]Murat et a] Arabic Tayli, , intelligent workstation’, 9th computerconference in Saudi Arabia,1986.

21. Responsa: A Full-textretrievalsystemwithlinguisticprocessing for a 65-millionwordcorpus of Jewishheritage in Hebrew

YaacovChoueka Bar-hanUniversity,Ramat-Gan,Israel

Combiningsophisticatedelements ofmorphologicalprocessing,patternmatching, “localfeedback”and“short-contextfiles”with a richrepertoire oftoolsforquery formulation,browsinganddisplay,coupledwithspecialdatastructuresand algorithmsfor a fastresponsetime, the ResponsaFull-Textsoftware,operational nowwith a corpus of 65 millionwords ofRabbinicalHebrew, is speciallywell- adaptedforprocessingcomplexqueries in largetextualdatabases in general, and in Hebrew(andSemiticlanguages) in particular.

1. Backgroundand highlights

TheResponsaProject(RP) -- recentlyincorporated in the morecomprehensive and ambitious

GlobalJewishDatabase(GJD) -- is a complexnetwork of research,.development,applications, service anddisseminationactivities in the fields of informationretrieval,computationallinguistics, textprocessing andJewishstudies. In its contextwasbuiltone of the veryfirstfull-textretrieval systems, and one of the largestcorpora in the humanities(certainly the largest in Hebrew). Conceived by A. Fraenkel in 1965-1966,whoalsodirected it till 1975, the ResponsaProject waslaunched in an operationalbatchmode 4] in 1967, as a jointendeavor of the WeizmannInstitute of Science(Rehovoth) and Bar-hanUniversity(Ramat-Gan), in Israel. In 1974/75 the projectwas completelyrelocated at Bar-Ilan to becomeformallypart of the Institute for InformationRetrieval and ComputationalLinguistics(IRCOL),which wascreated at that time. Theauthor,whowaspart of the RP researchteamsince its inception, has acted as Head of IRCOLand PrincipalInvestigator of RP since1975.

In 1980 the on-lineversion of the software waslaunched, and the highlypositiveimpact of the conceptconvinced the developers to enlarge the scope of the ResponsaProject -- originallylim ited to the ResponsaRabbinicalliterature -- to encompass the entirespectrum of the JewishHeritage in an ambitiousGlobalJewishDatabase thatwouldcomputerize -- so to speak -- the entireculture of the Jewishpeople. The ideacrystallized in 1983,with the establishment of a three-yearendowment fund for thispurpose by a groupheaded by T. Klutznick in Chicago, and sincethen,classicalwrit ings in Hebrew are beingconstantly-added to the database,subject of course to availablefunding.

Besides the basiccomponents of a largedatabase of Hebrewtexts and of a sophisticated full- texton-linesoftware(withuniquelinguisticcapabilities) for the processing,retrieval,display andprinting of documents and text-passagesfrom the database, the projectwasalsoinvolved in researchactivities in therelevantfields of informationretrieval andcomputationallinguistics, and in the development of the infrastructureneeded for supporting a largenetwork of workstations connected to the system in Israel and abroad, and in servicingusersfromacademic,legal, and rabbiniccircles in processingqueries in batch andon-linemode.

The specialnature of the language of the texts(Hebrew,with its complex and rich morphology) and the specialnature of the database(Rabbinicalwritingsspanning one thousand years)forced the developers to face and solveproblems that usually are not present in similar systemswithEnglish or withmodemtexts. In this shortreport,however, we restrictourselves to describe in somedetailonly two of the above-mentionedcomponents,namely, the database and the software. The latterwill be describedmainlyfrom a user’spoint-of-view;thus, no algorithms,files or data-structureswill be detailed. Forthese the reader is referred to the technicalliteraturegiven in the references and to a morecomprehensive“List of Publications”availablefromIRCOL.

Some of the softwaredistinctfeatures and highlightsare: a rich and varied set of tools -- includingproximity and booleanoperators -- for query formulations; a flexible anddynamicdisplay andbrowsingcomponent for scanninglarge sets of documents in a shorttime; a complete morphologicalcomponent for the automaticgeneration of all linguisticallyvalidmorphological variants of any term in the query; a rich pattern-matchingmodule;tools for locally anddynamically creating,updating,editing and deleting“personalthesauri” and embeddingthem in the query; a “short-context” tool to helpdisambiguatehighlyfrequentandambiguouswords that are part of

the query; and an on-line“localfeedback” tool that suggests to the usermoresynonyms -- or

rather“searchonyms” -- to the termsappearing in the query. All of this in the context of advanced algorithms that assurevery fastresponsetimeeven for complexqueries and in long-distance communicationsenvironments. 22. 2. Thedatabase

Thedatabase of the GJDprojectconsiststoday of some 65 millionwords of running text in Hebrew,organized in severaldistinctcorpora. Thelargest of these is still the Responsa corpuswhichaccount for about75% of the size of the GJDdatabase; thiscorpuswill therefore be described in somedetail. A comprehensive anddetailed list of all works that are parts of the database,includingbibliographicaldetails(author,edition, etc.) is availablefromIRCOL. TheResponsaliterature,writtenmainly in Hebrew andAramaic butoccasionallycontaining expressions or evenwholepassages in vemacularslikeArabic,Ladino,German,Yiddish,English, etc., is an impressivecollection ofrabbinic“queriesandresponses”spanningmanycenturies, and originatingfrom a largenumber of countries. Thequerieswereposed by individuals and communitiesfrom all over the world to the outstandingJewishRabbisandauthorities in each generation, and the decisionsrenderedbecamevaluableprecedents for JewishLaw,many of them beingsubsequentlyincorporatedinto the Codes. TheResponsaliterature is recognized as an invaluablestorehouse of information of interest to scholars in manyareas,such as law,history, economics,philosophy,religion,sociology,linguistics,folklore,etc. Personages,realia, geographicsites,scholars,wars andkings,togetherwiththoseminutiaewhichbring the life of a communityintosharpfocus -- birth,marriage and death,recipes,taxes,medicalpractices -- all these andmoreprovide the scholarwith a uniqueopportunity to recaptureJewish life (and life in general) with an accuracyoftenunobtainablefromothersources.

Interestinglyenough,manytopics that seem to be related to modemsociological and economicconditionssuch as ecology,devaluation ofcurrency,minorityrights, the right to strike,asylumclaims for criminals and suspects,autopsy and euthanasia,wereclearlydescribed, debated anddecided in this literaturehundreds of yearsago. Also,many of the Responsarecently formulateddealwithcurrentproblems of geneticengineering,robotics,spaceastronautics, computerizeddevices, and the like. Finally, the legalmaterialcontained in the Responsaliterature -- which is fromthispoint of viewnothing but a case-lawcompendium ofprecedents -- is of special interest,since it reflects in revealingdetails the operation of one of the oldestappliedlegaltraditions in the WesternWorld -- a system and philosophy of law thathavebeenapplied in diversecountries andunder the mostvariedconditionsrangingfrompredominantlyagricultural andruralsocieties to urban,commercial and industrialstates.

Thisremarkablyrichlegal and historicalmaterial has still to be systematicallycataloged. It is estimatedthatabout1000-1800authorshavecontributed to it, producingsome3500printed volumes(collections) of Responsatotalingmorethanhalf-a-million(probablycloser to three- quarters of a million)documents,originatingfrom a largenumber of countries,mainlyfrom Europe(especiallyEasternEurope),NorthAfrica, the MiddleEast and someparts of Asia. While the earliestresponsacan be tracedback to the Gaonicperiod in Babylonia(sixth to tenthcenturies), it should be noted that for the last thousandyears,responsabookshavebeencontinuouslywritten andpublished, and this activity willprobablyaccompany the Jewishpeople in the foreseeablefuture.

As of today, the computerizedresponsacorpuscontains the full-text of 252volumes of the mostimportantResponsacollections,totalingabout49,000documentswith 50 millionwords of Hebrew. Aboutthirtycountries of origin are represented in thisdatabase,amongthem: Morocco, Egypt,Algeria,Turkey,Iraq,Syria,Greece,Yemen,Palestine,Tunisia,Lybia,Poland,Hungary, Russia,Galicia,France,Germany,Austria,Spain,Provence,Italy, and the UnitedStates. Also, every century as well as everyschool of thought and influentialcenters andauthorsfrom the tenth centurydownwards are represented in the database.Besides the Responsacorpus, the database contains, as the nucleus of the envisagedGJD,about 14 millionwords of manyclassicalworks of Jewishheritage, of which we mention: the Bible; the 36 Tractates of theBabylonianTalmud (5thcentury),including the famousRashicommentary(11thcentury,France), in full; the Haggadic Midrashim, a collection of homileticalcommentariesandexpositionsabout the underlying significance of Biblicaltexts; a few MedievalBiblicalcommentaries; andMaimonides’Code (11thcentury). Thecompleteworks (20 volumes) of the modemIsraeliwriterandNobelPrize winner S. Agnon are alsoincluded, as the nucleus of a modernHebrewliteraturecorpus.

Thiscomplexdatabase is highlystructured in a hierarchy of levels, as per the following scheme. TheGJD is divided intoseveralAreas(Jewishstudies,ModernHebrewliterature...), each areacontainingdifferentCorpora(Responsa,Talmud,Bible...), eachcorpuscomposed of differentAuthors,whomayhaveseveralVolumes,composed ofDocuments,whicharedivided intoParagraphs,Sentences,Wordc, and Characters. Each one of theselevels is separatelyindexed and can be directlyaccessed by the display and searchingmodules.

23. 3. GeneralSetting

Thesystem(software and database) is installed on Bar-hanUniversity’sIBMmainframe, and is runningunder the MVSIFSOOperatingSystem; it can be accessedlocallythrough any standardIBM3270-typeterminal withHebrewoptions.OutsideBar-hanUniversity, it can be accessedusing an IBM-compatiblemicrocomputer, a modem, a telephone lineand a specially developedproprietarycommunicationsdiskette. Thediskettesimulates the IBM3270terminal on the PC, makes all the necessaryprotocolconversionsbetween the asynchronoustelephone communicationsand the synchronousonesrequired by the IBMcommunicationssoftware,while highlycompressing the data sent andreceived. After an initialone-timesetting,justinserting the diskettewillautomaticallymake all the necessarydial-upsand give all therequireddifferent passwordsand accountnumbers,leading the userdirectly to the heart of the Responsasystem. No blackboxes,protocolconverters or cards of any sort are needed.

Thevariousfunctions of the systems are activated by touching the appropriateProgram Function(PF)keys. Besides the standard PFs forHelp,PreviousPage,NextPage,Print and Exit, the followingPFs are available: DatabaseSelection: Forselecting the DB to which all subsequentsearch,display andbrowse functions will refer(untilanotherDB is laterselected);Bibliographies:Displaysuponrequest biographical/bibliographicaldetails on the works/authorsincluded in the DB underinvestigation; Textdisplay: Fordisplaying anydocument (orpart of a document) of which the exactreferences are known. Sincedifferentworksusuallyhavedifferentinternalreferencingsystems(some are subdividedintoparts,chapters, and paragraphs;other are referred to by pages and sections; references are madesometimes by letters,othertimes by numbers, still others by Romannumerals), etc., and since ourapproach is to let the userfeel comfortablewith the systemand use the sametype of referenceswithwhich he and his colleagues are familiar, we had in fact to devise a specificappropriatemenu for each and everyone of the hundreds of worksincluded in the GJD.

Notethatonce a portion of text from theDB is displayed, no matterwhy and by which function, the usercan freelynavigate in the DB,asking for anyothersection or paragraph of the displayeddocument(includingfirst/last/previous/nextpage/paragraph), anyotherdocument of the sameauthor/volume, etc. A title line at the top of the screenkeeps the userinformedabout his exact location in the DB at everymoment.

ThefollowingimportantPFswill be discussed in moredetail in the nextsections:Search, for formulating a query and displaying its results;Thesaurus, for the dynamiccreation andediting of localthesauri to be included in the search;Short-context, for helping the userresolveambiguous terms in the query;Feedback, to suggest to the useradditionalsearchonyms(synonyms,related terms) to his query’sterms; andMorphologicalmodule, thatgives an on-lineaccurateandcomplete morphologicalanalysis of any word(form) in Hebrew.Apparently, the ResponsaSystem is unique (amongotherfull-textsystems in any language) in having the last threefunctionsavailable in an operationalretrievalsystemwith a largecorpus.

4. Queryformulation:methodologicalconsiderations

As is well-known, the basicassumptionunderlying a full-textretrievalsystem is that the original text of the document is fullyavailable on the computerand searchable by the software; the userframes his query, in the mostbasicform, by givingany word or phrase (= sequence of words) which he thinkswouldadequatelyrepresent his topic, in the sensethateverydocumentthatcontains

thisphrase -- andonlysuch a document -- is relevant to the query and should be retrieved by the system. Thespecificlanguage andcorpuswithwhich we are dealinghere,however, and our personalview of the”philosophy” of full-textsystemsdictatesomemethodologicalconsiderations in which the Responsasystemmightdifferfromsome of the otherfull-textones:

a) In many of the systemsoperational in English,French and otherEuropeanlanguages, common words(i.e.,function or grammaticalwords)such as the, with,and,from,... are not searchable on the ground thatthey are devoid ofanyinformationcontent.. In the ResponsaSystem, however,each andeveryword (in fact,letter) of the text is searchable:first,because we believethat evencommonwords can be of crucialimportancewhencombinedwithothertermsintomeaningful phrases; and second,becausemost of thesewords (in Hebrew) are ambiguous and homographic withothercontent words(if/mother;with/nation;but/mourning,etc.) and thuscannot be omitted from the searchabletext. The storage and processing-timeproblems are algorithmic in natureand should be handledadequately by specially-designedcomponents of the system.

24. b) Usually a term in a query is a string of charactersthat has to be matchedexactly by its counterparts in the text. In ourcase,however, a termwould,moreoftenthan not, play the role of a representative of a wholefamily of “variants”,containingtens,hundreds,thousands, or even sometimestens of thousands of elements (= strings ofcharacters). Thismayhappenbecause of several reasons of which we mentionherefour (cf. the appropriatesectionsbelow for moredetails): morphologicalprocessing,where a lemma, say, has to be replaced by the family of its morpho logicallyrelatedvariants;patternmatching,where the string -- withspecial“operators”embedded in it -- stand for a pattern of characters that has to be checkedagainsteveryword in the text for a possiblematch;thesaurusprocessing,where the term is in fact the name of a wholefamily ofrelated that terms have to be substituted for it; andfeedbackprocess,where a set of “searchonyms” is the suggested by system as a substitute for a giventerm of the query. Thisway a simplequery of the form“look for the occurrence of nuclear,explosion,dessert, kill in the samesentence”,becomes a complexcombinatorialproblem in whichfourverylarge setshave to be matchedandprocessed at the sametime. Sophisticatedalgorithmshad therefore to be developed to handlesuchsituations 7], whilekeeping the requirements on internalstorageandprocessingtimewithinreasonablebounds.

c) Notwithstanding the remarksmadeabove, it is important to note that in the general of our the philosophy approach, systemdoes not forceupon the useranypre-definedconcepts or procedures(such as pre-establishedsuffixomission or termexpansion by thesauruscomponents), but him gives alwaysfreedom of choice,besidesestablishing ofcoursedefaultoptions that are acti vatedwhenever he chooses not to interferewith the system’sstandardcourse of action. Whenlarge families of variants are generatedfromsome of the terms in the queries,because of thereasonsjust mentioned in b), the usercan eithersilentlyagreewith the automaticinclusion of all the generated terms in the searchprocesswithoutevenbeingaware of thesegeneratedforms or theirnumbers (andthis is indeed the defaultoption), or he can ask for the generatedfamilies to be displayed for his inspectionbefore the search is done, in whichcase he can selectfrom the displayedformsthose he wishes to include in the search (or, if it is easier, to deletethose that he wishes to omit). Thus a novice or casual usercan convenientlyrely on the defaultoptions,while a moresophisticateduser can have if control, he wishes so, overeach and everyphase of the searchprocess.

5. Queryformulation:basicforms

In its simplestform, a query is a sequence of consecutivewords thathave to be looked for in the text, such as: nuclearenergy, court of appeal, testimonyunderoath.

Notethatcommon(grammatical)wordscan also be part of the query, andthatsentence boundaries in the text are considered as phrasedelineators. Morepowerfultools are available however to the user to formulate his query:proximityoperators,directionspecification,negated distances and alternative terms. Instead of describing,however, the exactsyntaxrequired, wejust give a (somewhatartifical)example thatclarifies the multipletoolsavailable.Asking for nuclear_atomic bomb_weapon(-3,5) desert(-10,10) -Nevada means: occurrences of nuclear any or atomicthat are immediatelyfollowed by bomb or weapon, in the of proximity which(from 3 words to the left up to 5 words to the right)desertoccurs,which is not followed or preceded (in up to 10 ~vords on eachside) by Nevada. (Note thatNevadacan occursomewhereelse in the document or even in the samesentence.)

Qualifying a querywith the operatorsentence in its front, willcause the distances to be measured in units of sentences in the sameparagraphratherthan in units of wordswithin the samesentence.Similarly for the operatorparagraph. Thus sentence: A (-1,1) B (0,0) C is satisfied when A B exactly and are in the same or in adjacentsentences and B and C are in the samesentence.

Note that all of the toolsmentionedabove -- positive and negativedistances,local disjunction,negateddistances, andsentence or paragraphoperators -- can be combined in one query, and will be interpretedaccordingly. Forexample, paragraph:terrorists (0,0) bombs(-1,1) -Lebanon will be satisfied by any paragraphmentioningterrorists or bombs and such thatLebanon is not mentioned in thatparagraph or in an adjacentone.

We note that word finally any (rather, any string of characters) thatoccur in a query as described in this can be if section, made, needed, to represent a wholefamily of hundreds or more 25. relatedterms,that are automaticallyassociatedwith it by algorithms for patternmatching, morphologicalprocessing or thesaurusexpansion (as instructed by the user). Thus,families of words can be manipulatedwithpositive and negativedistanceoperators,negateddistances,local disjunctions, and sentence or paragraphoperators,exactly as is donewith words. This is detailed in the followingsections.

6. Generatingfamilies of terms:patternmatching When form required, a in a query can represent,instead of a formalstring to be exactly matchedwith its occurrences in the text, a pattern to be looked for in the text,where the pattern can containmixed of sequences charactersand specialsymbols, to be describedpresently. Thiscan be veryusefulwhenlooking for foreignnames of persons or places, for technicalterms, for spelling variants and the like (especially in a corpussuch as the Responsawhichspanslongperiods of time andcontainsdifferentlayers of the language), in internationalmediapublicationsthatcoverevents in differentcountries and languages, in technicalliterature(chemistry,biology,pharmaceutics), etc. Someavailable symbols are: the don‘t-caresymbol ! which will match any single character, the alternativestring ofcharacterssymbol / whichrepresents an alternativebetween differentpossiblestrings in the pattern; the variable-lengthdon’t-caresymbol * which will match any sequence of characters of arbitrarylength(including the emptysequence of lengthzero), and which can be embedded at the beginning of the pattern(left-handtruncation), at its end(right-hand truncation), or anywhere in the pattern(middletruncation). Of specialinterest is the grammaticalaffix symbol #, whichwhenoccurring at the beginning/end of a pattern,stands for a predefined list of grammaticalprefixes/suffixesrespectively. ForEnglish,such a list wouldcontain, for and a suffix prefix example, un, re, dis, inter, ... one wouldcontain ed, ing, ment, tion,able Thus car#wouldretrievecars but not careful, and #order wouldretrievereorder and disorder but notborder. Such lists are speciallyhelpful for Hebrew and similarSemiticlanguageswhereprefixes and suffixes are verycommon. (Thetwo lists for Hebrewcontain a few tens of affixeseach.) Thisoperatorreflects a trace of linguistic processing, but it is still,really, a formalpattern-matchingone;indeedcar#would still retrieve caring,eventhoughcaring is related to care and not to car. Tools for truelinguisticprocessing will be described in the nextsection.

Finally,booleanoperators(and, or, but not) can be imposed on substrings in the pattern. Again, we contentourselveswith one actualexample.Looking for variantspellings in Hebrew for Venice, we firstasked for *v*NzI*,getting156(!)differentforms,most of thembeingindeed variants of Venice, but manyothersbeingirrelevant,containing P, K or G at the beginning or end of the word. Askingthen for *v(~p,K,G)*NzI(....p,K,G)*, a set of 85 variants wasretrieved, all of themindeedpertinent to the search. Thus, a couple of minutes of intelligentdynamicinteraction with the systemdramaticallyenhanced its performance for this specificquery. Inclusion of a pattern-matchingstring as one of the terms in the query willcause the system to search for all formsfound in the text that satisfy the pattern, and to substitutethemautomatically for that in pattern the query. This is not an easytask,specially for a corpussuch as ours that contains than more half-a-milliondifferentforms. Insisting as we do on an on-lineresponse precludes the possibility of a simplelinearsearch on this list of differentwords of the corpus to find the forms thatindeedsatisfy the pattern, and advancedalgorithms had to be developed to perform the task 3].

7. Generatingfamilies of terms:morphologicalprocessing

When a user is asking for all text passagesmentioning spy andexpulsion in the same he sentence, obviouslydoes not have in mindthesespecificstringsonly, but any othermorpho logicalvariantsuch as spies,spying,expelled, etc. as well. Yet, the possibility of sometimes a term in a into the expanding query correspondingfamily of its truemorphologicalvariants is 4 from usuallymissing most(all?)retrievalsystemswithEnglishtexts. Twoexcuses are usually advanced to explain this omission. First, the number of suchvariants for a givennoun or verb in English is small,three to fivevariants in general, and in anycase less than ten for mostwords. Thus it is not unreasonable to ask the user to takecare of thesevariantshimself,addingthem as alternative words in the query. Second,mostsuchvariantscan be derived by adding a few well-known that the don’t- suffixes, so carecharactersymbolgives a simplesolution to the problem, as in the case of which will play* expand intoplays,played,playing,player,etc. Theusercan easily convince that himself,however, boththeseexcuses are onlypartiallytrue; andeven if they are partiallyvalid for English,they are certainly not so for French, for example(think of the verbfaire),

26. and theytotallycollapse for Hebrew,Arabic, and similarlanguageswith a rich and complex morphology,where the number of variantscan reach the hundreds and the thousands, and their spellingcannot be adequatelydeveloped by formalpattern-matchingtools. In order to appreciate the complexity of the problem, and the need for an adequatesolution, we nowsketch a shortreview of the pertinentcharacteristics of Hebrewmorphology. TheHebrewalphabetcontains 22 letters (in fact,consonants).Because of the lack of vowels, words are usuallyshort (5 to 6 letters on the average), and it is quiterare to encounterwords in a runningtextwith a length of 10 letters or more; no word-agglutination (as in German) is possible.

Hebrew is a highlyinflectedlanguagewith a rich derivationalpattern, and this can be best demonstratedthroughthreefigures. Thetotalnumber of entries in a modern and comprehensive Hebrewdictionarydoes not exceed35,000etitries(includingsome3,500internationalloan-words such as bank,symphony ) derivablefromabout4,000roots of three to fourletters. On the other hand, the totalnumber of differentHebrewwords(heredefined as strings of characterswhich are meaningfulandlexicallycorrectunits of Hebrew) is estimated to be aboutfifty to one hundred million. Thus,there is a gap of severalorders of magnitudebetween the number of dictionary entries (= lemmas) and the number of words that are generatedfromthem by inflection and derivation. (Thecorrespondingnumbers forEnglish are estimated to be in the order of 35,000 “roots”,150,000dictionaryentries and one millionforms).

Briefly, a noun or an adjective can appear in up to six or sevendifferentvariants (singular/plural/dual,masculine/feminine, and constructstate); ten possessivepronouns(mine, yours,...) can be suffixed to eachnoun, and different(aboutthirty)combinations of prepositions and articles(the,in,from, etc.) can be prefixed to it. For a verb, the pattern is morecomplex. It can be conjugated in up to sevenmodes,fourtenses and twelvepersons;accusativepronouns (me,you, him,...) can be suffixed to transitiveverbs, andprepositions can be prefixed to it (as with nouns). Thenumber of variants for a nominalform is usually in the hundreds and can evenreachone thousand, and for a verbalform can reach a few thousands. It should be emphasizedthatduring the derivationprocessoutlinedabove the originaldictionary-entryformmayundergoquite a radical metamorphosis. TheHebrewsinglewordrepresenting the phraseandsince I sawhim (10 letters) containsonlytwoletters in common with the infinitive to see ; the word in Hebrew for computer has 1024variantswithtwelvedifferentleadingletters, and manymorethousandshave to be added if we want to accountalso for the verb to compute.

Given the complexity of the derivationalpatternsdescribedabove, it is clearthat the morphologicalcomponent of retrievalsystems for Hebrewtextscannot be neglected.Indeed, in a fewexperimentsconducted on our database, in whichsomequerieswereprocessedbothwith and without the morphologicalcomponent, it wasfoundthat the queries in which no linguistic expansion wasperformed,only2%-10% of the relevantmaterialwereretrieved. On the other hand, no ordinaryuserwill have the necessarylinguisticknowledge andmethodicalreasoning, not to mention the time,energy and patience,needed to manuallyinclude all the differentpossible variants of the requiredword in the query, and the tools of the pattern-matchingcomponent are certainly not adequate for this purpose.

It wasthereforenecessary to include in the system an automaticcounterpart ofHebrew morphology, and such a module,adequate for on-lineprocessing,wasindeeddeveloped in the mid-seventies and incorporated in the Responsasystemaround1980. Thetoolscreated can perform a full and correctgrammaticalanalysis of any givenHebrewword, as well as find all linguistically validvariants of any giventerm in a givencorpus.Basically the toolsconsist of a speciallycodified dictionary of Hebrewtogetherwith all the validrules of conjugation andderivation, andexceptions to the rules. Theinterestedreader can find moredetails in 1]; here we onlydescribe the toolsfrom a userpoint-of-view.

To everylinguisticallyvalidform we assign a dictionary-entrylemma,which is usually the singularmasculinevariant for a nominalform and the singularmasculinethirdpersonwithpast tense for a verbalform (thisformassumes the role of the infinitive in English),both stripped of any prepositionalprefixes or pronomialsuffixes. All lemmaswith the samebasicletters and the same semantic field are groupedtogetherunderoneroot. To give an example in English, the form computed andcomputes have the lemma (to) compute, and the lemma of computers is computer. On the otherhand,compute and computer have the sameroot,whichwouldalsoinclude the lemmascomputation,computerize, etc.

Enclosing a term in a query in singleapostrophes(‘compute’)instructs the system that this is lemma a thatshould be replaced by all its variantsoccurring in the corpus.Enclosing it in double quotationmarks(“compute”)indicates to the system that this is a root that should be 27. replaced by all variants of all lemmas thatcan be derivedfrom the givenroot. Finally, for a naive userwhodoes not evenknowwhat is the lemma or the root of someterm he is including in his query,enclosing the form in exclamationmarks(!compute!)instructs the system to firstanalyze the givenform and find its root(s), in order to replace it by all the derivativeformsconnected to this root. In English, forexample, ‘go’ wouldretrievealsogoes,going,gone,went, while“solve” wouldretrieve,besidessolves,solving andsolved, alsosolution, etc. Finally!thought!would alsoretrievethink and thinking besidesthought.

To sum up, whenincluding a term in his query, the userhas the option to consider it as a string of characters to be matched by an exactcounterpart in the text, as a pattern to be replaced by all strings in the database that successfullymatch it, as a lemma to be replaced by all its derivatives andconjugatedforms in the text, as a root to be firstreplaced by all correspondinglemmas andthen by the variants of theselemmas, andfinally as a word to be firstanalyzedand thenreplaced by the variants of its lemma(s).

Finally, it is not uncommon for a query to includefour or moreterms,each to be expanded intoseveralhundredvariants,with tens of thousands of occurrencescorresponding to eachterm and its variants.Following (in Fig. 1) are actualexamples of searches run on ourcorpus,detailing the number of variants foreachterm, and the totalnumber of occurrences for thesevariants: term No. of differentvariants Total no. of occurrences in the database of thesevariants in the database(rounded) lighted 1 16 (to) ‘light’ 210 7400 candle 1 3200 ‘candle’ 175 21300 and-he-testified 1 1 (to) ‘testify’ 336 53000 oath 1 4200 ‘oath’ 261 59000

Fig. 1 Examples of morphologicalexpansion

Thecomputationresourcesneeded to conclude the search in such cases can be enormous, and unless specially-tailored and quiteefficientalgorithms and datastructures(such as described in 7]) are used, the searchcan be almostimpossible to concludeon-line.

8. Generatingfamilies of terms:thesaurusexpansion

TheResponsasystemdoes not include a globalthesaurus thatcan automaticallysupply the userwithsynonyms or othertermssemantically-related to the onesused in his query. Forone,

it is our -- as well as others -- experiencethatbuildingsuch a thesaurus is a verycomplex,costly and time-consumingtask that moreoften that not does not come to completion. Wehavealso seriousdoubtsabout the efficiency and usefulness of such a global toolthat is supposed to solve all problems for all users. In harmonywith ourgeneralapproach, we are more in favor of local thesauri,thatcan be dynamicallycreated by real users for realneeds, andthenupdated,edited, and tailored to specificareas. The systemindeedsupports this activitythrough the ThesaurusProgram Function.Basically, the thesauruscontains“families”,eachfamilyhaving a uniquenameand containing a set ofrelatedwords. Forexample, a usersearchingfrequently for datesmaycreate a familycalledmonth and containing:January,Jan.,February,Feb. etc. A userresearching weaponsmaydefine a familyweaponcontaining:airplane,tank,rjfle, etc., whileanotherone interested in flyingobjectsmaycreate a familyflyingcontaining:airplane,bird,dove, etc. Familiescan be veryeasilycreated,edited,deleted, or unified at will.

Sincesome of the members of a family can be names of otherfamilies, the usercanconstruct hierarchies of relatedterms. Also, a word in a family can be a pattern or a linguisticlemma (orroot), in whichcase, the wordstands for the whole set of strings in the givencorpusthatsatisfy the given criterion.Mentioning the name of a family in the query,instructs the system to expandthis terminto the entirefamilywith the givenname in the thesaurus.

As always, the user has full control on the generated sets of words, if he wishes so, examiningthem anddeletingwhateverseems to be irrelevant for the presentquery.

28. 9. Storingandcombiningqueries

All queriesprocessedduring an on-linesession are givenuniqueidentifyingnumbers and are preserved for as long as the session is active, and can be stored in the user’sworkfilearea for al long as he wishes. Theusercan combinestoredqueries by using the Booleanoperatorsand, or, but not,formingthus a newquerywhose set of retrieveddocumentswould be thenconstructed by the intersection,union or complementation,respectively, of the corresponding sets of the mentioned queries. Given for example the queries: 1. testimony(-4,4)oath, 2. court of appeal, 3. paragraph:MissouriCourt, 4. paragraph:JudgeWilliams; a newquerycan be formed as in: 5. #1 and (#3 or #4) but not #2.

Thisapproach of splitting a queryinto a fewsimpleindependentcomponentsthatcan be combined by Booleanoperatorsintonew,morecomplexones, is in sharpcontrastwith the hierarchical one in which the onlyway to modify a query is by appending to it a newonethatbegins with one of the Booleanoperatorsand, or, but not,thusrestricting or expanding the set of documentsretrieved by the initialquery. In thisway a querymodificationalwaysrefers to the last queryconstructed, and there is no way of retracingbackfalsesteps, of tryingsideissues, or of experimentallystudying the effect of somemodificationwithoutcommittingoneselfdefinitively to thatpath. On the contrary, the adoptedphilosophydescribedhere is welladapted to such modifications in searchstrategy, so typical of complexsearches in documentretrievalsystems. Referringback to the exampleabove, if query 5 provesunsatisfactory,maybebecausecomponent 3 is ill-defined, onecan changequery 3 to: 3. sentence:MissouriCourt, and run query 5 again, or onecanjustignorequery 3 altogether andrun: 6. (#1 and #4) but not #2. By using this approach the user is deliberatelyeducatedintoavoidinglongcomplexformulations that are difficult to understandand to modify, and,instead, to split his requestinto a few small“buildingblocks” thatcan be combined andrecombined at will.

10. Presenting the results: the displayand browsingcomponent

One of the majorlessonslearnedfromour longexperiencewithfull-textdocumentretrieval

systems is that the user/systeminteraction -- in terms of querysubmission andresultsdisplay -- is moreoftenthan not a multi-step,dynamic,conversational and exploringone,ratherthan a one-shot

question - answeringprocedure. The browsingmodule in the Responsasystemwas designed to meetthis situation.Limitedspaceprevents us fromelaborating on thisaspecthere; let us thenjust highlightsomeinterestingpoints.

After the completion of the search,statistics on the number of retrieveddocuments and the number of hits will be displayed as a firstitem of information to the user, since we believethat the userstrategy for scanning the resultsdependmuch on the number of retrieveditems.Togetherwith thesestatistics the systemdisplays a set of selectionoptions (by author,number of hits,etc.), which the usercantrigger to reduce the number of documents to be displayed, if he judges it to be too large for his purpose.Oncethatselection is done, the startingpoint for browsing is a specialKWIC (Key-Word-In-Context) of the occurrences of the hits in theretrieveddocuments. An example, adapted forEnglish, for the querytestimony(-4,4)oath is given in Fig. 2. Vol. IV 1983 .Thecourtshouldalwaysconsider a testimony underoath as a valid... .specially that all these testimonies weregivenunderoath...

.beingunderoath, he testified that the accusedwas... .witnessessometimes lie on oath,such testimonies should be taken...

Vol. VI, 1985 .1 am a doctor, he testified and on oath to savelife... ..no way to convincehim; testifying underoath, he denied...

Vol. VII, 1986 .he refused to testify claimingthat an oath... Fig. 2-- A KWICsample

KWIClinesfromdifferentdocuments in the samevolume are separated by a fewdashes on the screen; this way a quickglance at the screen will be sufficient to assesswhat are the most promisingdocuments. Exactreferences for each line of the KWIC can be asked for, andthesewill then be displayed at the beginning of the line (replacingthere a few words of the KWIC).

29. Ourexperiencewith this type of output as a firstdisplay is excellent. One can scanlarge amounts (i.e. hundreds) of documents in a rathershorttime and get a good andreliableimpression on the relevance/non-relevance of many of thesedocuments. Severaladditionaloptions are however available to the user in case he decides to examine the output in moredepth. The usermaywish, for example, to expandsome of the displayedlinesintolargerchunks of texts,eitherbecause the given context is too small to be conclusive, or on the contrarybecause the item is clearlyrelevantand the userwants to see more of it. The usercan thenmark the lines he would like to expand andeven specifyhowlarge a context he is asking for: document,paragraph, or just n sentencessurrounding the examinedline. Once the requiredcontext is displayed, the usercanbrowsefreely in the document asking for the next/previouslast/firstparagraph/page, and whenfinished, ask for the expansion of the nextmarkedline, or redisplay the KWICpage in order to resume his research.Finally, the KWIC lineswill normally be kept in a file for furtherprocessing or printing. Theusercan marksome of theselines fordeletion -- if judgednon-relevant -- andonly the remaininglineswill be kept or printed.

11. Short-contexttools

This is one of the mostusefultoolsavailable to the user to partiallyresolve the (morphological)ambiguityproblem. When the userrealizes thatone of the terms in his query is highlyambiguous(such as parly or light in English) and can generate a largenumber of irrelevant citations, he can firstturn to the “short-context”component. Given an ambiguousterm as a parameter, the type of shortcontextrequired:followingneighbor or preceding one, and the sort optionrequired:alphabetically or by decreasingfrequency, this PFiriggers the system to display the list of all different left or rightneighbors (as requested) of the giventerm,togetherwith the frequency of thesetwo-wordexpressions in the corpus.Previousexperiments on texts in French 6] haveprovedthat in mostcases a one-wordcontext is sufficient to disambiguate an ambiguousword (at least forcertaintypes of ambiguity), that the errorrate in such a procedure is verysmall, and that indeed a relativelysmallnumber of frequentneighbors will account for a largeproportion of the term’soccurrences in the corpus. Althoughformalresults are not yet available for Hebrew, our experienceshows that the samecharacteristics are alsotrue of Hebrew. Someexperimentsrecently conducted by the authorwhilevisiting Bell CommunicationsResearch in NewJersey,U.S., on a NewYorkTimescorpus of ten millionwordsgavesimilarresults for English. Figure 3 shows, for example, a fewlinesfrom the lists of the rightneighbors of interest and the left neighbors ofparty withtheirfrequencies (as a two-wordexpression) in the corpus:

communist party 489 interestrates 1593 opposition party 41 interestpayments 145 third party 29 interestgroups 35 birthday party 28 interestincome 16 dinner party 24 interestcharges 10 herut party 10 interestcosts 10

Fig. 3 -- Short-contextsamples

Scanning the list of contexts of the givenambiguousterm,sorted by decreasingfrequency, the userwill usually be able in a couple of minutes to assess the appropriateness of manylines of contexts, and to mark fordeletionthose that are notrelevant,probablyeliminating in justone stroke thousands of non-relevantitems. The searchwill now be restrictedautomatically to thosedocuments thatcontain the giventerms in the appropriatecontextsonly.

12. Feedback

Whenframing his query, the usermayinadvertentlyomitcertainsynonyms or otherwise relatedtermsfrom his queryformulation. He mightgive, forexample,airplane but forget helicopter,jet, aircraft or ask for drugs,cocaine but omitopium,heroin, or LSD. Thisproblem can be solvedpartiallyusing a goodthesaurus. There are howevertwoproblemswith this solution. First,comprehensivegoodthesauri are not so readilyavailable or constructed;second,synonymity is a rathermisleadingterm in thiscontext. In informationretrieval,one is not reallylooking for synonyms, i.e. words that are equivalent to his query’sterms in a generallinguisticsense, but rather forsearchonyms, i.e. words that in the context of his specificsearch are indeedequivalent.

Most of the feedbackmethodsdeveloped in the last twentyyearswerebased on the statistical analysis of termsdistributions in specificdocuments and in the wholedatabase, and used the so-calledterm-term and term-documentmatrices. Theirresultswererathermixed. Newideaswere

30. introduced in 2], where a concept of localfeedback wasdefined, in whichproximityconstraints, localcontexts and onlyrelevantretrieveddocumentsweretakenintoaccount, and positiveresults werereportedthere for that approachwhentested in a batchenvironment on a small part of the Responsadatabase. Thisapproach was latelyincorporated in the on-lineversion,first on an experimentalbasis 8] and lately in an operationalone. Afterrunning the search andreturning a set ofretrieveddocuments, the systemreceives as feedbackfrom the user a list of thoseretrieved documentsthat arejudged by him to be relevant. The systemthenprocesses thisinformation and returns a list of up to twentypotentialsearchonyms for eachtermused in the query. Ourexperience with thiscomponent is verypromising. Thereturnedlistsusuallycontainsquite a few interesting searchonyms,sometimes in a totallyunexpectedandevenstartlingway. In an actualexamplerun on the system,asking for smoking in the proximity of health, the feedbackprocesssuggested the followingsearchonyms(amongothers):cigarettes,cigarello,tobacco,inhaling,chronic,risk, bronchitis,coughing Moreover, the feedbackcomponentwas usedalsosuccessfully -- in a peculiar“negativefeedback”fashion -- in order to enhance theprecision of the system, i.e. to reduce the number of non-relevantitemsretrieved 8].

13. Conclusions

Full-textretrievalsystems, once a merecuriosity,havenowcome of age, andcan certainly be consideredtoday a well-establishedmethodology,with a largenumber of applications, specially in the domain of legalmaterial,newspapers and magazines,scientificabstracts, etc.

It is quitesurprisingtherefore to find out thatmost of the availableandoperationalfull-text systems of today are still lackingmany of the featuresdescribedabove anddevelopedessentially in the late seventies andearlyeighties.Automaticmorphologicalanalysis,complexpatternmatching, proximityoperatorswithsentence,paragraph and negationcapabilities,dynamicthesaurusbuilding, short-contextbrowsing,localfeedbackand the like,should by now be standardcomponents of any

professionalfull-textsystem. Evenwhen a full-textsystem -- like the Responsaone -- doesinclude all of thesefeatures(andmore) as standardcapabilities, it is still to be lookedupon as a third- generationsystem(thefirst-generation onesbeing the crudebatchsystems of 1966-1975 and the second-generation onesbeing the simpleon-linesystems of 1975-1982).Researchanddevelopment effortsshould be directednow tofourth-generationsystems, in whichartificialintelligence techniqueswould be used to dramaticallyenhance the performance of the retrievalenvironmentwhile greatlysimplifying the user-systeminterface,reducing it to an almostnaturallanguagedialogue. Timewill showwhethersuchexpectationswill be realizedsoon.

References

1. R. Attar, Y. Choueka, N. Dershowitz and A.S.Fraenkel,KEDMA-Linguistictools in retrieval systems,Journal of ACM 25 (1978),52-66. 2. R. AttarandA.S.Fraenkel,Experiments in localmetricalfeedback in full-textretrievalsystems, Inform.Process. & Management 17 (1981),115-126.

3. P. Bratley and Y. Choueka,Processingtruncatedterms in documentretrievalsystems, InformationProcessing and Management 18 (1982),257-266. See alsoProc. of the 16th NationalConference of IPA,Jerusalem,1981.

4. Y. Choueka,Full-textsystems andresearch in the humanities,Computers and the Humanities 14 (1980),153-169. 5. Y. Choueka, M. Slae and A. Schreiber,TheResponsaproject:computerization of traditional Jewishlaw, in: Legal andLegislativeInformationProcessing (B. Erez,ed.),GreenwoodPress, 1980,ch.18,261-286.

6. Y. Choueka and S. Lusignan,Disambiguation by shortcontexts,Computers and the Humanities 19 (1985),147-157. 7. Y. Choueka,A.S.Fraenkel, S.T. Klein and E. Segal,Improvedtechniques for processing queries in full-textsystems,Proc. of the 10thAnnualInternationalACM-SIGIRConference on Research and Development in InformationRetrievalNewOrleans(1987),306-315.

8. S. Hanani,On-linelocalmethcalfeedbackcomponents in full-textretrievalsystems,M.Sc. thesis,Bar-IlanUniv.Ramat-Gan,Israel, 1987(Hebrew).

31. IconicCommunication: Image Realism and Meaning

Fang Sheng Liu, Ju Wei Tai

Institute of Automation,AcademiaSinica P. 0. Box 2728, Beijing100080, China

ABSTRACT: In iconic communication, an important problem is to visualize concepts such as data and operation with small images calledicons.So the relationbetween the realism of an image and the meaning it represents is an importantproblem to solve. Motivated by Chinese character category theory, this paper discusses how to construct a visual iconic human interface with realistic icons. It gives a set of principles to make icons more realistic and accurate in representingmeaning. Via our group of principles, an icon may form a situation together with some meaningfulimages. At the same time the group of principlesintroducesstructures into an icon. The paper also discusses the interpretation and the evaluation of an icon. Finally, it gives an example of an iconic interface of a databasesystem.

1. INTRODUCTION

The most often used meams by whichpeoplecommunicate is natural language.Naturallanguages are in some sense artificiallaguages, made by people. They are natural because they are closer to our daily life.In fact, naturallanguages are different from each other. There are languages based on phoneticletters, like English, as well as languages based on ideographs, such as Chinese.Phoneticletters conveymeaning by spellingwords,whileideographiclanguagesexpress meaning with pictographs or icons. So they have completelydifferent origins. They are the same in the sense that they are both string languages. In an ideographic language, a character has no direct relation with its pronunciation, every character has only one syllable,but the characteritselfexpresses a ineaning.In many cases, not only in science and technology, but also in daily life, people find that the representation of meaning with graphics is more natural and understandable.Arnheim says that iconicrepresentations are universal and easy to ecognize5J.In human computerinteraction it provides good human factors. When we mention the representation of meaning with icon or iconiclanguages, we are in fact talking about two things, the image and its meaning, i.e. a dual representation of a generalized icon (meaning, image) 1]. So we are faced with at least three problems: how to construct the image part (image or picture for short in the rest of the paper) to express the meaning accurately; what is the relationbetween an image and its meaning; and the interpretation of a generalized icon (icon for short in the rest of the paper). Expressingmeaning with icons is sometimesambiguous 2). Because an icon is a simplifiedpicture, the simpler an image, the more ambi

32. uous3]. On the other hand, there is no commonlyaccepteduniversal set of icons (4]. This paper is focused on iconiccommunication with betterimages,that is to say,thecreation of realistic icons and the relationbetween image realism and meaning. The Chinese language is very successfull in expressingmeaning with small piátures or images. Chinese characters are pictographs. They are the cream of the Chineseculture.Chinesecharactercategory theorycalled “ “ generates and interprets Chinese characters from small meaningfulpictures and combines them to form new ones. ~ Ic1~ “ says that there are six categories of Chinesecharacters. Category 1: ( if~* ) pictographiccharacters or pictographs; Category 2: ( 4~ ) self—explanatorycharacters; Category 3: ( ~ ) associative compounds, which are formed by combining two or more elements, each with a meaning of its own, to create a new meaning; Category 4: ( j~ ) pictophoneticcharacters, with one element indicatingmeaning and the other sound; Category 5: ( $~t ) mutuallyexplanatory or synonymouscharacters; Category 6: ( 1~f~ ) phonetic loan characters,charactersadopted to representhomophones. This is the architecture at the level of Chinesecharacter set. We will see the internalstructure in a Chinesecharacterbelow. Chinesecharacters have been evolving from ancient forms that were small pictures into the modern forms, Chinese ideogram. The evolutionconsists of seven stages:inscriptions on bones or tortoise shells, inscriptions on ancient bronze objects, seal character (a style of Chinese calligraphy),official script in Han Dynasty (an ancient style of calligraphy),cursive hand (characters executed swiftly and with strokes flowing together),regular script, and running hand. For these and other reasons, there are often Chinese characters that have more than one writing form, i.e. ancient form, modern (simpified) forms,avariant form, and/or the originalcomplex form. In the Chinese language, the original meaning of a Chinese charatermeans a concrete thing. But its extenedmeaning covers a larger area. In fact, human languages shoulddenoteconcretethings first, and then abstract, even mysteriousthingssecond. This paper got a lot of enlightenment from the Chinesecharacter categorytheory.

2. SOME PRINCIPLES OF ICON CONSTRUCTION

In this section we discussprinciples for making the image part more realistic and accurate in representing the meaning part of an icon. So, we focus on what appearance an icon should have to express a meaning, and the organization of icons.

2.1 Principle 1: AppearanceReference

No matter what concepts we are faced with, we have three classes of things: things that can be found in nature, artificial things made by man, and abstractthings. To give a concept about a thing we can show its picture or image directly. This is virtually the most superficial and basic method. For a concreteconcept,such as natural and artificialthings,we may take their natural or external shape as the icon image. If a natural thing has no visibleshape, we may take

33. the graphicrepresentation of its phenomenaand/or behavior, etc as the image representing a meaning.Examples are shown in Fig.l. The image of an artificial thing may also take its external form. In many cases, abstract conceptscannot be represented with an image as directly as above. The basic idea for describingabstract concepts is to make the abstract the concrete or to use the following principles, such as Principle 5 and 6. We will use the description of meaning by combinationand/or similarity or analogy. In addition, we may use the artificialsymbols that have been stan pA/\ (/1

(a) (b) (c) (d) 4 (e) (f) (g) (h)

Fig.l VisibleAppearance as Image (a) mountain ( jJj ); (b) fire ( )~ ); (c) steam, river, water ( ~j ); (d) electricity; (e) eye ( ~ ) ; (f) wood, tree ( ~ ); (g) bird ( f~ ); (h) humanbeing ( )~ )

dardized or have becomeconventions.

(a) (b) (c) (d)

Fig.2 Representation of AbstractConcepts with Image (a) left ( opposite to right ) ( ~ ); (b) right (opposite to left) ( ~ ); (c) diode; (d) walk ( ~ )

The image in Fig.2 (a) is a left hand drawn in a simplified style, so its meaning is “left”, with the modern Chinesecharacter in the right-mostbracket of the legend.The image in Fig.2 (d) shows a pair of walking foots. Hence, the meaning of the image is ~~walkt~. Further more, we may combine some small pictures with the image of existing icons made by Principle 1 or remove small pictures from them to create new icons. So these kinds of icons are a little bit different from those created directly from small pictures. Fig.3 gives two examples. LU

(a) (b)

Fig.3 Concept Represented by Images (a) fruit ( ~ ); (b) nest,bird’s nest ( ~ )

34. In Fig.3 (a),the tree is full of fruits.This image can also be drawn as “ “. So we get the meaning“fruit” from it. But we cannot get a definitemeaning from the small picture “6”~ In (b), by combining “ “ with the icon for “tree”, we get an image showing that birds are in the nest on a tree.So we have its meaning“nest,bird’snest”.

2.2 Principle 2: MarkingEmphasis

We can mark a specialposition or area of an image with marker to emphasize the “meaning”,and to make clear its local semantics and its relation to other parts of the icon image. This is often done on an icon formed via Principle 1. Usually we use a pointmarker,arrow, circle, color, blinking icon, or a highlighted icon, etc to make the “meaning” evident to the user. There are some examples in Fig.4. I (a) (b)

Fig.4 Making Meaning Evident (a) root (* ); (b) end,tip ( ~ )

In Fig.4 (a), an emphasismarker is put at the root part of a tree. Hence, an icon expressing the meaning “root” is created. In (b), a point marker is put on the top part of a tree. Hence, an icon representing the meaning “end” is produced.

2.3 Principle 3: Reference by Similarity-Peculiarity

By similarity, we create half of an image of an icon,which shows the class to which the icon belongs or its relation that the icon has to a definite concept. The half may be pictographic. By peculiarity, we construct the other half of the image, which shows its peculiar or characteristicmeaning or properties.

(a) (b)

Fig.5 MeaningRepresented by ConceptConstraints (a) femaleanimal (~E~ ); (b) male animal ( 4± )

In Fig.5 (a), the left half image is an ox head, meaning that the icon is used to describeanimal-relatedconcepts. The right half of the image is image for ewe, sow, mare, etc. Hence, the meaning is “femaleanimal”. In Fig.5 (b), the right half of the icon is a image for ram, boar, horse, etc. So the meaning is “male animal”. With this principle, more icons can be created to represent the meaningneeded. In fact, this principle goes a step further than the above three. Many concrete things and abstract concepts are difficult to expresse with Principle 1 and other principles in the paper. For example, “fish” is a general concept. But we have thousands kinds of fish. If for every kind we create an icon, a

35. large number of icons for “fish” will be produced. So we need a more efficientprinciple to solve the problem, i.e. Principle3.With this principle,we only need to combine an icon that can show the features of a special kind of fish with an icon “ “ to produce an icon representing the concept for the special kind.

2.4 Principle 4: Borrowing from Other Representations

By borrowing,weexpress the images that are not easilyexpressed directly. The borrowed image may have a visible external shape and take the “meaning” as its property, usage, behavior,function, etc. Fig.6 gives two examples. 8

(a) (b)

Fig.6 Creating Icons by Borrowing (a) directioncontrol; (b) fragile

2.5 Principle 5: Extending the Meaning of Existing Icons

In some cases,we may extend the meaning of an existing icon to a larger but related area. This is of course a step further in iconic communication. But conventions may be needed. On the other hand, with this principle,we may make full use of the existing icons and avoid the unnecessaryincrease in the number of the existing icons. For example, the meaning of the icon for “fruit” in 2.1 is extended to express the meaning of “result,consequence”, while the icon for “limit” in 2.6 gets an extendedmeaning of “restrict”.

2.6 Principle6:Interpretation via Integration and Transformation

Combining two or more icons may form a new icon image. The resultant image is often a meaningful image that narrates a process, a situation, a result, etc. Like Chinese character generation, we have differentsubrules to guide the formation of a new icon image 4]. In Fig.7 are severalexamples. In Fig.7 (a),the left half is an

(a) (b) (C) (d) (e)

Fig.7 MeaningsthroughConceptualInterpretation (a) ascend a height, go up ( ~ );(b) descend, fall, lower ( ~ ); (c) limit, bounds, limitation ( ~ ); (d) search ( 1~ ); (e) wade, ford (~) ) image for “mountain”, the right half is an image for “walk”; therefore,thepicture says that two feet are walkingupwards along a mountain. That is to say, the meaning is “go up”. In a similar way, the interpretation of the icon in Fig.7 (b) is “descend”, and the meaning of the icon in Fig. (e) is “wade, ford”. For the icon in

36. Fig.7 (c), we get the observation that a person “ f” turning his head is looking “ “ at the high moutain “ “. Certainly his line of sight cannot go beyond the mountain.Hence we get the meaning “limitation”. Fig.7 (d) gives an icon,theupper part of which is the simplified icon for “house”,themiddle part of which is fire,and the lower part of which is a hand. So the interpretation is searching with a torch in hand, that is, “search”. In Fig.8,we have a group of icons related to hand. In Fig.8(a), a hand in the upper part is picking fruit on a tree in the lower part. So the icon represents the meaning of “pick, pluck, gather”.In Fig.8 (b), a hand in the right part is taking the ear in the left part. In ancientChina, the icon had the meaning of “cut down (sb’s) ear”. Through evolution, it got the current meaning “take, get, fetch”. Fig.8 (C) shows that a hand in the lower part is catching a bird in the upper part.So the icon has the meaning“capture, catch”. The left part of Fig.8 (d) is a hand.The lower right part is mortar. The upper right part is a pestle. So the interpretation of the image

,1

(a) (b) (C) (d)

Fig.8 A Set of Icons with “hand” Icon (a)pick, pluck, gather( * );(b)take, get, fetch( ~ ); (C) capture, catch ( ~ ); (d) insert, stick in ( 4~ )

in (d) is that a hand pokes into a mortar, that .is to say, “insert, stick in”.

2.7 Principle 7: Combination of Graphics with Text

The dual representation of a generalized icon can be extended to (meaning,graphic+text). “text” here refers to function words or key words. Sometimes the text part may be some symbols. In this way, the image part may be described vividly and more accurately. This principle may be nearer to Category 4 in some sense. For example,the meaning of image “® “is “no parking”, while the meaning of “ is “warning”.

2.8 Principle 8: Reference by Association and Comparison

For those icons that may have differentmeanings in different contexts, by attachingpictures with sharp contrast, we may make the desiredmeaningevident and obvious. The Fig.9 is an exainple.This is

Fig.9 An Icon in TransportationControl: car left ,bicycle right

an icon we often see on the street. It means differenttransport

37. vehicles take differentroads.

2.9 Principle 9: Reference by Amplificationand/orSelection

Amplificationmeans making the related part of an image evident and clear. Selection means selecting the image wanted from many images, this is in some sense consistent with Principle 2. We may, for instance, use a “ “ sign to select. The difference is that we have many images here.

This group of icon generationprinciples is not only efficient in creating primitive icons, but also efficient and effective in constructingcomposite icons. They can be used (in differentcombin ations) to form different parts of an image in an icon. When the principles are applied to create an image, not only meaningful small pictures are put together, but also structures between them are built. A typicalstructure is an attributedree2)(3].Furthermore, the constructed image in real systems may be of higher realism, includingexternal shape, greydegree, colors, and so on.

3. MEANING AND INTERPRETATION

The created icons for simplerand/or concreteconcepts may have simplerimages. Their interpreation. is easy. But for more complex images, the interpretation is important. In this case, the composite images of a Chinesecharacter,forexaxnple,may form a situation other than combination of irrelative small picturesrepresentingmeanings. In such a situation there are often subject, object and different adverbial modifiers. Also, a conceptual graph may be built to interpret the meaning.Theinterpretation is sometimesfocused on the process in such a situation,sometimes on the results,sometimes on the extended meaning, and sometimes on other aspects. An example is shown in Fig.8, where “hand” is the subject of behavior.On the other hand, the interpretation has its own logic. The interpretation of icons for different fish is an example. In that case,” “ is used to represent the intension of the concept “fish”. But by adding specialfeatures for a given kind of fish, we may make the new icon cover a narrower intension only focusing on the special kind. We may make an observation that in many interpretations of created icons, some parts of an- image may be changed to other small pictures without changing the meaning. For Chinese character generation only, the special small pictures in the image part are possibly used to represent a specialmeaning, since every Chinese character has its own originalmeaning and extendedmeanings.Also in old times many Chinesecharacters often had one meaning.Afteryears’ evolution only one of the characters is used and others are not used any longer. Generally speaking, we may also deal with such cases by applying the above principles to create icons in iconic communication. The interpretation of icons in Chinese languagerequires back groundknowledge,such as knowledge about nature, society, and so on. As for Chinesecharacters, they were not created in one day. Some of them were createdfirst,and then new icons for more complexconcepts were obtained from images and existing icons. Some examples are illustrated in Fig.lO. Fig.lO (a) shows an icon for “knife”. Fig. 10

38. (b) is a picture in which a knife cuts a thing into pieces. So the meaning is “separate”. Fig.lO (C) is the today’s form of a Chinese character with the meaning of “break off with the fingers and thumb” The left and right part of the character are two hands. The middle part is a character shown in Fig.lO (b). So the explanation is I

(a) (b) (C) (d)

Fig.lO Example Icons for ComplexConcepts (a) knife, sword ( 3] ); (b) divide, separate ( ~ ); (c) break off with the fingers and thumb;(d) delete ( ~ )

“separate with two hands”, that is, the meaning mentionedabove. In Fig.lO (d),the left part is an icon for “volume, book”, since during ancient times a volume or a book made of bamboo slips was stringed together. The right part is the icon for “knife”. Hence, the whole icon produces an interpretation that a knife deletes the unwanted part and holds the desired part. That is the meaning“delete”.

4. AN EXAMPLE OF ICONICINTERFACE IN A DATABASESYSTEM

In computersystems, both data and operations can be represented with cons2].Suppose that we have a multimediadatabasesystem for data private cars mariagement,inwhich the owner’spicture with other such as his signature, and the picture of his car and other data parameters are stored. In the database system, we have a manipulationlanguage based on < D, 0 >, where D is the set of data; O is the set of operations.

Note: 0 = { find next, insert, delete, update ). Suppose that its visual iconic interfaceprovides not only icons for data and operations but also iconicframes to combine iconized data and iconizedoperations to form new icons which are called iconic sentences. First, we iconize the system in Fig.ll to build an interactive human computerinterface. The icon in (e) is a borrowed ILJI

_ (a) (b) (c) (d) (e) ?

(f) (g) (h) (i) (j)

Fig.ll The Basic Icon Set for ExampleDatabaseSystem (a) Icon for cars and related data; (b) Icon for owner and related data; (C) Icon for screen; (d) Icon for “file”; (e) Icon for “database”; (f) Icon for “data movement”; (g) Icon for “condition”; (h) Icon for “loop”; (i) Icon for “find”; (j) Icon for “modularize”

39. denotation. The icon in (g) may be regarded as one by Principle 5.. The other icons in Fig.ll are created by Principle 1. In the rest of the section,wediscuss the construction of icons for data,relations, iconizedoperations or commands to communicate with computers, by the examples shown in Fig.l2. Obviously, we can construct the icons in Fig.l2 by applying the principles in section 2. In this iconic interface, there are two areas called working area and icon area. Icon and iconizedobjects such as iconized data are different. The latter is called an icon instance.Every time icon data is used (e.g. put into a “modularize” icon in an icon sentence, or assigned) or an icon command is executed in a working area, it becomes an icon instance. On the other hand, we can construct icons and put them in icon area to be used in the future.Dataassigment can be implemented by form—filling.”Modularize” icons make separate data or relations a whole unit for latter use (e.g. for data: “connect”operations,for relation: “and” and “or” operations). In an iconicsentence,the sub— icon in the innermost “modularize” icon is executed first, and then

(a) (b) (c) (d) (e)

(f) (g) (i) (j)

Fig.].2 Iconic Data ManipulationLanguage (a) data assignment to car; (b) data assignment to both owner and car; (c) condition about car; (d) loop with condition; (e) insert data in modular frame into database; (f) insert data about car into database; (g) delete data in modular frame from database; (h) update data in the modular in database icon with data outside; (i) find data according to the given condition;(j) find data according to the condition a definite number of times and then display the results on screen the sub—icon in outer “modularize” icon is executed. The result of icon execution is data. In this way, new icons of data, relations and higher level operations can be constructed;therefore, an iconic data manipulationlanguage is obtained.

5. CONCLUSIONS

There are many reasons for the ambiguity in iconicrepresenta tions of meaning. One is the lack of conventions and standardiza tion. Another one is that the meaning of iconic units is related to the situation and context in which~ the icon is interpreted. Still another is the lack of dynamicbehavior in iconicrepresentations. The principles in this paper may be applied to primitive icons as well as compositeicons.They can be used for both simpleconcepts and complexconcepts, to representconcrete concepts and abstract

40. concepts. They make the image part more realistic and closer to its meaning. By studying the Chinesecharactercategorytheory, we may better understand how people interpret icons and eventually develop a betterpsychology of visual iconic human computerinteractions. Chinesecharactercategory theory is closelyrelated to Chinese culture,philosophy,history,geography,religions,and so on. So the principles can be misused.AlthoughChinese is based on ideographs, the Chineselanguage itself is still a stringlanguage. Our research goal is to develop 2-D (even 3-D) languages, since iconiclanguage is universal, easy to understand,powerful in describing many 2—D problems and so on.It should be pointed out that iconiclanguage and form language are often used together. Iconic communication and stringlanguage—basedcommunicationcannotreplace each other. The interpretation of an icon depends on the concepts as well as the conceptual structures that are represented in the icon. The evaluation of icon construction puts more emphases on the ease with which the icon is interpreted. The easier the interpretation is, the better the construction of an icon. If an icon is possibly interpretedambiguously, the construction of an icon is a failure. In the future, more work should be done on the interpretation of an icon, especially the conceptual interpretation based on the structures of meaning. This needs support from databasesystems and artificialintellegence, as well as 2—D computationalliguistics.

REFERENCES

1] S. K. Chang, Visual Languages: A Tutorial and Survey, IEEE Software Vol. 4, No. 1, January 1987 2] F. S. Liu and J. W. Tai, Some Principles of Icon Construction, in Visual Database T. L. Kunii ElsevierScience Systems , ed., Publishing B.V., The Netherlands, 1989 3] F. S. Liu, The Modelling and Analysis of Human Computer Interaction and Visual Representations, ~Ph.D. Dissertation, Institute of Automation,AcademiaSinica, Beijing,China,August, 1989 4] R. R. Korfhage and N. A. Korfhage,Criteria for IconicLanguages in VisualLnaguages, eds. S. K. Chang et al., Plenum, 1986 5] R. Arnheim, VisualThinking,University of CaliforniaPress,1969 6) Xu Shen (author), Duan YuCai (explains),Exposition and Explana tion of ChineseCharacters, ShahgHai Ancient Book Publishing House, 1988, 2nd Edition 7] Lu SiMian, Four Kinds of Chinese Character Studies,ShangHai EducationalPublishingHouse, June, 1985, 1st Edition 8] S. L. Tanimoto,VisualRepresentation in the Game of Aduinbration, Proc. of IEEE 1987 Workshop on VisualLanguages,August, 1987, Linkoping, Sweden 9] K. Furuya,S.Tayama,E.Kutsuwada and K. Matsumura, Approach to Standardize icons, Proc. of IEEE 1987 Workshop on Visual Languages,August, 1987, Linkoping, Sweden 10] C.J. Date, An Introduction to Database Systems, Volume I, Addison-WesleyPublishingCompany, 1981 0

41. Overview ofJapaneseKanji

andKoreanHangulSupport

on theADABAS/NATURALSystem

YoshiokiIshii 1) andHidekiNishimoto 2)

1) SoftwareAG of FarEastInc.,JAPAN 2) RyukokuUniversity,JAPAN

ABSTRACT

Thispaperdescribes the progress ofJapaneselanguagesupport at Software AG of Far Eastand a study on KoreanHangul as well as JapaneseKanji. In addition to, a methodcalledDBCS(double-bytecharacter set) support is introduced. Relatedproblems are discussed and solved by adapting the fourthgeneration language,NATURAL, as a SoftwareAGproduct.

1. JapaneseKanjiand KoreanHangul

Japaneselanguageborrowedextensively from ChineseHanzi by way of Korea to form their own writtenlanguageabout1,500years ago. Later, the characterscame to symbolizenativeJapanesewords similar in meaning to that in Chinese.

ModernJapanese is written as a mixture of ideograms,Kanji, and nativephoneticletters,Kana. The

Kanaphoneticalphabetexists as Hiragana and Katakana, each of whichservesdifferentpurposes and is stylisticallydifferent. A typical piece of JapanesewritingcontainsKanji,Hiragana, and sometimes

Katakana. Each of the Kanasyllabariesconsists of 46 basicsymbols.whileKanjicharacters are limited to about2,000symbols for official anddailyuse. (Fig.1-1)

7’.)~ ‘) ~ ~ ( America and Japan

_____ / / Katakana Hiragana Kanji

Fig~ 1-1 Katakana,HiraganaandKanjicharacters

Englishsentences are horizontallywritten,from the left to the right. In contrast,Japanesesentences are verticallywritten,from the top to the bottom. However, it has recentlybecome very popular to write

Japanesesentences in the Englishformat. Computer or wordprocessoroutputs are usuallyhorizontally written. Figure 1-2 shows a sentencefragmenttakenfrom a newspaperarticle that was verticallywritten andproduced by a wordprocessor as an example ofJapanesewritten in the Englishformat.

42. When writingvertically, the sentences are formatted from the right to the left of the page.

Consequently, the page sequence of Japanesepublications that are verticallywritten is arranged in a differentmanner than whenhorizontallywritten and the publicationsopenfrom left to right. This also happens in the Chineselanguage, and is one of the biggestdifferencesbetweenEastern and Western cultures. Unfortunately,terminals that can displayverticallywrittencharacters have not yet been developed.

7~ =Q)

~ ~ d,JS tC~~•SW1~C ~ ‘I~)-.~ 6~,r~’ ~ I ~ ~ j- ~ 1- ~Q) ~ ~LY~ U ~M’~ ~ ~

3~, ~*~15~: ~ .~. — 5’.i’~— ri: ~c,*t~i B~~{~+WI ~ D~’ ) ~ ~ Lt.~. ~ c,~:*~U: ‘J ~J ~ (~.~ < t~. ~ ~ .~ ~ ~

~t0 ~ ~ =

Fig.1-2NewspaperArticlesandJapaneseWordprocessorOutputs

Japan’scomputerizationstartedwith the use of computers‘made in U.S.A.’ At thattime, all the printer charactersprinted werealphanumeric.

A character set consisting of 63 characters was established and Katakanacharacterswereusedinstead of English, but sentenceswere still horizontallywritten. Aftermainframecomputermakers such as IBM released byte addressingmachines in 1965, the RomanEnglishalphabet and numerals was added to the Katakanacharactersandthisstylecontinueduntilabout1980. After1980, theJapanesecharactersstarted beingutilized to therefullestextent by computers. Until the 1970’s,Japanesecharacterswereonlyused for names and addresses, but today even text data can be stored in the databases.Computers with Kanji terminals,whichrepresentone character by twobytes anddisplay it on the screen as a Japanesecharacter, havemade a lot of progress.

JAPANESE CHARACTER HEX CODE STRINGS

B * OE4562456648E7449A4F5848F2~f 4—IBM

28C6FCCBDCB8ECA4CEB4C1BBFA~ *—Fujitsu

Englishtranslation of* : JapaneseKanji.

Fig.1-3Japanesecharacterandcoderepresentation

43. In Korean, the use of Hanzi is graduallybeingphased out, eventhough the Koreanlanguage was also greatlyaffected by ChineseHanzicharacters.

NowKorean is usuallywritten as a purephoneticalphabet,calledHangul, with no Hanzi. Hangul was invented in 1446during the reign of Sejong, the fourthKing of the Yi dynasty. It is the onlyscriptknown to havebeendesigned by a committee.

This phoneticalphabet has 24 letters,consisting of 10 vowels and 14 consonants. The letters are designed in such a way as be grouped into squareclusters of two, three, or occasionally four letters. Each clusterrepresents one spokensyllable of the language. The number of charactersrequired for general use comes to about2,000clusters(Fig.1-4).

KOREAN HANGIJL DATA LETTERS

* 0ED0658B690F

* Englishtranslation of : Hangul

Fig.1-4KoreanHangulandcoderepresentation

2. CodePresentationandTypingMethod

Fromearly on in the Japanesecomputermarket,there has been an intenseinterest in establishing a system to handleKanji.However, it has taken a long time and vast sums of money to createsomething compatiblewith the existingone-bytecharactersystems and to developdevicespeculiar to Japaneseusage

(e.g.highresolutiondisplays or graphicsprinterswithspecialfonts).

Generally one byte is used to denote a character, but a byte witheight bits can provide for as many as 256characters.

JapaneseKanjimust be represented by codes,usually two byteslong,becausethere are many different kinds ofcharacters(Fig.1-3). Thispoint is pressingcomputermanufacturers andvendors in Japan to greatly revise the standardsystems.

-

A greatnumber of Kanjitypingmethodshavebeendeveloped to date. All of these fall into threemain categories;two-dimensionalselectionarrays, in which the thousands ofavailablecharacters are laid out as a huge two-dimensional array of keys or pen targets,codingschemes, in which the typistanalyzes each character in turn and then figures out or recalls an input code derivedfromsome form of systemcoding rules, and phoneticconversion, in which the typistspells out the sound of the desired word unit and the computerlocatesthatword in a dictionarystored on one of its disk. In the lastcategory, if severalwordshave the samesound, the typist can indicate the one ntended1].Currently,phoneticconversion(Fig.2-1) is the mostpopularmethod for generalusers in Japan by reason of its having so few rules to learn and its use of

English-typingskills,amongotherreasons.

44. K’:~:.1’ I ~“—~ s~ Ji:i~i” 9L~’

1 ( :EiJ’ ) 1 I3*~

~: (j ) ~ 9~

E ( i’—~ ) I ~

F (~ ) 1~ 2~ 3~ 4~J 5~ii 6~ IE 8l~ ~ hi ‘~ 15 ~& 16 ~ 17 !g~ 18 5~. 19 ~ 20 1~ 21 j~ 22 ~ 23 J~ 2/i rj~ 25 ~ 26 ~ 27 ~ 28 ~ 29 ~ 30 ff~ 31 ~ 32 ~i 33 ~ 3/i ~ 35 ~t

1 C EJ~’” ~~ty ( ~ f. ~ !I~. EJ’ J jj

F (7 )

3 C ?9L~ ) 1 ~ 2 P~fl 3 ~

~ ~ *

Englishtranslation of*: JapanesedataincludesKanjicharacters.

Fig.2-1 Anexample ofthe phoneticconversionmethod

A Hangulclustermaycome in differentshapes and positions,depending on the surroundingcontext. A computerizedHangultypingsystemcan have a singlegenerickey or a Romanizedphoneticsymbol for each

Hangulletter.Thecomputer is relied on to decidewhich is the propergraphicform andpositioneachletter in relation to the letter or letterssurrounding it.

Computerdisplays are verygood for “movable”typingbecauseeachnewletterusuallyrequiresrevising theformandplacement of the lettersthathavepreviouslybeenyped1].

Figure 2-2 represents the typing of a four-letterword. The thirdlettertyped,’L’, has not beenfixed at the bottomend of thefirstcharacter or the top of the secondcharacter as it hasjustbeentyped.However, the nextletter,’ ‘, is a vowel, so the systemunderstandsthatthe latter is proper in thiscase. (Fig.2-2).

Typed Displayed Letters Clusters

U

F ‘-F ‘a

Fig.2-2Movabletyping of Hangul

45. 3.ProductProgress

SAGFE has establishedseveralvarioussystemenvironments thatare peculiar to Japaneseusagesince

JapaneseKanjiwassupported on ADABAS* for the firsttime in 1978.

Thissupport of the Japaneselanguage,whichstarted by puttingcharacterstrings on a specialKanji printer, has recently been improving with the development of more advancedterminalequipment and controllers.

The history of supportingKanji on ourADABASsystem in Japancan be divided into the followingfour stages:(Fig.3-1) 2]

Year

~ 1978 79 80 81 52 83 84 85 86 87 88 89 Development Hitachi’sT560-20 IBM’s13278 15550 New of Kanji andFujitsu’s made it was emulators characters Terminals -- F6650made it possible to intro- for5550 can be (Periodthat possible to output output duced. were output by started tO Users kanjicharacters. Kanji introduced, manyP(~s. useterminals) characters.

Release Release of Release of Release of of NATURAL NATURAL. NATURAL 2 NATURAL Vii V1.2

I Output of I I Output of II Mapfunctions I I KMAP I I Development I I Kanji I I Kanji I forrulersand function of KAPRI 2 I characters by I I characters to i I Kanji I bafrJ~ I terminals by II characters I printers ADASCRIPT generated in l Process batches of Kanji I Output of II Conversion I Release of I Start of I Support I Kanji Ii function I KAPRI I Development I characters for Kanji I I ~ KanJI I characters I I sunuort by I b1~’ATURAL andKana I I NATURAL I characters itself

Fig.3-1Thehistory of KanjiSupport in SAGFE

Stage I) PuttingKanjicharacters on a specialprinterusingbatchprocessing

In 1978, we firstsupportedJapaneseKanji on are lawretrievalsystemdeveloped by an ADABASsystem at theAdministrativeManagementAgency(theManagementandCoordinationAgency at present).

Thesystem is for retrieving law and regulationtexts,which the userdoes by specifying key words. The textsare thenprinted out on a printer.

Sincethere was not any terminalequipment for displayingKanjicharacters on the market, a special

Kanjiprinterwasput on after the retrieval by the batchprocessingmode.

Stage II) SupportingearlyKanjiterminals

As soon as earlyKanji-supportedterminalsentered the market in 1980, we triedusing the user-friendly querysystemDASCRIPT3] on an ADABASwith the terminal.

46. In 1981, we succeeded at Kyodo Oil Company in adopting a method by which a map with Kanji charactersandruledlinescould be created in advance.

StageIll) Providing a phoneticconversiontypingmethod

In 1982,SoftwareAGsucceeded in equipping the fourthgenerationlanguageNATURAL*systemwith a phoneticconversionmethod. The typistfirstenters the Hiraganaspelling of a text directlyenclosed with apostrophes, then the word-unithomophones are searched for in the dictionary.After they appear on the This is the methodused screen, the operator can choose the number of the appropriateword. samepopular for personalcomputers in Japan. Additionally, our method can be implemented on any non-intelligent terminalbecause it does notrequire a specialdictionary or an additionalprocessor on the terminal.

Stage IV) Developingthe environmentfor DBCSsupport

In 1982, an importantproblemarosewhen the new version of NATURAL(V1.2) was releasedsince it wasequippedwith a MapEditorfacility for designingthe directoutputimage on the screen.

Processing of the screen I/O is verydifficultwith our method in thiscasebecause the physicallength of

the data going into the terminal is greater by some bytes for the control-codes than the length of the charactersdisplayed on the screen.

This the In 1984, IBM developed a virtualterminalemulator for personalcomputers. facilityadapts data has been same Kanjishiftcodes as the Japaneseproducts use. Sincethen,Kanjicharacter mostly put into the computerthrough a phoneticconversionprocessoron.anintelligentterminal or on a work-station. Consequently,mainframesmainlyundertake datamanipulationsafter the Kanjihasbeenput in.

Thefirstversion of NATURAL2wasformallyreleased in 1987.Some of its screenoperationshavebeen

stronglyreinforced. In the sameyear,SAGFE developed and released the KanjiProto-typingInterface 2 KAPRI2)5]product,whichcorresponds to the NATURAL2system. Recently,SAG has decided to design and developdouble-bytecharacterset(DBCS)products in order to completelysupport all the multi- characters sets in the world. There are severalmanufacturers in Japan,besidesIBM that are developing productscompatible with IBM software. Products for the operatingsystem, TP monitor, and terminal supported by theADABAS/NATURALsystemare shown in Fig.3-2.

* is used at ADABAS is a DBMS for mainframesdeveloped by SoftwareAG, Germany, in 1971 and now is 3,500sitesaround the world.NATURAL, a 4th GenerationLanguagealsodeveloped by SoftwareAG, used at 3,300 sites with ADABAS and NATURALbeing used together in most cases. At present,

ADABAS and NATURAL are run on approximately 500 mainframes in Japan. We had to implement

andimprove a number ofitems in order to supporttheJapanesecharactersmentionedabove.

47. Note 1: Manufactñrer OS TP monitor Terminal (Note)) If a terminal can be connectd to other manufacturers’ OSs and TP monitors, IBM MVS,MVS/XA COM-PLETE 13270 then the terminal is able to work with VS1 TSO 13278 those OSs and TP monitors, i.e. terminals be of OSs and DOSIVSE CICS 15550(andotherPCswith can supportedindependent DOS, TP monitors. VM/CMS IMS Japanese3270emulators) N6300(13270compatible)

FUJITSU OSIVIF4MSP COM-PLETE F9526 OSIV/X8FSP TSS F665X AIM F668X PCswith6650emulators (eg.FMR)

HiTACHI VOS 3 TSS H9415 VOS 2 ADM T560-20 VOS 1 DCCM!1 PCswith560-20 DCCMIII emulators(eg.2020) TMSI3V

MiTSUBISHI GOSIVS TSS M4378 CIMS PCswithemulators

Fig.3-2ADABASINATURALenvironmentssupported in Japan

4.DBCSSupport oftheApproach

Problemscommon to the support of handlingmulti-characters are discussed in this section. The first problem is that several code notationsrepresenting the same character set already exist. In Japan, manufacturersnaturallyestablishedoriginal code specificationsbefore the computerizedcode notation of

eachcharacterbecamestandardizedwithin the country. The twotypes ofcodetables for JapaneseKanji are shown in Fig.4-1.

\ Second \ Second’ (Hexadecimal (Hexadecimal \BYte Notation) NBYta Notation) First\ Firs’~\ /tO CO 50 F? 20 40 60 80 AO CO EQ FF Byte “ 20 40 60 80 Byte “

20 20 Extcndcdnon-Kanji characters V 40 40

BasicKanjicharacters ~

Basic GO ~ ExtendedKnnjicharacLers harOcters 6Li Uscr-deflncdKanjicharacters 80 80

AO AO

Co Co

EO EO (Hexa (liexa decimal decimal Notation) Fl Notation) FF

Fig.4-lIBMandJISKanjicodes

48. Although the IBMKanjicode is organized as an expansion of EBCDIC, the JEFcode by Fujitsu and the

KEIScode by Hitachiadopt the shiftedJIScode. In a distributedprocessingenvironment,characterstrings

thatbelong to the samecharactersets,even if based on differentcodenotations,must be correctlyexecuted

in comparison or move operations.Consequently, a data transformationfunction may be additionally

required,sincethis is seen as a veryimportantrole ofthe databasemanagementsystem.

Thesameprogramwill be executed.

F6650terminal 15550terminal NATURA(J1DABAS

Fig.4-2ADABAS/NATURALdistributedenvironment

Second,there may be manydifficulties in the handling of the shiftcontrolcodes. In a terminal data stream,double-bytecharacters are usuallydistinguished from single-bytecharacters by checking the beginningandend ofstringswith a ShiftOut/ShiftIn(SO/SI)controlcode.

Each manufacturer has its own specifications for these codes, and this is a seriousproblem in a distributedprocessingenvironment.

A databasemanagementsystemshouldalso be able to completelytransform the controlcodesused at any site(Fig.4-2).Additionally, the internalprocedures for data manipulation - Search,Move, Edit and others - should be revisedusingvariousoptions to keep the resultsconsistent(Fig.4-3).

0010** 0020 ** NONSUPPORTEDDBCS FACILITYPROGRAMSAMPLE 0030 ** 0040 ** MOVE & COMPRESS 0050MOVE• • TO #WORK1(A14) 0060MOVE#WORK1 TO #WORK2(A6) 0070COMPRESS#WORK2 JAPAN TO #WORK3(A14)LEAVING NO SPACE 0080DISPLAY#WORK2 / #WORK2(EM H(6)) #WORK3 / #WORK3(EM • H(14)) 0090END

#WORK2 #WORK3

E3~IM —3characterrepresentation 0E4562456648 0E4562456648D1C1D7CID5404040 -+HEXnotation

(afterMOVE) (afterCOMPRESS)

Somedatamay be corruptedwhen an SI code is truncatedafterexecuting the MOVEstatementand the SI code for DBCSdatacan not be specified.

Fig.4-3Example ofdatamanipulationwithtrouble 49. The doublebytecharacter set (DBCS)support 4] of ADABAS/NATURALconsists of two products, the ADABAS/NATURALcodestandardizationand the DBCSfacility.

The first product is designed to correctlyprocess data manipulations, even amongdifferent code notationsbelonging to the samecharacterset.

In the ADABAS/NATURALworld, all representations of characterstrings will usuallyconform to a standardcodenotationthat hasbeenestablished for eachcharacter set in advance(Fig.4-4). DBMSshould I/O keep track of the environments of every site and give a proper code transformationduring communicationonly. Consequently, the internalprocedures of the datamanipulation can keep the results consistentandeachsitedoes notneedanyadditionaltransformationprograms or hugetables.

lull ililitililt~ Ax

Fig.4-4ADABASINATURALstandardcodes

50. The secondproduct, the DBCSfacility, has been developed so that all NATURALoperations can completelyaccepteveryvariable,name, and data valuedescribed by any double-bytecharacter as well as anysingle-bytecharacterdefined as EBCDIC or ASCII(Fig.4-5).

13:50:34.1 ~.-~};}*‘~ ~ SYSTEM

1. 5PJ~I .J~

2. ~i~I ~

3; ~‘} ~*

4. 4~ 1*

9. ~

~ ~Ms~AI.a

Enter—PFI—PF2--—PF3---PF4-——PF5---PF6---PF7”PF8---P19”4F10”PflI”PFIZ”

Englishtranslation ofthemenuwritten in Hangul:

Staffdrivingcarmanagementsystem 1. Dailydrivinginput 2. Dailydrivinginquiry 3. Driverregistration 4. Carregistration 9. Exit Pleaseselectnumber: —

Fig. 4-5 ScreensupportedDBCS

5. Conclusion

We plan to design a completeDBCSsupportfacilityandbuild it into the nucleus of the ADABAS/NATURAL system in order to correctlydealwithdouble-bytecharacters for variousdatamanipulations.Thismaymake it easier to buildapplications for handling othermulti-characterssets, in addition to Kanji or Hangul, if the character data is managed on the ADABAS/NATURALstandard code system and performed using the

DBCSfacility we havepresented.

REFERENCES

1] J.D.Becker,”TypingChinese,Japanese,and Korea”,COMPUTER,Jan.1985,pp.27-34. 2] M.Shimada et al., “Supporting the Character Sets of Japanese Kanji and KoreanHangul in the

ADABAS/NATURALSystem”, Proc. of mt. Symp. on DatabaseSystems for AdvancedApplications, 1989,pp.3-9.

3] P.Schnell,”Implementation and Engineering of a Production-orientedDBMS”,INFORMATION PROCESSING 83, ElsevierSciencePublishersB.V.,1983,pp.637-646. 41 SoftwareAG,“DoubleByteCharacterSet”,SoftwareAGInternalDocument,1988 5] SoftwareAG ofFarEast,“KAPRIReferenceManual”(InJapanese),Tokyo,1988 6] Y. Ishii,“IntroducingADABAS to JananeseMarket”,DatabaseEngineeringMar.1983. 7] J. Martin,“4GLFourthGenerationLanguages”,PrenticeHall,1986.

51. Call for Papers

T 9th International Conference on Entity-RelationshipApproach

October 8-10, 1990, Lausanne,Switzerland

ConferenceChairman : DennisTsichritzis,University of Geneva AmericanConferenceCo-Chair : KathiHogsheadDavis,University of NorthernIllinois ProgramCommitteeChairman : HannuKangassalo,University of Tampere SteeringCommitteeVice-Chair : Sal March,University of Minnesota LocalOrganization : DatabaseLaboratory,SwissFederalInstitute of Technology

The Entity-RelationshipApproach is used in many data base and informationsystemsdesign methodologies and tools. Its use is expanding to newtypes of applications and the ER modelitself is beingdeveloped to meet newrequirementsposed on the advancedmodellingapproach. This ER conferencecontinues its tradition of bringingtogetherpractitioners and researchers to shareideas, experiencesandnewdevelopmentsand issuesrelated to the use of the ERApproach. Theconference consists of presentedpapers,both on the theory and practice of the ER Approach, as well as invited papers,tutorials,panelsessions, and demonstrations.

MajorTopics :

KnowledgeRepresentation

- - Models forknowledgerepresentation Schema as a knowledgebase

- - Schemaevolutionandmaintenance Modeltaxonomies

ConceptualModelling and Data Base Design

- Acquisition of modellingknowledge

- Supportsystems forconceptualmodellingand knowledgeengineering

- - Designmethodologies,tools andinterfaces Practicalexperiences

New Approaches in DatabaseManagementSystems and in InformationSystems

- Object-orientedsystems - Multimediadatabasesystems

- Schemaandquerytranslation - Special-purposesystems

- - Userinterfaces,visualquerylangages Practicalexperiences

- Distributed ER schema

Innovativetheories and applications

- - Accounting - Socialsciences Musiccomposition

- - - Linguistics Softwareengineering Urbandatamanagement

Submission of Papers :

Originalpapers are solicited on the abovetopics or any othertopicrelevant to the ER approach. Fivecopies ofcompletedpapersmust be received by the programcommitteechairman by April 2, 1990.Papersmust be written in English and may not exceed 25 doublespacedpages.Selected paperswill be published in DataandKnowledgeEngineering(North-Holland).

Important Dates : Papersdue : April 2, 1990 Notification of acceptance : June 10, 1990 Camerareadycopies due : August 6, 1990

All requests and submissionsrelated to the programshould be sent to the programcommittee chairman: HannuKangassalo University of Tampere Department of ComputerScience tel : +358-31-156778 P.O. Box 607 fax : +358-31-134478 SF-33101 Tampere,Finland e-mail:[email protected]

52. CALLFORPAPERS

InternationalConference on MultimediaInformationSystems

January16-18,1991

Co-Organizers: ACMSIGIRand TheInstitute of SystemsScience NationalUniversity of Singapore

In Co-operationwith: ACMSIGOIS Co-operationsoughtfrom: ACMSICGRAPH,ACMSIGCHI andACMSIGMOD

Theme: MULTIMEDIAINFORMATIONSYSTEMS - TOWARDSBETTERINTEGRATION

Thelastfewyearshaveseensomedramatictrends in computertechnology:Fasterworkstations,higherresolutiondisplays,cheaperand largerstorage,improvementsin graphicspackagesandprogress in broad-bandcommunications.Whileall thesetopicshaveforumswhere theresearch,developmentandapplicationeffortsarereported,there is yet to be a forumwhichpromotesdiscussionandexchange ofideas on how all theseinterestingtechnologiescouldconvergetowardsthe nextgeneration of toolsandapplications. Thiseffort is verytimely sinceoneperceives a confluence of threemajorindustrialsectors:Publishingand printing,Videoand Movie,and Computertechnology. Throughthisconference, we hope to bringtogetherpeoplefromvariousindividualtechnologies to exploit thecombined.alvantages in buildingintegratedsystemsusingsome or all of the technologies~

Papers Demonstrations

Technicalpapersreportingeitheroriginalresearch or interesting Proposalsare sought for demonstrationsduringtheconference. applications are invited in areasrelevant to the theme. Topics of These can be eitherstandalone or on-lineconnected to overseas interestincludebutarenotlimited to thefollowing: facilities.Proposalsshould be sent byJuly1,1990. Hardware:Optical Disk devices and their use, HighBandwidth Buses, Multimedianetworking,Communicationprotocols for SpecialSessions Multimediasystems, etc. SystemsSoftware:DistributedMultimediaarchitectures,Support Proposalsare invited for sessionswhichfocus for MultimediaDatabases,Object-orientedMultimediasystems, organizingspecial Co-operativeauthoringtools for MultimediaSystems,Video and on emergingtechnologies.Proposals for specialsessionsshould be StillImagecompression, etc. received by May 1, 1990. Content-basedretrieval of MultimediaObjects:Objectoriented languages for multimediaretrieval, Retrieval in hypertext/ Deadlines hypermedianetworks,Hierarchicallanguages,Accessmethodsfor multimedia Indexesforvideo, andvoice, systems, images Browsing Fourcopies of full papersshould be (20 pages or approximately in video, etc. verylargeimages.and 500 words)sent to the followingaddresses by 10 April,1990. UserInterfaces:Visualinterfacesto Multimediasystems,Browsing interfaces, Agents in interfaces,Physicalmetaphors,Adaptive NorthAmericaandEurope: interfaces,Authoringandeditingtools. ProfessorStavrosChristodoulakis Applicationsand Systems:Pictureand document-imagearchival Institute forComputerResearch systems,Standards for electronicdocumentpreparation and of submission,Engineeringapplications,MedicalInformationSystems, University Waterloo Librariesofthefuture,Multimediaineducation,Homeentertainment Waterloo,Ontario,CanadaN2L 3 G1 systems. Bitnet:Watmath!WatDaisy!Schrisodoul @ uunet.uu.net FAX:1-519-8851208 Panels Australia,AsiaandOthers: Dr Proposals are invited for panelsrelevant to the theme of the DesaiNarasimhalu conference. Thepersonproposing a panelshouldidentifythetopic Institute of SystemsScience andtheparticipants.Theco-ordinatorshouldalsoidentifyalternates. NationalUniversity of Singapore Writtenconsent is requiredfromtheproposedpanelistsagreeing to HengMui KengTerrace should be participate in thepanel.Proposalsforpanels received by Singapore0511 31 March,1990 Bitnet:ISSAD @ NUSVM

FAX : 65-7782571 Tutorials

• Notification of acceptance of paperswill be mailed to authors Proposalsfortutorialsarealsowelcomedontopicsrelevant to the on or before 1 July,1990. theme of theconference. Thepersonproposingthetutorialshould • Accepted papers,camera ready, are due no later than I give an outline of thetopicandindicatetheduration ofthetutorial. September,1990. Proposals for tutorialsshould be sent in by 31 March,1990. ~1EEEComputerSociety Non-profitOrg U.S. Postage 1730MassachusettsAvenue. N W PAID Washington. DC 20036 1903 SilverSpring, MD Permit1398

~1enry F.~ ~(orth University of Texas Taylor’ 2124 Dept of Cs f~ustin, TX 78712 Us~