Normalizing Headwords of Cologne Digital Dictionaries
Total Page:16
File Type:pdf, Size:1020Kb
Normalizing headwords of Cologne digital dictionaries Dr. Dhaval Patel A/8, Gokul Flats, Nava Vadaj, Ahmedabad, Gujarat, India, Pin - 380013 [email protected] Abstract 1 Cologne Digital Sanskrit Dictionaries site maintains 36 Sanskrit related dictionaries as on th 2 31 October 2016. In Sanskrit NLP, these dictionaries are the main source for lexical 3 data. Discounting 3 English – Sanskrit dictionaries, there are 33 dictionaries in total which 4 have headwords in Sanskrit. As these dictionaries are compiled over a vast period of time, 5 different conventions were followed by their authors. When we want to align these digital 6 lexica, we need to understand these conventions and arrive at a standard convention, so that 7 these different databases can communicate to one another. 8 Examples would make it more evident. Some dictionaries tend to use first inflected form of 9 the headword e.g. धमः , whereas some tend to use the uninflected form of the headword e.g. 10 धम. Similarly some dictionaries have tendency to use ‘अर ’् for words ending with ऋ e.g. िपतर ,् 11 whereas some have tendency to use ‘ऋ’ e.g. िपतृ and some others have tendency to use आ e.g. 12 िपता. The data referred to by headwords धमः / धम or िपतर /् िपतृ / िपता are the same. When we 13 want these dictionaries to communicate seamlessly with one another, such differences need to 14 be ironed out. Greater consonance can be brought between these dictionaries by some form 15 of standardization in this regards. Present paper tries to analyse the different conventions 16 followed by these authors / editors and come out with a standardized convention, so that 17 the headwords list is in standard format. 18 Introduction 1 19 Cologne Digital Sanskrit Dictionaries website has a rich treasure of Sanskrit related th 20 dictionaries. There are 36 dictionaries hosted there as on 31 October 2016. There are 2 21 three English – Sanskrit dictionaries (AE, BOR and MWE) . Even after discounting these 22 dictionaries whose headwords are in a foreign language, there are 33 digital dictionaries 23 whose headwords are in Sanskrit and are of utmost interest to Sanskrit NLP researchers. 24 There are many generative and analytical computational tools in Sanskrit NLP, which 25 require annotation or extraction of lexical data in various degrees. If these tools are to be 26 supported with lexical data in long run, we need to standardize various conventions used 27 in various dictionaries into a predefined format. Examples would make the need for such 28 standardization in dictionary headwords more evident. AP tends to use inflected form of 1 http://www.sanskrit-lexicon.uni-koeln.de/ 2 For details of abbreviations, refer http://sanskrit-lexicon.uni-koeln.de/ or abbreviation section of this paper. 29 the headword e.g. धमः , whereas MW has tends to use the uninflected form of the headword 30 e.g. धम. Similarly PW has tendency to use ‘अर ’् for words ending with ऋ e.g. िपतर ,् whereas 31 MW has tendency to use ‘ऋ’ e.g. िपतृ and SKD has tendency to use िपता. The data referred 32 to by headwords धमः / धम or िपतर ्/ िपतृ / िपता are the same. For any NLP related program 33 to link properly to the data, we need to iron out these differences. Even if the input is 34 िपतृ, data from िपतर ् of PW or िपता of SKD should be accessible to that user. Because of 35 difference of conventions in various dictionaries, sometimes the user is not able toland 36 on the desired dictionary entry. This calls for a study of various conventions followed by 37 various dictionaries and need to come out with a standardized convention for headword 38 normalization. The suggested standard conventions proposed in this paper are open to 3 39 discussion and correction, if need be. After due deliberation, they can be finalized . 40 Convention 1 - Treatment of अनारु before consonants 41 There are at least 6 conventions of handling अनारु s in various dictionaries. They are 42 described below. When अनारु occurs before any of the letters य,र,ल,व,श,ष,स or ह, it is 43 uniformly depicted as अनारु in all dictionaries. The conventions mentioned below are not 44 applicable to such words. 45 Option 1.1 46 Treat it as अनारु , when occurring in between a word (other than the cases where the first 47 member of compound ends with म )् e.g. अकंुिठत. 48 Dictionaries – AP90 49 Option 1.2 50 Treat it as fifth letter of the वग of following letter, when in between a word (other than the 51 cases where the first member of compound ends with म )् e.g. चल. 52 Dictionaries: ACC, AP, BEN, BHS, BOP, BUR, CAE, CCS, GRA, GST, IEG, INM, MCI, 53 MD, MW, MW72, PD, PE, PGN, PUI, PW, PWG, SCH, SHS, SKD, SNP, STC, VCP, VEI, 54 WIL, YAT 55 Note regarding 1.1 and 1.2– KRM is a special dictionary, whose headwords are verbs 56 only. Therefore, the options 1.1 and 1.2 are not applicable to it. 57 Option 1.3 58 Use अनारु at the end of a word to denote neuter gender e.g. अकौिट.ं 59 Dictionaries: AP90, SKD 3 This is an ongoing project and the latest version of this classification can be seen at https://github.com/ sanskrit-lexicon/hwnorm1. Scholars are requested to send their suggestions or comments at https://github.com/ sanskrit-lexicon/hwnorm1/issues. 2 60 Option 1.4 61 Use अनारु at the end of a word to denote अयs mostly (not to denote neuter gender), where 62 usually म ्is supposed to be. e.g. अनकामु .ं 63 Dictionaries: YAT 64 Option 1.5 65 Treat as अनारु when first member of compound ends with म ्and the second member of 66 compound starts with झर ्letters e.g. सगीतं . 67 Dictionaries:ACC, AP, AP90, BEN, BHS, CAE, CCS, MCI, MD, PD, PW, PWG, SCH, 68 STC, VEI, WIL 69 Option 1.6 70 Treat it as fifth letter of the वग of following letter, when occuring in the cases where the first 71 member of compound ends with म .् e.g. सीत. 72 Dictionaries: BOP, BUR, GRA, GST, KRM, IEG, INM, MW72, PGN, PUI, SKD, VCP, 73 YAT 74 Notes regarding 1.5 and 1.6– (1) PE is inconsistent regarding conventions 1.5 and 1.6. 75 See गासरतीसगमं and वरदासम. (2) SHS is also inconsistent. See सह and कस महं . (3) MW 76 is also inconsistent. See अनिभसि and अिभसिधकं ृ त. (4) SNP doesn’t have any such case prima 77 facie, hence excluded from this list. 78 Standard convention 79 1.1. Convert every nasal followed by consonant to अनारु . 80 1.2. Convert every headword ending with अनारु to ending with म .् 81 It is pertinent to note that majority of dictionaries tend to prefer convention 1.2. Huet 4 82 (2009) also favoured this convention . So statistically we should lean towards 1.2, but there 83 are a few problems with that approach computationally. Many of the dictionaries following 84 1.2 also follow 1.5. Therefore, deciding whether to keep it as fifth letter of a वग or अनारु 85 depends on the knowledge of whether it is last letter of a compound or not. Therefore, 86 uniformly converting every internal nasal to अनारु is computationally easier choice. 87 Convention 2 - Duplication of consonants after ’r’. 88 Option 2.1 89 Duplication is done in all cases e.g. पू . 90 Dictionaries: SKD, WIL 4 See section 1.6 of Huet’s paper. 3 91 Option 2.2 92 Duplication is not done e.g. पवू . 93 Dictionaries: ACC, AP, AP90, BEN, BHS, BOP, BUR, CAE, CCS, GRA, GST, IEG, INM, 94 KRM, MCI, MD, MW, MW72, PD, PE, PGN, PUI, PW, PWG, SCH, SNP, STC, VCP, 95 VEI 96 Note– (1) SHS and YAT are inconsistent in this convention. See िनव / िनक in SHS 97 and वच / चस ्in YAT. (2) VCP highly leans towards option 2.2, but there are a few 98 inconsistent entries as well e.g. पवत and अिपत . 99 Standard convention 100 2. No duplication. 101 Convention 3 – Treatment of words ending with ‘at’ 102 Option 3.1 103 Treat words derived from शतृ suffix as ending with अत ्e.g. गत .् 104 Dictionaries: AP, AP90, BOP, BUR, GRA, GST, MD, MW, MW72, PD, SHS, VCP, WIL, 105 YAT 106 Option 3.2 107 Treat words derived from शतृ suffix as ending with अ ्e.g. अनाग .् 108 Dictionaries: BEN, BHS, CAE, CCS, PW, PWG, SCH, STC, VEI 109 Option 3.3 110 Treat words derived from शतृ suffix as ending with अन ्e.g. पँयन .् 111 Dictionaries: SKD 112 Note regarding options 3.1 to 3.3– (1) ACC, IEG, INM, KRM, MCI, PE, PGN, PUI and 113 SNP don’t have enough words with this suffix to decide conclusively conventions 3.1 to 3.3. 114 Option 3.4 115 Treat words derived from वतपु /् मतपु ्suffix as ending with वत /् मत ्e.g. भगवत .् 116 Dictionaries: ACC, AP, AP90, BOP, BUR, GRA, GST, IEG, INM, MCI, MD, MW, PD, 117 SHS, VCP, WIL, YAT 118 Option 3.5 119 Treat words derived from वतपु /् मतपु ्suffix as ending with व /् म ्e.g.