Manual for Morphological Annotation of Czech Sentences

M a n u a l f o r M o r p h o l o g i c a l A n n o t a t i o n Revision for the Prague Dependency Treebank 2.0 Ú F A L Technica Report ! o. 200" #2$ J i ř í H a n a D a n i e l Z e m a n Jan Hajič Hana Hanová Barbora Hladká Emil Jeř á bek Table of Contents Preface to Version 2.0 Preface to Version 1.0 1. ntrod!ction 2. "emma and ta# str!ct!re 2.1. "emma str!ct!re 2.1.1. Base form and n!mber 2.1.2. $eference 2.1.%. &ate#or' 2.1.(. )erm 2.1.*. +t'le 2.1.,. E-.lanational comment 2.1./. &omment on derivation 2.2. )a# +tr!ct!re 2.2.1. Positional ta#s 2.2.2. &om.act ta#s 2.2.%. nformal abbreviations %. 0ames %.1. Personal names %.1.1. von1 van1 etc. %.1.2. &2inese and 3orean names %.1.%. 4orei#ni5ed &5ec2 names %.2. 6eo#ra.2ical names %.2.1. &o!ntries1 cities1 rivers1 mo!ntains %.2.2. +treets %.%. &om.anies and instit!tions %.%.1. $esta!rants %.%.2. +.ort cl!bs %.(. Horses1 7J8 s etc. %.*. Prod!cts %.,. +.ortin# and ot2er events %./. 9t2er %./.1. B!ildin#s %./.2. )elevisions %./.%. 0e:s and ma#a5ines %./.(. +on# names %.;. <djectives derived from names (. <bbreviations (.1. 6ender (.2. solated letters (.%. =nits of meas!rements (.(. <!t2ors8 si#nat!res (.*. <cademic titles *. &ollo>!ial &5ec2 *.1. &os1 kd's1 jaks... *.2. +!ffi- ?@ in .l!ral of ne!ter ,. 4orei#n :ords and .2rases ,.1. <rticles 2 ,.2. En#lis2 no!n cl!sters ,.%. 0o!ns ,.(. Verbs ,.(.1. En#lis2 verbs ,.*. +lavic lan#!a#es and &5ec2 dialects /. Errors /.1. &2aracters /.2. +e.arators ;. Hard to decide ;.1. aA ;.2. jak ;.%. má lo ;.(. moc ;.*. .roto ;.,. svB j ;./. tak C. +elected :ords 10 . 7ate and time 11. 0!mbers1 n!merals and >!antifiers 12. H'.2enated com.osites 1%. nsertion 1%.1. Possessive adjectives 1%.2. Dords endin# :it2 ?ism!s1 ?i5m!s 1%.%. )ranscri.tion of .ron!nciation 1%.(. &ri..led forms 1%.*. solated mor.2emes 1%.,. 6eometr' 1%./. &2ess codes L i st of Tables 2.1. "emma e-am.les 2.2. "emma cate#ories 2.%. )erm t'.es 2.(. +t'le fla#s 2.*. <ttrib!tes in .ositional ta#s 2.,. P9+ 2./. +=BP9+ 2.;. 9bsolete +=BP9+ val!es 2.C. 6E07E$ 2.10 . 0=EBE$ 2.11. &<+E 2.12. P9++6E07E$ 2.1%. P9++0=EBE$ 2.1(. PE$+90 2.1*. )E0+E 2.1,. 6$<7E 2.1/. 0E6<) 90 2.1;. V9 &E 2.1C. V<$ % %.1. 0ame t'.es %.2. E-am.les of #eo#ra.2ical names %.%. E-am.les of' names %.(. E-am.les of resta!rant names %.*. E-am.les of s.ort cl!b names %.,. E-am.les of event names (.1. E-am.les of abbreviations (.2. 6ender of abbreviations (.%. E-am.les of isolated letters (.(. E-am.les of !nits *.1. &ollo>!ial e-am.les ,.1. E-am.les of forei#n .2rases ,.2. <rticles in common forei#n lan#!a#es ,.%. 0!mber and case of En#lis2 no!ns ,.(. E-am.les of En#lis2 verbs L i st of E x am p les 2.1. 4ollo:in# e-am.les ill!strate t2isF 2.2. 9t2er e-am.lesF %.1. Personal names :it2 von1 van etc. %.2. &2inese and 3orean names %.%. +treet names %.(. 0ames of 2orses %.*. )V' names %.,. 0ames of .eriodicals 11.1. &ase a#reement in co!nted .2rases 12.1. H'.2enated com.osites 1%.1. ?ism!s1 ?i5m!s 1%.2. )ranscri.tion of .ron!nciation 1%.%. &ri..led forms ( Preface to % ersion 2.0 <lt2o!#2 t2e title of t2is re.ort in2erits t2e :ord G Ean!alG from t2e .revio!s version1 it is no more intended to #!ide t2e annotators. $at2er it attem.ts to describe t2e c!rrent state of t2e mor.2olo#ical annotation in P7) 2.0 . Eost of t2e added information res!lted from several semi?a!tomatic c2ecks .erformed on t2e data before 2avin# released it. n some cases it :as not mana#eable to brin# t2e data to t2e desired state ? if so1 bot2 t2e desired and t2e c!rrent state of t2e data are described. P7) 2.0 contains 11C,0 1,*/ mor.2olo#icall' annotated tokens in 12,1;%1 sentences. )2ere are 1,;1(*( distinct :ord forms1 /1/1, distinct lemmas1 and 1/(0 mor.2olo#ical ta#s. )2e final c2eckin# and anal'sis of t2e data as :ell as t2e :ork on t2is man!al revision :ere s!..orted b' t2e &5ec2 <cadem' of +ciences .ro#ram called G nformation +ociet'G 1 .rojects 0o. 1E)10 1120 *0 % and 1E)10 1120 (1%1 and t2e #rant 0o. 6<(0 *H 0 %H 0 C1%. * Preface to % ersion & .0 De are .leased to .!blis2 t2e first version of t2e man!al for mor.2olo#ical annotation of &5ec2 sentences. De believe t2at s!c2 #!idelines can be of !se to t2e !sers of Pra#!e 7e.endenc' )reebank 1.0 IP7) 1.0 J1 as :ell as for .re.aration of ne: data. "et !s recall t2e most im.ortant ste.s :e .assed in order to #et abo!t t:o million mor.2olo#icall' annotated :ords IP7) 1.0 J. <t t2e ver' be#innin#1 :e .!t to#et2er a team of ei#2t annotators ? :e did introd!ce t2em to a s'stem of mor.2olo#ical ta#s :e desi#ned to describe &5ec2 mor.2olo#ical .ro.ertiesK :e also !sed Ias a .re.rocessin# ste.J a mor.2olo#ical anal'5er for .rocessin# isolated :ords1 and1 last b!t not least1 :e did rel' on t2eir kno:led#e of &5ec2 mor.2olo#' t2e' 2ave ac>!ired :2ile st!d'in# at secondar' sc2ool1 i.e. :e did not offer t2em an' annotation #!idelines. 9ne can ass!me t2at t2is strate#' is too 2a5ardo!s ? 2o: to deal :it2 discre.ancies t2e annotators .rod!ce to ens!re t2e consistenc' of annotationL 4irst1 t:o annotators annotated eac2 te-t file. )2en1 b' a G blindG a!tomatic .roced!re Ino matter :2at :ord is .rocessed ? j!st com.arin# t:o strin#sJ :e detected :ords annotated differentl'. &onse>!entl'1 t2e onl' one annotator Ias a member of j!st t:o?member teamJ 2andled t2ese cases and1 also1 c2ecked t2e mor.2olo#ical annotations a#ainst t2e s'ntactic?anal'tic annotations. )2is :a' :e re.laced t2e absence of annotation #!idelines b' se>!ential elimination of discre.ancies across bot2 t2e mor.2olo#ical and s'ntactic?anal'tic levels of annotation. <lon# t2e :a' :e :ere :ritin# t2is annotation man!al. t is not intended as a com.re2ensive #!ide to t2e mor.2olo#ical annotation of &5ec2 sentences Iin contrast to t2e man!al for s'ntactic?anal'tic annotationsJ. )2e a!t2ors concentrate G onl'G on t2ose cases :2ic2 ca!sed t2e most ambi#!ities and .roblems :2ile annotatin# P7) 1.0 . )2e on#oin# effort is directed to t2e treatin# of not? 'et?solved .roblematic cases in accord :it2 t2e conventions of t2e a!tomatic mor.2olo#ical anal'5er. )2e mor.2olo#ical annotation of P7) 1.0 :as carried o!t in t2e frame:ork of e-.erimental verification of t2e definition of formal re.resentation of t2e anal'sis of &5ec2 sentences It2e .roject 6<M $ (0 *H C,H 0 1C;1 G 4ormal re.resentation of lan#!a#e str!ct!resG J. )2e material obtained in t2is :a' IdataJ is !sed in man' domains of researc2 in com.!tational lin#!istics1 above all as basic Itrainin#J data in .rojects of t2e a!tomatic lan#!a#e anal'sis1 t2e EN E) researc2 .roject E+E11%0 0 0 0 0 ,1 t2e G "aborator' for "an#!a#e 7ata Processin#G It2e EN E) .roject V+C,1*10 J and t2e &enter for &om.!tational "in#!istics It2e EN E) .roject "00 0 <0 ,%J. )2ese data 2ave been also !sed as verification material for vario!s .artial .rojects :it2in t2e com.le- .ro#ram 6<M $ (0 *H C,H 321( IG &5ec2 "an#!a#e in &om.!ter <#eG J. )2e G &enter for &om.!tational "in#!isticsG .roject financiall' s!..orted :ork on t2ese mor.2olo#ical annotation #!idelines. , ' hapter & . ( ntroduction De do not :ant to s!bstit!te a #rammarbook of &5ec2. +o :e are not #oin# to s'stematicall' define :ord classes and .aradi#ms. <ll t2e annotators s2o!ld !nderstand t2e f!ndamentals of &5ec2 mor.2olo#'1 as most native &5ec2 s.eakers do It2e st!ff is bein# ta!#2t in elementar' sc2oolsJ. D2at :e are #oin# to describe are t2e diffic!lt or !n!s!al .2enomena. Eost notabl' :e :ill address t2e annotation of names1 forei#n :ords1 and abbreviations. +!c2 cate#ories are rarel' and s.arsel' covered b' standard dictionaries. )o #et an idea :2at a forei#n :ord1 name etc. mean it is !sef!l to tr' to find it !sin# an internet .ortal1 an enc'clo.edia etc. 7!rin# annotation1 :e fo!nd t2e follo:in# internet links !sef!lF P or tals.

