Er ... well, it matters, right? On the role of data representations in spoken language dependency

Kaja Dobrovoljc Matej Martinc Jozefˇ Stefan Institute Jozefˇ Stefan Institute Ljubljana, Slovenia Ljubljana, Slovenia [email protected] [email protected]

Abstract However, in the recent CoNLL 2017 shared task on multilingual parsing from raw text to UD (Ze- Despite the significant improvement of data- man et al., 2017), the results achieved on the Spo- driven dependency parsing systems in recent ken Slovenian (Dobrovoljc and Nivre, years, they still achieve a considerably lower 2016) - the only spoken treebank among the 81 performance in parsing spoken language data participating - were substantially lower in comparison to written data. On the exam- ple of Spoken Slovenian Treebank, the first than on other treebanks. This includes the writ- spoken data treebank using the UD annota- ten Slovenian treebank (Dobrovoljc et al., 2017), tion scheme, we investigate which speech- with a best labeled attachment score difference of specific phenomena undermine parsing perfor- more than 30 percentage points between the two mance, through a series of training data and treebanks by all of the 33 participating systems. treebank modification experiments using two distinct state-of-the-art parsing systems. Our results show that utterance segmentation is the Given this significant gap in parsing perfor- most prominent cause of low parsing perfor- mance between the two modalities, spoken and mance, both in parsing raw and pre-segmented written language, this paper aims to investigate transcriptions. In addition to shorter utter- which speech-specific phenomena influence the ances, both parsers perform better on nor- poor parsing performance for speech, and to what malized transcriptions including basic mark- ers of and excluding disfluencies, dis- extent. Specifically, we focus on questions re- course markers and fillers. On the other lated to data representation in all aspects of the hand, the effects of written training data addi- dependency parsing pipeline, by introducing dif- tion and speech-specific dependency represen- ferent types of modifications to spoken language tations largely depend on the parsing system transcripts and speech-specific dependency anno- selected. tations, as well as to the type of data used for spo- ken language modelling. 1 Introduction

With an exponential growth of spoken language This paper is structured as follows. Section2 data available online on the one hand and the addresses the related research on spoken language rapid development of systems and techniques for parsing and Section3 presents the structure and language understanding on the other, spoken lan- annotation of the Spoken Slovenian Treebank on guage research is gaining increasing prominence. which all the experiments were conducted. Sec- Many syntactically annotated spoken language tion4 presents the parsing systems used in the ex- corpora have been developed in the recent years to periments (4.1) and the series of SST data modi- benefit the data-driven parsing systems for speech fications to narrow the performance gap between (Hinrichs et al., 2000; van der Wouden et al., written and spoken treebanks for these systems, 2002; Lacheret et al., 2014; Nivre et al., 2006), involving the training data (4.3.1), speech tran- including two spoken language treebanks adopt- scriptions (4.3.2) and UD dependency annotations ing the (UD) annotation (4.3.3). Results are presented in Section5, while scheme, aimed at cross-linguistically consistent conclusions and some directions for further work dependency treebank annotation (Nivre, 2015). are addressed in Section6.

37 Proceedings of the Second Workshop on Universal Dependencies (UDW 2018), pages 37–46 Brussels, Belgium, November 1, 2018. c 2018 Association for Computational Linguistics parataxis obj aux expl advcl discourse:filler advmod discourse:filler discourse reparandum obj advmod punct advmod nummod mark det eee aha seˇ ena stvar to sem se tukaj s [gap] spomnil zdajle ko vidim ta komentar eem er yes more one thing this I-have (PRON) here r- [gap] remembered now when I-see this comment uhm (oh yes one more thing I just r- [gap] remembered this here now that I see this comment)

Figure 1: An example utterance taken from the Spoken Slovenian Treebank.

2 Related work speech, there have been no other systematic stud- ies on the role of spoken data representations, such In line with divergent approaches to syntactic an- as transcription or annotation conventions, in spo- notation of transcribed spoken data that either aim ken language parsing. to capture the syntactic structure involving all ut- tered lexical phenomena in an utterance, or dis- 3 Spoken Slovenian Treebank card the (variously defined) noisy speech-specific structural particularities on the other, research into The Spoken Slovenian Treebank (Dobrovoljc and parsing spoken language can broadly be catego- Nivre, 2016), which was first released as part of rized in two main groups. On the one side of the UD v1.3 (under the CC-BY-NC-SA 4.0 licence), is spectrum, we find approaches that separate disflu- the first syntactically annotated collection of spon- ences from parsing. Charniak and Johnson(2001) taneous speech in Slovenian. It is a sample of the and Jørgensen(2007), for example, both report Gos reference corpus of Spoken Slovenian (Zwit- a significant increase in parsing the Switchboard ter Vitez et al., 2013), a collection of transcribed section of the Penn Discourse Treebank (Godfrey audio recordings of spontaneous speech in differ- et al., 1992), if disfluencies are first removed from ent everyday situations, in both public (TV and ra- the data. These two-pass pipeline approaches thus dio shows, school lessons, academic lectures etc.) involve a separate task of automatic disfluency de- and private settings (work meetings, services, con- tection, one of the fundamental issues in automatic versations between friends and family etc.). (Liu et al., 2006; Lease et al., The SST treebank currently amounts to 29,488 2006). tokens (3,188 utterances), which include both lex- Recently, however, several parsing systems ical tokens () and tokens signalling other using non-monotonic transition-based algorithms types of verbal phenomena, such as filled pauses have emerged that enable joint parsing and dis- (fillers) and unfinished words, as well as some ba- fluency detection (Honnibal et al., 2013; Honnibal sic markers of prosody and extralinguistic speech and Johnson, 2015; Rasooli and Tetreault, 2013), events. The original segmentation, tokeniza- showing that joint treatment of both problems can tion and spelling principles described by Ver- actually outperform state-of-the-art pipeline ap- donik et al.(2013) have also been inherited proaches (Honnibal and Johnson, 2014). These by SST. Among the two types of Gos tran- findings open a promising line of future research scriptions (pronunciation-based and normalized for the development of speech-specific parsing spelling, both in lowercase only), subsequent man- systems (Yoshikawa et al., 2016), especially those ual annotations in SST have been performed on that also incorporate acoustic information (Kahn top of normalized transcriptions. et al., 2005; Tran et al., 2017). For syntactic annotation of the transcripts, un- Nevertheless, apart from research on speech- available in Gos, the SST treebank adopted the specific parsing systems, very little research has Universal Dependencies annotation scheme due to been dedicated to other, data-related aspects of its high degree of interoperability across different spoken language parsing. To our knowledge, with grammatical frameworks, languages and modali- expection of Caines et al.(2017) and Nasr et al. ties. In this original application of the UD scheme (2014), who investigate the role of different types to spoken language transcripts, several modifica- of training data used for parsing transcripts of tions of the scheme were implemented to accom-

38 modate the syntactic particularities in speech, ei- dependency parses from segmented and tokenized ther by extending the scope of application of ex- sequences of words. Its architecture is based isting universal labels (e.g. using punct for la- on a deep biaffine neural dependency parser beling markers of prosody) or introducing new presented by (Dozat and Manning, 2016), which speech-specific sub-labels (e.g. discourse:filler uses a multilayer bidirectional LSTM network for annotation of hesitation sounds). In subsequent to produce vector representations for each . comparison of the SST treebank with the writ- These representations are used as an input to a ten SSJ Slovenian UD treebank (Dobrovoljc et al., stack of biaffine classifiers capable of producing 2017), Dobrovoljc and Nivre(2016) observed sev- the most probable UD tree for every sentence and eral syntactic differences between the two modal- the most probable part of speech tag for every ities, as also illustrated in Figure1. word. The system was ranked first according to all five relevant criteria in the CONLL-2017 4 Experiment setup Shared Task. Same hyperparameter configuration was used as reported in (Dozat et al., 2017) with 4.1 Parsing systems and evaluation every model trained for 30,000 training steps. To enable system-independent generalizations, For the parameters values that were not explicitly two parsing systems were selected, UDPipe mentioned in (Dozat et al., 2017), default values 1.2 (Straka and Strakova´, 2017) and Stan- were used. ford (Dozat et al., 2017), covering the two most For both parsers, no additional fine-tuning was common parsing approaches, transition-based and performed for any specific data set, in order to graph-based parsing (Aho and Ullman, 1972), re- minimize the influence of training procedure on spectively. UDPipe 1.2 is a trainable pipeline the parser’s performance for different data pre- for sentence segmentation, tokenization, POS tag- processing techniques, especially given that no de- ging, lemmatization and dependency parsing. It velopment data has been released for the small represents an improved version of the UDPipe 1.1 SST treebank. (used as a baseline system in the CONLL-2017 For evaluation, we used the official CoNLL- Shared Task (Zeman et al., 2017)) and finished as ST-2017 evaluation script (Zeman et al., 2017) to the 8th best system out of 33 systems participating calculate the standard labeled attachments score in the task. (LAS), i.e. the percentage of nodes with cor- A single-layer bidirectional GRU network to- rectly assigned reference to parent node, includ- gether with a case insensitive dictionary and a ing the label (type) of relation. For baseline ex- set of automatically generated suffix rules are periments involving parsing of raw transcriptions used for sentence segmentation and tokenization. (see Section 4.2), for which the number of nodes The part of speech tagging module consists of a in gold-standard annotation and in the system out- guesser, which generates several universal part of put might vary, the F1 LAS score, marking the speech (XPOS), language-specific part of speech harmonic mean of precision an recall LAS scores, (UPOS), and morphological feature list (FEATS) was used instead. tag triplets for each word according to its last four characters. These are given as an input to an av- 4.2 Baseline eraged perceptron tagger (Straka et al., 2016) to Prior to experiments involving different data mod- perform the final disambiguation on the generated ifications, both parsing systems were evaluated on tags. Transition-based dependency parser is based the written SSJ and spoken SST Slovenian tree- on a shallow neural network with one hidden layer banks, released as part of UD version 2.2 (Nivre and without any recurrent connections, making et al., 2018).1 The evaluation was performed it one of the fastest parsers in the CONLL-2017 both for parsing raw text (i.e. automatic tok- Shared Task. We used the default parameter con- enization, segmentation, morphological annota- figuration of ten training iterations and a hidden tion and dependency tree generation) and parsing layer of size 200 for training all the models. Stanford parser is a neural graph-based 1Note that the SST released as part of UD v2.2 involves a parser (McDonald et al., 2005) capable of lever- different splitting of utterances into training and test tests as in UD v2.0, which should be taken into account when com- aging word and character based information in paring our results to the results reported in the CoNLL 2017 order to produce part of speech tags and labeled Shared Task.

39 UDPipe Stanford Parsing raw text Treebank Sents UPOS UAS LAS Sents UPOS UAS LAS sst 20.35 88.32 52.49 45.47 20.35 93.21 60.35 54.00 ssj 76.49 94.59 79.90 76.32 76.49 96.32 87.50 85.02 ssj 20k 76.42 89.88 71.79 66.40 76.42 94.61 82.60 78.60 Dependency parsing only Treebank Sents UPOS UAS LAS Sents UPOS UAS LAS sst 100 100 74.66 69.13 100 100 77.58 72.52 ssj 100 100 90.16 88.41 100 100 95.63 94.52 ssj 20k 100 100 86.69 84.21 100 100 91.93 89.60

Table 1: UDPipe and Stanford sentence segmentation (Sents), part-of-speech tagging (UPOS), unlabelled (UAS) and labelled attachment (LAS) F1 scores on the spoken SST and written SSJ Slovenian UD treebanks for parsing raw text, and for parsing texts with gold-standard tokenization, segmentation and tagging information. gold-standard annotations (i.e. dependency pars- parsing reduces to approximately 15-17 percent- ing only). For Stanford parser, which only pro- age points, if based on the same amount of training duces tags and dependency labels, the UDPipe to- data. kenization and segmentation output was used as In order to prevent the dependency parsing ex- input. periments in this paper being influenced by the The results displayed in Table1 (Parsing raw performance of systems responsible for produc- text) confirm the difficulty of parsing spoken lan- ing other levels of linguistic annotation, the ex- guage transcriptions, given that both UDPipe and periments set out in the continuation of this paper Stanford systems perform significantly worse on focus on evaluation of gold-standard dependency the spoken SST treebank in comparison with the parsing only. written SSJ treebank, with the difference in LAS 4.3 Data modifications F1 score amounting to 30.85 or 31.02 percent- age points, respectively. These numbers decrease Given the observed difference in parsing spoken if we neutralize the important difference in tree- and written language for both parsing systems, bank sizes - with 140.670 training set tokens for several automated modifications of the data fea- the written SSJ and 29.488 tokens for the spoken tured in the parsing pipeline have been introduced, SST - by training the written model on a compa- to investigate the influence of different factors on rable subset of SSJ training data (20.000 tokens), spoken language parsing performance. however, the difference between the two modali- ties remains evident. 4.3.1 Modifications of training data type A subsequent comparison of results in depen- Although the relationship between written and dency parsing only (Table1, Dependency parsing spoken language has often been portrayed as a only) reveals that a large share of parsing mistakes domain-specific dichotomy, both modalities form can be attributed to difficulties in lower-level pro- part of the same language continuum, encourag- cessing, in particular utterance segmentation (with ing further investigations of cross-modal model 2 an F1 score of 20.35), as spoken language pars- transfers. In the first line of experiments, we ing performance increases to the (baseline) LAS thus conducted experiments on evaluation of spo- score of 69.13 and 72.52 for the UDPipe and Stan- ken language parsing by training on spoken (sst) ford parser, respectively. Consequently, the actual and written (ssj) data alone, as well as on the difference between written and spoken language combination of both (sst+ssj). Given that the transcriptions in the SST treebank are written in 2Note that the low segmentation score is not spe- cific to UDPipe, but to state-of-the-art parsing sys- lowercase only and do not include any written- tems in general, as none of the 33 systems com- like punctuation, two additional models excluding peting in the CoNLL 2017 Shared Task managed to these features were generated for the written tree- achieve a significantly better result in SST treebank segmentation: http://universaldependencies. bank (ssj lc and ssj no-punct) to neutral- org/conll17/results-sentences.html. ize the differences in conventions

40 for both modalities. a speaker have been joined into a single syntactic tree via the parataxis relation. 4.3.2 Modifications of speech transcription Disfluencies: Following the traditional ap- The second line of experiments investigates the proaches to spoken language processing, the role of spoken language transcription conventions sst no-disfl SST treebank version marks the for the most common speech-specific phenomena, removal of disfluencies, namely filled pauses, such by introducing various automatically converted as eee, aaa, mmm (labeled as discourse:filler), versions of the SST treebank (both training and overridden disfluencies, such as repetitions, sub- testing data). stitutions or reformulations (labeled as reparan- Spelling: For word form spelling, the origi- dum), and [gap] markers, co-occurring with unfin- nal normalized spelling compliant with standard ished or incomprehensible speech fragments (Fig- orthography was replaced by pronunciation-based ure3). spelling (sst pron-spell), reflecting the re- gional and colloquial pronunciation variation (e.g. discourse:filler the replacement of the standard pronominal word reparandum form jaz “I” by pronunciation-based word forms punct jz, jaz, jst, jez, jes, ja etc.). mmm ne bom po [gap] prispeval podpisa Segmentation: Inheriting the manual segmen- hmmm not I-will sig- [gap] give signature tation of the reference Gos corpus, sentences (ut- (uhm I will not sig- [gap] give my signature) terances) in SST correspond to ”semantically, syn- tactically and acoustically delimited units” (Ver- Figure 3: Removal of disfluencies. donik et al., 2013). As such, the utterance segmen- tation heavily depends on subjective interpreta- Similar to structurally ’redundant’ phenom- tions of what is the basic functional unit in speech, ena described above, the sst no-discourse in line with the multitude of existing segmentation version of the SST treebank excludes syntacti- approaches, based on , , prosody, cally peripheral speech-specific lexical phenom- or their various combinations (Degand and Simon, ena, annotated as discourse, discourse:filler or 2009). To evaluate parsing performance for alter- parataxis:discourse, such as interjections (aha native types of segmentation, based on a more ob- “uh-huh”), response tokens (ja “yes”), expressions jective set of criteria, two additional SST segmen- of politeness (adijo “bye”), as well as clausal and tations were created. In the minimally segmented non-clausal discourse markers (no “well”, mislim version of the SST treebank (sst min-segm), “I think”). utterances involving two or more clauses joined by Prosody: Although the SST treebank lacks a parataxis relation (denoting a loose inter-clausal phonetic transcription, some basic prosodic infor- connections without explicit coordination, subor- mation is provided through specific tokens denot- dination, or argument relation) have been split into ing exclamation or interrogation intonation, silent separate syntactic trees (clauses), as illustrated in pauses, non-turn taking speaker interruptions, vo- the example below (Figure2). cal sounds (e.g. laughing, sighing, yawning) and non-vocal sounds (e.g. applauding, ring- parataxis ing). In contrast to the original SST treebank, in which these nodes were considered as regu- lar nodes of dependency trees (labeled as punct), glej jo seˇ kar stoka prosodic markers have been excluded from the look at-her still (PART) she-moans sst no-pros version of the treebank. (look at her she’s still moaning) 4.3.3 Modifications of UD annotation Figure 2: Splitting utterances by parataxis. Given that the SST treebank was the first spo- ken treebank to be annotated using the UD an- Vice versa, the maximally segmented SST ver- notation scheme, the UD annotation principles sion (sst max-segm) includes utterances corre- for speech-specific phenomena set out in Dobro- sponding to entire turns (i.e. units of speech by voljc and Nivre(2016) have not yet been evaluated one speaker), in which neighbouring utterances by within a wider community. To propose potential

41 future improvements of the UD annotation guide- mark lines for spoken language phenomena, the third set cop of SST modifications involved alternations of se- lected speech-specific UD representations. reparandum nsubj discourse:filler det Extensions: The SST treebank introduced five da so te eee ti stroskiˇ cimˇ manjsiˇ new subtypes of existing UD relations to an- that are these (F) er these (M) costs most low notate filled pauses (discourse:filler), clausal re- pairs (parataxis:restart), clausal discourse mark- reparandum ers (parataxis:discourse) and general extenders (conj:extend). In the sst no-extensions (so that these costs are as low as possible) version of the treebank, these extensions have Figure 5: Change of head for reparandum. been replaced by their universal counterparts (i.e. discourse, parataxis and conj). Head attachment: For syntactic relations, annotation of sentences replacing an aban- such as discourse or punct, which are not di- doned preceding clause, has been modified in rectly linked to the predicate-driven structure of sst parataxis:restart so as to span from the sentence, the choice of the head node to the root node instead of the more or less randomly which they attach to is not necessarily a straight- positioned head of the unfinished clause. forward task. The original SST treebank fol- Clausal discourse markers: In the original lowed the general UD principle of attaching such SST treebank, clausal discourse markers (e.g. nodes to the highest node preserving projectiv- ne vem “I don’t know”, (a) vesˇ “you know”, ity, typically the head of the most relevant nearby glej “listen”) have been labeled as parataxis clause or clause argument. To evaluate the im- (specifically, the parataxis:discourse extension), pact of such high attachment principle on pars- in line with other types of sentential parenthet- ing performance, an alternative robust attachment icals. Given the distinct distributional charac- has been implemented for two categories with teristics of these expressions (limited list, high the weakest semantic connection to the head, frequency) and similar syntactic behaviour to filled pauses (sst discourse:filler) and non-clausal discourse markers (no dependents, prosodic markers (sst punct), attaching these both peripheral and clause-medial positions), their nodes to the nearest preceding node instead, re- label has been changed to discourse in the gardless of its syntactic role, as illustrated in Fig- sst parataxis:discourse version of the ure4. treebank. For multi-word clausal markers, the fixed label was also introduced to annotate the punct internal structure of this highly grammaticized clauses (Figure6.

mene je strah ker se snema [all : laughter] parataxis:discourse I is afraid because (PRON) it-tapes [all : laughter] kaj bosˇ pa drugega pocelˇ a vesˇ punct what you-will (PART) else do you know (I am afraid because it’s being taped) discourse fixed Figure 4: Change of head for prosody markers. (what else can you do you know)

For the reparandum relation, which currently Figure 6: Change of annotation for clausal discourse markers. denotes a relation between the edited unit (the reparandum) and its repair, the opposite principle sst reparandum was implemented in , by at- 5 Results taching the reparandum to the head of its repair, i.e. to the node it would attach to had it not been Table2 gives LAS evaluation of both parsing sys- for the repair (Figure5). tems for each data modification described in Sec- Following a similar higher-attachment prin- tion 4.3 above, including the baseline results for ciple, the parataxis:restart relation, used for training and parsing on the original SST treebank

42 Model UDPipe Stanford global, exhaustive, graph-based parsing systems Training data are more capable of leveraging the richer con- 1 sst (= baseline) 69.13 72.52 textual information gained with a larger train set 2 ssj+sst 68.53 77.38 in comparison with local, greedy, transition-based 3 ssj no-punct 57.40 62.57 systems (McDonald and Nivre, 2007). 4 ssj 55.76 62.08 The results of the second set of experiments, in 5 ssj lc 55.61 61.99 which LAS was evaluated for different types of Transcriptions spoken language transcriptions, confirm that pars- 6 sst min-segm 74.89 78.31 ing performance varies with different approaches 7 sst no-disfl 71.47 74.77 to transcribing speech-specific phenomena. As ex- 8 sst no-discourse 70.73 75.47 pected, both systems achieve significantly better 9 sst no-pros 68.70 71.78 results if parsing is performed on shorter utter- 10 sst pron-spell 67.52 71.64 ances (sst min-segm). On the other hand, a 11 sst max-segm 63.93 68.13 similar LAS drop-off interval is identified for pars- Annotations ing full speaker turns (sst max-segm). These 12 sst punct 71.32 73.65 results confirm the initial observations in Section 13 sst discourse:filler 69.13 72.85 4.2 that speech segmentation is the key bottle- 14 sst parataxis:restart 68.53 71.95 neck in the spoken language dependency parsing 15 sst no-new-ext. 68.45 73.05 pipeline. Nevertheless, it is encouraging to ob- 16 sst reparandum 68.41 72.81 serve that even the absence of any internal seg- 17 sst parataxis:disc. 68.32 72.35 mentation of (easily identifiable) speaker turns re- Best combination turns moderate parsing results. 18 sst 6-7-8-12 79.58 N/A As has already been reported in related work, 19 sst 6-7-8-12-15 N/A 87.35 parsing performance also increases if spoken data is removed of its most prominent syntac- Table 2: LAS on the Spoken Slovenian Treebank tic structures, such as disfluencies, discourse (sst) for different types of training data, transcrip- markers and fillers. Interestingly, for Stan- tion and annotation modifications. Improvements of ford parser, the removal of discourse mark- the baseline are marked in bold. ers (sst no-discourse) is even more ben- eficial than the removal of seemingly less pre- (see Section 4.2). dictable false starts, repairs and other disfluencies When evaluating the impact of different types (sst no-disfl). On the contrary, the removal of training data on the original SST parsing, both of prosody markers (sst no-pros) damages the parsers give significantly poorer results than the baseline results for both parsers, suggesting that baseline sst model if trained on the written SSJ the presence of these markers might even con- treebank alone (ssj), which clearly demonstrates tribute to parsing accuracy for certain types of con- the importance of (scarce) spoken language tree- structions given their punctuation-like function in banks for spoken language processing. In addi- speech. tion, no significant improvement is gained if the As for spelling, the results on the tree- written data is modified so as to exclude punc- bank based on pronunciation-based word spelling tuation (ssj no-punct) or perform lowercas- (sst pron-spell) support our initial hypothe- ing (ssj lc), which even worsens the results. sis that the multiplication of token types damages Somewhat surprisingly, no definite conclusion parser performance, yet not to a great extent. This can be drawn on the joint training model based could be explained by the fact that token pronun- on both spoken and written data (sst+ssj), ciation information can sometimes help with syn- as the parsers give significantly different results: tactic disambiguation of the word form in context, while Stanford parser substantially outperforms if a certain word form pronunciation is only asso- the baseline result when adding written data to ciated with a specific syntactic role (e.g. the col- the model (similar to the findings by Caines et al. loquial pronunciation tko da of the discourse con- (2017)), this addition has a negative affect on UD- nective tako da “so that” that does not occur with Pipe. This could be explained by the fact that other syntactic roles of this lexical string).

43 No definite conclusion can be drawn from the two modalities encourages further data-based the parsing results for different alternations investigations into the complexity of spoken lan- of speech-specific UD annotations, as the re- guage syntax, which evidently reaches beyond the sults vary by parsing system and by the prototypical structural and pragmatic phenomena types of UD modification. While both sys- set forward in this paper and the literature in gen- tems benefit from an alternative attachment of eral. prosodic markers to their nearest preceding to- ken (sst punct),3 and prefer the current la- 6 Conclusion and Future Work beling and attachment principles for clausal re- pairs (sst parataxis:restart) and clausal In this paper, we have investigated which speech- discourse markers (parataxis:discourse), specific phenomena are responsible for below op- the effect of other changes seems to be system- timal parsing performance of state-of-the-art pars- dependent. What is more, none of the changes in ing systems. Several experiments on Spoken UD representations seem to affect the parsing per- Slovenian Treebank involving training data and formance to a great extent, which suggests that the treebank modifications were performed in order to original UD adaptations for speech-specific phe- identify and narrow the gap between the perfor- nomena, applied to the Spoken Slovenian Tree- mances on spoken and written language data. The bank, represent a reasonable starting point for fu- results show that besides disfluencies, the most ture applications of the scheme to spoken language common phenomena addressed in related work, data. segmentation of clauses without explicit lexical Finally, all transcription and annotation vari- connection is also an important factor in low pars- ables that were shown to improve spoken language ing performance. In addition to that, our re- LAS for each of the parsing systems, have been sults suggest that for graph-based parsing systems, joined into a single representation, i.e. a treebank such as Stanford parser, spoken language parsing with new, syntax-bound utterance segmentation, should be performed by joint modelling of both excluding disfluencies and discourse elements, spoken and written data excluding punctuation. and a change in prosody-marker-attachment (UD- Other aspects of spoken data representation, Pipe), as well as a change in filler-attachment such as the choice of spelling, the presence of and addition of written parsing model (Stanford).4 basic prosodic markers and the syntactic anno- Both UDPipe and Stanford achieved substantially tation principles seem less crucial for the over- higher LAS scores for their best-fitting combina- all parser performance. It has to be emphasized, tion than the original SST baseline model (sst), however, that the UD annotation modifications set i.e. 79.58 and 87.35, respectively, moving the SST forward in this paper represent only a few se- parsing performance much closer to the perfor- lected transformations involving labeling and at- mance achieved on its same-size written counter- tachment, whereas many other are also possible, part (ssj 20k, Table1), with the gap narrowing in particular experiments involving enhanced rep- to 4.63 for UDPipe and 2.25 for Stanford. This resentations (Schuster and Manning, 2016). confirms that the speech-specific phenomena out- These findings suggest several lines of future lined in this paper are indeed the most important work. For the SST treebank in particular and spo- phenomena affecting spoken language processing ken language treebanks in general, it is essential to scores. Nevertheless, the remaining gap between increase the size of annotated data and reconsider the existing transcription and annotation princi- 3Note that the sst punct results should be interpreted ples to better address the difficulties in spoken lan- with caution, as a brief analysis into the punct-related pars- guage segmentation and disfluency detection. Par- ing errors on the original SST treebank revealed a substantial amount of (incorrect) non-projective attachments of the [gap] ticularly in relation to the latter, our results should marker indicating speech fragments. This issue should be re- be evaluated against recent speech-specific parsing solved in future releases of the SST treebank. systems references in Section2, as well as other 4Modifications set out in 13 (sst discourse:filler) and 16 (sst reparandum) state-of-the-art dependency parsers. A promising that have also increased Stanford parser performance, are not line of future work has also been suggested in re- applicable to the Stanford best-combination representation, lated work on other types of noisy data (Blod- since discourse fillers and repairs have already been removed by modifications set out in 7 (sst no-disfl) and 8 gett et al., 2018), employing a variety of cross- (sst no-discourse). domain strategies for improving parsing with little

44 in-domain data. parser at the CoNLL 2017 Shared Task. In Proceed- Our primary direction of future work, however, ings of the CoNLL 2017 Shared Task: Multilingual involves an in-depth evaluation of parsing perfor- Parsing from Raw Text to Universal Dependencies, pages 20–30, Vancouver, Canada. Association for mance for individual dependency relations, to de- Computational Linguistics. termine how the modifications presented in this paper affect specific constructions, and to over- John J. Godfrey, Edward C. Holliman, and Jane Mc- come the prevailing approaches to spoken lan- Daniel. 1992. Switchboard: Telephone speech cor- pus for research and development. In Proceed- guage parsing that tend to over-generalize the syn- ings of the 1992 IEEE International Conference on tax of speech. Acoustics, Speech and Signal Processing - Volume 1, ICASSP’92, pages 517–520, Washington, DC, USA. IEEE Computer Society. References Erhard W. Hinrichs, Julia Bartels, Yasuhiro Kawata, Alfred V. Aho and Jeffrey D. Ullman. 1972. The Valia Kordoni, and Heike Telljohann. 2000. The Theory of Parsing, Translation, and Compiling. Tubingen¨ treebanks for spoken German, English, Prentice-Hall, Inc., Upper Saddle River, NJ, USA. and Japanese. In Wolfgang Wahlster, edi- tor, Verbmobil: Foundations of Speech-to-Speech Su Lin Blodgett, Johnny Wei, and Brendan O’Connor. Translation, Artificial Intelligence, pages 550–574. 2018. Twitter Universal Dependency parsing for Springer Berlin Heidelberg. African-American and Mainstream American En- glish. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Vol- Matthew Honnibal, Yoav Goldberg, and Mark John- son. 2013. A non-monotonic arc-eager transition ume 1: Long Papers), pages 1415–1425. Associa- system for dependency parsing. In Proceedings of tion for Computational Linguistics. the Seventeenth Conference on Computational Natu- Andrew Caines, Michael McCarthy, and Paula But- ral Language Learning, pages 163–172, Sofia, Bul- tery. 2017. Parsing transcripts of speech. In Pro- garia. Association for Computational Linguistics. ceedings of the Workshop on Speech-Centric Natu- ral Language Processing, pages 27–36. Association Matthew Honnibal and Mark Johnson. 2014. Joint for Computational Linguistics. incremental disfluency detection and dependency parsing. Transactions of the Association for Com- Eugene Charniak and Mark Johnson. 2001. Edit de- putational Linguistics, 2(1):131–142. tection and parsing for transcribed speech. In Pro- ceedings of the Second Meeting of the North Amer- Matthew Honnibal and Mark Johnson. 2015. An im- ican Chapter of the Association for Computational proved non-monotonic transition system for depen- Linguistics on Language Technologies, NAACL ’01, dency parsing. In Proceedings of the 2015 Con- pages 1–9, Stroudsburg, PA, USA. Association for ference on Empirical Methods in Natural Language Computational Linguistics. Processing, pages 1373–1378. Association for Com- putational Linguistics. Liesbeth Degand and Anne Catherine Simon. 2009. On identifying basic discourse units in speech: theoreti- Fredrik Jørgensen. 2007. The effects of disfluency de- cal and empirical issues. Discours, 4. tection in parsing spoken language. In Proceedings of the 16th Nordic Conference of Computational Kaja Dobrovoljc, Tomaz Erjavec, and Simon Krek. Linguistics NODALIDA-2007, pages 240–244. 2017. The Universal Dependencies Treebank for Slovenian. In Proceedings of the 6th Work- Jeremy G Kahn, Matthew Lease, Eugene Charniak, shop on Balto-Slavic Natural Language Process- Mark Johnson, and Mari Ostendorf. 2005. Effective ing, BSNLP@EACL 2017, Valencia, Spain, April 4, use of prosody in parsing conversational speech. In 2017, pages 33–38. Proceedings of the conference on human language Kaja Dobrovoljc and Joakim Nivre. 2016. The Uni- technology and empirical methods in natural lan- versal Dependencies Treebank of Spoken Slovenian. guage processing, pages 233–240. Association for In Proceedings of the Tenth International Confer- Computational Linguistics. ence on Language Resources and Evaluation (LREC 2016), Paris, France. European Language Resources Anne Lacheret, Sylvain Kahane, Julie Beliao, Anne Association (ELRA). Dister, Kim Gerdes, Jean-Philippe Goldman, Nico- las Obin, Paola Pietrandrea, and Atanas Tchobanov. Timothy Dozat and Christopher D. Manning. 2016. 2014. Rhapsodie: a prosodic-syntactic treebank for Deep biaffine attention for neural dependency pars- spoken French. In Proceedings of the Ninth In- ing. CoRR, abs/1611.01734. ternational Conference on Language Resources and Evaluation (LREC’14), pages 295–301, Reykjavik, Timothy Dozat, Peng Qi, and Christopher D. Manning. Iceland. European Language Resources Association 2017. Stanford’s graph-based neural dependency (ELRA).

45 Matthew Lease, Mark Johnson, and Eugene Charniak. tional Conference on Language Resources and Eval- 2006. Recognizing disfluencies in conversational uation (LREC 2016), Paris, France. European Lan- speech. IEEE Transactions on Audio, Speech, and guage Resources Association (ELRA). Language Processing, 14(5):1566–1573. Milan Straka, Jan Hajic, and Jana Strakova.´ 2016. Ud- Yang Liu, Elizabeth Shriberg, Andreas Stolcke, Dustin pipe: Trainable pipeline for processing CoNLL-U Hillard, Mari Ostendorf, and Mary Harper. 2006. files performing tokenization, morphological anal- Enriching speech recognition with automatic detec- ysis, POS tagging and parsing. In Proceedings tion of sentence boundaries and disfluencies. IEEE of the Tenth International Conference on Language Transactions on audio, speech, and language pro- Resources and Evaluation (LREC 2016), Paris, cessing, 14(5):1526–1540. France. European Language Resources Association (ELRA). Ryan McDonald and Joakim Nivre. 2007. Character- izing the errors of data-driven dependency parsing Milan Straka and Jana Strakova.´ 2017. Tokenizing, models. In Proceedings of the 2007 Joint Confer- pos tagging, lemmatizing and parsing UD 2.0 with ence on Empirical Methods in Natural Language UDPipe. In Proceedings of the CoNLL 2017 Shared Processing and Computational Natural Language Task: Multilingual Parsing from Raw Text to Univer- Learning (EMNLP-CoNLL). sal Dependencies, pages 88–99, Vancouver, Canada. Association for Computational Linguistics. Ryan McDonald, Fernando Pereira, Kiril Ribarov, and Jan Hajic.ˇ 2005. Non-projective dependency pars- Trang Tran, Shubham Toshniwal, Mohit Bansal, Kevin ing using spanning tree algorithms. In Proceedings Gimpel, Karen Livescu, and Mari Ostendorf. 2017. of the conference on Human Language Technology Joint modeling of text and acoustic-prosodic cues for and Empirical Methods in Natural Language Pro- neural parsing. CoRR, abs/1704.07287. cessing, pages 523–530. Association for Computa- tional Linguistics. Darinka Verdonik, Iztok Kosem, Ana Zwitter Vitez, Si- mon Krek, and Marko Stabej. 2013. Compilation, Alexis Nasr, Frederic Bechet, Benoit Favre, Thierry transcription and usage of a reference speech cor- Bazillon, Jose Deulofeu, and andre Valli. 2014. Au- pus: the case of the Slovene corpus GOS. Language tomatically enriching spoken corpora with syntactic Resources and Evaluation, 47(4):1031–1048. information for linguistic studies. In Proceedings of the Ninth International Conference on Language Ton van der Wouden, Heleen Hoekstra, Michael Resources and Evaluation (LREC’14), Reykjavik, Moortgat, Bram Renmans, and Ineke Schuurman. Iceland. European Language Resources Association 2002. Syntactic analysis in the Spoken Dutch Cor- (ELRA). pus (CGN). In Proceedings of the Third Interna- tional Conference on Language Resources and Eval- Joakim Nivre. 2015. Towards a universal uation, LREC 2002, May 29-31, 2002, Las Palmas, for natural language processing. In Alexander Gel- Canary Islands, Spain. bukh, editor, Computational Linguistics and Intelli- Masashi Yoshikawa, Hiroyuki Shindo, and Yuji Mat- gent , volume 9041 of Lecture Notes sumoto. 2016. Joint transition-based dependency in Computer Science, pages 3–16. Springer Interna- parsing and disfluency detection for automatic tional Publishing. speech recognition texts. In Proceedings of the Joakim Nivre, Jens Nilsson, and Johan Hall. 2006. Tal- 2016 Conference on Empirical Methods in Natural banken05: A Swedish treebank with phrase struc- Language Processing, EMNLP 2016, Austin, Texas, ture and dependency annotation. In Proceedings of USA, November 1-4, 2016, pages 1036–1041. the 5th International Conference on Language Re- Daniel Zeman et al. 2017. CoNLL 2017 Shared Task: sources and Evaluation (LREC), pages 1392–1395. Multilingual parsing from raw text to Universal De- Joakim Nivre et al. 2018. Universal Dependencies 2.2. pendencies. In Proceedings of the CoNLL 2017 LINDAT/CLARIN digital library at the Institute of Shared Task: Multilingual Parsing from Raw Text Formal and Applied Linguistics (UFAL), Faculty of to Universal Dependencies, pages 1–19, Vancouver, Mathematics and Physics, Charles University. Canada. Association for Computational Linguistics. Ana Zwitter Vitez, Jana Zemljaricˇ Miklavciˇ c,ˇ Simon Mohammad Sadegh Rasooli and Joel Tetreault. 2013. Krek, Marko Stabej, and Tomazˇ Erjavec. 2013. Spo- Joint parsing and disfluency detection in linear time. ken corpus Gos 1.0. Slovenian In Proceedings of the 2013 Conference on Empiri- repository CLARIN.SI. cal Methods in Natural Language Processing, pages 124–129, Seattle, Washington, USA. Association for Computational Linguistics.

Sebastian Schuster and Christopher D. Manning. 2016. Enhanced english universal dependencies: An im- proved representation for natural language under- standing tasks. In Proceedings of the Tenth Interna-

46