Reading behavior predicts syntactic categories

Maria Barrett and Anders Søgaard University of Copenhagen Njalsgade 140 DK-2300 Copenhagen S dtq912,soegaard @hum.ku.dk { }

Abstract (POS) pairs, and gaze data can therefore be used to significantly improve type-constrained POS tag- It is well-known that readers are less likely gers. This is potentially useful, since eye-tracking to fixate their gaze on closed class syn- data becomes more and more readily available tactic categories such as prepositions and with the emergence of eye trackers in mainstream . This paper investigates to what consumer products (San Agustin et al., 2010). extent the of a in With the development of robust eye-tracking in context can be predicted from gaze fea- laptops, it is easy to imagine digital text providers tures obtained using eye-tracking equip- storing gaze data, which could then be used to im- ment. If syntax can be reliably predicted prove automated analysis of their publications. from eye movements of readers, it can Contributions We are, to the best of our knowl- speed up linguistic annotation substan- edge, the first to study reading behavior of syntac- tially, since reading is considerably faster tically annotated, natural text across domains, and than doing linguistic annotation by hand. how gaze correlates with a complete set of syntac- Our results show that gaze features do dis- tic categories. We use logistic regression to show criminate between most pairs of syntactic that gaze features discriminate between POS pairs, categories, and we show how we can use even across domains. We then show how gaze fea- this to annotate with tures can improve a cross-domain supervised POS across domains, when tag dictionaries en- tagger. We show that gaze-based predictions are able us to narrow down the set of potential robust, not only across domains, but also across categories. subjects. 1 Introduction Eye movements during reading is a well- 2 Experiment established proxy for cognitive processing, and it In our experiment, 10 subjects read syntactically is well-known that readers are more likely to fixate annotated sentences from five domains. on words from open syntactic categories (verbs, , ) than on closed category items Data The data consists of 250 sentences: 50 like prepositions and conjunctions (Rayner, 1998; sentences (min. 3 tokens, max. 120 characters), Nilsson and Nivre, 2009). Generally, readers seem randomly sampled from each of five different, to be most likely to fixate and re-fixate on nouns manually annotated corpora: Wall Street Jour- (Furtner et al., 2009). If reading behavior is af- nal articles (WSJ), Wall Street Journal headlines fected by syntactic category, maybe reading be- (HDL), emails (MAI), weblogs (WBL), and Twit- havior can, conversely, also tell us about the syn- ter (TWI). WSJ and HDL syntactically annotated tax of words in context. sentences come from the OntoNotes 4.0 release of the English Penn Treebank.1 The MAI and WBL This paper investigates to what extent gaze data 2 can be used to predict syntactic categories. We sections come from the English Web Treebank. show that gaze data can effectively be used to dis- 1catalog.ldc.upenn.edu/LDC2011T03 criminate between a wide range of part of speech 2catalog.ldc.upenn.edu/LDC2012T13

345 Proceedings of the 19th Conference on Computational Language Learning, pages 345–349, Beijing, China, July 30-31, 2015. c 2015 Association for Computational Linguistics rxmtl 5c rmtedsly erecruited We display. the Hz ap- from chair cm 120 a 65 was proximately on rate seated were Sampling Participants binocular. monitor. 15” a with setup. familiarize experimental the to with sentences participant the practice 25 was by experiment actual preceded The possible. as sentence quickly next as the to continue and were read to Participants instructed keypad. numeric corresponding the the pressing on by number 1-6 from sentence scale a the on towards interest immediate the participants rated participant, was the sentence a by the processed that to actually ensure switch to To and sentence cross. new self-paced. fixation was a experiment by The preceded were line. one sentences into All fit centered could were sentences Sentences all and vertically, pixels. 25 was size and Verdana font was face a Font on black background. in gray time. time light a any at at presented was participant POS sentence the One the to nor revealed sentence, Neither were the tags for orders. the domain randomized source read five the of participants one but in participants, items 10 all by design sparsity. Experimental data to 12 due X discarded the category but to the 2011), labels al., et gold et (Petrov the Foster POS Universal mapped of We work the (2011). from al. comes data TWI The five across domains boxplots probability Fixation 1: Figure u paau a oi 10eetracker eye X120 Tobii a was apparatus Our probability probability 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 WSJ Fixationprobability. n=858 TWI Fixationprobability. n=610

. . ADJ ADJ ADP ADP ADV ADV CONJ CONJ probability DET DET 0.0 0.2 0.4 0.6 0.8 1.0 NOUN WBL Fixationprobability. n=716 NUM NUM . PRON PRON ADJ PRT PRT ADP VERB VERB

h 5 tm eeread were items 250 The ADV CONJ DET probability probability 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 NOUN HDL Fixationprobability. n=461 MAI Fixationprobability. n=575 NUM PRON . . PRT ADJ ADJ VERB ADP ADP ADV ADV CONJ CONJ DET DET NOUN NOUN NUM NUM PRON PRON PRT PRT VERB VERB 346 vlae ntefit oaodoe-tigt the to over-fitting avoid to fifth the and on domains four evaluated on models otherwise our unless trained experiments, we all stated, In 2013). T al., et 2002; (Collins, perceptron with structured regression (averaged) type-constrained and logistic L2-regularization type-constrained used experiments Learning ( experiments as tagging them POS our used in and features window), multiple screen to per exposed the sentences are (in subjects features corpus, and sentence-level Dundee 1st and for fixation, probability 2nd except above, as (about word-basedfeatures same for articles the extracted We newswire data words). re- gaze 20 51,000 in of reading dataset consists subjects and 10 used reading widely on a search is The corpus 2005). Pynte, Dundee and the (Kennedy with experiment corpus an Dundee models. include POS sentences, we will instead, gaze-based of This Here, robust thousands more collect to home. to leading at easy eye use it $99 to make we a subjects with work, for experiment future tracker the As replicate to dataset. plan small the is iment Corpus Dundee fixation total sentence). and per gaze word time a 32 on (such fixations have correlated of we number highly as total, are some In where features, word. per fix- time total and ation probability fixation as depth such reading features, and speed time). reading fixation included total also We and points desti- departure regression / du- (e.g. nations fixation processing first late and (e.g. ration) early and to words, relating subsequent measures and previous as on such We fixations features, oculomotor both approach. cover exploratory to wanted similarly a used who task. Saloj the by inspired for are fit features more in The are data which eye-tracking determine to raw order the from features cognitive of effort selection broad a extracted (Rayner, We reading 1998). during load cognitive exploring sentences. Features few a skip to strokes participant key the erroneous caused that believe few We 40 a on around sentences. fixations no lasted had participant session One Each minutes. was MA. level ongoing educational an Minimum were All readers. dyslexia. normal, skilled with to diagnosed corrected were none or and speakers. normal English was native vision were Their All 31.30 age campus. mean from male, (7 participants 10 hr r aydfeetfaue for features different many are There h anwans fteexper- the of weakness main The noreprmns we experiments, our In rie l (2003), al. et arvi ¨ ackstr ¨ ± § 3). 4.74)) om ¨ Rank Feature % of votes fixation probabilities across the 11 parts of speech. 0 Fixation prob 19.0 While the overall pattern is similar across the five 1 Previous word fixated binary 13.7 domains (open category items are more likely to 2 Next word fixated binary 13.2 be fixated), we see domain differences. For ex- 3 nFixations 12.2 4 First fixation duration on every word 9.1 ample, pronouns are more likely to be fixated in 5 Previous fixation duration 7.0 headlines. The explanation could lie in the dif- 6 Mean fixation duration per word 6.6 7 Re-read prob 5.7 ferent distributions of function words and content 8 Next fixation duration 2.0 words. It is established and unchallenged that 9 Total fixation duration per word 2.0 function words are fixated on about 35% of the Table 1: 10 most used features by stability selec- time and content words are fixated on about 85% tion from logistic regression classification of all of the time (Rayner and Duffy, 1988). In our data, POS pairs on all domains, 5-fold cross validation. these numbers vary among the domains according to frequency of that word class, see Figure 2. Fig- ure 2a shows that there is a strong linear correla- 0.78 HDL HDL 0.50 MAI MAI 0.76 TWI TWI tion between content word frequency and content 0.48 0.74 WBL WBL WSJ WSJ word fixation probability among the different do- 0.46 0.72 Fix. prob. Fix. prob. 0.70 0.44 mains: Pearson’s ρ = 0.909. From Figure 2b,

0.68 0.42 there is a negative correlation between function 0.50 0.55 0.60 0.65 0.15 0.20 0.25 0.30 Frequency Frequency word frequency and fixation proba- (a) Content words (b) Function words bility: Pearson’s ρ = 0.702. − Figure 2: Scatter plot of frequency and fixation Predictive gaze features To investigate which probability for content words (NOUN, VERB, gaze features were more predictive of part of ADJ, NUM) and function words (PRON, CONJ, speech, we used stability selection (Meinshausen ADP, DET, PRT) and Buhlmann,¨ 2010) with logistic regression classification on all binary POS classifications. Fixation probability was the most informative fea- characteristics of a specific domain. Our tag dic- ture, but also whether the words around the word tionary is from Wiktionary3 and covers 95% of all is fixated is important along with number of fixa- tokens. tions. In our binary discrimination and POS tag- ging experiments, using L2-regularization or av- 3 Results eraging with all features was superior (on Twitter Domain differences Our first observation is that data) to using stability selection for feature selec- the gaze characteristics differ slightly across do- tion. We also asked a psycholinguist to select a mains, but more across POS. Figure 1 presents the small set of relatively independent gaze features fit 3https://code.google.com/p/ for the task (first fixation duration, fixation proba- wikily-supervised-pos-tagger/downloads/ bility and re-read probability), but again, using all list features with L2-regularization led to better per- formance on the Twitter data. Binary discrimination First, we trained L2- 100

90 regularized logistic regression models to discrim- . 92 94 94 85 61 80 94 76 72 89 DET 27 8 52 22 -11 -5 22 15 -4 80 inate between all pairs of POS tags only using CONJ 10 -1 65 26 36 10 -3 4 70 gaze features. In other words, for example we PRON 32 5 56 37 2 16 28 60 selected all words annotated as NOUN or VERB, ADP 32 22 49 26 4 4 50 PRT 19 5 67 42 50 and trained a logistic regression model to discrim- 40 NUM 0 0 5 2 inate between the two in a five-fold cross valida- ADV 2 1 13 30 acc baseline tion setup. We report error reduction − ADJ 0 -2 20 1 baseline − VERB 7 10 in Figure 3. NOUN 0 . POS tagging We also tried evaluating our gaze

ADJ PRT DET 4 ADV NUM ADP CONJ VERB PRON NOUN features directly in a supervised POS tagger. We Figure 3: Error reduction of logistic regression 4 over a majority baseline. All domains https://github.com/coastalcph/ rungsted

347 SP +GAZE +DGAZE +FREQLEN +DGAZE+FREQLEN HDL 0.807 0.822 0.822 0.826 0.843 MAI 0.791 0.831 0.834 0.795 0.831 TWI 0.771 0.787 0.800 0.772 0.793 WBL 0.836 0.854 0.858 0.850 0.861 WSJ 0.831 0.837 0.838 0.831 0.859 Macro-av 0.807 0.826 0.830 0.815 0.837 Table 2: POS tagging results on different test sets using 200 out-of-domain sentences for training. DGAZE is using gaze features from Dundee. Best result for each row in bold face trained a type-constrained (averaged) perceptron 4 Related work model with drop-out and a standard feature model Matthies and Søgaard (2013) present results that (from Owoputi et al. (2013)) augmented with the suggest that individual variation among (academ- above gaze features. The POS tagger was trained ically trained) subjects’ reading behavior was not on a very small seed of data (200 sentences), doing a greater source of error than variation within sub- 20 passes over the data, and evaluated on out-of- jects, showing that it is possible to predict fixations domain test data; training on four domains, testing across readers. Our work relates to such work, on one. For the gaze features, instead of using to- studying the robustness of reading models across ken gaze features, we first built a lexicon with av- domains and readers, but it also relates in spirit to erage word type statistics from the training data. research on using weak supervision in NLP, e.g., We normalize the gaze matrix by dividing with work on using HTML markup to improve depen- its standard deviation. This is the normalizer in dency parsers (Spitkovsky, 2013) or using click- Turian et al. (2010) with σ = 1.0. We condition through data to improve POS taggers (Ganchev et on the gaze features of the current word, only. We al., 2012). compare performance using gaze features to us- ing only word frequency, estimating from the (un- 5 Conclusions labeled) English Web Treebank corpus, and word We have shown that it is possible to use gaze length (FREQLEN). features to discriminate between many POS pairs across domains, even with only a small dataset and The first three columns in Table 2 show, that a small set of subjects. We also showed that gaze gaze features help POS tagging, at least when features can improve the performance of a POS trained on very small seeds of data. Error reduc- tagger trained on small seeds of data. tion using gaze features from the Dundee corpus (DGAZE) is 12%. We know that gaze features cor- relate with word frequency and word length, but References using these features directly leads to much smaller Michael Collins. 2002. Discriminative training meth- performance gains. Concatenating the two fea- ods for Hidden Markov Models. In EMNLP. tures sets leads to the best performance, with an error reduction of 16%. Jennifer Foster, Ozlem Cetinoglu, Joachim Wagner, Josef Le Roux, Joakim Nivre, Deirde Hogan, and Josef van Genabith. 2011. From news to comments: In follow-up experiments, we observe that aver- Resources and benchmarks for parsing the language aging over 10 subjects when collecting gaze fea- of Web 2.0. In IJCNLP. tures does not seem as important as we expected. Tagging accuracies on raw (non-averaged) data are Marco R Furtner, John F Rauthmann, and Pierre Sachse. 2009. Nomen est omen: Investigating the only about 1% lower. Finally, we also tried run- dominance of nouns in word comprehension with ning logistic regression experiments across sub- eye movement analyses. Advances in Cognitive Psy- jects rather than domains. Here, tagging accura- chology, 5:91. cies were again comparable to our set-up, suggest- Kuzman Ganchev, Keith Hall, Ryan McDonald, and ing that gaze features are also robust across sub- Slav Petrov. 2012. Using search-logs to improve jects. query tagging. In ACL.

348 Alan Kennedy and Joel¨ Pynte. 2005. Parafoveal-on- foveal effects in normal reading. Vision research, 45(2):153–168. Franz Matthies and Anders Søgaard. 2013. With blinkers on: Robust prediction of eye movements across readers. In EMNLP, Seattle, Washington, USA. Nicolai Meinshausen and Peter Buhlmann.¨ 2010. Stability selection. Journal of the Royal Statis- tical Society: Series B (Statistical Methodology), 72(4):417–473. Matthias Nilsson and Joakim Nivre. 2009. Learning where to look: Modeling eye movements in reading. In CoNLL. Olutobi Owoputi, Brendan O’Connor, Chris Dyer, Kevin Gimpel, Nathan Schneider, and Noah A Smith. 2013. Improved part-of-speech tagging for online conversational text with word clusters. In NAACL. Slav Petrov, Dipanjan Das, and Ryan McDonald. 2011. A universal part-of-speech tagset. arXiv preprint arXiv:1104.2086. K.; Rayner and S. A. Duffy. 1988. On-line compre- hension processes and eye movements in reading. In G. E. MacKinnon M. Daneman and T. G. Waller, editors, Reading research: Advances in theory and practice, pages 13–66. Academic Press, New York. Keith Rayner. 1998. Eye movements in reading and information processing: 20 years of research. Psy- chological bulletin, 124(3):372. Jarkko Salojarvi,¨ Ilpo Kojo, Jaana Simola, and Samuel Kaski. 2003. Can relevance be inferred from eye movements in information retrieval. In Proceedings of WSOM, volume 3, pages 261–266. Javier San Agustin, Henrik Skovsgaard, Emilie Mol- lenbach, Maria Barret, Martin Tall, Dan Witzner Hansen, and John Paulin Hansen. 2010. Evalua- tion of a low-cost open-source gaze tracker. In Pro- ceedings of the 2010 Symposium on Eye-Tracking Research & Applications, pages 77–80. ACM. Valentin Ilyich Spitkovsky. 2013. Grammar Induction and Parsing with Dependency-and-Boundary Mod- els. Ph.D. thesis, STANFORD UNIVERSITY. Oscar Tackstr¨ om,¨ Dipanjan Das, Slav Petrov, Ryan McDonald, and Joakim Nivre. 2013. Token and type constraints for cross-lingual part-of-speech tag- ging. TACL, 1:1–12. Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. In ACL.

349