<<

Fine-Grained Prediction of Syntactic Typology: Discovering Latent Structure with Supervised Learning

Dingquan Wang and Jason Eisner Department of Computer Science, Johns Hopkins University wdd,eisner @jhu.edu { }

Abstract The problem is challenging because the lan- guage’s true order statistics are computed from We show how to predict the basic word-order facts of a novel given only a cor- syntax trees, whereas our method has access only pus of part-of-speech (POS) sequences. We to a POS-tagged corpus. Based on these POS se- predict how often direct objects follow their quences alone, we predict the directionality of each , how often follow their , type of dependency relation. We define the direc- and in general the directionalities of all depen- tionality to be a real number in [0, 1]: the fraction dency relations. Such typological properties of tokens of this relation that are “right-directed,” in could be helpful in induction. While the sense that the child (modifier) falls to the right such a problem is usually regarded as unsu- of its parent (head). For example, the dobj relation pervised learning, our innovation is to treat it as supervised learning, using a large collec- points from a to its direct (if any), so a di- tion of realistic synthetic as train- rectionality of 0.9—meaning that 90% of dobj de- ing data. The supervised learner must identify pendencies are right-directed—indicates a dominant surface features of a language’s POS sequence verb-object order. (See Table 1 for more such exam- (hand-engineered or neural features) that cor- ples.) Our system is trained to predict the relative relate with the language’s deeper structure (la- frequency of rightward dependencies for each of 57 tent trees). In the experiment, we show: 1) dependency types from the Universal Dependencies Given a small set of real languages, it helps to add many synthetic languages to the training project (UD). We assume that all languages draw on data. 2) Our system is robust even when the the same set of POS tags and dependency relations POS sequences include noise. 3) Our system that is proposed by the UD project (see 3), so that § on this task outperforms a grammar induction our predictor works across languages. baseline by a large margin. Why do this? Liu (2010) has argued for using these directionality numbers in [0, 1] as fine-grained 1 Introduction and robust typological descriptors. We believe that Descriptive linguists often characterize a human these directionalities could also be used to help de- language by its typological properties. For instance, fine an initializer, prior, or regularizer for tasks like English is an SVO-type language because its basic grammar induction or syntax-based machine trans- clause order is -Verb-Object (SVO), and lation. Finally, the vector of directionalities—or also a prepositional-type language because its the feature vector that our method extracts in or- adpositions normally precede the . Identifying der to predict the directionalities—can be regarded basic must happen early in the acqui- as a language embedding computed from the POS- sition of syntax, and presumably guides the initial tagged corpus. This language embedding may be interpretation of sentences and the acquisition of a useful as an input to multilingual NLP systems, such finer-grained grammar. In this paper, we propose as the cross-linguistic neural dependency parser of a method for doing this. While we focus on word Ammar et al. (2016). In fact, some multilingual NLP order, one could try similar methods for other typo- systems already condition on typological properties logical classifications (syntactic, morphological, or looked up in the World Atlas of Language Struc- phonological). tures, or WALS (Dryer and Haspelmath, 2013), as

147

Transactions of the Association for Computational Linguistics, vol. 5, pp. 147–161, 2017. Action Editor: Mark Steedman. Submission batch: 11/2016; Revision batch: 2/2017; Published 6/2017. c 2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

Typology Example syntactic structure. Our remedy in this paper is a su- Verb-Object dobj pervised approach. We simply imitate how linguists (English) She gave me a raise have analyzed other languages. This supervised ob- dobj jective goes beyond the log-likelihood of a PCFG- Object-Verb like model given the corpus, because linguists do not (Hindi) She me a raise gave vah mujhe ek uthaane diya merely try to predict the surface corpus. Their de- Prepositional case pendency annotations may reflect a cross-linguistic (English) She is in a car theory that considers semantic interpretability and case equivalence, rare but informative phenomena, con- Postpositional She a car in is sistency across languages, a prior across languages, (Hindi) vah ek kaar mein hai and linguistic conventions (including the choice of -Noun amod latent labels such as dobj). Our learner does not (English) This is a red car consider these factors explicitly, but we hope it will amod Noun-Adjective identify correlates (e.g., using deep learning) that This is a car red can make similar predictions. Being supervised, our (French) Ceci est une voiture rouge objective should also suffer less from local optima. Indeed, we could even set up our problem with a Table 1: Three typological properties in the World Atlas of Lan- guage Structures (Dryer and Haspelmath, 2013), and how they convex objective, such as (kernel) logistic regres- affect the directionality of Universal Dependencies relations. sion, to predict each directionality separately. Why hasn’t this been done before? Our set- we review in 8. However, WALS does not list all § ting presents unusually sparse data for supervised properties of all languages, and may be somewhat learning, since each training example is an en- inconsistent since it collects work by many linguists. tire language. The world presumably does not Our system provides an automatic alternative as well offer enough natural languages—particularly with as a methodology for generalizing to new properties. machine-readable corpora—to train a good classi- More broadly, we are motivated by the chal- fier to detect, say, Object-Verb-Subject (OVS) lan- lenge of determining the structure of a language guages, especially given the class imbalance prob- from its superficial features. Principles & Parame- lem that OVS languages are empirically rare, and ters theory (Chomsky, 1981; Chomsky and Lasnik, the non-IID problem that the available OVS lan- 1993) famously—if controversially—hypothesized guages may be evolutionarily related.1 We miti- that human babies are born with an evolutionarily gate this issue by training on the Galactic Depen- tuned system that is specifically adapted to natural dencies treebanks (Wang and Eisner, 2016), a col- language, which can predict typological properties lection of more than 50,000 human-like synthetic (“parameters”) by spotting telltale configurations in languages. The treebank of each purely linguistic input. Here we investigate whether is generated by stochastically permuting the sub- such a system can be tuned by gradient descent. It is trees in a given real treebank to match the word or- at least plausible that useful superficial features do der of other real languages. Thus, we have many exist: e.g., if nouns often precede verbs but rarely synthetic languages that are Object-Verb like Hindi follow verbs, then the language may be verb-final. but also Noun-Adjective like French. We know the true directionality of each synthetic language and we 2 Approach would like our classifier to predict that directional- We depart from the traditional approach to latent ity, just as it would for a real language. We will show structure discovery, namely unsupervised learning. that our system’s accuracy benefits from fleshing out Unsupervised syntax learners in NLP tend to be the training set in this way, which can be seen as a rather inaccurate—partly because they are failing to form of regularization. maximize an objective that has many local optima, 1 Properties shared within an OVS language family may ap- and partly because that objective does not capture pear to be consistently predictive of OVS, but are actually con- all the factors that linguists consider when assigning founds that will not generalize to other families in test data.

148 A possible criticism of our work is that obtaining Evaluation: Over pairs (u, p∗) that correspond the input POS sequences requires human annotators, to held-out real languages, we evaluate the expected and perhaps these annotators could have answered loss of the predictions ˆp. We use ε-insensitive loss2 the typological classification questions as well. In- with ε = 0.1, so our evaluation metric is deed, this criticism also applies to most work on p∗(r L) loss (ˆp( r, L), p∗( r, L)) (1) grammar induction. We will show that our system | · ε →| →| r is at least robust to noise in the input POS sequences X∈R ( 7.4). In the future, we hope to devise similar meth- where § ods that operate on raw word sequences. lossε(ˆp, p∗) max( pˆ p∗ ε, 0) • ≡ | r− | − countL( ) 3 Data p∗( r, L) = → is the empirical esti- • →| countL(r) mate of p( r, L). We use two datasets in our experiment: →| pˆ( r, L) is the system’s prediction of p • →| ∗ UD: Universal Dependencies version 1.2 (et al., The aggregate metric (1) is an expected loss that is 2015) A collection of dependency treebanks for 37 countL(r) weighted by p∗(r L) = , to empha- countL(r0) languages, annotated in a consistent style with POS | r0 size relation types that are more∈R frequent in L. tags and dependency relations. P Why this loss function? We chose an L1-style GD: Galactic Dependencies version 1.0 (Wang loss, rather than L2 loss or log-loss, so that the ag- and Eisner, 2016) A collection of projective de- gregate metric is not dominated by outliers. We took pendency treebanks for 53,428 synthetic languages, ε > 0 in order to forgive small errors: if some pre- using the same format as UD. The treebank of dicted directionality is already “in the ballpark,” we each synthetic language is generated from the UD prefer to focus on getting other predictions right, treebank of some real language by stochastically rather than fine-tuning this one. Our intuition is that permuting the dependents of all nouns and/or verbs errors < ε in ˆp’s elements will not greatly harm to match the dependent orders of other real UD lan- downstream tasks that analyze individual sentences, guages. Using this “mix-and-match” procedure, the and might even be easy to correct by grammar rees- GD collection fills in gaps in the UD collection— timation (e.g., EM) that uses ˆp as a starting point. which covers only a few possible human languages. In short, we have the intuition that if our pre- dicted ˆp achieves small lossε on the frequent relation 4 Task Formulation types, then ˆp will be helpful for downstream tasks, although testing that intuition is beyond the scope of We now formalize the setup of the fine-grained typo- this paper. One could tune ε on a downstream task. logical prediction task. Let be the set of universal R r relation types, with N = . We use to denote a |R| → 5 Simple “Expected Count” Baseline rightward dependency token with label r . ∈ R Input for language L: A POS-tagged corpus u. Before launching into our full models, we warm (“u” stands for “unparsed.”) up with a simple baseline heuristic called expected Output for language L: Our system predicts count (EC), which is reminiscent of Principles & Pa- p( r, L), the probability that a token in language rameters. The idea is that if ADJs tend to precede →| L of an r-labeled dependency will be right-oriented. nearby NOUNs in the sentences of language L, then It predicts this for each dependency relation type amod probably tends to point leftward in L. After r , such as r = dobj. Thus, the output is a all, the training languages show that when ADJ and ∈ R vector of predicted probabilities ˆp [0, 1]N . NOUN are nearby, they are usually linked by amod. ∈ Training: We set the parameters of our system Fleshing this out, EC estimates directionalities as using a collection of training pairs (u, p∗), each of r ecountL( ) which corresponds to some UD or GD training lan- pˆ( r, L) = r → r (2) →| ecountL( ) + ecountL( ) guage L. Here p∗ denotes the true vector of proba- → ← bilities as empirically estimated from L’s treebank. 2Proposed by V. Vapnik for support vector regression.

149 r r where we estimate the expected and counts by ← → logistic r r ecount ( ) = p( u , u ) (3) L → →| i j u u 1 i

150 6.3 Design of the featurization function π(u) . . . f1 Our current feature vector π(u) considers only the . . .

POS tag sequences for the sentences in the unparsed soft-pooling . . . corpus u. Each sentence is augmented with a special . . . f2 boundary tag # at the start and end. We explore both ......

hand-engineered features and neural features. . . . Hand-engineered features. Recall that 5 consid- § . . . ered which tags appeared near one another in a given . . . order. We now devise a slew of features to measure such co-occurrences in a variety of ways. By train- ing the weights of these many features, our system will discover which ones are actually predictive. Let g(t j) [0, 1] be some measure (to be de- | ∈ Figure 2: Extracting and pooling the neural features. fined shortly) of the prevalence of tag t near token 7 j of corpus u. We can then measure the prevalence or with a copy of tag Tj (that is, s). As an example, of t, both overall and just near tokens of tag s:6 8ˆ,2 πN V asks how often a verb is followed by at least 2 nouns,| within the next 8 words of the sentence and πt = mean g(t j) (8) j | before the next verb. A high value of this is a plausi- πt s = mean g(t j) (9) ble indicator of a VSO-type or VOS-type language. | j: Tj =s | We include the following features for each T j tag pair s, t and each w 1, 3, 8, 100, where j denotes the tag of token . We now define ∈ { 1, 3, 8, 100, 1ˆ, 3ˆ, 8ˆ, 100ˆ , 1ˆ, 3ˆ, 8ˆ, 100ˆ :8 versions of these quantities for particular prevalence − − − − − − − − } g measures . w w w w w w w w w w Given w > 0, let the W denote the πt , πt s, πt s πs , πt s//πt , πt //πt s, πt s//πt−s right window j | | · | | | | sequence of tags Tj+1,...,Tj+w (padding this se- quence with additional # symbols if it runs past the where we define x//y = min(x/y, 1) to prevent end of j’s sentence). We define quantities πw and unbounded feature values, which can result in poor t s w | generalization. Notice that for w = 1, π is bigram πw via (8)–(9), using a version of g(t j) that mea- t s t | conditional probability, πw πw is bigram| joint prob- sures the fraction of tags in W that equal t. Also, t s s j w w| · w,b w,b ability, the log of πt s/πt is bigram pointwise mu- for b 1, 2 , we define πt s and πt using a ver- | ∈ { } | πw /π w sion of g(t j) that is 1 if W contains at least b tual information, and t s t−s measures how much | j | | tokens of t, and 0 otherwise. more prevalent t is to the right of s than to the left. For each of these quantities, we also define a By also allowing other values of w, we generalize corresponding mirror-image quantity (denoted by these features. Finally, our model also uses versions of these features for each b 1, 2. negating w > 0) by computing the same feature on ∈ a reversed version of the corpus. Neural features. As an alternative to the manu- We also define “truncated” versions of all quan- ally designed π function above, we consider a neural tities above, denoted by writing ˆ over the w. In approach to detect predictive configurations in the ˆ these, we use a truncated window Wj, obtained sentences of u, potentially including complex long- from Wj by removing any suffix that starts with # distance configurations. Linguists working with 6In practice, we do backoff smoothing of these means. This Principles & Parameters theory have supposed that a avoids subsequent division-by-0 errors if tag t or s has count 0 single telltale sentence—a trigger—may be enough in the corpus, and it regularizes πt s/πt toward 1 if t or s is rare. | 7 0 λ In the “fraction of tags” features, g(t j) is undefined ( ) Specifically, we augment the denominators by adding , while | 0 augmenting the numerator in (8) by adding λ meanj,t g(t j) when Wˆ j is empty. We omit undefined values from the means. · | 8 w w (unsmoothed) and the numerator in (9) by adding λ times the The reason we don’t include πt−s //πt s is that it is in- | | smoothed πt from (8). λ > 0 is a hyperparameter (see 7.5). cluded when computing features for w. § −

151 to determine a typological parameter, at least given neural-feature system, and α [0, 1] is a hyperpa- ∈ the settings of other parameters (Gibson and Wexler, rameter to balance the two. sH(u) and sN(u) were 1994; Frank and Kapur, 1996). trained separately. At test time, we use (11) to com- We map each corpus sentence ui to a finite- bine them linearly before the logistic transform (6). dimensional real vector f i by using a gated recur- This yields a weighted-product-of-experts model. rent unit (GRU) network (Cho et al., 2014), a type of recurrent neural network that is a simplified variant 6.4 Training procedure of an LSTM network (Hochreiter and Schmidhuber, Length thresholding. By default, our feature vec- 1997). The GRU reads the sequence of one-hot em- tor π(u) is extracted from those sentences in u with beddings of the tags in ui (including the boundary length 40 tokens. In 7.3, however, we try con- ≤ § symbols #). We omit the part of the GRU that com- catenating this feature vector with one that is ex- putes an output sequence, simply taking f i to be the tracted in the same way from just sentences with final hidden state vector. The parameters are trained length 10. The intuition (Spitkovsky et al., 2010) ≤ jointly with the rest of our typology prediction sys- is that the basic word order of the language can be tem, so the training procedure attempts to discover most easily discerned from short, simple sentences. predictively useful configurations. Initialization. We initialize the model of (6)–(7) The various elements of f i attempt to detect vari- so that the estimated directionality pˆ( r, L), re- ous interesting configurations in sentence ui. Some →| might be triggers (which call for max-pooling over gardless of L, is initially a weighted mean of r’s di- sentences); others might provide softer evidence rectionalities in the training languages, namely (which calls for mean-pooling). For generality, p¯ w (r) p∗( r, L) (12) therefore, we define our feature vector π(u) by soft- r ≡ L →| L pooling of the sentence vectors f i (Figure 2). The X p∗(r L) where wL(r) | (13) tanh gate in the GRU implies fik ( 1, 1) and we p∗(r L0) ∈ − ≡ L0 | transform this to the positive quantity f = fik+1 ik0 2 ∈ P (0, 1). Given an “inverse temperature” β, define9 This is done by setting V = 0 and the bias (bV )r = log p¯r , clipped to the range [ 10, 10]. As a result, 1 p¯r 1/β − − β β we make sensible initial predictions even for rare re- π = mean (fik0 ) (10) k i r   lations , which allows us to converge reasonably quickly even though we do not update the parame- This πβ is a pooled version of f , ranging from k ik0 ters for rare relations as often. max-pooling as β (i.e., does f fire strongly → −∞ ik0 We initialize the recurrent connections in the on any sentence i?) to min-pooling as β . It → −∞ GRU to random orthogonal matrices. All other passes through arithmetic mean at β = 1 (i.e., how weight matrices in Figure 1 and the GRU use strongly does f fire on the average sentence i?), ge- ik0 “Xavier initialization” (Glorot and Bengio, 2010). ometric mean as β 0 (this may be regarded as an → All other bias weight vectors are initialized to 0. arithmetic mean in log space), and harmonic mean at β = 1 (an arithmetic mean in reciprocal space). Regularization. We add an L2 regularizer to the − Our final π is a concatenation of the πβ vectors objective. When training the neural network, we use for β 4, 2, 1, 0, 1, 2, 4 . We chose these β dropout as well. All hyperparameters (regularization ∈ {− − − } values experimentally, using cross-validation. coefficient, dropout rate, etc.) are tuned via cross- validation; see 7.5. Combined model. We also consider a model § Optimization. We use different algorithms in dif- s(u) = α sH(u) + (1 α) sN(u) (11) − ferent feature settings. With scoring functions that where sH(u) is the score assigned by the hand- use only hand features, we adjust the feature weights feature system, sN(u) is the score assigned by the by stochastic gradient descent (SGD). With scor- 9For efficiency, we restrict the mean to i 1e4 (the first ing functions that include neural features, we use ≤ 10,000 sentences). RMSProp (Tieleman and Hinton, 2012).

152 Train Test Architecture ε-insensitive loss cs, es, fr, hi, en, nl, da, fi, la, hr, ga, he, hu, Scoring Depth UD +GD de, it, la itt, got, grc, et, fa, ta, cu, el, ro, EC - 0.104 0.099 no, ar, pt la proiel, sl, ja ktc, sv, sH 0 0.057 0.037* grc proiel, bg fi ftb, id, eu, pl sH 1 0.050 0.036* sH 3 0.060 0.048 Table 2: Data split of the 37 real languages, adapted from Wang sN 1 0.062 0.048 and Eisner (2016). (Our “Train,” on which we do 5-fold cross- α s + (1 α) s 1 0.050 0.032* validation, contains both their “Train” and “Dev” languages.) H − N Table 3: Average expected loss over 20 UD languages, com- 7 Experiments puted by 5-fold cross-validation. The first column indicates whether we score using hand-engineered features (sH), neural 7.1 Data splits features (s ), or a combination (see end of 6.3). As a baseline, N § We hold out 17 UD languages for testing (Table 2). the first line evaluates the EC (expected count) heuristic from 5. Within each column, we boldface the best (smallest) re- § For training, we use the remaining 20 UD languages sult as well as all results that are not significantly worse (paired and tune the hyperparameters with 5-fold cross- permutation test by language, p < 0.05). A starred result is validation. That is, for each fold, we train the system significantly better than the other model in the same row. on 16 real languages and evaluate on the remaining 4. When augmenting the 16 real languages with GD nal test, we evaluate on the 17 test languages using languages, we include only GD languages that are a model trained on all training languages (20 tree- banks for UD, plus 20 21 21 = 8840 when generated by “mixing-and-matching” those 16 lan- × × guages, which means that we add 16 17 17 = adding GD) with the chosen hyperparameters. To × × 4624 synthetic languages.10 evaluate on a test language, we again measure how p u Each GD treebank u provides a standard split into well the model predicts train∗ from train. train/dev/test portions. In this paper, we primarily 7.2 Comparison of architectures restrict ourselves to the train portions (saving the gold trees from the dev and test portions to tune and Table 3 shows the cross-validation losses (equa- evaluate some future grammar induction system that tion (1)) that are achieved by different scoring ar- consults our typological predictions). For example, chitectures. We compare the results when the model is trained on real languages (the “UD” column) ver- we write utrain for the POS-tagged sentences in the sus on real languages plus synthetic languages (the “train” portion, and ptrain∗ for the empirical probabil- ities derived from their gold trees. We always train “+GD” column). The sH models here use a subset of the hand- the model to predict ptrain∗ from utrain on each train- ing language. To evaluate on a held-out language engineered features, selected by the experiments in 7.3 below and corresponding to Table 4 line 8. during cross-validation, we can measure how well § 11 Although Figure 1 and equation (7) presented an the model predicts ptrain∗ given utrain. For our fi- “depth-1” scoring network with one hidden layer, 10Why 16 17 17? As Wang and Eisner (2016, 5) explain, × × § Table 3 also evaluates “depth-d” architectures with d each GD treebank is obtained from the UD treebank of some hidden layers. The depth-0 architecture simply pre- substrate language S by stochastically permuting the depen- dents of verbs and nouns to respect typical orders in the super- dicts each directionality separately using logistic re- strate languages RV and RN respectively. There are 16 choices gression (although our training objective is not the for S. There are 17 choices for RV (respectively RN), including usual convex log-likelihood objective). RV = S (“self-permutation”) and RV = (“no permutation”). 11 ∅ Some architectures are better than others. We In actuality, we measured how well it predicts p∗ given dev note that the hand-engineered features outperform udev. That was a slightly less sensible choice. It may have harmed our choice of hyperparameters, since dev is smaller the neural features—though not significantly, since than train and therefore pdev∗ tends to have greater sampling er- they make complementary errors—and that com- ror. Another concern is that our typology system, having been bining them is best. However, the biggest benefit p∗ specifically tuned to predict dev, might provide an unrealisti- comes from augmenting the training data with GD cally accurate estimate of pdev∗ to some future grammar induc- tion system that is being cross-validated against the same dev languages; this consistently helps more than chang- set, harming that system’s choice of hyperparameters as well. ing the architecture.

153 ID Features Length Loss (+GD) MS13 N10 EC ∅ UD +GD 0 ∅ — 0.076 loss 0.166 0.139 0.098 0.083 0.080 0.039 1 conditional 40 0.058 2 joint 40 0.057 Table 5: Cross-validation average expected loss of the two 3 PMI 40 0.039 grammar induction methods, MS13 (Marecekˇ and Straka, 2013) 4 asymmetry 40 0.041 and N10 (Naseem et al., 2010), compared to the EC heuristic of 5 and our architecture of 6 (the version from the last line of 5 rows 3+4 40 0.038 § § Table 3). In these experiments, the dependency relation types 6 row 5+b 40 0.037 are ordered POS pairs. N10 harnesses prior linguistic knowl- 7 row 5+t 40 0.037 edge, but its improvement upon MS13 is not statistically signif- 8* row 5+b+t 40 0.036 icant. Both grammar induction systems are significantly worse 9 row 8 10 0.043 than the rest of the systems, including even our two baseline 10 row 8 10+40 systems, namely EC (the “expected count” heuristic from 5) 0.036 § and ∅ (the no-feature baseline system from Table 4 line 0). Like Table 4: Cross-validation losses with different subsets of hand- N10, these baselines make use of some cross-linguistic knowl- engineered features from 6.3. “∅” is a baseline with no fea- edge, which they extract in different ways from the training tree- § tures (bias feature only), so it makes the same prediction for all banks. Among our own 4 systems, EC is significantly worse w w w languages. “conditional” = πt s features, “joint” = πt s πs fea- than all others, and +GD is significantly better than all others. w w | w w | · tures, “PMI” = πt s//πt and πt //πt s features, “asymmetry” (Note: When training the baselines, we found that including w w | | the +GD languages—a bias-variance tradeoff— harmed EC but = πt s//πt−s features, “b” are the features superscripted by b, and “t”| are| the features with truncated window. “+” means con- helped ∅. The table reports the better result in each case.) catenation.The “Length” field refers to length thresholding (see 6.4). The system in the starred row is the one that we selected 7.5 Hyperparameter settings § for row 2 of Table 3. For each result in Tables 3–4, the hyperparameters 7.3 Contribution of different feature classes were chosen by grid search on the cross-validation objective (and the table reports the best result). For To understand the contribution of different hand- the remaining experiments, we select the depth-1 engineered features, we performed forward selec- combined models (11) for both “UD” and “+GD,” tion tests on the depth-1 system, including only as they are the best models according to Table 3. some of the features. In all cases, we trained in The hyperparameters for the selected models are the “+GD” condition. The results are shown in Ta- as follows: When training with “UD,” we took ble 4. Any class of features is substantially better α = 1 (which ignores sN), with hidden layer size than baseline, but we observe that most of the total h = 256, ψ = sigmoid, L2 coeff = 0 (no L2 benefit can be obtained with just PMI or asymme- regularization), and dropout = 0.2. When training try features. Those features indicate, for example, with “+GD,” we took α = 0.7, with different hy- whether a verb tends to attract nouns to its right or perparameters for the two interpolated models: sH left. We did not see a gain from length thresholding. uses h = 128, ψ = sigmoid, L2 coeff = 0, and dropout = 0.4, while sN uses h = 128, emb size = 7.4 Robustness to noisy input 128, rnn size = 32, ψ = relu, L2 coeff = 0, and We also tested our directionality prediction system dropout = 0.2. For both “UD” and “+GD”, we use on noisy input (without retraining it on noisy input). λ = 1 for the smoothing in footnote 6. Specifically, we tested the depth-1 sH system. This time, when evaluating on the dev split of a held-out 7.6 Comparison with grammar induction language, we provided a noisy version of that input Grammar induction is an alternative way to predict corpus that had been retagged by an automatic POS word order typology. Given a corpus of a language, tagger (Nguyen et al., 2014), which was trained on we can first use grammar induction to parse it into just 100 gold-tagged sentences from the train split dependency trees, and then estimate the directional- of that language. The average tagging accuracy over ity of each dependency relation type based on these the 20 languages was only 77.26%. Nonetheless, the (approximate) trees. “UD”-trained and “+GD”-trained systems got re- However, what are the dependency relation types? spective losses of 0.052 and 0.041—nearly as good Current grammar induction systems produce unla- as in Table 3, which used gold POS tags. beled dependency edges. Rather than try to obtain

154 0.45 0.5

0.40 foreign )

l foreign

0.35 e 0.4 d s goeswithtmod o s o 0.30 m

L l

l tmod goeswith e u f

v 0.25 0.3 ( i

t dep s i s

s name

0.20 o n L

e dep

aux s

e mwe name n 0.15 v 0.2 i I t - csubjpass remnant cop i aux ε reparandum s compound mwe 0.10 emph acl n vocativenummod poss e csubjpass

s cop remnant xcompiobj parataxisauxpassdiscourseadvcl prt n appos reparandum 0.05 csubjrelclnsubjpass I 0.1 emphacl

expl mark cc nsubj - ccomp case nmod nummod vocativeposs reflex advmod ε neg amod dobj advcliobj xcomp parataxis preconjpredetlist conj det punct discourse prt auxpass 0.00 apposcsubjnsubjpassrelcl nsubj expl 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 markccompcc casenmod advmod reflex neg dobj amod detpunct Average Proportion 0.0 conj list preconj predet 0.0 0.1 0.2 0.3 0.4 0.5 ε-Insensitive Loss (baseline) Figure 3: Cross-validation loss broken down by relation. We plot each relation r with x coordinate = the proportion of r in Figure 4: The y coordinate is the average loss of our model the average training corpus = meanL Train ptrain∗ (r L) [0, 1], ∈ | ∈ (Table 4 line 8), just as in Figure 3, whereas the x coordinate y and with coordinate = the weighted average is the average loss of a simple baseline model ∅ that ignores L Heldout wL(r) lossε(ˆpdev( r, L), pdev∗ ( r, L)) (see (13)). ∈ →| →| the input corpus (Table 4 line 0). Relations whose directionality P varies more by language have higher baseline loss. Relations a UD label like r = amod for each edge, we la- that beat the baseline fall below the diagonal line. The marker bel the edge deterministically with a POS pair such size for each relation is proportional to the x-axis in Figure 3. as r = (parent = NOUN, child = ADJ). Thus, we will attempt to predict the directionality of each predictable. Figure 4 shows that our success is not POS-pair relation type. For comparison, we retrain just because the task is easy—on relations whose di- our supervised system to do the same thing. rectionality varies by language, so that a baseline For the grammar induction system, we try the method does poorly, our system usually does well. implementation of DMV with stop-probability es- To show that our system is behaving well across timation by Marecekˇ and Straka (2013), which is languages and not just on average, we zoom in on a common baseline for grammar induction (Le and 5 relation types that are particularly common or of Zuidema, 2015) because it is language-independent, particular interest to linguistic typologists. These 5 reasonably accurate, fast, and convenient to use. We relations together account for 46% of all relation to- also try the grammar induction system of Naseem kens in the average language: nmod = noun-nominal et al. (2010), which is the state-of-the-art system modifier order, nsubj = subject-verb order (feature on UD (Noji et al., 2016). Naseem et al. (2010)’s 82A in the World Atlas of Language Structures), method, like ours, has prior knowledge of what typ- dobj = object-verb order (83A), amod = adjective- ical human languages look like. noun order (87A), and case = placement of both Table 5 shows the results. Compared to Marecekˇ adpositions and case markers (85A). and Straka (2013), Naseem et al. (2010) gets only As shown in Figure 5, most points in the first a small (insignificant) improvement—whereas our five plots fall in or quite near the desired region. “UD” system halves the loss, and the “+GD” sys- We are pleased to see that the predictions are robust tem halves it again. Even our baseline systems are when the training data is unbalanced. For example, significantly more accurate than the grammar induc- the case relation points leftward in most real lan- tion systems, showing the effectiveness of casting guages, yet our system can still predict the right di- the problem as supervised prediction. rectionality of “hi”, “et” and “fi.” The credit goes to the diversity of our training set, which contains 7.7 Fine-grained analysis various synthetic case-right languages: the system Beyond reporting the aggregate cross-validation loss fails on these three languages if we train on real lan- over the 20 training languages, we break down the guages only. That said, apparently our training set is cross-validation predictions by relation type. Fig- still not diverse enough to do well on the outlier “ar” ure 3 shows that the frequent relations are all quite (Arabic); see Figure 4 in Wang and Eisner (2016).

155 Relation Rate EC ∅ UD +GD nmod nsubj nmod 0.15 0.85 0.9 0.9 0.9 1.0 1.0 punct 0.11 0.85 0.85 0.85 0.85 ar

bgesitptfr cs no en 0.8 da 0.8 case 0.11 0.75 0.85 0.85 1 nl grc_proiel la_itt got la_proiel nsubj 0.08 0.95 0.95 0.95 0.95 de 0.6 et 0.6 det 0.07 0.8 0.9 0.9 0.9 grc fi dobj Prediction Prediction 0.06 0.6 0.75 0.75 0.85 hi 0.4 0.4 got la_proielgrc_proiel amod la_itt 0.05 0.6 0.6 0.75 0.9 fi hi en nl fr pt bg es cs 0.2 0.2 deet it advmod grc 0.05 0.9 0.85 0.85 0.85 ar dano cc 0.04 0.95 0.95 0.95 0.95 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 conj 0.04 1 1 1 1 True True mark 0.03 0.95 0.95 0.95 0.95 dobj amod advcl 0.02 0.85 0.85 0.85 0.8 1.0 1.0 ar cop 0.02 0.75 0.75 0.65 0.75 et noda en bg fr it es pt aux cs 0.02 0.9 0.6 0.75 0.65 0.8 0.8 ar fi it pt nl grc_proiel iobj 0.02 0.45 0.55 0.5 0.6 got fres la_itt 0.6 grc_proiel 0.6 la_proiel acl 0.01 0.45 0.85 0.85 0.8 de

got la_proiel nummod 0.01 0.9 0.9 0.9 0.9 grc Prediction Prediction 0.4 0.4 la_itt grc xcomp 0.01 0.95 0.95 0.95 1 hi

0.2 0.2 neg 0.01 1 1 1 1

no de nlcs ccomp 0.01 0.75 0.95 0.95 0.95 en da etfi hibg 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Avg. - 0.81 0.8475 0.855 0.8775 True True

case case (Trained with UD) Table 6: Accuracy on the simpler task of binary classification of 1.0 1.0 relation directionality. The most common relations are shown

et 0.8 0.8 first: the “Rate” column gives the average rate of the relation fi across the 20 training languages (like the x coordinate in Fig. 3). hi 0.6 0.6 Prediction Prediction 0.4 0.4 over languages) and by language (equal average over relations). Keep in mind that these systems had not 0.2 0.2 de nldaen itgotcsarfrnogrc_proiella_proielptdela_ittdabggrc et fi been specifically trained to place relations on the no la_ittbggrc gotcsla_proieles esnl en hi itarfrgrc_proielpt 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 correct side of the artificial 0.5 boundary. True True Binary classification is an easier task. It is easy Figure 5: Scatterplots of predicted vs. true directionalities (by because, as the column in Table 6 indicates, cross-validation). In the plot for relation type r, each language ∅ most relations have a clear directionality preference appears as a marker at (p∗, pˆ) (see 4), with the marker size pro- § portional to wL(r) (see (13)). Points that fall between the solid shared by most of the UD languages. As a result, the lines ( pˆ p∗ ε) are considered “correct,” by the definition | − | ≤ better models with more features have less opportu- of ε-insensitive loss. The last plot (bottom right) shows worse nity to help. Nonetheless, they do perform better, predictions for case when the model is trained on UD only. and the EC heuristic continues to perform worse. In particular, EC fails significantly on dobj and 7.8 Binary classification accuracy iobj. This is because nsubj, dobj, and iobj Besides ε-insensitive loss, we also measured how often have different directionalities (e.g., in SVO the systems perform on the coarser task of binary languages), but the EC heuristic will tend to pre- classification of relation direction. We say that re- dict the same direction for all of them, according to lation r is dominantly “rightward” in language L iff whether NOUNs tend to precede nearby VERBs. p ( r, L) > 0.5. We say that a system predicts ∗ →| “rightward” according to whether pˆ( r, L) > 0.5. 7.9 Final evaluation on test data →| We evaluate whether this binary prediction is cor- All previous experiments were conducted by cross- rect for each of the 20 most frequent relations r, validation on the 20 training languages. We now for each held-out language L, using 5-fold cross- train the system on all 20, and report results on the validation over the 20 training languages L as in the 17 blind test languages (Table 8). In our evaluation previous experiment. Tables 6 and 7 respectively metric (1), includes all 57 relation types that ap- R summarize these results by relation (equal average pear in training data, plus a special UNK type for

156 target EC ∅ UD +GD Test Train ar 0.8 0.8 0.75 0.85 target UD +GD target UD +GD bg 0.85 0.95 0.95 0.95 cu 0.024 0.024 ar 0.116 0.057 cs 0.9 1 1 0.95 el 0.056 0.011 bg 0.037 0.015 da 0.8 0.95 0.95 0.95 eu 0.250 0.072 cs 0.025 0.014 de 0.9 0.9 0.9 0.95 fa 0.220 0.134 da 0.024 0.017 en 0.9 1 1 1 fi ftb 0.073 0.029 de 0.046 0.025 es 0.9 0.9 0.95 0.95 ga 0.181 0.154 en 0.025 0.036 et 0.8 0.8 0.8 0.8 he 0.079 0.033 es 0.012 0.007 fi 0.75 0.85 0.85 0.85 hr 0.062 0.011 et 0.055 0.014 fr 0.9 0.9 0.9 0.95 hu 0.119 0.102 fi 0.069 0.070 got 0.75 0.8 0.85 0.8 id 0.099 0.076 fr 0.024 0.018 grc 0.6 0.7 0.7 0.75 ja ktc 0.247 0.078 got 0.008 0.026 grc proiel 0.8 0.8 0.85 0.9 la 0.036 0.004 grc 0.026 0.007 hi 0.6 0.45 0.45 0.7 pl 0.056 0.023 grc proiel 0.004 0.017 it 0.9 0.9 0.9 0.95 ro 0.029 0.009 hi 0.363 0.191 la itt 0.7 0.85 0.8 0.85 sl 0.015 0.031 it 0.011 0.008 la proiel 0.7 0.7 0.75 0.7 sv 0.012 0.008 la itt 0.033 0.023 nl 0.95 0.85 0.85 0.85 ta 0.238 0.053 la proiel 0.018 0.021 no 0.9 1 1 0.95 nl 0.069 0.066 pt 0.8 0.85 0.9 0.9 no 0.008 0.010 Avg. 0.81 0.8475 0.855 0.8775 pt 0.038 0.004 Test Avg. 0.106 0.050* All Avg. 0.076 0.040* Table 7: Accuracy on the simpler task of binary classification of relation directionality for each training language. A detailed Table 8: Our final comparison on the 17 test languages appears comparison shows that EC is significantly worse than UD and at left. We ask whether the average expected loss on these 17 +GD, and that ∅ is significantly worse than +GD (paired permu- real target languages is reduced by augmenting the training pool p < 0.05 tation test over the 20 languages, ). The improvement of 20 UD languages with +20*21*21 GD languages. For com- from UD to +GD is insignificant, which suggests that this is an pleteness, we extend the table with the cross-validation results easier task where weak models might suffice. on the training pool. The “Avg.” lines report the average of 17 test or 37 training+testing languages. We mark both “+GD” av- relations that appear only in test data. The results erages with “*” as they are significantly better than their “UD” range from good to excellent, with synthetic data counterparts (paired permutation test by language, p < 0.05). providing consistent and often large improvements. These results could potentially be boosted in the 8 Related Work future by using an even larger and more diverse Typological properties can usefully boost the per- training set. In principle, when evaluating on any formance of cross-linguistic systems (Bender, 2009; one of our 37 real languages, one could train a sys- O’Horan et al., 2016). These systems mainly aim tem on all of the other 36 (plus the galactic lan- to annotate low-resource languages with help from guages derived from them), not just 20. Moreover, models trained on similar high-resource languages. the Universal Dependencies collection has contin- Naseem et al. (2012) introduce a “selective shar- ued to grow beyond the 37 languages used here ( 3). ing” technique for generative parsing, in which a § Finally, our current setup extracts only one training Subject-Verb language will use parameters shared example from each (real or synthetic) language. One with other Subject-Verb languages. Tackstr¨ om¨ et al. could easily generate a variant of this example each (2013) and Zhang and Barzilay (2015) extend this time the language is visited during stochastic opti- idea to discriminative parsing and gain further im- mization, by bootstrap-resampling its training cor- provements by conjoining regular parsing features pus (to add “natural” variation) or subsampling it (to with typological features. The cross-linguistic neu- train the predictor to work on smaller corpora). In ral parser of Ammar et al. (2016) conditions on ty- the case of a synthetic language, one could also gen- pological features by supplying a “language embed- erate a corpus of new trees each time the language ding” as input. Zhang et al. (2012) use typological is visited (by re-running the stochastic permutation properties to convert language-specific POS tags to procedure, instead of reusing the particular permuta- UD POS tags, based on their ordering in a corpus. tion released by the Galactic Dependencies project). Moving from engineering to science, lin-

157 guists seek typological universals of human lan- though it may add linguistically-informed inductive guage (Greenberg, 1963; Croft, 2002; Song, 2014; bias (Ganchev et al., 2010; Naseem et al., 2010). Hawkins, 2014), e.g., “languages with domi- Most such methods use local search and must wres- nant Verb-Subject-Object order are always preposi- tle with local optima (Spitkovsky et al., 2013). Fine- tional.” Dryer and Haspelmath (2013) characterize grained typological classification might supplement 2679 world languages with 192 typological proper- this approach, by cutting through the initial com- ties. Their WALS database can supply features to binatorial challenge of establishing the basic word- NLP systems (see previous paragraph) or gold stan- order properties of the language. In this paper we dard labels for typological classifiers. Daume´ III and only quantify the directionality of each relation type, Campbell (2007) take WALS as input and propose ignoring how tokens of these relations interact lo- a Bayesian approach to discover new universals. cally to give coherent parse trees. Grammar induc- Georgi et al. (2010) impute missing properties of a tion methods like EM could naturally consider those language, not by using universals, but by backing local interactions for a more refined analysis, when off to the language’s typological cluster. Murawaki guided by our predicted global directionalities. (2015) use WALS to help recover the evolutionary 9 Conclusions and Future Work tree of human languages; Daume´ III (2009) consid- ers the geographic distribution of WALS properties. We introduced a typological classification task, Attempts at automatic typological classification which attempts to extract quantitative knowledge are relatively recent. Lewis and Xia (2008) pre- about a language’s syntactic structure from its sur- dict typological properties from induced trees, but face forms (POS tag sequences). We applied super- guess those trees from aligned bitexts, not by mono- vised learning to this apparently unsupervised prob- lingual grammar induction as in 7.6. Liu (2010) lem. As far as we know, we are the first to utilize § and Futrell et al. (2015) show that the directional- synthetic languages to train a learner for real lan- 12 ity of (gold) dependencies is indicative of “basic” guages: this move yielded substantial benefits. word order and freeness of word order. Those papers Figure 5 shows that we rank held-out languages predict typological properties from trees that are au- rather accurately along a spectrum of directionality, tomatically (noisily) annotated or manually (expen- for several common dependency relations. Table 8 sively) annotated. An alternative is to predict the ty- shows that if we jointly predict the directionalities pology directly from raw or POS-tagged text, as we of all the relations in a new language, most of those do. Saha Roy et al. (2014) first explored this idea, numbers will be quite close to the truth (low aggre- building a system that correctly predicts adposition gate error, weighted by relation frequency). This typology on 19/23 languages with only word co- holds promise for aiding grammar induction. occurrence statistics. Zhang et al. (2016) evaluate Our trained model is robust when applied to noisy semi-supervised POS tagging by asking whether the POS tag sequences. In the future, however, we induced tag sequences can predict typological prop- would like to make similar predictions from raw erties. Their prediction approach is supervised like word sequences. That will require features that ab- ours, although developed separately and trained on stract away from the language-specific vocabulary. different data. They more simply predict 6 binary- Although recurrent neural networks in the present valued WALS properties, using 6 independent bi- paper did not show a clear advantage over hand- nary classifiers based on POS bigram and trigrams. engineered features, they might be useful when used Our task is rather close to grammar induction, with word embeddings. which likewise predicts a set of real numbers giv- Finally, we are interested in downstream uses. ing the relative probabilities of competing syntactic Several NLP tasks have benefited from typologi- cal features ( 8). By using end-to-end training, our configurations. Most previous work on grammar in- § duction begins with maximum likelihood estimation methods could be tuned to extract features (existing of some generative model—such as a PCFG (Lari or novel) that are particularly useful for some task. and Young, 1990; Carroll and Charniak, 1992) or 12Although Wang and Eisner (2016) review uses of synthetic dependency grammar (Klein and Manning, 2004)— training data elsewhere in machine learning.

158 Acknowledgements This work was funded by the Hal Daume´ III. Non-parametric Bayesian areal lin- U.S. National Science Foundation under Grant No. guistics. In Proceedings of Human Language 1423276. We are grateful to the state of Maryland Technologies: The 2009 Annual Conference of the for providing indispensable computing resources via North American Chapter of the Association for the Maryland Advanced Research Computing Cen- Computational Linguistics, pages 593–601, 2009. ter (MARCC). We thank the Argo lab members for Hal Daume´ III and Lyle Campbell. A Bayesian useful discussions. Finally, we thank TACL action model for discovering typological implications. editor Mark Steedman and the anonymous review- In Proceedings of the 45th Annual Meeting of the ers for high-quality suggestions, including the EC Association of Computational Linguistics, pages baseline and the binary classification evaluation. 65–72, 2007. References Matthew S. Dryer and Martin Haspelmath, editors. The World Atlas of Language Structures Online. Waleed Ammar, George Mulcaire, Miguel Balles- Max Planck Institute for Evolutionary Anthropol- teros, Chris Dyer, and Noah Smith. Many lan- ogy, Leipzig, 2013. http://wals.info/. guages, one parser. Transactions of the Associ- Joakim Nivre, et al. Universal Dependencies 1.2. ation of Computational Linguistics, 4:431–444, LINDAT/CLARIN digital library at the Institute 2016. of Formal and Applied Linguistics, Charles Uni- Emily M. Bender. Linguistically na¨ıve != language versity in Prague. Data available at http:// independent: Why NLP needs linguistic typol- universaldependencies.org, 2015. ogy. In Proceedings of the EACL 2009 Workshop Robert Frank and Shyam Kapur. On the use of trig- on the Interaction between Linguistics and Com- gers in parameter setting. Linguistic Inquiry, 27: putational Linguistics: Virtuous, Vicious or Vacu- 623–660, 1996. ous?, pages 26–32, 2009. Richard Futrell, Kyle Mahowald, and Edward Gib- Glenn Carroll and Eugene Charniak. Two experi- son. Quantifying word order freedom in depen- ments on learning probabilistic dependency gram- dency corpora. In Proceedings of the Third Inter- mars from corpora. In Working Notes of the AAAI national Conference on Dependency Linguistics, Workshop on Statistically-Based NLP Techniques, pages 91–100, 2015. pages 1–13, 1992. Kuzman Ganchev, Joao Grac¸a, Jennifer Gillenwater, Kyunghyun Cho, Bart van Merrienboer, Dzmitry and Ben Taskar. Posterior regularization for struc- Bahdanau, and Yoshua Bengio. On the properties tured latent variable models. Journal of Machine of neural machine : Encoder-decoder Learning Research, 11:2001–2049, 2010. approaches. In Proceedings of Eighth Workshop Ryan Georgi, Fei Xia, and William Lewis. Com- on Syntax, Semantics and Structure in Statistical paring language similarity across genetic and Translation, pages 103–111, 2014. typologically-based groupings. In Proceedings of Noam Chomsky. Lectures on Government and Bind- the 23rd International Conference on Computa- ing: The Pisa Lectures. Holland: Foris Publica- tional Linguistics, pages 385–393, 2010. tions, 1981. Edward Gibson and Kenneth Wexler. Triggers. Lin- Noam Chomsky and Howard Lasnik. The theory of guistic Inquiry, 25(3):407–454, 1994. principles and parameters. In Syntax: An Inter- Xavier Glorot and Yoshua Bengio. Understanding national Handbook of Contemporary Research, the difficulty of training deep feedforward neu- Joachim Jacobs, Arnim von Stechow, Wolfgang ral networks. In Proceedings of the International Sternefeld, and Theo Vennemann, editors. Berlin: Conference on Artificial Intelligence and Statis- de Gruyter, 1993. tics, 2010. William Croft. Typology and Universals. Cambridge Joseph H. Greenberg. Some universals of grammar University Press, 2002. with particular reference to the order of mean-

159 ingful elements. In Universals of Language, Tahira Naseem, Harr Chen, Regina Barzilay, and Joseph H. Greenberg, editor, pages 73–113. MIT Mark Johnson. Using universal linguistic knowl- Press, 1963. edge to guide grammar induction. In Proceedings John A. Hawkins. Word Order Universals. Elsevier, of the 2010 Conference on Empirical Methods in 2014. Natural Language Processing, pages 1234–1244, 2010. Sepp Hochreiter and Jurgen¨ Schmidhuber. Long short-term memory. Neural Computation, 9(8): Tahira Naseem, Regina Barzilay, and Amir Glober- 1735–1780, 1997. son. Selective sharing for multilingual depen- dency parsing. In Proceedings of the 50th An- Dan Klein and Christopher Manning. Corpus-based nual Meeting of the Association for Computa- induction of syntactic structure: Models of depen- tional Linguistics (Volume 1: Long Papers), pages dency and constituency. In Proceedings of the 629–637, 2012. 42nd Annual Meeting of the Association for Com- Dat Quoc Nguyen, Dai Quoc Nguyen, Dang Duc putational Linguistics, pages 478–485, 2004. Pham, and Son Bao Pham. RDRPOSTagger: A Karim Lari and Steve J. Young. The estimation ripple down rules-based part-of-speech tagger. In of stochastic context-free using the Proceedings of the Demonstrations at the 14th Inside-Outside algorithm. Computer Speech and Conference of the European Chapter of the Asso- Language, 4(1):35–56, 1990. ciation for Computational Linguistics, pages 17– Phong Le and Willem Zuidema. Unsupervised de- 20, 2014. pendency parsing: Let’s use supervised parsers. Hiroshi Noji, Yusuke Miyao, and Mark Johnson. In Proceedings of the 2015 Conference of the Using left-corner parsing to encode universal North American Chapter of the Association for structural constraints in grammar induction. In Computational Linguistics: Human Language Proceedings of the 2016 Conference on Empirical Technologies, pages 651–661, 2015. Methods in Natural Language Processing, pages William D. Lewis and Fei Xia. Automatically iden- 33–43, 2016. tifying computationally relevant typological fea- Helen O’Horan, Yevgeni Berzak, Ivan Vulic, Roi tures. In Proceedings of the Third International Reichart, and Anna Korhonen. Survey on the Joint Conference on Natural Language Process- use of typological information in natural language ing: Volume-II, 2008. processing. In Proceedings of the 26th Interna- Haitao Liu. Dependency direction as a means of tional Conference on Computational Linguistics: word-order typology: A method based on de- Technical Papers, pages 1297–1308, 2016. pendency treebanks. Lingua, 120(6):1567–1578, Rishiraj Saha Roy, Rahul Katare, Niloy Ganguly, 2010. and Monojit Choudhury. Automatic discovery David Marecekˇ and Milan Straka. Stop-probability of adposition typology. In Proceedings of the estimates computed on a large corpus improve un- 25th International Conference on Computational supervised dependency parsing. In Proceedings Linguistics: Technical Papers, pages 1037–1046, of the 51st Annual Meeting of the Association for 2014. Computational Linguistics (Volume 1: Long Pa- Jae Jung Song. : pers), pages 281–290, 2013. and Syntax. Routledge, 2014. Yugo Murawaki. Continuous space representations Valentin I. Spitkovsky, Hiyan Alshawi, and Daniel of linguistic typology and their application to phy- Jurafsky. From baby steps to leapfrog: How “less logenetic inference. In Proceedings of the 2015 is more” in unsupervised dependency parsing. In Conference of the North American Chapter of Human Language Technologies: The 2010 An- the Association for Computational Linguistics: nual Conference of the North American Chapter Human Language Technologies, pages 324–334, of the Association for Computational Linguistics, 2015. pages 751–759, 2010.

160 Valentin I. Spitkovsky, Hiyan Alshawi, and Daniel Jurafsky. Breaking out of local optima with count transforms and model recombination: A study in grammar induction. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1983–1995, 2013. Oscar Tackstr¨ om,¨ Ryan McDonald, and Joakim Nivre. Target language adaptation of discrimina- tive transfer parsers. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies, pages 1061–1071, 2013. Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012. Dingquan Wang and Jason Eisner. The Galac- tic Dependencies treebanks: Getting more data by synthesizing new languages. Transactions of the Association of Computational Linguistics, 4: 491–505, 2016. Data available at https:// github.com/gdtreebank/gdtreebank. Yuan Zhang and Regina Barzilay. Hierarchical low- rank tensors for multilingual transfer parsing. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1857–1867, 2015. Yuan Zhang, Roi Reichart, Regina Barzilay, and Amir Globerson. Learning to map into a uni- versal POS tagset. In Proceedings of the 2012 Joint Conference on Empirical Methods in Nat- ural Language Processing and Computational Natural Language Learning, pages 1368–1378, 2012. Yuan Zhang, David Gaddy, Regina Barzilay, and Tommi Jaakkola. Ten pairs to tag—multilingual POS tagging via coarse mapping between embed- dings. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1307–1317, 2016.

161 162