<<

: Flexible Programming for Learning and Inference in NLP

Parisa Kordjamshidi, Tulane University, [email protected] Daniel Khashabi, University of Illinois at Urbana-Champaign Christos Christodoulopoulos, University of Illinois at Urbana-Champaign Bhargav Mangipudi, University of Illinois at Urbana-Champaign Sameer Singh, University of California, Irvine Dan Roth, University of Illinois at Urbana-Champaign

COLING-2016 1 December 2016, Osaka, Japan WHAT

Better Call Saul:… 2 WHAT

• Introducing Saul which is a declarative Learning based programming language. [Kordjamshidi et. al. IJCAI-2015]

Better Call Saul:… 2 WHAT

• Introducing Saul which is a declarative Learning based programming language. [Kordjamshidi et. al. IJCAI-2015]

• Particularly, the way we have augmented it with the abstraction levels and facilities for designing various NLP tasks with arbitrary output structure with various linguistic granularities.

Better Call Saul:… 2 WHAT

• Introducing Saul which is a declarative Learning based programming language. [Kordjamshidi et. al. IJCAI-2015]

• Particularly, the way we have augmented it with the abstraction levels and facilities for designing various NLP tasks with arbitrary output structure with various linguistic granularities. ■ Word level ■ Phrase level ■ Sentence level …

Better Call Saul:… 2 WHY

Better Call Saul:… 3 WHY

Most of the NLP learning tasks seek for a mapping from input structures to output structures.

Better Call Saul:… 3 WHY

Most of the NLP learning tasks seek for a mapping from input structures to output structures.

■ Syntactic => Part of speech tagging ■ Information Extraction => Entity mention / Relation extraction ■ Semantic => Semantic role labeling

Better Call Saul:… 3 WHY

Most of the NLP learning tasks seek for a mapping from input structures to output structures.

■ Syntactic => Part of speech tagging ■ Information Extraction => Entity mention / Relation extraction ■ Semantic => Semantic role labeling

We often need to make a lot of programming effort and hard code to:

Better Call Saul:… 3 WHY

Most of the NLP learning tasks seek for a mapping from input structures to output structures.

■ Syntactic => Part of speech tagging ■ Information Extraction => Entity mention / Relation extraction ■ Semantic => Semantic role labeling

We often need to make a lot of programming effort and hard code to:

■ Benefit from a specific structure ■ (Relational) Feature extraction (Even when using representation learning techniques)

Better Call Saul:… 3 !

!

!

! !

! !

! !! ! !! ! ! Example Tasks!! (1) !! !! !! Washington* covers* Seattle* for*the*Associated*Press* ! Node! * !! covers*Seattle* Seattle* Semantics:! Semantic role labeling Property! * c!!over5Washington* cover5Seattle* cover5for*the*Associated*Press* ! ! ! Edge! Washington* covers*Washington*Seattle* covers* for*the*SeattleAssociated*Press* for*the*Associated*Press* * POS*window! * Parse*Path* Label* POS*window* Parse*Path* Label* POS*windowNode* !NodeParse*Path! * Label* * ctx/2:VBZ! covers*Seattle*…! VBZ^VP^S…! Seattle*A0! ctx/2:!NNP…! VBZ^VPvNP…! A1! ctx/2:VBZ…! VBZ^VPvNP…! *AM/PNC* ! covers*Seattle* Seattle*! ! ! ! ! ! ! ! * ! ! * ! ! * ! ! Property! * !c!over5Washington* ** cover* 5Seattle* * cover5for*the*Associated*Press* * *Property* ! * ! c!over5Washington* cover5Seattle* cover5for*the*Associated*Press* Edge! ! POS*window! * Parse*Path* Label* POS*window* Parse*Path* Label* POS*window* Parse*Path* Label* ctx/2:VBZ…! VBZ^VP^S…! A0! ctx/2:!NNP…! VBZ^VPvNP…! A1! ctx/2:VBZ…! VBZ^VPvNP…! AM/PNC* ! ! ! ! * ! ! ! ! ! * ! ! ! ! * ! Edge! ! * * * * * * * POS*window! * Parse*Path* Label* POS*window!* Parse*Path* Label* POS*window* Parse*Path* Label* ctx/2:VBZ…! VBZ^VP^S…! A0! ctx/2:!NNP…! VBZ^VPvNP…! A1! ctx/2:VBZ…! VBZ^VPvNP…! AM/PNC* ! ! ! ! * ! ! ! ! ! * ! ! ! ! * ! ! * * * !! * * * *

parse*! head label* label* path!parse** headword* ! label* path! * label* ! word* ! * ! * ! * ! *! * ! ! * * * ! !!predicate* * argumentargument* *

parse* head label* label* path! * word* ! ! * ! * ! relation* * *

! chunk* predicate* argument* parse* relationlabel* * path* length* ! ! * !

chunk* parse* label* path* length* relation* ! ! * !

chunk* parse* label* INTRODUCTIONpath* length * CMPS-3240/6240 ’16 ! ! * ! 4 !

!

!

!

!

!!

!

!

! Washington) covers) Seattle) for)the)Associated)Press) Node! covers)Seattle) Seattle) ) ! Property! ) cover;Washington) cover;Seattle) cover;for)the)Associated)Press) ! ! Edge! POS)window) Parse)Path) Label) POS)window) Parse)Path) Label) POS)window) Parse)Path) Label) ! ctx/2:VBZ…! VBZ^VP^S…! A0! ctx/2:!NNP…! VBZ^VPvNP…! A1! ctx/2:VBZ…! VBZ^VPvNP…! AM/PNC) ! ! ! ! ) ! ! ! ! ! ) ! ! ! ! ) ! ) ) ) ) ) ) ) !

! Node! ) ! token) Property! ) parse) head label) word) ! label) label) label) path! ) word) before) form) Edge! Example! ! ) ! Tasks! (2) ! ! ) ! ) ) ) ! predicate) argument) !

! relation) Information extraction: Entity mentiondocument relation) sentence) extractionmention) relation) chunk) ! parse) label) path) length) ) ) ! ! ) ! POS) label) POS) label) ! ! ! ) ! ! ! ) ! ! SENTENCE! Washington)covers)Seattle)for)the)Associated) ! Press.)

Node! ) Washington) Seattle) the)Associated)Press) ENTITIES! ) ! covers)Seattle) Property! ) Label) ) Label) ) Label) ! ! GPE! ! GPE) ! Organiziation) ! ! ! ) ! ! ) ! ! ) ) ) ) Edge!

Washington;Seattle) Seattle;the)Associated)Press) RELATION! ! ) ) Label) ! ) Label) Located/at! ) ! Located/at) ! ! ! ) ! ) ! ) ! ! ) !

Better Call Saul:… 5 Example Tasks (3)

All annotations: Syntax, Semantics, Mentions, Relations …

Figure 1: An instantiation of the data-model for the NLP domain. The colored ovals are some observed properties, while the white ones show the unknown labels. For the POS and Entity Recognition tasks, the boxes represent candidates for single labels; for the SRL and Relation Extraction tasks, they represent candidates for linked labels. Better Call Saul:… 6

labels that are applicable on the node with the same type of xc. These constraints are added as a part of Saul’s objective, so we have the following objective form, which is in fact a constrained conditional model (Chang et al., 2012), g = w,f(x, y) ⇢,c(x, y) , where c is the constraint function and ⇢ is h ih i the vector of penalties for violating each constraint. This representation corresponds to an integer linear program, and thus can be used to encode any MAP problem. Specifically, the g function is written as the sum of local joint feature functions which are the counterparts of the probabilistic factors:

C | | g(x, y; w)= wp,fp(xk,lpk) + ⇢mcm(x, y), (2) h i lp l x ⌧ m=1 X2 kX2{ } X where C is a set of global constraints that can hold among various types of nodes. g can represent a general scoring function rather than the one corresponding to the likelihood of an assignment. The constraints are used during training for loss-augmented inference as well as during prediction.

4 Calling Saul: Case Studies

For programming global models in Saul the programmer needs to declare a) the data-model which is a global structure of the data and b) the templates for learning an inference decompositions. The templates are declared intuitively in two forms of classifiers using Learnable construct and first order constraints using ConstrainedClassifier construct. With these components have been specified, the programmer can easily choose which templates to use for learning (training) and inference (prediction). In this way the global objective is generated automatically for different training and testing paradigms in the spectrum of local to global models. One advantage of programming in Saul is that one can define a generic data-model for various tasks in each application domain. In this paper, we enrich Saul with an NLP data-model based on EDISON, a recently-introduced NLP library which contains raw data readers, data structures and feature extractors (Sammons et al., 2016) and use it as a collection of Sensors to easily generate the data-model from the raw data. In Saul,aSensor is a ‘black-box’ function that can generate nodes, edges and properties in the graph. An example of a sensor for generating nodes and edges is a sentence tokenizer which receives a sentence and generates its tokens. Here, we will provide some examples of data-model declaration language but more details are available on-line2. In the rest of the paper, we walk through the tasks of Semantic Role Labeling (SRL), Part-of-Speech (POS) tagging and Entity-Relation (ER) extraction and show how we can design a variety of local to global models by presenting the related code3.

2https://github.com/IllinoisCogComp/saul/blob/master/saul-core/doc/DATAMODELING.md 3https://github.com/IllinoisCogComp/saul/tree/master/saul-examples/src/main/scala/edu/illinois/cs/cogcomp/saulexamples/nlp Structured Output Models: Common Practice

For extraction and representation of features as well as for exploiting global output structure we do

Better Call Saul:… 7 Structured Output Models: Common Practice

For extraction and representation of features as well as for exploiting global output structure we do

■ Task Specific Programming for Data Structures ■ Model Specific Programming for Inference and Learning ■ It will be hard to generalize ■ It will be hard to Reuse and Reproduce results

Better Call Saul:… 7 Structured Output Models: Learning and Inference Paradigms

Better Call Saul:… 8 Structured Output Models: Learning and Inference Paradigms

■ Local Models: Local classifiers trained/output components predicted independently (LO)

Better Call Saul:… 8 Structured Output Models: Learning and Inference Paradigms

■ Local Models: Local classifiers trained/output components predicted independently (LO)

■ Pipelines

Better Call Saul:… 8 Structured Output Models: Learning and Inference Paradigms

■ Local Models: Local classifiers trained/output components predicted independently (LO)

■ Pipelines

■ Global Models

■ L+I: Training LO, global prediction

■ IBT: Global training and global prediction

Better Call Saul:… 8 Saul and NLP

Better Call Saul:… 9 Saul and NLP

Idea: High level abstraction for programming various configurations from local to global learning as well as building pipelines over NLP data structures.

Better Call Saul:… 9 Saul and NLP

Idea: High level abstraction for programming various configurations from local to global learning as well as building pipelines over NLP data structures. ■ Data Model ■ Graph (typed nodes, edges and properties ) ■ Sensors (black box functions that operate on graph’s base types)

Better Call Saul:… 9 Saul and NLP

Idea: High level abstraction for programming various configurations from local to global learning as well as building pipelines over NLP data structures. ■ Data Model ■ Graph (typed nodes, edges and properties ) ■ Sensors (black box functions that operate on graph’s base types) ■ Templates for learning and inference decomposition ■ Classifiers ■ Constraints

Better Call Saul:… 9 Underlying Computational Model

Better Call Saul:… 10 Underlying Computational Model

Learning:

Better Call Saul:… 10 Underlying Computational Model

Learning:

Structured output learning:

Better Call Saul:… 10 Underlying Computational Model

Learning:

Structured output learning:

Better Call Saul:… 10 Underlying Computational Model

Learning:

Structured output learning: Inference

Better Call Saul:… 10 Underlying Computational Model

Learning:

Structured output learning: Inference Decoding/ Prediction time inference

Better Call Saul:… 10 Underlying Computational Model

Learning:

Structured output learning: Inference Decoding/ Prediction time inference

Better Call Saul:… 10 Underlying Computational Model

Learning:

Structured output learning: Inference Decoding/ Prediction time inference

Weight vector

Better Call Saul:… 10 Underlying Computational Model

Learning:

Structured output learning: Inference Decoding/ Prediction time inference

Weight vector Joint feature function

Better Call Saul:… 10 Underlying Computational Model

Learning:

Structured output learning: Inference Decoding/ Prediction time inference

Weight vector Joint feature function

Better Call Saul:… 10 Underlying Computational Model : Input/Output

Joint feature function

Weight vector

Better Call Saul:… 11 Underlying Computational Model : Input/Output

Joint feature function

Weight vector

in addition to the constraints between labels!

Better Call Saul:… 11 Underlying Computational Model : Global Constraints

Constrained Conditional Models (CCM)

[Roth & Yih ‘04, 07; Chang, et.al.,’08,’12]

Better Call Saul:… 12 Underlying Computational Model : Global Constraints

Constrained Conditional Models (CCM) ■ Prediction function: assign values that maximize objective

[Roth & Yih ‘04, 07; Chang, et.al.,’08,’12]

Better Call Saul:… 12 Underlying Computational Model : Global Constraints

Constrained Conditional Models (CCM) ■ Prediction function: assign values that maximize objective h(x) = arg max g(x, y; W ) y Y (x) 2

[Roth & Yih ‘04, 07; Chang, et.al.,’08,’12]

Better Call Saul:… 12 Underlying Computational Model : Global Constraints

Constrained Conditional Models (CCM) ■ Prediction function: assign values that maximize objective h(x) = arg max g(x, y; W ) y Y (x) 2

[Roth & Yih ‘04, 07; Chang, et.al.,’08,’12]

Better Call Saul:… 12 Underlying Computational Model : Global Constraints

Constrained Conditional Models (CCM) ■ Prediction function: assign values that maximize objective h(x) = arg max g(x, y; W ) y Y (x) 2

[Roth & Yih ‘04, 07; Chang, et.al.,’08,’12]

Better Call Saul:… 12 Underlying Computational Model : Global Constraints

Constrained Conditional Models (CCM) ■ Prediction function: assign values that maximize objective h(x) = arg max g(x, y; W ) y Y (x) 2 ■ Objective is linear in features and constraints

[Roth & Yih ‘04, 07; Chang, et.al.,’08,’12]

Better Call Saul:… 12 Underlying Computational Model : Global Constraints

Constrained Conditional Models (CCM) ■ Prediction function: assign values that maximize objective h(x) = arg max g(x, y; W ) y Y (x) 2 ■ Objective is linear in features and constraints

[Roth & Yih ‘04, 07; Chang, et.al.,’08,’12]

Better Call Saul:… 12 Underlying Computational Model : Global Constraints

Constrained Conditional Models (CCM) ■ Prediction function: assign values that maximize objective h(x) = arg max g(x, y; W ) y Y (x) 2 ■ Objective is linear in features and constraints

[Roth & Yih ‘04, 07; Chang, et.al.,’08,’12]

Better Call Saul:… 12 Underlying Computational Model : Global Constraints

Constrained Conditional Models (CCM) ■ Prediction function: assign values that maximize objective h(x) = arg max g(x, y; W ) y Y (x) 2 ■ Objective is linear in features and constraints g = W, f (x, y) ⇢,c(x, y) h ih i

[Roth & Yih ‘04, 07; Chang, et.al.,’08,’12]

Better Call Saul:… 12 Underlying Computational Model : Global Constraints

Constrained Conditional Models (CCM) ■ Prediction function: assign values that maximize objective h(x) = arg max g(x, y; W ) y Y (x) 2 ■ Objective is linear in features and constraints g = W, f (x, y) ⇢,c(x, y) h ih i Compile everything in an Integer Linear Program: expressive enough to support decision making in the context of any probabilistic modeling.

[Roth & Yih ‘04, 07; Chang, et.al.,’08,’12]

Better Call Saul:… 12 Semantic Role Labeling : Data Model

!

val sentences! = node[TextAnnotation] val predicates = node[Constituent] val arguments! = node[Constituent] val pairs =! node[Relations] val pos-tag = property(arguments) val word-form! = property(arguments) val relationsToArguments!! = edge(relations, arguments) relationsToArguments.addSensor(relToArgument _) !

! Figure 2: An Example of data-model declarations for nodes, edges, properties and using sensors. The ! Washington* covers* Seattle* for*the*Associated*Press* sentences nodes are of type TextAnnotation class, which is a partNode of! Saul’s underlying NLP library; ! covers*Seattle* Seattle* * many predefined sensors can be applied on it to generate variousProperty nodes! * of type Constituent and c!over5Washington* cover5Seattle* cover5for*the*Associated*Press* ! Relations, properties of those nodes and establish edges betweenEdge them.! POS*window! * Parse*Path* Label* POS*window* Parse*Path* Label* POS*window* Parse*Path* Label* ctx/2:VBZ…! VBZ^VP^S…! A0! ctx/2:!NNP…! VBZ^VPvNP…! A1! ctx/2:VBZ…! VBZ^VPvNP…! AM/PNC* ! ! ! ! * ! ! ! ! ! * ! ! ! ! * ! ! * * * * * * * 4.1 Semantic Role Labeling A graph! in terms of typed nodes, edges and Properties SRL (Carreras! and Marquez,` 2004) is a shallow semantic analysis framework, whereby a sentence is

parse* head label* label* analysed into multiplepath! * word* propositions; each one consisting of a predicate and one or more core arguments, ! ! * ! * ! Better Call Saul:…* * 13 labeled with protosemantic! predicate* argument roles* (agents [Arg0], patient/theme [Arg1], beneficiary [Arg2], etc.), and zero or more optional arguments, labeled according to their semantic function (temporal, locative, manner, etc.). See Figure 1 forrelation an example* annotation.

chunk* parse* label* path! * length! * ! 4.1.1 Input-Output Spaces* Each sentence is a node in the data-model, comprised of constituents (derived from a tokenizer Sensor). These constituents are atomic components of x (see Figure 1) and are identified as x = x ,...,x , { 1 4} where xi is the identifier of the ith constituent in the sentence. Each constituent is described by a number of properties (word-form, pos-tag, . . . ) and the corresponding feature vector representation of these properties is denoted by constituent(xi). There are also composed components – pairs of constituents; their descriptive vectors are referred to as pair(xi,xj). The feature vector of a composed component such as a pair, pair(x1,x2) is usually described by the local features of x1, x2 and the relational features between them, such as the order of their position, etc. The main labels set for the SRL model is l = l ,l ,l which indicate whether a { isP red isArg argT ype} constituent is a predicate, whether it is an argument and the argument role respectively. lisArg and largT ype are linked labels, meaning that they are defined with respect to another constituent (the predicate). Depending on the type of correlations we aim to capture, we can introduce new linked labels in the model. These labels are not necessarily the target of the predictions but they help to capture the dependencies among labels. For example, to capture the long distance dependencies between two different arguments of same predicate we can introduce a linked label linking the label of two pairs and impose consistency constraints between this new linked label and the label of each pair. Figure 2 shows some declarations of the data-model’s graph representing input and output components of the learning models. The graph can be queried using our invented graph traversal language and the queries are directly used as structured machine learning features later.

4.1.2 Classifiers and Constraints As mentioned in Section 3, the structure of a learning model is specified by templates which are defined as classifiers (feature templates) and global constraints (constraint templates) in Saul. SRL has four main feature templates: 1) Predicate template: connects an input constituent to a single label lisP red. The input features of this template are generated based on the properties of the constituents constituent. The candidate generator of this template is a filter that takes all constituents whose pos-tag is VP; 2) Argument template: connects a pair of constituents to a linked label lisArg. The candidate generator of this template is a set of rules that are suggested by Xue and Palmer (2004); 3) ArgumentType template: connects a pair of constituents to the linked label largT ype. Same Xue-Palmer heuristics are used; 4) ArgumentTypeCorrelations template: connects two pairs of pairs of constituent (i.e. relations between relations) to their join linked label. The candidates are the pairs of Xue-Palmer candidates. val sentences = node[TextAnnotation] val sentences = node[TextAnnotation] val sentences =valnodepredicates[TextAnnotation]val predicates = node =[Constituent]node[Constituent] val predicatesval = nodearguments[Constituent]val arguments = node =[Constituent]node[Constituent] val arguments =valnodepairs[Constituent]val = pairsnode =[Relations]node[Relations] val pairs = nodeval[Relations]pos-tagval pos-tag = property = property(arguments)(arguments) val word-form = property(arguments) val pos-tag = propertyval word-form(arguments)val relationsToArguments = property(arguments) = edge(relations, arguments) val word-form =valpropertyrelationsToArgumentsrelationsToArguments.addSensor(relToArgument(arguments) = edge(relations, arguments) _) val relationsToArgumentsrelationsToArguments.addSensor(relToArgument = edge(relations, arguments) _) relationsToArguments.addSensor(relToArgument _) val sentences = node[TextAnnotation] Figure 2: An Example of data-model declarations for nodes, edges, properties and using sensors. The val predicates = node[Constituent] sentences nodes are of type TextAnnotation class, which is a part of Saul’s underlying NLP library; val arguments = node[Constituent] Figure 2: An Example of data-model declarations for nodes, edges, properties and using sensors. The Figure 2: An Example of data-modelmany predefineddeclarations sensors can for be nodes, applied edges, on it to properties generate various and using nodes sensors. of type Constituent The and val pairs = node[Relations] sentences nodes are of type TextAnnotation class, which is a part of Saul’s underlying NLP library; val pos-tag = property(arguments)sentences nodes are of typeRelationsTextAnnotation, properties ofclass, those whichnodes and is establish a part of edgesSaul between’s underlying them. NLP library; val word-form = property(arguments) many predefined sensors can be applied on it to generate various nodes of type Constituent and val relationsToArgumentsmany = edge predefined(relations, sensors arguments) can be applied on it to generate various nodes of type Constituent and relationsToArguments.addSensor(relToArgumentRelations _) , properties of those nodes and establish edges between them. Relations, properties of those4.1 nodes Semantic and Role establish Labeling edges between them. SRL (Carreras and Marquez,` 2004) is a shallow semantic analysis framework, whereby a sentence is Figure 2: An Example of data-model declarations for nodes, edges, properties and using sensors. The analysed into multiple propositions; each one consisting of a predicate and one or more core arguments, sentences nodes are of type TextAnnotation class, which is a part of Saul’s underlying NLP library; 4.1 Semanticlabeled Role with protosemantic Labeling roles (agents [Arg0], patient/theme [Arg1], beneficiary [Arg2], etc.), and zero many predefined sensors can4.1 be applied Semantic on it to Role generate Labeling various nodes of type Constituent and SRL (Carrerasor more and optional Marquez,` arguments, 2004) labeled is a according shallow to semantic their semantic analysis function framework, (temporal, locative, whereby manner, a sentence is Relations, properties of thoseSRL nodes (Carreras and establish and edges Marquez,` between them. 2004) is a shallow semantic analysis framework, whereby a sentence is analysed intoetc.). multiple See Figure propositions; 1 for an example each annotation. one consisting of a predicate and one or more core arguments, analysed into multiple propositions; each one consisting of a predicate and one or more core arguments, labeled with4.1.1 protosemantic Input-Output roles Spaces (agents [Arg0], patient/theme [Arg1], beneficiary [Arg2], etc.), and zero 4.1 Semantic Role Labelinglabeled with protosemantic roles (agents [Arg0], patient/theme [Arg1], beneficiary [Arg2], etc.), and zero Each sentence is a node in the data-model, comprised of constituents (derived from a tokenizer Sensor). SRL (Carreras and Marquez,` 2004) is a shallow semanticor more analysisACL optional 2016 Submission framework, ***. arguments, Confidential wherebyreview copy. DO NOT labeled a DISTRIBUTE. sentence according is to their semantic function (temporal, locative, manner, or more optional arguments,These labeled constituents according are atomic to their components semantic of functionx (see Figure (temporal, 1) and are locative, identified as manner,x = x ,...,x , analysed into multiple propositions; each one consistingetc.). of Seea predicate Figure and 1 one for or an more example core arguments, annotation. { 1 4} labeled with protosemantic rolesetc.). (agents See [Arg0], Figure400 patient/theme 1those for (?). The an constraints example [Arg1],where can hold beneficiary betweenx annotation.i is the thein- identifierSensor [Arg2], to refer toetc.), offunctions the and thatith act zero constituentas black boxes in450 the sentence. Each constituent is described by a number 401 stantiations of one template which implies the re- and can generate nodes, edges and properties in 451 or more optional arguments, labeled according402 to theirlations4.1.1 between semantic the Input-Output nodesof of function one properties type also referred (temporal, (word-form, Spacesthe graph. locative, An example pos-tag, of manner, a sensor for . . generat- . ) and the452 corresponding feature vector representation of these 4.1.1 Input-Output403 to as autocorrelations Spacesasproperties defined in a relational is denoteding nodesSemantic and by edges is the a sentence Role(x tokenizer). There Labeling453 are also composed : Classifiers components – pairs of constituents; etc.). See Figure 1 for an example annotation.404 dependency networks (?). The constraints are ex- which receives a sentenceconstituent and generates tokens.i 454 405 ploitedEach during sentence training intheir the loss-augmented descriptiveis a node in- Each vectors in token the will be aredata-model a node referred in the graph. to A, lem- as comprisedpair455(xi,xj) of. Theconstituents feature vector(derived of a composed from component a tokenizer Sensor). Each sentence406 is aference node and are in imposed the on thedata-model output structure matizer, comprised can be another sensor, of in thisconstituents case a prop- 456 (derived from a tokenizer Sensor). 4.1.1 Input-Output Spaces 407 duringThese prediction. constituents For graphicalsuch representation as a pair, are of ertypair atomic sensor,(x which1,x generates2 components) is usually the lemma of described a token of x457 by(see the Figure local features 1) and of x are1, x2 identifiedand the relational as x features= x1,...,x4 , 408 our models in terms of templates, accordingly we which is a propertyx of the token nodes. Depend- 458 x = x ,...,x Each sentence is a node in theThesedata-model constituents, comprised are of constituents atomicbetween components(derived them, from such a of tokenizer as the(see orderSensor Figure of their). 1) position, and are etc. identified as 1 4 , { } 409 use constraint templates that are beyond feature ing on the granularity of the task in hand different 459 { } templateswhere and expressxi is the relationships the identifier between sensors of can the be used.ith For SRL constituent each sentence is a in the sentence. Each constituent is described by a number These constituents are atomicwhere componentsxi is of the410 x identifier(see Figure of 1) and the areith identified constituent as x in= thex1,...,x sentence.4 , Each460 constituent is described by a number labels cross templates. That beingThe said, main the com- labelsnode in the set graph for and using the Sensors SRL each model sentence is l = lisP red,lisArg,largT ype which indicate whether a 411 { } 461 putationalof properties model in Saul is presented (word-form, by a rela- will be connected pos-tag, to it components . . which . ) are andCon- the corresponding{ feature} vector representation of these where xi is the identifier of the ith constituent412 in the sentence. Eachconstituent constituent is is a described predicate, by whether a number it is an argument462 and the argument role respectively.link-labell and of properties (word-form,tional constrained factor pos-tag, graph. We exemplify . . . )stituents andderived the from corresponding a constituent parser. These feature vector representation of these isArg 413 463 of properties (word-form, pos-tag, . . . ) and the correspondingthis representation when feature discussing the vector real prob- representationconstituents are atomic/single of these components of x 414 propertiesl isargT denoted ype are linked by labelsconstituent, meaning(x thati). they There are464 defined are also with composed respect to another components constituent – (the pairs predicate). of constituents; properties is denotedlems in the by later sections.constituent(xi).(see There Fig ??) and are are identified also as x = composedx1,...,x4 , components – pairs of constituents; properties is denoted by constituent(xi). There415 are also composed components – pairs of constituents;{ } 465 Depending on thewhere typexi is the of identifier correlations of the ith constituent we aim in to capture, we can introduce new linked labels in the model. 416 their descriptive vectors are referred(x ,x ) to as pair466(xi,xj). The feature vectorsingle-label of a composed component their descriptive vectors are referredtheir descriptive to as pair(xi vectors,x4 Callj). Saul: The Case are feature Studies referred vector ofto a asthe composed sentence.pair Each component constituenti j in. the The sentencefeature is vector of a composed component 417 These labels aredescribed not necessarily by a number of properties the target (lex-form, of the467 predictions but they help to capture the dependencies Focusing on aforementioned NLP tasks of part of such as a pair, (x ,x ) is usually described418 by thesuch( localx ,x as features a) pair, ofxpair, x (andx1pos-tag,,x the2 relational ...)) is describing usually the features word form, described the part 468 by thex localx features of x1, x2 and the relational features pair 1 2 such as a pair, pairspeech tagging,1 semantic2 is role usually labeling1 and entity2 described by the local features of 1, 2 and the relational features 419 among labels. Forof speech, example, etc and the corresponding to capture feature the vec- long469 distance dependencies between two different arguments mention relation extraction, in the rest of the paper object ArgTypeLearner extends Learnable(pairs){ between them, such as the order of their position,420 etc.between them, such astor the representationC order of these of properties their is denoted position, by 470 etc. between them, suchwe walk as through the these order problemsof same ofand show their predicate how position, wel.. can introduce etc. a linkeddef labellabellinking = argumentLabelGold the label of two pairs and impose consistency 421 (x ). There are also components com- 471 The main labels set for the SRL model is l = welisP can design red,l aisArg variety of,l localargT to global ype mod-whichconstituent indicatei whether a 422 The mainconstraints labels between setposed for of pairs this the of constituents new SRLlinked and model their descriptive label isanddefl the472=feature labell of = eachusing,l pair.(containsMOD, Figure,l 2 shows containsNEG,which some declarations indicate whether a The main labelsels{ and set show forthe related the code. SRL The Saul code} model is is l = lisP red,lisArg,largTisP ype redwhichisArg indicateargT ype whether a 423 vectorspairCandidate are referred(xi) to as pair(xi,xj). We define clauseFeatures,473 chunkPathPattern, chunkEmbedding, constituent is a predicate, whether it is an argumentdeclarative and in nature the and argument contains a set role of spec- respectively. lisArgWargTand ype { } 424 of the data-modela number’s graph of relational representing properties{ describing the input re- and474 output components} of the learning models. The graph ificationsconstituent including a) The input is nodes a predicate, and out- whether it is anchunkLength, argument and constituentLength, the argument role argPOSWindow, respectively. lisArg and largT ype are linked labels, meaningconstituent that they are425 is defined a predicate, with respect whether to another it constituent islationships an argument between (the constituents predicate). and (e.g. path). the The argument475 role respectively. lisArg and put labels which are related to declaring the data- l argWordWindow, headwordRelation, syntacticFrameRelation 426 can be queried usingfeature vector our of a invented composed componentargT graph ype such traversal as a 476 language and the queries are directly used as structured modell graph; b) Constraintsare overlinked the output struc- labels, meaning that they are defined with respect to another constituent (the predicate). Depending on the type of correlationslargT ype weare aim427linked to capture,argT labels we ype can, meaning introduce that new linked theypair, labelsare(x ,x defined)inis usually the model. described with by the respect lo- , pathRelation,477 to another constituent phraseTypeRelation, (the predicate). predPosTag, ture; c) The joint feature templates,machine candidate learning gen- featurespair 1 2 later. 428 cal features of x , x and the relational features predLemmaR,478 linearPosition) These labels are not necessarily the target of the predictionseration for the templates. but they When this help components to capture the dependencies1 2 Depending on429 theDepending type of correlations on the type webetween aimof correlations them, to such capture, as the order of we their position aimcan , introduce to479 capture, new welinked can introduce labels in new the model.linked labels in the model. have been specified then the train and test can be } etc. among labels. For example, to capture the long430 distanceperformed dependencies only by indicating4.1.2 which between templates Classifiers we two different and Constraints arguments 480 These labels431 are notThese necessarily labels are the not target necessarily of the predictions the target but of the they481 predictions help to capture but they the help dependencies to capture the dependencies of same predicate we can introduce a linked label linkingtend to use. the The way label we gather of the two specified pairs parts and impose consistency 432 As mentioned in Section 3, the structure of a learning482 model is specified by templates which are defined easily forms different training and test paradigms On the other hand the main label set fora Learning Template for argument typesl among labels.433 Forinamong the example, spectrum of labels. local to to global capture models. For example,Figure the long 3: Left: to distance capture shows the dependencies the components long483 distance of the betweenArgumentType dependencies two differentfeature between template. argumentsargT two ype differentis one linked arguments label constraints between this new linked label and the label of each pair. Figure 2 showsSRL is l some= lispredicate declarations,lisargument,largmenttype . 434 as classifiers (feature{ templates) and global} constraints484 (constraint templates) in Saul. SRL has four as aThe partlispredicate ofis the a single objective label which inindicates Equation 1, along with the corresponding block of weights and the pair of the data-model’s graph representing input435 and output4.1of Semantic same components Role predicate Labeling of the we learning can models. introduce The graph a linked label485 linking the label of two pairs and impose consistency of same predicate we can introducemain feature a linked templates:whether label a single constituent 1)linkingPredicate is a predicate thetemplate: or not label of connects two pairs an input and constituent impose consistency to a single label lisP red. 436 486 label feature can be queried using our invented graph traversal languageThe task is to and annotate the the queries sentences with are a num- directlycandidatesand the otherused two(diamonds as labels structured in fact are show linked-labels dot products). Right: shows the code for the template. and 437 The input features of this template are generated487 based on the properties of the constituents . constraints betweenberconstraints of linguistically this new motivated betweenlinked semantics and label re-are thisBetterthat respectivelyand indicate new the whetherCalllinked label one a Saul:… constituent property label of is each an argu- and pair. a thelist of label Figure properties of 2 each ofshows 14pair pair. nodes some Figure declared declarations 2 in shows the data-modelconstituent some, these declarations serve machine learning features later. 438 lationships (?). Figure ??Theshows how candidate the sentence generatorment of a predicate of and this what is templatethe role of a con- is a filter488 that takes all constituents whose pos-tag is VP; 2) of the data-model439 ”Washington’sof graph the coversdata-model Seattle representing for Associated Press.”,’s graphas input thestituent output (which representing and is andan output argument input candidate) parts components input with of re- this and template.489 output of the This components learning template can models. be of used the The as learning a local graph classifier models. or as a The part ofgraph 440 will be annotated with semanticArgument role labels. template:spect to another connects constituent a (which pair is a of predicate constituents490 to a linked label l . The candidate generator of 4.1.2 Classifiers and Constraints 441 491 isArg can be queried usingcan our be invented queried using graphthe our traversalcandidate). objective invented Depending of language a on globalgraph the type of correlationsmodel, andtraversal the depending queries language on are the indicated and directly the learning queries used as paradigm are structured directly by the programmer. used as structured 442 4.2 Input-Output Spacesthis template isthat a setwe aim of to considerrules we that might are introduce suggested new 492 by Xue and Palmer (2004); 3) ArgumentType template: As mentioned in Section 3, the structure of a443 learning model is specified by templates which are defined 493 machine learningThe features first step of designing later. the structured output linked-labels in the model. These linked labels are 444 machine learningconnects features a pair of constituents later. to the linked label494 l . Same Xue-Palmer heuristics are used; 4) as classifiers (feature templates) and global constraintsmodels is (constraintto make the input graph templates) from the raw val innotSaul directlylegalArgumentsConstraint. the SRL target of has the predictions four but help = constraint(sentences)argT ype { x => 445 data. Saul has enrichedArgumentTypeCorrelations with a number of li- for capturing the correlationstemplate: among labels. connects For 495 two pairs of pairs of constituent (i.e. relations between main feature templates: 1) Predicate template:446 connectsbraries including an format input readers constituent and commonly- toexample, a singleval to captureconstraints label the correlationslisP red between. = twofor { 496 4.1.2 Classifiers447 used4.1.2 and NLP data-structures. Constraints Classifiersrelations) The programmer to and can theirdifferent Constraints join argumentspredicate linked of one label. predicate<- The wesentences(x) can candidates in- 497 are> the sentenceToPredicates pairs of Xue-Palmer candidates. The input features of this template are generated448 basedexploiton these the libraries properties to easily generate of thethe nodes constituentstroduce a linked-labelcandidateRelationsconstituent linking the label of. the two = (predicates(y)498 ⇠ > -relationsToPredicates) The candidate generator of thisAs template mentioned is449 a filter in SectionandAs that edges mentioned takes in a datamodel 3, all the graph. constituents structure We in use Section the term whose ofpairs. a 3, pos-taglearningargLegalList the structure is VP model; 2) = of legalArguments(y) is a specified learning499 by model templates is⇠ specified which by are templates defined which are defined relation <- candidateRelations Argument template: connectsas a pair classifiers of constituents (feature to a linked templates) label lisArg and. The global candidate constraints generator of (constraint templates) in Saul. SRL has fourSaul as classifiers (feature5 templates)} yield classiferLabelIsLegal(argumentTypeLearner, and global constraints (constraint templates) relation, in argLegalList). SRL has four this template is a set of rules that are suggested by Xue and Palmer (2004); 3) ArgumentTypeor (argumentTypeLearnertemplate: on relation is "none") main feature templates:main feature 1) Predicate templates:template: 1) Predicate connectstemplate: an input constituentconnects an to input a single constituent label lisP to red a. single label lisP red. connects a pair of constituents to the linked label largT ype. Same Xue-Palmer} heuristics are used; 4) ArgumentTypeCorrelations template:The input connects features twoThe pairs of this input of pairs template features of constituent are of generated this (i.e. template relations based between are on generated the properties based of on the the constituents properties ofconstituent the constituents. constituent. def classiferLabelIsLegal(classifier, relation, legalLabels) = { relations) to their join linked label.The candidateThe candidates generator areThe the candidate pairs of of this Xue-Palmer template generator candidates. is oflegalLabels. a filterthis template that_exists takes is all{l a filter constituents=> (classifier that takes whoseon allrelation constituents pos-tagis isl)VP whose} ; 2) pos-tag is VP; 2) } Argument template:Argument connectstemplate: a pair of connects constituents a pair to a oflinked constituents label lisArg to a.linked The candidate label lisArg generator. The candidate of generator of this template is a setthis of template rules that is are a set suggested of rules that by Xue are suggestedand Palmer by (2004); Xue and 3) ArgumentType Palmer (2004);template: 3) ArgumentType template: Figure 4: Given a predicate, some argument types are illegal according to PropBank Frames (e.g. the connects a pair ofconnects constituents a pair to of the constituents linked label tolargT the ype linked. Same label Xue-PalmerlargT ype. Same heuristics Xue-Palmer are used; heuristics 4) are used; 4) ArgumentTypeCorrelations template: connectsverb ‘cover‘ two with sensepairs 03 of can pairs take onlyof constituent Arg0 or Arg1), (i.e. which relations means that between they should be excluded from ArgumentTypeCorrelationsthe inference.template: The legalArguments(y) connects tworeturns pairs ofthe pairs predefined of constituent list of legal arguments (i.e. relations for a predicate between relations) to their joinrelations) linked to label. their The joiny. candidates linked In line 3, label. graph are traversal The the pairs candidates queries of Xue-Palmer (using are the the> pairsoperator) candidates. of Xue-Palmer are applied to use candidates. an edge and go from a ⇠ sentence node to all contained predicate nodes in the sentence and then apply the constraint to all of those predicates. Each constraint imposes the argumentTypeLearner to assign a legal argument type to each candidate argument or does not count it as an argument at all, i.e., to assign none value to the argument type.

The feature templates are instances of Learnable in Saul and in fact they are treated as local classifiers. The script of Figure 3 shows the ArgumentType template. The Constraints are specified by means of first-order logical expressions. We use the constraints specified in Punyakanok et al. (2008) in our models. The script in Figure 4, shows an example expressing the legal argument constraints for a sentence.

4.1.3 Model Configurations Programming for learning and inference configurations in Saul is simply composing the basic building blocks of the language, that is, feature and constraint templates in different ways. Local models. Training local models is as easy as calling the train function over each specified feature template separately (e.g. ArgTypeLearner.train()). The test on these models also is simply done by calling test for each template (e.g. ArgTypeLearner.test()). In addition, TestClassifeirs(/*a list of classifiers*/) and TrainClassifiers(/*a list of classifiers*/) can be used to train/test a number of classifiers independently by passing a list of classifier’s names to these functions. The training algorithm can be specified when declaring the Learnable; here we have used averaged perceptrons in the experiments which is the default model. Semantic Role Labeling : Combined Feature Functions

■ Single or composed components of the input are represented with typed nodes in the graph ■ All features are defined as the properties of the nodes ■ Labels also applied to single components or composed components of the input (called link labels in the latter case) ■ Edges are established between nodes ■ The edges and properties are defined and computed based on a set of given NLP sensors

Better Call Saul:… 15 ACL 2016 Submission ***. Confidential review copy. DO NOT DISTRIBUTE.

400 those (?). The constraints can hold between the in- Sensor to refer to functions that act as black boxes 450 401 stantiations of one template which implies the re- and can generate nodes, edges and properties in 451 402 lations between the nodes of one type also referred the graph. An example of a sensor for generat- 452 403 to as autocorrelations as defined in a relational ing nodes and edges is the a sentence tokenizer 453 404 dependency networks (?). The constraints are ex- which receives a sentence and generates tokens. 454 405 ploited during training in the loss-augmented in- Each token will be a node in the graph. A lem- 455 406 ference and are imposed on the output structure matizer can be another sensor, in this case a prop- 456 407 during prediction. For graphical representation of erty sensor, which generates the lemma of a token 457 408 our models in terms of templates, accordingly we which is a property of the token nodes. Depend- 458 409 use constraint templates that are beyond feature ing on the granularity of the task in hand different 459 410 templates and express the relationships between sensors can be used. For SRL each sentence is a 460 411 labels cross templates. That being said, the com- node in the graph and using Sensors each sentence 461 putational model in Saul is presented by a rela- will be connected to it components which are Con- 412 462 tional constrained factor graph. We exemplify stituents derived from a constituent parser. These 413 463 this representation when discussing the real prob- constituents are atomic/single components of x 414 464 lems in the later sections. (see Fig ??) and are identified as x = x1,...,x4 , 415 { } 465 where xi is the identifier of the ith constituent in 416 466 4 Call Saul: Case Studies the sentence. Each constituent in the sentence is 417 described by a number of properties (lex-form, 467 Focusing on aforementioned NLP tasks of part of 418 pos-tag, ...) describing the word form, the part 468 speech tagging, semantic role labeling and entity 419 of speech, etc and the corresponding feature vec- 469 mention relation extraction, in the rest of the paper object ArgTypeLearner extends Learnable(pairs){ 420 tor representationC of these properties is denoted by 470 we walk through these problems and show how l.. def label = argumentLabelGold 421 (x ). There are also components com- 471 we can design a variety of local to global mod- constituent i 422 posed of pairs of constituents and their descriptive def472feature = using(containsMOD, containsNEG, els and show the related code. The Saul code is 423 vectorspairCandidate are referred(xi) to as pair(xi,xj). We define clauseFeatures,473 chunkPathPattern, chunkEmbedding, declarative in nature and contains a set of spec- WargT ype 424 a number of relational properties describing the re- 474 ifications including a) The input nodes and out- chunkLength, constituentLength, argPOSWindow, 425 lationships between constituents (e.g. path). The 475 put labels which are related to declaring the data- l argWordWindow, headwordRelation, syntacticFrameRelation 426 feature vector of a composed componentargT ype such as a 476 model graph; b) Constraints over the output struc- 427 pair, (x ,x ) is usually described by the lo- , pathRelation,477 phraseTypeRelation, predPosTag, ture; c) The joint feature templates, candidate gen- pair 1 2 428 cal features of x , x and the relational features predLemmaR,478 linearPosition) eration for the templates. When this components 1 2 429 Semanticbetween them, such as the order Role of their position Labeling , 479 : Constraints have been specified then the train and test can be } etc. 430 performed only by indicating which templates we 480 431 tend to use. The way we gather the specified parts 481 432 482 easily forms different training and test paradigms On the other hand the main label set for l 433 in the spectrum of local to global models. Figure 3: Left: shows the components483 of the ArgumentType feature template. argT ype is one linked label SRL is l = lispredicate,lisargument,largmenttype . 434 { } 484 as aThe partlispredicate ofis the a single objective label which inindicates Equation 1, along with the corresponding block of weights and the pair 435 4.1 Semantic Role Labeling 485 whether a single constituent is a predicate or not 436 486 The task is to annotate the sentences with a num-candidatesand the other two(diamonds labels in fact are show linked-labels dot products). Right: shows the code for the template. label and feature 437 487 ber of linguistically motivated semantics and re-arethat respectively indicate whether one a constituent property is an argu- and a list of properties of pair nodes declared in the data-model, these serve 438 lationships (?). Figure ?? shows how the sentenceOnlyment oflegal a predicate arguments and what is the role of of a con- a predicate488 could be assigned as a type to the candidate 439 ”Washington covers Seattle for Associated Press.”,as thestituent output (which is andan argument input candidate) parts with of re- this template.489 This template can be used as a local classifier or as a part of 440 will be annotated with semantic role labels. spect toarguments. another constituent (which The is a predicate legality490 is checked according to the Propbank frames. 441 thecandidate). objective Depending of a on global the type of correlationsmodel, depending491 on the indicated learning paradigm by the programmer. 442 4.2 Input-Output Spaces that we aim to consider we might introduce new 492 443 The first step of designing the structured output linked-labels in the model. These linked labels are 493 444 models is to make the input graph from the raw valnot directlylegalArgumentsConstraint the target of the predictions but help 494 = constraint(sentences) { x => 445 data. Saul has enriched with a number of li- for capturing the correlations among labels. For 495 446 braries including format readers and commonly- example,val to captureconstraints the correlations between = twofor { 496 447 used NLP data-structures. The programmer can different argumentspredicate of one predicate<- wesentences(x) can in- 497 > sentenceToPredicates 448 exploit these libraries to easily generate the nodes troduce a linked-labelcandidateRelations linking the label of the two = (predicates(y)498 ⇠ > -relationsToPredicates) 449 and edges in a datamodel graph. We use the term pairs. argLegalList = legalArguments(y)499 ⇠ relation <- candidateRelations 5 } yield classiferLabelIsLegal(argumentTypeLearner, relation, argLegalList) or (argumentTypeLearner on relation is "none") }

def classiferLabelIsLegal(classifier, relation, legalLabels) = { legalLabels._exists {l=> (classifier on relation is l) } }

Figure 4: Given a predicate, some argument types are illegal according to PropBank Frames (e.g. the Betterverb ‘cover‘ Call withSaul:… sense 03 can take only Arg0 or Arg1), 16 which means that they should be excluded from the inference. The legalArguments(y) returns the predefined list of legal arguments for a predicate y. In line 3, graph traversal queries (using the > operator) are applied to use an edge and go from a ⇠ sentence node to all contained predicate nodes in the sentence and then apply the constraint to all of those predicates. Each constraint imposes the argumentTypeLearner to assign a legal argument type to each candidate argument or does not count it as an argument at all, i.e., to assign none value to the argument type.

The feature templates are instances of Learnable in Saul and in fact they are treated as local classifiers. The script of Figure 3 shows the ArgumentType template. The Constraints are specified by means of first-order logical expressions. We use the constraints specified in Punyakanok et al. (2008) in our models. The script in Figure 4, shows an example expressing the legal argument constraints for a sentence.

4.1.3 Model Configurations Programming for learning and inference configurations in Saul is simply composing the basic building blocks of the language, that is, feature and constraint templates in different ways. Local models. Training local models is as easy as calling the train function over each specified feature template separately (e.g. ArgTypeLearner.train()). The test on these models also is simply done by calling test for each template (e.g. ArgTypeLearner.test()). In addition, TestClassifeirs(/*a list of classifiers*/) and TrainClassifiers(/*a list of classifiers*/) can be used to train/test a number of classifiers independently by passing a list of classifier’s names to these functions. The training algorithm can be specified when declaring the Learnable; here we have used averaged perceptrons in the experiments which is the default model. Model Precision Recall F1 ArgTypeLearnerG(GOLDPREDS) 85.35 85.35 85.35 ArgTypeLearnerG(GOLDPREDS)+C 85.35 85.36 85.35 ArgTypeLearnerXue(GOLDPREDS) 82.32 80.97 81.64 ArgTypeLearnerXue(GOLDPREDS)+C 82.90 80.70 81.79 ArgTypeLearnerXue(PREDPREDS) 82.47 80.79 81.62 ArgTypeLearnerXue(PREDPREDS)+C 83.62 80.54 82.05 ArgIdentifierXue ArgTypeLearnerXue(PREDPREDS) 82.55 81.59 82.07 | ArgIdentifierG(PREDPREDS) 95.51 94.19 94.85

Table 1: Evaluation of SRL various labels and configurations. The superscripts over the different Learners refer to the whether gold argument boundaries (G) or the Xue-Palmer heuristics (Xue) were used to generate argument candidates as input. GOLD/PREDPREDS refers to whether the Learner used gold or predicted predicates. ‘C’ refers to the use of constraints during prediction and denotes the pipeline | architecture.

Pipeline. Previous research on SRL (Punyakanok et al., 2008) shows that a good working model is the one that first decides on argument identification and then takes those arguments and decides about their roles. This configuration is made with a very minor change in the templates of the local models. Instead of usingSemantic Xue-Palmer candidates,Role Labeling we can use the : identified Constrained arguments by aClassifiersisArgument classifier as input candidates for the ArgTypeLearner model. The rest of the model is the same. L+I model. This is simply a locally trained classifier that uses a number of constraints on prediction time. We define a constrained argument predictor based on a previously trained local Learnable as follows:

object ArgTypeConstraintClassifier extends ConstrainedClassifier(ArgTypeLearner) { def subjectTo = srlConstraints }

where the srlConstraints is a constraint template. Having this definition we only need to callThis theConstrainedArgTypeConstraintClassifier Classifier now appliesconstraint on a given predictor pair candidate during, the test time as ArgTypeConstraintClassifier(x)BUT it uses the globalwhich constraints decides for at the the label sentence of x in a level global. context. IBT model.TheThe sentence linguistic is background accessed knowledge via the edges about defined SRL that in is the described data model in Section 4.1.2 provides the possibility of designingthat connect a variety the of global relation models. to its The original constraints sentence. that limit the argument arrangement around a specific predicate help to make sentence level decisions for each predicate during training phase and/or prediction phase. To train the global models we simply call the joint train function and provide the list of all declared constraint classifiers as parameters. The results of some versions of these models are shown in Table 1. The experimental settings, the data and the train/test splits are according to (Punyakanok et al., 2008) and the results are comparable. As the results show the models that use constraints are the best performing ones. For SRL the global background knowledgeBetter Call on the Saul:… arguments in IBT setting did not improve17 the results.

4.2 Part-Of-Speech Tagging This is perhaps the most often used application in ML for NLP. We use the setting proposed by Roth and Zelenko (1998) as the basis for our experiments. The graph of an example sentence is shown in Figure 1. We model the problem as a single-node graph representing constituents in sentences. We make use of context window features and hence our graph has edges between each token and its context window. This enables us to define contextual features by traversing the relevant edges to access tokens in the context. The following code uses the gold POS-tag label (POSLabel) of the two tokens before the current token during training and POS-tag classifier’s prediction (POSTaggerKnown) of the two tokens before the current token during the test. Program Structure

val srlDataModelObject = PopulateSRLDataModel(…)

val AllMyConstrainedClassifiers= List(argTypeConstraintClassifier,…)

JointTrain(sentences, AllMyConstrainedClassifiers)

ClassifierUtils.TestClassifiers(AllMyConstrainedClassifiers)

Better Call Saul:… 18 Program Structure

val srlDataModelObject = PopulateSRLDataModel(…)

val AllMyConstrainedClassifiers= List(argTypeConstraintClassifier,…)

JointTrain(sentences, AllMyConstrainedClassifiers)

ClassifierUtils.TestClassifiers(AllMyConstrainedClassifiers)

Same amount of code for other paradigms!

Better Call Saul:… 18 Other tasks

Better Call Saul:… 19 !

!

!

!

!

!!

!

!

! Washington* covers* Seattle* for*the*Associated*Press* Node! ! covers*Seattle* Seattle* * Property! * c!over5Washington* cover5Seattle* cover5for*the*Associated*Press* ! Edge! POS*window! * Parse*Path* Label* POS*window* Parse*Path* Label* POS*window* Parse*Path* Label* ctx/2:VBZ…Other! VBZ^VP^S…! tasksA0! ctx /2:!NNP…! VBZ^VPvNP…! A1! ctx/2:VBZ…! VBZ^VPvNP…! AM/PNC* ! ! ! ! * ! ! ! ! ! * ! ! ! ! * ! ! * * * * * * *

! val labelTwoBefore = property(tokens) { x: Constituent => ! // Use edges to jump to the previous constituent val cons = (tokens(x) > constituentBefore > constituentBefore).head parse* headif (POSTaggerKnown.isTraining)⇠ ⇠ label* label* path! * word*POSLabel(cons) ! ! * ! * ! * * else POSTaggerKnown(cons) ! predicate} * argument* token*

word* label* label* before* form* ! 4.2.1 Model configurations ! ! * relation* Here, we point to a few interesting settings for this problem and report the results we obtained by Saul in

Table 2.parse* chunk* label* path* length! * ! Count-based! baseline.* The simplest scenario is to create a simple count-based baseline: for each constituent choose the most popular label. This is trivial to program in Saul.

Independent classifiers. We train independent classifiers for known and unknown words. Though both classifiers use similar sets of features, the unknown classifier is trained only on tokens that were seen fewer than 5 times in the training data. Here the ‘Learnable‘ is defined as exampled in Section 4.1.2.

Classifier combination. Given the known and unknown classifiers, one easy extension is to combine Betterthem, depending Call Saul:… whether the input instance is seen during 19 the training phase or not. To code this, the defined Learnables for the two classifiers are simply reused in an ‘if‘ construct.

Sequence tagging. One can extend the previous configurations by training higher-order classifiers, i.e. classifiers trained on pair/tuple of neighboring constituents (similar to HMM or chain-CRF). At the prediction time one needs to choose the best structure by doing constrained inference on the predictions of the local classifiers. The following snippet shows how one can simply write a consistency constraint, given a pairwise classifier POSTaggerPairwise which scores two consecutive constituents.

def sentenceLabelsMatch = constraint(sentences) { t: TextAnnotation => val constituents = t.getView(ViewNames.TOKENS).getConstituents // Go through a sliding window of tokens constituents.sliding(3)._forall { cons: List[Constituent] => POSTaggerPairwise on (cons(0), cons(1)).second === POSTaggerPairwise on ( cons(1), cons(2)).first } }

4.3 Entity-Relation extraction This task is for labeling entities and recognizing semantic relations among them. It requires making several local decisions (identifying named entities in the sentence) to support the relation identification. The models we represent here are inspired some well-known previous work (Zhou et al., 2005; Chan and Roth, 2010). The nodes in our models consists of Sentences, Mentions and Relations.

4.3.1 Features and Constraints For the entity extraction classifier, we define various lexical features for each mention – head word, POStags, words and POStags in a context window. Also, we incorporate some features based on gazetteers for organization, vehicle, weapons, geographic locations, proper names and collective nouns. The relation extraction classifier uses lexical, collocation and dependency-based features from the baseline implementation in Chan and Roth (2010). We also use features from the brown word clusters (Brown et al., 1992). The features for each word are based on a path from the root in its Brown clustering representation. These features are easily available in our NLP data-model. We also use a decayed down-sampling of negative examples between training iterations. !

!

!

!

!

!!

!

!

! Washington* covers* Seattle* for*the*Associated*Press* Node! ! covers*Seattle* Seattle* * Property! * c!over5Washington* cover5Seattle* cover5for*the*Associated*Press* ! Edge! POS*window! * Parse*Path* Label* POS*window* Parse*Path* Label* POS*window* Parse*Path* Label* ctx/2:VBZ…Other! VBZ^VP^S…! tasksA0! ctx /2:!NNP…! VBZ^VPvNP…! A1! ctx/2:VBZ…! VBZ^VPvNP…! AM/PNC* ! ! ! ! * ! ! ! ! ! * ! ! ! ! * ! ! * * * * * * *

! val labelTwoBefore = property(tokens) { x: Constituent => ! // Use edges to jump to the previous constituent val cons = (tokens(x) > constituentBefore > constituentBefore).head parse* headif (POSTaggerKnown.isTraining)⇠ ⇠ label* label* path! * word*POSLabel(cons) ! ! * ! * ! * * else POSTaggerKnown(cons) ! predicate} * argument* token*

word* label* label* before* form* ! 4.2.1 Model configurations ! ! * relation* Here, wePos-Tag point to a Contextual few interesting Feature settings for this problem and report the results we obtained by Saul in

Table 2.parse* chunk* label* path* length! * ! Count-based! baseline.* The simplest scenario is to create a simple count-based baseline: for each constituent choose the most popular label. This is trivial to program in Saul.

Independent classifiers. We train independent classifiers for known and unknown words. Though both classifiers use similar sets of features, the unknown classifier is trained only on tokens that were seen fewer than 5 times in the training data. Here the ‘Learnable‘ is defined as exampled in Section 4.1.2.

Classifier combination. Given the known and unknown classifiers, one easy extension is to combine Betterthem, depending Call Saul:… whether the input instance is seen during 19 the training phase or not. To code this, the defined Learnables for the two classifiers are simply reused in an ‘if‘ construct.

Sequence tagging. One can extend the previous configurations by training higher-order classifiers, i.e. classifiers trained on pair/tuple of neighboring constituents (similar to HMM or chain-CRF). At the prediction time one needs to choose the best structure by doing constrained inference on the predictions of the local classifiers. The following snippet shows how one can simply write a consistency constraint, given a pairwise classifier POSTaggerPairwise which scores two consecutive constituents.

def sentenceLabelsMatch = constraint(sentences) { t: TextAnnotation => val constituents = t.getView(ViewNames.TOKENS).getConstituents // Go through a sliding window of tokens constituents.sliding(3)._forall { cons: List[Constituent] => POSTaggerPairwise on (cons(0), cons(1)).second === POSTaggerPairwise on ( cons(1), cons(2)).first } }

4.3 Entity-Relation extraction This task is for labeling entities and recognizing semantic relations among them. It requires making several local decisions (identifying named entities in the sentence) to support the relation identification. The models we represent here are inspired some well-known previous work (Zhou et al., 2005; Chan and Roth, 2010). The nodes in our models consists of Sentences, Mentions and Relations.

4.3.1 Features and Constraints For the entity extraction classifier, we define various lexical features for each mention – head word, POStags, words and POStags in a context window. Also, we incorporate some features based on gazetteers for organization, vehicle, weapons, geographic locations, proper names and collective nouns. The relation extraction classifier uses lexical, collocation and dependency-based features from the baseline implementation in Chan and Roth (2010). We also use features from the brown word clusters (Brown et al., 1992). The features for each word are based on a path from the root in its Brown clustering representation. These features are easily available in our NLP data-model. We also use a decayed down-sampling of negative examples between training iterations. !

!

!

!

!

!!

!

!

! Washington* covers* Seattle* for*the*Associated*Press* Node! ! covers*Seattle* Seattle* * Property! * c!over5Washington* cover5Seattle* cover5for*the*Associated*Press* ! Edge! POS*window! * Parse*Path* Label* POS*window* Parse*Path* Label* POS*window* Parse*Path* Label* ctx/2:VBZ…Other! VBZ^VP^S…! tasksA0! ctx /2:!NNP…! VBZ^VPvNP…! A1! ctx/2:VBZ…! VBZ^VPvNP…! AM/PNC* ! ! ! ! * ! ! ! ! ! * ! ! ! ! * ! ! * * * * * * *

! val labelTwoBefore = property(tokens) { x: Constituent => ! // Use edges to jump to the previous constituent val cons = (tokens(x) > constituentBefore > constituentBefore).head parse* headif (POSTaggerKnown.isTraining)⇠ ⇠ label* label* path! * word*POSLabel(cons) ! ! * ! * ! * * else POSTaggerKnown(cons) ! predicate} * argument* token*

word* label* label* before* form* ! 4.2.1 Model configurations ! ! * relation* Here, wePos-Tag point to a Contextual few interesting Feature settings for this problem and report the results we obtained by Saul in

Table 2.parse* chunk* label* path* length! * ! (WorkFor(x)Count-based! is baseline. true* thenThe simplestPER(x.firstArg) scenario is tois create true a and simple ORG(x.secondArg) count-based baseline: is fortrue each constituent choose the most popular label. This is trivial to program in Saul.

Independent classifiers. We train independent classifiers for known and unknown words. Though both classifiers use similar sets of features, the unknown classifier is trained only on tokens that were seen fewer than 5 times in the training data. Here the ‘Learnable‘ is defined as exampled in Section 4.1.2.

Classifier combination. Given the known and unknown classifiers, one easy extension is to combine Betterthem, depending Call Saul:… whether the input instance is seen during 19 the training phase or not. To code this, the defined Learnables for the two classifiers are simply reused in an ‘if‘ construct.

Sequence tagging. One can extend the previous configurations by training higher-order classifiers, i.e. classifiers trained on pair/tuple of neighboring constituents (similar to HMM or chain-CRF). At the prediction time one needs to choose the best structure by doing constrained inference on the predictions of the local classifiers. The following snippet shows how one can simply write a consistency constraint, given a pairwise classifier POSTaggerPairwise which scores two consecutive constituents.

def sentenceLabelsMatch = constraint(sentences) { t: TextAnnotation => val constituents = t.getView(ViewNames.TOKENS).getConstituents // Go through a sliding window of tokens constituents.sliding(3)._forall { cons: List[Constituent] => POSTaggerPairwise on (cons(0), cons(1)).second === POSTaggerPairwise on ( cons(1), cons(2)).first } }

4.3 Entity-Relation extraction This task is for labeling entities and recognizing semantic relations among them. It requires making several local decisions (identifying named entities in the sentence) to support the relation identification. The models we represent here are inspired some well-known previous work (Zhou et al., 2005; Chan and Roth, 2010). The nodes in our models consists of Sentences, Mentions and Relations.

4.3.1 Features and Constraints For the entity extraction classifier, we define various lexical features for each mention – head word, POStags, words and POStags in a context window. Also, we incorporate some features based on gazetteers for organization, vehicle, weapons, geographic locations, proper names and collective nouns. The relation extraction classifier uses lexical, collocation and dependency-based features from the baseline implementation in Chan and Roth (2010). We also use features from the brown word clusters (Brown et al., 1992). The features for each word are based on a path from the root in its Brown clustering representation. These features are easily available in our NLP data-model. We also use a decayed down-sampling of negative examples between training iterations. !

!

!

!

!

!!

!

!

! Washington* covers* Seattle* for*the*Associated*Press* Node! ! covers*Seattle* Seattle* * Property! * c!over5Washington* cover5Seattle* cover5for*the*Associated*Press* ! Edge! POS*window! * Parse*Path* Label* POS*window* Parse*Path* Label* POS*window* Parse*Path* Label* ctx/2:VBZ…Other! VBZ^VP^S…! tasksA0! ctx /2:!NNP…! VBZ^VPvNP…! A1! ctx/2:VBZ…! VBZ^VPvNP…! AM/PNC* ! ! ! ! * ! ! ! ! ! * ! ! ! ! * ! ! * * * * * * *

! val labelTwoBefore = property(tokens) { x: Constituent => ! // Use edges to jump to the previous constituent val cons = (tokens(x) > constituentBefore > constituentBefore).head parse* headif (POSTaggerKnown.isTraining)⇠ ⇠ label* label* path! * word*POSLabel(cons) ! ! * ! * ! * * else POSTaggerKnown(cons) ! predicate} * argument* token*

word* label* label* before* form* ! 4.2.1 Model configurations ! ! * relation* Here, wePos-Tag point to a Contextual few interesting Feature settings for this problem and report the results we obtained by Saul in

Table 2.parse* chunk* label* path* length! * ! (WorkFor(x)Count-based! is baseline. true* thenThe simplestPER(x.firstArg) scenario is tois create true a and simple ORG(x.secondArg) count-based baseline: is fortrue each constituent choose the most popular label. This is trivial to program in Saul. IndependentER constraint classifiers. imposedWe train on independent mentioned classifiers and relations for known and unknown words. Though both classifiers use similar sets of features, the unknown classifier is trained only on tokens that were seen fewer than 5 times in the training data. Here the ‘Learnable‘ is defined as exampled in Section 4.1.2.

Classifier combination. Given the known and unknown classifiers, one easy extension is to combine Betterthem, depending Call Saul:… whether the input instance is seen during 19 the training phase or not. To code this, the defined Learnables for the two classifiers are simply reused in an ‘if‘ construct.

Sequence tagging. One can extend the previous configurations by training higher-order classifiers, i.e. classifiers trained on pair/tuple of neighboring constituents (similar to HMM or chain-CRF). At the prediction time one needs to choose the best structure by doing constrained inference on the predictions of the local classifiers. The following snippet shows how one can simply write a consistency constraint, given a pairwise classifier POSTaggerPairwise which scores two consecutive constituents.

def sentenceLabelsMatch = constraint(sentences) { t: TextAnnotation => val constituents = t.getView(ViewNames.TOKENS).getConstituents // Go through a sliding window of tokens constituents.sliding(3)._forall { cons: List[Constituent] => POSTaggerPairwise on (cons(0), cons(1)).second === POSTaggerPairwise on ( cons(1), cons(2)).first } }

4.3 Entity-Relation extraction This task is for labeling entities and recognizing semantic relations among them. It requires making several local decisions (identifying named entities in the sentence) to support the relation identification. The models we represent here are inspired some well-known previous work (Zhou et al., 2005; Chan and Roth, 2010). The nodes in our models consists of Sentences, Mentions and Relations.

4.3.1 Features and Constraints For the entity extraction classifier, we define various lexical features for each mention – head word, POStags, words and POStags in a context window. Also, we incorporate some features based on gazetteers for organization, vehicle, weapons, geographic locations, proper names and collective nouns. The relation extraction classifier uses lexical, collocation and dependency-based features from the baseline implementation in Chan and Roth (2010). We also use features from the brown word clusters (Brown et al., 1992). The features for each word are based on a path from the root in its Brown clustering representation. These features are easily available in our NLP data-model. We also use a decayed down-sampling of negative examples between training iterations. !

!

!

!

!

!!

!

!

! Washington* covers* Seattle* for*the*Associated*Press* Node! ! covers*Seattle* Seattle* * Property! * c!over5Washington* cover5Seattle* cover5for*the*Associated*Press* ! Edge! POS*window! * Parse*Path* Label* POS*window* Parse*Path* Label* POS*window* Parse*Path* Label* ctx/2:VBZ…Other! VBZ^VP^S…! tasksA0! ctx /2:!NNP…! VBZ^VPvNP…! A1! ctx/2:VBZ…! VBZ^VPvNP…! AM/PNC* ! ! ! ! * ! ! ! ! ! * ! ! ! ! * ! ! * * * * * * *

! val labelTwoBefore = property(tokens) { x: Constituent => ! // Use edges to jump to the previous constituent val cons = (tokens(x) > constituentBefore > constituentBefore).head parse* headif (POSTaggerKnown.isTraining)⇠ ⇠ label* label* path! * word*POSLabel(cons) ! ! * ! * ! * * else POSTaggerKnown(cons) ! predicate} * argument* token*

word* label* label* before* form* ! 4.2.1 Model configurations ! ! * relation* Here, wePos-Tag point to a Contextual few interesting Feature settings for this problem and report the results we obtained by Saul in

Table 2.parse* chunk* label* path* length! * ! (WorkFor(x)Count-based! is baseline. true* thenThe simplestPER(x.firstArg) scenario is tois create true a andORG simple ORG(x.secondArg) count-basedPerson baseline: is fortrue each constituent choose the most popular label. This is trivial to program in Saul. IndependentER constraint classifiers. imposedWe train on independent mentioned classifiers and relations for known and unknown words. ThoughWorksFor both classifiers use similar sets of features, the unknown classifier is trained only on tokens that were seen fewer than 5 times in the training data. Here the ‘Learnable‘ is defined 4 as exampled 5 in Section 4.1.2. Classifier combination. Given the known and unknown classifiers, one easy extension is to combine Betterthem, depending Call Saul:… whether the input instance is seen during 19 the training phase or not. To code this, the defined Learnables for the two classifiers are simply reused in an ‘if‘ construct.

Sequence tagging. One can extend the previous configurations by training higher-order classifiers, i.e. classifiers trained on pair/tuple of neighboring constituents (similar to HMM or chain-CRF). At the prediction time one needs to choose the best structure by doing constrained inference on the predictions of the local classifiers. The following snippet shows how one can simply write a consistency constraint, given a pairwise classifier POSTaggerPairwise which scores two consecutive constituents.

def sentenceLabelsMatch = constraint(sentences) { t: TextAnnotation => val constituents = t.getView(ViewNames.TOKENS).getConstituents // Go through a sliding window of tokens constituents.sliding(3)._forall { cons: List[Constituent] => POSTaggerPairwise on (cons(0), cons(1)).second === POSTaggerPairwise on ( cons(1), cons(2)).first } }

4.3 Entity-Relation extraction This task is for labeling entities and recognizing semantic relations among them. It requires making several local decisions (identifying named entities in the sentence) to support the relation identification. The models we represent here are inspired some well-known previous work (Zhou et al., 2005; Chan and Roth, 2010). The nodes in our models consists of Sentences, Mentions and Relations.

4.3.1 Features and Constraints For the entity extraction classifier, we define various lexical features for each mention – head word, POStags, words and POStags in a context window. Also, we incorporate some features based on gazetteers for organization, vehicle, weapons, geographic locations, proper names and collective nouns. The relation extraction classifier uses lexical, collocation and dependency-based features from the baseline implementation in Chan and Roth (2010). We also use features from the brown word clusters (Brown et al., 1992). The features for each word are based on a path from the root in its Brown clustering representation. These features are easily available in our NLP data-model. We also use a decayed down-sampling of negative examples between training iterations. Results : Semantic Role Labeling

Model Precision Recall F1 ArgTypeLearnerModel G(GOLDPREDS) Precision85.35 Recall85.35 F185.35 G ArgTypeLearnerArgTypeLearner(GGOLD(GOLDPREDSPREDS)+C) 85.3585.35 85.3585.36 85.3585.35 Xue ArgTypeLearnerArgTypeLearner(GG(GOLDOLDPPREDSREDS))+C 85.3582.32 85.3680.97 85.3581.64 ArgTypeLearnerArgTypeLearnerXue(GXueOLD(GOLDPREDSPREDS)+C) 82.3282.90 80.9780.70 81.6481.79 Xue ArgTypeLearnerArgTypeLearnerXue(PRED(GOLDPREDSPREDS) )+C 82.9082.47 80.7080.79 81.7981.62 Xue ArgTypeLearnerArgTypeLearnerXue(PRED(PREDPREDSPREDS)+C) 82.4783.62 80.7980.54 81.6282.05 Xue ArgIdentifierArgTypeLearnerXue ArgTypeLearner(PREDPREDS)+CXue(PREDPREDS) 83.6282.55 80.5481.59 82.0582.07 Xue Xue ArgIdentifierG | ArgTypeLearner (PREDPREDS) 82.55 81.59 82.07 ArgIdentifier (PRED| PREDS) 95.51 94.19 94.85 ArgIdentifierG(PREDPREDS) 95.51 94.19 94.85 Table 1: Evaluation of SRL various labels and configurations. The superscripts over the different Learners Table 1: Evaluation of SRL various labels and configurations. The superscripts over the different Learners refer torefer the to whether the whether gold gold argument argument boundaries boundaries (G) (G) or or the the Xue-Palmer heuristics heuristics (Xue) (Xue) were were used used to to generategenerate argument argument candidates candidates as input.as input.GGOLDOLD/P/PREDREDPPREDSREDS refers to to whether whether the theLearnerLearnerusedused gold gold or predictedor predicted predicates. predicates. ‘C’ ‘C’ refers refers to tothe the use use of of constraints constraints during prediction prediction and anddenotesdenotes the the pipeline pipeline | | architecture.architecture.

Pipeline.Pipeline.PreviousPrevious research research on onSRL SRL (Punyakanok (Punyakanok et et al., al., 2008) 2008) shows shows that that a a good good working working model model is the is the Betterone thatone Call first that decidesSaul:… first decides on argument on argument identification identification and and20 then then takes takes those those arguments arguments and and decides decides about about their their roles.roles. This configuration This configuration is made is made with with a very a very minor minor change change in in the templates templates of of the the local local models. models. Instead Instead of using Xue-Palmer candidates, we can use the identified arguments by a isArgument classifier as input of using Xue-Palmer candidates, we can use the identified arguments by a isArgument classifier as input candidates for the ArgTypeLearner model. The rest of the model is the same. candidates for the ArgTypeLearner model. The rest of the model is the same. L+I model. This is simply a locally trained classifier that uses a number of constraints on prediction time. L+I model.We defineThis a constrained is simply a argument locally trained predictor classifier based on that a previously uses a number trained of local constraintsLearnable onas prediction follows: time. We define a constrained argument predictor based on a previously trained local Learnable as follows: object ArgTypeConstraintClassifier extends ConstrainedClassifier(ArgTypeLearner) object{ ArgTypeConstraintClassifier extends ConstrainedClassifier(ArgTypeLearner) { def subjectTo = srlConstraints } def subjectTo = srlConstraints } where the srlConstraints is a constraint template. Having this definition we only need to call the ArgTypeConstraintClassifier constraint predictor during the test time as where the srlConstraints is a constraint template. Having this definition we only ArgTypeConstraintClassifier(x) which decides for the label of x in a global context. need to call the ArgTypeConstraintClassifier constraint predictor during the test time as IBT model. The linguistic background knowledge about SRL that is described in Section 4.1.2 provides ArgTypeConstraintClassifier(x) which decides for the label of x in a global context. the possibility of designing a variety of global models. The constraints that limit the argument arrangement IBT model.around aThe specific linguistic predicate background help to make knowledge sentence level about decisions SRL that for is each described predicate in during Section training 4.1.2 phase provides the possibilityand/or prediction of designing phase. a To variety train the of global global modelsmodels. we The simply constraints call the joint that limit train function the argument and provide arrangement the aroundlist a specificof all declared predicate constraint help to classifiers make sentence as parameters. level decisions for each predicate during training phase and/or predictionThe results phase. of some To versions train the of global these models models are we shown simply in Table call the 1. The joint experimental train function settings, and providethe data the list ofand all declared the train/test constraint splits are classifiers according toas (Punyakanok parameters. et al., 2008) and the results are comparable. As the Theresults results show of some the models versions that of use these constraints models are are the shown best performing in Table ones. 1. The For experimental SRL the global settings, background the data and theknowledge train/test on splits the arguments are according in IBT to setting (Punyakanok did not improve et al., 2008) the results. and the results are comparable. As the results show the models that use constraints are the best performing ones. For SRL the global background knowledge4.2 Part-Of-Speech on the arguments Tagging in IBT setting did not improve the results. This is perhaps the most often used application in ML for NLP. We use the setting proposed by Roth and 4.2 Part-Of-SpeechZelenko (1998) as theTagging basis for our experiments. The graph of an example sentence is shown in Figure 1. We model the problem as a single-node graph representing constituents in sentences. We make use of This iscontext perhaps window the most features often and used hence application our graph has in ML edges for between NLP. We each use token the and setting its context proposed window. by Roth This and Zelenkoenables (1998) us toas define the basis contextual for our features experiments. by traversing The graph the relevant of an example edges to accesssentence tokens is shown in the incontext. Figure 1. We modelThe following the problem code asuses a thesingle-node gold POS-tag graph label representing (POSLabel constituents) of the two tokens in sentences. before the We current make token use of contextduring window training features and POS-tag and hence classifier’s our graph prediction has edges (POSTaggerKnown between each token) of and the its two context tokens window.before the This enablescurrent us to token define during contextual the test. features by traversing the relevant edges to access tokens in the context. The following code uses the gold POS-tag label (POSLabel) of the two tokens before the current token during training and POS-tag classifier’s prediction (POSTaggerKnown) of the two tokens before the current token during the test. Results: PoS-Tagging and ER

Scenario Precision Recall F1 Setting Accuracy Mention Coarse-Label 77.14 70.62 73.73 Count-based baseline 91.80% E Mention Fine-Label 73.49 65.46 69.24 Unknown Classifier 77.09% Known Classifier 94.92 % Basic 54.09 43.89 50.48 + Sampling 52.48 56.78 54.54 Combined Known-Unknown 96.69% R + Sampling + Brown 54.43 54.23 54.33 Table 2: The performance of the + Sampling + Brown + HCons 55.82 53.42 54.59 POStagger, tested on sections 22– Table 3: 5-fold CV performance of the fine-grained entity 24 of the WSJ portion of the Penn (E) and relation (R) extraction on Newswire and Broadcast Treebank (Marcus et al., 1993). News section of ACE-2005.

Relation hierarchy constraint. Since the coarse and fine labels follow a strict hierarchy, we leverage this information to boost the prediction of the fine-grained classifier by constraining its prediction upon the (more reliable) coarse-grained relation classifier.

4.3.2 Model Configuration Entity type classifier. For the entity type task, we train two independent classifiers - one for coarse-label Betterand Call the secondSaul:… for the fine-grained entity type. We generate21 the candidates for entities by taking all nouns and possessive pronouns, base noun phrases, selective chunks from the shallow parse and named entities annotated by the NE tagger of Ratinov and Roth (2009).

Relation type classifier. For the relation types, we train two independent classifiers - coarse-grained relation type label and fine-grained relation type label. We use features from our unified data-model which are properties defined on the relations node in the data-model graph. We also incorporate the Relation Hierarchy constraint during inference so that the predictions of both classifiers are coherent. We report some of our results in Table 3.

5 Related Work This work has been done in the context of Saul, a recently developed declarative learning based program- ming language. DeLBP is a new paradigm (Roth, 2005; Rizzolo, 2011; Kordjamshidi et al., 2015) which is related to probabilistic programming languages (PPL) (Pfeffer, 2009; McCallum et al., 2009) (inter alia), sharing the goal of facilitating the design of learning and inference models. However, compared to PPL, it is aimed at non-expert users of machine learning, and it is a more generic framework that is not limited to probabilistic models. It focuses on learning over complex structures where there are global correlations between variables, and where first order background knowledge about the data and domain could be easily considered during learning and inference. The desideratum of this framework is the conceptual representation of the domain, data and knowledge, in a way that is suitable for non-experts in machine learning and, it considers the aspect of relational feature extraction; this is different also from the goals of Searn (Hal et al., 2009) and Wolf (Riedel et al., 2014). DeLBP focuses on data-driven learning and reasoning for problem solving and handling collections of data from heterogeneous resources, unlike Dyna (Eisner, 2008) which is a generic declarative problem solving paradigm based on dynamic programming. This paper exhibits the capabilities and flexibility of Saul for solving problems in the NLP domain. Specifically, it shows how a unified predefined NLP data-model can help performing various tasks at various granularity levels.

6 Conclusion We presented three examples of NLP applications as defined in the declarative learning-based programming language Saul. The main advantage of our approach compared to traditional, task-specific, systems is that Saul allows one to define all the components of the models declaratively, from feature extraction to learning and inference with arbitrary structures. This allows designers and researchers a way to explore different way to decompose, learn and do inference and easily gain insights into the impact of these on the Conclusion https://github.com/IllinoisCogComp/saul

Better Call Saul:… 22 Conclusion

■ Saul language facilitates modeling NLP applications that need ■ Learning and reasoning over Structures ■ Relational Feature Extraction ■ Direct use of expert knowledge beyond data instances https://github.com/IllinoisCogComp/saul

Better Call Saul:… 22 Conclusion

■ Saul language facilitates modeling NLP applications that need ■ Learning and reasoning over Structures ■ Relational Feature Extraction ■ Direct use of expert knowledge beyond data instances ■ Saul saves Programming time in designing various configurations and experimentation ■ each configuration is expressed in a few lines of declarative code. https://github.com/IllinoisCogComp/saul

Better Call Saul:… 22 Conclusion

■ Saul language facilitates modeling NLP applications that need ■ Learning and reasoning over Structures ■ Relational Feature Extraction ■ Direct use of expert knowledge beyond data instances ■ Saul saves Programming time in designing various configurations and experimentation ■ each configuration is expressed in a few lines of declarative code. ■ Saul increases the reusability of codes when data and knowledge about the problem increases when we get to use new emerging algorithms https://github.com/IllinoisCogComp/saul

Better Call Saul:… 22 Conclusion

■ Saul language facilitates modeling NLP applications that need ■ Learning and reasoning over Structures ■ Relational Feature Extraction ■ Direct use of expert knowledge beyond data instances ■ Saul saves Programming time in designing various configurations and experimentation ■ each configuration is expressed in a few lines of declarative code. ■ Saul increases the reusability of codes when data and knowledge about the problem increases when we get to use new emerging algorithms The Novel Ideas include: https://github.com/IllinoisCogComp/saul

Better Call Saul:… 22 Conclusion

■ Saul language facilitates modeling NLP applications that need ■ Learning and reasoning over Structures ■ Relational Feature Extraction ■ Direct use of expert knowledge beyond data instances ■ Saul saves Programming time in designing various configurations and experimentation ■ each configuration is expressed in a few lines of declarative code. ■ Saul increases the reusability of codes when data and knowledge about the problem increases when we get to use new emerging algorithms The Novel Ideas include: ■ Programming for decompositions of a global optimizations that are used in training and prediction https://github.com/IllinoisCogComp/saul

Better Call Saul:… 22 Conclusion

■ Saul language facilitates modeling NLP applications that need ■ Learning and reasoning over Structures ■ Relational Feature Extraction ■ Direct use of expert knowledge beyond data instances ■ Saul saves Programming time in designing various configurations and experimentation ■ each configuration is expressed in a few lines of declarative code. ■ Saul increases the reusability of codes when data and knowledge about the problem increases when we get to use new emerging algorithms The Novel Ideas include: ■ Programming for decompositions of a global optimizations that are used in training and prediction ■ An abstraction for unifying various formalisms for learning and reasoning which is an ongoing work https://github.com/IllinoisCogComp/saul

Better Call Saul:… 22 Conclusion

■ Saul language facilitates modeling NLP applications that need ■ Learning and reasoning over Structures ■ Relational Feature Extraction ■ Direct use of expert knowledge beyond data instances ■ Saul saves Programming time in designing various configurations and experimentation ■ each configuration is expressed in a few lines of declarative code. ■ Saul increases the reusability of codes when data and knowledge about the problem increases when we get to use new emerging algorithms The Novel Ideas include: ■ Programming for decompositions of a global optimizations that are used in training and prediction ■ An abstraction for unifying various formalisms for learning and reasoning which is an ongoing work https://github.com/IllinoisCogComp/saul ■ A unified language for graph querying and structured learning

Better Call Saul:… 22 Conclusion Thank you!

■ Saul language facilitates modeling NLP applications that need ■ Learning and reasoning over Structures ■ Relational Feature Extraction ■ Direct use of expert knowledge beyond data instances ■ Saul saves Programming time in designing various configurations and experimentation ■ each configuration is expressed in a few lines of declarative code. ■ Saul increases the reusability of codes when data and knowledge about the problem increases when we get to use new emerging algorithms The Novel Ideas include: ■ Programming for decompositions of a global optimizations that are used in training and prediction ■ An abstraction for unifying various formalisms for learning and reasoning which is an ongoing work https://github.com/IllinoisCogComp/saul ■ A unified language for graph querying and structured learning

Better Call Saul:… 22 Conclusion Thank you!

■ Saul language facilitates modeling NLP applications that need ■ Learning and reasoning over Structures ■ Relational Feature Extraction ■ Direct use of expert knowledge beyond data instances ■ Saul saves Programming time in designing various configurations and experimentation ■ each configuration is expressed in a few lines of declarative code. ■ Saul increases the reusability of codes when data and knowledge about the problem increases when we get to use new emerging algorithms The Novel Ideas include: ■ Programming for decompositions of a global optimizations that are used in training and prediction ■ An abstraction for unifying various formalisms for learning and reasoning which is I havean ongoing two open work PhD positions and one postdoc position, please contact https://github.com/IllinoisCogComp/saul ■ A unified languageme at [email protected] for graph querying and structured, if you are learning interested!

Better Call Saul:… 22 NLP Sensors

def textAnnotationToTree(ta: TextAnnotation): Tree[Constituent]

def textAnnotationToStringTree(ta: TextAnnotation): Tree[String]

def getPOS(x: Constituent): String

def getLemma(x: Constituent): String

def getSubtreeArguments(currentSubTrees: List[Tree[Constituent]]): List[Tree[Constituent]]

def xuPalmerCandidate(x: Constituent, y: Tree[String]): List[Relation]

def fexContextFeats(x: Constituent, featureExtractor: WordFeatureExtractor): String

Better Call Saul:… 23