Application of sto chastic to

understanding action

by

Yuri A Ivanov

MS Computer Science

State AcademyofAir and Space Instrumentation St Petersburg Russia

February

Submitted to the Program in Media Arts and Sciences

Scho ol of Architecture and Planning

in partial fulllmentofthe requirements for the degree of

MASTER OF SCIENCE IN MEDIA ARTS AND SCIENCES

at the

Massachusetts Institute of Technology

February

c

Massachusetts Institute of Technology

All Rights Reserved

Signature of Author

Program in Media Arts and Sciences

January

Certied by

Aaron F Bobick

LG Electronics Inc Career Development

AssistantProfessor of Computational Vision

Program in Media Arts and Sciences

Thesis Sup ervisor

Accepted by

Stephen A Benton

Chairp erson

Departmental Committee on Graduate Students Program in Media Arts and Sciences

Application of sto chastic grammars to understanding action

by

Yuri A Ivanov

Submitted to the Program in Media Arts and Sciences

Scho ol of Architecture and Planning

on January

in partial fulllmentofthe requirements for the degree of

Master of Science in Media Arts and Sciences

Abstract

This thesis addresses the problem of using probabilistic formal languages to describ e and

understand actions with explicit structure The thesis explores several mechanisms of pars

ing the uncertain input string aided byastochastic contextfree This metho d

originating in sp eechrecognition lets us combine a statistical recognition approach with a

syntactical one in a unied syntacticsemantic framework for action recognition

The basic approachistodesignthe recognition system in a twolevel architecture The

rst level a set of indep endently trained comp onentevent detectors pro duces the likeli

ho o ds ofeach comp onentmodel The outputs of these detectors provide the input stream

for a sto chastic contextfree parsing mechanism Any decisions ab out supp osed structure of

the input are deferred to the parser whichattempts to combine the maximum amountof the

candidate events into a mostlikely sequence according to a given Sto chastic ContextFree

Grammar SCFG The grammar and parser enforce longer range temp oral constraints dis

ambiguate or correct uncertain or mislab eled lowlevel detections and allowthe inclusion

of a priori knowledge ab out the structure of temp oral events in a given domain

The metho d takes into consideration the continuous character of the input and p erforms

structural rectication of it in order to accountfor misalignments and ungrammatical

symbols in the stream The presented technique of sucharectication uses the structure

maximization to drivethesegmentation

Wedescribea metho d of achieving this goal to increase recognition rates of complex

sequential actions presenting the task of segmenting and annotating input stream in various

sensor mo dalities

Thesis Sup ervisor Aaron F Bobick

Title Assistant Professor of Computational Vision

Application of sto chastic grammars to understanding action

by

Yuri A Ivanov

The following p eople served as readers for this thesis

Reader

Alex Pentland

Toshiba Professor of Media Arts and Sciences

Media Lab oratoryMassachusetts Institute of Technology

Reader

Walter Bender

Asso ciate Director for Information Technology

Media Lab oratoryMassachusetts Institute of Technology

Acknowledgments

First and foremost I would liketothank myadvisor professor Aaron Bobick for giving me

the opp ortunitytojoin his team at MIT and intro ducing me to machine vision I really

appreciated his op enmindedness readiness to help and listen and giving me the freedom

to see things myown wayEvery discussion with him taughtmesomething new I consider

myself very fortunate to have him as myadvisor b oth p ersonally and professionally he has

given me more than I could ever ask for Thanks Aaron

Iwould liketothank myreaders Walter Bender and Sandy Pentland for taking the

time from their busy schedules to review the thesis and for their comments and suggestions

This work would not b e p ossible without the input from all members of the High Level

Vision group in no particular order Drew Stephen Lee Claudio Jim John Chris

and Martin My sp ecial thanks to Andrew Wilson who had the misfortune of b eing my

ocemate for the last year and a half and who had the patience to listen to all myramblings

and answer all myendless questions to the degree I never exp ected

But I guess this whole thing wouldnt even have happ en if George Slowhand Thomas

hadnt come along and told me that while I am lo oking for b etter times they pass Thanks

bro

Thanks to Mom and Dad for giving me all their loveandalways b eing there for me

regardless what mo o d I was in

Thanks to Margie for going the extra ten miles every time I needed it

And nally thanks to mywifeofhasitbeenthat long already years Sarah Thank

you for b eing so supp ortive loving patient and understanding Thank you for putting up

with all my crazy dreams and making them all worth trying For you I only have one

question You know what

Contents

Intro duction and Motivation

Motivation

Structure and Content

Bayesian View

AI View

Generative Asp ect

Example I Explicit Structure and Disambiguation

Example I I Cho osing the Appropriate Mo del

Approach

Outline of the Thesis

Background and Related Work

Motion Understanding

Pattern Recognition

Metho ds of Sp eech Recognition

Related Work

Sto chastic Parsing

Grammars and Languages

Notation

Basic denitions

ExpressivePower of Language Mo del

Sto chastic ContextFree Grammars

EarleyStolckeParsing Algorithm

Prediction

Scanning

Completion

Relation to HMM

Viterbi Parsing

Viterbi Parse in Presence of Unit Pro ductions

Temp oral Parsing of Uncertain Input

Terminal Recognition

Uncertaintyin the Input String

Substitution

Insertion

Enforcing Temp oral Consistency

MisalignmentCost

Online Pro cessing

Pruning

Limiting the Range of Syntactic Grouping

Implementation and System Design

Comp onent Level HMM Bank

Level Parser

Parser Structure

Annotation Mo dule

Exp erimental Results

Exp erimentIRecognition and Disambiguation

Recognition and semantic segmentation

RecursiveModel

Conclusions

Summary

Evaluation

Extensions

Final Thoughts

List of Figures

Example Parsing strategy

Error correcting capabilities of syntactic constraints

Simple action structure

Secretary problem

Left Corner Relation graph

Viterbi parse tree algorithm

Spurious symb ol problem

Temp oral consistency

Recognition system architecture

Example of a single HMM output

Comp onent level output

SQUARE sequence segmentation

STIVE setup

Output of the HMM bank

Segmentation of a conducting gesture Polhemus

Segmentation of a conducting gesture STIVE

Recursivemodel

List of Tables

Complexityofarecursivetask

Left Corner Relation matrix and its RT closure

Unit Pro duction matrix and its RTclosure

Chapter

Intro duction and Motivation

In the last several years there has b een a tremendous growth in the amountofcomputer

vision researchaimed at understanding action As noted byBobick these eorts have

ranged from the interpretation of basic movements suchasrecognizing someone walking

or sitting to the more abstract task of providing a Newtonian physics description of the

motion of several ob jects

In particular there has b een emphasis on activities or b ehaviors where the entitytobe

recognized maybeconsidered as a sto chastically predictable sequence of states The greatest

number of examplescomeform work in gesture recognition where analogies to

sp eechand handwriting recognition inspired researchers to devise hidden Markov model

metho ds for the classication of gestures The basic premise of the approachisthat the

visual phenomena observed can b e considered Markovian in some feature space and that

sucient training data exists to automatically learn a suitable mo del to characterize the

data

Hidden MarkovModels havebecome the weap on of choice of the computer vision

communityin the areas where complex sequential signals havetobelearned and recognized

The p opularityofHMMs is due to the richness of the mathematical foundation of the mo del

its robustness and the fact that many imp ortantproblems are easily solved byapplication

of this metho d

While HMMs are well suited for mo deling parameters of a sto chastic pro cess with as

sumed Markovian prop erties their capabilities are limited when it comes to capturing and

expressing the structure of a pro cess

If wehavea reason to b elieve that the pro cess has primitivecomponents it might

be benecial to separate the interrelations of the pro cess comp onents from their internal

statistical prop erties in the representation of the pro cess Suchaseparation lets us avoid

overgeneralization and employ the metho ds whichare sp ecic to the analysis of either the

structure of the signal or its content

Motivation

Here weestablish the motivation for our work based on four lines of argument First we show

how the separation of structural characteristics from statistical parameters in a vision task

can b e useful Then we pro ceed to discuss the prop osed approachinBayesian formulation

in AI framework and nally presentthe motivation in context of recognition of a conscious

directed task

Structure and Content

Our researchinterests lie in the area of vision where observations span extended p erio ds

of time Weoften nd ourselves in a situation where it is dicult to formulate and learn

parameters of a mo del in clear statistical terms whichprevents us from using purely sta

tistical approaches to recognition These situations can b e characterized by one or more of

the following prop erties

complete data sets are not always available but comp onentexamples are easily found

semantically equivalentpro cesses p ossess radically dierentstatistical prop erties

comp eting hyp otheses can absorb dierentlengths of the input stream raising the

need for naturally supp orted temp oral segmentation

structure of the pro cess is dicult to learn but is explicit and a priori known

the pro cess structure is to o complex and the limited Finite State mo del whichis

simple to estimate is not always adequate for mo deling the pro cess structure

Many applications exist where purely statistical representational capabilities are limited

and surpassed bythe capabilities of the structural approach

Structural metho ds are based on the fact that often the signicant information in a

pattern is not merely in the presence absence or numerical value of some feature set

Rather the interrelations or interconnections of features yield most imp ortantinformation

which facilitates structural description and classication One area whichisofparticular

interest is the domain of syntactic pattern recognition Typicallythesyntactic approach

formulates hierarchical descriptions of complex patterns built up from simpler subpatterns

At the lowest level the primitives of the pattern are extracted from the input data The

choice of the primitives distinguishes the syntactical approachfromanyother The main

dierence is that the features are just anyavailable measurementwhichcharacterizes the

input data The primitives on the other hand are subpatterns or building blo cks of the

structure

Sup erior results can b e achieved bycombining statistical and syntactical approaches

into a unied framework in a statisticalsyntactic approach These metho ds have already

gained p opularityinAIand sp eech pro cessing Eg

The statisticalsyntactic approachisapplicable when there exists a natural separation

of structure and contents For many domains suchadivision is clear For example consider

ballro om dancing There are a small numberofprimitives eg rightlegback which

are then structured into higher level units eg boxstep quarterturn etc Typically

one will havemanyexamples of rightlegback drawn from the relatively few examples

eachofthe higher level b ehaviors Another example mightbe recognizing a car executing

a parallel parking maneuver The higher level activitycan b e describ ed as rst a car

executes an pullalongside primitivefollowed byanarbitrary numberofcycles through

the pattern turnwheelsleft backup turnwheelsright pullforwardInthese

instances there is a natural division b etween atomic statistically abundant primitives and

higher level co ordinated b ehavior

Bayesian View

Supp ose wehaveasequence of primitives comp osing the action structure S and the obser

vation sequence O Weassume a simple probabilistic mo del of an action where particular

S pro duces some O with probability P S O Weneedtond S such that P S jO

max P S jO

S

Using Bayes rule

P O jS P S

P S jO

P O

The MAP estimate of S is therefore

S ar g max P O jS P S

S

The rst term P O jS is a comp onentlikeliho o d termed acoustic model in sp eech

recognition representing probabilityofthe observation given the sequence of primitives

It is obtained via recognition of the primitive units according to some statistical mo del

The second term P S isastructure mo del also called a language modelasitdescrib es

the probabilityasso ciated with a p ostulated sequence of primitives

Typically when neither term is known they are assumed to b e drawn from a distribution

of some general form and their parameters are estimated

The motivation of this work arises from the need to mo del actionbased pro cesses where

the structure mo del can b e arbitrarily complex While the comp onentlikeliho o d is based

on variations in the signal and a statistical mo del is typically appropriate the structure

mo del in our context is based on b ehavior or directed task The corresp onding probability

distribution is often nontrivial and dicult to mo del by purely statistical means

Decoupling b etween statistical mo dels of the structure and the underlying primitives is

desirable in our context since it makes it p ossible to deploy the appropriate machinery at

each level In order to do that weneedtodevelop a mechanism for indep endentestimation

of mo dels at eachlevel and propagation of constraints through suchatwolevel structure

There havebeen attempts to decouple structural knowledge from comp onent knowledge

and express the structure via establishing interrelations of the comp onents realizing them

in the architecture of the recognition system For instance in sp eech recognition grammar

networks are widely used to imp ose syntactic constraints on the recognition pro cess One

suchexample is the parsing strategy employed byanHMMTool Kit develop ed byEntropic

Research Lab Individual temp oral feature detectors can b e tied together according to the

exp ected syntax The syntactic mo del is expressed byaregular grammar in extended

BackusNaur form This grammar is used to build a network of the detectors and p erform

long sequence parse byaViterbi algorithm based on token passing

The resulting grammar network is shown in gure a The structure of the incoming a

b

b c a Parser ab c d ab e d

e

a b

Figure Illustration of dierent parsing strategies a Example of HTK individual temp oral

feature detectors for symb ols a b c d and e are combined into a grammar network b Prop osed

architecture whichachieves decoupling b etween the primitive detectors and the structural mo del

probabilistic parser in this case

string can b e describ ed byanonrecursive nite state mo del

Our approachgoesfurther in decoupling the structure from content Wedonotimp ose

the structural constraints on the detectors themselves Instead we run them in parallel

deferring the structural decisions to the parser whichembodies our knowledge of the action

syntax A clear advantage of this architecture is that the structural mo del can takeon

arbitrary complexitynot limited byanFSM character of the network according to our

b eliefs ab out the nature of the action

AI View

AI approaches to recognition were traditionally based on the idea of incorp orating the

knowledge from a varietyofknowledge sources and bringing them together to b ear on the

problem at hand For instance the AI approachtosegmentation and lab eling of the sp eech

signal is to augment the generally used acoustic knowledge with phonemic knowledge lexical

knowledge syntactic knowledge semantic knowledge and even pragmatic knowledge

Rabiner and Juang show that in tasks of automatic sp eechrecognition the word

correcting capabilityofhigherlevel knowledge sources is extremely p owerful In particular

they illustrate this statementbycomparing p erformance of a recognizer b oth with and



without syntactic constraints gure As the deviation noise parameter SI GM A

gets larger the word error probabilityincreases in b oth cases In the case without syntactic



Reprinted from

Figure Error correcting capabilities of syntactic constraints

constraints the error probabilityquickly leads to but with the syntactical constraints

enforced it increases gradually with an increase in noise

This is the approachtorecognition of structured visual information whichwe takein

this work It allows us to not only relievethe computational strain on the system but also

to make contextual decisions ab out likeliho o ds of eachofthe mo del as weparse the signal

Generative Asp ect

In problems of visual action recognition wedeal with problems whichoriginate in task

based pro cesses These pro cesses are often p erceived as a sequence of simpler semantically

meaningful subtasks forming the basis of the task vocabulary



As wespeak of ballet wespeak of plieand arab esque and howonefollows another

as wespeakoftennis or volleyball wespeak of backhand and smash and of playing by the

rules This suggests that when weturn to machines for solving the recognition problems

in higher level pro cesses whichwe humans are used to dealing with wecan provide the

machines with manyintuitiveconstraints heuristics semantic lab els and rules whichare

based on our knowledge of the generation of the observation sequence by some entitywith



CoincidentallyWebsters dictionary denes ballet as dancing in whichconventional p oses and steps are

combined with lightowing gures and movements

predictable b ehavior

As weattempt to formulate algorithms for recognition of pro cesses whichare easily

decomp osed into meaningful comp onents the context or the structure of it can b e simply

describ ed rather than learned and then used to validate the input sequence

In cognitive vision wecan often easily determine the comp onents of the complex b ehavior

based signal The assumption is p erhaps more valid in the context of conscious directed

activity since the nature of the signal is suchthatitisgenerated byataskdriven system

and the decomp osition of the input to the vision system is as valid as the decomp osition of

the task b eing p erformed bytheobserved source into a set of simpler primitivesubtasks

eg

Example I Explicit Structure and Disambiguation

Consider an example we will attempt to build a mo del whichwould let us p erform recog

nition of a simple hand drawing of a square as the gesture is executed Most likely the

direction in whichasquare is drawn will dep end on whether the test sub ject is left or

right handed Therefore our mo del will havetocontain the knowledge of b oth p ossible ver

sions of suchagesture and indicate that a square is b eing drawn regardless of the execution

style

This seemingly simple task requires signicanteort if using only the statistical pattern

recognition techniques A human observer on the other hand can provide a set of useful

heuristics for a system whichwould mo del the humans higher level p erception As we

recognize the need to characterize a signal bythese heuristics we turn our attention to

syntactic pattern recognition and combined statisticalsyntactic approaches whichwould

allowustoaddress the problem stated ab ove

Having a small vocabulary of prototyp e gestures right left up and downwe can

formulate a simple grammar G describing drawing a square

sq uar e

a b c

Right

Down Up

Left

d e f

Figure Example of the action structure a e Frames and of a

sequence showing a square b eing drawn in the air f shows the motion phases

Grammar G

sq uar e

SQUARE RIGHT SQ

LEFT SQ

RIGHT SQ rightdown left up

left up rightdown

LEFT SQ left down rightup

rightup left down

We can nowidentify a simple input string as b eing a square and determine if this

particular square is righthanded or left handed

Humans having observed the video sequence presented in Figure ae haveno

problem discerning a square Furthermore it b ecomes fairly clear somewhere in the middle

of the gesture that the gure drawn will b e a square and not something else This demon

strates the fact that the exp ected predicted structure is based on the typicality of the

whole gesture and after the whole gesture has b een seen helps us interpret the arcs in

Figure f as straight lines rather than semicircles

Nowimagine that weareasked to identify the top side of the square The task has to

deal with inherentambiguityofthepro cess in our case wecannot really determine if a

particular part of the gesture is the top or the b ottom b ecause it can b e implemented by

either left or right comp onent gestures dep ending on whichversion of the square is b eing EN GP TP EX EN GP TP EX

WS WS

a b

WS

EN GP TP EX

WS

c

Figure Realization of a recursive problem using nite state network EN enterGP

givepen WTS waittillsignedTP takepenEX exitainitial top ologyb added

links to takeinto accountthecase where b oss uses her own p en c redundantnetwork to eliminate

the invalid paths ENGPWTSEX and ENWTSTPEX

drawn For suchlocally ambiguous cases disambiguation bycontext is easily p erformed in

the setting of syntactic recognition Wewill revisit this example towards the end of this

do cument

Example I I Cho osing the Appropriate Mo del

For the sakeofillustration supp ose that an absentminded secretary keeps losing his p ens

Once a dayhe brings a pile of pap ers into his b oss oce to sign Quite often he gives the

b oss his p en to sign the pap ers and sometimes forgets to takeit back Wewantto build

a system whichreminds the secretary that the b oss did not return the p en Wecande

scrib e the b ehavior as a sequence of primitiveevents enter givepen waittillsigned

takepen and exitWe build a simple sequential network of recognizers which detect the

corresp onding actions whichisshown in gure a Now if the break in the sequence is

detected wewill sound an alarm and remind the secretary to take his p en back

One complication to the task is that the b oss o ccasionally uses her own p en to sign the

do cuments At rst sightitdoes not seem to presentaproblem we just add two extra

arcs to the network gure b But on closer insp ection werealize that the system no

longer works There is a path through the network whichwillallowtheboss to keep her

secretarys p en

In order to x the problem werearrange the network adding one more recognizer

waittillsignedand twoextra links gure c However the p erformance of the

original network will slightly drop due to a rep eated waittillsigned detector

A case similar to the secretary problem that is of particular imp ortance is the mon

itoring of assemblydisassembly tasks The problems of this category p ossess the prop erty

that every action in the b eginning of the string has an opp osite action in the end The lan

guages describing suchtasks are often referred to as palindrome languages The strings

over the palindrome language can b e describ ed as L f xy j st y is a mirror image of

p

xgItshould b e clear bynow that general palindromes cannot b e generated byanite state

machine However the b ounded instances of suchstrings can b e accepted byanexplicit

representational enumeration as was shown ab ove where the state waittillsigned

was duplicated The resulting growth of a nite state machinebased architecture can b e

prohibitivewithincrease of the counted nesting

To substantiate this claim let us lo ok at the relation of the depth of the nesting and

the amountof the replicated no des necessary to implement suchastructure The number

of additional no des whichisprop ortional to the decrease in p erformance required to

implementageneral recursive system of depth d can b e computed byarecurrent relation

d

X

N d i N

d i

i

d

B C

N

A

i

d i

N



where N is a total number of nodes in the network The table shows howmuch

d



larger the network has to b e and corresp ondinglyhowmuchslower to accommo date the

secretarytyp e task with increasing value of d



Weassume a simple twono de architecture as a starting p oint

d N

d

Table Numberofadditional no des for a nite state network implementing recursive system of

depth d

Approach

In this thesis wefocusourattention on a mixed statisticalsyntactic approachtoaction

recognition Statistical knowledge of the comp onents is combined with the structural knowl

edge expressed in a form of grammar In this framework the syntactic knowledge of the

pro cess serves as a p owerful constrainttorecognition of individual comp onents as well as

recognition of the pro cess as a whole

Toaddress the issues raised in the previous sections wedesign our recognition system

in a twolevel architecture as shown in gure b The rst level a set of indep endently

trained comp onentevent detectors pro duces likeliho o ds of eachcomponentmodel The

outputs of these detectors provide the input stream for a sto chastic contextfree parsing

mechanism Anydecisions ab out supp osed structure of the input are deferred to the parser

whichattempts to combine the maximum amountof the candidate events into a most likely

sequence according to a given Sto chastic ContextFree Grammar SCFG The grammar

and parser enforce longer range temp oral constraints disambiguate or correct uncertain or

mislab eled lowlevel detections and allowthe inclusion of a priori knowledge ab out the

structure of temp oral events in a given domain

Outline of the Thesis

The remainder of this thesis pro ceeds as follows the next chapter establishes relations of the

thesis to previous researchinthe areas of pattern recognition motion understanding and

sp eechpro cessing Chapter describ es the parsing algorithm whichisused in this work

Necessary extensions and mo dications to the algorithm which allow the system to deal

with uncertain and temp orally inconsistentinput are presented in Chapter Chapter

gives an overview of the system architecture and its comp onents Chapter presents some

exp erimental results whichwere achieved using the metho d describ ed in this do cument

and nallyChapter concludes the thesis with a brief outline of strong and weak p oints

of the prop osed technique

Chapter

Background and Related Work

Inspiration for this work comes primarily from sp eech recognition whichcovers a broad

range of techniques The thesis builds up on researchinmachine p erception of motion and

syntactic pattern recognition This chapter establishes relations of the thesis work to each

of the ab oveareas

Motion Understanding

In Bobickintro duces a taxonomyof approaches to the machine p erception of motion

The necessityofsuchaclassication arises from the breadth of the tasks in the domain

of motion understanding The prop osed categorization classies vision metho ds according

to the amountofknowledge whichisnecessary to solve the problem The prop osed tax

onomymakes explicit the representational comp etencies and manipulations necessary for

perception

According to this classication the sub ject of the recognition task is termed movement

activity or action Movements are dened as the most atomic primitives requiring no con

textual or sequence knowledge to b e recognized Activity refers to sequences of movements

or states where the only real knowledge required is the statistics of the sequence And

nally actions are larger scale events whichtypically include interaction with the environ

ment and causal relationships

As the ob jectiveofthe system b eing develop ed in this thesis we dene the recognition of

complex structured pro cesses Eachofthe comp onents of the structure is itself a sequential

pro cess whichaccording to the ab ovetaxonomybelongs to the category of activitiesAc

tivity recognition requires statistical knowledge of the corresp onding subsequence mo deled

in our case byaHidden MarkovModel

The probabilistic parser p erforming contextual interpretation of the candidate activities

essentially draws up on the knowledge of the domain whichisrepresented byagrammar

The grammar is formulated in terms of primitiveactivities grammar terminals and their

syntactic groupings grammar nonterminals Although no explicit semantic inferences are

made and no interactions with the environment are mo deled in the currentimplementation

of the system for the scop e of this work they were not necessary implicitly weallowmore

extensiveinformation to b e included in the representation with minimal eort

In the prop osed taxonomythe complete system covers the range b etween activities and

actions allowing for crosslevel constraintpropagation and easy transition from one level

to the other

Pattern Recognition

The approachto pattern recognition is dictated by the available information which can b e

utilized in the solution Based on the available data approaches to pattern recognition

lo osely fall into three ma jor categories statistical syntactic or neural pattern recognition

In the instances where there is an underlying and quantiable statistical basis for the

pattern generation statistical or decisiontheoretic pattern recognition is usually employed

Statistical pattern recognition builds up on a set of characteristic measurements or features

whichare extracted from the input data and are used to assign eachfeaturevector to one

of several classes

In other instances the underlying structure of the pattern provides more information

useful for recognition In these cases wespeak ab out syntactic pattern recognition The

distinction b etween syntactic and statistical recognition is based on the nature of the primi

tives In statistical approaches the primitives are features whichcanbe any measurements

In syntactic recognition the primitives must b e subpatterns from which the pattern sub

ject to recognition is comp osed

And nally neural pattern recognition is employed when neither of the ab ove cases

hold true but weare able to develop and train relatively universal architectures to correctly

asso ciate input patterns with desired resp onses Neural networks are particularly well suited

for pattern asso ciation applications although the distinction b etween statistical and neural

pattern recognition is not as denite and clearly drawn as b etween syntactical and statistic

recognition techniques

Eachofthese metho dologies oers distinctive advantages and has inevitable limitations

The most imp ortant limitations for eachoftheapproaches are summarized in the table

b elow

Statistical recognition Syntactic recognition Neural recognition

Structural information Structure is dicult to Semantic information

is dicult to express learn is dicult to extract

from the network

The technique develop ed in this thesis is a version of a mixed statisticalsyntactic ap

proach Low level statisticsbased detectors utilize the statistical knowledge of the sub

patterns of the higher level syntaxbased recognition algorithm which builds up on our

knowledge of the exp ected structure This approachallows to integrate the advantages

of eachofthe employed techniques and provides a framework for a crosslevel constraint

propagation

Metho ds of Sp eechRecognition

Sp eech recognition metho ds span an extremely broad category of problems ranging from

signal pro cessing to semantic inference Most interesting and relevantforour application

are the techniques used for determining word and sentence structure

Most often the N gram mo dels are used to combine the outputs of word detectors into

ameaningful sentence Given a word sequence W mutual o ccurrence of all the words of W

is mo deled bycomputing the

P W P w w w P w P w jw P w jw w Pw jw w

  l       l  l

Suchamodel essentially amounts to enumeration of all the word sequences and com

puting their probabilities whichiscomputationally imp ossible to do In practice it is

approximated by unit long conditional probabilities and the resulting estimation P W

is interp olated if the training data set is limited

Contextfree grammars are rarely employed in mo derate to large vocabulary natural lan

guage tasks Instead the bi or trigram grammars are used to provide the language mo del

Perhaps the main reason for that is the fact that CFGs are only a crude approximation of

the sp oken language structure and CFGstatistics are dicult to obtain N gram statistics

on the other hand are relatively straightforward to compute and to use in practice

The CFGmodel assumes a signicantly greater role in small to mo derate vocabulary

tasks where the domain is more constrained and the task of recognition is to detect relatively

short and structured command sequences In suchasetting the units of recognition are



the words of the limited vocabulary and a syntactical description can b e formulated with

relative ease

The latter category of sp eechrecognition tasks is where we nd the inspiration for

our work ArguablyCFGs are even more attractiveinthe vision domain since recursive

structures in vision are more often the case than in sp eech

Related Work

The prop osed approachcombines statistical pattern recognition techniques in a unifying

framework of syntactic constraints Statistical pattern recognition in vision has a long

history and well develop ed to ols It is most relevanttothe thesis in its application to

activityrecognition

One of the earlier attempts to use HMMs for recognition of activities is found in the

work byYamato et al where discrete HMMs are used to recognize six tennis strokes

performed bythreesubjects A subsampled camera image is used as a feature

vector directly

Activityasdened ab ove was the fo cus of the researchwhere the sequential character

of the observation was reected in the sequential structure of a statistical mo del suchas

work byDarrell where the recognition task is p erformed byatimewarping technique

closely related to HMM metho dology

Examples of statistical representation of sequences are seen in the recentwork in un

derstanding human gesture For instance Schlenzig et al describ e the results of their



At presenttimelarge vocabulary tasks cannot b e accomplished on the basis of word primitives due to

computational constraints

exp eriments of using HMMs for recognition of continuous gestures whichshowtobea

powerful gesture recognition technique

Starner and Pentland prop ose an HMMbased approachtorecognition of visual

language The task is p erformed byaset of HMMs trained on several hand signs of American

Sign Language ASL At run time HMMs output probabilities of the corresp onding hand

sign phrases The strings are optionally checked byaregular phrase grammar In this work

the authors fully utilize the advantages of the architecture presented in gure a

Wilson Bobickand Cassell analyzed explicit structure of the gesture where the

structure was implemented byanequivalentofanite state machine with no learning

involved

Syntactic pattern recognition is based on the advances of the formal language theory

and computational The most denitive text on these topics still remains The

authors presentathorough and broad treatmentofthe parsing concepts and mechanisms

Avast amountofwork in syntactic pattern recognition has b een devoted to the areas

of image and sp eechrecognition Areviewofsyntactic pattern recognition and related

metho ds can b e found in

Perhaps the most noted disadvantage of the syntactical approachis the fact that the

computational complexityofthesearchandmatching in this framework can p otentially

grow exp onentially with the length of the input Most of the ecient parsing algorithms

call for the grammar to b e formulated in a certain normal form This limitation was

eliminated for contextfree grammars by Earley in the ecient parsing algorithm prop osed

in his dissertation Earley develop ed a combined topdownb ottomup approach



whichisshown to p erform at worst at O N for an arbitrary CFGformulation

A simple intro duction of probabilistic measures into grammars and parsing was shown

byBooth and Thompson Thomason and others

Aho and Peterson addressed the problem of illformedness of the input stream In

they describ ed a mo died Earleys parsing algorithm where substitution insertion and

deletion errors are corrected The basic idea is to augmenttheoriginal grammar by error

pro ductions for insertions substitutions and deletions such that any string over the terminal

alphab et can b e generated bytheaugmented grammar Each such pro duction has some

cost asso ciated with it The parsing pro ceeds in suchamanner as tomakethe total cost

minimal It has b een shown that the error correcting Earley parser has the same time and

 

space complexityasthe original version namely O N andO N resp ectively where N

is the length of the string Their approachis utilized in this thesis in the framework of

uncertain input and multivalued strings

Probabilistic asp ects of syntactic pattern recognition for sp eechpro cessing were pre

sented in many publications for instance in The latter demonstrates some key

approaches to parsing sentences of natural language and shows advantages of use of prob

abilistic CFGs The text shows natural progression from HMMbased metho ds to proba

bilistic CFGs demonstrating the techniques of computing the sequence probabilitychar

acteristics familiar from HMMs suchasforward and backward probabilities in the chart

parsing framework

An ecient probabilistic version of Earley parsing algorithm was develop ed by Stolcke

in his dissertation The author develops techniques of embedding the probability

computation and maximization into the Earley algorithm He also describ es grammar

structure learning strategies and the rule probabilitylearning technique justifying usage of

Sto chastic ContextFree Grammars for natural language pro cessing and learning

The syntactic approachin Machine Vision has b een studied for more than thirtyyears

Eg mostly in the context of pattern recognition in still images The work by

Bunke and Pasche is built up on the previously mentioned developmentbyAho and

Peterson expanding it to multivalued input The resulting metho d is suitable for

recognition of patterns in distorted input data and is shown in applications to waveform

and image analysis The work pro ceeds entirely in nonprobabilistic context

More recentwork bySanfeliu et al is centered around twodimensional grammars

and their applications to image analysis The authors pursue the task of automatic trac

sign detection byatechnique based on Pseudo Bidimensional Augmented Regular Expres

sions PSBARE AREs are regular expressions augmented with a set of constraints that

involve the numberofinstances in a string of the op erands to the star op erator alleviating

the limitations of the traditional FSMs and CFGs which cannot count their arguments

More theoretical treatmentofthe approachisgiven in In the latter work the au

thors intro duce a metho d of parsing AREs whichdescrib e a sub class of a contextsensitive

languages including the ones dening planar shap es with symmetry

Avery imp ortanttheoretical work signifying an emerging information theoretic trend

in sto chastic parsing is demonstrated byOomenand Kashyap in The authors present

a foundational basis for optimal and information theoretic syntactic pattern recognition

They develop a rigorous mo del for channels whichpermit arbitrary distributed substitution

deletion and insertion syntactic errors The scheme is shown to b e functionally complete

and sto chastically consistent

There are manyexamples of attempts to enforce syntactic and semantic constraints

in recognition of visual data For instance Courtney uses a structural approachto

interpreting action in a surveillance setting Courtney denes high level discrete events

suchasob ject app eared ob ject disapp eared etc whichareextracted from the visual

data The sequences of the events are matched against a set of heuristically determined

sequence templates to makedecisions ab out higher level events in the scene suchasob ject

A removed from the scene byobject B

The grammatical approachtovisual activityrecognition was used byBrand who

used a simple nonprobabilistic grammar to recognize sequences of discrete events In his

case the events are based on blob interactions suchasobjects overlap etc The technique

is used to annotate a manipulation video sequence whichhasan a priori known structure

And nallyhybrid techniques of using combined probabilisticsyntactic approaches

to problems of image understanding are shown in and in pioneering researchbyFu

eg

Chapter

Sto chastic Parsing

Grammars and Languages

This section presents brief review of grammar and language hierarchyasgiven byChomsky

We discuss the complexityofthe mo del at eachlevel of the hierarchyandshow the inherent

limitations

Notation

In our further discussion we will assume the following notation

Symbol Meaning Example

Capital Roman letter Single nonterminal A B

Small Roman letter Single terminal a b

Small Greek letter String of terminals and nonterminals

Greek letter Emptystring

A minor exception to the ab ovenotation are three capital Roman literals N T and P

whichwe will use to denote nonterminal alphab et terminal alphab et and a set of pro duc

tions of a grammar resp ectively

Basic denitions

AChomsky grammar is a quadruple G fN T P Sgwhere N is a nonterminal alphab et

T is an alphab et of terminals P is a set of productionsorrewriting ruleswritten as

and S N is a starting symbol or axiomInfurther discussion V will denote T N and

G

 

V will stand for T N zero or more elements of V The complexity and restrictions

G

G

on the grammar are primarily imp osed bytheway the rewriting rules are formulated

By Chomskys classication wecallagrammar contextsensitive CSG if P is a set of

rules of the form A and being strings over V A N

G

Agrammar is contextfree CFG when all its rules are of the form A where A N



V

G

Acontextfree grammar that has at most one nonterminal symbol on the right hand

side that is if jj itiscalled linearFurthermore if all rules A of a contextfree

N

 

grammar have T or T N that is if the rules havenonterminal symbol in the

rightmost p osition the grammar is said to b e rightlinear

And nallyagrammar is called regular RG when it contains only rules of form A a

or A aB where A B N a T

The language generated by the grammar G is denoted by LG Twogrammars are

called equivalent if they generate the same language

Rules of CFGs and RGs are often required to b e presented in a sp ecial form Chomsky

Normal Form CNF CNF requires for all the rules to b e written as either A BC or

A aEvery CFGand RG has a corresp onding equivalentgrammar in Chomsky Normal

Form

There exist manymodications to the ContextFree Grammars in their pure form which

allow the mo del designer to enforce some additional constraints on the mo del typically avail



able in the lower Chomsky level grammar mechanism Often these extensions are employed

when only limited capabilities of the less constrained grammar are needed and there is no

need to implementthe corresp onding machine whichisingeneral more computationally

complex and slow

ExpressivePower of Language Mo del

The complexity and eciency of the language representation is an imp ortantissue when

mo deling a pro cess If wehave reasons to b elieve that the pro cess under consideration b ears

features of some established level of complexityitismost ecienttocho ose a mo del which



Chomsky classication assigns level to the least constrained free form grammar realized byageneral

Turing Machine More constrained grammars and languages suchasCSGCFG and RG are assigned to

higher levels of the hierarchy and resp ectively

is the closest to the nature of the pro cess and is the least complex Chomsky language

classication oers us an opp ortunitytomakean informed decision as to what level of

mo del complexityweneed to employtosuccessfully solvethe problem at hand

Regular grammars

The regular grammar is the simplest form of language description Its typical realization

is an appropriate nite state machine whichdoesnot require memory to generate strings of

the language The absence of memory in the parsinggenerating mechanism is the limiting

factor on the language which the regular grammar represents For instance recursive struc

tures cannot b e captured bya regular grammar since memory is required to represent the

state of the parser b efore it go es into recursion in order to restore the state of the parsing

mechanism after the recursiveinvocation As was mentioned ab ove regular pro ductions

n m

have the form A aB or A aThis form generates a language L a b where

r

n and m are indep endent In other words knowing the value of one do es not improve the

knowledge of the other

Anite automaton is not able to count or match anysymbols in the alphab et except

to a nite number

Contextfreegrammars

CFGs allowustomodeldependencies b etween n and m in the language L consisting

f

n m

of strings like a b whichisgenerally referred to as counting Counting in L should

f

not b e confused with counting in mathematical sense In linguistics it refers to matching

as opp osed to some quantitativeexpression That is wecan enforce a relation b etween n

and m in L butwecannot restrict it to havesomexedvalue although it can b e done by

f

enumeration of the p ossibilitie s Another subtle p ointofthe CFGmodel is that the counting

is recursive whichmeansthat eachsymbol b matches symb ols a in reverse order That is

by using subscripts to denote order L is in fact a a a b b b TypicallyCFG

f   n n  

is realized as a pushdown automaton PDA allowing for recursiveinvo cations of the

grammar submo dels nonterminals

Pushdown automata cannot mo del dep endencies b etween more than two elements of

n n n

the alphab et b oth terminal and nonterminal that is a language a b c is not realizable

byaPDA

Contextsensitive grammars

The main dierence b etween CFGs and CSGs is that no pro duction in a CFG can



aect the symbols in anyother nonterminal in the string Contextsensitive grammars

CSG havean added advantage of arbitrary order counting In other words a context

n m

sensitive language L is such that a b can b e matched in anyorder It is realizable by

s



linearlyb ound automaton and is signicantly more dicult to realize than CFG Due to

computational complexity asso ciated with LBAs limited context sensitivityisoften aorded

by extended or sp ecial forms of CFG

Examples

Let us without further explanations list some examples of languages of dierent levels

of Chomsky hierarchy

n m

Regular

L fa b g



 

L fa gwhere designates reverse order of



n m m n

ContextFree

L fa b c d g



n n

L fa b g



L fag



n m n m

ContextSensitive

L fa b c d g



n n n

L fa b c g

Sto chastic ContextFree Grammars

The probabilistic asp ect is intro duced into syntactic recognition tasks via Sto chastic Context

Free Grammars A Stochastic ContextFreeGrammar SCFG is a probabilistic extension of

aContextFree Grammar The extension is implemented by adding a probability measure

to every pro duction rule

A p

The rule probability p is usually written as P A This probabilityisaconditional



In probabilis tic framework context freeness of a grammar translates into statistical indep endenc e of its

nonterminals



EssentiallyaTuring machine with nite tap e

probabilityofthe pro duction b eing chosen given that nonterminal A is up for expansion

in generativeterms Saying that sto chastic grammar is contextfree essentially means

that the rules are conditionally indep endentand therefore the probabilityofthe complete

derivation of a string is just a pro duct of the probabilities of rules participating in the

derivation

Given a SCFG Glet us list some basic denitions

The probability of partial derivation is dened in inductive manner as

a P

b P P A P where pro duction A is a

pro duction of G is derived from by replacing one o ccurrence of A with



and V

G



The string probability P A Probabilityof given Aisthe sum of all leftmost

derivations A



The sentenceprobability P S Probabilityof given Gisthe string probability

given the axiom S of GInother words it is P jG



isthe sum of the strings having as a prex The prex probability P S

L

X

 

P A P S

L



V

G



In particular P S

L

EarleyStolckeParsing Algorithm

The metho d most generally and conveniently used in sto chastic parsing is based on an

Earley parser extended in suchawayastoaccept probabilities

In parsing sto chastic sentences weadopt a slightly mo died notation of The notion

of a state is an imp ortant part of the Earley parsing algorithm A state is denoted as

i X Y

k

where is the marker of the currentposition in the input stream i is the index of the

marker and k is the starting index of the substring denoted by nonterminal X Nonterminal

X is said to dominate substring w w w where in the case of the ab ovestate w is the

k i l l

last terminal of substring

In cases where the p osition of the dot and structure of the state is not imp ortant for

compactness wewill denote a state as

i

S i X Y

k

k

Parsing pro ceeds as an iterative pro cess sequentially switching b etween three steps

prediction scanning and completion For eachposition of the input stream an Earley

parser keeps a set of stateswhich denote all p ending derivations States pro duced by each

of the parsing steps are called resp ectively predicted scanned and completedAstate is

called complete not to b e confused with completed if the dot is lo cated in the rightmost

p osition of the state A complete state is the one that passed the grammaticalitycheck

and can nowbeused as a ground truth for further abstraction A state explains a string

that it dominates as a p ossible interpretation of symbols w w suggesting a p ossible

k i

continuation of the string if the state is not complete

Prediction

In the Earley parsing algorithm the prediction step is used to hyp othesize the p ossible

continuation of the input based on currentposition in the parse tree Prediction essentially

expands one branchofthe parsing tree down to the set of its leftmost leaf no des to predict

i

the next p ossible input terminal Using the state notation ab ove for each state S and

k

pro duction p of a grammar G fT N S P g of the form

i

i X Y S

k

k

p Y

where Y N wepredict a state

i

S i Y

i

i

Prediction step can takeonaprobabilistic form bykeeping trackofthe probabilityof

cho osing a particular predicted state Given the statistical indep endence of the nontermi

i

nals of the SCFG we can write the probabilityofpredicting a state S as conditioned on

i

i

This intro duces a notion of forward probabilitywhichhasthe interpre probabilityofS

k

tation similar to that of a forward probabilityinHMMs In SFCG the forward probability

is the probabilityofthe parser accepting the string w w and cho osing state S

i 

i

at a step iTocontinue the analogy with HMMs inner probability isaprobabilityof

i

generating a substring of the input from a given nonterminal using a particular pro duction

Inner probabilityisthus conditional on the presence of a nonterminal X with expansion

started at the p osition k unliketheforward probabilitywhich includes the generation

history starting with the axiom Formallywecanintegrate computation of and with

nonprobabilistic Earley predictor as follows

i X Y

k

 

i Y

i

Y



where is computed as a sum of probabilities of all the paths leading to the state



i X Y multiplied bytheprobabilityofcho osing the pro duction Y and is

k

the rule probabilityseeding the future substring probabilitycomputations

P



i X Y P Y

k





P Y

Recursivecorrection

Because of p ossible left recursion in the grammar the total probabilityofapredicted state

can increase with it b eing added to a set innitely many times Indeed if a nonterminal

A is considered for expansion and given pro ductions for A

A Aa

a

we will predict A a and A Aa at the rst prediction step This will cause the

predictor to consider these newly added states for p ossible descent whichwill pro duce the

same two states again In nonprobabilistic Earley parser it normally means that no further

expansion of the pro duction should b e made and no duplicate states should b e added In p2

A C p1

p3 p4

B

G



A BB p



p p

 

CB p



p p

 

B AB p



C p



C a p



Figure Left Corner Relation graph of the grammar G Matrix P is shown on the rightof

 L

the pro ductions of the grammar

probabilistic version however although adding a state will not add any more information

to the parse its probabilityhastobefactored in byadding it to the probabilityofthe

previously predicted state In the ab ovecase that would mean an innite sum due to

leftrecursiveexpansion

In order to demonstrate the solution we rst need to intro duce the concept of Left

Corner Relation

Two nonterminals aresaidtobeina Left Corner Relation X Y i thereexists a

L

production for X of the form X Y

Wecan compute a total recursivecontribution of eachleftrecursive pro duction rule

where nonterminals X and Y are in Left Corner Relation The necessary correction for

nonterminals can b e computed in a matrix form The form of recursivecorrection matrix

R can b e derived using simple example presented in Figure The graph of the relation

L

presents direct left corner relations b etween nonterminals of G The LC Relationship



matrix P of the grammar G is essentially the adjacency matrix of this graph If there

L 

exists an edge b etween two nonterminals X and Y it follows that the probabilityofemitting

terminal a suchthat

X Y

Y a

P S A B C

L

G



S p p

 

S AB p



A p



C p



B

d p



C p p

A AB p



a p



R S A B C

B bB p

L



p



b p

S p p p p p

  

p





C BC p

A

p



B p

B

c p



C p p

Table Left Corner P andits ReexiveTransitiveClosure R matrices for a simple SCFG

L L

from whichfollows that X a isasum of probabilities along all the paths connecting

X with Y multiplied byprobabilityofdirect emittance of a from Y P Y a

Formally

P



P X Y P a P Y a

X



P Y a P X Y





P X Y





P X Y



where P X Y is probabilityofapath from X to Y of length k

k

In matrix form all the k step path probabilities on a graph can b e computed from its

k

adjacency matrix as P And the complete reexivetransitiveclosure R is a matrix of

L

L

innite sums that has a closed form solution



X

   k 

R P P P P I P

L L

L L L L

k 

Thus the correction to the completion step should b e applied as rst descending the chain

of left corner pro ductions indicated byanonzero value in R

U

i X Z

k

 

i Y

Z st R Z Y

L

Y

and then correcting the forward and inner probabilities for the left recursive pro ductions

byanappropriate entry of the matrix R

L

P



i X Z R Z Y P Y

k L





P Y

Matrix R can b e computed once for the grammar and used at eachiteration of the

L

prediction step

Scanning

Scanning step simply reads the input symbol and matches it against all p ending states for

the next iteration For eachstateX a and the input symbol a wegenerate a state

k

i X a

k

i X a

i X a

k

k

It is readily converted to the probabilistic form The forward and inner probabilities remain

unchanged from the state b eing conrmed bythe input since no selections are made at this

step The probabilityhowever maychange if there is a likeliho o d asso ciated with the

input terminal This will b e exploited in the next chapter dealing with uncertain input In

probabilistic form the scanning step is formalized as

i X a

i X a

k

k

where and are forward and inner probabilities

Note the increase in i index This signies the fact that scanning step inserts the states

into the new state set for the next iteration of the algorithm

Completion

Completion step given a set of states whichjust havebeenconrmed byscanning up dates

the p ositions in all the p ending derivations all the wayup the derivation tree The p osition

in expansion of a p ending state j X Y is advanced if there is a state starting at

k

p osition j i Y whichconsumed all the input symb ols related to it Suchastate

j

can nowbeused to conrm other states exp ecting Y as their next nonterminal Since the

index range is attached to Y wecaneectively limit the search for the p ending derivations

to the state set indexed by the starting index of the completing state j

j X Y

k

i X Y

k

i Y

j

Again propagation of forward and inner probabilities is simple New values of and

are computed bymultiplying the corresp onding probabilities of the state b eing completed

by the total probabilityofall paths ending at i Y In its probabilistic form

j

completion step generates the following states

j X Y

k

 

i X Y

k

 

i Y

j

P

 

i X Y i Y

k j



P

 

i X Y i Y

k j



Recursivecorrection

Here as in prediction wemighthavea potential danger of entering an innite lo op Indeed

given the pro ductions

A B

a

B A

and the state i A awecomplete the state set j containing

j

j A B

j

j A a

j

j B A

j

Among others this op eration will pro duce another state i A a whichwill cause

j

the parse to go into innite lo op In nonprobabilistic Earley parser that would mean that

we just simply do not add the newly generated states and pro ceed with the rest of them

However this will intro duce the truncation of the probabilities as in the case with prediction

It has b een shown that this innite lo op app ears due to socalled unit productions

P S A B C

U

G



S p

S AB p 



A

C p



B

d p



C p

A AB p



a p



B bB p

R S A B C



U

b p

S p p p

 

C BC p

A

B p

B

c p



C p

Table Unit Pro duction P and its ReexiveTransitiveClosure R matrices for a simple

U U

SCFG

Two nonterminals are said to beina UnitProduction Relation X Y i there exists

U

aproduction for X of the form X Y

As in the case with prediction wecompute the closedform solution for a Unit Pro duc

tion recursivecorrection matrix R gure considering the Unit Pro duction relation

U



expressed byamatrixP R is found as R I P The resulting expanded

U U U U

completion algorithm accommo dates for the recursiveloops

j X Z

k

 

i X Z

Z st R Z Y

k

U

 

i Y

j

 

where computation of and is corrected byacorresp onding entry of R

U

P

 

i X Z R Z Y i Y

k U j



P

 

i X Y R Z Y i Y

k U j



As R unit pro duction recursivecorrection matrix R can b e computed once for each

L U

grammar

Relation to HMM

While SCFGs lo ok very dierent from HMMs they havesomeremarkable similarities

In order to consider the similarities in more detail let us lo ok at a subset of SCFGs

Sto chastic Regular Grammars SRG as dened ab ove section page is a tuple

G N T P S where the set of pro ductions P is of the form A aB orA a If

weconsider the fact that HMM is a version of probabilistic Finite State Machine we can

apply simple nonprobabilistic rules to convert an HMM into an SRG More precisely taking

a

transition A B with probability p of an HMM wecanform an equivalent pro duction

m

A aB p Conversion of an SCFGinitsgeneral form is not always p ossible An

m

SCFG rule A aAafor instance cannot b e converted to a sequence of transitions of an

HMM Presence of this mechanism makes richer structural derivations suchascounting

and recursion p ossible and simple to formalize

Another more subtle dierence has its ro ots in the wayHMMsand SCFGs assign

probabilities SCFGs assign probabilities to the pro duction rules so that for each non

terminal the probabilities sum up to It translates into probabilitybeing distributed

over sentences of the language probabilities of all the sentences or structural entities of

the language sum up to HMMs in this resp ect are quite dierentthe probabilitygets

distributed over all the sequences of a length n

In the HMM framework the tractabilityofthe parsing is aorded by the fact that

the numberof states remains constantthroughout the derivation In the Earley SCFG

framework the numberofpossible Earley states increases linearly with the length of the

input string due to the starting index of eachstate This makes it necessary in some

instances to employ pruning techniques whichallowustodeal with increasing computational

complexitywhenparsing long input strings

Viterbi Parsing

Viterbi parse is applied to the state sets in a chart parse in a manner similar to HMMs

Viterbi probabilities are propagated in the same wayasinner probabilities with the ex

ception that instead of summing the probabilities during completion step maximization is

i

performed That is given a complete state S wecan formalize the pro cess of computing

j

Viterbi probabilities v as follows

i

j

i i

v S v S max v S

i

j i i

j k

S

k

j

and the Viterbi path would include the state

j

i i

S ar g max v S v S

i

i j

k j

S

k

j

i i

The state S keeps a backp ointer to the state S whichcompletes it with maximum

j

k

probability providing the path for backtracking after the parse is complete The computa

tion pro ceeds iteratively within the normal parsing pro cess After the nal state is reached

it will contain p ointers to its immediate children whichcan b e traced to repro duce the max

imum probability derivation tree

Viterbi Parse in Presence of Unit Pro ductions

The otherwise straightforward algorithm of up dating Viterbi probabilities and the Viterbi

path is complicated in the completion step by the presence of the Unit pro ductions in the

derivation chain Computation of the forward and inner probabilities has to b e p erformed

in closed form as describ ed previouslywhichcauses the unit pro ductions to b e eliminated

from the Viterbi path computations This results in pro ducing an inconsistent parse tree

with omitted unit pro ductions since in computing the closed form correction matrix we

collapse such unit pro duction chains In order to remedy this problem we need to consider

two comp eting requirements to the Viterbi path generation pro cedure

On one hand weneedabatchversion of the probabilitycalculation p erformed as

previously describ ed to compute transitiverecursiveclosures on all the probabilities

by using recursivecorrection matrices In this case Unit pro ductions do not need

to b e considered bythe algorithm since their contributions to the probabilities are

encapsulated in the recursivecorrection matrix R

U

On the other hand we need a nite number of states tobeconsidered as completing

states in a natural order so that wecan preserve the path through the parse to have

a continuous chain of parents for each participating pro duction In other words the

unit pro ductions which get eliminated from the Viterbi path during the parse while

computing Viterbi probabilityneedtobeincluded in the complete derivation tree

We solveboth problems bycomputing the forward inner and Viterbi probabilities in

closed form by applying the recursivecorrection R to eachcompletion for nonunit com

U

pleting states and then considering only completing unit pro duction states for inserting

them into the pro duction chain Unit pro duction states will not up date their parents prob

abilities since those are already correctly computed via R Nowthe maximization step

U

in computing the parents Viterbi probabilityneeds to b e mo died to keep a consistent

tree Toaccomplish this when using the unit pro duction to complete a state the algorithm

inserts the Unit pro duction state into a child slot of the completed state only if it has the

same children as the state it is completing The overlapping set of children is removed from

the parentstate It extends the derivation path by the Unit pro duction state maintaining

consistentderivation tree

Pseudoco de of the mo dications to the completion algorithm whichmaintains the con

sistentViterbi derivation tree in presence of Unit pro duction states is shown in Figure

However the problem stated in subsection still remains Since we cannot compute

the Viterbi paths in closed form wehavetocompute them iterativelywhich can make

the parser go into an innite lo op Suchloopsshould b e avoided while formulating the

grammar A b etter solution is b eing sought

function StateSetComplete

ifStateIsUnit

StateForward

StateInner

else

StateForward ComputeForward

StateInner ComputeInner

end

NewState FindStateToComplete

NewStateAddChildState

StateSetAddStateNewState

end

function StateSetAddStateNewState

ifStateSetAlreadyExistsNewState

OldState StateSetGetExistingStateNewState

CompletingState NewStateGetChild

ifOldStateHasSameChildrenCompletingState

OldStateReplaceChildren CompletingState

end

OldStateAddProbabilitiesNewStateForward NewStateInner

else

StateSetAddNewState

end

end

Figure Simplied pseudoco de of the mo dications to the completion algorithm

Chapter

Temp oral Parsing of Uncertain

Input

The approachdescrib ed in the previous chapter can b e eectively used to nd the most likely

syntactic groupings of the elements in a nonprobabilistic input stream The probabilistic

character of the grammar comes into playwhendisambiguation of the string is required

by the means of the probabilities attached to eachrule These probabilities reect the

typicality of the input string By tracing the derivations to eachstring wecan assign a

value reecting howtypical the string is according to the currentmodel

In this chapter wetoaddress two main problems in application of the parsing algorithm

describ ed so far to action recognition

Uncertaintyin the input string

The input symb ols which are formed bylowlevel temp oral feature detectors are

uncertain This implies that some mo dications to the algorithm whichaccount for

this are necessary

Temp oral consistency of the eventstream

Eachsymbolofthe input stream corresp onds to some time interval Since here we

are dealing with a single stream input only one input symbolcan b e presentin the

stream at a time and no overlapping is allowed which the algorithm must enforce

Terminal Recognition

Before weproceed to develop the temp oral parsing algorithm weneedtosaya few words

ab out the formation of the input In this thesis weareattempting to combine the detection

of indep endentcomponentactivities in a framework of syntaxdriven structure recognition

Most often these comp onents are detected after the fact that is recognition of the activity

only succeeds when the whole corresp onding subsequence is presentinthe causal signal

At the p ointwhenthe activityisdetected bythe low level recognizer the likeliho o d of the

activitymodel represents the probability that this part of the signal has b een generated by

the corresp onding mo del In other words with eachactivitylikeliho o d we will also have

the sample range of the input signal or a corresp onding time interval to whichitapplies

This fact will b e the basis of the technique for enforcing temp oral consistency of the input

develop ed later in this chapter Wewill refer to the mo del activitylikeliho o d as a terminal

and to the length of corresp onding subsequence as a terminal length

Uncertaintyinthe Input String

The previous chapter describ ed the parsing algorithm where no uncertaintyabout the

input symbols is taken into consideration New symbols are read o bytheparser during

the scanning step whichchanges neither forward nor inner probabilities of the p ending

states If the likelihoodoftheinputsymbol is also available as in our case it can easily b e

incorp orated into the parse during scanning bymultiplying the inner and forward probability

of the state b eing conrmed by the input likeliho o d We reformulate the scanning step as

follows having read a symbol a and its likeliho o d P aothe input stream we pro duce

the state

 

i X a

i X a

k

k

 

and compute new values of and as



i X aP a

k



i X aP a

k

The new values of forward and inner probabilities will weigh comp eting derivations not

only by their typicality but also byour certaintyabout the input at eachsample

Substitution

Intro ducing likeliho o ds of the terminals at the scanning step makes it simple to employ

the multivaluedstring parsing where eachelementofthe input is in fact a multi

valued vectorofvo cabulary mo del likeliho o ds at eachtime step of an eventstream Using

these likeliho o ds wecan condition the maximization of the parse not only on frequency of

occurrence of some sequence in the training corpus but also on measurementorestimation

of the likeliho o d of eachofthesequence comp onentin the test data

The multivalued string approach allows for dealing with socalled substitution errors

which manifest themselves in replacementofaterminal in the input string with an incor

rect one It is esp ecially relevanttothe problems addressed by this thesis where sometimes

the correct symbol is rejected due to a lower likeliho o d than that of some other one In the

prop osed approach we allowthe input at discrete instance of time to include all nonzero

likeliho o ds and to let the parser select the most likely one based not only on the corresp ond

ing value but also on the probabilityofthe whole sequence as additional reinforcementto

the lo cal mo del estimate

From this p ointand on the input stream will b e viewed as a multivalued string a

lattice whichhasavector of likeliho o ds asso ciated with each time step eg gure

The length of the vector is in general equal to the number of vocabulary mo dels The mo del

likeliho o ds are incorp orated with the parsing pro cess at the scanning step as was shown

ab ove byconsidering all nonzero likeliho o ds for the same state set that is

i X a

k

 

i X a

k

a st P a

The parsing is p erformed in a parallel manner for all suggested terminals

Insertion

It has b een shown that while signicantprogress has b een made in the pro cessing of correct

text a practical system must b e able to handle many forms of illformedness gracefully

In our case the need for this robustness is paramount

Indeed the parsing needs to b e p erformed on a lattice where the symbols whichwe need

to consider for inclusion in the string come at random times This results in app earance a

b

c

Figure Example of the lattice parse input for three mo del vocabularyConsider a grammar

A abc j acb The dashed line shows a parse acbThetwo rectangles drawn around samples

and showthe spurious symbols for this parse which need to b e ignored for the derivation to

be contiguous Wecansee that if the spurious symbols are simply removed from the stream an

alternative derivation for the sequence abcshown byadotted line will b e interrupted Note the

sample which contains two concurrent symbols whichare handled bythe lattice parsing

of spurious ie ungrammatical symbols in the input stream a problem commonly

referred to as insertionWeneed to b e able to consider these inputs separatelyatdierent

time steps and disregard them if their app earance in the input stream for some derivation

is found to b e ungrammatical At the same time weneedtopreserve the symbol in the

stream for considering it in other p ossible derivations p erhaps even of a completely dierent

string The problem is illustrated bygure In this gure weobservetwoindep endent

derivations on an uncertain input stream While deriving the string acbconnecting samples

and we need to disregard samples and whichwould interfere with the derivation

However wedonot haveanapriori information ab out the overall validityofthis string

that is wecannot simply discard samples and from the stream b ecause combined with

sample they form an alternativestring abcFor this alternativederivation sample

would presentaproblem if not dealt with

There are at least three p ossible solutions to the insertion problem in our context

Firstlywe can employsome ad hoc metho d whichsynchronizes the input presenting

the terminals coming with a slight spread around a synchronizing signal as one vector

That would signicantly increase the danger of not nding a parse if the spread is

signicant

Secondlywemayattempt gathering all the partially valid strings p erforming all the

derivations wecan on the input and p ostpro cessing the resulting set to extract the

most consistentmaximum probability set of partial derivations Partial derivations

acquired bythistechnique will showa distinct split at the p osition in the input where

the supp osed ungrammaticalityoccurred Being robust for nding ungrammatical

symb ols in a single stream in our exp eriments this metho d did not pro duce the

desired results The lattice extension extremely noisy input and a relatively shallow

grammar made it less useful pro ducing large number of string derivations which

contained a large number ofsplits and were dicult to analyze

And nallywecan attempt to mo dify the original grammar to explicitly include the

ERROR symbol eg

In our algorithm the last approach proved to b e the most suitable as it allowed us to

incorp orate some additional constraints on the input stream easilyaswillbeshown in the

next section

Weformulate simple rules of the grammar mo dications and p erform the parsing of the

input stream using this mo died grammar

Given the grammar Grobust grammar G is formed byfollowing rules

Each terminal in pro ductions of G is replaced byapreterminal in G

G G

A bC A BC

For each preterminal of G askip rule is formed

G

B b j SKI P b j b SKI P

This is essentially equivalenttoadding a pro duction B SKI P b SKI P if

SKI P is allowed to expand to

Skip rule is added to Gwhichincludes all rep etitions of all terminals

G

SKI P b j c j j b SKI P j c SKI P j

Again if SKI P is allowed to expand to the last two steps are equivalentto

adding 1

0.8

a 0.6

0.4

0.2

0 0 2 4 6 8 10 12 14 16 18 20 t

1

0.8

b 0.6

0.4

0.2

0 0 2 4 6 8 10 12 14 16 18 20

t

Figure Example of temp orally inconsistentterminals Given a pro duction A ab j abA un

constrained parse will attempt to consume maximum amountofsamples bynonS KIP pro ductions

The resulting parse abababisshown bythe dashed line The solid line shows the correct parse

abab which includes only nonoverlapping terminals horizontal lines showthesample range of each

candidate terminal

G

B SKI P b SKI P

SKI P j b SKI P

This pro cess can b e p erformed automatically as a prepro cessing step when the grammar

is read in by the parser so that no mo dications to the grammar are explicitly written

Enforcing Temp oral Consistency

Parsing the lattice with the robust version of the original grammar results in the parser

consuming a maximum amountofthe input Indeed the SKI P pro duction whichis

usually assigned a lowprobability serves as a function that p enalizes the parser for each

symbol considered ungrammatical This mightresult in the phenomenon where some of the

noise gets forced to takepartinaderivation as a grammatical o ccurrence The eect of

this can b e relieved if wetakeinto accountthe temp oral consistency of the input stream

that is only one event happ ens at a given time Since eachevent has a corresp onding

length wecan further constrain the o ccurrence of the terminals in the input so that no

overlapping events take place The problem is illustrated in gure

In Earley framework wecan mo dify the parsing algorithm to work in incremental fashion

as a lter to the completion step while keeping trackof the terminal lengths during scanning

and prediction

In order to accomplish the task weneedtointro duce twomorestate variables h for

high mark and l for low mark Eachof the parsing steps has to b e mo died as follows

Scanning

Scanning step is mo died to include reading the time interval asso ciated with the

incoming terminal The up dates of l and h are propagated during the scanning as

follows

i X a l h

i X a l h

k a

k

where h is a high mark of the terminal In addition to this wesetthe timestamp of

a

the whole new i thstate set to h S h tobeused later by prediction step

a t a

Completion

Similarly to scanning completion step advances the high mark of the completed state

to that of the completing state therebyextending the range of the completed non

terminal

j X Y l h

k

i X Y l h

k  

i Y l h

j

This completion is p erformed for all states i Y sub ject to constraints l h

j  

and Y X SKI P

Prediction

Prediction step is resp onsible for up dating the lowmarkofthe state to reect the

timing of the input stream

i X Y

k

i Y S S

t t

Y

Here S is the timestamp of the state set up dated bythe scanning step

t

The essence of this ltering is that only the paths that are consistentin the timing of

their terminal symb ols are considered This do es not interrupt the parses of the sequence

since ltering combined with robust grammar leaves the subsequences whichformthe

parse connected via the skip states

MisalignmentCost

In the pro cess of temp oral ltering it is imp ortanttoremember that the terminal lengths

are themselves uncertain Auseful extension to the temp oral ltering is to implement

a softer p enalty function rather than a hard cuto of all the overlapping terminals as

describ ed ab ove It is easily achieved at the completion step where the ltering is replaced

byaweighting function f of a parameter h l whichismultiplied into forward

 





and inner probabilities For instance f e where is a p enalizing parameter can

be used to weigh overlap and spread equally The resulting p enaltyisincorp orated

with computation of and

P

 

f i X Z R Z Y i Y

k U j



P

 

f i X Y R Z Y i Y

k U j



More sophisticated forms of f can b e formulated to achievearbitrary spread balancing

as needed

Online Pro cessing

Pruning

As wegain descriptiveadvantages with using SCFGs for recognition the computational

complexityascompared to Finite State Mo dels increases The complexityofthe algorithm

grows linearly with increasing length of the input string This presents us with the necessity

in certain cases of pruning low probability derivation paths In Earley framework sucha

pruning is easy to implement Wesimply rank the states in the state set according to

their per sample forward probabilities and remove those that fall under a certain threshold

However p erforming pruning one should keep in mind that the probabilities of the remaining

states will b e underestimated

Limiting the Range of Syntactic Grouping

Contextfree grammars havea drawbackofnotbeing able to express countable relation

ships in the string byahard b ound byanymeans other than explicit enumeration the

situation similar to the secretary problem with nite state networks One implication

of this drawbackis that wecannot formally limit the range of the application of a SKI P

pro ductions and formally sp ecify howmany SKI P pro ductions can b e included in a deriva

tion for the string to still b e considered grammatical When parsing with robust grammar

is p erformed some of the nonterminals b eing considered for expansion cover large p or

tions of the input string consuming only a few samples bynonSKI P rules The states

corresp onding to suchnonterminals havelow probabilityandtypically havenoeect on

the derivation but are still considered in further derivations increasing the computational

load This problem is esp ecially relevant when wehavealong sequence and the task is

to nd a smaller subsequence inside it whichismodeledbyanSCFG Parsing the whole

sequence mightbe computationally exp ensiveifthe sequence is of considerable length We

can attempt to remedy this problem bylimiting the range of pro ductions

In our approach when faced with the necessitytoperform suchaparse weuse implicit

pruning byoperating a parser on a relatively small sliding window The parse p erformed

in this manner has two imp ortant features

The correct string if exists ends at the currentsample

The beginning sample of suchastring is unknown but is within the window

From these observations wecan formalize the necessary mo dications to the parsing

technique to implementsuchasliding windowparser

Robust grammar should only include the SKI P pro ductions of the form A a j SKI P a

since the end of the string is at the currentsample whichmeans that there will not

be trailing noise whichisnormally accounted for byapro duction A a SKI P

Eachstate set in the windowshould b e seeded with a starting symb ol This will

account for the unknown b eginning of the string After p erforming a parse for the

currenttime step Viterbi maximization will pick out the maximum probability path

whichcanbefollowed backtothe starting sample exactly

This technique is equivalenttoaruntime version of Viterbi parsing used in HMMs

The exception is that no backwards training is necessary since wehaveanopp ortunity

to seed the state set with an axiom at an arbitrary p osition

Chapter

Implementation and System

Design

The recognition system is built as a twolevel architecture according to the twolevel

Bayesian mo del describ ed in section page The system is written in C using

a LAPACbased math libraryItruns on a varietyofplatforms the tests were p erformed

in HPUX SGI IRIX Digital UNIX Linux and in a limited wayMSWindowsNT The

overall system architecture is presented in gure

Comp onentLevel HMM Bank

The rst level of the architecture is an HMM bank whichconsists of a set of indep endent

HMMs running in parallel on the input signal To recognize the comp onents of the mo del

vo cabularywetrain one HMM p er primitiveactioncomp onentterminal using the tradi

tional BaumWelchreestimation pro cedure At runtime eachofthese HMMs p erforms a

Viterbi parse of the incoming signal and computes the likeliho o d of the action prim

itive The runtime algorithm used byeachHMM to recognize the corresp onding word of

the action vocabulary is a version of whichperforms a backward matchofthesignal

over a windowofa reasonable size

Ateachtime step the algorithm p erforms the Viterbi parse of the part of the input

signal covered by the window The algorithm rst inserts a new sample of the incoming

signal into the rst p osition of the input FIFO queue advancing the queue and discarding

the oldest sample The Viterbi trellis is then recomputed to accountfor the new sample HMM Bank Q1 Parsing Module H1 E D1 P Viterbi m Segmenter Q2 a i r t H2 s D2 t Annotator Q3 e e r r

H3

Figure Architecture of the recognition system develop ed in the thesis The system consists

of an HMM bank and a parsing mo dule HMM bank contains the mo del HMMs H H

op erating indep endently on their own input queues Q Q The parsing mo dule consists of a

probabilistic parser Viterbi parser whichperforms segmentation and lab eling output data stream

D and an annotation mo dule output data stream D

To nd the maximum probabilityparsethe trellis searchisperformed Since the likeliho o d

of the mo del decreases with the length of the sequence the likeliho o d values in the trellis

need to b e normalized to express a per sample likeliho o d This is expressed byageometric

mean or in case when the computations are p erformed in terms of loglikelihood byalog

geometric mean

The maximum likeliho o d found in the trellis has a corresp onding range of the input

signal to whichitapplies This value is easily extracted from the trellis length and the

p osition of the maximum likeliho o d value in it Figure shows the output of a single

HMM in a bank illustrating the relation b etween output probabilities and sequence lengths

All the HMMs are run in parallel providing the parser with maximum normalized

probability and the corresp onding sequence length at eachtimestep

Event generation

The continuous vector output of the HMM bank is represented byaseriesof events

whichare considered for probabilistic parsing It is in terms of these events that the action

grammar is expressed In this step wedonot need to makestrongdecisions ab out discarding

anyevents we just generate a sucientstream of evidence for the probabilistic parser such

that it can p erform structural rectication of these suggestions In particular the SKI P

states allow for ignoring erroneous events As a result of sucharectication the parser 0.4 Output probability Sequence length 0.35

0.3

0.25

P 0.2

0.15

0.1

0.05

0 0 5 10 15 20 25 30 35

t

Figure HMM output plot Eachpointofthe probabilityplot is the normalized maximum

likeliho o d of the HMM at the current sample of the input signal This value is found as a result of

the backwards searchonthe FIFOqueue and corresp onds to a sequence of a certain length The

plot shows those lengths as horizontal lines ending at the corresp onding sample of the probability

plot

nds an interpretation of the stream whichisstructurally consistentie grammatical

is temp orally consistentuses nonoverlapping sequence of primitives and has maximum

likeliho o d given the outputs of the lowlevel feature detectors

For the tasks discussed here a simple discretization pro cedure provides go o d results

For each HMM in the bank weselect a very small threshold to cut o the noise in the



output and then searcheachofthe areas of nonzero probabilities for a maximum Refer

to gure a for an example of the output of the HMM bank sup erimp osed with the

results of discretization

After discretization we replace the continuous time vector output of the HMM bank

with the discrete symbol stream generated by discarding anyinterval of the discretized

signal in whichallvalues are zero This output is emitted at run time whichrequires a

limited capabilityofthe emitter to lo ok backatthe sequence of multivalued observations

It accumulates the samples of eachofthe HMMs in a FIFO queue until it collects all the

samples necessary to makethe decision of the p osition of a maximum likeliho o d in all the

mo del outputs When the maxima are found the corresp onding batchofvalues is emitted

into the output stream and the FIFOqueue is ushed

Figure b displays an example of the resulting eventsequence generated



Alternativelywecan searchthe areas of nonzero likeliho o d for the p osition where the pro duct of the

probability and the sequence length is maximal This will result in nding the weighted maximum which

corresp onds to the maximum probabilityofthesequence of the maximum length 50 0 50 0.40 50 100 150 200 250 300 350 400 0.4

0.2 0.2

0 0 0 2 4 6 8 10 12 14 16 18 20 0.20 50 100 150 200 250 300 350 400 0.2

0.1 0.1

0 0 0 2 4 6 8 10 12 14 16 18 20 0.40 50 100 150 200 250 300 350 400 0.4

0.2 0.2

0 0 0 50 100 150 200 250 300 350 400 0 2 4 6 8 10 12 14 16 18 20 0.4 0.4

0.2 0.2

0 0 0 50 100 150 200 250 300 350 400 0 2 4 6 8 10 12 14 16 18 20 0.4 0.4

0.2 0.2

0 0

0 50 100 150 200 250 300 350 400 0 2 4 6 8 10 12 14 16 18 20

a b

Figure Example of the output of the HMM bank a Top plot shows the input sequence Five

plots b elowitshowoutputofeach HMM in the bank sup erimp osed with the result of discretization

b Corresp onding discretized compressed sequence Input vectors for the probabilistic parser are

formed bytaking the samples of all sequences for each time step

Syntax Level Parser

The probabilistic parser accepts input in form of an event sequence generated by the

HMM bank just describ ed Eachsample of that sequence is a vector containing likeliho o ds

of the candidate terminals and their corresp onding lengths for the current time step

Parser Structure

The parser encapsulates its functionalityinthree complete levels The lowest base level

is a complete nonprobabilistic Earley parser whichaccepts an input string and parses it

according to a given ContextFree Grammar It implements all the basic functionalityof

the parsing algorithm in nonprobabilistic setting

A stochastic extension to the parser is built up on the basic Earley parser as a set of child

classes which derive all the basic functionality but p erform the parse according to a given

Sto chastic ContextFree Grammar After the grammar is read and checked for consistency

the algorithm computes its recursivecorrection matrices R and R as was describ ed in

L U

chapter The parser implemented on this level is capable of p erforming a probabilistic

parse of a string whichcanbepresented in either probabilistic or nonprobabilistic form

Optionally the parser can p erform a partial parse where the most likely partial derivations

are found if the string is ungrammatical

A multivaluedconstrainedparser is derived from the sto chastic extension and imple

ments robust temp orally consistentlattice parsing algorithm as describ ed in chapter It

is at this level sliding window algorithm is also implemented

Annotation Mo dule

Annotation mo dule is built into a Viterbi backtracking step After the parse the nal state

if found contains a backp ointer to the immediate children in a most likely derivation tree

Recursively traversing the tree in a depthrst manner weessentially travel from one non

terminal to another in a natural order This can b e used for attaching an annotation to

eachofthe nonterminals of interest

The annotation is declared in the grammar le after each pro duction for a nonterminal

of interest Annotation consists of text with emb edded commands and formatting symbols

and is emitted during the traversal of the Viterbi tree after the parse is completed Every

time a nonterminal which has an attached annotation is encountered the annotation text

is formatted according to the format sp ecications of the annotation blo ckandis sentto

the output

Aspecial mo de of the annotation mo dule implements the representativeframe selection

out of the video sequence based on results of segmentation When the range of a nonter

minal is determined the mo dule computes frame indices for its starting and ending frames

after whichitselects a representativeframe based on the minimum distance to the mean

image of the range

Chapter

Exp erimental Results

The recognition system describ ed in this thesis has b een used to recognize several complex

hand gestures Eachgesture has distinct comp onents forming its vocabularyforwhichwe

trained the lowlevel temp oral feature detectors HMMs The training data set consisted

of examples of gesture primitives The resulting HMM bank was run on several test

gestures not presentinthe training set For run time eachoftheHMMsin the bank had

its window and threshold parameters set manually Output of the HMM bank is pip ed to

the parser whichperformed the parse and output the resulting Viterbi trace and annotated

segmentation

The rest of this section presents three typ es of exp eriments whichwere p erformed using

the system

Exp erimentI Recognition and Disambiguation

The rst exp erimentissimply a demonstration of the algorithm successfully recognizing a

hand gesture and disambiguating its comp onents in the global context We dene a SQUARE

gesture as either lefthanded counterclo ckwise or a righthanded clo ckwise The gesture

vo cabulary consists of four comp onents leftright updown rightleft and downup

The interpretation of the gesture comp onents is dep endentonthe global context For

instance in the righthanded square the leftright comp onentisinterpreted as the top

of the square whereas in the case of the lefthanded square it is the b ottom Weformulate a

grammar for this mo del byintro ducing an additional nonterminal TOPThisnonterminal

well provide a lab el for leftright and rightleft gestures based on the global context

The grammar G provides a mo del whichrecognizes a SQUARE regardless of the fact

sq uar e

that it can b e either the righthanded or a lefthanded square with skip rules omitted for

simplicity

G

sq uar e

SQUARE RH

j LH

RH TOP updown rightleft downup

j updown rightleft downup TOP

j rightleft downup TOP updown

j downup TOP updown rightleft

LH leftrightdownup TOP updown

j downup TOP updown leftright

j TOP updown leftrightdownup

j updown leftrightdownup TOP

TOP leftright

j rightleft

We run the exp eriments rst using the data pro duced byamouse gesture and then

rep eat the approachusing the data from Polhemus sensor attached to the hand and the

data collected from the STIVE vision system For eachofthe variations of the

exp erimentweuse separately trained set of HMMs

Training data in all cases consists of examples of eachofthe comp onentgestures

state HMMs using x and y velo cities as feature vectors serveasmodels of those comp onents

and are trained on the set until the loglikeliho o d improvementfalls b elow

The testing is p erformed on a long unsegmented SQUARE gesture p erformed in either

righthanded or lefthanded version The new previously unseen sequence is segmented

and lab eled bythe algorithm

In the example shown in gure a the recovered structure was that of a righthand

square Recognition results for a lefthanded square sequence are seen in gure b Note

that the the lab el TOP was assigned to the leftright gesture in the rst case and to the

rightleft in the second The gures showsample indices corresp onding to eachofthe

gesture primitives enclosed in square brackets The indices whichwere found during the

a Segmentation rsquaredat

Label Segment LogP

Right hand square e

Top e

Up down e

Right left e

Down up e

Viterbi probability

b Segmentation lsquaredat

Label Segment LogP

Left hand square e

Left right e

Down up e

Top e

Up down e

Viterbi probability

Figure Example of the SQUARE sequence segmentation In the presented example STIVE setup

serves as the data source

a righthanded square segmentation b lefthanded square segmentation

Figure The Stereo InteractiveVirtual EnvironmentSTIVE computer vision system used to

collect data

execution of the algorithm form a continuous coverage of the input signal

Recognition and semantic segmentation

As a more complex test of our approachwechose the domain of musical conducting It is

easy to think of conducting gestures as of complex gestures consisting of a combination of

simpler parts for whichwe can write down a grammar coincidentallyabook describing

baton techniques written by the famous conductor Max Rudolf iscalled The Grammar

of Conducting Wecapture the data from a p erson a trained conductor using natural

conducting gestures The task weare attempting to solveisthefollowing A piece of music



bySibelius includes a part with complex music b eat pattern Instead of using this

complex conducting pattern conductors often use or gestures bymerging b eats

into groups of three or twoatwill For the exp erimentwecollect the data from a conductor

conducting a few bars of the score arbitrarily cho osing or gestures and attempt to

nd bar segmentation while simultaneously identifying the b eat pattern used

For the exp eriments weuse Polhemus sensor and then apply the technique in the STIVE

environment We trained veHMMsontwo primitivecomponents of a pattern and the

three comp onents of a pattern As a mo del we use a state HMMs with x and y velo city



Jean Sib elius Second SymphonyOpusin D Ma jor 100 0 100 0 50 100 150 200 250 0.5 0.5

0 0 0 50 100 150 200 250 0 5 10 15 20 25 30 35 0.5 0.5

0 0 0 50 100 150 200 250 0 5 10 15 20 25 30 35 0.5 0.5

0 0 0 50 100 150 200 250 0 5 10 15 20 25 30 35 1 1

0.5 0.5

0 0 0 50 100 150 200 250 0 5 10 15 20 25 30 35 1 1

0.5 0.5

0 0

0 50 100 150 200 250 0 5 10 15 20 25 30 35

a b

Figure Output of the HMM bank for the sequence a Top plot shows the

y comp onentof the input sequence Fiveplots b elowitshowoutputofeach HMM in the bank

sup erimp osed with the result of discretization b Corresp onding eventsequence

feature vectors for Polhemus data and a state HMM with a feature vector containing x

y and z velo cities of the hand for the STIVE exp eriment

Some of the primitives in the set are very similar to eachother and therefore corre

sp onding HMMs showahigh likeliho o d at ab out the same time whichresults in a very

noisy lattice gure a We parse the lattice with the grammar G again for simplicity

c

omitting the SKIP pro ductions

G

c

PIECE BAR PIECE

j BAR

BAR TWO

j THREE

THREE down right up

TWO down up

The results of run of a lower level part of the algorithm on a conducting sequence

are shown in gure with the top plot of a displaying a y positional

comp onent

Figure demonstrates output of probabilistic parsing algorithm in the form of se

mantic lab els and corresp onding sample index ranges for the gesture using

Polhemus sensor Results of the segmentation of a gesture in the

STIVE exp eriment are shown in gure

Segmentation

BAR

startend sample

Conducted as two quarter beat pattern

BAR

startend sample

Conducted as two quarter beat pattern

BAR

startend sample

Conducted as three quarter beat pattern

BAR

startend sample

Conducted as two quarter beat pattern

Viterbi probability

Figure Polhemus exp eriment Results of the segmentation of a long conducting gesture for

the bar sequence

Segmentation

BAR

start sampleend sample

Conducted as two quarter bit pattern

BAR

start sampleend sample

Conducted as three quarter bit pattern

BAR

start sampleend sample

Conducted as two quarter bit pattern

BAR

start sampleend sample

Conducted as three quarter bit pattern

BAR

start sampleend sample

Conducted as two quarter bit pattern

Viterbi probability

Figure STIVE exp eriment results of the segmentation of a conducting gesture for the bar

sequence 60 100 50 0 50 0.20 20 40 60 80 100 120 140 160 180

P 0.1 0 40 0 20 40 60 80 100 120 140 160 180 0.5 t P

0 30 0 20 40 60 80 100 120 140 160 180 0.5 t P

0 20 0 20 40 60 80 100 120 140 160 180 0.5 t P

0 10 0 20 40 60 80 100 120 140 160 180 0.5 t P

0 0 0 20 40 60 80 100 120 140 160 180

−25 −20 −15 −10 −5 0 5 10 15 20 25 t

a b

60

50

40

30

20

10

0

−30 −20 −10 0 10 20 30

c d

Figure Lab eling a recursivepro cess a Christmas tree drawing as an example of a recursive

pro cess b Results of comp onentrecognition pro cess with candidate events identied Top plot

shows the y comp onentofthe drawing plotted against time axis c Segmentation d Lower branch

removed

The segmentation and lab eling achieved in these exp eriments are correct whichcanbe

seen for one of the Polhemus exp eriments refer to gures and In the interest

of space wedonot showintermediate results for the STIVE exp eriment but it was also

successful in all tests

RecursiveModel

The nal exp erimentispresented here more for the sakeofillustration of simplicityof

describing a mo del for the recursiverecognition task

The task can b e describ ed as follows wemakeasimple drawing of a tree as shown in

gure a starting at the left side of the trunk Wewanttoparse this drawing presenting

the data in the natural order nd the lowest branchandremoveitfrom the drawing

The drawing is describ ed byasimple recursive grammar

G

c

TREE up BRANCH down

BRANCH LSIDE BRANCH RSIDE

j LSIDE RSIDE

LTSIDE left updiag

RTSIDE dndiag left

The intermediate results are shown in gure b What is of interest in this example is

the recursivecharacter of the drawing and the fact that the denition of BRAN C H includes

long distance inuencesInother words the part of the drawing whichwewanttolabelis

comp osed of the primitives which are disparate in time but one causes the other to o ccur

As seen from gure c the algorithm has no problems with either of the diculties

The segmentation data nowcanbe used to complete the task and remove the lower

branchofthe Christmas tree whichisshown in gure d

Chapter

Conclusions

Summary

In this thesis weimplemented a recognition system based on a twolevel recognition archi

tecture for use in the domain of action recognition The system consists of an HMM bank

and a probabilistic Earleybased parser In the course of the researchwe develop ed a set

of extensions to the probabilistic parser whichallowfor uncertain multivalued input and

enforce temp oral consistency of the input stream An improved algorithm for building a

Viterbi derivation tree was also implemented The system includes an annotation mo dule

whichisused to asso ciate a semantic action to a syntactic grouping of the symbols of the

input stream

The recognition system p erforms syntactic segmentation disambiguation and lab eling

of the stream according to a given Sto chastic ContextFree Grammar and is shown to work

in a varietyofsensor mo dalities

The use of formal grammars and syntaxbased action recognition is reasonable if decom

p osition of the action under consideration into a set of primitives that lend themselves to

automatic recognition is p ossible Weexplored several examples of suchactions and showed

that the recognition goals were successfully achieved by the prop osed approach

Evaluation

The goals of this thesis and the means with whichtheyare achieved in this work are

summarized in the table b elow

Goal Means Completed

p

Build a recognition system Achieved byaprobabilistic Earley

whichisbasedonasimplede based parsera Sto chastic Context

scription of a complex action Free Grammar and extensions for tem

p oral parsing

as a collection of comp onent

primitives

p

Develop a segmentation tech Segmentation is p erformed as a struc

nique based on a comp onent ture probabilitymaximization during

contentofthe input signal the Viterbi parse

p

Develop a context dep endent Annotation is emitted byagrammar

annotation technique for an driven annotation mo dule during the

unsegmented input stream structure probability maximization

As is seen from the table the main goals of the thesis havebeenreached

Evaluating the contributions of this thesis the most imp ortantquestiontobeanswered

is what this technique aords us First of all it allows us to assign probabilities to the

structural elements of an observed action Second of all it allows for easy formulation of

complex hierarchical action mo dels As an added advantage it increases reliabilityofthe es

timation of the comp onent primitiveactivities by ltering out the imp ossible observations

and accommo dates recursiveactions

Quantitativeevaluation of the accuracy increase of the technique develop ed in this thesis

is quite dicult to p erform for twomainreasons

Direct comparison of the currentsystem with an unconstrained one involves building

an alternativesystem whichisquite an undertaking and it is still unclear which

parameters of the system can b e changed and which ones carried over On the other

hand the evaluation problem can b e simply stated as not p erforming the parse

This alternativecomparison is unfair since it is clearly biased to the system whichwe

develop ed in this thesis

The improvement aorded by additional syntactic constraints is problem sp ecic

The essence of the syntactic ltering is that the full sequence space is signicantly

reduced bythe grammar to a nonlinear subspace of admissible sequences where

the searchisperformed The eectiveness of this approachis directly related to the

decrease in the size of the search space However it is p ossible to write a grammar

whichcovers the whole sequence space in whichcase no improvementisaorded

In general terms the exp ected advantage of syntactic constraints was shown in intro

duction gure The gure shows that the syntactic constraints dramatically increase

recognition accuracy of the comp onent primitives This estimation is correct in our case

as was noted ab ove

Extensions

The recognition system describ ed in this thesis can b e easily extended to pro duce video

annotations in the form of Web pages It can b e useful for instance for summarizing

instructional videos

It is p ossible to use Extended SCFGs with the current parser Extended CFGs allow for

limited contextsensitivityofthe CFGparsing mechanism without op ening the Pandoras

boxofLinearly b ounded automata

One interesting application of the windowparsing algorithm descried in section

is direct parsing of the outputs of an HMM bank with no explicit discretization In this

case computational complexitycan b ecome prohibitive However if some knowledge ab out

exp ected length of the signal is available the original parsing task can b e reformulated The

solution can b e provided by splitting the original grammar into several comp onents one

for each selected nonterminal and one whichgathers the results into a nal sentenial form

Eachofthe selected nonterminal mo dels p erforms the parsing within some windowofalim

ited size thus restricting the pro duction range to a manageable size as it is routinely done

with HMMs In this case the sequence will eectively undergo a semantic discretization

whichisthen assembled bythe toplevel structure whichhas no windowlimit

Final Thoughts

With all the advantages of the metho d there are still problems that were not addressed in our

work for instance grammar inference is hard to do in the framework of sto chastic parsing

Wehave not researched the opp ortunities of fully utilizing the pro duction probabilities In

the exp eriments ab ovewedetermined them heuristically using simple reasoning based on our

understanding of the pro cess The rule probabilities whichreecttypicality of a string

derivation will playasignicantly more imp ortantroleinrecognition and interpretation

of actions of a higher complexitythan those presented in the thesis Weplan on exploring

the added advantages of the learned probabilities in our further research

Biblio graphy

A V Aho and Ullman J D The Theory of Parsing Translation and Compiling

Volume ParsingPrentice Hall Englewoods Clis NJ

A V Aho and T G Peterson A minimum distance errorcorrecting parser for context

free languages SIAM Journal of Computing

J Allen Sp ecial issue on illformed input American Journal of Computational Lin

guistics

M L Baird and M D Kelly Aparadigm for semantic picture recognition PR

A F Bobick Movement activityandaction The role of knowledge in the p erception

of motion In Philosophical Transactions Royal Society London B

A F Bobick and A D Wilson A statebased technique for the summarization and

recognition of gesture Proc Int Conf Comp Vis

T L Bo oth and R A Thompson Applying probabilitymeasures to abstract languages

IEEE Transactions on Computersc

M Brand Understanding manipulation in video In AFGR pages

R A Bro oks A robust layered control system for a mobile rob ot IEEE Journal of

Robotics and Automation

H Bunke and D Pasche Parsing multivalued strings and its application to image and

waveform recognition Structural Pattern Analisys

L Campb ell Recognizing classical bal let setps using phase spaceconstraintsMasters

Massachusetts Institute of Technology

J H Connell A Colony Architecturefor an Articial Creature Phd MIT

J D Courtney Automatic video indexing via ob ject motion analysis PR

T J Darrell and A PPentland Spacetime gestures Proc Comp Vis and Pattern

Recpages

Charniak E Statistical Language LearningMIT Press Cambridge Massachusetts

and London England

Jay Clark Earley AnEcient ContextFreeParsing Algorithm Phd CarnegieMellon

University

K S Fu A step towards unication of syntactic and statistical pattern recognition

IEEE Transactions on Pattern Analysis and Machine Intel ligencePAMI

F Jelinek J D Laertyand R L Mercer Basic metho ds of probabilistic context

free grammars In Pietro Laface Mori and Renato Di editors Speech Recognition

and Understanding Recent Advances Trends and Applicationsvolume F of NATO

Advanced Study Institute pages Springer Verlag Berlin

P Maes The dynamic of action selection AAAI Spring Symposium on AI Limited

Rationality

R Narasimhan Lab eling schemata and syntactic descriptions of pictures InfoControl

B J Oomen and R L Kashyap Optimal and information theoretic syntactic pattern

recognition for traditional errors SSPR th International Workshop on Advances in

Structural and Syntactic Pattern Recognition Authors presentafoundational basis

for optimal and information theoretic syntactic pattern recognition They develop a

rigorous mo del for channels whichpermit arbitrary distributed substitution deletion

and insertion syntactic errors The scheme is shown to b e functionally complete and

sto chastically consistent VERYIMPORTANT

L R Rabiner A tutorial on hidden markovmodels and selected applications in sp eech

recognition Proc of the IEEE

L R Rabiner and B H Juang Fundamentals of speech recognitionPrentice Hall

Englewood Clis

Max Rudolf The Grammar of Conducting A Comprehensive Guide to Baton Tech

niques and InterpretationSchimmer Bo oks New York

A Sanfeliu and R Alquezar Ecient recognition if a class of contextsensitive lan

guages describ ed byaugmented regular expressions SSPR th International Work

shop on Advances in Structural and Syntactic Pattern RecognitionAuthors intro duce

ametho d of parsing AREs which describ e a sub class of a contextsensitivelanguages

including the ones describing planar shap es with symmetry AREs are regular ex

pressions augmented with a set of constraints that involve the numberofofinstances

in a string of the op erands of the star op erator

A Sanfeliu and M Sainz Automatic recognition of bidimensional mo dels learned

bygrammatical inferrence in outdor scenes SSPR th International Workshop on

Advances in Structural and Syntactic Pattern Recognition Show automatic trac sign

detection byapseudo bidimensional Augmented Regular Expression based technique

R Schalko Pattern Recognition Statistical Structural and Neural Approaches Wi

leyNewYork

J Schlenzig E Hunter and R Jain Recursive identication of gesture inputs using

hidden markovmodels Proc Second Annual ConferenceonApplications of Computer

Visionpages

T Starner and A Pentland Realtime american sign language from video using hidden

markovmodels In MBR page Chapter

T E Starner and A Pentland Visual recognition of american sign language using

hidden markovmodels In Proc of the Intl Workshop on Automatic Face and Gesture

Recognition Zurich

A Stolcke Bayesian Learning of Probabilistic Language ModelsPhd Universityof

California at Berkeley

A Stolcke An ecientprobabilistic contextfree parsing algorithm that computes

prex probabilities Computational Linguistics

M G Thomason Sto chastic syntaxdirected translation schemata for correction of

errors in contextfree languages TC

W G Tsai and K S Fu Attributed grammars a to ol for combining syntatctic and

statistical approaches to pattern recognition In IEEE Transactions on Systems Man

and Cyberneticsvolume SMC number pages

A D Wilson A F Bobick and J Cassell Temp oral classication of natural gesture

and application to video co ding Proc Comp Vis and Pattern Rec pages

C Wren A Azarbayejani T Darrell and A Pentland Pnder Realtime tracking of

the human b o dy IEEE Trans Pattern Analysis and Machine Intel ligence

J Yamato J Ohya and K Ishii Recognizing human action in timesequential images

using hidden markovmodel Proc Comp Vis and Pattern Rec pages