Implementingperceptuals
Total Page:16
File Type:pdf, Size:1020Kb
View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Göteborgs universitets publikationer - e-publicering och e-arkiv DEPARTMENT OF PHILOSOPHY, LINGUISTICS AND THEORY OF SCIENCE IMPLEMENTING PERCEPTUAL SEMANTICS IN TYPE THEORY WITH RECORDS (TTR) Arild Matsson Master’s Thesis 30 credits Programme Master’s Programme in Language Technology Level Advanced level Semester and year Spring, 2018 Supervisors Simon Dobnik and Staffan Larsson Examiner Peter Ljunglöf Keywords type theory, image recognition, perceptual semantics, visual question answering, spatial relations, artificial intelligence Abstract Type Theory with Records (TTR) provides accounts of a wide range of semantic and linguistic phenomena in a single framework. This work proposes a TTR model of perception and language. Utilizing PyTTR, a Python implementation of TTR, the model is then implemented as an executable script. Over pure Python programming, TTR provides a transparent formal specification. The implementation is evaluated in a basic visual question answering (VQA) use case scenario. The results show that an implementation of a TTR model can account for multi-modal knowledge representation and work in a VQA setting. Acknowledgements Huge thanks to my supervisors and to professor Robin Cooper, all of whom have provided significant help through this process. Contents 1 Introduction ............................................ 1 1.1 Contribution of this thesis ................................. 1 2 Background ............................................ 2 2.1 Image recognition and object detection ........................... 2 2.2 Computational semantics .................................. 2 2.2.1 Type theory in linguistics .............................. 3 2.3 Perceptual semantics .................................... 3 2.4 Type Theory with Records (TTR) ............................. 4 2.4.1 Overview of TTR .................................. 4 2.5 Visual question answering (VQA) ............................. 6 3 Method ............................................... 6 3.1 PyTTR: Programming with TTR .............................. 6 3.2 Object detection with YOLO ................................ 6 3.3 Objects and perception with TTR ............................. 7 3.4 Spatial relations ....................................... 7 3.5 Language and visual question answering (VQA) ...................... 8 4 Results ............................................... 9 4.1 TTR model ......................................... 9 4.1.1 Object detection ................................... 9 4.1.2 Individuation .................................... 10 4.1.3 Spatial relation classification ............................ 11 4.1.4 Beliefs ........................................ 12 4.1.5 Language ...................................... 12 4.1.6 Agent ........................................ 12 4.2 Combining situation types ................................. 16 4.3 Parsing language to TTR .................................. 17 4.3.1 Parsing to first-order logic (FOL) .......................... 19 4.3.2 Translating FOL to TTR .............................. 19 3 4.4 The relabel-subtype relation ................................ 20 4.4.1 Subtype relabeling algorithm ............................ 21 4.4.2 Restrictions of the algorithm ............................ 22 4.4.3 Implementation ................................... 22 4.5 Additions to PyTTR .................................... 22 4.6 Demonstration ....................................... 24 5 Discussion ............................................. 24 5.1 Subtype check ....................................... 24 5.2 PyTTR ........................................... 24 5.3 Inference-first ........................................ 24 6 Conclusions ............................................ 26 6.1 Suitability .......................................... 26 6.2 Benefits ........................................... 26 6.3 Can it tell us something about semantics? ......................... 26 6.4 Future work ......................................... 26 A Code ................................................ 31 1 Introduction Having computers understand visual input is desirable in several areas. A domestic assistant robot may use a camera to navigate and identify useful objects in a home. Driver-less cars need to be able to read road signs and track other moving vehicles. Web crawlers may extract information from images alongside text on the web. This kind of understanding involves processing sensory (such as visual) input on a cognitive level. Low- level image processing may include tasks such as prominent color extraction, edge detection and visual pattern recognition. Higher-level processing, however, includes identifying objects, their properties and their relations to each other. This information can then be used for language understanding, reasoning, prediction and other cognitive processes. Making the connection between sensory input and cognitive categories is what concerns the field of perceptual semantics (Pustejovsky, 1990). Humans use language to communicate information. Thus it is useful to add linguistic capacities to a per- ceptual system. With vision and language connected, a robot can talk about what it sees, and descriptions can be automatically generated for images found on the web. Image caption generation is indeed a popular task for evaluating computer vision systems. Another one is visual question answering (VQA) (Antol et al., 2017), where the system is expected to generate answers to natural-language questions in the context of a given image. The connection between different modes of information, such as vision and language, requires a model of semantic representation. Formal models such as first-order logic (FOL) have long been of choice, but recent developments have seen data-driven approaches excel in some cases. Briefly put, the former kind tends to deliver deep structures of information in narrow domains, while the latter more easily covers wide domains, but with shallow information content (Dobnik & Kelleher, 2017). A recent contribution that combines several branches in formal systems is Type Theory with Records (TTR) (Cooper, 2005a, 2016). TTR is implemented in Python as PyTTR (Cooper, 2017). 1.1 Contribution of this thesis The main purpose of this thesis is to extend the basic implementation of TTR (PyTTR, Cooper (2017)) to apply it for the first time in a practical task relating vision and language, in particular VQA. The questions that this research raises are: 1. To what degree is (a) TTR as a theoretical framework and (b) its existing practical implementation suited to connect existing vision and language systems? 2. What are the benefits of using TTR this way for (a) vision and language systems and (b) visual question answering? 3. What can connecting vision and language systems tell us about semantics (and TTR)? To explore these questions, a model will be formulated in TTR and implemented using PyTTR. The model will benefit from builing on past proposals; especially relevant are Dobnik & Cooper (2013) and Dobnik & Cooper (2017). As a limitation for the VQA task, the language domain is restricted to polar (yes/no) questions. 1 The theoretical background is summarized in Section 2. In Section 3, the strategies and techniques used for the implementation are described. The implementation is then presented in Section 4. In Section 5, the results are discussed in relation to the questions above. Finally, some conclusions are made in Section 6. 2 Background This section will highlight some important pieces of the history of past research in relevant fields. 2.1 Image recognition and object detection In image recognition, visual data is analyzed in order to detect and classify objects. A wide range of models have been developed to solve this task. Some focus only on the detection of objects (Blaschko & Lampert, 2008), some only on classification (e.g. ResNet, He et al., 2016), and others attempt to solve the full problem in an integrated fashion (Redmon et al., 2016; He et al., 2017). Like in machine learning in general, and other kinds of data processing, significant values from the input data (the image) are collected in a process known as feature extraction. Various types of features exist for image processing. For one, Scale-invariant feature transform (SIFT) is a technique where significant locations of an image are used to extract keypoints (Lowe, 1999). Classification can then be performed by comparing the keypoints of a target image to those in a database. With deep neural networks, however, the need for prior feature extraction is generally eliminated (He et al., 2016, 2017). Neural networks that process image data typically contain convolutional network layers. Color image data is highly dimensional, typically represented as a 2D matrix of pixels, where each pixel itself specifies a quantity of each of three basic colors. Convolutional layers are used to divide the image into smaller, overlapping segments, thus capturing locational aspects of the data. This way, the dimensionality can be reduced to a single, one-dimensional vector. 2.2 Computational semantics Semantics is the study of meaning. Computational semantics is concerned with how to represent meaning digitally, and use it to perform semantic parsing, inference and other tasks (Blackburn & Bos, 2003). A well-established and largely capable formalism for expressing and operating on propositions is first-order logic (FOL). In FOL, the