Distributional-Relational Models: Scalable Semantics for Databases

Knowledge Representation and Reasoning: Integrating Symbolic and Neural Approaches: Papers from the 2015 AAAI Spring Symposium Distributional-Relational Models: Scalable Semantics for Databases Andre´ Freitas1,2, Siegfried Handschuh1, Edward Curry2 1Faculty of Computer Science and Mathematics University of Passau 2Insight Centre for Data Analytics National University of Ireland, Galway Abstract to acquire comprehensive knowledge bases under that representation model. Logical frameworks are highly sensitive The crisp/brittle semantic model behind databases lim- to problems from the consistency and from the performance its the scale in which data consumers can query, ex- plore, integrate and process structured data. Approaches perspectives, which emerge in large-scale knowledge bases. aiming to provide more comprehensive semantic mod- This work proposes the use of distributional semantic els for databases, which are purely logic-based (e.g. as models (DSMs) to address these limitations, where the sim- in Semantic Web databases) have major scalability lim- plification of the semantic representation in DSMs facili- itations in the acquisition of structured semantic and tates the construction of large-scale and comprehensive se- commonsense data. This work describes a complemen- mantic/commonsense knowledge bases, which can be used tary semantic model for databases which has seman- to support effective semantic approximations for databases. tic approximation at its center. This model uses distri- Distributional semantics provides a complementary perspec- butional semantic models (DSMs) to extend structured tive to the formal perspective of database semantics, which data semantics. DSMs support the automatic construc- supports semantic approximation as a first-class database tion of semantic and commonsense models from large- scale unstructured text and provides a simple model to operation. analyze similarities in the structured data. The combi- A distributional semantics approach implies extending the nation of distributional and structured data semantics formal database semantics with a distributional semantic provides a simple and promising solution to address the layer. In the hybrid model, the crisp semantics of query challenges associated with the interaction and process- terms and database elements are extended and grounded ing of structured data. over a distributional semantic model (Figure 1). The distributional layer can be used to abstract the database user from Introduction the specific conceptualization of the data. Data consumers querying, exploring, integrating or analyz- Distributional Semantics ing data today need to go through the process of mapping Distributional semantics is built upon the assumption that their own conceptualization to the identifiers of database el- the context surrounding a given word in a text provides im- ements. The requirement of a perfect symbolic and syntactic portant information about its meaning (Harris 1954), (Tur- matching in the database interaction process forces the user ney and Pantel 2010). Distributional semantics focuses on to perform (during query construction time) a time consum- the construction of a semantic representation of a word ing information need-database symbol alignment process. based on the statistical distribution of word co-occurrence With the growth of the symbolic space associated with con- in unstructured data. The availability of high volume and temporary databases, the process of manual alignment to the comprehensive Web corpora brought distributional seman- database symbolic space becomes infeasible and restrictive. tic models as a promising approach to build and represent Automatic semantic approximation between the data con- meaning at scale. sumer information needs and database elements is a cen- One of the major strengths of distributional models is tral operation for data querying, exploration, integration and from the acquisitional point of view, where a semantic data analysis. However, effective semantic approximation model can be automatically built from large unstructured is heavily dependent on the construction of comprehen- text. In Distributional Semantic Models (DSMs) the mean- sive semantic/commonsense knowledge bases. While differ- ing of a word is represented by a weighted vector, which can ent semantic approaches based on logical frameworks have be automatically built from contextual co-occurrence infor- been proposed, such as Semantic Web databases, these ap- mation in unstructured data. The distributional hypothesis proaches are limited in addressing the trade-off between pro- (Harris 1954) assumes that the local context in which a term viding an expressive semantic representation and the ability occurs can serve as discriminative semantic features which Copyright c 2015, Association for the Advancement of Artificial represent the meaning of the term. While this simplification Intelligence (www.aaai.org). All rights reserved. makes distributional semantics a coarse-grained semantic 57 model, not suitable for all tasks, the scale in which seman- Query: q (q ,x ) Ʌ ... Ʌ q (x ,q ) p ~q ... q ~c tic knowledge and associations can be captured makes them p0 c0 0 pi j ck m pi cj ck effective models for calculating semantic approximations,a ... ... θr DB fact which is supported by empirical evidence (Gabrilovich p0 θ r p0(e0,e1) and Markovitch 2007). Distributional e1 p (e ,e ) DSMs are represented as a vector space model, where Relational e0 1 0 2 C Model θe p2(e1,e2) each dimension represents a context pattern for the lin- ... T guistic or data context in which the target term occurs. pm(el,en) A context can be defined using documents, data tuples, co- context occurrence window sizes (number of neighboring words) or Distributional vector space/ Reference data corpus/ syntactic features. The distributional interpretation of a tar- distribution of symbol associations get term is defined by a weighted vector of the contexts in Semantic Relatedness syntagmatic relation context pattern paradigmatic relation which the term occurs, defining a geometric interpretation context same symbol si under a distributional vector space. The weights associated srel (si,sj) document/ ... with the vectors are defined using an associated weighting dataset sj scheme W, which calibrates the relevance of more generic θ context or discriminative contexts. The semantic relatedness mea- η symbol sure s between two words is calculated by using different Observer 1 Observer 2 Observer n similarity/distance measures such as the cosine similarity, Euclidean distance, mutual information, among others. As the dimensionality of the distributional space grows, dimen- Figure 1: Depiction of distributional relations, contexts and sionality reduction approaches d can be applied. different representation views for distributional semantics. Distributional-Relational Models (DRMs) or external (a separate reference corpora); F is a map which The semantics of a database element e (e.g. constants, pred- −→ translates the elements ei ∈ E into vectors ei in the the dis- icates) is represented by the set of natural language descrip- tributional vector space VSDSM using the natural language tors associated with it. This typically does not include con- descriptor of ei; and H is the set of thresholds above which cept associations outside the scope of the specific task that two terms are semantically equivalent. the database was designed to address, limiting its use for Definition (Distributional-Relational Model (DRM)): A distri- semantic approximation to concepts outside the designed butional relational model is a tuple (DSM, DB, RC, F, H) such database representation. Semantic approximation operations that: are a fundamental operation to support schema-agnosticism •DSMis the associated distributional semantic model. (Freitas, Silva, and Curry 2014), i.e. the ability to interact with a database without a precise understanding of the con- •DBis an structured dataset with DB elements E and tuples T . ceptual model behind it. •RCis the reference corpora which can be unstructured, struc- In this work, the formal semantics of a database symbol is tured or both. The reference corpora can be internal (based on extended with a distributional semantics description, which the co-occurrence of elements within the DB) or external (a separate reference corpora). captures the large-scale symbolic associations within a large −→ reference corpora. The distributional semantics representa- •Fis a map which translates the elements ei ∈ E into vectors ei DSM tion captures large-scale semantic, commonsense and do- in the the distributional vector space VS using the string main specific knowledge, using it in the semantic approxi- of ei and the data model category of ei. mation process between a third-party information need and •His the set of semantic thresholds for the distributional seman- the database (Figure 3). The hybrid distributional-structured tic relatedness s in which two terms are considered semantically model is called Distributional-Relational Model (DRM). A equivalent if they are equal and above the threshold. DRM embeds the structure defined by relational models in In this work we assume a simplified data model with a a distributional vector space, where every entity and rela- signature Σ=(P, C) formed by a pair of finite set of sym- tionship have an associated vector representation. The dis- bols used to represent binary and unary predicates p ∈ P tributional associational information embedded

Distributional-Relational Models: Scalable Semantics for Databases

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support