Unsupervised and Explainable Natural Language Generation Using Link Grammar for General Conversational Intelligence Technical Fi

Attorney Docket No.: RAM.001P UNSUPERVISED AND EXPLAINABLE NATURAL LANGUAGE GENERATION USING LINK GRAMMAR FOR GENERAL CONVERSATIONAL INTELLIGENCE TECHNICAL FIELD The present technology relates to artificial general intelligence (AGI) and natural language processing (NLP). BACKGROUND Currently, the fields of AGI and NLP have little overlap, with few existing AGI architectures capable of comprehending natural language and nearly all NLP systems founded upon specialized rules and frameworks that are not generalizable to the various complex domains of human language [1]. Furthermore, most NLP and AGI frameworks that do incorporate some form of natural language understanding are unable to convey knowledge in a form similar to the human language without manual, labor-intensive methods such as template-based customization. DETAILED DESCRIPTION Many current AGI and NLP architectures do not possess general conversational intelligence—that is, they either do not deal with language or are unable to convey knowledge in a form similar to the human language without manual, labor-intensive methods such as template- based customization. The present technology is directed to a new technique to automatically generate grammatically valid sentences using the Link Grammar database. This natural language generation method may serve as the final component in a proto-AGI pipeline that understandably and unsupervisedly handles natural language material. I. Introduction A. Artificial general intelligence and natural language processing Currently, the fields of AGI and NLP have little overlap, with few existing AGI architectures capable of comprehending natural language and nearly all NLP systems founded upon specialized rules and frameworks that are not generalizable to the various complex domains of human language [1]. Furthermore, most NLP and AGI frameworks that do incorporate some -1- Attorney Docket No.: RAM.001P form of natural language understanding are unable to convey knowledge in a form similar to the human language without manual, labor-intensive methods such as template-based customization. B. Motivations 1. Explainable AI and understandable language processing Explainable AI (XAI) or interpretable AI refers to methods and techniques in the application of artificial intelligence (AI) such that both the AI model and its results can be understood by humans. It contrasts with the concept of “black box” algorithms in machine learning where even developers cannot explain how the AI arrived at a specific decision. In this work, the inventors present the idea of understandable language processing (ULP), which is an extension of the XAI concept to NLP; ULP is expected to allow for acquisition of natural language, comprehension of textual communications, and production of textual messages in a reasonable and transparent way. The NLG architecture of the present technology is configured to serve as a ULP method that allows for the generation of grammatically valid sentences in an interpretable way. The work of the inventors of the present technology not only provides explainable results but also provides an explainable model for sentence generation, since the inventors rely on Link Grammar which is explainable in itself. 2. Unsupervised language learning Current methods of grammar learning, such as deep neural networks (DNNs), require extensive supervised training on large corpora. However, humans are capable of acquiring explainable and reasonable rules of building sentences from words based on grammatical rules and conversational patterns and clearly understand the grammatical and semantic categories of words. This raises the idea of unsupervised language learning (ULL), which enables acquisition of language grammar from unlabeled text corpora programmatically in an unsupervised way. In a ULL system, the learned knowledge is stored in a human-readable and reasonable representation [2]. Examples of ULL systems include the OpenCog AI platform’s cognitive pipeline that enables unsupervised grammar learning [2] and Glushchenko et al.’s grammar induction technology that unsupervisedly learns grammar from only a single input corpus [3]. The ULL pipeline consists of five main components: a text pre-cleaner (which preprocesses corpus files with configurable cleanup and normalization options), a sense pre- -2- Attorney Docket No.: RAM.001P disambiguator (which performs word disambiguation and builds senses from tokens), a text parser (which parses sentences into word tokens or senses), a grammar learner (which learns word categories and rules from parses), and a tester that evaluates the quality of the inferred grammar. The ultimate output of the ULL pipeline is a model of the human language; one such model created by the ULL pipeline is the Link Grammar dictionary itself, which serves as the knowledge base behind the NLG architecture of the present technology [4]. 3. Question answering Question answering is a computer science discipline within the fields of information retrieval and natural language processing, which is concerned with building systems that automatically answer questions posed by humans in a natural language. An explainable question answering pipeline would consist of two main components: natural language comprehension, and natural language generation. Natural language comprehension involves first parsing the question (otherwise known as the input query) based on a formal grammar such as Link Grammar, and then performing semantic interpretation (extracting the concept represented by the query and determining relationships between individual parts of the parsed query obtained in the previous step). The natural language generation component then involves semantic query execution (determining the answer to the input question based on the semantic relationships extracted in the previous step) and the use of a formal grammar to construct grammatically valid sentences from the words associated with the semantic relationships derived during query execution. The present technology is concerned with the final part of the question answering pipeline; with Link Grammar serving as the requisite formal grammar, the proposed NLG architecture automates sentence construction from the words derived in the penultimate part of the pipeline and thereby enables question answering with minimal to no human interference. C. Link Grammar Link Grammar is founded upon the idea that each word possesses a feature structure consisting of a set of typed connectors [4]. Generation involves matching up connectors from one word with connectors from another. Consider the sentence, “The cat caught a mouse.” The Link Grammar parse structure for this sentence is shown in FIG. 1. -3- Attorney Docket No.: RAM.001P The “Link Grammar dictionary” is a database that contains connectors associated with all common words. The notation used to describe feature structures in this dictionary is quite simple. Different types of connectors are denoted by letters or pairs of letters like S or SX. For instance, if a word A has the connector S+, this means that the word can have an S link coming out to the right side. If a word B has the connector S-, this means that the word can have an S link coming out to the left side. In this case, if A occurs to the left of B in a sentence, then the two words can be joined together with an S link [1]. Disjuncts are sets of connectors that constitute the legal use of a given word. For instance, if word A has the expression {C-} & (X+ or Y+), where {} denotes optionality, then A has the following four disjuncts: C- X+, X+, C- Y+, Y+. Macros are single symbols that define large connector expressions. For instance, let the macro <macro> define the expression {C-} & (X+ or Y+) as above. Then, the expression <macro> or {W+ & Z-} is simply shorthand for ({C-} & (X+ or Y+)) or {W+ & Z-}. Macros are used throughout the Link Grammar dictionary to reduce redundancy of connector expressions and increase readability. The rules of Link Grammar impose additional constraints beyond the matching of connectors, namely the planarity and connectivity metarules. Planarity means that links do not cross, while connectivity means that the links and words of a sentence must form a connected graph—all the words must be linked to the other words in the sentence via some path. The architecture of the present technology utilizes the English Link Grammar dictionaries to perform natural language generation. 1. Why Link Grammar Besides Link Grammar, other grammar rule dictionaries and APIs exist, such as spaCy (https://spacy.io) and Universal Dependencies (https://universaldependencies.org/introduction.html). Solutions such as spaCy are programming language-specific APIs or packages; for instance, spaCy restricts its grammar database to applications developed using Python. Universal Dependencies (UD) and its alternatives, while downloadable and not restricted to certain programming languages, require knowledge of each -4- Attorney Docket No.: RAM.001P word’s part of speech. For instance, the UD grammar rules for the sentence “The dog was chased by the cat.” are as shown in FIG. 2. Link Grammar not only provides flexibility across multiple programming languages (the architecture of the present technology can be implemented in Java) but also is relatively simple; unlike grammar databases such as UD that include specific rules based on part of speech, Link Grammar solely focuses on links between words and has interpretable metarules. This explainable nature of Link Grammar allows

Load more