Guiding Inference with Policy Search Reinforcement Learning Matthew E
Total Page:16
File Type:pdf, Size:1020Kb
Guiding Inference with Policy Search Reinforcement Learning Matthew E. Taylor Cynthia Matuszek, Pace Reagan Smith, and Michael Witbrock Department of Computer Sciences Cycorp, Inc. The University of Texas at Austin 3721 Executive Center Drive Austin, TX 78712-1188 Austin, TX 78731 [email protected] {cynthia,pace,witbrock}@cyc.com Abstract a learner about how efficiently it is able to respond to a set of training queries. Symbolic reasoning is a well understood and effective approach Our inference-learning method is implemented and tested to handling reasoning over formally represented knowledge; 1 however, simple symbolic inference systems necessarily slow within ResearchCyc, a freely available version of Cyc (Lenat as complexity and ground facts grow. As automated approaches 1995). We show that the speed of inference can be signif- to ontology-building become more prevalent and sophisticated, icantly improved over time when training on queries drawn knowledge base systems become larger and more complex, ne- from the Cyc knowledge base by effectively learning low- cessitating techniques for faster inference. This work uses re- level control laws. Additionally, the RL Tactician is also able inforcement learning, a statistical machine learning technique, to outperform Cyc’s built-in, hand-tuned tactician. to learn control laws which guide inference. We implement our Inference in Cyc learning method in ResearchCyc, a very large knowledge base with millions of assertions. A large set of test queries, some of Inference in Cyc is different from most inference engines be- which require tens of thousands of inference steps to answer, cause it was designed and optimized to work over a large can be answered faster after training over an independent set knowledge base with thousands of predicates and hundreds of training queries. Furthermore, this learned inference mod- of thousands of constants. Inference in Cyc is sound but is not ule outperforms ResearchCyc’s integrated inference module, a always complete: in practice, memory- or time-based cutoffs module that has been hand-tuned with considerable effort. are typically used to limit very large or long-running queries. Additionally, the logic is “nth order,” as variables and quan- Introduction tifiers can be nested arbitrarily deeply inside of background Logical reasoning systems have a long history of success and knowledge or queries (Ramachandran et al. 2005). have proven to be powerful tools for assisting in solving cer- Cyc’s inference engine is composed of approximately a tain classes of problems. However, those successes are lim- thousand specialized reasoners, called inference modules, de- ited by the computational complexity of na¨ıve inference meth- signed to handle commonly occurring classes of problem ods, which may grow exponentially with the number of infer- and sub-problem. Modules range from those that handle ex- ential rules and amount of available background knowledge. tremely general cases, such as subsumption reasoning, to very As automated learning and knowledge acquisition techniques specific modules which perform efficient reasoning for only (e.g. (Etzioni et al. 2004; Matuszek et al. 2005)) make very a single predicate. An inference harness breaks a problem large knowledge bases available, performing inference effi- down into sub-problems and selects among the modules that ciently over large amounts of knowledge becomes progres- may apply to each problem, as well as choosing follow-up ap- sively more crucial. This paper demonstrates that it is pos- proaches, pruning entire branches of search based on expected sible to learn to guide inference efficiently via reinforcement productivity, and allocating computational resources. The be- learning, a popular statistical machine learning technique. havior of the inference harness is defined by a set of manually Reinforcement learning (Sutton & Barto 1998) (RL) is a coded heuristics. As the complexity of the problem grows, general machine learning technique that has enjoyed success that set of heuristics becomes more complex and more diffi- in many domains. An RL task is typically framed as an cult to develop effectively by human evaluation of the problem agent interacting with an unknown (or under-specified) envi- space; detailed analysis of a set of test cases shows that over- ronment. Over time, the agent attempts to learn when to take all time spent to achieve a set of answers could be improved actions such that an external reward signal is maximized. by up to 50% with better search policy. Many sequential choices must be made when performing Cyc’s inference harness is composed of three main high- complex inferences (e.g. what piece of data to consider, or level components: the Strategist, Tactician, and Worker. The what information should be combined). This paper describes Strategist’s primary function is to keep track of resource con- work utilizing RL to train an RL Tactician, an inference mod- straints, such as memory or time, and interrupt the inference ule that helps direct inferences. The job of the Tactician is if a constraint is violated. A tactic is a single quantum of to order inference decisions and thus guide inference towards work that can be performed in the course of producing re- answers more efficiently. It would be quite difficult to deter- sults for a query, such as splitting a conjunction into multiple mine an optimal inference path for a complex query to support clauses, looking up the truth value of a fully-bound clause, or training a learner with classical machine learning. However, finding appropriate bindings for a partial sentence via index- RL is an appropriate machine learning technique for optimiz- ing. At any given time during the solving of a query, there ing inference as it is relatively simple to provide feedback to are typically multiple logically correct possible actions. Dif- ferent orderings of these tactics can lead to solutions in radi- Copyright c 2007, American Association for Artificial Intelligence (www.aaai.org). All rights reserved. 1http://research.cyc.com/ 146 cally different amounts of time, which is why we believe this state variables, so that s = x1,x2,...,xk. The agent has a approach has the potential to improve overall inference per- set of actions, A, from which to choose. A reward function, formance. The Worker is responsible for executing tactics as R : s → R, defines the instantaneous environmental reward directed by the Tactician. The majority of inference reasoning of a state. A policy, π : S → A, fully defines how an agent in Cyc takes place in the Tactician, and thus this paper will interacts with its environment. The performance of an agent’s focus on speeding up the Tactician. policy is defined by how well it maximizes the received re- The Tactician used for backward inference in Cyc is called ward while following that policy. the Balanced Tactician, which heuristically selects tactics in RL agents often use a parameterized function to represent a best-first manner. The features that the Balanced Tactician the policy; representing a policy in table is either difficult or uses to score tactics are tactic type, productivity, complete- impossible if the state space is either large or continuous. Pol- ness, and preference. There are eight tactic types which de- icy search RL methods, a class of global optimization tech- scribes the kind of operation to be performed, such as “split” niques, directly search the space of possible policies. These (i.e. split a conjunction into two sub-problems). Productivity methods learn by tuning the parameters of a function repre- is a heuristic that is related to the expected number of answers senting the policy. NEAT is one such learning method. that will be generated by the execution of a tactic; lower pro- NeuroEvolution of Augmenting Topologies (NEAT)1 ductivity tactics are typically preferred because they reduce the state space of the inference. Completeness can take on This paper utilizes NeuroEvolution of Augmenting Topologies three discrete values and is an estimate of whether the tac- (NEAT) (Stanley & Miikkulainen 2002), a popular, freely tic is complete in the logical sense, i.e. it is expected to yield available, method to evolve neural networks via a genetic al- all true answers to the problem. Preference is a related fea- gorithm. NEAT has had substantial success in RL domains ture that also estimates how likely a tactic is to return all the like pole balancing (Stanley & Miikkulainen 2002) and vir- possible bindings. Completeness and Preference are mutually tual robot control (Taylor, Whiteson, & Stone 2006). exclusive, depending on the tactic type. Therefore the Bal- In many neuroevolutionary systems, weights of a neu- anced Tactician scores each tactic based on the tactic’s type, ral network form an individual genome. A population of productivity, and either completeness or preference. genomes is then evolved by evaluating each and selectively Speeding up inference is a particularly significant accom- reproducing the fittest individuals through crossover and mu- plishment as the Balanced Tactician has been hand-tuned tation. Most neuroevolutionary systems require the designer for a number of years by Cyc programmers and even small to manually determine the network’s topology (i.e. how many improvements may dramatically reduce real-world running hidden nodes there are and how they are connected). By con- times. The Balanced Tactician also has access to heuristic ap- trast, NEAT automatically evolves the topology by learning proaches to tactic selection that are not based on productivity, both network weights and the network structure. completeness, and preference (for example, so-called “Magic NEAT begins with a population of simple networks: inputs Wand” tactics—a set of tactics which almost always fail, but are connected directly to outputs without any hidden nodes. which are so fast to try that the very small chance of success Two special mutation operators introduce new structure in- is worth the effort). Because these are not based on the usual crementally by adding hidden nodes and links to a network.