Developing Engineering Ontology for Information Retrieval

Zhanjun Li Graduate Research Assistant Purdue Research and Education Center for Information Systems in Engineering (PRECISE), Purdue University, West Lafayette, IN 47907-2024; School of Mechanical Engineering, Purdue University, West Lafayette, IN 47907-2024 e-mail: [email protected] Victor Raskin Developing Engineering Ontology Professor Purdue Research and Education Center for Information Systems in Engineering (PRECISE), for Information Retrieval Purdue University, West Lafayette, IN 47907-2024; When engineering content is created and applied during the product life cycle, it is often Department of English and Linguistics, stored and forgotten. Since search remains word based, engineers do not have the effec- Purdue University, tive means to harness and reuse past designs and experiences. Current information re- West Lafayette, IN 47907-2024 trieval approaches based on statistical methods and keyword matching do not satisfy e-mail: [email protected] users’ needs in the engineering domain. Therefore, we propose a new computational framework that includes an ontological basis and algorithms to retrieve unstructured Karthik Ramani1 engineering documents while handling complex queries. The results from the preliminary Professor test demonstrate that our method outperforms the traditional keyword-based search with Mem. ASME respect to the standard information retrieval measurement. ͓DOI: 10.1115/1.2830851͔ Purdue Research and Education Center for Information Systems in Engineering (PRECISE), Purdue University, West Lafayette, IN 47907-2024; School of Mechanical Engineering, Purdue University, West Lafayette, IN 47907-2024; School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN 47907-2024 e-mail: [email protected] 1 Introduction exploded. For example, there are approximately 40,000 documents produced in the design of a single engine in an aerospace Engineering design is a decision making process in which the company ͓2͔. In Boeing, digital documents have accumulated up basic sciences, mathematics, and engineering sciences are applied to the scale of petabytes, a number which is expected to double to convert resources optimally in order to develop a product ͓1͔. over the next two years ͓5͔. During this process, a large amount of knowledge is generated in Engineers are dependent on retrieving and using these docu- order to describe the product and the process: Some of this knowl- ments in order to fulfill various engineering design tasks, such as edge is captured in the form of documents such as reports, note- the following. books, memos, emails, sketches, and 2D/3D computer-aided design ͑CAD͒ drawings, while other knowledge is retained in the ͑1͒ Acting as “memory extension” for individual engineers and memory of the engineers. Some important roles of the documen- enabling information sharing among them ͓6͔. tation include legal issues, patent applications, international stan- ͑2͒ Exploring design concept alternatives during the early stage dard certifications, internal practices, and sales catalogs. These of the development. This helps engineers avoid the ten- engineering documents can be classified into internal resources dency to take their first idea and start to refine it into a final and external resources. Internal resources include documents from design ͓6͔. product specifications and memos to final project reports and ͑3͒ Learning from the original design process and understand- CAD drawings. External resources include online catalogs of sup- ing the rationale behind the decisions made. This is espe- pliers and patents. A major portion of the engineering documents cially important for novice engineers since they are not are ͑1͒ textual descriptions and ͑2͒ CAD drawings that have em- always aware of what they need to know during the design bedded drawing notes, bill of materials ͑BOMs͒, and texts, which process ͓7͔. describe shape and assembly information ͓2–4͔. The number of ͑4͒ Searching for past designs when working on a similar prod- digital documents being generated for product development has uct or problem in order to gain insight from past design scenarios and experiences. This is known as design reuse ͓8͔. 1Corresponding author. Contributed by the Computer Engineering Informatics ͑EIX͒ Committee of In fact, today’s engineers simply do not make an effort to find ASME for publication in the JOURNAL OF COMPUTING INFORMATION SCIENCE AND ENGI- ͓ ͔ NEERING. Manuscript received December 5, 2006; final manuscript received August engineering content beyond doing mere keyword searches 9 . 13, 2007; published online February 14, 2008. However, current information retrieval approaches either retrieve Journal of Computing and Information Science in Engineering MARCH 2008, Vol. 8 / 011003-1 Copyright © 2008 by ASME Downloaded 14 Apr 2008 to 128.10.242.76. Redistribution subject to ASME license or copyright; see http://www.asme.org/terms/Terms_Use.cfm too much or irrelevant results for engineering or are not in a form text of domain-specific content. Following are the examples of that users can navigate and explore. It was reported that design queries through which users want to investigate the designs that engineers spent 20–30% of their time retrieving and communicat- ing information ͓10͔. Current engineering practices ignore reuse • lock a car door with a curvilinear slot sliding along a cylin- of previous knowledge because appropriate engineering informa- drical pin in the assembly, and tion retrieval tools have not been developed. As a result, a large • have a dc motor with an output speed of 100–1000 rpm. amount of time is spent reinventing what is already known in the company or is available in outside resources ͓11͔. The redundant The first query is for a desired function of a mechanism and its effort per employee is increasing and causing enormous cost as components. In the second example, the main concern is the exact the complexity of enterprises and products increases. It is, there- value of a property attributed to a specific component. It is impos- sible to represent these semantic descriptions accurately by using fore, imperative to minimize such overhead by developing the ͓ ͔ science base for contextual retrieval and then using this knowl- a few keywords. Google’s Brin and Page 21 pointed out that edge to create effective computer-aided tools. approaches using the VS model work well only with small and Most engineering documents are unstructured, in contrast to homogeneous collections such as literature or news releases under structured data resources such as database tables. Their formalities a common topic. However, recall that engineering documents usu- also vary. Formal documents, such as project proposals and re- ally describe various complex design processes and specifications, ports, are written to comply with grammars or specific regula- and are rich in specific technical terms and abbreviations, which tions; informal documents, such as engineers’ notebooks and some end users are usually not familiar. Another issue in applying the textual descriptions in the drawings, on the other hand, are frag- current IR in the engineering domain is the ambiguities. The same mentary descriptions. In contrast to general text documents, engi- term may represent different meanings in different contexts, or ͓ ͔ neering documents are different because of the syntax variations multiple terms may be used to mean the same thing 22 . To put it and semantic complexities of their contents ͓3͔. The syntax varia- differently, the search results should satisfy the users, who are tions refer to the usage of abbreviations, e.g., SLA for stere- looking for something that matches their understanding of a per- olithography, acronyms, e.g., AL ALY 6061, which stands for a tinent text—an understanding that includes, among other things, type of aluminum alloy, and synonyms, e.g., hardened way slides the relations among the terms and the ability to disambiguate and versus rectangular way slides, of the regular terms. They reflect to infer. This is where the statistical keyword-based techniques the company-specific or domain-specific naming conventions, the fail the users and defeat their purposes. diverse background of the authors, and the compositional nature IE approaches bring together natural language processing of the designs. The semantics complexities denote the wide range ͑NLP͒ tools with domain knowledge to extract meaningful sen- of domain-specific issues and the relationship among these issues tence constituents from unstructured texts for retrieval or for that must be considered and documented during the life cycle of knowledge mining purposes ͓23͔. The domain knowledge can ei- product development. Examples of these issues are customer re- ther be formalized as expression patterns by experts ͓24͔ or quirements, specifications, functions, performances, structure de- learned from a large training corpus ͓25͔. However, the research sign, material selections, and manufacturing process selections. in IE usually deals with short texts such as news of terrorism Therefore, it is necessary to consolidate and contextualize hetero- reports or extracts very specific information such as name entity geneous engineering documents in order to reconstruct the prior recognition ͓23͔. Engineering documents are more diversified. knowledge in a more explicit and structured manner. Therefore, it is very labor intensive to form expression patterns This research addresses the task of retrieving unstructured en- manually. It is also unfeasible

Developing Engineering Ontology for Information Retrieval

Basic Concepts in Information Retrieval

The Seven Ages of Information Retrieval

Information Retrieval (Text Categorization)

Information Retrieval System: Concept and Scope MODULE - 5B INFORMATION RETRIEVAL SYSTEM

A Philosophical Wish List for Research in Music Information Retrieval Cynthia M

An Architecture for Information Retrieval Agents* Craig A

Semantic Information Retrieval Based on Wikipedia Taxonomy

Introduction to Information Retrieval

Linguistic Fingerprints of Internet Censorship: the Case of Sina Weibo

User Centered and Ontology Based Information Retrieval System for Life Sciences

The World Trade Law of Censorship and Internet Filtering

A Review of Ontology-Based Information Retrieval Techniques on Generic Domains