Entity Resolution Using the Micron Automata Processor

Entity Resolution Acceleration using Micron's Automata Processor Chunkun Bo1, Ke Wang1, Jeffrey J. Fox2, and Kevin Skadron1 1Department of Computer Science 2Department of Materials Science and Engineering University of Virginia Charlottesville, VA, 22903 USA fchunkun, kewang, jjf5x, [email protected] Abstract Determining whether two records represent the same entity is usually computationally expensive. This is Entity Resolution (ER), the process of finding iden- even worse in the context of big data, because one needs tical entities across different databases, is critical to to compare a huge number of records (O(N 2)). Prior many information integration applications. As sizes of work has proposed different algorithms and computa- databases explode in the big-data era, it becomes com- tion models to improve the performance [2] [3] [4] [5]. putationally expensive to recognize identical entities for However, the performance is still unsatisfying [6]. all possible records with variations allowed. Profiling Micron's Automata Processor is an efficient and scal- results show that approximate matching is the primary able semiconductor architecture for parallel automata bottleneck. Micron's Automata Processor (AP), an processing [7]. It is a hardware implementation of efficient and scalable semiconductor architecture for non-deterministic finite automata (NFA) and is capable parallel automata processing, provides a new oppor- of matching a large number of complex patterns in tunity for hardware acceleration for ER. We propose parallel. Therefore, we propose a hardware acceleration an AP-accelerated ER solution, which accelerates the solution to ER using the AP. This paper focuses on performance bottleneck of fuzzy matching for similar solutions for string-based ER. but potentially inexactly-matched names, and use a real-world application to illustrate its effectiveness. To illustrate how the AP can accelerate ER, we Results show 121x to 4200x speedups for matching one present a framework and performance evaluation of a record, with better accuracy (7.6% more correct pairs real-world ER application. The problem is to identify and 39% less generalized merge distance cost) over the and combine the records with the same identity stored existing CPU method. in SNAC database. 1 Introduction In summary, we make the following contributions. Entity Resolution (ER), also known as Record Link- 1. We propose a novel AP-based hardware accelera- age, Duplication Reduction or Purging/Merging prob- tion framework to solve Entity Resolution. lems, refers to finding records which store the same 2. We present an automata design which can process entity within a single database or across different string-based ER, e.g. fuzzy name matching. databases. ER is an important kernel of a large number 3. We compare the proposed methods with state- of information integration applications. For example, of-the-art technology (Apache Lucene), and it achieves the Social Networks and Archival Context (SNAC) both higher performance (434x speedup on average) collects records from databases all over the world to and better accuracy (7.6% more correct pairs and 39% provide an integrated platform to search historical less GMD cost). collections [1]. In such applications, the records of the same person may be stored with slight differences, 2 Related Work because documents may come from different sources, Many methods have been proposed to solve ER. with different naming conventions, transliteration con- One is a domain-independent algorithm for detecting ventions, etc. SNAC needs to find the records referring approximately duplicate database records [3]. They to the same entity with different representations and first compute minimum edit-distance to recognize pairs merge these records. The intuitive method is to of possible duplicate records, and then a use union/find compare all possible pair records and check whether algorithm to keep track of duplicate records incremen- a pair represents the same entity. tally. Another method sorts the records and checks whether the neighboring records are the same [2]. Counters and Boolean elements are designed to work For approximate duplicates, some researchers define a with STEs to increase the space efficiency of automata window size and a threshold of similarity, so that they implementations and to extend computational capabil- can find records similar enough to satisfy application ities beyond NFAs. requirements. Apache Lucene is a high-performance search engine and it uses a similar method [8]. The 3.2 Speed and Capacity difference lies in that Lucene calculates the score of Micron's current generation D480 chip is built on a document based on the query, and sorts documents 45nm-equivalent technology running at an input symbol instead of every individual record. However, we are (8-bit) rate of 133 MHz. The D480 chip has two half- not aware of any implementations of these algorithms cores and each half-core has 96 blocks. Each block has using accelerators. Our AP-based approach implements 256 STEs, 4 counters and 12 Boolean elements [11]. In a version of the Hamming distance based method [9]. total, one D480 chip has 49,152 STEs, 2,304 Boolean elements, and 768 counter elements. Each AP board As databases become much larger, some researchers can have up to 32 AP chips. have suggested using blocking methods to improve the performance [4]. This allows them to solve a 3.3 Programming and Reconfiguration smaller database, but one needs to have some pre- Automata Network Markup Language (ANML) is known information to divide the data and devote extra an XML language for describing the composition of effort to deal with records comparison across different automata networks. Micron also provides a graphical blocks. user interface tool called the AP Workbench for quick Some novel architectures are proposed to accelerate automaton design and debugging. A \macro" is a pattern matching. Fang et. al. propose the General- container of automata for encapsulating a given func- ized Pattern Matching micro-engine, a heterogeneous tionality, similar to a function in common programming architecture to accelerate FSM-based applications [10]. languages. Micron's AP SDK also provides C and They translate regular expressions to DFA tables, store Python interfaces to build automata, create input DFA tables in local memory and expose DFA paral- streams, parse output and manage computational tasks lelism by a vector instruction interfaces so that they on the AP board. Furthermore, the symbols that an can reduce instruction counts. Kolb et. al. propose De- STE matches can be reconfigured. The replacement doop, a cloud-based infrastructure to speedup ER [5]. time is around 0.24 millisecond for one block. They gain 80x speedup by using 100 Amazon EC2 4 Real-world ER Application computation nodes. In contrast to these approaches, The term entity describes the real-world object and the AP is a memory-derived architecture which can resolution is used because ER is a decision process to directly implement automata efficiently and process a resolve the question. Entities are described by their large number of more complex patterns in parallel in a characteristics and identity attributes are the ones that single compute node. distinguish them from other entities. String-based ER 3 Automata Processor means that the identity attributes are strings. The Automata Processor is an efficient and scal- In this paper, we use a real-world ER problem in able semiconductor architecture for parallel automata SNAC to illustrate how the proposed AP-accelerated processing. It uses a non-Von-Neumann architecture, method works. When building the SNAC platform, the which can directly implement non-deterministic finite same person's name may not always be consistent from automata in hardware, to match complex regular ex- one record to the next because of typos, mis-spellings, pressions. The AP can also match other types of different versions of abbreviation, etc. These differences automata that cannot be conveniently expressed as lead to three major problems. 1) One may miss some regular expressions. correct results when querying a particular user; 2) 3.1 AP Functional Elements multiple entries for one entity wastes storage space; 3) The AP consists of three different types of functional the duplicated items will also slow down the speed of elements that can be used to build automata: State searching because the same entities are compared many Transition Element (STE), Counters and Boolean ele- times. Therefore, SNAC needs to identify and combine ments [7]. records, which is a typical ER problem. In the following Each STE can be configured to match a set of any section, we refer it as the Name Matching problem, 8-bit symbols and activate a set of successor STEs since SNAC uses people's names as identity attributes. connected to it when it finds a match. The STEs can But it is important to point out that the AP-accelerated be configured as start, all-input and report, so that they method is not limited to Name Matching. This design can read symbols from input or report when a match is can be ported to other string-based ER applications found. with small modifications. Formats Example Abc (basic) Aachen Abc Bcd Abad Santillan Abc Bcd Cde Abascal Yea Sousa Abc II Abdulhamid II Table 1: Family name formats. with each other, they can be processed using the same matching structure. For example, row 2, 4, 8, 10 and 13 can use the same structure, thus simplifying the automata design. Figure 1: Workflow of Name Matching. Formats Example Bcd (basic) John 5 Name Matching using the AP Bcd X. Michael K. 5.1 Workflow Bcd Cde Waino Waldemar The workflow of the Name Matching problem using Bcd X. (Bcd Xyz) Chuck L. (Chuck Lehman) the AP is shown in Figure 1. The CPU first extracts the B. X. P. S. P short for Paolo names from the original document in a pre-processing B.

Load more