HIL: A High-Level Scripting Language for Entity Integration

Mauricio Hernández Georgia Koutrika Rajasekar Krishnamurthy IBM Research – Almaden HP Labs IBM Research – Almaden [email protected] [email protected] [email protected] Lucian Popa Ryan Wisnesky IBM Research – Almaden Harvard University [email protected] [email protected]

ABSTRACT domain [5, 7], where the important entities and their relationships We introduce HIL, a high-level scripting language for entity res- are extracted and integrated from the underlying documents. We olution and integration. HIL aims at providing the core logic for refer to the process of extracting data from documents, integrating complex data processing flows that aggregate facts from large col- the information, and then building domain-specific entities, as en- lections of structured or unstructured data into clean, unified enti- tity integration.Enablingsuchintegrationinpracticeisachallenge, ties. Such flows typically include many stages of processing that and there is a great need for tools and languages that are high-level start from the outcome of information extraction and continue with but still expressive enough to facilitate the end-to-end development entity resolution, mapping and fusion. A HIL program captures the and maintenance of complex integration flows. overall integration flow through a combination of SQL-like rules There are several techniques that are relevant, at various levels, that link, map, fuse and aggregate entities. A salient feature of for entity integration: information extraction [9], schema match- HIL is the use of logical indexes in its data model to facilitate the ing [20], schema mapping [11], entity resolution [10], data fu- modular construction and aggregation of complex entities. Another sion [3]. These techniques have received significant attention in the feature is the presence of a flexible, open type system that allows literature, although most often they have been treated separately. In HIL to handle input data that is irregular, sparse or partially known. many complex scenarios, all of these techniques have to be used in As a result, HIL can accurately express complex integration tasks, cooperation (in a flow), since data poses various challenges. Con- while still being high-level and focused on the logical entities (rather cretely, the data can be unstructured (hence, it requires extraction than the physical operations). Compilation algorithms translate the to produce structured records), it has variations in the format and HIL specification into efficient run-time queries that can execute the accompanying attributes (hence, it requires repeated mapping in parallel on Hadoop. We show how our framework is applied to and transformation), and has variations in the identifying attributes real-world integration of entities in the financial domain, based on of entities (hence, it requires entity resolution, that is, the iden- public filings archived by the U.S. Securities and Exchange Com- tification of the same real-world entity across different records). mission (SEC). Furthermore, we apply HIL on a larger-scale sce- Moreover, fusion is needed to merge all the facts about the same nario that performs fusion of data from hundreds of millions of real-world entity into one integrated, clean object. Twitter messages into tens of millions of structured entities. The HIL Language. In this paper, we introduce HIL (High- level Integration Language), a programming (scripting) language to specify the structured part of complex integration flows. HIL 1. INTRODUCTION captures in one framework the mapping, fusion, and entity reso- In recent years, data integration has largely moved outside the lution types of operations. High-level languages for information enterprise. There is now a plethora of publicly available sources extraction already exist (e.g., AQL [6]) and are complementary to of data that can provide valuable information. Examples include: HIL. The main design goal for HIL is to provide the precise logic bibliographic repositories (DBLP, Cora, Citeseer), movie databases of a structured integration flow while leaving out the execution de- (IMDB), knowledge bases (Wikipedia, DBPedia, Freebase), social tails that may be particular to a run-time engine. The target users media data (Twitter, blogs). Additionally, a number of specialized for HIL are developers that perform complex, industrial-strength public repositories are starting to play an increasingly important entity integration and analysis. Our goal is to offer a more fo- role. These repositories include, for example, U.S. federal govern- cused, more uniform and higher-level alternative than program- ment data, congress and census data, as well as financial reports ming in general purpose languages (e.g., Java, Perl, Scala), us- archived by the U.S. Securities and Exchange Commission (SEC). ing ETL tools, or using general data manipulation languages (e.g., To enable systematic analysis of such data at the aggregated- Jaql [2], Pig Latin [18], XQuery). level, one needs to build an entity or concept-centric view of the HIL exposes a data model and constructs that are specific for the various tasks in entity integration flows. First, HIL defines the main entity types,whicharethelogicalobjectsthatauserintends to create and manipulate. Each entity type represents a collection Permission to make digital or hard copies of all or part of this work for of entities, possibly indexed by certain attributes. Indexes are logi- personal or classroom use is granted without fee provided that copies are cal structures that form an essential part of the design of HIL; they not made or distributed for profit or commercial advantage and that copies facilitate the hierarchical, modular construction of entities from the bear this notice and the full citation on the first page. To copy otherwise, to ground up. The philosophy of HIL isthatentitiesarebuiltorag- republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. gregated from simpler, lower-level entities. A key feature of HIL is EDBT/ICDT ’13 March 18 - 22 2013, Genoa, Italy the use of record polymorphism and type inference [19], allowing Copyright 2013 ACM 978-1-4503-1597-5/13/03 ...$15.00. schemas to be partially specified. In turn, this enables incremental development where entity types evolve and increase in complexity. HIL consists of two main types of rules that use a SQL-like syn- tax. Entity population rules express the mapping and transforma- tion of data from one type into another, as well as fusion and ag- gregation of data. Entity resolution rules express the matching and linking of entities, by capturing all possible ways of matching en- tities, and by using constraints to declare properties on the desired output (e.g., one-to-one or one-to-many types of matches). The entity population rules borrowfeaturesfromschemamap- ping formalisms (e.g., s-t tgds [12] or second-order tgds [13]); how- ever, our design is aimed at making the language intuitive, efficient and easy to use by the practitioners. As such, HIL drops features Figure 1: Example Integration Flow such as Skolem functions and complex quantifiers, it does not re- need for a high-level language to specify what needs to be done quire any a priori schemas, it is polymorphic (to address hetero- rather than how was identified as a key issue. Ideally, one would geneity and complexity in the input data), and includes user-defined like to focus on the logical entities and the logical integration steps, functions that can be used for aggregation and data cleaning (e.g., declaratively, in the same way SQL is a higher-level alternative to normalization). Furthermore, indexes are perhaps the most impor- aphysicalplanbasedonrelationalalgebraoperators. tant differentiating aspect of HIL, as they allow the decorrelation of Given the public availability of the SEC data, and its wide use complex integration tasks into simple and efficient rules. It is the by financial data providers, analysts, and regulators, we are re- presence of indexes that enables the true scalability of HIL in terms examining in this paper the SEC integration scenario, but from the of both the design (e.g., creating and maintaining the rules) and perspective of using a high-level language. We will use this sce- execution (e.g., on large amounts of data such as in social media). nario to illustrate the functionality of HIL, and also to later evaluate With regard to entity resolution, HIL is not a replacement of the HIL. We start by showing a simplified portion of the SEC integra- many physical algorithms or operators for matching ofentitiesthat tion flow in Figure 1. The goal of this flow is to construct an entity have been developed in the literature over the years. Instead, HIL type Person,representingthekeypeopleofmajorU.S.companies. entity resolution rules are declarative statements that can express The flow uses two input data sets: InsiderReportPerson (or, IRP the common patterns used in practice (e.g., blocking, matching, in short) and JobChange.Thefirstisasetofrecordsextractedfrom constraints), while making use of specialized functions for string XML insider reports. These reports are filed periodically by com- similarity, record matching, or for normalization of values. Fur- panies to state compensation-related aspects about their officers and thermore, entity resolution rules in HIL are used in conjunction directors. Each extracted record includes the person name, a central with the entity population rules in order to achieve fusion of the identification key (cik,aglobalSEC-assignedkeyforthatperson), matched or linked entities into the final, integrated entities. acompanyidentifier,thereportingdate,andwhetherthepersonis Our motivating example, which we introduce next, will show an officer or a director. If the person is an officer, the title attribute how the combination of entity resolution rules together with rules contains the executive position (e.g., “CEO", “CFO", etc). for mapping, fusion and aggregation of data gives HIL enough ex- The second data set, JobChange,consistsofrecordsextracted pressive power to achieve complex, end-to-end integration tasks. from textual reports that disclose job changes or new appointments At the same time, we will use the example to show the relative ease in a company. These records exhibit high variability in the qual- and simplicity of HIL when compared to the alternatives. ity of data (e.g., people names, positions). A record in JobChange includes the extracted person name, the appointment date, the posi- 1.1 A Motivating Example tion (appointedAs), and the appointing company. However, it does As a motivating scenario, we use the financial integration from not include any key identifying the person. The attributes docid and SEC that was described as part of the Midas system [5]. A different span identify the document and the position within the document scenario that integrates entities from social media will be described where the person name has been extracted from. Together, they and used for a larger-scale experimental evaluation in Section 6. serve as an identifier for the particular person occurence. (Note In the SEC scenario, company and people entities, together with that the same real-world person may occur in many documents or their relationships, are extracted and integrated from regulatory SEC many places in the same document.) filings that are in semi-structured or unstructured (text) form. While The first step in the flow is a transformation that constructs an SEC integration is one example application out of many, it is a initial instantiation of Person from insider reports. Since data from good illustration of the kind of integration that is performed in real- IRP is relatively clean and each person has a key, the resulting Per- 1 life, not only in Midas but also by financial data providers .These son entities will form a reference data set to be used and further providers often use a combination of manual methods (e.g., copy- enriched in subsequent steps. For each person key (cik), we create and-paste then clean) and low-level coding to achieve a reasonable auniquepersonentitythatincludestop-levelattributessuchasthe level of clean integrated data. In Midas, information extraction re- person name and the key itself. Then, for each person, we must lies on SystemT [6] and its high-level language AQL. However, the construct an employment history by aggregating from many input subsequent, structured part of entity integration is a complex mix- records. This history includes the companies for which the person ture of domain-specific rules for entity resolution, mapping and worked over the years, and for each company, a list of positions fusion. These rules were expressed mostly using Jaql [2], a gen- held by that person. Since a position is a string that varies from eral data manipulation language for JSON, which in turn compiles record to record (e.g., “CEO” and “Chief Exec. Officer”), normal- into Map/Reduce jobs on Hadoop. As noted in [5], the integration ization code must be used to identify and fuse the same position. process required a high cost of development and maintenance; the Moreover, for each position, we must capture the earliest known date and the latest known date for that position. These attributes, 1http://en.wikipedia.org/wiki/Financial_data_vendor earliest_date and latest_date,aretheresultofatemporalaggrega- 1. Person = rule m5: 2. join preserve Person, PeopleLink, JobChange insert into Employment ! [cik: l.cik] 3. where Person.cik == PeopleLink.cik select [ company: j.company, 4. and PeopleLink.docID == JobChange.docID positions: Positions ! [cik: l.cik, company: j.company] 5. and PeopleLink.span == JobChange.span ] 6. and isOfficer (JobChange.appointedAs) from JobChange j, PeopleLink l 7. into { name: Person.name, where j.docID = l.docID and j.span = l.span and isOfficer (j.appointedAs); 8. cik: Person.cik, 9. emp: Person.emp, 10. delta_emp: rule m6: 11. { company: JobChange.company, insert into Positions ! [cik: l.cik, company: j.company] 12. title: normTitle (JobChange.appointedAs), select [ title: normTitle (j.appointedAs), 13. date: JobChange.apptDate } … 14. } ] 15. group by c = $.cik into { from JobChange j, PeopleLink l 16. // copy name, cik, previous emp where j.docID = l.docID and j.span = l.span and isOfficer (j.appointedAs); 17. name: $[0].name, 18. cik: c, 19. emp: $[0].emp, Figure 3: Fusion via HIL rules. 20. // process new emp items 21. delta_emp: 22. ( $[*].delta_emp next stage. The final stage produces a new version of Person. 23. group by comp = $.company into In more details, the first stage (lines 2 - 14) has the role of join- 24. { company: comp, 25. positions: ing the existing Person data with “delta" employment facts from 26. ($ transform { $.title, JobChange.Thisstageexploitsthelinks(PeopleLink)thatwere 27. $.date })}) 28. } established by entity resolution, and extends the existing Person 29. transform { // copy name, cik, merge employments records to include a new delta_emp field with data from the cor- 30. name: $.name, 31. cik: $.cik, responding JobChange record. The keyword preserve in front of 32. emp: Person ensures that all existing Person records are in the output, ir- 33. ([$.emp, $.delta_emp] 34. expand // union of the two employment arrays regardless of whether there is a matching JobChange record.(This 35. group by comp = $.company into is similar to an outer-join.) Since the first stage breaks the nor- 36. { company: comp, 37. positions: ( [$[0].positions, $[1].positions] malization of the data (i.e., there may be multiple records per per- 38. expand // another deep union son, one for each delta_emp fact), the second stage (lines 15 - 28) 39. ... // further processing groups first by cik to reconstruct Person entities. As a result of this Figure 2: Snippet of fusion code in Jaql. operation, all the delta_emp facts for a person cik are accumulated together. These records are then processed in lines 22-27 into a hi- tion that considers all the IRP records that refer to the same person, erarchical structure (i.e., grouped by company). We are now ready company and position. to go into the next stage, and merge the two employment sets (emp The second step in the flow starts the integration of the second and delta_emp for each person (lines 33 and beyond). The merging data set, JobChange,bylinkingitsrecordstocorrespondingPer- process involves a deep union of hierarchical sets, to merge first the son entities. This entity resolution step produces a PeopleLink table records at the company level, and then at the positions level. that relates each person occurrence (identified by docid and span) As one can see from the example, the code is relatively low-level, in JobChange with an actual person (identified by cik)inPerson. with many physical operations involved. Since data needs to be In the third step, we join JobChange with PeopleLink in order to carried around between stages, there is repeated copying of data, as “retrieve" the person cik,andtheninsertor“fuse"appropriatedata well as grouping and renesting of the data. Fusion is accomplished into the employment history for that person entity. This fusion step by using union (which is often deeply nested and requires appro- is non-trivial since it may affect data at several levels in the Person priate data alignment). As the code evolves and becomes more structure. We may either insert a new company into the employ- complex (i.e., while fusing more sources), the higher-level logic of ment history or modify the set of positions in an existing company. the integration task is lost in the implementation details. In turn, the latter step either inserts a new position or modifies an In contrast to the lower-level approach, in HIL we can use just existing one (in which case, earliest_date and latest_date may be two entity population rules (shown in Figure 3) to capture the logic changed, by reapplying the temporal aggregation). of the fusion in this example, without going through any of the low-level physical operations (e.g., copy the data, group and re- 1.2 Advantages of HIL group, deep-union, etc.). We give below a few high-level details on We now discuss and contrast two approaches to express the above these rules, while leaving the full discussion on the HIL language type of functionality. The first approach is based on general script- (including the additional rules for the SEC example) for Section 2. ing languages for data manipulation that are currently available. We Each HIL rule inserts data into a single entity, and these entities use Jaqll [2] here for illustration, but the same comparison applies are part of the logical model that governs the integration task. In for languages such as Pig Latin [18] or XQuery. The second ap- our scenario, Employment and Positions are entities, and they are proach, which we take in this paper, is based on HIL. We illustrate used to index two types of information: employment (of a given our discussion using the fusion step of our example. person), and positions (of a given person and company). There is Figure 2 gives a snippet of the Jaql code needed to fuse the new another top-level entity, Person (to be shown later), but the above employment facts derived from our second data set, JobChange, fusion rules need not touch that; they need only insert the new em- into Person.Wedonotassumethereader’sfamiliaritywithJaql ployment facts and positions. syntax here, but assume some understanding of the basic operations AHILrulestateswherethedatacomesfrom,whereitgoesinto to manipulate data (e.g., join, record projection, group by, etc.). (i.e., the target entity, and key if it is an index), and the actual con- Jaql, like many other scripting languages, is a functional language struction of the tuples being inserted. In our example, the first rule with no direct facilitiesforupdatesorfusionofdata.Thus,the joins JobChange with PeopleLink to retrieve the relevant cik,which typical code that one writes takes as input a previous copy of the is then used as a key to access the Employment index (via the ! oper- data (e.g., Person), and includes several stages of data processing, ation) and insert new tuples. The rule relies on the Positions index, where the result of each stage is pipelined (using in Jaql) to the which is populated by a separate rule. In general, other rules may → further populate into the same indexes as well as combine these in- map between the various entities. There are multiple steps, which dexes to form other entities (e.g., as we see later, Employment is a gradually increase in complexity as the specification progresses. building block towards the main Person entity). The HIL compiler Top-level Mapping. We start by declaring the input and output will take all the specified rules and evaluate them in the right order. entities needed at this point (IRP and Person), by giving a partial HIL rules are naturally decorrelated (via indexes), which, in turn, specification of their types. More entities will be added later, to enables the scalability of the design. Each rule is an independent describe additional data structures (e.g., indexes). We also express unit that captures a particular piece of the integration logic and that now a first rule to populate the top-level attributes of Person. can be understood in isolation. For example, when writing the rule IRP: set [name: string, cik: int, ?]; to insert new records into Employment based on JobChange and Person: set [name: ?, cik: ?, emp: set ?, ?]; PeopleLink,oneneednotworryaboutotherrulesforthetop-level rule m1: insert into Person select [name: i.name, cik: i.cik] Person entity, or for Positions,orforanyotherindexesthatmay from IRP i; participate in the final construction of a Person entity. Furthermore, the decorrelation of the integration task via HIL indexes plays an We first explain the entity declarations and then the rule. First, the data model of HIL allows for sets and records that can be arbitrar- important role for the scalability of the runtime itself, as will be ily nested. In the above, IRP and Person are both sets of records. demonstrated later by our optimization techniques and by the ex- An important feature of the type system of HIL is that one can give periments on social media data. an unspecified type (denoted by ?) in any place where a type can 1.3 Contributions and Paper Outline appear (i.e., as the type of an attribute or as the type of the elements in a set). Moreover, records themselves can be left open,meaning The main contributions of this paper are as follows: that there can be more fields that are unknown or not relevant at this We give, in Section 2, a high-level language (HIL) that cov- • point. (See the ? at the end of the record types for IRP and Person.) ers the major information integration components (mapping, Open records are especially useful when schemas are complex but fusion, entity resolution). HIL naturally decorrelates a complex only some fields are relevant to the current transformation. As more integration task into multiple high-level rules, which offers both rules and declarations are added, HIL dynamically refines the types increased usability and potential for better optimization. of the entities, by inferring the most general types that are consis- We give, in Sections 3 and 4, algorithms to compile a declara- tent with all the declarations. • tive HIL program into lower-level, efficient operations. An entity population rule uses a select-from-where pattern to We exploit general (i.e., platform-independent) optimiza- specify a query over one or more input entities; this query extracts • tion techniques that implement logical indexes as binary data that is then used to populate (partially) the entity in the insert tables, and map large fragments of HIL rules (both entity clause. For our example, rule m1 specifies that for each record i population and entity resolution) into joins. from IRP,weselectthename and cik fields and use them to popu- We also develop optimization techniques that are targeted late corresponding attributes of Person.Theselect clause of a rule • towards Hadoop (Map/Reduce). In particular, we use a contains, in general, a record expression (possibly composite). co-group technique to fuse multiple HIL indexes in paral- The semantics of an entity population rule is one of containment: lel and access them as one, thus facilitating the fusion of for each tuple that is produced by the select-from-where statement, complexly structured entities from large amounts of data. there must be a tuple in the target entity (in the insert clause) with corresponding attributes and values. Thus, like types, entity pop- We evaluate the resulting framework, in Section 5, by first ap- • ulation rules are open; for our example, Person may contain addi- plying it to the SEC financial integration scenario. tional data (e.g., more records or even more attributes for the same We further evaluate, in Section 6, the performance of HIL on record) that will be specified via other rules. This is consistent with • alarge-scaleintegrationscenariothatfusesdatafromhundreds the usual open-world assumption in data integration [16]. Since of millions of Twitter posts into tens of millions of entities. rules define only partially the target entities, it is the role of the We note that HIL is extensively used within IBM for entity in- HIL compiler (described in Section 3) to take all the declarations tegration applications that are targeted towards public unstructured and create executable code that produces the final entities. data (the SEC data and social media being two primary examples) Using Finite Maps (Indexes). We now introduce indexes, which . are central to HIL and allow the modular and hierarchical construc- tion of entities. The above rule m1 specifies how to map the top part of Person,butissilentaboutthenestedsetemp,whichrep- 2. SCRIPTING IN HIL resents the employment history of a person. One of HIL design In this section, we give an example-driven overview of the HIL choices, motivated by simplicity,isthatentitypopulationrulescan language. The main ingredients of HIL are: (1) entities,defining only map tuples into one target set. Any nested set (e.g., emp)is the logical objects (including the data sources), (2) rules,foreither populated separately via a finite map or index.Similarly,anyag- populating the entities or linking among the entities, and (3) user- gregated value that needs to appear in an entity will be computed defined functions,whichaccompanyrulesandperformoperations via an index, which is populated separately. An index is declared such as string-similarity or cleansing and normalization of values. as a finite map: fmap T1 to T2,whereT1 is the type of keys and AspecialformofentitiesinHILareindexes,whichcanbeshared T2 is the type of entries. In many cases, T2 is a set type itself. In among the rules and facilitate the hierarchical, modular specifica- our example, we declare Employment to be an index that associates tion of the integration flow, as well as various forms of aggregation. apersonidentifier(i.e.,cik)withtheemploymenthistoryofthat person (i.e., a set of companies, each with a set of positions): 2.1 Entity Population Rules Employment: fmap [cik: int] We start by describing the first of the two main types of entity to set [company: string, positions: set ?]; rules in HIL, namely the rules to populate the target entities. We We now modify the earlier rule m1 to specify that the nested emp use the mapping task from IRP to Person as an illustration. We set of Person is the result of an index lookup on Employment (we will give the entity types that are used, as well as the rules that use ! for the lookup operation): rule m1!: insert into Person Intuitively, each position we generate (e.g., by rule m3)origi- select [ name: i.name, cik: i.cik, nates in some input document that contains a date (i.e., the report- emp: Employment![cik: i.cik] ] ingDate attribute of IRP). To compute the earliest date for a posi- from IRP i; tion, we need an auxiliary data structure to keep track of all the Intuitively, the rule assumes that Employment is constructed sepa- reporting dates for a position (of a given person with a given com- rately and here we simply access its entries. pany). Thus, we define an “inverted" index PosInfo that associates The above bits of specification do not state how to populate Em- asetofdatesforeachtriple(cik, company, title). This set of dates 2 ployment but rather how it is used in Person.Separaterulescan represents a form of provenance for the triple. now be used to populate Employment.Inparticular,thefollowing PosInfo: fmap [cik: int, company: string, title: string] rule populates Employment based on data from IRP: to set [date: ?, ?]; rule m2: insert into Employment![cik: i.cik] rule m4: insert into PosInfo![cik: i.cik, company: i.company, select [company: i.company, title: normTitle(i.title)] positions: Positions![cik: i.cik, company: i.company] ] select [date: i.reportingDate] from IRP i where i.isOfficer = true; from IRP i where i.isOfficer = true; Following the general pattern discussed above, to populate the posi- Rule m4 parallels the earlier rule m3:wheneverm3 produces a nor- tions field, rule m2 relies on a separate entity, Positions,indexedby malized title for a given cik and company, rule m4 produces the re- person cik and company.Theothernotablethingaboutm2 is that it porting dates (for all the input records in IRP that have the same cik, populates an index. For each record i in IRP where isOfficer is true, company and normalized title). In general, there may be additional we insert a tuple in the Employment index entry that is associated rules to populate this inverted index, since there may be more data with the key i.cik.DifferententriesinEmployment,corresponding sources or more rules (beyond m3)topopulatePositions.Comput- to different cik values, may be touched. Also, multiple tuples may ing the earliest date for a position amounts then to obtaining the be inserted in the same Employment entry, corresponding to multi- minimum date in a set of dates. First, we declare a user-defined ple input records with the same cik but different company values. function minDate for which we also give its Jaql implementation. Indexes are important data structures in themselves, and often re- flect the natural way in which logical entities need to be accessed. minDate: function set [date: t, ?] to t; @jaql { minDate = fn(a) min (a[*].date); } In this example, employment histories need to be looked up by per- son key, while positions represent a finer-grained view that is in- We then change the earlier rule m3 to use the inverted index by dexed by both person key and company. Furthermore, indexes are adding the following to the select clause: aconvenientmechanismthatallowstodecorrelate and decompose (*) earliest_date: minDate(PosInfo![cik: i.cik, company: i.company, what would otherwise be complex rules into much simpler rules. In title: normTitle(i.title)]) particular, the rules that populate a top-level entity (e.g., a person) So far, we have given the main entity population rules that are are decorrelated from the rules that populate the associated sub- needed to construct a Person entity, and some of the associated structures (e.g., employment of a person). In our example, we can structure (e.g., employment and positions), from one input data have subsequent rules that further populate the Employment index, source. We focus next on how to enrich this basic scenario, based without affecting any of the existing rules for Person. on additional data sources. In particular, we look at entity resolu- (No) Ordering of Rules. We remark that there is no intrinsic tion rules and the second step of our running example. order among the entity population rules. Here we gave the rule to populate Employment after the rule for Person,buttheordercould 2.2 Entity Resolution Rules be switched. It is up to the programmer to define the conceptual An entity resolution rule takes as input sets of entities and pro- flow of entities and of rules. In contrast, it is the role of the compiler duces a set of links between these entities. Each link represents a to stage the execution so that any intermediate entities are fully semantic association between the entities it refers to. For example, materialized before they are used in other entities (i.e., all rules if the input entities contain information about people, the generated for Employment must be applied before materializing Person). The links will connect those entities that contain, presumably, informa- main restriction in HIL is that we do not allow recursion among the tion about the same real-world person. entity population rules (see later Section 3 for more on this). An entity resolution rule uses a select-from-where pattern to spec- User-Defined Functions. We specify below the actual popula- ify how input entities are linked. The from clause specifies the input tion of Positions from IRP,withthehelpofaUDForuser-defined sets of entities to be linked. The where clause describes all possible function, normTitle,tonormalize the title string associated with a ways in which input entities can match. For example, one can spec- particular position. Normalization is an operation that is frequently ify that if the names of people in two lists are “similar”, then a “can- encountered in data cleansing, and often requires customization. didate” link exists between the two people. Additional clauses, in- From the point of view of HIL, all we need to provide is the signa- cluding check, group on and cardinality,specifyconstraintsthatfil- ture of the function. The actual implementation of such function is ter the “candidate” links. For instance, if only one-to-one matches provided (either in Java or Jaql) via a binding mechanism. between people entities are allowed, candidate links that connect normTitle: function string to string; one person in one list with multiple persons in another list will be dropped. Next, we describe these clauses in detail. rule m3: insert into Positions![cik: i.cik, company: i.company] select [title: normTitle(i.title)] In our example (Figure 1), we match Person entities with Job- from IRP i where i.isOfficer = true; Change records using a combination of person name and employ- ment history. If the name of the company that filed the job change Indexes and Aggregation. We now show how one can use an already appears in the person’s employment history, then we use index in HIL to perform aggregation. Aggregation is similar to the both company and person name for matching. Otherwise, we per- way nested sets are constructed, except that we also need an actual form a strong similarity match only on person name. In both cases, function to reduce a set to a single value. We show here how to 2 compute the earliest_date for a position (the latest_date is similar). We could also include other source fields (e.g., docID). we do not want to create a match if different birthdates appear in olution actions on constraint violations ensuring that the result is the inputs. Furthermore, in this particular entity resolution task, deterministic. The clauses check, group and cardinality serve this one Person entity can match multiple JobChange records. How- purpose and appear in a entity resolution rule in this order. ever, multiple Person entities cannot match the same JobChange A check clause specifies further predicates that are applied to record. When this conflict arises, we want to preserve the strongest each candidate link. A check clause has the form if pk then ck,with links (e.g., those that match identical person names). pk and ck being predicates over the candidate links. For every can- All these matching requirements are compactly captured in the didate link in W ,ifpk evaluates to true,thenwekeepthelinkonly following entity resolution rule (er1), which we analyze next: if ck also evaluates to true.Inourexample,wewanttoenforce rule er1: create link PeopleLink as that if the entities for a person inacandidatelinkcontain non-null select [cik: p.cik, docid: j.docID, span: j.span] birthdates, then the birthdates must match. In effect, a check clause from Person p, JobChange j, p.emp e specifies a global condition that must be satisfied by all candidate where match1: e.company = j.company and links matching pk,regardlessofthematchingpredicates.Thatis compareName(p.name, j.name), why although this condition could be “pushed-up” to each match- match2: normalize(p.name) = normalize(j.name) ing predicate, it is more convenient to specify it in a check clause. check if not(null(j.bdate)) and not(null(p.bdate)) then j.bdate = p.bdate The group on clause applies predicates to groups of candidate group on (j.docID, j.span) keep links p.name = j.name links. The clause specifies a list of attributes that serves as a group- cardinality (j.docID, j.span) N:1 (p.cik); ing key and a predicate that is applied to all entities in a group. In our example, a person occurrence in a JobChange entity (identi- The create clause specifies the name of the output set of entities fied by (docID, span)) may be linked to multiple entities in Person. (called PeopleLink here). The select clause restricts the attributes We want to examine all these links together. Any link where the kept from the input entities to describe the link entities. For each person name in both linked entitiesisexactlythesameshouldbe link, we keep the key attributes of the input entities so that we can kept (while the other links are rejected) because having the same link back to them (along with any other information that may be name provides stronger indication that we actually have a match. required). In er1,wekeepthe(docid, span) from each JobChange Of course, when there are no such “strong” links, in our example, and the person cik.SimilarlytoSQL,thecreate and select clauses we keep weaker links. Additional group and cardinality constraints are logically applied at the end, after processing the other clauses. can be specified to further refine the links. We could also specify The from clause names the sets of entities that will be used to cre- that only the strongest links survive, by just changing the keep links ate links, which in our example are the sets Person and JobChange. part of the group clause to keep only links. Interestingly, this clause can also include other auxiliary sets, like Additional types of group constraints are possible. For exam- the nested set p.emp that contains the employment of a person p. ple, we can use aggregate functions on the attributes of a group to In this way, a user can link entities not only by matching attribute decide whether to keep the links or not. For example, the constraint values but also by matching a value (such as a company name) to a set of values (e.g., the set of companies in a person’s employment group on (p.cik) keep links history). The from clause defines a set C of tuples of entities, corre- e.company = j.company and sponding, roughly, to the cartesian product of all input sets. How- j.apptDate = max(j.apptDate) ever, if a nested set in the from clause is empty, C will still contain keeps the most recent job change among all those filed by the same an entry that combines the other parts. In our example, if a par- company for the same person (cik). As another example, we could ticular p.emp is empty, the corresponding Person and JobChange use the link provenance to select links that are created by the stronger entities will appear in C with a value of null in the p.emp part. matching predicates. For example, if a JobChange matches several The where clause specifies the possible ways in which the input Person entities, then we can give priority to the links created by entities can be matched and essentially selects a subset of C.Each match1 if they exist. possible matching has a label (used for provenance of matches) Finally, a cardinality clause asserts the number of links a single and a predicate on the entities bounded in the from clause. Rule er1 entity can participate in (one or many). For example, the cardinal- specifies two matchings, labeled match1 and match2.Amatching ity clause in er1 asserts that each (docID, span) pair should link to predicate is a conjunction of conditions that combine equality and exactly one Person entity (but that Person entity can link to many relational operators (e.g., e.company = j.company), boolean match- JobChange entities). In the final result, if a (docID, span) pair maps ing functions (e.g., compareName(p.name, j.name))andtransfor- to multiple ciks, then all these links are considered ambiguous and mation functions (e.g., normalize(p.name)). For example, match1 dropped from the output. states that a JobChange entity can match a Person if the company name in JobChange is in the Person’s employment history and the 2.3 Additional Fusion Based on Links person names match. For comparing person names, match1 uses We can complete now our HIL use case on the SEC scenario aspecializedUDFcompareName.Notethatmatch2 uses only an with the fusion step (Step 3 in Figure 1). We have already outlined equi-join on the normalized person names to count for those cases in Section 1.2 the two main entity population rules (m5 and m6) that the company filing a job change for a person is not in the em- that fuse the new data from JobChange into the employment and ployment history of that person. position indexes of a person. We give here a few more details. HIL filters out any tuple in C that does not satisfy any of the The rules m5 and m6 are similar to the earlier rules m2 and specified matchings. In effect, every matching ri(1 i n) m3,exceptthatthenewdatavalues(forcompany and title)come ≤ ≤ results in a Ci = σri (C) C.Theresultofthewhere clause now from JobChange,whilethepersoncik comes from PeopleLink. ⊆ n is the union of all these subsets, W = i Ci,whichwecallthe The join between JobChange and PeopleLink is based on docid and “candidate links”. An important aspect is that all matchings in an span,whichformakeyforJobChange.Therulesalsoincludea entity resolution rule will be evaluated, regardless! of their relative filter condition to select only officers (and not directors). order and whether a matching evaluates to true or false. Since HIL uses set semantics, the effect of m5 is that a new com- Entity resolution rules can also specify semantic constraints that pany entry will be inserted into the Employment index only if it did are required to hold on the output links and provide explicit res- not exist a priori (e.g., due to rule m2)forthegivenpersoncik.If the company exists, then there is still a chance that the correspond- rules in the next section, where we give the overall HIL compila- ing set of positions will be changed, since rule m6 may apply. tion procedure that takes into account the recursion that may exist We must also ensure that the earliest and latest dates for a po- between entity population and entity resolution rules. sition are adjusted accordingly, since we now have new data. For The naive semantics of entity population rules is to identify all this, we make sure that the inverted index, PosInfo,thatkeepstrack the applicable rules, that is, rules which generate new facts, and of all the reporting dates for a position, is also updated with the new to insert all the new facts into the target entities (either sets or in- data. Thus, we need another rule (not shown here) that is similar to dexes). This process repeats until no new facts are generated. To the earlier rule m4 except that it uses JobChange and PeopleLink in- avoid such iterative and inefficient process, we use compilation (or stead of IRP.Theactualspecificationforearliest_date,whichmust query generation) to implement the semantics. The main assump- be incorporated in rule m6,remainsthesame:theequation(*)and tion that we make is that there is no recursion allowed among the the discussion at the end of Section 2.1 apply here too, with the entity population rules; hence, we can topologically sort the entities difference that the minDate aggregation now works on a larger set. based on the dependencies induced by the rules, and then generate Note that we did not need to modify the main rule m1! for Per- unions of queries (with no recursion) to populatetheentities. son.Also,wedidnotneedanynewentitiesorindexes.Thenew We break query generation into several steps. In the first step, the rules simply assert new data into the same indexes declared by the baseline for HIL compilation,indexesareimplementedasfunc- initial mapping phase. This pattern typically applies when fusing tions and index lookups as function calls. In a second step, we any new data source: first, write entity resolution rules to link the transform the baseline queries into more efficient queries, where new source to the existing targetdata,thenwriteentity population indexes are implemented as materialized binary tables and index rules to fuse the new data into the target entities (indexes). lookups are implemented via joins. A final optimization step, which is targeted at Hadoop, identifies multiple indexes that can be coa- 2.4 Other HIL Features lesced into super-indexes,whichcanthenbeaccessedinasingle For space reasons, we omit the syntax of HIL here and instead join operation from the parent entity. Other Hadoop-oriented opti- refer the reader to an earlier IBM research report [15]. We include mizations that are implemented in our compiler include pipelining next a discussion of a few other patterns, also expressible in HIL, of intermediate results whenever possible, and delaying duplicate which are common in complex data integration scenarios. elimination3 until needed. Both optimizations allow the fusion of We have mentioned that user-defined functions can be used to multiple map jobs into a single job, on a Map/Reduce platform. cleanse and normalize the individual values that appear in a source attribute. A slightly different operation that is also common and 3.1 Baseline Compilation Algorithm must involve user-defined functions is conflict resolution.Suchop- We now describe the baseline algorithm for query generation. eration is needed when the integration process yields multiple (con- For each entity that appears in the insert clause of an enriched rule, flicting or overlapping) values for an attribute that is required to be we will generate a query term to reflect the effect of that rule. Since single-valued, if certain functional dependencies must hold. there may be many rules mapping into the same entity, the query To illustrate, consider the earlier rule m1 in Section 2.1. If a for an entity will include a union of query terms (one per rule). In person with a given cik appears under different names in the data the additional case when the entity is an index, we encapsulate the sources, then the resulting set of Person entities will contain du- union of query terms into a function.Furthermore,weparameterize plicate entries (each with a different name) for the same cik.To the query terms by the argument of the function. avoid such duplication, the typical solution in HIL is to maintain a We illustrate on our running example. Assume first that m1! and separate index, call it Aliases,whichcollectsallthevariationsof m2 from Section 2.1 are the only available rules. The following aperson’snameacrossallknowninputs.Rulesmustbeaddedto two queries (written here in an abstract syntax that resembles the explicitly populate the Aliases index, from the given data sources. rules) are generated for Person and Employment. Furthermore, the rule m1 for Person must be modified so that a Person := select [ name: i.name, cik: i.cik, unique name is selected, possibly via a user-defined function, from emp: EmploymentFn (cik: i.cik) ] the list of aliases. This process becomes more sophisticated if fur- from IRP i; ther attributes, such as the provenance of each alias, are also main- EmploymentFn := tained in the index and then used in the selection function. fn (arg). select [ company: i.company, Finally, we mention two aspects related to entity resolution that positions: PositionsFn ([ cik: i.cik, can also be expressed in HIL: blocking and score-based matching. company: i.company]) ] Blocking is a common mechanism that is used to reduce the number from IRP i of comparisons among input entities, by partitioning them accord- where arg = [cik: i.cik] and i.isOfficer = true; ing to some criteria (called blocking criteria or keys). Score-based The first query is immediate and reflects directly the rule m1!.The matching, on the other hand, allows matching decisions to be made main difference is that, to compute the value of emp,weuseafunc- based on scores assigned to pairs of entities. This flavor of entity tion call that corresponds to the index lookup on Employment.The resolution is widely used in practice and appears in several com- second query, for Employment,istheactualfunction,withapa- mercial systems. A score-based entity resolution rule in HIL uses rameter arg that represents possible keys into the index. The func- matching conditions in the where clause that are based on UDFs tion returns a non-empty set of values only for a finite set of keys, that compute the similarity of two records (e.g., based on distance, namely those that are given by the rule m2 (assuming, for now, that on features, or based on probabilistic similarity measures [4]). The this is the only rule mapping into Employment). More concretely, scores computed by the matching conditions can then be used in the if the parameter arg coincides with an actual key [cik: i.cik] that check clause (e.g., averaged and compared to a threshold value). is asserted by the rule m2,thenwereturnthesetofallassociated entries. Otherwise, we return the empty set. To achieve this behav- 3. COMPILATION: ENTITY POPULATION ior, the body of the function is a parameterized query term, whose where clause contains the equality between the argument and the We now describe the compilationofHILentitypopulation rules into efficient runtime queries. We will discuss entity resolution 3As already mentioned, HIL is based on set semantics. actual key. Similarly to the query for Person,thepositions field in The transformation from EmploymentFn to the actual query for Em- the ouput employment record is computed via a call to a function ployment is not yet complete, since the Positions index is still ac- (not shown here) that implements Positions. cessed via a function call. We will show how to change this shortly. In the case when multiple rules map into an entity, the expression The two inner query terms are similar to the ones in the earlier Em- defining that entity incorporates a union of query terms. For our ploymentFn;however,insteadofbeingparameterizedbytheargu- example, if we consider the additional rule m5 for Employment,the ment key, they explicitly output all the relevant (key, value)pairs. expression for EmploymentFn changes to: The outer group by is an operation that transforms set [key: t1, val: t2] into set [key: t1, val: set t2] with the obvious semantics. EmploymentFn := fn (arg). select [ company: i.company, We now briefly describe how to modify the queries that refer to positions: PositionsFn ([ cik: i.cik, an index. For each reference to an index (earlier expressed via a company: i.company]) ] function call), we will use a join with the binary table that materi- from IRP i where arg = [cik: i.cik] and i.isOfficer = true alizes the index. Since an index is a finite map (i.e., it is defined for union only a finite set of keys), the join must be an outer join, where the select [ company: j.company nullable part is with respect to the index that is being invoked. To positions: PositionsFn ([ cik: l.cik, company: j.company]) ] illustrate, the earlier query for Person is replaced with: from JobChange j, PeopleLink l Person := select [ name: i.name, cik: i.cik, emp: emptyIfNull (e.val) ] where j.docid = l.docid and j.span = l.span from IRP i left outer join Employment e and arg = [cik:l.cik] and isOfficer (j.appointedAs) = true; on [cik: i.cik] = e.key;

For a given parameter arg,therearenowtwoquerytermsthatcan In the above, the left outer join has a similar semantics to the SQL generate entries for the Employment index. The first query term is correspondent. Thus, the query always emits an output tuple for as before; the second query term, obtained from rule m5,containsa each entry in IRP.Furthermore,ifthereisamatchwithEmploy- similar condition requiring the equality between the parameter arg ment,asspecifiedbytheon clause of the outer join, then e.val is and the actual key [cik: l.cik]. non-null and becomes the output set of employment records. If As shown in these examples, during HIL compilation, we use an there is no match, then e.val is null and we output the empty set for intermediate query syntax that is independent of a particular query emp.ThefunctionemptyIfNull has the obvious meaning. language, although it is similar to an object-oriented or complex- The actual procedure for replacing index lookup operations with value SQL. Translating from this syntax to a such joins is slightly more involved, since it needs to account for the as Jaql or XQuery is immediate. In our implementation and exper- case when a query term has multiple bindings in its from clause iments, we use Jaql as our target execution language. and also has its own where clause. In such situation, we first con- While the baseline algorithm gives rise to query expressions that struct a closure query that includes “everything" that the query term map directly to the HIL entity types and rules, these query expres- needs (except for the index lookup itself). This closure query is sions can also be inefficient. In particular, indexes are not stored; then outer-joined with the binary table representing the index. an index lookup is computed, on the fly, by invoking the function associated with the index, which in turn executes the query terms 3.3 Index Fusion with Co-Group inside the body. As a result, the query terms within a function will The final step in the compilationofentitypopulation rules is an be executed many times during the evaluation of a HIL program. optimization that is targeted at the Map/Reduce (or Hadoop) plat- We modify next the baseline strategy to avoid such inefficiency. form. This optimization makes use of a co-group operation that is present in Jaql, and also in PigLatin, and has the ability to group 3.2 Finite Maps as Binary Tables in parallel multiple input sets by the same key. Applying this opti- For each HIL entity that is an index (finite map), we generate mization has a significant impact when an entity needs to aggregate aquerythatproducesabinarytablerepresentingthegraph of the data from many indexes that share the same key. Rather than gener- finite map, that is, the set of all pairs (k, v),wherek is a key and ating a sequence of binary joins between the parent entity and each v is the value for that key. Since v is typically a set (e.g., for each of the indexes, we first generate a query to fuse all the contributing person cik,wehaveasetofemploymentrecords),thegenerated indexes into a single super-index using the same key. This super- query consists of two parts. First, we generate a union of queries to index is then joined, via a single left-outer join operation, in the accumulate pairs of the form (k, e) where e is an individual value query for the parent entity. (e.g., a single employment record). Then, we apply a group by As an example, assume that in addition to the earlier Employment operation to collect all the entries for the same key into a single set. index, we now have a few more indexes that accumulate further To illustrate, instead of using a function for the Employment in- information about a person: Board (the board membership history dex, we can use the following query: of a person), Transactions (the most recent stock transactions by a person), and possibly others. These indexes use the same key (the Employment := group by key cik (select[key:[cik:i.cik], of a person) to map to the corresponding entries for a given val: [company: i.company, person. By using the outer join strategy outlined in the previous positions: PositionsFn ([cik: i.cik, subsection, the main query for Person needs to include a sequence company: i.company]) ] ] of left outer joins, one for each index that needs to be accessed. from IRP i where i.isOfficer = true Instead, using the co-group strategy, we perform an index fusion union operation to merge such indexes into one super-index. Index fusion select [ key: [cik: l.cik], 4 val: [company: j.company is implemented as a single operation of the form : positions: PositionsFn ([ cik: l.cik, FusedIndex := cogroup Employment by x = Employment.key, company: j.company]) ] ] Board by x = Board.key, from JobChange j, PeopleLink l Transactions by x= Transactions.key where j.docid = l.docid and j.span = l.span and isOfficer (j.appointedAs) = true); 4The actual Jaql syntax is slightly different, but the idea is the same. into [ key: x, Group and Cardinality Clauses. Each group and cardinality Emp_value: Employment.value, clause is re-written as a query. For example, the query for the group Board_value: Board.value, clause in rule er1 groups candidate links by the (docID, span)at- Transactions_value: Transactions.value]; tributes and within each group checks if there are links that satisfy Intuitively, all the participating indexes are partitioned based on the condition p.name = j.name.Queriesforgroupconstraintsare their keys; then, for each key value (x), we stitch together the entries executed in the order specified in the entity resolution rule. The from all the participating indexes that actually have something for queries required for the cardinality constraints are executed last. that key (or put null otherwise). The main entity for Person can then Cardinality clauses are more complex. We outline here what be obtained via a single outer join with FusedIndex that retrieves would happen if the cardinality constraint in rule er1 were 1:1 (check- in one shot the employment, board, and transaction entries for a ing for 1:N is similar but simpler): person. The implementation of this strategy requires three main cardinality (docID, span) 1:1 (cik) steps: detection of when multiple indexes can be fused together, generation of the co-group queries that materialize fused indexes, This clause requires mapping each pair (docID, span) to exactly one and rewriting of the queries that can benefit from fused indexes. cik and vice versa. To enforce this constraint, we group links by While conceptually simple, index fusion can be very beneficial, their (docID, span) attributes and we count the number of distinct cik especially when the number of indexes that can contribute to an values within each group. Each group of links with more than one entity becomes large. As part of our experiments, we will show in cik value is rejected as ambiguous. Then, we group the remaining Section 6 how this optimization greatly improves the run-time per- links by cik and count the number of distinct (docID, span) pairs formance of fusion of person entities from social media (Twitter). within each group. We again reject ambiguous groups of links. The remaining links comprise the final set of links that is output. 4. COMPILATION: ENTITY RESOLUTION 4.2 Integrated HIL Compilation We first give highlights of the compilation of entity resolution As noted earlier, we do not allow recursion among entity pop- rules in isolation, and then discuss the integrated compilation of ulation rules. The main reason for this is to avoid generation of both types of rules (entity population and entity resolution). recursive queries, which are not supported by the languages we 4.1 Compilation of Entity Resolution Rules target (e.g., Jaql or XQuery). In the absence of recursion, and pro- vided that there are no entity resolution rules, the HIL compilation We divide the query generation for entity resolution rules into algorithm constructs a topological sort of all the entities in a HIL two steps. The first step handles the where and check clauses. Since program; in this sort, there is a dependency edge from an entity E1 the effect of a check clause is local, i.e., it targets individual links, to an entity E2 if there is a rule mapping from E1 to E2.Queries it is safe to apply it in conjunction with the matching predicates are then generated, in a bottom-up fashion, from the leaves to the of the where clause. The group and cardinality clauses apply to roots. The query generation algorithm for each entity E,whichwas groups of links; thus, all links that belong to a group need to be already described, is based on all the rules that have E is as target. generated before making a group-based decision on what links to However, when entity resolution rules are present, we do allow drop. Therefore, these clauses are applied in the second step. alimitedformofrecursiontotakeplace.Often,inpractice,entity While the semantics of an entity Where and Check Clauses. resolution needs to use intermediate results in the integration flow, resolution rule is based on the cross-product of the inputs in the while the results of the entity resolution itself need to be used in the from clause, the compilation algorithm performs two optimizations subsequent parts of the flow. Our running example, motivated from to produce a more efficient query. First, we use the matching con- Midas, exhibits such behavior. The entity resolution performed in ditions in the where clause to join and select entities from our in- Step 2 of our flow (see Figure 1) makes use of the partial Person puts. Concretely, based on the where clause of er1,wegeneratethe entities generated after Step 1. Subsequently, the fusion rules in following query for candidate links, corresponding to the union of Step 3 continue to populate into Person (and, in particular, their partial results from each of the matching predicates in er1: employment records), based on the result of entity resolution. select [p: p, j: j, emp: e, provenance: ‘match1’] To achieve such behavior, we take the convention that entity res- from Person p, JobChange j, p.emp e where e.company = j.company and compareName(p.name, j.name) olution rules induce a staging of the overall program, where we union force the evaluation of all the rules prior to a block of entity res- select [p: p, j: j, emp: e, provenance: ‘match2’] olution rules. Thus, the order of the entity resolution rules in a from Person p, JobChange j, p.emp e HIL program becomes important. Concretely, for our example, the where normalize(p.name) = normalize(j.name); entity resolution small er1 in Step 2 requires the evaluation of all In the same spirit, blocking conditions (see also earlier Section 2.4) the entity population rules in Step 1 of the flow. To this end, we are also pushed, whenever present, as join conditions in the where compile all the rules in Step 1 into a set P1 of queries, using the clauses of the above query terms. compilation method for entity population rules. We then compile The second optimization incorporates the conditions of the check er1,usingthemethodinSection4.1,intoaqueryP2 that runs on clauses within each matching condition. A check clause has the top of the result of P1 (and JobChange,whichissourcedata).The form if pk then ck,whichcanbere-writtenas(not pk or ck). As PeopleLink table that results after P2 is materialized and used as an example, the check clause of er1 is re-written as null(j.bdate) or new source data into the next stage. This stage compiles together null(p.bdate) or j.bdate = p.bdate,andthenaddedasaconjunctin the entity population rules in both Steps 1 and 3, again using the the where clause of both query terms in the above union. compilation method for entity population rules. As an example, the While for simplicity the previous query outputs all entities (as query that is generated for Employment (see Section 3.2) incorpo- well as a provenance attribute), the actual query will project on rates rules from both Step 1 and Step 3. The resulting set P3 of the attributes mentioned in the select clause of the entity resolution queries will produce the final data. rule, and on any other the attributes used in the group and cardinality Note that, to achieve the fusion of the data produced by the rules clauses. To achieve this, the algorithm performs a look-ahead and in Step 3 with the data produced by the earlier rules in Step 1, we marks all attributes that need to be carried over. needed to recompile all these entity population rules together. In (a) (b) 600 600 Table 1: Characteristics of the SEC data to integrate. 529 ER-Time for Signatures 500 P-Time ER-Time 400 Data source #Rcds #Attr #Links #P-Rules #ER-Rules 400 301 IRP (from XML) 348,855 9 no links 6 0 300 244 Time (s) Time 205 JobChange (from text) 1,077 7 694 5 2 (s) Time 200 154 172 129 110 Committee (from text) 63,297 10 50,533 3 2 108 89 100 Bios (from text) 23,195 9 19,750 5 2 12 Signatures (from text) 319,154 11 215,848 5 2 0 0 IRP, JobChange IRP, JobChange, IRP, JobChange, IRP, JobChange, 79,789 159,577 239,366 319,154 Committtee Committtee, Bios Committee, Bios , # of Row s Signatures general, after the evaluation of a block of entity resolution rules, Figure 4: Integration times in the Financial Scenario we compile and evaluate all the entity population rules (from the beginning of the HIL program) until the next block of entity resolu- expected, is more complex than the HIL specification itself. For tion rules. Additional optimization is possible, in principle, where example, the two entity resolution rules mentioned above are 2.32 KB on disk compared to the 12.54 KB of corresponding Jaql code. the materialized results from one stage (e.g., after P1)arereusedin In contrast to the Jaql code that consists of a staging of many dif- the evaluation of the next stages (e.g., in P3). ferent queries, with various intermediate results, the HIL rules are more readable, and they give a succint indication of what links are 5. EVALUATION: THE SEC SCENARIO created, from what fields, and by which matching functions. As In this section, we describe our experience in applying HIL to the another example, the fusion rules in the third step of this scenario financial integration scenario from SEC. The goal here is to test less were relatively simple in HIL (as in Section 2.3). In contrast, the the scalability in terms of the data size, but rather the specification compiled code that fused the new rules with the rules in the initial and execution of a complex scenario (in terms of the rules involved mapping step is complex, as it required staging of the process based and the types of data), where the input data size is constrained by on the dependencies between entities, as well as many steps such the domain itself (i.e., the number of relevant documents in SEC). as the materialization of indexes, the use of outer joins, grouping, In Section 6, we perform a different experiment to test the scalabil- duplicate elimination, etc. Note that this complexity is not due to ity of the system, in a parallel setting using a Map/Reduce cluster, Jaql but to the inherent complexity of the problem and the data. A and where the data scales to hundreds of millions of documents. language such as XQuery, which is similar in spirit to Jaql, would We start by describing the implementation and the execution of also require complex operations. In the HIL framework, all these the three steps of the SEC integration flow in Figure 1, which we low-level operations are hidden from the programmer and automat- call the Basic Person Integration Scenario.Wethenaddseveral ically handled via the optimizing compiler. other types of extracted data into the flow and we assess perfor- mance as the integration flow becomes increasingly more complex. 5.2 Adding More Data Sources We scale up the Basic Person Integration Scenario by adding sev- 5.1 Basic Person Integration Scenario eral new types of extracted data. Each type of extracted data acts The specification of the first step of the integration (illustrated in as a new data source, since it has its own format (schema), and its Figure 1) was along the lines described in Section 2.1, but included own set of records. Before the actual fusion, each data source is first additional rules and entities to produce board memberships (in ad- linked to the initial Person entities created in Step 1 of the basic in- dition to employment), as well as rules to handle their provenance tegration scenario, in the same way JobChange was linked before and temporal aggregation. The HIL code for this step comprised 6 fusion. The characteristics of the data sources (together with the target entity types and 6 entity population rules. previous ones, IRP and JobChange) are summarized in Table 1. For The input IRP data consisted of 348, 855 records, corresponding each source, we give a count of the records, the relevant attributes to all the XML documents with insider transactions of executives in (i.e., that are actually used in HIL rules), and of the links that are the finance industry that were archived by SEC from 2005 to 2010. created by entity resolution. We also list the number of rules that We compiled the HIL specification into Jaql, using the compilation each new data source requires. We distinguish between entity pop- algorithm described earlier, and ran it on an IBM System x3550 ulation rules, P-rules,andentityresolutionrules,ER-rules. with 2 CPUs (4 cores each) and 32 GB main memory. The total Figure 4(a) shows the performance of the HIL-generated code running time for the first step was 1min 28s, and the result con- with increasing number of sources. We have described in Sec- sisted of 32, 816 Person entities (15.77 MB). The resulting Person tion 5.1 the initial computation of Person entities from IRP,aswell entities were then used in the entity resolution step of Figure 1. This as the addition of JobChange,whichisrepresentedbythefirstdata step required two rules, one as shown in Section 2.2 and one that is point in Figure 4(a). With each increasing number of data sources, aslightvariationthatexploitsboardmembershiphistory.Theinput we include: (1) the total time, ER-Time,togeneratethelinksbe- JobChange data had 1, 077 records extracted from text documents. tween the data sources and the Person entities (as generated in The running time for entity resolution was 12s, and the result con- Step 1 from IRP), and (2) and the total time, P-Time,tofuseall sisted of 694 links. The third integration step included 5 more rules the data (obtained by re-compiling and running all the entity pop- for fusion. These rules were along the lines described in Section 2.3 ulation rules for all the sources so far). The second data point in and covered the fusion of board membership in addition to employ- Figure 4(a) corresponds to adding Committee into the flow. The ment histories. These rules were compiled together with the rules entity resolution time (89s, or 1 min 29s) includes now the previ- of the first step (as discussed in Section 4) and applied to take into ous entity resolution time (for JobChange)andtheadditionaltime account the links generated by entity resolution. The resulting Jaql to link Committee. P-Time (154s, or 2min 34s) accounts for run- ran in 1min 48s, where the new running time includes the joins be- ning all HIL rules for fusing all three data sources together. After tween JobChange and the links, in addition to the processing of IRP adding Bios into the flow, the cumulative entity resolution time, in- records and fusion of the results. We obtained the same number of cluding now the time to link Bios,is129s,or2min9s.Thefusion Person entities as after the first step but each entity is now more time for all four data sources is 172s, or 2min 52s. Even though complete with data fused from both IRP and JobChange. the number of records in Bios is not that large (23, 195), the in- We note that HIL is compiled to optimized Jaql code that, as tegration flow processes now a lot more textual information (the (a) Performance gain for co-group optimization no co-group Table 2: Data characteristics for social media integration. (single HIL fusion query using 12 indexes) co-group 20 #days #Tweets(cumulative) #Entities(cumulative) 17.05 15.92 1 21.72 millions 7.06 millions 14.38 12.92 2 44.13 millions 10.64 millions 11.50 11.58 10.42 3 68.25 millions 14.01 millions 9.57 9.53 8.05 4 91.13 millions 16.15 millions 6.20 5 116.24 millions 17.80 millions Time(mins) 4.88 6 140.17 millions 19.16 millions

0 123456 biographies). This is also reflected in the fact that the resulting set # of days of Twitter data Person of entities is significantly larger now, in size, than for the (b) Total Fusion time for Social Media Profiles from Twitter previous data point (43.3 MB vs. 19.2MB). Finally, the fourth data (after co-group optimization) 60 point corresponds to adding Signatures into the flow. These records 49.80 are extracted from a special signature section of a certain type of 40.87 33.73 input documents, and give additional information about key peo- 28.35 ple and their employment (that may have been missed by the other 22.40 types of extraction). The cumulative entity resolution time, is now Time (mins) 14.68 8min 49s. We notice a significant increase in the entity resolution 0 time, which is explained by the large number of Signatures records 123456 that are compared with the Person entities, as well as by the large # of days of Twitter data number of links that are generated and checked for the cardinality Figure 5: Performance of HIL fusion over Twitter data: (a) constraint. (These are N:1 links from Signatures to Person.) The effect of co-group optimization, (b) total fusion time. total fusion time (P-time)forallfivedatasourcesisnow4min4s. An additional experiment focuses on entity resolution alone and contains attributes whose values are extracted and fused from the its performance with respect to the number of entities to be re- Twitter messages involving that particular user over a time period. solved. Here, we keep the number of Person entities constant (32, 816) The characteristics of the data are described in Table 2. The input and modify the number of Signatures to be resolved (from 1/4 of size is measured in terms of the cumulative number of days (from the initial Signatures file to 4/4 of the file). Figure 4(b) shows that 1to6)ofTwittermessagesthatweprocess,whereforeachdaywe the execution time increases linearly with the input size. have access to 10% of the messages(obtainedthroughTwitterdec- ahose). For each such number of days, we first apply an extraction 5.3 Further Remarks phase based on SystemT [6], where each input tweet (an unstruc- Integration of entities from SEC was, in general, a complex pro- tured document) can be annotated with any number of attributes cess, and we have described here only a core part. The full-fledged ranging from personal information (address, name, marital status, integration also includes rules for generating company entities, in- etc.) to life events (birthdays, weddings, etc.), employment (occu- vestment entities, relationships (including lender/co-lender and par- pation, employer, etc.), and shopping-related attributes. The latter ent-subsidiary types of relationships), and many user-defined func- attributes describe a person’s purchase intent or sentiment towards tions for cleansing (of people names, company names, etc.) and products in the retail domain (e.g., intent to buy a camera at Best for conflict resolution. Other integration logic included rules for Buy, intent to buy groceries at Target, a positive comment about aggregating the stock transactions made by executives or directors aSamsungsmartphone,etc).The2ndcolumninTable2shows of companies in various years, as well as temporal analysis rules to the total number of tweets that were annotated, for each day. The determine how much such insiders have been holding in company set of annotated tweets is then passed as input to the fusion phase, stocks at any given time. All of these rules were expressed in HIL which applies HIL rules to obtain a set of integrated profiles, each and used various indexes to access the transaction and holding in- corresponding to a Twitter user. The last column in the table shows formation by various dimensions (company, person, year, type of the resulting number of such user profiles. Note that the number of security, type of ownership, etc). profiles does not increase linearly with the number of days. This In total, we used 71 HIL rules, populating and linking 34 entity is intuitive: as we analyze more Twitter messages, there are less types, calling 21 UDFs, all split into 11 scripts. The UDFs ranged new users to be discovered. Instead, the user profiles become more from very simple such as strToUpperCase, isNull, sortReverseBy- complete in terms of the data they accumulate. Date to more complex such as normalizeCompanyName and - The HIL specification includes a main entity population rule that ToJson (to convert an XML document to JSON format). The size constructs Person profiles. This rule relies on 12 auxilliary indexes of the generated Jaql code is about 100 KB on disk, and the entire that accumulate facts about each person across all tweets. The in- flow runs in approximately 25 mins. The full-fledged HIL-based dexes all use the same key, namely Twitter userid,whichisinthe integration of entities from SEC was usedinIBMtopopulate a metadata associated with each tweet. Each index is populated by commercial master data management (MDM) system, which was one or more HIL rules, based on the extracted annotations. For ex- demonstrated to financial analysts and regulators (including SEC). ample, a HIL rule iterates over all tweets that have been annotated by a SystemT "Job" extractor, and inserts relevant facts (with at- 6. EVALUATION: LARGE-SCALE INTEGRA- tributes such as JobTitle, JobCategory, Date,thetextofthetweet, etc.) into a HIL index called Occupation.Otherindexescapture TION OF SOCIAL MEDIA PROFILES life events, purchase intents, sentiment towards products, etc. The In this section, we give a different evaluation of HIL that is tar- main HIL rule accesses these indexes, via userid,toretrieveeither geted at integration from social media, and in particular from Twit- an aggregated single value (e.g., the latest entry in the Occupation ter. When compared to the SEC scenario, the data is now much index for the given person) or the entire set of records for the given larger, with the overall result consisting of tens of millions of en- person (e.g., the history of purchase intents). tities, aggregated fromhundredsofmillionsofTwittermessages. Figure 5 gives the run-time HIL performance for generating Per- Each output entity represents the profile of a social media user, and son profiles, with increasing number of days of tweets. We high- light: (a) the impact of co-group optimization (described in Sec- 8. CONCLUSION tion 3.3) on the main HIL fusion rule, and (2) the resulting perfor- In this paper, we introduced HIL, a high-level language for en- mance for the entire HIL fusion (including the population of all the tity integration, we gave algorithms for compilation into runtime indexes). The results were obtained by running the HIL-generated queries, and showed applications of HIL to integration in the finan- Jaql queries on a Hadoop cluster with 10 nodes, each an IBM Sys- cial domain and from social media. An immediate direction is to tem x3550, with 2 CPUs (4 cores each), and 32 GB memory. explore incremental compilation and evaluation algorithms, where As it can be seen, co-group optimization achieves a reduction of the target data is incrementally modified when new data sources 30% to 50%,byreplacinginthiscase12outerjoinswithasin- and rules are added. More broadly, the use of a high-level language gle co-group followed by one outer join. The effect of co-group is opens up many research directions in the space of reasoning, de- larger if more indexes are added to the integration flow (to incor- bugging, and maintenance (evolution) of integration flows. porate additional extraction). We note that the main HIL fusion is arelativelylargeportionofthetotalintegrationtime.Forexample, 9. REFERENCES for 6 days of tweets, the time taken by the main HIL fusion (via co- group) is 11.58 mins, while the total time (including the creation [1] A. Arasu, C. Ré, and D. Suciu. Large-Scale Deduplication with Con- straints Using Dedupalog. In ICDE,pages952–963,2009. of all the indexes) is about 50 mins. The gain in performance due [2] K. Beyer, V. Ercegovac, R. Gemulla, A. Balmin, M. Eltabakh, C.-C. to co-group becomes even more significant with further increase in Kanne, F. Ozcan, and E. Shekita. Jaql: A Scripting Language for both the time-period of analysis and complexity of analysis. Large Scale Semistructured Data Analysis. In VLDB,2011. Full-fledged integration of entities from social media is more [3] J. Bleiholder and F. Naumann. Data Fusion. ACM Comp. Surv.,41(1), complex than what we have described here. To build user profiles 2008. with rich enough historical information, we have extracted and inte- [4] S. Boriah, V. Chandola, and V. Kumar. Similarity Measures for Cate- gorical Data: A Comparative Evaluation. In SIAM,2008. grated data that spans months of tweets, with billions of messages. [5] D. Burdick, M. A. Hernández, H. Ho,G.Koutrika,R.Krishnamurthy, Going beyond Twitter, we have developed entity resolution rules to L. Popa, I. R. Stanoi, S. Vaithyanathan, and S. Das. Extracting, link with other social media sites such as blogs and forums. These Linking and Integrating Data from Public Sources: A Financial Case HIL rules exploit combinations of attributes such as name, loca- Study. IEEE Data Eng. Bull.,34(3):60–67,2011. tion, and other contextual clues, and were able to link with high [6] L. Chiticariu, R. Krishnamurthy, Y. Li, S. Raghavan, F. Reiss, and accuracy over 2 million profiles across multiple social media sites. S. Vaithyanathan. SystemT: An Algebraic Approach to Declarative Information Extraction. In ACL,pages128–137,2010. [7] N. N. Dalvi, R. Kumar, B. Pang, R. Ramakrishnan, A. Tomkins, P. Bo- 7. RELATED WORK hannon, S. Keerthi, and S. Merugu. A Web of Concepts. In PODS, Our HIL framework bridges two lines of data integration re- pages 1–12, 2009. [8] A. Deutsch, L. Popa, and V. Tannen. Physical Data Independence, search: schema mapping [11] and entity resolution [10]. A key Constraints, and Optimization with Universal Plans. In VLDB,pages differentiating aspect between schema mapping formalisms (e.g., 459–470, 1999. s-t tgds) [12] or SO tgds [13]) and the entity population rules in HIL [9] A. Doan, J. F. Naughton, R. Ramakrishnan, A. Baid, X. Chai, F. Chen, is that the latter are designed to facilitate direct programming by a T.Chen, E.Chu, P.DeRose, B. J.Gao, C. Gokhale, J. Huang, W.Shen, user. Associations between entities are explicitly given in HIL via and B.-Q. Vuong. Information Extraction Challenges in Managing indexes, which effectively eliminate the need for existential quanti- Unstructured Data. SIGMOD Record,37(4):14–20,2008. [10] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate fiers or Skolem functions. Although technically similar to the dic- Record Detection: A Survey. IEEE TKDE,19(1):1–16,2007. tionaries in the query optimization framework of [8], HIL indexes [11] R. Fagin, L. M. Haas, M. A. Hernández, R. J. Miller, L. Popa, and differ in that they capture the fusion logic in a data integration flow. Y. Velegrakis. Clio: Schema Mapping Creation and Data Exchange. Regarding entity resolution, Dedupalog [1] is another declarative In Conceptual Modeling: Foundations and Applications,pages198– language that uses constraints to specify properties on the outcome 236, 2009. of entity resolution. However, Dedupalog has no transformation or [12] R. Fagin, P. G. Kolaitis, R. J. Miller, and L. Popa. Data Exchange: Semantics and Query Answering. TCS,336(1):89–124,2005. fusion (i.e., the equivalent of entity population in HIL), and it only [13] R. Fagin, P. G. Kolaitis, L. Popa, and W. C. Tan. Composing Schema allows expressing constraints (and no matching rules) on the can- Mappings: Second-order Dependencies to the Rescue. ACM TODS, didate links. Furthermore, much of the execution logic is hidden 30(4):994–1055, 2005. behind a black-box system that attempts to minimize the number [14] H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. of constraint violations. In contrast, HIL rules provide explicit res- Declarative Data Cleaning: Language, Model, and Algorithms. In olution actions on constraint violations, with deterministic results. VLDB,pages371–380,2001. [15] M. Hernández, G. Koutrika, R. Krishnamurthy, L. Popa, and R. Wis- Ajax [14] is an early data cleaning framework. However, it was fo- nesky. HIL: A High-Level Scripting Language for Entity Integration. cused on matching and clustering and less on mapping and fusion, Technical Report RJ10499, IBM Research, June 2012. and did not have a notion of logical entities. XClean [22] is a data [16] M. Lenzerini. Data Integration: A Theoretical Perspective. In PODS, cleaning framework for XML; however, its language is at a lower pages 233–246, 2002. granularity than HIL, with clauses that correspond to physical-level [17] S. Melnik, E. Rahm, and P. A. Bernstein. Rondo: A Programming operators that are manually orchestrated in a procedural flow. Platform for Generic Model Management. In SIGMOD,pages193– 204, 2003. Model management [17] provides a high-level scripting frame- [18] C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig work, but operates at the metadata (schema) level. Closer to HIL Latin: A Not-So-Foreign Language for Data Processing. In SIGMOD, is iFuice [21], which combines mapping with fusion of data; how- pages 1099–1110, 2008. ever, it has no entity resolution (it assumes instead that the links are [19] B. C. Pierce. Types and Programming Languages.MITPress,2002. given), and fusion is focused on attributes (whereas fusion in HIL [20] E. Rahm and P. A. Bernstein. A Survey of Approaches to Automatic applies, more generally, in a hierarchy of entities and is driven by Schema Matching. VLDB Journal,10(4):334–350,2001. [21] E. Rahm, A. Thor, D. Aumueller, H. H. Do, N. Golovin, and indexes). Finally, there are several query/transformation languages T. Kirsten. iFuice - Information Fusion utilizing Instance Correspon- with complex data processing capabilities (but not focused on data dences and Peer Mappings. In WebDB,pages7–12,2005. integration), including XQuery, XSLT, Jaql, and Pig Latin, which [22] M. Weis and I. Manolescu. XClean in Action (Demo). In CIDR,pages are all possible target languages for HIL compilation. 259–262, 2007.