<<

Acknowledgments

} Joint Work with people from the BDRI Group at UFAM and InWeb at UFMG

} Industrial cooperation

} Support Contexto – Amazonas

Maior Unidade da Federação 98% da floresta original preservada Maior rede hidrográfica do mundo Contexto – Manaus

600 indústrias Lei de Informática 2,2 milhões habitantes Institutos de P&D 6º PIB em capitais Nokia, Samsung, ... Contexto - UFAM

80 cursos de graduação 5 campi por todo estado 35 cursos de pós-grad Diâmetro de 1500 KM 25 k alunos Contexto - IComp

35 doutores Mestrado e Doutorado (4) Graduação: CC, SI, ½ EC 190 mestres e 7 doutores Contexto – Grupo BDRI

Informaon Machine Professores Databases Retrieval Learning Edleno Silva de Moura Algran Soares da Silva João Marcos Cavalcan Marco Antônio Cristo David Fernandes Moisés Carvalho André Carvalho } JASIST, IEEE TDKE, Information Systems } ACM SIGIR, ACM CIKM, ACM SIGMOD, VLDB, WWW } SBBD, WebMedia

6 Where is the data? } Data of interest is no longer only in databases } They are, though, available in on-line sources } In particular: textual sources } Social networks, Wikis, Blogs, Web of Data, RSS, e-mail, … } Search engines are effective and popular tools } Consensus: } its possible to better exploit them

7 How to deal with it?

} Textual Sources } The structure is only implicit } Meta-data is a luxury } Constraints are a utopia } We do need semantics! } Multiple proposals to increase the expressive power } Syntactically: e.g., XML technology, RDF, etc. } Semantically: e.g., Semantic Web, Linked Data, etc. } Challenge: adoption of standards } Governance is needed, and it is good!! } But, the web was born messy and its is likely to remain like that

8 Any alternative ?

} Possible alternative perspective: } Methods & Techniques for “automatically” gathering, extracting , enriching and exploiting data available in textual Web sources } By no means new! } It has been out there for more than a decade! } New impulse: Industrial needs } Advances in Data Management, Information Retrieval, Machine Learning, Data Mining, Artificial Intelligence, … } Research on this subject is immediately applicable } Motivates a continuous feedback between industry and academia

9 Many Problems …

Data Acquisition Information Extraction & & Recording Cleaning

Dissemination Query Integration, Search Aggregation, Publishing ... Representation

Query Processing, Data Modeling & Analysis

10 It is Big Data !

The Big Data Analysis Pipeline H. V. Jagadish – ACM SIGMOD Blog - 05/06/2012 Challenges & Opportunities w/ Big Data – Online report e-Shopping Aggregation

} e-Shopping Aggregators receive and/or crawl hundreds of thousands unstructured product offers from thousands of stores } Available as ordinary unstructured textual descriptions } Different “styles” depending on the source and on the type of product

Apple iPad 2 Wi-Fi + 3G 64 GB - Apple iOS 4 1 GHz - Black $589 LG - 32LE5300 - 32" LED-backlit LCD TV - 1080p (FullHD) - $400 Samsung - UN55D7000 - 55" Class ( 54.6" viewable ) LED-backlit LCD ... $2,048 Mixter Max Accessory Plasma TV Rack Tilt Bracket 248-A05 $65 HP Deskjet 3050 All-in-One Color Ink-jet - Printer / copier / scanner $50

12 e-Shopping Aggregation

HP Deskjet 3050 All-in-One Color Ink-jet - Printer / copier / scanner $50

13 e-Shopping Aggregation } Main Tasks/Services } Crawl product offers over the Web } Product aggregation: cluster offers of a same product } Categorization: put offers in the right category } Structured search: e.g., search by brand } Product comparison: e.g., give me the cheapest 3D 40” TV

} Easier if data in offers is correctly segmented and labeled

Type Brand Size Screen Type Price TV Samsung 55“ LED-backlit $2,048

14 Live showcase: neemu.com by Also powered by Entity recognition by

People

Places Entity Disambiguation at Management of Bib. References in SHINE

19 Structured Data in Textual Content

} We have studied, developed, published and applied methods and techniques for all of these problems

20 Structured Data in Textual Content

} In this talk, focus on 3 specific results for two problems

21 In this talk

} Information Extraction } ONDUX [SIGMOD’10] and JUDIE [SIGMOD’11] } Filling of Web Forms } IForm [VLDB’11] } Complex Schema Matching } EvoMatch [IS’13]

22 IETS l Information extraction by text segmentation (IETS)

l Extracting semi-structured data records by identifying attribute in continuous text

l bibliographic citations, product descriptions, classified ads, etc

l Ungrammatical text – not suitable for NLP methods

Supervised Methods

l Current IETS methods use probabilistic frameworks such as HMM or CRF

l Learn a model for extracting data related to a domain

l Supervised IETS methods

l Require training data from each source

Regent Square $228,900 1028 Mifflin Ave, 6 Bedrooms 2 Bathrooms 412-638-7273 Supervised IETS

Features

f1, f2, f3,...,fk

g1, g2,g3,...,gl Learning

Labeled Segments (Tranining)

Model

Input Texts

Extraction

Unlabeled Input Strings

Output Labeled Segments Supervised IETS

Text Source 3

Text Source 1

Text Source 2 Supervised IETS

Text Source 3

Text Source 1

Text Source 2 Unsupervised IETS methods

l Learn from datasets - Dictionaries, knowledge bases, references tables, etc.

l No need for manual training for each input

l Source Independent

l IETS methods

l Unsup. CRF (Zhao et al. @SIAM ICDM’08)

l ONDUX (Cortez et al. @SIGMOD’10)

l JUDIE (Cortez et al. @SIGMOD’11) Unsupervised IETS ONDUX & JUDIE

Dataset

Learning

Model

Extraction

Output Labeled Segments Unsupervised IETS - ONDUX & JUDIE

Dataset

Content Features f 1 ,f 2 , f

3 Source 2 ,...,f k

Source 1

Source 3 Unsupervised IETS - ONDUX & JUDIE

Dataset

Content Features f 1 ,f 2 , f

3 Source 2 ,...,f k

Source 1

Source 3 Features } IETS methods rely on two types of features: } Content (or state) features: } Related to the contents of the tokens/strings } Structure (or transition) features: } Related to the location of tokens/strings in a sequence Content Features we use

} Vocabulary: } Similarity betweew strings in the input and values of an attribute from the KB } Value Range: } How close a numeric string in the input is from the mean value of a set of numeric values of an attribute in the KB } Format: } Common style often used to represent values of some attributes } URLs, e-mails, telephone numbers, etc Structure Features we use

} Features } Positioning: } position of the values of a given attribute within the input } Sequencing: } relative order of attribute values within the input } Assumption: } Some regularity in the appearance of attribute values within the input texts } Does not necessarily mean assuming a fixed order of appearance Content x Structure Features } Content Features } Domain-dependent but input-independent } For a given attribute A, can be computed from a any representative set of values in domain of A } e.g., from a previous existing dataset } Structure Features } Dependent of the placement of attributes values on the input } Thus, they are input-dependent

35 Unsupervised IETS methods

Method Content Structure Features Features Mansuri@ICDE’06 Dictionaries Seed instances Agichtein@SIGKDD’04 Reference Tables Sample, assumed to have a fixed order Zhao@SICDM’08 Reference Tables Sample, assumed to have a fixed order Cortez@JASIST’09 Bibliographic Files Heuristics for the bibliographic domain Cortez@SIGMOD’10 Knowledge Bases Automatically Induced Cortez@SIGMOD’11 Knowledge Bases Automatically Induced – multiple records

36 ONDUX

} General View

37 Features – Content Related

} Features Considered:

KB

Attribute Vocabulary

Noisy Ingredient OR Value Range White sugar

Value Format

38 Adding Structure Related Features

Matching

Noisy OR

39 Street Mifflin ONDUX

} Reinforcement } Once the PSM is built, we combine the matching, positioning and sequencing evidences using the Bayesian operator OR.

FS(B,ai ) =1− ((1− M (B,ai )) ×(1− t j,i )×(1− pi,k ))

Matching Result Sequence Positioning

40 Experimental Results

Dataset: Web Ads | Source: Folha On-line U-CRF presented a 1 poor performance (very heterogeneous dataset) 0.8

0.6 Due to the Matching U-CRF

F-Measure F-Measure 0.4 Phase and the PSM ONDUX-M that is learned On- 0.2 ONDUX-R Demand, ONDUX achieve very high 0 quality results

Attributes

41 Reinforcement

42 JUDIE

Chocolate Cake Recipe

1/2 cup butter 2 eggs 4 cups white sugar ground cinnamon 2 tablespoons dark rum 6 chopped pecans 1/2 cup milk 1 1/2 cups applesauce 2 cups all-purpose flour 1/4 cup cocoa powder 2 teaspoons baking soda 1/8 teaspoon salt 1 cup raisins 1/4 cup dark rum

Quantity Unit Ingredient 1/2 cup butter 2 eggs 4 cups white sugar ground cinnamon 2 tablespoons dark rum 6 chopped pecans 43 JUDIE

} Joint Unsupervised Structure Discovery and Information Extraction } Detects the structure of each individual record being extracted without any user intervention } Looks for frequent patterns of label repetitions or cycles } Integrates this algorithm in the IE process } Accomplished by successive refinement steps that alternate information extraction and structure discovery.

44 The SD Algorithm

Coincident Cycles

Title Conference Year Author Author Title Conference Year Author Title Conference Year … Author Title Journal Issue Year Author Title Journal Issue Year Author Author Journal Issue Year Title Year … Author Title Conference Year Author Author Author Title Journal Issue Year

Viable Cycle

Conference Title

Author Ye a r

Journal Issue

45 Comparison with baselines – Attribute Level

Attribute JUDIE ONDUX U-CRF Attribute JUDIE ONDUX U-CRF

Author 0.88 0.922 0.87 Bedroom 0.82 0.86 0.79 Title 0.70 0.79 0.69 Living 0.89 0.90 0.72 Booktitle 0.86 0.89 0.56 Phone 0.87 0.92 0.75

Journal 0.84 0.90 0.55 Price 0.92 0.93 0.78

Volume 0.90 0.96 0.43 Kitchen 0.83 0.84 0.78

Pages 0.86 0.84 0.50 Bathroom 0.77 0.79 0.81

Date 0.87 0.89 0.49 Others 0.73 0.79 0.71

Average 0.86 0.88 0.58 Average 0.84 0.85 0.76 CORA Web Ads } Results very close to ONDUX and even better than U-CRF } Recall: JUDIE faces a harder task. 46 More details ….

} Cortez, Silva, Gonçalves & Moura. ONDUX: on-demand unsupervised learning for information extraction. SIGMOD 2010 } Cortez, Oliveira, Silva, Moura & Laender: Joint unsupervised structure discovery and information extraction. SIGMOD 2011

47 One more …

Data Acquisition Information Extraction & & Recording Cleaning

Dissemination Query Integration, Search Aggregation, Publishing ... Representation

Query Processing, Data Modeling & Analysis

48 The Form Filling Problem

} Goal: } To automatically fill out the fields of a given form-based interface with values extracted from a data-rich free text document. 1. Extracting values from the input text; 2. Filling out the fields of the target form using them. Example

} Form-based interface

Text Box Check-box

Selection List Example

} Data-rich free text document Example

} Form Filling

2005 Honda Accord x x x x x low Automatic

x

Alloy Wheels

x Common usage of Web Forms

} A user manually fills each form field } Text-box, selection list, check-box and radio button } Tedious, error prone and repetitive process

values Our Aproach

} IForm: Information Extraction + Form Filling } A Probabilistic Approach for Automatically Filling Form- Based Web Interfaces } Appeared in PVLBD 2010 / VLDB 2011 } With Guilherme Toda, Eli Cortez and Edleno Moura

54 iForm

} Information Extraction + Form Filling

Data-rich text document Values

Verify Values

} Automatic form filling; iForm

} A probabilistic approach for automatically filling form-based interface

} Relies on a model that estimates the probability of each field in the form given the input text based on the values previously used for filling the form.

} Exploits features related to the content and style, which are combined through a Bayesian framework } tokens (words) composing each segment } wording style of each segment Related Work – Information Extraction

} CRF (Conditional Random Fields): state-of-the-art information extraction approach

} Lafferty, J. et al [ICML,2001] } Peng and McCallum [IPM, 2006] } Mansuri and Sarawagi [ICDE, 2006] } Kristjansson et al [IAAA, 2004]

} Usually requires training instances manually labeled } Extracts all segments in a input text } Iform extracts only relevant segments Related Work – Form Filling

} Chen et al. [ICDE, 2010] } USHER, a system used to automatically adapt the form design according to user experience.

} M. Al-Muhammed e Embley D. [ICDE,2007] } An approach that relies on a manually built ontology to guide the user in the form filling process.

} iCRF - Kristjansson et al [IAAA, 2004] - Baseline } CRF approach for the task of automatically filling web forms. } Relies on content and positioning features extracted from training instances } Model requires training instances to be manually labeled. iForm - Overview

Previous Submissions

Data-rich text document Values

Verify Values iForm - Scenario

Web Form

Shutter Island is a 2010 American psychological thriller film directed by . The film is based on 's 2003 novel of the same name . Starring Leonardo DiCaprio, Mark Ruffalo and .

Movie Review - Data-rich text iForm – Selecting plausible segments

} What is the probability of a form field given each text segment?

Shutter Island is a 2010 American psychological thriller film directed by Martin Scorsese. The film is based on Dennis Lehane's 2003 novel of the same name . Starring Leonardo DiCaprio, Mark Ruffalo and Ben Kingsley.

Shuer Shuer Island Shuer Island is Shuer Island is a

… Leonardo Leonardo DiCaprio Kingsley.

Redundant computation of several probabilities can be avoided by using dynamic programming. iForm - Features Previous } Features Considered: Submissions

Token

Bayes. Title Noisy Value OR Shutter Island

Style iForm – Token Similarity

} Likelihood of each token present in the segment occurring in each field

Average number of Shuer Island words of each field

Actors Title Director Genre Joshua Jackson Shutter Masayuki … Terror

Previous Mark Man Shutter Bug Paul J. Animation Submissions Mark Rufallo … … … Leonardo DiCaprio The Departed Martin … Thriller Ewan Mcgregor, The Island Michael B. Action The Island of Dr. .. John Frank Terror iForm – Value Similarity

} Likelihood of the value present in the segment occurring in each field

Mark Ruffalo

Previous Submissions

Actors Title Director Genre Seth Rogen Kung Fu Panda Mark Osborne Animation Ben Affleck Daredevil Mark S. Johson Action , … … … Zooey Deschanel Yes Man Peyton Reed Comedy What Doesn’t Brian Goodma Action Mark Ruffalo Zodiac Thriller iForm – Style Similarity

} Given a text segment, we encode it according to a taxonomy of symbols.

Ben Kingsley [A-Z][a-z]+ [A-Z][a-z]+

} Verifies the likelihood of the sequence following the same wording style of the known values for each field iForm – Combining all probabilities

} iForm models the computation of the probability of a field given a segment using a Bayesian network. iForm – Mapping Segments to Fields

} Given the set of text segments such that theirs probability P( f j | S ab ) is above a threshold ε

} iForm aims at finding a mapping between candidate values and form fields with a maximum aggregate probability } Select non-overlaping segments.

} Accomplished by means of a two-phase procedure iForm – Mapping Segments to Fields

} In the first phase, we begin by computing the candidate values for each field based only on content-based features (token + value). } The initial mapping is composed by the set of all candidate values for all fields and contains segment-field pairs. C j

} Goal: To find a subset of segment-field pairs 〈 S ab , F j 〉 in the mapping whose probabilities are maximum. } iForm relies on a simple greedy heuristic to find an approximate solution. iForm – Mapping Segments to Fields

} Extracts the pair 〈 S ab , F j 〉 with the highest probability from the initial mapping and verifies if the current field was already filled with a text segment.

} To deal with fields that were not mapped to a segment, we use the probabilities derived from the style-related features, in the second phase. } We adopt the two phase mapping after verifying through experiments that the style-related feature is less precise than the other two features adopted. iForm – Filling Form-based interfaces

} Uses the final mapping to fill out the form fields } Text Boxes: Mapped text segments as a field values.

title “Shutter Island” Shutter Island

“Martin Scorsese” director Martin Scorsese

} Check boxes: Set true for mapped fields.

“Movie” iForm – Filling Form-based interfaces

} Selection list } iForm aims at finding an item such that its similarity with the extracted value is maximum – “softTF-IDF”

“psychological thriller” iForm - Overview

Web Form

X

ShutterStructure Island Shutter Island is a 2010 American Martin ScorseseSketching psychological thriller film directed by Martin Scorsese. The film is based on LeonardoPhase DiCaprio 2 Previous Mark Ruffalo Dennis Lehane's 2003 novel of the Ben Kingslev same name . Starring Leonardo Submissions DiCaprio, Mark Ruffalo and Ben Thriller Kingsley. Evaluation – Multi-typed web forms

Movies Type of Field # Fields P R F iForm achieved high Text Box 4 0.74 0.69 0.71 quality results in Submission-Level 0.73 0.67 0.69 all datasets

Cars Type of Field # Fields P R F Text Box 5 0.78 0.73 0.76 The quality of iForm was almost the same Check Box 30 0.79 0.79 0.79 for the text box and Average 0.79 0.78 0.79 the check box fields. Submission-Level 0.77 0.73 0.75 Evaluation – Multi-typed web forms

Cellphones Type of Field # Fields P R F Filling quality above Text Box 2 0.89 0.69 0.78 0.90. In fact, more than Check Box 35 0.94 0.94 0.94 90% of each submission was Average 0.94 0.93 0.93 correctly entered in the Submission-Level 0.96 0.94 0.95 web form interface.

Books 1

Type of Field # Fields P R F Precision levels are Text Box 4 0.88 0.67 0.76 above 0.8 in all cases, and submission-level Drop Down 1 0.96 0.96 0.96 f-measure results for Average 0.90 0.73 0.80 this dataset is above Submission-Level 0.89 0.67 0.76 0.7. Evaluation – Comparison with iCRF

Jobs iForm had Field iForm iCRF superior F-measure levels in nine fields. Application 0.82 0.37 Area 0.18 0.23 City 0.70 0.65 The lower quality obtained by iCRF is explained by the fact that Company 0.41 0.17 segments to be extracted Country 0.77 0.87 from typical free text inputs, such as Desired Degree 0.57 0.37 jobs postings, may not Language 0.84 0.69 appear in a regular context. Platform 0.47 0.38 Recruiter 0.44 0.22 iForm was designed to Req. Degree 0.31 0.59 conveniently exploit these field-related Salary 0.22 0.25 features from previous State 0.85 0.81 submissions Title 0.72 0.49 Previous Submissions Impact

For the Movies and Books 1 datasets, the quality achieved by iForm increases proportionally with the number of previous submissions Previous Submissions Impact

Notice that F-measure values stabilize at around 3000 previous submissions and remain the same until 10000. Besides, even starting with a small number of submissions, iForm is able to help decrease the human effort in the form lling task. Conclusions

} A probabilistic approach for automatically filling form-based interface } Relies on a model that estimates the probability of each field in the form given the input text based on the values previously used for filling the form. } Achieved good results in comparison with iCRF } Our experiments demonstrate that our approach is able to properly deal with different types of input fields, such as text boxes, pull-down lists and check boxes } More in } Toda, Cortez, Silva & Moura: A Probabilistic Approach for Automatically Filling Form-Based Web Interfaces. VLDB 2011 The last one …

Data Acquisition Information Extraction & & Recording Cleaning

Dissemination Query Integration, Search Aggregation, Publishing ... Representation

Query Processing, Data Modeling & Analysis

79 Complex Schema Matching

} A group of elements from a given schema match a group of elements from another schema.

given name surname street number address 1 address 2 suburb rose leslie 26 coranderrk street rowethorpe hill end katheri hand 18 derrington crescent homewood kingsthorpe mary white 23 prescott street bonbeach

full name age address area leslei rose 43 coranderrk 26, rowethorpe hi end katherine hand 33 derrington crescent 18 , homewood kingsthorpe mary wite 39 prescott str bonbeach

80 Figure 1: Complex schema matches investigation, counter-terrorism and homeland security. Most often, data repositories manipulated in such scenarios are deliberately deprived from any form of identification or structural clues to make it dicult to be processed and analyzed. Another important issue we need to deal with is that most real-world situations involve complex matches.Di↵erently from the task of finding 1-1 matches (that represent a relationship between single elements from distinct schemas), identifying complex matches is a more challenging problem. This happens because complex matches require finding combinations of existing elements in one schema that are related to combinations of elements into another one. In this article, we propose an evolutionary approach1 that aims at auto- matically finding complex matches between schema elements of two seman- tically related data repositories. Since we only exploit the data stored in the repositories for this task, we rely on matching strategies that are based on record deduplication and information retrieval techniques to find complex schema matches. As pointed out by Gal [19], only limited research has been devoted to more complex schema matches, such as 1:N, N:1 and N:M, thus our evolutionary approach is a step forward towards solving this problem. The reason for adopting an evolutionary approach is the size of the search

1Our approach adopts concepts and ideas that are strongly inspired by genetic program- ming, but does not use the “classical” data structures usually adopted for representing the solutions.

3 Complex Schema Matching

} A group of elements from a given schema matches a group of elements from another schema.

81 Our approach

} An Evolutionary Approach to Complex Schema Matching } Just accepted to Information Systems to appear in 2013 } With Moises Carvalho, Alberto Laender & Marcos Gonçalves } Given two input schema, use an evolutionary process to generate Schema Matching Solutions for them } Start from an initial set of possible spurious/meaningless schema matching solution } Hopefully reach a final meaningful schema matching solution } Use a fitness function to evaluate and refine the solutions been generated

82 Requirements and Assumptions

} Schemata are known, but we can’t rely on attribute names } Different labels, noisy label extraction } Instances are known, we rely on them } Assumed to be abundant

given name surname street number address 1 address 2 suburb rose leslie 26 coranderrk street rowethorpe hill end katheri hand 18 derrington crescent homewood kingsthorpe mary white 23 prescott street bonbeach

full name age address area leslei rose 43 coranderrk 26, rowethorpe hi end katherine hand 33 derrington crescent 18 , homewood kingsthorpe mary wite 39 prescott str bonbeach

Figure 1: Complex schema matches

83 investigation, counter-terrorism and homeland security. Most often, data repositories manipulated in such scenarios are deliberately deprived from any form of identification or structural clues to make it dicult to be processed and analyzed. Another important issue we need to deal with is that most real-world situations involve complex matches.Di↵erently from the task of finding 1-1 matches (that represent a relationship between single elements from distinct schemas), identifying complex matches is a more challenging problem. This happens because complex matches require finding combinations of existing elements in one schema that are related to combinations of elements into another one. In this article, we propose an evolutionary approach1 that aims at auto- matically finding complex matches between schema elements of two seman- tically related data repositories. Since we only exploit the data stored in the repositories for this task, we rely on matching strategies that are based on record deduplication and information retrieval techniques to find complex schema matches. As pointed out by Gal [19], only limited research has been devoted to more complex schema matches, such as 1:N, N:1 and N:M, thus our evolutionary approach is a step forward towards solving this problem. The reason for adopting an evolutionary approach is the size of the search

1Our approach adopts concepts and ideas that are strongly inspired by genetic program- ming, but does not use the “classical” data structures usually adopted for representing the solutions.

3 Figure 5: Example of a simple match between two schemas.

Schema Matching Solutions. Let A and B be two distinct schemas. A schema matching solution is defined by a set of matches m1,...,mn (n>0) between them. { }

Figure 6 shows an example of a schema matching solution composed of two matches labeled M1 and M2. For simplicity, in this figure base elements Schema Matching Solutions (SMS) are represented by numbers. Attributes Similarity Function Derivation Tree Operators SchemaA A SchemaB B

concat

1 3 concat M1

4 5

concat

concat M2 1 concat 1 3 2 3

84 Figure 6: Example of a schema matching solution composed of two matches.

The ultimate goal of our proposed approach is to generate suitable schema matching solutions for two given input data repositories (e.g., two relational tables), in the sense that they accurately reflect the semantic equivalence between schema elements from these repositories. In the evolutionary process we propose, schema matching solutions are the individuals that evolve, starting from an initial population of possible spurious individuals, i.e., a meaningless schema matching solution, to a final population of fitting individuals, i.e., a meaningful schema matching solution. We describe this process next.

14 1A 1A 1B 1B 2A 2A 1A 1A 1B 1B 1A 1A

+ +

+ + + +

a a + + a / a / a / C1a / C1

a b a b a)a)a / a / b) b) c) c) a a b c b c b c b c b c b c

+ + + + + + + +

+ + + + a / a / a / a / a / C2 a / a C2/ a /

a / a / a / a / b b b c b b b c b c b c b b b c b b b c b c b c

b b b c b b b c b c b c

2A 2A 2B 2B 1B 1B 2B 2B 2A 2A 2B 2B

+ + + + + +

+ + + + a a a / a / a / a / C3 a C3/ a /

a / a / a / a / b c b c b c b c b c b c

b c b c b c b c

+ + + + + + + +

+ + + + a / a / a / C4a / a / C4 a / a b a b

a / a / b b b c b b b c a b a b b c b c b b b c b b b c

b b b c b b b c SMS Evolution

K evolutionary steps 1A1A1A1A 1A 1B1A1B1A1B1B1A1B 1B 1B 1B 2A2A2A2A 2A 2A 2A 2A d) +d) + e) 1Ae) 1A1A1A1A1A2A1A1B1A1B1B1A2A1Bf)1A1B 1B1Af)1B 1B1A 1B 1B 1A1A1A1A 1A 1A 1A 1A

+ + + + + + + + + + + + + + + + + + + + + + + +

a a a a a a + + a a + + + + + + + + + + + a / a / a / + a / a / a / a / a / a / a / a / C1 C1a C1/ a / C1a / C1 C1 a / C1 a / C1 a b a b a b a b a b a b + + a b a b / a / a a)a)a)a a)/ a / a a)/ a)a / a / a / b) b) b) b) b) b) a)a / a)a / b) b) c) c) c) c) c) c) a a a a a a c) c) a b c b c ab c b c b c b c a b a b b c b c b c b c b c b c b c b c 2 + 2 + + + a b a b b c b c

b c b c b c c b b c b c c b b c b c b c a b a b

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

+ + + + + + + + + + + + + + + + + + + a / a / a / + a / a + / a a / / a a / / + a / a / a / a / a / a / a / a / a / a / a / a a / a / / aa / / a / a / a / C2a C2/ C2 C2 C2 C2a a/ / C2 C2 a / 3 5 3 5 + + a / a / a / a a / / a / a / a a / / a a / / a / a / a / 1 2 1 2 a / a / a / a / a / b b b b c b b b c b a b c / b b b a c / b b b c b b c b c b b b cc a / b c b c b c b b b c b b cb b c b c b b b b c b b b c b b c b b b c b b b b c c b b b c b b c c b c b c b c b b b c bb b bc c b c

b c b b b b c b b b c bb b c c a b b b c b b b c b b c b c bba b cc b b c b c b c b b b c b b c b b c b c b c b c b b b c b b b c

2A2A2A2A 2A 2B2A2B2A2B2B2A2B 2B 2B1B1B2B1B1B 1B2B1B2B1B2B2B1B2B 2B 2B 2B 1B 2A2A1B2A2A2B2A 2B2A2B2A2B2B2A2B 2B2A 2B 2B2A 2B 2B + + + + + + + + + + + + + + + + + + + + + + + +

+ + + + + + + + + + + + + + + + a a a a a a + + a a a / a / a / a / a a / / a a / a / / a / a / a / a / a a / / a / a / a / a / a / a / a / + + C3 C3 C3 C3 C3 C3a / C3 C3 a / + + + +

a / a / a / a / a a / / a a / a / / a / a / a / + + a / a a / / a / b c b c b c b c b b c c b b c b c c b c b c b c b c b b c c b c b c b c b c b c b c b c b c b c / a / a

b c b c b c b c b b c c b b c b c c b c b c b c a b a b b c a b a b b b c c b c a b a b

c b c b

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 2 2 + + + +

+ + + + + + + + + + + + + + + + + + + + a / a / a / a / a / a / a / a / + + + + a / a / a / a / a a / a/ / a a / / a / a / a / C4a C4/ C4 C4 C4a a / / C4 a / a b a b a b a b a b a b C4 C4 a b a b

a / a / a / a / a / a / a / a / a / a / / a / a b b b b c b b b c b b c b b b c b b b c b b b c + + b b b c b b b c a b a b a b a b a b a b / a / a a b a b b c b c b c b b c b b bb c b c b b c b b b cc b b b c b b b c b b b c b c b b b c b c b b b c b b b b c b b b c b b c b b b c b b b c b b b c b b b c b b b c 2 2 b b b c b b b c c b c b c b c b 1 1 a b a b 3 5 3 5

85

Figure 7:Figure Crossover 7: Crossover operation. operation. d)d)d)d)+ +d)+ d)+ d)+ +d) + +e)e)e)e)1A1Ae)1Ae)1A e)1A2A1Ae)2A1A2A2A1A2Af) f)2Af)2A1Af)1Af)2A1A1Af) 1Af) 1B1Af)1B1A1B1B1A1B 1B 1B 1B

+ + + + + + + + + + + + + + + + + + + + + + + +

+ + + + + + + + / /a /a a / a / a / a / a / a

a b a b a b a b a b a b a b a b a b a b a b a b a b a b 2 2 2 +2 + 2+ +2 +2 ++ 2+ ++ ++ + + + + a b a b c b c b c b c b c b c b c b c b a b a b a b a b a b a b a b a b

+ + + + + + + + + + + + ++ + + + + + + The mutationThe mutation operation operation acts locally acts locally on a single on a individual. single individual. In our In case, our it case, it + + + +

+ + + + + + 3 3 35 5 53 3 5 3 5 5 + + 3 5 3 5 a / a / a / a / a / a / 1 1 21 21 21 2 12 12 12a / 2 a / a / a / a / a / a a / / a a / a / / a / a / a / a / a a / / a /

a b a b a b a b a b a b b c b c b c b c b c b c a b a b b c b c b c b c b c b c b b bb b c c b b b c b b b c c b b b c b b b c b b b c works byworks randomly by randomly transforming transforming the trees the that trees compose that compose the derived the derived elements. elements. b c b b b b c c b b b c

This isThis done is by done replacing by replacing base schema base schema elements elements in the1B1B in leafs1B the1B or1B leafs2B operators1B2B1B2B or2B1B operators2B 2B in 2B2A2A2B2A in2A 2A 2B2A2B2A2B2B2A2B 2B 2B 2B + + + + + + + +

+ + + + + + + +

+ + + + + + + ++ + + + + + + + + + + + + + the internalthe internal nodes. nodes. Notice Notice that, although that, although these transformations these transformations are randomly are randomly + +

/ a/ a/ a / a / a / a / a / a

a b a b a b a b a a b a b b aa bb a b a b a b a b a ab b a b a b a b a b a b a b a b a b a b c b c b c b c b c b c b + + + + + + + 2 2+ 2 2 2 2 2 2 c b c b performed,performed, they are they constrained are constrained to preserve to preserve the type the of+ type+ match.+ of+ match.+ + + + + + + + + + + + + + + + + + + + + + + + + + + +

a / a / a / a / / a / /a a /a / a / a / a / a a / / a a / + + / + a + + + + + / a/ a/ a / a / a / a / a / a b b b b c b b b c b b c b b b cc b b b c b c b b c b b b c c b c b c b 2 2 2 2 2 2 2 2 b b b c c b b b b c c b c b c b c b c b c b c b c b c b 1 1 1 1 1 1 a b a b a b a b a b a b 1 1 a b a b The crossoverThe crossover operation operation allows allows for exchanging3 for3 35 exchanging35 53 5 components3 5 3 5 3 components5 5 between between two two selectedselected individuals. individuals. Intuitively, Intuitively, these individuals these individuals play the play role the of role parents of parents whosewhose o↵spring o↵spring will compose will compose part of part the of individuals the individuals of a new of a generation new generation in in the evolutionarythe evolutionary process. process. Like the Like mutation the mutation operation,Figure operation,FigureFigureFigure the 7:Figure 7: Crossover type 7: CrossoverFigure 7: CrossovertheFigure 7:Crossover of type CrossoverFigure 7: the operation. operation.Crossover 7: operation.match of Crossover operation.7: the operation.Crossover match operation. operation. operation. is alsois preserved. also preserved. FigureFigure 4.2 illustrates 4.2 illustrates how the how crossover the crossover operation operation creates creates new individuals new individuals TheTheThe mutationThe mutationThe mutation mutationThe mutationThe operation mutation operationThe operation mutation operation mutation operation acts operation acts operation acts locally acts locally operation acts locally locally acts on locally actson a locally on single a acts on locally single a on single a locally single on aindividual. single individual. on a individual. single aindividual. on single individual. a single individual. In In individual. our In our individual. In our case, In our case, case, ourIn it case, ourIn it case, it our In case, it our it case, it case, it it in ourin approach. our approach. As we As can we see, can initialy see, initialy two individuals two individuals are selected are selected from the from the worksworksworksworks byworks by randomly byworks randomly by randomlyworks by randomlyworks by randomly transforming by randomly transforming randomly transforming by transforming randomly transforming transforming the transforming the transforming treesthe treesthe trees the that trees thethat trees that the compose trees that compose thethat trees compose compose that trees compose that the compose the that derivedthe compose derivedthe derived compose the derived the elements. derived elements. the derived elements. the elements. derived elements. derived elements. elements. elements. populationpopulation (a) and (a) then and matches then matches related related to a same to a schema same schema are paired are paired (1A (1A ThisThisThis isThis is doneThis is doneThis is done byis doneThis by done is replacingThis by replacing done byis replacing by done replacing is by replacing done base by replacing base replacingbaseby schema base replacingschema base schema baseschema schemaelements base elements schema baseelements schemaelements elements schema in elements in the in elements the in the leafs elements in the leafs leafs thein or leafs thein or leafs operators or the operators in leafsor operators the or leafs operators oroperators leafs in oroperators in operators inor in operators in in in in with 2Awith and 2A 1B and with 1B 2B) with (b). 2B) Next, (b). a Next, sequence a sequence of crossover of crossover operations operations (C1, (C1, thethe internalthe internalthe internalthe internalthe internal nodes.the nodes. internal nodes.the internal nodes. Notice nodes. internal Notice nodes. Notice Notice nodes. that, Notice that, nodes. that,Notice although that, Notice although that, although Notice that, although although that, these although these that, these although thesetransformations transformationsalthough these transformations these transformations thesetransformations transformations these transformations are transformations are randomly are randomly are randomly are randomly are randomly are randomly are randomly randomly performed,performed,performed,performed,performed,performed, they theyperformed, they areperformed, they are they areconstrained aretheyconstrained constrained are they constrained are theyconstrained are constrained to are constrained to preserve to constrainedpreserve to preserve to preserve preserveto the thepreserveto the type preserve to the type type thepreserve of type the of match. type of match.the oftype match. the of match. type match. of type match. of match. of match. TheTheThe crossoverThe crossoverThe crossover16 crossoverThe crossoverThe operation16 crossover operationThe operationcrossover operation crossover operation allows operation allows operation allows allows operation for allows for exchanging for allows exchanging for allows exchanging for exchanging allows for exchanging for exchanging components for exchanging components components exchanging components components components between components between between components between between two betweentwo two between two betweentwo two two two selectedselectedselectedselectedselected individuals. individuals.selected individuals.selected individuals. individuals.selected individuals. Intuitively, individuals. Intuitively, individuals.Intuitively, Intuitively, Intuitively, Intuitively, these Intuitively, these these Intuitively, these individuals individualsthese individuals these individuals these individuals individualsthese play individuals play play individuals the play the play the role playthe role rolethe play of role ofthe parentsplay role of parentsthe of role parents ofthe parents role ofparents role ofparents parents of parents whosewhosewhosewhose o↵whose ospring↵ owhosespring↵ ospring↵whose ospring↵ willwhosespring o will↵ willspring o compose↵ will composespring o will compose↵spring compose will compose will part compose part will part compose of part of compose the part of the of part the individuals of the individualspart individuals the of part individuals the of individuals the of individuals of the ofindividuals a of a new individuals of a new of a new generation new a of generation new generation a of generation new a of generation new a generation in new in generation in ingeneration in in in in thethethe evolutionarythe evolutionary evolutionarythe evolutionarythe evolutionarythe evolutionary process.the evolutionary process. process. evolutionary process. process. Like Like process. Like process.the Like the Like process. themutation themutation Like mutation the Like mutation the mutation Like operation, the mutation operation, the operation, mutation operation, mutation operation, theoperation, the operation, thetype thetype operation, type the of type of the the type of the the of type thematch of thematch type match the of typematch the of match the of match the match match isis alsois alsois also preserved. alsois preserved. also preserved.is preserved. alsois preserved. alsois preserved. also preserved. preserved. FigureFigureFigureFigure 4.2Figure 4.2 illustratesFigure 4.2 illustrates 4.2Figure illustrates 4.2 illustratesFigure 4.2 illustrates 4.2 how illustrates how 4.2 illustrates how the how illustrates the how the crossover the crossover how crossover the how crossover the crossover how operation the crossover operation the operation crossover operation crossover operation creates operation creates operation creates creates operation new creates new creates new individuals new creates individuals new individuals creates individuals new individuals new individuals new individuals individuals inin ourin ourin our approach.in our approach. approach. ourin approach. ourin approach. ourin As approach. As our approach. we As we As approach. can we As can we can see, As we can see, As we see, can initialy see, initialy weAs can see,initialy canwe initialy see, two initialy can see, two initialy two individuals see, two initialy individuals individualstwo initialy individuals two individuals two individuals are two are individuals selectedare individuals selectedare selected are selected are selected from are from selected from are selectedthe from the from selected the from the the from the from the the populationpopulationpopulationpopulationpopulationpopulation (a) (a)population (a)andpopulation (a)and and (a) then and then (a) and then (a) matches then and matches then (a) and matches then matches and matches then related matchesrelated then related matches related to matchesrelated to a related to a same to related a same to a same related same toschema a schema same to a schema same schemaa to same schema are a are schema same are paired schema are paired paired are schema paired are (1A paired (1A are paired (1A (1A are paired (1A paired (1A (1A (1A withwithwith 2Awith 2Awith and 2A and 2Awith and 1B2Awith and 1B 2A with and 1Bwith with 1B2A and with 1B 2B) 2A with and 2B) 1B with 2B) and (b). 1B 2B)with (b). 2B) (b). 1B with Next, (b). 2B) Next, with (b). Next, 2B) Next, (b). a 2B) Next,sequence a (b). sequencea Next, sequencea (b). Next,sequence a sequence Next, a of sequence of a crossover of sequence crossover aof crossover sequence of crossover crossover of operations crossover of operations operations crossover of operations crossover operations operations (C1, (C1, operations (C1, (C1, operations (C1, (C1, (C1, (C1,

1616161616 16 16 16 SMS Evolution: A Single Step 1A 1B 1A 2A 1A 1B 1A + 1B 2A 1A + + 1A a + 1B a / + a / C1 + + a b

a a)a / b) + c) a b c a / b c a / C1

a b a)a / b) c) b c a b c b c

b c + + + +

+ + + + + + a / a / a / C2 a / + + a / a / a / C2 a / a / a / b b b c b c a / a / b b b c b c b b b c b c b b b c b c b b b c b c

b b b c b c

2A 2B 2A 1B 2B 2B 1B 2B 2A 2B 2A 2B + + + + + + + + a a / a / C3 a / + + a a / a / a / a / C3 a / b c b c b c a / a / b c b c b c b c b c

b c b c + + + +

+ + + a / + + + a / C4 a / a b

a / + + b b b c a b a / b c b b b c a / C4 a / a b

b b b c a / b b b c a b b c b b b c

b b b c

86 d) + e) 1A 2A f) 1A 1B

+ + +

+ / a

a b d) + e) 2 1A+ 2A+ 1B a b f) 1A c b a b

+ + + + + +

+ 3 5 / a + a / a b 1 2 a / a / 2 + + a b

c b a b b c a b b c b b b c

+ + + 3 5 + 1 2 a / 1B a / 2B a / a b b c 2B b c b b b c 2A +

+

+ + +

/ a

a b a b a b 1B+ 2B 2 2B c b + 2A + + +

a / / a + + / a + 2 b b b c c b + + c b 1 + a b

3 5 / a a b a b a b + 2 c b

+ + + +

a / / a + / a 2 b b b c c b c b 1 a b 3 5 Figure 7: Crossover operation.

FigureThe mutation 7: Crossover operation operation. acts locally on a single individual. In our case, it works by randomly transforming the trees that compose the derived elements. This is done by replacing base schema elements in the leafs or operators in The mutation operationthe internal acts locally nodes. onNotice a single that, although individual. these In transformations our case, it are randomly works by randomly transformingperformed, the they trees are constrained that compose to preserve the derived the type elements. of match. This is done by replacingThe base crossover schema operation elements allows in the for leafs exchanging or operators components in between two selected individuals. Intuitively, these individuals play the role of parents the internal nodes. Noticewhose that, o↵spring although will compose these transformations part of the individuals are randomly of a new generation in performed, they are constrainedthe evolutionary to preserve process. the Like type the mutation of match. operation, the type of the match The crossover operationis also allows preserved. for exchanging components between two selected individuals. Intuitively,Figure 4.2 these illustrates individuals how the crossover play the operation role of createsparents new individuals whose o↵spring will composein our approach. part of the As we individuals can see, initialy of a two new individuals generation are selected in from the the evolutionary process.population Like the (a) mutation and then operation, matches related the type to a of same the schema match are paired (1A with 2A and 1B with 2B) (b). Next, a sequence of crossover operations (C1, is also preserved. Figure 4.2 illustrates how the crossover operation creates new individuals in our approach. As we can see, initialy two individuals16 are selected from the population (a) and then matches related to a same schema are paired (1A with 2A and 1B with 2B) (b). Next, a sequence of crossover operations (C1,

16 1A 1B 2A 1A 1B 1A

+

+ +

a + a / a / C1

a b

a)a / b) c) a b c b c

b c

+ + + +

+ + a / a / a / C2 a /

a / a / b b b c b c b b b c b c

b b b c b c

2A 2B 1B 2B 2A 2B

+ + +

+ + a a / a / C3 a /

a / a / b c b c b c

b c b c

+ + + +

+ + a / a / C4 a / a b

a / b b b c a b b c b b b c SMS Evolution: Crossover b b b c

1A 1B 2A 1A 1B 1A

+

+ +

a + a / a / C1

a b a)a / b) c) a b c b c b c

+ + + +

+ + a / a / a / C2 a /

a / a / b b b c b c b b b c b c b b b c b c d) + e) 1A 2A f) 1A 1B 2A 2B 1B 2B 2A 2B + + + + + +

+ + a a / a / C3 a / + a / a / / a b c b c b c a b b c b c 2 + + a b

+ + c b + + a b

+ + a / a / C4 a / a b

a / b b b c a b b c b b b c

b b b c

+ + + 3 5 + 1 2 a / a / a /

a b b c b c b b b c d) + e) 1A 2A f) 1A 1B

+ + + 1B 2B + / a 2B a b 2A 2 + + a b c b a b + + + + + 3 5 + 1 2 a / + + a / a / +

a b b c b c b b b c / a

a b a b a b + c b 1B 2B 2A 2B 2

+ + + + +

+

+ + +

a / / a / a + / a a b a b a b + c b 2 2 b b b c c b c b + + + + 1 a b

a / / a + / a 3 5 2 b b b c c b c b 1 a b 3 5

87 Figure 7: Crossover operation.

The mutation operation acts locally on a single individual. In our case, it Figure 7: Crossover operation. works by randomly transforming the trees that compose the derived elements. This is done by replacing base schema elements in the leafs or operators in the internal nodes. Notice that, although these transformations are randomly performed, they are constrained to preserve the type of match. The mutation operation acts locally on a single individual. In our case, it The crossover operation allows for exchanging components between two selected individuals. Intuitively, these individuals play the role of parents works by randomly transforming the trees that compose the derived elements. whose o↵spring will compose part of the individuals of a new generation in the evolutionary process. Like the mutation operation, the type of the match This is done by replacing base schema elements in the leafs or operators in is also preserved. Figure 4.2 illustrates how the crossover operation creates new individuals the internal nodes. Notice that, although these transformations are randomly in our approach. As we can see, initialy two individuals are selected from the population (a) and then matches related to a same schema are paired (1A performed, they are constrained to preserve the type of match. with 2A and 1B with 2B) (b). Next, a sequence of crossover operations (C1, The crossover operation allows for exchanging components between two 16 selected individuals. Intuitively, these individuals play the role of parents whose o↵spring will compose part of the individuals of a new generation in the evolutionary process. Like the mutation operation, the type of the match is also preserved. Figure 4.2 illustrates how the crossover operation creates new individuals in our approach. As we can see, initialy two individuals are selected from the population (a) and then matches related to a same schema are paired (1A with 2A and 1B with 2B) (b). Next, a sequence of crossover operations (C1,

16 1A 1B 2A 1A 1B 1A

+

+ +

a + a / a / C1

a b a)a / b) c) a b c b c b c

+ + + +

+ + a / a / a / C2 a /

a / a / b b b c b c b b b c b c

b b b c b c

2A 2B 1B 2B 2A 2B

+ + 1A 1B 2A + 1A 1B 1A + + + a a / a / + +

a a / + C3 a / a / C1

a b

a)a / b) c) a a / a / b c b c b c b c b c b c

+ + b c b c + +

+ + a / a / a / C2 a /

a / a / + b b b c b c + b b b c b c + +

b b b c b c

+ + a / a / C4 a / a b

a / 2A 2B b b b c 1B 2B a b 2A 2B b c b b b c

+ + + b b b c

+ + a a / a / C3 a /

a / a / b c b c b c

b c b c

+ + + +

+ + a / a / C4 a / a b

a / b b b c a b SMS Evolution: New b Solutionc b b b c

b b b c

d) + e) 1A 2A f) 1A 1B

+ + +

+ / a

a b 2 + + a b

c b d) + e) 1A 2A a b 1B + f) 1A + + 3 5 + 1 2 a / a / a /

a b b c b c b b b c + + +

+ / a

a b 2 + + 1B 2B a b 2B c b 2A + a b

+

+ + +

/ a

a b a b a b + + c b 2 + +

+ + + 3 5 + + 1 2 a / a / / a a / + a / / a 2 b b b c c b c b 1 a b b c 3 5a b b c b b b c

1B 2B Figure 7: Crossover operation.2A 2B +

+

+ + +

The mutation operation acts locally on a single/ individual.a In our case, it

a b a b a b + 2 works by randomly transforming the trees that composec b the derived elements.

+ + + This is done by replacing base schema elements in the leafs or operators+ in

a / / a + the internal nodes. Notice that, although these transformations are randomly/ a 2 b b b c c b c b 1 a b 3 5 performed, they are constrained to preserve the type of match. The crossover operation allows for exchanging components between two selected individuals. Intuitively, these individuals play the role of parents whose o↵spring will compose part of the individuals of a new generation in the evolutionary88 process. Like the mutation operation, the type of the match is also preserved. Figure 7:Figure Crossover 4.2 illustrates how operation. the crossover operation creates new individuals in our approach. As we can see, initialy two individuals are selected from the population (a) and then matches related to a same schema are paired (1A with 2A and 1B with 2B) (b). Next, a sequence of crossover operations (C1, The mutation operation acts locally on a single individual. In our case, it 16 works by randomly transforming the trees that compose the derived elements. This is done by replacing base schema elements in the leafs or operators in the internal nodes. Notice that, although these transformations are randomly performed, they are constrained to preserve the type of match. The crossover operation allows for exchanging components between two selected individuals. Intuitively, these individuals play the role of parents whose o↵spring will compose part of the individuals of a new generation in the evolutionary process. Like the mutation operation, the type of the match is also preserved. Figure 4.2 illustrates how the crossover operation creates new individuals in our approach. As we can see, initialy two individuals are selected from the population (a) and then matches related to a same schema are paired (1A with 2A and 1B with 2B) (b). Next, a sequence of crossover operations (C1,

16 SMS Evolution – Details

} Setup } Similarity Functions (e.g., Jaro, Consine, Prob. Density, etc.) } Data types with operators } STRING: concatenation, insertion, substitution, etc. } DATE: sum, sub, conversion (e.g., year to days), etc

} NUMBER: sum, mult, etc. } Next Generation } k individuals with the fitness value above a threshold ε is selected for mutation and crossover

89 4.2. Evolutionary Generation of Schema Matching Solutions Let A and B be two distinct schemas to be matched. The evolutionary process we propose for achieving a suitable schema matching solution is based on the generational algorithm we have described in Section 3.2. In the first step, individuals, that is, schema matching solutions, are randomly generated. This means that schema elements are grouped into derived elements forSMS both schemas. Evolution Pairs – ofDetails derived elements are used to

1A 1B 2A form matches with the addition of1A a similarity1B functions. These1A matches are

+

+ +

a + a / } a / C1 Fitness: which solutions1A are good? a b a)a / b) 1B c) 2A a b c 1A 1A 1B b c b c then grouped into candidate schema matching solutions,+ composing the first + +

+ + a + + + a / a / C1 + + a b a / 1A a / a)a / b) 1B c) a / C2 a / 2A a b c 1A b c a / 1A a / 1B b b b c b c b c b b b c b c

+

+ + b b b c b c

+ + a } + + + a / generation of the evolutionaryGeneral process. idea a / C1 + + a b a / a / a)a / b) c) a / C2 a / a b c b c a / a / b b b c b c b c b b b c b c

b b b c b c 2A 2B + + 1B 2B 2B + + 2A + + a / a / + + a / C2 a / This first generation, as well as all other upcoming generations, are+ evalu- a / a /

+ + b b b c b c a } a / a / b b b c b c Given a SMS, evaluate its matches C3 a / b b b c b c 2A 2B a / a / 1B 2B b c b c 2A 2B b c b c b c

+ + +

+ + + + + + a a / a / C3 a / + + ated according to a fitness function. For a given candidate2A a / 2B schema matching a / a / a / C4 a / a b 1B 2B b c b c b c a / 2B b b b c a b 2A b c b c b c b b b c } + + + In good matches,b b b c similarity functions must give

+ + + + + + a a / a / C3 a / + + a / a / a / a / C4 a / a b b c b c b c a / solution S = m ,m ,...,m , where each m is a match,b b b c the fitness func- a b 1 2 n b c b c i b c b b b c

b b b c

+ + + +

+ + high values a / a / C4 a / { } a b a / b b b c a b b c b b b c tion f(S) is calculated as follows: b b b c

d) + e) 1A 2A f) 1A 1B

+ + n + + / a

a b 2d) + + + e) 1A 2A a b 1B c b f) 1A a b

+ + +

+ + + + / a

a b + 3 5 + a b 2 + 1 eval2 (ma / ) 2A d) + e) 1A a / a / 1B i c b f) 1A a b

b c a b b c b b b c

+ + +

+ + + + / a

a b + 3 5 + a b 2 + 1 2 a / i=1 a / a / c b a b

b c a b f(S)=X 1B 2Bm = b c b b b c (1) + 2B i 2A + + 3 5 + 1 2 a / + a / a /

+

+ + b c a b + n b c b b b c 1B 2B / a 2B a b a b 2A a b + 2 + c b +

+ + + + + + +

/ a

a / / a 1B 2B + / a 2B a b a b 2A a b c b 2 b b b c c b + c b 1 2 + a b +

+ + 3 5 + + + + In Equation 1, let m = a ,b, ↵ be a match composed of derived ele-+ i i i i / a a / / a + / a a b a b a b c b 2 b b b c c b + c b 1 2 a b

h i 3 5 + + + + a / / a 90 + ments a and b , and a similarity function ↵ . eval is an evaluation of the/ a i i 2 i b b b c c b c b 1 a b 3 5 Figure 7: Crossover operation. similarity function ↵i using instances of schemas A and B to compute ai and Figure 7: Crossover operation. bi, respectively, according to theThe mutation sequence operation of actsoperationsFigure locally 7: on Crossover a single they operation. individual. describe. In our case, it A very important issue works in our byThe randomly approach mutation transforming operation is how acts the trees locally the that on instances compose a single individual. the derived of schemas In elements. our case, it A and B are used for evaluating Thisworks is done the byThe byrandomly mutation fitness replacing transforming operation base function. schema acts the elements locally trees In that on this in a compose the single work, leafs individual. the or derived operators we Incon- elements. our in case, it the internalThisworks is done nodes. by randomly by Notice replacing that, transforming base although schema the these elementstrees transformations that in compose the leafs the are or derivedrandomly operators elements. in sider two distinct evaluation performed,the strategies:This internal they is done nodes. are constrainedby Noticean replacingentity-oriented that, to base preserve although schema the these elements type transformations strategyof match. in the leafs areand or randomly operators a in Theperformed,the crossover internal they operation nodes. are constrained Notice allows that, for to exchanging although preserve these the components type transformations of match. between are two randomly value-oriented strategy. Adoptingselectedperformed,The individuals. one crossover of they these Intuitively, operation are constrained evaluation allows these for individuals to preserveexchanging strategies playthe components type the of role has match. of between someparents two implications and may be determinedwhoseselected o↵springThe individuals. crossover will by compose the operation Intuitively, characteristics part of allows the these individuals for individuals exchanging of of play a the new components the reposito-generation role of between parents in two the evolutionarywhoseselected o↵spring individuals. process. will compose Like Intuitively, the part mutation of these the operation, individuals individuals the of type play a new of the the generation role match of parents in ries being processed. We postponeis alsothe preserved.whose evolutionary a o detailed↵spring process. will compose discussion Like the part mutation of the on operation, individuals these the issues of type a new of until thegeneration match in Section 4.4. Figureis alsothe 4.2 preserved. evolutionary illustrates process. how the Likecrossover the mutation operation operation, creates new the individuals type of the match in our approach.isFigure also preserved. 4.2 As illustrates we can see, how initialy the crossover two individuals operation are creates selected new from individuals the As described in Sectionpopulation 3.2,in our the approach.Figure (a) next and 4.2 then As generationillustrates we matches can see, how related initialy the in crossover to the two a same individuals evolutionary operation schema are creates are selected paired new pro- from (1A individuals the withpopulation 2Ain and our 1B approach. (a) with and 2B) then As (b). we matches Next, can see, a related sequence initialy to two of a crossover sameindividuals schema operations are are selected paired (C1, from (1A the cess is obtained based on thewith valuepopulation 2A and calculated 1B (a) with and 2B) then (b). matchesfor Next, each a related sequence individual to ofa same crossover schemaS operationsusing are paired (C1, (1A the fitness function f(S) as follows.with 2A and Thus, 1B with first 2B) (b). the16 Next,n aindividuals sequence of crossover with operations the (C1, highest fitness values are reproduced to form the next16 generation. Then, asetofm<

15 SMS Evolution – Details

} Two different entities } Entity-oriented Strategy: } Assumes a non-negligible overlap between the instances } First, use similarity functions to look for similar entities } Then, verify if the match can detect these entities } Value-oriented Strategy } Assumes an empty or negligible overlap between the instances } First, use similarity functions to look for similar attributes } Then verify if the match can detect these entities

91 SMS Evolution – Details

} Constraints } For a given match, all attributes, operations, similarity functions should be of same data type } The set of possible similarity functions can be select by a specialists } These are practical constraints } The evolutionary process could be carried out without them } But using them we narrow the solution space and save some time

92 Experiments - Datasets Table 1: Experimental Dataset characteristics CharacteristicTable 1: ExperimentalSynthetic Dataset 1, characteristics 2, 3 Real State Inventory TotalCharacteristic of Elements in File A Synthetic12 1, 2, 3 Real32 State Inventory44 TotalTotal of of Elements Elements in in File File B A 127 3219 4438 TotalTotal of of 1-1 Elements Matches in File B 77 197 3827 StringTotal Matches of 1-1 Matches 37 76 2711 NumericalString Matches Matches 43 61 1116 TotalNumerical of Complex Matches Matches 24 121 1611 StringTotal Matches of Complex Matches 22 125 114 NumericalString Matches Matches 02 57 47 Numerical Matches 0 7 7

Table 2: Characteristics of tables obtained from GTF CharacteristicTable 2: Characteristics ofReal tables Estate obtainedCar Dealers from GTFRestaurants TotalCharacteristic of Elements in Table A Real7 Estate Car Dealers28 Restaurants6 TotalTotal of of Elements Elements in in Table Table B A 67 288 69 TotalTotal of of 1-1 Elements Matches in Table B 66 85 92 StringTotal Matches of 1-1 Matches 36 5 22 NumericalString Matches Matches 33 51 20 Numerical Matches 3 1 0 disjoint data scenario, since they do not present overlapping entities. We disjoint data scenario, since they do not present overlapping entities. We also noticed that their schemas do not required complex matches. Table 2 also93 noticed that their schemas do not required complex matches. Table 2 summarizessummarizes the the characteristics characteristics of of these these tables. tables. OurOur experimental experimental evaluation evaluation adopts adopts the the same accuracy metric metric used used in in otherother works works [14, [14, 22, 22, 24], 24], that that is, is, the the fraction fraction of all target target schema schema elements elements whosewhose candidate candidate match match is is correct, correct, as as given given by:

NumberOfCorrectlyIdentifiedMatches Accuracy = NumberOfCorrectlyIdentifiedMatches Accuracy = NumberOfMatchesNumberOfMatches

ForFor each each data data scenario, scenario, di di↵↵erenterent datasets datasets and parameter setups setups (evolu- (evolu- tionarytionary parameters parameters and and similarity similarity functions functions options) were were used. used. This This param- param- etereter setup setup was was chosen chosen after after initial initial tuning tuning experiments experiments with with the the training training set. set. ItIt provided provided good good results results and, and, at at the the same same time, time, avoided problems problems with with over- over- fittingfitting and and extensive extensive training training time time requirements. requirements. We present this this information information inin Table Table 3. 3. WeWe run run our our experiments experiments in in a a workstation workstation with the following following hardware hardware and and softwaresoftware configuration: configuration: Pentium Pentium Core Core 2 2 Duo Duo Quad (2 Ghz) Ghz) processor, processor, with with aa 4 4 GB GB RAM RAM DDR2 DDR2 memory, memory, and and a a 320 320 GB GB SATA hard drive, drive, and and running running aa 64-Bits 64-Bits FreeBSD FreeBSD 7.1 7.1 Unix-based Unix-based operational operational system.

2525 Experiments - Results

Overlap Non-Overlap Table 6: Results - Disjoint data scenario Matches Accuracy Table 4: Results - PartiallyPartial overlapped (ST1) data scenarioRS All 1-1 Matches 85% RS String 1-1 Matches 100% Matches Accuracy RS Numeric 1-1 Matches 0% All 1-1 Matches 57% RS All Complex Matches 25% String 1-1 Matches 100% RS String Complex Matches 60% Numeric 1-1 Matches 24% RS Numeric Complex Matches 0% All Complex Matches 75% INV All 1-1 Matches 40% String Complex Matches 75% INV String 1-1 Matches 100% INV Numeric 1-1 Matches 0% INV All Complex Matches 20% Table 5: Results - FullyFull overlapped (ST2) data scenarioINV String Complex Matches 56% 5.3. Experiments in the PartiallyMatches OverlappedAccuracy Data ScenarioINV Numeric Complex Matches 0% This first set of experimentsAll 1-1 Matches evaluates the100% capability ofST3 our All 1-1 evolutionary Matches 42% Strings 1-1 Matches 100% ST3 String 1-1 Matches 100% approach to find complex matchesNumeric 1-1 Matches in partially100% overlappedST3 data Numeric repositories. 1-1 Matchings 0% All Complex Matches 100% ST3 All Complex Matches 100% In this case, our schema matchingString Complex system Matches has100% to deal withST3 String some Complex “noise” Matches 100% during the search process, since there is no external evidence or structural information available to be used. This is the most common real-world sce- 94 similarity function derived from the classic vector space model. thenario e↵ [28].ectiveness We have of the used Sort only Winkler the Synthetic similarity 1 dataset function. for these experiments, As shown in Table 5, our 1-1 matchThe searcher rationale was of able this to strategy identify is all to 1-1 find matches between schema ele- since there is no partially overlappedments data that in thepresent Real similar State term and frequencies Inventory and share common terms (e.g., matches. Even matches involving numeric schema elements were identified. datasets. personal names, addresses and product descriptions). For this, we compare This happened because the comparisons were performed between entries cor- In this scenario, our searchers employtext vectors the entity-oriented that are assembled strategy using to all find data instances stored in the reposi- responding to the same real-world entities and the numerical values for each the most similar schema element combinationstory that are by associated comparing with the a correspond- combination of schema elements. The higher entry were mostly the same ones in both files, allowing the correct matching. ing instances stored in both files. Thethe most similarity similar value combination between the is considered text vectors, the higher the likelihood of The complex match searcher was also able to identify all existing matches. a possible replica, thus it is used forthe matching schema elementpurposes. combinations being related to each other. The simila- Moreover, since we have used the Sort Winkler similarity function, some of Table 4 shows the results for therity partially boundary overlapped for this experiment data scenario was using empirically chosen after initial tuning the correct answers found in the output solution presented the same schema Levenshtein, a well known edit distancetests and function, its value as was the set similarity to 0.95, function. since it maximized the e↵ectiveness of the elements, but with di↵erent concatenationcosine similarity sequences function. (e.g., forename after This function was the most e↵ective among many others (e.g., Sort Winlker, surname). For matching purposes, thisWe is have not used a problem the Synthetic because 3, Inventory the final and Real State datasets in these Jaro and Soundex [6, 10]) we used in our set of experiments. The similarity solution includes the correct schemaexperiments. elements. However, we notice that the two real datasets are too small boundary value for this experiment was empirically chosen after initial tuning These results show that the e↵ectiveness(only 100 ofentries), our match since they searchers were created was im- using real data, aiming at exper- tests and its value was set to 0.9, since it maximized the e↵ectiveness of the proved by the specific characteristicsiments of the with fully hybrid overlapped schema matchingdata scenario. approaches [14, 22, 24]. Considering Levenshtein function. In this scenario, the match searchersthat only real have datasets to identify are usually subsets much of schema bigger, we overcome this situation by Our 1-1 match searcher was ableusing to a find synthetic 57% of dataset all matches. whose data As better we resembles that found in real elements that are able to improve the overall similarity results. In the pre- can see, most of the unidentified matchesapplications. are numeric ones, which is mainly vious experiments, with partially overlapped data, our searchers also need due to the fact that we have treated allTable schema 6 shows elements the results as of strings. our experiments As a in the disjoint data scenario. to deal with di↵erent entities during the comparisons. Comparisons involv- consequence, the unidentified 1-1 matchesWe use increased the acronyms the RS, search INV space and ST3 of the when referring to the Real State, ing di↵erent entities may mislead the replica identification, being, therefore, complex match searcher. Yet, we wereInventory still able and to Synthetic find about 3 datasets, one fourth respectively. of Our 1-1 match searcher considered as “noise” for the searchers. all the numeric matches in this scenario.missed all On matches the other involving hand, numeric the complex and date schema elements (age, birth dates, prices and discounts) in all datasets. However, it was able to identify 5.5.matching Experiments searcher in found the Disjoint 75% of the Data complex Scenario matches. This result includes mostly complex matches involving addresses and one partial match6 involving This set of experiments evaluates the capability of our evolutionary30 ap- proach to find schema matches in disjoint data repositories. This means that they6A do partial not share match entries is a complex that correspond match that to misses the same one or real-world more schema entities elements. and, therefore, cannot be used as “rules” for identifying semantic relationships. For this reason, this is the most dicult scenario for any schema matching system based on the instance-level approach27 [28]. Since the repositories do no share entries that correspond to the same real-world entities, we cannot use our entity-oriented strategy in this sce- nario. Hence, we use instead our value-oriented strategy, which considers a

29 Experiments – Examples of Matches

} Inventory dataset: } ship-address = (ship-address + ship-postal-code) + (ship-city + ship-country) } Real State dataset: } house-address = (house-street + house-city) + house-zip-code } Synthetic 3 dataset: } fullname = forename + surname

95 Conclusions and Remarks

} Data of interest is no longer in databases, although they are in on-line sources } In particular: Textual Sources } The structure is only implicit } Meta-data is a luxury } Constraints are a utopia

96 Other areas can help a lot

} Information Retrieval } IR models, text indexing, relevance metrics, language models, etc. } Data/Text Mining } Rule Mining, Learning, Categorization, Graph Models } Artificial Intelligence } Ontologies, Automated Reasoning } ….

97 An expanded set of CS foundations is helpful!

} Computer Science Theory for the Information Age } Upcoming book by John Hopcroft and Ravindran Kannan } From the TOC } High-Dimensional Space This is the theory for the next 30 } Random Graphs years !! } Singular Value Decomposition (SVD) } Markov Chains } Learning and VC-dimension } Algorithms for Massive Data Problems } Clustering } Graphical Models and Belief Propagation

98

Many other approaches

} Named Entity Recognition (NER) } E.g. Sarawagi@FTD’08, Ratinov@CoNLL’09 } Open Information Extraction } Unsupervised NER over massive text collections, e.g., the Web } Oren Etzioni (e.g., EMNLP-CoNLL’12, WWW’08, IJICAI’07) } Hidden Web } Juliana Freire (e.g., WWW’07, ICDE’07, WebD’10) } Web Tables } Alon Halevy, Mike Cafarela (e.g., PVLDB’08, CIDR’07) } NoDB – Scientific Data! } Anastacia Ailamaki (e.g., SIGMOD’12)

99