On Scalable Information Retrieval Systems

Ophir Frieder [email protected] www.ir.iit.edu

1 © Ophir Frieder 2002 “Scalable Search”

Structured Semi-structured Answer Text, video, etc. Engine

2 © Ophir Frieder 2002 Scalable Information Systems: “Characteristics”

n Ingest data from multiple sources – Duplicate document detection n Process multiple type data sources – Structured & unstructured data integration (SIRE) n Use scalable (parallel) technology systems – Parallel SIRE n Integrate retrieved data to yield answers – IIT Mediator

3 © Ophir Frieder 2002 Duplicate Document Detection

n Union of data obtained from multiple sources often contains duplicates n Duplicates affect both retrieval effectiveness and retrieval efficiency n Duplicate detection is either syntactic or semantic, where semantic is far more challenging.

4 © Ophir Frieder 2002 What is a Duplicate Document?

• Semantic Similarity If a document contains roughly the same semantic content it is a duplicate whether or not it is a precise syntactic match.

5 © Ophir Frieder 2002 Duplicate Detection Techniques

Main duplicate detection approaches: – Hash based approaches (syntactic)

– Information retrieval techniques

– Resemblance S ( A ) Ç S ( B ) r ( A, B ) = S ( A ) È S ( B )

6 © Ophir Frieder 2002 Duplicate Detection with IR

n Using documents as queries, rank all documents in the collection with similar terms – Documents with equivalent weights are duplicates

n For each query term, the corresponding posting list entries must be retrieved – for large collections, I/O costs are prohibitive

7 © Ophir Frieder 2002 Duplicate Detection with Resemblance

n Calculate the resemblance of each document to every other document with matching features n Divide the document into shingles (X terms) used to create a unique hash n Calculate the resemblance based on hashes rather than terms n N2 comparison approaches not feasible for large collections n Optimizations, filter which shingles to use – E.g., every 25th shingle or a combination of multiple shingles

8 © Ophir Frieder 2002 Issues with Prior Approaches

n Hash techniques not resilient to small changes in document representation.

n IR techniques - slow for large collections.

n Resemblance – documents are clustered into multiple clusters due to partitioning – duplicate classification is difficult.

9 © Ophir Frieder 2002 Combined (I-Match) Algorithm

n Tokenize document n Create list of unique tokens n Filter tokens - What to filter? n Create a unique hash of remaining tokens n Search collection for duplicate hashes

10 © Ophir Frieder 2002 Filtration Based On Collection Statistics

Hi & Low 25% Low 25% High 25% Mid 50%

æ N ö 1. Sort according to idf = logç ÷ è n ø N = Number _ Of _ Documents _ In _ Collection n = Number _ Of _ Documents _Term _ Occurs _ In 2. Filter unwanted components

11 © Ophir Frieder 2002 LA Times Collection

n Create random duplicates to test effectiveness. n For every ith word, pick a random number from one to ten. n If the number is higher than the random threshold (call it alpha) then pick a number from 1 to 3. n If the random number chosen is a one then remove the word. n If the number is a two then flip it with a word at position i+1. n If it is a three, add a word (randomly pick one from the term list). n Insert duplicate into the collection.

12 © Ophir Frieder 2002 Document Clusters Formed

Document Resemblance Resemblance-Opt Combined

LA 123190-0013 9 9 7 LA 123190-0022 6 9 2 LA 123190-0025 9 11 3 LA 123190-0037 10 11 1 LA 123190-0043 8 11 2 LA 123190-0053 10 9 2 LA 123190-0058 7 11 3 LA 123190-0073 6 11 3 LA 123190-0074 11 11 1 LA 123190-0080 9 11 9

Average 8.5 10.4 3.3 I-Match did not produce any false positives while Resemblance did. 13 © Ophir Frieder 2002 Processing Time – 2GB

Algorithm MEAN Std Deviation Median Time Time Resemblance 31838.22 807.9 30862.5

Resemblance - Opt 24514.7 1042.1 24475.5

I-Match 3815.8 975.8 3598.8

Syntactic 65 N/A N/A

14 © Ophir Frieder 2002 Scalable Information Systems: “Characteristics”

n Ingest data from multiple sources – Duplicate document detection n Process multiple type data sources – Structured & unstructured data integration (SIRE) n Use scalable (parallel) technology systems – Parallel SIRE n Integrate retrieved data to yield answers – IIT Mediator

15 © Ophir Frieder 2002 SIRE Goals

§ Integrate structured and semi-structured data using a framework that also integrates unstructured data. § Improve accuracy of retrieved results § Support scalability: – data volume – retrieval speeds § Support legacy data

16 © Ophir Frieder 2002 Portability

The information retrieval prototype was implemented on the following relational platforms: – NCR Teradata DBC-machines – Microsoft SQL Server – Sybase – Oracle – IBM DB2 and SQL/DS

17 © Ophir Frieder 2002 Relational Inverted Index

All inverted index entries

e.g., vehicle D1, D3, D4 results in:

term docID vehicle D1 vehicle D3 vehicle D4

18 © Ophir Frieder 2002 Text Retrieval Conference (TREC) Sample Document

AP881214-0028 AP-NR-12-14-88 0117EST u i BC-Japan-Stocks 12-14 0027 BC-Japan-Stocks,0026 Stocks Up In Tokyo TOKYO (AP) The Nikkei Stock Average closed at 29,754.73 points up 156.92 points on the Tokyo Stock Exchange Wednesday.

19 © Ophir Frieder 2002 Relational Document Representation (Term Processing)

DOCUMENT docID docname headline dateline 28 AP881214-0028 Stocks Up In Tokyo TOKYO (AP)

INDEX TERM docID termcnt term term df idf 28 1 nikkei average 2265 1.08 28 2 stock closed 2208 1.08 28 1 average exchange 2790 1.00 28 1 closed nikkei 234 2.07 28 2 points points 1627 1.23 28 1 up stock 2674 1.00 28 1 tokyo tokyo 725 1.58 28 1 exchange up 12746 0.30 28 1 wednesday wednesday 6417 0.60

20 © Ophir Frieder 2002 Simplistic Models:

Keyword and Boolean Searches

21 © Ophir Frieder 2002 Relational Approach: Keyword Search Techniques

§ Keyword search select i.docID from INDEX i, QUERY q where i.term = q.term

§ Keyword search with stop word list select i.docID from INDEX i, QUERY q, STOPLIST s where (i.term = q.term) and (i.term <> s.term)

22 © Ophir Frieder 2002 Relational Approach: Boolean Search Techniques

§ OR query select docID select docID from INDEX from INDEX where term = term1 where term = term1 OR union term = term2 OR select docID term = term3 OR from INDEX .... where term = term2 term = termN union select docID from INDEX where term = term3 .... union select docID from INDEX where term = termN 23 © Ophir Frieder 2002 Relational Approach: Boolean Search Techniques

§ AND query select docID select docID from INDEX from INDEX a, INDEX b, INDEX c, ... INDEX N where term = term1 where a.term = term1 AND intersect b.term = term2 AND select docID c.term = term3 AND from INDEX .... where term = term2 n.term = termN AND intersect a.docID = b.docID AND select docID b.docID = c.docID AND from INDEX .... where term = term3 N-1.docID = N.docID .... intersect select docID from INDEX where term = termN 24 © Ophir Frieder 2002 Fixed Join-Count AND Queries

Find all documents that contain all of the terms found in the QUERY relation:

select i.docID from INDEX i, QUERY q where i.term = q.term group by i.docID having count (distinct (i.term)) = select count(*) from QUERY

25 © Ophir Frieder 2002 TAND Queries

Find all documents that contain at least X of the terms found in the QUERY relation:

select i.docID from INDEX i, QUERY q where i.term = q.term group by i.docID having count (distinct (i.term)) >= X

26 © Ophir Frieder 2002 Relevance Ranking:

Vector Space & Probabilistic Models

27 © Ophir Frieder 2002 Vector Space Model

§ Term Frequency (tfik):

– number of occurrences of term tk in document i

§ Document Frequency (dfj ):

– number of documents which contain tj

§ Inverse Document Frequency (idfj): – log(d/dfj) where d is the total number of documents

§ Notes: – idf is a measure of uniqueness of a term across the collection – tf is the frequency of a term in a given document

28 © Ophir Frieder 2002 Vector Space Model: Sample Relational Query

List all documents in the order of their similarity coefficient where the coefficient is computed using the dot product.

SELECT d.docID, d.docname, SUM(i.termcnt * t.idf * q.termcnt * t.idf) FROM DOCUMENT d, QUERY q, INDEX i, TERM t WHERE q.term = i.term AND q.term = t.term AND d.docID = i.docID GROUP BY d.docID, d.docname ORDER BY 3 DESC

29 © Ophir Frieder 2002 Similarity Coefficients

§ Several similarity coefficients based on the query vector X and the document vector Y are defined: t Inner Prod uct å xi × yi i=1 t å xiyi Cosine Coefficient i=1 t t 2 2 å xi · å yi i=1 i=1

30 © Ophir Frieder 2002 SQL for “Probabilistic” Similarity Measure

num _ terms æ (numdocs-dfi )+.5ö æ 2.2*tfid ö å logç ÷*ç ÷*qtf i=1 è (dfi +.5) ø è.3+(.75*doclength/ avgdoclength)+tfid ø

SELECT d.docID, d.docname, SUM( LOG(((NumDocs - t.df) + 0.5) / (t.df + 0.5)) * ((2.2*i.tf) / (.3 + ((.75 * d.DocLen)/AvgDocLen) + i.tf)) * q.termcnt ) FROM INDEX i, TERM t, DOCUMENT d, QUERY q WHERE i.term = t.term AND i.docID = d.docID AND t.term = q.term GROUP BY d.docID, d.docname ORDER BY 3;

31 © Ophir Frieder 2002 Relational Document Representation (Term Processing)

DOCUMENT docID docname headline dateline 28 AP881214-0028 Stocks Up In Tokyo TOKYO (AP)

INDEX TERM docID termcnt term term df idf 28 1 nikkei average 2265 1.08 28 2 stock closed 2208 1.08 28 1 average exchange 2790 1.00 28 1 closed nikkei 234 2.07 28 2 points points 1627 1.23 28 1 up stock 2674 1.00 28 1 tokyo tokyo 725 1.58 28 1 exchange up 12746 0.30 28 1 wednesday wednesday 6417 0.60

32 © Ophir Frieder 2002 Relational Query Representation (Term Processing)

QUERY term termcnt nikkei 1 ORIGINAL QUERY: stock 2 “nikkei stock exchange american stock exchange” exchange 2 american 1

(Query Weight * Document Weight) SQL: SELECT d.docID, d.docname, SUM(a.termcnt * c.idf * b.termcnt * c.idf) FROM QUERY a, INDEX b, TERM c, DOCUMENT d WHERE a.term = b.term AND a.term = c.term AND b.docID = d.docID GROUP BY d.docID, docname ORDER BY 3 DESC

33 © Ophir Frieder 2002 Sample Term Query Result (Inner/Dot Product)

Term Q-Termcnt Q-Weight D-Termcnt D-Weight Q-Wt * D-Wt

nikkei 1 2.07 1 2.07 4.28 stock 2 2.00 2 2.00 4.00 exchange 2 2.00 1 1.00 2.00 american 1 0.60 0 0.00 0.00

Similarity Coefficient 10.28

34 © Ophir Frieder 2002 Simple Phrase Parsing

§ Simple phrase parser with the following rules – Phrases do not include stop terms – Phrases do not span across punctuation Example: The Nikkei Stock Average closed at 29,754.73 points up 156.92 points, on the Tokyo Stock Exchange Wednesday.

Phrases: nikkei stock stock average average closed points up tokyo stock stock exchange exchange wednesday

35 © Ophir Frieder 2002 Relational Document Representation (Phrase Processing)

DOCUMENT docID docname headline dateline 28 AP881214-0028 Stocks Up In Tokyo TOKYO (AP)

INDEX PHRASE docID termcnt phrase phrase df idf 28 1 nikkei stock average closed 91 2.49 28 1 stock average exchange Wednesday 13 3.33 28 1 average closed nikkei stock 203 2.14 28 1 points up points up 70 2.61 28 1 tokyo stock stock average 209 2.14 28 1 stock exchange stock exchange 1818 1.34 28 1 exchange wednesday tokyo stock 246 2.10

36 © Ophir Frieder 2002 Enhancing Accuracy With Relevance Feedback

37 © Ophir Frieder 2002 Relevance Feedback

§ The modification of the search process so as to improve accuracy by incorporating information obtained from prior relevance judgments.

Q new Q 0 terms 1 top relevant matching documents documents Q0 database database search search

38 © Ophir Frieder 2002 Relevance Feedback Example

Q tunnel under English 1 Document Collection Channel

Top Ranked Document: 2 The tunnel under the English channel is Documents Retrieved often called a “Chunnel” 3 Relevant Retrievedb Not Relevant Q1 tunnel 4 Documents Retrieved under English 5 Channel Relevant Chunnel Retrievedb 39 © Ophir Frieder 2002 Feedback Mechanisms

§ Manual - relevant documents are identified manually and new terms are selected either manually or automatically. § Automatic - relevant documents are identified automatically by assuming the top-ranked documents, are relevant and new terms are selected automatically.

40 © Ophir Frieder 2002 Relevance Feedback Parameters

Various techniques can be used to improve the relevance feedback process. – Number of Top-Ranked Documents – Number of Feedback Terms – Feedback Term Selection Techniques – Term Weighting – Document Clustering – Relevance Feedback Thresholding – Term Frequency Cutoff Points – Query Expansion Using a Thesaurus

41 © Ophir Frieder 2002 Relevance Feedback Evaluation

Improvement from relevance feedback, nidf weights

0.50

0.45

0.40

0.35

0.30

0.25

0.20

0.15 nidf, no 0.10 feedback

0.05 nidf, feedback 10 terms 0.00 at 0.00 at 0.20 at 0.40 at 0.60 at 0.80 at 1.00

Recall 42 © Ophir Frieder 2002 Comparative TREC-8 Results

Run IIT Avg. # Above # At # Below # # Precision Median Median Median Best Worst

iit00t .1627 30 2 18 5 4

iit00td .2227 38 1 11 3 0

iit00tde .2293 37 2 11 4 1

iit00m .3519 40 3 7 25 1

43 © Ophir Frieder 2002 Performance Optimizations:

Query Thresholds & Clustered Indexes

44 © Ophir Frieder 2002 Query Thresholds

§ Consider a query with terms t1, t2, t3, ..., tn. § Sort the terms by their frequency across the collection (least frequent terms appear first). § Define a threshold as the percentage of terms taken in the original query in a newly created reduced query.

Term 1 Term 2 Threshold = 20 Term 3 Term 4 Term 5 Threshold = 50 Term 6 Term 7 Term 8 Threshold = 80 Term 9 Term 10 45 © Ophir Frieder 2002 Relevant Retrieved as a Function of Query Thresholds

2500

2000 2119 2138 1856 1675 1500 1601 1505 1000 831 500 Relevant Retrieved

0 0 10 20 30 40 50 60 70 80 90 100 Query Threshold (Percent)

46 © Ophir Frieder 2002 Run Time as a Function of Query Thresholds

35000 33,989 30000 25000

CPU 20000 15000 13,181 10000 5,215 5000 1,736 475 950 1,238 0 0 20 40 60 80 100 Query Threshold (Percent)

47 © Ophir Frieder 2002 Relevant New Documents Per CPU Cycle

Relevant CPU New Relevant Threshold Retrieved Cycles Docs per Cycle 10 831 475 1.75

20 1505 950 1.58

25 1601 1238 1.29

33 1657 1736 0.95

50 1856 5215 0.36

75 2119 13181 0.16

100 2138 33989 0.06

48 © Ophir Frieder 2002 Caveat: Logical Design versus Physical Implementation

While the design shown represents the replication of the document identifier, in the physical implementation, clustered tables are actually used.

That is, attribute values that are logically repeated many times are physically clustered by the attribute value to eliminate the replication storing only one copy for each unique attribute value. (Note clustered tables in Oracle implementations)

The I/O to retrieve a posting list is achieved via a grouped block read as opposed to retrieval across distributed storage.

49 © Ophir Frieder 2002 Clustered Indexes: Posting List Processing

TRADITIONAL term docId tf stock 1 1 stock 28 2 stock 111 5 stock 1024 3 stock 2263 11 CLUSTERED stock 11101 12 term docid tf stock 12345 1 stock { ( 1 1 ), stock 13456 6 ( 28 2 ), stock 23456 4 ( 111 5 ), ( 1024 3 ), ( 2263 11 ), ( 11101 12 ), ( 12345 1), ( 13456 6), ( 23456 4) }

50 © Ophir Frieder 2002 Clustered Indexes: Relevance Feedback Processing

TRADITIONAL docID termcnt term 28 1 nikkei 28 2 stock 28 1 average 28 1 closed 28 2 points CLUSTERED 28 1 up docID termcnt term 28 1 tokyo 28 { ( 1 nikkei ), 28 1 exchange ( 2 stock ), 28 1 wednesday ( 1 average ), ( 1 closed ), ( 2 points ), ( 1 up ), ( 1 tokyo ), ( 1 exchange ), ( 1 wednesday ) }

51 © Ophir Frieder 2002 Technology Transfer

Concern: An Academic Solution Without a Public Commercial World Problem

Resolution: National Institutes of Health’s National Center for Complementary and Alternative Medicine Citation Index

52 © Ophir Frieder 2002 NIH-NCCAM Application

§ Short query lengths necessitate query expansion § “Advanced Search” techniques are needed since used by roughly 30% of the users § Efficient processing critical – filtration an option. § Scalability not currently a concern but future needs may dictate such.

53 © Ophir Frieder 2002 NCCAM - Search Interface

54 © Ophir Frieder 2002 NCCAM Results Page Interface

55 © Ophir Frieder 2002 System Architecture

Servlet Engine Internet

Oracle DBMS User

HTTP Server

Sun ES 4500

56 © Ophir Frieder 2002 Servlet Architecture

HTTP Search Request

Generic Search Servlet

Dynamic HTML MRU Cache

Query Engine

RDBMS Connection RDBMS Pool 57 © Ophir Frieder 2002 Scalable Information Systems: “Characteristics”

n Ingest data from multiple sources – Duplicate document detection n Process multiple type data sources – Structured & unstructured data integration (SIRE) n Use scalable (parallel) technology systems – Parallel SIRE n Integrate retrieved data to yield answers – IIT Mediator

58 © Ophir Frieder 2002 Scalability via Parallelism:

“Not My Problem: It’s the Database Vendors’ Problem”

59 © Ophir Frieder 2002 Parallel Information Retrieval

§ Most parallel information retrieval systems require custom hardware and software which reduces portability across systems. § Parallelism in a relational information retrieval system is a function of the database management system and does not require custom hardware and software.

60 © Ophir Frieder 2002 TPC- C Benchmarks

TPC-C BENCHMARK RESULTS These results are valid as of date 7/5/2000 9:46:09 AM

TPC-C Results - Revision 3.X

Company System Spec. RevisiontpmC $/tpmC Total Sys. CostCurrency Database Software Operating SystemTP MonitorServer CPU# Type Server CPU'sCluster # Front EndsDate SubmittedAvailability Date ALR Revolution 6X6 (1MB3.3 L2)13089.3 (c/s) 35.44 463821 US $ Microsoft SQLMicrosoft Server WindowsTuxedo Enterprise 6.3NT IntelCFS Enterprise Edition PentiumPro 6.5 Edition 200 4.0 6 MHz N 6 11/6/97 12/31/97 ALR ALR Revolution 6X63.3 c/s10665.53 48.1 513001 US $ Microsoft SQLMicrosoft Server WindowsTuxedo 6.5. Enterprise 6.3NT IntelCFS Enterprise PentiumPro Edition 200 4.0 6 MHz N 5 7/9/97 11/30/97 ALR Revolution Quad6 3.2SD5127407.05 c/s 57.78 427949 US $ Microsoft SQLMicrosoft Server Windows-none- 6.5 SP3 NT 4.0 PentiumPro 200 4 MHz N 4 4/4/97 4/30/97 Acer AcerAltos 21000 c/s3.5 23235.57 16.66 387010 US $ Microsoft SQLMicrosoft Server WindowsBEA Enterprise Tuxedo NTIntel Enterprise 6.4Edition Pentium CFS 7.0 Edition III 4.0 4 500MHz N 3 5/17/99 5/17/99 Acer AcerAltos 19000Pro43.3 (c/s)11072.07 27.25 301663 US $ Microsoft SQLMicrosoft Server WindowsTuxedo Enterprise 6.3NT IntelCFS Enterprise Edition PentiumPro 6.5 Edition 200 4.0 4 MHz N 5 2/16/98 2/16/98 Amdahl EnVista Frontline Server3.2 7573 78.43 593988 US $ Microsoft SQLMicrosoft Server Windows-none- 6.5 SP3 NT Intel 4.0 PentiumPro 200 4 MHz N 6 3/28/97 3/31/97 Bull Escala EPC810 c/s3.5 66750.27 45.5 3037499 US $ Oracle 8i EnterpriseIBM AIX 4.3.3 WebshpereEdition v. 8.1.7IBM App. RS64-III Server 500MHz Ent. Edition 8 N V.3.0 6 6/20/00 9/30/00 Bull Escala T610 c/s 3.5 33571.3 58.98 1980179 US $ Oracle 8i V8.1.6IBM AIX 4.3.3 Webshpere IBM App. RS64-III Server 500MHz Ent. Edition 6 N V.3.0 5 6/20/00 6/20/00 Bull EPC 440 c/s 3.5 17133.7 68.2 1168743 Euros Oracle8i 8.1.5.1.1IBM AIX 4.3.2 IBM TXSeries IBM 4.2 RS64-II for AIX 340 Mhz 4 N 8 12/2/99 12/2/99 Bull Escala EPC 24003.5 c/s 135815.7 54.94 7462215 US $ Oracle 8i V8.1.6IBM AIX 4.3.3 IBM TXSeries IBM 4.2 RS64-III for AIX 450MHz 24 N 15 11/5/99 3/1/00 Bull Express5800 HV80003.3 (c/s)16216.1 39.69 643714 US $ Microsoft SQLMicrosoft Server WindowsTuxedo 6.5. Enterprise 6.3NT IntelCFS Enterprise PentiumPro Edition 200 4.0 8 MHz N 8 3/26/98 6/15/98 Bull ESCALA Power Cluster3.2 14285.87 P4404-HE c/s235.02 3357571 US $ Oracle7 7.3IBM AIX 4.1.5 Tuxedo ETP Motorola 4.2.1 PowerPC 32 604 Y 112 MHz 16 1/16/97 6/30/97 Bull ESCALA D404/8 c/s3.2 7303.67 195.68 1429252 US $ Sybase SQLIBM Server AIX 4.2.0 11.0.3Tuxedo 4.2.2 Motorola V2 PowerPC 604 8 N 150 MHz 6 11/15/96 1/31/97 Bull ESCALA D404/8 c/s3 4925.02 230.33 1134364 US $ Oracle7 7.3IBM AIX 4.1.4 Tuxedo 4.2.2 Motorola V2 PowerPC 604 8 N 112 MHz 7 6/3/96 11/30/96 Bull ESCALA D401/8 c/s3 3512.97 269.47 946649 US $ Oracle7 7.3IBM AIX 4.1.3 Tuxedo 4.2.2 Motorola V2 PowerPC 601 8 N 75 MHz 5 2/1/96 7/30/96 Bull ESCALA Deskside D201/43 1562.93 C/S 549.14 858266 US $ Informix OnLineIBM AIX 7.1 4.1.2 Tuxedo 4.2.2 Motorola V2 PowerPC 601 4 N 75 MHz 3 5/9/95 5/1/95 Bull ESCALA Deskside D401/83 2660.03 C/S 530.06 1409965 US $ Informix OnLineIBM AIX 7.1 4.1.2 Tuxedo 4.2.2 Motorola V2 PowerPC 601 8 N 75 MHz 5 5/9/95 6/1/95 Bull ESCALA Rack R201/43 C/S1562.93 583.73 912331 US $ Informix OnLineIBM AIX 7.1 4.1.2 Tuxedo 4.2.2 Motorola V2 PowerPC 601 4 N 75 MHz 3 5/9/95 5/1/95 Bull ESCALA Rack R201/83 C/S2660.03 560.18 1490086 US $ Informix OnLineIBM AIX 7.1 4.1.2 Tuxedo 4.2.2 Motorola V2 PowerPC 601 8 N 75 MHz 5 5/9/95 6/1/95 ProLiant DL-580R-67003.5 33507.75 14.09 472032 US $ Microsoft SQLMicrosoft Server WindowsMicrosoft 2000 2000 COM+ Intel Pentium III Xeon 4 700MHz N 4 6/23/00 8/1/00 Compaq ProLiant 8500-550-6P3.5 33617.4 12.91 433809 US $ Microsoft SQLMicrosoft Server WindowsMicrosoft Enterprise NT COM+Intel Enterprise Edition Pentium 7.0 Edition III 550MHz 4.0 6 N 4 4/7/00 4/7/00 Compaq ML530-800-2P 3.5 16133.75 15.87 256156 US $ Microsoft SQLMicrosoft Server WindowsCompaq 2000 2000 DB Intel Web Pentium Connector III Xeon 2 800 N MHz 2 3/8/00 8/1/00 Compaq ProLiant PDC/O2000-63.5 101657.2 35.68 3627928 US $ Oracle 8i V8.1.6Microsoft Windows Compaq NT DB Intel Enterprise Web Pentium Connector Edition III Xeon 48 4.0 550 Y MHz 6 2/11/00 3/31/00 Compaq AlphaServer ES40 3.5Model 6/66730738 29.48 906225 US $ Sybase AdaptiveCompaq Server Tru64Application UNIXEnterpr V4.0G OptimizerAlphachip WNT 21264 Server 667 4 MHz N 4 2/9/00 3/17/00 Compaq PDC/O2000 3.5 99274.9 39.14 3886116 US $ Oracle 8i V8.1.6Microsoft Windows Compaq NT DB Intel Enterprise Web Pentium Connector Edition III Xeon 48 4.0 550 Y MHz 12 12/23/99 3/31/00 Compaq ProLiant 8500-550-8P3.5 40697.2 16.8 683666 US $ Microsoft SQLMicrosoft Server WindowsMicrosoft Enterprise NT COM+Intel Enterprise Edition Pentium 7.0 Edition III Xeon 4.0 8 550 N MHz 5 12/20/99 12/31/99 Compaq ProLiant 3000-6/600-1P3.5 8049.6 19.96 160643 US $ Microsoft SQLMicrosoft Server WindowsMicrosoft Enterprise NT COM+ IntelEnterprise Edition Pentium 7.0 Edition III 600 4.0 1 MHz N 1 10/12/99 12/31/99 Compaq ProLiant 3000-6/600-2P3.5 13598.1 17.8 241916 US $ Microsoft SQLMicrosoft Server WindowsMicrosoft Enterprise NT COM+ IntelEnterprise Edition Pentium 7.0 Edition III 600 4.0 2 MHz N 3 9/29/99 12/31/99 Compaq ProLiant 7000-500-2S3.5 25139.13 18.39 462100 US $ Microsoft SQLMicrosoft Server WindowsMicrosoft Enterprise NT COM+Intel Enterprise Edition Pentium 7.0 Edition II Xeon 4.0 4 500MHz N 4 9/20/99 12/31/99

61 © Ophir Frieder 2002 More Vendors – TPC- C Benchmarks

DG AViiON AV 25000 3.4 23016.97 34.69 798343 US $ Microsoft SQLMicrosoft Server WindowsBEA Enterprise Tuxedo NTIntel Enterprise 6.4Edition Pentium 7.0 Edition II Xeon 4.0 4 450MHz N 4 3/4/99 3/25/99 DG AViiON AV 8700 3.4 23070.97 27.96 644899 US $ Microsoft SQLMicrosoft Server WindowsBEA Enterprise Tuxedo NTIntel Enterprise 6.4Edition Pentium CFS 7.0 Edition II 450 4.0MHz4 N 4 1/22/99 3/25/99 DG AViiON AV 8600 Server3.3 16101.27 (c/s) 55.73 897290 US $ Microsoft SQLMicrosoft Server WindowsTopEnd Enterprise v2.04.01NTIntel Enterprise Edition PentiumPro 6.5 Edition 200 4.0 8 MHz N 5 3/2/98 5/31/98 DG AViiON 6600 (c/s)3.3 12030.47 71.85 864485 US $ Microsoft SQLMicrosoft Server WindowsTopEnd Enterprise v2.04.01NTIntel Enterprise Edition PentiumPro 6.5 Edition 200 4.0 6 MHz N 9 11/21/97 2/28/98 Dell PowerEdge 6400 3.5 31187.45 14.51 452391 US $ Microsoft SQLMicrosoft Server WindowsMicrosoft 2000 2000 COM+ Intel Pentium III Xeon 4 700MHzN 4 5/26/00 8/1/00 Dell PowerEdge 6450 3.5 31187.45 14.5 452211 US $ Microsoft SQLMicrosoft Server WindowsMicrosoft 2000 2000 COM+ Intel Pentium III Xeon 4 700MHzN 4 5/26/00 8/1/00 Dell PowerEdge 8450 3.5 40168.5 14.86 596786 US $ Microsoft SQLMicrosoft Server WindowsMicrosoft Enterprise NT COM+Intel Enterprise Edition Pentium 7.0 Edition III Xeon 4.0 8 550N MHz 5 11/22/99 12/31/99 Dell PowerEdge 6300 3.5 23187.9 16.31 378295 US $ Microsoft SQLMicrosoft Server WindowsBEA Enterprise Tuxedo NTIntel Enterprise 6.4Edition Pentium 7.0 Edition III Xeon 4.0 4 500MHzN 3 5/28/99 9/1/99 Dell PowerEdge 6350 3.5 23187.9 16.29 377795 US $ Microsoft SQLMicrosoft Server WindowsBEA Enterprise Tuxedo NTIntel Enterprise 6.4Edition Pentium 7.0 Edition III Xeon 4.0 4 500MHzN 3 5/28/99 9/1/99 Dell PowerEdge 6300 3.5 23460.57 17.26 404886 US $ Microsoft SQLMicrosoft Server WindowsBEA Enterprise Tuxedo NTIntel Enterprise 6.3Edition Pentium 7.0 Edition III Xeon 4.0 4 500MHzN 3 3/28/99 6/1/99 Dell PowerEdge 6350 3.5 23460.57 17.24 404386 US $ Microsoft SQLMicrosoft Server WindowsBEA Enterprise Tuxedo NTIntel Enterprise 6.3Edition Pentium 7.0 Edition III Xeon 4.0 4 500MHzN 3 3/28/99 6/1/99 Dell PowerEdge 6100 (c/s)3.3 10984.07 29.55 324623 US $ Microsoft SQLMicrosoft Server WindowsBEA Enterprise Tuxedo NTIntel Enterprise 6.3Edition PentiumPro 6.5 Edition 200 4.0 4 MHzN 2 1/5/98 1/5/98 Dell PowerEdge 6100 3.2 7693.03 42.53 327234 US $ Microsoft SQLMicrosoft Server Windows-none- 6.5 SP3 NT Intel 4.0 PentiumPro 200 4 MHz N 4 3/12/97 4/11/97 Fujitsu SiemensPrimergy N800 3.5 56388.5 20.52 1157002 Euros Microsoft SQLMicrosoft Server WindowsMicrosoft 2000 2000 COM+ Intel Datacenter Pentium III Server Xeon 8 700MHzN 8 7/5/00 10/1/00 Fujitsu SiemensPrimergy K800 3.5 40696.25 16.78 682917 Euros Microsoft SQLMicrosoft Server WindowsMicrosoft Enterprise NT COM+Intel Enterprise Edition Pentium 7.0 Edition III Xeon 4.0 8 550N MHz 5 12/13/99 1/1/00 Fujitsu SiemensPrimergy 870-40 3.5 23570.33 23.41 551727 US $ Microsoft SQLMicrosoft Server WindowsBEA Enterprise Tuxedo NTIntel Enterprise 6.4Edition Pentium CFS 7.0 Edition III Xeon 4.04 500MHzN 4 3/17/99 4/26/99 Fujitsu SiemensPrimergy 870-40 3.5 22349.47 25.84 577459 US $ Microsoft SQLMicrosoft Server WindowsBEA Enterprise Tuxedo NTIntel Enterprise 6.4Edition Pentium CFS 7.0 Edition II 450 4.0MHz4 N 4 12/23/98 2/28/99 Fujitsu SiemensPrimergy 870 with 3.37 Primergy18528.97 460 36.9 683682 US $ Microsoft SQLMicrosoft Server WindowsOpen Enterprise UTM NT Intel 4.0 Edition Pentium 7.0 II Xeon 4 400 N MHz 7 9/28/98 12/29/98 Fujitsu SiemensRM600 model E703.3 (c/s)22681.23 123.71 2805885 US $ Informix OnLineSNI Reliant Dynamic OpenUTMUNIX 5.44BServer VersionMIPS 7.23 R10000 3.4 FC1 200 MHz16 N 14 3/14/98 3/1/98 Fujitsu SiemensPrimergy 560 (c/s)3.3 10854.24 44.32 481008 US $ Microsoft SQLMicrosoft Server WindowsSNI Enterprise openUTM NTIntel Enterprise Edition 4.0 PentiumPro 6.5 Edition 200 4.04 MHzN 6 12/1/97 1/1/98 Fujitsu SiemensRM600 Model E203.3 c/s 11503.03 168.14 1934106 US $ Informix OnLinePyramid Dynamic ReliantOpenUTM Server VersionV5.43CMIPS 7.23 R10000 3.4 200 MHz 8 N 7 7/11/97 7/31/97 Fujitsu SiemensPrimergy 560 3.2 7063.07 68.61 484578 US $ Microsoft SQLMicrosoft Server Windows-none- 6.5 SP3 NT Intel 4.0 PentiumPro 200 4 MHz N 5 2/14/97 3/31/97 Fujitsu SiemensRM400-C90 c/s 3.2 2301.38 181.44 417575 US $ Oracle7 7.3SNI Reliant OpenUTM UNIX V5.43C Version MIPS R10000 3.4 200 MHz 1 N 2 2/14/97 3/31/97 Fujitsu SiemensRM600 Model 7203.2 c/s 9524.47 297.46 2833151 US $ Informix OnLineSNI Reliant Dynamic OpenUTMUNIX ServerV5.43B VersionMIPS 7.23.UC1 R4400MC 3.4 250 22 MHz N 7 12/20/96 6/30/97 Fujitsu SiemensRM600 Model 7203.2 c/s 7560.93 371.68 2810223 US $ Informix OnLineSNI Reliant 7.12 OpenUNIX UTM V5.43B MIPS R4400MC 250 18 MHz N 6 9/9/96 9/9/96 Fujitsu SiemensRM 600 Model 620 c/s3 6269.67 472.69 2963608 US $ Informix OnLineSNI SINIX-Y 7.11Tuxedo V5.43B 4.2.2 MIPS R4400MC 200 16 MHz N 5 1/5/96 5/1/96 Fujitsu SiemensRM 600 Model 340 cs3 3921.17 494.67 1939674 US $ Informix OnLineSNI SINIX-Y 7.11Tuxedo V5.43B MIPS R4400MC 200 8 MHz N 3 11/13/95 5/1/96 Fujitsu SiemensRM 400 Model 630 c/s3 874.13 364.29 318441 US $ Informix OnLineSNI SINIX-N 5.0 TuxedoB V5.42C 4.2.2 MIPS R4400SC 200 1 MHz N 1 7/20/95 11/1/95 Fujitsu/ICLGRANPOWER 50003.5 Model25440.6 680 28.57 726817 US $ Fujitsu/ICL Microsoft SymfoWARE WindowsBEA ServerTuxedo NTIntel Enterprise1.1 6.4 Pentium CFS Edition III Xeon 4.0 4 500MHzN 6 7/19/99 1/16/00

62 © Ophir Frieder 2002 More Vendors – TPC- C Benchmarks

HP HP NetServer LX Pro3.3 9198 50.35 463178 US $ Microsoft SQLMicrosoft Server WindowsBEA 6.5. Tuxedo Enterprise NTIntel Enterprise 6.3 PentiumPro Edition 200 4.0 4 MHz N 4 7/8/97 11/30/97 HP NetServer LX Pro c/s3.2 8028.07 72.48 581896 US $ Microsoft SQLMicrosoft Server WindowsBEA 6.5 TuxedoSP3 NT Intel 4.0 6.2 PentiumPro 200 4 MHz N 4 4/3/97 4/3/97 HP NetServer LH Pro c/s3.2 3904 110.35 430835 US $ Microsoft SQLMicrosoft Server WindowsTuxedo 6.5 SP3 4.10NT Intel 4.0 PentiumPro 200 2 MHzN 2 12/16/96 2/28/97 IBM Netfinity 8500R c/s3.5 440880 32.28 14232696 US $ IBM DB2 UDBMicrosoft 7.1 Windows Microsoft 2000 COM+ Intel Pentium III Xeon 4 700MHzY 96 7/3/00 12/7/00 IBM RS/6000 Enterprise3.5 Server66750.27 M80 45.46 3034664 US $ Oracle 8i EnterpriseIBM AIX 4.3.3 EditionWebshpere v. 8.1.7 IBM App. RS64-III Server 500MHz Ent. Edition 8 N V.3.0 6 5/31/00 9/30/00 IBM RS/6000 Enterprise3.5 Server33571.3 F80 58.94 1978616 US $ Oracle8i Ent.IBM Edition AIX 4.3.3Webshpere 8.1.6.0 IBM App. RS64-III Server 500MHz Ent. Edition 6 N V.3.0 5 5/9/00 6/9/00 IBM Netfinity 8500R 3.5 40251.15 18.98 763922 US $ Microsoft SQLMicrosoft Server WindowsMicrosoft Enterprise NT COM+Intel Enterprise Edition Pentium 7.0 Edition III Xeon 4.0 8 550N MHz 6 2/25/00 2/25/00 IBM RISC System 60003.5 Enterprise135815.7 Server S8052.7 7156910 US $ Oracle 8i V8.1.6IBM AIX 4.3.3 IBM TXSeries IBM 4.2 RS64-III for AIX 450MHz 24 N 15 10/29/99 3/1/00 IBM RS 6000 Enterprise3.5 Server110434.1 S70 122.44 13521883 US $ Oracle Oracle8IBM AIX V8.0.5 4.3.2IBM TXSeries IBM 4.2 RS64-II for AIX 262 MHz 60 Y 60 6/30/99 6/30/99 IBM AS/400e Model 7403.5 Feature43419.15 2070 c/s 69.96 3037802 US $ IBM DB2 forIBM AS/400 OS/400 V4R3BEA V4R3 Tuxedo IBM 6.4 PowerPC AS 262MHz 12 N 6 6/7/99 6/1/99 IBM RISC System/60003.5 Enterprise17133.7 Server H7078 c/s 1343526 US $ Oracle v8.1.5IBM AIX 4.3.2 IBM TXSeries IBM 4.2 RS64-II for AIX 340 Mhz 4 N 8 5/28/99 11/19/99 IBM AS/400e Server Model3.3 S4043169.85 2208 128.91 5564844 US $ IBM DB2 forIBM AS/400 OS/400 V4R3CICS V4R3 for AS/400 IBM PowerPC V4R3 AS A50 12 N 262 MHz 97 9/1/98 9/11/98 IBM RS/6000 Enterprise3.3 Server34139.63 S70 88.37 3016996 US $ Oracle Oracle8IBM AIX Enterprise 4.3.1IBM TXSeries Edition IBM 4.2 8.0 PowerPC for AIX RS604-2 12 N 262 MHz 15 8/11/98 1/21/99 IBM RS/6000 Enterprise3.3 Server18666.73 S70 (c/s) 108.62 2027566 US $ Oracle Oracle8IBM AIX Enterprise 4.3.0IBM TXSeries Edition IBM 4.2 8.0 PowerPC for AIX RS64 12 125 N MHz 8 3/3/98 9/2/98 IBM RISC System/60003.3 Workgroup9853.13 Server 64.22F50 (c/s)632799 US $ Sybase AdaptiveIBM AIX Server 4.2.0BEA Enterprise Tuxedo IBM 6.3 PowerPC 11.5 604e 3324 N MHz 5 2/12/98 2/12/98 IBM AS/400e Server Model3.3 25149.75s40-2261 c/s 138.06 3459675 US $ IBM DB2 forIBM AS/400 OS/400 V4R1CICS V4R1 for AS/400 IBM AS V3R6 A35 125 MHz 12 N 64 8/18/97 8/29/97 IBM RS/6000 Workgroup3.3 Server8142.4 F50 c/s 62.71 510643 US $ Sybase AdaptiveIBM AIX Server 4.2.1BEA Enterprise Tuxedo IBM 6.2 PowerPC 11.5 604e 1664 N MHz 3 5/12/97 9/30/97 IBM RS/6000 Enterprise3.3 Server9165.13 J50 c/s 97.66 895035 US $ Sybase SQLIBM Server AIX 4.2.1 11.1BEA Tuxedo IBM 6.2 PowerPC 604e 2008 N MHz 3 5/6/97 9/30/97 IBM RS/6000 Enterprise3.3 Server9165.13 R50 c/s 98.83 905767 US $ Sybase SQLIBM Server AIX 4.2.1 11.1BEA Tuxedo IBM 6.2 PowerPC 604e 2008 N MHz 3 5/6/97 9/30/97 IBM RS/6000 Workgroup3.2 Server7308.1 F50 c/s 99.32 725823 US $ Sybase SQLIBM Server AIX 4.2.1 11.0.3BEA Tuxedo IBM 6.2 PowerPC 604e 1664 N MHz 3 4/7/97 4/25/97 IBM RISC System/60003.2 Cluster14285.87 - PowerPC228.96 Server 3270909R40 c/s US $ Oracle7 7.3IBM AIX 4.1.5 Tuxedo ETP IBM 4.2.1 PowerPC 604 11232 Y MHz 16 12/10/96 6/30/97 IBM RISC System/60003.2 Cluster14285.87 - 4 Power225.75 PC J40 Nodes3225073 US $ Oracle7 7.3IBM AIX 4.1.5 Tuxedo ETP IBM 4.2.1 PowerPC 604 11232 Y MHz 16 12/10/96 6/30/97 IBM RS6000 PowerPC 3.2Server5774.07 J40 c/s 198.1 1143858 US $ Sybase SQLIBM Server AIX 4.1.4 11.0.3BEA Tuxedo Motorola PowerPC 604 8 N 112 MHz 6 7/23/96 12/15/96 IBM RS6000 PowerPC 3.2Server5774.07 R40 c/s 202.63 1169995 US $ Sybase SQLIBM Server AIX 4.1.4 11.0.3BEA Tuxedo Motorola PowerPC 604 8 N 112 MHz 6 7/23/96 12/15/96 IntergraphInterServe 650 Server3.3 c/s5307.23 62.93 334026 US $ Microsoft SQLMicrosoft Server Windows-none- 6.5 SP3 NT Intel 4.0 PentiumPro 200 2 MHzN 4 7/30/97 7/1/97 IntergraphInterServe 625 3.2 3961 63.34 250927 US $ Microsoft SQLMicrosoft Server Windows-none- 6.5 SP3 NT Intel 4.0 PentiumPro 200 2 MHzN 2 3/5/97 3/31/97 IntergraphInterServe 615 3.2 2300.03 66.41 152748 US $ Microsoft SQLMicrosoft Server Windows-none- 6.5 SP3 NT Intel 4.0 PentiumPro 200 1 MHzN 1 3/5/97 3/31/97 Itautec InfoSERVER 5020 3.3 7573 109.58 829918 Brazil $ Microsoft SQLMicrosoft Server Windows-none- 6.5 SP3 NT Intel 4.0 PentiumPro 200 4 MHzN 6 6/30/97 3/1/97 Motorola Motorola Series MP 3MP601-75/8Way3512.97 273.5c/s 960792 US $ Oracle7 7.3IBM AIX 4.1.3 Tuxedo 4.2.2 Motorola V2 PowerPC 601 8 N 75 MHz 5 4/24/96 7/30/96 NEC Express5800/180Ra-7 3.5 31764.37 18.63 592075 US $ Microsoft SQLMicrosoft Server WindowsMicrosoft 2000 2000 COM+ Intel Pentium III Xeon 4 700MHzN 4 6/23/00 9/29/00 NEC Express5800/140Hb 3.5 c/s31764.37 18.61 591383 US $ Microsoft SQLMicrosoft Server WindowsMicrosoft 2000 2000 COM+ Intel Pentium III Xeon 4 700MHzN 4 6/19/00 9/29/00 NEC Express5800 HV86003.5 50208.43 94.05 4723164 US $ Oracle8i Ent.Microsoft Edition WindowsBEA 8.1.6.0 Tuxedo NT Intel Enterprise 6.4 Pentium CFS Edition III Xeon 32 4.0 500MHzY 8 6/29/99 11/30/99 NEC Express5800 HX45003.3 (c/s)18322.67 33.55 614908 US $ Microsoft SQLMicrosoft Server WindowsTuxedo Enterprise 6.3NT Intel CFSEnterprise Edition Pentium 7.0 Edition II Xeon 4.0 4 400 N MHz 5 8/5/98 12/29/98 63 © Ophir Frieder 2002 More Vendors – TPC- C Benchmarks

NEC UP 4800/610 C/S (23 CPU)568.13 204793.7 1.16E+08 Yen Informix OnLineNEC UP-UX/V 5.01Tuxedo (REL4.2MP) (R4.2MP) MIPS R7.1R4400 100 MHz 2 N 3 1/27/95 4/21/95 NEC UP 4800/690 C/S 3 1488.6 235302.6 3.5E+08 Yen Informix OnLineNEC UP-UX/V 5.01Tuxedo (REL4.2MP) (R4.2MP) MIPS R6.2R4400 100 MHz 6 N 7 1/27/95 6/21/95 SGI Origin2000 (c/s) 3.3 6040.33 121.98 736822 US $ Sybase AdaptiveSGI IRIX Server 6.5BEA Enterprise Tuxedo MIPS 6.3 11.5.1 R10000 250 MHz 2 N 2 4/23/98 7/31/98 SGI Origin2000 Server 3.3 25309.2 139.04 3519012 US $ Informix OnLineSGI IRIX Dynamic 6.4BEA S2MP Tuxedo Server MIPS 6.1 7.3.UC1 R10000 195 MHz 28 N 26 4/30/97 10/29/97 Sequent NUMACenter 20003.5 93900.85 131.67 12363684 US $ Oracle Oracle8DYNIX/ptx Enterprise BEA4.4 Tuxedo Edition Intel 6.3 8.0.4 Xeon 405 MHz 64 N 8 12/18/98 6/15/99 Sequent NUMACenter 20003.5 48793.4 127.53 6222683 US $ Oracle Oracle8Sequent Enterprise DYNIX/ptxBEA Tuxedo Edition 4.4Intel 6.4 8.0.4 XeonCFS 405 MHz 32 N 4 10/13/98 3/15/99 Sun Enterprise 450 3.5 20123.7 27.7 557377 US $ Fujitsu/ICL Sun SymfoWARE SolarisBEA 7 Server Tuxedo Ultra for 6.3 Workgroup SPARC II 400MHz 2.0 4 N 7 2/2/00 7/28/00 Sun Enterprise 4500 3.5 50268.07 49.88 2507453 US $ Sybase AdaptiveSun Solaris ServerBEA 7 Enterprise Tuxedo Ultra 6.3 11.9.3 SPARC II 400MHz 14 N 15 11/23/99 3/30/00 Sun Enterprise 6500 Cluster3.5 135461.4 97.1 13153324 US $ Oracle8i Ent.Sun Edition SolarisBEA 8.1.6.02.6 Tuxedo Ultra 6.3 SPARC II 400MHz 96 Y 40 9/24/99 1/31/00 Sun Starfire Enterprise 3.510000115395.7 105.63 12189298 US $ Oracle 8i v.8.1.5.1Sun Solaris BEA 7 Tuxedo Ultra 6.3 SPARC II 400MHz 64 N 32 3/24/99 8/22/99 Sybase Digital AlphaServer 84003 11014.1 5/300 c/s 222.02 2445319 US $ Sybase SQLDigital Server UNIX 11.0ITI V3.2C Tuxedo Digital4.2 DECchip 21164 10 N 300 MHz 10 12/21/95 3/1/96 Tandem Integrity NR/4436 Server3 6313.78 c/s 493.42 3115351 US $ Informix OnLineSGI IRIX 7.11 6.2IMC Tuxedo MIPS R4400SC 250 16 MHz N 7 11/10/95 2/28/96 Unisys e-@ction ES5085R3.5 Enterprise48767.76 Server 20.13 981337 US $ Microsoft SQLMicrosoft Server WindowsMicrosoft 2000 2000 COM+ Intel Datacenter Pentium III Server Xeon 8 550 N MHz 4 2/16/00 7/13/00 Unisys Unisys e-@ction ES2085R3.5 41085.43 Enterprise 37.19 1527980 US $ Oracle8i Ent.UnixWare Edition 7Tuxedo 8.1.6.0Data Center 6.4 Intel and Ed. Pentium6.5 Ver. CFS 7.1.1 III Xeon 8 550 N MHz 5 12/13/99 6/1/00 Unisys Aquanta ES5085R3.5 Server26922.6 (4P) 17.39 468169 US $ Microsoft SQLMicrosoft Server WindowsMicrosoft Enterprise NT COM+Intel Enterprise Edition Pentium 7.0 Edition III Xeon 4.0 4 550 N MHz 3 10/27/99 12/31/99 Unisys Aquanta ES5085R3.5 Server40670.05 18.42 749208 US $ Microsoft SQLMicrosoft Server WindowsMicrosoft Enterprise NT COM+Intel Enterprise Edition Pentium 7.0 Edition III Xeon 4.0 8 550 N MHz 5 10/12/99 12/31/99 Unisys Aquanta ES2025 Server3.5 10265.9 19.15 196642 US $ Microsoft SQLMicrosoft Server WindowsBEA Enterprise Tuxedo NTIntel Enterprise CFSEdition Pentium 6.3 7.0 Edition III Xeon 4.0 2 550 N MHz 1 9/7/99 9/30/99 Unisys Aquanta ES2085R3.5 Server37757.23 19.85 749304 US $ Microsoft SQLMicrosoft Server WindowsBEA Enterprise Tuxedo NTIntel Enterprise CFSEdition Pentium 6.3 7.0 Edition III Xeon 4.0 8 550 N MHz 4 6/22/99 9/30/99 Unisys Aquanta ES2043R3.5 Server23189.9 18.27 423795 US $ Microsoft SQLMicrosoft Server WindowsBEA Enterprise Tuxedo NTIntel Enterprise CFSEdition Pentium 6.3 7.0 Edition III Xeon 4.0 4 500MHz N 3 5/11/99 5/11/99 Unisys Aquanta ES2043 Server3.5 23189.9 17.79 412525 US $ Microsoft SQLMicrosoft Server WindowsBEA Enterprise Tuxedo NTIntel Enterprise CFSEdition Pentium 6.3 7.0 Edition III Xeon 4.0 4 500MHz N 3 5/7/99 5/7/99 Unisys Aquanta ES2045 Server3.5 23852.1 18.05 430592 US $ Microsoft SQLMicrosoft Server WindowsBEA Enterprise Tuxedo NTIntel Enterprise CFSEdition Pentium 6.3 7.0 Edition III Xeon 4.0 4 500MHz N 3 4/1/99 3/31/99 Unisys Aquanta ES5045 3.5 24328.67 19.62 477440 US $ Microsoft SQLMicrosoft Server WindowsBEA Enterprise Tuxedo NTIntel Enterprise CFSEdition Pentium 6.3 7.0 Edition III Xeon 4.0 4 500MHz N 3 3/17/99 3/17/99 Unisys Aquanta QR/2 Server3.5 21354.55 23.73 506676 US $ Microsoft SQLMicrosoft Server WindowsTuxedo Enterprise 6.3NTIntel CFSEnterprise Edition Xeon 450 7.0 Edition MHz 4.0 4 N 3 1/5/99 1/5/99 Unisys Aquanta QR/2V Server3.5 19118.37 22.19 424297 US $ Microsoft SQLMicrosoft Server WindowsTuxedo Enterprise 6.3NTIntel CFSEnterprise Edition Pentium 7.0 Edition II Xeon 4.0 4 400 N MHz 3 12/4/98 12/29/98 Unisys Aquanta QS/2 Server3.5 18707.7 24.96 466883 US $ Microsoft SQLMicrosoft Server WindowsTuxedo Enterprise 6.3NTIntel CFSEnterprise Edition Pentium 7.0 Edition II Xeon 4.0 4 400 N MHz 3 11/11/98 12/29/98 Unisys Aquanta QR/2 Server3.5 18343.17 24.83 455520 US $ Microsoft SQLMicrosoft Server WindowsTuxedo Enterprise 6.3NTIntel CFSEnterprise Edition Pentium 7.0 Edition II Xeon 4.0 4 400 N MHz NEC UP 4800/610 C/S (23 CPU) Unisys Aquanta QS/2 Server3.3 (c/s) 18154 25.49 462803 US $ Microsoft SQLMicrosoft Server WindowsTuxedo Enterprise 6.3NTIntel CFSEnterprise Edition Pentium 7.0 Edition II Xeon 4.0 4 400 N MHz NEC UP 4800/690 C/S 3

64 © Ophir Frieder 2002 Experimental Platform: NCR Teradata DBC/1012

Client Computer Work (client resident software) Work Station Station Channels Channel

COP IFP IFP PE LAN

YNET

AMP AMP AMP AP LAN IFP - Interface Processor COP - Communications Processor Ynet - Interprocessor Bus UNIX AMP - Access Module Processor disk AP - Application Processor disks disks disks PE - Parsing Engine

65 © Ophir Frieder 2002 Sample Relevance Ranking Query

SELECT c.qryid, b.docid, SUM(((1+LOG(a.termcnt))/((b.logavgtf)* (229.50439 + (.20*b.disterm))))*(c.nidf*((1+LOG(c.termcnt))/(d.logavgtf)))) FROM trec6$d5$idx a, trec6$d5$docavgtf b, trec6$q6$qrynidf c, trec6$q6$qryavgtf d WHERE a.docid = b.docid AND c.qryid = d.qryid AND a.term = c.term AND c.qryid = 301 GROUP BY c.qryid, b.docid UNION SELECT c.qryid, b.docid, SUM(((1+LOG(a.termcnt))/((b.logavgtf)* (229.50439 + (.20*b.disterm))))*(c.nidf*((1+LOG(c.termcnt))/(d.logavgtf)))) FROM trec6$d4$idx a, trec6$d4$docavgtf b, trec6$q6$qrynidf c, trec6$q6$qryavgtf d WHERE a.docid = b.docid AND c.qryid = d.qryid AND a.term = c.term AND c.qryid = 301 GROUP BY c.qryid, b.docid ORDER BY 3 DESC;

66 © Ophir Frieder 2002 Breakdown of DBC/1012 Processing Steps

Step 1 - A read lock is placed on tables trec6$d5$idx, trec6$d5$docavgtf, trec6$d4$idx, and trec6$d4$docavgtf. Step 2 - A single processor is used to select and join rows via a merge join from trec6$q6$qrynidf and trec6$q6$qryavgtf where the value of qryid = 301. The results are stored on spool 3. Step 3 - An all processor join is used to select and join rows via a row hash match scan from trec6$d5$idx and spool 3 where the values of the term attribute match. The results are sorted and stored on spool 4. Step 4 - An all processor join is used to select and join rows via a row hash match scan from trec6$d5$docavgtf and spool 4 where the values of the docid attribute match. The results are stored on spool 2 which is built locally on each processor. Step 5 - The SUM value for the aggregate function is calculated from the data on spool 2 and the results are stored on spool 5. (The next two steps, 6a and 6b, are executed in parallel) Step 6a - The data from spool 5 is retrieved and distributed via a hash code to spool 1 which encompasses all processors. Step 6b - An all processor join is used to select and join rows via a row hash match scan from trec6$d4$idx and spool 3 where the values of the term attribute match. The results are sorted and stored on spool 9. Step 7 - An all processor join is used to select and join rows via a row hash match scan from trec6$d4$docavgtf and spool 9 where the values of the docid attribute match. The results are stored on spool 7 which is built locally on each processor. Step 8 - The SUM value for the aggregate function is calculated from the data on spool 7 and the results are stored on spool 10. Step 9 - The data from spool 10 is retrieved and distributed via a hash code to spool 1 which encompasses all processors. A sort is then done to remove duplicates from data on spool 1. Step 10 - An END TRANSACTION step is sent to all processors involved and the contents of spool 1 are sent back to the user.

67 © Ophir Frieder 2002 Parallel Performance

Parallel Efficiency measures how efficiently the processors are distributing the workload. max_cpu = maximum CPU time avg_ cpu across all processors Parallel Efficiency (PE) = avg_cpu = average CPU time max_ cpu across all processors

Average CPU Maximum Data Storage Parallel time per CPU time imbalance Efficiency processor per processor across processors 4 processors, disk 2 only 149.45 176.00 6.1% 84.9% primary index on term 4 processors, disk 2 and 4 393.65 453.20 3.1% 86.9% primary index on term 24 processors, disk 2 only 17.07 42.04 18.9% 40.6% primary index on term 24 processors, disk 2 and 4 49.11 111.10 19.2% 44.2% primary index on term 68 © Ophir Frieder 2002 Hashing Algorithm

4 Processors A,E,I,M,Q,U,Y Proc #1 B,F,J,N,R,V,Z Proc #2 Hashing Term C,G,K,O,S,W Proc #3 Algorithm D,H,L,P,T,X Proc #4

24 Processors A Proc #1 B Proc #2 C Proc #3 Term Hashing Algorithm ...... V Proc #22 WX Proc #23 69 © Ophir Frieder 2002 YZ Proc #24 Term Distribution

Distribution of terms based on starting letter

1400000

1200000

1000000

800000 Number of Terms 600000

400000

200000

0 i k a c e g o q s u y m w Starting Letter

70 © Ophir Frieder 2002 Parallel Performance

Parallel Efficiency before and after balancing data storage across 24 processors

Average CPU Maximum Data Storage Parallel time per CPU time imbalance Efficiency processor per processor across processors 24 processors, disk 2 only 17.07 42.04 18.9% 40.6% primary index on term 24 processors, disk 2 and 4 49.11 111.10 19.2% 44.2% primary index on term 24 processors, disk 2 only 29.21 31.90 0.8% 91.6% primary index on docID and term 24 processors, disk 2 and 4 74.18 79.09 0.5% 93.8% primary index on docID and term

71 © Ophir Frieder 2002 Scalable Information Systems: “Characteristics”

n Ingest data from multiple sources – Duplicate document detection n Process multiple type data sources – Structured & unstructured data integration (SIRE) n Use scalable (parallel) technology systems – Parallel SIRE n Integrate retrieved data to yield answers – IIT Mediator

72 © Ophir Frieder 2002 Current Enterprise Portals

73 © Ophir Frieder 2002 Next Generation Search Engines

74 © Ophir Frieder 2002 75 © Ophir Frieder 2002 75 76 76 76 © Ophir Frieder 2002 77 © Ophir Frieder 2002 77 78 © Ophir Frieder 2002 78 IIT Production Mediator Logs

271 Queries 142 With User Feedback

Ok Satisfied Dissatisfied 13% 13% 13%

Unhappy Happy 23% 38%

79 © Ophir Frieder 2002 Technology Transfer

n Industrial – America Online – BIT Systems – Harris Corporation – IITRI – NCR – Unnamed Others (Proprietary) – Assorted dot-dead companies (hopefully not due to our technology!!!)

n Government – National Institutes of Health

– Additional Others 80 © Ophir Frieder 2002 Information Retrieval Laboratory

Faculty Members: Students: O. Frieder Steven Beitzel D. Grossman Rebecca Cathey N. Goharian Ankit Jain X. Li Eric Jensen P. Wan Vincent Nguyen Angelo Pilotto Senior Affiliates Michael Saelee A. Chowdhury – AOL D. Holmes – NCR Chih-Wei Yi M. C. McCabe - US Gov. Wang Yu

81 © Ophir Frieder 2002 References

n D. A. Grossman, O. Frieder, D. O. Holmes, and D. C. Roberts, “Integrating Structured Data and Text: A Relational Approach, ” Journal of the American Society of Information Science, February 1997. n C. Lundquist, O. Frieder, D. Holmes, D. Grossman, “A Parallel Relational Database Management System Approach to Relevance Feedback in Information Retrieval,” Journal of the American Society of Information Science, April 1999. n O. Frieder, D. Grossman, A. Chowdhury, and G. Frieder, "Efficiency Considerations in Very Large Information Retrieval Servers," Journal of Digital Information, (British Computer Society), 1(5), April 2000. Invited Paper. n A. Chowdhury, O. Frieder, D. Grossman, and M. C. McCabe, “Analyses of Multiple-Evidence Combinations for Retrieval Strategies,” ACM Twentieth SIGIR, New Orleans, Louisiana, September 2001. n D. Grossman, S. Beitzel, E. Jensen, and O. Frieder, “IIT Intranet Mediator: Bringing Data Together on a Corporate Intranet,” IEEE IT PRO, January/February 2002. n A. Chowdhury, O. Frieder, D. Grossman, and M. McCabe, “Collection Statistics for Fast Duplicate Document Detection,” ACM Transactions on Information Systems (TOIS), April 2002.

82 © Ophir Frieder 2002