<<

11/4/2019

Information Retrieval

Deepak Kumar

Information Retrieval

Searching within a document collection for a particular needed information.

1 11/4/2019

Query

Search Engines…

Altavista Entireweb Leapfish Spezify Ask Stinky Teddy Faroo Maktoob Stumpdedia Bing Info.com Miner.hu Monster Crawler ChaCha Walla Omgili WebCrawler Daum Go Rediff Yahoo! Dmoz Goo Scrub The Web Du Hakia Seznam Egerin HotBot ckDuckGo

2 11/4/2019

Search Engine Marketshare 2019

3 11/4/2019

Search Engine Marketshare 2017

Matching & Ranking

matched pages ranked pages

1.

2. query 3. muddy waters matching ranking

“hits”

4 11/4/2019

Index

Inverted Index

• A mapping from content (words) to location.

• Example:

the cat sat on the dog stood on the cat stood 1 2 3 the mat the mat while a dog sat

5 11/4/2019

Inverted Index

the cat sat on the dog stood on the cat stood 1 2 3 the mat the mat while a dog sat

a 3 cat 1 3 dog 2 3 mat 1 2 on 1 2 sat 1 3 stood 2 3 the 1 2 3 while 3

Inverted Index

the cat sat on the dog stood on the cat stood 1 2 3 the mat the mat while a dog sat

a 3 cat 1 3 dog 2 3 mat 1 2 Every word in every on 1 2 web page is indexed! sat 1 3 stood 2 3 the 1 2 3 while 3

6 11/4/2019

Searching

the cat sat on the dog stood on the cat stood 1 2 3 the mat the mat while a dog sat

a 3 cat 1 3 query dog 2 3 mat 1 2 cat on 1 2 sat 1 3 stood 2 3 the 1 2 3 while 3

Searching

the cat sat on the dog stood on the cat stood 1 2 3 the mat the mat while a dog sat

a 3 cat 1 3 query dog 2 3 mat 1 2 cat on 1 2 sat 1 3 stood 2 3 the 1 2 3 while 3

7 11/4/2019

Searching

the cat sat on the dog stood on the cat stood 1 2 3 the mat the mat while a dog sat

hits a 3 cat 1 3 the cat sat on dog 2 3 1 query the mat mat 1 2 cat on 1 2 the cat stood 3 sat 1 3 while a dog sat stood 2 3 the 1 2 3 while 3

Searching

the cat sat on the dog stood on the cat stood 1 2 3 the mat the mat while a dog sat

hits a 3 cat 1 3 the dog stood on 2 query dog 2 3 the mat mat 1 2 dog on 1 2 the cat stood 3 sat 1 3 while a dog sat stood 2 3 the 1 2 3 while 3

8 11/4/2019

Searching

the cat sat on the dog stood on the cat stood 1 2 3 the mat the mat while a dog sat

a 3 cat 1 3 query dog 2 3 mat 1 2 cat dog on 1 2 sat 1 3 stood 2 3 the 1 2 3 while 3

Searching

the cat sat on the dog stood on the cat stood 1 2 3 the mat the mat while a dog sat

a 3 cat 1 3 query dog 2 3 mat 1 2 cat dog on 1 2 sat 1 3 stood 2 3 the 1 2 3 while 3

9 11/4/2019

Searching

the cat sat on the dog stood on the cat stood 1 2 3 the mat the mat while a dog sat

hits a 3 cat 1 3 the cat stood 3 query dog 2 3 while a dog sat mat 1 2 cat dog on 1 2 sat 1 3 stood 2 3 the 1 2 3 while 3

Searching

the cat sat on the dog stood on the cat stood 1 2 3 the mat the mat while a dog sat

a 3 cat 1 3 query dog 2 3 mat 1 2 cat the sat on 1 2 ??? sat 1 3 stood 2 3 the 1 2 3 while 3

10 11/4/2019

Phrase Queries

the cat sat on the dog stood on the cat stood 1 2 3 the mat the mat while a dog sat

hits a 3 cat 1 3 the cat sat on dog 2 3 1 query the mat mat 1 2 “cat sat” on 1 2 the cat stood 3 sat 1 3 while a dog sat stood 2 3 the 1 2 3 while 3

Phrase Queries

the cat sat on the dog stood on the cat stood 1 2 3 the mat the mat while a dog sat

hits a 3 cat 1 3 the cat sat on dog 2 3 1 query the mat mat 1 2 “cat sat” on 1 2 the cat stood 3 sat 1 3 while a dog sat stood 2 3 the 1 2 3 while 3

How to tell if two words occur next to each other?

11 11/4/2019

Phrase Queries

the cat sat on the dog stood on the cat stood 1 2 3 the mat the mat while a dog sat

hits a 3 cat 1 3 the cat sat on dog 2 3 1 query the mat mat 1 2 “cat sat” on 1 2 the cat stood 3 sat 1 3 while a dog sat stood 2 3 the 1 2 3 while 3

How to tell if two words occur next to each other? EFFICIENTLY???

Inverted Index with Location

the cat sat on the dog stood on the cat stood 1 2 3 the mat the mat while a dog sat

a 3-5 cat 1-2 3-2 dog 2-2 3-6 mat 1-6 2-6 on 1-4 2-4 sat 1-3 3-7 stood 2-3 3-3 the 1-1 1-5 2-1 2-5 3-1 while 3-4

12 11/4/2019

Inverted Index with Location

the cat sat on the dog stood on the cat stood 1 2 3 the mat the mat while a dog sat

a 3-5 cat 1-2 3-2 query dog 2-2 3-6 mat 1-6 2-6 “cat sat” on 1-4 2-4 sat 1-3 3-7 stood 2-3 3-3 the 1-1 1-5 2-1 2-5 3-1 while 3-4

Inverted Index with Location

the cat sat on the dog stood on the cat stood 1 2 3 the mat the mat while a dog sat

a 3-5 cat 1-2 3-2 1-2, 3-2 query dog 2-2 3-6 mat 1-6 2-6 “cat sat” on 1-4 2-4 sat 1-3 3-7 1-3, 3-7 stood 2-3 3-3 the 1-1 1-5 2-1 2-5 3-1 while 3-4

13 11/4/2019

Inverted Index with Location

the cat sat on the dog stood on the cat stood 1 2 3 the mat the mat while a dog sat

a 3-5 cat 1-2 3-2 1-2, 3-2 query dog 2-2 3-6 mat 1-6 2-6 “cat sat” on 1-4 2-4 sat 1-3 3-7 1-3, 3-7 stood 2-3 3-3 the 1-1 1-5 2-1 2-5 3-1 while 3-4

Inverted Index with Location

the cat sat on the dog stood on the cat stood 1 2 3 the mat the mat while a dog sat

hits a 3-5 cat 1-2 3-2 the cat sat on dog 2-2 3-6 1-2 1 query the mat mat 1-6 2-6 “cat sat” on 1-4 2-4 sat 1-3 3-7 1-3 stood 2-3 3-3 the 1-1 1-5 2-1 2-5 3-1 while 3-4

14 11/4/2019

NEAR* Queries

the cat sat on the dog stood on the cat stood 1 2 3 the mat the mat while a dog sat

hits a 3-5 cat 1-2 3-2 3-2 the cat stood dog 2-2 3-6 3 query 3-6 while a dog sat mat 1-6 2-6 cat NEAR dog on 1-4 2-4 sat 1-3 3-7 stood 2-3 3-3 the 1-1 1-5 2-1 2-5 3-1 while 3-4

*NEAR: distance <= 5

NEAR* Queries

the cat sat on the dog stood on the cat stood 1 2 3 the mat the mat while a dog sat

hits a 3-5 cat 1-2 3-2 3-2 the cat stood dog 2-2 3-6 3 query 3-6 while a dog sat mat 1-6 2-6 cat NEAR dog on 1-4 2-4 sat 1-3 3-7 stood 2-3 3-3 the 1-1 1-5 2-1 2-5 3-1 while 3-4 Useful in ranking! *NEAR: distance <= 5

15 11/4/2019

Matching & Ranking

matched pages ranked pages

1.

2. query 3. muddy waters matching ranking

“hits”

Ranking & Relevance

1 By far the most common 2 Our cause was not cause of malaria is helped by the poor being bitten by an health of the troops, infected mosquito, but many of whom were there are also other suffering from malaria ways to contract the and other tropical disease. diseases.

16 11/4/2019

Ranking & Relevance

1 By far the most common 2 Our cause was not cause of malaria is helped by the poor being bitten by an health of the troops, infected mosquito, but many of whom were there are also other suffering from malaria ways to contract the and other tropical disease. diseases.

also 1-19 … cause 1-6 2-2 … malaria 1-8 2-19 … whom 2-15

Ranking & Relevance

1 By far the most common 2 Our cause was not cause of malaria is helped by the poor being bitten by an health of the troops, infected mosquito, but many of whom were there are also other suffering from malaria ways to contract the and other tropical disease. diseases.

also 1-19 … query cause 1-6 2-2 malaria cause … malaria 1-8 2-19 … whom 2-15

17 11/4/2019

Ranking & Relevance

1 By far the most common 2 Our cause was not cause of malaria is helped by the poor being bitten by an health of the troops, infected mosquito, but many of whom were there are also other suffering from malaria ways to contract the and other tropical disease. diseases.

also 1-19 … query cause 1-6 2-2 Nearness can malaria cause … resolve the ranking! malaria 1-8 2-19 … whom 2-15

Using Metadata

18 11/4/2019

Using Metadata

CS380: Science of Information (Course Page)

Bryn Mawr College
CS 380: Recent Advances in Computer Science
Topic: Science of Information Fall 2019

BMC Class Number: 2283
Course Materials

Metadata

my cat my dog my pets 1 2 3 the cat sat on the dog stood on the cat stood the mat the mat while a dog sat

19 11/4/2019

Metadata

my cat my dog my pets 1 2 3 the cat sat on the dog stood on the cat stood the mat the mat while a dog sat

my cat <title>my dog <title>my pets 1 2 3 th the cat sat on the dog stood on e cat stood while the mat the mat a dog sat

Metadata

my cat a 3-10 1 cat 1-3 1-7 3-7 the cat sat on dog 2-3 2-7 3-11 the mat mat 1-11 2-11 my 1-2 2-2 3-2 on 1-9 2-9 2 my dog pets 3-3 sat 1-8 3-12 the dog stood on stood 2-8 3-8 the mat the 1-6 1-10 2-6 2-10 3-6 while 3-9 my pets <body> 1-5 2-5 3-5 3 th 1-12 2-12 3-13 e cat stood while 1-1 2-1 3-1 a dog sat 1-4 2-4 3-4

20 11/4/2019

Structure Queries

a 3-10 cat 1-3 1-7 3-7 dog 2-3 2-7 3-11 mat 1-11 2-11 my 1-2 2-2 3-2 query on 1-9 2-9 pets 3-3 intitle: dog sat 1-8 3-12 stood 2-8 3-8 the 1-6 1-10 2-6 2-10 3-6 while 3-9 1-5 2-5 3-5 1-12 2-12 3-13 1-1 2-1 3-1 1-4 2-4 3-4

Structure Queries

a 3-10 cat 1-3 1-7 3-7 dog 2-3 2-7 3-11 mat 1-11 2-11 my 1-2 2-2 3-2 query on 1-9 2-9 pets 3-3 intitle: dog sat 1-8 3-12 stood 2-8 3-8 the 1-6 1-10 2-6 2-10 3-6 while 3-9 1-5 2-5 3-5 1-12 2-12 3-13 1-1 2-1 3-1 1-4 2-4 3-4

21 11/4/2019

Structure Queries

a 3-10 cat 1-3 1-7 3-7 dog 2-3 2-7 3-11 mat 1-11 2-11 my 1-2 2-2 3-2 query on 1-9 2-9 pets 3-3 intitle: dog sat 1-8 3-12 stood 2-8 3-8 the 1-6 1-10 2-6 2-10 3-6 while 3-9 1-5 2-5 3-5 1-12 2-12 3-13 1-1 2-1 3-1 1-4 2-4 3-4

Structure Queries

a 3-10 cat 1-3 1-7 3-7 dog 2-3 2-7 3-11 mat 1-11 2-11 my 1-2 2-2 3-2 query on 1-9 2-9 pets 3-3 intitle: dog sat 1-8 3-12 stood 2-8 3-8 the 1-6 1-10 2-6 2-10 3-6 while 3-9 1-5 2-5 3-5 1-12 2-12 3-13 1-1 2-1 3-1 1-4 2-4 3-4

22 11/4/2019

Structure Queries

a 3-10 cat 1-3 1-7 3-7 dog 2-3 2-7 3-11 mat 1-11 2-11 my 1-2 2-2 3-2 query on 1-9 2-9 pets 3-3 intitle: dog sat 1-8 3-12 stood 2-8 3-8 the 1-6 1-10 2-6 2-10 3-6 while 3-9 1-5 2-5 3-5 1-12 2-12 3-13 1-1 2-1 3-1 1-4 2-4 3-4

Structure Queries

a 3-10 cat 1-3 1-7 3-7 dog 2-3 2-7 3-11 mat 1-11 2-11 my 1-2 2-2 3-2 my dog query on 1-9 2-9 2 pets 3-3 the dog stood on intitle: dog sat 1-8 3-12 the mat stood 2-8 3-8 the 1-6 1-10 2-6 2-10 3-6 while 3-9 1-5 2-5 3-5 1-12 2-12 3-13 1-1 2-1 3-1 1-4 2-4 3-4

23 11/4/2019

Web Information Retrieval

• Search Engines • Queries phrase queries structure queries (NEAR, intitle:, …) • Matching • Inverted Index page number location • Ranking & Relevance • Metadata

Web Information Retrieval

• Search Engines • Queries phrase queries Efficient matching structure queries is only one half the story. • Matching • Inverted Index The other grand challenge page number is how to rank the location matching pages • Ranking & Relevance • Metadata

24 11/4/2019

References

• Google’s PageRank and Beyond, Amy N. Langville and Carl D. Meyer, Princeton University Press, 2006. • Nine Algorithms That Changed The Future, John MacCormick, Princeton University Press, 2012. • Learning Computing with Robots, Deepak Kumar, IPRE 2011.

25