<<

Web Search Algorithms

- 1 - Why web search in this module ?

• WWW is the delivery platform and the interface • How do we find information and services on the web … we try to generate a url that seems sensible – Dell Computers – www.dell.ie, Ford Ireland – www.ford.ie But products? • GPS Devices – www.gps.ie is not ok • Or, we use a Search Engine – So we rely on Search Engines - we even use them to look up spellings and as a calculator ! • Search Engines bring people to a – For most, such as Google, ranking algorithm is closely guarded, wholesome, true, uncorrupted, and not paid • advertisements are merely sold based on similarity to query keywords. • This leads to the industry of Search Engine Optimisations (SEO) ... the “Google Dance” - 2 - Text IR - Google as example

• Google is operational since 1998 – Two PhD students from Stanford • ?? Billion documents – Early search engines competed on size of index, related to how powerful their infrastructure was. Not an issue now. – Stopped advertising after 8,168,684,336 pages in Aug 2005 – Size now, effectively unknown • Also has ??? billion images – not all unique images – Flickr has about 2B (Nov 2007); FaceBook had 4.1 B at that time

- 3 - Searching or Marketing?

• However, Search Engines must make a profit! – Advertisment Sales – Marketing – Paid Listings – And selling their indexes • A lot of Search Engines are also marketing companies… – This is at odds with the idea that a search engine is a page you visit on the way elsewhere. • The less time you spend there the better! • But, many people ‘pass through the doors’, so they sell query focussed advertisements – You can estimate by looking at the main page of the search engine.

- 4 - How do SEs help user searches

• It is known that we search for – people / home pages. – companies / company HPs (or guess from URLs). – a particular product or service. – a fact, buried in one or more documents, any one of which will do… – a document, an entire document, with text/image, and nothing smaller will do. – an overview on a broad or narrow topic – Media Search • an MPEG-4 file. • Through image databases. • Through (digital) video library, and/or through a video. • If the SE knows the type of query, then ranking can be tailored to that query, because different search types can be satisfied by different search algorithms.

- 5 - Search Engines

• Originally SE’s were web directorys – Manually generated (e.g. Yahoo!) • Then automatic crawler-based Search Engines developed – The web got big and manual categorisation was becoming too difficult (e.g. ) – Today the large SE’s index over ?? billion web pages. – The first crawler-based SE was the WWWW in 1994

- 6 - Architecture of a Search Engine

- 7 - My Google!

- 8 - Bing

- 9 - Facebook ? Is it a Search Engine?

- 10 - Facebook Social Graph

A Previous Class College Friends

Friends

IR Research Community

- 11 - TWITTER

- 12 - The Landscape is changing

- 13 - Web 1.0  Web 3.0

• Web 1.0 – Static content... Companies created content – We were consumers • Web 2.0 – User generated content – Communities and creators... We create, filter, recommend the content • Web 3.0 – UGC and... Semantic Web... Life streams? – Social and Location – What is the next big thing?

- 14 - Web 1.0

• Search engines over prepared and planned content

• Organisations and some users

• SEO was the way to optimise WEB 1.0

• HTML and static content

- 15 - - 16 - Web 2.0

• User and Organisation Generated Content • Social Graphs • Social Filtering and Social Ranking • Examples: – Social networks : facebook, twitter, linkedin – Shared bookmarks: digg, delicious, reddit, stumbleupon – Social media sharing :flickr, youtube – Blogs (MSN space, wordpress, blogger) – Even 3D social worlds... Social gaming?

- 17 - - 18 - Web 3.0

• Semantic Web – Many media types... Integrated for smarter uses

• Rich media integration

• Personalisation to the user context

• Life streaming of content – We are integrated into our own entertainment

- 19 - What is Web3.0 about?

- 20 - The Search Landscape

Changing enormously

- 21 - Continuous Partial Attention

• Be aware of Continuous Partial Attention... a kind of multitasking • skimming the surface of the incoming data, picking out the relevant details, and moving on to the next stream. • Continuous not episodic • Cast a wider net, but never full attention

• So.. How does this impact on search?

http://www.wisegeek.com/what-is-continuous-partial-attention.htm

- 22 - And don’t forget the twitter curve…

http://headrush.typepad.com/creating_passionate_users/2006/12/httpwww37signal.html

- 23 - Google AdSense

- 24 - Spamming

• Spamming is a technique based on the manipulation of content in order to affect ranking from search engines – Bogus meta tags, hidden text, plan text… – Also link spamming… • Huge SE resources are used in defeating spamming - more than in search quality improvement ! • Getting in the top-10 is essential for businesses – 85% of users only look at top 10. – Lead to the business of Search Engine Optimisation

- 25 - Search Engine Ranking

As we all know, simply examining content as text is not enough.. We need to examine ranking factors.. Positive and negative.

- 26 - Positive Ranking Factors : Term Location

• In the TITLE of the page, most important • In the body of the text, but must MAKE SENSE • In the Heading text (H1,H2…) • In the Domain Name – Also in page URL • In ALT tag and image title • In BOLD/STRONG tags • Terms near the top likely ranked higher than other terms

- 27 - Positive Ranking Factors : Page Attributes

• Importance of the page in the Website – Number of to it from the same website • Quality of links to other pages • Age of a document – Older may be more authorative • We will see authorities later! – Newer may be better for some queries (e.g. news) • Amount of text on the page • Structure of the page • Frequency of updates • Spelling and correctness of HTML

- 28 - SE Ranking + : Website Issues

• Linkage of the Website – Global link popularity of the website • Like a global Pagerank (SiteRank) – Relevance of the links into the website – Link popularity of the site in a topical community – Rate of new inbound links to a website • Age of a website (older is better) • Freshness of a website (new pages is better) • Relevancy of the website (as well as the page) • Clickthrough rate for the website • Reputation of the top-level domain – E.g. .GOV & .EDU … can not easily be bought - 29 - SE Ranking + : Linkage Issues

• Anchor text of inbound links as a description of the WWW page – Also text surrounding the link into the webpage • Topical relationship between source and target of link • Link popularity of the page in a topical community • Age of links – The older the better, i.e. long lasting links • Pagerank of the webpage – Googles PageRank algorithm • Number of links into a web page

- 30 - Positive Ranking Factors : Images

• Images on a web page – Can provide a chance to express ideas in a visual way that can convey a considerable amount of information – Add to the attractiveness and perceived quality of a site. – Recent Patent on “Scoring Relevance of a Document Based on Image Text” – Also.. Remember to name the image properly and have alt element

- 31 - Negative Ranking Factors

• Link Farm Participation – Try to artificially increase PageRank • Proportion of links to or from known Spamming sites • Duplicate Content to already indexed content • Errors or server down-time • External links to low-quality content • Low level of visitors to the website • Try to include hidden text on the page

- 32 - Using the Ranking Factors…

PageRank Factors Linkage Factors Negative Factors Website Factors

Page Factors

Term Location Factors

Result The Search Engine ranking process is a closely guarded trade secret of the User Query search engines.

- 33 - So lets look in some detail at some of these ranking factors…

Linkage-based Search

- 34 - The Shape of the WWW

This is based on a study of 200 million web pages. Scale up to WWW scale.

- 35 - Spidering : finding WWW content

• A Search Engine needs to find WWW content for its index – This is done by the spidering software • Starting from some ‘seed’ WWW pages, the spider software downloads these pages and extracts the links, thereby learning about new pages to crawl. • WWW-scale crawling means crawling thousands of pages per second

- 36 - A Basic Crawling Algorithm

• You need to be linked to from the main WWW.. Remember the shape! • Given a set of ‘seed’ URLs (WWW pages addresses): – Add them to a (priority) queue of URLs – While the queue is not empty (!empty) • Take the first URL (u) off the queue • Download the WWW page for u • Store the URL in a list of seen URLs • Index it • If u is a HTML page, extract the links (y) – For each y add it to the queue if it has not been visited before

- 37 - Spiders must behave!

• Most crawlers/spiders will follow some rules: – A spider must never request large numbers of documents from the same host sequentially… change the target website as often as is feasible. – A spider must never (for whatever reason) repeatedly request the same document. If a document is unavailable, … it’s position in the queue must be penalized … Repeated failures must be taken into account and the document flagged as unavailable and taken off the queue. – A spider must respect author’s wishes as expressed using the robots exclusion protocol

- 38 - Robots Exclusion

allows Web site administrators to indicate to visiting robots which parts of their site should not be visited by the robot. Most good robots will process it… BUT it makes a crawler less efficient… more explorative crawling required… To exclude all robots from the entire server User-agent: * Disallow: / To allow all robots complete access User-agent: * Disallow: To exclude all robots from part of the server User-agent: * Disallow: /cgi-bin/ Disallow: /private/ To exclude a single robot User-agent: BadBot Disallow: / To allow a single robot User-agent: WebCrawler Disallow: User-agent: * Disallow: /

- 39 - Robots.txt example

- 40 - Another Example

- 41 - And one more…

- 42 - Simple Overview

WWW 1. Spidering

2. Indexing View WWW page

3. Ranking

- 43 - WWWW – the first SE

• WWWW (94) did not use the content of a page for indexing, it used: – Title of the document – Text in the URL String – Any anchor text from links pointing to the page.

• Based on using the UNIX egrep program to search through disk files.

All SEs now use Linkage Analysis to exploit latent human judgement to improve retrieval performance

This is in addition to using the document content.

- 44 - Some history… Citation Analysis

Most significant contribution to web search is the technique for how to rank Journals based on quality (impact) Citation indexing… the ‘impact factor’ measurement… based on two elements: – the number of citations in the current year to any articles published in the journal over the previous two years. – the number of articles published by the journal during these two years.

• Letting j be a journal and IFj be the Impact Factor of journal j, we have: #Citations (last 2years ) IF = j # Published Articles (last 2years ) • This “impact factor” was originally applied to medical journals as a simple method of comparing journals to each other regardless of their size.

- 45 - Hirsch Index (h-index)

• Citation Analysis is a balance between quality (number of citations) and quantity (number of papers); • Among scientists, the h-index is becoming popular for measurement … it’s the number of published papers which each have a number of citations greater or equal to that number. – Alan Smeaton has 250+ papers, about 3,000 citations, and an h-index of 30; – Desmond Higgins (UCD) has 29,000 citations (22,500 on one paper), and an h-index of 22; • Linkage analysis in web topology does something like this, as we’ll see.

- 46 - Linkage Analysis

Linkage Analysis : a method of ranking web sites which is based on the exploitation of latent human judgments mined from the hyperlinks that exist between documents on the WWW.

The first generation of web search engines were effectively TF-IDF or BM25, or equivalent.

And they have addressed the engineering problems of web spidering and efficient searching for large numbers of both users and documents.

Linkage Analysis important since late 90s.

Anecdotally this appears to have improved the precision of retrieval yet there was little scientific evidence in support of this until recently.

- 47 - Origin : Citation Analysis

How to rank Journals based on quality (impact)

Citation indexing… the ‘impact factor’ measurement… based on two elements: – the number of citations in the current year to any articles published in the journal over the previous two years. – the number of articles published by the journal during these two years.

• Letting j be a journal and IFj be the Impact Factor of journal j, we have: #Citations (last 2years ) IF = j # Published Articles (last 2years ) • This “impact factor” was originally applied to medical journals as a simple method of comparing journals to each other regardless of their size.

- 48 - Mining links can tell us that…

• Bibliographic Coupling – A and B are similar because they both cite C,D,E

• Co-citation Analysis – A and B are similar because they are both cited by C,D,E

- 49 - What else can we do with links?

• Count them? • Distinguish between good and bad ones?

• How we employ them is called Linkage Analysis – Linkage-based ranking schemes can be seen to belong to one of two distinct classes: • Query-independent schemes, – A score is assigned to a document once and used for all subsequent queries. » independent of a given query. – Fast processing at query time! • Query-dependent schemes, – assigns a linkage score to a page in the context of a given query. – Slower processing at query time!

- 50 - Assumed Properties of Links

When extracting information for linkage analysis from hyperlinks on the Web, two core properties can be assumed:

– A link between two documents on the web carries the implication of related content.

– If different people authored the documents (different domains, therefore off-site links), then the first author found the second document valuable. • An author can-not be allowed to influence the linkage score of documents within his/her domain. – Off-site links (links between web sites) are more important that links within or within documents.

- 51 - Link Types in-link to doc F : 5,8,9 out-link from doc F : 4,6,10 self-links: 2,11 on-site links: 6,8,12 off-site links: 1,3,4,5,9,10 on-site in-links to doc F: ? off-site out-links of doc F: ?

- 52 - Basic Linkage Analysis

Given a linkage graph (below), Page A is a better page than B because…

F Off-site links only… G J H B A C

I E D

- 53 - Expanding on this…

However, page B may actually be better…

F G J CNN H B A C

Yahoo I E D

So we use iterative processes… like PageRank or Kleinberg’s

- 54 - Generating a linkage score

Let n be some web page and Sn be the set of web pages that link into n across off-site links:

Pn = Sn

In this case, the Pn score (Popularity score) is based purely on the in-degree of document n…

Could be the sole source of document ranking given a set of relevant documents (boolean IR) OR could work by integrating normal document retrieval (TF-IDF / BM25 scores) to generate an overall weight.

Once again, we let n be some web page and Sn be the set to pages that link into n:

assumes normalisation Sc'n = (# ! Sim(q, n) ) + (" ! S n )

parameters

- 55 - More simple linkage techniques

• Weighted Citation Raking

• Spreading Activation & Co-citation Analysis – SA: Spreads a score across outlinks – CA: Passes a score back to hub document

- 56 - Hubs & Authorities

• A Hub is a document D E that contains links to C many other F documents

• An Authority is a W document that many documents link to • A good Hub links to X good Authorities A • A good Authority Y links to good Hubs

Z

- 57 - What makes a good Hub…?

What makes a good hub for the query “web browsers”?

Internet Explorer

Netscape Mozilla Hub

MyBrowser NeoPlanet

- 58 - What Makes a good Authority

What makes a good Authority for the query “web browsers”?

Hub Hub Hub Hub Hub

Hub Hub Explorer Amaya

Hub Hub Mozilla Hub Opera Firefox

MyBrowser NeoPlanet

- 59 - And What makes these authorities good?

Good hubs that themselves link into good authorities… a self-re-inforcing relationship!

Hub Hub Hub Hub Hub

Hub Internet Hub Explorer Amaya

Hub Hub Netscape Mozilla Hub Opera Firefox

MyBrowser NeoPlanet

- 60 - The Influence of Links

• A Document’s content can be represented by the anchor text of the in-links (all) into that doc, not by the document itself. • More in-links, means more content, better chance of getting returned for a query. • Very Simple, but effective! • Improved by windowing…

Document Anchor Text Doc

- 61 - The Importance of Windows

- 62 - The Importance of Windows

- 63 - Iterative Linkage Algorithms

PageRank

- 64 - PageRank

• Query INDEPENDENT score for every documents

• An important aspect of Google ranking…? It allocates a PageRank (query independent importance) score to every document in an index, and this score is used when ranking documents.

• Simple Iterative Algorithm – Until convergence

• A simulation of a random user’s behaviour when browsing the web. – Equivalent to a user randomly following links, or getting bored and randomly jumping to a random page anywhere on the WWW. In effect it is based on the probability of a user landing on any given page.

• This can be applied to other graphs than the WWW graph… social networks, blog comments?

- 65 - Key points…

The PR of B is equal to The PR of A is divided the sum of the equally among its out- transferable PR of all its links in-links W 1/4 PRW=1

1/4 X PRB = 2¼ PRX=1 PRA = 1 A B

Y 1/4 ¼ PRY=1 ½ ½ 1/4 Z + 1

PRZ=1 2¼

- 66 - For Example…

the PageRank PRF of document F is equal to PRB divided the out- degree of B summed with PRD divided by the out-degree of D.

PR PR PR = B + D F 2 3

- 67 - The Simplified Technique

1, Calculate a pre-iteration PageRank score for each document 1 for all n in N, PR = n N

2, Calculate PageRank score for each document PR PR' c m n = # ! …assume c = 1 m"Sn outd egreem

3, Store new PageRank scores

for all n in N , PRn = PR'n

4, If not convergence then goto 2

- 68 - A Simple Web Graph

G F

A

B

E C

D

- 69 - PageRank – Sample Graph

1 1

1

Total = 7.0 1

1 1

1

- 70 - PageRank – after Iteration 1

1 1.5

.5

Total = 6.0 1

.5 1

.5

- 71 - PageRank – after Iteration 2

1.5 1.5

. 5

Total = 5.5 .75

.5 .5

.25

- 72 - PageRank – Problem 1 (Dangling Links)

?

?

- 73 - PageRank - Problem 2 (Rank-Sink)

- 74 - PageRank – Problem 1 (Dangling Links)

?

?

- 75 - PageRank – Problem 1 (Dangling Links)

removed

- 76 - PageRank - Problem 2 (Rank-Sink)

- 77 - PageRank - Problem 2 (Rank-Sink)

15% 15% 15% A Vector over All Web Pages

0.14 Doc 1 0.14 Doc 2 0.14 Doc 3 15% 0.14 Doc 4 15% 0.14 Doc 5 0.14 Doc 6 0.14 Doc 7

15%

Hence if all PageRanks sum to 1.0, then 15% ||E|| = 0.15

- 78 - The two problems…

• Dangling Links: these are links that point to a page which itself contains no outLinks… • Docs which the system knows about (and has anchor text descriptions for) but has not downloaded yet. • Or just docs with no links out… – If the PageRank of the web pages associated with the target of these links is not redistributed at each iteration and is lost from the system… – SOLUTION : Remove page • or use Universal Document…

• Rank Sinks: these are two or more pages that have outLinks to each other, but to no other pages. Assuming we have at least one inLink into these pages from a page outside of these pages then at each iteration rank enters these pages and never exits… accumulates rank… – SOLUTION: using the E Vector with |E| = 0.15 or… – … the inclusion of a Virtual (Universal Document)

- 79 - How to use this Vector?

• This vector has an entry for each document and is used as an indicator of how to distribute any redundant rank back into the system. – Each documents entry in the Vector (E) represents the proportion of rank to given to that document, but it is believed to be uniform with ||E|| = 0.15 if the sum of all pageranks sums to 1. – But we can do personalisation…e.g. to focus on Formula1 pages increase their weight in E.

• Letting En be some vector over the Web pages that corresponds to a source of rank, c is a constant which is maximised and ||PR|| = 1 (sum of all PageRanks = 1), we have the following formula: PR PR' c m (1 c) E n = ! # + " ! n m$Sn outd egreem

- 80 - Alternate Solution!

Vector

0.14 Doc 1 0.14 Doc 2 0.14 Doc 3 UD * 0.14 Doc 4 0.14 Doc 5 0.14 Doc 6 0.14 Doc 7

Probability of a user being bored is now 1/(n+1) where n = number of outlinks… not 0.15

- 81 - Personalised PageRank

Vector

0.10 Doc 1 0.05 Doc 2 0.05 Doc 3 UD * 0.35 Doc 4 0.25 Doc 5 0.10 Doc 6 0.10 Doc 7

- 82 - Using PageRank…

Query PageRank Array

Content PageRank Score (n) Score (n)

??? Formula ???

Final Document Score

- 83 - Kleinberg’s Algorithm

Kleinberg’s algorithm is similar to PageRank, in that it is an iterative algorithm based purely on the linkage of the documents on the web. However it does have some major differences:

• It is executed at query time, and not at indexing time, with the associated hit on performance that accompanies query-time processing. • Is it used in SE’s… not common!

• It computes two scores per document (hub and authority) as opposed to a single score.

• It is processed on a small subset of ‘relevant’ documents, not all documents as was the case with PageRank.

- 84 - Recall Hubs and Authorities

HUB Page: a hub page is a page that contains a number of links to pages containing information about some topic, e.g. a resource page containing links to documents on a topic such as ‘Formula 1 motor racing’. Required pages have a hub score representing it’s quality as a source of links.

AUTHORITY Page: an authority page is one that contains a lot of information about some topic, an ‘authoritive’ page. Consequently, many pages will link to this page, thus giving us a means of identifying it. Required pages also have an authority score representing its perceived quality by other people.

Documents with high authority scores are expected to contain relevant content, whereas documents with high hub scores are expected to contain links to relevant documents.

- 85 - HITS Process

2 Expanded 1 Root Set Set

Hub Auth for all p q p = !( q ) " Auth Hub for all q p 3 p = !( q ) "

Focused subgraph of WWW

4

- 86 - Hub Scores

Hub Auth for all p q p = !( q ) "

X

P Y

Z Q contains X,Y and Z

- 87 - Authority Scores

Auth Hub for all q p p = !( q ) "

X

Y P

Z Q contains X,Y and Z

- 88 - Kleinberg’s HITS Technique

• Iteratively calculates Hub & Authority scores • Begin with all Hubs & Authority scores = 1 • 10+ iterations needed until convergence – Hub scores based on Authority scores of off-site outLink docs. – Auth scores based on Hub scores of off-site inLink docs. • Return top X Hubs and/or Authorities • Once expanded set generated then no further content analysis (Topic Independent). • Narrow Topic will diffuse to a Broader Topic – Broad Topic may produce inaccurate results

- 89 - Kleinberg’s Algorithm

Hubi # 1

Authi # 1 loop : for n =1,2,...N : Auth' Hub n = ! m m"Sn Hub' Auth n = ! o n"Tn

Normalise Auth'n , obtaining Authn

Normalise Hub'n , obtaining Hubn end while( not converged )

- 90 - Wrapping up SEs

• SE’s now provide more than just searching and are “portals” - consumer-oriented gateway to web resources which is editorially controlled links to what search engines, or their paying clients, believe you may be interested in. • Search engines are “for profit” ventures, not charities… – Some sell their indexes – Mostly advertising • 10% to 15% of queries to the major search engines are on adult themes. • Offer lots of extra’s including : media search, identification of names, Amazon links, related searches listing, page translation, language specific search… – then there is photo management, , music…

- 91 - Final thoughts

• Sub 1 second querying is essential – No time for interesting algorithms, Q&A, Manual Query Expansion, …

• Belief is that searchers happy with sub-optimal results as long as no delay in getting them.

• No industry standard benchmark for evaluation.

- 92 -