Web Search Algorithms

Web Search Algorithms - 1 - Why web search in this module ? • WWW is the delivery platform and the interface • How do we find information and services on the web … we try to generate a url that seems sensible – Dell Computers – www.dell.ie, Ford Ireland – www.ford.ie But products? • GPS Devices – www.gps.ie is not ok • Or, we use a Search Engine – So we rely on Search Engines - we even use them to look up spellings and as a calculator ! • Search Engines bring people to a website – For most, such as Google, ranking algorithm is closely guarded, wholesome, true, uncorrupted, and not paid • advertisements are merely sold based on similarity to query keywords. • This leads to the industry of Search Engine Optimisations (SEO) ... the “Google Dance” - 2 - Text IR - Google as example • Google is operational since 1998 – Two PhD students from Stanford • ?? Billion documents – Early search engines competed on size of index, related to how powerful their infrastructure was. Not an issue now. – Stopped advertising after 8,168,684,336 pages in Aug 2005 – Size now, effectively unknown • Also has ??? billion images – not all unique images – Flickr has about 2B (Nov 2007); FaceBook had 4.1 B at that time - 3 - Searching or Marketing? • However, Search Engines must make a profit! – Advertisment Sales – Marketing – Paid Listings – And selling their indexes • A lot of Search Engines are also marketing companies… – This is at odds with the idea that a search engine is a page you visit on the way elsewhere. • The less time you spend there the better! • But, many people ‘pass through the doors’, so they sell query focussed advertisements – You can estimate by looking at the main page of the search engine. - 4 - How do SEs help user searches • It is known that we search for – people / home pages. – companies / company HPs (or guess from URLs). – a particular product or service. – a fact, buried in one or more documents, any one of which will do… – a document, an entire document, with text/image, and nothing smaller will do. – an overview on a broad or narrow topic – Media Search • an MPEG-4 file. • Through image databases. • Through (digital) video library, and/or through a video. • If the SE knows the type of query, then ranking can be tailored to that query, because different search types can be satisfied by different search algorithms. - 5 - Search Engines • Originally SE’s were web directorys – Manually generated (e.g. Yahoo!) • Then automatic crawler-based Search Engines developed – The web got big and manual categorisation was becoming too difficult (e.g. Lycos) – Today the large SE’s index over ?? billion web pages. – The first crawler-based SE was the WWWW in 1994 - 6 - Architecture of a Search Engine - 7 - My Google! - 8 - Bing - 9 - Facebook ? Is it a Search Engine? - 10 - Facebook Social Graph A Previous Class College Friends Friends IR Research Community - 11 - TWITTER - 12 - The Landscape is changing - 13 - Web 1.0 Web 3.0 • Web 1.0 – Static content... Companies created content – We were consumers • Web 2.0 – User generated content – Communities and creators... We create, filter, recommend the content • Web 3.0 – UGC and... Semantic Web... Life streams? – Social and Location – What is the next big thing? - 14 - Web 1.0 • Search engines over prepared and planned content • Organisations and some users • SEO was the way to optimise WEB 1.0 • HTML and static content - 15 - - 16 - Web 2.0 • User and Organisation Generated Content • Social Graphs • Social Filtering and Social Ranking • Examples: – Social networks : facebook, twitter, linkedin – Shared bookmarks: digg, delicious, reddit, stumbleupon – Social media sharing :flickr, youtube – Blogs (MSN space, wordpress, blogger) – Even 3D social worlds... Social gaming? - 17 - - 18 - Web 3.0 • Semantic Web – Many media types... Integrated for smarter uses • Rich media integration • Personalisation to the user context • Life streaming of content – We are integrated into our own entertainment - 19 - What is Web3.0 about? - 20 - The Search Landscape Changing enormously - 21 - Continuous Partial Attention • Be aware of Continuous Partial Attention... a kind of multitasking • skimming the surface of the incoming data, picking out the relevant details, and moving on to the next stream. • Continuous not episodic • Cast a wider net, but never full attention • So.. How does this impact on search? http://www.wisegeek.com/what-is-continuous-partial-attention.htm - 22 - And don’t forget the twitter curve… http://headrush.typepad.com/creating_passionate_users/2006/12/httpwww37signal.html - 23 - Google AdSense - 24 - Spamming • Spamming is a technique based on the manipulation of content in order to affect ranking from search engines – Bogus meta tags, hidden text, plan text… – Also link spamming… • Huge SE resources are used in defeating spamming - more than in search quality improvement ! • Getting in the top-10 is essential for businesses – 85% of users only look at top 10. – Lead to the business of Search Engine Optimisation - 25 - Search Engine Ranking As we all know, simply examining web page content as text is not enough.. We need to examine ranking factors.. Positive and negative. - 26 - Positive Ranking Factors : Term Location • In the TITLE of the page, most important • In the body of the text, but must MAKE SENSE • In the Heading text (H1,H2…) • In the Domain Name – Also in page URL • In ALT tag and image title • In BOLD/STRONG tags • Terms near the top likely ranked higher than other terms - 27 - Positive Ranking Factors : Page Attributes • Importance of the page in the Website – Number of links to it from the same website • Quality of links to other pages • Age of a document – Older may be more authorative • We will see authorities later! – Newer may be better for some queries (e.g. news) • Amount of text on the page • Structure of the page • Frequency of updates • Spelling and correctness of HTML - 28 - SE Ranking + : Website Issues • Linkage of the Website – Global link popularity of the website • Like a global Pagerank (SiteRank) – Relevance of the links into the website – Link popularity of the site in a topical community – Rate of new inbound links to a website • Age of a website (older is better) • Freshness of a website (new pages is better) • Relevancy of the website (as well as the page) • Clickthrough rate for the website • Reputation of the top-level domain – E.g. .GOV & .EDU … can not easily be bought - 29 - SE Ranking + : Linkage Issues • Anchor text of inbound links as a description of the WWW page – Also text surrounding the link into the webpage • Topical relationship between source and target of link • Link popularity of the page in a topical community • Age of links – The older the better, i.e. long lasting links • Pagerank of the webpage – Googles PageRank algorithm • Number of links into a web page - 30 - Positive Ranking Factors : Images • Images on a web page – Can provide a chance to express ideas in a visual way that can convey a considerable amount of information – Add to the attractiveness and perceived quality of a site. – Recent Microsoft Patent on “Scoring Relevance of a Document Based on Image Text” – Also.. Remember to name the image properly and have alt element - 31 - Negative Ranking Factors • Link Farm Participation – Try to artificially increase PageRank • Proportion of links to or from known Spamming sites • Duplicate Content to already indexed content • Server Errors or server down-time • External links to low-quality content • Low level of visitors to the website • Try to include hidden text on the page - 32 - Using the Ranking Factors… PageRank Factors Linkage Factors Negative Factors Website Factors Page Factors Term Location Factors Result The Search Engine ranking process is a closely guarded trade secret of the User Query search engines. - 33 - So lets look in some detail at some of these ranking factors… Linkage-based Search - 34 - The Shape of the WWW This is based on a study of 200 million web pages. Scale up to WWW scale. - 35 - Spidering : finding WWW content • A Search Engine needs to find WWW content for its index – This is done by the spidering software • Starting from some ‘seed’ WWW pages, the spider software downloads these pages and extracts the links, thereby learning about new pages to crawl. • WWW-scale crawling means crawling thousands of pages per second - 36 - A Basic Crawling Algorithm • You need to be linked to from the main WWW.. Remember the shape! • Given a set of ‘seed’ URLs (WWW pages addresses): – Add them to a (priority) queue of URLs – While the queue is not empty (!empty) • Take the first URL (u) off the queue • Download the WWW page for u • Store the URL in a list of seen URLs • Index it • If u is a HTML page, extract the links (y) – For each y add it to the queue if it has not been visited before - 37 - Spiders must behave! • Most crawlers/spiders will follow some rules: – A spider must never request large numbers of documents from the same host sequentially… change the target website as often as is feasible. – A spider must never (for whatever reason) repeatedly request the same document. If a document is unavailable, … it’s position in the queue must be penalized … Repeated failures must be taken into account and the document flagged as unavailable and taken off the queue. – A spider must respect author’s wishes as expressed using the robots exclusion protocol - 38 - Robots Exclusion allows Web site administrators to indicate to visiting robots which parts of their site should not be visited by the robot. Most good robots will process it… BUT it makes a crawler less efficient… more explorative crawling required… To exclude all robots from the entire server User-agent: * Disallow: / To allow all robots complete access User-agent: * Disallow: To exclude all robots from part of the server User-agent: * Disallow: /cgi-bin/ Disallow: /private/ To exclude a single robot User-agent: BadBot Disallow: / To allow a single robot User-agent: WebCrawler Disallow: User-agent: * Disallow: / - 39 - Robots.txt example - 40 - Another Example - 41 - And one more… - 42 - Simple Overview WWW 1.

Web Search Algorithms

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support