Introduction to Computer Science CS 101

Ilir Capuni Boston University Plan for today

| Internet search | Evaluations What is the web

| Internet: the hardware (computers networked) | Web: the software (pages and documents) | HTML (HyperTextMarkupLanguage) z Text z Images z Links z … The need for search

| No comment Altavista and Yahoo

| Text based | Search for Toyota z You would get links of some celebrity or some nasty and popular page that mentions it | It became useless as the web started growing

| Different approach Modeling the web Graph

| G=(V,E) | Weights on edges to denote importance Representing graph

| , | Linked list | Arrays… A snippet of the web yesterday Task

| Having this representation, we would like to rank the pages according their relative importance within the graph | This does not apply only for the web. Could be used in any collection of entities with reciprocal quotations and references Democracy on the web

| Basic idea: Consider a link from page A to page B as a vote by page A for page B | Check also the voter’s reputation: higher its reputation his, his vote weighs more The PageRank

| The PageRank is defined recursively and depends on the number and the PageRank of all pages that link to it | It assigns values from 0-10 Is it possible to manipulate

| Yes! | Many ways have been invented to manipulate the page rank Are there better ?

| Yes there are | HITS by J. Kleinberg (a bit before PageRank and it is referenced in their paper) | IBM Clever | TrustRank Any such things in the history

| Yes! | analysis by in the 1950s at Upenn developed by Massima Marchiori at the University of Padova The random surfer model

| We consider a surfer that starts from one page and hits on the links randomly | PageRank represents the probability that a random surfer will arrive at a particular page What does a Google do

| Back end z Crawls the internet z Creates the graph z Analyzes it and updates the rank | Front End z User enters a query z Google analyzes the query z Checks its z Output those that have that tag and sorts them according to the PageRank Intentional surfer model

| Once you and many programs that you use get addicted to Google, they will start sending information to Google about your habits and web history | Question: do you like this after yesterday’s class? | Question: how come all of a sudden google is offering you as a search or ad result something that is related to an email that you’ve sent to your friend a week ago? Many uses

| of scientific journals | Word Sense Disambiguation | Optimized web crawling rel=“” option

| In 2005 Google implemented a new value “nofollow” for the rel atrribute of the HTML link and anchor elements | One can mark some links used in his/her page with nofollow to prevent the Google from using it for the PageRank computation Where does the money come from

| Advertising industry: bid per click auctions Bing

| the listing of search suggestions as queries are entered and a list of related searches (called "Explorer pane") based on semantic technology from Powerset that Microsoft purchased in 2008 | As of January 2010, Bing is the third largest on the web by query volume, at 3.16%, after its competitor Google at 85.35% and Yahoo at 6.15%, according to Net Applications.[ Cleaning up your name