PageRank Andrey Karnauch & Dakota Sanders Questions

1. Why is the named PageRank? 2. What algorithm is used to converge on values in PageRank? 3. Why is PageRank better than simple counting? Presenter - Andrey Karnauch

- UTK’s Computer Masters Program - Advisor: Dr. Mockus

- Born in Binghamton, NY - Moved to Chattanooga, TN in 2006 - Parents are from Ukraine (Soviet Union)

- Like to watch other people play video games, backpack, and sometimes workout ( rip TRECS ) Andrey Karnauch Ukrainian Food Favorites From Chattanooga, TN

Masters in Comp. Sci. this May

Full time Django developer in Knoxville

I have a dog named Boone

Recently started rock climbing (I am not that good)

I don’t take pictures of my food )^:

I like going to the gym, playing board games Some v0 climb at Stone Fort with friends, and playing video games Bouldering in Soddy Daisy, TN

(I go by Dakota and Cody, people in this class will Dakota / Cody Sanders know me as either one or the other) Boone Mountain Cur, 1.5 years old Table of Contents

Overview History What is PageRank? An Example of the Power Method Converging Other Applications Implementation Experiment Open Issues References Discussion Overview

- We present PageRank in its original context with web pages - Other applications of PageRank are discussed later

- The following terms are used interchangeably throughout: - Pages and nodes - Links, , edges

- The PageRank algorithm has many moving parts - We try to cover all of PageRank before showing a full-fledged example Section 1 History of PageRank

Source: https://towardsdatascience.com/graphs-and-paths-pagerank-54f180a1aa0a History - The

- As the Web grew in the 1990s, web search engines were needed to index/find Web pages - Several companies (e.g. WWWW, Altavista, WebCrawler) - 1994 (1500 queries per day) vs. 1997 (20M+ queries per day) - By 1997, only 1 of the top 4 commercial search engines “found” itself

- Enter: research (~1995-1999) - and - Stanford grads (worth ~$50B each now) - Wanted an academic SE that emphasized search quality and scalability - At the heart of this was the PageRank algorithm - Ultimately led to the functional prototype: Google History - The World Wide Web

- RankDex developed by Robin Lee in 1996 - Popularity of based on “links” to it - Similar to PageRank - Larry Page cited Robin Lee in the PageRank patent - Robin Lee founded using RankDex in 2000

- At the core, both parties borrowed from the idea of - in 1950s - And eigenvector - Phillip Bonacich in 1986 Section 2 What is PageRank?

Source: https://towardsdatascience.com/graphs-and-paths-pagerank-54f180a1aa0a What is PageRank?

From Google:

The basis of Google's search technology is called PageRank™, and assigns an "importance" value to each page on the web and gives it a rank to determine how useful it is. However, that's not why it's called PageRank. It's actually named after Google co-founder Larry Page.

*https://web.archive.org/web/20010715123343/https://www.google.com/press/funfacts.html Setting Up

- Construct a directed graph with webpages as nodes, and links as edges

- A page can have any number of forward links or

- Impossible to know if all A & B are backlinks of C backlinks are collected, but forward links are available by downloading the page Why use PageRank?

- Intuitively, pages with many backlinks are “important” (i.e. a high citation )

- However, if a page has only one , but that backlink is from google.com, we can also consider it to be important

- PageRank handles this use case much better than simple citation ranking methods by accounting for the “importance” of each page Definition - Basic

A page has high ranking if the sum of the ranks of its backlinks is high

Formally: Let be a webpage with front-links and back-links. Then let , and be a factor for normalization Definition - Basic

10 A C 10 5 10

5 5 B 5 Random Surfer Interpretation

- Imagine a user browsing the network clicking links at random

- At each time step, the user chooses a page to visit at random

- The “importance” ranking/PageRank of a page is essentially the limiting probability that the will be at that node after a sufficiently large time Representation

- - How do we determine initial and final ranks?

- Initial ranks: Any set of ranks you want to use

- Final ranks: Iterating the above computation until convergence - Requires us to convert the problem into a matrix representation - Specifically, start with a square matrix where each entry Lᵤ,ᵥ = 1/Nᵤ if there is an edge from u to v, otherwise its 0 Matrix Representation

- Lᵤ,ᵥ = 1/Nᵤ if there is an edge from u to v, otherwise its 0 A B C

0 0 1 A A C ½ 0 0 B ½ 1 0 C

B Dangling Nodes

- What happens if a node has no forward links? A B C

0 0 1 A A C ½ 0 0 B ½ 1 0 C

B D Dangling Nodes

- What happens if a node has no forward links? A B C D 0 0 ½ 0 A A C ½ 0 0 0 B ½ 1 0 0 C 0 0 ½ 0 D

B D Dangling Nodes

- How to work with dangling nodes? - If a random surfer reaches a page with no outgoing links, they will most likely not stay on that page - Instead, randomly choose another page to continue surfing - What does this mean for our matrix representation? - The dangling node will have a column of zeros originally - Instead, replace the zeros with a 1/Nᵤ chance that page gets visited Dangling Node Correction

- What does this mean for our matrix representation? A B C D 0 0 ½ 0 0 0 ½ ¼ A ½ 0 0 0 B ------> ½ 0 0 ¼ ½ 1 0 0 C ½ 1 0 ¼ 0 0 ½ 0 D 0 0 ½ ¼ Random Surfer - Damping Factor

- In our random surfer model, it is possible for a user to get stuck in a cycle of pages that link to each other

- How does PageRank account for this? - Assign a damping factor (value between 0-1) - Set at 0.85, according to Page and Brin - Gives a chance to jump to a random page at any time step Matrix Representation

- Lᵤ,ᵥ = 1/Nᵤ if there is an edge from u to v, otherwise its 0 A B C

0 0 1 A A C ½ 0 0 B ½ 1 0 C

B Calculating PageRank

- Now, to determine the PageRank of each node, 0 0 1 we need to find the eigenvector R of L with eigen ½ 0 0 value C: ½ 1 0 R = cLR - How do we find eigenvector R and eigenvalue C? A C - Use the power iteration method! - Finds an eigenvector of a square matrix corresponding to the eigenvalue with the largest magnitude B Section 3 An Example of the Power Method - Initialize directed graph of webpages A B

- A page has an outgoing edge if the webpage links to the page the edge connects to C D - As an example, webpage A has links to pages B, C, and D, as shown in the graph diagram First, find the link vector for webpage A, normalizing by the number of links, 3 A B

C D Then, find the link vector for webpage B A B

C D Continue on for webpages C and D A B

C D Convert each link vector to a column of a square matrix A B

C D Notice that columns are outward links, and now rows are inward links A B

C D Now set up vector to hold ranks of pages... for page A:

For the entire matrix:

Since we don’t have an initial rank for any pages, assume equal probabilities and normalize: Since we update when we calculate this, our notation for the entire iterative process becomes:

Notice now that is an eigenvector of with eigenvalue 1

Because of how we have constructed L (a stochastic square matrix), it has properties that assure the rank vector returned from the power method is optimal. In order to account for our damping factor, , we transform our original iterative process as follows:

Now, iteratively calculate until convergence! A B C D

R⁰ ¼ ¼ ¼ ¼

R¹ 0.1250 0.2083 0.2083 0.4583

R² 0.1354 0.2118 0.2118 0.4410 ......

R¹² 0.1200 0.2400 0.2400 0.4000

R¹³ 0.1200 0.2400 0.2400 0.4000 A B

C D Number of Iterations

- O(log n) iterations is expected when using d = 0.85 - With a damping factor of 0.85, the original PageRank paper took ~50 iterations to converge on a graph of 322 million nodes - The larger your damping factor, the longer it takes to converge - 0.85 is a “sweet” spot Section 4 Other Applications Other Applications

- The PageRank algorithm is not unique to just the web! - Some other examples include: a. Sports b. Literature - finding most original authors c. Neuroscience d. Toxic waste management e. Debugging (MonitorRank) f. Predicting traffic flow (including human mobility) g. Used by to recommend people you should follow

Source: https://arxiv.org/pdf/1407.5107v1.pdf Implementation - Our Experiment

- Based off of sports rankings done for tennis players - To test the effectiveness of PageRank, we experimented with professional chess players a. Downloaded the Caissabase chess database and extracted ~4 million chess games - Experiment as follows: a. Create a directed graph with players as nodes b. An edge is created from the loser of a match to the winner c. By running PageRank on this graph, we can (in theory) test how influential the player with the highest PageRank is The Results

- Player with highest PageRank: - Korchnoi, Viktor - “He is considered one of the strongest players never to have become World Chess Champion.” Section 5 Conclusion Open Issues

- Several attempts to manipulate PageRank over the years - (search engine poisoning) - Creating tons of posts linking to your site - Buying and selling links from “important” - “” HTML attribute abuse

- PageRank was the core of Google originally, but now it is just one of many working parts in Google’s search engine - Google hides these internals nowadays to prevent further abuse Valuable sources https://patents.google.com/patent/US7058628B1/en http://infolab.stanford.edu/pub/papers/google.pdf http://ilpubs.stanford.edu:8090/697/1/2005-33.pdf https://sci2s.ugr.es/sites/default/files//TematicWebSit es/hindex/PinskiNarin1976.pdf http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf http://www.ams.org/publicoutreach/feature-column/fcarc-pager ank Suggested Discussion Topics

- Have any of you used PageRank for any application? If so, how did it turn out?

- Can you think of other applications where PageRank could be valuable? Questions

1. Why is the algorithm named PageRank? 2. What algorithm is used to converge on values in PageRank? 3. Why is PageRank better than simple citation counting?