SEARCHING Finding Needles in the World’s Biggest Haystack - or - The Page Match & Page Rank Problems

We will look at how a carries out an impossible task:

• Read your search word or phrase; • Search all 40 billion pages on the ; • List all the pages that contain some or all of your phrase; • Rank these matches so the best matches are first; • And do this in less than two seconds. 1. The PageMatch Problem investigates how your browser takes your Inter- net search string, passes it to a search engine which returns an answer incredibly fast. We will see that this process involves a number of hidden tricks and indexes. 2. The PageRank Problem considers how it is possible, once many matching web pages have been found, to select the very few that correspond most closely to your search string. Goals for this lecture

1. To see the motivation behind creating the Internet and the 2. To see the difference between a browser and a search engine. 3. To realize that a search engine doesn’t wait for your search string and then start checking one page after another on the web; 4. To realize that it’s possible to create an of the web, just like the index of a book; 5. To understand how Web crawling is done to create an index; 6. To recognize how information searches were done before the Internet. Reading Assignment

Read Chapter 2, pages 10–23, “9 Algorithms that Changed the Future”. What’s the Internet?

Once, computers were simply stand-alone devices. A computer could be used to create information and store it on its own hard drive. I could use my computer to write a letter to you, but to send it, I’d have to print it out and put it in an envelope. Or, if you were in the same office, I could copy my letter onto a removable disk, carry it to your computer, and load it onto your hard disk, so you could view it. People very quickly invented ways of connecting all the computers in an office using cables, so that information could be moved from one hard disk to another by simple keyboard commands. Gradually, it became clear that it was a good idea to connect all the com- puters in the whole world this way. This required modifying computers to include an Internet port which could be connected by a thin cable to a local network, over which information could be sent and received. All of this is hardware. It required building a network of Internet Servers, so that every message could be passed from its sender computer, through a sequence of servers, until it reached its recipient computer. It required coming up with a universal system of numeric addresses, called the Internet Protocol or IP, so that each message included an exact address that the servers could read. It required setting up Domain Name Servers (DNS) so that humans could specify an easy-to-remember address such as facebook.com or fsu.edu, but these would be automatically translated to numeric IP addresses. Up to this point, the Internet simply connected computers. This was good for scientists, who often wanted to access a big, far away supercomputer without having to get on an airplane, or package their data in an envelope and mail it to the computer. Instead, using the Internet, a scientist could start on a desktop computer, and log in to the supercomputer using programs called ssh or putty. They could run a big scientific program on the supercomputer. Then they could bring the data back to their desktop computer using a file transfer program called ftp or sftp. In fact, scientists still use the Internet this way. One of the problems we are having today with hackers is that the Internet was set up to do research. What’s the World Wide Web?

The Internet had been set up to make it easy for a person on one computer to send information to another computer, which was like sending a letter. But sometimes someone wanted to make an announcement to many people they knew, (so a lot of individual letters), or perhaps even to any interested person (even people unknown to the author.) This was more like a newspaper or poster. This idea eventually brought about the creation of the World Wide Web (WWW), by Tim Berners-Lee at the European Center for Nuclear Research. WWW took advantage of the existence of the Internet, and its addresses and server system to create something that had not been imagined before. The web began with the idea that any computer user might want to share some information with people around the world. To make this possible, the user simply had to set aside a special location or directory or folder on their computer, often called public . Any documents placed inside this location would be presumed to be intended to be viewed by any interested person. To access the documents, however, required knowledge of the address of the computer. The addresses had a specific format, starting with the prefix http://www and then listing information that specified the country, type of institution, department of the institution, person, directory and filename. Such an address can be a lot to remember. A fairly short example is: https://www.sc.fsu.edu/undergraduate/courses These documents were usually fairly short, and so were called web pages. Thus, to publicize a new class to any interested computer user, an instructor could create a document called newclass.txt and place it in the public html of a computer:

Announcing a new class in Computational Thinking!

This fall, the Department of Scientific Computing will be offering a class called "Computational Thinking".

For details, stop by room 444 Dirac Science Library.

Or check out the syllabus at http://people.sc.fsu.edu/~jpeterson/teaching As the WWW became better known, people began to take advantage of it, posting interesting and useful information in their public areas. But the crazy complicated address system was a huge drawback. No one could find the information without knowing the address, and the address was so long and complicated that it wasn’t possible to try to guess where things were. The first improvement came with the invention of links, which allowed a person writing a web page to make a bridge or simple connection to other web pages, saving readers from having to know more addresses. A web page using links would allow a user to jump to another web page simply by clicking on highlighted or underlined or colored text, rather than having to enter that address separately:

Announcing a new class in Computational Thinking!

This fall, the Department of Scientific Computing will be offering a class called "Computational Thinking".

For details, stop by room 445 Dirac Science Library.

Or check out the SYLLABUS. ------Now that interesting web pages existed on the web, and it wasn’t so hard to find them, a new kind of program was developed for computers, called a browser. One early browser was called Mosaic, which transformed into NetScape, and then into FireFox. The browser’s job was to display a web page on the computer screen, allow the user to specify an address for a new web page, to effortlessly jump to a new web page if the user clicked on a link on the current web page, and to jump back if the user changed their mind. Because it still wasn’t obvious how to find the most interesting information, browsers often included lists of recommended web pages for users to try. Some browsers included a simple feature which would allow the user to ask a question, and get back a recommendation for a good web page to investigate. Because people added, modified, and deleted web pages every day, the web was very dynamic, and the most interesting places to examine changed from day to day. The simple programs inside a browser for guiding users relied on lists created by a staff who could not keep up with the rapid growth in the number of web pages and users. Browser companies began desperately researching new methods of satisfying users, who wanted good information on what was new on the internet, and how to get there. The problem was that the web changed too rapidly for human analysis, but how could a computer be of any use in finding and evaluating web pages? Browsers and Search Engines

Originally, internet browsers tried to handle user searching themselves, or simply included pointers to a few places that listed resources. Then users demanded a much improved search facility. It seemed sensible to split up the work, to separate the browser from the search engine. The browser would take care of moving to any particular web page, but the search engine would be a separate program that figured out which web page was the right place to go to next, in response to the user’s question. In some ways, the browser was like a car which could go anywhere, while the search engine was like a GPS system, which could direct the browser to a desired location. Ever since those days, there have been two clearly distinct but cooperating programs for navigating on the web. Browsers like Mosaic/NetScape/Firefox, Chrome, Microsoft Internet Explorer/Edge, Apple Safari, and Opera allow people to view and retrieve information on the internet. Search engines, like , , DuckDuckGo, Yahoo Search!, Ask.com, are used from within a browser, by typing a search string into a search box. The search engine then seeks to list web pages that pertain to that search string, after which the user can ask the browser to display information from a selected site. Winners and losers in the browser wars The search engine wars

In 2002, Google, Yahoo, and Microsoft each had about 30% of the search engine market share, but Google made some dramatic improvements to its search engine, and drove Yahoo and Microsoft down to less than 20% shares each (and dropping). To most of us, the web is simply an enormous pile of

• documents; • photographs, images, pictures; • , messages, chats; • songs; • magazines; • videos, movies; • programs; • advertisement, commercials and promotional videos; • retail stores and ticket outlets; • social media sites. The Web may seem to be able to connect us to everything; but if we don’t know where something is, we might as well not have it. So as the Internet has grown in size (number of things connected) and complexity (kinds of things connected), it has become increasingly important to develop and improve ways of:

• finding things you are looking for; • recommending new things you might be interested in; • learning enough about your interests to guess what other things you might want to see. Specialized search engines You are probably familiar with some specialized search programs that help you find information on the web, such as NetFlix, Orbitz, Yelp, Mapquest, and Amazon. These programs work well because they have a very limited vocabulary. You often select choices from a list, or enter simple information into boxes, like your credit card number and address. For such a program, it’s easy to imagine a simple procedure for getting the desired information from the user, and then checking a list of song titles, travel dates, or shoe sizes, until a desired match is found, and then finishing up the billing and shipping arrangements. While such programs seem to carry out an elaborate sequence of steps, in general the choices are not so large, and it’s easy to determine when you’ve found a match (the right size, the right price, the correct dates). But a real search engine has a much more difficult task! Suppose you are looking for a list of the rulers of Russia, or advice on how to whistle, or instructions on how to get rid of a wasp infestation. As soon as the browser realizes you’re doing a search, it hands the information to the search engine, as a list of one or more words or search strings. The search engine doesn’t speak English or French, or Mandarin. It has no idea what your words actually mean. If your search word is apple, should it concentrate on fruit, on the computer company, on the recording label for the Beatles, Fiona Apple, a movie named “The Apple”? What about Applebee’s restaurant? What makes matters worse is that if you search on “Apple”, there seem to be more than a billion web pages on which that word appears!

What happens between your question and the answers? Your browser is running on the computer right in front of you. The information you need is somewhere else, perhaps far away. We assume you’ve got a connection, wireless or wired, to your local network. Your request goes out from the local network to the Internet... and a few seconds later, your answer appears, a list of hundreds or thousands of “hits”, showing about 15 or 20 on the first page, including the location and an extract from the matching text. It seems like magic; in fact, it’s impossible for a search engine to receive a search string and check every word of every web page on the Internet and return the good matches to you unless you are willing to wait days for an answer. But we got the answer in two seconds! It seems impossible for this to work! Why does this seem like an impossible task?

For your browser to access just one web page:

• the browser converts the web address to a numeric IP (Internet Protocol) address; • it has to set up a connection to that address; • once the connection is made, it has to request a copy of the page from the remote computer; • it has to wait for the remote computer to find the file locally; • it has to wait for the information in the web page to be copied across the Internet into a local file.

Each single web page access can take on the order of 1 second.

There are more than 40 billion web pages on the Internet;

To copy every web page would take 40 billion seconds = 1240 years. Browser users expect rapid response, and browser developers try every trick they can think of to keep their users happy. Although a typical web page access might take a second, sometimes there is a longer delay because the network is busy, or the path to the web page is unusually long, or the server that controls the web page is very slow. Since users often refer to a single web page several times in a single session, browsers added a cache, that is, they saved copies of the most recent web pages that have been visited. When you request a page, rather than going immediately to the web, the browser first checks to see if it already has a copy in its cache. This Cache Trick usually works fine. But you can see it break down some- times. If a web page has been updated since you first referred to it, your browser may continue to show you the out-of-date version. This is why most browsers include a Refresh button, which really means throw away the cached page and get a fresh copy from the web! What really happens is a little more complicated ... It’s natural to assume that if you search for polar bears, then the search engine...really searches every web page on the Web, and comes back with a list of all the pages that contain that word.

That is impossible. So what does happen?

In order to find your matching pages, the search engine did a great deal of work long before your fingers touched the keyboard:

• web crawling: making a map of the Web; • examining links: making an index of every word.

Let’s see how careful advance work makes your search happen so fast! Step 1: Making a map A search engine prepares in advance for your question, whatever it is. It starts this process by locating all the web pages it can find. This is called web crawling. It’s hard to do right, because there’s no map of the internet, and new sites appear and disappear every minute. Imagine the Web as a network of stops in the New York subway system. The search engine needs to take random rides on this subway system, and notice every place it stops, and how it got there, and whether anything has changed since the last visit. All of this goes to updating a map of the Web. A , often called a spider, is actually a program. A search engine has many web crawlers exploring the web. From time to time, a crawler visits the main FSU web page. It notes all the links on the FSU web page, some of which changed since the last visit. It makes as many observations as possible in the local FSU web page directory, sending this information back to the search engine. Once it has extracted all the information it can from the main FSU web page, it decides somewhat randomly where to explore next. From their starting sites, the swarm of spiders follow links across the web, recording what they see and sending it back to the main search engine site, where all the information is combined into an updated snapshot of the web, its web sites, the web pages, and the links in those web pages. How does a spider start its travels over the Web? Some parts of the web are well known, and don’t change their locations. These are servers, computers devoted to a special task. These are good places for an internet crawler to begin its exploration. A web server does nothing but send and retrieve web pages to your browser while you’re surfing the Internet. An e-mail server is a computer that works as a virtual post office, receiving and storing mail messages, and interacting with you when you check your mailbox. Facebook servers do nothing but handle all the activities of their users, and they are packed with information that a crawler wants to see. If you have a web site, you may be surprised or unhappy to realize that a spider will look at your information from time to time and send a summary of it back to Microsoft or Google or Yahoo. Usually, if you have a web site, anybody can look at your information, but still, it can be unsettling to realize that some viewers are actually vacuuming up your information for their own use. Because of user complaints and privacy concerns, some major web sites and social media sites try to restrict access by these crawlers. Creating a map of the web is part of the search engine’s plan to efficiently respond to your requests. When you send a question to the search engine, it relies first on this map, rather than on the actual web. That means that, sometimes, the map and the web differ. The map may include a web page that has since actually disappeared. For this reason, your search request will sometimes point to what seems like an interesting web page, but when you try to view it, it says “Can’t find this page!” and you’re left thinking, “What? You just told me to look at it!” Also, if you put a web page up, it will not be visible to any search engine for some time, until a web crawler runs into it and adds it to the map. These failures are two consequences of the otherwise very reliable Map Trick. Step 2: Making an index of the web So now we have some idea of how it’s possible to map the web using crawlers. But just because we know where every web page is, we still have the impos- sible task of checking them all against the user’s search string. Luckily, Google solved this problem, using other information gathered by the web crawlers. When you do a Google search, you are NOT actually searching the web, but rather Google’s index of the web. Google’s index is well over 100,000,000 gigabytes of information, requiring one million computing hours to build, and is constantly updated. Now we want to understand what it means to index the web. When you issue a web search query, it is processed in two stages:

• matching searches for all matching pages using the web map; • ranking orders those matching pages so the best appear first.

The query “London bus timetable” seeks matches, and then ranks them. Library search: Search engines have always existed One way to realize how makes the page match task possible is to think about what used to happen, in the good old days, when you went to the library to work on a term paper. You might start by wandering through the library, looking for the books you need, but after a while, you realize this is hopeless.

Then you find a librarian and say:

“I need a list of the rulers of Russia.”

The librarian doesn’t know where your information is either. But the librarian knows how you can find it.

The librarian says: “Search the catalog card index!”

The old card catalog index contained hundreds of thousands of search phrases: topics, author names, events, all in alphabetical order.

Every time a new book came into the library, the librarians prepared cards for every topic covered by the book, and added these to the index.

With great work, and over a long time, the card catalog was built up to be a labor saving device that allowed you to “instantly” (well, within a minute or two) discover the “addresses” (library call numbers) of every book in the library that might pertain to your topic. Searching the card catalog under Russia, Rulers we might see a sequence of cards like: Lieven, Dominic “Russia’s rulers under the old regime” DK253.L54 Lev, Timofeev “Russia’s secret rulers” DK510.73.T56 Warnes, David “Chronicle of the Russian Tsars” DK37.6W37 Gooding, John “Rulers and subjects” DK189.G86 Perhaps the title by Warnes seems the best match for our interest. In that case, we need to make a note of the information DK37.6W37 because this is the Library of Congress catalog number for the book. This number amounts to an address for the book, so once we have a map of the library, we now know where to search next. So search phrase plus index gets us the location of the information. Books have an index too! Most nonfiction books include an index, which lists names and topics in alphabetical order, and the pages on which these are discussed. Once we get the book in our hands, we’re holding 300 or 400 pages of dense text; we probably don’t want to read or even scan the whole book to find a list of Russia’s rulers! Instead, we rely on the fact that books include an index at the back, making it possible to rapidly locate the pages on which key topics are presented. So we turn to the back of the book, and luckily, find “Rulers, List: 217” and turn to page 217 and finally have what we want: Rurik I 862-879 Oleg of Novgorod 879-912 Igor I 913-945 Olga of Kiev 945-962 ...... Now Strozier library claims to hold about 3.3 million books. A typical book might have 300 pages. So we can estimate there are a total of 3,300,000 x 300 ≈ 1 billion pages. Our search lead us to one page on one book, so essentially we picked one page out of a billion in about 15 minutes. This means that in every single second, we essentially skipped over a million pages of useless text in our search for the right one. Library search rate ≈ one million pages per second. Twice, in our library search, we referred to an index as an efficient way to rapidly narrow our search. The library index listed books in the library that might pertain to our topic. The book index listed pages in the book that might pertain to our topic. The indices saved us from having to search every book, or search every page. Of course, the reason we had a fast search was that other people (librarians and publishers) had already done the hard work of preparing accurate indexes for us. Web search is similar to library search Searching the web is similar to searching the library:

• There are a huge number of items; • Each item contains many pieces of information; • Each item has an address.

We need:

• a way to specify a search topic; • a list of items that might include this topic; • an address for each item; • a map that tells us how to reach any address; • a way to reach the address; • a way to view the item; • a way to locate our search topic within the item. Search Engines as Fast as Possible by Tech Quickie

https://www.youtube.com/watch?v=ADvI44Sap3g Map + Match + Rank

We have seen that when you enter a search phrase in your browser, the browser passes this request to a search engine, and the search engine doesn’t actually go out and check the web, but rather looks at an index of the web created from information gathered by web crawlers as they constructed a map of the web. The stages involved in this process include:

1. map the web, its web pages and topics; 2. match the search words to web pages; 3. rank the matching web pages by importance.

Let’s assume that the web map has been built, and that it’s time to think about how to rapidly match web pages to an arbitrary user search word. In the next lecture we will see how the matches can be ranked. An index can be used to speed up the search for occurrences of some word, phrase, or topic. An index essentially says “If you’re looking for this topic, check out these places.” It should be clear that this idea, which worked so well for libraries and books, can be extended to the web and web pages. An index takes a lot of work to prepare; once it’s created, it makes searching incredibly faster. Google search index Example. Consider a very simplified example of the web where there are three webpages.

Indexing these web pages is very similar to indexing a book. We will consider every word important. So we start by listing alphabetically the unique words we find. Each word will be followed by a “1” if it occurs in page 1, a “2” if in page 2, and so on.

Let’s go through this painful indexing process now. Our resulting index should look like this. It is a single file that contains all the words that appear, and which web pages use them:

Word Web pages containing word a 3 cat 1 3 dog 2 3 mat 1 2 on 1 2 sat 1 3 stood 2 3 the 1 2 3 while 3 A search engine could answer questions with just this single index file. If we enter the search string dog, the search engine can quickly find the corresponding line in the index (this is quick, because a computer takes advantage of alphabetical ordering even more efficiently than we do). Then the search engine can report that “dog” appears in webpages 2 and 3. Of course, a real search engine would include a bit of the text surrounding the occurrence of “dog” but that’s an added feature that we don’t need to worry about right now. If we enter the search string cat, our search engine will tell us that this word occurs in pages 1 and 3. Often, our search string involves more than one word. For example, if we are interested in finding a web page that discusses the problems of having both a cat and a dog, then we don’t want pages that have cats and pages that have dogs, we want pages that contain both keywords at the same time. So let’s enter a search phrase dog cat. The search engine can determine that “dog” occurs on pages 2 and 3; Then the search engine can find “cat” on pages 1 and 3. Now it must put these two facts together, realizing that “dog” and “cat” both appear only on page 3, and this is the single result returned. So finding two search words together on the same page takes two searches, three words would require three searches, and so on. Notice an important fact: to answer these questions, we did not have to have access to the original web pages. We needed that when we made the index, but now a single index file allows us to answer questions about all three pages.

Now the World Wide Web has 40 billion web pages. Suppose we can create a similar (but huge!) index file for them. Then, to answer simple matching questions about all the web pages, we only have to search one place, the index file, not the web itself.

This is one clue to how a search task that should take a thousand years can be cut down to 2 seconds. With our index for this example can we handle connected keywords?

In most search engines, it is possible to enter a phrase, using quotation marks, such as “cat sat”. In that case, you are not just asking that both words appear somewhere on the same page, but that they appear immediately together, in that order. That is, we want the bigram cat sat.

The index file we created for the tiny web can tell us that both cat and sat occur on pages 1 and 3, but not whether they occur together.

It might seem that the solution is to look up cat and then go to those web pages and find whether sat occurs in the right position.

This is not acceptable! It would require downloading every web page that had cat, and reading the entire web page to see whether sat was the next word. There is no way to guarantee a fast answer. The answer to this problem is to include word locations in the index.

Suppose in version #2 of our index file:

• we label every word in each page with its position; • we record every occurrence of a word in the page along with its position.

So if the occurs twice in a web page, each occurrence is listed, along with its position. New index including word location

Word Page-Position a 3-5 cat 1-2 3-2 dog 2-2 3-6 mat 1-6 2-6 on 1-4 2-4 sat 1-3 3-7 stood 2-3 3-3 the 1-1 1-5 2-1 2-5 3-1 while 3-4

The first number indicates the web page number the word occurs on and the second number indicates its position on that page. Now suppose we are given the search phrase cat sat.

We look up the word cat, and see that it occurs on page 1 as word 2, and on page 3 as word 2. sat is on page 1 as word 3, right after cat, so we have a hit on page 1. sat is on page 3 as word 7, not immediately following cat there, and so that counts as a miss.

By making our index more intelligent, we can now answer any phrase inquiry about our web pages, and we still don’t need to access the web to do this.

We call this The Word Location Trick. Exercise. Use the following table of page numbers and word locations to answer the following questions.

Word Page - Position a 1-7 animal 1-2 2-6 3-5 be 2-4 cheetah 1-8 2-1 3-1 2-5 3-1 earth 1-4 fastest 1-1 2-5 3-3 is 1-5 3-2 land 3-4 may 2-2 not 1-6 2-3 on 1-3

1. On what page(s) does the word earth occur? 2. On what page(s), if any, does the bigram on earth occur? If it doesn’t occur, enter “0”. 3. On what page(s), if any, does the trigram fastest land animal occur? If it doesn’t occur, enter “0”. Reading assignment

Read Chapter 3, pages 24–37, “9 Algorithms that Changed the Future”. Socrative Quiz Searching Quiz1

CTISC1057

True or false.

1. Safari is a search engine. 2. Google Chrome is a browser. 3. You have to use a browser to get to a search engine. 4. When you type in keywords in a browser, then the search engine checks every word of every web page and returns your results. 5. A spider or a crawler is an electronic robot which is inside your computer. 6. A spider starts its Web search at a popular Web site or server. 7. Currently, Yahoo is the most popular search engine in the U.S. 8. When you type in keywords in a browser then the search engine is not actually searching the Web but an index of the Web. 9. The Internet and the World Wide Web were developed at the same time. 10. Links on a web page provide a bridge to other web pages. Goals for this lecture

1. Last time we began to look at how an index of the web can be made. We want to expand this idea; 2. To understand the importance of the location of keywords on a Web page; 3. To realize that Web pages are typically written in a called HTML; 4. How HTML tags can indicate what a web page is about. 5. To understand how web pages are ranked after we find the ones matching search phrase. When is one match for two search words better than another?

Suppose we were interested in learning the cause of malaria. We might naturally search on malaria cause although we probably don’t insist that those two words occur exactly together so we don’t need quotes.

Suppose the search engine discovered two web pages with both match words. We can see the first web page is a better match. What clues could a search engine use? The answer is “Nearness” because close words are typically a better match.

Here we see part of our index file, with keywords highlighted.

Word Page, Position by 1-1 cause 1-6 2-2 common 1-5 ... malaria 1-8 2-19 many 2-13 of 2-10 2-14 ... the 1-3 1-24 2-7 2-11 Using our word location index, the search engine can see that on page 1, malaria and cause are just 1 word apart, versus 16 words apart on page 2. Although both pages match both keywords, page 1 may be the better match because the keywords occur much closer to each other. Notice that the computer does not understand what it is reading! It could do the same kind of analysis for keywords and text written in Italian, or in ancient Mayan. It may look like an intelligent action, but it’s based on a very simple idea: physically close keywords suggest a better match. The Nearness Trick is thus also useful for our upcoming page ranking task. We already know two ways to specify multiple key words to a search engine:

• quoted, we prefer that the words appear together, in that order; • unquoted, the words can be in any order and far apart.

It turns out that most search engines automatically prefer situations in which the keywords are close. However, one interesting feature in Google Search allows us to specify how close we want the words to be.

If we use one asterisk between two quoted keywords, then we are asking for pages where the keywords appear in that order, separated by exactly one word

"string1" * "string2" Nearness can also help us avoid spam pages. One reason that search engines also prefer matches in which multiple key- words are close is to avoid being trapped by spamdexes. A spamdex is an artificial web page that simply contains a grab bag of keywords, without any information. You could make such a web page by posting a dictionary, minus the definitions, for instance. We will look at this more closely later in the course.

A search engine looking for hair loss remedy or tap dance lessons or perpetual motion machines will find matches (but no information!) on a spamdex page, and the spamdex operator will pick up some money by displaying ads to the annoyed user.

Thus, even if the user doesn’t request that the keywords be close, search engines avoid matching pages that fail the proximity test. Indexing should notice a web page title

Web pages are actually a little more complicated than the simple text files we have used so far as examples.

Web pages are written in HyperText Markup Language HTML. HTML allows the author to vary the type and size, to include tables, lists and figures, and to indicate the structure of the document.

A web page author can specify a title for the web page using specific HTML tags to indicate where the title starts and stops.

If a search engine is looking for the key word malaria, doesn’t it make a huge difference if the actual title of a page is “Malaria”? An example of title searching using HTML

Here are three web pages which include title information.

The pages as we see them.

The pages as the browser and search engine see them.

When written in HTML the Web page typically has a title between the special tags and . (The actual HTML tags are slightly different.) An intelligent search engine takes advantage of noticing the title!

Our improved tiny web index will notice and include HTML tags.

Since the spiders see the actual text that generates the Web page, the first “word” to index is the HTML tag , the next words are the ones in the title, then the HTML tag , then the HTML tag for starting the body, followed by the body of the page and finally the HTML tag for ending. For example, for the page entitled “My Cat” we have the following indexing.

my cats 1234 the cat sat 5678 on the mat 9 10 11 12 Word Page - Position a 3-10 cat 1-3 1-7 3-7 dog 2-3 2-7 3-11 mat 1-11 2-11 my 1-2 2-2 3-2 on 1-9 2-9 pets 3-3 sat 1-8 3-12 stood 2-8 3-8 the 1-6 1-10 2-6 2-10 3-6 while 3-9 1-1 2-1 3-1 1-4 2-4 3-4 1-5 2-5 3-5 1-12 2-12 3-13

Now suppose a user searches for dog. A page in which dog is in the title is probably a stronger match.

Each time a page is found containing the word dog, the engine can check whether this word is actually part of the web page title. It does this by comparing the positions of and dog and . If the keyword falls between the two title markers, then this web page is more highly related than if it occurs elsewhere.

By looking at our index, we see the following cases:

Page titleStart dog titleEnd Start < dog < End? 2 1 3 4 yes 2 1 7 4 no 3 1 11 4 no

This technique is The Metaword Trick. Maps and Indexes have solved the pagematch problem

From what we have seen, the impossible problem of quickly responding to a request to find keywords in all the webpages in the world has become the possible problem of intelligently searching a single index file.

Just as with card catalogs and an index at the back of a book, the creation of an index file for the web takes a great deal of time, and space.

Google, for instance, has created enormous collections of computer servers whose job is to collect all the information on all the web pages and create, update, and analyze the corresponding index file. This means that the index file is actually always out of date (like Google Street View) but regularly updated piece by piece. Computer “Tricks”

The search engine may seem to be intelligent - it’s answering your questions, after all. But actually, we have simply figured out a number of tricks that make it possible to come up with reasonable approximations of good answers:

• The Cache Trick • The Map Trick • The Index Trick • The Nearness Trick • The Word Location Trick • The MetaWord Trick

These are examples of computational thinking in action: given a problem, how we can use the strengths of a computer (ability to store information, to gather and remember new information, look up information quickly, repeat operations) to simulate the abilities of a human (read all the web pages, and find the matching ones). Exercise. Consider the following two sample web pages with HTML tags. Index the pages by completing the given table. Page 1

Seminoles Football Visit Seminoles.com and get FSU news from the official athletic site

Page 2

Florida State Seminoles Football Schedule Florida State Seminoles Football scores schedule stats roster players news and more Word Page - Position and athletic Florida football from FSU get more news official players roster schedule scores Seminoles Seminoles.com site State stats the visit Exercise. Use the index below to answer the following questions.

Word Page - Position a 1-5 1-14 2-16 basketball 1-9 2-3 be 2-15 can 2-14 chance 1-15 college 2-2 deep 2-8 for 1-7 1-16 2-7 fsu 1-8 2-4 2-15 has 2-5 in 1-20 2-19 madness 1-3 1-19 2-21 major 2-17 march 1-2 1-18 2-20 ncaa 2-9 pieces 2-6 player 2-18 possibility 1-6 run 2-11 season 2-23 some 1-8 1-17 still 1-4 1-13 tallahassee 1-21 there’s 1-12 1-10 2-6 2-10 this 2-22 tournament 2-10 1-1 2-1 1-10 2-12 1-11 2-13 1-22 2-12 2-24 1. What is the title of the first page? 2. Does the bigram “march madness” appear on the first page? On the second page? 3. Does the word “tournament” appear in the title of the second page? 4. Does the bigram “march madness” appear in the title of the second page? 5. Does the word “tallahassee” appear on both pages? 6. How many words are in the title of the second page? (excluding and ) 7. Does the bigram “major player” occur on either page? If so, which one. 8. Does the bigram “fsu season” occur on either page? If so, which one. After finding matches, we need to do ranking!

Even with the tricks we have described, it is common for a search engine to discover thousands or millions of matching web pages.

The mapping and matching algorithm are only the beginning of the process of responding to your web search.

Next it will be necessary to consider the page rank algorithm, which considers all the matching pages that have been found, and sorts them in order of importance, so that even with millions of matches, most users know their best choice is a match on one of the first few pages. The Page Rank Problem

We have seen that search engines use Web Mapping and to find matches to search words.

• Before a search engine sees any user search words, it has already done a lot of work in preparation. • So a search engine must send out a constant stream of web crawlers, which randomly explore the Web, reporting where they are, how they got there, and what they find. • The information from web crawlers is used to create and update an enor- mous index of the web: – directions to reach every server, folder, directory that contains web pages; – a list of every web page; – a list of every word and its location in the web page; – a list of hyperlinks in every web page.

• Instead of the search engine storing a copy of the entire web, it really transforms the web into an enormous and complicated index file. • Having a local index means that, when you send in a key word, the search engine can very quickly look it up locally, rather than having to touch the web at all. • The index will provide a list of all the pages on the web that have your keyword somewhere in their text. • Since there are 40 billion web pages, it’s easy for almost any key word to result in millions of matches. • Most of those matches are probably not very useful, and there’s no way you would have the patience to search through them one by one for the truly relevant ones. • When you use a search engine, you often find the best matches on the first page. That’s because, after the search engine found all the matches, it went through them and made a very good guess as to which pages were the best. But the search engine is a computer program. It can’t actually understand the pages it is handling. How can it decide what to put on page 1? • We saw two simple tricks (Nearness and MetaWord Tricks) that could help with ranking web pages. • The Nearness Trick assumes that if you specified two keywords like malaria and cause, then matching pages should be preferred where these words were closer together in the text. • The Metaword Trick assumes that the person who wrote the web page included special editorial comments, using the HTML language. In par- ticular, if your keyword turns out to appear in the title of a web page, that makes it likely to be a better match. • These two tricks are useful, but when you’re dealing with millions of matching pages, we need much more powerful tools in order to quickly pick the best matches. Sorting a list of a million web sites is different from sorting a list of numbers or names. Sorting a large set of objects by importance is called ranking. If the search engine was a human, and familiar with the question we asked, then it could carefully read every matching web page, and sort them in order of usefulness, showing us the best matches first. Asking a computer to do web page ranking seems to be another example of an impossible task, since the search engine can’t actually understand the web pages. But automatic ranking is also a vital task, because we can’t afford to pay people to read and rank web pages (nor can we wait that long!), and the quality of pages on the web varies from marvelous to ridiculous. Is ranking without comprehension possible? The search engine doesn’t understand what the search words or the web pages mean - it doesn’t even know English. How can reliable, reputable, reasonable web pages be selected over the ocean of misinformed or irrelevant matter on the web? Early search engines actually did try having people read and rate individual web pages; But such a rating system is very expensive, requires hiring experts in every possible field, and must be updated daily as web pages change and new ones are added. Nonetheless, we will come up with a solution, and it will work well even if the pages are written in Polish, Esperanto, or the Martian language! Example. Ranking restaurants. We ourselves sometimes choose between items about which we seem to know nothing, and we don’t simply flip a coin. We have our own set of tricks to use. Suppose your job has sent you to work in the country of Vulgonia for a week. Let’s assume you don’t speak a word of Vulgonian. You go to the downtown area hoping to get something to eat, and you see two restaurants are open. One says BLATNOSKI LOBSOPPY and the other says DINGLE MARKSWART. There are menus in the window, but you can’t read them. The windows are too steamy to see inside. You stand outside the two restaurants for ten minutes, and make up your mind. You are confident you are going to a good restaurant. How is this possible? It looks like the restaurant on the left is much more popular than the one on the right. You automatically assume the left one is better, based on observing the choices of other people. You are basing your choice on its popularity. Search engines use a similar trick called the Authority Trick. So we can see that, at least in some simple situations, we may find that we need to make a choice without knowing what we’re choosing. If there are already other people making choices, then a reasonable strategy is to prefer the most popular choice. A refinement to this strategy arises if you have some way of judging the people making the choices. If you feel some people are more reliable, or more knowledgeable, or have more in common with you, you might weight their choices more strongly, giving their choices more authority. An example of this is when you take a movie recommendation of a friend over that of someone you just met. The Authority Trick suggests that you can sometimes make good choices without knowing what you’re choosing. But a computer can’t count people going to a restaurant. Can we see a way to extend this idea of the authority trick to our web page ranking problem? Example. Ranking papers in mathematics Let’s suppose that, by an incredible mistake, you’ve been asked to give some advice about a famous problem in higher mathematics, a language you probably don’t speak! This famous mathematical problem is called the Riemann hypothesis.

Ever since Riemann stated his hypothesis in 1859, mathematicians have been investigating and puzzling over how to determine whether it is true that the Riemann function f(ζ) will never produce a zero result except for a special set of input numbers. This has resulted in many papers, of varying quality, being written about the Riemann hypothesis. Now suppose a friend has suddenly become interested in this Riemann hy- pothesis, and has asked you to recommend just one paper to read. Knowing nothing about this problem, you go to the library and find 20 mathematical papers on this topic. You can’t just hand your friend all 20 papers! Can you make a recommendation that at least looks intelligent? The 20 mathematical papers you found each includes a bibliography, that is, a list of other papers that the author referred to while writing this paper. Even if you don’t know much mathematics, you still can recognize a bibiog- raphy; that means that you not only have 20 papers, you also have 20 lists of what are probably good, useful papers. Suppose you notice many of the papers cite one or both of these papers:

• Conrey, J. Brian (2003), The Riemann Hypothesis, Notices of the American Mathematical Society. • Dudek, Adrian W. (2014). On the Riemann hypothesis and the difference between primes. Interna- tional Journal of Number Theory 11 (03).

Now that you know that people who think about the Riemann hypothesis often cite papers by Conrey or Dudek, you could reasonably go to your friend and suggest that those papers would be an excellent starting point.

You naturally assumed that all the citations in a paper are “votes” or “likes” for other papers, so that if you could keep track of the papers with the most votes, you probably had a few winners. The popularity idea isn’t perfect. A paper could also be frequently cited because it is controversial or disputed or wrong. Your collection of papers might also include many references to:

• Smaley, Ricardo (2012), Riemann was wrong!, Ruritanian Mathematics Journal.

Without mathematical training, and reading this paper, we can’t judge whether it is a correct reference or not, but the fact that so many people have referred to it suggests that it is nonetheless an interesting reference, and one that you might mention to your friend. The point is that the existence of bibliographies means we can make some guesses about influential and important papers, without reading them. And that suggests that a computer program can sometimes use similar clues to make what look like intelligent decisions. Ranking: Counting followers of Twitter users

One of the most prominent activities for users of social media is the ability to “friend” or “follow” or “like” or “connect” with other users. This automatically creates a ranking among users: those with the most friends, followers, likes, or connections are seen to be most important, and are naturally regarded by new users as worth following as well. On Twitter, for instance, Taylor Swift has more followers than Barack Obama, so, at least as far as Twitter users are concerned, a popular singer has far more authority or importance than the U.S. president. Now you can start to see how a computer can learn about us humans; simply by counting followers on Twitter, a computer program would gain some correct notions of fame and influence. In our example of the Riemann hypothesis, the most authority was given to the paper which received the most citations from others. In the Twitter example, the most authority was given to the person who had the largest number of followers. But if we think about it, we can use this idea to try to estimate the impor- tance of web pages as well. Most web pages include links, referring the reader to other pages that have related information. These links are created by humans, who have made a choice about which web pages to link. Thus, a link is a sort of vote, recorded on one web page, for another web page. Ranking: Counting links to web pages Webpage links

Most web pages include links, allowing a reader to get more information about a specific topic. This is often because a user begins with a very general question, and wants help in gradually focussing the question to get the exact information desired. A high school student might be interested in whether FSU has a foreign study program in its German Department. One way to get there involves moving through a series of web pages, each of which includes a link to the next one. At each step, the user is exploring (sometimes making a mistake, and moving backwards!) The links between pages allow the user to slowly explore the information and usually make it to the right stuff. How is a link put into a web page? We have mentioned that the HyperText Markup Language (HTML) includes some tools for formatting a web page, so that it has a title, and can make lists, include images, and various . HTML also specifies how a writer can insert a link into a web page. For example, near the top of the FSU main web page, there is an item ACADEMICS which is a hyperlink. Actually, ACADEMICS is the visible part of the link, what the user sees. There is also an invisible part, what is called the Universal Resource Locator (URL), which is simply the web address. For ACADEMICS, this web address is www.fsu.edu/academics/ When you click on the word ACADEMICS on the FSU web page, what happens is the same as if you typed the URL www.fsu.edu/academics/ into your browser. The hyperlink just makes this easy for you. The ACADEMICS hyperlink is set up by inserting the following text into the FSU main web page, which simply lists the URL to be associated with the visible text: ACADEMICS

Any number of links can be included in a web page, and they can point to any place on the web that the author of a web page thinks might be useful. So we can think about a link as a sort of vote by one web page for another web page. We need to think about how to count these votes. Compare web page A which includes 50 links to other web pages, while 50 different web pages all link to web page B. There’s no reason to think web page A is very important, but web page B has gotten the attention of 50 different web writers; perhaps there’s something worth seeing on web page B! Links TO a web page are important, not links FROM it. So a web site that contains no links might still be very important, whereas a web site that has no links to it seems to be pretty useless. We’ll call this initial idea for ranking web pages the Hyperlink Trick. Links: Inlinks are web pages that vote for you Example.

As a simplified example of the Hyperlink Trick, suppose you search for a scrambled egg recipe and the search engine finds two matching pages: Ernie’s recipe and Bert’s recipe.

Which recipe should the search engine recommend most strongly? The search engine looks at how many inlinks each website has.

A search engine can’t read or understand the recommending pages. But it can certainly count the number of links coming in to either of the two recipes. The fact that Bert’s page has more links is at least a suggestion that people (who can read and understand Web pages!) found Bert’s page more useful, or his recipe better tasting. So in the absence of any better information, a search engine could take the number of incoming links to a Web page as a rough indication of the rank or value or authority of that page. Incoming links can be reported by web crawlers We have already seen that we needed an army of programs called web crawlers to wander around the web constantly, gathering information about network connections, web sites, web pages, search phrases inside of web pages and the locations of those search phrases. This goes to making up a map of the web, and an index of search words. Now that we see that incoming links are important, we can simply ask our web crawlers to include this information in their searches. As they “read” a web page, they notice every link that points to another page, and they tell the search engine Web page A links to web page B. Now our search engine can track the incoming links to every page. Is ranking by link count good enough? It’s easy to see some problems with such a simple ranking system.

1. If all the Web pages pointing to Bert’s page said “This recipe is terrible!”, the search engine would still give Bert’s page a higher ranking than Ernie’s; 2. If Ernie knew how the search engine works, he could quickly write 10 new Web pages that praise his recipe, so now he ranks higher than Bert; 3. High school students and film critics both make top ten lists of movies, and there are many more high school students than film critics. Example. Returning to our example of Bert & Ernie’s webpages for scrambled eggs, we now look at a webpage that links to Bert’s page and another which links to Ernie’s page.

Each page has one link, but are these links equal? John MacCormick is not a famous chef, but Alice Waters is. If we, being humans, know that John MacCormick is not a famous chef, but Alice Waters is, then we are likely to assume that it’s safe to prefer Bert’s recipe, because Alice Waters’s recommendation has more authority. We would like to modify our ranking procedure. Instead of only counting the number of hyperlinks to pages, we’d like to include somehow a measure of the authority of the Web page that is making the recommendation, that is, the hyperlink to Bert or Ernie’s page. In this way, our ranking procedure can take advantage of The Authority Trick. Hyperlinks from pages with high “authority” will result in a higher ranking than links from pages with low authority. But how can a computer determine authority? Let’s consider combining the Hyperlink Trick with the Authority Trick.

Let’s start by assuming there are a total of 102 web pages that point to either John MacCormick or Alice Waters, and let’s assign each of these web pages an authority of 1. Now suppose John MacCormick has 2 hyperlinks pointing to his web page, while Alice Waters has 100. We might give MacCormick an authority of 2, and Alice Waters an authority of 100, as though the lower level web pages were voting for them. Then we might suppose that any recommendation (hyperlink) by Alice Wa- ters should add 100 “authority points” to that web page, while a recommen- dation by John MacCormick would only be worth 2 points. This means that Ernie’s recipe has an authority score of 2, and Bert’s 100. Google would know that Alice Waters’ site has more authority than John MacCormick’s site because her site would have a higher ranking. Count the links to the pages that link to the pages that link to the pages ... Adding up links and links to links and so on almost works...

The ideas of using hyperlinks and assigning authority are good ones. We might try to implement these ideas by starting every web page with one authority point. Then, each hyperlink in a page would cause that page’s authority points to be added to the linked page’s authority points.

Then we just have to do this for all web pages and we’re done, right?

Unfortunately, this idea won’t quite work. A problem arises if we encounter a cycle, a sequence of hyperlinks forming a loop. In this case, you can start on page A, jump to B, then E, and back to A. The pages A, B, and E form a cycle. This means that our method for assigning authority points will fail. To see this, let’s imagine trying to compute the rankings for this case.

C and D have no links pointing to them, so they get a score of 1. C and D point to A, so A gets a score of 2. A points to B, and B points to E, so they get scores of 2 as well. Are we done? No, A’s score is out date now! We update A to 4. But then we must update B and C...and A again. And there must be many cases of cycles like this over the entire Web. To save our idea, we need the Random Surfer Trick. The Random Surfer will follow a trail of hyperlinks, but only for a while! The random surfer simulates a person surfing the Internet. A starting page is picked at random. If this page has any hyperlinks, the surfer picks one at random and moves to that new page. If that page has hyperlinks, another random choice leads to another page, and so on. If a page has no links, a new page is chosen at random. The random surfer never moves backward. Even if the current page has links, the surfer is allowed occasionally to instead make a jump to a random page, as though he/she/they is bored.

In some sense, the random surfer models user behavior. The random surfer model takes into account the quantity (the Hyperlink Trick) and the quality (the Authority Trick) of incoming links at each page. Randomly surfing the Internet seems an odd way of trying to understand the authority index we are seeking. However, if we do this experiment many, many times, then you should be able to see that a web page that is pointed to by many links will be more likely to be visited often by the random surfer. On the other hand, we will never get stuck in an infinite loop, because we always restart the process after a certain number of steps. So the Random Surfer Trick estimates the authority index by wandering through the Internet, and noticing which pages it visits most often. Example of Random Surfing

Here the surfer starts at page A, and moves to another page following a randomly selected link (darker arrow). Three such steps reach page B. From page B, the surfer jumps (dashed line) to page C, then links to page D, then another page, then another random jump.

From there, the surfer takes two linking steps and stops. It turns out that if you let the random surfer wander around the web like this, then you have solved the authority index problem. This is because, in a natural way, the importance or authority of a web page is related to the number of times the random surfer visited that web page. More precisely, if we make the authority index a percentage, then the au- thority of a web page is the percentage of visits that were made to that page. The web has lots of cycles that could trap someone who can only move along links. But the surfer gets bored easily, and jumps around, escaping the cycle traps. A simulation on an internet of 16 pages

We have recorded the number of times each page was viewed by the random surfer over 1000 steps. Authority = number of visits to this page / number of steps times 100 (to convert to percent) Rank = Hyperlink + Authority + Random Surfer

1. Hyperlink Trick suggested that a page with many incoming links should receive a high ranking. But the more incoming links a page has, the more likely the random surfer will visit it. 2. Authority Trick: an incoming link from a highly authoritative page should improve a page’s rank more than a link from a less authoritative page. But a popular page will be visited more often than an unpopular one, and so there will be more opportunities for the surfer to arrive from the popular page than the less popular one. 3. Random Surfer Trick: a random surfer in the form of a computer program starts at a page at random and follows hyperlinks moving to a new page. The surfer occasionally jumps to a random page, eliminating the problem of getting stuck in loops. Now our search engine is complete

A good search engine needs:

1.a map of the web, its pages, and links; 2.a page match algorithm that finds pages that match the user’s search words; 3.a page rank algorithm that sorts matching pages so the most authoritative are listed first. It only took half a second to carry out 1,000,000 steps of the random surfer procedure in our earlier example using 16 pages. Since the World Wide Web has 40 billion pages, it will obviously take much longer to compute a complete authority index list. However, if several computers carry out a separate random surfer analysis, the results can be combined. Since Google has about two million computer processors available, the task suddenly becomes much more doable. Moreover, Web pages don’t change very fast, so results can be computed every week or so. So a good search engine will have available an up-to-date ranking of all Web pages, before a user has made any search requests. In 1998, Larry Page and Sergey Brin announced their PageRank algorithm, built into Google search. Results were noticeably better and faster, and the “first page” results often exactly what users were seeking; Google soon became the dominant search engine. Google and its competitors continue to improve their search engines. Example. Is Google’s PageRank algorithm foolproof? The Case of J C Penney. The Web began as a way for scientific researchers to communicate but now it involves commercial services and advertising. Getting a company’s adver- tisement pages onto the first page of search results means big money. People do a lot of shopping online; instead of visiting stores, they look for items by using the search engine on their browser. When the search results come back, most shoppers only look at the first page of results, and 1/3 of the time they go for the very first result. That means being the first result in a search can make big money for an online company. The following is excerpted from the “The dirty little secrets of search”, by David Segal, which appeared in the New York Times, Feburary 12, 2011. Pretend for a moment that you are Google’s search engine. Someone types the word “dresses” and hits enter. What will be the very first result? There are, of course, a lot of possibilities. Macy’s comes to mind. Maybe a specialty chain, like J. Crew or the Gap. Perhaps a Wikipedia entry on the history of hemlines. O.K., how about the word “bedding”? Bed Bath & Beyond seems a candi- date. Or Wal-Mart, or perhaps the bedding section of Amazon.com. You could imagine a dozen contenders for each of these searches. But in the last several months, one name turned up, with uncanny regularity, in the No. 1 spot for each and every term: J. C. Penney. The company bested millions of sites - and not just in searches for dresses, bedding, and area rugs. This striking performance lasted for months, most crucially through the holiday season, when there is a huge spike in online shopping. Type in “Samsonite carry on luggage”, for instance, and Penney for months was first on the list, ahead of Samsonite.com. Google’s stated goal is to sift through every corner of the Internet and find the most important, relevant Web sites. Does the collective wisdom of the Web really say that Penney has the most essential site when it comes to dresses? And bedding? And area rugs? And dozens of other words and phrases? The New York Times asked an expert, Doug Pierce, to study this question. What he found suggests that the Google search often represents layer upon layer of intrigue. If you own a Web site about Chinese cooking, your site’s Google ranking will improve as other sites link to it. Even links that have nothing to do with Chinese cooking can bolster your profile. And here’s where the strategy that aided Penney comes in. Someone paid to have thousands of links placed on hundreds of sites scattered around the web, which lead directly to JCPenney.com. Mr Pierce found 2,015 pages with phrases like “casual dresses”, ”evening dresses”, “little black dress” or “cocktail dress”. Click on any of these phrases and you are bounced to the main page for dresses on JCPenney.com. Some of these sites are related to clothing, but many are not. There are links to JCPenney.com’s dresses page on sites about diseases, cameras, cars, dogs, aluminum sheets, travel, snoring, diamond drills, bathroom tiles... Google warns against using such tricks to improve search engine ratings. The penalty for getting caught is a pair of virtual concrete shoes: the company sinks in Google’s results. On a Wednesday in 2011, JCPenney was the subject of Google’s “corrective action”. At 7pm, JCPenney was the No.1 result for “Samsonite carry on luggage”. Two hours later, it was No. 71. At 7pm, Penney was No. 1 in searches of “living room furniture”. By 9pm, it had sunk to No. 68. Penney fired its search engine consulting firm, and announced that they were “disappointed” with Google’s actions. Google engineer Matt Cutts emphasized that there are 200 million domain names and a mere 24,000 employees at Google. “Spammers never stop”, he said. What if you want to create your own Webpage?

A web page is just a text file, which can be created with any text editor. However, unlike a typical text file, a web page includes special tags that essentially say “This is the title”, or “This part should be in italics” or “This begins a list” or “This is a link to another page.” The rules for using and interpreting these tags are part of HTML, the Hy- perText Markup Language. The browser uses the tags in order to format the web page, so that you only see what looks like a text file. To understand how a web page works, you can look at the version that includes the tag information. In FireFox, for instance, you would go to the Developer menu and choose page Source. If the web page is very fancy, you may be surprised at how complicated the HTML version is! The web browser Mozilla has an online editor for learning to create a Web page using HTML. You can find it at https://thimble.mozilla.org/ It is a good place to learn by taking a template and changing it. The simplest template is creating a poster. A sample is given and you can modify it however you like. Other projects are included like creating automatic excuses for late assignments! Socrative Quiz Searching Quiz2

CTISC1057

Use the table below to answer the first 5 questions.

Word Page - Position a 3-10 cat 1-3 1-7 3-7 dog 2-3 2-7 3-11 mat 1-11 2-11 my 1-2 2-2 3-2 on 1-9 2-9 pets 3-3 sat 1-8 3-12 stood 2-8 3-8 the 1-6 1-10 2-6 2-10 3-6 while 3-9 1-1 2-1 3-1 1-4 2-4 3-4 1-5 2-5 3-5 1-12 2-12 3-13 1. The word mat occurs on all three pages. 2. The word pets is in only one of the three titles. 3. The word sat is adjacent to cat but not to dog. 4. The word my only occurs in the title of the pages and not in the body. 5. What page(s) is the word “mat” on? 6. When we use Google Search to search the WWW, we are really not searching the Web but rather an index of the Web. 7. A web page which has the search words in the title is probably more important than one which doesn’t. 8. A hyperlink on a web page allows the user to jump from one web address to another without actually knowing the address of the second. 9. Links FROM a web page give more authority to the page than links TO the web page. 10. Google’s page ranking algorithm is foolproof.