11

Quantitative Web History Methods

Anthony Cocciolo

Introduction as a decreased use of text on the web in favor of image-based content, such as video and This chapter explores how historical research photographs. I was particularly interested in questions, including research questions about what seemed like an erosion of written con- the history of the web, can be addressed tent online in favor of a form of communica- through quantitative research methods applied tion that seemed to share commonalities with to web archives. At a basic level, quantitative children’s books, where photographs were methods involve applying a variety of math- accompanied with small amounts of text. In ematical or statistical analysis on numerical seeing what looked like a movement away data. These mathematical techniques can be from the written word, I was reminded of the as simple as adding up the occurrence of work of Walter Ong who noted the tenacity of some word, to more sophisticated techniques orality, or a tendency to attempt to return to such as analysis of variance (ANOVA), which an oral culture despite the success and obvi- will be described in more depth in this chap- ous benefits of literacy (Ong, 2002). An oral ter. Through the use of web archives, quanti- culture is one without knowledge of literacy, tative methods can be used to show patterns where information, knowledge, and culture and changes over time, thus having utility in is communicated and passed down through addressing historical research questions. means other than the written word, such as In this chapter, a personal use of quantita- through oral storytelling, music, and other tive research methods with web archives will non-written means. Was the internet, with its be discussed as a way of illustrating how they newfound ability to easily stream video and can be used more broadly (Cocciolo, 2015). high-resolution imagery, allowing for Ong’s In 2014, I was interested in what I perceived return to orality?

BK-SAGE-BRUGGER_MILLIGAN-180264-Chp11.indd 138 8/20/18 8:18 PM Quantitative Web History Methods 139

Although I realized that I could not make The aim of this chapter is to offer a starting such sweeping arguments in an academic point for historians interested in applying study – reviewers would be none too pleased – quantitative research methods to web archives I was still interested in developing a sound to answer historical research questions, using method for determining changes in the amount personal experience as an illustration. of text delivered to users over time. To study However, before the stages are discussed, this, web archives are essential because they relevant literature on using quantitative contain copies of webpages from the past. research methods with web archives will be Although some countries have extensive web introduced. archives of their national domain or other collecting areas, in the United States – which was my main study site – the most extensive web archive is the one kept by the Internet Relevant Literature Archive that it displays through its WayBack Machine. Thus, I knew that if I was to study Using quantitative research methods with changes in the presentation of text over web- web archives may be somewhat new to histo- sites used by people in the United States, the rians. In a traditional sense, historical research web archives kept by Internet Archive would involves the close examination of textual be an essential resource. records to address historical research ques- Before I discuss how I applied web tions. The question of whether to use quanti- archives in my research, I will outline the tative or qualitative research methods is not general steps for engaging in quantitative generally directed at the historian but rather at research using web archives. Each of these the social scientist, such as the psychologist, steps will be described in more detail in the sociologist, and education researcher. following sections and will draw on this per- Quantitative and qualitative research – when sonal example. These steps are: referred to by social scientists – typically involves studying living people, which may or 1 Developing a research question – First, a research may not be the case for historians. In social question should be developed that can be science, qualitative research methods such as addressed in full or in part through web archives. interviews and focus groups are often used to Types of questions that can be explored through get at people’s understanding of something such methods will be explored, as well as those that is not well understood, such as motiva- that are better suited for other methods. 2 Securing a corpus – Second, the corpus of web tions or opinions on some topic. Quantitative archived content that provides coverage of the research can be used to study an issue or topic areas appropriate to the research question ought that may be better understood. However, there to be secured, and ways to gain access to such is greater interest in seeing how wide or gen- corpora will be discussed. eralizable a given view is. Whereas qualitative 3 Numerical translation – Third, the corpus of text research may involve analysis of data such as needs to be translated into numerical data based interview transcripts, quantitative research on the research question. may involve analysis of numerical data such 4 Analysis – Fourth, using the numerical datasets as that generated from a survey. In this paper, created in the earlier step, mathematical or ‘quantitative research’ is used to refer to per- statistical analysis techniques can be employed. forming analysis using numerical and statisti- This can be as simple as mathematical functions such as summation, average, standard deviation, cal techniques on data, specifically web to more complex analysis, including statistical archives. While this method may not be com- techniques such as analysis of variance (ANOVA). monly used by historians, it can be used to 5 Drawing conclusions – Fifth, like all research, con- help address historical research questions in clusions should be drawn based on the analysis. conjunction with other sources of evidence.

BK-SAGE-BRUGGER_MILLIGAN-180264-Chp11.indd 139 8/20/18 8:18 PM 140 The SAGE Handbook of Web History

Studying webpages and web-based phe- opportunities for historical and longitudinal nomena using quantitative methods is not analysis are increasingly possible. new as it is captured in the research subfield In the field of communication and media of information science known as webomet- studies, researchers have begun to use web rics. According to Thelwall and Vaughan, archives to create web histories, which ‘Webometrics encompasses all quantitative Brügger defines as ‘a necessary condition studies of web-related phenomenon’ (2004: for the understanding of the Internet of the 1213). Thelwall (2009) notes that webomet- present as well as of new, emerging Internet rics can be used for studying a variety of web- forms’ (2011: 24). Web histories can include based phenomena, such as issues relating to studies of multiple facets of the web, such as election websites, online academic commu- national domains, which may look at factors nication, bloggers as amateur journalists, and such as volume, space, structure, content, social networking. The methods can be used among others (Brügger, 2014). for understanding aspects like web impact assessment, citation impact, trend detec- tion, and search engine optimization, among other possible uses. Webometrics grows out Stage 1 – Developing a research of the subfield of information science known question or questions as bibliometrics, which uses quantitative analysis to make measurements related to Before progressing further, I must make a published books and articles, such as citation quick note on language used in this article as analysis to determine impact. Related sub- it varies to some degree by country. By fields include infometrics, which is the quan- ‘homepage’, I am referring to the start page titative study of information and can combine or initial page of a website. I also use the analysis of information in whatever form it term website, which refers to an entire col- may occur. lection of webpages under a given domain. Björneborn and Ingwersen (2004) high- For example, the website ‘pepsi.com’ is light four main areas of webometric research: composed of a homepage and other web- 1) webpage content analysis; 2) web link pages that are hyperlinked together to form structure analysis; 3) web usage analysis; and the website. 4) web technology analysis. Notably missing When engaging in quantitative research from this list is a longitudinal or time-based using web archives, it is necessary to have dimension. However, webometrics research- a research question that lends itself to such ers highlight the possibilities opened up by methods. For my project mentioned in the web archives. Björneborn and Ingwersen introduction, my research questions are the highlight that ‘Web archaeology… could following: in this webometric context be important for recovering historical Web developments, for Is the use of text on the World Wide Web declin- ing? If so, when did it start declining, and by how example, by means of the Internet Archive much has it declined? (www.archive.org)’ (2004: 1217). When webometrics was in its early development in The above research questions are well suited the 2000s, web archives such as the Internet for quantitative methods using web archives. Archive only contained a few years of con- The first reason why is that the question is tent, making them less appealing for long- essentially quantitative in nature: a ‘decline’ term, longitudinal analysis. However, as web and by ‘how much’ is something that can be archives have persisted, and notable web readily measured numerically by comparing archives such as the Internet Archive have data from some specific year in the past and surpassed 20 years of crawling websites, new measuring it against a more recent year. The

BK-SAGE-BRUGGER_MILLIGAN-180264-Chp11.indd 140 8/20/18 8:18 PM Quantitative Web History Methods 141

second reason is that web archives are the person or place. Basically, if it can be read- essential resource for seeking answers to ily identified by a computer or human, it can the above questions. As the Internet Archive be used in the research question. Another has been archiving the web since 1996 component can be the co-occurrence of one (Goel, 2016), it is possible to use it to ana- ‘thing’ with another. Co-occurrence analy- lyze homepages for nearly the entire lifespan sis can factor in the distance between each of the World Wide Web. Although web ‘thing’, such as word distance, pixel distance, archives are generally not available for the or number of links away. More sophisticated first few years of the World Wide Web, by the relationships between one or more ‘things’ end of the 1990s good web archives – spe- can also be studied, such as the nature of sen- cifically through the Way Back Machine – timents relating one element to another (Liu, exist. Thus, web archives work well for 2012). The more complicated the phenom- studying content from the late 1990s to the enon – such as sentiment – the more com- present day. plex the algorithms for identifying them must The major limitation when making com- be. With the increased complexity, there is parisons between the past and present is that greater chance that the algorithm could per- not all present-day webpages existed in the form poorly and not correctly identify the past, and not all past websites continue into sentiment. Thus, simpler phenomena, such the future. Further, some webpages blocked as occurrence or co-occurrence, are more web crawlers because they feared losing straightforward to determine than more control of their content, thus leaving sites sophisticated relationships. Emerging prac- like Time Magazine (time.com) poorly rep- tices that use machine learning and artificial resented in 1990s web archives. A further intelligence in algorithms have the potential limitation is that most web archives do not for identifying complex phenomena and rela- copy every webpage within a given domain, tionships (Kelleher et al., 2015). Although but only go a few levels deep off of the home- well beyond the scope of this article, machine page. While the Internet Archive has very learning techniques that leverage artificial extensive copies of homepages of top-level intelligence have potential application to domain names for many months and years, research questions that are inherently histori- webpages several levels below the home- cal in nature. page are less well-represented. Thus, using Questions well-suited for web archives web archives to make comparisons between research could be the timeframe of the late homepages is much more feasible than mak- 1990s to the present, which are the years that ing comparisons against some webpage many are well-represented in some web archives. If levels below the homepage. Understanding earlier years are being included, web archives the strengths and limitations of a particular may need to be augmented with more tradi- web archive necessarily impacts the types of tional sources, such as newspapers, maga- research questions and analysis that it can be zines, and books. Analysis of such sources used to address. can be expedited to some extent by using Although there are infinite number of pos- digitized copies of such works, but print sible research questions that may make use holdings may be needed as not everything of quantitative research methods with web has been digitized nor, if it has, is it necessar- archives, some particular components are ily available to researchers. more appropriate than others. Research ques- In a perfectly linear world, research- tions that are looking into the occurrence of ers move from creating research questions, some ‘thing’ are particularly noteworthy, to devising methods for addressing those which can include the occurrence of a word, questions, to implementing the methodol- image, phrase, visual element, hyperlink, ogy, analyzing the data, generating results,

BK-SAGE-BRUGGER_MILLIGAN-180264-Chp11.indd 141 8/20/18 8:18 PM 142 The SAGE Handbook of Web History

and drawing conclusions. However, as many could take a long time to completely down- researchers know, the questions are devel- load. For example, Masanès notes ‘it will oped in dialectic with the available research take more than three days to archive a site methods and data sources available; thus with 100,000 pages’ (2006: 24). each aspect influences the other. Hence, it Some of the limitations of client-side would not be unusual to refine research ques- web archiving are overcome by server-side tions based on the data that can be secured, web archiving, in which files are copied or the analysis options available. In the next directly from the server in conjunction with section, securing the web archive corpus will the site owner’s cooperation. This method be discussed. was used by the Library of Congress to cre- ate an archive of Twitter (Osterberg, 2013). The limitation of this approach is re-creating the webpages so that they are authentic to Stage 2 – Securing a web archive what the user would have experienced, and corpus the extensive effort required negotiating the transfer of data. The next step in the research process is Perhaps the simplest form of web archiv- securing or identifying a corpus of web ing is to create non-web archives, where archived webpages that can be used or ana- web content is printed out or converted to lyzed to address the research questions. a format like Adobe Acrobat PDF or PNG Before obtaining a corpus, it is important to files and stored using something other than understand the ways in which web archives the web (e.g., file folders, directories on a come into existence. Some of the methods computer) (Masanès, 2006). Although this used to archive web content are client-side method has some appeal because of its sim- archiving, server-side archiving, and non- plicity, it loses the context in which users web archiving (Masanès, 2006). Client-side experienced the content and the way it was archiving is the most popular form of web navigated using hyperlinks. It also could archiving and is used by the Internet Archive lose some of the graphical look and feel (2018) to collect webpages for display on the of the webpage, which is readily evident WayBack Machine. In this approach, web when most webpages are saved as PDFs or crawlers act like normal web users and ‘start printed out. from seed pages, parse them, extract links, In the case of my research project on the and fetch the linked document’, then re-iter- decline of text on the web, I was interested in ate (Masanès, 2006: 23). This method works making comparisons between webpages from well for simpler webpages, but could encoun- today with those in the past. Thus, I needed ter difficulty when encountering webpages to use webpages that I know existed in the that exchange content in between webpage past and persist until today. I developed a list loads, which is popularly known as the of 100 popular and prominent websites in the Asynchronous JavaScript and XML (AJAX) United States that existed from 1999 to 2014 approach to creating web interfaces. Many which were available through the WayBack social media sites make extensive use of this Machine, using indexes like Alexa’s Top approach, and without special provisions for 500 English-language Website index (Alexa, web archiving, could make retrieving this 2003). Popular and prominent websites were content challenging. This could be over- selected – rather than websites that may be come, but may require manual intervention obscure and unused – because they may bet- by a skilled web archivist. This approach is ter reflect the interests and desires of the also challenging when attempting to down- general user population. The list is repeated load large collections of webpages, which below in Table 11.1.

BK-SAGE-BRUGGER_MILLIGAN-180264-Chp11.indd 142 8/20/18 8:18 PM Quantitative Web History Methods 143

Table 11.1 website categories with respective Websites Category Total websites Websites Consumer products 10 Amazon.com, Pepsi,com, Lego.com, Bestbuy.com, Mcdonalds.com, Barbie.com, and retail Coca-cola.com, Intel.com, Cisco.com, Starbucks.com Government 11 Whitehouse.gov, SSA.gov, CA.gov, USPS.com, NASA.gov, NOAA.gov, Navy.mil, CDC. gov, NIH.gov, USPS.com, NYC.gov Higher education 10 Berkeley.edu, Harvard.edu, NYU.edu, MIT.edu, UMich.edu, Princeton.edu Stanford. edu, Columbia.edu, Fordham.edu, Pratt.edu Libraries 8 NYPL.org, LOC.gov, Archive.org, BPL.org, Colapubliclib.org, Lapl.org, Detroit.lib. mi.us, Queenslibrary.org Magazines 12 USNews.com, TheAtlatnic.com, NewYorker.com, Newsweek.com, Economist. com, Nature.com, Forbes.com, BHG.com, FamilyCircle.com, Rollingstone.com, NYMag.com, Nature.com Museums 10 SI.edu, MetMuseum.org, Guggenheim.org, Whitney.org, Getty.edu, Moma.org, Artic.edu, Frick.org, BrooklynMuseum.org, AMNH.org Newspapers 9 NYTimes.com, ChicagoTribune.com, LATimes.com, NYDailyNews.com, Chron.com, NYPost.com, Suntimes.com, DenverPost.com, NYPost.com Online service 8 IMDB.com, MarketWatch.com, NationalGeographic.com, WebMD.com, Yahoo.com, Match.com Technology site 11 CNet.com, MSN.com, Microsoft.com, AOL.com, Apple.com, HP.com, Dell.com, Slashdot.org, Wired.com, PCWorld.com, IBM.com Television 11 CBS.com, ABC.com, NBC.com, Weather.com, PBS.org, BBC.co.uk, CNN.com, Nick. com, MSNBC.com, CartonNetwork.com, ESPN.go.com Total 100

Although the Internet Archive has archived all my sites submitted the Internet Archive webpages from 1996 onward, the range was the web archive that was brought back of webpages archived improved as years by Memento. I developed a PHP script that advanced, and by 1999 many more websites issued requests to the Memento web service were being archived than in 1996. Thus, I for the 100 websites for six years, and in each would begin my comparisons at the year case it returned a URL of the content from 1999. To show changes over time, I would the Internet Archive, thus producing 600 web analyze those websites every three years archived pages. Note that websites originally (1999, 2002, 2005, 2008, 2011, and 2014). included in the 100 websites that were not In sum, all websites needed to continuously archived for any given year were removed exist between 1999 and 2014, and all those in and replaced with a website that had all years Table 11.1 met that criteria. archived from the respective category. Thus, The way that I chose to secure the archived each homepage was inspected manually to webpages was to use Memento (2018). ensure that it was there and was not some Memento is a technical framework aimed at type of error page that would throw off the a better integration of the current and the past analysis. Web, and provides a way to issue requests Web archived webpages (the HTML and and receive responses from web archives binary files) can be downloaded in a web (Van de Sompel et al., 2009). For example, browser window (using the File -> Save As you can submit to the web service a URL option), or a script can be used to download and date, and it will bring back the URL for the HTML file and related files needed to the web archived content. The URL returned render the page. For example, the command- can be from a variety of web archives, but for line based tool ‘wget’ makes downloading

BK-SAGE-BRUGGER_MILLIGAN-180264-Chp11.indd 143 8/20/18 8:18 PM 144 The SAGE Handbook of Web History

webpages and related binary files relatively archived websites can have problems with straightforward. For example, issuing the fol- the link or with the content displayed, among lowing command via the Windows command other possible issues. One advantage of creat- line or Macintosh terminal using ‘wget’ will ing a graphic version of a web archived web- download the Internet Archive’s July 1997 page is that it stabilizes it: since a PNG file homepage of the Pratt Institute’s website: can be only opened and rendered one way, there is no chance that it will be displayed wget -p -e robots=off ://web.archive.org/web/ differently on different computers or brows- 19970713123416/ ers. However, it has the disadvantage of elim- http://www.pratt.edu/ inating the special functions of webpages (e.g., hyperlinks, interactive content, moving Note that in the above example, an option image content, etc.). Thus, while it eliminates is passed to wget to ignore the robots.txt file some problems (e.g., browser obsolescence on the Internet Archive’s WayBack Machine, that may render some functions inoperable), which explicitly blocks all crawlers. It is a it introduces new limitations (e.g., transform- strange irony that the site that was built on ing an inherently interactive medium into a crawling websites does not allowing crawling! static image). However, this is likely not so problematic as For projects where only the text is impor- long as large amounts of content are not down- tant and not the HTML or related binary files, loaded all at once, which can place strain on a scripts can be developed which remove the webserver. In fact, the above request will only HTML. For example, a regular expression download a small amount of data. If too many script, implemented in PHP or Python, can requests are being issued too quickly, it is remove all HTML markup from a file, leav- likely that you could be temporarily blocked. ing only the readable text. Further, the PHP The ‘p’ option included with the wget com- function strip_tags() can remove HTML mand ensures that the crawler downloads all information from a webpage, leaving only the page pre-requisites, such as GIF and JPG the visible text. Lastly, specialized libraries, files referenced in the HTML page. such as the Beautiful Soup library for Python, In my case, I was not interested in get- can be used for extracting information from ting the HTML and related binary files (e.g., HTML and XML files (Beautiful Soup, JPGs, ), but wanted large visual pres- 2018; PHP.net, 2018). entations of the webpage. This was because When engaging in such tasks as stripping the method I intended to use to determine the html from a file, it is always a good idea which parts of the webpage were graphics to maintain copies of your data in stages. This and which were text was to use a computer is because if you realize further in the process vision algorithm, which will be discussed that you made an error (e.g., stripped out more more in the next section. To download the full content than was expected), you can refer to print-screens, I used a simple exten- copies of the data from the earlier stages. sion called ‘Grab Them All’, which creates Note that this is only one method to get full webpage screenshots of pages as PNG access to web archived content. This method files using a seed list of URLs (Grab Them diverges from the ‘big data’ approaches to All, 2018). I was able to give it a seed list of working with web archives, such as using 600 URLs, and in half an hour I had large gigabytes, terabytes or even petabytes of PNG visual representations of those archived web archived content in analyses. To gain webpages. access to big datasets of web archived con- Brügger writes that a web archived web- tent, it is necessary to work more directly site is typically faulty and deficient com- with providers, rather than downloading pared with the original (Brügger, 2008). Web small bits from the web. For example, the

BK-SAGE-BRUGGER_MILLIGAN-180264-Chp11.indd 144 8/20/18 8:18 PM Quantitative Web History Methods 145

non-profit organization Common Crawl pro- HTML, Javascript, and other parts of a web- vides researchers access to web archived data page that are not shown to users via a web (Common Crawl, 2018). Since these datasets browser. This format can be especially useful are so large, it can be inefficient to make cop- if you are interested in the textual content on ies of the data, and thus sites like Common a webpage. Crawl allow users to run their analysis One limitation of Common Crawl’s data against the cloud-based data using comput- is that its earliest is from 2008 and 2009, ing services provided by Amazon. stored in ARC format, which was an earlier Before discussing big-data approaches in web archiving format that preceded WARC. more detail, it is important to make a distinc- ARC is very similar to WARC, and many tion between data and metadata. Metadata tools support both formats, so this should not can be defined as ‘data about data’. A famil- be a major impediment. However, because iar example is that a telephone conversation this data only begins in 2008, it would not be could be considered the data, but the phone useful for studying the earliest years of the number dialed to enable the connection and web. To study webpages from the 1990s, the the duration of the call might be considered Internet Archive is likely the best source. metadata. In the case of web archives, the data could be the HTML of a webpage where the metadata may be the URL and the date the copy of the HTML was made. Stage 3 – Numerical translation Common Crawl provides users with access to the web archive data and metadata in three Once the corpus has been secured, whether formats: WARC file, metadata-only format this comprises WARC files, metadata of web (WAT) and text-only format (WET). WARC crawls, text from crawls, screenshots of web- files are a standard-based, text-based format pages, or other manifestations of data from for representing web-crawled webpages. web archives, it is necessary to begin to pre- WARC files include the HTML for crawled pare the data for numerical analysis. Web webpages, metadata on the crawl (e.g., what archives have many facets, and thus the ways day and time the site was crawled), and the in which the data gets translated into numeri- binary files encoded in the text-based for- cal data is going to depend very much on the mat. WARC files can be large and difficult research questions. In the case of my research to deal with, especially if you are not using project described earlier, I was interested in all the data provided in them. For example, using the 600 full-length screenshots, repre- all the JPGs, GIFs or other binary data that senting 15 years of homepages from 100 are part of a webpage get encoded in WARC popular and prominent websites, to see if the files, easily making the majority of WARC amount of text presented to users was declin- files comprise nonsensical content because ing. Thus, I needed a method to decipher the the binary data is not human-readable but textual content from other content, such as machine-readable. If the binary data, such images, videos or whitespace. as images, are not necessary for a particular I ended up modifying an open-sourced research project, the other formats Common extension called Project Naptha Crawl offers may be better options. These (2018), which could be used for detecting include the metadata-only format (WAT), blocks of text within an image. This exten- which provides information like the page sion implemented an innovative computer title and outgoing links on the page, making vision algorithm called the Stroke Width it useful for creating networks of webpage Transform (SWT). SWT was created by a linkages. The last format provided is the Microsoft research team who observed that text-only format (WET), which removes the the ‘one feature that separates text from other

BK-SAGE-BRUGGER_MILLIGAN-180264-Chp11.indd 145 8/20/18 8:18 PM 146 The SAGE Handbook of Web History

elements of a scene is its nearly constant the dome incorrectly as a text area, as well stroke width’ (Epshtein et al., 2010: 2963). as some other very small areas. Nevertheless, During their initial evaluation the algorithm it has an accuracy that is consistent with the was able to identify text regions within natu- findings of the Microsoft researchers. A sec- ral images with 90% accuracy. ond example provided is that of the White For example, Figure 11.1 shows this pro- House website from 2002 using this same cess used on the Library of Congress webpage process (shown in Figure 11.2). from Internet Archive’s 2002 collection, with Using the bounding boxes produced by the black boxes identifying the text regions Project Naptha, a percentage of webpage from the images. The algorithm is not without text to non-text was computed, and recorded minor inaccuracies. It has identified part of into a MySQL database. For example, the

Figure 11.1 library of Congress Website from year 2002, with text areas highlighted with black bounding boxes. Webpage is 23.33% text using this method.

BK-SAGE-BRUGGER_MILLIGAN-180264-Chp11.indd 146 8/20/18 8:18 PM Quantitative Web History Methods 147

webpage shown in Figure 11.1 is 23.33% 600 screenshots of webpages. The data text, whereas the webpage shown in Figure included URL, percentage of text to non- 11.2 is 46.10% text, which indicates what text, and year. This is only one narrow slice is readily visible: that Figure 11.2 is more of data that can be generated, and other ­text-heavy than Figure 11.1. numerical data can be derived. For example, Thus, at this point in my research pro- this could include recording information like ject, I had created numerical data from word counts, properties of images or videos,

Figure 11.2 whiteHouse.gov from 2002 with text areas highlighted with black bounding boxes. Webpage is 46.10% text using this method.

BK-SAGE-BRUGGER_MILLIGAN-180264-Chp11.indd 147 8/20/18 8:18 PM 148 The SAGE Handbook of Web History

relative amount of executable code to HTML amount of text online was declining. I had on webpages, and file sizes, among many 600 data points which included URL, year, other possible numerical properties. and percentage of text from non-text that was Other methods for translating web archives generated using the computer vision algo- into numerical data include the use of topic rithm described in the earlier step. An analy- modeling, text mining, and natural language sis step that I was interested in engaging in processing (NLP) tools (Graham et al., 2016). was the average percentages for each year NLP is a research field within the discipline available: 1999, 2002, 2005, 2008, 2011, and of Computer Science, and has a number of 2014. Although this analysis could be readily applications that can readily produce numeric achieved in any spreadsheet program such as data. These can include named entity recog- Microsoft Excel, I used the statistics package nition, such as identifying the number of ref- SPSS. However, average values alone are not erences to a specific person, place or thing; enough evidence that text is rising and fall- sentiment analysis; and topic identification ing. For example, say the year 2014 was 30% (Jurafsky and Martin, 2008). A number of text because all webpages were approxi- free tools and open-toolkits are available mately 30% text, whereas in 1999 the aver- for engaging in natural language processing age was also 30% but half were 15% text and (Bird et al., 2018; Natural Language Toolkit, the other half 45% text. In cases such as this 2018; Open NLP, 2018; Stanford Core CLP, one, the standard deviation statistic is impor- 2018). All of these tools require some experi- tant because it clarifies the extent to which mentation, as well as verification that they are the data diverges from the mean, and indi- producing the desired result. If such tools are cates how the mean should be interpreted. being used, it is important to verify that they Microsoft Excel can also be used for comput- are working correctly, such as correctly iden- ing standard deviation, but again I used tifying from a pre-existing list or accurately SPSS. Table 11.2 shows the results of both identifiying a sentiment. Evaluating the util- these computations, and a visualization of ity of these NLP tools can be accomplished the values is shown in Figure 11.3. through sampling the source material and the In addition to mean and standard deviation resulting outcome to verify that they are work- statistics, additional statistical work can be ing as expected. This is especially important undertaken to provide greater meaning to the if the method being used has not been proven average values. In this case, I was interested to work elsewhere. Although there is no hard if the means (or the average amount of text and fast rule of how large the sample must be, on a webpage) were dependent on the year I suggest at least 10% for unproven method- they were produced, or if these means were ologies. The evaluation can be enhanced by simply random. Although standard deviation using two independent evaluators to measure how well the tools are working on the sam- ple data. Ensuring this consistency is often Table 11.2 mean percentage of text on a Webpage pear year, with standard deviation referred to as interrater reliability. values. Year Mean percentage of Standard text on a webpage deviation Stage 4 – Analysis 1999 22.36 15.45 2002 30.89 14.93 Once web archives have been used to gener- 2005 32.43 14.60 ate numerical data, this data can be used in 2008 31.31 15.88 analysis. In the research project described 2011 28.51 15.47 here, I was interested in knowing if the 2014 26.88 13.23

BK-SAGE-BRUGGER_MILLIGAN-180264-Chp11.indd 148 8/20/18 8:18 PM Quantitative Web History Methods 149

Figure 11.3 percentage of text on Webpages.

can measure variation, statistical signifi- to as one-way ANOVA. The SPSS cance tests can ensure that this variation is facilitates the computation work for this test, not merely chance but is dependent on some and provides results that can be interpreted other variable or variables. In this case, that to conclude whether there is indeed a sta- variable is the year the webpage was pro- tistically significant relationship between duced, which is referred to as the independ- these variables. When using statistical tests ent variable because it is a fact that does not like ANOVA, the existence of a statistically depend on other variables. The percentage of significant relationship is determined by text on a webpage could then be considered the ‘p-value’ or probability value. Although a dependent variable, whose value is hypoth- explaining how p-values work is beyond the esized to be dependent on the year the web- scope of this article, this particular research page was produced. ANOVA produced a p-value of less than When engaging in statistical tests, a find- .0005, which led to the conclusion that there ing of ‘statistical significance’ indicates that was a statistically significant relationship the dependent variable’s values are not pure between the percentage of text on a webpage chance but are influenced one way or another and the year it was produced. In the case of by the independent variable. In this project, the research project discussed here, a ‘one- the specific test I used was the one-way way ANOVA revealed that the percentage of analysis of variance, which is often referred text on a webpage are not chance occurrences

BK-SAGE-BRUGGER_MILLIGAN-180264-Chp11.indd 149 8/20/18 8:19 PM 150 The SAGE Handbook of Web History

but rather this percentage is dependent on the Stage 5 – Drawing Conclusions year the website was produced’ (Cocciolo, 2015). Before drawing conclusions from research It should be acknowledged that many his- using quantitative methods, it is necessary to torians may not be particularly well-versed describe the limitations. As mentioned ear- in statistics. I received doctoral training at a lier, quantitative methods are well-suited for school of education, where statistics course- large datasets, such as ones with at least work is generally required. However, histori- 30 data points. Beyond concerns of dataset ans interested in using big data can become size, other limitations can be articulated in more experienced in statistics through their the conclusion. For example, in the research own independent study or through formal study described here, one limitation was that coursework. As I do not always use statisti- that it only included websites that were popu- cal tests in my research, I find myself hav- lar and prominent in the United States, and ing to brush up on how to use such tests thus the changes to the composition of those and how to implement them in computer webpages over time may not be a universal software, such as SPSS. However, I have phenomenon but rather US-specific. found some resources useful, most notably Although webpages in the United States have the LAERD Statistics (2018) tutorials on the significant use from individuals outside of web. Although they provide some free infor- the United States, being able to highlight mation, most of it is available via a monthly how aspects like website selection impact the subscription. It provides comprehensive dis- conclusions that can be drawn is necessary. cussions of all the different types of statisti- Further, issues around the limitations of web cal tests, with plenty of examples, as well as archiving – or what got archived and what information on how to implement the tests got missed – could lead to erroneous conclu- in SPSS and interpret the results. The sub- sions. For example, in this study I noted that scription costs are low and well worth the some popular and prominent websites – such small investment. Consulting resources such as Time Magazine – are poorly web archived as this, as well as other resources such as in WayBack Machine because of limitations statistics textbooks (e.g., Mendenhall et al., defined by Time in its robots.txt file. 2012), can be useful in understanding how The conclusion reached in the study was statistical tests work and how the results that ‘the percentage of text on the Web should be interpreted, including knowing climbed during the turn of the twentieth whether there is a statistically significant century, peaked in 2005, and has been on relationship between the variables. A further the decline ever since’ – with the caveat that option is to explore online courses, such as this finding is based on popular and promi- courses available through Khan Academy nent websites in the United States (Cocciolo, and Coursera. 2015). Other issues arising from limitations Note that there are some limitations to of the web archive corpus can also be dis- using statistics. If the dataset is small, such as cussed when drawing conclusions from the under 30 data points, statistics may not be the research. best tool and results can be augmented with qualitative information. The study described here could be significantly enhanced by making use of a dataset larger than 600 data Conclusion points. When using statistics, in general, large datasets are better than small, so 6,000, In conclusion, this paper offered a starting or even six million, data points could enhance point for historians interested in using quan- the overall significance of the study. titative research methods using web archives.

BK-SAGE-BRUGGER_MILLIGAN-180264-Chp11.indd 150 8/20/18 8:19 PM Quantitative Web History Methods 151

Although the use of such methods requires Brügger, N. (2014) ‘Probing a Nation’s Web some study of statistical research methods, as Domain: A New Approach to Web History well as implementing scripts using tools like and a New Kind of Historical Source’, in PHP and Python, these can all be readily G. Goggin and M. McLelland (Eds.), The learned using resources described in this Routledge Companion to Global Internet chapter, including online resources. Such Histories. New York: Routledge. pp. 61–73. Cocciolo, A. (2015) ‘The rise and fall of text on methods allow for making sense of large the Web: A quantitative study of Web amounts of data and addressing historical archives’, Information Research 20(3). Avail- research questions with precision. In some able at: http://www.informationr.net/ir/20-3/ cases, the best option may be engaging with paper682.html [19 February 2018]. others who have the necessary expertise to Common Crawl (2018) Available at: http:// study the phenomena. These can include commoncrawl.org/ [18 February 2018] computer , statisticians, and Epshtein, B., Ofek, E., and Wexler, Y. (2010) those with access to and facility with web ‘Detecting text in natural scenes with stroke archives. Through such collaborations, histo- width transform,’ in Proceedings of 2010 rians can open up exciting new research IEEE Conference on Computer Vision and avenues using web archives. Pattern Recognition in San Francisco, CA. New York, NY: IEEE. pp. 2963–2970. Goel, V. (2016) ‘Defining Web pages, Web sites and Web captures’, Internet Archive Blogs. Available at: https://blog.archive.org/2016/10/ References 23/defining-web-pages-web-sites-and-web- captures/ [19 February 2018]. Alexa (2003) ‘Alexa’s Top 500 English-language Grab Them All (2018). Available at: https://add Website index (2003 web archive)’. Available ons.mozilla.org/en-US/firefox/addon/grab- at: https://web-beta.archive.org/web/ them-all/ [19 February 2018]. 20031209132250/http://www.alexa.com/ Graham, G., Milligan, I., and Weingart, S. site/ds/top_sites?ts_mode=lang&lang=en (2016) Exploring Big Historical Data: The [18 February 2018]. Historian’s Macroscope. London: Imperial Beautiful Soup Python Library (2018). Available College Press. at: https://www.crummy.com/software/ Internet Archive (2018) The Way Back Machine BeautifulSoup/bs4/doc/ [18 February 2018]. (http://archive.org). Bird, S., Klein, E., and Loper, E. (2018) Natural Jurafsky, D., and Martin, J.H. (2008) Speech Language Processing with Python – Analyz- and Language Processing, 2nd edition. New ing Text with the Natural Language Toolkit. York: Pearson Prentice Hall. Available at: http://www.nltk.org/book/ Kelleher, J.D., Mac Namee, B., and D’Arcy, A. [19 February 2018]. (2015) Fundamentals of Machine Learning Björneborn, L., and Ingwersen, P (2004) for Predictive Data Analytics: Algorithms, ‘Toward a basic framework for webomet- Worked Examples and Case Studies. Cam- rics’, Journal of the American Society for bridge, MA: MIT Press. Information Science and Technology 55(14): Laerd Statistics (2018). Available at: https:// 1216–1227. statistics.laerd.com/ [19 February 2018]. Brügger, N. (2008) ‘The archived website and Liu, B. (2012) Sentiment Analysis and Opinion website philology: A new type of historical Mining. San Rafael, CA: Morgan & Claypool. document’, Nordicom Review 29(2): 155–175. Masanès, J. (2006) ‘Web Archiving: Issues and Brügger, N. (2011) ‘Web Archiving – Between Methods’, in J. Masanès (Ed.), Web Archiv- Past, Present and Future’, in M. Consalvo ing. Berlin: Springer. and C. Ess (Eds.), The Handbook of Internet Memento (2018) ‘time travel’. Available at: Studies. Malden, MA: Wiley-Blackwell. http://timetravel.mementoweb.org/ [19 Feb- pp. 24–42. ruary 2018].

BK-SAGE-BRUGGER_MILLIGAN-180264-Chp11.indd 151 8/20/18 8:19 PM 152 The SAGE Handbook of Web History

Mendenhall, W., Beaver, R.J., and Beaver, B.M. Project Naptha (2018) Available at: https:// (2012) Introduction to Probability and Statis- projectnaptha.com/ [18 February 2018]. tics. Stamford, CT: Duxbury Press. Stanford Core NLP (2018) Available at: http:// Natural Language Toolkit (2018). Available at: stanfordnlp.github.io/CoreNLP/ [19 February http://www.nltk.org/ [19 February 2018]. 2018]. Ong, W.J. (2002) Orality and Literacy: The Tech- Thelwall, M. (2009) Introduction to Webomet- nologizing of the Word. London: rics: Quantitative Web Research for the Routledge. Social Sciences. San Rafael, CA: Morgan & Open NLP (2018) Available at: http://opennlp. Claypool. sourceforge.net/projects.html [19 February Thelwall, M., and Vaughan, L. (2004) ‘Webo- 2018]. metrics: An introduction to the special issue’, Osterberg, G. (2013) ‘Update on the Twitter Journal of the Association for Information Archive at the Library of Congress’, Library of Science and Technology 55(14): Congress Blog. Available at: http://blogs.loc. 1213–1215. gov/loc/2013/01/update-on-the-twitter- Van de Sompel, H., Nelson, M.L., Sanderson, R., archive-at-the-library-of-congress/ [19 February Balakireva, L.L., Ainsworth, S., and Shankar, 2018]. H. (2009) ‘Memento: Time travel for the Web’. PHP.net (2018) ‘strip_tags’. Available at: http:// Available at: http://arxiv.org/abs/0911.1112 php.net/manual/en/function.strip-tags.php [18 February 2018]. [18 February 2018].

BK-SAGE-BRUGGER_MILLIGAN-180264-Chp11.indd 152 8/20/18 8:19 PM