Best Practices for Google Analytics in Digital Libraries

Authored by the Digital Library Federation Assessment Interest Group Analytics working group

Molly Bragg, Duke University Libraries Joyce Chapman, Duke University Libraries Jody DeRidder, University of Alabama Libraries Rita Johnston, University of North Carolina at Charlotte Ranti Junus, Michigan State University Martha Kyrillidou, Association of Research Libraries Eric Stedfeld, New York University

September 2015

The purpose of this white paper is to provide digital libraries with guidelines that maximize the effectiveness and relevance of data collected through the Google Analytics service for assessment purposes. The document recommends tracking 14 specific metrics within Google Analytics, and provides library-centric examples of how to employ the resulting data in making decisions and setting institutional goals and priorities. The guidelines open with a literature review, and also include theoretical and structural methods for approaching analytics data gathering, examples of platform specific implementation considerations, Google Analytics set-up tips and terminology, as well as recommended resources for learning more about web analytics. The DLF Assessment Interest Group Analytics working group, which produced this white paper, looks forward to receiving feedback and additional examples of using the recommended metrics for digital library assessment activities.

2

Table of contents Section I: Introduction and Literature Review A. Introduction B. Literature Review Section II: Google Analytics Prerequisites A. Learn About Google Analytics Practices and Policies B. Understand Local Digital Library Infrastructure C. Goal Setting Section III: Recommended Metrics to Gather A. Content Use and Access Counts 1. Content Use and Access Counts Defined 2. Site Content Reports 3. Bounce Rate 4. Download Counts 5. Time 6. Pageviews 7. Sessions B. Audience Metrics 1. Location 2. Mode of Access 3. Network Domain 4. Users C. Navigational Metrics 1. Path Through the Site 2. Referral Traffic 3. Search Terms Section IV: Additional Metrics and Custom Approaches A. Dashboards and Custom Reports B. Event Tracking C. Goals and Conversions Section V: Examples of Platform-specific Considerations A. CONTENTdm B. DLXS Section VI: Tips for Google Analytics Account Setup A. Terminology 1. Properties 2. Views Section VII: Further Resources on Google Analytics Section VIII: Conclusions and Next Steps Appendix Other Methods for Collecting Analytics Data Bibliography

3

Section I: Introduction and Literature Review

A. Introduction Libraries invest significant resources in providing online access to scholarly resources, whether through digitization, detailed cataloging and metadata production, or other methods. Once digital materials are published online, resource managers can take a number of approaches to assessing a digital library program. Qualitative methods include web usability studies, focus groups, surveys, and anecdotal information gathering. Quantitative methods to assess digital libraries often center on tracking information about website visitors and their actions through some means of web traffic monitoring, also known as web analytics. This paper will focus on best practices for collecting and using web analytics data in digital libraries, specifically data gathered through Google Analytics.1 ​ ​

While this paper focuses on web analytics, collecting any type of assessment data and using it to inform decision-making can: ● increase understanding of the return on the investment of digital libraries; ● provide more information about users, use, cost, impact, and value; ● help guide improvement of digital library services and the user experience; ● assist in decision-making and strategic focus. No single type of data or technique can assess every aspect of a digital library; analytics are a single piece of a larger assessment puzzle.

This document is intended for digital library managers and curators who want to use analytics to understand more about users of, access to, and use of digital library materials. The Digital Library Federation Assessment Interest Group (DLF AIG) is using Matusiak’s definition of a digital library as “the collections of digitized or digitally born items that are stored, managed, serviced, and preserved by libraries or cultural heritage institutions, excluding the digital content purchased from publishers.”2 The authors hope to pave the way for cross-institutional resource managers to share benchmarkable and comparable analytics. Their intention is for the information in this paper to evolve over time as more institutions utilize and enhance these guidelines.

We chose to limit our scope to Google Analytics because many libraries use this tool, and because our task needed to be scoped in order to be attainable.3 It is important to be aware, however, that changes in technology may cause fluctuation in the usefulness of any tool -- including Google Analytics -- in the future. If there is enough community interest and volunteers,

1 "Google Analytics," accessed August 4, 2015, http://www.google.com/analytics/. 2 Matusiak, K. (2012). Perceptions of usability and usefulness of digital libraries. International Journal of Humanities and Arts Computing, 6(1-2), 133-147. DOI: http://dx.doi.org/10.3366/ijhac.2012.0044. ​ ​ 3 Over 60% of all websites use Google Analytics: see “Piwik, Privacy,” accessed September 16, 2015, http://piwik.org/privacy/.

4

other web analytics services could be considered for inclusion after the Digital Library Federation 2015 Forum. An overview of other methods for collecting web analytics data can be found in the appendix of this document.

This document was authored by the analytics working group of the DLF AIG. The DLF AIG was formed in spring of 2014.4 The group arose from two working sessions that took place at the 2013 DLF Forum, “Determining Assessment Strategies for Digital Libraries and Institutional ​ Repositories Using Usage Statistics and Altmetrics”5 and “Hunting for Best Practices in Digital ​ ​ Library Assessment.”6 The first of these sessions was concerned with determining how to ​ measure the impact of digital collections; developing areas of commonality and benchmarks in how the community measures collections across various platforms; understanding cost and benefit of digital collections; and exploring how such information can be best collected, analyzed, communicated, and shared effectively with various stakeholders. The second working session set out to test the waters for the potential of a collaborative effort to build community guidelines for best practices in digital library assessment. The two working sessions were well attended and group leaders formed the DLF AIG to foster ongoing conversation. In the fall of 2014, volunteers from the DLF AIG formed four working groups around citations, analytics, cost assessment, and user studies. The primary purpose of each working group is to develop best practices and guidelines that can be used by all to assess digital libraries in their particular area; the white papers and other products of these working groups can be found on the DLF wiki.7

The Analytics working group is composed of library staff from around the United States working in the fields of digital programs, assessment, and electronic resources. The authors of this document are: ● Molly Bragg (Co-coordinator of the working group, Digital Collections Program Manager, Duke University Libraries) ● Joyce Chapman (Co-coordinator of the working group, Assessment Coordinator, Duke University Libraries) ● Jody DeRidder (Head of Metadata and Digital Services, University of Alabama Libraries) ● Rita Johnston (Digitization Project Librarian, University of North Carolina at Charlotte) ● Ranti Junus (Electronic Resources Librarian, Michigan State University) ● Martha Kyrillidou (Senior Director, Statistics and Service Quality Programs, ARL) ● Eric Stedfeld (Project Manager/Systems Analyst, New York University)

4 See Joyce Chapman’s blog post "Introducing the New DLF Assessment Interest Group," posted May 12, 2014, http://www.diglib.org/archives/5901/. 5 "Determining Assessment Strategies for Digital Libraries and Institutional Repositories Using Usage Statistics and Altmetrics," accessed August 4, 2015, http://www.diglib.org/forums/2013forum/schedule/21-2/. 6 "Hunting for Best Practices in Digital Library Assessment," accessed August 4, 2015, http://www.diglib.org/forums/2013forum/schedule/30-2/. 7 The DLF Assessment Interest Group wiki can be found at http://wiki.diglib.org/Assessment. As of the DLF 2015 annual meeting the citations, users and user studies, and analytics working groups have each produced a white paper and the cost assessment working group has defined digitization processes for data collection and created a digitization cost calculator using data contributed by the community. See each of the group’s individual wiki pages for links and details.

5

The group began its work by performing a literature review and defining types of audiences, content, and metrics pertinent to digital library assessment. The group continued to refine a list of core metrics to recommend for baseline collection in a digital library program. In this paper, each metric includes a definition and explanation of importance, as well as library-centric examples for how to work with the metric in Google Analytics. This document was distributed to the larger DLF AIG for feedback and comments in two drafts in July and August 2015. The white paper was released in its final form in September 2015.

If you are interested in contributing to the DLF AIG, attend the group’s session, “Collaborative efforts to develop best practices in assessment: a progress report” at the DLF Forum in October 2015, or join the Google Group.8 As of the release of this paper, there is also a Google Group ​ ​ associated with the working group of the DLF AIG focused on analytics.9 Currently the Analytics ​ ​ Google group is used to coordinate the activities of the working group, and not as a discussion forum on web analytics for digital libraries. The future of the Analytics working group and Google group will be decided at the 2015 DLF forum, and this information will be posted to the DLF AIG Google group. In the meantime, the DLF AIG Google group is the appropriate place to begin discussions about analytics in digital libraries.

B. Literature Review This literature review is concerned specifically with efforts to develop guidelines and best practices for web analytics for digital libraries. While extensive literature exists documenting the value of capturing and analyzing web analytics data,10 the existing literature on best practices for collecting web analytics data for digital libraries is limited. One possible reason for this lack of data is that web metrics were initially developed for use by e-commerce sites. “Success,” however, is not measured in monetary terms at cultural heritage institutions. Because cultural heritage institutions measure success based on research value, usage, and other measures beyond sales, mainstream web metrics cannot necessarily be borrowed from the e-commerce world, and a web metrics strategy must still be developed independently for digital libraries (Khoo, 2008).11

According to Szajewski (2014), much of what has been published on this topic for digital libraries focuses on how to use analytics data to improve site usability and the user experience, while “there is a significant lack of scholarship dedicated to the role of web analytics in developing a program specifically intended to increase the visibility, discovery, and use of

8 "Digital Library Assessment - Google Group," accessed August 5, 2015, https://groups.google.com/forum/#!forum/digital-library-assessment. 9 "Digital Library Analytics - Google Groups," accessed August 5, 2015, https://groups.google.com/forum/#!forum/digital-library-analytics. 10 See Szajewski (2013); Kelly (2014). 11 Tools such as Google Analytics refer to “monetizing goals” in order to assign a relative value and measure website effectiveness, but this need not represent commercial value.

6

digitized archival assets amongst a broad audience of web users” (Szajewski, “Using Google Analytics”). Some case studies describe how individual institutions have chosen to work with analytics and their findings,12 tutorials on how to set-up Google Analytics for libraries,13 and numerous articles on the topic of transaction log analysis,14 however, there are very few suggestions for overarching guidelines or best practices. In 2004, there was an early effort to standardize the collection of web metrics across the members of the National Science Digital Library (NSDL), although the participants were not librarians. The NSDL brought together a group “to discuss how web metrics could be implemented in a pilot study to identify current NSDL use and develop strategies to support the collection of usage data across NSDL in the future” (Jones et al., “Developing”). However, this effort was still limited to a single consortial ​ ​ group, and did not include efforts to benchmark or standardize metrics for digital libraries at large.

Sparse information exists in the literature on which metrics are most commonly captured and reported by libraries employing web analytics. Kelly reports that the most commonly captured metrics include the number of visitors, the date and time of a visit, the geographical location of an IP address, referral information, which pages were viewed, entrance and exit pages, and information about operating systems and browsers in use (Kelly, 2014). A study on the use of web metrics by libraries and other cultural heritage institutions in the Netherlands found that the number of visits and the number of visitors were the most commonly reported data (Voorbij, 2010).

Some of the articles reviewed for our research included a call for increased understanding of web metrics and sharing of analytics data, as well as a call for action around a lack of interoperable metrics. Prom points out that special collections as a field lacks a systematic understanding of how people interact with digital objects, which should be remedied (2011). Custer’s 2013 article calls for the open distribution and sharing of Google Analytics data, and for the hodgepodge of case studies around digital collection and finding aid usage to be turned into a collaborative and professional effort. Custer closes with the statement that “whatever our professional objectives...aggregating, analyzing, and sharing data sets in a unified effort might be the best means that we have to attain them” (Custer, “Mass Representation” p. 495). Despite this well-made argument, the literature did not uncover developments in this area in the two years since. The topic of standardizing metrics is not limited to web analytics; in 2012 Chapman and Yakel advocated for the field of special collections “create community standards to enable meaningful comparison of metrics across institutions” and to “take control of how we are measured by identifying the metrics that are relevant to us and defining these metrics in terms that best represent our institutions, collections, and user communities” (Chapman and Yakel, “Data-Driven Management” p. 151).

12 Such as Khoo (2008); Kroth (2010). 13 Such as Yang & Perrin (2014). 14 Such as Jansen (2006), Jones et al. (2000), Jansen et al. (2000), and Peters (1993). ​ ​ ​ ​

7

This exploration of library-related literature on web analytics reveals a gap in the area of best practices. While the case studies and literature reviews cited are relevant to anyone interested in using Google Analytics as part of their assessment program, this paper attempts to address and fill the literature gap. The remainder of this document will explain how web analytics work, examine how digital libraries might approach Google Analytics specifically, recommend and define a baseline set of metrics that institutions can collect, and present further resources for those interested in expanding their knowledge of web analytics.

8

Section II: Google Analytics Prerequisites

Google Analytics is an extremely powerful tool for understanding digital library users. But before diving into web analytics, digital library managers should attempt to satisfy several prerequisites, including familiarizing themselves with their website infrastructure, institutional data policies, and assessment goals, as well as the Google Analytics application itself.

A. Learn About Google Analytics Practices and Policies The first prerequisite is understanding Google Analytics policies and practices, as well as attaining a general knowledge of the nuts and bolts of the application. This document provides links to instructions for basic application features, and it is easy to find further information through Google Analytics help.15 In addition to learning how the Google Analytics application ​ ​ works, the authors of this document recommend a clear understanding of the policies used by Google Analytics before adoption. Although Google Analytics is “free,” it is also a business, and libraries are not its target market. There may be areas within Google Analytics policies,16 such ​ ​ as Google’s privacy policy17 that conflict with institutional philosophy, mandates, or priorities.

Alternatives to Google Analytics exist, and several are described in the appendix. When choosing a method of collecting web analytics data it is important for digital library managers to understand their institution's preferences and needs, as well as available internal resources. In this way, they can weigh the pros and cons of each method before choosing an approach to analytics. If it is feasible, institutions may prefer to set up an internal log analysis program strictly according to local rules and priorities instead of using an external service.

B. Understand Local Digital Library Infrastructure Digital library managers need to understand their own digital library environment in order to use any analytics to its full potential. Below are a few questions one should be able to answer before examining analytics data:

● What platform, Content Management System (CMS) or Digital Asset Management System (DAMS) provides access to metadata and digital library resources? ○ Does this platform offer specific training of best practices for using Google Analytics? ● How does the local platform structure URLs?

15 "Analytics Help," Google, accessed August 6, 2015, https://support.google.com/analytics/?hl=en#topic=3544906. 16 "Google Analytics policies - Analytics Help," accessed August 6, 2015, https://support.google.com/analytics/answer/4597324?hl=en&ref_topic=1008008. 17"Data privacy & security - Analytics Help," accessed August 6, 2015, https://support.google.com/analytics/topic/2919631?hl=en.

9

○ Does it link to metadata separately from digital objects? ○ How does it handle compound objects as opposed to single image objects? ○ Are objects, object components, collections, and/or other defining structural elements easily identifiable based on URLs? ○ Does the platform use session IDs or embed search terms in the URLs as users browse the library (see DLXS examples in section V below)? ● Does the site use AJAX or any other technology that will require event tracking (see section V below)? ● Who within the institution will be responsible for set-up and maintenance of the Google Analytics account and HTML code? ○ Can that person be a technical resource for further understanding the application? Will that person be able to implement additional analytics features in the future to help maximize the relevance of the collected data? ○ Do you understand how your Google Analytics account is structured (see section VII below)?

Understanding URL structure is particularly important when working with Google Analytics. For example, some URLs may include strings of letters or numbers that indicate previous clicks and reference points, session identifiers, or search terms. Since Google Analytics collects statistics by URL, the only way to clarify how many times a particular item has been accessed is to know all the URL variations possible to reach that item, and to combine the access counts for all variations for that item. This is not an easy task, and may require input from developers and/or website administrators. If analytics is a priority at an institution and Google Analytics is the tool of choice, URL structure should be considered in advance and requirements should be set for URL structure in future digital library development.

Alternatively, some platforms will link to different item components using distinct URLs. A transcription, the primary metadata, the metadata for a subsection, the third page of a multipage item, the table of contents, or a translation could each have their own unique URL. An institution should consider if accessing item components (metadata, transcriptions, etc.) should be counted as item access or if these should be tracked separately. Depending on the answer, the institution may need to customize default Google Analytics reports to exclude or include only the access instances desired.

Similarly, web pages may include tabs providing access to different item components within a single URL. Tracking how users interact with these components will require event tracking.18 ​ ​ Implementing event tracking in Google Analytics requires technical ability to modify HTML code, planning, and a clear understanding of the desired behavior to track. Knowing that such elements exist and deciding if they should be tracked is the first step.

18 "About Events - Analytics Help," accessed August 6, 2015, https://support.google.com/analytics/answer/1033068?hl=en.

10

C. Goal Setting Digital library programs should consider the questions Google Analytics will answer and how to phrase those questions operationally.This kind of expectation and goal setting should begin before serious data analysis. For example, if the goal is to determine how much resources are used, think about what “use” means. Does it mean a pageview? A session? A download? Google Analytics tracks these metrics separately.

Reading the case studies mentioned in the introduction will provide ideas of how to approach these kinds of questions. Thinking about these questions is an iterative process, and an institution may revise its questions and operational definitions as it uses the service more frequently and understands its place in its larger assessment program. Setting assessment goals and operational definitions for those goals can lead to more focused, data-driven decision making.

11

Section III: Recommended Metrics to Gather

The Analytics working group selected 14 metrics as a baseline recommendation for digital libraries to gather in order to support data-informed decision-making and interoperable metrics for digital library programs. This list was developed by members of the DLF AIG Analytics working group based on feedback from the larger AIG, colleagues, and our own experience as digital library professionals. The list was sent to the larger DLF AIG for feedback in early July 2015; feedback led the DLF AIG Analytics group to refine definitions of metrics, but not add or subtract metrics.

The 14 metrics have been grouped into three categories: Content Use and Access Counts, Audience Metrics, and Navigational Metrics. Below, each of the metrics is defined, its importance is explained, and library-centric examples are provided. A discussion of content use reports and broader information about use and access is also included, in order to frame the individual metrics. The methods and examples provided in this paper do not represent an exhaustive exploration of how Google Analytics can work for digital libraries, and the working group hopes to hear from the community about other approaches to gathering the same or similar analytics data.

Although we have included some information for how to locate each metric within the Google Analytics interface, we recommend referring to Google Analytics help19 for the most up-to-date instructions. Google Analytics makes changes to tools and terminology often, and this white paper will not be updated each time changes are made. If terms or instructions in this paper seem incongruous with the application, Google Analytics may have changed terminology or interface design after October 2015; Google Analytics help will be up-to-date, even if this document is not.

As stated in the introduction, these are baseline metrics. While the analytics working group hopes that these metrics will benefit anyone working with analytics, these metrics will likely be especially helpful for those just getting started.

A. Content Use and Access Counts 1. Content Use and Access Counts Defined 2. Site Content Reports 3. Bounce Rate 4. Download Counts 5. Engagement/Time 6. Pageviews 7. Sessions

19 “Analytics Help,” accessed September 27, 2015, https://support.google.com/analytics/?hl=en#topic=3544906.

12

B. Audience Metrics 1. Location 2. Mode of Access 3. Network Domain 4. Users C. Navigational Metrics 1. Path Through the Site 2. Referral Traffic 3. Search Terms

A. Content Use and Access Counts

1. Content Use and Access Counts Defined Content use and access counts are related measures that provide some indication of the success of a website. Content use metrics reveal how often users return, how much time users are spending on specific pages, and how they traverse the site. Understanding the frequency and type of use on a website is fundamental to knowing what resources the audience values. Access counts are the number of times URLs on a website have been requested by a web browser or search engine. Access counts are often used to measure trends in reach and engagement; for example, if access counts have grown since last year, the website may be reaching a wider audience or have deeper engagement with the existing audience.

Access has different meanings to different members of the digital library community: ● to some, access to a digital item means viewing the metadata record, with or without a thumbnail image; ● to others, access to a digital item means viewing either all of the content of the item or part of the content of the item; ● and to still others, access is defined by download of either all of the content of the item or part of the content of the item.

The potential to have different definitions of access makes comparisons across institutions difficult. At minimum, libraries who collect counts should document how access was defined, and this should be consistent from year to year. While it is challenging for access metrics in Google Analytics to account for every local definition of use and access, we do recommend that digital libraries track these numbers. Best practices exist for collecting physical collection use statistics in the form of standards such as ANSI/NISO Z39.7,20 which supports the importance of ​ ​ collecting analogous use metrics for digital collections. We hope to see more clearly defined best practices for access counts in web analytics evolve, and this paper is an attempt to begin that process.

20 "ANSI/NISO Z39.7-2013, Information Services and Use: Metrics & Statistics for Libraries and Information Providers - Data Dictionary," accessed August 6, 2015, http://z39-7.niso.org/standard/section7.html.

13

The specific use and access metrics we recommend in this paper are: bounce rate, download count, time, pageviews, and sessions, each of which are described below starting with subsection 3.

2. Site Content Reports Google Analytics provides the four content analysis reports described below.21 There are ​ limitations to the amount of data that can be retrieved using these reports in the Google Analytics user interface (currently, no more than 5,000 rows can be exported); however, there is a Core Reporting API22 that allows for harvesting much larger quantities of data. A tool has been ​ developed to help explore the API queries,23 as well as other helpful resources, such as a ​ ​ python application demonstrating how to use the python client library to access the Core ​ Reporting API v3.24

● All Pages: displays all accessed pages by URL in the order of highest number of ​ accesses to lowest (the report can be resorted). ● Content Drilldown: click through specific directories within a website to see which are ​ being accessed more than others. ● Landing Pages: shows the frequency with which users enter the site on particular pages. ​ ● Exit Pages: displays the frequency with which users exit the site on particular pages. ​

By default, site content reports include pageviews, average time on site, and bounce rate, all of which will be reviewed later in this document. Site content reports can be edited to combine different metrics by adding secondary dimensions25 to the report. ​

Site content reports provide access information by URL. This can make tracking metrics by collection (or group of related items) challenging. To successfully track collection URLs, each related item should include a common identifier or segment within the URL that can be used as a filter or search term. If they do not, it will be extremely difficult to understand pageviews by collection. This is due to the potentially high number of URL accesses to sift out of the report, in order to collect all the access points for items in a given collection. If, however, URLs from the

21 As of September 1, 2015, in Google Analytics, information about pages that have been accessed could be found by clicking on behavior → site content. 22 “Analytics Core Reporting API,” accessed September 21, 2015, https://developers.google.com/analytics/devguides/reporting/core/v3/. 23 “Query Explorer,” accessed September 21, 2015, https://ga-dev-tools.appspot.com/query-explorer/. 24 “core_reporting_v3_reference.py,” accessed September 21, 2015, https://code.google.com/p/google-api-python-client/source/browse/samples/analytics/core_reporting_v3_refe rence.py. 25 "Add a secondary dimension to a report - Analytics Help," accessed August 6, 2015, https://support.google.com/analytics/answer/6175970?hl=en.

14

same collection share a common identifier, use the search box above the main report to identify ​ pageviews by collection.26

As suggested above, there is no direct equivalent of “access counts” in Google Analytics. If the content accessed consists of web pages, Google Analytics can provide a count of accesses to a specified URL from a single IP address over a designated period of time. This is called pageviews27 in Google Analytics. Pageviews from the same user (determined by IP address) within a certain timeframe are referred to as sessions.28 Both are described in sub-sections 6 ​ ​ and 7 respectively below; additionally, both relate to downloads and users, which are described in subsection 4 and under Audience Metrics, respectively.

Session numbers are available elsewhere in Google Analytics, but they are so closely related to pageviews that they have been included here under content use. Similarly, downloads are tracked separately from content use reports but are listed here because of their close relation to access counts and use. More information about both metrics is available below.

There are almost endless ways to parse content use and access count information and combine metrics to draw conclusions about digital library patrons, a few examples include:

● Monitor content use regularly either by revisiting the site content reports or setting up a custom report or dashboard (see section IV) to see how top URLs and subdirectories change over time. Consider the data alongside departmental data about outreach, publicity, and frequency of new content added to the site in order to understand the potential impact of such activities on use. ● Find out what content institutional stakeholders value, and track that content specifically. For example, does the site include content that is labor-intensive to produce? Track that content’s use versus other content’s use to better understand one facet of the return on investment for labor-intensive workflows29 Or perhaps some content was heavily championed by a top administrator; it would be valuable for that administrator to know how that content is faring compared to the rest of the resources. ● Compare entrance and exit pages with download counts to see how many patrons are finding exactly what they need when they enter your page through a browser search.

26 The search box was available in this location as of the writing of this report in September 2015. 27 "The difference between AdWords Clicks, and Sessions, Users, Entrances, Pageviews, and Unique Pageviews in Analytics - Analytics Help," accessed September 21, 2015, https://support.google.com/analytics/answer/1257084#pageviews_vs_unique_views. 28 “How a session is defined in Analytics - Analytics Help,” accessed September 21, 2015, https://support.google.com/analytics/answer/2731565?hl=en. 29 For an example of such an experiment, see Chapman, “Evaluating the effectiveness of manual metadata enhancements,” accessed September 22, 2015, https://staff.lib.ncsu.edu/confluence/display/MNC/Evaluating+the+effectiveness+of+manual+metadata+enha ncements+for+digital+images.

15

Please note that collecting useful access counts is impacted by the inclusion of web crawler accesses in the statistics gathered by Google Analytics. In other words, when small programs or scripts access a digital library for indexing purposes, this pageview is recorded as if it was a human visitor. This greatly inflates the counts and may give the impression of far more user access than is actually the case. These guidelines recommend exclusion of crawler accesses in order to obtain a more realistic count. Ideally, best practices would also specify exclusion of access by IPs belonging to those employees working on or in the system itself. This is difficult to do in Google Analytics. The employees in question must have static IP addresses, and those ​ ​ must be excluded from all measures by use of filters. This is challenging if IP addresses are not ​ all in a specified range, because Google Analytics filters (as of the writing of this report) only allow 255 characters and the regular expression necessary to catch a large number of variations in IP ranges may be too long. Another useful approach would be a configuration that ​ ignores traffic including a certain URL parameter, such as “&analytics=off”.30 ​

3. Bounce Rate Bounce rate31 is the percentage of times a user exits a site on the same page they entered without having initiated interaction with the page. An interaction is any action that is sent as a second request to the server, such as clicking to download a document or navigating within the web page. In Google Analytics, a bounce is recorded every time there is a single-hit session.

Why Bounce Rate is Important Bounce rate can be a useful metric to measure user behavior and engagement on a website, however, when taken without context it is a meaningless number.32 A bounce can be good or ​ ​ bad, depending on the purpose of the page or the site in question. For example, a high bounce rate may not be viewed as negative for a collection of archival finding aids because users frequently only need to view a single finding aid to complete their information-seeking task successfully. On the other hand, if the bounce rate is high in a digital image collection which is intended to encourage sustained browsing, it may imply that the site’s design has not been successful in making browsing intuitive.

Bounce Rate and Google Analytics The working group has no additional recommendations for bounce rate and Google Analytics.

30 "Google Analytics - Excluding Your Own Visits in Development and Production," accessed September 21, 2015, http://tjvantoll.com/2012/08/28/google-analytics-excluding-visits-in-development-and-production/. 31 As of September 1, 2015, general bounce rate for a site is included in Google Analytics on the audience ​ → overview and behavior → overview pages. To obtain bounce rate for a specific page, navigate to behavior → site content → all pages and search for the page in question. To obtain bounce rate for a specific subdirectory of the site, navigate to behavior → site content → content drilldown. For more information about site content reports, please see the content use reports sub-section above. 32 "The Magical, Mysterious Bounce Rate and What It Means," accessed August 6, 2015, http://www.cjadvertising.com/blog/interactive/the-magical-mysterious-bounce-rate-and-what-it-means/.

16

4. Download Counts Download counts track the number of times a set of predefined documents were clicked for download. These guidelines propose that a download count include different types of events that incorporate “units or descriptive records examined” as defined by NISO Z39.733 and “item ​ requests” as defined by COUNTER.34 This is similar to the way "access counts" are defined ​ ​ above. Currently COUNTER offers descriptions of usage reports for journals, databases, platforms, books, multimedia, and titles.

Four sources were consulted as the working group sought to define a download count, and these sources may be useful to institutions in articulating how they think about downloads: (a) NISO Z39.7 data dictionary, (b) COUNTER Code of Practice Release 4, (c) identifying how Google Analytics might be configured to track downloads35 (Google Analytics sees downloads as a specific instance of an event), and (d) ARL Statistics annual survey 2013-2014. The first three sources also convey different levels of permanency, i.e., the NISO definitions are adjusted less frequently, Google Analytics often makes changes to their product in a more dynamic way, and the Counter Code of Practice is updated more frequently than the NISO Z39.7 data dictionary standard. The COUNTER Code of Practice also works in tandem with the NISO ​ SUSHI standard,36 and the ARL Statistics 2013-201437 follow the COUNTER definitions. In the ​ ​ ARL Statistics, for example, there is a data collection item articulated that is labeled as "number of successful full-text article requests," which has been in use for a few years and needs to be revisited, as ebooks and non-journal article requests are not explicitly included in the current definition.

The NISO Z39.7 definition dictates the following:38 ​ ​ ​ ​

● Number of units or descriptive records examined (including downloads): Number of ​ full-content units examined, downloaded, or otherwise supplied to user, to the extent that these are recordable and controlled by the server rather than the browser (ICOLC Guidelines, September 2006). Viewing documents is defined as having the full text of a digital document or electronic resource downloaded, or any catalogue record or database entry fully displayed during a search. (ISO 2789, 3.3.3). Some electronic services (e.g., OPAC, reference database) do not typically require downloading as simply viewing documents (abstracts, titles) is normally sufficient for users' needs.

33 "ANSI/NISO Z39.7-2013, Information Services and Use: Metrics & Statistics for Libraries and Information Providers - Data Dictionary," accessed August 6, 2015, http://z39-7.niso.org/standard.html. 34 "COUNTER | Code of Practice," accessed August 6, 2015, http://www.projectcounter.org/code_practice.html. 35 "How to Track Downloads in Google Analytics," accessed August 6, 2015, http://www.blastam.com/blog/index.php/2011/04/how-to-track-downloads-in-google-analytics. 36 "SUSHI/COUNTER Schemas," accessed August 6, 2015, http://www.niso.org/schemas/sushi/#counter. 37 "ARL Statistics," accessed August 6, 2015, http://www.arlstatistics.org/About/Mailings/stats_2013-14. 38 "ANSI/NISO Z39.7-2013, Information Services and Use: Metrics & Statistics for Libraries and Information Providers - Data Dictionary," accessed August 6, 2015, http://z39-7.niso.org/standard/section7.html#7.7/.

17

Project COUNTER defines items and item requests as follows:

● Item Full text article, TOC, Abstract, Database record: A uniquely identifiable piece of ​ published work that may be: a full-text article (original or a review of other published work); an abstract or digest of a full-text article; a sectional HTML page; supplementary material associated with a full-text article (e.g., a supplementary data set), or non-textual resources, such as an image, a video, or audio). ● Item requests: Number of items requested by users as a result of a user request, action, ​ or search. User requests include viewing, downloading, emailing and printing of items, where this activity can be recorded and controlled by the server rather than the browser. Turnaways, also known as rejected sessions,39 will also be counted. (See 3.1.5.4).

Why Download Counts are Important Downloads are often considered important because their counts reflect a high value indicator. Depending on the format of the resource, a download often indicates that the user has taken an action that would allow her to consult this resource later on her own, if it is an independent object that can be saved locally.

Download Counts and Google Analytics To track access counts defined as downloads in Google Analytics, a supplemental feature called "event tracking" must be set up that records user interactions with website elements.40 ​ ​ Event tracking is set on each link for which data collection is desired and a category, action, and label parameter (for example: collection name, “download,” unique object ID or name) are encoded to organize and classify event data in the reporting interface. Event tracking will need to be added to every download you want to track, so this can be a time consuming solution. It is possible that Google Tag Manager41 may be able to make this process more efficient, however ​ at this time (September 2015), the authors of this document are unfamiliar with the specifics of such an implementation. If there is a preference not to use tags and if there is technical support, there is an alternative solution42 that can be implemented with jQuery JavaScript as well. ​ ​

39 According to COUNTER guidelines, a turnaway is defined as “an unsuccessful log-in to an electronic service due to exceeding the simultaneous user limit allowed by the licence” http://www.projectcounter.org/cop_books_ref.html. 40 "Event Tracking - Web Tracking (analytics.js)," accessed August 7, 2015, https://developers.google.com/analytics/devguides/collection/analyticsjs/events. 41 "Google Tag Manager official website," accessed August 6, 2015, http://www.google.com/tagmanager/. 42 "How to Track Downloads & Outbound Links in Google Analytics," accessed August 6, 2015, http://www.blastam.com/blog/index.php/2013/03/how-to-track-downloads-in-google-analytics-v2.

18

5. Time Time43 is the number of minutes a user spends on a site, which generally indicates interest in content and engagement. Measures are indicated by average session duration and average time on page.

Note that similar to bounce rates, if users are spending a long time on the site, that does not necessarily mean they are "engaged." Likewise, a short time on the site could still mean someone is engaged. For example, an engaged user could be looking at images, see what she wants, and download it for offline, in-depth analysis after only a few seconds on the page. On the other hand, a user may leave a browser open on your site for an extended period of time while engaged in other activities. These examples illustrate how, just like any metric, the amount of time on a site needs to be viewed in context.

Why Time is Important: Time can provide an indication of whether the content on a site is useful and engaging to users. The lack of engagement/time may indicate problems with the usability of a site.

Time and Google Analytics The working group has no additional recommendations for bounce rate and Google Analytics.

6. Pageviews Pageviews44 count the number of times a user directs their browser to a specific URL. ​

Why Pageviews are Important Pageviews demonstrate which assets in a digital library are viewed more or less often. They are therefore a measure of overall popularity. Like so many of the metrics described here, pageview counts should be evaluated in context. It is very possible that the value of having access to an item online is independent from the number of pageviews recorded for that item.

43 As of September 1, 2015, Google Analytics provides time metrics in multiple ways and places: ● Under audience → overview is a measure for average session duration. ● Under behavior → overview is a measure for average time on page (this is the aggregate average across the entire website). ● Average time on page also appears on the reports under the various site content options (under behavior). From the reports, is average time on page for each page listed and not just an average for the whole site. For more information about content use reports see the preface and content use reports section above. 44 As of September 1, 2015, pageviews were available in multiple places within Google Analytics: ● Under audience → overview is a measure for overall pageviews for the website. ● Under behavior → overview is the overall site pageviews. ● Pageviews also appear on the site content reports under behavior → site content. These reports show the number of pageviews per URL. For more information about content use reports see the preface and content use reports section above.

19

Pageviews and Google Analytics A new pageview is recorded every time a user loads a web page. If the user clicks reload, this is counted as an additional pageview. If the user navigates to a different page and then returns to the original page, that return is recorded as another pageview. Unique pageviews aggregate pageviews generated by the same user during the same session, so these are indicative of the number of sessions during which that page was viewed one or more times.

Please note that in systems where multiple URLs define an item (for example, a compound digital object), the pageview approach is problematic for collecting access statistics to the compound object instead of to individual components of an object. There is no clear and simple method for either identifying and defining “access” or of collecting access counts in Google Analytics at the item (or even collection) level. Please refer to section II “Google Analytics Prerequisites” for more help in this area.

7. Sessions Sessions45 are pageviews sorted by user. If the same user accesses multiple pages within a set ​ time from a single IP address, that is recorded as a session.

Why Sessions are Important By the definition stated above, sessions are an indicator of sustained engagement and the browsability of a website.

Sessions and Google Analytics The rules on how Google sets a new user session46 have many variables, particularly with ​ regards to referrals from ad sources, but in general, the following applies:

A session begins when the user lands on a site, no matter the source, and lasts until:

● there are 30 minutes or more of inactivity, or ● the end of day occurs (11:59:59 pm) in the viewer’s time zone.

In other words, each new referral starts a new session. Returning after a 30-minute-or-more break starts a new session. The arrival of midnight starts a new session.

45 As of September 1, 2015, session information can be found on the audience overview page. The average session duration is indicated under sessions. Pageviews and pages per session are also on the same report. Pages per session indicate how many clicks users make during their sessions, therefore higher numbers may indicate greater engagement. 46 "How a session is defined in Analytics - Analytics Help," accessed August 6, 2015, https://support.google.com/analytics/answer/2731565?hl=en.

20

B. Audience Metrics

1. Location Location47 is the geographic region from which a session originates. Google Analytics allows for ​ ​ different levels of granularity of location information, including continent, country, state, and city.

Why Location is Important Reviewing location data is an excellent way to determine if a digital library is reaching its target audience, and to discover unexpected digital collections users. Understanding the geographic makeup of an audience can also help institutions make technology and content decisions. If a digital library is experiencing website traffic from regions known for slow or restricted networks, they may want to adapt their site to run better on handheld devices or choose not to use certain third party hosting services like YouTube (which is restricted in China, for example). A digital library may also want to invest in translating areas of their website if they are receiving traffic from countries where a different language other than that used on their site is spoken. In addition to location, Google Analytics also provides language metrics,48 which are useful in ​ ​ prioritizing translations.

2. Mode of Access Google Analytics provides data on technology (browser, operating system) and devices people use when accessing your site.

Why Mode of Access Information is Important Information on the kind of technology and devices visitors are using is useful regardless of whether the website is optimized for mobile devices.49 Browser and operating system (OS) information can be used as an indicator of the top browsers used to access the collection, and helps to determine testing and web development priorities. If the website has not implemented responsive design, knowing this information could help determine whether allocating resources to make the site responsive is warranted. Additionally, if Google Analytics shows low rates of access from mobile devices, this is often an indicator that the site is unusable or ineffective on mobile devices.

Browser and Operating System and Google Analytics

47 As of September 1, 2015, location information can be found under audience → geo → location. Google Analytics will first break down your traffic origination by country. By clicking on the country names you can drill down by region and city. 48 “Overview of Audience reports - Analytics Help,” accessed September 20, 2015, https://support.google.com/analytics/answer/1012034?hl=en&vid=1-635783729473335734-2997705225#Ge o. 49 Read more about responsive design, an approach to designing websites that are optimized for easy viewing and interaction on a range of devices, at https://en.wikipedia.org/wiki/Responsive_web_design or ​ ​ http://scholarworks.gvsu.edu/cgi/viewcontent.cgi?article=1004&context=library_books. ​

21

Combined with the operating system50 (OS) information, browser data demonstrates the technology ecosystem visitors are using. Investigating each browser’s version helps to see whether users tend to use the latest browser or not. Another way to analyze data from this metric is to look at the trend of the browser usage from the last several months. A decline in number of users of the older version indicates more users are using the newer version. On the other hand, a steady number of users of the old browser version indicates there are regular visitors that consistently use the website with older versions of the browser. Future web design might need to keep this cluster of users in mind.

Increasingly, use of iOS or Android operating systems indicates increasing use of mobile devices. If combined with a high bounce rate, this may be an indication that mobile users are unable to engage with the content. In such situations, libraries may want to consider designing the website to be more mobile friendly.

Mobile devices and Google Analytics In Google Analytics, phones and tablets are considered mobile devices.51 Google Analytics will show information for all three types of devices: desktop, tablet, and phone. This is good information for comparing the percentage of users who visit the website using mobile devices versus desktop computers. In this section of Google Analytics, you can see the various brands and models of the devices used to access your site. Check the screen size to find out which sizes are used most often to access and interact with the website.

3. Network Domain / ISP Name Network domain is the domain name of the Internet Service Provider (ISP) from which users originate, for example “duke.edu” or “rr.com.”52 ISP Name is the human-readable name chosen by a service provider for their domain, for example, “orange county public schools,” or “abu dhabi university.”53

Why Network Domain / ISP Name is Important These two metrics are considered particularly useful for institutions that want to know how much of their traffic originates locally, or for users interested in knowing how much of their traffic comes from university or other educational institutions. While network domain might be viewed as a useful metric to isolate access to a website from scholarly networks, Aery makes an excellent case for why ISP name can arguably provide more accurate data for two reasons. First, .edu domains are only granted to accredited postsecondary institutions in the United ​ ​

50 As of September 1, 2015, both browser and operating system information are available under audience → technology → browser & OS. 51 As of September 1, 2015, information about mobile devices can be found on the audience → mobile → devices page. 52 As of October 2015, this is called “Hostname” in the Google Analytics interface and can be accessed via the Audience à Technology à Network interface. 53 As of October 2015, this is called “Service provider” in the Google Analytics interface and can be accessed via the Audience à Technology à Network interface.

22

States, meaning all traffic from schools or universities outside the United States are discounted by filters or custom reports looking limited to .edu traffic. Second, in Aery’s analysis of analytics ​ ​ at Duke Libraries a much higher percentage of network domains appear to register as “(not set)” than ISP names,54 meaning that ISP name may provide data on more visits than network domain (Aery, 2015).

Network Domain / ISP Name and Google Analytics Google Analytics records users’ network domain and ISP name information, but a large percentage of site traffic may be reported as “(not provided),” “(not set),” or “unknown.unknown.” This is due to the rise of encrypted searching and privacy options. To set up an ISP name filter for education institutions, one might create a regular expression filter including information such as “school*|universit*|college*”.55

4. Users Unless a website requires a login, there is no way to track the number of people who visit it. One can, however, count unique cookie values in the browsers of users. This is how analytics platforms calculate a user metric. A cookie is a small text file that contains an anonymous string ​ ​ of characters created the first time a person visits the site from a new computer or browser, and which persists until it is either deleted or it expires.56 It is likely that user data in most analytics platforms is somewhat inflated because if the same person accesses a site from another device or another web browser within the same device, they will be assigned a new cookie and be counted as a different user.

Why Users are Important The user metric represents the size of the audience that a website is reaching. It can also help gauge whether publicity and marketing efforts are working by showing how many new users the site has over specified time periods. It also allows an institution to track returning users, which can assist in understanding the “stickiness” of a site: are people deciding to return after the first time they visit? If they are, it may imply that the site is useful or compelling.57 If they are not, a ​ ​ digital library may need to dig deeper to figure out why.

Users and Google Analytics

54 In his June 2015 analysis, Aery found that 24% of all visits to the Duke digital collections site had unknown ISP network domains, whereas only 6% has unknown ISP names: https://blogs.library.duke.edu/bitstreams/2015/06/26/the-elastic-ruler-measuring-scholarly-use-of-digital-colle ctions/. 55 The example provided here only filters on English terms related to educational institutions; terms in other languages could also be included, depending on the goals of the analysis. 56 "Hits, Sessions & Users: Understanding Digital Analytics Data," accessed August 6, 2015, http://cutroni.com/blog/2014/02/05/understanding-digital-analytics-data/. 57 "The 6 Most Important Web Metrics to Track for Your Business Website," accessed August 6, 2015, http://articles.bplans.com/the-6-most-important-web-metrics-to-track-for-your-business-website/.

23

Google Analytics uses two different techniques for calculating users, depending on the type of ​ ​ report requested. Because of this, discrepancies may appear in the number of users, between: 1. the Audience Overview, where no segments are applied or in a custom report where only a date dimension is applied; and 2. a custom report with a non-date dimension.58 Google Analytics ​ ​ uses first-party cookies. In the newest version of Google Analytics, “Universal Analytics,”59 the cookie is named _ga and lasts for two years. Previous versions of Google Analytics used a cookie named _utma.

Users is a similar metric to sessions (see section A.7 for more information on sessions). The difference between the two is that the users metric counts unique users once for a specified time period, regardless of the number of sessions they engage in. Sessions, on the other hand, is a total count of all sessions, whether they are from repeat users or new users. The number of sessions will always be greater than the number of users.60 61

C. Navigational Metrics

1. Path Through the Site This section refers to the path users take to travel through the site from one page to another, including where they enter and exit the site. This information is represented by an ordered series of URLs.

Why Path is Important: Understanding the path users take through the site can help uncover what content keeps users engaged, and to understand how users move through content, which helps in the management of interface design.

Paths and Google Analytics: The behavior flow report62 in Google Analytics visualizes the path users take through a site. The ​ behavioral flow report isolates pages or sources and supports analysis of how users behave once they enter the website. Digital libraries can follow the path a successful visit63 takes and try

58 "How the Users metric is calculated - Analytics Help," accessed August 6, 2015, ​ https://support.google.com/analytics/answer/2992042?hl=en. 59 The new version of Google Analytics launched in 2012. In 2015, new accounts may only use Universal Analytics. All Google Analytics users are encouraged to upgrade. The process is simple and free. 60 "The difference between AdWords Clicks, and Sessions, Users, Entrances, Pageviews, and Unique Pageviews in Analytics - Analytics Help," accessed August 6, 2015, https://support.google.com/analytics/answer/1257084?hl=en. 61 “Hits, Sessions & Users: Understanding Digital Analytics Data,” accessed September 21, 2015, http://cutroni.com/blog/2014/02/05/understanding-digital-analytics-data/. 62 As of September 1, 2015, to configure the report, first choose behavior → behavior flow from the sidebar in Google Analytics. 63 “Successful visit” will be defined differently by different institutions, depending on their goals at the time. Examples could include a visit in which a user downloads digitized content, a visit in which a user

24

to isolate why that visit was successful when others were not. More detailed instructions are available from Marketing Land.64 ​ ​

2. Referral Traffic Referral traffic shows how users reach a website. Google Analytics may provide the last URL a user visited before coming to the digital library, a social media site where they found a link to the site, a search term they used in Google that led them to the site, or simply a count of users who input the site’s URL directly (perhaps via a link in email, a bookmark, or typed into the URL bar).

Why Referral Traffic is Important: Understanding referral traffic helps digital libraries understand where and how users are finding their site. This information supports the development of outreach strategies by helping to determine who primary audiences are and how those audiences find the resources in question.

Referral Traffic and Google Analytics: Google Analytics breaks up referral traffic65 into four categories:

● Referral traffic can be analyzed using Google Analytics for users who reached the digital ​ library website by clicking on a link from another site, including social media. Google Analytics can display the referring site alongside other information, such as number of sessions and users. For some referrers, digital libraries can also drill down by clicking on the referral path to see which page specifically drove traffic to your website. ● Organic search lists search terms used to reach the site. Users who have logged into ​ Google, however, have their search terms protected. These currently appear as “(not provided)” in the list. Over the past few years, the number of anonymous searches has increased exponentially as more users are logged into Google accounts while browsing the web. One workaround for accessing search term data is to implement Google ​ ​ Search Console, and either view data within Webmaster Tools or import the data back into the Google Analytics account.66 See a more detailed discussion below in the search terms section, C.3. ● Direct lists users who keyed in the link to the site directly, or used a stored bookmark or ​ ​ other provided link (perhaps from email or a document). Landing pages are listed separately.

successfully navigates from one page to a series of other set pages, a visit in which a user engages in browsing activities after visiting an item specific page, or a visit in which a user signs up for a digital newsletter. 64 "Behavior Flow: Better Insights, Better Marketing," accessed August 6, 2015, http://marketingland.com/behavior-flow-better-insights-better-marketing-66300. 65 As of September 1, 2015, the main area in which referral information is found is under the acquisition menu, as “channels.” 66 “Google Analytics Keyword “Not Provided” Workaround,” accessed August 7, 2015, ​ http://www.business2community.com/seo/google-analytics-keyword-provided-workaround-01193722.

25

● Social separates out users according to the social network from which they were ​ ​ referred, such as LinkedIn or Facebook.

Under acquisition, Google Analytics also provides three other reports:

● Treemaps: these only work if you have the pay service Adwords set up. ​ ● Source/medium: source refers to the domain the user came from and medium is how ​ they got there (whether by entering a search term or following a link, etc.). ● Referrals: this report provides a list of all the domains that include direct links to your ​ site.

In addition to the reports under the acquisition section of Google Analytics, digital libraries can access referral and source information by choosing either option as a secondary dimension in any of the content use reports in Google Analytics. This allows one to combine referral data with other metrics like page, landing page, network domain and more.

3. Search Terms Search terms are the keywords users entered in a site or Google search that led them to a given website. There are two types of search term: in this paper we refer to them as “universal” (those entered by a user in a web browser, producing results that are clicked and lead to your site) and “local” (those entered by a user already on your site into a search box that performs a site search).

Why Search Terms are Important Like referral traffic and path through the site, search terms help digital libraries understand what their users are looking for and how they have made their way to your resources. Understanding search terms that are and are not bringing users to the site will in turn assist digital libraries in managing search engine optimization efforts.67

Search Terms and Google Analytics

a. Universal Google Analytics used to provide search term data from web browsers to website publishers. However, due to privacy concerns, Google stopped providing access to search referral data from anyone logged into a Google account while searching in October 2011. Google did this by encrypting searches and outbound clicks by default with SSL (Secure Sockets Layer). The one exception is Google’s advertisers: if an organization pays to advertise with Google (the

67 There are several free tools that assist in keyword research, one example is Google Trends https://www.google.com/trends/. Exploring keywords that are entered in Google searches over time can provide insight into how often people use different keywords over time, and assist in setting keywords on your website.

26

AdWords program), it can still see all search referral data for its ad campaigns.68 Google now ​ ​ displays the search term “(not provided)” for all referrals coming through SSL search. How this will affect digital library analytics depends on what percentage69 of their organic search referral data is being blocked, which varies widely from site to site, but is often a significant number.

Google Search Console70 (previously Google Webmaster Tools71) is a second set of tools provided freely by Google that offers a basic overview of keywords leading visitors to a site.72 While the tools do not provide a list of every keyword and the number of times it was used, it does provide certain information on the top 2,000 queries that returned to a site within the past 90 days, including the number of impressions, clicks, clickthrough rate (CTR), and average ​ ​ position.73 b. Local Many websites have a site search, or a search option that provides retrieval for content only on the website. Google Custom Search and Google Site Search are by far the most commonly used site searches, as they do not require any technical skills or custom scripting to implement, and have the power of Google’s search engine and algorithms at their disposal. Google Custom ​ Search allows anyone to create a search engine and host it on their site for free, using the ​ Custom Search element. Google Site Search is primarily for businesses, has an annual fee, ​ ​ includes technical support, a tailored look and feel, and more.74

Google Site Search options do provide free access to the search terms used within a local search box on a site.75 This is an excellent option to understand what users are looking for once they are already on a site, particularly if the site in question does not have the option of creating its own custom logging solution.

Alternatively, digital libraries can create their own custom logging solutions if they have the necessary technical expertise. There are numerous ways to do this, and no particular out-of-the-box solution. One example would be to use Nutch76 to crawl a site, Solr to store and ​ index the site data collected, and JavaScript to log submitted search queries to a database or

68 "Dark Google: One Year Since Search Terms Went 'Not Provided'," accessed August 7, 2015, http://marketingland.com/dark-google-search-terms-not-provided-one-year-later-24341. 69 As of September 1, 2015, to determine the percentage, select a time period and go to acquisitions → campaigns → organic keywords. 70 "What is Search Console? - Search Console Help," accessed August 7, 2015, https://support.google.com/webmasters/answer/4559176?hl=en. 71 Google Webmaster Tools became Google Search Console in May of 2015: http://googlewebmastercentral.blogspot.com/2015/05/announcing-google-search-console-new.html. 72 As of September 1, 2015, this could be found via: traffic → search queries. 73 For more information on what these terms mean, see "Search queries - Search Console Help," accessed September 21, 2015, https://support.google.com/webmasters/answer/35252?hl=en. 74 "Google Site Search vs Google Custom Search - Custom Search Help," accessed August 7, 2015, https://support.google.com/customsearch/answer/72326?hl=en. 75 As of September 1, 2015, these could be found via: behavior → site search → search terms. 76 "Apache Nutch," accessed August 7, 2015, http://nutch.apache.org/.

27

log file.77 With a custom solution, one would have to extract and analyze search term data themselves, whereas Google has built-in reporting and analysis features.

Search terms for failed searches are also important, as they can indicate search terms that did not provide the user with any results, or that produced results that the user did not choose. However, this data can be tricky to find. Google Search Console provides one method of considering “failed” search data. Clickthrough rates (CTR) are provided for each of the highest ranking search queries. The inverse of the CTR is the percentage of the time that the search query in question did not result in someone clicking to a site. Keep in mind, however, that failed clickthroughs site do not mean the user did not find a result that met their need, just that they did not access the site in question to do so. Section IV: Additional Metrics and Custom Approaches

A. Dashboards and Custom Reports In addition to site content reports, custom reports and dashboards are an excellent way to monitor content use and access counts. Dashboards allow Google Analytics users to compile different reports within Google Analytics on one page. For example, landing page metrics and geographic information exist in different areas of the Google Analytics application; however, if you create a dashboard for these measures you would see both on the same page. Dashboards are easy to set up78 and do not require technical expertise. ​

Customization features allow digital libraries to build nuanced reports beyond those that Google Analytics produces out of the box. For example, one can create a custom report for all the URLs pertaining to a single collection within a digital library. Custom reports also provide more technically advanced mechanisms to fine tune your report, such as support for regular expressions.

Once created, a custom report can also be added to a dashboard so it can be viewed alongside standard Google Analytics reports. Google Analytics Help provides detailed information about creating and managing custom reports.79 ​

77 This is not a DLF-recommended method, it’s just being used an example. There are literally dozens of ways one might do this. 78 "About Dashboards - Analytics Help," accessed August 7, 2015, https://support.google.com/analytics/answer/1068216?hl=en. 79 "About Custom Reports - Analytics Help," accessed August 7, 2015, https://support.google.com/analytics/answer/1033013?hl=en.

28

B. Event Tracking Event tracking80 is a customized process that allows digital libraries to note when users take specific actions on a website, such as downloading items, using dynamic elements (such as hovering over links to make pop-up menus appear), and monitoring load time.

Implementing event tracking81 requires planning, whether managing Google Analytics through snippets or the tag manager.82 There are three main areas in each event: categories, actions, ​ ​ and labels. A category is a name given to a group of similar events that you want to track together (for example, “videos”). Actions describe the type of interaction (such as “play,” “download,” “hover”). Labels provide a further reporting dimension and are a further way to identify specific events (such as the title of a video that was played).83 There is also a fourth ​ component, value, which is the numerical value of an event being tracked (such as length of a video played, or download time).

Event tracking is easier to understand by looking at examples. A digital library might want to know how often videos are played, how often particular videos are played versus other videos, and the average load time for videos. In this case, they could add the category “videos” to every link to access a video on the site, have an action called “play,” that would trigger when a play button is clicked, provide a unique label for each video (such as the title), and set the value variable to record as the load time. When the resource manager reviews event tracking reports, she will be able to see how many people played videos, how different videos performed in comparison to each other, the total number of times downloads associated with different categories or actions occurred, and the average value (in this case, the average load time). Digital library managers may already have event tracking set up without realizing it; we recommend asking the Google Analytics administrator if and how events are tracked locally. ACRL TechConnect also provides examples in a blog post from 2013.84 ​ ​

Before setting up the first event, Google Analytics recommends thinking through the types of ​ events to be tracked.85 Ideally, categories and actions should be managed consistently and ​ have labels with unique and descriptive names.

80 "Event Tracking - Web Tracking (analytics.js)," accessed August 7, 2015, https://developers.google.com/analytics/devguides/collection/analyticsjs/events. 81 "Event Tracking - Web Tracking (analytics.js)." 82 "Google Tag Manager official website," accessed August 6, 2015, http://www.google.com/tagmanager/. 83 "Event Tracking - Web Tracking (analytics.js)." 84 "Event Tracking with Google Analytics - ACRL TechConnect Blog," accessed August 7, 2015, http://acrl.ala.org/techconnect/?p=2664. 85 "About Events - Analytics Help," accessed August 7, 2015, https://support.google.com/analytics/answer/1033068?hl=en.

29

C. Goals and Conversions A conversion occurs when a goal is successfully completed, or when a sub-goal is reached (called a micro-conversion). Examples of conversions include signing up for an email newsletter or submitting a contact information form. Websites may be structured and Google Analytics configured so that conversions can be tracked.

Although goals and conversions lend themselves well to e-commerce or other commercial sites where completing purchases is important, they are somewhat harder to structure, configure, and track in a higher-education or library environment. As with any data collection and analysis, however, efforts should be driven by the business need of the organization in order to make strategic and informed decisions; data should not be collected just for the sake of collecting data.

Higher-education or library examples of goals and conversions may include: guiding users from a university press website exhibit to the web page where the books may be purchased; encouraging users to view digital assets that are linked from finding aids; signing users up for email delivery of a newsletter; or registering users for events at the library.

Metrics may also be applied to track progress towards goals and conversions. This involves both defining the criteria by which the results can be measured (key performance indicators or KPIs), as well as assigning desired values to the results (such as an increase by a certain percentage over a particular period of time). Even for non-commercial sites this is often done with monetization, assigning monetary values (chits, or tokens) when certain results occur.

30

Section V: Examples of Platform-specific Considerations

Each digital library platform structures its content differently, and this structure has implications for web analytics. This section provides examples of how two digital library platforms structure their content, and suggestions for maximizing the effectiveness of using Google Analytics with these platforms. As discussed in section II of this document, all platforms have their own method of constructing URLs. Thus, to be able to identify collections or items accessed within Google Analytics, it is necessary to study and document the URL patterns supported by the delivery system in question. There may be multiple variations on URLs that access the same item or collection; to capture an accurate count of users and pageviews in such cases, all viable combinations for each collection/item component would need to be combined. This type of analysis and comparison of possible links to various granularities or representations of content provides the information necessary to locate and combine links to specific items and/or collections within the Google Analytics results.

It is important to remember that updates and new versions of the system may alter the links, and hence require re-analysis. This is where documentation is crucial in order to combine counts of accesses to a particular collection prior to and after such changes, as well as to provide business continuity and clarification in statistics gathering.

A. CONTENTdm CONTENTdm is one of the most widely used digital asset management systems among cultural heritage repositories. CONTENTdm collects some web usage data that can be accessed through CONTENTdm Administration, but it is very limited. CONTENTdm also tracks pageviews of compound objects in a different way than Google Analytics; CONTENTdm’s methods have been considered inaccurate by some and report inflated numbers.86 ​ ​

The default URL string for CONTENTdm collections is structured as: cdm[institution’s server number].contentdm.oclc.org.

Many institutions choose a custom URL to use in place of this string as their favored web address for access. Google Analytics will collect web analytics data for CONTENTdm usage whether users access content through a custom URL or the default CONTENTdm path.

Google Analytics event tracking is already enabled for CONTENTdm collections without any additional scripting required. Supported event categories at the time of publication are print, download, search, advanced search, facets, reference url, share, tags, comments, ratings, navigation, compound objects, and page flip. For more information about using Google

86 "Google Analytics & CONTENTdm: Part I," accessed August 7, 2015, https://parkslibrarypreservation.wordpress.com/2014/02/25/google-analytics-contentdm-part-i/.

31

Analytics specifically with CONTENTdm, see the OCLC setup guide and tutorial on working with ​ ​ Google Analytics.87

B. DLXS DLXS88 is no longer supported but is still in use by several institutions. Unfortunately, configuration for Google Analytics event tracking is not simple to implement, and the default URL may vary. This platform is presented here to demonstrate that there may be multiple methods of constructing links that direct the user to the same item. This makes it difficult in Google Analytics to isolate the number of hits or users for a particular set of content. For example, the following five links are equivalent: 1. http://diglib.lib.utk.edu/cgi/t/text/text-idx?c=tdh;cc=tdh;rgn=main;view=toc;idno=pav 2. http://diglib.lib.utk.edu/cgi/t/text/text-idx?c=tdh;cc=tdh;q1=cherry;rgn=main;view=toc;idno =pav 3. http://diglib.lib.utk.edu/cgi/t/text/text-idx?c=tdh;cc=tdh;sid=2acf0eab36829d5a900c97926 6f5611e;q1=cherry;rgn=main;view=toc;idno=pav 4. http://diglib.lib.utk.edu/cgi/t/text/text-idx?c=tdh;;q1=cherry;rgn=div1;view=toc;idno=pav;n ode=pav%3A2 5. http://diglib.lib.utk.edu/cgi/t/text/text-idx?c=tdh;;q1=cherry;rgn=div1;view=toc;idno=pav

The first link is the most straightforward; the second includes the original query (which may apply to multiple items), and the third includes a session id (“sid” value). The “c” value is the collection; the “cc” value is the collection grouping, and is an optional value. The “view” value specifies that this is the table of contents (“toc”), and the “rgn” value may indicate the portion of text searched. In the 4th link, a node value was concatenated by navigating back to the table of contents from a containing page.Yet all these, and more, are links to aspects of the same item, here identified by idno=”pav.” This single attribute in the URL may be the one used to identify ​ ​ hits or accesses to this intellectual item.

If seeking accesses to all items in this collection, all queries containing “c=tdh” would be ​ ​ collected, and then sifted to remove results lists (links including “view=reslist”). Hence, the collection of hits and users for specific items can be very complex, whereas the collection of hits and users for collections is usually somewhat simpler.

In any system, analysis of patterns of URLs used for access to items and collections may be critical for obtaining valid statistics in Google Analytics.

87 "Getting Started with Google Analytics in CONTENTdm," accessed August 7, 2015, http://www.contentdm.org/USC/kbase/tutorials/google-analytics-in-contentdm.pdf. 88 Information in this section is drawn from the experiences of Jody DeRidder in working with DLXS at the University of Tennessee Libraries, from 2002-2008.

32

Section VI: Tips for Google Analytics Account Setup

Google Analytics is configured hierarchically with Accounts, one or more Properties that are associated with each Account, and one or more Views that are associated with each Property. In this section we will look at some general tips for setting up and managing your account. Consult Google Analytics help information89 for the most thorough and up-to-date set-up ​ instructions.

A. Terminology

1. Properties A property is a website, a collection of pages that you want to track as one entity in Google Analytics. This might be a homepage and its related pages, a mobile app, or a separate content management system like Libguides. Each Google Analytics account can have up to 100 properties.

2. Views Google defines a view as “...your access point for reports; a defined view of data from a ​ ​ property.”90 Think of a view as a subset of data that an analytics user is working with at any ​ given time, such as online finding aids for a particular archive. Each view has its own set of reports, isolated from other views. If data isn’t included in your current view, it won’t show up in reports. Note that views used to be called “profiles,” and that terminology may still show up in older documentation.

Google Analytics only acquires data through the views that are set up. Data can only be collected from the time a view is established moving forward; it cannot be acquired retroactively. If a property is configured to exclude data from certain URLs, or if a view is configured to filter out particular data, the historic content that is being filtered cannot be unfiltered if changes are made to the view or property setup. New views also only contain data from the time of their creation onward, even if the view configuration is copied from an existing view for the property. Data cannot be combined within the Google Analytics reporting platform across properties by default. However, one can do this through the API by exporting Google Analytics data and combining separate property data through scripting; see the Google Analytics API ​

89 "Analytics Help," Google, accessed August 6, 2015, https://support.google.com/analytics/?hl=en#topic=3544906. 90 "Hierarchy of accounts, users, properties, and views - Analytics Help," accessed August 7, 2015, https://support.google.com/analytics/answer/1009618?hl=en.

33

documentation91 for more information. Accomplishing this work will likely require the assistance of a developer or local Google Analytics administrator.

When getting starting with Google Analytics, it is strongly recommended to consider the configuration that would be most appropriate for your institution. Without careful planning, one can easily end up with a disorganized collection of accounts, properties and views. Some institutions may have one account for the library (or even one account for the parent institution with one or more properties for the library). Data that needs to be analyzed in aggregate must be contained within the same property. Properties have a default URL for the base location of the product, so content that resides at different locations is best configured with separate properties. In other words, each unique property should have its own unique URL.

The tracking code for web pages is set at the property level. In addition to web pages, Google Analytics can also attach code to JavaScript events, such as playing a video. Managing all the potential code across websites and events can get complicated, and Google provides a product called Google Tag Manager to assist with this. Tag manager can also be used to insert the Google Analytics code itself into a website. Taking the time to learn and use the tag manager early in your analytics process will make managing your Google Analytics setup much more straightforward later. Google has published a guide on this process.92 ​ ​

When creating a property, a default view is set up called “all website data.” This is an unfiltered view of the raw data from the website, and Google recommends93 that this view be kept as-is so ​ that one can always go back to view the data from the site. Google also recommends94 creating ​ at least two additional views: a “master view” for how the data is to be presented to end users, and a “test view” to try out new configurations before adding them to the master view.

Permissions can be set at the account, property, or view level.95 By default, permissions set at the account level carry over to property and view levels. Permission levels per user can include manage users, edit, collaborate, and read & analyze. It is strongly recommended to restrict manage users and edit permissions only to those who have direct responsibility for managing the account, property, or view. Since deleted data, views, properties, and accounts cannot be recovered, great damage can be done if these permissions are not carefully handled.

91 “Google Analytics APIs - Analytics Help,” accessed September 20, 2015, https://support.google.com/analytics/answer/1008004?hl=en. 92 "Solutions Guide for Implementing Google Analytics via Google Tag Manager - Analytics Blog," accessed August 7, 2015, http://analytics.blogspot.com/2015/03/solutions-guide-for-implementing-google.html. 93 "Understanding your account structure - Analytics Help," accessed July 28, 2015, https://support.google.com/analytics/answer/6083325?hl=en. 94 “Understanding your account structure - Analytics Help.” 95 Permissions are called “user management” in the Google Analytics interface as of September 22, 2015.

34

Section VII: Further Resources on Google Analytics

Google ● Analytics Academy: https://analyticsacademy.withgoogle.com/explorer. ​ ​ ○ Digital Analytics Fundamentals: https://analyticsacademy.withgoogle.com/course01. ​ ○ Google Analytics Platform Principles: https://analyticsacademy.withgoogle.com/course02. ​ ● Analytics Blog: http://analytics.blogspot.com/ ​ ● About Google Analytics Individual Qualification (IQ) - Google Partners Help: https://support.google.com/partners/answer/6089738. ​ ● Analytics Help: https://support.google.com/analytics/#topic=3544906.Google Analytics: ​ ​ http://www.google.com/analytics/ ● Google Analytics APIs: https://support.google.com/analytics/answer/1008004?hl=en ​ ● Google Analytics | Google Developers: https://developers.google.com/analytics/. ​ ​ ● Google Analytics - Google Product Forums: https://productforums.google.com/forum/#!forum/analytics. ​ ● Google Analytics Terms of Service: http://www.google.com/analytics/terms/us.html ​ ● Google Tag Manager official website: http://www.google.com/tagmanager/?hl=en_US. ​ ​ ● How to prepare for Google Analytics IQ - Analytics Help: https://support.google.com/analytics/answer/3424288?hl=en. ​ ● Privacy Policy - Privacy & Terms: http://www.google.com/policies/privacy/ ​ ● Training & Certification - Google Analytics: http://www.google.com/analytics/learn/index.html. ​ ● Services and Apps - Google Analytics Partner Services and Technologies: http://www.google.com/analytics/partners/search/all. ​

Implementations ● OCLC, Getting Started with Google Analytics in CONTENTdm 6.4: http://www.accesspadigital.org/staff/tutorials/Version6Tutorials/google-analytics-in-conte ntdm.pdf. ​

Lynda.com ● Google Analytics Essential Training: http://www.lynda.com/Google-Analytics-tutorials/Google-Analytics-Essential-Training/197 523-2.html. ​

YouTube ● ConversionUniversity, Google Analytics IQ Lessons: https://www.youtube.com/playlist?list=PL953EF4F771134336. ​ ○ ConversionUniversity, 1. Introduction to Google Analytics | Google Analytics IQ Lessons: https://youtu.be/H1Opn4DS88k. ​ ​

35

● Google Analytics: https://www.youtube.com/user/googleanalytics. ​ ​

36

Section VIII: Conclusions and Next Steps

The recommendations presented in this paper are an attempt by the DLF AIG analytics working group to bridge the web analytics best practices gap. We realize that this is an evolving field of inquiry and that there are many opportunities to advance these best practices. Future directions could include standardizing methods for sharing metrics across institutions, clear decision-making around allowing or disallowing web crawler traffic from access counts, reaching ​ ​ further consensus on definitions of access and use, and widening the scope beyond Google Analytics to include other recommended tools and methods.

We encourage analytics beginners and experts alike to get involved in the analytics best practices conversation. If you would like to participate, join the DLF AIG Google Group96 and if ​ possible, attend the DLF AIG session at DLF in October 2015. For those who cannot attend the DLF AIG session in person, the session will be freely available in realtime via livestream.97 Please also feel free to send feedback and ideas to the larger Google group. The DLF AIG encourages members to actively share information, post questions, and suggest resources that may be relevant to others. Members of the community who want to take an active role in continuing analytics work should express their interest at the DLF AIG lunch session at DLF in October 2015, post to the DLF AIG Google Group, or contact Molly Bragg ([email protected]) and Joyce Chapman ([email protected]) directly. ​ ​ ​ ​

Only if interested parties take up the challenge to make analytics best practices a community-wide concern will they continue to evolve. One of the goals of this white paper is to demystify the process of gathering relevant analytics data from which resource managers can set priorities and make decisions. The authors hope they have been effective in empowering our community with these guidelines. There is still more work to do! We hope that these guidelines will inspire our colleagues to take up the charge of improving and codifying analytics and other assessment best practices for digital libraries.

96 "Digital Library Assessment - Google Group," accessed August 5, 2015, https://groups.google.com/forum/#!forum/digital-library-assessment. 97 A select number of session at the 2015 DLF forum will be livestreamed by the University of British Columbia Library. See the livestream schedule here http://www.diglib.org/forums/2015forum/livestream-schedule. ​

37

Appendix

Other Methods for Collecting Analytics Data

Google Analytics uses page tagging98 technology to track web page visits and interactions. ​ However, a number of other methods and tools exist for collecting web data, including combining “user panels and browser logging tools to track sample WWW user populations; collecting network traffic data directly from ISP servers; and using site-specific server log parsers or page tagging technologies to measure traffic through a particular site” (Khoo et al. ​ 2008, 375). A Wikipedia entry on web analytics software99 lists seven open source, eight ​ proprietary and 18 hosted or software-as-a-service options as of 16 April 2014. Coverage of these many options is beyond the scope of this document. However, almost all of these use server logs100 to generate their output. This section will discuss server logs (specifically, Apache logs, as they are the most common) as well as page tagging and web beacons, as these are all common methods for collecting web analytics data.

A. Web Server Logs Server logs contain extremely detailed information, as they collect data about every connection made to the server. Each component within a web page (for example, an image or a CSS file) maps to its own individual URL; when you access a single web page you actually load every individual URL on the page and each of those URLs is logged in the server log. Since accessing a single web page can generate multiple HTTP calls (for each CSS file, JavaScript file, icon, image, and more), server logs can be both overwhelming and intriguing. Parsing server logs to extract pertinent information generally requires scripting skills, and such skills may be necessary if the software packages available cannot provide needed information. However, a general understanding of server logs can help decision makers select the most appropriate software for their needs.

A brief and somewhat technical overview of server log parsing follows. If you explore this approach, you will need the assistance of a developer and/or system administrator who has access to the logs and can provide technical support.

1. Analyzing Server Logs Apache log files can be configured by your server administrator101 to output information in a ​ variety of ways; only two common methods will be examined here. Remember to check with the

98 "Web Analytics - Page TaggingGoogle Analytics," Wikipedia, last modified August 7, 2015, https://en.wikipedia.org/wiki/Web_analytics#Page_tagginghttps://en.wikipedia.org/wiki/Google_Analytics. 99 "List of web analytics software," Wikipedia, accessed August 6, 2015, https://en.wikipedia.org/wiki/List_of_web_analytics_software. 100 "Server log," Wikipedia, accessed August 6, 2015, https://en.wikipedia.org/wiki/Server_log. 101 "Log Files - Apache HTTP Server Version 2.4," The Apache Software Foundation, accessed August 5, 2015, http://httpd.apache.org/docs/current/logs.html.

38

server administrator to clarify not only where the logs are located on the server, but also how often and when they are rotated. Rotation prevents the logs from becoming too large, and normally also deletes older ones, which may need to be retained if capturing statistics for a longer period of time.

The first of two examples102 is in the Common Log Format:

127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326

● This example begins with the IP address of the client connecting to the server (this is common). ● The second entry, here represented by a hyphen (which indicates the information is not available) is the RFC 1413 identity103 of the client. ​ ● “frank” here is the userid of the person requesting the document, as determined by HTTP authentication. This value should not be trusted if the status code is 401, as the person has not yet been authenticated. If the document is not password protected, this value will be another hyphen. ● Inside the brackets is the time that the request was received, here formatted as day/month/year:hour:minute:second followed by the offset to Coordinated Universal Time (UTC). ● The section in quotes is the query received. ○ The method is “GET” (as opposed to “POST”, “PUSH”, and “DELETE”). For more information on the methods, see Using HTTP Methods for RESTful Services.104 ​ ​ ○ Next is the content requested, expressed as relative to the URL base (prefix with the base URL to reconstruct the link; for example: http://yourServer.org/ may be your URL base. The URL base is not included in the log, only the portion of the link following it). ○ The protocol here is “HTTP/1.0” (as opposed to FTP, SMTP, TCP/IP, POP or others). ● The “200” indicates that the request resulted in a successful response (all codes beginning in 2), so this code is very important. Codes beginning in 3 indicate a redirection; those beginning in 4 indicate an error caused by the client, and those beginning with 5 indicate an error in the server. All possible codes are spelled out by the Hypertext Transfer Protocol RFC 2616.105 ​ ● The last value in this line (“2326”) indicates the size in bytes of the object returned to the client, not including response headers.

102 Examples are extracted from “Log Files - Apache HTTP Server Version 2.2” by Toby Goodwin, The Apache Software Foundation, accessed August 4, 2015, http://httpd.apache.org/docs/2.2/logs.html. 103 "RFC 1413 - Identification Protocol," accessed August 5, 2015, https://tools.ietf.org/html/rfc1413. 104 "Using HTTP Methods for RESTful Services," accessed August 5, 2015, http://www.restapitutorial.com/lessons/httpmethods.html. 105 "Hypertext Transfer Protocol -- HTTP/1.1," accessed August 5, 2015, http://www.w3.org/Protocols/rfc2616/rfc2616.txt.

39

The second example is in Combined Log Format, which is the same as the Common Log format, with the addition of two more fields:

127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)"

The “http://www.example.com/start.html” value indicates the site from which the client was referred -- usually the site that linked to the content requested. Following that is identifying information from the client’s browser, which enables collection of data about the types of browser and operating system used (in this example text, Windows 98, and the Mozilla 4.08 browser).

2. Best Practices for Working with Server Logs When parsing logs, it’s important to separate out the supporting files that are requested (such as .js, .icon, .gif, .css) to avoid inflating access counts. The analysis performed to identify the types of links to count as item hits or collection accesses will inform which URL requests to count. Large files will transfer sequentially in chunks (“packets”), and these entries should be combined into a single access count. Errors are also important to review for entries that identify problems, such as “file does not exist.”

If the delivery system does not indicate session identifiers, sessions can be generated by noting any periods longer than 30 minutes between accesses by a single IP address, or the arrival of midnight. This would assume that if the client is using a shared computer, at least a half hour would pass before the next user would try to access the site in question.

In conclusion, if the software available does not provide the granularity of information needed, or is not transparent about the methods used to collect counts of information, analysis of server logs can be a viable, and often the most reliable, solution. However, successful parsing of these logs requires analysis of potential variations on item, collection, and content links, and sifting out unnecessary information. Additionally, the server logs can be invaluable for identifying problems that may interfere with user access to the system, with detailed information that is rarely provided by any software package.

B. Page Tagging Page tagging is the method of inserting tags into web pages to be read by an analytics program. For example, Google Analytics collects data through page tagging106 technology. Other ​ page-tagging analytics software (such as CrawlTrack,107 Piwik,108 or Open Web Analytics109) is ​ ​ ​ ​ ​

106 "Web Analytics - Page Tagging," Wikipedia, last modified August 4, 2015, https://en.wikipedia.org/wiki/Web_analytics#Page_tagging. 107 "CrawlTrack, web analytics," accessed August 5, 2015, http://www.crawltrack.net/. 108 "Piwik,” accessed September 16, 2015, http://piwik.org/.

40

also available. The programs mentioned are all open source (unlike Google Analytics), are downloaded and installed on a local server, and tend to appeal to users who want to maintain control over their data and not share it with Google.110 Google Analytics, however, is the most widely used page-tagging analytics tool.

C. Web Beacon Web beacon111 or web bug is a method of web metrics collection that utilizes an embedded object on each web page tracked by the analytic tool. Typically this is done by embedding a very small image on each page. Every time the page is loaded, the image is also downloaded and thus allows the tool to track page use. This method is typically useful for tracking mobile-device users, because the small size of the image does not hinder the load speed of the page.

109 "Open Web Analytics," accessed August 5, 2015, http://www.openwebanalytics.com/. 110 “Piwik, privacy,” accessed September 16, 2015, http://piwik.org/privacy/. 111 "Web beacon," Wikipedia, last modified August 3, 2015, https://en.wikipedia.org/wiki/Web_beacon.

41

Bibliography

Aery, Sean. The Elastic Ruler: Measuring Scholarly Use of Digital Collections (blog). Bitstreams: ​ ​ ​ Notes from the digital projects team. https://blogs.library.duke.edu/bitstreams/2015/06/26/the-elastic-ruler-measuring-scholarl y-use-of-digital-collections/. ​

Chapman, Joyce and Elizabeth Yakel. “Data-Driven Management and Interoperable Metrics for Special Collections and Archives User Services,” RBM: A Journal of Rare Books, ​ Manuscripts, and Cultural Heritage 13 (2012): 129-51. ​

Chapman, Joyce. “Evaluating the Effectiveness of Manual Metadata Enhancements for digital images,” NCSU Libraries public intranet, 2011, ​ ​ https://staff.lib.ncsu.edu/confluence/display/MNC/Evaluating+the+effectiveness+of+man ual+metadata+enhancements+for+digital+images (22 September 2015). ​

Custer, Mark. “Mass Representation Defined: A Study of Unique Page Views at East Carolina University,” The American Archivist 76 (2013): 481-501. ​ ​

Jansen, Bernard. “Search log analysis: What it is, what’s been done, how to do it,” Library & ​ Information Science Research 28 (2006): 407-432. ​

Jansen, Bernard, Spink, Amanda, and Tefko Saracevic. “Real life, real users, and real needs: a study and analysis of user queries on the web,” Information Processing & Management ​ 36 (2000): 207-227.

Jones, Casey, Sarah Giersch, Tamara Sumner, Michael Wright, Anita Coleman and Laura Bartolo. “Developing a Web Analytics Strategy for National Science Digital Library,” D-Lib Magazine 11 (2004). Accessed August 5, 2015. ​ http://www.dlib.org/dlib/october04/coleman/10coleman.html. ​

Jones, Steve, Cunningham, Sally Jo, McNab, Rodger, and Stefan Boddie. “A transaction log analysis of a digital library,” International Journal on Digital Libraries 3 (2000): 152-169. ​ ​

Kelly, Elizabeth Joan. “Assessment of Digitized Library and Archives Materials: A Literature Review,” Journal of Web Librarianship 8 (2014): 384-403. Accessed May 5, 2015. doi: ​ ​ 10.1080/19322909.2014.954740

Khoo, Michael, Joe Pagano, Anne L. Washington, Mimi Recker, Bart Palmer and Robert A. Donahue. “Using Web Metrics to Analyze Digital Libraries,” Proceedings of the 8th ​ ACM/IEEE-CS Joint Conference on Digital LIbraries (2008): 375-84. ​

42

Peters, Thomas. “The history and development of transaction log analysis,” Library Hi Tech 11 ​ ​ (1993): 41-66.

Prom, Christopher J. “Using Web Analytics to Improve Online Access to Archival Resources,” The American Archivist 74 (2011): 158-184. ​

Szajewski, Michael. “Using Google Analytics Data to Expand Discovery and Use of Digital Archival Content,” Practical Technology for Archives 1 (2013). Accessed August 5, 2015. ​ ​ http://practicaltechnologyforarchives.org/issue1_szajewski/. ​

Voorbij, Henk. 2010. “The Use of Web Statistics in Cultural Heritage Institutions,” Performance ​ Measurement & Metrics 11 (2010): 266–79. ​

Yang, Le and Joy M. Perrin. “Tutorials on Google Analytics: How to Craft a Web Analytics Report for a Library Web Site,” Journal Of Web Librarianship, 8 (2014): 404-17. ​ ​ doi:10.1080/19322909.2014.944296