Data Analytics: the Future of Legal JEREMIAH CHAN1

International In-house Counsel Journal Vol. 9, No. 34, Winter 2016, 1 Data Analytics: The Future of Legal JEREMIAH CHAN1 Legal Director, Google Inc., USA & JAY YONAMINE Senior Data Scientist, Google Inc., USA & NIGEL HSU Head of Patent Operations, Verily Life Sciences LLC, USA INTRODUCTION During the 2002 season of Major League Baseball (MLB), the Oakland Athletics and New York Yankees both won an astounding 103 games, a feat only topped three times by any team in the last 24 years. Despite identical records, the New York Yankees payroll on opening day was $125,928,583 compared to only $39,679,746 for the Oakland Athletics. On a per win basis, the Oakland Athletics paid $385,240 per win compared to the New York Yankee’s $1,222,608. The New York Yankees were not an outlier. All other MLB teams (excluding Oakland) paid on average $853,252 per win in 2002. So how were the Oakland Athletics able to generate wins with over two to three times the cost efficiency of other teams in the league? By now, the answer has been well documented by Michael Lewis in the book and subsequent movie Moneyball, in which Lewis recounts how Oakland Athletics’ General Manager Billy Beane replaced traditional approaches to scouting and roster management with a rigorous use of data and statistical models. This is often referred to as “data analytics.” The quest for increased efficiency is certainly not unique to MLB. Be it a widget factory, technology company, or legal department, the desire to maximize the quality of outcomes while maintaining or reducing costs is ubiquitous. As the “Big Data revolution” continues to grow, organizations increasingly look to data analytics as the core drivers of increased efficiency. However, this is more difficult for some organizations than others, as it requires not only building or buying sophisticated technical infrastructure, but also generating buy-in from key stakeholders who might not believe in the power of data analytics. For many in-house legal departments, debates over whether or not to move towards data-driven efficiency are over. In-house counsel can no longer afford to disregard relevant data in making decisions. The question facing legal department leadership is not whether to incorporate data analytics, but how to most effectively do so. Regarding how, there is good news and bad news for legal departments. The bad news is that legal as an industry is late to the game. The good news is that many other industries have already paved the way, having spent decades resolving a host of technical challenges, and their efforts have resulted in clear best practices and massive increases in 1 The views expressed herein are those of the authors, alone. International In-house Counsel Journal ISSN 1754-0607 print/ISSN 1754-0607 online 2 Jeremiah Chan, Jay Yonamine & Nigel Hsu efficiency. The widespread adoption of data analytics across other industries suggests that legal departments may be forced to adapt faster than they would expect or prefer. This article describes the evolution of the data landscape in the legal industry and highlights several applications for data analytics in legal. It also prognosticates the continued growth of data analytics in legal practice and the potential repercussions based on observations from other industries. SECTION 1: THE EVOLUTION OF THE DATA LANDSCAPE IN LEGAL In 1965, more than one hundred thousand patent applications were filed in the United States. At that time, the U.S. Patent Office was managing thousands of physical documents, including patents, applications, and the voluminous correspondence between an applicant and an examiner in the course of procuring a patent (called “prosecution history”). IP law firms were also handling lots of paper in the course of patent law practice. Patent attorneys needed to search and copy documents from prior art archives in order to understand the novelty of inventions for the purpose of filing new patent applications. They also had to order hard copies of prosecution histories from the Patent Office in order to evaluate the scope and validity of patents asserted in lawsuits. Fast forward to today. Patents from all over the world have been digitized, OCRed, and persisted in a distributed and indexed database. Search functionality has gone from complex boolean strings to powerful combinations of natural language and semantic search, class codes, citation networks, and a host of other signals. Most importantly, everyone has free access to this search functionality through the Patent Office’s searchable database and tools like Google Patents (patents.google.com).2 These tools have not only raised the quality of the patent attorney’s work product, they have also minimized the amount of attorney time spent on tasks that historically required hours of work and weeks of waiting for hard copies. Figure 1 Figure 1 describes the stages of development in the example above: (A) collecting all of the relevant physical documents; (B) scanning and digitizing the documents; (C) making the data searchable with tools like Google Patents; (D) visualizing the data with dynamic dashboards like Innography’s PatentIQ; and (E) applying advanced analytics to fully automate decisions. Every other industry has experienced the same sequence of events in 2 Google Patents makes it easy for anyone to search for patents and prior art from many sources, including the machine-translated full text of patents and applications from many patent offices and results from Google Scholar. It scales from simple searches performed by an individual inventor to extensive prior art and invalidation searches performed by patent examiners, agents, and attorneys. To efficiently search the growing amount of prior art in less time, it focuses on assisting the user in constructing their search query and surfacing the most relevant results. Full support for the Cooperative Patent Classification (CPC) scheme has been integrated, with results from Google Scholar machine-classified by CPCs to quickly narrow down non-patent prior art, and CPC autocomplete and result clustering suggestions shown to refine a query. Data Analytics 3 achieving data efficiency, and the legal industry is no different. At a higher level, the 5 stages can be grouped into 3 principal phases of data evolution: (1) availability, (2) visualization, and (3) automation. This section provides a brief overview of each phase. Availability As in all other industries, the first step in achieving data-driven efficiency for legal departments is data availability, which entails digitizing and persisting the data of interest in a way that enables the end user to search the data. In some domains, users are so accustomed to having data available that it is easy to take this functionality for granted. For example, consider search functionality within an e-mail account – a capability that most legal professionals use every day. Underlying this feature is a highly complex, distributed database architecture that powers near instant keyword and boolean searches across potentially terabytes of text, saving countless hours. For many legal tasks, data availability has yet to be achieved. For example, it is common for legal contracts to exist solely in .pdf format, even in large in-house departments. This means that in-house counsel are unable to perform searches on the text in the .pdf files or utilize contract metadata that could be valuable for other analyses. Fortunately, many industries (including legal) have spent years establishing best practices to achieve data availability on currently “unavailable” data. At a high level, the best practices include the following three steps: 1) Identify: Determine the specific data of interest and implement a process for obtaining it. For example, if the data of interest is legal contracts: a) Identify where the legal contracts are located (on a hard drive, on a web server, etc.) b) Determine the current file format (.pdf, .doc, .odt, .wpd, or .rtf) c) Implement a process for ingesting the documents on a regular schedule (an “extract” process, web scraper, etc.) 2) Prepare: Ensure that the data is in a format conducive for persistence in a database. If the data in question is document-based, the “prepare” step might include running optical character recognition (OCR) software to convert text in a .pdf to a machine-readable text format. 3) Persist: Store the cleaned, machine-readable data in a database. A schema is required to identify the specific fields of information to be stored as well as relationships between fields and appropriate indexing to enable fast queries.3 There are dozens of technical choices to be made, including on-premise vs. cloud-based hardware, open-source vs. Commercial software, and SQL vs. NoSQL database architectures. For proprietary data, an organization may need to follow these steps fully within their hardware firewall. However, for many use cases, external companies provide data availability in the cloud and allow legal departments to access the data through application programming interfaces (APIs) or manually through software as a service (SaaS) applications. For some legal use cases, simply achieving data availability is sufficient, especially when a user has the ability to combine diverse information from multiple database locations. For example, consider a scenario in which your company is considering an acquisition of three companies and is interested in determining the legal risk that each company faces 3 A database “field” is identical to a column in a spreadsheet. 4 Jeremiah Chan, Jay Yonamine & Nigel Hsu from patent lawsuits. To perform a quick evaluation of patent risk, an analyst might first want to determine the number of past litigations that each of the three companies has faced with revenue information about the plaintiffs in each case. Imagine that the database contains a table with corporate revenue information called financial_data, and a table contains comprehensive information about patent litigations called patent_litigation_data.

Data Analytics: the Future of Legal JEREMIAH CHAN1

Arguments for USPTO Error in Granting U.S. Patent 9,904,924 to Square, Inc

Application Number

Artificial Intelligence, Economics, and Industrial Organization Hal Varian1

Improving Google Patents with European Patent Office Patents And

Google Patents

Machine Learning for Marketers

Implications of the Google's US 8,996,429 B1 Patent in Cloud

Do Patents Disclose Useful Information?

The Future According to Google

Overview of the Patent System and Procedure in Sri Lanka

Google Patents

Cambridgeip Climate Change Innovation & Partnership Models