<<

APPLIED NATURAL LANGUAGE PROCESSING FOR LAW PRACTICE

BRIAN S. HANEY

INTRODUCTION ...... 1 I. NATURAL LANGUAGE PROCESSING ...... 4 II. PREPROCESSING ...... 7 A. Text Corpora ...... 8 B. Vector Space ...... 10 III. MODELS ...... 12 A. Artificial Neural Networks ...... 12 B. Reinforcement Learning ...... 16 C. Transformer ...... 20 IV. APPLICATIONS IN LAW ...... 22 A. Question Answering ...... 23 B. Document Review ...... 25 C. Legal Writing ...... 28 V. ETHICS ...... 32 A. Professional Responsibility ...... 32 B. Access to Justice ...... 35 C. Automated Labor ...... 38 CONCLUSION ...... 42 APPENDIX A. SUMMARY OF NOTATION ...... 43

IPTF] Applied Natural Language Processing for Law Practice 1 APPLIED NATURAL LANGUAGE PROCESSING FOR LAW PRACTICE

BRIAN S. HANEY*

Abstract: Scholars, lawyers, and commentators are predicting the end of the legal profession, citing specific examples of (AI) systems out-performing lawyers in certain legal tasks. Yet, technology’s role in the practice of law is nothing new. The Internet, email, and databases like Westlaw and Lexis have been altering legal practice for decades. Despite technology’s evolution across other industries, in many ways the practice of law remains static in its essential functions. The dynamics of legal technology are defined by the organization and quality of data, rather than innovation. This Article explores the state of the art in AI applications in law practice, offering three main contributions to legal scholarship. First, this Article explores various methods of natural language database generation and normalization. Second, this Article provides the first analysis of two types of models in law practice, deep reinforcement learning and the Transformer. Third, this Article introduces a novel natural language processing algorithm for legal writing.

INTRODUCTION

Since its inception at the Big Bang, the has been expanding.1 Similarly, since the dawn of communication, so too has the universe of language been expanding.2 Indeed, similar to the way in which entropy guides the Universe from order to disorder,3 time drives the expansion of language.4 As the Bible tells the story, there was once a time at which the whole world had a common language.5 As a result, humans were incredibly powerful, deciding to build a bridge to the heavens called the Tower of

© 2020, Brian S. Haney. All rights reserved. * B.A. Washington & Jefferson College. J.D. Notre Dame Law School. Thanks to Angela Elias, Broderick Haney, Brad Haney, Leslie Kaelbling, and Branden Keck for the helpful comments, suggestions and feedback. 1 , A BRIEF HISTORY OF TIME 151 (1996). 2 ZOLTAN TOREY, THE CONSCIOUS MIND 51 (2014). 3 BRIAN GREENE, FABRIC OF THE COSMOS 151 (2005); see also Frederic H. Behr, Jr. et al., Estimating and Comparing Entropy Across Written Natural Languages Using PPM Compression, INSTITUTE DE RECHERCHE EN INFORMATIQUE FONDAMENTALE 1, https://www.irif.fr/~dxiao/docs/entropy.pdf (last visited May 8, 2020). 4 See Daniel Martin Katz et al., Legal N-Grams? A Simple Approach to Track the ‘Evolution’ of Legal Language (Dec. 16, 2011), https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1971953. 5 Genesis 11:1. 2 Intellectual Property & Technology Forum at Boston College Law School [2020

Babel.6 But then God said, “[i]f as one people speaking the same language, they have begun to do this, then nothing they plan to do will be impossible for them. Come, let us go down and confuse their language so they will not understand each other.”7 So, God scattered the people’s language across the Earth, stopping the Tower of Babel’s construction.8 As the late philosopher, Zoltan Torey described, “[t]he epigenesis of our highly articulated human language is a fascinating story.”9 But, despite its entropic diffusion across time, Torey argues, “[l]anguage need not be a cognitive trap but can become a liberating passport to ever-deepening insights into the world and conscious mind itself.”10 Interestingly, the intersection of the conscious mind and language is the heart of natural language processing (NLP) studies, a sub-field of artificial intelligence.11 In fact, mastery of language is thought by many to be one of the most difficult tasks for computers to conquer.12 For example, machine learning scholar Ethem Alpaydin argues the driving force of computing technology is the realization that every piece of information can be represented as numbers.13 It follows logically that all information can be processed with computers. The divides between syntax and semantics, however, —manifested in a computer’s inability to understand human language—remains one of the most challenging problems in artificial intelligence. Generally, Artificial Intelligence (AI) is any system replicating the thoughtful processes associated with the human mind.14 In fact, machine learning pioneer Paul John Werbos argued from an engineering point of view, the human brain itself is simply a computer—an information processing system.15 Werbos further argued, the function of any computer as a whole system is to compute its outputs.16 And, many thinkers throughout history have argued the human mind is a machine learning system.17 Indeed, AI scholar Murray Shanahan explains, “[a] person’s browser history and buying

6 See id. 11:4–5. 7 Id. 11:6–7. 8 Id. 11:9. 9 TOREY, supra note 2, at 47 (epigenesis describes a theory of development through gradual differentiation). 10 Id. at 105–106. 11 See Noam Chomsky, Language and Nature, 104 MIND 1, 2 (1995). 12 MAX TEGMARK, LIFE 3.0: BEING HUMAN IN THE AGE OF ARTIFICIAL INTELLIGENCE 90–91 (2017). 13 ETHEM ALPAYDIN, MACHINE LEARNING 2 (2016). 14 Brian S. Haney, The Perils and Promises of Artificial General Intelligence, 45 J. LEGIS. 151, 152 (2018). 15 PAUL JOHN WERBOS, THE ROOTS OF BACKPROPAGATION FROM ORDERED DERIVATIVES TO NEURAL NETWORKS AND POLITICAL FORECASTING 305 (1994). 16 Id. 17 Id. at 307. IPTF] Applied Natural Language Processing for Law Practice 3 habits, together with their personal information, are enough for machine learning algorithms to predict what they’ll buy and how much they’ll pay for it.”18 As a result of increasing advancements in AI and NLP technologies, University of Pittsburgh Professor of Law, Kevin Ashley argues, “[a]rtificial Intelligence & Law is a research field that is about to experience a revolution.”19 Ashley is not alone. Scholars, lawyers, and commentators alike are now predicting the end of the legal profession, citing specific examples of computers successfully performing lawyers’ jobs and solving the age-old problems associated with access to justice.20 The impact of technology on legal practice, however, is nothing new.21 The Internet, email, and legal research databases like Westlaw have been impacting legal practice for decades.22 Yet, technology continues to fail in solving problems relating to access to justice.23 Indeed, the impacts technology will have on law practice and problems like access to justice, depend on a variety of factors including, institutional barriers and technological capabilities. Today, NLP is the most commonly used method of AI in the practice of law. This Article proceeds in five parts. Part I explains natural language processing and its evolution as a field of study from its inception to current state.24 Part II explains the preprocessing phase of NLP tasks, which generally includes methods of data gathering, organization, and modeling.25 Part III explores various machine learning models at the heart of contemporary machine learning research.26 Part IV discusses three applications of the models described in Part III to law practice, including a novel algorithm for legal writing.27 Part V explores three ethical considerations regarding the relationship between AI and law practice.28

18 MURRAY SHANAHAN, THE TECHNOLOGICAL SINGULARITY 170 (2015). 19 KEVIN D. ASHLEY, ARTIFICIAL INTELLIGENCE AND LEGAL ANALYTICS 3 (2017). 20 Nicholas Barry, Man Versus Machine Review: The Showdown Between Hordes of Discovery Lawyers and a Computer-Utilizing Predictive-Coding Technology, 15 VAND. J. ENT. & TECH. L. 343, 344 (2013); see Drew Simshaw, Ethical Issues in Robo-Lawyering: The Need for Guidance on Developing and Using Artificial Intelligence in the Practice of Law, 70 HAST. L.J. 173, 179 (2018); see also David Colarusso, How an Online Game Can Help AI Address Access to Justice, LAWYERIST (Feb. 17, 2020), https://lawyerist.com/learned-hands-launch/. 21 Dana Remus & Frank Levy, Can Robots Be Lawyers?, 30 GEO. J. LEGAL ETHICS 501, 503 (2017). 22 Id. 23 See Ian Weinstein, Coordinating Access to Justice for Low and Moderate Income People, 20 N.Y.U. J. LEGIS. & PUB. POL’Y 501, 501 (2017) (discussing problems relating to poverty and access to justice). 24 See discussion infra Part I. 25 See discussion infra Part II. 26 See discussion infra Part III. 27 See discussion infra Part IV. 28 See discussion infra Part V. 4 Intellectual Property & Technology Forum at Boston College Law School [2020

I. NATURAL LANGUAGE PROCESSING Natural language processing (NLP) is an interdisciplinary field of study with influences from computer science, artificial intelligence, and computational linguistics.29 Defined, NLP is the study of computational linguistics, which includes natural language understanding (NLU) and natural language generation (NLG).30 In other words, NLP uses formal logic to analyze the informal structures of human language.31 Pattern recognition is fundamental to this practice.32 NLP systems learn patterns from a text corpus, which is a body of natural language.33 The ultimate goal is to develop machines which process, understand, and generate language representations as well as humans.34 This is a difficult task, however, because interpreting human language depends on abstract concepts like common sense and real world context to account for language instances like sarcasm and visual cues.35 Thus, NLP endeavors to bridge the divide, enabling computers to analyze syntax and process semantics.36 Modern theories of NLP developed in the 1950s with the seminal work of Noam Chomsky.37 Chomsky’s key insight in Syntactic Structures was the independence of grammar and semantics.38 According to Chomsky, a grammar is a device generating all of the grammatical sequences of the language and none of the ungrammatical devices.39 In other words, grammar should be set up to include clear sentences and clear non-sentences.40 Chomsky presents an example of a sentence, which is grammatically correct, but lacks any meaning: “[c]olorless green ideas sleep furiously.”41 Thus, Chomsky concluded grammar is independent of meaning.42

29 Peng Lai Li, Natural Language Processing, 1 GEO. L. TECH. REV. 98, 98 (2016). 30 See id. (“Natural Language Processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics.”). 31 STEVEN BIRD ET AL., NATURAL LANGUAGE PROCESSING WITH PYTHON 39 (2009). 32 Id. at 221. 33 ASHLEY, supra note 19, at 234. 34 See Miles Brundage et al., The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation, ARXIV, Feb. 2018, at 12, https://arxiv.org/pdf/1802.07228.pdf (discussing AI with superhuman level performance). 35 See BIRD ET AL., supra note 31, at 32–33. 36 Lai Li, supra note 29. 37 See NOAM CHOMSKY, SYNTACTIC STRUCTURES 34 (1957); see also Claude E. Shannon, A Mathematical Theory of Communication, 27 Bell Sys. Technical J. 379, 379–423 (1948). 38 See CHOMSKY, supra note 37, at 17. 39 Id. at 13. 40 Id. at 14. 41 Id. at 15. 42 Id. at 15. IPTF] Applied Natural Language Processing for Law Practice 5

Chomsky opposes probabilistic-based models of language.43 Instead, he analyzes linguistic description in terms of a system with levels of representations.44 In large part, Chomsky’s preferences for rule-based systems of language may have been due to the lack of data and computing resources available in the 1950s and 60s.45 Beginning in the 1980s, NLP research and development began to focus on statistics and probability models.46 Probabilistic language models define a probability distribution over an output space, using adjustable parameters to determine the distribution.47 These strategies developed into the early machine learning techniques deployed in the 1990s.48 Machine learning describes a process by which algorithms improve through experience.49 The central architecture of machine learning is the neural network.50 A neural network is a group of neurons influencing each other’s behavior.51 Neural networks draw inspiration from the biological neo- cortex.52 A biological neuron consists of dendrites—receivers of various electrical impulses from other neurons—that are gathered in the cell body of a neuron.53 Once the neuron’s cell body has collected enough electrical energy to exceed a threshold amount, the neuron transmits an electrical charge to other neurons in the brain through synapses, structures connecting

43 Id. at 17. 44 Id. at 18. 45 See Lai Li, supra note 29, at 99. 46 Id. See also Fang Liu, Assessment of Bayesian Expected Power via Bayesian Bootstrap, ARXIV, May 11, 2017, at 14, https://arxiv.org/abs/1705.04366 (providing an illustration reflecting the state- of-the-art in statistical modeling). Specifically, “the bootstrap-based procedures will appeal to non- Bayesian practitioners given their analytical and computational simplicity and easiness in implementation.” Id. 47 Narges Sharif-Razavian & Andreas Zollmann, An Overview of Nonparametric Bayesian Models and Applications to Natural Language Processing, CARNEGIE MELLON UNIV. (2008), http://www.cs.cmu.edu/~zollmann/publications/nonparametric.pdf; see also Lise Getoor et al., Selectivity Estimation Using Probabilistic Models, 461, 462 (2001), https://dl.acm.org/doi/pdf/10.1145/375663.375727 (discussing probabilistic graphical models). 48 See generally WERBOS, supra note 15, at 275 (discussing the theoretical background for derivative calculations as a method for backpropagation). 49 TEGMARK, supra note 12, at 72; see also Emily Berman, A Government of Laws and Not of Machines, 98 B.U. L. REV. 1277, 1278 (2018), https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3098995 (machine learning is a strand of artificial intelligence that sits at the intersection of computer science, statistics, and mathematics, and it is changing the world). 50 JOHN D. KELLEHER & BRENDEN TIERNEY, DATA SCIENCE 121 (2018). 51 TEGMARK, supra note 12, at 72. 52 Michael Simon et al., Lola v. Skadden and the Automation of the Legal Profession, 20 YALE J.L. & TECH 234, 254 (2018). 53 See MOHEB COSTANDI, NEUROPLASTICITY 7 (2016) (diagraming nerve cells). 6 Intellectual Property & Technology Forum at Boston College Law School [2020 neurons.54 This transfer of information in the biological brain provides the foundation for artificial neural networks (ANNs) to operate.55 Every ANN has an input layer and an output layer.56 Between the input and output layer, ANNs contain multiple hidden layers of connected neurons.57 The neurons are connected by weight coefficients modeling the strength of synapses in the biological brain.58 Typically, information flows through an ANN’s layers, modeled by matrix calculus to a final output.59 The output’s accuracy according to pre-determined labels regulates whether the weight coefficients need to be updated or learned, to make more accurate predictions.60 ANNs learn through a process called backpropagation.61 Backpropagation describes the way neural networks are trained to derive meaning from data.62 The backpropagation algorithm’s essential mathematical components include partial derivative calculations and a loss function to be minimized.63 In functional terms, the algorithm adjusts the ANN’s weights to reduce output error.64 The algorithm’s ultimate goal is convergence to an optimal network.65 At the turn of the century the digital revolution was in full swing, bringing increased computation power, more data, and deeper ANNs.66 Deep learning is a process by which neural networks learn from large amounts of data.67 An important notion in deep learning is that the data, not the programmers, drive the operation.68 Defined, data is any recorded information about the world.69 In fact, every two days humans create more data than the total amount of data created from the dawn of humanity until

54 Id. at 9. 55 SEBASTIAN RASCHKA & VAHID MIRJALILI, PYTHON MACHINE LEARNING 18 (2017). 56 KELLEHER & TIERNEY, supra note 50, at 124. 57 ALPAYDIN, supra note 13, at 100. 58 Id. at 88. 59 EUGENE CHARNIAK, INTRODUCTION TO DEEP LEARNING 21 (2018). 60 RASCHKA & MIRJALILI, supra note 55, at 21–22. 61 See Steven M. Bellovin et al., Privacy and Synthetic Datasets, 22 STAN. TECH. L. REV. 1, 18 (2019) (discussing neural networks); see also Katerina Fragkiadaki et al., Figure-Ground Image Segmentation Helps Weakly-Supervised Learning of Objects, in LECTURE NOTES IN COMPUTER SCIENCE, VOL. 6316 (Daniilidis ed., 2010), https://link.springer.com/chapter/10.1007/978-3-642- 15567-3_41 (optimizing a conditional likelihood of the image collection given the image bottom- up saliency information). 62 KELLEHER & TIERNEY, supra note 50, at 129. 63 WERBOS, supra note 15, at 275. 64 KELLEHER & TIERNEY, supra note 50, at 127. 65 Id. at 130–131. 66 RICHARD SUSSKIND, TOMORROW’S LAWYERS 11 (2017). 67 Haney, supra note 14, at 157. 68 ALPAYDIN, supra note 13, at 3. 69 Id. at 12. IPTF] Applied Natural Language Processing for Law Practice 7

2003.70 The internet is the driving force behind modern deep learning strategies because it enables humanity to organize and aggregate massive amounts of data.71 As a result, deep learning techniques allow for statistical- based language models to demonstrate human level responses in certain contexts. The most common language models are described as a probability distribution over all strings in a language.72 In other words, a language model is a formalization of a language’s sentences.73 Other language models have also been theorized and developed. For example, Zoltan Torey described language as a method of communicating percepts.74 According to Torey, “[s]ince percepts are private, first person experiences, they cannot be accessed, handled, or communicated without a carrier.”75 In Torey’s language model, the carrier of percepts is the word, which allows the brain to generate mental experiences.76 In the context of NLP, language learning models can be understood as consisting of two elements: data models and learning methods. The central problem trying to be solved with NLP in law is how to best reconcile the divide between the syntax and semantics of legal language.

II. PREPROCESSING Like all machine learning tasks, language learning starts with problem definition and data collection.77 This initial phase is known as preprocessing.78 The central goal of preprocessing is to manipulate a system’s inputs to enable effective computational processing, but not adversely affect the substantive conclusions derived from the model.79 Developing, organizing, and synthesizing data models are the core of the preprocessing stage, accounting for roughly eighty percent of the project’s time.80 Generally, the preprocessing stage involves organizing, aggregating, and synthesizing two elements, the text corpus and a vector space

70 SUSSKIND, supra note 66, at 11. 71 Id. 72 CHARNIAK, supra note 59, at 71. 73 Id. 74 TOREY, supra note 2, at 40. 75 Id. 76 Id. 77 See David Lehr & Paul Ohm, Playing with the Data: What Legal Scholars Should Learn About Machine Learning, 51 U.C. DAVIS L. REV. 653, 668 (2017). 78 See Matthew J. Denny & Arthur Spirling, Text Preprocessing for Unsupervised Learning: Why It Matters, When It Misleads, and What to Do About It, 26 POLITICAL ANALYSIS 168, 168 (explaining preprocessing) (2018). 79 Id. 80 KELLEHER & TIERNEY, supra note 50, at 65. 8 Intellectual Property & Technology Forum at Boston College Law School [2020 representation.81 This Part explains the advancement of data models from a text corpus to a vector space model through generation of word vectors.82

A. Text Corpora NLP uses data in the form of a text corpus, which is a body of text commonly stored in various formats including SQL, CSV, TXT, or JSON.83 The majority of time developing a deep learning system is spent on the pre- processing stage, aggregating and organizing the corpus.84 During this initial phase, machine learning researchers gather, organize, and aggregate data to be analyzed by neural networks.85 How the data is organized is in large part dependent on the goal for the deep learning system.86 For example, in a system being developed for predictive purposes the data may be labeled with positive and negative instances of an occurrence.87 The labels allow a supervised learning algorithm to learn how to classify future instances of data- making predictions.88 A critical component of corpora development is the normalization process. Indeed, the normalization process allows the corpora to be consistent, readable, and searchable.89 In general, normalization refers to the reduction of text toward a more basic or simplistic form.90 For example, reducing all the text in a corpus to lowercase form is a method of normalization.91 A second example of normalization is stemming.92 Stemming refers to the process of stripping affixes from words, typically with

81 ASHLEY, supra note 19, at 217. 82 Id.; see discussion infra Part II. 83 KELLEHER & TIERNEY, supra note 50, at 9–10. 84 Id. at 65. 85 Id. at 1. 86 BIRD ET AL., supra note 31, at 106; see Serena Yeung et al., End-to-end Learning of Action Detection from Frame Glimpses in Videos, THE IEEE CONFERENCE ON COMPUTER VISON & PATTERN RECOGNITION 2678, 2678–87 (2016), https://www.cv- foundation.org/openaccess/content_cvpr_2016/papers/Yeung_End-To- End_Learning_of_CVPR_2016_paper.pdf (introducing a fully end-to-end approach for action detection in videos that learns to directly predict the temporal bounds of actions); see also Olga Russakovsky et al., Best of Both Worlds: Human-machine Collaboration for Object Annotation, THE IEEE CONFERENCE ON COMPUTER VISON & PATTERN RECOGNITION 2121, 2121–31 (2015), https://ieeexplore.ieee.org/document/7298824 (introducing a model that integrates multiple computer vision models with multiple sources of human input in a Markov Decision Process). 87 ALPAYDIN, supra note 13, at 68. 88 Id. 89 BIRD ET AL., supra note 31 at 39. 90 Id. 91 Id. at 107–108. 92 See Karmran Kowsari et al., Text Classification Algorithms: A Survey, 10 INFO. 150, 5 (2019). IPTF] Applied Natural Language Processing for Law Practice 9 regular expressions.93 A third method of normalizing a raw text corpus is segmentation.94 Text segmentation is the process of dividing written text into more meaningful units.95 One way this may be accomplished is by representing characters with Boolean values, indicating word breaks.96 Interestingly, the segmentation task may be formulated as a search problem— find the bit string causing the text string to be correctly segmented into words.97 A fourth example of normalization is tokenization, which involves identifying and dividing text strings into tokens, which are generally morphemes for processing.98 In other words, tokenization divides a stem of texts into smaller meaningful elements.99 The normalization process supports further preprocessing activity toward the development of a vector space model. In addition to normalization, other preprocessing tasks include text categorization and tagging.100 Text can be tagged with category labels in a normalized corpora.101 Generally, tagging identifies the part of speech for a specific piece of text.102 Further, n-grams, collocations of word sequences commonly occurring together, may be identified.103 For example, bi-grams are lists of word pairs extracted from a larger text.104 The normalization processes for a particular corpus depend in large part on the particular problem, model, and goals of an application or experiment. After a text corpus is adequately developed with normalization and other pre-processing techniques, it may be vectorized.

93 See BIRD ET AL., supra note 31, at 107 (regular expressions are algorithms defining patterns in text). 94 Id. at 112. 95 Id. 96 Id. at 113. 97 Id. at 114. 98 Id. at 109 (morphemes are fundamental meaningful units of language data which cannot be further sub-divided); see also CHOMSKY, supra note 37, at 31–32 (describing phrase structure). 99 Kowsari, supra note 92, at 4. 100 See Aashish R. Karkhanis & Jenna L. Parenti, Toward an Automated First Impression on Patent Claim Validity: Algorithmically Associating Claim Language with Specific Rules of Law, 19 STAN. TECH. L. REV. 196, 207 (2016). 101 BIRD ET AL., supra note 31, at 227. 102 Id. at 179. 103 Id. at 20. 104 See Kowsari, supra note 92, at 5. 10 Intellectual Property & Technology Forum at Boston College Law School [2020

B. Vector Space Vector space models represent words as real-valued vectors.105 The vector values are associated with abstract features.106 For example, vector values may be associated with information retrieval, document classification, or question and answering.107 One critical task for developing vector space models for NLP is creating word embeddings.108 Word embeddings are mappings of words to vectors, allowing deep learning models to computationally process textual information.109 Word embeddings follow the distributional hypothesis, which states that words with similar meanings tend to occur in similar contexts.110 Indeed, word embeddings have been studied as a way to quantify meaning because embedding similarity mirrors meaning similarity.111 In essence, word embeddings are a way to vectorize text corpora for computational processing. To start, word vectorization begins by turning words to floating point numbers, allowing machines to process the information.112 Word vectors are created to allow machines to learn from large datasets.113 Indeed, word embedding development supports vector space model production.114 Vector space models represent words in a three-dimensional vector space.115 Within this three-dimensional space, words are associated via co-occurrences, the rate at which words co-occur within a defined window.116 The cosine similarity of two vectors is a standard measure of how close the two vectors

105 Thomas Mikolov et al., Efficient Estimation of Word Representations in Vector Space, ARXIV, Sept. 7, 2013, at 1, https://arxiv.org/pdf/1301.3781.pdf. 106 Jeffrey Pennington et al., GloVe: Global Vectors for Word Representation, STANFORD UNIV. (2014), https://nlp.stanford.edu/pubs/glove.pdf. 107 Id. 108 Hongliang Fei et al., Hierarchical Multi-Task Word Embedding Learning for Synonym Prediction, BAIDU RESEARCH (2019), http://research.baidu.com/Public/uploads/5d71c5a158f32.pdf. 109 See generally Lingpeng Kong et al., A Mutual Information Maximization Perspective of Language Representation Learning, ARXIV, Nov. 26, 2019, at 1, https://arxiv.org/pdf/1910.08350.pdf. 110 Tom Young et al., Recent Trends in Deep Learning Based Natural Language Processing, ARXIV, Feb. 20, 2018, at 2, https://arxiv.org/pdf/1708.02709v5.pdf. 111 Id. 112 CHARNIAK, supra note 59, at 73. A floating point number is a number with an arbitrary, un- restricted number of digits after the decimal. Id. For example, 0.883, 1.45, and 17.989891 are all floating point numbers. Id. 113 See Mikolov et al., supra note 105, at 4; see also Justine T. Kao et al., Nonliteral Understanding of Number Words, 111 PNAS 12002–007 (2014), https://www.pnas.org/content/pnas/111/33/12002.full.pdf. 114 Hongliang Fei et al., Hierarchical Multi-Task Word Embedding Learning for Synonym Prediction, BAIDU RESEARCH 834, 836 (2019), http://research.baidu.com/Public/uploads/5d71c5a158f32.pdf. 115 Id.; see also Pennington et al., supra note 106. 116 See id. IPTF] Applied Natural Language Processing for Law Practice 11 are to one another.117 The computation for arbitrary-dimension cosine similarity is formally expressed:118 % ∙ ' cos(%, ') = !#$ " !#$ " +,-∑!#% %! /-∑!#% '! /0

The cosine similarity is computed for each word with respect to all preceding words in the model.119 Vector space models, however, are blind to synonyms, idioms, and antonyms—which is a significant limitation.120 Yet, vector space models still provide state of the art performance in research and industry.121 A recent paper, GloVe: Global Vectors for Word Representation, made a substantial contribution to NLP research by combining two previous methods of word vectorization, global matrix factorization and local context window methods.122 Global matrix factorization is a method of generating low- dimensional word representations.123 Typically, such methods utilize low- rank approximations to decompose larger matrices, capturing statistical information about a text corpus.124 The main goal for developing local context window methods was to design a system for machines to learn similarities among words.125 These methods train high-dimensional word vectors on large amounts of data, so the model is able to detect similarities in word-usage, which correlate with semantic relationships.126 The GloVe model provides a method of capturing global corpus statistics from vector space models.127 Semantic vector space language models represent each word with a real-valued vector.128 The GloVe paper explains if units of texts have similar vectors in a text frequency matrix, then they tend to have similar meanings.129 Further, the GloVe paper analyzes model properties necessary to produce linear directions of meaning.130 In short, GloVe is a global log-bilinear regression model for learning of word representations through an unsupervised learning technique.131

117 CHARNIAK, supra note 59, at 75. 118 Id. 119 Id. at 76. 120 Id. 121 See Pennington et al., supra note 106. 122 Id. 123 See id. 124 Id. 125 See Mikolov et al., supra note 105, at 3. 126 Id. at 5. 127 Pennington et al., supra note 106. 128 Id. 129 Id. 130 Id. 131 Id. 12 Intellectual Property & Technology Forum at Boston College Law School [2020

The synthesizing of vector space models allows for two important improvements: representing the text corpus numerically and modeling similarity among words.132 The preprocessing stage accounts for the majority of time spent on NLP projects and is arguably the most important.133 Indeed, the data define the machine learning systems.134 Thus, it is critical the data set developed for any particular project is accurate and valid.135 Once the pre- processing stage is complete, machine learning algorithms analyze the data.136 There are various machine learning methods and models employable for the creation of generative language models and other applications suited for law practice.137

III. MODELS In the last few years, artificial neural networks (ANNs) have shown state-of-the-art-performance in NLP tasks.138 In particular, two types of ANNs are most commonly used in research and practice, Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs).139 Further, when paired with reinforcement learning, a type of machine learning for optimization, CNN and RNN models form deep reinforcement learning (DRL) algorithms, which produce superior results.140 Most recently, memory-based models including the Attention Mechanism and the Transformer have arguably changed the field of NLP completely.141 This Part begins by discussing ANNs, followed by DRL, and finally the Attention Mechanism and the Transformer.

A. Artificial Neural Networks Artificial Neural Networks (ANNs) are at the heart of modern deep learning methods.142 Indeed, ANNs are essentially a function which learns an

132 ASHLEY, supra note 19, at 108. 133 KELLEHER & TIERNEY, supra note 50, at 65. 134 ALPAYDIN, supra note 13, at 12. 135 Id. at 156. 136 Id. at 104. 137 See Young et al., supra note 110, at 1. 138 Id. 139 Id. 140 TEGMARK, supra note 12, at 85. 141 See generally Volodymyr Mnih et al., Recurrent Models of Visual Attention, ARXIV, June 24, 2014, at 1, https://arxiv.org/pdf/1406.6247.pdf; see also Alec Radford et al., Language Models Are Unsupervised Multitask Learners, (2019), https://d4mucfpksywv.cloudfront.net/better-language- models/language_models_are_unsupervised_multitask_learners.pdf. 142 KELLEHER & TIERNEY, supra note 50, at 127. IPTF] Applied Natural Language Processing for Law Practice 13 association of information.143 One major difficultly for AI systems is modeling and understanding the creativity associated with language.144 Interestingly, this problem stems from a lack of ability to associate language meaning in context, due to the difficulties in aligning syntax and semantics.145 As such, ANN models are particularly popular in AI and NLP because of their associative capabilities.146 In particular, two types of ANNs are commonly used in NLP: Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs).147 A RNN, is an ANN tailored for sequential series of information in which the output contributes to its own input.148 RNNs were developed to allow for an artificial memory mechanism to improve the quality of machine learning methods in NLP.149 Indeed, the term recurrent refers to the way in which the network processes information with a dependency on preceding calculations.150 The memory mechanism is inspired by a biological counterpart in the human brain.151 In the brain, memories are formed by the strengthening of synaptic connections.152 As such, RNNs work by strengthening the relationships between certain nodes in the network through a recurrent feed-forward model.153 Interestingly, RNNs only have one hidden layer, but they also use a replay buffer for memory.154 The depth of an RNN arises from the fact that the memory vector is propagated forward and improved through each input sequence.155

143 Id.; see also Layla El Asri et al., Frames: A Corpus for Adding Memory to Goal Oriented Dialogue Systems, ARXIV, Apr. 13, 2017, at 14, https://arxiv.org/pdf/1704.00057.pdf (“We propose adding memory as a first milestone towards goal-oriented dialogue systems that support more complex dialogue flows.”). 144 See NOAM CHOMSKY, ASPECTS OF THE THEORY OF SYNTAX 6 (1965) (“Within traditional linguistic theory, furthermore, it was clearly understood that one of the qualities that all languages have in common is their “creative” aspect.). 145 CHOMSKY, supra note 37, at 15. 146 See Nal Kalchbrenner et al., A Convolutional Neural Network for Modeling Sentences, ARXIV, Apr. 8, 2014, at 1, https://arxiv.org/pdf/1404.2188.pdf (“The aim of a sentence model is to analyse and represent the semantic content of a sentence for purposes of classification or generation.”). 147 Young et al., supra note 110, at 1. 148 CHARNIAK, supra note 59, at 82. 149 Id. at 83. 150 Young et al., supra note 110, at 7. 151 COSTANDI, supra note 53, at 55; see also Mihika Prabhu et al., A Recurrent Ising Machine in a Photonic Integrated Circuit, ARXIV, Sept. 30, 2019, at 2–5, https://arxiv.org/pdf/1909.13877.pdf (experimentally demonstrating a photonic recurrent model on a quantum computer); see also Brian S. Haney, AI Patents: A Data Driven Approach, 19 CHI.-KENT J. INTELL. PROP. (forthcoming 2020), https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3527154. 152 Id. 153 CHARNIAK, supra note 59, at 83. 154 JOHN D. KELLEHER, DEEP LEARNING 170–171 (2019). 155 Id. at 172. 14 Intellectual Property & Technology Forum at Boston College Law School [2020

In general, RNNs are appropriate for problems where specific prior nodes influence later nodes in the network156 because RNNs process sequences of data one element at a time.157 Thus, RNNs are frequently used for language-modeling in particular because language learning is often defined through a problem framework requiring memory.158 The task of updating the network’s weights, representing synapses, is solved with brute force.159 The overall technique is called back propagation, which takes in a window size and computes error.160 A commonly used back propagation algorithm in NLP is the Chain Rule, which states:161

∆% ∆% ∆% ∆( lim = = ∙ ∆"→$ ∆& ∆( ∆( ∆&

Here, � is a function of � and � is a function �.162 The derivative of � with respect to � is: 163 ∆% lim ∆"→$ ∆&

In other words, the Chain Rule takes the dot product of the derivative of � with respect to � and the derivative of � with respect to �.164 In short, the Chain Rule allows the RNN to update the weights of its network so that it may learn the appropriate associations of syntax and semantics. In addition to RNNs, Convolutional Neural Networks (CNNs) are also commonly used in NLP tasks.165 Like RNNs, CNNs draw inspiration in design from the biological brain. Indeed, CNNs are modeled based upon the biological visual cortex.166 The biological visual cortex is composed of receptive fields made up of cells that are sensitive to small sub-regions of the

156 Kalchbrenner et al., supra note 146, at 3; see also Serena Yung et al., Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos, ARXIV, June 9, 2017, at 7–11, https://arxiv.org/pdf/1507.05738.pdf (modeling multiple dense labels benefits from temporal relations within and across classes). 157 KELLEHER, supra note 154, at 172. 158 CHARNIAK, supra note 59, at 83. 159 Id. at 84. 160 Id. (a window size is a defined numerical sequence of words). 161 Chain Rule, MIT, https://ocw.mit.edu/courses/mathematics/18-01sc-single-variable-calculus- fall-2010/1.-differentiation/part-a-definition-and-basic-rules/session-11-chain- rule/MIT18_01SCF10_Ses11a.pdf (last visited May 19, 2020). 162 Id. 163 Id. 164 Id. 165 See Yoon Kim, Convolutional Neural Networks for Sentence Classification, ARXIV, Sept. 3, 2014, at 1, https://arxiv.org/pdf/1408.5882.pdf. 166 Manon Legrand, Deep Reinforcement Learning for Autonomous Vehicle Control Among Human Drivers 22–23 (2017) (unpublished M.S. thesis, Université Libre de Bruxelles), https://ai.vub.ac.be/sites/default/files/thesis_legrand.pdf. IPTF] Applied Natural Language Processing for Law Practice 15 visual field.167 In a CNN, these small sub-regions are modeled with a kernel, as described by the model below.168

A kernel is a small square matrix that is applied to each element of the input matrix.169 Further, in a CNN, a neuron’s response to a stimulus in its receptive field is modeled with a mathematical convolutional operation, similar to the way in which light is convoluted by the eye as it passes through the lens to the retina.170 Convolution is a mathematical operation for classification, relying on matrix multiplication between certain kernels and the network’s later layers.171 The convolutional operation allows CNNs to classify objects based upon their similarity.172 Indeed, every CNN contains at least one convolution layer, a layer whose parameters are learnable kernels.173 Each kernel is convolved across an input matrix and the resulting output is called a feature map.174 The full

167 Brian S. Haney, The Future of Autonomous Vehicles & Liability Theory, 29 ALB. L.J. SCI. & TECH. (forthcoming 2020). 168 See Legrand, supra note 166, at 23; see also Guoyun Tu et al., A Multi-task Neural Approach for Emotion Attribution, Classification, and Summarization, ARXIV, July 24, 2019, at 1, https://arxiv.org/pdf/1812.09041.pdf. 169 CHARNIAK, supra note 59, at 52. 170 Legrand, supra note 166, at 23; see also Carl Zimmer, The Brain: Our Strange, Important, Subconscious Light Detectors, DISCOVER (Feb. 15, 2012), https://www.discovermagazine.com/mind/the-brain-our-strange-important-subconscious-light- detectors?b_start:int=0&-C=.The retina consists of thin layers of light sensitive tissue. The retina transfers electrical signals across the optic nerve to the occipital lobe, where the image is transposed in the visual cortex, the visual processing center of the human brain. 171 ALPAYDIN, supra note 13, at 102. 172 Kabita Thaoroijam, A Study of Document Classification Using Machine Learning Techniques, 11 INT’L J. COMPUTER SCI. ISSUES 217, 217 (2014); see also Olga Russakovsky et al., Object- Centric Spatial Pooling for Image Classification, (2012), http://ai.stanford.edu/~olga/papers/eccv12-OCP.pdf; see also Fragkiadaki et al., supra note 61 (optimizing a conditional likelihood of the image collection given the image bottom-up saliency information.). 173 See Damien Matti, Combining LiDAR Space Clustering and Convolutional Neural Networks for Pedestrian Detection, ARXIV, Oct. 17, 2017, at 3, https://arxiv.org/pdf/1710.06160.pdf. 174 Legrand, supra note 166, at 24. 16 Intellectual Property & Technology Forum at Boston College Law School [2020 output of the layers is obtained by stacking all of the feature maps to create dimensionality.175 In a CNN, a window is defined over a smaller input space and the units are connected to a small subset of the inputs.176 In other words, the kernel is centered over a subset of the input matrix and then multiplied for the purpose of feature abstraction.177 The process of learning to optimize functions is the core of both RNNs and CNNs and is achieved by learning the appropriate set of weights for the connections in the network.178 When combined with a reinforcement learning algorithm, both CNNs and RNNs function as prediction models for actions in a deep reinforcement algorithm.179

B. Reinforcement Learning Richard Susskind argues, “one of the most exciting possibilities in legal technology is the use of reinforcement learning in developing systems in law.”180 At its core, reinforcement learning is an optimization algorithm.181 In short, reinforcement learning is a type of machine learning concerned with learning how an agent should behave in an environment to maximize a reward.182 Agents are the software programs making intelligent decisions.183 Generally, reinforcement learning algorithms contain three elements: (1) Model: the description of the agent-environment relationship;

175 Kalchbrenner et al., supra note 146; see also Katerina Fragkiadaki et al., Grouping-Based Low- Rank Trajectory Completion and 3D Reconstruction, CARNEGIE MELLON UNIV. (2014), https://www.cs.cmu.edu/~katef/papers/NIPS2014_NRSFM.pdf. 176 ALPAYDIN, supra note 13, at 101; see also Ava P. Soleimany et al., Image Segmentation of Liver Stage Malaria Infection with Spatial Uncertainty Sampling, ARXIV, Nov. 30, 2019, at 1, https://arxiv.org/pdf/1912.00262.pdf (discussing CNN applications for visual system recognition). 177 Legrand, supra note 166, at 23. 178 KELLEHER, supra note 154, at 161. 179 See Serena Yeung et al., A Computer Vision System for Deep Learning-Based Detection of Patient Mobilization Activities in the ICU, 2 NPJ DIGITAL MED. 1, 1 (2019) (introducing an algorithm for detection of mobility activity occurrence); see also Serena Yeung et al., Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos, ARXIV, June 9, 2017, at 10, https://arxiv.org/pdf/1507.05738.pdf (modeling multiple dense labels benefits from temporal relations within and across classes). 180 SUSSKIND, supra note 66, at 14; see also Leslie Pack Kaelbling et al., Reinforcement Learning: A Survey, 4 J. ARTIFICIAL INTELLIGENCE RES. 237, 237 (1996) (surveying the field of reinforcement learning). 181 See Volodymyr Mnih et al., Human-Level Control Through Deep Reinforcement Learning, 518 NATURE INT’L J. SCI. 529, 529 (2015); see also Fragkiadaki et al., supra note 175 (exploring how an agent can be equipped with an internal model of the dynamics of the external world, and how it can use this model to plan novel actions by running multiple internal simulations). 182 ALPAYDIN, supra note 13, at 127; see also Leslie Pack Kaelbling et al., Goals As Parallel Program Specifications, ASSOC. FOR THE ADVANCEMENT OF ARTIFICIAL INTELLIGENCE 60, 60 (1988), https://www.aaai.org/Papers/AAAI/1988/AAAI88-011.pdf?ref=Guzels.TV. 183 RICHARD S. SUTTON & ANDREW G. BARTO, REINFORCEMENT LEARNING: AN INTRODUCTION 3 (2017). IPTF] Applied Natural Language Processing for Law Practice 17

(2) Policy: the way in which the agent makes decisions; and (3) Reward: the agent’s goal.184 The fundamental reinforcement learning model is the Markov Decision Process (MDP).185 The MDP model was developed by the Russian Mathematician Andrey Markov in 1913.186 Interestingly, Markov’s work over a century ago remains the state-of-the-art in AI today.187 The model below describes the agent-environment interaction in an MDP:188

The environment is made up of states for each point in time in which the environment exists.189 The learning begins when the agent takes an initial action selected from the first state in the environment.190 Once the agent selects an action, the environment returns a reward and the next state.191 The second element of the reinforcement learning framework is the policy. Generally, the goal of the agent is to interact with its environment according to an optimal policy.192 A policy is the way in which an agent makes decisions or chooses actions within a state.193 In other words, the agent chooses which action to take when presented with a state based upon the agent’s policy.194 Intuitively, a greedy person has a policy that routinely guides their decision making toward acquiring the most wealth. The goal of

184 Katerina Fragkiadaki, Deep Q Learning, CARNEGIE MELLON UNIV. (2018), https://www.cs.cmu.edu/~katef/DeepRLFall2018/lecture_DQL_katef2018.pdf. 185 Haney, supra note 14, at 160–161. 186 See Gely P. Basharin et. al, The Life and Work of A.A. Markov, 386 LINEAR ALGEBRA AND ITS APPLICATIONS 1, 15 (2004). 187 GEORGE GILDER, LIFE AFTER GOOGLE 75 (2018). 188 SUTTON & BARTO, supra note 183, at 54 (model created by author based on illustration at the preceding citation). 189 ALPAYDIN, supra note 13, at 126–127. 190 SUTTON & BARTO, supra note 183, at 53. 191 MYKEL J. KOCHENDERFER, DECISION MAKING UNDER UNCERTAINTY 77 (2015). 192 Id. at 79. 193 Id. 194 SUTTON & BARTO, supra note 183, at 7. 18 Intellectual Property & Technology Forum at Boston College Law School [2020 the policy is to allow the agent to advance through the environment to maximize a reward.195 The reward is the third element of the reinforcement learning framework. Ultimately, the purpose of reinforcement learning is to maximize an agent’s reward.196 Nonetheless, the reward itself is defined by the algorithm’s designer.197 For each action the agent takes in the environment, a reward is returned.198 There are various ways of defining reward based upon the specific application.199 But generally, the reward is associated with the final goal of the agent.200 For example, in a trading algorithm, the reward is money.201 In sum, reinforcement learning programs learn good policies for sequential decision problems by optimizing a cumulative future reward.202 Interestingly, many scholars argue that the human mind is a reinforcement learning system.203 And, reinforcement learning algorithms add substantial improvements to deep learning models, especially when the two models are combined.204 Deep reinforcement learning is an intelligence technique integrating deep learning and reinforcement learning.205 MIT professor Max Tegmark suggests deep reinforcement learning was developed by Google in 2015.206 Earlier scholarship, however, explores and explains the integration of neural networks in the reinforcement learning paradigm.207 In fact, the literature on neural networks and reinforcement learning algorithms dates back to the early 1990s and Harvard Scholar Paul John Werbos’ work in political forecasting and brain modeling.208 Arguably, deep reinforcement learning is a method of general intelligence because of its theoretic capability to solve any continuous control task.209 For example, deep reinforcement learning algorithms drive state-of-

195 WERBOS, supra note 15, at 311. 196 SUTTON & BARTO, supra note 183, at 50. 197 , : PATHS, DANGERS, STRATEGIES 239 (2017). 198 KOCHENDERFER, supra note 191, at 77. 199 BOSTROM, supra note 197. 200 MAXIM LAPAN, DEEP REINFORCEMENT LEARNING HANDS-ON 3 (2018). 201 Id. at 217. 202 Hado van Hasselt et al., Deep Reinforcement Learning with Q-Learning, ARXIV, Dec. 8, 2015, at 1, https://arxiv.org/pdf/1509.06461.pdf. 203 WERBOS, supra note 15, at 307. 204 ALPAYDIN, supra note 13, at 136. 205 Brian S. Haney, Applied Artificial Intelligence in Modern Warfare & National Security Policy, 11 HASTINGS SCI. & TECH. L.J. 61, 70 (2020). 206 TEGMARK, supra note 12, at 85. 207 WERBOS, supra note 15, at 306–308. 208 Id. 209 TEGMARK, supra note 12, at 85. IPTF] Applied Natural Language Processing for Law Practice 19 the-art autonomous vehicles.210 But deep reinforcement learning algorithms show poorer performance on other types of tasks, like writing, because mastery of human language is—for now—not describable as a continuous control problem. Regardless of its scalable nature toward general intelligence, deep reinforcement learning is a powerful AI.211 There are two types of deep reinforcement learning algorithms: on- policy and off-policy.212 Deep reinforcement learning algorithms that don’t use old data to learn are called on-policy algorithms.213 On-policy algorithms directly optimize a goal and do not use old data to calculate the updates.214 Alternatively, off-policy deep reinforcement learning algorithms are able to re-use and learn from old data.215 Typically, off-policy algorithms use Bellman equations for optimality.216 More generally, there are three frameworks for deep reinforcement learning: (1) action-value, which involve neural networks’ prediction values for actions in a state space; (2) policy gradient, which involve optimizing policies via a neural network and gradient methods; and (3) actor-critic, which involve two neural networks working together to optimize an outcome.217 As research in deep reinforcement learning grows rapidly, however, so too do the models being explored.218 One of the most influential and interesting developments in recent NLP scholarship is the Transformer.219

210 See generally Legrand, supra note 166 (discussing vehicle control methods using deep reinforcement learning). 211 Proximal Policy Optimization, OPENAI, https://spinningup.openai.com/en/latest/algorithms/ppo.html (last visited May 19, 2020); see also Brian S. Haney, Deep Reinforcement Learning Patents: An Empirical Survey 1, 31 (2020), https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3570254. 212 Id. 213 Id. 214 John Schulman et al., Proximal Policy Optimization Algorithms, ARXIV, Aug. 28, 2017, at 2, https://arxiv.org/pdf/1707.06347.pdf. 215 Aprit Agarwal et al., Model Learning for Look-Ahead Exploration in Continuous Control, ARXIV, Nov. 20, 2018, at 4, https://arxiv.org/pdf/1811.08086.pdf. 216 van Hasselt et al., supra note 202. 217 Shixun You et al., Deep Reinforcement Learning for Target Searching in Cognitive Electronic Warfare, 7 IEEE ACCESS 37432, 37438 (2019). 218 See generally Sergey Ivanov & Alexander D’yakonov, Modern Deep Reinforcement Learning Algorithms, ARXIV, July 6, 2019, at 1039, https://arxiv.org/pdf/1906.10025.pdf (surveying a range of deep reinforcement learning algorithms.) 219 See discussion infra Part II.C. See generally Ashish Vaswani et al., Attention is All You Need, ARXIV, Dec. 6, 2017, https://arxiv.org/pdf/1706.03762.pdf (introducing the Attention Mechanism as a model for natural language processing). 20 Intellectual Property & Technology Forum at Boston College Law School [2020

C. Transformer In 2017, a team of researchers from Google and the University of Toronto published the paper, Attention Is All You Need.220 The paper introduced a novel model architecture, the Transformer.221 Rather than using RNNs or CNNs, the Transformer utilizes an autoencoder with an attention mechanism.222 Autoencoders are double-ended neural networks, comprised of an encoder and a decoder, which predict both inputs and outputs for a given word.223 The attention mechanism encodes and stores a series of hidden vectors, which are decoded to generate new text.224 Combing the autoencoder and attention mechanism, the Transformer contains three key features:225 (1) In encoder-decoder attention layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder.226 (2) The encoder contains self-attention layers.227 (3) Self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder, up to and including that position.228 The attention mechanism developed based on the realization that human perception does not tend to process a whole scene in its entirety at once.229 In general, an attention function can be described as a vectorized mapping of a query and a set of key-value pairs to an output.230 The output a weighted sum, where the weight assigned to each value is computed by a compatibility function.231 Self-attention is an attention mechanism relating different positions of a single sequence to compute a representation of the sentence.232 Further, self-attention models perform a variety of tasks including reading comprehension, abstractive summarization, and learning task-independent sentence representations.233

220 See id. 221 Id. at 2. 222 Id. at 1–2. 223 LAPAN, supra note 200, at 309. 224 Young et al., supra note 110, at 14. 225 Vaswani et al., supra note 219, at 5. 226 Id. 227 Id. 228 Id. 229 Mnih et al., supra note 141, at 1. 230 Vaswani et al., supra note 219, at 3. 231 Id. at 3. 232 Id. at 2. 233 Id. IPTF] Applied Natural Language Processing for Law Practice 21

An autoencoder is a type of neural network trained to reconstruct its input at its output.234 Because there are fewer intermediary hidden units than inputs, the network is forced to learn a short-compressed representation at the hidden units, which can be interpreted as a process of abstraction.235 According to machine learning scholar, Ethem Alpaydin, language understanding is a process of encoding where from a given sentence, we extract high order abstraction.236 And, language generation is a process of de- coding where natural language sentences are synthesized from higher order representations.237 In sum, the Transformer is the first NLP model relying on self-attention and an autoencoder to compute representations of its input and output without using RNNs or CNNs.238 The Transformer has been used by both research teams at Google and OpenAI.239 Further, the Transformer has been at the heart of research aimed at developing a general language model.240 OpenAI’s transformer is a Generative Pre-Training Model (GPT-2).241 Google’s transformer is a Bidirectional Encoder Representation from Transformers (BERT).242 GPT-2 is a large-scale unsupervised language model that generates paragraphs of text, first announced by OpenAI in February 2019.243 Multitask learning is a promising framework for improving general performance.244 Prior NLP systems needed hundreds to thousands of examples to induce functions that generalize well.245 This suggests multitask training may need many effective training pairs to realize its full protentional.246 The GPT-2 model connects these two lines of work, continuing the trend toward more general transfer methods.247 A language model capable of performing such a connection is unsupervised multitask learning.248 There are four GPT-2

234 CHARNIAK, supra note 59, at 137. 235 ALPAYDIN, supra note 13, at 103. 236 Id. at 109. 237 Id. 238 Vaswani et al., supra note 215, at 10. 239 See Alec Radford et al., supra note 141; see also Jacob Delvin, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, 1 ASS’N FOR COMPUTATIONAL LINGUISTICS 4171, 4171 (2019), https://www.aclweb.org/anthology/N19-1423.pdf. 240 See generally Zihang Dai et al., Transformer-XL: Attentive Language Models Beyond a Fixed- Length Context, ARXIV, June 2, 2019, https://arxiv.org/pdf/1901.02860.pdf (providing expansions in Transformer model research). 241 Radford et al., supra note 141. 242 Delvin, supra note 239. 243 Irene Solaiman et al., Release Strategies and the Social Impacts of Language Models, ARXIV, Nov. 13 2019, at 1, https://arxiv.org/pdf/1908.09203.pdf. 244 Radford et. al., supra note 141. 245 Id. 246 Id. 247 Id. 248 Solaiman et al., supra note 243, at 1. 22 Intellectual Property & Technology Forum at Boston College Law School [2020 variants, the smallest is 124 million parameters and the largest is 1.5 billion parameters.249 The largest version of GPT-2 released to the public, however, is an intermediate 774 million parameter model. OpenAI, which recently accepted a $1 billion-dollar investment from Microsoft, keeps the largest model proprietary.250 In addition to OpenAI, Google has also released a transformer model, BERT. BERT addresses the previously mentioned unidirectional constraints by proposing a new pre-training objective: the masked language model.251 The masked language model (MLM) randomly hides some of the words from the input, and the objective is to predict the original hidden word based only on context.252 The MLM objective allows the representation to fuse the left and the right contextual information into one model.253 In turn, this supports pre- training a deep bidirectional Transformer on various text corpora.254 Although the MLM concept is not new,255 BERT is the first model to pre- train a deep neural network, which allows for applications to several NLP tasks.256 A key difference between GPT-2 and BERT is BERT’s bidirectional nature, compared to the unidirectional nature of GPT-2.257 Although BERT and GPT-2 both show state-of-the-art performance on many language tasks, their application in the law remains to be seen. Further, successful implementation of both systems generally requires cloud computing resources, due to the massive amount of data the Transformer requires.258 There are many ways in which NLP is impacting the legal industry and many AI applications developing for the improved practice of law.

IV. APPLICATIONS IN LAW Machine learning applications for NLP are constantly improving and increasing.259 Yet the human language remains an inherently complex form of information representation including lexical, syntactic, and semantic rules

249 Radford et. al., supra note 141. 250 Stephen Nellis, Microsoft to Invest $1 Billion in OpenAI, REUTERS (July 22, 2019), https://www.reuters.com/article/us-microsoft-openai/microsoft-to-invest-1-billion-in-openai- idUSKCN1UH1H9. 251 Delvin, supra note 239. 252 Id. 253 Id. 254 Id. 255 CHOMSKY, supra note 37, at 19. 256 Delvin, supra note 239. 257 Id.; see also Radford et al., supra note 239. 258 See generally ALPAYDIN, supra note 13, at 152; see also BERT FineTuning with Cloud TPU: Sentence and Sentence-Pair Classification Tasks, GOOGLE (2019), https://cloud.google.com/tpu/docs/tutorials/bert. 259 ALPAYDIN, supra note 13, at 68. IPTF] Applied Natural Language Processing for Law Practice 23 at different levels.260 In addition, language is articulated and understood subjectively, depending on a variety of contextual cues.261 Thus, NLP’s central purpose is converting informal textual structures into formal representations computers can understand and analyze.262 This Part explores three applications of NLP systems in law: question answering, document review, and legal writing.

A. Question Answering A question-answering (Q&A) system searches a large text collection and finds a short phrase or sentence that precisely answers a user’s question.263 For example, in a dialogue system the question is first encoded to an abstract level, which is then decoded as the response to the question.264 A Q&A system’s essential task is information extraction.265 Information extraction refers to summarizing the essential details particular to a given document.266 Indeed, in a Q&A system information is extracted from a larger body of information in accordance with certain search terms and returned to the end user.267 One standardized measurement for Q&A system performance is the Stanford Question Answering Dataset (SQuAD), one of the world’s largest open source datasets for NLP tasks.268 The original SQuAD was published in 2016, and a more recent version, SQuAD 2.0, was published in 2018.269 One of the main difficulties SQuAD 2.0 sought to address was the problem of identifying when a model lacked sufficient data to answer a question.270 The SQuAD 2.0 dataset consists of Wikipedia data, along with question and answer pairs relating to the Wikipedia data.271 Although NLP models trained on SQuAD are continuously improving, the SQuAD 2.0 paper specifically states, “these systems are still far from true language understanding.”272

260 Id. 261 NOAM CHOMSKY, LANGUAGE AND MIND 17 (2006). 262 John Nay, Natural Language Processing and Machine Learning for Law and Policy Texts, 1 (Dec. 18, 2019), https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3438276. 263 ASHLEY, supra note 19, at 4–5. 264 ALPAYDIN, supra note 13, at 109. 265 ASHLEY, supra note 19, at 5. 266 Id. 267 Id. 268 See Pranav Rajpurkar et al., SQuAD: 100,000+ Questions for Machine Comprehension of Text, ARXIV, Oct. 11, 2016, at 1, https://arxiv.org/pdf/1606.05250.pdf. 269 See id.; see also Pranav Rajpurkar, Know What You Don’t Know: Unanswerable Questions for SQuAD, ARXIV, June 11, 2018, at 1, https://arxiv.org/pdf/1806.03822.pdf. 270 Rajpurkar et al., supra note 268; Rajpurkar, supra note 265. 271 Rajpurkar, supra note 265. 272 Rajpurkar et al., supra note 268. 24 Intellectual Property & Technology Forum at Boston College Law School [2020

The bleeding edge in NLP research uses the BERT model with SQuAD.273 In fact, virtually all of the top performing models on the SQuAD dataset were developed with BERT.274 Thus, some claim BERT is now foundational to the state of the art in machine reading comprehension.275 According to recent scholarship a remaining challenge in Q&A tasks is the ability of NLP models to make inferences and reason about information.276 Unlike other models and data, scholars argue there is a certain uniqueness about legal information requiring more complex analysis.277 Indeed, much legal scholarship is devoted to complex knowledge representations associated with legal reasoning.278 Interestingly, a recent study modeled legal question and answering specifically as a classification task.279 The study used a CNN to develop a legal Q&A model for questions relating to the Japan Civil Code.280 Legal Technologist, Richard Susskind argues AI driven legal Q&A will increase everyday citizens’ access to the legal system, closing the access to justice gap.281 According to Susskind, with NLP, expert Q&A systems will be able to: (1) understand legal problems expressed through natural language; (2) analyze and classify fact patterns inherent in problems; (3) draw legal conclusions and offer advice; and (4) express guidance as to a course of action.282 Similarly, Kevin Ashley argues “[l]egal QA could be a great boon to making legal knowledge more accessible.”283 Nevertheless, Susskind and Ashley are patently misguided about the nature of legal question answering systems and the role they will have in the future of our legal system. In the context of legal Q&A, it is important to note that most decisions humans make are made unconsciously, rather than as a result of conscious

273 Wei Yang et al., End-to-End Open-Domain Question Answering with BERTserini, ARXIV, Sept. 18, 2019, at 1, https://arxiv.org/abs/1902.01718. 274 Sam Schwager & John Solitario, Question and Answering on SquAD 2.0: BERT Is All You Need, STANFORD UNIV. 1 (2019), https://web.stanford.edu/class/cs224n/reports/default/15812785.pdf. 275 Id. 276 Yuwen Zhang & Zhaozhuo Xu, BERT for Question Answering on SquAD 2.0, STANFORD UNIV 1 (2019), http://web.stanford.edu/class/cs224n/reports/default/15848021.pdf. 277 Phong-Khac Do et al., Legal Question Answering Using Ranking SVM and Deep Convolutional Neural Network, ARXIV, Mar. 16, 2017, at 1, https://arxiv.org/pdf/1703.05320.pdf. 278 ASHLEY, supra note 19, at 27. 279 Phong-Khac Do et al., supra note 277. 280 Id. 281 SUSSKIND, supra note 66, at 54–55. 282 Id. at 55. 283 ASHLEY, supra note 19, at 27. IPTF] Applied Natural Language Processing for Law Practice 25 deliberation.284 And, the Chicago School’s law and economics scholarship casts serious doubt on the extent to which legal syntax affects legal decisions.285 Further, due to the adversarial nature of law, the extent to which legal questions may be answered affirmatively is unclear.286 The unfortunate of the modern legal system is that law is an ad hoc developing structure, depriving of freedom those of lower socioeconomic status. In other words, simply having an answer to a legal question does not help improve access to justice when defense counsel plays golf with the judge or the prosecutor worked on the judge’s campaign. Law, at its core, is a business in which money wins. And, as legal technology evolves, so too does the wealth gap and access to justice. Perhaps more importantly, the state of the art in legal question answering technology is far from providing any more valuable insight than a simple Google search. As a result, legal Q&A is not a promising application of NLP in law practice. Other applications do show promise, however, both from economic and technology-based perspectives.

B. Document Review The most lucrative, straightforward, and commonly used NLP application in law practice is document review.287 A document is a class of information containing significant objects.288 Indeed, documents are important because they are considered evidence.289 General descriptions of documents serve three functions: characterization, representation, and relational mapping.290 In litigation, during the discovery process adverse parties are often required to produce documents relevant to the litigation pursuant to Rule 26 of the Federal Rules of Civil Procedure (FRCP).291 Indeed, Rule 26(a) requires the parties to produce all “documents,

284 Andrew Campbell et al., Why Good Leaders Make Bad Decisions, HARV. BUS. REV. (2009). 285 See Richard A. Posner, The Economic Approach to Law, 53 TEX. L. REV. 757, 774 (1975) (“…the criticism that economics leaves out too much of what is important in the law is not so much a criticism of the economic approach to law as a prediction that it will ultimately be a barren field.”). See generally Issac Ehrlich & Richard A. Posner, Analysis of Legal Rulemaking, 1 J. LEGAL STUD. 257, 257 (2008). 286 One example of a legal question which would be affirmatively answered would be “is murder legal?” Beyond basic questions like this, however, which are easily answerable by a Google search, there isn’t much weight between legal reasoning and legal decision making. In other words, legal reasoning justifies legal decisions rather than causing legal decisions. 287 Sergio David Becerra, The Rise of Artificial Intelligence in the Legal Field: Where We Are and Where We Are Going, 11 J. BUS. ENTREPRENEURSHIP & L. 27, 39–40 (2019); see also Simon et al., supra note 52, at 238. 288 MICHAEL BUCKLAND, INFORMATION AND SOCIETY 21 (2017). 289 Id. 290 Id. at 79. 291 FED. R. CIV .P. 26. 26 Intellectual Property & Technology Forum at Boston College Law School [2020 electronically stored information, and tangible things” to be used in the course of litigation.292 In the context of corporate litigation, millions of documents may require searching and examination for relevance.293 As a result, many law firms submit to costly contracts for document review systems.294 From a technical perspective, however, document review systems are nearly identical to spam filters for email.295 As machine learning scholar Ethem Alpaydin explains, a basic NLP application is document categorization, the process of assigning various documents to different categories based upon document language.296 The document classification problem is solvable with supervised learning.297 Supervised learning algorithms analyze training data and infer a model which can be used to classify new instances.298 Such models are well suited for the task of making predictions.299 In the same way an email spam filter classifies an email as spam or not spam—a document review system classifies documents as relevant or not relevant.300 Document review systems are particularly prevalent in the context of e- discovery.301 E-discovery is the collecting, exchanging, and analyzing of electronically stored information in pre-trial discovery.302 Pre-trial discovery in lawsuits involves processing parties’ requests for materials in the hands of opponents and others to reveal facts and develop evidence for trial.303 Today, large lawsuits often involve millions of documents.304 Documents produced in litigation are diverse, ranging from corporate memoranda and contracts, to tweets and email.305 Thus, a difficult challenge in e-discovery is finding ways to extract uniform information from heterogeneous documents.306 Interestingly, a recent study published by Baidu, the Chinese search engine, showed state-of-the-art performance in text extraction for heterogeneous

292 FED. R. CIV. P. 26(a)(1)(A)(ii). 293 Simon et al., supra note 52, at 254. 294 See Chris D. Birkel, The Growth and Importance of Outsourced E-Discovery: Implications for Big Law and Legal Education, 38 J. LEGAL PROF. 231 (2014). 295 ALPAYDIN, supra note 13, at 68. 296 Id. 297 Michael A. Livermore et al., Computationally Assisted Regulatory Participation, 93 NOTRE DAME L. REV. 977, 1006 (2018). 298 ASHLEY, supra note 19, at 109. 299 SUSSKIND, supra note 66, at 53. 300 ALPAYDIN, supra note 13, at 68. 301 Birkel, supra note 294, at 236. 302 ASHLEY, supra note 19, at 239. 303 Id. 304 Id. 305 Id. 306 Id. IPTF] Applied Natural Language Processing for Law Practice 27 documents.307 In short, the study provides a learning model incorporating both a CNN and an attention mechanism for extracting key features from various document types.308 The study reflects the rapidly growing capabilities of NLP based information extraction systems.309 Selecting the information and features to be extracted from documents remains a critical task often assigned to litigators.310 Feature selection is a preprocessing technique used to represent documents in a vector space model.311 Thus, in the course of this process, litigators commonly construct a theory of relevance called a relevance hypothesis.312 Generally, a relevance hypothesis is a description of subject matter that, if found in a document, would make that document relevant.313 The goal for the relevance hypothesis is to accurately classify litigation-related documents as relevant or irrelevant.314 To accomplish this goal, litigators will first identify keywords to search and identify an initial set of documents to be reviewed.315 Then lawyers classify a document sample as positive or negative instances of what they regard as relevant.316 This task is commonly referred to as predictive coding.317 As this classification takes place, the lawyers are training deep learning models to classify documents, providing labels for an ANN to learn.318 After the lawyers have classified the initial set of documents, the deep learning models are applied to a larger set of documents for relevancy classification.319 Although document review remains the prolific NLP application in law practice today, NLP applications for legal writing present promise for the future.

307 See generally He Guo et al., EATEN: Entity-aware Attention for Single Shot Visual Text Extraction, ARXIV, Sept. 20, 2019, at 1, https://arxiv.org/pdf/1909.09380.pdf. 308 Id. 309 Young et al., supra note 110, at 12. 310 Thaoroijam, supra note 172. 311 Id; see also Lise Getoor et al, Learning Probabilistic Models of Relational Structure, STANFORD UNIV. (2001), https://ai.stanford.edu/~koller/Papers/Getoor+al:ICML01.pdf. 312 ASHLEY, supra note 19, at 240. 313 Id. 314 Id. at 237. 315 Nicholas Barry, Man Versus Machine Review: The Showdown Between Hordes of Discovery Lawyers and a Computer-Utilizing Predictive-Coding Technology, 15 VAND. J. ENT. & TECH. L. 343, 351 (2013). 316 ASHLEY, supra note 19, at 241. 317 Id. 318 Barry, supra note 315, at 354. 319 Id. 28 Intellectual Property & Technology Forum at Boston College Law School [2020

C. Legal Writing Natural language generation (NLG) is a process of synthesizing language to form sequences with syntactic accuracy and semantic coherence.320 Although some argue this a uniquely human activity,321 these processes are capable of logical representation. Indeed, NLG is describable as a reinforcement learning problem.322 And, Transformer models show state- of-the-art performance in NLG.323 First, this section explores a recent NLG study using GPT-2 for patent claim generation. Second, this section introduces a novel machine algorithm for legal writing. One approach to developing NLP applications for legal writing is using a Transformer model. Indeed, a recent study used GPT-2 for patent claim generation.324 The researchers created a dataset of 555,890 patent claims that were preprocessed for training a GPT-2 model.325 The study used cloud computing resources from Google to conduct the experiments.326 The researchers hoped the Transformer model would show performance improvement compared to ANN models.327 A significant portion of the study’s generated text, however, was senseless.328 Yet, the study’s authors suggest that using a deep learning model in conjunction with the Transformer may improve future results.329 A second approach to NLG is deep reinforcement learning.330 At its core, reinforcement learning is a process by which machines learn optimal strategies for achieving goals.331 For example, the Deep Q-Network (“DQN”) algorithm, a deep reinforcement learning variant, is goal- oriented.332 The DQN is an example of an action-value based framework where an agent begins its interactions with its environment by randomly exploring and gathering information about the environment’s states, actions,

320 ALPAYDIN, supra note 13, at 109. 321 John McGinnis, Accelerating AI, 104 NW. U. L. REV. COLLOQUY 366, 368 (2010); see also Milan Markovic, Rise of The Robot Lawyers?, 61 ARIZ. L. REV. 325, 330 (2019). 322 Young et al., supra note 110, at 12. 323 Tianyi Zhang et al., BERTScore: Evaluating Text Generation with BERT, ARXIV, Feb. 24, 2020, at 1, https://arxiv.org/pdf/1904.09675.pdf. 324 Jieh-Sheng Lee & Jieh Hsiang, Patent Claim Generation by Fine-Tuning OpenAI GPT-2, ARXIV, July 1, 2019, at 1, https://arxiv.org/pdf/1907.02052.pdf. 325 Id. at 2. 326 Id. at 3. 327 Id. at 9. 328 Id. at 8. 329 Id. at 9. 330 Jiewi Li, Deep Reinforcement Learning for Dialogue Generation, ARXIV, Sept. 29, 2016, at 1, https://arxiv.org/pdf/1606.01541.pdf. 331 See TEGMARK, supra note 12, at 85–86. 332 LAPAN, supra note 200, at 410. IPTF] Applied Natural Language Processing for Law Practice 29 and rewards.333 The algorithm stores the information in memory, called experience.334 Over time, the algorithm learns from this experience through a process called experience replay.335 Experience replay refers to the agent’s experiences stored in memory, which are used to train the neural network to approximate the value of state-action pairs.336 Thus, the DQN is describable as an off-policy algorithm, meaning it uses data from its memory to optimize performance.337 The DQN algorithm develops an optimal policy �∗ for an agent with a Q-learning algorithm.338 The optimal policy is the best method of decision making for an agent with the goal of maximizing reward.339 The Q-learning algorithm maximizes a Q-function: Q(�, �), where � is the state of an enviornment and � is an action in the state.340 In essence, by applying the optimal Q-function �∗ to every state-action pair (�, �) in an environment, the agent is acting according to the optimal policy.341 Nonetheless, computing Q(�, �) for each state-action pair in an environment is computationally expensive.342 Instead, the DQN algorithm approximates the value of each state-action pair as follows:343

�(�, �; �) ≈ (�, �).

Here, � represents the function parameters, which are a function’s variables.344 The parameters are determined by a neural network using experience replay.345 The network iterates until the Q-function’s convergence, which is determined by the Bellman Equation:346

333 Id. at 127. 334 CHARNIAK, supra note 59, at 133. 335 Id. 336 Id. 337 Hasselt, Guez & Silver, supra note 202; see also Yuval Tassa et al., DeepMind Control Suite, Jan. 3, 2018, at 12, https://arxiv.org/pdf/1801.00690.pdf (The Deep Mind Control Suit is a set of tasks for benchmarking continuous RL algorithms developed by Google Deep Mind.). 338 Mnih et al., supra note 181; see also U.S. Patent Application No. 20190205753A1 (filed Feb. 27, 2019). 339 KOCHENDERFER, supra note 191, at 81. 340 Mnih et al., supra note 181. 341 LAPAN, supra note 200, at 144; see also U.S. Patent Application No. 2016/015, 231 (filed June 22, 2018). 342 U.S. Patent App. No. 2014/097,862 at 5 (filed Dec. 5, 2013); see also DQN, TensorFLow, GITHUB.COM, (2020), https://github.com/tensorflow/agents/tree/master/tf_agents/agents/dqn (Code for DQN from TensorFlow under an Apache license) (last visited May 19, 2020). 343 Id. 344 Id. 345 CHARNIAK, supra note 59, at 133. 346 Haney, supra note 14, at 162. 30 Intellectual Property & Technology Forum at Boston College Law School [2020

∗ ∗ � (�, �) = � ! � + � max � (� , � )|�, �. ~ ! Here, �!~ refers to the expectation for all states, � is the reward, � is a discount factor typically defined 0 < � < 1, allowing present rewards to have higher value.347 Additionally, the ��� function describes an action at which the Q-function takes its maximal value for each state-action pair.348 In other words, the Bellman Equation does two things; it defines the optimal Q- function and allows the agent to consider the reward from its present state as greater relative to rewards in future states.349 The formal description of goals to be achieved, however, remains the most difficult task in AI scholarship.350 From a conceptual standpoint the idea is straightforward. AI consistently outperforms humans in Atari games like Breakout because these environments have a clearly defined goal-score maximization.351 In other words, the purpose for playing the game is to get the most points. Similarly, by defining a goal for a writer as a score maximization problem, AI will outperform any human writer in achieving that goal, developing novel strategies and techniques to optimize performance. Thus, the difficulty is not in developing better AI systems, but rather in defining quality metrics for legal documents.352 In the context of legal writing, the state-space of the document is discrete, reflecting its finite nature.353 As a result, the DQN is a prime candidate as a deep reinforcement learning algorithm to maximize the value of the document according to defined metrics.354 One difficulty in developing reinforcement learning algorithms is defining a reward.355 This is particularly true in the context of NLG because metrics for writing are inherently subjective.356 Harvard Law Fellow Ron Dolin argues, however, that one

347 KOCHENDERFER, supra note 191, at 78. 348 Brian S. Haney, The Optimal Agent: The Future of Autonomous Vehicles & Liability Theory, 29 ALB. L.J. SCI. & TECH. 1, 18–19 (2020). 349 LAPAN, supra note 200, at 102–103. 350 TEGMARK, supra note 12, at 249; see also BOSTROM, supra note 197, at 239. 351 Id. at 83. 352 See Ron Dolin, Measuring Legal Quality: Purposes, Principles, Properties, Procedures, and Problems (June 18, 2017) (unpublished manuscript) (on file with the Harvard Law School Center on the Legal Profession). 353 CHOMSKY supra note 37, at 13. 354 See Jinyuong Choi et al., Multi-focus Attention Network for Efficient Deep Reinforcement Learning, ARXIV, Dec. 13, 2017, at 1, https://arxiv.org/pdf/1712.04603.pdf. 355 BOSTROM, supra note 197, at239. 356 TOREY, supra note 2, at 61. IPTF] Applied Natural Language Processing for Law Practice 31 method of capturing human intuition in measuring legal quality is a weighted geometric mean, formally:357

% # ∑$&' #$ " ! = #$ %! ! !$% In the above equation s is the document score; n represents the number of factors �; and � is the per factor weight. The square root is a summation equation designed to calculate the total weight for all factors.358 Although the process of scoring legal writing is inherently subjective, Dolin’s algorithm does allow for a quantitative formalization of legal work product.359 By defining a reward function correlated with maximizing s—a DQN algorithm would optimize the document’s score in accordance with pre-defined metrics. One possibility is to develop a machine learning algorithm to learn what metrics matter most in legal quality. Indeed, with deep learning it would be possible to develop computational models recognizing patterns in high quality legal writing, rather than have humans identify features.360 In a deep learning approach, one would need only define select instances of good writing and bad writing. For example, a winning U.S. Supreme Court brief may receive a score of 0.95, while a law student’s rough draft of a moot court brief may receive a score of 0.35, and a Facebook rant about a recent politicized Court opinion may a receive a score of 0.05. This would allow a neural network to learn the abstractions lawyers find valuable in writing as opposed to manually defining such metrics. MIT Professor Max Tegmark explains that there are two mathematically equivalent ways of describing physical laws, one in which the past causes the future and one in which nature optimizes a function.361 In the proposed algorithm, the latter approach is adopted for NLG, the goal being to maximize a score associated with legal quality metrics. Indeed, an agent may optimize the text of a document in accordance with the defined metrics by selecting characters from a list in each state of the environment. Importantly, the only remaining piece of the complete automation of legal writing is defining metrics.

357 Dolin, supra note 352. 358 Brian S. Haney, Calculating Corporate Compliance & The Foreign Corrupt Practices Act, 19 U. PITT. J. TECH. L. & POL’Y 1, 24 (2018), https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3261443. 359 Dolin, supra note 352. 360 Id. 361 TEGMARK, supra note 12, at 251. 32 Intellectual Property & Technology Forum at Boston College Law School [2020

V. ETHICS Ethics are principles governing human behavior.362 The study of ethics is inherently limited by the subjective nature of personal ethics.363 Indeed, what one person finds to be unethical may be considered entirely appropriate by another.364 In practice, lawyers are required to follow certain ethical guidelines.365 Yet, as a whole the profession is commonly made out or perceived to be amoral, cruel, and greedy.366 Some argue the profession’s commercialization leads to a profit first mentality367 and ethics are peripheral to the legal system as a whole. Instead, ethics are more often used as a justification for the maintenance of socio-economic order. All the while, the evolution of ethical norms within the legal system progresses at slower rates and is dependent on ideological shifts supporting stronger ethical codes.368 Nonetheless the interplay between ethics and law practice automation is an interesting discussion full of interesting issues for lively debate. In particular, three ethical issues for the automation of the legal profession arise in the domains of professional responsibility, access to justice, and automated labor.

A. Professional Responsibility Rules of professional conduct arguably incentivize certain types of ethical behavior.369 A lawyer engaged in an attorney-client relationship must comply with the rules of professional conduct in the state where the lawyer is admitted to practice law.370 Critically, rules of professional conduct only apply to lawyers engaged in the practice of law.371 The line between the practice of law and something else is gray.372 The American Bar Association’s (ABA) Model Rules of Professional Conduct outline several requirements

362 See Thomas M. Madden, Law and Strategy and Ethics?, 32 GEO. J. LEGAL ETHICS 181, 200 (2019) (discussing law firm competition). 363 See Veronica Root, More Meaningful Ethics, UNIV. OF CHI. (Jan. 7, 2020), https://lawreviewblog.uchicago.edu/2020/01/07/more-meaningful-ethics-by-veronica-root- martinez/. 364 Id. 365 See generally MODEL RULES OF PROF’L CONDUCT (2019). 366 See Subha Dhanaraj, Making Lawyers Good People: Possibility or Pipedream?, 28 FORDHAM URB. L.J. 2037, 2038 (2001); see also Robert Granfield & Thomas Koenig, “It’s Hard to Be a Human Being and a Lawyer”: Young Attorneys and the Confrontation with Ethical Ambiguity in Legal Practice, 105 W. VA. L. REV. 495, 495 (2003). 367 Granfield & Koenig, supra note 366, at 498. 368 See MARYAM JAMSHIDI, THE FUTURE OF THE ARAB SPRING: CIVIC ENTREPRENEURSHIP IN POLITICS, ART, AND TECHNOLOGY STARTUPS 27 (2014) (discussing ideological systematic shifts in the Middle East). 369 Root, supra note 363. 370 Id. 371 Simon et al., supra note 52, at 245. 372 See id. IPTF] Applied Natural Language Processing for Law Practice 33 lawyers must follow.373 For example, Rule 1.1 states, “[a] lawyer shall provide competent representation to a client. Competent representation requires the legal knowledge, skill, thoroughness and preparation reasonably necessary for the representation.”374 Additionally, Rule 1.4 states, “[a] lawyer shall explain a matter to the extent reasonably necessary to permit the client to make informed decisions regarding the representation.”375 The Model Rules Preamble explains the purpose of the rules, “[a] lawyer, as a member of the legal profession, is a representative of clients, an officer of the legal system and a public citizen having special responsibility for the quality of justice.”376 Generally, the practice of law by non-lawyers is outlawed by state statute.377 And, the ABA has backed state statutes preventing the unauthorized practice of law by those who are not admitted to the bar.378 As a whole, the profession is unwelcoming to entrepreneurs developing AI systems, with increasing vigilance toward those automating the practice of law.379 Whether systems are engaged in the practice of law is a critical issue from an ethics perspective because systems engaged in the practice of law are largely outlawed by state statute.380 Thus, from a legal perspective, one important ethical issue is whether machines are engaged in the practice of law.381 According to the ABA, “[t]he definition of the practice of law is established by law and varies from one jurisdiction to another. Whatever the definition, limiting the practice of law to members of the bar protects the public against rendition of legal services by unqualified persons.”382 As one court explained, however, it is “difficult, if not impossible, to lay down a formula or definition of what constitutes the practice of law.”383 In Lola v. Skadden, the Court of Appeals for the Second Circuit addressed the issue of whether a document review attorney is engaged in the practice of law.384 In doing so, the Second Circuit provided insight and analysis of what legal work constitutes the practice of law according to federal courts.385

373 See generally MODEL RULES OF PROF’L CONDUCT (2019). 374 MODEL RULES OF PROF’L CONDUCT R. 1.1. 375 MODEL RULES OF PROF’L CONDUCT R. 1.4(b). 376 MODEL RULES OF PROF’L CONDUCT Preamble & Scope. 377 Simon et al., supra note 52, at 237. 378 Id. 379 See Thomas E. Spahn, Is Your Artificial Intelligence Guilty of The Unauthorized Practice of Law?, 24 RICH. J.L. & TECH 2, 30 (2018). 380 See id. 381 Simshaw, supra note 20, at 178. 382 MODEL RULE OF PROF’L CONDUCT R. 5.5 cmt. 2. 383 People ex rel. Ill. State Bar Ass’n v. Schafer, 404 Ill. 45, 50, (1949). 384 See Lola v. Skadden, Arps, Slate, Meagher & Flom LLP, 620 Fed. Appx. 37, 39 (2015). 385 See Lola, 620 Fed. Appx. at 41. 34 Intellectual Property & Technology Forum at Boston College Law School [2020

The case began in 2013, when David Lola filed a complaint in federal court against law firm Skadden, Arps, Meagher, Slate & Flom LLP (Skadden).386 Lola worked forty-five to fifty-five hours per week as a document review attorney for Skadden at a rate of twenty-five dollars an hour.387 Lola alleged his work was limited to three essential tasks: (1) Looking at documents to see what search terms appeared; (2) Marking those documents in predetermined categories; and (3) Drawing black boxes to redact portions of certain documents based on specific protocols.388 In short, Lola argued his work was performing command-and-control tasks, using the software system Relativity to aid in the document review process.389 In filing the complaint, Lola sought to receive overtime compensation, time- and-a-half, for instances when he worked more than forty hours in a particular week—pursuant to the Federal Labor Standards Act (FLSA).390 The United States District Court for the Southern District of New York ruled against Lola, however, because the FLSA excludes lawyers.391 In other words, lawyers are not entitled to time-and-a-half compensation for overtime work. On appeal to the United States Court of Appeals for the Second Circuit, Defendants maintained that Lola was exempt from FLSA’s overtime rules because he was a licensed attorney engaged in the practice of law.392 The parties disputed whether the document review he performed constitutes “engaging in the practice of law.”393 Lola argued for a new federal standard defining the practice of law, however the court refused to adopt such a formal rule.394 The Court, however, did find the district court erred in concluding that engaging in document review per se constitutes practicing law.395 According to the Second Circuit “[t]he gravamen of Lola’s complaint is that he performed document review under such tight constraints that he exercised no legal judgement whatsoever.”396 Accordingly, the court held, that Lola adequately alleged in his complaint that he failed to exercise any legal

386 Lola v. Skadden, Arps, Slate, Meagher & Flom LLP, No. 13-cv-5008 (RJS), 2014 WL 4626228, at *1–2 (S.D.N.Y. Sept. 16, 2014). 387 Id. at *1. 388 Lola, 620 Fed. Appx. at 40. 389 Simon et al., supra note 52, at 240–241. 390 Lola, 2014 WL 4626228, at *1–2. 391 See id. at *1. 392 Lola, 620 Fed. Appx. at 40. 393 Id. at 41. 394 Id. 395 Id. at 44. 396 Id. at 45. IPTF] Applied Natural Language Processing for Law Practice 35 judgment in performing his duties for the Defendants.397 As such, the case was remanded to the district court before the parties agreed to a settlement.398 The Second Circuit’s ruling is important for the future of legal automation because it established that in some circumstances document review may not be considered the practice of law.399 It follows logically that in the event a machine is performing document review—it is not practicing law. In other words, if the practice of law requires something a computer cannot do—then a computer cannot practice law by definition. Yet, all information about the world can be represented as numbers.400 And, despite how complex practicing law may be—a computer can in theory replicate every action a lawyer takes throughout the course of representing a client.401 But if the legal definition of the practice of law excludes purely computational processes, then an AI lawyer cannot be held to the same ethical nor professional standards as a human lawyer. Instead, the human lawyer using an AI lawyer to automate part of their practice will likely be liable for any ethics or professional responsibility violations resulting. An interesting scenario will arise if AI systems are used to help those with legal needs, but who lack the capital to hire a lawyer.

B. Access to Justice A central question surrounding the AI evolution for law practice is whether it will increase access to justice.402 Access to justice refers to citizens’ ability to interact with the legal system regardless of socio-economic standing.403 The idea undermining access to justice is that everyone— regardless of financial standing—should have access to the legal system equally.404 But the American legal system does not treat everyone equally.405 In fact, many people face too many barriers when they need access to justice in America.406 In fact, a general trend in the American legal system is increasing

397 Id. 398 Id.; Simon et al., supra note 52, at 247. 399 See Simon et al., supra note 52, at 238 (“[T]he Lola court did something extraordinary: it constituted the first judicial step in distancing the work of lawyers from that of machines.”). 400 ALPAYDIN, supra note 13, at 2. 401 TEGMARK, supra note 12, at 53 (discussing Hans Moravec’s landscape of human competence and the complexities of various computational problems). 402 See Simshaw, supra note 20, at 183. 403 SUSSKIND, supra note 66, at 93–94. 404 Id. at 93. 405 See Weinstein, supra note 23. 406 Id. at 502. 36 Intellectual Property & Technology Forum at Boston College Law School [2020 power for executive authorities407 and law enforcement at the expense of lower socio-economic classes.408 Indeed, in the United States, only the wealthy and corporations have access to the legal system.409 Some argue that specific factors such as gender or race define the problem.410 Others argue the complexities of the legal system are a contributing factor in the justice gap.411 Yet, the complexity of the system contributes to the problem far less than the arbitrary—near random— application of law by judges.412 As a result, such arguments are blind to the system’s reality, confusing the issue—the access to justice problem is purely economic in nature. Importantly, the American legal system is not designed to administer moral justice.413 Instead, the American legal system is designed to maintain the socio-economic order.414 Indeed, governments influence behavior by passing laws criminalizing activity and regulating markets.415 The way in which the United States government decides which laws to pass are subject to the demands of powerful interest groups struggling among themselves to maximize their members income.416 The system is based on strict hierarchies

407 Emily Berman, Regulating Domestic Intelligence Collection, 71 WASH. & LEE L. REV. 3, 6 (2014) (arguing for changes in administrative law to protect civil liberties associated with intelligence collection). 408 See Maya Steinitz, Whose Claim Is This Anyway? Third Party Litigation Funding, 95 MINN. L. REV. 1268, 1276 (2010) (discussing access to justice); see also Maryam Jamshidi, The Climate Crisis Is a Human Security, Not a National Security, Issue, 93 S. CAL. L. REV. POSTSCRIPT 36, 39– 40 (2019). 409 Weinstein, supra note 23. 410 Rebecca L. Sandefur, The Fulcrum Point of Equal Access to Justice: Legal and Nonlegal Institutions Remedy, 42 LOY. L. A. L. REV. 949, 949 (2009). 411 Weinstein, supra note 405. 412 Even at a high level, laws are inconsistently applied. In the lower courts, law is even more inconsistent, relying heavily on the judges and lawyers involved in each case. See National Federation of Independent Business v. Sebelius, 567 U.S. 519, 519 (2012) (finding the Affordable Care Act’s individual mandate was a tax within Congress’s taxing power); U.S. v. Morrison, 529 U.S. 598, 627 (2000) (Thomas, J., concurring) (arguing that the substantial effects test places virtually no limits on federal power); see also Gonzales v. Raich, 125 S. Ct. 2195 (2005) (aggrandizing federal power to prevent cancer patients from having access to medical marijuana). 413 Oliver Wendell Holmes, Jr., The Path of the Law, 10 HARV. L. REV. 457, 465 (1897). 414 Richard A. Posner, Theories of Economic Regulation, 5 BELL J. ECON. & MGMT. SCI. 335, 335(1974); see also Lawrence Lessig, The New Chicago School, 27 J. LEGAL STUD. 661, 662 (1998); See also Ruckelshaus v. Monsanto Co., 467 U.S. 986, 986 (1984); see also U.S. CONST. amend. V. The Fifth Amendment guarantees the right to “life, liberty, and property.” Id. The right to property, however, guarantees to the right to life and liberty. ARTHUR LEE, AN APPEAL TO THE JUSTICE AND INTERESTS OF THE PEOPLE OF GREAT BRITAIN,” IN THE PRESENT DISPUTE WITH AMERICA 14 (1775) (“The right of property is the guardian of every other right, and to deprive a people of this, is in fact to deprive them of their liberty.”). Critically, without property Americans have no liberty and an inevitably poor quality of life. Indeed, without property, or money, there is no way to access necessities like food or water. 415 PRIMAVERA DE FILIPPI & AARON WRIGHT, BLOCKCHAIN AND THE LAW 174 (2018). 416 Posner, supra note 414, at 335–336. IPTF] Applied Natural Language Processing for Law Practice 37 maximizing wealth and freedom to those at the top of the pyramid. As a result, scholars have called for a re-defining of the state and its relation to groups of lower socio-economic status.417 Some scholars, including Ron Dolin and Richard Susskind, suggest technology may help to mend the justice gap.418 For example, Susskind argues online legal services could be developed to help the poor identify whether or not they have a legal issue.419 But, simply identifying a legal issue is often no help because unless one has the money to afford an attorney or a claim lucrative enough for an attorney to work on contingency—pro se representation is the only option.420 As Ron Dolin explains “it’s literally unconscionable to pretend that millions of people are not representing themselves and failing miserably in the process.”421 The reality is the poor are consistently suppressed by the American legal system—which is corrupt beyond repair.422 Indeed, in the American legal system money and corruption win, while morality and good conscious are time and again the loser. All the while, the subjectivity of the word “justice” demeans any attempt at a solution to the access to justice problem. As such, access to the justice system is only attainable by the wealthy. It follows justice is done and administered by the wealthy, for the wealthy—leaving the poor outcast and attacked by government lawyers, judges, and the legal system as a whole. As such, the role the law plays in modern society is one of maintaining the established socio-economic order, repressing the poor in the process. In many ways, the law is a process by which natural selection takes place, favoring those with expendable resources and strong political relationships.423 In fact, laws in many aspects are merely arbitrary orders backed by threats,424 well described in the Latin maxim, auctoritas nec veritas fecit legem—authority

417 Milena Sterio, A Grotian Moment: Changes in the Legal Theory of Statehood, 39 DENV. J. INT’L L. & POL’Y 209, 237 (2010). 418 SUSSKIND, supra note 66, at 93; see also Ron Dolin, UPL, Technology, and Access to Justice, Radical Concepts, RADICAL CONCEPTS (April 30, 2015), http://radicalconcepts.com/285/upl- technology-and-access-to-justice/. 419 SUSSKIND, supra note 66, at 97–98 (2017). 420 See Nina Ingwer Van Wormer, Help at Your Fingerprints: A Twenty-First Century Response to The Pro Se Phenomenon, 60 VAND. L. REV. 983, 991 (2007). 421 Dolin, supra note 418. 422 See, e.g., Ed Shanahan, Judge Obstructed Justice in $10 Million Corruption Case, U.S. Says, N.Y. TIMES (Oct. 11, 2019), https://www.nytimes.com/2019/10/11/nyregion/brooklyn-supreme- court-judge-sylvia-ash.html. 423 See CHARLES DARWIN, ON THE ORIGIN OF SPECIES BY MEANS OF NATURAL SELECTION, OR THE PRESERVATION OF FAVORED RACES IN THE STRUGGLE FOR LIFE 62 (1859) (explaining the struggle for existence includes dependence of beings on one another, the life of the individual, and success in leaving progeny). 424 H.L.A. HART, THE CONCEPT OF LAW 6–7, 18–25 (2d ed. 1994). 38 Intellectual Property & Technology Forum at Boston College Law School [2020 not truth makes law.425 While law is a dying art, the way in which world authorities issue policy fostering AI evolution will be critical to the future of humanity—particularly relating to labor automation.426

C. Automated Labor A principle ethical issue relating to AI is the effect of automation on the work force.427 There are a variety of arguments as to the relationship between technology and jobs,428 all of which are relatively non-decisive.429 The extent to which the practice of law will be automated remains uncertain.430 However, every function humans perform can be automated, including writing, reasoning, and making art.431 The question of whether a particular job will be automated is largely dependent on economic factors. As a result, jobs involving highly repetitive tasks are more likely to be automated in the near future compared to jobs requiring creativity, novelty, and social intelligence.432 Thus, it is unlikely lawyers have reason to worry about their work being automated.433 However, there are three schools of thought on the future of AI and jobs—technological utopianism, extreme inequality, and a moderate approach—each of which leads to different outcomes for the legal labor market. Technological utopianism refers to the idea that digital life is the natural and desirable next step in humanity’s cosmic evolution which will certainly be good.434 As a result of technological utopianism, a majority of literature on the subject of technology is inherently optimistic, both in terms of outcomes and rates of progress. For example, Oxford Professor Nick Bostrom suggests that exponential increases in artificial intelligence technologies will soon lead to super-intelligent machines.435 And, Google’s Ray Kurzweil argues that the technological singularity—the time at which the human brain is reverse engineered with computational technologies—is

425 OTTFRIED HOFFE, THOMAS HOBBES 9 (2016); see also OXFORD LATIN DESK DICTIONARY 20, 120, 202 (James Morwood ed., 2005) (defining Latin to English translations of auctoritas, nec, and veritas). 426 TEGMARK, supra note 12, at 123 (discussing whether AI will eventually lead humans to become totally obsolete). 427 Id. 428 Id. 429 JOHN JORDAN, ROBOTS 163 (2016). John M. Jordan is Clinical Professor of Supply Chain and Information Systems in Smeal College of Business at Penn State University. 430 Id. 431 Haney, supra note 14, at 151. 432 TEGMARK, supra note 12, at 52–53. 433 One exception to this is document review attorneys. 434 TEGMARK, supra note 12, at 32. 435 BOSTROM, supra note 197, at 34. IPTF] Applied Natural Language Processing for Law Practice 39 only a decade away.436 Technological utopians typically argue all jobs will soon be automated and thus, there is an urgent need for a universal basic income.437 The argument follows that machines will eventually replace all human jobs and therefore society will need a different method of dispersing wealth among its population.438 Under the technological utopian’s perspective, developments in technology will lead to the automation of various legal tasks—increasing access to the justice system for those in need.439 And, as certain jobs are automated—more are created.440 As a result, society as a whole should embrace technology because innovation leads to equality among a society.441 However, the utopian perspective is inherently misguided—ignoring the of the human condition.442 Further, the utopians are also incorrect in assuming the practice of law will be automated.443 Indeed, so much of law practice is relationship-based—a phenomena machines will not recreate soon. And, inertia from powerful interest groups will certainly slow any innovation attempting to occur in the legal profession.444 A second argument is automated labor will lead to extreme economic inequalities.445 Consider, the world’s richest men—Bill Gates and Jeff Bezos—both of whom made their fortunes in technology.446 And, new technologies undoubtedly create winners and losers in the labor market.447 However, the degree to which winners reap rewards comes at an expense to the losers. It is no surprise Northern ’s Bay Area is the center of the world’s technological innovation, while simultaneously having the highest

436 RAY KURZWEIL, HOW TO CREATE A MIND 261 (2012). 437 TEGMARK, supra note 12, at 126. 438 Id. 439 Simshaw, supra note 20, at 178–179. 440 SUSSKIND, supra note 66, at 146. 441 Eleanor Lumsden, The Future Is Mobile: Financial Inclusion and Technological Innovation in the Emerging World, 23 STAN. J.L. BUS. & FIN. 1, 5 (2017) (arguing the best hope for eradicating poverty is technological innovation). 442 Peter Thiel, The Education of a Libertarian, CATO UNBOUND (May 1, 2009), https://www.cato- unbound.org/2009/04/13/peter-thiel/education-libertarian. 443 Harry Surden, Machine Learning and Law, 89 WASH. L. REV. 87, 87 (2014) (arguing AI algorithms have been unable to replicate most human intellectual abilities). 444 See Spahn, supra note 379, at 45 (explaining bar associations usually resist introduction of technologies automating areas of practice). 445 BOSTROM, supra note 197, at 96–97; Haney, supra note 205 (explaining the potential kinetics of a rapid takeoff in AI leading to a unitary power). 446 See The Richest People in the World, FORBES (Mar. 5, 2019), https://www.forbes.com/billionaires/#77480d02251c. 447 Michael Webb, The Impact of Artificial Intelligence on the Labor Market 1 (Jan. 2020) (unpublished manuscript) (on file with Stanford University). 40 Intellectual Property & Technology Forum at Boston College Law School [2020 percentages of homelessness in the United States.448 The United States government consistently reallocates wealth from the poor to wealthy technology companies through the broken and corrupt public procurement process.449 Indeed, politics is often a deciding factor in whether a technology company succeeds or fails.450 As such, it is a general rule that technology is more a driver of inequality than a champion for the poor. Indeed, the kinetics of digital innovation have prevented most people and organizations from keeping up with the pace of change.451 And, as this trend progresses, so too does income inequality in the United States.452 Indeed, the poor and middle class are powerless to the federal government’s economic repression. However, even in this scenario lawyers will be the last to have their profession automated due to its important rule within the government. While AI will surely advance inequality to an extent, others argue the degree to which economic inequality expands may be more limited.453 A third argument regarding the relationship between AI and labor is one in which current trends relating to technology and the workforce continue— a moderate approach.454 The economic theory underlying this position suggests automation acts as a labor-saving device, freeing workers to perform work that adds more value.455 In other words, when a task gets mechanized or automated, workers find new and better ways to be involved in the workforce.456 For example, rather than replace lawyers, Richard Susskind

448 MANCUR OLSON, THE LOGIC OF COLLECTIVE ACTION 7 (1971) (arguing the State’s members often have interests separate and apart from the people); see also U.S. OFFICE OF COMMUNITY PLANNING & DEVELOPMENT, ANNUAL HOMELESS ASSESSMENT REPORT (AHAR) TO CONGRESS 33 (2018), https://www.novoco.com/sites/default/files/atoms/files/hud_ahar_2018_121718.pdf. 449 Brian S. Haney, Automated Source Selection & FAR Compliance, 48 PUB. CONT. L.J. 751, 754 (2019); see Craig Whitlock & Bob Woodward, Pentagon Buries Evidence of $125 Billion in Bureaucratic Waste, WASH. POST (Dec. 5, 2016), https://www.washingtonpost.com/investigations/pentagon-buries-evidence-of125-billion-in- bureaucratic-waste/2016/12/05/e0668c76-9af6-11e6-a0ed-ab0774c1eaa5_story.html; see also Femme Comp Inc. v. United States, 83 Fed. Cl. 704, 767 (2008); University Research Company, LLC, 2004 WL 2496439, at *10 (Comp. Gen. Oct. 28, 2004). 450 Lessig, supra note 414, at 537 (explaining the role of the political economy in the internet’s evolution). 451 JORDAN, supra note 429, at 163. 452 Id. at 170; see also SHARON JANK & LINDSAY OWENS, INEQUALITY IN THE UNITED STATES 1, https://inequality.stanford.edu/sites/default/files/Inequality_SlideDeck.pdf. 453 Richard A. Posner, Orwell Versus Huxley: Economics, Technology, Privacy, and Satire 5–6 (1999) (University of Chicago Law School, Working Paper No. 89), https://papers.ssrn.com/sol3/papers.cfm?abstract_id=194572 (technological innovations also can interact with each other or with the social structure to produce unforeseeable long-run consequences that may be good or bad). 454 SUSSKIND, supra note 66, at 146. 455 JORDAN, supra note 429, at 165. 456 Id. at 169. IPTF] Applied Natural Language Processing for Law Practice 41 argues technology will change the role of the lawyer—creating new jobs for lawyers to perform and altering the employment picture for legal services.457 Susskind argues that new jobs will be created in the legal services market, for example, legal engineers, research and development associates, and legal risk managers.458 However, Susskind’s model assumes there will be a market for these types of services.459 Rather, it is unlikely firms will be willing to pay for these and other technology-related services. In fact, the knowledge economy is a fallacy.460 Firms are unwilling to pay for information due to its ready availability across the internet.461 Indeed, the internet is home to virtually all of the world’s information and is accessible to anyone with a personal computer.462 Instead, the economy is largely driven by attention, relationships, and exploitation.463 As a result, under a moderate approach, more of the same should be expected.464 AI will likely increase competition among firms—driving billable hour requirements up and rates down. There are unquestionably eras throughout human history plagued with existential destruction to scale.465 For example, the black death wiped out two-hundred million people in the Fourteenth Century.466 However, there have also been consistent time periods for humanity on a macro-level. But, the task of predicting the future is laced with randomness and chance.467 Thus, as Professor John Jordan argues, the relationship between technology and jobs remains fundamentally uncertain.468 However, there are some things which may be predicted with high probability.469 For example, it is near certain the American lawyer will not soon be replaced by an AI system. In short, the role of the lawyer is not threatened by technological innovation because interaction with a judge or third-party official is a central part of

457 SUSSKIND, supra note 66, at 146. 458 Id. at 135. 459 Id. at 147 (suggesting that accounting firms, legal technology companies, and consulting firms will increasingly hire lawyers). 460 JAMES W. CORTADA, INFORMATION AND THE MODERN CORPORATION 3–4 (2011) (discussing knowledge as a vital asset class for corporations). 461 See ALPAYDIN, supra note 13, at 15 (explaining the availability of massive amounts of information on the internet). 462 Haney, supra note 14, at 164–165 (explaining that the world’s most advanced weapons systems are accessible to anyone with internet access). 463 Frank Rose, The Attention Economy 3.0, MILKEN INST. REV. 42, 44 (2015) (explaining the value of information over time moves toward zero, while attention is exploited for financial gain). 464 Madden, supra note 362, at 200 (discussing law firm competition). 465 See, e.g., De-coding the Black Death, BBC NEWS (Oct. 3, 2001), http://news.bbc.co.uk/2/hi/health/1576875.stm. 466 Id. 467 GREENE, supra note 3, at 192–193. 468 JORDAN, supra note 429, at 163. 469 KURZWEIL, supra note 436, at 4. 42 Intellectual Property & Technology Forum at Boston College Law School [2020 virtually every area of practice. In other words, the idea that the United States government is a system of laws and not men is entirely a fallacy.470 The law is made of people and people work on relationships.

CONCLUSION

The development of language was the key to the transformation of homo erectus to homo sapiens.471 Yet, language is inherently limited in its ability to embody the percepts of the human mind.472 Although much research in computational theory surrounds the study of language, it remains unlikely statistical models of language will allow computational forms of intelligence to master communication at human levels of performance. Further, such models have shown little improvement since the 1950s.473 Indeed, there is little difference between the Transformer’s MLM, deep reinforcement learning, and the Markov models dismissed by Chomsky in Syntactic Structures.474 What has changed, however, is the amount of data available from which the models learn.475 Yet, mastery of language still remains an elusive task for machines. Further, even if an AI could generate information indistinguishable from or better than lawyers, the law would hardly change. The industry’s protective barriers prevent, stifle, and attack innovation by design.476 Ultimately, although language is the tool of choice for lawyers and judges, the legal system as a whole is made up of people. And, people are amoral, self- interested actors.477 As Justice Holmes described, it is a fallacy to think “the only force at work in the development of the law is logic.”478 In many ways’ language serves a justification for the application of laws. Indeed, one can give any conclusion a logical form.479 As a result, language is often peripheral in the practice of law.

470 Contra Solem v. Helm, 463 U.S. 277, 313–15 (1983) (Burger, J., dissenting) (arguing that framers’ view of the Cruel and Unusual Punishments Clause and controlling authority on the issue demanded precedential application). 471 TOREY, supra note 2, at 29. 472 Id. at 40. 473 See CHOMSKY supra note 37, at 19–20 (describing earlier models). 474 Id. 475 See SUSSKIND, supra note 66, at 11. 476 Ron A. Dolin & Thomas Buley, Adaptive Innovation: Innovator’s Dilemma in Big Law (2015) (Stanford Law Center on the Legal Profession, Working Paper), https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2593621. 477 WERBOS, supra note 15, at 307. 478 Holmes, Jr., supra note 413, at 465. 479 Id. at 464. IPTF] Applied Natural Language Processing for Law Practice 43

APPENDIX A. SUMMARY OF NOTATION

Notation Meaning ∆% The derivative of % with respect to &. lim ∆"→$ ∆&

∆% ∆' The dot product of the derivative of % with ∙ ∆' ∆& respect to ' and the derivative ' with respect to the derivative of &.

)∗(+, -) Value of taking action a under the optimal policy.

/ Discount factor.

0['] Expectation of random variable

-34 max 7(-) A value of -, at which 7(-) takes its maximal & value.

3 Reward.

+" State at time t.