The Pile: an 800GB Dataset of Diverse Text for Language Modeling Leo Gao Stella Biderman Sid Black Laurence Golding

Total Page:16

File Type:pdf, Size:1020Kb

The Pile: an 800GB Dataset of Diverse Text for Language Modeling Leo Gao Stella Biderman Sid Black Laurence Golding The Pile: An 800GB Dataset of Diverse Text for Language Modeling Leo Gao Stella Biderman Sid Black Laurence Golding Travis Hoppe Charles Foster Jason Phang Horace He Anish Thite Noa Nabeshima Shawn Presser Connor Leahy EleutherAI [email protected] Abstract versity leads to better downstream generalization capability (Rosset, 2019). Additionally, large-scale Recent work has demonstrated that increased training dataset diversity improves general language models have been shown to effectively cross-domain knowledge and downstream gen- acquire knowledge in a novel domain with only eralization capability for large-scale language relatively small amounts of training data from that models. With this in mind, we present the domain (Rosset, 2019; Brown et al., 2020; Carlini Pile: an 825 GiB English text corpus tar- et al., 2020). These results suggest that by mix- geted at training large-scale language mod- ing together a large number of smaller, high qual- els. The Pile is constructed from 22 diverse ity, diverse datasets, we can improve the general high-quality subsets—both existing and newly cross-domain knowledge and downstream general- constructed—many of which derive from aca- demic or professional sources. Our evalua- ization capabilities of the model compared to mod- tion of the untuned performance of GPT-2 and els trained on only a handful of data sources. GPT-3 on the Pile shows that these models struggle on many of its components, such as To address this need, we introduce the Pile: a academic writing. Conversely, models trained 825:18 GiB English text dataset designed for train- on the Pile improve significantly over both ing large scale language models. The Pile is com- Raw CC and CC-100 on all components of the posed of 22 diverse and high-quality datasets, in- Pile, while improving performance on down- cluding both established natural language process- stream evaluations. Through an in-depth ex- ing datasets and several newly introduced ones. ploratory analysis, we document potentially concerning aspects of the data for prospective In addition to its utility in training large language users. We make publicly available the code models, the Pile can also serve as a broad-coverage used in its construction.1 benchmark for cross-domain knowledge and gener- alization ability of language models. 1 Introduction We introduce new datasets derived from the fol- Recent breakthroughs in general-purpose language lowing sources: PubMed Central, ArXiv, GitHub, modeling have demonstrated the effectiveness of the FreeLaw Project, Stack Exchange, the US training massive models on large text corpora for Patent and Trademark Office, PubMed, Ubuntu downstream applications (Radford et al., 2019; IRC, HackerNews, YouTube, PhilPapers, and NIH Shoeybi et al., 2019; Raffel et al., 2019; Rosset, ExPorter. We also introduce OpenWebText2 and arXiv:2101.00027v1 [cs.CL] 31 Dec 2020 2019; Brown et al., 2020; Lepikhin et al., 2020). As BookCorpus2, which are extensions of the original the field continues to scale up language model train- OpenWebText (Gokaslan and Cohen, 2019) and ing, the demand for high-quality massive text data BookCorpus (Zhu et al., 2015; Kobayashi, 2018) will continue to grow (Kaplan et al., 2020). datasets, respectively. The growing need for data in language modeling In addition, we incorporate several existing high- has caused most existing large-scale language mod- quality datasets: Books3 (Presser, 2020), Project els to turn to the Common Crawl for most or all of Gutenberg (PG-19) (Rae et al., 2019), Open- their data (Brown et al., 2020; Raffel et al., 2019). Subtitles (Tiedemann, 2016), English Wikipedia, While training on the Common Crawl has been DM Mathematics (Saxton et al., 2019), EuroParl effective, recent work has shown that dataset di- (Koehn, 2005), and the Enron Emails corpus (Klimt 1https://pile.eleuther.ai/ and Yang, 2004). To supplement these, we also in- 1 Figure 1: Treemap of Pile components by effective size. troduce a new filtered subset of Common Crawl, 1.1 Contributions Pile-CC, with improved extraction quality. The core contributions of this paper are: Through our analyses, we confirm that the Pile is 1. The introduction of a 825:18 GiB english- significantly distinct from pure Common Crawl language dataset for language modeling com- data. Additionally, our evaluations show that the bining 22 diverse sources. existing GPT-2 and GPT-3 models perform poorly on many components of the Pile, and that models 2. The introduction of 14 new language model- trained on the Pile significantly outperform both ing datasets, which we expect to be of inde- raw and filtered Common Crawl models. To com- pendent interest to researchers. plement the performance evaluations, we also per- 3. Evaluations demonstrating significant im- form an exploratory analysis of the text within the provements across many domains by GPT-2- Pile to provide a detailed picture of the data. We sized models trained on this new dataset, com- hope that our extensive documentation of the con- pared to training on CC-100 and raw Common struction and characteristics of the Pile will help Crawl. researchers make informed decisions about poten- tial downstream applications. 4. The investigation and documentation of this dataset, which we hope will better inform re- Finally, we make publicly available the preprocess- searchers about how to use it as well as moti- ing code for the constituent datasets of the Pile and vate them to undertake similar investigations the code for constructing alternative versions2. In of their own data. the interest of reproducibility, we also document all processing performed on each dataset (and the 2 The Pile Datasets Pile as a whole) in as much detail as possible. For further details about the processing of each dataset, The Pile is composed of 22 constituent sub-datasets, see Section2 and AppendixC. as shown in Table1. Following Brown et al.(2020), we increase the weights of higher quality compo- 2https://github.com/EleutherAI/ nents, with certain high-quality datasets such as the-pile Wikipedia being seen up to 3 times (“epochs”) for 2 Component Raw Size Weight Epochs Effective Size Mean Document Size Pile-CC 227.12 GiB 18.11% 1.0 227.12 GiB 4.33 KiB PubMed Central 90.27 GiB 14.40% 2.0 180.55 GiB 30.55 KiB Books3† 100.96 GiB 12.07% 1.5 151.44 GiB 538.36 KiB OpenWebText2 62.77 GiB 10.01% 2.0 125.54 GiB 3.85 KiB ArXiv 56.21 GiB 8.96% 2.0 112.42 GiB 46.61 KiB Github 95.16 GiB 7.59% 1.0 95.16 GiB 5.25 KiB FreeLaw 51.15 GiB 6.12% 1.5 76.73 GiB 15.06 KiB Stack Exchange 32.20 GiB 5.13% 2.0 64.39 GiB 2.16 KiB USPTO Backgrounds 22.90 GiB 3.65% 2.0 45.81 GiB 4.08 KiB PubMed Abstracts 19.26 GiB 3.07% 2.0 38.53 GiB 1.30 KiB Gutenberg (PG-19)† 10.88 GiB 2.17% 2.5 27.19 GiB 398.73 KiB OpenSubtitles† 12.98 GiB 1.55% 1.5 19.47 GiB 30.48 KiB Wikipedia (en)† 6.38 GiB 1.53% 3.0 19.13 GiB 1.11 KiB DM Mathematics† 7.75 GiB 1.24% 2.0 15.49 GiB 8.00 KiB Ubuntu IRC 5.52 GiB 0.88% 2.0 11.03 GiB 545.48 KiB BookCorpus2 6.30 GiB 0.75% 1.5 9.45 GiB 369.87 KiB EuroParl† 4.59 GiB 0.73% 2.0 9.17 GiB 68.87 KiB HackerNews 3.90 GiB 0.62% 2.0 7.80 GiB 4.92 KiB YoutubeSubtitles 3.73 GiB 0.60% 2.0 7.47 GiB 22.55 KiB PhilPapers 2.38 GiB 0.38% 2.0 4.76 GiB 73.37 KiB NIH ExPorter 1.89 GiB 0.30% 2.0 3.79 GiB 2.11 KiB Enron Emails† 0.88 GiB 0.14% 2.0 1.76 GiB 1.78 KiB The Pile 825.18 GiB 1254.20 GiB 5.91 KiB Table 1: Overview of datasets in the Pile before creating the held out sets. Raw Size is the size before any up- or down-sampling. Weight is the percentage of bytes in the final dataset occupied by each dataset. Epochs is the number of passes over each constituent dataset during a full epoch over the Pile. Effective Size is the approximate number of bytes in the Pile occupied by each dataset. Datasets marked with a † are used with minimal preprocessing from prior work. each full epoch over the Pile. Detailed information 2.2 PubMed Central about the construction of each dataset is available PubMed Central (PMC) is a subset of the PubMed in AppendixC. online repository for biomedical articles run by the United States of America’s National Center 2.1 Pile-CC for Biotechnology Information (NCBI), providing open, full-text access to nearly five million publi- Common Crawl is a collection of website crawls cations. Most publications indexed by PMC are from 2008 onwards, including raw web pages, recent, and their inclusion is mandated for all NIH metadata and text extractions. Due to the raw na- funded research starting from 2008 by the NIH ture of the dataset, Common Crawl has the ad- Public Access Policy. We included PMC in the vantage of including text from diverse domains, hopes that it will benefit potential downstream ap- but at the cost of varying quality data. Due to plications to the medical domain. this, use of Common Crawl typically necessi- tates well-designed extraction and filtering. Our 2.3 Books3 Common Crawl-based dataset, Pile-CC, uses jus- Text (Endrédy and Novák, 2013) on Web Archive Books3 is a dataset of books derived from a copy files (raw HTTP responses including page HTML) of the contents of the Bibliotik private tracker for extraction, which yields higher quality output made available by Shawn Presser (Presser, 2020).
Recommended publications
  • Machine-Translation Inspired Reordering As Preprocessing for Cross-Lingual Sentiment Analysis
    Master in Cognitive Science and Language Master’s Thesis September 2018 Machine-translation inspired reordering as preprocessing for cross-lingual sentiment analysis Alejandro Ramírez Atrio Supervisor: Toni Badia Abstract In this thesis we study the effect of word reordering as preprocessing for Cross-Lingual Sentiment Analysis. We try different reorderings in two target languages (Spanish and Catalan) so that their word order more closely resembles the one from our source language (English). Our original expectation was that a Long Short Term Memory classifier trained on English data with bilingual word embeddings would internalize English word order, resulting in poor performance when tested on a target language with different word order. We hypothesized that the more the word order of any of our target languages resembles the one of our source language, the better the overall performance of our sentiment classifier would be when analyzing the target language. We tested five sets of transformation rules for our Part of Speech reorderings of Spanish and Catalan, extracted mainly from two sources: two papers by Crego and Mariño (2006a and 2006b) and our own empirical analysis of two corpora: CoStEP and Tatoeba. The results suggest that the bilingual word embeddings that we are training our Long Short Term Memory model with do not improve any English word order learning by part of the model when used cross-lingually. There is no improvement when reordering the Spanish and Catalan texts so that their word order more closely resembles English, and no significant drop in result score even when applying a random reordering to them making them almost unintelligible, neither when classifying between 2 options (positive-negative) nor between 4 (strongly positive, positive, negative, strongly negative).
    [Show full text]
  • The Deep Learning Revolution and Its Implications for Computer Architecture and Chip Design
    The Deep Learning Revolution and Its Implications for Computer Architecture and Chip Design Jeffrey Dean Google Research [email protected] Abstract The past decade has seen a remarkable series of advances in machine learning, and in particular deep learning approaches based on artificial neural networks, to improve our abilities to build more accurate systems across a broad range of areas, including computer vision, speech recognition, language translation, and natural language understanding tasks. This paper is a companion paper to a keynote talk at the 2020 International Solid-State Circuits Conference (ISSCC) discussing some of the advances in machine learning, and their implications on the kinds of computational devices we need to build, especially in the post-Moore’s Law-era. It also discusses some of the ways that machine learning may also be able to help with some aspects of the circuit design process. Finally, it provides a sketch of at least one interesting direction towards much larger-scale multi-task models that are sparsely activated and employ much more dynamic, example- and task-based routing than the machine learning models of today. Introduction The past decade has seen a remarkable series of advances in machine learning (ML), and in particular deep learning approaches based on artificial neural networks, to improve our abilities to build more accurate systems across a broad range of areas [LeCun et al. 2015]. Major areas of significant advances ​ ​ include computer vision [Krizhevsky et al. 2012, Szegedy et al. 2015, He et al. 2016, Real et al. 2017, Tan ​ ​ ​ ​ ​ ​ and Le 2019], speech recognition [Hinton et al.
    [Show full text]
  • Student Research Workshop Associated with RANLP 2011, Pages 1–8, Hissar, Bulgaria, 13 September 2011
    RANLPStud 2011 Proceedings of the Student Research Workshop associated with The 8th International Conference on Recent Advances in Natural Language Processing (RANLP 2011) 13 September, 2011 Hissar, Bulgaria STUDENT RESEARCH WORKSHOP ASSOCIATED WITH THE INTERNATIONAL CONFERENCE RECENT ADVANCES IN NATURAL LANGUAGE PROCESSING’2011 PROCEEDINGS Hissar, Bulgaria 13 September 2011 ISBN 978-954-452-016-8 Designed and Printed by INCOMA Ltd. Shoumen, BULGARIA ii Preface The Recent Advances in Natural Language Processing (RANLP) conference, already in its eight year and ranked among the most influential NLP conferences, has always been a meeting venue for scientists coming from all over the world. Since 2009, we decided to give arena to the younger and less experienced members of the NLP community to share their results with an international audience. For this reason, further to the first successful and highly competitive Student Research Workshop associated with the conference RANLP 2009, we are pleased to announce the second edition of the workshop which is held during the main RANLP 2011 conference days on 13 September 2011. The aim of the workshop is to provide an excellent opportunity for students at all levels (Bachelor, Master, and Ph.D.) to present their work in progress or completed projects to an international research audience and receive feedback from senior researchers. We have received 31 high quality submissions, among which 6 papers have been accepted as regular oral papers, and 18 as posters. Each submission has been reviewed by
    [Show full text]
  • The Pagerank Algorithm and Application on Searching of Academic Papers
    The PageRank algorithm and application on searching of academic papers Ping Yeh Google, Inc. 2009/12/9 Department of Physics, NTU Disclaimer (legal) The content of this talk is the speaker's personal opinion and is not the opinion or policy of his employer. Disclaimer (content) You will not hear physics. You will not see differential equations. You will: ● get a review of PageRank, the algorithm used in Google's web search. It has been applied to evaluate journal status and influence of nodes in a graph by researchers, ● see some linear algebra and Markov chains associated with it, and ● see some results of applying it to journal status. Outline Introduction Google and Google search PageRank algorithm for ranking web pages Using MapReduce to calculate PageRank for billions of pages Impact factor of journals and PageRank Conclusion Google The name: homophone to the word “Googol” which means 10100. The company: ● founded by Larry Page and Sergey Brin in 1998, ● ~20,000 employees as of 2009, ● spread in 68 offices around the world (23 in N. America, 3 in Latin America, 14 in Asia Pacific, 23 in Europe, 5 in Middle East and Africa). The mission: “to organize the world's information and make it universally accessible and useful.” Google Services Sky YouTube iGoogle web search talk book search Chrome calendar scholar translate blogger.com Android product news search maps picasaweb video groups Gmail desktop reader Earth Photo by mr.hero on panoramio (http://www.panoramio.com/photo/1127015)‏ 6 Google Search http://www.google.com/ or http://www.google.com.tw/ The abundance problem Quote Langville and Meyer's nice book “Google's PageRank and beyond: the science of search engine rankings”: The men in Jorge Luis Borges’ 1941 short story, “The Library of Babel”, which describes an imaginary, infinite library.
    [Show full text]
  • ALW2), Pages 1–10 Brussels, Belgium, October 31, 2018
    EMNLP 2018 Second Workshop on Abusive Language Online Proceedings of the Workshop, co-located with EMNLP 2018 October 31, 2018 Brussels, Belgium Sponsors Primary Sponsor Platinum Sponsors Gold Sponsors Silver Sponsors Bronze Sponsors ii c 2018 The Association for Computational Linguistics Order copies of this and other ACL proceedings from: Association for Computational Linguistics (ACL) 209 N. Eighth Street Stroudsburg, PA 18360 USA Tel: +1-570-476-8006 Fax: +1-570-476-0860 [email protected] ISBN 978-1-948087-68-1 iii Introduction Interaction amongst users on social networking platforms can enable constructive and insightful conversations and civic participation; however, on many sites that encourage user interaction, verbal abuse has become commonplace. Abusive behavior such as cyberbullying, hate speech, and scapegoating can poison the social climates within online communities. The last few years have seen a surge in such abusive online behavior, leaving governments, social media platforms, and individuals struggling to deal with the consequences. As a field that works directly with computational analysis of language, the NLP community is uniquely positioned to address the difficult problem of abusive language online; encouraging collaborative and innovate work in this area is the goal of this workshop. The first year of the workshop saw 14 papers presented in a day-long program including interdisciplinary panels and active discussion. In this second edition, we have aimed to build on the success of the first year, maintaining a focus on computationally detecting abusive language and encouraging interdisciplinary work. Reflecting the growing research focus on this topic, the number of submissions received more than doubled from 22 in last year’s edition of the workshop to 48 this year.
    [Show full text]
  • Using Morphemes from Agglutinative Languages Like Quechua and Finnish to Aid in Low-Resource Translation
    Using Morphemes from Agglutinative Languages like Quechua and Finnish to Aid in Low-Resource Translation John E. Ortega [email protected] Dept. de Llenguatges i Sistemes Informatics, Universitat d’Alacant, E-03071, Alacant, Spain Krishnan Pillaipakkamnatt [email protected] Department of Computer Science, Hofstra University, Hempstead, NY 11549, USA Abstract Quechua is a low-resource language spoken by nearly 9 million persons in South America (Hintz and Hintz, 2017). Yet, in recent times there are few published accounts of successful adaptations of machine translation systems for low-resource languages like Quechua. In some cases, machine translations from Quechua to Spanish are inadequate due to error in alignment. We attempt to improve previous alignment techniques by aligning two languages that are simi- lar due to agglutination: Quechua and Finnish. Our novel technique allows us to add rules that improve alignment for the prediction algorithm used in common machine translation systems. 1 Introduction The NP-complete problem of translating natural languages as they are spoken by humans to machine readable text is a complex problem; yet, is partially solvable due to the accuracy of machine language translations when compared to human translations (Kleinberg and Tardos, 2005). Statistical machine translation (SMT) systems such as Moses 1, require that an algo- rithm be combined with enough parallel corpora, text from distinct languages that can be com- pared sentence by sentence, to build phrase translation tables from language models. For many European languages, the translation task of bringing words together in a sequential sentence- by-sentence format for modeling, known as word alignment, is not hard due to the abundance of parallel corpora in large data sets such as Europarl 2.
    [Show full text]
  • A Massively Parallel Corpus: the Bible in 100 Languages
    Lang Resources & Evaluation DOI 10.1007/s10579-014-9287-y ORIGINAL PAPER A massively parallel corpus: the Bible in 100 languages Christos Christodouloupoulos • Mark Steedman Ó The Author(s) 2014. This article is published with open access at Springerlink.com Abstract We describe the creation of a massively parallel corpus based on 100 translations of the Bible. We discuss some of the difficulties in acquiring and processing the raw material as well as the potential of the Bible as a corpus for natural language processing. Finally we present a statistical analysis of the corpora collected and a detailed comparison between the English translation and other English corpora. Keywords Parallel corpus Á Multilingual corpus Á Comparative corpus linguistics 1 Introduction Parallel corpora are a valuable resource for linguistic research and natural language processing (NLP) applications. One of the main uses of the latter kind is as training material for statistical machine translation (SMT), where large amounts of aligned data are standardly used to learn word alignment models between the lexica of two languages (for example, in the Giza?? system of Och and Ney 2003). Another interesting use of parallel corpora in NLP is projected learning of linguistic structure. In this approach, supervised data from a resource-rich language is used to guide the unsupervised learning algorithm in a target language. Although there are some techniques that do not require parallel texts (e.g. Cohen et al. 2011), the most successful models use sentence-aligned corpora (Yarowsky and Ngai 2001; Das and Petrov 2011). C. Christodouloupoulos (&) Department of Computer Science, UIUC, 201 N.
    [Show full text]
  • A Corpus-Based Study of Unaccusative Verbs and Auxiliary Selection
    To be or not to be: A corpus-based study of unaccusative verbs and auxiliary selection Master's Thesis Presented to The Faculty of the Graduate School of Arts and Sciences Brandeis University Department of Computer Science James Pustejovsky, Advisor In Partial Fulfillment of the Requirements for Master's Degree By Richard A. Brutti Jr. May 2012 c Copyright by Richard A. Brutti Jr. 2012 All Rights Reserved ABSTRACT To be or not to be: A corpus-based study of unaccusative verbs and auxiliary selection A thesis presented to the Department of Computer Science Graduate School of Arts and Sciences Brandeis University Waltham, Massachusetts Richard A. Brutti Jr. Since the introduction of the Unaccusative Hypothesis (Perlmutter, 1978), there have been many further attempts to explain the mechanisms behind the division in intransitive verbs. This paper aims to analyze and test some of theories of unac- cusativity using computational linguistic tools. Specifically, I focus on verbs that exhibit split intransitivity, that is, verbs that can appear in both unaccusative and unergative constructions, and in determining the distinguishing features that make this alternation possible. Many formal linguistic theories of unaccusativity involve the interplay of semantic roles and temporal event markers, both of which can be analyzed using statistical computational linguistic tools, including semantic role labelers, semantic parses, and automatic event classification. I use auxiliary verb selection as a surface-level indicator of unaccusativity in Italian and Dutch, and iii test various classes of verbs extracted from the Europarl corpus (Koehn, 2005). Additionally, I provide some historical background for the evolution of this dis- tinction, and analyze how my results fit into the larger theoretical framework.
    [Show full text]
  • The Translation Equivalents Database (Treq) As a Lexicographer’S Aid
    The Translation Equivalents Database (Treq) as a Lexicographer’s Aid Michal Škrabal, Martin Vavřín Institute of the Czech National Corpus, Charles University, Czech Republic E-mail: [email protected], [email protected] Abstract The aim of this paper is to introduce a tool that has recently been developed at the Institute of the Czech National Corpus, the Treq (Translation Equivalents) database, and to explore its possible uses, especially in the field of lexicography. Equivalent candidates offered by Treq can also be considered as potential equivalents in a bilingual dictionary (we will focus on the Latvian–Czech combination in this paper). Lexicographers instantly receive a list of candidates for target language counterparts and their frequencies (expressed both in absolute numbers and percentages) that suggest the probability that a given candidate is functionally equivalent. A significant advantage is the possibility to click on any one of these candidates and immediately verify their individual occurrences in a given context; and thus more easily distinguish the relevant translation candidates from the misleading ones. This utility, which is based on data stored in the InterCorp parallel corpus, is continually being upgraded and enriched with new functions (the recent integration of multi-word units, adding English as the primary language of the dictionaries, an improved interface, etc.), and the accuracy of the results is growing as the volume of data keeps increasing. Keywords: InterCorp; Treq; translation equivalents; alignment; Latvian–Czech dictionary 1. Introduction The aim of this paper is to introduce one of the tools that has been developed recently at the Institute of the Czech National Corpus (ICNC) and which could be especially helpful to lexicographers: namely, the Treq translation equivalents database1.
    [Show full text]
  • Summer 2017 Issue 11.1
    the THE MAGAZINE OF CARNEGIE MELLON UNIVERSITY’S SCHOOL OF COMPUTER SCIENCE 60 YEARS IN THE MAKING CMU AI is Here SUMMER 2017 ISSUE 11.1 SUMMER 2017 cvr1 Iain Mathews Bhat, Matthews Win Academy Awards for Technical Achievement Computer Science at CMU School of Computer Science alumnus Kiran Bhat and underpins divergent fields and endeavors in today’s world, former Robotics Institute faculty member Iain Matthews all of which LINK SCS to profound received Oscars on February 11, from the Academy of advances in art, culture, nature, Motion Picture Arts and Science, for their work in capturing the sciences and beyond. facial performances. Bhat earned his doctorate in robotics in 2004, and helped design and develop the Industrial Light and Magic facial performance-capture solving system, which transfers facial performances from actors to digital characters in large-scale productions. The system was used in “Rogue One: A Star Wars Story” to resurrect the role of Grand Moff Tarkin, played by the late actor Peter Cushing, as well as to capture Mark Ruffalo’s expressions for his character, the Hulk, in “The Avengers.” Matthews, a post-doctoral researcher and former faculty member in the Robotics Institute working on face modeling and vision-based tracking, was recognized along with his team for the design, engineering and development of the facial-performance capture and solving system at Weta Digital, known as FACETS. Matthews spent two years helping to develop the facial motion capture system for “Avatar” and “Tintin.” With Bhat’s and Matthews’ wins, Carnegie Mellon alumni and faculty have received nine Academy Awards to date.
    [Show full text]
  • Device Placement with Reinforcement Learning
    Special Topics: CSci 8980 Machine Learning in Computer Systems Jon B. Weissman ([email protected]) Department of Computer Science University of Minnesota Introduction • Introductions – all • Who are you? • What interests you and why are you here? 2 Introduction (cont’d) • What is this course about? – machine learning • Interpreted broadly: learning from data to improve … – computer systems • Interpreted broadly: compilers, databases, networks, OS, mobile, security, … (not finding a boat in an image) 3 Confession • If you took a ML course, you know more than me about it • Interestingly … – Took an AI course from Geoff Hinton – Did an M.S. on neural networks eons ago 4 Web Site • http://www- users.cselabs.umn.edu/classes/Spring- 2019/csci8980/ 5 Technical Course Goals • Learn a “little” about ML and DL techniques – Understand their scope of applicability • Learn about one or more areas of computer systems in more detail • Learn how ML/DL can benefit computer systems 6 Non-Technical Course Goals • Learn how to write critiques (blogs) • Learn how to present papers and lead discussions • Do a team research project – Idea formation – Writeup – Experiment – Present – (fingers-crossed) publish a (workshop) paper 7 Major Topics • Machine learning Introduction • Databases • Networking • Scheduling • Power management • Storage • Compilers/Architecture • Fault tolerance • IOT/mobile 8 Course structure • Grading … – Presentations: 2 (1 big, 1 small) of them (10% each) – Take-home mid-term: 20% – Final project: 30% – Written critiques (blogging): 10%
    [Show full text]
  • Large-Scale Deep Learning with Tensorflow
    Large-Scale Deep Learning With TensorFlow Jeff Dean Google Brain team g.co/brain In collaboration with many other people at Google What is the Google Brain Team? ● Research team focused on long term artificial intelligence research ○ Mix of computer systems and machine learning research expertise ○ Pure ML research, and research in context of emerging ML application areas: ■ robotics, language understanding, healthcare, ... g.co/brain We Disseminate Our Work in Many Ways ● By publishing our work ○ See papers at research.google.com/pubs/BrainTeam.html ● By releasing TensorFlow, our core machine learning research system, as an open-source project ● By releasing implementations of our research models in TensorFlow ● By collaborating with product teams at Google to get our research into real products What Do We Really Want? ● Build artificial intelligence algorithms and systems that learn from experience ● Use those to solve difficult problems that benefit humanity What do I mean by understanding? What do I mean by understanding? What do I mean by understanding? What do I mean by understanding? Query [ car parts for sale ] What do I mean by understanding? Query [ car parts for sale ] Document 1 … car parking available for a small fee. … parts of our floor model inventory for sale. Document 2 Selling all kinds of automobile and pickup truck parts, engines, and transmissions. Example Needs of the Future ● Which of these eye images shows symptoms of diabetic retinopathy? ● Find me all rooftops in North America ● Describe this video in Spanish
    [Show full text]