
The Pile: An 800GB Dataset of Diverse Text for Language Modeling Leo Gao Stella Biderman Sid Black Laurence Golding Travis Hoppe Charles Foster Jason Phang Horace He Anish Thite Noa Nabeshima Shawn Presser Connor Leahy EleutherAI [email protected] Abstract versity leads to better downstream generalization capability (Rosset, 2019). Additionally, large-scale Recent work has demonstrated that increased training dataset diversity improves general language models have been shown to effectively cross-domain knowledge and downstream gen- acquire knowledge in a novel domain with only eralization capability for large-scale language relatively small amounts of training data from that models. With this in mind, we present the domain (Rosset, 2019; Brown et al., 2020; Carlini Pile: an 825 GiB English text corpus tar- et al., 2020). These results suggest that by mix- geted at training large-scale language mod- ing together a large number of smaller, high qual- els. The Pile is constructed from 22 diverse ity, diverse datasets, we can improve the general high-quality subsets—both existing and newly cross-domain knowledge and downstream general- constructed—many of which derive from aca- demic or professional sources. Our evalua- ization capabilities of the model compared to mod- tion of the untuned performance of GPT-2 and els trained on only a handful of data sources. GPT-3 on the Pile shows that these models struggle on many of its components, such as To address this need, we introduce the Pile: a academic writing. Conversely, models trained 825:18 GiB English text dataset designed for train- on the Pile improve significantly over both ing large scale language models. The Pile is com- Raw CC and CC-100 on all components of the posed of 22 diverse and high-quality datasets, in- Pile, while improving performance on down- cluding both established natural language process- stream evaluations. Through an in-depth ex- ing datasets and several newly introduced ones. ploratory analysis, we document potentially concerning aspects of the data for prospective In addition to its utility in training large language users. We make publicly available the code models, the Pile can also serve as a broad-coverage used in its construction.1 benchmark for cross-domain knowledge and gener- alization ability of language models. 1 Introduction We introduce new datasets derived from the fol- Recent breakthroughs in general-purpose language lowing sources: PubMed Central, ArXiv, GitHub, modeling have demonstrated the effectiveness of the FreeLaw Project, Stack Exchange, the US training massive models on large text corpora for Patent and Trademark Office, PubMed, Ubuntu downstream applications (Radford et al., 2019; IRC, HackerNews, YouTube, PhilPapers, and NIH Shoeybi et al., 2019; Raffel et al., 2019; Rosset, ExPorter. We also introduce OpenWebText2 and arXiv:2101.00027v1 [cs.CL] 31 Dec 2020 2019; Brown et al., 2020; Lepikhin et al., 2020). As BookCorpus2, which are extensions of the original the field continues to scale up language model train- OpenWebText (Gokaslan and Cohen, 2019) and ing, the demand for high-quality massive text data BookCorpus (Zhu et al., 2015; Kobayashi, 2018) will continue to grow (Kaplan et al., 2020). datasets, respectively. The growing need for data in language modeling In addition, we incorporate several existing high- has caused most existing large-scale language mod- quality datasets: Books3 (Presser, 2020), Project els to turn to the Common Crawl for most or all of Gutenberg (PG-19) (Rae et al., 2019), Open- their data (Brown et al., 2020; Raffel et al., 2019). Subtitles (Tiedemann, 2016), English Wikipedia, While training on the Common Crawl has been DM Mathematics (Saxton et al., 2019), EuroParl effective, recent work has shown that dataset di- (Koehn, 2005), and the Enron Emails corpus (Klimt 1https://pile.eleuther.ai/ and Yang, 2004). To supplement these, we also in- 1 Figure 1: Treemap of Pile components by effective size. troduce a new filtered subset of Common Crawl, 1.1 Contributions Pile-CC, with improved extraction quality. The core contributions of this paper are: Through our analyses, we confirm that the Pile is 1. The introduction of a 825:18 GiB english- significantly distinct from pure Common Crawl language dataset for language modeling com- data. Additionally, our evaluations show that the bining 22 diverse sources. existing GPT-2 and GPT-3 models perform poorly on many components of the Pile, and that models 2. The introduction of 14 new language model- trained on the Pile significantly outperform both ing datasets, which we expect to be of inde- raw and filtered Common Crawl models. To com- pendent interest to researchers. plement the performance evaluations, we also per- 3. Evaluations demonstrating significant im- form an exploratory analysis of the text within the provements across many domains by GPT-2- Pile to provide a detailed picture of the data. We sized models trained on this new dataset, com- hope that our extensive documentation of the con- pared to training on CC-100 and raw Common struction and characteristics of the Pile will help Crawl. researchers make informed decisions about poten- tial downstream applications. 4. The investigation and documentation of this dataset, which we hope will better inform re- Finally, we make publicly available the preprocess- searchers about how to use it as well as moti- ing code for the constituent datasets of the Pile and vate them to undertake similar investigations the code for constructing alternative versions2. In of their own data. the interest of reproducibility, we also document all processing performed on each dataset (and the 2 The Pile Datasets Pile as a whole) in as much detail as possible. For further details about the processing of each dataset, The Pile is composed of 22 constituent sub-datasets, see Section2 and AppendixC. as shown in Table1. Following Brown et al.(2020), we increase the weights of higher quality compo- 2https://github.com/EleutherAI/ nents, with certain high-quality datasets such as the-pile Wikipedia being seen up to 3 times (“epochs”) for 2 Component Raw Size Weight Epochs Effective Size Mean Document Size Pile-CC 227.12 GiB 18.11% 1.0 227.12 GiB 4.33 KiB PubMed Central 90.27 GiB 14.40% 2.0 180.55 GiB 30.55 KiB Books3† 100.96 GiB 12.07% 1.5 151.44 GiB 538.36 KiB OpenWebText2 62.77 GiB 10.01% 2.0 125.54 GiB 3.85 KiB ArXiv 56.21 GiB 8.96% 2.0 112.42 GiB 46.61 KiB Github 95.16 GiB 7.59% 1.0 95.16 GiB 5.25 KiB FreeLaw 51.15 GiB 6.12% 1.5 76.73 GiB 15.06 KiB Stack Exchange 32.20 GiB 5.13% 2.0 64.39 GiB 2.16 KiB USPTO Backgrounds 22.90 GiB 3.65% 2.0 45.81 GiB 4.08 KiB PubMed Abstracts 19.26 GiB 3.07% 2.0 38.53 GiB 1.30 KiB Gutenberg (PG-19)† 10.88 GiB 2.17% 2.5 27.19 GiB 398.73 KiB OpenSubtitles† 12.98 GiB 1.55% 1.5 19.47 GiB 30.48 KiB Wikipedia (en)† 6.38 GiB 1.53% 3.0 19.13 GiB 1.11 KiB DM Mathematics† 7.75 GiB 1.24% 2.0 15.49 GiB 8.00 KiB Ubuntu IRC 5.52 GiB 0.88% 2.0 11.03 GiB 545.48 KiB BookCorpus2 6.30 GiB 0.75% 1.5 9.45 GiB 369.87 KiB EuroParl† 4.59 GiB 0.73% 2.0 9.17 GiB 68.87 KiB HackerNews 3.90 GiB 0.62% 2.0 7.80 GiB 4.92 KiB YoutubeSubtitles 3.73 GiB 0.60% 2.0 7.47 GiB 22.55 KiB PhilPapers 2.38 GiB 0.38% 2.0 4.76 GiB 73.37 KiB NIH ExPorter 1.89 GiB 0.30% 2.0 3.79 GiB 2.11 KiB Enron Emails† 0.88 GiB 0.14% 2.0 1.76 GiB 1.78 KiB The Pile 825.18 GiB 1254.20 GiB 5.91 KiB Table 1: Overview of datasets in the Pile before creating the held out sets. Raw Size is the size before any up- or down-sampling. Weight is the percentage of bytes in the final dataset occupied by each dataset. Epochs is the number of passes over each constituent dataset during a full epoch over the Pile. Effective Size is the approximate number of bytes in the Pile occupied by each dataset. Datasets marked with a † are used with minimal preprocessing from prior work. each full epoch over the Pile. Detailed information 2.2 PubMed Central about the construction of each dataset is available PubMed Central (PMC) is a subset of the PubMed in AppendixC. online repository for biomedical articles run by the United States of America’s National Center 2.1 Pile-CC for Biotechnology Information (NCBI), providing open, full-text access to nearly five million publi- Common Crawl is a collection of website crawls cations. Most publications indexed by PMC are from 2008 onwards, including raw web pages, recent, and their inclusion is mandated for all NIH metadata and text extractions. Due to the raw na- funded research starting from 2008 by the NIH ture of the dataset, Common Crawl has the ad- Public Access Policy. We included PMC in the vantage of including text from diverse domains, hopes that it will benefit potential downstream ap- but at the cost of varying quality data. Due to plications to the medical domain. this, use of Common Crawl typically necessi- tates well-designed extraction and filtering. Our 2.3 Books3 Common Crawl-based dataset, Pile-CC, uses jus- Text (Endrédy and Novák, 2013) on Web Archive Books3 is a dataset of books derived from a copy files (raw HTTP responses including page HTML) of the contents of the Bibliotik private tracker for extraction, which yields higher quality output made available by Shawn Presser (Presser, 2020).
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages39 Page
-
File Size-