A Massively Parallel Barcoded Sequencing Pipeline Enables Generation of the First Orfeome and Interactome Map for Rice
Total Page:16
File Type:pdf, Size:1020Kb
A massively parallel barcoded sequencing pipeline enables generation of the first ORFeome and interactome map for rice Shayne D. Wierbowskia,b,1, Tommy V. Vob,1,2, Pascal Falter-Braunc,d, Timothy O. Jobee, Lars H. Krusef, Xiaomu Weia, Jin Liangb, Michael J. Meyera,b, Nurten Akturkb, Christen A. Rivera-Erickb, Nicolas A. Corderob, Mauricio I. Paramob,g, Elnur E. Shayhidinb, Marta Bertolottib, Nathaniel D. Tippensa,b, Kazi Aktherh, Rita Sharmai, Yuichi Katayosej, Kourosh Salehi-Ashtianik,l,m,n, Tong Haol,m, Pamela C. Ronaldo,p,q, Joseph R. Eckerr,s, Peter A. Schweitzert, Shoshi Kikuchiu, Hiroshi Mizunov, David E. Hilll,m, Marc Vidall,m, Gaurav D. Moghef, Susan R. McCouchh,3, and Haiyuan Yua,b,3 aDepartment of Biological Statistics and Computational Biology, Cornell University, Ithaca, NY 14853; bWeill Institute for Cell and Molecular Biology, Cornell University, Ithaca, NY 14853; cInstitute of Network Biology, Helmholtz Zentrum München, German Research Center for Environmental Health, 85764 Munich, Germany; dMicrobe-Host Interactions, Faculty of Biology, Ludwig-Maximilians-Universität München, 80539 Munich, Germany; eBotanical Institute, Cluster of Excellence on Plant Sciences (CEPLAS), University of Cologne, 50674 Cologne, Germany; fPlant Biology Section, School of Integrative Plant Sciences, Cornell University, Ithaca, NY 14853; gDepartment of Molecular Biology and Genetics, Cornell University, Ithaca, NY 14853; hSection of Plant Breeding and Genetics, School of Integrated Plant Sciences, Cornell University, Ithaca, NY 14853-1901; iSchool of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi 110067, India; jAdvanced Genomics Breeding Section, Institute of Crop Science, National Agriculture and Food Research Organization, Tsukuba, Ibaraki 305-8634, Japan; kLaboratory of Algal, Systems, and Synthetic Biology, Division of Science and Math, New York University Abu Dhabi, 129188, Abu Dhabi, United Arab Emirates; lCenter for Cancer Systems Biology, Dana-Farber Cancer Institute, Boston, MA 02215; mDepartment of Genetics, Blavatnik Institute, Harvard Medical School, Boston, MA 02115; nCenter for Genomics and Systems Biology, New York University Abu Dhabi, 129188, Abu Dhabi, United Arab Emirates; oDepartment of Plant Pathology, University of California, Davis, CA 95616; pThe Genome Center, University of California, Davis, CA 95616; qInnovative Genomics Institute, Berkeley, CA 94704; rHoward Hughes Medical Institute, The Salk Institute for Biological Studies, La Jolla, CA 92037; sPlant Biology Laboratory, The Salk Institute for Biological Studies, La Jolla, CA 92037; tBRC Genomics Facility, Institute of Biotechnology, Cornell University, Ithaca, NY 14853; uPlant Genome Research Unit, Division of Genome and Biodiversity Research, Agrogenomics Research Center, National Institute of Agrobiological Sciences, Tsukuba 305-0856, Japan; and vDepartment of Molecular Biology, National Institute of Agrobiological Resources, Tsukuba 305-0856, Japan PLANT BIOLOGY Contributed by Susan R. McCouch, March 11, 2020 (sent for review October 21, 2019; reviewed by Jing-Dong Jackie Han and Yinong Yang) Systematic mappings of protein interactome networks have pro- regarding plate layout, and cannot detect potential cross- vided invaluable functional information for numerous model contamination between wells (SI Appendix,Fig.S1C and D). organisms. Here we develop PCR-mediated Linkage of barcoded Here, we develop a massively parallel sequencing approach Adapters To nucleic acid Elements for sequencing (PLATE-seq) that called PCR-mediated Linkage of barcoded Adapters To nucleic serves as a general tool to rapidly sequence thousands of DNA elements. We validate its utility by generating the ORFeome for Oryza sativa covering 2,300 genes and constructing a high-quality Significance protein–protein interactome map consisting of 322 interactions between 289 proteins, expanding the known interactions in rice Although recent advances in next-generation sequencing have by roughly 50%. Our work paves the way for high-throughput facilitated the construction of whole genomes from hundreds profiling of protein–protein interactions in a wide range of of organisms, considerable barriers still restrict the functional organisms. understanding of the genes that they contain. An initial pre- requisite for such an understanding is the availability of high- protein–protein interaction | rice | ORFeome | next-generation quality gene libraries (ORFeomes) amenable to high-throughput sequencing | networks functional experiments. Here we develop a massively parallel next-generation sequencing method, PLATE-seq, and leverage the method to construct an ORFeome for rice, providing a toolkit he genomics revolution has democratized sequencing and for systematic functional study in an agricultural species. To structural annotation of genomes; however, assigning func- T demonstrate the utility of these resources, we present a map of tions to predicted genes remains an important unsolved challenge. rice protein–protein interactions. Identification of protein–protein interactions can help advance functional annotation in sequenced genomes. The first step in Author contributions: T.V.V., X.W., J.L., M.J.M., P.C.R., J.R.E., P.A.S., S.K., H.M., D.E.H., systematic, genome-wide mapping of protein–protein interactions M.V., S.R.M., and H.Y. designed research; S.D.W., T.V.V., P.F.-B., T.O.J., L.H.K., X.W., J.L., involves the construction of a comprehensive set of high-quality M.J.M., N.A., C.A.R.-E., N.A.C., M.I.P., E.E.S., M.B., K.A., R.S., Y.K., K.S.-A., S.R.M., and H.Y. performed research; T.V.V., P.F.-B., N.D.T., R.S., Y.K., K.S.-A., T.H., P.C.R., J.R.E., S.K., H.M., open reading frames (ORFs). To generate such ORFeomes, tens D.E.H., and M.V. contributed new reagents/analytic tools; S.D.W., T.V.V., L.H.K., M.J.M., K. of thousands of clones must be sequenced to ensure that only S-A., T.H., G.D.M., S.R.M., and H.Y. analyzed data; and S.D.W., T.V.V., L.H.K., G.D.M., correct, full-length clones are retained. Traditionally, this is S.R.M., and H.Y. wrote the paper. achieved through the labor-intensive and cost-prohibitive process Reviewers: J.-D.J.H., Peking University; and Y.Y., Pennsylvania State University. of Sanger sequencing of each individual clone. Simple barcoding The authors declare no competing interest. strategies that append a barcode at the beginning or end of each Published under the PNAS license. 1 ORF are problematic because they require thousands of unique S.D.W. and T.V.V. contributed equally to this work. barcodes, provide extremely limited ORF coverage, and generate 2Present address: Laboratory of Biochemistry and Molecular Biology, National Cancer SI Institute, National Institutes of Health, Bethesda, MD 20892. a high fraction of uninformative reads that contain no barcode ( 3 Appendix A To whom correspondence may be addressed. Email: [email protected] or haiyuan.yu@ ,Fig.S1 ). A deep-well-pooling approach has recently cornell.edu. been used to sequence ORFeome libraries (1, 2); however, these This article contains supporting information online at https://www.pnas.org/lookup/suppl/ smart pooling approaches cannot accommodate the inclusion of doi:10.1073/pnas.1918068117/-/DCSupplemental. homologous ORFs in one pool, rely on concrete prior knowledge www.pnas.org/cgi/doi/10.1073/pnas.1918068117 PNAS Latest Articles | 1of7 Downloaded by guest on September 25, 2021 acid Elements for sequencing (PLATE-seq)—a broadly utiliz- maximized to span both the plate-specific barcode and the ORF able approach for rapid sequencing of thousands of DNA ele- sequence. Notably, the final low-cycle PCR enrichment ensures that ments. We validate the utility of PLATE-seq by developing an all paired reads contain both barcodes and can be informatively ORFeome for rice and constructing a high-quality, experimen- mapped back to their exact source well. Moreover, because Tn5 tally validated protein–protein interactome map of this impor- tagmentation acts at a random position within the ORF, the frag- tant monocot species. Despite being a staple food for over half of ments generated can theoretically span the entire ORF. However, the world’s population, and an important model for monocot in practice, cluster formation and amplification using current genomics, empirical annotations currently cover fewer than 5% Illumina sequencers become inefficient for prohibitively large of genes in the rice genome (3). Indeed, understanding the fragments. Therefore, we optimized our methods to provide function of plant genes has become a major bottleneck for the coverage of roughly the last 800 to 1,000 bp of each ORF (SI field of plant biology as a whole. Appendix,Fig.S1B). The majority of plant functional studies have been carried out In order to benchmark the accuracy of PLATE-seq, we in the dicot model organism Arabidopsis thaliana (4, 5), and, to implemented the method on a test plate of 94 human ORF clones date, only limited characterization has been performed in mono- selected from the sequence-verified human ORFeome 8.1 library cot species. Combating this disparity in annotation, the develop- (2), demonstrating that PLATE-seq correctly identified the true ment of a comprehensive, high-quality ORFeome for Oryza sativa ORF in 100% of the test cases (Fig. 1C). Determination of clone would enable large-scale reverse-proteomics studies—including identity for each well was made by aligning the reads to the entire the systematic mapping of protein–protein interactions—and thus 8.1 reference library to calculate the fraction of reads in each well would significantly expand the functional genomics toolkit for contributed by each ORF. Importantly, the vast majority of plants. PLATE-seq reads aligned to the true ORF with only a minor A full-length complementary DNA (FLcDNA) clone library fraction aligning to an incorrect ORF introduced through exper- has previously been reported in O. sativa (6); however, such li- imental contamination or alignment ambiguities (Fig. 1D). More- braries are not suitable for high-throughput studies.