<<

sequencing for a better life

annual report 2016 cnag centre nacional d’anàlisi genòmica centro nacional de análisis genómico 01 02 03 director 2016 in facts research highlights

• director’s report • single cell genomics operation at • foreword by the CRG director CNAG–CRG

• the genome of the Iberian lynx

• the IHEC coordinated paper release

• accuracy and speed of germline variant calling pipelines

• should network biology be used for drug discovery?

• the genetic history of Aboriginal Australians

• decoding the complete genome os the olive tree

• ancient admixture between and

04 05 06 platform overview research programmes appendix

• sequencing unit • bioinformatic development & • funding • bioinformatics unit statistical genomics • collaborators • genome assembly and • human resources annotation • projects • biomedical genomics • publications • population genomics • structural genomics • comparative genomics • single cell genomics annual report 2016

01. director director’s report foreword by the CRG director 2016 has been another productive and successful year for CNAG- complete epigenetic descriptions of cells of the immune system and CRG. We have continued our strategic path to offer the best put them in context of different diseases. They also describe tools 02. 2016 in facts possible support to our collaborators in their research projects. that were developed to capture the entire content of epigenetic Of particular focus are areas of patient-near research, such as in profiles. CNAG-CRG played a key role in this effort by sequencing 03. research highlights rare diseases and cancer. From an applications points of view we and analysing nearly 200 whole genome methylomes. single cell genomics operation at CNAG–CRG have extended our expertise in single cell analysis, epigenomics, the genome of the Iberian lynx translational techniques and the integration of population The RD-Connect database that was developed at the CNAG-CRG the IHEC coordinated paper release information. was made available to the European International Rare Disease accuracy and speed of germline variant Research Consortium investigators for β-testing this year. The calling pipeline This year has seen many highlights. We have taken our quality database received a lot of praise from the β-testers. Towards the should network biology be used for drug discovery? system to the next level by our ISO17025:2005 accreditation with end of the year saw the installation of the RD-Connect server. the genetic history of Aboriginal Australians the scope of DNA/RNA analysis by high throughput sequencing decoding the complete genome of the olive tree (NGS) by the Spanish national accreditation body ENAC. The Several of our de novo assembly and annotation projects arrived ancient admixture between chimpanzees and bonobos ISO17025:2005 accreditation covers the laboratory and data at publication this year, notably the iberian lynx, the turbot and analysis. This puts us in a unique position of having achieved the olive tree. The iberian lynx is a critically endangered felid. The this with such a wide scope. We are one of very few NGS annotated genome sequence allows the tuning of conservation 04. platform overview operations to be accredited and to be able to offer it both for efforts of this species with less than 200 remaining animals. sequencing unit academic research projects and clinical application. We have Supported by funding from the Emilio Botin Foundation we bioinformatics analysis unit started working with the clinical services of several hospitals for sequenced an over 1000 year old olive tree, the oldest still living personalized medicine. organism ever sequenced. This genetic information will help olive trees in their development and protection against infection. 05. research programmes In 2016 we started the upgrade process of our sequencer bioinformatic development & park. The first Illumina HiSeq4000 was received and taken into Personalized medicine is arriving and genome analysis is its statistical genomics operation. Our work on the Oxford Nanopore sequencer has major tool, as it provides unprecedented resolution for diagnosing genome assembly and annotation reached a level where we can now offer this to our collaborators. patients. Moving forward it is clear that CNAG-CRG will play biomedical genomics The nanopore sequencers provide an orthogonal type of sequence a key role in the implementation of personalized medicine into population genomics to the Illumina short-read sequencing. Sequences from the healthcare. With our sequencing platform, our sophistication in structural genomics nanopore can reach readlengths of several 10s kb. This datatype data analysis and our databases to make genomic data more user- comparative genomics can be perfectly applied in combination with Illumina short-reads in friendly we are in a prime position to support this monumental task. single cell genomics de novo assembly projects. We are investigating its use in projects where genomes are heavily rearranged, such as for example in 06. appendix cancer. funding collaborators The EU-funded project BLUEPRINT reached its conclusion in 2016 human resources with the publication of a series of 41 scientific papers within the Ivo G. Gut projects IHEC framework in journals of the Cell Press group and other Director publications high impact journals. These papers report on the relationship of

cnag

centre nacional d’anàlisi genòmica centro nacional de análisis genómico annualannual report report 2016 2016

01. director director’s report foreword by the CRG director The CRG will be making Integrative Biology the cornerstone of its scientific programme in the coming five years. Addressing 02. 2016 in facts the complexity of biological systems, and more specifically of humans, now more than ever requires concerted consensus-based 03. research highlights and integrative approaches and biomedical interdisciplinary single cell genomics operation at CNAG–CRG science. The CRG has generated important scientific insights the genome of the Iberian lynx into our understanding of the organisation, deployment and the IHEC coordinated paper release evolution of genetic information, the internal workings of cells, their accuracy and speed of germline variant differentiation and reprogramming, their collective organisation calling pipeline in tissue formation and alterations in diseases, including cancer. should network biology be used for drug discovery? The CRG’s strategic fields are medical genetics and personalised the genetic history of Aboriginal Australians medicine. The recent integration of the CNAG within the CRG decoding the complete genome of the olive tree to become the CNAG-CRG is a key factor in this regard. An ancient admixture between chimpanzees and bonobos infrastructure of excellence such as the CNAG, with its associated technological development research, in combination with fundamental biological research and other resources (e.g. the 04. platform overview European Genome-phenome Archive) at the CRG, will position the sequencing unit latter as a leading international player in personalised medicine. bioinformatics analysis unit This, together with the presence of other prominent research institutes and infrastructures such as the Barcelona Super Computing Centre-Centro Nacional de Supercomputación (BSC-CNS) and 05. research programmes hospitals, will make Barcelona a major research hub. bioinformatic development & statistical genomics genome assembly and annotation biomedical genomics population genomics Luis Serrano structural genomics Director of the CRG comparative genomics single cell genomics

06. appendix funding collaborators human resources projects publications

cnag

centre nacional d’anàlisi genòmica centro nacional de análisis genómico annual report 2016

01. director director’s report foreword by the CRG director 2016 in facts – A look back at the year

02. 2016 in facts

03. research highlights January February March April single cell genomics operation at CNAG–CRG the genome of the Iberian lynx the IHEC coordinated paper release Four of CNAG-CRG PI’s (Ivo Gut, Tomàs Marquès-Bonet is coauthor The complete sequencing of the CNAG-CRG organises the 2nd accuracy and speed of germline variant Simon Heath, Marc A. Martí-Renom of a study published in Nature that genome of the turbot, conducted by edition of the workshop Introduction to calling pipeline and Tomàs Marquès-Bonet) are finds first genetic evidence of modern scientists form CNAG-CRG and CRG Genome Analysis to help collaborators should network biology be used for drug discovery? present on the top zone of the Ranking human DNA in a Neanderthal is published. get a better understanding and the genetic history of Aboriginal Australians of Scientists in Spain published by individual. improve interpretation of genome data. decoding the complete genome of the olive tree the Webometrics Ranking of World ancient admixture between chimpanzees and bonobos Universities.

04. platform overview May June July September sequencing unit bioinformatics analysis unit Top scientists present their latest CNAG-CRG participates in the 120 students visit CNAG-CRG within Oscar Lao, co-authors a study research discoveries and ideas at the complete genome sequencing of an the “Summer Scientists” programme, published in Nature that analyses the 05. research programmes 5th CNAG Symposium on Genome ancient olive tree that is over 1,000 an outreach initiative to boost scientific genome of the Aboriginal Australian bioinformatic development & Research: Single Cell Studies. years old, and that will probably live vocations among high-school students. population helping to understand statistical genomics CNAG-CRG achieves ISO 17025:2005 another 1,300 years. how modern humans left the African genome assembly and annotation accreditation for its Next Generation continent. biomedical genomics Sequencing platform. population genomics structural genomics comparative genomics single cell genomics October November December

06. appendix Researchers from CNAG-CRG Cell Reports publishes a study led Publication of the genome sequencing funding benchmark six combinations of state- by CNAG-CRG that sheds light on of the Iberian lynx (Lynx pardinus), collaborators of-the-art read aligners and variant a specific aspect of the epigenome: currently one of the world’s most human resources callers for WES and WGS to improve DNA methylation. This paper is part of endangered felines. The sequencing projects accuracy and computing costs. a collection of 41 coordinated studies and assembly of the genome was publications developed within the International done at CNAG-CRG. The study led by Tomàs Marquès-Bonet Human Epigenome Consortium (IHEC) revealing ancient admixture events published in high-impact journals. among bonobos and chimpanzees is published in Science.

cnag

centre nacional d’anàlisi genòmica centro nacional de análisis genómico annual report 2016

01. director director’s report foreword by the CRG director

02. 2016 in facts

03. research highlights single cell genomics operation at CNAG–CRG the genome of the Iberian lynx the IHEC coordinated paper release accuracy and speed of germline variant calling pipeline should network biology be used for drug discovery? the genetic history of Aboriginal Australians decoding the complete genome of the olive tree ancient admixture between chimpanzees and bonobos

04. platform overview sequencing unit bioinformatics analysis unit

05. research programmes bioinformatic development & statistical genomics genome assembly and annotation biomedical genomics population genomics structural genomics comparative genomics single cell genomics

06. appendix funding collaborators human resources projects publications

cnag

centre nacional d’anàlisi genòmica centro nacional de análisis genómico annual report 2016

01. director director’s report foreword by the CRG director research highlights

02. 2016 in facts single cell genomics operation at CNAG-CRG

03. research highlights single cell genomics operation at CNAG–CRG the genome of the Iberian lynx the IHEC coordinated paper release accuracy and speed of germline variant calling pipeline should network biology be used for drug discovery? the genetic history of Aboriginal Australians decoding the complete genome of the olive tree Single Cell Genomics recently entered the spotlight of (ERCC) with defined concentration and length. While ancient admixture between chimpanzees and bonobos basic and translational research areas by providing both techniques showed high level of accuracy, Smart- high-resolution snapshots of complex samples. seq2 turned out to be the most sensitive approach Profiling the molecular architecture of single cells in to determine transcriptional activity of even lowly 04. platform overview heterogeneous and dynamic systems allows their expressed genes. On the other hand, MARS-Seq sequencing unit deconvolution in distinct cell types or states. Cell-to-cell represents one of the most cost-effective techniques bioinformatics analysis unit variability in genetic, epigenetic and transcriptional suitable for large-scale projects and the profiling of profiles describe phenotype differences between cells 1000s of single cells. and allows the hypothesis-free assessment of sample 05. research programmes heterogeneity. The Single Cell Genomics team has Single cell transcriptome methods rely on the bioinformatic development & implemented several techniques to profile genomes availability of fresh starting material, largely excluding statistical genomics and transcriptomes scalable to the characterization sampling at locations without immediate access to genome assembly and annotation of 1000s of individual cells. Specifically, we specialized equipment. This has major implications biomedical genomics implemented massively parallel single-cell RNA on study designs and is particularly challenging in population genomics sequencing (MARS-Seq) allowing the precise gene clinical context or in time course experiments. In structural genomics expression quantification through digital 3`-trancript order to disconnect time and location of sampling comparative genomics counting. Full-length mRNA is sequenced using Smart- from subsequent processing steps, we established a single cell genomics seq2, providing a technique to sensitively capture workflow that allows the transfer of isolated single cells and characterize single cell transcriptomes. Genetic following their lysis in individual wells. Our recent study alterations and epigenomic variance can be identified further increased the flexibility of sample handling by 06. appendix by whole-genome amplification (scWGA) and open implementing cryopreservation for long-term storage. funding chromatin profiling (scATACseq) strategies. We showed that the RNA content from cryopreserved collaborators cells is indistinguishable from freshly prepared samples human resources The performance of the team’s transcriptomics and that both conditions are equally suitable to assess projects applications was evaluated in a comparative study biological relevant differences between single cells. publications that included a broad spectrum of single cell RNA The work included clinical samples. Archiving material sequencing techniques. Teaming up with groups from in cryoprotectants could drastically broaden the scope Germany (LMU, Munich) and Sweden (KI, Stockholm) of single cell application in this context. Moreover, the MARS-Seq and Smart-seq2 were tested for their simultaneous processing of multiple samples largely efficiency and accuracy to assess a single cell’s RNA reduces technical artifacts (batch effects) that occur expression profile. A standardized experimental setup when single cells are processed at different time points. allowed comparative analysis across techniques; Together, cryopreservation provides a suitable solution based on endogenously expressed genes in mouse to avoid bottlenecks in sample supply, while increasing embryonic stem cells and the analysis of artificial RNAs the robustness of the results. cnag

centre nacional d’anàlisi genòmica centro nacional de análisis genómico annual report 2016

01. director director’s report foreword by the CRG director research highlights

02. 2016 in facts the genome of the Iberian lynx

03. research highlights single cell genomics operation at CNAG–CRG the genome of the Iberian lynx the IHEC coordinated paper release accuracy and speed of germline variant calling pipeline should network biology be used for drug discovery? the genetic history of Aboriginal Australians decoding the complete genome of the olive tree The Iberian lynx (Lynx pardinus) is considered the most variants and substitutions, as well as the lowest ancient admixture between chimpanzees and bonobos endangered felid in the world by the International genome-wide genetic diversity reported so far in any Union for Conservation of Nature (IUCN). For years, mammalian species. Noticeably, it has preserved only this emblematic species has been the target of several half of the genetic diversity observed in the cheetah 04. platform overview coordinated efforts to avoid its extinction. The CNAG- (Acinonyx jubatus). sequencing unit CRG, playing an active role in a genome project The extreme genomic erosion we see in the bioinformatics analysis unit begun in 2010, coordinated by EBD-CSIC and Iberian lynx genome may hamper short- and long- funded by Banco Santander, combined whole genome term viability through reduced fitness or adaptive shotgun sequencing and fosmid pool sequencing potential. However, the knowledge and resources 05. research programmes strategies in order to generate the first annotated draft developed in this study will give a boost to research bioinformatic development & assembly of the Iberian lynx genome. Importantly, this on felid evolution and conservation genomics, which statistical genomics is the first mammalian genome entirely sequenced and will in turn benefit the ongoing conservation and genome assembly and annotation assembled in Spain. In addition to “Candiles”, the management of our national felid. biomedical genomics individual that was chosen for assembly, the CNAG- population genomics CRG re-sequenced the genomes of ten other Iberian structural genomics lynx coming from the two main populations left on the comparative genomics Iberian Peninsula and one Eurasian lynx as well. single cell genomics The study detected a series of severe population bottlenecks predating the known demographic decline Work of reference The Genome Assembly and Annotation team, together with other scientists 06. appendix in the 20th century that have greatly influenced the managed to read and organize 2.4 billion letters of DNA from Candiles, a male funding evolution of the Iberian lynx genome. We observed Extreme genomic erosion after recurrent demographic collaborators lynx born in the Sierra Morena lynx population, who now forms part of a program drastically reduced rates of weak-to-strong substitutions bottlenecks in the highly endangered Iberian lynx human resources for breeding in captivity.(Photo by J.M Perez de Ayala) associated with GC-biased gene conversion and Abascal F*, Corvelo A*, Cruz F* et al including projects increased rates of fixation of transposable elements. The Frias L, Ribeca P, Derdak S, Blanc J, Gut M, Gut I, publications genome of individuals from the two remnant Iberian lynx Marques-Bonet T, Alioto T. Genome Biol. 2016 Dec populations show multiple signatures of genetic erosion, 14;17(1):251. including a high frequency of potentially deleterious *contributed equally

cnag

centre nacional d’anàlisi genòmica centro nacional de análisis genómico annual report 2016

01. director director’s report foreword by the CRG director research highlights

02. 2016 in facts the IHEC coordinated paper release

03. research highlights single cell genomics operation at CNAG–CRG the genome of the Iberian lynx the IHEC coordinated paper release accuracy and speed of germline variant calling pipeline should network biology be used for drug discovery? the genetic history of Aboriginal Australians decoding the complete genome of the olive tree The International Human Epigenome Consortium (IHEC) epigenomic changes are established by environmental ancient admixture between chimpanzees and bonobos started in 2010. It is a large international initiative that cues, will likely lead to new biomarkers for better joins 8 individually funded projects with the objective to diagnosis and estimation of prognosis. We contributed generate comprehensive descriptions of the epigenomes by generation and analysis of 200 whole genome 04. platform overview of all cell types in the human body. CNAG-CRG is a bisulphite sequences. Apart from our involvement in sequencing unit partner in the EU-funded Project BLUEPRINT and has establishing the reference epigenomes we contributed bioinformatics analysis unit been responsible for the generation of all of the whole to studies on haematological cancers and diabetes. genome bisulphite sequencing and the primary analysis We established compute-efficient pipelines for DNA of the data, and contributed to downstream integrative methylation analysis, defined the data format for DNA 05. research programmes analysis. BLUEPRINT finished its work in 2016 and methylation sequencing information and contributed bioinformatic development & together with the IHEC summarized its efforts in 41 to simplified views of genome-wide DNA methylation statistical genomics papers that were concertedly published in journals of information. genome assembly and annotation the Cell Press group and other high impact journals in In particular, the CNAG-CRG led a study that gives biomedical genomics November 2016. insights into mechanisms of normal development and, population genomics The primary goals of IHEC are to coordinate the by comparison, what goes wrong at an epigenetic level structural genomics production of reference maps of human epigenomes in disease. Better understanding of these epigenetic comparative genomics for key cellular states relevant to health and diseases, changes in illness will lead to more effective treatment single cell genomics to facilitate rapid distribution of the data to the strategies tailored to the genetic profile of each patient. research community, and to accelerate translation of this new knowledge to improve human health. A 06. appendix critical component of IHEC goals is to coordinate the funding development of common bioinformatics standards, data collaborators models and analytical tools to organize, integrate and human resources display the epigenomic data generated. projects The focus of BLUEPRINT was on the epigenome of publications blood cell types. Many cell types of the haematological Work of reference system are involved in immunity. Apart from the reference epigenomes of blood cell types, the The International Human Epigenome Consortium: A relationship of the epigenomes of immune cell types Blueprint for Scientific Collaboration and Discovery between normal and disease states were investigated. Stunnenberg HG; International Human Epigenome The nucleosome: a complex structure consisting of DNA being wrapped These insights, together with the understanding of how Consortium, Hirst M including Gut IG, Heath S as around complexes of proteins (histones). The nucleosome is the major immune cells alter their epigenomes in reaction to or collaborators. Cell. 2016 Nov 17;167(5):1145- carrier of epigenetic information, and can be affected by a variety of to contribute to a diseased environment, and how the 1149. doi: 10.1016/j.cell.2016.11.007. epigenetic mechanisms. (Photo by Spencer-Phillips, EMBL-EBI). cnag

centre nacional d’anàlisi genòmica centro nacional de análisis genómico annual report 2016

01. director director’s report foreword by the CRG director research highlights

02. 2016 in facts accuracy and speed of germline variant calling pipelines

03. research highlights single cell genomics operation at CNAG–CRG the genome of the Iberian lynx the IHEC coordinated paper release accuracy and speed of germline variant calling pipeline should network biology be used for drug discovery? the genetic history of Aboriginal Australians decoding the complete genome of the olive tree Sequencing a human genome today costs less than results is much lower in another 20% of the genome. ancient admixture between chimpanzees and bonobos 1% of what it did in 2006, and takes hours instead of Finally, we noticed that in the remaining 10% of the years. However, computing costs to analyse sequencing genome it is unfeasible to conduct reliable variant data have decreased far less than the cost of whole calling comparisons with the sequencing technology 04. platform overview exome and whole genome sequencing (WES and and tools assayed in this benchmark and remain a big sequencing unit WGS) and today constitutes a non-negligible fraction challenge. bioinformatics analysis unit of the overall sequence analysis costs. The clinical Regarding the computing costs, the study found genetics community is adopting WES and WGS as substantial differences between tools. It is notable that a standard practice in research and diagnosis and GEM3, the alignment tool developed and used at the 05. research programmes therefore it is essential to choose the most accurate and CNAG-CRG was found to be 4 times faster than the bioinformatic development & cost-efficient analysis pipeline. widely used BWA-MEM. While BWA-MEM required statistical genomics Researchers from the Data Analysis unit benchmarked almost 300 CPU hours for WGS alignment, GEM3 genome assembly and annotation six combinations of state-of-the-art read aligners used less than 60 CPU hours to complete the same biomedical genomics (BWA-MEM, GEM3) and variant callers (FreeBayes, task. Further downstream in the pipeline, we also population genomics GATK-HC, Samtools) on WES and WGS data from the noticed differences in speed of up to 20 times between structural genomics exhaustively-analysed NA12878 sample. The study, the fastest caller (FreeBayes) and the rest on WGS comparative genomics published in Human Mutation, aims to evaluate the samples. single cell genomics robustness of the variant detection process, while taking into account the computing resources required. The results show that the six variant calling pipelines are Venn diagrams illustrating concordance of variant identification for GEM3 alignments. 06. appendix consistent in 70% of the genome. Both the agreement funding between the pipelines and the concordance of the Work of reference collaborators called variants with the NA12878 reference dataset human resources is extremely high for SNVs and somewhat lower for From Wet-Lab to Variations: Concordance and Speed projects InDels. The latter could be due to poorer performing of Bioinformatics Pipelines for Whole Genome and publications methods but also to biases in the generation of the Whole Exome Sequencing reference dataset. Furthermore, we observed a very Laurie S, Fernandez-Callejo M, Marco-Sola S, Trotta high concordance between the variants called in the JR, Camps J, Chacón A, Espinosa A, Gut M, Gut I, WGS and the WES from the same individual, even Heath S, Beltran S**. if they had been sequenced 3 years appart by two Hum Mutat. 2016 Dec;37(12):1263-1271. doi: different labs. 10.1002/humu.23114. Epub 2016 Sep 26. However, we confirmed that the concordance of the **Corresponding author

cnag

centre nacional d’anàlisi genòmica centro nacional de análisis genómico annual report 2016

01. director director’s report foreword by the CRG director research highlights

02. 2016 in facts should network biology be used for drug discovery?

03. research highlights single cell genomics operation at CNAG–CRG the genome of the Iberian lynx the IHEC coordinated paper release accuracy and speed of germline variant calling pipeline should network biology be used for drug discovery? the genetic history of Aboriginal Australians decoding the complete genome of the olive tree According to The Tufts Center for the Study of Drug these new problems. More specifically, modern drug ancient admixture between chimpanzees and bonobos Development (TTCSDD), the development and discovery programs should be able to deal with the marketing approval for a New Molecular Entity (NME) massive amount of data generated in the initial stages takes more than 13 years and costs around US$2.6 of the drug discovery pipeline. 04. platform overview billion. Moreover, the cost of putting a new drug In this scenario, our group has recently published a sequencing unit into the market has dramatically increased since the review arguing that computational methods can play bioinformatics analysis unit 1970s. This raise in drug development cost has led a significant role to reduce the time and costs of pre- to a dramatic shrinkage of the efficiency, measured in clinical stages. Over the last thirty years, computational terms of the number of new approved drugs per billion methods have helped the development of new 05. research programmes US dollars of research and discovery spending. Factors therapeutics. However, we are still far from extracting bioinformatic development & that have contributed to the raise of drug development all their potential when applied to the drug discovery statistical genomics costs include increased clinical trial complexity, larger field. We reviewed how computational methods in genome assembly and annotation clinical trial size, greater assessment of safety and general, and network-based methods in particular, biomedical genomics toxicity drug profiles or evaluation on equivalent drugs could be used to optimize the pre-clinical stages of the population genomics to accommodate payer demands for comparative drug discovery process. structural genomics effectiveness data. Simultaneously, the emergence of comparative genomics high throughput technologies such as High Throughput single cell genomics Screenings or Next Generation Sequencing have led to a drug discovery paradigm shift from the traditional single drug perspective towards a more target centric Work of reference 06. appendix view. The application of these technologies alongside funding the increasing complexity of the treating diseases and Should network biology be used for drug discovery? collaborators the growing intricacy of the mechanism of action of Martínez-Jiménez F, Marti-Renom MA** human resources the drug have also significantly increased the costs Expert Opin Drug Discov. 2016 Dec;11(12):1135- projects of pre-clinical stages. Hence, there is an urgent need 1137. Epub 2016 Sep 23. publications for readjusting the drug discovery process to tackle **corresponding author

cnag

centre nacional d’anàlisi genòmica centro nacional de análisis genómico annual report 2016

01. director director’s report foreword by the CRG director research highlights

02. 2016 in facts the genetic history of Aboriginal Australians

03. research highlights single cell genomics operation at CNAG–CRG the genome of the Iberian lynx the IHEC coordinated paper release accuracy and speed of germline variant calling pipeline should network biology be used for drug discovery? the genetic history of Aboriginal Australians decoding the complete genome of the olive tree When and how anatomically modern humans from a genetic point of view of the recent arrival of ancient admixture between chimpanzees and bonobos appeared and spread around the world is still not other populations, such as the North European, to the solved. Although consensus among the scientific Australian continent? community supports an African origin, some scientists 04. platform overview based on fossil record argue that such spread out of the Our results suggest that most of current Aboriginal sequencing unit African continent occurred twice, being the Aboriginal Australian individuals are highly admixed with bioinformatics analysis unit Australians an admixed remnant of such first diaspora European, East Asian and Papuan populations. We whereas others claims that there has been a unique showed that the obtained percentages of ancestry colonizing event. could be used for predicting when the admixture 05. research programmes occurred in time, which we could further corroborate bioinformatic development & Within this context, the analysis of the genetic with historical records. An important consequence of statistical genomics variation of the Aboriginal Australian population is such recent admixture was that most of the conducted genome assembly and annotation crucial for disentangling among these competing analyses for understanding the population dynamics biomedical genomics hypotheses. Furthermore, the analysis of the genetic within the continent were severely impaired. However, population genomics diversity within and between Aboriginal Australian when only the Aboriginal Australian ancestry was structural genomics populations can help understanding when and how the considered, we observed a strong correlation between comparative genomics Australian continent was colonized and which are the geography and genetic variation, to the extent that the single cell genomics demographic dynamics of the Australian population, geographic sampling origin could be predicted with a both within the continent but also with other human relatively small error for most of the sampled individuals. populations. Overall, our results provided for the first time a picture Biological anthropologist Dr Westaway from Griffith University obtains 06. appendix at high resolution of the complex recent demographic a saliva sample from Thanakwith Elder Mr Thomas Wales, Cape York, Australia. funding In this study we had the opportunity, for the first time, to history of the Aboriginal Australians. (Photo by Tom Cebula, Wall to Wall Media). collaborators analyze a collection of 84 Aboriginal Australian fully human resources sequenced genomes from nine geographic locations. projects In this collaboration, the CNAG-CRG Population publications Genomics team was involved in the analyses of population substructure of the Aboriginal Australian population. We were interested in answering questions Work of reference such as: what is the amount of population substructure within the Aboriginal Australian population? Can A genomic history of Aboriginal Australia we use this population substructure to predict the Malaspinas AS, Westaway MC, Muller C geographic ancestry of a given individual? What is et al including Lao O. Nature. 2016 Oct the amount of genetic admixture with neighbouring 13;538(7624):207-214. doi: 10.1038/ populations such as the Papuans? What was the impact nature18299. Epub 2016 Sep 21. cnag

centre nacional d’anàlisi genòmica centro nacional de análisis genómico annual report 2016

01. director director’s report foreword by the CRG director research highlights

02. 2016 in facts decoding the complete genome of the olive tree

03. research highlights single cell genomics operation at CNAG–CRG the genome of the Iberian lynx the IHEC coordinated paper release accuracy and speed of germline variant calling pipeline should network biology be used for drug discovery? the genetic history of Aboriginal Australians decoding the complete genome of the olive tree In 2016 the CNAG-CRG, in a collaborative project adaptability to arid conditions, and differences between ancient admixture between chimpanzees and bonobos including Toni Gabaldón of the Centre for Genomic the varieties, sizes and flavor of olives, for example. Regulation (CRG) and Pablo Vargas of the Real Ultimately, it may bolster molecular breeding efforts as Jardín Botánico (CSIC-RJB), finished sequencing and well as support new research ways of protecting the 04. platform overview assembling the genome of a nearly 1,300-year-old olive tree from bacterial and viral infections. In a first sequencing unit olive tree belonging to the Spanish Farga variety. step toward these aims, the CNAG-CRG has annotated bioinformatics analysis unit The three-year research project was funded by over 56,000 genes in its genome, more than double Banco Santander, who also provided the specimen, that of the human genome, and has started to study a monumental olive tree translocated to the Ciudad their functions. 05. research programmes Financiera in Boadilla del Monte, Spain, in 2005. bioinformatic development & This was the first time that such an old living being has statistical genomics had its genome sequenced. These “millennial” olive genome assembly and annotation trees are extremely long-lived, with this one likely to live biomedical genomics another 1,300 years. population genomics The CNAG-CRG Assembly and Annotation team took structural genomics a fosmid pool approach to sequencing the genome, a Work of reference comparative genomics strategy designed to overcome the problem of repetitive single cell genomics sequence and high heterozygosity, resulting in an Genome sequence of the olive tree, Olea europaea assembly of its 1.4 billion base-long genome. This Cruz F, Julca I, Gómez-Garrido J et al including reference genome sequence for the olive will provide Loska D, Frias L, Ribeca P, Derdak S, Gut M, Gut IG, 06. appendix a valuable resource for studying its domestication funding Alioto TS**. Gigascience. 2016 Jun 27;5:29. doi: history, as well as developmental and physiological 10.1186/s13742-016-0134-5. The olive tree sequenced at the CNAG-CRG is a Farga olive tree, named collaborators processes that might give insight into its longevity, its human resources **corresponding author “Santander,” from Sierra del Maestrazgo estimated to have been planted around the projects year 792, and transplanted in 2005 to the Ciudad Financiera del Banco Santander publications in Boadilla del Monte, Spain.

cnag

centre nacional d’anàlisi genòmica centro nacional de análisis genómico annual report 2016

01. director director’s report foreword by the CRG director research highlights

02. 2016 in facts ancient admixture between chimpanzees and bonobos

03. research highlights single cell genomics operation at CNAG–CRG the genome of the Iberian lynx the IHEC coordinated paper release accuracy and speed of germline variant calling pipeline should network biology be used for drug discovery? the genetic history of Aboriginal Australians decoding the complete genome of the olive tree Between 1,5 and 2 million years ago chimpanzees ancient admixture between chimpanzees and bonobos (Pan troglodytes) and bonobos (Pan paniscus) split Work of reference from a common ancestor and evolved important strong physical and behavioural differences. To this day, the genomic diversity reveals ancient 04. platform overview existence of gene flow between the species has not admixture with bonobos sequencing unit been considered due to the River that physically de Manuel M, Kuhlwilm M, Frandsen P et al including bioinformatics analysis unit separates the geographical distribution ranges of the Lao O, Gut M, Gut I, Marques-Bonet T**. species. This study is then, the first study to reveal Science. 2016 Oct 28;354(6311):477-481. Epub admixture among the species similar to what has been 2016 Oct 27. 05. research programmes reported between humans and Neanderthals. This **corresponding author bioinformatic development & has been possible thanks to the application of recent statistical genomics analytical tools to detect current and ancient admixture genome assembly and annotation among groups. biomedical genomics The studied samples, comprising 75 complete genomes population genomics of chimpanzees and bonobos, cover 10 countries structural genomics in Africa, from the westernmost to the easternmost comparative genomics region of the chimpanzee range. Here, we found single cell genomics that chimpanzees do have very strong geographical stratification of their genome diversity, and thus, the results have a direct application to the conservation 06. appendix of these species because they permit the detection funding of the origin of chimpanzees confiscated from illegal collaborators trafficking. human resources projects publications

Geographic stratification of chimpanzee diversity. PCA plot of chromosome 21 SNP data for a set of central (left) and eastern (right) chimpanzees. Samples with unknown origin are coloured in gray. The test samples (squares) were found to cluster perfectly with genome data for known samples (dots) (deManuel et al. Science 2016). Triangles correspond to SNP data from fecal samples.

cnag

centre nacional d’anàlisi genòmica centro nacional de análisis genómico annual report 2016

01. director director’s report foreword by the CRG director

02. 2016 in facts

03. research highlights single cell genomics operation at CNAG–CRG the genome of the Iberian lynx the IHEC coordinated paper release accuracy and speed of germline variant calling pipeline should network biology be used for drug discovery? the genetic history of Aboriginal Australians decoding the complete genome of the olive tree ancient admixture between chimpanzees and bonobos

04. platform overview sequencing unit bioinformatics analysis unit

05. research programmes bioinformatic development & statistical genomics genome assembly and annotation biomedical genomics population genomics structural genomics comparative genomics single cell genomics

06. appendix funding collaborators human resources projects publications

cnag

centre nacional d’anàlisi genòmica centro nacional de análisis genómico annual report 2016

01. director director’s report foreword by the CRG director sequencing unit

02. 2016 in facts the Sequencing Unit of the CNAG-CRG stands at the forefront of genomic research in Spain, processing thousands of samples every 03. research highlights single cell genomics operation at CNAG–CRG year with a comprehensive palette of state-of-the-art sequencing the genome of the Iberian lynx applications and executing hundreds of sequencing projects from the IHEC coordinated paper release accuracy and speed of germline variant Spain and from international consortia. calling pipeline should network biology be used for drug discovery? the genetic history of Aboriginal Australians decoding the complete genome of the olive tree Head of the Unit: The Sequencing Unit operates several different types Research Projects ancient admixture between chimpanzees and bonobos Marta Gut of Illumina sequencers and MinIon sequencers from Oxford Nanopore Technologies (ONT). The ONT • Implementation of automated protocol for Genotyping- Laboratory Managers: MinIons have been taken into data production this year, By-Sequencing (GBS) Julie Blanc, Katja Kahlem, 04. platform overview with their unprecedented length of sequencing reads • Maintaining of the failure rates during sample sequencing unit Ana González (until August), that allow cheap and portable de novo assembly of preparation to less than 5% Lidia Agueda (from August) bioinformatics analysis unit prokaryotic and eukaryotic genomes. • Testing and implementation of Nimblegen SeqCap EZ MedExome Target Enrichment protocol Laboratory Technicians: The collaborative projects take advantage of the • Operational use of the mitochondrial DNA capture 05. research programmes Marta Lopez, Ana Gonzalez high proficiency and quality standards of the data within the MedExome capture (from September), Pilar Herruzo, bioinformatic development & production in the CNAG-CRG Sequencing Unit with • Increasing the enrichment and specificity of statistical genomics Maite Rico, Caterina Mata, an ISO 9001:2008 and ISO 17025:2005 certified Nimblegen small captures by implementing a double genome assembly and annotation Laetitia Casano, Aurora Padron, and accredited quality management system. With capture protocol biomedical genomics Yasmina Mirassou, Giulia Lunazzi this system we are able to provide data with quality • Enlarging the protocol portfolio with the Kapa population genomics (until June), Regina Antoni, Nuria meeting clinical use regulatory requirements, aiming stranded mRNA protocol structural genomics Aventin to bring the highest standards to genome research in • Implementation of a manual Truseq Amplicon low comparative genomics Spain and Europe. input protocol for the B-CAST project single cell genomics Specialized Technicians: • Development of automated protocol for B-CAST Javier Gutierrez, Esther Lizano Five teams work jointly within the Sequencing Unit Truseq Amplicon with several liquid handlers (until June), Silvia Speroni (from – Biorepository, Sample Preparation, Sequencing • Introducing a new Pipetmax robot (Gilson) to the 06. appendix December) Production, Support and Single Cell Genomics. The Biorepository pipeline funding inter-team communication is assured and information • Implementation of the management process for the collaborators Quality Mananger: tracking guaranteed by the Laboratory Information B-CAST project human resources Lidia Sevilla Management System (LIMS). • Development of sample preparations for MinIon MkI projects sequencers publications Single Cell Genomics Team: The Biorepository team receives, quality controls and • Incorporation of an Illumina HiSeq4000 sequencing Team leader: Holger Heyn stores samples from the collaborators and transfers instrument Postdoctoral Fellows: Amy Lauren sample aliquot to the Sample Preparation team which • Implementation of large scale full length mRNA Guillaumet, Gustavo Rodríguez is responsible for preparing sequencing libraries for the protocols (Smart-seq2) for single cell, low input and Data Analyst: Elisabetta Mereu Sequencing Production team to load prepared libraries degraded samples (from May) onto the sequencing instruments. The Support team tests • Improvements of 3’-end mRNA count single cell PhD Students: Atefeh Lafzi (from and sets up new protocols – manual and automated methods – MARS-seq and C1 HT arrays September) – and follows the correct functioning of the CNAG- • Testing and evaluation of multiple displacement CRG´s large and small technical equipment. The Single amplifications (MDA) and PCR-based methods from Cell Genomics team elaborates new applications to single cells cnag increase the portfolio of offerings from ultra-low input • Preparation and passing the audit of ISO

centre nacional d’anàlisi genòmica samples, down to single cell transcriptomics and 17025:2005 centro nacional de análisis genómico genomics. annual report 2016

01. director director’s report foreword by the CRG director sequencing unit

02. 2016 in facts the Sequencing Unit of the CNAG-CRG stands at the forefront of genomic research in Spain, processing thousands of samples every 03. research highlights single cell genomics operation at CNAG–CRG year with a comprehensive palette of state-of-the-art sequencing the genome of the Iberian lynx applications and executing hundreds of sequencing projects from the IHEC coordinated paper release accuracy and speed of germline variant Spain and from international consortia. calling pipeline should network biology be used for drug discovery? the genetic history of Aboriginal Australians decoding the complete genome of the olive tree ancient admixture between chimpanzees and bonobos

04. platform overview sequencing unit bioinformatics analysis unit libraries prepared by application libraries prepared by year 3% 6%

05. research programmes bioinformatic development & 25% statistical genomics genome assembly and annotation 26% biomedical genomics population genomics 8.000 structural genomics comparative genomics whole genome sequencing 2% 6.000 single cell genomics exome / custom capture sequencing

RNA sequencing 4.000 06. appendix 6.455 6.890 bisulphite sequencing (including oxBS) 2.000 funding libraries pass 3.915 2.743 3.073 collaborators genotyping by sequencing (GBS) 954 human resources 38% libraries fail 0 other special application (including HiC) projects 2011 2012 2013 2015 2016 publications 2014

cnag

centre nacional d’anàlisi genòmica centro nacional de análisis genómico annual report 2016

01. director director’s report foreword by the CRG director bioinformatics analysis unit

02. 2016 in facts the Bioinformatics Analysis unit develops and operates state-of-the-art pipelines, tools and 03. research highlights single cell genomics operation at CNAG–CRG databases to manage and analyse the sequencing the genome of the Iberian lynx data generated at the CNAG-CRG. the IHEC coordinated paper release accuracy and speed of germline variant calling pipeline should network biology be used for drug discovery? the genetic history of Aboriginal Australians decoding the complete genome of the olive tree ancient admixture between chimpanzees and bonobos Head of the Unit: The unit collaborates closely with internal and external groups Sergi Beltran and delivers customised high quality user-friendly results and Research lines actively participates in its interpretation. The unit is also involved 04. platform overview Production Bioinformatics in European projects such as RD-Connect, Elixir-Excelerate and • Bioinformatics for Rare Disease Research and Clinical sequencing unit Team: B-CAST. Genomics bioinformatics analysis unit Matthew Ingham (Manager), Raul Alcántara, Eloi Casals • Development of pipelines and tools The activity is carried out by highly skilled data analysts, • Benchmarking of data analysis methods Data Analysis Team: software engineers and bioinformaticians divided into three • Variant annotation systems 05. research programmes Sergi Beltran (Manager), Jordi teams. The Production Bioinformatics team develops and • Omics data integration bioinformatic development & Camps, Sophia Derdak, Steven operates the Laboratory Information Management System • Agrogenomics (GBS) statistical genomics Laurie, Inés Martínez, Anastasios (LIMS) and pipelines to process, control the quality and transfer genome assembly and annotation Papakonstantinou, Davide Piscia, the data generated by the laboratory. The Data Analysis biomedical genomics Joan Protasio, Raul Tonda, Jean- team develops and operates bioinformatics pipelines to population genomics Rémi Trotta analyse sequencing data, mainly related to germinal variant structural genomics and somatic mutation identification and annotation. Finally, comparative genomics Functional Genomics team: the Functional Genomics team participates in RNA-Seq and single cell genomics Simon Heath (Manager), Marc Methylation studies. Although most of the activity is related to Dabad, Anna Esteve, Marcos clinical research (mainly on Mendelian disorders and cancer), Fernández 06. appendix the participation in agrigenomics and model organisms projects funding has been increasing steadily in the past years. collaborators human resources projects Services publications • Collaborative analyses • Experimental design • Data processing and quality control • Germinal variant identification • Somatic mutation identification • Variant annotation, filtering and interpretation • Copy Number Variant identification • Genotyping by Sequencing (GBS) • Differential expression and functional analysis • Identification of isoforms cnag • Identification of methylation profiles centre nacional d’anàlisi genòmica centro nacional de análisis genómico annual report 2016

01. director director’s report foreword by the CRG director bioinformatics analysis unit

02. 2016 in facts the Bioinformatics Analysis unit develops and operates state-of-the-art pipelines, tools and 03. research highlights single cell genomics operation at CNAG–CRG databases to manage and analyse the sequencing the genome of the Iberian lynx data generated at the CNAG-CRG. the IHEC coordinated paper release accuracy and speed of germline variant calling pipeline should network biology be used for drug discovery? the genetic history of Aboriginal Australians Selected publications analysed samples by application decoding the complete genome of the olive tree 2% 2% ancient admixture between chimpanzees and bonobos From Wet-Lab to Variations: Concordance and Speed of Bioinformatics Pipelines for Whole Genome and exome capture sequencing Whole Exome Sequencing. Laurie S, Fernandez- 28% 04. platform overview Callejo M, Marco-Sola S, Trotta JR, Camps J, Chacón RNA sequencing 20% sequencing unit A, Espinosa A, Gut M, Gut I, Heath S, Beltran S**. bioinformatics analysis unit Hum Mutat. 2016 Dec;37(12):1263-1271. doi: custom capture sequencing 10.1002/humu.23114. Epub 2016 Sep 26. genotyping by sequencing **Corresponding author 05. research programmes panel capture sequencing bioinformatic development & Identification of protein-damaging mutations in 10 statistical genomics whole genome sequencing swine taste receptors and 191 appetite-reward genes. genome assembly and annotation Clop A, Sharaf A, Castelló A et al including Derdak S, biomedical genomics 22% Beltran S. BMC Genomics. 2016 Aug 26;17:685. population genomics doi: 10.1186/s12864-016-2972-z. structural genomics 26% comparative genomics single cell genomics Novel Candidate Genes and a Wide Spectrum of Structural and Point Mutations Responsible for Inherited Retinal Dystrophies Revealed by Exome Sequencing de Castro-Miró M, Tonda R, Escudero-Ferruz P et al. 06. appendix analysed samples by year funding PLoS One. 2016 Dec 22;11(12):e0168966. doi: collaborators 10.1371/journal.pone.0168966. eCollection 2016. human resources projects Extreme genomic erosion after recurrent demographic 2000 publications bottlenecks in the highly endangered Iberian lynx Abascal F*, Corvelo A*, Cruz F* et al including Frias L, Ribeca P, Derdak S, Blanc J, Gut M, Gut I, 1500 Marques-Bonet T, Alioto T. Genome Biol. 2016 Dec 14;17(1):251. 1000 1875 *contributed equally 1458 1439 1488 500 1008 ATRX driver mutation in a composite malignant pheochromocytoma. Comino-Méndez I, Tejera ÁM, 0 Currás-Freixes M, Remacha L, Gonzalvo P, Tonda R, Leton R, Blasco MA, Robledo M, Cascon A. Cancer 2012 2013 2014 2015 2016 cnag Genet. 2016 Jun;209(6):272-7. doi: 10.1016/j. centre nacional d’anàlisi genòmica centro nacional de análisis genómico cancergen.2016.04.058. Epub 2016 Apr 26. annual report 2016

01. director director’s report foreword by the CRG director

02. 2016 in facts

03. research highlights single cell genomics operation at CNAG–CRG the genome of the Iberian lynx the IHEC coordinated paper release accuracy and speed of germline variant calling pipeline should network biology be used for drug discovery? the genetic history of Aboriginal Australians decoding the complete genome of the olive tree ancient admixture between chimpanzees and bonobos

04. platform overview sequencing unit bioinformatics analysis unit

05. research programmes bioinformatic development & statistical genomics genome assembly and annotation biomedical genomics population genomics structural genomics comparative genomics single cell genomics

06. appendix funding collaborators human resources projects publications

cnag

centre nacional d’anàlisi genòmica centro nacional de análisis genómico annual report 2016

research programmes 01. director director’s report foreword by the CRG director bioinformatics development The research focus of the team is on the development and and statistical genomics team application of efficient methods (from both statistical and 02. 2016 in facts computational perspectives) for the large scale processing and integrative analysis of omics datasets. These methods 03. research highlights Team Leader: have been developed into analysis pipelines at CNAG- single cell genomics operation at CNAG–CRG Simon Heath CRG, and have been applied to both in house and the genome of the Iberian lynx collaborative projects. The largest single dataset we are the IHEC coordinated paper release Staff Scientist: currently working on is the epigenome data from the accuracy and speed of germline variant Emanuele Raineri BLUEPRINT project, which consists of over 200 samples calling pipeline Postdoctoral Fellows: of mostly healthy tissue from different blood cell types, but should network biology be used for drug discovery? Angelika Merkel, Ron Schuyler we are also working on diverse epigenetic datasets from the genetic history of Aboriginal Australians the International Human Epigenome Consortium and the decoding the complete genome of the olive tree Software Engineer: ICGC-TCGA Pan Cancer project. ancient admixture between chimpanzees and bonobos Marcos Fernández PhD Student: 04. platform overview Santiago Marco (until July) Research projects Selected Publications sequencing unit bioinformatics analysis unit • Efficient methods for large scale DNA methylation Distinct Trends of DNA Methylation Patterning in the analysis from WGBS experiments, generating calls of Innate and Adaptive Immune Systems. Schuyler RP, both genetic and epigenetic variation Merkel A, Raineri E, et al including Gut I, Heath S**. 05. research programmes • Integrative analysis of diverse epigenetic datasets (RNA- Cell Rep. 2016 Nov 15;17(8):2101-2111. doi: bioinformatic development & Seq, ChIP-Seq, WGBS) across > 200 healthy and 10.1016/j.celrep.2016.10.054. statistical genomics cancer samples from the human haemopoietic system **corresponding author genome assembly and annotation with the aim of uncovering novel epigenetic signatures biomedical genomics of genome regulation Decoding the DNA Methylome of Mantle Cell population genomics • Investigation into the relationship between large scale Lymphoma in the Light of the Entire B Cell Lineage. structural genomics genome organization predicted from Hi-C experiments Queirós AC, Beekman R, Vilarrasa-Blasi R et al including comparative genomics and finer scale epigenetic information from DNA Merkel A, Raineri E, Heath S, Gut IG. Cancer Cell. single cell genomics methylation and chromatin marks, and to what extent 2016 Nov 14;30(5):806-821. doi: 10.1016/j. the different epigenetic data types can predict each ccell.2016.09.014. other 06. appendix • Characterization of partially methylated domains in Information recovery from low coverage whole-genome funding different haemopoietic cell types, and their relationship bisulfite sequencing. Libertini E, Heath SC, Hamoudi RA collaborators with differentiation and cancer et al including Gut M, Gut IG. Nat Commun. 2016 Jun human resources • A systematic evaluation of all stages of our WGBS 27;7:11306. doi: 10.1038/ncomms11306. projects analysis pipeline on simulated and real datasets, publications comparing the ability to call genomic and methylation Integrative epigenome-wide analysis demonstrates variants against other commonly used pipelines that DNA methylation may mediate genetic risk in inflammatory bowel disease. Ventham NT, Kennedy NA, Adams AT et al including Heath S, Gut IG. Nat Commun. 2016 Nov 25;7:13507. doi: 10.1038/ ncomms13507.

Comprehensive genome and epigenome characterization of CHO cells in response to evolutionary pressures and over time. Feichtinger J, Hernández I*, Fischer C et al cnag including Merkel A, Heath S. Biotechnol Bioeng. 2016 Apr 12. doi: 10.1002/bit.25990. [Epub ahead of centre nacional d’anàlisi genòmica centro nacional de análisis genómico print] * Visiting student annual report 2016

research programmes 01. director director’s report foreword by the CRG director genome assembly and We carry out “genome projects” in the classical sense, annotation team i.e. sequencing, assembling and annotating genomes 02. 2016 in facts de novo, specializing in large eukaryotic genomes. We also have expertise in assembly and annotation of 03. research highlights Team Leader: transcriptomes. Genome assembly is not only difficult single cell genomics operation at CNAG–CRG Tyler Alioto due to the sheer size of the data and computational the genome of the Iberian lynx requirements, but also because the biology of genomes the IHEC coordinated paper release Postdoctoral Fellow: is confounded by repetitive elements, polyploidy and accuracy and speed of germline variant Fernando Cruz variation (single-nucleotide, insertions/deletions, and calling pipeline Technician: larger structural variants). We aim to meet and overcome should network biology be used for drug discovery? Jèssica Gómez these challenges, developing new computational the genetic history of Aboriginal Australians protocols as each project demands. In particular, we are decoding the complete genome of the olive tree focusing our efforts on the adoption of single molecule ancient admixture between chimpanzees and bonobos long read technology. Annotation of the gene content of the newly assembled genome is key to understanding the genome, once finished. Our rapid robust annotation 04. platform overview pipeline, which we are constantly improving, helps us to sequencing unit accomplish this task. bioinformatics analysis unit

05. research programmes Research lines Selected publication bioinformatic development & statistical genomics • Genome sequence assembly Genome and transcriptome analysis of the Mesoamerican genome assembly and annotation • Gene prediction/Genome annotation common bean and the role of gene duplications in biomedical genomics • Transcriptome reconstruction establishing tissue and temporal specialization of genes. population genomics • Functional annotation Vlasova A, Capella-Gutiérrez S, Rendón-Anaya M et al structural genomics including Gómez-Garrido J, Alioto T. Genome Biol. 2016 comparative genomics Feb 25;17(1):32. doi: 10.1186/s13059-016-0883-6. single cell genomics Whole Genome Sequencing of Turbot (Scophthalmus maximus; Pleuronectiformes): A Fish Adapted to Demersal 06. appendix Life. Figueras A, Robledo D, Corvelo A et al including funding Gómez-Garrido J, Gut M, Gut IG, Alioto T. DNA Res. collaborators 2016 Mar 6. pii: dsw007. [Epub ahead of print] human resources projects Genome sequence of the olive tree, Olea europaea. publications Cruz F, Julca I, Gómez-Garrido J et al including Loska D, Frias L, Ribeca P, Derdak S, Gut M, Gut IG, Alioto TS**. Gigascience. 2016 Jun 27;5:29. doi: 10.1186/ s13742-016-0134-5. **corresponding author

Extreme genomic erosion after recurrent demographic bottlenecks in the highly endangered Iberian lynx. Abascal F*, Corvelo A*, Cruz F* et al including Frias L, Ribeca P, Derdak S, Blanc J, Gut M, Gut I, Marques-Bonet T, Alioto T. Genome Biol. 2016 Dec 14;17(1):251. cnag *contributed equally centre nacional d’anàlisi genòmica centro nacional de análisis genómico annual report 2016

research programmes 01. director director’s report foreword by the CRG director biomedical genomics group The Biomedical Genomics Group works on deepening the understanding of the function of the human genome 02. 2016 in facts using sequencing data from disease studies. We apply computational methods to determine genetic and 03. research highlights Group Leader: genomic causes of disease and reversely also study the single cell genomics operation at CNAG–CRG Ivo G. Gut effects of the disease on the genome. For our studies the genome of the Iberian lynx we use data that is generated at the CNAG-CRG and the IHEC coordinated paper release Postdoctoral fellows: combine it with data that we retrieve from other sources. accuracy and speed of germline variant Justin Whalley, Gian-Andri Thun, Miranda Stobbe, Darek Kedra This allows us to increase the power of our studies and calling pipeline (until July), Ryoji Takahashi (from ask the data questions that were not necessarily at the should network biology be used for drug discovery? June) base of the initial study design. We have been working the genetic history of Aboriginal Australians on three classes of diseases: decoding the complete genome of the olive tree PhD students: 1. Cancer ancient admixture between chimpanzees and bonobos Lukasz Roguski 2. Rare diseases 3. Respiratory disorders

04. platform overview sequencing unit Selected publications bioinformatics analysis unit New technologies for DNA analysis - a review of the READNA Project. McGinn S., et al., N Biotechnol. 05. research programmes Research projects 2016 May 25;33(3):311-30. doi: 10.1016/j. bioinformatic development & nbt.2015.10.003. Epub 2015 Oct 26. statistical genomics • Cancer: Our main effort here goes to the PanCancer genome assembly and annotation study, where we have been investigating the origin A whole-genome sequence and transcriptome perspective biomedical genomics of non-random, recurring somatic insertion and on HER2-positive breast cancers. Ferrari A., et al., Nat population genomics deletion mutations in cancer genomes and have been Commun. 2016 Jul 13;7:12222. doi: 10.1038/ structural genomics coordinating the Quality Control working group. We ncomms12222. comparative genomics have defined five non-redundant measures that can be single cell genomics used to describe the quality of tumour genomes. We Decoding the DNA Methylome of Mantle Cell Lymphoma have also gained important insight into the dynamics in the Light of the Entire B Cell Lineage. Queirós AC., et of mutational processes. We concluded the EU-funded al., Cancer Cell. 2016 Nov 14;30(5):806-821. doi: 06. appendix project BLUEPRINT, for which we contributed analyses 10.1016/j.ccell.2016.09.014. funding of epigenomes of haematological cancers. collaborators • Rare Disease: In rare diseases we have been Genome-wide nucleosome specificity and function of human resources concentrating on the RD-Connect project and data with chromatin remodellers in ES cells. Dieuleveult M., et projects the objective to mine the data for phenotype-specific al., Nature. 2016 Feb 4;530(7588):113-6. doi: publications modifier variants. 10.1038/nature16505. Epub 2016 Jan 27. • Respiratory Disorders: In respiratory disorders we have been investigating the manifestation of genomic Emphysema- and airway-dominant COPD alterations and their relation to gene expression and phenotypes defined by standardised quantitative clinical phenotypes with a particular focus on the computed tomography. Subramanian D.R., et presentation of early signs of known and potentially al., Eur Respir J. 2016 Jul;48(1):92-103. doi: causal genomic events. 10.1183/13993003.01878-2015. Epub 2016 May 26. cnag

centre nacional d’anàlisi genòmica centro nacional de análisis genómico annual report 2016

research programmes 01. director director’s report foreword by the CRG director population genomics team The Population Genomics team started in 2015. Our team focuses on describing and quantifying the genetic 02. 2016 in facts variation present in current populations in order to understand the micro-evolution of the given species and 03. research highlights Team Leader: asses the phenotypic consequences of such genetic single cell genomics operation at CNAG–CRG Òscar Lao diversity. In particular, we address questions related to the genome of the Iberian lynx which is the genetic origin from a population point of the IHEC coordinated paper release PhD Students: view of a given individual, which are the demographic accuracy and speed of germline variant Iago Maceda (from March), and selective factors that shaped the genetic variation calling pipeline Amrinder Singh (from May) present in a population and how ultimately this variation should network biology be used for drug discovery? influences and allows us to detect the individual risk in the genetic history of Aboriginal Australians phenotypes of interest. In order to achieve these goals, decoding the complete genome of the olive tree we are actively working on developing new tools and ancient admixture between chimpanzees and bonobos algorithms for describing population substructure in the genome and understanding the biological implications of such structure, identifying the fingerprint of polygenic 04. platform overview adaptation in complex phenotypes and evaluating the sequencing unit impact of archaic introgression in phenotypes of interest. bioinformatics analysis unit Our team focuses on human species but the universality of the proposed methods allows us to apply them to other model organisms. 05. research programmes bioinformatic development & statistical genomics genome assembly and annotation biomedical genomics population genomics structural genomics Research projects Selected publications comparative genomics single cell genomics • Development of new algorithms for predicting genetic Chimpanzee genomic diversity reveals ancient admixture ancestry prediction with bonobos. de Manuel M, Kuhlwilm M, Frandsen P et • Identification of the fingerprint of polygenic adaptation al including Lao O, Gut M, Gut I, Marques-Bonet T**. 06. appendix in complex phenotypes such as body mass index or Science. 2016 Oct 28;354(6311):477-481. Epub funding bone mineral density 2016 Oct 27. collaborators • Analysis of the impact of archaic introgression in the **corresponding author human resources evolution of complex phenotypes such as bone mineral projects publications density or aggression A genomic history of Aboriginal Australia. • Development of new algorithms for statistically Malaspinas AS, Westaway MC, Muller C et al including inferring demographic parameters and distinguishing Lao O. Nature. 2016 Oct 13;538(7624):207-214. amongcompeting models doi: 10.1038/nature18299. Epub 2016 Sep 21. • Analysis of the genetic variants associated to attention deficit hyperactivity disorder (ADHD) phenotype

cnag

centre nacional d’anàlisi genòmica centro nacional de análisis genómico annual report 2016

research programmes 01. director director’s report foreword by the CRG director structural genomics group How biomolecules fold and function in a three- dimensional space is one of the most challenging 02. 2016 in facts questions in biology. For example, we have limited knowledge on how the 2-meter-long DNA molecule folds 03. research highlights Group Leader: in the micro-sized nucleus or how RNA, proteins and single cell genomics operation at CNAG–CRG Marc A. Marti-Renom small chemical compounds fold and interact to perform the genome of the Iberian lynx the most basic functions of the cell. Our research group the IHEC coordinated paper release Staff Scientist: employ the laws of physics and the rules of evolution to accuracy and speed of germline variant Davide Baù develop and apply computational methods for predicting calling pipeline Posdoctoral Fellows: the 3D structures of macromolecules and their complexes. should network biology be used for drug discovery? François Serra, Yannick Spill, By doing so, we have recently developed new the genetic history of Aboriginal Australians Marco di Stefano, Irene Farabella computational tools to predict the insertion of proteins decoding the complete genome of the olive tree in the lipid transmembrane of the cell as well as predict ancient admixture between chimpanzees and bonobos PhD Students: drug compounds using network biology. David Dufour, Francisco Martínez- Jiménez, Gireesh K. Bogu, Silvia Galan, Paula Soler 04. platform overview sequencing unit Programmer, Technicians: bioinformatics analysis unit Michael Goodstadt Laboratory Technician: 05. research programmes Yasmina Cuartero bioinformatic development & statistical genomics genome assembly and annotation biomedical genomics population genomics structural genomics Research projects / Research lines Selected publications comparative genomics single cell genomics • Structure determination of genomes. We develop Should network biology be used for drug discovery? methods for determining the 3D organization of the Martínez-Jiménez F, Marti-Renom MA** chromatin Expert Opin Drug Discov. 2016 Dec;11(12):1135- 06. appendix • Comparative RNA structure prediction. We develop a 1137. Epub 2016 Sep 23. funding series of tools for the alignment of RNA structures and **corresponding author collaborators the prediction of their structures and functions. human resources • Protein-Ligand interactions. We develop methods for Structural features of the fly chromatin colors revealed by projects publications comparative docking of small chemical compounds automatic three-dimensional modeling. Serra F, Baù D, and their target proteins. Filion G, Marti-Renom MA** bioRxiv 2016 **corresponding author

Biological insertion of computationally designed short transmembrane segments. Baeza-Delgado C, von Heijne G, Marti-Renom MA, Mingarro I. Sci Rep. 2016 Mar 18;6:23397. doi: 10.1038/srep23397.

cnag

centre nacional d’anàlisi genòmica centro nacional de análisis genómico annual report 2016

research programmes 01. director director’s report foreword by the CRG director comparative genomics group Our main line of research is centered in the discovery of the extent of all kinds of genome variation within different 02. 2016 in facts phenotypically genomes. Specifically, we study genome variation (centered on CNVs), gene expression and 03. research highlights Group Leader: epigenetic differences in the human species in the context single cell genomics operation at CNAG–CRG Tomas Marques-Bonet (ICREA of great evolution and other mammalian genomes the genome of the Iberian lynx Research Professor) such as canids. The goal is to create an integrated view of the IHEC coordinated paper release genome evolution by studying changes in the composition, accuracy and speed of germline variant Postdoctoral Fellows: frequency, size and location at every major branch point calling pipeline Inna Povolotskaya, Martin Kuhlwilm, David de Juan of recent human evolution. should network biology be used for drug discovery? the genetic history of Aboriginal Australians PhD Students: decoding the complete genome of the olive tree Raquel Garcia, Jessica Research lines ancient admixture between chimpanzees and bonobos Hernández, Marc de Manuel,

Irene Lobon, Lukas Kuderna, • Genomic variation in ape genomes Claudia Fontsere, Aitor Serres, • Evolution of Gene Regulation 04. platform overview Sojung Han, Manuel Solís, Luis • Canid evolution sequencing unit Ferrández bioinformatics analysis unit

05. research programmes bioinformatic development & Selected publications statistical genomics genome assembly and annotation Chimpanzee genomic diversity reveals ancient admixture biomedical genomics with bonobos. Manuel M., et al., Science. 2016 Oct population genomics 28;354(6311):477-481. Epub 2016 Oct 27. structural genomics comparative genomics Evolution and demography of the great . Kuhlwilm single cell genomics M., et al., Curr Opin Genet Dev. 2016 Dec;41:124- 129. doi: 10.1016/j.gde.2016.09.005. Epub 2016 Oct 4. 06. appendix funding Demographic History of the Genus Pan Inferred from collaborators Whole Mitochondrial Genome Reconstructions. Lobon human resources I., et al., Genome Biol Evol. 2016 Jul 3;8(6):2020-30. projects doi: 10.1093/gbe/evw124. publications Ancient gene flow from early modern humans into Eastern Neanderthals. Kuhlwilm M., et al., Nature. 2016 Feb 25;530(7591):429-33. doi: 10.1038/nature16544. Epub 2016 Feb 17.

Worldwide patterns of genomic variation and admixture in gray wolves. Fan Z., et al., Genome Res. 2016 Feb;26(2):163-73. doi: 10.1101/gr.197517.115. Epub 2015 Dec 17. cnag

centre nacional d’anàlisi genòmica centro nacional de análisis genómico annual report 2016

research programmes 01. director director’s report foreword by the CRG director single cell genomics team The Single Cell Genomics team is dedicated to advance our understanding of genome activity in health and 02. 2016 in facts disease. The mission of the team is the implementation of novel single cell genomics technologies and their 03. research highlights Team Leader: application in a research context, with a focus on single cell genomics operation at CNAG–CRG Holger Heyn translational research. New computational strategies the genome of the Iberian lynx include methods to deconvolute tissue composition, the IHEC coordinated paper release Postdoctoral Fellows: determine cell type heterogeneity, identify cell type accuracy and speed of germline variant Amy Guillaumet, Gustavo Rodriguez defining markers or track transcriptional dynamics. The calling pipeline team’s experience in single cell genomics is unique in should network biology be used for drug discovery? Data Analyst: Spain and suits research on virtual every species, tissue or the genetic history of Aboriginal Australians Elisabetta Mereu (from May) disease context. decoding the complete genome of the olive tree PhD Students: ancient admixture between chimpanzees and bonobos The Single Cell Genomics team is equipped for and Atefeh Lafzi (from September) specialized on the large-scale production of transcriptomic, genomic and epigenomic profiles from single cells. 04. platform overview Several techniques, such as the single cell RNA and DNA sequencing unit sequencing methods MARS-seq, Smart-seq2 and G&Tseq, bioinformatics analysis unit were implemented into standard operation procedures, to allow a reliable and reproducible preparation of samples. Single cell projects are conducted with specialized 05. research programmes equipment, enabling an immediate sample processing and bioinformatic development & the monitoring of high quality data production throughout statistical genomics the experiments. Specialized technical equipment includes Single cells ordered along a predicted developmental trajectory. genome assembly and annotation automated liquid handling robotics and microfluidic single biomedical genomics The colours indicate their progression in pseudotime inferred by cell devices (C1, Fluidigm). In 2016, the Single Cell population genomics their transcriptomic information. Genomics team processed and analyzed >20,000 single structural genomics cell genomes and transcriptomes. comparative genomics single cell genomics

06. appendix funding collaborators Research lines Selected publications human resources projects • Cancer Heterogeneity RNA sequencing validation of the Complexity INdex publications • Developmental Dynamics in SARComas prognostic signature. Lesluyes T., et al., • Complex Tissue Composition Eur J Cancer. 2016 Apr;57:104-11. doi: 10.1016/j. • Single Cell Genomics Technologies ejca.2015.12.027. Epub 2016 Feb 23.

Epigenomic analysis detects aberrant super-enhancer DNA methylation in human cancer. Heyn H., et al., Genome Biol. 2016 Jan 26;17(1):11. doi: 10.1186/ s13059-016-0879-2.

cnag

centre nacional d’anàlisi genòmica centro nacional de análisis genómico annual report 2016

01. director director’s report foreword by the CRG director

02. 2016 in facts

03. research highlights single cell genomics operation at CNAG–CRG the genome of the Iberian lynx the IHEC coordinated paper release accuracy and speed of germline variant calling pipeline should network biology be used for drug discovery? the genetic history of Aboriginal Australians decoding the complete genome of the olive tree ancient admixture between chimpanzees and bonobos

04. platform overview sequencing unit bioinformatics analysis unit

05. research programmes bioinformatic development & statistical genomics genome assembly and annotation biomedical genomics population genomics structural genomics comparative genomics single cell genomics

06. appendix funding collaborators human resources projects publications

cnag

centre nacional d’anàlisi genòmica centro nacional de análisis genómico annual report 2016

01. director director’s report foreword by the CRG director

02. 2016 in facts funding 03. research highlights single cell genomics operation at CNAG–CRG the genome of the Iberian lynx the IHEC coordinated paper release accuracy and speed of germline variant calling pipeline should network biology be used for drug discovery? the genetic history of Aboriginal Australians funding evolution funding by sources decoding the complete genome of the olive tree 2016 ancient admixture between chimpanzees and bonobos Contracts/ Services 2,83M € 04. platform overview sequencing unit 10.000.000 bioinformatics analysis unit 8.000.000 2.165.892 05. research programmes Trustees 6.000.000 2,5M € 4.219.720 5.515.117 8.341.095 bioinformatic development & 6.120.000 Competitive statistical genomics projects 5.179.488 genome assembly and annotation 4.000.000 3,22M € biomedical genomics 6.000.000 Other population genomics 2.000.000 3.694.678 income structural genomics 3.250.000 2.000.000 1.500.000 2.500.000 0,07M € comparative genomics 0 single cell genomics 2011 2012 2013 2014 2015 2016

06. appendix funding other resources collaborators human resources trustees projects publications

cnag

centre nacional d’anàlisi genòmica centro nacional de análisis genómico annual report 2016

01. director director’s report foreword by the CRG director

02. 2016 in facts collaborators 03. research highlights single cell genomics operation at CNAG–CRG the genome of the Iberian lynx the IHEC coordinated paper release accuracy and speed of germline variant calling pipeline should network biology be used for drug discovery? the genetic history of Aboriginal Australians collaborators by origin collaborators by sector decoding the complete genome of the olive tree ancient admixture between chimpanzees and bonobos

04. platform overview sequencing unit 180 180 bioinformatics analysis unit 1 160 160 24 05. research programmes 140 140 bioinformatic development & statistical genomics 3 3 genome assembly and annotation 120 120 biomedical genomics 22 26 population genomics 100 100 11 structural genomics 10 20 31 comparative genomics 166 single cell genomics 80 143 80

60 60 116 118 06. appendix 99 funding 93 93 94 40 83 40 collaborators 3 74 2 human resources projects international 20 private 20 publications 26 27 national public 0 0

2011 2012 2013 2014 2015 2016 2011 2012 2013 2014 2015 2016

cnag

centre nacional d’anàlisi genòmica centro nacional de análisis genómico annual report 2016

01. director director’s report foreword by the CRG director

02. 2016 in facts collaborators 03. research highlights single cell genomics operation at CNAG–CRG the genome of the Iberian lynx the IHEC coordinated paper release accuracy and speed of germline variant calling pipeline should network biology be used for drug discovery? the genetic history of Aboriginal Australians 2016 activity by sequencing application 2016 activity by research area decoding the complete genome of the olive tree ancient admixture between chimpanzees and bonobos

04. platform overview sequencing unit 6% 12% bioinformatics analysis unit 3% 20% 4% Capture-Seq (Exome, panel, custom and amplicon) 05. research programmes HiC 33% animal genomics bioinformatic development & 13% other cancer genomics statistical genomics 4% genome assembly and annotation RNA-Seq disease genomics biomedical genomics WG-BS and OxBS-Seq functional genomics population genomics 8% structural genomics WG-Seq microbial genomics 40% comparative genomics 4% other single cell genomics 22% plant genomics 31% 06. appendix funding collaborators human resources projects publications

cnag

centre nacional d’anàlisi genòmica centro nacional de análisis genómico annual report 2016

01. director director’s report foreword by the CRG director

02. 2016 in facts human resources 03. research highlights staff evolution by area and position single cell genomics operation at CNAG–CRG the genome of the Iberian lynx the IHEC coordinated paper release accuracy and speed of germline variant calling pipeline should network biology be used for drug discovery? the genetic history of Aboriginal Australians 2016 2015 2014 2013 2012 2011 decoding the complete genome of the olive tree ancient admixture between chimpanzees and bonobos

SEQUENCING (LAB) 25 21 22 19 17 13 04. platform overview sequencing unit Unit heads 2 2 1 1 1 1 bioinformatics analysis unit Postdocs and staff scientists 2 2 1 - - -

05. research programmes Technicians bioinformatic development & 20 17 20 18 16 12 statistical genomics genome assembly and annotation Predocs 1 0 0 0 0 0 biomedical genomics population genomics BIOINFORMATICS (IT) 46 42 37 29 25 16 structural genomics comparative genomics Group / Team / Unit leaders 6 7 5 6 7 6 single cell genomics

Postdocs and staff scientists 13 13 14 12 9 5 06. appendix funding Technicians 21 18 15 7 4 4 collaborators human resources Predocs 6 4 3 4 5 1 projects publications MANAGEMENT (ADMIN) 4 5 5 4 5 5

TOTAL 75 68 64 52 47 34

cnag

centre nacional d’anàlisi genòmica centro nacional de análisis genómico annual report 2016

01. director director’s report foreword by the CRG director

02. 2016 in facts human resources 03. research highlights staff evolution by area and gender single cell genomics operation at CNAG–CRG the genome of the Iberian lynx the IHEC coordinated paper release accuracy and speed of germline variant calling pipeline should network biology be used for drug discovery? the genetic history of Aboriginal Australians 2016 2015 2014 2013 2012 2011 decoding the complete genome of the olive tree ancient admixture between chimpanzees and bonobos

SEQUENCING (LAB) 25 21 22 19 17 13 04. platform overview sequencing unit Male 3 2 2 2 2 1 bioinformatics analysis unit Female 22 19 20 17 15 12

05. research programmes BIOINFORMATICS (IT) bioinformatic development & 46 42 37 29 25 16 statistical genomics genome assembly and annotation Male 36 32 29 25 21 14 biomedical genomics population genomics Female 10 10 8 4 4 2 structural genomics comparative genomics MANAGEMENT (ADMIN) 4 5 5 4 5 5 single cell genomics

Male 0 0 1 1 2 2 06. appendix funding Female 4 5 4 3 3 3 collaborators human resources RATIO FEMALES VS TOTAL 48% 50% 50% 46,15% 46,81% 50% projects publications

*In both tables data has been updated according to the CRG categories

cnag

centre nacional d’anàlisi genòmica centro nacional de análisis genómico annual report 2016

01. director director’s report foreword by the CRG director

02. 2016 in facts projects 03. research highlights international competitive projects in force in 2016 single cell genomics operation at CNAG–CRG the genome of the Iberian lynx the IHEC coordinated paper release accuracy and speed of germline variant calling pipeline should network biology be used for drug discovery? the genetic history of Aboriginal Australians acronym project type timeline funding decoding the complete genome of the olive tree ancient admixture between chimpanzees and bonobos

AIRPROM FP7 Collaborative Project 2011-2016 459,400€ 04. platform overview sequencing unit BLUEPRINT FP7 Collaborative Project 2011-2016 2,718,950€ bioinformatics analysis unit RD-CONNECT FP7 Collaborative Project 2012-2018 1,497,365€ 05. research programmes bioinformatic development & IBD-CHARAC FP7 Collaborative Project 2012-2017 994,600€ statistical genomics genome assembly and annotation BBMRI-LPC FP7 Collaborative Project & 2013-2017 494,426€ biomedical genomics Coordination and Support Action population genomics structural genomics 4DGENOME ERC Synergy 2014-2019 1,752,612€ comparative genomics single cell genomics PHYLOCANCER ERC Consolidator 2014-2019 320,004€

06. appendix BCAST H2020 Collaborative Project 2015-2020 2,380,125€ funding collaborators MUG H2020 EINFRA 2015-2018 510,695€ human resources projects ELIXIR-EXCELERATE H2020 INFRADEV 2015-2019 200,000€ publications MIND H2020 MSCA-ITN 2016-2018 179,155€

SINGEK H2020 MSCA-ITN 2016-2019 285,753€

TOTAL 11,793,085€

cnag

centre nacional d’anàlisi genòmica centro nacional de análisis genómico annual report 2016

01. director director’s report foreword by the CRG director

02. 2016 in facts projects 03. research highlights national competitive projects in force in 2016 single cell genomics operation at CNAG–CRG the genome of the Iberian lynx the IHEC coordinated paper release accuracy and speed of germline variant calling pipeline should network biology be used for drug discovery? the genetic history of Aboriginal Australians reference project type timeline funding decoding the complete genome of the olive tree ancient admixture between chimpanzees and bonobos

REDBIO2014 FIS/ ISCIII 2014-2017 359,150€ 04. platform overview PT13/0001/0044 sequencing unit bioinformatics analysis unit 2014 SGR 615 Suport a Grups de Recerca 2014-2017 70,000€

CIN-2015-244- ERA-NET 2016-2019 75,000€ 05. research programmes BUSPONTHEAL bioinformatic development & statistical genomics genome assembly and annotation BIO2015-71969-REDI Redes de Excelencia 2015-2017 178,000€ biomedical genomics population genomics BFU2013-047736-P Proyectos de Excelencia 2014-2016 205,000€ structural genomics comparative genomics BIO2015-71792-P Proyectos de Excelencia 2016-2018 296,450€ single cell genomics

BFU2015-68759-P Proyectos de Excelencia 2016-2018 142,296€ 06. appendix funding RYC-2013-14797 Ramón y Cajal Contracts 2015-2019 208,600€ collaborators human resources BES-2014-070327 Predoc training contract 2015-2019 88,400€ projects publications PTA2014-09515-I Technicians contract 2015-2018 39,000€

CP14/00229 Miguel Servet 2016-2019 232,480€

TOTAL 1,894,376€

cnag

centre nacional d’anàlisi genòmica centro nacional de análisis genómico annual report 2016

01. director director’s report foreword by the CRG director

02. 2016 in facts publications 03. research highlights single cell genomics operation at CNAG–CRG the genome of the Iberian lynx the IHEC coordinated paper release accuracy and speed of germline variant 1. Novel Candidate Genes and a Wide Spectrum of 7. Whole exome sequencing analysis reveals TRPV3 Queirós AC, Beekman R, Vilarrasa-Blasi R et al calling pipeline Structural and Point Mutations Responsible for Inherited as a risk factor for cardioembolic stroke Carrera C, including Merkel A, Raineri E, Heath S, Gut IG. should network biology be used for drug discovery? Retinal Dystrophies Revealed by Exome Sequencing. Jiménez-Conde J, Derdak S et al including Beltran Cancer Cell. 2016 Nov 14;30(5):806-821. doi: the genetic history of Aboriginal Australians de Castro-Miró M, Tonda R, Escudero-Ferruz P et S. Thromb Haemost. 2016 Nov 30;116(6):1165- 10.1016/j.ccell.2016.09.014. decoding the complete genome of the olive tree ancient admixture between chimpanzees and bonobos al. PLoS One. 2016 Dec 22;11(12):e0168966. 1171. Epub 2016 Sep 8. 14. Human Oocyte-Derived Methylation Differences doi: 10.1371/journal.pone.0168966. eCollection 8. Increased DNA methylation variability in type 1 Persist in the Placenta Revealing Widespread 2016. diabetes across three immune effector cell types Transient Imprinting Sanchez-Delgado M, Court 04. platform overview 2. Extreme genomic erosion after recurrent demographic Paul DS, Teschendorff AE, Dang MA et al including F, Vidal E et al including Marques-Bonet T. PLoS sequencing unit bottlenecks in the highly endangered Iberian lynx Heath S, Gut M, Gut IG. Nat Commun. 2016 Nov Genet. 2016 Nov 11;12(11):e1006427. doi: bioinformatics analysis unit Abascal F*, Corvelo A*, Cruz F* et al including 29;7:13555. doi: 10.1038/ncomms13555. 10.1371/journal.pgen.1006427. eCollection Frias L, Ribeca P, Derdak S, Blanc J, Gut M, Gut I, 9. Integrative epigenome-wide analysis demonstrates 2016. Marques-Bonet T, Alioto T. Genome Biol. 2016 Dec that DNA methylation may mediate genetic risk in 15. Chimpanzee genomic diversity reveals ancient 05. research programmes 14;17(1):251. *contributed equally inflammatory bowel disease Ventham NT, Kennedy admixture with bonobos de Manuel M, Kuhlwilm bioinformatic development & 3. Genetic Diversity and Population Structure of Rice NA, Adams AT et al including Heath S, Gut IG. Nat M, Frandsen P et al including Lao O, Gut M, statistical genomics Varieties Cultivated in Temperate Regions Reig-Valiente Commun. 2016 Nov 25;7:13507. doi: 10.1038/ Gut I, Marques-Bonet T**. Science. 2016 Oct genome assembly and annotation biomedical genomics JL, Viruel J, Sales E et al including Gut M, Deardak S. ncomms13507. 28;354(6311):477-481. Epub 2016 Oct 27. population genomics Rice (N Y). 2016 Dec;9(1):58. Epub 2016 Oct 20. 10. The International Human Epigenome Consortium: A **corresponding author structural genomics 4. Evolution and demography of the great apes Blueprint for Scientific Collaboration and 16. A genomic history of Aboriginal Australia comparative genomics Kuhlwilm M, de Manuel M, Nater A, Greminger Discovery Stunnenberg HG; International Malaspinas AS, Westaway MC, single cell genomics MP, Krützen M, Marques-Bonet T*. Curr Opin Human Epigenome Consortium, Hirst M Including Muller C et al including Lao O. Nature. 2016 Genet Dev. 2016 Dec;41:124-129. doi: Gut IG, Heath S as collaborators. Cell. 2016 Oct 13;538(7624):207-214. doi: 10.1038/ 10.1016/j.gde.2016.09.005. Epub 2016 Oct 4. Nov 17;167(5):1145-1149. doi: 10.1016/j. nature18299. Epub 2016 Sep 21. 06. appendix *Corresponding author cell.2016.11.007. 17. Mutations in GLDN, Encoding Gliomedin, a funding 5. From Wet-Lab to Variations: Concordance and Speed 11. β-Glucan Reverses the Epigenetic State of LPS- Critical Component of the Nodes of Ranvier, Are collaborators of Bioinformatics Pipelines for Whole Genome and Induced Immunological Tolerance Novakovic B, Responsible for Lethal Arthrogryposis Maluenda J, human resources Whole Exome Sequencing Laurie S, Fernandez- Habibi E, Wang SY et al including Heath S, Gut I. projects Manso C, Quevarec L et al including Gut M, Gut I. publications Callejo M, Marco-Sola S, Trotta JR, Camps J, Chacón Cell. 2016 Nov 17;167(5):1354-1368.e14. doi: Am J Hum Genet. 2016 Oct 6;99(4):928-933. A, Espinosa A, Gut M, Gut I, Heath S, Beltran S** 10.1016/j.cell.2016.09.034. doi: 10.1016/j.ajhg.2016.07.021. Epub 2016 Hum Mutat. 2016 Dec;37(12):1263-1271. doi: 12. Distinct Trends of DNA Methylation Patterning in Sep 8. 10.1002/humu.23114. Epub 2016 Sep 26. the Innate and Adaptive Immune Systems Schuyler 18. The Fanconi anemia DNA damage repair pathway **Corresponding author RP, Merkel A, Raineri E, et al including Gut I, Heath in the spotlight for germline predisposition to 6. Natural Selection in the Great Apes Cagan A, S. Cell Rep. 2016 Nov 15;17(8):2101-2111. colorectal cancer Esteban-Jurado C, Franch- Theunert C, Laayouni H et al including Marques-Bonet doi: 10.1016/j.celrep.2016.10.054. Expósito S, Muñoz J et al including Serra E, Beltran T. Mol Biol Evol. 2016 Dec;33(12):3268-3283. 13. Decoding the DNA Methylome of Mantle Cell S. Eur J Hum Genet. 2016 Oct;24(10):1501-5doi: Epub 2016 Oct 30. Lymphoma in the Light of the Entire B Cell Lineage 10.1038/ejhg.2016.44. Epub 2016 May 11. cnag

centre nacional d’anàlisi genòmica centro nacional de análisis genómico annual report 2016

01. director director’s report foreword by the CRG director

02. 2016 in facts publications 03. research highlights single cell genomics operation at CNAG–CRG the genome of the Iberian lynx the IHEC coordinated paper release accuracy and speed of germline variant 19. Should network biology be used for drug discovery? 26. A whole-genome sequence and transcriptome 32. Saturation analysis for whole-genome bisulfite calling pipeline Martínez-Jiménez F, Marti-Renom MA** Expert Opin perspective on HER2-positive breast cancers Ferrari sequencing data Libertini E, Heath SC, Hamoudi should network biology be used for drug discovery? Drug Discov. 2016 Dec;11(12):1135-1137. Epub A, Vincent-Salomon A, Pivot X et al including Gut RA et al including Gut M, Gut IG. Nat Biotechnol. the genetic history of Aboriginal Australians decoding the complete genome of the olive tree 2016 Sep 23. **corresponding author IG, Gut M. Nat Commun. 2016 Jul 13;7:12222. 2016 Jun 27. doi: 10.1038/nbt.3524. [Epub ancient admixture between chimpanzees and bonobos 20. Sequence variation between 462 huma individuals doi: 10.1038/ncomms12222. ahead of print] fine-tunes functional sites of RNA processing Ferreira 27. CARGO: effective format-free compressed storage 33. Genome sequence of the olive tree, Olea PG, Oti M, Barann M et al including Esteve-Codina of genomic information Roguski Ł, Ribeca P**. europaea Cruz F, Julca I, Gómez-Garrido J et 04. platform overview A. Sci Rep. 2016 Sep12;6:32406. doi: Nucleic Acids Res. 2016 Jul 8;44(12):e114. al including Loska D, Frias L, Ribeca P, Derdak sequencing unit 10.1038/srep32406. doi: 10.1093/nar/gkw318. Epub 2016 Apr 29. S, Gut M, Gut IG, Alioto TS**. Gigascience. bioinformatics analysis unit 21. PRKG1 and genetic diagnosis of early-onset **corresponding author 2016 Jun 27;5:29. doi: 10.1186/s13742-016- thoracic aortic disease Gago-Díaz M, Blanco- 28. Demographic History of the Genus Pan Inferred 0134-5. **corresponding author Verea A, Teixidó G, Huguet F, Gut M, Laurie S, Gut from Whole Mitochondrial Genome Reconstructions 34. Making sense of big data in health research: 05. research programmes I, Carracedo A, Evangelista A, Brion M. Eur J Clin Lobon I, Tucci S, de Manuel M et al including Towards an EU action plan Auffray C, Balling bioinformatic development & Invest. 2016 Sep;46(9):787-94. doi: 10.1111/ Dabad M, Marques-Bonet T**. Genome Biol Evol. R, Barroso I et al including Gut IG. Genome Med. statistical genomics genome assembly and annotation eci.12662. Epub 2016 Aug 18. 2016 Jul 3;8(6):2020-30. doi: 10.1093/gbe/ 2016 Jun 23;8(1):71. doi: 10.1186/s13073- biomedical genomics 22. Identification of protein-damaging mutations in 10 evw124. **corresponding author 016-0323-y. population genomics swine taste receptors and 191 appetite-reward 29. Emphysema and airway- dominant 35. ATRX driver mutation in a composite structural genomics genes Clop A, Sharaf A, Castelló A et al including COPD phenotypes defined by standardised malignant pheochromocytoma Comino-Méndez I, comparative genomics Derdak S, Beltran S. BMC Genomics. 2016 Aug quantitative computed tomography Subramanian Tejera ÁM, Currás-Freixes M, Remacha L, Gonzalvo single cell genomics 26;17:685. doi: 10.1186/s12864-016-2972-z. DR, Gupta S, Burggraf D et al including P, Tonda R, Leton R, Blasco MA, Robledo M, 23. Atad2 is a generalist facilitator of chromatin Gut I. Eur Respir J. 2016 Jul;48(1):92-103. doi: Cascon A. Cancer Genet. 2016 Jun;209(6):272- dynamics in embryonic stem cells Morozumi Y, 10.1183/13993003.01878-2015. Epub 2016 7. doi: 10.1016/j.cancergen.2016.04.058. 06. appendix Boussouar F, Tan M et al including Gut M. J Mol May 26. Epub 2016 Apr 26. funding Cell Biol. 2016 Aug;8(4):349-62. doi: 10.1093/ 30. Whole-genome single nucleotide polymorphism- 36. New technologies for DNA analysis - a review collaborators jmcb/mjv060. Epub 2015 Oct 12. based linkage analysis in spondyloarthritis multiplex of the READNA Project McGinn S, Bauer D, human resources projects 24. Somatic Embryonic FGFR2 Mutations in families reveals a new susceptibility locus in 13q13 Brefort T et al including Gut M, Heath S, Gut IG**. publications Keratinocytic Epidermal Nevi Toll A, Fernández Costantino F, Chaplais E, Leturcq T et al including N Biotechnol. 2016 May 25;33(3):311-30. doi: LC, Pons T et al including Beltran S, Gut M, Gut I. Gut I. Ann Rheum Dis. 2016 Jul;75(7):1380-5. 10.1016/j.nbt.2015.10.003. Epub 2015 Oct J Invest Dermatol. 2016 Aug;136(8):1718-21. doi: 10.1136/annrheumdis-2015-207720. Epub 26. **corresponding author doi: 10.1016/j.jid.2016.03.040. Epub 2016 Apr 19. 2015 Aug 14. 37. MetaTrans: an open-source pipeline for 25. Mitochondrial Complex I Is a Global Regulator 31. Information recovery from low coverage whole- metatranscriptomics Martinez X, Pozuelo M, Pascal of Secondary Metabolism, Virulence and Azole genome bisulfite sequencing Libertini E, Heath SC, V, Campos D, Gut I, Gut M, Azpiroz F, Guarner F, Sensitivity in Fungi Bromley M, Johns A, Davies Hamoudi RA et al including Gut M, Gut IG. Nat Manichanh C. Sci Rep. 2016 May 23;6:26447. E et al including Gut M, Gut I. PLoS One. 2016 Commun. 2016 Jun 27;7:11306. doi: doi: 10.1038/srep26447. Jul 20;11(7):e0158724. doi: 10.1371/ 10.1038/ ncomms11306. journal.pone.0158724. eCollection 2016. cnag

centre nacional d’anàlisi genòmica centro nacional de análisis genómico annual report 2016

01. director director’s report foreword by the CRG director

02. 2016 in facts publications 03. research highlights single cell genomics operation at CNAG–CRG the genome of the Iberian lynx the IHEC coordinated paper release accuracy and speed of germline variant 38. Dynamic recruitment of Ets1 to both nucleosome- 44. Biological insertion of computationally designed 50. ETE 3: Reconstruction, analysis and visualization calling pipeline occupied and -depleted enhancer regions mediates short transmembrane segments Baeza-Delgado of phylogenomic data Huerta-Cepas J, Serra F, should network biology be used for drug discovery? a transcriptional program switch during early T-cell C, von Heijne G, Marti-Renom MA, Mingarro I. Bork P. Mol Biol Evol. 2016 Jun;33(6):1635-8. the genetic history of Aboriginal Australians decoding the complete genome of the olive tree differentiation Cauchy P, Maqbool MA, Zacarias- Sci Rep. 2016 Mar 18;6:23397. doi: 10.1038/ doi: 10.1093/molbev/msw046. Epub 2016 ancient admixture between chimpanzees and bonobos Cabeza J et al including Gut M, Gut I. Nucleic srep23397. Feb 26. Acids Res. 2016 May 5;44(8):3567-85. doi: 45. Whole Genome Sequencing of Turbot 51. Genome and transcriptome analysis of the 10.1093/nar/gkv1475. Epub 2015 Dec 15. (Scophthalmus maximus; Pleuronectiformes): A Fish Mesoamerican common bean and the role of 04. platform overview 39. Functional Implications of Human-Specific Changes Adapted to Demersal Life Figueras A, Robledo D, gene duplications in establishing tissue sequencing unit in Great Ape microRNAs Gallego A, Melé Corvelo A et al including Gómez-Garrido J, Gut M, and temporal specialization of genes Vlasova A, bioinformatics analysis unit M, Balcells I et al including Marques-Bonet T. Gut IG, Alioto T. DNA Res. 2016 Jun;23(3):181- Capella-Gutiérrez S, Rendón-Anaya M et al PLoS One. 2016 Apr 22;11(4):e0154194. doi: 92. doi: 10.1093/dnares/dsw007. Epub 2016 Mar 6. including Gómez-Garrido J, Alioto T. Genome Biol. 10.1371/journal.pone.0154194. eCollection 46. Demographically-Based Evaluation of Genomic 2016 Feb 25;17(1):32. doi: 10.1186/s13059- 05. research programmes 2016. Regions under Selection in Domestic Dogs 016-0883-6. bioinformatic development & 40. Comprehensive genome and epigenome Freedman AH, Schweizer RM, Ortega-Del Vecchyo 52. Ancient gene flow from early modern humans statistical genomics genome assembly and annotation characterization of CHO cells in response to D et al including Marques-Bonet T. PLoS Genet. into Eastern Neanderthals Kuhlwilm M, Gronau I, biomedical genomics evolutionary pressures and over time Feichtinger 2016 Mar 4;12(3):e1005851. doi: 10.1371/ Hubisz MJ et al including Marques-Bonet T. Nature. population genomics J, Hernández I, Fischer C et al including Merkel journal.pgen.1005851. eCollection 2016. 2016 Feb 25;530(7591):429-33. doi: structural genomics A, Heath S. Biotechnol Bioeng. 2016 Apr 12. doi: 47. Specific small-RNA signatures in the amygdala 10.1038/nature16544. Epub 2016 Feb 17. comparative genomics 10.1002/bit.25990. [Epub ahead of print] at premotor and motor stages of Parkinson’s 53. Genetic load of loss-of-function polymorphic single cell genomics 41. RNA sequencing validation of the Complexity disease revealed by deep sequencing analysis variants in great apes de Valles-Ibáñez G, INdex in SARComas prognostic signature Pantano L, Friedländer MR, Escaramís G, Lizano Hernandez-Rodriguez J, Prado-Martinez J, Luisi P, Lesluyes T, Pérot G, Largeau MR et al E et al. Bioinformatics. 2016 Mar 1;32(5):673-81. Marquès-Bonet T, Casals F. Genome Biol Evol. 06. appendix including Mendez-Lago M, Gut M, Gut I. Eur doi: 10.1093/bioinformatics/btv632. Epub 2015 2016 Mar 26;8(3):871-7. doi: 10.1093/gbe/ funding J Cancer. 2016 Apr;57:104-11. doi: 10.1016/j. Nov 2. evw040. collaborators ejca.2015.12.027. Epub 2016 Feb 23. 48. Evaluation of mRNA markers for estimating blood 54. Genome-wide nucleosome specificity and function human resources projects 42. Identification of genetic variation in the swine deposition time: Towards alibi testing from human of chromatin remodellers in ES cells de Dieuleveult publications toll-like receptors and development of a porcine forensic stains with rhythmic biomarkers Lech K, Liu M, Yen K, Hmitou I et al including Gut M, Gut TLR genotyping array Clop A, Huisman A, van As F, Ackermann K, Revell VL, Lao O, Skene DJ, Kayser I. Nature. 2016 Feb 4;530(7588):113-6. doi: P, Sharaf A, Derdak S, Sanchez A Genet Sel Evol. M. Forensic Sci Int Genet. 2016 Mar;21:119-25. 10.1038/nature16505. Epub 2016 Jan 27. 2016 Mar 31;48:28. doi: 10.1186/s12711- doi: 10.1016/j.fsigen.2015.12.008. Epub 55. Dissecting Daily and Circadian Expression Rhythms 016-0206-0. 2015 Dec 28 of Clock-Controlled Genes in Human Blood Lech 43. The Fungus Candida albicans Tolerates Ambiguity 49. Worldwide patterns of genomic variation and K, Ackermann K, Revell VL, Lao O, Skene DJ, at Multiple Codons Simões J, Bezerra AR, Moura admixture in gray wolves Fan Z, Silva P, Gronau Kayser M. J Biol Rhythms. 2016 Feb;31(1):68-81. GR, Araújo H, Gut I, Bayes M, Santos MA. Front I et al including Marques-Bonet T. Genome doi: 10.1177/0748730415611761. Epub 2015 Microbiol. 2016 Mar 31;7:401. doi: 10.3389 Res. 2016 Feb;26(2):163-73. doi: 10.1101/ Nov 2. fmicb.2016.00401. eCollection 2016. gr.197517.115 Epub 2015 Dec 17. cnag

centre nacional d’anàlisi genòmica centro nacional de análisis genómico annual report 2016

01. director director’s report foreword by the CRG director

02. 2016 in facts publications 03. research highlights single cell genomics operation at CNAG–CRG the genome of the Iberian lynx the IHEC coordinated paper release accuracy and speed of germline variant 56. Epigenomic analysis detects aberrant super- calling pipeline enhancer DNA methylation in human cancer should network biology be used for drug discovery? Heyn H*, Vidal E, Ferreira HJ et al including Gut the genetic history of Aboriginal Australians decoding the complete genome of the olive tree M, Gut I. Genome Biol. 2016 Jan 26;17(1):11. ancient admixture between chimpanzees and bonobos doi: 10.1186/s13059-016-0879-2. *first author 57. Selective constraints on protamine 2 in primates and rodents Lüke L, Tourmente M, Dopazo H, 04. platform overview Serra F, Roldan ER. BMC Evol Biol. 2016 Jan sequencing unit 22;16(1):21. doi: 10.1186/s12862-016-0588-1. bioinformatics analysis unit 58. Genetic associations at 53 loci highlight cell types and biological pathways relevant for kidney function Pattaro C, Teumer A, Gorski M 05. research programmes et al including Thun GA. Nat Commun. 2016 Jan bioinformatic development & 21;7:10023. doi: 10.1038/ncomms10023. statistical genomics genome assembly and annotation 59. Bottlenecks and selective sweeps during biomedical genomics domestication have increased deleterious population genomics genetic variation in dogs Marsden CD, structural genomics Ortega-Del Vecchyo D, O’Brien DP, Taylor JF, comparative genomics Ramirez O, Vila C, Marques-Bonet T, Schnabel single cell genomics RD, Wayne RK, Lohmueller KE. Proc Natl Acad Sci U S A. 2016 Jan 5;113(1):152-7. doi: 10.1073/ pnas.1512501113. Epub 2015 Dec 22. 06. appendix 60. Chromatin and RNA Maps Reveal Regulatory funding Long Noncoding RNAs in Mouse Bogu GK, collaborators Vizán P, Stanton LW, Beato M, Di Croce L, human resources projects Marti-Renom MA**.Mol Cell Biol. 2015 Dec publications 28;36(5):809-19. doi: 10.1128/MCB.00955- 15. **corresponding author

cnag

centre nacional d’anàlisi genòmica centro nacional de análisis genómico