This Thesis Has Been Submitted in Fulfilment of the Requirements for a Postgraduate Degree (E.G
Total Page:16
File Type:pdf, Size:1020Kb
This thesis has been submitted in fulfilment of the requirements for a postgraduate degree (e.g. PhD, MPhil, DClinPsychol) at the University of Edinburgh. Please note the following terms and conditions of use: This work is protected by copyright and other intellectual property rights, which are retained by the thesis author, unless otherwise stated. A copy can be downloaded for personal non-commercial research or study, without prior permission or charge. This thesis cannot be reproduced or quoted extensively from without first obtaining permission in writing from the author. The content must not be changed in any way or sold commercially in any format or medium without the formal permission of the author. When referring to this work, full bibliographic details including the author, title, awarding institution and date of the thesis must be given. Lost Pigs and Broken Genes: The search for causes of embryonic loss in the pig and the assembly of a more contiguous reference genome Amanda Warr A thesis submitted for the degree of Doctor of Philosophy at the University of Edinburgh 2019 Division of Genetics and Genomics, The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh Declaration I declare that the work contained in this thesis has been carried out, and the thesis composed, by the candidate Amanda Warr. Contributions by other individuals have been indicated throughout the thesis. These include contributions from other authors and influences of peer reviewers on manuscripts as part of the first chapter, the second chapter, and the fifth chapter. Parts of the wet lab methods including DNA extractions and Illumina sequencing were carried out by, or assistance was provided by, other researchers and this too has been indicated throughout the thesis. This work has not been submitted for any other degree or professional qualifications. All sources of information have been acknowledged. Amanda Warr 2019 I Abstract The pig is an economically important species, with pork being the most widely consumed meat in the world. Genomic technologies have the potential to improve reproduction, health and efficiency in the pig industry. Additionally, pigs are more similar to humans than species commonly used as medical models and improved genomic resources for the pig may facilitate its use in medical modelling. The cost of DNA sequencing has greatly decreased in recent years, allowing more researchers to incorporate next generation sequencing into their projects. Many bioinformatic tools are designed to accept a reference genome as a truth against which individuals are compared, however most of the available reference genome sequences are low-quality drafts. It is important to understand the limitations of available reference genomes in order to make full use of the sequencing technologies available. Chapter 2 assesses the quality of the published pig draft reference genome sequence, Sscrofa10.2, using short-read sequencing data from the individual from which the genome was assembled. By identifying regions where the reads disagree with the assembly, regions of low-confidence are identified and a filter is produced to reduce the impact of these regions on genomic analyses. Chapter 3 makes use of exome sequencing data to identify variants that are predicted to truncate proteins in 96 pigs, through application of filters, including the filter designed in Chapter 2. This is reduced to a short list of variants that are likely to have an impact on phenotype, specifically variants that may be associated with reproductive phenotypes and embryonic lethality. Additionally, imputation from the 96 pig III exomes to a larger set of 446 pigs each genotyped for ~60,000 single nucleotide polymorphisms (SNPs) with the PorcineSNP60 BeadChip (Illumina) is carried out, and variants are investigated for association between two reproductive phenotypes and the imputed exome variants using a genome-wide association study (GWAS) and a number of candidate genes are identified. Chapter 4 uses an alternate method of identifying phenotype altering variants by using whole genome sequencing of a trio of individuals (sire, dam and affected individual) to search for a genetic cause of foetal mummification in pigs. Finally, chapter 5 focuses on improving the available resources for the pig by reassembling the pig genome using the latest long- read sequencing technologies, producing a much improved assembly, Sscrofa11.1. The assembly is one of the most contiguous reference genomes currently available with a contig N50 of 48.2Mb, only 103 gaps remaining (less the Y chromosome) and two closed chromosomes. This project improves the available genomic resources for the pig, and identifies several putative causal variants and candidate genes underlying important traits in a commercial population. IV Lay Summary Pork is the most widely consumed meat in the world, and it is important that we are able to produce enough pork to meet demand. In order to improve pork production, it would be beneficial to know exactly why one pig performs better or worse than others. Every living thing is made up of cells, and in those cells are a set of instructions, or “DNA”, that determine the traits of individuals along with environmental influences. We call the full set of DNA in the cell a “genome”, and the sections that specifically instruct on how to make proteins are called the “genes”. Using equipment called DNA sequencers, we can read the instructions. The DNA is written in sequences of “bases” (A, G, T and C). Most DNA sequencers can only read a few hundred bases at a time. A pig’s genome has roughly 2.6 billion bases. If we piece together sequences from the genome we can make a “reference genome”, a sequence that represents the order in which the bases are found in the complete set of DNA in an individual. Using this we can take the short sequences that DNA sequencers produce and compare them to the reference to start to understand what is different between individuals, and how those differences, or “variants”, affect the traits of the individual. For the pig, there is already a reference genome, however, in the second chapter I describe methods I used to identify parts of this reference that are badly assembled. The genome has many mistakes in it and thousands of gaps. If we compare sequences to a reference that doesn’t accurately represent any pig this can confuse analyses and make data harder to interpret. In chapter 3, I use sequences from the genes of 96 pigs and by comparing them to the V reference I try to identify variants that cause the protein to be cut short, truncating the protein potentially makes it unable to function. I use the badly assembled regions from the second chapter to exclude some inaccurate variants and identify several variants that might truncate proteins that are important for pork production. I also try to relate the presence of variants to differences in the breeding success of the pigs. In chapter 4, I look at sequences from a piglet that died during pregnancy, and both of its parents, to see if the piglet inherited any variants from its parents that may have caused its death. Each individual has two copies of a gene, one copy inherited from each parent, so even if both parents are healthy, they might each have one copy of a damaging variant, and both pass down this copy to the piglet. In chapter 2 and 3, badly assembled regions of the reference genome made accurate analysis difficult. In chapter 5, I assemble a new reference genome using a new kind of DNA sequencer that can read tens of thousands of bases at a time instead of only a few hundred. These longer sequences make it easier to find overlapping sequences and assemble the genome with fewer mistakes and fewer gaps. This new reference genome is much more accurate than the previous one and is now available for other researchers to use, and should make genome analysis in the pig more accurate and useful. This work provides other scientists with better resources to work with, and also identifies genetic variants that might cause smaller litter sizes in pigs that following further validation can be avoided to improve pork production. VI Acknowledgements I would like to thank my primary supervisor Mick Watson for his advice, support and encouragement throughout the project. I had no experience of bioinformatics and limited programming experience when I started this project, and Mick’s guidance has helped me to develop these skills. I would also like to thank the Watson group, particularly Christelle Robert who played a big part in introducing me to exome sequencing at the beginning of the project and whose data was used in chapter 3. I’d like to thank my second supervisor, Alan Archibald, who gave me the opportunity to reassemble the pig genome, this was a challenging part of the project, but also very rewarding and Alan’s enthusiasm for producing a high quality reference assembly is contagious. My final supervisor, David Hume, was also very supportive and always full of ideas and advice. All three of my supervisors have shown great faith in me, and have helped to build my confidence in my abilities, for which I am extremely grateful. My project was part funded by Genus PIC, and I’d like to thank Joseph Deeb, Matthew Campbell, Steve Rounsley and Alan Mileham for their support. Particular thanks to Joseph and Matthew, who not only advised me on science, but also arranged accommodation for me and drove me around for ~3 months while I was in the States. Several parts of this project involved wet lab work, in which I had some experience but was not confident. I’d like to thank everyone who welcomed the poor, lost bioinformatician into their lab and were always happy to answer VII silly questions and lend me equipment.