Improvement on the Interpretability of Convolutional Neural Network for Population Genetics

Improvement on the interpretability of convolutional neural network for population genetics Yongxin Xu December 18, 2020 Version: 1.0.0 Advisor: Sara Mathieson Abstract Population genetics is a study of genetic variations in populations and evolutionary forces that explain these variations. Relevant studies are usually based on simulated genomic data in matrix form. Many existing methods, such as statistical likelihood inferences and SVM, can only deal with the summary statistics of simulated matrices, which suffer a loss of information and reduced accuracy. CNNs, with their ability to process raw genomic data and stable performance, have outperformed the existing methods in solving many population genetic problems such as detecting natural selection. However, since the inner architecture of CNNs is complex, it is usually difficult for researchers to understand what their models are learning and why the models make certain decisions on the given inputs. To enhance the interpretability of CNN models, we look into two techniques that have been successfully applied in other fields, where the application of CNNs is more mature than population genetics. One technique is an intrinsically interpretable CNN design called SincNet in speech recognition, which utilizes band-pass filters to limit the number of learnable parameters and thereby improve interpretability. The other is a post-hoc interpretation technique known as saliency maps which visualize the importance of each input unit to the final decisions, and have been widely applied in computer vision and natural language processing. In the end, we propose two approaches to fit these two techniques accordingly into the studies of natural selection. ii ii Acknowledgements First of all, I would like to thank my advisor professor Sara Mathieson for all her academic guidance and warmhearted support. I also want to thank my thesis group partner and my dearest friend Jiawen Zhang (Ivy). The weekly thesis meeting of the three of us was always the highlight of the week, full of great discussions, new ideas, and fun, especially at this challenging time when we could only meet virtually across the world. Second, I would like to thank my second reader professor Alvin Grissom for detailed feedback on my first draft of the thesis which helped me think deeper about the concepts and ideas. Third, I would like to thank professor Sorelle Friedler for the organization of this difficult semester and the caring for students from all over the world. iii iii Contents 1 Introduction 2 1.1 An Introduction to Population Genetics . 2 1.1.1 Natural Selection . 2 1.1.2 Data and Methods . 3 1.2 An Introduction to Machine Learning . 5 1.2.1 Convolutional Neural Network . 6 1.3 CNN in Natural Selection . 9 2 Literature Review 12 2.1 An interpretable CNN design: SincNet . 12 2.1.1 Ravanelli and Bengio, Interpretable Convolutional Filters with SincNet . 12 2.2 A post-hoc interpretation technique: Saliency Map . 16 2.2.1 Li et al., Visualizing and Understanding Neural Models in NLP 17 3 Statement of The Problem 20 3.1 Interpretable CNN design . 20 3.2 Saliency maps . 22 4 Conclusion and Future Work 25 1 1 1 Introduction 1.1 An Introduction to Population Genetics Population genetics is a study of genetic variations within or between populations and evolutionary forces that explain these variations. This study mainly focuses on the examination and modeling of population gene frequency distributions and their changes over time and space. Examining the patterns of these variations, geneticists identify demographic events like population shift and gene flow (Yang et al. 2020), deduce the rates of mutation and recombination (Hinch et al. 2011), reveal the impact of natural selection (Elyashiv et al. 2016), and therefore infer the natural history of populations. In particular, population genetics is helpful in improving human health: it helps to identify genes related to heritable diseases and develop effective medical interventions to reduce the population frequency of these diseases. 1.1.1 Natural Selection The evolutionary force that this thesis focuses on is natural selection. Natural selection occurs when individuals in a population have different traits and some traits are better suited to the environment than others. At the genetic level, differences of traits result from genes that have multiple variants (alleles) at the same position (genetic locus) on the chromosome. These allelic variations of genes are usually caused by mutations. Some alleles are beneficial to the adaption, survival, and reproduction of their carriers, while others are neutral or even harmful. As a result, individuals carrying beneficial alleles can pass on more genes to the next generation. These alleles are “naturally selected” to dominate the population; 2 2 their frequencies increase rapidly and tend to reach “fixation” in the population. Meanwhile, the genetic variants around the beneficial alleles “hitchhike” to fixation as well. This process is also called a selective sweep, which reduces genetic diversity at the corresponding locations in the genome, and is usually regarded as a signature of natural selection. However, such a signature is usually hard to detect, due to the occurrences of recombination. Recombination is a process by which genetic sequences are broken and recombined to form new alleles. It breaks the signature of natural selection by reshuffling the combinations of genes. On the other hand, recombination contributes to identifying the specific locations of selected mutations. By breaking up the genetic sequences into fragments, it narrows down the region to identify beneficial mutations (otherwise, the entire chromosome would “hitchhike” to fixation and it is impossible to determine which mutations are actually beneficial and selected by nature). Consequently, because recombination increases “entropy” in genetic variation, it is difficult for humans to detect natural selection and identify selected sites from genomic data entirely on their own. That’s why many models have been developed that help researchers learn about the effects of natural selection on populations. Understanding natural selection has been increasingly helpful in the development of certain fields such as medicine, agriculture, and resource management (Gregory 2009). 1.1.2 Data and Methods Typically, the studies of population genetics are based on simulated genomic data to examine and model the influence of natural selection and other basic evolutionary factors on shaping genetic variation. One reason for using simulated data is that it is difficult to obtain a large number of real and complete population genomic data sets. Even with such data sets in hand, it is difficult to explain which evolutionary forces are in play for the given population and what their impact is on genetic variation (basically, there is no “ground truth” for genomic data), and therefore it is hard to construct a model that describes this evolutionary process. Simulated genomic data can overcome these challenges. Many simulation tools have been devised so that researchers can easily simulate a large number of genomic data by setting a few parameters that define the effect of evolutionary forces. For example, in the studies of natural selection, researchers manually set the mutation rate and 3 3 Fig. 1.1: (A) A genealogical tree of the genes of eight individuals with eight segregating sites falling on different branches, and the corresponding SNP matrix where rows are the individuals and columns are the segregating sites. Mutations occurring in a certain lineage are inherited by all its descendants and coded as 1’s in the matrix. Using this data can infer whether the population with mutations has undergone (B) a neutral evolution (where mutations (white dots) do not influence individuals), or (C) a positive selection (where an increased number of individuals are carrying a beneficial mutation (the green dot)). Figure from (Ronen et al. 2013). selection coefficient (i.e. reproduction rate) for each mutation to mimic the impact of natural selection on different types of mutations in a population. Based on the simulated data and the knowledge of evolutionary forces, researchers may construct a model that describes the evolutionary process of the simulated population and apply this model to the real genomic data to learn about the real evolution. In more detail, population genome simulators like MSMS (Ewing and Hermisson 2010) usually first generate a genealogical tree that describes the changes in the distribution of alleles of a sample population under pre-defined selections. Then a matrix is derived from the tree, where the rows are the individuals in the sample and columns are segregating sites, i.e. the loci where mutations occur. The entries in this matrix are either 1’s when mutations are observed at certain sites or 0’s when ancestral alleles are observed. As the set of segregating sites is also known as a set of single-nucleotide polymorphisms (SNPs), this matrix is also called the SNP matrix. Most population genetic studies, including those on natural selection, are based on this type of matrix. Fig.1.1 shows an example of an SNP matrix and its corresponding genealogy tree. Statistical methods have been commonly used in population genetic studies for analyzing and inferring genomic data. Likelihood-based statistical inferences like Approximate Bayesian Computation (ABC) (Beaumont et al. 2002) have been pop- ular because of their statistical efficiency. However, instead of working directly with a SNP matrix, these current methods use summary statistics derived from the 4 4 matrix. One common type of summary statistics is a site frequency spectrum (SFS). A site frequency spectrum is the distribution of allele frequencies in the given set of SNPs. For example, the SNP matrix in Fig.1.1 has a SFS ξ as [4, 1, 1, 1, 0, 0, 1], where i ξi denotes the number of segregating sites having mutations at frequency n in a sample size of n and 1 ≤ i ≤ n − 1.

Load more