An Embedded Categorical Feature Selector Applied to the Genotype-Phenotype Prediction Problem

UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL INSTITUTO DE INFORMÁTICA CURSO DE CIÊNCIA DA COMPUTAÇÃO JOSÉ PEDRO SILVEIRA MARTINEZ An Embedded Categorical Feature Selector applied to the Genotype-Phenotype Prediction Problem Work presented in partial fulfillment of the requirements for the degree of Bachelor in Computer Science Advisor: Prof. Dr. Márcio Dorn Coadvisor: M.Sc Bruno Iochins Grisci Porto Alegre December 2019 UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL Reitor: Prof. Rui Vicente Oppermann Vice-Reitora: Profa. Jane Fraga Tutikian Pró-Reitor de Graduação: Prof. Wladimir Pinheiro do Nascimento Diretora do Instituto de Informática: Profa. Carla Maria Dal Sasso Freitas Coordenador do curso de Ciência da Computação: Prof. Sérgio Luis Cechin Bibliotecária-chefe do Instituto de Informática: Beatriz Regina Bastos Haro ACKNOWLEDGMENTS I send thanks to my advisors, Professor Márcio Dorn and Bruno Grisci, for in- troducing and discussing the theoretical aspects that supported this study with me, for making me believe I am capable of working as a scientist, and for highlighting stepping stones on the road of my own professional and personal improvement. To Dr. Eduardo Ávila for the attentive review of the work and precise corrections concerning the bio- logical aspects of the study. To my sister, Tatiana, for providing me emotional support and strengthening my spiritual linkage with our common ancestors. To my colleagues at Meerkat for supporting me financially and trusting I am doing my best in spite of the new challenges I am facing. Last but far from least, thank you to the INCT Forense for providing the data analyzed in this study. “The Road to Wisdom? Well, it’s plain And simple to express: Err and err and err again, but less and less and less.” —PIET HEIN,THE ROAD TO WISDOM ABSTRACT One crucial preprocessing step in many machine learning algorithms is feature selection since many predictions tasks are defined without knowledge of which attributes lying on data are relevant to the given task. The burden of evaluating all possible features subsets is computationally intractable and many heuristics for choosing sub-optimal attributes sets have been proposed and evaluated over the last decades. With the advance of data storage and collection technology, databases posing novel predictions tasks are filled with spurious attributes, which enforces the urge for general and computationally inexpensive feature selection algorithms. Elimination of redundant attributes is capable of improving machine learning models predictive capability, while the discovery of relevant features has an important scientific value on domains in which knowledge about the relationship of collected data attributes is null or insufficient. This study proposes N3O-D, a novel feature selection algorithm based on neuroevolution and mutual information which au- tomatically selects categorical attributes as it learns to solve a given learning task. The proposed method classification and selection capability are experimentally evaluated on the genotype-phenotype prediction of eye and skin color. Experimental results showed that the method has the potential of improving classification performance obtained from state-of-art feature selection frameworks, achieving it on some of the evaluated data sets. Keywords: Neuroevolution, feature selection, metaheuristics, genetic algorithm, neural networks, machine learning, mutual information, information theory. Um Seletor de Atributos Categóricos baseado em Neuroevolução aplicado ao problema de Predição de Fenótipo a partir de Genótipo RESUMO Uma etapa crucial do pré-processamento de muitos algoritmos de aprendizado de má- quina é a seleção de atributos, visto que muitas tarefas preditivas são definidas sem um conhecimento prévio de quais atributos presentes nos dados são de fatos relevantes para o problema. A tarefa de avaliar todos os possíveis subconjuntos de atributos é computacionalmente intratável e heurísticas que propõe conjuntos de atributos sub-ótimos tem sido propostas e avaliadas nas últimas décadas. Com o avanço tecnólogico da coleta e armaze- namento de dados em larga escala, bancos de dados apresentam tarefas preditivas inéditas contendo atributos supérfluos, reforçando a carência de algoritmos de seleção de atributos genéricos e computacionalmente baratos. A eliminação de atributos redundantes é capaz de melhorar a capacidade preditiva de algoritmos de aprendizado de máquina, enquanto a descoberta de atributos relevantes tem um valor científico importante para domínios onde o conhecimento sobre a relação dos dados coletados é nula ou insuficiente. Esta mono- grafia propõe N3O-D, um seletor de atributos baseado em neuro-evolução e informação mútua que automaticamente seleciona atributos categóricos enquanto aprende a resolver uma tarefa de aprendizado. O método proposto é avaliado experimentalmente na predição dos fenótipos cor de olho e cor de pele a partir de dados genotípicos. Resultados experi- mentais demonstraram que o método é capaz de superar a capacidade preditiva obtida a partir de métodos de seleção de atributos no estado da arte, atingindo o objetivo em alguns dos conjuntos de dados analisados. Palavras-chave: neuroevolução, seleção de atributos, metaheurísticas, algoritmo gené- tico, redes neurais, aprendizado de máquina, informação mútua, teoria da informação. LIST OF FIGURES Figure 2.1 Depiction of a NEAT individual’s genome and phenotype ...........................20 Figure 2.2 Depiction of competing conventions .............................................................21 Figure 2.3 Depiction of mutation Add Connection.........................................................22 Figure 2.4 Depiction of mutation Add Node ..................................................................23 Figure 2.5 Depiction of crossover operator.....................................................................34 Figure 2.6 Depiction of mutation Add Input...................................................................35 Figure 2.7 Depiction of mutation Swap Input.................................................................35 Figure 3.1 Depiction of a single nucleotide polymorphism............................................37 Figure 3.2 Depiction of distinct data derived from genomic samples ............................38 Figure 4.1 Depiction of mutation Add Input given one hot encoding of input nodes ....41 Figure 4.2 Depiction of mutation Swap Input given one hot encoding of input nodes ..41 Figure 4.3 Diagram depicting how the proposed method is applied to the genotype- phenotype problem....................................................................................................42 Figure 5.1 Picture of analyzed skin phenotypes .............................................................45 Figure 5.2 Picture of analyzed eye phenotypes...............................................................45 LIST OF TABLES Table 2.1 Example of confusion matrix ..........................................................................32 Table 5.1 Raw data examples..........................................................................................44 Table 5.2 Skin color class distribution ............................................................................45 Table 5.3 Eye color class distribution .............................................................................46 Table 5.4 Skin color data views ......................................................................................47 Table 5.5 Eye color data views........................................................................................47 Table 5.6 Chosen hyper-parameters for the proposed method configuration .................52 Table 5.7 Chosen hyper-parameters for SVM grid search optimization.........................53 Table 5.8 Evaluated feature selections sizes on skin color data views............................53 Table 5.9 Evaluated feature selections sizes on eye color data views.............................53 Table 5.10 SNPs selected by UFS(MI) on skin color data..............................................54 Table 5.11 SNPs selected by UFS(MI) on eye color data...............................................55 Table 5.12 SNPs selected by N3O-D on skin color data.................................................56 Table 5.13 SNPs selected by N3O-D on eye color data..................................................57 Table 5.14 Classification performance of selected features on skin color data views ....57 Table 5.15 Classification performance of selected features on eye color data views .....58 Table 5.16 Confusion matrix for UFS(MI) skin color predictions..................................58 Table 5.17 Confusion matrix for UFS(MI) eye color predictions...................................59 Table 5.18 Confusion matrix for N3O-D skin color predictions ....................................59 Table 5.19 Confusion matrix for N3O-D(MI) eye color predictions ..............................60 Table 5.20 Classification performance of champion topologies on skin color data........60 Table 5.21 Classification performance of champion topologies on eye color data.........61 Table 5.22 Input size of evolved topologies on skin color data ......................................61 Table 5.23 Input size of evolved topologies on eye color data views .............................61 Table 5.24 Report of relevant selected SNPs ..................................................................63 Table 5.25 Report of novel selected SNPs ......................................................................64

An Embedded Categorical Feature Selector Applied to the Genotype-Phenotype Prediction Problem

(Dominance) Effects of Genetic Variants Associated with Refractive Error And

Structural Insights Into Ubiquitin Recognition and Ufd1 Interaction of Npl4

Supplementary Materials

Allele-Specific Alternative Splicing and Its Functional Genetic Variants in Human Tissues

A Multiethnic Genome-Wide Analysis of 44,039 Individuals Identifies 41 New Loci Associated with Central Corneal Thickness

A Large Genome-Wide Association Study of Age-Related Macular Degeneration Highlights Contributions of Rare and Common Variants

A Commonly Occurring Genetic Variant Within the NPLOC4–TSPAN10–PDE6G Gene Cluster Is Associated with the Risk of Strabismus

Structural and Functional Analysis of Disease-Linked P97 Atpase Mutant Complexes

MAFB Determines Human Macrophage Anti-Inflammatory

HIV Integration Sites in Latently Infected Cell Lines

Myopia Genetics in Genome-Wide Association and Post- Genome-Wide Association Study Era

Unique Evolutionary Trajectories of Breast Cancers with Distinct Genomic and Spatial Heterogeneity Tanya N