Gene Discovery for Facioscapulohumeral Muscular Dystrophy by Machine Learning Techniques
Total Page:16
File Type:pdf, Size:1020Kb
Genes Genet. Syst. (2015) 90, p. 343–356 Gene discovery for facioscapulohumeral muscular dystrophy by machine learning techniques Félix F. González-Navarro1*, Lluís A. Belanche-Muñoz2, María G. Gámez-Moreno1, Brenda L. Flores-Ríos1, Jorge E. Ibarra-Esquer3 and Gabriel A. López-Morteo1 1Instituto de Ingeniería, Universidad Autónoma de Baja California, Mexicali 21280, México 2Dept. de Ciències de la Computació, Universitat Politècnica de Catalunya - Barcelona Tech, Barcelona 08034, Spain 3Facultad de Ingeniería Universidad Autónoma de Baja California, Mexicali 21280, México (Received 16 February 2015, accepted 8 August 2015; J-STAGE Advance published date: 10 March, 2016) Facioscapulohumeral muscular dystrophy (FSHD) is a neuromuscular disorder that shows a preference for the facial, shoulder and upper arm muscles. FSHD affects about one in 20-400,000 people, and no effective therapeutic strategies are known to halt disease progression or reverse muscle weakness or atrophy. Many genes may be incorrectly regulated in affected muscle tissue, but the mechanisms responsible for the progressive muscle weakness remain largely unknown. Although machine learning (ML) has made significant inroads in biomedical disci- plines such as cancer research, no reports have yet addressed FSHD analysis using ML techniques. This study explores a specific FSHD data set from a ML perspective. We report results showing a very promising small group of genes that clearly separates FSHD samples from healthy samples. In addition to numerical prediction figures, we show data visualizations and biological evidence illustrating the potential usefulness of these results. Key words: Facioscapulohumeral muscular dystrophy, gene discovery, feature selection, machine learning, protein-protein association networks ferent racial groups, recent estimates being one in about INTRODUCTION 400,000 to one in 20,000 (MDC, 2012). Patients usually In recent years, machine learning (ML) has made sig- become symptomatic in their twenties or later (Tawil and nificant inroads in the fields of bioinformatics and bio- Maarel, 2006). The most common FSHD symptoms are medicine; see, for example, Schölkopf et al. (2004). progressive weakening of muscles and atrophy of the face, Specifically, in cancer research a variety of ML algo- shoulder, upper arm and shoulder girdle, and lower rithms have been developed for tumor prediction by asso- limbs. These are usually accompanied by an inability to ciating gene expression patterns with clinical outcomes flex the foot upward, foot weakness, and an onset of right/ for patients with tumors (Lukas et al., 2004). The major- left asymmetry (Tawil et al., 1998; Maarel et al., 2007). ity of this research has focused on building accurate clas- FSHD is considered a relatively benign dystrophy, despite sification models from reduced sets of features. Some of the fact that around 20% of patients are eventually con- the analyses also aimed to gain an understanding of the fined to a wheelchair (Tawil and Maarel, 2006), and an differences between normal and malignant cells and to estimated 1% ultimately require breathing assistance identify genes that are differentially regulated during (Wahl, 2007). Although no effective therapeutic strate- cancer development. gies are known to either halt progression or reverse mus- Facioscapulohumeral muscular dystrophy (FSHD) is an cle weakness (or atrophy) (Rose and Tawil, 2004), a autosomal dominant neuromuscular disorder that shows number of actions can provide symptomatic and func- a preference for the facial, shoulder and upper arm mus- tional improvement in many patients; the use of assistive cles, and is the third most common inherited muscular devices such as braces, standing frames or walkers, and dystrophy (Flanigan, 2004; Tawil, 2008). Its incidence physical therapies such as exercises in water, helped by varies with geographic location and probably within dif- psychological support and speech therapy, may help to alleviate especially difficult life conditions. Edited by Ali Masoudi-Nejad It is believed that FSHD is caused by deletion of a sub- * Corresponding author. E-mail: [email protected] set of D4Z4 macrosatellite repeat units in the subtelom- DOI: http://doi.org/10.1266/ggs.15-00017 ere of chromosome 4q (Maarel et al., 2011), but this 344 F. F. GONZÁLEZ-NAVARRO et al. modification needs to occur on a specific chromosomal simple models (in terms of low numbers of genes) capable background to cause FSHD. More than 95% of patients of good generalization. A prior selection stage is carried with clinical FSHD have an associated D4Z4 deletion on out, using a well-known statistical test, to filter out genes the 4q35 chromosome (Wahl, 2007). However, a small with negligible discrimination ability. To avoid selection number of kindreds with typical FSHD do not display this biases, the full selection process is embedded into an deletion. Recent advances involve the DUX4 gene, a ret- outer Monte Carlo validation process. rogene sequence within D4Z4 that encodes a double We report experimental results supporting the practi- homeodomain protein whose exact function is not cal advantage of combining robust feature selection and known. FSHD is the first example of a human disease classification in the analyzed FSHD data set. The associated with the inefficient repression of a retrogene in described method was able to unveil a consistent small a macrosatellite repeat array (Maarel et al., 2011). group of genes that yields high mean test set accuracies. Although the mechanisms responsible for progressive The practical utility of these results is reinforced by low- muscle weakness remain unknown, the study of this gene dimensional visualizations and supported by current bio- may offer a therapeutic route (Maarel et al., 2011). logical evidence. The simultaneous monitoring of expression levels for thousands of genes may allow the study of the effects of MATERIALS AND METHODS certain treatments, diseases and developmental stages on gene expression. In particular, microarray-based gene Data set The database used in this study was obtained expression analysis based on statistical or ML methods from the EMBL-EBI repository of the European can be used to identify differentially expressed genes Bioinformatics Institute (EMBL-EBI, 2014). Specifically, (which act as features, to use ML terminology) by compar- Experiment E-GEOD-3307 uses the Affymetrix GeneChip ing expression in affected and healthy cells or tissues. A Human Genome HG-133A and HG-U133B designs to ana- gene expression data set typically consists of dozens of lyze a group of muscle diseases for comparative gene observations and up to tens of thousands of genes. Pre- expression profiling purposes. A total of 121 muscle dictive model construction and validation in this situation samples of 11 muscle pathologies (plus several healthy is difficult and prone to yield unreliable readings. As a samples) constitute the data: acute quadriplegic myopa- result, dimensionality reduction and in particular feature thy, juvenile dermatomyositis, amyotrophic lateral sclero- subset selection (FSS) techniques may be very useful, as sis, spastic paraplegia, fascioscapulohumeral muscular a way to reduce the problem complexity and facilitate dystrophy, Emery-Dreifuss muscular dystrophy, Becker expert medical diagnosis. In a practical medical context, muscular dystrophy, Duchenne muscular dystrophy, cal- the interpretability of the obtained solutions is also of pain 3, dysferlin, and the FKRP using U133A and U133B paramount importance, limiting the applicability of meth- array design. These are diseases with an extremely low ods such as Principal Components Analysis (PCA), which incidence rate in the general population. Facioscapu- involves weighted combinations of many genes instead of lohumeral Muscular Dystrophy (FSHD, HG-133A ver- individual genes. Moreover, data visualization in a low- sion), the targeted group in this work, consists of 14 dimensional representation space may become extremely people showing FSHD (hereafter referred to as FSHD important, as it helps doctors to gain insights into this samples or cases) and 18 people not showing FSHD medical area. Such contributions are scarce for the case (healthy cases), described by p = 22,283 genes or features. of FSHD-associated gene expression data, probably due to a relative research bias toward more common diseases. Feature selection algorithm The first action taken This scenario is aggravated by the absence of publicly was to filter out those genes that do not possess a mini- available scientific data, outside purely medical domains, mum discriminative capacity. In particular, we use although the situation is slowly changing. In contrast, Welch’s t-test (Pan, 2002) computed as follows: there is now a vast body of available microarray gene +− expression data sets focused on cancer diseases. xxjj− t j = 1, ≤≤jp The development of simple predictive models that can s22+ s − (1) jj+ distinguish between healthy and FSHD samples with N + N − minimal error recognition rate is thus a clear research goal. This study explores a specific FSHD data set from where the j subscript runs through all genes G = {g1, . , − 2 a ML perspective. First, a dimensionality reduction gp}; x j and sj denote sample mean and variance of a gene, stage is carried out. A feature selection algorithm is and N is the sample size; the symbols + and − indicate used as the main engine to select genes promoting the positive (healthy) and negative