Expertdiscovery and UGENE Integrated System for Intelligent Analysis of Regulatory Regions of Genes
Total Page:16
File Type:pdf, Size:1020Kb
In Silico Biology 11 (2011/2012) 97–108 97 DOI 10.3233/ISB-2012-0448 IOS Press ExpertDiscovery and UGENE integrated system for intelligent analysis of regulatory regions of genes Y.Y. Vaskina,*, I.V. Khomichevab, E.V. Ignatievac and E.E. Vityaevb aNovosibirsk State University, Department of Information Technology, Novosibirsk, Russia bInstitute of Mathematics, SD RAS, Novosibirsk, Russia cInstitute of Cytology and Genetics, SD RAS, Novosibirsk, Russia Abstract. The task of automatic extraction of the hierarchical structure of eukaryotic gene regulatory regions is in the junction of the fields of biology, mathematics and information technologies. A solution of the problem involves understanding of sophis- ticated mechanisms of eukaryotic gene regulation and applying advanced data mining technologies. In the paper the integrated system, implementing a powerful relation mining of biological data method, is discussed. The system allows taking into account prior information about the gene regulatory regions that is known by the biologist, performing the analysis on each hierarchical level, searching for a solution from a simple hypothesis to a complex one. The integration of ExpertDiscovery system into UGENE toolkit provides a convenient environment for conducting complex research and automating the work of a biologist. For demonstration, the system has been applied for recognition of SF1, SREBP, HNF4 vertebrate binding sites and for the analysis the human gene regulatory regions that promote liver-specific transcription. Keywords: Complex signal, hierarchical analysis, recognition, gene regulatory regions, bioinformatics 1. Introduction RNA molecule (a primary transcript) with essentially the same sequence as the gene. Given the complexity Analysis of gene regulatory regions and searching for of multicellular eukaryotes, transcriptional regulation structural and functional patterns are actual problems of in these organisms needs to be very complex [30]. biology which are far from a final solution. In general Intensity of each gene transcription is precisely regu- it is due to the complex structure of gene regulatory lated depending on cellular conditions (a type of cells regions and variety of mechanisms of transcription and tissues,a stage oforganismdevelopment,a cellcycle, regulation. There is a need to analyze various data which inducers or repressors, influencing the cell) [4]. A great concern physical, chemical, structural, information pro- number of different regulatory proteins are involved in perties of gene regulatory regions and experimental data the process of transcription regulation, including tran- of their functions. scription factors (TFs), coactivators, cosuppressors,med- The fundamental property of genes is gene expression iators [15]. TFs play one of the most important roles in (i.e., the ability to produce biologically active products – the process. They interact with exact DNA regions – proteins or RNA). Gene expression occurs in two major transcription factor binding sites (TFBSs) – in a specific stages. At first, genes are transcribed into RNA and then way. Besides interacting with DNA, TFs participate in translated to make proteins. Transcription is the first step protein-protein interactions with other regulatory pro- leading to gene expression. During transcription, a DNA teins, forming multiprotein complexes which activate sequence is read by an RNA polymerase to produce an or suppress gene transcription. Possibility of flexible regulation of eukaryotic gene transcription is provided by the presence of extensive *Corresponding author: Y.Y. Vaskin Novosibirsk State University, Department of Information Technology, Pirogova St., 2, Novosibirsk, gene regulatory regions that have complex block- Russia. E-mail: [email protected]. hierarchical structure [12]. 1386-6338/11/12/$27.50 © 2011/2012 – IOS Press and the authors. All rights reserved 98 Y.Y. Vaskin et al. / ExpertDiscovery and UGENE integrated system The first level of regulatory regions hierarchy Regulatory regions of each gene include a unique includes various transcription factors binding sites, combination of TFBSs of different types. According to the short regions of DNA (10–20 nucleotides) recog- TRRD [28], regulatory regions of a specific gene may nized specifically by regulatory proteins (transcription contain more than 20 different TFBSs, which are experi- factors) [17]. mentally proved. Therefore the whole system of gene The next level is represented by composite elements integrated regulation may include dozens of regulatory which include neighboring TFBSs which acquire units [14]. The fact that the length and the location of new regulation properties as a result of protein-protein regulatory units may be very different emphasises their interations with corresponding TFs. Composite ele- variety considerably. ments of synergetic type provide a non-additively high Nowadays there are plenty of widespread computer level of transcription activation as a result of protein- methods providing analysis of regulatory regions, protein interactions. Composite elements of antagonis- each of them dealings with certain hierarchical level. tic type include overlapping or very closely located For TFBSs recognition such methods as PWM [35], TFBSs. In this case two transcription factors compete SITECON [18], SiteGA [16] and other are used. But with each other for binding to DNA, so a stimulating the task of TFBSs recognition is methodologically very effect of an activator is changed to suppressing effect difficult due to the high variety of genomic transcrip- of an inhibitor or otherwise, depending on cellular con- tion factor binding sites. As a result, all TFBSs predic- ditions [12]. tion methods developed so far have the well known Regulatory units (promoter regions, enhancers, silen- shortcomings such as high over- or under-prediction cers) form the next level in the system of gene regula- rates (false positives or false negatives) [12]. tory regions hierarchical structure. Their functions are The task of recognition of relationships between implemented since they contain TFBS and composite TFBS recognition corresponds to the other hierarchi- elements interacting with regulatory proteins [5]. The cal level of regulatory regions structure [5]. However, location of the regulatory units and their length vary since regulatory regions contain unique TFBSs com- considerably. Enhancers and silencers are units acti- binations, the methods face bad representativeness of vating or suppressing transcription of a specific gene training sets which do not include enough particular and they may be located very far from the trans- cases of a general situation. cription start site (up to 50000 bp). Enhancers and Thetaskofanalysisofgeneregulatoryunitsandthe silencers may be situated in 5’-and3’- flanking whole system of integrated transcription regulation is regions of genes or in introns. Promoter regions are considerably much more complex than the tasks of regulatory units located right before the start of gene analysis of TFBSs and their combinations. The reason transcription. Their size usually varies from 200 to of it is the existence of a huge variety of regulatory 1000 bp. [1]. regions structures which is caused by presence of differ- The highest level of regulatory regions hierarchical ent elementary signals in regulatory regions (TFBSs, structure corresponds to the system of integrated tran- conformational, physical, chemical features) and also scription regulation [13] which is carried out with the variability of regulatory regions lengths and locations. participation of complex system of regulatory proteins From an informatics point of view, the task of eukaryo- interacting with the whole set of regulatory units and tic gene regulatory regions analysis implies hierarchical elements of a specific gene. Composition of multipro- analysis of genetic information. To solve the problem it tein complexes is determined by DNA-protein interac- is required to apply the state-of-art computer technolo- tions based on superposition of different DNA codes gies of intelligent data analysis (Data Mining and (linear, conformational, etc.). Knowledge Discovery). Nowadays none of the known There is a wide range of structures of regulatory methods can completely solve this problem. In the most regions because each specific gene should be regulated cases in order to achieve a biologically significant result in a particular way depending on a cellular situation. scientists have to manually analyze huge amount of For instance, according to recent data, the human gen- information, which may be contradictory in some cases. ome encodes about 1500 TFs [2]. One may expect In general, automatic methods of gene regulatory that the human genome (like genomes of many other regions analysis and recognition must consider various eukaryotic organisms) may contain just about the same contextual, physical, chemical and conformational fea- amount of different types of TFBSs, providing func- tures of DNA. So, constructing an integrated method tional properties of gene regulatory regions. of recognition which would involve signals of different Y.Y. Vaskin et al. / ExpertDiscovery and UGENE integrated system 99 types acquired as results of applying of other recogni- particular problems in the fields of psychophysics, can- tion methods is quite an actual problem. cer diagnostics and securities rates prediction. The heart In the current work a new method of intelligent regu- of the system is semantic