BioMed Research International

Bioinformatics Applications in Life Sciences and Technologies

Guest Editors: Sílvia A. Sousa, Jorge H. Leitão, Raul C. Martins, João M. Sanches, Jasjit S. Suri, and Alejandro Giorgetti Bioinformatics Applications in Life Sciences and Technologies BioMed Research International Bioinformatics Applications in Life Sciences and Technologies

GuestEditors:SílviaA.Sousa,JorgeH.Leitão, Raul C. Martins, João M. Sanches, Jasjit S. Suri, and Alejandro Giorgetti Copyright © 2016 Hindawi Publishing Corporation. All rights reserved.

This is a special issue published in “BioMed Research International.” All articles are open access articles distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Contents

Bioinformatics Applications in Life Sciences and Technologies Sílvia A. Sousa, Jorge H. Leitão, Raul C. Martins, João M. Sanches, Jasjit S. Suri, and Alejandro Giorgetti Volume 2016, Article ID 3603827, 2 pages

𝐸-Index for Differentiating Complex Dynamic Traits Jiandong Qi, Jianfeng Sun, and Jianxin Wang Volume 2016, Article ID 5761983, 13 pages

Advancements in RNASeqGUI towards a Reproducible Analysis of RNA-Seq Experiments Francesco Russo, Dario Righelli, and Claudia Angelini Volume 2016, Article ID 7972351, 11 pages

Identification of Gene Expression Pattern Related to Breast Cancer Survival Using Integrated TCGA Datasets and Genomic Tools Zhenzhen Huang, Huilong Duan, and Haomin Li Volume 2015, Article ID 878546, 10 pages

A Genetic Algorithm Based Support Vector Machine Model for Blood-Brain Barrier Penetration Prediction Daqing Zhang, Jianfeng Xiao, Nannan Zhou, Mingyue Zheng, Xiaomin Luo, Hualiang Jiang, and Kaixian Chen Volume 2015, Article ID 292683, 13 pages

How to Use SNP_TATA_Comparator to Find a Significant Change in Gene Expression Caused by the Regulatory SNP of This Gene’s Promoter via a Change in Affinity of the TATA-Binding Protein for This Promoter Mikhail Ponomarenko, Dmitry Rasskazov, Olga Arkova, Petr Ponomarenko, Valentin Suslov, Ludmila Savinkova, and Nikolay Kolchanov Volume 2015, Article ID 359835, 17 pages Hindawi Publishing Corporation BioMed Research International Volume 2016, Article ID 3603827, 2 pages http://dx.doi.org/10.1155/2016/3603827

Editorial Bioinformatics Applications in Life Sciences and Technologies

Sílvia A. Sousa,1 Jorge H. Leitão,1 Raul C. Martins,2 João M. Sanches,3 Jasjit S. Suri,4 and Alejandro Giorgetti5,6

1 Institute for Bioengineering and Biosciences (IBB), Department of Bioengineering, Instituto Superior Tecnico,´ Universidade de Lisboa, Avenida Rovisco Pais, 1049-001 Lisbon, Portugal 2Instituto Superior Tecnico,´ Universidade de Lisboa, 1049-001 Lisbon, Portugal 3Institute for Systems and Robotics, Instituto Superior Tecnico,´ Universidade de Lisboa, Avenida Rovisco Pais, 1049-001 Lisbon, Portugal 4AtheroPoint, Roseville, CA 95661, USA 5Department of Biotechnology, University of Verona, Strada Le Grazie 15, 37134 Verona, Italy 6Computational Biomedicine, Institute for Advanced Simulation IAS-5 and Institute of Neuroscience and Medicine INM-9, Forschungszentrum Julich,¨ 52425 Julich,¨ Germany

Correspondence should be addressed to S´ılvia A. Sousa; [email protected]

Received 11 April 2016; Accepted 14 April 2016

Copyright © 2016 S´ılvia A. Sousa et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Life sciences researchers collect and analyse a high amount framework, called functional mapping, has been developed of different types of scientific data, including DNA, RNA, and extensively used to characterize the quantitative trait and amino acid sequences, in situ and microarray gene loci (QTLs) or nucleotides (QTNs) that underlie a complex expression data, protein structures and biological pathways, dynamictrait.However,thistoolisnotwellsuitedwhenthe and biological signals and images of diverse origin. In curves are complex, especially in the case of nonmonotonic recent years, a wealth of bioinformatics applications in the curves. Therefore, to overcome this problem, in their work fields of basic and applied life sciences has changed the J. Qi and colleagues propose the earliness index (E-index) to paradigm of both research and exploitation of knowledge. cumulatively measure the earliness degree to which a variable The development of novel and powerful bioinformatics tools (or dynamic trait) increases or decreases its value. The authors dedicated to biological data acquisition, data mining, and show by both theoretical proofs and simulation studies that analysis empowered both the basic and applied life sciences E-index is more general than functional mapping and can research. These bioinformatics developments span from tools be applied to any complex dynamic trait, even those with for genome annotation and function prediction, gene expres- nonmonotonic curves. sion analyses, and databases of biological information, to RNA-Seq experiments are nowadays extensively used the emerging fields of biomedical applications of research, in a wide range of studies, spanning from genome-wide including the development of new bioinformatics-based gene expression and regulatory mechanisms underlying basic devices and predictive applications. physiological traits to human pathologies, including cancer. Thisspecialissueiscomposedoffiveoriginalresearch However, RNA-Seq data analyses are complex and require papers selected after in-depth peer review. Selected papers the use of several different tools to manipulate and pro- describe novel bioinformatics tools and/or databases for cess the retrieved data. The work presented by F. Russo fundamental and/or applied research in the broad range of and collaborators shows recent advancements and novelties biological and biomedical sciences. introduced in RNASeqGUI, a graphical interface that allows Understanding the genetic control of complex dynamic theusertohandleandanalysebigdatasets(collectedfrom traits is of fundamental importance to agricultural, evolution- RNA-Seq experiments) in a fast, efficient, and reproducible ary, and biomedical genetic research. A statistical mapping way. The here presented version of RNASeqGUI combines 2 BioMed Research International graphical interfaces with tools for reproducible research, biomedical applications, thus contributing to promote both such as literate statistical programming, human readable individuals and populations welfare. report, parallel executions, caching, and interactive and web- exploitable tables of results. S´ılvia A. Sousa The Cancer Genome Atlas (TCGA) data portal is a Jorge H. Leitao˜ platform containing tumor gene expression data, together Raul C. Martins with clinical information, enabling researchers to gather Joao˜ M. Sanches information on significant genomic alterations that occur Jasjit S. Suri duringthedevelopmentandmetastasisofatumor.Tohelp Alejandro Giorgetti biomedical researchers to identify gene expression patterns related to breast cancer survival, H. Zhenzhen et al. developed a web-based TCGA data analysis platform called TCGA4U, providing a visualization solution for the analysis of the relationships of genomic changes with the available clinical data. The authors believe that the use of TCGA4U will inspire more biomedical researchers to explore the biological mechanisms of those genes and more precisely explain their role in breast cancer development, paving the way for the discover of more targeted therapies and help more breast cancer patients. Predicting blood-brain barrier (BBB) permeation is essential for drug design of molecules that act in the central nervous system (CNS). On the other hand, peripherally acting drugs must show limited ability to cross the BBB and therefore be devoid of action in the CNS. However, under- standing the process of permeation is complicated, since compounds can cross the BBB by passive diffusion and/or active transport. As an alternative to invasive animal exper- iments, in silico screening methods have been introduced to assist in the development of central nervous system active drugs. In their paper, D. Zhang and colleagues describe the design and implementation of a genetic algorithm to predict the BBB permeation ability of a given molecule, achieving more accurate results than currently available models. Asinglenucleotidepolymorphism(SNP)istheresult of the variation of a single nucleotide at a specific posi- tion in the genome. Besides introducing some degree of genetic variation within a population, certain SNPs have been associated with the susceptibility of the individual to specific diseases. The accumulated knowledge resulting from the availability of human genome sequences and the association of specific SNPs with certain diseases prompted the development of the so-called predictive preventive per- sonalized medicine. In their article, P. Ponomarenko and colleagues analysed the effects of SNPs occurring at the promotor region of specific genes on the affinity of the TATA- binding proteins to the promoter region using their web- service SNP TATA Comparator. Throughout their work, the authors provide some examples and discuss how to use the bioinformatics application SNP TATA Comparator to analyse and extract unannotated SNPs from the database “1000 genomes.” Seventeen novel candidate SNP markers, putatively associated with several diseases, are reported. The development of bioinformatics tools have changed the paradigm of research in both basic and applied biological sciences, as illustrated by the papers published in this special issue. While these tools enable scientists to gain knowledge of complex biological systems, they also allow envisioning the exploitation of results towards novel developments in Hindawi Publishing Corporation BioMed Research International Volume 2016, Article ID 5761983, 13 pages http://dx.doi.org/10.1155/2016/5761983

Research Article 𝐸-Index for Differentiating Complex Dynamic Traits

Jiandong Qi,1 Jianfeng Sun,1 and Jianxin Wang1,2

1 School of Information, Beijing Forestry University, Beijing 100083, China 2Center for Computational Biology, Beijing Forestry University, Beijing 100083, China

Correspondence should be addressed to Jianxin Wang; [email protected]

Received 3 July 2015; Revised 28 October 2015; Accepted 11 February 2016

Academic Editor: S´ılvia A. Sousa

Copyright © 2016 Jiandong Qi et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

While it is a daunting challenge in current biology to understand how the underlying network of genes regulates complex dynamic traits, functional mapping, a tool for mapping quantitative trait loci (QTLs) and single nucleotide polymorphisms (SNPs), has been applied in a variety of cases to tackle this challenge. Though useful and powerful, functional mapping performs well only when one or more model parameters are clearly responsible for the developmental trajectory, typically being a logistic curve. Moreover, it does not work when the curves are more complex than that, especially when they are not monotonic. To overcome this inadaptability, we therefore propose a mathematical-biological concept and measurement, 𝐸-index (earliness-index), which cumulatively measures the earliness degree to which a variable (or a dynamic trait) increases or decreases its value. Theoretical proofs and simulation studies show that 𝐸-index is more general than functional mapping and can be applied to any complex dynamic traits, including those with logistic curves and those with nonmonotonic curves. Meanwhile, 𝐸-index vector is proposed as well to capture more subtle differences of developmental patterns.

1. Introduction the growth process leading to it [9]. For example, growth may be defined as quantitative changes in size, mass, or number, Whether there are different genes responsible for the forma- and the process is more biologically meaningful than the tion of a trait and how these genes regulate the trait are of final state solely: the measurement value of an individual or fundamental importance biologically, agriculturally, and/or an organ. The complex traits, which can be expressed asa medically. Quantitative traits, or characteristics varying in functional or visually a curve, were thought to be infinite- degree, can be attributed to the effects of genes and their dimensional characters in [10] or function-valued traits in [11]. environment [1]. Lander and Botstein [2] pioneered the Researchers made effort to study this problem extensively by systematic integration of molecular genetics and statistical biological, mathematical, or statistical means [12–15]. And methodologies to dissect quantitative traits to an individual other researchers tried to solve this problem by considering genetic locus, well known as quantitative trait loci (QTLs). the complex trait (with a set of sampled values) as a bunch Since then, quantitative differences in mass, length, and so of simple traits [16–19]. However, if the values (a set of traits) forthofthewholeindividualoranorganintheirmaturestate of a complex trait are considered separately, the relationships areusedtoidentifygenes[3–8].AccordingtoQTLmapping, between the values are lost or are too time-consuming to individuals with different marker locus genotypes will have capture due to the large size of the residue covariance matrix. different mean values of a quantitative trait, if a QTL is linked But with the eigenvalues of the matrix, the dimensions can be to the marker locus. reduced greatly and it becomes feasible for genetic mapping It should be noted, however, that the single-valued traits of a large number of traits [20, 21]. These methods, however, are only a portion of the numerous traits, of which many do not take into account the developmental mechanisms that others change with time or other independent variables and regulate trait formation and variation. are the so-called complex traits. In fact, measurement values Ma et al. [22] proposed functional mapping,astatistical in the mature state provide much less information than framework, for mapping QTL regulating dynamic trajectories 2 BioMed Research International

1.0 1.0 0.8 0.8 0.6 From lower to upper, 0.6 =0.6, 0.7,...,2.9,3.0 0.4 0.4 From lower to upper of the right part, r=11 1.2,...,2.4,2.5 0.2 0.2 . , (measurement value) (measurement (measurement value) (measurement y y 0.0 0.0 02468 02468 t (time) t (time) (a) (b)

Figure 1: Growth curves with different parameter values. (a) is a bunch of curves described by (1). (b) is a bunch of curves described by(2). of traits. Functional mapping is constructed on the basis of 200 abiologicallaw,aspresentedbyWestetal.[23],thatthe 180 growth of many an organism follows a logistic curve due to the fundamental metabolic principles for allocating energy 160 between maintaining current tissue and gaining biomass. By incorporating the logistic curve, functional mapping 140 differentiates a complex dynamic trait by the parameters of the logistic function, instead of directly by the trait 120 values, and thus makes computation less time-consuming 100 and makes the results more biologically meaningful. Since functional mapping was proposed in [22], there has been a Growth (%) 80 wealth of literature about its variations, improvements, and applications [24–29]. Up to date, functional mapping has 60 successfully applied to associate high dimensions of SNPs 40 with high dimensions of dynamic traits [30]. We now briefly review how functional mapping differ- 20 entiates developmental trajectories. First of all, the growth process of an individual or an organ can be described by a 0 0 2 4 6 8 101214161820 growth curve, a function of a measurable variable against Age (years) time. Theoretically, a growth curve may provide infinite amount of information, unlike a single measurement value Lymphoid type General type in a mature state. For example, we consider two bunches Neural type Genital type of growth curves which are described by the following two equations and illustrated in Figure 1: Figure 2: Four major types of growth curves of the organs and the body as a whole, from birth to 20 years [31]. (Later we will obtain 2 𝐸 𝑦= −1, their -indices from upper to lower as 1.328, 0.802, 0.459, and 0.191, 1+𝑒−𝑟𝑡 (1) resp.) 1 𝑦= . 1+𝑒−𝑟(𝑡−3) (2) these parameters to be the characteristic values, which is the essence of functional mapping. Unfortunately, no function is In practice, discrete values of the developmental process qualified for describing all growth types. Specifically, Figure 2 aremeasuredandcollected,basedonwhichfunctional shows Scammon’s classic illustration [31] of different growth mapping recovers the process by describing it with a curve types of human beings that are almost impossible to describe which is determined by one or more parameters (𝑟 here with a uniform function. Therefore, functional mapping fails indicatingthegrowthrate).WecanobserveinFigure1(a)that to work with curves like the nonmonotonic lymphoid type in as the parameter 𝑟 increases, the corresponding growth curve Figure 2. becomes steeper at the beginning part. And in Figure 1(b), a The diversity of growth curves gives rise to a problem: growth curve with greater 𝑟 increases slower at the beginning how can we differentiate them with one or more characteristic of the left part and faster at the end of left part. The inclination values? An important characteristic value, 𝐸-index, will be of the right part is opposite to that of the left part. Therefore proposedbelowandtherestofthepaperisorganizedas the parameter 𝑟 may act as a characteristic value of the bunch follows. 𝐸-index is defined and its properties are discussed to differentiate the curves and thus differentiate growth types in Section 2. And in Section 3, a statistical framework for or styles. 𝐸-index is given and its effectiveness is validated through Ifallgrowthcurvescanbedescribedbyafunction simulation studies. 𝐸-index vector is defined and validated in with one or more varying parameters, then we can employ Section4.AndSection5concludesthepaper. BioMed Research International 3

2. 𝐸-Index’s Definition and Properties the scenario presented in Figure 3 gives us a helpful intuition andcluetodefinethe𝐸-index rigidly. 2.1. Concept and Definition. As is shown in Figure 1, growth Togivethedefinition,wedonotrequireagrowthcurve and development may perform faster or slower, earlier or to be globally differentiable, but it is currently required to later, due to different types. And the earliness degree of be piecewise differentiable, which as we will see later is not growth that we are to define an index to measure may necessarily met. And we hence give the definition of 𝐸-index. play an important role to evaluate a growth curve, both mathematically and biologically. Definition 1. Suppose that the growth curve is a continuous It is common sense that a growth curve is continuous function 𝑓(𝑡) defined on a closed interval [𝑎, 𝑏], 𝑎<𝑏. 𝐸 𝑛 and smooth, but in order to elucidate the concept of - {𝑝𝑖}𝑖=0 is a sequence of points, 𝑝0 =𝑎, 𝑝𝑛 =𝑏, 0≤𝑖< index intuitively and for simplicity, we design an imaginary 𝑗≤𝑛.Alsosupposethat𝑓(𝑡) is differentiable on the open 󸀠 scenario in which an individual gains part of its height interval (𝑝𝑖−1,𝑝𝑖), 𝑖=1,2,...,𝑛,and that 𝑓 (𝑡) is its derivative instantaneously (though this is impossible), as shown in function. Then we define the 𝐸-index of the growth curve as Figure 3. It takes each of the 6 individuals indicated in follows: Figure 3 exactly 9 units of time (from 0 to 9) to gain 5 units 1 𝑛−1 𝑏 of height (from 2 at the beginning to 7 at the end). 𝑏 󸀠 𝐸𝑎 (𝑓) = ∑ ∫ 𝑓 (𝑡)(𝑏−𝑡) 𝑑𝑡. (3) Take in Figure 3(c), for instance, first. The individual (𝑏−𝑎) (𝑓 (𝑏) −𝑓(𝑎)) 𝑖=1 𝑎 indicated by it keeps its original height 2 for the first 2 units 󸀠 of time, and then its height instantaneously increases by 4 It should be noted that 𝑓 (𝑡), the growth rate at time point units at the time point 2. After that, it keeps the height for 𝑡, is undefined on each inner split point. But this would not 4 units of time, until it increases its height again by 1 unit changetheintegrationresult,evenifwesetthegrowthrateat 𝑏 at time point 6. Finally it keeps the height 7 to the end such point to be any value. For simplicity, we denote 𝐸𝑎(𝑓) as point. 𝐸(𝑓). 󸀠 Compared with (c), the individual indicated by (d) grows Howearlyorhowlatethegrowthrate𝑓 (𝑡) occurs is our 󸀠 “later” since, at the “earlier” time point 2, it increases less, key concern for growth, and the expression 𝑓 (𝑡)(𝑏 − 𝑡) in while, at the “later” time point 6, it increases more. Intuitively, (3) quantifies the degree of earliness. The greater the product 󸀠 the earliness degree of the individual indicated by (c) is value 𝑓 (𝑡)(𝑏 − 𝑡) is, the earlier the growth or development more than that of the individual by (c). But we need to occurs. In this sense, the 𝐸-index measures how early growth quantitatively measure the earliness degree to systematically occurs in the whole process by accumulating the product reflect the difference and comparison. Obviously, two factors along the time. aretobeconsidered:increasedheightandthetimespanfrom the time point when increasement occurs to the end time 2.2. Properties of 𝐸-Index. From the definition above, we point, and therefore we use their product to represent the can derive several of 𝐸-index’s properties which are to be earliness degree. discussed in the form of propositions. However, their proofs On the basis of the analysis and discussion above, we areallomittedsincetheycanbefoundincalculustextbooks are now able to calculate the earliness degree of individual or related literature. (c) as the sum of the areas of two rectangles, one being 4 (increased height) by 7 (time span to the end) and the other Proposition 2. If the growth curve function 𝑓(𝑡) defined on being 1 (increased height) by 3 (time span). The area sum the closed interval [𝑎, 𝑏] is strictly monotonically increasing and is 4×7+3×1 = 31,whichcanbestandardizedtobe is globally differentiable, then its 𝐸-index can be calculated with 31/45 = 0.689, by being divided by the area of the entire the following equation: rectangle of 5 (total increased height) by 9 (whole time span 1 𝑏 from beginning to end). 𝐸(𝑓)= (∫ 𝑓 (𝑡) 𝑑𝑡 Similarly, the earliness degree of (d) is calculated as (1 × (𝑏−𝑎) (𝑓 (𝑏) −𝑓(𝑎)) 𝑎 7 + 4 × 3)/45 = 0.422, which is much less than 0.689, the 𝑏 earliness degree of (c). ∫ 𝑓 (𝑡) 𝑑𝑡 −𝑓(𝑎)(𝑏−𝑎))= 𝑎 (4) Using the same method we can calculate the earliness (𝑏−𝑎) (𝑓 (𝑏) −𝑓(𝑎)) degree of the other 4 individuals, with those of (a) and (b) being trivially 1 and 0, respectively. But the cases of (e) and 𝑓 (𝑎) − . (f) are more complicated, since the height of (e), before (𝑓 (𝑏) −𝑓(𝑎)) reaching the end time point, has increased to a value 8, a greater value than that of the mature state, 7; and the height Proposition 2 provides us an alternative approach to of (f) has decreased to a value 1, a smaller value than the calculate the 𝐸-indexandrevealstoustherelationsbetween beginning value 2. Nevertheless, the earliness degree of (e) the integrations along horizontal direction and along vertical canbecalculatedas(6 × 9 − 1 × 2)/45 = 1.156,avaluegreater direction. than 1, and that of (f) as (−1×9+1×6)/45 = −0.067,anegative value. Proposition 3. The conclusion of Proposition 2 still holds if We denote the quantitative earliness degree as 𝐸-index the growth curve function is still globally differentiable, but not for short. Though imaginary and impossible in real life, necessarily monotonic. 4 BioMed Research International

6 6

h h 3 3

0 0

0246810 0246810 t t (a) (b)

6 6

h h 3 3

0 0

0246810 0246810 t t (c) (d)

6 6

h h 3 3

0 0

0246810 0246810 t t Earliness accumulated Earliness accumulated Earliness lost Earliness lost Earliness both accumulatedand lost Earliness both accumulatedand lost (e) (f)

Figure 3: An imaginary scenario about growth. The time interval of the growth is from 0 to 9, and the measurement values are all 2atthe beginning point and are all 7 at the end point.

WecanillustratetheproofwithFigure 4.Supposethereis area between the curve and the horizontal line 𝑦=𝑓(𝑎), only one inner extreme point (we can prove it similarly with which is the conclusion we wanted. more inner extreme points). The integration for the left part of the curve forms the red area, while that for the right part Proposition 4. The conclusion of Proposition 3 still holds if the forms the green area which is negative. And their sum is the growth curve function is piecewise differentiable. BioMed Research International 5

f(b)

f(b)

y y

f(a) f(a) 0a bab t t (a) (b) f(b) f(b)

y

y

p p−q p+q

f(a) f(a) a b ab t t (c) (d)

Figure 4: Illustrations for the proposition proofs. (a) is a growth curve strictly monotonically increasing. (b) is a nonmonotonic growth curve. (c) is a piecewise smooth growth curve. (d) is a resulting growth curve by smoothening that in (c) near the unsmooth internal point.

Propositions 2–4 indicate that we can calculate 𝐸-index But 𝐸(𝑓) may get a value greater than 1 if, for some 𝑡, 𝑓(𝑡) with (4), no matter whether the growth curve is monotonic is greater than 𝑓(𝑏) or even be a negative value if, for some 𝑡, or not and no matter whether it is piecewise smooth or 𝑓(𝑡) is less than 𝑓(𝑎). globally smooth. In fact, if the growth curve function is not differentiable, even not continuous, we are still able to Proposition 5 dictates the range of 𝐸-index, and we calculate its 𝐸-index with (4), without changing the meaning can design a growth curve function whose 𝐸-index is any of 𝐸-index. designated value in the range. Typically, the measurement value in the growth process is between the value at the beginning and that at the end. And 3. Validating 𝐸-Index’s Effectiveness we have still another proposition for this situation. 𝐸-index can be easily calculated with integration operation Proposition 5. If the growth curve function 𝑓(𝑡) is defined on stated in (4). However, is it as effective as the function param- [𝑎, 𝑏] and 𝑓(𝑎) ≤ 𝑓(𝑡) ≤𝑓(𝑏),then0≤𝐸(𝑓)≤1. eters, say, 𝑟 in (1) and (2), to differentiate growth curves? Or in 6 BioMed Research International addition, is it able to differentiate the growth curves without This example illustrated that 𝐸-index may, at least in a uniform function in Figure 2? Growth curves are usually some cases, differentiate growth curves even without uniform formed by collecting successive measurements and finding function describing them. a function (sometimes difficult to find) approximately fitting 𝐸 the data. But is -index applicable in this situation? We will 3.3. 𝐸-Index of Spline Interpolation. In order to differentiate answer all these questions in the following subsections. growth curves by function parameters, we have to assume the function type first and then calculate parameters making 3.1. Contrasting 𝐸-Index and Function Parameters. The func- the function fit the successively collected measurement values tion parameter 𝑟 inthe2bunchesofgrowthcurvescan best. The resulting parameters do not work well if the function differentiate the curves, as is illustrated in Figure 1. We are does not fit the data well. trying to find out whether 𝐸-index is capable of doing so, and In fact, spline interpolation performs well to find a theresultsareshowninFigure5. smooth function piecewise defined by polynomials. Unfor- Foreachofthe25valuesof𝑟 in Figure 1(a), the 𝐸-index tunately, splines are not uniform functions and therefore, of the corresponding growth curve function is calculated with function parameters do not work either for the case of splines. (4) as follows: 𝐸-index, however, does work in this situation. Based on the successively collected measurements, we can define a smooth 1 8 2 𝐸 (𝑟) =𝐸(𝑓(⋅))= ∫ ( )−1. function to fit the data by spline interpolation and then 𝑟 −𝑟𝑡 (5) 8 0 1−𝑒 calculate the 𝐸-index of the function. The resulting 𝐸-indices will provide help to differentiate the corresponding growth 𝑟 The relation between the values of the parameter and indicated by the collected measurements. 𝐸 the corresponding -index values is plotted in Figure 5(a). We will consider 2 growth curves. The first one is 𝑟 𝐸 Itcanbeobservedthatas increases, the -index value describedby(1)withtheparameter𝑟=1.Supposethatwe 𝐸 increases accordingly, which implies that -index is capable donothaveanyknowledgeofthecurvetypeandallthatwe 𝑟 of differentiating growth curves as the parameter that have is the function values of 5 interpolation points evenly is related to growth rate. For the bunch of growth curves dispersed in the time domain. A typical kind of spline, cubic in Figure 1(b), we can obtain similar result illustrated in spline, is calculated and compared to the original curve in Figure 5(b). Figure 5(c). The second growth curve is described by (2) with 𝑟=1,andthederivedcubicsplineandtheoriginalcurveare 3.2. 𝐸-Index Applied in Nonuniform Functions. In some cases, contrasted in Figure 5(d). we may use 𝐸-index as an equivalent of function parameters, ItisobservedfromFigures5(c)and5(d)thatthesplinefits to differentiate growth curves of a uniform type. In addition, the original growth curve well (and will fit it better with more we may continue to apply 𝐸-index to differentiate them interpolation points), which indicates that 𝐸-index works without uniform function describing them. For instance, the well even without knowledge of the growth curve type. 4 types of growth curves in Figure 2 are lymphoid, neural, Next, spline is calculated for each of the functions in the general, and genital, respectively. Specifically, we can use the bunch illustrated in Figure 1(a). For the same value of the following 4 functions to precisely describe them: parameter 𝑟, 𝐸-index of the original function and that of the splinearecalculated,respectively.Andtheobtained𝐸-index −(𝑡−17.43)2/153.1175 𝑓Lymphoid (𝑡) = 810𝑒 − 110.5, values are contrasted in Figure 5(e). Similarly, the 𝐸-index values are contrasted in Figure 5(f) for the original functions 0 ≤ 𝑡 ≤ 20, illustrated in Figure 1(b) and their splines. The results are 𝐸 200 encouraging, since -index values of the splines are quite 𝑓 (𝑡) = − 100, 0 ≤ 𝑡 ≤ 20, close to those of the original functions if the number of inner Neural 1+𝑒−0.35𝑡 interpolation points is 5 or more (see Figures 5(e) and 5(f)). 108 (6) 𝑓 (𝑡) = − 2.1, 0 ≤ 𝑡 ≤ 20, General 1+𝑒−0.32(𝑡−11) 3.4. Statistical Framework of 𝐸-Index. How can we apply 𝐸 16 -index to differentiate complex dynamic traits? We are { −6, 0≤𝑡≤12, typically given two genotypes 𝐴 and 𝐵 with 𝑚 samples of {1+𝑒−0.35(𝑡−1) 𝑓 (𝑡) = 𝐴 and 𝑛 samples of 𝐵,eachsamplemeasuredat𝑇 time Genital { 100 { + 9.2, 12 < 𝑡 ≤ 20. points. And our purpose is to judge whether the genotypes {1+𝑒−(𝑡−17.7) significantly affect the phenotypes. Supposethatthevaluevectorofthe𝑖th sample of 𝐴 is Applying(4)onceagaintotheabovefunctions,wewill 𝐴 𝐴 𝐴 𝐴 V =(V , V ,...,V ), 𝑖 = 1,2,...,𝑚, and that of the 𝑗th obtain the corresponding 𝐸-indices as 1.328, 0.802, 0.459, and 𝑖 𝑖,1 𝑖,2 𝑖,𝑇 𝐵 V𝐵 =(V𝐵 , V𝐵 ,...,V𝐵 ) 𝑗 = 1,2,...,𝑛 0.191 for the lymphoid, neural, general, and genital type of sample of is 𝑗 𝑗,1 𝑗,2 𝑗,𝑇 , .The growth curves, respectively. This result is consistent with our computation steps are as follows, using 𝑡-test of (𝑚+𝑛−2) observation and intuition: growth and development occur degrees of freedom to discover the significance. earliest for the lymphoid type, comparing to the other three V𝐴 V𝐵 types of growth curves; and the genitals grow and develop (a) For each 𝑖 and 𝑗 ,wecanusesplineinterpolationto 𝐴 𝐵 latest among the four types. get continuous functions (curves) 𝑓𝑖 (𝑡) and 𝑓𝑗 (𝑡). BioMed Research International 7

0.95 1.24

0.85 1.20 -index -index E E

0.75 1.16

1.0 1.5 2.0 2.5 3.0 1.0 1.5 2.0 2.5 r r (a) (b)

1.0 1.0

0.8 0.8

0.6 0.6 y y

0.4 0.4

0.2 0.2

0.0 0.0 02468 02468 t t

Original curve Original curve Cubic spline Cubic spline (c) (d) 0.95 1.25

1.15 0.85 -index E -index 1.05 E

0.75 0.95

1.0 1.5 2.0 2.5 3.0 1.0 1.5 2.0 2.5 3.0 r r

Neural-type curve Neural-type curve Spline with 5 points Spline with 5 points (e) (f)

Figure 5: Simulations validating the effectiveness of 𝐸-index. (a) is 𝐸-indices of growth curves in Figure 1(a) plotted against parameter 𝑟.(b) is 𝐸-indices of growth curves in Figure 1(b) plotted against parameter 𝑟. (c) contrasts a growth curve in Figure 1(a) with 𝑟=1and its spline. (d) contrasts a growth curve in Figure 1(b) with 𝑟=1and its spline. (e) compares 𝐸-index values of the bunch of functions in Figure 1(a) and those of their splines. (f) compares 𝐸-index values of the bunch of functions in Figure 1(b) and those of their splines. 8 BioMed Research International

(b) Then (4) is employed to calculate each 𝐸-index of 200 𝐸𝐴 𝐸𝐵 𝑖= the curves, namely, 𝑖 and 𝑗 ,respectively,for 180 1,2,...,𝑚 and 𝑗 = 1,2,...,𝑛. And we denote that 𝐴 𝐴 𝐴 𝐴 𝐵 𝐵 𝐵 𝐵 𝐸 =(𝐸1 ,𝐸2 ,...,𝐸𝑚) and 𝐸 =(𝐸1 ,𝐸2 ,...,𝐸𝑛 ). 160

(c) After that we define a test statistic, 140

𝐸𝐴 − 𝐸𝐵 120 𝑡= , (7) √𝑠2 (1/𝑚+ 1/𝑛 ) 100 𝐴 where 𝐸𝐴 is the mean of the vector 𝐸 and 𝐸𝐵 is the Growth (%) 80 𝐵 mean of the vector 𝐸 (herewesupposethat𝐸𝐴 > 𝐸𝐵) and the common variance 60 (𝑚−1) 𝑠2 + (𝑛−1) 𝑠2 40 𝑠2 = 𝐴 𝐵 , (8) 𝑚+𝑛−2 20 2 𝐴 where 𝑠𝐴 isthesamplevarianceofthevector𝐸 and 0 2 𝐵 02468101214161820 𝑠 is that of 𝐸 . 𝐵 Age (years) (d)Wecantestthenullhypothesisthatthetwogroupsof samples are not significantly different: Growth curves of genotype A E-indices from upper to lower are 1.47, 1.44, 1.42, 1.39, and 1.36 Growth curves of genotype B 𝐻0: 𝐸𝐴 − 𝐸𝐵 =0, (9) E-indices from upper to lower are 1.15, 1.09, 1.05, 1.02, and 1.00

versus the alternative hypothesis that the two groups Figure 6: Simulated organ growth curves of different genotypes 𝐴 are significantly different: and 𝐵 with sample size 5 each.

𝐻 𝐸 − 𝐸 >0. 2 1: 𝐴 𝐵 (10) by (8) the common sample variance 𝑠 = 0.0027 and by (7) the test statistic 𝑡 = 10.77,muchlargerthan𝑡0.01(5+5−2) = 2.896, 𝐻0 will be rejected if which means that there is sufficient evidence to indicate that the genotypes are clearly responsible for the developmental 𝑡>𝑡𝛼 (𝑚+𝑛−2) ; (11) processes of the organs.

otherwise 𝐻0 will be accepted, where 𝑡 is the compu- 𝑡 (𝑚+𝑛−2) 𝑡 tation result of (7) and 𝛼 is the -distribution 4. 𝐸-Index Vector value with the confidence level 𝛼 and (𝑚+𝑛−2) degrees of freedom. As is mentioned earlier, a curve may theoretically provide infinite amount of information about growth. Though the 𝐸- 3.5. Applying 𝐸-Index. With the statistical framework of 𝐸- index is sometimes capable of differentiating growth curves, it index in the previous subsection, we can now apply it to is after all only one characteristic value revealing one aspect of 𝐸 differentiate complex dynamic traits. information. Therefore, it is natural for us to extend -index 𝐸 Two bunches of growth curves of genotypes 𝐴 and 𝐵, into -index vector. respectively, are generated by simulation, and they, together with their 𝐸-indices, are illustrated in Figure 6. 4.1. Definition of 𝐸-Index Vector. Where and why is 𝐸-index The relative measurement value of each sample at the insufficient to differentiate growth curves? An example from beginning time point 0 is 0 percent, and that at the ending Figures 7(a), 7(c), and 7(e) will illustrate this. time point 20 is 100 percent. Consequently, we are not able Comparing the sizes of shade in Figures 7(a), 7(c), and to differentiate them merely by the measurement value at 7(e), we will find that 3 totally different growth curves lead to thematurestateandhavetoresorttothedifferenceof the same 𝐸-index value (0.5). This is mainly due to the fact developmental processes. that the effect caused by the higher growth rate in Figure 7(b) Intuitively, the two groups are far apart. But we fail to or Figure 7(c) is counteracted by lower growth rate earlier or apply the functional mapping framework to differentiate the later. two groups, since there exist no parameters like 𝑟 in (1) and This example indicates that 𝐸-index does not work in (2) responsible for the curve shape. In addition, each curve is some cases to differentiate growth curves. Consequently, we nonmonotonic, which functional mapping is not able to deal have to move forward for a more sophisticated tool. Naturally with. we will extend 𝐸-index into 𝐸-index vector. But using the statistical framework given in Section 3.4, we get the standard deviations of the 𝐸-indices for genotypes Definition 6. Suppose that 𝑓(𝑡) describing a growth curve is 𝐴 and 𝐵, 𝑠𝐴 = 0.0428 and 𝑠𝐵 = 0.0597,respectively.Wehave continuous on a closed interval [𝑎, 𝑏], 𝑎<𝑏.Andsuppose BioMed Research International 9

1.0 1.0

0.8 0.8

0.6 0.6 y y 0.4 0.4

0.2 0.2

0.0 0.0 0.0 0.4 0.8 0.0 0.4 0.8 t t (a) (b) 1.0 1.0

0.8 0.8

0.6 0.6 y y 0.4 0.4

0.2 0.2

0.0 0.0 0.0 0.4 0.8 0.0 0.4 0.8 t t (c) (d) 1.0 1.0

0.8 0.8

0.6 0.6 y y 0.4 0.4

0.2 0.2

0.0 0.0 0.0 0.4 0.8 0.0 0.4 0.8 t t (e) (f)

Figure 7: The limitation of 𝐸-index and the definition of 𝐸-index vector. (a) and (b) illustrate 𝐸-index and 𝐸-index vector for constant growth rate, respectively. (c) and (d) illustrate 𝐸-index and 𝐸-index vector for higher growth rate in earlier half and lower growth rate in later half. (e) and (f) illustrate 𝐸-index and 𝐸-index vector for lower growth rate in earlier half and higher growth rate in later half. 𝐸-indices of the curves in (a), (c), and (e) are of the same value 1. 𝐸-index vectors for the curves in (b), (d), and (e) are (0.5, 0.5), (0.716, 0.284), and (0.284, 0.716), respectively. 10 BioMed Research International

𝑛 that 𝑄={𝑝𝑖}𝑖=0 is a prescribed sequence of points, 𝑝0 =𝑎, Equation (16) enables 𝐸-index vector to help differentiate 𝑝𝑛 =𝑏, 𝑝𝑖 <𝑝𝑗, 0≤𝑖<𝑗≤𝑛.Thenthe𝐸-index vector of 𝑓(𝑡) two growth curves. What is more important, however, is according to 𝑄 is defined as follows: to differentiate a set of growth curves or to divide them

𝑝 𝑝 𝑝 into groups or clusters, which will be discussed in the next 𝑉 (𝑓) = (𝑛𝐸 1 (𝑓) ,𝑛𝐸 2 (𝑓) ,...,𝑛𝐸 𝑛 (𝑓)) . 𝑄 𝑝0 𝑝1 𝑝𝑛−1 (12) subsection. 𝑝 𝑉 (𝑓) 𝑉(𝑓) (𝑛𝐸 1 (𝑓), 𝑄 is denoted as for simplicity and 𝑝0 𝑝 𝑝 𝑛𝐸 2 (𝑓),...,𝑛𝐸 𝑛 (𝑓)) (𝐸 (𝑓), 𝐸 (𝑓),...,𝐸 (𝑓)) 4.2. Grouping or Clustering Growth Curves by 𝐸-Index Vector. 𝑝1 𝑝𝑛−1 as 1 2 𝑛 . More and more growth traits are available and they can be The 𝐸-index vectors of the growth curves in Figures 7(b), described by growth curves. Studies [12] indicate that growth 7(d),and7(f)arecalculatedwith(12).Andthethreeresulting traits are powerful to identify genes some of which cannot be 𝐸-index vectors, (0.5, 0.5), (0.716, 0.284), and (0.284, 0.716), identified by traits only in one time point. are apparently different, as is also illustrated with shaded In order to identify genes with growth traits, we are areas in Figure 7. But how can we evaluate this difference required to divide into groups all growth curves that are as quantitatively? In order to answer this question, we above all similar as possible in the same group while being as dissimilar give another definition as the following. as possible in different groups. But a common situation we are encountering is that it is difficult for us to obtain reasonable Definition 7. Suppose 𝑓(𝑡) and 𝑔(𝑡) are two functions defined groups. on a closed interval, describing two growth curves with 𝐸-index, 𝐸-indexvector,andthegrowthdissimilarity 𝑛 thesameprescribedsequenceofpoints𝑄={𝑝𝑖}𝑖=0.Then definitionbasedonthesetwoconceptsmayhelpustogroup the growth dissimilarity between 𝑓(𝑡) and 𝑔(𝑡) is defined as or cluster the growth curves. follows: With the 𝐸-index vectors, we can define describing rules

𝑛 1/2 for a curve group. Take the growth curves in Figures 7(b), 𝐷(𝑓,𝑔)=[∑ (𝐸 (𝑓) − 𝐸 (𝑔))2] . 7(d), and 7(e), for instance. We define the rule describing the 𝑖 𝑖 (13) 𝐸 𝑖=1 first group as “the first component of the -index vector is greater than 0.6 and the second less than 0.4.” And the second It is easy to prove that growth dissimilarity satisfies rule is defined as “the first component is less than 0.6 and the distance axioms; that is, second greater 0.4.” These two rules describe and define two groups, with growth curve (d) in the first group and growth 𝐷(𝑓,𝑓)=0, curves (b) and (e) in the second one. 𝐷(𝑓,𝑔)=𝐷(𝑔,𝑓), Though describing rules are capable of grouping growth (14) curves, human experts are involved in prescribing the rules, 𝐷 (𝑓, 𝑔) + 𝐷 (𝑔, ℎ) ≥ 𝐷 (𝑓,ℎ). and thus the rules, consequently the grouping results, may be different from person to person. Andwehencehavetransformedproblemsaboutgrowth Unlike the grouping technique with describing rules, the curves into problems about vectors which will help to analyze clustering technique, 𝑘-mean algorithm, and its variations are the relation between different growth curves in Figure 7. almost automatic. It is a recursive algorithm with 𝑘 randomly Denote the functions describing the growth curves in Figures selected centers. To cluster the growth curves, the growth 7(b), 7(d), and 7(f) as 𝑓1(𝑡), 𝑓2(𝑡),and𝑓3(𝑡),respectively. dissimilarities between each growth curve and each center are According to (13), their dissimilarities are calculated and calculated with (16), and a group corresponding to a center listed as follows: will include all the growth curves nearer to its center than to the other centers. This process continues until the inclusion 𝐷(𝑓,𝑓) = 0.355, 1 2 of each group keeps unchanged. 𝑘 𝐷(𝑓,𝑓) = 0.355, We simulated the -mean algorithm by randomly gen- 1 3 (15) erating 60 growth curves and dividing them into 5 clusters by the algorithm, according to 𝐸-index vector definition in 𝐷(𝑓2,𝑓3) = 0.611. (12) and growth dissimilarity definition in (16). In Figure 8, The results above show that, in the growth perspective of the primitive growth curves and the resulting groups are earlier and later halves, 𝑓2(𝑡) is more similar to 𝑓1(𝑡) than it is displayed.ItcanbeseenfromFigure8thateachgroup to 𝑓3(𝑡), which is consistent with what is observed in Figure 7. represents a distinct growth style. Different weights can be designated to differently impor- tant phases of growth according to specific problems. So the growth dissimilarity defined in (13) can be accordingly 5. Conclusions redefined as the following equation with 𝑊𝑖 denoted as the 𝑖th weight: In order to generalize functional mapping and overcome the shortages of it, 𝐸-index and 𝐸-index vector are proposed 𝑛 1/2 2 in this paper, respectively, by means of measuring earliness 𝐷(𝑓,𝑔)=[∑ 𝑊𝑖 (𝑉𝑖 (𝑓) −𝑖 𝑉 (𝑔)) ] . (16) degree of growth or development in the overall process and 𝑖=1 in a growth phase. We summarize their features as follows. BioMed Research International 11

0.8 0.8

y y 0.4 0.4

0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 t t (a) (b)

0.8 0.8

y y 0.4 0.4

0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 t t (c) (d)

0.8 0.8

y y 0.4 0.4

0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 t t (e) (f)

Figure8:Clusteringgrowthcurvesbasedontheir𝐸-index vectors. (a) illustrates 60 randomly generated growth curves, with a random color designated to each of them. (b)∼(f) are the resulting groups after applying 𝑘-mean algorithm, with a group in each figure. Each growth curve in (b)∼(f) retains its shape, position, and color from its original in (a). 12 BioMed Research International

(i) 𝐸-index is capable of differentiating growth curves [2] E. S. Lander and S. Botstein, “Mapping mendelian factors (such as logistic curves) as function parameters are underlying quantitative traits using RFLP linkage maps,” Genet- in the applications of functional mapping. Like func- ics,vol.121,no.1,pp.185–199,1989. tional mapping, 𝐸-index is good at differentiating [3]C.Neuschl,G.A.Brockmann,andS.A.Knott,“Multiple-trait growth trajectories with the same values of mature QTL mapping for body and organ weights in a cross between state, which traditional QTL takes as the same. In this NMRI8 and DBA/2 mice,” Genetical Research,vol.89,no.1,pp. sense, 𝐸-index generalized functional mapping. 47–59, 2007. [4] L.J.Leamy,D.Pomp,E.J.Eisen,andJ.M.Cheverud,“Pleiotropy (ii) 𝐸-index is sometimes unavailable according to its of quantitative trait loci for organ weights and limb bone lengths primitive definition given in Definition 1, due to strict in mice,” Physiological Genomics,vol.2002,no.10,pp.21–29, restrictions. But it is always available for a growth 2002. curveandeasiertocalculatefromanotherperspective [5]C.Fan,Y.Xing,H.Maoetal.,“GS3,amajorQTLforgrain stated in Proposition 4. Moreover, measuring the length and weight and minor QTL for grain width and thickness earliness degree of growth or development, 𝐸-index in rice, encodes a putative transmembrane protein,” Theoretical is as biologically meaningful as the important curve and Applied Genetics,vol.112,no.6,pp.1164–1171,2006. parameters employed in functional mapping. [6]G.Shao,S.Tang,J.Luoetal.,“MappingofqGL7-2,agrain length QTL on chromosome 7 of rice,” Journal of Genetics and (iii) A function globally and thoroughly describing the Genomics,vol.37,no.8,pp.523–531,2010. process of growth is unnecessary for calculating 𝐸- [7] P.Ramya,A.Chaubal,K.Kulkarnietal.,“QTLmappingof1000- index. In fact, a cubic spline (as employed in [32]) kernel weight, kernel length, and kernel width in bread wheat approximates that well. Furthermore, 𝐸-index can be (Triticum aestivum L.),” Journal of Applied Genetics,vol.51,no. appliedinanyperiodofthedevelopmentalprocess; 4, pp. 421–429, 2010. on the contrary, functional mapping can only be [8]X.Y.Wan,J.M.Wan,L.Jiangetal.,“QTLanalysisforricegrain applied to the whole process in order to get suitable length and fine mapping of an identified QTL with stable and parameters. major effects,” Theoretical and Applied Genetics,vol.112,no.7, pp.1258–1270,2006. (iv) Being a key and general characteristic value though, [9] R. L. Wu and M. Lin, “Opinion: Functional mapping—how to 𝐸-index provides limited information. As an exten- map and study the genetic architecture of dynamic complex sion of 𝐸-index, 𝐸-index vector is focused on the traits,” Nature Reviews Genetics,vol.7,no.3,pp.229–237,2006. growth in different phases, the number of which may [10] M. Kirkpatrick and N. Heckman, “A quantitative genetic vary, and the time spans for the same vector may model for growth, shape, reaction norms, and other infinite- not be of equal length, according to the application dimensional characters,” Journal of Mathematical Biology,vol. background and requirements. 27,no.4,pp.429–450,1989. [11] S. D. Pletcher and C. J. Geyer, “The genetic analysis of age- (v) By extracting the growth information in a curve dependent traits: modeling the character process,” Genetics,vol. and forming a vector, we can use well developed 153, no. 2, pp. 825–835, 1999. techniques for analysis, such as describing rules and [12] R.-L. Wu, M.-X. Wang, and M.-R. Huang, “Quantitative genet- 𝑘 -mean algorithm. ics of yield breeding for Populus short rotation culture. I. 𝐸 Dynamics of genetic control and selection model of yield traits,” (vi) -index vector helps us reveal detailed characteristics Canadian Journal of Forest Research,vol.22,no.2,pp.175–182, in growth curves. It may be looked on as a microscope 1992. employed to observe a desired level of growth detail. [13] W.R.AtchleyandJ.Zhu,“Developmentalquantitativegenetics, conditional epigenetic variability and growth in mice,” Genetics, Competing Interests vol. 147, no. 2, pp. 765–776, 1997. [14]J.M.Cheverud,J.J.Rutledge,andW.R.Atchley,“Quantita- The authors declare that there are no competing interests tive genetics of development: genetic correlations among age- regarding the publication of this paper. specific trait values and the evolution of ontogeny,” Evolution, vol.37,no.5,pp.895–905,1983. [15] W. R. Atchley, “Ontogeny, timing of development, and genetic Acknowledgments variance-covariances structure,” The American Naturalist,vol. 123, no. 4, pp. 519–540, 1984. This study was supported by the Fundamental Research [16] W.-R. Wu, W.-M. Li, D.-Z. Tang, H.-R. Lu, and A. J. Worland, Funds for the Central Universities (YX-2010-30), NSFC “Time-related mapping of quantitative trait loci underlying Grant (31470675), and Special Fund for Forest Scientific tiller number in rice,” Genetics,vol.151,no.1,pp.297–303,1999. Research in Public Welfare of China (201404102). [17]J.M.Cheverud,E.J.Routman,F.A.M.Duarte,B.Van Swinderen, K. Cothran, and C. Perel, “Quantitative trait loci for References murine growth,” Genetics,vol.142,no.4,pp.1305–1319,1996. [18] D. Verhaegen, C. Plomion, J.-M. Gion, M. Poitel, P. Costa, and [1] T. F. C. MacKay, E. A. Stone, and J. F. Ayroles, “The genetics of A. Kremer, “Quantitative trait dissection analysis in Eucalyptus quantitative traits: challenges and prospects,” Nature Reviews using RAPD markers. 1. Detection of QTL in interspecific Genetics,vol.10,no.8,pp.565–577,2009. hybrid progeny, stability of QTL expression across different BioMed Research International 13

ages,” Theoretical and Applied Genetics,vol.95,no.4,pp.597– 608, 1997. [19] L. C. Emebiri, M. E. Devey, A. C. Matheson, and M. U. Slee, “Age-related changes in the expression of QTLs for growth in radiata pine seedlings,” Theoretical and Applied Genetics,vol.97, no.7,pp.1053–1061,1998. [20] B. Mangin, P. Thoquet, and N. Grimsley, “Pleiotropic QTL analysis,” Biometrics, vol. 54, no. 1, pp. 88–99, 1998. [21] A. B. Korol, Y. I. Ronin, A. M. Itskovich, J. Peng, and E. Nevo, “Enhanced efficiency of quantitative trait loci mapping analysis based on multivariate complexes of quantitative traits,” Genetics, vol. 157, no. 4, pp. 1789–1803, 2001. [22]C.X.Ma,G.Casella,andR.L.Wu,“Functionalmapping of quantitative trait loci underlying the character process: a theoretical framework,” Theoretical and Applied Genetics,vol. 97, no. 7, pp. 1053–1061, 1998. [23] G. B. West, J. H. Brown, and B. J. Enquist, “A general model for ontogenetic growth,” Nature,vol.413,no.6856,pp.628–631, 2001. [24] R. L. Wu, C.-X. Ma, X.-Y. Lou, and G. Casella, “Molecular dissection of allometry, ontogeny, and plasticity: a genomic view of developmental biology,” BioScience,vol.53,no.11,pp.1041– 1047, 2003. [25]R.L.Wu,C.-X.Ma,M.Lin,andG.Casella,“Ageneralframe- work for analyzing the genetic architecture of developmental characteristics,” Genetics,vol.166,no.3,pp.1541–1551,2004. [26]R.Wu,C.-X.Ma,R.C.Littell,andG.Casella,“Astatistical model for the genetic origin of allometric scaling laws in biology,” JournalofTheoreticalBiology,vol.219,no.1,pp.121– 135, 2002. [27] R. L. Wu, C.-X. Ma, M. Lin, Z. Wang, and G. Casella, “Func- tional mapping of quantitative trait loci underlying growth trajectories using a transform-both-sides logistic model,” Bio- metrics, vol. 60, no. 3, pp. 729–738, 2004. [28]R.L.Wu,Z.H.Wang,W.Zhao,andJ.M.Cheverud,“Amech- anistic model for genetic machinery of ontogenetic growth,” Genetics,vol.168,no.4,pp.2383–2394,2004. [29] X. Zhao, C. Tong, X. Pang et al., “Functional mapping of ontogeny in flowering plants,” Briefings in Bioinformatics,vol. 13, no. 3, pp. 317–328, 2012. [30] L. Jiang, J. Liu, X. Zhu et al., “2HiGWAS: a unifying high- dimensional platform to infer the global genetic architecture of trait development,” Briefings in Bioinformatics,vol.16,no.6,pp. 905–911, 2015. [31] L. Bogin, Patterns of Human Growth, Cambridge University Press, 2nd edition, 1999. [32]M.Liu,X.Li,R.Fan,X.Liu,andJ.Wang,“Asystematicanalysis of candidate genes associated with nicotine addiction,” BioMed Research International,vol.2015,ArticleID313709,9pages, 2015. Hindawi Publishing Corporation BioMed Research International Volume 2016, Article ID 7972351, 11 pages http://dx.doi.org/10.1155/2016/7972351

Research Article Advancements in RNASeqGUI towards a Reproducible Analysis of RNA-Seq Experiments

Francesco Russo, Dario Righelli, and Claudia Angelini

Istituto per le Applicazioni del Calcolo, CNR, 80131 Napoli, Italy

Correspondence should be addressed to Francesco Russo; [email protected]

Received 2 July 2015; Revised 11 December 2015; Accepted 3 January 2016

Academic Editor: S´ılvia A. Sousa

Copyright © 2016 Francesco Russo et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

We present the advancements and novelties recently introduced in RNASeqGUI, a that helps biologists to handle and analyse large data collected in RNA-Seq experiments. This work focuses on the concept of reproducible research and shows how it has been incorporated in RNASeqGUI to provide reproducible (computational) results. The novel version of RNASeqGUI combines graphical interfaces with tools for reproducible research, such as literate statistical programming, human readable report, parallel executions, caching, and interactive and web-explorable tables of results. These features allow the user to analyse big datasets in a fast, efficient, and reproducible way. Moreover, this paper represents a proof of concept, showing a simple way to develop computational tools for Life Science in the spirit of reproducible research.

1. Introduction of about 100 bp, often from paired-end libraries) collected foreachsamplebymeansofamappingprocedure(using RNA-Seq [1–4] is now the most widely used technology to TopHat [8], e.g.). For complex eukaryotic genomes such as study genome-wide gene expression and regulatory mecha- human or mouse, the alignment files (usually in the so-called nisms in response to stress conditions or drug treatments and bam format) are quite big (about 2–5 GBytes per sample). cell development as well as in the onset and progression of In a typical experiment, researchers can produce from few several diseases [5], including cancer. In particular, RNA-Seq unitstotensofsamplesforatotalamountthatcanreach experiments allow profiling an entire transcriptome under a tens or hundreds of GBytes. Moreover, with the decrease of condition of interest, detecting differences in transcriptional experimental cost, such amount is expected to increase with activities that can be associated with different physiological fast rate. or pathological conditions and identifying and estimating Subsequently, the analysis proceeds with the gene expres- isoform abundances as well as identifying novel genes or iso- sion quantification (i.e., it can be viewed both as a simple forms. The overall goal of RNA-Seq experiment is to under- readcountingoveralistofannotatedgenesorasisoform stand which functional processes are significantly altered quantification [2]). Then, the data has to undergo a series (either upregulated or downregulated) when comparing two of preprocessing steps, which include filtering and normal- or more conditions and, then, to identify the biological ization of the gene expression values aimed at making the mechanisms regulating such changes. samples comparable and removing different sources of biases. Usually, RNA-Seq data analyses are complex and require To have a better insight into the biological process under the usage of several different tools to manipulate and process study, a crucial step is the identification of differentially data, depending on the particular question the researcher expressed (DE) genes across different biological conditions is interested in (see [6, 7] for a review). When a reference [7, 9, 10]. In this context, a researcher has to use one or genome is available, a typical analysis starts with the align- more statistical tests that are able to assess whether observed ment of millions of raw sequences (i.e., short sequences differences in gene expression levels are more likely due to the 2 BioMed Research International differences in the biological conditions rather than to chance. Science when the analyses are very complex, time consuming, A typical output of this step is a list of DE genes usually and computationally demanding [22, 23]. containing few hundreds of elements. In fact, to assure reproducibility, it is necessary to store The final step of the analysis consists in the identification all initialization parameters and codes of the methods used of pathways and functionalities significantly altered among during an analysis. To date, the lack of reproducibility has conditions. Such point is crucial since it allows the biological constituted one of the main limitations of GUIs. However, interpretation of the analysis and it is usually known as in recent years, many different tools provide novel function- Pathway and Gene Ontology analysis. alities that support developers to build software (including Several tools are available in the literature to carry out GUIs) capable of keeping track of all actions performed while RNA-Seq analyses. Most of them operate as command-line executingananalysis(see[24]foranoverview). (see, e.g., the Tuxedo pipeline in [11]). Unfortunately, the In this work, we present the novel advancements we usage of command-line tools can be intimidating for those introduced in RNASeqGUI [15] with particular focus on the with a limited knowledge of programming languages. To incorporation of the RR. this purpose, a series of web-servers and graphical user- Since the first version of RNASeqGUI, we increased friendly interfaces (GUIs) have been recently developed. both the number of interfaces and the number of function- For instance, the well-known web-platform Galaxy [12] has alities within each interface. We added the possibility of included several tools to build efficient pipelines for carrying handling complex/multifactor designs (up to two covariates) out RNA-Seq data analysis and it represents one of the most by using several different DE methods and the possibility of efficient environments to handle big data over the cloud. The conducting two different types of analyses for biological/ oneChannelGUI [13], originally developed for the technical replicates. We also introduced the possibility of analysis of microarray data, has been extended to handle performing the pathway analysis with three different meth- RNA-Seq experiments and it is now a tool that combines ods, such as David [25], Graphite [26], and Gage [27], and of several different functions for quantification and differential performing the Gene Ontology analysis with David and Gage expression analysis. Analogously, RobiNA [14] and RNASe- interfaces. Each of these functionalities gives the possibility qGUI [15] are similar tools devoted to the identification of of querying some of the major pathway databases. In partic- DE genes from RNA-Seq experiments. More recently, RAP ular, via David and Graphite it is possible to query Kegg [16] has been proposed as a cloud computing web-interface (http://www.genome.jp/kegg/), Reactome (http://www.reac- offering the possibility of creating modular analysis workflow. tome.org/), and Biocarta (http://cgap.nci.nih.gov/Pathways/ We refer to [17] for a comprehensive review of the available BioCarta Pathways),whileGageusesKeggpathwaydatabase. GUIs. Moreover, in order to face the limit of the reproducibility All those GUIs or web-platforms are easy to use and do of the analyses with GUIs, we incorporated the RR feature not require a specific knowledge of a programming language. inside RNASeqGUI. As a result, all actions and steps are Therefore, they allow a nonexpert user to run complex and automatically recorded and visualized in a human readable personalized analyses on the datasets of interest. On one report. This report integrates raw data, result tables, figures, hand,web-serversaremoresuitedtobuildlargepipelines and software code. Not only does the report contain detailed that are automatically and completely executed on large information about all the actions performed (along with amount of data; on the other hand GUIs are more suited all initialization settings), but also the code chunks are for an interactive analysis of experimental data in which executed each time the user makes an action. Therefore, the researcher decides which step to perform on the basis each code chunk corresponds to a part of the analysis and of the inspection of preliminary results. However, the price can be executed independently in R console. More precisely, to pay for this additional flexibility consists in the difficulty the report is presented as html fileinahumanreadable of keeping track of all actions performed while using GUIs format, ready for submission as supplementary material of [17]. Clearly, the latter point is considered a limit in terms of a publication or as piece of code in public repositories like reproducibility of computational results. rpubs.com. In a full RR spirit, each time the report file is In the last decade, we have seen a growing interest in generated, all the code chunks contained inside the report are theliteratureontheconceptofreproducible (computational) executed again. research (RR in the rest of the paper) [18–21], motivated Clearly,thefullreexecutionmightbeverytimeconsum- by the need to better improve the transparency of scientific ing for both the authors and the readers of a publication, since publications and the knowledge transfer. In our opinion, RR the amount of data involved in RNA-Seq study can be very is extremely important since it provides a way to inspect the large. Moreover, a potential reader might not have the com- correctness and authenticity of results presented in published putational resources to run all the analyses. To address this papers. This feature consists in the possibility of reexecuting limitation, we also implemented a feature in RNASeqGUI, an entire analysis, or parts of it, of accessing all the details, called caching [28]. Even though this feature is widely used in of learning more about a particular study, or of replicating manyfieldsofComputerScience,fromInternetbrowsersto astudyuptoacertainpointandthentryingalternative smartphone apps, it is still not commonly used in Life Science. analyses by using other methods (e.g., different normalization Caching constitutes a solution to speed up repetitive and procedures and/or filtering procedures and/or DE methods computational expensive code chunks by using intermediate and/or pathway types of analysis). The problem of the results stored in precomputed databases. In this way, a third- reproducibility of data analysis is of great relevance in the Life party user with small computational resources can either BioMed Research International 3 replicate some pieces of the analysis or execute an alternative (Count Distr), and scatterplot (Plot All Counts) generated analysis starting from a middle point in the report. in the Data Exploration Interface the user can decide if The paper is organized as follows. In Section 2.1, we a normalization step is needed and also which type of describe the novel version of RNASeqGUI. In Sections 2.2 normalization to perform. and 2.3 we explain how RR and caching have been imple- In the current release, the first section covers files mented in our platform. Section 2.4 explains how parallel exploration of the alignment files (bam format). The second computations are currently handled in RNASeqGUI. Sec- concerns the counting process of the mapped reads against a tion 2.5 summarizes the main environmental requirements. gene annotation file aimed at quantifying the gene expression Section 2.6 describes how to extend RNASeqGUI by adding levels. The third focuses on the exploration of count-data, on new functionalities. Finally, Section 3 concludes the paper the normalization procedures, and on the filtering process, and draws future development directions. aimed at detecting and removing sources of biases. The fourthisabouttheidentificationoftheDEgenesthatcanbe 2. Material and Methods performed by several methods. Such a crucial section now includes also the possibility of handling complex/multifactor 2.1. RNASeqGUI. RNASeqGUI [15] is an open source graph- designs up to two covariates, as well as of using methods that ical user interface, implemented in R, devoted to the analysis can apply a suitable statistical hypothesis test in case of either of RNA-Seq experiments. It requires the RGTK2 graph- technical or biological replicates (see Figure 1). Typical output ical library [29] to run and is freely available at http:// of this section is the list of DE genes between conditions of bioinfo.na.iac.cnr.it/RNASeqGUI.Overall,RNASeqGUIinte- interest. grates—in a unified platform—several of R packages com- The fifth section regards the inspection of the results monly used in the analysis of RNA-Seq data. produced by these methods and the quantitative comparison RNASeqGUI works at two different levels at the same among them (via Venn diagrams). Using the interfaces time. The first one is the user level composed of all the availableinthissectionitispossibletoproduceawide interfaces available to the user in order to analyse data, while series of graphical outputs such as Venn diagrams, volcano the second one, the reporting level, is automatically executed plots, fold change plots and histograms of 𝑝 values, FDRs, while the user operates at the first level and regards the and posterior probabilities. The novel sixth section regards caching and reporting features. This new reporting level (not theGeneOntologyandPathwayanalysis(seeFigure1). present in previous versions [15, 17]) automatically keeps The introduction of such a section allows a self-contained track of all the operations performed at the user level by analysis and interpretation of the findings from a biological registering all the user actions and the input and output data perspective. and by creating the databases of the intermediate results. This Finally, the seventh section contains the button to gener- second level makes the analysis reproducible and constitutes ate the HTML report of the analysis executed (called Report) one of the main novelties of the new version of RNASeqGUI. and Utility Interface that provides a series of useful functions Figure 1 illustrates a typical RNA-Seq analysis workflow for general purposes. and represents a schematic view of the most important Therefore, the novel version of RNASeqGUI allows the features available in RNASeqGUI. The old functionalities are user to conduct a complete analysis from the quality assess- represented in blue while the novel ones are represented in ment of the alignment files to the Gene Ontology and orange. Moreover, two levels, namely, panel (a) and panel (b), Pathway analysis, deeply extending the range of applications respectively, user level and reporting level, are illustrated. with respect to previous versions [15, 17]. Moreover, thanks to some peculiar functionalities, like Heatmap in the Gage 2.1.1. RNASeqGUI Main Interface. The user interface of the interface, it is possible to interpret the change in gene novel version of RNASeqGUI (RNASeqGUI 1.1.0) is divided expression levels for a particular gene path of interest. into seven main sections, as illustrated in Figure 2. The user manual (available at http://bioinfo.na.iac.cnr.it/ Each section is devoted to a particular step of the data RNASeqGUI/Manual.html) constitutes a detailed descrip- analysis process and contains the access to one or more tion of all functionalities, with several suggestions and exam- interfaces. RNASeqGUI is designed to represent a typical ples to guide the user through the analysis of RNA-Seq data. RNA-Seq analysis workflow that starts with the alignment file Moreover, in the spirit of RR the novel version of (in bam format). This approach is aimed at guiding the user RNASeqGUI keeps track of all actions made by the user through all the steps usually performed in an analysis. Clearly, and generates a final executable human readable report inte- the user is not obliged to access each section, but he can start grating data and tables of results and figure with executable from any section he wants and decide to skip some steps he codechunks.Tothebestofourknowledge,itisthefirst considers unnecessary for the specific type of study carried tool devoted to the analysis of RNA-Seq that combines the on. As a consequence, RNASeqGUI results are very flexible flexibility of an interactive point&click analysis with the tools for any type of usage. that assure reproducibility [17]. Within each section or interface, the user can decide what is the most appropriate action to perform in the next 2.1.2. RNASeqGUI Usage. Each analysis must start with the step on the basis of the results obtained in the previous one. creation or the selection of a project that refers to a specific For instance, by looking at mean-difference plot (MDplot), experiment. In principle, the user should create a specific density function (Density or Qplot Density), boxplot of counts project for each dataset and for each workflow applied to 4 BioMed Research International

Bam exploration

Read count

Data exploration Normalization Filtering

Data analysis

Simple design Complex design

Technical Biological replicates replicates Reproducible research

Reporting Result inspection Result comparison HTML report Caching Tools

Enrichment analysis

Gene Ontology Pathway

David Gage Graphite

(a) (b) Figure 1: RNASeqGUI pipeline. Old features are represented in blue and the novel features are represented in orange. The boxes represent the software modules, while the ellipsis represents the modules functionalities. Panel (a) illustrates all the features the user can interact with, while panel (b) shows the reproducible research modules that work without user interaction. Note that panel (a) also illustrates a typical workflow to be executed during the analysis of RNA-Seq data experiments.

such dataset. Then, by choosing the button corresponding or hours, the latter for the functionalities in the Read Count to a desired step, it proceeds with the access to an interface Interface). necessary to configure all those parameters useful to perform Figure 3 shows an example of interaction with the Result thechosenstep(forthistaskabutton,called“How to use this Inspection Interface, which helps to better understand how the interface,” helps the user to set them; however, more advanced software interfaces are structured. information on the usage is provided in the user manual). After a job is executed, the results are presented to the user After the configuration of all required parameters, the user in a graphical form or, alternatively, the user receives the path must press the button corresponding to the action he wants where to access them. This second case shows up when the to perform. Subsequently, in the R console several messages output consists of large tables. aredisplayedtoinformtheuserabouttheprogressofthe As mentioned before, typical input data consists in a execution (which can last for few seconds, several minutes, series of alignment files (in bam format) that can be obtained BioMed Research International 5

genes by a single click. This action automatically redirects the user to two databases, such as http://www.ensembl.org/ and http://www.ncbi.nlm.nih.gov/, containing relevant biological information on the selected gene. Therefore, it is possible to retrieve information of biological interest in a fast and interactive way. Finally, the Plots directory contains all the figures in pdf format, generated during the analysis.

2.2. Reproducible Research in RNASeqGUI. RR is the key aspect of the novel version of RNASeqGUI. By means of literate statistical programming novel internal module devoted to the reproducibility automatically keeps track of all lines of code corresponding to the actions performed by the user during the analysis, by writing (in the Logs folder) R markdown file, named report.Rmd (an example is Figure 2: RNASeqGUI main interface. giveninSupplementaryMaterialFigure1,availableonlineat http://dx.doi.org/10.1155/2016/7972351). Each time an action is made by the user, RNASeqGUI registers it with a mark in the Rmd file, writing the executed R from the raw sequences using standard mapping procedures. code. In this way, when the user clicks the report button (see Moreover, in order to quantify gene expression levels the user Figure 2), the report.Rmd file is compiled and executes all the has also to provide a gene annotation file (in GTF format). marks and the code lines and generates the HTML file named The sections are aimed at guiding the user through the data report.html (see Supplementary Material Figure 2). analysis following flow-charts such as the one described in Hence, this report.html contains all the information about Figure 1. However, we set the sections to be as independent the code lines used by RNASeqGUI plus all the initialization as possible. In this way, the user is not obliged to follow a parameters and the input and output data. Such a report predetermined flux of execution, but he is free to use each can be considered as a full detailed log file, written in section without a preestablished order. human readable format, usable as supplementary material, containing executable code along with all initializations and 2.1.3. RNASeqGUI Output. RNASeqGUI provides results of printed results (plots, tables, arrays, etc.). any action in graphical and/or table-formatted form. The first Therefore, not only does RNASeqGUI provide the open time the user creates a project, a specific folder, named as the source code, but also all those lines do, which have been project, is created in the RNASeqGUI Projects root directory. actually executed during a specific analysis. They are clearly In that folder the user will find all intermediate and final reported as code chunks. These lines constitute complete and results of his analysis. independent units of code that can be executed indepen- The project directory contains three main directories dentlyinRconsolewithouttheneedtoinstallRNASeqGUI. named Logs, results,andplots,asillustratedinFigure4. For instance, the Supplementary Material Figure 3 shows The Logs folder contains all the files (report.Rmd, a scrap of the HTML report, which contains a code chunk report.html, report.md, report.txt, and sessionInfo.txt files) used to produce a fold change plot (PlotFC). In this way, if reporting all the actions performed during the analysis (see a reader is interested in generating the same plot, he does Section 2.2) and a subdirectory named cache within the not need to read the code of the entire analysis performed. cachingdatabasefiles(seeSection2.3).Eachdatabasefile It will be sufficient to copy and paste the code chunk for the iscreatedwhenanactionisperformedtostoretheresults particular step of interest inside R console, to generate the obtained and the parameters used. same plot. Finally, the user can compare the plot generated The Results folder contains all the tables produced during in this way with the plot depicted in the report.html to check theDEanalysisandthePathwayandGeneOntologyanalysis. whether they are identical. This can be done with all the code They are saved in txt and tsv (tab separated values) format. chunks inside the report.html. Moreover, when the Read Count Interface is used, a new subdirectory inside the Results folder is created to store the 2.3. Caching in RNASeqGUI. Another aspect of the RR is results of the specific read count function invoked (either given by the possibility of fast reproducing and sharing of SummarizeOverlaps from Subread package [30] or Feature- analyses and results via Internet. Counts from GenomicRanges package [31]). In fact, when generating the report file, the execution of In the Results folder, thanks to the ReportingTools package allcodechunksusedduringtheentireperformedanalysis [32],mostrelevantresulttablesarealsoavailableinhtml can be very time consuming. Therefore, to face such issues format. Therefore, they can also be opened via a web browser we used caching: a strategy to store data into several objects (see Figure 5) and it is possible to interact with them. They can in order to retrieve them in a faster and secure way. be filtered by values, sorted by using different column criteria. Figure 6 represents a typical execution flux involving Moreover, it is possible to access available information of the the caching procedure. During step 1, a code chunk is 6 BioMed Research International

② ③

① ④

Figure 3: An example of execution flux for the results inspection interface. From the main interface, by clicking the results inspection interface button, a second interface opens. This interface is useful to inspect the results produced, by DE analysis. For each DE method, there isa dedicated button that opens a new box at the bottom of the interface. Such interface contains other buttons. We notice that each interface presents a “How to use this interface” button helping the user with the configuration of the parameters. After selecting resultsNoiSeq file( results file in this example), it is possible to use one of the buttons in the additional boxes, to make a graphical representation of the results (PlotFC in this example).

executed producing output data and caching database file In other words, caching makes all the intermediate results within input/output variables. During step 2, when the same available in order to check them separately and to be used codechunkisexecuted,theoutputisdrawnfromthecache as starting points for different analyses. As a consequence, database file. the implementation of caching allows the user to run in a There are lots of R packages useful for caching [33–35]. more efficient way different types of analyses on the same We choose filehash [36], since it better fits our needs and dataset and to easily modify an analysis while still preserving storage idea. We wrapped some of its functionalities in order reproducibility. to implement, in the novel version of RNASeqGUI, a caching However, when sharing cached data through Internet, system to create a set of cache database files, stored in the reproducibility might be limited unless both the raw data and Logs/cache folder, for each analysis flux (project) of RNASe- the code needed to generate cached data are released. qGUI. In this way, each function, when executed, generates To better understand how caching is implemented in a cache database file within the input/output variables and RNASeqGUI, in Supplementary Material Figure 4 a scrap of some partial computation data. These files are useful during the HTML report file is represented. To check the execution the RNASeqGUI report generation. flux and to speed up the report compilation at the same Indeed, after the execution of each code chunk, RNASe- time, both the commented code used to generate the cached qGUI generates a mark for it in the R markdown file (cf. data (in the blue parenthesis (A)) and the code used to Section2.2)andacachedatabasefile,tracedintheR load cached data (red parenthesis (B)) are reported. In markdown file (see Figure 7(a)). Supplementary Material Figure 4(B) the result of the upper In this way, during the report generation (activated by the quartile normalization, stored in the uqua.db object, is loaded report buttoninthemaininterface)thedataareloadedfrom via the function LoadCachedObject.Inthisway,tocheck the cache database file, speeding up the entire process (see if the cached object is correct, a third-party user is able to Figure 7(b)), instead of reexecuting the entire code written in generate the cached data by uncommenting the code reported the report.Rmd file. in Supplementary Material Figure 4(A) that was used to Moreover, in a complete spirit of transparency the user produce the uqua.db object. can share these files via Internet making it possible to Furthermore, even if some code chunks are very fast to reproduce the same analysis without complication of data be generated (few seconds), it would be better to cache them research and manipulation. as well, since during the generation of the HTML report, BioMed Research International 7

RNASeqGUI_Projects

MyProject1 MyProject2 MyProject3

LogsPlots Results

f(x) The web f(x)f(x)

Feature counts Summarize csv tables Cache R markdown Full HTML pdf figures Overlaps Reporting Tools file report html files

Caching database asciiasciiascii asciiasciiascii files Read counts Read counts txt files txt files Figure 4: Output tree for the RNASeqGUI package.

without them, all the code chunks are reexecuted and the computationalcostrequiredtocompleteeachjob.Inpar- overall process could last for several minutes. ticular, when working with large alignment files from RNA- To allow a better management of the entire data analysis Seq experiments, the most computational demanding step and an automatic way to keep track of the computational consists of the read counting process (i.e., the quantification protocol used for analysing a specific dataset, we combined a levelofeachgeneineachsample).Tohandlesuchprocessina human readable report, within the code chunks, and caching reasonable amount of time also on standard desktop, we used in RNASeqGUI. parallel computing within the R environment. We stress that each execution of RNASeqGUI is linked There are several packages that help to implement parallel with the name of the project chosen by the user and the name computing in R, like doparallel [37] combined with foreach oftheinputfileused.Allthesettingsaresavedinthereport. [38] and snow [39]. In RNASeqGUI we used BiocParallel Therefore, the user will keep track of all changes of the input [40],apackageallowingparallelevaluationforBioconductor parameters used. However, if a user changes the parameters [41] objects. We chose this package for its multiplatform within the same project and with the same input file then portability and since it is optimized to work on bam files. thecashedobjectwillbeoverwrittenalongwiththeprevious We tested the parallel computation by using two example result file. To avoid such problem, one should create a single datasets, one composed of six bam files of a cell culture from project for a specific workflow. Therefore, if a user wants to mouse (mouse dataset) and one consisting of seven samples try two or more different settings of the same method then of a cell culture from Drosophila melanogaster (Drosophila he has to create one project for each setting. dataset), published in [42]. The mouse dataset has a total amount of data of about 38.4 GB and approximately 572 2.4. Parallel Computing in RNASeqGUI. Another crucial million reads, while the Drosophila dataset has a total of aspect, while working with large amount of data, is the approximately 360 million reads for about 11.2 GB. 8 BioMed Research International

Figure 5: An example of HTML table using the ReportingTools package. By clicking on the gene of interest the author is redirected to well-known databases, such as NCBI or ENSEMBL.

② Table 1: Time (in seconds) necessary to execute the counting Code chunk procedure for RNA-Seq reads on two example datasets (mouse and Drosophila datasets). On the rows are represented two datasets used for the read counting step and the columns indicate if the Cache database parallel computing was used or not, on two different machines. The file test was performed on a desktop personal computer with Intel I7- ① [email protected] GHz and 24 GB of RAM, running Ubuntu 14.04 and on a Cluster node composed of 12 cores of Intel Xeon [email protected] GHz, with 64 GB of RAM running CentOS release 6.5.

Desktop Cluster Parallel (s) Not parallel (s) Parallel (s) Not parallel (s) Mouse 725 2027 2339 3409 ① ② Drosophila 442 559 416 969

Output data

As shown in Table 1, the computational time is drastically Figure 6: A typical execution flux involving the caching procedure. During step 1, a code chunk is executed producing output data and reducedwhenwemadeuseofparallelcomputing,bothon caching database file within input/output data. During step 2, when desktop PC and on cluster node. thesamecodechunkisexecuted,theoutputisdrawnfromthecache OntherowsofTable1thetimesinsecondsforthe database. tested datasets are reported, using the SummarizeOverlaps method of the GenomicRanges package [31]. The columns are the computational times, measured with and without parallel For the test we used two machines with R version 3.1.2: computing, on each machine, using 8 cores on the desktop one desktop personal computer and one node of a cluster. The PC and 12 cores on the cluster. desktop PC is configured with an Intel [email protected] GHz runningUbuntu14.04,whiletheclusternodeisequipped 2.5. Installation and Environmental Requirements. RNASe- with 12 cores of Intel Xeon [email protected] GHz, running Cen- qGUI is designed as a desktop application and requires a tOS release 6.5. machineequippedwithatleast8GBofRAM.Thenovel BioMed Research International 9

Putting database 1 loading mark

Cache database file 1 Putting database 2 loading mark

Function 1 execution Function 2 execution Cache database file 2 Function 3 execution Putting database 3 loading mark

Function Cache database

n file 3 n loading mark executio R markdown . . file

n Putting database

Cache database file n (a)

···

Cache Cache Cache Cache database database database database file 1 file 2 file 3 filen

Compiling trigger Loading Loading Loading Loading database 1 database 2 database 3 database n

R markdown file

Full HTML report (b)

Figure 7: Schematic illustration of caching in RNASeqGUI. Panel (a) represents the caching file creation process. For each button of RNASeqGUI, one caching database file is created and a mark in the R markdown file is inserted, for its future load. Panel (b) represents the loading process during the report creation. Once the html button (in the log files section of RNASeqGUI) is selected, the R markdown file is compiled and data in the caching file are loaded to speed up the creation of the entire report. version 1.1.0 successfully runs with R v3.2.2 and Bioconductor It is also possible to use RNASeqGUI (v 1.1.0) on a cluster v3.2 with all major operative systems such as Linux, Mac OS environment. To start RNASeqGUI on the cluster, we sim- X Yosemite, and Windows. Its functionalities work both on ply used the command ssh-Xuser@clusterhostdomainand complex eukaryotic genomes (e.g., human and mouse)andon running RNASeqGUI in R shell as described in the manual. simpler organisms (e.g., Drosophila melanogaster). The instal- In this way, it was possible to use RNASeqGUI in remote lation procedure and the additional requirements (specific for mode from a computer running the Xunixwindowsystem, each operative system) are detailed in the user manual, avail- making the data present on the cluster directly accessible by ableat http://bioinfo.na.iac.cnr.it/RNASeqGUI/Manual.html. RNASeqGUI. 10 BioMed Research International

2.6. Extensibility. One of the most appealing features of Finally, we aim that this work will constitute a proof of RNASeqGUI regards the fact that it is relatively simple to add concept on how RR feature can be incorporated in GUIs a new functionality. In fact, the steps necessary to add the new in a useful and suitable way. Therefore, it will promote the button (i.e., functions) are only three. developmentofnovelcomputationalsoftwarefortheanalysis Firstly, the user has to write his own function, putting it of other NGS data (e.g., ChIP-Seq data, BS-Seq, etc.) in the in an appropriate R source file. After that, he has to write the spirit of RR. code to create the button in the selected interface section, and, finally, he has to create the code to bind together the function Conflict of Interests and the button. The user manual explains through an example howthelattertwostepscanbeperformed. The authors declare that there is no conflict of interests As a consequence, in the spirit of open source, the regarding the publication of this paper. user is allowed not only to include de novo developed functions, but also to use already developed packages in order to extend the features of RNASeqGUI. However, we note Acknowledgments that the new method added by a user will not possess the The authors want to thank M. Franzese, V. Costa, and R. reproducible research and caching features straightforwardly. Esposito for suggestions and discussions and D. Granata for Consequently, the usage of the new method will not be technical support. This work was supported by the Italian reported in the report file generated by RNASeqGUI. Future Flagship InterOmics Project (PB.P05), BMBS COST Action releases of RNASeqGUI will try to face this issue as well. BM1006, PON01-02460. 3. Conclusions References Inthiswork,wehavepresentedanovelversionofRNASe- qGUI that combines the flexibility of a graphical user inter- [1] Z. Wang, M. Gerstein, and M. Snyder, “RNA-Seq: a revolution- ary tool for transcriptomics,” Nature Reviews Genetics,vol.10, face with the tools available in Bioconductor for RR. The no. 1, pp. 57–63, 2009. novel version significantly extends the original version with respect to several aspects [15, 17] (see Figure 1). [2] V. Costa, C. Angelini, I. De Feis, and A. Ciccodicola, “Uncover- ing the complexity of transcriptomes with RNA-Seq,” Journal of For each comprehensive analysis, not only does RNASe- Biomedicine and Biotechnology, vol. 2010, Article ID 853916, 19 qGUI keep track of all actions executed by the user, but it also pages, 2010. provides a set of cached objects saved in a database (by storing [3] F. Ozsolak and P. M. Milos, “RNA sequencing: advances, some intermediate results of the analysis) and in addition challenges and opportunities,” Nature Reviews Genetics,vol.12, it generates a human readable report, which combines data, no. 2, pp. 87–98, 2011. figures, and tables within the source code used to generate [4] E. L. van Dijk, H. Auger, Y. Jaszczyszyn, and C. Thermes, them. In this manner, the results (i.e., figures, tables, etc.) “Ten years of next-generation sequencing technology,” Trends can be directly used in a publication, while the report can be in Genetics, vol. 30, no. 9, pp. 418–426, 2014. viewed as a kind of supplementary information of a paper. [5]V.Costa,M.Aprile,R.Esposito,andA.Ciccodicola,“RNA- Moreover, the database of cached objects can be shared via Seq and human complex diseases: recent accomplishments and Internet allowing collaborators, reviewers, and readers to future perspectives,” European Journal of Human Genetics,vol. perform the same analysis and using the same data. Thanks 21,no.2,pp.134–142,2013. to the report and thanks to the availability of cached objects [6]S.Pepke,B.Wold,andA.Mortazavi,“ComputationforChIP- database, not only does the user promote the transparency seq and RNA-seq studies,” Nature Methods,vol.6,no.11,pp. of his own work, but he also improves knowledge transfer S22–S32, 2009. and allows other readers to execute alternate analysis starting [7]A.Oshlack,M.D.Robinson,andM.D.Young,“FromRNA-seq from intermediate results of the original analysis carried out. reads to differential expression results,” Genome Biology, vol. 11, Moreover, we extended RNASeqGUI in the number of no.12,article220,2010. interfaces and functionalities, also within each interface. [8] D. Kim, G. Pertea, C. Trapnell, H. Pimentel, R. Kelley, and S. We added the possibility of handling complex/multifactor L. Salzberg, “TopHat2: accurate alignment of transcriptomes in designs by using several different DE methods and the the presence of insertions, deletions and gene fusions,” Genome possibility of conducting two different types of analyses Biology,vol.14,no.4,articleR36,2013. for biological/technical replicates and also implemented the [9] C. Trapnell, D. G. Hendrickson, M. Sauvageau, L. Goff, J. L. Pathway and Gene Ontology analysis. Therefore, the novel Rinn, and L. Pachter, “Differential analysis of gene regulation at version constitutes a self-containing software able to support transcript resolution with RNA-seq,” Nature Biotechnology,vol. researchers in extracting biologically relevant information 31,no.1,pp.46–53,2013. from the analysis of large datasets of RNA-Seq experiments. [10] F. Finotello and B. Di Camillo, “Measuring differential gene RNASeqGUI is a growing platform for the analysis of RNA- expression with RNA-seq: challenges and strategies for data Seq data. Future releases will include other functionalities analysis,” Briefings in Functional Genomics,vol.14,no.2,pp. such as the possibility of identifying and estimating isoform 130–142, 2015. abundances, in order to extend the range of supported [11] C. Trapnell, A. Roberts, L. Goff et al., “Differential gene and features [17]. transcript expression analysis of RNA-seq experiments with BioMed Research International 11

TopHat and Cufflinks,” Nature Protocols,vol.7,no.3,pp.562– [31]M.Lawrence,W.Huber,H.Pags,P.Aboyoun,andM.Carlson, 578, 2012. “Software for computing and annotating genomic ranges,” PLoS [12] J. Goecks, A. Nekrutenko, J. Taylor et al., “Galaxy: a com- Computational Biology, vol. 9, no. 8, Article ID e1003118, 2013. prehensive approach for supporting accessible, reproducible, [32] M. A. Huntley, J. L. Larson, C. Chaivorapol et al., “Reporting- and transparent computational research in the life sciences,” Tools: an automated result processing and presentation toolkit Genome Biology, vol. 11, no. 8, article R86, 2010. for high-throughput genomic analyses,” Bioinformatics,vol.29, [13] R. Sanges, F. Cordero, and R. A. Calogero, “oneChannelGUI: no. 24, pp. 3220–3221, 2013. a graphical interface to Bioconductor tools, designed for life [33] Z. Liu and S. Pounds, “An R package that automatically scientists who are not familiar with R language,” Bioinformatics, collects and archives details for reproducible computing,” BMC vol.23,no.24,pp.3406–3408,2007. Bioinformatics,vol.15,article138,2014. [14] M. Lohse, A. M. Bolger, A. Nagel et al., “RobiNA: a user- [34] S. Falcon, weaver: Tools and extensions for processing friendly, integrated software solution for RNA-Seq-based tran- documents. R package version, 1(0), 2007. scriptomics,” Nucleic Acids Research,vol.40,no.1,pp.W622– [35] Y. Xie, Dynamic Documents with R and ,CRCPress,New W627, 2012. York, NY, USA, 2nd edition, 2015. [15] F. Russo and C. Angelini, “RNASeqGUI: a GUI for analysing [36] R. Peng, “Interacting with data using the filehash package for RNA-Seq data,” Bioinformatics,vol.30,no.17,pp.2514–2516, R,” Working Paper 108, Department of Biostatistics Working 2014. Papers, Johns Hopkins University, Baltimore, Md, USA, 2006. [16] M. D’Antonio, P. D’Onorio De Meo, M. Pallocca et al., “RAP: [37] and S. Weston, “DoParallel: foreach par- RNA-Seq analysis pipeline, a new cloud-based NGS web appli- allel adaptor for the parallel package,” RPackageVersion,vol.1, cation,” BMC Genomics,vol.16,supplement6,articleS3,2015. no. 8, 2014. [17] A. Poplawski, F. Marini, M. Hess, T. Zeller, J. Mazur, and H. [38] S. Weston, “Using The foreach Package,” 2014. Binder, “Systematically evaluating interfaces for RNA-seq anal- [39] L. Tierney, A. J. Rossini, and N. Li, “Snow: a parallel computing ysisfromalifescientistperspective,”Briefings in Bioinformatics, framework for the R system,” International Journal of Parallel 2015. Programming,vol.37,no.1,pp.78–90,2009. [18] R. Gentleman, “Reproducible research: a bioinformatics case [40] M. Morgan, V. Carey, and M. Lawrence, “BiocParallel: Bio- study,” Statistical Applications in Genetics and Molecular Biology, conductor Facilities for Parallel Evaluation,” R Package Version vol. 4, no. 1, article 2, 25 pages, 2005. 0.4.1, 2014. [19] R. D. Peng, “Reproducible research in computational science,” [41] R. C. Gentleman, V. J. Carey, D. M. Bates et al., “Bioconductor: Science,vol.334,no.6060,pp.1226–1227,2011. open software development for computational biology and bioinformatics,” Genome Biology,vol.5,no.10,articleR80,2004. [20] D. C. Ince, L. Hatton, and J. Graham-Cumming, “The case for open computer programs,” Nature,vol.482,no.7386,pp.485– [42]A.N.Brooks,L.Yang,M.O.Duffetal.,“Conservationof 488, 2012. an RNA regulatory map between Drosophila and mammals,” Genome Research, vol. 21, no. 2, pp. 193–202, 2011. [21] “Enhancing reproducibility,” Nature Methods,vol.10,no.5, article 367, 2013. [22] R. D. Peng, “Reproducible research and Biostatistics,” Biostatis- tics,vol.10,no.3,pp.405–408,2009. [23] A. Nekrutenko and J. Taylor, “Next-generation sequencing data interpretation: enhancing reproducibility and accessibil- ity,” Nature Reviews Genetics,vol.13,no.9,pp.667–672,2012. [24] V. Stodden, F. Leisch, and R. D. Peng, Eds., Implementing Reproducible Research,CRCPress,2014. [25]C.FresnoandE.A.Fernandez,´ “RDAVIDWebService: a versa- tile R interface to DAVID,” Bioinformatics,vol.29,no.21,pp. 2810–2811, 2013. [26] A. L. Tarca, S. Draghici, P. Khatri et al., “A novel signaling pathway impact analysis,” Bioinformatics,vol.25,no.1,pp.75– 82, 2009. [27] W. Luo, M. S. Friedman, K. Shedden, K. D. Hankenson, and P. J. Woolf, “GAGE: generally applicable gene set enrichment for pathway analysis,” BMC Bioinformatics,vol.10,no.1,article161, 2009. [28] R. D. Peng, “Caching and distributing statistical analyses in R,” Journal of Statistical Software,vol.26,no.7,pp.1–24,2008. [29] M. Lawrence and T. L. Duncan, “RGtk2: a graphical user interface toolkit for R,” Journal of Statistical Software,vol.37,no. 8, pp. 1–52, 2010. [30]Y.Liao,G.K.Smyth,andW.Shi,“TheSubreadaligner:fast, accurate and scalable read mapping by seed-and-vote,” Nucleic Acids Research, vol. 41, no. 10, article e108, 2013. Hindawi Publishing Corporation BioMed Research International Volume 2015, Article ID 878546, 10 pages http://dx.doi.org/10.1155/2015/878546

Research Article Identification of Gene Expression Pattern Related to Breast Cancer Survival Using Integrated TCGA Datasets and Genomic Tools

Zhenzhen Huang,1 Huilong Duan,1 and Haomin Li2,3

1 College of Biomedical Engineering and Instrument Science, Zhejiang University, Zhouyiqing Building No. 510, Yuquan Campus, Hangzhou 310027, China 2TheChildren’sHospital,ZhejiangUniversity,ZhouyiqingBuildingNo.510,YuquanCampus,Hangzhou310003,China 3The Institute of Translational Medicine, Zhejiang University, Zhouyiqing Building No. 510, Yuquan Campus, Hangzhou 310029, China

Correspondence should be addressed to Haomin Li; [email protected]

Received 3 July 2015; Revised 14 September 2015; Accepted 28 September 2015

Academic Editor: S´ılvia A. Sousa

Copyright © 2015 Zhenzhen Huang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Several large-scale human cancer genomics projects such as TCGA offered huge genomic and clinical data for researchers to obtain meaningful genomics alterations which intervene in the development and metastasis of the tumor. A web-based TCGA data analysis platform called TCGA4U was developed in this study. TCGA4U provides a visualization solution for this study to illustrate the relationship of these genomics alternations with clinical data. A whole genome screening of the survival related gene expression patterns in breast cancer was studied. The gene list that impacts the breast cancer patient survival was divided into two patterns. Gene list of each of these patterns was separately analyzed on DAVID. The result showed that mitochondrial ribosomes play a more crucial role in the cancer development. We also reported that breast cancer patients with low HSPA2 expression level had shorter overall survival time. This is widely different to findings of HSPA2 expression pattern in other cancer types. TCGA4U provided a new perspective for the TCGA datasets. We believe it can inspire more biomedical researchers to study and explain the genomic alterations in cancer development and discover more targeted therapies to help more cancer patients.

1. Introduction will direct the target for translational research. Heterogeneity gene expression of the specific gene in the cancer population Breast cancer is one of the most common cancers and the is well-known. Such heterogeneity may regulate proliferation, leading cause of cancer death among women all over the survival, angiogenesis, metastasis,andothers.Miningthese world, with 2.6 women being diagnosed every minute and gene expression patterns and illustrating their mechanisms more than 52 women died every hour in 2008 [1]. Currently, on molecular and signal pathway levels will help researchers with the public availability of genomic data such as The and clinicians to subclass and treat the cancer with more Cancer Genome Atlas (TCGA) and the International Cancer precision. In this study, through integrating gene expres- Genome Consortium (ICGC), a plenty of bioinformatics sion data and clinical outcome data of breast cancer from researchers analyzed gene expression data with clinical data TCGA datasets on a web-based genomic analysis platform to attempt to predict the prognosis and find biomarkers (TCGA4U), breast cancer survival related gene expression for therapy [2–5]. These researches have gained obvious patterns were identified and analyzed. achievements in prediction of cancer prognosis. Integrated gene expression data and clinical outcome data provided 2. TCGA4U the potential to correlate the expression pattern with the survival.Toscreenthewholegenomeandidentifystatistically TCGA provides the platform for researchers to search, down- significant gene expression patterns which impact survival load, and analyze datasets including clinical information, 2 BioMed Research International

Figure 1: Exploring gene expression distribution and survival curves on TCGA4U.

genomic characterization data, and high level sequence anal- allowing survival analysis of specific gene alterations. As ysis of the tumor genomes of nearly 50 tumor types [6]. shown in Figure 1, the distribution of patients with different Many researchers take statistics methods, novel algorithms, gene expression value was displayed in the left panel. Survival and computational model on these high throughput genomic curves of subgroup patients that were grouped based on data, including copy number alterations (CNAs), mRNA their expression values will be provided for users. In current and small RNA expression, somatic mutation, and DNA TCGA4U, four types of genomic data which include somatic methylation data to find potential driver mutations, genes mutation, DNA methylation, gene expression, and copy for improving cancer prevention, early detection, and treat- number variation were integrated with the follow-up data and ment [7–14]. However, there are many clinical researchers provided survival related analysis. This will make complex without enough knowledge of data analysis and training relationships between cancer genomics profiles and clinical in bioinformatics will face an embarrassing situation where outcomes accessible and understandable to researchers and they have not enough professional abilities and thoughts clinicians without bioinformatics expertise, thus facilitat- to handle with the gigabytes downloaded data. In recent ing biological discoveries. Theoretically, the comprehensive years, web-based analysis tools such as Cancer Genome survival related gene alterations analysis of different data Workbench (https://cgwb.nci.nih.gov/), cBioPortal for Can- types in different cancer types can be investigated. While cer Genomics (http://cbioportal.org/), Integrative Genomics TCGA measures hundreds of thousands of variable data for Viewer (http://www.broadinstitute.org/igv/), and Broad Fire- each data type in each sample, the sheer volume of possible hose (http://gdac.broadinstitute.org/) have been used by the associations in multiple data type is overwhelming even for clinicians and researchers to search meaningful genomic computer. In this study, the gene expression data in the alterations to make targeted and personalized treatment breast cancer were used to demonstrate the potential power in clinical practice [6, 14]. Different tools using distinct of bioinformatics approaches to leverage the TCGA big data. approaches to visualize the huge volume cancer genomics In this study, the difference among survival curves of dif- data and the relationships under this data are mutually com- ferent subgroup patients can be assessed using the Log-Rank plementary. To help clinical cancer researchers fully benefit test. A gene list can be identified on a basis of a certain statistic from the TCGA datasets through a simple and user-friendly threshold such as 𝑝 < 0.005. Generally, genes in this list can tool, we developed a web-based platform called TCGA4U be further classified into two groups: high expression value (http://www.tcga4u.org:8888). TCGA4U is an intuitive web- correlated with poor outcome and low expression value cor- based analysis tool to analyze high level genomic data of relatedwithpooroutcome.Throughanalyzingthesegenelists different TCGA samples in distinct cancer types. In the mean- at molecular and signal pathway levels, some of these genes time, TCGA4U platform offers statistical analysis results and can be served as biomarkers to predict the clinical outcome. graphical views to help users find interesting results for further investigation. Besides providing the specific gene or 3. Materials and Methods gene list genomic characteristics query service, such as CNAs, somatic mutation, gene expression, and DNA methylation, 3.1. Data Preparation and Integration. All genomic data and furthermore, TCGA4U also integrated clinical data, gene clinical data of breast cancer were downloaded from TCGA ontology, and data mining results with the gene-level data data portal during two months from February 2014 to April to provide more insights for clinical investigation. One of 2014. These downloaded data files including four gene-level its unique features is providing interactive user interface and data types (copy number variants, gene expression, somatic BioMed Research International 3 mutation, and DNA methylation) and clinical data were Table 1: Part of the gene list of two patterns. imported into a relational database which was defined based Log-Rank test Mean survival on the downloaded tab delimited files. TCGA barcode ID Annotations Gene (𝑝 value) (high/low mos) for samples and patients in different data files was used to associate those data tables. Besides, more reference data such Pattern I ATP5G3 0.0000572 79.4/156.5 as human genome (hg19/build37) were downloaded as a part ATP5E 0.0029381 117.8/182.1 of TCGA4U database. The following data mining and analysis COX8A 0.0003704 98.4/172.8 were based on these integrated TCGA4U datasets. Oxidative COX5B 0.0013046 120.4/158.3 phosphorylation SDHD 0.0000044 53.6/155.6 3.2. Correlation Gene Expression Pattern with Survival. The UQCRB 0.0042031 125.2/182.7 gene expression value distributions in population of total SDHA 0.0000869 42.2/151.6 14,819 genes expressed in breast tumor tissue were surveyed. For each gene, the cancer population can be divided into MRPL13 0.0027950 119.1/166.9 MRPL18 0.0000003 80.4/173.9 two groups: the gene expression more than normal tissue Mitochondrial > MRPS23 0.0037369 64.1/154.8 (log2 Lowess normalized value 0) and the gene expression ribosome less than normal tissue (log2 Lowess normalized value < 0). MRPS25 0.0000036 45.0/152.1 The survival of two subgroup patients for each gene was MRPS7 0.0014625 98.8/165.8 compared and tested with the Log-Rank test. Genes with Log- Rank test 𝑝 value < 0.005 and subgroup observed times more PSMD12 0.0001182 99.6/161.4 PSMD14 0.0000011 62.9/155.2 than 4 were filtered out into a gene list for further analysis. Proteasome Genes in this list can be further divided into two gene PSMA6 0.0026322 85.5/163.8 lists: high expression pattern correlated with poor survival PSMB1 0.0030467 82.4/153.2 and low expression pattern correlated with poor survival. Pattern II DAVID (http://david.abcc.ncifcrf.gov/) were used to conduct RPL13A 0.0002459 154.9/63.2 RPL3 0.0011936 158.2/115.1 the Functional Annotation Clustering analysis on these two Ribosome gene lists. The enriched gene clusters that contain not only RPS27 0.0009400 162.7/77.4 the identified gene but also its related genes were clustered RPS9 0.0011063 160.5/72.6 using the gene expression profiles to confirm the pathways or function units play an important role in tumor evolution and MGMT 0.0002336 150.7/42.7 ATXN3 0.0048654 165.7/122.9 patient survival. DNA repair POLI 0.0020534 165.9/112.8 PML 0.0000252 159.2/64.0 3.3. Bioinformatics Tools. Most of data analyses were con- ducted under the R 3.1.1. The Log-Rank test was conducted based on survdiff function in survival package (version 2.37- 7). Heat map was plotted based on heatmap 2 function in patients with low gene expression have poor outcome (pat- gplots package (version 2.14.1). 𝐾-menas clustering was cal- tern II). Total 107 genes were grouped into pattern I and 94 culated based on kmeans function in stats package (version genes were grouped into pattern II (please check supplemen- 3.1.1) in R. tary files for details of two gene lists in Supplementary Mate- rial available online at http://dx.doi.org/10.1155/2015/878546). Apartofthegenelistoftwopatternsthatwillbediscussed 4. Results later was given in Table 1. 4.1. Log-Rank Test Results of Gene Expression Patterns Correlated with Survival. Using aforementioned methods, 4.2. Functional Annotation Clustering. Functional Annota- TCGA4U provides an interactive interface for users to query tion Clustering module of DAVID was used to classify gene distribution of gene expression values and corresponding list into functional related gene groups. It generated 2D view survival curves of two gene expression patterns. Please for related gene-term relationship and ranked annotation visit http://www.tcga4u.org:8888/GenomicAnalysis for groups with enrichment. Pattern I 107 genes and pattern details. The results of the Log-Rank test of 14,811 genes were II 94 genes are separately analyzed on DAVID under the published at http://www.tcga4u.org:8888/SurvivalLogRank conditions of Homo sapiens of species. (please select “breast invasive carcinoma” in disease type and The most significant annotation enrichment in pattern “Expression HighLow” in characteristic type dropdown list). Igenelistis“mitochondrion.”Total31genesinthelistare As mentioned before, Log-Rank test 𝑝 value < 0.005 related to mitochondrion, 18 of which are clustered to the and observed times more than 4 were used to filter out “mitochondrion part” that play a crucial role in ATP synthase 201 genes whose gene expression pattern significantly related and mitochondrial protein synthesis. Another enrichment to patients survival. This gene list was further divided into annotation of pattern I gene list is related to the protein two gene lists based on its gene expression pattern: patients synthesis and degradation. Five genes of two subunits of with high gene expression have poor outcome (pattern I) or mitochondrial ribosome are found in pattern I gene list and 4 BioMed Research International

Mitochondrial intermembrane space + + 2 2 ATP synthase + 4H cyt-cox H + 4H 2 H cyt-cred 2 − − e 2e 2CoQH2

CoQH2 CoQH2 2CoQ 1 2 − CoQ e 2 O2 H2O CoQ CoQ CoQH2 FADH 2 FAD Q-cycle + + H + NAD 2H 4 + + H 4H NADH + + Succinate H Complex I Fumarate NADH dehydrogenase Complex II Succinate dehydrogenase Complex III Cytochrome b-c1 Complex IV Cytochrome c oxidase ADP Pi ATP SDHD COX5B ATP5G3 UQCRB SDHA COX8A ATP5E

Figure 2: High expression of oxidative phosphorylation complex proteins correlated with poor survival.

four genes that encoded proteins of proteasome are found in for most proteins, while there are many mitochondrial pattern I gene list. proteins being essential for oxidative phosphorylation in the The clustering results of the functional annotation of pat- mitochondria and the mRNAs of these proteins are only tern II gene list do not give a dominated functional group. The translated on mitochondrial ribosomes. From Table 1, the most enrichment of annotation is the “membrane-enclosed genes that encode important mitochondrial proteins (such as lumen” especially the “nuclear lumen,” while there are 4 the oxidative phosphorylation related proteins) and the genes genes that encode proteins of ribosome which are found in that are responsible for synthesizing these mitochondrial pattern II gene list that is paradoxical with the mitochondrial proteins (such as genes that encode the mitochondrion ribosome gene expression pattern (detail about this will be ribosome proteins) were highly expressed in the poor survival discussed later). As expected, there are 4 genes related to group, while the annotation analysis of pattern II shows “DNA repair” which were found in pattern II gene list. The that the lower expression of cytosol ribosome genes was not low expression of such DNA repair genes will increase the reported in previous studies. risk of cancer and also give different therapy responses that will affect the overall survival. 4.4. Mitochondrial Ribosome versus Cytosol Ribosome. Ribo- some plays an important role in protein synthesis by protein 4.3.AggressiveTumorwithMoreMitochondrialActivity. translation and is also essential for cell growth, proliferation, Mitochondria generate much of the cellular energy, regulate and development. In the result, an interesting phenomenon the cellular redox state, and produce most of the cellular is that the mitochondrial ribosome and the cytosol ribosome reactive oxygen species (ROS) [15]. Cancer cells need enough have very different gene expression patterns. As shown in energy for cell growth, differentiation, and development by Figure 3(a), five mitochondrial ribosome genes (MRPL13, themitochondriaintheformofATPproducedbythe MRPL18, MRPS23, MRPS25, and MRPS7) are characterized process of oxidative phosphorylation [16]. Thirty-one genes by pattern I in which high gene expression related with that related to mitochondrion were identified in pattern I shorter overall survival time, while four cytosol ribosome gene lists. It supported the findings that we mentioned above. genes (RPL13A, RPL3, RPS27, and RPS9) are characterized by Among these genes, 7 genes were located on KEGG oxidative pattern II. phosphorylation pathway as shown in Figure 2. It is obvious From Figure 3(b) that gave the gene expression clustering to notice that 7 genes covered the entire electron transport heat map of related 9 genes in the breast cancer dataset, it is chain except complex I. In other words, the whole oxidative obviousthatthecytosolribosomerelatedgenesarehighly phosphorylation pathway is more active in the poor survival expressed in breast cancer. This confirmed the previous breast cancer patients. findings that ribosome production is enhanced in cancer To translate those genes into protein, ribosome plays cells and that ribosome biogenesis plays a crucial role in a critical role. In eukaryotic cells, there are two types of tumor progression [17, 18], while the gene expression values ribosomes: cytosol ribosome and mitochondrial ribosome. of mitochondrial ribosome genes are observably low. Genes Cytosol ribosome served as the site of protein synthesis from cytosol ribosome and mitochondrial ribosome are BioMed Research International 5

MRPL13 MRPL18 1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4 Survival rate

0.2 Survival rate 0.2 p = 0.0028 p < 0.0001 0.0 0.0 0 1000 2000 3000 4000 5000 6000 7000 0 1000 2000 3000 4000 5000 6000 7000 Survival time (days) Survival time (days) MRPS23 MRPS25 1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4 Survival rate Survival rate 0.2 0.2 p = 0.0037 p < 0.0001 0.0 0.0 0 1000 2000 3000 4000 5000 6000 7000 0 1000 2000 3000 4000 5000 6000 7000 Survival time (days) Survival time (days) MRPS7 RPL13A 1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4 Survival rate Survival rate 0.2 0.2 p = 0.0015 p = 0.0025 0.0 0.0 0 1000 2000 3000 4000 5000 6000 7000 0 1000 2000 3000 4000 5000 6000 7000 Survival time (days) Survival time (days) RPL3 RPS27 1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4 Survival rate Survival rate 0.2 0.2 p = 0.0012 p = 0.0009 0.0 0.0 0 1000 2000 3000 4000 5000 6000 7000 0 1000 2000 3000 4000 5000 6000 7000 Survival time (days) Survival time (days)

Low Low High High

Figure 3: Continued. 6 BioMed Research International

RPS9 1.0

0.8

0.6

0.4 Survival rate 0.2 p = 0.0011 0.0 0 1000 2000 3000 4000 5000 6000 7000 Survival time (days) Low High (a)

Cluster I Cluster II 1.0

13 RPL A 0.8 RPS9 RPL3 RPS27 Cytosol 0.6 MRPS25 MRPL13 MRPL18I 0.4 MRPS7 Survival rate MRPS23

Mitochondrion 0.2 (b) p = 0.0893 0.0 0 1000 2000 3000 4000 5000 6000 7000 Survival time (days)

Cluster II Cluster I (c) Figure 3: Different gene expression patterns of mitochondrial ribosome and cytosol ribosome. (a) The Kaplan-Meier survival curves of gene expression pattern of mitochondrial ribosome genes (MRPL13, MRPL18, MRPS23, MRPS25, and MRPS7) and cytosol ribosome genes (RPL13A, RPL3, RPS27, and RPS9). (b) A gene expression clustering heat map of the above 9 mitochondrial ribosome and cytosol ribosome genes in the TCGA breast cancer (red for high expression value and blue for low expression value). Genes with different expression level were clustered in cytosol and mitochondrion ribosome. The samples were also clustered into two groups (Cluster I and Cluster II) based on the expression value of these 9 genes. (c) The Kaplan-Meier survival curves of Cluster I and Cluster II samples show the difference of overall survival. grouped into two clusters. The patients were also clustered role in cancer, we believe that the mitochondrial ribosomes into two groups in which Cluster I contained samples with play a more crucial role in the cancer development. The relative higher cytosol ribosome expression level and relative upregulated mitochondrial ribosome may be the result of lower mitochondrial ribosome. The Kaplan-Meier survival reprogramed energy metabolism that tumor obtained during curves of Cluster I and Cluster II were shown in Figure 3(c). evaluation to fulfill the energy requirement of continuous cell It supported that breast cancer patients with relative higher proliferation, while the downregulated cytosol ribosome in mitochondrial ribosome gene expression and lower cytosol small part of sample can be explained by the energy gap the ribosome gene expression had shorter overall survival time. tumor cell faced and it is a tradeoff between cell proliferation Comparingwiththeoverallhighexpressedcytosolribo- and energy generation. some genes, the high expressed mitochondrial ribosome To confirm this finding and identify more potential genes in breast cancer patients are more detectable from biomarkers, the gene expression clustering heat map of all the general lower expressed background. Furthermore, the the genes that encode proteins of mitochondrial ribosome average chi-square statistic value in Log-Rank tests of the five was shown in Figure 4. The genes that clustered into the mitochondrial ribosome genes is higher than the correspond- same group with the identified genes will also be investigated. ing statistic of four cytosol ribosome genes (15.036 versus Through cluster analysis of the 28s subunit of mitochondrial 11.381). Considering the above mentioned mitochondrion ribosome, MRPS7 and MRPS23 that are both identified in BioMed Research International 7

MTIF3 CHCHD1 MRPS26 MRPS35 MRPS34 MRP63 AURKAIP1 MRPS36 DAP3 MRPS21 MRPS30 MRPS7 MRPS23 MRPS17 MRPS18A MRPS12 MRPS14 MRPS33 MRPS11 MRPS15 ERAL1 MRPS9 MRPS10 MRPS18B MRPS16 MRPS31 MRPS27 C7orf30 MRPS25 MRPS2 MRPS6 AIP MRPL51 MRPS18C MRPS24 PTCD3 MRPS5 MRPL42 MRPS22 (a)

MRPL55 MRPL24 MRPL43 MRPL53 MRPL54 MRPL20 MRPL41 ICT1 MRPL38 MRPL27 MRPL49 MRPL47 MRPL9 MRPL1 MRPL21 MRPL39 MRPL16 MRPL50 MRPL13 MRPL15 MRPL45 MRPL48 MRPL11 MRPL30 MRPL42 MRPL51 MRPL46 MRPL44 MRPL4 MRPL37 MRPL12 MRPL14 MRPL33 MRPL35 MRPL19 MRPL2 MRPL52 GADD45G MRPL40 MRPL10 MRPL23 MRPL17 MRPL18 MRPL28 MRPL22 MRPL32 MRPL34 MRPL3 MRPL36 (b)

Figure 4: Clustering heat map of mitochondrial ribosome gene lists. (a) Mitochondrial ribosome 28s. (b) Mitochondrial ribosome 39s. our pattern I gene list and have similar expression profile MRPL22 that are clustered with MRPL18 have the same gene are clustered together. Therefore, MRPS7 and MRPS23 have expression pattern of MRPL18. Patients with high MRPL28 potential to become biomarkers of prognosis assessment. In expression level have 101.1-month mean survival time, and the 39s subunit of mitochondrial ribosome, MRPL28 and the low expression group has 157.0-month mean survival. 8 BioMed Research International

70 1.0

60 0.8 50

0.6 40

Count 30 0.4 Survival rate 20 0.2 10 p = 0.0005 0.0 0 −1 −0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 0 1000 2000 3000 4000 5000 6000 7000 Expression value (log2) Survival time (days) Low High (a) (b) 5000 1.0 4500 4000 0.8 y = 157.89x + 992.49 3500

0.6 3000 2500 0.4 2000 Survival rate Survival time (days) Survival time 1500 0.2 High versus low p = 0.0018 1000 500 0.0 0 0 1000 2000 3000 4000 5000 6000 7000 0 5 10 15 Survival time (days) Expression value (fold change) Low Normal High (c) (d)

Figure 5: HSPA2 plays a different role in breast cancer. (a) Histogram of HSPA2 expression value. (b) Survival curves of HSPA2 high expression and low expression. (c) Survival curves of 3 patient groups that are grouped by 𝐾-means of HSPA2 gene expression values. (d) The correlations of death days with HSPA2 expression value.

Patients with high MRPL22 expression level have 107.3- of heat shock proteins and is important for cancer cell growth month mean survival time, and the low expression group and metastasis [19]. HSPA2 has been highlighted as an impor- has 157.8-month mean survival time. The survival curves of tant biomarker in many cancer types. Fu et al. had confirmed related mitochondrial ribosome and cytosol ribosome genes that hepatocellular carcinoma patients with higher HSPA2 expression pattern were given in Supplement Figure 1. expression had shorter overall survival time [20]. Scieglinska et al. showed that high HSPA2 expression was significantly 4.5. HSPA2 Plays a Different Role in Breast Cancer. We manu- related to shorter overall survival in stage I-II non-small-cell ally review all the 201 genes in our results using the Gene Ref- lung carcinoma patients [21]. But the results of this study erence into Functions (GeneRIFs) provided by DAVID. We show that HSPA2 plays a totally different role in breast cancer. use the keyword “cancer” to search the GeneRIFs and identify As shown in Figure 5(a), the HSPA2 is highly expressed more reliable cancer related genes in our results. During this in most of breast cancer patients. However, patients with process, HSPA2 was found with different characters in our low HSPA2 expression cancer had a shorter overall survival breast cancer data compared with previous reported study in time (Figure 5(b)). This result is contradictive with previ- othercancertypes.HSPA2isamemberoftheHSP70family ous findings reported in other cancers. Considering that BioMed Research International 9 therelativesmallgroupoflowHSPA2expressionwasnot requirement of data integration, exploration, and analytics, convincing, 𝐾-means was used to group patients into three several professional web-based tools such as cBioPortal for groups, high, medium, and low. As show in Figure 5(c), Cancer Genomics (http://cbioportal.org/) that is supported patients that with low level HSPA2 gene expression have a by plenty of funding have been developed, while the potential shorter overall survival time. In order to further confirm value of cancer genomics big data lies in the millions of the correlation of HSPA2 gene expression value and survival millions of potential relationships that can be presented time, a scatter diagram of patient’s death days and their in different form at diverse platform to inspire distinct HSPA2 gene expression values was plotted in Figure 5(d). researchers. TCGA4U is not a competitor of other cancer The regression line shows a positive correlation that means genomics tools but a supplement that provided unique view patients with higher HSPA2 expression values have longer of the big data and the relationships under it. The cancer survival time. We also checked several other breast can- genomic big data can sustain more different analysis tools. cer datasets at the Oncomine (https://www.oncomine.org/) which provide 5-year live status for some breast cancer gene 6. Conclusion expression datasets. Four additional breast cancer datasets wereplottedinSupplementFigure2tosupportourresults. In this study, through developing a novel genomics platform TCGA4U and using DAVID, the survival related gene expres- 5. Discussion sion patterns in breast cancer were studied. Gene expression patterns and survival curves of all genes expressed in breast In this study, we focused on exploring breast cancer survival tumor can be queried on TCGA4U website. In this paper, related gene expression pattern. Therefore, we utilized gene some interesting results were reported: (1) mitochondrial expression data and follow-up data to analyze the difference ribosomes play a more crucial role in the cancer development; of survival curves with different expression levels through (2) HSPA2 has a widely different gene expression pattern the Kaplan-Meier method and Log-Rank test. We used in breast cancer compared with previous findings in other Functional Annotation Clustering of DAVID to cluster these cancer types. We believe that published results on TCGA4U genes to annotations and chose mitochondrion ribosome will inspire more biomedical researchers to explore the bio- and cytosol ribosome as research objects. We explored the logical mechanism of those genes and more precisely explain difference of expression of mitochondrion ribosome and their role in the breast cancer development and discover more cytosol ribosome genes on breast patients and discussed the targeted therapies to help more breast cancer patients. possible biological mechanisms. We expanded and analyzed genes related mitochondrion ribosome and cytosol ribosome Conflict of Interests with similar expression patterns and prognosis assessments through cluster heat map and survival analysis on the The authors declare that there is no conflict of interests TCGA4U. We found that HSPA2 plays a different role in regarding the publication of this paper. breast cancer through our bioinformatics approaches. We wouldliketoaskbiomedicalresearcherstostudytheHSPA2 in breast cancer to understand the real biological function of Acknowledgments this biomarker. This research was financially supported by the National High- In 2002, van de Vijver et al. used the correlation coeffi- <− > Tech R&D Program of China (2012AA02A601), National cient (correlation coefficient 0.3 or 0.3) of the expression Natural Science Foundation of China (30900329), and Fun- for each gene with disease outcome to identify 231 genes damental Research Funds for the Central Universities. thatrelatedtobreastcanceroutcome.Basedonthislist, they further established a 70-gene prognosis profile that was proved as a more powerful predictor of the outcome of References disease in young patients with breast cancer than standard systems based on clinical and histologic criteria [22]. We [1] M. Palme and E. Simeonova, “Does women’s education affect breast cancer risk and survival? Evidence from a population use the dataset of this study to further confirm the HSPA2 basedsocialexperimentineducation,”JournalofHealthEco- expression pattern result as shown in Supplement Figure 3. nomics,vol.42,pp.115–124,2015. In 2012, Patsialou et al. identified several markers in the [2] Y.-C. Chen, W.-C. Ke, and H.-W. Chiu, “Risk classification of migratory tumor cells to predict clinical outcome in breast cancer survival using ANN with gene expression data from cancer patients [23]. Different methods and different samples multiple laboratories,” Computers in Biology and Medicine,vol. hadbeenusedintheabovetwostudiesandourstudy,while 48,no.1,pp.1–7,2014. there are four genes (PGK1, GCN1L1, PRDX5, and SDHD) [3] S. Rao, L. Welsh, D. Cunningham et al., “Correlation of overall thatwererepeatedandidentifiedatleasttwiceinthreestudies survival with gene expression profiles in a prospective study of and might become valuable prognostic tools or therapeutic resectable esophageal cancer,” Clinical Colorectal Cancer,vol.10, targets in breast cancer. no. 1, pp. 48–56, 2011. In the current stage, we had published some meaningful [4]G.P.Sfakianos,E.S.Iversen,R.Whitakeretal.,“Validationof gene lists for researchers at TCGA4U. In the next stage, more ovarian cancer gene expression signatures for survival and sub- potential relationships between high-dimensional variables type in formalin fixed paraffin embedded tissues,” Gynecologic in the TCGA datasets will be studied. As more and more Oncology,vol.129,no.1,pp.159–164,2013. 10 BioMed Research International

[5] Z.-Y. Xu, J.-S. Chen, and Y.-Q. Shu, “Gene expression profile clinical features in non-small cell lung carcinoma patients,” towards the prediction of patient survival of gastric cancer,” Anticancer Research,vol.34,no.6,pp.2833–2840,2014. Biomedicine & Pharmacotherapy,vol.64,no.2,pp.133–139, [22] M. J. van de Vijver, Y. D. He, L. J. van ’T Veer et al., “A gene- 2010. expression signature as a predictor of survival in breast cancer,” [6] J. Gao, G. Ciriello, C. Sander, and N. Schultz, “Collection, The New England Journal of Medicine,vol.347,no.25,pp.1999– integration and analysis of cancer genomic profiles: from data 2009, 2002. to insight,” Current Opinion in Genetics and Development,vol. [23] A. Patsialou, Y. Wang, J. Lin et al., “Selective gene-expression 24,no.1,pp.92–98,2014. profiling of migratory tumor cells in vivo predicts clinical [7] F. Gnad, S. Doll, G. Manning, D. Arnott, and Z. Zhang, outcome in breast cancer patients,” Breast Cancer Research,vol. “Bioinformatics analysis of thousands of TCGA tumors to 14, no. 5, article R139, 2012. determine the involvement of epigenetic regulators in human cancer,” BMC Genomics, vol. 16, supplement 8, article S5, 2015. [8]Z.Kan,B.S.Jaiswal,J.Stinsonetal.,“Diversesomaticmutation patterns and pathway alterations in human cancers,” Nature,vol. 466, no. 7308, pp. 869–873, 2010. [9] K. D. Korthauer and C. Kendziorski, “MADGiC: a model-based approach for identifying driver genes in cancer,” Bioinformatics, vol.31,no.10,pp.1526–1535,2015. [10] H.Wu,L.Gao,F.Li,F.Song,X.Yang,andN.Kasabov,“Identify- ing overlapping mutated driver pathways by constructing gene networks in cancer,” BMC Bioinformatics,vol.16,supplement5, p. S3, 2015. [11] D. Li, H. Xia, Z. Li, L. Hua, and L. Li, “Identification of novel breast cancer subtype-specific biomarkers by integrating genomics analysis of DNA copy number aberrations and mirna- mrna dual expression profiling,” BioMed Research International, vol.2015,ArticleID746970,17pages,2015. [12]H.Wu,L.Gao,F.Li,F.Song,X.Yang,andN.Kasabov,“Iden- tifying overlapping mutated driver pathways by constructing gene networks in cancer,” BMC Bioinformatics,vol.16,no. supplement 5, article S3, 2015. [13]C.Kandoth,N.Schultz,A.D.Cherniacketal.,“Integrated genomic characterization of endometrial carcinoma,” Nature, vol. 497, no. 74 47, pp. 67–73, 2013. [14] H. Thorvaldsdottir,´ J. T. Robinson, and J. P.Mesirov, “Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration,” Briefings in Bioinformatics,vol. 14,no.2,pp.178–192,2013. [15] D. C. Wallace, “Mitochondria and cancer,” Nature Reviews Cancer, vol. 12, no. 10, pp. 685–698, 2012. [16] J. E. Sylvester, N. Fischel-Ghodsian, E. B. Mougey, and T. W. O’Brien, “Mitochondrial ribosomal proteins: candidate genes for mitochondrial disease,” Genetics in Medicine,vol.6,no.2, pp.73–80,2004. [17] S. Belin, A. Beghin, E. Solano-Gonzalez` et al., “Dysregulation of ribosome biogenesis and translational capacity is associated with tumor progression of human breast cancer cells,” PLoS ONE,vol.4,no.9,ArticleIDe7147,2009. [18] S. Ray, R. Johnston, D. C. Campbell et al., “Androgens and estrogens stimulate ribosome biogenesis in prostate and breast cancer cells in receptor dependent manner,” Gene,vol.526,no. 1,pp.46–53,2013. [19] H. Zhang, H. Gao, C. Liu, Y. Kong, C. Wang, and H. Zhang, “Expression and clinical significance of HSPA2 in pancreatic ductal adenocarcinoma,” Diagnostic Pathology,vol.10,article13, 2015. [20] Y. Fu, H. Zhao, X.-S. Li et al., “Expression of HSPA2 in human hepatocellular carcinoma and its clinical significance,” Tumor Biology,vol.35,no.11,pp.11283–11287,2014. [21] D. Scieglinska, A. Gogler-Piglowska, D. Butkiewicz et al., “HSPA2 is expressed in human tumors and correlates with Hindawi Publishing Corporation BioMed Research International Volume 2015, Article ID 292683, 13 pages http://dx.doi.org/10.1155/2015/292683

Research Article A Genetic Algorithm Based Support Vector Machine Model for Blood-Brain Barrier Penetration Prediction

Daqing Zhang,1 Jianfeng Xiao,2 Nannan Zhou,3 Mingyue Zheng,2 Xiaomin Luo,2 Hualiang Jiang,2,3 and Kaixian Chen2

1 Center for Systems Biology, Soochow University, Suzhou 215006, China 2Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China 3School of Pharmacy, East China University of Science and Technology, Shanghai 200237, China

Correspondence should be addressed to Mingyue Zheng; [email protected], Xiaomin Luo; [email protected], and Hualiang Jiang; [email protected]

Received 24 February 2015; Revised 7 May 2015; Accepted 19 May 2015

Academic Editor: S´ılvia A. Sousa

Copyright © 2015 Daqing Zhang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Blood-brain barrier (BBB) is a highly complex physical barrier determining what substances are allowed to enter the brain. Support vector machine (SVM) is a kernel-based machine learning method that is widely used in QSAR study. For a successful SVM model, the kernel parameters for SVM and feature subset selection are the most important factors affecting prediction accuracy. In most studies, they are treated as two independent problems, but it has been proven that they could affect each other. We designed and implemented genetic algorithm (GA) to optimize kernel parameters and feature subset selection for SVM regression and applied it to the BBB penetration prediction. The results show that our GA/SVM model is more accurate than other currently available log BB models. Therefore, to optimize both SVM parameters and feature subset simultaneously with genetic algorithm is a better approach than other methods that treat the two problems separately. Analysis of our log BB model suggests that carboxylic acid group, polar surface area (PSA)/hydrogen-bonding ability, lipophilicity, and molecular charge play important role in BBB penetration. Among those properties relevant to BBB penetration, lipophilicity could enhance the BBB penetration while all the others are negatively correlated with BBB penetration.

1. Introduction with CNS+/− datasets is that CNS activity implies BBB permeation, while CNS inactivity might be due to factors The blood-brain barrier (BBB) plays important roles in other than nonpermeation, such as the fact that compounds separating the central nervous system (CNS) from circulating mightberapidlymetabolizedoreffluxedfromthebrain.Log blood and maintaining brain homeostasis. BBB penetration, 𝐵𝐵, which is defined as logarithm of Brain/Blood partitioning which may be desired or not depending on the therapeutic ratio at steady state [2], is by far the most widely used target, is a critical character in chemical toxicological studies parameter for BBB penetration. However, this parameter may and in drug design. Compounds can cross the BBB by passive also result in misleading conclusions because it ignores the diffusion or by means of a variety of catalyzed transport main parts of process of permeability [3]. Log PS,whichis systems that can carry compounds into the brain (carrier- defined as the logarithm of permeability-surface area product mediated transport, receptor-mediated transcytosis) or out reflecting the rate of brain permeation, is superior to but more of the brain (active efflux). Various parameters are used difficulttomeasurecomparedtolog𝐵𝐵 [4]. In vivo brain for predicting BBB penetration such as CNS+/−,log𝐵𝐵, uptake methods may be the most reliable evaluation of BBB and log 𝑃𝑆.CNS+/− is a qualitative property denoting the penetration. However, the low-throughput, expensive, and compound’s activity (CNS+) or inactivity (CNS−)against labor-intensive characteristics make these methods inappli- a CNS target with its BBB penetration [1]. The problem cable in early drug discovery stages. For these reasons, in vitro 2 BioMed Research International and in silico methods have been introduced. As there is no one [43–47] have been employed. Most of these methods focus on in vitro model which can mimic all properties of the in vivo feature selection or parameters optimization separately [45]. BBB, developing more reliable models remains challenging However, the choice of feature subset influences the appropri- [4]. So far, a great number of in silico BBB models have been ate kernel parameters and vice versa [48]. Hence the proper developed and thoroughly reviewed [2, 3, 5–9]. Because of the way seems to address the two problems simultaneously. GA high complex nature of the BBB, most computational models [41], immune clonal algorithm (ICA) [49], and Bayesian only account for passive diffusion. approach [50] have been recently used for simultaneously Initial studies were focusing on making correlation feature selection and parameters optimization for SVM on between BBB permeability of small set of compounds and general classification problems. In our study, GA was used simple descriptors and then revealed “rules of thumb.” These to do parameter optimization and feature subset selection rough models reflect some important relationships between simultaneously, and an SVM regression model was developed BBB penetration and properties of compounds but have for the blood-brain barrier penetration prediction. a problem of oversimplification [10, 11]. As the accumula- tion of new data, various more sophisticated models were 2. Methods reported to predict BBB permeability. Classification models [12] which were used widely explored for distinguishing TheworkflowusedinthisstudyforBBBpenetrationpredic- between the molecules capable of being across the BBB and tion is illustrated in Figure 1. those restricted to periphery. These models often developed byusingthesamedatasetofabout1500drugscompiledby 2.1. Dataset and Molecular Descriptors. The log 𝐵𝐵 dataset Adenot and Lahana [13], which is the largest single homoge- used in this study was compiled by Abraham et al. [51], neous up-to-date source of qualitative data published. Some which was a combination of both in vivo and in vitro data, others [12, 14] distinguish molecules based on a certain log 𝐵𝐵 including 302 substances (328 data points). Abraham et al. threshold. However, the main problem is the threshold which applied linear free energy relationship (LFER) to the dataset is subjectively determined and not unified. Most quantitative and obtained good correlation between log 𝐵𝐵 values and models were developed by building QSAR models [10, 15–18]. LFER descriptors plus two indicator variables [51]. CODESSA Since different datasets and validation methods were used, [52] could not calculate descriptors for the first 5 gases ([Ar], it is difficult to compare the performance of these models [Kr], [Ne], [Rn], and [Xe]) of the original dataset, and they [19]. Recently, Carpenter et al. [20] developed a new model were excluded from the dataset. The final dataset contained predicting the BBB penetration using molecular dynamic 297 compounds (323 data points). The indicator variables simulations and received good results, providing new thread of 𝐼V andAbsCarboxyusedinAbraham’sstudy[51]were of BBB permeability prediction. Here we focused on log 𝐵𝐵 retained in this study. 𝐼V was defined as 𝐼V =1for the in models of BBB penetration by passive diffusion. vitro data and 𝐼V =0for the in vivo data. AbsCarboxy was an Various data mining methods have been employed in indicator for carboxylic acid (AbsCarboxy = 1 for carboxylic BBB penetration models, such as multiple linear regression acid, otherwise AbsCarboxy = 0). [21, 22], partial least squares (PLS) regression [13, 23], The initial structures in SMILES format were imported recursive partitioning [23, 24], neural network [25–27], and to Marvin [53]andexportedinMDLMOLformat.AM1 support vector machine (SVM) [28–30]. SVM, which was method in AMPAC [54] was used for optimization plus originally developed by Vapnik and coworkers [31], has been frequencies and thermodynamic properties calculation. The extensively used and consistently achieves similar or superior generatedoutputfileswereusedbyCODESSAtocalculatea performance compared to other machine learning methods large number of constitutional, topological, geometrical, elec- [32]. Its main idea is to map data points to a high dimension trostatic, quantum-chemical, and thermodynamic descrip- space with a kernel function, and then these data points can tors. Marvin was also used to calculate some physicochemical be separated by a hyper plane. properties of the compound, including log 𝑃,log𝐷,polar For a successful SVM model, kernel parameters of SVM surfacearea(PSA),polarizability,andrefractivity.Allthese and feature subset selection are the two most important descriptors and properties were used as candidate features in factors affecting the prediction accuracy. Various strate- later modeling. gies have been adopted for the two problems. Grid-based Features with missing values or having no change across algorithm is one of the most straightforward strategies for the data set were removed. If the correlation coefficient of parameter optimization, which discretizes the parameters two features is higher than a specified cutoff value (0.999999 andthensystematicallysearcheseverygridpointtofind used here), then one of them is randomly chosen and a best combination of the parameters [33]. However, its removed. The cutoff value used here is very high because very use is limited due to the computational complexity and high variable correlation does not mean absence of variable time-consumption. Gradient-based methods [34, 35]arealso complementarity [55]. A total number of 326 descriptors were widely used, which require the kernel function and the left for further analysis. However, many highly correlated scoring function differentiable to assess the performance of features have very similar physicochemical meanings. In our the parameters. Evolutionary method [36]hasalsobeenused final analysis, similar features were put together by their and achieved promising results. As for the feature selection, physicochemical meaning, which we hope could unveil some genetic algorithms- (GA-) based [37–41], 𝐹-score based underlying molecular properties that determine the BBB feature recursive elimination [42], and many other methods penetration. BioMed Research International 3

328 data points Abraham et al., (2006)

AbsCarboxy, log P, log D, PSA, H-Bond, number of Calculate descriptors atoms/bonds/rings, charge, and HOMO/LUMO, ...

Remove features with missing values Remove bad values =0 (323 data points, and 326 features) Remove features having no change over the dataset (SD ) Remove features having very high correlation ((1 − r) ≤ 1e − 6)

Data scale Scaled to [0,1]

Training set/test set Kennard-Stone method

Feature selection and parameter optimization GA/SVM simultaneously

Figure 1: Workflow of GA/SVM model for BBB penetration prediction.

󵄩 󵄩 󵄩 󵄩2 The dataset was then split into training set and test 𝐾 (𝑥𝑖,𝑥𝑗) = exp (−𝛾 󵄩𝑥𝑖 −𝑥𝑗󵄩 ) ,𝛾>0, (1) set using the Kennard-Stone method [56], which selects a subset of representative data points uniformly distributed in the sample space [57]. At start, the Kennard-Stone method where 𝑥𝑖 and 𝑥𝑗 are training vectors (𝑖 =𝑗̸ , 𝑥𝑖 =𝑥̸ 𝑗)and𝛾 is chooses the data point that is the closest to the center of the kernel parameter. dataset measured by Euclidean distance. After that, from all remaining data points, the data point that is the furthest from 2.3. Genetic Algorithms. Genetic algorithms (GA) [41]are those already selected is added to the training set. This process stochastic optimization and search method that mimics continues until the size of the training set reaches specified biological evolution as a problem-solving strategy. They are size. 260 data points were selected as training set and the other very flexible and attractive for optimization problems. 63 were used as test set. Given a specific problem to solve, the input to the GA is a set of potential solutions to that problem, encoded in some 2.2. SVM Regression. Details about SVM regression can be fashion, and a fitness function that allows each candidate to found in literatures [58–60]. As in other multivariate statis- be quantitatively evaluated (Figure 2). Selection, mating, and tical models, the performance of SVM regression depends mutation just mimic the natural process. For each generation, on the combination of several parameters. In general, 𝐶 is a individuals are selected for reproduction according to their regularization parameter that controls the tradeoff between fitness values. Favorable individuals have a better chance to training error and model complexity. If 𝐶 is too large, the be selected for reproduction and the offspring have chance modelwillhaveahighpenaltyfornonseparablepointsand to mutate to keep diversity, while the unfavorable individuals may store too many support vectors and get overfitting. If it are less likely to survive. After each generation, whether the is too small, the model may have underfitting. Parameter 𝜀 evolution is converged or the termination criteria are met is controls the width of the 𝜀-insensitive zone, used to fit the checked; if yes, job is done; if not, the evolution goes into training data. The value of 𝜀 can affect the number of the next generation. After many generations, good individuals support vectors used to construct the regression function. willdominatethepopulation,andwewillgetsolutionsthat The bigger 𝜀 is, the fewer support vectors are selected. On are good enough for our problem. the other hand, bigger 𝜀-values result in more flat estimates. First, in order to solve a problem with GA, each individual Hence, both 𝐶 and 𝜀-values affect model complexity (but in the population should be represented by a chromosome. in a different way). The kernel type is another important In our study, since the parameter optimization and feature parameter. In SVM regression, radial basis function (RBF) subset selection should be addressed simultaneously, the (1) was the most commonly used kernel function for its chromosome is a combination of parameter genes and feature better generalization ability, less number of parameters, and gene (Figure 3), where 𝑓𝑛 is an integer in the range of [1, 𝑁] less numerical difficulties33 [ ]andwasusedinthisstudy. and 𝑁 isthenumberofcandidatefeaturesformodelcon- Parameter 𝛾 in RBF controls the amplitude of the RBF kernel struction. A chromosome represents an individual in genetic and therefore controls the generalization ability of SVM algorithms and parameters contained in chromosome could regression. The LIBSVM package (version 2.81) [61]wasused be used for SVM modeling. Left part of the chromosome is in this study for SVM regression calculation, taking the form the parameter genes, of which 𝐶, 𝛾,and𝜀 all are float genes. 4 BioMed Research International

Define encoding strategy, fitness function, and GA parameters

Generate initial population (SVM parameters, and selected features)

Fitness evaluation (MSE from Select, mating, and mutation 10-fold CV SVM)

No Convergence?

Yes

Optimized (C, 𝛾, 𝜀) and feature subset

Figure 2: Workflow of genetic algorithms.

SVM parameters Selected features

C (float) 𝛾 (float) 𝜀 (float) f1(int) f2(int) f3(int) f4(int) ··· fn(int)

C (float) 𝛾 (float) 𝜀 (float) f1(int) f2(int) f3(int) f4(int) ··· fn(int)

Chromosome

Figure 3: Encoding of the chromosome.

̂ The feature gene is an array of integers, and each integer where 𝑛 is the number of all data points, 𝑋𝑖 is the predicted represents a feature. value, and 𝑋𝑖 is the experiment value. Fitness function can be seen as a ruler, which was used Tournament selection was used as the selection strategy to quantitatively evaluate and compare each candidate. In in GA, which selected the best 1 from 3 randomly chosen our study, the mean squared error (MSE) of 10-fold cross candidates. The advantage of tournament selection over validation (CV) for SVM was used as fitness function, and roulette wheel selection is that tournament selection does not smaller fitness value indicated better individual. Given a need to sort the whole population by fitness value. training set containing 𝑛 compounds, (𝑥 ,𝑦 ),...,(𝑥𝑛,𝑦𝑛), 𝑥𝑖 Since there are different types of genes in a chromosome, 1 1 𝑛 is descriptor vector of compound 𝑖 and 𝑥𝑖 ∈𝑅. 𝑦𝑖 is the different mating strategies were used for different types of log 𝐵𝐵 value and 𝑦𝑖 ∈{−1,+1}. The objective function can genes (Figure 4): be calculated by 𝑛 𝑉 =𝛽𝑝 +(1 −𝛽)𝑝 , ∗ new 1 2 (4) 𝑓 (𝑥) = ∑ (𝛼𝑖 −𝛼𝑖 )𝐾(𝑥𝑖,𝑥)+𝑏. (2) 𝑖=1 where 𝛽 uniformly distributed random number on the 𝛼 𝛼∗ 𝐾(𝑥 ,𝑥) [−0.25, 1.25] 𝑝 𝑉 𝑖 and 𝑖 are Lagronia factors, 𝑖 is kernel of interval , 𝑛 is the value of parent gene, and new radial basis function. The MSE of 10-fold CV for SVM was is the value of child gene. calculated by For float genes, the new value is a linear combination of 𝑛 the parents (4). For feature gene, uniform crossover is used: = 1 ∑ (𝑋̂ −𝑋)2 , MSE 𝑛 𝑖 𝑖 (3) each element of the child gene is selected randomly from the 𝑖=1 corresponding items of parents. BioMed Research International 5

f f f f f Parent A C 𝛾1 2 3 4 𝜀 ··· n

+++ +

f f f f f Parent B C 𝛾1 2 3 4 𝜀 ··· n

Child A C 𝛾f1 f2 f3 f4 𝜀 ··· fn +++ +

Child B C 𝛾f1 f2 f3 f4 𝜀 ··· fn

V =𝛽p +(1−𝛽)p new 1 2 Uniform crossover

Figure 4: Mating strategy of GA.

Table 1: Performance comparison of models with different number of features.

2 Training (CV = 10) Prediction/𝑟 Parameters of SVM Number of features 2 MSE 𝑟 Test set Training set 𝐶𝛾𝜀 4 0.1197 0.674 0.722 0.740 38.8833 0.6081 0.1491 5 0.1042 0.715 0.770 0.805 16.3419 0.7973 0.2743 6 0.0945 0.744 0.840 0.829 13.3573 0.7158 0.1513 7 0.0959 0.74 0.821 0.843 34.3067 0.5218 0.1595 8 0.0883 0.761 0.834 0.883 60.9596 0.5871 0.2357 9 0.0815 0.777 0.847 0.864 3.7770 0.8764 0.1663 10 0.0823 0.776 0.858 0.903 15.2236 0.6247 0.1434 11 0.0714 0.804 0.861 0.891 5.6937 0.6531 0.1573 12 0.0780 0.787 0.864 0.905 7.2787 0.7428 0.1515 13 0.0817 0.778 0.862 0.922 4.1957 0.7791 0.1574 14 0.0812 0.778 0.882 0.917 14.8391 0.5002 0.2054 15 0.0734 0.799 0.870 0.919 4.9915 0.5231 0.1077

Again, different mutation strategies were used for differ- 3. Results and Discussion ent types of genes. For float genes, the values were randomly mutated upward or downward. The new value was given by 3.1. GA/SVM Performance. GA was run with different num- ber of features from 4 to 15. For each number of features, GA was run 50 times, and the best model was chosen for 𝑉−𝛽(𝑉−𝑉 ) () < . { min if random 0 5 further analysis. From Table 1 and Figure 5,theoveralltrend 𝑉 = { (5) new 𝑉+𝛽(𝑉 −𝑉) () ≥ . , of the GA showed the following: (1) the accuracy of the model { max if random 0 5 increasedwiththenumberofthefeatures;(2)theaccuracy of the model on training set was better than the accuracy on where 𝛽 was a random number distributed in [0, 1], 𝑉 and the test set, which was then better than the accuracy of cross 𝑉 𝑉 𝑉 validation. new are values before and after mutation, and min and max are the minimum and maximum values allowed for a gene. As the feature number increases, the complexity also For feature gene, several points were first randomly increases, which will often increase the probability of overfit- chosen for mutation, and then a random number in [1, 𝑁] ting. A complex model is also difficult to interpret and apply (𝑁 is the total number of features) was chosen as new feature in practical use, so generally speaking, we need to find a while avoiding duplicate features. The GA was terminated balancebetweentheaccuracyandcomplexityofthemodel.It when the evolution reached 1000 generations. In our pilot is observed that (Table 1, Figure 5(a)) the prediction accuracy 2 study (data not shown), 1000 generations were enough for the (𝑟 = 0.744) of cross validation (𝑛=10)ofthe6-feature 2 GA to converge. The other parameters for GA were as follows: modelwassimilartothatoftheAbraham’smodel(𝑟 = population size 100, cross rate 0.8, mutation rate 0.1, elite size 0.75) which used all 328 data points (Table 2). As the number 2, and number of new individuals in each generation 8. of features increased from 6 to 15, the prediction accuracy 6 BioMed Research International

1.0 0.8 0.7 0.9 0.6 0.8 0.5 2 r 0.7 0.4 CV 0.6 0.3 MSE (10-fold CV) (10-fold MSE 0.2 0.5 0.1 0.4 46810121416 0 200 400 600 800 1000 Number of features Generation 2 CV r MSE (10-fold CV) 2 2 Test r R (test set) 2 Train r (a) (b)

Figure 5: (a) Performance comparison of models with different number of features. (b) Evolution of the best 6-feature model.

Table 2: Comparison of most relevant QSAR studies on BBB permeability.

2 Predictive accuracy Descriptors 𝑁 𝑁 Methods 𝑟 Reference train test train on test set Linear Δlop 𝑃, log 𝑃,and log 𝑃 20 — 0.69 — Young et al. [77] cyc Regression Excess molar refraction, dipolarity/polarisability, H-bond acidity, 2 148 30 LFER 0.75 𝑟 = 0.73 Platts et al. [66] and basicity test Solute McGowan volume ∘ Linear Δ𝐺 55 — 0.82 — Lombardo et al. [78] 𝑊 Regression PSA, the octanol/water partition 𝑟 2 = 0.80 coefficient, and the conformational 56 7 MLR 0.85 test Iyer et al. [79] flexibility 𝑟 2 = 0.81 PLS 0.83 test CODESSA/DRAGON (482) 200 110 𝑟 2 = 0.96 Golmohammadi et al. [62] SVM 0.97 test Molecular (CODESSA-PRO) descriptors 2 113 19 MLR 0.78 𝑟 = 0.77 Katritzky et al. [15] (5) test 𝑟 2 = 0.83 Molecular fragment (ISIDA) descriptors 112 19 MLR 0.90 test Katritzky et al. [15] Combinatorial PSA, log 𝑃, the number of H-bond 2 144 10 QSAR (KNN 0.91 𝑟 = 0.8 Zhang et al. [17] acceptors, E-state, and VSA test SVM) Abraham solute descriptors and 328 — LFER 0.75 — Abraham et al. [51] indicators Abraham solute descriptors and 164 164 LFER 0.71 𝑠 = 0.25, MAE = 0.20 Abraham et al. [51] indicators This research, GA/SVM, 2 𝑟 = 0.84, RMSE = final model CODESSA/Marvin/indicator (6) 260 63 GA based SVM 0.83 test 0.23 𝐶 =13.3573,𝛾 = 0.715761, 𝜀 = 0.151289 2 This research, Grid/SVM 𝑟 = 0.55, RMSE = CODESSA/Marvin/indicator (236) 260 63 GA based SVM 0.97 test 𝐶 = 8.0, 𝛾 = 0.015625, 𝜀 = 0.31 0.0625 2 𝑟 = 0.58, RMSE = This research, Grid/SVM CODESSA/Marvin/indicator (6) 260 63 GA based SVM 0.86 test 0.29 𝐶 = 8.0, 𝛾 = 1.0, 𝜀 = 0.125 BioMed Research International 7

1 1 BB BB 0 0 Predicted log Predicted Predicted log Predicted −1 −1

−2 −2 −2 −1 012−2 −1 01 Experiment log BB Experiment log BB (a) (b)

Figure 6: Prediction accuracy of the final model on training set (a) and test set (b). on training set increased from 0.829 to 0.919, while the Abraham’s model [51]isthebestmodelthatiscurrently prediction accuracy on test set only slightly increased from publicly available. A comparison of our models with Abra- 0.840 to 0.870 accordingly. Take all these into consideration, ham’s models was shown in Table 2.InTable 2,thelast3 the 6-feature model (Table 1, Figure 5(a)) was chosen as our rowsaremodelsinourstudy.Thesamedatasetwasusedin final model (Figure 6), of which the prediction accuracy on Abraham’s research and this study, but the data set was split 𝑟 2 −𝑟 2 = 0.11 both test set and training set was similar ( train test ) into different training set and test set (our model: train/test = and high enough (>0.82). 260/63; Abraham’s model: training/test = 164/164). 7 variables were used in Abraham’s model, compared to 6 in our final Figure 5(b) showed the evolution of the prediction per- 2 𝑟 2 model. The 𝑟 values for training set in Abraham’s 164/164 formance of the model (MSE and test ). In the first 100 generations, the MSE decreased very fast, followed with a model and 328/0 model were 0.71 and 0.75, respectively, platform stage from about 100 to 650 generations. Another compared with 0.83 for our model. It has to be noted that the decrease occurred at about 650 generations, and then the size of our training set (260) is bigger than Abraham’s (164). evolutionbecamestableatabout850generations.Themodels Ourmodelwasalsocomparedwithgridmethodimple- did not improve much in the last 150 generations, which may mented with Python toolkit (grid.py) shipped with libsvm imply a convergence. [61] for parameter optimization. First, since grid method A tabular presentation of relevant studies regarding the cannot be used to select feature subset, all 326 features were prediction of the blood-brain distribution is shown in Table 2. used to construct a BBB prediction model. The prediction 𝑟 2 = 0.97 These models were constructed by using different statistical accuracy of the training set was very high ( train ) 𝑟 2 =0.55 learning methods, yielding different prediction capability butthatofthetestsetwasdisappointing( test ). Then 𝑅 2 with test ranging from 0.5 to over 0.9. Generally, regression we used the same feature set as our final model (6 features). by SVM appears to be more robust than traditional linear The result was slightly better, but still too bad for test set 𝑟 2 = 0.86 𝑟 2 =0.58 approaches such as PLS and MLR, with respect to the nonlin- prediction ( train , test ). ear effects induced by multiple potentially cooperative factors So compared with the grid-based method, our GA-based governing the BBB permeability. For example, the SVM method could get better accuracy with fewer features, which 𝑅 2 model by Golmohammadi et al. [62] yielded the highest test suggested that GA could get much better combination of on a test set containing 110 molecules. However, it should parameters and feature subset. This was also observed in be noted that direct comparison with results from previous other’s study [48]. studies is usually inappropriate because of differences in their datasets. In this study, a combination of both in vivo and 3.2. Feature Analysis. An examination of the descriptors used in vitro data compiled by Abraham et al. [51]wasusedfor in the model could provide an insight into the molecular developing BBB prediction model, which is of high data properties that are most relevant to BBB penetration. Table 3 quality and covers large chemical diversity space. In addition showed the features used in the final 6-feature model and to the data source, kernel parameter optimization and feature their meanings. In order to explore the relative importance selection are two crucial factors influencing the prediction and the underlying molecular properties of the descriptors, accuracy of SVM models. To reduce the computational cost, themostfrequentlyusedfeaturesinall506-featuremodels most of the existing models addressed the feature selection were analyzed. Table 4 showed the top 10 most frequently and parameter optimization procedures separately. In this used features. Interestingly, AbsCarboxy, an indicator of the study, we used a GA scheme to perform the kernel parameter existence of carboxylic acid was the most significant property. optimization and feature selection simultaneously, which is Some features with similar meaning were also found to occur more efficient at searching the optimal feature subset space. in the models, such as PSA related features (M PSA 7. 4 , 8 BioMed Research International

Table 3: Features used in the final model. Name Meaning M log 𝑃 log 𝑃 (Marvin) HA dependent HDSA-2 [Zefirov’s PC] H-bond donor surface area related (CODESSA) M PSA 7.4 PSA at pH 7.4 (Marvin) AbsCarboxy Carboxylic acid indicator (Abraham) HA dependent HDCA-2/SQRT(TMSA) [Zefirov’s PC] H-bond donor charged area related (CODESSA) Average Complementary Information content (order 0) Topology descriptor (CODESSA)

Table 4: The most frequently used features for all 6-feature modelsa.

Occurrence Number Feature name Meaning (50 models) † 11 AbsCarboxy 36 Indicator for carboxylic acid ESP- H-acceptor surface area/total molecular 268 FHASA Fractional HASA (HASA/TMSA) Quantum- 14 surface area# Chemical PC Topological electronic index for all 101 Topographic electronic index (all bonds) Zefirov’s PC 12 ‡ bonded pairs of atomsb 8 M PSA 7. 4 1 1 P S A at p H 7. 4 c§ ESP-HASA H-acceptors surface area Quantum- 267 10 H-acceptor surface area# Chemical PC , ∧ 5 delta log 𝐷 9 log 𝐷 (pH 6.5) − log 𝐷 (pH 7.4)d e 7 M PSA 7. 0 9 P S A at p H 7. 0 § 138 HA dependent HDCA-2 [Zefirov’s PC] 9 H-donors charged surface area# 6 M PSA 6.5 8 PSA at pH 6.5§ ∧ 1 M log 𝑃 7 log 𝑃 aRows with the same symbol could be categorized into the same group. b 𝑇=∑𝑁𝐵 (|𝑞 −𝑞 |/𝑟2 ) 𝑞 𝑖 Topological electronic index is a feature to characterize the distribution of molecular charge: (𝑖<𝑗) 𝑖 𝑗 𝑖𝑗 ,where 𝑖 is net charge on th atom and 𝑟𝑖𝑗 is the distance between two bonded atoms. c7.4 is the pH in blood. d6.5 is the pH in intestine. e𝐷 is the ratio of the sum of the concentrations of all species of a compound in octanol to the sum of the concentrations of all species of the compound in water. For neutral compounds, log 𝐷 is equal to log 𝑃.

M PSA 7. 0 , a n d M PSA 6.5), and H-bond related descriptors 40 36 (number267,268,and138).Sowedecidedtoputsimilar 35 33 descriptors into the same group (Figure 7) to find the under- lying properties affecting BBB penetration. 30 28 According to the associated molecular properties of 25 the features, those frequently used features are catego- 20 rized into 5 groups: AbsCarboxy (indicator of carboxylic 16 15 acid), H-bonding (H-bonding ability, including H-bond Occurrence 12 donor/acceptor related features), PSA (molecular polar sur- 10 face area related features), lipophilicity (including M log 𝑃 5 𝐷 and delta log ), and molecular charge (including charge 0 and topological electronic index related features). AbsCarboxy H-bonding PSA Lipophilicity Molecular The following is observed: charge Feature groups (1) Interestingly, AbsCarboxy was also the most signif- icant feature, which occurred 36 times in total 50 Figure 7: Top features for all 6-feature models (50 in all). models. This may indicate that the carboxylic acid group plays an important role in the BBB penetration, whichisconsistentwiththestudyofAbrahametal. (2) H-bonding (H-bond donor/acceptor) related surface [51]. area, polar surface area, log 𝑃 related features, and BioMed Research International 9

∗ Table 5: Most frequently used features for all top models (number of features range from 4 to 15) .

Occurrence Number Descriptor name Meaning (120 models) † 11 AbsCarboxy 84 Indicator for carboxylic acid ∧ 5 delta log 𝐷 35 log 𝐷 (pH 6.5) − log 𝐷 (pH 7.4) 7 M PSA 7. 0 32 P S A at p H 7. 0 § ∧ 1 M log 𝑃 28 log 𝑃 138 HA dependent HDCA-2 [Zefirov’s PC] 27 H-donors charged surface area# ESP- H-acceptor surface area/total molecular 268 FHASA Fractional HASA (HASA/TMSA) Quantum- 27 surface area# Chemical PC 6 M PSA 6.5 25 PSA at pH 6.5§ PPSA-1 Partial positive surface area [Quantum- 167 25 Partial positive surface area§ Chemical PC] Topological electronic index for all 101 Topographic electronic index (all bonds) Zefirov’s PC 23 ‡ bonded pairs of atoms 8 M PSA 7. 4 2 1 P S A at p H 7. 4 § ∗ Rows with the same symbol could be categorized into the same group.

120 (nitrogen, sulphur, oxygen, and phosphorus), and most of the 157 103 time these polar atoms can be H-bond acceptor or donor. 100 If PSA and H-bonding were merged into one group (𝑛= 84 157), they will become the most significant property group 80 of features. 63 60 54 3.3. Properties Relevant to BBB Penetration Occurrence 40 23 3.3.1. Carboxylic Acid Group. It was proposed by Abraham et 20 al. [51] that carboxylic acid group played an important role in BBB penetration. While it was commonly believed that the 0 AbsCarboxy H-bonding PSA Lipophilicity Molecular most important molecular properties related to BBB pene- charge tration were H-bonding ability, lipophilicity, and molecular Feature groups charge [63, 64]. However, our study confirmed Abraham’s conclusion and showed that the importance of carboxylic Figure 8: The most frequently used features for all top models. acid group in BBB penetration could be underestimated. In the models by Abraham et al. [51], the indicator variable of carboxylic acid group has the largest negative topological electronic index related features are also coefficient, indicating its importance in BBB penetration, significant. and is consistent with observations in our model that the In order to further confirm the previous finding, all top indicatorofcarboxylicacidgroupisthemostfrequently 10 models from features = 4 to 15 were further analyzed, used descriptor. Zhao et al. [23] tried to classify compounds and the result was almost the same (Table 5, Figure 8). The into BBB positive or BBB negative groups using H-bonding top 10 most frequently used feature sets were very similar. related descriptors, and the indicator of carboxylic acid group Compared to the 6-feature models, the only difference was was also found to be important in their model. Furthermore, that H-bond related feature number 267 is replaced with PSA our results are consistent with the fact that basic molecules related feature number 167. have a better BBB penetration than the acid molecules [10]. Again, the descriptors were analyzed by group. The The carboxylic acid group may affect the BBB penetration composites of the groups were almost the same. Given that through molecular charge interactions since in most cases the there are 4 descriptors in PSA group, AbsCarboxy was also carboxylic acid group will exist in the ionized form carrying themostsignificantproperty,followedbyPSA,lipophilicity, a negative charge. The carboxylic acid group could also affect and H-bonding having similar occurences, then followed the BBB penetration by forming H-bond with BBB and hence by molecular charge with relatively low frequencies. The weaken the BBB penetration abilities of molecules. high consistency suggested that these groups of features had Abraham et al. suggested that the presence of carboxylic signficant impact on the BBB penetration ability. acid group which acted to hinder BBB penetration was PSA and H-bonding descriptors are highly relevant prop- not only simply due to the intrinsic hydrogen bonding and erties: PSA is the molecular areas contributed by polar atoms polarity properties of neutral acids [51]. There were some 10 BioMed Research International other ways in which the carboxylic acid groups could affect al. [70] suggested that the log 𝐷 of the molecule should be the BBB penetration, such as acidic drugs which could in [1, 3] for good BBB penetration. These observations are bind to albumin [65], the ionization of the carboxylic acid consistent with the fact that the lipid bilayer is lipophilic in groups which could increase the excess molar refraction and nature, and lipophilic molecules could cross the BBB and get hydrogen bonding basicity, and finally the carboxylic acid into the brain more easily than hydrophobic molecules. groups which may be removed from brain by some efflux mechanism [66]. 3.4. Molecular Charge. From the viewpoint of computational chemistry, the distribution of molecular charge is a very important property that affects the molecule properties 3.3.2. Polar Surface Area and H-Bonding Ability. As pointed greatly. It is the uncharged form that can pass the BBB out in our previous analysis, PSA and H-bonding ability by passive diffusion. Fischer et al.73 [ ]haveshownthat are actually two highly correlated properties. If these two acid molecules with p𝐾 <4and basic molecules with groups are merged, they will be the most significant group a p𝐾 >10could not cross the BBB by passive diffusion. of properties, even more significant than the carboxylic acid a Under physiological conditions (pH = 7.4), acid molecules indicator (Figure 8). Norinder and Haeberlein [6]concluded with p𝐾 <4and basic molecules with p𝐾 >10will be that hydrogen bonding term is a cornerstone in BBB penetra- a a ionized completely and carry net charges. As mentioned in tion prediction. In Zhao et al.’s study [23], PSA, AbsCarboxy, our previous analysis, the carboxylic acid group may affect number of H-bonding donors, and positively charged form the BBB penetration through molecular charge interactions. fraction at pH 7.4 were all treated as H-bonding descriptors Mahar Doan et al. [74] compared physicochemical properties and were found to be important in the final model. of 93 CNS (𝑛=48) and non-CNS (𝑛=45) drugs and showed Furthermore, almost all published models make use of that 0 of 48 CNS drugs have a negative charge and CNS molecular polar and/or H-bonding ability related descriptors, drugs tend to have less positive charge. These are reasonably such as PSA [23, 67, 68], high-charged PSA [22], number of consistent with the study of Abraham et al. [51]inwhichthe hydrogen donors and acceptors [23, 69], and hydrogen bond coefficient of the carboxylic acid indicator is negative. acidity/basicity [23]. And, in these models, PSA and/or H- It has to be noted that, all these molecular properties bondabilityareallnegativelycorrelatedwithlog𝐵𝐵,which are not independent, and they are related to each other. For is in agreement with Abraham et al.’s study [51]inwhichthe example, the carboxylic acid group is related to both PSA and coefficients of the H-bond acidity and H-bond basicity are H-bondingability,fortheOatominthecarboxylicacidgroup both negative. is a polar atom and has a strong ability to form H-bonds; After review of many previous works, Norinder and the carboxylic acid group, which carries charge under most Haeberlein [6]proposedthatifthesumofnumberof conditions, is also related to molecular charge. The PSA/H- nitrogenandoxygenatoms(N+O)inamoleculewasfive bonding ability is also correlated to molecular charge, because orless,ithadahighchanceofenteringthebrain.Asweall in many cases atoms could contribute to PSA or form H- know, nitrogen and oxygen atoms have great impact on PSA bonds which could probably carry charges. Lipophilicity is and H-bonding. Norinder and Haeberlein [6]alsoconcluded alsorelatedtomolecularcharge. that BBB penetration could be increased by lowering the We can get a conclusion that the most important prop- overall hydrogen bonding ability of a compound, such as erties for a molecule to penetrate BBB are carboxylic acid by encouraging intramolecular hydrogen bonding. After an group, PSA/H-bonding ability, lipophilicity, and charge. BBB analysis of the CNS activity of 125 marketed drugs, van de penetration is positively correlated with the lipophilicity Waterbeemd et al. [70] suggested that the upper limit for PSA and negatively correlated with the other three properties. A in a molecule that is desired to penetrate the brain should 2 comparison of the physicochemical properties of 48 CNS ˚ be around 90 A , while Kelder et al. [71]analyzedthePSA drugs and 45 non-CNS suggested that compared to non-CNS distribution of 776 orally administered CNS drugs that have drugs, CNS drugs tend to be more lipophilic and more rigid reachedatleastphaseIIstudiesandsuggestedthattheupper 2 and have fewer hydrogen-bond donors, fewer charges, and ˚ 2 limit should be 60–70 A . lower PSA (<80 A˚ )[74], which is in reasonable consistency Having in mind that molecules mainly cross the BBB with our finding except that the molecular flexibility is not by passive diffusion, we think it may be because molecules important in our model. with strong H-bonding ability have a greater tendency to There are some other properties utilized in some existing form H-bonds with the polar environment (the blood), hence models, such as molecular weight, molecular shape, and weakening their ability to cross the BBB by passive diffusion. molecular flexibility. It is suggested by van de Waterbeemd et al. [70] that molecular weight should be less than 450 for 3.3.3. Lipophilicity. Lipophilicity is another property widely good BBB penetration, while Hou and Xu [22]suggestedthat recognized as being important in BBB penetration, and most the influence of molecular bulkiness would be obvious when of the current models utilize features related to lipophilicity thesizeofthemoleculewaslargerthanathresholdandfound [22, 64, 67, 72]. Lipophilicity was thought to be positively that molecular weight made a negative contribution to the correlated with log 𝐵𝐵; that is, increase the lipophilicity of a BBB penetration when the molecular weight is greater 360. molecule will increase the BBB penetration of the molecule. This is not widely observed in other studies. In Zhao et al.’s Norinder and Haeberlein [6] also proposed that if log 𝑃−(N+ study [23], molecular weight was found to be not important O)>0, then log 𝐵𝐵 was positive. Van de Waterbeemd et compared to hydrogen bond properties. BioMed Research International 11

Lobell et al. [68] proposed that spherical shapes have a Acknowledgments small advantage compared with rod-like shapes with regard to BBB penetration, they attributed this to the membranes The authors gratefully acknowledge financial support that are largely made from rod-shaped molecules and rod- from the Hi-Tech Research and Development Program of like shape may become more easily trapped within membrane China (Grant 2014AA01A302 to Mingyue Zheng and Grant without exiting into the brain compartment. However, in 2012AA020308 to Xiaomin Luo), National Natural Science Foundation of China (Grants 21210003 and 81230076 to Rose et al.’s model [75] based on electrotopological state Hualiang Jiang and Grant 81430084 to Kaixian Chen), and descriptors showed that BBB penetration increased with less National S&T Major Project (Grant 2012ZX09301-001-002 to sketch branching. Crivori et al. [76] tried to correlate descrip- Xiaomin Luo). tors derived from 3D molecular fields and BBB penetration andconcludedthatthesizeandshapedescriptorshadno marked impact on BBB penetration. References Iyer et al. [67] found that increasing the solute conforma- [1]K.Lanevskij,J.Dapkunas,L.Juska,P.Japertas,andR.Didzi- tional flexibility would increase log 𝐵𝐵, while in the study of apetris, “QSAR analysis of blood-brain distribution: the influ- Mahar Doan et al. [74], CNS drugs tend to be more rigid. ence of plasma and brain tissue binding,” Journal of Pharma- However, the roles of molecular weight, molecular shape, ceutical Sciences,vol.100,no.6,pp.2147–2160,2011. and molecular flexibility in BBB penetration seem to be still [2] K. Lanevskij, P. Japertas, and R. Didziapetris, “Improving the unclear and not well received. Further studies are still needed. prediction of drug disposition in the brain,” Expert Opinion on Drug Metabolism & Toxicology,vol.9,no.4,pp.473–486,2013. [3] A. R. Mehdipour and M. Hamidi, “Brain drug targeting: a 4. Conclusion computational approach for overcoming blood-brain barrier,” Drug Discovery Today,vol.14,no.21-22,pp.1030–1036,2009. In this study, we have developed a GA/SVM model for [4] J. Bicker, G. Alves, A. Fortuna, and A. Falcao,˜ “Blood-brain the BBB penetration prediction, which utilized GA to do barrier models and their relevance for a successful development kernel parameters optimization and feature selection simul- ofCNSdrugdeliverysystems:areview,”European Journal of taneously for SVM regression. The results showed that our Pharmaceutics and Biopharmaceutics,vol.87,no.3,pp.409–432, method could get better performance than addressing the 2014. two problems separately. The same GA/SVM method can be [5]K.Nagpal,S.K.Singh,andD.N.Mishra,“Drugtargetingto brain: a systematic approach to study the factors, parameters extended to be used on other QSAR modeling applications. and approaches for prediction of permeability of drugs across In addition, the most important properties (carboxylic BBB,” Expert Opinion on Drug Delivery,vol.10,no.7,pp.927– acid group, PSA/H-bond ability, lipophilicity, and molecu- 955, 2013. lar charge) governing the BBB penetration were illustrated [6] U. Norinder and M. Haeberlein, “Computational approaches to through analyzing the SVM model. The carboxylic acid the prediction of the blood-brain distribution,” Advanced Drug group and PSA/H-bond ability have the strongest effect. The Delivery Reviews,vol.54,no.3,pp.291–313,2002. existence of carboxylic acid group (AbsCarboxy), PSA/H- [7] J. T. Goodwin and D. E. Clark, “In silico predictions of blood- bonding and molecular charge is all negatively correlated brain barrier penetration: considerations to ‘keep in mind’,” with BBB penetration ability, while the lipophilicity enhances Journal of Pharmacology and Experimental Therapeutics,vol. 315, no. 2, pp. 477–483, 2005. the BBB penetration ability. [8] J. A. Nicolazzo, S. A. Charman, and W. N. Charman, “Methods The BBB penetration is a highly complex process and to assess drug permeability across the blood-brain barrier,” is a result of many cooperative effects. In order to clarify Journal of Pharmacy and Pharmacology,vol.58,no.3,pp.281– the factors that affect the BBB penetration, further efforts 293, 2006. are needed to investigate the mechanistic nature of the BBB, [9]L.Di,E.H.Kerns,andG.T.Carter,“Strategiestoassessblood- and, as pointed out by Goodwin and Clark [7], the most brain barrier penetration,” ExpertOpiniononDrugDiscovery, fundamental need is for more high quality data, both in vivo vol. 3, no. 6, pp. 677–687, 2008. and in vitro,uponwhichthenextgenerationofpredictive [10] Y. Fan, R. Unwalla, R. A. Denny et al., “Insights for predicting model can be built. blood-brain barrier penetration of CNS targeted molecules using QSPR approaches,” JournalofChemicalInformationand Modeling,vol.50,no.6,pp.1123–1133,2010. Conflict of Interests [11] M. H. Abraham, “The factors that influence permeation across the blood-brain barrier,” European Journal of Medicinal Chem- The authors declare that there is no conflict of interests istry,vol.39,no.3,pp.235–240,2004. regarding the publication of this paper. [12] I. F. Martins, A. L. Teixeira, L. Pinheiro, and A. O. Falcao, “A bayesian approach to in silico blood-brain barrier penetration modeling,” Journal of Chemical Information and Modeling,vol. Authors’ Contribution 52, no. 6, pp. 1686–1697, 2012. [13] M. Adenot and R. Lahana, “Blood-brain barrier permeation Daqing Zhang, Jianfeng Xiao, and Nannan Zhou contributed models: discriminating between potential CNS and non-CNS equally to this work. drugs including P-glycoprotein substrates,” Journal of Chemical 12 BioMed Research International

Information and Computer Sciences, vol. 44, no. 1, pp. 239–248, [29]J.Shen,F.Cheng,Y.Xu,W.Li,andY.Tang,“Estimation 2004. of ADME properties with substructure pattern recognition,” [14] M. Muehlbacher, G. M. Spitzer, K. R. Liedl, and J. Kornhuber, Journal of Chemical Information and Modeling,vol.50,no.6, “Qualitative prediction of blood-brain barrier permeability on a pp. 1034–1041, 2010. large and refined dataset,” Journal of Computer-Aided Molecular [30]H.Golmohammadi,Z.Dashtbozorgi,andW.E.AcreeJr., Design, vol. 25, no. 12, pp. 1095–1106, 2011. “Quantitative structure-activity relationship prediction of [15] A. R. Katritzky, M. Kuanar, S. Slavov et al., “Correlation of blood-to-brain partitioning behavior using support vector blood-brain penetration using structural descriptors,” Bioor- machine,” European Journal of Pharmaceutical Sciences,vol.47, ganic & Medicinal Chemistry, vol. 14, no. 14, pp. 4888–4917, no. 2, pp. 421–429, 2012. 2006. [31] V.N. Vapnik, The Nature of Statistical Learning Theory,Springer, [16] O. Obrezanova, J. M. R. Gola, E. J. Champness, and M. New York, NY, USA, 1995. D. Segall, “Automatic QSAR modeling of ADME properties: [32] K. Heikamp and J. Bajorath, “Support vector machines for drug blood-brain barrier penetration and aqueous solubility,” Journal discovery,” ExpertOpiniononDrugDiscovery,vol.9,no.1,pp. of Computer-Aided Molecular Design,vol.22,no.6-7,pp.431– 93–104, 2014. 440, 2008. [33] C. W. Hsu, C. C. Chang, and C. J. Lin, A Practical Guide to [17] L. Zhang, H. Zhu, T. I. Oprea, A. Golbraikh, and A. Tropsha, Support Vector Classication, National Taiwan University, Taipei, “QSAR modeling of the blood–brain barrier permeability for Taiwan, 2006. diverse organic compounds,” Pharmaceutical Research,vol.25, [34] O.Chapelle,V.Vapnik,O.Bousquet,andS.Mukherjee,“Choos- no. 8, pp. 1902–1914, 2008. ing multiple parameters for support vector machines,” Machine [18] S. van Damme, W. Langenaeker, and P. Bultinck, “Prediction of Learning,vol.46,no.1–3,pp.131–159,2002. blood-brain partitioning: a model based on ab initio calculated [35] K.-M. Chung, W.-C. Kao, C.-L. Sun, L.-L. Wang, and C.-J. Lin, quantum chemical descriptors,” Journal of Molecular Graphics “Radius margin bounds for support vector machines with the and Modelling,vol.26,no.8,pp.1223–1236,2008. RBF kernel,” Neural Computation, vol. 15, no. 11, pp. 2643–2681, [19] D. A. Konovalov, D. Coomans, E. Deconinck, and Y. V.Heyden, 2003. “Benchmarking of QSAR models for blood-brain barrier per- [36] F. Friedrichs and C. Igel, “Evolutionary tuning of multiple SVM meation,” Journal of Chemical Information and Modeling,vol. parameters,” Neurocomputing,vol.64,no.1–4,pp.107–117,2005. 47,no.4,pp.1648–1656,2007. [37] J. H. Yang and V. Honavar, “Feature subset selection using [20]T.S.Carpenter,D.A.Kirshner,E.Y.Lau,S.E.Wong, genetic algorithm,” IEEE Intelligent Systems & Their Applica- J. P. Nilmeier, and F. C. Lightstone, “A method to predict tions, vol. 13, no. 2, pp. 44–48, 1998. blood-brain barrier permeability of drug-like compounds using molecular dynamics simulations,” Biophysical Journal,vol.107, [38] M. L. Raymer, W. F. Punch, E. D. Goodman, L. A. Kuhn, and no. 3, pp. 630–641, 2014. A. K. Jain, “Dimensionality reduction using genetic algorithms,” IEEE Transactions on Evolutionary Computation,vol.4,no.2, [21] J. Zah, G. Terre’Blanche, E. Erasmus, and S. F. Malan, “Physic- pp. 164–171, 2000. ochemical prediction of a brain-blood distribution profile in polycyclic amines,” Bioorganic and Medicinal Chemistry, vol. 11, [39] S. Salcedo-Sanz, M. Prado-Cumplido, F. Perez-Cruz, and C. no.17,pp.3569–3578,2003. Bousono-Calzon, “Feature selection via genetic optimization,” in Artificial Neural Networks—Icann 2002,vol.2415,pp.547– [22] T. J. Hou and X. J. Xu, “ADME evaluation in drug discovery. 552, Springer, 2002. 3. Modeling blood-brain barrier partitioning using simple molecular descriptors,” Journal of Chemical Information and [40] Z. Q. Wang and D. X. Zhang, “Feature selection in text classifica- Computer Sciences,vol.43,no.6,pp.2137–2152,2003. tion via SVM and LSI,” in Advances in Neural Networks—ISNN [23] Y. H. Zhao, M. H. Abraham, A. Ibrahim et al., “Predicting pen- 2006,vol.3971ofLecture Notes in Computer Science,part1,pp. etration across the blood-brain barrier from simple descriptors 1381–1386, Springer, Berlin, Germany, 2006. and fragmentation schemes,” Journal of Chemical Information [41] M. Fernandez, J. Caballero, L. Fernandez, and A. Sarai, “Genetic and Modeling,vol.47,no.1,pp.170–175,2007. algorithm optimization in drug design QSAR: bayesian- [24] S. R. Mente and F. Lombardo, “A recursive-partitioning model regularized genetic neural networks (BRGNN) and genetic for blood-brain barrier permeation,” Journal of Computer-Aided algorithm-optimized support vectors machines (GA-SVM),” Molecular Design,vol.19,no.7,pp.465–481,2005. Molecular Diversity,vol.15,no.1,pp.269–289,2011. [25] P.Garg and J. Verma, “In silico prediction of blood brain barrier [42] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene selec- permeability: an artificial neural network model,” Journal of tion for cancer classification using support vector machines,” Chemical Information and Modeling,vol.46,no.1,pp.289–297, Machine Learning,vol.46,no.1–3,pp.389–422,2002. 2006. [43] R. Kohavi and G. H. John, “Wrappers for feature subset [26] A. Guerra, J. A. Paez,´ and N. E. Campillo, “Artificial neural selection,” Artificial Intelligence,vol.97,no.1-2,pp.273–324, networks in ADMET modeling: prediction of blood-brain 1997. barrier permeation,” QSAR & Combinatorial Science,vol.27,no. [44] K. Z. Mao, “Feature subset selection for support vector 5, pp. 586–594, 2008. machines through discriminative function pruning analysis,” [27] Z. Wang, A. Yan, and Q. Yuan, “Classification of blood- IEEE Transactions on Systems, Man, and Cybernetics, Part B: brain barrier permeation by Kohonen’s self-organizing neural Cybernetics,vol.34,no.1,pp.60–67,2004. network (KohNN) and support vector machine (SVM),” QSAR [45] K.-Q. Shen, C.-J. Ong, X.-P. Li, and E. P. V. Wilder-Smith, & Combinatorial Science, vol. 28, no. 9, pp. 989–994, 2009. “Feature selection via sensitivity analysis of SVM probabilistic [28] A. Yan, H. Liang, Y. Chong, X. Nie, and C. Yu, “In-silico outputs,” Machine Learning,vol.70,no.1,pp.1–20,2008. prediction of blood-brain barrier permeability,” SAR and QSAR [46] Y.-W. Chen and C.-J. Lin, “Combining SVMs with various in Environmental Research,vol.24,no.1,pp.61–74,2013. feature selection strategies,” in Feature Extraction,vol.207of BioMed Research International 13

Studies in Fuzziness and Soft Computing,pp.315–324,Springer, [66]J.A.Platts,M.H.Abraham,Y.H.Zhao,A.Hersey,L.Ijaz, Berlin, Germany, 2006. and D. Butina, “Correlation and prediction of a large blood- [47] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, and V. Vapnik, brain distribution data set—an LFER study,” European Journal “Feature selection for SVMs,”in Advances in Neural Information of Medicinal Chemistry,vol.36,no.9,pp.719–730,2001. Processing Systems 13, pp. 668–674, MIT Press, 2000. [67] M. Iyer, R. Mishra, Y. Han, and A. J. Hopfinger, “Predict- [48] C.-L. Huang and C.-J. Wang, “AGA-based feature selection and ing blood-brain barrier partitioning of organic molecules parameters optimizationfor support vector machines,” Expert using membrane-interaction QSAR analysis,” Pharmaceutical Systems with Applications,vol.31,no.2,pp.231–240,2006. Research,vol.19,no.11,pp.1611–1621,2002. [49] X. R. Zhang and L. C. Jiao, “Simultaneous feature selection [68] M. Lobell, L. Molnar,´ and G. M. Keseru,¨ “Recent advances and parameters optimization for SVM by immune clonal in the prediction of blood-brain partitioning from molecular algorithm,” in Advances in Natural Computation,vol.3611 structure,” Journal of Pharmaceutical Sciences,vol.92,no.2,pp. of Lecture Notes in Computer Science, pp. 905–912, Springer, 360–370, 2003. Berlin, Germany, 2005. [69] Y. N. Kaznessis, M. E. Snow, and C. J. Blankley, “Prediction [50]C.Gold,A.Holub,andP.Sollich,“Bayesianapproachtofeature of blood-brain partitioning using Monte Carlo simulations selection and parameter tuning for support vector machine of molecules in water,” Journal of Computer-Aided Molecular classifiers,” Neural Networks,vol.18,no.5-6,pp.693–701,2005. Design,vol.15,no.8,pp.697–708,2001. [51] M. H. Abraham, A. Ibrahim, Y.Zhao, and W.E. Agree Jr., “Adata [70] H. van de Waterbeemd, G. Camenisch, G. Folkers, J. R. base for partition of volatile organic compounds and drugs from Chretien, and O. A. Raevsky, “Estimation of blood-brain barrier blood/plasma/serum to brain, and an LFER analysis of the data,” crossing of drugs using molecular size and shape, and H- JournalofPharmaceuticalSciences,vol.95,no.10,pp.2091–2100, bonding descriptors,” Journal of Drug Targeting,vol.6,no.2,pp. 2006. 151–165, 1998. [52] A. R. Katritzky, V. S. Lobanov, and M. Karelson, CODESSA: [71] J.Kelder,P.D.J.Grootenhuis,D.M.Bayada,L.P.C.Delbressine, Reference Manual,UniversityofFlorida,1996. and J.-P. Ploemen, “Polar molecular surface as a dominating determinant for oral absorption and brain penetration of [53] Marvin 4.0.5, ChemAxon, 2006, http://www.chemaxon.com/. drugs,” Pharmaceutical Research,vol.16,no.10,pp.1514–1519, [54] AMPAC 8, 1992–2004 Semichem, Shawnee, Kan, USA. 1999. [55] I. Guyon and A. Elisseeff, “An introduction to variable and [72] F. Ooms, P. Weber, P.-A. Carrupt, and B. Testa, “A simple feature selection,” The Journal of Machine Learning Research,vol. model to predict blood-brain barrier permeation from 3D 3, pp. 1157–1182, 2003. molecular fields,” Biochimica et Biophysica Acta—Molecular [56] R. W. Kennard and L. A. Stone, “Computer aided design of Basis of Disease,vol.1587,no.2-3,pp.118–125,2002. experiments,” Technometrics, vol. 11, no. 1, pp. 137–148, 1969. [73] H. Fischer, R. Gottschlich, and A. Seelig, “Blood-brain barrier [57] M. Daszykowski, B. Walczak, and D. L. Massart, “Representative permeation: molecular parameters governing passive diffu- subset selection,” Analytica Chimica Acta,vol.468,no.1,pp.91– sion,” Journal of Membrane Biology,vol.165,no.3,pp.201–211, 103, 2002. 1998. [58] R. Collobert and S. Bengio, “SVMTorch: support vector [74] K. M. Mahar Doan, J. E. Humphreys, L. O. Webster et al., “Pas- machines for large-scale regression problems,” Journal of sive permeability and P-glycoprotein-mediated efflux differen- Machine Learning Research,vol.1,no.2,pp.143–160,2001. tiate central nervous system (CNS) and non-CNS marketed [59] A. J. Smola and B. Scholkopf, “A tutorial on support vector drugs,” Journal of Pharmacology and Experimental Therapeutics, regression,” Statistics and Computing,vol.14,no.3,pp.199–222, vol. 303, no. 3, pp. 1029–1037, 2002. 2004. [75]K.Rose,L.H.Hall,andL.B.Kier,“Modelingblood-brain [60] R. G. Brereton and G. R. Lloyd, “Support Vector Machines for barrier partitioning using the electrotopological state,” Journal classification and regression,” Analyst,vol.135,no.2,pp.230– of Chemical Information and Computer Sciences,vol.42,no.3, 267, 2010. pp.651–666,2002. [61] C. C. Chang and C. J. Lin, “LIBSVM: a library for support vector [76] P. Crivori, G. Cruciani, P.-A. Carrupt, and B. Testa, “Predict- machines,” 2001. ing blood-brain barrier permeation from three-dimensional molecular structure,” Journal of Medicinal Chemistry,vol.43,no. [62] H. Golmohammadi, Z. Dashtbozorgi, and W. E. Acree Jr., 11, pp. 2204–2216, 2000. “Quantitative structure–activity relationship prediction of blood-to-brain partitioning behavior using support vector [77] R. C. Young, R. C. Mitchell, T. H. Brown et al., “Development machine,” European Journal of Pharmaceutical Sciences,vol.47, of a new physicochemical model for brain penetration and no. 2, pp. 421–429, 2012. its application to the design of centrally acting H2 receptor histamine antagonists,” Journal of Medicinal Chemistry,vol.31, [63] D. E. Clark, “In silico prediction of blood-brain barrier perme- no. 3, pp. 656–671, 1988. ation,” Drug Discovery Today,vol.8,no.20,pp.927–933,2003. [78] F. Lombardo, J. F. Blake, and W. J. Curatolo, “Computation [64] S. Vilar, M. Chakrabarti, and S. Costanzi, “Prediction of of brain-blood partitioning of organic solutes via free energy passive blood-brain partitioning: straightforward and effective calculations,” Journal of Medicinal Chemistry,vol.39,no.24,pp. classification models based on in silico derived physicochemical 4750–4755, 1996. descriptors,” Journal of Molecular Graphics and Modelling,vol. 28, no. 8, pp. 899–903, 2010. [79] M. Iyer, R. Mishra, Y. Han, and A. J. Hopfinger, “Predict- ing blood–brain barrier partitioning of organic molecules [65] F.Herve, S. Urien, E. Albengres, J.-C. Duche, and J.-P.Tillement, using membrane–interaction QSAR analysis,” Pharmaceutical “Drug binding in plasma. A summary of recent trends in the Research,vol.19,no.11,pp.1611–1621,2002. study of drug and hormone binding,” Clinical Pharmacokinetics, vol.26,no.1,pp.44–58,1994. Hindawi Publishing Corporation BioMed Research International Volume 2015, Article ID 359835, 17 pages http://dx.doi.org/10.1155/2015/359835

Research Article How to Use SNP_TATA_Comparator to Find a Significant Change in Gene Expression Caused by the Regulatory SNP of This Gene’s Promoter via a Change in Affinity of the TATA-Binding Protein for This Promoter

Mikhail Ponomarenko,1,2 Dmitry Rasskazov,1 Olga Arkova,1 Petr Ponomarenko,3 Valentin Suslov,1 Ludmila Savinkova,1 and Nikolay Kolchanov1,2

1 Institute of Cytology and Genetics, Siberian Branch of Russian Academy of Sciences, Novosibirsk 630090, Russia 2Department of Natural Sciences, Novosibirsk State University, Novosibirsk 630090, Russia 3Children’s Hospital Los Angeles, University of Southern California, Los Angeles, CA 90027, USA

Correspondence should be addressed to Mikhail Ponomarenko; [email protected]

Received 3 July 2015; Accepted 24 August 2015

Academic Editor: Jorge H. Leitao˜

Copyright © 2015 Mikhail Ponomarenko et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

The use of biomedical SNP markers of diseases can improve effectiveness of treatment. Genotyping of patients with subsequent searching for SNPs more frequent than in norm is the only commonly accepted method for identification of SNP markers within the framework of translational research. The bioinformatics applications aimed at millions of unannotated SNPs of the “1000 Genomes” can make this search for SNP markers more focused and less expensive. We used our Web service involving Fisher’s 𝑍-score for candidate SNP markers to find a significant change in a gene’s expression. Here we analyzed the change caused by SNPs in the gene’s promoter via a change in affinity of the TATA-binding protein for this promoter. We provide examples and discuss how to use this bioinformatics application in the course of practical analysis of unannotated SNPs from the “1000 Genomes” project. Using known biomedical SNP markers, we identified 17 novel candidate SNP markers nearby: rs549858786 (rheumatoid arthritis); rs72661131 (cardiovascular events in rheumatoid arthritis); rs562962093 (stroke); rs563558831 (cyclophosphamide bioactivation); rs55878706 (malaria resistance, leukopenia), rs572527200 (asthma, systemic sclerosis, and psoriasis), rs371045754 (hemophilia B), rs587745372 (cardiovascular events); rs372329931, rs200209906, rs367732974, and rs549591993 (all four: cancer); rs17231520 and rs569033466 (both: atherosclerosis); rs63750953, rs281864525, and rs34166473 (all three: malaria resistance, thalassemia).

1. Introduction In the 20th century, discovery of SNPs and of the resulting associations with diseases was casual, whereas the Biomedical SNP (single nucleotide polymorphism) markers postgenomic search for SNPs is systematic and large-scale: are significantly frequent differences of personal genomes it includes the largest worldwide project “1000 Genomes” of patients from the reference human genome, hg19. The [3]. Researchers maintaining the dbSNP database [4] accu- discovery of SNP markers of hypersensitivity to the HIV-1 mulate and annotate proven SNPs and continuously refine reverse transcriptase inhibitor Ziagen in the HLA-B gene of thehumanreferencegenome(hg19),namely,theancestral the human major histocompatibility complex [1] prevented variants for all SNPs within the Ensembl [5] and GENCODE deaths of thousands of patients. That is the reason why a v. 19 [6] databases available from the public UCSC Genome search for candidate SNP markers of diseases now represents Browser [7]. The biomedical databases GWAS (genome-wide the bulk of bioinformatics studies aimed at the development association study) [8], OMIM [9], ClinVar [10], and HapMap of so-called postgenomic predictive preventive personalized [11] supplement these SNPs by documenting associations medicine, PPPM [2]. with diseases, with one another, and with the pathogenic 2 BioMed Research International haplotypes (e.g., [12]). Furthermore, researchers project these database ACTIVITY [45]. Next, we derived formulas for in SNPs onto the whole-genome maps of genes, protein-binding silico prognosis of the TBP-ssDNA [46], TBP-dsDNA [43], sites on DNA predicted in silico and/or detected in vivo using and TBP-promoter [47] affinity using the widely accepted chromatin immunoprecipitation (ChIP), interchromosomal Bucher’s criterion [48] for the canonical TBP-binding sites, contacts, and nucleosome packaging as well as transcrip- the so-called TATA box (synonyms: Goldberg-Hogness box tomes in health [13] and disease in different tissues [14] and andHognessbox[32]),inthethree-stepmechanismof after treatment [15]. Accordingly, the available Web services the TBP binding to a promoter [47]. This mechanism was (e.g., [16–27]) facilitate the bioinformatics search for relevant- observed independently in vitro a year later [49]. Then we to-medicine candidate SNP markers in terms of ranking of confirmed predictions of this three-step empirical predictive unannotated SNPs by their similarity to known biomedical bioinformatics model [47] at equilibrium [50], without equi- SNP markers, according to projections of these SNPs onto librium [51], and in real time [52, 53] in vitro. Additionally, the whole-genome maps. The Central Limit Theorem means we compiled a set of SNPs in the TBP-binding sites associated [28] that the accuracy of such a search should increase with human diseases [54], including the AIDS pandemic [55], asymptotically with an increase in accuracy, volume, repre- and with commercially important traits of plants and animals sentativeness, completeness, the number, and diversity of the [56]. Then, we confirmed the three-step predictions by means whole-genome maps as well as due to refinement of empirical of these SNPs [57] and by means of transcriptomes of the analyses of similarity between projections of SNPs onto human brain [58], the auxin response in plants [59, 60], and genomic maps [16]. This way, the best research progress has the data from 68 independent experiments (for review, see been achieved for many thousands of SNPs within protein- [61]). To finalize this comprehensive verification of the three- coding regions of genes [9] due to the invariant types of dis- step model of TBP binding to a promoter [47, 49], we created ruptioninbothstructureandfunctionoftheaffectedproteins afreelyavailableWebservice[62]foruserswhowishtoapply regardless of the cellular conditions [29]. At the same time, this bioinformatics application to data on the TBP/promoter- the worst research progress has been made for a few hundred complexes in humans: http://beehive.bionet.nsc.ru/cgi-bin/ of so-called regulatory SNPs [4, 9, 23, 24] because their mgs/tatascan/start.pl. manifestations are dependent on cellular conditions [30]. Inthiswork,weupdatedourreviewofSNPs(intheTBP- For the present study, it was helpful that an intermediate binding sites) associated with human diseases [54] using position between these extremes belongs to SNPs in the DNA the standard keyword search, using existing data from the sites binding to the TATA-binding protein (TBP); these SNPs literature [63], in NCBI databases [4] and provide examples ∼ constitute 10% of all the known regulatory SNP markers on how to use our Web service [62] to find a significant relevant to medicine, whereas TBP is only one of 2600 known changeinagene’sexpressionwhenthischangeiscausedby DNA-binding proteins in humans [31]. The above-mentioned the regulatory SNP in this gene’s promoter via achangeinthe special place of such SNPs can be mostly explained by the TBP affinity for the promoter. Using a representative set ofso- − − necessity of a TBP-binding site within the [ 70; 20] region calledcontroldataonthetotalnumberof62SNPs,weshow of the promoter for any mRNA [32] because RNA polymerase the output of our bioinformatics applications. Using this II binds to the anchoring complex TBP-promoter, and this approach, for the known SNP markers relevant to medicine, event triggers assembly of the transcription preinitiation we present 17 novel candidate SNP markers that are located complex for this mRNA [33]. These results were obtained in nearby, namely, rs549858786 of the IL1B gene (associated with studies on unviability of TBP-null animals [34] or animals rheumatoid arthritis), rs63750953 and rs281864525 (both: harboring a knockdown [35] of the TBP gene. Besides, ChIP HBB; malaria resistance and 𝛽-thalassemia), rs34166473 f0f0 data confirmed that the -like motifs are the TBP- (HBD; malaria resistance and 𝛿-thalassemia), rs563558831 binding sites in gene promoters in yeast [36] and in mice (CYP2B6; better bioactivation of cyclophosphamide), [37], as did the results of in silico analysis and their selective rs372329931 (ADH7;esophagealcancer),rs562962093 verification by means of in vivo bioluminescence among (MBL2; stroke, preeclampsia, and variable immunodefi- human genes [38]. Finally, SNPs in the TBP-binding sites ciency), rs72661131 (MBL2;cardiovasculareventsin invariantly cause gene overexpression in relation to SNP- rheumatoid arthritis), rs17231520 and rs569033466 (both: caused enhancement of the TBP/promoter affinity as well as CETP; atherosclerosis), rs55878706 (DARC;lowwhite- the deficient expression of genes as a result of an SNP-caused blood-cell count and resistance to malaria), rs367732974 reduction in this affinity regardless of any cellular conditions; and rs549591993 (both: F7; progression of colorectal these phenomena have been repeatedly demonstrated in cancer from a primary tumor to metastasis), rs572527200 independent experiments [39–41]. This stability of the SNP- (MMP12; low risks of asthma, systemic sclerosis, and caused alterations in the TBP/promoter-affinity resembles the psoriasis), rs371045754 (F9; Leiden hemophilia B), invariant relation of SNPs in protein-coding gene regions rs200209906 (GSTM3; brain, lung, and testicular cancers), with protein structure/function, rather than such relations and rs587745372 (GJA5, arrhythmia and cardiovascular involving regulatory SNPs, whose effects strongly depend on events). This is the principal result of this work. the tissue, cell type, and so forth. In our previous studies, we measured in vitro affinity values of TBP for the representative sets of aptamers of 2. Methods synthetic single-stranded DNA (ssDNA) [42] and double- 2.1. Web-Service SNP TATA Comparator. Web service SNP stranded DNA (dsDNA) [43] including natural TBP-binding TATA Comparator http://beehive.bionet.nsc.ru/cgi-bin/mgs/ sites of human gene promoters [44] that are stored in our tatascan/start.pl [62] is a bioinformatics application installed BioMed Research International 3

Reference human genome (hg19) Ensembl [5]

GENCODE [6]

Get gene

Get TSSs BioPerl [64] Get seq

−ln(KD)±𝛿 An estimation of the TBP affinity rs33981098 rs63750953 −30 ≥ −25 to the promoter of both minor A g AA deleted and ancestral alleles of a gene ANSI “C”

Z-score, p value The significance of changes in gene expression of minor alleles relatively to the ancestral allele

Package R

(a) (b)

Figure1:HowtousetheWebserviceSNPTATA Comparator [62] to find a significant change in gene expression caused by SNPs of this gene’s promoter via a change in affinity of the TATA-binding protein (TBP) for this promoter in the cases of (a) a known biomedical SNP marker and (b) a nearby candidate SNP marker. Solid, dotted, and dashed arrows are the gene, transcript, and sequence lists, respectively, from Ensembl [5] and GENCODE [6] databases of the reference human genome, hg19. Dash-and-dot arrows are an estimate of the statistical significance (𝑍-score, 𝑝 value) of deviation of the gene expression in patients carrying minor alleles, relative to the ancestral allele, (1)–(4) and Algorithm 1.

on the hybrid cluster supercomputer HKC-30T (Hewlett Pack- One more example of the output data from the above- ard, Palo Alto, CA, US) based on the Intel Xeon 5450 platform mentioned executable applet is shown within the two top of 85-Tflop performance under OS Red Hat Enterprise Linux lines of the “Result” window in Figure 1(b). These data 5.4 that is supported by the Siberian Supercomputer Center include the maximum value, − ln(𝐾𝐷)±𝛿,amongallthe (Novosibirsk, Russia). possible estimates of the TBP binding affinity for the 26 bp One can see screenshots of the user interface of this soft- DNA fragment, {𝑠𝑖−13 ⋅⋅⋅𝑠𝑖 ⋅⋅⋅𝑠𝑖+12} at the 𝑖th position ranging ware in Figure 1 and all the data flowcharts (arrows) between from–70to–20forbothDNAchains[32,59].Here,𝐾𝐷 is them and two databases Ensembl [5] and GENCODE v. 19 the equilibrium dissociation constant (expressed in the units [6] of the human reference genome, hg19, in Figure 1(a). of mol per liter; M) of the TBP binding to the ancestral or Using the standard method, we encoded this interface in the minor allele of the promoter under study. These quantitative dynamic programming language JavaScript and created these estimates of the SNP-caused change in the TBP-promoter flowcharts by means of the BioPerl toolkit [64]. Using the affinity are the input data for another executable applet coded online mode of these modules, a user can prepare input data primarily by means of the standard statistical package in the for the executable applet encoded primarily in the program- R software. We provided examples of its output data within ming language C of the ANSI standard and, then, run this the bottom line of the “Result” window in Figure 1. These are applet (the “Calculate” button). These input data consist of Fisher’s 𝑍-score value along with its probability rate, 𝑝 (where two variants—ancestral (the “Base sequence” window) and 𝛼=1−𝑝, statistical significance). Within the “Decision” minor (the “Editable sequence” window)—of the 90 bp DNA line, one can see the prediction made by our Web service, sequence {𝑠−90 ⋅⋅⋅𝑠𝑖 ⋅⋅⋅𝑠−1} in the proximal core-promoter namely, (i) “excess” for overexpression of the gene after the region immediately upstream of the transcription start site SNP-caused significant increase in the TBP binding affinity (TSS, 𝑠0) of interest within the human reference genome, hg19 for the minor allele of the gene promoter or (ii) “deficiency” (where 𝑠𝑖 ∈ {𝑎, 𝑐, 𝑔, 𝑡}). One can find our description of the for lowered expression of this gene in the opposite case. This bioinformatics model of this executable applet within the next prediction is the main result of the proposed Web service Section 2.2. [62]. 4 BioMed Research International

{− (𝐾 ) − (𝐾 )} IF ln 𝐷;MINOR is statistically significantly greater than ln 𝐷;ANCESTRAL , THEN {DECISION is “there is an excess of the minor allele of a given gene versus the ancestral allele”}; {− (𝐾 ) − (𝐾 )} ELSE [IF ln 𝐷;MINOR is statistically significantly less than ln 𝐷;ANCESTRAL , THEN {DECISION is “there is a deficiency of the minor allele of this gene versus the ancestral allele”}]; OTHERWISE {DECISION is “alteration of the expression of this gene is insignificant”}.

Algorithm 1

2.2. The Bioinformatics Model. The bioinformatics model that where [TA]3󸀠HALF is the total number of instances of dinu- 󸀠 we use here is the three-step approximation of the TBP cleotide TA within the 3 -halfoftheDNAsequencetreated; binding to the [−70; −20] region of the core-promoters of MinorGrooveWidthREGION is the mean width of the minor eukaryotic genes; this approximation was first suggested by groove of the B-form of the DNA helix [68]; 0.8, −3.4, us [47] on the basis of our original experimental data [42– and −35.1 are linear regression coefficients determined by 44] and, then, this three-step approximation was discovered means of our experimental data [43] stored in our database independently [49] a year later. Within the framework of this ACTIVITY [45]; MEAN15 bp is the mean arithmetic value for model, (i) TBP binds nonspecifically to DNA and slides along all possible positions and orientations of the TBP-binding site this molecule ↔ (ii) the sliding of TBP stops at a proper TBP- (15 bp long) that was determined empirically [67]. ∘ binding site ↔ the DNA helix bends from the 19 angle to the In (1), 𝐾BEND is our empirical estimate of the equilibrium ∘ 90 angle [65] and stabilizes the local TBP-promoter complex. constant at the DNA helix bending step on the basis of This interaction (binding affinity) can be estimated using the the macromolecular dynamics computations [65] describing following empirical equation: how TBP can bind to DNA; namely, − (𝐾 ) ln 𝐷 − (𝐾 )= {0.9 [ ] ln BEND MEANTATA-box WR FLANK (3) = 10.9 (1) + 2.5 [TV]CENTER + 14.4} ,

− 0.2 {ln (𝐾SLIDE)+ln (𝐾STOP)+ln (𝐾BEND)} , where WR = {TA, AA, TG, AG} and TV = {TA, TC, TG} −5 where 10.9 (ln units) is nonspecific TBP-DNA affinity 10 M [46] (the IUPAC-IUB nomenclature [69]); 0.9, 2.5, and [66], 0.2 is the stoichiometric coefficient [47], and 𝐾STOP is 14.4 are linear regression coefficients calculated from our the maximal score value of Bucher’s position-weight matrix, experimental data [42] stored in our database ACTIVITY which is the commonly accepted criterion of the TATA box: [45]; MEANTATA-box isthemeanarithmeticvalueforboth the canonical form of the TBP-binding site [48]. DNA strands of the TBP-binding site at the position of the In (1), 𝐾SLIDE is our empirical estimate of the equilibrium maximal score value of Bucher’s position-weight matrix [48]. constant of the TBP sliding along DNA that was determined Additionally, the standard deviation of the − ln[𝐾𝐷] experimentally [67]; namely, estimates (see (1))—for all the 78 possible mononucleotide substitutions, 𝑠𝑖+𝑗 →𝜉, at each 𝑗th position (−13 ≤ 𝑗 ≤ 12; − ln (𝐾SLIDE)=MEAN15 bp {0.8 [TA]3󸀠HALF 3×26) within the 26 bp DNA window centered by 𝑖th position (2) of the promoter DNA analyzed—was heuristically estimated − 3.4 MinorGrooveWidthCENTER − 35.1} , as

2 1/2 (Σ Σ [ (𝐾 ({𝑠 ⋅⋅⋅𝑠 𝜉𝑠 ⋅⋅⋅𝑠 }) /𝐾 ({𝑠 ⋅⋅⋅𝑠 𝑠 𝑠 ⋅⋅⋅𝑠 })) ]) [ 1≤𝑖≤26 𝜉∈{𝑎,𝑐,𝑔,𝑡} ln 𝐷 𝑖−13 𝑖+𝑗−1 𝑖+𝑗+1 𝑖+12 𝐷 𝑖−13 𝑖+𝑗−1 𝑖+𝑗 𝑖+𝑗+1 𝑖+12 ] 𝛿=[ ] . (4) 78 [ ]

This equation (4) estimates the resistance against the For each SNP processed, the decision (Algorithm 1) is the majority of SNPs in the case of the biologically essential main result of the bioinformatics model used. complex of TBP binding to the TBP-binding site of the promoters [55]. 2.3. How to Use SNP TATA Comparator. Practical use of our Finally, the results of (1)–(4) on the promoter DNA Web service [62] is illustrated in Figure 1 and documented sequences of two minor and ancestral alleles of a given gene in Tables 1–3. In this work, we analyzed in silico 31 human are compared with one another in terms of Fisher’s 𝑍-score genes containing 40 known biomedical SNP markers in their and its probability rate, that is, the 𝑝 value (where 𝛼=1−𝑝 core-promoter from our review [54], which was updated in is the statistical significance level). On this basis, a decision is thepresentwork.UsingtheUCSCGenomeBrowser[7], made. we found 163 additional unannotated SNPs nearby that were BioMed Research International 5 , 𝛼=1−𝑝 -score; 𝑍 [10, 70–75] [This work], [76] [78] [81] [82] [This work] [83] [84] [Reference], [this work] , 𝑍 de H. , an estimate [55] of the dissociation ), and norm (=); 𝐷 ↓ 𝐾 e; Helicobacter pylori infection, Graves’ disease, and ; LUC, bioluminescence. occurrence of a spurious ), deficient expression ( Gastric cancer in infection, hepatocellular carcinoma in hepatitis C virus infection, non-small cell lung cancer, chronic gastritis and gastric ulcer in pylori major recurrent depression (Hypothetically) Rheumatoid arthritis Myocardial infarction and venous thromboembolism Resistance to malaria, epilepsy riskResistance to [79, methotrexate 80] therapy for leukemia Endometrial cancer caused by a novo TBP-biding site (Hypothetically) Health A healthy Hungarian blood donor participating in a health check-up program A healthy individual in the “Control” cohort selected for comparison with the “Autoimmune Diseases” cohort Known [reference] diseases or hypothetical [this work] ones ↑ −7 −7 −2 −7 −7 −7 −7 −7 −7 310 810 810 810 910 910 610 15 10 14 10 ↑ ↑ ↑ ↑ ↓ ↑ ↑ ↑ ↓ Δ𝑍 𝛼 10 5 ,nM 𝐷 versus versus 𝐾 1versus2 2versus3 2versus5 2versus4 7 9versus15 6versus10 13 versus 24 22 min versus hg19 -flank 󸀠 3 tcttggctgc gccgccccct ataaaaacag gagccgcgtg aaaacagcga aaaggagccg aaagggccac gcgcggggca gggacgaggg t t t t t c c t a a t c c c a a g g min hg19 , the expression change in comparison with the norm: overexpression ( Δ -flank 󸀠 5 ccctttatag ttttgaaagc gtcattccag gtataaatac gtgctataaa ctgcacaaat tgaaagccat agtcgggaga gggagataaa t t t a a t t c c min → → → → → → → → → → SNP 31c 51t 21c 33t 25c 28a 20a 26g 26g − − − − − − − − − hg19 [50]; ND, not documented; in vitro rs10168 rs1143627 dbSNP [4] rel. 141, 142 rs10895068 rs111426889 rs563763767 ND, see [83] rs549858786 ND, see [79] rs544843047 Table 1: Known disease-related SNP markers increasing affinity of the TATA-binding protein (TBP)human for promoters, gene their SNP neighbors. 120) − , probability; Figure 1); TF, transcription factor; EMSA, electrophoretic mobility shift assay; CAT, chloramphenicol acetyltransferase activity #2 (+1) #1 (+1) #1 (+1) #3 (+1) #2 (+270) #2 (+1) #3 ( RNA (TSS) ) of the TBP-DNA complex 𝑝 𝐷 , total number of SNPs processed; RNA, item number of mRNA in GENCODE v.19 [6]; TSS, transcription start site; hg19, ancestral allele; min, minor allel 𝐾 SNP ) 𝑁 SNP 𝑁 Gene ( IL1B (3) F3 (2) NOS2 (7) DHFR (5) PGR (3) CYP21A2 (1) TNFRSF18 (5) constant ( Note: significance ( 6 BioMed Research International [85–92] [98, 99] [100] [Reference], [this work] [93, 94] [This work], [95] [9, 97] [This work] [101–103] [This work] [9, 96] [This work], [95] ” t 34 and − -thalassemia ”has50%of“ -thalassemia g 𝛽 𝛿 ” ancestral allele t 34 − 28 − -thalassemia -thalassemia (norm), risk of Coppock-like cataract Known [reference] or hypothetical [this work] diseases (observations) Malaria resistance and 𝛽 Health, well-known so-called “silent SNP” (Hypothetically) Malaria resistance, Malaria resistance and 𝛿 (Hypothetically) Malaria resistance, Low white-blood-cell count and resistance to malaria (Hypothetically) Low white-blood-cell count and malaria resistance Lower risk of lung cancer in smokers LUC: “ For “ Low risk of chronic asthma, systemic sclerosis, and psoriasis (Hypothetically) Low risk of asthma, systemic sclerosis, psoriasis −7 −7 −7 −7 −7 −2 −2 −7 −3 −7 −7 −2 −7 −7 −7 −3 −2 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 3 3 3 8 9 4 4 7 11 11 21 18 14 10 27 24 34 ↓ ↓ ↑ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ Δ𝑍𝛼 11 5 5 4 10 ,nM 𝐷 hg19 versus versus versus versus versus 𝐾 3versus5 5versus2 8 9versus5 6versus5 9versus2 8 8versus4 7 11 versus 5 21 versus 5 18 versus 5 29 versus 5 min versus 14 14 versus 11 12 12 versus 10 -flank 󸀠 3 tcttggaagc ctatgagtca tgagtcactc atacaacagt atacaacagt ataaaagtca taaaagtcag cttggaagca gtcagggcag gtcagggcag taaaaggcag aagtcagggc aagtcagggc aaagtcaggg agccccgccg aaaggcaaac aaaaggcagg t c c c c g g g g g t t t t t t c a a a a a a a a a — — g,c g,c aa t(c) min a,t,g a,g,c hg19 -flank 󸀠 5 cttggctctt ttggctctta tcctgctata gatatcaact gatgatatca ctgggcataa tcaggcagta tgggcataaa gctgggcata gctgggcata ggctgggcat aggaccagca caggaccagc agggctgggc gggctgggca gggctgggca cagggctggg t c g g g c c c g g g,c g,c t(c) min a,t,g a,g,c → → → → → → → → → → → → → 25aa → → → - SNP 30t 28t 27a 31a 26t 34t 25a 27a 28a 30a 27a 30a 27a − − − − − 31c − − − − 29t − − − − hg19 − − rs2814778 rs2276109 rs35518301 dbSNP [4] rs33931746 rs33981098 rs34166473 rel. 141, 142 rs63750953 del rs33980857 rs55878706 rs28399433 rs55999272 rs34598529 rs34500389 rs281864525 rs572527200 ND, see [93] rs397509430 del-29t Table 2: Known disease-related SNP markers decreasing affinity of the TATA-binding protein (TBP)human for promoters, gene their SNP neighbors. 140) − #2 (+1) #3 (+1) #3 (+1) #3 (+1) #1 ( #1 (+1) RNA (TSS) ) SNP ) 5 𝑁 Gene ( HBB (19) HBD (14) DARC (2) CYP2A6 (3) CRYGEP ( MMP12 (2) BioMed Research International 7 [Reference], [this work] [104, 105] [This work], [105] [106] [This work] [107] [108, 109] [113–115] [This work] [This work], [119] [This work] [This work], [120] and Known [reference] or hypothetical [this work] diseases (observations) Hyperalphalipoproteinemia reduces atherosclerosis risk (Hypothetically) Higher risk of atherosclerosis-related autoimmune diseases Better bioactivation of anticancer prodrug cyclophosphamide (Hypothetically) Better bioactivation of cyclophosphamide Familial amyotrophic lateral sclerosis Hemolytic anemia and neuromuscular diseases ESR2-low pT1 tumorBreast cancerVariable immunodeficiency, stroke, and preeclampsia [110, 111] (Hypothetically) Stroke, variable immunodeficiency, preeclampsia (Hypothetically) Risk of [112] cardiovascular events in rheumatoid arthritis Esophageal cancer(Hypothetically) Esophageal cancer Hematuria, fatty liver, obesity [116] Moderate bleeding tendency [117] [118] progression of colorectal cancer from a primary tumor to metastasis −7 −7 −7 −7 −3 −3 −7 −7 −3 −2 −7 −3 −7 −7 −7 −7 −7 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 3 5 5 5 7 20.05(Hypothetically)Riskof 4 13 15 15 17 17 13 13 13 12 12 10 ↑ ↑ ↓ ↓ ↓ ↑ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↑ ↑ ↓ ↓ ↓ Δ𝑍𝛼 4 1 53 53 2 4 4 2 ,nM 𝐷 hg19 versus versus versus versus versus versus versus versus 𝐾 3versus1 3 1 4versus1 7versus2 5 4versus3 3 2 7versus4 4versus2 8versus6 10 10 versus 4 min versus 18 versus 13 62 versus 53 25 47 Table 2: Continued. -flank 󸀠 3 tatagcctgc tcagtcccat cctgcaccca ggctccaggc taaagtagtc gagaactttg atagcctgca ataggccctg gacatacata ttaaaaggaa ataacagggt tatacaacag ccatggggaa aacagggtgc atacaacaga gggctccagg aagtgggcag agcccagagc c c c c c c c g g g g g t a a a a a t t t t t c g g g g a a a a a a — min hg19 [18 bp] -flank 󸀠 5 tctatttcta atctatttct tttctatata agctgctgtt aactttgccc tgaaatttta gatgaaattt gctgctgtta gcccgtcagt cctctcggtc gcgctctata atacatatac aggtgatatc ggtctggcct ggggctgggc tgcagacata ccttggaggc cgtgggggct g c g c c a c g a a c c c g g a a min → → → → → → → → → → → → → → → → → → SNP 26t 39t 13c 28t 36t 43t 19g 24t 33a 35a 32a del-54 36a 53g 35g 37a 68g 40a − − − − − − − − − − − − − − − − − hg19 rs1800202 rs7277748 rs72661131 rs72661131 rs17231520 dbSNP [4] rs17537595 rel. 141, 142 rs34223104 rs35036378 rs372329931 rs563558831 rs549591993 rs201739205 rs367732974 rs562962093 rs569033466 ND, see [117] ND, see [118] ND, see [104] 48) − #4 (+1) #4 (+1) #201 (+1) #1 (+1) #2 (+1) #1 (+1) #3 (+1) #3 (+1) #1 (+1) RNA (TSS) #1 ( ) SNP ) 4 𝑁 Gene ( CETP (5) CYP2B6 ( SOD1 (4) TPI1 (3) ESR2 (5) HSD17B1 (8) MBL2 (6) ADH7 (3) APOA1 (1) F7 (4) Note: hereinafter, can be seen under Table 1. 8 BioMed Research International [121, 122] [123] [This work] [129] [126] [12] [133] [This work] [This work], [125] [This work] [This work] [130] [131] [124] [Reference], [this work] [132] [127, 128] ” 39g/76a ” ” ” ” − t a g a 55 33 57 49 − − − − /76g” is 50% of “ ”is50%of“ ”is10%of“ ”is50%of“ ”is84%of“ c a g g a 55 57 33 49 39 − − − − − No differences between proven fathers and infertile men Arrhythmia, cardiovascular events LUC: “ Leiden hemophilia B, EMSA: HNF4-binding site disrupted rather than proximal TBP-binding site (Hypothetically) Risk of brain, lung,and testicular, renal cell carcinomas Oral cancer risk, LUC: “ Ethnic differences such as rare alleles in humans Short stature, EMSA: unknown TF-binding site disrupted rather than TBP-binding site Necessary but not sufficient in hyperbilirubinemia and jaundice (Hypothetically) Leiden hemophilia B (Hypothetically) Congenital adrenal hyperplasia (Hypothetically) Higher risk of oral cancer (Hypothetically) Arrhythmia, cardiovascular events Thrombophlebitis risk LUC: “ Arrhythmia, cardiovascular events LUC: “ Risk of brain, lung, testicular, andcarcinomas, renal cell LUC: “ Hypertensive diabetic patients, EMSA: unknown TF-binding site disrupted rather than TBP-binding site Known [reference] or hypothetical [this work] diseases −7 −7 −2 −3 −7 −3 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 > > > > > > > > > > > < 510 510 510 310 410 2.4 ↓ ↓ ↓ ↓ ↓ ↓ =0 =1 =0 =1 =1 =1 =1 =0 =1.9 =1 10 =0 =0 Δ𝑍 𝛼 6.4 5.7 3.6 ,nM versus versus versus 10.3 10.3 1.54 1.54 1.54 0.67 0.67 𝐷 hg19 versus versus versus 𝐾 12.1 10.3 versus 0.71 versus 1.48 versus min versus 2.02 2.28 0.65 versus 7. 3 v e r s u s 7. 3 1.4 versus 1.5 2.3 versus 2.1 3.1 versus 3.6 3.1 versus 3.4 5.7 versus 5.7 4.3 6.8 9.6 6.4 versus 6.4 6.4 versus 6.4 -flank 󸀠 3 atatatatat atatatatat tactatatta tactttggta ctatattata ctcccgctca atcgacctta tattaaacac tattatagga catttaagac ggcacttata tataaaaagg ggtacaacta gggtataaag gggggacatt aagaaatcag gcgacagata cgattaaaaa t t c c g g g g a a a a a a t t c c c g g g g g g a a a a a at — — t,a — min hg19 at(at) -flank 󸀠 5 tcttccactt tccacttact ggtttttgcc ggtttttgcc gcttgtactt cttcttccac ccccttatgt agctcagctt tggtacaact cagccttcag caactaagat gtataaagcc tcagcggggg ggcgacagat gcaaatgcag agggagggcc aggggccagg gaggagggaa t c 1;2 c t a a a g a g g a a g t,a min → → → → → → → → → → → → → → → → SNP 33c 57t 42t 29a 32a 33g 55g 49c 39g 29g 52a 48g 55a 22a del-50g 36c ins-55at − − − − − − − − − − − − − − − del-51(at) hg19 Table 3: Known disease-related SNP markers insignificantly changing TBP affinityhuman for promoters, gene their SNP neighbors. rs1332018 rs7586110 rs1394205 dbSNP [4] rs35594137 rs34983651 rs16887226 rs10465885 rs13306848 rs28399433 rs574890114 rs371045754 rs587745372 rs544850971 rs542729995 rs398048306 rs200209906 ND, see [123] ND, see [123] #2 (+16) #1 (+1) #3 (+31) #1 (+1) #4 (+1) #1 (+1) #1 (+1) #1 (+68) #201 (+1) RNA (TSS) #2 (+1) ) SNP 𝑁 Gene ( FSHR (3) F9 (4) StAR (3) GH1 (11) GSTM3 (8) UGT1A7 (4) GJA5 (8) THBD (3) UGT1A1 (10) BioMed Research International 9 detected in the “1000 Genomes” project [3]. Thus, the total (see (1)–(4) and Algorithm 1) is consistent with a number of number of the DNA sequences processed was 203. independent clinical studies [70–75]. We used the ancestral variants of these SNPs from Using the UCSC Genome Browser [7], we found the Ensembl [5] using the GENCODE v. 19 [6]; we also con- unannotated SNP rs549858786 (−28A → T) positioned 4 bp structed their minor alleles by hand in “online real-time” downstream of the above-mentioned known SNP marker mode according to the dbSNP entries [4] and/or literature rs1143627 (–31C → T). As one can see in Figure 1(b), our Web sources in the case of the SNPs undocumented in this service [63] predicts (see (1)–(4) and Algorithm 1) the affinity databaseasshowninFigure1andinTables1–3.Weanalyzed of TBP for the minor allele −28T of the promoter analyzed: each of the 203 SNPs independently from one another. As a 7 nM (Table 1); this result is significantly less than the norm: −6 result, for most of the unannotated SNPs analyzed, we found 5nM(𝑍-score = 7.63, 𝛼<10 ). According to some studies insignificant changes in TBP affinity for human promoters: [39–41], this significant decrease in TBP affinity for the 142of163or90%ofSNPs(datanotshown). IL1B promoter corresponds to an interleukin 1𝛽 deficiency Finally, the remaining 17 of the 163 unannotated SNPs in patients. Because the known SNP marker rs1143627 and (10%) appeared to be new candidate biomedical SNP markers the unannotated SNP rs549858786 have opposite effects near the existing markers. We italicized and labeled them (relative to each other) on IL1B expression, we performed with the marks “hypothetical”and“this work”inTables1– an additional keyword search for [54, 63] “interleukin 1𝛽 3. We found associations of both known and possible nearby deficiency” as a biochemical marker relevant to medicine in SNP markers with the same human diseases in the case of the NCBI databases [4]. The result is shown in Table 1and their codirectional effects on gene expression; otherwise, we represents experimental findings [76] in a murine model of did an additional keyword search [54, 63] in NCBI databases human rheumatoid arthritis, which showed an association [4] and recorded the results below the above-mentioned of the interleukin 1𝛽 deficiency with a high risk of this marks “hypothetical”and“this work.” These 17 new candidate autoimmune disease. Within the framework of this animal biomedical SNP markers are the main result of the present model of the human disease [76], we propose rs549858786 as study on how to use the proposed Web service [62] in a candidate SNP marker of an increased risk of rheumatoid practice. arthritis. This is the first novel finding in the present study. Furthermore, the IL1B promoter under study contains 3. Results one more unannotated SNP rs4986962 (−67G → T) [3, 4] that was predicted by our Web service [62] to insignificantly 3.1. The Results on Seven Known Biomedical SNP Markers That change TBP affinity for this promoter (data not shown). Increase TBP Affinity for Human Gene Promoters. The results Notably, this prediction of (1)–(4) and Algorithm 1 does not on seven known biomedical SNP markers that increase rule out the possible usefulness of this SNP for clinical TBP affinity for human gene promoters are presented in practice as a valid SNP marker of some human diseases. Table 1. The most widely studied among them is rs1143627, This is because our prediction does not take into account the a substitution of minor T for ancestral C at position −31 influence of this SNP, for example, on the DNA sites binding (hereafter denoted as −31C → T) in the core-promoter for to other transcription factors [23, 77], which can be studied transcript number 2 of the human IL1B gene (interleukin 1𝛽). in a different project, for example, using other Web services Let us analyze it in detail so that we can later briefly describe [25–27]. the rest of our SNPs on the basis of this example. As one can see in Table1, this SNP transforms a non- As one can see in Table 1, the next known SNP marker canonical TBP-binding site to the canonical TATA-box, (of myocardial infarction and venous thromboembolism), rs563763767 (−21C → T) [78], is located within the core- namely, gaaagC−31ATAAAAcag → gaaagT−31ATAAAAcag. Obviously, the minor allele −31T can significantly increase promoter for transcript number 1 of the F3 gene (coagulation TBP affinity for the IL1B promoter relative to the ancestral factor F3; synonym: tissue factor) and has properties that one, −31C. According to (1)–(4) and Algorithm 1, their esti- are similar to those of the above-mentioned basic example. Using the Web service [62], we predicted the SNP-caused mate 𝐾𝐷 =2nM(Table1),inthecaseof−31T, is significantly −6 overexpression of this gene, in agreement with the known greater (𝑍-score = 14.56, 𝛼<10 )than𝐾𝐷 =5nMin − b pathogenesis of these cardiovascular diseases [78]. In turn, case of 31 . According to three independent empirical − → studies [39–41], this significant increase in TBP affinity for the known SNP marker 51T C within the core-promoter theminorvariantoftheIL1B promoter corresponds to of the human NOS2 gene (inducible nitric oxide synthase overexpression of this gene (designated as ↑ in Tables 1–3). 2) exemplifies the so-called balanced SNPs, which can have This prediction is consistent with clinical findings: overex- both beneficial (malaria resistance [79]) and adverse effects pression of interleukin 1𝛽 in gastric cancer with Helicobacter (epilepsy risk [80]) on human health. Another type of man- pylori infection [10, 70], in hepatocellular carcinoma with ifestations of SNPs is illustrated by the known SNP marker infection by hepatitis C virus [71], in non-small cell lung rs10168 (−26G → A) in the human DHFR gene (dihydrofolate cancer in smokers and during alcohol dependence [72], as reductase; the main target of methotrexate, which is the key well as in nonneoplastic chronic gastritis and gastric ulcer drug for the treatment of children with acute lymphoblastic [73], in intractable Graves’ autoimmune disease [74], and leukemia) [81]. This gene’s overexpression as a result of even in a neurodegenerative disorder during major recurrent −26A causes resistance to the above-mentioned antitumor depression [75]. Thus, the prediction by the Web service [62] drug. 10 BioMed Research International

The known SNP marker rs10895068 of the human PGR transformation of this TBP-binding site into the SNP-caused gene exemplifies the SNP-caused de novo appearance of a spu- C\EBP-binding site [106]. rious TBP-biding site along with the additional pathogenic Furthermore, the remaining six known SNP mark- TSSatposition+270fromthenormalTSSfortranscript ers, rs7277748 (SOD1) [107], rs1800202 (TPI1) [108, 109], number 2 of the same gene [82]. This alternative TSS disrupts rs35036378 (ESR2) [110, 111], rs201739205 (HSD17B1)[112], the balance between the 𝛼 and 𝛽 isoforms of the progesterone rs72661131(MBL2) [113–115], and rs17537595 (ADH7)[116], receptor encoded by this gene; this aberration doubles the risk including two substitutions, −35A → C(APOA1) [117] and of endometrial cancer in overweight women [82]. −33A → C (F7) [118], are of the most frequent and best Finally, the two bottom lines of Table 1 show two examples understood type of SNP: pathogenic damage to a normal of the known SNP markers of so-called silent SNPs: −20A → TBP-binding site. This way, these SNPs can reduce expression T within the promoter of the human CYP21A2 gene [83] of human genes. and rs111426889, which precedes the alternative TSS located Finally, near these 22 known biomedical SNP markers, we at position −120upstreamofthemajorTSSfortranscript found and proposed 13 candidate SNP markers: rs63750953 number 3 of the TNFRSF18 gene [84]. These silent SNPs (HBB), rs281864525 (HBB), rs34166473 (HBD), rs55878706 are useful for monitoring of migration flows and ethnic (DARC), rs572527200 (MMP12), rs17231520 (CETP), composition of regional human subpopulations. rs569033466 (CETP), rs563558831 (CYP2B6), rs562962093 (MBL2), rs72661131 (MBL2), rs372329931 (ADH7), rs36773297 3.2. The Results on 22 Known Biomedical SNP Markers (F7), and rs549591993 (F7), as one can see in Table 2. About That Decrease TBP Affinity for Human Gene Promoters. The ahalfofthem(8of13,62%)haveeffectsongeneexpression results on 22 known biomedical SNP markers that decrease that are codirectional with the effects of the nearby known TBP affinity for human gene promoters are presented in SNP markers and thus can serve as markers of the same Table 2. Let us analyze them briefly referring to the above human diseases (e.g., rs562962093 and rs33931746). For examples. the other half of the SNPs, we found associations with Some of these biomedical SNP markers (8 of 22; 36%) appropriate diseases [119, 120] using a keyword search were found within the promoters of two gene-paralogs: [54,63]inNCBIdatabases[4](e.g.,rs567653539). HBB and HBD of 𝛽-and𝛿-hemoglobins. As one can see in Table 2, all of them are “balanced SNPs” causing both 3.3. The Results on 10 Known Biomedical SNP Markers That resistance to malaria and thalassemia [85–96] with only one Insignificantly Change TBP Affinity for Human Gene Promot- exception: substitution −27A → T is of the “silent SNP” type. ers. The results on 10 known biomedical SNP markers that In addition, the SNP marker rs2814778 within the DARC gene insignificantly change TBP affinity for human gene promoters isofthesame“balancedSNP”type;namely,itisassociated are presented in Table 3. Let us discuss them briefly. with malaria resistance and a low white-blood-cell count, as Firstofall,theknownSNPmarkerrs1394205(−29G → A) positive and negative effects on human health, respectively within the FSHR gene belongs to one of the most important [97]. types of SNP: it causes a frequently occurring disease, for The known SNP marker rs28399433 (low risk of lung example, male infertility, and this connection has been cancer among smokers) was found here within the human proven clinically regardless of bioinformatic, biochemical, CYP2A6 gene (nicotine oxidase; synonyms: xenobiotic or any other nonclinical data. As shown in the first line of monooxygenase, polypeptide 6 of subfamily A of family 2 of Table 3, in terms of this biomedical marker, there are no cytochrome p450) [98, 99]. Our Web service [62] predicts differences between fertile men (who are fathers) and infertile (see (1)–(4) and Algorithm 1) reduced affinity of TBP for the meninItaly[121]andinTurkey[122].Inagreementwiththese minor allele of this gene promoter (Table 2). This result is biomedical findings [121, 122], our Web service [62] (see (1)– consistent with empirical studies involving bioluminescence (4) and Algorithm 1) predicts no differences in TBP affinity [98, 99]. In addition, three known SNP markers, rs55999272 for this gene’s promoter between ancestral and minor alleles in the CRYGEP gene, rs2276109 in MMP12,and18bpdeletion of this SNP. within the promoter of CETP, are associated with a reduced The next four substitutions, −48G → C(F9), −42T → A risk of Coppock-like cataract [100], asthma [101], systemic (F9), rs16887226 (StAR), and rs28399433 (GH1), are among sclerosis [102], psoriasis [103], and atherosclerosis [104, 105] the oldest known SNP markers that were discovered by means duetotheSNP-causeddamagetotheTBP-bindingsitesofthe of the electrophoretic mobility shift assay (EMSA) before promoters of these genes. the advent of the reference human genome, gh19 [123, 124, In addition, the known SNP marker rs34223104 within 126]. According to these EMSA assays [123, 124, 126], each the core-promoter for the undocumented alternative TSS of these four SNPs pathologically reduces expression of the (located 48bp upstream of the major TSS of the CYP2B6 corresponding gene by disrupting the tissue-specific binding gene) transforms the canonical form (TATA-box) of the TBP- siteforatranscriptionfactorratherthanbydisruptingthe 󸀠 󸀠 binding site, 5 -gatgaaatttTATAAcagggt-3 ,intotheC\EBP- ubiquitous TBP-binding site (they overlap). Additionally, binding site (C\EBP, CCAAT-enhancer-binding protein), the next five known SNP markers—rs1332018 (GSTM3), which causes increased bioactivation of the anticancer pro- rs7586110 (UGT1A7), rs10465885 (GJA5), rs35594137 (GJA5), drug cyclophosphamide [106]. In this case, our Web ser- and rs13306848 (THBD)—have properties similar to those vice [62] predicts damage to this normal TBP-binding site of the SNPs above, in terms of bioluminescence (LUC) that is in agreement within the experimentally observed assays [127–132] instead of EMSA. Here we found six nearby BioMed Research International 11 unannotated SNPs, rs371045754 (F9), rs544850971 (StAR), genes; this deficient gene expression is more often associ- rs200209906 (GSTM3), rs574890114 (UGT1A7), rs542729995 ated with adverse than beneficial effects on human health. (UGT1A7), and rs587745372 (GJA5), which can significantly This finding is in agreement with the commonly accepted disrupt the above-mentioned TBP-binding sites and thereby bioinformatics notion that the SNP-caused damage to genetic may cause the same diseases in humans as do the six information is more frequent than SNP-caused genetic bene- candidate SNP markers (Table 3). fits. Finally, the last two biomedical SNP markers— The third most frequent group of SNP markers, 13 of 203 rs587745372 and rs398048306—taken together are the (7%), increases the TBP binding affinity for core-promoters well-known unique genetic variation in the TBP-binding site of human genes and, hence, causes overexpression of these ( ) ( ) length, A TA 5–8A in comparison with the norm: A TA 7A. genes. This overexpression can be pathogenic, neutral, or The longest of them, rs587745372, is an integral part of several beneficial for human health at approximately equal probabil- haplotypes associated with a high risk of hyperbilirubinemia ities. This finding points to huge diversity of genetic effects and jaundice [133], whereas two shortest ones, rs398048306 of SNPs within the human genome. Indeed, the remaining and rs200209906, are “silent SNPs” that are used to study manifestations of SNPs constitute only rare examples, such ethnic differences of regional human subpopulations ([12] as “silent SNPs” (e.g., rs111426889), “balanced SNPs” (e.g., and Table 3). rs35518301), a de novo occurrence of a spurious TBP-biding Thus, in the vicinity of the 40 known biomedical SNP site (e.g., rs10895068), transformation of a normal TBP- markers within the TBP-binding sites in humans, we first binding site into another regulatory genomic signal (e.g., found 17 candidate SNP markers: rs55878706 (malaria rs34223104), a change of the composite unit containing the resistance, low white-blood-cell count), rs562962093 TBP-binding site (e.g., rs28399433), a deletion of the DNA (stroke, preeclampsia, and variable immunodeficiency), fragment either around or inside the TBP-binding site (e.g., rs563558831 (cyclophosphamide bioactivation), rs549858786 rs63750953), and a duplication of the DNA fragment inside (rheumatoid arthritis), rs372329931 (esophageal cancer), the TBP-binding site (e.g., rs34983651). rs72661131 (cardiovascular events in rheumatoid arthritis), As for the SNP-caused pathological changes, the majority rs200209906 (brain, lung, testicular, and renal cell (40 of 57; 70%) of the SNP markers of diseases are either carcinomas), rs572527200 (low risk of asthma, systemic increasing or decreasing the risk of human diseases, whereas sclerosis, and psoriasis), rs371045754 (Leiden hemophilia theraretypesofSNPsareassociatedwithdrugresistance B), rs587745372 (cardiovascular problems), rs367732974 and (e.g., rs10168), prodrug bioactivation (e.g., rs34223104), dis- rs549591993 (both: progression of colorectal cancer from a ease complications (e.g., rs72661131), and ethnic differences primary tumor to metastasis), rs17231520 and rs569033466 (e.g., rs398048306 and rs34223104). In addition, 10 of the (both: atherosclerosis), and rs63750953, rs281864525, and 17 proposed candidate SNP markers are codirectionally rs34166473 (all three: malaria resistance, thalassemia). This changing TBP affinity for the core-promoters of human is the main result of our study. geneswithrespecttothenearbyknownSNPmarkers, whereas the remaining 7 candidate SNP markers do so in the 4. Discussion opposite direction. Accordingly, we did additional keyword searches [54, 63] by hand in NCBI databases [4]. Both of Because the mainstream method of searching for candidate these observations mean that our Web service [62], when SNP markers is now based on a statistical estimate of the combined with a manual comprehensive search for keywords similarity between the projections of unannotated SNPs and [54, 63] by means of the Web-based information sources, is known SNP markers on various genome-wide maps, here we most suitable for precise analysis of specific SNPs, genes, and simplified the procedure by limiting it to unannotated SNPs diseases rather than for a whole-genome search for a wide only that are located near the known SNP markers in the TBP- range of all possible manifestations of any unannotated SNPs. binding sites of human genes. Within this framework, we In this regard, it should be noted that the statistical found and analyzed 40 known SNP markers and 163 nearby significance of the proposed 17 candidate SNP markers varies −7 unannotated SNPs shown within the first column of Tables 1– from high confidence𝛼< ( 10 ) to borderline significance 3belowthegeneacronyms.Themajorityoftheunannotated (𝛼<0.05). In contrast, 𝐾𝐷 values when expressed in SNPs (153 of 203; 75%) appear to be insignificantly altering moles (𝑀; representing affinity of TBP binding to the core- TBP affinity for the core-promoter of the corresponding promoter in vitro [50]) vary from 1 nM to 62 nM, and their gene in humans (data not shown). This prediction of our variation among alleles of a given SNP is less than 2% of this Web service [62] seems to be consistent with the commonly range and thus outside the limits of accuracy of empirical accepted paradigm of genetic stability of the human genome measurement of 𝐾𝐷 values, if we are not taking into account andwithdatafromEMSAandLUCassaysofSNP-caused additional information on the expected range of the values pathological disruption of binding sites for tissue-specific being measured. Thus, the 𝐾𝐷 values shown in Tables 1– transcription factors rather than disruption of the TBP- 3 are necessary for prognostic affinity analysis of these 17 binding site (overlaps them; they constitute the so-called candidate SNP markers that we made using the Web service composite unit [134]; Table 3). [62]forthepurposeoftheirempiricalverificationbymeans The second most frequent group of SNP markers, 37of of sophisticated equipment (e.g., [50–53]). 203 (18%), disrupts TBP-binding sites within core-promoters Finally, our estimates for the 17 candidate SNP markers of human genes and thereby reduces expression of these (Tables 1–3) are only measures of bioinformatic (𝐾𝐷-values, 12 BioMed Research International

𝑍-score, 𝛼-value, 𝑝 value, etc.) and biomedical justification References (last columns in Tables 1–3) for the highly expensive and [1] S. Mallal, D. Nolan, C. Witt et al., “Association between presence laborious verification of SNPs during a search for an SNP ∗ marker that can be validated only by a higher incidence in of HLA-B 5701, HLA-DR7,andHLA-DQ3 and hypersensitivity to HIV-1 reverse-transcriptase inhibitor abacavir,” The Lancet, patients than in healthy people. What is healthy or normal vol. 359, no. 9308, pp. 727–732, 2002. depends on ethnic, social, age, and gender composition [2] G. M. Trovato, “Sustainable medical research by effective and of a human subpopulation, the settlement ratio and the comprehensive medical skills: overcoming the frontiers by pre- associated migration flows, climate and environment, living dictive, preventive and personalized medicine,” EPMA Journal, conditions and lifestyle, the technological level of health care vol.5,no.1,article14,2014. and diagnostic procedures, anamnesis, and treatment history [3]V.Colonna,Q.Ayub,Y.Chenetal.,“Humangenomicregions [135]. with exceptionally high levels of population differentiation identified from 911 whole-genome sequences,” Genome Biology, 5. Conclusions vol. 15, no. 6, article R88, 2014. [4] NCBI Resource Coordinators, “Database resources of the The use of biomedical SNP markers can improve effectiveness National Center for Biotechnology Information,” Nucleic Acids of treatment and help to develop new medications. The Research,vol.43,pp.D6–D17,2015. majority of known SNP markers are located in protein- [5] P.Flicek, M. R. Amode, D. Barrell et al., “Ensembl 2011,” Nucleic coding regions of human genes and have invariant manifes- Acids Research, vol. 39, pp. D800–D806, 2011. tation of disruption in the protein structure and/or function [6]J.Harrow,A.Frankish,J.M.Gonzalezetal.,“GENCODE:the (e.g., [29]). At the same time, only a minority of known SNP reference human genome annotation for the ENCODE project,” markers are located in regulatory regions of genes because Genome Research,vol.22,no.9,pp.1760–1774,2012. their experimental detection is complicated by the tissue- [7]T.R.Dreszer,D.Karolchik,A.S.Zweigetal.,“TheUCSC and developmental-stage-specific variation in binding ofa Genome Browser database: extensions and updates 2011,” regulatoryproteintothetheseDNAregions[23,25,27,30, Nucleic Acids Research,vol.40,no.1,pp.D918–D923,2012. 77]. Nevertheless, the best-studied regulatory SNPs in TBP- [8] D. Welter, J. MacArthur, J. Morales et al., “The NHGRI GWAS binding sites of human promoters seem to have a lot in Catalog, a curated resource of SNP-trait associations,” Nucleic common with the SNPs in protein-coding regions rather than Acids Research, vol. 42, no. 1, pp. D1001–D1006, 2014. with the remaining regulatory SNPs. With this in mind, here [9] J. S. Amberger, C. A. Bocchini, F. Schiettecatte, A. F. Scott, we first predicted 17 candidate biomedical SNP markers in and A. Hamosh, “OMIM.org: Online Mendelian Inheritance in TBP-binding sites of human promoters and confirmed them Man (OMIM), an online catalog of human genes and genetic disorders,” Nucleic Acids Research,vol.43,pp.D789–D798,2015. using both clinical and basic research of other investigators (Tables 1–3). Verification of these predictions according to [10] M. J. Landrum, J. M. Lee, G. R. Riley et al., “ClinVar: public archive of relationships among sequence variation and human established biomedical standards and protocols can bridge phenotype,” Nucleic Acids Research,vol.42,no.1,pp.D980– the gap between the best-studied SNPs within protein-coding D985, 2014. regions of human genes and the worst-studied regulatory [11] D. M. Altshuler, R. A. Gibbs, L. Peltonen et al., “Integrating com- SNPs and thus may advance postgenomic predictive preven- mon and rare genetic variation in diverse human populations,” tive personalized medicine. Nature,vol.467,no.7311,pp.52–58,2010. [12] N. Kaniwa, K. Kurose, H. Jinno et al., “Racial variability in hap- Conflict of Interests lotype frequencies of UGT1A1 and glucuronidation activity of a novel single nucleotide polymorphism 686C>T (P229L) found The authors declare that the research was conducted in the in an African-American,” Drug Metabolism and Disposition,vol. absence of any commercial or financial relationships that 33,no.3,pp.458–465,2005. could be construed as a potential conflict of interests. [13]Y.Ni,A.W.Hall,A.Battenhouse,andV.R.Iyer,“Simultaneous SNP identification and assessment of allele-specific bias from ChIP-seq data,” BMC Genetics,vol.13,article46,2012. Acknowledgments [14] J. Hu, J. W.Locasale, J. H. Bielas et al., “Heterogeneity of tumor- induced gene expression changes in the human metabolic The authors are grateful to Nikolai A. Shevchuk for English network,” Nature Biotechnology,vol.31,no.6,pp.522–529,2013. translation and editing and to Dr. Alena D. Zolotarenko [15] M. Hein and S. Graver, “Tumor cell response to bevacizumab for her fruitful ideas. Writing of the paper was supported single agent therapy in vitro,” Cancer Cell International,vol.13, by Project no. 14-04-00485 (for Ludmila Savinkova and no.1,article94,2013. Mikhail Ponomarenko) from the Russian Foundation for [16] C.-Y. Chen, I.-S. Chang, C. A. Hsiung, and W. W. Wasserman, Basic Research. The software development was supported “On the identification of potential regulatory variants within by Project no. 14-24-00123 (for Dmitry Rasskazov) from genome wide association candidate SNP sets,” BMC Medical the Russian Scientific Foundation. The data compilation was Genomics,vol.7,article34,2014. supported by Project VI.58.1.2 (for Olga Arkova) and the data [17] M. C. Andersen, P. G. Engstrom,S.Lithwicketal.,“Insilico¨ processing and analysis were supported by Project VI.61.1.2 detection of sequence variations modifying transcriptional (for Nikolay Kolchanov and Valentin Suslov, resp.), both from regulation,” PLoS Computational Biology,vol.4,no.1,articlee5, the Russian State Budget. 2008. BioMed Research International 13

[18] G. Macintyre, J. Bailey, I. Haviv, and A. Kowalczyk, “is-rSNP: [34] I. Martianov, S. Viville, and I. Davidson, “RNA polymerase II a novel technique for in silico regulatory SNP detection,” transcription in murine cells lacking the TATA binding protein,” Bioinformatics, vol. 26, no. 18, pp. i524–i530, 2010. Science, vol. 298, no. 5595, pp. 1036–1039, 2002. [19] A. P. Boyle, E. L. Hong, M. Hariharan et al., “Annotation of [35] F. Muller,¨ L. Lakatos, J.-C. Dantonel, U. Strahle,¨ and L. Tora, functional variation in personal genomes using RegulomeDB,” “TBP is not universally required for zygotic RNA polymerase Genome Research,vol.22,no.9,pp.1790–1797,2012. II transcription in zebrafish,” Current Biology,vol.11,no.4,pp. [20] A. Riva, “Large-scale computational identification of regulatory 282–287, 2001. SNPs with rSNP-MAPPER,” BMC Genomics,vol.13,supple- [36] H. S. Rhee and B. F. Pugh, “Genome-wide structure and ment 4, article S7, 2012. organization of eukaryotic pre-initiation complexes,” Nature, [21]Y.Fu,Z.Liu,S.Louetal.,“FunSeq2:aframeworkfor vol.483,no.7389,pp.295–301,2012. prioritizing noncoding regulatory variants in cancer,” Genome [37] M.-A. Choukrallah, D. Kobi, I. Martianov et al., “Intercon- Biology,vol.15,no.10,article480,2014. version between active and inactive TATA-binding protein [22] C.-C. Chen, S. Xiao, D. Xie et al., “Understanding variation in transcription complexes in the mouse genome,” Nucleic Acids transcription factor binding by modeling transcription factor Research,vol.40,no.4,pp.1446–1459,2012. genome-epigenome interactions,” PLoS Computational Biology, [38] M. Q. Yang, K. Laflamme, V. Gotea et al., “Genome-wide vol. 9, no. 12, Article ID e1003367, 2013. detection of a TFIID localization element from an initial human [23] J. V. Ponomarenko, G. V. Orlova, T. I. Merkulova et al., disease mutation,” Nucleic Acids Research,vol.39,no.6,pp. “rSNP Guide: an integrated database-tools system for studying 2175–2187, 2011. SNPs and site-directed mutations in transcription factor bind- [39] B. F. Pugh, “Control of gene expression through regulation of ing sites,” Human Mutation,vol.20,no.4,pp.239–248,2002. the TATA-binding protein,” Gene,vol.255,no.1,pp.1–14,2000. [24] J. V. Ponomarenko, G. V. Orlova, A. S. Frolov, M. S. Gelfand, [40] J. J. Stewart and L. A. Stargell, “The stability of the TFIIA-TBP- and M. P. Ponomarenko, “SELEX DB: a database on in vitro DNA complex is dependent on the sequence of the TATAAA selected oligomers adapted for recognizing natural sites and element,” The Journal of Biological Chemistry,vol.276,no.32, for analyzing both SNPs and site-directed mutagenesis data,” pp. 30078–30084, 2001. Nucleic Acids Research,vol.30,no.1,pp.195–199,2002. [41] I. Mogno, F. Vallania, R. D. Mitra, and B. A. Cohen, “TATA is a [25] D. A. Rasskazov, E. V. Antontseva, L. O. Bryzgalov et al., modular component of synthetic promoters,” Genome Research, “rSNP—guide-based evaluation of SNPs in promoters of the vol.20,no.10,pp.1391–1397,2010. human APC and MLH1 genes associated with colon cancer,” Russian Journal of Genetics: Applied Research,vol.4,no.4,pp. [42] A. A. Sokolenko, I. I. Sandomirskii, and L. K. Savinkova, “Inter- 245–253, 2014. action of yeast TATA-binding protein with short promotor segments,” Molekuliarnaia B iologiia,vol.30,no.2,pp.279–285, [26] N. L. Podkolodnyy, D. A. Afonnikov, Y. Y. Vaskin et al., 1996. “Program complex SNP-MED for analysis of single-nucleotide polymorphism (SNP) effects on the function of genes associated [43] M. P. Ponomarenko, J. V. Ponomarenko, A. S. Frolov et al., with socially significant diseases,” Russian Journal of Genetics: “Identification of sequence-dependent DNA features corre- Applied Research,vol.4,no.3,pp.159–167,2014. lating to activity of DNA sites interacting with proteins,” [27] L. O. Bryzgalov, E. V. Antontseva, M. Y. Matveeva et al., Bioinformatics,vol.15,no.7-8,pp.687–703,1999. “Detection of regulatory SNPs in human genome using ChIP- [44] L. K. Savinkova, I. A. Drachkova, M. P. Ponomarenko et al., seq ENCODE data,” PloS ONE,vol.8,no.10,ArticleIDe78833, “Interaction of recombinant TATA-binding protein with the 2013. TATA boxes of the of mammalian gene promoters,” Ecological [28] M. P. Ponomarenko, J. V. Ponomarenko, A. S. Frolov et al., Genetics (St. Petersburg, Russia),vol.4,pp.44–49,2007. “Oligonucleotide frequency matrices addressed to recognizing [45] J. V. Ponomarenko, D. P. Furman, A. S. Frolov et al., “ACTIV- functional DNA sites,” Bioinformatics,vol.15,no.7-8,pp.631– ITY: a database on DNA/RNA sites activity adapted to apply 643, 1999. sequence-activity relationships from one system to another,” [29] H. Mitsuyasu, K. Izuhara, X. Q. Mao et al., “Ile50Val variant of Nucleic Acids Research,vol.29,no.1,pp.284–287,2001. IL4R alpha upregulates IgE synthesis and associates with atopic [46] M. P. Ponomarenko, L. K. Savinkova, Y. V. Ponomarenko, A. E. asthma,” Nature Genetics, vol. 19, no. 2, pp. 119–120, 1998. Kel’,I.I.Titov,andN.A.Kolchanov,“SimulationofTATAbox [30] D. R. Zerbino, S. P.Wilder, N. Johnson, T. Juettemann, and P.R. sequences in eukaryotes,” Molecular Biology,vol.31,no.4,pp. Flicek, “The ensembl regulatory build,” Genome Biology,vol.16, 616–622, 1997. article 56, 2015. [47] P. M. Ponomarenko, L. K. Savinkova, I. A. Drachkova et al., “A [31] M. M. Babu, N. M. Luscombe, L. Aravind, M. Gerstein, and step-by-step model of TBP/TATA box binding allows predicting S. A. Teichmann, “Structure and evolution of transcriptional human hereditary diseases by single nucleotide polymorphism,” regulatory networks,” Current Opinion in Structural Biology,vol. Doklady Biochemistry and Biophysics,vol.419,no.1,pp.88–92, 14,no.3,pp.283–291,2004. 2008. [32] M. Ponomarenko, V. Mironova, K. Gunbin, and L. Savinkova, [48] P. Bucher, “Weight matrix descriptions of four eukaryotic RNA “Hogness box,” in Brenner’s Encyclopedia of Genetics,S.Maloy polymerase II promoter elements derived from 502 unrelated andK.Hughes,Eds.,vol.3,pp.491–494,AcademicPress, promoter sequences,” Journal of Molecular Biology,vol.212,no. Elsevier Inc, San Diego, Calif, USA, 2013. 4, pp. 563–578, 1990. [33] M. Ponomarenko, L. Savinkova, and N. Kolchanov, “Initiation [49] R. F. Delgadillo, J. E. Whittington, L. K. Parkhurst, and L. J. Factors,” in Brenner’s Encyclopedia of Genetics,S.MaloyandK. Parkhurst, “The TATA-binding protein core domain in solution Hughes, Eds., vol. 4, pp. 83–85, Academic Press, San Diego, variably bends TATA sequences via a three-step binding mech- Calif, USA, 2nd edition, 2013. anism,” Biochemistry,vol.48,no.8,pp.1801–1809,2009. 14 BioMed Research International

[50] L. K. Savinkova, I. A. Drachkova, T. V. Arshinova, P. Pono- Vavilov Journal of Genetics and Breeding,vol.17,no.4/1,pp.599– marenko, M. Ponomarenko, and N. Kolchanov, “Anexperimen- 606, 2013. tal verification of the predicted effects of promoter TATA-box [63] I. Missala, U. Kassner, and E. Steinhagen-Thiessen, “A sys- polymorphisms associated with human diseases on interactions tematic literature review of the association of lipoprotein(a) between the TATA boxes and TATA-binding protein,” PLoS and autoimmune diseases and atherosclerosis,” International ONE,vol.8,no.2,ArticleIDe54626,2013. Journal of Rheumatology,vol.2012,ArticleID480784,10pages, [51] I. Drachkova, L. Savinkova, T. Arshinova, M. Ponomarenko, S. 2012. Peltek, and N. Kolchanov, “The mechanism by which TATA- [64] J. E. Stajich, D. Block, K. Boulez et al., “The Bioperl toolkit: perl box polymorphisms associated with human hereditary diseases modules for the life sciences,” Genome Research,vol.12,no.10, influence interactions with the f0f0-binding protein,” Human pp. 1611–1618, 2002. Mutation,vol.35,no.5,pp.601–608,2014. [65] D. Flatters and R. Lavery, “Identification of sequence-dependent [52] I. A. Drachkova, S. V. Shekhovtsov, S. E. Peltek et al., “Sur- features correlating to activity of DNA sites interacting with face plasmon resonance study of the interaction between the proteins,” Biophysical Journal,vol.75,no.1,pp.372–381,1998. human TATA-box binding protein and the TATA element of [66] S. Hahn, S. Buratowski, P. A. Sharp, and L. Guarente, “Yeast theNOS2Agenepromoter,”Vavilov Journal of Genetics and TATA-binding protein TFIID binds to TATA elements with Breeding,vol.16,no.2,pp.391–396,2012. both consensus and nonconsensus DNA sequences,” Proceed- [53] O.V.Arkova,N.A.Kuznetsov,O.S.Fedorova,N.A.Kolchanov, ings of the National Academy of Sciences of the United States of and L. K. Savinkova, “Real-time interaction between TBP and America, vol. 86, no. 15, pp. 5718–5722, 1989. theTATAboxofthehumantriosephosphateisomerasegene [67] R. A. Coleman and B. F. Pugh, “Evidence for functional binding promoter in the norm and pathology,” Acta Naturae,vol.6,no. and stable sliding of the TATA binding protein on nonspecific 2, pp. 36–40, 2014. DNA,” The Journal of Biological Chemistry,vol.270,no.23,pp. [54] L. K. Savinkova, M. P. Ponomarenko, P. M. Ponomarenko et 13850–13859, 1995. al., “TATA box polymorphisms in human gene promoters and [68] H. Karas, R. Knuppel,W.Schulz,H.Sklenar,andE.Wingender,¨ associated hereditary pathologies,” Biochemistry,vol.74,no.2, “Combining structural analysis of DNA with search routines for pp. 117–129, 2009. the detection of transcription regulatory elements,” Computer [55]V.V.Suslov,P.M.Ponomarenko,V.M.Efimov,L.K.Savinkova, Applications in the Biosciences,vol.12,no.5,pp.441–446,1996. M. P. Ponomarenko, and N. A. Kolchanov, “SNPS in the HIV-1 [69] IUPAC-IUB Commission on Biochemical Nomenclature tata box and the aids pandemic,” Journal of Bioinformatics and (CBN), “Abbreviations and Symbols for nucleic acids, Computational Biology,vol.8,no.3,pp.607–625,2010. polynucleotides and their constituents,” Journal of Molecular [56] V. V. Suslov, P. M. Ponomarenko, M. P. Ponomarenko et al., Biology,vol.55,no.3,pp.299–310,1971. “TATA box polymorphisms in genes of commercial and labo- [70] E. M. El-Omar, M. Carrington, W.-H. Chow et al., “Interleukin- ratory animals and plants associated with selectively valuable 1 polymorphisms associated with increased risk of gastric traits,” Russian Journal of Genetics,vol.46,no.4,pp.394–403, cancer,” Nature,vol.404,no.6776,pp.398–402,2000. 2010. [71] Y. Wang, N. Kato, Y. Hoshida et al., “Interleukin-1𝛽 gene [57] P. M. Ponomarenko, M. P. Ponomarenko, I. A. Drachkova et polymorphismsassociatedwithhepatocellularcarcinomain al., “Prediction of the affinity of the TATA-binding protein to hepatitis C virus infection,” Hepatology,vol.37,no.1,pp.65–71, TATA boxes with single nucleotide polymorphisms,” Molecular 2003. Biology,vol.43,no.3,pp.472–479,2009. [72]K.-S.Wu,X.Zhou,F.Zheng,X.-Q.Xu,Y.-H.Lin,andJ. [58]M.P.Ponomarenko,V.V.Suslov,K.V.Gunbinetal.,“Iden- Yang, “Influence of interleukin-1 beta genetic polymorphism, tification of the relationship between variability of expression smoking and alcohol drinking on the risk of non-small cell lung of signaling pathway genes in the human brain and affinity of cancer,” Clinica Chimica Acta, vol. 411, no. 19-20, pp. 1441–1446, TATA-binding protein to their promoters,” Vavilov Journal of 2010. Genetics and Breeding,vol.18,no.3-4,pp.1219–1230,2014. [73] D. N. Mart´ınez-Carrillo, E. Garza-Gonzalez,´ R. Betancourt- [59]V.V.Mironova,N.A.Omelyanchuk,P.M.Ponomarenko,M. Linares et al., “Association of IL1B -511C/-31T haplotype and P. Ponomarenko, and N. A. Kolchanov, “Specific/nonspecific Helicobacter pylori vacA genotypes with gastric ulcer and binding of TBP to promoter DNA of the auxin response chronic gastritis,” BMC Gastroenterology,vol.10,article126, factorgenesinplantscorrelatedwithARFsfunctionongene 2010. transcription (activator/repressor),” Doklady Biochemistry and [74] F. Hayashi, M. Watanabe, T. Nanba, N. Inoue, T. Akamizu, and Biophysics,vol.433,no.1,pp.191–196,2010. Y. Iwatani, “Association of the -31C/T functional polymorphism [60] P. M. Ponomarenko and M. P. Ponomarenko, “Sequence-based in the interleukin-1beta gene with the intractability of Graves’ prediction of transcription upregulation by auxin in plants,” disease and the proportion of T helper type 17 cells,” Clinical and Journal of Bioinformatics and Computational Biology,vol.13,no. Experimental Immunology,vol.158,no.3,pp.281–286,2009. 1, Article ID 1540009, 2015. [75] P. Borkowska, K. Kucia, S. Rzezniczek et al., “Interleukin- [61] P. M. Ponomarenko, V. V. Suslov, L. K. Savinkova, M. P. 1beta promoter (-31T/C and -511C/T) polymorphisms in major Ponomarenko,andN.A.Kolchanov,“Apreciseequationof recurrent depression,” Journal of Molecular Neuroscience,vol. equilibrium of four steps of TBP binding with the TATA 44, no. 1, pp. 12–16, 2011. box for prognosis of phenotypic manifestation of mutations,” [76] H. Yamazaki, M. Takeoka, M. Kitazawa et al., “ASC plays a role Biophysics,vol.55,no.3,pp.358–369,2010. in the priming phase of the immune response to type II collagen [62] D. A. Rasskazov, K. V. Gunbin, P. M. Ponomarenko et al., in collagen-induced arthritis,” Rheumatology International,vol. “SNP TATA COMPARATOR: web-service for comparison of 32,no.6,pp.1625–1632,2012. SNPs within gene promotArs associated with human diseases [77]G.V.Vasiliev,V.M.Merkulov,V.F.Kobzev,T.I.Merkulova, using the equilibrium equation of the TBP/TATA complex,” M. P. Ponomarenko, and N. A. Kolchanov, “Point mutations BioMed Research International 15

within 663–666 bp of intron 6 of the human TDO2 gene, [92] M. Poncz, M. Ballantine, D. Solowiejczyk, I. Barak, E. Schwartz, associated with a number of psychiatric disorders, damage the and S. Surrey, “𝛽-Thalassemia in a Kurdish Jew. Single base YY-1 transcription factor binding site,” FEBS Letters,vol.462, changes in the T-A-T-A box,” TheJournalofBiologicalChem- no. 1-2, pp. 85–88, 1999. istry,vol.257,no.11,pp.5994–5996,1982. [78] E. Arnaud, V. Barbalat, V. Nicaud et al., “Polymorphisms in [93] F. S. Collins and S. M. Weissman, “The molecular genetics 󸀠 the 5 regulatory region of the tissue factor gene and the risk of human hemoglobin,” Progress in Nucleic Acid Research and of myocardial infarction and venous thromboembolism: the Molecular Biology,vol.31,pp.315–462,1984. ECTIM and PATHROS studies,” Arteriosclerosis, Thrombosis, [94] C. Badens, N. Jassim, N. Martini, J. F. Mattei, J. Elion, and D. and Vascular Biology, vol. 20, no. 3, pp. 892–898, 2000. Lena-Russo, “Characterization of a new polymorphism, IVS-I- [79] I. A. Clark, K. A. Rockett, and D. Burgner, “Genes, nitric oxide 108 (T → C), and a new 𝛽-thalassemiamutation,-27(A → T), and malaria in African children,” Trends in Parasitology,vol.19, discovered in the course of a prenatal diagnosis,” Hemoglobin, no. 8, pp. 335–337, 2003. vol.23,no.4,pp.339–344,1999. [80] J. A. Gonzalez-Mart´ ´ınez, G. Moddel,Z.Ying,R.A.Prayson,W.¨ [95] R. M. Bannerman, L. M. Garrick, P. Rusnak-Smalley, J. E. E. Bingaman, and I. M. Najm, “Neuronal nitric oxide synthase Hoke, and J. A. Edwards, “Hemoglobin deficit: An inherited expression in resected epileptic dysplastic neocortex: laboratory hypochromic anemia in the mouse,” Proceedings of the Society investigation,” Journal of Neurosurgery,vol.110,no.2,pp.343– for Experimental Biology and Medicine,vol.182,no.1,pp.52–57, 349, 2009. 1986. [81] F. Al-Shakfa, S. Dulucq, I. Brukner et al., “DNA variants in [96] H. Frischknecht and F. Dutly, “Two new delta-globin muta- region for noncoding interfering transcript of Dihydrofolate tions: Hb A2-Ninive [!delta133(H11)Val-Ala] and a delta(+)- reductase gene and outcome in childhood acute lymphoblastic thalassemia mutation [-31 (A → G)] in the TATA box of the leukemia,” Clinical Cancer Research,vol.15,no.22,pp.6931– delta-globin gene,” Hemoglobin,vol.29,no.2,pp.151–154,2005. 6938, 2009. [97] M. A. Nalls, J. G. Wilson, N. J. Patterson et al., “Admixture [82]I.DeVivo,G.S.Huggins,S.E.Hankinsonetal.,“Afunctional mapping of white cell count: genetic locus responsible for lower polymorphism in the promoter of the progesterone receptor white blood cell count in the Health ABC and Jackson Heart gene associated with endometrial cancer risk,” Proceedings of the studies,” American Journal of Human Genetics,vol.82,no.1,pp. National Academy of Sciences of the United States of America,vol. 81–87, 2008. 99, no. 19, pp. 12263–12268, 2002. [98] M. Pitarque, O. von Richter, B. Oke, H. Berkkan, M. Oscar- [83] B. Blasko,´ Z. Banlaki,G.Gyapayetal.,“Linkageanalysisofthe´ son, and M. Ingelman-Sundberg, “Identification of a single C4A/C4B copy number variation and polymorphisms of the nucleotide polymorphism in the TATA box of the CYP2A6 adjacent steroid 21-hydroxylase gene in a healthy population,” gene: impairment of its promoter activity,” Biochemical and Molecular Immunology, vol. 46, no. 13, pp. 2623–2629, 2009. Biophysical Research Communications,vol.284,no.2,pp.455– [84]T.P.Velavan,S.Bechlars,X.Huang,P.G.Kremsner,andJ.F. 460, 2001. J. Kun, “Novel regulatory SNPs in the promoter region of the [99] M. L. Pianezza, E. M. Sellers, and R. F. Tyndale, “Nicotine TNFRSF18 gene in a gabonese population,” Brazilian Journal of metabolism defect reduces smoking,” Nature,vol.393,no.6687, Medical and Biological Research, vol. 44, no. 5, pp. 418–420, 2011. p. 750, 1998. [85] Y. Yamashiro, Y. Hattori, Y. Matsuno et al., “Another example of [100]R.H.Brakenhoff,H.A.M.Henskens,M.W.P.C.VanRossum, Japanese beta-thalassemia [-31 Cap (A—-G)],” Hemoglobin,vol. N. H. Lubsen, and J. G. G. Schoenmakers, “Activation of 13, no. 7-8, pp. 761–767, 1989. the gammaE-crystallin pseudogene in the human hereditary [86] Y. Takihara, T. Nakamura, H. Yamada et al., “A novel mutation Coppock-like cataract,” Human Molecular Genetics,vol.3,no. in the TATA box in a Japanese patient with beta +-thalassemia,” 2,pp.279–283,1994. Blood,vol.67,no.2,pp.547–550,1986. [101] G. M. Hunninghake, M. H. Cho, Y. Tesfaigzi et al., “MMP12, [87] Y. J. Fei, T. A. Stoming, G. D. Efremov et al., “𝛽-thalassemia lung function, and COPD in high-risk populations,” The New due to a T → A mutation within the ATA box,” Biochemical and England Journal of Medicine,vol.361,no.27,pp.2599–2608, Biophysical Research Communications,vol.153,no.2,pp.741– 2009. 747, 1988. [102] M. Manetti, L. Ibba-Manneschi, C. Fatini et al., “Association of [88]S.-P.Cai,J.-Z.Zhang,M.Doherty,andY.W.Kan,“AnewTATA a functional polymorphism in the matrix metalloproteinase-12 box mutation detected at prenatal diagnosis for 𝛽-thalassemia,” promoter region with systemic sclerosis in an Italian popula- The American Journal of Human Genetics,vol.45,no.1,pp.112– tion,” Journal of Rheumatology,vol.37,no.9,pp.1852–1857,2010. 114, 1989. [103]N.L.Starodubtseva,V.V.Sobolev,A.G.Soboleva,A.A. [89] S. E. Antonarakis, S. H. Irkin, T. C. Cheng et al., “Beta- Nikolaev, and S. A. Bruskin, “Genes expression of metallopro- thalassemia in American blacks: novel mutations in the ‘TATA’ teinases (MMP-1, MMP-2, MMP-9, and MMP-12) associated box and an acceptor splice site,” Proceedings of the National with psoriasis,” Russian Journal of Genetics,vol.47,no.9,pp. Academy of Sciences of the United States of America,vol.81,no. 1117–1123, 2011. 4, pp. 1154–1158, 1984. [104] W.Plengpanich, W.Le Goff, S. Poolsuk, Z. Julia, M. Guerin, and [90]S.Huang,C.Wong,S.E.Antonarakis,T.Ro-lien,W.H.Y.Lo, W. Khovidhunkit, “CETP deficiency due to a novel mutation in and H. H. Kazazian Jr., “The same “TATA” box 𝛼-thalassemia theCETPgenepromoteranditseffectoncholesteroleffluxand mutation in Chinese and US blacks: another example of selective uptake into hepatocytes,” Atherosclerosis,vol.216,no. independent origins of mutation,” Human Genetics,vol.74,no. 2, pp. 370–373, 2011. 2, pp. 162–164, 1986. [105] K. Oka, L. M. Belalcazar, C. Dieker et al., “Sustained phenotypic [91] S. H. Orkin, J. P. Sexton, T.-C. Cheng et al., “ATA box transcrip- correction in a mouse model of hypoalphalipoproteinemia with tion mutation in 𝛽-thalassemia,” Nucleic Acids Research, vol. 11, a helper-dependent adenovirus vector,” Gene Therapy,vol.14, no. 14, pp. 4727–4734, 1983. no.3,pp.191–202,2007. 16 BioMed Research International

[106]J.Zukunft,T.Lang,T.Richteretal.,“AnaturalCYP2B6 complex in colorectal cancer,” Journal of Peking University: TATA box polymorphism (−82T → C) leading to enhanced Health sciences,vol.41,no.5,pp.531–536,2009. transcription and relocation of the transcriptional start site,” [121] M. Pengo, A. Ferlin, B. Arredi et al., “FSH receptor gene poly- Molecular Pharmacology,vol.67,no.5,pp.1772–1782,2005. morphisms in fertile and infertile Italian men,” Reproductive [107] S. Niemann, W. J. Broom, and R. H. Brown Jr., “Analysis of a BioMedicine Online,vol.13,no.6,article2494,pp.795–800, genetic defect in the TATA box of the SOD1 gene in a patient 2006. with familial amyotrophic lateral sclerosis,” Muscle and Nerve, [122] M. Balkan, A. Gedik, H. Akkoc et al., “FSHR single nucleotide vol. 36, no. 5, pp. 704–707, 2007. polymorphism frequencies in proven fathers and infertile men [108] M.Watanabe,B.C.Zingg,andH.W.Mohrenweiser,“Molecular in southeast turkey,” Journal of Biomedicine and Biotechnology, analysis of a series of alleles in humans with reduced activity at vol. 2010, Article ID 640318, 5 pages, 2010. the triosephosphate isomerase locus,” The American Journal of [123] M. J. Reijnen, F. M. Sladek, R. M. Bertina, and P. H. Reitsma, Human Genetics, vol. 58, no. 2, pp. 308–316, 1996. “Disruption of a binding site for hepatocyte nuclear factor 4 [109]J.-L.Vives-Corrons,H.Rubinson-Skala,M.Mateo,J.Estella,E. results in hemophilia B Leyden,” Proceedings of the National Feliu, and J.-C. Dreyfus, “Triosephosphate isomerase deficiency Academy of Sciences of the United States of America,vol.89,no. with hemolytic anemia and severe neuromuscular disease: 14, pp. 6300–6303, 1992. familial and biochemical studies of a case found in Spain,” [124]A.J.Casal,V.J.P.Sinclair,A.M.Capponi,J.Nicod,U.Huynh- Human Genetics,vol.42,no.2,pp.171–180,1978. Do, and P. Ferrari, “Anovel mutation in the steroidogenic acute [110] S. Philips, A. Richter, S. Oesterreich et al., “Functional charac- regulatory protein gene promoter leading to reduced promoter terization of a genetic polymorphism in the promoter of the activity,” JournalofMolecularEndocrinology,vol.37,no.1,pp. ESR2 gene,” Hormones and Cancer,vol.3,no.1-2,pp.37–43, 71–80, 2006. 2012. [125]K.M.Caron,S.-C.Soo,W.C.Wetsel,D.M.Stocco,B.J. [111] A. M. Sieuwerts, M. Ansems, M. P. Look et al., “Clinical Clark, and K. L. Parker, “Targeted disruption of the mouse significance of the nuclear receptor co-regulator DC-SCRIPT gene encoding steroidogenic acute regulatory protein provides in breast cancer: an independent retrospective validation study,” insights into congenital lipoid adrenal hyperplasia,” Proceedings Breast Cancer Research,vol.12,no.6,articleR103,2010. of the National Academy of Sciences of the United States of [112] H. Peltoketo, Y. Piao, A. Mannermaa et al., “A point mutation America, vol. 94, no. 21, pp. 11540–11545, 1997. in the putative TATA box, detected in nondiseased individuals [126]M.Horan,D.S.Millar,J.Hedderichetal.,“Humangrowthhor- and patients with hereditary breast cancer, decreases promoter mone 1 (GH1) gene expression: complex haplotype-dependent activity of the 17𝛽-hydroxysteroid dehydrogenase type 1 gene 2 influence of polymorphic variation in the proximal promoter (EDH17B2) in vitro,” Genomics,vol.23,no.1,pp.250–252,1994. and locus control region,” Human Mutation,vol.21,no.4,pp. [113] A. B. W.Boldt, L. Culpi, L. T. Tsuneto, I. R. de Souza, J. F. J. Kun, 408–423, 2003. and M. L. Petzl-Erler, “Diversity of the MBL2 gene in various [127]X.Liu,M.R.Campbell,G.S.Pittman,E.C.Faulkner,M. Brazilianpopulationsandthecaseofselectionatthemannose- A. Watson, and D. A. Bell, “Expression-based discovery of binding lectin locus,” Human Immunology,vol.67,no.9,pp. variation in the human glutathione S-transferase M3 promoter 722–734, 2006. and functional analysis in a glioma cell line using allele-specific [114] A. Cervera, A. M. Planas, C. Justicia et al., “Genetically- chromatin immunoprecipitation,” Cancer Research,vol.65,no. defined deficiency of mannose-binding lectin is associated with 1, pp. 99–104, 2005. protection after experimental stroke in mice and outcome in [128]X.Tan,Y.Wang,Y.Hanetal.,“GeneticvariationintheGSTM3 human stroke,” PLoS ONE,vol.5,no.2,ArticleIDe8433,2010. promoter confer risk and prognosis of renal cell carcinoma by [115] I. Sziller, O. Babula, P. Hupuczi et al., “Mannose-binding lectin reducing gene expression,” British Journal of Cancer,vol.109,no. (MBL) codon 54 gene polymorphism protects against develop- 12,pp.3105–3115,2013. ment of pre-eclampsia, HELLP syndrome and pre-eclampsia- [129] T. O. Lankisch, A. Vogel, S. Eilermann et al., “Identification and associated intrauterine growth restriction,” Molecular Human characterization of a functional TATA box polymorphism of the Reproduction,vol.13,no.4,pp.281–285,2007. UDP glucuronosyltransferase 1A7 gene,” Molecular Pharmacol- [116] A. Abbas, M. Lechevrel, and F.Sichel, “Identification of new sin- ogy,vol.67,no.5,pp.1732–1739,2005. gle nucleotid polymorphisms (SNP) in alcohol dehydrogenase [130] R. C. Wirka, S. Gore, D. R. Van Wagoner et al., “A common classIVADH7genewithinaFrenchpopulation,”Archives of connexin-40 gene promoter variant affects connexin-40 expres- Toxicology, vol. 80, no. 4, pp. 201–205, 2006. sion in human atria and is associated with atrial fibrillation,” [117] A. Matsunaga, J. Sasaki, H. Han et al., “Compound het- Circulation: Arrhythmia and Electrophysiology,vol.4,no.1,pp. erozygosity for an apolipoprotein A1 gene promoter mutation 87–93, 2011. and a structural nonsense mutation with apolipoprotein A1 [131] M. Firouzi, H. Ramanna, B. Kok et al., “Association of human deficiency,” Arteriosclerosis, Thrombosis, and Vascular Biology, connexin40 gene polymorphisms with atrial vulnerability as a vol. 19, no. 2, pp. 348–355, 1999. risk factor for idiopathic atrial fibrillation,” Circulation Research, [118] A. Kavlie, L. Hiltunen, V. Rasi, and H. P. B. Prydz, “Two vol. 95, no. 4, pp. e29–e33, 2004. novel mutations in the human coagulation factor VII promoter,” [132] L. Le Flem, V. Picard, J. Emmerich et al., “Mutations in pro- Thrombosis and Haemostasis,vol.90,no.2,pp.194–205,2003. moter region of thrombomodulin and venous thromboembolic [119] L. N. Troelsen, P. Garred, B. Christiansen et al., “Double role disease,” Arteriosclerosis, Thrombosis, and Vascular Biology,vol. of mannose-binding lectin in relation to carotid intima-media 19, no. 4, pp. 1098–1104, 1999. thickness in patients with rheumatoid arthritis,” Molecular [133] P. J. Bosma, J. R. Chowdhury, C. Barker et al., “The Immunology,vol.47,no.4,pp.713–718,2010. genetic basis of the reduced expression of bilirubin UDP- [120] J. Q. Tang, Q. Fan, Y. L. Wan et al., “Ectopic expression glucuronosyltransferase 1 in Gilbert’s syndrome,” The New and clinical significance of tissue factor/coagulation factor VII England Journal of Medicine, vol. 333, no. 18, pp. 1171–1175, 1995. BioMed Research International 17

[134] V. Matys, O. V. Kel-Margoulis, E. Fricke et al., “TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes,” Nucleic Acids Research,vol.34,pp.D108–D110, 2006. [135] S. S. Yoo, C. Jin, D. K. Jung et al., “Putative functional variants of XRCC1 identified by RegulomeDB were not associated with lung cancer risk in a Korean population,” Cancer Genetics,vol. 208, no. 1-2, pp. 19–24, 2015.