<<

BioMed Research International

Integrated Analysis of Multiscale Large-Scale Biological Data for Investigating Human Disease 2016

Guest Editors: Tao Huang, Lei Chen, Jiangning Song, Mingyue Zheng, Jialiang Yang, and Zhenguo Zhang Integrated Analysis of Multiscale Large-Scale Biological Data for Investigating Human Disease 2016 BioMed Research International Integrated Analysis of Multiscale Large-Scale Biological Data for Investigating Human Disease 2016

GuestEditors:TaoHuang,LeiChen,JiangningSong, Mingyue Zheng, Jialiang Yang, and Zhenguo Zhang Copyright © 2016 Hindawi Publishing Corporation. All rights reserved.

This is a special issue published in “BioMed Research International.” All articles are open access articles distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Contents

Integrated Analysis of Multiscale Large-Scale Biological Data for Investigating Human Disease 2016 Tao Huang, Lei Chen, Jiangning Song, Mingyue Zheng, Jialiang Yang, and Zhenguo Zhang Volume 2016, Article ID 6585069, 2 pages

New Trends of Digital Data Storage in DNA Pavani Yashodha De Silva and Gamage Upeksha Ganegoda Volume 2016, Article ID 8072463, 14 pages

Analyzing the miRNA- Networks to Mine the Important miRNAs under Skin of Human and Mouse Jianghong Wu, Husile Gong, Yongsheng Bai, and Wenguang Zhang Volume 2016, Article ID 5469371, 9 pages

Differential Regulatory Analysis Based on Coexpression Network in Research Junyi Li, Yi-Xue Li, and Yuan-Yuan Li Volume 2016, Article ID 4241293, 8 pages

Hybrid Binary Imperialist Competition Algorithm and Tabu Search Approach for Feature Selection Using Data Shuaiqun Wang, Aorigele, Wei Kong, Weiming Zeng, and Xiaomin Hong Volume 2016, Article ID 9721713, 12 pages

Predicting Diagnostic Gene Biomarkers for Non-Small-Cell Lung Cancer Bin Liang, Yang Shao, Fei Long, and Shu-Juan Jiang Volume 2016, Article ID 3952494, 8 pages

A Five-Gene Expression Signature Predicts Clinical Outcome of Ovarian Serous Cystadenocarcinoma Li-Wei Liu, Qiuhao Zhang, Wenna Guo, Kun Qian, and Qiang Wang Volume 2016, Article ID 6945304, 6 pages

DASAF: An R Package for Deep -Based Detection of Fetal Autosomal Abnormalities from Maternal Cell-Free DNA Baohong Liu, Xiaoyan Tang, Feng Qiu, Chunmei Tao, Junhui Gao, Mengmeng Ma, Tingyan Zhong, JianPing Cai, Yixue Li, and Guohui Ding Volume 2016, Article ID 2714341, 7 pages

The Use of -Protein Interactions for the Analysis of the Associations between PM2.5 and Some Diseases Qing Zhang, Pei-Wei Zhang, and Yu-Dong Cai Volume 2016, Article ID 4895476, 7 pages

The Occurrence of Genetic Alterations during the Progression of Breast Carcinoma Xiao-Chen Li, Chenglin Liu, Tao Huang, and Yang Zhong Volume 2016, Article ID 5237827, 5 pages

Using Small RNA Deep Sequencing Data to Detect Human Viruses Fang Wang, Yu Sun, Jishou Ruan, Rui Chen, Xin Chen, Chengjie Chen, Jan F. Kreuze, ZhangJun Fei, Xiao Zhu, and Shan Gao Volume 2016, Article ID 2596782, 9 pages Motif-Based Text Mining of Microbial Metagenome Redundancy Profiling Data for Disease Classification Yin Wang, Rudong Li, Yuhua Zhou, Zongxin Ling, Xiaokui Guo, Lu Xie, and Lei Liu Volume 2016, Article ID 6598307, 11 pages

Analysis and Identification of Aptamer-Compound Interactions with a Maximum Relevance Minimum Redundancy and Nearest Neighbor Algorithm ShaoPeng Wang, Yu-Hang Zhang, Jing Lu, Weiren Cui, Jerry Hu, and Yu-Dong Cai Volume 2016, Article ID 8351204, 9 pages

The Subcellular Localization and Functional Analysis of Fibrillarin2, a Nucleolar Protein in Nicotiana benthamiana LupingZheng,JinaiYao,FangluanGao,LinChen,ChaoZhang, Lingli Lian, Liyan Xie, Zujian Wu, and Lianhui Xie Volume 2016, Article ID 2831287, 9 pages

Mining for Candidate Related to Pancreatic Cancer Using Protein-Protein Interactions and a Shortest Path Approach Fei Yuan, Yu-Hang Zhang, Sibao Wan, ShaoPeng Wang, and Xiang-Yin Kong Volume 2015, Article ID 623121, 12 pages Hindawi Publishing Corporation BioMed Research International Volume 2016, Article ID 6585069, 2 pages http://dx.doi.org/10.1155/2016/6585069

Editorial Integrated Analysis of Multiscale Large-Scale Biological Data for Investigating Human Disease 2016

Tao Huang,1 Lei Chen,2 Jiangning Song,3 Mingyue Zheng,4 Jialiang Yang,5 and Zhenguo Zhang6

1 Chinese Academy of Sciences, Shanghai, China 2Shanghai Maritime University, Pudong, Shanghai, China 3MonashUniversity,Clayton,VIC3800,Australia 4Shanghai Institute of Materia Medica, Zuchongzhi Rd, Pudong, Shanghai, China 5IcahnSchoolofMedicineatMountSinai,NewYork,NY10029,USA 6Department of , University of Rochester, Rochester, NY 14627, USA

Correspondence should be addressed to Tao Huang; [email protected]

Received 29 August 2016; Accepted 29 August 2016

Copyright © 2016 Tao Huang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

With the development of high-throughput omics technolo- (TS) which conducts fine-tune search. The performance of gies, more and more omics data are generated. It has become their method was superior to other similar works. common to have multiomics data for the same samples which J. Li et al. reviewed the paradigm of differential regulatory make the integrative analysis possible. But the data integra- analysis (DRA) based on gene coexpression network (GCN). tionisstillchallengingsincethereareonlyalimitednumber They found that DRA can reveal underlying molecular ofmethodstodosuchanalysis.Tostimulatethemethodology mechanism in large-scale carcinogenesis studies. development and applications of multiomics analysis, we B. Liang et al. constructed a non-small cell lung can- collected 14 novel studies of large scale multiomics data for cer- (NSCLC-) specific functional association network and biomedical researches. applied a network partition algorithm to divide the network P. Y. De Silva and G. U. Ganegoda critically analyzed into gene modules. From these modules, they identified various methods used for encoding and encrypting data onto NSCLC biomarkers. DNA and identified the advantages and capability of every B. Liu et al. developed an R package, detection of auto- scheme to overcome the drawbacks of previous methods. somal abnormalities for fetus (DASAF), which implements J. Wu et al. integrated the MGI, GEO, and miRNA data- the three most popular trisomy detection methods—the base to analyze the genetic regulatory networks under mor- standard 𝑧-score method (STDZ); the GC correction 𝑧-score phology difference of integument of humans and mice. And (GCCZ) method; and the internal reference 𝑧-score (IRZ) they found that the gene expression network in the skin was method—together with one subchromosome abnormality highly divergent between human and mouse. identification method (SCAZ). L.-W. Liu L. et al. analyzed 303 samples of ovarian serous Q. Zhang et al. investigated the associations between cystadenocarcinoma and the corresponding RNA-seq data. PM2.5 and 22 disease classes, such as respiratory diseases, They established a risk assessment model of five genes and the cardiovascular diseases, and gastrointestinal diseases. They AUROC value was 0.67 when predicting the survival time in found that several diseases, such as diseases related to ear, testing set. nose, and throat and gastrointestinal, nutritional, renal, and S. Wang et al. proposed a new hybrid algorithm called cardiovascular diseases, are influenced by PM2.5. HICATS that incorporated imperialist competition algo- F.Wangetal.used931sRNA-seqdatasetsfromtheNCBI rithm (ICA) which performs global search and tabu search SRA database to detect and identify viruses in human cells 2 BioMed Research International or tissues. Six viruses including HPV-18, HBV, HCV, HIV-1, SMRV, and EBV were detected from 36 datasets and SMRV was found in Diffuse Large B Cell Lymphoma cells for the first time. S. Wang et al. attempted to extract important features for aptamer-compound interactions using feature selection methods, such as maximum relevance minimum redun- dancy, and incremental feature selection. They found that quantum-chemical and electrostatic descriptors were impor- tant for aptamer-compound interaction prediction. X.-C. Li et al. constructed oncogenetic tree to imitate the occurrence of genetic and cytogenetic alterations in human breast cancer. They found that ErbB2 copy number variation is the frequent early event of human breast cancer. Y. Wang et al. proposed a -Based Motif Finding Algorithm (PMF) to analyze 16S rRNA text data. By integrating phylogenic rules and other statistical indexes for classification, it can effectively reduce the dimension of the large feature spaces generated by the text datasets. L. Zheng et al. analyzed the subcellular localization and biological functions of Fibrillarin2, a nucleolar protein in Nicotiana benthamiana.Theyfoundthattheproteinwas localized in the nucleolus and cajal body of leaf epidermal cells of N. benthamiana and involved growth retardation, organ deformation, chlorosis, and necrosis. F. Yuan et al. tried to predict candidate genes related to pancreatic cancer using protein-protein interactions and a shortest path approach. The genes on the shortest path among known pancreatic cancer genes were considered as candidates that were further filtered by permutation test. Sev- eral predicted genes are promising and worth experimental validation. With this special issue, we hope more and more people will become familiar with multiomics big data analysis and be interested in applying the integrative analysis approaches to reveal the underling mechanisms of complex biomedical phenotypes. Tao Huang Lei Chen Jiangning Song Mingyue Zheng Jialiang Yang Zhenguo Zhang Hindawi Publishing Corporation BioMed Research International Volume 2016, Article ID 8072463, 14 pages http://dx.doi.org/10.1155/2016/8072463

Review Article New Trends of Digital Data Storage in DNA

Pavani Yashodha De Silva and Gamage Upeksha Ganegoda

Faculty of Information Technology, University of Moratuwa, Katubedda, Moratuwa, Sri Lanka

Correspondence should be addressed to Gamage Upeksha Ganegoda; [email protected]

Received 15 April 2016; Revised 21 July 2016; Accepted 2 August 2016

Academic Editor: Lei Chen

Copyright © 2016 P. Y. De Silva and G. U. Ganegoda. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

With the exponential growth in the capacity of information generated and the emerging need for data to be stored for prolonged period of time, there emerges a need for a storage medium with high capacity, high storage density, and possibility to withstand extreme environmental conditions. DNA emerges as the prospective medium for data storage with its striking features. Diverse encoding models for reading and writing data onto DNA, codes for encrypting data which addresses issues of error generation, and approaches for developing codons and storage styles have been developed over the recent past. DNA has been identified as a potential medium for secret writing, which achieves the way towards DNA cryptography and stenography. DNA utilized as an organic memory device along with big data storage and analytics in DNA has paved the way towards DNA computing for solving computational problems. This paper critically analyzes the various methods used for encoding and encrypting data onto DNA while identifying the advantages and capability of every scheme to overcome the drawbacks identified priorly. Cryptography and stenography techniques have been analyzed in a critical approach while identifying the limitations of each method. This paper also identifies the advantages and limitations of DNA as a memory device and memory applications.

1. Introduction last only for a limited time period. Memory cards and chips are sustainable for 5 years from their preliminary use [3]. The excursion of data storage initiated from bones, rocks, and Standard hard drives are prone to damage from high temper- paper. Then this journey deviated to punched cards, magnetic atures,moisture,andexposuretomagneticfieldsandthrough tapes, gramophone records, floppies, and so forth. Afterwards mechanical failures. Though solid state drives operate better with the development of the technology optical discs includ- than hard drives, if not power-driven for more than few ing CDs, DVDs, Blu-ray discs, and flash drives came into months they tend to lose their information [3]. Therefore operation. All of these are subjected to decay. Being nonbio- researchers’ devotion has been driven towards development degradable materials these pollute the environment and also of a storage mechanism which overcomes the aforementioned release high amounts of heat energy while using energy for drawbacks successfully. operation [1]. Taking into account the manner fossil bones preserve With the employment of digital systems for the purpose genetic material for ages, researchers paid their attention of generation, transmission, and storage of information, there towards using deoxyribonucleic acid (DNA) as a storage rises a need for active and ongoing maintenance of digital medium. media. With the massive amounts of digital data that has to DNA has an unbelievable storage capacity. Castillo states be stored for future use, a problem arises in the storage of that all the information in the entire Internet could be irresistible amounts of data. The demand for data storage is located in a device which is lesser than unit cubic inch [3]. rapidly increasing day by day. The total information storage DNA is witnessed as the optimal medium in this regard of the entire world was around 2.7 ZB in 2012. Every year the fundamentally because instead of using 1 s and 0 s by the storage necessity is increasing by 50% [2]. Currently almost computer to store data, DNA consisting of adenine, guanine, all of the digital data is stored with a technology that will cytosine, and thymine (A, G, C, and T) already paired into 2 BioMed Research International base pairs A-T and G-C can be utilized for storing computer to massively use its vast parallelism effectively are information in a form of binary code [1]. As the urgent need discussed here along with the drawbacks associated with each for high capacity storage medium rises, DNA is considered approach. ideal in this regard as single nucleotide can represent 2 bits of information. Accordingly 455 EB of data can be encoded 2. Materials and Methods in 1 gram of single stranded DNA (ssDNA) [3]. Entire information that is produced by the world over a year can 2.1. DNA as a Storage Device. Two lengthy strands of be stored in just 4 grams of DNA [1]. High memory space is together compose a DNA . Each offered by DNA as it is 3-dimensional (3D) by structure. DNA nucleotide contains one of four bases (A: adenine, G: guanine, offers readable and reliable information for millennia, which C: cytosine, and T: thymine), along with a deoxyribose can be extended to almost infinity by drying and protecting sugar and a phosphate group. DNA molecule comprises a from oxygen and water [3]. double stranded structure, comprising two single stranded DNA can withstand a broader range of temperatures sets of nucleotides bonded by A=T double hydrogen bond or ∘ ∘ (−800 C–800 C). It utilizes power usage million times more C≡G triple hydrogen bond. These two single strands which effectively than a modern personal computer. Additionally are bonded together by hydrogen bonds are referred to as it privileges more storage options as it stores data in a complimentary strands. A single stranded DNA is positioned 󸀠 󸀠 nonlinear structure unlike most of the media storing data in in between two ends 5 (5 prime) and 3 (3 prime) [4]. a linear structure. DNA promises more options to improve Information is packed into chunks of nucleotides called latency and extraction of data, as it allows reading data in “oligonucleotides” as accumulating of lengthy DNA strands bidirections.TheimportantfactthatDNAisinvisibleto is problematic [3]. Process of synthesizing using a DNA human eye ensures that DNA is secure and is impossible to synthesizer obtains single stranded DNA chains, artificially be harmed by living [1]. composed of about 50–100 oligonucleotides. Hybridization This research paper critically analyzes the techniques is the process of individual single stranded DNA (ssDNA) used in writing and reading data into DNA. Many models of forming double stranded DNA (dsDNA), coupling with com- encoding were used to encode data into DNA. First section plementary ssDNA under certain conditions. Ribonucleic of this review paper provides a background for digital data acid (RNA) is a ssDNA comprising ribonucleotides in which storageinDNA,analyzingtheperfectionofDNAtobeused thymine (T) is replaced with Uracil (U) [4]. as a storage medium. The second section of this review paper Information is usually read in base pairs as DNA is double analyzes the results and methods of the models used for stranded. Currently with the aid of new DNA synthesizing encoding data into DNA while critically analyzing the custom techniques the way in which base nucleotide pairs are in which each model overcomes the drawbacks identified generatedisbeingreformedbecauseitisachallengetoread in the prior models. In view of the organization of the first G-C pair consecutively. Traditional base pairs were A-T and subsection of the second section, function of DNA as a stor- G-C. Innovative base pairs, A-C and G-T, are being used by age device is highlighted in this. and development newer technologies. A-C is employed to code 0 while G-T of encoding models in the recent past have been summa- is used for 1 [4]. Figure 1 diagrammatically represents the rized firstly in this subsection. Secondly encryption schemes general procedure of storing data on DNA. usedforencryptionofdatainDNAhavebeendiscussed in detail, diagnosing their significant features, advantages, 2.1.1. Encoding Data on DNA and drawbacks associated. Thirdly, several approaches used for designing of codons have been discussed. Basic data Microvenus Project. This project was initiated by Joe Davis storagestyleshavebeenanalyzedinacriticalmannerby to store an image in DNA, which had the core intention identifying the advantages over the other finally in the first of storing abiotic data in DNA. Encoding was based on the subsection.ThesecondsubsectionofSection2discussesthe molecularsizeofbases,C→1,T→2,A→3,andG→4. applicability of DNA storage for secret writing in the fields Each and every nucleotide was assigned a phase structure of cryptography and stenography. Considering the structure C → X, T → XX, A → XXX, G → XXXX. of the second subsection, firstly problems associated with Encoding was achieved by placing a nucleotide at each DNA cryptography and stenography have been discussed in repeated position of 1 and 0 bits, for instance, 100101=CTCCT detail while this paper critically analyzes DNA secret writing and 10101=CCCCC. In the process of decoding, “C” could be algorithms by identifying the advantages offered from each decoded as “1” or a “0,” because only the number of repeated technique and the drawbacks associated with each technique bits was taken into consideration at the time of encoding. while suggesting the possible approaches to be implemented For instance, CTCCT could be decoded as 011010 or 100101. to overcome the associated drawbacks secondly. Approach of Therefore this scheme of encoding was inaccurate as it was DNA as an organic memory device and possible applications not distinctively decodable [5, 6]. of DNA as a memory device have been discussed in the third subsection of Section 2. The fourth subsection of Section 2 of Genesis Project. This was introduced by Eduardo Kac. He this article describes in detail manner in which big data stor- createdan“artist’sgene,”asyntheticgenebyconvertinga age and big data analytics have paved the way towards DNA sentence from the bibliographical book “Genesis” into Morse computing. Approaches undertaken to solve computational codeandthenconvertingtheMorsecodeintoDNAbase problemsemployingaDNAcomputerandpossibilityofDNA pairs [7]. Hyphen and full stop were represented by bases T BioMed Research International 3

DNA synthesis Data encoding DNA strings (writing on DNA) ATGATCGATCGACT

Error Data DNA sequencing (writing on DNA) Error Error correction

Types of errors ATGTCGATCGACT Deletion

ATGCTCGATCGACT Insertion

ATCATCGATCGTCT Substitution

Data lost Figure 1: Schematic representation of storing data on DNA. and C while replacing word space and letter space with A and (PPK), forward and reverse primer common 5-6 bases, to G, respectively [5, 7]. Synthetic genes were fused into indicate the stored information. PPK functions as the data and in the presence of an ultraviolet light, mutations were location identifier. In data encoding ternary code is being caused in bacteria. mapped to 3 bases (A, C, T) while sequencing primers consist Thesemutationshaveinturncausedchangesinthe of 4 bases with the 4th position as G in order to avoid sentence. When the genes were decoded back into the Morse mispriming [9]. In decoding process PPK is decoded first codeandthenbackintoEnglish,theoriginalsentencehas with the intention of decoding forward and reverse primers been changed [7]. Genesis project was inaccurate as the and then based on a unique sequencing primer can retrieve original sentence was altered during mutation at the presence the information. of ultraviolet light. Main advantage of PCR based encoding models is high These 2 methods laid the foundation for encoding in securityastherecipientshouldbeawareoftheencryption DNA but were inaccurate as it was not uniquely decodable key and primer. Hitches associated with PCR methods are the andtheoriginalcontentwasalteredduetomutationsinthe need of PCR, need of the knowledge of primers, widespread Microvenus and Genesis projects, respectively. experimental obstacles, and practical problems. Insertion of errors in template region makes recovering the encoded PolymeraseChainReactionBasedEncodingSchemes.In this data unmanageable. Moreover data breakage could occur in method the data sequence is converted into a DNA sequence encoding and decoding procedures due to errors of humans using a rule, which is an encryption key or a “codon.” The [10]. encoded DNA sequence is placed in between two unique As a result researchers paved their attention towards template DNA regions corresponding to the forward and developing PCR independent data encoding models. reverse primers. dsDNA is prepared according to the design of the DNA encoded message and is inserted into genomic Alignment Based Encoding Models. ThismodelwasPCR DNA. In information readout process, amplification is done independent. Alignment based approach introduced a data by Polymerase Chain Reaction (PCR) and is decoded by DNA storage and retrieval method based on of sequences [8]. DNA.Specialtyofthisapproachisthatthisperformsretrieval Limbachiya and Gupta used microdots to store data. without parity checks, DNA template, or error correcting This approach was secure because of its size and even if algorithms [11]. an adversary identifies the microdot, it would be extremely Yachie et al. proposed a method to copy and paste data difficult to read the data without the knowledge of the primer withinasequenceofanorganismtoachieveflexibilityof sequence. But the limitation was scalability of data encoded storageandvigorofdatainheritance,mostlyappropriate in the limited size of microdots, as only 136 bits of data could in using DNA as trademarks/signatures of living modified be encoded [5]. organisms (LMOs) and as valuable transmissible media [12]. To overcome this limitation Kac proposed information Through this approach it is possible to retrieve data from DNA (iDNA) [7] which consists of single Polyprimer Key a living DNA without additional material such as template 4 BioMed Research International

DNA,anditneedsonlythesequencingofthecomplete (iii) Addressing method is utilized only to read the posi- . tion of a read but does not perform selective reads. Data retrieval by PCR based amplification is prone to breakage of either side of DNA annealing sites, which are This method can be effectively utilized for accessing random crucial for reading even parts of encoded data [13]. data sections and also for storing frequently updated data Advantages of this approach are greater speed and lower which needs to memorize the editing history [13]. cost of reading DNA data and lower cost of synthetic DNA. Blaum et al. used DNA sequences consisting of special This approach ensures higher durability and data inheritance strings of addresses to access random information. These as multiple copies of data are available and each copy is DNA sequences are also encoded with error correction mech- capable of detecting and correcting errors of the other copy anisms. Mutually noninterrelated addresses are designed [12]. while they satisfy error-control running digital sum con- Data could be lost during evolution. In order to prevent straint [15]. Encoding is done through threading together this Yachie et al. introduced different nucleotide sequences prefixes of addresses which are accurately concluded. As encoding the same data by multiple data compression paths addresses are noninterrelated and chosen with a considerable [12]. hamming distance there is a very less probability for the Disadvantage of this approach is that multiplication of addresses to be tangled with each other. Prefix encoding cassettes leads to redundant volumes. Parity effects cost a format [16] was chosen to select blocks to be rewritten certainvolumeofdatasequence;atthesametimedata while gBlock [17] and Overlap Extension PCR (OE-PCR) recovery rate is fragile and is proportional to data breakage [18]methodswereusedtoaccomplishrewriting.Sanger which occurs through DNA deletion of long ranges. Positions sequencing [19] is used for decoding. of the data breakages could be identified easily by the gBlock [17] method is more efficient but needs long and alignment results although they were not recoverable [12]. expensive primers for sequencing. OE-PCR method [18] uses Main downside of this approach was the size limit of the cheap and short primers but is done in several steps. cassette oligonucleotides being used to encode the message. Highcostisoneofthedrawbacksofthisapproach. If it increases a certain limit there is a possibility of it to Instead of using employing next genera- appear by chance in host genome. And also sequencing of the tion sequencing methods [19] will drastically reduce the cost entire genome is required to retrieve data. Thereafter Alien- of readouts. berg proposed improved Huffman coding method in which nucleotides were used efficiently and used specific primers Next Generation Digital Information Storage. Microvenus for different types of files. Improved Huffman coding defines project [5, 6], Genesis project [5, 7], PCR based encoding DNA codes for the entire keyboard, for clear-cut information models [11], and alignment based encoding models [12] only coding. This is based on a construction of a plasmid library stored small amounts of data. Hence this model proposed an with specially designed primers embedded along with the efficient one-bit-per-base algorithm. In this approach Church message for fast retrieval. A good encoding scheme should et al. improved a scheme to encode arbitrary information have economical use of nucleotide per character which is employing next generation DNA synthesis and sequencing about 3.5 here [12, 14]. technologies. A draft of html coded book with 53,426 words, 11 JPG images, and 1 JavaScript program is being encoded. Rewritable and Random Access Based DNA Storage System. This encompasses a 5.27 MB stream [20]. Main features of this architecture are the random access to Primary advantages of this approach over the prior data blocks of DNA which promotes nonlinear access and approaches used for encoding include the employment of rewriting capability of information into random locations. one-bit representation per base (G or T for 1 and A or C This approach successfully addresses the drawback of the for zero). This brings in the capability of encoding sequences existing approaches, the need to read the entire file of data to whicharechallengingtobereadorwrittenduetocontaining read only one fragment of data. And also the existing methods repeats, secondary structures, and extreme GC content. As are read-only methods whereas this approach is rewritable. wearesplittingthestreamofbitsintodatablocksthereby Undesirable cross hybridization problems are eliminated in Church et al. are avoiding constructing of long DNA which this method by prohibiting redundancy of information. ischallengingtoassembleinreadingtheinformation.Syn- Tabatabaei Yazdi et al. encoded Wikipedia pages of six thesizing, storing, and sequencing multiple copies of olingo universities, carefully chose parts of the stored data, and are done with the intention of evading sequence verifying edited text written into DNA related to three universities. constructs and cloning. Each copy has the capability to Shifting from current read-only methods to rewritable meth- correct the errors in other copy as the errors are almost ods requires to address the below mentioned drawbacks [13]. never coextensive. In this approach the cost incurred is ∼100,000 less compared with the first generation technologies (i) There is need to rewrite the entire content in order to in encoding and decoding information [20]. edit in a compressive domain. Challenges associated with this scheme include the cost (ii) Fourfold coverage is used to ensure reliability of which is unfeasible and the time for reading and writing onto information which makes the rewriting process much DNA. However the cost associated with synthesizing and complicated because in order to rewrite one base sequencing of DNA has been dropping at 5–12 exponential modification of four locations is needed. rates per year which is relatively much speedy than electronic BioMed Research International 5 media [20]. These can be leveraged in the future as DNA messages which required to be deposited for extensive period sequencers which are hand-held are recently available which of time [22]. fastens the reading process. This research has not implemented the biological proto- Church et al. have not paid their attention towards colstoinsertthesequenceingenomeofbacteria. compression, parity checks, redundant encodings, and error Future work to be addressed includes modification of rate. Therefore attention has to be focused towards error cor- transformation algorithm and designing other mapping rection for the purpose of improving density and safety [20]. function for encoding nucleotide sequence. HencetoovercomethisissueGoldmanincludedimproved base 3 Huffman coding instead of the one-bit-per-base repre- 2.1.2. Codes for Encrypting Data in DNA. Basically 3 codes sentation [21]. This approach encoded a total of files of 730 KB have been used over the past to store information on DNA. including five types of computer files to emphasize the ability All these codes generally considered that an alphabetic to store arbitrary digital information. They included 154 of language is being encoded in DNA. Although most of the Shakespeare’s sonnets (ASCII text), a classic scientific paper researches considered English as the alphabetic language, it (PDF format), a medium-resolution colour photograph couldhavebeenusedforevenshorthand,whichisthewriting (JPEG 2000 format), a 26 s excerpt from Martin Luther scheme for phonetics. King’s 1963 “I Have A Dream” speech (MP3 format) and a Foracodetobeoptimumitshouldsatisfythedualcriteria Huffman code to convert bytes to base 3 digits (ASCII text). as follows: Binary code was transformed to ternary code and then was represented in triplet DNA code. Fourfold redundancies were (i) It should use DNA (nucleotides) economically, mainly used to avoid data loss. To ensure security each redundant because synthesizing of extended oligonucleotides is chunk was complemented in alternate chunk. This approach an expensive process though replicating appears to be ensured high scalability and reliability. This paved the way comparatively economical. towardshighcapacityandlowmaintenancedatastorage (ii) It should be able to reconstruct the message after mechanisms as Goldman et al. were able to achieve a storage encoding of data. capacity of 2.2 PB/gram [20, 21]. Although it is not considered to be essential, if the coding Encoding Scheme for Small Text Files. Majority of the encoding scheme offers some error detection and protection mecha- schemes are performing superior for large text files. This nism it would be of tremendous advantage. But this feature approach is for small text files. This approach primes to high is not considered vitally important, because there are other volume data storage density via dipping number of nucleot- mechanisms for addressing this issue such as using multiple ides for representation while depending on sequence trans- copies of DNA. As the written language inherently consists mission algorithms [22]. of self-correcting mechanisms it makes this feature of error At present many loss less compression algorithms are detection and correction not essentially important [24]. in use but still they require ample context information for encoding purposes. Huffman Coding. This code uses the principle of varying Burrow-Wheeler transformation (BWT) [23] has a decent the length of symbols used for representing a character. compression ratio with adequate use of context information. Most recurrently appearing character in the text is assigned But here in order to achieve this through small text, BWT was the lowest number of symbols while the least recurrently trailed by Move-to-Front transformation (MTF) [23]. appearingcharacterisassignedthemostnumberofsymbols. The following three steps are performed to generate Employing this principle leads to developing of a very eco- context information through this encoding scheme. nomicalcode[25].Averagecodelengthisaround2.2char- acters in Huffman coding scheme. This is the least average (i) Text file compression: Huffman coding method was codon length achieved [24]. employed for this. Output is a binary sequence set. Unambiguity of the code is achieved through comprising (ii) Mapping function: Two nucleotide base pairs were of only one way in which the encrypted message can be read effectively employed to represent four binary bits. once the starting point is mentioned [24]. Foundation for choosing 4 bits for 3 nucleotides is Drawbacks associated with Huffman coding include not output of Huffman coding being a hexadecimal value. outfitting for numbers and symbols. This is mainly because (iii) Encryption: For the purpose of maintaining security, the frequency of showing these symbols is highly dependent encoded message ought to be encrypted. One Time on the text which reviews the fact that they are unable to be Pad (OTP) which require a random key of same included in formulating the Huffman code. Secondly it is not length as message encoded is employed for this. suitable for long term storage due to the fact that when dif- ferent length codons assembled together it might not reveal This approach reviewed that maximum efficiency of com- a pattern. Therefore the future generations might not be able pression is possible to be achieved through performing trans- to detect the significance of the pattern [24]. formation prior to compression thereby reducing number of nucleotides. This directly affects reducing cost factor. The Comma Code. In this approach a single base (G) is con- Thisschemeismoresignificantinmilitaryapplications sidered as the comma. Codons of 5-base length are separated and signatures of living modified as they are small from each other using base G. % base codons consist of other 6 BioMed Research International three bases, namely, A, C, and T, further more limited to English alphabet. DNA was inserted into living organisms single A:T and two G:C base pairs. The C of the and they are subject to losing information due to breakage second G:C pair is always positioned in the upper strand [24]. by mutation, insertion, and deletion. Hence, this approach is Consisting of isothermal melting temperature is the a solution to this problem as it is able to recover data of dam- advantage of the composition of the message DNA utilizing aged DNA. Therefore this method overcomes the drawback of this scheme. Dominant feature of comma code is the reading thecomma-freecode.Thisisalsoabletoidentifyanyframe frame of six codons including G, the comma, which is not shift due to mutation or errors in sequencing. This method achieved by other codes. This helps to identify a clear reading uses unique primer design using plasmid DNA libraries [14]. frame without the necessity to mention a starting point. Protection mechanism from insertion and deletion mutation The Perfect Genetic Code. In this approach Doig points out is also guaranteed by this approach which makes the other the fact that through varying the codon length efficiency of codes much more complex [24]. the code can be increased significantly [27]. Here Doig uses a variable codon length through employing more frequent Drawback of this code is that it is not economical as it amino acids with shorter codon length while rare acids are repeats the comma-base G to create an automatic reading represented using a longer codon length. The only approach frame [10]. which is more efficient than this is using overlapping genes The Alternating Code. Thisschemeconsistsof6basecodons [27]. which are 64 in number including pyrimidines and purines. According to Doig the necessity for a fixed codon length Construction of the message DNA in an entirely synthetic comprising 3 bases requires 42% more DNA than the min- nature is the primary feature of this approach. As this creates imum requirement [27]. Perfect genetic code is 70% more fully artificial DNA it is suitable for long term storage which efficient than the other codes due to the use of a variable code overcomes the drawback of Huffman code. In addition, it length. Doig employs the Shannon-Fano coding scheme to offers benefits such as being isothermal and error detecting derive the binary code [27]. but it is not superior to comma code [24, 26]. As there is very little redundancy, any mutation would Alternating code also comprises repetitive features which cause a change in amino acids. Additionally, if the mutation makes it noneconomical [10]. It is the main drawback is chief to fluctuation of codon length many of the mutations associated with this coding scheme. Therefore attention of the will be extensive frameshift mutations. According to Doig, it researchers has been led towards developing an economical is impossible for the shift towards use of perfect genetic code, code without repetitive features [10]. considering a simple example that Val has been coded for four codons comprising 3 bases. As the third does not convey Comma-Free Code. It is also known as prefix free code. any information it would be effective to code Val employing This comprises fixed length base frames without commas two bases only. The difficulty of machinery to shift from to separate the frames. Therefore, it uses an automatic fixed codon length to variable codon length plus the priorly frame detection mechanism. Comma-free code [24] does not mentioned drawbacks leads to not using this effective code consistofidenticalfourbasepairswhichistheonlywayof though it maximizes the efficiency through using variable hindering from natural DNA sequences. These codons are length codons [27]. possible to be read simply in one way and support error detection mechanisms as well. 2.1.3. Basic Approaches for Designing DNA Codons. There is Although comma-free code is robust and the error no standard structure in which code words are generated due correction works to correct against small-scale loss such as to the fact that importance of constraints to be addressed DNA point mutations, it does not have the ability to recover differs according to the encoding models [11]. broken data when a large DNA segment is deleted from the data encoded DNA region [10]. Template Map Strategy. Codon’s group proposed dividing the constraints on code words into two binary codes, namely, Improved Huffman Coding Scheme. Each approach used for template,specifying GC content and maps indicating mis- information storage in DNA differs in the economical use of match between word pairs. The product of the aforemen- nucleotides. Here Ailenberg and Rotstein use the principles of tioned codes results in a quaternary code. Later on Arita using Huffman coding to define DNA codes for the entire keyboard, the Hadamard code expanded this for longer code words for clear-cut information coding [14]. This overcomes the with the mismatch beyond half the length of the code [11]. drawback of the Huffman code, being limited only to the Downside of this is that melting temperature of the words letters of the alphabet. This is based on a construction ofa may differ without considering the unvarying GC content. plasmid library with specially designed primers embedded Comma freeness is another problem associated with this along with the message for fast retrieval. Index plasmid only approach. Enlarging into multiple words will be an issue due contains details about the structure of the information library. to mishybridization [11]. A good encoding scheme should have economical use of nucleotide per character which is about 3.5 here (base-to- De Bruijn Construction. Arita overcomes the problem of character ratio) [14]. melting at different temperatures and also showed an ideal Other coding schemes have low base-to-character ration algorithm for choosing oligonucleotides to get rid of the but is limited to lower number of characters such as the higher risk of mishybridization because of the matched base BioMed Research International 7 pairs which are placed consecutively [11]. Disadvantages asso- (ii) DNA sequencing: reading DNA. ciated with this approach are small number of mismatches between words and comma freeness. Objectives of cryptography are mainly as follows. (i) Authentication: Authentication is confirmation of Stochastic Method. Thisisawidelyusedmethodintherecent the details about entity from which we are receiv- past. Kac employed generic algorithms to find similar melting ing information. Digital signatures, passwords, and temperaturecodewords.Complexityoftheproblemledthem trademarks are used to ensure authentication. to apply genetic algorithms only to code words up to length 25 [7]. Arita could increase the code words in number designed (ii) Data confidentiality: This is the process of securing by the template map strategy. This approach fails if it is started confidential data from unauthorized personnel. In from the scratch which reveals the fact that it is better to use cryptography this goal is achieved through encryp- this approach to enlarge the code words of already designed tion. word sets [11]. (iii) Data integrity: This is to guarantee that information is received in the exact format in which it has been 2.1.4. Data Storage Style. Data storage style is the manner sent by the official party. That includes the fact in which DNA words are stored in a medium. In these two that no modification or alteration is done during approaches which are discussed in detail, DNA words are cryptography process [31]. stored in solid and liquid media. Data storage style depends on the word design. DNA chips which are an immobilization Practical methods for DNA data embedding are twofold: techniquehadbeenpopularamongthepublicastheweaker (i)DNAbasedstenographicmethodsareproposedby constraint on words limits the design problems. A systematic Bancroft and Clelland by physically secreting infor- word design which avoids mishybridization serves both mation in a living organism so that PCR and secret surface based approach and soluble approach [11]. key are essential for retrieving information [32]. Surface Based Approach. DNA words are placed on a solid (ii) Cox proposed embedding information in living approach which is also known as solid phase approach. beings so that information will be carried by organism Advantages associated with this approach are that code words along with cell replication without affecting biological (codons) can be separated from the complements as single properties of living organism. This can be done in two strand of the double helix structure is powerless, which in ways [33, 34]. turn reduces the risk of unforeseen combination of words and easiness to recognize the words separately for the purpose of (a) Replacement of DNA in noncoding segments information reading due to effective fluorescent labeling [11]. never being transferred to : drawback is that extreme care is needed to ensure that Soluble Approach. Benefitsassociatedareopeningupof this insertion would not affect the biological the possibility for independent information processing and functions. possibility for introducing DNA words into microbes. Dis- (b) Modifying coding DNA (cDNA) partitions advantages include limiting inborn abilities of which get transferred to proteins: this approach through providing easier access to information [11]. is more systematic and safer [11].

2.2. DNA Secret Writing. Secret writing is used to pre- 2.2.1. Problems Associated with DNA Cryptography. Non- vent illegal access of information by unauthorized parties. availability of a hypothetical basis and lack of knowledge Cryptography and steganography are two methods used for associated with DNA cryptographic methods are major secret writing. Cryptography manipulates information for problems. Similarly high cost and difficulty in understanding misunderstanding while steganography hides the existence of also have effect, in addition to inappropriateness to be used by information [4]. Traditional cryptography and stenography general public due to the biological tests and trials which have techniques have been identified as losing power and are to be performed in highly technology equipped laboratories. becomingbreakableastheyarebasedondifficultmathemati- calproblemswhicharematurebothintheoryandrealization. 2.2.2. DNA Secret Writing Algorithms Therefore researchers are investigating developing hybrid cryptosystems involving DNA methodologies into cryptogra- Steganography Technique Using DNA Hybridization. This phy and stenography. Hiding data in terms of DNA sequence method effectively uses the advantages offered by the struc- is referred to as DNA cryptography [28]. Data embedding ture of DNA for vast storing capabilities and parallel molec- is also referred to as stenography or DNA watermarking. ular computation. This approach brings out an algorithm for Adleman’s research [29] and research on hiding messages hiding data in DNA in a digital form [35, 36]. in microdots [30] are the birth of DNA stenography and One Time Pad (OTP) generated keys are used as encryp- cryptography. DNA data embedding is made possible via tion key. This key is used just only once for exactly one message. The used pad is destroyed by the user after encryp- (i) DNA methods: writing into DNA using insertion and tion. Figure 2 summarizes the encryption process using creation; hybridization technique. 8 BioMed Research International

Generation of overcomes this problem as only the types of and OTP key primers need to be known in advance. This algorithm uses the vast randomness of the DNA medium.Thisisthedownsizeofthisapproach.Thiscannotbe Plain text Text in binary Comparison Encrypted data termed as a proper cryptographic algorithm due to this fact. format Hence, OTP key can be used only once [35].

Figure 2: Block diagram of encrypting a message using DNA DNA XOR OTP with Tiles. This is based on DNA tiles and hybridization technique. DNA XOR tiles of Biomolecular Computation (BMC). Tiles carrying the message are binding together in a string. Then an XOR operation is performed between the plaintext and the OTP key OTP which is the encryption key. Tiles possess two pairs of sticky ends one to bind between tiles containing the message and the other to bind to OTP tiles. Restrictive are used to remove the cipher text from the rest of the tiles. Encrypted data Comparison Binary data Data in original format In decryption process OTP is used to extract the encryp- tion key by the user. Self-assembling of the tiles used for Figure 3: Block diagram of decrypting a message using DNA encryption in reverse order results in giving the plain text hybridization. [36–38].

2.3. DNA as Organic Data Memory. Incorporating DNA After decrypting the message receiver destroys the iden- into a living host, who has the ability to withstand risky tical pad which is owned by him. Because of this reason ecological conditions, has the ability to grow rapidly, and is this approach is extremely secure. In this algorithm single able to tolerate addition of artificial gene sequences was, the strandedDNAisusedastheOTP.LengthoftheOTP solution proposed for generating a reliable storage medium. should be 10 times larger than the binary message [35]. Inoculating DNA sequences into an organism is a challenge Encrypted message is hidden in a microdot. Strand consisting because it is difficult to retrieve a message from a whole of the encrypted message is placed between two PCR primer organism composed of many . Another obstacle is sequences. Then it is hidden in many similar structures. the unpredictable nature of genomic mutation [39]. Without the knowledge about start and end primers one DNA memory prototype consists of 4 main steps: willnotbeabletoreadthemessagebyamplifying.In (i) Encoding information as artificial DNA sequences. reading the message: hybridized segments are read as “1” and unchanged ssDNA segments are read as “0” [3]. Figure 3 is a (ii) Injecting the sequences to living organisms. diagrammatic representation of the decryption process using (iii) Permitting the organisms to be nurtured. DNA hybridization technique. DNA hybridization is a slower process at the beginning (iv) Extracting information back from organisms. because it is difficult for two complementary strands to combine together. But later this is a rapid process. This can 2.3.1. Memory Application of DNA. DNA memory can be be effectively utilized in searching and parallel computation. effectively utilized in commercial applications and in national Restrictions at present for this process are time consumption security for information hiding purposes and for data stenog- and expansiveness [35]. raphy. Deinococcus bacteriacanliveandmultiplywithout So, it already provides an honest security and takes solely human interference [39]. This property can be used to less time for the message to be communicated [37]. preserve data at nuclear catastrophe [14]. There exists a competition among seed companies to DNA Indexing. In this approach Homo sapiens’ protect their investments. Therefore, incorporating a DNA chromosomal sequence is used as the OTP key for the process watermark in the seeds could be a better approach to track of encryption and decryption. Plain text is converted to theirsalesandpreservetheircopyrightedproductsagainst ASCII code and then to binary code and lastly concerted illegitimate planting [39]. This would not affect the farmers into DNA format. OTP key of Homo sapiens is taken and who are in need but covetous farmers. scanned to examine the DNA form of plain text in the In order to capture pollutants that will contaminate OTP key. For each and every character in plain text found with the ecological resources and will pollute the resources, in the chromosome an array of indexes is formed placing researchers drill wells to gather samples of soil. In cooperating the starting point of the index in chromosome using DNA sufficient information in bacteria which could update the indexing method [4]. current status of soil with time uninterruptedly by tracking Diagrammatic representation of the encryption process bacteria’s distribution spatially and temporally, using devel- by chromosome DNA indexing is represented in Figure 4 opedtechnologieswouldbeaneffectiveapproachforthis while Figure 5 details the decryption process by chromosome purpose [39]. DNA indexing. Communication of primers, chromosomes, Endangered could be identified by injecting a and OTP is a major difficulty in cryptography. This algorithm DNA watermark into the genome of the subject, replacing the BioMed Research International 9

OTP key Encrypted data

Text in binary Text in DNA Plain text Comparison Indexed table format format

Figure 4: Block diagram of encryption process using DNA Chromosome Indexing.

OTP key

Data in Data in Encrypted Comparison Data in DNA binary original data format format format

Figure 5: Block diagram of decryption process using DNA Chromosome Indexing. other synthetic identification. This could also be used to pre- a random strand comprising 20 bases. Each and every edge serve safely the personal information of a person such as med- of the graph was represented by another different oligonu- ical information and family history in their own bodies [39]. cleotide consisting of 20 bases complementary to the source node’s second half and target node’s first half. This results 2.4. Big Data Storage in DNA and DNA Computing. Era of in self-assembling and ligation of compatible edges by the DNA computing begins with the identification of limitations function of T4 DNA ligase . In order to filter the in electronic computers. The volume of data that can be stored paths the product of the self-assembling process is subjected in an electronic computer and the speed thresholds that can to amplification by PCR. To filter the paths with the exact be reached which is governed by the physical characteristics length, separation and recovery of DNA strands with the of computers are the main limitations identified in big data exact length are done by Agarose gel electrophoresis. Agarose storage [40]. DNA computer addresses the above-mentioned gelelectrophoresis[41]istheidentifiedmosteffectiveway limitations through solving computational problems engag- of separating DNA fragments constituted of variable lengths. ing molecule manipulations while discovering natural com- PriortotheemploymentofAgarosegelforseparationof putational models leading to the big data storage and big data DNA fragments, it was done using sucrose density gradient analytics in DNA. centrifugation. It only provided a separation based on the Advantages offered by DNA computing include consum- approximation of size [41]. Filtering out the paths which ing significantly less energy than the electronic computers. Energy consumed by DNA computers is billion times com- cover all the nodes is achieved through affinity purification paratively less than other electronic computers. The storage successively for all nodes. In order to check the presence of the space needed to store information is less than trillion times path, PCR was employed to check the existence of a molecule over electronic computers [40]. Furthermore DNA comput- encoding a path of Hamiltonian [27, 40]. ers offer parallelism at a high level. Millions and trillions of The time duration required for the practical approach molecules perform chemical reactions parallel [40]. was around 7 lab days. Adleman’s algorithm was more labour extensive. Process automation could address the problem of 2.4.1. Hamilton Path Problem. Adleman addresses the high labour intensity. The algorithm used here was inefficient Directed Hamilton Path problem through exploring possi- as the number of oligonucleotides needed increased linearly bilities of information encoding in DNA sequences and there- with the increase of edges and exponentially with the number after performing simple operations for strand manipulations. of vertices. This is energy efficient. It is not clear whether the The simplified version of this problem was the salesman prob- large number of inexpensive operations could be used for lem, which needs to find out the optimal path out of a pool of resolution of real computational problems [27]. Speed of a cities through which salesman has to travel. Adleman compli- computer is measured by two factors. cated this problem through restricting the connection routes between cities and specified the start and end of the journey. (i) The number of operations it can perform parallel. Thisapproachsolvestheabove-mentionedproblemwith the intention of generating random paths through the graph (ii)Howmanystepseachprocesscanperformperunit where Adleman encoded each node in the graph into time. 10 BioMed Research International

With regard to first measure it is in favour of DNA the issue of not allowing random access and rewriting. Com- computers because of the vast parallelism it offers. Second parison of encoding models used is summarized in Table 1. measureisinfavourofelectroniccomputersbecausecon- PCR based encoding models [5] are highly secure. Main sidering a personal computer, it can perform 100 million drawback associated with PCR based models is scalability instructions per second [42]. But as the parallelism offered of data. Yachie et al. proposed iDNA [8] method to over- by DNA computers is massive, time for executing one come scalability problem. Drawbacks of requisition of PCR, instruction is not problematic. knowledge on primers, and insertion of errors in the template With the enormous flexibility offered by electronic com- region which results in impossibility of recovering the data puters through the numerous operations it is pretty efficient have not been addressed. Therefore researchers searched for to multiply two 100 digit numbers whereas it is overwhelming a PCR independent encoding model. to perform this using a DNA computer with the protocols and Alignment based encoding model [10] achieves the enzymes presently available [27]. advantage of not needing a template DNA. It performs retrievals without parity checks or error correcting algo- 2.4.2. 20-Variable 3-SAT Problem. Thisisthelargestproblem rithms. Though this approach efficiently retrieves data from a solved yet with a DNA computer which is 20 variables for genome, it requires sequencing of the entire genome which is three-satisfiability problem. This problem is an NP (Nonde- themaindownsideassociatedwiththisapproach.Highspeed terministic Polynomial) time-complete computational prob- and lower cost of reading are identified in this approach. lem. As the problem complexity is very high, even with fastest As the data could be lost during evolution of the living sequential algorithms, exponential time to solve this problem organisms, Yatchie et al. [10] introduced multiple copies is required [43]. This was the reason it leads to examining of sequences compressed by multiple compression paths, the performance of DNA computers. Subsequences based though this factor overcomes the data loss during evolution separation used in Sticker Model [44] is used to solve this and it leads to redundant volumes of data. Data recovery rate problem. Two basic operations are used by Striker Model for is fragile in this approach. Size limit of the oligonucleotides is computations: application of strikers and separation based a downside of this approach as if the size is increased it might on subsequence. 20-variable SAT problem uses separations. appear in the original genome. Oligonucleotide probes restrained in polyacrylamide gel- Church and Goldman model [12, 14] marked a milestone filled glass modules are employed for carrying out separa- in DNA storage as all the prior work was only for small tions. Electrophoresis moves the DNA strands through the volumes of data. This algorithm efficiently uses bases as module. Strands which are complementary to the immobi- it encodes one bit per base. Significant drawback associ- lized modules hybridized while the strands which do not ated here is the lack of error correction mechanism. Next contain complementary strands pass by. Electrophoresis at a generation digital information storage techniques employ higher temperature which is higher than the melting temper- one-bit representation per base. Cost associated and time ature of the probes is used to free netted strands. Remaining for reading and writing data are main drawbacks of this strands are transported to other modules [41]. This problem approach. Aforementioned drawbacks could be accessed in maximizes the use of parallelism offered by DNA computers. the near future as hand-held reading devises are becoming available. Church et al. have not paid their attention towards 3. Results and Discussion compression, parity checks, redundant encodings, and error rate. Attention has to paid towards improving the storage DNA has been identified as the potential medium for data density and safety. Placing multiple copies of data would lead storage due to its vast storage capacity, high data density, sus- to increasing the safety measures. taining to extreme environmental conditions, and so forth. Numerous encryption schemes have been used to encrypt Evolution of data storage in DNA is described in detail in data in DNA. Huffman code [24, 25] has been developed for this review article. Idea of data storage in DNA emerges with a limited number of symbols only the English alphabet. This MicrovenusprojectofDavis[5,6]:tostoreanimagetoDNA drawback has is overcome by the improved Huffman code leads the foundation for DNA based storage system with the [14]bydefiningthecodefortheentirekeyboard,images,and core idea of storing abiotic information on DNA. Genesis music.CommacodeusesbaseGtoidentifyandseparatethe project [5, 7] succeeded the Microvenus project. Microvenus reading frames. Both comma code and alternating code use was inaccurate as it was not uniquely decodable while Genesis repetitive bases to separate reading frames. Comma-free code was also inaccurate and was inefficient because it did not has an automatic frame detection mechanism through which return the original text after decoding. Although DNA has it overcomes the drawback of redundancy. Alternate code is been identified as the potential medium for long term data not superior to the comma code. Although perfect genetic storage there these issues were identified as the hitches to code [27] varies the codon length which in turn increases the be addressed: high cost of DNA synthesis, data read back efficiency a paradiagram shift is needed to move into perfect at a slower rate, DNA not possessing rewriting capabilities, genetic code as machines need to read variable length codons. and DNA not allowing random access. Among the above Comparison of the encryption schemes used is summarized identified issues Yatchi et al. have been able to overcome the in Table 2. issueofslowerdatareadbackspeedusingalignmentbased DNA is effectively used for secret writing due to the encoding model [10]. Rewritable and random access based security mechanisms offered by DNA. DNA cryptosystems storage systems [13] proposed by Tabatabaei et al. overcome are more secure due to the huge size of the OTP key used for BioMed Research International 11

Table1:Comparisonofencodingmodels.

Encoding model Advantages Disadvantages Laid the foundation for storing abiotic Microvenus project Being inaccurate and not distinctively decodable information in DNA Laidtheresearchworktoexploretheintricate relationship between biology, belief systems, Inaccurate as the original sentence was altered during Genesis project information technology, dialogical interaction, mutation at the presence of ultraviolet light ethics, and the Internet Insertionoferrorsintemplateregionmakingit unmanageable to recover the encoded data High security because of the size of the microdots Need of the knowledge of primers PCR based encoding and even if an adversary identifies the microdot it Widespread experimental obstacles and practical models would be extremely difficult without the problems knowledge of the primer sequence Need of PCR Data breakage that could occur in encoding and decoding procedures due to errors of humans Multiplication of cassettes leads to redundant volumes Parity effects cost a certain volume of data sequence Data recovery rate is fragile and is proportional to data Independent of Polymerase Chain Reaction breakage which occurs through DNA deletion of long Greater speed and lower cost of reading DNA data ranges Alignment based encoding and lower cost of synthetic DNA Sequencing of the entire genome is required to retrieve models Positions of the data breakages that could be data identified easily by the alignment results although There is size limit of the cassette oligonucleotides being they were not recoverable used to encode the message. If it increases a certain limit there is a possibility of it to appear by chance in host genome Random access to data blocks of DNA which promotes nonlinear access Rewriting capability of information into random Rewritable and random locations access based DNA storage Cross hybridization problems that are eliminated High cost system in this method by prohibiting redundancy of information Being used to store frequently updated data which needs to memorize the editing history Employment of one-bit representation per base High scalability High data storage density Next generation digital Cost is unfeasible Highly reliable information storage Time for reading and writing onto DNA is high Each copy having the capability to correct the errors in the other copy as the errors are almost never coextensive High volume data storage density Not needing ample context information for Encoding scheme for small Have not proceeded in implementing the biological encoding purposes text files protocols to insert the sequence in genome of bacteria Maximum efficiency of compression Reducing cost factor

encryption. It will be extremely hard to break the algorithm the key obtained from the public database is extremely without knowing the primer sequences and scientific spec- more huge than the hybridization method. Encrypted data ifications of the organism. Running time of cryptographic is obtained by scanning a randomly picked number from the systems is less. Larger key and the indexes of arrays used key and the plain text of DNA form. Random pick of numbers in DNA systems need high memory space. Therefore in alongwiththelengthyOTPkeyimprovestheconfidentiality. practical situations it might require a separate storage device. Implementation of cryptographic systems is highly costly. In the hybridization method, an OTP key 10 times bigger is In order to provide more security, OTP key generated in generated for each binary bit of plain text. In DNA indexing the hybridization method can be increased in length further. 12 BioMed Research International

Table2:Comparisonofencryptioncodes.

Alternate Improved Huffman Perfect genetic Huffman code Comma code Comma-free code code code code Base-to-character ∼2.2 ∼6 ∼6Variable ∼3.5 Variable ratio Economical Very economical No No Yes Long term storage No Yes Yes Error correcting Yes Yes Yes Protection from Yes Yes No mutation Isothermal melting Yes Yes temperature Synthetic DNA Yes Yes Fixed length base Uses the principle frames without of varying the commas to length of symbols separate the frames used for Consists of fixed Does not consist of 70% more efficient representation length reading identical four bases Stores text, images, than the other Special features based on the frames of 6 bases Does not have the and music in DNA codes due to the recurrence of a including the capability to Expensive use of a variable character comma, G recover the broken code length Not applicable to data when a large numbers and segment is deleted symbols from a data encoded region

It is possible by generating an OTP key of more than 10 bases Table 3: Comparison of the performance of cDNA secret writing in length (say 12 or more). It can be concluded that “the higher techniques. the length of the key data, the higher the security.” DNA hybridization Chromosome DNA In order to enhance the security of cryptographic systems Features technique indexing further step of hiding the data after encryption could be prac- ticed.Itcanbeachievedthroughperformingthebiological Running time Less More Large depending on Large independent of process of hiding the encrypted data between the primers in Size of the key the DNA sequence. Comparison of the performance of the the input the input basic cryptography algorithms is summarized in Table 3. High based on the High based on the key Incorporatingmessagesintohuman,mouse,orbac- Strength of the type, size, and the type and key size and terium was popular but the challenges imposed were extract- algorithm randomness of the the randomly ing the information from the whole genome without knowing key produced index Needs more memory any tracks about the embedded message and the unpre- More than the space for storing the dictable nature of genomic mutations. Wong et al. [39] over- hybridization type lengthy key and come this challenges through identifying fixed size sequences Memory space because of the huge performing the key length and the that usually do not exist in the living genome. This is the operations involving index array involved critical operation as they do not want to cause any damage it to the living genome. Mutation of the organisms is a major Cost High High challenge, influencing the integrity of the messages in this Believed to withstand Believed to withstand approach. Selecting organisms with low mutation rate would Longevity beapossiblesolutiontoovercomeit. any duration any duration Cost and the data retrieval rate are major problems in DNA based storage systems. At present data can be read at a rate about 100MB per second by storage devices Making a custom DNA molecule is expensive and this has which is much higher rate than natural storage. Synthesis been identified as the major obstacle for DNA based storage and sequencing processes are time consuming and require systems [45]. Recently parallel automation and decreasing the expertise knowledge which makes this method inaccessible cost of synthesis using next generation sequencing techniques to general public. Even though DNA is scalable, robust, and are in operation. Therefore, simplified purification techniques stable the above-mentioned drawbacks have a high concern. are a major area to be focused. At present researchers BioMed Research International 13 have focused on developing DNA storage system which is [2] H. A. Hakami, Z. Chaczko, and A. Kale, “Review of big consisting of reading and writing chambers [8] as the initial data storage based on DNA computing,” in Proceedings of the step towards developing a molecular data storage system. Asia-Pacific Conference on Computer-Aided System Engineering To ensure security the researchers have translocated the (APCASE ’15),pp.113–117,Quito,Ecuador,July2015. information to safety zones known as parking spots. In order [3]M.Castillo,“FromharddrivestoflashdrivestoDNAdrives,” to read DNA, it has been transferred to coding stations using American Journal of Neuroradiology,vol.35,no.1,pp.1–2,2014. electric field gradients, electronic motors, and so forth. In [4] M. Borda and O. Tornea, “DNA secret writing techniques,” in order to make molecular storage device a commercial appli- Proceedings of the 8th International Conference on Communica- cation researchers have to focus on scaling natural storage tions (COMM ’10), pp. 451–456, Bucharest, Romania, June 2010. capacity. High time consumption associated with the overall [5] D. Limbachiya and M. K. Gupta, “Natural data storage: a review on sending information from now to then,” ACM Journal on general process of encoding, amplifying, sequencing, restruc- Emerging Technologies in Computer Systems, http://arxiv.org/abs/ turing, and decoding is also a major challenge. Many errors 1505.04890. such as homopolymers, sequencing errors, lower access rate [6] J.Davis,“Microvenus,”Art Journal,vol.55,no.1,pp.70–74,1996. errors are available. Though autocorrection mechanism is [7] E. Kac, “Genesis-art of DNA,”1999, http://www.ekac.org/geninfo available in natural DNA there is no such mechanism in .html. synthetic DNA. Future work includes analyzing the environ- [8] N. Yachie, Y. Ohashi, and M. Tomita, “Stabilizing synthetic data mental conditions under which the experiments are carried in the DNA of living organisms,” Systems and Synthetic Biology, outaswellaspreservationofDNAforlongtermstorage. vol.2,no.1-2,pp.19–25,2008. [9] C.Bancroft,T.Bowler,B.Bloom,andC.T.Clelland,“Long-term storage of information in DNA,” Science,vol.293,no.5536,pp. 4. Conclusion 1763–1765, 2001. [10] N. Yatchie, Y. Ohashi, and M. Tomita, “Stabilizing synthetic data This review article critically analyzes the existing methods in the DNA of living organisms,” Systems and Synthetic Biology, of storing data onto DNA. Data is encrypted into DNA vol.2,no.1-2,pp.19–25,2008. using diverse codes and this article analyzes and discusses [11] M. Arita, “Writing information into DNA,” Aspects of Molecular the codes used for encrypting data. Multiple approaches for Computing,vol.2950,pp.23–35,2004. designing DNA codons and diverse data storage styles have [12]N.Yachie,K.Sekiyama,J.Sugahara,Y.Ohashi,andM.Tomita, been analyzed in detail identifying the pros and cons of each “Alignment-based approach for durable data storage into living approach. Secret writing techniques using DNA molecules organisms,” Biotechnology Progress,vol.23,no.2,pp.501–505, forsecuredatastoragearealsodiscussedthroughthisarticle. 2007. DNA can be used as an organic memory to store massive [13] S. M. H. Tabatabaei Yazdi, Y. Yuan, J. Ma, H. Zhao, and O. amounts of data. This paper also analyzes the mechanism Milenkovic, “A rewritable, random-access DNA-based storage where living organism could be used as storage devices system,” Scientific Reports, vol. 5, article 14138, 2015. while identifying limitations and appropriate applicability. [14] M. Ailenberg and O. D. Rotstein, “An improved Huffman Challenges faced through trying to apply organic memory coding method for archiving text, images, and music characters concepts are also discussed through this paper. Big data stor- in DNA,” BioTechniques,vol.47,no.3,pp.747–754,2009. age and analytics and the way it has led to DNA computing to [15]M.Blaum,S.Litsyn,V.Buskens,andH.C.vanTilborg,“Error- solvehardcomputationalproblemsarealsodiscussedhere. correcting codes with bounded running digital sum,” IEEE The outcome of this study is a review article which identifies Transactions on Information Theory,vol.39,no.1,pp.216–227, 1993. the limitations of existing encoding algorithms and proposes [16] E. N. Gilbert, “Synchronization of binary messages,” IRE Trans- methods to overcome the identified limitations. actions on Information Theory,vol.6,pp.470–477,1960. [17] H. Packer, “CRISPR and Cas9 for flexible genome editing,”Tech. Competing Interests Rep., 2014, https://www.idtdna.com/pages/products/genes/ gblocks-gene-fragments/decoded-articles/decoded/2013/12/13/ The authors declare that there are no competing interests crispr-and-cas9-for-flexible-genome-editing. regarding the publishing of this paper. [18] A. V. Bryksin and I. Matsumura, “Overlap extension PCR cloning: a simple and reliable way to create recombinant plasmids,” BioTechniques,vol.48,no.6,pp.463–465,2010. Acknowledgments [19] S. C. Schuster, “Next-generation sequencing transforms today’s biology,” Nature,vol.5,no.1,pp.16–18,2008. Pavani Yashodha De Silva would like to express her sincere [20] G. M. Church, Y. Gao, and S. Kosuri, “Next-generation digital gratitude to her supervisor, Dr. Gamage Upeksha Ganegoda, information storage in DNA,” Science,vol.337,no.6102,p.1628, for the guidance and support extended to drive her on the 2012. correct path to carry out her independent study. [21] N. Goldman, P.Bertone, S. Chen et al., “Towards practical, high- capacity, low-maintenance information storage in synthesized DNA,” Nature,vol.494,no.7435,pp.77–80,2013. References [22] R. Vishwakarma and N. Amiri, “High density data storage in DNA using an efficient message encoding scheme,” Inter- [1]S.ShrivastavaandR.Badlani,“DatastorageinDNA,”Interna- national Journal of Information Technology Convergence and tional Journal of Electrical Energy,vol.2,no.2,pp.119–124,2014. Services,vol.2,no.2,pp.41–46,2012. 14 BioMed Research International

[23] M. Burrows and D. J. Wheeler, “A block-sorting lossless [44] S. Roweis, E. Winfree, R. Burgoyne et al., “Asticker-based model data compression algorithm,” Tech. Rep. 124, Digital System for DNA computation,” Journal of ,vol.5, Research Center Research, Palo Alto, Calif, USA, 1994. no. 4, pp. 615–629, 1998. [24] G. C. Smith, C. C. Fiddes, J. P. Hawkins, and J. P. L. Cox, “Some [45] A. Rosenblum, “Microsoft reports a big leap forward for DNA possible codes for encrypting data in DNA,” Biotechnology data storage,” 2016, https://www.technologyreview.com/s/601851/ Letters,vol.25,no.14,pp.1125–1130,2003. microsoft-reports-a-big-leap-forward-for--data-storage. [25] D. A. Huffman, “A method for the construction of minimum- redundancy codes,” Proceedings of the IRE,vol.40,no.9,pp. 1098–1101, 1952. [26] M. K. Rogers and K. C. Seigfried-Spellar, “Digital forensics and cyber crime,” in Proceedings of the 4th International ICST Conference on Digital Forensics & Cyber Crime (ICDF2C ’12), Lafayette, Ind, USA, October 2012. [27] A. J. Doig, “Improving the efficiency of the genetic code by varying the codon length—the perfect genetic code,” Journal of Theoretical Biology, vol. 188, no. 3, pp. 355–360, 1997. [28] Y. Zhang and L. H. Bochen Fu, “Research on DNA cryptography,”in Applied Cryptography and Network Security,J. Sen,Ed.,pp.357–376,InTech,Rijeka,Croatia,2012,http://www .intechopen.com/books/applied-cryptography-and-network- security/research-on-dna-cryptography. [29] L. M. Adleman, “Molecular computation of solutions to com- binatorial problems,” Science,vol.266,no.5187,pp.1021–1024, 1994. [30] C. T. Clelland, V. Risca, and C. Bancroft, “Hiding messages in DNA microdots,” Nature, vol. 399, no. 6736, pp. 533–534, 1999. [31] N. Sangwan, “Text encryption with huffman compression,” International Journal of Computer Applications,vol.54,no.6, pp.29–32,2012. [32] C. Bancroft and C. T. Clelland, “DNA-based steganography for security making,” U.S. Patent .6,312, p. 911, June 2001. [33] F. Balado, “On the embedding capacity of DNA strands under substitution, insertion, and deletion mutations,” in Media Forensics and Security II, 754114,vol.7541ofProceedings of SPIE, San Jose, Calif, USA, January 2010. [34] J. P. Cox, “Long-term data storage in DNA,” Trends in Biotech- nology,vol.19,no.7,pp.247–250,2001. [35] M. E. Borda, O. Tornea, and T. Hodorogea, “Secret writing by DNA hybridization,” Acta Technica Napocensis Electronics and Telecommunications,vol.50,no.2,pp.21–24,2009. [36] M. Borda, Fundamentals in Information Theory and Coding, Springer, Berlin, Germany, 2011. [37] G. Jeevitha and G. Sasikala, “Implementation of cryptography using DNA secret writing techniques with OTP,” International Journal of Advanced Research in Computer Science and Software Engineering,vol.5,no.5,pp.970–974,2015. [38]O.TorneaandM.E.Borda,“DNAcryptographicalgorithms,” MediTech,vol.26,pp.223–226,2009. [39] P. C. Wong, K.-K. Wong, and H. Foote, “Organic data memory using the DNA approach,” Communications of the ACM,vol.46, no. 1, pp. 95–98, 2003. [40] L. Kari and L. F. Landweber, “Computing with DNA,” Methods in Molecular Biology,vol.132,pp.413–430,1999. [41] P.Y. Lee, J. Costumbrado, C. Y. Hsu, and Y. H. Kim, “Agarose gel electrophoresis for the separation of DNA fragments,” Journal of Visualized Experiments, no. 62, Article ID e3923, 2012. [42] R. J. Lipton, “DNA solution of hard computational problems,” Science,vol.268,no.5210,pp.542–545,1995. [43]R.S.Braich,N.Chelyapov,C.Johnson,P.W.K.Rothemund, and L. Adleman, “Solution of a 20-variable 3-SAT problem on a DNA computer,” Science,vol.296,no.5567,pp.499–502,2002. Hindawi Publishing Corporation BioMed Research International Volume 2016, Article ID 5469371, 9 pages http://dx.doi.org/10.1155/2016/5469371

Research Article Analyzing the miRNA-Gene Networks to Mine the Important miRNAs under Skin of Human and Mouse

Jianghong Wu,1,2,3,4,5 Husile Gong,1,2 Yongsheng Bai,5,6 and Wenguang Zhang1

1 College of Science, Inner Mongolia Agricultural University, Hohhot 010018, China 2Inner Mongolia Academy of Agricultural & Animal Husbandry Sciences, Hohhot 010031, China 3Inner Mongolia Prataculture Research Center, Chinese Academy of Science, Hohhot 010031, China 4State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of , Chinese Academy of Sciences, Kunming 650223, China 5Department of Biology, Indiana State University, Terre Haute, IN 47809, USA 6The Center for Genomic Advocacy, Indiana State University, Terre Haute, IN 47809, USA

Correspondence should be addressed to Yongsheng Bai; [email protected] and Wenguang Zhang; [email protected]

Received 11 April 2016; Revised 15 July 2016; Accepted 27 July 2016

Academic Editor: Nicola Cirillo

Copyright © 2016 Jianghong Wu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Genetic networks provide new mechanistic insights into the diversity of species morphology. In this study, we have integrated the MGI, GEO, and miRNA database to analyze the genetic regulatory networks under morphology difference of integument of humans and mice. We found that the gene expression network in the skin is highly divergent between human and mouse. The GO term of secretion was highly enriched, and this category was specific in human compared to mouse. These secretion genes might be involved in eccrine system evolution in human. In addition, total 62,637 miRNA binding target sites were predicted in human integument genes (IGs), while 26,280 miRNA binding target sites were predicted in mouse IGs. The interactions between miRNAs and IGs in human are more complex than those in mouse. Furthermore, hsa-miR-548, mmu-miR-466,andmmu-miR-467 have an enormous number of targets on IGs, which both have the role of inhibition of host immunity response. The pattern of distribution on the chromosome of these three miRNAs families is very different. The interaction of miRNA/IGs has added the new dimension in traditional gene regulation networks of skin. Our results are generating new insights into the gene networks basis of skin difference between human and mouse.

1. Introduction species. Mammals have diverse coat morphology, due to changes in the gene regulatory pathway/network. Common The integument includes the skin, coat/hair, and nails [1]. The genetic network variants have been shown to affect many mammalian coat hair is one of the defining characteristics complex traits, including integument morphology. There- of mammals [2]. Enormous morphological variations are fore, to understand the differences of molecular mechanism found among mammalian integument, especially in coat hair. for integument phenotype between mouse and human, it Hair is present in differing degrees in all mammals [3]. In is necessary to employ the scope of system biology. The most mammals, the hair is abundant enough to cover the comparison of gene coexpression networks and regulation body and form a thick coat, while dolphins, naked mole-rats, networks between two species is often a useful method to and humans are among the most hairless of all mammals identify the differences of critical biologic process associated [4]. However, humans with the hypertrichosis syndrome with morphology diversity. have hair covering their faces, their eyelids, and their bodies Mammalian Phenotype (MP) ontology [7] (http://www [5, 6]. Therefore, we provide a hypothesis that key genes .informatics.jax.org/) is made up of 17,330 terms (as of related to hair coat formation exist in every mammalian 02/08/2016), and most of these terms were characterized 2 BioMed Research International from abnormal mouse phenotypes. The ontology database Weighted Gene Co-Expression Network Analysis (WGCNA) deposited 1627 mouse/human orthologous genes with phe- [16, 17] was used for these expression datasets to create notype annotation related with integument (MP:0010771). consensus networks between human and mouse, and the High-throughput experiments have produced many microar- networks were visualized in Cytoscape 3.3.0 [18]. rays and next generation sequencing data that are collected in Gene Expression Omnibus (GEO) database [8]. Genetic 2.3. Functional Enrichments for IGs Network. To classify the networks have been widely used in biological research, terms and group for IGs in (GO) and KEGG bridging the gap between single genes and biological systems pathway, we used the Cytoscape plug-in ClueGO to imple- by investigating the relationships between different genes. ment enrichment analysis. We set kappa score threshold to miRNAs are small noncoding RNA molecules which play 0.5 and the 𝑝 value to 0.05; we used two-sided test, Bonferroni an essential role during skin development [9]. Variations in step down and GO term fusion. The other parameters of the theinteractionbetweenmiRNAandtargetgenearelikelyto software were set to default values [19]. influence the phenotypic differences between species. Here, weuseWeightedGeneCo-ExpressionNetworkAnalysis 2.4. Prediction of miRNA Targets for Integument Genes. For pre- (WGCNA) to reveal shared and unique characters of the diction of miRNA targets for hub genes, miRanda [20] soft- gene networks in human and mouse skin. We identify ware version aug2010 available at http://cbio.mskcc.org/mic- networks of coexpressed genes, which might be associated rorna data/miRanda-aug2010.tar.gz was employed to predict with biological functionally relevant coat morphology, and miRNAs regulated IGs. miRanda was running as the follow- explore differences in these biological processes between two ing command options: sc ≥ 180, en = 1 (no energy filtering), species. We also found that candidate miRNA may play a go = −9.0, ge = −4.0, and scale = 4.0. We set the option role in the regulation of gene expression networks in skin. -strict to prevent gaps and noncanonical base pairing in These results provide a system-level insight into evolutionary target sites. Human/mouse miRNA sequences were used as changes of the integument. query sequences, and the IGs sequences were used as [20]. We also predicted the targets with two miRNA prediction 2. Materials and Methods databases, miRTarBase [21] and miRDB [22], and selected In the present study, we emphasize the comparison between the intersections targets to construct the miRNA-mRNA the gene network for integument of human and that of regulatory network. mouse and identification of candidate miRNAs relevant to the integument genetic pathway and constructed an interaction 3. Results network between miRNAs and targeted genes. Figure S1 in Supplementary Material available online at http://dx.doi.org/ 3.1. Chromosome Location for Integument Genes of Human 10.1155/2016/5469371 shows pipeline for the study. and Mouse. We observed a uniform distribution of the 1627 integument genes across all chromosomes in human and 2.1. Selection of Genes and miRNAs for Human and Mouse. mouse (Supplementary Figure S2). As can be seen in Figure Mouse/Human Orthology with Phenotype Annotations were S1, in general, 1627 integument genes are dispersed along the downloaded from the MGI database [10]. 1627 of these genes chromosomes of human and mouse. These results provide are related to the integument phenotype (MP:0010771). To insights into the biological underpinnings of integument draw the chromosome location for integument genes (IGs), phenotype and the pleiotropic connections between traits. It the R package org.Mm.eg.db [11] and org.Hs.eg.db [12] were is very difficult to detect which chromosome is essential to used. We downloaded the gene expression data of skin for integument phenotype because these integument genes will mouse and human from the Gene Expression Omnibus be uniformly distributed on all chromosomes. (GEO) [13] and mouse and human miRNAs from miRBASE (http://www.mirbase.org/cgi-bin/browse.pl/) [14]. Mouse and 󸀠 3.2. Gene Expression Network. ByusingWGCNA,wegen- human 3 UTR sequences of 1627 integument genes were ob- erated two consensus networks of the human and mouse tained from Ensemble (http://www.ensembl.org/) by biomaRt (Figure1)whichshowanetworkheatmapplotofagenenet- in R language [15]. work together with the corresponding hierarchical clustering and the resulting modules. Then, the gene 2.2. Construction of Gene Networks. To compare the gene expression networks were visualized in Cytoscape. There are networks of skin for human and mouse, we selected many two groups in human skin consensus networks constructed relevant data from the GEO database [8], and then we from 560 genes (nodes) and 18391 interactions (edges), while filtered out all but a core collection of datasets that were only one group in mouse skin gene networks was constructed similar enough for useful bioinformatic comparison. First, we from 368 genes and 1757 interactions (Figure 2). To eval- downloaded all datasets that were run on same Affymetrix uate the complexity of gene networks, we calculated the platform, one platform in human (GPL570) and one in mouse interaction degrees for gene expression network. The average (GPL1261) (Table S1). Second, we selected only relevant interaction degrees are 32.8 (18391/560) and 4.8 (1757/368) for experiments about skin samples on the platform. Third, we human and mouse, respectively. extracted IGs expression data from the chip matrix. Finally, We use the ClueGO to compare three clusters of genes a total of 1487 genes were analyzed in this study. Then, from Group 1 and Group 2 of human and the mouse BioMed Research International 3

(a) (b)

Figure 1: The TMO plot shows the modules of gene expression in human (a) and mouse (b) by WGCNA.

SOD1 IFT27 OSTM1 HNF1A AGATAX1BP1 ICMT TFAM VAC14 BBS2 HERC3CREB5ZMPSTE24 THRA CCDC57 ARID4A SMOC1 NR2F6 SIRT2 ITPKB SMC6SIK3 CHD2 SHH RNF8 ETS2CRIM1 HSF1HDAC4 MAF1MAPKAPK5APC TULP3 GJB3 MAEA WRNFARP2 MECP2FREM2LRIG2 STIM2NR2C2 NEO1ERBB2 PRKDC ABCD4SRSF4DICER1CREBBP SIK2OPHN1 ERBB2IP NCOA3ATMMSH2 MKLN1 PRDM16PYGO2 SLC4A4SMCHD1KALRN SMARCA2ALDH2SYNE2 BBS4BRD7 MIA3 RHBDF1 RB1CC1YAP1 NCOR1OXTRCLCN6 CDKN2CROCK1 FIGN NFE2L1 EZH1 FBXW7CDC73 PLCL1 FAAH CDK5RAP2 MSH3TERF2IPPPM1AMPV17NEK1 LEF1 FGF11 SSFA2 TAP1MCOLN1RAE1RAF1PHF2 BMPR1AOXTMBD5 TAB2 NCOA2PGGT1BCBX7IDUATSC1 VDR KDM4C SOS1 PAQR3PRLR GAB1 ATP2C1SLC6A2XIAPP2RX6 TRAF3IP2RB1 MIB1 Group 1 Group 2 GFRA2 GABBR1 RINT1FOXO3LMF1XPAPOU2F1BRD1HIRAPPARARXRBMTDHMORF4L1BIRC6 POLG FBLN5 KDM4B PPP2R5DCDKAL1CISD2CDK7TCF7L2 ERN1 PLEKHG5 BSCL2 KIF1B MAF GABRA4MAPK7PTPN11 VPS33A PRL CASR ARMNTYY1 YES1 POFUT1 DLG4 FBN1 DPT MEGF8 PARK2GAK NAV2 NOTCH1 EGLN2 PLCB3ITPR2DOPEY2 WDR48EGF RPS6KA3AEBP2PTCH1 SCARA5 MAPT SLC12A6CBS GIT2PPM1F CYSLTR2 TFPI F2 TRPM3 DDR1 SNIP1 TSC2 COL1A2DCN ANK1MAP2K2 COQ9 MED1EGFRSOD2TNFRSF19 ARNTGNAS NCOA1USP4 NF1SLC17A7PI4KA RPS20 FSTL1 CNTNAP2 ERP44 CLPP XPC ADORA1 ARID3A COL3A1COL1A1 FGF6 HPS3 KISS1R DLG4ARIH2 ATR SLC19A1 LUM NIT1 MMP24PLCG1 MAP2K7 SLIT2 F2RL1 SLC11A2GBF1 LATS1 ZBTB20MEOX2S100A10 BECN1SLC39A3CARD11 E2F5BARX2 FN1COL5A2 CAMK2A T GNAO1 B3GNT3 ADAMTS5 FGF9HMOX2 CDK5PAX6 KIF1A GLRA3SLC12A1 EBF1PDGFRB ADORA1MADD MAP3K11 GATA5 TCF15SLC38A10 IGFBP7 FMOD EDARADDEDAR LMX1BCHRNGGRIK5 KCNA6AP3D1 GABRA2 ATE1 LMO2CDH5A2M NRG3 GPR3CDH8PIK3CG SENP1FGF10ATP8A2GBX1 DLGAP3CYP7A1OXTRSLC6A4CSN2FCRLA COL5A1 HOXB1CACNA1A RCE1 VWA1 TNFSF11SCN11APROKR2STOML3BTN1A1TMC1HFE2CAMK4CDK5R2AGERTLL1 VHL KDR UBE3B POU3F4GRIA2 MCOLN3 PLG HTR3AFOXB1RHBDL1AGAP2GHRHRLALBAGRM1GRM7KIF1AGPC5RXFP1DBHCSF2 THBS2AEBP1 SERTAD2COL5A3 GRIA4 P2RX3 OPRL1PRDM8NTSR1SLC5A7NOS2HOXB13SSTR4GSCPOU3F2SLC12A5OPRD1NKX2-5ASCL1MUSKHOXD13TPH2CTSESYT7AGTR2NRP2GNRHRBRSK1TCF21HOXB9IL4HTR7CSF3VGFGRIA1LTC4S ARL6IP5 CD34 RHOT2 CCDC57PRKCASLC6A3CDH23GPRC6AGFI1BOPRM1ITGB3KCNJ11CLDN6SCN10AUSH1CAVPR2FOXP3CD19LIPINPB TWF1NFKB2 RPL24 RECK CDH8 BRSK2ERCC2ZBTB17 CLP1RGS4NMUR2NTSR2CACNA1SLMX1ACLEC1BCSN3CACNA1BESR2NAGSPARK2RLN1PDYNAQP2CD40LGFGAFGF20BRSK2KCNIP3SEMA6CMC5RCRHHNF1AADAMTS20 PRKAR1B GAK ALDH2 PDYNTMC1 STOML3 PSEN1 PNPLA6 ESR1 KISS1RTMPRSS6NMUR1HCN1PLD4ALAS2LMX1BCHRNB4P2RX3AICDAP2RX6CNR2CLRN1TFF1COL19A1CCR10OPRK1RSPO2HOXD1CHRNA4HOXB1HAP1CHRNGDUSP26BCL2HCN2ERCC2FGF6NR4A1RARGEMILIN1TRAPPC6A RELN RINT1 GALR2POU3F4EDARADDOXTPTCH2GATA1NTRK1CHST8CAMK2ARLBP1SHHEMX1MYOGSOX2GATA4ADCY1FGF23TRPM3HTR2CSRCMKL1PAHCARD11TOP3BPRX ZBTB17 EVPL CLDN6MTA1 RXRA PDE6BAMHR2SOX21TREML1CCKBRRXFP2FSHRPOMCMTA2GABRR1FOXA2KRT76CASRPDX1APOA1RAC3CLPSGRM5 CDKN2A CCM2 RUNX1T1RAPH1DDR1 PCSK2 SCN10AUBE2B ATOH1PTGER1TREX2SLC17A6IL10PNOCCITED1SOX11DRD2CNTNAP2FGF5WNT10APDGFBETV4ADRBK1HSF1PORCN CTNND1 SDC1 RAF1 NBL1CHUK YY1 POMC PAX3 SOX18SLC45A2KSR1CLCN1WNT7ATRPA1PITX3ZAP70SNAP25MYO7AGBX1ANK1GFRA2 CHRNB2SLC39A3ZFPM1POLD1RXRBNR1H2TRPC4AP JAG2 KCNJ1ARX PTGDS ALX4HTR1FSLC4A4PTGIRSLC4A1ARXGRIK5IL13TRPV4PLCB2 LIPE AGPAT2 NDFIP1NDUFS4TMEM79 HOXB13SLC12A5 GRIK1 ECE1 P2RX7ANO1ACADSCDKN2C SOD1ARCN1CBLBNXF1PROS1HDAC1DBI HOMER2 GHRHR SSTR4 FOXG1GATA5 IL17RABCL2L1 AARS CRIM1PTPN6ERCC1 MAP3K1 CACNA1B ARHGAP1 HOMER3 HLX DSC3DICER1SLC20A1ATF4MAPK6IRF2TSC1TIMM23SNX5CTSALAMP2KITGBF1 INPP5D MTHFRTYRO3SREBF2 PTGES3GNAQGABBR1PFAS PRDM6 PAX1OPRD1 JDP2BCL2L11 HOXC8 USP9XTAX1BP1NPM1JAG1FBXO11NFE2L1ILK NCOA1ARPC2 ARIH2 TOM1L2 PRL STAT5B FES PHB HIF1A RAC1NPEPPSTERF2IPHMGN1MRPS35GUSBRFWD2ATRDHCR24KIF1B IFT27 CDK5R1 ALDH16A1 FARP2TOM1L2 LDLR PAK1IP1CHD2RPL27ATNFRSF1AYAP1RPL38BIRC6ROBO1PPP1R13BYES1 ERCC3LIPA CBX6 CYP19A1 FMR1APLP2ERCC5PNNMGAT1LBRTRPV1SMC6DDB2BMPR1AERP44NSUN2NPY1RASPAITPA GABRA2 GNAO1 TNIP1IRS2OPHN1STIM1BRD1BBS2WRNEZH2ATP2C1 BIN1TGFBR1 GNRHR OPRL1 STK11 PRLR FOXO3ARID4ASMARCA5FTOSHOC2PERPPPM1ARALBP1ZMPSTE24NCOR1FBXW7BUB3PLCB3CISD2ASB1 USH1C MYO6SGSHITGB5PSTPIP2PRMT5USF2CDK4 UBE3ACLCN6 LBH XPA TK2DNM1LBSCL2MAP2K1TPH1 SP3 GATA4 ABCC6 SUPV3L1SRA1TBK1NR2C2LRRK2NOTCH2ICMTTSG101OTCATOX1USP4HDAC4SIRT6 T TRPC4AP SLC25A37ZDHHC13TNRC6ALRP4SPENMPP5CDK7SMAD2PHF2APRT NTSR2 TRIM21 POFUT1KIFAP3CREBBPPOU2F1 ETS1 CDC73 NFE2MED31 PTPRS CBY1 CELSR1 LIPI MAPK1 GOLM1AFAP1L2RPS19 HPS4CBLTBC1D20NOTCH3F3 SCARF1MAPK8 LSR HTR1F OTC RAG1 MKLN1 PPP1R15BSMARCA2 MYBGSTP1 RELA TCF3 POLB CENPJPDGFAPKNOX1 HOXD9PYGO2 LIPHSLC7A7S1PR2 SCN9A IKBKB TRAF3IP2 EDN3 TERF1 UACA LMNB1 HTR2C MEFV MGRN1 MAPK14 B4GALT1 MAP4K5 MDM2 KCNIP3 IL17REGNAZ Human Mouse (a) (b)

Figure 2: The consensus networks for skin of human and mouse: (a) for human and (b) for mouse.

consensus networks (Figure 2). 44 significantly overrepre- genes and response to organic substance were common in sented categories were identified (Figure 3). Most of the three clusters. positive regulations of cellular process were clustered into Furthermore, we found that four GO terms were highly two independent modules. More than half of the common enriched, and these terms were specific in human compared genes among three clusters were involved in regulation of to mouse. These four GO terms could be divided into three cellular and metabolic processes, such as negative regulation categories (Figure 4). The regulation of secretion category of nucleobase-containing compound metabolic process and contains two GO terms: regulation of secretion and regula- regulation of macromolecule biosynthetic process. Some of tion of secretion by cell. the most significant categories were response to stimulus, regulation of localization, negative regulation of biological 3.3. miRNA Target Sites in IGs. miRNAs are implicated process, cell differentiation, single-multicellular organism in integument development. To examine how the miRNAs process, and response to organic substance. Analysis of the interacted with the gene networks under integument, we modules showed that the majority of the response to stimulus used miRanda to predict the miRNA targets for IGs. Total 4 BioMed Research International

Figure 3: Overrepresented GO categories in the integument gene list of human and mouse. The figure was drawn by Cytoscape 3.3.0 and shows enriched GO categories for common genes in three clusters. The pie chart shows the leading group term in each of the GO categories. ∗∗ ∗ The kappa score threshold was set to ≥0.5. If the term/group is oversignificant𝑝 ( value < 0.001). If the term/group is significant (0.001 <𝑝value < 0.05).

% genes/term 01 23456789101112 Regulation of secretion 89∗∗ Regulation of secretion by cell 80∗∗ Cell-cell signaling 143∗∗ G-protein coupled receptor 90∗∗ signaling pathway

∗∗ Figure 4: Specific GO term enrichment as estimated for the human versus mouse. If the term/group is oversignificant𝑝 ( value < 0.001). ∗ If the term/group is significant (0.001 <𝑝value < 0.05).

62,637 miRNA binding target sites were predicted for 1627 datasetof1915mousemiRNAsusedfortheanalysis.Asingle human IGs selected and the dataset of 2588 human miRNAs miRNA can regulate hundreds of transcripts and a single used for the analysis. Total 26,280 miRNA binding target transcript can have binding sites for multiple miRNAs of the sites were predicted for 1627 mouse IGs selected and the same or different sequence. To determine the portion of the BioMed Research International 5

Table 1: Top 10 miRNA families for IGs of human and mouse.

Human miRNA family Target number Mouse miRNA family Target number 1 hsa-miR-548 4643 mmu-miR-466 1704 2 hsa-miR-1273 547 mmu-miR-467(669) 956 3 hsa-miR-30 412 mmu-let-7 235 4 hsa-miR-520 332 mmu-miR-297 222 5 hsa-miR-302 281 mmu-miR-6951 176 6 hsa-miR-3689 277 mmu-miR-3473 173 7 hsa-miR-4763 273 mmu-miR-30 166 8 hsa-miR-619 266 mmu-miR-7116 154 9 hsa-miR-4775 262 mmu-miR-181 140 10 hsa-miR-513 241 mmu-miR-7661 133

Scale 100 kb mm10 chr2: 10,400,000 10,450,000 10,500,000 10,550,000 UCSC genes (RefSeq, GenBank, tRNAs, and comparative genomics) AK076687 Sfmbt2 Sfmbt2 mir-466m mir-467a-4 mir-467d Sfmbt2 mir-669f mir-467a-3 mir-466 Sfmbt2 Sfmbt2 mir-669e Gm25285 mir-669m-2 Sfmbt2 Sfmbt2 mir-669b mir-669a-1 mir-466n mir-669d mir-467a-4 mir-669o mir-669l mir-466b-4 mir-466g mir-669d-2 mir-669a-1 mir-466l mir-297a-3 mir-467c mir-467a-4 mir-669i mir-466b-4 mir-466b-4 mir-669h mir-669a-3 mir-467a-4 mir-669k mir-466b-2 mir-467a-9 mir-669a-1 mir-466b-8 mir-467a-4 mir-669a-1 mir-466b-4 mir-669g mir-669a-1 mir-669j mir-467a-3 mir-467a-2 mir-466b-3 mir-466b-8 mir-669a-1 mir-669a-1 mir-669a-2 mir-467b mir-669m-1 mir-466c-2 mir-669a-1 mir-467a-6 mir-466c-2 mir-669a-1 mir-467a-4 mir-466b-4 mir-669a-1 mir-466b-4 Mouse ESTs that have been spliced Spliced ESTs Simple nucleotide polymorphisms (dbSNP 142) found in ≥ 1% of samples Common SNPs(142) Detailed visualization of RepeatMasker annotations RepeatMasker Viz. Repeating elements by RepeatMasker RepeatMasker

Figure 5: The chromosome location of mmu-miR-466 and mmu-miR-467 family.

main miRNA in the total miRNA target file, we selected the 3.4. Construction of a miRNA-mRNA Regulatory Network. top 10 miRNA families based on the target number of IGs Two miRNA prediction databases miRTarBase and miRDB (Table 1). The hsa-miR-548 family has predicted 4643 target were used to verify the results of miRanda. The intersections sites distributed on 541 IGs of the human. The mmu-miR- targets of three algorithms were selected to construct the 466 family and mmu-miR-467 family have predicted 1,704 miRNA-mRNA regulatory network (Table S2). The resulting target sites and 956 target sites, respectively, distributed on miRNA/target mRNA pairs and gene networks for human 375 and 310 IGs of the mouse. The hsa-miR-548 family is and mouse were visualized in Cytoscape, where edges from widely distributed in the whole . However, miRNA to genes represent a potential regulatory relationship bothofthesetwomiRNAfamiliesarelocatedinanintron and edges from gene to gene represent an expression correla- of Sfmbt2 gene on of the mouse (Figure 5). tion. The network consisted of two almost separate groups, These three miRNA families have common target genes which could be connected when miRNAs is added to the (Figure 6) and might present some similarity function to skin expression gene network of human (Figure 7). These miRNA- morphogenesis. target interactions were added to the network of the mouse, 6 BioMed Research International

Hsa-miR-548 Mus-miR-466 Analysis of the GO terms category showed that the majority of the positive regulations of cellular process and response to stimulus genes were common in three clusters, 65 295 which may be due to the perception function of skin to in vivoandinvitroenvironments[30,31].Inotherwords,these 118 common categories in human and mouse skin were involved in protection [32, 33], sensation [34], temperature regulation 121 [34], immunity [35], exocrine, and endocrine. By comparing 61 the GO terms for those genes in expression network, the term 71 of regulation of secretion category was enriched in human skin. Human has eccrine glands and thermal apocrine glands. Apocrine glands are always associated with hair follicles, and eccrine sweat glands cover almost the entire body surface 57 ofhuman[36],whileeccrinesweatglandsinthemouseare found only on the footpads [37]. These results might provide an important insight into the evolution of thermal eccrine Mus-miR-467 system in human. The secretion category of the genes might Figure 6: The Venn diagram of the target gene list of three miRNAs. be involved in the eccrine system in human. miRNA plays an essential role in the regulation of skin development and morphogenesis [38]. miRNA family is a group of miRNAs that share common seed sequences, which which have not significantly affected the consensus networks has a similar regulation function [39]. We found that the in mouse skin (Figure 7). hsa-miR-548 family has the highest amount of target sites among the identified miRNAs in human and the mmu-miR- 4. Discussion 466 and mmu-miR-467 familiesaretoptwointhemiRNAs list predicted in the mouse. Generally, increasing the number Recently, a lot of biology databases have been published. of target sites within a gene improves the efficiency of miRNA How to integrate these resources in system biology is a regulation [40]. More target sites provide more opportunity bigchallengeforanalysisofcertainbiologicalproblems for recognition by miRNA and increase the kinetics and over- [23]. In this study, we have integrated the MGI, GEO, and all level of transcript regulation [41]. miR-548 is a - miRNA databases to analyze the genetic regulatory networks specific miRNA family, which has 69 members distributed under morphology difference of integument of human and in almost all human chromosomes [42]. The hsa-miR-548 mouse. 1627 mouse/human orthologous genes related with family is involved in multiple biological processes, such as sig- integument phenotype were deposited in MGI database. naling pathways, immunity, and osteogenic differentiation, These genes were widely scattered across the mouse genome and some [43–47]. hsa-miR-548 takes part in IFN and the human genome, which suggest that the integument signaling which responds to the virus and bacterial infections is a quantity phenotype with polygenic determinism. We also on the cell [46]. hsa-miR-548 also can turn down the host have constructed the expression correlation networks of IGs antiviral response by degradation of IFN-𝜆1 [43]. UVB irra- with the gene expression matrix from the GEO database. diations can downregulate hsa-miR-548 of human epidermal With the process of evolution, the organisms have increased melanocytes cell [48]. These results suggested that hsa-miR- in complexity. However, the organism’s complexity is not 548 might contribute to dynamic regulatory network of skin simply associated with the number of its genes [24]. Now, of human. Compared to human, mmu-miR- the interaction degrees as a straightforward detector were 466 and 467 families only located in intron 10 of Sfmbt2 used to evaluate the complexity of gene networks [25]. Our genes. This area of intron 10 has a larger cluster of miRNAs, results revealed that the networks of human skin are much which is specifically present in mouse and rat [49]. Based morecomplexthanthoseofmousewhichmightbeexplained on miRBase database (http://www.mirbase.org/) definition, by the fact that the integument structure at the anatomical clustered miRNAs are a group of miRNAs located within levelismuchmorecomplexinhuman,comparedwithmouse 10 Kb of distance on the same chromosome. In this study, [26]. Many researches reported that one gene could evolve mmu-miR-466, mmu-miR-467,andmmu-miR-669 clusters new function to adapt to the change of environment during haveonecorepromoterregionandtranscriptionalstartsite species divergence, from lower to higher species [27, 28]. shared with the Sfmbt2 gene. The expressions of mmu-miR- Erwin and Davidson reported that the reorganizations of 466 and mmu-miR-467 markedly waved during the hair gene networks could change the animal morphology and follicle cycling in mouse [50] and were downregulated in the basic of the evolution is regulatory changes within a melanoma of mouse by curcumin diet [51]. And histone gene regulation network [29]. These results also confirm our deacetylation and metabolic oxidative stress can induce the hypothesis that those key genes related with the hair coat exist activity of mmu-miR-466 [52]. Skin serves as a barrier in every mammalian genome and the diverse morphology between body and pathogens (disease causing organisms), in mammal just because of the difference in gene regulation which is part of the innate immune system [53]. Both of networks. hsa-miRNA-548 and mmu-miR-466 and mmu-miR-467 can BioMed Research International 7

POLG

ARID3ACYSLTR2

BARX2 ADORA1 ATE1 SLC38A10CYP19A1 MIB1 IKBKB mmu-miR-466b-3pNRAS E2F5 FES mmu-miR-466p-3p STAT5B LATS1GNAO1 TRAPPC6A mmu-miR-466c-3p PRLR SLC6A4GABRA2TCF15 RARGSREBF2 AGPAT2 WDR37 hsa-let-7b-5p PRL GRIK1HTR3AMKL1FGF23KISS1RVGFARXSLC12A1INPP5D HLX LIPE mmu-miR-340-5p GPRC6ACLDN6FGF5SCN11AKRT76HTR1FPTGIRTLL1CSF2OXTRTOM1L2TYRO3ACADS mmu-miR-466m-3pmmu-miR-466o-3pTGFBR1 hsa-miR-18b-5p RGS4GRIA1GSCGALR2GPR3RSPO2TMC1LMX1BAVPR2NTRK1TCF21IL4RLN1SNAP25CITED1CHRNA4OPRM1NPBADCY1KCNIP3CHST8SEMA6CHOXD1LTC4SGLRA3VHLIL17RARXRBPORCNCDKN2C MAPK14 hsa-miR-6740-5phsa-miR-1294hsa-miR-98-5p SOX21NMUR2RLBP1CLRN1MTA2FGF20NMUR1ADAMTS20OPRK1PTCH2KCNJ11NAGSCLPSANK1EMX1CLCN1FOXP3CNR2GBX1NTSR1AICDACHRNB4GABRR1AGTR2PARK2HOXB9GRM7MUSKCHRNGGFRA2SLC4A1DUSP26ARHGAP1ADRBK1ERCC2FGF6 NR1H2P2RX7 CHEK1mmu-miR-7b-5p hsa-miR-18a-5pESR1 CCDC57TREML1RHBDL1HOXD13OPRD1SOX11TREX2P2RX6MYOGALX4GATA1CDH23AQP2TMPRSS6SLC17A6HFE2CCR10AMHR2RXFP1PDX1CTSESRCAGAP2FGATFF1EDARADDCD19PAHLIPICSN2IL13PRXSLC39A3POLD1TWF1CCM2TRPC4AP MAPK1 hsa-miR-6124hsa-miR-3919hsa-miR-3927-3p OPRL1NTSR2SLC5A7CACNA1SFOXB1SYT7PTGER1DRD2PROKR2CLEC1BLMX1ANOS2FSHRCDH8NKX2-5USH1CBRSK1CD40LGPOU3F2GHRHRCASRHTR2CCACNA1BSLC45A2IL10FOXA2CRHBRSK2DLGAP3CYP7A1DBHWNT10ACARD11SLC4A4GRM5BCL2HCN2ETV4TRPV4ZFPM1HOMER3CHRNB2ANO1EMILIN1 CNR1 MEGF8 hsa-let-7f-5p CLP1PDE6BPDYNPRDM8GFI1BCDK5R2ZAP70STOML3ASCL1POMCHAP1HOXB13GRM1APOA1ATOH1TRPM3RXFP2HCN1SLC12A5OXTSOX2CSN3SCN10AESR2ALAS2TRPA1HNF1ACSF3GRIK5TOP3BPLCB2HSF1ZBTB17NR4A1HOXC8 RAD18 hsa-miR-22-3p FOXG1VWA1PRKCAGATA4WNT7AHOXB1POU3F4TNFSF11CAMK4PITX3P2RX3PLD4CAMK2ALALBASSTR4KIF1AAGERRAC3GNRHRNRP2SHHMYO7APNOCCCKBRITGB3GPC5HTR7PDGFBECE1CDKN2ANFKB2 mmu-miR-1952 SLC6A3BTN1A1COL19A1KSR1MC5RTPH2CNTNAP2BCL2L1FCRLA mmu-miR-5101 IGF1R GATA5 SOX18PIK3CG JDP2 B3GNT3 MTHFRPRKAR1BBCL2L11hsa-miR-5589-5p ERN1 IDS FGF11 PTGS2 FARP2 SSTR4RXRA mmu-miR-342-3p hsa-miR-8485 hsa-miR-24-3p mmu-miR-466l-5pGATA4 OPRL1 WFS1mmu-miR-466i-3p mmu-miR-126a-5p hsa-miR-30e-5p hsa-miR-32-5p ALDH16A1PSEN1ZBTB17 TLR4XPC GABBR1mmu-miR-495-3pF2RL1 mmu-miR-136-5p hsa-miR-137 MYO5A USH1CTOM1L2TRPC4APFGF6STK11 ANK1 SFN PLEKHG5GRIA1 hsa-miR-548c-3p PLG ABCC6 hsa-miR-23b-3p MMP12mmu-miR-466f-3p FIGN SLC19A1 PTEN RAG1 EDN3 CNTNAP2CAMK2AGRIK5 LIPI TSC2 BDNFmmu-miR-3082-5pLEF1F2 PAX1 MAPT hsa-miR-21-5pNCOA3hsa-miR-20b-5phsa-miR-301b-3phsa-miR-19a-3p hsa-miR-23a-3p ATP8A2GBX1mmu-miR-377-3pAP3D1SNIP1SMOC1POFUT1VEGFA hsa-miR-17-5p PTGDS CASR SRSF4 SHH hsa-miR-20a-5phsa-miR-106b-5p hsa-miR-26a-5p GNRHR MCOLN3FGF9FGF10ADORA1KIF1A PNPLA6MPV17PPP2R5D GNAQ STAT3 hsa-miR-3666 DLG4hsa-miR-203a-3p OTC CACNA1B KCNA6BRSK2 GNAO1THRAPLCL1GFRA2 mmu-miR-297b-5p hsa-miR-106a-5pLDLR EGLN2 PDYN PI4KA OXTMORF4L1PPM1F ATRNCD28mmu-miR-297c-5p E2F1 PAX3 ETS2KCNIP3NAB2RUNX3 hsa-miR-451b IRF1 hsa-miR-19b-3p PRDM6ARX STOML3P2RX3 TRPM3 P2RX6CDKN2CPRLCLCN6 SOD2 KCNJ16 hsa-miR-497-5pKL hsa-miR-4668-5p SCN9A GABRA4 DLG4TNFRSF19CISD2PAQR3 SNAI2TBX15SNAP25 CASK GABRA2 mmu-miR-466l-3p CACNA1AARIH2 RPS6KA3 PRLR TULP3NGFRGRASPHOXC8 VEGFA hsa-miR-93-5p SLC12A5EGFR DDR1 WDR48CDK7 TCF7L2SLC4A4 PTPRBHTT mmu-miR-3062-5p CRKL PCSK2POMC TMC1GRIA2GATA5MAP2K7T NF1 PTCH1PTGDR NEO1PRDM8TACR1 DMTF1hsa-miR-199a-5p CSF1 HTR1F GRIA4 COQ9SLC17A7ADAMTS9RXRBCDC73 FECH mmu-miR-669n hsa-miR-199b-5p POU3F4RHOT2BECN1CHRNGUBE3B GIT2 mmu-miR-297a-5pTAB2CD274 GRIA3 hsa-miR-130b-3phsa-miR-301a-3p NCOA1MED1 MTDHPPARASLC12A6 KCNJ1OXTRPIRT CASKCD4TNFACPP mmu-miR-574-5p hsa-miR-449ahsa-miR-1277-5p CDH8 LMX1BUBE2BGNAS MBD5MIB1 MDM2ECE1 MDM4 hsa-miR-16-5p GJA1 HTR2C MAP3K1 CARD11PAX6 CBSNCOA2 BMPR1ABIRC6PPM1AAEBP2OPHN1RHBDF1TCEB3 OPRM1 NOTCH1 hsa-miR-15b-5p hsa-miR-130a-3p MTA1 HOXB1CLDN6 PLCG1MMP24 EGF IDUA SIK2CREB5CHRNB4NQO2 hsa-miR-4703-5pDDR1 hsa-miR-215-5p ZEB1 HOXB13MADD RCE1SCN10A KISS1R YES1 MAF NAB1 CASP8 hsa-miR-34a-5p hsa-miR-192-5p hsa-miR-200a-3p NRG3SSFA2 GBF1DOPEY2 KDM4CPARK2MSH3RB1ZMPSTE24TERF2IP KALRNWRN BCL2 TERF1 mmu-miR-466i-5p NTSR2EDARKIF1B SENP1PGGT1B AR HIRATSC1FREM2 CRIM1 RUNX1 MAP4K5YY1PPP1R15BALDH2DSC3SERTAD2hsa-miR-107 hsa-miR-141-3p GAB1 GAKMAPK7 CDKAL1 NCOR1SMCHD1GNA13SOCS5 mmu-miR-466p-5p TMEM79DICER1hsa-miR-103a-3p EDARADD CBX7 SMARCA2ATR POLBCDC73NBL1CTNND1AARS ZBTB20 hsa-miR-429 E2F3 MAP2K2ERCC2SLC39A3ITPR2 PTPN11SLC6A2 NEK1 FARP2mmu-miR-9-5p mmu-miR-466a-5p CHUK KIF1BGOLM1HIF1Ahsa-miR-15a-5pRINT1 MEOX2RPS20hsa-miR-200b-3pPTPRZ1 HOMER2 USP4 ARNT NAV2 MIA3ATM GSTP1 PERPNPY1RGNAQLAMP2 hsa-miR-200c-3p CDK5R1MAP3K11NIT1 TAP1 MNTPOU2F1RB1CC1 mmu-miR-1187 mmu-miR-466e-5p RFWD2KIFAP3KITCTSATK2EVPLRAC1PAK1IP1RPL27A KDRADAMTS5COL1A2 ZFPM2 GHRHR EZH1SIRT2 HERC3ERBB2 ARIH2 SMARCA2ASPABBS2SHOC2ARID4ACDK7SLC20A1 OPRD1 RINT1BRD1VPS33ALMF1MSH2SYNE2 CCDC57VDR ITPKBSIK3 ERBB2IP S1PR2 LBHSCARF1ERCC3 NOTCH2JAG1PTGES3FMR1DBIRPL38YAP1CHD2CD34RPL24THBS2A2MEBF1LMO2COL3A1 FAAHSLC11A2 MKLN1BRD7BBS4 MAVS DHCR24CISD2ZMPSTE24FBXW7NCOR1NDUFS4ATP2C1ROBO1PROS1SNX5BIRC6YES1ZDHHC13TNFRSF1AAPLP2HDAC1TNIP1NXF1ARCN1CRIM1 S100A10LUM CDK5 PLCB3 RAF1SOS1NFE2L1 NR2C2 SOD1 IFT27TGFBR1APRTMRPS35BUB3LRRK2TERF2IPTRPV1CDK4LBRPRMT5ERP44HMGN1ERCC5NFE2L1FBXO11NPM1SDC1SOD1RECK CDH5 FN1 TFPI BSCL2 ERP44HPS3XIAPXPA ROCK1 XPAMAPK8SP3POFUT1TBC1D20SMAD2OTCTSG101MAP2K1GABBR1TPH1BSCL2ARPC2RALBP1BMPR1ASMARCA5ERCC1PTPN6OPHN1NCOA1ILKUSP9XMAPK6ARL6IP5AEBP1PDGFRBIGFBP7 GJB3 CLPPKDM4BYY1 PHF2MCOLN1CREBBPTAX1BP1ALDH2 SMC6SIRT6 hsa-miR-223-3pASB1HDAC4MYBUBE3AUSF2PSTPIP2SMC6FOXO3NPEPPSDDB2SUPV3L1IRS2TSC1ATF4NDFIP1RAPH1RUNX1T1RELN COL1A1 APCCHD2 FOXO1 CBX6 CLCN6ITPAATRPHF2LIPANFE2SLC25A37NSUN2EZH2PPM1ATBK1WRNSTIM1ITGB5BRD1TAX1BP1CBLB COL5A1COL5A2SLIT2 TRAF3IP2RAE1YAP1FBXW7 MECP2 USP4ATOX1PLCB3MED31NR2C2DNM1LPFASFTOLRP4MGAT1PNNhsa-miR-6838-5p DCNDPT mmu-miR-466d-5pHMOX2 FOXO3PRDM16 mmu-miR-1192 POU2F1 PTPRSRELAICMTTIMM23PPP1R13BMYO6IRF2 JAG2MKLN1 FMOD FSTL1SCARA5 STIM2 MAF1 VAC14TFAM T SIRT6CREBBPF3NOTCH3BIN1TRIM21GUSB SGSHHPS4 ATP2C1 MAPKAPK5KRAS CBY1CBL TNRC6AAFAP1L2SPENRAF1RPS19GAK CDK5RAP2 RNF8 CLOCKAGAmmu-miR-298-5p TCF3 PYGO2HOXD9 PDGFASRA1 FBN1FBLN5 DICER1 CELSR1LIPH PKNOX1TRAF3IP2MPP5 PHB mmu-miR-466n-5p PYGO2HDAC4LRIG2 CENPJ hsa-miR-424-5p PRKDC NCOA3 BBS2 SLC7A7LMNB1 GBF1 ETS1 LSR MEFV UACAhsa-miR-320aCOL5A3 ICMT HSF1OSTM1mmu-miR-15a-5pmmu-miR-669f-3p hsa-let-7c-5pB4GALT1hsa-miR-101-3p ABCD4 mmu-miR-22-3p hsa-miR-150-5p mmu-miR-132-3pmmu-miR-182-5pmmu-miR-25-3p ARID4A MDM2MGRN1 hsa-miR-125b-5p mmu-miR-466k HNF1A FGFR1 NOTCH1MAEA IFT27 IL17REGNAZ hsa-miR-29b-3p hsa-miR-3929 mmu-let-7b-5p hsa-miR-149-3p NR2F6 AKT2

hsa-miR-29a-3p mmu-miR-34a-5p Human Mouse (a) (b)

Figure 7: The interaction networks between IGs and miRNA family. (a) For human and (b) for mouse. The purple nodes represent miRNAs and the blue nodes represent target gene.

inhibit the host immunity response [54]. It is necessary to further detect the role of these miRNAs, miRNA/IGs specific carryouttheresearchonhowthesemiRNAscontributeto regulatory networks were added in IGs expression correlated integument morphogenesis in the future. When we added networks, which will advance the dimension to skin regula- the relationship of miRNAs and their target genes by three tion networks. hsa-miR-548, mmu-miR-466,andmmu-miR- prediction algorithms to the expression correlation networks, 467 have an enormous number of targets on IGs, which those two clusters in gene regulatory networks of human both have the role of inhibition of host immunity response. have been integrated, which add a new dimension to genetic The regulations of factors to downstream genes networks under human integument. The changes of the hub and miRNA to transcription factors affect the spatial and gene in gene regulatory networks result in the evolutionary temporal expression of genes during skin morphogenesis. alternation and the morphological difference [29]. Our results provide a new avenue to understand the genetic networks basis of skin difference between human and mouse. 5. Conclusions Competing Interests Genetic networks variants have been shown to affect many complex traits, including integument morphology. In this The authors declare that there are no competing interests. study, we try to compare the regulatory networks of miRNAs andIGsintheskinofhumanandmouse.Wedownloaded mRNAexpressiondatainhumanandmouseskinfromthe Authors’ Contributions GEO database to create within-species consensus networks by WGCNA. There is a big difference in consensus networks Jianghong Wu and Husile Gong contributed equally to this between human and mouse; human consensus networks are work. more complex compared to mouse. Two principal regulatory networks were found in human: one module contains 286 IGs Acknowledgments specifically involved in secretion, whereas the other module contains 250 IGs which are enriched for cellular response to The authors thank Lizhong Ding, Department of Biology, stress and catabolic process. The secretion category, which is IndianaStateUniversity,USA,forRscriptsandthank specific category for human compared to mouse, of the genes Ethan Rath, Department of Biology, Indiana State University, mightbeinvolvedineccrinesysteminhuman.Then,total USA, for polishing manuscript and thank the Center for 62,637 miRNA binding target sites were predicted in human Genomic Advocacy at Indiana state University for provid- IGs, while 26,280 miRNA binding target sites were predicted ing the workspace and cutting-edge computational server. in mouse IGs. The interactions between miRNAs and IGs This work was supported by the National Natural Science of human are also more complex compared to mouse. To Foundation of China (31560623 to Dr. Jianghong Wu and 8 BioMed Research International

31260538/31560622 to Dr. Wenguang Zhang), the Natu- [16]M.A.Osterhoff,T.Frahnow,A.C.Seltmannetal.,“Identifi- ral Science Foundation of Inner Mongolian (2013MS0414), cation of gene-networks associated with specific lipid metabo- the Innovation Foundation of IMAAAHS (2013CXJJM09 lites by Weighted Gene Co-Expression Network Analysis to Dr. Jianghong Wu), State Key Laboratory of Genetic (WGCNA),” Experimental and Clinical Endocrinology & Dia- Resources and Evolution, Kunming Institute of Zoology, betes,vol.122,no.3,2014. Chinese Academy of Sciences (GREKF13-02), and the China [17] P. Langfelder and S. Horvath, “WGCNA: an R package for Postdoctoral Science Foundation (2014M562351 to Jianghong weighted correlation network analysis,” BMC , Wu). vol. 9, article 559, 2008. [18] P. Shannon, A. Markiel, O. Ozier et al., “Cytoscape: a software Environment for integrated models of biomolecular interaction References networks,” Genome Research, vol. 13, no. 11, pp. 2498–2504, 2003. [1] O. F. Chernova, “Skin derivatives in vertebrate ontogeny and [19]G.Bindea,B.Mlecnik,H.Hackletal.,“ClueGO:acytoscape phylogeny,” Biology Bulletin,vol.36,no.2,pp.175–183,2009. plug-in to decipher functionally grouped gene ontology and [2] P. F. A. Maderson, “Mammalian skin evolution: a reevaluation,” pathway annotation networks,” Bioinformatics,vol.25,no.8,pp. Experimental Dermatology,vol.12,no.3,pp.233–236,2003. 1091–1093, 2009. [3] W.J. Murphy, E. Eizirik, W.E. Johnson, Y.P.Zhang, O. A. Ryder, [20] D.Betel,M.Wilson,A.Gabow,D.S.Marks,andC.Sander,“The and S. J. O’Brien, “Molecular and the origins of microRNA.org resource: targets and expression,” Nucleic Acids placental mammals,” Nature,vol.409,no.6820,pp.614–618, Research,vol.36,supplement1,pp.D149–D153,2008. 2001. [21] C.-H. Chou, N.-W. Chang, S. Shrestha et al., “miRTarBase [4]M.PagelandW.Bodmer,“Anakedapewouldhavefewerpar- 2016: updates to the experimentally validated miRNA-target asites,” Proceedings of the Royal Society of London B: Biological interactions database,” Nucleic Acids Research,vol.44,no.1,pp. Sciences, vol. 270, supplement 1, pp. S117–S119, 2003. D239–D247, 2016. [5]H.W.Zhu,D.Shang,M.Sunetal.,“X-linkedcongenital [22] N. Wong and X. Wang, “miRDB: an online resource for hypertrichosis syndrome is associated with interchromosomal microRNA target prediction and functional annotations,” insertions mediated by a human-specific palindrome near Nucleic Acids Research, vol. 43, no. 1, pp. D146–D152, 2015. SOX3,” American Journal of Human Genetics,vol.88,no.6,pp. [23] D. Gomez-Cabrero, I. Abugessaisa, D. Maier et al., “Data 819–826, 2011. integration in the era of omics: current and future challenges,” [6] R. M. Rashid and L. E. White, “A hairy development in hyper- BMC Systems Biology,vol.8,supplement2,articleI1,2014. trichosis: a brief review of Ambras syndrome,” Dermatology [24] C. Vogel and C. Chothia, “Protein family expansions and Online Journal,vol.13,no.3,article8,2007. biological complexity,” PLoS Computational Biology,vol.2,no. [7] J. T. Eppig, J. E. Richardson, J. A. Kadin, M. Ringwald, J. A. 5, article e48, 2006. Blake,andC.J.Bult,“MouseGenomeInformatics(MGI): [25]K.Xia,Z.Fu,L.Hou,andJ.-D.J.Han,“Impactsofprotein- reflecting on 25 years,” Mammalian Genome,vol.26,no.7-8,pp. protein interaction domains on organism and network com- 272–284, 2015. plexity,” Genome Research,vol.18,no.9,pp.1500–1508,2008. [8] J. Hogan, C. T. O’Connor, A. Azizl et al., “Application of the [26] A. E. Vinogradov, “Human more complex than mouse at Gene Expression Omnibus (GEO) to generate and validate cellular level,” PLoS ONE,vol.7,no.7,ArticleIDe41753,2012. consensus transcriptomic profiles that accurately differentiate [27] J. Wu, Husile, H. Sun et al., “Adaptive evolution of Hoxc13 genes nodal status in colorectal cancer,” Irish Journal of Medical in the origin and diversification of the vertebrate integument,” Science, vol. 181, pp. S181–S181, 2012. Journal of Experimental Zoology Part B: Molecular and Develop- [9] R. Yi, M. N. Poy, M. Stoffel, and E. Fuchs, “A skin microRNA mental Evolution,vol.320,no.7,pp.412–419,2013. promotes differentiation by repressing ‘stemness’,” Nature,vol. [28]J.H.Wu,H.Xiang,Y.X.Qietal.,“Adaptiveevolutionofthe 452,no.7184,pp.225–229,2008. STRA6 genes in mammalian,” PLoS ONE,vol.9,no.9,Article [10] S. Bello, C. L. Smith, H. Dene et al., “Integrating large- ID e108388, 2014. scale phenotyping data into the mouse genome informatics [29] D. H. Erwin and E. H. Davidson, “The evolution of hierarchical database,” Transgenic Research,vol.23,no.5,p.850,2014. gene regulatory networks,” Nature Reviews Genetics,vol.10,no. [11] M. Carlson, S. Falcon, H. Pages, and N. Li, “org.Mm.eg.db: 2, pp. 141–148, 2009. Genome wide annotation for Mouse,” 2012. [30] M. Mildner, J. A. Jin, L. Eckhart et al., “Knockdown of filaggrin [12] M. Carlson, S. Falcon, H. Pages, and N. Li, org. Hs. eg. db: impairs diffusion barrier function and increases UV sensitivity Genome Wide Annotation for Human,RPackageVersion,2013. in a human skin model,” Journal of Investigative Dermatology, [13] T. Barrett and R. Edgar, “Gene expression omnibus: microarray vol. 130, no. 9, pp. 2286–2294, 2010. data storage, submission, retrieval, and analysis,” Methods in [31] A. Slominski, D. J. Tobin, M. A. Zmijewski, J. Wortsman, and Enzymology, vol. 411, pp. 352–369, 2006. R. Paus, “Melatonin in the skin: synthesis, metabolism and [14] G. Van Peer, S. Lefever, J. Anckaert et al., “miRBase Tracker: functions,” Trends in Endocrinology and Metabolism,vol.19,no. keeping track of microRNA annotation changes,” Database,vol. 1,pp.17–24,2008. 2014, Article ID bau080, 2014. [32] E. Fuchs, “Scratching the surface of skin development,” Nature, [15] S. Durinck, Y. Moreau, A. Kasprzyk et al., “BioMart and vol.445,no.7130,pp.834–842,2007. Bioconductor: a powerful link between biological databases and [33] L.-H. Gu and P. A. Coulombe, “Keratin function in skin microarray data analysis,” Bioinformatics,vol.21,no.16,pp. epithelia: a broadening palette with surprising shades,” Current 3439–3440, 2005. Opinion in Cell Biology,vol.19,no.1,pp.13–23,2007. BioMed Research International 9

[34] A. A. Romanovsky, “Skin temperature: its role in thermoregu- [51] I. N. Dahmke, C. Backes, J. Rudzitis-Auth et al., “Curcumin lation,” Acta Physiologica,vol.210,no.3,pp.498–507,2014. intake affects miRNA signature in murine melanoma with [35] M. Pasparakis, I. Haase, and F. O. Nestle, “Mechanisms reg- mmu-miR-205-5p most significantly altered,” PLoS ONE,vol. ulating skin immunity and ,” Nature Reviews 8, no. 12, Article ID e81122, 2013. Immunology,vol.14,no.5,pp.289–301,2014. [52] A. Druz, M. Betenbaugh, and J. Shiloach, “Glucose deple- [36] G. E. Folk Jr. and A. Semken Jr., “The evolution of sweat glands,” tion activates mmu-miR-466h-5p expression through oxidative International Journal of Biometeorology,vol.35,no.3,pp.180– stress and inhibition of histone deacetylation,” Nucleic Acids 186, 1991. Research,vol.40,no.15,pp.7291–7302,2012. [37] D. K. Taylor, J. A. Bubier, K. A. Silva, and J. P. Sundberg, [53] J. A. Segre, “Epidermal barrier formation and recovery in skin “Development, structure, and keratin expression in C57BL/6J disorders,” The Journal of Clinical Investigation,vol.116,no.5, mouse eccrine glands,” Veterinary Pathology,vol.49,no.1,pp. pp. 1150–1158, 2006. 146–154, 2012. [54]Y.Li,X.Fan,X.Heetal.,“MicroRNA-466linhibitsantiviral [38] R. Yi and E. Fuchs, “MicroRNA-mediated control in the skin,” innate immune response by targeting interferon-alpha,” Cellular Cell Death & Differentiation,vol.17,no.2,pp.229–235,2010. and Molecular Immunology,vol.9,no.6,pp.497–502,2012. [39]Q.Zou,Y.Mao,L.Hu,Y.Wu,andZ.Ji,“miRClassify:an advanced web server for miRNA family classification and annotation,” Computers in Biology and Medicine,vol.45,no.1, pp.157–160,2014. [40] B. Gentner, G. Schira, A. Giustacchini et al., “Stable knockdown of microRNA in vivo by lentiviral vectors,” Nature Methods,vol. 6,no.1,pp.63–66,2009. [41] B. D. Brown and L. Naldini, “Exploiting and antagonizing microRNA regulation for therapeutic and experimental appli- cations,” Nature Reviews Genetics,vol.10,no.8,pp.578–585, 2009. [42] T. Liang, L. Guo, and C. Liu, “Genome-wide analysis of mir-548 gene family reveals evolutionary and functional implications,” Journal of Biomedicine and Biotechnology,vol.2012,ArticleID 679563, 8 pages, 2012. [43]Y.Li,J.Xie,X.Xuetal.,“MicroRNA-548down-regulateshost antiviral response via direct targeting of IFN-𝜆1,” Protein and Cell,vol.4,no.2,pp.130–141,2013. [44]R.Zhou,L.Yang,andC.Liu,“Hsa-Mir-548linhibitsthe invasion and metastasis of non-small cell lung cancer by targeting Erk1 And Pkn2,” American Journal of Respiratory and Critical Care Medicine, vol. 187, Article ID A3437, 2013. [45] W. Gu, J. An, P. Ye, K.-N. Zhao, and A. Antonsson, “Prediction of conserved microRNAs from skin and mucosal human papillomaviruses,” Archives of Virology,vol.156,no.7,pp.1161– 1171, 2011. [46] M. C. Dickson, V. J. Ludbrook, H. C. Perry, P. A. Wilson, S. J. Garthside, and M. H. Binks, “A model of skin inflammation in humans leads to a rapid and reproducible increase in the interferon response signature: a potential translational model for drug development,” Inflammation Research,vol.64,no.3-4, pp. 171–183, 2015. [47] W. Zhang, L. C. Zhang, Y. Zhou et al., “Synergistic effects of BMP9 and miR-548d-5p on promoting osteogenic differentia- tion of mesenchymal stem cells,” BioMed Research International, vol. 2015, Article ID 309747, 9 pages, 2015. [48] Q. Jian, Q. An, D. N. Zhu et al., “MicroRNA 340 is involved in UVB-induced dendrite formation through the regulation of RhoA expression in melanocytes,” Molecular and Cellular Biology, vol. 34, no. 18, pp. 3407–3420, 2014. [49] Q. Wang, J. Chow, J. Hong et al., “Recent acquisition of imprinting at the Sfmbt2 correlates with insertion of a large block of miRNAs,” BMC Genomics,vol.12,article204, 2011. [50] A. N. Mardaryev, M. I. Ahmed, N. V.Vlahov et al., “Micro-RNA- 31 controls hair cycle-associated changes in gene expression programs of the skin and hair follicle,” The FASEB Journal,vol. 24,no.10,pp.3869–3881,2010. Hindawi Publishing Corporation BioMed Research International Volume 2016, Article ID 4241293, 8 pages http://dx.doi.org/10.1155/2016/4241293

Review Article Differential Regulatory Analysis Based on Coexpression Network in Cancer Research

Junyi Li,1,2 Yi-Xue Li,1,2,3,4 and Yuan-Yuan Li2,3,4

1 Key Lab of Computational Biology, CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China 2Shanghai Center for Bioinformation Technology, 1278 Keyuan Road, Shanghai 201203, China 3Shanghai Industrial Technology Institute, 1278 Keyuan Road, Shanghai 201203, China 4Shanghai Engineering Research Center of Pharmaceutical Translation, 1278 Keyuan Road, Shanghai 201203, China

Correspondence should be addressed to Yi-Xue Li; [email protected] and Yuan-Yuan Li; [email protected]

Received 14 April 2016; Revised 9 June 2016; Accepted 12 June 2016

Academic Editor: Zhenguo Zhang

Copyright © 2016 Junyi Li et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

With rapid development of high-throughput techniques and accumulation of big transcriptomic data, plenty of computational methods and algorithms such as differential analysis and network analysis have been proposed to explore genome-wide gene expression characteristics. These efforts are aiming to transform underlying genomic information into valuable knowledges in biological and medical research fields. Recently, tremendous integrative research methods are dedicated to interpret the development and progress of neoplastic diseases, whereas differential regulatory analysis (DRA) based on gene coexpression network (GCN) increasingly plays a robust complement to regular differential expression analysis in revealing regulatory functions of cancer related genes such as evading growth suppressors and resisting cell death. Differential regulatory analysis based on GCN is prospective and shows its essential role in discovering the system properties of carcinogenesis features. Here we briefly review the paradigm of differential regulatory analysis based on GCN. We also focus on the applications of differential regulatory analysis based on GCN in cancer research and point out that DRA is necessary and extraordinary to reveal underlying molecular mechanism in large-scale carcinogenesis studies.

1. Introduction been revealed, the whole dysregulation mechanisms are far from clear. In the past decade, plenty of computational methods and Cancer is a complex disease and an effective way to study algorithms such as differential analysis and network analysis regulatory role of genes involved in cancer is to summarize [1, 2] are proposed to explore genome-wide gene expression them into network [6]. It is suggested that genes having characteristics with rapid development of high-throughput similar or correlated expression patterns might contribute to technologies and accumulation of big transcriptomic data. the same regulatory function and gene coexpression patterns These efforts in computational genomic area are dedicated revealed by coexpression network analysis may lead to more to transform underlying genomic information into valuable insightful discovery on the underlying regulatory mecha- knowledges in biological and medical research fields [3, 4]. nisms [2, 7]. By comparing the difference of the regulatory Recently, tremendous integrative research aims to interpret networks between cancer and normal status, specific differ- the development and progress of cancers because elucidating ential network of genes can be identified as dysfunctional in molecular regulatory mechanisms, especially the dysregula- cancer. A large number of reverse engineering approaches tion mechanisms, of neoplastic diseases makes great benefit have been developed to construct regulatory network from in medical and pharmaceutical aspects. Although partial gene expression data. For examples, Xiao suggested Boolean different regulatory functions of cancer hallmarks such as model to analyze and stimulate the gene regulatory network evading growth suppressors and resisting cell death [5] have [8]. Some methods based on Bayesian model lead to Bayesian 2 BioMed Research International networks and they are widely applied [9–11]. Nonlinear threshold is selected. After removing the nonsignificant edges differential equation model is also developed to construct the or negligible coordinated gene pairs by the threshold, gene regulatory network [12]. Prior biological knowledge such as coexpression network is constructed by the significantly - (TF-) target regulatory relationships or correlated gene pairs remained as shown in Figure 1(a) [29]. miRNA-target regulatory relationships can also be integrated Weighted correlation network analysis (WGCNA) [27, into modelling framework [11, 13, 14]. These reverse and for- 28] is widely used for constructing coexpression network wardintegratedapproachesaresupposedtohavesmallerfalse based on gene expression data and implementing network positive rate to extract informative insights of transcriptomic analysis [19, 21, 30]. It summarizes clusters of highly cor- behaviors. related genes by defining a continuous network adjacency Although network analysis provides the possibility to which is a power of initial one to reduce the low-adjacency comprehensively understand biological processes, it does gene pairs. WGCNA analyzes the cluster structure and increase the computational complexity. Decreasing the explores the relationships between modules or that between searching space before network analysis is necessary in high modules and genes. GCN topological characters can be well dimension data analysis. An obvious strategy of reducing studiedbyWGCNAandthisgreatadvantagemakesWGCNA the computational burden is to build a subnetwork around a oneofmostusedGCNconstructionmethodsinresearch. givensetofgenessuchaspreviouslyreporteddisease-related There are many coexpression module detection methods genes [15] or around differentially expressed genes [16–18]. provided by WGCNA for users to choose their own preferred Differential expression analysis (DEA) compares the mean one. Meanwhile, choosing a suitable threshold is required for expression value of genes between case and control samples GCN construction in WGCNA. One limitation of WGCNA and identifies significantly differentially expressed genes by is that its GCN construction is undirected. Other prior statistical tests. In current transcriptomic analysis procedure, knowledge is needed if further regulatory analysis based on DEA has become the basic and the very first analysis step. GCN is designed. Recently, differential coexpression analysis (DCEA) Link-based quantitative methods in DCGL [26, 31] increasingly plays a robust complement to DEA [2] and is employ a half-thresholding strategy to construct specific widely used in discovering the system properties of carcino- GCNs. That is, if at least one of the two coexpression values genesis features. By calculating the change of correlations of a specific link exceeds the threshold, the link in both between gene pairs instead of mean expression level, DCEA coexpression networks from two different conditions is kept provides more information about phenotypic change-related [26, 31]. In this way, minute variations are ignored by filtering regulatory network [19–24]. Therefore, differential regulatory out those noninformative links whose correlation values in analysis based on coexpression network may detect more both networks are insignificant. GCNs constructed by DCGL insights into regulatory mechanisms. arealsowithoutdirections. Gaussian graphical model (GGM) is another approach to In this review, we will introduce the paradigm of differ- construct gene coexpression network [32–34]. Based on the ential regulatory analysis (DRA) based on gene coexpression assumption that the covariance of gene pairs follows a mul- network (GCN). We also focus on the applications of DRA tivariate Gaussian distribution, partial correlation between basedonGCNincancerresearchandpointoutthatDRAis genepairsiscalculatedasthedegreeofcorrelationafter necessary and extraordinary to reveal underlying regulatory the effects of other genes are removed. Unlike the fact that mechanism in large-scale carcinogenesis studies. correlations in PCC-based method are calculated by gene pairs themselves, correlations of gene pair in GGM-based 2. Paradigm of Differential methods take into account information of other genes, which Regulatory Analysis Based on Gene makes GGM-based GCN more similar to real biological Coexpression Network network. Some algorithms are proposed with focus on how to Differential regulatory analysis based on gene coexpression infer the structure of gene correlation relationships. Algo- network has been widely used in carcinogenesis regulation rithm for the Reconstruction of Accurate Cellular Networks research and basically includes three procedures as shown (ARACNE) [35, 36] is a method of GCN construction, which in Figure 1: constructing gene coexpression network based also pays attention to partial network properties of GCN by on transcriptomic data, regulatory analysis according to gene counting gene triplets. The Context Likelihood of Related- coexpression network, and differential regulatory compari- ness (CLR) [37] algorithm calculates the relative correlation son between different conditions. basedontheempiricalcorrelationsoversurroundinggenes. MRNET is an iterative feature selection algorithm and uses 2.1. Construction of Gene Coexpression Network. In a gene a maximum relevance and minimum redundancy criterion coexpression network, genes are nodes and their correlations [38]. are represented by the edges of network. Pearson correla- DECODE (differential coexpression and differential tion coefficient (PCC) is the mostly used score to measure expression) combines the information from both differential the tendency of gene expression correlation [25–28]. The coexpression and differential expression to set up the thresh- value of PCC ranges from −1to1andhigherabsolute olds systematically based on a chi-square maximization [39]. value of PCC means higher correlation between gene pairs. A recent GCN construction algorithm is proposed by When constructing gene coexpression network, a correlation Planar Filtered Network Analysis (PFNA) and Multiscale BioMed Research International 3

(a) Construction of coexpression network

Pearson correlation matrix

PCC-based methods (e.g., WGCNA and Coexpression DCGL); network GGM method; MEGENA method

(b) Regulatory analysis according to constructed coexpression network

Integration of prior regulatory Clustering method to discover knowledges (e.g., DCGL2) subnetworks and topological analysis

Transcription factor (TF) Hub gene TF’s target Non-TF’s target Subnetwork

(c) Differential regulatory comparison between different conditions

Normal condition Cancer condition Differential regulatory TF and its targets

Different subnetworks in different conditions

Figure 1: Paradigm of differential regulatory analysis based on gene coexpression network. The paradigm of differential regulatory analysis based on gene coexpression network includes but is not limited to three procedures. (a) Constructing gene coexpression network based on genomic transcriptomic data. (b) Regulatory analysis according to gene coexpression network. (c) Differential regulatory comparison between different conditions. 4 BioMed Research International

Embedded Gene Coexpression Network Analysis (MEGENA) and build a network based on these selected gene pairs [40, 41]. According to this algorithm, GCN is constructed on [47, 48]. Differential regulatory comparison between dif- a spherical surface so that links between gene pairs do not ferent conditions is able to find the differential genes or cross the others. They have advantages in extracting most gene modules across different conditions, providing useful relevant information from similarity matrix from complex information as well. In DCGL v2, differentially coexpressed network system based on topological sphere. TFs are defined as differential regulated genes (DRGs), and DRGs are ranked for prioritizing regulators that are putatively 2.2. Regulatory Analysis according to Gene Coexpression Net- causative to the phenotype of interests in DRrank function in work. After the gene coexpression network is constructed the R package of DCGL2 [25]. For example, RIF algorithm based on the transcriptomic data, regulatory information can in DRrank function combines three types of transcriptomic be extracted by various regulatory analysis methods from the information and assigns a high score to those TFs that are GCN according to research desires as shown in Figure 1(b). “cumulatively most differentially wired to the abundant most The most common way is to use the prior knowledge of TF- differentially expressed genes” [25, 49]. target regulatory relationships or miRNA-target regulatory relationships to highlight the specific regulatory subnetwork [25]. 3. Applications of Differential Regulatory Another important method looking for regulatory ele- Analysis Based on GCN in Cancer Research ments is clustering method. A major goal of coexpression Recently, differential regulatory analysis based on GCN is analysis is discovering biologically related modules or gene applied in more and more cancer studies. In the following, groups. In WGCNA, hierarchical clustering method is used to identify highly correlated gene subnetworks [27]. Sig- we give some examples of its applications and summarize nificantly compact subclusters are verified by the average advantages of this integrative method. shortest path distance within each cluster over the cluster size. Nonnegative matrix factorization clustering method [42–44] 3.1. Revealing Dysfunctional Regulatory Genes and Subnet- is also employed to clustering coexpressed genes with features works in Cancer Research. The direct advantage of differential of interests. regulatory analysis (DRA) is that DRA is able to distinguish According to topological structure of gene coexpression dysfunctional regulatory subnetworks or pathways in cancer network, some interesting characters such as hub genes can status. For example, Jiang et al. constructed highly preserved help to explain regulatory function contained in the GCN. gene ontology biological process (GO BP) gene coexpression Connectivity in GCN presents how a gene connects to other network and prostate cancer coexpression network by using genes and hub genes are the ones with very high level of WGCNA approach. With regulatory analysis they discovered connectivity. Hub genes are normally connections between 548 GO BP coexpression modules and 294 prostate cancer different gene modules and should be a specific research focus coexpression modules. By comparing the difference of these for investigations into cancer-correlated gene modules. For modules, they identified 55 conserved prostate cancer coex- instance, Yang et al. build gene coexpression networks based pression modules [30]. And there are five modules which on transcriptomic and clinical data of four cancer types and are significantly enriched with prostate cancer candidate discovered that prognostic mRNA genes tended not to be genes. These five modules are featured with regulation of hub genes [24]. They suggested that hubs genes coordinate , response to stress, cellular localization, and protein genes over different pathways to participate in the regulatory localization [30]. Udyavar et al. performed coexpression net- processes. Chou et al. investigated endometrial cancers (ECs) work construction based on a dataset of combined normal, hub genes by constructing WGCNA coexpression network adenocarcinoma, squamous cell carcinoma, and small-cell andthesehubgenesareinvolvedinantigenprocessing,cell lung cancer (SCLC) tissue specimens by WGCNA. They adhesion, and cell-cycle regulation [21]. On the other hand, compared the distribution of significant modules across four loss of connectivity in coexpression network is a common types of samples and derived an SCLC-specific hub network topological trait among the different kinds of cancer [45]. classifier and identified spleen tyrosine kinase as candidate biomarker and therapeutic target for SCLC [19]. 2.3. Differential Regulatory Comparison between Differ- Cancer is considered as a complex disease with multilevel ent Conditions. Distinguishing different regulatory elements progressing process. DRA based on GCN is able to bring light between different conditions such as tumor and normal to dynamic regulatory relationships of cancer in its different tissue, or different cancer types, or even cross-species (human progress levels. For instance, Cao et al. first constructed gene and mouse) [46] help to understand the dysfunctional regu- coexpression networks of normal, adenoma, and carcinoma- lation. specific gastric carcinogenesis to decrease the searching space There are two ways to perform differential regulatory for potential regulatory genes [48]. After these potential comparison between different conditions. The first way is regulatory genes are acquired, three differential networks are toconstructgenecoexpressionnetworkbasedoneach constructed. By comparing constructed differential network- condition and compare the difference between constructed ing information and signaling pathway information of three GCNs to extract different regulation elements as shown developing stages, the regulation roles of GATA6 and ESRRG in Figure 1(c) [24, 30]. The other way is to calculate the and their signaling pathways in gastric carcinogenesis were significantly different correlation between various conditions suggested [48]. This work frame is prospective and extendable BioMed Research International 5 to other cancer researches. Wu et al. also performed a system- 3.3. Applicable in Medical and Pharmaceutical Aspects. Since levelstudyofgastriccancerbyconstructingfivephenotype- DRA based on GCN is a system-level analysis method specific coexpression networks [50]. Their comparison anal- and explores the regulatory mechanisms of diseases, it has ysis of connectivity reveals that hub genes which only exit been widely applied in medical and pharmaceutical aspects. in the normal networks play important roles in gastric By integrating other pieces of information such as clinical tumorigenesis and hub genes only related to tumor networks information or drug-target genes information, DRA based on are enriched in specific biological terms. Ruan et al. identi- GCN has more potential to contribute to theoretical base of fied specific pathways associated with renal cell carcinoma medical and pharmaceutical researches. (RCC) based on differentially coexpressed links which were For example, prognostic genes are very important for detected by three methods: Pearson’s correlation, Bayesian cancer prognosis and treatment. By integrating survival network, and WGCNA. These RCC-related pathways help information, Yang et al. studied the system-level prognostic to explain underlying regulatory mechanisms of RCC [29]. genes across four cancer types by DRA based on GCN Khosravi et al. built independent gene regulatory networks [24]. Discovering new biomarkers or molecular subtypes of from each prostate cancer and found critical transcription cancer is also valuable for stratification in clinical studies. factorsinvolvedinprostatecancerbasedonhubtypevari- DRA-based signature has the ability to classify patients into ation[51].Thesedynamicnetworkstudiesacrosscancer different subtypes with different clinical results. Meanwhile, stages well reveal the change of regulatory patterns during the DRA-based signature is featured with different regulatory progression of cancers. patterns of each subtype. Wu et al. revealed a novel three- transcription-factor signature including AHR, NFIL3, and 3.2. Practical and Beneficial in Multilevel Network Analysis ZNF423 for glioma molecular subtypes by DRA based on When Integrated with miRNAs and lncRNAs Data. With the GCN. This three-gene DRA-based signature clusters glioma emerging roles of microRNAs (miRNAs) and long noncod- patients into three major subtypes which are significantly ing (lncRNAs) in gene regulatory networks, more different in patient survival as well as transcriptomic patterns and more genomic studies on miRNAs and lncRNAs are [47].Jinetal.captureda12-genenetworkmoduleofovarian performed in cancer research aspects [52–58]. By integrat- cancer by constructing weighted survival and differential ing genomic miRNAs and lncRNAs with mRNAs data to coexpression network and this module shows a close corre- construct multi-level co-expression network and analyze lation with cell death [63]. All these prognostic DRA studies differential regulatory mechanism, the understanding of based on GCN help to provide a more accurate survival carcinogenesis functions of miRNAs and lncRNAs can be prediction. Moreover, DRA-based prognostic signature has greatly improved from the views of system level. For example, Lin et al. constructed a cross-cancer miRNA more potential to explore carcinogenesis mechanisms which differential coexpression network and identified two poten- lead to a better precision medicine in cancer diagnosis and tial miRNA-regulated oncomodules associated with poor treatment. survival outcomes in patients [59]. This study suggested that disruption of miRNA positive coexpression in cancer 4. Discussion might contribute to cancer development. There are many efforts made to discover the regulatory action of lncRNAs Regulatory analysis is always a focal point in biological in cancer [56, 57]. DRA based on coexpression network is research. Understanding the function of each regulatory also extendable to lncRNAs-cancer gene network analysis. element in biological process is fundamental and challenging. InCaNet is a data resource which contains precalculated Large-scale and multilevel sequencing data provide more significant coexpression pairs of 9641 lncRNAs and 2544 opportunities to reveal molecular regulatory mechanism well-classified cancer genes in 2922 matched TCGA samples from the systematic viewpoint. Differential regulatory anal- [60]. And InCaNet helps to explore regulatory functions of ysis is designed for distinguishing the differential regulatory particular lncRNA-cancer gene interaction in cancer studies. elements in different conditions or dysfunctional regulation Most lncRNAs’ regulatory functions are unknown and DRA specific for an abnormal condition. For example, cancer based on lncRNAs-cancer gene network has the exact ability is considered as a complex genetic disease and different to perform the predictions. According to the assumption phonotypes in cancer embody regulatory level mechanisms. that genes or nodes in a subgroup may execute similar Dysfunctional regulatory elements take priority in carcino- functions, Cogill and Wang identified a list of previously genesis studies because of their important roles in regulation uncharacterized lncRNAs coexpressed with key cancer genes as well as their potential in cancer treatments. in their study [20] and Hao et al. inferred lncRNAs related to In the early past decade, gene expression data for spe- esophageal squamous cell carcinoma ESCC from constructed cific phenotype is very limited and researchers had to use coexpression network and differential regulatory analysis various cell lines data to construct conceptual gene regula- [61]. Moreover, an integrated miRNA-mRNA-lncRNA coex- tory networks [14]. Then it is very difficult to explain the pression network analysis was performed by Wu et al. differential regulatory relationships between tumor types. to study the oestrogen receptor-regulated transcriptome Recently, with the accumulation of large-scale and multiscale in breast cancer [62]. All these multilevel studies expand data, researchers are able to apply differential regulatory our understanding of regulatory mechanism in cancer analysis to identify specific regulatory patterns in a given biology. cancer type. Since differential regulatory analysis based on 6 BioMed Research International coexpression network has a system-level property, it has great References strength to discover underlying molecular mechanism and dysfunctional regulatory elements from large-scale data of [1] E. E. Schadt, “Molecular networks as sensors and drivers of complex system. With the development of differential regu- common human diseases,” Nature,vol.461,no.7261,pp.218– 223, 2009. latoryanalysisbasedoncoexpressionnetworkitselfandits applications in more and more genomic and big data research, [2] A. de la Fuente, “From ‘differential expression’ to ‘differential it presents its essential and prospective role in cancer networking’—identification of dysfunctional regulatory net- works in diseases,” Trends in Genetics,vol.26,no.7,pp.326–333, research. 2010. Technically, construction of coexpression networks as [3] E. E. Schadt, S. H. Friend, and D. A. Shaywitz, “A network the basic starting point of differential regulatory analysis view of disease and compound screening,” Nature Reviews Drug playsimportantandcriticalroleinthewholeinvestigation Discovery,vol.8,no.4,pp.286–295,2009. process. Therefore, both the data sets and the way to con- [4] E.E.Schadt,B.Zhang,andJ.Zhu,“Advancesinsystemsbiology struct networks must be carefully examined. For example, are enhancing our understanding of disease and moving us Ballouz et al. suggest minimal experimental criteria to obtain closer to novel disease treatments,” Genetica,vol.136,no.2,pp. useful functional connectivity and topology information of 259–269, 2009. coexpression with microarrays greater than 20 samples and [5] D. Hanahan and R. A. Weinberg, “Hallmarks of cancer: the next read depth greater than 10 M per sample [64]. Selection generation,” Cell,vol.144,no.5,pp.646–674,2011. of differential regulatory genes relies on the method or [6] P.K. Kreeger and D. A. Lauffenburger, “Cancer systems biology: algorithmchoseninDRAbasedonGCN[22,65,66].For a network modeling perspective,” Carcinogenesis,vol.31,no.1, instance, differential analysis between two conditions, which pp.2–8,2009. is followed by regulatory analysis, can be performed after [7]M.R.J.Carlson,B.Zhang,Z.Fang,P.S.Mischel,S.Horvath, two gene coexpression networks are constructed. Sometimes, and S. F. Nelson, “Gene connectivity, function, and sequence regulatory analysis based on coexpression networks between conservation: predictions from modular yeast co-expression two conditions is conducted first and comparison of differen- networks,” BMC Genomics,vol.7,article40,2006. tial regulatory elements is performed later. All these steps in [8] Y. Xiao, “A tutorial on analysis and simulation of boolean gene current differential regulatory analysis methods are relatively regulatory network models,” Current Genomics,vol.10,no.7,pp. flexible depending on the aim of specific research and infor- 511–525, 2009. mation available. The DRA based on GCN method might [9] D. Husmeier, “Sensitivity and specificity of inferring genetic need further standardization and refinement to better serve regulatory interactions from microarray experiments with more carcinogenesis research in medical and pharmaceutical dynamic Bayesian networks,” Bioinformatics,vol.19,no.17,pp. fields. 2271–2282, 2003. [10] Z. S. H. Chan, L. Collins, and N. Kasabov, “Bayesian learning of sparse gene regulatory networks,” BioSystems,vol.87,no.2-3, 5. Conclusion pp.299–306,2007. In this review, we summarize the paradigm of differential [11] S. Gao and X. Wang, “Quantitative utilization of prior biological regulatory analysis based on coexpression network of tran- knowledge in the Bayesian network modeling of gene expres- sion data,” BMC Bioinformatics,vol.12,article359,2011. scriptomic data and the applications of differential regulatory analysis based on GCN in cancer research. Differential [12] T. T. Vu and J. Vohradsky, “Nonlinear differential equation regulatoryanalysisbasedonGCNisdemonstratedasa model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae,” Nucleic Acids necessary and potential tool to reveal underlying molecular Research,vol.35,no.1,pp.279–287,2007. mechanism in basic functional genomic research as well as practical carcinogenesis studies. [13]K.Tu,H.Yu,Y.-J.Huaetal.,“Combinatorialnetworkof primary and secondary microRNA-driven regulatory mecha- nisms,” Nucleic Acids Research,vol.37,no.18,pp.5969–5980, Competing Interests 2009. [14]H.Yu,K.Tu,Y.-J.Wangetal.,“Combinatorialnetworkof The authors declare that they have no competing interests. transcriptional regulation and microRNA regulation in human cancer,” BMC Systems Biology,vol.6,article61,2012. [15] B. Zhang, H. Li, R. B. Riggins et al., “Differential dependency Acknowledgments network analysis to identify condition-specific topological changes in biological networks,” Bioinformatics,vol.25,no.4, ThisworkissupportedbythegrantsfromtheNational pp. 526–532, 2009. Key Scientific Instrument and Equipment Development [16] D. Nitsch, L.-C. Tranchevent, B. Thienpont et al., “Network Project of China (2012YQ03026108), National Natural Sci- analysis of differential expression for the identification of ence Foundation of China (31501077), the National “973” Key disease-causing genes,” PLoS ONE,vol.4,no.5,ArticleID Basic Research Development Program (2012CB316501 and e5526, 2009. 2013CB910801), and the Fundamental Research Program of [17]R.Gill,S.Datta,andS.Datta,“Astatisticalframeworkfor Shanghai Municipal Commission of Science and Technology differential network analysis from microarray data,” BMC Bioin- (14DZ1951300 and 14DZ2252000). formatics, vol. 11, article 95, 2010. BioMed Research International 7

[18]D.Nitsch,J.P.Gonc¸alves,F.Ojeda,B.deMoor,andY. [34] S. Aburatani, F. Sun, S. Saito, M. Honda, S.-I. Kaneko, and K. Moreau, “Candidate gene prioritization by network analysis Horimoto, “Gene systems network inferred from expression of differential expression using machine learning approaches,” profiles in hepatocellular carcinogenesis by graphical gaussian BMC Bioinformatics,vol.11,article460,2010. model,” Eurasip Journal on Bioinformatics & Systems Biology, [19] A. R. Udyavar, M. D. Hoeksema, J. E. Clark et al., “Co- vol. 2007, Article ID 47214, 2007. expression network analysis identifies Spleen Tyrosine Kinase [35] A. A. Margolin, I. Nemenman, K. Basso et al., “ARACNE: an (SYK) as a candidate oncogenic driver in a subset of small-cell algorithm for the reconstruction of gene regulatory networks lung cancer,” BMC Systems Biology,vol.7,supplement5,article in a mammalian cellular context,” BMC Bioinformatics,vol.7, S1, 2013. no.1,articleS7,2006. [20] S. B. Cogill and L. Wang, “Co-expression network analysis of [36] R. A. Chavez´ Montes, G. Coello, K. L. Gonzalez-Aguilera,´ human lncRNAs and cancer genes,” Cancer Informatics,vol.13, N. Marsch-Mart´ınez,S.deFolter,andE.R.Alvarez-Buylla, supplement 5, pp. 49–59, 2014. “ARACNe-based inference, using curated microarray data, of [21] W.-C. Chou, A.-L. Cheng, M. Brotto, and C.-Y.Chuang, “Visual root transcriptional regulatory networks,” gene-network analysis reveals the cancer gene co-expression in BMC Plant Biology,vol.14,no.1,article97,2014. human endometrial cancer,” BMC Genomics,vol.15,article300, [37] J.J.Faith,B.Hayete,J.T.Thadenetal.,“Large-scalemappingand 2014. validation of Escherichia coli transcriptional regulation from a [22] M. M. Bourdakou, E. I. Athanasiadis, and G. M. Spyrou, compendium of expression profiles,” PLoS biology,vol.5,no.1, “Discovering gene re-ranking efficiency and conserved gene- article e8, 2007. gene relationships derived from gene co-expression network [38] P. E. Meyer, K. Kontos, F. Lafitte, and G. Bontempi, analysis on breast cancer data,” Scientific Reports,vol.6,Article “Information-theoretic inference of large transcriptional ID 20518, 2016. regulatory networks,” EURASIP Journal on Bioinformatics and [23] S. Hong, H. Dong, L. Jin, and M. Xiong, “Gene co-expression Systems Biology,vol.2007,ArticleID79879,2007. network and functional module analysis of ovarian cancer,” [39]T.W.H.Lui,N.B.Y.Tsui,L.W.C.Chan,C.S.C.Wong, International Journal of Computational Biology and Drug P.M.F.Siu,andB.Y.M.Yung,“DECODE:anintegrated Design,vol.4,no.2,pp.147–164,2011. differential co-expression and differential expression analysis of [24]Y.Yang,L.Han,Y.Yuan,J.Li,N.Hei,andH.Liang,“Gene gene expression data,” BMC Bioinformatics,vol.16,no.1,article co-expression network analysis reveals common system-level 182, 2015. properties of prognostic genes across cancer types,” Nature [40] W.-M. Song and B. Zhang, “Multiscale embedded gene co- Communications,vol.5,article3231,2014. expression network analysis,” PLoS Computational Biology,vol. [25] J. Yang, H. Yu, B.-H. Liu et al., “DCGL v2.0: an R pack- 11, no. 11, Article ID e1004574, 2015. age for unveiling differential regulation from differential co- [41] M. Tumminello, T. Aste, T. Di Matteo, and R. N. Mantegna, “A expression,” PLoS ONE,vol.8,no.11,ArticleIDe79729,2013. tool for filtering information in complex systems,” Proceedings [26] B.-H. Liu, H. Yu, K. Tu, C. Li, Y.-X. Li, and Y.-Y. Li, “DCGL: an of the National Academy of Sciences of the United States of R package for identifying differentially coexpressed genes and America,vol.102,no.30,pp.10421–10426,2005. links from gene expression microarray data,” Bioinformatics, [42] D. D. Lee and H. S. Seung, “Learning the parts of objects by vol. 26, no. 20, Article ID btq471, pp. 2637–2638, 2010. non-negative matrix factorization,” Nature,vol.401,no.6755, [27] P. Langfelder and S. Horvath, “WGCNA: an R package for pp. 788–791, 1999. weighted correlation network analysis,” BMC Bioinformatics, [43] J. J.-Y. Wang, X. Wang, and X. Gao, “Non-negative matrix vol. 9, article 559, 2008. factorization by maximizing correntropy for cancer clustering,” [28] B. Zhang and S. Horvath, “A general framework for weighted BMC Bioinformatics, vol. 14, article 107, 2013. gene co-expression network analysis,” Statistical Applications in [44] K. Inamura, T. Fujiwara, Y. Hoshida et al., “Two subclasses of Genetics and Molecular Biology,vol.4,no.1,article17,ArticleID lung squamous cell carcinoma with different gene expression 20518, 2005. profiles and prognosis identified by hierarchical clustering and [29] X. Ruan, H. Li, B. O. Liu et al., “A novel method to identify non-negative matrix factorization,” ,vol.24,no.47,pp. pathways associated with renal cell carcinoma based on a gene 7105–7113, 2005. co-expression network,” Oncology Reports,vol.34,no.2,pp. [45] R. Anglani, T. M. Creanza, V. C. Liuzzi et al., “Loss of 567–576, 2015. connectivity in cancer co-expression networks,” PLoS ONE,vol. [30] J. Jiang, P. Jia, Z. Zhao, and B. Shen, “Key regulators in 9, no. 1, Article ID e87075, 2014. prostate cancer identified by co-expression module analysis,” [46] A. Aytes, A. Mitrofanova, C. Lefebvre et al., “Cross-species BMC Genomics,vol.15,no.1,article1015,2014. regulatory network analysis identifies a synergistic interaction [31] H. Yu, B.-H. Liu, Z.-Q. Ye, C. Li, Y.-X. Li, and Y.-Y. Li, between FOXM1 and CENPF that drives prostate cancer malig- “Link-based quantitative methods to identify differentially nancy,” Cancer Cell,vol.25,no.5,pp.638–651,2014. coexpressed genes and gene Pairs,” BMC Bioinformatics,vol.12, [47]S.Wu,J.Li,M.Cao,Y.-X.Li,andY.-Y.Li,“Anovelintegrated article 315, 2011. gene coexpression analysis approach reveals a prognostic three- [32] T.Wang,Z.Ren,Y.Dingetal.,“FastGGM:anefficientalgorithm transcription-factor signature for glioma molecular subtypes,” for the inference of gaussian graphical model in biological in Proceedings of the International Conference on Intelligent networks,” PLOS Computational Biology,vol.12,no.2,Article Biology and Medicine ((ICIBM ’15),Indianapolis,Ind,USA, ID e1004755, 2016. November 2015. [33] P. Ingkasuwan, S. Netrphan, S. Prasitwattanaseree et al., [48]M.S.Cao,B.Y.Liu,W.T.Dai,W.X.Zhou,Y.X.Li,andY.Y.Li, “Inferring transcriptional gene regulation network of starch “Differential network analysis reveals dysfunctional regulatory metabolism in Arabidopsis thaliana leaves using graphical networks in gastric carcinogenesis,” American Journal of Cancer Gaussian model,” BMC Systems Biology,vol.6,article100,2012. Research, vol. 5, no. 9, pp. 2605–2625, 2015. 8 BioMed Research International

[49]N.J.Hudson,A.Reverter,andB.P.Dalrymple,“Adifferential wiring analysis of expression data correctly identifies the gene containing the causal mutation,” PLoS Computational Biology, vol. 5, no. 5, Article ID e1000382, 2009. [50] J. Wu, X. Zhao, Z. Lin, and Z. Shao, “A system level analysis of gastric cancer across tumor stages with RNA-seq data,” Molecular BioSystems, vol. 11, no. 7, pp. 1925–1932, 2015. [51] P. Khosravi, V. H. Gazestani, M. Akbarzadeh, S. Mirkhalaf, M. Sadeghi, and B. Goliaei, “Comparative analysis of prostate can- cer gene regulatory networks via hub type variation,” Avicenna Journal of Medical Biotechnology,vol.7,no.1,pp.8–15,2015. [52] C.-C. Lin, Y.-J. Chen, C.-Y. Chen, Y.-J. Oyang, H.-F. Juan, and H.-C. Huang, “Crosstalk between transcription factors and microRNAs in human protein interaction network,” BMC Systems Biology,vol.6,article18,2012. [53] J. Xu, C.-X. Li, Y.-S. Li et al., “MiRNA-miRNA synergistic net- work: construction via co-regulating functional modules and disease miRNA topological features,” Nucleic Acids Research, vol.39,no.3,pp.825–836,2011. [54] G. A. Calin and C. M. Croce, “MicroRNA signatures in human cancers,” Nature Reviews Cancer,vol.6,no.11,pp.857–866, 2006. [55] V. N. Kim and J.-W. Nam, “Genomics of microRNA,” Trends in Genetics, vol. 22, no. 3, pp. 165–173, 2006. [56] S.W.Cheetham,F.Gruhl,J.S.Mattick,andM.E.Dinger,“Long noncoding RNAs and the genetics of cancer,” British Journal of Cancer,vol.108,no.12,pp.2419–2425,2013. [57] S. Wang and E. J. Tran, “Unexpected functions of lncRNAs in gene regulation,” Communicative and Integrative Biology,vol.6, no. 6, Article ID e27610, 2013. [58] A. Quitadamo, L. Tian, B. Hall, and X. Shi, “An integrated network of microRNA and gene expression in ovarian cancer,” BMC Bioinformatics,vol.16,supplement5,articleS5,2015. [59] C.-C. Lin, R. Mitra, F. Cheng, and Z. Zhao, “A cross-cancer differential co-expression network reveals microRNA-regulated oncogenic functional modules,” Molecular BioSystems,vol.11, no. 12, pp. 3244–3252, 2015. [60] Y. Liu and M. Zhao, “lnCaNet: pan-cancer co-expression network for human lncRNA and cancer genes,” Bioinformatics, vol.32,no.10,pp.1595–1597,2016. [61] Y. Hao, W.Wu, F. Shi et al., “Prediction of long noncoding RNA functions with co-expression network in esophageal squamous cell carcinoma,” BMC Cancer,vol.15,article168,2015. [62]Q.Wu,L.Guo,F.Jiang,L.Li,Z.Li,andF.Chen,“Analysisof the miRNA-mRNA-lncRNA networks in ER+ and ER- breast cancer cell lines,” Journal of Cellular and Molecular Medicine, vol.19,no.12,pp.2874–2887,2015. [63] N. Jin, H. Wu, Z. Miao et al., “Network-based survival- associated module biomarker and its crosstalk with cell death genes in ovarian cancer,” Scientific Reports,vol.5,ArticleID 11566, 2015. [64] S. Ballouz, W. Verleyen, and J. Gillis, “Guidance for RNA- seq co-expression network construction and analysis: safety in numbers,” Bioinformatics,vol.31,no.13,pp.2123–2130,2015. [65] O. Odibat and C. K. Reddy, “Ranking differential hubs in gene co-expression networks,” Journal of Bioinformatics and Computational Biology,vol.10,no.1,ArticleID1240002,2012. [66]H.Yu,R.Mitra,J.Yang,Y.Li,andZ.Zhao,“Algorithms for network-based identification of differential regulators from transcriptome data: a systematic evaluation,” Science China Life Sciences, vol. 57, no. 11, pp. 1090–1102, 2014. Hindawi Publishing Corporation BioMed Research International Volume 2016, Article ID 9721713, 12 pages http://dx.doi.org/10.1155/2016/9721713

Research Article Hybrid Binary Imperialist Competition Algorithm and Tabu Search Approach for Feature Selection Using Gene Expression Data

Shuaiqun Wang,1 Aorigele,2 Wei Kong,1 Weiming Zeng,1 and Xiaomin Hong1

1 Information Engineering College, Shanghai Maritime University, Shanghai 201306, China 2Faculty of Engineering, University of Toyama, Toyama-shi 930-8555, Japan

Correspondence should be addressed to Shuaiqun Wang; [email protected]

Received 15 April 2016; Accepted 22 June 2016

Academic Editor: Jialiang Yang

Copyright © 2016 Shuaiqun Wang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Gene expression data composed of thousands of genes play an important role in classification platforms and disease diagnosis. Hence, it is vital to select a small subset of salient features over a large number of gene expression data. Lately, many researchers devote themselves to feature selection using diverse computational intelligence methods. However, in the progress of selecting informative genes, many computational methods face difficulties in selecting small subsets for cancer classification due to thehuge number of genes (high dimension) compared to the small number of samples, noisy genes, and irrelevant genes. In this paper, we propose a new hybrid algorithm HICATS incorporating imperialist competition algorithm (ICA) which performs global search and tabu search (TS) that conducts fine-tuned search. In order to verify the performance of the proposed algorithm HICATS, we have tested it on 10 well-known benchmark gene expression classification datasets with dimensions varying from 2308 to 12600. The performance of our proposed method proved to be superior to other related works including the conventional version of binary optimization algorithm in terms of classification accuracy and the number of selected genes.

1. Introduction and irrelevant genes, the conventional classification methods cannot be effectively applied to gene classification due to DNA microarray technology which can measure the expres- the poor classification accuracy. With the inherent property sion levels of thousands of genes simultaneously in the field of gene data, efficient algorithms are needed to solve this of biological tissues and produce databases of cancer based problem in reasonable computational time. Therefore, many on gene expression data [1] has great potential on cancer supervised machine learning algorithms, such as Bayesian research. Because the conventional diagnosis method for can- networks, neural networks, and support vector machines cer is inaccurate, gene expression data has been widely used (SVMs), combined with feature selection techniques, have to identify cancer biomarkers closely associated with cancer, been used to process the gene data [3]. Gene selection is which could be strongly complementary for the traditional the process of selecting the smallest subset of informative histopathologic evaluation to increase the accuracy of cancer genes that are most predictive to its relative class using a diagnosis and classification [2] and improve understanding of classification model. The objectives of feature selection prob- the pathogenesis of cancer for the discovery of new therapy. lems are maximizing the classifier ability and minimizing Therefore, it has gained popularity by the application of thegenesubsetstoclassifysamplesaccurately.Theoptimal geneexpressiondataoncancerclassification,diagnosis,and featureselectionproblemfromgenedataisNP-hardproblem. treatment. Hence, it is more effective to use metaheuristics approaches, Due to the high dimensions of gene expression data such as nature inspired computation, to solve this problem. In compared to the small number of samples, noisy genes, recent years, metaheuristic algorithms based on global search 2 BioMed Research International strategy rather than local search strategy have shown their function evaluation. Section 4 describes the parameter setting advantages in solving combinatorial optimization problems, and the experiment result based on several benchmark and a number of metaheuristic approaches have been applied gene datasets including comparative results between HICATS on feature selection, for example, genetic algorithm (GA), and other variants of PSO. Finally, concluding remarks are particle swarm optimization (PSO), tabu search (TS), and presented in Section 5. artificial bee colony (ABC). Metaheuristic algorithms, as a kind of random search techniques, cannot guarantee finding the optimal solution 2. Related Algorithm every time. Due to the fact that a single metaheuristic 2.1. Generic Imperialist Competition Algorithm (ICA). ICA is algorithm is often trapped into an immature solution, the a population-based stochastic optimization technique, which recent trends of research have been shifted towards the was proposed by Atashpaz-Gargari and Lucas [10]. ICA, as several hybrid methods. Kabir et al. [4] introduced a new one of the recent metaheuristic optimization techniques, is hybrid genetic algorithm incorporating a local search to fine- inspired by sociopolitical behavior. A review on last studies tunethesearchforfeatureselection.Shenetal.[5]presented showed that this method has not been used to solve gene ahybridPSOandTSforfeatureselectiontoimprovethe expression data for feature selection. Like other evolutionary classification accuracy. Next, Li et al. proposed a hybrid PSO algorithms, ICA begins with an initial set of solutions (coun- and GA [6]. Unfortunately, the experiment results did not tries) called population. Each individual of population is an obtain high classification accuracy. Alshamlan et al. brought array which is called “country” in ICA and “chromosome” in outanideaofABCtosolvefeatureselection.Theyfirst GA.Theempireiscomposedofthecountriesthatwouldbe attempted applying ABC algorithm in analyzing microar- either an imperialist or a colony. The powerful countries are ray gene expression combined with minimum redundancy considered to be imperialists and the colonies are assigned maximum relevance (mRMR) [7]. Then, they also hybridized to each empire based on the power of each imperialist state. ABC and GA algorithm to select genetic feature for microar- After generating the empires, the colonies are assimilated ray cancer classification and the goal was to integrate the by their related imperialist which would make the colonies advantages of both algorithms [8]. The result obtained by stronger and move towards the promising region. If the ABC algorithm was improved to some extent, but the small colony is better than its imperialist when moving towards the number of genes cannot get the high accuracy. Chuang et imperialist, then exchange positions of the imperialist and al. [9] introduced an improved binary PSO in which the its colony. As an empire has more power, it attracts more global best particle was reset to zero position when its fitness colonies and imperialist competition among the empires values did not change after three consecutive iterations. With forms the basis of the ICA. The powerful imperialists are a large number of selected genes, the result of the proposed reinforced and the weak ones are weakened and gradually algorithm obtained 100% classification accuracy in many collapsed when the imperialist has no colony. Finally, the datasets. algorithm converges to the optimal solution. The flowchart of So,inthispaper,weconcentrateonimperialistcompe- ICA is shown in Figure 1. ICA has been successfully applied tition algorithm inspired by sociopolitical behavior which is in many areas: fuzzy system control, function optimization, a kind of new swarm intelligent optimization algorithms to artificial neural network training, and other application prob- address the process of feature selection from gene expression lems. data. It starts with an initial population and effectively searches the solution space through some specially designed operators to converge to optimal or near-optimal solution. 2.2. The Tabu Search Algorithm. Tabu search (TS) was pro- Although ICA has been proved a potential search technique posed by Glober in 1986 [11], which is a famous local search for solving optimization problem, it still faces some difficul- algorithm,tosolveawiderangeofhardcombinationopti- ties that ICA is easy to trap into local optima and cannot get a mization problems. The algorithm begins with initial solu- better result. Tabu search (TS) as a local search technique just tions and evaluates the fitness values for the given solutions. can make up for the deficiency of the ICA algorithm. It has Thenanassociatedsetoffeasiblesolutionscanbeobtained the ability to avoid convergence to local optimal by a flexible by applying a simple modification with given solution. This memory system including aspiration criterion and tabu list. modification by a simple and basic transformation is called DuetolocalsearchpropertyofTS,theconvergencespeed move. If the best of these neighbors is not in the tabu list, of TS largely depends on the initial solution. The parallelism then select it as the new current solution. The tabu list keeps of population in ICA would help the TS find the promising track of previously explored solutions and prevents TS from regions of the search space very quickly. So the hybrid revisiting them again to avoid falling into local optimum. A algorithm HICATS effectively combines the advantages of move is created to increase diversity even if it is worse than ICA and TS and shows the superiority in feature selec- the current solution. At the same time, tabu list is introduced tion. and used to memorize the better local neighbors which have The rest of the paper is organized as follows. Section 2 been searched and will be neglected. After a subset of feasible describes the related algorithm incorporating the process solutions is created according to the tabu list and evaluated of generic ICA and TS. Section 3 elaborates the proposed by the objective function. The best optimal solution will be HICATS including the framework, individual representa- selected as the next solution. This loop is stopped when the tion, empire initialization, colonies assimilation, and fitness stopping criteria are satisfied. BioMed Research International 3

Initial empires

Country 1 Colony assimilation

New Imperialist position of colony

Country 2 x d 𝜃

Initial country Initial . . Colony

Country n No

Is there a colony better

Yes than its imperialist? Eliminate the imperialist which has no colony Exchange the No position of imperialist Empire Empire and colony Compute the total power of each empire Yes

Empire 1 Empire Empire The weakest empire Weakest colony in the weakest empire

Empire 2 Empire 3 Empire N_imp

Stop criteria satisfied? ··· Yes

End Imperialist competition

Imperialist 1 Colony 1

2 Imperialist 2 Colony Colony 3 Imperialist 3

Imperialist m Colony m

Figure 1: The flowchart of ICA scheme. 4 BioMed Research International

The start of HICATS Move the colonies toward their imperialist

Set parameters and initialize HICATS’s population Compare the objective values between imperialist and its colonies

Calculate the fitness of countries If the colony in an empire is better than its imperialist

Generate empires: imperialists and colonies Yes No Exchange position of imperialist and its best colony Apply tabu search (TS) on imperialist

Compute the total power of an empire

Apply imperialist competition algorithm (ICA)

Choose the weakest colony from the No weakest empire and give it to the most likelihood empire to possess it Update the empires No

If an empire with no colony Stop criteria satisfied? Yes

Yes Eliminate the empire

Output the optimal solution and stop the algorithm HICATS

Figure 2: The framework of the proposed algorithm HICATS.

3. Proposed HICATS country in the population which utilizes support vector machine classifier (SVM). The fitness is decided bythe ICA as a global search metaheuristic algorithm reveals the percentage of classification accuracy of SVM and the number advantage in solving combinatorial optimization problems; of feature subsets. Then empires are generated depending on however, the diversity of the population would be greatly their fitness values. reduced after some generations and the algorithm might lead to premature convergence. TS as a local search technique Step 2. Apply TS on imperialist in each empire. Generate and can exploit the neighbors of current solutions to get better evaluate the neighbors of imperialist. Select the new solution candidates, but it will take much time to obtain the global according to the tabu list and aspiration criterion to replace optimum or near-global optimum. The incorporation of the old imperialist. TS into ICA as a local improvement strategy enables the method to maintain the population diversity and prohibits Step 3. Apply a learning mechanism on colonies which is the misleading local optimal. Each binary coded string (country) same as Baldwinian Learning (BL) mechanism [12]. Find out represents a set of genes, which is evaluated by the fitness the different genes between imperialist and one of its colonies; function. TS is applied on imperialist in each empire to then use the randomly generated learning probability to select the new imperialist and avoid premature conver- decide the number of selected genes for a colony. This strategy gence. The framework of the proposed algorithm HICATS makes the colonies move towards their imperialist. can be shown in Figure 2, which is described further as follows. Step 4. Comparetheobjectivevaluesbetweenimperialistand itscoloniesinthesameempire.Exchangethepositionsof Step 1. Set parameters of the algorithm and initialize coun- imperialist and its colony when a colony is better than its tries with binary representation 0 and 1. Evaluate each imperialist. BioMed Research International 5

𝑚 Gene data f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 imperialists proportionally, normalized fitness value of th imperialist is defined by String 1100 1 0 1 00 1 Fit𝑚 = fit𝑚 − min {fit𝑖}, 𝑖∈1,2,...,𝑁imp, (2) Country f1 f2 f5 f7 f10 where Fit𝑚 and fit𝑚 are the normalized fitness value of Figure 3: An illustrated example with generated subset and individ- 𝑚th imperialist and the fitness value of 𝑚th imperialist, ual representation. respectively. The normalized power for this imperialist is defined by 󵄨 󵄨 Step 5. Calculate the total power of an empire and compare 󵄨 󵄨 𝑝 = 󵄨 Fit𝑚 󵄨 . all empires; then eliminate the weakest empire when it loses 𝑚 󵄨 𝑁 󵄨 (3) 󵄨∑ imp 󵄨 all of its colonies. 󵄨 𝑖=1 Fit𝑖 󵄨

Step 6. If the termination condition (the predefined max The normalized power of an imperialist reveals the strength 𝑚 iterations)isnotfulfilled,gobacktoStep2.Otherwise,output of this imperialist. So the possessed colonies of th empire theoptimalsolutioninthecurrentpopulationandstopthe will be algorithm. 𝑁𝐶𝑚col = round {𝑝𝑚 ⋅𝑁col}, (4) It is clear that HICATS integrates two quite different 𝑁 𝑁𝐶 search strategies for feature selection, that is, the operation where col is the total number of colonies and 𝑚col is the initial number of colonies of 𝑚th empire. To generate on ICA which can explore the new region and provide the 𝑁𝐶 ideal solution for TS, while TS can exploit the neighbors of each empire, we randomly choose 𝑚col colonies and give imperialist for better candidate and avoid getting into local them to each imperialist. Figure 1 shows the initial population optimal according to memory system. The evaluation func- of each empire including imperialist and colonies with the tion, incorporating the accuracy of SVM with the number same color. It is obvious that bigger imperialists have greater of selected genes in feature subset, assists HICATS to find number of colonies while weaker ones have less. Imperialist 1 the most salient features with less redundancy. A reliable has the most colonies and formed the most powerful empire. selection method for genes classification should have higher classification accuracy and contain less redundant genes. For 3.2. Colonies Assimilation. In HICATS, assimilation is an more comprehensibility, details about each component of important operation and could likely be a momentous help HICATS are described in the following sections. in the progress of colonies evolution. In this paper, the idea of continuous BL is introduced into the HICATS for colonies 3.1. Individual Representation and Empire Initialization. In assimilation by their imperialist. This strategy can utilize this paper, we utilize random approach to generate a binary some specific differential information from the imperialist, that is, the differential information between imperialist and coded string (country) composed of 0, 1 and its length is 𝑋 −𝑋 equal to the dimensions of gene expression data. A value of colony IM CO, indicating a more effective way to learn 1incountryindicatesthatthisgeneshouldbeselectedwhile from the excellent solution. It can be defined as follows: the value of 0 represents the uselessness of corresponding 𝑋CO =𝑋CO +⌊𝛽∗(𝑋IM −𝑋CO)⌋ . (5) gene. In order to clearly understand these operations, we take an example for explanation. Assume that the gene data have The operation of difference states that, 1 subtracting 0, the 10 dimensions (10 features: 𝑓1, 𝑓2, 𝑓3, 𝑓4, 𝑓5, 𝑓6, 𝑓7, 𝑓8, 𝑓9, 𝑓 𝑋 = {1, 1, 0, 0, 1, 0, 1, 0, 0, 1} result is 1; 1 subtracting 1, the result is 0; 0 subtracting 1, the and 10); the country country is result is 0; and 0 subtracting 0, the result is 0. The learning initialized with 0 and 1. The bits of the string template are rate 𝛽 ∈ (0, 1) is a randomly generated real number which 10 which is equal to the dimensions of gene data. The string means the proportion of selected genes from the differences. is randomly generated and the number of 1 is 5. Hence, the ⌊⋅⌋ is the operator that rounds its argument towards the features 𝑓1, 𝑓2, 𝑓5, 𝑓7,and𝑓10 are selected to form a country closest integer and ⌊𝛽 ∗ (𝑋IM −𝑋CO)⌋ represents the selected which is shown in Figure 3. genes. In order to reduce the dimension of the country, our After generating the population, we should evaluate the research adopts a randomly generated template depending countries and initialize empires composed of imperialists and on the larger dimensions of imperialist and colonies. For colonies. The fitness value of a country is estimated by the imperialist (IM) with five characteristics 𝑓1, 𝑓2, 𝑓5, 𝑓7,and fitness function 𝐹: 𝑓10 and one of its colonies with six characteristics 𝑓1, 𝑓3, 𝑓 𝑓 𝑓 𝑓 fit =𝐹(𝑋country)=𝐹(𝑓1,𝑓2,...,𝑓𝐷). (1) 6, 8, 9,and 10 in an empire, the dimension of binary template (BT) is 6. The template of colony is generated from In this study, assume that the initial population size is 𝑁pop; nonoperation of BT, denoted by BTF. Because the number of 𝑁imp most powerful countries are selected as imperialists IM feature genes is less than the template BT, the IMBT just and the remaining 𝑁col (𝑁col =𝑁pop −𝑁imp) countries takes one part of BT. To describe how it works, we will illus- are assigned to these empires according to the power of trate the following numerical example. Imperialist represents imperialists as their colonies. To assign the colonies among 𝑋IM = {1, 1, 0, 0, 1, 0, 1, 0, 0, 1}; one of its colonies encodes 6 BioMed Research International

1 1 0 0 1 0 1 0 0 1 f1 f IM f2 1 f1 f2 f5 f7 f10 f5 IMBT 1 0 1 1 0 f5 f 7 f7 f10 BT 1 0 1 1 0 0 󳰀 IM CO 1 0 1 0 0 1 0 1 1 1 IM BTF 0 1 0 0 1 1 f1 f3 f6 f8 f9 f10 f1 f3 f6 COBT 0 1 0 0 1 1 Difference f f f f8 IM-CO 2 5 7 f9 f f10 3 󳰀 Number of different genes: 3, randomly generated number between 1 f9 CO and 3, for example 2; randomly selected two genes f2 and f7 from IM-CO CO f10

Assimilated CO f2 f3 f7 f9 f10

Figure 4: A colony is assimilated by an imperialist.

𝑋CO = {1, 0, 1, 0, 0, 1, 0, 1, 1, 1}; then the differential informa- 11001 0 1 0 0 1 tion is described as 𝑋IM −𝑋CO = {0, 1, 0, 0, 1, 0, 1, 0, 0, 0}.Itis obvious that the number of different genes is 3 (the number 111 0 1 001 0 1 of 1). According to the parameter 𝛽,thenumberofselected genes from different genes set is 2 and the features of 𝑓2 Figure 5: Producing nearby solutions in TS. and 𝑓7 arechosen.Atthesametime,BT = {1, 0, 1, 1, 0, 0} is produced by random strategy and BTF = {0, 1, 0, 0, 1, 1} is the nonoperation of BT. IMBT = {1, 0, 1, 1, 0} is obtained the whole algorithm has a speed convergence. However, the from BT and COBT = {0, 1, 0, 0, 1, 1} is equal to BTF. After classical ICA is easy to fall into local optimum. Therefore, this process, CO becomes a country with the features 𝑓3, 𝑓9, the exploitation is performed by TS to search the better 𝑓 and 10 and the assimilated CO combining different genes solution nearby the current imperialist and to escape from 𝑓 𝑓 𝑓 𝑓 between CO and IM, with five features 2, 3, 7, 9,and local optimal in this paper. How to produce the solution 𝑓 10, is produced. Therefore, assimilated CO is generated by and the tabu list is very important in TS algorithm. In our BL operation which is shown in Figure 4. study, one bit of the solution with nonoperation is utilized to produce the nearby solutions. For example, if the gene 3.3. Fitness Function. The feature selection of gene expression expression data with 10 dimensions and the current country data needs to consider the classification accuracy and the is 𝑋country = {1, 1, 0, 0, 1, 0, 1, 0, 0, 1},thenearbysolution number of selected informative genes. Hence, the fitness canbeobtainedfromthecurrentsolutionthroughTS-based function is defined as follows: algorithm in Figure 5. 𝑛−𝐷(𝑋) (𝑋 )=𝑤 ×𝐴(𝑋)+(𝑤 × 𝑖 ) fitness 𝑖 1 𝑖 2 𝑛 (6) 4. Experiment 𝐴(𝑋 )∈[0,1] in which 𝑖 is the leave-one-out-cross-validation 4.1. Gene Expression Datasets and Parameter Setting. In this 𝑋 (LOOCV) classification accuracy of one country 𝑖 (gene paper, except for SRBCT which was gained by continuous subset) obtained by SVM model. 𝑛 is the dimensions of image analysis, the rest of the gene microarray datasets optimal problem; in other words, it is the total number of were obtained by the oligonucleotide technique. Presently, 𝐷(𝑋 ) genes for each sample and 𝑖 is the number of selected there is no standard method for processing gene expres- 𝑋 𝑤 𝑤 genes in 𝑖. We use the parameters 1 and 2 to measure sion data. Therefore, we designed an effective algorithm the importance of classification accuracy and the number HICATS to perform feature selection for improving the of selected genes, respectively. The classification accuracy is classification accuracy. The datasets consist of 10 pieces more crucial than the number of selected genes and setting of gene expression data, which can be downloaded from of the parameters satisfies constraint condition as follows: http://www.gems-system.org/. The description of datasets is 𝑤 ∈ [0, 1] 𝑤 =1−𝑤 1 and 2 1. listed in Table 1, which contains the dataset name and detailed expression. Table 2 gives the related samples, genes, and 3.4. TS-Based Local Search. In HICATS, each colony can be classes. These datasets contained binary-class and multiclass assimilated by its imperialist and then improve itself. Thus, data that contained thousands of genes. BioMed Research International 7

Table 1: Cancer-related human gene microarray datasets used in this study.

Dataset name Description 9 Tumors Oligonucleotide microarray gene expression profiles for the chemosensitivity profiles of 232 chemical compounds 11 Tumors Transcript profiles of 11 common human tumors for carcinomas of the prostate, breast, colorectum, lung, liver, gastroesophagus, pancreas, ovary, kidney, and bladder/ureter DNA microarray gene expression profiles derived from 99 patient samples. The medulloblastomas included primitive Brain Tumor 1 neuroectodermal tumors, atypical teratoid/rhabdoid tumors, malignant gliomas, and the medulloblastomas activated by the sonic hedgehog pathway Brain Tumor 2 Transcript profiles of four malignant gliomas, including ssiccla glioblastoma, nonclassic glioblastoma, classic oligodendroglioma, and nonclassic oligodendroglioma Leukemia 1 DNA microarray gene expression profiles of acute myelogenous leukemia (AML) and acute lymphoblastic leukemia (ALL) of B-cell and T-cell Leukemia 2 Gene expression profiles of a chromosomal translocation to distinguish mixed- leukemia, ALL, and AML Lung Cancer Oligonucleotide microarray transcript profiles of 203 specimens, including lung adenocarcinomas, squamous cell lung carcinomas, pulmonary carcinomas, small-cell lung carcinomas, and normal lung tissue SRBCT cDNA microarray gene expression profiles of small, round blue cell tumors, which include neuroblastoma, rhabdomyosarcoma, non-Hodgkin’s lymphoma, and the Ewing family of tumors cDNA microarray gene expression profiles of prostate tumors. Based on MUC1 and AZGP1 gene expression, the prostate Prostate Tumor cancer can be distinguished as a subtype associated with an elevated risk of recurrence or with a decreased risk of recurrence DLBCL DNA microarray gene expression profiles of DLBCL, in which the DLBCL can be identified as cured versus fatal or refractory disease

Table 2: Description of gene expression datasets. computation efficiency. Large number of countries would require more computational times for completing feature Number of Dataset number Dataset name selection while if the number is too small, although the Samples Genes Classes algorithm can take place in a relatively short period of 19Tumors 60 5726 9 time, the performance of the algorithm is not guaranteed. 211Tumors 174 12533 11 Therefore, the intermediate values for the size of population 3BrainTumors 1 90 5920 5 and iteration are chosen to be 15 and 50, respectively. Since 4BrainTumors 2 50 10367 4 population is composed of imperialists and colonies, the 5 Leukemia 1 72 5327 3 number of imperialists also needs to be determined. If the 6Leukemia272112253number of imperialists is 1, the HICATS algorithm is trans- 7 Lung Cancer 203 12600 5 formed to single-population evolutionary algorithm instead of multisubpopulation while if the number of imperialists is 8SRBCT8323084 too large, the number of colonies cannot be guaranteed. The 9ProstateTumor 102 10509 2 number of imperialists is chosen to be 4 in our experiment. In 10 DLBCL 77 5469 2 Section 3.3, the parameters of 𝑤1 and 𝑤2 are introduced and the range of values is given. In order to guarantee that 𝑤1 is Table 3: Parameter settings for HICATS. larger than 𝑤2,thevaluesof𝑤1 and 𝑤2 are set as 0.8 and 0.2 in our proposed algorithm with the same parameter setting Parameters Values of EPSO [13]. The number of countries 15 The number of imperialists 4 4.2. Experiment Results. In this paper, a hybrid algorithm The number of colonies 11 HICATS incorporating ICA and TS is used to perform feature The number of iterations (generations) 50 selection for the gene expression data. TS was embedded in 𝑤1 0.8 the ICA to prevent the method from getting trapped into a 𝑤2 0.2 local optimum, while applying TS on imperialist can improve the performance and speed up the convergence of TS. The experiment results included classification accuracy The parameter values for HICATS are shown in Table 3.It and the number of selected feature genes obtained by isveryclearthattheparametersoftheproposedalgorithmare HICATS over 10 independent runs for 10 datasets included less than binary particle swarm optimization (BPSO). Hence, 11 Tumors, 9 Tumors, SRBCT, Leukemia 1, Leukemia 2, the influence of parameter setting on HICATS is relatively DLBCL, Prostate Tumor, Lung Cancer, Brain Tumors 1, and small and the robustness of algorithm is better. The size of Brain Tumors 2 which are shown in Tables 4, 5, and 6. It is the population affects the performance of the algorithm and found that the classification accuracy of HICATS achieves 8 BioMed Research International

Table 4: The computational results obtained by our proposed algorithm HICATS for 10 independent runs on11 Tumors, 9 Tumors, and SRBCT datasets. 11 Tumors 9 Tumors SRBCT Runs Acc. (%) Selected genes Acc. (%) Selected genes Acc. (%) Selected genes 1 97.70 287 75.00 245 100 10 2 96.55 302 76.67 262 100 14 3 94.83 330 75.00 233 100 15 4 95.40 268 75.00 249 100 13 5 96.55 290 76.67 257 100 9 6 96.55 356 81.67 242 100 12 7 94.83 323 83.33 259 100 16 8 94.83 349 76.67 238 100 9 9 95.98 275 81.67 247 100 9 10 95.40 295 81.67 253 100 10 Ave. ± SD 95.86 ± 0.97 307.5 ± 30.46 78.33 ± 3.33 248.5 ± 9.38 100 ± 0 11.70 ± 2.67

Table 5: The computational results obtained by our proposed algorithm HICATS for 10 independent runs on Leukemia 1, Leukemia 2, and DLBCL datasets. Leukemia 1 Leukemia 2 DLBCL Runs Acc. (%) Selected genes Acc. (%) Selected genes Acc. (%) Selected genes 1 100 3 100 8 100 4 2 100 3 100 10 100 3 3 100 3 100 6 100 5 4 100 3 100 6 100 3 5 100 3 100 7 100 4 6 100 3 100 8 100 3 71003100 5 100 4 8 100 3 100 7 100 5 9 100 3 100 5 100 6 10 100 3 100 6 100 4 Ave. ± SD 100 ± 03± 0100± 0 6.80 ± 1.55 100 ± 0 4.10 ± 0.99

100% with less than 10 informative genes for Leukemia 1, classification datasets. The comparison results including the Leukemia 2, and DLBCL and with less than 20 selected optimal classification accuracy and the number of selected genes for SRBCT. The average classification accuracy is more genes obtained by HICATS and ICA are given in Table 7. than 92.22% for all the datasets except for 9 Tumors. In The difference between these two algorithms is only whether other words, it is strongly demonstrated that HICATS can each contains local search mechanism TS or not. It is quite efficiently select informative genes from high-dimensional, clear that HICATS performs better than original ICA for binary-class, or multiclass datasets for classification. For all all datasets. Hence, ICA combined with TS can effectively best classification results, the selected genes are less than 10 jump out of local optimum and HICATS achieves better except for 9 Tumors and 11 Tumors, while, for the average performance. Table 8 compares experiment results obtained classification result, the informative genes in subset are also by other approaches from the literature and the proposed less than 10 except for 9 Tumors, 11 Tumors, and SRBCT method HICATS. Various methods including non-SVM and datasets. Furthermore, the standard deviation is less than 5 MC-SVM were used to compare with our proposed method. for all datasets except for 9 Tumors and 11 Tumors. From the The experiment results listed in Table 8 were taken from classification accuracy and the selected informative genes, it Chuang et al. for comparison [9]. Non-SVM contains the 𝐾- is not difficult to find that HICATS is an efficient algorithm nearest neighbor method [9, 14, 15], backpropagation neural for feature selection and produces a near-optimal gene subset networks [16], and probabilistic neural networks [17]. MC- from gene expression data. SVM includes one-versus-one and one-versus-the-rest [18], In order to verify the effectiveness of the proposed algo- DAG-SVM [19], the method by Weston and Watkins [20, 21], rithm, firstly, we will compare the performance of HICATS and the method by Crammer and Singer [20, 22]. It is obvious with pure ICA using SVM as a classifier under the same that our proposed approach HICATS obtained all the high- experimental conditions; then, we will compare HICATS est classification accuracies for the 10 benchmark datasets. with other optimization algorithms on several benchmark The average highest classification accuracy of non-SVM, BioMed Research International 9

Table 6: The computational results obtained by our proposed algorithm HICATS for 10 independent runs on Prostate Tumor, Lung Cancer, Brain Tumor 1, and Brain Tumor 2 datasets. Prostate Tumor Lung Cancer Brain Tumor 1 Brain Tumor 2 Runs Acc. (%) Selected genes Acc. (%) Selected genes Acc. (%) Selected genes Acc. (%) Selected genes 1 98.04 8 95.57 6 94.44 6 94 5 2 97.06 7 96.06 6 93.33 12 90 6 3 98.04 5 96.06 9 94.44 9 94 7 4 98.04 7 95.57 8 91.11 10 92 5 5 97.06 6 96.06 7 93.33 8 92 3 6 98.04 7 97.04 11 92.22 14 94 8 7 97.06 10 96.06 8 91.11 7 92 4 8 98.04 8 96.06 7 93.33 9 94 3 9 98.04 9 96.06 9 94.44 6 90 9 10 98.04 5 97.04 7 93.33 8 94 8 Ave. ± SD 97.75 ± 0.47 7.2 ± 1.62 96.16 ± 0.50 7.8 ± 1.55 93.10 ± 1.26 8.9 ± 2.55 92.60 ± 1.60 5.8 ± 2.14

Table 7: Classification accuracies and selected genes obtained by HICATS and ICA for gene expression data.

Methods Datasets HICATS ICA Acc. (%) Selected genes Acc. (%) Selected genes 9 Tumors 83.33 259 76.67 282 11 Tumors 97.70 287 95.98 293 Brain Tumor 1 94.44 6 91.11 8 Brain Tumor 2 94 3 92 5 Leukemia 1 100 3 97.50 7 Leukemia 2 100 5 97.32 8 Lung Cancer 97.04 7 95.57 12 SRBCT 100 9 100 10 Prostate Tumor 98.04 5 97.06 6 DLBCL 100 3 97.50 5

Table 8: Classification accuracies of gene expression data obtained via different classification methods.

Methods HICATS Datasets Non-SVM MC-SVM SVM 𝐾NN [9] NN PNN OVR OVO DAG WW CS OVR 9 Tumors 78.33 19.38 34.00 65.10 58.57 60.24 62.24 65.33 83.33 11 Tumors 93.10 54.14 77.21 94.68 90.36 90.36 94.68 95.30 97.70 Brain Tumor 1 94.44 84.72 79.61 91.67 90.56 90.56 90.56 90.56 94.44 Brain Tumor 2 94.00 60.33 62.83 77.00 77.83 77.83 73.33 72.83 94 Leukemia 1 100 76.61 85.00 97.50 91.32 96.07 97.50 97.50 100 Leukemia 2 100 91.03 83.21 97.32 95.89 95.89 95.89 95.89 100 Lung Cancer 96.55 87.80 85.66 96.05 95.59 95.59 95.55 96.55 97.04 SRBCT 100 91.03 79.50 100 100 100 100 100 100 Prostate Tumor 92.16 79.18 79.18 92.00 92.00 92.00 92.00 92.00 98.04 DLBCL 100 89.64 80.89 97.50 97.50 97.50 97.50 97.50 100 (1) Non-SVM: traditional classification method. (2) MC-SVM: multiclass support vector machines. (3) 𝐾NN: 𝐾-nearest neighbors. (4)NN:backpropagation neural networks. (5) PNN: probabilistic neural networks. (6) OVR: one-versus-the-rest. (7) OVO: one-versus-one. (8) DAG: DAGSVM. (9)WW:methodby Weston and Watkins. (10) CS: method by Crammer and Singer. (11) HICATS: improved binary imperialist competition algorithm. 10 BioMed Research International

Table9:ThenumberofselectedgenesfromdatasetsbetweenHICATSandIBPSO.

HICATS IBPSO Datasets Genes selected Percentage of genes selected Genes selected Percentage of genes selected 9 Tumors 259 0.045 2941 0.51 11 Tumors 287 0.022 3206 0.26 Brain Tumor 1 6 0.001 754 0.13 Brain Tumor 2 3 0.0003 1197 0.12 Leukemia 1 3 0.0006 1034 0.19 Leukemia 2 5 0.0004 1292 0.12 Lung Cancer 7 0.0005 1897 0.15 SRBCT 9 0.004 431 0.19 Prostate Tumor 5 0.0005 1294 0.12 DLBCL 3 0.0005 1042 0.19 Average 5 0.00097 1117.6 0.15

9_Tumors 11_Tumors 1 1

0.95 0.8

0.9 0.6 Fitness Fitness 0.85

0.4 0.8

0.2 0.75 0 10 20 30 40 50 0 10 20 30 40 50 Generation Generation

Best Best Average Average

Figure 6: The convergence graphs of the best and average accuracy classification by HICATS algorithm on9 Tumors and 11 Tumors datasets.

MC-SVM, and HICATS is 97.14, 93.63, and 97.81, respectively. a modified sigmoid function to increase the probability of For the datasets of Leukemia 1, Leukemia 2, SRBCT, and the bits in a particle’s position to be zero and minimized DLBCL, the classification accuracy can reach 100%. The the number of selected genes. In our proposed algorithm, average classification accuracy of HICATS and IBPSO seems randomly generated binary templates are used to reduce to be the same; however, the selected genes of HICATS are the dimension of selected genes in each generation due to significantly less than those of IBPSO listed in Table 9 because the assimilation mechanism that the colonies learn a lot of dimension reduction mechanisms are different. IBPSO algo- different genes from their imperialist. Hence, it is not hard rithm mainly utilizes the value of sigmoid function to deter- to find that the speed of convergence is very fast and the mine whether the gene is selected. In the initial iteration, the differences of the number of selected genes between HICATS probabilities of 0 and 1 are 0.5 by a standard sigmoid function and IBPSO are huge. without any constraint and no modification. Then, in the The convergence graphs of the best and average classifica- next iteration, the probabilities are potentially influenced by tion accuracy obtained by HICATS for 9 Tumors, 11 Tumors, velocity vectors; however, the probabilities of 0 and 1 are SRBCT, and DLBCL are shown in Figures 6 and 7. It can be mostly maintained for its application on the gene expression seen that the best classification accuracy is achieved to be dataduetoitshighdimensionsandalargesearchspace.The 100% less than 10 iterations for SRBCT and between 10 and number of genes is minimized about half of the total number 20 generations for DLBCL. Therefore, HICATS possesses a of genes using the standard sigmoid function in high- faster convergence speed and achieves the optimal solution dimensional data. Therefore, Mohamad et al. [13] introduced rapidly. BioMed Research International 11

SRBCT DLBCL 1 1

0.99 0.95

0.98 0.9 0.97 Fitness Fitness 0.85 0.96

0.8 0.95

0.94 0.75 0 10 20 30 40 50 0 10 20 30 40 50 Generation Generation

Best Best Average Average

Figure 7: The convergence graphs of the best and average accuracy classification by HICATS algorithm on SRBCT and DLBCL datasets.

5. Conclusions [3]H.Alshamlan,G.Badr,andY.Alohali,“Acomparativestudyof cancer classification methods using microarray gene expression In this paper, a hybrid algorithm HICATS incorporated profile,” in Proceedings of the 1st International Conference on binary imperialist competition algorithm and tabu search Advanced Data and Information Engineering (DaEng ’13),vol. is used to perform feature selection and SVM with one- 285 of Lecture Notes in Electrical Engineering, pp. 389–398, versus-the-rest serves as an evaluator of HICATS for gene Springer, 2014. expression data classification problems. This work effectively [4]M.M.Kabir,M.Shahjahan,andK.Murase,“Anewlocal combines the advantages of two kinds of different search search based hybrid genetic algorithm for feature selection,” mechanism algorithms to obtain the higher classification Neurocomputing,vol.74,no.17,pp.2914–2928,2011. accuracy for gene expression data problems. In general, the [5] Q. Shen, W.-M. Shi, and W. Kong, “Hybrid particle swarm classification performance of HICATS is as good as IBPSO; optimization and tabu search approach for selecting genes for however, HICATS is superior to IBPSO and other methods tumor classification using gene expression data,” Computational in terms of selected genes. In our proposed algorithm, in Biology and Chemistry,vol.32,no.1,pp.53–60,2008. order to avoid imperialist premature convergence, a local [6] S. Li, X. Wu, and M. Tan, “Gene selection using hybrid particle search strategy TS embedded in ICA while TS is applied swarm optimization and genetic algorithm,” Soft Computing, on imperialist in each empire can exploit the neighbors vol.12,no.11,pp.1039–1048,2008. of imperialist to speed the convergence and assist in the [7] H. Alshamlan, G. Badr, and Y. Alohali, “MRMR-ABC: a hybrid imperialist evolution. Experimental results show that our gene selection algorithm for cancer classification using microar- method effectively classifies the samples with reduced feature ray gene expression profiling,” BioMed Research International, genes. In the future work, imperialist competition algorithm vol.2015,ArticleID604910,15pages,2015. combined with other intelligent search strategies will be used [8]H.M.Alshamlan,G.H.Badr,andY.A.Alohali,“GeneticBee to select informative genes. Colony (GBC) algorithm: a new gene selection method for microarray cancer classification,” Computational Biology and Chemistry,vol.56,pp.49–60,2002. Competing Interests [9] L.-Y. Chuang, H.-W. Chang, C.-J. Tu, and C.-H. Yang, The authors declare that they have no competing interests. “Improved binary PSO for feature selection using gene expres- sion data,” Computational Biology and Chemistry,vol.32,no.1, pp. 29–37, 2008. References [10] E. Atashpaz-Gargari and C. Lucas, “Imperialist competitive algorithm: an algorithm for optimization inspired by imperi- [1] H. Feilotter, “A biologist’s guide to analysis of DNA microarray alistic competition,” in Proceedings of the 2007 IEEE Congress data,” The American Journal of Human Genetics,vol.71,no.6, on Evolutionary Computation (CEC ’07), pp. 4661–4667, IEEE, pp.1483–1484,2002. Singapore, September 2007. [2] R. Simon, “Analysis of DNA microarray expression data,” Best [11] F. Glover, “Future paths for integer programming and links to Practice and Research: Clinical Haematology,vol.22,no.2,pp. artificial intelligence,” Computers & Operations Research,vol.13, 271–282, 2009. no.5,pp.533–549,1986. 12 BioMed Research International

[12]M.Gong,L.Jiao,andL.Zhang,“Baldwinianlearninginclonal selection algorithm for optimization,” Information Sciences,vol. 180, no. 8, pp. 1218–1236, 2010. [13]M.S.Mohamad,S.Omatu,S.Deris,M.Yoshioka,A.Abdullah, and Z. Ibrahim, “An enhancement of binary particle swarm optimization for gene selection in classifying cancer classes,” Algorithms for Molecular Biology,vol.8,article15,2013. [14] B. V. Dasarathy, NN Concepts and Techniques: Nearest Neigh- bours (NN) Norms, IEEE Computer Society Press, 1991. [15] T. Cover and P. Hart, “Nearest neighbor pattern classification,” IEEE Transactions on Information Theory,vol.13,no.1,pp.21– 27, 1967. [16] T. M. Mitchell, Machine Learning, McGraw-Hill, 1997. [17] D. F. Specht, “Probabilistic neural networks,” Neural Networks, vol.3,no.1,pp.109–118,1990. [18] C. Orsenigo and C. Vercellis, “Multicategory classification via discrete support vector machines,” Computational Management Science,vol.6,no.1,pp.101–114,2009. [19] J. C. Platt, N. Cristianini, and J. Shawe-Taylor, “Large margin dags for multiclass classification,” Advances in Neural Informa- tion Processing Systems,vol.12,no.3,pp.547–553,2010. [20] L. Chin-Jen, “A comparison of methods for multiclass support vector machines,” IEEE Transactions on Neural Networks,vol. 13, no. 2, pp. 415–425, 2002. [21] J. Weston, “Multi-class support vector machines,”in Proceedings of the European Symposium on Artificial Neural Networks (ESANN ’99), pp. 1–28, 1999. [22] K. Crammer and Y. Singer, “On the learnability and design of output codes for multiclass problems,” Machine Learning,vol. 47,no.2-3,pp.201–233,2002. Hindawi Publishing Corporation BioMed Research International Volume 2016, Article ID 3952494, 8 pages http://dx.doi.org/10.1155/2016/3952494

Research Article Predicting Diagnostic Gene Biomarkers for Non-Small-Cell Lung Cancer

Bin Liang, Yang Shao, Fei Long, and Shu-Juan Jiang

Department of Respiratory Medicine, Shandong Provincial Hospital Affiliated to Shandong University, Jinan, Shandong 250021, China

Correspondence should be addressed to Shu-Juan Jiang; [email protected]

Received 6 April 2016; Accepted 7 June 2016

Academic Editor: Jiangning Song

Copyright © 2016 Bin Liang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Lung cancer is the primary reason for death due to cancer worldwide, and non-small-cell lung cancer (NSCLC) is the most common subtype of lung cancer. Most patients die from complications of NSCLC due to poor diagnosis. In this paper, we aimed to predict gene biomarkers that may be of use for diagnosis of NSCLC by integrating differential gene expression analysis with functional association network analysis. Wefirst constructed an NSCLC-specific functional association network by combining gene expression correlation with functional association. Then, we applied a network partition algorithm to divide the network into gene modules and identify the most NSCLC-specific gene modules based on their differential expression pattern in between normal and NSCLC samples. Finally, from these modules, we identified genes that exhibited the most impact on the expression of their functionally associated genes in between normal and NSCLC samples and predicted them as NSCLC biomarkers. Literature review of the top predicted gene biomarkers suggested that most of them were already considered critical for development of NSCLC.

1. Introduction with NSCLC were already in the unresectable stage of IIIB or IV and the option of surgery treatment was not feasible, Lung cancer is not only the most common cancer in the which is the main reason why the overall 5-year survival world, but also the primary reason for death due to cancer [1]. rate of NSCLC patients is only 12%. However, the overall 5- At present, the number of Chinese patients who were newly year survival rate can be increased to 50% if patients can diagnosedwithlungcancerisabout500thousandeachyear, be diagnosed in an early stage [6]. Currently, the diagnosis and the number is expected to reach one million by 2025 [2]. of NSCLC mainly depends on X-ray, CT, sputum cytology, Non-small-cell lung cancer (NSCLC) is the most common fiber bronchoscopy, and cellular pathology. However, X-ray subtype of lung cancer, and NSCLC patients account for is not sensitive to small lesions and covert lesions; sputum about 80% to 85% of all lung cancer incidences [3]. The most cytology only works on lung cancer originated in the center common types of NSCLC are squamous cell carcinoma, large ofairwaysandhasalowdetectionrate;fiberbronchoscopyis cell carcinoma, and adenocarcinoma [4]. Adenocarcinoma is notsuitableforearlydiagnosisduetoitsinvasiveness;cellular currently the most common type of lung cancer in “never pathology cannot precisely determine malignance stages [7]. smokers” (lifelong nonsmokers) [5]. Large cell lung carci- As such, identifying molecular biomarkers for early diagnosis noma (LCLC) is a heterogeneous group of undifferentiated of NSCLC is urgently needed. malignant neoplasms originating from transformed epithelial During the development of cancer, tumor cells undergo cells in the lung [3]. Currently, the most often used treatments significant alterations at both genetic and molecular levels, for NSCLC are surgery, radiotherapy, and chemotherapy, in which are accompanied by significant changes in gene expres- which the complete surgical resection is the most effective sion [8]. Genes with significant expression change during method [4]. Despite the improvements of surgical techniques tumor cell development can therefore be used as biomarkers and instruments in recent years, the overall 5-year survival for early diagnosis. For example, Yang et al. conducted a rate for NSCLC patients after surgery is from 15% to 45% meta-analysis on high-throughput gene expression data of [6]. On the other hand, most patients that were diagnosed lung cancer and identified tumor necrosis factor-𝛼 (TNF-𝛼) 2 BioMed Research International that predicts prognosis of relapse survival for lung cancer The human STRING network (version 9.1) [18, 19] was [9]. Spinola et al. found that the expression of PDCD5 was downloaded from http://string.embl.de/, which consists of significantly lower in lung cancer patients and suggested that 19,038 genes and 2,271,610 weighted edges. Next, we filtered itcouldbeusedasapotentialmolecularmarkerforpredict- the STRING network by requiring the functional association ing the diagnosis of lung cancer [10]. In another example, scores between two genes to be greater than 500. Finally, the expression of RRM1 encoding ribonucleotide reductase we mapped the coexpression gene pairs identified in both was found to be associated with the response to gemcitabine normal and NSCLC datasets to the filtered STRING network chemotherapy, making it a prognostic marker for lung cancer and obtained a lung cancer-specific functional association [11, 12]. However, identifying gene biomarkers based only network. In this network, any two linked genes not only are on their expression change is not a reliable approach, as strongly functionally associated, but also are either positively the results may be significantly biased by expression noise or negatively coexpressed in normal or NSCLC samples. [13]. On the other hand, since cancer development is a complicated process involving interplay between many genes 2.2. Network Partition and the Identification of NSCLC- concordantly, it is necessary to take into consideration the Specific Gene Modules. Again, following what was already complicated relationships between genes when identifying described, we applied a network partition algorithm named gene biomarkers of diagnostic use. Thus, a number of meth- iNP [28] to divide the lung cancer-specific network into ods have been developed to integrate network or pathway gene modules. In order to guarantee each gene module information for discovering disease genes [14, 15]. including several genes, we also set the number of genes Recently, Long et al. developed an integrated approach inside a module to be greater than 4 so that we can get some for predicting biomarkers that distinguish NSCLC from stable modules. From these gene modules, we identified the small-cell lung cancer (SCLC) [16]. In their approach, they NSCLC-specific gene modules by inspecting their differential first constructed a lung cancer-specific functional network expression pattern in between NSCLC and normal samples. by integrating gene coexpression relationships identified in Briefly, for each module, we recorded the median expression NSCLC, SCLC, and normal lung tissues with gene functional value in each sample as the representative expression value associations obtained from the STRING network [17]. Then, for the module; then, we applied 𝑡-test in 𝑅 to determine the they applied a network partition algorithm to identify gene significance of the genes’ differential expression in between modules from the lung cancer-specific network and selected normal and NSCLC samples. Following Long et al. [16], we the gene modules with distinctive gene expression pattern used the Benjamini-Hochberg false discovery rate (FDR) [29, in between NSCLC and SCLC. Finally, from the selected 30] to conduct multiple test correction and chose the cutoff gene modules, they identified those genes whose function- to adjust 𝑝 value at 0.01 and also required log (fold change) ally associated partners exhibit significant gene expression ≥2 to select candidate gene modules. These gene modules alterations in between NSCLC and SCLC and predicted them were considered NSCLC-specific gene modules. To quantify as candidate biomarkers. Inspired by their approach, in this the specificity of each selected module to NSCLC, for each study, we aimed to predict gene biomarkers that distinguish module, we plotted a receiver operating characteristic (ROC) NSCLC from normal lung tissues and therefore might be of curve by applying the modules’ median gene expression value use for diagnosis of NSCLC. to distinguish NSCLC from normal cases. The area under theROCcurve(AUC)couldbegreaterorsmallerthan 2. Materials and Methods 0.5, corresponding to up- or downregulated gene expression patterns in NSCLC in comparison to their expression levels 2.1. Dataset Collection, Processing, and Construction of Lung in normal case. The farther the AUC is from 0.5, the more Cancer-Specific Functional Association Network. In Long et specific the expression of a module is to NSCLC. al.’s study [16], they obtained three gene expression datasets forNSCLCandSCLCfromGeneExpressionOmnibus 2.3.GeneSetEnrichmentAnalysis. Genesetenrichmentanal- 𝑅 (GEO) database of NCBI (http://www.ncbi.nlm.nih.gov/geo/) ysis was performed by Fisher test in . Gene set annotations by following these criteria: samples must be from untreated include gene ontology (GO) biological process terms [31] and human tissue samples and the sample size should be greater gene sets from molecular signatures database (MSigDB) [32]. 𝑅 than 100. Since one of the expression datasets was SCLC, Multiple test correction was conducted by FDR in . here we only used two expression datasets: normal lung tissue dataset (GSE23546) and NSCLC dataset (GSE41271) 2.4. Scoring Genes for Candidate Gene Biomarkers of NSCLC. that include 1349 and 275 samples, respectively. We followed Given a selected gene module, we followed Long et al. Long et al. [16] to process the normal and NSCLC datasets [16]toscoreeachgeneinsidethemoduleforthegenes’ separatelyandtoconstructthelungcancer-specificfunc- usefulness as NSCLC biomarker. Briefly, there were two tional association network. Briefly, after normalizing gene component scores: the cancer-specificity score determined expression profiles of normal and NSCLC samples separately, by the AUC ROC of the module the gene belongs to minus we calculated the Pearson Correlation Coefficient (PCC) 0.5 and the coexpression change score of the gene. To obtain of gene expression levels for every gene pair in the two thecoexpressionchangescore,allgenesconnectingtothe datasets separately and selected the top 1% and the bottom target gene were first recorded. Then, for each connection, a 1% of gene pairs ranked by PCC, representing a strong coexpression status difference score was computed by using relationship of positive and negative coexpression, respectively. the coexpression status of the two genes in NSCLC minus BioMed Research International 3 their coexpression status in normal case. The coexpression NSCLC samples from normal samples, from which the area statuscantakethreescores:1,−1, and 0, indicating positive under curve (AUC) was computed. Here, we labeled cancer correlation, negative correlation, and others, respectively. samples with “1” and normal samples with “0.” Then, by Then, the absolute coexpression status difference was aver- sorting the gene expression values of a module in each sample aged across all connections to derive the coexpression change from high to low values, we computed the true positive rate score,whichindicatedtheabilityofthegenetoimpactthe (number of samples with label of “1” above the value/number expression pattern of its functionally associated genes. The of all samples with label “1”) and false positive rate (number two component scores were then multiplied, and a positive of samples with label of “0” above the value/number of all score indicated the gene tended to be upregulated in NSCLC, samples with label “0”) at each expression value. Finally, a or vice versa. ROCcurvewasobtainedbyplottingthetruepositiverate against the false positive rate. If the expression values of a 3. Result module in cancer samples were similar to that in normal samples, then the AUC of the ROC curve would be close to 3.1. Construction of an NSCLC-Specific Functional Association 0.5. If a module typically had higher gene expression values Network. The pipeline for predicting NSCLC diagnostic in cancer samples (upregulated in cancer samples), then its genebiomarkersisshowninFigure1.Therationaleofthe AUC would be greater than 0.5; otherwise (downregulated) methodology behind the pipeline was that, given a gene it would be smaller than 0.5. The farther the AUC of a ROC module in which genes not only were functionally associated, from 0.5, the more significant the difference that we would but also had correlated expression patterns, if the module’s observe in between the gene expression values of cancer overall expression pattern (the median expression value of and normal samples. In other words, the AUC of a ROC genes inside a module) was significantly different in between curve quantitatively determines the specificity of a module cancer and normal samples, then this module could be to NSCLC. Because AUC ROC measures the expression considered a cancer-specific gene module; within this cancer- alteration of a group of genes that are coexpressed, it is specific gene module, if a gene’s expression correlations more robust than the expression alteration of individual genes with its functionally associated partners were significantly and should be more appropriate to be used to represent the altered from normal to cancer samples, then this gene’s expression change of a given gene of interest. Here, we found expression might be critical to cancer development and was 2 modules with AUC value greater than 0.9 (significantly then considered a potential biomarker for diagnostic use. For upregulatedinNSCLC)and9withAUCvaluesmallerthan detailsaboutthemethodology,pleaserefertoLongetal. 0.1 (significantly downregulated) (Table 1). [16]. Briefly speaking, the pipeline consists of the following To validate the specificity of these modules to lung cancer, three steps: construction of a disease-specific functional we performed function enrichment analysis for genes inside association network by integrating gene coexpression with each module and showed the most significant function for functional association, identification of disease-specific gene each module in Table 1. It could be easily seen that the modules, and prediction of gene biomarkers. Following the enriched functions of selected modules were significantly pipeline, we have constructed an NSCLC-specific functional associated with cancer’s development. In the upregulated association network by integrating gene coexpression infor- gene modules, M44 has 5 genes with an AUC of 0.996; the mation obtained from both lung normal and NSCLC datasets top enriched function of this module was protein targeting with the filtered functional association extracted from the to membrane (𝑝 value: 1.68𝐸 − 04). Recently, Li and Perez- STRING network (see Materials and Methods for details). Soler [33] found that skin toxicity is related to inhibition of This network is a binary network consisting of 4,452 genes epidermal growth factor receptor (EGFR), a target for NSCLC and 13,831 edges. treatment. Another example of upregulated modules is M394 that consisted of 15 genes with an AUC of 0.909. The top 3.2. Identification of NSCLC-Specific Gene Modules. The enriched function was epidermis development with a 𝑝 value NSCLC-specific network was partitioned into 637 gene mod- of 1.80𝐸 − 08. Tian et al.’s [34] research result showed that an ulesbyusinganetworkpartitionalgorithmnamediNP.There important gene in epidermis development, DHHC,encoding are 254 gene modules with the cutoff of including more than palmitoyltransferase catalyzes S-palmitoylation by targeting 4 genes. To identify NSCLC-specific gene modules, following on the , and siRNA targeting this gene was Long et al. [16], we used the median gene expression value able to inhibit the growth of NSCLC cell lines. There were of the genes inside the module to represent the module’s nine NSCLC-specific gene modules that were downregulated. expression value in normal or NSCLC dataset and then Among these modules, M315 consisted of 9 genes and had an inspected the differential expression pattern of each module AUC of 0.0913. The top enriched function was cell chemotaxis between normal and NSCLC samples (see Materials and (8.50𝐸 − 07), which was found to promote the development Methods for details). We found 11 gene modules whose of NSCLC [35]. In another example, M349 consisted of seven expression was significantly different in between normal and genes with an AUC of 0.0091. All these seven genes were NSCLC samples. There are 2 upregulated modules and 9 associated with the function of BMP2 targeting with a 𝑝 downregulated modules among these differentially expressed valueof0.00159.Gautschietal.[36]foundthatId1plays modules. The representative expression values of each of the an important role in Src-mediated tumor cell invasion, and 11 modules were also used to plot ROC curves to discriminate BMP2 could induce the expression of Id1, suggesting that 4 BioMed Research International

Genes (normal) Genes (cancer) PCC STRING

A network associated with lung cancer

M1 M2

M3 M4

Up Down Modules

1.0 1.0 0.8 AUC = 0.98 0.8 AUC = 0.01 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 True positive rate positive True True positive rate positive True 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 1.0 False positive rate Normal cancer Normal cancer False positive rate ROC ROC

3 Gene 3 Gene 1 Gene 3 Gene 1 Gene 3 Gene 1 Gene Gene 1

Gene 4 Gene 4 Gene 2 Gene 2 Gene 2 Gene 2

Normal Cancer Normal Cancer ∑[(gene 1− gene 2) − (gene 1−gene 2) ] = (AUC − 0.5) × cancer normal Score n Score(gene 2) = (0.98 − 0.05) ∗(1−(−1) + 1 − (−1))/2 = 1.86 Score(gene 2) = (0.01 − 0.05) ∗ ((1−1) + (1−1) + (1 − 1))/3 = 0

Figure 1: The pipeline for predicting diagnostic gene biomarkers of NSCLC. In the bottom of the figure, the color of the line connecting gene 2 and the other genes indicates the coexpression status between gene 2 and the other genes, with red color corresponding to positive coexpression and blue color corresponding to negative coexpression. For details about the pipeline, refer to the Materials and Methods.

BMP2 targeting might be important for cancer development. from normal tissues. Following the strategy described by Therefore, functional enrichment results validated the speci- Long et al. [16], we computed the module’s specificity to ficity of the partitioned gene modules to NSCLC. NSCLC, which is AUC 0.5. Then, for each gene inside a cancer-specific module, we computed a coexpression change 3.3. Predicting Diagnostic Gene Biomarkers for NSCLC. After score to indicate its potential impact on the alteration of the identifying NSCLC-specific gene modules, we aimed to coexpression pattern of its functionally associated partner determine the genes inside an NSCLC-specific module that genes. For details about the score, please refer to the Materials have the best discriminating power to distinguish NSCLC andMethods.Thefinalscorecanbeeitherpositiveor BioMed Research International 5

Table 1: The 11 cancer-specific gene modules.

Module name AUC ROC The most significantly enriched function𝑝 ( value) M44 0.996 GO: protein targeting to membrane (0.00110) M394 0.909 GO: epidermis development (1.80e − 08) M53 0.000404 MSigDB: CTRL VS DAY3 LAIV IFLU VAC CINE PBMC UP (0.00670) M348 0.00113 MSigDB: MIKKELSEN IPS LCP WITH H3K4ME3 (0.00670) M350 0.00313 MSigDB: BOSCO TH1 CYTOTOXIC MODULE (1.80e − 08) M60 0.00833 MSigDB: MCLACHLAN DENTAL CARIES UP (8.66e − 11) M349 0.0091 MSigDB: BMP2 TARGETS UP (0.00159) M264 0.0159 MSigDB: SHEDDEN LUNG CANCER GOOD SURVIVAL (5.66e − 04) M83 0.0273 MSigDB: BOYLAN MULTIPLE MYELOMA (9.10e − 04) M94 0.0284 None M315 0.0913 GO: cell chemotaxis (8.50e − 07)

Table 2: Summary of the top 10 upregulated and top 10 downregu- and NSCLC cases indicated that it might be an important lated gene biomarkers for NSCLC. gene that could potentially have a large impact on NSCLC’s development. N6AMT1 ranked the top among downregulated Gene Gene name (upregulated) Gene name (downregulated) gene biomarkers. It had two functionally associated genes rank YRDC and ETF1 in gene module M53 whose AUC value 1 SEC61B N6AMT1 ∗ was 0.0004. Interestingly, the coexpression pattern of both 2 S100P [20] CYCS YRDC and ETF1 with N6AMT1 reversed from normal to 3 RPL23 YRDC NSCLC samples: they were both negatively coexpressed with 4 SEC61G PPP1R15B N6AMT1 in normal lung samples, while they were both ∗ ∗ 5 SPRR2D [21] TOP1 [22] positively coexpressed with N6AMT1 in NSCLC samples (Figure 2). Thus, it was very likely that the coexpression 6 RPS7 MMRN2 ∗ ∗ of N6AMT1 with its functionally associated genes may play 7 S100A2 [23] P2RY14 [24] ∗ important roles in the development of NSCLC. 8 SPRR3 FOXA2 [25] ∗ ∗ 9 DSG3 [26] GCA [27] ∗ 3.4. Case Reports for the Predicted Gene Biomarkers. We have 10 SPRR1A [21] MGAM conducted a thorough literature review on the predicted ∗ indicates that the gene was relevant to NSCLC directly, with the references genebiomarkers.AsshowninTable2,9ofthemwere shown in parenthesis. knowntoberelevanttoNSCLC,andtheother11were knowntoberelevanttocancer,stronglyvalidatingour predictions and also suggesting that they may be of use negative, indicating that the corresponding gene is either as diagnostic biomarkers for NSCLC. Below, we presented upregulated or downregulated in NSCLC, respectively. an example for both upregulated and downregulated gene We obtained 59 genes with nonzero scores. Among them, biomarkers, respectively. One example is S100P that encodes 11 were upregulated and 48 were downregulated. In Table 2, a member of small calcium-binding proteins family and we listed the top 10 upregulated and top 10 downregulated is highly expressed in NSCLC. S100P played a key role genes and considered them as potential biomarkers for downstream for Keap1-Nrf2 interaction. Keap1 encodes E3 NSCLC. SEC61B wasrankedthetopamongtheupregu- ligase and is involved in cellular defense response to oxidative lated genes. It was functionally associated with three genes stress through an interaction with nuclear factor erythroid-2- (SEC61G, RPL23,andRPS7)ingenemoduleM44thatwas related factor 2 (Nrf2). It has been shown that Keap1 could significantly associated with NSCLC (the AUC of the ROC inhibit tumor metastasis by targeting Nrf2/S100P pathway curvebyusingthemediangeneexpressionvalueofM44to in NSCLC cells [20]. Thus, S100P is of high importance to discriminate NSCLC from normal case was 0.996). As shown NSCLC. In another example, FOXA2 is a tumor suppressor in Figure 2, the coexpression pattern of SEC61B with its func- and has been suggested to be a new target protein for the tionally associated genes changed significantly from normal treatment of NSCLC [25]. to NSCLC samples. For example, SEC61B and RPL23 were not positive or negatively coexpressed in normal case; however, 3.5. The Validation of Analysis Strategy. FollowingLonget they were significantly positively coexpressed in NSCLC. As al. [16], in order to prove the availability and reasonability for SEC61G and RPS7, they were both positively coexpressed of the method, we used another NSCLC dataset (GSE10245; with SEC61B in normal samples; in NSCLC samples, however, detailed information was shown in Long et al. [16]) to they were no longer positively coexpressed with SEC61B. repeat the analysis for predicting gene biomarkers. Based on The significant changes in the coexpression pattern of genes this dataset, we predicted 84 genes biomarkers for NSCLC, functionally associated with SEC61B in between normal with 12 overlapping predictions. The number of overlapped 6 BioMed Research International

M44 0.996 1.0

RPL23 RPL23 0.8 SEC61G SEC61G SEC61B SEC61B 0.6 RPL26 RPL26 0.4 RPS7 RPS7

True positive rate positive True 0.2

0.0 0.0 0.2 0.4 0.6 0.8 1.0 False positive rate (a) M53 0.000404 CYCS CYCS 1.0

0.8 DDX21 PPP1R15B DDX21 PPP1R15B 0.6 ETF1 YRDC ETF1 YRDC 0.4 PPP2CA N6AMT1 PPP2CA N6AMT1 True positive rate positive True 0.2 TOP1 TOP1 0.0 Normal NSCLC 0.0 0.2 0.4 0.6 0.8 1.0 False positive rate ROC

(b)

Figure 2: The predicted gene biomarkers for NSCLC. (a) shows an example of upregulated gene biomarker—SEC61B. Its coexpression patterns with the functionally associated genes in normal and NSCLC were shown separately. Line colors of red, blue, and orange indicated the positive, negative, and random coexpression. (b) is similar to (a), except that the example is a downregulated gene biomarker—N6AMT1.Thecolor of each gene represented its expression level. The ROC curves to the right were based on the corresponding gene modules of SEC61B and N6AMT1. predictions was significantly higher than random guess (<10– functionally associated partners in between normal and 4, 10,000 randomizations), proving the robustness of our NSCLC samples. For instance, a gene is positively correlated predictions. withitsfunctionallyassociatedpartnersinnormalsamples in terms of gene expression, while this pattern significantly 4. Discussion altered to negative correlation. This abnormal coexpression pattern alteration thus made the gene potentially important Lung cancer is the primary reason for death due to cancer for NSCLC cancer cell development. Thus, the abnormal worldwide, and NSCLC is the most common subtype of lung coexpression pattern alteration could be potentially useful cancer. In the present study, we predicted 20 gene biomarkers for diagnostic use. Specifically, rather than focusing on the potentially useful for diagnosis of NSCLC. Literature reviews expression value alteration of the predicted gene biomarkers, of our predictions revealed that some of them were already we ought to consider the expression of both the predicted reported to be specifically relevant to NSCLC, while the gene biomarkers and their functionally associated partner others were relevant to cancer. As such, the predicted gene genes. For example, we could develop a microchip that inte- biomarkers may be of use for further exploitation as diagnos- grates not only the predicted gene biomarkers but also their ticbiomarkersforNSCLC. functionally associated partners, and then, by examining The predicted gene markers have the following char- theexpressionofthesegenesasawhole,wecoulddevelop acteristics. First of all, they are within gene modules in a system to detect the abnormal coexpression alteration which genes are strongly functionally associated with each for assessing whether a sample is NSCLC. The predicted other based on the STRING network and have correlated biomarkers were ranked according to our scoring scheme expression in normal or NSCLC samples. In addition, the that considers not only the level of expression alteration, but modules are specific to cancer as measured by the median also the extent of the expression change of the biomarkers’ expression values of genes inside the module. Thirdly, the partner genes. Therefore, a gene with higher rank should not predicted genes had altered expression correlation with their only have a more significant expression alteration in cancer, BioMed Research International 7

but also have higher impact on the expression of its partner [11] G. Bepler, I. Kusmartseva, S. Sharma et al., “RRM1 modulated in genes.Consequently,thisgeneshouldbemoreeasilydetected vitro and in vivo efficacy of gemcitabine and platinum in non- than those genes with lower ranks and should be studied with small-cell lung cancer,” Journal of Clinical Oncology,vol.24,no. higher priority. 29, pp. 4731–4737, 2006. Given the predicted biomarkers and their partner genes [12] W. Gong, X. Zhang, J. Wu et al., “RRM1 expression and for NSCLS, an immediate application is for clinical diagnostic clinical outcome of gemcitabine-containing chemotherapy for use of NSCLC. By developing a microchip to measure the advanced non-small-cell lung cancer: a meta-analysis,” Lung expression of these biomarkers and their partner genes, we Cancer,vol.75,no.3,pp.374–380,2012. can collaborate with local hospitals to apply this chip to [13]S.Karni,H.Soreq,andR.Sharan,“Anetwork-basedmethod NSCLC patients and normal people. Then, we can construct a for predicting disease-causing genes,” Journal of Computational clinical model based on the results from hundreds of patients Biology,vol.16,no.2,pp.181–189,2009. and normal people. This model will be used to diagnose [14] K.-Q. Liu, Z.-P. Liu, J.-K. Hao, L. Chen, and X.-M. Zhao, whether a new patient likely has NSCLC by measuring the “Identifying dysregulated pathways in cancers from pathway expression of the predicted biomarkers and their partners in interaction networks,” BMC Bioinformatics,vol.13,no.1,article the chip. 126, 2012. [15] X. Liu, Z.-P. Liu, X.-M. Zhao, and L. Chen, “Identifying disease genes and module biomarkers by differential interactions,” Competing Interests Journal of the American Medical Informatics Association,vol.19, no. 2, pp. 241–248, 2012. The authors declare that they have no competing interests. [16] F. Long, J.-H. Su, B. Liang, L.-L. Su, and S.-J. Jiang, “Identi- fication of gene biomarkers for distinguishing small-cell lung Acknowledgments cancer from non-small-cell lung cancer using a network-based approach,” BioMed Research International,vol.2015,ArticleID The authors would like to thank the financial support of the 685303, 8 pages, 2015. National Natural Science Foundation of China (81370138). [17] D. Szklarczyk, A. Franceschini, M. Kuhn et al., “The STRING database in 2011: functional interaction networks of proteins, References globally integrated and scored,” Nucleic Acids Research,vol.39, no. 1, pp. D561–D568, 2011. [1] R. Siegel, D. Naishadham, and A. Jemal, “Cancer , 2013,” [18] D. Szklarczyk, A. Franceschini, M. Kuhn et al., “The STRING CA Cancer Journal for Clinicians,vol.63,no.1,pp.11–30,2013. database in 2011: functional interaction networks of proteins, [2]J.Z.P.HaoandW.Q.Chen,2012 Chinese Cancer Registration globally integrated and scored,” Nucleic Acids Research,vol.39, Report, Junshi Yi Xue Ke Xue Chu Ban She, 2012. supplement 1, pp. D561–D568, 2011. [3]J.R.Molina,P.Yang,S.D.Cassivi,S.E.Schild,andA.A. [19] A. Franceschini, D. Szklarczyk, S. Frankild et al., “STRING v9.1: Adjei, “Non-small cell lung cancer: epidemiology, risk factors, protein-protein interaction networks, with increased coverage treatment, and survivorship,” Mayo Clinic Proceedings,vol.83, and integration,” Nucleic Acids Research,vol.41,no.1,pp.D808– no. 5, pp. 584–594, 2008. D815, 2013. [4] NCI, Non-Small Cell Lung Cancer Treatment, National Cancer [20] M.-H. Chien, W.-J. Lee, F.-K. Hsieh et al., “Keap1-Nrf2 inter- Institute, 2008. action suppresses cell motility in lung adenocarcinomas by targeting the S100P protein,” Clinical Cancer Research,vol.21, [5]S.Couraud,G.Zalcman,B.Milleron,F.Morin,andP.-J. no. 20, pp. 4719–4732, 2015. Souquet, “Lung cancer in never smokers—a review,” European [21] J. Tesfaigzi and D. M. Carlson, “Expression, regulation, and Journal of Cancer,vol.48,no.9,pp.1299–1311,2012. function of the spr family of proteins: a review,” Cell Biochem- [6]A.Matsuda,K.Yamaoka,andT.Tango,“Qualityoflifein istry and Biophysics, vol. 30, no. 2, pp. 243–265, 1999. advanced non-small cell lung cancer patients receiving pallia- [22] A.-K. Jakobsen, K. L. Lauridsen, E. B. Samuel et al., “Correlation tive chemotherapy: a meta-analysis of randomized controlled between topoisomerase I and tyrosyl-DNA phosphodiesterase trials,” Experimental and Therapeutic Medicine,vol.3,no.1,pp. 1 activities in non-small cell lung cancer tissue,” Experimental 134–140, 2012. and Molecular Pathology,vol.99,no.1,pp.56–64,2015. [7] D. E. Midthun, “Screening for lung cancer,” Clinics in Chest [23] T. Wang, Y. Liang, A. Thakur et al., “Diagnostic significance of Medicine,vol.32,no.4,pp.659–668,2011. S100A2 and S100A6 levels in sera of patients with non-small cell [8] Y. Bosse,´ D. S. Postma, D. D. Sin et al., “Molecular signature of lung cancer,” Tumor Biology,vol.37,no.2,pp.2299–2304,2016. smoking in human lung tissues,” Cancer Research,vol.72,no. [24] T. Muller,¨ H. Bayer, D. Myrtek et al., “The P2Y14 receptor of 15, pp. 3753–3763, 2012. airway epithelial cells: coupling to intracellular Ca2+ and IL-8 [9]C.Yang,J.Liu,X.Dong,Z.Cai,W.Tian,andX.Wang, secretion,” American Journal of Respiratory Cell and Molecular “Short-term and continuing stresses differentially interplay with Biology,vol.33,no.6,pp.601–609,2005. multiple hormones to regulate plant survival and growth,” [25] S.-M. Jang, J.-H. An, C.-H. Kim, J.-W. Kim, and K.-H. Choi, Molecular Plant,vol.7,no.5,pp.841–855,2014. “Transcription factor FOXA2-centered transcriptional regula- [10] M. Spinola, P. Meyer, S. Kammerer et al., “Association of the tion network in non-small cell lung cancer,” Biochemical and PDCD5 locus with lung cancer risk and prognosis in smokers,” Biophysical Research Communications,vol.463,no.4,pp.961– Journal of Clinical Oncology, vol. 24, no. 11, pp. 1672–1678, 2006. 967, 2015. 8 BioMed Research International

[26] M. Gomez-Morales,´ M. Camara-Pulido,´ M. T. Miranda-Leon´ et al., “Differential immunohistochemical localization of desmo- somal plaque-related proteins in non-small-cell lung cancer,” Histopathology,vol.63,no.1,pp.103–113,2013. [27] W. Wu, H. Li, H. Wang et al., “Effect of polymorphisms in XPD on clinical outcomes of platinum-based chemotherapy for chinese non-small cell lung cancer patients,” PLoS ONE,vol.7, no. 3, Article ID e33200, 2012. [28] S. Sun, X. Dong, Y. Fu, and W. Tian, “An iterative network partition algorithm for accurate identification of dense network modules,” Nucleic Acids Research,vol.40,no.3,p.e18,2012. [29] B. Singh, A. M. Ronghe, A. Chatterjee, N. K. Bhat, and H. K. Bhat, “MicroRNA-93 regulates NRF2 expression and is associated with breast carcinogenesis,” Carcinogenesis,vol.34, no. 5, pp. 1165–1172, 2013. [30] Y. Chand and M. A. Alam, “Network biology approach for identifying key regulatory genes by expression based study of breast cancer,” Bioinformation, vol. 8, no. 23, pp. 1132–1138, 2012. [31] M. Ashburner, C. A. Ball, J. A. Blake et al., “Gene ontology: tool for the unification of biology,” Nature Genetics,vol.25,no.1,pp. 25–29, 2000. [32] A.Liberzon,A.Subramanian,R.Pinchback,H.Thorvaldsdottir,´ P. Tamayo, and J. P. Mesirov, “Molecular signatures database (MSigDB) 3.0,” Bioinformatics,vol.27,no.12,ArticleIDbtr260, pp.1739–1740,2011. [33] T. Li and R. Perez-Soler, “Skin toxicities associated with epider- mal growth factor receptor inhibitors,” Targeted Oncology,vol. 4, no. 2, pp. 107–119, 2009. [34] H. Tian, J.-Y. Lu, C. Shao et al., “Systematic siRNA screen unmasks NSCLC growth dependence by palmitoyltransferase DHHC5,” Molecular Cancer Research,vol.13,no.4,pp.784–794, 2015. [35]M.W.Y.Li,R.Liu,L.Guo,T.Xu,J.Chen,andS.Xu, “Investigational study of mesenchymal stem cells on lung cancer cell proliferation and invasion,” ZhongguoFeiAiZaZhi,vol.18, no. 11, pp. 674–679, 2015. [36]O.Gautschi,C.G.Tepper,P.R.Purnelletal.,“Regulationof Id1 expression by Src: implications for targeting of the bone morphogenetic protein pathway in cancer,” Cancer Research, vol.68,no.7,pp.2250–2258,2008. Hindawi Publishing Corporation BioMed Research International Volume 2016, Article ID 6945304, 6 pages http://dx.doi.org/10.1155/2016/6945304

Research Article A Five-Gene Expression Signature Predicts Clinical Outcome of Ovarian Serous Cystadenocarcinoma

Li-Wei Liu,1 Qiuhao Zhang,1 Wenna Guo,2 Kun Qian,1 and Qiang Wang1

1 State Key Laboratory of Pharmaceutical Biotechnology, School of Life Sciences, Nanjing University, Nanjing 210023, China 2School of Life Sciences, Shanghai University, Shanghai 200444, China

Correspondence should be addressed to Qiang Wang; [email protected]

Received 16 April 2016; Accepted 25 May 2016

Academic Editor: Jialiang Yang

Copyright © 2016 Li-Wei Liu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Ovarian serous cystadenocarcinoma is a common malignant tumor of female genital organs. Treatment is generally less effective as patients are usually diagnosed in the late stage. Therefore, a well-designed prognostic marker provides valuable data for optimizing therapy. In this study, we analyzed 303 samples of ovarian serous cystadenocarcinoma and the corresponding RNA-seq data. We observed the correlation between gene expression and patients’ survival and eventually established a risk assessment model of five factors using Cox proportional hazards regression analysis. We found that the survival time in high-risk patients was significantly shorter than in low-risk patients in both training and testing sets after Kaplan-Meier analysis. The AUROC value was 0.67 when predicting the survival time in testing set, which indicates a relatively high specificity and sensitivity. The results suggest diagnostic and therapeutic applications of our five-gene model for ovarian serous cystadenocarcinoma.

1. Introduction available and well-studied biomarkers for ovarian cancers. Other studies using miRNAs as biomarkers also suggest the Ovarian serous cystadenocarcinoma is a common female limited value for clinical application, and miRNA therapy genital cancer that causes more deaths than any other cancer is still not clinically feasible. Compared with the foregoing ofthefemalereproductivesystem.AccordingtoGlobalCan- methods, gene expression markers not only possess higher cer Statistics, approximately 230,000 women are diagnosed practical value, but also yield higher accuracy. with ovarian cancer every year, and an estimated 150,000 Here, we analyzed 303 clinical samples of ovarian serous women die of this disease annually [1]. Ovarian serous cystadenocarcinoma and the corresponding RNA-seq data. cystadenocarcinoma, a type of epithelial ovarian cancer, We determined the relationship between gene expression accounts for about 90% of all ovarian cancers [2]. Studies data and survival time, in an effort to develop effective and suggest that the risk factors for the disease include nulliparity, accurate biomarkers for outcome prediction and personal- early menarche, late menopause, and family history [3]. Since ized treatment. the disease is often asymptomatic, the majority of patients are diagnosed at an advanced stage, with tumor invasion. Studies 2. Materials and Methods showed that the 5-year survival of stage I patients is greater than 90%, while that of patients in stages III to IV is less than 2.1. Patient Samples and Gene Expression Data. We collected 20% [4, 5]. The recent increase in the incidence of ovarian data from a total of 587 samples of serous cystadeno- cancer has attracted the interest and attention of researchers carcinoma (April 2016) from TCGA (http://cancergenome worldwide. .nih.gov/) and finally used 303 samples (Table S1, at With the development of sequencing technology, the Supplementary Material available online at http://dx.doi.org/ research focus has been on the study of signature analysis for 10.1155/2016/6945304) in this study after excluding 284 - prognostic monitoring of ovarian cancer [6–12]. Microarray ples with unknown survival time or insufficient gene expres- studies require precise design of probes despite the currently siondata.The303sampleswereassignedinto13batches 2 BioMed Research International and randomly allocated to training and testing sets. The Table 1: Assignment of patient demographic and clinical character- prognostic marker model was established with a training istics. set containing 8 batches (batches 9, 11–15, and 17-18) with Patients 168 samples and validated using a testing set, comprising 5 Characteristic Training set Testing set Total batches (batches 19–22, 24, and 409) with 135 samples. Age at diagnosis (years) 57 59 58 2.2. Statistical Analysis. Initially, we screened the samples Median by excluding those with unclear survival time or status. We Range 34–87 30–87 30–87 retained only those genes expressed in more than half of the Vital status samples for further analysis. The expression level was then Living 58 62 120 determined by logarithmic transformation and univariate Dead 110 73 183 𝑝 Cox regression analysis. The significance of genes with Follow-up (days) value less than 0.001 was evaluated using random forests. 1018 883 949 We selected 100 genes of the largest importance to perform Median multivariate Cox’s analysis. Considering the practicality of Median (dead) 1155 919 1069 clinical testing, we established 75,287,520 models with vari- Clinical stage ables ranging from one to five genes using Cox proportional Stage I 011 hazards regression analysis [35]. Further, all the 75,287,520 Stage II 81321 models were subjected to Receiver Operating Characteristic Stage III 142 99 241 (ROC) analysis and the model with the largest area was 18 20 38 selected. Stage IV Kaplan-Meier analysis was then conducted in both Unknown 022 training and testing groups to validate the efficiency of the model. In order to test the independence and repro- ducibility of our model, we divided the samples into different datasets according to their ages and disease trafficking [21]. Its use as a prognostic molecular marker in stages. We then performed Kaplan-Meier analyses and ROC liver disease is also discussed. TSPAN9 is probably directly analyses in each condition with IBM SPSS Statistics 22 related to the proliferation of cancer cells. Other genes not (http://www.ibm.com/analytics/us/en/technology/spss/). directly correlated with the development of cancer may affect metabolism via signal transduction and indirectly affect the 3. Results development of cancer.

3.1. Sample Characteristics. According to the screening cri- 3.3. Test the Predictive Ability of the Constructed Model Using teria described, we randomly allocated the 303 samples with Testing Set. After constructing the five-variable model with explicit survival time, survival state, and expression data training set, we performed a Kaplan-Meier survival analysis into training and testing sets for modeling and validation, of both training and testing sets to determine its prognostic respectively. The median age of diagnosis in the selected value. In the training set, by calculating each patient’s risk patients was 58 years, the median survival time was 949 days, scoreusingthemodel,wedividedthepatientsintotwo and the median survival of late-stage patients was 1069 days. groups, designated as high-risk (𝑛=84)andlow-risk A single patient was found in clinical stage I and 21, 241, and groups (𝑛=84), based on their risk scores. The average 38 patients were in stages II, III, and IV, respectively. The survival time of patients in the low-risk group was 1,443 clinical stages of two patients were unknown (Table 1). days, longer than in the high-risk group, which was 892 days. Kaplan-Meier analysis indicated a significant difference 3.2. Obtain Genes Associated with Survival Time. Subse- (𝑝 < 0.001) between the high-risk and low-risk groups quently, we constructed 75,287,520 models comprising factors in survival time [Figure 1(a)]. The prognosis of high-risk from1to5basedonthe100geneswiththehighestsignifi- group appeared worse than that of the low-risk group, cance in the random forest method. The survival risk score of indicating that our model successfully distinguished the risk each patient was calculated according to the corresponding pattern. The higher risk tended to result in shorter survival risk formula in each model, and the ROC curves were time. Similar results of Kaplan-Meier analysis were found drawn. We extracted a batch of 5 genes (GPR128, AGXT, in the test group [Figure 1(b)], suggesting that our model CYTH3, C10orf76, and TSPAN9) (Table 2) with the largest was universally applicable in determining the risk level and AUROC using the following formula: risk score = (0.0796 × predicting the survival of patients. expression point of GPR128) + (0.3451 × expression point of In order to further confirm the prognostic value of our AGXT) + (0.3402 × expression point of CYTH3) + (0.6198 × model in predicting the survival time, we performed ROC expression point of C10orf76) + (0.2534 × expression point analysis of the test group, setting 3 years as the cut-off, and of TSPAN9). All of these genes were reported previously calculated the risk score as the variable. The AUROC value (Table 3). The CYTH3 gene was expressed in the liver alone, of 0.670 (Figure 2) indicated a relatively high specificity and playing a key role in regulating protein sorting and membrane sensitivity. BioMed Research International 3

Table 2: Five genes strongly correlated with patients’ survival time in training set.

Gene name 𝑝 value Hazard ratio Coefficient Variable importance Relative importance GPR128|84873 0.00092 1.0828 0.0796 0.0009 0.2478 AGXT|189 0.00038 1.4121 0.345 0.0005 0.1442 CYTH3|9265 0.00048 1.4052 0.3402 0.0005 0.1432 C10orf76|79591 0.00037 1.8585 0.6198 0.0009 0.2446 TSPAN9|10867 0.0008 1.2884 0.2534 0.0002 0.0518

Table 3: Five-gene functions in previous research.

Chromosomal Start site End site Function Playing important role in the transduction of intercellular signals across the plasma GPR128 chr3 100328433 100414323 membrane; related to weight gain and intestinal contraction frequency in mouse [13–16]. Expressing proteins involved in glyoxylate detoxification in the peroxisomes; its AGXT chr2 240868479 240880502 mutation causes primary hyperoxaluria type I, a severe inborn error of metabolism [17–20]. Mediating the regulation of protein sorting and membrane trafficking; related to CYTH3 chr7 6161776 6272644 HCC (hepatocellular carcinoma) tissues and could serve as prognostic factor [21–24]. C10orf76 chr10 101845599 102056193 Currently unknown; a recent study suggested the loss of C10orf76 resulted in the upregulation of several genes [25–29]. Mediating signal transduction events that play a role in the regulation of cell TSPAN9 chr12 3077355 3286564 development, activation, growth, and motility; associated with adhesion receptors of the integrin family and regulates integrin-dependent cell migration [30–34].

Training group Testing group

1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4 Proportion surviving Proportion surviving

0.2 0.2

0.0 0.0

0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 6000 Time (days) Time (days)

Low-risk Low-risk High-risk High-risk (a) (b)

Figure 1: Kaplan-Meier curves with two-sided log rank test show correlation between five-gene model and survival time in both training set and testing set. (a) In training set, by calculating each patient’s risk score out of the model, we divided the patients into two groups, named as high-risk group (𝑛=84)andlow-riskgroup(𝑛=84), based on their risk scores. Kaplan-Meier analysis was then performed and significant difference𝑝 ( < 0.001) was found between high-risk and low-risk group in the level of survival time. (b) Similar process and results are showed in testing set. 4 BioMed Research International

Table 4: Cox proportional hazard regression analyses in training and testing sets.

Univariable model Multivariable model Variables HR (95% CI) 𝑝 value HR (95% CI) 𝑝 value Training group Five-gene model 2.672 (1.801–3.965) <0.001 2.536 (1.832–3.509) <0.001 Age 1.683 (1.153–5.457) 0.007 1.013 (0.994–1.031) 0.173 Testing group Five-gene model 2.248 (1.397–3.620) 0.001 2.224 (1.379–3.586) 0.001 Age 1.224 (0.772–1.941) 0.389 1.153 (0.726–1.830) 0.546 Training group Five-gene model 2.672 (1.801–3.965) <0.001 2.725 (1.821–4.078) <0.001 Stage 1.080 (0.670–1.741) 0.752 0.883 (0.541–1.442) 0.62 Testing group Five-gene model 2.248 (1.397–3.620) 0.001 2.385 (1.387–3.562) <0.001 Stage 1.032 (0.580–1.461) 0.453 0.685 (0.432–1.238) 0.428

1.0 Table 5: Kaplan-Meier analysis and ROC analysis were conducted to validate the reproducibility of five-gene model.

Prognostic Group Kaplan-Meier AUROCs 0.8 factor 𝑝 value ≤57 (146) <0.001 0.653 Age >57 (157) 0.001 0.683 0.6 I, II (22) 0.018 0.625 Stage III (241) <0.001 0.664 IV (38) <0.1 0.778 Sensitivity 0.4

0.2 revealed that, in both groups, patients in low-risk group survived longer than in the high-risk group (𝑝 ≤ 0.001). Similar results were obtained with the groups of patients at 0.0 different disease stages (stages I and II were merged because 0.0 0.2 0.4 0.6 0.8 1.0 of limited specimen) except stage IV (Figure S1), which may 1−specificity be attributed to the relatively small sample size. However, the AUROC of this group was rather high. These analyses Low-risk established that our model was independent of other risk High-risk factors and successfully distinguished low risk from high risk Figure 2: Receiver Operating Characteristic (ROC) analysis of the in each dataset. selected five-gene model. AUROC value is 0.670 (𝑝 < 0.001). 4. Discussion 3.4. The Independence and Reproducibility of the Five-Gene Ovarian serous cystadenocarcinoma is a common female Model. The survival of patients is associated with their age, genital cancer. Due to the absence of early-stage clinical clinical stage, and other factors. To determine the inde- symptoms and effective diagnosis, most patients were diag- pendence of our model, we conducted a multivariate Cox nosedwithadvanceddisease.Further,duetothelackof regression analysis using age and disease stages. We found effective treatment, the management of epithelial ovarian that the five-gene model was independent of age and disease cancer is passive. Developing reliable prognostic molecular stage (Table 4). markers provides meaningful guidance for a reasonable and Further Kaplan-Meier analysis and ROC analysis were effective management program. then conducted (Table 5). We merged the training and In this study, we analyzed 303 clinical samples of ovarian testing sets into an overall dataset, which was divided into serous cystadenocarcinoma and the corresponding RNA-seq twoseparategroupsbyage57.TheKaplan-Meieranalysis data, observed the correlation between gene expression and BioMed Research International 5 survival time, and eventually established a risk assessment [8] N. Jin, H. Wu, Z. Miao et al., “Network-based survival- model based on five factors. Two of these genes (TSPAN9 associated module biomarker and its crosstalk with cell death [30–34], CYTH3 [21–24]) were directly correlated with can- genes in ovarian cancer,” Scientific Reports,vol.5,ArticleID cer, with CYTH3 identified as a biomarker in liver cancer. 11566, 2015. By calculating each patient’s risk score, we found that each [9]M.Schwede,D.Spentzos,S.Bentinketal.,“Stemcell-like set showed significant differences in survival time between gene expression in ovarian cancer predicts type II subtype and low-risk and high-risk groups, indicating that the model prognosis,” PLoS ONE,vol.8,no.3,ArticleIDe57799,2013. accurately predicted the mortality risk. The AUROC value [10] K. P. Prahm, G. W. Novotny, C. Høgdall, and E. Høgdall, in testing group is 0.670, representing a relatively high “Current status on microRNAs as biomarkers for ovarian specificity and sensitivity. cancer,” APMIS,vol.124,no.5,pp.337–355,2016. [11] P. Mapelli, E. Incerti, F. Fallanca, L. Gianolli, and M. Picchio, In conclusion, our gene expression biomarkers can be 18 used for accurate patient risk assessment, demonstrating “Imaging biomarkers in ovarian cancer: the role of F-FDG practical value in predicting clinical outcomes. Our results PET/CT,” The Quarterly Journal of Nuclear Medicine and Molec- are based on the samples derived from 303 individuals. ular Imaging,vol.60,no.2,pp.93–102,2016. ˇ Expanding sample size, especially including early-stage can- [12] I. Sedlakov´ a,´ J. Laco, J. Toˇsner, and J. Spacek,ˇ “Prognostic sig- cer patients, will further improve the prognostic value of the nificance of Pgp, MRP1, and MRP3 in ovarian cancer patients,” Ceska Gynekologie,vol.80,no.6,pp.405–413,2015. model. [13]A.Chase,T.Ernst,A.Fiebigetal.,“TFG,atargetofchromo- some translocations in lymphoma and soft tissue tumors, fuses Competing Interests to GPR128 in healthy individuals,” Haematologica,vol.95,no.1, pp.20–26,2010. The authors declare that there is no conflict of interests [14] R. Fredriksson, D. E. I. Gloriam, P. J. Hoglund,¨ M. C. Lager- regarding the publication of this paper. strom,¨ and H. B. Schioth,¨ “There exist at least 30 human G- protein-coupled receptors with long Ser/Thr-rich N-termini,” Authors’ Contributions Biochemical and Biophysical Research Communications,vol.301, no.3,pp.725–734,2003. Li-Wei Liu and Qiuhao Zhang contributed equally to this [15] T. K. Bjarnadottir,´ R. Fredriksson, P. J. Hoglund,D.E.Gloriam,¨ work. M. C. Lagerstrom,¨ and H. B. Schioth,¨ “The human and mouse repertoire of the adhesion family of G-protein-coupled recep- tors,” Genomics,vol.84,no.1,pp.23–33,2004. Acknowledgments [16] Y.Suzuki,R.Yamashita,M.Shirotaetal.,“Sequencecomparison This work was supported by the National Natural Science of human and mouse genes reveals a homologous block struc- ture in the promoter regions,” Genome Research,vol.14,no.9, Foundation of China (31471200). The authors are grateful pp. 1711–1718, 2004. to the High Performance Computing Center (HPCC) of NanjingUniversityfordoingthenumericalcalculationsin [17] R. Montioli, E. Oppici, M. Dindo et al., “Misfolding caused by the pathogenic mutation G47R on the minor allele of this paper on its IBM Blade cluster system. alanine: glyoxylate aminotransferase and chaperoning activity of pyridoxine,” Biochimica et Biophysica Acta (BBA)—Proteins References and Proteomics, vol. 1854, no. 10, pp. 1280–1289, 2015. [18] E. Oppici, R. Montioli, and B. Cellini, “Liver peroxisomal ala- [1] R. L. Siegel, K. D. Miller, and A. Jemal, “Cancer statistics, 2016,” nine:glyoxylate aminotransferase and the effects of mutations CA—Cancer Journal for Clinicians,vol.66,no.1,pp.7–30,2016. associated with primary hyperoxaluria type I: an overview,” [2] Cancer Genome Atlas Research Network, “Integrated genomic Biochimica et Biophysica Acta,vol.1854,no.9,pp.1212–1219, analyses of ovarian carcinoma,” Nature,vol.474,no.7353,pp. 2015. 609–615, 2011. [19] N.Miyata,J.Steffen,M.E.Johnson,S.Fargue,C.J.Danpure,and [3] Y. Lee, A. Miron, R. Drapkin et al., “A candidate precursor to C. M. Koehler, “Pharmacologic rescue of an enzyme-trafficking serous carcinoma that originates in the distal fallopian tube,” defect in primary hyperoxaluria 1,” Proceedings of the National The Journal of Pathology,vol.211,no.1,pp.26–35,2007. Academy of Sciences of the United States of America, vol. 111, no. [4] T. Meyer and G. J. S. Rustin, “Role of tumour markers in 40, pp. 14406–14411, 2014. monitoring epithelial ovarian cancer,” British Journal of Cancer, [20] R. Montioli, A. Roncador, E. Oppici et al., “S81L and G170R vol. 82, no. 9, pp. 1535–1538, 2000. mutations causing Primary Hyperoxaluria type I in homozy- [5] D. M. Gershenson, C. C. Sun, K. H. Lu et al., “Clinical gosis and heterozygosis: an example of positive interallelic behavior of stage II-IV low-grade serous carcinoma of the complementation,” Human Molecular Genetics, vol. 23, no. 22, ovary,” Obstetrics & Gynecology,vol.108,no.2,pp.361–368, pp. 5998–6007, 2014. 2006. [21] Y. Fu, J. Li, M.-X. Feng et al., “Cytohesin-3 is upregulated [6] T. R. Adib, S. Henderson, C. Perrett et al., “Predicting biomark- in hepatocellular carcinoma and contributes to tumor growth ers for ovarian cancer using gene-expression microarrays,” and vascular invasion,” International Journal of Clinical and British Journal of Cancer,vol.90,no.3,pp.686–692,2004. Experimental Pathology,vol.7,no.5,pp.2123–2132,2014. [7] X. Yu, X. Zhang, T. Bi et al., “MiRNA expression signature [22] A. W.Malaby, B. van den Berg, and D. G. Lambright, “Structural for potentially predicting the prognosis of ovarian serous basis for membrane recruitment and allosteric activation of carcinoma,” Tumor Biology,vol.34,no.6,pp.3501–3508,2013. cytohesin family Arf GTPase exchange factors,” Proceedings of 6 BioMed Research International

the National Academy of Sciences of the United States of America, vol. 110, no. 35, pp. 14213–14218, 2013. [23] C. Pilling, K. E. Landgraf, and J. J. Falke, “The GRP1 PH domain, like the AKT1 PH domain, possesses a sentry glutamate residue essential for specific targeting to plasma membrane PI(3,4,5)P3,” Biochemistry,vol.50,no.45,pp.9845–9856,2011. [24] M.-B. Poirier, G. Hamann, M.-E. Domingue, M. Roy, T. Bardati, andM.-F.Langlois,“Generalreceptorforphosphoinositides1, a novel repressor of thyroid hormone receptor action that pre- vents deoxyribonucleic acid binding,” Molecular Endocrinology, vol. 19, no. 8, pp. 1991–2005, 2005. [25] M. K. Wojczynski, M. Li, L. F.Bielak et al., “Genetics of coronary artery calcification among African Americans, a meta-analysis,” BMC Medical Genetics,vol.14,no.1,article75,2013. [26] C. A. Rietvelda, T. Eskoc, and G. Davies, “Common genetic variants associated with cognitive performance identified using the proxy-phenotype method,” Proceedings of the National Academy of Sciences, vol. 111, no. 38, pp. 13790–13794, 2014. [27] A. Grupe, Y. Li, C. Rowland et al., “A scan of chromosome 10 identifies a novel locus showing strong association with late-onset Alzheimer disease,” The American Journal of Human Genetics, vol. 78, no. 1, pp. 78–88, 2006. [28] E. D. Neto, R. G. Correa, S. Verjovski-Almeida et al., “Shotgun sequencing of the human transcriptome with ORF expressed sequence tags,” Proceedings of the National Academy of Sciences of the United States of America,vol.97,no.7,pp.3491–3496, 2000. [29]A.Castello,B.Fischer,K.Eichelbaumetal.,“InsightsintoRNA biology from an atlas of mammalian mRNA-binding proteins,” Cell,vol.149,no.6,pp.1393–1406,2012. [30] T. Yamaguchi, H. Nakaoka, K. Yamamoto et al., “Genome- wide association study of degenerative bony changes of the temporomandibular joint,” Oral Diseases,vol.20,no.4,pp.409– 415, 2014. [31] J. Kotha, C. Zhang, C. M. Longhurst et al., “Functional relevance of tetraspanin CD9 in vascular smooth muscle cell injury phenotypes: a novel target for the prevention of neointimal hyperplasia,” Atherosclerosis,vol.203,no.2,pp.377–386,2009. [32] M. B. Protty, N. A. Watkins, D. Colombo et al., “Identification of Tspan9 as a novel platelet tetraspanin and the collagen receptor GPVI as a component of tetraspanin microdomains,” Biochemical Journal,vol.417,no.1,pp.391–401,2009. [33] V. Serru, P. Dessen, C. Boucheix, and E. Rubinstein, “Sequence and expression of seven new tetraspans,” Biochimica et Biophys- icaActa(BBA)—ProteinStructureandMolecularEnzymology, vol.1478,no.1,pp.159–163,2000. [34] F. Berditchevski, “Complexes of tetraspanins with integrins: more than meets the eye,” Journal of Cell Science,vol.114,no. 23, pp. 4143–4151, 2001. [35] A. A. Margolin, E. Bilal, E. Huang et al., “Systematic analysis of challenge-driven improvements in molecular prognostic models for breast cancer,” Science Translational Medicine,vol. 5,no.181,ArticleID181re1,2013. Hindawi Publishing Corporation BioMed Research International Volume 2016, Article ID 2714341, 7 pages http://dx.doi.org/10.1155/2016/2714341

Research Article DASAF: An R Package for Deep Sequencing-Based Detection of Fetal Autosomal Abnormalities from Maternal Cell-Free DNA

Baohong Liu,1 Xiaoyan Tang,2 Feng Qiu,3,4 Chunmei Tao,4 Junhui Gao,3,4 Mengmeng Ma,4 Tingyan Zhong,4 JianPing Cai,1 Yixue Li,2,3 and Guohui Ding2,3,4

1 State Key Laboratory of Veterinary Etiological Biology, Lanzhou Veterinary Research Institute, Chinese Academy of Agricultural Sciences, Xujiaping 1, Lanzhou, Gansu 730046, China 2Key Laboratory of Systems Biology, Institute of Biochemistry and Cell Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China 3Shanghai Center for Bioinformation Technology, Shanghai Industrial Technology Institute, Shanghai 200235, China 4EG Information Technology Enterprise (EGI), Basepair Biotechonology. Co. Ltd., Suzhou 215123, China

Correspondence should be addressed to Yixue Li; [email protected] and Guohui Ding; [email protected]

Received 28 March 2016; Accepted 30 May 2016

Academic Editor: Zhenguo Zhang

Copyright © 2016 Baohong Liu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Background. With the development of massively parallel sequencing (MPS), noninvasive prenatal diagnosis using maternal cell-free DNA is fast becoming the preferred method of fetal chromosomal abnormality detection, due to its inherent high accuracy and lowrisk.Typically,MPSdataisparsedtocalculateariskscore,whichisusedtopredictwhetherafetalchromosomeisnormal or not. Although there are several highly sensitive and specific MPS data-parsing algorithms, there are currently no tools that implement these methods. Results. We developed an R package, detection of autosomal abnormalities for fetus (DASAF), that implements the three most popular trisomy detection methods—the standard 𝑍-score (STDZ) method, the GC correction 𝑍-score (GCCZ) method, and the internal reference 𝑍-score (IRZ) method—together with one subchromosome abnormality identification method (SCAZ). Conclusions. With the cost of DNA sequencing declining and with advances in personalized medicine, the demand for noninvasive prenatal testing will undoubtedly increase, which will in turn trigger an increase in the tools available for subsequent analysis. DASAF is a user-friendly tool, implemented in R, that supports identification of whole-chromosome as well as subchromosome abnormalities, based on maternal cell-free DNA sequencing data after genome mapping.

1. Introduction in the maternal blood which refers to traditional nonin- vasive detection [6]. However, the traditional noninvasive Fetal autosomal aneuploidies are one type of abnormalities methods are lacking accuracy because they are indirect for chromosome number with a death rate of 6%–11% in measures of the underlying chromosomal defect [7, 8]. newborns. And the most common autosomal aneuploidies So pregnant women have to choose the invasive meth- are Down’s syndrome (trisomy 21) with the incidence of ods including chorionic villus sampling and cordocentesis, 1 in every 160 newborns causing mental retardation and coupled with fetal cell karyotyping which yield definitive hypoplasia [1]. Besides whole-chromosome aneuploidies, a answers. But there is a 0.5% risk of miscarriage which adds considerable number of fetuses are at high risk for sub- additional concern to the pregnant women and their families chromosomal abnormalities [2–4] that also result in mental [9, 10]. illnesses and other abnormalities [5]. The discovery of cell-free fetal DNA in maternal serum The traditional screening for chromosomal abnormalities [11] and recent advances in massive parallel sequencing combined the maternal age, ultrasonographic examination (MPS) technologies [12–14] now enable noninvasive prenatal of the fetus, and levels of various proteins or hormones testing (NIPT) of fetal chromosomal aneuploidies [15–18], 2 BioMed Research International

Table 1: Description of reference datasets.

Dataset name Description NCR ref Ratios of uniquely mapping reads for every chromosome and the total number of sequences uniquely mappedtothegenomeforall120samples GC ref GC content for every chromosome for all 120 samples Tag pos Genomic bins positions for widths of 1 Mb and 100 kb Bin GC GC content calculated for genomic bins for all samples Nearest bin ref Bin names of the 10 bins for the 1 Mb data and the 40 bins for the 100 kb data, which are with the nearest GC content for every divided bin BRV ref Ratios of reads within a bin to the total number of reads in bins with the nearest GC percentages

with very high specificity and sensitivity [19–21]. In addition popular trisomy detection methods (STDZ, GCCZ, and IRZ) to being noninvasive, NIPT requires only 5 mL of maternal and one subchromosome abnormality identification method peripheral blood for sequencing. Sequences are analyzed (SCAZ). We have also included a fetal gender prediction using bioinformatics methods to calculate a hazard score, module in the DASAF package. With the cost of DNA which is then used to determine whether fetal chromosomes sequencing declining and with advances in personalized are normal or not. Although the standard 𝑍-score (STDZ) medicine, we believe that the demand for NIPT will increase, method was originally used, it was later discovered that the which will undoubtedly trigger an increase in the tools accuracy of this method varied depending on the GC content available for subsequent analysis. of the chromosomes in question [15]. More specifically, the coefficients of variance (of measuring the percentage of representation of each chromosome) were much larger for 2. Materials and Methods chromosomes 18 and 13 than for chromosome 21 [15, 16]. This variation in accuracy is linked to the difference in This study was approved by the Independent Ethics Commit- sequencing efficacy as a function of chromosome size and GC tee of Shanghai Clinical Research Center. The reference data content. In recent years, many methods have emerged to solve used here consists of DNA sequencing data from one hundred the aforementioned problem, including a GC correction 𝑍- and twenty pregnant women from Huzhou Maternity & score (GCCZ) method [21], internal reference 𝑍-score (IRZ) Child Care Hospital located in Huzhou, Zhejiang, China. method [20], and the noninvasive fetal trisomy (NIFTY) test All data were produced by Illumina HiSeq2000 for 100 bp 6 6 [19], as well as the method of Srinivasan et al., henceforth pair-end with 7 × 10 to 17 × 10 sequence read pairs per referred to as the subchromosome abnormality 𝑍-score sample. (SCAZ) method [22]. The first three methods are similar to The sequencing reads were aligned to the human genome the standard method (𝑍-score) for identifying abnormalities assembly hg19 with Bowtie short read aligner (version 1.1.2), inwholechromosomes,whilethelastmethodisusedtoiden- allowing for two base mismatches at most when aligning [24]. tify subchromosomal (i.e., chromosomal regions) losses and Only uniquely mapped reads were kept. gains. Lau et al. indicated that the standard 𝑍-score (STDZ) Before using DASAF, sequencing data should be aligned method accurately detects trisomy 21 early in pregnancy of 11 using the above method, which is independent of the DASAF weeks with low accuracy for other aneuploidies, being 0% for software and needs to be completed by the users themselves. trisomy 13 and 40% for trisomy 18, while the GC correction The results file from Bowtie is used as input for DASAF. with LOESS regression method (GCCZ) is more accurate All the reference datasets are described in Table 1. A typical than STDZ but still with low detection rate for trisomy 18. DASAF workflow involves two procedures: mapping read And the adjusted method using 𝑍-scores with an internal statistics and autosomal aneuploidy prediction (Figure 1). reference (IRZ), which corrects for GC bias and sequencing efficiency, substantially improved the performance of the test [20]. 2.1. Read Mapping Statistics. Read mapping statistics produce On the other hand, Verweij et al. investigated the attitudes two files: one contains the unique mapping read counts among pregnant women regarding NIPT for the detection of for every chromosome and the other contains the mapping trisomy 21 (T21): they had a positive attitude regarding NIPT location for every unique read. The normalized chromosome for detection of T21, and more than 50% of them who rejected ratio (NCR) is generated according to the following equation the traditional screening would accept NIPT if available [23]. for every chromosome in each sample: NCR is the ratio of However, although NIPT has become increasingly popular number of reads uniquely mapped to the specific chromo- and acceptable and subsequent data analysis algorithms have some divided by the total number of reads uniquely mapped emerged, there are no tools currently available to implement to all autosomal chromosomes [15, 25]. If the GCCZ method these data analysis methods. In the present study, we devel- is used, the GC content for every chromosome is calculated oped an R package, DASAF, that implements the three most from the mapping results. BioMed Research International 3

Sequencing Hg19 data

Mapping DASAF workflow result (.sam)

Reference Unique Reference GC content GC mapping Reads NCRs positions contents counts

Reference NCR BRVs BRVs

GCCZ STDZ IRZ SCAZ

Whole-chromosome abnormality Subchromosome abnormality

No Normal ?>3

Yes

Abnormal

Figure 1: Workflow of the R package DASAF. The DASAF workflow consists of two main parts: (a) mapping reads statistics and (b) autosomal aneuploidy prediction. The results from (a) are used to calculate the risk score using any of the four methods implemented in (b). STDZ: standard 𝑍-score; GCCZ: GC correction 𝑍-score; IRZ: internal reference 𝑍-score; and SCAZ: subchromosome abnormalities 𝑍-score.

2.2. Autosomal Aneuploidy Prediction of chromosome 𝑖 in the reference samples, and 𝑖 is the specific chromosome number, that is, 13, 18, and 21 [15]. 𝑍 𝑍 2.2.1. Standard -Score Method. In the standard -score For the average value and standard deviation values for 𝑍 (STDZ) theory method, a hazard ratio of the -score is calcu- theNCRs,onecanusethereferencefiles(NCRref.txt) lated to determine whether the fetal chromosome is normal contained in the DASAF package or calculate them based on or not: one’s own samples. The 𝑍-score is a number indicating how far an observation deviates from the average in a population NCR𝑖 − NCR𝑖 [26].Usually,a𝑍-scoreof3isselectedasthresholdto (STDZ)𝑖 = , (1) SD𝑖 determine whether the fetus is normal or not [22]. where NCR𝑖 istheratioofthesequencecountsuniquely mapped to the specific chromosome and the total number 2.2.2. GC Correction 𝑍-Score Method. We calculated the of the sequences uniquely mapped to all of the autosomal slope from the NCR values (in reference file NCR ref.txt) chromosomes, NCR𝑖 is the average NCR of chromosome 𝑖 in of chromosomes 13, 18, and 21 of the 120 reference samples the reference samples, SD𝑖 is the standard deviation for NCRs against their GC content (in reference file GC ref.txt) by 4 BioMed Research International linearregressionandacorrectedNCRvaluewillbecalculated which is similar to the standard 𝑍-score method, while using the following equation: the median absolute deviations (MAD) were adjusted to 𝑎MAD (i.e., MAD was multiplied by 1.4826); here 𝑎 is 1.4826. (GC − GC ) = − average ref , Consider NCRGC NCR (2) Sloperef BRV𝑖𝑗 −𝑚BRV𝑖𝑗 (SCAZ)𝑖𝑗 = . (4) 𝑎MAD𝑖𝑗 where NCRGC istheNCRvalueafterGCcorrection,NCRis the original value, and GC and Slope are the mean average ref ref The absolute values of 𝑍-score larger than 3 indicate that there values of references’ chromosomal GC content and the slope were CNVs in fetal chromosome for the specific genomic of linear regression from the reference samples [21]. regions [22]. Then,themeanandSDoftheGC-correctedNCRwere calculated for the reference dataset and the 𝑍-score was calculated for the chromosome of the sample tested using (1) 3. Results and Discussions with a 𝑍-score cutoff of 3. We built the DASAF R package, which supports three existing methods for identifying whole chromosome abnormalities 2.2.3. Internal Reference 𝑍-Score Method. To minimize the and one for identifying subchromosome abnormalities from sequencing bias (stemming from differences in GC content), MPS data. We then compared the running time and identifi- Lau et al. presented a 𝑍-score method that relies on an cation accuracy of the four methods. internal reference chromosome [20]. They showed that using chromosomes 4, 8, and 14 as internal reference chromosomes provided the most accurate results for the detection of 3.1. Comparison of Chromosomal Abnormality Detection trisomy 13, trisomy 18, and trisomy 21, respectively. The Methods. All the detection methods used here were derived methodisasfollows:thecomparativeNCRiscalculatedusing from existing algorithms and their accuracy has been tested / previously [19, 20]. Here, we therefore only list the previously the value from the internal reference as NCR𝑖 NCRIR,where IR is the internal reference chromosome for chromosome reported results for these algorithms. Lau et al. provided 𝑖.The𝑍-score is also calculated by (1) that the IR adjusted detection rates and false-positive rates for the three whole- NCR value for the test sample subtracts the averaged IR chromosome trisomy detection methods. Their research adjusted NCR values from the reference samples and the revealed that the false-positive rates were 0 for all the three differenceisthendividedbythestandarddeviationfromthe methods and the method of IRZ was the most sensitive, with IR adjusted NCR values for the reference samples. A 𝑍-score a 100% detection rate for all trisomies examined (13, 18, and of 3 was selected as threshold for the diagnosis of trisomy in 21). For the method of STDZ, the detection rate was 100% chromosome 𝑖 of the testing sample [20]. for detecting trisomy 21 but only 40% for trisomy 18 and almost 0% for trisomy 13, while the GCCZ method with a detection rate of 100% for trisomy 21, 90% for trisomy 2.2.4. Subchromosome Abnormality 𝑍-Score Method. In addi- 18, and 100% for trisomy 13 was better than the standard tion to whole-chromosome abnormalities, subchromosome method but worse than the IRZ method [20]. Jiang et al. losses and gains are also important components of chro- also evaluated the performance of these three methods for mosomal diseases [4]. The subchromosome abnormality 𝑍- 903 cases and found that the Coefficient of Variation (CV) score (SCAZ) is a method used to identify abnormalities for the STDZ method was larger than that for the other two for chromosomal regions with lengths between 100 kb and approaches among clinically relevant chromosomes (13, 18, 1 Mb [22]. In the first step, positions uniquely mapped to the and 21). Thus, the STDZ method has poor sensitivity for the genome are retrieved and counted as tags. And the whole detection of trisomies 13 and 18. However, the performance of genome was divided into continuous bins with length of 1 Mb theGCCZapproachdemonstratedover99%sensitivityand and 100 kb and tags were assigned to individual bins for the specificity for the detection of trisomies 13, 18, and 21, while following analysis. Then GC content percentage of each bin the IRZ approach displayed CV larger than GCCZ but smaller was calculated to rank the bins across the entire genome. than STDZ for chromosomal trisomies 13, 18, and 21 [19]. And then every bin was normalized using the ratio of tags In summary, the adjusted methods (GCCZ and IRZ) more within the bin to the sum of the tag counts in bins with accurately identify trisomies than the STDZ method. the nearest GC content percentages. Bins with nearest GC It was also reported that the SCAZ method, which content percentages include 10 bins of 1 Mb length and 40 bins identifies chromosome CNVs, can accurately detect losses of 100 kb length. The equation is as follows: and gains for chromosomal regions [22]. Tags 𝑖𝑗 = , BRV𝑖𝑗 (3) 3.2. Evaluation of Diagnostic Accuracy as a Function of ∑ Tags 𝑘𝑚 Sequencing Depth. In order to evaluate the effect of sequenc- where BRV𝑖𝑗 is the ratio for the 𝑗th bin for chromosome 𝑖 and ing depth on diagnostic accuracy, we randomly subsampled Tags 𝑖𝑗 is the count of tags in the 𝑗th bin for chromosome 𝑖. 𝑘𝑚 the 100 bp pair-end (PE) sequencing data at read counts of at represents the bins with length of 100 kb and 1 Mb. least 3 M, 5 M, 7 M, and 12 M. Using the STDZ method, cases Further, every BRV was examined for deviation from were diagnosed as T21-positive or T21-negative. Importantly, the median values collected across all the reference samples we found that, even at a read count of 3 M, T21 was accurately BioMed Research International 5

Table 2: Execution time (in seconds) to detect chromosomal abnormalities using different methods.

12 M reads 20 M reads 40 M reads Standard 𝑍-score (STDZ) method 1 1 2 GC Correction 𝑍-score (GCCZ) method 312 629 1,156 Internal reference 𝑍-score (IRZ) method 1 1 1 Subchromosome abnormality 𝑍-score (SCAZ) method 2,074 2,105 2,278 The computing platform is a Linux system with 16 threads (0.8 GHZ for each) and RAM of 64 GB. Execution time was averaged over five repetitive runs.

20 for whole chromosomal abnormality detection, summarize the results of the three available methods (i.e., average the 15 three 𝑍-scores) for detection of trisomies 13, 18, and 21. We chose a 𝑍-score threshold of 3 to predict fetal chro- 10 mosome abnormalities. The reference datasets under the

score directory of data in the package can be updated or replaced Z- 5 byusersasthesamplesincrease,whichcanpromotethe accuracy of these methods. A detailed vignette is included 0 with the DASAF package to assist nonexperts in the field (http://lifecenter.sgst.cn/dasaf/). 08020 40 60100 120 The cost of high-throughput sequencing has decreased Sample number dramatically over the past few years, thus increasing its utility in clinical practice [27, 28]. Noninvasive prenatal diagnosis is 12 M 5 M the most widely used method for detecting trisomic abnor- 7 M 3 M malities or the loss or gain of chromosomal regions, and an Figure 2: 𝑍-score distribution for maternal cell-free DNA samples increasing number of pregnant women are benefitting from at varying sequencing depths. The black dots represent samples from this technology. In August 2014, noninvasive prenatal DNA the 100 bp pair-end run at a depth of 12 M reads. The black squares, diagnosis finally obtained legal status in China following diamonds, and triangles represent samples from the original 100 bp the approval of the registration of second-generation gene- paired end run that have been randomly subsampled to 7 M, 5 M, 𝑍 sequencing diagnostic products. This represents a major and 3 M reads, respectively. The black line indicates the -score advance in the field of prenatal screening that will undoubt- threshold of 3. edly benefit numerous pregnant women and their families. diagnosed, which suggests that the cost of sequencing can be Abbreviations considerably reduced by decreasing the sequence coverage. The results shown in Figure 2 demonstrate that the 𝑍-scores DASAF: Detection of autosomal abnormalities for fetus for all the positive samples are larger that 3 (above the MPS: Massively parallel sequencing horizontal line of 𝑦=3). NIPT: Noninvasive prenatal testing STDZ: Standard 𝑍-score 𝑍 3.3. Execution Time Comparison. We tested the running time GCCZ: GC correction -score 𝑍 for all methods included in the DASAF package on datasets IRZ: Internal reference -score withreadpairsof12M,20M,and40M(derivedfrom NIFTY: Noninvasive fetal trisomy 𝑍 patients of the Huzhou Maternity & Child Care Hospital). SCAZ: Subchromosome abnormalities -score STDZ and IRZ ran faster than the other methods if the NCR: Normalized chromosome representation NCR values for references were prepared beforehand. The BRV: Bin ratio value. GCCZ method requires the user to calculate the GC content for every chromosome, which consumes a considerable amountoftime.TheSCAZmethodhadthelongestruntime Additional Points because the BRV needs to be calculated for every bin by (i) Project name is DASAF. (ii) Project home page is http://life- counting the tags. While all running times were acceptable, center.sgst.cn/dasaf/. (iii) Operating systems are Linux and these times can be dramatically reduced by decreasing the Windows (Linux recommended). (iv) Programming lan- sequence read counts to 3–5 M (Table 2). guages are R and Perl. (vi) License is GPL ≥ 2. (vii) There is no nonacademics restrictions for using. 4. Conclusions WedevelopedanRpackagethatsupportschromosomal Competing Interests abnormality detection. For chromosomal abnormality detec- tion, users can select one of four supported methods or, The authors declare that they have no competing interests. 6 BioMed Research International

Authors’ Contributions [9]A.O.Odibo,J.M.Dicke,D.L.Grayetal.,“Evaluatingtherate and risk factors for fetal loss after chorionic villus sampling,” Baohong Liu designed the study, performed the analysis, Obstetrics and Gynecology,vol.112,no.4,pp.813–819,2008. cowrote the paper, constructed the R package, and prepared [10]A.O.Odibo,D.L.Gray,J.M.Dicke,D.M.Stamilio,G.A. the paper. Xiaoyan Tang processed the data and cowrote Macones,andJ.P.,“Revisitingthefetallossrateafter the paper. Chunmei Tao also cowrote the paper. Feng Qiu second-trimester genetic amniocentesis: a single center’s 16- prepared the experiment datasets. Junhui Gao, Mengmeng year experience,” Obstetrics & Gynecology, vol. 111, no. 3, pp. Ma, and Tingyan Zhong assisted in the software website 589–595, 2008. construction. JianPing Cai also helped to write the paper. [11] Y.M. Dennis Lo, N. Corbetta, P.F. Chamberlain et al., “Presence Yixue Li and Guohui Ding assisted in retrieving the test of fetal DNA in maternal plasma and serum,” The Lancet,vol. data and revising the paper. All authors read and approved 350, no. 9076, pp. 485–487, 1997. the final paper. Baohong Liu and Xiaoyan Tang are equally [12] D. R. Bentley, S. Balasubramanian, H. P. Swerdlow et al., contributing first authors. “Accurate whole human genome sequencing using reversible terminator chemistry,” Nature,vol.456,no.7218,pp.53–59, 2008. Acknowledgments [13] E. R. Mardis, “A decade’s perspective on DNA sequencing ThisresearchwassupportedbytheNationalBasicResearch technology,” Nature,vol.470,no.7333,pp.198–203,2011. Program of China (2011CB910204, 2011CB510102, and [14] M. L. Metzker, “Sequencing technologies the next generation,” 2010CB529200), the National Key Technology Support Nature Reviews Genetics, vol. 11, no. 1, pp. 31–46, 2010. Program (2013BAI101B09), and the National Key Scientific [15]R.W.K.Chiu,K.C.A.Chan,Y.Gaoetal.,“Noninvasive Instrument and Equipment Develeopment Project (2012YQ- prenatal diagnosis of fetal chromosomal aneuploidy by mas- 03026108). sively parallel genomic sequencing of DNA in maternal plasma,” Proceedings of the National Academy of Sciences of the United States of America,vol.105,no.51,pp.20458–20463,2008. References [16] H. C. Fan, Y. J. Blumenfeld, U. Chitkara, L. Hudgins, and S. R. Quake, “Noninvasive diagnosis of fetal aneuploidy by [1] D. A. Driscoll and S. Gross, “Prenatal screening for aneuploidy,” shotgun sequencing DNA from maternal blood,” Proceedings of The New England Journal of Medicine,vol.360,no.24,pp.2502– the National Academy of Sciences of the United States of America, 2562, 2009. vol. 105, no. 42, pp. 16266–16271, 2008. [2]B.D.Blumberg,J.D.Shulkin,J.I.Rotter,T.Mohandas, [17]R.W.K.Chiu,H.Sun,R.Akolekaretal.,“Maternalplasma andM.M.Kaback,“Minorchromosomalvariantsandmajor DNA analysis with massively parallel sequencing by ligation chromosomal anomalies in couples with recurrent abortion,” for noninvasive prenatal diagnosis of trisomy 21,” Clinical American Journal of Human Genetics,vol.34,no.6,pp.948– Chemistry,vol.56,no.3,pp.459–463,2010. 960, 1982. [18] Y. M. D. Lo, “The quest for accurate measurement of fetal DNA [3] D. S. Borgaonkar, D. R. Bolling, C. Partridge, F. H. Ruddle, and in maternal plasma,” Clinical Chemistry,vol.57,no.3,pp.522– V. A. McKusick, “Chromosomal variation in man: catalog of 523, 2011. chromosomal variants and anomalies,” BirthDefects:Original [19] F. Jiang, J. Ren, F. Chen et al., “Noninvasive Fetal Trisomy Article Series, vol. 11, no. 3, pp. 82–84, 1975. (NIFTY) test: an advanced noninvasive prenatal diagnosis [4]F.R.Grati,A.Barlocco,B.Grimietal.,“Chromosomeabnor- methodology for fetal autosomal and sex chromosomal aneu- malities investigated by non-invasive prenatal testing account ploidies,” BMC Medical Genomics,vol.5,article57,2012. for approximately 50% of fetal unbalances associated with [20] T. K. Lau, F. Chen, X. Pan et al., “Noninvasive prenatal diag- relevant clinical phenotypes,” American Journal of Medical nosis of common fetal chromosomal aneuploidies by maternal Genetics, Part A,vol.152,no.6,pp.1434–1442,2010. plasma DNA sequencing,” Journal of Maternal-Fetal and Neona- [5]D.W.Bianchi,R.P.Rava,andA.J.Sehnert,“DNAsequencing tal Medicine,vol.25,no.8,pp.1370–1374,2012. versus standard prenatal aneuploidy screening,” The New Eng- [21] D. Liang, W. Lv, H. Wang et al., “Non-invasive prenatal testing land Journal of Medicine,vol.371,no.6,p.578,2014. of fetal whole chromosome aneuploidy by massively parallel [6] F. D. Malone, J. A. Canick, R. H. Ball et al., “First-trimester or sequencing,” Prenatal Diagnosis,vol.33,no.5,pp.409–415,2013. second-trimester screening, or both, for down’s syndrome,” The [22] A. Srinivasan, D. W. Bianchi, H. Huang, A. J. Sehnert, and R. P. New England Journal of Medicine,vol.353,no.19,pp.2001–2011, Rava, “Noninvasive detection of fetal subchromosome abnor- 2005. malities via deep sequencing of maternal plasma,” American [7]G.E.Palomaki,J.E.Hardow,G.J.Knightetal.,“Risk- Journal of Human Genetics,vol.92,no.2,pp.167–176,2013. basedprenatalscreeningfortrisomy18usingalpha-fetoprotein, [23] E. J. Verweij, D. Oepkes, M. de Vries, M. E. van den Akker, E. unconjugated oestriol and human chorionic gonadotropin,” S. van den Akker, and M. A. de Boer, “Non-invasive prenatal Prenatal Diagnosis,vol.15,no.8,pp.713–723,1995. screening for trisomy 21: what women want and are willing to [8]B.F.Crandall,F.W.Hanson,S.Keener,M.Matsumoto,and pay,” Patient Education and Counseling,vol.93,no.3,pp.641– W.Miller, “Maternal serum screening for 𝛼-fetoprotein, uncon- 645, 2013. jugated estriol, and human chorionic gonadotropin between [24] B. Langmead, C. Trapnell, M. Pop, and S. L. Salzberg, “Ultrafast 11 and 15 weeks of pregnancy to detect fetal chromosome and memory-efficient alignment of short DNA sequences to abnormalities,” American Journal of Obstetrics and Gynecology, the human genome,” Genome Biology,vol.10,no.3,articleR25, vol. 168, no. 6, pp. 1864–1869, 1993. 2009. BioMed Research International 7

[25]M.Ehrich,C.Deciu,T.Zwiefelhoferetal.,“Noninvasivedetec- tion of fetal trisomy 21 by sequencing of DNA in maternal blood: a study in a clinical setting,” AmericanJournalofObstetricsand Gynecology, vol. 204, no. 3, pp. 205.e1–205.e11, 2011. [26] A.B.Sparks,C.A.Struble,E.T.Wang,K.Song,andA.Oliphant, “Noninvasive prenatal detection and selective analysis of cell- free DNA obtained from maternal blood: evaluation for trisomy 21 and trisomy 18,” American Journal of Obstetrics and Gynecol- ogy,vol.206,no.4,pp.319.e1–319.e9,2012. [27] Z. Brownstein, L. M. Friedman, H. Shahin et al., “Targeted genomic capture and massively parallel sequencing to identify genes for hereditary hearing loss in middle eastern families,” Genome Biology, vol. 12, no. 9, article R89, 2011. [28] C. J. Bell, D. L. Dinwiddie, N. A. Miller et al., “Carrier testing for severe childhood recessive diseases by next-generation sequencing,” Science Translational Medicine,vol.3,no.65, Article ID 65ra4, 2011. Hindawi Publishing Corporation BioMed Research International Volume 2016, Article ID 4895476, 7 pages http://dx.doi.org/10.1155/2016/4895476

Research Article The Use of Protein-Protein Interactions for the Analysis of the Associations between PM2.5 and Some Diseases

Qing Zhang,1 Pei-Wei Zhang,2 and Yu-Dong Cai1

1 School of Life Sciences, Shanghai University, Shanghai 200444, China 2Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China

Correspondence should be addressed to Yu-Dong Cai; cai [email protected]

Received 20 February 2016; Revised 8 April 2016; Accepted 18 April 2016

Academic Editor: Jialiang Yang

Copyright © 2016 Qing Zhang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Nowadays, pollution levels are rapidly increasing all over the world. One of the most important pollutants is PM2.5. It is known that the pollution environment may cause several problems, such as greenhouse effect and acid rain. Among them, the most important problem is that pollutants can induce a number of serious diseases. Some studies have reported that PM2.5 is an important etiologic factor for lung cancer. In this study, we extensively investigate the associations between PM2.5 and 22 disease classes recommended by Goh et al., such as respiratory diseases, cardiovascular diseases, and gastrointestinal diseases. The protein-protein interactions were used to measure the linkage between disease genes and genes that have been reported to be modulated by PM2.5. The results suggest that some diseases, such as diseases related to ear, nose, and throat and gastrointestinal, nutritional, renal, and cardiovascular diseases, are influenced by PM2.5 and some evidences were provided to confirm our results. For example, a total of 18 genes related to cardiovascular diseases are identified to be closely related to PM2.5, and cardiovascular disease relevant gene DSP is significantly related to PM2.5 gene JUP.

1. Introduction This leads to the situation that most researches mainly focus on lung diseases and ignore other health risks we are facing. Though air pollution varies widely depending on its regions, Undoubtedly, in this way, our health risks will be largely average pollution levels are increasing rapidly around the underestimated. world, especially in some industrializing countries in Asia. In order to comprehensively understand our risks, we With the effect of weather and seasons, regional hazes might turn our attention to the associations between PM2.5 and appear due to the mixture of pollutants, which further lead to most human disorders. Since diseases are so countless that visibility impairment, traffic jams, and the reducing of living we hardly calculate each one’s relationship with PM2.5, we qualities [1]. Currently, measurement of PM2.5 (particulate chose 22 disorder classes recommended by Goh et al. [4], 𝜇 matter with particle aerodynamic diameters of 2.5 mand which include most known human disorders and are mainly smaller, also called fine particulate matter) is the most used classified based on the physiological system affected. The rele- method as an indicator pollutant to monitor air quality vant genes of each disorder, retrieved from Online Mendelian [2]. The sources of PM2.5 are diverse but mostly are from Inheritance in Man (OMIM) [5], and genes modulated by industrial emissions, biomass burning, domestic heating, and PM2.5, reported in a study authored by Gualtieri et al. [6], cigarette smoking. Also, the annual average range of haze can 3 were employed in mining the possible associations between differ from 10 to 100 𝜇g/m globally [1]. the investigated disorders and PM2.5. For each disease class, Exaggerated by air pollution, the increasing of health we calculated the maximum interaction score of each related risks, such as lung diseases, cardiovascular diseases, and gene in this class to PM2.5 relevant genes to evaluate its inflammation, is threatening us [3]. The influenced degree impact caused by PM2.5 based on the protein-protein inter- towards our health is arguable. Until now, our methods to actions (PPIs) from STRING (Search Tool for the Retrieval of understand how PM2.5 can influence our health are limited. Interacting Genes/Proteins) [7]. In addition, a permutation 2 BioMed Research International test was executed for each related gene to further evaluate Table 1: The number of genes related to each of OMIM classes of the accuracy of the aforementioned measurement, yielding disease. another measurement, namely, permutation FDR. Finally, for OMIM disease class Number of related genes each disease class, the proportion of genes with permutation Bone 35 FDRs less than 0.05 was calculated, which was used to evaluate the strength of the associations between the disease Cancer 166 class and PM2.5. Our results show that five disease classes Cardiovascular 120 are highly related to PM2.5: ear, nose, throat, gastrointestinal, Connective tissue disorder 46 nutritional, renal, and cardiovascular. These results recovered Dermatological 72 some known associations between PM2.5 and some diseases, Developmental 54 validating the accuracy of the results and providing a new Ear, nose, and throat 38 way to investigate the associations between PM2.5 and other Endocrine 80 diseases. Gastrointestinal 25 Hematological 65 2. Materials and Methods Immunological 85 2.1. Genes Related to 22 Diseases. The genes for each dis- Metabolic 177 ease were downloaded from OMIM (http://www.omim.org/, Multiple 179 accessed in January 2014) [5]. Because there are too many Muscular 59 diseases and many diseases are actually similar, according Neurological 226 to the morbid map file with OMIM disease ID and class Nutritional 18 assignment proposed in Goh et al.’s study [4], similar diseases Ophthalmological 90 were combined into the same disease class. 22 disease classes Psychiatric 29 were obtained (please see column 1 of Table 1). For genes Renal 43 relatedtoacertaindiseaseclass,thosewithoutEnsemblIDs Respiratory 45 occurring in the PPI network reported in STRING were Skeletal 56 discarded. Column 2 of Table 1 lists the number of remaining genesrelatedtoeachoneofthe22diseaseclasses.Thedetailed Unclassified 13 codes of the related genes are provided in Supplementary Material I (in Supplementary Material available online at http://dx.doi.org/10.1155/2016/4895476). databases, such as Database of Interaction Proteins (DIP) [21], BioGRID [22], which only provide PPIs validated by 2.2. Genes Modulated by PM2.5. Genes that can be modu- experiments, PPI used in this study can measure associ- lated by PM2.5 were retrieved from Gualtieri et al.’s study [6], ations between proteins from the point of view of both in which 177 differentially expressed genes were reported to 𝑝 < the protein physical properties and the protein functional be modulated by PM2.5 winter and summer with value properties because they are derived from the following 0.05 and 43 differentially expressed genes were reported to be (1) (2) <−0.5 > four sources: genomic context, high-throughput modulated by PM2.5 with log 2 fold change or 0.5. We experiments, (3) (conserved) coexpression, and (4) previous combined these differentially expressed genes and obtained knowledge. All PPI information was contained in a file called 189 genes that are provided in Supplementary Material II. For “protein.links.v9.0.txt.gz” which can be downloaded from convenience, they were called PM2.5 genes in this study and 𝐷 STRING. Each PPI consists of two proteins, represented by comprised the gene set PM. Ensembl IDs, and one score with range between 150 and 999, indicating the strength of the interaction. For proteins 𝑝1 2.3. PPI. PPI information is a widely used type of informa- and 𝑝2, let us denote the score of the PPI between them by tion to investigate various protein-related and gene-related 𝐼(𝑝1,𝑝2).If𝑝1 and 𝑝2 are identical, 𝐼(𝑝1,𝑝2) was set to 1,000, problems. Many investigators have built several effective while it was set to zero if the interaction between them is not computational methods to deal with different biological reported in STRING. problems, such as protein functions prediction [8–12] and disease gene identification [13–19]. Most of the methods werebasedonawidelyacceptedfactthattwoproteinsina 2.4. Measuring the Associations between Disease Genes and PPI always share similar functions [8, 9, 11, 19]. To uncover PM2.5. To analyze the associations between each OMIM the associations between 22 disease classes and PM2.5, it diseaseclassandPM2.5,wecaninvestigatetheassociations is natural to analyze the relationships between their related between genes related to each disease and PM2.5 genes, genes, while the PPI information is one of the most suitable thereby inducing the likelihood of genes modulated by materials to address this problem. PM2.5. In this study, we used the PPI information reported in For a gene 𝑔 related to one OMIM disease class, if STRING (version 9.0, http://string.embl.de/) [20], a large it has strong associations with one PM2.5 gene, it may online database containing a large number of PPIs for several have similarity functions with the PM2.5 gene, suggesting organisms. Comparing to the PPIs reported in some other that it may be modulated by PM2.5. Thus, we calculated BioMed Research International 3

PM2.5 genes D ( PM)

Using (1)

Genes related Each Maximum to one disease gene interaction score disease Permutation FDR

1000 scores

Using (2)

1000 gene sets

Select disease genes with permutation FDRs less than 0.05

Proportion of selected genes

Figure 1: A flowchart to illustrate the procedures of our method. the maximum interaction score for each disease gene 𝑔 as for disease gene 𝑔.Obviously,smallpermutationFDRfora follows: gene suggests that its position in the whole PPI network is

󸀠 󸀠 notspecialanditishighlyrelatedtoPM2.5ifitsmaximum 𝑄(𝑔)=max {𝐼 (𝑔, 𝑔 ):𝑔 ∈𝐷PM}. (1) interaction score is high.

Obviously, this score measures the associations between 𝑔 2.5. Measuring the Associations between Diseases and PM2.5. and PM2.5 genes. A high score means that there is at least one For each OMIM disease class, there are several genes related PM2.5 gene that is highly related to 𝑔.BecauseeachPM2.5 to it. Each gene has measured its associations with PM2.5 by gene can be modulated by PM2.5, it can infer that 𝑔 can be calculating the maximum interaction score and permutation modulated by PM2.5 with high probability. FDR. If a gene received a small permutation FDR, it may be Each disease gene measured the associations between it highly related to PM2.5. In view of the fact that 0.05 is always and PM2.5 by investigating the PPIs between it and PM2.5 selected as the cutoff value for the significance level in the genes. However, some disease genes may occupy the special traditional test and has been applied in some studies [14, 15, locations in the PPI network, meaning that they are highly 23], we also set 0.05 as the threshold for the permutation FDR related to any gene. In this case, the maximum interaction to filter most related disease genes among the genes related to scorecalculatedby(1)cannotreflectthereality.Inthiscase, an OMIM disease class. Because the numbers of genes related another measurement should be employed for each disease to 12 disease classes are of great difference, it is not reasonable gene 𝑔.Toobtainthismeasurement,werandomlyproduced to measure the associations between diseases and PM2.5 only 1,000 sets of genes, denoted by 𝐷1,𝐷2,...,𝐷1000,andeachof considering the number of selected genes. The proportion of them had the same size of 𝐷PM.Forgeneset𝐷𝑖,calculatethe the selected genes to all genes is more suitable. It is clear that score as follows: if a disease received a high proportion, that is, many genes 󸀠 󸀠 relatedtothisdiseasearecloselyrelatedtoPM2.5,itmayhave 𝑄𝑖 (𝑔) = max {𝐼 (𝑔, 𝑔 ):𝑔 ∈𝐷𝑖}. (2) strong associations with PM2.5 and PM2.5 can be a potential etiologic factor of this disease. Accordingly,thereisonescorefor𝐷PM and 1,000 scores for 1,000 randomly produced gene sets. The measurement, 3. Results and Discussion namely,permutationFDR(FalseDiscoveryRate),was defined to be the proportion of 1,000 scores for randomly In this study, we proposed a computational method to produced gene sets that were larger than the score for 𝐷PM. measure the associations between 22 disease classes and For convenience, this measurement was denoted by FDR(𝑔) PM2.5. The whole procedures are illustrated in Figure 1. This 4 BioMed Research International

Table 2: The associations between the 22 disease classes and PM2.5. Muscular Cancer OMIM disease class Proportion Neurological Ear, nose, and throat 0.2368 Ophthalmological Developmental Gastrointestinal 0.2000 Hematological Nutritional 0.1667 Connective tissue disorder Renal 0.1628 Unclassified Multiple Respiratory 0.1556 Bone Cardiovascular 0.1500 Immunological Endocrine 0.1500 Metabolic Dermatological Skeletal 0.1429 Psychiatric Psychiatric 0.1379 diseaseOMIM class Skeletal Endocrine Dermatological 0.1250 Cardiovascular Metabolic 0.1243 Respiratory Immunological 0.1176 Renal Nutritional Bone 0.0857 Gastrointestinal Multiple 0.0838 Ear, nose, and throat Unclassified 0.0769 0 0.05 0.1 0.15 0.2 0.25 Connective tissue disorder 0.0652 Proportion Hematological 0.0615 Figure 2: Bar chart illustrating the proportion of the selected genes Developmental 0.0556 to all disease genes for each disease class. Ophthalmological 0.0556 Neurological 0.0531 “neurological,” “cancer,” and “muscular” classes are the least Cancer 0.0482 five disease classes. This section would give some evidences Muscular 0.0169 to show that some top disease classes are closely related to PM2.5. section would give the detailed results and the discussion 3.3.1. Nasal Biopsies Exposed to PM2.5 Show Squamous basedonthem. Metaplasia and Some Biochemical Changes (e.g., the Increasing Secretion of Amphiregulin and IL-8). Nose and throat are 3.1. Results of the Associations between Disease Genes and directly exposed to outside pollutants. Also, running nose, PM2.5. As described in Section 2.4, we calculated two blockednose,andcoughareoftenreportedinpatientsliving measurements: maximum interaction score and permutation in polluting areas. For example, an experiment based on FDR for each disease gene, to quantify the associations nasal biopsies of children from polluted areas in Southwest between it and PM2.5. These values are listed in Supplemen- Metropolitan Mexico City has revealed that their biopsies tary Material III. displayed squamous metaplasia and decreased numbers of goblet and ciliated cells, and so on [24]. Another experiment 3.2. Results of the Associations between Diseases and PM2.5. using apical membranes of well-differentiated human nasal For genes related to one disease class, we excluded those epithelial (HNE) cells exposed to PM2.5 suggested that PM2.5 with permutation FDR larger than or equal to 0.05. The can stimulate both amphiregulin and IL-8 secretion [25]. proportion of the remaining genes was calculated to measure the associations between the disease and PM2.5. These 3.3.2. Air Pollutants Are Associated with Intestinal Diseases, proportions are listed in Table 2. For easy observation, a bar Such As Appendicitis and Children Acute Diarrheal Disease. chart (Figure 2) was also plotted to illustrate the proportions The influence of PM2.5 on the intestinal system is poorly for all 22 disease classes. It can be seen that the disease class investigated until now. Despite the fact that the oral route can “ear, nose, and throat” is the leader disease which is deemed be accessible easily, large fractions of air pollutants will also to be closely related to PM2.5, followed by disease classes enter and further impact the intestine, because of systemic “gastrointestinal,” “nutritional,” and so forth. The following inflammation [26]. Fortunately, there are still studies that selection would give the detailed discussion on these findings, show an association between air pollutants and different suggesting that our method can recover some known results aspects of intestinal diseases. Kaplan and his colleagues andalsoimplysomenewresults. showed that the incidence of appendicitis is related to some pollutants, such as SO2 and NO2 in summer [27]. Orazzo and 3.3. Analysis of the Results. According to the results listed his colleagues reported that children acute diarrheal disease in Table 2, “ear, nose, and throat,” “gastrointestinal,” “nutri- is positively corrected with SO2 and CO [28]. tional,” “renal,” “respiratory,” and “cardiovascular” classes are the top six disease classes that are most related to PM2.5. 3.3.3. Exposure to PM2.5 Increases Risk to Diabetic and On the other hand, “developmental,” “ophthalmological,” Obese Individuals. Nutritional diseases refer to any of the BioMed Research International 5

Table 3: Detailed information on 18 cardiovascular related genes with permutation FDR less than 0.05.

Disease gene Maximum interaction score Permutation FDR Most related PM2.5 gene DSP 999 0.009 JUP PTGIS 977 0.011 PTGS2 EPHX1 995 0.012 CYP1A1 PPARG 999 0.017 PPARGC1A OLR1 944 0.018 CCL2 PKP2 984 0.024 JUP TCF4 994 0.028 ID1 NR3C2 953 0.029 SGK1 TCF7L2 997 0.029 JUP ELN 993 0.033 FBN2 NEUROD1 939 0.035 ID2 F5 998 0.036 PROC F13A1 970 0.037 FGG BMPR2 999 0.039 BMP6 PNMT 952 0.04 EGR1 IL6 999 0.042 IL6ST TNFSF4 859 0.042 STAT4 HNF1B 947 0.044 SOX9 nutrient-related diseases and conditions that cause illness from which we can see that the total of 18 genes related to in humans, including eating disorders and obesity, deficien- cardiovascular diseases were identified to be closely related cies or excesses in the diet, and chronic diseases such as to PM2.5 by our method. Here, some genes that are most cardiovascular disease, diabetes mellitus, and hypertension. relevant to PM2.5 would be discussed. For example, gene Schneider et al. claimed that exposure to PM2.5 is detrimental DSP (desmoplakin, a component of functional desmosomes) to diabetic individuals, in detail, patients with type 2 diabetes plays a crucial role in intercellular junctions between adjacent shown endothelial cell dysfunction after exposure to PM2.5 cells. DSP is significantly related to PM2.5 gene JUP.Research [29]. Also, experimental data suggest that the bad effects of has found that mutations in DSP can cause arrhythmogenic inhaled PM2.5 can be exacerbated by a high-fat diet [30], and right ventricular dysplasia/cardiomyopathy (ARVD/C) and increased risk is accompanying obese people [31]. increase risk of sudden death. More importantly, mutations of JUP itself have also been identified in patients with 3.3.4. Heavy Metals in PM2.5 (e.g., Lead and TiO2)Can ARVD/C [36], which further explains that PM2.5 might Damage Kidneys. PM2.5 contains many heavy metals that increase risks of cardiovascular diseases. Another example is candamagekidneys.Eventhoughleadconcentrationis PTGS2. The abnormal expression of PTGS2 is found when dependent on the sampling location, lead often exists in cells are exposed to PM2.5, and the PM2.5 gene PTGS2 PM2.5.Reporthasshownthatexposuretoleadcancausekid- is related to PTGIS (a member of the cytochrome P450 neydamage[32].Also,titaniumdioxide(TiO2)isfrequently superfamily of enzymes), which encodes monooxygenases used in a wide range of plastics, paints, and paper. It is shown that are involved in synthesis of lipids, such as cholesterol and that it can induce serious swelling in rats’ renal glomerulus steroids. Patients who are at high risks of developing stroke andleadtosignificantlesionsofkidneys[33]. and myocardial infarction often have mutations in PTGS2 or abnormal expression of PTGIS [37]. 3.3.5. PM2.5 Increases Risk of Cardiovascular Diseases Sup- Here, two considerations relevant to our research should ported by Disease Statistics and Associations in Genes (e.g., be mentioned. First, we should confirm that current findings Cardiovascular Relevant Gene DSP Is Related to PM2.5 Gene that PM2.5 can aggravate lung and heart diseases, such as JUP). Sufficient evidence has proved that PM2.5 is associated respiratory symptoms and heart attacks, are easily influenced with cardiovascular diseases. A study in the northeastern by subjective judgments and unknown factors. This is partly United States confirmed that risks in all-cause coronary heart because most findings between diseases and PM2.5 are based disease(CHD)areincreasingwhentheexposuretoPM2.5 on hospital admissions and information from questionnaires, is increasing [34]. In Massachusetts, an association between which could be largely influenced by weather, epidemic occurrence of acute myocardial infarction (AMI, one of the diseases, and other factors. In addition, few researches have specific cardiovascular diseases) and long-term exposure to focused on other diseases, such as gastrointestinal diseases, area PM2.5 has been proven [35]. In this study, cardiovascular due to no enough resources and reasonable methods for diseases are also predicted to be highly related to PM2.5. them to dig into all these diseases. Thus, it will be unwise Table 3 lists the maximum interaction scores and permutation to conclude which disease is mostly influenced by PM2.5, FDRs of selected genes related to cardiovascular diseases, without ruling out other diseases’ associations with PM2.5 6 BioMed Research International duetolackofrelevantresearch.Second,weadopted22 National Academy of Sciences of the United States of America, disease classes based on Maurizio Gualtieri and his colleges’ vol. 104, no. 21, pp. 8685–8690, 2007. study. The mixtures of several diseases to one disorder might [5] A. Hamosh, A. F. Scott, J. S. Amberger, C. A. Bocchini, and V.A. cover some relevant information or add some irrelevant McKusick, “Online Mendelian Inheritance in Man (OMIM), a information. Despite the existence of the flaw, our goal is knowledgebase of human genes and genetic disorders,” Nucleic to research associations between PM2.5 and most human Acids Research, vol. 33, supplement 1, pp. D514–D517, 2005. diseases that have not been executed before. Since no other [6] M. Gualtieri, E. Longhin, M. Mattioli et al., “Gene expression solution has been found to solve this problem well, we assume profiling of A549 cells exposed to Milan PM2.5,” Toxicology that this method is accepted, relevant, and accurate. Letters,vol.209,no.2,pp.136–145,2012. [7] L. J. Jensen, M. Kuhn, M. Stark et al., “STRING 8—a global view on proteins and their functional interactions in 630 organisms,” 4. Conclusions Nucleic Acids Research, vol. 37, supplement 1, pp. D412–D416, 2009. This contribution gives an intensive investigation on the [8] R. Sharan, I. Ulitsky, and R. Shamir, “Network-based prediction linkages between PM2.5 and several diseases. For each dis- of protein function,” Molecular Systems Biology, vol. 3, article 88, ease, its association with PM2.5 was measured by its related 2007. genes and PM2.5 genes using the protein-protein interaction [9]K.-L.Ng,J.-S.Ciou,andC.-H.Huang,“Predictionofprotein information. Our method affirms that some diseases are functions based on function-function correlation relations,” highly related to PM2.5. It is hopeful that the new findings in ComputersinBiologyandMedicine,vol.40,no.3,pp.300–305, this study may give new ways to study the causal relationship 2010. between diseases and PM2.5. In addition, our method used [10] L. L. Hu, T. Huang, X.-J. Liu, and Y.-D. Cai, “Predicting protein the PPIs to evaluate the associations between PM2.5 and phenotypes based on protein-protein interaction network,” several diseases. In future, we will focus on integrating some PLoS ONE,vol.6,no.3,ArticleIDe17668,2011. advanced network algorithms, such as the shortest path [11] P. Gao, Q.-P. Wang, L. Chen, and T. Huang, “Prediction of algorithm and random walk algorithm, into the PPI network human genes’ regulatory functions based on proteinprotein to build more effective methods for addressing the problem. interaction network,” Protein and Peptide Letters,vol.19,no.9, pp. 910–916, 2012. [12] S. Letovsky and S. Kasif, “Predicting protein function from Competing Interests protein/protein interaction data: a probabilistic approach,” Bioinformatics,vol.19,supplement1,pp.i197–i204,2003. The authors declare that there are no competing interests [13]J.Zhang,M.Jiang,F.Yuanetal.,“Identificationofage- regarding the publication of this paper. related macular degeneration related genes by applying shortest path algorithm in protein-protein interaction network,” BioMed Authors’ Contributions Research International, vol. 2013, Article ID 523415, 8 pages, 2013. Qing Zhang and Pei-Wei Zhang contributed equally to this [14] B.-Q. Li, T. Huang, L. Liu, Y.-D. Cai, and K.-C. Chou, “Identifi- work. cation of colorectal cancer related genes with mrmr and shortest path in protein-protein interaction network,” PLoS ONE,vol.7, no. 4, Article ID e33393, 2012. Acknowledgments [15]L.Chen,Z.Xing,T.Huang,Y.Shu,G.Huang,andH.Li, “Application of the shortest path algorithm for the discovery of This study was supported by the National Natural Science breast cancer-related genes,” Current Bioinformatics,vol.11,no. Foundation of China (31371335). 1,pp.51–58,2016. [16] T. Gui, X. Dong, R. Li, Y.Li, and Z. Wang, “Identification of hep- References atocellular carcinoma–related genes with a machine learning and network analysis,” Journal of Computational Biology,vol.22, [1] D. Loomis, Y. Grosse, B. Lauby-Secretan et al., “The carcino- no. 1, pp. 63–71, 2015. genicity of outdoor air pollution,” The Lancet Oncology,vol.14, [17] L. Chen, C. Chu, X. Kong, G. Huang, T. Huang, and Y.-D. Cai, no. 13, pp. 1262–1263, 2013. “A hybrid computational method for the discovery of novel [2] M. A. Kioumourtzoglou, B. A. Coull, F. Dominici, P. Koutrakis, reproduction-related genes,” PLoS ONE,vol.10,no.3,Article J. Schwartz, and H. Suh, “The impact of source contribution ID e0117090, 2015. uncertainty on the effects of source-specific PM2.5 on hospital [18]L.Chen,J.Yang,T.Huang,X.Kong,L.Lu,andY.Cai, admissions: a case study in Boston, MA,” Journal of Exposure “Mining for novel tumor suppressor genes using a shortest path Science and Environmental Epidemiology,vol.24,no.4,pp.365– approach,” Journal of Biomolecular Structure and Dynamics,vol. 371, 2014. 34,no.3,pp.664–675,2015. [3] C. Arden Pope III, R. T. Burnett, M. C. Turner et al., “Lung can- [19] J. Zhang, J. Yang, T. Huang, Y. Shu, and L. Chen, “Identification cer and cardiovascular disease mortality associated with ambi- of novel proliferative diabetic retinopathy related genes on ent air pollution and cigarette smoke: shape of the exposure- protein-protein interaction network,” Neurocomputing,Inpress. response relationships,” EnvironmentalHealthPerspectives,vol. [20] A. Franceschini, D. Szklarczyk, S. Frankild et al., “STRING v9.1: 119, no. 11, pp. 1616–1621, 2011. protein-protein interaction networks, with increased coverage [4]K.I.Goh,M.E.Cusick,D.Valle,B.Childs,M.Vidal,andA. and integration,” Nucleic Acids Research,vol.41,no.1,pp.D808– L. Barabasi,´ “The human disease network,” Proceedings of the D815, 2013. BioMed Research International 7

[21] I. Xenarios, D. W. Rice, L. Salwinski, M. K. Baron, E. M. [36] Z. Yang, N. E. Bowles, S. E. Scherer et al., “Desmosomal dys- Marcotte, and D. Eisenberg, “DIP: the database of interacting function due to mutations in desmoplakin causes arrhythmo- proteins,” Nucleic Acids Research,vol.28,no.1,pp.289–291, genic right ventricular dysplasia/cardiomyopathy,” Circulation 2000. Research,vol.99,no.6,pp.646–655,2006. [22] C. Stark, B.-J. Breitkreutz, T. Reguly, L. Boucher, A. Breitkreutz, [37] E. Fosslien, “Cardiovascular complications of non-steroidal and M. Tyers, “BioGRID: a general repository for interaction anti-inflammatory drugs,” Annals of Clinical & Laboratory datasets,” Nucleic Acids Research, vol. 34, supplement 1, pp. Science,vol.35,no.4,pp.347–385,2005. D535–D539, 2006. [23]Y.-F.Gao,Y.Shu,L.Yangetal.,“Agraphicmethodforiden- tification of novel glioma related genes,” BioMed Research International,vol.2014,ArticleID891945,8pages,2014. [24] L. Calderon-Garciduenas, A. Rodriguez-Alcaraz, G. Valencia- Salazar et al., “Nasal biopsies of children exposed to air pollutants,” Toxicologic Pathology,vol.29,no.5,pp.558–564, 2001. [25] F. Auger, M.-C. Gendron, C. Chamot, F. Marano, and A.-C. Dazy, “Responses of well-differentiated nasal epithelial cells exposed to particles: role of the epithelium in airway inflam- mation,” Toxicology and Applied Pharmacology,vol.215,no.3, pp.285–294,2006. [26] L. Calderon-Garcidue´ nas,˜ M. Franco-Lira, R. Torres-Jardon´ et al., “Pediatric respiratory and systemic effects of chronic air pollution exposure: nose, lung, heart, and brain pathology,” Toxicologic Pathology,vol.35,no.1,pp.154–162,2007. [27] G. G. Kaplan, E. Dixon, R. Panaccione et al., “Effect of ambient air pollution on the incidence of appendicitis,” Canadian Medi- cal Association Journal,vol.181,no.9,pp.591–597,2009. [28] F. Orazzo, L. Nespoli, K. Ito et al., “Air pollution, aeroallergens, and emergency room visits for acute respiratory diseases and gastroenteric disorders among young children in six Italian cities,” Environmental Health Perspectives, vol. 117, no. 11, pp. 1780–1785, 2009. [29] A. Schneider, N. E. Alexis, D. Diaz-Sanchez et al., “Ambient PM2.5 exposure up-regulates the expression of costimulatory receptors on circulating monocytes in diabetic individuals,” Environmental Health Perspectives,vol.119,no.6,pp.778–783, 2011. [30] Q. Sun, P. Yue, J. A. Deiuliis et al., “Ambient air pollution exaggerates adipose inflammation and insulin resistance in a mouse model of diet-induced obesity,” Circulation,vol.119,no. 4, pp. 538–546, 2009.

[31] C. Potera, “Toxicity beyond the lung: connecting PM2.5, inflam- mation, and diabetes,” EnvironmentalHealthPerspectives,vol. 122,no.1,p.A29,2014. [32]H.M.Aburas,M.A.Zytoon,andM.I.Abdulsalam,“Atmo- spheric lead in PM 2.5 after leaded gasoline phase-out in Jeddah City, Saudi Arabia,” Clean—Soil, Air, Water,vol.39,no.8,pp. 711–719, 2011. [33] J. Wang, G. Zhou, C. Chen et al., “Acute toxicity and biodistri- bution of different sized titanium dioxide particles in mice after oral administration,” Toxicology Letters,vol.168,no.2,pp.176– 185, 2007. [34] R. C. Puett, J. E. Hart, J. D. Yanosky et al., “Chronic fine and coarse particulate exposure, mortality, and coronary heart disease in the Nurses’ Health Study,” Environmental Health Perspectives,vol.117,no.11,pp.1697–1701,2009. [35] J. Madrigano, I. Kloog, R. Goldberg, B. A. Coull, M. A. Mittleman, and J. Schwartz, “Long-term exposure to PM2.5 and incidence of acute myocardial infarction,” Environmental Health Perspectives,vol.121,no.2,pp.192–196,2013. Hindawi Publishing Corporation BioMed Research International Volume 2016, Article ID 5237827, 5 pages http://dx.doi.org/10.1155/2016/5237827

Research Article The Occurrence of Genetic Alterations during the Progression of Breast Carcinoma

Xiao-Chen Li,1 Chenglin Liu,2 Tao Huang,3 and Yang Zhong1

1 School of Life Sciences, Fudan University, Shanghai 200433, China 2School of Life Sciences and Biotechnology, Shanghai Jiaotong University, Shanghai 200240, China 3Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China

Correspondence should be addressed to Yang Zhong; [email protected]

Received 4 November 2015; Accepted 28 March 2016

Academic Editor: Xin-yuan Guan

Copyright © 2016 Xiao-Chen Li et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

The interrelationship among genetic variations between the developing process of carcinoma and the order of occurrence has not been completely understood. Interpreting the mechanisms of copy number variation (CNV) is absolutely necessary for understanding the etiology of genetic disorders. Oncogenetic tree is a special phylogenetic tree inferential pictorial representation of oncogenesis. In our present study, we constructed oncogenetic tree to imitate the occurrence of genetic and cytogenetic alterations in human breast cancer. The oncogenetic was built on CNV of ErbB2, AKT2, KRAS, PIK3CA, PTEN, and CCND1 genes in 963 cases of tumors with sequencing and CNA data of human breast cancer from TCGA. Results from the oncogenetic tree model indicate that ErbB2 copy number variation is the frequent early event of human breast cancer. The oncogenetic tree model based on the phylogenetic tree is a type of mathematical model that may eventually provide a better way to understand the process of oncogenesis.

1. Introduction describes this linear cancer genome evolution by construct- ing an oncogenetic tree representing common sequences of Copy number variations (CNVs) are a form of structural genetic events in tumor progression. The oncogenetic models genetic alterations contributed to the initiation and progres- are flexible since they can represent multiple pathways at sion of breast cancer. They are an underlying cause of the can- the same time [3]. Various bioinformatics software programs cer cell population adaption by affecting cell evolution includ- have been proposed for the construction of oncogenetic tree ing proliferation, fitness, and clonal selection [1]. Distinct including TrAp [4] and oncotree [5]. CNV patterns have been reported in breast cancer subgroups, In this paper, we construct two oncogenetic trees based which are correlated to different stages of cancer progression on independent models based for human breast carcinoma. [2]. The study to understand the evolutionary relationship In the distance-based oncogenetic tree construction, the state of CNVs based on their occurrence can reveal important of a normal cell is the root of the tree [6]. Each subclone biomarkers and benefit the diagnostic of breast carcinoma. is represented by an imbalanced tree, with the nodes being Apowerfultooltorevealtheevolutionaryrelationships described by probabilities of CNV occurrence based on is the phylogenetic tree. Besides its application to depict the conjunctive Bayesian networks. CNV occurring early in the evolutionary descent of different species or organisms from cytogenetic evolution event is close to the root in the tree. the same ancestor, it is widely used in cancer study to reveal The gene positive selection is conveyed by degree of the the phylogenetic relationships of tumor subclones during imbalanceorasymmetryofthetreetopology,whilethose cell evolution. Based on the assumption that each of the closely balanced subsets are known as clusters in the tree subclones derived from a parental cell population carries all and hidden genetic events before the internal crossover point. its alterations and acquires additional ones, phylogenetic tree However, the limitation is the insufficiency to explain the 2 BioMed Research International negative correlation between two aberrations. Hence, we 3. Results verify our results by introducing the branching oncogenetic trees to estimate the probabilities of false negative and false 3.1. Frequency of Gene Alteration. The copy number variation positive observations [7]. is detected in at least one of the six candidate genes in thirty- six percent (342 in 963 cases) of patients. Among the six copy number alterations assayed, the CCND1 alteration was 2. Material and Methods detected as the most frequent one, existing in 16% of breast cancer cases. AKT2 alteration was the least common, found The sequence and CNA data of 963 breast cancer patients in only 2% of patients. The CNV frequencies of the other were collected from the Cancer Genome Atlas (TCGA) Breast four candidate genes are PIK3CA (5%, 51/963), KRAS (3%, Invasive Carcinoma project. Important copy number varia- 25/963), PTEN (6%, 55/963), and ErbB2 (13%, 121/963), as tionswereselectedbysearchingthekeywords“copynumber showninFigure1. alterations”, “CNAS”, “breast cancer”, “deletion”, and “ampli- According to the distance-based oncogenetic trees (Fig- fication” in Online Mendelian Inheritance in Man (OMIM), ure 2(a)), the branch of ErbB2 event yields two clusters, and candidate alteration biomarkers for breast cancer including KRAS and PIK3CA and AKT2 and PTEN. In this were chosen as ErbB2, AKT2, KRAS, PIK3CA, PTEN, and model, the interconnection of the paths forms the cluster CCND1, whose copy number alterations we supposed would of genetic imbalances which indicates high correlation of play a key role in the development of breast cancer based events. The closer the final points are to the root of the tree, upon literature research. the earlier the imbalances occur in genetic evolution. The The distance-based oncogenetic tree model is constructed ErbB2 alteration, presumably a primary event, is shown to be based on the idea of path models by Desper et al. [7]. The close to the root of the tree. The next branching in the tree branchesofthetreerepresentthemultiplepathwaysofthe separates CCND1 from the remaining events. given tumor. The pathways either are parallel to each other The branching oncogenetic tree model of the dataset or branch out in tree topology, each showing a dependent or showed four branches departing from the root (Figure 2(b)). independent event. Based on first-order Markov property, a ErbB2 was an important early event as it is close to the root genetic event only depends on its single direct predecessors andalsotherootofanewsubtree.TheCCND1alteration, in the tree. The occurrence probability of the subclones PTEN alteration, and AKT2 alteration are located on one derived from their parental clones is described as conditional branch emanating from the root. Those branches do not probability and represented as the edge weight in the - continue to any other alteration. The next main branch child relations in the tree topology. In this tree model, the emanating from the root leads to alteration of ErbB2, which root of the tree serves as a normal event and the inner nodes is marked as a first node, and then branches again to KRAS represent the hypothetical taxonomic units. A probability and PIK3CA alteration. The KRAS subbranch continued to 𝑒 𝑝 is assigned to each edge 𝑒 in the tree. In the beginning, its alteration to PIK3CA with high probability. Inthe1000-timereplicatedbootstrapresamplingproce- none of events corresponding to the nodes in the tree has dure for the primary tree, the original tree is reconstructed in occurred. Starting at the root of the tree, the event of the 776times,whichisthemostfrequenttreeinthebootstrap- next node occurs based on the model regulations using ping. The frequent trees and the original tree are showed in the following probabilities. Denote the occurring event as 𝑋 Figure 3. Among trees representing at least 1000 resampled node in the tree; it occurs independently in a continuous oncogenetic trees, AKT2, KRAS, PTEN, and CCND1 are 𝑝𝑒 marginal probability .Inthistreemodel,theweightsof direct or indirect descendants of ErbB2. ErbB2 is always − (𝑝𝑒) edges are defined as log .Finally,astoasingletumor,the closetotheroot,andKRASandPIK3CAalwaysstayinone subset of leaf events occurring are the results of the random branch. In 48 cases in the bootstrapping, the KRAS-PIK3CA experiment. The clusters of trees characterize the subset of the branch is independent of ErbB2 from the root. In another changes that have close correlations [8], and clusters close to 45 cases, the PICKA subbranch consistently shows high root reveal the early events. Maximum likelihood estimation probabilityofalterationtoErbB2andthentoKRAS.PTEN is performed to determine the best fitting tree model for the is an independent branch of the root in the tree in 30 cases. given copy number data. The nonparametric bootstrap [9] is adopted to evaluate the uncertainty of properties of the 4. Discussion estimated tree model. The bootstrap confidence values are obtained for the inner edges and the resulting tree displays the In this paper, we construct the distance-based tree models splits that occurred in >10 percent of the bootstrap datasets and the branching models based on six candidate genes [10]. whose CNVs are reported as important to breast carcinoma Two types of oncogenetic tree models were proposed to basedonOMIMdatabase.Basedontheseconstructed describe the occurrence of copy number alterations during oncogenetic tree models, some common findings are revealed the progression of breast cancer. The branching trees were as the following inferences: (a) ErbB2 is close to the root of the constructed using the software Oncotree [5], (https://cran.r- tree, indicating that its alteration is potentially an early event project.org/web/packages/Oncotree/), while the distance- in breast cancers; (b) at least three subtrees appear to follow based trees were generated by the R software package up the event of ErbB2, which indicates that there are pathways (https://cran.r-project.org/web/packages/oncomodel/) [10]. ofprogressioninhumanbreastcancer;(c)acloserelationship BioMed Research International 3

PIK3CA 5 AKT2 2 KRAS 2 PTEN 5 ErbB2(%) 12 CCND1 15 Genetic alteration

Amplification Deep deletion

Figure 1: The CNV frequency of six candidate genes in the breast carcinoma patients. The CNVs of at least one of the six candidate genes occur in 342 (36%) of 963 cases/patients. The CNV frequencies of the six candidate genes are PIK3CA (5%, 51/963), AKT2 (2%, 22/963), KRAS (3%, 25/963), PTEN (6%, 55/963), ErbB2 (13%, 121/963), and CCND1 (16%, 153/963).

PTEN

I1 Root AKT2

I3 0.63

KRAS ErbB2 I2 I4 PIK3CA 0.64 0.57 0.60 0.74

AKT2 KRAS PTEN CCND1 Root I5 ErbB2 0.90

CCND1 PIK3CA (a) (b)

Figure 2: The distant-based and branching oncogenetic tree models for copy number alterations in 963 breast tumors. (a) In the distance- based oncogenetic tree, hidden or unknown events are represented as internal nodes. Leaves between the distances depend on the meaning of copy number changes with greater probability of cluster-related copy number variation. The root of the tree close to the left of the last point in the imbalance of genetic evolution occurred early. Interconnection path represents a high correlation between events, forming clusters of genetic balance. (b) In the branching oncogenetic tree, root represents the event with no alterations. The alterations genes include ErbB2, AKT2, KRAS, PIK3CA, PTEN, and CCND. The aberration of ErbB2 is potentially early event as it is close to the root and also the rootofa new subtree.

is revealed between KRAS and PIK3CA alterations, since they According to the ErbB2 branch, copy number alterations are consistently grouped to the same subcluster. of AKT2, PTEN, CCND1, KRAS, and PIK3CA may associate Some contradictories are discovered to be mainly relative with the upstream pathway lesions on the ErbB2 pathway to the order of KRAS and PIK3CA alteration, that is, whether based on oncogenetic tree theory. It is reported that ErbB2 is PIK3CA alteration as a part of RAS pathway is the late event a major driver of tumor growth in 20% of breast cancers [12]. after KRAS alteration or PIK3CA the late event in breast The downstream signaling cascade PI3K signaling associated cancer. Hollestelle et al. reported that PIK3CA alteration with signals originating from HER2 plays a vital role in occurs almost exclusively in invasive tumors, which is a late tumorigenesis, drug resistance, and tumor progression in event of the PI3K pathway [11]. However, Hollestelle et al. breast cancer [13]. CCND1 and ErbB2 have been reported found that there are downstream effectors of the oncogenic to play an early role in sporadic breast cancer [14]. The link PI3K and RAS pathways in breast cancer [11]. However, as between ErbB2 and the AKT pathway in breast cancers has shown in our results, KRAS and PIK3CA alteration tend to beenvalidatedinvitroandinvivo,indicatingthepossible occurafterErbB2,andacloserelationshipmustexistbetween role of AKT activation in ErbB2-mediated breast cancer them. progression [15]. 4 BioMed Research International

Root Root Root

ErbB2 ErbB2 ErbB2

AKT2 KRAS PTEN CCND1 AKT2 KRAS PTEN CCND1 AKT2 KRAS PTEN CCND1

PIK3CA PIK3CA PIK3CA

(a) Original tree (b) Tree based on most frequent parent (c) Observed frequency = 776 Root Root Root

ErbB2 ErbB2 ErbB2

AKT2 KRAS PTEN CCND1 AKT2 KRAS PTEN CCND1 AKT2 KRAS PTEN CCND1

PIK3CA PIK3CA PIK3CA

(d) Observed frequency = 48 (e) Observed frequency = 45 (f) Observed frequency = 30

Figure 3: The frequent oncogenetic trees constructed in the 1000-time bootstrap procedure for the primary 𝑛 = 963 trees. (a) The original tree; (b) the tree based on the most frequent parent; (c) the tree repeatedly appearing 776 times; (d) the tree replicated 48 times; (e) the tree replicated 45 times; (f) the tree replicated 30 times.

Based on the oncogenetic tree model of human breast [4]F.Strino,F.Parisi,M.Micsinai,andY.Kluger,“TrAp:atree carcinoma, we demonstrate the following three key points: approach for fingerprinting subclonal tumor composition,” Nucleic Acids Research,vol.41,no.17,p.e165,2013. (i) The alterations of ErbB2 are a potential early event in [5] A. Szabo and K. Boucher, “Estimating an oncogenetic tree when cellular transformation of breast cancer. false negatives and positives are present,” Mathematical Bio- (ii) The copy number alterations of AKT2, PTEN, sciences,vol.176,no.2,pp.219–236,2002. CCND1, KRAS, and PIK3CA may associate with their [6] R. Desper, F. Jiang, O.-P.Kallioniemi, H. Moch, C. H. Papadim- upstream pathway lesions ErbB2 alterations. itriou, and A. A. Schaffer,¨ “Distance-based reconstruction of tree models for oncogenesis,” Journal of Computational Biology, (iii) A close relationship exists between KRAS and vol. 7, no. 6, pp. 789–803, 2001. PIK3CA alterations. [7] R. Desper, F. Jiang, O.-P.Kallioniemi, H. Moch, C. H. Papadim- In summary, the oncogenetic tree models can provide a better itriou, and A. A. Schaffer,¨ “Inferring tree models for oncogenesis way to understand the process of oncogenesis. They can be from comparative genome hybridization data,” Journal of Com- applied to other somatic alterations analysis to describe a putational Biology,vol.6,no.1,pp.37–51,1999. reasonable order of biological events for the major pathways [8]T.Longerich,M.M.Mueller,K.Breuhahn,P.Schirmacher,A. of cancer. Benner, and C. Heiss, “Oncogenetic tree modeling of human hepatocarcinogenesis,” International Journal of Cancer,vol.130, no. 3, pp. 575–583, 2012. Competing Interests [9]D.Felsenstein,W.P.Carney,V.R.Iacoviello,andM.S. Hirsch, “Phenotypic properties of atypical lymphocytes in The authors declare that they have no competing interests. cytomegalovirus-induced mononucleosis,” The Journal of Infec- tious Diseases,vol.152,no.1,pp.198–203,1985. References [10] A. von Heydebreck, B. Gunawan, and L. Fuzesi,¨ “Maximum likelihood estimation of oncogenetic tree models,” Biostatistics, [1] N. Zaman, L. Li, M. Jaramillo et al., “Signaling network assess- vol. 5, no. 4, pp. 545–556, 2004. ment of mutations and copy number variations predict breast [11] A. Hollestelle, F. Elstrodt, J. H. A. Nagel, W. W. Kallemeijn, cancer subtype-specific drug targets,” Cell Reports,vol.5,no.1, and M. Schutte, “Phosphatidylinositol-3-OH kinase or RAS pp.216–223,2013. pathway mutations in human breast cancer cell lines,” Molecular [2] C. Curtis, S. P. Shah, S.-F. Chin et al., “The genomic and tran- Cancer Research,vol.5,no.2,pp.195–201,2007. scriptomic architecture of 2,000 breast tumours reveals novel [12] A.-G. A. Selim, G. El-Ayat, and C. A. Wells, “c-erbB2 oncopro- subgroups,” Nature,vol.486,no.7403,pp.346–352,2012. tein expression, gene amplification, and aneu- [3] E. R. Fearon and B. Vogelstein, “A genetic model for colorectal somy in apocrine adenosis of the breast,” Journal of Pathology, tumorigenesis,” Cell,vol.61,no.5,pp.759–767,1990. vol. 191, no. 2, pp. 138–142, 2000. BioMed Research International 5

[13]Y.Lu,X.Zi,Y.Zhao,andM.Pollak,“OverexpressionofErbB2 receptor inhibits IGF-I-induced Shc-MAPK signaling pathway in breast cancer cells,” Biochemical and Biophysical Research Communications,vol.313,no.3,pp.709–715,2004. [14] C. B. J. Vos, N. T. Ter Haar, J. L. Peterse, C. J. Cornelisse, and M. J. Van De Vijver, “Cyclin D1 gene amplification and overex- pression are present in ductal carcinoma in situ of the breast,” The Journal of Pathology,vol.187,no.3,pp.279–284,1999. [15] M. P.Scheid and V.Duronio, “Dissociation of cytokine-induced phosphorylation of Bad and activation of PKB/akt: involvement of MEK upstream of Bad phosphorylation,” Proceedings of the National Academy of Sciences of the United States of America,vol. 95,no.13,pp.7439–7444,1998. Hindawi Publishing Corporation BioMed Research International Volume 2016, Article ID 2596782, 9 pages http://dx.doi.org/10.1155/2016/2596782

Research Article Using Small RNA Deep Sequencing Data to Detect Human Viruses

Fang Wang,1 Yu Sun,2 Jishou Ruan,3 Rui Chen,4 Xin Chen,2 Chengjie Chen,5 Jan F. Kreuze,6 ZhangJun Fei,7 Xiao Zhu,8 and Shan Gao2

1 Department of Gynaecology, The Second Hospital, Tianjin Medical University, Tianjin 300211, China 2CollegeofLifeSciences,NankaiUniversity,Tianjin300071,China 3School of Mathematical Sciences, Nankai University, Tianjin 300071, China 4Tianjin Institute of Agricultural Quality Standard and Testing Technology, Tianjin Academy of Agricultural Sciences, Tianjin 300381, China 5College of Horticulture, South China Agricultural University, Guangzhou, Guangdong 510642, China 6International Potato Center (CIP), Apartado 1558, Lima 12, Peru 7Boyce Thompson Institute for Plant Research, Cornell University, Ithaca, NY 14853, USA 8Guangdong Provincial Key Laboratory of Medical Molecular Diagnostics, Dongguan Scientific Research Center, Guangdong Medical University, Dongguan, Guangdong 523808, China

Correspondence should be addressed to Xiao Zhu; [email protected] and Shan Gao; gao [email protected]

Received 7 November 2015; Revised 13 January 2016; Accepted 3 February 2016

Academic Editor: Jozef Anne´

Copyright © 2016 Fang Wang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Small RNA sequencing (sRNA-seq) can be used to detect viruses in infected hosts without the necessity to have any prior knowledge or specialized sample preparation. The sRNA-seq method was initially used for viral detection and identification in plants and then in invertebrates and fungi. However, it is still controversial to use sRNA-seq in the detection of mammalian or human viruses. In this study, we used 931 sRNA-seq runs of data from the NCBI SRA database to detect and identify viruses in human cells or tissues, particularly from some clinical samples. Six viruses including HPV-18, HBV, HCV, HIV-1, SMRV, and EBV were detected from 36 runs of data. Four viruses were consistent with the annotations from the previous studies. HIV-1 was found in clinical samples without the HIV-positive reports, and SMRV was found in Diffuse Large B-Cell Lymphoma cells for the first time. In conclusion, these results suggest the sRNA-seq can be used to detect viruses in mammals and humans.

1. Introduction methods (e.g., ELISA, PCR, or microarrays) cannot be used in some cases due to failure to satisfy certain requirements Infection by pathogens is one of the main risk factors for (e.g., prior knowledge of the potential pathogen or the ability many diseases [1–4], particularly for cancers. In 2008, approx- to cultivate and purify the pathogen [6]). In addition, they imately two million new cancer cases (16%) worldwide were are time-consuming and difficult to use in detection of highly caused by pathogen infection. Most cancers inducing infec- divergent or novel viruses. tious agents were viruses [5], including Epstein-Barr virus To overcome these limitations, next generation sequenc- (EBV), hepatitis B and C virus (HBV and HCV,resp.), Kaposi ing (NGS) technologies have been applied for virus and sarcoma herpes virus (KSHV, also known as human herpes viroid discovery in plants and [7, 8]. Compared virus type 8, HHV-8), human immunodeficiency virus type to other NGS based methods requiring the use of viral 1 (HIV-1), human papillomavirus type 16 (HPV-16), and enrichment and concentration procedures [7], the small RNA human T-cell lymphotropic virus type 1 (HTLV-1). Therefore, sequencing (sRNA-seq) based method simplifies the virus therapidandaccuratedetectionandidentificationofthese detection, with the aid of virus fragments enriched by the viruses is essential to human health. Conventional detection RNA interference (RNAi) mechanism. RNAi is a cytoplasmic 2 BioMed Research International cell surveillance system which recognizes double-stranded information between the virus genus and the host was RNA (dsRNA) and specifically destroys single-stranded RNA from the International Committee on of Viruses and dsRNA molecules homologous to the dsRNA inducer, (ICTV). For some virus genera which did not have host using small interfering RNAs (siRNAs) as a guide [9]. The information assigned to them, we were able to assign host abundant siRNAs accumulated during the RNAi process categories after reading their NCBI annotations. facilitate virus detection and the study of RNAi mechanism. For each detected virus, we assigned a putative reference RNAi has been proposed as a key antiviral intrinsic immune genome from the NCBI GenBank database to represent the response in plants, nematodes, and [10]. Based on virus (Supplementary File 1 in Supplementary Material avail- such theory, the sRNA-seq method was originally used for able online at http://dx.doi.org/10.1155/2016/2596782). We viral detection and identification in plants [8, 11, 12] and in used the reference genome coverage and the average depth to invertebrates [13–15], but not in mammals or humans. There quantify the detected viruses. The genome coverage repre- was evidence that antiviral RNAi functions in mammalian sents the proportion of read-covered positions against the germ cells and embryonic stem cells (ESCs), as well as some genome length. The average depth is equal to the total carcinoma cell lines [10]. No evidence had been provided to base pairs of the aligned reads divided by the read-covered prove RNAi functions in mammalian somatic cells until Li positions on the reference genome (Tables 2 and 3). et al.’s work was published [16]. Although Li et al. discovered low level siRNA duplexes in the baby kidney 21 cells, 3. Results and Discussion the role of RNAi in viral defence in mammalians remains controversial. Therefore, using sRNA-seq to detect viruses in 3.1. HPV-18 from HeLa Cell Lines. To test our virus detection mammals and humans is a highly promising but hard topic. pipeline,weusedHeLacelllinedatafromtheprevious In this study, we used 931 sRNA-seq runs of data from study SRP001381 as positive controls (Table 1). The HeLa the NCBI SRA database [17] to detect and identify viruses in cell line, derived from cervical cancer cells of the patient human cells or tissues, particularly from some clinical sam- Henrietta Lacks, contains HPV-18, one of the carcinogenic ples. These tissues came from saliva, tongue, laryngopharynx, HPV . In this study, HPV-18 was detected in oropharynx, prefrontal cortex, liver, cervix, serum, plasma, all of the three runs of data (SRR031635, SRR031636, and lymph,andsoforth.Asaresult,sixvirusesincludingHPV-18, SRR031637). The assembled HPV-18 in the data SRR031636 HBV, HCV, HIV-1, SMRV (squirrel monkey retrovirus), and covered 74.1% of the reference genome M20325 with an EBV were detected from 36 runs of data. In brief, the existence average depth 8.5. The 19 long viral contigs (≥ 40 bp) covered of HPV-18, HBV, HCV, and EBV was consistent with the 62.5% of the reference genome with a uniform distribution findings from the original studies, whereas HIV-1 and SMRV (Supplementary File 1). had not been identified previously in the experimental sam- ples. The nucleotide polymorphism, read-enriched regions 3.2.HBVandHCVfromHumanLiverandHCCTissues. (hotspots), and RNAi responses of detected viruses were Chronic hepatitis B virus (HBV) is one of the first viruses to analyzed, following the detection of these viruses. be causally linked to a human tumor and is a major global cause of hepatocellular carcinoma (HCC). HBV, hepatitis C 2. Materials and Methods virus (HCV), and cirrhosis between them contribute to the genesis of almost all global HCCs [23]. Conventional clinical Using NCBI SRA advanced searching tools (http://www.ncbi tests use markers at the protein level, including the HBV .nlm.nih.gov/sra/advanced), we retrieved 2,820 runs of data surface antigen (HBsAg), HBV envelope antigen (HBeAg), by the combined keywords including Illumina, small RNA, and HBV core antigen (HBcAg) and their from and Homo sapiens (November 1, 2014). We subsequently fil- the patients’ serum. However, these protein markers are not tered these data based on the following criteria: (1) to always present for various reasons [23]. remove non-small RNA-seq data by reading the annotations; In the previous study SRP002272 from the NCBI SRA (2) to remove data containing keyword “cell line”; (3) to database (Table 2), 15 clinical samples had been sequenced remove data from cDNA library selection during library con- including three normal liver tissues, one HBV-infected liver struction. Ultimately, we used 931 runs of data from 42 pre- tissue, one severe chronic hepatitis B liver tissue, two HBV- vious studies in this study (Table 1). positive HCC tissues, one HCV-positive HCC tissue, and The software Fastq clean [18] was used for sRNA data one HCC tissue without HBV or HCV [24]. In this study, cleaning and quality control. To detect and identify viruses the detection and identification results in 15 runs of data using sRNA-seq data, we developed an automatic pipeline were consistent with the findings from the previous study using Perl scripts. This pipeline had performed well in the SRP002272 with one exception SRR039619 (Table 2). The detection and identification of plant and insect viruses in sRNA data SRR039619 from a HBV-positive HCC patient our previous studies [12, 19–21]. The pipeline integrated three should have contained HBV but it was not found by our sequence databases: The first one was an rRNA database, pipeline. SRR039619 contained 9,161,157 reads, which possibly which was built based on the SILVA ribosomal RNA gene were not deep enough to catch adequate virus derived small database[22].Thesecondonewasthehumanhostgenome RNAs (vsRNAs) for detection. forthesubtractionofhostgenomesequences.Thelastone The assembled HBV in the data SRR039620 covered contained the Vertebrata viral sequences constructed from 54.6% of the reference genome JQ688404 with an average the NCBI GenBank database, version 197. The relationship depth 6 (Supplementary File 1). In the data SRR039620, seven BioMed Research International 3

Table 1: The 42 previous studies from the SRA database.

Study ID Runs Sample source Disease DRP000998 3 Whole saliva, salivary exosome Healthy ERP001908 63 Tongue, laryngopharynx, oropharynx HNSCC ERP004592 23 Prefrontal cortex Huntington’s disease SRP001381 3 HeLa cell line HPV18(+) SRP002118 14 Hek293T cell line NA SRP002272 15 Liver HBV(+), HCV(+), HCC SRP002326 38 Cervical tumor Cervical cancer SRP002402 3 Sperm Healthy SRP007825 67 Skin Psoriasis SRP008258 2 Hek293, HeLa cell line NA SRP009246 4 Primary human fibroblast NA SRP014020 20 Thyroid tumor Follicular thyroid adenoma SRP017809 4 Dorsolateral prefrontal cortex Healthy SRP017979 4 Colorectal tumor Colorectal cancer SRP018255 35 Plasma, serum, placenta Healthy FTLD, PSP, BHS, DLB, SRP021130 20 Cerebral cortex Alzheimer’s disease SRP021193 40 Heart NIC, IC SRP021911 12 Cumulus granulosa cell, mural granulosa cell NA SRP021924 5 Brain frontal cortex NA SRP022043 70 Blood Alzheimer’s disease Sigma, liver, coecum, colon ascendens, lymph SRP022054 26 Colorectal cancer node SRP026081 2 Penicillium marneffei NA SRP026558 2 PBMC Osteopetrosis SRP026562 11 Prefrontal cortex Alzheimer’s disease SRP027589 42 Serum Breast cancer SRP028291 78 ACA,ACCtumor,adrenaltissue ACA,ACC SRP028738 16 MiRQC, serum, liver NA SRP029599 9 FFPE, serum Nonkeratinizing NPC, NPC SRP032650 4 Serum Latent PTB, PTB SRP032953 12 Alpha cell, beta cell, whole islet Type 2 diabetes mellitus SRP033505 3 Plasma Healthy Connective tissue, plasma, neuronal tissue, SRP033566 185 primary cell, cardiac muscle, epithelium, skeletal DCM, IC muscle SRP034547 4 Primary fibroblast Microcephaly SRP034586 24 Serum, PBMC Healthy SRP034590 14 Plasma NA Tensor fascia lata, quadricep vastus, vastus SRP034654 12 FSHD externe, rhomboid, iliopsoas SRP034698 8 Skin, lymph node MCC, SCC, melanoma, BCC SRP040421 12 Exosome in human semen Healthy SRP041082 2 Seminal fluid Prostate cancer DLBCL, Burkitt’s lymphoma, SRP046046 12 Lymphoblastoid EBV(+) SRP046234 2 Breast epithelium Triple negative breast cancer 4 BioMed Research International

Table 1: Continued. Study ID Runs Sample source Disease SRP048290 6 Platelet Healthy “Study ID” is uniq for each high-throughput project in the NCBI SRA database. ACA: adrenal cortical adenoma, ACC: adrenal cortical carcinoma, BCC: Cell Carcinoma, BHS: bilateral hippocampal sclerosis, DCM: Dilated Cardiomyopathy, DLB: dementia with Lewy bodies, DLBCL: Diffuse Large B- Cell Lymphoma, FSHD: Facioscapulohumeral Muscular Dystrophy, FTLD: frontotemporal lobar dementia, HCC: HBV-related hepatocellular carcinoma, HNSCC: Head and Neck Squamous Cell Carcinoma, IC: Ischemic Cardiomyopathy, MCC: Merkel Cell Carcinoma, NIC: Nonischemic Cardiomyopathy, NPC: nasopharyngeal carcinoma, PBMC: Peripheral Blood Mononuclear Cell, PSP: Progressive Supranuclear Palsy, PTB: Pulmonary Tuberculosis, and SCC: Squamous Cell Carcinoma.

Table 2: HBV and HCV from the SRP002272 study.

Run ID Sample Source Reference Cov (%) Depth SRR039611 Human Normal Liver Tissue NA NA NA SRR039612 Human Normal Liver Tissue NA NA NA SRR039613 Human Normal Liver Tissue NA NA NA SRR039614 HBV-Infected Liver Tissue JQ688405 423 (13.2) 3.0 SRR039615 Severe Chronic Hepatitis B Liver Tissue NA NA NA SRR039616 HBV(+) Distal Tissue NA NA NA SRR039617 HBV(+) Adjacent Tissue NA NA NA SRR039618 HBV(+) Side Tissue NA NA NA ∗ SRR039619 HBV(+) HCC Tissue NA NA NA SRR039620 HBV(+) Adjacent Tissue JQ688404 1756 (54.6) 6.0 SRR039621 HBV(+) HCC Tissue GQ475344 321 (10) 1.5 SRR039622 HCV(+) Adjacent Tissue D85516 1032 (10.8) 1.8 SRR039623 HCV(+) HCC Tissue GU133617 805 (8.3) 8.0 SRR039624 HBV(−)HCV(−) Adjacent Tissue NA NA NA SRR039625 HBV(−)HCV(−) HCC Tissue NA NA NA “Run ID” is uniq for each high-throughput fastq file in the NCBI SRA database. “Reference” uses the NCBI GenBank accession number. “Cov (%)” and “Depth” represent the genome coverage and the average depth, respectively. “Side Tissue” is close to the border between the tumor tissues and the normal tissues but 0–2 cm far from the tumor tissues. “Adjacent Tissue” is the normal tissues 2–5 cm far from the tumor tissues. “Distal Tissue” is the normal tissues at least 10 cm ∗ far from the tumor tissues. “SRR039619 ” should have contained HBV but it was not found by our pipeline.

Table 3: SMRV and EBV from the SRP046046 study.

Run ID Sample Source Reference Cov (%) Depth SRR1563015 DLBCL M23385 8714 (99.2) 146.1 SRR1563017 DLBCL Exosome M23385 8732 (99.4) 494.5 SRR1563018 EBV(+) BL KC207813 2765 (1.6) 29.2 SRR1563056 EBV(+) BL Exosome KC207813 33107 (19.3) 9.6 SRR1563057 EBV(−)BL NA NA NA SRR1563058 EBV(−) BL Exosome NA NA NA SRR1563059 EBV(+) LCL KC207813 13757 (8) 358.2 SRR1563060 EBV(+) LCL Exosome M80517 7444 (4) 288.8 SRR1563061 EBV(+) LCL M80517 18688 (10.2) 151.1 SRR1563062 EBV(+) LCL Exosome KC207814 7931 (4.6) 198.2 SRR1563063 EBV(+) LCL M80517 37898 (20.6) 52.8 SRR1563064 EBV(+) LCL Exosome M80517 57850 (31.4) 17.6 “Run ID” is uniq for each high-throughput fastq file in the NCBI SRA database. “Reference” uses the NCBI GenBank accession number. “Cov (%)” and “Depth” represent the genome coverage and the average depth, respectively. long viral contigs (≥ 40 bp) covered the HBV x (HBx), HBV detected in the data SRR039623 with genome coverage 8.3% core (HBc), and HBV polymerase (HBp) gene regions but did and average depth 1. not cover the HBV surface (HBs) gene region. The long viral contigs (≥ 40 bp) in the data SRR039614 and SRR039621 only 3.3. HIV-1 from Breast Cancer Patients. HIV as a member of covered the HBx gene region. The assembled HCV in the data the genus Lentivirus causes acquired immunodeficiency syn- SRR039622 covered 10.8% of the reference genome D85516 drome (AIDS). As technology evolves, HIV testing assays are with an average depth 1 (Supplementary File 1). HCV was also being improved on sensitivity and specificity [25]. However, BioMed Research International 5 the tests still provide false negative results due to the diag- the data SRR1563018, SRR1563056, SRR1563059, SRR1563060, nostic window or other reasons [25]. In the previous study SRR1563061, SRR1563062, SRR1563063, and SRR1563064. SRP027589 from the NCBI SRA database (Table 1), 42 sam- This finding confirmed the results in the previous study ples had been sequenced for the discovery and profiling of SRP046046. However, the reference genome coverage by circulating microRNAs in the serum of 42 stage II-III locally vsRNAs was uneven in eight runs of data varying from 1.6% advanced and inflammatory breast cancer (BC) patients [26]. to 31.4%. This large variance could result from sample extrac- These patients received neoadjuvant chemotherapy (NCT) tion, small RNA library construction, sequencing quality, or followed by surgical tumor resection. However, no AIDS sequencing depth. In the data SRR1563063, the assembled or HIV-positive results of these patients had been reported EBV contigs covered 20.6% of the reference genome M80517 in the previous study SRP027589. In this study, HIV-1 was (Supplementary File 1). detected at a very high level in the data SRR941591. The assembled HIV-1 in the data SRR941591 covered 39.3% of 3.5. Nucleotide Polymorphism, Hotspots, and RNAi Responses. the reference genome M19921 with an average depth 210.1 TheplantsRNA-seqdatahadbeenshowntocontainadequate (Supplementary File 1). As far as we know, this was the first information for studying nucleotide polymorphism of the time to report the detection of HIV-1 using sRNA data from actual virus [35]. Among the six human viruses found in clinical samples. this study, HIV-1 in the data SRR941591 showed the highest nucleotide polymorphism rate covering 2.66% (155/5,831) of 3.4. SMRV and EBV from B Cells and Exosomes. SMRV, an the genomic positions (Figure 1), as compared to SMRV in endogenous virus of squirrel monkeys, had been isolated by the data SRR1563017, EBV in the data SRR1563063, and HPV- cocultivation of squirrel monkey lung cells with canine cells 18 in the data SRR031636 covering only 0.41% (36/8,732), [27]. In previous studies, SMRV had been detected in Burkitt’s 0.29% (110/37,898), and 0.13% (3/2,324) of the genomic lymphoma (BL) cell lines [28]. Specifically, the insertion of positions, respectively (Supplementary File 2). HIV-1 is a the incomplete SMRV proviral genomes had been detected single-stranded RNA (ssRNA) reverse-transcribing virus. in Namalwa cell lines [29]. However, we found no reports HIV reverse transcriptase has been shown to be exceptionally that SMRV had been detected in the Diffuse Large B-Cell inaccurate [36] and may explain the high polymorphism rates Lymphoma (DLBCL). To the best of our knowledge, DLBCL observed in this study. HPV-18 and EBV are double-stranded hadonlybeenreportedtobecausedbyEBV[30],HCV[31], DNA(dsDNA)viruseswhichhavelowerrorratesduring HIV [32], and SV40 (Simian Virus 40) [33]. their replication. SMRV, as ssRNA retrovirus, was expected In this study, SMRV was detected in the data SRR1563015 to have a high nucleotide polymorphism rate but this was not and SRR1563017 (Table 3). The assembled SMRV in these reflected in these data. HBV and HCV showed no polymor- two runs of data covered 99.2% and 99.4% of the reference phism whatsoever, probably due to the low sequencing depth. genome M23385 at an average depth of 146.1 and 494.5, Consistent with our previous results in plant virus detec- respectively. In the data SRR1563017, the longest viral contig tion, the distribution of vsRNA coverage over the human was assembled to have a length of 6,760 bp and an identity virus genomes was not even, with some read-enriched 99% (6,751/6,764) of the reference sequence M23385 (Supple- regions (hotspots) in the vsRNA-covered regions on both mentary File 1). As far as we know, this was the first time to of the positive and negative strands (Supplementary File report the detection of SMRV using sRNA data from DLBCL 3). In HPV-18, HBV, HCV, and SMRV, the vsRNA-covered samples. region on the positive strand was more than nine times larger Epstein-Barr virus (EBV) has been firmly linked to than the vsRNA-covered region on the negative strand, while some cancers and proliferative diseases, including Burkitt’s HIV-1 and EBV had little difference between vsRNA-covered lymphoma (BL), nasopharyngeal carcinoma, immunoblastic regions over the positive and negative strands. Using the data lymphoma, a subset of gastric carcinomas, rare T- and NK- SRR941591 as an example (Figure 1), the number of bases cell lymphomas or leiomyosarcoma, acute infectious mon- covered by vsRNA reads on the HIV-1 positive strand against onucleosis, and Hodgkin’s disease. Almost 100% of BL cases thenegativestrandwas4,961bpto3,945bpwithoverlap in Equatorial Africa carry EBV. Children infected early in 52.74% (3,075/5,831). There were three obvious hotspots on life with the highest titres to the virus are at the the putative HIV-1 reference M19921. The first (779–810 bp) highest risk of developing the tumor [34]. EBV-positive BL and second (2,017–2,045 bp) hotspot resided on the HIV-1 predominant in Africa and EBV-negative BL predominant in positive strand. Different from the first and second hotspot, Europe and/or the United States have different causation and the third hotspot (12,006–12,044 bp) consisted of positive- characteristics [34]. and negative-strand vsRNAs. In the previous study SRP046046 from the NCBI SRA To investigate the RNAi responses using 36 virus- database, 12 samples had been sequenced to distinguish the containing runs of data, we analyzed the length distribution small RNA composition in six B cells from their exosomes. of the reads aligned to the virus reference sequences. Viral SixBcellsincludedthreeEBV-positivelymphoblastoidB small RNA read lengths of HIV-1 in the data SRR941591 cells (LCLs), one EBV-positive Burkitt’s lymphoma (BL) cell, had the distribution pattern expected from a RNAi response, oneEBV-negativeBLcell,andoneDiffuseLargeB-Cell similar to what had been found in previous studies [16]. Lymphoma (DLBCL) cell. As a result, EBV had been detected This pattern consists of positive- and negative-strand vsRNAs from two EBV-positive BL samples and six EBV-positive with countable values at the 21, 22, 23, and 24 bp read LCL samples (Table 3). In this study, EBV was detected in length (Figure 2). Another characteristic of RNAi responses 6 BioMed Research International

8000 #1 5000 #2 #3 Hotspot1 (#1) Hotspot1 (#2) 1000

4000 6000 500

5000 3000 4000 0 3

Counts Counts Counts Hotspot (# ) 2000 −500 2000 1000 −1000

0 0 Read_counts 15 21 23 24 15 21 23 24 15 21 23 24 22 22 22 32 32 32 30 31 30 31 30 31 Read_length (bp) Read_length (bp) Read_length (bp) 0 Strand direction Strand direction Strand direction Reverse Reverse Reverse Forward Forward Forward

21 bp duplexes 22 bp duplexes 23 bp duplexes 24 bp duplexes 12023-CTTGATCCGGCAAACAAACCACC (12) 12022-TCTTGATCCGGCAAACAAACC (74) 12024-TTGATCCGGCAAACAAACCACC (7) 12016-GGTAGCTCTTGATCCGGCAAACA (9) 12024-TTGATCCGGCAAACAAACCACCGC (15) 12017-GTAGCTCTTGATCCGGCAAAC (22) 12017-GTAGCTCTTGATCCGGCAAACA (10) 12015-TGGTAGCTCTTGATCCGGCAAAC (864) 12022-TCTTGATCCGGCAAACAAACCACC (444) 12013-GTTGGTAGCTCTTGATCCGGC (86) 12013-GTTGGTAGCTCTTGATCCGGCA (4) 12011-GAGTTGGTAGCTCTTGATCCGGC (3) 12015-TGGTAGCTCTTGATCCGGCAAACA (895) 12000-CTTCGGAAAAAGAGTTGGTAGCTCTTGATCCGGCAAACAAACCACCGCTG-12049 12000-CTTCGGAAAAAGAGTTGGTAGCTCTTGATCCGGCAAACAAACCACCGCTG-12049 12000-CTTCGGAAAAAGAGTTGGTAGCTCTTGATCCGGCAAACAAACCACCGCTG-12049 12000-CTTCGGAAAAAGAGTTGGTAGCTCTTGATCCGGCAAACAAACCACCGCTG-12049 12011-CTCAACCATCGAGAACTAGGC (9) 12011-CTCAACCATCGAGAACTAGGCC (5) 12009-TTCTCAACCATCGAGAACTAGGC (3) 12013-CAACCATCGAGAACTAGGCCGTTT (143) 12015-ACCATCGAGAACTAGGCCGTT (2) 12015-ACCATCGAGAACTAGGCCGTTT (9) 12013-CAACCATCGAGAACTAGGCCGTT (16) 12020-CGAGAACTAGGCCGTTTGTTTGGT (737) 12020-CGAGAACTAGGCCGTTTGTTT (237) 12022-AGAACTAGGCCGTTTGTTTGGT (24) 12014-AACCATCGAGAACTAGGCCGTTT (13) 12022-AGAACTAGGCCGTTTGTTTGGTGG (14) 12021-GAGAACTAGGCCGTTTGTTTGGT (44)

0 5000 10000 Position Figure 1: Nucleotide polymorphism, hotspots, and siRNA duplexes of HIV-1. The 𝑥-axis represents positions on the HIV-1 reference genome (GenBank: M19921). The 𝑦-axis represents the read counts from the data SRR941591 on each position. The dots in the top black box represent positions with polymorphic nucleotides. #1, #2, and #3 are the size distributions of positive- and negative-strand viral reads in hotspot 1 (779– 810 bp), hotspot 2 (2,017–2,045 bp), and hotspot 3 (12,006–12,044 bp). The read counts of 21 bp, 22 bp, 23 bp, and 24 bp siRNA duplexes are marked in parentheses.

5e + 06 10000

4e + 06 7500

3e + 06 5000 Counts

2e + 06 Counts 2500

1e + 06 0

0e + 00

15 21222324 3031323334 15 21 22 23 24 30 31 32 33 Read_length (bp) Read_length (bp)

Read counts Strand direction HIV-1×100 reads Reverse All reads Forward (a) (b)

Figure2:DistributionofthetotalandHIV-1viralreadlengthonbothofthestrands.The𝑥-axisrepresentsreadlength.The𝑦-axis represents the read counts of each length in the data SRR941591. HIV-1 ×100 reads represent 100 times of reads which can be aligned to the HIV-1 reference genome (GenBank: M19921). is that there must be positive- and negative-strand vsRNAs strands being rapidly degraded following their creation. In in some hotspots. In the data SRR941591, the third hotspot the third hotspot, we found three canonical 22 bp siRNA satisfied this criterion. The last and key step to identify duplexes containing a 20 nt perfectly base-paired duplex 󸀠 RNAi responses is to find the siRNA duplexes from hotspots. region with 2 nt 3 overhangs. We also found 21, 23, and TheyareusuallyonlyaminutefractionofthetotalvsRNAs, 24 bp siRNA-like duplexes, respectively (Figure 1). However, because the duplexes are short lived, due to one of the two we used the putative HIV-1 reference M19921 for this analysis, BioMed Research International 7

Predicted miRNA precursor sequence, 90 bp, dG =−69.8[initially − 69.3] gccgggcggccgccggtgggtcCGCTGGGCCGCTGCCCCGCTCC GGGTGGGGGGTGGCCCCGCTGGGCACCGCTGCGCCGCCGCCAGGT C U C G G C C C C C U C G U C C G G G U G G G G G G C G G G C

󳰀 G G C C U U G G G 5 G G U G G C C G C G C C G C G G C G C C G C G G G 󳰀 C C U C C C G U G G C C C G C C G G U 3 G A C A C 0 1 (a) 0 35000 70000 105000 140000 175000 EBV reference genome (M80517) 13 repeating regions

50578 50702 50827 50952 51077 51202 51327 51452 51577 51702 51827 51952 52077 13 repeating regions 52115

50578 CCGCCAGGTCCTGGGGCAGCCGGGGTTCCTGGCGCTCCGGGGGCAgccgggcggccgccggtgggtc 50664 50645 CGCTGGGCCGCTGCCCCGCTCCGGGTGGGGGGTGGCCCCGCTGGGCACCGCTGCGCCG 50702 (b)

Figure 3: The predicted miRNAs of EBV. The EBV detected in data SRR1563063 is represented using the reference genome (GenBank: M80517) in this study. The sequence of the predicted mature miRNA is represented using the lowercase letters. (a) The second structures of the miRNA were predicted using RNAfold. (b) The first repeating unit (50578-50702) contains the predicted mature miRNA (50624-50646). This mature miRNA is repeated 12 times in 13 repeated units.

without knowledge of the exact HIV-1 sequence. M19921 used RNAfold online server (http://rna.tbi.univie.ac.at/cgi- is recombinant clone pNL4-3, which includes the HIV-1 bin/RNAfold.cgi) to predict the second structures of their virus region and vector region. Since the pNL4-3 clone is pre-miRNAs (Supplementary File 4). As a result, we iden- only constructed for the experiment use, the vsRNAs on tified a new EBV pre-miRNA, with a length of 90 bp and a the pNL4-3 vector region could be contamination during very high minimal folding free energy index (MEFI) over the library construction or sequencing process. Although the 1.0 (Figure 3(a)). Furthermore, this pre-miRNA resided in third hotspot existed in the pNL4-3 vector region, rather than a repeating region on the reference M80517. This repeating the HIV-1 region on the reference M19921, the reliability of region from 50,578 to 52,077 bp had 13 units (Figure 3(b)). these siRNA duplexes in the third hotspot was supported by Each unit with a length of 125 bp contained this pre-miRNA their uniqueness in the NCBI GenBank database and high sequence. It suggested that this repeating region comprise a sequencing depth. Therefore, this RNAi response could have primary miRNA (pri-miRNA). happened in other samples. In addition to the siRNAs, vsRNAs include other small 4. Conclusions RNA reads, for example, microRNAs (miRNAs), piwi- interacting RNAs (piRNAs), or degraded mRNA fragments. In this study, we used 931 sRNA-seq runs of data from the DNA viruses produce their own miRNAs facilitating the NCBI SRA database to detect and identify human viruses. detection of DNA viruses, for example, EBV. In the data Six viruses were detected and two of them were not found SRR1563063, we found seven miRNA-like duplexes from in previous studies. These results suggest the sRNA-seq can EBV vsRNAs (Supplementary File 3). Then, we blasted these be used to detect viruses in mammals and humans. The duplexes to the miRBase database (http://www.mirbase.org/) sRNA-seq data contains the heterozygosity information that and identified three known mature virus miRNAs. They were canbeusedtoinvestigatethepathogenevolutioninone rlcv-mir-rL1-1-3p, ebv-mir-BART1-5p, and ebv-mir-BART1- person and design therapies to deal with a specific virus 3p located at positions 53,801, 151,640, and 151,676 on the population. The sRNA-seq data can also be used to find EBVreferenceM80517.ThreematuremiRNAscamefromtwo new virus miRNA or to investigate the RNAi responses in miRNA precursors (pre-miRNAs) rlcv-mir-rL1-1 and ebv- mammals and humans. However, sRNA-seq data used in mir-BART1. Compared to two other miRNAs, the ebv-mir- this study were not from virus-infection experiments with BART1-5p was expressed at a very high level of 17,987 read knowledge of the exact virus sequences. In place of the exact counts. As for the remaining four duplexes, we confirmed virus sequences, the putative virus sequences were used to theycouldnotbematchedtothehumangenome.Then,we investigate the RNAi responses. Although using the putative 8 BioMed Research International virus sequences brought some uncertainties, the results of parallel next generation sequencing,” PLoS ONE,vol.6,no.12, this study still shed light on the studies of virus induced RNAi Article ID e28879, 2011. in mammals. [8] J. F. Kreuze, A. Perez, M. Untiveros et al., “Complete viral genome sequence and discovery of novel viruses by deep sequencing of small RNAs: a generic method for diagnosis, Conflict of Interests discovery and sequencing of viruses,” Virology,vol.388,no.1, pp.1–7,2009. The authors declare that no financial competing interests [9] S. Mlotshwa, G. J. Pruss, and V. Vance, “Small RNAs in viral exist. infection and host defense,” Trends in Plant Science,vol.13,no. 7, pp. 375–382, 2008. Authors’ Contribution [10] B. R. Cullen, S. Cherry, and B. R. Tenoever, “Is RNA interference a physiologically relevant innate antiviral immune response in Shan Gao conceived the project. Shan Gao and Xiao Zhu mammals?” Cell Host & Microbe,vol.14,no.4,pp.374–378, supervised this study. Shan Gao wrote the main paper text. 2013. JanF.KreuzeandJishouRuanrevisedthepaper.FangWang, [11] C. Hagen, A. Frizzi, J. Kao et al., “Using small RNA sequences to Yu Sun, and Rui Chen downloaded, managed, and processed diagnose, sequence, and investigate the infectivity characteris- thedata.XinChenpreparedthefiguresandtables.Chengjie tics of vegetable-infecting viruses,” Archives of Virology,vol.156, Chen conducted programming. Zhangjun Fei gave sugges- no.7,pp.1209–1216,2011. tion to build the virus detection pipeline. Fang Wang and Yu [12]R.G.Li,S.Gao,A.G.Hernandez,W.P.Wechter,Z.J.Fei,and Sun contributed equally to this paper. K.-S. Ling, “Deep sequencing of small RNAs in tomato for virus and viroid identification and strain differentiation,” PLoS ONE, vol. 7, no. 5, Article ID e37127, 2012. Acknowledgments [13] A. Nayak, M. Tassetto, M. Kunitomi, and R. Andino, “RNA interference-mediated intrinsic antiviral immunity in inverte- The authors appreciate the help equally from the people brates,” in Intrinsic Immunity,pp.183–200,Springer,2013. listed below. They are Associate Professor Jijun Tang from [14] Q. Wu, Y. Luo, R. Lu et al., “Virus discovery by deep sequencing the Department of Computer Science & Engineering, Uni- and assembly of virus-derived small silencing RNAs,” Proceed- versity of South Carolina, Associate Professor Xiujun Gong ings of the National Academy of Sciences of the United States of from the College of Computer Science & Tech, Tianjin America, vol. 107, no. 4, pp. 1606–1611, 2010. University, and Professor Wenjun Bu from the College of [15] V. Gausson and M.-C. Saleh, “Viral small RNA cloning and Life Sciences, Nankai University. The data analysis in this sequencing,” Methods in Molecular Biology,vol.721,pp.107–122, study was supported by the National Scientific Data Sharing 2011. Platform for Population and Health Translational Cancer [16]Y.Li,J.Lu,Y.Han,X.Fan,andS.-W.Ding,“RNAinterference Medicine Specials. This work was supported partly by the functions as an antiviral immunity mechanism in mammals,” National Natural Science Foundation of China (81541153), Science,vol.342,no.6155,pp.231–234,2013. Guangdong Provincial Research Project of Science and [17] R. Leinonen, H. Sugawara, and M. Shumway, “The sequence Technology (2015A050502048 and 2014A020212295), and read archive,” Nucleic Acids Research,vol.39,no.1,pp.D19–D21, Science and Technology Research Project in Dongguan City 2011. (2013508152011 and 2013508152002). [18] M. Zhang, H. Sun, Z. Fei, F. Zhan, X. Gong, and S. Gao, “Fastq clean: an optimized pipeline to clean the illumina sequencing data with quality control,” in Proceedings of the IEEE Interna- References tional Conference on Bioinformatics and Biomedicine (BIBM ’14), pp. 44–48, IEEE, Belfast, Northern Ireland, November 2014. [1] A. A. Ansari, “Clinical features and pathobiology of Ebolavirus [19]R.G.Li,S.Gao,Z.J.Fei,andK.S.Ling,“Completegenome infection,” Journal of Autoimmunity,vol.55,pp.1–9,2014. sequence of a new tobamovirus naturally infecting tomatoes [2] K. Neyt and B. N. Lambrecht, “The role of lung dendritic in Mexico,” Genome Announcements,vol.1,no.5,ArticleID cell subsets in immunity to respiratory viruses,” Immunological e00794-13, 2013. Reviews,vol.255,no.1,pp.57–67,2013. [20]C.Padmanabhan,S.Gao,R.Li,S.Zhang,Z.Fei,andK.-S. [3] G.F.Wang,W.Li,andK.Li,“Acuteencephalopathyandenceph- Ling, “Complete genome sequence of an emerging of alitis caused by influenza virus infection,” Current Opinion in tobacco streak virus in the united states,” Genome Announce- Neurology,vol.23,no.3,pp.305–311,2010. ments, vol. 2, no. 6, Article ID e01138-14, 2014. [4] A. A. Lackner, M. Mohan, and R. S. Veazey, “The gastrointesti- [21] R.G.Li,S.Gao,S.Berendsen,Z.J.Fei,andK.S.Ling,“Complete nal tract and aids pathogenesis,” Gastroenterology,vol.136,no. genome sequence of a novel genotype of squash mosaic virus 6, pp. 1966–1978, 2009. infecting Squash in Spain,” Genome Announcements,vol.3,no. [5] C. De Martel, J. Ferlay, S. Franceschi et al., “Global burden of 1, Article ID e01583-14, 2015. cancersattributabletoinfectionsin2008:areviewandsynthetic [22] C. Quast, E. Pruesse, P. Yilmaz et al., “The SILVA ribosomal analysis,” The Lancet Oncology,vol.13,no.6,pp.607–615,2012. RNAgenedatabaseproject:Improveddataprocessingandweb- [6] O. Isakov, S. Modai, and N. Shomron, “Pathogen detection based tools,” Nucleic Acids Research,vol.41,no.1,pp.D590– using short-RNA deep sequencing subtraction and assembly,” D596, 2013. Bioinformatics,vol.27,no.15,pp.2027–2030,2011. [23] P.Arbuthnot and M. Kew, “Hepatitis B virus and hepatocellular [7] G. M. Daly, N. Bexfield, J. Heaney et al., “A viral discovery carcinoma,” International Journal of Experimental Pathology, methodology for clinical biopsy samples utilising massively vol.82,no.2,pp.77–100,2001. BioMed Research International 9

[24] J. Hou, L. Lin, W. Zhou et al., “Identification of miRNomes in human liver and hepatocellular carcinoma reveals miR-199a/b- 3p as therapeutic target for hepatocellular carcinoma,” Cancer Cell,vol.19,no.2,pp.232–243,2011. [25] S. Butto,` B. Suligoi, E. Fanales-Belasio, and M. Raimondo, “Laboratory diagnostics for HIV infection,” Annali dell’Istituto Superiore di Sanita`,vol.46,no.1,pp.24–33,2010. [26] X.Wu,G.Somlo,Y.Yuetal.,“Denovosequencingofcirculating miRNAs identifies novel markers predicting clinical outcome of locally advanced breast cancer,” Journal of Translational Medicine,vol.10,no.1,article42,2012. [27] D. Colcher, R. L. Heberling, S. S. Kalter, and J. Schlom, “Squirrel monkey retrovirus: an endogenous virus of a new world primate,” Journal of Virology, vol. 23, no. 2, pp. 294–301, 1977. [28] C. C. Uphoff, S. A. Denkmann, K. G. Steube, and H. G. Drexler, “Detection of EBV, HBV, HCV, HIV-1, HTLV-I and- II, and SMRV in human and other primate cell lines,” Journal of Biomedicine and Biotechnology,vol.2010,ArticleID904767, 2010. [29]P.G.Middleton,S.Miller,J.A.Ross,C.M.Steel,andK.Guy, “Insertion of SMRV-H viral DNA at the c-myc gene locus of a BL cell line and presence in established cell lines,” International Journal of Cancer,vol.52,no.3,pp.451–454,1992. [30] C. Y. Ok, T. G. Papathomas, L. J. Medeiros, and K. H. Young, “EBV-positive diffuse large B-cell lymphoma of the elderly,” Blood,vol.122,no.3,pp.328–340,2013. [31] L. Arcaini, D. Rossi, M. Lucioni et al., “The NOTCH pathway is recurrently mutated in diffuse large B-cell lymphoma associated with hepatitis C virus infection,” Haematologica,vol.100,no.2, pp.246–252,2015. [32]K.Liapis,A.Clear,A.Owenetal.,“Themicroenvironmentof AIDS-related diffuse large B-cell lymphoma provides insight into the pathophysiology and indicates possible therapeutic strategies,” Blood,vol.122,no.3,pp.424–433,2013. [33] S.-I. Nakatsuka, A. Liu, Z. Dong et al., “Simian virus 40 sequences in malignant lymphomas in Japan,” Cancer Research, vol.63,no.22,pp.7606–7608,2003. [34] D. A. Thorley-Lawson and M. J. Allday, “The curious case of the tumour virus: 50 years of Burkitt’s lymphoma,” Nature Reviews Microbiology,vol.6,no.12,pp.913–924,2008. [35] D. Kutnjak, M. Rupar, I. Gutierrez-Aguirre, T. Curk, J. F.Kreuze, and M. Ravnikar, “Deep sequencing of virus-derived small interfering RNAs and RNA from viral particles shows highly similar mutational landscapes of a plant virus population,” Journal of Virology,vol.89,no.9,pp.4760–4769,2015. [36] J. D. Roberts, K. Bebenek, and T. A. Kunkel, “The accuracy of reverse transcriptase from HIV-1,” Science,vol.242,no.4882, pp. 1171–1173, 1988. Hindawi Publishing Corporation BioMed Research International Volume 2016, Article ID 6598307, 11 pages http://dx.doi.org/10.1155/2016/6598307

Research Article Motif-Based Text Mining of Microbial Metagenome Redundancy Profiling Data for Disease Classification

Yin Wang,1,2 Rudong Li,3 Yuhua Zhou,4 Zongxin Ling,5 Xiaokui Guo,4 Lu Xie,2 and Lei Liu1,2

1 Shanghai Public Health Clinical Center and Institutes of Biomedical Sciences, Fudan University, Shanghai 200032, China 2Shanghai Center for Bioinformation Technology, Shanghai 201203, China 3Key Lab of Computational Biology, CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China 4Department of Medical Microbiology and Parasitology, Institutes of Medical Sciences, Shanghai Jiao Tong University School of Medicine, Shanghai 200240, China 5Collaborative Innovation Center for Diagnosis and Treatment of Infectious Diseases, State Key Laboratory for Diagnosis and Treatment of Infectious Diseases, The First Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, Zhejiang 310003, China

Correspondence should be addressed to Lei Liu; [email protected]

Received 28 October 2015; Accepted 12 January 2016

Academic Editor: Zhenguo Zhang

Copyright © 2016 Yin Wang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Background. Text data of 16S rRNA are informative for classifications of -associated diseases. However, the raw text data need to be systematically processed so that features for classification can be defined/extracted; moreover, the high-dimension feature spaces generated by the text data also pose an additional difficulty. Results. Here we present a Phylogenetic Tree-Based Motif Finding algorithm (PMF) to analyze 16S rRNA text data. By integrating phylogenetic rules and other statistical indexes for classification, we can effectively reduce the dimension of the large feature spaces generated by the text datasets. Using the retrieved motifs in combination with common classification methods, we can discriminate different samples of both pneumonia and dental caries better than other existing methods. Conclusions. We extend the phylogenetic approaches to perform supervised learning on microbiota text data to discriminate the pathological states for pneumonia and dental caries. The results have shown that PMF may enhance the efficiency and reliability in analyzing high-dimension text data.

1. Introduction Database Project (RDP) website would facilitate investiga- tions of key microorganisms associated with certain host The microbial ecology in human determines or promotes diseases [7]. necessary bioprocesses in human bodies, and compositions On the other hand, traditional (supervised) methods such of microbial communities can be reflections for the health as feature selection are frequently adopted in classifications of conditions of the hosts [1]. In fact, the complex microbial microbiota-associated disease samples, for example, selecting communities play key roles in human health from time to the microorganism(s) which can maximally discriminate dis- time. For example, dysfunction of microbiota eased and healthy hosts [6]. Nonetheless, substantial amounts or infection of pathogenic microbiota would lead to a series of of sequence data actually embody the characteristics of entire human diseases, like pneumonia [2], dentes cariosus [3], and microbial communities rather than individual microbes [8]. so on [4, 5]. Fortunately, sequencing of 16S rRNA provides Hence, the mapping results of 16S rRNA segments to indi- informative knowledge for the distributions of microbiota vidual microbes based on the sequencing data would not [6]. For instance, microbiota taxonomy analysis based on be informative enough for the aftermath feature selection. the sequence data by bioinformatic tools such as Ribosomal Furthermore, sequences that cannot be mapped to known 2 BioMed Research International microbes might also have certain importance. Therefore, Table 1: Alphabet of generalized letters. algorithms focusing on the textual features of microbiota sequencesthemselves(e.g.,𝑘-mer/𝑘-tun features) have been Letter Members Antisense complementary letter applauded by recent researchers, as they skip the sequence- microbe mapping and hence avoid the intrinsic drawbacks RAG Y [9, 10]. YCT R However, abundantly many features can be defined WAT W regarding raw text data (i.e., strings); in other words, the MAC K dimension of feature space would usually be extremely high; KGT M thus the “curse of dimensionality” resulted [11]. In this regard, SCG motif-oriented algorithms are capable of accelerating the fea- S ture selection pipeline, as generalizing or lumping the textual HACT D features of a lot of strings into certain motifs is equivalent to BCGT V degenerating the feature space (i.e., dimension reduction) [12, VACG B 13]. Nonetheless, extracting the motifs put forward another DCGT H issue,intuitivelybecausemotifscanbedefinedinvarious NACGT N different ways and there is no universal solution. Therefore, systematic approaches for motif extraction/definition are necessary. For this purpose, here we present an improved text min- normal were set as the test data, so that classifications were ing method named Phylogenetic Tree-Based Motif Finding evaluated. For the 𝑘-mer counting profiles of 16S rRNAs fasta algorithm (PMF). In this method, relevance between text file collected from dental plaques samples, the training data strings is considered, which are defined by the phylogeny contained 23 dental decay patients and 20 normal samples of the strings. By statistically associating the motif counts and the test data contained 9 dental decay patients and computed via PMF with disease statuses, efficient classifi- 8 normal samples. For the 𝑘-mer counting of 16S rRNAs cation of disease samples based on (microbiota) sequence collected from saliva samples, the training data contained texts could be achieved. We have simulated the 16S rRNA 23 dental decay patient samples and 19 normal samples; datasets of pneumonia and dentes cariosus patients with this and the test data contained 10 dental decay patient samples pipeline, respectively. Compared to previous results [14], our and 8 normal samples. The partition of the training and new pipeline shows better classifications. Additionally, the test datasets, as well as the cross validation of training data pipeline is suitable for issues with high-dimensional feature themselves, was adopted from the previous study; hence spaces. impartial comparisons (with previous results [14]) could be performed. 2. Data and Methods 2.2. Phylogenetic Tree-Based Motif Finding (PMF) Method. 2.1. Data and Preprocessing. We acquired 16S rRNA sequenc- The improved text mining method, Phylogenetic Tree-Based ing fasta files of pneumonia patients and dental decay patients Motif Finding algorithm (PMF), handled counting results of from Zhou et al. [2] and Ling et al. [3], respectively. Two each person’s 16S rRNA fasta file. PMF algorithm consisted to six length 𝑘-mer counting results in each meta-genomic of three parts: motif finding, motif sorting, and model eval- sequence were calculated [15]. The 𝑘-mer counting results uation. Motif finding was the main part of the algorithm, in were shown in Files S1 and S2 in Supplementary Material which key step was constructing a clustering tree to combine available online at http://dx.doi.org/10.1155/2016/6598307. the original strings to a new motif, that is, transforming Each counting and its antisense complementary result were original letters (“A,” “T,” “C,” and “G”) into the generalized summarized and combined together. The 𝑘-mer frequencies letters (“Y,” “R,” “W,” “K,” “M,” “S,” “D,” “V,” “B,” “H,” were normalized by the reciprocal of length of each sequence and “N”). The rules of the generalized letters were shown in as weight and divided by the number of sequences in each Table 1. fasta file. Identified microbes of 16S rRNAs’ sequences from Minimum distance method was used to cluster the phylo- Zhou et al. [2], which were downloaded from NCBI website genetictree.Foreachpairofsequenceswiththesamelength, (ID: GU737566 to GU737625 and HQ914698 to HQ914775) the phylogenetic distance was calculated by summarizing dif- (http://www.ncbi.nlm.nih.gov), were used for constructing ferences of all sites. For the generalized letters, the differences the phylogenetic trees. After removing redundant sequences, were calculated using the number of intersections divided a total of 90 microbe species were used for further analysis. by the number of unions. The phylogenetic distance of two The pneumonia samples included 101 patients with motifs was estimated by summarizing differences of both hospital-acquired pneumonia (HAP), 43 patients with original and generalized sites. If the phylogenetic distance community-acquired pneumonia (CAP), and 42 normal of antisense complementary sequence was smaller than the personsascontrol.71HAPcases,32CAPcases,and30cases original sequence, the instance of its antisense complemen- of normal samples were allocated as training data; fitness tary sequence was selected. To calculate the complementary was calculated using 5-fold proportional cross validation. generalized letters, each member of the generalized letters The other 30 cases of HAP, 13 cases of CAP, and 12 cases of was calculated, (i.e., “A” versus “T” and “C” versus “G”), BioMed Research International 3 and results were summarized by rules of Table 1. If there was necessary motifs. Other than training errors calculated by (5- more than one pair of sequences with the minimum distance, fold proportional) cross validation, number of dimensions the phylogenetic distances were sorted using Kruskal-Wallis was also considered. Therefore, at most the first 𝑝 (𝑝= statistics in descending order as follows: Sqrt{𝑁}) models with minimum training errors were set as candidate models to be evaluated and combined with dimen- 𝑛 ∑ (1 − 𝑝original) sions (in descending order). The number of dimensions was =1−𝑝new − 𝑖=1 𝑖 , (1) KW 𝑛 considered as the logarithm penalty [17], together with the training errors, so the minimum value model was selected as where 𝑝new was the Kruskal-Wallis test 𝑝 value of new follows: original motif profile, 𝑝𝑖 was the Kruskal-Wallis test 𝑝 value of 𝑖th original sequence covered by the new motif, and 𝑛 𝑖←󳨀arg min {training error + log (dimension) was the number of original sequences or their antisense (5) complementary sequences covered by the new motif. The (𝑁∗𝑘) ∗ log }, generalized motif (and its antisense complementary motif) (𝑁∗𝑘) wascomposedofagroupoforiginalsequences;theoriginal sequences with profiling were defined as covered by the new 𝑁=min {number training data,𝑏} , (6) motifs. Profile of the new motif was calculated as follows: min (dimension) 𝑘= , (7) profiling motif = profile (:, covered) ∗ LDA weights, (2) max (dimension) where “profile(:, covered)”weretheprofilesoforiginal where 𝑏 was the number of candidate models with lower sequences covered by the new motif (i.e., rows were samples dimensions than the model minimum training error. and columns were the covered original sequences), and The pipeline of the proposed algorithm is described in LDA weights were the linear combination weights calculated detail below and its flowchart was shown in Figure 1. by Linear Discriminant Analysis (LDA) [16] method with maximum Fisher’s Rayleigh quotient (shown in “Linear Step 1 (initialization). Delete features (sequences) with zeros Discriminant Analysis” part in the other method). Therefore variance profile. Set counters of 2 to 6 length sequences to profiles of the covered sequences were replaced by the profile zero. of new motif. With the clustering rule, the original sequences could be Step 2. Enter each loop from Step 3 to Step 5 until the counter 𝑛 −1 transformed into the generalized motifs. To suit for high- reaching 𝑘 forlengthofsequencesisfrom2to6, 𝑛 = 2, 3, 4, 5, 6 dimension characteristics of text data, first 𝑚th nonredun- respectively (i.e., ). dant pairs of sequences were combined to new motifs, where 𝑘𝑛/2 𝑚=square root (Sqrt in short) of the total number Step 3. Select and sort first pairs of sequences with min- of sequences with the same length. The batch computing imum phylogenetic distance combined with profile Kruskal- method could accelerate motif finding part of PMF and avoid Wallis statistics in descending order as (1). overfitting the training data [17]. Therefore the generalized Step 4 (remove redundancy of the selected pairs). For each motifs were found iteratively until all sites changed into “N.” current sorted pairs, delete any later selected pairs having After finding motifs with the same length, different length intersection with the current one. Finally select 𝑚=Sqrt(𝑘𝑛) (e.g., from two to six) motifs needed to be sorted using motif pairs at most. sorting part of PMF by integrating Kruskal-Wallis 𝑝 value [18] and specificity in descending order. For each motif, the Step 5. Merge the final selected pairs of sequences into the specificity was calculated as follows: generalized sequences/motifs. Combine profiles of original 𝐾 sequences covered by the generalized motif using Linear Dis- ∑𝑖=1 (1 − 𝑥𝑖/4) specificity = , (3) criminant Analysis (LDA) method with maximum Fisher’s 𝐾 Rayleigh quotient value. Counter ← Counter +𝑚. where 𝐾 was the length of each sequence and 𝑥𝑖 was the Step 6. SortthemotifsbythespecificityandKruskal-Wallis number of members of 𝑖th site’s (generalized) letter. Therefore 𝑝 value in descending order using (4). each motif could be sorted in descending order as follows:

1−𝑝value + specificity. (4) Step 7. Evaluate models by (5). Treat original profiles by selected suitable motifs using (2). With suitable motifs, original profiles could be merged into profiles of motifs by (2), and covered profiles could be deleted 2.3. Other Methods so dimensions reduction would be performed. The more original sequences were replaced by generalized motifs; the 2.3.1. Kruskal-Wallis Test. Kruskal-Wallis [18] test is a non- linear bias was getting greater, but variance was getting lower, parametric method for testing whether samples originate and vice versa. To compromise between the bias and variance from the same distribution. The test assumes that all samples criteria, model evaluation part was performed to select from the same group have the same continuous distribution, 4 BioMed Research International

Gain is also known as mutual information. 𝑘-mer counting Begin values were discretized using two thresholds’ mean ± std. If more than one sequence was with the same Information Gain 0 Initialize the counters, delete the variance features value, they were sorted by Kruskal-Wallis 𝑝 value.

2.5. Chi-Square Statistic. This method uses the Chi-square For each k length sequences: enter the loops: statistic to discretize numeric attributes and achieves feature selection via discretization [20]. The Chi-square value is

Compute and select first nk/2 pairwise relationship of the defined as 2 remaining features by both phylogenetic distance and 𝑐 𝑘 (𝐴 −𝐸 ) 2 𝑖𝑗 𝑖𝑗 Kruskal-Wallis statistics 𝜒 = ∑ ∑ , 𝑖=1 𝑗=1 𝐸𝑖𝑗 (9) 𝑀𝑖𝐵𝑗 Find first mth nonredundant motifs,m=Sqrt(nk) 𝐸 = , 𝑖𝑗 𝑁

where 𝑐 is the number of intervals, 𝑘 is the number of classes, Treat profiles by selected motifs using LDA method. 𝐴𝑖𝑗 is the number of samples in the 𝑖th interval and the 𝑗th counter ← counter + 1 class, 𝑀𝑖 isthenumberofsamplesinthe𝑖th interval, 𝐵𝑗 is the number of samples in the 𝑗th class, and 𝑁 is the total number of samples. 𝑘-mer counting values were discretized using two ← + Normalization weights; the counter counter 1 thresholds’ mean ± std. If more than one sequence was with thesameChi-squarestatistic,theyweresortedbyKruskal- Wallis 𝑝 value. Yes If the counter equals nk −1 2.6. Linear Discriminant Analysis. Linear Discriminant No Analysis (LDA) is a typical variable transformation method Exit the loops to reduce dimensions [16]. The key step of LDA is to maximize the Rayleigh quotient: 𝛼𝑇𝑆 𝛼 𝐽 (𝑊) = 𝐵 , (10) 𝛼𝑇𝑆 𝛼 Sort all motifs by (1 − Kruskal-Wallis p value + specificity) in 𝑊 descending order where the “between-class scatter matrix” is defined as 󸀠 (𝑚𝑘 −𝑚)(𝑚𝑘 −𝑚) Model evaluation 𝑆 = ∑ (𝑝 −1)∑ 𝐵 𝑘 (𝐾−1) (11) 𝑘

End and the “within-class scatter matrix” is defined as 𝐾 (𝑦 − 𝑚 )(𝑦−𝑚 )󸀠 Figure 1: PMF algorithm flowchart. 𝑆 = ∑ 𝑘 𝑘 . 𝑊 (𝑁−𝐾) (12) 𝑘 𝐾 𝑝 and they are mutually independent. In this study, Kruskal- isthenumberofclasses, 𝑘 isthenumberofthesamples 𝑘 𝑚 Wallis 𝑝 value was used to rank features. within the th class, 𝑘 isthemeanvalueofthesamplewithin the 𝑘th class, and 𝑚 isthemeanvalueofallthesamples. 2.4. Information Gain Method. Information Gain [19] mea- Traditional LDA requires the total scatter matrix to be sures the classification ability of each feature with respect nonsingular. To deal with the singularity problems, classical to the relevance with the output class, which is defined as LDA method was modified in a way that a unit diagonal Information Gain = 𝐻(𝑆) − 𝐻(𝑆 |𝑥): matrix with small weights was added to the within-class scatter matrix, if the scatter matrix is singular [14]. 𝐻 (𝑆) =−∑ 𝑝 (𝑠) log2 (𝑝 (𝑠)), 𝑠∈𝑆 (8) 3. Results and Discussion 𝐻 (𝑆|𝑥) =−∑ 𝑝 (𝑥) ∑ 𝑝 (𝑠|𝑥) log2 (𝑝 (𝑠|𝑥)), 𝑥∈𝑋 𝑠∈𝑆 We first performed the PMF algorithm on pneumonia sam- ples. We considered both 2-class problem (pneumonia: CAP where 𝑆 and 𝑥 are features. When measuring the mutual rela- + HAP, versus normal) and 3-class problem (HAP, CAP, tion between the extracted features and the class, Information versus normal). Conventionally, due to the data imbalance, BioMed Research International 5

Table 2: Classification results of pneumonia data in 3-class problem.

Error rate Method Dimension Feature On training data On test data SVM/FMS 0.1895 0.2637 29 Microbes SVM/PMF 0.062 0.0756 411 Sequences SVM/Kruskal-Wallis 0.1187 0.5273 272 Sequences SVM/Information Gain 0.143 0.2124 12 Sequences SVM/Chi-square statistic 0.1743 0.5909 1280 Sequences SVM 0.2187 0.3812 4390 Sequences NNA/FMS 0.2013 0.3406 112 Microbes NNA/PMF 0.2152 0.2081 786 Sequences NNA/Kruskal-Wallis 0.2718 0.3363 85 Sequences NNA/Information Gain 0.2354 0.4141 39 Sequences NNA/Chi-square statistic 0.2649 0.3107 69 Sequences NNA 0.442 0.6162 4390 Sequences

the accuracy for each class was used to measure the classi- SVM for the 𝑘-mer counting profiles of pneumonia samples fication, which was equivalent to combining the specificity were used for further analysis. and sensitivity in general classifications. Two widely applied Heat map is a frequently used matrix of pairwise sample methods, nearest neighbor algorithm (NNA) and support correlations indicating anticorrelation or correlation using vector machine (SVM), were used to select the optimal a color scale, that is, green to red. Figure 4(a) showed that classifier set of motifs extracted by PMF for pneumonia theoriginaldataprofilewasalmostinvisibleforpatternsor samples. Since SVM mainly suits pairwise classifications, sample classifications, after being analyzed by our method, normal samples must be discriminated against the pneumo- since the original feature space had been reduced to a much nia samples (CAP and HAP) before CAP and HAP were smaller space spanned by a few features (with the most classified in a 3-class problem. To evaluate the performance important variances retained). Therefore as shown in Figures of our (𝑘-mer) motif-oriented method, we compared our 4(b) and 4(c), the heat maps of the samples were much clearer results with those of previous methods, including the Feature with high resolutions for classifications. Merging and Selection algorithm (FMS) based on sequence- Profiles with reduced dimensions obtained by PMF microbe associations [14], as well as other 𝑘-mer counting were sorted according to the Kruskal-Wallis 𝑝 values. The feature selection algorithms, for example, the Information top 5 motifs with 𝑝 value < 0.05 in 3-class problem Gain method [19], Chi-square statistic method [20], and were “KCTCWT,” “TTCGHT,” “CGATCS,” “TCWCTA,” primitive Kruskal-Wallis statistic method [18]. and “TTWCGC”. Sequences (including antisense comple- Figure 2 showed the learning curves for the training mentary sequences) covered by the first motif “KCTCWT” data; combined with logarithm penalty evaluation, the best (𝑝 value = 0.0126) were matched to the microbe taxonomic evaluated models were selected with 1321 and 1369 runs of resultsofZhouetal.[2].6outof19matchedmicrobes PMF for the 3-class problem, with SVM and NNA classifiers, were among the top 20 genera suspiciously contributing respectively. Optimal models for the 2-class problem were to pneumonia [2] (Table S1). 10 out of the remaining 13 selected with 1218 and 1136 runs of PMF (with SVM and microbes were also related to pneumonia [21–30] (Table NNA). As shown in Tables 2 and 3, our method had the S1). The 2 motifs with 𝑝 value < 0.05 in 2-class problem lowest mean error in both 3-class and 2-class problem (with were “WTCGTC” and “ATCWCT”. Sequences covered by either SVM or NNA combined), compared with the previous first one “WTCGTC”𝑝 ( value = 0.0288) were matched to methods mentioned earlier. the published taxonomic data [2]. Five out of 6 matched In statistics, a receiver operating characteristic (ROC) microbes were related to pneumonia [2, 29, 30] (Table S2). curve is the summary of both sensitivity and specificity for Furthermore, by pinpointing the 25 (e.g., 19 + 6)matched various thresholds. ROC was constructed for each subset microbes in a phylogenetic tree constructed from the pub- of features (Figure 3). As shown, the optimal features that lishedtaxonomicdata(usingMEGA6software[31]with areselectedunderthecombinedcriteriaofcrossvalidation minimum distance method), we observed that distribution of and model evaluation possessed high specificity∼ ( 80%) with the microbes was dispersed, indicating the diverse functions high sensitivity (∼70%) for the 3-class problem, indicating performed by microbiota for human (Figure 5). the ability of our method. Moreover, even higher specificity Our method was also tested on 𝑘-mer counting profiles and sensitivity were obtained (>0.95) for the 2-class problem. from dental decay sample. These samples were collected Noteworthy, PMF combined with SVM performed better in from saliva and dental plaques separately. Combined with the classification; therefore the results derived by PMF with logarithm penalty evaluation, the best evaluated models were 6 BioMed Research International

0.2 0.2 0.18 0.18 0.16 0.16 0.14 0.14 0.12 0.12 0.1 0.1

Error rate Error 0.08 rate Error 0.08 0.06 0.06 0.04 0.04 0.02 0.02 0 0 0 200 400 600 800 1000 1200 1400 0 200 400 600 800 1000 1200 1400 Running times Running times (a) (b) 0.4 0.4

0.35 0.35 0.3

0.3 0.25

0.2 Error rate Error 0.25 rate Error

0.15 0.2 0.1

0.05 0 500 1000 1500 0 500 1000 1500 Running times Running times (c) (d)

Figure 2: Learning curves of PMF algorithm for 3-class problem with NNA (a), for 3-class problem with SVM (b), for 2-class problem with NNA (c), and for 2-class problem with SVM (d).

Table 3: Classification results of pneumonia data in 2-class problem.

Error rate Method Dimension Feature On training data On test data svm/FMS 0.0922 0.1279 42 Microbes svm/PMF 0 0 551 Sequences svm/Kruskal-Wallis 0.01 0 28 Sequences svm/Information Gain 0 0.0116 26 Sequences svm/Chi-square statistic 0.01 0.0417 127 Sequences svm 0.0667 0.0116 4390 Sequences NNA/FMS 0.1279 0.2393 20 Microbes NNA/PMF 0 0.0833 361 Sequences NNA/Kruskal-Wallis 0.0167 0.125 13 Sequences NNA/Information Gain 0.0214 0.125 12 Sequences NNA/Chi-square statistic 0.0381 0.125 26 Sequences NNA 0.2667 0.5 4390 Sequences BioMed Research International 7

1 1

0.8 0.8

0.6 0.6

Sensitivity 0.4 Sensitivity 0.4

0.2 0.2

0 0 0 0.5 1 0 0.5 1 1−specificity 1−specificity (a)auc=0.7943 (b)auc=0.9652 1 1

0.8 0.8

0.6 0.6

Sensitivity 0.4 Sensitivity 0.4

0.2 0.2

0 0 0 0.5 1 0 0.5 1 1−specificity 1−specificity (c) auc = 0.6347 (d)auc=0.8810 1 1

0.8 0.8

0.6 0.6 Sensitivity 0.4 Sensitivity 0.4

0.2 0.2

0 0 0 0.5 1 0 0.5 1 1−specificity 1−specificity (e)auc=0.6139 (f) auc = 0.7648

Figure 3: Continued. 8 BioMed Research International

1 1

0.8 0.8

0.6 0.6

Sensitivity 0.4 Sensitivity 0.4

0.2 0.2

0 0 0 0.5 1 0 0.5 1 1−specificity 1−specificity (g) auc = 0.7219 (h)auc=0.7520

Figure 3: ROC curves of motif finding step of PMF algorithm for pneumonia samples (CAP + HAP) in 2-class problem (a, b), normal samples (c, d), CAP samples in 3-class problem (e, f), and HAP samples in 3-class problem (g, h), with NNA or SVM.

Normal CAP HAP Normal CAP HAP

(a) (b)

Normal CAP + HAP

(c)

Figure 4: Heat map of 𝑘-mer counting profiles of original pneumonia data for 3-class problem (a), data after treating by PMF for 3-class problem, (b) and data after treating by PMF for 2-class problem (c). Rows are retained motifs and columns are disease classes. From leftto right are 30 normal, 32 CAP, and 71 HAP samples for 3-class problem and 30 normal 103 pneumonia samples for 2-class problem. BioMed Research International 9

GU737614 HQ914733 GU737613 GU737615 HQ914768 GU737618 GU737619 HQ914754 GU737620 HQ914713 GU737621 GU737616 GU737617 GU737611 HQ914743 GU737612 HQ914744 HQ914740 HQ914774 GU737622 HQ914728 HQ914770 HQ914699 HQ914708 HQ914772 HQ914750 HQ914736 HQ914773 HQ914745 HQ914766 HQ914760 HQ914759 HQ914769 HQ914734 GU737624 GU737625 GU737623 HQ914701 GU737585 GU737584 HQ914746 HQ914705 HQ914709 GU737573 HQ914749 HQ914721 GU737583 GU737582 HQ914752 HQ914737 GU737581 HQ914715 HQ914707 HQ914735 HQ914757 HQ914765 HQ914722 HQ914767 HQ914706 HQ914764 HQ914720 HQ914730 HQ914704 HQ914731 HQ914762 HQ914725 HQ914710 HQ914729 HQ914748 HQ914775 HQ914755 HQ914711 HQ914712 HQ914716 HQ914751 HQ914727 HQ914739 HQ914738 GU737579 HQ914761 HQ914700 HQ914714 HQ914771 GU737578 HQ914703 GU737576 GU737577 HQ914741 GU737580 HQ914719 GU737566 GU737567 GU737568 GU737569 GU737575 GU737574 GU737570 HQ914698 GU737571 GU737572 HQ914726 HQ914724 HQ914747 GU737610 HQ914758 HQ914763 GU737586 GU737587 GU737588 HQ914702 GU737608 GU737589 GU737609 HQ914732 GU737606 HQ914717 GU737607 GU737605 HQ914723 GU737603 GU737604 HQ914742 HQ914756 GU737600 GU737601 GU737602 GU737596 GU737598 GU737599 GU737597 HQ914753 GU737591 GU737590 GU737592 GU737593 GU737595 GU737594 HQ914718 0.02 Figure 5: Phylogenetic relationship of identified microbiota signatures. Identified microbes matched by significant motifs are highlighted with underline (“KCTCWT”) or strikethrough (“WTCGTC”). 10 BioMed Research International selected with 2330 and 2485 runs of PMF for dental plaques andapprovedthepaper.YinWangandRudongLiequally (with SVM) and saliva samples (with NNA), respectively contributed to this paper. (Figure S1). The results showed that our method could also select suitable classifiers and perform better on the test data Acknowledgments than the previous and other methods (Tables S3 and S4). This work was supported by the National High Tech- 4. Conclusions nology Research and Development Program (863 Pro- gram) (2012AA02A602, 2015AA020104), National Science In this paper, we presented the PMF method to analyze and Technology Major Project (2012ZX09303013-015), and the align-free 𝑘-mer counting profiles of 16S rRNA micro- special funds for scientific research in the health industry bial data. The improved pipeline systematically analyzed (201302010). relevance between each pair of sequences using minimum distance phylogenetic trees. Moreover, PMF also considered References relationships between 𝑘-mer counting profiles and the disease status. In addition, by combining original profiles using the [1] H. Sokol, B. Pigneur, L. Watterlot et al., “Faecalibacterium LDA method, PMF learned profiles of text strings suitable prausnitzii is an anti-inflammatory commensal bacterium iden- for disease classification. Batching method also accelerated tified by gut microbiota analysis of Crohn disease patients,” PMF and avoided overfitting of training data. As a result, Proceedings of the National Academy of Sciences of the United via combing characteristics of sequences and classification States,vol.105,no.43,pp.16731–16736,2008. statistics of text profiles, PMF selected suitable motifs to [2] Y.Zhou, P.Lin, Q. Li et al., “Analysis of the microbiota of sputum evaluate metagenome characteristics of human microbiota samples from patients with lower respiratory tract infections,” disease. Acta Biochimica et Biophysica Sinica,vol.42,no.10,pp.754–761, In conclusion, we developed an improved motif-based 2010. text mining algorithm, and the new pipeline was verified [3] Z. Ling, J. Kong, P. Jia et al., “Analysis of oral microbiota by both pneumonia and dentes cariosus samples. As the in children with dental caries by PCR-DGGE and barcoded classification results have shown, it was demonstrated that pyrosequencing,” Microbial Ecology, vol. 60, no. 3, pp. 677–690, 2010. PMF was an effective approach for finding informative motifs from training data, and it was validated well compared with [4]Z.Gao,G.I.Perez-Perez,Y.Chen,andM.J.Blaser,“Quanti- tation of major human cutaneous bacterial and fungal popula- the previous study and other widely used methods. PMF per- tions,” Journal of Clinical Microbiology, vol. 48, no. 10, pp. 3575– formed well and it could extend evolutionary/phylogenetic 3581, 2010. approaches to perform supervised learning on microbiota [5]E.M.Bik,P.B.Eckburg,S.R.Gilletal.,“Molecularanalysisof text data to discriminate disease/pathology status. thebacterialmicrobiotainthehumanstomach,”Proceedings of the National Academy of Sciences of the United States of America, Abbreviations vol. 103, no. 3, pp. 732–737, 2006. [6] D. Knights, E. K. Costello, and R. Knight, “Supervised classifi- PMF: Phylogenetic Tree-Based Motif Finding algorithm cation of human microbiota,” FEMS Microbiology Reviews,vol. FMS: Feature Merging and Selection algorithm 35, no. 2, pp. 343–359, 2011. LDA: Linear Discriminant Analysis [7]Q.Wang,G.M.Garrity,J.M.Tiedje,andJ.R.Cole,“Na¨ıve HAP: Hospital-acquired pneumonia Bayesian classifier for rapid assignment of rRNA sequences CAP: Community-acquired pneumonia into the new bacterial taxonomy,” Applied and Environmental NNA: Nearest neighbor algorithm Microbiology,vol.73,no.16,pp.5261–5267,2007. SVM: Support vector machine. [8] J. Handelsman, “Metagenomics: application of genomics to uncultured microorganisms,” Microbiology and Molecular Biol- Disclosure ogy Reviews,vol.68,no.4,pp.669–685,2004. [9] C. Simon and R. Daniel, “Metagenomic analyses: past and The funders had no role in study design, data collection and future trends,” AppliedandEnvironmentalMicrobiology,vol.77, analysis,decisiontopublish,orpreparationofthepaper. no. 4, pp. 1153–1161, 2011. [10] S. Karlin, J. Mrazek,´ and A. M. Campbell, “Compositional biases of bacterial genomes and evolutionary implications,” Journal of Conflict of Interests Bacteriology,vol.179,no.12,pp.3899–3913,1997. [11] T. Thomas, J. Gilbert, and F. Meyer, “Metagenomics—a guide Theauthorsdeclaredthattheyhavenocompetinginterests. from sampling to data analysis,” Microbial Informatics and Experimentation,vol.2,no.1,article3,2012. Authors’ Contribution [12] T.-H. Lin, R. F. Murphy, and Z. Bar-Joseph, “Discriminative motif finding for predicting protein subcellular localization,” Yin Wang performed algorithm design and wrote the paper. IEEE/ACM Transactions on Computational Biology and Bioin- Yuhua Zhou and Zongxin Ling collected the data. Lei Liu, formatics,vol.8,no.2,pp.441–451,2011. Lu Xie, and Xiaokui Guo designed and sponsored the study. [13] J. Korhonen, P.Martinmaki,¨ C. Pizzi, P.Rastas, and E. Ukkonen, Rudong Li contributed and edited the paper. All authors read “MOODS: fast search for position weight matrix matches in BioMed Research International 11

DNA sequences,” Bioinformatics,vol.25,no.23,pp.3181–3182, [30] M. Koide, F. Higa, M. Tateyama, H. L. Cash, A. Hokama, 2009. and J. Fujita, “Role of Brevundimonas vesicularis in supporting [14] Y. Wang, Y. Zhou, Y. Li et al., “An improved dimensionality the growth of Legionella in nutrient-poor environments,” New reduction method for meta-transcriptome indexing based dis- Microbiologica,vol.37,no.1,pp.33–39,2014. eases classification,” BMC Systems Biology,vol.6,supplement3, [31] K. Tamura, G. Stecher, D. Peterson, A. Filipski, and S. Kumar, article S12, 2012. “MEGA6: molecular evolutionary genetics analysis version 6.0,” [15] F. Zhou, V. Olman, and Y. Xu, “Barcodes for genomes and Molecular Biology and Evolution,vol.30,no.12,pp.2725–2729, applications,” BMC Bioinformatics,vol.9,article546,2008. 2013. [16] X.-Y. Jing, D. Zhang, and Y.-Y. Tang, “An improved LDA approach,” IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics,vol.34,no.5,pp.1942–1951,2004. [17] A. E. Isabelle Guyon, “An introduction to variable and feature selection,” Journal of Machine Learning Research,vol.3,pp.1157– 1182, 2003. [18] L. J. Wei, “Asymptotic conservativeness and efficiency of kruskal-wallis test for K dependent samples,” Journal of the American Statistical Association,vol.76,no.376,pp.1006–1009, 1981. [19] T. M. Cover and J. A. Thomas, Elements of Information Theory, Wiley-Interscience, 1991. [20] H. Liu and R. Setiono, “Chi2: feature selection and discretiza- tion of numeric attributes,” in Proceedings of the 7th Interna- tional Conference on Tools with Artificial Intelligence, pp. 388– 391, Herndon, Va, USA, November 1995. [21]N.Jafari,R.Behroozi,D.Farajzadeh,M.Farsi,andK.Akbari- Noghabi, “Antibacterial activity of Pseudonocardia sp. JB05, a rare salty soil actinomycete against Staphylococcus aureus,” BioMed Research International,vol.2014,ArticleID182945,7 pages, 2014. [22]H.Lu,G.Qian,Z.Renetal.,“AlterationsofBacteroides sp., Neisseria sp., Actinomyces sp., and Streptococcus sp. populations in the oropharyngeal microbiome are associated with liver cirrhosis and pneumonia,” BMC Infectious Diseases,vol.15,no. 1, article 239, 2015. [23] P. Li, C. Yang, J. Xie et al., “Acinetobacter calcoaceticus from a fatal case of pneumonia harboring blaNDM-1 on a widely distributed plasmid,” BMC Infectious Diseases,vol.15,article131, 2015. [24] A. Hasegawa, T. Sato, Y. Hoshikawa et al., “Detection and identification of oral anaerobes in intraoperative bronchial fluids of patients with pulmonary carcinoma,” Microbiology and Immunology,vol.58,no.7,pp.375–381,2014. [25] M. M. Pettigrew, A. S. Laufer, J. F. Gent, Y. Kong, K. P. Fennie, and J. P. Metlay, “Upper respiratory tract microbial communi- ties, acute otitis media pathogens, and antibiotic use in healthy and sick children,” Applied and Environmental Microbiology, vol.78,no.17,pp.6262–6270,2012. [26]J.P.Balikian,P.G.Herman,andJ.J.Godleski,“Serratia pneumonia,” Radiology,vol.137,no.2,pp.309–311,1980. [27] L. Yu, A. H. Gunasekera, J. Mack et al., “Solution structure and function of a conserved protein SP14.3 Encoded by an essential Streptococcus pneumoniae gene,” Journal of Molecular Biology, vol. 311, no. 3, pp. 593–604, 2001. [28] P. Panagou, L. Papandreou, and D. Bouros, “Severe anaerobic necrotizing pneumonia complicated by pyopneumothorax and anaerobic monoarthritis due to Peptostreptococcus magnus,” Respiration,vol.58,no.3-4,pp.223–225,1991. [29]C.M.Chao,C.C.Lai,H.Y.Tsaietal.,“Pneumoniacausedby Aeromonas species in Taiwan, 2004–2011,” European Journal of Clinical Microbiology and Infectious Diseases,vol.32,no.8,pp. 1069–1075, 2013. Hindawi Publishing Corporation BioMed Research International Volume 2016, Article ID 8351204, 9 pages http://dx.doi.org/10.1155/2016/8351204

Research Article Analysis and Identification of Aptamer-Compound Interactions with a Maximum Relevance Minimum Redundancy and Nearest Neighbor Algorithm

ShaoPeng Wang,1 Yu-Hang Zhang,2 Jing Lu,3 Weiren Cui,4 Jerry Hu,5 and Yu-Dong Cai1

1 School of Life Sciences, Shanghai University, Shanghai 200444, China 2Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China 3Key Laboratory of Molecular Pharmacology and Drug Evaluation (Ministry of Education), Collaborative Innovation Center of Advanced Drug Delivery System and Biotech Drugs in Universities of Shandong, School of Pharmacy, Yantai University, Shandong, Yantai 264005, China 4CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China 5Department of Mathematics and Computer Science, School of Arts and Sciences, University of Houston-Victoria, Victoria, TX 77901, USA

Correspondence should be addressed to Yu-Dong Cai; cai [email protected]

Received 5 November 2015; Accepted 5 January 2016

Academic Editor: Zhenguo Zhang

Copyright © 2016 ShaoPeng Wang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

The development of biochemistry and molecular biology has revealed an increasingly important role of compounds in several biological processes. Like the aptamer-protein interaction, aptamer-compound interaction attracts increasing attention. However, it is time-consuming to select proper aptamers against compounds using traditional methods, such as exponential enrichment. Thus, there is an urgent need to design effective computational methods for searching effective aptamers against compounds. This study attempted to extract important features for aptamer-compound interactions using feature selection methods, such as Maximum Relevance Minimum Redundancy, as well as incremental feature selection. Each aptamer-compound pair was represented by properties derived from the aptamer and compound, including frequencies of single nucleotides and dinucleotides for the aptamer, as well as the constitutional, electrostatic, quantum-chemical, and space conformational descriptors of the compounds. As a result, some important features were obtained. To confirm the importance of the obtained features, we further discussed the associations between them and aptamer-compound interactions. Simultaneously, an optimal prediction model based on the nearest neighbor algorithm was built to identify aptamer-compound interactions, which has the potential to be a useful tool for the identification of novel aptamer-compound interactions. The program is available upon the request.

1. Introduction molecules have several advantages. Apart from their high affinity and wide range of applications, it is much easier to Aptamers are defined as single-stranded nucleic acids or screen and accurately amplify aptamers than antibodies. With peptides that act like antibodies [1, 2]. These specific selective the development of molecular biology techniques, it is even molecules can easily recognize and identify certain targets possible for us to modify the aptamers after screening, which in the proper environment. In vitro, aptamers are widely may be much harder for antibodies. Moreover, purification artificially selected from a large random sequence pool; is always difficult and cumbersome in molecular technology. at the same time, natural aptamers always exist in the However, polymerase chain reaction makes it amazingly easy riboswitches [3]. Compared to antibodies, these artificial to attain quantities of target aptamers without a complex 2 BioMed Research International purification process [4]. All in all, aptamers are a potentially compounds that cannot match the SMILE strings were valuable class of ligands that are sure to be widely used in the also removed. Finally, we obtained 159 aptamer-compound fields of biology and medicine [5]. interactions, involving 20 compounds and 156 aptamers. Previous studies have focused on aptamer-protein inter- These 159 aptamer-compound interactions were considered actions [6]. With the development of biochemistry and to be positive interactions in this study. molecular biology, compounds have been shown to play an To characterize features of aptamer-compound interac- increasingly significant role in several biological processes; tions, the negative data were also necessary, constructed therefore, it is necessary to focus on aptamer-compound according to the following rules: (1) randomly combine one interactions. The most widely used method to select aptamers compound from 20 compounds and one aptamer from 156 is systematic evolution of ligands by exponential enrich- aptamers to constitute an interaction; (2)theconstructed ment (SELEX) [1, 2]. Similar to aptamer-protein interac- interactions were not positive interactions. Because the pos- tions, SELEX is also used to select proper aptamers against sibility of one compound and one aptamer being an actual compounds [7, 8]. However, aptamers are highly target- aptamer-compound interaction is very low, we randomly specific and environment dependent. As a result, select- produced 318 negative interactions, which was twice as ing proper aptamers from random combinatorial libraries many as the positive interactions. The positive and negative is monotonously repetitive and inefficient. A proper, high interactions are all provided in Supplemental Material I, affinity aptamer takes months or even years to be screened. available online at http://dx.doi.org/10.1155/2016/8351204. Currently, we can design effective computational methods to screen proper aptamers. In this study, we analyzed the 2.2. Representation of Aptamer-Compound Interactions. To mechanism underlying aptamer-compound interactions by buildaneffectivepredictionmodel,encodingeachsample synthesizing characteristics of both the compounds and the with its essential properties is one of the most important aptamers. To encode each investigated interaction into a steps. In this study, we encoded each aptamer by the nucleo- numerical vector that can be processed by computers, the tide composition and compound using descriptors, includ- constitutional, electrostatic, quantum-chemical, and space ing constitutional, topological, geometric, electrostatic, and conformational descriptors of the compounds were taken quantum-chemical features. into consideration, as was the nucleotide composition of the aptamers. Then, like the aptamer-protein feature selection 2.2.1. Aptamer Representation. The frequencies of single reported in a previous study [9], the Maximum Relevance nucleotides (“a,” “c,” “g,” and “u(t)”) and dinucleotides (“aa,” Minimum Redundancy (mRMR) method and the Incremen- “ac,” “ag,” “au(t),” “ca,” “cc,” “cg,” “cu(t),” “ga,” “gc,” “gg,” tal Feature Selection (IFS) method were applied to screen the “gu(t),” “u(t)a,” “u(t)c,” “u(t)g,” and “u(t)u(t)”) were used to optimal features for the determination of aptamer-compound encode each aptamer. Thus, each investigated aptamer can be interactions. Simultaneously, an optimal prediction model represented by a 20D (20-dimensional) numerical vector. based on the nearest neighbor algorithm (NNA) was built. Our results may help broaden the applications of aptamers in 2.2.2. Compound Representation. The initial structures of all biological and medical fields. compounds were optimized by Sybyl 6.8 [11], and structural optimization was performed using the AM1 semiempirical 2. Materials and Methods method implemented in AMPAC 8.16 [12]. To describe the characteristics of the compounds, a total of 499 descrip- 2.1. Materials. Aptamer Base (http://aptamerbase.semantic- tors,includingconstitutional,topological,geometric,electro- science.org/) is a collaboratively created and maintained static, and quantum-chemical features, were calculated with knowledge base about aptamers, including their interac- Codessa 2.7.2 [13]. After removing those descriptors with zero tions and detailed experimental conditions with citations variance or missing values for some compounds, 301 descrip- to primary scientific literature [10]. It contains a total of tors remained. The distribution of these 301 descriptors is 1,994 entries of interactions (accessed in May 2014), in listed in Table 1. As a result, each investigated compound was which 1,335 entries involve one or more compounds. After represented by a 301D (301-dimensional) numerical vector. searching the 1,335 entries, we obtained 1,507 interactions between aptamers and compounds. Moreover, because of the 2.2.3. Interaction Representation. Because each interaction extension of freebase itself, it is easy to obtain compound consisted of one aptamer and one compound, it can be information from another freebase “compound.” Most of the represented by a 321D (321-dimensional) numerical vector, “compound IDs” and some SMILE strings were also available where 20 components represented the properties of aptamers from direct query on this freebase. and the others represented the properties of compounds (see To obtain a well-defined dataset, 1,507 aptamer-com- Table 1). pound interactions were further refined using the following rules: (1) interactions containing compounds whose Pub- 2.3. mRMR. As mentioned in Section 2.2, 321 features rep- chem IDs were not available were excluded; (2) interactions resented each aptamer-compound interaction. Clearly not containing compounds whose molecular weights are greater all features equally contribute to the identification of actual than 800 were removed because it is time-consuming to aptamer-compound interactions. Some of features make key make structural optimization by AMPAC for compounds contributions, whereas some others are less important. To with high molecular weights; and (3) interactions containing analyze the features, a popular feature selection method, BioMed Research International 3

Table 1: Distribution of the features investigated in this study. and some optimal classification models have been built [15– 24]. Here, we denoted the MaxRel features list and the mRMR Feature type Number of features features list as follows: Features of aptamer

Frequency of single nucleotide 4 𝑀 𝑀 𝑀 MaxRel features list: 𝐹MaxRel =[𝑓1 ,𝑓2 ,...,𝑓𝑛 ], Frequency of dinucleotide 16 (2) 𝑚 𝑚 𝑚 Features of compound mRMR features list: 𝐹mRMR =[𝑓1 ,𝑓2 ,...,𝑓𝑛 ]. Constitutional 24 Electrostatic 57 For a detailed description of this method, please refer to Geometrical 12 Peng et al.’s [14] or visit the website http://home.penglab.com/ Quantum-chemical 171 software/Hanchuan Peng Software/software.html. Topological 37 Total 321 2.4. Basic Prediction Engine. Based on the mRMR features list obtained by the mRMR method and a basic prediction mRMR, which was first proposed by Peng et al. [14] in engine, one can construct an optimal prediction model 2005, was employed. This method measures the investigated using key features to represent samples. Here, we tried four prediction engines: (1)NNA[25];(2) Random Forest (RF) features for a certain problem by providing two lists, MaxRel 3 4 features list and mRMR features list. The MaxRel features [26]; ( ) Sequential Minimal Optimization (SMO) [27]; ( ) list sorts the investigated features by their contributions into Dagging [28]. Their brief descriptions were as follows. classifications, that is, with relevance to class labels. The mRMR features list sorts features by considering not only 2.4.1. NNA. NNA is a classic classifier. Although it is simple, their contributions to classification but also the redundancies it performs well in many cases [29–32]. For a query sample, to features listed before them. The detailed descriptions are the distances between it and samples in the training set are as follows. Firstly, the above factors can be encoded into computed and the class of the sample with the minimum numbers using the mutual information (MI), which can be distance is assigned to it. calculated by 𝑝(𝑥,𝑦) 2.4.2. RF. RF is an ensemble classifier proposed by Breiman 𝐼(𝑥,𝑦)=∬ 𝑝(𝑥,𝑦) 𝑑𝑥 𝑑𝑦, log 𝑝 (𝑥) 𝑝(𝑦) (1) [26]. It integrates a number of decision trees, which are constructed by randomly selecting samples from the original where 𝑥 and 𝑦 represent two variables, 𝑝(𝑥, 𝑦) represents the training set and randomly selecting features to split each joint probabilistic density of 𝑥 and 𝑦,and𝑝(𝑥) represents the node. Because it contains two procedures of random selec- marginal probabilistic density of variable 𝑥. tions, it always yields good performance and has been applied For a problem involving 𝑁 features, the MI of each to deal with many biological problems [33–37]. feature as well as the target vector, consisting of samples class labels, is calculated. The MaxRel features list ranks the 2.4.3. SMO. SMO is a type of support vector machines (SVM) features with the descending order of MI values. For the that is optimized by the John Platt’s Sequential Minimal Opti- mRMR features list, because it additionally considers the mization algorithm [27]. The optimization problem of SVM redundancies between features, it is produced using a loop is divided into several of the smallest possible subproblems, Ω 𝑁 Ω procedure. Suppose is a set containing features and 𝑠 is and they are solved analytically. a set consisting of already selected features (initially, Ω𝑠 =Φ) and Ω𝑡 consists of the rest features; that is, Ω𝑡 =Ω−Ω𝑠. The contribution of feature 𝑓 in Ω𝑡 is measured using the 2.4.4. Dagging. Dagging is a metaclassifier containing multi- MI of it and target vector 𝑐,thatis,𝐷=𝐼(𝑓,𝑐), while the ple prediction models that are derived from a number of dis- redundancies between it and features in Ω𝑠 are measured by joint subsets of the original training set and a single machine 𝑅 = (1/|Ω |) ∑ 𝐼(𝑓,𝑓) Ω =Φ𝑅 learning algorithm [28]. Its predicted result integrated the 𝑠 𝑓𝑖∈Ω𝑠 𝑖 (if 𝑠 , is set to zero). To select a feature with maximum contributions for classification results of the prediction models by majority voting. In Weka [38], four classifiers (IB1, Random Forest, SMO, and minimum redundancies between it and features in Ω𝑠, the feature yielding the maximum 𝐷-𝑅 will be selected in the and Dagging) implement the above four methods. For conve- nience,theywereemployedtomakeclassificationsandthey next loop and removed from Ω𝑡 to Ω𝑠.Whenallfeaturesare were all executed with their default parameters. in Ω𝑠, the loop stops. The mRMR features list ranks features using the selection sequence of features. By analyzing the MaxRel features list and mRMR features 2.5. Accuracy Measurement. Identification of aptamer- list, we can extract important features among the investigated compound interactions is a two-class classification problem. featuresandbuildanoptimalpredictionmodelbasedonone To measure the performance of a classifier on this type of machine learning algorithm. Currently, the mRMR method problem, four values were counted, true positive (TP), true has been applied to study a number of biological problems negative (TN), false positive (FP), and false negative (FN) 4 BioMed Research International

[29, 39]. Furthermore, these values can be used to calculate 0.25 the following measurements: 0.2 TP SN = , 0.15 TP + FN 0.1

TN Proportion SP = , TN + FP 0.05 TP + TN 0 ACC = , (3) TP + TN + FP + FN MCC Topological Electrostatic Geometrical

⋅ − ⋅ Constitutional = TP TN FP FN . √

( + ) ⋅ ( + ) ⋅ ( + ) ⋅ ( + ) Quantum-chemical TN FN TN FP TP FN TP FP Features of aptamer Feature type To correctly measure the performance of a classifier, one measurement listed in (3) should be selected as the key Figure 1: The proportion of features listed in the top 10% of the measurement. Obviously, SN and SP are not perfect mea- MaxRel features list in each feature type. surements because they only partly use TP, TN, FP, and FN. Regarding ACC and MCC [40], we prefer to use MCC as the key measurement because MCC is a balanced measurement features in each feature type is different, only considering even if the number of samples in each class greatly differs. thenumberoffeatureslistedinthetop10%oftheMaxRel Therefore, in this study, MCC is always used to measure the features list for each feature type has its limitation. Thus, we performance of the current prediction method, whereas SN, computed the proportion of the number of features in the top SP, and ACC are provided as reference. 10% of the MaxRel features list and total number of features in each feature type, as illustrated in Figure 1. It can be observed 2.6. IFS. By combining the mRMR features list and a basic from Table 2 and Figure 1 that features of electrostatic and prediction engine (e.g., NNA and RF), one can build an opti- quantum-chemical descriptors were more related to the mal prediction model, in which each sample is represented determination of aptamer-compound interactions than other by extracted key features and the adopted basic prediction interactions. engine provides the best performance. This procedure is called IFS, which can be implemented as follows: 3.2. Results of IFS. By analyzing the MaxRel features list, we 𝑚 𝑚 (i) Based on the mRMR feature list 𝐹mRMR =[𝑓1 ,𝑓2 , obtained only some important features that may play key 𝑚 ...,𝑓𝑛 ], 𝑁 feature sets were constructed such that roles in the determination of aptamer-compound interac- 𝑚 𝑚 𝑚 IFS𝑖 ={𝑓1 ,𝑓2 ,...,𝑓𝑖 } (1≤𝑖≤𝑛). tions. On the other hand, an optimal prediction model based on a certain basic prediction engine can be built according (ii) For the 𝑖th feature sets IFS𝑖,eachsamplewasrepre- to the mRMR features list and the IFS method. Following sented by features in IFS𝑖 and the basic prediction the procedures described in Section 2.6, a set of MCCs can engine was executed on all samples for classification be obtained using different numbers of features for each and was evaluated by tenfold cross-validation [41]. of the four basic prediction engines, which are listed in (iii) Evaluate the performance of the basic prediction Supplemental Material III. For the readers’ interest, the SNs, enginebycalculatingMCCandselectfeaturesinthe SPs and ACCs are also provided in Supplemental Material feature set that induces the highest MCC as the opti- III. Based on the MCCs obtained by IFS method and four mal features. basic prediction engines, we plotted four curves, namely, IFS curves, for four basic prediction engines by setting MCC as 3. Results and Discussion the 𝑦-axis and the number of considered features (i.e., the subscript 𝑖 of IFS𝑖)asthe𝑥-axis. Figure 2 shows these four 3.1. Results of mRMR. The investigated 477 interactions curves, from which we can clearly observe that the maximum were represented by 321 features. The mRMR method was MCC for NNA, RF, SMO, and Dagging was 0.670, 0.629, employed to analyze these features. As a result, we obtained 0.425, and 0.483, respectively, when the first 80, 135, 42, and two lists, the MaxRel features list and the mRMR features 54 features in the mRMR features list were used. Thus, the list, which are provided in Supplemental Material II. For the NNA yielded the best performance (MCC 0.670) using the MaxRel features list, we investigated the top 10% of features, first 80 features in the mRMR features list. For readers’ inter- which were important for the determination of aptamer- est, the SN, SP, and ACC obtained using the NNA and first compound interactions. Table 2 gives the distribution of these 80 features in the mRMR feature lists were 0.780, 0.890, and features, from which we can see that no features of the 0.853, respectively. It can be observed that the performance aptamers were among the top 10% of features of the MaxRel of the NNA is much better than the performances of SMO features list. Furthermore, because the number of considered andDagging.Thepossiblereasonisthatthecurrentdata BioMed Research International 5

Table 2: Distribution of the top 10% features in the MaxRel features list.

Feature type Number of features Feature names Features of aptamer 0 — Constitutional 2 Number of double bonds; number of O atoms DPSA-1 difference in CPSAs (PPSA1-PNSA1) [Zefirov’s PC]; HA dependent HDCA-2 [Zefirov’s PC]; Max partial charge for H atom [Zefirov’s PC]; PNSA-3 Electrostatic 11 atomic charge weighted PNSA [Zefirov’s PC]; HACA-2 [Zefirov’s PC]; HACA-1 [Zefirov’s PC]; min(#HA #HD) [Zefirov’s PC]; count of H-acceptor sites [Zefirov’s PC]; HA dependent HDSA-1/TMSA [Zefirov’s PC]; DPSA-3 difference in CPSAs (PPSA3-PNSA3) [Zefirov’s PC]; HA dependent HDCA-1 [Zefirov’s PC] Geometrical 0 — Tot dipole of the molecule; tot point-charge comp. of the molecular dipole; ESP-HA dependent HDSA-2 [quantum-chemical PC]; ESP-HA dependent HDCA-2 [quantum-chemical PC]; ESP-HACA-2 [quantum-chemical PC]; HA dependent HDSA-2 [quantum-chemical PC]; final heat of formation; ESP-Max net atomic charge for H atom; ESP-DPSA-1 difference in CPSAs (PPSA1-PNSA1) Quantum-chemical 18 [quantum-chemical PC]; HA dependent HDCA-2 [quantum-chemical PC]; HOMO - LUMO energy gap; ESP-HA dependent HDSA-1 [quantum-chemical PC]; min(#HA #HD) [quantum-chemical PC]; ESP-count of H-acceptor sites [quantum-chemical PC]; ESP-min(#HA #HD) [quantum-chemical PC]; count of H-acceptor sites [quantum-chemical PC]; DPSA-1 difference in CPSAs (PPSA1-PNSA1) [quantum-chemical PC]; HA dependent HDCA-1 [quantum-chemical PC] Topological 1 Average structural information content (order 1)

Table 3: Predicted results of some specific examples obtained by the optimal prediction model.

Compound Aptamer Predicted class True class Arsenate 20000526-arsenic-5 Positive Positive Isoleucine 15772067-isoleucine-1 Positive Positive Dopamine 9245404-dopamine-4 Positive Positive Chitin 10743940-chitin-5 Positive Positive N-Acetylneuraminic acid 23042406-Neu5Ac-1 Positive Positive Isoleucine 14980623-sialyllactose-1 Positive Negative Dopamine 18983163-ochratoxin A-3 Positive Negative Chitin 10786843-L tyrosine-3 Positive Negative Tyrosine 20000526-arsenic-Ma-1 Positive Negative N-Acetylneuraminic acid 21076782-L-tryptophan-1 Positive Negative of aptamer-compound interactions is so complicated that provide more clues for other investigators to study aptamer- its distribution is not clear, inducing difficulties for making compound interactions, we listed the predicted results of 477 prediction by the kernel function methods (e.g., SMO) or interactions in Supplemental Material IV. Because the SN boosting methods (e.g., Dagging), while the NNA is good at obtained by the optimal prediction model was 0.78, mean- dealing with this type of data. The IFS results of NNA suggest ing that 124 of 159 aptamer-compound interactions were cor- that the first 80 features in the mRMR feature lists were the rectly predicted, five such examples are listed in first five optimal features to identify aptamer-compound interactions. rowsinTable3.Forthenegativeinteractions,thosethatwere The prediction model based on the NNA and 80 optimal predictedtobe“positive”weremoreimportantthanothers features was the optimal prediction model. The following because they may be potential true aptamer-compound section gives a detailed discussion of the 88 features used in interactions. The last five rows of Table 3 list such five negative the optimal prediction model. interactions.

3.3. Prediction Results of Some Specific Examples. According 3.4. Analysis of the Optimal Features. The 80 optimal features to the results mentioned in Section 3.2, the optimal predic- can be categorized into six types, including features of tion model used the NNA as the classifier and the 80 optimal aptamer, constitutional, electrostatic, geometrical, quantum- features to represent aptamer-compound interactions. To chemical, and topological features. The distributions of these 6 BioMed Research International

0.7 (80, 0.670)

0.6 (135, 0.629)

0.5 (54, 0.483) 0.4

MCC (42, 0.425) 0.3

0.2

0.1 1 21 81 41 61 101 121 141 201 221 241 281 301 161 181 261 Number of features NNA SMO RF Dagging

Figure 2: Four IFS curves plotted by taking MCC as the 𝑦-axisandthenumberofconsideredfeaturesasthe𝑥-axis for four basic prediction engines. The MCC values indicate the performance of various prediction models using different classifiers and different combination of features to represent interactions. It can be observed that using NNA as the classifier and the first 80 features in the mRMR features list to represent interactions can yield the best performance with the highest MCC value of 0.670.

40 38 0.9 35 0.8 30 0.7 25 0.6 0.5 20 17 0.4 15 14

Proportion 0.3 10 0.2

Number of features of Number 6 5 3 2 0.1 0 0 Topological Topological Electrostatic Electrostatic Geometrical Geometrical Constitutional Constitutional Quantum-chemical Quantum-chemical Features of aptamer of Features Features of aptamer of Features Feature type Feature type (a) (b)

Figure 3: (a) The distribution of the 80 optimal features. (b) The proportion of features among the 80 optimal features in each feature type.

sixfeaturetypesareillustratedinFigure3(a).Liketheanalysis the structural foundation of aptamer-compound binding ofthetop10%featuresintheMaxRelfeatureslist,wealso [47]. Furthermore, the selected quantum-chemical features calculated the proportion of the number of features among also describe conformational changes and atomic reactivity the80optimalfeaturesandthetotalnumberoffeaturesin during the interaction. These traits explain aptamers’ target each feature type, as illustrated in Figure 3(b). specificity and why an aptamer can easily detect changes ina The quantum-chemical features make up approximately target’smoleculestructure[48,49].Theresultsabovesuggest 50%of80optimalfeatures.Amongthesefeatures,thetot that our aptamer prediction has to include consideration of dipole moment of the target molecule seems to be statistically the molecular polarity and the surface electrostatic charge essential for aptamer-target interactions, represents specific distribution of the target molecules. Consequently, prediction polarity characteristics, and, to some extent, reflects the space using the optimal prediction model might be widely imple- conformation of the target molecule [42, 43]. This finding is mented in the design of aptamers. consistent with those of previous studies that show that the The electrostatic features were also a part of the optimal space conformation of the targets plays an important role features. These traits reflect the distribution of the spe- in interactions with aptamers [44–46]. Moreover, quantum- cific molecule surface charge. Molecule-molecule interac- chemical features also contain the characteristics of the tion (such as aptamer-target) is largely dependent on the total surface area and surface functional groups that may interaction of respective charge [50, 51]. Such surface charge participateinthereaction.Thesecharacteristicsmakeup distribution is sure to have a correlation with aptamer-target BioMed Research International 7 interaction. Indeed, the polarity of targets as well as aptamers Acknowledgments can induce aptamers to recognize their specific targets [52]. The distribution has also been demonstrated to be involved This study was supported by the National Basic Research Pro- in aptamer-protein interactions. A typical example is the TBA gram of China (2011CB510101, 2011CB510102), the National (thrombin binding aptamer) [53]. Similarly, polarity may also Natural Science Foundation of China (31371335), the Innova- play a crucial role in aptamer-compound interactions. tion Program of Shanghai Municipal Education Commission Constitutional features also play a unique role in the (12ZZ087), and a grant from “The First-Class Discipline of interaction. Certain features may combine to act as a standard Universities in Shanghai.” to distinguish the material categories. Apart from character- istics describing the target compounds, aptamer frequency References (the composition of nucleotide and dual nucleotide) can also interfere with the reaction by remodeling the spatial confor- [1] C. Tuerk and L. Gold, “Systematic evolution of ligands by mation of the aptamers. A stable and target-specific spatial exponential enrichment: RNA ligands to bacteriophage T4 conformation is the foundation of the aptamers’ function DNA polymerase,” Science, vol. 249, no. 4968, pp. 505–510, 1990. [54–56]. Considering that the conformation of is [2] A. D. Ellington and J. W. Szostak, “In vitro selection of RNA mainly based on interactions between nucleotides, the com- molecules that bind specific ligands,” Nature,vol.346,no.6287, position of nucleotides and dual nucleotides may influence pp.818–822,1990. aptamers’ specific three-dimensional structures and their [3] Q. Vicens, E. Mondragon,´ and R. T. Batey, “Molecular sensing stability. Moreover, some specific compounds may have the by the aptamer domain of the FMN riboswitch: a general model ability to recognize nucleotide chains, which may contain a for ligand binding by conformational selection,” Nucleic Acids characteristic nucleotide frequency. Those compounds inter- Research, vol. 39, no. 19, pp. 8586–8598, 2011. act with aptamers based on sequence specificity [57, 58]. Our [4] K.P.F.Janssen,K.Knez,D.Spasic,J.Schrooten,andJ.Lammer- results further confirm that the polar properties and distribu- tyn, “Multiplexed protein detection using an affinity aptamer tion of molecular surface charge and aptamer frequency are amplification assay,” Analytical and Bioanalytical Chemistry,vol. 404, no. 6-7, pp. 2073–2081, 2012. significant for the interaction between the aptamers and their respective targets. [5]J.G.Bruno,M.P.Carrillo,A.M.Richarte,T.Phillips,C. Andrews, and J. S. Lee, “Development, screening, and analysis All in all, our prediction of proper aptamers against of DNA aptamer libraries potentially useful for diagnosis and compounds depends on the traits of polarity, surface charge passive immunity of arboviruses,” BMC Research Notes,vol.5, distribution of the compounds, constitutional features, and no. 1, article 633, 2012. aptamer frequency. Our prediction using the mRMR pro- [6] G. Lautner, Z. Balogh, A. Gyurkovics, R. E. Gyurcsanyi,´ and gram depends on the propensities of the compounds and the T. Mesz´ aros,´ “Homogeneous assay for evaluation of aptamer- nucleotide (dual nucleotide) frequency of aptamers. In con- protein interaction,” Analyst,vol.137,no.17,pp.3929–3931,2012. clusion, in addition to protein analysis, mRMR can also be [7]D.Zhu,X.Zhou,andD.Xing,“Anewkindofaptamer-based applied to design matching aptamers to specifically identify immunomagnetic electrochemiluminescence assay for quanti- objective compounds. tative detection of protein,” Biosensors and Bioelectronics,vol. 26, no. 1, pp. 285–288, 2010. 4. Conclusions [8] S. Xie and S. P. Walton, “Development of a dual-aptamer-based multiplex protein biosensor,” Biosensors and Bioelectronics,vol. Our study analyzed and identified the important features 25, no. 12, pp. 2663–2668, 2010. that influence the matching of aptamers to compounds. Max- [9] B.-Q. Li, Y.-C. Zhang, G.-H. Huang, W.-R. Cui, N. Zhang, imum Relevance Minimum Redundancy and incremental and Y.-D. Cai, “Prediction of aptamer-target interacting pairs feature selection were performed on a dataset, in which com- with pseudo-amino acid composition,” PLoS ONE,vol.9,no.1, pounds and aptamers were represented by descriptors and Article ID e86729, 2014. nucleotide compositions, respectively. As a result, some key [10] J. Cruz-Toledo, M. McKeague, X. Zhang et al., “Aptamer base: a features were extracted and an optimal prediction model was collaborative knowledge base to describe aptamers and SELEX built based on the nearest neighbor algorithm. The novel experiments,” Database: The Journal of Biological Databases and findings of our study may give new insights into the investi- Curation, vol. 2012, Article ID bas006, 2012. gation of aptamer-compound interactions. [11] Sybyl,Tripos,St.Louis,Mo,USA,2013. [12] AMPAC, Semichem Inc, Shawnee, Kan, USA. [13] A. R. Katritzky, R. Petrukhin, H. Yang, and M. Karelson, Com- Conflict of Interests prehensive Descriptors for Structural and Statistical Analysis (CODESSA), Semichem, Shawnee, Kan, USA, 2002. The authors declare that there is no conflict of interests regarding the publication of this paper. [14]H.Peng,F.Long,andC.Ding,“Featureselectionbasedon mutual information: criteria of max-dependency, max-rele- vance, and min-redundancy,” IEEE Transactions on Pattern Authors’ Contribution Analysis and Machine Intelligence,vol.27,no.8,pp.1226–1238, 2005. ShaoPeng Wang and Yu-Hang Zhang contributed equally to [15] L. Chen, W.-M. Zeng, Y.-D. Cai, and T. Huang, “Prediction of this work. metabolic pathway using graph property, chemical functional 8 BioMed Research International

group and chemical structural set,” Current Bioinformatics,vol. [32] L. Lan, N. Djuric, Y. Guo, and S. Vucetic, “MS-kNN: protein 8, no. 2, pp. 200–207, 2013. function prediction by integrating multiple data sources,” BMC [16] T. Huang, L. Chen, Y.-D. Cai, and K.-C. Chou, “Classification Bioinformatics, vol. 14, supplement 3, article S8, 2013. and analysis of regulatory pathways using graph property, [33]L.Chen,C.Chu,T.Huang,X.Kong,andY.Cai,“Prediction biochemical and physicochemical property, and functional and analysis of cell-penetrating peptides using pseudo-amino property,” PLoS ONE,vol.6,no.9,ArticleIDe25297,2011. acid composition and random forest models,” Amino Acids,vol. [17] Y. Zhang, C. Ding, and T. Li, “Gene selection algorithm by 47,no.7,pp.1485–1493,2015. combining reliefF and mRMR,” BMC Genomics,vol.9,supple- [34]B.-Q.Li,K.-Y.Feng,L.Chen,T.Huang,andY.-D.Cai,“Pre- ment 2, article S27, 2008. diction of protein-protein interaction sites by random forest [18] H. Mohabatkar, M. M. Beigi, and A. Esmaeili, “Prediction of algorithmwithmRMRandIFS,”PLoS ONE,vol.7,no.8,Article ID e43927, 2012. GABAA receptor proteins using the concept of Chou’s pseudo- amino acid composition and support vector machine,” Journal [35] G. Pugalenthi, K. K. Kandaswamy, K.-C. Chou, S. Vivekanan- of Theoretical Biology,vol.281,no.1,pp.18–23,2011. dan, and P. Kolatkar, “RSARF: prediction of residue sol- [19]G.S.Han,Z.G.Yu,V.Anh,A.P.D.Krishnajith,andY.-C.Tian, vent accessibility from protein sequence using random forest “An ensemble method for predicting subnuclear localizations method,” Protein and Peptide Letters,vol.19,no.1,pp.50–56, from primary protein structures,” PLoS ONE,vol.8,no.2, 2012. Article ID e57225, 2013. [36] Z. Qiu and X. Wang, “Improved prediction of protein ligand- [20] Y. Xu, Y. Deng, Z. Ji et al., “Identification of thyroid carcinoma binding sites using random forests,” Protein and Peptide Letters, related genes with mRMR and shortest path approaches,” PLoS vol. 18, no. 12, pp. 1212–1218, 2011. ONE,vol.9,no.4,ArticleIDe94022,2014. [37] P. P. Pai and S. Mondal, “MOWGLI: prediction of protein- MannOse interacting residues with ensemble classifiers using [21]Z.Li,L.Chen,Y.Lai,Z.Dai,andX.Zou,“Theprediction evolutionary information,” Journal of Biomolecular Structure of methylation states in human DNA sequences based on and Dynamics,2015. hexanucleotide composition and feature selection,” Analytical Methods,vol.6,no.6,pp.1897–1904,2014. [38] I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann, Boston, [22] L.Song,D.Li,X.Zeng,Y.Wu,L.Guo,andQ.Zou,“nDNA-prot: Mass, USA, 2005. identification of DNA-binding proteins based on unbalanced classification,” BMC Bioinformatics,vol.15,no.1,article298, [39] P. Baldi, S. Brunak, Y. Chauvin, C. A. F. Andersen, and H. 2014. Nielsen, “Assessing the accuracy of prediction algorithms for classification: an overview,” Bioinformatics,vol.16,no.5,pp. [23]Y.Zhou,N.Zhang,B.-Q.Li,T.Huang,Y.-D.Cai,andX.-Y. 412–424, 2000. Kong, “A method to distinguish between lysine acetylation and [40] B. W. Matthews, “Comparison of the predicted and observed lysine ubiquitination with feature selection and analysis,” Jour- secondary structure of T4 phage lysozyme,” Biochimica et Bio- nal of Biomolecular Structure and Dynamics,vol.33,no.11,pp. physica Acta—Protein Structure,vol.405,no.2,pp.442–451, 2479–2490, 2015. 1975. [24] L.Chen,C.Chu,andK.Feng,“Predictingthetypesofmetabolic [41] “A study of cross-validation and bootstrap for accuracy esti- pathway of compounds using molecular fragments and sequen- mation and model selection,” in Proceedings of the 14th Inter- tial minimal optimization,” Combinatorial Chemistry & High national Joint Conference on Artificial Intelligence (IJCAI ’95), Throughput Screening,vol.19,no.4,2016. R. Kohavi, Ed., pp. 1137–1143, Lawrence Erlbaum Associates, [25] T. Cover and P. Hart, “Nearest neighbor pattern classification,” Montreal, Canada, August 1995. IEEE Transactions on Information Theory,vol.13,no.1,pp.21– [42] R. C. Reid, M.-K. Yau, R. Singh, J. Lim, and D. P.Fairlie, “Stereo- 27, 1967. electronic effects dictate molecular conformation and biological [26] L. Breiman, “Random forests,” Machine Learning,vol.45,no.1, function of heterocyclic amides,” JournaloftheAmerican pp. 5–32, 2001. Chemical Society,vol.136,no.34,pp.11914–11917,2014. [27]Z.Xu,M.Dai,andD.Meng,“Fastandefficientstrategiesfor [43] D. Schmitz, C. Schmitz-Antoniak, A. Warland et al., “The dipole model selection of Gaussian support vector machine,” IEEE moment of the spin density as a local indicator for phase Transactions on Systems, Man, and Cybernetics, Part B: Cyber- transitions,” Scientific Reports,vol.4,article5760,2014. netics,vol.39,no.5,pp.1292–1307,2009. [44]S.P.Hennelly,I.V.Novikova,andK.Y.Sanbonmatsu,“The [28] “Stacking bagged and dagged models,”in Proceedings of the 14th expression platform and the aptamer: cooperativity between 2+ International Conference on Machine Learning (ICML ’97),K. Mg and ligand in the SAM-I riboswitch,” Nucleic Acids M. Ting and I. H. Witten, Eds., pp. 367–375, Morgan Kaufmann, Research,vol.41,no.3,pp.1922–1935,2013. San Francisco, Calif, USA, 1997. [45] B. Waybrant, T. R. Pearce, P. Wang, S. Sreevatsan, and E. [29] L. Chen, K.-Y. Feng, Y.-D. Cai, K.-C. Chou, and H.-P. Li, “Pre- Kokkoli, “Development and characterization of an aptamer dicting the network of substrate-enzyme-product triads by binding ligand of fractalkine using domain targeted SELEX,” combining compound similarity and functional domain com- Chemical Communications,vol.48,no.80,pp.10043–10045, position,” BMC Bioinformatics,vol.11,article293,2010. 2012. [30] L. Chen, J. Lu, N. Zhang, T. Huang, and Y.-D. Cai, “A hybrid [46]H.Y.LiuandX.Gao,“Auniversalproteintagfordeliveryof method for prediction and repositioning of drug Anatomical SiRNA-aptamer chimeras,” Scientific Reports,vol.3,article3129, Therapeutic Chemical classes,” Molecular BioSystems,vol.10, 2013. no. 4, pp. 868–877, 2014. [47] J. Yang, M. Palla, F. G. Bosco et al., “Surface-enhanced Raman [31] S.-B. Wan, L.-L. Hu, S. Niu et al., “Identification of multiple sub- spectroscopy based quantitative bioassay on aptamer-function- cellular locations for proteins in budding yeast,” Current Bioin- alized nanopillars using large-area Raman mapping,” ACS formatics,vol.6,no.1,pp.71–80,2011. Nano,vol.7,no.6,pp.5350–5359,2013. BioMed Research International 9

[48] H.-Y. Cao, A.-H. Yuan, W. Chen, X.-S. Shi, and Y. A. Miao, “A DNA aptamer with high affinity and specificity for molecular recognition and targeting therapy of gastric cancer,” BMC Cancer,vol.14,no.1,article699,2014. [49]K.Ji,W.S.Lim,S.F.Y.Li,andK.Bhakoo,“Atwo-stepstimulus- response cell-SELEX method to generate a DNA aptamer to recognize inflamed human aortic endothelial cells as a potential in vivo molecular probe for atherosclerosis plaque detection,” Analytical and Bioanalytical Chemistry,vol.405,no.21,pp. 6853–6861, 2013. [50] D. Ben-Yaakov, D. Andelman, and H. Diamant, “Interaction between heterogeneously charged surfaces: surface patches and charge modulation,” Physical Review E,vol.87,no.2,ArticleID 022402, 2013. [51] X. Jia, J. Zeng, J. Z. H. Zhang, and Y. Mei, “Accessing the appli- cability of polarized protein-specific charge in linear interaction energy analysis,” JournalofComputationalChemistry,vol.35, no. 9, pp. 737–747, 2014. [52] L. Martino, A. Virno, A. Randazzo et al., “A new modified 󸀠 󸀠 thrombin binding aptamer containing a 5 –5 inversion of polarity site,” Nucleic Acids Research,vol.34,no.22,pp.6653– 6662, 2006. [53] I. R. Krauss, A. Merlino, C. Giancola, A. Randazzo, L. Maz- zarella, and F. Sica, “Thrombin-aptamer recognition: a revealed ambiguity,” Nucleic Acids Research,vol.39,no.17,pp.7858–7867, 2011. [54] C. Wu, T. Chen, D. Han et al., “Engineering of switchable aptamer micelle flares for molecular imaging in living cells,” ACS Nano,vol.7,no.7,pp.5724–5731,2013. [55] T. Xia, J. Yuan, and X. Fang, “Conformational dynamics of an ATP-binding DNA aptamer: a single-molecule study,” The JournalofPhysicalChemistryB,vol.117,no.48,pp.14994–15003, 2013. [56] Y.-W.Cheung,J.Kwok,A.W.L.Law,R.M.Watt,M.Kotaka,and U. A. Tanner, “Structural basis for discriminatory recognition of Plasmodium lactate dehydrogenase by a DNA aptamer,” Proceedings of the National Academy of Sciences of the United States of America, vol. 110, no. 40, pp. 15967–15972, 2013. [57] Y. M. K. Yoga, D. A. K. Traore, M. Sidiqi et al., “Contribution of the first K-homology domain of poly (C)-binding protein 1 to its affinity and specificity for C-rich oligonucleotides,” Nucleic Acids Research,vol.40,no.11,pp.5101–5114,2012. 󸀠 [58] K. Jaudzems, X. Jia, H. Yagi et al., “Structural basis for 5 - end-specific recognition of single-stranded DNA by the R3H domain from human S𝜇bp-2,” Journal of Molecular Biology,vol. 424, no. 1-2, pp. 42–53, 2012. Hindawi Publishing Corporation BioMed Research International Volume 2016, Article ID 2831287, 9 pages http://dx.doi.org/10.1155/2016/2831287

Research Article The Subcellular Localization and Functional Analysis of Fibrillarin2, a Nucleolar Protein in Nicotiana benthamiana

Luping Zheng,1,2 Jinai Yao,3 Fangluan Gao,1,2 Lin Chen,2 Chao Zhang,2 Lingli Lian,2 Liyan Xie,2 Zujian Wu,2 and Lianhui Xie2

1 Key Laboratory of Biopesticide and Chemical Biology, Ministry of Education, Fujian Agriculture and Forestry University, Fuzhou 350002, China 2Key Laboratory of Plant Virology of Fujian Province, Fujian Agriculture and Forestry University, Fuzhou 350002, China 3Institute of Plant Protection, Fujian Provincial Academy of Agricultural Sciences, Fuzhou 350002, China

Correspondence should be addressed to Lianhui Xie; [email protected]

Received 14 October 2015; Accepted 1 December 2015

Academic Editor: Jialiang Yang

Copyright © 2016 Luping Zheng et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Nucleolar proteins play important roles in plant cytology, growth, and development. Fibrillarin2 is a nucleolar protein of Nicotiana benthamiana (N. benthamiana). Its cDNA was amplified by RT-PCR and inserted into expression vector pEarley101 labeled with yellow fluorescent protein (YFP). The fusion protein was localized in the nucleolus and Cajal body of leaf epidermal cells of N. benthamiana.TheN. benthamiana fibrillarin2 (NbFib2) protein has three functional domains (i.e., glycine and arginine rich domain, RNA-binding domain, and 𝛼-helical domain) and a nuclear localization signal (NLS) in C-terminal. The protein 3D structure analysis predicted that NbFib2 is an 𝛼/𝛽 protein. In addition, the virus induced gene silencing (VIGS) approach was used to determine the function of NbFib2. Our results showed that symptoms including growth retardation, organ deformation, chlorosis, and necrosis appeared in NbFib2-silenced N. benthamiana.

1. Introduction nucleolar protein 56 (Nop56) [6], is localized in the mid- dle and C-terminal of fibrillarin, respectively. The RNA- Fibrillarin is a major nucleolar protein, playing multifunc- binding domain and 𝛼-helical domain together form an Ado- tional roles in RNA biogenesis. It localizes in nucleolus and Met-dependent methyltransferase- (MTase-) like region. The Cajal bodies (CBs), subnuclear dynamic particles involved MTase-like region, which is evolutionarily conserved, con- in RNA transcription and editing [1]. Fibrillarin is an evolu- tains S-adenosylmethionine (SAM, the methionine group tionarily conserved protein. Human fibrillarin shares 94.2% donor) binding motifs and encodes a MTase required for the 󸀠 sequence identity with mouse fibrillarin and 82.9% sequence methylation of 2 -O-ribose [5]. identity with fibrillarin [2]. Homologs of human In addition to understanding its functions in pre- fibrillarinarealsoreportedinadvancedplants[3]. rRNA processing, modification, and ribosomal assembly [7], Fibrillarin usually consists of three domains including a researches in fibrillarin are focused on its interactions with glycine and arginine rich domain (GAR domain), an RNA- virus-encoded-proteins and the roles of these interactions binding domain, and an 𝛼-helical domain. The GAR domain in viral movement and infection. For example, Melen´ et al. is critical for the function of fibrillarin. This domain is found that fibrillarin interacted with nonstructural protein localizedintheN-terminaloftheproteinanditsarginine 1(NS1)inInfluenza A H3N2 subtype virus via C-terminal residues are methylated [4]. In human and plants, the nuclear localization signal 2 (NLS2) [8]. Kim et al. found GARdomainisinvolvedinthetranslocationoffibrillarin that fibrillarin is involved in the long distance movement into nucleoli. The RNA-binding domain, which interacts and infection of Groundnut rosette virus (GRV) [9, 10]. with RNA [5] and 𝛼-helical domain, which interacts with Specifically, ORF3 (movement protein) in GRV migrates 2 BioMed Research International intonucleolusviaCBs,bindswithfibrillarininnucleolus, In addition, the web server I-TASSER [17] was adopted relocates into cytoplasm, and finally assembles with virus to predict the 3D structure of NbFib2 using PDB ID 1g8sA ribonucleoprotein (vRNP) particles in cytoplasm for long- (Pyrococcus horikoshii fibrillarin) as template (Figure 1(c)). distance movement and systemic infection. It was also The analysis predicted that NbFib2 belongs to 𝛼/𝛽 proteins, a reported that fibrillarin interacts with viral genome-linked class of structural domains in which the secondary structure protein (VPg) in Potato virus A (PVA) and 2b silencing is composed of alternating 𝛼-helices and 𝛽-sheets along the suppressor protein in Cucumber mosaic virus (CMV) [11, 12]. backbone. The 𝛽-sheet is an external structure while most It is known that genes in some plants can be cosuppressed part of 𝛼-helices is internal. The GAR region mostly consists if the plants are transformed with homologous transgenes. of hairpin-𝛽 motif. The RNA binding region forms a 𝛽-𝛼-𝛽 This mechanism will result in new types of intercellular motif and is randomly coiled (Figure 1(c)). The C-terminal communication and viral defense mechanisms [13]. The event region is 𝛼-helical and the subcellular localization signal in which virus vectors carrying the host-derived sequence peptide is exposed outside. silence the corresponding host genes in the infected plants Furthermore, the quality and reliability of protein struc- is defined as virus-induced gene silencing (VIGS) [14]. VIGS ture prediction were evaluated by several assessment methods is the manifestation of an RNA-mediated defense mechanism including the C-score, TM-score, and root-mean-square and is believed to be a fast and powerful method to determine deviation (RMSD). A high C-score indicates a high confi- gene function. In 2001, a novel VIGS vector TRV was dence in prediction and vice versa, and a protein structure modified from the RNA virus Tobacco rattle virus.Thisvector prediction is considered as reliable if its C-score is in the range successfully silenced endogenous genes such as phytoene between −5 and 2. Similarly, the TM-score and RMSD are desaturase (PDS)andNicotiana FLO/LEY (NFL)inNicotiana often used to measure the accuracy of a structure modeling benthamiana (N. benthamiana)[14].Sincethen,TRVhas if a reference structure is known. For NbFib2, the C-score become the most commonly used vector in VIGS studies. The of the predicted structure is −2.34, and the estimated TM- vector used in this study is a modified TRV [15], containing score and RMSD are 0.44±0.14 and 11.8±4.5 A,˚ respectively. duplicated CaM V 35S promoters (2 × 35S) and nopaline These results suggest that the predicted structure of NbFib2 is synthase terminator (NOSt) in the C-terminal, which can reliable and closely matched to the topology of the reference ensure the accumulation of viral RNAs to a higher level. The protein (e.g., PDB ID 1g8sA). new TRV has two genomes, designated as pTRV1 and pTRV2 We then used the COFACTOR server [18] to predict the (pYL156). The latter genome acts as a VIGS vector containing functions of NbFib2, including the Enzyme Classification multiple cloning sites. (EC) number, Gene Ontology (GO), and protein-ligand Fibrillarin2 is a protein in Fibrillarin family. In our previ- binding sites, using the protein structure (PDB ID: 2ipxA) ous study, we identified the importance of fibrillarin2 from as template. The analysis found that NbFib2 and human N. benthamiana (NbFib2) during the process of Rice stripe fibrillarin share high similarity in ligand-binding sites with an virus infection [16]. However, the subcellular localization, EC value of 0.437, which is in the range of reliable scores for the 3D structure, and functions of NbFib2 were not fully theECprediction.Thus,theNbFib2proteinmightalsohave resolved. The objectives of this study include (1) determining a similar active site at residue 131 like human fibrillarin. The the subcellular localization of fibrillarin2 in N. benthami- function prediction also revealed that NbFib2 has a NLS and ana (NbFib2); (2) predicting the functional domains and 3 several regions targeting nucleolus and Cajal body, suggesting dimensional(3D)structureofNbFib2;and(3)identifyingthe that the protein may localize in the two organelles. These rolesofNbFib2inplantgrowthanddevelopment. predictions were confirmed by DAPI staining.

2. Results 2.2. Subcellular Localization of NbFib2 and Western Blot 2.1. The Functional Domains of NbFib2. NbFib2 is highly Analysis. NbFib2wasfoundinthenucleolusandCajalbody homologous to AtFib2 (GenBank accession AAG10153) and of N. benthamiana epidermis cells by DAPI fluorescence HsaFib (GenBank accession AAH19260). They share more staining (Figure 2). Western blot analysis was used to confirm than 74% amino acid sequence identity (Figure 1(a)). The the protein expression. In the leaves inoculated with Agrobac- protein has 314 amino acid (aa) residues and consists of three terium carrying 35S-GFP, a protein with a molecular weight functional regions including a GAR region, an RNA binding identical to GFP (30 kD) was detected, while, in the leaves region, and an 𝛼-helical region (Figure 1(b)). The GAR region inoculated with Agrobacterium carrying pEarley101-NbFib2, has 61 aas (aa8–68) and is located in the N-terminal of the a protein with a molecular weight identical to NbFib2-YFP protein. The RNA binding region is seated in the middle of the fusion protein (75 kD) was detected. The two proteins were protein (aa131–221), in which the most possible interaction not detected in the negative control leaves (Figure 3). sites with RNA are from aa176 to aa183. The 𝛼-helical region is located near to the C-terminal (aa231–279), followed by a nuclear localization signal (NLS, aa307–313) motif. It is worth 2.3. Verifying the Function of NbFib2 Using VIGS. Ten mentioning that proline encoded by aa131 is predicted to days after infiltrating, plants with different genes silenced be involved in most protein activities of NbFib2. The RNA started to develop different phenotypes. At about three weeks binding region, 𝛼-helical region, and NLS together form a postinoculation (dpi), the obvious and typical symptoms MTase domain, implying that NbFib2 will localize in nucleus. appeared. The leaves in the NbPDS-silenced plants (positive BioMed Research International 3

𝛼 1 TT TTT TTT TT NbFib2 1 MV AP TRGRG GGGF RGGRG DG G G RGGR GG RGGF GGGRG GGG SA..MK RGGG RG GGGR GG GG ...... RG G GRG GRG ....G G FK G AtFib2 1 .M R P PLTGS GGGF SGGRG RG G Y SGGR .G DGGF SGGRG GGG ...... RGGG RG FSDR GG RG RGRGPPRG G ARG GRG PAGRG G MK G HsaFib 1 .M KP GFSPR GGGF .GGRG GF G D RGGR GG RGGF GGGRG RGG GFRGRG RGGG GG GGGG GG GG RGGGGFHS G GNR GRG RGGKR G NQ S Consensus > 50 ...P.....GGGF.GGRG..G..GGR.G.GGF.GGRG.GG...... RGGG.G....GG.G...... G...GRG.....G...

𝛽1 𝛽2 𝛽3 𝛽4 𝛼2 𝛽5 TT TTTTT TT NbFib2 73 G NK V VVEPHRH GGVFI A KGKEDAL CTKNLVPGE AVY NEKR IS VQN ED GT K VEYR VWNPFRSKLAAA VLGGVD DI WIKPGAKVLY AtFib2 77 G SK V IVEPHRH AGVFI A KGKEDAL VTKNLVPGE AVY NEKR IS VQN ED GT K TEYR VWNPFRSKLAAA ILGGVD NI WIKPGAKVLY HsaFib 83 G KN V MVEPHRH EGVFI C RGKEDAL VTKNLVPGE SVY GEKR VS ISE GD .D K IEYR AWNPFRSKLAAA ILGGVD QI HIKPGAKVLY Consensus > 50 G..V.VEPHRH.GVFI..GKEDAL.TKNLVPGE.VY.EKR!S!.#.D..K.EYR.WNPFRSKLAAA!LGGVD#I.IKPGAKVLY

𝛼3𝛽6𝛼4 𝜂1𝛽7 𝛼5𝛽8 𝛼6 NbFib2 157 LGAASGTTVSHVSD LVGP GG VVYAVEFSHRSGRDL V T MAKKRT D VIP II GDARHP AKY GML V GMVDVIF SDVAQPDQ ARI LALN AtFib2 LGAASGTTVSHVSD LVGP EG CVYAVEFSHRSGRDL V N MAKKRT N VIP II EDARHP AKY RML V GMVDVIF SDVAQPDQ ARI LALN HsaFib 166 LGAASGTTVSHVSD IVGP DG LVYAVEFSHRSGRDL I N LAKKRT N IIP VI EDARHP HKY RML I AMVDVIF ADVAQPDQ TRI VALN Consensus > 50 LGAASGTTVSHVSD.VGP.G.VYAVEFSHRSGRDL!.$AKKRT#!IP!I.DARHP.KY.ML!.MVDVIF.DVAQPDQ.RI.ALN

𝛽9 𝛼7 𝛼8𝛽10 𝛽11 TT TT NbFib2 241 A SY FL KA GGHFV MSIKANCIDST V PAEAVF AQ EVKK LQ AE Q FKP MEQ VTLEP FERDHA CVVG AYR AP KK QK A AA.. AtFib2 A SY FL KS GGHFV ISIKANCIDST V PAEAVF QT EVKK LQ QE Q FKP AEQ VTLEP FERDHA CVVG GYR MP KK PK A ATAA HsaFib 250 A HT FL RN GGHFV ISIKANCIDST A SAEAVF AS EVKK MQ QE N MKP QEQ LTLEP YERDHA VVVG VYR PP PK VK N .... Consensus > 50 A..FL..GGHFV.SIKANCIDST..AEAVF..EVKK$Q.E#.KP.EQ.TLEP%ERDHA.VVG.YR.P.K.K..... (a)

𝛼 GAR-region “R region” “ -rich region”

Phe 131 RBS NLS

GAR 𝛼

[176–183] GVVYAVEF

N-terminal domain MTase domain (b) (c)

Figure 1: The structural and functional characteristics of NbFib2. (a) Amino acid sequence alignment among NbFib2, AtFib2, andHsaFib; (b) a sketch of functional domains in NbFib2; (c) 3D model of NbFib2 in which the 𝛼-helical is in red, the 𝛽-sheet region is in yellow, random coil region is in green, and NLS is in cyan. control, Figure 4(e)) became bleach, whereas the NbFib2- domains: a GAR domain, an RNA binding domain, and silenced plants developed curved interior leaves (Figures 4(a), an 𝛼-helix domain (Figure 1(a)). This result confirmed that 4(b), and 4(d)), chlorotic full leaves (Figures 4(b), 4(c), and NbFib2 belong to fibrillarin protein family, a highly con- 4(d)), and short stems and internodes and became stunted served protein family. Previous literatures suggested that (Figure 4(d)). However, the plants infiltrated by empty TRV the GAR domain always contains a nucleolar localization vector (negative control, Figure 4(e)) did not develop any signalandatargetingsiteoffibrillarinincells[19,20].In symptom. addition, hordeiviral movement protein encoded by the first gene of the triple-gene block (TGBp1)inPoa semilatent virus 2.4. Semiquantitative RT-PCR Analysis of NbFib2 mRNA (PLSV) interacts with AtFib, which occurs between the GAR Accumulation. The accumulation of NbFib2 transcripts in domain of AtFib and the N-terminal of TGBp1 [21]. Thus, the silenced plants is significantly lower than that in the plants GAR region might be a potential domain for NbFib2 to inoculated by empty TRV vector. A negative correlation bind to virus-encoded proteins. The RNA-binding domain was shown between transcript level of target genes and the is essential for the presence of fibrillarin in nucleoli [5]. The severity of disease symptom; in other words, silenced plants C-terminal region of Fib protein family always contains two with lower transcript in target genes usually showed more short sequences: one sequence forms an 𝛼-helix structure, severe symptoms (Figure 5). which targets fibrillarin to CBs [1] and interacts directly with Nop56 [6]; the other is NLS. The feature of those domains 3. Discussion reveals that NbFib2 should localize in cell nucleus, which is consistent with the results of subcellular localization of NbFib2 shares high homology (more than 74% aa sequence) NbFib2. The RNA-binding region and C-terminal region with AtFib2 and HsaFib; they also contain three functional can constitute a conserved methyltransferase- (MTase-) like 4 BioMed Research International

(a) (b)

(c) (d)

(e) (f)

Figure 2: Subcellular localization of pEarley101-NbFib2. (a) pEarley101-NbFib2; (b) bright field; (c) overlay of (a) and (b); (d) DAPI fluorescence staining; (e) pEarley101-NbFib2; (f) overlay of (d) and (e). BioMed Research International 5

M 12 3 small nucleolar ribonucleoproteins (snoRNPs) [25], which is an essential type of proteins for plant growth [26, 27]; (kD) fibrillarin also plays important roles on biogenesis of different 72 RNAs and ribosomal subunits; those show that NbFib2 is 65 involved in plant growth and development. An important feature of plant viral proteins is to interact with fibrillarin, and this feature is not restricted to one or 34 two taxonomic groups [28]. Plant viruses are able to recruit fibrillarin to facilitate their infections in various stages of host 26 development, which suppresses host defense responses. Since N. benthamiana isamodelplantandthehostofmanyplant viruses, the study of the functions, genetics, and 3D structure of NbFib2 in N. benthamiana is important to understand the interaction between hosts and viruses.

4. Materials and Methods Figure 3: Western blot analysis of proteins expression in Nicotiana benthamiana leaves.M:proteinMarker;1:pEarley101-NbFib2;2:CK 4.1. Plant Growth Conditions. The N. benthamiana plants ∘ (healthy plant); 3: 35S-GFP. were grown and maintained in a greenhouse at 25 C.

4.2. Plasmid Construction. Total RNA was extracted from domain, responsible for the methyltransferase activity of N. benthamiana leaves using EasyPure Plant RNA Kit fibrillarin [22]. manufactured by Beijing Transgen Biotech Co. Ltd. (Bei- 3D structure is very important in determining the func- jing, China). Reverse transcription was carried out using tions and localizations of proteins and their interaction with FastQuant RT Kit with gDNase (Tiangen biotech Co., Ltd., other molecules. Homology modeling is one of the most Beijing, China). cDNA encoding NbFib2wereamplifiedby popular approaches to predict the 3D structure of proteins. PCR using primers (Table 1) designed from N. benthamiana Homology modeling requires the identification of a template sequences (GenBank accession: AM269909) downloaded sequence that matches best the query sequence. The template from GenBank. pEarley101-NbFib2 construct was generated could be identified using homology search programs such by first inserting NbFib2 cDNA into entry vector pDONR221 as PSI BLAST against a PDB database using consecutive or and then inserting it into destination vector pEarley101 using spaced seed techniques [23, 24]. Gateway recombination system [29]. The constructs were The COFACTOR server [18] provides a variety of anno- confirmed by capillary sequencing conducted by Takara tations for functional prediction of proteins. One of the most Biotechnology Co., Ltd. (Dalian, China). The PCR product of important advantages of this algorithm is the combination NbFib2 was digested with EcoRI and BamHI and ligated into of the global and local structural comparisons. In addition, TRV vector by pYL156 digested with the same enzymes. sinceCOFACTORtakesintoaccountfortheglobalstructure similarity,itismorerobustthanthosemethodsrelyingon 4.3. Functional Domain and 3D Structure Prediction of only local pocket comparisons. NbFib2. Fibrillarin sequences from three species were VIGS has emerged as a powerful method to study gene aligned using the MAFFT program (http://mafft.cbrc.jp/ functions. In this study, we used VIGS to identify the alignment/software). Subcellular localization of NbFib2 function of NbFib2 in the growth and development of N. was predicted by web-based program WolfPsort (http:// benthamiana.WefoundthatNbFib2-silenced N. benthami- wolfpsort.org/). Homology identification was performed by ana plants developed growth retardation, organ deformation, submitting NbFib2 sequences into I-TASSR server [17] and and necrosis. The silenced plants can be divided into two the 3D structure of NbFib2 was constructed using Pyrococcus groups based on the reduction of NbFib2 expression. The horikoshii fibrillarin as reference on the same server. The plants in the first group developed curly, chlorotic, and NbFib2 function was predicted using the COFACTOER deformed interior and upper leaves, while those in the second server [18]. group developed deformed upper leaves developed shortened stems and internodes and became stunted. No symptoms were developed on the negative control (TRV-empty). We 4.4. Agrobacterium-Mediated Transient Expression. Agrobac- believe that the symptoms of NbFib2 infiltrated plants are terium tumefaciens (A. tumefaciens) strain EHA105 carry- caused by the inserted sequences carried by TRV vector ing pEarley101-NbFib2 and strain GV3101 carrying pTRV1, rather than TRV vector itself. The RT-PCR result confirms pYL156-NbPDS, and pYL156-NbFib2 were grown separately ∘ that the difference in symptom development is consistent to OD600 =0.8at28C on Luria-Bertani liquid medium with the level of NbFib2 silenced, suggesting that NbFib2 supplemented with 50 𝜇g/𝜇L of rifampicin and 50 𝜇g/𝜇L contributes to the growth and development of plants in N. of kanamycin. They were then transferred to induction benthamiana. Fibrillarin is a catalytic component of box C/D media (10 mM MES, pH 5.6, 10 mM MgCl2,and150𝜇M 6 BioMed Research International

(a) (b)

(c) (d)

(e) (f)

Figure 4: Symptoms of NbFib2 silenced Nicotiana benthamiana.(a)–(d)NbFib2 silenced; (e) NbPDS (positive control); (f) empty pYL156 (negative control). BioMed Research International 7

Table 1: The sequences, homologous recombination, and restriction sites of PCR primers.

󸀠 󸀠 a Primer and purpose Sequence (5 → 3 ) Modification Construction for entry vector pDONR221 NbFib2-GF ggggacaagtttgtacaaaaaagcaggcttcATGGTTGCACCAACTAGAGG Homologous recombination NbFib2-GR ggggaccactttgtacaagaaagctgggtcGGCAGCAGCCTTCTGCTTCT Homologous recombination Construction for VIGS vector pYL156 NbFib2-VF CGgaattcCGATGGTTGCACCAACTAGAGGTCGCG EcoRI NbFib2-VR CGggatccCGTTAAATTTTCTAGGCAGCAGCCTTC BamHI Semiquantitative RT-PCR analysis of NbFib2 mRNA accumulation NbFib2-F ATGGTTGCACCAACTAGAGGTCGCG NbFib2-R GGCAGCAGCCTTCTGCTTCTTCCGGC Actin-F ACTGATGAAGATACTCACAGA Actin-R TGGAATTGTATGTGGTTTCAT a The lowercased letters indicate homologous recombination sequence or a restriction enzyme site.

123 Subcellular localization of NbFib2 was determined by infiltrating culture of A. tumefaciens harboring pEarley101- NbFib2 onto the fully grown upper leaves of N. benthamiana. In the VIGS assay, A. tumefaciens harboring pTRV1 was mixedinanequalproportionofA. tumefaciens harboring (a) pYL156-NbFib2,pYL156-NbPDS,andemptypYL156,respec- tively (V/V), and infiltrated onto fully grown upper leaves. Six-week-old N. benthamiana was used for the experiment.

(b) 4.5. Confocal Imaging Analysis. Subcellular locations of pro- teins were monitored at 48 hours after infiltration under a confocal microscope (Microsystems CMS GmbH Leica TCS SP5). The fluorophores in YFP were excited using 514 nm light and images were taken using BA535–565-nm emission filters. To locate the fluorescent proteins in nuclei, the N. (c) benthamiana leaves were infiltrated with PBS containing 󸀠 󸀠 4 ,6 -diamidino-2-phenylindole (DAPI).

4.6. Western Blot Analysis. Leaves infiltrated with pEarley101- (d) NbFib2 (500 mg) were collected at 72 hours after inocula- tions and homogenized in 2 mL extraction buffer containing 50 mM phosphate (pH 8.0), 10 mM Tris (pH 8.0), 500 mM NaCl, 0.1% Tween20, 0.1% NP-40, 0.1% 𝛽-mercaptoethanol, 1 mM PMSF, and 1/4 of Roche Protease inhibitor cocktail MINI tablet. The crude extracts were centrifuged at 12,000 g (e) for 10 minutes. Supernatant was transferred into a new cen- trifuge tube and centrifuged at 12,000 g for another 15 min- utes. 10 𝜇L supernatant was mixed with 2 𝜇L5× SDS-PAGE loading buffer. Proteins in the extracts were separated by electrophoresisin12%SDS-PAGEat80Vfor1hourandthen (f) at 120 V for another 40 minutes. Proteins in gels were trans- ferred onto polyvinylidene difluoride PVDF membranes by Figure 5: RT-PCR analysis for the accumulation of NbFib2 tran- Electrophoresis Cell at 60 V for 1 hour (Beijing WoDeLife script in Nicotiana benthamiana plants. 1–3: PCR amplifications after Sciences Instrument Company, Beijing, China) and probed 30, 35, and 40 cycles; (a)–(d) amplifications of NbFib2 from corre- with a rabbit-anti-GFP polyclonal antibody (GenScript Co., sponding plants shown in Figures 4(a)–4(d); (e)-(f) amplifications Ltd.,Nanjing,China).Thepolyclonalantibodywasagoat- of NbFib2 and Nbactin from plant in Figure 4(f). anti-rabbit IgG conjugated with alkaline phosphatase (Sigma, St. Louis, MO, USA) and used at 1 : 10000 (V/V) dilution. Acetosyringone). The induction media with the bacterial Proteins on the membrane were visualized by NBT-BCIP cultures were incubated at room temperature for 3 hours. solution (Promega). 8 BioMed Research International

4.7. Analysis of Fibrillarin-Silenced Plants. When silenced protein targets into the nucleus and binds primarily via its plants developed symptoms, which usually occur at ten C-terminal NLS2/NoLS to nucleolin and fibrillarin,” Virology days after infiltration, the upper leaves of silenced plants Journal,vol.9,no.1,article167,13pages,2012. were collected and maintained individually. Accumulation [9] S. H. Kim, E. V. Ryabov, N. O. Kalinina et al., “Cajal bodies and of NbFib2 mRNA was analyzed by RT-PCR using specific the nucleolus are required for a plant virus systemic infection,” primers (Table 1) designed from NbFib2 sequence. NbActin The EMBO Journal,vol.26,no.8,pp.2169–2179,2007. mRNA (280 bp) was used as a control for the RT-PCR [10] H. K. Sang, S. MacFarlane, N. O. Kalinina et al., “Interaction of analysis. a plant virus-encoded protein with the major nucleolar protein fibrillarin is required for systemic virus infection,” Proceedings of the National Academy of Sciences of the United States of Conflict of Interests America, vol. 104, no. 26, pp. 11115–11120, 2007. [11] I. Gonzalez,L.Mart´ ´ınez,D.V.Rakitinaetal.,“Cucumber mosaic The authors declare no conflict of interests. virus 2b protein subcellular targets and interactions: their significance to RNA silencing suppressor activity,” Molecular Authors’ Contribution Plant-Microbe Interactions, vol. 23, no. 3, pp. 294–303, 2010. [12] M.-L. Rajamaki¨ and J. P. T. Valkonen, “Control of nuclear and Luping Zheng and Jinai Yao contributed equally to this work. nucleolar localization of nuclear inclusion protein a of picorna- like Potato virus ainNicotiana species,” Plant Cell,vol.21,no.8, Acknowledgments pp. 2485–2502, 2009. [13] D. Robertson, “VIGS vectors for gene silencing: many targets, This work was supported by National Science Foundation many tools,” Annual Review of Plant Biology,vol.55,pp.495– of China (31401715 and 31171821) and Fujian Science and 519, 2004. Technology Agency of China (2014R1024-3). The authors [14] F. Ratcliff, A. M. Martin-Hernandez, and D. C. Baulcombe, “Technical advance: tobacco rattle virus as a vector for analysis would like to thank Dr. Xinzhong Cai at Zhengjiang Univer- of gene function by silencing,” Plant Journal,vol.25,no.2,pp. sity, China, for providing vectors TRV and plasmid pYL156- 237–245, 2001. NbPDS and Dr. Aiming Wang at AAFC-Southern Crop [15] Y. Liu, M. Schiff, and S. P. Dinesh-Kumar, “Virus-induced gene Protection and Food Research Centre, Canada, for providing silencing in tomato,” Plant Journal,vol.31,no.6,pp.777–786, vectors pDNOR221 and pEarleyGate101. The authors also 2002. thank Dr. Jiasui Zhan at Fujian Agriculture and Forestry [16] L. Zheng, Z. Du, C. Lin et al., “Rice stripe tenuivirus p2 may University, China. recruit or manipulate nucleolar functions through an inter- action with fibrillarin to promote virus systemic movement,” References Molecular Plant Pathology,vol.16,no.9,pp.921–930,2015. [17] A. Roy, A. Kucukural, and Y. Zhang, “I-TASSER: a unified plat- [1] S. Snaar, K. Wiesmeijer, A. G. Jochemsen, H. J. Tanke, and R. form for automated protein structure and function prediction,” W. Dirks, “Mutational analysis of fibrillarin and its mobility in Nature Protocols,vol.5,no.4,pp.725–738,2010. living human cells,” The Journal of Cell Biology,vol.151,no.3, [18] A. Roy, J. Yang, and Y. Zhang, “COFACTOR: an accurate pp.653–662,2000. comparative algorithm for structure-based protein function [2] J. P. Aris and G. Blobel, “cDNA cloning and sequencing of annotation,” Nucleic Acids Research,vol.40,no.1,pp.W471– human fibrillarin, a conserved nucleolar protein recognized by W477, 2012. autoimmune antisera,” Proceedings of the National Academy of [19] K. T. Pih, M. J. Yi, Y. S. Liang et al., “Molecular cloning Sciences of the United States of America, vol. 88, no. 3, pp. 931– and targeting of a fibrillarin homolog from Arabidopsis,” Plant 935, 1991. Physiology,vol.123,no.1,pp.51–58,2000. [3] D.L.Pearson,R.D.Reimonenq,andK.M.Pollard,“Expression [20] S. A. Levitskiy, K. S. Mukharyamova, V. P. Veiko, and O. and purification of recombinant mouse fibrillarin,” Protein V. Zatsepina, “Identification of signal sequences determining Expression and Purification,vol.17,no.1,pp.49–56,1999. the specific nucleolar localization of fibrillarin in HeLa cells,” [4] Q. Liu and G. Dreyfuss, “In vivo and in vitro arginine methyla- Molecular Biology,vol.38,no.3,pp.405–413,2004. tion of RNA-binding proteins,” Molecular and Cellular Biology, [21] M. A. Semashko, I. Gonzalez,´ J. Shaw et al., “The extreme vol. 15, no. 5, pp. 2800–2808, 1995. N-terminal domain of a hordeivirus TGB1 movement protein [5]V.V.Barygina,V.P.Veiko,andO.V.Zatsepin,“Analysisof mediates its localization to the nucleolus and interaction with nucleolar protein fibrillarin mobility and functional state in fibrillarin,” Biochimie, vol. 94, no. 5, pp. 1180–1188, 2012. living HeLa cells,” Biochemistry, vol. 75, no. 8, pp. 979–988, 2010. [22] H. Wang, D. Boisvert, K. K. Kim, R. Kim, and S.-H. Kim, “Crys- [6] T. Lechertier, A. Grob, D. Hernandez-Verdun, and P. Roussel, tal structure of a fibrillarin homologue from Methanococcus “Fibrillarin and Nop56 interact before being co-assembled in jannaschii,ahyperthermophile,at1.6A˚ resolution,” The EMBO box C/D snoRNPs,” Experimental Cell Research,vol.315,no.6, Journal,vol.19,no.3,pp.317–323,2000. pp.928–942,2009. [23] S. F. Altschul, T. L. Madden, A. A. Schaffer¨ et al., “Gapped [7]D.Tollervey,H.Lehtonen,M.Carmo-Fonseca,andE.C.Hurt, BLAST & PSI-BLAST: a new generation of protein database “The small nucleolar RNP protein NOP1 (fibrillarin) is required search programs,” Nucleic Acids Research,vol.25,no.17,pp. for pre-rRNA processing in yeast,” The EMBO Journal,vol.10, 3389–3402, 1997. no. 3, pp. 573–583, 1991. [24] J. Yang and L. Zhang, “Run probabilities of seed-like patterns [8] K. Melen,´ J. Tynell, R. Fagerlund, P. Roussel, D. Hernandez- and identifying good transition seeds,” Journal of Computa- Verdun, and I. Julkunen, “Influenza A H3N2 subtype virus NS1 tional Biology, vol. 15, no. 10, pp. 1295–1313, 2008. BioMed Research International 9

[25] J. Venema and D. Tollervey, “Ribosome synthesis in Saccha- romyces cerevisiae,” Annual Review of Genetics,vol.33,no.1,pp. 261–311, 1999. [26] D. L. J. Lafontaine and D. Tollervey, “Nop58p is a common component of the box C+D snoRNPs that is required for snoRNA stability,” RNA,vol.5,no.3,pp.455–467,1999. [27] A. Narayanan, W. Speckmann, R. Terns, and M. P. Terns, “Role of the box C/D motif in localization of small nucleolar RNAs to coiled bodies and nucleoli,” Molecular Biology of the Cell,vol.10, no.7,pp.2131–2147,1999. [28] M. E. Taliansky, J. W.S. Brown, M. L. Rajamaki,¨ J. P.T. Valkonen, and N. O. Kalinina, “Involvement of the plant nucleolus in virus and viroid infections. Parallels with animal pathosystems,” Advances in Virus Research,vol.77,pp.119–158,2010. [29] M. Karimi, D. Inze,´ and A. Depicker, “GATEWAY vectors for Agrobacterium-mediated plant transformation,” Trends in Plant Science,vol.7,no.5,pp.193–195,2002. Hindawi Publishing Corporation BioMed Research International Volume 2015, Article ID 623121, 12 pages http://dx.doi.org/10.1155/2015/623121

Research Article Mining for Candidate Genes Related to Pancreatic Cancer Using Protein-Protein Interactions and a Shortest Path Approach

Fei Yuan,1 Yu-Hang Zhang,1 Sibao Wan,2 ShaoPeng Wang,2 and Xiang-Yin Kong1

1 Institute of Health Sciences, Shanghai Institutes for Biological Sciences (SIBS), Chinese Academy of Sciences (CAS) & Shanghai Jiao Tong University School of Medicine (SJTUSM), Shanghai 200031, China 2CollegeofLifeSciences,ShanghaiUniversity,Shanghai200444,China

Correspondence should be addressed to Xiang-Yin Kong; [email protected]

Received 6 September 2015; Accepted 15 October 2015

Academic Editor: Jialiang Yang

Copyright © 2015 Fei Yuan et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Pancreatic cancer (PC) is a highly malignant tumor derived from pancreas tissue and is one of the leading causes of death from cancer. Its molecular mechanism has been partially revealed by validating its and tumor suppressor genes; however, the available data remain insufficient for medical workers to design effective treatments. Large-scale identification of PC-related genes can promote studies on PC. In this study, we propose a computational method for mining new candidate PC-related genes. A large network was constructed using protein-protein interaction information, and a shortest path approach was applied to mine new candidate genes based on validated PC-related genes. In addition, a permutation test was adopted to further select key candidate genes. Finally, for all discovered candidate genes, the likelihood that the genes are novel PC-related genes is discussed based on their currently known functions.

1. Introduction crucial to study this severe disease. Similar to most types of tumors, PC is induced by both environmental and hereditary The pancreas is a significant versatile organ of the digestive elements. Extrinsic factors such as age, gender, race, cigarette system and endocrine system; it is significant because it assists smoking, and obesity are all factors that may contribute to in digestion and maintains hormonal balance via digestive tumor initiation [3–5]. Further, certain chronic pancreas- enzymes and specific hormones, respectively. Pancreatic associated diseases, such as diabetes mellitus and chronic cancer (PC) originates from the pancreas, has a high rate of pancreatitis, are also related to PC [6, 7]. metastasis, and is a highly malignant tumor derived from Over the last decade, the genetic background for PC, pancreas tissue. Among various types of PC, pancreatic especially pancreatic adenocarcinoma (PAC) that comprises ductal adenocarcinoma (PDAC) accounts for more than 90% most cases, has been revealed through validating a list of of all pancreatic tumors. PDAC is a malignancy with a oncogenes and tumor suppressor genes for PC. Based on poor prognosis, which is demonstrated through its one-year hereditary features, mutations in pancreatic adenocarcinoma survival rate of approximately 18% for all stages of the disease can be divided into two clusters defined as common somatic [1]. In the Western world, pancreatic cancer is one of the mutations and germline mutations. A somatic mutation, top killers for human beings [2]. In 2012 alone, it resulted KRAS, is regarded as the earliest and key mutation in non- in 33000 deaths all over the world. In the Western world, familial PAC initiation, and it aids in maintaining invasion pancreatic cancer is the fourth leading cause of death from status and tumor progression [8]. In addition to tumor cancer with a poor prognosis (5-year survival in less than 5% development, more mutated genes contribute to malignant of cases according to most reports). Such a high fatality rate phenotypes. The tumor suppressor genes p16/INK4A are is attributed to the low rate of diagnosis at early age. Only a significant somatic mutations and are downregulated in minority of patients can receive proper treatment with a 5- pancreatic adenocarcinoma [9, 10]. In many types of tumors, year survival rate up to 22%. Therefore, it is significant and excessive activation of the TGF-𝛽 pathway is a mechanism of 2 BioMed Research International tumor progression and invasion [11]. Another pair of tumor “protein.links.v9.1.txt.gz” and selected protein interactions suppressor genes, SMAD/DPC4, is involved in PAC through beginning with “9606.”, which produced 2,425,314 human regulating the TGF-𝛽 pathway and is critical to advanced PPIs. Each PPI includes two proteins and one score, which tumors [12]. measures the strength of the interaction in the range between Further, heredofamilial pancreatic adenocarcinoma is 150 and 999. For the formulation, let us denote the score of associated with certain other significant genes with more an interaction between proteins 𝑝1 and 𝑝2 as 𝑄𝑖(𝑝1,𝑝2). complex mechanisms. As indicated by the available literature, The aforementioned PPIs were used to construct a large most such genes participate in the DNA repair process, such network by taking proteins as nodes. Two nodes are adjacent as MSH1/2, PMS1/2, and BRCA1/2, which may participate if and only if the corresponding proteins comprise an inter- in nonspecific tumor induction [13–15]. However, in several action that is a member of the 2,425,314 PPIs. Furthermore, inherited familial conditions, specific known mutations have the interaction score should also be added to the constructed not been identified, which may hint at the complexity of network. To generate compatibility between our network and carcinogenic mechanisms and the potential oncogenes as well ashortestpathapproach,eachedgee was assigned a weight as tumor suppressor genes [16]. 𝑤(𝑒) as follows: 𝑤(𝑒) = 1000𝑖 −𝑄 (𝑝1,𝑝2),where𝑝1 and 𝑝2 To predict more potential tumor-related genes, we pro- are corresponding proteins of the endpoints of 𝑒. posed a new method that considers protein interactions Two proteins that can interact with each other always from STRING (Search Tool for the Retrieval of Interacting share similar attributes [19–23]. Further, considering the Genes/Proteins) [17] and mines potential PC-related genes. interaction score, this notion can be generalized as follows: STRING is a database with massive amounts of information twoproteinsinaninteractionwithahighscorearemore on physical and functional associations between different likely to share similar attributes than those with a low score. proteins. With an established score system, STRING enables a Moreover, if we consider a series of proteins 𝑝1,𝑝2,...,𝑝𝑛 user to search and browse the protein interactions data as well such that two consecutive proteins can comprise an inter- as simultaneously quantify the statistic cooccurrence in the action with a high score, then these proteins may all share background [17]. Here, depending on the method and using some common attributes. By mapping these proteins onto a comprehensive analysis of the protein interaction network, the constructed network, they may comprise a shortest path we fully utilized the database containing reported PC-related connecting 𝑝1 and 𝑝𝑛,whichisapathconnecting𝑝1 and 𝑝𝑛 genes and predicted potential genes involved in PC. such that the summation of the weights of edges on the path is minimum. Thus, we searched all shortest paths connecting any two PC-related genes. For two consecutive nodes in each 2. Materials and Methods of these paths, their corresponding proteins can comprise a PPI with a high score because they lie on a shortest path. 2.1. Materials. PC-related genes were retrieved from KEGG As mentioned above, they can share similar functions. For PATHWAY, which is one of the main databases in KEGG a specific shortest path 𝑝1,𝑝2,...,𝑝𝑛,where𝑝1 and 𝑝𝑛 are (Kyoto Encyclopedia of Genes and Genomes) [18]. By exam- encoded by PC-related genes, 𝑝2 shares similar functions with ining the pathway hsa05212, pancreatic cancer (http://www 𝑝1;thatis,𝑝2 may be encoded by an invalidated or known .genome.jp/dbget-bin/www bget?hsa05212, accessed in De- PC-related gene; 𝑝3 shares similar functions with 𝑝2;thus𝑝3 cember 2014), we obtained 65 PC-related genes, which may also share similar functions with 𝑝1 and may be encoded comprisethegenesetS. Detailed information on these genes by an invalidated or known PC-related gene. This can be is listed in Supplementary Material I available online at induced to 𝑝4,𝑝5,...,𝑝𝑛−1. Thus, we extracted genes in these http://dx.doi.org/10.1155/2015/623121. shortest paths and excluded those that were members of 65 PC-related genes. These genes were referred to as shortest 2.2. Method for Mining New Candidate Genes in a Protein- path genes and deemed to have special relationships with PC. Protein Interaction Network. Protein-protein interaction The similar scheme has been applied to extract novel genes (PPI) information is useful for investigating protein-related orchemicalsrelatedtootherdiseasesorsomebiological and gene-related problems [19–25]. Most methods for processes [26–29]. In addition, to distinguish those genes, predicting protein attributes are based on the notion that each shortest path gene was assigned a value, which is referred two proteins that interact always share similar attributes to as betweenness and defined as the number of shortest paths [19–23]. Because PC-related genes must share common that contain the gene. The betweenness indicates the direct features related to PC, it is reasonable to use PPI information and indirect associations between shortest path genes and to identify whether a gene is related to PC. Here, we adopted PC-related genes [30]. the PPI information reported in STRING (version 9.1, http://www.string-db.org/) [17], which is a large online database that reports known and predicted protein interac- 2.3. Further Selection. By executing the method mentioned in tions. The PPIs reported in STRING are derived from the Section 2.2, certain shortest path genes can be extracted from following four sources: (1) genomic context, (2) high- the constructed network. However, certain such genes may throughput experiments, (3) (conserved) coexpression, and be false positives. To exclude this type of gene, the following (4) previous knowledge, which imply that the physical and method was adopted. functional associations of the proteins were measured. To Certain nodes in the constructed network are general extract PPI information for humans, we downloaded the file hubs; their corresponding genes may always receive a high BioMed Research International 3 betweenness value even if we randomly selected certain genes 3.3. Analysis of Significant Candidate Genes. Using our as PC-related genes, but the genes exhibit few relationships method, we predicted thirteen genes that may participate with PC. To exclude this type of gene among shortest path in PC. Based on the principle underlying our method, the genes, we randomly produced 1,000 gene sets with the same candidate genes are specifically connected with PC-related size as S andcomparedthebetweennessoftheshortestpath genes. Moreover, such genes have all been reported as genes genes in these sets to S. The detailed procedures are described that are relevant to PC and may exhibit diverse functions in as follows. tumor initiation and invasion, especially in PC. Among the candidate genes, three are inhibitory genes (I) Randomly produce 1,000 gene sets, such as 𝑆1, that may participate in PC through their respective mech- 𝑆2,...,𝑆1000, each of which with the same size as S, anisms. The first candidate gene is NFKBIA (see row 1 the set consisting of PC-related genes. of Table 1, with betweenness 73 and permutation FDR (II) The method described in Section 2.2 was executed for 0.002), which is expressed in pancreatic tissues [31]. NFKBIA 1,000 rounds. For the 𝑖th round, the PC-related genes encodes a member of the NF-kappa-B inhibitor family, which has been confirmed to further participate in interactions with in S were replaced with the genes in 𝑆𝑖,andthe REL dimers to inhibit the NF-kappa-B pathway in processes shortest paths connecting any pair of genes in 𝑆𝑖 were searched, thereby counting the betweenness of each of inflammation immune response and tumorigenesis [32, shortest path gene based on these shortest paths. 33]. Targeted by a specific microRNA, miR196a, NFKBIA hasbeenprovedtobeassociatedwithPC,especiallyinthe (III) For each shortest path gene, there were 1,000 metastasis process [31]. As we have mentioned above, in most betweenness on randomly produced gene sets and tumors including pancreatic cancer, NF-kappa-B pathway one betweenness on S.Aftercomparingthem,we hasbeenwidelyreportedtobeoveractivatedandhasa generated a measurement referred to as permutation close connection with the patients’ prognosis, indicating the FDR, which defines the proportion of randomly underlying relationship between NFKBIA and the pancreatic produced gene sets on which the betweenness was cancer [34–37]. Mutations have been widely reported to larger than that on S. contribute to specific functional alteration of crucial proteins (IV) To exclude shortest path genes with general hubs including NFKBIA in diseases especially in various cancer inthenetworkandfewrelationshipswithPC,we subtypes like pancreatic cancer [38–41]. What is more, as excluded shortest path genes with permutation FDR one of the crucial inhibitory components of NF-kappa-B values equal to or larger than 0.05. The remaining pathway, NFKBIA has been reported to be downregulated shortest path genes are referred to as candidate genes. in cancer [42, 43]. The overexpression of NFKBIA has also been reported to be associated with a better prognosis of various treatment methods in different human tumor sub- 3. Results and Discussion types, especially for the prognosis of patients that have taken alpha 1-adrenoceptor antagonists and radiotherapy [44, 45]. 3.1. Shortest Path Genes. Based on the method described in Further, considering the inflammation associated function Section2.2,wesearchedallshortestpathsconnectingany of such gene, such mutations and expression alteration may pair of PC-related genes and produced 2,080 shortest paths contribute to the process of tumorigenesis through two (each pair of PC-related genes can be connected by a shortest individual regulation mechanisms: proliferation associated path), which are provided in Supplementary Material II. A pathways that involve NF-kappa-B and specific immune graphwiththese2,080shortestpathsisshowninFigure1. response associated pathways in tumor microenvironment The detailed information of edges in this graph is provided [46]. In addition to the NF-kappa-B pathway, the JNK in Supplementary Material III. These 2,080 paths involved pathway is also a specific pathway in tumor initiation and 134 Ensembl gene IDs. We excluded 65 Ensembl IDs for PC- progression, including PC [47, 48]. A specific regulator of related genes, resulting in 69 shortest path genes, which are Rho protein exchange reactions which is crucial for JNK listed in Supplementary Material IV. In addition, we counted pathway, ARHGDIA (see row 2 of Table 1, with between- the betweenness of each shortest path gene, which is also ness 94 and permutation FDR 0.012), was also identified provided in Supplementary Material IV. using our method and functions in several types of tumors [49, 50]. In addition, this apoptosis inhibitory protein has 3.2. Candidate Genes. As mentioned in Section 3.1, several been confirmed to control Rho protein and shortest path genes were retrieved. According to the principle participate in the initiation and progression of PC through underlying the method in Section 2.2, these genes may have the apoptosis associated GDP/GTP exchange reaction via special relationships with PC. However, certain such genes RhoA-Rho pathway [51]. Apoptosis inhibition widely exists may be false positives and have few relationships with PC. in tumor tissues and is an effective way to induce malignant Thus, we performed a permutation test to control for this cells, implying the potential relationship between ARHGDIA type of genes. After calculating the permutation FDR for and pancreatic cancer [52, 53]. As we all know, pathways each shortest path gene, which are listed in Supplementary that contribute to the proliferation and invasion of cancer Material IV, we discarded the genes with a permutation FDR cells (JNK-STAT, RhoA-Rho signaling pathway, etc.) have greater than or equal to 0.05, thereby generating thirteen all been confirmed to involve phosphorylation and dephos- candidate genes, which are listed in Table 1. phorylation process which may be regulated by the survival 4 BioMed Research International

ENSP00000266970ENSP00000362649ENSP00000417281ENSP00000268182 ENSP000002671011 ENSP00000278616 ENSP00000361423 ENSP00000329623 ENSP00000304895 ENSP00000297494 ENSP00000348986 ENSP00000338934 1 1 1 2 1 ENSP00000303830 ENSP000003694971 ENSP00000338018 ENSP00000355249ENSP00000267868ENSP00000269300ENSP0000035212111 1 1 1 ENSP00000345571 ENSP000003025641 1 ENSP00000340944 ENSP00000329357 ENSP000002695712 ENSP00000321410 ENSP00000265171 ENSP00000304283ENSP00000384675 ENSP00000262435 ENSP00000348461 1 1 6 1 ENSP00000289153 ENSP00000343204 1 ENSP00000278568 ENSP00000288986 ENSP000003091031 ENSP00000263025 ENSP00000287647 1 ENSP00000366563 ENSP00000320940ENSP000002629041 1 1 1 1 1 ENSP000002954006 2 5 6 ENSP00000282561 ENSP000002843841ENSP00000262741 11 6 1 ENSP00000366244 1 ENSP00000293288 ENSP00000332973 5 ENSP000002680588 1 1 ENSP000003611201 1 ENSP000003543941 2 ENSP00000358022 1 4 2 ENSP00000254066 1 1 1 ENSP00000330237 1 1 ENSP00000341551 1 1 1 ENSP00000298316 1 1 ENSP00000361125 ENSP00000309845ENSP00000262160 1 1 1 2 1 1 1 3 1 1 3 2 ENSP00000375892ENSP00000371067 1 1 1 1 ENSP00000250894ENSP00000274335 1 2 1 2 1 1 1 1 2 3 6 ENSP00000355896ENSP000003478581 1 ENSP00000355153 1 1 4 5 ENSP00000384515 1 1 4 1 2 1 1 ENSP00000005257 1 1 11 1 8 11 ENSP00000003084 3 ENSP00000288602 2 2 1 1 3 1 ENSP00000269321 1 1 2 11 3 1 1 1 2 ENSP000000193172 ENSP00000251849 1 1 1 3 ENSP000002626131 2 1 3 ENSP0000026403332 1 ENSP000003427931 10 2 1 1 ENSP00000353483 1 1 ENSP00000344818 ENSP00000272519 ENSP000002288722 1 1 ENSP000003521571 13 1 1 1 ENSP00000349467 1 1 1 2 1 1 11 ENSP000002275071 ENSP00000341189ENSP000002506171 1 1 2 1 1 3 1 1 1 ENSP00000215832ENSP00000046794 ENSP00000249071 3 1 1 ENSP000003005741 1 1 ENSP000002693051 1 1 1 1 1 ENSP000002440071 ENSP00000275493 1 1 1 ENSP00000299421 2 1 1 ENSP00000302486 ENSP000002386821 1 1 1 1 1 1 ENSP00000302269 1 1 1 ENSP00000226574 ENSP00000344456ENSP00000267163 1 1 1 1 1 ENSP000003391511 1 1 1 ENSP00000265734 1 ENSP00000162330 ENSP000002447411 1 2 1 11 1 1 1 1 1 ENSP0000035862211 1 ENSP000002646571 1 3 1 ENSP000002283071 1 ENSP00000314458 1 ENSP000003594241 1 1 ENSP00000359206 ENSP000003842733 ENSP000002639671 2 ENSP00000263826 2 ENSP00000350941 1 ENSP00000222254 ENSP00000270202 1 ENSP00000339007 ENSP00000257904ENSP00000364133ENSP000003519051 1ENSP000002219301 1 1 2 1 1 1 ENSP00000360683 ENSP000002230232 1 1 1 1 1 1 ENSP00000206249 ENSP00000261799 1 35 1 1 1 ENSP000002623671 ENSP00000222005 1 1 1 ENSP00000263253 ENSP00000354558 1 ENSP00000335153 ENSP000003095032 1 ENSP00000350283 ENSP00000300161ENSP00000216797ENSP000003602661 ENSP00000219476

Figure 1: A graph consisting of 2,080 shortest paths. The nodes on the inner circle (red nodes) represent 65 PC-related genes, while the nodes on the outer circle (blue nodes) represent 69 shortest path genes. The numbers on the edges represent the weights of the edges.

associated GDP/GTP exchange reaction, further validating to be related to the progression of pancreatic cancer [58]. the potential relationship between ARHGDIA and pancreatic Furthermore, we also obtained the gene XIAP as one of cancer [54–58]. What is more, regarded as a therapeutic target the candidate genes in our list (see row 3 of Table 1, with of pancreatic cancer, ARHGDIA (also known as RhoGDI) has betweenness 64 and permutation FDR 0.006) that encodes a been actually proved to be associated with the proliferation functional apoptosis suppressor protein. As the most potent and invasion process of the pancreatic tumorigenesis which apoptosis suppressor, XIAP interacts with several caspases may further affect the prognosis of various patients [51]. directly and prevents the apoptosis process [59–61]. In PC, The expression quantity of such gene, ARHGDIA, as the XIAP has been proved to be critical for the progression specific negative regulator of Rho protein exchange reactions, and prognosis of this disease via several classical pathways hasbeenwidelyconfirmedtobedownregulatedintumors, (Erk, PTEN/PI3K/AKT, etc.) [62–64]. What is more, the which further induce the activation of survival associated overexpression of XIAP has been confirmed to be associated Rho GTPases and promote the progression of various tumors with poor prognosis of pancreatic cancer. Therefore, such [55–57]. As for specific mutations that may affect the bio- protein has already been applied as a useful laboratory test logical function of such gene, coincidentally, specific triple parameter and a functional therapeutic target of pancreatic Y156F/S101A/S174A-RhoGDI mutation has been confirmed cancer [64, 65]. As an immune associated functional gene, BioMed Research International 5

Table 1: Detailed information on the thirteen candidate genes.

Row number Ensembl ID Gene symbol Full name Betweenness Permutation FDR 1 ENSP00000216797 NFKBIA NuclearFactorofKappaLightPolypeptide 73 0.002 Gene Enhancer in B-Cells Inhibitor, Alpha 2 ENSP00000269321 ARHGDIA Rho GDP Dissociation Inhibitor (GDI) 94 0.012 Alpha 3 ENSP00000347858 XIAP X-Linked Inhibitor of Apoptosis, E3 64 0.006 Ubiquitin Protein Ligase 4 ENSP00000250894 MAPK8IP3 Mitogen-Activated Protein Kinase 8 64 <0.001 Interacting Protein 3 Solute Carrier Family 9, Subfamily A 5 ENSP00000262613 SLC9A3R1 (NHE3, Cation Proton Antiporter 3), 128 0.031 Member 3 Regulator 1 6 ENSP00000384515 PARVB Parvin, Beta 64 <0.001 7 ENSP00000254066 RARA Retinoic Acid Receptor, Alpha 24 0.005 8 ENSP00000299421 ILK1 Integrin-linked kinase 1 64 0.03 9 ENSP00000338934 EZR Ezrin 128 0.026 10 ENSP00000309845 HRAS Harvey Rat Sarcoma Viral Oncogene 113 0.033 Homolog 11 ENSP00000335153 HSP90AA1 Heat Shock Protein 90kDa Alpha 230 0.007 (Cytosolic), Class A Member 1 12 ENSP00000348986 INS-IGF2 INS-IGF2 Readthrough 64 0.031 13 ENSP00000284384 PRKCA Protein Kinase C, Alpha 175 <0.001

the polymorphism of XIAP may contribute to the complex our candidate gene SLC9A3R1 (see row 5 of Table 1, with and personalized immune response to pancreatic cancers. betweenness 128 and permutation FDR 0.031) regulates the Someofthefunctionalvariantshavealreadybeenprovedto subcellular location and function of SLC9A3 and plays a be directly associated with the prognosis of various subtypes specific role in the Wnt pathway [77–79]. Known as NHERF1, of pancreatic cancer, indicating the potential significant role SLC9A3R1 acts as a scaffold protein in many types of tumors of XIAP in pancreatic cancer [62, 66]. depending on its specific role in Wnt pathway, especially In addition to inhibitory genes, a group of scaffold genes in PC [78, 80, 81]. During the course of pancreatic cancer was also among the thirteen candidate genes. MAPK8IP3 raising, SLC9A3R1 has been confirmed to be overexpressed (see row 4 of Table 1, with betweenness 64 and permutation andinvolvethepoorprognosisofsuchdisease[81,82].Just FDR <0.001), which is also known as JIP3, is a scaffold like other candidates encoding scaffold proteins, SLC9A3R1 proteininvolvedintheJNKpathway[67].Associatedwith has also been proved to contain specific nonsynonymous the MAP kinase cascade, it may be related to PC via mutations in various cancer subtypes, especially in pancreatic potential downstream mechanisms [67–69]. With high tissue cancer [83, 84]. Coding an adaptor protein, the candidate specificity, such gene has been reported to actually induce genePARVB(seerow6ofTable1,withbetweenness64 the excessive proliferation of pancreatic cells, which may and permutation FDR <0.001) is critical to certain specific further initiate the biological process of pancreatic cancer biological processes in cancer, such as ERK signaling pathway [70].MAPK8IP3isalsooneofthecrucialcomponentsofJNK and focal adhesion which both have been identified in pathway which has been overactivated in tumors including pancreatic cancer [85, 86]. Therefore, this gene is critical pancreatic tumor [71, 72]. During the initiation and invasion for tumorigenesis and may have a specific role in PC [87]. stages of pancreatic cancer, MAPK8IP3 has been reported As a crucial adaptor protein, the overexpression of PARVB to be overexpressed which may further contribute to the hasbeenprovedtobeassociatedwithhighcellproliferation excessive proliferation of pancreatic cells in clinical cases rate which indicates worsening clinical prognosis of tumors, [70]. Mutations have also been identified in MAPK8IP3 including pancreatic cancer [86, 88]. The polymerase of such which has been reported to partially interfere with its own gene, PARVB, has also been identified and confirmed to be functions under pathological conditions [73, 74]. Some of associated with a group of severe diseases including cancers the specific mutations of MAPK8IP3 have been reported [87]. Such abnormal expression level and high polymerase in to influence the adhesion and invasion process of various tumor cells suggest that PARVB may play a crucial role in the cancer cells including the pancreatic cancer cells, validating specific pathogenic process of pancreatic cancer. the crucial role of MAPK8IP3 in pancreatic cancer [75, 76]. Six remaining genes may also be associated with PC As another functional scaffold protein that participates in in their respective way. RARA (see row 7 of Table 1, the interaction between the cell membrane and cytoskeleton, with betweenness 24 and permutation FDR 0.005) encodes 6 BioMed Research International a specific receptor for retinoic acid, which may further reg- contribute to the tumorigenesis process of various cancer ulate germ cell development during spermatogenesis and the subtypes including pancreatic cancer [118]. All in all, EZR expression of specific genes through recruiting the chromatin may be a potential pancreatic-cancer-associated gene and complex (KMT2E, MLL5, etc.) [89–91]. As we all know, the may contribute to similar pathways with SL9A3R1 as we have stem-cell-like features of tumor cells may be associated with mentioned above [110]. the potential invasion and proliferation ability of such tumor Further,wealsopredictedoneofthemostfamous cells [92–94]. In pancreatic cancer, it has been confirmed that oncogenes HRAS (see row 10 of Table 1, with betweenness the more “stem-like” the cancer cell is, the more malignant 113 and permutation FDR 0.033), which binds GDP/GTP the cancer is [95]. Similar to its specific function in germ cells, and exhibits intrinsic GTPase activity [119]. Associated with RARA is related to the regulation of stem- cell-like features in several tumors and similar to its homologues (r-RAS, k- PC, which implies that it may further affect the prognosis of Ras, etc.), HRAS is a powerful oncogene that can initiate such a severe malignancy [96]. The overexpression of RARA tumors through inducing excessive growth factor activation has been reported to be related to specific biological behav- in cells and promoting malignant proliferation of tumor iors of tumors such as EMT (epithelial-to-mesenchymal) cells, especially in tumors in situ [120–122]. In PC, such and invasion, which indicates poor prognosis of pancreatic gene has also been regarded as a main oncogene and is cancers in clinical cases [97]. Although few mutations have widely reported to be overexpressed during the entire clinical been identified in RARA, crucial fusion gene variants have course of pancreatic cancer, validating our prediction of the been widely identified in cancers which may be analyzed later. potential pancreatic cancer-associated genes [123]. Several As we have mentioned above, SLC9A3R1 has been reported genetic alterations in HRAS (mutations, CNVs, etc.) have to regulate cellular morphology as a scaffold protein [78]. been identified and reported to be associated with such a Similarly, the candidate gene ILK1 (see row 8 of Table 1, with severe type of cancer [124–126]. Recently, some of the crucial betweenness 64 and permutation FDR 0.005) also mediates mutations have already been confirmed to directly con- cell architecture, signaling transduction and cell adhesion tribute to the tumorigenesis of pancreatic cancer, indicating via integrin-mediated signaling transduction, especially in the underlying relationship between mutations of candidate tumor tissues [98, 99]. Although studies have not explained genes and the process of tumorigenesis [127]. Heat shock clearly how ILK1 actually contributes to the process of pancre- proteins compose a unique group of proteins produced by atic cancer, it has been shown to participate in several types of cells to resist harmful and stressful conditions, including tumors and may be associated with several unique molecules the tumor microenvironment, which contains less oxygen (GSK, AKT, PTEN, etc.) that all have been confirmed to be and a lower pH [128–130]. This cluster of proteins has been critical to PC [98–103]. Recently, such gene has been reported confirmed to be functional in tumor microenvironment and to be associated with the expression of TGF-𝛼,whichmay promote the progression of various tumor subtypes [131, 132]. partially explain the underlying carcinogenic mechanism of One of the candidate genes, HSP90AA1 (see row 11 of Table 1, such gene [104]. What is more, such gene has been confirmed with betweenness 230 and permutation FDR 0.007), also to be overexpressed during the whole biological process of encodes a specific heat shock protein, heat shock protein tumorigenesis including the proliferation, migration, and 90 kDa alpha, which has been reported to be expressed in invasion of pancreatic cancers [104–106]. Further, specific the cytoplasm. Several types of heat shock proteins (such mutationshavealsobeenidentifiedinsuchgene.Integrin- as HSP20, HSP70, and HSP90), including our predicted linked kinase 1 (ILK1) and HSP90 (HSP90AA1, which is protein, are not only related to tumor genesis but specifically also in our list and will be analyzed later) have both been contribute to PC [133–135]. Heat shock proteins have also confirmed to contain quite a lot of tumor-associated variants. been confirmed to be overexpressed in pancreatic cancer, Such mutations of ILK1 have been reported to be functional including HSP90 which is encoded by our predicted gene pancreatic cancer biomarkers [107–109]. All in all, ILK1 [136–140]. The overexpression of heat shock proteins has hasbeenprovedtoregulatethetumorigenesisprocessof been confirmed to actually contribute to the tumorigenesis PC through a specific TGF-𝛼-associated mechanism. As a process of pancreatic cancer which validates our predication functional SLC9A3R1-associated protein, a specific protein of HSP90AA1 as a functional candidate gene [133]. As a ezrin is encoded by one of our candidate genes EZR (see hormone-associated gene, our candidate gene INS-IGF2 (see row 9 of Table 1, with betweenness 128 and permutation FDR row 12 of Table 1, with betweenness 64 and permutation 0.026) [110]. Crucial for cytoplasmic peripheral membrane, FDR 0.031) is a readthrough gene of INS and IGF2. This EZRisassociatedwiththeactincytoskeletonandregulates gene may act as a posttranscriptional regulatory factor for the surface expression of actin [111, 112]. It has been widely the two genes INS and IGF2, which has been shown to be reported that ezrin is critical for the proliferation, metastasis, associated with several pancreatic-associated diseases, such and invasion process of several tumor subtypes, especially as diabetes [141–143]. As a specific fusion gene, INS-IGF2 of pancreatic cancer [113, 114]. The overexpression of EZR participates in several cancer types, especially in prostate hasbeenwidelyreportedtobeassociatedwiththeinitia- cancer [144–146]. The fusion of a tumor-associated gene tion, proliferation, and metastasis processes of tumor, which and a pancreas-related gene, INS-IGF2, probably participates strongly affect the patients’ prognosis especially in pancreatic in PC progression and invasion specifically. Recently, such cancer [115–117]. EZR, as the crucial scaffold protein which fusion gene INS-IGF2 is reported to be specifically expressed mayinteractwithSLC9A3R1thatwehavementionedabove, in pancreatic islets especially in patients with autoimmune has also been reported to contain specific mutations that may diseases under pathological conditions [147, 148]. Since BioMed Research International 7 immune response is quite significant for tumor surveillance, Conflict of Interests INS-IGF2 is definitely associated with the unique tumor- associated immune response and may further contribute to The authors declare that there is no conflict of interests the prognosis of pancreatic cancer [148]. Although genes like regarding the publication of this paper. RARA (mostly identified as FIP1L1/PML-RARA fusion gene) (we have mentioned above) and INS-IGF2 have not been Acknowledgments reported to contain specific mutations that may be associated with pancreatic cancer, such fusion genes have both been con- This study was supported by the National Basic Research firmed to have abnormal gene functions which may directly Program of China (2011CB510101 and 2011CB510102), the contribute to the tumorigenesis of various cancer subtypes, National Natural Science Foundation of China (31371335), including pancreatic cancer [149, 150]. The last candidate and the Innovation Program of Shanghai Municipal Educa- gene PRKCA (see row 13 of Table 1, with betweenness 175 tion Commission (12ZZ087). and permutation FDR <0.001) encodes a specific kinase that has been reported in several types of tumors [151, 152]. References As a widely reported oncogene, PRKCA contributes to the phosphorylation process of crucial tumor-associated proteins [1] M. Hidalgo, S. Cascinu, J. Kleeff et al., “Addressing the chal- [153]. The overexpression of PRKCA has been proved to lenges of pancreatic cancer: future directions for improving enhance the transformed proliferation and invasion process outcomes,” Pancreatology,vol.15,no.1,pp.8–18,2015. of pancreatic cancer which further definitely influences the [2]C.Verbeke,M.Lohr,¨ J. Severin Karlsson, and M. Del Chiaro, prognosis of such disease [154, 155]. While at the same “Pathology reporting of pancreatic cancer following neoadju- time, genes like PRKCA which encodes functional kinase vant therapy: challenges and uncertainties,” Cancer Treatment Reviews,vol.41,no.1,pp.17–26,2015. may be strongly affected by genetic alterations especially by some specific nonsynonymous mutations [75, 156]. Therefore, [3] J. Freilich, T. Strom, G. Springett et al., “Age and resected pancreatic cancer outcomes,” International Journal of Radiation the polymerase of gene PRKCA may definitely affect the Oncology Biology Physics,vol.90,no.1,supplement,p.S356, function of protein kinase C which has also been identified 2014. in pancreatic cancer [156, 157]. However, recent studies show that, at least in vitro, PRKCA has a specific function [4] S. Mizuno, Y. Nakai, H. Isayama et al., “Smoking, family history of cancer, and diabetes mellitus are associated with the age of in tumor suppression [156]. The complex role of PKC in onset of pancreatic cancer in japanese patients,” Pancreas,vol. cancer may reflect the complicated interactive relationship 43,no.7,pp.1014–1017,2014. between PRKCA and PC [156, 158]. All in all, our predicted [5] G. Preziosi, J. A. Oben, and G. Fusai, “Obesity and pancreatic gene PRKCA has been confirmed to be definitely associated cancer,” Surgical Oncology,vol.23,no.2,pp.61–71,2014. with pancreatic cancer, while, at the same time, the various functions of PRKCA imply the complex role of such gene in [6] S. Deng, S. Zhu, B. Wang et al., “Chronic pancreatitis and pancreatic cancer demonstrate active epithelial-mesenchymal pancreatic cancer. transition profile, regulated by miR-217-SIRT1 pathway,” Cancer Consequently, all candidate genes obtained using our Letters,vol.355,no.2,pp.184–191,2014. method play a critical role in tumor initiation and progres- [7]K.S.Snima,R.S.Nair,S.V.Nair,C.R.Kamath,andV.-K. sion. Because the detailed regulation mechanisms of only few Lakshmanan, “Combination of anti-diabetic drug metformin genes remain unclear, those candidate genes may be genes and boswellic acid nanoparticles: a novel strategy for pancreatic that are critical to PC. PC features genetic heterogeneity, cancer therapy,” JournalofBiomedicalNanotechnology, vol. 11, and many familial hereditary related genes remain ambigu- no.1,pp.93–104,2015. ous [16]. Therefore, in addition to the confirmed genes, [8] J. Luttges,¨ A. Reinecke-Luthge,¨ B. Mollmann¨ et al., “Duct such as KRAS and p16/INK41 [8, 9], more genes must be changes and K-ras mutations in the disease-free pancreas: screened and validated to clearly demonstrate the mecha- analysis of type, age relation and spatial distribution,” Virchows nisms of pancreatic adenocarcinoma. In conclusion, based Archiv,vol.435,no.5,pp.461–468,1999. onreportedPC-relatedgenes,theproposedmethodwas [9] D. Georgiadou, T. N. Sergentanis, S. Sakellariou et al., “Cyclin effective at predicting candidate tumor-associated genes in D1, p16INK4A and p27Kip1 in pancreatic adenocarcinoma: assess- PC. ing prognostic implications through quantitative image analy- sis,” Acta Pathologica, Microbiologica, Et Immunologica Scandi- navica,vol.122,no.12,pp.1230–1239,2014. 4. Conclusions [10]Z.Wang,S.Ali,S.Banerjeeetal.,“ActivatedK-Rasand INK4a/Arf deficiency promote aggressiveness of pancreatic In this study, we proposed a computational method to mine cancer by induction of EMT consistent with cancer stem cell new candidate genes related to pancreatic cancer, which phenotype,” Journal of Cellular Physiology,vol.228,no.3,pp. utilized protein-protein interaction information. The analy- 556–562, 2013. ses of the obtained candidate genes indicate that they may [11] L. Zhang, W. Liu, Y. Qin, and R. Wu, “Expression of TGF-𝛽1 be novel PC-related genes. Hopefully, this contribution will in Wilms’ tumor was associated with invasiveness and disease promote studies on pancreatic cancer and provide new hope progression,” Journal of Pediatric Urology,vol.10,no.5,pp.962– for designing effective treatments. 968, 2014. 8 BioMed Research International

[12] Y. El-Gohary, S. Tulachan, P. Guo et al., “Smad signaling [28]L.Chen,C.Chu,X.Kong,G.Huang,T.Huang,andY.Cai, pathways regulate pancreatic endocrine development,” Develop- “A hybrid computational method for the discovery of novel mental Biology,vol.378,no.2,pp.83–93,2013. reproduction-related genes,” PLoS ONE,vol.10,no.3,Article [13] J. A. Veenstra, “Immunocytochemical demonstration of a ID e0117090, 2015. homology in peptidergic neurosecretory cells in the suboe- [29] B.-Q. Li, J. Zhang, T. Huang, L. Zhang, and Y.-D. Cai, “Identi- sophageal ganglion of a beetle and a locust with antisera to fication of retinoblastoma related genes with shortest path in a bovine pancreatic polypeptide, FMRFamide, vasopressin and 𝛼- protein-protein interaction network,” Biochimie,vol.94,no.9, MSH,” Neuroscience Letters,vol.48,no.2,pp.185–190,1984. pp.1910–1917,2012. [14] R. Das Gupta and R. Kolodner, “Identification of postmeiotic [30] J. B. M. Craven, Markov Networks for Detecting Overlapping segregation (PMS) mutants as novel alleles of mismatch repair Elements in Sequence Data, MIT Press, 2005. (MMR) genes,” The FASEB Journal,vol.11,no.9,p.1971,1997. [31] F. Huang, J. Tang, X. Zhuang et al., “MiR-196a promotes pan- creatic cancer progression by targeting nuclear factor kappa-B- [15]T.Golan,Z.S.Kanji,R.Epelbaumetal.,“Overallsurvivaland inhibitor alpha,” PLoS ONE,vol.9,no.2,ArticleIDe87897,2014. clinical characteristics of pancreatic cancer in BRCA mutation carriers,” British Journal of Cancer, vol. 111, no. 6, pp. 1132–1138, [32]P.-Y.Chang,K.Draheim,M.A.Kelliher,andS.Miyamoto, 2014. “NFKB1 is a direct target of the TAL1 oncoprotein in human T leukemia cells,” Cancer Research, vol. 66, no. 12, pp. 6008–6013, [16] A. P. Klein, K. A. Brune, G. M. Petersen et al., “Prospective 2006. risk of pancreatic cancer in familial pancreatic cancer kindreds,” [33] N. Fullard, C. L. Wilson, and F. Oakley, “Roles of c-Rel Cancer Research,vol.64,no.7,pp.2634–2638,2004. signalling in inflammation and disease,” International Journal of [17] D. Szklarczyk, A. Franceschini, S. Wyder et al., “STRING v10: Biochemistry and Cell Biology,vol.44,no.6,pp.851–860,2012. protein-protein interaction networks, integrated over the tree [34] Y. Ito, E. Kikuchi, N. Tanaka et al., “Down-regulation of NF of life,” Nucleic Acids Research, vol. 43, no. 1, pp. D447–D452, kappa B activation is an effective therapeutic modality in 2015. acquired platinum-resistant bladder cancer,” BMC Cancer,vol. [18] M. Kanehisa and S. Goto, “KEGG: kyoto encyclopedia of genes 15, article 324, 2015. and genomes,” Nucleic Acids Research,vol.28,no.1,pp.27–30, [35] L. Lessard, A.-M. Mes-Masson, L. Lamarre, L. Wall, J.-B. Lat- 2000. touf, and F.Saad, “NF-𝜅B nuclear localization and its prognostic [19]M.Deng,K.Zhang,S.Mehta,T.Chen,andF.Sun,“Prediction significance in prostate cancer,” BJU International,vol.91,no.4, of protein function using protein-protein interaction data,” pp.417–420,2003. Journal of Computational Biology,vol.10,no.6,pp.947–960, [36] T. Gilmore, M.-E. Gapuzan, D. Kalaitzidis, and D. Star- 2003. czynowski, “Rel/NF-𝜅B/I𝜅B signal transduction in the gener- [20]L.Hu,T.Huang,X.Shi,W.-C.Lu,Y.-D.Cai,andK.-C.Chou, ation and treatment of human cancer,” Cancer Letters,vol.181, “Predicting functions of proteins in mouse based on weighted no. 1, pp. 1–9, 2002. protein-protein interaction network and protein hybrid prop- [37] B. Haefner, “NF-kappa B: arresting a major culprit in cancer,” erties,” PLoS ONE, vol. 6, no. 1, Article ID e14556, 2011. Drug Discovery Today,vol.7,no.12,pp.653–663,2002. [21] P. Gao, Q.-P. Wang, L. Chen, and T. Huang, “Prediction of [38]H.Kinugasa,K.Nouso,K.Miyaharaetal.,“DetectionofK- human genes regulatory functions based on proteinprotein ras gene mutation by liquid biopsy in patients with pancreatic interaction network,” Protein and Peptide Letters,vol.19,no.9, cancer,” Cancer,vol.121,no.13,pp.2271–2280,2015. pp. 910–916, 2012. [39] A. Purohit, M. Varney, S. Rachagani, M. Ouellette, S. K. Batra, [22] K.-L. Ng, J.-S. Ciou, and C.-H. Huang, “Prediction of protein andR.K.Singh,“CXCR2signalingaxis-ahelpinghandto functions based on function-function correlation relations,” KRAS(G12D) mutation in pancreatic cancer initiation,” Cancer ComputersinBiologyandMedicine,vol.40,no.3,pp.300–305, Research, vol. 74, article 4874, 2014. 2010. [40] H. Hayashi, H. Ueno, Y. Sakamoto et al., “Gene mutation profile of pancreatic cancer in Japanese patients and its association with [23] R. Sharan, I. Ulitsky, and R. Shamir, “Network-based prediction prognosis,” Annals of Oncology,vol.25,2014. of protein function,” Molecular Systems Biology, vol. 3, article 88, 2007. [41] K. Saito, Y.Nakai, H. Isayama et al., “Associations between K-ras mutation, smoking, and prognosis of pancreatic cancer,” Journal [24] P. Bogdanov and A. K. Singh, “Molecular function prediction of Clinical Oncology, vol. 32, supplement 3, abstract 298, 2014. using neighborhood features,” IEEE/ACM Transactions on Com- [42]M.Bredel,H.Kim,T.K.Nandaetal.,“Deletionofthe putational Biology and Bioinformatics,vol.7,no.2,pp.208–217, tumor suppressor NFKBIA in triple-negative breast cancer,” 2010. International Journal of Radiation Oncology, Biology, Physics, [25] M. D. M. AbdulHameed, G. J. Tawa, K. Kumar et al., “Systems vol. 87, no. 2, p. S98, 2013. level analysis and identification of pathways and networks [43] G. R. Harsh, D. M. Scholtens, A. K. Yadav et al., “NFKBIA associated with liver fibrosis,” PLoS ONE,vol.9,no.11,Article in glioblastomas: tumor suppressor and potent predictor of ID e112193, 2014. outcome,” Journal of Neurosurgery,vol.115,p.A404,2011. [26]L.Chen,C.Chu,J.Lu,X.Kong,T.Huang,andY.D.Cai, [44] T. Mukogawa, F. Koyama, M. Tachibana et al., “Adenovirus- “A computational method for the identification of new candi- mediated gene transduction of truncated I𝜅B𝛼 enhances date carcinogenic and non-carcinogenic chemicals,” Molecular radiosensitivity in human colon cancer cells,” Cancer Science, BioSystems, vol. 11, no. 9, pp. 2541–2550, 2015. vol. 94, no. 8, pp. 745–750, 2003. [27] T. Gui, X. Dong, R. Li, Y. Li, and Z. Wang, “Identification of [45] J. V. Partin, I. E. Anglin, and N. Kyprianou, “Quinazoline- hepatocellular carcinoma-related genes with a machine learn- based 𝛼1-adrenoceptor antagonists induce prostate cancer cell ing and network analysis,” Journal of Computational Biology, apoptosis via TGF-𝛽 signalling and I𝜅B𝛼 induction,” British vol.22,no.1,pp.63–71,2015. Journal of Cancer, vol. 88, no. 10, pp. 1615–1621, 2003. BioMed Research International 9

[46]K.M.Reid-Lombardo,B.L.Fridley,W.R.Bamlet,J.M. [61]Z.Cao,X.Li,J.Li,W.Luo,C.Huang,andJ.Chen,“X-linked Cunningham, M. G. Sarr, and G. M. Petersen, “Survival is inhibitor of apoptosis protein (XIAP) lacking RING domain associated with genetic variation in inflammatory pathway localizes to the nuclear and promotes cancer cell anchorage- genes among patients with resected and unresected pancreatic independent growth by targeting the E2F1/Cyclin E axis,” cancer,” Annals of Surgery,vol.257,no.6,pp.1096–1102,2013. Oncotarget,vol.5,no.16,pp.7126–7137,2014. [47] S. Suzuki, M. Okada, K. Shibuya et al., “JNK suppression [62]X.P.Yi,T.Han,Y.X.Li,X.Y.Long,andW.Z.Li,“Simultaneous of chemotherapeutic agents-induced ROS confers chemoresis- silencing of XIAP and survivin causes partial mesenchymal- tance on pancreatic cancer stem cells,” Oncotarget,vol.6,no.1, epithelial transition of human pancreatic cancer cells via the pp.458–470,2015. PTEN/PI3K/Akt pathway,” Molecular Medicine Reports,vol.12, [48] X.P.Yuan,M.Dong,X.Li,andJ.P.Zhou,“GRP78promotesthe pp.601–608,2015. invasion of pancreatic cancer cells by FAK and JNK,” Molecular [63] L.-P. Cao, J.-L. Song, X.-P. Yi, and Y.-X. Li, “Double inhibition and Cellular Biochemistry,vol.398,no.1-2,pp.55–62,2014. of NF-𝜅B and XIAP via RNAi enhances the sensitivity of [49] L. Liang, Q. Li, L. Y.Huang et al., “Loss of ARHGDIA expression pancreatic cancer cells to gemcitabine,” Oncology Reports,vol. is associated with poor prognosis in HCC and promotes 29,no.4,pp.1659–1665,2013. invasion and metastasis of HCC cells,” International Journal of [64] Z.-H. Wang, H. Chen, H.-C. Guo et al., “Enhanced antitu- Oncology,vol.45,no.2,pp.659–666,2014. mor efficacy by the combination of emodin and gemcitabine [50] M. Dal Bo, F. Pozzo, R. Bomben et al., “ARHGDIA, a mutant against human pancreatic cancer cells via downregulation of the TP53-associated Rho GDP dissociation inhibitor, is over- expression of XIAP in vitro and in vivo,” International Journal expressed in gene expression profiles of TP53 disrupted chronic of Oncology, vol. 39, no. 5, pp. 1123–1131, 2011. lymphocytic leukaemia cells,” British Journal of Haematology, [65]H.Zai,X.Yi,Y.Li,C.Jiang,andX.Lu,¨ “Simultaneous inhi- vol. 161, no. 4, pp. 596–599, 2013. bition of XIAP and survivin expression on EMT and invasion of human pancreatic cancer cells,” Journal of Central South [51] M. A. Harding and D. Theodorescu, “RhoGDI signaling pro- University, vol. 37, no. 9, pp. 883–888, 2012. vides targets for cancer therapy,” European Journal of Cancer, vol.46,no.7,pp.1252–1259,2010. [66] R.Yu,S.M.Albarenque,R.H.Cool,W.J.Quax,A.Mohr,andR. M. Zwacka, “DR4 specific TRAIL variants are more efficacious [52]D.Dixit,V.Sharma,S.Ghosh,V.S.Mehta,andE.Sen, than wild-type TRAIL in pancreatic cancer,” Cancer Biology and “Inhibition of Casein kinase-2 induces p53-dependent cell Therapy,vol.15,no.12,pp.1658–1666,2014. cycle arrest and sensitizes glioblastoma cells to tumor necrosis factor (TNF𝛼)-induced apoptosis through SIRT1 inhibition,” [67] T. Sun, N. Yu, L.-K. Zhai et al., “C-Jun NH2-terminal kinase Cell Death and Disease,vol.3,no.2,articlee271,2012. (JNK)-interacting protein-3 (JIP3) regulates neuronal axon elongation in a - and JNK-dependent manner,” Journal [53] Q. Liu, X.-Y. Zhao, R.-Z. Bai et al., “Induction of tumor of Biological Chemistry, vol. 288, no. 20, pp. 14531–14543, 2013. inhibition and apoptosis by a candidate tumor suppressor gene DRR1 on 3p21.1,” Oncology Reports,vol.22,no.5,pp.1069–1075, [68] H. Matsuura, H. Nishitoh, K. Takeda et al., “Phosphorylation- 2009. dependent scaffolding role of JSAP1/JIP3 in the ASK1-JNK signaling pathway. A new mode of regulation of the MAP kinase [54] M. J. Marinissen, M. Chiariello, T. Tanos, O. Bernard, S. Naru- cascade,” The Journal of Biological Chemistry,vol.277,no.43,pp. miya, and J. S. Gutkind, “The small GTP-binding protein RhoA 40703–40709, 2002. regulates c-jun by a ROCK-JNK signaling axis,” Molecular Cell, [69] N. Skrypek, R. Vasseur, A. Vincent, B. Duchene,ˆ I. Van vol. 14, no. 1, pp. 29–41, 2004. Seuningen, and N. Jonckheere, “The oncogenic receptor ErbB2 [55] W. P. Bozza, Y. Zhang, K. Hallett, L. A. Rivera Rosado, and B. modulates gemcitabine and irinotecan/SN-38 chemoresistance Zhang, “RhoGDI deficiency induces constitutive activation of of human pancreatic cancer cells via hCNT1 transporter and Rho GTPases and COX-2 pathways in association with breast multidrug-resistance associated protein MRP-2,” Oncotarget, cancer progression,” Oncotarget,2015. vol. 6, no. 13, pp. 10853–10867, 2015. [56] S. Hooshmand, A. Ghaderi, K. Yusoff, K. Thilakavathy, R. Rosli, + [70]J.Størling,N.Allaman-Pillet,A.E.Karlsen,N.Billestrup,C. and Z. Mojtahedi, “Differentially expressed proteins in ER − Bonny, and T. Mandrup-Poulsen, “Antitumorigenic effect of MCF7 and ER MDA-MB-231 human breast cancer cells by proteasome inhibitors on insulinoma cells,” Endocrinology,vol. 𝛼 RhoGDI- silencing and overexpression,” Asian Pacific Journal 146, no. 4, pp. 1718–1726, 2005. of Cancer Prevention,vol.15,no.7,pp.3311–3317,2014. [71] C. L. Standen, N. J. Kennedy, R. A. Flavell, and R. J. Davis, [57] B. Zhang, “Rho GDP dissociation inhibitors as potential targets “Signal transduction cross talk mediated by Jun N-terminal for anticancer treatment,” Drug Resistance Updates,vol.9,no.3, kinase-interacting protein and insulin receptor substrate scaf- pp. 134–141, 2006. fold protein complexes,” Molecular and Cellular Biology,vol.29, [58] Z. X. Wang and D. C. Thurmond, “Differential phosphorylation no. 17, pp. 4831–4840, 2009. of RhoGDI mediates the distinct cycling of Cdc42 and Rac1 to [72]P.P.Ongusaha,H.H.Qi,L.Rajetal.,“IdentificationofROCK1 regulate second-phase insulin secretion,” Journal of Biological as an upstream activator of the JIP-3 to JNK signaling axis in Chemistry,vol.285,no.9,pp.6186–6197,2010. response to UVB damage,” Science Signaling,vol.1,no.47,p. [59] G. Jiao, W. Guo, T. Ren et al., “BMPR2 inhibition induced ra14, 2008. apoptosis and autophagy via destabilization of XIAP in human [73] S. N. Rao and V. N. Balaji, “Molecular docking studies on JNK chondrosarcoma cells,” Cell Death and Disease,vol.5,no.12, inhibitors at the allosteric JNK-JIP interaction site,” Current Article ID e1571, 2014. Science,vol.102,no.7,pp.1009–1016,2012. [60]R.Hu,Y.Yang,Z.Liuetal.,“TheXIAPinhibitorEmbelin [74] W.Engstrom¨ and M. Granerus, “Expression of JNK-interacting enhances TRAIL-induced apoptosis in human leukemia cells by protein JIP-1 and insulin-like growth factor II in Wilms tumour DR4 and DR5 upregulation,” Tumor Biology,vol.36,no.2,pp. cell lines and primary Wilms tumours,” Anticancer Research, 769–777, 2015. vol. 29, no. 7, pp. 2467–2472, 2009. 10 BioMed Research International

[75] T. Takino, M. Nakada, H. Miyamori et al., “JSAP1/JIP3 coop- [90] S. S. W. Chung, X. Wang, S. S. Roberts, S. M. Griffey, P. R. erates with focal adhesion kinase to regulate c-Jun N-terminal Reczek, and D. J. Wolgemuth, “Oral administration of a retinoic kinaseandcellmigration,”The Journal of Biological Chemistry, Acid receptor antagonist reversibly inhibits spermatogenesis in vol. 280, no. 45, pp. 37772–37781, 2005. mice,” Endocrinology,vol.152,no.6,pp.2492–2502,2011. [76] T. Matsuguchi, A. Masuda, K. Sugimoto, Y. Nagai, and Y. [91] S. Flajollet, B. Lefebvre, C. Cudejko, B. Staels, and P. Lefebvre, Yoshikai, “JNK-interacting protein 3 associates with Toll-like “The core component of the mammalian SWI/SNF complex receptor 4 and is involved in LPS-mediated JNK activation,” The SMARCD3/BAF60c is a coactivator for the nuclear retinoic acid EMBO Journal,vol.22,no.17,pp.4455–4464,2003. receptor,” Molecular and Cellular Endocrinology,vol.270,no.1- [77] N. Broere, M. Chen, A. Cinar et al., “Defective jejunal and 2, pp. 23–32, 2007. + + colonic salt absorption and altered Na /H exchanger 3 (NHE3) [92] Q. H. Wu, R. J. Guo, M. Lin, B. Zhou, and Y. F. Wang, activity in NHE regulatory factor 1 (NHERF1) adaptor protein- “MicroRNA-200a inhibits CD133/1+ ovarian cancer stem cells deficient mice,” Pflugers¨ Archiv—European Journal of Physiol- migration and invasion by targeting E-cadherin repressor ogy,vol.457,no.5,pp.1079–1091,2009. ZEB2,” Gynecologic Oncology,vol.122,no.1,pp.149–154,2011. [78] A. Bellizzi, M. R. Greco, R. Rubino et al., “The scaffolding [93] F. Yu, J. Li, H. Chen et al., “Kruppel-like factor 4 (KLF4) is protein NHERF1 sensitizes EGFR-dependent tumor growth, required for maintenance of breast cancer stem cells and for cell motility and invadopodia function to gefitinib treatment in migration and invasion,” Oncogene,vol.30,no.18,pp.2161–2172, breast cancer cells,” International Journal of Oncology,vol.46, 2011. no. 3, pp. 1214–1224, 2015. [94] K. M. Bae, N. N. Parker, Y. Dai, J. Vieweg, and D. W. Siemann, [79] B. Hopwood, A. Tsykin, D. M. Findlay, and N. L. Fazzalari, “E-cadherin plasticity in prostate cancer stem cell invasion,” “Microarray gene expression profiling of osteoarthritic bone American Journal of Cancer Research, vol. 1, pp. 71–84, 2011. suggests altered bone remodelling, WNT and transforming [95] A. Yasuda, H. Sawai, H. Takahashi et al., “The stem cell factor/c- growth factor-𝛽/bone morphogenic protein signalling,” Arthri- kit receptor pathway enhances proliferation and invasion of tis Research and Therapy,vol.9,no.5,articleR100,2007. pancreatic cancer cells,” Molecular Cancer,vol.5,article46, [80] Y. Jiang, S. Wang, J. Holcomb et al., “Crystallographic analysis 2006. of NHERF1-PLCbeta3 interaction provides structural basis [96] M. Herreros-Villanueva, T. K. Er, and L. Bujanda, “Retinoic for CXCR2 signaling in pancreatic cancer,” Biochemical and acid reduces stem cell-like features in pancreatic cancer cells,” Biophysical Research Communications,vol.446,no.2,pp.638– Pancreas,vol.44,no.6,pp.918–924,2015. 643, 2014. [97] A. Doi, K. Ishikawa, N. Shibata et al., “Enhanced expression [81] S.Wang,Y.Wu,Y.Houetal.,“CXCR2macromolecularcomplex of retinoic acid receptor alpha (RARA) induces epithelial-to- in pancreatic cancer: a potential therapeutic target in tumor mesenchymal transition and disruption of mammary acinar growth,” Translational Oncology,vol.6,no.2,pp.216–225,2013. structures,” Molecular Oncology,vol.9,no.2,pp.355–364,2015. [82] A. Mangia, C. Saponaro, A. Malfettone et al., “Involvement of [98]H.-J.Yang,Y.-B.Zheng,T.Jietal.,“OverexpressionofILK1in nuclear NHERF1 in colorectal cancer progression,” Oncology breast cancer associates with poor prognosis,” Tumor Biology, Reports, vol. 28, no. 3, pp. 889–894, 2012. vol. 34, no. 6, pp. 3933–3938, 2013. [83] Y. Hayashi, J. R. Molina, S. R. Hamilton, and M.-M. Georgescu, [99] C. Leung-Hagesteijn, A. Mahendra, I. Naruszewicz, and G. “NHERF1/EBP50 is a new marker in colorectal cancer,” Neopla- E. Hannigan, “Modulation of integrin signal transduction by sia,vol.12,no.12,pp.1013–1022,2010. ILKAP, a protein phosphatase 2C associating with the integrin- [84] Y. Y. Jiang, S. O. Wang, J. Holcomb et al., “Crystallographic linked kinase, ILK1,” The EMBO Journal,vol.20,no.9,pp.2160– analysis of NHERF1-PLC𝛽3 interaction provides structural 2170, 2001. basis for CXCR2 signaling in pancreatic cancer,” Biochemical [100] J. Sanghera, P. Costello, L. Kojic et al., “Integrin-linked kinase 1 and Biophysical Research Communications,vol.446,no.2,pp. (ILK1): a ‘hot’ therapeutic target,” Clinical Cancer Research,vol. 638–643, 2014. 7, pp. 3766s–3767s, 2001. [85]T.Fukuda,L.Guo,X.Shi,andC.Wu,“CH-ILKBPregulatescell [101] D. Bang, W.Wilson, M. Ryan, J. J. Yeh, and A. S. Baldwin, “GSK- survival by facilitating the membrane translocation of protein 3𝛼 promotes oncogenic KRAS function in pancreatic cancer kinase B/Akt,” The Journal of Cell Biology,vol.160,no.7,pp. via TAK1-TAB stabilization and regulation of noncanonical NF- 1001–1008, 2003. 𝜅B,” Cancer Discovery,vol.3,no.6,pp.690–703,2013. [86] A. Eslami, K. Miyaguchi, K. Mogushi et al., “PARVB overex- [102]P.Shi,T.Yin,F.Zhou,P.Cui,S.Gou,andC.Wang,“Valproic pression increases cell migration capability and defines high risk acid sensitizes pancreatic cancer cells to natural killer cell- for endophytic growth and metastasis in tongue squamous cell mediated lysis by upregulating MICA and MICB via the carcinoma,” British Journal of Cancer,vol.112,no.2,pp.338– PI3K/Akt signaling pathway,” BMC Cancer,vol.14,article370, 344, 2015. 2014. [87] J. L. Sepulveda and C. Wu, “The parvins,” Cellular and Molecular [103] J. Y. C. Chow, M. Ban, H. L. Wu et al., “TGF-𝛽 downregulates Life Sciences,vol.63,no.1,pp.25–35,2006. PTEN via activation of NF-𝜅B in pancreatic cancer cells,” Amer- [88] G. Rosenberger, I. Jantke, A. Gal, and K. Kutsche, “Interaction ican Journal of Physiology: Gastrointestinal and Liver Physiology, of 𝛼PIX (ARHGEF6) with 𝛽-parvin (PARVB) suggests an vol. 298, no. 2, pp. G275–G282, 2010. involvement of 𝛼PIX in integrin-mediated signaling,” Human [104] L.-H. Chang, S.-L. Pan, C.-Y. Lai, A.-C. Tsai, and C.-M. Molecular Genetics,vol.12,no.2,pp.155–167,2003. Teng, “Activated PAR-2 regulates pancreatic cancer pro- [89] N. A. Nejad, F. Amidi, M. A. Hoseini et al., “Male germ-like gression through ILK/HIF-𝛼-induced TGF-𝛼 expression and cell differentiation potential of human umbilical cord Wharton’s MEK/VEGF-A-mediated angiogenesis,” The American Journal jelly-derived mesenchymal stem cells in co-culture with human of Pathology,vol.183,no.2,pp.566–575,2013. placenta cells in presence of BMP4 and retinoic acid,” Iranian [105]A.H.Huang,S.H.Pan,W.H.Changetal.,“PARVApromotes JournalofBasicMedicalSciences,vol.18,no.4,pp.325–333,2015. metastasis by modulating ILK signalling pathway in lung BioMed Research International 11

adenocarcinoma,” PLoS ONE,vol.10,no.3,ArticleIDe0118530, phenotype of the pancreatic cancer cell line PANC-1,” Genes 2015. Chromosomes & Cancer,vol.39,no.3,pp.224–235,2004. [106]Z.P.Yan,H.Z.Yin,R.Wangetal.,“Overexpressionof [121] F. Schonleben,¨ W. Qiu, J. D. Allendorf, J. A. Chabot, H. E. integrin-linked kinase (ILK) promotes migration and invasion Remotti, and G. H. Su, “Molecular analysis of PIK3CA, BRAF, of colorectal cancer cells by inducing epithelial-mesenchymal and RAS oncogenes in periampullary and ampullary adenomas transition via NF-𝜅Bsignaling,”Acta Histochemica,vol.116,no. and carcinomas,” Journal of Gastrointestinal Surgery,vol.13,no. 3, pp. 527–533, 2014. 8, pp. 1510–1516, 2009. [107] A. Traister, M. Walsh, S. Aafaqi et al., “Mutation in integrin- 𝑅211𝐴 [122] J. Sayyah, A. Bartakova, N. Nogal, L. A. Quilliam, D. G. Stupack, linked kinase (ILK ) and heat-shock protein 70 comprise and J. H. Brown, “The Ras-related protein, Rap1A, medi- a broadly cardioprotective complex,” PLoS ONE,vol.8,no.11, ates thrombin-stimulated, integrin-dependent glioblastoma cell Article ID e77331, 2013. proliferation and tumor growth,” The Journal of Biological [108]J.A.Gilbert,L.J.Adhikari,R.V.Lloyd,T.R.Halfdanarson, Chemistry, vol. 289, no. 25, pp. 17689–17698, 2014. M. H. Muders, and M. M. Ames, “Molecular markers for [123] J.-S. Park, K.-M. Lim, S. G. Park et al., “Pancreatic cancer novel therapeutic strategies in pancreatic endocrine tumors,” induced by in vivo electroporation-enhanced sleeping beauty Pancreas,vol.42,no.3,pp.411–421,2013. transposon gene delivery system in mouse,” Pancreas,vol.43, [109] F.Bergmann, S. Aulmann, B. Sipos et al., “Acinar cell carcinomas no. 4, pp. 614–618, 2014. of the pancreas: a molecular analysis in a series of 57 cases,” [124] T. Shichinohe, N. Senmaru, K. Furuuchi et al., “Suppression of Virchows Archiv,vol.465,no.6,pp.661–672,2014. pancreatic cancer by the dominant negative ras mutant, N116Y,” [110] L.Mery,B.Strauss,J.F.Dufour,K.H.Krause,andM.Hoth,“The The Journal of Surgical Research,vol.66,no.2,pp.125–130,1996. PDZ-interacting domain of TRPC4 controls its localization and [125] V. Sirivatanauksorn, Y. Sirivatanauksorn, and N. R. Lemoine, surface expression in HEK293 cells,” Journal of Cell Science,vol. “Molecular pattern of ductal pancreatic cancer,” Langenbeck’s 115,no.17,pp.3497–3508,2002. Archives of Surgery,vol.383,no.2,pp.105–115,1998. [111] I. Del Giudice, M. Messina, S. Chiaretti et al., “Behind the [126] A. Nagy, L. Kozma, I. Kiss et al., “Copy number of cancer genes scenes of non-nodal MCL: downmodulation of genes involved predict tumor grade and survival of pancreatic cancer patients,” in actin cytoskeleton organization, cell projection, cell adhe- Anticancer Research,vol.21,no.2,pp.1321–1326,2001. sion, tumour invasion, TP53 pathway and mutated status of immunoglobulin heavy chain genes,” British Journal of Haema- [127] T. Seufferlein, J. Van Lint, S. Liptay, G. Adler, and R. M. Schmid, 𝛼 tology,vol.156,no.5,pp.601–611,2012. “Transforming growth factor activates Ha-Ras in human pan- creatic cancer cells with Ki-ras mutations,” Gastroenterology, [112] X. L. Li, X. Lu, S. Parvathaneni et al., “Identification of RECQ1- vol. 116, no. 6, pp. 1441–1452, 1999. regulated transcriptome uncovers a role of RECQ1 in regulation of cancer cell migration and invasion,” Cell Cycle,vol.13,no.15, [128] T. J. Haggerty, I. S. Dunn, L. B. Rose, E. E. Newton, F. pp. 2431–2445, 2014. Pandolfi, and J. T. Kurnick, “Heat shock protein-90 inhibitors enhance antigen expression on melanomas and increase T cell [113] M. D. Pastor, A. Nogal, S. Molina-Pinelo et al., “Identification of recognition of tumor cells,” PLoS ONE,vol.9,no.12,ArticleID proteomic signatures associated with lung cancer and COPD,” e114506, 2014. Journal of Proteomics,vol.89,pp.227–237,2013. [114] J. D. Figueroa, S. S. Han, M. Garcia-Closas et al., “Genome- [129] Y. J. Zhou and R. J. Binder, “The heat shock protein-CD91 path- wide interaction study of smoking and bladder cancer risk,” way mediates tumor immunosurveillance,” OncoImmunology, Carcinogenesis,vol.35,pp.1737–1744,2014. vol. 3, no. 4, Article ID e28222, 2014. [115] J. H. Zhou, Y. J. Feng, K. T. Tao et al., “The expression and [130] K. Dodd, S. Nance, M. Quezada et al., “Tumor-derived phosphorylation of ezrin and merlin in human pancreatic inducible heat-shock protein 70 (HSP70) is an essential compo- cancer,” International Journal of Oncology,vol.45,no.6,pp. nent of anti-tumor immunity,” Oncogene,vol.34,pp.1312–1322, 2059–2067, 2014. 2015. [116]Y.X.Meng,Z.H.Lu,S.N.Yu,Q.A.Zhang,Y.H.Ma,andJ. [131]N.Qiao,Y.Zhu,H.Li,Z.Qu,andZ.Xiao,“Expressionofheat Chen, “Ezrin promotes invasion and metastasis of pancreatic shock protein 20 inversely correlated with tumor progression in cancer cells,” Journal of Translational Medicine, vol. 8, article 61, patients with ovarian cancer,” European Journal of Gynaecologi- 2010. cal Oncology, vol. 35, no. 5, pp. 576–579, 2014. [117] H. M. Kocher, J. Sandle, T. A. Mirza, N. F. Li, and I. R. Hart, [132] U. Ramp, C. Mahotka, S. Heikaus et al., “Expression of heat “Ezrin interacts with cortactin to form podosomal rosettes in shock protein 70 in renal cell carcinoma and its relation to pancreatic cancer cells,” Gut,vol.58,no.2,pp.271–284,2009. tumor progression and prognosis,” Histology and Histopathol- [118] Y. Oda, S. Aishima, K. Morimatsu et al., “Differential ezrin and ogy,vol.22,no.10–12,pp.1099–1107,2007. phosphorylated ezrin expression profiles between pancreatic [133] J. J. Hyun, H. S. Lee, B. Keum et al., “Expression of heat shock intraepithelial neoplasia, intraductal papillary mucinous neo- protein 70 modulates the chemoresponsiveness of pancreatic plasm, and invasive ductal carcinoma of the pancreas,” Human cancer,” Gut and Liver,vol.7,no.6,pp.739–746,2013. Pathology, vol. 44, no. 8, pp. 1487–1498, 2013. [134] Y. Kuramitsu, Y. Wang, K. Taba et al., “Heat-shock protein 27 [119] J. V. Thevathasan, E. Tan, H. Zheng et al., “The small GTPase plays the key role in gemcitabine-resistance of pancreatic cancer HRas shapes local PI3K signals through positive feedback cells,” Anticancer Research,vol.32,no.6,pp.2295–2299,2012. and regulates persistent membrane extension in migrating [135] Z. Zhang, K. Kawamura, Y. Jiang et al., “Heat-shock protein fibroblasts,” Molecular Biology of the Cell,vol.24,no.14,pp. 90 inhibitors synergistically enhance melanoma differentiation- 2228–2237, 2013. associated gene-7-mediated cell killing of human pancreatic [120] H. Fensterer, K. Giehl, M. Buchholz et al., “Expression profil- carcinoma,” Cancer Gene Therapy,vol.20,no.12,pp.663–670, ing of the influence of RAS mutants on the TGFB1-induced 2013. 12 BioMed Research International

[136] T. N. MacKenzie, N. Mujumdar, S. Banerjee et al., “Triptolide [151] T. A. Bailey, H. Luan, E. Tom et al., “A kinase inhibitor induces the expression of miR-142-3p: a negative regulator of screen reveals protein kinase C-dependent endocytic recycling heat shock protein 70 and pancreatic cancer cell proliferation,” of ErbB2 in breast cancer cells,” The Journal of Biological Molecular Cancer Therapeutics,vol.12,no.7,pp.1266–1275, Chemistry,vol.289,no.44,pp.30443–30458,2014. 2013. [152] H. Qi, B. Sun, X. Zhao et al., “Wnt5a promotes vasculogenic [137] Y. K. Yu, A. Hamza, T. Zhang et al., “Withaferin A targets mimicry and epithelial-mesenchymal transition via protein heat shock protein 90 in pancreatic cancer cells,” Biochemical kinase C𝛼 in epithelial ovarian cancer,” Oncology Reports,vol. Pharmacology,vol.79,no.4,pp.542–551,2010. 32,no.2,pp.771–779,2014. [138] S. Mori-Iwamoto, K. Taba, Y. Kuramitsu et al., “Interferon-𝛾 [153] A. Doller, C. Winkler, I. Azrilian et al., “High-constitutive HuR down-regulates heat shock protein 27 of pancreatic cancer cells phosphorylation at Ser 318 by PKC delta propagates tumor and helps in the cytotoxic effect of gemcitabine,” Pancreas,vol. relevant functions in colon carcinoma cells,” Carcinogenesis,vol. 38,no.2,pp.224–226,2009. 32,no.5,pp.676–685,2011. [139] S. A. Lang, C. Moser, A. Gaumann et al., “Targeting heat [154]A.M.Butler,M.L.S.ScottiBuzhardt,S.Li,K.E.Smith,A. shock protein 90 in pancreatic cancer impairs insulin-like P. Fields, and N. R. Murray, “Protein kinase C Zeta regulates growth factor-I receptor signaling, disrupts an interleukin- human pancreatic cancer cell transformed growth and invasion 6/signal-transducer and activator of transcription 3/hypoxia- through a STAT3-dependent mechanism,” PLoS ONE,vol.8,no. inducible factor-1𝛼 autocrine loop, and reduces orthotopic 8, Article ID e72061, 2013. tumor growth,” Clinical Cancer Research,vol.13,no.21,pp. [155]Y.Chen,G.Z.Yu,D.H.Yu,andM.H.Zhu,“PKC𝛼-induced 6459–6468, 2007. drug resistance in pancreatic cancer cells is associated with [140] V.Dudeja,P.Phillips,R.Sharif,Y.Zhan,R.Dawra,andA.Saluja, transforming growth factor-𝛽1,” Journal of Experimental & “Heat shock protein 70 (HSP70) confers resistance to apoptosts Clinical Cancer Research,vol.29,no.1,article104,2010. in pancreatic cancer cells by lysosomal stabilization,” Pancreas, [156] C. E. Antal, A. M. Hudson, E. Kang et al., “Cancer-associated vol. 33, no. 4, p. 457, 2006. protein kinase C mutations reveal kinase’s role as tumor sup- [141] W. Fendler, I. Klich, A. Cie´slik-Heinrich, K. Wyka, A. Szad- pressor,” Cell,vol.160,no.3,pp.489–502,2015. kowska, and W. Młynarski, “Increased risk of type 1 diabetes in 󸀠 [157] M. Linch, M. Sanz-Garcia, E. Soriano et al., “Acancer-associated Polish children—association with INS-IGF2 5 VNTR and lack mutation in atypical protein kinase C𝜄 occurs in a substrate- of association with HLA ,” Endokrynologia Polska,vol. specific recruitment motif,” Science Signaling, vol. 6, no. 293, 62,no.5,pp.436–442,2011. article ra82, 2013. [142] G. Sutherland, G. Mellick, J. Newman et al., “Haplotype anal- [158] P. Storz, “Targeting protein kinase C subtypes in pancreatic ysis of the IGF2-INS-TH gene cluster in Parkinson’s disease,” cancer,” Expert Review of Anticancer Therapy,vol.15,no.4,pp. American Journal of Medical Genetics. Part B: Neuropsychiatric 433–438, 2015. Genetics,vol.147,no.4,pp.495–499,2008. [143] C. Polychronakos, A. Kukuvitis, N. Giannoukakis, and E. Colle, “Parental imprinting effect at the IPIS-IGF2 diabetes susceptibility locus,” Diabetologia,vol.38,no.6,pp.715–719, 1995. [144]G.Y.F.Ho,A.Melman,S.-M.Liuetal.,“Polymorphismofthe insulin gene is associated with increased prostate cancer risk,” British Journal of Cancer, vol. 88, no. 2, pp. 263–269, 2003. [145]I.A.Eaves,S.T.Bennett,P.Forsteretal.,“Transmissionratio distortion at the INS-IGF2 VNTR,” Nature Genetics,vol.22,no. 4, pp. 324–325, 1999. [146] C. A. Haiman, Y. Han, Y. Feng et al., “Genome-wide testing of putative functional exonic variants in rlationship with breast and prostate cancer risk in a multiethnic population,” PLoS Genetics,vol.9,no.3,ArticleIDe1003419,2013. [147]N.Kanatsuna,A.Delli,C.Anderssonetal.,“Doublyreac- tive INS-IGF2 autoantibodies in children with newly diag- nosed autoimmune (type 1) diabetes,” Scandinavian Journal of Immunology,vol.82,pp.361–369,2015. [148]N.Kanatsuna,J.Taneera,F.Vaziri-Sanietal.,“Autoimmunity against INS-IGF2 protein expressed in human pancreatic islets,” JournalofBiologicalChemistry, vol. 288, no. 40, pp. 29013– 29023, 2013. [149] G. Paroni, M. Fratelli, G. Gardini et al., “Synergistic antitumor activity of lapatinib and retinoids on a novel subtype of breast cancer with coamplification of ERBB2 and RARA,” Oncogene, vol. 31, no. 29, pp. 3431–3443, 2012. [150] A. Saumet, G. Vetter, M. Bouttier et al., “Transcriptional repres- sion of microRNA genes by PML-RARA increases expression of key cancer proteins in acute promyelocytic leukemia,” Blood, vol.113,no.2,pp.412–421,2009.