International Conference on Bioinformatics Delegate Book 2014 Page 0

TABLE OF CONTENTS

CONFERENCE SPONSORS 2

ORGANISING COMMITTEES 3

DELEGATE INFORMATION Venue 6 Organisers Office and Registration Desk 6 Registration 6 Name Tags 6 Speaker Presentations 6 Social Functions 6 Hotel Check Outs 6 Insurance 6 Disclaimer 6 Smoking 6

INVITED SPEAKERS 7

PROGRAM Thursday 31st July 2014 12 Friday 1st August 2014 15 Saturday 2nd August 2014 19

POSTER LISTING 20

ABSTRACTS Orals 22 Posters 69

AUTHOR INDEX 90

EXHIBITORS 96

ATTENDEES 97

International Conference on Bioinformatics Delegate Book 2014 Page 1

CONFERENCE SPONSORS

ORGANISER

SPONSORS

EXHIBITORS

SUPPORTERS

AFFILIATIONS

International Conference on Bioinformatics Delegate Book 2014 Page 2

SCIENTIFIC PROGRAM COMMITTEE

Chair Shoba Ranganathan, Macquarie University, Australia

Members Bruno Gaeta, University of New South Wales, Australia Kenta Nakai, Tokyo University, Japan Asif M Khan, Perdana University, Malaysia Christian Schönbach, Nazarbayev University, Kazakhstan Tin Wee Tan, National University of Singapore, Singapore

Program Committee Co-Chairs Shoba Ranganathan, Macquarie University, Australia Christian Schönbach, Nazarbayev University, Kazakhstan

Local Organising Committee Dr. Daniel Sze, Hong Kong Polytechnic University, Hong Kong Dr. Abidali Mohamedali, Macquarie University, Australia Ms. Sowmya Gopichandran, Macquarie University, Australia Mr. Mohammad Islam, Macquarie University, Australia

Members Mohd Firdaus-Raih, Universiti Kebangsaan Shandar Ahmad, National Institute of Biomedical Malaysia, Malaysia Innovation, Japan Andrew French, University of Nottingham, UK Tatsuya Akutsu, Kyoto University, Japan Ge Gao, Peking University, China Shunsuke Aoki, Kyushu Institute of Technology, Pascale Gaudet, Swiss Institute of Japan Bioinformatics, Switzerland Nicola Armstrong, University of New South Charles Gilman, Nazarbayev University, Wales, Australia Kazakhstan Vladimir Bajic, King Abdullah University of Marsia Gustiananda, Eijkman Institute for Science and Technology, Kingdom of Saudi Arabia Molecular Biology, Indonesia Christopher Baker, University of New Brunswick, Timothy Hancock, Kyoto University, Japan Canada Matthew He, Nova Southeastern University, USA Sergio Baranzini, University of California at San Yongqun He, University of Michigan Medical Francisco, USA School, USA Arsen Batagov, Bioinformatics Institute, Chia-Lang Hsu, National Taiwan University, Singapore Taiwan Alex Bateman, The Wellcome Trust Sanger Guang Hu, Soochow University, China Institute, UK Chun-His Huang, University of Connecticut, USA Vladimir Brusic, Dana-Farber Cancer Institute, Ulibek Kairov, Nazarbayev University, USA Kazakhstan Zhi-Wei Cao, Shanghai Center for Bioinformation Asif M Khan, Perdana University, Malaysia Technology, China Javed Khan, Macquarie University, Australia Filippo Castiglione, National Research Council Tsung-Fei Khang, University of Malaya, Malaysia (CNR), Italy Daisuke Kiga, Tokyo Institute of Technology, Jonathan Chan, King Mongkut's University of Japan Technology Thonburi, Thailand Akira Kinjo, Osaka University, Japan Jiajia Chen, Suzhou University of Science Akihiko Konagaya, Tokyo Institute of &Technology, China Technology, Japan Ming Chen, Zhejiang University, China Shinji Kondo, National Institute of Polar Wai-Ki Ching, Hongkong University, Hong Kong Research, Japan Qinghua Cui, Peking University, China Anton Kratz, RIKEN Omics Center, Japan Ning Deng, Zhejiang University, China Gaurav Kumar, Virginia Commonwealth Frank Eisenhaber, Bioinformatics Institute, University, USA Singapore Igor Kurochkin, Bioinformatics Institute, Mahmoud Elhefnawi, National Research Centre, Singapore Egypt International Conference on Bioinformatics Delegate Book 2014 Page 3

Chee Keong Kwoh, Kwoh Nanyang Technological Paolo Tieri, National Research Council, Italy University, Singapore Joo Chuan Tong, National University of Guozheng Li, Tongji University, China Singapore, Singapore Jinyan Li, University of Technology, Sydney, Sissades Tongsima, National Center for Genetic Australia Engineering and Biotechnology, Thailand Xiaoli Li, Institute for Infocomm Research, Ikuo Uchiyama, National Institute for Basic Singapore Biology, Japan Wei Lin, Fudan University, China Chandra Verma, Bioinformatics Institute, Xinghua Lu, University of Pittsburgh, USA Singapore Hiroshi Mamitsuka, Kyoto University, Japan Mauno Vihinen, Lund University Hideo Matsuda, Osaka University, Japan Guohua Wang, Harbin Institute of Technology, Bui Quang Minh, Center for Integrative China Bioinformatics Vienna, Austria Jin Wang, Nanjing University, China Lenny Moise, University of Rhode Island, USA Junbai Wang, Oslo University Hospital, Norway Santo Motta, University of Catania, Italy Yong Wang, Chinese Academy of Mathematics Kenta Nakai, University of Tokyo, Japan and Systems Science, China Yasushi Okazaki, Saitama Medical University, Xiujie Wang, CAS Institute of Genetics and Japan Developmental Biology, China Francesco Pappalardo, University of Catania, Martin Wasser, Bioinformatics Institute, Italy Dongqing Wei, Shanghai Jiao Tong University, Ashwini Patil, University of Tokyo, Japan China Nikolai Petrovsky, Flinders Medical Centre, Gonghong Wei, University of Oulu, Finnland Australia Limsoon Wong, National University of Singapore, Jiang Qian, Johns Hopkins University, USA Singapore Yasubumi Sakakibara, Keio University, Japan Jingfa Xiao, Beijing Institute of Genomics, China Meena Sakharkar, University of Tsukuba, Japan Chao Xie, National University of Singapore, Daniele Santoni, National Research Council, Italy Singapore Arman Saparov, Nazarbayev University, Yu Xue, Huazhong University of Science and Kazakhstan Technology, China Tetsuo Shibuya, The University of Tokyo, Japan Wenying Yan, Soochow University, China Narayanaswamy Srinivasan, Indian Institute of Yan Zhang, Chinese Academy of Sciences (CAS), Science, India China Durai Sundar, Indian Institute of Technology Guang Lan Zhang, Boston University, USA Delhi, India Xingming Zhao, Tongji University, China Y-H. Taguchi, Chuo University, Japan Zhongming Zhao, Vanderbilt University Medical Yoichi Takenaka, Osaka University, Japan Center, USA Martti Tammi, Prince Songkhla University, Dongxiao Zhu, Wayne State University, USA Thailand Shanfeng Zhu, Fudan University, China Weidong Tian, Fudan University, China

Publications Committee

Kenta Nakai, Tokyo University, Japan Shoba Ranganathan, Macquarie University, Australia Christian Schönbach, Nazarbayev University, Kazkhstan Tin Wee Tan, National University of Singapore, Singapore

Conference Secretariat ASN Events Pty Ltd PO Box 200 (3056 Frankston Flinders Road) Balnarring VIC 3926 Ph: 03 5983 2400 Fax: 03 5983 2223 Email: [email protected]

International Conference on Bioinformatics Delegate Book 2014 Page 4

International Conference on Bioinformatics Delegate Book 2014 Page 5

DELEGATE INFORMATION

Venue Cnr. Grand Parade & Princess Street Brighton-le-Sands, NSW 2216 Australia Phone: +61 2 9556 5111

Organiser’s Office and Registration Desk The organiser’s office is located in the hotel foyer. Any enquiries can be directed to ASN Events staff at the organiser’s office, with the exception of enquiries regarding accommodation which should be directed to Novotel Brighton Beach. The conference office hours are: Wednesday 30th July 7:00 pm- 8:00pm Thursday 31st July 8:00 pm – 5:00 pm Friday 1st August 8:00 am – 5:30 pm Sunday 2nd August 8:00 am – 12:00 pm

Registration Conference delegates receive the following services as part of their registration: . Access to all sessions . Conference satchel complete with program* . Morning teas, Lunches & Afternoon tea . Conference Dinner - Sahra Restaurant (Only for full conference registrants) . Use of the conference APP: http://incob2014.m.asnevents.com.au

*All delegates receive a copy of the proceedings, but satchels can only be given to trade delegates if supply allows

Name Tags Delegates and partners are required to wear their name tags to all scientific and catered sessions.

Speakers Presentations ASN staff will be able to assist presenters with loading their presentation. Speakers can contact ASN staff members at the registration desk during any breaks. Speakers are encouraged to load their presentations as soon as possible to avoid any last minute rushes. The standard AV set up for all presentations will be data projection using MS PowerPoint. All presentations will be run from a PC.

Social Functions

Conference Dinner: Sahra Restaurant 88A The Grand Parade Brighton-Le-Sands All delegates with a ticket are welcome to attend the dinner. Dinner is from 7:00pm. Partners can attend the function with the payment of an additional conference dinner tickets, see the registration desk). There is no reserved seating for this function.

Hotel Check Outs You are required to check out of your room before 10am. The resort reception has facilities to store your luggage. If you apply early enough to reception, you may be able to organise a late check-out.

Insurance The hosts and organisers are not responsible for personal accidents, any travel costs, or the loss of private property and will not be liable for any claims. Delegates requiring insurance should make their own arrangements.

Disclaimer The hosts, organisers and participating societies are not responsible for, or represented by, the opinions expressed by participants in either the sessions or their written abstracts.

Smoking Smoking is not permitted in the venue.

International Conference on Bioinformatics Delegate Book 2014 Page 6

SPEAKERS

Professor Mary O’Kane

Professor Mary O'Kane has served as the NSW Chief Scientist & Engineer since 2008. She is also Executive Chairman of Mary O'Kane & Associates Pty Ltd, a Sydney-based company that advises governments, universities and the private sector on innovation, research education and development. Professor O’Kane’s other appointments include Chair of the Australian Centre for Renewable Energy, Chair of the Development Gateway and the Development Gateway International, Chair of the CRC for Spatial Information, a director of PSMA Ltd, Business Events Sydney, and the Australian Business Foundation and a Board member of NICTA (National ICT Australia Ltd). Professor O’Kane was Vice-Chancellor and President of the University of Adelaide from 1996-2001 and Deputy Vice-Chancellor (Research) from 1994-96. Before that, she was Dean of the Faculty of Information Sciences and Engineering at the University of Canberra. She is a former member of the Australian Research Council, the Co-operative Research Centres (CRC) Committee, the board of FH Faulding & Co Ltd and the board of the CSIRO. She is Vice President of the Academy of Technological Sciences and Engineering and a Fellow of Engineers Australia. Professor O'Kane is spearheading the NSW Translational Bioinformatics initiative, actively seeking to strengthen the research workforce by encouraging research and innovation in health services, strengthening the Research Workforce and by building research assets and maximising their implementation.

Professor Gil Omenn

Professor Gil Omenn is the Director of the Center for Computational Medicine and Bioinformatics (CCMB) at the University of Michigan Medical School. He is a Professor of Molecular Medicine & Genetics, Professor of Human Genetics, Professor of Internal Medicine, Human Genetics, Professor of Public Health at the School of Public Health and Research Professor at the Department of Computational Medicine & Bioinformatics. Professor Omenn's research focuses on cancer proteomics and informatics. He leads the Proteomics Alliance for Cancer Research, the HUPO Plasma Proteome Project, the Driving Biological Problems Core of the National Center for Integrative Biomedical Informatics, and the Center for Computational Medicine and Bioinformatics. There are datasets for application of analytical tools, and there are research teams eager to engage in collaborative studies in each of these initiatives. He has long-standing interests in mechanisms of genetic predispositions to risks from environmental and occupational exposures, pharmacogenetics and pharmacogenomics, and science-based risk analyses. Professor Omenn also served as Executive Vice President for Medical Affairs and as Chief Executive Officer of the University of Michigan Health System from 1997-2002. He is PI of the Michigan Life Sciences Corridor Proteomics Alliance for Cancer research program and leader of the international Human Proteome Organization (HUPO). He has also recently received the 2013 David E. Rogers Award from the American Association of Medical Colleges (AAMC).

International Conference on Bioinformatics Delegate Book 2014 Page 7

Professor Terry Gaasterland

Dr. Terry Gaasterland is a computer scientist turned computational molecular biologist. Her work seeks to understand the program of the cell encoded in the genome. She earned her undergraduate degree in Computer Science and Russian with a minor in Chemistry from Duke University as an A.B. Duke Scholar, and her Ph.D. in Computer Science from University of Maryland. As an Enrico Fermi Fellow at the Department of Energy's Argonne National Laboratory and then as an Assistant Professor of Computer Science at the University of Chicago, she applied techniques from her work in "cooperative answering", natural language processing, and deductive database research to the interpretation of the first three DOE-funded microbial genomes and a fourth Canadian-funded archaeal genome. During seven years as a Head of Lab at Rockefeller University, Dr. Gaasterland focused on the integration of gene expression data and genome sequence data analysis in human and model eukaryotic organisms. Ten years ago, Dr. Gaasterland moved her Laboratory of Computational Genomics to UCSD to establish the Scripps Genome Center, a UCSD resource based at the Scripps Institution of Oceanography in the Marine Biology Division, with bioinformatics hardware and software housed at the San Diego Supercomputer Center. At UCSD, she is now Professor of Computational Biology and Genomics at SIO and a faculty member in UCSD’s Institute for Genomic Medicine. Since receiving the Presidential Early Career Award in Science and Engineering (PECASE) in 2000, she has been continuously funded by the National Science Foundation and the National Institutes of Health to develop and use methods in computational genomics. Her accomplishments in computational molecular biology as well as her early career work in deductive databases is reflected in over 90 refereed publications, with over 80 indexed in PubMed. Dr. Gaasterland designs and uses computational tools to decipher and interrogate cell systems through integrated analysis of genomic and proteomic data. Her work aims to address the general question: How does regulation of transcription and translation modulate and affect cell state changes? She applies this approach to understand optic nerve degeneration in primary open angle glaucoma (POAG). This chronic eye disease affects more than 2 million Americans over age 40, and causes blindness in over 3 million people worldwide each year, affecting quality of life in aging populations. Her laboratory is seeking genes and molecular mechanisms responsible for risk and progression. A member of the NEIGHBOR Consortium to study POAG and the NHGRI Medical Sequencing program, Dr. Gaasterland is sequencing and analyzing variation in transcribed exons genome-wide for 400 primary open angle glaucoma cases and controls. To decipher molecular mechanisms affected by variation, Dr. Gaasterland uses a combination of high-throughput sequencing, quantitative PCR, and computational analysis to identify and test regulatory binding sites and non-coding RNAs affected by mutations.

Professor Terry Speed

Professor Terry Speed is an expert mathematician and statistician who has applied mathematical theories to a range of problems in forensic science, medical science, farming and mining. However, his main research focus is in the application of statistics to problems in genetics and molecular biology. Internationally, Professor Speed is regarded as the leading expert in the analysis of microarray data and has made has made ground-breaking contributions to the fields of bioinformatics, statistical genetics, the analysis of designed experiments, graphical models and Bayes networks. He was recently awarded the 2013 Prime Minister’s Science Prize for his work in Bioinformatics, which is Australia’s highest award for excellence in science research. In the same year that he received the prize, he was elected a Fellow of the Royal Society, United Kingdom, while in 2012 he was the recipient of the Victoria Prize for Science and Innovation and won the Thomas Reuters Citation Award in Biochemistry and Molecular Biology for being the most cited Australian researcher in that field for the past decade. He also received the inaugural National Health and Medical Research Council (NHMRC) Achievement Award for Excellence in Health and Medical Research in 2007, an NHMRC Fellowship in 2009 and the Australian Government Centenary Medal in 2001. He is now working as the Senior Principal Research Scientist and is the head of the Bioinformatics division at the Walter and Eliza Hall Institute of Medical Research in Melbourne where his main research focus is in metabolic flux analysis, estimating 13C enrichment in time course experiments, basecalling for resequencing chips, and phylogenomics.

International Conference on Bioinformatics Delegate Book 2014 Page 8

Professor John Mattick

Professor John Mattick is the Executive Director of the Garvan Institute of Medical Research, one of Australia's best medical research organisations. He is also a Professor of Molecular Biology and Australian Research Council Federation Fellow at the Institute for Molecular Bioscience, University of . He has worked at Baylor College of Medicine in Houston, the CSIRO Division of Molecular Biology in Sydney, and has been based in the Universities of Cambridge, Oxford, Cologne, Strasbourg and Queensland since 1988. He was the Foundation Director of the Australian Genome Research Facility and the Institute for Molecular Bioscience. Professor John Mattick completed his Bachelor’s degree with First Class Honours in Biochemistry at the University of Sydney. He then went on to obtain his PhD at Monash University in Melbourne. It was soon after when the idea occurred to him that ‘junk’ DNA was not actually evolutionary debris, as previously thought, but may be involved in the orchestration of the growth, division and differentiation of genes. He then went on to test his idea, and has published more than 200 scientific papers to date on the subject in Journals such as Science, Nature Genetics, Nature Reviews Genetics, Genome Research, PNAS, Human Molecular Genetics and Scientific American. As a result, Professor John Mattick’s research has contributed great insight into the complexity and depth of the human transcriptome. Professor John Mattick has received numerous awards for his contribution, including the Biotechnology Medal by the Australian Biochemical Society, the Centenary Medal of the Australian Government, the CSIRO Eureka Prize for Leadership in Science, the inaugural Gutenberg Professorship of the University of Strasbourg, and the Julian Wells Medal of the Lorne Genome Society.

Professor Lars Nielsen

Professor Lars Nielsen is the Chair of Biological Engineering at the Australian Institute for Bioengineering and Nanotechnology (AIBN) at the . His research interests are in the fields of haematotherapy, immunotherapy and organotypic models for the study of disease and treatment in the tissue engineering area. He has developed novel strategies for generating microtissues for drug screening and using stem cells to produce red and white blood cells for transfusion. He leads a research group at the Centre for Systems and Synthetic Biology who have made much advancement in projects involving the polymer production in bacteria, recombinant protein and virus production in animal cells, and metabolic engineering of sugarcane. Professor Nielsen has been granted four patents in stem cells and in metabolic engineering and has received the UQ Foundation Research Excellence Award and the Australian Institute of Political Science Queensland Young Tall Poppy Award. His work has also led to project collaborations with the world's leading metabolic engineers from Korea and the United States.

Associate Professor Regina Berretta

A/Prof Regina Berretta is Head of Discipline of Computer Science and Software Engineering at the University of Newcastle, Australia and one of the founding academics of the Priority Research Centre for Bioinformatics, Biomarker Discovery and Information Based Medicine. A/Prof Berretta holds degrees in Computational and Applied Mathematics, Mathematics for Teaching, a Master and PhD (all from UNICAMP, Brazil) in the area of Metaheuristic methods to address Integer Programming problems. She held a prestigious early-career fellowship at University of Sao Paulo, Brazil prior to joining the University of Newcastle in 2003. Her main research interest is in the development of mathematical models and computational methods to solve problems in bioinformatics area with emphasis in personalized medicine. She has a substantial expertise in the development of heuristics and metaheuristics for tackle complex combinatorial optimization problems in several areas (production planning, education timetable, functional genomics, etc). She has published more than 60 papers and was awarded more than 15 competitive grants.

International Conference on Bioinformatics Delegate Book 2014 Page 9

Associate Professor Jean Yang

Jean Yee Hwa Yang is currently an Associate Professor and an ARC Future Fellow in the School of Mathematics and Statistics at University of Sydney. Her research work has centred on the development of statistical methodology and the application of statistics to problems in genomics, proteomics and biomedical research. In particular, her focus is on developing methods for integrating expression studies and other biological metadata such as miRNA expression, sequence information and clinical data. As a statistician who works in the bioinformatics area, she works in a collaborative environment with scientific investigators from diverse backgrounds. Jean completed a bachelor’s degree in statistics from the University of Sydney before her PhD in the Department of Statistics at the University of California, Berkeley, on the design and analysis of cDNA microarray experiments.Jean is a member of the core team in the Bioconductor project, an open source and open development software project for the analysis of genomic and other biological data and actively contributes to organizing the annual Sydney Bioinformatics Research Symposium.

Dr Michelle Brazas Dr. Michelle Brazas is currently working at the Ontario Institute for Cancer Research (OICR) as a Manager of Research and Knowledge Translation, where she brokers knowledge between research areas to further research outcomes. In this role, she coordinates and facilitates the advanced bioinformatics workshops offered through Bioinformatics.ca. She also plays an active role in the ISCB Education committee, and is the Secretariat on the Executive Board of the Global Organisation for Bioinformatics Learning, Education & Training (GOBLET), where she helps coordinate bioinformatics training endeavors worldwide.

International Conference on Bioinformatics Delegate Book 2014 Page 10

International Conference on Bioinformatics Delegate Book 2014 Page 11

PROGRAM

Thursday 31st July 2014

Registration Opens 8:00 AM Foyer

Welcome to InCoB 2014 9:00 AM – 9:45 AM Endeavour 1 & 2 Mary O’Kane, Chief Scientist & Engineer, NSW Government, Australia Mark Baker, President - Human Proteome Organization (HUPO) Terry Gaasterland, Vice-President-International Society for Computational Biology (ISCB) Shoba Ranganathan, President, Asia-Pacific Bioinformatics Network (APBioNet)

Planary 1 9:45 AM – 10:30AM Endeavour 1 & 2 Chair: Mark Baker Gil Omenn: Strategies and Progress of the HUPO Human Proteome Project abs#001

Coffee Break 10:30 AM – 11:00 AM Pre-Function Area

Planary 2 11:00 AM – 11:45 AM Endeavour 1 & 2 Chair: Shoba Ranganathan ISCB Speaker: Terry Gaasterland: Integrating exome sequencing, mRNA-seq, and microRNA-seq to identify genes and mechanisms in optic nerve degeneration abs#002

Planary 3 11:45 AM – 12:30 PM Endeavour 1 & 2 Chair: Shoba Ranganathan Terry Speed: Normalization of -omic data after 2007 abs#003

Lunch Break & Poster/Trade Display Viewing 12:30 PM – 1:30 PM Pre-Function Area/Sirius 2

Genome & Transcriptome Informatics I 1:30 PM – 3:00 PM Endeavour 1 Chair: Hideo Matsuda 1:30 PM Tzu-Hsien Yang cisMEP: an integrated repository of genomic epigenetic profiles and cis-regulatory modules in Drosophila abs#004 1:45 PM Po-Cheng Hung YNA: an integrative gene mining platform for studying chromatin structure and its regulation in Yeast abs#005 2:00 PM Naoki Matsushita Metagenome Fragment Classification Based on Multiple Motif-Occurrence Profiles abs#006 2:15 PM Mostafa Abbas Assessment of genome assemblers for fungal draft Genomes abs#007 2:30 PM Xiu-Jie Wang Broad existence of pluripotent factor regulated transcript isoforms with stage-specific alternative first exons (SAFE) in mouse embryonic stem cells abs#008 2:45PM Fu-Jou Lai A comprehensive performance evaluation on the prediction results of existing cooperative transcription factors identification algorithms abs#009

International Conference on Bioinformatics Delegate Book 2014 Page 12

Protein and Proteome Informatics I 1:30 PM – 3:00 PM Endeavour 2 Chair: Paul Horton 1:30 PM Yu Xue CPLM: an database of protein lysine modifications abs#010 1:45 PM Yi-Yuan Chiu Homopharma: A new concept for exploring the molecular binding mechanisms and drug repurposing abs#011 2:00 PM Cheng-Tsung Li Characterization and identification of protein O-GlcNAcylation sites with substrate Specificity abs #012 2:15 PM Ahmet Sinan Yavuz Predicting sumoylation sites using support vector machines based on various sequence features, conformational flexibility and disorder abs#013 2:30 PM Haifen Chen IFACEwat: the interfacial water-implemented re-ranking algorithm to improve the discrimination of near native structures for protein rigid docking abs#014 2:45 PM Dana Pascovici Combining protein ratio p-values yields a useful pragmatic approach to the analysis o multi-run iTRAQ experiments abs#015

Structural Bioinformatics 1:30 PM – 3:00 PM Endeavour 3 Chair: Durai Sundar 1:30 PM Sonam Grover Computational identification of novel natural inhibitors of glucagon receptor for checking type II diabetes mellitus abs#016 1:45 PM Fabian A Buske Utilising graph databases to analyse the 3D organisation of chromatin abs#017 2:00 PM Avinash Mishra Bhageerath-H: A homology/ ab initiohybrid server for predicting tertiary structures of monomeric soluble proteins abs#018 2:15 PM Yi-Fan Liou SCMHBP: Prediction and analysis of heme binding proteins using propensity scores of dipeptides abs#019 2:30 PM Abdollah Dehzangi Improving protein fold recognition using the amalgamation of evolutionary based and structural based information abs#020 2:45 PM Qian Liu Use B-factor related features for accurate classification between protein binding interfaces and crystal packing contacts abs#021

High Performance & Supercomputing in Bioinformatics & Bioimaging 1:30 PM – 3:00 PM Sirius 1 Chair: Tin Wee Tan & Akihiko Konagaya 1:30 PM Ahmed Metwalli CloudSACA: distributed suffix array construction algorithms Package on Cloud abs#022 1:45 PM Mohammad Islam & Stuart Allen Intersect data storage – technologies used, lessons learned abs#023 1:30 PM Carlos Riveros Discovery of gene interactions by GPU-enabled computation of pairwise expression level metafeatures abs#024 2:15 PM Dimitri Perrin Whole-Brain imaging at the single-cell resolution with the CUBIC Method abs#025 2:30 PM Kuleesha FMAj: a tool for high content analysis of muscle dynamics in Drosophila metamorphosis abs#026 2:45PM Akihiko Konagaya Automated microtubule path tracking on gliding assay using hidden Markov model abs#027

International Conference on Bioinformatics Delegate Book 2014 Page 13

Coffee Break 3:00 PM – 3:30 PM Pre Function Area

Genome & Transcriptome Informatics II 3:30 PM – 4:30 PM Endeavour 1 Chair: Jean Yang 3:30 PM Gong Zhang FANSe2: an accurate read mapping algorithm the re-interprets the next generation sequencing abs#028 3:45 PM Kulwadee Somboonviwat miRNA Workbench: a computational toolkit for miRNA identification and target prediction experiments abs#029 4:00 PM Conrad Burden Error estimates for the analysis of differential expression from RNA-seq count data abs#030 4:15 PM Margaret R Donald Evaluating two-factor experimental results for RNA-Seq data using simulation abs#031

Protein and Proteome Informatics II 3:30 PM – 4:30 PM Endeavour 2 Chair: Ge Gao 3:30 PM Vladimir Brusic Tumor antigens as proteogenomic biomarkers in invasive ductal carcinomas abs#032 3:45 PM Tzu-Hsien Yang iPhos: toolkit to streamline the alkaline phosphatase assisted comprehensive LC-MS phosphoproteome investigation abs#033 3:45 PM Julian Uszkoreit The bacterial proteogenomic pipeline abs#034 4:15 PM Vladimir Brusic Pathway analysis and transcriptomics improve protein identification by shotgun proteomics from samples of small number of cells abs#035

Pathways & Networks 3:30 PM – 4:30 PM Endeavour 3 Chair: Lars Neilsen 3:30 PM Ashwini Patil TimeXNET: Identifying active gene sub-networks using time-course gene expression profiles abs#036 3:45 PM Rebecca L Barter Network-based biomarkers enhance classical approaches to prognostic gene expression signatures abs#037 4:00 PM Chia-Hao Chin cytoHubba: indetify hub objects and sub-network from complex interactome abs#038 4:15 PM Maad Shatnawi Protein Inter-Domain Linker Prediction Using Random Forest and Amino Acid Physiochemical Properties abs#39

Bioinformatics Software testing & Quality Assurance 3:30 PM – 4:30 PM Sirius 1 Chair: Joshua Ho 1. Joshua Ho Software quality assurance in genomic medicine and systems biology abs#040 2. Michael Charleston Mutations and metamorphics:good software development practices for bioinformaticians abs#041 3. Tsong Chen Techniques for testing large and complex software abs#042

APBioNET Report and Annual General Meeting 4:30 PM – 5:00 PM Endeavour 1

Welcome Reception and Poster Session 1 5:00 PM – 6:00 PM Pre Function Area & Sirius 2 International Conference on Bioinformatics Delegate Book 2014 Page 14

Friday 1st August 2014

Registration Opens 8:00 AM Foyer

Plenary 4 9:00 AM – 9:45 AM Endeavour 1 & 2 Chair: Terry Speed Jean Yang: Vertically integrated multi-layered omics data for biomarker discovery abs#043

Plenary 5 9:45 AM – 10:30 AM Endeavour 1 & 2 Chair: Bruno Gaeta ABN Speaker Michelle Brazas: Supporting trainers to improve bioinformatics education globally abs#044

Coffee Break 10:30 AM – 11:00 AM Pre Function Area

Corporate Presentation 1 11:00 AM – 11:30AM Endeavour 1 & 2 Henry Wang, QIAGEN: An automatic pipeline to find and annotate rare subclonal somatic variants in a paired tumor/normal sample abs#045

Corporate Presentation 2 11:30 AM – 12:00 PM Endeavour 1 & 2 Siddarth Singh: Pacific Biosciences: PacBio single molecule long-read sequencing: applications and bioinformatic tools abs#046

Corporate Presentation 3 12:00 PM – 12:30 PM Endeavour 1 & 2 Sarah Reed, AB SCIEX: next generation data independent analysis : SWATH 2.0 abs#047

Lunch Break & Poster/Trade Display Viewing 12:30 PM – 1:30 PM Pre Function Space & Sirius 2

Sequencing & Sequence Analysis 1:30 PM – 3:00 PM Endeavour 1 Chair: Terry Gaasterland 1:30 PM Igor N Berezovsky The fundamental tradeoff in genomes and proteomes of prokaryotes established by the genetic code, codon entropy, and physics of nucleic acids and proteins abs#048 1:45 PM JeHoon Jun Whole genome sequence and analysis of the Marwari horse breed and its genetic origin abs#049 2:00 PM A K M Abdul Baten De novo assembly of the complete sequence and comparative analysis of the chloroplast genome of Macadamia integrifolia (Proteaceae) abs#050 2:15 PM Joshua Ho Verification and validation of bioinformatics software without a gold standard: a case study of BWA and Bowtie abs#051 2:30 PM Jaeyoung Choi funRNA: a fungi-centered genomics platform for genes encoding key components of RNAi abs#052 2:45 PM Yun Zheng Revealing editing and SNPs of microRNAs in colon tissues by analyzing high-throughput sequencing profiles of small RNAs abs#053

International Conference on Bioinformatics Delegate Book 2014 Page 15

Sytems Biology I 1:30 PM – 3:00 PM Endeavour 2 Chair: Ashwini Patil 1:30 PM Hagen Meckel Strategies for combining multi OMICS data abs#054 1:45 PM Haijun Gong Computational analysis of the roles of ER-Golgi network in the cell cycle abs#055 2:00 PM Yushan Qiu An efficient method for observability of singleton attractors in Boolean networks abs#056 2:15 PM Takefumi Moriya Effects of downstream genes on synthetic genetic circuits abs#057 2:30 PM Kuan-Bei Chen dCaP: Detecting differential binding events in multiple conditions and proteins abs#058 2:45 PM Jinwoo Kim Drug-induced toxicity prediction for multi-organ pathological findings based on integrative model of gene expression data abs#059

Disease Informatics I 1:30 PM – 3:00 PM Endeavour 3 Chair: Gil Omenn 1:30 PM Sridharan Srinath Novel SNP improves differential survivability and mortality in non-small cell lung cancer patients abs#060 1:45 PM Ping Zhang Genetic algorithm with logistic regression for diagnosis and prognosis of Alzheimer’s disease abs#061 2:00 PM Mani Grover Identification of novel therapeutics for complex diseases from genome-wide association data abs#062 2:15 PM Christine LP Eng Predicting host tropism of influenza A virus proteins using random forest abs#063 2:30 PM Ran Su Supervised predicationof drug induced nephrotoxicity based on interleukin-6 and -8 expression levels abs#064 2:45 PM Matloob Khushi Bioinformatic analysis of cis-regulatory interactions between progesterone and estrogen receptors in breast cancer abs#065

Immuniformatics 1:30 PM – 3:00 PM Sirius 1 Chair: Vladimir Brusic 1:30 PM Jinyan Li Rule discovery and distance separation to detect reliable miRNA biomarkers for the diagnosis of lung squamous cell carcinoma abs#066 1:45 PM Aidan R O’Brien Scalable clustering of genotype information using MapReduce abs#067 2:00 PM Shanfeng Zhu MHC2MIL: a novel multiple instance learning based method for MHC II peptide binding prediction by considering peptide flanking region and residue positions abs#068 2:15 PM Chih-Ta Lin Anticancer drug design using kinase profiling, kinase expression and KIDFamMap abs#069 2:30 PM Radha Mahendran Immunoinformatics and molecular docking studies of outer membrane proteins with MHC Class I alleles for fish pathogens abs#070 2:45PM Hamid Alinejad-Rokny A new tool to avoid errors associated with the analysis of hypermutated viral sequences by the widely used Hypermut program abs#071

Coffee Break 3:00 PM – 3:30 PM Pre Function Area International Conference on Bioinformatics Delegate Book 2014 Page 16

Ontology, Text mining & Evolution & Data Integration 3:30 PM – 5:00 PM Endaevour 1 Chair: Regina Berretta 3:30 PM Jean-Marc Schwartz Molecular profiling of thyroid cancer subtypes using large-scale text mining abs#072 3:45 PM Benjamin Drinkwater Introducing TreeCollapse: a novel greedy algorithm to solve the cophylogeny reconstruction problem abs#073 4:00 PM Patrick B Thomas Multi-species sequence comparison reveals conservation of preproghrelin splice variants and a novel variant encoding a truncated ghrelin peptide abs#074 4:15 PM Jiayin Wang Identifying significant associations with interacting germline variation and somatic mutational events for cancers abs#075 4:30 PM Modest von Korff Mining for gene-disease associations with MeSH terms in MEDLINE and ArrayExpress abs#076 4:45 PM Jiyuan An J-Circos: a Java graphic user interface for Circos plot abs#077

Sytems Biology II 3:30 PM – 5:00 PM Endeavour 2 Chair: Nicola Armstrong 3:30 PM Sriganesh Srihari Complex-based analysis of deregulated cellular processes in cancer abs#078 3:45 PM Yuki Kato Using hidden Markov models to investigate G-quadruplex motifs in genomic sequences abs#079 4:00 PM Lingxiao Zhou Combining spatial and chemical information for clustering pharmacophores abs#080 4:15 PM Haifen Chen Highly sensitive inference of time-delayed gene regulations by network deconvolution abs#081 4:30 PM Qingyao Wu Semi-supervised multi-label collective classification ensemble for functional genomics abs#082 4:45 PM Tzu-Wen Lin Predicting functional related proteins based on characteristic of the gene sequences of the protein pairs abs#083

Disease Informatics II 3:30 PM – 5:00 PM Endeavour 3 Chair: Jinyan Li 3:30 PM YH Taguchi TINAGL1 and B3GALNT1 are potential therapy target genes to suppress metastasis in non- small cell lung cancer abs#084 3:45 PM Abhinav Grover Fragment based group QSAR and molecular dynamics mechanistic studies on arylthioindole derivatives targeting the α-β interfacial site of human tubulin abs#085 4:00 PM Aliaksandr A Yarmishyn HOXD-AS1 is a novel long noncoding RNA encoded in HOXD cluster and a marker of neuroblastoma progression revealed via integrative analysis of noncoding transcriptome abs#086 4:15 PM Ko-Chun Yang Transcriptome alterations of mitochondrial and coagulation function in schizophrenia by cortical sequencing analysisabs#087 4:30 PM Junhee Seok A gene set method to predict patient survival risks from gene expression data abs#088

International Conference on Bioinformatics Delegate Book 2014 Page 17

4:45 PM Mahmoud ElHefnawi Bioinformatics analysis of the most potent tumor suppressor microRNAs in hepatocellular carcinoma revealing new links to immune system modulation and insights into cancer pathways abs#089

Bioinformatics Education and Training 3:30 PM – 5:00 PM Sirius 1 Chair: David Lovell & Bruno Gaeta 1. David Lovell A (two year) snapshot of bioinformatics education and training in Australia abs#090 2. Mark Crowe A survey of Bioinformatics training needs in Australia abs#091 3. Harriet Dashnow, Marek Cmero, Andrew Lonsdale How we became bioinformaticsians: the student experience abs#092 4. Scott C Ritchie A Software carpentry persepective of Bioinformatics abs#093 5. Bruno Gaeta Bioinformatics as engineering or science? A tale of two degrees abs#094 6. Asif Khan A one-year postgraduate diploma programme for a foundation in bioinformatics: a case study in Malaysia abs#095 7. Annette McGrath Strengthening bioinformatics capabilities at CSIRO abs#096 8. Nathan Watson-Haigh Delivering Bioinformatics Training Using Cloud Computing Infrastructure abs#97

Panel Discussion with all speakers and Michelle Brazas Moderated by Vicky Schneider-Gricar

Poster Session 2 5:00 PM – 6:00 PM Pre Function Area and Sirius 2

Conference Dinner – Sahra Restaurant 7:00 PM – 9:00PM

International Conference on Bioinformatics Delegate Book 2014 Page 18

nd Saturday 2 August 2014

Registration Opens 8:00 AM Foyer

Plenary 6 9:00 AM – 9:45 AM Endeavour 1 & 2 Chair: Tin Wee Tan MQ BioFocus Speaker: Lars Nielson Genome scale regulatory network modelling abs#099

Plenary 7 9:45 AM – 10:30 AM Endeavour 1 & 2 Chair: Pablo Moscato Regina Berretta: Combinatorial optimisation models for analysing biological data set abs#100

Coffee Break 10:30 AM – 11:00 AM Foyer

Plenary 8 11:00 AM – 11:45 AM Endeavour 1 & 2 Chair: Shoba Ranganathan John Mattick: RNA at the epicenter of human development abs#101

InCoB 2015 Presentation 11:45 AM – 12:00 PM Endeavour 1 & 2

InCoB Awards 12:00 PM – 12:30 PM Endeavour 1 & 2

Close of InCoB 12:30 PM

International Conference on Bioinformatics Delegate Book 2014 Page 19

POSTER PRESENTATIONS

POSTER SESSION ONE - Thursday

Manjula Algama abs#201 Drosophila 3'UTRs are more complex than protein-coding sequences Rawan AlSaad abs#202 SIDRAiTrip: a high performance translational research platform for personalized medicine Maina Bitar abs#203 An assessment of ncRNAs in Trypanosoma cruzi Jingmin Che abs#204 A strategy of Gene Prioritization by Integrating Genetic Resources with Improved TOPSIS Zhiliang Chen abs#205 Differences in early transcription factor upregulation underlie socially-induced developmental plasticity in the Australian black field cricket Yee Siew Choong abs#206 Assembly of Salmonella enterica ser. Typhi TolC in DMPE and POPE Hon-Nian Chua abs#207 Microbial community pattern detection in human body habitats via ensemble clustering framework Brett Cooke abs#208 Proteogenomic Workflows on Draft Genomes Marek Cmero abs#209 Structural variations as a method for phylogenetic reconstruction of sub-clonal tumour evolution Harriet Dashnow abs#210 Genotyping microsatellites in next-generation sequencing data Nandan Deshpande abs#211 Sequencing, assembly and comparative analysis of five strains of the fungal pathogen Cryptococcus gatti Westa Domanova abs#212 Modelling the insulin signalling network: unravelling the molecular mechanisms of insulin resistance Richard Edwards abs#213 Computational prediction of protein interaction motifs from integrated protein sequence, structure and interaction data. Sowmya Gopichandran abs#214 A site for direct integrin αvβ6•uPAR interaction from structural modelling and docking Dianjing Guo abs#215 Detecting the characteristics of human branch point sequence(BPS) using a novel prediction model Kyungsook Han abs#216 Predicting Protein-Binding Nucleotides with Consideration of a Binding Partner of RNA Hao Jiang abs#217 A Parsimonious Model for Predicting Drug Side-effect Profiles Jahangir Khan abs#218 Establishing relationship of virus titer with agro-economic characteristics of tomato Jahangir Khan abs#219 Tomato leaf curl Palampur virus associated with chili pepper leaf curl disease in Pakistan Swaminathan Krishnaswamy abs#220 Ligand Based Docking Studies of Genus Jatropha against Human Breast Cancer Protein BRCA1 Dhiendra Kumar abs#221 Integrative analysis of multi-omics data to discover novel protein forms Piramanayagam Shanmughavel abs#222 In silico approach on CXCR4 Antagonists as potential Microbicides against HIV-1 subtype C Receptor

International Conference on Bioinformatics Delegate Book 2014 Page 20

POSTER SESSION TWO – Friday John Lai abs#251 Evaluation of fusion transcripts as markers of prostate cancer treatment resistance. Michael Lee abs#252 Sequencing analysis of telomeres reveals unexpected sequence heterogeneity Jie Li abs#253 Effects of sample size and unbalance on finding cancer biomarker Peijie Lin abs#254 Estimation of amplicon methylation patterns from bisulphite sequencing data Yue Liu abs#255 The transcriptome sequence analysis of Artemisia frigida Chao Liu abs#256 Computational analysis of DNA repair pathways using gene expression data Andrew Lonsdale abs#257 Enhancing metabolic pathway databases with localisation data: integrating SUBA with AraCyc Andrew Lonsdale abs#258 COMBINE: a bioinformatics group aimed at students and early-career researchers Ranjeeta Menon abs#259 Computational prediction of molecular mimicry in host-pathogen protein-protein interactions Heloisa Milioli abs#260 Meta-features as predictors of breast cancer intrinsic subtypes in the METABRIC gene expression dataset Nabilatul Hani Mohd Radzman abs#261 Molecular docking predictions of stevioside-insulin receptor (IR) interactions in a Mus musculus IR model Santo Motta abs#262 A tool for fast development of new ontologies Nagarajan Raju abs#263 Novel approach for selecting the best predictor for identifying the binding sites in DNA binding proteins Hidayat Setiadji abs#264 Construction and cloning of encoded gene of structural protein of hepatitis c virus in escherichia coli and its expression in Chinese hamster ovary cells Renhua Song abs#265 Rule discovery and distance separation to detect reliable miRNA biomarkers for the diagnosis of lung squamous cell carcinoma Aidan Tay abs#266 Validation of transcripts assembled from RNA-seq data using proteomics data. Huai-Kuang Tsai abs#267 Intrinsic properties of genomic sequences allow prediction of transcription factor binding regions Daryi Wang abs#268 Role of antisense RNAs in evolution of yeast regulatory complexity Ko-Chun Yang abs#269 Pain-TENS: a database for pain research using transcutaneous electrical nerve stimulation Ko-Chun Yang abs#270 The role of zinc finger family genes in schizophrenia Reza Yuridian abs#271 Molecular Docking of Various Type Soybean-Phosphatidylcholine to Fas-receptor for Drug Design Studies of Inducing Adipocyte Apoptosis and Isolating-Proliferation of Adipocyte-derived Stem Cells Yan Zheng abs#272 Comparative analysis of gene expression profiles induced by IL-4 and IL-6 in human peripheral blood mononuclear cells Yun Zheng abs#273 Genome-Wide discovery and analysis of phased small Interfering RNAs in Chinese sacred lotus

International Conference on Bioinformatics Delegate Book 2014 Page 21

ABSTRACTS

ORALS

1

STRATEGIES AND PROGRESS OF THE HUPO HUMAN PROTEOME PROJECT Gil Omenn1 1. University of Michigan,, Ann Arbor, MI, United States After several years of discussion and formal planning, the Human Proteome Organization (HUPO) announced the global Human Proteome Project (HPP) at the Sydney World Congress of Proteomics in September 2010. A year later the Project was launched at the Geneva World Congress. In 2012 in Boston and in 2013 in Yokohama there was tremendous progress reported by the Chromosome-centric HPP, led by Young-Ki Paik (Korea), Bill Hancock (USA), and Gyorgy Marko-Varga (Sweden), and the Biology and Disease-driven HPP, led by Ruedi Aebersold (Switzerland) and Jennifer van Eyk (USA). There are now 24 chromosome-centric teams and 16 B/D teams (including the pre-existing HUPO organ and biofluid-based initiatives). These activities are supported by resource pillars covering the mass spectrometry, protein capture, and knowledge-base domains. The Journal of Proteome Research in 2013 and in 2014 published special issues with 48 and 32 C-HPP and C-HPP-related articles, respectively. The HPP stimulated the emergence of ProteomeXchange for submission of all datasets. PeptideAtlas and GPMDB provide standardized re-analyses of the accumulating data, with appropriate rigorous quality thresholds; Human Protein Atlas provides extensive information on tissue expression by immunohistochemistry; and neXtProt integrates and curates the combined identifications and annotations. As of the January 2014 Lane et al update (JPR 13:15-20), there were 15,646 confidently identified proteins and 3844 protein-coding genes for which protein-level evidence was missing or insufficient. Additional large datasets from TCGA, Pandey lab, and Kuster lab, among others, will be subjected to the standardized reanalysis and incorporated into the HPP Metrics. Many investigators are focused on finding credible evidence for the missing proteins, and many others on characterizing the presence, dynamics, and functions of splice variants, post-translational modifications, and sequence variants. The B/D-HPP has stimulated development toward robust, moderate-cost, high throughput mass spectrometers (Proteome Analyzer project), and has initiated consensus development of priority protein lists for specific major diseases, like type 2 diabetes and ovarian cancers, for which SRM reagents and spectral libraries are available for use throughout the life sciences and biomedical research community. Current information about the publications and other activities of the HPP is available at www.thehpp.org and at www.c-hpp.org. All experimental datasets are expected to be submitted to PRIDE/EBI (MS/MS) or PASSEL/ISB (SRM) to be made available to investigatorsglobally and linked to additional databases and neXtProt through ProteomeXchange.

2

INTEGRATING EXOME SEQUENCING, MRNA-SEQ, AND MICRORNA-SEQ TO IDENTIFY GENES AND MECHANISMS IN OPTIC NERVE DEGENERATION Terry Gaasterland1 1. University of California San Diego and Scripps Institution of Oceanography, La Jolla, United States In glaucoma, progressive optic nerve degeneration can lead to irreversible vision impairment and eventual blindness, despite treatment. Genetic causes and influences are not yet clear in primary open angle glaucoma (POAG), the most prevalent form of the disease in North America, Europe, and several other parts of the world. The genetics of POAG are complex; to date, no single causative genomic variant has been established as causing the disease.

Genome-wide sequencing of exons from protein coding and non-coding genes in 333 patients with primary open angle glaucoma revealed over 100 associated SNP sites in over 70 genes. To rank and prioritize genes and generate hypotheses about molecular mechanisms disrupted by associated variant sites, mRNA and small RNA (microRNA) were sequenced from ocular tissues relevant to the disease.

Analysis protocols and techniques for integrated data interpretation to construct putative regulatory networks underlying disease will be discussed. The approach revealed two strong candidate models explaining neurodegeneration in POAG. Data collection and analysis methods are generally applicable beyond glaucoma to other chronic, progressive diseases associated with aging.

International Conference on Bioinformatics Delegate Book 2014 Page 22

3

NORMALIZATION OF OMIC DATA AFTER 2007 Terry Speed1 1. Walter and Eliza Hall Institute of Medical Research, Parkville, VIC, Australia For over a decade now, normalization of transcriptomic, genomic and more recently metabolomic and proteomic data has been something you do to “raw” data to remove biases, technical artifacts and other systematic non-•‐biological features. These features could be due to sample preparation and storage, reagents, equipment, people and so on. It was a “one-•‐off” fix to what I’m going to call removing unwanted variation. Since around 2007, a more nuanced approach has been available, due to JT Leek and J Storey (SVA) and O Stegle et al (PEER). These new approaches do two things differently. The first is that they do not assume the sources of unwanted variation are known in advance, they are inferred from the data. And secondly, they deal with the unwanted variation in a model-•‐based way, not “up front.” That is, they do it in a problem-•‐specific manner, where different inference problems warrant different model-•‐based solutions. For example, the solution for removing unwanted variation in estimation not necessarily being the same as doing for prediction. Over the last few years, I have been working with Johann Gagnon-•‐Bartsch and Laurent Jacob on these same problems through making use of positive and negative controls, a strategy which we think has some advantages. In this talk I’ll review the area, and highlight some of the advantages of working with controls. Illustrations will be from microarray, mass spec and sequence data.

4

CISMEP: AN INTEGRATED REPOSITORY OF GENOMIC EPIGENETIC PROFILES AND CIS- REGULATORY MODULES IN DROSOPHILA Tzu-Hsien Yang1, Chung-Ching Wang1, Po-Cheng Hung1, Wei-Sheng Wu1 1. Department of Electrical Engineering , National Cheng Kung University, Tainan, Taiwan Cis-regulatory modules (CRMs), or the DNA sequences required for regulating gene expression, play the central role in biological researches on transcriptional regulation in metazoan species. Nowadays, the systematic understanding of CRMs still mainly resorts to computational methods due to the time-consuming and small-scale nature of experimental methods. But the accuracy and reliability of different CRM prediction tools are still unclear. Without comparative cross-analysis of the results and combinatorial consideration with extra experimental information, there is no easy way to assess the confidence of the predicted CRMs. This limits the genome- wide understanding of CRMs. It is known that transcription factor binding and epigenetic profiles tend to determine functions of CRMs in gene transcriptional regulation. Thus integration of the genome-wide epigenetic profiles with systematically predicted CRMs can greatly help researchers evaluate and decipher the prediction confidence and possible transcriptional regulatory functions of these potential CRMs. However, these data are still fragmentary in the literature. Here we performed the computational genome-wide screening for potential CRMs using different prediction tools and constructed the pioneer database, cisMEP (cis-regulatory module epigenetic profile database), to integrate these computationally identified CRMs with genomic epigenetic profile data. cisMEP collects the literature-curated TFBS location data and nine genres of epigenetic data for assessing the confidence of these potential CRMs and deciphering the possible CRM functionality. cisMEP aims to provide a user-friendly interface for researchers to assess the confidence of different potential CRMs and to understand the functions of CRMs through experimentally-identified epigenetic profiles. The deposited potential CRMs and experimental epigenetic profiles for confidence assessment provide experimentally testable hypotheses for the molecular mechanisms of metazoan gene regulation. cisMEP is available online at http://cosbi3.ee.ncku.edu.tw/cisMEP/. We believe that the information deposited in cisMEP will help biologists to study the modular regulatory mechanisms between different TFs and their target genes.

5

YNA: AN INTEGRATIVE GENE MINING PLATFORM FOR STUDYING CHROMATIN STRUCTURE AND ITS REGULATION IN YEAST Po-Cheng Hung1, Tzu-Hsien Yang1, Hung-Jiun Liaw2, Wei-Sheng Wu1 1. Department of Electrical Engineering, National Cheng Kung University, Tainan, Taiwan 2. Department of Life Sciences, National Cheng Kung University, Tainan, Taiwan Histone modification and remodeling play crucial roles in regulating gene transcription. These post-translational modifications of histones alter chromatin structure, thus facilitating the binding between protein domains and histones that regulate DNA accessibility during transcription. An emerging theme is that multivalent interactions between specifically modified histones and domains of protein complexes allosterically regulate the activity of a protein complex. Therefore, understanding the combinatorial pattern of the histone code is vital to understand the biological processes. However, most of these chromatin-regulating datasets are scattering in several literatures. And no comprehensive investigation tool on these data is available.

International Conference on Bioinformatics Delegate Book 2014 Page 23

To decipher the mechanism of transcriptional regulation, we developed the Yeast Nucleosome Atlas database, or the YNA database, to integrate available experimental data of nucleosome occupancy, histone modifications, factors for chromatin regulation, and expression profiles. In addition, to acquire experimentally testable hypotheses, we implemented the genome-wide gene miner to provide the interface for researchers to fetch gene lists by custom-defined filtering criteria based on those previously published datasets. Moreover, the biological significance analyzer, which addresses the issues concerning the enrichment of histone modification and binding proteins, expression profiles, and functional categories, was constructed to help researchers propose testable hypotheses for downstream analysis. Compared to previously established genome browsing databases, YNA provides the integrative and comprehensive information about global chromatin structure and gene regulation. Most importantly, YNA provides gene mining and analyzing functions for advanced analysis.

6

METAGENOME FRAGMENT CLASSIFICATION BASED ON MULTIPLE MOTIF-OCCURRENCE PROFILES Naoki Matsushita1, Shigeto Seno1, Yoichi Takenaka1, Hideo Matsuda1 1. Osaka University, Osaka, Japan An enormous amount of metagenomic data have been obtained to extract multiple genomes simultaneously from microbial communities, including from uncultivable microbes. By analyzing metagenomic data, such microbes are discovered and new microbial functions are elucidated. The first step to analyzing the data is sequenced-read classification into reference genomes from which each read could be derived. The Naïve Bayes Classifier is one of the methods for the classification. To identify the derivation of the reads, the method calculates the score based on the occurrence of a DNA sequence on each reference genome. However, large differences are present among their genome sizes, which bias the scoring of the genomes. This bias may cause erroneous classification and diminish the classification accuracy. To cope with this issue, we have enhanced the Naïve Bayes Classifier with multiple sets of occurrence profiles for each reference genome by leveling genome sizes: dividing its genome sequence into a set of subsequences having approximately same lengths and by generating profiles for each subsequence. The multiple profile scheme improves the accuracy of the results by the Naïve Bayes Classifier for simulated and Sargasso Sea datasets.

7

ASSESSMENT OF GENOME ASSEMBLERS FOR FUNGAL DRAFT GENOMES Mostafa M Abbas1, Qutaibah M Malluhi1, Balakrishnan p1 1. KINDI Lab for Computing Research, College of Engineering, Qatar University, Doha, Qatar Background Recently, several bio-projects dealing with the release of fungal genomes have transpired. Most of these projects use the new generation sequencing platforms. As a consequence, many de novo assembly tools have been devolved to assemble the reads generated by these platforms. Each tool has its own inherent advantages and disadvantages, which make the selection of appropriate tool a challenging problem. Results We have evaluated the performance of frequently used de novo assemblers namely ABySS, IDBA-UD, Minia, SOAP, SPAdes, Sparse, and Velvet. These assemblers are assessed based on their output quality during the assembly process conducted over fungal data. We compared the performance of these assemblers by considering both computational as well as quality metrics. By analyzing these performance metrics, the assemblers are ranked and a tentative procedure for choosing the candidate assembler is illustrated. Conclusions In this study, we propose an assessment method for the selection of de novo assemblers by considering their computational as well as quality metrics at the draft genome level. We divide the quality metrics into three groups: g1 measures the goodness of the assemblies, g2 measures the problems of assemblies, and g3 measures the conservation elements in the assemblies. Our results demonstrate that the assemblers ABySS and IDBA-UD exhibit a good performance for the studied data from fungal genomes in terms of running time, memory, and quality. The results suggest that whole genome shotgun sequencing projects should make use of different assemblers by considering their merits. Our results are available for free for academic research at http://confluence.qu.edu.qa/display/download/bioinf

International Conference on Bioinformatics Delegate Book 2014 Page 24

8

BROAD EXISTENCE OF PLURIPOTENT FACTOR REGULATED TRANSCRIPT ISOFORMS WITH STAGE-SPECIFIC ALTERNATIVE FIRST EXONS (SAFE) IN MOUSE EMBRYONIC STEM CELLS Guihai Feng1, Man Tong1, Baolong Xia2, Guan-Zheng Luo1, Meng Wang1, Dongfang Xie1, Haifeng Wan2, Qi Zhou2, Xiu-Jie Wang1 1. Institute of Genetics & Developmental Biology, Chinese Academy of Sciences, Beijing, China 2. Institute of Zoology, Chinese Academy of Sciences, Beijing, China Stage-specific alternative first exon (SAFE) usage is an important alternative splicing type which implicates in the regulation of many important biological processes, especially in a spatial and temporal manner. Yet the presence of SAFE transcripts and their roles in embryonic stem cells (ESCs) are still largely unknown. By comparing transcriptomes of mouse ESCs (mESCs) and somatic cells, we identify 137 mESC SAFE isoforms of 128 genes with broad expression in both ESCs and somatic cells. More than half of the mESC SAFE isoforms have open reading frame (ORF) changes as compared to the corresponding commonly expressed isoform of the same gene. The promoter regions of SAFE isoforms exhibit enriched H3K4me3 and Pol II binding as well as higher DNase I sensitivity in mESCs, but not in mouse embryonic fibroblasts (MEFs) and other somatic cells, in support of the ESC-specific expression patterns of these transcripts. Promoter regions of about 42% SAFE isoforms have interactions with key pluripotent factors Oct4, Sox2 or Nanog. Knocking-down these pluripotent related genes indeed impairs the expression of SAFE isoforms. The expression of SAFE isoforms is activated during the reprogramming process of induced pluripotent stem (iPS) cells, and dynamically regulated in early stage embryos or during cell differentiation. These results reveal the wide presence of SAFE isoforms in ESCs, the involvement of pluripotent factors in the expression regulation of SAFE isoforms indicates their functional importance in ESCs.

9

A COMPREHENSIVE PERFORMANCE EVALUATION ON THE PREDICTION RESULTS OF EXISTING COOPERATIVE TRANSCRIPTION FACTORS IDENTIFICATION ALGORITHMS Fu-Jou Lai1, Yueh-Min Huang1, Wei-Sheng Wu2 1. Department of Engineering Science, National Cheng Kung University, Tainan, Taiwan 2. Computational Systems Biology Lab, Department of Electrical Engineering, National Cheng Kung University, Tainan

Background: Eukaryotic transcriptional regulation is known to be highly connected through the networks of cooperative transcriptional regulators. Measuring the cooperativity of transcriptional regulators is helpful for understanding the biological relevance of them in regulating genes. The recent advances in computational techniques led to various predictions of significant cooperative transcription factor (TF) pairs by genome-wide analysis in yeast. As each study in the related domain utilized diverse data resources and distinctive algorithms, it possessed its own merit and claimed outperforming others. However, the claim was prone to subjectivity because the study compared with a few other studies only and just used a small set of performance indices for comparison. This motivated us to develop and propose a series of measurement approaches to generate performance indices in order to objectively evaluate the prediction performance of each study. And based on these performance indices, we conducted a comprehensive performance evaluation and comparison among these studies.

Results: We collected and compiled the predicted cooperative TF pairs (PCTFPs) from 14 existing algorithms. With 7 performance indices we proposed, the cooperativity of each TF pair in each set of PCTFPs was measured and a ranking score according to the mean cooperativity of the set was given for each individual performance index. It was seen that the ranking scores of a set of PCTFPs vary with different performance indices, implying that an algorithm used in predicting cooperative TF pairs is of strength somewhere but may be of weakness elsewhere. We finally made a comprehensive ranking for these 14 sets. The results showed that Wang J’s study obtained the best performance evaluation on the prediction of cooperative TF pairs. Conclusions: Our study has the following features: (i) It manipulated various published datasets in modelling 7 performance indices; (ii) It compared 14 sets of PCTFPs reported in literature; (iii) It carried out objective comprehensive performance evaluation on the prediction of PCTFPs; and (iv) The performance indices we proposed can be quickly introduced in the performance measurement of the new PCTFPs in the future study and helpful for making a quick comparison with others.

International Conference on Bioinformatics Delegate Book 2014 Page 25

10

CPLM: AN DATABASE OF PROTEIN LYSINE MODIFICATIONS Yu Xue1, Zexian Liu1, Yongbo Wang1 1. Huazhong University of Science and Technology, Wuhan, HUBEI, China Through covalent modification of residues in proteins, post-translational modification (PTM) greatly expands the proteome diversity and regulates the dynamic functions of proteins. Recently, lysine was discovered as a hot spot amino acid for the PTM. Besides relatively well-studied PTMs such as methylation, acetylation and ubiquitination, a number of new PTMs were discovered to modify lysine residue, for example, butyrylation, crotonylation and succinylation. Although the detailed regulatory mechanisms are far from understanding, it is anticipated that these PTMs play critical roles in various biological processes. Here, We reported an integrated database of CPLM (Compendium of Protein Lysine Modification) or protein lysine modifications (PLMs). In total, 203,972 modification events on 189,919 modified lysines of 45,748 proteins for 11 types of PTMs were manually collected, including acetylation, ubiquitination, methylation, sumoylation, propionylation, butyrylation, succinylation, crotonylation, glycation, malonylation, and pupylation. With the dataset, we totally identified 76 types of co-occurrences of various PLMs on the same lysine residues, and the most abundant PLM crosstalk is between acetylation and ubiquitination. Up to 53.5% of acetylation and 33.1% of ubiquitination events co-occur at 10,746 lysine sites. Thus, the various PLM crosstalks suggested that a considerable proportion of lysines were competitively and dynamically regulated in a complicated manner. Since these various lysine modifications attracted great attention recently, we anticipate that such a comprehensive resource will be useful for the research community. The CPLM database is free to all users at: http://cplm.biocuckoo.org (1).

1. 1. Liu Z, Wang Y, Gao T, Pan Z, Cheng H, Yang Q, Cheng Z, Guo A, Ren J, Xue Y**. (2014) CPLM: a database of protein lysine modifications. Nucleic Acids Res. 42(1): D531-6. http://www.ncbi.nlm.nih.gov/pubmed/24214993

11

HOMOPHARMA: A NEW CONCEPT FOR EXPLORING THE MOLECULAR BINDING MECHANISMS AND DRUG REPURPOSING Yi-Yuan Chiu1 1. Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan Background Development of drugs that simultaneously target multiple proteins could improve efficacy, particularly in the treatment of complex diseases (e.g. cancer and central nervous system disorders). We have introduced Space-Related Pharmamotif (SRPmotif) method to identify pharma-interfaces sharing similar binding environments. However, the atomic interaction between a compound and a protein is important to understand that a compound target a protein. To combine similar binding environments and protein-compound interactions would provide the opportunities to explore the molecular binding mechanisms. Results In this study, we proposed a new concept of "homopharma" to describe a group of protein-compound interactions. A homopharma is a set of proteins with conserved sub-binding environment at the protein-compound interfaces and a set of compounds with similar topology. The results demonstrated that complexes of a homopharma group would perform similar protein-compound interactions and comprise conserved specific residues and important functional sites. Based on homopharma groups, four flavonoid derivatives were tested against 32 human protein kinases using in vitro enzymatic profiling. The experimental results identified 56 novel protein- compound interactions. 25 of 56 interactions may have IC50 values of less than 1 μM. Some novel protein-compound interactions suggest that these flavonoids could be used as anticancer compounds, such as oral and colon cancer. Conclusions The experimental results showed that new concept homopharma is not only useful to identify potential targets of compounds, but also can reveal the key binding environment. Moreover, it would useful for discovering the new usages for existing drugs. We believe that this approach can be further applied to understand molecular binding mechanisms and provide new concept about drug development.

International Conference on Bioinformatics Delegate Book 2014 Page 26

12

CHARACTERIZATION AND IDENTIFICATION OF PROTEIN O-GLCNACYLATION SITES WITH SUBSTRATE SPECIFICITY Tzong-Yi Lee1, Hsin-Yi Wu, Cheng-Tsung Lu 1. Yuan Ze University, Chungli, Taiwan Background: Protein O-GlcNAcylation, involving the attachment of single N-acetylglucosamine (GlcNAc) to the hydroxyl group of serine or threonine residues, is catalyzed by O-GlcNAc transferase (OGT). Elucidation of O-GlcNAcylation sites on proteins is required in order to decipher its crucial roles in regulating cellular processes and aid in drug design. With an increasing number of O- GlcNAcylation sites identified by mass spectrometry (MS)-based proteomics, several methods have been proposed for the computational identification of O-GlcNAcylation sites. However, no development that focuses on the investigation of OGT substrate motifs has existed. Thus, we were motivated to design a new method for the identification of protein O-GlcNAcylation sites with the consideration of substrate site specificity of OGT. Results: In this study, 375 experimentally verified O-GlcNAcylation sites were collected from dbOGAP, which is an integrated resource for protein O-GlcNAcylation. Due to the difficulty in characterizing the substrate motifs by conventional sequence logo analysis, a recursively statistical method has been applied to obtain statistically significant conserved motifs. Support Vector Machines (SVMs) were then adopted to construct a two-layered predictive model learned from the identified substrate motifs. The predictive model was evaluated using a five-fold cross validation which yielded a sensitivity of 0.76, a specificity of 0.80, and an accuracy of 0.78. Additionally, an independent testing set, which was really blind to the training data of predictive model, was used to demonstrate that the proposed method could provide a promising accuracy (0.94) and outperform three other O-GlcNAcylation site prediction tools. Conclusion: A case study demonstrated that the proposed method could be a feasible means of conducting preliminary analyses of protein O- GlcNAcylation. We also proposed that the substrate motif may make the study of extensive crosstalk between O-GlcNAcylation and phosphorylation more facile. This method may help unravel their mechanisms and roles in signaling, transcription, chronic disease, and cancer.

13

PREDICTING SUMOYLATION SITES USING SUPPORT VECTOR MACHINES BASED ON VARIOUS SEQUENCE FEATURES, CONFORMATIONAL FLEXIBILITY AND DISORDER Ahmet Sinan Yavuz1, Osman Ugur Sezerman1 1. Biological Sciences and Bioengineering Program, Faculty of Engineering and Natural Sciences, Sabanci University, Istanbul, Turkey Background Sumoylation, which is a reversible and dynamic post-translational modification, is one of the vital processes in a cell. Before a protein matures to perform its function, sumoylation may alter its localization, interactions, and possibly structural conformation. Abberations in protein sumoylation has been linked with a variety of disorders and developmental anomalies. Experimental approaches to identification of sumoylation sites may not be effective due to the dynamic nature of sumoylation, laborsome experiments and their cost. Therefore, computational approaches may guide experimental identification of sumoylation sites and provide insights for further understanding sumoylation mechanism. Results In this paper, the effectiveness of using various sequence properties in predicting sumoylation sites was investigated with statistical analyses and machine learning approach employing support vector machines. These sequence properties were derived from windows of size 7 including position-specific amino acid composition, hydrophobicity, estimated sub-window volumes, predicted disorder, and conformational flexibility. 5-fold cross-validation results on experimentally identified sumoylation sites revealed that our method successfully predicts sumoylation sites with a Matthew’s correlation coefficient, sensitivity, specificity, and accuracy equal to 0.66, 73%, 98%, and 97%, respectively. Additionally, we have showed that our method compares favorably to the existing prediction methods and basic regular expressions scanner. Conclusions By using support vector machines, a new, robust method for sumoylation site prediction was introduced. Besides, the possible effects of predicted conformational flexibility and disorder on sumoylation site recognition were explored computationally for the first time to our knowledge as an additional parameter that could aid in sumoylation site prediction.

International Conference on Bioinformatics Delegate Book 2014 Page 27

14

IFACEWAT: THE INTERFACIAL WATER-IMPLEMENTED RE-RANKING ALGORITHM TO IMPROVE THE DISCRIMINATION OF NEAR NATIVE STRUCTURES FOR PROTEIN RIGID DOCKING Chinh Tran-To Su1, Thuy-Diem Nguyen2, Jie Zheng13, Chee-Keong Kwoh1, Haifen Chen1 1. Bioinformatics Research Centre, School of Computer Engineering, Nanyang Technological University, Singapore 639798, Singapore 2. Parallel and Distributed Computing Centre, School of Computer Engineering, Nanyang Technological University, Singapore 639798, Singapore 3. Genome Institute of Singapore, Agency for Science, Technology, and Research (A*STAR), Biopolis, Singapore 138672, Singapore Background: Protein-protein docking is an in silico method to predict the formation of protein complexes. Due to limited computational resources, the docking approach has been developed under the assumption of rigid docking, in which one of the two protein partners remains rigid during the protein associations and water contribution is ignored or implicitly presented. While the rigid docking has successfully predicted structures of various protein complexes, it seems to-date that most docking algorithms often fail to discriminate the correct predictions from the false positives, especially for Antigen/Antibody complexes. To tackle this issue, a new energy-based scoring function is proposed in this paper, namely IFACEwat – a combined Interface Atomic Contact Energy (IFACE) and water effect, to re-rank results of a rigid docking algorithm and therefore further improve the discrimination of the near-native structures from the false positives, especially for Antigen/Antibody complexes. Unlike other re- ranking techniques, the IFACEwat explicitly implements interfacial water into the protein interfaces to account for the water- mediated contacts during the protein interactions. Results: Our results showed that the IFACEwat increased both the numbers of the near-native structures and improved their ranks as compared to the initial rigid docking. In fact, the IFACEwat achieved a success rate of 83.8% for Antigen/Antibody complexes, 92.3% and 90% respectively for medium and difficult cases of protein complexes. Conclusion: The improvement is achieved by explicitly taking into account the contribution of water during the protein interactions, which was ignored or not fully presented by the initial rigid docking and other re-ranking techniques. In addition, the IFACEwat maintains sufficient computational efficiency of the initial docking algorithm, yet improves the ranks as well as the number of the near native structures found.

15

COMBINING PROTEIN RATIO P-VALUES YIELDS A USEFUL PRAGMATIC APPROACH TO THE ANALYSIS OF MULTI-RUN ITRAQ EXPERIMENTS Dana Pascovici1, Xiaomin Song, Edmond Breen, Jemma Wu, Mark Molloy 1. Australian Proteome Analysis Facility, Macquarie University, NSW, Austria The promise of iTRAQ has been tempered by some well documented difficulties including ratio compression and run variability which have made the analysis of iTRAQ experiments challenging, particularly in the multi-run scenario. From the statistical standpoint, a main difficulty in working with protein ratios in the presence of variability across experiments is that the protein ratios are not “all born equal”, but have varying credibility depending on the number and quality of the peptides that were used to generate them. Whilst one can easily measure the credibility of a protein ratio using confidence intervals or p-values, it is hard to integrate such measures directly alongside the ratios in a standard statistical analysis such as ANOVA. Hence more sophisticated methods of statistical analysis such as those introduced by Hill and Oberg revert to the peptide ratios rather than working directly with the protein ratios, but yield complex ANOVA models whose solution relies on computational approaches such as stage-wise regression which are non-trivial to run and harder to verify. A more pragmatic approach can be taken to generate combined measures of ratio confidence across experiments, in a fashion similar to running a meta-analysis across different iTRAQ runs. We present and evaluate such an analysis method, which relies on combining p-values for the iTRAQ ratios using a measure such as Stouffer’s Z-transform test alongside a run consistency measure. The core advantages are simplicity, high tolerance of run variability, and emphasis on proteins with high identification confidence. We show some limitations on the types of experiment designs that can be tackled using this approach, and also explore the applicability of multiple testing correction procedures to the context of iTRAQ protein level data. The main example iTRAQ dataset used belongs to a large multi-run pathogen exposure time course in wheat leaves.

International Conference on Bioinformatics Delegate Book 2014 Page 28

16

COMPUTATIONAL IDENTIFICATION OF NOVEL NATURAL INHIBITORS OF GLUCAGON RECEPTOR FOR CHECKING TYPE II DIABETES MELLITUS Sonam Grover1, Jaspreet Kaur Dhanjal1, Sukriti Goyal2, Abhinav Grover3, Durai Sundar1 1. Department of Biochemical Engineering and Biotechnology, Indian Institute of Technology, Delhi, India 2. Apaji Institute of Mathematics & Applied Computer Technology, Banasthali University, Tonk, Rajasthan, India 3. School of Biotechnology, Jawaharlal Nehru University, Delhi, India Background Interaction of the small peptide hormone glucagon with glucagon receptor (GCGR) stimulates the release of glucose from the hepatic cells during fasting; hence GCGR performs a significant function in glucose homeostasis. Inhibiting the interaction between glucagon and its receptor has been reported to control hepatic glucose overproduction and thus GCGR has evolved as an attractive therapeutic target for the treatment of type II diabetes mellitus. Results In the present study, a large library of natural compounds was screened against 7 transmembrane domain of GCGR to identify novel therapeutic molecules that can inhibit the binding of glucagon with GCGR. Molecular dynamics simulations were performed to study the dynamic behaviour of the docked complexes and the molecular interactions between the screened compounds and the ligand binding residues of GCGR was analysed in detail. We report two natural drug like compounds PIB and CAA which showed good binding affinity for GCGR and are potent inhibitor of its functional activity. Conclusion This study contributes evidence for application of these compounds as prospective small ligand molecules against type II diabetes. Novel natural drug like inhibitors against the 7 transmembrane domain of GCGR have been identified which showed high binding affinity and potent inhibition of GCGR.

17

UTILISING GRAPH DATABASES TO ANALYSE THE 3D ORGANISATION OF CHROMATIN Fabian A Buske12, Sami El Hilali2, Denis C Bauer3, Susan J Clark12 1. St Vincent's Clinical School, University of NSW Australia, Sydney, NSW, Australia 2. Garvan Institute of Medical Research, Darlinghurst, NSW, Australia 3. Division of Computational Informatics, CSIRO, Sydney, NSW, Australia Background: Biological data often describe relationships between objects that can be represented in graphs such as networks, pathways or ontologies. These dataset are frequently too large to keep in random access memory thus complicating their analysis and use as resources. The recent advent of NoSQL database technologies and their focus on scalability and distributed analysis alike thus present an attractive solution for large biological datasets. Graph databases in particular can store complex data transparently while providing efficient analysis algorithms developed in the field of graph theory. Here, we utilise Neo4j, a graph database, to combine and analyse relationships within and between gene expression and DNA looping datasets. Results: Integrating ChIP-Seq, RNA-Seq and DNase hypersensitivity data into a common graph enables us to identify gene regulatory circuits of transcription factors that can act as feedback loops by simple graph traversal. In addition, we demonstrate how data aggregation can be used to identify TF modules and illustrate how these modules may contribute to DNA looping interactions. Conclusions: Graph databases enable the connection of data from different domains and provide powerful algorithms for graph traversal and analysis. While some datasets naturally integrate into a graph framework, other datasets, such as sequential and temporal data, require discretisation first.

International Conference on Bioinformatics Delegate Book 2014 Page 29

18

BHAGEERATH-H: A HOMOLOGY/ AB INITIO HYBRID SERVER FOR PREDICTING TERTIARY STRUCTURES OF MONOMERIC SOLUBLE PROTEINS B Jayaram1, Priyanka Dhingra1, Avinash Mishra1, Rahul Kaushik1, Goutam Mukherjee1, Ankita Singh1, Shashank Shekhar 1. Indian Institute of Technology, New Delhi, DELHI, India Background: The advent of human genome sequencing project has led to a spurt in the number of protein sequences in the databanks. Despite significant progresses in the area of experimental protein structure determination, the sequence-structure gap is continually widening. Data driven homology based computational methods have proved successful in predicting tertiary structures for sequences sharing medium to high sequence similarities. With dwindling similarities of query sequences, advanced homology/ ab initio hybrid approaches are being explored to solve structure prediction problem. Here we describe Bhageerath-H, a homology/ ab initio hybrid software/server for predicting protein tertiary structuresto advance drug design attempts. Results: Bhageerath-H web-server was validated on 75 CASP10 targets and showed TM-score of > 0.5 in 91% of the cases and Cα RMSDs of < 5Ǻ from the native in 58% of the targets, which is ahead of the current limits. Comparison with some leading servers demonstrated the uniqueness of the hybrid methodology in effectively sampling conformational space, scoring best decoys and refining low resolution models to high and medium resolution. Conclusion: Bhageerath-H methodology is web enabled for the scientific community as a freely accessible web server at http://www.scfbio-iitd.res.in/bhageerath/bhageerath_h.jsp. The methodology is fielded in the on-going CASP11 experiment.

19

SCMHBP: PREDICTION AND ANALYSIS OF HEME BINDING PROTEINS USING PROPENSITY SCORES OF DIPEPTIDES Yi-Fan Mr. Liou1, Phasit Mr. Charoenkwan1, Yerukala Sathipati Mr. Srinivasulu1, Tamara Mrs. Vasylenko1, Shih-Chung Mr. Lai1, Hua-Chin Mr. Lee12, Hui-Ling Mrs. Huang12, Shinn-Ying Mr. Ho12 1. Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan 2. Department of Biological Science and Technology, National Chiao Tung University, Hsinchu, Taiwan Background: Heme binding proteins (HBPs) are metalloproteins containing a heme ligand (an iron-porphyrin complex) as the prosthetic group. Several computational methods were proposed to predict heme binding residues to understand the interactions between heme and its host proteins. However, few in silico methods are reported to identify HBPs. Results: This study proposes a scoring card method (SCM) based method (named SCMHBP) for predicting and analyzing HBPs from sequences. A balanced dataset of 747 HBPs (selected using a Gene Ontology term GO:0020037) and 747 non-HBPs (selected from 92,309 putative non-HBPs) with identity 25% was established first. Consequently, a set of propensity scores of amino acids and dipeptides to be HBPs using SCM is estimated by maximizing the prediction accuracy of SCMHBP. Finally, we identify informative physicochemical properties by utilizing the estimated propensity scores for categorizing HBPs. The training and mean test accuracies of SCMHBP on three independent test datasets are 85.90% and 83.76%, respectively. SCMHBP performs well, compared with some methods such as support vector machine (SVM), decision tree J48, and Bayes classifiers. The putative non-HBPs with high sequence propensity scores are potential HBPs which can be further validated. The propensity scores of individual amino acids and dipeptides are examined to recognize the interactions between heme and its host proteins. Moreover, the following characteristics of BLPs are derived from the propensity scores: 1) aromatic side chains are important to the performance of specific HBP functions; 2) hydrophobic environment plays an important role in the interaction between heme and binding sites; and 3) low flexibility of the whole HBP while the heme binding residues are relatively flexible. Conclusions: SCMHBP aims at discovering knowledge for further understanding HBPs rather than pursuing high prediction accuracy only. The used datasets and source codes of SCMHBP are available at http://iclab.life.nctu.edu.tw/SCMHBP/.

International Conference on Bioinformatics Delegate Book 2014 Page 30

20

IMPROVING PROTEIN FOLD RECOGNITION USING THE AMALGAMATION OF EVOLUTIONARY- BASED AND STRUCTURAL BASED INFORMATION Alok Sharma1, Kuldip K Paliwal1, Abdollah Dehzangi1, James Lyons1 1. , Brisbane, QLD, Australia Deciphering three dimensional structure of a protein sequence is a challenging task in biological science. Protein fold recognition and protein secondary structure prediction are transitional steps in identifying the three dimensional structure of a protein. For protein fold recognition, evolutionary-based information of amino acid sequences from the position specific scoring matrix (PSSM) has been recently applied with improved results. On the other hand, the SPINE X predictor has been developed and applied for protein secondary structure prediction. To date, several methods for protein fold recognition have been developed but with limited recognition accuracy only. In this paper, we have developed a strategy of combining evolutionary-based information (from PSSM) and predicted secondary structure using SPINE X to improve protein fold recognition. The strategy is based on finding the probabilities of amino acid pairs (AAP). The proposed method has been tested on several protein benchmark datasets and an improvement of 8.9% recognition accuracy has been achieved. We have achieved, for the first time over 90% and 75% prediction accuracies when the sequential similarity rate is less than 40% and 25%, respectively. We report 90.6% and 77.0% prediction accuracies, respectively, for the Extended Ding and Dubchak benchmark, and Taguchi and Gromiha benchmark that have been widely used for protein fold recognition in the literature.

21

USE B-FACTOR RELATED FEATURES FOR ACCURATE CLASSIFICATION BETWEEN PROTEIN BINDING INTERFACES AND CRYSTAL PACKING CONTACTS Qian Liu1, Zhenhua Li2, Jinyan Li1 1. Advanced Analytics Insitute, FEIT, University of Technology, Sydney, Sydney, NSW, Australia 2. School of Computer Engineering, Nanyang Technological University, Singapore Background: Distinction between true protein interactions and crystal packing contacts is important for structural bioinformatics studies to respond to the need of accurate classification of the rapidly increasing protein structures. There are many unannotated crystal contacts and there also exist false annotations in this rapidly expanding volume of data. Previous tools have been proposed to address this problem. However, challenging issues still remain, such as low performance when the training and test data contain mixed interfaces having diverse sizes of contact areas. Methods and Results: B factor is a measure to quantify the vibrational motion of an atom, a more relevant feature than interface size to characterize protein binding. We propose to use three features related to B factor for the classification between biological interfaces and crystal packing contacts. The first feature is the sum of the normalized B factors of the interfacial atoms in the contact area, the second is the average of the interfacial B factor per residue in the chain, and the third is the average number of interfacial atoms with a negative normalized B factor per residue in the chain. We investigate the distribution properties of these basic features and a compound feature on four datasets of biological binding and crystal packing, and on a protein binding-only dataset with known binding affinity. We also compare the cross-dataset classification performance of these features with existing methods and with a widely-used and the most effective feature interface area. The results demonstrate that our features outperform the interface area approach and the existing prediction methods remarkably for many tests on all of these datasets. Conclusion: Our computational methods have a potential for large-scale and accurate identification of biological interactions from the experimentally determined structural data stored at PDB which may have diverse interface sizes.

International Conference on Bioinformatics Delegate Book 2014 Page 31

22

CLOUDSACA: DISTRIBUTED SUFFIX ARRAY CONSTRUCTION ALGORITHMS PACKAGE ON CLOUD Ahmed Abdelhadi1, Ahmed Ali2, Ahmed Kandil1, Mohamed Abouelhoda12 1. Biomedical Engineering, Cairo University, Giza, Egypt 2. Center of Information Systems, Nile University, Giza, Egypt Suffix array is an important indexing data structure for biological sequence analysis. Since its introduction, there have been a lot of algorithms for constructing it either in memory or on disk. However, there is no such implementation for a distributed algorithm that can run on a computer cluster based on (MPI). Moreover, there is no ready-to-use package that can run on cloud computing platforms. There are so far two distributed suffix array construction algorithms that can run on a computer cluster; Futamura-Aluru- Kurtz and Kulla-Sanders. The latter algorithm has better theoretical time complexity but there is no proof of its superiority in practice. Due to the lack (and difficulty) of any implementation for both algorithms, it is still an open question which algorithm would be the best in practice. In this paper, we have implemented the two distributed algorithms with a number of variations for them. These algorithms are further optimized with recent parallel distributed algorithms to improve their performance. Regarding the cloud implementation, we have developed a module that can automatically create resources either on Amazon or Azure clouds, depending on user preferences. The module also manages upload of input data, submission of jobs, and transfer of results to a persistent storage. We also conducted experiments to measure the performance of both algorithms for DNA sequences and provided a detailed profiling including the timing of intermediate steps. Our package is the first package that supports distributed construction of suffix array over a MPI computer cluster. It is also the first that supports multiple cloud providers. Our implementations is available through command line and web-interface, hiding all technical details for establishing and using cloud resources. Regarding the performance of algorithm, the algorithm of Futamura- Aluru-Kurtz is the more efficient in practice. This is although the algorithm of Kulla-Sanders has better time complexity

23

INTERSECT DATA STORAGE – TECHNOLOGIES USED, LESSONS LEARNED Mohammad Islam1 1. Macquarie University, Sydney, NSW, Australia Transfer, storage and sharing of data are big challenges for the modern bioinformatician. Intersect is assisting researchers and bioinformaticians throughout NSW by providing a versatile and cost-effective cloud storage platform. While developing this platform, Intersect has supported a variety of use-cases and workflows through a complimentary set of technologies and interfaces. As Intersect enters its seventh year it now offers high performance computing (HPC), high performance storage, and application hosting as well as software engineering and other services. Analysts Mohammad Islam and Stuart Allen will discuss some of the technologies used and the lessons learned while ingesting Intersect’s first petabyte of active research data

International Conference on Bioinformatics Delegate Book 2014 Page 32

24

DISCOVERY OF GENE INTERACTIONS BY GPU-ENABLED COMPUTATION OF PAIRWISE EXPRESSION LEVEL METAFEATURES Carlos Riveros1, Heloisa Milioli1, Renato Vimieiro1, Regina Berretta1, Pablo Moscato1 1. Centre for Bioinformatics, Biomarker Discovery and Information-based Medicine, HMRI - University of Newcastle, New Lambton Heights, NSW, Australia Background: Across-samples intrinsic variability and noise in microarray gene expression data masks the discovery of the variables of interest associated with the dis-regulation of the biomolecular network in disease. Here we explore the discovery of interactions (differences, ratios, etc.) between levels of gene expression as the independent variables, through their massive computation and characterisation using GPUs. We show the utility of this approach by its application to the analysis of two complex gene expression datasets. Methods: We developed a GPU-based framework for the high-performance computation and analysis of metafeatures constructed as a function of a pair of features, for all feature pairs in a dataset. We use meta-programming techniques to tailor at run-time the GPU components of the computation for the function being tested. We test and rank metafeatures by using the CM1 score and characterise the most significant metafeatures by their combined sample classification power. Results: Compared with a distributed parallel implementation of the proposed method in R, our framework achieves wall-clock speedups of 25x, and resource savings of 16x to 20x. The GPU tool can quickly analyse large datasets in relatively modest hardware. The metafeatures selected by the method provide an extended classification power compared to gene probesets alone. In a breast cancer dataset with consistent molecular subtype labelling we obtain an average Cramer’s V of 0.91 ± 0.04 for the top ranked differences vs. 0.73 ± 0.06 for the top ranked probes. Conclusions: Exploration of pairwise combinations of expression levels in the analysis pipeline of disease or its subtypes proves a useful tool for both searching biomarkers and discovery of lesser known interactions affected by confounding factors. Our framework using GPUs makes this kind of analysis available without specialised computing environments.

25

WHOLE-BRAIN IMAGING AT THE SINGLE-CELL RESOLUTION WITH THE CUBIC METHOD Dimitri Perrin1, Etsuo A Susaki1234, Kazuki Tainaka234, Hiroki R Ueda1234 1. RIKEN Center for Developmental Biology, Kobe, Japan 2. RIKEN Quantitative Biology Center, Kobe, Japan 3. Graduate School of Medicine, The University of Tokyo, Tokyo, Japan 4. CREST, Japan Science and Technology Agency, Saitama, Japan A major challenge of systems biology is to understand how emergent properties at the organism level result from specific phenomena at the cellular scale. This is particularly difficult (and crucial) in the brain, due to the complexity and importance of the organ. Whole-brain imaging has the potential to be a facilitating technology in these efforts, provided that it can operate at single-cell resolution and through a method with very high throughput. In this talk in the Highlights Track, we will present a method, called CUBIC (Clear, Unobstructed Brain Imaging Cocktails and Computational analysis), which we published this April . CUBIC is the result of a comprehensive chemical screening, and is a simple and efficient method involving the immersion of brain samples in chemical mixtures containing aminoalcohols, which enables rapid whole-brain imaging with single-photon excitation microscopy. The method can be applied to multicolor imaging of fluorescent proteins or immunostained samples in adult brains and it is scalable from a primate brain to subcellular structures. We also developed a whole-brain cell-nuclear counterstaining protocol and a computational image analysis pipeline. All these elements, taken together, enable the visualization and quantification of neural activities induced by environmental stimulation. In this presentation, we will explain the challenges associated with whole-brain imaging, as well as how CUBIC addresses limitations of previous methods and in doing so enables time-course expression profiling of whole adult brains with single-cell resolution. We will put particular emphasis on the computational challenges related to the handling and analysis of high-resolution 3D images.

International Conference on Bioinformatics Delegate Book 2014 Page 33

26

FMAJ: A TOOL FOR HIGH CONTENT ANALYSIS OF MUSCLE DYNAMICS IN DROSOPHILA METAMORPHOSIS Kuleesha Ms Kuleesha12, Puah Ms Wee Choo1, Lin Dr Feng2, Martin Dr Wasser1 1. Bioinformatics Institute, Singapore 2. Nanyang Technological University, Singapore, Singapore During metamorphosis in Drosophila, muscles undergo developmentally regulated remodeling, which involves cell death of obsolete and atrophy of persistent muscles. Thanks to the ability to perform live imaging of muscle development in transparent pupae and the power of genetics, metamorphosis in Drosophila can be used as a model to study the regulation of skeletal muscle mass. We performed targeted gene perturbation in muscles and acquired 3D time series images of muscle development in metamorphosis using laser scanning confocal microscopy. To help us quantify the phenotypic effects of gene perturbations in large number of images, we designed an ImageJ based Fly Muscle Analysis tool (FMAj) and MySQL frameworks for image processing and data storage, respectively. The image analysis pipeline of FMAj is divided into three modules. The first module assist the user in adding annotations to the time-lapse datasets, such as genotype, experimental parameters and temporal reference points, which are used to compare different datasets. The second module performs segmentation and feature extraction of muscles cell and nuclei. The third module performs comparative quantitative analysis of muscle phenotypes. We demonstrated our tool in the time series phenotypic characterization of two atrophy related genes, which were silenced by RNA interference. Reduction of the Drosophila TOR homolog resulted in atrophy, while inhibition of the autophagy factor Atg9 led to hypertrophy and abnormal morphology of muscle fibers. By applying statistical analysis we could show statistically significant differences in muscle diameter between controls and the two types of gene perturbation. Our in vivo imaging experiments revealed that genes involved in TOR signalling and autophagy are not only conserved in sequence between mammals and Drosophila, but also perform similar function in regulating muscle mass. Extending our approach to a genome-wide scale has the potential to identify new genes involved in muscle size regulation.

27

AUTOMATED MICROTUBULE PATH TRACKING ON GLIDING ASSAY USING HIDDEN MARKOV MODEL Bulibuli Mahemuti1, Yuexing Han12, Daisuke Inoue3, Akira Kakugo34, Akihiko Konagaya1 1. Interdisciplinary Graduate School of Science and Engineering, Tokyo Institute of Technology, Yokohama, Japan 2. School of Computer Engineering and Science, Shanghai University, Shanghai , China 3. Faculty of Science, Hokkaido University, Sapporo, Japan 4. Graduate School of Chemical Sciences and Engineering, Hokkaido University, Sapporo, Japan Object tracking is an important issue in bio-imaging and necessary to elucidate the dynamics of molecules from video data. In microtubule gliding assays, object tracking becomes non-trivial due to the occurrences of compound objects such as crossing and snuggling of microtubules as well as sudden appearance and disappearance of microtubules. In order to solve these issues, in our study, we discuss the newly created object tracking methodology using a Hidden Markov model. The microtubule Hidden Markov model enables us to estimate a plausible tracking paths efficiently by means of decomposing hidden states of compound objects. These microtubule tracking paths can enhance our understanding of the dynamics of microtubule movement. Our future work will focus on further improving the microtubule recognition accuracy.

International Conference on Bioinformatics Delegate Book 2014 Page 34

28

FANSE2: AN ACCURATE READ MAPPING ALGORITHM THAT RE-INTERPRETS THE NEXT- GENERATION SEQUENCING Gong Zhang1 1. Jinan University, Guangzhou, China Correct and bias-free interpretation of the deep sequencing data depends on the complete mapping of all mappable reads to the reference sequence. However, the accuracy and robustness of previous read-mapping algorithms are not satisfactory in many cases, impairing the reproducibility and verifiability. We developed an algorithm FANSe2 with iterative mapping strategy based on the statistics of real-world sequencing error distribution to substantially accelerate the mapping without compromising the high accuracy, robustness and verifiability. The sensitivity and accuracy of FANSe2 are higher than previous algorithms in the tests using both prokaryotic and eukaryotic sequencing datasets. The gene identification results of FANSe2 is experimentally validated, while the previous algorithms have false positives and false negatives. Also, the SNV identifications based on FANSe2 results are experimentally validated, while the other algorithms provides false positive and false negative SNV identifications. We implemented a scalable and almost maintenance-free parallelization method that can utilize the computational power of multiple office computers, a novel feature not present in any other mainstream algorithm. Its speed can exceed the BWT-based algorithms, matching the speed of the coming generation of sequencers. In sum, FANSe2 thus provides verifiable and robust accuracy, full indel sensitivity, fast speed, versatile compatibility and economical computational utilization, making it a useful and practical tool for deep sequencing applications. FANSe2 is freely available at http://bioinformatics.jnu.edu.cn/software/fanse2/.

29 miRNA WORKBENCH: A COMPUTATIONAL TOOLKIT FOR MIRNA IDENTIFICATION AND TARGET PREDICTION EXPERIMENTS Kulwadee Somboonviwat1, Napol Kaewkascholkul2, Kunlaya Somboonwiwat2 1. Software Engineering Program, International College, King Mongkut's Institute of Technology Ladkrabang, Bangkok, Thailand 2. Center of Excellence for Molecular Biology and Genomics of Shrimp, Department of Biochemistry, Faculty of Science, Chulalongkorn University, Bangkok, Thailand MicroRNAs (miRNAs) are small, non-protein coding RNAs (ncRNAs) with size about 22 nucleotides. They are increasingly recognized as important regulators of many processes in eukaryotic systems. Empowered with the next generation sequencing technique, researchers are now able to comprehensively analyze miRNAs data.The sheer size of miRNAs data avaiable has generated sizable demand for computational tools to assist researchers in their miRNAs analysis pipeline. Here, we present "miRNA Workbench", a toolkit for miRNA identification and target prediction experiments. The miRNA Workbench provides an integrated software environment that facilitates the process of systematic miRNAs data analysis. Starting with the raw input data in FastQ format, the miRNA Workbench pre-processes the input data by converting it into FastA format, and removing low-quality sequences. The analysis of miRNA data are divided into two main steps. The first step is the identification of known and novel miRNAs, and the second step is the miRNA target prediction. All these miRNAs data analysis tasks are tied together into a single workflow. Another key feature of the miRNA Workbench is the customization of algorithms used in each of the analysis steps. With the apparent increases in the importance of miRNAs, we anticipate that many new algorithmic techniques for miRNAs data analysis will be developed in the near future. In this respect, the miRNA Workbench can be used by researchers to experiment with different algorithms for miRNA data analysis.

International Conference on Bioinformatics Delegate Book 2014 Page 35

30

ERROR ESTIMATES FOR THE ANALYSIS OF DIFFERENTIAL EXPRESSION FROM RNA-SEQ COUNT DATA Conrad Burden1, Sumaira Qureshi1, Susan Wilson12 1. Australian National University, Canberra, ACT, Australia 2. University of New South Wales, Sydney, NSW, Australia A number of algorithms exist for analysing RNA-sequencing data to infer profiles of differential gene expression. Problems inherent in building algorithms around statistical models of over dispersed count data are formidable and frequently lead to non-uniform p- value distributions for null-hypothesis data and to inaccurate estimates of false discovery rates (FDRs). This can lead to an inaccurate measure of significance and loss of power to detect differential expression. We use synthetic and real biological data to assess the ability of several available R packages to accurately estimate FDRs. The packages surveyed are based on statistical models of overdispersed Poisson data and include edgeR, DESeq, DESeq2, PoissonSeq and QuasiSeq. Also tested is an add-on package to edgeR and DESeq which we introduce called Polyfit. Polyfit aims to address the problem of a non-uniform null p-value distribution for two-class datasets by adapting the Storey-Tibshirani procedure. We find the best performing package in the sense that it achieves a low FDR which is accurately estimated over the full range of p- values to be the QLSpline implementation of QuasiSeq. This finding holds provided the number of biological replicates in each condition is at least 4. The next best performing packages are edgeR and DESeq. When the number of biological replicates is sufficiently high, and within a range accessible to multiplexed experimental designs, the Polyfit extension improves the performance edgeR (for greater than or equal to 10 replicates per condition) or DESeq (for greater than or equal to 6 replicates per condition) in our tests with synthetic data.

31

Evaluating Two-Factor Experimental Results for RNA-Seq Data using Simulation Margaret R Donald1, Susan R Wilson12 1. University of New South Wales, Kensiington , NSW 2052, Australia 2. Australian National University, Acton, ACT 0200, Australia Background: Many programs for the analysis of RNA-Seq data are available in R, and there has been some assessment of whether, and when, they may be expected to give correct answers. Generally, simulations to test the software are based on the Poisson or the negative binomial distribution, and contrast two groups with varying numbers of replications. However, experimenters often use more complex experimental designs. We simulate negative binomial data from a two factor design to assess several R packages. Two different simulation methods are used. Our motivating data set is a two factor experiment with eight patients suffering from Myelodysplastic Syndrome and seven from Chronic Myelomonocytic Leukaemia, and RNA-Seq counts estimated before and after drug treatment. We compare the DE genes used to generate the simulates with the DE genes found from the simulations in the R- packages, DEseq2, PoissonSeq, edgeR, and QuasiSeq. Results: Since the simulated data are negatively binomially distributed, unsurprisingly the Poisson distribution based methods performed poorly. The number of DE genes in common using one set of simulations varied from 0% to 19%, and in the other from 4% to 37%. Conclusions: We conclude that for a two-factor experiment with 30 experiments per gene, the negative binomial is sufficiently flexible to account for extra-Poisson variability. For simulated data for 2000 randomly selected genes the task of normalisation made some packages almost entirely uninformative as to which genes were DE, with PoissonSeq performing least well. Normalisation affects parameter estimates, and seemed badly estimated with 2000 genes. Normalisation rates operate on the gene counts, raising or lowering them, and hence, diminish or enhance the looked-for signal. Packages edgeR with tagwise dispersion, and QuasiSeq with tagwise, common and trend dispersion estimates, performed reasonably well. DESeq2 was harder to assess, but performed well when the interest is in looking at ‘top’ genes.

International Conference on Bioinformatics Delegate Book 2014 Page 36

32

TUMOR ANTIGENS AS PROTEOGENOMIC BIOMARKERS IN INVASIVE DUCTAL CARCINOMAS Vladimir Brusic1 1. Boston University, Boston, MA, United States BACKGROUND: The majority of genetic biomarkers for human cancers are defined by statistical screening of high-throughput genomics data. While a large number of genetic biomarkers have been proposed for diagnostic and prognostic applications, only a small number have been applied in the clinic. Similarly, the use of proteomics methods for the discovery of cancer biomarkers is increasing. The emerging field of proteogenomics seeks to enrich the value of genomics and proteomics approaches by studying the intersection of genomics and proteomics data. This task is challenging due to the complex nature of transcriptional and translation regulatory mechanisms and the disparities between genomic and proteomic data from the same samples. In this study, we have examined tumor antigens as potential biomarkers for breast cancer using genomics and proteomics data from previously reported laser capture microdissected ER+ tumor samples.

RESULTS: We applied proteogenomic analyses to study the genetic aberrations of 32 tumor antigens determined in the proteomic data. We found that tumor antigens that are aberrantly expressed at the genetic level and expressed at the protein level, are likely involved in perturbing pathways directly linked to the hallmarks of cancer. The results found by proteogenomic analysis of the 32 tumor antigens studied here, capture largely the same pathway irregularities as those elucidated from large-scale screening of genomics analyses, where several thousands of genes are often found to be perturbed.

CONCLUSION: Tumor antigens are a group of proteins recognized by the cells of the immune system. Specifically, they are recognized in tumor cells where they are present in larger than usual amounts, or are physiochemically altered to a degree at which they no longer resemble native human proteins. This proteogenomic analysis of 32 tumor antigens suggests that tumor antigens have the potential to be highly specific biomarkers for different cancers.

33 iPhos: TOOLKIT TO STREAMLINE THE ALKALINE PHOSPHATASE ASSISTED COMPREHENSIVE LC-MS PHOSPHOPROTEOME INVESTIGATION Tzu-Hsien Yang1, Hong-Tsun Chang1, Eric S.L. Hsiao2, Juo-Ling Sun2, Chung-Ching Wang1, Hsin-Yi Wu3, Pao-Chi Liao2, Wei-Sheng Wu1 1. Department of Electrical Engineering , National Cheng Kung University, Tainan, Taiwan 2. Department of Environmental and Occupational Health, National Cheng Kung University, Tainan, Taiwan 3. Institute of Chemistry, Academia Sinica, Taipei, Taiwan Comprehensive characterization of the phosphoproteome in living cells is critical in signal transduction research. But the low abundance of phosphopeptides among the total proteome in cells remains an obstacle in mass spectrometry-based proteomic analysis. To provide a solution, an alternative analytic strategy to confidently identify phosphorylated peptides by using the alkaline phosphatase (AP) treatment combined with high-resolution mass spectrometry was provided. While the process is applicable, the key integration along the pipeline was mostly done by tedious manual work. We developed a software toolkit, iPhos, to facilitate and streamline the work-flow of AP-assisted phosphoproteome characterization. The iPhos tookit includes one assister and three modules. The iPhos Peak Extraction Assister automates the batch mode peak extraction for multiple liquid chromatography mass spectrometry (LC-MS) runs. iPhos Module-1 can process the peak lists extracted from the LC-MS analyses derived from the original and dephosphorylated samples to mine out potential phosphorylated peptide signals based on mass shift caused by the loss of some multiples of phosphate groups. iPhos Module-2 provides customized inclusion lists with peak retention time windows for subsequent targeted LC-MS/MS experiments. iPhos Module-3 facilitates to link the peptide identifications from protein search engines with the quantification results from pattern-based label-free quantification tools. We further demonstrated the utility of the iPhos toolkit on the data of human metastatic lung cancer cells (CL1-5). In the comparison study of the control group of CL1-5 cell lysates and the treatment group of datasinib-treated CL1-5 cell lysates, we demonstrated the applicability of the iPhos toolkit and reported the experimental results based on the iPhos-facilitated phosphoproteome investigation. We also compared the strategy with pure DDA-based LC-MS/MS phosphoproteome investigation. The results of iPhos-facilitated targeted LC-MS/MS convey more thorough and confident phosphopeptide identification than the results of pure DDA LC-MS/MS. The iPhos software toolkit and sample tutorial data are available at http://cosbi3.ee.ncku.edu.tw/iPhos/.

International Conference on Bioinformatics Delegate Book 2014 Page 37

34

THE BACTERIAL PROTEOGENOMIC PIPELINE Julian Uszkoreit1, Nicole Plohnke1, Sascha Rexroth1, Martin Eisenacher1 1. Ruhr-Universitaet Bochum, Bochum, Germany Proteogenomics combines the cutting-edge methods from genomics and proteomics. While it has become cheap to sequence whole genomes, the correct annotation of protein coding regions in the genome is still tedious and error prone. Mass spectrometry on the other hand relies on good characterizations of proteins deriving from the genome, but can also be used to help improving the annotation of genomes or find species specific peptides. Additionally proteomics is widely used to find evidence for differential expression of proteins under different conditions, e.g. growth conditions for bacteria. Though the concept of proteogenomics is not new, mainly in-house scripts or special tools for eukaryotic and human analyses were developed. The Bacterial Proteogenomic Pipeline, which is completely written in Java, alleviates the conducting of proteogenomic analyses of bacteria. From a given genome sequence, a naïve six frame translation is performed and, if desired, a decoy database generated. This database is used to identify MS/MS spectra by common peptide identification algorithms. After combination of the search results and optional flagging for different experimental conditions, the results can be browsed and further inspected. In particular, for each peptide the number of identifications for each condition and the positions in the corresponding protein sequences are shown. Intermediate and final results can be exported into GFF3 format for visualization in common genome browsers.

35

PATHWAY ANALYSIS AND TRANSCRIPTOMICS IMPROVE PROTEIN IDENTIFICATION BY SHOTGUN PROTEOMICS FROM SAMPLES COMPRISING SMALL NUMBER OF CELLS – A BENCHMARKING STUDY Vladimir Brusic1 1. Boston University, Boston, MA, United States BACKGROUND:Proteomics research is enabled with the high-throughput technologies, but our ability to identify expressed proteome is limited in small samples. The coverage and consistency of proteome expression are critical problems in proteomics. Here, we propose pathway analysis and combination of microproteomics and transcriptomics analyses to improve mass-spectrometry protein identification from small size samples.

RESULTS: Multiple proteomics runs using MCF-7 cell line detected 4,957 expressed proteins. About 80% of expressed proteins were present in MCF-7 transcripts data; highly expressed transcripts are more likely to have expressed proteins. Approximately 1,000 proteins were detected in each run of the small sample proteomics and more than 4,000 proteins were extracted from the gene sets representing canonical pathways. The identified canonical pathways were largely overlapping between individual runs. Of identified pathways 182 were shared between three individual small sample runs.

CONCLUSIONS: Current technologies enable us to directly detect 10% of expressed proteomes from small sample comprising as few as 50 cells. We used knowledge-based approaches to elucidate the missing proteome that can be verified by targeted proteomics. This knowledge-based approach includes pathway analysis and combination of gene expression and protein expression data for target prioritization. Proteins present in canonical pathways represent approximately 50% of expressed proteomes and 90% of targets from canonical pathways were estimated to be expressed. Highly expressed transcripts indicate high probability of protein expression. However, approximately 10% of expressed proteins are not matched with the expressed transcripts.

International Conference on Bioinformatics Delegate Book 2014 Page 38

36

TIMEXNET: IDENTIFYING ACTIVE GENE SUB-NETWORKS USING TIME-COURSE GENE EXPRESSION PROFILES Ashwini Patil1, Kenta Nakai1 1. Institute of Medical Science, University of Tokyo, Tokyo, Japan Background Time-course gene expression profiles are frequently used to provide insight into the changes in cellular state over time and to infer the molecular pathways involved. When combined with large-scale molecular interaction networks, such data can provide information about the dynamics of cellular response to stimulus. However, few tools are currently available to predict a single active gene sub-network from time-course gene expression profiles.

Results We introduce a tool, TimeXNet (http://timexnet.hgc.jp/), which identifies active gene sub-networks with temporal paths using time- course gene expression profiles in the context of a weighted gene regulatory and protein-protein interaction network. TimeXNet uses a specialized form of the network flow optimization approach to identify the most probable paths connecting the genes with significant changes in expression at consecutive time intervals1. TimeXNet has been extensively evaluated for its ability to predict novel regulators and their associated pathways within active gene sub-networks in the innate immune response. Compared to other similar methods, TimeXNet identifies up to 40% more novel regulators from independent experimental datasets. It also predicts paths within a greater number of known pathways with longer overlaps (up to 7 consecutive edges) within these pathways.

Conclusions TimeXNet is a reliable tool that can be used to study cellular response to stimuli through the identification of time-dependent active gene sub-networks in diverse biological systems. TimeXNet is implemented in Java as a stand-alone application and supported on Linux, MS Windows and Macintosh. It can be downloaded from http://timexnet.hgc.jp/. The output of TimeXNet can be directly viewed in Cytoscape. TimeXNet is freely available for non-commercial users.

37

NETWORK-BASED BIOMARKERS ENHANCE CLASSICAL APPROACHES TO PROGNOSTIC GENE EXPRESSION SIGNATURES Rebecca L Barter1, Sarah-Jane Schramm23, Yee Hwa (Jean) Yang13, Graham J Mann23 1. School of Mathematics and Statistics, The University of Sydney, Sydney, NSW, Australia 2. Westmead Millenium Institute, The University of Sydney, Sydney, NSW, Australia 3. Melanoma Institute Australia, Sydney, NSW, Australia Classical approaches to predicting patient clinical outcome via gene expression information are based on differential expression of unrelated genes (single-gene approaches) or genes related by biologic features (gene-sets). Recently, network-based approaches utilising interaction information between genes have emerged. An open problem is whether such approaches add value to traditional methods of modelling. We explore this question via comparison of single-gene, gene-set, and network-based methods, using gene expression microarray data from two cancers. We consider two general network approaches. The first of these identifies informative genes using gene expression and network information drawn from prior knowledge of protein-protein interactions (PPIs). In the second approach, classification features are small networks of interacting proteins (again, identified from prior knowledge) or are obtained from such networks e.g., by considering edges (interactions) or hubs (highly-connected proteins). For all methods we perform 100 rounds of 5-fold cross-validation under three different classifiers. For network-based approaches, we consider two PPI networks. We quantify resulting patterns of misclassification and discuss the relative value of each with respect to ongoing development of prognostic biomarkers. We find that single-gene, gene-set and network methods yield similar classification error rates across cancer data sets. Crucially, however, our detailed patient-level analyses reveal that the different methods are correctly classifying alternate subsets of patients within each cohort. We also find that the network-based NetRank feature selection method is the most stable. Network-based methods of signature modelling harness data from external sources and are foreshadowed as a standard mode of analysis. But do they add to traditional approaches? Our findings indicate there is value in the way different subspaces of the patient sample are captured differently among the various methods, highlighting the possibility of ‘combination’ classifiers capable of identifying which patients will be more accurately classified by one particular type of method over another.

International Conference on Bioinformatics Delegate Book 2014 Page 39

38

CYTOHUBBA: IDENTIFY HUB OBJECTS AND SUB-NETWORK FROM COMPLEX INTERACTOME Chia-Hao Chin1, Shu-Hwa Chen2, Hsin-Hung Wu3, Chin-Wen Ho4, Ming-Tat Ko23, Chung-Yen Lin2356 1. Department of Computer Science and Information Engineering, Nanhua University, Chiayi, Taiwan 2. Institute of Information Science, Academia Sinica, Taipei, Taiwan 3. Research Center of Information Technology Innovation, Academia Sinica, Taipei, Taiwan 4. Department of Computer Science and Information Engineering, National Central University, Taoyuan, Taiwan 5. Division of Biostatistics and Bioinformatics, Institute of Population Health Sciences, National Health Research Institutes, Zhunan, Taiwan 6. Institute of Fisheries Science, College of Life Science, National Taiwan University, Taipei, Taiwan

Background Network is a useful tool for presenting many types of biological data including protein-protein interactions, gene regulations, cellular pathways, and signal transductions. We can measure nodes by their network features to infer their importance in the network, and it can help us identify central elements of biological networks.

Results We introduce a novel Cytoscape plugin cytoHubba for ranking nodes in a network by their network features. CytoHubba provides 11 topological analysis methods including Degree, Edge Percolated Component, Maximum Neighborhood Component, Density of Maximum Neighborhood Component, Maximal Clique Centrality and six centralities (Bottleneck, EcCentricity, Closeness, Radiality, Betweenness, and Stress) based on shortest paths. Among the eleven methods, the new proposed method, MCC, has a better performance on the precision of predicting essential proteins from the yeast PPI network.

Conclusions CytoHubba provide a user-friendly interface to explore important nodes in biological networks. Itcomputes all eleven methods in one stop shopping way. Besides, researchers are able to combine cytoHubba with and other plugins into a novel analysis scheme. The network and sub-networks caught by this topological analysis strategy will lead to new insights on essential regulatory networks and protein drug targets for experimental biologists. Cytohubba is available as cytoscape plug-in and can be accessed freely at http://hub.iis.sinica.edu.tw/cytohubba/ for more detail.

International Conference on Bioinformatics Delegate Book 2014 Page 40

39

PROTEIN INTER-DOMAIN LINKER PREDICTION USING RANDOM FOREST AND AMINO ACID PHYSIOCHEMICAL PROPERTIES Maad Shatnawi1, Nazar Zaki1, Paul D. Yoo2 1. UAEU, Abu Dhabi, United Arab Emirates 2. Center for Distributed and High Performance Computing (J12), University of Sydney, Sydney, Australia Protein chains are generally long and consist of multiple domains. Domains are distinct structural units of a protein that can evolve and function independently. The accurate prediction of protein domain linkers and boundaries is often regarded as the initial step of protein tertiary structure and function predictions. Such information not only enhances protein-targeted drug development but also reduces the experimental cost of protein analysis by allowing researchers to work on a set of smaller and independent units. In this work, we propose a novel and accurate domain-linker prediction approach based on protein primary structure information only. We utilize a nature-inspired machine-learning model called Random Forest along with a novel domain-linker profile that contains physiochemical and domain-linker information of amino acid sequences. The proposed approach was tested on two well-known benchmark protein datasets and achieved 68% sensitivity and 99% precision, which is better than any existing protein domain-linker predictor. Without applying any data balancing technique such as class weighting and data re-sampling, the proposed approach is able to accurately classify inter-domain linkers from highly imbalanced datasets. Our experimental results proves that the proposed approach is useful for domain-linker identification in highly imbalanced single-domain and multi-domain proteins.

40

SOFTWARE QUALITY ASSURANCE IN GENOMIC MEDICINE AND SYSTEMS BIOLOGY AmirHossein Kamali12, Alistair McEwan2, Joshua Ho13 1. Victor Chang Cardiac Research Institute, Darlinghurst, NSW, Australia 2. University of Sydney, Sydney, Australia 3. UNSW, Sydney, Australia Software quality assurance becomes especially critical if bioinformatics tools are to be used in a translational medical setting, such as analysis and interpretation of WGS data. We must ensure that only validated algorithms are used, and that they are implemented correctly in the analysis pipeline – and not disrupted by hardware or software failure. Recently it has been shown that theconcordance of multiple widely used variant-calling pipelines is very low (<60% in SNP calling, and <30% in indel calling). Considering there is only one ground truth, the high level of discrepancy is troubling, and is telling us that even the most popular bioinformatics tools to date can generate results with a non-negligible false positives or false negative rate. In this talk, we will give examples on our recent experience in implementing QA in commonly used bioinformatics programs in genomic medicine and systems biology. We will discuss explore how concepts and tools in the software testing field can be adapted to perform quality assurance in genomic medicine and systems biology.

International Conference on Bioinformatics Delegate Book 2014 Page 41

41

MUTATIONS AND METAMORPHICS: GOOD SOFTWARE DEVELOPMENT PRACTICES FOR BIOINFORMATICIANS Michael Charleston1 1. University of Sydney, Camperdown, NSW, Australia Software that has not been tested cannot be regarded as reliable, yet there is a mass of bioinformatics software available whose testing is unclear, or even absent. It is common to see journal articles that "test" performance of an algorithm or an implementation by pitting programs against each other and using secondary measures of quality: score of a function, memory footprint, time taken, number of contigs --- and so on.

Software testing can be broadly divided into verification --- that a program is doing what we intended it to do, and validation , that our intention was correct in the first place. It is more common to see validation of software, by such means as direct comparison across implementations, yet comparatively rare to see verification, which must be based on tests of output correctness. But perhaps that's not so surprising: computationally complex bioinformatics problems must often be solved with heuristics, which may even include a stochastic element. Testing programs that can have varying output given the same input clearly, therefore, presents significant challenges.

This talk outlines some practices that we can use to test correct execution of bioinformatics software, both treating programs as "black boxes" in the case where source code is not available, or as more transparent ones, when we can inspect the code directly. I will present some work we have completed on some published and very commonly used phylogenetics software

42

TECHNIQUES FOR TESTING LARGE AND COMPLEX SOFTWARE Tsong Chen1 1. Swinburne University of Technology, Hawthorn, VIC, Australia Testing bioinformatics programs poses a particular challenge because it is often hard to systematically verify the correctness of the output of a program due to the nature of bioinformatics programs. In the software testing fields, a number of effective methods have been developed to tackle this problem. In my talk, I will explain and discuss some of these state-of-the-art methods.

43

VERTICALLY INTEGRATED MULTI-LAYERED OMICS DATA FOR BIOMARKER DISCOVERY Jean Yang1 1. University of Sydney, Sydney, NSW, Australia Over the last decade, several statistical techniques have been proposed to tackle genome-wide expression data. However, with the advancement of many other high-throughput biotechnologies, the interest of researchers has been focusing on utilizing multiple data sources together with the clinical data, to improve the prognosis of disease outcome. Integrating the components from different platforms has become a crucial step to better understand the relationships between clinical and -omics data and the information they provide to classify some response. The statistical task to preserve the stability and interpretability of the classifier has become more challenging in this framework. One major issue is that the large dimension of -omics data can completely dominate the modelling procedure and it is an open question how to best combine different types of variables. This talk will present our most recent results on improving upon standard classification procedures for metastatic Melanoma cancer data. We will use a two-stage process that involves simultaneously changing the number of observations and features to classify well different disease outcome.

44 International Conference on Bioinformatics Delegate Book 2014 Page 42

SUPPORTING TRAINERS TO IMPROVE BIOINFORMATICS EDUCATION GLOBALLY Michelle Brazas1 1. Ontario Institute for Cancer Research, Ontario, Canada A needs assessment isn’t necessary to realize that across the globe, there is a high demand for quality bioinformatics training in all domains of life science. Delivering on this demand however is not trivial. In addition to computational infrastructure and software tools, quality bioinformatics training depends upon excellent trainers and training resources. With a focus on the trainer in the learning equation, the Global Organization for Bioinformatics Learning, Education and Training (GOBLET) aims to facilitate the advancement of bioinformatics education globally by training and supporting a network of bioinformatics trainers. Activities include coordinating training efforts, sharing data sets and teaching materials, discussing best practices and building up teaching standards and teaching recognition. Examples to improve your bioinformatics training programs will be provided. Through support and development of trainer excellence, GOBLET is working to improve the global landscape in bioinformatics education.

45

AN AUTOMATIC PIPELINE TO FIND AND ANNOTATE RARE SUBCLONAL SOMATIC VARIANTS IN A PAIRED TUMOR/NORMAL SAMPLE Henry Yun Herng Wang1 1. QIAGEN, Taipei, Taiwan Identifying and characterizing somatic variants in deep genome sequence data from tumor samples remains challenging and time- consuming. Of special interest in cancer research and diagnostics is the detection and annotation of rare subclonal somatic variants found only in a small proportion of primary tumor cells. Such variants can drive tumor spread and recurrence, but are often neglected in choosing treatments. Currently, few tools reliably distinguish such rare subclonal variants from sequencing errors. And even among real somatic variants, drivers (of tumor growth, spread, or resistance) are hard to distinguish from passengers. Doing so entails integrating diverse information on variants, genes, pathways, cancer-relevant phenotypes, and treatments (including insights on population allele frequencies and broader evolutionary conservation; known/likely effects on gene product structure, function, expression, and interaction; and relations among gene products, phenotypes, and drugs). Software for effectively integrating such data in light of genomic variation in samples, to highlight relevant findings through clear visualization, has been a pressing need. Here we present an end-to-end analysis workflow for finding and functionally characterizing rare subclonal variants, using the newly developed CLC Cancer Research Workbench to feed the interpretive platform of Ingenuity Variant Analysis, to identify cancer driver mutations in paired tumor/normal samples. We will show new interesting results from this analysis, which were not shown beforehand on this publicly available cancer dataset (Case Reports in Oncological Medicine, Volume 2013 (2013), Article ID 270362) from a patient with massive acinic cell carcinoma.

46

PACBIO SINGLE MOLECULE LONG-READ SEQUENCING: Applications and Bioinformatic Tools Siddarth Singh1 1. Pacific Biosciences, Menlo Park, CA, United States PacBio SMRT technology is capable of generating extremely long DNA sequencing reads (on average 8,500bp) with no systematic error, no sequence context bias, and no requirement for amplification. These unique properties allow PacBio to access regions of the genome - and areas of biology - previously unattainable. In this seminar, Dr Singh will give a scientific account of his hands-on experience with PacBio data from projects completed in the Asia Pacific region. The focus will be on bioinformatics tools and new approaches for dealing with long-read sequencing data. Key applications include de Novo Assembly / Genomes Finishing, Full- length Transcript Sequencing (IsoSeq), Repeat Expansions and Structural Variation, Haplotype Phasing, Epigenome Analysis & Base Modification Detection. Dr Siddarth Singh is a senior scientist at PacBio responsible for overseeing informatics and application development across the Asia Pacific region. Dr Singh has many years of experience in Next Generation Sequencing analysis across multiple applications and platforms. Dr Singh has a PhD in Bioinformatics from Devi Ahilya University, India.

47

International Conference on Bioinformatics Delegate Book 2014 Page 43

NEXT GENERATION DATA INDEPENDENT ANALYSIS : SWATH 2.0 Sarah Reed1 1. AB SCIEX, Australia & New Zealand, VIC, Australia Not available at time of print

48

THE FUNDAMENTAL TRADEOFF IN GENOMES AND PROTEOMES OF PROKARYOTES ESTABLISHED BY THE GENETIC CODE, CODON ENTROPY, AND PHYSICS OF NUCLEIC ACIDS AND PROTEINS Igor N Berezovsky1 1. Bioinformatics Institute, Singapore Diversity of extreme environments, phylogeny, and life styles of prokaryotes are reflected in nucleotide compositions of their genomes and amino acid compositions of their proteomes. Despite significant efforts, causal relationship between these compositions remains unresolved. While the genetic code inherently bridges the realms of nucleic and amino acids, the rules of the mutual adjustment of nucleotide and amino acid compositions has not yet been established. We discovered a fundamental tradeoff , which analytically describes mutual adjustment of compositions and its effect on the mutational biases. The tradeoff is determined by the interplay between the genetic code, optimization of the codon entropy, and demands on the structure and stability of nucleic acids and proteins. In particular, an increase of the purine load in genomes with low GC content changes the balance between two major determinants of the double-stranded DNA stability: base pairing and base stacking interactions. Simulations yield an increasing proportion of nonsynonymous mutations in genomes with low GC. Corresponding amino acid substitutions result, however, mostly in changes into chemically similar amino acids. As a result, stability of the nucleic acids is maintained, while stability of the encoded proteins is not compromised. The tradeoff is a unifying property of all prokaryotes regardless of differences in their phylogenies, life styles, and extreme environments. It provides a foundation for the work of natural selection and underlies mutational biases characteristic for genomes with skewed GC compositions.

49

Whole genome sequence and analysis of the Marwari horse breed and its genetic origin JeHoon Jun1, Yun Sung Cho1, Haejin Hu1, Hak-Min Kim1, Sungwoong Jho1, Priyvrat Gadhvi1, Kyung Mi Park2, Jeongheui Lim3, Woon Kee Paek3, Kyudong Han45, Andrea Manica6, Jeremy S Edwards7, Jong Bhak28910 1. Personal Genomics Institute, Genome Research Foundation, Suwon, Republic of Korea 2. Theragen BiO Institute, TheragenEtex, Suwon, Republic of Korea 3. National Science Museum, Daejeon, Republic of Korea 4. Department of Nanobiomedical Science & BK21 PLUS NBM Global Research Center for Regenerative Medicine, Dankook University, Cheonan, Republic of Korea 5. DKU-Theragen institute for NGS analysis (DTiNa), TheragenEtex, Cheonan, Republic of Korea 6. Evolutionary Ecology Group, Department of Zoology, University of Cambridge, Cambridge, UK 7. Department of Chemistry and Chemical Biology, Department of Molecular Genetics and Microbiology, Department of Chemical and Nuclear Engineering, Cancer Research and Treatment Center, University of New Mexico, Albuquerque, NM, USA 8. Personal Genomics Institute, Genome Research Foundation, Suwon, Korea 9. Advanced Institutes of Convergence Technology Nano Science and Technology, Suwon, Republic of Korea 10. Program in Nano Science and Technology, Department of Transdisciplinary Studies, Seoul National University, Suwon, Republic of Korea International Conference on Bioinformatics Delegate Book 2014 Page 44

Background The horse (Equus ferus caballus) is one of the earliest domesticated species and has played numerous important roles in human societies over the past 5,000 years. In this study, we characterized the genome of the Marwari horse, a rare breed with certain unique characteristics such as inwardly turned ear tips. The breed is thought to have arisen from breeding local Indian ponies with Arabian horses beginning in the 12th century. Results We generated 101 Gb (~30´ coverage) of whole genome sequences from a Marwari horse using the Illumina HiSeq2000 sequencer. The sequences were mapped to the horse reference genome at a mapping rate of ~98% and with ~95% of the genome having at least 10´ coverage. A total of 5.9 million single nucleotide variations and 0.6 million small insertions or deletions were identified. We confirmed a strong Arabian and Mongolian component in the Marwari genome. Novel variants from the Marwari sequences were annotated, and were found to be enriched in olfactory functions. Additionally, we suggest a potential functional genetic variant in the TSHZ1 gene (p.Ala344>Val) associated with the inward-turning ear tip shape of the Marwari horses. Conclusions Here, we present an analysis of the Marwari horse genome. This is the first genomic data for an Asian breed, and is an invaluable resource for future studies of genetic variation associated with phenotypes and diseases in horses.

International Conference on Bioinformatics Delegate Book 2014 Page 45

50

DE NOVO ASSEMBLY OF THE COMPLETE SEQUENCE AND COMPARATIVE ANALYSIS OF THE CHLOROPLAST GENOME OF MACADAMIA INTEGRIFOLIA (PROTEACEAE) A K M Abdul Baten1, Catherine J Nock1, Graham J King1 1. Southern Cross University, Lismore, NSW, Australia

Background Sequence data from the chloroplast genome have played a central role in elucidating the evolutionary history of flowering plants, Angiospermae. In the past decade, the number of complete chloroplast genomes has burgeoned, leading to well-supported angiosperm phylogenies. However, some relationships, particulary among early-diverging lineages, remain unresolved. The diverse Southern Hemisphere plant family Proteaceae arose in Gondwanan rainforests early in angiosperm history and is a model group for adaptive radiation in response to changing climatic conditions. Genomic resources for the family are limited, and until now has been one of the few early-diverging ‘basal eudicot’ lineages not represented in chloroplast phylogenomic analyses.

Results The chloroplast genome of the Australian nut crop tree Macadamia integrifolia was assembled de novo from Illumina paired-end sequence reads. Three contigs, corresponding to a collapsed inverted repeat, a large and a small single copy region were identified, and used for genome reconstruction. The complete genome is 159,714bp in length and was assembled at deep coverage (3.29 million reads; ~2000 x). Phylogenetic analysis based on an 83-gene and 87-taxa alignment, the largest sequence-rich dataset to include the basal eudicot family Proteaceae, provided strong support for a Proteales clade that includes Macadamia, Platanus and Nelumbo. Genome structure and content followed the ancestral angiosperm pattern and was highly conserved in the Proteales, whilst size differences were largely explained by the relative contraction of the single copy regions and expansion of the inverted repeats in Macadamia.

Conclusions The Macadamia chloroplast genome presented here is the first in the Proteaceae, and confirms placement of this basal eudicot family within the order Proteales. It provides a high-quality reference genome for future evolutionary studies and will be of benefit for taxon-rich phylogenomic analyses aimed at resolving relationships among early-diverging angiosperms, and more broadly across the plant tree of life.

51

VERIFICATION AND VALIDATION OF BIOINFORMATICS SOFTWARE WITHOUT A GOLD STANDARD: A CASE STUDY OF BWA AND BOWTIE Eleni Giannoulatou 1, Shin-Ho Park1, David Humphreys1, Joshua WK Ho1 1. Victor Chang Cardiac Research Institute, Darlinghurst, NSW, Australia Background: Bioinformatics software quality assurance is essential in genomic medicine. Systematic verification and validation of bioinformatics is difficult because it is often not possible to obtain a realistic “gold standard” for systematic evaluation. Here we apply a technique that originates from the software testing literature, namely Metamorphic Testing (MT), to systematically test three widely used short read sequence alignment programs. Results: MT alleviates the problems associated with the lack of gold standard by checking that the results from multiple executions of a program satisfy a set of expected or desirable properties that can be derived from the software specification or user expectations. We tested BWA, Bowtie and Bowtie2 using simulated data and one HapMap dataset. It is interesting to observe that multiple execution of the same aligner using slightly modified input FASTQ sequence file, such as randomly re-ordering of the reads, may affect alignment results. Furthermore, we found that the list of variant calls can be affected unless strict quality control is applied during variant calling. Conclusion: Thorough testing of bioinformatics software is important in delivering clinical genomic medicine. This paper demonstrates a different framework to test a program that involve checking its properties, thus greatly expanding the number and repertoire of test cases we can apply in practice.

International Conference on Bioinformatics Delegate Book 2014 Page 46

52

FUNRNA: A FUNGI-CENTERED GENOMICS PLATFORM FOR GENES ENCODING KEY COMPONENTS OF RNAI Jaeyoung Choi1, Ki-Tae Kim1, Jongbum Jeon1, Jiayao Wu2, Hyeunjeong Song1, Fred O Asiegbu2, Yong-Hwan Lee1 1. Seoul National University, Seoul, South Korea 2. University of Helsinki, Helsinki, Finland Background RNA interference (RNAi) is involved in genome defence and diverse cellular, developmental, physiological processes. Key components of RNAi are Argonaute, Dicer, and RNA-dependent RNA Polymerase (RdRP), which have been functionally characterized mainly in model organisms. The key components are believed to exist throughout the eukaryotes, however, no systematic platform is present for archiving and dissecting these important gene families. In addition, only a few fungi are studied to date, limiting our understanding of RNAi in fungi. Here we present funRNA (http://funrna.riceblast.snu.ac.kr/), a fungal kingdom- wide genomics platform for putative genes encoding Argonaute, Dicer, and RdRP. Description To identify and archive the genes encoding the key components, protein domain profiles were determined from the characterized sequences. The domain profiles were searched on fungal, metazoan, and plant genomes as well as bacterial and archaeal ones. In total of 1,163 Argonaute, 442 Dicer, and 678 RdRP-encoding genes were predicted. Based on the identification results, active site variation of Argonautes, diversification of Dicers, and structural conserveness of RdRP were discussed in a fungi-oriented manner. funRNA provides results from diverse bioinformatics programs and job submission forms for BLAST, BLASTMatrix, and ClustalW. Furthermore, sequence collections created in funRNA are synced with several family analysis portals and databases, offering further analysis opportunities. Conclusions By providing the identification results from a broad range of taxonomy and diverse analysis functions, funRNA could be used in diverse comparative and evolutionary studies. Therefore, funRNA would serve as a versatile genomics workbench for key components of RNAi.

53

REVEALING EDITING AND SNPS OF MICRORNAS IN COLON TISSUES BY ANALYZING HIGH- THROUGHPUT SEQUENCING PROFILES OF SMALL RNAS Yun Zheng1, Ting Li1, Ren Ren2, Donghua Shie3, Shengpeng Wang1 1. Faculty of Life Science and Technology, Kunming University of Science and Technology, Kunming, Yunnan, China 2. School of Life Sciences, Fudan University, Shanghai, China 3. Zhongshan Hospital, Fudan University, Shanghai, China Editing and mutations in microRNAs (miRNAs) can change the stability of pre-miRNAs and/or complementarities between miRNAs and their targets. Small RNA (sRNA) high-throughput sequencing (HTS) profiles contain miRNAs that are originated from mutated DNAs or are edited during their biogenesis procedures. It is largely unknown whether miRNAs are edited in colon tissues since existing studies mainly focused their attention on the editing of miRNAs in brain tissues. Through comprehensive analysis of four high-throughput sequencing profiles of normal and cancerous colon tissues, we identified 548 editing and/or SNPs in miRNAs that are significant in at least one of the sequencing profiles used. Our results show that the most abundant editing events of miRNAs in colon tissues are 3’-A and 3’-U. In addition to four known A-to-I editing sites previously reported in brain tissues, four novel A-to-I editing sites are also identified in colon tissues. This suggests that A-to-I editing of miRNAs potentially is a commonly existing mechanisms in different tissues to diversify the possible functional roles of miRNAs, but only a small portion of different miRNAs are edited by the A-to-I mechanism at a significant level. Our results suggest that there are other types of editing in miRNAs through unknown mechanisms. Furthermore, several SNPs in miRNAs are also identified.

International Conference on Bioinformatics Delegate Book 2014 Page 47

54

STRATEGIES FOR COMBINING MULTI OMICS DATA Hagen Meckel1, Michael Kohl1, Martin Eisenacher1, Katrin Marcus1 1. Medical Proteom-Center, Bochum, Germany Multi-OMICS approaches aim on measuring the dynamics of the most important biomolecules (e.g. genes, mRNAs, proteins and metabolites) in order to gain better understanding of the complex regulation of a cell. Combining data of different platforms provide comprehensive insights into biological processes. Furthermore, in the biomedical research such an approach offers great advantages for the identification and characterization of disease-related processes, biomarkers and drug targets. We introduce a miRNA, mRNA, protein data processing workflow as well as strategies and methods to explore and to integrate different OMICS datasets, mainly focusing on miRNA, mRNA and protein analyses. Our strategies include on the one hand a joint classification concept where selected features of all available OMICS datasets are considered to improve classification accuracy with the aim of generating a ‘cross-omics-biomarker-panel’. On the other hand we focus on strategies to reveal the relationship between miRNA, mRNA and proteins preferable measured from the same patient samples in order to analyze translational regulation effects and to gain better biological insights. These strategies include among other things the analyses of effects of miRNA target relations and their influence on the protein level as well as the integration of subcellular localizations and functional annotations. A challenging research question will be the quantitative estimation of miRNA changes on mRNA translation.

55

COMPUTATIONAL ANALYSIS OF THE ROLES OF ER-GOLGI NETWORK IN THE CELL CYCLE Haijun Gong1, Lu Feng1 1. Saint Louis University, St. Louis, MO, United States Background: ER-Golgi network plays an important role in the processing, sorting and transport of proteins, and it’s also a site for many signaling pathways that regulate the cell cycle. Accumulating evidence suggests that, the stressed ER and malfunction of Golgi apparatus are associated with the pathogenesis of cancer and Alzheimer’s disease (AD). Moreover, our previous work discovered and verified that altering the expression levels of target SNARE and GEF could modulate the size of Golgi apparatus. While Golgi’s size changes dramatically during the development of several diseases. It is of importance to investigate the roles of ER-Golgi network in the cell cycle progression and some diseases. Results: In this work, we develop a computational model to systematically investigate the ER stress-induced and Golgi-related apoptosis-survival signaling pathways. Then, we propose and apply both Synchronous and Asynchronous Model Checking methods, which extend our previous verification technique, to automatically and formally analyze different signaling pathways regulated by ER-Golgi network and identify important regulatory components in the cell cycle progression through verifying some computation tree temporal logic (CTL) formulas. Our technique has advantages for large network verification (it can check up to 10100 possible states in minutes) over traditional methods. Conclusion: The proposed Asynchronous and Synchronous Symbolic Model Checkers verified several temporal and dynamic properties related to cancer and Alzheimer's disease. We also identified some signaling components in the ER-Golgi network, including the NFκB, IKK, ATF4, ASK1 and TRAF2, which might be key players in the pathogenesis of cancer and AD. Our studies indicate that targeting the ER stress-induced and Golgi-related pathways might serve as potent therapeutic targets of cancer and Alzheimer's disease, and the crosstalk among different signaling pathways may be responsible for the pathogenesis of AD and cancer even if some pathways are blocked by certain single-gene targeted therapies.

International Conference on Bioinformatics Delegate Book 2014 Page 48

56

AN EFFICIENT METHOD FOR OBSERVABILITY OF SINGLETON ATTRACTORS IN BOOLEAN NETWORKS Yushan Qiu1, Xiaoqing Cheng1, Wenpin Hou1, Wai-Ki Ching1 1. The University of Hong Kong, Hong Kong, China Boolean network (BN) is a popular mathematical model for studyinggenetic regulatory networks and its observability plays avital role in understanding the underlying network.Several research works have been done onobservability of BNs and complex networks.However, observability of attractor cycles is not yet fully addressedin the literature and it is a challenging issue.In this paper, we propose a novel problem on theobservability of attractors in a BN.Identification of the minimum set of contiguous nodes that can determinewhich attractor cycle the system belongs tocan serve as a biomarker for different disease types (different attractors).Thus, detection of the minimum set plays a significant role inthe study of signaling networks.We propose a novel method for solvingthe problem in O(n) time where n is the number of genes in the network.Furthermore, computational experiments are conductedto demonstrate both the efficiency and the effectiveness ofour proposed method for the captured observability problem.

57

EFFECTS OF DOWNSTREAM GENES ON SYNTHETIC GENETIC CIRCUITS Takefumi Moriya1, Masayuki Yamamura1, Daisuke Kiga1 1. Tokyo Institute of Technology, Yokohama, Japan In order to understand and regulate complex genetic networks in living cells, it is important to build simple and well-defined genetic circuits. We designed such circuits using a synthetic biology approach that included mathematical modeling and simulation, with a focus on the effects by which downstream reporter genes are involved in the regulation of synthetic genetic circuits. Our results indicated that downstream genes exert two main effects on genes involved in the regulation of synthetic genetic circuits: (1) competition for regulatory proteins and (2) protein degradation in the cell. Our findings regarding the effects of downstream genes on regulatory genes and the role of impedance in driving large-scale and complex genetic circuits may facilitate the design of more accurate genetic circuits. This design will have wide applications in future studies of systems and synthetic biology.

58 dCAP: DETECTING DIFFERENTIAL BINDING EVENTS IN MULTIPLE CONDITIONS AND PROTEINS Kuan-Bei Chen1, Ross Hardison1, Yu Zhang1 1. Penn State U, State College, PA, United States ABSTRACT Background: Current ChIP-seq studies are interested in comparing multiple epigenetic profiles across several cell types and tissues simultaneously for studying constitutive and differential regulation. Simultaneous analysis of multiple epigenetic features in many samples can gain substantial power and specificity than analyzing individual features and/or samples separately. Yet there are currently few tools can perform joint inference of constitutive and differential regulation in multi-feature-multi-condition contexts with statistical testing. Existing tools either test regulatory variation for one factor in multiple samples at a time, or for multiple factors in one or two samples. Many of them only identify binary rather than quantitative variation, which are sensitive to threshold choices. Results: We propose a novel and powerful method called dCaP for simultaneously detecting constitutive and differential regulation of multiple epigenetic factors in multiple samples. Using simulation, we demonstrate the superior power of dCaP compared to existing methods. We then apply dCaP to two datasets from human and mouse ENCODE projects to demonstrate its utility. We show in the human dataset that the cell-type specific regulatory loci detected by dCaP are significantly enriched near genes with cell-type specific functions and disease relevance. We further show in the mouse dataset that dCaP captures genomic regions showing significant signal variations for TAL1 occupancy between two mouse erythroid cell lines. The novel TAL1 occupancy loci detected only by dCaP are highly enriched with GATA1 occupancy and differential gene expression, while those detected only by other methods are not. Conclusions: Here, we developed a novel approach to utilize the cooperative property of proteins to detect differential binding given multivariate ChIP-seq samples to provide better power, aiming for complementing existing approaches and providing new insights in the method development in this field.

International Conference on Bioinformatics Delegate Book 2014 Page 49

59

DRUG-INDUCED TOXICITY PREDICTION FOR MULTI-ORGAN PATHOLOGICAL FINDINGS BASED ON INTEGRATIVE MODEL OF GENE EXPRESSION DATA Jinwoo Kim1, Miyoung Shin1 1. Bio-Intelligence & Data Mining Lab, School of Electronics Engineering, Kyoungpook National University, Daegu, Korea Publish consent withheld

60

NOVEL SNP IMPROVES DIFFERENTIAL SURVIVABILITY AND MORTALITY IN NON-SMALL CELL LUNG CANCER PATIENTS Tzia Liang Mah1, Xin Ning Adeline Yap1, Vachiranee Limviphuvadh2, Nanpu Li1, Srinath Sridharan1, Vellaisemy Kuralmani1, Mengling Feng1, Natalia Liem3, Sharmila Adhikari2, Wei Peng Yong3, Ross A Soo3, Sebastian Maurer-Stroh2, Frank Eisenhaber2, Joo Chuan Tong4 1. Institute for Infocomm Research, Singapore 2. Bioinformatics Institute, Singapore 3. National University Health System, Singapore 4. Institute of High Performance Computing, Singapore Background: Non-small cell lung cancer (NSCLC) is a major cause of cancer-related death worldwide due to poor patient prognosis and clinical outcome. Here, we studied the genetic variations underlying NSCLC pathogenesis based on their association to patient outcome after gemcitabine therapy. Methods: Bioinformatics analysis was used to investigate possible effects of POLA2 G583R (POLA2+1747=GG/GA, dbSNP ID: rs487989) in terms of protein function. Using biostatistics, POLA2+1747=GG/GA (rs487989, POLA2 G583R) was identified as strongly associated with mortality rate and survival time among NSCLC patients. Results: It was also shown that POLA2+1747=GG/GA is functionally significant for protein localization via green fluorescent protein (GFP)-tagging and confocal laser scanning microscopy analysis. The single nucleotide polymorphism (SNP) causes DNA polymerase alpha subunit B to localize in the cytoplasm instead of the nucleus. This inhibits DNA replication in cancer cells and confers a protective effect in individuals with this SNP. Conclusions: The results suggest that POLA2+1747=GG/GAmay be used as a prognostic biomarker of patient outcome in NSCLC pathogenesis.

International Conference on Bioinformatics Delegate Book 2014 Page 50

61

GENETIC ALGORITHM WITH LOGISTIC REGRESSION FOR DIAGNOSIS AND PROGNOSIS OF ALZHEIMER’S DISEASE Ping Zhang1, Piers Johnson1, Luke Vandewater1, William Wilson2, Paul Maruff3, Greg Savage4, Petra Graham4, Lance Macaulay5, Kathryn A Ellis6, Cassandra Szoeke6, Ralph Martins7, Christopher Rowe6, Colin Masters6, David Ames6 1. CCI, CSIRO, Marsfield, NSW, Australia 2. CCI, CSIRO, North Ryde, NSW, Australia 3. Cogstate Ltd, Melbourne, VIC, Australia 4. Macquarie University, North Ryde, NSW, Australia 5. CMSE, CSIRO, Parkville, VIC, Australia 6. The , Parkville, VIC, Australia 7. Edith Cowan University, Perth, WA, Australia

Background Assessment of risk and early diagnosis of Alzheimer’s disease (AD) is a key to its prevention or slowing the progression of the disease. Previous research on risk factors for AD typically utilizes statistical comparison tests or stepwise selection with regression models. Outcomes of these methods tend to emphasize single risk factors rather than a combination of risk factors. However, a combination of factors, rather than any one alone, is likely to affect disease development. Genetic algorithms (GA) can be useful and efficient for searching a combination of variables for the best achievement (eg. accuracy of diagnosis), especially when the search space is large, complex or poorly understood, as in the case in prediction of AD development.

Results GA in combination with logistic regression (LR) was used for finding one or more sets of neuropsychological tests which can best predict the progression to AD. Data from the Australian Imaging, Biomarkers & Lifestyle (AIBL) Study of Ageing with 36 months follow up were examined. A set of 37 neuropsychological variables including depression and anxiety measures was used for identifying the best subsets for prediction of conversion from healthy to mild cognitive impairment (MCI) or AD and for conversion from MCI to AD. Multiple sets of neuropsychological variables were identified by GA to best predict conversions between clinical categories, with a cross validated AUC of 0.90 for prediction of HC conversion to MCI/AD and 0.86 for MCI conversion to AD within 36 months.

Conclusions This study showed the potential of GA application in the neural science area. It demonstrated that the combination of the variables is superior in performance than the use of single significant variables for prediction of progression of disease. Variables more frequently selected by GA might be more important as part of the algorithm for prediction of disease development.

62

IDENTIFICATION OF NOVEL THERAPEUTICS FOR COMPLEX DISEASES FROM GENOME-WIDE ASSOCIATION DATA Mani Grover1, Merridee Wouters1 1. Deakin University, Geelong Waurn Pond, VIC, Australia

Background Human genome sequencing has enabled the association of phenotypes with genetic loci, but our ability to effectively translate this data to the clinic has not kept pace. In silico tools such as candidate gene prediction systems allow rapid identification of disease genes by identifying the most probable candidate genes linked to genetic markers of the disease or phenotype under investigation. Integration of drug-target data with candidate gene prediction systems can identify novel phenotypes which may benefit from current therapeutics. Such a drug repositioning tool can save valuable time and money spent on preclinical studies and phase I clinical trials. Methods We previously used Gentrepid (www.gentrepid.org) as a platform to predict 1,497 candidate genes for the seven complex diseases considered in the Wellcome Trust Case-Control Consortium genome-wide association study; namely Type 2 Diabetes, Bipolar Disorder, Crohn’s Disease, Hypertension, Type 1 Diabetes, Coronary Artery Disease and Rheumatoid Arthritis. Here, we adopted a simple approach to integrate drug data from three publicly available drug databases: the Therapeutic Target Database, the Pharmacogenomics Knowledgebase and DrugBank; with candidate gene predictions from Gentrepid at the systems level. Results Using the publicly available drug databases as sources of drug-target association data, we identified a total of 428 candidate genes as novel therapeutic targets for the seven phenotypes of interest, and 2,130 drugs feasible for repositioning against the predicted novel targets. Conclusions

International Conference on Bioinformatics Delegate Book 2014 Page 51

By integrating genetic, bioinformatic and drug data, we have demonstrated that currently available drugs may be repositioned as novel therapeutics for the seven diseases studied here, quickly taking advantage of prior work in pharmaceutics to translate ground- breaking results in genetics to clinical treatments.

63

PREDICTING HOST TROPISM OF INFLUENZA A VIRUS PROTEINS USING RANDOM FOREST Christine LP Eng1, Joo Chuan Tong12, Tin Wee Tan1 1. Department of Biochemistry, National University of Singapore, Singapore 2. Institute of High Performance Computing, Singapore Background: Majority of influenza A viruses reside and circulate among animal populations, seldom infecting humans due to host range restriction. Yet when some avian strains do acquire the ability to overcome species barrier, they might become adapted to humans, replicating efficiently and causing diseases, leading to potential pandemic. With the huge influenza A virus reservoir in wild birds, it is a cause for concern when a new influenza strain emerges with the ability to cross host species barrier, as shown in light of the recent H7N9 outbreak in China. Several influenza proteins have been shown to be major determinants in host tropism. Further understanding and determining host tropism would be important in identifying zoonotic influenza virus strains capable of crossing species barrier and infecting humans. Results: In this study, computational models for 11 influenza proteins have been constructed using the machine learning algorithm random forest for the prediction of host tropism. The prediction models were trained on influenza protein sequences isolated from both avian and human samples, which are transformed into amino acid physicochemical properties feature vectors. The results were highly accurate prediction models (ACC>96.57; AUC>0.980; MCC>0.916) capable of determining host tropism of individual influenza proteins. In addition, features from all 11 proteins were used to construct a combined model to predict host tropism of influenza virus strains. This would help assess a novel influenza strain’s host range capability. Conclusions: From the prediction models constructed, all achieved high prediction performance, indicating clear distinctions in both avian and human influenza proteins. Understanding and predicting host tropism of influenza proteins lay an important foundation for future work in constructing computation models capable of predicting interspecies transmission of influenza viruses. The prediction models are available on http://flupred.bic.nus.edu.sg.

64

SUPERVISED PREDICTION OF DRUG-INDUCED NEPHROTOXICITY BASED ON INTERLEUKIN-6 AND -8 EXPRESSION LEVELS Ran Su1, Yao Li2, Daniele Zink2, Lit-Hsin Loo1 1. Bioinformatics Institute, Agency for Science, Technology and Research (A*STAR), Singapore 2. Institute of Bioengineering and Nanotechnology, Agency for Science, Technology and Research (A*STAR), Singapore Drug-induced nephrotoxicity causes acute kidney injury and chronic kidney diseases, and is a major reason for late-stage failures in the clinical trials of new drugs. Therefore, early, pre-clinical prediction of nephrotoxicity could help to prioritize drug candidates for further evaluations, and increase the success rates of clinical trials. Recently, an in vitro model for predicting renal-proximal-tubular- cell (PTC) toxicity based on the expression levels of two inflammatory markers, interleukin (IL)-6 and -8, has been described. However, this and other existing models mostly use linear and manually determined thresholds to predict nephrotoxicity. Here, we report a systematic comparison of the performances of three supervised classifiers, namely support vector machine (SVM), k-nearest- neighbor and naive Bayes classifiers, in predicting PTC toxicity based on IL-6 and -8 expression levels. Using a dataset of human primary PTCs treated with 41 well-characterized compounds that are toxic or not toxic to PTC, we found that SVM classifiers based on radial-basis-function kernels have the highest cross-validated classification performance (mean accuracy=83.05%, sensitivity=83.29%, and specificity=82.78%). Furthermore, we also found that IL-8 is more predictive than IL-6, but a combination of both markers gives higher classification accuracies. Finally, we also show that our SVM classifiers trained automatically on the whole dataset have higher mean accuracy than a previous threshold-based classifier constructed for the same dataset (92.52% vs. 81.80%). Our results suggest that a SVM classifier based on these two markers can be used to automatically predict drug-induced PTC toxicity.

International Conference on Bioinformatics Delegate Book 2014 Page 52

65

BIOINFORMATIC ANALYSIS OF CIS-REGULATORY INTERACTIONS BETWEEN PROGESTERONE AND ESTROGEN RECEPTORS IN BREAST CANCER Matloob Khushi12, Christine Clarke2, Dinny Graham2 1. Children Medical Research Institute, Westmead, NSW, Australia 2. Westmead Institute for Cancer Research, Sydney Medical School, University of Sydney, Westmead, NSW, Australia Chromatin factors interact with each other in a cell and sequence-specific manner in order to regulate transcription and a wealth of publically available datasets exists describing the genomic locations of these interactions. Our recently published BiSA (Binding Sites Analyser) database contains transcription factor binding locations and epigenetic modifications collected from published studies and provides tools to analyse stored and imported data. Using BiSA we investigated the overlapping cis-regulatory role of estrogen receptor alpha (ERα) and progesterone receptor (PR) in the T-47D breast cancer cell line. We found that ERα binding sites overlap with a subset of PR binding sites. To investigate further, we re-analysed raw data to remove any biases introduced by the use of distinct tools in the original publications. We identified 22,152 PR and 18,560 ERα binding sites (<5% false discovery rate) with 4358 overlapping regions among the two datasets. BiSA statistical analysis revealed a non-significant overall overlap correlation between the two factors, suggesting that ERα and PR are not partner factors and do not require each other for binding to occur. However, one quarter of ERα binding sites overlapped with PR binding sites suggesting a biologically significant interaction on specific DNA regions. Motif analysis revealed that the shared binding regions were enriched with binding motifs for ERα, PR and a number of other transcription and pioneer factors. Some of these factors are known to co-locate with ERα and PR binding. Our data suggest that ERα and PR, in general function independently at the molecular level, but that their activities converge on a specific subset of transcriptional targets.

66

RULE DISCOVERY AND DISTANCE SEPARATION TO DETECT RELIABLE MIRNA BIOMARKERS FOR THE DIAGNOSIS OF LUNG SQUAMOUS CELL CARCINOMA Renhua Song1, Qian Liu1, Gyorgy Hutvagner1, Hung Nguyen1, Ramamohanarao Kotagiri2, Limsoon Wong3, Jinyan Li1 1. University of Technology Sydney, Broadway, NSW, Australia 2. The University of Melbourne, Melbourne, VIC, Australia 3. National University of Singapore, Singapore Altered expression profiles of miRNAs are linked to many diseases including lung cancer. miRNA expression profiling is reproducible and miRNAs are very stable. These characteristics of miRNAs make them ideal biomarker candidates. This work is aimed to detect 2- and 3-miRNA groups, together with specific expression ranges of these miRNAs, to form simple linear discriminant rules for biomarker identification and biological interpretation. Our method is based on a novel committee of decision trees to derive 2- and 3-miRNA 100%-frequency rules. This method is applied on a data set of lung miRNA expression profiles of 61 squamous cell carcinoma (SCC) samples and 10 normal tissue samples. A distance separation technique is also used to select the most reliable rules which are then evaluated on a large independent data set. We obtained four 2-miRNA and three 3-miRNA top- ranked rules. One important rule is that: If the expression level of miR-98 is above 7.356 and the expression level of miR-205 is below 9.601 (log2 quantile normalized MirVan miRNA Bioarray signals), then the sample is normal rather than cancerous with specificity and sensitivity both 100%. The classification performance of our best miRNA rules remarkably outperformed that of randomly selected miRNA rules. Our data analysis also showed that miR-98 and miR-205 have two common predicted target genes FZD3 and RPS6KA3, which are actually genes associated with carcinoma according to the OMIM database. We also found that most of the chromosomal loci of these miRNAs have a high frequency of genomic alteration in lung cancer. On the independent data set (with balanced controls), the three miRNAs miR-126, miR-205 and miR-182 from our best rule can separate the two classes of samples at the accuracy of 84.49%, sensitivity of 91.40% and specificity of 77.14%. We also had a discussion on a mapping between lung tissue-specific and plasma-specific miRNA biomarkers for a minimally invasive diagnosis.

International Conference on Bioinformatics Delegate Book 2014 Page 53

67

SCALABLE CLUSTERING OF GENOTYPE INFORMATION USING MAPREDUCE Aidan O'Brien1 1. CSIRO, Ryde, NSW, Australia Processing genomic information from whole genome sequence studies pose computational challenges due to the unprecedented data volume generated, which render transitional approaches insufficient. However, by utilising advancements in modern hardware accelerators and data processing we can provide the means for scalable solutions. We therefore aim to provide the interface between standard genomic data formats and advanced and scalable analysis libraries like Mahout. We achieve an 2-fold speedup by using the scalable k-means MapReduce implementation over the equivalent analysis performed in R, by comparable accuracy. However, the real benefit lies in scaling beyond R's capability to a population-size analysis. We successfully clustered more than 5000 individuals each having more than 15 Million variants. Using modern compute paradigms is essential to scale to modern genomic research in an efficient sustainable way.

68

MHC2MIL: A NOVEL MULTIPLE INSTANCE LEARNING BASED METHOD FOR MHC II PEPTIDE BIN DING PREDICTION BY CONSIDERING PEPTIDE FLANKING REGION AND RESIDUE POSITIONS Yichang Xu1, Cheng Luo1, Mingjie Qian2, Xiaodi Huang3, Shanfeng Zhu1 1. Fudan University, Shanghai, China 2. University of Illinois at Urbana-Champaign, Urbana, IL, USA 3. Charles Sturt University, Albury, Australia Background: Computational prediction of major histocompatibility complex class II (MHC-II) binding peptides can assist researchers in understanding the mechanism of immune systems and developing peptide based vaccines. Although many computational methods have been proposed, the performance of these methods are far from satisfactory. The difficulty of MHC-II peptide binding prediction comes mainly from the large length variation of binding peptides. Methods:We develop a novel multiple instance learning based method called MHC2MIL, in order to predict MHC-II binding peptides. We deem each peptide in MHC2MIL as a bag, and some substrings of the peptide as the instances in the bag. Unlike previous multiple instance learning base methods that consider only instances of fixed length 9 (9 amino acids), MHC2MIL is able to deal with instances of both lengths of 9 and 11 (11 amino acids), simultaneously. As such, MHC2MIL incorporates important information in the peptide flanking residues. For measuring the distances between different instances, furthermore, MHC2MIL explicitly highlights the amino acids in some important positions. Results Experimental results on a benchmark dataset have shown that, the performance of MHC2MIL is significantly improved by considering the instances of both 9 and 11 amino acids, as well as by emphasizing amino acids at key positions in the instance. The results are consistent with those reported in the literature on MHC II peptide binding. In addition to five important positions (1, 4, 6, 7 and 9) for HLA(human leukocyte antigen, the name of MHC in Humans) DR peptide binding, we also find that position 2 may play some roles in the binding process. By using 5-fold cross validation on the benchmark dataset, MHC2MIL outperforms two state-of-the-art methods of MHC2SK and NN-align with being statistically significant, on 12 HLA DP and DQ molecules. In addition, it achieves close performance with MHC2SK and NN-align on 14 HLA DR molecules. MHC2MIL is freely available at http://datamining-iip.fudan.edu.cn/service/MHC2MIL/index.html .

International Conference on Bioinformatics Delegate Book 2014 Page 54

69

ANTICANCER DRUG DESIGN USING KINASE PROFILING, KINASE EXPRESSION AND KIDFAMMAP Chih-Ta Lin1, Chun-Yu Lin1, Yi-Yuan Chiu1, Kai-Cheng Hsu1, Jhang-Wei Huang1, Tzu-Ying Sung1, Yi-Syuan Jhuang1, Weng-Chon Lam1, Kuan-Hsiu Liu1, Jen-Hu Tseng1, Jinn-Moon Yang12 1. Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan 2. Department of Biological Science and Technology, National Chiao Tung University, Hsinchu, Taiwan Background Protein kinases, mediate most of the signal transduction to control cellular processes, have become primary drug targets, especially in cancers. To date, over thirty thousand kinase inhibitors have been identified; however, only 27 small molecule drugs have been approved by US FDA. The low clinical development success rates for investigational inhibitors may result from difficulty in drug target validation for particular diseases, as well as incorrect evaluation without considering the roles of target kinases in particular diseases for inhibitor selectivity. Results Here, we propose “Approvance scores” to quantify the clinical development success rates of inhibitors for particular cancers, and utilized KIDFamMap to provide the optimizing guidance for enhancing the success rate of drugs for particular cancers. Approvance scores considered not only inhibitor potency of targeting kinases but also the role of target kinases in a particular disease. Our results show that the kinase candidates identified from expression data for computing approvance score were highly correlated with cancer- related genes and biological processes of gene ontology, as well as approvance scores are consistent with the efficiencies of inhibitors. Kinase profiling results also show that the optimizing guidance of KIDFamMap is able to design kinase inhibitors with high approvance scores. Conclusions We believe that the approvance scores reflect an index to design the drugs of a particular disease and provide personalized medicine according to the patient’s gene expressions. According to KIDFamMap and approvance scores, we can design kinase inhibitors for particular diseases with high clinical development success rates.

70

IMMUNOINFORMATICS AND MOLECULAR DOCKING STUDIES OF OUTER MEMBRANE PROTEINS WITH MHC CLASS I ALLELES FOR FISH PATHOGENS: AN IN SILICO VACCINE DESIGN APPROACH Radha S Mahendran1, Gayathri S Sitharaman1 1. Bioinformatics Department, Vels University, Chennai, TN, India Background Edwardsiellosis & Columnaris are the two important infectious diseases occurring in fish caused by bacterial pathogens Edwardsiella tarda and Flavobacterium columnare. Since efficient vaccinations are still not in place for the disease outbreaks, we carried out an in silico, immunoinformatic approach to identify T cell epitopes that can be used as potential peptide vaccine candidates. Determination of T cell epitopes and their binding and interaction with major histocompatibility complex (MHC) proteins play a very important role in the activation of T cells. Upon activation, the T cell receptors invade the pathogens by inducing apoptosis. Results We have identified potential T cell epitopes that bind with MHC from the outer membrane proteins of the pathogens. The sequences of the outer membrane proteins (OMPs) were analyzed owing to the fact that they are increasingly recognized as potential targets for inducing immune responses. OMPs were selected based on their antigenic and immunogenic properties. The OMPs of genes TolC and FCOL_04620 from E.tarda and F.columnare were taken for study. We identified 4 cytotoxic T cell epitopes from the OMP of E. tarda. Out of four, two epitopes exhibited excellent protein-peptide interaction. Eighteen cyctotoxic T cell epitopes were identified from the OMP of F.columnare. Out of eighteen, five epitopes bound well with MHC class I alleles and had good protein-peptide interaction. Conclusion Activation of T cell receptors plays a very important role in the destruction of the invading foreign pathogens. Cytotoxic T cells bound with MHC class I and class II alleles activate the T cells and convert them to T cell receptors. This study identified potential peptides from the OMPs of the fish pathogens that bound well with MHC class I alleles. There is ample scope for further in vitro studies to develop potential peptide vaccines using the peptides mentioned above.

International Conference on Bioinformatics Delegate Book 2014 Page 55

71

A NEW TOOL TO AVOID ERRORS ASSOCIATED WITH THE ANALYSIS OF HYPERMUTATED VIRAL SEQUENCES BY THE WIDELY USED HYPERMUT PROGRAM Hamid Alinejad-Rokny1, Miles Davenport1, Diako Ebrahimi1 1. UNSW, SYDNEY, NSW, Australia The human genome encodes a family of editing enzymes known as APOBEC3 (apolipoprotein B mRNA editing enzyme, catalytic polypeptide-like3). They induce context dependent G-to-A changes in the genome of sub-populations of viruses such as HIV, SIV, HBV and endogenous retroviruses, is referred to as “hypermutation”. Hypermut is a program by the Los Alamos National Laboratories that is widely used to analyse and identify hypermutation. It is shown here that insertion/deletion in the sequences results in several different errors in this program leading to the incorrect identification of hypermutated sequences. This in turn results in erroneous biological inferences made based on the outcome of the Hypermut program.In this paper we identify and report these errors using published and unpublished viral sequences and present a new algorithm we refer to as G2A3 to avoid these errors.

72

MOLECULAR PROFILING OF THYROID CANCER SUBTYPES USING LARGE-SCALE TEXT MINING Chengkun Wu1,2,3, Jean-Marc Schwartz1, Georg Brabant4,5, Goran Nenadic3,6,7 1. Faculty of Life Sciences, University of Manchester, Manchester, United Kingdom 2. Doctoral Training Centre in Integrative Systems Biology, University of Manchester, Manchester, United Kingdom 3. Manchester Institute of Biotechnology, University of Manchester, Manchester, United Kingdom 4. Department of Endocrinology, Christie Hospital, University of Manchester, Manchester, United Kingdom 5. Experimental and Clinical Endocrinology, Med Clinic I, University of Lubeck, Lübeck, Germany 6. School of Computer Science, University of Manchester, Manchester, United Kingdom 7. Health eResearch Centre (HeRC), University of Manchester, Manchester, United Kingdom Background Thyroid cancer is the most common endocrine tumor with a steady increase in incidence. It is classified into multiple histopathological subtypes with potentially distinct molecular mechanisms. Identifying the most relevant genes and biological pathways reported in the thyroid cancer literature is vital for the understanding of the disease and developing targeted therapeutics. Results We developed a large-scale text mining system to generate a molecular profiling of thyroid cancer subtypes. The system first uses a subtype classification method for the thyroid cancer literature, which employs a scoring scheme to assign different subtypes to articles. We evaluated the classification method on a gold standard derived from the PubMed Supplementary Concept annotations, achieving an F1-score of over 80% for most subtypes. We then used the subtype classification results to extract genes and pathways associated to different thyroid cancer subtypes. Conclusions Identification of key genes and pathways plays a central role in understanding the molecular biology of thyroid cancer. An integration of subtype context will allow prioritized screening for diagnostic biomarkers and novel molecular targeted therapeutics. Source code used for this study is made freely available online at https://github.com/chengkun-wu/GenesThyCan.

International Conference on Bioinformatics Delegate Book 2014 Page 56

73

INTRODUCING TREECOLLAPSE: A NOVEL GREEDY ALGORITHM TO SOLVE THE COPHYLOGENY RECONSTRUCTION PROBLEM Benjamin Drinkwater1, Michael A Charleston1 1. University of Sydney, University Of Sydney, NSW, Australia Background: Cophylogeny mapping is used to uncover deep coevolutionary associations between two or more phylogenetic histories at a macro coevolutionary scale. As cophylogeny mapping is NP-Hard, this technique relies heavily on heuristics to solve all but the most trivial cases. One notable approach utilises a metaheuristic to search only a subset of the exponential number of fixed node orderings possible for the phylogenetic histories in question. This is of particular interest as it is the only known heuristic that guarantees biologically feasible solutions. This has enabled research to focus on larger coevolutionary systems, such as coevolutionary associations between figs and their pollinator wasps, including over 200 taxa. Although able to converge on solutions for problem instances of this size, a reduction from the current cubic running time is required to handle larger systems, such as Wolbachia and their insect hosts. Results: Rather than solving this underlying problem optimally we present a greedy algorithm called TreeCollapse, which uses common topological patterns to recover an approximation of the coevolutionary history where the internal node ordering is fixed. This approach offers a significant speed-up compared to previous methods, running in linear time. This algorithm has been applied to over 100 well-known coevolutionary systems converging on Pareto optimal solutions in 68% of test cases, where in some cases the reported solution has not previously been recoverable. Further, while TreeCollapse applies a local search technique, it can guarantee solutions are biologically feasible, making this the fastest method to provide such a guarantee. Conclusion: As a result, we argue that the newly proposed algorithm is a valuable addition to the field of coevolutionary research. Not only does it offer a significantly faster method for recovering cophylogeny mappings but by using this approach, in conjunction with existing heuristics, it can assist in recovering a larger subset of the Pareto front than has previously been possible.

74

MULTI-SPECIES SEQUENCE COMPARISON REVEALS CONSERVATION OF PREPROGHRELIN SPLICE VARIANTS AND A NOVEL VARIANT ENCODING A TRUNCATED GHRELIN PEPTIDE Inge Seim1, Carina M Walpole1, Patrick B Thomas1, Penny L Jeffery1, Jenny NT Fung1, Peiyi Yap1, Angela O'Keeffe1, John Lai1, Eliza J Whiteside1, Adrian C Herington1, Lisa Chopin1 1. TRI-IHBI, Queensland University of Technology, Brisbane, Queensland, Australia Background The peptide hormone ghrelin is a potent orexigen produced predominantly in the stomach. It has a number of other biological actions, including a role in energy balance, the stimulation of growth hormone release and the regulation of cell proliferation. Recently, several ghrelin gene (GHRL) splice variants have been described. In this manuscript, we attempted to identify conserved alternative splicing of the ghrelin gene by cross-species sequence comparisons. Results We have identified a novel human exon 2-deleted preproghrelin variant and provide preliminary evidence that this splice variant and the in1-ghrelin preproghrelin variant encode a C-terminally truncated form of the ghrelin peptide, termed minighrelin. These preproghrelin variants are expressed in humans and mice, demonstrating conservation of alternative splicing spanning 90 million years. Minighrelin appears to have similar actions to canonical ghrelin, as treatment with exogenous minighrelin peptide stimulates appetite and feeding in mice. Forced expression of the exon 2-deleted preproghrelin variant mirrors the effect of the canonical preproghrelin, stimulating cell proliferation and migration in the PC3 prostate cancer cell line. Conclusions This is the first study to characterise an exon 2-deleted preproghrelin variant and to demonstrate sequence conservation of preproghrelin splice variants that encode a truncated ghrelin peptide. This adds further impetus for studies into the alternative splicing of the ghrelin gene and the function of novel ghrelin peptides in vertebrates.

International Conference on Bioinformatics Delegate Book 2014 Page 57

75

IDENTIFYING SIGNIFICANT ASSOCIATIONS WITH INTERACTING GERMLINE VARIATION AND SOMATIC MUTATIONAL EVENTS FOR CANCERS Zhongmeng Zhao1, Xuanping Zhang1, Wenke Wang1, Yu Geng1, Mingchao Xie2, Beifang Niu2, Kai Ye2, Kimberly Johnson3, Li Ding2, Xiao Xiao4, Jiayin Wang2 1. Department of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, P.R.China 2. The Genome Institute, Washington University in St. Louis, St. Louis, MO, United States 3. Brown School Master of Public Health Program, Washington University in St. Louis, St. Louis, MO, United States 4. State Key Laboratory of Cancer Biology, Xijing Hospital of Digestive Diseases, Xi'an, P.R.China Background: Identifying novel deleterious germline variation and somatic events is one of the essential questions in cancer genomics. A series of association approaches have been proposed to achieve this, among which the burden-test-based methods are the most popular ones. However, these methods are challenged by multiple issues, such as overly depending on pre-selection genetic models, hard to differentiate deleterious variants from neutral ones, suffering low statistical power, etc. Moreover, interactions among germline and somatic variation have been widely reported recently, but without being considered in burden-tests. Results: Motivated by the issues aforementioned, we propose a novel association approach to identify deleterious variants using combined germline variants and somatic mutational events from cancer genome sequencing data. As a model-free strategy, our approach RareProb-C makes algorithmic selections of causal variants and eliminates singular cases, and then collapses the candidate causal mutations into a statistical test. In addition, an improved four-Gamete test is introduced to enhance the accuracy and reduce false positives. We compare RareProb-C to existing burden-test approaches on both artificial and real datasets. RareProb-C achieves higher statistical power than those existing ones under different simulation configurations. We perform RareProb-C on an ATM gene screening dataset and an ovarian cancer research dataset that consists of 419 cases with tumor-normal pair Exome-Seq data, where our approach successfully identifies most of the highlighted variants which are considered enriching disease susceptibilities

.

76

MINING FOR GENE-DISEASE ASSOCIATIONS WITH MESH TERMS IN MEDLINE AND ARRAYEXPRESS Modest von Korff1, Bernard Deffarges, Valerie Siefken, Thomas Sander 1. Actelion Pharmaceuticals Ltd., Allschwil, BASEL, Switzerland

Background This study examines the possibility to extract meaningful gene-disease associations from public databases. Gene-disease associations (GDA) are of high interest in medicine and in drug discovery. Two fully automatic methods were implemented to mine two databases, PubMed Central and ArrayExpress. A database is queried with a gene name and the retrieved result records are searched with disease-related MeSH terms. The MeSH terms are ranked by their frequency of occurrence in the result records. A test dataset with 38 drugs was compiled to examine the relevance of the described approaches. This was done because a drug provides a triple association to the gene, which is encoding the target of the drug, and to a dissease, which is cured by the drug. A test record contained the drug name, the disease MeSH term (indication) and the gene name of the target protein. For a test, one of the databases was queried with the gene name. The results were searched and the MeSH terms ranked. Finally, the relative rank of the disease MeSH term from the test record was used as figure of merit for the relevance of the gene-disease association.

Results A number of 53 test records was derived from the 38 drugs, as for some drugs more than one target was compiled from literature. For mining ArrayExpress a median of 0.675 resulted from the relative ranks of all test records. For mining PubMed Central the median calculated with 0.951.

Conclusions Mining PubMed Central for relevant gene disease associations was much more successful than mining ArrayExpress. For PubMed Central, the disease MeSH term for the underlying indication of the test record was in the majority of cases between the first five percent of the ranked diseases. This demonstrated that the described method delivered meaningful gene-disease associations.

International Conference on Bioinformatics Delegate Book 2014 Page 58

77

J-CIRCOS: A JAVA GRAPHIC USER INTERFACE FOR CIRCOS PLOT Jiyuan An, John Lai, ChenWei wang, Melanie Lehman, Colleen Nelson Circos plots are graphical outputs that have many useful applications such as displaying three dimensional chromosomal interactions, and fusion transcripts. However, the Circos tool is difficult for non-bioinformaticians to use as the Perl-implemented Circos tool requires users to install related packages and processed data files in a unix environment. This has resulted in the development of an R-based circos tool (RCircos), although R-circos is equally inaccessible for non-bioinformaticians to use. Thus, we have developed a circos plot tool (J-Circos) that is targeted towards biologists with limited bioinformatics skills as it uses an intuitive Graphic User Interface (GUI). J-Circosuses Java computer language to enable it to be used on most operating systems (Windows, MacOS, Linux). User can input data into J-Circos using flat data formats, as well as from the GUI. Additionally, J-Circos has a mouse hover function that provides information for specific data points. Collectively, J-Circos is an easy-to-use tool that is accessible for biologists to use. This will enable biologists to further study more complex chromosomal interactions and fusion transcripts that are otherwise difficult to visualise from next-generation sequencing data

78

COMPLEX-BASED ANALYSIS OF DEREGULATED CELLULAR PROCESSES IN CANCER Sriganesh Srihari1, Piyush B Madhamshettiwar1, Sarah Song1, Chao Liu1, Peter T Simpson2, Kum Kum Khanna3, Mark A Ragan1 1. Institute for Molecular Bioscience, The University of Queensland, St Lucia, QLD, Australia 2. The University of Queensland, UQ Centre for Clinical Research, Brisbane, QLD, Australia 3. Signal Transduction Laboratory, QIMR-Berghofer Institute of Medical Research, Brisbane, QLD, Australia Background: Differential expression analysis of (individual) genes is often used to study their roles in diseases. However, diseases such as cancer are a result of the combined effect of multiple genes. Gene products such as proteins seldom act in isolation, but instead constitute stable multi-protein complexes performing dedicated functions. Therefore, complexes aggregate the effect of individual genes (proteins) and can be used to gain a better understanding of cancer mechanisms. Here, we observe that complexes show considerable changes in their expression, in turn directed by the concerted action of transcription factors (TFs), across cancer conditions. We seek to gain novel insights into cancer mechanisms through a systematic analysis of complexes and their transcriptional regulation. Results: We integrated large-scale protein-interaction (PPI) and gene-expression datasets to identify complexes that exhibit significant changes in their expression across different conditions in cancer. We then devised a log- linear model to relate these changes to the differential regulation of complexes by TFs. The application of our model on two case studies involving pancreatic and familial breast tumour conditions revealed: (i) complexes in core cellular processes, especially those responsible for maintaining genome stability and cell proliferation (e.g. DNA damage repair and cell cycle) show considerable changes in expression; (ii) these changes include decrease and countering increase for different sets of complexes indicative of compensatory mechanisms coming into play in tumours; and (iii) TFs work in cooperative and counteractive ways to regulate these mechanisms. Such aberrant complexes and their regulating TFs play vital roles in the initiation and progression of cancer. Conclusions: Complexes in core cellular processes display considerable decreases and countering increases in expression, strongly reflective of compensatory mechanisms in cancer. These changes are directed by the con-certed action of cooperative and counteractive TFs. Our study highlights the roles of these complexes and TFs and presents several case studies on compensatory processes, providing novel insights into cancer mechanisms.

International Conference on Bioinformatics Delegate Book 2014 Page 59

79

USING HIDDEN MARKOV MODELS TO INVESTIGATE G-QUADRUPLEX MOTIFS IN GENOMIC SEQUENCES Masato Yano1, Yuki Kato2 1. Nara Institute of Science and Technology, Ikoma, Japan 2. Center for iPS Cell Research and Application (CiRA), Kyoto University, Kyoto, Japan G-quadruplexes are four-stranded structures formed in guanine-rich nucleotide sequences. Several functional roles of DNA G- quadruplexes have so far been investigated, where their putative functional roles during DNA replication and transcription have been suggested. A necessary condition for G-quadruplex formation is the presence of four regions of tandem guanines called G-runs and three nucleotide subsequences called loops that connect G-runs. A simple computational way to detect potential G-quadruplex regions in a given genomic sequence is pattern matching with regular expression. Although many putative G-quadruplex motifs can be found in most genomes by the regular expression-based approach, the majority of these sequences are unlikely to form G- quadruplexes because they are unstable as compared with canonical double helix structures. Here we present elaborate computational models for representing DNA G-quadruplex motifs using hidden Markov models (HMMs). Use of HMMs enables us to evaluate G-quadruplex motifs quantitatively by a probabilistic measure. In addition, the parameters of HMMs can be trained by using experimentally veried data. Experiments in prediction of G-run regions in bona fide G-quadruplex sequences and discrimination of putative G-quadruplexes in the human genome were carried out, indicating that all HMM-based models simulate G-quadruplex structures well and one of them has the possibility of reducing false positive G-quadruplexes predicted by existing regular expression-based methods. Furthermore, our results show that one of our models can be specialized to detect G-quadruplex sequences whose functional roles are expected to be involved in DNA transcription. The HMM-based method along with the conventional pattern matching approach can contribute to reducing costly and laborious wet-lab experiments to perform functional analysis on a given set of potential G-quadruplexes of interest.

80

COMBINING SPATIAL AND CHEMICAL INFORMATION FOR CLUSTERING PHARMACOPHORES Lingxiao Zhou1, Renate Griffith, Bruno Gaeta 1. UNSW, Sydney, NSW, Australia

Background A pharmacophore model consists of a group of chemical features arranged in three-dimensional space that can be used to represent the biological activities of the described molecules. Clustering of molecular interactions of ligands on the basis of their pharmacophore similarity provides an approach for investigating how diverse ligands can bind to a specific receptor site or different receptor sites with similar or dissimilar binding affinities. However, efficient clustering of pharmacophore models in three- dimensional space is currently a challenge.

Results We have developed a pharmacophore-assisted Iterative Closest Point (ICP) method that is able to group pharmacophores in a manner relevant to their biochemical properties, such as binding specificity etc. The implementation of the method takes pharmacophore files as input and produces distance matrices. The method integrates both alignment-dependent and alignment-independent concepts.

Conclusions We apply our three-dimensional pharmacophore clustering method to two sets of experimental data, including 31 globulin-binding steroids and 4 groups of selected antibody-antigen complexes. Results are translated from distance matrices to Newick format and visualised using dendrograms. For the steroid dataset, the resulting classification of ligands shows good correspondence with existing classifications. For the antigen-antibody datasets, the classification of antigens reflects both antigen type and binding antibody. Overall the method runs quickly and accurately for classifying the data based on their binding affinities or antigens.

International Conference on Bioinformatics Delegate Book 2014 Page 60

81

HIGHLY SENSITIVE INFERENCE OF TIME-DELAYED GENE REGULATIONS BY NETWORK DECONVOLUTION Haifen Chen1, Piyushkumar A. Mundra2, Li Na Zhao1, Feng Lin1, Jie Zheng13 1. Nanyang Technological University, Singapore 2. Metabolomics Laboratory, Baker IDI Heart and Diabetes Institute, Melbourne, Australia 3. Genome Institute of Singapore, Singapore Background: Gene regulatory network (GRN) is a fundamental topic in systems biology. The dynamics of GRN can shed light on the cellular processes, which facilitates our understanding the mechanisms of disease when the processes are dysregulated. Accurate reconstruction of GRN could also provide guidelines for experimental biologists. Therefore, inferring gene regulatory network from high-throughput gene expression data is a central problem in systems biology. However, due to the inherent complexity of gene regulation, noise in measuring the data and short time-series data, it is very challenging to reconstruct accurate GRNs. On the other hand, a better understanding into gene regulation could help to improve the performance of GRN inference. Time delay is one of the most important characteristics of gene regulation. By incorporating the information of time delays, we can achieve more accurate inference of GRN. Results: In this paper, we propose a method to infer time-delayed gene regulations based on cross-correlation and network deconvolution (ND). First, we employ cross-correlation to obtain the probable time delays for the interactions between each target gene and its potential regulators. Then based on the inferred delays, the technique of network deconvolution (ND) is applied to identify direct interactions between the target gene and its regulators. Experiments on real-life gene expression datasets show that our method achieves overall better performance than existing methods for inferring time-delayed GRNs. Conclusion: By taking into account the time delays among gene interactions, our method is able to infer GRN more accurately. The effectiveness of our method has been shown by the experiments on three real-life gene expression datasets of yeast. Compared with other existing methods which were designed for learning time-delayed GRN, our method has significantly higher sensitivity without much reduction of specificity.

82

SEMI-SUPERVISED MULTI-LABEL COLLECTIVE CLASSIFICATION ENSEMBLE FOR FUNCTIONAL GENOMICS Qingyao Wu1, Yunming Ye1, Shen-Shyang Ho2, Shuigeng Zhou3 1. Shenzhen Graduate School, Harbin Institute of Technology, Shenzhen, China 2. School of Computer Engineering, Nanyang Technological University, Singapore 3. School of Computer Science, Fudan University, Shanghai, China Background: With the rapid accumulation of proteomic and genomic datasets in terms of genome-scale features and interaction networks through high-throughput experimental techniques, the process of manual predicting functional properties of the proteins has become increasingly cumbersome, and computational methods to automate this annotation task are urgently needed. Most of the approaches in predicting functional properties of proteins require to either identify a reliable set of labeled proteins with similar attribute features to unannotated proteins, or to learn from a full-labeled protein interaction network with a large amount of labeled nodes. However, acquiring such labels can be very difficult in practice, especially for multi-label protein function prediction problems. Learning with only a few labeled data can lead to poor performance as limited supervision knowledge can be obtained from the similar proteins or from the connections between them. To effectively annotate proteins even in the paucity of labeled data, it is of interest to take advantage of all data sources that are available in this problem setting, including interaction networks, attribute feature information, correlations of functional labels, and unlabeled data. Results: In this paper, we show that the underlying nature of predicting functional properties of proteins using various data sources of relational data is a typical Collective Classification (CC) problem in machine learning. The protein functional prediction task with limited annotation is then cast into a Semi-supervised Multi-label Collective Classification (SMCC) framework. As such, we propose a novel Generative Model based SMCC algorithm, called GM-SMCC, to effectively compute the label probability distributions and predict functional properties to unannotated proteins. To further boost the predicting performance, we extend the method in an ensemble manner, called EGM-SMCC , by utilizing multiple heterogeneous networks with various latent linkages constructed to explicitly model the relationships among the nodes for effectively propagating the supervision knowledge from labeled to unlabeled nodes. Conclusion: Experimental results on KDD Cup tasks of 2001 predicting the functions and localization of a protein to a given yeast gene demonstrate the effectiveness of the proposed method. In the comparison, we find that the performances of the proposed algorithms are better than the other compared algorithms.

International Conference on Bioinformatics Delegate Book 2014 Page 61

83

PREDICTING FUNCTIONAL RELATED PROTEINS BASED ON CHARACTERISTIC OF THE GENE SEQUENCES OF THE PROTEIN PAIRS Tien-Hao Chang1, Tzu-Wen Lin1, Shao-Ting Jang1 1. National Cheng Kung University, Tainan, Taiwan Various protein functions are essential to diverse biological processes. Elucidating these protein functions and linking functional related proteins helps our understanding of the mechanisms of biological systems at the molecular level. Nowadays, various protein intrinsic features (e.g. protein sequences, structures, functions and so on) have been studied to predict functional related proteins. However, no studies have analysed the regulatory features (e.g. transcription factors that regulate the gene of a protein) between two interacting proteins. This study aims to answer whether regulatory features preserve effects on functional relation after the gap from gene to protein as well as to build a regulatory feature-based prediction model for functional related proteins.This study has conducted a comprehensive analysis of regulatory features. It collected eight kinds of transcriptional characteristics and encoded them to 16 transcriptional features: DNA bendability, gene size (with sense or with antisense), gene distance, nucleosome occupancy, TATA box information, TF binding and knockout information and eight regulatory similarities based on TFBS data. The experimental results show that gene distance, gene size, and TATA box information improved the prediction performance with 7% area under curve and indicate that these regulatory features did influence the functional relation after the gap from gene to protein. In Saccharomyces cerevisiae, our method’s prediction is better than previous methods.This work is the first study to discuss the regulatory features in predicting functional related proteins and the results suggest this category of features must be considered in the future. The proposed new regulatory characteristic encoding method has been shown capable to identify whether two proteins are functional related and. The constructed prediction model is helpful to discover the unknown molecular mechanisms of specific regulatory functions. Finally, this study leads the following works in related research topics to consider regulatory features, even the topics are in the protein level.

84

TINAGL1 AND B3GALNT1 ARE POTENTIAL THERAPY TARGET GENES TO SUPPRESS METASTASIS IN NON-SMALL CELL LUNG CANCER Hideaki Umeyama1, Mitsuo Iwadate2, Y-h. Taguchi1 1. Department of Physics, Chuo University, Tokyo, NON-U, Japan 2. Department of Biological Science, Chuo University, Tokyo, Non-US/Canada, Japan Background: Non-small cell lung cancer (NSCLC) remains lethal despite the development of numerous drug therapy technologies. About 85% to 90% of lung cancers are NSCLC and the 5-year survival rate is at best still below 50%. Thus, it is important to find drug target genes for NSCLC to develop an effective therapy for NSCLC. Results: Integrated analysis of publically available gene expression and promoter methylation patterns of two highly aggressive NSCLC cell lines generated by in vivo selection was performed. We selected eleven critical genes that may mediate metastasis using recently proposed principal component analysis based unsupervised feature extraction .The eleven selected genes were significantly related to cancer diagnosis. The tertiary protein structure of the selected genes were inferred by Full Automatic Modeling System, a profile based protein structure inference software, to determine protein functions and to specify genes that could be potential drug targets. Conclusions: We identified eleven potentially critical genes that may mediate NSCLC metastasis using bioinformatic analysis of publically available data sets. These genes are potential target genes for therapy of NSCLC. Among the eleven genes, TINAGL1 and B3GALNT1 are possible candidates for drug compounds that inhibit their gene expression.

International Conference on Bioinformatics Delegate Book 2014 Page 62

85

FRAGMENT BASED GROUP QSAR AND MOLECULAR DYNAMICS MECHANISTIC STUDIES ON ARYLTHIOINDOLE DERIVATIVES TARGETING THE Α-Β INTERFACIAL SITE OF HUMAN TUBULIN Chetna Tyagi1, Ankita Gupta2, Sukriti Goyal3, Jaspreet Kaur Dhanjal1, Abhinav Grover1 1. Jawaharlal Nehru University, New Delhi, DELHI, India 2. Department of Biotechnology, Delhi Technological University, New Delhi, Delhi, India 3. Apaji Institute of Mathematics & Applied Computer Technology, Banasthali University, Tonk, Rajasthan, India Background A number of microtubule disassembly blocking agents and inhibitors of tubulin polymerization have been elements of great interest in anti-cancer therapy, some of them even entering into the clinical trials. One such class of tubulin assembly inhibitors is of arylthioindole derivatives which results in effective microtubule disorganization responsible for cell apoptosis by interacting with the colchicine binding site of the β-unit of tubulin close to the interface with the α unit. We modelled the human tubulin β unit (chain D) protein and performed docking studies to elucidate the detailed binding mode of actions associated with their inhibition. The activity enhancing structural aspects were evaluated using a fragment-based Group QSAR (G-QSAR) model and was validated statistically to determine its robustness. A combinatorial library was generated keeping the arylthioindole moiety as the template and their activities were predicted. Results The G-QSAR model obtained was statistically significant with r2 value of 0.85, cross validated correlation coefficient q2 value of 0.71 and pred_r2 (r2 value for test set) value of 0.89. A high F test value of 65.76 suggests robustness of the model. Screening of the combinatorial library on the basis of predicted activity values yielded two compounds HPI (predicted pIC50 = 6.042) and MSI (predicted pIC50 = 6.001) whose interactions with the D chain of modelled human tubulin protein were evaluated in detail. A toxicity evaluation resulted in MSI being less toxic in comparison to HPI. Conclusions The study provides an insight into the crucial structural requirements and the necessary substitutions required for the arylthioindole moiety to exhibit enhanced inhibitory activity against human tubulin. The two reported compounds HPI and MSI showed promising anti cancer activities and thus can be considered as potent leads against cancer. The toxicity evaluation of these compounds suggests that MSI is a promising therapeutic candidate. This study provided another stepping stone in the direction of evaluating tubulin inhibition and microtubule disassembly degeneration as viable targets for development of novel therapeutics against cancer.

86

HOXD-AS1 IS A NOVEL LONG NONCODING RNA ENCODED IN HOXD CLUSTER AND A MARKER OF NEUROBLASTOMA PROGRESSION REVEALED VIA INTEGRATIVE ANALYSIS OF NONCODING TRANSCRIPTOME Aliaksandr A Yarmishyn1, Arsen O Batagov1, Jovina Z Tan1, Gopinath M Sundaram2, Prabha Sampath2, Vladimir A Kuznetsov1, Igor V Kurochkin1 1. Department of Genome and Gene Expression Data Analysis, Bioinformatics Institute, A*STAR, Singapore 2. Translational Control in Development and Disease Group, Institute of Medical Biology, A*STAR, Singapore Background Long noncoding RNAs (lncRNAs) constitute a major, but poorly characterized part of human transcriptome. Recent evidence indicates that many lncRNAs are involved in cancer and can be used as predictive and prognostic biomarkers. Significant fraction of lncRNAs is represented on widely used microarray platforms, however they have usually been ignored in cancer studies. Results We developed a computational pipeline to annotate lncRNAs on popular Affymetrix U133 microarrays, creating a resource allowing measurement of expression of 1,581 lncRNAs. This resource can be utilized to interrogate existing microarray datasets for various lncRNA studies. We found that these lncRNAs fall into three distinct classes according to their statistical distribution by length. Remarkably, these three classes of lncRNAs were co-localized with protein coding genes exhibiting distinct gene ontology groups. This annotation was applied to microarray analysis which identified a 159 lncRNA signature that discriminates between localized and metastatic stages of neuroblastoma. Analysis of an independent patient cohort revealed that this signature differentiates also relapsing from non-relapsing primary tumors. This is the first example of the signature developed via the analysis of expression of lncRNAs solely. One of these lncRNAs, termed HOXD-AS1, is encoded in HOXD cluster. HOXD-AS1 is evolutionary conserved among hominids and has all bone fide features of a gene. Studying retinoid acid (RA) response of SH-SY5Y cell line, a model of human metastatic neuroblastoma, we found that HOXD-AS1 is a subject to morphogenic regulation, is activated by PI3K/Akt pathway and itself is involved in control of RA-induced cell differentiation. Knock-down experiments revealed that HOXD-AS1 controls expression levels of clinically significant protein-coding genes involved in angiogenesis and inflammation, the hallmarks of metastatic cancer. Conclusions Our findings greatly extend the number of noncoding RNAs functionally implicated in tumor development and patient treatment and highlight their role as potential prognostic biomarkers of neuroblastomas.

International Conference on Bioinformatics Delegate Book 2014 Page 63

87

TRANSCRIPTOME ALTERATIONS OF MITOCHONDRIAL AND COAGULATION FUNCTION IN SCHIZOPHRENIA BY CORTICAL SEQUENCING ANALYSIS Kuo-Chuan Huang1, Ko-Chun Yang2, Han Lin2, Theresa Tsun-Hui Tsao3, Sheng-An Lee4 1. Department of Psychiatry, Beitou Branch, Tri-Service General Hospital, Taipei, Taiwan 2. Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, Taipei, Taiwan 3. Department of Biochemical Science and Technology,, National Taiwan University, Taipei, Taiwan 4. Department of Information Management, Kainan University, Taoyuan, Taiwan

Background Genetic and protein interactions in schizophrenia may predispose to biological dysfunction of energy metabolism and hemostasis. A comparison of schizophrenic candidate genes from literature reviews was explored. The differential expression level of schizophrenic candidate genes from NGS BA22 brain samples and associated mediator genes constructed schizophrenia-mediator network(SCZMN). The corresponding pathways searched against pathway databases such as PID, Reactome, HumanCyc, and Cell- Map and the candidate complexes by CORUM were identified by MCL clustering for potential pathogenesis of schizophrenia.

Results We identified genes which were over- or under-expressed in the BA22 brain samples of schizophrenia and proposed them as schizophrenia candidate marker genes (SCZCGs). The genetic interactions of mitochondrial genes surrounded by most under- expressed SCZCGs indicates the genetic predisposition of mitochondria dysfunction in schizophrenia. The biological functions of SCZCGs, as listed in the Pathway Interaction Database (PID), indicate that these genes have roles in DNA binding transcription factor, signal and cancer-related pathways, coagulation and cell cycle regulation and differentiation pathways. The relationship between antipsychotic target genes (DRD2/3 and HTR2A) and coagulation factor genes (F3, F7 and F10) appeared to cascade the following hemostatic process implicating the bottleneck of coagulation genetic network by the bridging of actin-binding protein (FLNA).

Conclusions Transcriptome sequencing of brain specific samples provides enrichment analysis of differential expression and genetic interaction in evaluation of mitochondrial and coagulation function in schizophrenia. Energy metabolism and hemostatic process have important roles in the pathogenesis for schizophrenia.

88

A GENE SET METHOD TO PREDICT PATIENT SURVIVAL RISKS FROM GENE EXPRESSION DATA Junhee Seok12, Ronald Davis3, Wenzhong Xiao4 1. School of Electrical Engineering, Korea University, Seongbuk-gu, Seoul, South Korea 2. Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, USA 3. Stanford Genome Technology Center, Stanford University, Palo Alto, CA, USA 4. Department of Surgery, Massachusetts General Hospital , Boston, MA, USA Background: Gene sets representing modules of biological functions, such as pathways and transcriptional regulations, have been actively studied and applied in the analyses of high-throughput gene expression data in clinical studies. While many previous studies have focused on finding gene sets significantly associated with disease conditions, the use of gene sets for the prediction of patient risks from censored survival times hasn’t been well studied. Results: In this work, we propose a method that utilizes gene sets in the prediction of patient survival risks using gene expression profiles. The method uses the gene set information by summarizing the expression indices of member genes, and incorporates both the single gene and gene set information in the framework of conventional prediction methods. Tested over multiple data sets of cancer and severe injury, the method shows significantly improved prediction power for patient survival risks comparing with conventional single gene predictions, and the performance of prediction seems to benefit from the use of an integrated super- collection of multiple available gene set collections. Detailed examination of the results of prediction in the injury data shows that gene sets selected by the method for the prediction are highly interpretable in biology. Conclusions: To date, most of outcome predictions using gene expression data have focused on single gene information. The development utilizing gene set information is expected to applicable in a wide range of survival prediction problems in clinical genomics and personalized medicine.

International Conference on Bioinformatics Delegate Book 2014 Page 64

89

BIOINFORMATICS ANALYSIS OF THE MOST POTENT TUMOR SUPPRESSOR MICRORNAS IN HEPATOCELLULAR CARCINOMA REVEALING NEW LINKS TO IMMUNE SYSTEM MODULATION AND INSIGHTS INTO CANCER PATHWAYS Mahmoud ElHefnawi1, Bangly Soliman12, Mohamed Ghazy3, Ahmed Salem3 1. Centre of Excellence for Advanced Sciences, Informatics and Systems , National Research Centre, Cairo, Egypt 2. Biochemisty, Faculty of Science Ainshams University, Cairo, Egypt 3. Biochemisty, Faculty of Science Ainshams University, Cairo, Egypt Interest in miR-34a, let-7a, and miR-199 a&b is sparking as more insights into their roles as master regulators of cellular processes emerge. These 3 micro-RNAs possess tumor suppressor activity that makes them potential new anti-cancer agents for hepatocellular carcinoma. In our current study, we performed in silico functional enrichment analysis using four innovative servers (miRror Suite, miRWalk, miRGator v3.0, and GeneTrail) in order to demonstrate the combinatorial and individual regulation of these 3 suppressor miRs on the expression of hundreds specific target genes involved in a variety of pathways of immune system and HCC/cancer hallmarks. We determined eighty seven common target genes which are coordinately regulated by our 3 miRNA set using miRror 2.0 target prediction programs with p-value< or = 0.05 and miRror Internal Score (miRIS) = 0. Furthermore, functional enrichment analysis of these miRNA targets by DAVID functional annotation (KEGG, BIOCARTA, GO) and REACTOME reveals two pathways linked to immune system. Eight pathways linked to HCC/cancer hallmarks and two pathways mediate interconnected dual function between immune system and HCC/cancer hallmarks. Moreover, there are seven functionally enriched GO terms. An additional interesting findings from miRror suite is illustrating the protein-protein interactions network for the predicted common target genes of our 3 miRNA set by STRING cytoscape.Regarding the individual analysis for each miRNA of interest through miRWalk database we could determine some of the novel oncogenes which are regulated by these miRs. Furthermore, by mirwalk pathway analysis we determined many enriched immune system pathways and other cancer hallmarks pathways where these 3 miRs mediate regulation. On using miRGator v3.0 server, we analyzed deep sequencing data of liver hepatocellular carcinoma (TCGA- LIHC) under our own approaches to determine some putative targets with significant anti-correlation of expression with the canonical mature miR-34a,let-7a, miR-199 a&b , based on Pearson and Spearman correlations.As result, The number of target genes for miR-34a were 36, for let-7a were 34, for miR-199a were 29 and for miR-199b were 26. Then, GeneTrail Server was used for functional enrichment analysis.

90

A (TWO-YEAR) SNAPSHOT OF BIOINFORMATICS EDUCATION AND TRAINING IN AUSTRALIA David Lovell1 1. CSIRO/Australian Bioinformatics Network, Canberra, ACT, Australia For the past two years, AustralianBIoinformatics.net, the main online presence of the Australian Bioinformatics Network, has been gathering and advertising information about bioinformatics education and training events in Australia. While not comprehensive, this is probably the best snapshot we’ve got, and in this short presentation we will see what this data tells us about learning, education and training.

91

A SURVEY OF BIOINFORMATICS TRAINING NEEDS IN AUSTRALIA Mark Crowe1 1. QFAB Bioinformatics, Brisbane, QLD, Australia In 2013, the Bioinformatics Resource Australia EMBL (BRAEMBL) undertook a survey of the bioinformatics needs of the Australian life sciences community to better understand and meet these needs. One of the most clear-cut results from this survey was the overwhelming demand for bioinformatics training, which led BRAEMBL to adopt the goal “to engage in Australia-wide training” as one of its key activities. A follow-up survey carried out by QFAB Bioinformatics reinforced this view, and refined the estimates of both the extent of demand for training and the specific bioinformatics tools and applications of most interest to researchers. In this presentation, I will review the main conclusions of the two surveys, and will discuss some of the ways in which BRAEMBL and QFAB, in partnership with the Genomics Virtual Lab project, have been working together to meet the training needs identified. These include hands-on workshops using individual cloud-based analysis servers, self-guided tutorials, video presentations, and post- training discussion and support forums.

International Conference on Bioinformatics Delegate Book 2014 Page 65

92

A ONE-YEAR POSTGRADUATE DIPLOMA PROGRAMME FOR A FOUNDATION IN BIOINFORMATICS: A CASE STUDY IN MALAYSIA Mohammad Asif Khan1, HSA Raman1, S Tan1, NE Mohamed 1, NA Azhar1, MF Sjaugi 1 1. Perdana University, Serdang Selangor Darul Ehsan, Malaysia Background Bioinformatics is a transformative science that is at the crux of interdisciplinary research. With the advancements in molecular biology, genomics, genetics and medical sciences, there is currently a high demand for bioinformaticians worldwide, which is unmet, in-part due to shortage in the supply of qualified and trained bioinformaticians. In Malaysia, a number of undergraduate degree programmes have been offered by the local institutions of higher learning to meet this demand locally, however, the necessary broad nature of these programmes makes transition to workplace or further education difficult for majority of the students. We present herein a one-year Postgraduate Diploma Programme to provide students with a strong foundation in bioinformatics. Method The curriculum for the programme was developed over two years, in consultation with leading academic and industry experts of the field and also students from the life science field with interest in bioinformatics. The process, procedures and mechanisms for the curriculum design included a comprehensive 11 step-wise strategy. Results The programme comprises 10 courses to be covered over three-trimesters for a foundation in bioinformatics for students from a pure biology or pure computer science background. The first trimester is designed to establish a common base by focussing on critical thinking, scientific communication and overview of the core discipline areas. In the second trimester, students experience interdisciplinary courses that connect the core discpline areas. These courses are aimed to prepare the students for their research mini-project in the third trimester. All courses are taught in a sequential order, except for the research seminar (starts in first trimester and ends in the second) and the mini-project (preparations start in the second trimester). Discussions The programme is designed with the over-arching goal to empower bioscientists the ability to develop and/or apply innovative bioinformatics solutions to solve biological problems by providing them the foundation necessary to manage and mine the wealth of available biological data for knowledge discovery. The programme aims to provide a balance between theoretical understandings and practical skills, with sufficient exposure to research pipeline, from inception and critique of ideas to communication and defence of research findings. The programme is being implemented at Perdana University with first intake of students scheduled for September 2014.

93

HOW WE BECAME BIOINFORMATICIANS: THE STUDENT EXPERIENCE Harriet Dashnow123, Marek Cmero4, Andrew Lonsdale5 1. Life Science Computation Centre, Victorian Life Sciences Computation Initiative, Carlton, VIC, Australia 2. The University of Melbourne, Parkville, VIC, Australia 3. Murdoch Childrens Research Institute, Parkville, VIC, Australia 4. Centre for Neural Engineering, The University of Melbourne, Carlton, Victoria, Australia 5. School of Botany, University of Melbourne, Melbourne, VIC Publish consent withheld

94

A SOFTWARE CARPENTRY PERSPECTIVE OF BIOINFORMATICS TRAINING Scott C Ritchie12, Damien Irving23, David Flanders2 1. Medical Systems Biology, Department of Pathology & Microbiology and Immunlogy, The University of Melbourne, Parkville 2. ITS Research, The University of Melbourne, Parkville 3. School of Earth Sciences, The University of Melbourne, Parkville Programming is increasingly becoming an essential skill for researchers in the life sciences. However, most scientists doing bioinformatics receive no formal training in programming, inhibiting both research efficiency and reproducibility. Software Carpentry is a volunteer organisation whose goal is to make scientists more productive, and their work more reliable, by teaching them basic programming skills through intensive two-day workshops (bootcamps). In this session we will discuss the Software Carpentry philosophy and reflect on our experiences running bootcamps for bioinformaticians in the Melbourne region.

International Conference on Bioinformatics Delegate Book 2014 Page 66

95

BIOINFORMATICS AS ENGINEERING OR SCIENCE? A TALE OF TWO DEGREES Bruno Gaeta1 1. UNSW, UNSW-Sydney, NSW, Australia The need for just-in-time bioinformatics training for biologists is well-established. However the jury is still out when it comes to formal education in bioinformatics. Bioinformaticians are called to apply computational methods to life science data with the view of contributing to biological discoveries – a scientific task, but they are also often required to design and implement new methods and infrastructure for life science computing – a task that draws on engineering skills and mindset. Putting the two together into one degree is difficult especially at the undergraduate level given the breadth of foundational knowledge required. UNSW has for the last 13 years offered a degree in Bioinformatics engineering with a strong engineering core. This degree is now being complemented by a Bachelor of Science in bioinformatics that aims to train biologists with a strong computational focus. The two degrees share common courses but each has a distinct emphasis and target audience.

96

DELIVERING BIOINFORMATICS TRAINING USING CLOUD COMPUTING INFRASTRUCTURE Nathan S Watson-Haigh1 1. Australian Centre for Plant Functional Genomics, Urrbrae, SA, Australia

Not Available at time print

97

STRENGTHENING BIOINFORMATICS CAPABILITIES AT CSIRO Annette McGrath1 1. CSIRO, Canberra, ACT, Australia CSIRO is Australia’s national science agency. As a highly multidisciplinary research organisation in which the life sciences are an important focus, bioinformatics is a vital, but complex undertaking. Surveys in 2011 indicated that lack of awareness and training were the main factors limiting the application of bioinformatics in CSIRO. The varying backgrounds and skills of bioinformatics partitioners add to the complexity. In this talk I will share what steps were taken to address both of these issues. From in-house developed programs on in-house hosted platforms through to national and international collaborations, I will share insights gained from a spectrum of bioinformatics initiatives aimed to raise bioinformatics capability across CSIRO.

99

GENOME SCALE REGULATORY NETWORK MODELING Lars Nielsen1 1. AIBN, University of Queensland, ., QLD, Australia Curated cellular signaling databases such as Reactome, Panther and NCI are approaching or exceeding many metabolic models. Conventional tools used for signal transduction models are unsuited for modeling networks with 1,000+ let alone 10,000+ entities. While catalytic cascades cancel the advantage of flux balance modeling (and flux model formulation carries significant overheads), a direct logical translation of biochemical reaction networks is possible. Moreover, a biochemical interpretation of inhibition overcomes a common problem of Boolean formulation and greatly reduces logical incoherence in large models. Using efficient pruning strategies and a linearly scalable algorithm, it is possible on a standard PC to compute all minimal (unique) input sets from 2909 sources capable of generating each of 1851 outputs in Reactome. While the framework was initially restricted to point-stable signal transduction systems, several extensions to differentiating or oscillating systems have been developed.

International Conference on Bioinformatics Delegate Book 2014 Page 67

100

COMBINATORIAL OPTIMISATION MODELS FOR ANALYSING BIOLOGICAL DATA SETS Regina Berretta1 1. The University of Newcastle, Callaghan, NSW, Australia This talk will present combinatorial optimisation models and algorithmic techniques that have been developed to analyse large datasets. First, the presentation will focus on an approach, based on a combinatorial optimisation problem (called the (α,β)-k-Feature Set Problem) to deal with the problem of selecting groups of features, such as genes, that discriminate between different existing classes. We will illustrate the application of these models using different variations of the model in several datasets. Next, the presentation will illustrate how a classical and well-known combinatorial optimisation problem; the Quadratic Assignment Problem (QAP), is employed as a mathematical model to produce a visualization of a data set, based on the relationships between the elements in the data set. The visualization method can also incorporate the results of a clustering algorithm to facilitate the process of data analysis.

101

RNA AT THE EPICENTER OF HUMAN DEVELOPMENT John Mattick1 1. Garvan Institute, Darlinghurst, NSW, Australia It appears that the genomic programming of humans and other complex organisms has been misunderstood for the past 50 years, because of the incorrect assumption that most genetic information, including regulatory information, is transacted by proteins. Derived assumptions, such as the presumed “explosive” (i.e., factorial) scaling of regulatory options by “combinatoric interactions” between regulatory proteins are not only unjustified theoretically and mechanistically, but are clearly incorrect on the empirical evidence. Surprisingly, the human genome contains only about 23,000 protein-coding genes, similar in number and with largely orthologous functions as those in other animals, including developmentally simple nematodes and sponges. On the other hand, the extent of non-protein-coding DNA increases with increasing developmental and cognitive complexity, reaching 98.5% in humans. Moreover, high throughput analyses have shown that the vast majority of the human genome is dynamically transcribed to produce a previously hidden world of different classes of small and large, overlapping and interlacing intronic, intergenic and antisense non- protein-coding RNAs. The transcriptome is in fact far more complex than the genome, which is best viewed as a zip file that is unpacked in highly stage- and cell-specific patterns during development. This is illustrated by the use of targeted RNA sequencing to reveal thousands of previously unknown exons and spliced isoforms of oncogenes and tumor suppressors, as well as at least 1500 new long noncoding RNA (lncRNA) genes in intergenic GWAS regions associated with complex diseases. The functions of lncRNAs are varied and include a number of widely expressed lncRNAs that play central roles in the formation of differentiation- specific subnuclear organelles. However, recent evidence suggests that their main function of the tens of thousands of highly cell- specific lncRNAs is to dynamically organize chromosome territories and guide chromatin-modifying complexes to their sites of action, to specify the architectural trajectories of development. Moreover, this system has subsequently evolved plasticity, via an as- yet-unexplored universe of retrotransposon expression and mobilization, as well as RNA editing and modification, which appears to be the molecular basis of environmental-epigenome interactions and brain function

International Conference on Bioinformatics Delegate Book 2014 Page 68

POSTERS

201

DROSOPHILA 3'UTRS ARE MORE COMPLEX THAN PROTEIN-CODING SEQUENCES Manjula Algama1, Edward Tasker1, Christopher Oldmeadow2, Kerrie Mengersen3, Jonathan M Keith1 1. School of Mathematical Sciences, Monash University, Clayton, Victoria, Australia 2. School of Medicine and Public Health, University of Newcastle, Newcastle, NSW, Australia 3. School of Mathematical Sciences, Queensland University of Technology, Brisbane, QLD, Australia The 3’ UTRs of eukaryotic genesparticipate in a variety of post-transcriptional (and some transcriptional) regulatory interactions. Some of these interactions are well characterised, but an undetermined number remain to be discovered. While some regulatory sequences in 3' UTRs may be conserved over long evolutionary time scales, others may have only ephemeral functional significance as regulatory profiles respond to changing selective pressures. Here we propose a sensitive segmentation methodology for investigating patterns of composition and conservation 3’ UTRs based on comparison of closely related species. We describe encodings of pairwise and three-way alignments integrating information about conservation, GC content and transition/transversion ratios and apply the method to three closely related Drosophila species: D. melanogaster, D. simulans and D. yakuba. Incorporating multiple data types greatly increased the number of segment classes identified compared to similar methods based on conservation or GC content alone. We propose that the number of segments and number of types of segment identified by the method can be used as proxies for functional complexity. Our main finding is that the number of segments and segment classes identified in 3’ UTRs is greater than in the same length of protein-coding sequence, suggesting greater functional complexity in 3’ UTRs. There is thus a need for sustained and extensive efforts by bioinformaticians to delineate functional elements in this important genomic fraction.

202

SIDRAiTrip: A HIGH PERFORMANCE TRANSLATIONAL RESEARCH PLATFORM FOR PERSONALIZED MEDICINE Rashid Al-Ali1, Nagarajan Kathiresan1, Rawan AlSaad1, M. Ramzi Temanni1, Abdou Kadri1, Emad ElSebakhi1, Radja Badji1, Francesco Marincola1 1. Sidra Medical and Research Center, Doha, Qatar The recent advancement in Next Generation Sequencing (NGS) technology gives rise to linking omics data such as genomics, proteomics and metabolomics with clinical data is referred to as clinico-genomic environment. This integrated clinico-genomic data is a basic building block for Precision Health and Personalized Medicine (PHPM). The scientists & clinicians are able to design the research experiments and choose patient cohorts not only based on clinical data but also based on omics data for designing most valuable personalized medicine. Achieving efficient personalized treatments is often hindered by the lack of integrated translational research informatics platforms. SIDRA integrated Translational Research Informatics Platform (SIDRAiTrip) is a research and development project at SIDRA medical and research center. The objective of this project is to design, develop, implement, deploy and demonstrate a novel, comprehensive, feature-rich, and secure biomedical research platform. This SIDRAiTrip uses an effective “cohort selection” process from petabytes of genomic and clinical data, which necessitates the development of new types of translational research informatics platforms to allow physician, scientists and translational researchers to mine, analyze and visualize a variety of omics data in the context of defined clinical outcomes. This SIDRAiTrip platform uses 3-tier approach: a) data processing, b) data aggregation and c) user-interface layer. The SIDRAiTrip utilizes the High Performance Computing (HPC) paradigm as the underlying computing infrastructure to address computing & memory-intensive genomic analysis and data-intensive clinical analytic type of operations. Furthermore, the SIDRAiTrip addresses the so-called ‘translational informatics’ concept, which is expediting the research studies/discoveries from bench, i.e. row genomics data produced by the NGS instruments, all the way to the clinical side, i.e. clinicians accessing the processed data out of the research translational platform. Hence, the SIDRAiTrip provides best hassle-free prediction and new approach to realize translational informatics that leads to precision health personalized medicine.

International Conference on Bioinformatics Delegate Book 2014 Page 69

203

AN ASSESSMENT OF NCRNAS IN TRYPANOSOMA CRUZI Maina Bitar12, Priscila Grynberg3, Martin A Smith2, Gloria R Franco12, John S Mattick 1. Universidade Federal de Minas Gerais, Belo Horizonte, Brazil 2. Garvan Institute for Medical Research, Sydney, Australia 3. EMBRAPA / CENARGEN, Brasilia, Brazil The prediction of non-coding RNA (ncRNA) expression, structure and function is a rapidly expanding field of research. A great variety of ncRNAs with different regulatory, catalytic and structural functions have been described. We performed in silico experiments to predict and classify ncRNAs of the protozoan Trypanosoma cruzi, the causative agent of Chagas disease. 4195 ncRNA candidates were identified through comparative genomics between T. cruzi and Trypanosoma brucei using eQRNA to identify compensatory mutations. From the 1382 candidates which did not present significant protein-coding potential, 49 were classified as tRNAs or rRNAs and 29 showed similarity to previously characterised ncRNAs from public databases. Here, we describe a novel in silico protocol for the identification of ncRNAs in different life-cycle stages of T. cruzi. We have compared the mapping efficiency of over 22 million publicly available RNAseq reads of the Y strain to each of the 8 currently sequenced T. cruzi genomes using BWA and Bowtie. The best results regarding mapping quality were obtained with Bowtie allowing 3 mismatches between aligned sequences and against the genomes of CL Brener Esmeraldo-like, CL Brener non-Esmeraldo-like and Sylvio strains. To account for problematic data features, such as short length, the genetic difference between strains, and the poor assembly and annotation of the genomes, we decided to only consider those reads which mapped to orthologous regions in all the three aforementioned genomes. The final sets of ncRNA candidates from both strategies were compared and further annotated based on currently available ncRNA databases. These were then submitted to structural analyses using the DotAligner algorithm for RNA structure clustering. Next we intend to assess the different functional classes of ncRNAs from T. cruzi and contribute to a more thorough understanding of the role of these RNAs in parasite evolution, development and pathogenicity.

204

A STRATEGY OF GENE PRIORITIZATION BY INTEGRATING GENETIC RESOURCES WITH IMPROVED TOPSIS Jingmin Che1, Miyoung Shin1 1. Bio-Intelligence & Data-Mining Laboratory, School of Electronics Engineering, Kyungpook National University, Daegu, Korea Publish consent withheld

International Conference on Bioinformatics Delegate Book 2014 Page 70

205

DIFFERENCES IN EARLY TRANSCRIPTION FACTOR UPREGULATION UNDERLIE SOCIALLY- INDUCED DEVELOPMENTAL PLASTICITY IN THE AUSTRALIAN BLACK FIELD CRICKET Zhiliang Chen1, Michael M Kasumovic2, Marc Wilkins13 1. Systems Biology Initiative, University of New South Wales, Sydney, NSW 2052, Australia 2. Evolution & Ecology Research Centre, School of Biological, Earth & Environmental Sciences, The University of New South Wales, Sydney, NSW 2052, Australia 3. School of Biotechnology and Biomolecular Science, University of New South Wales, Sydney, NSW 2052, Australia

Background Juvenile developmental trajectories in the Australian black-field cricket (Teleogryllus commodus) are influenced by a whole suite of biotic and abiotic factors. Male and female individuals alter their developmental trajectory, and therefore their adult morphology and behaviour, as a consequence of the calls they hear during maturation. The ecological and evolutionary consequences of this developmental plasticity are well understood. However, we have a poor understanding of the underlying mechanisms controlling early developmental decisions. Here we use RNA-seq to assemble a transcriptome for the black-field cricket. We used our transcriptome to explore differences in the transcription factors expressed in the brains of last juvenile instar of males and females reared in two different ecologically relevant social environments sampled at two time points.

Results We assembled the T.commodus transcriptome using a total number of 489.7 million RNA-seq reads and three assemblers, Trans- ABySS, Velvet-Oases and Trinity. Among the three assemblers, Velvet-Oases assembled 80,476 transcripts, including the longest contig of 49,365 bp, a highest average contig length of 2,484 bp, and a highest number of 47,763 (59.2%) transcripts with significant similarity to Drosophila melanogaster isoforms. The differential expression analysis of the transcripts demonstrates that both treatment groups demonstrated an up-regulation of transcripts associated with moulting later in development. Individuals exposed to cricket calls up-regulated 17 sexual developmental associated transcription factors only in the earlier time while the individuals in the silent (control) treatment up-regulated 12 of this group of transcription factors only in the later time point. A number of transcription factors associated with developmental maturation and neuronal development were also found up regulated earlier in the individuals exposed to cricket calls.

Conclusions Our results demonstrate that individual developmental trajectories and adult behaviours are associated with differences in early expression of transcription factors expressed as a consequence of ecologically relevant stimuli.

206

ASSEMBLY OF SALMONELLA ENTERICA SER. TYPHI TOLC IN DMPE AND POPE Yee Siew Choong1, Siew Wen Leong1, Theam Soon Lim1, Gee Jun Tye1 1. Universiti Sains Malaysia, Minden, PG, Malaysia TolC protein was found in many pathogenic Gram-negative bacteria. It is an outer membrane channel for expulsion of drug and toxin from the cell. In the causative agent for typhoid fever, Salmonella enterica ser. Typhi, the TolC outer membrane protein is also found to be antigenic. Since the lipid environment is an important modulator of membrane protein structure and function, therefore in order to study the membrane protein, TolC from Salmonella enterica serovar Typhi was assembled in two different lipid membranes, namely DMPE and POPE. The conformation of TolC from molecular dynamics simulation in DMPE and POPE bilayers was evaluated. S. Typhi TolC protein showed similar conformational dynamics to TolC proteins family. Flexibility of the protein is seen in the C-terminal, extracellular loops and α-helical region. Similar TolC conformation in both DMPE and POPE bilayers were observed which was the rotational motion of the C-terminal residues and extracellular loops. Nevertheless, hydrophobic matching effects of the TolC protein, particularly in the lipids lengthening and subtle movements of the β-barrel towards the lower leaflet in DMPE were exhibited. The study demonstrated the use of molecular dynamics simulation in revealing the differential effect on membrane protein and lipids on each other. In this case, POPE is more suitable lipids for further simulation of the S. Typhi TolC protein.

International Conference on Bioinformatics Delegate Book 2014 Page 71

207

MICROBIAL COMMUNITY PATTERN DETECTION IN HUMAN BODY HABITATS VIA ENSEMBLE CLUSTERING FRAMEWORK PENG YANG1, Xiaoquan Su2, Le Ouyang3, Hon-Nian Chua1, Kang Ning2, Xiao-Li Li1 1. Institute for Infocomm Research, A*STAR, Singapore 2. Qingdao Institute of Bioenergy and Bioprocess Technology, Qingdao, China 3. Sun Yat-Sen University, Guangzhou, China

Background The human habitat is a host where microbial species evolve, function, and continue to evolve. Elucidating how microbial community responds to human habitats is a fundamental and critical task, as establishing baselines of human microbiome is essential in understanding its role in disease and health. Recent studies on healthy human microbiome focus on particular body habitats, assuming that microbiome develop similar structural pattern to perform similar ecosystem function under same environmental conditions. However, current studies usually overlook a complex and interconnected landscape of human microbiome and limit the ability in particular body habitats with leaning models of specific criterion. Therefore, these methods could not capture the underlying microbial pattern efficiently.

Results To obtain a comprehensive view, we propose a novel ensemble clustering framework to structure microbial community pattern on large-scale metagenomic data. We first build a microbial similarity network via integrating 1920 metagenomic samples from three body habitats of healthy adults. Then a novel symmetric Nonnegative Matrix Factorization (NMF) based ensemble model is proposed on the network to detect clustering pattern. Experiments are conducted to evaluate the effectiveness of our model on deriving microbial community with respect to body habitat and host gender. From clustering results, body habitat exhibits a strong bound but non-unique microbial structural pattern. Meanwhile, human microbiome reveals different degree of structural variation over body habitat and host gender.

Conclusions In summary, our ensemble clustering framework could efficiently explore integrated clustering results to identify accurate microbial communities. The clustering results indicate that structure of human microbiome is varied systematically across body habitats and host genders. Such trends depict an integrated biography of microbial communities, which offer a new insight towards uncovering pathogenic model of human microbiome.

208

STRUCTURAL VARIATIONS AS A METHOD FOR PHYLOGENETIC RECONSTRUCTION OF SUB- CLONAL TUMOUR EVOLUTION Marek Cmero1, Geoff J Macintyre1, David C Wedge2, Christopher M Hovens3 1. University of Melbourne, Parkville, VIC, Australia 2. Wellcome Trust Sanger Institute, Hinxton, UK 3. Department of Surgery, Royal Melbourne Hospital and the Australian Prostate Cancer Research Centre Epworth, Richmond, VIC Tumour evolution is a complex and multifaceted process that arises from the driving forces of carcinogenesis. Intra-tumour heterogeneity results in distinct cellular populations with inheritable genetic characteristics that can be observed within a single tumour. Additionally, primary tumours can seed metastases in distant parts of the body from one or several cellular sub-populations. The ability to trace the progression of a cancer, by identifying sub-populations and inferring the relationships between them from shared genetic features, has only recently become feasible. Next-generation sequencing technologies are able to provide fine-grained genomic data which can quantify the relationships between intra-tumour populations as well as distant metastases. Understanding these relationships using methods of phylogenetic reconstruction can inform the evolution of invasive or metastatic genetic changes in the evolutionary history of a cancer. This information can also assist in prognostication and prediction of cancer evolution in a clinical setting. Several potential methods exist for deconstructing the phylogeny of cancer populations, including single-nucleotide variations (SNVs) and copy-number variation (CNVs), although there is no gold standard approach. Particularly in prostate cancer, structural variations (SVs) are commonly observed events that consist of mutational changes in the genome, consisting of insertions, deletions, duplications, translocations and/or inversions. By comparing multiple cancer samples from the same patient, distinct cellular populations and their ancestral relationships can be de-convolved and the occurrence of a particular SV within a cancer's evolution can be estimated. We present a method that seeks to reconstruct the phylogenetic relationships of a tumour's sub-clonal cellular populations using structural variation data, detected using the Socrates algorithm. We demonstrate that tumour phylogenies are able to be reconstructed with SV data alone, and that SVs can play a useful role in resolving uncertainties in particular tree branches when compared to other data, such as SNVs and CNVs.

International Conference on Bioinformatics Delegate Book 2014 Page 72

209

PROTEOGENOMIC WORKFLOWS ON DRAFT GENOMES Apurv Goel1, Karthik Kamath1, Ignatius Pang2, Aiden Tay2, Marc Wilkins2, Brett Cooke1 1. Macquarie University, Sydney, NSW, Australia 2. System Biology Initiative, University of New South Wales, Sydney, NSW, Australia We sought to improve our in-house genome annotation abilities through the use of proteogenomic pipelines. By combining mass spectrometry data with genomic information we improved the number of novel proteins detected than compared with methods such as Mascot searching against a fasta database created through gene-prediction (such as Glimmer) or a simple 6-frame translation. We used the Nexus Proteogenomic pipeline which works through the creation of intermediate ‘virtual proteins’ which can be confirmed by physical evidence, i.e. mass spectrometry, and combined to form expected proteins. We first tested the pipeline on an unannotated incomplete genome, that of Scedosporium aurantiacum. This was only available to us in a format of ~10000 contigs, the pipeline was able to outperform a standard 6-frame translation when detecting known proteins (verified against swissprot through batch blasting). The pipeline has also been tested on the Pacific Oyster (Crassostrea gigas) genome , currently comprising of 11969 contigs, for similar results. This provides weight towards considering the other detected proteins as potentially novel. We find that our use of the Nexus pipeline, when accompanied with optimizations for each genome being run, gives us similar results to that of other proteogenomic pipelines. The major advantage is that it can be run on incomplete, unassembled and unannotated genomes wheras other pipelines need complete genomes and/or annotated genomes.

210

GENOTYPING MICROSATELLITES IN NEXT-GENERATION SEQUENCING DATA Harriet Dashnow123, Susan Tan4, Debjani Das4, Simon Easteal4, Alicia Oshlack23 1. Victorian Life Sciences Computation Initiative, Carlton, VIC, Australia 2. The University of Melbourne, Parkville, VIC, Australia 3. Murdoch Childrens Research Institute, Parkville, VIC, Australia 4. John Curtin School of Medical Research - ANU, Canberra, ACT, Australia Publish consent withheld

211

SEQUENCING, ASSEMBLY AND COMPARATIVE ANALYSIS OF FIVE STRAINS OF THE FUNGAL PATHOGEN CRYPTOCOCCUS GATTI Nandan P Deshpande1, Yu-Wen Lai2, Leona Campbell2, Chi Nam Ignatius Pang1, Dee Carter2, Marc Wilkins1 1. Systems Biology Initiative , University of New South Wales, Sydney, NSW, Australia 2. Biochemistry, School of Molecular Bioscience, The University of Sydney, Sydney, NSW, Australia Fungal diseases are an increasing problem worldwide. Current anti-fungal drugs are limited by their spectrum of activity or toxicity, and resistance is an emerging issue. Cryptococcus gattii, a basidiomycete yeast is an emerging agent of cryptococcosis in healthy individuals. Current treatment of cryptococcosis involves induction with amphotericin B (AMB) plus 5-flucytosine, with maintenance using fluconazole (FLC). However some C. gattii strains have been found to be inherently FLC resistant. Chong et al (2010) found very high inherent resistance by C. gattii strain 97/170, with intermediate resistance found in additional strains. We have begun to explore novel synergistic methods to improve the efficacy of FLC and reduce resistance using Iron (Fe) chelators. Our group is using fungal gene expression networks during a synergistic antifungal response to understand the synergistic process. We have sequenced C. gattii strain 97/170 using Illumina next generation sequencing. Four additional C. gattii strains of varying resistance levels were also sequenced, (average genome size ~ 17.5 Mb), to allow a comprehensive analysis of the influence of species, strain and inherent antifungal resistance on potential synergy with Fe chelators. We used both de novo and comparative genome annotation methods with the gene prediction tool Augustus to define an average of 7300 gene models across the C. gattii genomes. The quality of gene prediction for strain 97/170 was validated using RNA-Seq data generated for this strain. Comparative genomics analysis was carried out using genome based alignment tools Mauve and mummer, and protein ortholog clustering tool orthoMCL. The highly collinear genomes contained low levels of repetitive DNA. Proteins specific to individual strains, and those with varying conservation patterns across the strains were highlighted. These high-resolution draft genomes will be used as references in the RNA-seq-based differential expression analysis, leading into the development of co-expression networks for genes influenced by drug-chelator synergy. International Conference on Bioinformatics Delegate Book 2014 Page 73

212

MODELLING THE INSULIN SIGNALLING NETWORK: UNRAVELLING THE MOLECULAR MECHANISMS OF INSULIN RESISTANCE Westa Domanova123, James Krycer34, Fatemeh Vafaee2, David James23, Zdenka Kuncic12 1. University of Sydney, Sydney, NSW, Australia 2. Charles Perkins Centre, Sydney, NSW, Australia 3. Garvan Institute of Medical Research, Darlinghurst, NSW, Australia 4. The University of New South Wales, Sydney, NSW, Australia Intracellular signalling networks are robust due to feedback mechanisms and cross talk between pathways. Rewiring or short- circuiting this network within the insulin signaling pathway can lead to insulin resistance. This is characterized by a reduced cellular response to insulin, a hallmark of of type 2 diabetes. Although some important nodes of the insulin signaling network are known, they do not explain insulin resistance. Thus, we wish to better understand how insulin mediates its effects upon the cell by building up the insulin signalling network a priori.

To address this, we have previously performed a phosphoproteomic screen investigating insulin action over time in adipocytes (fat cells). Previous analysis has included clustering and machine learning to predict novel substrates of the three major kinases (protein signalling hubs) Akt, mTOR and PKA. Here, we extend this study, using statistical, mathematical and computational techniques to build up the insulin signalling network a priori. This is done by combining the phosphorylation time-course data with: (1) experimentally-validated kinase-substrate interactions, from databases and the literature; and (2) predicted interactions (e.g., from protein-protein interaction studies, consensus sequences). This will enable us to perform a more statistically rigorous analysis with improved predictive power. Our preliminary results suggest involvement of kinases (e.g. G protein coupled receptor kinase) that have not been associated with the insulin signalling pathway before. Furthermore, our data shows that proteins are phosphorylated at different time points depending on their kinase and intrinsic properties such as location, sequence motif and abundance. We aim to develop new mathematical and computational techniques to assign kinases to phosphorylation events in the insulin signalling network. This will lead to a better understanding of the mechanisms driving kinase action and discovery of potential therapeutic targets for overcoming type 2 diabetes.

213

COMPUTATIONAL PREDICTION OF PROTEIN INTERACTION MOTIFS FROM INTEGRATED PROTEIN SEQUENCE, STRUCTURE AND INTERACTION DATA. Richard J Edwards123, Nico Palopoli2 1. School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, NSW, Australia 2. Centre for Biological Sciences, University of Southampton, Southampton, Hampshire, UK 3. Institute for Life Sciences, University of Southampton, Southampton, Hampshire, UK Protein-protein interactions (PPI) between globular domains and Short Linear Motifs (SLiMs) play a crucial part in many biological processes. SLiMs are short stretches of 5 to 15 amino acids with high evolutionary plasticity that are usually found in disordered regions of proteins. Their role as ligands for molecular signalling, post-translational modifications and subcellular targeting has been increasingly studied over recent years, but experimental discovery of SLiMs remains a challenging task due to their small size and high degeneracy. As a consequence, computational tools for prediction and analysis of SLiMs are a valuable resource. We have previously developed SLiMFinder1, a motif discovery tool that applies a model of convergent evolution to estimate the statistical significance of over-represented motifs with high specificity2. Here, we aim to improve motif discovery by integrating SLiMFinder with methods that predict new domain-motif interactions directly from structural features in high-resolution 3D data. To this end we have developed “Query” SLiMFinder (QSLiMFinder), which uses knowledge of the interaction interface to constrain the motif search space and thereby increase search sensitivity. We have benchmarked QSLiMFinder using the Eukaryotic Liner Motif (ELM) database and simulated data. As expected, specific domain-motif interaction data can increase the power of de novo SLiM prediction from a set of proteins with a common PPI partner. We are now applying QSLiMFinder to large-scale analysis of public PPI and 3D structure data. Domain-motif interactions are predicted from structures in the protein data bank (PDB). QSLiMFinder then identifies patterns within the putative motif region that are over-represented in the other known PPI partners of the domain-containing protein. This will add crucial molecular details to the interactome.

International Conference on Bioinformatics Delegate Book 2014 Page 74

214

A SITE FOR DIRECT INTEGRIN ΑVΒ6•UPAR INTERACTION FROM STRUCTURAL MODELLING AND DOCKING Sowmya Gopichandran1 1. Macquarie University, Sydney, NSW, Australia Integrin αvβ6 is an epithelially-restricted heterodimeric transmembrane glycoprotein, known to interact with the urokinase plasminogen activating receptor (uPAR), playing a critical role in cancer progression. While the X-ray crystallographic structures of segments of other integrin heterodimers are known, there is no structural information for the complete αvβ6 integrin to assess its direct interaction with uPAR. We have performed structural analysis of αvβ6•uPAR interactions using model data with docking simulations to pinpoint their interface, in accord with earlier reports of the β-propeller region of integrin α-chain interacting with uPAR. Interaction of αvβ6•uPAR was demonstrated by our previous study using immunoprecipitation coupled with proteomic analysis by mass spectrometry. Recently this interaction was validated with proximity ligation assays and peptide arrays. The data suggested that two potential peptide regions from domain II and one peptide region from domain III of uPAR, interact with αvβ6 integrin. Only the peptide region from domain III is consistent with the three-dimensional interaction site proposed in this study. The molecular basis of integrin αvβ6•uPAR binding using structural data is discussed for its implications as a potential therapeutic target in cancer management.

215

DETECTING THE CHARACTERISTICS OF HUMAN BRANCH POINT SEQUENCE(BPS) USING A NOVEL PREDICTION MODEL Dianjing GUO1, Qing Zhang, Xiaodan Fan, Yejun Wang 1. The Chinese University of Hong Kong, Hong Kong In mammalian spliceosome assembly, splicing factor 1 (SF1) and the 65 kDa subunit of U2AF (U2AF65) recognize branch point sequence(BPS) and polypyrimidine tract (PPT) respectively, which is important in forming the early E complex with U1 and the 35 kDa subunit of U2AF (U2AF35). In this paper, we propose a novel computational model (BPPT) integrating BPS and PPT characteristics for BPS prediction in human and other genomes. Specifically, a mixture model was used to infer the BPS motif and a novel scoring system was developed to estimate the affinity between U2AF65 and the query PPT sequence. BPPT was applied to all human introns to predict the candidate BPS. Analysis of the predicted BPS indicates that BPS with constitutive splice sites undergo more adaptive evolution. By estimating the relationships between predicted BPS, PPT, 5SS and 3SS in a set of orthologous introns, we find clue that BPS and PPT may co- evolve and coordinately facilitate the formation of spliceosome.

216

PREDICTING PROTEIN-BINDING NUCLEOTIDES WITH CONSIDERATION OF A BINDING PARTNER OF RNA Narankhuu Tuvshinjargal1, Jinyong Im1, Byungkyu Park1, Wook Lee1, Kyungsook Han1 1. Inha University, Incheon, South Korea Background In recent years several computational methods have been developed to predict RNA-binding sites in protein. Most of these methods do not consider interacting partners of a protein, so they predict the same RNA-binding sites for a given protein sequence even if the protein binds to different RNAs. In contrast to the problem of predicting RNA-binding sites in protein, the problem of predicting protein-binding sites in RNA has received much less attention.

Results In this study we identified effective features of RNA and protein molecules and developed a support vector machine (SVM) model to predict protein-binding nucleotides from RNA and protein sequence data. The model that used both protein and RNA sequence data achieved an accuracy of 86.3% and Matthews correlation coefficient (MCC) of 0.69 in a 10-fold cross validation; it achieved an accuracy of 79.2% and MCC of 0.48 in independent testing. For comparative purpose, we built another SVM model that uses RNA sequence alone. The model that used RNA sequence data alone achieved an accuracy of 82.2% and MCC of 0.63 in a 10 fold-cross validation; it achieved an accuracy of 75.5% and MCC of 0.45 in independent testing.

Conclusions Both in cross-validations and independent testing, the model that used both RNA and protein sequences showed a better performance than the model that used RNA sequence data alone. Unlike previous computational approaches that predict the RNA- or DNA- binding residues in a protein sequence without considering the binding partners of the target protein, our prediction model predicts different binding sites for a given RNA sequence when its binding partner is changed. To the best of our knowledge, this is the first sequence-based prediction of protein-binding nucleotides, which considers the binding partner of RNA.

International Conference on Bioinformatics Delegate Book 2014 Page 75

217

A PARSIMONIOUS MODEL FOR PREDICTING DRUG SIDE-EFFECT PROFILES Hao Jiang, Yushan Qiu, Xiaoqing Cheng, Wai-Ki Ching The identification of potential side effects for promising drugs is one of the critical stage in drug development. The costly and time- consuming process in elucidating adverse effects of drugs has become a major and severe bottleneck in drug development. Traditional drug design with one-drug one-target tends to overlook system-wide effects. Recent study on drug side effect prediction has shown the trend from independent analysis to systematic investigation on side effects profiles of drugs. Therefore, developing a new approach which is capable of detecting potential drug side effects systematically is an urgent need for improving the process of side effect identification and at the same time providing efficient evaluation schemes in drug development.In this article, we investigate the relationship between potential side effects of drug candidates and their chemical structures. We propose a novel and efficient model (SR) for drug side-effect prediction. The promising feature of the method lies in the efficiency of obtaining model parameters in training where the closed form solution exists. The primary foundation on regression of the model creates a golden opportunity for evaluating drug side-effect profiles efficiently. An improved version of the model (NSR) with regularization further improves the prediction accuracy. The usefulness of the proposed method is demonstrated in a cross validation setting through prediction of 1385 side-effects in the SIDER database from the chemical structures of 888 approved drugs. Remarkably, our new method exhibits high efficiency as well as good performance in accuracy compared to some of state-of-the-art methods. Theoretical analysis on regularization parameter is conducted to explain the possible role it played in improving both prediction accuracy and efficiency.

218

ESTABLISHING RELATIONSHIP OF VIRUS TITER WITH AGRO-ECONOMIC CHARACTERISTICS OF TOMATO Jahangir Khan1, Wajeeha Tariq2, Muhammad Shafiq2, Muhammad Saleem Haider2 1. Engro Eximp, Lahore, Pakistan 2. Institute of Agricultural Sciences, University of the Punjab, Lahore, Punjab, Pakistan Abstract: Tomato leaf curl a geminivirus is a contagious pathogen of tomato, many other crops and weeds as well. It has wide host range, the responsible agent for this is whitefly (Bemisia tabaci), a polyphagous vector, which also spread it on large scale. This pathogen is responsible for major reduction in yield of tomato growing countries. To evaluate the effect of virus titer on crop yield, ten commercial cultivar of the tomato plant were selected. The yield potential along with other traits of these cultivars was assessed on the basis of symptom development and virus DNA accumulation. We establish the relationship between the virus titer, symptom severity and agro-economic traits. Our results explain that the high level of virus accumulation in the plant tissue results in the development of severe symptoms and leads to major reduction in yield in case of susceptible cultivars, but this is not true for the cultivars showing intermediate resistance. The virus DNA remains low and approximately constant in resistant cultivars and has minimal effect on the yield and plant health. Key words: Geminiviruses, Tomato leaf curl virus, Vector, Virus titer, Agro-economic.

219

TOMATO LEAF CURL PALAMPUR VIRUS ASSOCIATED WITH CHILI PEPPER LEAF CURL DISEASE IN PAKISTAN Jahangir Khan1, Wajeeha Tariq2, Muhammad Shafiq2, Muhammad Saleem Haider2 1. Engro Eximp, Lahore, Pakistan 2. Institute of Agricultural Sciences, University of the Punjab, Lahore, Punjab, Pakistan Chili pepper is an important crop widely grown and consumed as a condiment in Pakistan. A distinct bipartite Tomato leaf curl Palampur begomovirus (ToLCPMV) was found to be associated with chili pepper plants showing characteristics symptoms on the leaves. On the basis of symptoms it is predicted that ToLCPMV is associated with chili pepper leaf curl disease. The presence of ToLCPMV in chili pepper plants was confirmed by polymerase chain reaction (PCR) and southern blotting. The PCR product was cloned into the pTZ57R/T vector and sequenced. The sequence was available in the databases under the accession number HF912449.

International Conference on Bioinformatics Delegate Book 2014 Page 76

220

LIGAND BASED DOCKING STUDIES OF GENUS JATROPHA AGAINST HUMAN BREAST CANCER PROTEIN BRCA1 Swaminathan Krishnaswamy1, Piramanayagam shanmughavel 1. Bharathiar Univer, Coimbatore, TN, India Background: Breast cancer is a genetic disease. It is one of the most common malignant disease especially affecting women. Mostly occur form in hereditary and sporadic. BRCA1 is a tumour suppressor gene. It was first identified breast cancer susceptibility gene. Germ line mutation, genomic rearrangement in human BRCA1 gene can case 80% of inherited breast cancer. BRCA1 gene play an important role in cell cycle control, apoptosis, maintaining genomic stability, DNA damage, DNA double strand break, protein ubiquitination, transcriptional regulation, chromatin modeling, cell differentiation, cell spreading and mobility. Germ line mutation occurs in human BRCA1. After that the normal function of BRCA1 was barricaded (or) over expressed can leads to develop breast cancer. The aim of this study we have developed breast cancer drug from Genus Jatropha herbal plant and ten commercially available compound for breast cancer as a reference compound for molecular docking studies against human breast cancer protein BRCA1 by using Schrodinger suite. Results: From the docking results the Glide score, Glide energy, No. of H2 bonds were analyzed among the Genus Jatropha compounds eight of them Multifidol, Cleomiscosin A, 12-Deoxy-16- hydroxyphorbol, 16-Hydroxyphorbol, Fraxetin, (2α,13α,14β,20s)-2,24,25-Trihydroxylanost-7-en-3-one,3β,14α-Hydroxypimara- 7,9(11),15-triene-12-one,Multifidol glucoside are better Glide score, Glide energy against human breast cancer protein BRCA1. When compare to the commercially available breast cancer drug compounds. The ADME (or) pharmacokinetics properties were carried out for best eight Genus Jatropha compounds. These compounds are under acceptable range with predicted ADME (or) pharmacokinetics properties. Conclusion: From the In silico docking studies we conclude the following Genus Jatropha compounds Multifidol, Cleomiscosin A, 12-Deoxy-16-hydroxyphorbol, 16-Hydroxyphorbol, Fraxetin, (2α,13α,14β,20s)-2,24,25- Trihydroxylanost-7-en-3-one,3β,14α-Hydroxypimara-7,9(11),15-triene-12-one, Multifidol glucoside is a best drug for breast cancer and inhibit over expression of BRCA1 in human.

221

INTEGRATIVE ANALYSIS OF MULTI-OMICS DATA TO DISCOVER NOVEL PROTEIN FORMS

Dhirendra Kumar 1 2 Amit Kumar Yadav 1 Xinying Jia 2 Jason Mulvenna 2 Debasis Dash 1 CSIR-Institute of Genomics and Integrative Biology, New Delhi, India QIMR Berghofer Medical Research Institute, Herston, QLD, Australia

Rattus norvegicus (Norway Rat) is a model organism for the study of human diseases but annotation of the rat genome lags behind similar efforts in human and mice. Since RNA splicing patterns are reported to be specific to either the rat or mouse, mouse annotations are not likely to be a suitable surrogate for rat proteins and protein evidence specific to the rat is urgently needed. We have developed a novel analysis pipeline, EuGenoSuite, and analysed publicly available RNA-Seq and mass spectrometry datasets to improve the rat genome annotation. Using EuGenoSuite, 276 unique mapping novel peptides were discovered in rat brain microglia tissue. Among these, 145 mapped to the intergenic regions, 28 to annotated non-coding loci, 25 to UTRs of genes, 14 to the intronic regions, 18 to a different translation frame than the annotated CDS and 45 were splice peptides. Intergenic novel peptides were mostly un-annotated parts of genes, peptides from non-coding loci highlighted translation of eight annotated pseudogenes and novel splice-junction peptides identified novel exons and splice variants. Our analysis highlights the major shortcomings in current annotations of the rat genome and improved annotations will provide a better reference for human disease studies. We are now extending this approach to the discovery of novel protein forms specific to aggressive forms of breast cancer.

International Conference on Bioinformatics Delegate Book 2014 Page 77

222

IN SILICO APPROACH ON CXCR4 ANTAGONISTS AS POTENTIAL MICROBICIDES AGAINST HIV-1 SUBTYPE C RECEPTOR Piramanayagam shanmughavel1 1. Bharathiar University, Coimbatore, TAMIL, India The HIV/AIDS poses to be a destructive pandemic to humans worldwide. microbicides play an important role in controlling HIV. Microbicides are self- administered prophylactic agents that impede transmission of HIV. Microbicides are products designed to be applied to the vagina or rectum for the purpose of reducing the acquisition of STI’s including HIV . Antiretroviral (ARV) drugs act at different stages of HIV life cycle, hindering the process of HIV infection and replication. ARV drugs that specifically target HIV include nucleotide reverse transcriptase inhibitors (NtRTIs), non-nucleoside reverse transcriptase inhibitors (NNRTIs) and Entry inhibitors. The present study focuses on entry inhibitors because they target the viral life cycle at an initial stage of infection itself.Homology modeling was used to build 3D models of CXCR4 receptor. The resultant models were subjected to structure validation and they were found to be satisfactory. Protein-ligand interactions were carried out to predict the binding conformations of various CXCR4 antagonists with their respective receptors which is an essential factor for designing new drugs. Out of 114 antagonists selected, about five antagonists showing best Glide scores were shortlisted. The predicted ligand-receptor interaction models provided a satisfactory explanation for the binding between the receptor and the corresponding ligands. The entry inhibitors can be considered as potential microbicides which can be subjected to in vitro and in vivo studies to decide the efficacy and the formulation in which it has to be developed.Entry inhibitors act very early in the HIV life cycle long before integration occurs. HIV infects a cell by binding to CD4 receptor of the target cell membrane. In addition to binding to CD4, HIV must also bind with a co- receptor ie. Chemokine receptors CCR5 and CXCR4 expressed on the cell membrane to enter a T cell.

251

EVALUATION OF FUSION TRANSCRIPTS AS MARKERS OF PROSTATE CANCER TREATMENT RESISTANCE. John Lai123, Jiyuan An123, Inge Seim123, Carina Maree Walpole123, Chenwei Wang123, Melanie Lehman123, Judith Clements123, Colleen Nelson123, Jyotsna Batra123 1. Australian Prostate Cancer Research Center -- QLD, Brisbane 2. Translational Research Institute, Brisbane 3. Institute of Health and Biomedical Innovation, Queensland University of Technology, Woolloongabba, QLD, Australia Prostate cancer relies on the male hormone androgen for growth and survival. The action of androgens in the prostate is mediated by its cognate receptor, the androgen receptor (AR), which is a transcription factor that binds to promoters and enhancers to activate or repress gene transcription. The importance of the AR in prostate cancer is highlighted by therapeutic targeting using anti-androgens such as bicalutamide (Astrazenica) and enzalutamide (Medivation), however many tumours eventually progress to an anti-androgen resistant state. Here, we use RNAseq to detect 71 high-confidence fusion transcripts that are expressed in the LNCaP prostate cancer cell line after treatment with androgen (DHT) or anti-androgen drugs (bicalutamide and enzalutamide). Most of these fusion transcripts result from ‘read-through’ transcription of adjacent genes, with the distances between fusion transcripts originating from two genes that are located on the same chromosome varying between 0.4 kilobases to 61 megabases. Inspection of ChIPseq profiles indicated that there is an increase in AR occupancy around fusion transcript loci after androgen treatment, suggesting that the AR plays a role in mediating fusion transcription. Using clinical RNAseq data, we also found that a number of the androgen and anti- androgen regulated fusion transcripts were differentially expressed in prostate cancers compared to normal adjacent tissue. Collectively, we conclude that fusion transcripts might reflect a novel mechanism adopted by prostate cancer cells as they acquire treatment resistance to anti-androgen therapy.

International Conference on Bioinformatics Delegate Book 2014 Page 78

252

SEQUENCING ANALYSIS OF TELOMERES REVEALS UNEXPECTED SEQUENCE HETEROGENEITY Michael Lee1, Mark Hills2, Dimitri Conomos13, Michael D Stutz1, Roger R Reddel34, Hilda A Pickett13 1. Telomere Length Regulation Group, Children's Medical Research Institute, Westmead, NSW, Australia 2. Terry Fox Laboratory, BC Cancer Agency, Vancouver, Canada 3. Sydney Medical School, University of Sydney, Sydney, NSW, Australia 4. Cancer Research Unit, Children’s Medical Research Institute, Westmead, NSW, Australia Telomeres are terminal repetitive DNA sequences at the ends of chromosomes, and are considered to consist almost exclusively of the hexameric sequence TTAGGG. We analysed telomeres in humans using whole-genome sequencing followed by telomeric read extraction in a panel of mortal and immortal cell lines. We identified a wide range of telomere variant repeat sequences in human cells, and found evidence that telomerase- and ALT-mediated telomere lengthening generate variant repeats in mechanistically distinct ways. Telomerase-mediated telomere extension resulted in the generation of biased variant repeats that differed to the canonical sequence at positions 1 and 3, but not at positions 2, 4, 5 or 6. In contrast, cell lines that use the ALT pathway of telomere maintenance contained a large variety of variant repeats that differed between lines. This is consistent with variant repeats spreading from proximal telomeric regions throughout telomeres in a stochastic manner by recombination-mediated templating of DNA synthesis. The presence of unexpectedly large numbers of variant repeats in cells utilizing either telomere maintenance mechanism suggests a conserved role for variant sequences at human telomeres. To further investigate this we have carried out a comparative evaluation of telomere sequence content in different organisms and found that telomere variant repeat profiles differ between species. We propose that different variant repeats fulfil specific functional roles within telomeres, and are currently investigating the mechanisms underlying variant repeat generation.

253

EFFECTS OF SAMPLE SIZE AND UNBALANCE ON FINDING CANCER BIOMARKER Jie Li1, Yadong Wang 1. harbin institute of technology, Harbin, China Background: Cancer biomarker plays an important role in cancer diagnosis and treatment, but very few robust cancer molecular biomarkers are discovered in the last decades. One of the reasons is that many researchers lack the necessary analysis of the clinical cancer samples to some degree and ignore the effect of sample size and unbalance on finding robust biomarker. Here we study their effects on finding cancer biomarker genes to attract more and more scientist’s attentions. Methods: We identified a large number of prognostic biomarker gene sets from randomly selected breast cancer data sets with different sample size and ratio using survival risk analysis method and evaluate their stability, robust performance in 8 breast cancer data sets through the proposed evaluation method. Results: Experimental results show that the number, stability and robustness of biomarker genes have significant change when sample size and ratio change. Conclusions: Sample size and unbalance have significant effects on finding stable and robust biomarker genes. A large number of cancer samples are necessary to find high quality biomarker genes. The larger sample size of training data sets is, more robust the identified biomarker genes are. In addition, it’s critical to keep an appropriate ratio of different types of cancer samples to overcome negative effects of sample unbalance. The identified biomarker genes are generally more robust when sample ratio is near 1. Data sets from different laboratories also have important effects on finding and test biomarker gene set.

International Conference on Bioinformatics Delegate Book 2014 Page 79

254

ESTIMATION OF AMPLICON METHYLATION PATTERNS FROM BISULPHITE SEQUENCING DATA Peijie Lin1, Sylvain Foret2, Susan Wilson3, Conrad Burden1 1. Mathematical Sciences Institute, Australian National University, Canberra, ACT, Australia 2. Research School of Biology, Australian National University, Canberra, ACT, Australia 3. School of Mathematics and Statistics, University of New South Wales, Sydney, NSW, Australia Many of the known mechanisms driving gene regulation fall into the category of epigenomic modifications. One example of an epigenomic modification is DNA methylation, in which a cytosine (C) in the genomic DNA sequence can be altered by the addition of a methyl group. Methylation patterns in DNA amplicons are detected by treating with bisulphite, which converts unmethylated cytosines to uracils while leaving methylated cytosines intact. Treated DNA amplicons are sequenced, mapped to a reference genome and methylation patterns inferred. However, the bisulphite conversion is not 100% efficient, and this introduces errors in the read distribution. A second source of errors is the read errors. In this project we have developed a model for these two sources of errors, based on an incomplete conversion rate and site-dependent read error rates, both of which can be estimated independently. We have also developed an algorithm to estimate the true distribution of the methylation patterns and to predict spurious patterns which appear in the reads solely due to incomplete conversion and read errors. This algorithm is currently being developed as an R Bioconductor package. As the true distribution is always unknown in the lab, synthetic data has been constructed to test the effectiveness of the algorithm. We have found that the estimated distribution given by the algorithm is closer to the 'true' distribution than the observed read distribution. The algorithm is also effective in predicting spurious patterns. The results of applying the model and the algorithm to data on the methylation patterns of the honey bee amplicons are presented.

255

COMPUTATIONAL ANALYSIS OF DNA REPAIR PATHWAYS USING GENE EXPRESSION DATA Chao Liu1, Kum Kum Khanna2, Sriganesh Srihari1, Peter T. Simpson3, MarK Ragan1, Kim-Anh Le Cao4 1. Institue for Molecular Bioscience University of Queensland, St Lucia, QLD, Australia 2. QIMR-Berghofer Medical Research Institute, Brisbane, Queensland, Australia 3. The University of Queensland Centre for Clinical Research, Brisbane, Queensland, Australia 4. The University of Queensland Diamantina Institute, Brisbane, Queensland, Australia Human DNA is constantly subject to threats posed by various endogenous and exogenous factors, such as ultraviolet radiation, cigarette smoke and oxidative by- products from cellular respiration. At least six DNA repair pathways have been developed to counteract these threats. The difference in activity of these repair pathways amongst subgroups of various cancers has been associated with radio- and chemoresistance, and more recently with response to poly [ADP-ribose] polymerase 1 (PARP1) inhibitor- related targeted therapy for breast and ovarian cancer. It is thus important to investigate systematically the status of all these six repair pathways in cancer, but to our knowledge no such studies have been done. As DNA repair research is a fast-advancing area, we have first manually curated these pathways by combining literature search and domain expertise to provide up-to-date knowledge of these pathways. We then evaluated the sensitivity and specificity of four popular self-contained pathway analysis methods on the curated pathways, using three publicly available gene expression data sets for which the status of each repair pathway is already known. We chose to focus on self-contained methods because they can test pathway-phenotype association directly. Our preliminary results show all the four methods display good sensitivity, but poor specificity that is likely due to pathway crosstalk. Moreover, all these methods fail to identify defective repair pathways that are common in cancer and important for predicting therapy response. In order to obtain a better activity estimate of these repair pathways, we propose further to develop a pathway analysis methodology that incorporates a recently proposed gene signature for detecting deficiencies in the homologous recombination repair pathway, and an algorithm that corrects for pathway crosstalk . This methodology will then be applied to breast cancer subgroups to investigate their DNA repair capabilities.

International Conference on Bioinformatics Delegate Book 2014 Page 80

256

THE TRANSCRIPTOME SEQUENCE ANALYSIS OF ARTEMISIA FRIGIDA Yue Liu1234, Xiaoxiao Feng1, Yi Wang34, Naxin Huo34, Xu Ma1, Chuanchuan Chen1 1. Minzu University of China, Beijing, China 2. National Resource Center for Chinese Materia Medica, China Academy of Traditional Chinese Medicine, Beijing, China 3. USDA-ARS, Western Regional Research Center, Albany, CA, USA 4. Department of Plant Science, University of California, Davis, CA, USA Artemisia frigida, also named as Xiaobaihao or Hanhao, is a Mongolian traditional medicinal plant. It belongs to the genus of Artemisia in Asteraceae family. This plant has the power of stanch, detumescence and is well applied in the cure of illness, such as bleeding, arthroncus, colds, coughs, mountain fever, and so on. Besides its medical efficacy, it is also valued as a very important feeding resource, and a cultural symbol in the daily life of the Mongolian and indigenous peoples of America. Three tissues from sample N01 of Artemisia frigida was carried out using Illumina 100 bp paired-end sequencing technology on the Illumina HiSeq 2000 system. A total of 15,938,471,800 bp from 159,320,702 reads of three Artemisia frigida samples were obtained. After removing the low quality sequence, we got 12,317,103,117 bp of data which accounted for 77.28% of the raw data. These clean data were assembled into 21,221, 32,339 and 9,028 unigenes in the three samples of Artemisia frigida, leaf, stem and root, respectively. The N 50 length of the unigenes in each sample ranged from 598 bp to 808 bp and the mean length of the unigenes ranged from 575 bp to 713 bp. Then we did the transcriptome analysis of Artemisia frigida, including gene annotation, functional classification and differential expression, etc. According to the analysis, we found that the gene expression level of flavonoids biosynthesis enzymes, such as phenylalnine ammonialyase, cinnamic acid-4-hydroxylase, chalcone synthase, chalcone isomerase and flavone synthase II and so on, was as follows: the gene expression level in the stem was higher than that in the leaf, while that in the leaf was higher than that in the root. Therefore, we speculated that the most suitable medicinal parts of Artemisia frigida were leaf and stem. In this study, the transcriptome of Artemisia frigida was analyzed, which enriched the genetic information of Artemisia frigida, and helped study flavonoids secondary metabolic pathways in this plant. Also, it will lay a foundation for improving the active ingredients of ethnic medicines by using genetic engineering in the future.

257

ENHANCING METABOLIC PATHWAY DATABASES WITH LOCALISATION DATA: INTEGRATING SUBA WITH ARACYC Andrew Lonsdale1 1. School of Botany, University of Melbourne, Melbourne, VIC Integrating bioinformatics resources can make use of specialised information from different sources to enhance their utility. The subcellular location database for Arabidopsis proteins (SUBA) includes locations based on both predictions or experimental evidence for over 35,888 proteins (May 2014). SUBA allows for structured localisation queries on including combinations of protein location, evidence type and literature reference. The BioCyc framework of metabolic databases creates pathway/genome databases (PGDB) that allow for queries based on pathways of an organism. AraCyc is a heavily curated PGDB for Arabidopsis thaliana. AraCyc 11.5 has 321 localisation (May 2014).We present work in progress on the integration of SUBA localisation data into the Arabidopsis metabolic database (AraCyc) in a way consistent with the evidence ontology of the BioCyc framework. This integration will allow for metabolic pathway focused queries to benefit from the SUBA localisation database to enhance the utility of both SUBA and AraCyc. This approach can be extended to other organisms with BioCyc databases and to other sources of localisation data.

258

COMBINE: A BIOINFORMATICS GROUP AIMED AT STUDENTS AND EARLY-CAREER RESEARCHERS Andrew Lonsdale1, Harriet Dashnow234 1. School of Botany, University of Melbourne, Melbourne, VIC 2. Life Science Computation Centre, Victorian Life Sciences Computation Initiative, Carlton, VIC, Australia 3. The University of Melbourne, Parkville, VIC, Australia 4. Murdoch Childrens Research Institute, Parkville, VIC, Australia COMBINE is a student-run Australian organisation for researchers in computational biology, bioinformatics, and related fields. COMBINE is the official International Society for Computational Biology (ISCB) Regional Student Group (RSG) for Australia. We aim to bring together students and early-career researchers from the computational and life sciences for networking, collaboration, and professional development. Australia has many research institutes, each with their own cohorts of students. Aside from conferences, there are few opportunities that bring these students together, allowing them to discover the different kinds of research going on at other local institutes. COMBINE aims to bridge this institutional divide by organising a variety of lectures and social events that allow students to connect with each other in a casual environment.

International Conference on Bioinformatics Delegate Book 2014 Page 81

259

COMPUTATIONAL PREDICTION OF MOLECULAR MIMICRY IN HOST-PATHOGEN PROTEIN- PROTEIN INTERACTIONS Ranjeeta Menon1, Nicolas Palopoli2, Richard J Edwards34 1. School of Biotechnology and Biomolecular Sciences, University Of New South Wales, Sydney, NSW, Australia 2. Centre for Biological Sciences, University of Southampton , Southampton, UK 3. School of Biotechnology and Biomolecular Sciences, University Of New South Wales, Sydney, NSW, Australia 4. Centre for Biological Sciences, University of Southampton, Southampton, UK Background: Short Linear Motifs (SLiMs) are short (3-15aa) segments of proteins that mediate numerous protein-protein interactions (PPI) in critical biological pathways and signaling networks. SLiMs typically occur in structurally disordered regions of proteins and have few sites specific to function, resulting in high evolutionary plasticity and frequent convergent evolution on different protein backgrounds. Such “molecular mimicry” is abundant in viruses, which exploit SLiMs to influence the molecular machinery of host cells. Methods: We are combining publicly available datasets of host-host and host-pathogen PPI with recently developed tools from the SliMSuite package (1) SLiMProb to identify novel candidates for viral mimicry of known host SLiMs , and (2) QSLiMFinder to predict entirely new SLiM classes . In addition to novel SLiMs, we also aim to predict new human targets for known viral SLiMs. In each case, signals of convergent evolution are identified using statistical over-representation of motifs in unrelated proteins. SLiM predictions will be put in context using network analysis of the host interactome. Results: The approach will be benchmarked using known cases of molecular mimicry in viral proteins. Novel candidates of molecular mimicry that appear to target important proteins and pathways along with GO annotations for the viral life cycle will be highlighted. Conclusion: Systems-level analysis of molecular mimicry in virus-host and host-host PPI data using high throughput computational SLiM discovery tools has great potential to increase our understanding of how viruses manipulate their hosts and identify candidates for novel therapeutic targets. Although still in its early stages, this work reveals key considerations for future analysis.

260

META-FEATURES AS PREDICTORS OF BREAST CANCER INTRINSIC SUBTYPES IN THE METABRIC GENE EXPRESSION DATASET Heloisa H Milioli12, Renato Vimieiro13, Carlos Riveros13, Regina Berretta13, Pablo Moscato13 1. Priority Research Centre for Bioinformatics, Biomarker Discovery and Information-Based Medicine, Hunter Medical Research Institute, Newcastle, NSW, Australia 2. School of Environmental and Life Science, The University of Newcastle, Newcastle, NSW, Australia 3. School of Electrical Engineering and Computer Science, The University of Newcastle, Newcastle, NSW, Australia Gene expression microarray data has expanded our understanding of breast cancer disease and also supported further classification in five distinct subtypes: luminal A, luminal B, HER2-enriched, normal-like, and basal-like . The investigation of individual transcriptomic signatures remains a valuable tool to determine patient diagnosis and prognosis, and predict therapy response. Novel methods for tumour stratification, biomarkers identification and subtype prediction are, therefore, urgently needed for future applications in clinical practice . In this study, we explore the competence of a newly proposed method to select reliable combinations of probes for breast cancer individuation. We expanded the analysis of the original METABRIC breast cancer data set, with over 2,000 samples, by considering the pair-wise differences of the gene expression values (meta-features). In addition, we computed the CM1 scores for each subtype and selected the partial balanced top-10 meta-features from the five groups of patients. The ability of these meta-features to assign subtypes was assessed using a list of classifiers from the Weka software suite , based on a 10-fold cross-validation model and a training-test setting. Classifiers demonstrated extensive predictive power on labelling samples, with the average Cramer’s V of 0.91 ± 0.044 and 0.92 ± 0.034 using the selected meta-features, in the discovery and validation sets, respectively. Our results also revealed an almost perfect agreement (κ≈0.97) between labels assigned by the majority of classifiers and the refined labels from METABRIC in both discovery and validation sets. The selected meta-features included classic genes outlining breast cancer subtypes and markedly improved label prediction using novel potential biomarkers in the pair-wise analysis. Moreover, our approach highlighted the greater performance of an ensemble of classifiers or methods to accurately predict sample subtype. Ultimately, our achievements may enhance the molecular understanding of breast cancer gene signatures and support future applications in clinical practice.

International Conference on Bioinformatics Delegate Book 2014 Page 82

261

MOLECULAR DOCKING PREDICTIONS OF STEVIOSIDE-INSULIN RECEPTOR (IR) INTERACTIONS IN A MUS MUSCULUS IR MODEL Nabilatul Hani Mohd-Radzman1, Siti Azma Jusoh1, Aishah Adam1, Wan Iryani Wan Ismail1 1. Universiti Teknologi MARA, Bandar Puncak Alam, Malaysia The escalating numbers of metabolic syndromes like obesity and diabetes due to the adoption of sedentary lifestyles and modern eating habits has been worrying nowadays. Problems introduced by side effects and incompatibilities of current drugs have initiated the rise of alternative therapies, mainly from natural products like stevioside. As one of the steviol glycosides extracted fromStevia rebaudiana Bertoni, stevioside holds high promises as a possible treatment to insulin resistance and diabetes mellitus. It has zero calories despite having immense sweetening properties exceeding 300 times more than normal sucrose. Past reports have also indicated stevioside’s abilities in lowering postprandial blood glucose levels both in rats and human subjects. Furthermore, it was also discovered that stevioside managed to increase glucose uptake and elevate proteins related to the insulin signalling pathway in 3T3-L1 (Mus musculus) adipocytes. This finding has therefore instigated this project in evaluating the interactions between stevioside and the insulin receptor (IR) via bioinformatics means. A three-dimensional (3D) structure of the Mus musculus IR was initially built with the MODELLER programme based on the human IR structure isolated through x-ray diffraction (PDB ID: 3LOH). This modelled mouse IR was then subjected for the docking of stevioside using the AutoDock Vina programme. Through these docking simulations, it was revealed that stevioside has managed to dock to three different binding pockets within the mouse IR. Most interestingly, stevioside was also seen to favourably dock on the same binding region to that of insulin. As a summary, this discovery may shed some light in understanding the stevioside-IR interactions as part of enhancing the activities of the insulin signalling pathway and improving insulin sensitivity.

262

A TOOL FOR FAST DEVELOPMENT OF NEW ONTOLOGIES Abdul-Mateen Rajput1, Marzio Pennisi2, Santo Motta2, Francesco Pappalardo2 1. Bonn-Aachen International Center for Information Technology [B-IT], University of Bonn, Bonn, Germany 2. University of Catania, Italy, Catania, Italy Ontology construction is a time consuming and labor intensive task. It may take many months to construct an ontology as according to standard practices each concept must have synonyms, domain specific definition, unique identifier and references. Current practices of ontology construction require manual data input to feed this data via programs such as Protege etc. We designed a small application that speeds up the development of new ontologies. It provides an easy to use and convenient interface that allows to theoretically build an ontology within few days. The output of our program can be easily opened and then used into a standard ontology editor like Protege.

International Conference on Bioinformatics Delegate Book 2014 Page 83

263

NOVEL APPROACH FOR SELECTING THE BEST PREDICTOR FOR IDENTIFYING THE BINDING SITES IN DNA BINDING PROTEINS Nagarajan Raju1, Shandar Ahmad2, Michael Gromiha M1 1. Indian Institute of Technology Madras, Chennai, India 2. National Institute of Biomedical Innovation, Japan DNA-binding proteins (DBPs) play vital roles in many cellular processes by the interactions of amino acids with DNA. Because of the experimental difficulties for getting structures and the exponential increase in the gap between the available sequences and structures of DBPs, computational methods were developed to predict DNA interacting residues from protein sequence. But their performance varies which mainly depends on training dataset, feature selection and learning capacity. Hence, it is important to reveal the correspondence between the performance of methods and properties of DBPs. To address this problem, we have collected all available DNA binding sites prediction methods and revealed their performances on unbiased, stringent and diverse datasets for DBPs with 25% sequence identity based on various aspects: i) structural class, ii) fold, iii) superfamily, iv) family, v) binding motif vi) DNA strand, vii) conformation of DNA and viii) protein function. We observed that the best performing methods for each of the datasets showed significant biases toward the datasets selected for their benchmark. We also analyzed the performance of methods for the disordered regions, structures which are not included in the training dataset and recently solved structures. The reliability is better than randomly choosing any method or combination of methods. Our analysis revealed important features, which could be used to estimate these context specific biases and hence suggest the best method to be used for a given problem (http://www.biotech.iitm.ac.in/DNA-protein).1 2

264

CONSTRUCTION AND CLONING OF STRUCTURAL PROTEIN GENE OF HEPATITIS C VIRUS IN Escherichia coli AND ITS EXPRESSION IN CHINESE HAMSTER OVARY CELLS Hidayat aji Setiadji1, Debbie Debbie Retnoningrum1, Neni Neni Nuraini1 1. Bandung Institute Of Technology, Bandung, INDON, Indonesia Hepatitis C Virus (HCV) is a member of Hepacivirus genus in Flaviviridae family. It has at least six genotypes and more than 100 subtypes that distinguished on phylogenetic tree. HCV genom consist of positive single stranded RNA, sized 9.6 kb. Viral RNA encode polyprotein precursor with approximately 3.000 aa in size, which was cut by co-translational and post-translational process to generate structural (C, E1, E2) and non-structural protein C protein multimerize and form a capsid to protect its RNA on cytoplasmic surface of endoplasmic reticulum that would interact with envelope 1 protein (E1). Encoded gene of C, E1, E2 protein is constructed to obtain HCV structural protein. Synthetic gene is acquired from multiple alignment of amino acid sequences that were retrieved from several web site, Then, a dominant motive sequence was analyzed for its HCV homology. The sequence was analyzed to search for its epitope, signal sequence and its codon preference to chinese hamster ovary (CHO) cell. Synthetic gene is analyzed for its probability to establish a VLP by examining of its signal peptide, glucosylation site, disulfide bond, and host cell factors supporting VLP formation. Gene that was obtained from IDT DNA was ligated into pVAX1 expression vector. pVAX1 recombinant plasmid were transformed into E. coli. After it was verified by polymerase chain reaction (PCR), migration analysis, sequencing and homology analysis, positive recombinant vector were transformed into CHO cell to produce HCV structural protein. Expression of HCV structural protein was characterized by flow cytometry analysis by using fluorescence-activated cell sorting (FACS).

International Conference on Bioinformatics Delegate Book 2014 Page 84

265

RULE DISCOVERY AND DISTANCE SEPARATION TO DETECT RELIABLE MIRNA BIOMARKERS FOR THE DIAGNOSIS OF LUNG SQUAMOUS CELL CARCINOMA RENHUA SONG1, Qian Liu1, Gyorgy Hutvagner2, Hung Nguyen2, Kotagiri Ramamohanarao3, Limsoon Wong4, Jinyan Li1 1. Advanced Analytics Institute, University of Technology, Sydney, Sydney, NSW, Australia 2. Centre for Health Technologies, University of Technology, Sydney, Sydney, NSW, Australia 3. Department of Computing and Information Systems, the University of Melbourne, Melbourne, VIC, Australia 4. School of Computing, National University of Singapore, Singapore Altered expression profiles of microRNAs (miRNAs) are linked to many diseases including lung cancer. miRNA expression profiling is reproducible and miRNAs are very stable. These characteristics of miRNAs make them ideal biomarker candidates. This work is aimed to detect 2- and 3-miRNA groups, together with their specific expression ranges, to form simple linear discriminant rules for biomarker identification and biological interpretation. Our method is based on a novel committee of decision trees to derive 2- and 3-miRNA 100%-frequency rules. This method is applied to a data set of lung miRNA expression profiles of 61 squamous cell carcinoma samples and 10 normal tissue samples. A distance separation technique is used to select the most reliable rules which are then evaluated on an independent data set. We obtained four 2-miRNA and three 3-miRNA top-ranked rules. One important rule is that: If the expression level of miR-98 is above 7.356 and the expression level of miR-205 is below 9.601 (log2 quantile normalized MirVan miRNA Bioarray signals), then the sample is normal rather than cancerous with specificity and sensitivity both 100%. The classification performance of our best miRNA rules remarkably outperformed that by randomly selected miRNA rules. Our data analysis also showed that miR-98 and miR-205 have two common predicted target genes FZD3 and RPS6KA3, associated with carcinoma according to the Online Mendelian Inheritance in Man database. We also found that most of the chromosomal loci of these miRNAs have a high frequency of genomic alteration in lung cancer. On the independent data set, the three miRNAs miR-126, miR-205 and miR-182 from our best rule can separate the two classes of samples at the accuracy of 84.49%, sensitivity of 91.40% and specificity of 77.14%. Our results indicate that rule discovery followed by distance separation is a powerful computational method to identify reliable miRNA biomarkers.

266

VALIDATION OF TRANSCRIPTS ASSEMBLED FROM RNA-SEQ DATA USING PROTEOMICS DATA. Aidan Tay12, Chi Pang12, Natalie Twine12, Linda Harkness3, Gene Hart-Smith12, Moustapha Kassem3, Marc Wilkins12 1. Systems Biology Initiative, The University of New South Wales, Sydney, New South Wales, Australia 2. School of Biotechnology and Biomolecular Sciences, The University of New South Wales, Sydney, New South Wales, Australia 3. Department of Endocrinology and Metabolism, Odense University Hospital & University of Southern Denmark, Odense, Denmark Alternative splicing of mRNA is known to play a major role in diversifying function of proteins in humans, resulting in cell-specific proteomic variation. While transcriptomic experiments enabled the detection of many alternatively spliced transcripts, the majority of these transcripts seem to lack protein-coding potential. Earlier this year, we have published the Proteomic–Genomic Nexus (PG Nexus) pipeline, which showed that integrating proteomics and transcriptomics data can efficiently validate alternatively spliced transcripts. To accommodate analysis of novel transcripts assembled from RNA-seq data, we have developed an additional software component. The new module, TranscriptCoder, uses previously identified exons to translate RNA transcripts from Cufflinks into a protein sequence database for MS/MS searches. In conjunction with other PG Nexus tools, users can identify and validate splice junction boundaries and mRNA transcripts which have protein-coding potential. For our analyses, we used data from human undifferentiated mesenchymal stem cells (MSC) to validate 4187 unique splice junctions annotated in ENSEMBL. Comparing our method to other approaches, including a 3-frame translation of all RNA-seq transcripts, revealed 4472 unique splice junctions, 4062 of these were the same as those identified by TranscriptCoder. There were 125 junctions that were found by TranscriptCoder only, and 410 junctions found by 3-frame translation only. The above suggests a combination of both methods could enhance the coverage of splice peptides. Our tool and results highlight an integrative approach that is incorporated into our PG Nexus pipeline, allowing us to validate alternatively spliced forms of protein.

International Conference on Bioinformatics Delegate Book 2014 Page 85

267

GENOME-WIDE DISCOVERY AND ANALYSIS OF PHASED SMALL INTERFERING RNAS IN CHINESE SACRED LOTUS Yun Zheng1, Shengpeng Wang1, Ramanjulu Sunkar2 1. Faculty of Life Science and Technology, Kunming University of Science and Technology, Kunming, Yunnan, China 2. Department of Biochemistry and Molecular Biology, Oklahoma State University, Stillwater, Oklahoma, America Phased small interfering RNA (phasiRNA) generating loci (briefly as PHAS) in plants are a novel class of genes that are normally regulated by microRNAs (miRNAs). Similar to miRNAs, phasiRNAs encoded by PHAS play important regulatory roles by targeting protein coding transcripts in plant species. We performed a genome-wide discovery of PHAS loci in Chinese sacred lotus and identified a total of 106 PHAS loci. Of these, 47 loci generate 21 nucleotide (nt) phasiRNAs and 59 loci generate 24nt phasiRNAs, respectively. We have also identified a new TAS3 and two TAS4 loci in the lotus genome. Our results show that some of the nucleotide-binding, leucine-rich repeat (NB-LRR) disease resistance proteins and MYB transcription factors generate phasiRNAs. Furthermore, our results suggest that some LSU-rRNAs can derive phasiRNAs, which is potentially resulted from crosstalk between small RNA biogenesis pathways that are employed to process rRNAs and PHAS loci, respectively. Some of the identified phasiRNAs have trans-targets with less than 4 mismatches, suggesting that the identified PHAS are involved in many different pathways. Finally, the discovery of 24nt in lotus suggests that there are 24nt PHAS in dicots.

268

INTRINSIC PROPERTIES OF GENOMIC SEQUENCES ALLOW PREDICTION OF TRANSCRIPTION FACTOR BINDING REGIONS Tsung-Yeh Zing Tsai1, Shin-Han Shiu2, Huai-Kuang Tsai1 1. Academia Sinica, Taipei, Taiwan 2. Michigan State University, Michigan , USA Transcription factor binding is determined by multiple factors including sequence specificity and chromatin accessibility where the latter is influenced by both chromatin state and DNA structural properties. Although these features can be used to predict TF binding sites, their relative and joint contributions remain unclear. Particularly, given some of these features can be predicted based on genomic sequence alone, it remains an open question how well they can be applied for predicting binding regions. By a systematic assessment on the impact of jointly considering 23 features in predicting TF binding preference, chromatin state and DNA structural properties are better predictors for binding than sequence motif of a TF. In addition, simultaneously considering chromatin state and DNA structural properties further improves the accuracy of TF binding prediction, indicating that these two feature sets are highly synergistic. However, their relative contributions differ greatly between TFs. Most importantly, we show that three DNA intrinsic properties are particularly critical in predicting TF binding. Using the intrinsic model, we can predict binding regions not only across TFs but also across DNA-binding domain families with distinct structural folds. The intrinsic property model allows TF binding predictions across DNA-binding domain families that are present in most eukaryotes, suggesting that the model is likely universal and can be used across species. Thus our findings demonstrate the feasibility in establishing a universal model for identifying regulatory regions in any sequenced genomes.

269

ROLE OF ANTISENSE RNAS IN EVOLUTION OF YEAST REGULATORY COMPLEXITY Daryi Wang1, Huai-Kuang Tsai2, Chih-Hsu Lin1 1. Biodiversity Research Center, Academia Sinica, Taipei, Taiwan 2. iis, Academia Sinica, Taipei Publish consent withheld

International Conference on Bioinformatics Delegate Book 2014 Page 86

270

PAIN-TENS: A DATABASE FOR PAIN RESEARCH USING TRANSCUTANEOUS ELECTRICAL NERVE STIMULATION Ko-Chun Yang12, Chun-Ping Yen34, Wen-Chieh Chang12, Cheng-yan Kao25, Sheng-An Lee6 1. Taiwan Resonant Waves Research Corporation, Taiwan 2. Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, Taiwan 3. School of Post-baccalaureate Chinese Medicine ,China Medical University, Taiwan 4. Tao-Yuan General Hospital, Taiwan 5. Department of Computer Science and Information Engineering, National Taiwan University, Taiwan 6. Department of Information Management, Kainan University, Taiwan Transcutaneous electrical nerve stimulation (TENS) is clinically used by a variety of healthcare professionals for the reduction of pain. Pain is typically classified as either acute or chronic. Chronic pain persists for weeks or months and is usually associated with an underlying condition, such as post-operative pain, cancer pain, and low back pain,etc. Several recent studies have suggested that how to evaluate clinical effectiveness of TENS with different experimental designs. Approximate 1000 articles on TENS studies have already published. Scientists describe theories that support the use of TENS based on the release of endogenous opioids and the gate control theory. Low frequencies, usually below 10 Hz, activate µ-opioid receptors according to the endogenous opioid system;however, high frequencies, above 50 Hz, activate δ-opioid receptors based on the gate control system. The main tools available for measuring pain fall into two categories, namely unidimensional scales and multidimensional scales. Unidimensional tools include numeric rating scales, visual analogue scale and face rating scale. Multidimensional tools include brief pain inventory and McGill pain questionnaire. In this study, we review the clinical literature on TENS effectiveness, experimental designs, number of the cases , TENS with varying frequency, classification of diseases, and the assessment of pain. We construct a literature-curated TENS and pain research database to make it easy to do systematic reviews and also demonstrate visualization of the results.

270

THE ROLE OF ZINC FINGER FAMILY GENES IN SCHIZOPHRENIA Ko-Chun Yang1,2, Chun-Ping Yen3,4, Wen-Chieh Chang1,2, Cheng-yan Kao2,5, Sheng-An Lee6 1. Department of psychiatry, Beitou Branch, Tri-service general hospital, Taipei, Taiwan 2. Department of Biochemical Science and Technology,, National Taiwan University, Taipei, Taiwan 3. Department of Information Management, Kainan University, Taoyuan, Taiwan 4. Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, Taipei, Taiwan

Schizophrenia is believed to be a progressive brain disorder, along with reduction of brain size especially in hippocampus, thalamus, temporal lobe and prefrontal cortex. The molecular function of zinc family genes which encoded zinc finger proteins are found to be the most abundant and broadly diversely distributed including biological functions such as DNA recognition, transcription activity, regulation of apoptosis and protein folding and assembly. GWAS and SNP study have identified one of zinc finger family genes, ZNF804A as a susceptible gene and is strongly associated with schizophrenia

In order to explore the relationships between zinc finger family genes and schizophrenia, the schizophrenic zinc-finger mediator network(SZFMN) was constructed to illustrate the interaction between schizophrenic candidate genes and zinc family genes. Important hub genes in SZFMN included ZNF174, ZNF195, ZNF200, ZNF259, TP53 and UBC. In different literature reviews, the over-expressed zinc finger genes are ZNF 24 and ZNF200. The under-expressed zinc finger genes are ZNF148, ZNF174, ZNF195, ZNF200 and ZNF512. ZNF200 revealed inconsistent expression level in different literatures.

The SZFMN revealed connection of zinc finger genes and schizophrenic candidate genes which implicates the dysfunction of zinc finger genes may be associated with brain atrophy and contribute to the important disease mechanisms of schizophrenia

International Conference on Bioinformatics Delegate Book 2014 Page 87

271

MOLECULAR DOCKING OF VARIOUS TYPE SOYBEAN-PHOSPHATIDYLCHOLINE TO FAS- RECEPTOR FOR DRUG DESIGN STUDIES OF INDUCING ADIPOCYTE APOPTOSIS AND ISOLATING- PROLIFERATION OF ADIPOCYTE-DERIVED STEM CELLS Reza Yuridian1, Iis Rosliana2, Fadilah .3, Ariyani .2, Prasetyawan Yunianto4, Hans-Joachim Freisleben5 1. Indonesian Medical Mesotherapy Association, Jakarta, PUSAT, Indonesia 2. érpour Dermatology, Mesotherapy, Stem Cells, and Aesthetic Medicine Research Center , Central Jakarta, Indonesia 3. Department of Chemistry, Faculty of Medicine, Universitas Indonesia, Indonesia 4. Agro Industrial Technology Development , Biomedical Laboratory, Serpong, Tanggerang, Indonesia 5. Medical Research Unit, Faculty of Medicine, Salemba, Jakarta, Indonesia Subcutaneous phosphatidylcholines (PPCs) injection was promoted as efficacious alternative medical procedure to liposuction for local fat deposit removal. It has been shown that PPCs able to induce apoptosis of adipocyte observed by clear evidence of apoptotic proteins involvement, including cleavage of caspase-8. Cleavages of caspase-8 were basically known involved in Fas-death induce pathway in apoptosis. However, whether PPCs has Fas ligand (FasL) agonistic activity that may directly activates Fas receptor (FasR) remains unknown. Separation using reverse phase-high performance liquid chromatography of high purity of soya PPC showed 6 peaks that may refers to various types of PPCs with different fatty acyl chain combination at sn-1 and sn-2 position in its glycerol backbone. It has been reported that PPCs from soya bean are isolated as mixed compound of PPC with palmityl, stearyl, oleyl, linoleyl and linolenyl distributed at sn-1 and oleyl, linoleyl and linolenyl at sn-2 position on the glycerol backbone resulted 15 types of PPC. In this study, authors have checked FasL agonistic property from various types of soya PPCs by molecular docking to FasR. The author also compared interaction between PPCs and FasR to the interaction between genistein and FasR. Genistein is estrogenic soy isoflavone that has a pro-apoptotic, antiproliferative and anti-adipogenic activites on adipocytes. FasR molecular 3D model was obtained from RCSB Protein Data Bank with PDB ID: 3TJE. Active site candidates in FasR for ligand binding was determined using CastP Site Finder. Dockings were performed surrounded the interest region positioned at -6.693, 6.965 and 12.774 as x, y, z coordinates. We found that Tyr 75, Arg 86, Arg 89, Asn 108, Cys 111, and Cys 124 of FasR as binding residues for PPCs, while genistein binds to FasR in Arg 89 residue. 1,2-Dioleyl-sn-glycero-3-phosphocholine were found as compounds with the strongest binds to FasR with binding afinity as much as -5,3 kcal/mole, while 1-stearyl-2-linolenyl-sn-glycero-3-phosphocholine the weakest binds to FasR with the lowest binding affinity, which is 4.1 kcal/mole. Compare to genistein with binding affinity to FasR as much as -6.9 kcal/mole, PPCs has lower binding affinity to FasR. We confirmed in silico study’s result with toxicity assay using 3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl-tetrazolium-bromide (MTT) method. We test various concentration of soya PPCs begin form 0 mg/ml to 1 mg/ml to the adipose derived stem cell (ADSCs), and see the cells viability. From MTT assay, it has been showed that PPCs have tendency to reduce ADSCs viability with lower toxicity compared to PPCs solution that contains 2% of sodium deoxycholate (SD), necrotic cell’s agent. We conclude that soya-PPCs may serves as potent pro-apoptotic agent in certain dose and need to be further developed for gaining a new effective PPCs formulation and with minimal side effect rather than PPC/SD formulation that we already tested for 9 years as clinical medication.

272

COMPARATIVE ANALYSIS OF GENE EXPRESSION PROFILES INDUCED BY IL-4 AND IL-6 IN HUMAN PERIPHERAL BLOOD MONONUCLEAR CELLS Gaopeng Li1, Hongwu Du2, Guangyu Chen3, Charles Fathman3, Yan Zhang1 1. Institute for Nutritional Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China 2. School of Chemistry & Biotechnology Engineering, University of Science & Technology Beijing, Beijing, China 3. Department of Medicine, Division of Immunology and Rheumatology, Stanford University School of Medicine, Stanford, CA, USA Background: IL-4 and IL-6 play important roles in a variety of biological processes such as inflammation and immune response. Although the crosslinks between these two molecules have been reported previously, little work has been done at the systematic level to investigate their joint effects on different cells. Results: We carried out microarray analysis of human peripheral blood mononuclear cells with IL-4 and IL-6 stimuli at different time points and under different culture conditions (hydrogen peroxide or phytohaemagglutinin). A large number of differentially expressed genes were detected within each condition, and were enriched in a variety of GO terms and pathways. Investigation of some representative pathways revealed that different conditions may influence the same pathway; however, the effects were quite different. PPI network analysis was also performed for identification of key regulators and condition-specific genes. Conclusions: Our data provide new insights into the complex roles of IL-4 and IL-6 as well as their cellular effects under different conditions.

International Conference on Bioinformatics Delegate Book 2014 Page 88

273

GENOME-WIDE DISCOVERY AND ANALYSIS OF PHASED SMALL INTERFERING RNAS IN CHINESE SACRED LOTUS Yun Zheng1, Shengpeng Wang1, Ramanjulu Sunkar2

1. Faculty of Life Science and Technology, Kunming University of Science and Technology, Kunming, Yunnan, China 2. Department of Biochemistry and Molecular Biology, Oklahoma State University, Stillwater, Oklahoma, America Background: IL-4 and IL-6 play important roles in a variety of biological processes such as inflammation and immune response. Although the crosslinks between these two molecules have been reported previously, little work has been done at the systematic level to investigate their joint effects on different cells.

Results: We carried out microarray analysis of human peripheral blood mononuclear cells with IL-4 and IL-6 stimuli at different time points and under different culture conditions (hydrogen peroxide or phytohaemagglutinin). A large number of differentially expressed genes were detected within each condition, and were enriched in a variety of GO terms and pathways. Investigation of some representative pathways revealed that different conditions may influence the same pathway; however, the effects were quite different. PPI network analysis was also performed for identification of key regulators and condition-specific genes.

Conclusions: Our data provide new insights into the complex roles of IL-4 and IL-6 as well as their cellular effects under different conditions.

International Conference on Bioinformatics Delegate Book 2014 Page 89

AUTHOR INDEX

Abbas, M.M 7 Buske, F.A 17 Cmero, M 93,208 Abdelhadi, A 22 Campbell, L 211 Conomos, D 252 Abouelhoda, M 22 Cao, K.L 255 Cooke, B 209 Adam, A 261 Carter, D 211 Crowe, M 91 Adeline Yap, X 60 Chang, H 33 Das, D 210 Adhikari, S 60 Chang, T 83 Dash, D 222 Ahmad, S 263 Chang, W 270 Dashnow, H 93,210,258 Al-Ali, R 202 Charleston, M 41 Davenport, M 71 Algama, M 201 Charleston, M.A 73 Davis, R 88 Ali, A 22 Charoenkwan, P.M 19 Deffarges, B 76 Alinejad-Rokny, H 71 Che, J 204 Dehzangi, A 20 AlSaad, R 202 Chen, C 256 Deshpande, N.P 211 Ames, D 61 Chen, G 272 Dhanjal, J 16,85 An, J 77,251 Chen, H 14,81 Dhingra, P 18 Asiegbu, F.O 52 Chen, K 58 Ding, L 75 Azhar, N 92 Chen, S 38 Domanova, W 212 Badji, R 202 Chen, T 42 Donald, M.R 31 Barter, R.L 37 Chen, Z 205 Drinkwater, B 73 Batagov, A.O 86 Cheng, X 56,217 Du, H 272 Baten, A 50 Chin, C 38 Easteal, S 210 Batra, J 251 Ching, W 56,217 Ebrahimi, D 71 Bauer, D.C 17 Chiu, Y 11,69 Edwards, J.S 49 Berezovsky, I.N 48 Cho, Y 49 Edwards, R.J 213,259 Berretta, R 24,100,260 Choi, J 52 Eisenacher, M 34,54 Bhak, J 49 Choong, Y 206 Eisenhaber, F 60 Bitar, M 203 Brabant, G 72 Chopin, L 74 El Hilali, S 17

Brazas, M 44 Chua, H 207 ElHefnawi, M 89

Breen, E 15 Clark, S.J 17 Ellis, K.A 61

Brusic, V 32,35 Clarke, C 65 ElSebakhi, E 202

Burden, C 30,254 Clements, J 251 Eng, C.L 63

International Conference on Bioinformatics Delegate Book 2014 Page 90

Fan, X 215 Han, K 49,216 Jang, S 83 Jayaram, B 18 Fathman, C 272 Han, Y 27 Jeffery, P.L 74 Feng, G 8 Hardison, R 58 Jeon, J 52 Feng, L.D 26 Harkness, L 266 Jho, S 49 Feng, L 55 Hart-Smith, G 266 Jhuang, Y 69 Feng, M 60 Herington, A.C 74 Jia, X 222 Feng, X 256 Hills, M 252 Ho, C 38 Jiang, H 217 Flanders, D 94 Ho, J 40 Johnson, K 75 Foret, S 254 Ho, J.W 51 Johnson, P 61 Franco, G.R 203 Ho, S 82 Jun, J 49 Freisleben, H 271 Ho, S.M 19 Jusoh, S 261 Fung, J.N 74 Hou, W 56 Kadri, A 202 Gaasterland, T 2 Gadhvi, P 49 Hovens, C.M 208 Kaewkascholkul, N 29

Gaeta, B 80,95 Hsiao, E 33 Kakugo, A 27

Geng, Y 75 Hsu, K 69 Kamali, A 40

Ghazy, M 89 Hu, H 49 Kamath, K 209

Giannoulatou, E 51 Huang, H.M 19 Kandil, A 22

Goel, A 209 Huang, J 69 Kao, C 270

Gong, H 55 Huang, K 87,273 Kassem, M 266

Gopichandran, S 214 Huang, X 68 Kasumovic, M.M 205

Goyal, S 16,85 Huang, Y 9 Kathiresan, N 202

Graham, D 65 Humphreys, D 51 Kato, Y 79

Graham, P 61 Hung, P 4,5 Kaushik, R 18

Griffith, R 80 Huo, N 256 Keith, J.M 201

Grover, A 16,85 Hutvagner, G 66,265 Khan, J 218,219

Grover, M 62 Im, J 216 Khan, M 92

Grover, S 16 Inoue, D 27 Khanna, K.K 255

Grynberg, P 203 Irving, D 94 Khanna, K 78

Guo, D 215 Islam, M 23 Khushi, M 65

Gupta, A 85 Iwadate, M 84 Kiga, D 57

Haider, M 218,219 James, D 212 Kim, H 49 International Conference on Bioinformatics Delegate Book 2014 Page 91

Kim, J 59 Li, N 60 Madhamshettiwar, P.B 78

Kim, K 52 Li, T 53 Mah, T 60

King, G.J 50 Li, X 207 Mahemuti, B 27

Ko, M 38 Li, Y 64 Mahendran, R.S 70

Kohl, M 54 Li, Z 21 Malluhi, Q.M 7

Konagaya, A 27 Liao, P 33 Manica, A 49

Kotagiri, R 66 Liaw, H 5 Mann, G.J 37

Krishnaswamy, S 220 Liem, N 60 Marcus, K 54

Krycer, J 212 Lim, J 49 Marincola, F 202

Kuleesha, K.M 26 Lim, T 206 Martins, R 61

Kumar, D 222 Limviphuvadh, V 60 Maruff, P 61

Kumar Yadav, A 222 Lin, C 38,69,69,269 Masters, C 61

Kuncic, Z 212 Lin, F 81 Matsuda, H 6

Kuralmani, V 60 Lin, H 87 Matsushita, N 6

Kurochkin, I.V 86 Lin, P 254 Mattick, J 101 Lin, T 83 Kuznetsov, V.A 86 Mattick, J.S 203 Liou, Y.M 19 Kwoh, C 14 Maurer-Stroh, S 60 Liu, C 78,255 Lai, F 9 McEwan, A 40 Liu, K 69 Lai, J 74,77,251 McGrath, A 97 Liu, Q 21,66,265 Lai, S.M 19 Meckel, H 54 Liu, Y 256 Lai, Y 211 Menon, R 259 Lam, W 69 Liu, Z 10 Milioli, H 24 Lee, H.M 19 Lonsdale, A 93,257,258 Milioli, H.H 260 Lee, M 252 Loo, L 64 Mishra, A 18 Lee, S 87,270,273 Lovell, D 90 Mohamed , N 92 Lee, T 12 Luo, C 68 Mohd-Radzman, N 261 Lee, W 216 Luo, G 8 Molloy, M 15 Lee, Y 52 Lyons, J 20 Moriya, T 57 Lehman, M 77,251 M, M 263 Moscato, P 24,260 Leong, S 206 Ma, X 256 Motta, S 262 Li, G 272 Macaulay, L 61 Mukherjee, G 18 Li, J 21,66,253,265 Macintyre, G.J 208 International Conference on Bioinformatics Delegate Book 2014 Page 92

Mulvenna, J 222 Plohnke, N 34 Sharma, A 20 Mundra, P 81 Qian, M 68 Shatnawi, M 39 Nakai, K 36 Qiu, Y 56,217 Shekhar, S 18 Nelson, C 77,251 Nenadic, G 72 Qureshi, S 30 Shie, D 53

Nguyen, H 66,265 Ragan, M 255 Shin, M 59,204

Nguyen, T 14 Ragan, M.A 78 Siefken, V 76

Nielsen, L 99 Rajput, A 262 Simpson, P.T 78

Ning, K 207 Raju, N 263 Simpson, P.T 255

Niu, B 75 Ramamohanarao, K 265 Singh, A 18

Nock, C.J 50 Raman, H 92 Singh, S 46

Nuraini, N.N 264 Reddel, R.R 252 Sitharaman, G.S 70

O'Brien, A 67 Reed, S 47 Sjaugi , M 92

O'Keeffe, A 74 Ren, R 53 Smith, M.A 203

Oldmeadow, C 201 Retnoningrum, D.D 264 Soliman, B 89

Omenn, G 1 Rexroth, S 34 Somboonviwat, K 29

Oshlack, A 210 Ritchie, S.C 94 Song, H 52

Ouyang, L 207 Riveros, C 24,260 Song, R 66,265

P, B 7 Rosliana, I 271 Song, S 78

Paek, W 49 Rowe, C 61 Song, X 15

Paliwal, K.K 20 Salem, A 89 Soo, R.A 60

Palopoli, N 213,259 Sampath, P 86 Speed, T 3

Pang, C 211,266 Sander, T 76 Sridharan, S 60

Pang, I 209 Savage, G 61 Srihari, S 78,255 Schramm, S 37 Pappalardo, F 262 Srinivasulu, Y.M 19 Schwartz, J 72 Park, B 216 Stutz, M.D 252 Seim, I 74,251 Park, K 49 Su, C.T 14 Seno, S 6 Park, S 51 Su, R 64 Seok, J 88 Pascovici, D 15 Su, X 207 Setiadji, H.a 264 Patil, A 36 Sun, J 33 Pennisi, M 262 Sezerman, O 13 Sundar, D 16 Perrin, D 25 Shafiq, M 218,219 Sundaram, G.M 86 Pickett, H.A 252 shanmughavel, P 220,221 International Conference on Bioinformatics Delegate Book 2014 Page 93

Sung, T 69 von Korff, M 76 Xie, M 75 Sunkar, R 267 Walpole, C.M 74 Xu, Y 68 Susaki, E.A 25 Walpole, C 251 Xue, Y 10 Szoeke, C 61 Wan, H 8 Yamamura, M 57 Taguchi, Y 84 Wan Ismail, W 261 Yang, J 43,69 Tainaka, K 25 Wang, C 4,33,77,251 Wang, D 269 Yang, K 87,270,273 Takenaka, Y 6 Wang, J 75 Yang, P 207 Tan, J.Z 86 Wang, M 8 Yang, T 4,5,33 Tan, S 92,210 Wang, S 53,267 Yang, Y 37 Tan, T 63 Wang, W 75 Yano, M 79 Tariq, W 218,219 Wang, X 8 Yap, P 74 Tasker, E 201 Tay, A 209,266 Wang, Y 10,45,215,253,256 Yarmishyn, A.A 86

Temanni, M 202 Wasser, M.D 26 Yavuz, A 13

Thomas, P.B 74 Watson-Haigh, N.S 96 Ye, K 75

Tong, J 6 0,63 Wedge, D.C 208 Ye, Y 82

Tong, M 8 Wee Choo, P.M 26 Yen, C 270

Tsai, H 268,269 Whiteside, E.J 74 Yong, W 60

Tsai, T.Z 268 Wilkins, M 205,209,211,266 Yoo, P 39

Tsao, T 87,273 Wilson, S 30,254 Yunianto, P 271

Tseng, J 69 Wilson, S.R 31 Yuridian, R 271

Tuvshinjargal, N 216 Wilson, W 61 Zaki, N 39

Twine, N 266 Wong, L 66,265 Zhang, G 28

Tyagi, C 85 Wouters, M 62 Zhang, P 61

Tye, G 206 Wu, C 72 Zhang, Q 215

Ueda, H.R 25 Wu, H 12,33,38 Zhang, X 75

Umeyama, H 84 Wu, J 15,52 Zhang, Y 58,272

Uszkoreit, J 34 Wu, Q 82 Zhao, L 81

Vafaee, F 212 Wu, W 4,5,9,33 Zhao, Z 75

Vandewater, L 61 Xia, B 8 Zheng, J 14,81

Vasylenko, T.M 19 Xiao, W 88 Zheng, Y 53,267

Vimieiro, R 24,260 Xiao, X 75 Zhou, L 80 Xie, D 8 International Conference on Bioinformatics Delegate Book 2014 Page 94

Zhou, Q 8 Zink, D 64 Zhu, S 68 Zhou, S 82

International Conference on Bioinformatics Delegate Book 2014 Page 95

EXHIBITORS

Exhibitor Floor Plan

Exhibitor Trade Listing

ABSCIEX Booth 1 AB SCIEX is a global leader in the development of best-in-class mass spectrometer-based technologies, enabling scientists & analysts to push the limits of their fields & answer complex questions, thereby improving our world. We partner with leaders in many disciplines to create instrumentation, software, reagents, & services that are reliable, sensitive, and intuitive to use.

Millenium Science Booth 2 Millennium Science is a specialist distribution company established in 1999 and is a leading solution supplier to the science industry with a core competency in the delivery of high technology products. These include hardware, software and consumables. Focusing heavily on pre- and post-sales technical support, Millennium Science employs scientifically qualified Technical Sales Representatives backed up by PhD qualified Application Scientists with strong bioinformatics knowledge to support the ever increasing data analysis requirement of the market.

Qiagen Booth 3 QIAGEN offers industry-leading applications for the analysis, interpretation, and reporting of biological data. Understanding raw data is one of the most significant challenges in modern molecular methods. Data must be examined within the context of complex biological processes, and rapidly increasing throughput makes analyses time and labor intensive. QIAGEN’s portfolio of powerful tools addresses this bottleneck with innovative applications based on cutting-edge bioinformatics. Visit our booth for more information.

International Conference on Bioinformatics Delegate Book 2014 Page 96

CONFERENCE ATTENDEES

Mostafa Abbas Zhiliang Chen Sowmya Gopichandran Qatar University Systems Biology Initiative, Macquarie University [email protected] University of New South Wales [email protected] Qatar [email protected] Australia Australia Mohd Basyaruddin Abdl Rahman Abhinav Grover Universiti Putra Malaysia Jaeyoung Choi Jawaharlal Nehru University [email protected] Seoul National University [email protected] Malaysia [email protected] India South Korea Rashid Al-Ali Mani Grover Sidra Medical and Research Yee Siew Choong Deakin University Center Universiti Sains Malaysia [email protected] [email protected] [email protected] Australia Qatar Malaysia Sonam Grover Manjula Algama Hon Nian Chua Indian Institute of Technology Monash university Institute for Infocomm Research Delhi ,India [email protected] [email protected] [email protected] Australia Singapore India

Hamid Alinejad-Rokny Mark Crowe Boris Guennewig UNSW QFAB Bioinformatics Garvan [email protected] [email protected] [email protected] Australia Australia Australia

Jiyuan An Daniel Damiano Kyungsook Han Queensland University of Oracle Corporation Inha University Technology [email protected] [email protected] [email protected] Australia South Korea Australia Abdollah Dehzangi Joshua Ho Nicola Armstrong Griffith University Victor Chang Cardiac Research University of Sydney [email protected] Institute [email protected] u.au [email protected] Australia Australia Australia

Rebecca Barter Margaret Donald Amina Ibrahim The University of Sydney University of New South Wales National University of Medical [email protected] [email protected] and Technical Studies- Sudan Australia Australia [email protected] Sudan A K M Abdul Baten Benjamin Drinkwater Southern Cross University University of Sydney Mohammad Islam [email protected] [email protected] Macquarie University Australia u.au [email protected] Australia Australia Conrad Burden Australian National University Richard Edwards Hao Jiang [email protected] University of New South Wales [email protected] Australia [email protected] China Australia Rebecca Carter JeHoon Jun CSIRO Mahmoud ElHefnawi Personal Genomics Institute, [email protected] NRC Genome Research Foundation Australia [email protected] [email protected] Egypt South Korea Michael Charleston University of Sydney Christine Eng Jahangir Khan [email protected]. National University of Singapore Engro Foundation au [email protected] [email protected] Australia Singapore Pakistan

Jingmin Che Bruno Gaeta Mohammad Asif Khan Kyungpook National University UNSW Perdana University [email protected] [email protected] [email protected] South Korea Australia Malaysia International Conference on Bioinformatics Delegate Book 2014 Page 97

Matloob Khushi Annette McGrath Yushan Qiu Children Medical Research CSIRO The University of Hong Kong Institute (CMRI) and Westmead [email protected] [email protected] Institute for Cancer Research; Australia Australia Sydney Medical School, University of Sydney Ranjeeta Menon Shoba Ranganathan [email protected] University Of New South Wales Macquarie University Australia [email protected] [email protected] Australia Australia Jinwoo Kim Kyungpook national university Ahmed Metwally Scott Ritchie [email protected] Cairo University The University of Melbourne South Korea [email protected] [email protected] Egypt Australia Akihiko Konagaya Tokyo Institute of Technology Heloisa Milioli Carlos Riveros [email protected] University of Newcastle HMRI - University of Newcastle Japan [email protected] [email protected] Australia Australia Swaminathan Krishnaswamy Bharathiar Univer Avinash Mishra Aedan Roberts [email protected] Indian Institute of Technology Kids Reseach Institute, Children's India [email protected] Hospital at Westmead India [email protected] Dhirendra Kumar u QIMR Berghofer Medical Research Abidali Mohamedali Australia Institute Macquarie University Dhirendra.Kumar@qimrberghofer. [email protected] Abdul Sattar edu.au Australia Griffith University Australia [email protected] Nabilatul Hani Mohd Radzman Australia Peijie Lin Universiti Teknologi MARA Australian National University [email protected] Sarah-Jane Schramm [email protected] Malaysia The University of Sydney Australia sarah- Takefumi Moriya [email protected] Yue Liu Tokyo Institute of Technology Australia Minzu University of China [email protected] [email protected] Japan Jean-Marc Schwartz China University of Manchester Pablo Moscato [email protected] David Lovell HMRI - University of Newcastle United Kingdom CSIRO/Australian Bioinformatics [email protected] Network u Hidayat Setiadji [email protected] Australia Bandung Institute Of Technology Australia [email protected] Aidan O'Brien Indonesia Cheng-Tsung LU CSIRO Yuan Ze University [email protected] Maad Shatnawi [email protected] Australia UAEU Taiwan [email protected] Ashish Patel United Arab Emirates Bulibuli Mahemuti National Institute of Technology Tokyo Institute of Technology Raipur Avtar Singh [email protected] [email protected] GGS Institute of Information Japan India Communication Technology India [email protected] Radha Mahendran Ashwini Patil India Vels University Institute of Medical Science, [email protected] University of Tokyo India [email protected] Kulwadee Somboonviwat Japan International College, King Helen McCormick Mongkut's Institute of Technology Victor Chang Cardiac Research Dimitri Perrin Ladkrabang (KMITL) Institute RIKEN [email protected] [email protected] [email protected] Thailand u Japan Australia International Conference on Bioinformatics Delegate Book 2014 Page 98

Jiangning Song Huai-Kuang Tsai Normi Yahaya Monash University Academia Sinica Universiti Putra Malaysia [email protected] [email protected] [email protected] Australia Taiwan Malaysia

Srinath Sridharan Mukesh Kumar Verma Jean Yang Institute for Infocomm Research National Institute of Technology University of Sydney [email protected] Raipur (C.G.), India [email protected] Singapore [email protected] Australia India Durai Sundar Ahmet Sinan Yavuz Indian Institute of Technology Jiayin Wang Sabanci University (IIT) Delhi Washington University in St. Louis [email protected] [email protected] [email protected] Turkey India United States Reza Yuridian Y-h. Taguchi Xiu-Jie Wang Indonesian Medical Mesotherapy Chuo University Institute of Genetics & Association [email protected] Developmental Biology, Chinese [email protected] Japan Academy of Sciences Indonesia [email protected] Tin Wee Tan China Ping Zhang National University of Singapore csiro [email protected] [email protected] Singapore Nathan Watson-Haigh Australia Australian Centre for Plant Patrick Thomas Functional Genomics Gong Zhang [email protected]. [email protected] Jinan University au Australia [email protected] Australia China Yu Xue Shradha Tiwari Huazhong University of Science Yun Zheng Amity University and Technology Kunming University of Science [email protected] [email protected] and Technology India China [email protected] China

Yudong Zhou Harbin Institute of Technology [email protected] China

International Conference on Bioinformatics Delegate Book 2014 Page 99

International Conference on Bioinformatics Delegate Book 2014 Page 100

International Conference on Bioinformatics Delegate Book 2014 Page 101