20142014 WINTERWINTER2014 WINTERSCHOOLSCHOOL SCHOOL IMB 2014 WINTERININ SCHOOLIN MATHEMATICALMATHEMATICALMATHEMATICAL&& COMPUTATIONALCOMPUTATIONALIN & COMPUTATIONAL BIOLOGYBIOLOGY BIOLOGY

MATHEMATICAL & COMPUTATIONAL BIOLOGY

in in of Centre ARC Bioinformatics Excellence

7-117-11 JulyJuly 201420147-11 July 2014

7-11 July 2014 Hosted by: Hosted Auditorium AuditoriumAuditoriumAuditorium Bioscience Precinct QueenslandQueenslandQueensland Bioscience BioscienceBioscience Precinct PrecinctPrecinct TheTheThe UniversityUniversity UniversityThe University ofof QueenslandQueensland of Queensland

Brisbane,Brisbane,Brisbane, AustraliaAustraliaBrisbane, Australia PROGRAM

PROGRAMPROGRAMPROGRAMPROGRAM

Brisbane, Australia Brisbane, The of University The

Queensland Bioscience Precinct Bioscience Hosted by: Queensland

Auditorium HostedHosted by:by: Hosted by: 7-11 July 2014 July 7-11

ARCARC CentreCentre ofof ExcellenceExcellenceARC Centreinin BioinformaticsBioinformatics of Excellence in Bioinformatics

& COMPUTATIONAL BIOLOGY COMPUTATIONAL & MATHEMATICAL

IN IN IMIMBIMBB IMB 2014 WINTER SCHOOL WINTER 2014

2014 Winter School in Mathematical and Computational Biology 7-11 July 2014 http://bioinformatics.org.au/ws14

Queensland Bioscience Precinct (Building #80) The University of Queensland Brisbane, Australia

Monday 7 July 2014 NEXT GENERATION SEQUENCING & BIOINFORMATICS

8:15 a.m. REGISTRATION OPENS

9:00 a.m. Welcome and introduction Dr Nicholas Hamilton Institute for Molecular Bioscience The University of Queensland

09:05 a.m. Next-generation sequencing: an overview of technologies and applications Dr Ken McGrath Australian Genome Research Facility Ltd (AGRF) Brisbane Node (The University of Queensland)

09:45 a.m. NGS mapping, errors and quality control Dr Felicity Newell The University of Queensland Diamantina Institute

10:30 a.m. Morning Tea

11:00 a.m. Defensive NGS informatics – what can go wrong and how do you know when to throw in the towel? Mr John Pearson QIMR Berghofer Medical Research Institute

11:45 a.m. Structural variants detection using whole genome sequencing Dr Ann-Marie Patch Queensland Centre for Medical Genomics Institute for Molecular Bioscience The University of Queensland

12:30 p.m. Lunch

13:30 p.m. De novo genome assembly Dr Torsten Seemann Victorian Bioinformatics Consortium Monash University

14:00 p.m. Introduction to RNA-seq Dr Nadia Davidson Murdoch Childrens Research Institute Royal Children’s Hospital

14:30 p.m. RNA-seq differential expression Dr Annette McGrath CSIRO Canberra

15:00 p.m. Afternoon Tea

i

15:30 p.m. MicroRNAs – sequencing, analysis … and then what? Dr Nicole Cloonan Genomic Biology Laboratory QIMR Berghofer Medical Research Institute

16:00 p.m. NGS experimental design and statistical power Dr Stephen Rudd QFAB Bioinformatics

16:30 p.m. Genomic infrastructure for NGS A/Professor Mik Black School of Medical Sciences University of Otago

17:00 p.m. What the Australian Bioinformatics Network can do for you Dr David Lovell CSIRO

17:30 p.m. Welcoming BBQ

ii

Tuesday 8 July 2014 NEXT GENERATION SEQUENCING & BIOINFORMATICS

09:00 a.m. Great expectations or why sequencing platforms are not magic wands Dr Lauren Bragg CSIRO

Dr Michael Imelfort School of Chemistry & Molecular Biosciences, The University of Queensland

10:00 a.m. Phlogeny-based methods for analysing and comparing uncultured microbial communities A/Professor Aaron Darling University of Technology, Sydney

10:30 a.m. Morning Tea

MODELLING FROM HIGH-THROUGHPUT BIO-DATA 11:00 a.m. Exploring the structure of whole-genome conservation profiles using Bayesian segmentation Dr Jonathan Keith School of Mathematical Sciences Monash University

11:45 a.m. Machine learning in action Ms Tatyana Goldberg Technische Universitat München, Germany

12:30 p.m. Lunch

13:30 p.m. Detection of recombination events in bacterial genomes Dr Nouri Ben Zakour School of Chemistry and Molecular Sciences The University of Queensland

14:15 p.m. Epigenomics: The many garments of the genome sequence Dr Fabian Buske Garvan Institute of Medical Research

15:00 p.m. Afternoon Tea

15:30 p.m. Mixed linear model analyses of human complex traits using SNP data Dr Jian Yang Queensland Brain Institute The University of Queensland

16:15 p.m. Detection and replication of epistasis influencing transcription in humans Dr Joseph Powell Queensland Brain Institute The University of Queensland

17:00 p.m. An introduction to BRAEMBL services Dr Webber Liao BRAEMBL

iii

Wednesday 9 July 2014 MODELLING FROM HIGH-THROUGHPUT BIO-DATA 09:00 a.m. The future of DNA sequencing technology Professor Graham Taylor

09:45 a.m. Population-scale high-throughput sequencing data analysis Dr Denis Bauer Computational Informatics (CCI) CSIRO Sydney

10:30 a.m. Morning Tea

11:00 a.m. Translating exome and whole genome sequencing to the clinic A/Professor Marcel Dinger Garvan Institute of Medical Research

12:00 noon Panel discussion Moderated by Dr Nicole Cloonan QIMR Berghofer Medical Research Institute

*** FREE WEDNESDAY AFTERNOON ***

iv

Thursday 10 July 2014 BIG DATA, STATISTICS AND APPLICATIONS

08:45 a.m. Taming the Big Data Dragon Professor John Quackenbush Dana-Farber Cancer Institute & Harvard School of Public Health, USA

10:00 a.m. From Big Data to smart knowledge – integrating multimodal biological data and modelling metabolism Professor Falk Schreiber Monash University University Halle-Wittenberg, Germany

10:40 a.m. Morning Tea

11:10 a.m. Visual analytics of Big Data Professor Seok-Hee Hong University of Sydney

11:50 a.m. The life-sciences as a pathfinder in data-intensive research practice Dr Andrew Treloar Australian National Data Service (ANDS)

12:30 p.m. Lunch

13:30 p.m. Statistical experiment design principles for biological studies Dr Alec Zwart CSIRO

14:15 p.m. Genome-wide association studies Professor David Evans The University of Queensland Diamantina Institute

15:00 p.m. Afternoon Tea

15:30 p.m. Mixture models for analysing transcriptome and ChIP-chip data Dr Marie-Laure Martin Magniette Mathématiques et Informatique Appliquées Institut National de la Recherche Agronomique, France

16:15 p.m. Multivariate models for dimension reduction and biomarker selection in omics data Dr Kim-Anh Lê Cao The University of Queensland Diamantina Institute

v

Friday 11 July 2014 MOLECULAR PHYLOGENETICS 09:00 a.m. An introduction to phylogenetic inference Dr Robert Lanfear Macquarie University

10:15 a.m. Morning Tea

10:45 a.m. Loss of information at deeper divergences, and what we can do about it Distinguished Professor David Penny Institute of Fundamental Sciences Massey University, NZ

*** IMB FRIDAY SEMINAR *** 12:00 noon From mutation to macroevolution Professor Lindell Bromham Research School of Biology Australian National University

13:00 p.m. Lunch

14:00 p.m. The application of high throughput DNA barcoding for landscape ecology and management Professor Mike Wilkinson School of Agriculture Food & Wine The University of Adelaide

15:00 p.m. Open forum/questions Moderated by Professor Mark Ragan Institute for Molecular Bioscience The University of Queensland

16:00 p.m. Student travel award presentation & close of Winter School

~*~*~*~*~

vi

BIOGRAPHY AND ABSTRACT

Dr Ken McGrath Australian Genome Research Facility Ltd (AGRF) Brisbane Node (The University of Queensland)

Biography: Ken McGrath is the manager of the Brisbane Lab of the Australian Genome Research Facility. He completed his undergraduate degree with honours in 2001 at QUT working with the plant biotechnology group on developing transgenic bioreactors, and transitioned to UQ for his PhD work investigating the genetic regulation of plant defence responses to disease, in collaboration with CSIRO and the CRC for Tropical Plant Protection. Following this, his post- doctoral research with the Schmidt and Schenk labs at UQ involved examining the transcriptomes of mixed microbial communities in industrial and agricultural settings. In 2009, Ken joined the AGRF as sequencing supervisor, and currently helps manage submissions and workflows on a range of next-generation sequencing platforms.

Date: Monday 7 July 2014

Presentation title: Next-generation sequencing: an overview of technologies and applications

Abstract: The “Next-Generation Sequencing” landscape is one of constant change, with new and emerging technologies constantly competing with established platforms. This abundance of competition is resulting in faster and cheaper methods to perform sequencing of DNA and RNA samples, but it also brings with it a confusing array of options, each with its own strengths and weaknesses. Ken will give an overview of the available sequencing technologies and run through some examples projects that can be run on them, as well as describe the typical bioinformatics approaches for these projects, and also take a look at what’s “next” in Next-Gen.

1 BIOGRAPHY AND ABSTRACT

Dr Felicity Newell The University of Queensland Diamantina Institute

Biography: Felicity originally trained in the fields of molecular and cell biology. In her PhD and first post-doctoral position at the University of Queensland, she investigated the role of growth factors in the differentiation of human preadipocytes. After this, she developed an interest in software development and bioinformatics, obtaining a Master of Information Technology from the Queensland University of Technology. She worked for two years as a software developer developing bioinformatics web applications at QFAB Bioinformatics, before moving to the Queensland Centre for Medical Genomics at UQ. At QCMG, she developed software for the analysis of cancer sequencing data, including a tool used for the detection of structural variants. She is currently carrying out postdoctoral research at the University of Queensland Diamantina Institute, using next generation sequencing data to understand human disease, and has a particular interest in structural variation.

Date: Monday 7 July 2014

Presentation title: NGS mapping, errors and quality control

Abstract: An important step in next generation sequencing is the alignment (mapping) of the short reads that are generated to a reference genome. Tools designed for mapping are required to efficiently and accurately align each read and more than 60 applications are currently available for this purpose. In this presentation I will describe some of the approaches to sequence alignment, highlighting popular tools that are used such as BWA, Novoalign and Bowtie. An important consideration for mapping and downstream sequence analysis is the ability to recognise and deal with common errors and biases that can occur during the process. I will discuss some of the common errors that occur in next generation sequencing and the approaches to quality control that should be applied in order to obtain high quality data.

2 BIOGRAPHY AND ABSTRACT

Mr John Pearson QIMR Berghofer Medical Research Institute

Biography: John Pearson has qualifications in biochemistry, physiology, computing science and technology management and has spent 20 years creating software for scientists. John was Computer Systems Manager for the Genetic Epidemiology Laboratory at the Queensland Institute of Medical Research (QIMR) prior to moving to the United States in 2000 where he was the lead programmer in the Bioinformatics and Scientific Programming Core (BSPC) at the National Human Genome Research Institute (NHGRI) within the National Institutes of Health (NIH) in Bethesda, Maryland. In 2003, John left the NIH to become a founding Faculty member at the Translational Genomics Research Institute (TGen) where he lead the Bioinformatics Research Unit and also served as a Division Director with oversight of all bioinformatics activities at TGen. John has held software development grants from the American Cancer Society, the National Institutes of Health and Microsoft and has been focusing on next-generation sequencing since the end of 2007. John returned from the US to take up a position in early 2010 as Senior Bioinformatics Manager for the Queensland Centre for Medical Genomics (QCMG).

Date: Monday 7 July 2014

Presentation title: Defensive NGS informatics - what can go wrong and how do you know when to throw in the towel?

Abstract: Next-generation sequencing has radically changed medical research by allowing deep interrogation of the DNA and RNA of pathogenic organisms, families with inherited disorders and the de-novo mutations responsible for tumourigenesis. As with any new technology, a "gold rush" mentality can arise where being first to the answer can push rigour and methodological soundness into the background. In this seminar, I'll talk from QCMG experience about some of the ways sequencing can go wrong, how the problems became apparent, what we did about them, and tools we developed to try to catch the same problems in future.

3 BIOGRAPHY AND ABSTRACT

Dr Ann-Marie Patch Queensland Centre for Medical Genomics Institute for Molecular Bioscience The University of Queensland

Biography: Ann-Marie is currently a Senior Bioinformatics Researcher within the multi-skilled group at the Queensland Centre for Medical Genomics led by Prof Sean Grimmond. Her current research focuses on the detection of small indels and somatic structural rearrangements in ovarian, pancreatic and other cancers, with a personal interest in the mechanisms of DNA repair. Her PhD, gained in 2006 from the University of Exeter UK, combined bioinformatics and laboratory approaches to study the nature of tandem repetitive elements in the model genomes of fission and budding yeast. She then joined the Peninsula College of Medicine & Dentistry as an associate research fellow in Prof Andrew Hattersley’s group employing next generation sequencing to identify monogenic causes of neonatal diabetes and to identify causal mutations across a broad spectrum of genetic disorders for the Royal Devon and Exeter Molecular Genetics Laboratory.

Date: Monday 7 July 2014

Presentation title: Structural variants detection using whole genome sequencing

Abstract: As part of the International Cancer Genome Consortium, the Queensland Centre for Medical Genomics has established a world class laboratory and computational infrastructure balanced with high level expertise to enable the analysis of whole human genomes for the presence of DNA, RNA and epigenetic variants that are associated with the hallmarks of cancer. This talk will describe and discuss the principles and challenges of identifying structural variants (SVs) using whole genome sequencing. I will present the basis of detecting SVs, a tool developed at QCMG, and examples of how SV analysis can identify mechanisms driving tumorigenesis.

4 BIOGRAPHY AND ABSTRACT

Dr Torsten Seemann Victorian Bioinformatics Constortium Monash University

Biography: Dr Torsten Seemann is the Scientific Director of the Victorian Bioinformatics Consortium at Monash University, and a Senior Research Scientist at the Life Sciences Computation Centre in Melbourne. He originally trained as a computer scientist and did his PhD in image processing and data compression, but his first postdoc in 2002 saw him thrown into the middle of Australia's first large genome project, and he hasn't looked back since. He specialises in microbial comparative genomics, genome assembly, and genome annotation; and is a strong believer in writing high quality, useful software tools and contributing back to the bioinformatics community. You can learn more at his group website www.bioinformatics.net.au, his blog TheGenomeFactory.blogspot.com.au, and on Twitter @torstenseemann.

Date: Monday7 July 2014

Presentation title: De novo genome assembly

Abstract: De novo assembly is the process of reconstructing a genome's DNA sequence using only a set of much shorter error- prone sequences (reads) sampled from the genome. It is the "original" genomics-based bioinformatics problem, because it is all we can do when we don't have any related reference genome sequences, with the exemplar being the original human genome project. This presentation will discuss the principles of and approaches to de novo assembly of data, and practical issues like computational and memory requirements, limitations of de novo assembly, terminology, file formats, available software, and an example run-through of an assembly using the Velvet software if time permits

5 BIOGRAPHY AND ABSTRACT

Dr Nadia Davidson Murdoch Childrens Research Institute Melbourne

Biography: Dr Nadia Davidson is a bioinformatician working within the Oshlack group at the Murdoch Childrens Research Institute, Melbourne. She was trained in physics and software engineering and completed her PhD in Experimental Particle Physics from the University of Melbourne in 2011. Her research interests include methodology development and analysis of next-generation RNA sequencing data. She has been involved in a diverse set of projects that include studying sex development in birds, to identifying genomic rearrangements in cancer, all with the common theme of de novo transcriptome assembly.

Date: Monday 7 July 2014

Presentation title: Introduction to RNA-Seq

Abstract: The central dogma of genetics is that the genome, comprised of DNA, encodes many thousands of genes that can be transcribed into RNA. Following this, the RNA may be translated into amino acids giving a functional protein. While the genome of an individual will be identical for each cell throughout their body, the number of transcribed copies of each gene, as RNA, will differ due to the different functional requirement of each tissue type. An important area of research within genetics is to study the genome in-action, through RNA. For example, by comparing the quantities of each gene’s RNA between different tissue types, through development, in disease or in different environments – known as differential gene expression analysis.

RNA-Seq, or high throughput RNA sequencing, has accelerated research in this area. The technology works by reverse transcribing the RNA back into DNA, sheering it into smaller fragments, then reading each fragments sequence in parallel to give millions of short “reads”, each between approximately 50-200 bases in length. With these data comes a computational and statistical challenge because the biology must be inferred from millions of short sequences. Along with technical biases, there is true biological variability between samples of the same type, which must be accounted for.

In this talk I discuss the applications of RNA-Seq, its challenges and some of the bioinformatics strategies being employed to analyse this complex data. In particular, I will focus on the steps involved in differential gene expression analysis, for both model organisms, like human, and more exotic organisms, without a sequenced genome.

6 BIOGRAPHY AND ABSTRACT

Dr Annette McGrath CSIRO Canberra

Biography: Dr Annette McGrath is the Bioinformatics Core Leader at CSIRO where her team works on enhancing bioinformatics capability and developing and supporting enterprise bioinformatics infrastructure for CSIRO’s bioinformaticians and bioscientists. They also collaborate with researchers on a number of genomics research projects. She has qualifications in biochemistry, molecular biology and statistics, and has worked in bioinformatics roles in industry, the not-for-profit sector and now CSIRO since 1998.

Date: Monday 7 July 2014

Presentation title: RNA-seq differential expression

Abstract: RNASeq has become one of the most popular applications of NGS technology and it is used to give a snapshot of the RNA that is present, and in what relative quantity, in a particular biological material at a given point in time. The previous presentation covers a number of applications of RNASeq including applications in non-model organisms. RNASeq can be used for many applications including spliced gene discovery, differential expression, RNA editing and detection of variants and this talk will focus on the tools and methods of data analysis for these applications.

7 BIOGRAPHY AND ABSTRACT

Dr Nicole Cloonan QIMR Berghofer Medical Research Institute

Biography: Nicole Cloonan is an ARC Future Fellow who has recently established the Genomic Biology Laboratory at the QIMR Berghofer Medical Research Institute. Her work is multi-disciplinary in nature, involving computational biology and bioinformatics, biochemistry, cell biology, and molecular biology – all of which she uses to understand the complexity, function, and systems biology of RNA.

Date: Monday 7 July 2014

Presentation title: MicroRNAs - sequencing, analysis ... and then what?

Abstract: MicroRNAs (miRNAs) are an important class of non-coding regulatory RNAs, which interfere with the translation of protein-coding mRNA transcripts. By incorporation into the RNA induced silencing complex (RISC), miRNAs can inhibit translation, promote sequestration of mRNAs to P-bodies, and/or destabilise and degrade target mRNAs. The small size of mature miRNAs (typically only 20 to 24 nucleotides) makes them ideal for characterisation using short-tag RNA- sequencing (RNA-seq) technologies as you can capture the entire molecule in a single read. Unlike hybridisation approaches such as microarray profiling or Northern blotting, massive-scale sequencing provides a way to discriminate discrete but closely related RNA molecules, and profile miRNAs without a priori knowledge of expression.

MicroRNAs perform their biological roles by binding to mRNAs through Watson-Crick base-pairing. The attractive simplicity of using nucleotide complementarity to identify mRNA targets has given rise to many bioinformatics tools. These are based (to differing extents) on complementarity to the seed, evolutionary conservation, and free energy of binding.

So with great technology and plenty of well researched and well respected bioinformatics tools, miRNAs should be easy, right? This talk will systematically crush this rosy view of miRNAs as a field of study, and lay before you the desolate wasteland to navigate on your path to publication. Those towards the end of their PhD study on miRNAs may wish to avoid this talk.

8 BIOGRAPHY AND ABSTRACT

Dr Stephen Rudd Queensland Facility for Advanced Bioinformatics (QFAB) The University of Queensland

Biography: Stephen is Head of Computational Biology at QFAB Bioinformatics, a bioinformatics and biostatistics services organisation based here at IMB. Over the last 15 years Stephen has worked as a genome biologist and bioinformatician in academia and industry in five different countries. He is a classical geneticist by training with a PhD in molecular biology and an adjunct professorship in plant genome bioinformatics. In a service provision role Stephen has seen some of worst experimental designs imaginable (pharmaceutical industry) and regularly provides "-omics disaster recovery" for when complex sequence-based studies don't seem like such a good idea after all. He really does not like Excel and knows that you will appreciate reactive approaches to data visualisation. For this reason Stephen continues to develop open-source software with the aim of enabling comparative genomics by bridging the divide between bench-biologists and big-data.

Date: Monday 7 July 2014

Presentation title: NGS experimental design and statistical power

Abstract: Today's sequencing platforms make it rather too easy to inexpensively generate hundreds of gigabases of DNA sequence data. It is advisable to plan your research study carefully before you start collecting samples, pooling controls and sequentially sending off your lovingly extracted cDNA to the cheapest sequencing service provider around. In this talk we will explore the anatomy of a sequence based genomics study. We will consider experimental design and the selection of appropriate controls to design a simple hypothesis driven project. The application of statistical power calculations will be used determine the appropriate number of samples and we will consider how potential batch-effects associated with library preparation and sequencing may confound downstream analyses. Experimental metadata will be discussed and I will reference anecdotal studies where additional metadata would have greatly simplified the data interpretation. Building on the old-adage of "Junk in :: Junk out" we have a paranoid look at NGS data quality control and I will provide pointers to a number of suitable workflows for understanding whether a provided RNA-Seq / ChIP-Seq / exome or WGS dataset is really fit-for-purpose.

9 BIOGRAPHY AND ABSTRACT

A/Professor Mik Black Department of Biochemistry University of Otago Dunedin, New Zealand

Biography: Mik received a BSc (Hons) in statistics from the University of Canterbury, and an MSc (mathematical statistics) and PhD (statistics) from Purdue University. After completing his PhD in 2002, Mik returned to New Zealand to work as a lecturer in the Department of Statistics at the University of Auckland. An ongoing involvement in a number of Dunedin-based collaborative genomics projects resulted in a move to the University of Otago in 2006. Mik's research focuses on the development and application of statistical methods for the analysis of data from genomics experiments, with a particular emphasis on human disease. Mik is also heavily involved in two major initiatives designed to put in place sustainable national research infrastructure for NZ: NZGL (New Zealand Genomics Ltd) for genomics (where he was the interim Bioinformatics Team Leader during 2012-2013), and NeSI (New Zealand eScience Infrastructure) for computing/eResearch.

Date: Monday 7 July 2014

Presentation title: Genomic Infrastructure for NGS

Abstract: In the current research environment, the ability to manage, analyse and interpret data produced by high-throughput sequencing platforms has become an essential skill for both wet- and dry-lab researchers. While a number of options exist for outsourcing these tasks, the reality is that researchers still need (and desire) a level of analytic skill that allows them to perform basic exploratory analysis of their data, without having to rely on external assistance.

In this talk, I will discuss some of the infrastructure initiatives that have been undertaken in New Zealand and Australia to provide both genomics and bioinformatics support for researchers, as well as highlighting some of the tools and skills that help to ensure the robustness and reproducibility of the analyses being carried out.

10 BIOGRAPHY AND ABSTRACT

Dr Lauren Bragg CSIRO Biography: Lauren is a research scientist in the CSIRO Digital Productivity and Services Flagship. Lauren completed her Bachelors in Science (Bioinformatics) at the University of Sydney in 2005, and subsequently worked for a year as a software developer for the Capital Markets CRC. In 2007, Lauren joined CSIRO’s Division of Mathematics, Informatics and Statistics at North Ryde, and worked on a variety of projects spanning microarray design, analysis and genomic tool development. Inspired by the global oceanic survey (GOS) study, Lauren moved to Queensland to begin a CSIRO-UQ PhD in the area of metagenomics, supervised by Professor Gene Tyson. Lauren's thesis focused on the development of statistical and computational methods for analysis of environmental sequencing, where she established a protocol for metagenome assembly, developed a novel tool for correcting errors in pyrosequenced amplicons ('Acacia'), and evaluated the quality of Ion Torrent PGM as a platform for environmental sequencing applications. Upon completing her thesis research in 2012, Lauren returned to CSIRO and by using metabolomic information as a proxy for metabolic capability and activity, is developing expression models that will predict the consequences of controlled perturbations (such as dietary changes and probiotics) on the complex microbial communities present in the digestive tracts of animals and humans.

Dr Michael Imelfort School of Chemistry & Molecular Biosciences The University of Queensland Biography: Michael is a bioinformatician at The Australian Center for Ecogenomics, The University of Queensland. During his PhD he worked almost exclusively with plant genomes, but now he focusses on the genomics of environmental microbial communities, particularly those communities which cannot be cultured. His current research involves finding ways to merge and analyse data produced using a variety of DNA sequence-based experimental frameworks, including 16S pyrotag community profiling and metagenomic and meta-transcriptomic sequencing. Recently he has been developing novel techniques that cluster metagenomic contigs into population-specific groups (differential coverage binning).

Date: Tuesday 8 July 2014

Presentation title: Great expectations or why sequencing platforms are not magic wands

Abstract: Environmental microbial sequencing (e.g. amplicon sequencing, metagenomics, metatranscriptomics) provides a culture-independent means to investigate the composition, genomic potential and activity of microbial communities.

These approaches have been rapidly and widely adopted with the result that the corresponding data typically constitutes a critical component of many studies. Unfortunately, highly complex microbiomes coupled with poor experimental designand unrealistic goals have all too often lead to doomed studies, disappointment and tears. In this tag-team talk, we provide an overview of environmental microbial sequencing techniques, with a focus on appropriate experimental design and bioinformatic analyses. We aim to provide a broad overview of what can be achieved with a HiSeq and some derring-do: and what cannot. We will illustrate our main points with a few case studies.

11 BIOGRAPHY AND ABSTRACT

A/Professor Aaron Darling University of Technology Sydney

Biography: A/Professor Aaron Darling is an internationally recognised expert in computational biology and bioinformatics. Darling’s career began 14 years ago in the team that sequenced the first few E. coli genomes and he went on to develop the widely used Mauve software for genome analysis and comparison. Darling has been awarded several competitive fellowships, research grants, and industry sponsored research contracts. He has published over 50 manuscripts in journals ranging from PLoS to PeerJ to Nature.

Date: Tuesday 8 July 2014

Presentation title: Phylogeny-based methods for analysing and comparing uncultured microbial communities

Abstract: Sequencing of uncultured microbial communities via both shotgun metagenomic and 16S amplicon methods has provided great insight into the diversity of microbes and their roles in the environment and human health. The most commonly used methods for analysing such datasets are based on identification of Operational Taxonomic Units (OTUs): collections of sequences within a predefined percent nucleotide identity. These OTU-based approaches have some shortcomings, such as ambiguity in OTU definition and limited resolution. In this seminar I will review recent work on alternative approaches to quantifying and comparing microbial community diversity using Bayesian phylogenetic inference. This will include an introduction to basic phylogenetic models and the concepts of alpha and beta diversity in microbial communities.

12 BIOGRAPHY AND ABSTRACT

Dr Jonathan Keith Faculty of Science Monash University

Biography: Dr Jonathan Keith was awarded a PhD in mineral processing by the University of Queensland in 2000, and was a postdoctoral fellow there and at Queensland University of Technology before moving to Monash University. He has worked in Bayesian methodology and applications since 2000 and has developed a trans-dimensional generalisation of the Gibbs sampler and adaptive Markov chain Monte Carlo methods. His methods have been applied in comparative genomics to investigate the non-protein-coding fraction of eukaryotic genomes, and also in phylogenetics, in genetic linkage and association studies, and in modelling the spread of invasive pest species.

Date: Tuesday 8 July 2014

Presentation title: Exploring the structure of whole-genome conservation profiles using Bayesian segmentation

Abstract: Conservation is a key indicator of function in genomes, and can potentially be used to discover novel functional non- protein-coding RNAs and regulatory sequences. However, recent investigations have demonstrated that a simple dichotomy between conserved and non-conserved sequence is too naïve a distinction to reflect the full complexity of the numerous types of structural and functional constraints acting on genomes. This presentation will discuss recent investigations into the detailed structure of whole-genome conservation profiles, using Bayesian segmentation techniques to identify multiple classes of conservation level. By integrating information about conservation with profiles of other properties indicative of function, including GC content and transition/ transversion ratios, a much finer level of structure can be detected. The method has been applied to a range of species including Drosophila, zebrafish, malaria and bacterial genomes, and results from each of these will be presented. One key implication of these results is that the proportion of functionally constrained sequence in eukaryotic genomes may be very much larger than previously supposed. Another key implication is that genomic sequences may be subject to ephemeral functional constraints that act on too short a time scale to be detected in most comparative genomic studies. The functional content of various classes of conserved sequence will also be discussed.

13 BIOGRAPHY AND ABSTRACT

Ms Tatyana Goldberg Technical University of Munich Germany

Biography: Tatyana Goldberg is a PhD student in Bioinformatics at Technical University of Munich, Germany. In her research Tatyana focuses on applying Machine Learning to answer various biological questions. In particular she is interested in the prediction of protein sub-cellular localisation and the understanding of micro-world warfare (prediction of bacterial pathogen effectors). Tatyana is leading students in several scientific projects, including those participating in “The CAFA Challenge” and “Google Summer of Code”.

Date: Tuesday 8 July 2014

Presentation title: Machine learning in action

Abstract: Advances in high-throughput sequencing technologies led to an enormous increase in the amount of data stored in public databases. The experimental annotation of this data however remains a challenging task, thus widening the sequence-to-annotation gap. Reliable computational prediction methods of protein function could counter this trend; they are becoming invaluable in the analysis and annotation of biological data. In this presentation I will give an introduction to machine learning and its applications in bioinformatics. On the example of protein sub-cellular localisation prediction, I will discuss a typical workflow for applying machine learning methods and provide code samples.

14 BIOGRAPHY AND ABSTRACT

Dr Nouri Ben Zakour School of Chemistry and Molecular Sciences Australian Infectious Diseases Research Centre The University of Queensland

Biography: Dr Nouri Ben Zakour is a researcher in Microbial and Evolutionary Genomics with over 10 years of international experience in the field. After completing her PhD in Bioinformatics at the French National Institute for Agricultural Research, she held a post-doctoral position at the Roslin Institute, University of Edinburgh, to work on the genomic basis of host adaptation in staphylococcal species. In 2009, she joined the Australian Infectious Diseases Research Centre and School of Chemistry and Molecular Biosciences at the University of Queensland as a senior post-doctoral fellow. Working with Dr Scott Beatson and the Microbial Genomics Group, she has expanded her knowledge on the evolution of bacterial pathogens of medical and veterinary importance. Her interests range from population genetics and evolutionary genomics to functional genomics, to elucidate how pathogenic bacteria evolve to colonise new ecological niches and cause outbreaks.

Date: Tuesday 8 2014

Presentation title: Detection of recombination events in bacterial genomes

Abstract: Bacteria have the extraordinary ability to evolve not only by accumulating point mutations, but also by acquiring foreign DNA through lateral gene transfer. They can also “reshuffle” alleles present in a bacterial population through a mechanism called homologous recombination, which allows them to exchange homologous DNA regions. Recombination can mediate large evolutionary jumps in bacterial genomes by rapidly spreading variants associated with increased virulence, antibiotic resistance or fitness. A corollary of this adaptive diversification is that laterally exchanged variations introduced by recombination conflict with the phylogenetic signal of vertically transmitted variations. Detecting recombination in bacterial genomes is not only essential to understand the patterns of bacterial evolution and adaptation, but can also be crucial when attempting to infer phylogenies.

A plethora of approaches has been developed in the recent years to solve the computational challenges of detecting recombination events in bacterial genomes. I will review some of the current approaches used, with a particular emphasis on those adapted to large-scale population studies. I will also illustrate briefly with some examples, how the recent advances in the detection of recombination have helped shift some of the established dogmas of bacterial evolution.

15 BIOGRAPHY AND ABSTRACT

Dr Fabian Buske Garvan Institute of Medical Research

Biography: Dr. Fabian Buske specialises in Big Data analysis of sequence, epigenomic, transcriptomic and medical data. He did his PhD on nucleic acid triple helices at the Institute for Molecular Bioscience, The University of Queensland. In 2013, he joined Prof. Susan Clark's lab at the Garvan Institute of Medical Research in Sydney on the quest to advance cancer research. He accepted the challenge of integrating the wide array of epigenetic data sets as well as to extend our predominately one-dimensional view of genomic data to the third dimension in order to gain new insights into the cellular mechanisms that contribute to cancer.

Date: Tuesday 8 July 2014

Presentation title: Epigenomics: The many garments of the genome sequence

Abstract: Epigenetic modifications are reversible modifications on the DNA that affect gene expression without changing the actual genome sequence. The spectrum of modifications range from DNA methylation, histone modification and nucleosome positioning to DNA packaging and chromatin organisation in the three dimensional space. This presentation will highlight different assays and bioinformatic approaches used to query epigenetic modifications genome-wide as well as how these layers of information can be integrated into meaningful models.

16 BIOGRAPHY AND ABSTRACT

Dr Jian Yang Senior Research Fellow Queensland Brian Institute The University of Queensland

Biography: Jian Yang is a Senior Research Fellow at Queensland Brain Institute, The University of Queensland. He received his PhD in 2008 from Zhejiang University, China, which was followed by postdoctoral research at the Queensland Institute of Medical Research. He joined The University of Queensland in 2012. His research interests are in developing novel methods and software tools to better understand the genetic architecture of complex diseases and traits using high- throughput genetic and genomic data. In 2012, he won the Centenary Institute Lawrence Creative Prize, which is awarded annually to only one young medical researcher in Australia. He was awarded a NHMRC RD Wright Career Development Fellowship in the same year, and was part of a team shortlisted for the Eureka Prize in Scientific Research. In 2013, he received a UQ Foundation Research Excellence award and was one of two recipients of the Sylvia and Charles Viertel Charitable Foundation’s Senior Medical Research Fellowship.

Date: Tuesday 8 July 2014

Presentation title: Mixed linear model analyses of human complex traits using SNP data

Abstract: Most traits and common diseases in humans, such as height, cognitive ability, psychiatric disorders and obesity, are influenced by many genes and their interplay with environmental factors. These diseases/traits are called “complex” traits to differentiate them from “Mendelian” traits that are caused by single genes. Understanding the genetic architecture of human complex traits, e.g. how much of the difference between people’s susceptibilities to diseases are accounted for by their difference in DNA sequence, how many genes are involved in the etiology of diseases, where the genes are located and how much effects of the genes are on the disease risks, is essential to diagnosis, discovery of new drug targets and prevention. To date, thousands gene loci as represented by single nucleotide polymorphisms (SNPs) have been identified to be associated with hundreds of human complex traits by the genome- wide association study (GWAS) technique. In this lecture, I will be introducing the use of mixed linear model in the analyses of GWAS data, to estimate the proportion of variance for a trait that can be explained by all SNPs (or called SNP heritability), to quantify the extent to which two traits (or diseases) share a common genetic basis (genetic correlation) using all SNPs, and to control for population structure in genome-wide association analyses of individuals SNPs.

17 BIOGRAPHY AND ABSTRACT

Dr Joseph Powell Queensland Brain Institute The University of Queensland

Biography: Dr Joseph Powell is a team leader in the Centre for Neurogenetics and Statistical Genomics based at the Queensland Brain Institute. He received his PhD from the University of Edinburgh in 2010 followed by two years working as a post- doctoral researcher at QIMR Berghofer.

Joseph has worked on a range of research projects involving methods, theory and application around the nexus of quantitative, statistical and population genetics. This has provided a good foundation for his more recent work investigating the genetic architecture regulating gene expression and its role within a systems genetics framework.

Date: Tuesday 8 July 2014

Presentation title: Detection and replication of epistasis influencing transcription in humans

Abstract: Epistasis is the phenomenon whereby one polymorphism’s effect on a trait depends on other polymorphisms present in the genome. The extent to which epistasis influences complex traits and contributes to their variation is a fundamental question in evolution and human genetics. Although often demonstrated in artificial gene manipulation studies in model organisms, and some examples have been reported in other species, few examples exist for epistasis among natural polymorphisms in human traits. Its absence from empirical findings may simply be due to low incidence in the genetic control of complex traits, but an alternative view is that it has previously been too technically challenging to detect owing to statistical and computational issues. Here we show, using advanced computation and a gene expression study design, that many instances of epistasis are found between common single nucleotide polymorphisms (SNPs). In a cohort of 846 individuals with 7,339 gene expression levels measured in peripheral blood, we found 501 significant pairwise interactions between common SNPs influencing the expression of 238 genes (P < 2.91 × 10−16). Replication of these interactions in two independent data sets showed both concordance of direction of epistatic effects (P = 5.56 × 10−31) and enrichment of interaction P values, with 30 being significant at a conservative threshold of P < 9.98 × 10−5. Forty-four of the genetic interactions are located within 5 megabases of regions of known physical chromosome interactions (P = 1.8 × 10−10). Epistatic networks of three SNPs or more influence the expression levels of 129 genes, whereby one cis-acting SNP is modulated by several trans-acting SNPs. For example, MBNL1 is influenced by an additive effect at rs13069559, which itself is masked by trans-SNPs on 14 different chromosomes, with nearly identical genotype–phenotype maps for each cis–trans interaction. This study presents the first evidence, to our knowledge, for many instances of segregating common polymorphisms interacting to influence human traits.

18 BIOGRAPHY AND ABSTRACT

Professor Graham Taylor Department of Pathology University of Melbourne

Biography: Professor Graham Taylor is the Herman Professor of Genomic Medicine, Department of Pathology, University of Melbourne, and Director of the Australian Node of the Human Variome Project.

In 2006 he led the UK Department of Health Funded project “New genetic diagnostic technologies for consanguineous families at risk of recessive genetic disease” and became Head of Genomic Services for Cancer Research UK, chairing the advisory committee for genome wide association (GWA) studies and leading a review of CR-UK bioinformatics demand and capacity and an evaluation of Next Generation Sequencing (NGS) technology.

In 2009 he joined the Leeds Teaching Hospitals and Leeds University as Professorial Head of the Genomics Translation Unit. The Unit was instrumental in establishing the Leeds Genetics Service as the leading provider of genetic diagnosis using NGS within the NHS. His team developed the Grouped Read Typing method for diagnostic amplicon sequencing in fixed tissue, copy number variation analysis by NGS and streamlined conventional genetic testing by NGS. In 2012 he joined the University of Melbourne.

Date: Wednesday 9 July 2014

Presentation title: The future of DNA sequencing technology

Abstract: I will review the recent history of “post-Sanger” sequencing technology, and then make some wild and unjustified extrapolations into the future based on too few data points.

I will review some of the technologies on the horizon and ask how we can appraise them.

For example, if we can define sequence read quality as a composite of read length and base-calling accuracy, recent trends have overwhelmingly been in the direction of quantity at the expense of quality. As a consequence a great deal of informatics effort has been expended in managing rather poor quality data. Of course the human genome, along with many other genomes, is not particularly amenable to analysis, contain entities such as pseudogenes, non-coding regions (sometimes referred to as “junk”, sometimes claimed to be functionally important) and short repeats. So how does the collision of a relatively refractory analyte like the human genome and an imperfect sequencing method result in a “genomics revolution”? What have we gained and what are the current limitations that need to be addressed in future technologies?

I will look at two examples of the impact of current and pending sequencing technology: tumour analysis in fixed and fresh tissue and the identification of allele expansions.

19 BIOGRAPHY AND ABSTRACT

Dr Denis Bauer Computational Informatics (CCI) CSIRO

Biography: Dr. Bauer is interested in high-performance computer systems for integrating large data-volumes to inform strategic interventions for human health. She has a PhD in Bioinformatics and post-docs in machine-learning and genetics, has published in Nature Genetics and Genome Research, was an invited speaker at Bio-IT World Asia 2013, and has attracted more than AU$360,000 in funding (NSW Cancer Institute, CSIRO).

Date: Wednesday 9 July 2014

Presentation title: Population-scale high-thoughput sequencing data analysis

Abstract: Unprecedented computational capabilities and high-throughput data collection methods promise a new era of personalised, evidence-based healthcare, utilising individual genomic profiles to tailor health management as demonstrated by recent successes in rare genetic disorders or stratified cancer treatments. However, processing genomic information at a scale relevant for the health-system remains challenging due to high demands on data reproducibility and data provenance. Furthermore, the necessary computational requirements require a large investment associated with computer hardware and IT personnel, which is a barrier to entry for small laboratories and difficult to maintain at peak times for larger institutes. This hampers the creation of time-reliable production informatics environments for clinical genomics. Commercial cloud computing frameworks like Amazon Web Services (AWS) provide an economical alternative to in-house compute clusters as they allow outsourcing of computation to third-party providers, while retaining the software and compute flexibility.

To cater for this resource-hungry, fast-paced yet sensitive environment of personalised medicine, we developed NGSANE, a Linux-based, HPC-enabled framework that minimises overhead for set up and processing of new projects yet maintains full flexibility of custom scripting and data provenance when processing raw sequencing data either on a local cluster or Amazon’s Elastic Compute Cloud (EC2).

20 BIOGRAPHY AND ABSTRACT

A/Professor Marcel Dinger Head of Clinical Genomics & Genomic Informatics Garvan Institute of Medical Research

Biography: A/Prof. Marcel Dinger is the Head of Clinical Genomics and Genome Informatics at the Garvan Institute of Medical Research. Prior to his position at the Garvan Institute, A/Prof. Dinger led Cancer Genomics and Transcriptomics at the University of Queensland Diamantina Institute. Marcel received his PhD from the University of Waikato in 2003. While undertaking his PhD, Marcel founded an informatics company that produced a series of highly successful products and services. In 2005, he resumed his academic career with a prestigious New Zealand Foundation for Research Science and Technology Postdoctoral Fellowship to join Professor Mattick’s group at the Institute for Molecular Bioscience at The University of Queensland to study the role of long noncoding RNAs in mammalian development and disease. In 2009, he was awarded an NHMRC Career Development Award and a Smart Futures Fellowship.

Date: Wednesday 9 July 2014

Presentation title: Translating exome and whole genome sequencing to the clinic

Abstract: Since sequencing the draft human genome in 2001, the number of diseases with known genetic basis has increased >50-fold to over 3000. Despite this remarkable success, >2000 Mendelian disorders remain unsolved, and up to 70% of patients presenting at the clinic with genetic disorders remain undiagnosed. Clinical-grade genome sequencing holds the dual promise of improving diagnostic rates, and empowering genetic research through the discovery of novel disease-associated variants. The long-term research value of performing whole exome and genome sequencing in a diagnostic setting on thousands of individuals will offset the initially higher cost and complexity, than a targeted gene-panel approach.

In late 2012, we established the Kinghorn Centre for Clinical Genomics (KCCG) with the aim of implementing genomic medicine in Sydney. At the heart of the KCCG are 2 Illumina HiSeq 2500 sequencers that are used for rapid turnover exome sequencing, and more recently, one the world’s first HiSeq X Ten sequencing suites, with capability of sequencing more than 300 whole human genomes per week. Since we intend to provide NATA-certified, clinical-grade sequencing, much of our work over the past 12 months has been focused on the development of standardised procedures for test procurement in the clinic through to wet-lab processes, bioinformatics and clinical reporting. The bioinformatics workflow includes phenotype capture, read alignment, mutation calling, variant annotation and filtering by inheritance pattern, rarity, predicted functional impact and known disease association.

To date, we have sequenced exomes from >100 patients, from a range of conditions, largely reflecting the undiagnosed caseload at the Sydney Children’s Hospital. We will present some early success stories from sequencing these exomes and reflect on the possibilities presented by low-cost whole genome sequencing in the diagnosis of inherited disease.

Marcel E. Dinger1, Mark J. Cowley1, Kevin Ying1, Jiang Tao1, Liviu Constantinescu1, Derrick Lin1, Paula Morris1, Kerith-Rae Dias1, Warren Kaplan1, Lisa Ewans2, Tony Roscioli2

1. Kinghorn Centre for Clinical Genomics, Garvan Institute of Medical Research, Darlinghurst, NSW, Australia 2. Sydney Children’s Hospital and the School of Women’s and Children’s Health, UNSW, Randwick, NSW, Australia

21 BIOGRAPHY AND ABSTRACT

Professor John Quackenbush Dana Farber Cancer Institute & Harvard School of Public Health USA

Biography: John Quackenbush is a Professor of Computational Biology and Bioinformatics in the Department of Biostatistics, Harvard School of Public Health and at the Dana-Farber Cancer Institute. He has received his PhD in 1990 in theoretical physics from UCLA on string theory models. Following two years as a postdoctoral fellow in physics, Dr Quackenbush applied for and received a Special Emphasis Research Career Award from the National Center for Human Genome Research to work on the Human Genome Project. He spent two years at the Salk Institute and two years at Stanford University working at the interface of genomics and computational biology. In 1997 he joined the faculty of The Institute for Genomic Research (TIGR) where his focus began to shift to understanding what was encoded within the human genome. Since joining the faculties of the Dana-Farber Cancer Institute and the Harvard School of Public Health in 2005, his work has focused on the use of genomic data to reconstruct the networks of genes that drive the development of diseases such as cancer and emphysema.

Date: Thursday 10 July 2014

Presentation title: Taming the Big Data Dragon

Abstract: Nearly every major scientific revolution in history has been driven by one thing: data. Today, the availability of Big Data from a wide variety of sources is transforming health and biomedical research into an information science, where discovery is driven by our ability to effectively collect, manage, analyse, and interpret data. New technologies are providing abundance levels of thousands of proteins, population levels of thousands of microbial species, expression measures for tens of thousands of genes, information on patterns of genetic variation at millions of locations across the genome, and quantitative imaging data—all on the same biological sample. These omic data can be linked to vast quantities of clinical metadata, allowing us to search for complex patterns that correlate with meaningful health and medical endpoints. Environmental sampling and satellite data can be cross-referenced with health claims information and Internet searches to provide insights into the impact of atmospheric pollution on human health. Anonymised data from cell-phone records and text messages can be tied to health outcomes data, helping us explore disease transmission networks. Realising the full potential of Big Data will require that we develop new analytical methods to address a number of fundamental issues and that we develop new ways of integrating, comparing, and synthesising information to leverage the volume, variety, and velocity of Big Data. Using concrete examples from our work, I will present some examples that highlight the challenges and opportunities that present themselves in today’s data rich environment.

22 BIOGRAPHY AND ABSTRACT

Professor Falk Schreiber Monash University; and University Halle-Wittenberg, Germany

Biography: Falk Schreiber was awarded a PhD and a habilitation in Computer Science from the University of Passau (Germany). In 2001-2002 he worked as a Research Fellow and Lecturer at the University of Sydney. Since 2003, he has been head of a bioinformatics research group at the Leibniz Institute of Plant Genetics and Crop Plant Research Gatersleben, Germany. In 2007 he was appointed professor of Bioinformatics at the Martin Luther University Halle-Wittenberg (Germany) and additionally Bioinformatics coordinator at the IPK Gatersleben. He is currently taking a position as professor in the Faculty of IT at Monash University.

Dr Schreiber has been researching topics in bioinformatics and computational systems biology more than 15 years. His main interests are visual computing and visual analytics of biological data, analysis of structure and dynamics of biological networks, integrative analysis of omics data, graphical standards for systems biology, as well as modelling and analysis of metabolism.

Date: Thursday 10 July 2014

Presentation title: From Big Data to smart knowledge - integrating multimodal biological data and modelling metabolism

Abstract: Modern data acquisition methods in the life sciences allow the procurement of different types of data in increasing quantity, facilitating a comprehensive view of biological systems. As data are usually gathered and interpreted by separate domain scientists, it is hard to grasp multi-domain properties and structures. Consequently there is a need for the integration, analysis, modelling, simulation, and visualisation of life science data from different sources and of different types.

This talk focuses on these two aspects: firstly, methods for the integration and visualisation of multimodal biological data are presented. This is achieved based on two graphs representing the meta-relations between biological data, and the measurement combinations, respectively. Both graphs are linked and serve as different views of the integrated data with navigation and exploration possibilities. Data can be combined and visualised multifariously, resulting in views of the integrated biological data. Secondly, methods to reconstruct, simulate, and analyse detailed metabolic models are presented. We will focus on stoichiometric models, and see how different types of data are used to gather new insights into metabolic processes shown on an example of metabolism in plants.

23 BIOGRAPHY AND ABSTRACT

Professor Seok-Hee Hong ARC Future Fellow School of Information Technologies University of Sydney

Biography: Prof. Hong is a Professor and a Future Fellow at the School of IT, University of Sydney. She was a Humboldt Fellow in 2013-2014, and a project leader of VALACON (Visualisation and Analysis of Large and Complex Networks) project at NICTA (National ICT Australia) in 2004-2007. Her research interests include garph drawing, algorithms, information visualisation and visual analytics.

In 2006, she won the CORE (Computing Research and Education Association of Australasia) Chris Wallace Award for Outstanding Research Contribution in the field of Computer Science, for her research "Theory and Practice of Graph Drawing". The award was given for notable breakthroughs and a contribution of particular significance.

Prof. Hong has held research funding of $4.5M, from her three fellowships (Future Fellowship, ARC Research Fellowship and Humboldt Fellowship), three ARC Discovery Projects and two ARC Linkage Projects including her latest project on "Algorithmics for Visual Analytics of Massive Complex Networks”. She has more than 140 publications including 10 edited books, 7 book chapters, 40 journal papers, and 90 conference papers, and she has given 10 invited talks at international conferences as well as 50 invited seminars worldwide. In particular, she has developed an open source visual analytic software GEOMI with her research team members.

Prof. Hong serves as a Steering Committee member of GD (International Symposium on Graph Drawing), IEEE PacificVis (International Symposium on Pacific Visualisation) and ISAAC (International Symposium on Algorithms and Computations) and an editor of JGAA (Journal of Graph Algorithms and Applications). She has served as a Program Committee Chair of AWOCA 2004, APVIS 2005/2007, GD 2007, ISAAC 2008 and IEEE PacificVis 2013, and a Program Committee Member of 50 international conferences. In particular, she has formed the Information Visualisation research community in the Asia-Pacific Region, by founding IEEE PacificVis Symposium.

Date: Thursday 10 July 2014

Presentation title: Visual analytics of Big Data

Abstract: Recent technological advances have led to the production of a Big Data, and consequently have led to many massive complex network models in many domains including science and engineering. Examples include biological networks such as phylogenetic network, gene regulatory network, metabolic pathways, biochemical network and protein- protein interaction networks. Other examples are social networks such as Facebook, Twitter, Linked-in, telephone calls, patents, citations and collaborations.

Visualisation is an effective analysis tool for such networks. Good visualisation reveals the hidden structure of the networks and amplifies human understanding, thus leading to new insights, new findings and predictions. However, constructing good visualisation of Big Data can be challenging.

In this talk, I will present a framework for visual analytics of Big Data. Visual Analytics is the science of analytical reasoning facilitated by interactive visual interfaces. Our framework is based on the tight integration of network analysis methods with visualisation methods to address the scalability and complexity issues. I will present a number of case studies using various networks derived from Big Data, in particular social networks and biological networks.

24 BIOGRAPHY AND ABSTRACT

Dr Andrew Treloar Director of Technology Australian National Data Services (ANDS)

Biography: Dr Andrew Treloar is the Director of Technology for the Australian National Data Service (ANDS) (http://ands.org.au/), with particular responsibility for international engagement. In 2008 he led the project to establish ANDS. He is currently co-chair of the Research Data Alliance (http://rd-alliance.org/) Technical Advisory Board and Visiting Fellow at the Data Archive and Network Services organisation in the Netherlands (http://dans.knaw.nl/). His research interests include data management and scholarly communication. He never seems to be able to make enough time for practising his cello, or reading, but does try to prioritise talking to his chickens and working in his vegetable garden and orchard. Further details at http://andrew.treloar.net/ or follow him on Twitter as @atreloar.

Date: Thursday 10 July 2014

Presentation title: The life-sciences as a pathfinder in data-intensive research practice

Abstract: The advent of the Internet is bringing about fundamental changes in the ways that research is performed and communicated. These have been particularly driven by the growing importance of data, as well as the tools available to work with this data. This presentation will examine this shift, drawing on examples from the life-sciences, and try to make some predictions about the next five years.

25 BIOGRAPHY AND ABSTRACT

Dr Alec Zwart CSIRO, Canberra

Biography: Dr Zwart holds three degrees from the University of Waikato in New Zealand: BSc Honours in Computing and Mathematical Sciences, PhD in Industrial Magnetohydrodynamics and Master of Science in Statistics.

After completing his PhD in 1998, he joined New Zealand's National Institute for Water and Atmospheric Research as a mathematical modeller. He then completed his Master of Science in Statistics in 2002 and worked as a tutor, lecturer and part time statistical consultant.

Alec joined CSIRO in Canberra in 2006 as a biometrician. He has particular interests in agricultural and horticultural statistics, particularly the robust design of agricultural/horticultural experiments and field trials and the analysis of datasets arising from such experiments.

Date: Thursday 10 July 2014

Presentation title: Statistical experiment design principles for biological studies

Abstract:

“To consult the statistician after an experiment is finished is often merely to ask them to conduct a post mortem examination. They can perhaps say what the experiment died of.”

- Sir Ronald Aylmer Fisher, the father of modern statistics.

Statistical experimental design, accompanied by the appropriate statistical analyses, plays a crucial role in producing valid and precise inferences, and avoiding ‘design disasters’ in empirical science. I will quickly refresh some of the basic elements of experimental design in general, and discuss some key issues and examples that arise in the areas of genetics/genomics and high throughput data.

26 BIOGRAPHY AND ABSTRACT

Professor David Evans The University of Queensland Diamantina Institute

Biography: David Evans is Professor of Statistical Genetics and Head of Genomic Medicine at the University of Queensland Diamantina Institute. He obtained his PhD at the University of Queensland in 2003, before undertaking a four year postdoctoral fellowship in statistical genetics at the Wellcome Trust Centre for Human Genetics, University of Oxford. In 2007 he moved to take up a Senior Lecturer then Reader position at the University of Bristol where he has led the genome-wide association studies work in the Avon Longitudinal Study of Parents and Children (ALSPAC). His research interests include the genetic study of several complex traits and diseases including ankylosing spondylitis, osteoporosis, atopic dermatitis and three dimensional face shape via genome-wide association and next generation sequencing approaches. His other main research interest is in the development of statistical methodologies in genetic epidemiology including approaches for gene mapping, individual risk prediction, casual modelling including Mendelian randomisation and dissecting the genetic architecture of complex traits. On weekends he likes to surf and is enjoying the temperature difference between Queensland waters and the northern coast of Devon.

Date: Thursday 10 July 2014

Presentation title: Genome-wide association studies

Abstract: Genome-wide association studies have been spectacularly successful over the last few years in terms of identifying common genetic variants associated with complex traits and diseases. David will explain how simple statistical tests can be used to map genetic loci associated with complex traits. This will include a discussion of genotype imputation, meta-analysis, approaches to detect and correct for population stratification, as well as some guidelines on how the results from genome-wide association studies should be interpreted and replicated.

27 BIOGRAPHY AND ABSTRACT

Dr Marie-Laure Martin Magniette Mathématiques et Informatique Appliquées Institut National de la Recherche Agronomique France

Biography: Marie-Laure Martin-Magniette is a director of research at the French National Institute for Agronomical Research (INRA) in the Unit of Applied Mathematics and Computer Sciences (Statistics & Genome team) and in the Plant Genomics Research Unit (Bioinformatics for predictive genomics team). In 2001, she has received her PhD in Université Paris-Sud, France for the development of new survival models taking into account measurement error of covariates and allowing the estimation of flexible hazard function. She did a one year postdoctoral fellowship in epidemiology at INRA and at Nantes Hospital and was recruited as junior researcher at INRA in the Plant Breeding Department in 2003.

Since 2003, Marie-Laure has been strongly involved in the analyses of genomic data and is at the interface between statistics and molecular biology. She has been for 11 years in charge of the statistical analyses of the data produced by the transcriptomic platform of the Plant Genomics Research Unit. Since 2003, she has acquired a strong expertise on the data normalisation and the differential analysis for microarray and high-throughput sequencing technologies. She has also investigated the analysis of chIP-chip data to detect enriched regions and differentially methylated regions.

Since 2005 she has been focused on the discovery and characteristics of underlying structures in genomic data with mixture models and Hidden Markov Models. She conceived these models in close collaboration with fellow biologists and statisticians. Since September 2013, she has led the team Bioinformatics for predictive genomics of the Plant Genomics Research Unit. Her team project is highly interdisciplinary and deals with the construction of genomic networks of the plant model Arabidopsis thaliana for the discovery of functional modules and the prediction of functions of orphan genes involved in stress responses.

Date: Thursday 10 July 2014

Presentation title: Mixture models for analysing transcriptome and chIP-chip data

Abstract: Mixture models are useful for identifying underlying structures. In such models, the density of the observations is modelled by a weighted sum of parametric density (e.g. each component is a Gaussian distribution) and each one represents a subpopulation composed of observations sharing common characteristics. The first part of my talk will be dedicated to a presentation of the mixture models. I will explain the concept and the outputs of an analysis based on a mixture through easy examples. In the second part of my talk, I will show how mixture models can be applied to analyse transcriptomic (co-expression analysis of Arabidopsis thaliana genes) and chIP-chip data (detection of enriched regions and of differentially methylated regions).

28 BIOGRAPHY AND ABSTRACT

Dr Kim-Anh Lê Cao The University of Queensland Diamantina Institute

Biography: Dr Kim-Anh Lê Cao was awarded her PhD in 2008 in Université de Toulouse, France. She was awarded the "Marie- Jeanne Laurent-Duhamel" prize 2009 of the Société Française de Statistique (French Statistical Society) for her PhD thesis.

She started her postdoc in late 2008 in the ARC Centre of Excellence in Bioinformatics with Prof. Geoff McLachan and then worked as a research-only academic in QFAB Bioinformatics. She is now based in the University of Queensland Diamantina Institute.

Since the beginning of her PhD Kim-Anh has initiated a wide range of valuable collaborative and research opportunities in both statistics and molecular biology. Her research interests are multidisciplinary as they focus on mathematical statistics characterisation of molecular biological systems, and she is interested in developing sound statistical frameworks applied to addressing new biological questions arising from these frontier molecular technologies. Her main research focus is on variable selection for biological data (‘omics’ data) coming from different functional levels by the means of dimension reduction approaches.

Date: Thursday 10 July 2014

Presentation title: Multivariate models for dimension reduction and biomarker selection in omics data

Abstract: Recent advances in high throughput ’omics’ technologies enable quantitative measurements of expression or abundance of biological molecules of a whole biological system. The transcriptome, proteome and metabolome are dynamic entities, with the presence, abundance and function of each transcript, protein and metabolite being critically dependent on its temporal and spatial location.

Whilst single omics analyses are commonly performed to detect between-groups difference from either static or dynamic experiments, the integration or combination of multi-layer information is required to fully unravel the complexities of a biological system. Data integration relies on the currently accepted biological assumption that each functional level is related to each other. Therefore, considering all the biological entities (transcripts, proteins, metabolites) as part of a whole biological system is crucial to unravel the complexity of living organisms.

With many contributors and collaborators, we have further developed several multivariate approaches to project high dimensional data into a smaller space and select relevant biological features, while capturing the largest sources of variation in the data. These approaches are based on variants of partial least squares regression and canonical correlation analysis and enable the integration of several types of omics data.

In this presentation, I will illustrate how various techniques enable exploration, biomarker selection and visualisation for different types of analytical frameworks.

29 BIOGRAPHY AND ABSTRACT

Dr Rob Lanfear Senior Lecturer Department of Biological Sciences Macquarie University

Biography: Rob is a senior lecturer at Macquarie University in Sydney, where he works on molecular evolution and phylogenetics with the aim of understanding the causes and consequences of molecular evolution. His work bridges spatial and temporal scales: from developing methods to identify and understand mutations that occur within a single individual over a few decades, to analysing the long-term evolution of globally-distributed clades of species over millions of years. He also investigates theoretical aspects of molecular evolution, and has developed new statistical methods and software to help infer phylogenies from huge DNA datasets. Rob has an undergraduate degree in Ecology (Durham), a Masters in Artificial Intelligence (Sussex), and PhD in Developmental Biology (Sussex). He switched to studying molecular evolution and phylogenetics full-time during his postdoctoral work at the Australian National University.

Date: Friday 11 July 2014

Presentation title: An introduction to phylogenetic inference

Abstract: Phylogenies are fantastically important in biology. In addition to telling us the relationships among organisms, they can be used to date evolutionary divergences, delineate species, track disease outbreaks, understand molecular evolution, and inform conservation decisions. This talk will give a quick overview of some of these applications, and then delve deeper into the methods that can be used to infer phylogenies from molecular sequence data. The talk will explain and compare parsimony methods, distance methods, maximum likelihood, and Bayesian approaches to phylogenetic inference. It will finish up by introducing some of the most-recent methodological advances for inferring phylogenies from phylogenomic datasets – gigantic datasets that can include thousands genes from thousands of species.

30 BIOGRAPHY AND ABSTRACT

Distinguished Professor David Penny Institute of Fundamental Sciences Massey University New Zealand

Biography: David Penny has been involved with reconstructing evolutionary trees from DNA and protein sequences for over 30 years, and is now extending this to predicted tertiary structures. As a biologist, he has worked with mathematicians (particularly Professor Mike Hendy) in order to allow quantitative evaluation of the results, and to measure the rate of convergence as sequences get longer. His interests include any likely deviations from the model of evolution that is assumed, and what effect, if any, this is likely to have on the tree that is produced.

David holds undergraduate degrees in Botany (BSc) and Chemistry (BSc Honours) from Canterbury University College (Christchurch NZ), and a PhD in Biology from Yale University. Following postdoctoral research at McMaster University (Hamilton, Ontario, Canada) he returned to New Zealand (Massey University), where he is now Distinguished Professor of Theoretical Biology. In 2000 he was awarded the Marsden Medal of the NZ Association of Scientists in recognition of his outstanding service to science. He is a Fellow of the Royal Society of New Zealand, and in 2004 was awarded the Rutherford Medal in recognition of his contributions in theoretical biology, molecular evolution, and the analysis of DNA. In 2006 he was made a Companion of the New Zealand Order of Merit for services to science. He is a former president of the NZ Association of Scientists.

Date: Friday 11 July 2014

Presentation title: Loss of information at deeper divergences, and what we can do about it

Abstract: It has been shown by Mossel and Steel (2004) that simple Markov models lose information at the deepest divergences (say, greater than 400 million years ago); and that the fall-off is exponential at deeper times. However, that does not mean that there is no information left; for example, the three-dimensional structure of proteins should still retain information about deeper divergences, although we may not yet know how to use that information. Biologists still want to estimate the deeper divergences and thus it is a significant question to find additional sources of information. Several suggestions are offered that require a more formal analysis. Firstly, we probably expect that where there is a real Gamma distribution of rates, information may be retained for longer. Secondly, if there is really a bimodal distribution of rates, then identifying, and eliminating these faster-evolving sites should help. Thirdly, the inference of ancestral sequences at deeper divergences appears quite robust, and there is some evidence that this may help recover deeper divergences. Fourthly, it is increasingly possible to infer three-dimensional structures, and these should retain information longer. Fifthly, there may be differences between the loop regions of Akaryote and Eukaryote proteins, and only taking the regions crossing the central 3D region might help. Sixthly, an approach of weighting, not of characters, but of the partitions they are consistent with, might help. Seventhly, possibly gene order information might be helpful. Several examples of such approaches will be presented, and a challenge issued to theoreticians to solve some of these fundamental issues. There is still a lot to learn about protein evolution.

31 BIOGRAPHY AND ABSTRACT

Professor Lindell Bromham Centre for Macroevolution & Macroecology, Evolution, Ecology & Genetics Research School of Biology Australian National University

Biography: I am an evolutionary biologist, and I am interested in ways of testing ideas about macroevolutionary patterns and mechanisms, particularly the way that phylogenies constructed from DNA sequence data can be used to understand evolutionary past and processes. I have used comparative analyses to investigate processes of evolutionary change spanning timescales from current patterns of biodiversity to ancient evolutionary patterns. But in order to use molecular data to understand evolution, we need to understand how evolutionary information is recorded in the genome, so I also study the way that patterns and rates of molecular evolution are influenced by species characteristics, environment, and macroevolutionary processes.

Date: Friday 11 July 2014

Presentation title: From mutation to macroevolution

Abstract: Molecular phylogenetics allows us to use the patterns of changes in the genomes of different species to reconstruct evolutionary history. This has revolutionised studies of macroevolution, which focus on the patterns and processes of variation in biodiversity over time, space or lineages. But molecular phylogenies are not just a useful tool in macroevolution, they are also a way of thinking about the connection between change at the genomic level and evolution at the level of global biodiversity. I will use a number of examples to explore how molecular phylogenetic analysis has the potential to overcome the hierarchical distinction between macroevolution and microevolution by allowing us to consider us to consider genome-level, population-level and lineage-level patterns in a single analysis.

32 BIOGRAPHY AND ABSTRACT

Professor Mike Wilkinson Head, School of Agriculture, Food and Wine The University of Adelaide

Biography: Professor Mike Wilkinson is Head of the School of Agriculture, Food and Wine at the University of Adelaide and Director of the Waite Research Institute. The UK-born research scientist, who joined the University in September 2011, is best known for his work on quantifying the risks associated with GM crops, and has published extensively in this area.

He has a PhD from the University of Leicester in hybridisation and evolutionary processes in wild grasses. Prior to immigrating to Adelaide in 2011, Professor Wilkinson established the world’s first Master of Science focused on training regulators of GM crops, a project funded by the Bill and Melinda Gates Foundation.

A specialist in plant genetics, Professor Wilkinson has previously worked at the Scottish Crop Research Institute in crop research and cytogenetics, was Director of the Institute of Biological Sciences at Aberystwyth University and also Trustee of the National Botanic Gardens in Wales.

Professor Wilkinson has over 20 years of research experience in plant and animal genetics and has published several significant works in the area of plant epigenetics. Most recently, his studies into epigenetics featured several papers in high-impact international journals including Nature Communications (on the epigenetics of the human parasite schistosomiasis), Analytical Chemistry (two works on the chemistry of DNA methylation) and Journal of Experimental Botany (on heritable epigenetic effects). He also holds three patents in the field and has secured several million dollars of external funding in support of epigenetics research. He is well acquainted with all the methods to be used in the project and co-developed some of them. Over the course of his career, he has supervised >30 PhD students to completion (all within 4 years).

Date: Friday 11 July 2014

Presentation title: The application of high throughput DNA barcoding for landscape ecology and management

Abstract: One of the chief justifications for the development of DNA barcoding for species identification rested in the potential the rapid identification of cryptic species or of representatives from taxonomically problematic groups without the need for detailed anatomical characterisation or reference to a small number of specialists for the group. This need is most keenly felt in poorly studied regions of high biodiversity or in cases where morphological identification is rendered impossible because of incomplete or degraded specimens, or for mixed samples containing multiple species. In this presentation, I will provide a series of case studies to illustrate the value of next-generation sequencing in enhancing the potential of DNA barcoding for the purposes of species discovery, the risk assessment of GM crops, diet reconstruction and the study of ancient DNA.

33

20142014 WINTERWINTER2014Sponsored by:WINTERSCHOOLSCHOOL SCHOOL IMB ININ IN

MATHEMATICALMATHEMATICALMATHEMATICAL&& COMPUTATIONALCOMPUTATIONAL& COMPUTATIONAL BIOLOGYBIOLOGY BIOLOGY

in in of Centre ARC Bioinformatics Excellence

7-117-11 JulyJuly 201420147-11 July 2014 Hosted by: Hosted AuditoriumAuditorium Auditorium QueenslandQueenslandQueensland BioscienceBioscience PrecinctPrecinct Bioscience Precinct TheThe UniversityUniversityThe Universityofof QueenslandQueensland of Queensland

Brisbane,Brisbane, AustraliaAustraliaBrisbane, Australia PROGRAM

PROGRAMPROGRAMPROGRAM

Brisbane, Australia Brisbane,

The University of Queensland of University The

Queensland Bioscience Precinct Bioscience Queensland

Auditorium HostedHosted by:by: Hosted by: 7-11 July 2014 July 7-11

ARCARC CentreCentre ofof ExcellenceExcellenceARC Centreinin BioinformaticsBioinformatics of Excellence in Bioinformatics

& COMPUTATIONAL BIOLOGY COMPUTATIONAL & MATHEMATICAL

IN IN IMBIMB IMB 2014 WINTER SCHOOL WINTER 2014