Centre for

Second Report 31 May 2004

Centre Director: Prof www.imperial.ac.uk/bioinformatics [email protected]

Support Service Head: Dr Sarah Butcher www.codon.bioinformatics.ic.ac.uk [email protected]

Centre for Bioinformatics - Imperial College London - Second Report - May 2004 1 Contents

Summary...... 3 1. Background and Objectives ...... 4 1.1 Objectives of the Report...... 4 1.2 Background...... 4 1.3 Objectives of the Centre for Bioinformatics...... 5 1.4 Objectives of the Bioinformatics Support Service ...... 5 2. Management ...... 6 3. Biographies of the Team...... 7 4. Bioinformatics Research at Imperial ...... 8 4.1 Affiliates of the Centre...... 8 4.2 Research...... 9 5. Teaching ...... 11 5.1 MSc in Bioinformatics...... 11 5.2 Wellcome Trust 4 year PhD in Bioinformatics...... 11 5.3 Computational Bioinformatics ...... 11 6. The Bioinformatics Support Service...... 12 6.1 Introduction ...... 12 6.2 Management ...... 12 6.3 Hardware, Software and Databases ...... 12 6.4 Support & Training ...... 16 6.4 Research Projects...... 16 6.5 Grants ...... 17 6.6 Courses...... 17 6.7 Financial Arrangements ...... 17 7. The London Bioinformatics Forum...... 19 7.1 Mission of the London Bioinformatics Forum ...... 19 7.2 Objectives ...... 19 7.3 Management ...... 19 7.4 Activities...... 19 8. Seminar Programme...... 20 9. Achievements and Plans...... 22 9.1 Achievements...... 22 9.2 Plans ...... 22 Appendix 1 - Selected Publications...... 23

Centre for Bioinformatics - Imperial College London - Second Report - May 2004 2 Summary

The mission of the Centre for Bioinformatics is to promote and co-ordinate world-class research and training in Bioinformatics within Imperial College London and to provide state-of-the-art Bioinformatics support to members of Imperial for their research. Our second report describes the activities of the Centre from 1st February 2003 to 31st May 2004 and documents our publications and grants for the calendar year 2003. Additional information can be found at www.imperial.ac.uk/bioinformatics.

The main achievements of the Centre during this period are:

• The provision across the College of a Bioinformatics Support Service with 230 registered users as of May 2004.

• The successful role of the Bioinformatics Support Service in obtaining a £600K BBSRC grant for the application of E-science to provide support for microarray analysis.

• The development of collaborative research projects between the Support Service and several research groups in the College.

• The expansion of the Centre with the addition of 11 new Affiliates.

• The publication by our Affiliates of more than 50 refereed papers in Bioinformatics during 2003.

• The award during 2003 of more than £2 million of grant support for research and training in Bioinformatics.

• The co-ordination of postgraduate teaching of Bioinformatics across the College.

• The running of a seminar series that attracts an audience from the College and other organisations in the London area.

• The establishment with colleagues from other London groups of the London Bioinformatics Forum.

Centre for Bioinformatics - Imperial College London - Second Report - May 2004 3 1. Background and Objectives

1.1 Objectives of the Report This is the second report of the Centre for Bioinformatics and covers the period 1st February 2003 to 31st May 2004. The objectives of this report are:

• To document the status of the Centre as of May 2004, particularly describing the contribution from our Steering Committee, External Advisors and the Affiliates.

• To highlight the main developments over the period of this report.

• To describe the activities of the Bioinformatics Support Service in terms of facilities provided and its uptake by users.

• To report grants awarded and publications for the calendar year 2003.

1.2 Background • Bioinformatics can be defined as the use of computational, mathematical and statistical methods to organise, analyse and interpret biological information, particularly at the molecular, genetic and genomic levels.

• Bioinformatics is central to the interpretation and exploitation of the wealth of biological data being generated in the post-genome era with the consequential major clinical and commercial benefits.

• It is vital that Imperial has world-class research in Bioinformatics together with state-of-the-art facilities for all users. Since Bioinformatics research is located in all four Faculties, a clearly identifiable focus is required, in particular to co- ordinate multi-disciplinary research. In parallel, it is essential to provide biologists and clinical reseachers with state-of-the-art Bioinformatics to empower them to deliver world-class research.

• To address these issues, in 2001 the Deputy Rector together with the Faculties of Life Sciences and of Medicine established the Centre for Bioinformatics and the associated Bioinformatics Support Service.

• The Bioinformatics Support Service is located on the newly-refurbished third floor of the Biochemistry Building on the South Kensington Campus. This acts as the focus for the Centre with its links to the different Departments and Campuses of the College.

Centre for Bioinformatics - Imperial College London - Second Report - May 2004 4 1.3 Objectives of the Centre for Bioinformatics The objectives of the Centre for Bioinformatics are to:

• Co-ordinate the strategic development of Bioinformatics at Imperial across the four Faculties and at the different Campuses.

• Develop new collaborative projects within and outside Imperial, particularly those that are multi-disciplinary.

• Contribute to a broad College-wide view of Bioinformatics including the development of links with areas such as statistical genetics, chemometrics and image processing.

• Have a strategic role in the provision of teaching and training in Bioinformatics.

• Organise seminar programs and symposia on Bioinformatics.

• Disseminate relevant software, databases and information to the UK and world scientific communities, both academic and industrial.

• Facilitate the provision of state-of-the-art Bioinformatics support to members of Imperial by directing the Bioinformatics Support Service.

1.4 Objectives of the Bioinformatics Support Service The objectives of the Bioinformatics Support Service are to provide the following services to all Imperial Campuses:

• In-house facilities for major Bioinformatics tasks, such as sequence database searching and microarray processing.

• Access to appropriate commercial software and data.

• Curated links to public domain sites providing additional services.

• Expertise and training courses on the use of the above facilities.

• Collaborative research on specific topics.

• Support for undergraduate and postgraduate teaching.

Centre for Bioinformatics - Imperial College London - Second Report - May 2004 5 2. Management

• The Director of the Centre is Professor Michael Sternberg, Department of Biological Sciences

• There is a Steering Committee to manage the Centre comprised of senior academic staff at Imperial involved in Bioinformatics. The Committee reports to the Principals of the Life Sciences and Medical Faculties via the Director.

Membership of the Steering Committee Member Faculty / Affiliation Prof Michael Sternberg Life Sciences / Biological Sciences (Chair) Prof Timothy Aitman Medicine / CSC Prof David Balding Medicine / Epidemiology and Public Health Prof John Darlington Engineering / Computer Science / LESC Prof Paul Freemont Life Sciences / Biological Sciences/ CSB Prof Philippe Froguel Medicine Prof Engineering / Computer Science Prof James Scott Medicine / GGRI Prof Richard Templer Physical Sciences / Chemistry Dr Sarah Butcher (in attendance) Head Bioinformatics Support Service CSC - MRC Clinical Sciences Centre; LESC - London e-Science Centre; CSB - Centre for Structural Biology; GGRI – Genetics and Genomics Research Institute.

• There is also a panel of External Advisors drawn from leading scientists in academia and industry with a strong interest in Bioinformatics. Dr Philippe Sanseau has kindly agreed to replace our previous industrial representative, Prof Charlie Hodgman, who has now taken up a post with Nottingham University. We take this opportunity to thank Prof Charlie Hodgman for his work as an External Advisor.

External Advisors Member Affiliation Prof Alan Bundy Division of Informatics, Edinburgh University Prof Lon Cardon Wellcome Trust Centre for Human Genetics, Oxford Prof Anna Dominiczak Western Infirmary, Glasgow Dr Philippe Sanseau GlaxoSmithKline, Stevenage Prof , FRS European Bioinformatics Institute, Hinxton

Centre for Bioinformatics - Imperial College London - Second Report - May 2004 6 3. Biographies of the Team

Professor Michael Sternberg is the Director of the Centre. He holds the Chair of Structural Bioinformatics in the Department of Biological Sciences. His first degree was in Theoretical Physics (Cambridge) followed by an MSc in Computing at Imperial. He moved into the Life Sciences via his D.Phil. research in Oxford on protein modelling. Prior to joining Imperial in 2001, he held posts in the Department of Crystallography, Birkbeck College and at Cancer Research UK.

Dr Sarah Butcher is the Head of the Bioinformatics Support Service (BSS). Her first degree was in Applied Biology (Imperial) followed by a PhD in Cellular Immunolgy from the National Institute for Medical Research (CNAA). She then worked as a in Virology for the NERC Centre for Virology and Environmental Microbiology, Oxford. Subsequently Sarah joined Oxford University Bioinformatics Centre – which she later managed for 3 years. Sarah took up her post at Imperial in June 2002.

Dr James Abbott is the main software developer for the BSS. He obtained a BSc in Biology with Biotechnology at the University of Luton, before undertaking a PhD in plant biochemistry at the University of Dundee. Following this, James worked as a bioinformatics specialist for Zeneca Agrochemicals (latterly Syngenta), contributing to the expression bioinformatics project and running the SRS project.

Dr Gail Bartlett is the Computational Biologist for the BSS, responsible for user support, tutorials and training. She obtained an undergraduate Masters degree in Biochemistry from the and went on to study for a PhD in Bioinformatics with Professor Janet Thornton, initially at University College London and later at the European Bioinformatics Institute.

Mr Derek Huntley is a research assistant for the BSS, specialising in second-line user support including development of custom java programs and interfaces. He obtained a first degree in Biology at Sussex University and went on to complete an MSc in Computer Science at Birkbeck College, London. He worked in the Department of Computing at Imperial College London for 4 years developing genomic annotation software before joining the BSS. He has recently submitted a PhD.

Ms Ruth Walters is the Administrator of the Centre. She joined the Centre in December 2001. Previously she obtained a degree in Philosophy (Southampton) and then worked in educational administration.

Dr Suhail Islam assists the Centre part time in the management of local computing. He is a member of the Structural Bioinformatics Group, develops software and manages the Linux farm and the molecular graphics system. Previously he held similar posts at Kings College, Birkbeck College and Cancer Research UK.

Additional support is obtained from members of the London e-Science Centre. The BSS login server is housed within the Department of Computing and the BSS receives Unix system support and additional advice from the team led by Professor John Darlington and Dr Steven Newhouse.

Centre for Bioinformatics - Imperial College London - Second Report - May 2004 7 4. Bioinformatics Research at Imperial

4.1 Affiliates of the Centre The Centre coordinates Bioinformatics research at Imperial via a network of Affiliates spanning all four Faculties and many Campuses. Affiliates are members of Imperial directly involved in Bioinformatics who are either pursuing independent research or are providing major support or development of Bioinformatics. In addition, several heads of Imperial Centres are Affiliates, thereby representing the collective Bioinformatics interests of a set of people. The Affiliates and their research interests are given below.

During 2003, there were two major new appointments in Bioinformatics. Professor Jaroslav Stark joined the Department of Mathematics and Dr Michael Stumpf joined Biological Sciences with his research group being located within the Centre for Bioinformatics. In addition to these two appointments, nine others academics at Imperial became Affiliates of the Centre. The eleven new Affiliates are highlighted by (*) below.

Computer Science, Mathematics and Statistics Dr Mauricio Barahona (*) Biomathematics and dynamical systems Dr Simon Colton (*) and artificial intelligence Prof John Darlington High performance computing, e-science and the grid Prof Yike Guo Machine learning & data mining Dr Martin Howard (*) Biophysics and pattern formation Prof Henrik Jensen Evolution of interacting networks Prof David Hand Statistical and machine learning methods Prof Stephen Muggleton Machine learning and its application to bioinformatics Prof Sylvia Richardson Hierarchical Bayesian models, clustering microarray data Prof Marek Sergot Automated reasoning applied to problems in bioinformatics Prof Jaroslav Stark (*) Mathematical modelling of biological systems Dr David Stephens Bayesian probabilistic analysis of biological sequences Prof Guang-Zhong Yang Image processing applied to biomolecular modelling

DNA and Protein Sequence Analysis (including Phylogenetics) Dr Austin Burt Evolution of non-Mendelian genetic elements Prof Charles Godfray, FRS Population biology & phylogenetics Dr Andy Purvis Inferring evolutionary processes from phylogenetic patterns Dr Mike Tristem Retroviral and retroelement evolution Dr Alfried Vogler Comparative genomics and molecular systematics of insects

Genetics and Genomics Prof David Balding Disease gene mapping & population genetics Dr Mark Field (*) Molecular parasitology Prof Philippe Froguel (*) Genome annotation and SNP analysis Prof Neil Ferguson, OBE (*) Modelling of pathogen population dynamics and evolution Prof James Scott, FRS Genetics & genomics Prof Brian Spratt, FRS Characterisation of isolates of bacterial strains Dr Michael Stumpf (*) Population genetics and comparative genomics Dr John Whittaker Statistical methods to identify disease genes Prof Douglas Young (*) Infection & Immunity, pathogen genomics/proteomics

High-throughput 'Omics Methodologies Dr. Helen Causton Gene expression analysis & data warehousing Prof Anne Dell, FRS Mass spectrometric sequencing of biopolymers Dr David Perkins (*) Proteomics and sequence analysis

Centre for Bioinformatics - Imperial College London - Second Report - May 2004 8

Macromolecular Structures Prof Paul Freemont Structure and function of biological macromolecules Prof Michael Sternberg Structural bioinformatics (especially protein modelling)

Physical and Chemical Methods Dr Ian Gould Simulation methods for biological systems Dr Henry Rzepa Quantum chemical modelling, XML, semantic web

Support and Training Dr Sarah Butcher (*) Bioinformatics support

4.2 Research Since the Centre started in 2001, Affiliates of the Centre have obtained over £8M of grants to support Bioinformatics research within Imperial. Most of this support was obtained over 2001-2 and this has financed extensive research by the Affiliates. Appendix 1 lists the research publications for calendar year 2003. There are more than 50 papers including four in Nature, Science and the Proceedings of the National Academy of Sciences, USA.

During 2003, over £2M of grants were obtained by Affiliates of the Centre for research and training in Bioinformatics at Imperial. In the list below, multi-disciplinary grants are placed under the research area of the principal investigator. We have not included grants to our Affiliates for research outside Bioinformatics. Sums quoted are the support to Imperial with the total award in brackets afterwards. The main investigators at Imperial are reported.

Computer Science, Mathematics and Statistics

APRIL II – Application of probabilistic inductive logic programming. EU. £300K (£1M). 2004-2006. Muggleton & Sternberg. To develop a sound theoretical understanding of probabilistic logic learning that enables one to develop effective probabilistic learning systems. To apply these methods to applications in bioinformatics including protein folding, modelling metabolic pathways and genetics.

Computational tools for Bayesian bioinformatics. MRC Training fellowship. £135K. 2003-2006. Lunn, Best & Whittaker. This grant is to develop a user-friendly specialist interface and computational algorithms tailored specifically for Bayesian statistical modelling.

Adverse event data mining. EPSRC. Case PhD Studentship with GSK. 2003-2006.Hand. Drugs in the marketplace are subject to constant monitoring, so that possible side effects and interactions with other drugs can be detected. Novel statistical tools are required to analyse the large sparse datasets which are produced.

Genetics and Genomics

Bioinformatics for the analysis and exploitation of re-sequenced genomes. MRC/DTI link. £400K (£1.5M). 2003-2006. Balding. The goal of this research is to develop simulation models and statistical tools to investigate optimal strategies for the use of whole genome re-sequencing data to investigate DNA variants involved in disease causation. Joint with European Bioinformatics Institute, Wellcome Trust Sanger Institute, and Solexa Ltd.

Centre for Bioinformatics - Imperial College London - Second Report - May 2004 9 Molecular evolution of G-protein coupled receptors. Royal Society. £10K. 2003-2004. Stumpf. G-protein coupled receptors take a central role in cell-cell signaling. In this project we attempt to understand the evolutionary history of these proteins and the amount of selection that has operated on them in the human lineage.

Bayes network models of gene regulation. Royal Society. 2004-2006. £20K. Stumpf. In order to understand the sequence if interactions in a gene regulation network we develop and test Bayesian network models of gene regulation. Our framework is suitable both for simulation as well as inferential procedures as we can determine parameters of the network model from real data. With Professor C Wiuf.

Statistical modelling for the association of multiple SNP genotypes and phenotype. MRC. PhD Studentship. Balding & Whittaker. This project will exploit recent developments in spatial statistics to develop methods for the analysis of data from genetic association studies where many SNPs have been genotyped.

High-throughput 'Omics Methodologies

Microarrays in clinical practice. Department of Health. £247K - 2004-2006. Causton, Aitman, Navarange, Bloom & Stamp. This project aims to extend the current Microarray Centre data warehouse to accommodate clinical data. The use of microarrays in routine clinical use will bring a better understanding of the relationship between genes and disease, tools for more accurate diagnosis allowing treatment tailored to fit the individual and will assist in the development of new and more effective therapies.

Macromolecular Structures

Modelling and prediction of docked protein-protein complexes. MRC. PhD Studentship. 2003-2006. Sternberg. The aim is to enhance computational methods to predict the structure of a protein-protein complex starting from the coordinates of the unbound components.

Prediction of protein specificity using machine learning. BBSRC. PhD Studentship. 2003-2006. Sternberg & Muggleton. The aim is to develop a machine learning approach to predict protein function from structure. Of particular important is to identify those residues involved in providing specificity of function.

Teaching and Training Programmes

A four year PhD programme in bioinformatics. Wellcome Trust. £1.2M. 2003-2008. Sternberg & Field. This programme supports 5 students commencing 2003 and 7 commencing 2004 to undertake a 4 year PhD in bioinformatics at Imperial. In the first year the students will attend the MSc in Bioinformatics. The next three years, the student will undertake a PhD in any of the Departments associated with the programme.

The statistical analysis of gene expression data. EPSRC. £23K. 2003. Richardson. Funding to organise a workshop to promote good statistical practice, to initiate new methodological research on ways to analyse this type of data and to foster the interface between new technological developments and the biological and experimental context. With P.Brown (Kent).

Centre for Bioinformatics - Imperial College London - Second Report - May 2004 10 5. Teaching

5.1 MSc in Bioinformatics Imperial established an MSc in Bioinformatics in the academic year 2001-2. The course takes graduates from either the Life Sciences or the Numerical/Physical Sciences and trains them in Bioinformatics. The first half of the year has formal courses covering Computer Programming (C++, JAVA, Perl); Mathematics & Statistics and Bioinformatics. The second part of the year is spent on research projects. Staff from all four Faculties at Imperial contribute to the formal teaching and offer research projects.

In 2003, we awarded 16 degrees including five with distinction. Several students are progressing to PhD research and many others obtained positions in industry and academia employing their Bioinformatics skills.

We have 13 students enrolled on the 2003-4 course. The MRC provided one funded MSc place for 2003-4.

Recently, the BBSRC provided support for five places on the MSc for three annual intakes (2004 to 2006). In addition, the MRC are funding two places for admission in 2004.

5.2 Wellcome Trust 4 year PhD in Bioinformatics In January 2003, the Centre for Bioinformatics with the Department of Biological Sciences were awarded support from the Wellcome Trust to establish a 4 year PhD programme in Bioinformatics. We recruited five students who started in October 2003. In keeping with the aims of the programme, these students came from a broad range of undergraduate disciplines – Biological Sciences, Computing, Mathematics and Statistics. In the first year the students are following the MSc in Bioinformatics. Towards the end of the first year, students will select PhD research topics offered by the contributing Departments (Biological Sciences, Chemistry, Computing, Mathematics, the MRC Clinical Sciences Centre and Primary Care Division of the Medical School). The students would then join the department of their primary supervisor. To foster inter-disciplinary training, there will be a second supervisor from a complementary discipline. The cohort will maintain contact with each other via common seminar programmes. We have recruited seven students to join the programme in October 2004.

This programme provides an excellent opportunity for Imperial to attract the best students to hop disciplines and train in Bioinformatics.

5.3 Computational Bioinformatics In 2003, the Department of Computing introduced a module “Introduction to Bioinformatics” that is an option in three courses: the third year undergraduate degree in Computing, the MSc in Computing (Conversion Course), and the third year undergraduate degree in Electrical Engineering. The course is run by Drs Yike Guo and Simon Colton from the Department of Computing.

Centre for Bioinformatics - Imperial College London - Second Report - May 2004 11 6. The Bioinformatics Support Service

6.1 Introduction The mission of the Bioinformatics Support Service (BSS) is to deliver state-of-the-art software and training to members of the College to assist their research and teaching in core areas of Bioinformatics. The team consists of Dr Sarah Butcher (Head), Dr James Abbott (Software Developer), Dr Gail Bartlett (User Support & Training) and Mr Derek Huntley (Research Assistant). Ms Nadia Anwar left the service in Oct 2003 to take up a PhD position at the University of Glasgow. She was replaced by Dr Gail Bartlett, who joined the BSS in January 2004.

The offices of the Service moved to custom SRIF-refurbished space on level three of the Biochemistry building on the South Kensington Campus in November 2003. This places the service next to the Bioinformatics research groups of Prof Sternberg and Dr Stumpf. The Service has its own meeting room and can access two rooms with PC clusters.

6.2 Management The management arrangement is that the Head of the Support Service reports to the Steering Committee of the Centre via the Director of the Centre. To assist the Support Service in achieving its mission an Operations Committee has been established. The initial membership of the Committee reflects the essential input required in Computing, user support and Bioinformatics.

Membership of the Bioinformatics Support Service Operations Committee Member Affiliation Professor Michael Sternberg Director of Centre (Chair) Dr Sarah Butcher Head of Support Service Dr Helen Causton MRC Micro Array Centre, Clinical Sciences Centre Dr Steven Newhouse London e-Science Centre, Dept. of Computing Mr Arthur Spirling Information and Communication Technologies

6.3 Hardware, Software and Databases The main BSS login server remains a Sun V880 (8x750 MHz processors, 32 GB RAM, 430 GB disk). We are fortunate to benefit from considerable additional shared compute resources within the London e-Science Centre (LeSC) through Professor John Darlington. To date, these comprise a Sun 6800 (24x750 MHz processors, 32 GB RAM, 6TB disk, 24 TB tape system) funded through a JREI grant, currently used as an SRS server and for selected compute-intensive jobs, and a 133 dual processor Intel/linux cluster (1 or 2 GB RAM per node). The latter is funded as part of a £2 million investment to support applied computational scientists within Imperial - primarily for Bioinformatics, high energy Physics and Computational Engineering.

Job scheduling between the login server and additional shared resources has been developed using Sun Grid Engine. Currently, selected large-scale analyses (e.g. BLAST, Interproscan, HMMSearch with >300 input sequences) are targeted for scheduling on the Linux cluster. Additional wrappers are under development to extend the scope of shared resource use, with emphasis on ease of use and transparency to users.

Centre for Bioinformatics - Imperial College London - Second Report - May 2004 12

The BSS maintains a fully up-to-date comprehensive local set of public biological databases employing cumulative updates – see Table 1. These are checked for consistency offline and indexed for BLAST and SRS using a fully-automated system designed and implemented by Dr Abbott. The Centre recently installed the commercial BRENDA database of enzyme functional data, now available from the web-site or directly via SQL queries. The BRENDA databases is an illustration of how the Centre will install and maintain additional databases as requested by users.

The Centre supports a wide range of Unix-based bioinformatics software (see Table 2). The majority are freely available packages but commercial packages are used where they add significant functionality (e.g. SRS – Lion Biosciences). The Centre recently acted as a beta test site for the SRS8 package.

A large number of packages have additional PISE-generated web interfaces and have been integrated within SRS for ease of access. In addition, the service has adapted and hosted new web-based SiRNA design software from the Wistar Institute, as well as building a custom graphical BLAST interface integrated with SRS.

A ‘wish-list’ for commercial software of potential interest is also available from the web-site. Users can request additions to the list and if sufficient interest is registered from other users through the accompanying form, the software will be considered for purchase and central installation.

Table 1 - Bioinformatics Databases Provided

DNA Sequence-Related EMBL BRENDA Genbank DSSP/FSSP/HSSP dbEST INTERPRO: REFSEQN BLOCKS Repbase PFAM PRINTS Miscellaneous PROSITE Enzyme Prodom GO/ GOA PDB Locuslink Unigene OMIM Uniref/Uniseq Rebase Taxonomy

Protein Genpept REFSEQP UNIPROT: PIR Swissprot TrEMBL

Centre for Bioinformatics - Imperial College London - Second Report - May 2004 13 Table 2 - Major Bioinformatics Software

Multifunction Packages Phylogenetic Analysis Emboss (with JEMBOSS) bonsai NCBI Toolkit fastdnaml HMMER nifas PHYLIP njplot orthostrapper Codon Use phylip codonw protml rio Database Searching tree-puzzle ballast blixem NCBI blast2 aqua blimps domainer dbwatcher procheck fasta rasmol hmmer structer Interproscan MSPcrunch Repeats ssaha maskeraid Washington University blast2 repeatmasker

Database Text Searching RNA SRS qrna entrez snoscan trnascan Genome Analysis/Annotation apollo SNP Discovery artemis polybayse act polyphred firstef refcomp genscan snp_pipe (Oxagen) glimmer glimmer Sequence Assembly & Trace Data grailEXP phrap qrna phred tricross staden package vista wise2 Sequence Comparisons avid Linkage clustalw genehunter clustalx morgan dialign2 QTLReaper dotter Simwalk2 hmmerviewer transmit jalview lalnview Primer/SiRNA Design seaview oligoArray sim4 primer3 t-coffee siRNA Sequence Manipulation phd2fasta readseq

Centre for Bioinformatics - Imperial College London - Second Report - May 2004 14

250

s 200

150

100

50 Number of User

0

3 3 4 -03 -03 0 -03 0 -04 -04 0 y- l y- ep- Nov-02 Jan Mar Ma Ju S Nov-03 Jan Mar Ma Months

Figure 1 – Number of Users

Life Sciences Medicine Engineering Physical Sciences

Figure 2 – Users by Faculty

South Kensington Hammersmith St Marys Charing Cross Others

Figure 3 – Users by Campus

Centre for Bioinformatics - Imperial College London - Second Report - May 2004 15 6.4 Support & Training The Service has 230 registered users as of May 2004, and user numbers continue to grow (see Fig 1). Users split almost 50/50 between the Faculties of Life Sciences and Medicine, with a small number in the Faculties of Engineering and Physical Sciences (see Fig 2). Many campuses are represented, with the largest numbers of users from the South Kensington, Hammersmith and St Mary’s campuses (see Fig 3). Users are supported via email, phone and one-to-one or group meetings and site visits. An email queue tracking system facilitates automatic call logging and tracking. One to one and group-based advice and consultation are available on many aspects of analysis, with the emphasis on training users to perform their own analyses. Bespoke scripts and interfaces are developed to assist users as necessary, particularly with bulk processing tasks. Where appropriate, these may be adapted and made available for more generalised use by other College researchers, or published for outside dissemination.

The BSS also provides help for researchers writing grant proposals. The scope of this can vary from providing advice on appropriate data analysis methods and references to include, through to active participation as co-applicants, with provision of part or full-time posts to provide bioinformaticians for specific analyses and/or development of new databases, scripts and software. Often the scope of Bioinformatics analyses to fulfil a particular aim can appear difficult to quantify in terms of resources. The BSS can provide costs for necessary staff time and identify the resources required (e.g. additional disk storage). The Service also undertakes pilot work for grant proposals e.g. exploratory analyses to provide preliminary results to show proof-of-concept and strengthen cases. The Service produces statements outlining the expertise of the BSS, which can be included in resource justifications, to indicate how grants accessing the BSS resources have made a provision for optimal data analysis.

6.4 Research Projects The Service has already been engaged in a number of large projects with users where significant new scripts, programs and/or user interfaces have been produced for specified functions. These have enabled the researchers to process large and/or problematic datasets, which would otherwise have proven difficult to handle with a more piecemeal approach. A few of these are outlined below, and in several cases, such work has led to substantial further collaborations:

• The development of AriadneDB a program for automated EST clustering and filtering with a java interface for phylogeny.

• The development of an automated system for investigating LINE repeat structure and distribution within selected mouse chromosomes including a Java interface to view details which can be zoomed from chromosome to sequence level.

• The writing of scripts for bulk pattern matching within SwissProt and filtering of results into manageable groupings based on SwissProt annotations and GO terms.

• The development of extensive scripts for reformatting multiple large EST datasets and BLAST results from a non-standard format to ‘extended’ EMBL format. In addition, custom built SRS parsers were written to enable resulting data to be available as SRS indices for easy user interrogation via command-line and SRS web interface.

Centre for Bioinformatics - Imperial College London - Second Report - May 2004 16

• The re-annotation of a section of Anopheles and Drosophila genomes with emphasis on putative alternative splice sites and storage of results in EMBL format for viewing by multiple users in Artemis graphical viewer.

The Centre’s web site (www.codon.bioinformatics.ic.ac.uk - internal access only) has continued to develop. We now offer direct access to a large number of software packages as well as help pages covering a wide range of practical information on common tasks, locally held software, databases and good practice. We are also developing a set of self-contained downloadable tutorial exercises e.g. ‘Introduction to using Unix for Bioinformatics’.

6.5 Grants Dr Butcher recently led a successful grant application to the BBSRC Bioinformatics and E-Science Programme (Butcher, Sternberg, Newhouse, Darlington, Causton & Aitman - A distributed system for E-support of microarray data analysis and management). This 3-year grant (£600K) provides three new postdoc positions and will enable the BSS, together with the MRC Microarray Centre at the Hammersmith Hospital and the London E Science Centre, to develop new methods for supporting microarray data management and analysis within Imperial. This will underpin the outreach of the BSS towards actively supporting microarray analysis within the coming year.

6.6 Courses The Centre has started to run a program of modular taught courses. A half-day introductory course is now available and has already been delivered three times. This course is free to registered users and is expected to run periodically on any Campus where suitable computer teaching facilities are present. It has also been complemented by practical software demonstrations at Wye College and the Kennedy Institute.

A number of other practical half-day courses are in preparation and will commence later this year. Titles shortly to be released include: ‘Biological databases and getting the most from database interrogation’, ‘Sequence alignments and database searching’, ‘Multiple sequence alignments – methods and uses’. It is envisaged that these courses will incur a small fee.

Dr Butcher has also given a number of lectures on the facilities of the BSS, and their use within the College. These include lectures as part of core introductory courses for postgraduates and for undergraduates.

6.7 Financial Arrangements The Support Service was established with major financial support initially from the Pro-Rectors Reserve and subsequently from the Faculties of Life Sciences and of Medical. Clearly with the realities of university finances, the Service cannot continue to run on major funding from the Faculties. Research and teaching grants that have a Bioinformatics component are required to include an access fee for use of the Support Service. The access charge has initially been set at £1,500 per annum per postdoc per new grant, and will include support of associated PhD students at no

Centre for Bioinformatics - Imperial College London - Second Report - May 2004 17 extra charge. The College has facilitated administration of this access charge by inclusion of a specific check-box on internal grant processing forms.

This access fee should be considered as funding a small fraction of skilled support that maintains software and databases and additionally provides expert assistance and advice. If individual groups were to undertake their own Bioinformatics support in house, this would be far more expensive in terms of staff time and the resultant service would almost always be far poorer. Thus the access fee is exceptionally good value for money in a research grant. In addition to a core level of support for many researchers, certain projects will require extensive Bioinformatics support. If the research group wishes the Service to provide such support, then the group would need to finance the appropriate level of staff time and computational resources from the Service.

We consider these mechanisms to be the most effective strategy to finance the Service. The alternative of charging each user has been shown in many organisations, both in the UK and abroad, to be exceptionally problematic to administer.

A well-financed Bioinformatics Support Service will empower a wide number of users in the College to perform first class Bioinformatics in their projects which will be translated into substantial enhancement of the quality of their research.

Centre for Bioinformatics - Imperial College London - Second Report - May 2004 18 7. The London Bioinformatics Forum

7.1 Mission of the London Bioinformatics Forum During 2003, the Centre for Bioinformatics at Imperial working with other bioinformaticians in London established the London Bioinformatics Forum (LBF). The mission of the LBF is to promote discussion amongst London based groups in all areas of Bioinformatics including research, teaching, support and training with a view to encouraging collaboration.

7.2 Objectives The objectives of the London Bioinformatics forum are:

• To exchange information about research activities, teaching, support and training in Bioinformatics.

• To organise a Bioinformatics seminar series primarily with contributions from London-based researchers.

• To identify areas for inter-institutional collaborations in both Bioinformatics and other disciplines.

• To identify funding opportunities that can be pursued by members of the Forum via their home institutions.

• To highlight to the UK and international communities the strengths in Bioinformatics in London by mechanisms such as a common web site and scientific meetings.

• To facilitate the public engagement of science with respect to Bioinformatics.

7.3 Management The chair of the LBF is Prof Michael Sternberg (Imperial) and the deputy chair is Prof David Jones from UCL. The Steering committee of 30 has representatives from 17 London organisations. Further details can be found at the web site www.londonbioinformatics.org.

7.4 Activities In November 2004, the LBF held an Inaugural Open Day at Imperial with speakers from several London organisations (see Section 8). We intend that future events will include an opportunity for graduate students to present their work. The web site provides links to both the research activities and major training programmes within London. The LBF also has mailing lists for news announcements and bioinformatics discussion ([email protected] and [email protected]) maintained by staff of the Centre for Bioinformatics.

Centre for Bioinformatics - Imperial College London - Second Report - May 2004 19 8. Seminar Programme

The Centre runs a seminar programme. The seminars are followed by networking stimulated by refreshments. The audience comes from many of the Departments and Campuses at Imperial and from other London groups. During 2003 we also held scientific presentations as part of the Inaugural Open Day of our Centre and for the Opening of the London Bioinformatics Forum. The 2003 programme consisted of the following seminars.

Monday 3 March

Dr Nigel Saunders Pathology, University of Oxford ‘Functional genomics and bacterial pathogenesis’

Dr Adrian Cootes Biological Sciences, Imperial College London ‘The automatic discovery of structural principles describing protein fold space’

Tuesday 18 March - Inaugural Open Day of the Centre for Bioinformatics

Professor Janet Thornton FRS Director of the European Bioinformatics Institute, Hinxton ‘The evolution of protein function from a structural perspective’

Professor Lon Cardon Wellcome Trust Centre for Human Genetics, Oxford ‘Use of the human haplotype map in complex disease association studies’

Professor Carole Goble Dept of Computing, University of Manchester ‘Ontologies and BioGrid services: prospects and pitfalls’

Dr Peer Bork EMBL, Heidelberg ‘Function, prediction and protein networks’

Monday 24 March

Professor Luis Montero University of Havana ‘Modeling molecules and biomolecules: basic principles and drug engineering’

Dr Jordi Villa i Freixa Structural bioinformatics laboratory (GRIB), IMIM/Universitat Pompeu Fabra ‘Seeking for realistic energy profiles in ion channels simulations’

Friday 11 April

Professor Nikolay Kolchanov Institute of Cytology and Genetics, Novosibirsk, Russia ‘TRRD: Database of transcription regulatory regions - implications for analysis of expression data’

Centre for Bioinformatics - Imperial College London - Second Report - May 2004 20 Monday 28 April

Professor David Hand Department of Mathematics, Imperial College London ‘Statistical pattern detection in genomics and proteomics’

Dr Hilary Booth Centre for Bioinformation Science (CbiS), Australian National University, Australia ‘Normalization of sequence alignment scores’

Monday 2 June

Dr Helen Causton Microarray Centre, Faculty of Medicine, Imperial College London ‘Low level analysis of affymetrix gene expression data’

Professor Sylvia Richardson Department of Epidemiology and Public Health, Imperial College London ‘Bayesian hierarchical models for gene expression data’

Monday 7 July

Professor Mark Sansom Laboratory of Molecular Biophysics, University of Oxford ‘Membrane proteins: structural dynamics via simulations’

Monday 20 October

Dr Sarah Teichmann Structural Studies Division, MRC Laboratory of Molecular Biology, Cambridge ‘Gene regulatory network growth by duplication’

Mr Jonathan Swire Department of Biological Sciences, Imperial College London ‘Gradients in amino acid composition within the yeast genome as a response to selection on cost’

Wednesday 26 November - Opening of the London Bioinformatics Forum

Professor David Jones Department of Computer Science, University College London ‘Predicting old and new folds for genome sequences’

Professor Stephen Muggleton Department of Computing, Imperial College London ‘Machine learning for bioinformatics’

Professor Richard Goldstein NIMR, Mill Hill. ‘Evolutionary studies of G-protein coupled receptors’

Dr Lorenz Wernisch School of Crystallography, Birkbeck College ‘Graphical models for interpreting microarray experiments’

Centre for Bioinformatics - Imperial College London - Second Report - May 2004 21 9. Achievements and Plans

9.1 Achievements The major achievements of the Centre for Bioinformatics over the period of this report (1st February 2003 - 31st May 2004) are:

• The provision across the College of a Bioinformatics Support Service with 230 registered users as of May 2004.

• The successful role of the Bioinformatics Support Service in obtaining a £600K BBSRC grant for the application of E-science to provide support for microarray analysis.

• The development of collaborative research projects between the Support Service and several research groups in the College.

• The expansion of the Centre with the addition of 11 new Affiliates.

• The publication by our Affiliates of more than 50 refereed papers in Bioinformatics during 2003.

• The award during 2003 of more than £2 million of grant support for research and training in Bioinformatics.

• The co-ordination of postgraduate teaching of Bioinformatics across the College.

• The running of a seminar series that attracts an audience from the College and other organisations in the London area.

• The establishment with colleagues from other London groups of the London Bioinformatics Forum.

9.2 Plans Our key objectives for the next year are:

• To continue with the development of the Bioinformatics Support Service in terms of the number of users assisted, the range of software that is available, the breadth of advice and training provided, and the number of collaborative research projects undertaken.

• To extend the Bioinformatics research community within Imperial.

• To stimulate the establishment of new research projects in Bioinformatics within the College, particularly those which are multi-disciplinary.

• To facilitate the inclusion of Bioinformatics within undergraduate and postgraduate courses in all Faculties.

• To develop further links with colleagues outside Imperial both nationally and internationally.

Centre for Bioinformatics - Imperial College London - Second Report - May 2004 22 Appendix 1 - Selected Publications

Below we list a selection of publication in 2003 by the Affiliates to the Centre. We focus on those with a Bioinformatics component and note that we do not cite publications from our Affiliates in areas outside Bioinformatics.

Computer Science, Mathematics and Statistics

Balding, D.J. (2003). Likelihood-based inference for genetic correlation coefficients. Theoretical Population Biology, 63, 221-230.

Byng, M.C, Fisher, S.A., Lewis, C.M. & Whittaker, J.C. (2003). Variance components linkage analysis for adjusted systolic blood pressure in the Framingham Heart Study. BMC Genetics 4(Suppl 1), S4.

Byng, M.C., Whittaker, J.C., Cuthbert, A.P., Mathew, C.G. & Lewis, C.M. (2003). SNP subset selection for genetic association studies. Annals of Human Genetics, 67, 543-556.

Callard, R.E., Yates, A. & Stark, J. (2003). Fratricide: A Mechanism for T Memory Cell Homeostasis. Trends in Immunology, 24, 370-375.

Chan, C.C.W., George, A.J.T. & Stark, J., (2003). T Cell Sensitivity and Specificity - Kinetic Proofreading Revisited. Discrete and Continuous Dyn. Sys. B, 3, 343-360.

Clifford, R. & Sergot, M.J. (2003). Distributed and paged suffix trees for large genetic databases. In ‘Proc. 2003 of the 14th Annual Symposium on Combinatorial Pattern Matching (CPM'03) R. Baeza-Yatres, E. Ch'ave, and M. Crochemore, editors, Morelia, Mexico, June 2003, LNCS 2676'. 70-82. Springer-Verlag.

Clifford, R. & Sergot, M.J. (2003). Distributed suffix trees and their application to large-scale genomic analysis. In ‘Proc. International Conference on Computational Methods in Sciences and Engineering (ICCMSE'03), Kastoria, Greece, September 2003'.

Colton, S. & Muggleton, S.H. (2003). ILP for mathematical discovery. In ‘Proceedings of the 13th International Conference on Inductive Logic Programming’. 93-111. Springer-Verlag.

Denham, M.C. & Whittaker, J.C. (2003). A Bayesian approach to disease gene location using allelic association. Biostatistics, 4, 399-409

Excoffier, L., Laval, G. & Balding, D.J. (2003). Gametic phase estimation over large genomic regions using an adaptive window approach. Human Genomics, 1, 7-19.

Balding, D.J., Bishop, M. & Cannings, C. (2003). Editors of 'Handbook of Statistical Genetics, 2nd edition'. Wiley.

Green, P.J., Hjort, N.L., Richardson, N. & Richardson, S. (2003). Editors of 'Highly Structured Stochastic Systems'. .

Morris, A., Whittaker, J., Xu, C-F., Hosking, L. & Balding, D.J. (2003). Multipoint LD mapping narrows location interval and identifies mutation heterogeneity. Proc. Natl. Acad. Sci. USA. 100, 13442–13446.

Muggleton, S.H., Tamaddoni-Nezhad, A. & Watanabe, H. (2003). Induction of enzyme classes from biological databases. In ‘Proceedings of the 13th International Conference on Inductive Logic Programming’, 269-280. Springer-Verlag.

Centre for Bioinformatics - Imperial College London - Second Report - May 2004 23 Phillips, M.S., Lawrence, R., Sachidanandam, R., Morris, A.P., Balding, D.J., Cardon, L.R. & 29 authors, (2003). Chromosome-wide distribution of haplotype blocks and the role of recombination hotspots. Nature Genetics, 33, 382-387.

Puech, A. & Muggleton, S.H. (2003). A comparison of stochastic logic programs and Bayesian logic programs. In ‘ICAI03 Workshop on Learning Statistical Models from Relational Data’. ICAI.

Sibly, R.M, Meade, A., Boxall, N., Wilkinson, M., Corne, D.W. & Whittaker, J.C. (2003). The structure of interrupted human AC microsatellites. Mol. Biol. Evol. 20, 453-459.

Stark, J., Brewer, D., Barenco, M., Tomescu, D., Callard, R. & Hubank, M. (2003). Reconstructing Gene Networks: What Are the Limits? Biochemical Society Transactions, 31, 1519–1525.

Stark, J. & Hardy, K. (2003). Chaos: Useful at Last. Science, 301, 1192-1193.

Stark, J., Callard, R. & Hubank, M. (2003). From the Top Down: Towards a Predictive Biology of Gene Networks. Trends in Biotechnology, 21, 290-293.

Sternberg, M.J.E. & Muggleton, S.H. (2003). Structure activity relationships (SAR) and pharmacophore discovery using inductive logic programming (ILP). QSAR and Combinatorial Science, 22, 527-532

Whittaker, J.C., Harbord, R.M., Boxall, N., Mackay, I., Dawson, G. & Sibly, R.M. (2003). Likelihood-based estimation of microsatellite mutation rates. Genetics, 164, 781-787.

Huntley, D., Hummerich, H., Smedley, D., Kittivoravitkul, S., McCarthy, M., Little, P.F.R. & Sergot, M.J. (2003). GANESH: Software for customised annotation of genome regions. Genome Research, 13, 2195-2202.

Whittaker, J.C., Gharani, N., Hindmarsh, P. & McCarthy, M.I. (2003). Estimation and testing of parent of origin effects for quantitative traits. Am. J. Hum. Gen. 72, 1035-1039.

Wilson, J., Weale, M.E. & Balding, D.J. (2003). Inferences from DNA data: population histories, evolutionary processes, and forensic match probabilities. Journal of the Royal Statistical Society A. 166(2), 155-187.

DNA and Protein Sequence Analysis (including Phylogenetics)

Bininda-Emonds, O.R.P., Jones, K.E., Price, S.A., Grenyer, R., Cardillo, M., Habib, M., Purvis, A. & Gittleman, J.L. (2003). Supertrees are a necessary not-so-evil: a response to Gatesy et al. Systematic Biology, 52, 724-729.

Gifford, R. & Tristem, M. (2003). The evolution, distribution and diversity of endogenous retroviruses. Virus Genes, 26, 291-315.

Grenyer, R. & Purvis, A., (2003). A composite species-level phylogeny of the 'Insectivora' (Mammalia, Order Lipotyphla Haeckel 1866). Journal of Zoology (London), 260, 245-257.

Isaac, N.J.B., Agapow, P.-M., Harvey, P.H. & Purvis, A. (2003). Phylogenetically nested comparisons for testing correlates of species-richness: a simulation study of continuous variables. Evolution, 57, 18-26.

Kambol, R., Kabat, P. & Tristem, M. (2003). Complete nucleotide sequence of an endogenous retrovirus from the amphibian, Xenopus laevis. Virology, 311, 1-6.

Lynch, C. & Tristem, M. (2003). A co-opted gypsy-type LTR-retrotransposon is conserved in the genomes of humans, sheep, mice and rats. Current Biology, 13, 1518-1523.

Centre for Bioinformatics - Imperial College London - Second Report - May 2004 24

Mace, G.M., Gittleman, J.L. & Purvis, A. (2003). Preserving the Tree of Life. Science, 300, 1707-1709.

Genetics and Genomics

Capelli, C., Redhead, N., Abernethy, J.K., Gratrix, F., Wilson, J.F., Moen, T., Hervig, T., Richards, M., Stumpf, M.P.H, Underhill, P.A., Bradshaw, P., Shaha, A, Thomas, M.G., Bradman, N. & Goldstein, D.B. (2003). A Y-chromosome census of the British Isles, Current Biology, 13, 979-984.

Ferguson N.M. & Donnelly C.A. (2003). Assessment of the risk posed by bovine spongiform encephalopathy in cattle in Great Britain and the impact of potential changes to current control measures. Proc. R. Soc. Lond. B. Biol. Sci. 270. 1579-1584.

Ferguson N.M., Keeling M.J., Edmunds W.J, Gani R., Grenfell B.T., Anderson R.M. & Leach S. (2003). Planning for smallpox outbreaks. Nature, 425, 681-685.

Griffin, J.L., Bonney S.A., Mann, C., Hebbachi, A.M., Gibbons, G.F., Nicholson, J.K., Shoulders, C.C. & Scott, J. (2003). An Integrated Reverse Functional Genomic and Metabolic Approach to Understanding Orotic Acid Induced Fatty Liver. Physiological Genomics.

Hagenaars T.J., Donnelly C.A., Ferguson N.M. & Anderson R.M. (2003). Dynamics of a scrapie outbreak in a flock of Romanov sheep: estimation of transmission parameters. Epidemiol. Infect. 131, 1015-1022.

Jones B, Jones, E.L., Bonney, S. A, Patel, H.N., Mensenkamp, A.R., Rudling, M., Myrdal, U., Annesi, G., Naik, S., Meadows, N., Quattrone, A., Naoumova, R.P., Angelin, B., Infante, R., Levy, E., Roy, C.C., Freemont, P.S., Scott, J. & Shoulders, C.C. (2003). Lipid Absorption Disorders of the Intestine Caused by Mutations of a Sar1 GTPase. Nature Genetics, 34, 29-31.

Mead, S., Stumpf, M.P.H., Whitfield, J., Beck, J.A., Poulter, M., Campbell, T., Uphill, J.B. , Goldstein, D.B., Alpers, M., Fisher, E.M. & Collinge, J. (2003). Balancing selection at the prion protein gene consistent with prehistoric kuru-like epidemics. Science, 300, 640-643.

Naoumova, R.P., Bonney. S.A., Eichenbaum-Voline, S, Patel, H.N., Jones, B., Jones, Joanna E.L., Amey, J., Colilla, S., Neuwirth, C.K.Y., Seed, M., Betteridge, D.J., Galton, D.J., Cox, N.J., Bell, G.I., Scott, J. & Shoulders, C.C. (2003). Confirmed Locus on Chromosome 11p and Candidate Loci on 6q and 8p for the Triglyceride and Cholesterol Traits of Combined Hyperlipidemia. Arterioscler. Thromb. Vasc. Biol. 23, 2070-2077.

Redmond, S., Vadivelu, J. & Field, M.C. (2003). RNAit: an automated web-based tool for the selection of RNAi targets in Trypanosoma brucei. Molecular and Biochemical Parasitology, 128, 115-118.

Riley S., Donnelly C.A. & Ferguson N.M. (2003). Robust parameter estimation techniques for stochastic within-host macroparasite models. J. Theor. Biol. 225, 419-430.

Stumpf, M.P.H. & McVean, G.A.T., (2003). Estimating recombination rates from population- genetic data. Nat. Rev. Genet. 4, 959-968.

Stumpf, M.P.H. & Goldstein, D.B. (2003). Demography, recombination hotspot intensity, and the block structure of linkage disequilibrium. Current Biology, 13, 1-8 .

Wiuf, C., Laidlaw, Z. & Stumpf, M.P.H. (2003). Some notes of the combinatorial properties of haplotype tagging. Math. Biosci.185, 205-216.

Centre for Bioinformatics - Imperial College London - Second Report - May 2004 25 High-throughput 'Omics Methodologies

Causton, H.C., Quackenbush, J. & Brazma, A. (2003). Microarray Gene Expression Data Analysis - A Beginner's Guide. 1st edn. Blackwell publishing.

Kemp,T.J., Causton, H.C. & Clerk, A. (2003). Changes in gene expression induced by H2O2 in cardiac myocytes. Biochem. Biophy. Res. Commun. 307, 416-421.

Shen, W.C., Bhaumik, S.R., Causton, H.C., Simon, I., Zhu, X., Jennings, E.G., Wang, T.H., Young, R.A. & Green, M.R. (2003). Systematic analysis of essential yeast TAFs in genome- wide transcription and preinitiation complex. EMBO J. 22, 3395-3402.

Macromolecular Structures

Cootes, A.P., Muggleton, S.H. & Sternberg, M.J.E. (2003). The automatic discovery of structural principles describing protein fold space. J. Mol. Biol. 330, 839-850.

Janin, J., Henrick, K., Moult, J., Eyck, L. T., Sternberg, M.J.E, Vajda, S., Vakser, I. & Wodak, S. J. (2003). CAPRI: A Critical Assessment of Predicted Interactions. Proteins 52, 2-9.

Smith, G.R. & Sternberg, M.J.E. (2003). Evaluation of the 3D-Dock protein docking suite in rounds 1 and 2 of the CAPRI blind trial. Proteins 52, 74-79.

Physical and Chemical Methods

Gkoutos, G.V., Rzepa, H.S. & Murray-Rust, P. (2003). Online Validation and Comparison of Molfile and CML Molecular Atom-Connection Descriptors. Internet. J. Chem. article 1.

Gkoutos, G. V., Rzepa, H. S., Clark, R. M., Adjei, O. & Johal, H. (2003). Chemical Machine Vision: Automated extraction of chemical meta-data from raster images. J. Chem. Inf. Comp. Sci., 43, issue 5.

Murray-Rust, P. & Rzepa, H. S. (2003). Chemical Markup, XML and the Worldwide Web. Part 4. CML Schema. J. Chem. Inf. Comp. Sci. 43, issue 4.

Murray-Rust, P. & Rzepa, H. S. (2003). Towards the Chemical Semantic Web. An introduction to RSS. Internet J. Chem. 6, article 4.

Murray-Rust, P. & Rzepa, H. S. (2003). XML for scientific publishing. OCLC Systems and Services. 19, 162-169.

Murray-Rust, P. & Rzepa, H. S. (2003). In 'Handbook of Chemoinformatics. Part 2. Advanced Topics, ed. J. Gasteiger & T. Engel', Vol 1.

Centre for Bioinformatics - Imperial College London - Second Report - May 2004 26