Stockholm Centre Annual Report 2006

Director’s summary The year 2006 was very eventful for SBC as it changed both management and location. The SSF grant that had supported the core funding for SBC since its inception in 2000 was finished in 2005. Instead, the centre now receives core funding from KTH and the faculty for natural sciences at SU. In conjunction with this shift, a new director (Erik Sonnhammer) was appointed as of March 2006. A steering group was also created, consisting of Gunnar von Heijne (representing SU), Joakim Lundeberg (representing KTH), and Hugh Salter, AstraZeneca (external). In order to improve the working environment, the SBC moved out of the Albanova main building to two locations. The groups of Lindahl and Elofsson joined the newly started Center for Biomembrane Research (CBR) at the Frescati campus, while the rest of the SBC moved to Roslagstullsbacken 35 to join the computational biology groups headed by Professors Anders Lansner and Erik Aurell. An already begun recruitment for an assistant professor in bioinformatics funded by the SU dept. of and (DBB) was unfortunately postponed due to the poor economical situation of the department. This will however be resumed as soon as the economy stabilises, but this may not be until 2008. The SBC is now in an excellent position to efficiently collaborate with research groups in the Stockholm area and further advance the level of bioinformatics research. In particular, the contacts with experimental groups at CBR and at the Human Proteome Resource at Biotechnology-KTH have led to many promising synergies. The coursework at SBC will also be expanded by a new international Master programme in Bioinformatics, hosted at DBB, which will start in the autumn of 2007. Personnel

Prof. Arne Elofsson Åsa Björklund PhD student Diana Ekman PhD student Olivia Eriksson PhD student Johannes Frey-Skött PhD student Linnea Hedin PhD student Kristoffer Illergård PhD student Per Larsson PhD student Håkan Viklund PhD student Erik Granseth PhD student Björn Wallner* PhD student Sara Light* PhD student Costas Papaloukas* Guest professor

Ass. Prof. Erik Lindahl Anna Johansson PhD student Aron Hennerdal PhD student Pär Bjelkmar PhD student Jenny Falk PhD student Yana Vereshchaga Post-doc

Prof. Jens Lagergren Öjvind Johansson Post-doc Ali Tofigh PhD student Örjan Åkerborg PhD student Isaac Elias PhD student Samuel Andersson* PhD student

Prof. Erik Sonnhammer Andrey Alexeyenko Post-doc Olof Karlberg Post-doc Mats Lindskog Post-doc Tomas Ohlson* Post-doc Carsten Daub* Post-doc Kristoffer Forslund PhD student Anna Henricson PhD student Gabriel Östlund PhD student Timo Lassmann* PhD student Abhiman Saraswathi* PhD student Lukas Käll* PhD student Tom Casavant* Guest professor

Prof. Gunnar von Heijne Andreas Bernsel PhD student Karin Melen PhD student

Bengt Sennblad Assistant professor Lars Arvestad Senior scientist Olof Emanuelsson Research associate Karin Julenius Assistant professor Erik Sjölund System administrator

*) Left during 2006

Collaboration partners EU bioinformatics network Biosapiens EU bioinformatics network Embrace EU bioinformatics network Genefun Center for Biological Sequence Analysis, Danmarks Tekniska Universitet (prof. Søren Brunak & Anders Gorm Pedersen) Bioinformatics Laboratory, BioInfoBank Institute, Poznan (Dr. Leszek Rychlewski) Institut Pasteur, Paris (Dr. Marc Delarue) Stanford University (Prof. , Prof. Vijay S. Pande, Prof. James Trudell) Uppsala University (van der Spoel) University of Wyoming (Dr. David Liberles) McGill Centre for Bioinformatics (Dr. Mike Hallett) University of British Columbia (Dr. Wyeth Wasserman). Yale University, New Haven, CT. (Dr. Mark Gerstein) University of Buffalo (Dr. Daniel Fischer) Cornell University, Ithaca, NY. (Dr. Klaas van Wijk) Inst. för Molekylärbiologi, Köpenhamns Universitet (prof. ) The Pfam database consortium (Dr. Richard Durbin, Sanger Institute; Prof. Sean Eddy, Janelia farm, VA, USA) University of Valencia, Spain (Dr. Gustavo Camps-Valls) University of Rochester Medical Center (Dr. Fred Hagen) University of Paris René Descartes (Prof. Jean-Laurent Casanova) Marie Öhman, & Functional Genomics, SU Mattias Höglund, Clinical Genetics, LU Gunnar Norstedt, CMM, KI

Scientific publications (From http://www.sbc.su.se/publications) von Heijne, G. (2006) Membrane-protein topology. Nat Rev Mol Cell Biol 7 (12) : 909-918.

Johansson, A.C. and Lindahl, E. (2006) Amino-Acid solvation structure in transmembrane helices from molecular dynamics simulations. Biophys J 91 (12) : 4450-4463.

Alexeyenko, A., Millar, A.H., Whelan, J. and Sonnhammer, E.L. (2006) Chromosomal clustering of nuclear genes encoding mitochondrial and chloroplast proteins in Arabidopsis. Trends Genet 22 (11) : 589-593.

Bjorklund, A.K., Ekman, D. and Elofsson, A. (2006) Expansion of Protein Domain Repeats. PLoS Comput Biol 2 (8)

Julenius, K. and Pedersen, A.G. (2006) Protein evolution is faster outside the cell. Mol Biol Evol 23 (11) : 2039-2048.

Alexeyenko, A., Tamas, I., Liu, G. and Sonnhammer, E.L. (2006) Automatic clustering of orthologs and inparalogs shared by multiple proteomes. Bioinformatics 22 (14) : e9-15. Ohlson, T., Aggarwal, V., Elofsson, A. and MacCallum, R.M. (2006) Improved alignment quality by combining evolutionary information, predicted secondary structure and self- organizing maps. BMC Bioinformatics 7: 357.

Viklund, H., Granseth, E. and Elofsson, A. (2006) Structural classification and prediction of reentrant regions in alpha-helical transmembrane proteins: application to complete genomes. J Mol Biol 361 (3) : 591-603.

Kim, H., Melen, K., Osterberg, M. and von Heijne, G. (2006) A global topology map of the Saccharomyces cerevisiae membrane proteome. Proc Natl Acad Sci U S A 103 (30) : 11142- 11147.

Osterberg, M., Kim, H., Warringer, J., Melen, K., Blomberg, A. and von Heijne, G. (2006) Phenotypic effects of membrane protein overexpression in Saccharomyces cerevisiae. Proc Natl Acad Sci U S A 103 (30) : 11148-11153.

Lindahl, E., Azuara, C., Koehl, P. and Delarue, M. (2006) NOMAD-Ref: visualization, deformation and refinement of macromolecular structures based on all-atom normal mode analysis. Nucleic Acids Res 34 (Web Server issue) : W52-6.

Azuara, C., Lindahl, E., Koehl, P., Orland, H. and Delarue, M. (2006) PDB_Hydro: incorporating dipolar solvents with variable density in the Poisson-Boltzmann treatment of macromolecule electrostatics. Nucleic Acids Res 34 (Web Server issue) : W38-42.

Amico, M., Finelli, M., Rossi, I., Zauli, A., Elofsson, A., Viklund, H., von Heijne, G., Jones, D., Krogh, A., Fariselli, P., Luigi Martelli, P. and Casadio, R. (2006) PONGO: a web server for multiple predictions of all-alpha transmembrane proteins. Nucleic Acids Res 34 (Web Server issue) : W169-72.

Ekman, D., Light, S., Bjorklund, A.K. and Elofsson, A. (2006) What properties characterize the hub proteins of the protein-protein interaction network of Saccharomyces cerevisiae? Genome Biol 7 (6) : R45.

Arvestad, L. (2006) Efficient methods for estimating amino acid replacement rates. J Mol Evol 62 (6) : 663-673.

Svensson, O., Arvestad, L. and Lagergren, J. (2006) Genome-wide survey for biologically functional pseudogenes. PLoS Comput Biol 2 (5) : e46.

Abhiman, S., Daub, C.O. and Sonnhammer, E.L. (2006) Prediction of function divergence in protein families using the substitution rate variation parameter alpha. Mol Biol Evol 23 (7) : 1406-1413.

Wallner, B. and Elofsson, A. (2006) Identification of correct regions in protein models using structural, alignment, and consensus information. Protein Sci 15 (4) : 900-913.

Rapp, M., Granseth, E., Seppala, S. and von Heijne, G. (2006) Identification and evolution of dual-topology membrane proteins. Nat Struct Mol Biol 13 (2) : 112-116.

(Not in http://www.sbc.su.se/publications:)

“Kalign, Kalignvu and Mumsa: web servers for multiple sequence alignment” Lassmann T, Sonnhammer EL. Nucleic Acids Res., 34:W596-W599 (2006)

“Overview and comparison of ortholog databases” Julia Lindberg, Andrey Alexeyenko, Åsa Pérez-Bercoff, Erik L.L. Sonnhammer Drug Discovery Today: Technologies, 3:137-143 (2006)

“NovelFam3000 - uncharacterized human protein domains conserved across model organisms” Kemmer D, Podowski RM, Arenillas D, Lim J, Hodges E, Roth P, Sonnhammer EL, Höög C, Wasserman WW BMC Genomics, 7:48 (2006)

“A Hidden Markov Model for Identification of G-protein Coupled Receptors in Protein Sequences” Markus Wistrand, Lukas Käll and Erik L.L.Sonnhammer Protein Science, 15:509-521 (2006)

“Pfam: clans, web tools and services” Finn RD, Mistry J, Schuster-Bockler B, Griffiths-Jones S, Hollich V, Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R, Eddy SR, Sonnhammer ELL, Bateman A. Nucleic Acids Res., 34:D247-251 (2006)

Roth C, Rastogi S, Arvestad L, Dittmar K, Light S, Ekman D, Liberles DA, 2006, Evolution after gene duplication: models, mechanisms, sequences, systems, and organisms. J Exp Zoolog B Mol Dev Evol 308B:58-73

Motif Yggdrasil: Sampling from a Tree Mixture Model. Samuel A. Andersson and Jens Lagergren. In Tenth Annual International Conference on Research in Computational Molecular Biology (RECOMB 2006), pages 458-472.

New Probabilistic Network Models and Algorithms for Oncogenesis. M. Hjelm, M. Höglund, and J. Lagergren. Journal of Computational Biology, May 2006, Vol. 13, No. 4: 853-865.

Compatibility of unrooted phylogenetic trees is FPT. D. Bryant and J. Lagergren. Theoretical Computer Science 351 (2006) 296-302

Bertone P, Trifonov V, Rozowsky J S, Schubert F, Emanuelsson O, Karro J, Kao M-Y, Snyder M, Gerstein M (2006): Design Optimization Methods for Genomic DNA Tiling Arrays. Genome Res. 16:271-281

Royce T E, Rozowsky J S, Luscombe N M, Emanuelsson O, Yu H, Zhu X, Snyder M, Gerstein M B (2006): Extrapolating traditional DNA microarray statistics to the tiling and protein microarray technologies Methods Enzymol. 411:282-311

Courses and workshops Algoritmisk bioinformatik (4p) 2D1450 by Jens Lagergren Bioinformatics (4p) 2D1396 by Lars Arvestad Structural biochemistry and bioinformatics (5 p) by Arne Elofsson, Erik Lindahl

7th Swedish Bioinformatics Workshop (SBW2006) for PhD students and Postdocs, November 24-25, 2006 at Albanova University Center, Stockholm

Invited lectures and seminars “Automatic clustering of orthologs and inparalogs shared by multiple proteomes”, ISMB'06 Fortaleza, Brazil (Aug, 2006), by Andrey Alexeyenko

“FunCoup: data integration and networks of functional coupling in eukaryotes”, Interactome networks, Hinxton, England, by Andrey Alexeyenko

“Normal-mode based refinement techniques “, D.E. Shaw Research Inc., 2006-01-06 by Erik Lindahl

“Molecular simulation with GROMACS: Applications, lessons, and future grand challenges “, Forschungszentrum Juelich, Psi-k/MolSimu forward look, 2006-11-13 by Erik Lindahl

“Simulation of biomolecules through loosely coupled of distributed simulation”, Freie Univärsität Berlin, 2006-12-12 by Erik Lindahl

“The Motif Yggdrasil (MY) sampler”, McGill University, Montreal, Canada, 2006-02 by Jens Lagergren

“Are there human pseudogenes with a regulatory role?”, BCB 2006 Fourth Bertinoro Computational Biology Meeting, Italy, 2006-06-25 by Jens Lagergren

“Structural classification and prediction of reentrant regions in alpha-helical transmembrane proteins: application to complete genomes”, ISMB'06 Fortaleza, Brazil (Aug, 2006), by Erik Granseth

III International Symposium on Biochemistry and Molecular Biology, Havanna, October, 2006 by Arne Elofsson

CASP7, Pacific Grove, CA (Dec, 2007) by Björn Wallner

“Sequence conservation in mucin-type O-glycosylation sites and in extracellular proteins in general”, Linnaeus Centre for Bioinformatics, Uppsala, 14 Feb 2006 by Karin Julenius

”Bioinformatics”, KTH Computational Science and Engineering Centre Annual Meeting, Lovik conference centre, Stockholm, 7 December 2006 by Erik Sonnhammer.

”Exploiting orthology to infer protein networks of functional coupling”, Seminar series, Dept. of Molecular Biology and Functional Genomics, , 4 December 2006 by Erik Sonnhammer.

”Kan vi hitta svaret på cancergåtan i bananflugans proteiner?”, Café ledande forskning (faculty club), Stockholm University, 17 October 2006 by Erik Sonnhammer.

”Exploiting orthology to infer protein networks of functional coupling”, Seminar series, Danish Royal Veterinary and Agricultural University, Copenhagen, 20 September 2006. (Also presented at CBS-DTU and BINF-KU on 18 and 19 Sept.) by Erik Sonnhammer.

“Prediktion av proteinfunktion”, Stockholm-Uppsala symposium i matematisk statistik, Stockholms Universitet, 7 Juni 2006 by Erik Sonnhammer. “Linking Model Organism Databases (MODs) with InParanoid”, Workshop on Phylogenetic Ontology, St. Louis, Missouri, 22 May 2006 by Erik Sonnhammer.

Seminar series, Medical Bioinformatics and Biophysics, Karolinska Institutet, 16 May 2006 by Erik Sonnhammer.

“Integrating Proteomics and Genomics Data with Orthology to Predict Functional Networks”, Seminar series, Biotechnology dept., Royal Inst. of Technology, 3 February 2006 by Erik Sonnhammer.

NIGMS Protein Structure Initiative Workshop (invited lecture). Bethesda, US. April 2006 by Gunnar von Heijne.

Symposium on “Membran Protein Structure and Function” (invited lecture). Mount Verita, Switzerland. May 2006 by Gunnar von Heijne.

3rd KEY symposium “Membrane Transport Proteins in Health and Disease” (invited lecture). Stockholm, . May 2006 by Gunnar von Heijne.

EMBO Workshop “Cell Membrane Organization and Dynamics” (invited lecture). Bilbao, Spain. June 2006 by Gunnar von Heijne.

Gordon Conference “Bacterial Cell Surfaces” (invited lecture). New London, NH, US. June 2006 by Gunnar von Heijne.

Gordon Conference “Ion Channels” (invited lecture). New England, US. July 2006 by Gunnar von Heijne.

Klaus Tschira Foundation Symposium “Molecular Forces of Life” (invited keynote lecture). Heidelberg, FRG, September 2006 by Gunnar von Heijne.

Danish Society for Biochemistry and Molecular Biology Symposium “Membrane Proteins: Structure and Function” (invited EMBO Lecture). Ebberup, Denmark. October 2006 by Gunnar von Heijne.

EMBO Workshop “Protein Targeting” (invited lecture). Gdansk, Poland. October 2006 by Gunnar von Heijne.

SFB Symposium “Membrane Proteins and Cellular Dynamics” (invited lecture). Osnabrück, FRG. November 2006 by Gunnar von Heijne.

National Congress of the Mexican Society of Biochemistry (invited plenary lecture). Monterrey, Mexico. November 2006 by Gunnar von Heijne.

Computer infrastructure The SBC employs a very standardized computer system in which each workplace has an identically set up desktop computer. All user disk storage is done at PDC and is accessed via the AFS file system (in 5-10 Gb volumes). Heavy computation is carried out on a compute cluster also maintained by PDC, which can also access the user disks. A summary of the infra- structure is listed below. Desktop computers: 43 Pentium 4 2.80GHz, 1 Gb RAM, 40 Gb disk running Centos Linux 4.4

Compute cluster: 364 CPUs, mixed Intel and AMD, mostly 1 Gb RAM, 40 Gb disk.

Disk servers: 2 servers, ~2.8 Tb in total

Internal servers: mail, cups, life, mickey, sbcdb

Web servers: http://www.sbc.su.se: AMD Athlon 900 Mhz, 0.8 Gb RAM, 100 Gb disk 23.1 million hits from 151627 IP numbers (2.3 from 142944 if filtering out robots) (feb-dec) Hosted services: * PRIMETV: Visualize tree reconciliations * Pmembr A threading method for membrane proteins. * HMMER High capacity site for use of HMMER to search SCOP or Pfam * ProQ A protein model quality predictor. * PeroxiP Predict peroxisomal proteins and Pfam domains * PRODIV-TMHMM Topology and reentrant predictions. * TMHMMfix TMHMM with optional fixing and reliability score calculation. * DAS Prediction of Transmembrane Regions. * NucPred Nuclear localization prediction. * DRIP-PRED Disorder/order prediction for proteins. * GPCPRED Contact map prediction for proteins. * SVMHC Prediction of MHC class I binding peptides. * PhylProM Phylogenetic profiles * OVOP automatic view generation for protein structures (source code available) * modhmm A modular HMM programed used in PRO(DIV)-TMHMM and other studies.. * LGscore A program to measure the similarity between proteins. * Palign Our alignment/threading programs. * ssHMM Secondary structure HMMs based on HMMER * LEPRA Protein modelling C++ /library. * TAED The Adaptive Evolution Database. * www.genefun.org GeneFun EU collaboration * www.perlgp.org PerlGP, The Open Source Perl Genetic Programming System * www.socbin.org Society for Bioinformatics in Norther Europe * prime.sbc.su.se Probabilistic Integrated Models of Evolution http://sbcweb.pdc.kth.se: 320037 hits Hosted services: * FUNCOUP prediction of functional coupling http://sly.sbc.su.se: Pentium 4 2.80GHz, 1 Gb RAM, 40 Gb disk 59082 hits from 259 IP numbers (56518 from 257 if filtering out robots) (not deployed) Hosted services: * inparanoid.sbc.su.se Eukaryotic ortholog groups * DAS services for Phobius (http://das.sbc.su.se:9000/das/phobius) fw.sbc.su.se: Hosted services: * Web Services for EMBRACE

Compute cluster statistics: The compute cluster has delivered 1.1-1.6 million ”aggregate wallclock” hours of compute time per year for the last 3 years.