Python Tools in Computational Chemistry (And Biology)
Total Page:16
File Type:pdf, Size:1020Kb
Python Tools in Computational Chemistry (and Biology) Andrew Dalke Dalke Scientific, AB Göteborg, Sweden EuroSciPy, 26-27 July, 2008 “Why does ‘import numpy’ take 0.4 seconds? Does it need to import 228 libraries?” - My first Numpy-discussion post (paraphrased) Your use case isn't so typical and so suffers on the import time end of the balance. - Response from Robert Kern (Others did complain. Import time down to 0.28s.) 52,000 structures PDB doubles every 2½ years HEADER PHOTORECEPTOR 23-MAY-90 1BRD 1BRD 2 COMPND BACTERIORHODOPSIN 1BRD 3 SOURCE (HALOBACTERIUM $HALOBIUM) 1BRD 4 EXPDTA ELECTRON DIFFRACTION 1BRD 5 AUTHOR R.HENDERSON,J.M.BALDWIN,T.A.CESKA,F.ZEMLIN,E.BECKMANN, 1BRD 6 AUTHOR 2 K.H.DOWNING 1BRD 7 REVDAT 3 15-JAN-93 1BRDB 1 SEQRES 1BRDB 1 REVDAT 2 15-JUL-91 1BRDA 1 REMARK 1BRDA 1 .. ATOM 54 N PRO 8 20.397 -15.569 -13.739 1.00 20.00 1BRD 136 ATOM 55 CA PRO 8 21.592 -15.444 -12.900 1.00 20.00 1BRD 137 ATOM 56 C PRO 8 21.359 -15.206 -11.424 1.00 20.00 1BRD 138 ATOM 57 O PRO 8 21.904 -15.930 -10.563 1.00 20.00 1BRD 139 ATOM 58 CB PRO 8 22.367 -14.319 -13.591 1.00 20.00 1BRD 140 ATOM 59 CG PRO 8 22.089 -14.564 -15.053 1.00 20.00 1BRD 141 ATOM 60 CD PRO 8 20.647 -15.054 -15.103 1.00 20.00 1BRD 142 ATOM 61 N GLU 9 20.562 -14.211 -11.095 1.00 20.00 1BRD 143 ATOM 62 CA GLU 9 20.192 -13.808 -9.737 1.00 20.00 1BRD 144 ATOM 63 C GLU 9 19.567 -14.935 -8.932 1.00 20.00 1BRD 145 ATOM 64 O GLU 9 19.815 -15.104 -7.724 1.00 20.00 1BRD 146 ATOM 65 CB GLU 9 19.248 -12.591 -9.820 1.00 99.00 1 1BRD 147 ATOM 66 CG GLU 9 19.902 -11.351 -10.387 1.00 99.00 1 1BRD 148 ATOM 67 CD GLU 9 19.243 -10.169 -10.980 1.00 99.00 1 1BRD 149 ATOM 68 OE1 GLU 9 18.323 -10.191 -11.782 1.00 99.00 1 1BRD 150 ATOM 69 OE2 GLU 9 19.760 -9.089 -10.597 1.00 99.00 1 1BRD 151 ATOM 70 N TRP 10 18.764 -15.737 -9.597 1.00 20.00 1BRD 152 ATOM 71 CA TRP 10 18.034 -16.884 -9.090 1.00 20.00 1BRD 153 ATOM 72 C TRP 10 18.843 -17.908 -8.318 1.00 20.00 1BRD 154 ATOM 73 O TRP 10 18.376 -18.310 -7.230 1.00 20.00 1BRD 155 .. Structure input Parse file into list of atoms Format spec? What spec? Which spec? Distance search to identify bonds Residue assignment Characterize molecules as protein, DNA, water Secondary structure assignment Structure visualization Part science, part esthetics (Pretty pictures get to be on journal covers.) Spheres assume everything is equally important. Specialized ways to visualize protein, DNA, even water. Molecular surfaces, charge and density isosurfaces Interactive use Display the results with OpenGL Atom/region selection (mouse and text) Change representation style and color GUIs to control all of this Scriptability Setup scripts, movies, demos, analysis Display other items in the scene Tcl is a great language for this! VMD switches between Tcl and Python. PyMol adds a command syntax to Python. IPython? Python is popular! VMD, PyMol, Chimera, PMV, Vida, BALLView, Yasara Visualization programs are popular! Tcl(ish): VMD, RasMol, gOpenMol Java: JMol, MarvinView, OpenAstexViewer Other/commercial: Sybyl, MOE (and 100+ more) Molecular Dynamics well studied - 1950s for gases, 1970s for biomolecules F = ma U = Ubond + Uangle + Udihedral + Uimproper + UUrey-Bradley + Uelectrostatic + Uvan der Waal Numerically integrated with ~1femtosecond timesteps http://www-dsv.cea.fr/instituts/institut-de-recherches-en-technologies-et-sciences-pour-le-vivant-irtsv/unites-de-recherche/laboratoire-chimie-et- biologie-des-metaux-lcbm/equipe-modelisation-interactions-et-repliement/breve-introduction-a-la-mecanique-mm-et-a-la-dynamique-moleculaire-dm O(n2); or O(n log n) using Particle mesh Ewald O(n) with cutoffs http://www-dsv.cea.fr/instituts/institut-de-recherches-en-technologies-et-sciences-pour-le-vivant-irtsv/unites-de-recherche/laboratoire-chimie-et- biologie-des-metaux-lcbm/equipe-modelisation-interactions-et-repliement/breve-introduction-a-la-mecanique-mm-et-a-la-dynamique-moleculaire-dm Plus: Choice of force fields Long-range cutoffs User-defined forces Boundary conditions Choice of integration methods Special integrators for hydrogen (SHAKE) Rigid body dynamics Dihedral dynamics Constant E/T/P/count Hybrid quantum/classical ... NAMD (C++ with Tcl scripting) DL_POLY (Fortran) AMBER (Fortran, C, C++) GROMACS (C rewrite of GROMOS/Fortan) TINKER (Fortran and some C) CHARMM (Fortran) MOLDY (C, and GPLed) Where’s Python? MMTK and nMOLDYN, BALLView, Molecular Dynamics Language (MDL). Minority software Bioinformatics ~3 billion base pairs 20-25,000 protein coding genes GenBank 82 million records 85 billion base pairs data doubles every 18 months LOCUS NM_052942 2431 bp mRNA linear PRI 25-MAY-2008 DEFINITION Homo sapiens guanylate binding protein 5 (GBP5), mRNA. ACCESSION NM_052942 ⋯ VERSION NM_052942.2 GI:31377630 CDS 525..2285 KEYWORDS . /gene="GBP5" SOURCE Homo sapiens (human) /codon_start=1 ORGANISM Homo sapiens /product="guanylate-binding protein 5" Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; /protein_id="NP_443174.1" Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; /db_xref="GI:16418425" Catarrhini; Hominidae; Homo. /db_xref="CCDS:CCDS722.1" REFERENCE 1 (bases 1 to 2431) /db_xref="GeneID:115362" AUTHORS Ito,Y., Shibata-Watanabe,Y., Ushijima,Y., Kawada,J., Nishiyama,Y., /db_xref="HGNC:19895" Kojima,S. and Kimura,H. /db_xref="HPRD:13571" TITLE Oligonucleotide microarray analysis of gene expression profiles /db_xref="MIM:611467" followed by real-time reverse-transcriptase polymerase chain /translation="MALEIHMSDPMCLIENFNEQLKVNQEALEILSAITQPVVVVAIV reaction assay in chronic active Epstein-Barr virus infection GLYRTGKSYLMNKLAGKNKGFSVASTVQSHTKGIWIWCVPHPNWPNHTLVLLDTEGLG JOURNAL J. Infect. Dis. 197 (5), 663-666 (2008) DVEKADNKNDIQIFALALLLSSTFVYNTVNKIDQGAIDLLHNVTELTDLLKARNSPDL PUBMED 18260761 DRVEDPADSASFFPDLVWTLRDFCLGLEIDGQLVTPDEYLENSLRPKQGSDQRVQNFN ... LPRLCIQKFFPKKKCFIFDLPAHQKKLAQLETLPDDELEPEFVQQVTEFCSYIFSHSM FEATURES Location/Qualifiers TKTLPGGIMVNGSRLKNLVLTYVNAISSGDLPCIENAVLALAQRENSAAVQKAIAHYD source 1..2431 QQMGQKVQLPMETLQELLDLHRTSEREAIEVFMKNSFKDVDQSFQKELETLLDAKQND /organism="Homo sapiens" ICKRNLEASSDYCSALLKDIFGPLEEAVKQGIYSKPGGHNLFIQKTEELKAKYYREPR /mol_type="mRNA" KGIQAEEVLQKYLKSKESVSHAILQTDQALTETEKKKKEAQVKAEAEKAEAQRLAAIQ /db_xref="taxon:9606" RQNEQMMQERERLHQEQVRQMEIAKQNWLAEQQKMQEQQMQEQAAQLSTTFQAQNRSL /chromosome="1" LSELQHAQRTVNNDDPCVLL" /map="1p22.2" ... gene 1..2431 ORIGIN /gene="GBP5" 1 ctccaggctg tggaaccttt gttctttcac tctttgcaat aaatcttgct gctgctcact /synonym="GBP-5" 61 ctttgggtcc acactgcctt tatgagctgt aacactcact gggaatgtct gcagcttcac /note="guanylate binding protein 5" 121 tcctgaagcc agcgagacca cgaacccacc aggaggaaca aacaactcca gacgcgcagc /db_xref="GeneID:115362" 181 cttaagagct gtaacactca ccgcgaaggt ctgcagcttc actcctgagc cagccagacc /db_xref="HGNC:19895" 241 acgaacccac cagaaggaag aaactccaaa cacatccgaa catcagaagg agcaaactcc /db_xref="HPRD:13571" 301 tgacacgcca cctttaagaa ccgtgacact caacgctagg gtccgcggct tcattcttga /db_xref="MIM:611467" 361 agtcagtgag accaagaacc caccaattcc ggacacgcta attgttgtag atcatcactt 421 caaggtgccc atatctttct agtggaaaaa ttattctggc ctccgctgca tacaaatcag 481 gcaaccagaa ttctacatat ataaggcaaa gtaacatcct agacatggct ttagagatcc ⋯ 541 acatgtcaga ccccatgtgc ctcatcgaga actttaatga gcagctgaag gttaatcagg 601 aagctttgga gatcctgtct gccattacgc aacctgtagt tgtggtagcg attgtgggcc 661 tctatcgcac tggcaaatcc tacctgatga acaagctggc tgggaagaac aagggcttct 721 ctgttgcatc tacggtgcag tctcacacca agggaatttg gatatggtgt gtgcctcatc 781 ccaactggcc aaatcacaca ttagttctgc ttgacaccga gggcctggga gatgtagaga 841 aggctgacaa caagaatgat atccagatct ttgcactggc actcttactg agcagcacct 901 ttgtgtacaa tactgtgaac aaaattgatc agggtgctat cgacctactg cacaatgtga 961 cagaactgac agatctgctc aaggcaagaa actcacccga ccttgacagg gttgaagatc 1021 ctgctgactc tgcgagcttc ttcccagact tagtgtggac tctgagagat ttctgcttag 1081 gcctggaaat agatgggcaa cttgtcacac cagatgaata cctggagaat tccctaaggc 1141 caaagcaagg tagtgatcaa agagttcaaa atttcaattt gccccgtctg tgtatacaga 1201 agttctttcc aaaaaagaaa tgctttatct ttgacttacc tgctcaccaa aaaaagcttg 1261 cccaacttga aacactgcct gatgatgagc tagagcctga atttgtgcaa caagtgacag 1321 aattctgttc ctacatcttt agccattcta tgaccaagac tcttccaggt ggcatcatgg 1381 tcaatggatc tcgtctaaag aacctggtgc tgacctatgt caatgccatc agcagtgggg 1441 atctgccttg catagagaat gcagtcctgg ccttggctca gagagagaac tcagctgcag 1501 tgcaaaaggc cattgcccac tatgaccagc aaatgggcca gaaagtgcag ctgcccatgg 1561 aaaccctcca ggagctgctg gacctgcaca ggaccagtga gagggaggcc attgaagtct 1621 tcatgaaaaa ctctttcaag gatgtagacc aaagtttcca gaaagaattg gagactctac 1681 tagatgcaaa acagaatgac atttgtaaac ggaacctgga agcatcctcg gattattgct 1741 cggctttact taaggatatt tttggtcctc tagaagaagc agtgaagcag ggaatttatt 1801 ctaagccagg aggccataat ctcttcattc agaaaacaga agaactgaag gcaaagtact 1861 atcgggagcc tcggaaagga atacaggctg aagaagttct gcagaaatat ttaaagtcca 1921 aggagtctgt gagtcatgca atattacaga ctgaccaggc tctcacagag acggaaaaaa 1981 agaagaaaga ggcacaagtg aaagcagaag ctgaaaaggc tgaagcgcaa aggttggcgg 2041 cgattcaaag gcagaacgag caaatgatgc aggagaggga gagactccat caggaacaag 2101 tgagacaaat ggagatagcc aaacaaaatt ggctggcaga gcaacagaaa atgcaggaac 2161 aacagatgca ggaacaggct gcacagctca gcacaacatt ccaagctcaa aatagaagcc 2221 ttctcagtga gctccagcac gcccagagga ctgttaataa cgatgatcca tgtgttttac 2281 tctaaagtgc taaatatggg agtttccttt ttttactctt tgtcactgat gacacaacag 2341 aaaagaaact gtagaccttg ggacaatcaa catttaaata aactttataa ttattttttc 2401 aaactttaaa aaaaaaaaaa aaaaaaaaaa a // How do they sequence a genome? Is my newly sequence DNA similar to existing DNA? What does “similar” mean? How similar? What about Levenshtein distance? “edit distance” Mutation Insertion/Deletion CAT