The 8th annual Bioinformatics Open Source Conference (BOSC 2007) 18th July, Vienna, Austria
Biopython Project Update
Peter Cock, MOAC Doctoral Training Centre, University of Warwick, UK Talk Outline
What is python? What is Biopython? Short history Project organisation What can you do with it? How can you contribute? Acknowledgements
The 8th annual Bioinformatics Open Source Conference Biopython Project Update @ BOSC 2007, Vienna, Austria What is Python?
High level programming language Object orientated Open Source, free ($$$) Cross platform: Linux, Windows, Mac OS X, … Extensible in C, C++, …
The 8th annual Bioinformatics Open Source Conference Biopython Project Update @ BOSC 2007, Vienna, Austria What is Biopython?
Set of libraries for computational biology Open Source, free ($$$) Cross platform: Linux, Windows, Mac OS X, … Sibling project to BioPerl, BioRuby, BioJava, …
The 8th annual Bioinformatics Open Source Conference Biopython Project Update @ BOSC 2007, Vienna, Austria Popularity by Google Hits
Python 98 million Biopython 252,000
Perl 101 million BioPerlBioPerl 610,000
Ruby 101 million BioRuby 122,000
Java 289 million BioJava 185,000
Both Perl and Python are strong at text Python may have the edge for numerical work (with the Numerical python libraries)
The 8th annual Bioinformatics Open Source Conference Biopython Project Update @ BOSC 2007, Vienna, Austria Biopython history 1999 : Started by Jeff Chang & Andrew Dalke 2000 : Biopython 0.90, first release 2001 : Biopython 1.00, “semi-complete” 2002 : Biopython 1.10, “semi-stable” 2003 : Biopython 1.20, 1.21, 1.22 and 1.23 2004 : Biopython 1.24 and 1.30 2005 : Biopython 1.40 and 1.41 2006 : Biopython 1.42 2007 : Biopython 1.43 The 8th annual Bioinformatics Open Source Conference Biopython Project Update @ BOSC 2007, Vienna, Austria Biopython Project Organisation
Releases: No fixed schedule Currently once or twice a year Work from a stable CVS base Bugs: Online bugzilla Some small changes handled on mailing list Tests: Many based on unittest python library Also simple scripts where output is verified The 8th annual Bioinformatics Open Source Conference Biopython Project Update @ BOSC 2007, Vienna, Austria What can you do with Biopython?
Read, write & manipulate sequences Restriction enzymes BLAST (local and online) Web databases (e.g. NCBI’s EUtils) Call command line tools (e.g. clustalw) Clustering (Bio.Cluster) Phylogenetics (Bio.Nexus) Protein Structures (Bio.PDB)
The 8th annual Bioinformatics Open Source Conference Biopython Project Update @ BOSC 2007, Vienna, Austria Manipulating Sequences
Use Biopython’s Seq object, holds: Sequence data (string like) Alphabet (can include list of letters) Alphabet allows type checking, preventing errors like appending DNA to Protein
The 8th annual Bioinformatics Open Source Conference Biopython Project Update @ BOSC 2007, Vienna, Austria Manipulating Sequences from Bio.Seq import Seq from Bio.Alphabet.IUPAC import unambiguous_dna my_dna=Seq('CTAAACATCCTTCAT', unambiguous_dna) print 'Original:' print my_dna print 'Reverse complement:' print my_dna.reverse_complement()
Original: Seq('CTAAACATCCTTCAT', IUPACUnambiguousDNA()) Reverse complement: Seq('ATGAAGGATGTTTAG', IUPACUnambiguousDNA())
The 8th annual Bioinformatics Open Source Conference Biopython Project Update @ BOSC 2007, Vienna, Austria Translating Sequences from Bio import Translate bact_trans=Translate.unambiguous_dna_by_id[11] print 'Forward translation' print bact_trans.translate(my_dna) print 'Reverse complement translation' print bact_trans.translate( \ my_dna.reverse_complement())
Forward translation Seq('LNILH', HasStopCodon(IUPACProtein(), '*')) Reverse complement translation Seq('MKDV*', HasStopCodon(IUPACProtein(), '*'))
The 8th annual Bioinformatics Open Source Conference Biopython Project Update @ BOSC 2007, Vienna, Austria Sequence Input/Output
Bio.SeqIO is new in Biopython 1.43 Inspired by BioPerl’s SeqIO Works with SeqRecord objects (not format specific representations) Builds on existing Biopython parsers
The 8th annual Bioinformatics Open Source Conference Biopython Project Update @ BOSC 2007, Vienna, Austria SeqIO – Sequence Input from Bio import SeqIO handle = open('ls_orchid.fasta') format = 'fasta' for rec in SeqIO.parse(handle, format) : print "%s, len %i" % (rec.id, len(rec.seq)) print rec.seq[:40].tostring() + "..." handle.close() gi|2765658|emb|Z78533.1|CIZ78533, len 740 CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCAT... gi|2765657|emb|Z78532.1|CCZ78532, len 753 CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCAT......
The 8th annual Bioinformatics Open Source Conference Biopython Project Update @ BOSC 2007, Vienna, Austria SeqIO – Sequence Input from Bio import SeqIO handle = open('ls_orchid.gbk') format = 'genbank' for rec in SeqIO.parse(handle, format) : print "%s, len %i" % (rec.id, len(rec.seq)) print rec.seq[:40].tostring() + "..." handle.close()
Z78533.1, len 740 CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCAT... Z78532.1, len 753 CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCAT......
The 8th annual Bioinformatics Open Source Conference Biopython Project Update @ BOSC 2007, Vienna, Austria SeqIO – Extracting Data from Bio import SeqIO handle = open('ls_orchid.gbk') format = 'genbank' from sets import Set print Set([rec.annotations['organism'] \ for rec in SeqIO.parse(handle, format)]) handle.close()
Set(['Cypripedium acaule', 'Paphiopedilum primulinum', 'Phragmipedium lindenii', 'Paphiopedilum papuanum', 'Paphiopedilum stonei', 'Paphiopedilum urbanianum', 'Paphiopedilum dianthum', ...])
The 8th annual Bioinformatics Open Source Conference Biopython Project Update @ BOSC 2007, Vienna, Austria SeqIO – Filtering Output i_handle = open('ls_orchid.gbk') o_handle = open('small_orchid.faa', 'w') SeqIO.write([rec for rec in \ SeqIO.parse(i_handle, 'genbank') \ if len(rec.seq) < 600], o_handle, 'fasta') i_handle.close() o_handle.close()
>Z78481.1 P.insigne 5.8S rRNA gene and ITS1 ... CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTT... >Z78480.1 P.gratrixianum 5.8S rRNA gene and ... CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTT......
The 8th annual Bioinformatics Open Source Conference Biopython Project Update @ BOSC 2007, Vienna, Austria 3D Structures
Bio.Nexus was added in Biopython 1.30 by Frank Kauff and Cymon Cox Reads Nexus alignments and trees Also parses Newick format trees
The 8th annual Bioinformatics Open Source Conference Biopython Project Update @ BOSC 2007, Vienna, Austria Newick Tree Parsing (Bovine:0.69395,(Gibbon:0.36079,(Orang:0.33636, (Gorilla:0.17147,(Chimp:0.19268, Human: 0.11927):0.08386):0.06124):0.15057): 0.54939,Mouse:1.21460):0.10; from Bio.Nexus.Trees import Tree tree_str = open("simple.tree").read() tree_obj = Tree(tree_str) print tree_obj tree a_tree = (Bovine,(Gibbon,(Orang,(Gorilla, (Chimp,Human)))),Mouse);
The 8th annual Bioinformatics Open Source Conference Biopython Project Update @ BOSC 2007, Vienna, Austria Newick Tree Parsing tree_obj.display()
# taxon prev succ brlen blen (sum) support 0 - None [1,2,11] 0.0 0.0 - 1 Bovine 0 [] 0.69395 0.69395 - 2 - 0 [3,4] 0.54939 0.54939 - 3 Gibbon 2 [] 0.36079 0.91018 - 4 - 2 [5,6] 0.15057 0.69996 - 5 Orang 4 [] 0.33636 1.03632 - 6 - 4 [7,8] 0.06124 0.7612 - 7 Gorilla 6 [] 0.17147 0.93267 - 8 - 6 [9,10] 0.08386 0.84506 - 9 Chimp 8 [] 0.19268 1.03774 - 10 Human 8 [] 0.11927 0.96433 - 11 Mouse 0 [] 1.2146 1.2146 -
Root: 0
The 8th annual Bioinformatics Open Source Conference Biopython Project Update @ BOSC 2007, Vienna, Austria 3D Structures
Bio.PDB was added in Biopython 1.24 by Thomas Hamelryck Reads PDB and CIF format files
The 8th annual Bioinformatics Open Source Conference Biopython Project Update @ BOSC 2007, Vienna, Austria Working with 3D Structures
Before: After: This example (online) uses Bio.PDB to align 21 alternative X-Ray crystal structures for PDB structure 1JOY. http://www.warwick.ac.uk/go/peter_cock/ python/protein_superposition/
The 8th annual Bioinformatics Open Source Conference Biopython Project Update @ BOSC 2007, Vienna, Austria Population Genetics (planned)
Tiago Antão (with Ralph Haygood) plans to start a Population Genetics module
See also: PyPop: Python for Population Genetics Alex Lancaster et al. (2003) www.pypop.org
The 8th annual Bioinformatics Open Source Conference Biopython Project Update @ BOSC 2007, Vienna, Austria Areas for Improvement
Documentation! I’m interested in sequences & alignments: Seq objects – more like strings? Alignment objects – more like arrays? SeqIO – support for more formats AlignIO? – alignment equivalent to SeqIO Move from Numeric to NumPy Move from CVS to SVN?
The 8th annual Bioinformatics Open Source Conference Biopython Project Update @ BOSC 2007, Vienna, Austria How can you Contribute?
Users: Discussions on the mailing list Report bugs Documentation improvement Coders: Suggest bug fixes New/extended test cases Adopt modules with no current ‘owner’ New modules
The 8th annual Bioinformatics Open Source Conference Biopython Project Update @ BOSC 2007, Vienna, Austria Biopython Acknowledgements
Open Bioinformatics Foundation or O|B|F for web hosting, CVS servers, mailing list Biopython developers, including: Jeff Chang, Andrew Dalke, Brad Chapman, Iddo Friedberg, Michiel de Hoon, Frank Kauff, Cymon Cox, Thomas Hamelryck, me Contributors who report bugs & join in the mailing list discussions
The 8th annual Bioinformatics Open Source Conference Biopython Project Update @ BOSC 2007, Vienna, Austria Personal Acknowledgements
Everyone for listening Open Bioinformatics Foundation or O|B|F for the BOSC 2007 invitation Iddo Friedberg & Michiel de Hoon for their encouragement The EPSRC for my PhD funding via the MOAC Doctoral Training Centre, Warwick http://www.warwick.ac.uk/go/moac
The 8th annual Bioinformatics Open Source Conference Biopython Project Update @ BOSC 2007, Vienna, Austria Questions?
The 8th annual Bioinformatics Open Source Conference Biopython Project Update @ BOSC 2007, Vienna, Austria