Dendropy Tutorial Release 3.12.1
Total Page:16
File Type:pdf, Size:1020Kb
DendroPy Tutorial Release 3.12.1 Jeet Sukumaran and Mark T. Holder March 22, 2014 Contents i ii CHAPTER 1 Phylogenetic Data in DendroPy 1.1 Introduction to Phylogenetic Data Objects 1.1.1 Types of Phylogenetic Data Objects Phylogenetic data in DendroPy is represented by one or more objects of the following classes: Taxon A representation of an operational taxonomic unit, with an attribute, label, corresponding to the taxon label. TaxonSet A collection of Taxon objects representing a distinct definition of taxa (for example, as specified explicitly in a NEXUS “TAXA” block, or implicitly in the set of all taxon labels used across a Newick tree file). Tree A collection of Node and Edge objects representing a phylogenetic tree. Each Tree object maintains a reference to a TaxonSet object in its attribute, taxon_set, which specifies the set of taxa that are referenced by the tree and its nodes. Each Node object has a taxon attribute (which points to a particular Taxon object if there is an operational taxonomic unit associated with this node, or is None if not), a parent_node attribute (which will be None if the Node has no parent, e.g., a root node), a Edge attribute, as well as a list of references to child nodes, a copy of which can be obtained by calling child_nodes. TreeList A list of Tree objects. A TreeList object has an attribute, taxon_set, which speci- fies the set of taxa that are referenced by all member Tree elements. This is enforced when a Tree object is added to a TreeList, with the TaxonSet of the Tree object and all Taxon references of the Node objects in the Tree mapped to the TaxonSet of the TreeList. CharacterMatrix Representation of character data, with specializations for different data types: DnaCharacterMatrix, RnaCharacterMatrix, ProteinCharacterMatrix, StandardCharacterMatrix, ContinuousCharacterMatrix, etc. A CharacterMatrix can treated very much like a dict object, with Taxon objects as keys and character data as values associated with those keys. DataSet A meta-collection of phylogenetic data, consisting of lists of multiple TaxonSet ob- jects (taxon_sets), TreeList objects (tree_lists), and CharacterMatrix objects (char_matrices). 1.1.2 Creating New (Empty) Objects All of the above names are imported into the the the dendropy namespace, and so to instantiate new, empty objects of these classes, you would need to import dendropy: 1 DendroPy Tutorial, Release 3.12.1 >>> import dendropy >>> tree1= dendropy.Tree() >>> tree_list11= dendropy.TreeList() >>> dna1= dendropy.DnaCharacterMatrix() >>> dataset1= dendropy.DataSet() Or import the names directly: >>> from dendropy import Tree, TreeList, DnaCharacterMatrix, DataSet >>> tree1= Tree() >>> tree_list1= TreeList() >>> dna1= DnaCharacterMatrix() >>> dataset1= DataSet() 1.1.3 Reading and Writing Phylogenetic Data DendroPy provides a rich set of tools for reading and writing phylogenetic data in various formats, such as NEXUS, Newick, PHYLIP, etc. These are covered in detail in the following “Reading Phylogenetic Data” and “Writing Phylo- genetic Data” chapters respectively. 1.2 Reading Phylogenetic Data 1.2.1 Creating and Populating New Objects The Tree, TreeList, CharacterMatrix-derived, and DataSet classes all support “get_from_*” factory methods that allow for the simultaneous instantiation and population of the objects from a data source: get_from_stream(src, schema, **kwargs) Takes a file or file-like object opened for read- ing the data source as the first argument, and a string specifying the schema as the second. get_from_path(src, schema, **kwargs) Takes a string specifying the path to the the data source file as the first argument, and a string specifying the schema as the second. get_from_string(src, schema, **kwargs) Takes a string containing the source data as the first argument, and a string specifying the schema as the second. All these methods minimally take a source and schema specification string as arguments and return a new object of the given type populated from the given source: >>> import dendropy >>> tree1= dendropy.Tree.get_from_string("((A,B),(C,D))", schema="newick") >>> tree_list1= dendropy.TreeList.get_from_path("pythonidae.mcmc.nex", schema="nexus") >>> dna1= dendropy.DnaCharacterMatrix.get_from_stream(open("pythonidae.fasta"),"dnafasta") >>> std1= dendropy.StandardCharacterMatrix.get_from_path("python_morph.nex","nexus") >>> dataset1= dendropy.DataSet.get_from_path("pythonidae.nex","nexus") The schema specification string can be one of: “nexus”, “newick”, “nexml”, “fasta”, or “phylip”. Not all formats are supported for reading, and not all formats make sense for particular objects (for example, it would not make sense to try and instantiate a Tree or TreeList object from a FASTA-formatted data source). Alternatively, you can also pass a file-like object and a schema specification string to the constructor of these classes using the keyword arguments stream and schema respectively: >>> import dendropy >>> tree1= dendropy.Tree(stream=open("mle.tre"), schema="newick") >>> tree_list1= dendropy.TreeList(stream=open("pythonidae.mcmc.nex"), schema="nexus") 2 Chapter 1. Phylogenetic Data in DendroPy DendroPy Tutorial, Release 3.12.1 >>> dna1= dendropy.DnaCharacterMatrix(stream=open("pythonidae.fasta"), schema="dnafasta") >>> std1= dendropy.StandardCharacterMatrix(stream=open("python_morph.nex"), schema="nexus") >>> dataset1= dendropy.DataSet(stream=open("pythonidae.nex"), schema="nexus") Various keyword arguments can also be passed to these methods which customize or control how the data is parsed and mapped into DendroPy object space. These are discussed below. 1.2.2 Reading and Populating (or Repopulating) Existing Objects The Tree, TreeList, CharacterMatrix-derived, and DataSet classes all support a suite of “read_from_*” instance methods that parallels the “get_from_*” factory methods described above: read_from_stream(src, schema, **kwargs) Takes a file or file-like object opened for read- ing the data source as the first argument, and a string specifying the schema as the second. read_from_path(src, schema, **kwargs) Takes a string specifying the path to the the data source file as the first argument, and a string specifying the schema as the second. read_from_string(src, schema, **kwargs) Takes a string specifying containing the source data as the first argument, and a string specifying the schema as the second. When called on an existing TreeList or DataSet object, these methods add the data from the data source to the object, whereas when called on an existing Tree or CharacterMatrix object, they replace the object’s data with data from the data source. As with the “get_from_*” methods, the schema specification string can be any supported and type-apppropriate schema, such as “nexus”, “newick”, “nexml”, “fasta”, “phylip”, etc. For example, the following accumulates post-burn-in trees from several different files into a single TreeList object: >>> import dendropy >>> post_trees= dendropy.TreeList() >>> post_trees.read_from_path("pythonidae.nex.run1.t","nexus", tree_offset=200) >>> print(post_trees.description()) TreeList object at 0x550990 (TreeList5573008): 801 Trees >>> post_trees.read_from_path("pythonidae.nex.run2.t","nexus", tree_offset=200) >>> print(post_trees.description()) TreeList object at 0x550990 (TreeList5573008): 1602 Trees >>> post_trees.read_from_path("pythonidae.nex.run3.t","nexus", tree_offset=200) >>> print(post_trees.description()) TreeList object at 0x550990 (TreeList5573008): 2403 Trees >>> post_trees.read_from_path("pythonidae.nex.run4.t","nexus", tree_offset=200) >>> print(post_trees.description()) TreeList object at 0x5508a0 (TreeList5572768): 3204 Trees The TreeList object automatically handles taxon management, and ensures that all appended Tree objects share the same TaxonSet reference. Thus all the Tree objects created and aggregated from the data sources in the example will all share the same TaxonSet and Taxon objects, which is important if you are going to be carrying comparisons or operations between multiple Tree objects. In contrast to the aggregating behavior of read_from_* of TreeList and DataSet objects, the read_from_* methods of Tree- and CharacterMatrix-derived objects show replacement behavior. For example, the following changes the contents of a Tree by re-reading it: >>> import dendropy >>> t= dendropy.Tree() >>> t.read_from_path(’pythonidae.mle.nex’,’nexus’) >>> print(t.description()) Tree object at 0x79c70 (Tree37413776: ’0’): (’Python molurus’:0.0779719244,((’Python sebae’:0.1414715009,(((((’Morelia tracyae’:0.0435011998,(’Morelia amethistina’:0.0305993564,((’Morelia nauta’:0.0092774432,’Morelia kinghorni’:0.0093145395):0.005595,’Morelia clastolepis’:0.005204698):0.023435):0.012223):0.025359,’Morelia boeleni’:0.0863199106):0.019894,((’Python reticulatus’:0.0828549023,’Python timoriensis’:0.0963051344):0.072003,’Morelia oenpelliensis’:0.0820543043):0.002785):0.00274,((((’Morelia viridis’:0.0925974416,(’Morelia carinata’:0.0943697342,(’Morelia spilota’:0.0237557178,’Morelia bredli’:0.0357358071):0.041377):0.005225):0.004424,(’Antaresia maculosa’:0.1141193265,((’Antaresia childreni’:0.0363195704,’Antaresia stimsoni’:0.0188535952):0.043287,’Antaresia perthensis’:0.0947695442):0.019148):0.007921):0.022413,(’Leiopython albertisii’:0.0698883547,’Bothrochilus boa’:0.0811607602):0.020941):0.007439,((’Liasis olivaceus’:0.0449896545,(’Liasis mackloti’:0.0331564496,’Liasis fuscus’:0.0230286886):0.058253):0.016766,’Apodora papuana’:0.0847328612):0.008417):0.006539):0.011557,(’Aspidites ramsayi’:0.0349772256,’Aspidites