Beastling Documentation Release 0.0.0
Total Page:16
File Type:pdf, Size:1020Kb
BEASTling Documentation Release 0.0.0 Luke Maurits February 13, 2017 Contents 1 Contents 3 1.1 Overview.................................................3 1.2 Installation................................................4 1.3 Tutorial..................................................5 1.4 Usage................................................... 19 1.5 Configuration file............................................. 20 1.6 Data formats............................................... 26 1.7 Modelling details............................................. 28 1.8 Clock models............................................... 29 1.9 Substitution models........................................... 30 1.10 Scripting BEASTling........................................... 31 i ii BEASTling Documentation, Release 0.0.0 A linguistics-focussed command line tool for easily generating BEAST 2.x XML files for phylogenetic analyses. Contents 1 BEASTling Documentation, Release 0.0.0 2 Contents CHAPTER 1 Contents 1.1 Overview 1.1.1 Motivation BEASTling is aimed (at least in part) at making BEAST somewhat more accessible to linguists who have, or want to develop, a quantitative bent; people who might read a historical linguistics paper published by biologists and computer scientists and think “Gee, that’s interesting. I wonder what would happen if you relaxed this constraint, or added this extra datapoint?”, but have no hope in hell of investigating this because, being linguists, none of their data sits around in NEXUS files and they quite reasonably don’t yet know how to write a Python script to programmatically generate a 100,000 line XML file. If at any point in using BEASTling to set up a BEAST analysis of linguistic data you have to understand or give any thought to: • NEXUS and/or Newick • XML and associated concepts like namespaces, ids or idrefs • Sequences, alignments, populations, or anything else to do with biology • Codemaps • Class names, method names or call signatures of any Objects in the BEAST source code then BEASTling has failed in its goal. Of course, you should still understand at least the basics of the model you are using and MCMC in general. The idea is not to let you easily play with black boxes you don’t understand. The idea is to cut away the many, many layers of irrelevant technical detail that you would otherwise have to understand in addition to the linguistics problem at hand. BEASTling is also aimed at people who are quite comfortable wrangling XML but would like a convenient, consistent, easily scriptable way to do it which, for example, makes generating thousands of BEAST configs for a simulation study managable. 1.1.2 What does BEASTling actually do? BEASTling is designed to take short, clear, high level configuration files which are human readable and writable, like this: [admin] basename = my_analysis log_trees = True log_params = True [MCMC] 3 BEASTling Documentation, Release 0.0.0 chainlength = 50000 [languages] families = Indo-European, Uralic monophyletic = True [model my_model] data = my_data.csv model = mk rate_variation = False and turn them into corresponding 100,000 line XML files. The text of the configuration file is embedded as a comment at the top of the XML file, along with the time and date the XML was generated and the version of BEASTling which did the generating. This means you can quickly get a feel for what an XML file you generated six months ago does, without spending an hour grepping around for details. BEASTling relies on data being provided in CSV format. If your data is not already in CSV or some format which can be easily programmatically transformed into CSV, you’re doing something wrong. The expected CSV format is one in which every row corresponds to one language, every column to one feature, and languages are represented using three letter ISO 639 (the header for the language column must be “iso”). The insistence on using ISO codes allows BEASTling to have some situational awareness of the data it is working with. E.g., the example config above includes the line: families= Indo-European, Uralic This means that even if the provided data file “my_data.csv” contains data for all the languages on Earth, BEASTling will pick out only the languages which belong to the Indo-European or Uralic language families (as determined by Glottolog). Because of the line: monophyletic= True BEASTling will automatically apply monophyly constraints derived from Glottolog’s family classifications, i.e. the resulting BEAST analysis will enforce that e.g. all Germanic languages belong in a single clade. The [model my_model] section of the config allows you to specify which substitution model you’d like to use (Lewis Mk in this case), as well as control various high-level features of the model, like whether or not rate variation is permitted. Any details of the model which are not specified in the config will be automatically set to sensible, generic defaults. 1.2 Installation 1.2.1 Dependencies Although technically not a dependency, BEASTling is pretty useless without BEAST installed. The config files generated by BEASTling are only compatible with BEAST versions 2.x.y. They will not work with old BEAST 1.x.y installations. The latest versions of BEAST 2 are dependent upon Java version 1.8, so it’s a good idea to update your Java installation before you install BEAST. Many of the config files generated by BEASTling will make use of features which are not a part of the BEAST core, but rather are implemented in packages. Managing packages is fairly straightforward using the Beauti GUI. To save headaches, you should install the BEAST_CLASSIC, BEASTlabs and morph-models packages before you do anything with BEASTling to avoid confusion. 4 Chapter 1. Contents BEASTling Documentation, Release 0.0.0 1.2.2 Installation methods setup.py BEASTling is installed using the setup.py script in the root of the repository. Installation will look something like this: $ git clone https://github.com/lmaurits/BEASTling.git $ cd BEASTling $ sudo python ./setup.py install This will install an executable beastling, which should be put somewhere in your default PATH, so you can run it from the command line simply by typing beastling and hitting enter. Everything else Coming soon! 1.3 Tutorial This tutorial will explain step-by-step how to use BEASTling to set up, configure, run and analyze a Bayesian phylo- genetic analysis of language data. As an example, we will use a small dataset of lexical data for the Indo-European language family. This tutorial will only scratch the surface of using BEASTling, using BEAST, and Bayesian phylo- genetic analysis in general. It should be a convenient first step, but you should make use of as many other resources as you can to learn how to use these tools and interpret the results. The official BEAST book is a great resource. BEASTling is a command line tool. The actual analysis tool, BEAST 2, is most easily run from the command line interface as well. We will therefore begin by giving you a very short introduction to working with the command line, which you can skip if you are already familiar with this and go directly to Installation. If you have BEASTling and BEAST 2 installed and accessible from your CLI, skip further to Using BEASTling. 1.3.1 Fundamentals While you may be used to driving applications by pointing and clicking with the mouse and very occasionally typing text, command-line interfaces (CLI) use text commands to drive computer programs. In some sense similar to human language, these text commands must obey a specific syntax to be understood (but as opposed to human language, if you don’t follow the syntax strictly, nothing will happen), and this syntax powers compositionality, which makes automatising complex or repetitive tasks easier. In addition, the text representation means that command line instructions can be easily copied, shared, reproduced, and modified. BEASTling was created to automatise complex tasks and improve reproduceability and adaptation for Bayesian infer- ence on linguistic data, so it has naturally been implemented as a command line tool – and this part of the tutorial is there to ensure that BEASTling does not fail its goal to make using inference tools less daunting just by living on the CLI. So here are some instructions to get you started using that powerful tool. The CLI application, where you type commands and see their outputs, is called a shell. The most common shell these days is bash (or variants thereof), which is the default on Linux and Mac systems in the Terminal application. By default, Windows systems only include the Command Prompt, which you can start by looking for cmd.exe and running that. The Command Prompt is far less flexible and user-friendly than other available shells, but sufficient for running beastling. 1.3. Tutorial 5 BEASTling Documentation, Release 0.0.0 If you are working under Windows, you will need a working Python installation to run beastling, for which you will install Anaconda in the next section. Anaconda gives you a Command Prompt set up to work more cleanly with its Python installation under the name Anaconda Prompt. For the matters of this introduction to the command line, Command Prompt and Anaconda Prompt are interchangeable. Now, start your shell – open a Terminal application, start cmd.exe or run an Anaconda Prompt, whichever is available to you. You should now have a window that displays you some text – often some information about you, then a directory name (where ~ means “your home directory”) and then a prompt symbol ($ or >), before a cursor. Type dir and press Enter. The shell should show you the contents of the directory you are in, which is probably your home directory. For the remainder of this tutorial, we will use the notation $ echo Example Command Example Command to show you what to type on the command line and what to expect as output. The two lines above mean that you should type echo Example Command after the prompt symbol (which may be > instead of $, if you are working on Windows), and expect the output Example Command.