BEASTling Documentation Release 0.0.0

Luke Maurits

February 13, 2017

Contents

1 Contents 3 1.1 Overview...... 3 1.2 Installation...... 4 1.3 Tutorial...... 5 1.4 Usage...... 19 1.5 Configuration file...... 20 1.6 Data formats...... 26 1.7 Modelling details...... 28 1.8 Clock models...... 29 1.9 Substitution models...... 30 1.10 Scripting BEASTling...... 31

i ii BEASTling Documentation, Release 0.0.0

A linguistics-focussed command line tool for easily generating BEAST 2.x XML files for phylogenetic analyses.

Contents 1 BEASTling Documentation, Release 0.0.0

2 Contents CHAPTER 1

Contents

1.1 Overview

1.1.1 Motivation

BEASTling is aimed (at least in part) at making BEAST somewhat more accessible to linguists who have, or want to develop, a quantitative bent; people who might read a historical linguistics paper published by biologists and computer scientists and think “Gee, that’s interesting. I wonder what would happen if you relaxed this constraint, or added this extra datapoint?”, but have no hope in hell of investigating this because, being linguists, none of their data sits around in NEXUS files and they quite reasonably don’t yet know how to write a Python script to programmatically generate a 100,000 line XML file. If at any point in using BEASTling to set up a BEAST analysis of linguistic data you have to understand or give any thought to: • NEXUS and/or Newick • XML and associated concepts like namespaces, ids or idrefs • Sequences, alignments, populations, or anything else to do with biology • Codemaps • Class names, method names or call signatures of any Objects in the BEAST source code then BEASTling has failed in its goal. Of course, you should still understand at least the basics of the model you are using and MCMC in general. The idea is not to let you easily play with black boxes you don’t understand. The idea is to cut away the many, many layers of irrelevant technical detail that you would otherwise have to understand in addition to the linguistics problem at hand. BEASTling is also aimed at people who are quite comfortable wrangling XML but would like a convenient, consistent, easily scriptable way to do it which, for example, makes generating thousands of BEAST configs for a simulation study managable.

1.1.2 What does BEASTling actually do?

BEASTling is designed to take short, clear, high level configuration files which are human readable and writable, like this: [admin] basename = my_analysis log_trees = True log_params = True [MCMC]

3 BEASTling Documentation, Release 0.0.0

chainlength = 50000 [] families = Indo-European, Uralic monophyletic = True [model my_model] data = my_data.csv model = mk rate_variation = False and turn them into corresponding 100,000 line XML files. The text of the configuration file is embedded as a comment at the top of the XML file, along with the time and date the XML was generated and the version of BEASTling which did the generating. This means you can quickly get a feel for what an XML file you generated six months ago does, without spending an hour grepping around for details. BEASTling relies on data being provided in CSV format. If your data is not already in CSV or some format which can be easily programmatically transformed into CSV, you’re doing something wrong. The expected CSV format is one in which every row corresponds to one , every column to one feature, and languages are represented using three letter ISO 639 (the header for the language column must be “iso”). The insistence on using ISO codes allows BEASTling to have some situational awareness of the data it is working with. E.g., the example config above includes the line: families= Indo-European, Uralic

This means that even if the provided data file “my_data.csv” contains data for all the languages on Earth, BEASTling will pick out only the languages which belong to the Indo-European or Uralic language families (as determined by Glottolog). Because of the line: monophyletic= True

BEASTling will automatically apply monophyly constraints derived from Glottolog’s family classifications, i.e. the resulting BEAST analysis will enforce that e.g. all Germanic languages belong in a single clade. The [model my_model] section of the config allows you to specify which substitution model you’d like to use (Lewis Mk in this case), as well as control various high-level features of the model, like whether or not rate variation is permitted. Any details of the model which are not specified in the config will be automatically set to sensible, generic defaults.

1.2 Installation

1.2.1 Dependencies

Although technically not a dependency, BEASTling is pretty useless without BEAST installed. The config files generated by BEASTling are only compatible with BEAST versions 2.x.y. They will not work with old BEAST 1.x.y installations. The latest versions of BEAST 2 are dependent upon version 1.8, so it’s a good idea to update your Java installation before you install BEAST. Many of the config files generated by BEASTling will make use of features which are not a part of the BEAST core, but rather are implemented in packages. Managing packages is fairly straightforward using the Beauti GUI. To save headaches, you should install the BEAST_CLASSIC, BEASTlabs and morph-models packages before you do anything with BEASTling to avoid confusion.

4 Chapter 1. Contents BEASTling Documentation, Release 0.0.0

1.2.2 Installation methods

setup.py

BEASTling is installed using the setup.py script in the root of the repository. Installation will look something like this: $ git clone https://github.com/lmaurits/BEASTling.git $ cd BEASTling $ sudo python ./setup.py install

This will install an executable beastling, which should be put somewhere in your default PATH, so you can run it from the command line simply by typing beastling and hitting enter.

Everything else

Coming soon!

1.3 Tutorial

This tutorial will explain step-by-step how to use BEASTling to set up, configure, run and analyze a Bayesian phylo- genetic analysis of language data. As an example, we will use a small dataset of lexical data for the Indo-European language family. This tutorial will only scratch the surface of using BEASTling, using BEAST, and Bayesian phylo- genetic analysis in general. It should be a convenient first step, but you should make use of as many other resources as you can to learn how to use these tools and interpret the results. The official BEAST book is a great resource. BEASTling is a command line tool. The actual analysis tool, BEAST 2, is most easily run from the command line interface as well. We will therefore begin by giving you a very short introduction to working with the command line, which you can skip if you are already familiar with this and go directly to Installation. If you have BEASTling and BEAST 2 installed and accessible from your CLI, skip further to Using BEASTling.

1.3.1 Fundamentals

While you may be used to driving applications by pointing and clicking with the mouse and very occasionally typing text, command-line interfaces (CLI) use text commands to drive computer programs. In some sense similar to human language, these text commands must obey a specific syntax to be understood (but as opposed to human language, if you don’t follow the syntax strictly, nothing will happen), and this syntax powers compositionality, which makes automatising complex or repetitive tasks easier. In addition, the text representation means that command line instructions can be easily copied, shared, reproduced, and modified. BEASTling was created to automatise complex tasks and improve reproduceability and adaptation for Bayesian infer- ence on linguistic data, so it has naturally been implemented as a command line tool – and this part of the tutorial is there to ensure that BEASTling does not fail its goal to make using inference tools less daunting just by living on the CLI. So here are some instructions to get you started using that powerful tool. The CLI application, where you type commands and see their outputs, is called a shell. The most common shell these days is bash (or variants thereof), which is the default on Linux and Mac systems in the Terminal application. By default, Windows systems only include the Command Prompt, which you can start by looking for cmd.exe and running that. The Command Prompt is far less flexible and user-friendly than other available shells, but sufficient for running beastling.

1.3. Tutorial 5 BEASTling Documentation, Release 0.0.0

If you are working under Windows, you will need a working Python installation to run beastling, for which you will install Anaconda in the next section. Anaconda gives you a Command Prompt set up to work more cleanly with its Python installation under the name Anaconda Prompt. For the matters of this introduction to the command line, Command Prompt and Anaconda Prompt are interchangeable. Now, start your shell – open a Terminal application, start cmd.exe or run an Anaconda Prompt, whichever is available to you. You should now have a window that displays you some text – often some information about you, then a directory name (where ~ means “your home directory”) and then a prompt symbol ($ or >), before a cursor. Type dir and press Enter. The shell should show you the contents of the directory you are in, which is probably your home directory. For the remainder of this tutorial, we will use the notation $ echo Example Command Example Command

to show you what to type on the command line and what to expect as output. The two lines above mean that you should type echo Example Command after the prompt symbol (which may be > instead of $, if you are working on Windows), and expect the output Example Command. Sometimes, we will abbreviate the expected output, and write [...]. It is important to know how to navigate the file system on the command line, otherwise you will be stuck running all analyses inside your home directory! So let us create a new directory $ mkdir example_directory

and step inside with the cd (“change directory”) command. $ cd example_directory $ dir

As you can see, this directory is empty – on the bash, dir outputs nothing, while it lists two

‘ectories ‘. and .. on Windows. These two directories are special (they exist under Linux as well, they just are not shown): they are this directory example_directory itself, and its parent directory, where we have just come from, respectively. We can use these special directories to move up using cd: $ cd .. $ dir [... The same output as before, and the new directory:] example_directory [...] $ cd example_directory

Paths like this can be combined using /, so if you are inside example_directory, $ cd ../example_directory

will do nothing. This knowledge should allow you to go from any directory to any other directory on your hard drive, and on Windows, you can use your other hard drive’s letter, such as D:, as a command to change hard drives. (Under Linux, other hard drives etc. just behave like any other directory, so you change hard drives like you change directories.) Unfortunately, the Command Prompt and bash understand different languages. For example, while in bash, we might have $ echo This text is put into a file. > file.txt $ dir file.txt $ cat file.txt This text is put into a file.

6 Chapter 1. Contents BEASTling Documentation, Release 0.0.0

The cat command does not exist in Windows, as the Prompt will tell you: ‘cat’ is not recognized as an internal or external command, operable program or batch file. There is however a Windows command called type that you can use in place of cat, which will output the content of a file. In this tutorial, we will use the language of bash to give coded examples, but where needed we will give a Windows command or a way outside the CLI to achieve the same result.

1.3.2 Installation

BEASTling is written in the Python programming language, and BEAST 2 is written in Java 8. We will therefore first have to install these core dependencies.

Java 8

Java 8 can be downloaded from the official Oracle website. You only need the JRE, not the JDK, to use BEAST. Please note that BEAST 2 will not work with Java 7 or earlier versions, so even if you already have Java installed, you may need to upgrade.

BEAST 2

Once you have a working Java 8 installation, download BEAST 2 from the official BEAST 2 website. The README file included in the package you download will include installation instructions for your operating system. In addition to installing BEAST 2, you should probably install some of its extension packages. Without these, you will be very limited in the kinds of analyses you can run. You can read about installing BEAST packages here.

Python

Most current Linux distributions come with a pre-packaged Python installation. If your python version (which you can see by running python –version in a shell) is lower than 2.7, you will want to upgrade your Python in the way you usually install new software. If you want to run BEASTling on Windows, we recommend the Anaconda Python distribution. Download it here and run the Python 3.5 installer for your system.

BEASTling and its Python dependencies

If you want to control the details of your installation, refer to the Installation instructions elsewhere in the BEASTling documentation. Otherwise, BEASTling is available from the Python Package Index, which is easily accessible using the pip command line tool, so it will be sufficient to run $ pip install beastling [...]

in order to install the package and all its dependencies. All current Python versions (above 2.7.9 and above 3.4) are shipped with pip – if you have an older version of Python installed, either check how to get pip elsewhere, consider upgrading your Python or check the Installation chapter for alternative installation instructions.

1.3. Tutorial 7 BEASTling Documentation, Release 0.0.0

1.3.3 Using BEASTling

First, create a new empty directory. We will collect the data and run the analyses inside that folder. Open a command line interface, and make sure its working directory is that new folder. For example, start terminal and execute $ mkdir indoeuropean $ cd indoeuropean

For this tutorial, we will be using lexical data, i.e. cognate judgements, for a small set of Indo-European languages. The data is stored in CLDF format in a csv file called ie_cognates.csv which can be downloaded as follows: $ curl -OL https://raw.githubusercontent.com/lmaurits/BEASTling/release-1.2/docs/tutorial_data/ie_cognates.csv [... Download progress]

(curl is a command line tool to download files from URLs, available under Linux and Windows. You can, of course, download the file yourself using whatever method you are most comfortable with, and save it as ie_cognates.csv in this folder.) If you look at this data, using your preferred text editor or importing it into Excel or however you prefer to look at csv files, you will see that $ cat ie_cognates.csv Language_ID,Feature_ID,IPA,Value [...]

it is a comma-separated CLDF file, which is a format that BEASTling supports out-of-the-box. So let us start building the most basic BEASTling analysis using this data. Create a called ie_vocabulary.conf using your favourite text editor with the following content: [model ie_vocabulary] model = covarion data = ie_cognates.csv

– ie_vocabulary.conf This is a minimal BEASTling file that will generate a BEAST 2 XML configuration file that tries to infer a tree of Indo-European languages from the dataset using a binary Covarion model. Let’s try it! $ beastling ie_vocabulary.conf $ ls [...] beastling.xml [...] $ cat beastling.xml [... Many xml lines describing the model in detail]

8 Chapter 1. Contents BEASTling Documentation, Release 0.0.0

We would like to run this in BEAST to test it, but the default chain length of 10000000 will make waiting for this analysis to finish tedious (over an hour on most machines). Because this is a small data set, we can get away with a shorter chain length (we will discuss how to tell what chain length is required later), so let’s reduce it for the time being [MCMC] chainlength=500000 [model ie_vocabulary] model=covarion data=ie_cognates.csv

— ie_vocabulary.conf Now we can run beastling again (after cleaning up the previous output) and then run BEAST. $ rm beastling.xml $ beastling ie_vocabulary.conf $ beast beastling.xml [...]

[...] BEAST v2.4[...], 2002-2016 Bayesian Evolutionary Analysis Sampling Trees Designed and developed by Remco Bouckaert, Alexei J. Drummond, Andrew Rambaut & Marc A. Suchard [...] ======[...] Start likelihood: [...] [...] Sample ESS(posterior) prior likelihood posterior [...]

BEAST will now spend some time sampling trees. Because this is a simple analysis with a small data set, BEAST should finish in 5 or 10 minutes unless you are using a relatively slow computer. When BEAST has finished running, you should see two new files in your directory: $ ls [...] beastling.log beastling.nex beastling.xml [...] beastling.log is a log file which contains various details of each of the 10,000 trees sampled in this analysis, including their prior probability, likelihood and posterior probability, as well as the height of the tree. In more complicated analyses, this file will contain much more information, like rates of change for different features in the dataset, details of evolutionary clock models, the ages of certain clades in the tree and more. beastling.log is a tab separated value (tsv) file. You should be able to open it up in a spreadsheet program like Microsoft Excel, LibreOffice Calc or Gnumeric. Let’s look at the first few lines of the log file. $ head beastling.log [... Numbers are stochastic and may vary Sample prior likelihood posterior treeHeight 0 -11.466785941356303 -5434.4533277100645 -5445.920113651421 2.930099025192108 50 -14.507387085439145 -4948.559786139161 -4963.0671732246 2.8632651425342983 100 -13.715625758051573 -4588.294198523788 -4602.009824281839 2.8235811961563644 150 -14.455572518334662 -4353.763156917764 -4368.218729436098 2.720387319308833

1.3. Tutorial 9 BEASTling Documentation, Release 0.0.0

200 -10.719230155244194 -4219.189086103397 -4229.908316258641 2.0137609414490942 250 -2.906983109341201 -4176.574925532654 -4179.481908641995 1.4462030568578153 300 -2.9491105164545837 -4027.5833312195637 -4030.5324417360184 1.4462030568578153 350 5.795184249496499 -3866.294505320323 -3860.4993210708267 0.6592039530882482 400 8.927313730401623 -3757.008703631417 -3748.0813899010154 0.5651416164402189]

(head is a command available in most Unix-based platforms like Linux and OS X which prints the first 10 lines of a file. You can just look at the first ten rows of your file in Excel or similar if you don’t have head available) Don’t panic if you don’t see exactly the same numbers in your file. BEAST uses a technique called Markov Chain Monte Carlo (MCMC), which is based on random sampling of trees. This means every run of a BEAST analysis will give slightly different results, but the overall statistics should be the same from run to run. Imagine tossing a coin 100 times and writing down the result. If two people do this and compare the first 10 lines of their results, they will not see exactly the same sequence of heads and tails, and the same is true of two BEAST runs. But both people should see roughly 50 heads and roughly 50 tails over all 100 tosses, and two BEAST runs should be similar in the same way. Even though you will have different numbers, you should see the same 6 columns in your file. Just for now, we will focus on the first five. The sample column simply indicates which sample each line corresponds to. We asked BEAST to draw 500,000 samples (with the chain_length setting). Usually, not every sample in an MCMC analysis is kept, because consecutive samples are too similar to one another. Instead, some samples are thrown away, and samples are kept at some periodic interval. By default, BEASTling asks BEAST to keep enough samples so that the log file contains 10,000 samples. In this case, this means keeping every 50th sample, which is why we see 0, 50, 100, 150, etc in the first column. If we’d asked BEAST to draw 50,000 samples instead, we’d haave to keep every 5th sample to get 10,000 by the end, so the first column would start with 0, 5, 10, 15, etc. The next three columns, prior, likelihood and posterior, record the important probabilities of the underlying model: the prior probability of the tree and any model parameters, the likelihood of the data under the model, and the posterior probability which is the product of these two values. These probabilities are stored logarithmically, e.g. the probability 0.5 would be stored as -0.69, which is the natural logarithm of 0.5. This simply makes it easier for computers to store very small numbers, which are common in these analyses. The fifth column, treeHeight, records the height of each of the sampled trees (the total distance along the branches from the root to the leaves). Later, we will provide calibration dates for some of the Indo-European languages, and then the treeHeights will be recorded in units of years, and these values will give us an estimate of the age of proto- Indo-European. However, in this simple analysis, we have no calibrations, so the treeHeight is in units of the average number of changes which have happened in the data, per feature, from the root to the leaves. Log files like this one are usually inspected using specialist tools to extract information from them (such as the mean value of a parameter across all samples, which is commonly used as an estimate of the parameter). A tool called Tracer is commonly used for this task. We will discuss using Tracer later. In a pinch, you can use spreadsheet software like Excel to analyse one of these files, too. For now, let’s turn our attention to the other log file. beastling.nex is a tree log file which contains the actual 10,000 sampled trees themselves. This file is in a format knows as Nexus, which itself expresses phylogenetic trees in a format known as Newick, which uses nested brackets to represent trees. If you open this file in a text-editor like Notepad and scroll down a little, you will be able to see these Newick trees. One of them might look like this: tree STATE_0 = (((((1:0.0699,10:0.0699):0.1936,9:0.2635):0.0767,(2:0.1176,5:0.1176):0.2225):0.9013,(6:0.4338,((((7:0.0262,12:0.0262):0.0649,8:0.0911):0.1889,((15:0.0884,19:0.0884):0.1319,16:0.2203):0.0597):0.0817,17:0.3617):0.0721):0.8076):0.6963,(((3:0.0438,14:0.0438):0.0124,4:0.0563):0.3858,((11:0.0154,18:0.0154):0.0507,13:0.0661):0.376):1.4957):0;

As you can see, Newick trees are very hard to read directly, especially for large trees. Instead, these files can be visualised using special purpose programs, which makes things much easier. FigTree is a popular example, but there are many more. Let’s take a look at our trees! Remember there are 10,000 trees saved in the beastling.nex file. When you open the file in FigTree, by default it will show you the first one in the file (which corresponds to sample 0 in the beastling.log file). There are Prev/Next arrows near the top right of the screen which let you examine each tree in turn. The first tree in the file is the starting point of the Markov Chain, and BEAST chooses it at random. So the first tree you are looking at will probably not look like a plausible history of Indo-European! Here is an example:

10 Chapter 1. Contents BEASTling Documentation, Release 0.0.0

Once again, you should not expect to see the exact same tree in your file, because the trees are randomly sampled. But you should have a random tree which does not reflect what we know about Indo-European. However, regardless of the random starting tree, the consecutive sampled trees will tend to have a better and better match to the data. Let’s look at the 10,000th and final tree in the file, which should look better (you don’t have to press Next 10,000 times! Use the “Current Tree” menu to the left of the screen):

Here the Germanic, Romance and Slavic subfamilies have been correctly separated out, and the Germanic family is correctly divided into North and West Germanic. You should see similar good agreement in your final tree, although the details may differ from here, and the fit might not be quite as good or may be a little better. Bayesian MCMC does not sample trees which strictly improve on the fit to data one after the other. Instead, well-fitting trees are sampled

1.3. Tutorial 11 BEASTling Documentation, Release 0.0.0 more often than ill-fitting trees, with a sampling ratio proportional to how well they fit. So there is no guarantee that the last tree in the file is the best fit, but it will almost certainly be a better fit than the first tree. Just like tools like Tracer are used on log files to summarise all of the 10,000 samples into a useful form, like the mean of a parameter, there are tools to summarise all of the 10,000 trees to produce a so-called “summary tree”. One tool for doing this is distributed with BEAST and is called treeannotator. If you are an advanced command line user you may like to use the tool phyltr, which is also written by a BEASTling developer and uses the idea of a Unix pipeline. The image below shows a “majority rules consensus tree”, produced using phyltr. This shows all splits between languages which are present in at least 5,000 of the 10,000 trees. The numbers at each branching point show the proportion of trees in the sample compatible with each branching.

In this style of consensus tree, the tree may sometimes split into more than two branches at once (i.e. the tree is not a binary tree). For example, look at the Scandinavian languages. Here the tree splits into four languages. This is because the relationships among the Scandinavian languages is uncertain. All of the 10,000 trees in our posterior sample are binary trees, but this summary tree only shows relationships which are supported by at least half the trees. Perhaps in our 10,000 trees, Icelandic is most closely related to Norwegian 45,000 of them, to Swedish in 30,000 of them and Danish in 25,000 of them. None of these relationships is supported at least half the time, so the summary tree shows only a polytomy. But the posterior tree log file always contains full information about the uncertainty, i.e. by counting the relationships above we know that Icelandic is more likely to be related to Norwegian than Danish, and we know how much more likely (almost twice as likely).

1.3.4 More advanced modelling

The BEASTling analysis we have used so far has a very short and neat configuration, but it is not based on a terribly realistic model of linguistic evolution, and so we may want to make some changes (however, it is always a good idea when working with a new data set to try to get very simple models working first and add complexity in stages). The main oversimplification in the default analysis is the treatment of the rate at which linguistic features change. The default analysis makes two simplifications: first, all features in the dataset change at the same rate as each other. Secondly, it assumes that the rate of change is fixed at all points in time and at all locations on the phylogenetic tree. Both of these things are very unlikely to have been true about Indo-European vocabulary. BEASTling makes it easy to relax either of these assumptions, or both. The cost you pay is that your analysis will not run as quickly, and you may experience convergance issues.

12 Chapter 1. Contents BEASTling Documentation, Release 0.0.0

Rate variation

You can enable rate variation by adding rate_variation = True to your [model] section, like this: [model ie_vocabulary] model=covarion data=ie_cognates.csv rate_variation=True

— ie_vocabulary.conf This will assign a separate rate of evolution to each feature in the dataset (each meaning slot in the case of our cognate data). The words for some meaning slots, such as pronouns or body parts, may change very slowly compared to the average, while the words for other meaning slots may change more quickly. With rate variation enabled, BEAST will attempt to figure out relative rates of change for each of your features (the rates across all features are assumed to follow a Gamma distribution). Note that BEAST now has to estimate one extra parameter for each meaning slot in the data set (110), which means the analysis will have to run longer to provide good estimates, so let’s increase the chain length to 2,000,000. Ideally, it should be longer, but this is a tutorial, not a paper for peer review, and we don’t want to have to wait too long for our results. [mcmc] chainlength=2000000 [model ie_vocabulary] model=covarion data=ie_cognates.csv rate_variation=True

— ie_vocabulary.conf BEAST will now infer some extra parameters, and we’d like to know what they are. By default, these will not be logged, because the logfiles can become very large, eating up lots of disk space, and in some cases we may not be too interested. We can switch logging on by adding an admin section and setting the log_params option to True. [admin] log_params=True [mcmc] chainlength=2000000 [model ie_vocabulary] model=covarion data=ie_cognates.csv rate_variation=True

— ie_vocabulary.conf Now rebuild your XML file and run BEAST again: $ beastling --overwrite ie_vocabulary.conf $ beast beastling.xml [...] Bayesian Evolutionary Analysis Sampling Trees [...]

If you look at the new beastling.log file, you will notice that many extra columns have appeared compared to our first analysis. Many of these are the new individual rates of change for our meaning slots. You should see columns with the following names: featureClockRate:ie_vocabulary:I, featureClockRate:ie_vocabulary:all, featureClock- Rate:ie_vocabulary:ashes, featureClockRate:ie_vocabulary:bark, featureClockRate:ie_vocabulary:belly, etc. These are the rates of change for the meaning slots “I”, “all”, “ashes”, “bark” and “belly”. They are expressed as multiples of the overall average rate. In my run of this analysis, the mean value of featureClockRate:ie_vocabulary:I is about

1.3. Tutorial 13 BEASTling Documentation, Release 0.0.0

0.16, meaning cognate replacement for this meaning slot happens a bit more than 6 times more slowly than the average meaning slot. This is to be expected, as pronouns are typically very stable. On the other hand, my mean value for featureClockRate:ie:vocabulary:belly is about 2.14, suggesting that this word evolves a little more than twice as fast as average. Features with a mean value of around 1.0 are evolving at the average rate. In addition to providing information on the relative rates of change for features, permitting rate variation can impact the topology of the trees which are sampled. If two languages have different words for a meaning slot which evolves very slowly, this is evidence the the languages are only distantly related. However, if two languages have different words for a meaning slot which evolves rapidly, then this does not necessarily mean they cannot be closely related. This kind of nuanced inference cannot be made in a model where all features are forced to evolve at the same rate, so the tree topology which comes out of the two models can differ significantly. Rate variation can also influence the relative timing of the branching events in a tree. If two languages share cognates for most meaning slots and differ in only a few, the rates of change of those few meaning slots give us some idea of how long ago the languages diverged. Let’s look at our new trees, or rather, at a consensus tree:

Notice that the Scandinavian languages are now a little bit better resolved - Swedish and Danish are directly related in about 6,310 of our 10,000 posterior trees, so the tree splits in two here now! This may be due to the rate variation (maybe some the cognates Swedish and Danish share belong to very stable meaning slots but BEAST could not use this information previously), or it might just be because we ran our chain for longer and got better samples (we are working a little “off the cuff” in this tutorial). Also notice that the Romance languages are a little less well resolved! Rate variation can cause this too. Perhaps the cognates shared by Romanian and French turned out to be for quickly changing meaning slots.

Clock variation

If you want the rate of language change to vary across different branches in the tree (which correspond to different locations and times), you can specify your own clock model. ::

[admin] log_params=True [mcmc] chainlength=2000000 [model ie_vocabulary] model=covarion data=ie_cognates.csv rate_variation=True [clock default] type=relaxed

14 Chapter 1. Contents BEASTling Documentation, Release 0.0.0

— ie_vocabulary.conf Here we have specified a relaxed clock model. This means that every branch on the tree will have its own specific rate of change. However, all of these rates will be sampled from one distribution, so that most branches will receive rates which are only slightly faster or slower than the average, while a small number of branches may have outlying rates. By default, this distribution is log-normal, but it is possible to specify an exponential or gamma distribution instead. Another alternative to the default “strict clock” is a random local clock, but relaxed clocks are more commonly used. Note that we have left rate variation on as well, but this is not required for using a relaxed clock. Rate variation and non-strict clocks are two separate and independent ways of making your model more realistic. Rebuild your XML file and run BEAST again in the now-familiar manner: $ beastling --overwrite ie_vocabulary.conf $ beast beastling.xml [...]

Just like when we switched on rate variation, you should be able to see that using a relaxed clock added several ad- ditional columns to your beastling.log logfile. In particular, you should see: clockRate.c:default, rate.c:default.mean, rate.c:default.variance, rate.c:default.coefficientOfVariation and ucldSdev.c:default. The first two new columns, clockRate.c:default and ucldSdev.c:default, are the mean and standard deviation respectively of the log-normal dis- tribution from which the clock rates for each branch are drawn. In this analysis, the mean is fixed at 1.0, and this is due to the lack of calibrations. You will see how this changes later in the tutorial. The next two, rate.c:default.mean and rate.c:default.variance, are the empirical mean and variance of the actual rates sampled for the branches, which may differ slightly from the distribution parameters. Finally, clockRate.c:default.coefficientOfVariation is the ratio of the variance of branch rates to the mean, and provides a measure of how much variation there is in the rate of evolution over the tree. If this value is quite low, say 0.1 or less, this suggests that there is very little variation across the branches, and using a relaxed clock instead of a strict clock will probably not have enough impact on your results to be worth the increased running time. High values mean the data is strongly incompatible with a strict clock. Once again, we can look at a consensus tree to see how this change has affected our analysis.

Notice that the Scandinavian and Romance subfamilies are now both completely resolved! For more details on clock models supported by BEASTling, see the Clock models page.

1.3. Tutorial 15 BEASTling Documentation, Release 0.0.0

Adding calibrations

The trees we have been looking at up until now have all had branch lengths expressed in units of expected number of substitutions, or “change events”, per feature. One common application of phylogenetics in linguistics is to estimate the age of language families or subfamilies. In order to do this, we need to calibrate our tree by providing BEAST with our best estimate of the age of some points on the tree. If we do this, the trees in our beastling.nex output file will instead have branch lenghts in units which match the units used for our calibration. Calibrations are added to their own section in the configuration file. Suppose we wish to calibrate the common ancestor of the Romance languages in our analysis to have an age coinciding with the collapse of the , say 1,400 to 1,600 years BP. We will specify our calibrations in units of millenia: [admin] log_params=True [mcmc] chainlength=2000000 [model ie_vocabulary] model=covarion data=ie_cognates.csv rate_variation=True [clock default] type=relaxed [calibrations] French,Italian,Portuguese,Romanian,Spanish=1.4-1.6

— ie_vocabulary.conf Once again we rebuild and re-run: $ beastling --overwrite ie_vocabulary.conf $ beast beastling.xml [...] Bayesian Evolutionary Analysis Sampling Trees [...]

Including this calibration will have changed several things about our output. First, let’s look at the log file. The most obvious difference will be in the treeHeight column. Whereas previously this value was in rather abstract units of “average number of changes per meaning slot”, now it is in units of millenia, matching our calibration. Instead of a mean value of around 0.82, you should see a mean value of something like 5.72. This is our analysis’ estimate of the age of proto-Indo-European (i.e. about 5,700 years). In addition to a point estimate like this, we can get a plausible interval, by seeing that 95% of the samples in our analysis are between 1.35 and 15.00, so the age of Indo-European could plausibly lie anywhere in this range. This is quite a broad range, which is not unexpected here - we are using a very small data set (in terms of both languages and meaning slots) and have only one internal calibration. Serious efforts to date protolanguages require much more care than this analysis, however it demonstrates the basics of using BEASTling for this purpose. You should also see some new columns, including one with the (somewhat unweildy) name mrca- time(French,Italian,Portuguese,Romanian,Spanish). This column records the age (in millenia BP) of the most recent common ancestor of the Romance languages in our analysis. Because we placed a calibration on this node, you should see that almost all values in this column are between 1.4 and 1.6. In my run of this analysis, I see a mean of 1.497 and a 95% HPD interval of 1.399 to 1.6, indicating that the calibration has functioned exactly as intended. As is now usual, we can build a consensus tree to summarise the results of our analysis.

16 Chapter 1. Contents BEASTling Documentation, Release 0.0.0

If you compare this tree to the previous one, after we introduced the relaxed clock, you will notice that they have exactly the same topology, and the posterior support values are very similar. This is to be expected. Adding a single calibration point essentially does nothing but rescale the tree branch lengths. Adding multiple calibrations, however, could potentially change the topology.

1.3.5 Best practices

Bayesian phylogenetic inference is a complicated subject, and this tutorial can only ever give you a quick first impres- sion of what is involved. We urge you to make use of the many other learning resources available for mastering the art. However, to help you get started we offer a very brief discussion of some important “best practices” you should follow.

Keep it simple

For serious linguistic studies, you will almost always end up using some model more complicated than the default provided by BEASTling, perhaps using multiple substitution models, rate variation, non-strict clocks and multiple calibrations in either time or space. Each complication brings an additional chance of problems, and at the very least means your analysis will take longer to run. You should always begin a study by using the simplest model possible, even if it is not a perfect match to reality. Make sure the model runs with a strict clock, no rate variation and without any calibrations first. Add these details later one at a time to see what impact each one has on the results. If you encounter any problems, at least you will know which part of the model is the cause.

Sample from the prior

An essential part of Bayesian modelling is using prior distributions to influence your results. Complicated models usually come with complicated priors. All BEASTling-generated analyses feature a prior distribution over the phy- logenetic tree, and depending upon your setup your analysis may add additional components to the prior such as monophyly constraints, timing calibrations and geographic constraints.

1.3. Tutorial 17 BEASTling Documentation, Release 0.0.0

Even if it is not obvious, these prior constraints can interact with one another in unexpected ways, and this can introduce biases into your results. If your posterior tree sets suggest that some languages are related, you must not simply assume that this is due to phylogenetic signal in the data. It may be that there are actually only a small number of ways to simultaneously satisfy all of your constraints, and most or all of these may involve your languages being related. In this case, your results will show the languages to be related no matter what data you give your model! To guard against this, you should always sample from the prior distribution of your final analysis, i.e. do a run where the data is ignored. You should then compare the results you get from this to the results you get from the full analysis, to make sure that the data is contributing most of the result. BEASTling makes this easy. The easiest way to do this is to run BEASTling with the –prior option. For our Indo- European example, instead of doing the usual $ beastling ie_vocabulary.conf, we can do $ beastling –prior ie_vocabulary.conf Instead of creating a beastling.xml file, this will create a file named beastling_prior.xml. This file will contain the configuration for a BEAST analysis which is identical to the one specified in ie_vocabulary.conf, but it will sample from the prior. When you run it with: $ beast beastling_prior.xml [...]

The output files will be beastling_prior.log and beastling_prior.nex, and these can be interpreted in precisely the same way as the regular log files.

How long should I run my chains?

The essence of what BEAST does when it runs an analysis configured by BEASTling is to sample 10,000 trees (and 10,000 values of all parameters), and we use these samples as an estimate of the posterior distribution. This is true regardless of the configured chain length. If we run the chain for 10,000 iterations, then each one is kept as one of our samples. If we run the chain for 100,000 iterations, then only every 10th sample is kept and the others are thrown out. Since we get 10,000 samples either way, how do we know how long to set our chain length? In order for our estimate to be a “good one”, we need to take a few things into account. The MCMC sampler sets the tree and all parameters to random initial values, and then at each iteration attempts to change one or more of these values. The state of the chain drifts away from the random initial state (which is probably a very bad fit to the data) and then one the values are a good fit, the chain wanders around the space of good fitting values, sampling values in proportion to their posterior probability. So, one thing we need to be sure of is that our chain runs for enough iterations to get out of the initial bad fit and into a region of good fit. This is known as “getting past burn in”. Another thing to consider is that we want our 10,000 samples to be roughly independent. Suppose we have a weighted coin and we want to estimate the bias. We can flip it 10,000 times and count the heads and tails and compute the ratio to get a good estimate of the bias. Suppose instead of flipping the coin ourselves, we give it to a coin-flipping robot. The robot isn’t very good at its job (but it’s trying its best!), and it only succeeds in flipping the coin every 5 tries. Instead of getting a sequence like this: H, T, H, T, H, H, T, T, H, T we get a sequence like this: H, H, H, H, H, T, T, T, T, T, H, H, H, H, H, T, T, T ,T, T,... Obviously, if we let the robot produce 10,000 samples for us, we will not get as good an estimate as flipping the coin ourselves. We are getting 10,000 samples, but intuitively, there is only as much information as 2,000 “real” samples, due to the duplications. A complicated MCMC analysis is kind of like this not-so-good robot. Consecutive samples tend to be identical or very similar to one another, so if we just took the first 10,000 samples out of the chain after burn in, there might actually

18 Chapter 1. Contents BEASTling Documentation, Release 0.0.0

only be a very small amount of information in them and our estimate would not be reliable. Because of this, we need to run the chain for more than 10,000 iterations (sometimes much more) and only record every 10th or 100th or 1,000th sample in order to ensure good quality estimates. The more complicated your analysis, the harder the MCMC robot’s job becomes, so the longer the required chain length and the longer you have to wait for results. Very complicated analyses with very large data sets can easily take several days or even weeks to provide a good sample! So, how do we know when we have run our chain long enough to get past the burn in, and spaced our samples out enough to get a reliable estimate? The Tracer program distributed with BEAST can help us with this task. When you load a BEAST .log file in Tracer, in addition to seeing the mean value of all the columns in the log file, you can see the ESS, or Effective Sample Size. This tells you how many independent samples your 10,000 samples hold as much information as (in our coin-flipping robot example above, we said that the ESS of the 10,000 samples was about 2,000 because). As a rule of thumb, an ESS of below 100 is too low for a reliable estimate, and an ESS of 200 or more is considered acceptable. Accordingly, Tracer will colour ESSes below 100 red to let you know they are problematic, and ESSes below 100 and 200 yellow to let you know they are not quite ideal.

1.4 Usage

BEASTling is a command-line tool, with no graphical interface. On Linux or OS X machines, it can be used from a terminal. On Windows machines, it can be run from the command prompt or, for a less painful experience, you can use cygwin.

1.4.1 Basic usage

Typical usage is to run: $ beastling my_config.conf

where “my_config.conf” is a valid BEASTling configuration file. This will produce an XML file, whose name is determined by the “basename” parameter in the config file. Alternatively, the output filename can be specified as a second parameter: $ beastling my_config.conf my_output.xml

If the my_output.xml file already exists and you want to overwrite it, use the --overwrite option: $ beastling --overwrite my_config.conf my_output.xml

To write the XML output to stdout instead of a file, use - in place of an output filename: $ beastling my_config.conf -

1.4.2 Running your analysis

Once you have your output XML file, you can get BEAST to run your analysis by simply running: $ beast my_output.xml

1.4.3 Verbose mode

If you run BEASTling in verbose mode, using either -v or --verbose, BEASTling will print messages while processing your configuration file. These messages will let you know of BEAST packages that your analysis depends upon, and of various decisions it makes which you may like to be aware of. For example:

1.4. Usage 19 BEASTling Documentation, Release 0.0.0

$ beastling -v my_config.conf my_output.xml [DEPENDENCY] ConstrainedRandomTree is implemented in the BEAST package BEASTLabs. [DEPENDENCY] The Lewis Mk substitution model is implemented in the BEAST package "morph-models". [INFO] Model "my_model": Trait f3 excluded because its value is constant across selected languages. Set "remove_constant_features=False" in config to stop this. [INFO] Model "my_model": Trait f6 excluded because there are no datapoints for selected languages. [INFO] Model "my_model": Using 8 features from data file ./tests/data/basic.csv [INFO] 5 languages included in analysis.

In future, BEASTling in verbose mode may also offer hints on hwo you can tweak your configuration to improve performance.

1.4.4 Generating reports

In addition to creating a BEAST XML file, BEASTling is also capable of simultaneously creating high-level, human- readable analysis reports. To generate these reports, include the --report option when running BEASTling. This will produce two files, my_config.md and my_config.geojson. The my_config.md file contains Markdown-formatted text. This report briefly summarise things like which lan- guages are included in the analysis and which families they come from, how many features from the datafiles are used and which substitution models have been applied, calibration dates which have been applied, and more. The my_config.geojson file is a GEOJson file which encodes the location of all the languages in your analysis. If you keep your BEASTling configuration file and the generated reports in a GitHub repository, then when you view the reports GitHub will automatically render the Markdown into nicely formatted text, and will automatically render the GEOJson as a zoomable, pannable world-map, where languages are colour-coded by family. This is probably the quickest and easiest way to view the reports, and it makes it super simple to share your work with others by sending around the URLs for these reports. People who have no idea how to read BEAST XML files or even BEASTling configuration files can look at these two reports and immediately understand the high-level details of what you are doing. Besides, you were alreay going to put your data and configuration on GitHub anyway, right, so your fellow scientists can reproduce your results and easily run their own modifications?

1.4.5 Extracting configurations from XMLs

If you have a pre-existing BEAST XML file which was generated by BEASTling, then you can use the --extract option to extract the original configuration file and, if embed_data was enabled in that configuration file, any data files. This makes it extremely easy to start experimenting with variations on a published analysis. Note that --extract will not overwrite existing files unless --overwrite is specified.

1.4.6 Advanced stuff

These usage patterns will cover the vast majority of uses of BEASTling. If you’re feeling funky, you can read the linguistic data from stdin instead of a .csv file (see Data formats), or you can generate XML files directly from a Python script, using BEASTling as a library (see Scripting BEASTling).

1.5 Configuration file

Understanding BEASTling is, mostly, a matter of understanding the configuration file format. Config files have the following form:

20 Chapter 1. Contents BEASTling Documentation, Release 0.0.0

[section1] param1a= value1a param2a= value2a param3a= value3a param4a= value4a param5a= value5a [section2] param1b= value1b param2b= value2b param3b= value3b [section3] param1c= value1c param2c= value2c param3c= value3c param4c= value4c ...

i.e. they are divided into sections, which are indicated by names enclosed in square brackets (in the above, the section names are section1, section2, etc.), and each section consists of some number of parameters and assigned values. Each line of each section corresponds to assigning one value to one parameter, the the parameter name on the left of the equals sign and the value on the right. BEASTling configuration files can range from very simple (the only section which is compulsory is one or more model sections) to relatively complicated - although in all cases they are vastly simpler than any BEAST XML file. If you provide minimal configuration information, “sensible defaults” will be used for all settings. It is your responsibility to know what the defaults are and to make sure that they truly are sensible for your application. The recognised config file sections are as follows:

1.5.1 admin section

The admin section may contain the following parameters: • basename: this is any user-friendly string which will be used in e.g. filenames. If the basename is, say, “IE_cognates”, then the BEAST XML file which BEASTling produces will be named IE_cognates.xml, and when the BEAST analysis is run, the trees will be logged in IE_trees.nex, etc. If unspecified, it will default to “beastling”. • glottolog_release: the number of a Glottolog release (>=2.7), from which to obtain the language classi- fication. • screenlog: this must be set to “True” or “False” and controls whether or not BEAST should output basic MCMC data like ESS to the screen while running. Default is True. • log_probabilities: “True” or “False”. Controls whether or not the prior, likelihood and posterior should be logged to a file called basename.log. This is generally a good idea, so that you can check e.g. ESSes for these things in Tracer, so the default is True. • log_params: “True” or “False”. Controls whether or not all model parameters are also included in base- name.log. Default is False. • log_trees: “True” or “False”. Controls whether or not sampled trees should be logged to basename.nex. Default is True. • log_all: “True” or “False”. Setting this true is simply a shorthand for setting log_probabilities and log_params and log_trees to all be true. Default is False. • log_every: an integer specifying how many MCMC samples should elapse between consecutive entries in the log file. If not specified, BEASTling will set this based on the chainlength such that the log file will be

1.5. Configuration file 21 BEASTling Documentation, Release 0.0.0

10,000 entries long. This is a good compromise between getting lots of information about the posterior and conserving disk space.

1.5.2 MCMC section

The MCMC section may contain the following parameters: • chainlength: number of iterations to run the MCMC chain for. Default is 10,000,000. • sample_from_prior: “True” or “False”. If True, BEAST will ignore all supplied data and the tree, all clock rates and any model parameters will all be sampled from their prior distributions. Default is False.

1.5.3 languages section

The languages section may contain the following parameters: • exclusions: One of: – A comma-separated list of language names or codes to exclude from the analysis, spelled exactly as they are in the data file(s). – The path to a file which contains one language per line. This can be used by itself to remove a few problematic languages from a data file, or in conjunction with families or macroareas to better control which languages are included, e.g. you may set macroareas = , but use excludes to remove some outliers like Austronesian languages on or Indo-European languages in . • families: One of: – A comma-separated list of language families to include in the analysis, spelled exactly as they are in Glottolog. E.g. Indo-European, Uralic, Dravidian. – The path to a file which contains one language family per line. If no value is assigned to this parameter, all languages present in the data file will be included (unless languages (see below) is used. families and languages cannot both be used in a single configuration. • languages: One of: – A comma-separated list of language names or codes to include in the analysis, spelled exactly as they are in the data file(s). – The path to a file which contains one language per line. If no value is assigned to this parameter, all languages present in the data file will be included (unless families (see above) is used. languages and families cannot both be used in a single configuration. • macroareas: One of: – A comma-separated list of Glottolog macroareas to include in the analysis – The path to a file which contains one macroarea per line. Valid macroareas are: Africa, , Eurasia, , Papunesia, . This can be used in conjunction with languages or families, in which case a language must meet both criteria to be included. E.g. if you set families = Afro-Asiatic and macroareas = Africa, you will get only the Afro-Asiatic languages located in Africa, and those located in Eurasia will be excluded.

22 Chapter 1. Contents BEASTling Documentation, Release 0.0.0

• monophyly (or monophyletic): “True” or “False”. Controls whether or not to impose the family structure in Glottolog as monophyly constraints in the BEAST analysis. Default is False. If True, very fine-grained control over exactly how much constraint is opposed can be gained by using additional options, documented below. • monophyly_levels: An integer specifying how many levels of the Glottolog classification to impost as a monophyly constraints. By default, levels are added in a top-down fashion (but see monophyly_direction below). E.g. if monophyly_levels = 3 is specified, then Indo-European languages will be constrained to be monophyletic (one level), and so will Armenian, Celtic and Germanic, among others (two levels), and so will be Gothic and Northwest Germanic, among others (three levels), but North Germanic and West Germanic, or any descendant groups, will not be. This allows one to enforce the high level structure of Glottolog, while leaving the “fine details” of relationships among leaves to be inferred from data. If no value is specified, the entire Glottolog classification will be imposed. • monophyly_direction: One of top_down (the default) or bottom_up. Determines the effect of monophyly_levels. If monophyly_direction = top_down, constraints will be added from the roots of Glottolog trees downward (e.g. Indo-European, Germanic, North Germanic,...). If bottom_up, con- straints will be added from the leaves upward (e.g. Macro-Swedish, East Scandinavian, North Germanic,...). • monophyly_start_depth: An integer specifying an initial number of levels of the Glottolog classification to skip over when implying constraints (default 0). E.g., with top down con- straints, setting monophyly_start_depth=2 will skip over Indo-European and Germanic, so that if monophyly_levles=3, the imposed levels will be, e.g. Western Germanic, Franconian and High Fran- conian. With bottom up constraints, this controls skipping initial levels above the leaves. • monophyly_end_depth: An integer specifying a level in the Glottolog classification be- low which constraints will not be imposed. If monophyly_end_depth is specified, then monophyly_direction and monophyly_levels are ignored. The imposed constraints will be those between monophyly_start_depth and monophyly_end_depth, interpreted in a top down fashion. This is a “low level” approach to controling monophyly, and in general the “configurational sugar” of using monophyly_direction, monophyly_start and monophyly_levels should be preferred. • overlap: One of union or intersection. Controls how to deal with language sets mismatches between input data.

– If set to union (the default), languages missing in one data set will be added with missing datapoints (”?”) for all features. – If set to intersection, only languages present in all data sets will be used. • starting_tree: Used to provide a starting tree. Can be a Newick format tree or the name of a file which contains a Newick format tree. If not specified, a random starting tree (compatible with monophyly constraints, if active) will be used. • sample_branch_lengths: If True, the branch lengths of the starting tree. If False, the starting branch lengths will be kept fixed. Use this in conjunction with starting_tree when you have a tree you trust and want to fit model parameters to it. Default is True. • sample_topology: If true, the topology of the starting tree (i.e. the details of which leaves are connected to which and how) will be sampled during the analysis to fit the data. If false, the topology will be kept fixed. Use this in conjunction with starting_tree when you have a tree you trust and want to fit model parameters to it. Default is True.

1.5.4 calibration section

The calibration section should contain one parameter for each distinct calibration point that you wish to include in the analysis.

1.5. Configuration file 23 BEASTling Documentation, Release 0.0.0

The name of each parameter should be a comma-separated list of family names or Glottocodes. Optionally, the name can be enclosed in originate( ) to place the calibration not on the MRCA of the languages/families specificed, but on the originate, i.e. the top of the branch leading to the MRCA. The value for each calibration can be a string in one of several supported formats. The two simplest formats are to specify a range of ages, or a single upper or lower bounding age. Ranges can be specified as follows: Austronesian= 4750- 5800

You may use arbitrary units without problems, i.e. you could provide dates in millenia BP: Austronesian= 4.75- 5.8

The only time this matters is when it comes time to interpret tree heights or clock and/or mutation rates. With this kind of calibration, BEASTling will set a normal distribution prior on the age of the family indicated. The mean of the distribution will be equal to the midpoint of the provided range (5275 in the above case). The standard deviation will be set such that 95% of the probability mass will lie within the range provided. In other words, the range you provide is treated as a 95% credibility interval. Bounds can be specified as follows: Austronesian = > 4750

or Austronesian = < 5800

With this kind of calibration, BEASTling will set a uniform distribution prior on the age of the family indicated. The upper or lower bound will be set to the provided age, and the other bound will be set to zero or infinity as appropriate. If you require more control over your priors, you can explicitly provide the type of distribution (either normal, lognor- mal or uniform) and the parameters as follows: Austronesian= normal(5275, 535.71) # First param is mean, second is standard deviation Austronesian= lognormal(8.57, 0.05) # First param is mean, second is standard deviation, both are in logspace Austronesian= uniform(4.75, 5.80) # First param is lower bound, second is upper bound

Finally, it is possible to specify an age range and ask for a lognormal distribution to be fitted to it, as follows: Austronesian= lognormal(4750- 5800)

With this kind of calibration, BEASTling will set a lognormal distribution prior on the age of the family indicated. The mean of the distribution will be set so that the median of the lognormal distribution equals the midpoint of the range provided. The standard deviation will be set to the mean of two values: one with the property that the provided lower bound is at the 5th percentile of the lognormal distribution, and one with the property that the provided upper bound is at the 95th percentile. The provided interval does not quite end up being a 95% credible interval, but it is roughly so. Explicitly set the lognormal parameters as shown above if you need more control over the matching than this.

1.5.5 model sections

A BEASTling config file must include at least one model section, but it can contain several. Model sections are different from almost all other sections in that you must give each one a name. A [model] section is invalid, but [model mymodel] will work. Suppose you want to perform an analysis using both cognate data and structural data, and you want to use different model settings for the different kinds of data (say different substitution models). You could have a [model cognate] section and a [model structure] section. You can have as many models as you like, as long as each one gets a unique name. Each model section must contain the following parameters, i.e. they are mandatory and BEASTling will refuse to work if you ommit them:

24 Chapter 1. Contents BEASTling Documentation, Release 0.0.0

• model: should specify the name of the substitution model type you want to use. Available models are: – “covarion” (Binary covarion model) – “bsvs” (Bayesian Stochastic Variable Selection) – “mk” (Lewis Mk model) For more information on the available models, see Substitution models. • data: should be one of: – A path to a file containing your language data in a compatible .csv format – The string “stdin” if you wish for data to be read from stdin rather than a file. Note that if data is a relative path, this will be interpreted relative to the current working directory when beastling is run, not relative to the location of the configuration file. Regardless of whether data is read from a file or from stdin, it must be in one of the two compatible .csv formats. These are described in Data formats. Note that BEASTling can also be made to read data from stdin by using the --stdin command line argument. Additionally, each model section may contain the following parameters, i.e. they are optional: • binarised or binarized: “True” or “False”. This option is only relevant if the binary covarion model is being used (see Binary Covarion). If unspecified, BEASTling will try to guess whether the supplied data has already been binarised, and will automatically translate multistate features into multiple binary features if not. If BEASTling is guessing wrong, you can use this option to explicitly inform it whether or not your data has already been binarised. • clock: Assigns the clock to use for this model. See clock sections below for details. • file_format: Can be used to explicitly set which of the two supported .csv file formats the data for this model is supplied in, to be used if BEASTling is mistakenly trying to parse one format as the other (which should be very rare). Should be one of:

– “beastling” – “cldf” • language_column: Can be used to indicate the column name in the .csv file header which corresponds to the unique language identifier. If the column name is one of “iso”, “iso_code”, “glotto”, “glotto_code”, “language”, “language_id”, “lang” or “lang_id”, BEASTling will recognise it automatically. This parameter is only needed if you have a pre-existing data file which uses a different column name which you don’t want to change (perhaps because it would break compatibility with another tool). • pruned: “True” or “False”. Make use of “pruned trees”. This can improve performance in data sets with a lot of missing data. Default is False. • rate_variation: “True” or “False”. Estimate a separate substitution rate for each feature (using a Gamma prior). • remove_constant_features: “True” or “False”. By default, this is set to “True”, which means that if your data set contains any features which have the same value for all of the languages in your analysis (which is not necessarily all of the languages in your data file, if you are using the “families” parameter in your “languages” section!), BEASTling will automatically remove that feature from the analysis (since it cannot possibly provide any phylogenetic information). If you want to keep these constant features in for some reason, you must explicitly set this parameter to False. • minimum_data: Indicates the minimum percentage of languages that a feature should have data present for to be included in an analysis. E.g, if set to 50, any feature in the dataset which has more question marks than actual values for the selected languages will be excluded. • features: Is used to select a subset of the features in the given data file. Should be one of:

1.5. Configuration file 25 BEASTling Documentation, Release 0.0.0

– A comma-separated list of feature names (as they are given in the data CSV’s header line) – A path to a file which contains one feature name per line

1.5.6 clock sections

clock sections are quite similar to model sections, in that they must be given names, e.g. [clock myclock].A BEASTling config file may include any number of clock sections, including zero, but it makes no practical sense to define more clock sections than you have model sections. clock sections are used to define clock models, which determine how tree branch lengths are transformed into a measure of evolutionary time. Each model in your analysis has an associated clock model. You can share one clock across all your models, or give each model its own clock, or assign clocks in any other way you like. If no clock section is defined, all models will be associated with a default clock (of type “strict”). Alternatively: • You may define your own [clock default] section. Because the name is default, this clock will be associated with all model sections, unless those sections have a different clock specifically assigned. • You may explicitly assign a clock to a model by setting the model section’s clock option equal to the name of a clock section. • If a model section and a clock section have the same name, then they are automatically associated with each other (unless the model section explicitly assigns a different clock. Each clock section must contain the following parameters, i.e. they are mandatory and BEASTling will refuse to work if you ommit them: • type: should specify the type of clock model type you want to use. Available models are: – “strict” (Strict clock) – “relaxed” (Uncorrelated relaxed clock) – “random” (Random local clock) For more information on the available models, see Clock models.

1.6 Data formats

BEASTling relies on data being provided in CSV files. Two particular CSV formats are supported.

1.6.1 BEASTling format

In this format, each line of the CSV file contains all of the data for a single languge. The first line of the file must be a header, giving the column names for the rest of the file. The column which contains each language’s unique identifier should be named one of: • iso • iso_code • glotto • glottocode • language • language_id

26 Chapter 1. Contents BEASTling Documentation, Release 0.0.0

• lang • lang_id A column with one of these names will be automatically recognised as containing language identifiers. If you abso- lutely have to use a different column name, use the language_column parameter in your configuration file’s [model] section to tell BEASTling the name. Languages can be identified by arbitrary strings, provided each language has a unique identifier, however certain features of BEASTling will not function unless your language identifiers are either: • three character ISO 639 codes • Glottocodes as assigned by the Glottolog project All columns other than the language identifier column correspond to independent language features. The names and values of features can both be arbitrary strings, so long as each feature has a unique name. Question marks (”?”) can be used to indicate missing data. Example valid BEASTling format data files are shown below. Using ISO codes and numeric data: iso,f0,f1,f2,f3,f4,f5,f6,f7,f8,f9 aiw,1,1,1,1,1,1,?,1,?,1 aas,2,2,2,1,2,2,?,?,1,3 kbt,3,3,1,1,2,3,?,2,?,5 abg,4,2,2,1,1,4,?,?,3,4 abf,5,1,1,1,2,5,?,3,?,2

Using Glottocodes and alphabetical data: glotto,f0,f1,f2,f3,f4,f5,f6,f7,f8,f9 aari1239,A,A,A,A,A,A,?,A,?,A aasa1238,B,B,B,A,B,B,?,?,A,C abad1241,C,C,A,A,B,C,?,B,?,E abag1245,D,B,B,A,A,D,?,?,C,D abai1240,E,A,A,A,B,E,?,C,?,B

1.6.2 CLDF format

BEASTling also supports the Cross-Linguistic Data Format standard. In this format, each line of the CSV file contains a single data point for a single language. The first line of the file must be a header, giving the column names for the rest of the file. The three column names must be Language_ID, Feature_ID or Parameter_ID, and Value (these column names are how BEASTling recognises a file as a CLDF file, so if you change them the file will be parsed as a BEASTling format file). As before, Language_IDs can be arbitrary strings, but must be ISO codes or Glottocodes if you want to use all features of BEASTling. Feature_IDs and Values can be arbitrary strings, and ? can be used to indicate missing data. An example valid CLDF format data file is shown below. It specifies precisely the same data set as the first example BEASTling format data file above. Language_ID, Feature_ID, Value aiw, f0, 1 aiw, f1, 1 aiw, f2, 1 aiw, f3, 1 aiw, f4, 1 aiw, f5, 1

1.6. Data formats 27 BEASTling Documentation, Release 0.0.0

aiw, f6, ? aiw, f7, 1 aiw, f8, ? aiw, f9, 1 aas, f0, 2 aas, f1, 2 aas, f2, 2 aas, f3, 1 aas, f4, 2 aas, f5, 2 aas, f6, ? aas, f7, ? aas, f8, 1 aas, f9, 3 kbt, f0, 3 kbt, f1, 3 kbt, f2, 1 kbt, f3, 1 kbt, f4, 2 kbt, f5, 3 kbt, f6, ? kbt, f7, 2 kbt, f8, ? kbt, f9, 5 abg, f0, 4 abg, f1, 2 abg, f2, 2 abg, f3, 1 abg, f4, 1 abg, f5, 4 abg, f6, ? abg, f7, ? abg, f8, 3 abg, f9, 4 abf, f0, 5 abf, f1, 1 abf, f2, 1 abf, f3, 1 abf, f4, 2 abf, f5, 5 abf, f6, ? abf, f7, 3 abf, f8, ?

1.7 Modelling details

It is important to understand the modelling assumptions made by BEASTling which underly all analyses generated by it. Substitution models in particular are discussed separately in the next section.

1.7.1 Tree prior

BEASTling uses a Yule pure-birth prior for phylogenetic trues. The birthrate is fixed everywhere on the tree, and is estimated during the analysis. The prior over birthrates is a uniform prior from zero to infinity. This model is not particularly suitable for linguistic analyses, however it is currently the best BEAST has to offer.

28 Chapter 1. Contents BEASTling Documentation, Release 0.0.0

1.7.2 Branch lengths

If no calibration points have been provided in the configuration, then the trees logged from a BEASTling analysis will have units of “expected number of mutations for a feature with rate 1.0”. If calibration points have been provided, then branch lengths will be in the same units as the calibrations.

1.7.3 Clocks

BEASTling supports three different clock models. These are: strict clocks, where the mutation rate of any particular feature is the same at all points across the tree; relaxed clocks, where the mutation rate of every branch on the tree is tree is sampled individually; and random local clocks, which can be thought of as an interpolation between these two extremes.

1.8 Clock models

The branch lengths of trees in BEAST need to be converted to some measure of evolutionary time in order to compute transition probabilities. For example, if you have provided calibration dates, then the branch lengths of your tree are in the same units as your calibration data (typically years or kiloyears), but they need to be in units of expected substitutions in order to assess how well the tree fits the data. This conversion is performed by a clock model. Clock models may be very simple, such as specifying a single, unchanging expected number of substitutions per unit of branch length (e.g. substitutions per year) which is valid all over the tree, or more complex, with each branch on the tree having a different conversion rate, corresponding to changes in the rate of evolution over time and/or space. When configuring a BEASTling analysis, each substitution model you configure in a model section must be asso- ciated with a clock model (via a clock section), and there are several clock models to choose from. The following clock models are currently supported:

1.8.1 Strict

(set type=strict in config file) A strict clock is the simplest clock model available in BEASTling. It is basically a single value which represents a conversion rate between branch lengths and evolutionary time. This same value is valid over all branches on the tree. Strict clocks are simple and result in fast-running analyses, but they represent an assumption about language change which most linguists do not believe is plausible for most situations, i.e. that the rate at which a particular feature changes is fixed at all points in time and all subfamilies in a tree.

1.8.2 Uncorrelated Relaxed Clock

(set type=relaxed in config file) Uncorrelated relaxed clocks allow each branch of a phylogenetic tree to have its own clock rate. This is in contrast to a strict clock where one rate is applied all over the tree. These are called “uncorrelated” because the rate at one branch does not depend upon the rate at the branch immediately above it. This means that the rate of evolutionary change can potential change abruptly, i.e. going from fast to slow or slow to fast at a single point, rather than needing to “accelerate smoothly” over multiple branching events. The different rates are sampled from a probability distribution, whose parameters are also sampled by the MCMC chain. The supported distributions are Lognormal (add distribution=lognormal to the config file’s clock section), Exponential (add distribution=exponential to the clock section) and Gamma (distribution=gamma).

1.8. Clock models 29 BEASTling Documentation, Release 0.0.0

The relaxed clock implementation in BEAST works by assigning each branch one rate from a fixed number of discrete rates. The number of discrete rates can be set using the rates option in the config file. For example, if rates were set to 11, the provided distribution would be sampled at the 0.0, 0.1, 0.2,..., and 1.0 quartiles, and each branch would be assigned one of these 11 rates. Lower numbers of rates resulting in better Markov chain mixing, but result a less accurate representation of the underlying distribution, and may skew estimates of the clock rate’s mean or standard deviation. If no rate count is explicitly set, the number of discrete rates will be set equal to the number of branches in the tree.

1.8.3 Random Local Clock

(set type=random in config file) Random local clocks permit an amount of variation in clock rate across a tree which is more than the strict clock (which has no variation) but less than the relaxed clock (which has a different rate for each branch). They work by permiting the clock rate to change a fixed number of times at certain locations on the tree. The number of changes may be zero (in which case the resulting clock is a strict clock), or it may be equal to the number of branches (in which case the resulting clock is a relaxed clock), or it may be somewhere in between. The MCMC chain samples over both the number of changes and their locations on the tree. A Poisson prior is placed on the number of changes. The various rates are sampled from a Gamma distribution. The random local clock can be configured in uncorrelated mode (correlated=false, the default), where each rate is sampled independently from the Gamma distribution, and in correlated mode (correlated=true), where what are sampled from the Gamma distribution are multipliers, with each new rate being a scaling of the rate before the change point.

1.9 Substitution models

BEAST models discrete data as evolving on a phylogenetic tree according to a Continuous Time Markov Chain. When configuring a BEASTling analysis, for each model you must choose a substitution model. Different substitution models are more or less appropriate for different kinds of data. The following substitution models are currently supported:

1.9.1 Mk

(set model=mk in config file) The Lewis Mk model is the simplest generic substitution model available in BEASTling. It is a generalisation of the classic JC69 model from genetics to a statespace of arbitrary size. Transitions are possible from any state to any other state, and every transition is equally probable. No parameters are estimated, increasing analysis speed. This model could be used with any dataset, but the assumptions are not a good match for cognate data.

1.9.2 Binary Covarion

(set model=covarion in config file) The binary Covarion model is defined for binary datasets, i.e. sets where every datapoint is either a 0 or a 1. This model introduces a latent “fast” or “slow” state, which controls the rate of transitions between 0 and 1 (transitions in either direction are always equally probable). This model is typically used for cognate data. When the binary Covarion model is used, if the specified datafile contains multistate data, BEASTling will automat- ically translate this into the appropriate number of binary features. This approach means that you can have a single data file which can be used to generate binary and multistate analyses, and also lets BEASTling share mutation rates

30 Chapter 1. Contents BEASTling Documentation, Release 0.0.0

across binary features corresponding to a single multistate feature. This is the recommended way to use the binary Covarion model. However, if you have pre-binarised data you wish to use, BEASTling should recognise this (by seeing that all features in the dataset have only two possible values) and will avoid binarising it a second time. Note that when used this way, BEASTling will assign separate mutation rates to each binary feature, as it has no way to know which groups of binary features originally corresponded to a single multistate feature. If BEASTling decides to treat your data as pre-binarised, a notification will be emitted in --verbose mode. You can use the binarised (or binarized) option in your config’s [model] section to tell BEASTling explicitly whether or not your data has been binarised, and this will disable the automatic detection.

1.9.3 BSVS

(set model=bsvs in config file) The Bayesian Stochastic Variable Selection (BSVS) model is a rich model suitable for structural data. Compared to the Lewis Mk model, it permits non-equal transition probabilities between different states, and also tries to set a number of probabilities to zero, i.e. transitions from some states to others will be disallowed. This model is suitable for attempting to uncover preferential directions of change in the evolution of particular linguistic features. Note that this model is very parameter intensive and analyses will be much slower than Mk analyses for the same data. A bsvs model accepts two additional parameters, symmetric and svsprior. They change the behaviour as follows. A symmetric model (symmetric=True, which is the default value) assumes that transition rates between states are symmetric, i.e. for two states A and B, transitions from A to B occur at the same rate as transitions from B to A. An asymmetric model (symmetric=False) has double the number of parameters, because the rates A→B and B→A are estimated separately. The svsprior property specifies the shape of the prior distribution which is placed over the number of non-zero rate. Possible choices are poisson’’ and ‘‘exponential, with poisson being the default. The size of the statespace for a particular feature determines a maximum possible number of non-zero rates (the entire matrix), and also a minimum possible number (to ensure that the Markov chain is ergodic). The non-zero rate prior is defined over this range, so both the Poisson and exponential priors have an offset, rather than beginning their support at zero. The default Poisson prior is the more conservative choice. BEASTling will set the mean of the Poisson distribution equal to the midpoint between the minimum and maximum possible number of non-zero rates. In this way, the model has no strong preference for sparse matrices over dense matrices or vice versa (as the mean of the Poisson is usually approximately equal to the median), while still encouraging the setting of rates to zero if the data supports it. The exponential prior, on the other hand, is biased toward sparse transition matrices. BEASTling will set the mean of the exponential distribution such that 99% of the probability density lies between the minimum and maximum possible number of non-zero rates. In this way, matrices with the majority of rates set to zero are much more probable a priori than matrices with the majority of rates being non-zero. This prior is the better choice when you want to fit a model that permits the minimum number of transitions required to explain the data.

1.10 Scripting BEASTling

It is possible, though currently a little awkward, to use BEASTling as a Python library so that you can generate XML files from scripts, without creating a config file first. This is especially useful for generating large number of XML files where only one or two options differ across the files.

1.10. Scripting BEASTling 31 BEASTling Documentation, Release 0.0.0

1.10.1 Example

As an illustrative example, suppose we have a directory my_data with several different CSV files in it, corresponding to different datasets, and we have a BEASTling configuration file my_config.conf which contains the details of a BEASTling analysis, and we want to generate one BEAST XML file for each data file. All analyses should use the same settings (e.g. substitution models, calibration points, etc.), but the data should be different for each analysis. We can generate these XML files easily, even for 1,000 different data files (suppose you are doing a simulation study and have generated 1,000 synthetic data sets), using the following script: from glob import glob

# Import relevant parts of BEASTling from beastling.configuration import Configuration from beastling.beastxml import BeastXml

# For several different data files... for data_filename in glob("my_data/*.csv"): # Build a Configuration object config= Configuration(configfile="my_config.conf") config.model_configs[0]["data"]= data_filename

# Create a BeastXML object from the Configuration object beastxml= BeastXml(config)

# Save BeastXML to file xml_filename= data_filename.replace("csv","xml") xml.write_file(xml filename)

The essential process for creating a file from within a script is to create first a Configuration and object, and then feed this to the constructor of a BeastXML object. One instantiated, the BeastXML object’s write_file method can be used to save to generated XML to the filesystem.

1.10.2 Creating Configurations from scratch

In the above example, a Configuration object was created from a BEASTling config file, using the configfile argument to the Configuration constructor. We then overrode one aspect of that configuration before creating an XML file. It is also possible to create a Configuration object from scratch, without any corresponding configuration file: from beastling.configuration import Configuration config= Configuration()

Such a Configuration object will be created with all options set to their default values. The one essential step before feeding this object to a BeastXml is to populate the model_configs attribute, which by default is an empty list. model_configs should end up list of Python dictionaries. The keys and values of these dictionaries should mirror the structure of a [model] section in a BEASTling configuration file. At the bare minimum, you must set the name, model and data keys to appropriate values: config["name"]="my_model" config["model"]="mk" # or "covarion", "bsvs", etc. config["data"]="my_data.csv"

If you want to include non-default clock models, you should similarly populate the clock_configs attribute, which by default is an empty list and should end up full of dictionaries which mirror the structure of a [clock] section.

32 Chapter 1. Contents BEASTling Documentation, Release 0.0.0

Other details of the configuration can be specified by overwriting the following instance attributes:

1.10. Scripting BEASTling 33