Supplementary information, Section S1 Computational methods for calculating genealogically-sensitive averages and proportions

Contents

1 Introduction 1

2 A short orientation to R 1 2.1 R objects ...... 2 2.2 R packages ...... 2 2.3 Trees in R ...... 3

3 Genealogically-sensitive averages and proportions 6

4 Using and adapting trees from glottolog.com 8 4.1 Glottolog’s genealogical data ...... 8 4.2 How to combine trees ...... 11 4.3 How to remove and duplicate languages ...... 13 4.4 How to add branch lengths ...... 22 4.5 Exporting trees for use with other software ...... 25

5 Putting it together: A worked example 26 5.1 Preparing a tree ...... 26 5.2 Preparing the dataframe of typological data ...... 27 5.3 Calculating genealogically-sensitive proportions ...... 29

6 Using these methods in typological research 30

1 Introduction

This document provides a guide to calculating genealogically-sensitive proportions and averages. Because a key part of the analysis is the preparation of a phylogenetic tree, considerable space is given over to how such trees can be prepared. Much of the discussion here is an introduction to the functionality of an R package, PhyloWeights (Round 2021) which has been specifically written for this task. Since typologists may have little familiarity with R, we begin with a short introduction to it in Section 2. We then discuss the calculation of genealogically-sensitive proportions and averages in Section 3. Section 4 explains how typologists can prepare phylogenetic trees by adapting resources freely available from glottolog.com (Hammarström et al. 2020). Section 5 provides a worked example used in a typological investigation of sonority sequencing by Yin (2020).

2 A short orientation to R

The code presented here runs in R. Here we quickly introduce R objects (such as variables and dataframes) in Section 2.1, R packages in Section 2.2 and how R works with tree objects in Section 2.3. Readers already familiar with the basics of R may wish to skip directly to Section 2.3.

1 Below, chunks of R code appear against a grey background. The results that R produces from each chunk is shown below it, either as lines text preceded by ## or as a plot that R generates, or both. The code within the grey boxes can be copied and pasted into R. Within the code chunks, where a single R command runs longer than one printed line, we have indented the lines after the first; to run these in R, you will need to copy and paste the whole command.

2.1 R objects In R, values can be assigned to objects (such as variables) using the assignment operator, <-: x <- 7.5 y <-2 / 9 z <- sqrt(2) a <- "hello"

The content of a simple object can be displayed just by entering its name. The [1] at the start of the output is telling you this is item number 1 in the object: z

## [1] 1.414214 A common object in R is a one-dimensional vector, such as a sequence of numbers or characters strings. Here is a vector of numbers. The c() here is a function, while the numbers 3, 4, 5 and 6 are its arguments. The function takes these four individual arguments and returns a single vector. c(3,5,4,6)

## [1] 3 5 4 6 A dataframe is the object in R that most resembles an Excel spreadsheet. It has columns that are named and rows that may or may not be named. Here we see the creation of a dataframe using the function data.frame(). In this instance, two arguments of the function are a vector of strings, my_letters, and a vector of numbers my_numbers. These vectors need to be of equal length, as they become the columns of the resulting dataframe. Here we have been assigned that result to the object named my_dataframe: my_dataframe <- data.frame(my_letters = c("x", "y", "z"), my_numbers = c(1,2,3))

The contents of my_dataframe are displayed like this: my_dataframe

## my_letters my_numbers ## 1 x 1 ## 2 y 2 ## 3 z 3 In a dataframe, the contents of any one column will all be of the same class, e.g. all character strings, or all numbers, but as in my_dataframe, different columns can contain items of different classes.

2.2 R packages R provides a range of basic statistical functions, but it is most powerful when extended by the addition of packages which contain additional functions. Here we will use the packages ape “Analyses of Phylogenetics and Evolution” (Paradis & Schliep 2018) to work with trees, dplyr (Wickham et al. 2021) to manipulate dataframes, and PhyloWeights to prepare linguistic phylogenies and perform the analysis of genealogically-sensitive averages and proportions. Packages need to be installed, i.e., downloaded and unpacked, just once. Later, they are ‘loaded’ during each work session as needed. To install the packages we will use, run these commands. You will only ever need to do this once.

2 install.packages("dplyr") install.packages("ape") install.packages("devtools") library(devtools) install_github("erichround/PhyloWeights")

To use the installed packages, load them as follows: library(dplyr) library(ape) library(PhyloWeights)

2.3 Trees in R Here we discuss how trees are created, manipulated and plotted in R. Terminology we will use includes: tips at the ends of trees, which in a linguistic tree would usually be the languages or lects; the branches of a tree; the interior nodes or just nodes of a tree, where branches join together; and the root of the tree, its deepest node. R will represent trees as complex objects, in which the tips, nodes and branches all appear, along with labels for the tips and nodes. One of the simplest methods of constructing a tree in R begins with a description of the tree using the Newick standard. In its simplest form, a tree is represented in Newick format by a set of tip labels grouped by parentheses, separated by commas, and ending with a semicolon. For example, here is a string that represents a tree with four tips, A, B, C and D, which we assign to the object my_newick: my_newick <- "(((A,B),C),D);"

A Newick-formatted string can then be converted to a tree object by supplying it as the text argument of the function read.tree(), from the ape package: my_tree <- read.tree(text = my_newick)

We can plot the tree using the plot() function: plot(my_tree)

D

C

B

A Running in the opposite direction, if we have a tree object in R, we can convert it to a Newick-formatted string using write.tree(). This can be useful if trees have been obtained from elsewhere, and we want to access their Neckwick representation in order to alter it. Here we assign the Newick representation of my_tree to an object my_tree_newick, and then have R print it: my_tree_newick <- write.tree(my_tree) my_tree_newick

## [1] "(((A,B),C),D);"

3 By default, trees are plotted horizontally following the convention in biology. These settings can be altered using additional options in the plot() function, though we will not discuss them further here: plot(my_tree, direction = "downwards", srt = 90, label.offset = 0.2, adj = 0.5, x.lim = c(0,5), y.lim = c(0,10), no.margin =T)

A B C D

The tree object my_tree which we defined above did not include information about branch lengths. In Newick format, branch lengths are written with a preceding colon and appear directly after a language or the closing bracket for a subgroup: my_newick2 <- "(((A:4,B:4):1,C:5):3,D:8);"

Converting this to a tree object and plotting it: my_tree2 <- read.tree(text = my_newick2) plot(my_tree2, direction = "downwards", srt = 90, label.offset = 0.5, adj = 0.5, x.lim = c(0,5), y.lim = c(0,20), no.margin =T)

A B C D To create plots that more closely follow the conventions of linguistic diagrams, trees can be given a ‘root’ branch by including an additional set of outer brackets: my_newick3 <- "((((A:4,B:4):1,C:5):3,D:8):1);" my_tree3 <- read.tree(text = my_newick3) plot(my_tree3, direction = "downwards", srt = 90, label.offset = 0.5, adj = 0.5, x.lim = c(0,5),

4 y.lim = c(0,20), no.margin =T)

A B C D Tree can have labels not only for their tips but also for their internal nodes. In linguistics, an internal node of a tree may be interpreted taxonomically, as representing a subgroup and labeled accordingly, or genealogically, as a proto-language from which the subgroup descends and labeled accordingly. In Newick format, labels for internal nodes are placed directly after a closing parenthesis. For example, here we add labels that reflect a genealogical interpretation of the nodes: my_newick4 <- "((((A:4,B:4)proto-AB:1,C:5)proto-ABC:3,D:8)proto-ABCD:1);"

To make the internal nodes visible in the plot of a tree, we can plot the tree as usual and then use the function nodelabels(), stating in the text argument that the labels are to be taken from the tree we just plotted (here, as my_tree4$node.label — an expression which will be explained shortly below): my_tree4 <- read.tree(text = my_newick4) plot(my_tree4, direction = "downwards", srt = 90, label.offset = 0.5, adj = 0.5, x.lim = c(0,5), y.lim = c(0,20), no.margin =T) nodelabels(text = my_tree4$node.label)

proto−ABCD

proto−ABC proto−AB

A B C D Technically speaking, trees are represented by R as objects with a customised class called “phylo”. class(my_tree4)

## [1] "phylo" A phylo object stores information about the tree topology (i.e., its branching structure), branch lengths, and the labels of the tips and nodes. In R we often use the $ operator to access one object that is contained inside another. For instance, the object y contained within the larger object x would be referred to as x$y. We saw this usage just above in my_tree4$node.label. Here are some others:

5 my_tree3$edge.length

## [1] 1 3 1 4 4 5 8 my_tree3$tip.label

## [1] "A" "B" "C" "D" A related class is the “multiPhylo” class. These are used to store multiple “phylo” trees in a single, larger object. newick_a <- "((((A:4,B:4):1,C:5):3,D:8):1);" newick_b <- "(((A:2,B:2):1,(C:1,D:1,E:1):2):2);" tree_a <- read.tree(text = newick_a) tree_b <- read.tree(text = newick_b) my_multiPhylo <- c(tree_a, tree_b) class(my_multiPhylo)

## [1] "multiPhylo" For reasons we won’t go into here, the “phylo” trees inside a multiPhylo object are not accessed using the $ operator but using double square brackets. For example, here is how we would plot the second tree inside the multiPhylo object my_multiPhylo: plot(my_multiPhylo[[2]], direction = "downwards", srt = 90, label.offset = 0.5, adj = 0.5, x.lim = c(0,5), y.lim = c(0,15), no.margin =T)

A B C D E

3 Genealogically-sensitive averages and proportions

We now turn to the calculation of genealogically-sensitive proportions and averages. To run the code in this and subsequent sections, ensure that the packages described in Section 2.2 have been installed and loaded. Calculating averages and proportions will require two key components: 1. A “phylo” or “multiPhylo” object containing one or more trees The “phylo” or “multiPhylo” object can be manually defined, as described in Section 2.3, can be read from a file,1 or it can be constructed by using and adjusting materials freely available from glottolog.com, as described below in Section 4. 2. A dataframe which contains the typological data to be averaged and matches it to the tips of the trees

1See online documentation of the ape package for reading trees from various common file formats.

6 The dataframe can be manually defined or be read from a file. Perhaps the easiest method is to read from a file that you have created and saved in “CSV” format. These can be created in commercial spreadsheed software like Excel, and then read by R using the read.csv() function like this: my_dataframe <- read.csv("my_csv_file.csv")

The dataframe must contain one column named “language” (note that in R, names of columns and other objects are case sensitive) plus at least one column containing numerical data. The contents of the “language” column must be the same as the tip labels of the tree(s) in the “phylo” or “multiPhylo” object. The contents of the numerical columns will depend on whether a proportion or an average is desired. To calculate a proportion, fill a numerical column with 1 if the language possesses the property and 0 if it does not. To calculate an average, fill a numerical column with the values of the variable for each language. As an example, in order to apply these analyses to the four languages in Figure 2a,b,c of the main paper, here are the trees that would be needed, which are placed inside a single multiPhylo object: newick_Fig2a <- "(((A:1,B:1,C:1):1,D:2):0.3);" newick_Fig2b <- "(((A:0.2,B:0.2,C:0.2):1.8,D:2):0.3);" newick_Fig2c <- "(((A:1.8,B:1.8,C:1.8):0.2,D:2):0.3);" tree_Fig2a <- read.tree(text = newick_Fig2a) tree_Fig2b <- read.tree(text = newick_Fig2b) tree_Fig2c <- read.tree(text = newick_Fig2c) multiPhylo_Fig2 <- c(tree_Fig2a, tree_Fig2b, tree_Fig2c)

The dataframe required is shown below. In addition to the language column, it contains two numerical columns, is_SOV and n_consonants. As good housekeeping, we recommend using coloumn names of the form is_X or has_X for columns that contain data for proportions, and names of the form n_X for columns that contain counts to be averaged. data_Fig2 <- data.frame(language = c("A", "B", "C", "D"), is_SOV = c(1,1,1,0), n_consonants = c(18, 20, 22, 40))

Genealogically-sensitive averages and proportions are obtained using the PhyloWeights function phylo_average(), specifying its arguments phy and data as in the example below. Here we have assigned the output of this function to a new object results_Fig2. We recommend always assigning the output of phylo_average() to an object. We will see below how to extract from it the various parts of the results. results_Fig2 <- phylo_average(phy = multiPhylo_Fig2, data = data_Fig2)

The function phylo_average() may take up to several minutes to run if the tree is large, or many trees are provided. It will return error messages if the inputs provided to it are not what is required. The results object will contain several parts, which can be accessed using the $ operator. In $phy will be the tree(s) that were supplied and in $data will be the dataframe that was supplied, e.g.: results_Fig2$data

## language is_SOV n_consonants ## 1 A 1 18 ## 2 B 1 20 ## 3 C 1 22 ## 4 D 0 40 In $ACL_weights is a dataframe containing one column for each tree provided, in which appear the phylogenetic weights obtained using the ACL method. The dataframe also contains all of the non-numeric columns of the input data dataframe. In $BM_weights is a similar dataframe, with the phylogenetic weights obtained using the BM method:

7 results_Fig2$ACL_weights

## language tree1 tree2 tree3 ## 1 A 0.2 0.1724138 0.2380952 ## 2 B 0.2 0.1724138 0.2380952 ## 3 C 0.2 0.1724138 0.2380952 ## 4 D 0.4 0.4827586 0.2857143 results_Fig2$BM_weights

## language tree1 tree2 tree3 ## 1 A 0.2317460 0.1856200 0.2489451 ## 2 B 0.2317460 0.1856200 0.2489451 ## 3 C 0.2317460 0.1856200 0.2489451 ## 4 D 0.3047619 0.4431401 0.2531646 In $ACL_averages is a dataframe with one row per tree and one column for each numerical column in the data dataframe. These are filled with the genealogically-sensitive averages or proportions obtained using the ACL method. In $BM_averages appear the genealogically-sensitive averages or proportions obtained using the BM method: results_Fig2$ACL_averages

## tree is_SOV n_consonants ## 1 tree1 0.6000000 28.00000 ## 2 tree2 0.5172414 29.65517 ## 3 tree3 0.7142857 25.71429 results_Fig2$BM_averages

## tree is_SOV n_consonants ## 1 tree1 0.6952381 26.09524 ## 2 tree2 0.5568599 28.86280 ## 3 tree3 0.7468354 25.06329 It is possible to save any of these dataframes to a file using the write.csv() function, for example: write.csv(results_Fig2$ACL_averages, file = "my_ACL_averages.csv")

4 Using and adapting trees from glottolog.com

Glottolog.com (Hammarström et al. 2020) contains many useful resources for quantitative typology. In this section we introduce glottolog’s genealogical data in Section 4.1, discussing how to locate metadata about languages and families of interest, and how to view glottolog’s linguistic family trees. Since genealogically- sensitive averages and proportions require languages to be represented within a single tree, we then discuss how glottolog’s individual trees can be combined in Section 4.2. Since typological studies will often examine language varieties at a level of granualarity that differs from glottolog’s own, in Section 4.3 we discuss how to add and remove languages from trees. in Section 4.4 we discuss how to add branch lengths to trees, since branch lengths are necessary for the calculation of genealogically-sensitive averages and proportions. Section 4.5 discusses how to export trees for use with other software.

4.1 Glottolog’s genealogical data Glottolog provides metadata about the world’s language varieties, their division into language families and the hierarchical subgrouping of langauges inside those families. Naturally, there are many points of contention in linguistics about what the world’s stock of languages and dialects actually is, how it groups into families, and how the families themselves are subgrouped. Glottolog provides one set answers, and structures them

8 in a way which provides typologists with a basis for carrying out changes to suit their own hypotheses. In later sections we will see how this can be done. In the section we describe glottolog’s own global linguistic metadata. At time of writing, the current version of glottolog is v4.3. The PhyloWeights package contains a copy of the v4.3 metadata covering language names, language identification codes, family names, geographical groupings, and family trees. The original metadata files that contain this information are currently available at https://glottolog.org/meta/downloads, where a file named tree_glottolog_newick.txt contains glottolog’s trees, and languages_and_dialects_geo.csv, provides geographical metadata. Language metadata can be accessed using the PhyloWeights function glottolog_languages(). This function returns a dataframe of twenty-five thousand rows. To view it in full, we suggest saving it to a CSV file and opening it in spreadsheet software such as Excel: language_metadata <- glottolog_languages() write.csv(language_metadata, "language_metadata.csv")

Here are the first ten rows: language_metadata <- glottolog_languages() head(language_metadata,n= 10)

## glottocode isocode name name_in_tree position tree tree_name ## 1 3adt1234 3Ad-Tekles 3Ad-Tekles tip 186 Afro-Asiatic ## 2 aala1237 Aalawa Aalawa tip 205 Austronesian ## 3 aant1238 Aantantara Aantantara tip 145 NuclearTransNewGuinea ## 4 aari1238 Aari-Gayil node 85 SouthOmotic ## 5 aari1239 aiw Aari Aari tip 85 SouthOmotic ## 6 aari1240 aay Aariya Aariya NA ## 7 aasa1238 aas Aasax Aasax tip 186 Afro-Asiatic ## 8 aasd1234 Aasdring Aasdring tip 179 Indo-European ## 9 aata1238 Aatasaara Aatasaara tip 145 NuclearTransNewGuinea ## 10 abaa1238 Rngaba Rngaba tip 329 Sino-Tibetan Listed here are all of glottolog’s languages, dialects, subgroups and families. These entities are identified by a name, an ISO-639-3 code if available (format: three letters) and a glottolog-specific glottocode (format: four letters followed by four digits2). Also described is the entity’s relationship to a glottolog tree: the representation of its name in the tree (which may differ slightly from the name used elsewhere by glottolog3), its position (as tip or node), and the tree’s number and name. Briefer metadata about glottolog’s language families can be accessed using the PhyloWeights function glottolog_families(). This returns a dataframe of 418 rows, so to view it in full, we also suggest saving it to a CSV file and opening it in spreadsheet software. Here are the first ten rows: family_metadata <- glottolog_families() head(family_metadata,n= 10)

## tree tree_name n_tips n_nodes main_macroarea ## 1 1 Konda-Yahadian 2 1 Papunesia ## 2 2 Canichana 1 1 South America ## 3 3 Mongolic-Khitan 66 25 Eurasia ## 4 4 Caddoan 7 6 North America ## 5 5 Yuki-Wappo 4 2 North America ## 6 6 GreatAndamanese 10 9 Eurasia ## 7 7 Ta-Ne-Omotic 29 15 Africa 2There are two exceptional glottocodes with numbers in the initial four characters: b10b1234 and 3adt1234. 3The differences are systematic and are made in order to conform with the permissible Newick format of tree labels: spaces and apostrophes are removed, parentheses are replaced by braces, and commas are replaced by forward slashes.

9 ## 8 8 Bogia 2 1 Papunesia ## 9 9 LowerSepik-Ramu 36 23 Papunesia ## 10 10 Kalapuyan 3 1 North America Glottolog v4.3 divides the world’s languages into 418 families, including 138 isolates, and it provides a tree for each. Together, the 418 trees contain 8028 internal nodes and 16734 tips, many of which represent varieties that would typically be considered dialects. Geographically, glottolog assigns each language variety to one of six macroareas: Africa, Australia, Euarasia, Papunesia, South America and North America. The PhyloWeights metadata includes a column main_macroare. This is the one macroarea which contains more of the family’s language varieties than any other. We will see how this information can be useful in Section 4.2. Glottolog’s 418 family trees are represent in a multiPhylo object, glottolog_trees_v4.3. For example, here is glottolog’s representation of the Great Andamanese family, which is the sixth tree in the multiPhylo object. For readability, we plot this tree horizontally: tree6 <- glottolog_trees_v4.3[[6]] plot(tree6, x.lim = c(-0.3, 13)) Akarbale[akar1243][acl]−l− Akabea[akab1249][abj]−l− Akakora[akak1251][ack]−l− MixedGreatAndamanese[mixe1288][gac] Akacari[akac1240][aci]−l− Akabo[akab1248][akm]−l− Akakede[akak1252][akx]−l− Apucikwar[apuc1241][apq]−l− Akakol[akak1253][aky]−l− Okojuwoi[okoj1239][okj]−l−

Glottolog’s tip labels consist of a name followed by a glottocode in square brackets, an ISO code in square brackets (if one exists) and possibly the string “-l-”. Node labels (not shown above) have the same structure. The PhyloWeights function label_by_glottocode() will shorten labels to just the glottocode, for example: tree6_glottocodes <- label_by_glottocode(tree6) plot(tree6_glottocodes, direction = "downwards", label.offset = 0.05, x.lim = c(0,10), y.lim = c(0,8), no.margin =T) nodelabels(text = tree6_glottocodes$node.label)

10 . Here PhyloWeight

akar1243 bind_as_rake() , here we select multiple label_by_glottocode() sout2683 akab1249 function akak1251

jeru1239 mixe1288 akaj1239 PhyloWeights

nort2678 akac1240 . glottolog_trees_v4.3[[6]] 11 (170, 212, , 388 207, )] 100 c (arnhem) grea1241 boca1235

akab1248 arnhem nort3276

akak1252 will issue a warning if there are tip or node labels in which it is unable

apuc1241 label_by_glottocode (arnhem_multiPhylo) has still changed labels to glottocodes where possible; it is just flagging the fact

okol1242 akak1253 within single square barackets).

midd1323 okoj1239 c(...) label_by_glottocode() bind_as_rake ## Warning in## label_by_glottocode(arnhem): unchanged: Labels 0 without tip(s); glottocodes 1 detected node(s): and left Plotting the resulting tree,root we in get a a rake-like newly structure, created without Arnhem any tree. additional Note subgrouping. how all five families are joined to the label_by_glottocode() that it was not ablearnhem_glottocodes to do <- so in all cases. arnhem_multiPhylo <- glottolog_trees_v4.3[ arnhem <- Because the newly created root node in this single tree has no label, the function glottolog language families (Gunwinyguan,and Mangarrayi-Maran, Gaagudju), Maningrida, whose and tree thefamily numbers isolates is are Kungarakany selected 170, using 212,families double 388, using square 207 brackets, and as 100. in (Regarding the R code: whereas a single a genealogical hypothesis, eventhat if making that such hypotheses hypotheses is, is tactily, unavoidable, that it all will familiesprovides most are tools beneficial equally for for (un)related. combining progress multiple Given in glottolog the trees field into to one. them explicit. to identify a glottocode. We will see an examplelengths of in this Section shortly 4.4. below. 4.2 How to combine trees will issue a warning, that it encountered one node without a glottocode. This is not an error message, and The trees in awe multiPhylo assign object the can resulting, be single combined tree using to the the object To begin with a small example, here we combine five glottolog families to represent the hypothesised To enable typologists to explore genealogical hypotheses and to make those hypotheses explicit, The function The branches in glottolog’s trees are all of equal length. We will discuss how to assign more realistic branch Arnhem group in northern Australia (Green 2003). First we create a multiPhylo object containing the five As discussed in the main paper, the comparison of langauges across language families unavoidably commits to plot(arnhem_glottocodes, direction = "downwards", label.offset = 0.05, x.lim = c(0,30), y.lim = c(0,10), no.margin =T) kung1259 gaga1251 mang1381 djee1236 kunb1251 djau1244 alaw1244 mara1385 wand1263 naka1260 ngan1295 anin1240 nung1290 ngal1292 uwin1234 wara1290 gura1252 ngal1293 remb1249 kund1258 gund1246 mura1269 guma1252 naia1238 wulw1234 gunn1248 gunn1247 gunn1249 kune1235 kune1236

It is possible to give a combined tree more structure, by using bind_as_rake() iteratively. For instance, suppose we wished to hypothesise that Gunwinyguan, Mangarrayi-Maran, Maningrida form their own subgroup. First, we create a single tree containing them, which we call tree_A: multiPhylo_A <- glottolog_trees_v4.3[c(170, 212, 388)] tree_A <- bind_as_rake(multiPhylo_A)

Then we create the final tree by combining tree_A with the two isolate family trees: arnhem_multiPhylo <- c(tree_A, glottolog_trees_v4.3[c(207, 100)]) arnhem2 <- bind_as_rake(arnhem_multiPhylo) arnhem2_glottocodes <- label_by_glottocode(arnhem2)

## Warning in label_by_glottocode(arnhem2): Labels without glottocodes detected and left ## unchanged: 0 tip(s); 2 node(s): , plot(arnhem2_glottocodes, direction = "downwards", label.offset = 0.05, x.lim = c(0,30), y.lim = c(0,10), no.margin =T)

12 . : whose is either list mentioned in gaga1251 macro_groups kung1259 djee1236 naka1260 main_macroarea : argument to a

gura1252 macro_groups = NULL gunn1249

gunn1247 main_macroarea gunn1248 wand1263 . By default, the function returns a mara1385 alaw1244 macro_groups mang1381 wara1290

uwin1234 glottolog_supertree() wulw1234 djau1244 13 kunb1251

ngal1292 glottolog_supertree()

naia1238 argument of (macro_groups = my_list) (macro_groups = my_list) guma1252 () (macro_groups = NULL)

mura1269 function:

kune1236 , within which any groupings containing more than one macroarea are kune1235 c() gund1246 kund1258 macro_groups nung1290 my_list anin1240 ("South, America" "North America" )) glottolog_supertree glottolog_supertree c ngan1295 glottolog_supertree glottolog_supertree remb1249 ("Africa", , "Australia" "Eurasia", , "Papunesia" ("Africa", ) "Eurasia" ngal1293 list list , PhyloWeights provides the function 4.3 How to remove and duplicate languages commonly, a typological study will cover a set of languages that differs from the set of tips in any single my_list <- my_list <- my_supertree <- my_supertree <- It is also possiblesingle to group. group macroareas Grouping together,items of for are macroareas example, the is to desired acheivedthe combine groups by tree, North of setting and and macroareas. the all Southcombine Each of America North group its into and will a families South then belowa America appear it. list, into as For a which one instance, single we’verepresented of to group, called as the keep the a highest-level all following vector, nodes of code using of glottolog’s would the macroareas be separate, used. but First to we define my_supertree <- For instance, to group all familiesmy_supertree directly <- into a 418-pronged rake structure, set supertree that divides firstthese into macroarea glottolog’s nodes six appear macroareas,Section all with 4.1 of an above. the internal This glottolog node tree families, for is grouped each, enourmous, by and so their directly we below do not plot it here. It is obtained like this: There are several reasons why typologists may wish to use a tree that departs from the glottolog trees. Most Taking a second example, to create a supertree containing only the families whose The highest-level, macroarea groupings can also be controlled through the function’s argument Typological studies often examine languages from very many families. To group all 418 families into a single Africa or Eurasia, and to place Africa and Eurasia under separate, highest-level nodes, we would use: We then use that list as the supertree glottolog tree, either through the exclusion of some of the lects that glottolog represents as tips, through the distinction of additional lects, or through the inclusion of varieties that glottolog represents not as tips in the tree but as internal nodes—this can occur when glottolog places one or more dialects at the tree’s tips and more a general, language node above them, however the calculation of genealogically-sensitive averages and proportions requires one’s typological variables to be related to the tips of trees, not to internal nodes. The PhyloWeights function keep_tips() assists with all of these cases by taking a tree, plus a set of labels describing the desired tips, and then (1) stripping the tree of all tips whose labels do not match those supplied; (2) converting internal nodes to tips if their labels have been supplied; and (3) duplicating any tips whose label is supplied more than once. Examples will be provided below. In the following examples, we will make use of glottolog’s representation of the Great Andamanese family, where we have converted the labels to glottocodes using label_by_glottocode(): tree6 <- glottolog_trees_v4.3[[6]] tree6_glottocodes <- label_by_glottocode(tree6) plot(tree6_glottocodes, direction = "downwards", label.offset = 0.05, x.lim = c(0,10), y.lim = c(0,8), no.margin =T) nodelabels(text = tree6_glottocodes$node.label)

grea1241

midd1323 nort3276 sout2683

okoj1239 okol1242 akak1252 nort2678 akab1249 akar1243

akak1253 apuc1241 boca1235 jeru1239

akab1248 akac1240 akaj1239 akak1251 mixe1288

4.3.1 How to remove tips Firstly, we illustrate the removal of tips. Here we retain six of the original ten tips. We do this by setting the label argument to a vector containing the six desired tip labels.4 tree6a <- keep_tips(tree6_glottocodes, label = c("akar1243", "akak1251", "akac1240", "akak1252", "apuc1241", "okoj1239")) plot(tree6a, direction = "downwards", label.offset = 0.05, x.lim = c(0,7), y.lim = c(0,8),

4The order of the labels within the vector is not important here.

14 no.margin =T) nodelabels(text = tree6a$node.label) grea1241

midd1323 nort3276 sout2683

okoj1239 okol1242 akak1252 nort2678 akar1243

apuc1241 boca1235 jeru1239 akac1240 akak1251

4.3.2 How to remove internal nodes Sometimes, the removal tips will cause one or more of the remaining tips to sit below a node which dominates only it. This reflects that fact that keep_tips() preserves the original depth of any tips that remain in the tree (the reader may like to confirm this by comparing the two plots above). This outcome may or may not be desirable, according the researcher’s needs. If it is undesirable, then non-branching, internal nodes can be removed using the PhyloWeights function collapse_node(). For instance, here we remove two of the non-branching nodes in the tree above, by stating the unwanted labels within a vector supplied as the label argument of collapse_node(). In the resulting tree, these nodes have been removed, thus reducing the depth of the tips below them: tree6b <- collapse_node(tree6a, label = c("boca1235", "okol1242")) plot(tree6b, direction = "downwards", label.offset = 0.05, x.lim = c(0,7), y.lim = c(0,8), no.margin =T) nodelabels(text = tree6b$node.label) grea1241

midd1323 nort3276 sout2683

okoj1239 apuc1241 akak1252 nort2678 akar1243

akac1240 jeru1239 akak1251

The function collapse_node() can also be used to alter a subgrouping hypothesis, and specifically, to remove a layer of subgrouping, converting a nested structure ((A,B),C) into a flat structure (A,B,C). For instance, here we remove the okol1242 node of the original glottolog tree, converting its two daughter languages into sisters of okoj1239:

15 tree6c <- collapse_node(tree6_glottocodes, label= "okol1242") plot(tree6c, direction = "downwards", label.offset = 0.05, x.lim = c(0,10), y.lim = c(0,8), no.margin =T) nodelabels(text = tree6c$node.label) grea1241

midd1323 nort3276 sout2683

okoj1239 akak1253 apuc1241 akak1252 nort2678 akab1249 akar1243

boca1235 jeru1239

akab1248 akac1240 akaj1239 akak1251 mixe1288

4.3.3 How to duplicate an internal node as a tip Now we move to the example of converting internal nodes to tips. Suppose that we wished to remove the tips akac1240 and akab1248 from the Great Andamenese glottolog tree, and to retain the internal node above them, boca1235, as a tip. This outcome will not be acheived simply removing the tips akac1240 and akab1248, since by convention, internal nodes are removed automatically if no tips remain below them. We see this here, where the tips akac1240 and akab1248 are removed, and so too is the node boca1235. tree6d <- keep_tips(tree6_glottocodes, label = c("akar1234", "akab1249", "akak1251", "mixe1288", "akak1252", "apuc1241", "akak1253", "okoj1239")) plot(tree6d, direction = "downwards", label.offset = 0.05, x.lim = c(0,8), y.lim = c(0,8), no.margin =T) nodelabels(text = tree6d$node.label)

16 node . Passing keep_tips()

akab1249 akab1249 while leaving its daughter tips sout2683 sout2683 , located below the original tip akak1251 akak1251 in the tree here: boca1235

jeru1239 mixe1288 jeru1239 nort2678 mixe1288 akaj1239 boca1235 nort2678 akaj1239

nort3276 boca1235 17 grea1241

akak1252 boca1235 nort3276 grea1241 akak1252

apuc1241 apuc1241 "mixe1288", "akak1252", "apuc1241", "akak1253", "okoj1239", "boca1235")) results in the node being cloned as a ("akar1234", "akab1249", "akak1251", c okol1242 akak1253 okol1242 node.label) as a tip, we can add it to the vector of labels given to akak1253 $ while also removing its original daughter tips. However, it is possible to clone argument of keep_tips(). Here we clone midd1323 label = midd1323 (tree6_glottocodes,

okoj1239 keep_tips() okoj1239 label boca1235 boca1235 (0,10), (0,8), c c keep_tips (text = tree6e direction = "downwards", label.offset = 0.05, x.lim = y.lim = no.margin =T) (tree6e, any node independently oflabel whether appears its in daughters the arein retained place, or meaning not. that the A newly node cloned will tip be appears cloned as whenever their its sister: the label of aof node the to same name, whichtree6e is also <- retained. We see this for plot nodelabels To retain the node Above, we cloned tree6f <- keep_tips(tree6_glottocodes, label = c("akar1234", "akab1249", "akak1251", "mixe1288", "akac1240", "akab1248", "akak1252", "apuc1241", "akak1253", "okoj1239", "boca1235")) plot(tree6f, direction = "downwards", label.offset = 0.05, x.lim = c(0,10), y.lim = c(0,8), no.margin =T) nodelabels(text = tree6f$node.label) grea1241

midd1323 nort3276 sout2683

okoj1239 okol1242 akak1252 nort2678 akab1249

akak1253 apuc1241 boca1235 jeru1239

boca1235 akab1248 akac1240 akaj1239 akak1251 mixe1288

4.3.4 How to duplicate tips Next we illustrate the cloning of tips. Cloning tips may be useful when glottolog provides only one glottocode, and thus only one tree tip, corresponding to multiple lects in the typologist’s study. To clone a tip, simply include it more than once in the vector for the label argument of keep_tips(). For example, here we clone the tip okoj1239. When a tip is cloned, it is important to respect the convention of maintaining a distinct label for every tip in the tree, and so keep_tips() automatically adds a suffix to the labels of any cloned tips, as we see here for okoj1239: tree6g <- keep_tips(tree6_glottocodes, label = c("akar1234", "akab1249", "akak1251", "mixe1288", "akac1240", "akab1248", "akak1252", "apuc1241", "akak1253", "okoj1239", "okoj1239")) plot(tree6g, direction = "downwards", label.offset = 0.05, x.lim = c(0,11), y.lim = c(0,8), no.margin =T) nodelabels(text = tree6g$node.label)

18 akab1249 akab1249 sout2683 sout2683 akak1251 akak1251 : the original one, and a new one jeru1239 mixe1288 jeru1239 mixe1288 akaj1239 akaj1239 baco1235 : nort2678 akac1240 nort2678 akac1240

boca1235 akab1248

boca1235 akab1248 nort3276 nort3276 grea1241 19 function notices this and automatically adds a suffix to grea1241

akak1252 akak1252 collapse_node() , as we did previously, and also clone the resulting tip, by listing apuc1241 apuc1241 keep_tips() ("akar1234", , "akab1249" "akak1251", c baco1235 okol1242 akak1253 okol1242 akak1253 node.label) $ (tree6g, "okoj1239") okoj1239−2

okoj1239−2 label = midd1323 (tree6_glottocodes, midd1323 (0,11), (0,8), okoj1239 okoj1239−1 okoj1239−1 c c keep_tips collapse_node (text = tree6h twice. As a side-effect, this creates two nodes named direction = "downwards", label.offset = 0.05, x.lim = y.lim = no.margin =T) (tree6h, the node labels to maintaintree6i their distinctness: <- Here we clone thebaco1235 interior node placed above the duplicated tips. The desired, this new node cantree6h be removed <- using plot nodelabels As seen above, the multiple copies of the cloned tip are placed under an internal node of the same name. If "mixe1288", "akac1240", "akab1248", "akak1252", "apuc1241", "akak1253", "okoj1239", "boca1235", "boca1235")) plot(tree6i, direction = "downwards", label.offset = 0.05, x.lim = c(0,12), y.lim = c(0,8), no.margin =T) nodelabels(text = tree6i$node.label) grea1241

midd1323 nort3276 sout2683

okoj1239 okol1242 akak1252 nort2678 akab1249

akak1253 apuc1241 boca1235−1 jeru1239

boca1235−2 akab1248 akac1240 akaj1239 akak1251 boca1235−1 boca1235−2 mixe1288

Here we produce three new copies of the tip okoj1239 by listing it four times: tree6j <- keep_tips(tree6_glottocodes, label = c("akar1234", "akab1249", "akak1251", "mixe1288", "akac1240", "akab1248", "akak1252", "apuc1241", "akak1253", "okoj1239", "okoj1239", "okoj1239", "okoj1239")) plot(tree6j, direction = "downwards", label.offset = 0.05, x.lim = c(0,13), y.lim = c(0,8), no.margin =T) nodelabels(text = tree6j$node.label)

20 argument label , then read that file akab1249 sout2683 akak1251 column as the glottocode mixe1288 jeru1239 akaj1239

akac1240 glottocode) nort2678 $ glottocode akab1248 boca1235 nort3276 grea1241 akak1252 as is useful, and then exporting the resulting tree to (tree131) 21 provides some limited utility in this regard, allowing one to

apuc1241 node.label) $

akak1253 okol1242

okoj1239−4 collapse_node() and midd1323 collapse_node() okoj1239−3 label_by_glottocode (old_tree, label = my_dataframe ) ("my_file_with_a_glottocode_column.csv" thus provides a general-purpose tool for altering a glottolog tree (or a combined

okoj1239−2 okoj1239 read.csv keep_tips() keep_tips

okoj1239−1 (0,6), (0,8), , for instance, like this: c c keep_tips() (text = tree131_by_glottocode direction = , "downwards" srt = , 90 label.offset = -0.3, adj = , 0.5 x.lim = y.lim = no.margin =T) (tree131_by_glottocode, keep_tips() plot altered Newick representation. Tofamily: take a simple example, heretree131 is glottolog’s <- representation.3[[131]] glottolog_trees_v4 oftree131_by_glottocode the <- Pahoturi my_new_tree <- 4.3.5 Making other changes to the tree mentioned above, the function simplify a tree byHowever, removing in internal order nodes, to andbe examine thus various needed, removing genealogical the such hypotheses, layers assuggest more of shifting first subgrouping substantial tips using modifications which from they mayNewick one represent. sometimes format, subgroup altering to the another. Newick representation For manually, advanced and manipulations then like creating this, the we modified tree from the of my_dataframe <- nodelabels tree or supertree) tostudy. make its set of tips conformIn to the the code set chunksconvenient above, of to the lects create sets that a of a CSVinto tips typologist R, file were has in assign written used some it out a other to manually typological software, a in with dataframe R. a object In column and many named cases, use it that may dataframe’s be more The function A typologist may have reason to alter a glottolog tree in ways other than just removing or cloning tips. As paho1240

agob1244 idii1243

agob1245 ende1235 kawa1281 nucl1597 tame1238

Suppose we wished to shift the tip kawa1281 from one subgroup to the other. We can obtain the Newick representation of the tree using write.tree(): write.tree(tree131_by_glottocode)

## [1] "((agob1245:1,ende1235:1,kawa1281:1)agob1244:1,(nucl1597:1,tame1238:1)idii1243:1)paho1240:1;" To move kawa1281, we edit this to the following representation and assign it to the object new_newick: new_newick <- "((agob1245:1,ende1235:1)agob1244:1,(kawa1281:1,nucl1597:1,tame1238:1)idii1243:1)paho1240:1;"

We then create a new tree object using read.tree(), with the text argument equal to this modified Newick representation: tree131b <- read.tree(text = new_newick) plot(tree131b, direction = "downwards", srt = 90, label.offset = -0.3, adj = 0.5, x.lim = c(0,6), y.lim = c(0,8), no.margin =T) nodelabels(text = tree131b$node.label) paho1240

agob1244 idii1243

agob1245 ende1235 kawa1281 nucl1597 tame1238 In many cases, since the Newick representation will be long, it may be easiest to save the tree to a file, then alter the file and read it back into R.5 Trees can be saved to a Newick-formatted text file using write.tree() supplied with a file argument, and read from a file into an R object using read.tree() with a file argument. For instance: write.tree(my_tree, file = "my_tree_file.newick") my_tree <- read.tree(file = "my_tree_file.newick")

4.4 How to add branch lengths Branch lengths in a tree convey information, and all of the phylogenetic methods discussed in the main paper are sensitive to the information represented by the branch lengths. (To be specific, the methods discussed in the main paper are sensitive to the relative lengths of the branches, so multiplying all of the branch lengths in a tree by some constant amount would not affect the results.)

5By convention, these files are given the filename extension .tree or .newick.

22 Glottolog’s trees contain informative subgrouping structure, but the branch lengths are all equal. Even without knowing what the the true branch lengths are for a linguistic tree, we do know that a situation in which all are equal is highly unlikely. A good approximation to the most-likely6 distribution of branch lengths in a phylogenetic tree, under a variety of assumptions, is exponential (Venditti, Meade & Pagel 2010), i.e., very long branches are rare, and very short ones are frequent. This notion is implemented in the PhyloWeights package by the function set_branch_lengths_exp(), which sets the deepest branches to length 1/2, then next layer to length 1/4, then the next to 1/8 and so on. This will produce a more plausible set of branch lengths, even in the absence of firm knowledge of exact lengths, and on these grounds we advocate its use if additional information about branch lengths is not available. Here is an example of the result of applying exponential branch lengths to glottolog’s Great Andamanese tree: tree6 <- glottolog_trees_v4.3[[6]] tree6_glottocodes <- label_by_glottocode(tree6) tree6k <- set_branch_lengths_exp(tree6_glottocodes) plot(tree6k, direction = "downwards", label.offset = 0.05, x.lim = c(0,12), y.lim = c(0,2), no.margin =T) okoj1239 akak1252 akab1249 akar1243 akak1253 apuc1241 akab1248 akac1240 akak1251 mixe1288

Here is an example of the result of applying them to glottolog’s Eskimo-Aleut tree, of 30 tips: tree73 <- label_by_glottocode(glottolog_trees_v4.3[[73]]) tree73_by_glottocode <- label_by_glottocode(tree73) tree73a <- set_branch_lengths_exp(tree73_by_glottocode) plot(tree73a, direction = "downwards", label.offset = 0.05, x.lim = c(0,30), y.lim = c(0,2), no.margin =T)

6‘Most likely’ doesn’t mean that we expect to see trees with exactly these branch lengths. Compare this to flipping a coin two million times: although it is unlikely that the outcome will be exactly one million heads and one million tails, it remains true that one million heads and one millions tails is the most likely outcome. The branch lengths discussed here a ‘most likely’ is a similar sense.

23 ,

koni1262 . koni1262 chug1254 chug1254 nauk1242 nauk1242 cent2128 cent2128 unal1234 unal1234 , before ultrametiricising the nuni1254 nuni1254 set_branch_lengths_exp() hoop1234 hoop1234 kusk1241 kusk1241 sire1246 ultrametricise() sire1246 sigl1242 sigl1242 nets1241 nets1241 copp1244 copp1244 tunu1234 tunu1234 kala1399 kala1399

queb1248 queb1248 , can be used to adjust just the first layer of branches.

rigo1234 rigo1234 24 labr1244 labr1244

cari1277 cari1277 set_first_branch_lengths() sout3269 sout3269 (arnhem_a,1) tunu1235 tunu1235 (arnhem_glottocodes) pola1254 pola1254 iglu1234 iglu1234 coas1299 coas1299 tree. This is done using the function king1259 king1259 (arnhem_b) nort2944 (tree73a) nort2944 kotz1238 kotz1238 kobu1239 kobu1239 set_first_branch_lengths() medn1235 ultrametric medn1235 west2616 west2616 (0,30), (0,3), (0,30), (0,2), east2533 c c east2533 c c ultrametricise set_branch_lengths_exp set_first_branch_lengths ultrametricise direction = , "downwards" label.offset = 0.05, x.lim = y.lim = direction = "downwards", label.offset = 0.05, x.lim = y.lim = no.margin =T) (arnhem_c, (tree73b, arnhem_c <- plot by changing the firsttree: branch length to 1 using arnhem_a <- arnhem_b <- the implied closeness or distance between the first-order branches. For example, here we take the hypothesised plot tree73b <- This may be useful where multiple family trees have been joined together,which and sets there the is deepest a branch length desire to to 1/2. manipulate Then we double the distance of the deepest level of relationships what is known as an An additional function, Arnhem group from Section 4.2. First we assign exponential branch lengths with An additional option is to stretch the terminal branches so that all tips are equidistant from the root, creating no.margin =T) ngal1293 remb1249 ngan1295 anin1240 nung1290 kund1258 gund1246 kune1235 kune1236 mura1269 guma1252 naia1238 ngal1292 kunb1251 djau1244 wulw1234 uwin1234 wara1290 mang1381 alaw1244 mara1385 wand1263 gunn1248 gunn1247 gunn1249 gura1252 naka1260 djee1236 kung1259 gaga1251

4.5 Exporting trees for use with other software As mentioned above, trees can be saved to file in Newick format using write.tree(). Files like this can be opened by other software such as FigTree7, which can be used to generate tree plots interactively, which may be useful where preparing plots for publication and dissemination. Often it will be desirable to reproduce a tree with labels that are more reader-friendly than glottocodes. PhyloWeights provides the function label_by_name(), which will replace full glottolog labels, or labels consisting of just a glottocode, with glottolog’s corresponding language, dialect, subgroup or family name. Here we relabel the Arnhem tree by the languages’ names. As was the case with label_by_glottocode(), warnings are given by label_by_name() if a tree contains any nodes that cannot be relabeled in this way; these are not errors, just alerts. arnhem_by_name <- label_by_name(arnhem_c)

## Warning in label_by_name(arnhem_c): Labels without glottocodes detected and left ## unchanged: 0 tip(s); 1 node(s): plot(arnhem_by_name, direction = "downwards", label.offset = 0.05, x.lim = c(0,30), y.lim = c(0,3), no.margin =T)

7https://github.com/rambaut/figtree/releases

25 . The first

Gaagudju Kungarakany Djeebbana Nakara

Guragone coda_violation Gun−nartpa and

Gun−narta package as a dataframe with the Gun−narda Wandarang Marra Alawa Mangarrayi Warray PhyloWeights

Uwinymil onset_violation Wulwulam ,

Jawoyn 26 Kunbarlang Ngalkbun ManyallalukMayali glottocode

Kunwinjku , Kuninjku

KuneNarayek name KuneDulerayek Kundjeyhmi Kundedjnjenghmi Wubuy Anindilyakwa

Ngandi , and columns Rembarrnga Ngalakgan yin_2020_data (yin_2020_data,n= 10) used for South America and North America. Additionally, the only African language available in the sample macroareas. Since the language sample covered relatively few families in the Americas, a single group was ten rows are shown here: head #### 1## 2## 3## 4## 5## 6 Abkhaz name## glottocode 7 Adnyamathanha onset_violation abkh1244## Achagua coda_violation 8 Abui adny1235## 9 acha1250 abui1241## Hokkaidoainu Adang Standardalbanian 10 ainu1240 adan1251 alba1267 Adyghe5.1 adyg1241 Preparing a Alawa tree 1 alaw1244 0 Aleut 0 0 aleu1260 0 0 1 1 1 0 1 1 0 1 0 1 1 1 0 1 the languages had word-initialas onset 1 clusters for or yes word-finalname and coda 0 clusters for containing no. sonority This reversals, coded dataset is provided with the 5 Putting itIn this together: section we A provide a worked real example worked example ofgenealogically-sensitive the proportions use of of languagesconsisted the in of methods which 496 described various languages violations above. in occurred. the CLICS2 The database language (Anderson sample et al. 2018) and the AusPhon-Lexicon database was Arabic, so Africa and Eurasia were grouped together: The tree for Yin’s study was constructed from a glottolog supertree. Yin’s supertree made use of glottolog’s were used to help produce a principled interpretation of the data. Yin’s raw data consisted of a table of languages’ names and glottocodes and indications of whether or not Yin (2020) examined violations of the sonority sequencing principle in 496 languages, and calculated the (Round 2017). The language sample was not balanced in the traditional sense, and phylogenetic methods yin_macro <- list(c("South America", "North America"), c("Africa", "Eurasia"), "Papunesia", "Australia") supertree <- glottolog_supertree(macro_groups = yin_macro) supertree_glottocodes <- label_by_glottocode(supertree)

## Warning in label_by_glottocode(supertree): Labels without glottocodes detected and ## left unchanged: 0 tip(s); 5 node(s): World, SouthAmerica-NorthAmerica, Africa-Eurasia, ## Papunesia, Australia From this supertree, the 496 languages in Yin’s dataset were kept and all others removed, and the internal node mada1298 was collapsed: supertree_a <- keep_tips(supertree_glottocodes, label = yin_2020_data$glottocode) supertree_b <- collapse_node(supertree_a, "mada1298")

Finally, branch lengths were assigned. Branches were first assigned exponential lengths. Then, in order to diminish the importance of the macro groups, the branches above them were shortened to a length of 1/40. The effect of this decision is that the implied distance between families in different macro groups is only marginally greater than between families within a single macro group. The resulting tree appears as in Figure 1.8 supertree_c <- set_branch_lengths_exp(supertree_b) yin_2020_tree <- set_first_branch_lengths(supertree_c,1 /40)

5.2 Preparing the dataframe of typological data To calculate phylogenetic weights and genealogically-sensitive proportions, in addition to the tree (or a set of trees) we require a dataframe with (a) one column language, whose contents match the tip labels in the trees, and (b) other columns containing numerical data to be averaged. The dataframe yin_2020_tree does not currently have a column languages, so we create one by copying from glottocode: yin_2020_data$language <- yin_2020_data$glottocode

The first ten rows of the dataframe yin_2020_data now look like this: head(yin_2020_data,n= 10)

## name glottocode onset_violation coda_violation language ## 1 Abkhaz abkh1244 1 1 abkh1244 ## 2 Abui abui1241 0 0 abui1241 ## 3 Achagua acha1250 0 1 acha1250 ## 4 Adang adan1251 0 1 adan1251 ## 5 Adnyamathanha adny1235 0 0 adny1235 ## 6 Adyghe adyg1241 1 1 adyg1241 ## 7 Hokkaidoainu ainu1240 0 0 ainu1240 ## 8 Alawa alaw1244 1 0 alaw1244 ## 9 Standardalbanian alba1267 1 1 alba1267 ## 10 Aleut aleu1260 1 1 aleu1260 The glottocodes in the language column of yin_2020_data contain some duplicates, corresponding to cases where the study’s language sample contained more than one lect corresponding to a single glottocode. Here we show rows 160 to 169 of yin_2020_data. Duplicate glottocodes can be seen in rows 164 and 165, for the lects Ikarranggal and Ogh Angkula.

8Figure 1 was generated using FigTree, https://github.com/rambaut/figtree/releases.

27 Uk u ri g u ma Ma i a - P i l a

Yarawata Yangulam

Uyaj i ta ya

Bilak ura Mi a n i

Waube Yaben Ma l a

Bargam Re ra u Ga v a k Og e a Wanambre Ma l a s e awai Kes asi Saus isauru Sins Jilim as As

Uy a

Ma u wa k e Us a n Pamos u ru a Da n Kobol a n i g Uri Mu s a r Mo e r e

Bepour Watiwa Ma wa k Arawum Kowaki m u g Mi Mu m Pal Lemio Amaimon Pulabu Siroi Sileibi Anjam Bongu ale l Ma Brem Sam

Sop Sumau Apali Mu n i t u a l g Ga n Gi ra wa Saep Pay namar Kein Yabong k a Mu s Gu ma l u Biyom Faita Sihan Tauya Anamgura Panim Atemble a d a s Mo e r Ne nd Ra p ti n g Isebe Amele Kare Anam Bau Ma t eNa p ik e Yoidik

Re mpi Uta rmb un g WamasGa ru s Wadaginam Adang Samosa Blagar Mu r u p i Mo s i mo No b o n o b Klon Wagi Sawila Wers ing Korak Baimak Waskia Saruga Ga l Tuyuca Ma wa n Bagupi Yuruti Ca ra p a n a Abui Wanano SilopiUtu Kamang Kui PiratapuyWaimaja o Bara Tatuyo

TucanoDe s a n o Isabi Siriano

Cu b e o Baras ana Ma cun a Alorpantar Kaera Ne d e b a n g Sec oy a Teiwa Siona Wes ternpantar Gi a c o n e Ore oj n Tanimuca Tariano Cu rri p a c o Ac hagua No rth e rn y u k a g h i r Southerny uk aghir Koreguaje Piapoc o Adyghe Mo d e r n h e b rAbk e w haz Ca b iy a ri Standardarabic Yuc una Khalk hamongolian Epenabas urudo Kalmyk ChItelmen u k c h i Re s ig a ro Epenas aija Emberac hami Ma l a y a l aTamil m Ev enk i Kannada Ma n ch u Burus has k i Emberatado Japanese Na n a i Bury at Ca ti o Ho k k a i d o a iBas n u que Korean Emberadarien Lezgian Ika Telugu Ni v k h Baniv a Ge o rg i a n Wayuu Di mi n a Tsez Da rgLak wa Ket Av ar Ch e c h e n Kogui No rth e rn u z b e k No rth a z e rb a i j a n i Ch imi la Turkish Wounaan Sak ha Bari Gu a h ib o TunebocentralPlay ero Kaz ak h Ch u vash Tatar Bas hk ir Cu ib a

Ch a p a la a c h i No rth e rn man s i Gu a ya b e ro Tsafiquipila No rth e rn k h a n ty Jitnu Gu a mb ia n o Mo k s h a Totoro Erz y a Hu n g a ri a n Witotominica Witotomurui Me a d o wmaHi l l mari r i Komiz y rian Awa Witotonipode Komipermyak Ud murt Kalaallis ut Ng a n a s a n Oca in a Ce n tra ls ib e ri a n y u p ik TundranenetsNo rth e rn s e l k u p Mi ra n a Forestenets Aleut Southerns ami Bora No rth e rn s a mi Mu in a n e Lulesami Kak ua Kildins ami Nu k a k Skoltsami Ca ri j o n a Inarisami Yuk pa Livonian Jupda Es tonian Kams a Paez Finnish Inga No rth k a re l i a n Saliba Puinav e Veps Ol o n e t s k a re li a n Kuugu_Ya' u Luobenzhuo Umpi l a Xiangy un Ay apathu Zhoucheng Yinty ingk a Qi li q ia o Bak anh Wik_Mungkan Yunlong Kugu_Nganhcara Lanping Ji a n ch u a n Angk amuthi He q i n g Yadhay k enu Ery uan Atampay a Fuzhou Anguthimri Thaynakwithi Xiamen Linngithigh Ch a o z h o u Ch a n g s h a Oyka n g a n d Me ix ia n Ol k o l Og h _ Un y j a n Na n c h a n g Ikarranggal Gu a n g z h o u Kok _NarOg h _ An g k u l a Yangjiang Mb a b a r a m Wenzhou Ri man g g u d i n h ma Suz hou Ng a wu n Yangzhou Kok o_Bera ShenyJi ang n a n He fe i Xi_An Gu ri n d j i Mo d e rn g reArmenian e k Ma ln g in Beijing Wanyjirra Ng a ri n y man Standardalbanian Bilinarra Ma n d a rni c h ni e s e Kunming Mu d b u r r a Ch e n g d u Irish Jaru Walmajarri Warlpiri Ng a rd i l y Breton Warlmanpa Welsh BengaliHi n d i Kartujarra Putijarra Emmi Wangkajunga Latin YulparijaWarnman Wes ternfars i Larrakia No rth e rn p a s h to Mu nhP- i rh at LimilnganLardil Tiwi Ng a a n y a tj a rra Os s e tai n Ma e ngel t Ng a l i a Wardaman Karajarri Pintupi Wagiman Kuk atja Ro ma n ia n Patjtjamalh Ma n g a l a Ny a n g u marta No rth e rn k u rd is h Italian French Kariy arra Ca ta la n EnglisDu h tc h Ng a rl u ma Pany jima PortugueseSpanis h Ge rma n Yinhawangk a Da n is h Kurrama Yindjibarndi Swedis h Ma r t u t h u n i r a Ny i y a p a rl i Ng a rl a Ny a mal Latvian IcelandicLithuanian Purduna Pay unguThalanyji Polis h No rwe g ia n b o k ma l Tharrkari Bardi Yingk arta Slov ak JiwarliWarriy angga Bunuba Kija Badimay a Cze ch Nh a n d a Ny nki a Ma l k a n a Go ayon di Mi r r n i n y Bulgarian Ng a d j u n mayWatjarri a iriwog n Mio i r Nu n g ai l Slovene Ma wn g Cro a tia n Uk ra ni ai n Iwaidja Ru s s ai n Na ka ra Amurdak Ga mi l a ra a y Yuwaalaraay Belarus ian Wiradjuri Yuwaliy aay Ng i y a mba a engerrdji j d r r e Me g n Wayilwan

angarrayi y a r r a g Ma n Gu mb a y n g g i r gar nNg i y n ri a Jawoyn Yay gir Mu r u wa r i Butc hulla yir l a rb i Dy Du u n g i d j a wu Gu i danj Gu g u _ Ba d h u n Waalubal Yorta_Yorta Gi d a b a l Wambaya mbl a ru a Dh Wemba_Wemba Erre Bidy ara Wotjobaluk Wangk umara Gu wa mu Burarra Pitta_Pitta Kungk ari Gu n g g a ri Gu rr-Go in Gu n y a Ma r g a n y arra r Ma r Ga n g u l u Alawa

Dj i n a n g Nh a n g u Yany uwa Biri Urn ng ai n g Wiri

Un miug

Warndarrang n o b a Da l Wubuy Di y a ri Wirangu Eas tern_Anmaty erre Yandruwandha Thirarri Ce n tra l _ Arre rn te Nh i rrp i Wes tern_Arrernte Ng a min i Yarluy andi

Ri th a rrn g u

Wulguru

Western_Wakaya

Yidiny Dj a b u g a y Adny amathanha Na rru n g g a Nu kun u

yaw i g y wa a Ny

Yalarnnga

Worrorra Dh a y' yi alyangapa p a g n a y l Ma Dh a n g u

Bularnu Warluwarra

Gu p a p u y n g u gandi d n Ng a Yawijibay a Kuninjk u Dj a p u u u_Yalanji Kuk

Kunwinjk u

un-jeihmi h i e -Dj Gu n rr i h d mi Yi _ u g u Gu

otenPa nyi inty Southern_Paak

galakgan a g k a l Ng a em r aRe g mb rrn a

Figure 1: Supertree of 496 languages used in Yin (2020).

28 yin_2020_data[160:169,]

## name glottocode onset_violation coda_violation language ## 160 Hindi hind1269 1 1 hind1269 ## 161 Hungarian hung1274 0 1 hung1274 ## 162 Jupda hupd1244 0 0 hupd1244 ## 163 Icelandic icel1247 1 1 icel1247 ## 164 Ikarranggal ikar1243 1 0 ikar1243 ## 165 Ogh Angkula ikar1243 0 1 ikar1243 ## 166 Inarisami inar1241 1 1 inar1241 ## 167 Inga inga1252 1 0 inga1252 ## 168 Irish iris1253 1 0 iris1253 ## 169 Isabi isab1240 1 0 isab1240 The function phylo_average() will tolerate the presence of duplicate labels in the language column, so long as the corresponding tips in the tree are sisters with equally-long branches immediately above them. These criteria will be satisfied if the duplicate tips have been generated using keep_tips() and branch lengths have been assigned using set_branch_lengths_exp() (or are left all at the same length), and consequently they are satisfied by Yin’s tree, and so the dataframe yin_2020_data is ready to be used with phylo_average() in conjunction with the tree object yin_2020_tree.

5.3 Calculating genealogically-sensitive proportions The results are calculated using phylo_average(), setting its phy argument to the tree we have constructed, yin_2020_tree, and its data argument to the dataframe we have prepared, yin_2020_data: yin_2020_results <- phylo_average(phy = yin_2020_tree, data = yin_2020_data)

## Warning in phylo_average(phy = yin_2020_tree, data = yin_2020_data): data contains non- ## numeric columns other than 'language', which have been ignored: name, glottocode Results are in the format described in Section 3. The first ten rows of phylogenetic weights according to the ACL and BM methods are: head(yin_2020_results$ACL_weights,n= 10)

## name glottocode language tree1 ## 1 Abkhaz abkh1244 abkh1244 8.094765e-03 ## 2 Abui abui1241 abui1241 9.898242e-04 ## 3 Achagua acha1250 acha1250 5.312938e-04 ## 4 Adang adan1251 adan1251 2.922504e-04 ## 5 Adnyamathanha adny1235 adny1235 9.574962e-05 ## 6 Adyghe adyg1241 adyg1241 8.094765e-03 ## 7 Hokkaidoainu ainu1240 ainu1240 1.241197e-02 ## 8 Alawa alaw1244 alaw1244 2.392739e-03 ## 9 Standardalbanian alba1267 alba1267 2.065340e-03 ## 10 Aleut aleu1260 aleu1260 7.588717e-03 head(yin_2020_results$BM_weights,n= 10)

## name glottocode language tree1 ## 1 Abkhaz abkh1244 abkh1244 0.0026310928 ## 2 Abui abui1241 abui1241 0.0008156741 ## 3 Achagua acha1250 acha1250 0.0005596008 ## 4 Adang adan1251 adan1251 0.0009625388 ## 5 Adnyamathanha adny1235 adny1235 0.0008300526 ## 6 Adyghe adyg1241 adyg1241 0.0027607143

29 ## 7 Hokkaidoainu ainu1240 ainu1240 0.0017704970 ## 8 Alawa alaw1244 alaw1244 0.0003004560 ## 9 Standardalbanian alba1267 alba1267 0.0012775881 ## 10 Aleut aleu1260 aleu1260 0.0069506848 The genealogically-sensitive proportions according to the ACL and BM methods are these: head(yin_2020_results$ACL_averages)

## tree onset_violation coda_violation ## 1 tree1 0.368859 0.4237406 head(yin_2020_results$BM_averages)

## tree onset_violation coda_violation ## 1 tree1 0.3428647 0.3005395 As a point of comparison, the raw proportions, which are equal to the means of the columns onset_violation and coda_violation, are these: mean(yin_2020_data$onset_violation)

## [1] 0.3649194 mean(yin_2020_data$coda_violation)

## [1] 0.3145161

6 Using these methods in typological research

We hope that typologists find the arguments in our main paper compelling at the concpetual level, and the methods in this supplementary document convenient at the practical level. One intended advantage PhyloWeights is that in future the code used to produce linguistic trees can be published as supplementary materials together with typological studies, enabling subsequent researchers to replicate the findings, and just as importantly, to modify the original assumptions by modifying the trees, and thereby to test further hypotheses inspired by the original study. If you find these methods useful, please cite the package and enable others to find it and use it too: CITATION TO BE FINALISED pending peer-review

References Anderson, Cormac et al. 2018. CLICS2: an improved database of cross-linguistic colexifications assembling lexical data with the help of cross-linguistic data formats. Linguistic Typology 22(2). 277–306. Green, Rebecca. 2003. Proto Maningrida with proto Arnhem: evidence from verbal inflectional suffixes. In Nicholas Evans (ed.), The non-Pama-Nyungan languages of northern Australia : comparative studies of the continent’s most linguistically complex region, 369–421. Pacific Linguistics. Hammarström, Harald et al. 2020. Glottolog 4.3. Jena. https://doi.org/10.5281/zenodo.4061162. https: //glottolog.org/. Paradis, E. & K. Schliep. 2018. Ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics 35. 526–528. Round, Erich R. 2017. The AusPhon-Lexicon project: 2 million normalized segments across 300 Australian languages. 47th Poznań Linguistic Meeting. Round, Erich R. 2021. PhyloWeights: Calculation of Genealogically-sensitive Proportions and Averages in Linguistics. R package version 0.1.0. https://github.com/erichround/PhyloWeights. Venditti, Chris, Andrew Meade & Mark Pagel. 2010. Phylogenies reveal new interpretation of speciation and the red queen. Nature 463(7279). 349–352.

30 Wickham, Hadley et al. 2021. dplyr: A Grammar of Data Manipulation. R package version 1.0.5. https: //CRAN.R-project.org/package=dplyr. Yin, Ruihua. 2020. Violations of the sonority sequencing principle: How, and how often? 2020 Conference of the Australian Linguistic Society. https://als.asn.au/Conference/Program.

31