<<

SPLITS AND TREES – OR WAVES AND WEBS ?

CASE STUDIES IN THE ANDES , ACCENTS OF ENGLISH , AND INDO -EUROPEAN

April McMahon, Paul Heggarty, Warren Maguire, Rob McMahon – Colin Renfrew, Paul Heggarty

ARE PHYLOGENETIC ANALYSIS PROGRAMMES REALLY SUITABLE FOR LANGUAGE DIVERGENCE ? IF SO, WHICH ?

Much has been made in recent years of applying to language data ‘tree-drawing’ phylogenetic analysis techniques originally developed for the biological sciences. But are such models are necessarily appropriate for analysing language divergence in the first place? This question arises out of research into a combination of new techniques from both linguistics and phylogenetics to try to push our methods to the greater precision required for to the level of dialects and indeed accents (regional and/or sociolinguistic) of a single language. It transpires, though, that they have lessons valid much more widely for all levels of language divergence. In particular, the latest network type phylogenetic analysis programmes, which seek to accommodate more complex cross-cutting signals in the relationships between languages, duly emerge as much more faithful to the reality of language divergence than ‘tree-only’ methods.

NEW LINGUISTIC TECHNIQUES FOR MEASURING LANGUAGE DIFFERENCE

To provide more reliable linguistic inputs to these phylogenetic analyses we have developed new linguistic methods to produce more precise measures of difference between languages (particularly important for the finer- grained differences at the accent and dialect levels). One of these measures difference in lexical semantics, but deliberately makes a number of radical departures from the traditional lexicostatistical methodology, to be more sensitive to the complexities in meaning-to-word relationships in real languages, which cannot always be analysed well in all-or-nothing binary terms. A second method seeks to extend quantification to phonetics, to quantify the degree of difference between pronunciations of cognate forms in different accents or languages – to be precise, it measures net divergence since their common ancestor. For more details and a demonstration of all methods referred to here, ask Paul Heggarty who is attending the Languages and Genes conference.

THE LATEST PHYLOGENETIC TECHNIQUES : WEBS NOT TREES

The quantification results from these methods were then used as linguistic input to one of the latest techniques developed on the phylogenetic side too, NeighborNet by Bryant and Moulton (2002). What distinguishes this method is that it is not limited to producing a graphical output only in the form of a branching phylogeny, but can instead draw a more complex ‘web’, if this is a more faithful representation of the relationships between the languages in the data set than a simple tree-only structure would be. As such, NeighborNet is certainly well suited to analysing language variation at the level of dialects, since they tend naturally to be in relatively complex webs of interrelationships with each other, within a dialect continuum. This combination of new methods has been applied to two case studies, each of which offers some striking results.

HOW FINE CAN YOU GO?: PHYLOGENETIC ANALYSIS OF ACCENT -LEVEL DIFFERENCES IN ENGLISH

Our NeighborNets of accents of English within the British Isles reproduce graphical reflexes of the principal isoglosses separating them into groups (e.g. rhoticity, Scottish Vowel Length Rule). They also make for a surprisingly neat correlation with geography, at least at first glance – see Figure 1. We intend to follow this up with an exploration of whether there is any correlation also with genetic signals for the regional populations. Significantly, the phenograms produced from the same data by the SplitsTree 4 phylogenetic analysis programme are much less stable than the webs: removing an individual variety at random can immediately restructure the tree (even in other branches distant from the variety removed), but has much less impact on the web. SOLVING A RELATEDNESS RIDDLE IN THE ANDES

A second study looked into the two main surviving indigenous language families of the Central Andes, Quechua and Aymara. These represent a particularly difficult case for traditional methods in trying to establish whether the two are or are not genealogically related to each other. Thanks to a number of methodological innovations in our approach to measuring difference in lexical semantics, we were able to combine results for the two families without making an a priori judgement on the issue. The corresponding NeighborNet s make for a new type of evidence that argues strongly against relatedness: the families are much closer to each other for the less stable and more easily borrowable subset of our 150 sample word meanings (the NeighborNet on the left), much more distant for the more stable ones (on the right).

REDEFINING TRADITIONAL CLASSIFICATIONS

The detailed dialect results for Quechua alone were no less significant. For decades the traditional classification has represented this family as the familiar neatly branching tree. Yet as soon as we applied a phylogenetic analysis that is not restricted to necessarily drawing a tree in the first place, for Quechua it duly drew no tree at all, but a web of dialect continuum relationships, as in the diagrams above. This is by no means an automatic artefact of the programme: with the Aymara family, for which the surviving data are compatible with a clear branch, NeighborNet does indeed duly draw a tree. Far from being a helpful idealisation, in the case of Quechua the traditional tree-model appears on the contrary to be an oversimplification that is dangerously misleading as to the true history of the family.

A METHODOLOGICAL WAKE -UP CALL : WEBS BETTER THAN TREES ?

This lesson has repercussions far more widely than just for Quechua – a general methodological wake-up call for the field. Is it not high time that we stopped paying nothing but lip-service to the wave model, before merrily pressing on with applying tree-only analyses to language divergence data? Trees are tempting, certainly: they are neater, simpler to get our heads around, perhaps even more intellectually satisfying. Yet none of that makes them necessarily the most faithful way of representing the actual relationships between real languages. Dialect continua are just as natural a form of language divergence as branching trees, even if the extinction of intermediate dialects may often make it look otherwise on the surface. (Indeed, even in the case of Aymara, this appears to be the case: it probably was more of a continuum originally, of which only the poles have survived.)

WHAT OF THE LANGUAGE FAMILIES OF EUROPE : TREES OR WAVES ?

What of all this for Indo-European? It transpires on closer inspection that here too, the search for a perfect phylogeny invariably comes up against difficulties. Ringe et al. ( 2002 ) came across them particularly in the relationships between Germanic and the other main families of Europe; these are also precisely the nodes that have some of the lowest confidence values in Gray & Atkinson’s (2003 ) tree too. Linguists have long identified two main processes of language divergence – splits leading to branching trees, and the wave model leading to dialect continua. Applications of NeighborNet suggest that we need to ensure that our phylogenetic methods accommodate the latter model in a much more balanced way alongside the former, if we want to apply those methods to language divergence. Rather than singling out certain language data as unhelpfully ‘recalcitrant’ to fitting into a perfect tree structure, perhaps the problem lies rather in an approach that assumes that such a structure is necessarily suitable in the first place for representing all or even most relationships between natural languages. Our latest research project will be looking into precisely this issue for the main language families of Europe.

How well supported are the nodes separating the four major European sub-families in Gray & Atkinson’s (2003) tree of Indo-European languages? If we zoom in and look at the values attached to those nodes, highlighted in the red rings we’ve added, we can see how low they are, i.e. how weakly supported those branches are: Slavic vs . Celtic/Romance/Germanic = 44; Celtic vs . Romance/Germanic = 67; Romance vs . Germanic = 46

Celtic

Romance

Germanic

Balto -Slavic

WHOSE RESEARCH IS THIS ? WHERE TO FIND OUT MORE

The work reported on here is that of a multidisciplinary cluster of researchers in the UK, involving the linguists April McMahon, Paul Heggarty and Warren Maguire, the geneticist Rob McMahon, and the archaeologist Colin Renfrew, in three separate research projects.

• Quantitative Methods in Language Classification . English Language and Linguistics, University of Sheffield, June 2001-May 2004. April McMahon, Paul Heggarty, Rob McMahon, Natalia Slaska.

• Sound Comparisons: Dialect and Language Comparison and Classification by Phonetic Similarity . Linguistics and English Language, University of Edinburgh, Oct.2005-.2007. April McMahon, Paul Heggarty, Warren Maguire. www.soundcomparisons.com (Listen online to our accent database recordings.)

• Languages and Origins in Europe . McDonald Institute for Archaeological Research, University of Cambridge, June 2006-Sept.2009. Colin Renfrew, Paul Heggarty. www.languagesandpeoples.com Colin Renfrew and Paul Heggarty are both attending the Languages and Genes conference in Santa Barbara. NEIGHBOR NET REPRESENTATION OF THE RELATIONSHIPS BETWEEN REGIONAL VARIETIES OF QUECHUA AND AYMARA

As calculated from quantifications of their similarity in lexical semantics in Heggarty (2005), for different sub-lists of meanings. The two numbers indicate the distances between the ‘roots’ of the two families.

MORE STABLE MEANINGS LESS STABLE MEANINGS

SOUTHERN SOUTHERN CENTRAL

CENTRAL

AYMARA

S~C

56.6 26.5

CENTRAL ECUADOR

S~C NORTH PERU INTERMEDIATE

SOUTHERN

CENTRAL NORTH PERU QUECHUA INTERMEDIATE ECUADOR

SOUTHERN NEIGHBOR NET OF 18 TRADITIONAL REGIONAL ACCENTS OF ENGLISH FROM BRITAIN AND

based on phonetic difference ratings, and showing a fairly close match with geography

Shetland  Buckie 

Glasgow Berwick R~PR   Holy Island Renfrew Coldstream    Cornhill  Hawick  Morpeth  Antrim  Tyneside Tyrone   PR ~NR R~NR

Liverpool    Dublin Sheffield

London 

English a ccents can be classed into three R = Rhotic main groups by their pronunciation of /r/. The boxes and arrows show the corres- PR = Partially-Rhotic ponding dividing lines between them. NR = Non-Rhotic