Representing the Germanic Languages Using Median- Joining Phylogenetic Networking
Total Page:16
File Type:pdf, Size:1020Kb
View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Wits Institutional Repository on DSPACE Germanic and its Network: Representing the Germanic Languages Using Median- Joining Phylogenetic Networking Master’s Dissertation Andrew Peter Fidleir Hall Student No: 480138 University of the Witwatersrand Faculty of Humanities School of Literature, Language and Media (SLLM) 1 Contents: Key to Conventions in Formulae 4 Chapter 1 6 Aim of Study: 6 Background: 6 Rationale: 7 Chapter 2 8 Literature Review: 8 Chapter 3 15 On the Organisation and Classification of the Germanic Languages: 15 A Brief Explanation of Median-Joining: 20 Method: 23 Subgroup1: The Old Germanic Languages: 25 Subgroup 2: The Modern Germanic Languages: 32 Subgroup 3: Old and Modern Languages-Combined Simulations: 39 Subgroup 4: The Old and Modern Germanic Languages with Proto-Germanic: 41 Chapter 4 44 Results: 44 Subgroup 1: The Old Germanic Languages: 44 Subgroup 2: The Modern Germanic Languages: 53 Subgroup 3: Old and Modern Germanic Languages– Combined Simulations: 62 Subgroup 4: Old and Modern Germanic Languages with Proto-Germanic: 81 Discussion: 89 Conclusion: 101 Acknowledgments: 102 References: 103 Appendix A: 107 2 Appendix B: 161 3 Key to Conventions Used in Lexical Structure Formulae Consonants: Vowels: C = consonant V = vowel alv = alveolar back = back vowel app = approximant fro = front vowel cont = continuent hi = high vowel dist = distributed low = low vowel fric = fricative mid = mid-vowel gli = glide ro = round vowel intdent = interdental lab = labial lab-dent = labio-dental lat-app = lateral approximant nas = nasal obst = obstruent pal = palatal pal-alv = palato-alveolar plo = plosive postalv = postalveolar son = sonorant sib = sibilant stp = stop trill = trill vo = voice These symbols describe segmental features in the lexical items or parts of lexical items examined for cognacency where they were felt appropriate, such as where similar or related, but not identical, sounds occurred in the same positions across languages. Where the same 4 sounds occurred in the same positions of languages under examination, standard IPA symbols for specific sounds were used. Standard IPA sysmbols were also used in instances where individual words were transcribed. A plus sign, “+ “ before a feature indicates its presence. A minus, “−“ its absence. The slash “ / “ symbol is used in two different ways in this study: (1) the lexical structure formulae, where it appears in the subscript sections in which vowel or consonant features are provided, it indicates that either of two given features occur in the data for the segment in that part of the word or root (for example, [+alv/+pal] indicates the particular segment to which this is attached has either the feature [+alveolar] or [+palatal], depending on the language or variety looked at), and (2) in formulae which show segment conditioning or change it is used as it normally is to indicate “in the environment of” (for example, C[+vo] C[-vo]/_#, read as “a voiced consonant is devoiced at the end of a syllable boundary”). An italicised segment in a lexical structure formula represents an instance where varieties have segments which form a class of phonetically or perceptually related sounds but which have a significant number of different segmental features (for example, r indicates the presence of a rhotic consonant). When placed around a segment, normal parentheses, (), indicate that the parenthesised segment is widespread but not universal in a set of data items. 5 Germanic and its Network: Representing the Germanic Languages using Median- Joining Phylogenetic Networking. Chapter 1 Aim of this study This study has two main aims. The first is to examine the effects of the use of different lexical items on the generation of a network of the Germanic languages. The second is to determine the effects on network generation of different coding strategies. The model used is the median-joining program Network. The study is ultimately intended to provide information on how applicable this model might be to linguistic data, and to indicate how and under what conditions it may be helpful to linguistic analyses. Background The use of a number of bioinformatics tools from the fields of molecular biology and genetics as possible tools for language classification has attracted a degree of interest in recent times. It has, however, received mixed reviews from specialists, with some claiming that they may represent valid ways of classifying languages (Atkinson & Gray, 2005), and others claiming that the use of these methods is highly problematic (Atkinson & Gray, 2005; Heggarty, 2006). This dissertation will examine how the use of different coding strategies for lexical data, as well as different choices of lexical data, will influence the structure of networks generated for the Germanic languages. This study will be based on a study conducted by Forster, Röhl and Polzin (2006), and will use largely the same data, although some varieties of Germanic languages which they did not use, such as Afrikaans and Flemish, will be included. Missing data was handled in one of three ways to determine what the effect of each approach is (removing all missing items, coding missing items as deletions and assigning the most common code to the missing items); different words were also selected for use under different conditions to determine what effects this would have on network generation (for instance influencing where language nodes are placed). It is hoped that, by doing this, potential strong and weak points of the use of median-joining phylogenetic networking will be highlighted, and that some guidelines for assessing this method’s accuracy may be arrived at. The classification of languages can be viewed in many ways as analogous to the classification of biological organisms into groups of species (Atkinson & Gray, 2005). Classification methods and schemes used in diachronic and comparative linguistics are, at an abstract level, based on a number of concepts similar to those used in biology.In both cases comparisons are based on similarities between objects (languages or organisms). In both emphasis is placed on features deemed unlikely to have arisen in parallel, such as conserved 6 nucleic acid and protein structures in biology, and sound correspondences and shared morphology in linguistics. This has led to a certain amount of interest in using programmes originally designed to compare biological information as a tool for comparing linguistic information (Atkinson & Gray, 2005). One form of model used, which can construct relationships using both inherited and borrowed pieces of information, is termed phylogenetic networking. Phylogenetic networking models are not designed to generate trees or lines of borrowed information, but to look for points of data and connect them in ways which indicate whether or not those individual pieces of data are related to each other; that is, they construct a network of points showing which sequence inputs are related to each other, and what similarities they share (Forster, Toth & Bandelt, 1998). This dissertation will use the program Network (available at http://www.fluxus-engineering.com) to generate a phylogenetic network for Germanic vocabulary items. The network generated by the program will be compared to traditional classifications of the Germanic languages. By doing this, the utility of this method when applied to lexical data can be investigated, as it will allow one to determine how different ways of coding and handling the data may affect the results of the program. This may be useful in determining the reliability of the method. A result which would potentially indicate that the method is reliable would be if the program generated a network in which more divergent languages were found at points further outside the network, and which grouped similar varieties together. If the program were to generate a network which, for instance, placed Gothic alongside Modern English, such a result would be taken to indicate either an unreliable method or a problem with the coding of the data items. This would then be open to investigation. The issue of using these methods to date language divergence will not be investigated due to constraints on the size and scope of this paper. It is hoped that some indication of how reliable this method is for linguistic classification may be reached, and that indicators of future directions in research may be provided. Rationale The rationale behind this study is that, theoretically, linguistic and biological data may behave in similar ways. Following this it is thought that, at an abstract level, models designed to handle biological information should be able to handle linguistic information in such a way that they provide useful information regarding how language varieties relate to each other. Forster, Polzin and Röhl (2006) used words from the Swadesh 100-word list, many of which are thought to be less likely to be borrowed at high rates from other languages (this view has been challenged, however (Campbell, 2004)). However, this on its own does not necessarily mean that the words chosen from the individual languages will be historical cognates. One issue with Forster et al.’s (2006) dataset is that some words which are cognate with those in a number of the other languages have not been included as data items. For instance, in Modern English the word beam is not included in the dataset at all, despite the fact that it is a modern continuation of Old English bēam “tree”, and as such is a cognate of the various words for “tree” in the other West Germanic languages. This is despite the fact that its meaning has changed. By not including cognates which have undergone semantic changes, the placement of certain languages within the network may have been oversimplified. Additionally, in instances where data items are missing different ways of handling this situation may have a 7 sizeable impact on the results.