
Computational dialectology in Irish Gaelic Brett Kessler∗ Department of Linguistics Stanford University Stanford CA 94305-2150 USA [email protected] Abstract are frustrating. The first problem, as Gaston Paris noted (apud Durand, 1889:49), is that isoglosses Dialect groupings can be discovered objec- rarely coincide. At best, isoglosses for different fea- tively and automatically by cluster analy- tures approach each other, forming vague bundles; sis of phonetic transcriptions such as those at worst, isoglosses may cut across each other, de- found in a linguistic atlas. The first step in scribing completely contradictory binary divisions of the analysis, the computation of linguistic the dialect area. That is, language may vary ge- distance between each pair of sites, can be ographically in many dimensions, but the require- computed as Levenshtein distance between ments we usually impose require that a specific site phonetic strings. This correlates closely be placed in a unique dialect. Traditional dialecto- with the much more laborious technique logical methodology gives little guidance as to how of determining and counting isoglosses, to perform such reduction to one dimension. and is more accurate than the more fa- A second problem is that many isoglosses do not miliar metric of computing Hamming dis- neatly bisect the language area. Often variants do tance based on whether vocabulary entries not neatly line up on two sides of a line, but are in- match. In the actual clustering step, tradi- termixed haphazardly. More importantly, for some tional agglomerative clustering works bet- sites information may be lacking, or the question is ter than the top-down technique of parti- simply not applicable. When comparing how vari- tioning around medoids. When agglomer- ous sites pronounce the first consonant of a partic- ative clustering of phonetic string compar- ular word, it is meaningless to ask that question if ison distances is applied to Gaelic, reason- the site does not use that word. So the isogloss is able dialect boundaries are obtained, corre- incomplete and cannot be meaningfully compared sponding to national and (within Ireland) with isoglosses based on different sets of sites. arXiv:cmp-lg/9503002v1 1 Mar 1995 provincial boundaries. The third problem is that most languages have di- alect continua, such that the speech of one commu- 1 Introduction nity differs little from the speech of its neighbours. Even though the cumulative effects of such differ- Defining dialects is one of the first tasks that lin- ences may be great when one considers the ends of guists need to pursue when approaching a language. the continua (such as southern Italian versus north- Knowing the dialect areas helps one allocate re- ern French), still it seems arbitrary to draw major sources in language research and has implications for dialect boundaries between two villages with very language learners, publishers, broadcasters, educa- similar speech patterns. Such conundrums led Paris tors, and language planners. Unfortunately, dialect and others to conclude that the dialect boundary, definition can be a time-consuming and ill-defined and therefore the very notion of dialect, is an ill- process. The traditional approach has been to plot defined concept. isoglosses, delineating regions where the same word More recently, the field of dialectometry, as intro- is used for the same concept, or perhaps the same duced by S´eguy (1971, 1973), has addressed these pronunciation for the same phoneme. But isoglosses issues by developing several techniques for summa- ∗ I thank Martin Kay, Paul Kiparsky, Tom Wasow, rizing and presenting variation along multiple di- and a reviewer for helpful comments. mensions. They replace isoglosses with a distance matrix, which compares each site directly with all two-way partitions of the dialect area is O(2N ). other sites, ultimately yielding a single figure that The current state of dialectometry thus presents measures the linguistic distances between each pair two main questions which constitute the method- of sites. There is however no firm agreement on ological focus of this paper. The first deals with just how to compute the distance matrices. S´eguy’s distance matrices. Is there a way to build accu- earliest work (1971) was based on lexical correspon- rate distance matrices that minimize editorial deci- dences: sites differed in the extent to which they sions without discarding relevant data? My research used different words for the same concept. S´eguy suggests that this may be done by using string dis- (1973), Philps (1987), and Durand (1989) use some tances computed directly on phonetic transcriptions, combination of lexical, phonological, and morpho- and that this is better than restricting the study to logical data. Babitch (1988) described the dialec- lexical comparisons. The second deals with cluster- tal distances in Acadian villages by the degree to ing. Do bottom-up or top-down techniques work which their fishing terminology varied. Babitch and best? My conclusion is that the traditional bottom- Lebrun (1989) did a similar analysis based on the up technique works better than a typical top-down varying pronunciation of /r/. Elsie (1986) grouped method. These conclusions are based partly on an the Gaelic dialects on the basis of whether the vocab- analysis of the mathematical properties of the clus- ulary matched. Ebobisse (1989) grouped the Sawa- ters themselves, partly on how well they correlate bantu languages of Cameroon by whether phonolog- with analyses based on more traditional isogloss ical correspondences in matching vocabulary items techniques, and partly on how well they compare were complete, partial, or lacking. There seems to with previously-published descriptions of dialects in be a certain bias in favour of working with lexical a specific language, Irish Gaelic. correspondences, which is understandable, since de- At one time the Gaelic language group was spo- ciding whether two sites use the same word for the ken throughout Ireland, from where it spread to the same concept is perhaps one of the easiest linguis- Isle of Man and to much of Scotland. Currently tic judgements to make. The need to figure out such fully native use of Gaelic is limited to a few dis- systems as the comparative phonology of various lin- contiguous areas in the westernmost reaches of Ire- guistic sites can be very time-consuming and fraught land and Scotland. In the case of Ireland, everyone with arbitrary choices. agrees that Gaelic is nowadays found in three main Not all dialectometrists agree on the wisdom of dialects: that of Ulster, that of Connacht, and that delineating dialect areas. S´eguy (1973:18) insisted of Munster (O´ Siadhail, 1989). But several ques- that the concept of dialect boundaries was mean- tions are raised that are less easily answered. Do the ingless, and his emphasis on the gradience of lan- three provinces separate out so neatly for intrinsic guage similarity has been widely maintained. But linguistic reasons, or simply because their speakers those who do look for firm dialect affiliations (such have become so widely separated from each other as Babitch and Ebobisse) use bottom-up agglomer- geographically as speakers in intervening areas have ative techniques. The two linguistically closest sites adopted English? Does the language of Connacht are grouped into one dialect, and thenceforth treated naturally group with that of Ulster or with that of as a unit. The process continues recursively until all Munster? And looking beyond Ireland, many have sites are grouped into one superdialect embracing commented that the language of Ulster in general is the entire language area under consideration. This similar to that of Scotland. Are Irish, Manx, and yields a binary tree. But Kaufman and Rousseeuw Scottish Gaelic considered three separate languages (1990:44) suggest that when the emphasis in a clus- for intrinsic linguistic reasons, or because they are tering problem is on the top-level clusters—here, spoken in different countries? To a large extent, di- finding the two main dialects—then such bottom-up alectologists have found these questions difficult to methods, which can potentially introduce error at answer because they accepted Paris’s conundrum. each of several steps, are less reliable than top-down For O´ Siadhail, the ultimate scientific justification partitioning methods. Perhaps past researchers have in adopting the three-dialect account is the fact that used inferior bottom-up techniques simply because the Gaeltacht (Irish-speaking territory) is so frag- the necessary algorithms are computationally more mented nowadays that it no longer forms a contin- tractable. Comparing all possible pairs of sites is ´ 2 1 uum. O Cu´ıv(1951:4–49) felt that there can be a O(N ) problem, whereas considering all possible no dialect boundaries because transitions are grad- 1The overall algorithm is O(N 3) since for each new ual. Elsie (1986:240) considers a dialect to be an group one must compute the distances between it and area where all communities are linguistically more each of the other sites or groups. similar to each other than any community is to any site outside the dialect. Such notions provide a very whenever two sites were in the same set and 1 when- firm, absolute notion of dialecthood: a set of com- ever two sites were in contrasting sets, then tak- munities either constitutes a dialect area, or it does ing the mean. This baseline approach corresponds not. But as the dialectometrists have shown, other formally to determining distance by the number of notions of clustering are equally scientific and may isoglosses that separate sites, which is in principle more accurately correspond to intuitive notions of the traditional technique. what it means to be a dialect. This baseline was compared to several other ap- proaches. The etymon identity metric averaged the 2 Data number of times the sites agreed in using words whose stem had the same ultimate derivation.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages8 Page
-
File Size-