Dialect Clustering with Character-Based Metrics: in Search of the Boundary of Language and Dialect

Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 985–990 Marseille, 11–16 May 2020 c European Language Resources Association (ELRA), licensed under CC-BY-NC Dialect Clustering with Character-Based Metrics: in search of the boundary of language and dialect Yo Sato, Kevin Heffernan Satoama Language Services, Kwansei Gakuin University Kingston-upon-Thames, U.K., Sanda, Hyogo, Japan [email protected], [email protected] Abstract We present in this work a universal, character-based method for representing sentences so that one can thereby calculate the distance between any two sentence pair. With a small alphabet, it can function as a proxy of phonemes, and as one of its main uses, we carry out dialect clustering: cluster a dialect/sub-language mixed corpus into sub-groups and see if they coincide with the conventional boundaries of dialects and sub-languages. By using data with multiple Japanese dialects and multiple Slavic languages, we report how well each group clusters, in a manner to partially respond to the question of what separates languages from dialects. Keywords: clustering, dialects, similar languages, Japanese, Slavic languages, distance metric 1. Introduction linguistic/dialectal identity. We therefore employ character- level feature vectors, taking the Roman alphabet as a proxy ‘A language is a dialect with an army and navy’ is a well- character set for phoneme-level representation. known dictum attributed to sociologist Max Weinrich. It We will show that, with this relatively simple setting, we often happens that what are considered to be distinct lan- can achieve a reasonable set of results. In terms of the guages appear more similar to each other than what are dialect / language question, a better set of results have in considered to be dialects. Our objective, however, despite fact been achieved for Japanese dialects than, not just the what the sub-title might suggest, is not a pursuit of a clear Balkan languages but than between East and South Slavic distinction. Instead the idea in which this work is oriented languages. While this result by no means definitively shows is that such distinction should be dissolved into a gradient that the first group is more internally similar than the sec- scale — i.e. that all groupings of languages or dialects are ond, given the exploratoty nature of our experiments, it can a matter of degree, relative to some metric. be considered a first step towards a universal metric upon In this exploratory work to we experiment with a set of which an objective grouping of languages / dialect may be metrics and feature representations on shuffled corpora — achieved. classified corpora jumbled into one— for clustering, to see how close the resulting clusters are to the original grouping. We use two sets of classified corpora, of ‘dialects’ 2. Related work (Japanese dialects) and of similar ‘languages’ (East/South As stated, to the knowledge of the authors, unsupervised Slavic languages), five-way classified respectively, and re- clustering for text processing is not an area that has been port how well —in relation to the conventional grouping— extensively studied. However, the use of characters for text these dialects/languages cluster, in a manner in which to processing is not a new idea and has been explored in the partially reply to the question of whether ‘dialects’ can be context of dialect processing, and some unsupervised tech- more different from each other than ‘languages’ are. niques have started to emerge in dialect processing. Fur- Clustering of languages/dialects is, in comparison to its su- thermore, unsupervised clustering is an actively pursued pervised counterpart, their classification (or identification), topic in the speech recognition community in a somewhat a relatively less established field, presumably mainly due similar context to ours, as well as in biology, in a rather dif- to its difficulty to achieve deployment-level performance. ferent context, i.e. the discovery of protein sequence pat- Given, however, the reality that digital language resources terns. are often mixed in languages/dialects, the work on their Dialect identification for text is clearly a closely related clustering bears not just a theoretical but practical value, topic, which is studied actively for multi-dialectal language as the technique can be used for useful pre-processing to an communities such as those of Arabic (Zaidan and Callison- NLP workflow. Burch, 2013) and Chinese (Williams and Dagli, 2017). The Performance aside, another major theme of this work is varieties of English, which could also be considered variant data representation of a sentence. Clustering standardly re- dialects, have also been the target of investigation (Lui and quires a distance that can be computed on two given dat- Cook, 2013). These studies have generally invoked some apoints and they are represented typically by feature vec- form of character-level processing, be it embeddings or N- tors, but what would be the best method for the purpose grams. Scherrer (2014) provides a ‘pipeline’ that invokes of dialect clustering? If it requires too much human ef- several methods including character-level MT, in a partly fort to fill in the feature vectors, that would defeat the unsupervised approach. However, these works rely on the object of unsupervised learning. Furthermore, the repre- presence of the ‘standard’ form for which the dialects are sentation for cross-lingual clustering like ours needs to be variants, making them characterisable as transfer or adapta- dialect/language-independent, as we cannot presuppose the tion approach, or semi-supervised modelling. In this broad 985 sense, they are similar in spirit to ‘normalisation’ studies trying and differentiating similar languages and dialects, that have nothing to do with dialect, as in Han et al. (2011), where we use three method of distance computation, all in which the authors deal with ‘noisiness’ of social media character based, unigram, bigram and SGT, in the order of content in an attempt to retrieve the ‘standard’ forms, or simplicity. in Saito et al. (2014), where the authors try to group the All these methods require the same feature dimensional- orthographically normalisable variants. ity for datapoints. Thus it is incumbent on us to decide In contrast, our study starts from scratch, and simply does what character set, or alphabet as we will call it, is to be not assume any ‘standard’ to which any particular group used. In their respective standard orthographies, Japanese or sentences should be assimilated, or use any pre-trained and East/South Slavic languages employ very different al- model. It is the domain of speech where such pure clus- phabets indeed, and within the latter, there is a divide of tering has drawn more interest, since the researchers take Roman and Cyrillic characters. Given the fact that there interest in clustering the individual realisation in articula- is a one-to-one mapping system of Japanese and Cyrillic tion into groups, mainly those of accents. While accent characters into the Roman ones, we have made the prac- identification could take the form of adaptation (the pop- tical decision to use the latter. We then have the question ular i-vector method, for example (Cardinal et al., 2015), of diacritics for the East/South Slavic group. In the main, there have been attempts to cluster from scratch, where the we will use the de-accented ascii counterparts for all the researchers use approaches such as Gaussian Mixture Mod- diacritics, though we will show the results of using diacrit- els (Mannepalli et al., 2016), or in the recently more pop- ics alongside. In addition, all characters have been lower- ular method of auto-encoder (Kamper et al., 2015). Such cased before the experimentation. Therefore we mainly use models however require a more continuous feature space. a very restricted alphabet consisting of 28 characters, that While it is conceivable to create a continuous feature space is, 26 roman alphabet letters along with comma (‘,’) and for texts, it would require some pre-processing to extract full-stop (‘.’). Other punctuations are all normalised into such features. one of them. The rather unlikely domain from which we take inspira- Importantly, we did not use the space character. That is, tion most is bioinformatics. For a relatively discrete feature all the word boundaries have been deleted, making the sen- space like text, a very similar challenge is faced in sequence tence effectively a chain of printable characters. This is clustering that is used for discovery of the types of pro- firstly to make use of characters as the proxy for phonemes teins from discrete sequences of amino acids. Amongst the in normal speech, which are not ‘separated’ with pauses, possible options, we employ rather recent Sequence Graph and secondly, to circumvent the difficulty of segmenting Transform (SGT) (Ranjan et al., 2016), as it claims to be Japanese, which is not normally segmented, and for which less length-sensitive than the popular alternatives such as there are segmentation ambiguities. UCLUST (Edgar, 2010). For clustering, we use three popular methods: KMeans, Another area where a similar approach is taken is document divisive (top-down) hierarchical clustering (HC), and ag- clustering. For semantic, topic-based grouping, where un- glomerative (bottom up) HC. There are some hyperpareme- ambiguous, one-to-one labelling is often difficult, the use of ters to tune, which we will discuss in the experiment section vectors is common to cluster documents instead of learning below. Furthermore, there is a sparsity issue for sentences. on a pre-labelled dataset (e.g. (Sharma and Dhir, 2009)). That is, we cannot expect all the characters, or bigrams, The difference of course is that their target is a document, to appear in a single sentence all the time. Therefore, we and their preferred units are words.

Dialect Clustering with Character-Based Metrics: in Search of the Boundary of Language and Dialect

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support