Guide to Arawakan and Tup´ı-Guaran´ıLexical Datasets

Lev Michael

1 Introduction

One of my major areas of scholarly activity for the last several years has been the devel- opment of two sets of lexical datasets for historical linguistic research: one set focused on Arawakan and the second on Tup´ı-Guaran´ılanguages. This document is intended as a brief guide to these datasets, to facilitate their review. Note that in a number of key places in this pdf, sections of text are hyper-linked to the datasets to facilitate accessing them; these links are repeated in §4.

2 Arawakan Datasets

I began developing the Arawakan datasets in 2015 to carry out comparative work on the Arawakan family, with the short term goal of developing an internal classification employ- ing computational phylogenetic methods, which should make it possible to identify the Proto-Arawakan homeland and developing a model for geographic dispersal of . In the longer term I plant to apply the comparative method to the major subgroups of the family and the family as a whole. In developing the datasets I have sought to include every Arawakan for which lexical data exists, from lexically well-documented modern languages to fragmentarily documented languages that became extinct in the colonial period. The languages included in the Arawakan dataset are:

1. Achagua 10. Bahuana 18. Curripaco

2. Anauy´a 11. (Central di- 19. Enawene-Nawe alect) 3. A˜nun 20. Gar´ıfuna 12. Bare 4. Apurin˜a 21. Guinau 13. Baure 5. Araicu 22. Ignaciano 6. Aruan 14. Cabiyar´ı 23. I˜napari 7. Ash´aninka 15. Canamari 24. Island Carib (= ‘Old 8. Ash´eninka 16. Cariay Gar´ıfuna’)

9. Atorada 17. Chamicuro 25. Jum´ana

1 26. Kustenau 43. Morike 60. Ta´ıno(xxx dialects)

27. Kaishana 44. Nanti 61. Tariana 28. Kakinte 45. Nomatsigenga 62. Terena 29. Karipur´a 46. Old Achagua 63. Uirina 30. Kuniba 47. Old Baure 64. 31. Kushitineri 48. Old 65. Warekena do Xi´e 32. Lokono 49. Old Moje˜no

33. Maipure 50. Old Yawalapit´ı 66. Warekena Velha

34. Manao 51. Pacaguara 67.

35. Mandahuaca 52. Pajonal Ash´aninka 68. Wayuu 36. Mapidian 53. 69. Yabaana 37. Marawa 54. Parecis 70. Yanesha’ 38. Marawan 55. Pass´e 71. Yavitero 39. Mariate 56. Paunaka 72. Yawalapit´ı 40. Matsigenka 57. Piapoco

41. Mawakwa 58. Res´ıgaro 73. Yine

42. Mawayana 59. Saraveka 74. Yukuna

There are two Arawakan datasets, which I describe in the following sections: the Com- parative Arawakan Lexical Dataset, and the Arawakan Grapheme-Phoneme Cor- respondences.

2.1 Comparative Arawakan Lexical Dataset The Comparative Arawakan Lexical Dataset (CALD) was collected using a 837-item concept list. This concept list was designed to include the concepts employed in the Comparative Tup´ı-Guaran´ıLexical Dataset, which includes a large number of concepts specifically ap- propriate to lowland (e.g., ‘jaguar’) as well as ensure the inclusion of all data from fragmentarily-documented extinct languages. A number of concepts that I believed to be particularly relevant for Arawakan languages (e.g., ‘domestic animal’ and Brycon cephalus) were also added. The coverage for the languages in the dataset varies considerably: for well-documented languages, it was possible to harvest data for almost every concept, while for others, the total quantity of data available is small. Thus, 1301 items were harvested Apurin˜a(i.e. for

2 some concepts, multiple items were harvested), while for Pacaguara,1 the entirety of the data available for the language consists of a few dozen words. CALD is currently organized as spreadsheet workbook, with each variety given its own sheet. The sheets are organized for importation into a database, with each column corre- sponding to the contents of a single database field. The content of each column is described in the Fields sheet at the far left of the workbook, but I here summarize the most interest- ing ones. The ORT, PHM, and PHT fields provide the original forms harvested from the sources, depending on whether the source representation is most accurately characterized as ‘othographic’ (ORT), ‘phonemic’ (PHM), or ‘phonetic’ (PHT). The FUN column contains my phonemic representation of the forms (see below). The original translation for the item is found in TRA column, while the English-glossed concept for which the item was harvest is found in the TUE column. Any notes from the source are found in the CSO column, while notes written by anyone involved in the compilation of the dataset are found in the CSA column (with their initials). The source information is given in SRC and REF columns.

Sources Sources are generally published works on the languages in question (biblographic references are found in the Sources tab at the far left of the workbook), but there are three datasets that have been developed by colleagues that I drew on in the earlier stages of developing this dataset that are important to mention, which are indicated as HG, TC, and NA in the SRC column.

HG The South American component of the Languages of Hunter-Gatherers and their Neighbors: Database developed by Patience Epps at the University of Texas at Austin.2

TC An unpublished dataset of lexical data for Arawakan languages of northerwestern Amazonia developed by Thiago Chacon at the Universidade de Bras´ılia.Note that in some cases the data in the TC dataset was harvested from the HG database (above).

NA The unpublished Northern Arawakan dataset developed by Tammy Stark for her 2018 Berkeley dissertation on comparative northern Arawakan morphosyntax.3

In recent years I have been checking and replacing the data harvested from the above datasets, since they often do not provide the original translations (corresponding to the CALD TRA field), but instead the concept used to harvest the items, or do not provide the page number in the source where the particular items can be found (indeed, in some of these other datasets it is not always clear what the source is). This has also given me the opportunity to correct errors found in those datasets.4

1Not to be confused with the dialect of Ch´acobo, Panoan language, that sometimes bears the same name. 2Epps, Patience. South American Languages. In Languages of Hunter-Gatherers and their Neighbors: Database, Claire Bowern, Patience Epps, Jane Hill, and Patrick McConvell. https://huntergatherer.la.utexas.edu. 3Stark, Tammy. 2018. Northern person marking and alignment: A comparative and diachronic analysis. UC Berkeley PhD dissertation. 4I wish to be clear that I am not casting stones: development of large lexical datasets like this unavoidably requires relying on research assistants, whose quality of work inevitably varies. Checking and rechecking harvested data is an essential part of curating such datasets.

3 ‘Old’ languages Note that I have included both modern lexical data, and when available, lexical data drawn from early colonial sources. Of course, in the cases of some languages, older sources are the only ones available, but I have also deliberately harvested data collected during the early colonial period, as, for example, in the case of Lokono, where we have an anonymous vocabulary dating from 1765 to draw from. Lexical data from older sources like this is kept distinct from the corresponding modern data and distinguished as, in this case, ‘Old Lokono’. The purpose of harvesting both modern and early data is to provide calibration points for the planned phylogenetic analysis of the Arawakan cognate sets. With calibration points like these,5 it is possible to develop an absolute chronology for the phylogenetically-inferred tree.

2.2 Arawakan Grapheme-Phoneme Correspondence Dataset The sources from which lexical data was harvested employ a range of representations for the words they contain. In order to facilitate cognate set construction, and to make fea- sible subsequent correspondence set construction, a mapping has been developed from the graphemes used in each source to a corresponding phonemic representation of the grapheme in question. These correspondences were subsequently applied to the harvested lexical data (in the ORT, PHM, or PHT columns), and the resulting forms placed in the FUN column for the language in CALD. When phonological descriptions were available for languages, they were employed to develop the mappings. The Arawakan Grapheme-Phoneme Correspondence Dataset (AGPCD) is a dataset that indicates how the graphemes found in each source from which lexical data was harvested were converted to the phonological representation given in the FUN column of CALD. Each language has a separate sheet in the AGPCD spreadsheet workbook, and each lexical source and phonological description for a given language is given a column in the sheet for the given language, allowing one to see the correspondences between the various sources. The AGPCD serves to record the grapheme replacements carried out in creating the FUN forms, making it possible to evaluate and easily revise these replacements in the future. Note that while this is potentially relevant to any language, this will no doubt prove especially important for languages where the lexical data was collected by non-linguistics, and for which no phonological description exists. This is especially true for extinct languages for which the only data comes from colonial-era sources.

3 Tup´ı-Guaran´ıDatasets

The Tup´ı-Guaran´ı (TG) datasets are being developed with two collaborators, Natalia Chousou-Polydouri6 and Zachary O’Hagan,7 and in the early years of their development, Vivian Wauters.8 There are two TG datasets, the Comparative Tup´ı-Guaran´ıLexical

5Another important source of calibration points is the use of archaeological findings to date splits, e.g., as in the case of the Arawakan expansion into the Caribbean, which plausibly correlates with the split of the Gar´ıfuna-Ta´ınoclade from its sisters within the Caribbean subgroup. 6Currently a researcher at Dynamique du Langage, CNRS, Lyon. 7A UC Berkeley graduate student. 8Formerly a UC Berkeley graduate student

4 Dataset (CTGLD) and the Tup´ı-Guaran´ıCognate Sets (TGCS). I began the develop- ment of these datasets in 2010, and they have led to two publications (Michael et al. 2015 and O’Hagan et al. submitted), as well as numerous conference presentations. Since 2015 the focus of our work has been refining the cognate sets and adding new lexical data. This has included: 1) searching in previously-harvested sources for cognates that have undergone semantic shift,9 and are thus not easily findable using the original concept list; and 2) in- corporating newly available lexical data, e.g., Ramirez (2018), which significantly improves our lexical knowledge of Warazu (also known as ). The languages for which data has been harvested include 33 TG languages, and two non-TG that served as outgroups for TG in the phylogenetic analysis:

1. Ach´e 13. Kaiow´a 25. Temb´e

2. Anamb´e 14. Kamayur´a 26. Asurin´ı

3. Apiak´a 15. Kayab´ı 27. Tupinamb´a

4. Arawet´e 16. Kokama 28. Turiwar´a

5. Av´aCanoeiro 17. Nandeva˜ 29. Warazu

6. Emerill´on 18. Omagua 30. Wayamp´ı

7. Chiriguano 19. Paraguayan Guaran´ı 31. Xet´a

8. Guaj´a 20. Parakan 32. Xingu Asurin´ı

9. 21. Parintintin 33. Yuki

10. Guarayu 22. Siriono 34. Awet´ı (non-TG Tu- pian) 11. Jor´a 23. Tapiete 35. Maw´e (non-TG Tu- 12. Ka’´apor 24. Tapirap´e pian)

3.1 Comparative Tup´ı-Guaran´ıLexical Dataset The Comparative Tup´ı-Guaran´ıLexical Dataset (CTGLD) exists in two forms, the original data-harvesting spreadsheet, and a format for database importation similar to the one in which the Arawakan data is housed. We are currently in the process of moving all the data into the latter format. The CTGLD was initially developed using a 597-item concept list, which includes a large number of concepts specifically appropriate to lowland South America. Subsequently, additional data was collected from lexical sources to flesh out cognate sets by searching for cognates that have undergone semantic shift, as discussed above. The forms collected in this way are labeled with an @ following the concept with which they are (loosely) associated, e.g. ‘rope@’.

9The identification of these cognates has been carried out on predicting the relevant forms on the basis of sound correspondences and comparative lexical data in the cognate sets

5 Note that in general, the data in the CTGLD is given in the representation used in the source for which was harvested. Unlike the Arawakan dataset, only for a small number of TG languages have grapheme-phoneme substitutions been carried out.10

3.2 Tup´ı-Guaran´ıCognate Sets The Tup´ı-Guaran´ıCognate Sets (TGCS) are organized by semantic domain into four sheets in a spreadsheet workbook, due to the size of the dataset. Cognate set labels are given in column B (generally formed from the concept under which most elements of the cognate set were collected, plus a numeral, e.g. TREE3).11 Note that we form cognate sets both for roots, which include all forms that contain the root, and also for morphologiaclly complex forms,12 especially compounds. In the latter case, we are treating the compounding of the two roots as a heritable feature. If the heritable feature that defines the cognate set is a particular root+morpheme combination, or a compound, this is indicated in column A (as COMPLEX or COMPOUND, respectively). In the first line of any block of cognate set labels that are formed on the same con- cept (e.g. BELLY1, BELLY2, BELLY3, etc.), concept names that appear in capital letters (e.g. INTESTINE) in that line indicate that items harvested under the former concept (i.e. BELLY) have been moved to cognate sets whose labels are fromed on the latter con- cept (i.e. INTESTINE). This is typically due to semantic shift.

4 Links

I here repeat the links included above for easy access:

• Comparative Arawakan Lexical Database

• Arawakan Grapheme-Phoneme Correspondence Dataset

• Tup´ı-Guaran´ıCognate Sets

• Tup´ı-Guaran´ıComparative Lexical Dataset

– single spreadsheet – database importation format

10Specifically, for Omagua, Kokama, and Tupinamb´a,in the database importation format dataset, for an ongoing project to reconstruct the Proto-Omagua-Kokama-Tupinamb´aphonological inventory. 11Cognate set labels may have gaps in their numeration due to the merging of sets previously thought to be distinct. 12For example, in most TG languages, the concept STAR is expressed as a compound of the words MOON and FIRE.

6