Analysis of Worldcat Language Data by Adam Schiff for LCDGT Working Group

MID15-/6 appendix2

[analysis of Worldcat language data by Adam Schiff for LCDGT working group]

Ed,

This is exactly the sort of thing we were trying to figure out. I think we could throw out the clearly incorrect results (e.g. "audio visual aids") and what would remain would be a pretty good indication of which language terms were the most common and should be used to create demographic group terms for speakers of those languages.

Just glancing through the spreadsheet, I identify the following in reverse order of popularity used in WorldCat records. I cumulated terms that were obvious typos or incorrectly formulated but clearly were supposed to be a particular language or language group or dialect. I have only listed languages with 10 or more uses.

English 69,629 Chinese 13,454 Spanish 7,523 Japanese 5,465 French 3,654 German 3,146 Russian 2,963 Korean 2,449 Vietnamese 1,749 Polish 1,078 Arabic 986 African 776 Italian 747 Thai 578 Portuguese 540 Turkish 532 Dutch 525 Greek 391 Persian 387 (see also Farsi) Hebrew 375 Indonesian 372 Czech 350 Swedish 326 Hungarian 256 Hindi 234 Serbo-Croatian 217 Afrikaans 207 Danish 204 Finnish 188 Armenian 187 Yiddish 187 Urdu 180 Khmer 162 (see also Cambodian) Romanian 137 Hmong 134 Cantonese 128 Lao 128 Tagalog 111 Panjabi 109 Ukrainian 105 MID15-/6 appendix2

Norwegian 101 Malay 89 Bengali 86 Serbian 67 Aboriginal Australian/Aboriginal/Australian aboriginal 65 Mandarin 65 Bulgarian 62 Indic 61 Croatian 61 Slovak 61 Samoan 60 Slovenian 59 Haitian Creole/Haitian French Creole 51 Navajo 49 Gujarati 44 Somali 44 Burmese 41 Asian 41 (see also Oriental) Welsh 40 Iloko 38 Mongolian 37 Tamil 35 Amharic 33 Latvian 33 Oriental 33 (see also Asian) Georgian 32 Uighur 31 Creole 30 Icelandic 30 Swahili 30 Ladino 28 Sinhalese 28 Farsi 26 (see also Persian) Maori 25 Telugu 21 Fijian 20 Malayalam 20 Azerbaijani 19 Cambodian 19 (see also Khmer) Macedonian 19 Tongan 19 Marathi 18 Yoruba 18 Estonian 17 Hausa 17 Philippine 17 Lithuanian 16 Polynesian 16 Tatar 15 Indochinese 13 Irish 13 Nepali 14 Tibetan 14 Niuean 13 Taiwanese 13 Eskimo 12 Zulu 12 MID15-/6 appendix2

Catalan 11 Quechua 11 Rarotongan 11 Uzbek 11 Belarusian 10 Bosnian 10 Shona 10

Used from 1-9 times:

Kannada, Kazakh, Tigrinya, Wolof, Hawaiian, Cree, Inuit, Slavic, Xhosa, Konkani, Mayan, Sotho, Tetum, Scandinavian, Kashmiri, Kurdish, Minangkabau, Chamorro, Lingala, Yupik, Bantu, Tahitian, Bashkir, Otomi, Karaim, Yugoslav, Bru, Maltese, Basque, Igbo, Malagasy, Ethiopian, Tswana, Breton, Crow, Gilbertese, Gorontalo, Javanese, Oriya, Oromo, Tontemboan, Papiamentu, Sundanese, Chuvash, Laguna dialect, Northern Sotho, Aymara, British Sign Language, Buginese, Buriat, Cebuano, Drai, East Indian, Ganda, Kanuri, Kimbundu, Kyrgyz/Kirghiz, Mongo, Ossetic, Pushto, Romance, Sindhi, Tombulu, Yakut, Inupiaq, Afghan, Bassari, Manipuri, Sauraseni, Tibeto-Burman, Bena Bena, Biangai, Bisayan, Cheyenne, Esperanto, Ewe, Faroese, Judeo-Arabic, Kamano, Karakalpak, Karanga, Khakass, Khun, Kinyarwanda, Luba Lulua, Melanesian, Micronesian, Mota, Siksika, Syriac, Tajik, Tiv, Tokelauan, Toraja, Tunebo, West Kewa, Amuesha, Anufo, Attie, Baoule, Dravidian, Luba, Luba, Naga, Sino-Tibetan, Sissala, Sranan, Tsonga, Venda, Yanomamo, Adyghe, Assamese, Athabascan, Baluchi, Bambara, Cakchikel, Chol, Cushitic, Dakota, Divehi, Dogrib, Erzya dialect, Frisian, Fulah, Gahuku, Galician, Gambai dialect, Hiligaynon, Hmar, Hopi, Ilokano, Yabim, Judeo-Persian, Kalmyk, Kanarese, Kaw, Keyagana, Khanty, Khasi, Komi, Kongo, Krio, Kuanyama, Kwamera, Lamnso, Lokele, Madurese, Makasarese, Mam, Mandingo, Marshallese, Masai, Maya, Moksha dialect, Moore, Mosquito, Ndebele, Oirat, Ojibwa, Ovambo, Palauan, Papuan, Passamaquoddy, Penobscot, Pushtu, Raeto-Romance, Ronga, Rundi, Sami, Sangihe, Sanskrit, Sema, Shan, Songhai, Toba Batak, Tohono Oodham, Truk, Turkic, Turkmen, Udmurt, Umbundu, Wa, Yao, Yapese, Yi

When I sort on the column for use in LC records only, there's a very different order of languages (from most used to least):

English Spanish French German African Chinese Russian Japanese Polish Swedish Finnish Italian Dutch Korean Czech Portuguese Arabic Hungarian Turkish MID15-/6 appendix2

Danish Vietnamese Hindi Romanian Afrikaans Indic Persian Hebrew Thai Bulgarian Urdu Indonesian Greek Creole Serbian Lithuanian Slovak Albanian Bengali Khmre Norwegian Serbo-Croatian Ukrainian Estonian Malay Latvian Malayalam Mongolian Scandinavian Tibetan Armenian Azerbaijani Belarusian Catalan Hmong Indochinese Marathi Navajo Panjabi Philippine Yiddish Bashkir Basque Burmese Hausa Igbo Inupiaq Kannada Kashmiri Kurdish Lao Malagasy Maori Minangkabau Oriental Otomi Papiamentu MID15-/6 appendix2

Slovenian Sundanese Swahili Tatar Telugu Uzbek Zulu Achinese Afghan Amuesha Anufo Asian Attie Australian Baoule Bassari Bosnian Chamorro Chuvash Dravidian Ethiopian Filipino French Creole Georgian Gujarati Karaim Kazakh Kyrgyz Laguna dialetc Latin Lingala Luba Macedonian Manipuri Mayan Naga Northern Sotho Polynesian Quechua Sauraseni Shona Sino-Tibetan Sissala Sotho Sranan Tamil Tibeto-Burman Tigrinya Tsonga Tswana Venda Welsh Wolof Yanomamo Yugoslav Yupik MID15-/6 appendix2

And of course there are many languages used in WorldCat records that were not used in LC records, including some with many records, e.g.:

Aboriginal Australian Amharic Cantonese Croatian Eskimo Fijian Haitian/Haitian Creole/Haitian French Creole Icelandic Irish Ladino Lao Mandarin Nepali Nieuan Rarotongan Samoan Sinhalese Somali Tagalog Taiwanese Tongan Uighur Yoruba

It seems to me that LC could use this information to decide which demographic terms for language speakers to establish. They could either pick an arbitrary cutoff for use (say for example, 10 uses in WorldCat, which I've given above), or they could decide to establish all the terms for language speakers that have already been used in LC and/or OCLC records. A usage of even one time could indicate that a demographic group term for the speakers of that language is warranted, because if these resources were recataloged once LCGFT/LCDGT goes into effect, we would need a demographic group term for the speakers of that language.

Thanks for this preliminary work - I think it will be very useful to Janis and the subcommittee.

Adam