<<

Economic Research Working Paper No. 64

Expanding the World Gender-Name Dictionary: WGND 2.0

Gema Lax Martínez, Helena Saenz de Juano-i-Ribes, Deyun Yin, Bruno Le Feuvre, Intan Hamdan-Livramento, Kaori Saito, Julio Raffo

July 2021

Expanding the World Gender-Name Dictionary: WGND 2.0

Gema Lax Martínez*, Helena Saenz de Juano-i-Ribes†, Deyun Yin**, Bruno Le Feuvre‡, Intan Hamdan-Livramento**, Kaori Saito§ and Julio Raffo**

Abstract

This paper revisits the first World Gender Name Dictionary (WGND 1.0), allowing to disambiguate the gender in data naming physical persons (Lax Martínez et al., 2016). We discuss its advantages and limitations and propose an expansion based on updated data and additional sources. By including more than 26 million records linking given names and 195 different countries and territories, the resulting WGND 2.0 substantially increases the international coverage of its processor. As a result, it is particularly designed to be applied to intellectual property unit-record data naming inventors, designers, individual applicants and other creators disclosed in these data.

JEL Classification: J16, O31, O32, O34

Keywords: Gender gap; Gender innovation metrics; Gender-name dictionary.

Disclaimer

The views expressed in this paper are those of the authors, and do not necessarily reflect the views of the World Intellectual Property Organization (WIPO) or its member states.

Acknowledgements

The authors are thankful and indebted to the inputs, support and kind review of the following list of people: Ernesto Fernandez-Polcuch, Sara Callegari, Carsten Fink, Ernest Miguelez, Mosahid Khan. We are also thankful to all WIPO Gender focal points.

* University of Lausanne, Faculty of Business and Economics † University of Glasgow, Adam Smith Business School ‡ World Intellectual Property Organization (WIPO), Department of Economics and Data Analytics (DEDA), Statistics Division. § United Nations High Commissioner for Refugees (UNHCR) ** WIPO-DEDA, Innovation Economy Section 1

Introduction

Properly measuring women’s contribution to all fields of innovation and creativity is a crucial step to understand why women remain underrepresented in most of these areas. Yet, gender research studies continuously signal a prevalent deficiency of gender breakdown metrics in the field of economics of innovation and creativity (e.g. Frietsch et al., 2009; Mauleón & Bordons, 2009; Naldi et al., 2005).

In the field of innovation and creativity economics – or intellectual property (IP) more narrowly – there are several main approaches to obtain data with gender breakdown and none of these are flawless. The most direct approach is to collect primary data by requesting innovators and creators their gender, but this fails to capture gender for past innovators and creators. The main alternative is to disambiguate the gender of innovators and creators using their names, which allows to build long series for analysis.1

Following the latter approach, Lax-Martinez et al. (2016) compiled the first World Gender Name Dictionary (WGND 1.0). The main aim was to disambiguate the names of inventors and individual applicants of patent applications filed through the Patent Cooperation Treaty (PCT) System. As a result, the most expanded version of WGND 1.0 based on languages included 6.2 million names for 182 different countries and economies.

Since its creation, the WGND 1.0 has been widely used beyond its original analysis. WIPO’s PCT Yearly Review and World IP Indicators (WIPI) are publishing PCT indicators with gender breakdown based on this data. It has been used in several gender studies from national IP offices or international organizations both as direct source and as source of inspiration to generate gender indicators for their IP collections. Academic scholars have also been making use of the WGND 1.0 systematically.2 More recently, an open version has been shared via GitHub, which will most likely expand this base.

Given its wide use and the possibility to improve it, we believe it is timely to revisit the WGND 1.0, and propose future expansions and implement some of them to benefit the ever- growing WGND user communities. This paper is organized as follows: The first section revisits critically the WGND 1.0 and proposes some possible approaches for improvements. Section 2 updates the previous sources and elaborates additional ones to be included in a new version of WGND. Section 3 consolidates WGND 1.0 and the new sources to generate the WGND 2.0. Section 4 concludes by making final remarks and suggesting potential future steps for the future versions of WGND.

1 See Lax-Martinez et al. (2016) for a review of these methods. 2 On May 2021, there are more than 3,000 downloads of the dictionary from WIPO website and the Harvard Dataverse combined (Raffo & Lax-Martinez, 2018).

2

1. Starting point: the WGND 1.0 base

As the gender of inventors is not provided in the PCT application form, Lax-Martinez et al. (2016) proposed the compilation of a gender-name dictionary aiming at attaining a worldwide coverage. The resulting WGND 1.0 compiled the information from 13 different sources, which combined, covered 173 different countries and economies. These 13 public sources were complemented with an ad-hoc list of names, created by Chinese, Indian, Japanese, and Korean WIPO native-speaking colleagues after manually checking the results of a first round of gender attribution to PCT data.

These 14 sources totaled 319,785 pairs of given names and countries, split as 54% female given names, 38% to male ones and the remaining 8% related to names marked as unisex or ambiguous (see details in Figure 1). Once these sources were cleaned up from undefined names, duplications and initials, the final name-country dictionary had 290,020 observations.

Figure 1 : Source data for WGND 1.0

Source Observations Female (%) Male (%) Unknown (%) Social Security Administration (US) 91,320 61.3 33.8 4.9 Alberta government a 87,573 55.9 37.1 7.0 Michael (2007) 72,670 43.9 40.2 16.0 Office for National Statistics of United Kingdom (ONS) a 34,214 53.8 42.4 3.8 Tang et al. (2011) 21,512 53.9 46.1 0.0 US Census Bureau (2000) 5,164 76.4 17.2 6.4 Wikipedia a 2,358 49.8 50.0 0.3 WIPO (Assemblies list) 980 34.1 65.9 0.0 a 965 51.6 47.3 1.1 Instituto Nacional de Estadística (Spain) a 200 50.0 50.0 0.0 Institut National de la Statistique (France) a 183 50.8 48.6 0.6 Yu et al. (2014) 155 47.1 45.2 7.7 Denmark Statistics b 46 50.0 50.0 0.0 WIPO (Manual check) c 2,445 17.5 74.0 8.6 TOTAL 319,785 54.3 38.2 7.5 Notes: Some original observations were dropped due to text cleaning or duplications; (a) Accessed in December, 2015; (b) Accessed in May, 2016; (c) ad-hoc list.

The WGND 1.0 was provided in four different versions: WGND source (the one described in Figure 1), WGND country (290,020 unique name-country observations), WGND_nocountry (containing 177,042 unique names non-conflicting across countries) and WGND langcountry. The latter is an expansion of the name-country 290,020 pairs based on common language for the 12 most frequent languages: Arabic, Chinese, Dutch, English,

3

French, German, Italian, Japanese, Korean, Portuguese, Russian and Spanish.3 The resulting WGND langcountry dataset contains 6,247,039 unique name-country pairs, covering 182 countries and economies.

Figure 2 : WGND 1.0 breakdown by country and language

Country % cum % Language* % cum % United States 39.5 39.5 English 61.6 61.6 Canada 27.6 67.1 French 23.1 84.7 United Kingdom 11.8 78.9 German 2.2 86.9 Germany 1.4 80.3 Dutch 1.2 88.0 Netherlands 1.0 81.3 Italian 0.8 88.8 Italy 1.0 82.3 Hindi 0.7 89.4 India 0.9 83.1 Spanish 0.6 90.0 Switzerland 0.8 83.9 Persian 0.5 90.5 Iran 0.7 84.6 Romanian 0.5 91.0 Romania 0.7 85.3 Swedish 0.5 91.5 Other 14.6 99.9 Other 8.5 100.0 Notes: (*) = It refers to the total names found for countries officially speaking the language, not to the linguistic origin of the names. (Cum %) = cumulative percentage.

As shown in Figure 2, despite its large coverage, the WGND 1.0 has an overwhelming representation of a few countries and languages. Combined, Canada, the United Kingdom and the United States account for almost 80% of all names. Moreover, only 27 countries have more than one thousand names, while 115 have less than thirty names. The distribution of languages offers a similar picture. Names from countries whose official language is English or French account for roughly 85% of all names in WGND 1.0. Again, similar than for countries, only 22 languages have more than one thousand names, while 40 have less than thirty names.

In addition, WGND 1.0 made a couple of methodological choices which somehow decreased the diversity of names. First, in order to increased harmonization, it ignored information in given names such as Unicode characters – e.g. é, å, þ, ü, ø, ß, etc. – by presenting the Latin version of these. The only exception was the inclusion of Chinese and Korean ideograms. Second, it ignored the frequencies of name and gender pairs within countries, which was available for some sources but not all.

Taking everything into account, a step forward for the WGND is to increase the amount, diversity, and representativeness of sources of names. A second step forward is expanding the data structure of WGND to capture other valuable information such as frequencies and different character settings. The next section explores the possible sources to include in its second version.

3 Based on the CIA’s World Factbook (Accessed in December, 2015).

4

2. Expanding the WGND sources

This section explores a new set of more than 50 different sources, including the updates of the sources in the original WGND. These sources together account for five million country- name pairs, with some overlap.

We separated these sources in two categories and, within these, grouped them by the source’s country.4 The first and main category refers to sources from official entities, such as national statistical offices or population registrars. This category is preferred as, among other elements, they are more stable over time and provide more reliable figures on the frequencies of names. These name frequencies are typically based on census data or the registration of newborns, making them connected to a clear statistical reference. However, it is not easy to find equivalent data for this type of agencies across all countries. Some of these have very complete, open and accessible data. However, others may either lack frequency data or provide limited information on names (e.g. only top 100 or 200 names).

The second category refers to private sources, such as web sites providing baby names or typical names per language or country. This category has been used in several studies disambiguating gender.5 One main advantage of many of the private sources is that they have a wider international coverage, allowing filling the name coverage for several countries, including China, India and Japan. On the limitations side, these sources may disappear over time and lack the information on the frequencies of names.6 The occasional times they provide some name frequencies, it is often not clearly related to the population base. Yet, acknowledging the difficulties found by scholars to attribute the gender of Chinese and Indian names, we put considerable effort to overcome it in this version by incorporating any possible source available.7

Figure 3 lists the new sources by the above-mentioned categories. Totaling more than 4 million observations, Indian and Chinese names are the more prevalent sources. This is expected given their large populations. Nevertheless, even without these two groups, the remaining sources have almost three times more observations than in WGND 1.0. In addition, many of the official sources provide a much wider coverage of the population’s names if the frequencies from census or birth data are considered. The gender distribution of names is quite balanced across sources. The most notable exceptions are the voluminous Indian and Chinese private sources, which disproportionately contribute to the total average.

Only two groups of sources – i.e., India and International – report names labeled as unisex or unknown. These are of little use, as they do not allow distributing their names across genders in a reliable way. Nonetheless, as mentioned above, several sources report name and gender pairs with some information on their frequencies. Having included the name frequencies by gender allows the new WGND to treat unisex or ambiguous names in a way different from the previous version. The name frequencies enable the calculation of the expected distribution of names by gender for a given country. As shown in Figure 3, 20 groups of sources have reported some name splitting based on frequencies.

4 A list of the sources is given at the end of this document in the annex. 5 See, for instance, Cheng (2008). 6 Since we started collecting the data, many websites containing name lists have changed their URLs or do not seem active anymore. 7 See Shah & Singh (2014), Park & Yoon (2007), Nayan et al. (2008), Shah et al. (2016), Tripathi & Faruqui (2011), Yu et al. (2013), Asahara & Matsumoto (2003), Matsumoto et al. (2002), Qu & Grefenstette (2004).

5

Figure 3 – Data sources for WGND 2.0 Population's Female Male Unknown Split freq. Source groups Names frequency³ (%) (%) (%)4 (%)5 Armenia¹ 128 235,098 53.1 46.9 - 0.0 Australia¹ 68,277 2,042,029 61.5 38.5 - 11.0 Azerbaijan² 711 - 43.7 56.3 - 0.3 ¹ 55,266 84,274,136 53.8 46.2 - 11.6 Brazil² 443 - 53.1 47.0 - - Canada¹ 105,439 13,470,522 59.9 40.1 - 11.9 China² 1,227,297 - 36.8 63.2 - 44.0 Czech Republic¹ 70 - 51.4 48.6 - - Denmark¹ 257 31,177,329 51.8 48.3 - 0.0 France¹ 796 78,756,721 55.4 44.6 - 1.0 Hungary² 111 - 55.9 44.1 - - India² 2,894,532 - 39.0 48.0 13.0 0.2 Ireland¹ 5,271 3,161,758 58.6 41.4 - 8.0 Japan² 2,993 - 38.4 61.6 - 7.4 Montenegro¹ 94 - 52.1 47.9 - - New Zealand¹ 810 2,121,111 55.4 44.6 - 4.9 Norway¹ 2,028 42,986,217 53.2 46.8 - 0.7 Rep. North Macedonia¹ 20 228,883 50.0 50.0 - 0.0 Republic of Bulgaria¹ 40 1,997,217 50.0 50.0 - 0.0 Republic of Korea¹ 9,682 5,099,290 48.1 51.9 - 41.1 Russian Federation² 532 - 51.3 48.7 - - Serbia¹ 79 - 55.7 44.3 - - Slovenia¹ 35 35,935 51.4 48.6 - 0.0 Spain¹ 54,544 44,093,085 50.7 49.3 - 3.0 Sweden¹ 31,213 213,833,574 53.0 47.0 - 8.9 Switzerland¹ 65,708 8,903,476 55.0 45.0 - 9.0 The Philippines¹ 55 410,136 52.7 47.3 - 3.6 Turkey¹ 139 - 58.3 41.7 - - Ukraine² 210 - 47.6 52.4 - - United Kingdom¹ 75,232 27,096,437 59.9 40.1 - 9.6 United States¹ 101,261 329,760,765 62.5 37.5 - 19.6 United States² 29,239 - 53.8 46.2 - 38.0 International¹ 2,624 - 18.3 81.8 - 0.2 International² 281,836 - 44.2 49.7 6.1 1.9 Total 5,016,972 41.0 51.2 7.8 12.6 Notes: (1) Official on-line source. (2) Private on-line source. (3) frequency of names based on census or birth data. Frequencies of names may be based on several years of data. Not all official sources reported frequencies. (4) labeled as ‘unknown’ or ‘unisex’ by source. (5) names without 100% of cases attributed to one gender. (-) not available. See list of sources in annex.

These new sources vastly increase the international coverage of WGND 1.0. First, as mentioned above, they incorporate a large volume of names from China and India, which now are the top covered countries. Second, when considering beyond these two countries, the remaining top countries concentrate a smaller number of names than WGND 1.0 did (see Figure 2 and Figure 4). Third, and more importantly, 135 economies (up from 27) have

6

more than one thousand names with the new sources, while only 44 (down from 115) have less than thirty names.

The distribution of languages follows the same trend. When considering the languages spoken on each name’s country or region, the distribution of different languages has become considerably more diverse. Now, 71 languages (up from 22) have more than one thousand names and only 13 (down from 40) have less than thirty names. This is partially explained by the better international coverage of the sources, but also by the extension of the list of expanded languages. We increased the languages expanded from 12 to 96, which now cover 193 different countries and territories.

Figure 4: WGND 2.0 sources broken down by country

Cum. w/o China and Country Total (%) (%) India India 2,897,885 57.8% 57.8% Cum. (%) China 1,241,955 24.8% 82.5% (%) United States 135,055 2.7% 85.2% 15.4% 15.4% Canada 107,782 2.1% 87.4% 12.3% 27.7% United Kingdom 78,335 1.6% 88.9% 8.9% 36.6% Switzerland 70,157 1.4% 90.3% 8.0% 44.6% Australia 69,804 1.4% 91.7% 8.0% 52.6% Belgium 57,588 1.1% 92.9% 6.6% 59.1% Spain 56,344 1.1% 94.0% 6.4% 65.6% Sweden 32,245 0.6% 94.6% 3.7% 69.3% Rep. of Korea 13,082 0.3% 94.9% 1.5% 70.7% Ireland 7,367 0.1% 95.0% 0.8% 71.6% Other 249,222 5.0% 100.0% 28.4% 100.0%

All things considered, the updated and new sources provide a significant improvement from the previous version in terms of names quantity, spelling variety, relative frequencies, international coverage and language extension. We now turn in the next section to the elaboration of the specific files contained in the WGND 2.0.

3. Building the WGND 2.0

Similar to the previous version, the WGND 2.0 comes in different flavors. These are meant to assist researchers and analysts to find the best solution for their specific dataset.

In concrete terms, there are five separate files: (i) WGND 2.0 source, (ii) WGND 2.0 name- gender-code, (iii) WGND 2.0 name-gender, (iv) WGND 2.0 name-gender-langcode and (v) WGND 2.0 name-gender-code language expansion. A sixth file, WGND 2.0 code-langcode, complements these files.

The details for each of these six files are presented below:

- WGND 2.0 source contains 5,016,972 unique name-code-gender-src observations from the sources described in the previous section. The information from each specific group

7

of sources includes the population related frequencies (nobs) for each name-code- gender per source and the gender relative distribution (wgt) within these, whenever available.

- WGND 2.0 name-gender-code contains 4,970,296 unique name-code-gender observations. It refers primarily to the unique combinations of name-code pairs. However, the inclusion of gender frequencies results in 633,201 name-code duplicated pairs, due to more than one gender found for the pair name-code. The resulting dataset contains 4,148,968 unique names, covering 195 countries and territories (code).

- WGND 2.0 name-gender (No code) contains 3,491,141 unique name observations. This file is based on WGND 2.0 name-gender-code but it omits all conflicting names across sources, geography and gender. All names reported in the file have a weight (wgt) equal to one.

- WGND 2.0 name-gender-langcode contains 21,831,043 unique name-gender-langcode observations. It refers to the transformation of the name-code pairs in WGND 2.0 name- gender-code to name-language pairs. The transformation uses all the official languages (langcode) spoken on each country or territory (code) in the WGND 2.0 code-langcode file. Multiple language countries – such as Switzerland or Canada – contribute name- gender pairs to all their languages. Conflicting name-gender within each language are ignored. The resulting dataset contains 3,505,319 names covering 94 different languages. All names reported in the file have a weight (wgt) equal to one.

- WGND 2.0 name-gender-code (Language expansion) contains 26,043,223 unique name-code-gender observations. It refers to the expansion of the name-code pairs to all countries and territories speaking the same languages. The expansion is based in propagating the results in WGND 2.0 name-gender-langcode to all countries or territories speaking those languages, as in the WGND 2.0 code-langcode list. The resulting dataset contains 3,505,319 unique names, covering 191 countries and territories. Out of these economies, 170 have more than one thousand names and only 12 have less than 30 names. All names reported in the file have a weight (wgt) equal to one.

- WGND 2.0 code-langcode contains 261 unique code-langcode observations. It permits the conversion from 193 different countries or territories (code) to 96 different “macrolanguages” (langcode).

Variables contained in these files are as follows:

- name: Given name (lowercase string). This name may be single (“mary”) or composed (“jean-marc”). In order to maintain the rich and diversified information of both the updated and new sources, we decided to keep the original name spelling from each source, while adding the harmonization from WGND 1.0. First, each original name spelling is maintained in the new WGND “as is”.8 Second, a non-Unicode version of the names is added whenever possible – e.g. “josé” becomes “jose”. Third, any non- alphabet symbol is removed – e.g. “marie-claire” becomes “marie claire”. Fourth, we do not remove short names – i.e. shorter than three characters – as we did in the previous version.9

- code: Country or territory code, as reported by each original source, formatted using the ISO 3166-1 alpha-2 standard (two uppercase chars). In the few cases where the names

8 Only unnecessary leading, within and trailing blank spaces are removed. 9 In WGND 1.0, the only exception to this rule were the names in original Chinese or Korean characters.

8

were reported by language, a conversion to country or territory code was applied with the above-mentioned code-langcode list.

- langcode: Language code formatted with the ISO 639-1 2-digit standard (two lowercase chars). The langcodes are imputed using the country or territory codes from each original source and the WGND 2.0 code-langcode list.

- gender: Gender code referring to: “F” for feminine name, “M” for masculine name, and “?” for unisex name or unknown.

- wgt: Proportion of frequencies of each name-code-gender observation (ranging from 0 to 1). The sum of wgt for the same name and code should equal one within the same source (although some small discrepancies may arise due to rounding errors).

- nobs: Frequency of name and gender in a given country or territory (ranging from 1 to ∞+). These often refer census or population registration data, but not always. In some sources, the number represents more than one-year aggregations.

- src: Source grouping (string). Please refer to annex for more details.

These files can be found in WIPO website, Github and Harvard Dataverse.

4. Conclusions

This technical paper revisited critically the World Gender Name Dictionary (WGND) published in 2016 (Lax Martínez et al., 2016). As a result, we propose a new version of the WGND based on updated data source from the first edition and considerable additional sources. The newly compiled worldwide gender-name dictionary (WGND 2.0) includes 26,043,223 triads linking given names, geography and gender, which is a substantial improvement from the first version.

Among the main improvements, the WGND 2.0 expands the international coverage by providing name-gender links for 195 different countries and territories or in 94 different languages. Second, it enhances both the generality and representativeness of the previous dictionary by incorporating more diversified data sources. Third, it also increases the richness of the information by including both Unicode rich spelling and ASCII clean versions of underlying given names. Another methodological change in WGND 2.0 is that it compiles the proportions (weights) for unisex names based on population frequencies that can be used in a statistical way.

Yet, there is still room for improvement, beyond the scope of this version. In particular, more national official sources need to be gathered to further enhance the quality of the name-gender frequencies. Another potential direction of improvement could be to unfold the information in composite given names.

9

References

Asahara, M., & Matsumoto, Y. (2003). Japanese Named Entity Extraction with Redundant Morphological Analysis. Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, 8–15. https://www.aclweb.org/anthology/N03-1002 Cheng, K. K. Y. (2008). Names in Multilingual-Multicultural Malaysia. Names, 56(1), 47–53. https://doi.org/10.1179/175622708X282965 Frietsch, R., Haller, I., Funken-Vrohlings, M., & Grupp, H. (2009). Gender-specific patterns in patenting and publishing. Research Policy, 38(4), 590–599. https://doi.org/10.1016/j.respol.2009.01.019 Lax Martínez, G., Raffo, J., & Saito, K. (2016). Identifying the gender of PCT inventors (No. 33; WIPO Economic Research Working Papers). World Intellectual Property Organization. https://www.wipo.int/publications/en/details.jsp?id=4125&plang=EN Matsumoto, Y., Kitauchi, A., Yamashita, T., Hirano, Y., Matsuda, H., Takaoka, K., & Asahara, M. (2002). Morphological Analysis System ChaSen version 2.2. 9 Manual. Mauleón, E., & Bordons, M. (2009). Male and female involvement in patenting activity in Spain. Scientometrics, 83(3), 605–621. https://doi.org/10.1007/s11192-009-0131-x Naldi, F., Luzi, D., Valente, A., & Parenti, I. V. (2005). Scientific and technological performance by gender. In Handbook of quantitative science and technology research (pp. 299–314). Springer. http://link.springer.com/chapter/10.1007/1-4020-2755-9_14 Nayan, A., Rao, B. R. K., Singh, P., Sanyal, S., & Sanyal, R. (2008). Named Entity Recognition for Indian Languages. Proceedings of the IJCNLP-08 Workshop on Named Entity Recognition for South and South East Asian Languages. IJCNLP 2008. https://www.aclweb.org/anthology/I08- 5014 Park, S.-B., & Yoon, H.-G. (2007). Determining the Gender of Korean Names for Pronoun Generation. International Journal of Computer Science and Engineering, 1(4). Qu, Y., & Grefenstette, G. (2004). Finding ideographic representations of Japanese names written in Latin script via language identification and corpus validation. Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, 183-es. https://doi.org/10.3115/1218955.1218979 Raffo, J., & Lax-Martinez, G. (2018). WGND 1.0 [Data set]. Harvard Dataverse. https://doi.org/10.7910/DVN/YPRQH8 Shah, H., Bhandari, P., Mistry, K., Thakor, S., Patel, M., & Ahir, K. (2016). Study of named entity recognition for indian languages. Int. J. Inf, 6(1), 11–25. Shah, R., & Singh, D. (2014). Improvement of Soundex Algorithm for Indian Language Based On Phonetic Matching. Tripathi, A., & Faruqui, M. (2011). Gender prediction of Indian names. IEEE Technology Students’ Symposium, 137–141. https://doi.org/10.1109/TECHSYM.2011.5783842 Yu Jiang-de, Zhao Hong-dan, Zheng Bo-ju, & Yu Zheng-tao. (2013). A method of gender discrimination based on character feature of Chinese names. Journal of Shandong University (Engineering), 44(1), 13–18. https://doi.org/10.6040/j.issn.1672-3961.2.2013.274

10

Annex - List of name-gender data sources

• AM_gov (Armenia): National Statistical Service of the Republic of Armenia • AU_gov (Australia): Governments of New South Wales, Western Australia and South Australia. • AZ_web (Azerbaijan): Azerbaijan International (www.azeri.org), Cutebaby names (www.cute-baby-names.com) • BE_gov (Belgium): Statistics Belgium (Stat Bel) • BG_gov (Republic of Bulgaria): National Statistical Institute • BR_web (Brazil): iHeartBrazil (www.iheartbrazil.com) • CA_gov (Canada): Governments of Alberta, British Columbia and Ontario • CH_gov (Switzerland): Office Fédéral de la Statistique • CN_web (China): Multiple websites: Wikipedia, xh.5156edu.com, We Have Kids, Baby Names, Baby Names Direct, Behind the Name, Top 100 Baby Names Search, etc. • CZ_gov (Czech Republic): Czech Republic Statistical Office • DK_gov (Denmark): Denmark Stat (DST) • ES_gov (Spain): Spain’s National Institute of Statistics (INE) • FR_gov (France): INSEE • GB_gov (United Kingdom): Office for National Statistics of United Kingdom (ONS), National Records of Scotland, and Northern Ireland Statistics and Research Agency • HU_web (Hungary): University of Debrecen, Department of Hungarian Linguistics • IE_gov (Ireland): Central Statistics Office (CSO) • IN_web (India): Multiple websites: Wikipedia, Baby Names Direct, Indian Hindu Names.com, etc. • INT_gov (International): WIPO assemblies and seminar lists of names, WIPO colleagues, etc. • INT_web (International): Multiple websites: Wikipedia, Behind the name, Equivalent Given Names (Kankula and Phillips, 2011), EURONEWS, Michael J (2008), Vornamen Verzeichnis - Deutsch & International, etc. • JP_web (Japan): Multiple websites: World of Baby Names, Thought Co, etc. • KR_gov (Republic of Korea): Korean Name Statistics, Electronic Family Relationship System • ME_gov (Montenegro): Statistical Office of Montenegro • MK_gov (Republic of North Macedonia): National Statistical Office • NO_gov (Norway): • NZ_gov (New Zealand): Te Tari Taiwhenua / Department of Internal Affairs • PH_gov (The Philippines): National Statistics Office, Gender and Development Committee (GCOM) • RS_gov (Serbia): Statistical Office of the Republic of Serbia • RU_web (Russian Federation): Multiple websites: Wikipedia, Master Russian, etc. • SE_gov (Sweden): Statistics Sweden (SCB) • SI_gov (Slovenia): Statistical Office (RS) • TR_gov (Turkey): Turkish Statistical Institute • UA_web (Ukraine): Multiple websites: Wikipedia, Proud of Ukraine, etc. • US_gov (United States): Social Security Administration (SSA) • US_web (United States): Multiple websites: Wikipedia, Tang et al. (2011), etc.

11