Language and Cultural Barriers in Labor Markets and International Factor Mobility
INAUGURALDISSERTATION zur Erlangung der Würde eines Doktors der Wirtschaftswissenschaft der Fakultät für Wirtschaftswissenschaft der Ruhr-Universität Bochum
Kumulative Dissertation, bestehend aus fünf Beiträgen
vorgelegt von Diplom-Ökonom Sebastian Otten aus Rhede 2013 Dekan: Prof. Dr. Helmut Karl Referent: Prof. Dr. Thomas K. Bauer Koreferent: Prof. Dr. Christoph M. Schmidt Tag der mündlichen Prüfung: 11.02.2014 i
Contents
List of Tables ...... vi List of Figures ...... vii
1 Introduction and Overview 1
2 The Costs of Babylon – Linguistic Distance in Applied Economics 7 2.1 Introduction ...... 7 2.2 Measuring Linguistic Distance ...... 9 2.2.1 Previous Literature ...... 9 2.2.2 The Levenshtein Distance ...... 10 2.3 Language Fluency of Immigrants ...... 11 2.3.1 Data and Method ...... 13 2.3.2 Results ...... 15 2.4 International Trade ...... 16 2.4.1 Data and Method ...... 17 2.4.2 Results ...... 19 2.5 Conclusion ...... 22 2.A Appendix ...... 28
3 Linguistic Barriers in the Destination Language Acquisition of Immi- grants 35 3.1 Introduction ...... 35 3.2 Measuring Linguistic Distance ...... 38 3.3 Data ...... 45 3.4 Method ...... 48 3.5 Results ...... 50 3.6 Conclusion ...... 55 3.A Appendix ...... 68 3.B Supplementary Appendix ...... 72 CONTENTS ii
4 Language and Cultural Barriers in International Trade and Investment 80 4.1 Introduction ...... 80 4.2 Empirical Model ...... 86 4.2.1 Theoretical Background: The Structural Gravity Model ...... 86 4.2.2 Empirical Strategy ...... 88 4.3 Data ...... 94 4.3.1 Measuring Linguistic and Genetic Distance ...... 94 4.3.2 Data on Trade Flows, Portfolio Investment, and Banking Claims . . 97 4.4 Empirical Findings ...... 98 4.4.1 The Effect of Linguistic and Genetic Distance on Trade ...... 98 4.4.2 The Effect of Linguistic and Genetic Distance on Investment . . . . 101 4.5 Conclusion ...... 102
5 The Role of Language Skills in the German Labor Market 114 5.1 Introduction ...... 114 5.2 Data ...... 116 5.3 Empirical Strategy ...... 119 5.4 Results ...... 121 5.5 Conclusions ...... 123
6 The Role of Source- and Host-Country Characteristics in Female Immi- grant Labor Supply 130 6.1 Introduction ...... 130 6.2 Background ...... 132 6.3 Method, Data, and Descriptive Statistics ...... 137 6.3.1 Method ...... 137 6.3.2 The European Social Survey ...... 138 6.3.3 Aggregated Data ...... 143 6.4 Basic Results ...... 149 6.4.1 Source- and Host-Country Fixed Effects ...... 149 6.4.2 Source-Country FLFPR ...... 151 6.4.3 Host-Country FLFPR ...... 155 6.5 Sensitivity Analyses ...... 158 6.5.1 Control for Partner Characteristics ...... 158 6.5.2 Control for Parents’ Human Capital and Employment ...... 159 6.5.3 Ratio of FLFPR to MLFPR ...... 160 6.5.4 Source-Country Characteristics at Year of Migration ...... 162 6.5.5 Bias-Reduced Linearization of Standard Errors ...... 162 CONTENTS iii
6.6 Conclusion ...... 164 6.A Appendix ...... 177
Bibliography 182
Acknowledgments 196
Curriculum Vitae 198 iv
List of Tables
2.1 Closest and Furthest Language Pairs with Respect to the Levenshtein Distance 24 2.2 Descriptive Statistics of Dependent and Explanatory Variables – Immigra- tion Sample ...... 24 2.3 Immigrant’s Language Skills – Probit Results ...... 25 2.4 Descriptive Sample Statistics and Variable Definitions – International Trade Sample ...... 26 2.5 Effect of Language on Bilateral Trade – OLS Results ...... 27 2.A1 40-Items Swadesh Word List ...... 28 2.A2 Summary Statistics for the Language Variables – International Trade Sample 29 2.A3 Immigrant’s Language Skills – Probit Results, Including Native Speakers . 32 2.A4 Effect of Language on Bilateral Trade – OLS Results, Subsample Language Barrier > 0 ...... 33 2.A5 Effect of Language on Bilateral Trade – OLS Results, Subsample Linguistic Features Index ...... 34
3.1 Average Test Scores of US Language Students ...... 58 3.2 40-Items Swadesh Word List with Computational Examples ...... 59 3.3 Closest and Furthest Languages to English and German ...... 60 3.4 Rank Correlations among Linguistic, Geographic, and Genetic Distance Measures ...... 61 3.5 Distribution of Language Skills across Samples ...... 62 3.6 Language Ability and Linguistic Distance – Aggregated Results ...... 63 3.7 OLS Results of Linguistic Distance – ACS Sample ...... 64 3.8 Ordered Logit Marginal Effects of Linguistic Distance – Model 1 ACS & SOEP Sample ...... 64 3.9 OLS Results of Linguistic Distance – SOEP Sample ...... 65 3.A1 Descriptive Statistics – ACS & SOEP Sample ...... 68 3.A2 Variables Description – ACS & SOEP Sample ...... 69 LIST OF TABLES v
3.A3 Robustness Checks: OLS Results of Linguistic Distance – Model 2 ACS Sample ...... 70 3.A4 Robustness Checks: OLS Results of Linguistic Distance – Model 2 SOEP Sample ...... 71 3.B1 OLS Results – Model 1 & 2 ACS Sample ...... 72 3.B2 Ordered Logit Results – Model 1 & 2 ACS Sample ...... 73 3.B3 Ordered Logit Marginal Effects – Model 1 ACS Sample ...... 74 3.B4 Ordered Logit Marginal Effects – Model 2 ACS Sample ...... 75 3.B5 OLS Results – Model 1 & 2 SOEP Sample ...... 76 3.B6 Ordered Logit Results – Model 1 & 2 SOEP Sample ...... 77 3.B7 Ordered Logit Marginal Effects – Model 1 SOEP Sample ...... 78 3.B8 Ordered Logit Marginal Effects – Model 2 SOEP Sample ...... 79
4.1 Closest and Furthest Language Pairs with Respect to the Levenshtein Distance104 4.2 Effect of Linguistic and Genetic Distance on Bilateral Exports ...... 105 4.3 Effect of the Bilateral LDE Indicator on Bilateral Exports ...... 106 4.4 Effect of the Linguistic Distance toward English on Bilateral Exports . . . 107 4.5 Effect of Linguistic and Genetic Distance on Cross-Border Asset Stocks . . 108 4.6 Effect of Linguistic and Genetic Distance on Bilateral Banking Claims . . . 109 4.7 Effect of the Bilateral LDE Indicator on Cross-Border Asset Stocks . . . . 110 4.8 Effect of the Bilateral LDE Indicator on Bilateral Banking Claims . . . . . 111 4.9 Effect of the Linguistic Distance toward English on Cross-Border Asset Stocks112 4.10 Effect of the Linguistic Distance toward English on Bilateral Banking Claims113
5.1 Swadesh 40-Item List with Computational Examples ...... 124 5.2 Distance from German – Closest and Furthest Languages ...... 124 5.3 Descriptive Statistics by Oral German Ability ...... 125 5.4 Results of Employment Regressions ...... 127 5.5 Results of Wage Regressions – Oral Ability ...... 128 5.6 Results of Wage Regressions – Written Ability ...... 129
6.1 Descriptive Statistics – Individual Variables ...... 168 6.2 Descriptive Statistics – Aggregated Variables ...... 169 6.3 Model 1 – Source- and Host-Country Fixed Effects ...... 170 6.4 Model 2 – Source-Country Characteristics ...... 171 6.5 Model 3 – Host-Country Characteristics ...... 172 6.6 Models 2 & 3 – Controlling for Partner Characteristics ...... 173 6.7 Models 2 & 3 – Controlling for Parents Characteristics ...... 174 LIST OF TABLES vi
6.8 Models 2 & 3 – Ratio of FLFPR to MLFPR ...... 175 6.9 Model 2 – Source-Country Characteristics at Year of Migration ...... 175 6.10 Models 2 & 3 – Bias-Reduced Linearization of Standard Errors ...... 176 6.A1 Explanatory Power of Source- & Host-Country Fixed Effects ...... 177 6.A2 Explanatory Power of Source- & Host-Country Characteristics ...... 177 6.A3 List of Source Countries ...... 178 6.A4 Macroeconomic Data – Sources and Descriptions ...... 179 vii
List of Figures
2.A1 Comparisons of Linguistic Distance Using the Test-Score-Based Measure and the Levenshtein Distance – 2000 U.S. Census ...... 28 2.A2 Bivariate Kernel Density Estimation of Log Bilateral Trade and Levenshtein Distance – International Trade Sample ...... 30
3.1 Language Relations in the TREE Approach ...... 58 3.2 Predicted Language Assimilation Profiles for the ACS Sample ...... 66 3.3 Predicted Language Assimilation Profiles for the SOEP Sample ...... 67
5.1 German Ability, Employment Rate, and Hourly Wages by Years since Migration (5-Year Moving Average) ...... 126
6.1 Female Labor Force Participation Rate (Age 15-64) – Year 2011 ...... 166 6.2 Effect of Source-Country FLFPR by Years since Migration ...... 167 6.3 Effect of Host-Country FLFPR by Years since Migration ...... 167 The limits of my language mean the limits of my world. —Ludwig Wittgenstein, Tractatus Logico-Philosophicus (1922), p. 149.
Chapter 1
Introduction and Overview
Extreme poverty is a problem observed in all parts of the world. More than 1.2 billion people around the world live in extreme poverty, being defined as having to live from less than US$ 1.25 per day (United Nations, 2013). In addition, more than 870 million people world-wide suffer from extreme hunger and are chronically undernourished (IFPRI, 2013). In order to address these problems, the United Nations has set the fight against extreme poverty and hunger on top of the agenda of its millennium goals (United Nations, 2000). Both theoretical and empirical economic research suggest that international trade and migration can contribute to a reduction of differences in income between countries, initiate a progress of convergence in development, and that way help to overcome poverty (Freeman, 2006; OECD, 2011; Winters et al., 2004).1 Despite the positive impact of international migration, only around 3% of the world’s population live outside their country of birth (IOM, 2013). The relatively low share of migrants might partly be a result of the strict immigration legislation in the typical immigration countries (Freeman, 2006). While migration flows are hampered by the immigration legislation of many countries, the markets for goods and capital underwent a comprehensive liberalization of trade policies along with a reduction of trade barriers as a consequence of the collapse of the Bretton Woods system in 1973 (Wacziarg and Welch, 2007). This lead to a steady increase in the international flows of goods and capital. Nevertheless, not all countries have profited from the ongoing globalization of the international markets for goods and capital to the same extent. Even for countries with similar endowments and institutional frameworks, significant differences in trade volumes can be observed. This poses the question of what causes the differences in trade volumes. Theoretical and empirical economic studies provide clear answers and stress the importance of a country’s
1As John K. Galbraith states: “Migration is the oldest action against poverty. It selects those who most want help. It is good for the country to which they go; it helps break the equilibrium of poverty in the country from which they come. What is the perversity in the human soul that causes people to resist so obvious a good?” (Galbraith, 1979).
1 CHAPTER 1. INTRODUCTION AND OVERVIEW 2 geographical and climatic features, the endowment with resources, the technological level, and the institutional framework. However, while some important drivers of international trade have been identified, the phenomenon is not completely understood. What remains unexplained is how linguistic and cultural differences between two countries affect their bilateral trade and capital flows. A similar thought experiment can be applied to the relatively small number of international migrants. Again, economic models and empirical studies provide numerous explanations. However, no satisfactory answer as to how linguistic and cultural differences affect international migration flows is given. Next to the question of what influences the selection of destination countries, the analysis of the factors that determine the immigrants’ integration process is of particular importance. Immigrants differ largely with respect to their success of integrating into their destination country’s society. Integration is mainly achieved by learning the destination country’s language and by the adaption to its culture. If the social and economic integration fails, the positive impact of immigration is undermined. Failed integration leads to a lack of perspectives and increases the immigrants’ risk of becoming subject to poverty again, though this time not in their country of origin, but in an unfamiliar society. Furthermore, the low level of income reduces the ability of immigrants to send remittances to their families in the country of origin. Anecdotal evidence points to the particular importance of linguistic and cultural differences for international transactions and migration.2 Nevertheless, these factors are mostly neglected in the standard economic models. This dissertation aims at partly filling this research gap. For this purpose, the following five chapters deal with empirical analyses of the influence of linguistic and cultural differences on different aspects of economic behavior.3 To analyze the influence of linguistic and cultural differences within an empirical framework, a suitable measure is required that permits such an analysis and that can be implemented into econometric models. Since the economic literature provides only a method for the measurement of cultural differences – the genetic distance – Chapter 2 introduces the Levenshtein distance as a new measure of linguistic distance. Subsequently, this method is tested in both a micro- and macroeconomic application and is compared to the most frequently used measures in both areas of research. Chapters 3 and 4 extend the empirical applications presented in Chapter 2. In Chapter 3, the linguistic distance measure is compared to three other measures of language differences which
2A prime example is the failed fusion of the automobile companies Daimler and Chrysler, which is to large parts attributed to differences in their management culture (Michler, 2011). 3The chapters of this thesis are available as independent articles and working papers. The chapters 2 and 3 are published and the chapters 4–6 are currently under revision. Electronic preprints are available from the author. CHAPTER 1. INTRODUCTION AND OVERVIEW 3 have been used in the economic literature. In doing so, the measures are analyzed within a microeconometric framework with respect to their explanation of the variation in the language proficiency of immigrants. In Chapter 4, the influence of linguistic and genetic distance on bilateral flows of goods and capital is analyzed in the context of a macroeconometric model. The second part of the dissertation is concerned with the labor market. Chapter 5 analyzes how differences in language skills affect the employment probabilities and wages of immigrants. Chapter 6 then focuses on identifying the role of source- and host-country culture in immigrant women’s labor supply. In the following, the contributions of this thesis to the economic literature are clarified and the main findings and implications of the succeeding chapters are summarized. Chapter 2 introduces the normalized and divided Levenshtein distance as the method of choice to quantify differences between languages and discusses its advantages over previous measures of linguistic distance. The measure is then used in two applications to explain the costs of linguistic differences on the micro- and macro-level: (i) the analysis of the language acquisition of immigrants and (ii) the analysis of linguistic barriers in international trade flows. On the micro-level, we use multiple datasets to estimate the initial disadvantages in the language acquisition of immigrants resulting from differences in linguistic origin. Estimations using the U.S. Census allow for a direct comparison of our approach to the approach by Chiswick and Miller (1999), who measure linguistic distance toward English using average test scores of language students. Both approaches lead to qualitatively comparable results. We further use the general applicability of our measure to broaden previous evidence to non-Anglophone countries. In doing so, we use data from the National Immigrant Survey of Spain and the German Socio-Economic Panel. The results reveal that immigrants who come from a more distant linguistic origin face significantly higher costs of language acquisition, which is reflected in their lower probability of reporting good language skills. On the macro-level, we apply the Levenshtein distance in the context of international trade, where language barriers have previously been addressed by controlling for common languages in bilateral trade flows. Using a comprehensive dataset of bilateral trade flows by Rose (2004), we estimate a standard gravity model using the Levenshtein distance as an additional explanatory variable and compare this approach to a previous approach based on shared linguistic features by Lohmann (2011). The results provide new and strong evidence indicating that language barriers affect trade above and beyond the simple effect of sharing a common language. Chapter 3 extends the analysis of the language acquisition of immigrants in the preceding chapter by comparing the Levenshtein distance to three other approaches CHAPTER 1. INTRODUCTION AND OVERVIEW 4 previously used in further applications in the economic literature to measure linguistic distance: (i) the WALS measure, which uses differences in language characteristics, (ii) the TREE measure, which is based on a priori knowledge on language families, and (iii) a measure based on average test scores of native U.S. foreign language students (SCORE). The information on language differences is applied to German and U.S. micro data – the German Socio-Economic Panel and the American Community Survey – in order to provide a comprehensive analysis of the influence of the linguistic origin on the acquisition of the destination language proficiency. The results suggest that the linguistic barriers raised by language differences play a crucial role in the determination of the destination-country language proficiency of immigrants. Regardless of the method employed, we estimate large initial disadvantages by linguistic distance for immigrants both in the U.S. and in Germany. In Germany, these initial differences in language skills decrease with a moderate convergence over time. Contrarily, in the U.S., the initial disadvantages increase over time. We interpret this difference in assimilation patterns as a potential outcome of stronger enclave effects in the U.S. This crucial difference highlights the importance of extending the analysis beyond the case of Anglophone countries. Chapter 4 extends the macro-economic analysis of linguistic barriers by evaluating the extent to which language and cultural barriers affect different types of international factor movements, i.e., international trade flows, cross-holdings of assets, and consolidated international banking claims. In addition to disentangling the impact of language and cultural dissimilarities, which might vary substantially over the aforementioned factor movements, I analyze the effect of English proficiency on international factor movements, an impact factor so far neglected in the literature. In the empirical analysis, I apply a gravity model, which was first proposed by Tinbergen (1962) and has since then been applied in numerous empirical studies on factor mobility. The results show that linguistic and genetic distance have varying effects on the examined factor movements. While controlling for a host of other possible determinants, I find strong evidence that a higher linguistic distance between two countries reduces cross-border trade and investment holdings between these countries. The results for genetic distance, however, are mixed. While cultural differences significantly reduce bilateral trade, they have no effect on international investments. When including the country’s linguistic distance toward English in the model, the estimates indicate a significant negative impact of a higher linguistic distance toward English on international factor mobility. These findings are in line with the theoretical expectations and provide supportive evidence that language differences contribute to higher informational frictions across countries, thereby reducing bilateral trade and cross-border investment flows, respectively. CHAPTER 1. INTRODUCTION AND OVERVIEW 5
The second part of the thesis, Chapters 5 and 6, analyzes the role of language and culture in the labor market. Chapter 5 investigates the effects of language skills on labor market outcomes of immigrants in Germany using data from the German Socio-Economic Panel. Knowledge about the extent to which language skills affect employment probabilities and wages may improve our understanding of the underlying mechanisms of a successful integration of immigrants and illustrate the need and scope for government intervention such as the provision of student loans or free language courses. To address the problem of unobserved heterogeneity4 and a potential measurement error, we employ an instrumental variable approach to identify the causal effect of language skills on labor market outcomes. The instrument is based on the relationship between immigrants’ duration of residence in their host country and their language skills, taking into account heterogeneity in the linguistic distance among immigrants from different countries of origin. We find that the effect of language skills on employment probabilities is insignificant, which is in line with the economic literature on residential segregation of immigrants. In particular, it seems likely that geographic clustering allows immigrants to find jobs even without knowledge of the host-country language. In contrast, we observe a significantly positive effect of language skills on wages. However, this effect diminishes when we control for occupation, indicating that the returns to language skills are a result of the sorting of immigrants across occupations. We further demonstrate that simple OLS regressions systematically underestimate the positive effects of language skills on wages. Chapter 6 investigates the labor force participation of female immigrants in Europe. A central aim of this chapter is to provide evidence on the role of culture in women’s labor market behavior. Specifically, we are interested in whether immigrant women’s labor supply in their host country is affected by the female labor force participation rate in their source country, which serves as a proxy for the country’s preferences and beliefs regarding women’s roles. The effect of source-country culture on immigrants’ behavior, however, might weaken as immigrants assimilate to the culture of their host country and adapt to the labor supply behavior of natives. A second aim of this paper is to shed light on such assimilation patterns by investigating the role of host-country female labor force participation in immigrant women’s labor supply decisions. In the empirical analysis, we employ data from five rounds of the European Social Survey (ESS), covering immigrants in 26 European countries surveyed between 2002 and 2011. These data are augmented with an extensive set of aggregated source- and host-country variables as well as bilateral data describing the relationship between both countries, such as the geographic, linguistic, and genetic distance between the immigrants’ source and host country.
4In particular, the identification of a causal effect of language skills on labor market outcomes is challenging because language skills and labor market outcomes are both determined by unobserved individual ability. CHAPTER 1. INTRODUCTION AND OVERVIEW 6
We find that the labor supply of both first- and second-generation immigrants is positively associated with the FLFPR in their (parents’) source country. This result supports previous evidence for immigrants in the U.S. and suggests that immigrant women’s labor supply is affected by preferences and beliefs regarding women’s roles in society in her source country. The effect of this cultural proxy on the labor supply of immigrant women is robust to controlling for spousal characteristics, parental characteristics, and a variety of source-country characteristics. Moreover, we find evidence for a strong positive correlation between the FLFPR in the immigrant’s host country and immigrant women’s decision to participate in the labor market. This result suggests that immigrant women adapt to the culture, institutions, and economic conditions in their host country and that way assimilate to the work behavior of natives. Again, this result is robust to various sensitivity analyses. Taken together, the results of this dissertation stress the importance of linguistic and cultural differences for many aspects of economic behavior. Both in theoretical and empirical economic models, the standard variables should be complemented by factors that capture language and cultural differences. The proper modelling and implementation of these factors allows to gain new insights and can help to address unsolved puzzles. An example in this context is the irrational preference of investors to invest disproportionally high shares of their investment portfolio in their home country, which contradicts economic theory. The analyses in this dissertation have shed light on the fact that language and cultural differences impose burdens on the integration of immigrants. As a consequence, groups of immigrants that face higher linguistic and cultural hurdles are disadvantaged relative to other groups of immigrants. Due to this disadvantage, they have a lower employment probability and lower wages, which in turn increases their risk of poverty. Here, it is up to political decision makers to consider this disadvantage and address it with specifically designed measures. Public debates on immigrants’ abilities to integrate into their host country, as conducted in Germany and other industrialized countries, are often polemical. The fact that these debates do not consider the factors mentioned above stresses the necessity of particular target-oriented political actions. Language and cultural differences, in particular in relation to the leading industrial nations, negatively affect bilateral trade and capital flows. They act as an additional export duty and thus lead to a disadvantage on the globalized markets for goods and capital. This disadvantage affects developing countries in particular. It is up to the World Trade Organization to counteract this disadvantage and ensure fair competition on the international goods and capital markets. 7
Chapter 2
The Costs of Babylon – Linguistic Distance in Applied Economics∗
2.1 Introduction
According to biblical accounts, the Babylonian Confusion once stopped quite effectively the construction of the tower of Babel and scattered the previously monolingual humanity across the world, speaking countless different languages. In economic research, linguistic diversity is believed to be a crucial determinant of real economic outcomes, due to its impact on communication and language skills (see, e.g., Chiswick and Miller, 1999), and as accumulated costs affecting international trade flows (see, e.g., Lohmann, 2011). The operationalization of differences between languages is not straightforward and only few, but problematic, approaches have been undertaken so far. This study proposes to use a measure of linguistic distance developed by linguistic researchers. Linguistic distance is defined as the dissimilarity of languages, including, but not restricted to, vocabulary, grammar, pronunciation, scripture, and phonetic inventories. The Automatic Similarity Judgment Program (ASJP) by the German Max Planck Institute for Evolutionary Anthropology offers a descriptive measure of phonetic similarity: the normalized and divided Levenshtein distance. This distance is based on the automatic comparison of the pronunciation of words from different languages having the same meaning. We use this measure in two applications to explain the costs of linguistic differences on the micro-
∗Co-authored with Ingo E. Isphording (IZA). This chapter is published in the Review of International Economics, 21(2), 2013. A preliminary version of this chapter is available as Ruhr Economic Paper #337. The authors are grateful to Thomas K. Bauer, John P. Haisken-DeNew, Ira N. Gang, Julia Bredtmann, Jan Kleibrink, Maren Michaelsen, the participants of the RGS Doctoral Conference 2012, the RES 2012, the SOLE 2012, and the ESPE 2012 for helpful comments and suggestions. We are also very thankful to Andrew K. Rose for providing parts of the trade dataset and Johannes Lohmann for the data of his language barrier index. CHAPTER 2. LINGUISTIC DISTANCE IN APPLIED ECONOMICS 8 and macro-level. On the micro-level, we use multiple datasets, the 2000 U.S. Census, the National Immigrant Survey of Spain and the German Socio-Economic Panel, to estimate the initial disadvantages due to differences in linguistic origin in the language acquisition of immigrants. On the macro-level, trade-flow gravity models are estimated using the bilateral trade-flow data by Rose (2004) to analyze accumulated costs of linguistic barriers in international trade. Epstein and Gang (2010) point out that differences in culture, though crucially affecting economic outcomes, are typically treated as a black box in empirical investigations. One main channel of this effect of cultural distance on economic outcomes are differences arising from different linguistic backgrounds. Differences in language are arguably the most visible manifestation of such cultural differences. Previous studies relied on approaches that measure linguistic distance using average test scores of language students (Chiswick and Miller, 1999) or classifications by language families (Guiso et al., 2009). Test-score-based approaches assume the difficulty of learning a foreign language for students to be determined by the distance between the native and a foreign language. Unfortunately, due to data limitations test-score-based measures are only available for the distances toward the English language and are therefore strongly restricted in its use. Approaches using language family trees to derive measures of linguistic distance rely on strong assumptions of cardinality and have to deal with arbitrarily chosen parameters. Against this background, we contribute to the existing literature in several respects. First, we introduce the normalized and divided Levenshtein distance as an easy and transparently computed, cardinal measure of linguistic distance. We use the general applicability of this measure to broaden the evidence on disadvantages in the language acquisition of immigrants to non-Anglophone countries. Second, we apply the measure in the context of international trade, where language barriers have previously been addressed by controlling for common languages in bilateral trade flows. The Levenshtein distance allows to overcome this very narrow definition of linguistic barriers. Our results confirm the existence of significant costs of language barriers on the micro- and macro-level. Immigrants coming from a more distant linguistic origin face significantly higher costs of language acquisition. A higher linguistic distance strongly decreases the probability of reporting good language skills. To illustrate the results for immigrants into the U.S., a Vietnamese immigrant coming from a very distant linguistic origin faces an initial disadvantage compared to a German immigrant from a close linguistic origin which is worth of 6 additional years of residence. In the case of accumulated costs on bilateral trade flows, our results indicate that not only a shared common language but also a related but not identical language accelerates trade by lowering transaction costs. CHAPTER 2. LINGUISTIC DISTANCE IN APPLIED ECONOMICS 9
The paper is organized as follows. Section 2.2 provides a short overview of previous attempts to measure linguistic distance and introduces the Levenshtein distance, discussing its advantages and potential shortcomings. We present our results concerning the explana- tion of immigrants’ language skills in Section 2.3. The second application, the explanation of international trade flows, is discussed in Section 2.4. Section 2.5 summarizes the results and concludes.
2.2 Measuring Linguistic Distance
2.2.1 Previous Literature
Linguistic distance is the dissimilarity of languages in a multitude of dimensions, such as vocabulary, grammar, pronunciation, scripture, and phonetic inventories. This multi- dimensionality of linguistic distance makes it difficult to find an appropriate empirical operationalization to be used in applied economic studies. A very straightforward approach is the evaluation of linguistic distances between languages by counting shared branches in language-family-trees (see, e.g., Guiso et al., 2009). This language-tree approach has to deal with strong cardinality assumptions and arbitrarily chosen parameters. Additionally, the approach offers only low variability between different language pairs and is difficult to implement for isolated languages such as Korean. A widely used approach to measure linguistic distance has been introduced by Chiswick and Miller (1999), who use data on the average test score of U.S. language students after a given time of instruction in a certain foreign language. They assume that the lower the average score, the higher is the linguistic distance between English and the foreign language. Similar measures have been used to analyze the effect of language barriers on international trade (Hutchinson, 2005; Ku and Zussman, 2010). This test-score-based measurement of linguistic distance relies on strong assumptions. It has to be assumed that the difficulty of U.S. citizens to learn a particular foreign language is symmetric to the difficulty of foreigners to learn English. Further, it has to be assumed that the average test score is not influenced by other language-specific sources. Dörnyei and Schmidt (2001) give an overview of the potential role of intrinsic and extrinsic motivation in learning a second language. Intrinsic motivation, the inherent pleasure of learning a language, and extrinsic motivation, the utility derived from being able to communicate in the foreign language, are likely to differ across languages, but are not distinguishable from the actual linguistic distance in the test-score-based approach. CHAPTER 2. LINGUISTIC DISTANCE IN APPLIED ECONOMICS 10
2.2.2 The Levenshtein Distance
Drawing from linguistic research, it is possible to derive an operationalization of linguistic distance without strong identification assumptions that underlie previous approaches. The so-called Automatic Similarity Judgment Program (ASJP) developed by the German Max Planck Institute for Evolutionary Anthropology aims at automatically evaluating the phonetic similarity between all of the world’s languages. The basic idea is to compare pairs of words having the same meaning in two different languages according to their pronunciation. The average similarity across a specific set of words is then taken as a measure for the linguistic distance between the languages (Bakker et al., 2009). This distance can be interpreted as an approximation of the number of cognates between languages. The linguistic term cognates denotes common ancestries of words. A higher number of cognates indicates closer common ancestries. Although restricting its computation on differences in pronunciation, a lower Levenshtein distance therefore also indicates a higher probability of sharing other language characteristics such as grammar (see Serva, 2011). The acquisition of a second language is crucially affected by such differences in pronunciation and phonetic inventories, as they determine the difficulty in discriminating between different words and sounds. For a recent overview of the linguistic literature on language background and language acquisition see Llach (2010). The algorithm calculating the distance between words relies on a specific phonetic alphabet, the ASJPcode. The ASJPcode uses the characters within the standard ASCII1 alphabet to represent common sounds of human communication. The ASJPcode consists of 41 different symbols representing 7 vowels and 34 consonants. Words are then analyzed as to how many sounds have to be substituted, added, or removed to transfer the one word in one language into the same word in a different language (Holman et al., 2011). The words used in this approach are taken from the so-called 40-item Swadesh list, a list including 40 words that are common in almost all the world’s languages, including parts of the human body or expressions for common things of the environment. The Swadesh list is deductively derived by Swadesh (1952), its items are believed to be universally and culture independently included in all world’s languages.2 The ASJP program judges each word pair across languages according their similarity in pronunciation. For example, to transfer the phonetic transcription of the English word you, transcribed as yu, into the transcription of the respective German word du, one simply has to substitute the first consonant. But to transfer maunt3n, which is the transcription of mountain, into bErk, which is the transcription of the German Berg, one has to remove or substitute each 7 consonants and vowels, respectively.
1American Standard Code for Information Interchange, keyboard-character-encoding scheme. 2A list of the 40 words is available upon request. CHAPTER 2. LINGUISTIC DISTANCE IN APPLIED ECONOMICS 11
The following formalization of the computation follows Petroni and Serva (2010). To normalize the distance according to the word length, the resulting number of changes is divided by the word length of the longer word. Denoting this normalized distance between
item i of language α and β as Di(α, β), the calculation of the normalized linguistic distance (LDN) is computed as the average across all i = 1, ..., M distances between synonyms of the same item:
1 X LDN(α, β) = D(αi, βi). (2.1) M i To additionally account for potential similarities in phonetic inventories which might lead to a similarity by chance, a global distance between languages is defined as the average Levenshtein distance of words with different meanings:
1 X Γ(α, β) = D(αi, βj). (2.2) M(M − 1) i6=j The final measure of linguistic distance is then the normalized and divided Levenshtein distance (LDND), which is defined as:
LDN(α, β) LDND(α, β) = . (2.3) Γ(α, β) The resulting measure expresses a percentage measure of similarity between languages, although, by construction, it might take on values higher than 100% in cases in which languages do not even possess those similarities which are expected to exist by chance. Table 2.1 lists the closest and furthest languages toward English, German, and Spanish. The measurement via the normalized and divided Levenshtein distance is in line with an intuitive guessing about language dissimilarities. Although there is clearly a strong positive correlation between the Levenshtein distance and the test-score-based approach by Chiswick and Miller (1999), the Levenshtein distance offers a higher variability in its measurement and we believe it to be more exact.3 Some languages are found to be distant according to the Levenshtein distance, but have a comparably low distance using the test-score-based measure, indicating that the test-score-based measure might also entail incentives to learn a foreign language instead of solely measuring linguistic distance.
2.3 Language Fluency of Immigrants
Language skills of immigrants are known to be a crucial determinant of the economic success of immigrants in the host country labor market. The economic literature concerning
3A figure which shows the relationship between both measure is available upon request. CHAPTER 2. LINGUISTIC DISTANCE IN APPLIED ECONOMICS 12 the determinants of language fluency of immigrants was initiated by the influential work by Chiswick (1991). Based on this seminal paper, Chiswick and Miller (1995) developed a theoretical human capital framework of host country language skill acquisition. In this framework, linguistic distance is a crucial determinant of language skills by lowering the efficiency of learning a language and inducing higher learning costs. This theoretical implication has been subsequently tested for various countries using the test-scores-based measure by Chiswick and Miller (1999). Due to its exclusive availability to the English language, these applications have been restricted to studies concerning the immigration to English-speaking countries such as the U.S. or Canada (Chiswick and Miller, 2005). This restriction does not hold for the Levenshtein distance as a measure of linguistic distance, which is not restricted to any home or host country, and may therefore be applied to a broader range of countries. This feature allows for providing evidence on the relationship between linguistic distance and language fluency in an international perspective. In doing so, we utilize data from three different sources. First, we use data from the 2000 U.S. Census to apply both the test-score-based measure by Chiswick and Miller (1999) and the Levenshtein distance to the same dataset. To compare the influence of linguistic distance across different countries, we additionally use data from the German Socio-Economic Panel (SOEP), and the National Immigrant Survey of Spain (NISS). The U.S., Germany, and Spain have very different migration histories that make an international comparison worthwhile. The United States have been an immigrant country since its foundation and currently a legal permanent residence status is granted to about 1 million immigrants per year. In 2000, this immigration flow consisted mainly of immigrants from other North-American countries (40%, including 21% from Mexico), followed by Asian (32%) and European immigrants (15%) (U.S. Department of Homeland Security, 2010). These inflows are also resembled in the stocks of the immigrant population. In the 2000 U.S. Census, 11% of the population of the U.S. were foreign-born. Neither does Germany have such a long-running immigration history as the U.S., nor can it look back on an extensive colonial history as Spain. Mass immigration started off only shortly after World War II with large waves of ethnic German expellees, followed by the so-called “guestworker”-programs aimed at attracting mainly unskilled workers from Mediterranean countries such as Turkey, Yugoslavia, Italy, or Spain. These two first waves of immigration were followed by a strong immigration phase by family re-unification during the 1970s and 1980s. The third large wave of immigration consists of immigrants and Ethnic Germans from former Soviet states during the 1990s (Bauer et al., 2005). Compared to the U.S. and Spain, Germany has a very old immigrant population with long individual migration histories. In 2009, 10.6 million (approx. 13%) of the German population have CHAPTER 2. LINGUISTIC DISTANCE IN APPLIED ECONOMICS 13 immigrated after 1949, 3.3 million as Ethnic Germans. This third immigration wave was accompanied by large numbers of refugees and asylum seekers from Ex-Yugoslawia. The major part of the immigrant population is born in EU member states (32%), followed by 28% from Turkey and 27% from former members of the Soviet Union. Although Spain has a long-running colonial history, it is a comparably young immigra- tion country. After large waves of emigration until the 1970s, net immigration began in the early nineties, and accelerated considerably during the last 20 years. Between 1997 and 2007, the number of migrants increased by around 700%, initially including mostly migrants from Africa and Western Europe. Nowadays, the majority of immigrants comes from Latin America and, since the EU enlargement, increasingly from Eastern Europe. Today, about 10% or 4.5 million of the population in Spain are foreign-born (see Fernández and Ortega, 2008).
2.3.1 Data and Method
Our data are restricted to male immigrants who entered the respective country after the age of 16 and are younger than 65 and who do not speak the host country language as their first language. The sample drawn from the 1%-PUMS (Public Use Microdata Series) 2000 U.S. Census file consists of 59,889 individuals. Similar data is extracted from the German Socio-Economic Panel, a long-run longitudinal representative study. Using cross-sectional data from 2001, the sample consists of 675 male immigrants.4 The National Immigrant Survey of Spain, conducted in 2007, also offers comprehensive cross-sectional information on the socio-economic characteristics and migration history of immigrants.5 The sample includes 2,513 male immigrants. All datasets include self-reported assessments of language fluency taking four or five possible values, which we have converted into a dichotomous measure taking a value of 1 if language skills are “Good” or “Very Good” and 0 otherwise. This variable serves as a the dependent variable in Probit regressions. This dichotomization decreases the probability of misclassification, which would lead to biased estimates in the case of Probit models, as pointed out by Dustmann and van Soest (2001). Moreover, it avoids dealing with violated proportional odds assumptions in the case of Ordered Probit models, as discussed by Isphording and Otten (2011). Further, the recoding enhances the comparability of the estimations between the different datasets and to previous approaches, as e.g., Chiswick and Miller (1999). Denoting this dichotomized indicator variable of host country language skills as our
4For further information about the SOEP see Haisken-DeNew and Frick (2005). The SOEP data was extracted by using the Stata-add-on PanelWhiz (Haisken-DeNew and Hahn, 2006). 5For further information about the NISS see Reher and Requena (2009). CHAPTER 2. LINGUISTIC DISTANCE IN APPLIED ECONOMICS 14 dependent variable yi, the estimated probability of reporting good language skills can be specified as:
P r [yi = 1| LDi,Xi] = Φ (β0 + β1 LDi + Xi γ), (2.4) where Φ(·) is the cumulative distribution function for the standard normal distribution. LD is the linguistic distance between the native language and the host country language, the parameter β1 is our main parameter of interest, the disadvantage by linguistic origin in the language acquisition process. The main variable of interest is the measure of linguistic distance introduced in Section 2.2.2. Both measures of linguistic distance, the test-score-based measure and the normalized and divided Levenshtein distance, are expressed as percentiles of their respective distribution. This allows for a direct comparison between effects. As additional control variables, all three datasets offer comparable information on the age at migration, years since migration, years of education, marital status, number of children and an indicator variable denoting a former colonial relationship between home- and host country. We additionally include the distance between capitals in kilometers to proxy migration costs. For the 2000 U.S. Census we include some additional regional information about living in a non-metropolitan area, living in the Southern states and the share of the minority speaking the language of the individual. For Germany, we include a dummy for coming from a neighboring country. We control for refugee status (U.S. and Germany) and political reasons for migration (Spain), respectively. The U.S. data further includes information about having been abroad 5 years ago, while the German data includes information on having family abroad. This information serves as a proxy for return migration probability. Finally, each specification includes 17 world-region dummies to account for potential cultural differences correlated with linguistic distance. Sample means of the included variables are reported in Table 2.2. They show significant differences across the datasets, related to the different migration histories summarized above. Immigrants in Germany display the highest number of years since migration, as the sample consists in large parts of former guestworkers who immigrated during the 1960s and early 1970s. The German immigrant population also has the lowest mean education, but a higher share of married couples and a higher number of children, which might partly be due to the higher average age. The low average distance to the home country indicates the high share of guestworkers and immigrants from Eastern and Southern Europe. In contrast, both immigrants to Spain and the U.S. have a high average distance to the home country, as many immigrants come from overseas. Spain has the youngest immigrant population, resembling its relatively short immigration history starting off in the 1990s. Each dataset has a comparable share of “Good” or “Very Good” host country language CHAPTER 2. LINGUISTIC DISTANCE IN APPLIED ECONOMICS 15 skills of around half of the sample.
2.3.2 Results
Table 2.3 lists the results of the Probit regressions across datasets, reported as marginal effects evaluated at the mean of the covariates. Columns (1) and (2) show the results for the U.S. data, using the test-score-based measure and the Levenshtein distance, respectively. Column (3) shows the results for the German SOEP data, and column (4) for the Spanish NISS data. The results confirm a significantly negative effect of linguistic distance on the probability of reporting good or very good language abilities in the host country language throughout all estimations. For the U.S., the effects for the test-score-based measure and the Levenshtein distance are qualitatively comparable. The effect is lower, however, when applying the Levenshtein distance. To illustrate the effect of linguistic distance, we can look at the additional amount of years of residence that make up for an initial disadvantage by linguistic origin. This amount of years of residence can be calculated by equating the marginal effect of years since migration with the marginal effect of a certain difference in linguistic origins and solving for the years since migration. In the U.S., the initial disadvantage of an immigrant with a distant linguistic origin, e.g., a Vietnamese who is in the 97th percentile of the distribution of linguistic distance, compared to an immigrant with close linguistic origin, e.g., a German who is in the 1st percentile, is worth around 6 years of additional residence. For a Turk (79th percentile), the largest immigrant group in Germany, the disadvantage compared to a linguistically closer Dutch migrant (3rd percentile) is worth 8 years of residence. Switching the measure of linguistic distance in the U.S. data from the test-score-based to the normalized and divided Levenshtein distance does not qualitatively affect the coefficients of the control variables. The coefficients are in line with previous studies and theoretical predictions. We see a positive impact of education, at around 5 percentage points for the U.S. and Germany and around 3 percentage points for Spain. The initial negative effect of age at migration decreases over time. Being married and having children is associated with higher language skills. The signs of these relationships are stable across all datasets, with the exception of lower language skills for immigrants with children in Spain. Being born in a former colony has a strong positive effect for both immigrants in the U.S. and in Spain. In Germany those immigrants from a neighboring country report higher language skills. Refugees in the U.S. and in Germany report lower average language skills compared to immigrants without refugee status. CHAPTER 2. LINGUISTIC DISTANCE IN APPLIED ECONOMICS 16 2.4 International Trade
Costs imposed by linguistic barriers can also be found on the macro-level. The trade- increasing effect of a common language is an undisputed fact in international economics. It is intuitive that trade between countries with a common language is cheaper than between countries with different languages. In their survey article, Anderson and van Wincoop (2004) report an estimate of the tax equivalent of “representative” trade costs for industrialized countries of about 170%. Of these, language-related barriers account for 7 percentage points, which is similar in magnitude to policy barriers and information costs. The question is whether and to what extent the dissimilarity between two languages matters if trading partners do not share a common language. Certainly, a range of dominating languages (English in the Western countries, Russian in Eastern Europe, French in Africa, and Spanish in Latin America) plays a major role in international trade. Especially the role of English as a lingua franca has been addressed by Ku and Zussman (2010). However, in the development of longer-term business partnerships, the crucial variable of interest is the linguistic knowledge in the trade partner’s home country language (Hagen et al., 2006), captured by the direct linguistic distance between the dominant languages of the trade partners. The method of choice in examining determinants of international bilateral trade is the gravity model first proposed by Tinbergen (1962). The basic theoretical gravity model assumes that the size of bilateral trade between any two countries depends on a function of each country’s economic size measured by (log of) GDP. Trade costs in their simplest form are approximated by the distance between the trading countries (Anderson and van Wincoop, 2004). Extensions are proxies for trade frictions, such as the effect of trade agreements (McCallum, 1995), and cultural proximity (Felbermayr and Toubal, 2010). To incorporate language-related barriers into these gravity models, common empirical practice is to use an indicator variable that equals 1 if two countries share the same official language and 0 otherwise (see Anderson and van Wincoop, 2004). While most studies employ the former approach, Mélitz (2008) goes beyond official languages and develops two different measures. The first measure depends on the probability that two randomly chosen individuals from either country share a common language spoken by at least 4% of both populations. The second measure is an indicator variable that equals 1 if two countries have the same official language or the same language is spoken by at least 20% of the populations of both countries. These measures share the shortcoming that they only look at whether countries share the same language, but do not account for heterogeneity in the degrees of similarity between languages. The degree of similarity, however, is likely to affect trade costs, e.g., CHAPTER 2. LINGUISTIC DISTANCE IN APPLIED ECONOMICS 17
by lower costs of learning the trade partner’s language or by lowering translation costs (Hagen et al., 2006). Our results of Section 2.3.2, showing that linguistic barriers crucially affects second language acquisition, lend further support to this hypothesis. Moreover, lower host country language skills diminish the ability of immigrants to promote trade and commerce between their host country and their country of origin (Hutchinson, 2005). The only two approaches we know of that take similarities and differences between a multitude of languages into account are the ones by Hutchinson (2005) and Lohmann (2011). By relying on the measure by Chiswick and Miller (1999), Hutchinson’s approach is restricted to distances toward English. Lohmann (2011) uses data from the World Atlas of Language Structures (WALS; see Dryer and Haspelmath, 2011) to construct an index of 139 potentially shared linguistic features between languages. Similar to our application, he applies this index to explain international trade flows using data from Rose (2004). This approach counts shared language features within language pairs and builds up a language features index normalized to the interval of [0; 1], where 0 means sharing all features.6
2.4.1 Data and Method
To ensure a high degree of comparability with the previous literature, we use a widely accepted empirical methodology and a standard dataset of bilateral trade flows. The dataset constructed by Rose (2004) has been widely used previously by Mélitz (2008), Ku and Zussman (2010), and Lohmann (2011).7 The sample covers bilateral trade between 178 countries over the years 1948 to 1999 leading to 234,597 country-pair-year observations.8 The variables of interest are Rose’s binary common language variable, two versions of linguistic distance between trading partners’ languages as measured by the Levenshtein distance, and finally the linguistic features index calculated by Lohmann (2011). The Levenshtein distance is computed for every country-pair in the dataset. In mono- lingual countries we assign the respective native language to the country. In multi-lingual countries, the most prevalent native language is assigned, which was identified using a multitude of sources, including CIA’s World Factbook, encyclopedias, and Internet
6The measure by Lohmann (2011) is assigned at the country-level using the most widely spoken official language of each country. In the Spanish and U.S. micro-data, we can rely on a more detailed assignment using information on the mother tongue of each individual. This makes it unfeasible to include this alternative measure in the micro-data regressions in Section 2.3. 7The data and their sources are explained in detail in Rose (2004) and posted on his website. Following Tomz et al. (2007), we have defined our WTO membership variable broadly to include both countries that are either formal members of the organization or have agreements that involve rights and obligations toward it. 8The annual value of bilateral trade between a pair of countries is created by averaging the imports and exports. Country-pairs with zero bilateral trade flows are not included in the sample. CHAPTER 2. LINGUISTIC DISTANCE IN APPLIED ECONOMICS 18 resources.9 To analyze the sensitivity of the results with respect to the measurement of linguistic distance, we calculate an alternative specification of the Levenshtein distance, replacing the most prevalent language with the prevailing lingua franca in a country. A lingua franca is defined as a language typically used to enable communication between individuals not sharing a mother tongue. These languages are often third languages, which are widely spoken in a particular regional area and are not necessarily an official language.10 Subsequently, we compare the effect of these two definitions of the Levenshtein distance with the approach by Lohmann (2011). Descriptive statistics of the variables used in the empirical analysis are shown in Table 2.4. The average Levenshtein distance decreases from 90.3 to 75.1 when we use lingua francas instead of the most prevalent native language to calculate the linguistic distance. This indicates that lingua francas may have come into existence to decrease costs imposed by language barriers in the first place. Following Rose’ definition, 22.3% of the country-pairs share a common language. This quite high share relies on a very broad definition of official languages by Rose. For example, even country pairs such as the U.S. and Denmark or France and Egypt are coded to have the same language. Using the Levenshtein distance, only 4.7% of the country-pairs show a distance of zero, which is equivalent to sharing a common language, increasing to 18.4% for the Levenshtein distance measure based on lingua francas. The linguistic features index by Lohmann (2011) is zero for 9.4% of the country-pairs, meaning that both languages share all linguistic features considered. We use the gravity model to estimate the impact of language barriers on trade between pairs of countries. The model has a long record of success in explaining bilateral trade flows and becomes the standard model for applied trade analysis. Following Rose (2004), we augment the basic gravity equation with a number of additional variables that affect trade in order to control for as many determinants of trade flows as possible. Our empirical strategy is to compare trade patterns for trading partners with different language barriers using variation across country-pairs. If a common language or a high similarity between languages has a positive effect on trade, we expect to observe significantly higher trade for these country-pairs than for others. We compare three different specifications. First, we adopt the original specification by Rose (2004) including an indicator variable for country-pairs sharing the same language. This basic approach is then augmented by the Levenshtein distance and the language features index by Lohmann (2011). The exact
9For example, we use English as the native language in the United Kingdom, because it is a mono- lingual country and English is the national language. In a multi-lingual country such as Canada we use English instead of French, because English is the most prevalent native language. A comprehensive index of assigned languages with further explanations is available upon request. 10For example, we use Russian as lingua franca for most countries of the former Soviet Union. CHAPTER 2. LINGUISTIC DISTANCE IN APPLIED ECONOMICS 19 specification of the gravity model is:
ln Xijt = β0 + β1 ln (YiYj)t + β2 ln Distij + Zijt κ + γ1 LBij (2.5) + X δ I + X θ J + X φ T + ε , i i i j j j t t t ijt where the dependent variable Xijt denotes the average value of real bilateral trade between country i and country j at time t, mainly influenced by the “mass” of both economies, indicated by the product of their GDP denoted by Y , and the distance in log miles. Z is a vector of control variables, including population size, geographic characteristics such as sharing a land border, number of landlocked countries, number of island nations in the country-pair (0, 1, or 2), the area of the country (in square kilometers), and colonial relationships. Further, it is controlled for member and nonmember participation in the GATT/WTO (one or both countries), same currency, regional trade agreements, and being a GSP beneficiary.11
The main coefficient of interest is γ1. It measures the effect of the different language barriers variables (LB) on international trade. If both countries share a common language,
γ1 should be positive; if instead one of the linguistic distance measures is used, the effect of γ1 on trade should be negative. A comprehensive set of country and year fixed effects is included in the specification to control for any factor affecting trade that is country (e.g., stock of migrants, foreign language knowledge) or time specific (e.g., common shocks and trends).12 The gravity model is estimated by ordinary least squares (OLS) with robust standard errors clustered on the country-pair level.
2.4.2 Results
Table 2.5 summarizes the results of Eq. (2.5). For the sake of brevity, the estimated coefficients for the time- and country-fixed effects are omitted from all tables. In the first column, we reproduce the benchmark specification from Rose (2004) based on his measure of common language augmented with country fixed effects. Rose’ model confirms the hypothesis of a significant positive effect of common language on bilateral trade. Sharing a common language is found to raise trade by about (exp(0.274) − 1 ≈) 31.5%.
11More details are given in Rose (2004), the source for all variables except the linguistic distance measures. In the assignment of GATT/WTO rights and obligation we follow Tomz et al. (2007) and impose the restriction that formal membership has the same effect as nonmember participation. 12Recent empirical work on the determinants of bilateral trade increasingly relies on panel data techniques that account for country-pair instead of exporter and importer specific fixed effects. Country- pair fixed effects control for the impact of any time-invariant country-pair specific determinant such as bilateral distance or common language. However, this comes at the cost of not being able to estimate the effect of the language barrier variables, our variables of interest, on bilateral trade. CHAPTER 2. LINGUISTIC DISTANCE IN APPLIED ECONOMICS 20
Still, this result might be biased by the very broad definition of having a common language. The question we want to answer is whether language barriers affect trade above and beyond the simple effect of sharing a common language. Therefore, our ensuing specifications examine how the results change when we employ the linguistic distance measures instead of the common language variable. The second column shows our preferred model. We replace Rose’ common language variable with our default Levenshtein distance measure. We find significantly lower trade when the Levenshtein distance between both countries in a dyad increases. The coefficient indicates that a country-pair trades about (exp(−0.006)−1 ≈) 0.6% less if the Levenshtein distance increases by one unit. To illustrate the magnitude of this effect, we note that the 75th percentile of the Levenshtein distance in our sample is 99.93 (roughly the distance between English and Japanese) and the 25th percentile is 92.95 (roughly the distance between English and Russian). The estimate in column 2 implies that an increase from the 25th to the 75th percentile in the Levenshtein distance decreases bilateral trade by approximately 4.1%.13 In multi-lingual countries, the assignment of languages to countries is difficult. To show that our findings are not a result of a particular assignment of languages to countries, the estimation results with the Levenshtein distance measure based on lingua francas are presented in column (3). The key result that the Levenshtein distance has a statistically and economically significant negative effect on bilateral trade is robust. However, the effect decreases by 50%, maybe due to the lower variability of the alternative Levenshtein distance. Additionally, lingua francas are purposely chosen to lower transaction costs. Therefore, we should expect a smaller effect on trade when taking the lingua francas into account. Next, column (4) shows the results of Lohmann’s linguistic features index as a measure of common language. Due to a restricted data availability, the linguistic features index is only computable for a subsample of 227,145 country-pairs.14 The coefficient reveals that a pair of countries trade about (exp(−0.618) − 1 ≈) 4.6% less if the linguistic features index increases by 0.1 units (corresponding to a 10% decrease in common linguistic features). To compare the influence of the language features index by Lohmann (2011) to the
13To examine whether the effect of the Levenshtein distance on bilateral trade only builds on the grounds of sharing or not sharing a common language and not on the linguistic distance between different languages, we estimate models 2-4 with subsamples excluding country-pairs with no language barrier in the corresponding measure, i.e., a linguistic distance of zero. The results are available upon request. Regarding both versions of the Levenshtein distance measure, they become even lager in magnitude and are stable in significance, while the linguistic features index becomes distinctly smaller in magnitude and significance. 14To check for sample selection we additionally estimated models 1-3 restricted to the same subsample. The results are available upon request. The estimates regarding the language variables remained stable in magnitude and significance. CHAPTER 2. LINGUISTIC DISTANCE IN APPLIED ECONOMICS 21
Levenshtein distance, we compute the elasticity and the marginal effect multiplied by the interquartile range of the linguistic features index. Increasing the linguistic features index by 1% decreases bilateral trade by about 0.3% compared to a 0.6% decrease in case of the Levenshtein distance. Moving up the distribution of the linguistic features index from the lower to the upper quartile decreases trade between countries by (exp(−1.465) − 1 ≈) 7.7%. The results show a larger effect for the Levenshtein distance with regard to elasticities. Since the distribution of the Levenshtein distance is right-skewed, the value of the interquartile range is smaller compared to the interquartile range of the linguistic features index. As a result, the effect of the linguistic features index becomes larger than the effect of the Levenshtein distance. In summary, the empirical analysis provides evidence that according to both measures linguistic distance has a statistically and economically significant negative effect on bilateral trade flows. The estimated coefficients of the control variables confirm the traditional results of gravity trade equations. The indicators for whether one or both countries in the dyad participated in the GATT/WTO have significantly positive coefficients. The respective coefficients are comparable to those reported by Tomz et al. (2007). Countries that are farther apart trade less, while countries belonging to the same regional trade association, belonging to the same GSP, or sharing a currency trade more. Islands or landlocked countries trade less, while countries sharing a land border trade more. Economically larger and richer countries trade more, as do physically larger countries. A shared colonial history encourages trade as well. These estimation results are both statistically and economically significant and in line with estimates from previous literature. As compared to the first specification, the application of the Levenshtein distance measure does not considerably affect the magnitude or significance of the other independent variables. All variables show the expected results. However, the coefficient of common colonizer increases by about 10 percentage points, indicating that the effect of cultural ties is underestimated in the traditional gravity model. During the colonization period, colonizers created new institutions such as the legal and administrative system in their colonies. These institutions impose policies and law enforcement, thereby determining the formal and informal rules in commerce. Since international transactions between countries with different or poorly developed institutional settings involve high transactions costs, colonial ties between countries that had the same colonial history and therefore established a similar institutional system, facilitate bilateral trade flows. Despite the fact that the colonizers’ languages became the official languages in the colonies and represent one of the official languages in most former colonies even today, a large part of the population failed to achieve an acceptable degree of knowledge in these languages (see, e.g., Lewis, 2009). Hence, using information on common official languages in a country-pair to estimate trade CHAPTER 2. LINGUISTIC DISTANCE IN APPLIED ECONOMICS 22
flows, in particular between countries with a common colonizer, might underestimate the effect of colonial ties and overestimate the relation of the common colonizers language on trade pattern. In summary, a common colonizer promotes trade between these countries because of establishing a similar institutional setting; an effect that might be hidden when not controlling properly for linguistic heterogeneity.
2.5 Conclusion
This study is concerned with the operationalization of linguistic distance between languages and the estimation of arising costs of linguistic barriers on the micro- and macro-level. Linguistic barriers are strong obstacles in the realization of free worldwide factor move- ments. The operationalization of linguistic barriers in applied economic studies is not straightforward and makes it necessary to rely on interdisciplinary approaches drawing heavily from linguistic research. Our measure for linguistic distance is based on the Automatic Similarity Judgment Program (ASJP) by the German Max Planck Institute for Evolutionary Anthropology. The linguistic distance is computed as a function of phonetic similarity of words (a Levenshtein distance) from different languages having the same meaning. It can be used as an approximation of the historical difference in languages and is therefore also correlated to differences in other dimensions of dissimilarity, such as grammar or vocabularies. Compared to the previous approach by Chiswick and Miller (1999), which measures linguistic distance by using average test-scores of second language students, the Levenshtein distance has some advantages. It is available for any pair of the world’s languages (instead of being only applicable for the distance toward English). Additionally, it is not influenced by other extrinsic or intrinsic incentives for learning a foreign language, and should deliver an unbiased approximation of the dissimilarity between languages. The measurement of linguistic distance is used in two applications, the language acquisition of immigrants and language barriers in bilateral trade flows. Following the widely accepted rational choice framework of language acquisition (see, e.g., Chiswick and Miller, 1995; Esser, 2006), linguistic distance affects second language skills by lowering the initial efficiency, thereby imposing higher costs of learning a foreign language. Following previous work that shows such a negative relationship for English-speaking countries, we broadened the evidence for other countries by applying the measure in estimations using U.S., German, and Spanish individual micro-data. The results confirm a strong significantly negative effect of linguistic distance on immigrant language skills. The initial disadvantage due to distant linguistic origin is worth several years of additional residence. As such, the linguistic distance is able to explain a large part of language skill heterogeneity CHAPTER 2. LINGUISTIC DISTANCE IN APPLIED ECONOMICS 23
in immigrant populations. The considerable hurdles for language acquisition on the micro- level might explain the lower migration rates between linguistically distant countries, as analyzed by Adsera and Pytlikova (2012). To additionally look at how these effects on the micro-level accumulate to costs of linguistic barriers on the macro-level, we apply the Levenshtein distance in the setting of international trade. Linguistic proximity is believed to enhance trade flows between countries by lowering costs imposed by language barriers, e.g., translation or information costs. Using a comprehensive dataset of bilateral trade flows by Rose (2004), we estimate a standard gravity model using the Levenshtein distance as an additional explanatory variable and compare this approach to a previous approach based on shared linguistic features by Lohmann (2011). The results provide new and strong evidence indicating that language barriers affect trade above and beyond the simple effect of sharing a common language. Moving up the distribution of the Levenshtein distance from the lower quartile (roughly the distance between English and Russian) to the upper quartile (roughly the distance between English and Japanese) decreases trade between countries by about 4.1%. Taken together, this study suggests an important role of language differences in economic transactions. The results show the significant economic costs of linguistic heterogeneity on the individual and aggregated level. The Levenshtein distance offers a simple and comprehensive way to control for this heterogeneity in a large range of applications in empirical economics and thereby circumvents potential pitfalls by decreasing the degree of unobserved heterogeneity in the data. CHAPTER 2. LINGUISTIC DISTANCE IN APPLIED ECONOMICS 24 Tables
Table 2.1: Closest and Furthest Language Pairs with Respect to the Levenshtein Distance Closest Furthest Language Distance Language Distance Distance to English Afrikaans 62.08 Vietnamese 104.06 Dutch 63.22 Turkmen 103.84 Norwegian 64.12 Hakka (China) 103.10
Distance to German Luxembourgish 42.12 Korean 104.30 Dutch 51.50 Palestinian Arabic 103.72 Westvlaams (Belgium) 57.86 Yoruba (Nigeria) 103.58
Distance to Spanish Galician 54.82 Wolof (Senegal) 103.02 Italian 56.51 Igbo Onitsha (Nigeria) 102.84 Portuguese 64.21 Ewondo (Cameroon) 101.87 Notes: – The table shows the three closest and furthest languages toward English, German and Spanish according to the normalized and divided Levenshtein dis- tance. – Only languages spoken within samples are listed. – Geographic origin of language in parentheses.
Table 2.2: Descriptive Statistics of Dependent and Explanatory Variables – Immigration Sample 2000 U.S. Census SOEP NISS Mean StdD Mean StdD Mean StdD Good language skills 0.58 0.49 0.52 0.50 0.58 0.49 Years of education 11.32 4.28 10.50 2.21 10.82 3.38 Age at entry 26.76 8.72 28.68 8.93 30.00 9.63 Years since migration 12.72 9.91 18.50 11.11 8.33 6.96 Married 0.68 0.47 0.86 0.35 0.58 0.49 One child 0.19 0.39 0.50 0.50 0.24 0.43 Two children 0.19 0.39 0.21 0.41 0.22 0.42 Three or more children 0.14 0.35 0.18 0.38 0.34 0.47 Distance to home country (in 100 km) 57.60 39.95 19.14 14.62 24.41 22.96 Naturalized 0.34 0.47 0.35 0.48 0.07 0.25 Former colony 0.11 0.32 0.10 0.30 0.03 0.18 Southern states 0.29 0.45 –––– Non-metropolitan area 0.01 0.12 –––– Minority language share 0.33 0.25 –––– Abroad five years ago 0.23 0.42 –––– Refugee 0.12 0.32 0.07 0.25 –– Neighboring country – – 0.12 0.33 –– Family abroad – – 0.30 0.46 –– Political reasons – – – – 0.03 0.16 Notes: – Number of observations: 59,889 in the 2000 U.S. Census, 675 in the SOEP, and 2,513 in the NISS Sample. – The dependent variable “Good language skills” is defined dichotomously, 1 indicates higher language skills. CHAPTER 2. LINGUISTIC DISTANCE IN APPLIED ECONOMICS 25
Table 2.3: Immigrant’s Language Skills – Probit Results Dataset: 2000 U.S. Census SOEP NISS Linguistic distance measure: Test-score LDND LDND LDND ME/StdE ME/StdE ME/StdE ME/StdE Linguistic distance (Test-score-based) −0.003∗∗∗ ––– (0.000) Levenshtein distance (ASJP) – −0.001∗∗∗ −0.002∗ −0.002∗∗ (0.000) (0.001) (0.001) Years of education 0.048∗∗∗ 0.048∗∗∗ 0.054∗∗∗ 0.029∗∗∗ (0.001) (0.001) (0.011) (0.003) Age at entry −0.018∗∗∗ −0.018∗∗∗ −0.039∗∗ 0.006 (0.002) (0.002) (0.015) (0.007) Age at entry2/100 0.012∗∗∗ 0.012∗∗∗ 0.049∗ −0.019∗ (0.002) (0.002) (0.022) (0.010) Years since migration 0.014∗∗∗ 0.014∗∗∗ 0.032∗∗ 0.039∗∗∗ (0.001) (0.001) (0.011) (0.005) Years since migration2/100 −0.021∗∗∗ −0.021∗∗∗ −0.058∗ −0.067∗∗∗ (0.002) (0.002) (0.024) (0.015) Married 0.020∗∗∗ 0.020∗∗∗ −0.008 0.003 (0.005) (0.005) (0.066) (0.026) Children in the HH. (Ref.= 0) One child 0.018∗∗ 0.019∗∗ 0.013 −0.055† (0.006) (0.006) (0.083) (0.030) Two children 0.015∗ 0.015∗ −0.038 0.072† (0.006) (0.006) (0.083) (0.038) Three or more children −0.001 −0.000 −0.094 −0.128∗∗ (0.007) (0.007) (0.087) (0.041) Distance to home country (in 100 km) −0.002† −0.002† 0.006 0.001 (0.001) (0.001) (0.008) (0.004) Distance to home country2/100 0.003∗∗∗ 0.003∗∗∗ −0.002 0.001 (0.001) (0.001) (0.010) (0.002) Naturalized 0.138∗∗∗ 0.138∗∗∗ 0.300∗∗∗ 0.043 (0.006) (0.006) (0.065) (0.049) Former colony 0.108∗∗∗ 0.102∗∗∗ −0.222 0.227∗∗∗ (0.011) (0.011) (0.218) (0.059) Southern states 0.044∗∗∗ 0.045∗∗∗ –– (0.005) (0.005) Non-metropolitan area 0.051∗∗ 0.049∗∗ –– (0.018) (0.018) Minority language share −0.252∗∗∗ −0.280∗∗∗ –– (0.019) (0.020) Abroad five years ago −0.093∗∗∗ −0.091∗∗∗ –– (0.007) (0.007) Refugee −0.233∗∗∗ −0.214∗∗∗ −0.085 – (0.009) (0.009) (0.103) Neighboring country – – 0.310† – (0.166) Family abroad – – −0.069 – (0.056) Political reasons – – – 0.045 (0.063) Region fixed effects yes yes yes yes Pseudo-R2 0.263 0.261 0.160 0.138 Observations 59,889 59,889 675 2,513 Notes: – Significant at: ∗∗∗0.1% level; ∗∗1% level; ∗5% level; †10% level. – Robust standard errors are reported in parentheses. – The dependent variable is defined dichotomously, 1 in- dicates higher language skills. – Probit results are reported as marginal effects evaluated at covariate means. – Region controls are not recorded. CHAPTER 2. LINGUISTIC DISTANCE IN APPLIED ECONOMICS 26 average value of realbinary bilateral variable trade which between is countriesLevenshtein unity i distance if and using i j the andLevenshtein at most distance j year prevalent using share native t the a languageindex in most common of which US prevailing language each increases $ lingua and country with francabinary zero decreasing of variable otherwise similarity which each of is country binary a unity variable language-pair if which (values both is betweenbinary i unity 0 variable and if and which j either 1) is are igreat unity GATT/WTO or circle if participants j distance i at is between wasproduct t country a a of i GATT/WTO GSP the participant and beneficiary real atproduct country of GDP’s t j of j of in the or both miles realbinary vice countries GDP’s variable versa in per which at year is capita t t binary unity of variable if both which i countries is andbinary in unity j variable year if which both t i is belong andnumber unity to of j if the landlocked use i same nations the andnumber regional in same of j trade the currency island share agreement country-pair at nations aproduct (0, time in land 1, of t the border or the country-pair 2) landbinary (0, areas variable 1, of which or both is 2) binary unity countries variable if (in which i square is andbinary kilometers) unity j variable if which were i is ever isbinary unity colonies a variable if after which colony i 1945 is of ever with unity j colonized the if at j same i time or colonizer and t vice j or versa remained vice part versa of the same nation during the sample 336 416 453 787 203 476 461 422 809 676 504 120 118 172 466 540 280 300 044 142 017 ...... 062223256 3 063 0 429 22 652 36 307 0 231 0 165 0 881 0 034 0 015 2 014 1 031 0 246 0 341 0 206 0 100 0 002 3 021 0 000 0 0 0 ...... 0 0 0 0 0 8 0 0 0 0 0 0 0 0 0 Mean StdD Definitions 10 90 75 47 16 24 Descriptive Sample Statistics and Variable Definitions – International Trade Sample Common language Levenshtein distance Levenshtein distance LF Linguistic features index Both in GATT/WTO One in GATT/WTO General system of preferences Log distance Log product real GDP Log product real GDPRegional p/c FTA Currency union Land border Number landlocked Number islands Log product land area Common colonizer Currently colonized Ever colony Common country Log real trade Notes: – Number of observations: 234,597 in 12,150 country-pair groups with 1 to 52 observations per group. The mean is 19.3 observations per group. – For the linguistic features index there are 227,145 observations in 11,348 country-pair groups. Table 2.4: CHAPTER 2. LINGUISTIC DISTANCE IN APPLIED ECONOMICS 27
Table 2.5: Effect of Language on Bilateral Trade – OLS Results ComLang LDND I LDND II LingFeat Coef/StdE Coef/StdE Coef/StdE Coef/StdE Common language 0.274∗∗∗ ––– (0.044) Levenshtein distance – −0.006∗∗∗ –– (0.001) Levenshtein distance LF – – −0.003∗∗∗ – (0.000) Linguistic features index – – – −0.618∗∗∗ (0.098) Both in GATT/WTO 0.604∗∗∗ 0.618∗∗∗ 0.609∗∗∗ 0.578∗∗∗ (0.061) (0.061) (0.061) (0.062) One in GATT/WTO 0.277∗∗∗ 0.288∗∗∗ 0.283∗∗∗ 0.247∗∗∗ (0.056) (0.056) (0.056) (0.056) General system of preferences 0.709∗∗∗ 0.733∗∗∗ 0.711∗∗∗ 0.721∗∗∗ (0.032) (0.031) (0.032) (0.032) Log distance −1.313∗∗∗ −1.278∗∗∗ −1.308∗∗∗ −1.293∗∗∗ (0.023) (0.024) (0.023) (0.024) Log product real GDP 0.167∗∗ 0.165∗∗ 0.164∗∗ 0.159∗∗ (0.051) (0.051) (0.051) (0.053) Log product real GDP p/c 0.532∗∗∗ 0.533∗∗∗ 0.535∗∗∗ 0.552∗∗∗ (0.049) (0.049) (0.049) (0.050) Regional FTA 0.941∗∗∗ 0.942∗∗∗ 0.939∗∗∗ 0.975∗∗∗ (0.126) (0.126) (0.126) (0.129) Currency union 1.174∗∗∗ 1.253∗∗∗ 1.169∗∗∗ 1.208∗∗∗ (0.122) (0.125) (0.123) (0.124) Land border 0.280∗∗ 0.283∗∗ 0.284∗∗ 0.292∗∗ (0.108) (0.108) (0.108) (0.113) Number landlocked −1.056∗∗∗ −1.032∗∗∗ −1.014∗∗∗ −0.971∗∗∗ (0.207) (0.205) (0.205) (0.208) Number islands −1.579∗∗∗ −1.575∗∗∗ −1.622∗∗∗ −1.545∗∗∗ (0.188) (0.188) (0.188) (0.190) Log product land area 0.496∗∗∗ 0.501∗∗∗ 0.513∗∗∗ 0.496∗∗∗ (0.041) (0.041) (0.041) (0.041) Common colonizer 0.605∗∗∗ 0.703∗∗∗ 0.592∗∗∗ 0.687∗∗∗ (0.064) (0.062) (0.065) (0.065) Currently colonized 0.743∗∗ 0.744∗∗ 0.753∗∗ 0.719∗∗ (0.263) (0.252) (0.264) (0.262) Ever colony 1.274∗∗∗ 1.272∗∗∗ 1.261∗∗∗ 1.339∗∗∗ (0.114) (0.116) (0.114) (0.113) Common country 0.288 0.090 0.263 0.278 (0.583) (0.658) (0.579) (0.617) Year fixed effects yes yes yes yes Country fixed effects yes yes yes yes Adjusted R2 0.703 0.703 0.703 0.705 RMSE 1.818 1.817 1.817 1.805 F Statistic 274.34∗∗∗ 272.11∗∗∗ 274.96∗∗∗ 265.50∗∗∗ Observations 234,597 234,597 234,597 227,145 Notes: – Significant at: ∗∗∗0.1% level; ∗∗1% level; ∗5% level; †10% level. – Robust standard errors (clustered at the country-pair lvel) are reported in parentheses. – The dependent variable is defined as log of real bilateral trade in US$. – Intercept, year, and country controls are not recorded. CHAPTER 2. LINGUISTIC DISTANCE IN APPLIED ECONOMICS 28 2.A Appendix
Table 2.A1: 40-Items Swadesh Word List I You We One Two Person Fish Dog Louse Tree Leaf Skin Blood Bone Horn Ear Eye Nose Tooth Tongue Knee Hand Breast Liver Drink See Hear Die Come Sun Star Water Stone Fire Path Mountain Night Full New Name
Source: Bakker et al. (2009). 100 90 80 Levenshtein Distance 70 60 .2 .4 .6 .8 1 Linguistic Distance (Test-Score-Based)
Figure 2.A1: Comparisons of Linguistic Distance Using the Test-Score-Based Measure and the Levenshtein Distance – 2000 U.S. Census CHAPTER 2. LINGUISTIC DISTANCE IN APPLIED ECONOMICS 29
Table 2.A2: Summary Statistics for the Language Variables – International Trade Sample A. Simple Correlations among Language Distance Measures Common Levenshtein Levenshtein Linguistic features language distance distance LF indexa Common language 1
Levenshtein distance -0.3868 1
Levenshtein distance LF -0.6689 0.4813 1
Linguistic features indexa -0.3533 0.5490 0.4070 1
B. Frequency of Country-pairs with and without the same Language Common Levenshtein Levenshtein Linguistic features language distance distance LF indexa Same language 52,205 11,017 43,229 21,389
Different language 182,392 223,580 191,368 205,756 Notes: – Number of observations: 234,597, except a227,145. CHAPTER 2. LINGUISTIC DISTANCE IN APPLIED ECONOMICS 30
0.015
0.01 Density 0.005
0 100 80 20 15 60 10 5 40 0 20 −5 −10 0 −15 Levenshtein Distance Log Bilateral Trade
Figure 2.A2: Bivariate Kernel Density Estimation of Log Bilateral Trade and Levenshtein Distance – International Trade Sample CHAPTER 2. LINGUISTIC DISTANCE IN APPLIED ECONOMICS 31
Sensitivity Analysis
As mentioned above, we perform a number of sensitivity analyses which, in each case, find similar results to those reported above. First, we estimate our model of immigrant’s language skills using an expanded sample which also includes individuals speaking the language of the host country as mother tongue. The results are reported in Table 2.A3. Second, we repeat our estimations using the four or fivefold information of language fluency as dependent variable in Ordered Probit models. The marginal effects across the different categories indicate a comparable effect as in the dichotomous case. The signs of the marginal effects change at the same threshold we use for the dichotomization. The results are available upon request. Third, Tables 2.A4 and 2.A5 report estimation results for two subsamples of the trade sample. The results are quite similar in magnitude and significance level to those for the whole sample. Table 2.A4 examines the sensitivity of the results with respect to the measurement of linguistic distance. Therefore, we exclude dyad-observations with the same language from the sample, thereby including only country-pairs with a language barrier greater than zero. This tests the idea that country-pairs speaking or not speaking a common language delivering the results of the language barrier, rather than an effect of linguistic distance per se. Table 2.A5 analyzes the sensitivity of our results when we restrict our sample to the slightly smaller one of Lohmann’s linguistic features index. Finally, we add for both countries in a country-pair country-by-time interaction terms, P P it ηit (I × T )it and jt ψjt (J × T )jt, to the models of Table 2.5. These interaction terms capture any exporter and importer specific time-variant effects such as each country’s business cycle or its institutional characteristics. The findings for the key variables (available upon request) are quite similar in magnitude and significance level to those for the models with country and year specific fixed effects. CHAPTER 2. LINGUISTIC DISTANCE IN APPLIED ECONOMICS 32
Table 2.A3: Immigrant’s Language Skills – Probit Results, Including Native Speakers ME/StdE ME/StdE ME/StdE ME/StdE Linguistic distance (Test-score-based) −0.008∗∗∗ ––– (0.000) Levenshtein distance (ASJP) – −0.007∗∗∗ −0.003∗ −0.004∗∗∗ (0.000) (0.001) (0.000) Years of education 0.040∗∗∗ 0.040∗∗∗ 0.054∗∗∗ 0.015∗∗∗ (0.001) (0.001) (0.011) (0.002) Age at entry −0.014∗∗∗ −0.015∗∗∗ −0.038∗ 0.002 (0.001) (0.001) (0.015) (0.003) Age at entry2/100 0.009∗∗∗ 0.010∗∗∗ 0.047∗ −0.008 (0.002) (0.002) (0.021) (0.005) Years since migration 0.013∗∗∗ 0.013∗∗∗ 0.029∗∗ 0.022∗∗∗ (0.001) (0.001) (0.011) (0.003) Years since migration2/100 −0.020∗∗∗ −0.022∗∗∗ −0.053∗ −0.038∗∗∗ (0.002) (0.002) (0.024) (0.008) Married 0.015∗∗ 0.018∗∗∗ 0.007 0.003 (0.005) (0.005) (0.066) (0.014) Children in the HH. (Ref.= 0) One child 0.012∗ 0.016∗∗ 0.008 −0.030† (0.005) (0.005) (0.082) (0.017) Two children 0.012∗ 0.014∗ −0.039 0.035∗ (0.005) (0.005) (0.082) (0.018) Three or more children −0.005 0.001 −0.092 −0.065∗∗ (0.006) (0.006) (0.086) (0.022) Distance to home country (in 100 km) −0.000 0.000 0.007 0.001 (0.001) (0.001) (0.008) (0.002) Distance to home country2/100 0.002∗∗∗ 0.002∗∗∗ −0.003 0.000 (0.000) (0.000) (0.010) (0.001) Naturalized 0.108∗∗∗ 0.106∗∗∗ 0.268∗∗∗ 0.017 (0.004) (0.005) (0.066) (0.024) Former colony 0.123∗∗∗ 0.095∗∗∗ −0.203 0.240∗∗∗ (0.008) (0.008) (0.202) (0.030) Southern states 0.049∗∗∗ 0.064∗∗∗ –– (0.004) (0.004) Non-metropolitan area 0.039∗∗ 0.021 –– (0.015) (0.015) Minority language share −0.340∗∗∗ −0.567∗∗∗ –– (0.016) (0.017) Abroad five years ago −0.084∗∗∗ −0.078∗∗∗ –– (0.006) (0.006) Refugee −0.266∗∗∗ −0.214∗∗∗ −0.096 – (0.009) (0.008) (0.102) Neighboring country – – 0.313∗ – (0.159) Family abroad – – −0.085 – (0.056) Political reasons – – – 0.032 (0.029) Region fixed effects yes yes yes yes Pseudo-R2 0.304 0.299 0.165 0.347 Observations 70,201 70,201 689 3,986 Notes: – Significant at: ∗∗∗0.1% level; ∗∗1% level; ∗5% level; †10% level. – Robust standard errors are reported in parentheses. – The dependent variable is defined dichotomously, 1 in- dicates higher language skills. – Probit results are reported as marginal effects evaluated at covariate means. – Region controls are not recorded. CHAPTER 2. LINGUISTIC DISTANCE IN APPLIED ECONOMICS 33
Table 2.A4: Effect of Language on Bilateral Trade – OLS Results, Subsample Language Barrier > 0 LDND I LDND II LingFeat Coef/StdE Coef/StdE Coef/StdE Levenshtein distance −0.008∗∗∗ –– (0.002) Levenshtein distance LF – −0.008∗∗∗ – (0.002) Linguistic features index – – −0.266∗ (0.123) Both in GATT/WTO 0.506∗∗∗ 0.435∗∗∗ 0.454∗∗∗ (0.067) (0.070) (0.069) One in GATT/WTO 0.219∗∗∗ 0.193∗∗ 0.174∗∗ (0.061) (0.064) (0.063) General system of preferences 0.754∗∗∗ 0.612∗∗∗ 0.682∗∗∗ (0.031) (0.032) (0.032) Log distance −1.281∗∗∗ −1.211∗∗∗ −1.271∗∗∗ (0.025) (0.027) (0.027) Log product real GDP 0.060 −0.070 −0.005 (0.053) (0.058) (0.056) Log product real GDP p/c 0.646∗∗∗ 0.793∗∗∗ 0.727∗∗∗ (0.051) (0.056) (0.054) Regional FTA 0.828∗∗∗ −0.252∗ 0.190 (0.147) (0.118) (0.155) Currency union 1.312∗∗∗ 1.138∗∗∗ 1.200∗∗∗ (0.133) (0.275) (0.203) Land border 0.281∗ 0.391∗∗ 0.301∗ (0.120) (0.130) (0.133) Number landlocked −1.232∗∗∗ −1.330∗∗∗ −1.307∗∗∗ (0.206) (0.207) (0.214) Number islands −1.837∗∗∗ −2.492∗∗∗ −1.993∗∗∗ (0.192) (0.199) (0.203) Log product land area 0.551∗∗∗ 0.668∗∗∗ 0.581∗∗∗ (0.042) (0.043) (0.044) Common colonizer 0.697∗∗∗ 0.887∗∗∗ 0.662∗∗∗ (0.063) (0.092) (0.069) Currently colonized 0.322 1.149∗∗ 0.442 (0.292) (0.391) (0.390) Ever colony 1.517∗∗∗ 1.041∗∗∗ 1.182∗∗∗ (0.131) (0.193) (0.152) Common country 1.203∗∗∗ –– (0.346) Year fixed effects yes yes yes Country fixed effects yes yes yes Adjusted R2 0.702 0.706 0.708 RMSE 1.827 1.792 1.801 F Statistic 860.28∗∗∗ 267.50∗∗∗ 275.73∗∗∗ Observations 223,580 191,368 205,756 Notes: – Significant at: ∗∗∗0.1% level; ∗∗1% level; ∗5% level; †10% level. – Robust standard errors (clustered at the country-pair level) are reported in parentheses. – The dependent variable is defined as log of real bilateral trade in US$. – Intercept, year, and country controls are not recorded. – In column (2) and (3) common country is omitted from the equations because of collinearity. CHAPTER 2. LINGUISTIC DISTANCE IN APPLIED ECONOMICS 34
Table 2.A5: Effect of Language on Bilateral Trade – OLS Results, Subsample Linguistic Features Index ComLang LDND I LDND II Coef/StdE Coef/StdE Coef/StdE Common language 0.292∗∗∗ –– (0.045) Levenshtein distance – −0.006∗∗∗ – (0.001) Levenshtein distance LF – – −0.004∗∗∗ (0.001) Both in GATT/WTO 0.585∗∗∗ 0.600∗∗∗ 0.592∗∗∗ (0.062) (0.062) (0.062) One in GATT/WTO 0.255∗∗∗ 0.267∗∗∗ 0.264∗∗∗ (0.057) (0.056) (0.057) General system of preferences 0.707∗∗∗ 0.732∗∗∗ 0.708∗∗∗ (0.032) (0.031) (0.032) Log distance −1.305∗∗∗ −1.268∗∗∗ −1.297∗∗∗ (0.023) (0.024) (0.023) Log product real GDP 0.166∗∗ 0.163∗∗ 0.164∗∗ (0.053) (0.053) (0.052) Log product real GDP p/c 0.546∗∗∗ 0.547∗∗∗ 0.548∗∗∗ (0.050) (0.050) (0.050) Regional FTA 0.980∗∗∗ 0.980∗∗∗ 0.975∗∗∗ (0.129) (0.128) (0.128) Currency union 1.185∗∗∗ 1.268∗∗∗ 1.172∗∗∗ (0.123) (0.126) (0.124) Land border 0.285∗ 0.290∗∗ 0.293∗∗ (0.113) (0.112) (0.112) Number landlocked −1.033∗∗∗ −1.009∗∗∗ −0.987∗∗∗ (0.209) (0.207) (0.207) Number islands −1.602∗∗∗ −1.599∗∗∗ −1.655∗∗∗ (0.191) (0.190) (0.190) Log product land area 0.493∗∗∗ 0.498∗∗∗ 0.510∗∗∗ (0.042) (0.041) (0.041) Common colonizer 0.595∗∗∗ 0.701∗∗∗ 0.570∗∗∗ (0.067) (0.065) (0.068) Currently colonized 0.734∗∗ 0.734∗∗ 0.744∗∗ (0.268) (0.256) (0.269) Ever colony 1.255∗∗∗ 1.251∗∗∗ 1.222∗∗∗ (0.115) (0.117) (0.114) Common country 0.307 0.103 0.281 (0.596) (0.678) (0.592) Year fixed effects yes yes yes Country fixed effects yes yes yes Adjusted R2 0.705 0.706 0.706 RMSE 1.805 1.804 1.805 F Statistic 268.30∗∗∗ 265.73∗∗∗ 269.23∗∗∗ Observations 227,145 227,145 227,145 Notes: – Significant at: ∗∗∗0.1% level; ∗∗1% level; ∗5% level; †10% level. – Robust standard errors (clustering at the country-pair level) are re- ported in parentheses. – The dependent variable is defined as log of real bilateral trade in US$. – Intercept, year, and country controls are not recorded. 35
Chapter 3
Linguistic Barriers in the Destination Language Acquisition of Immigrants∗
3.1 Introduction
Already the biblical description of the fall of the Tower of Babel acknowledged the fact that differences and diversity between languages impose major obstacles for human communication. A range of empirical studies have shown that linguistic barriers constitute distinctive hurdles for international factor flows, e.g., in international trade (Isphording and Otten, 2013; Lohmann, 2011) or international migration flows (Adsera and Pytlikova, 2012; Belot and Ederveen, 2012). On the individual level, language skills have been analyzed as being a crucial determinant for the economic and social integration of immigrants in their destination country, starting with early work by Carliner (1981) and McManus et al. (1983) and more recently estimating strong wage effects for destination language proficiency (Bleakley and Chin, 2004; Chiswick and Miller, 1995; Dustmann and van Soest, 2002). These wage effects arise from the role of language as a medium of everyday and working life, constituting an important productive trait of individuals (Crystal, 2010). Furthermore, low proficiency may also act as a signal of foreignness, facilitating discrimination and differentiation (Esser, 2006). Apart from wages, language proficiency is related to further economic outcomes, such as employment status (Dustmann and Fabbri, 2003), occupational
∗Co-authored with Ingo E. Isphording (IZA). This chapter is published in the Journal of Economic Behavior & Organization, 105, 2014. A preliminary version of this chapter is available as Ruhr Economic Paper #274 and IZA Discussion Paper No. 8090. The authors are grateful to Thomas K. Bauer, John P. Haisken-DeNew, Julia Bredtmann, Carsten Crede, Michael Kind, Jan Kleibrink, Maren Michaelsen, William Neilson, the participants of the EEA 2011, the EALE 2011, and the International German Socio-Economic Panel User Conference 2012 for helpful comments and suggestions. CHAPTER 3. LINGUISTIC BARRIERS IN LANGUAGE ACQUISITION 36
choice (Chiswick and Miller, 2007), and locational choice (Bauer et al., 2005). Language skills are not randomly distributed: rather, they display the outcome of a systematic human capital investment decision influenced by costs and expected benefits (Chiswick and Miller, 1995). This study is concerned with the analysis of a specific cost factor of language acquisition related to the origin of an immigrant. The degree of difficulty learning a new language depends on the degree of dissimilarity of the mother tongue of immigrants to the language of the destination country. This linguistic distance, denoting differences between vocabularies, phonetic inventories, grammars, scripts, etc., is expected to crucially affect the efficiency of language learning and to raise the costs of human capital investment. In spite of the strong impact of the skills of immigrants in the destination language on their integration process, the literature on the determinants of the acquisition of the language of their destination remains surprisingly scarce. The systematic analysis of the determinants of language proficiency started with the early work by Evans (1986) comparing immigrants in Germany, the US, and Australia. More recently, Chiswick and Miller (1999, 2002, 2005) provide a comprehensive analysis of the language acquisition of immigrants in the US. For Germany, Dustmann (1999) analyzes the language proficiency as a jointly determined outcome along with migration duration. Dustmann and van Soest (2001) takes into account potential misclassification in self-reported language proficiency and Danzer and Yaman (2010) analyze German language proficiency as a function of enclave density. Still, the influence of characteristics related to the country of origin, such as the linguistic distance faced by immigrants, remains an under-researched area (Esser, 2006). The major challenge in analyzing the effect of linguistic barriers on the language acqui- sition of immigrants is to operationalize the linguistic distance for use in large scale micro data studies. We propose drawing from comparative linguistics and using an innovative linguistically based operationalization of linguistic distance, the so-called normalized and divided Levenshtein distance calculated by the Automated Similarity Judgment Program (ASJP). The ASJP approach offers advantages in terms of transparent computation and general applicability. We compare its benefits to those of three other approaches previously used in further applications in the economic literature to measure linguistic distance: (i) The WALS measure, which uses differences in language characteristics, (ii) the TREE measure, which is based on a priori knowledge on language families, and (iii) a measure based on average test scores of native US foreign language students (SCORE). Combining this information on language differences with US and German micro data, we provide a comprehensive analysis of the influence of the linguistic origin on the acquisition of the destination language proficiency. The US and Germany are excellent examples for analyzing the language acquisition of immigrants. Both countries have a long history as CHAPTER 3. LINGUISTIC BARRIERS IN LANGUAGE ACQUISITION 37 significant immigration hubs, receiving immigrants from a large variety of source countries. The present study contributes to the literature of the determinants of language profi- ciency in several ways. First, we provide a comprehensive overview of the different methods of deriving a measure of language differences applicable to the analysis of the role of languages in economic behavior. Second, we introduce the ASJP approach as an easily and transparently computed measure of linguistic dissimilarity between languages. Moreover, this new approach to measuring linguistic distance is applicable to any of the world’s languages, and offers specific advantages compared to other linguistic and non-linguistic approaches used in the previous literature. We apply the derived methods to explain the language acquisition of immigrants in the US using the American Community Survey (ACS) as a very recent data source. Finally, we contribute to the literature by taking advantage of the general applicability of the linguistically based methods and extend our analysis beyond the case of Anglophone countries using data from the German Socio-Economic Panel (SOEP). Our results suggest that the linguistic barriers raised by language differences play a crucial role in the determination of the destination-country language proficiency of immigrants. Regardless of the method employed, we estimate large initial disadvantages by linguistic distance for immigrants both in the US and in Germany. In Germany, these initial differences in language skills decrease with a moderate convergence over time. Contrarily, in the US, the initial disadvantages increase over time. The gap between immigrants from different linguistic groups becomes larger with the time of residence. A potential explanation for the opposing results might be found in the higher prevalence of linguistic enclaves in the US, leading to different long-term incentives for investment in language skill in the US and Germany. The estimated differences by linguistic origin witness to the great influence of linguistic background on the economic integration of immigrants. This role should be accounted for in the design of integration policy measures. The results allow the identification of potential target groups for policy intervention. Typical measures aiming at increasing the average language proficiency of immigrants have relied on lump sum payments or fixed classroom hours for language classes. Public spending for language acquisition support might be more effective when a priori information about the expected difficulties is taken into account to specifically address target groups prone to insufficient levels. The remainder of the paper is organized as follows. In Section 3.2 we provide an overview of the measurement of linguistic differences employed in our analysis. Section 3.3 describes the data, Section 3.4 outlines our empirical model. The findings obtained from our empirical analysis are presented and discussed in Section 3.5, and Section 3.6 concludes. CHAPTER 3. LINGUISTIC BARRIERS IN LANGUAGE ACQUISITION 38 3.2 Measuring Linguistic Distance
The massive increase in migration flows during the last decades have shaped previously ho- mogeneous populations into linguistically and culturally diverse melting pots. Immigrants face very different costs of language acquisition, associated with their linguistic origin. The influence of the first language (L1) on the acquired language (L2) is a common research topic in linguistics: A larger linguistic distance between L1 and L2 is believed to hamper any potential language transfer (the application of knowledge in the mother tongue to second languages) and to make it more difficult to differentiate between different sounds and words. Linguistic studies typically analyze the effect of linguistic distance employing small samples or case studies. An overview and notable exception can be found in Van der Slik (2010). The effect of linguistic distance on language acquisition can also be interpreted within an economic framework. The acquisition of language skills is an investment in a type of human capital with a high degree of specificity. Analogously to the restricted portability of source-country education (Friedberg, 2000), language skills are restricted in their portability across borders. The value of language skills outside a certain country can be very low, and immigrants have to invest in destination language skills as a prerequisite for successful integration. The imperfect portability of source-country language proficiency is a cost factor in the acquisition of the destination language. The linguistic distance indicates this portability of source-country language skills to the destination country. The larger the linguistic distance, the lower is the applicability of source-country language knowledge in the acquisition of the destination language. This leads, ceteris paribus, to greater difficulties and higher costs in the language acquisition (Chiswick and Miller, 1999). The difficulty in analyzing the relation between linguistic distance and language skills in a large scale micro data setting lies in the operationalization of the concept of linguistic distance. While specialized linguists have dedicated their whole career to studying the difference between two specific languages, our research question requires a simple standardized and continuous measure of differences between a large set of origin and destination languages. We propose to use a measure of linguistic distance relying on the phonetic dissimilarity between languages based on linguistic research by the so-called Automated Similarity Judgment Program (ASJP). This measure, the normalized and divided Levenshtein distance, offers a continuous measure of linguistic differences and is easily computed for any pair of the world’s languages. We compare this measure with two linguistic approaches and a test-score based method that have been applied in different settings in the economic literature. CHAPTER 3. LINGUISTIC BARRIERS IN LANGUAGE ACQUISITION 39
The test score measure
The only work we are aware of that addresses the effect of linguistic distance on language proficiency using large micro data sets are the studies by Chiswick and Miller (1999, 2001, 2005). The construction of that measure is based on average exam scores of US American English native speakers in standardized tertiary education language courses after a fixed amount of class hours. Assuming symmetry in the difficulty of learning languages, the authors state that the difficulty of English native speakers’ learning a foreign language resembles the difficulty of speakers of this foreign language in learning English. This symmetry assumption allows using these test scores as a summary statistic for the dissimilarity between languages. The necessary classroom assessments of test scores are provided by Hart-Gonzalez and Lindemann (1993), Chiswick and Miller (1999) report the respective averages by foreign language. For example, US students learning Norwegian reached an average score of 3.0 (the highest potential score). Using this score the linguistic distance for a Norwegian native speaker learning English is defined as the
inverse: LDSCORE = 1/Score = 0.33. Since Icelandic and Faroese are assumed to be close languages to Norwegian, the same distance is assigned to these languages. Unfortunately, this test-score based measure of linguistic distance is restricted to differences of a finite set of languages from English. An excerpt of the scores and resulting distances provided by Chiswick and Miller (1999) can be found in Table 3.1. The approach, especially the underlying symmetry assumption, has been widely dis- puted in the linguistic literature (see, e.g., Van der Slik, 2010). A further disadvantage of such a test-score based approach is a potential bias by incentives and motivations to learn a foreign language that cannot be separated from the effect of differences between lan- guages. These incentives can include different economic prospects from learning a language (differences in the applicability in the labor market), or the prestige from learning new, difficult or “hip” languages. These potential biases might lead to rather counter-intuitive assessments, such as the similarly low distance between Swahili and English or Dutch and English.
Linguistic approaches: The TREE and the WALS measure
Comparative linguistics, a branch of linguistics that is concerned with the analysis of family ties and similarities within language families, provides alternatives to the test-score based method. To retrace the historical development of languages, language trees have been developed to arrange languages into different families. These language trees depict the “genealogical” relations between languages and allow of tracing back the development of languages to likely extinct common ancestors. Most prominently, the Ethnologue Project CHAPTER 3. LINGUISTIC BARRIERS IN LANGUAGE ACQUISITION 40
(see, Lewis, 2009) aims at evaluating the family relations between all known languages in the world. Using this information about the family relations between languages, it is possible to derive a measure of the linguistic distance between languages by counting the number of branches between the languages. While offering a convenient and continuous measure of linguistic distance, although with a comparably low number of increments, the resulting measure is build on strong and arbitrary assumptions of cardinality along the language tree and makes it difficult to include isolated languages (such as Korean or Basque) in the analysis. Two recent studies apply this approach to measure the effect of linguistic distance in a macroeconomic framework. Desmet et al. (2009) use a measure based on steps through the branches of a language tree to assess the effect of linguistic diversity on redistribution. Adsera and Pytlikova (2012) use a language tree approach to analyze the role of linguistic barriers in migration flows. Using the Ethnologue information, they define a language proximity index that takes on the value of 0 for languages without any family language relation, and 1 for being the same language. Between these extreme values, the language proximity indicator takes on values of 0.1, 0.25, 0.45 and 0.7 for sharing up to four levels of family relations. As both approaches by Desmet et al. (2009) and Adsera and Pytlikova (2012) rely on more or less arbitrarily chosen assumptions on cardinality and functional form, we employ the one by Adsera and Pytlikova (2012) due to its straightforward computation. Figure 3.1 illustrates a subset of a language tree to outline its computation. Since Portuguese and Spanish share the first four common branches: Indo-European, Italic, Romance, and Italo-Western, this is coded as a linguistic proximity of 0.7. English and German only share three branches: Indo-European, Germanic, and West. Therefore, the approach leads to a proximity indicator for this language pair of 0.45. The linguistic distance is again defined as the inverse of this proximity indicator:
LDTREE = 1/P roximity. Apart from Ethnologue, a second information source about languages is the World Atlas of Language Structure (WALS). The WALS offers an online database of the structural properties of languages, such as the phonological, grammatical and lexical features of more than 2,500 different languages. The 144 different characteristics include, for example, different cases, word order or syntax. Specific grammatical features from WALS have been used recently to analyze the relation between language structure and economic behavior, such as the encoding of present and future savings behavior (Chen, 2013) or gender systems and female political participation (Santacreu-Vasut et al., 2013). Panel C of Table 3.2 lists some examples of English and German WALS features. Both languages share a low consonant–vowel ratio, but while English possesses a vowel nasalization, German does not. Using the full information on all features offered by WALS, Lohmann (2011) derives an index of linguistic dissimilarity between 0 and 1 by counting and averaging shared CHAPTER 3. LINGUISTIC BARRIERS IN LANGUAGE ACQUISITION 41 characteristics between languages to explain international trade flows. While conveniently summarizing linguistic differences in a number of different dimensions, the approach relies on the more or less arbitrary assumption of the equal importance of each linguistic feature. More importantly, the WALS database suffers from highly unbalanced data, since not every WALS characteristic is assessed for every language. This leads to the fact that the distance between some languages relies on a very small subsets of the commonly assessed WALS features, which potentially generates a large measurement error in the variable. To reduce this measurement error (with the trade-off of losing observations), we only include distances between languages that are based on at least 20 out of the 144 available characteristics.
The Automatic Similarity Judgment Program
The main focus of our analysis is the application of a new and innovative way of measuring linguistic distance, the so-called Automatic Similarity Judgment Program developed by the German Max Planck Institute for Evolutionary Anthropology.1 This project aims at developing an automatic procedure to evaluate the phonetic similarity between all of the world’s languages and offers a convenient way of deriving a continuous measure of linguistic differences that is purely descriptive in nature. As such, it might be used to derive language trees (which is its original purpose) but does not rely on any prior expert opinion on language families, as does the TREE approach. The basic idea behind the ASJP is the automatic comparison of the pronunciation of words across languages. A more similar pronunciation proxies the number of cognates, word pairs between languages with common ancestors, which then again indicates a closer relation between the languages. Petroni and Serva (2010) and Brown et al. (2008) demonstrate that the language relations predicted by the ASJP coincide closely with expert opinions on language relations taking into account any available language characteristics, despite the fact that it is only based on simple comparisons of word lists. To implement this “lexicostatistical” approach, the ASJP uses a core set of vocabulary for each language, describing common things and environments, called the Swadesh word list (Swadesh, 1952). The Swadesh list consists of words which are deductively chosen according to their availability in as many languages as possible, so that synonyms for these words exist in almost any potential language. Panel A of Table 3.2 lists the words used, which comprise parts of the human body, environmental descriptions, and basic words of human communication such as classifiers or personal pronouns. To focus on the pronunciation instead of the written word, these words are transcribed into a phonetic script, the ASJP code. The ASJP code uses all available characters on a standard QWERTY
1Further information can be found at http://www.eva.mpg.de. CHAPTER 3. LINGUISTIC BARRIERS IN LANGUAGE ACQUISITION 42 keyboard to represent sounds of human communication. For example, the English word mountain is transcribed in the ASJP code as maunt3n, while its German counterpart Berg is transcribed as bErk. The English word you is transcribed as yu, the respective German du is the same in the ASJP code, du. In the following, we go through the algorithm that leads to the continuous measure of language dissimilarity. In the first step, all word pairs from the transcribed 40-word list are compared with regard to their similarity in pronunciation. For each word pair, the minimum distance between the transcribed phonetic strings is measured as the Levenshtein distance, a measure of distance between string variables. The Levenshtein distance counts how many additions and/or subtractions are necessary to transform the string of the pronunciation of a word in language A into the string of the pronunciation of the respective word in Language B. For example, to transform the English yu into the very similar German du, only the first sound has to be changed. Whereas for the very dissimilar words mountain transcribed as maunt3n and Berg (bErk), all of the seven sounds of maunt3n have to be changed or removed. This first step results in a word-by-word absolute distance 2 D(αi, βi) between item i of two languages α and β. Examples for the transcription and determination of the word-by-word minimum distance are listed in Panel B of Table 3.2.
Taking a simple average across all M word pairs αi and βi, i = 1, .., N results in the normalized Levenshtein distance (LDN):
1 X LDN(α, β) = D(αi, βi). (3.1) M i This simple normalized Levenshtein distance might indicate a closeness between languages if languages shared the same set of commonly used sounds in communication. These potential similarities in phonetic inventories (the sum of speech sounds used in a particular language) between two compared languages do not conclusively hint at a genealogical relation between the languages, but might rather produce a similarity by chance. To filter out similarities by common phonetic inventories, a global average distance Γ(α, β) between all non-related items of the languages α and β is defined by comparing each word of the first language with all non-related words from the second language. This distance takes into account the overall similarity in phonetic inventories irrespective of the meanings of the words:
1 X Γ(α, β) = D(αi, βj). (3.2) M(M − 1) i6=j The final normalized and divided linguistic distance is then defined as the quotient between
2We draw in our notations from Petroni and Serva (2010). CHAPTER 3. LINGUISTIC BARRIERS IN LANGUAGE ACQUISITION 43 the normalized linguistic distance and the global distance between α and β:
LDN(α, β) LDND(α, β) = . (3.3) Γ(α, β) The resulting continuous measure can be broadly interpreted as a percentage measure of dissimilarity between languages, with lower numbers indicating a closer relation. In a few cases, the resulting numbers are bigger than 100%, indicating a dissimilarity that exceeds a potentially incidental similarity between languages that would be expected due to similar phonetic inventories. The ASJP algorithm allows including or excluding loan words from different languages, e.g., the predominance of former Latin words in many of the European languages. While it makes sense to exclude these loan words in the analysis of the long-term development of languages, we include these loan words in our analysis, as they lead to certain similarities of languages that might ease the later language transfer in the acquisition process.3 The normalized and divided Levenshtein distance offers some advantages compared to previous measures of linguistic distance, which lead to more precise and efficient results in economic and social science applications. First, the measure is easily and transparently computed and is purely descriptive in nature, as such it does not rely on any a priori expert information on language relations. Second, due to this purely descriptive nature, it is not likely to be biased by economic incentives. Third, it offers a high variation as it is not restricted to certain parameter values. Lastly, it is comprehensive (all relevant languages are covered by the ASJP database) and can be used for any destination-country language included in the ASJP database. Therefore, it not only allows the analysis of important immigration countries such as the US, the United Kingdom, Canada, Germany, and France, but also permits the analysis of immigrants from rather “exotic” countries with typically few observations that are otherwise excluded from datasets. The comprehensiveness of the database further allows analyses concerning South–South migration, including rather seldom analyzed languages. This is a major advantage compared to the test-score based approach of Chiswick and Miller (1999), which is restricted to distances from English.
Identification issues
We rely in our estimations on four measures of linguistic differences between the destination- and source-country language that differ in their ranges of availability and in the restrictive- ness of their necessary assumptions. The test-score based measure (SCORE) is compelling with its encompassing nature, but relies on a strong symmetry assumption and is poten-
3The necessary software to compute the distance matrix is available at http://www.eva.mpg.de. The complete distance matrix used in our analysis is available upon request. CHAPTER 3. LINGUISTIC BARRIERS IN LANGUAGE ACQUISITION 44 tially biased by differences in incentives to learn a specific language. Most importantly, it is restricted to distances from English. In our US sample, it is available for up to 56 different source-country languages, when expert opinions on close language relations are taken into account to maximize the scope of the measure (Chiswick and Miller, 1999). Compared to this test-score measure, linguistically based approaches offer a more general framework to asses the distance between languages. The tree approach (TREE) derives a measure of distance by counting the number of shared branches in language trees, relying on prior knowledge of language family relations. It is based on strong assumptions on functional form and cardinality. Due to the completeness of the language family classifications by Ethnologue, this approach is available for the distances of 85 languages toward English and 83 languages toward German. Using external databases on language characteristics and pronunciation, the WALS and the ASJP approach offer ways to assess the differences between languages in a more descriptive manner. Neither approach relies on a priori expert knowledge of language families. However, the WALS approach has to make assumptions on cardinality. The data restrictions of the WALS database reduce the number of available languages to 67 languages in the case of the US sample and 68 languages in the case of the German sample. The ASJP database does not suffer from these restrictions, offering sufficient information for almost any language in the samples. We can rely on information for 85 languages in the US sample, and 83 languages in the German sample, providing the same applicability as that of the TREE approach. Because of its general applicability and descriptive nature, we argue that the ASJP approach, based on simple comparisons of pronunciations of word lists, offers the most appropriate way to measure linguistic distance and is superior for the application at hand. Although the ASJP approach includes much broader information on source-country languages, for the sake of comparability we restrict our estimations to immigrants from those source countries for which we have common information using all four approaches. Table 3.3 summarizes the three closest and the three most distant languages from En- glish and German according to the four different measures of linguistic distance. Consistent across the different measures, the closest languages consist of members of the Germanic language family. Some advantages and disadvantages of the measures employed are already apparent in this table. Due to the low number of increments within the measurement scale, both the TREE and the SCORE approach show only a small variation between the closest and furthest languages. Therefore, a range of languages shares the closest and the most distant position, respectively. In contrast, the ASJP and the WALS measure offer a high variation in the data. The comprehensiveness of the ASJP database allows including more remote languages, such as the Caribbean Creole languages, in the analysis, which are not covered by the other approaches. CHAPTER 3. LINGUISTIC BARRIERS IN LANGUAGE ACQUISITION 45
Regardless of the approach employed, the identification of an effect of linguistic barriers on the language acquisition might be biased by a correlation between linguistic distance and unobservable further cultural differences in habits and behavior (Chen et al., 1995). These unobserved cultural differences might hamper the identification of language barriers in terms of an omitted variable bias. To address this identification issue, we additionally control for the geographic distance between the destination and the immigrant’s source country. Moreover, we use a measure of genetic differences between populations as a proxy for cultural differences. Spolaore and Wacziarg (2009) combine the frequencies of gene manifestations in populations sampled by Cavalli-Sforza et al. (1994) and the ethnicity composition of countries compiled by Alesina et al. (2003) to derive a measure of the average genetic distance between countries. The change in genes, the emergence of new alleles, happens randomly at an almost constant rate. This constant rate of change over time makes it a reasonable proxy for the time populations spent separated, making the genetic distance an “excellent summary statistic capturing divergence in the whole set of implicit beliefs, customs, habits, biases, conventions, etc. that are transmitted across generations—biologically and/or culturally—with high persistence.” (Spolaore and Wacziarg, 2009, p. 471). Including this measure of genetic distance as a proxy for cultural distance and assuming a reasonable correlation between the measured genetic differences and any unobservable cultural differences should allow the identification of the isolated effect of linguistic distance in the estimations.4 The linguistic, genetic, and geographic distance are, due to their parallel emergence over time, likely to be highly correlated. High pairwise correlations could lead to difficulties in the identification of single effects, but the pairwise rank correlations in Table 3.4 are far from perfect. The linguistic distance measures are highly correlated among each other, increasing our confidence in these measures. The correlation between linguistic and geographic and especially between linguistic and genetic distance is distinctively lower.5
3.3 Data
To assess how the different approaches to measuring linguistic dissimilarities fare in explaining the differences in the language acquisition of immigrants, two sources of
4The data on genetic differences was originally gathered by Cavalli-Sforza et al. (1994) for 42 subpopulations. Spolaore and Wacziarg (2009) extended this data to genetic differences between 180 countries by weighting it using data on the composition of ethnicities of countries compiled by Alesina et al. (2003). It is stressed again at this point that the measure of genetic distance focuses solely on genetic distance based on neutral change, not caused by evolutionary pressure, and therefore does not explain differences in language acquisition due to superior skills or ability. 5Due to the lag of normally distributed measures, we report the rank correlations instead of the Pearson correlation coefficients. CHAPTER 3. LINGUISTIC BARRIERS IN LANGUAGE ACQUISITION 46 individual data are employed for the estimations. Large scale micro data from the American Community Survey (ACS) offers a comprehensive representative sample of the American immigrant population. Furthermore, using a dataset from an English speaking destination country allows us to compare the linguistically based approaches with the test-score measure by Chiswick and Miller (1999) due to its restriction to distances toward English. To take advantage of the comprehensive nature of the linguistically based measures of linguistic distance, we further use data from the German Socio-Economic Panel (SOEP) to analyze the influence of the linguistic origin on the language acquisition in a non English-speaking country. Besides this new application, the SOEP offers the benefit of a broader range of individual characteristics that are unobservable in census-like data such as the ACS. The ACS data is taken from the 2006–2010 Public Use File and used as a pooled cross section. The dataset includes a self-reported measure of language skills which indicates English proficiency on a four point scale ranging from “Not at all/Bad” to “Very Well,” which constitutes our dependent variable. To focus on the potential workforce, the sample is restricted to immigrants between 17 and 65 years of age. As we want to concentrate our analysis on immigrants who acquire a destination language as an additional language, we restrict the sample to immigrants arriving at an age of 17 or older, and who originate from a non-English speaking country. After excluding observations with missing information, the pooled sample consists of 514,874 observations. A disadvantage of using the ACS is that it only offers scarce background information. As explanatory variables in our model, we use information on the time of residence, the age at arrival, individual education, sex, and marital status.6 We also include indicators of the source countries’ geopolitical world region and the year of observation to control for region- and time-fixed effects.7 To bring the analysis beyond the case of English-speaking destination countries, we use the German SOEP as a long-run panel which is an excellent data source for immigration- and integration-specific research, due to its over-sampling of immigrants and a migration- specific background questionnaire.8 The sample used in this study covers the period
6We recode the information on highest degree to compute years of schooling using a modified version of the definition proposed by Jaeger (1997) adapted to the categories of the ACS. Specifically, we recode: No schooling completed = 0, Nursery school to grade 4 = 4, Grade 5 or grade 6 = 6, Grade 7 or grade 8 = 8, Grade 9 = 9, Grade 10 = 10, Grade 11 = 11, Grade 12, no diploma = 12, High school graduate = 12, Some college, but less than one year = 13, One or more years of college, no degree = 13, Associate’s degree = 14, Bachelor’s degree = 16, Master’s degree = 18, Professional school degree = 18, Doctoral degree = 18. 7The geopolitical regions are defined following the MAR project, see http://www.cidcm.umd.edu/mar. 8The SOEP is a panel survey conducted since 1984 covering more than 20,000 individuals per wave. For more information, see Haisken-DeNew and Frick (2005). The data used in our analysis was extracted using the Add-On package PanelWhiz for Stata. PanelWhiz (http://www.PanelWhiz.eu) was written by John P. Haisken-DeNew ([email protected]). See Haisken-DeNew and Hahn (2006) for details. The PanelWhiz generated DO file to retrieve the data used here is available upon request. CHAPTER 3. LINGUISTIC BARRIERS IN LANGUAGE ACQUISITION 47 between 1997 and 2010. Until 2007, questions concerning the language proficiency of immigrants were included in every second wave, and on an annual basis after 2007. Analogously to the ACS sample, we restrict the SOEP sample to immigrants between 17 and 65 years of age who were at least 17 years of age when migrating to Germany, and who were born in a non-German speaking country. Furthermore, we exclude Ethnic Germans and asylum seekers from the sample. After excluding observations with missing values, we end up with a sample of 5,803 person-year observations which we use in a pooled cross-section. Similarly to the ACS, the SOEP offers information on self-reported German (oral) proficiency. The self-reported measure of language proficiency is fivefold, but because of the small number of individuals indicating the category “Not at all,” we recode this information to derive an analog fourfold ordinal measure ranging from “Not at all/Bad” to “Very Well.” The survey character of the SOEP offers a broader range of information about the individual characteristics shaping the language acquisition process. The factors influencing the language acquisition of immigrants can be divided into three groups: the exposure to the destination-country language, the efficiency of their learning ability, and the economic incentives of learning the new language (Chiswick and Miller, 1995). Our main variable of interest—the linguistic distance—affects the efficiency in acquiring the new language, decreasing the potential of any lexical transfer or portability of their proficiency in the source-country language. The efficiency of learning a new language is further controlled for by individual years of education, an indicator of good proficiency in the source-country language (as a proxy for literacy) and the age at entry, related to neurobiological research demonstrating a decreased efficiency for older arrivers (Newport, 2002). We model the effect of exposure to the destination-country language by including five variables in our estimation model. The simple ‘learning by doing’ effect is captured by a function of the years since migration. Moreover, we account for family composition characteristics captured by the number of children, marital status, and the German nationality of the spouse. The relation of these factors to the language acquisition process is ambiguous, because they lead to a social exclusion or inclusion of immigrants. Finally, an indicator for neighboring countries of Germany serves as a proxy for the probability of pre-migration exposure to the German language. The economic incentives for learning a new language are primarily influenced by the expected length of stay, shaping the time horizon of the expected benefits. An indicator variable for having family ties abroad captures potential return plans that might alter the economic incentives to invest in the destination language. Our estimation model also includes an indicator for immigrant’s sex and controls for the source country’s geopolitical CHAPTER 3. LINGUISTIC BARRIERS IN LANGUAGE ACQUISITION 48 world region and the year of observation using region- and time-fixed effects. We augment both individual datasets—the ACS and the SOEP—with a set of ag- gregated country characteristics. These characteristics capture aspects of the relation between the immigrant’s country of birth and the country of residence that might be correlated with the linguistic distance. First, we include the share of immigrants from the migrant’s source country among the destination country’s population to capture potential network and enclave effects. The data on bilateral migrant stocks are taken from United Nations (2012). Ethnic enclaves may reduce the incentives for immigrants to acquire destination-country specific abilities such as proficiency in the official language. Although the share of immigrants of the same source country is only a raw proxy for the immigrant’s neighborhood, it might still provide some insights into the role of networks and enclaves in the acquisition of foreign language skills. Second, we control for the geographic distance, which serves as a proxy for the individual costs of migration. The geographic distance is defined as the geodesic distance between the capitals of the source and the destination country in 100 kilometers.9 Lastly, we include a measure of the genetic distance between the source and the destination country as discussed in Section 3.2, which serves as a proxy for cultural differences.10 As neither of our micro data sources (the ACS and the SOEP) offer information on the mother tongue of an immigrant, the linguistic distance is assigned by the predominant language of the country of birth. In multi-lingual countries, languages are assigned as the most prevalent native language (excluding lingua francas, i.e., commonly known foreign languages used for trade and communication across different mother tongues), which is identified using a multitude of sources, including factbooks, encyclopedias, and Internet resources.11 To allow easier comparison between the differently defined measures, we standardize each measure to have a mean of zero and a standard deviation of one.
3.4 Method
This data setup, the ACS and SOEP micro data combined with the measures of linguistic distance, allows us to estimate the language proficiency L as a function of the linguistic distance and the control variables, both on an aggregated and on the individual level. To get a first glimpse into the relationship between linguistic barriers and the language
9The geographic distance data are compiled by researchers at Centre d’Etudes Prospectives et d’Informations Internationales (CEPII) and available at http://www.cepii.fr/anglaisgraph/bdd/distances.htm. 10Descriptive statistics for the ACS and the SOEP sample are presented in Table 3.A1 in the Appendix. Table 3.A2 in the Appendix provides a description of the variables used in our estimations. 11A comprehensive index of assigned languages with further explanations is available upon request. CHAPTER 3. LINGUISTIC BARRIERS IN LANGUAGE ACQUISITION 49 acquisition of migrant groups, we start with estimations on the aggregated level. In these estimations, we explain the average language proficiency by source country and year of observation. As dependent variable, we use predictions from a first stage explaining the S individual language proficiency Lit by a fully interacted set of source-country (cj ) and time indicators (Tk) and a set of individual characteristics Xit (gender, marital status, years since migration and age at entry):
J K 0 X X S Lit = β0 + Xitβ + γcj Tk + εit. (3.4) j=1 k=1
From this first stage, we derive averages of the predicted language proficiency by source country and year of observation (Ldjt). In the second step, we then explain these predicted values by the respective linguistic distance (LDj) and a set of aggregated source-country and country-pair characteristics (Zjt):
0 Ldjt = δ0 + δ1LDj + Zjtη + εjt. (3.5)
Although this specification on an aggregated country-of-origin level provides some first insights in the relation between linguistic barriers and the language acquisition, it ignores further available information on individual migration experience and potential interactions between the linguistic barriers and individual characteristics. Therefore, in a second step we change to the individual level and model the destination language proficiency as:
0 Lit = β0 + β1LDi + β2YSMit + Xitγ + εit. (3.6)
Here, LD depicts the linguistic distance between the source- and destination-country languages, YSM represents the years since migration, and X is a vector including the control variables. In the following, we refer to the model depicted by Equation 3.6 as Model 1.12
In Model 1, β1 represents an average effect of linguistic origin for all immigrants. However, it is likely that the linguistic distance not only imposes an initial barrier to language acquisition, but also affects the steepness of the language acquisition. Two different profiles are imaginable. On the one hand, recent immigrants with a distant linguistic background might have higher incentives to invest in language skills than
12For the sake of brevity, we present here and in the following only the linear notation of our estimation models. CHAPTER 3. LINGUISTIC BARRIERS IN LANGUAGE ACQUISITION 50 linguistically close immigrants due to decreasing returns to invested effort. This would lead to a convergence over time. On the other hand, the hurdles imposed by language barriers can discourage investments and might lead to flatter acquisition profiles for distant immigrants. This would then lead to a divergence for immigrants from different linguistic origins, leaving linguistically distant immigrants worse off. To address this potential convergence or divergence, we allow the disadvantage by the linguistic distance to vary with the years since migration. We include an interaction of both variables LD × YSM in Equation 3.7. We will refer to this specification as Model 2:
0 Lit = β0 + β1LDi + β2YSMit + β3LDi × YSMit + Xitγ + εit. (3.7)
In Model 2 the main effect indicated by β1 shows the effect of linguistic distance on language ability at the time of immigration and β3 depicts the change in the steepness of the assimilation profile. A convergence in skill levels over time should be represented in a positive coefficient β3, indicating a catching up to immigrants with a lower linguistic distance. A negative β3 would imply a divergence. Linguistically more distant immigrants would then face flatter assimilation profiles than immigrants with a lower linguistic distance. We start our analysis by estimating our models using Ordinary Least Squares (OLS), separately for the four measures of linguistic distance in the US case and three measures in the German case. To interpret the OLS results using the ordinal language proficiency variable quantitatively, we have to impose strong cardinality assumptions. To take into account this ordinal character of the dependent variable and to derive quantitatively interpretable results, we repeat the estimations using Ordered Logit regressions and use graphical representations to interpret the interaction between linguistic distance and years since migration. Throughout all specifications in our analysis, we use (cluster)-robust standard errors to correct for possible heteroskedasticity in the data.
3.5 Results
A first descriptive look at the relation between language proficiency and the different measures of linguistic distance is provided in Table 3.5. The distribution of language skills in the US and Germany is quite different. While in the US about 37% of all immigrants report a “Very Well” proficiency, in Germany only 15% report the highest category. The expected negative relation between linguistic distance and language proficiency does not appear in the unconditional means reported in Table 3.5 in the US sample, ASJP, WALS and SCORE even suggest a marginally positive relation. In the German sample, the CHAPTER 3. LINGUISTIC BARRIERS IN LANGUAGE ACQUISITION 51
relation between linguistic distance and language proficiency is distinctively negative on the descriptive level: across all three available measures, we observe a decrease in the linguistic distance as the language skills increases, with the lowest average distance in the “Very Well” category. However, it remains to be seen how potentially correlated individual characteristics change this first descriptive picture. The results of the estimations of equation 3.5 on the aggregated source country level are summarized in Table 3.6.13 We find a strong negative relationship between linguistic barriers and the average language proficiency. This result is robust, and can be observed both in the US and the German data, but differs distinctively in magnitude across different measures of linguistic distance. Assuming cardinality in our dependent variable, the coefficient of linguistic distance measured by the ASJP approach indicates that an increase of the linguistic distance by one standard deviation (roughly the difference in the distance to English between German and Romanian) decreases the average language proficiency by 0.17 points on the 0–3 scale in the US sample and 0.19 points in the German sample. Using the WALS or TREE approach shows a decrease by only 0.11 points in the US sample, while the TREE approach indicates a decrease of 0.2 points in the German sample, comparable to the ASJP sample. Concerning the control variables, migrant stocks are negatively related to the average language proficiency, hinting at potential negative influences of ethnolinguistic enclaves, see also Chiswick and Miller (2002); Dustmann and Fabbri (2003) and Cutler et al. (2008). We further find positive relationships between geographic and genetic distance (as a proxy for cultural differences) which we interpret as indirect evidence for selection on unobservable motivation and ability, while the positive coefficients of GDP per capita capture potential differences in pre-migration language exposure and education. The results provided in Table 3.7 bring the analysis to the individual level. Table 3.7 summarizes the results of the OLS estimations for the ACS sample, separately for the different measures of linguistic distance. As already seen in the aggregated results, the estimations of the effect of the linguistic distance remain very volatile to the choice of employed measure. This highlights the importance of applying different available measures, rather than relying on only one approach, to get a comprehensive insight into the relation of linguistic barriers and the language acquisition. The results of Model 1 are summarized in Panel A. Across all different methods, the effect is highly significant and negative. Similar to the aggregated results, the ASJP approach indicates the strongest influence of the linguistic origin on the language acquisition: an increase by one standard deviation is related to a lower language proficiency by 0.24 points on the 0–3 scale, while estimations using the TREE, the WALS and the SCORE approach indicate a decrease by 0.10 to 0.12
13We generated all estimation output tables using the Stata routine estout by Ben Jann (see, Jann, 2007). CHAPTER 3. LINGUISTIC BARRIERS IN LANGUAGE ACQUISITION 52 points. The coefficients in Panel A represent an average effect of linguistic distance for immi- grants sharing a common linguistic background. To analyze whether this disadvantage is increasing or decreasing over time of residence, Model 2 includes an interaction term between the years since migration and the linguistic distance. The respective results are included in Panel B. In this specification, the main effect of linguistic distance is to be interpreted as an initial disadvantage at the time of immigration. Compared to Model 1, this initial disadvantage is smaller than the average difference by linguistic origin in Panel A. This results from a negative interaction between linguistic distance and the years since migration. Although we observe an overall positive language assimilation over time, the language assimilation profile becomes flatter with increased linguistic distance. Linguistically distant immigrants not only experience a higher initial disadvantage in their language acquisition, but also seem to experience a slower acquisition of English as destination language. After immigration, the initial gap between the immigrants from close and from distant linguistic origins increases over time. This pattern is robust across all four different models, while again the effect is strongest for the ASJP approach.14 To drop the cardinality assumption and to take the ordinal character of the self-reported language proficiency into account, we estimate both models using Ordered Logit regressions instead of OLS. Table 3.8 provides the marginal effects of the linguistic distance on the probability of reporting specific categories of language proficiency in Model 1.15 Increasing the linguistic distance quantified by the ASJP approach by one standard deviation decreases the probability of reporting “Very Well” language skills in English by about 20 percentage points. Due to the non-linear Ordered Logit model and the inclusion of an interaction term, the marginal effects of linguistic distance in Model 2 are best interpreted in a graphical manner. Figure 3.2 depicts predicted probabilities of reporting the highest category of language proficiency by different levels of linguistic distance over the time of residence. Linguistically close immigrants in the 1st percentile of the distance distribution face a initially steeper assimilation profile, linguistically distant migrants are outpaced. While this pattern sheds some light on the effect of the heterogeneity in linguistic origin on the language acquisition of immigrants in the US, the large differences in immigration policy regimes and differences in selection patterns make it difficult to generalize the results to other countries. Previous analyses using the SCORE approach have been restricted to English-speaking destination countries (Chiswick and Miller, 1999). However, English, as
14Regarding the influence of the control variables, Model 1 and Model 2 do not differ much, neither does the influence of the control variables vary with the measure of linguistic distance applied. For the sake of brevity, we do not further discuss the influence of the control variables. The respective coefficients can be found in Table ?? in the Appendix. 15The underlying coefficients and the marginal effects of Model 1 and 2 of the Ordered Logit regressions are presented in Tables 3.B2–3.B4 in the Appendix. CHAPTER 3. LINGUISTIC BARRIERS IN LANGUAGE ACQUISITION 53 a lingua franca in international trade, the Internet, and communication technology, might enjoy very different incentives for being learned, compared to languages which lack this worldwide predominance. Against this restriction, a major advantage of the linguistically based methods of language differences is the general applicability to any pair of languages. Taking advantage of this general applicability, we are able to extend the analysis beyond immigration to English-speaking countries. Specifically, we turn to Germany, one of the most important non English-speaking destinations for international migration. The German SOEP sample allows a similar analysis as that of the ACS data, but with a richer set of control variables including the number of children, literacy, family ties abroad, and having a native spouse.16 Table 3.9 lists the respective OLS results of Model 1 and Model 2.17 Again, we find a negative effect of the linguistic distance between the mother tongue and the destination language on the language acquisition process which differs strongly by the employed approach. To derive a quantitative interpretation, we again turn to the results of an Ordered Logit model in Table 3.8. The marginal effects of Model 1 show a negative effect of linguistic distance on reporting “Very Well” German proficiency by 1.9 to 4.4 percentage points, which is moderate compared to the US results.18 However, the results for Model 2 draw a very different picture for the German SOEP sample compared to the US results. The interaction term between the linguistic distance and the years since migration in Model 2 turns out to have a positive sign but is insignificant across all different estimations in the German case (see Table 3.9, Panel B). This slight convergence is more distinctive in the Ordered Logit results, which are illustrated in Figure 3.3 in terms of predicted probabilities of reporting “Very Well” proficiency. Immigrants from a more distant linguistic origin therefore face a steeper assimilation profile than immigrants with a close linguistic background. Instead of observing a divergence by linguistic origin, we find a convergence in language skills. Over the time of residence, the gap between the linguistically close and distant immigrants closes, linguistically distant immigrants are able to catch up. We might speculate about the driving factors of the difference between the divergence and convergence patterns in the US and in Germany. English and German are very closely related Germanic languages. This raises doubts that the differences are simply driven by purely linguistic reasons, such as that one language possesses particularly strong obstacles, e.g., by very special grammatical features, that would lead to the observed divergence. A more economically based potential explanation are differences in
16Following Dustmann (1999), literacy is assumed for individuals reporting being able to write in their mother tongue. 17The coefficients, omitted in Table 3.9, are included in Table 3.B5 in the Appendix. 18The underlying coefficients and the marginal effects of Model 1 and 2 of the Ordered Logit regressions are presented in Tables 3.B6–3.B8 in the Appendix. CHAPTER 3. LINGUISTIC BARRIERS IN LANGUAGE ACQUISITION 54
unobservable characteristics by different selection patterns of the migrants in the US and Germany. A perceived higher difficulty of German compared to English could lead to a self-selection of immigrants with superior skills in the acquisition of foreign languages into Germany. If this selection pattern were stronger for linguistically more distant immigrants (where the initial returns to the language acquisition would be higher), the observed competing patterns of divergence in the US and convergence in Germany could arise. However, as we control in both samples for individual education, which is expected to be correlated with unobserved ability, we should at least partially capture such a selection process. A second, in our opinion more plausible, explanation might be related to enclave effects in language acquisition. A range of studies have addressed the potentially discouraging effects of linguistical enclaves on investments in language skills (e.g., Chiswick and Miller, 2002; Cutler et al., 2008; Dustmann and Fabbri, 2003). Living in a linguistic enclave reduces the need for and potential advantages of learning the destination language, as immigrants can communicate in daily life in their mother tongue. Danzer and Yaman (2010) argue that the probability of moving into a neighborhood dominated by speakers of their own mother tongue is positively related to the own learning costs. The initial learning costs are strongly related to the linguistic distance between the mother tongue and the destination language, making it more likely for linguistically distant immigrants to move into segregated neighborhoods. Neighborhood segregation needs time to take place: due to its longer migration history, the ethnic segregation within cities is much more pronounced in the US than in Germany with its comparably short-running migration history. Therefore, the observed differences in assimilation patterns are potentially driven by the larger prevalence of linguistic enclaves in the US (e.g., the famous Chinatowns and Little Italy’s in US cities). In order to test the robustness of our results, we use different subsamples of our datasets. In doing so, we split the sample: (i) by gender, (ii) between low-skilled and high- skilled immigrants, and (iii) by excluding the majority immigrant groups, i.e., Mexican immigrants in the US and Turkish immigrants in Germany, from our regressions. A summary of these sensitivity checks is provided in Tables 3.A3 and 3.A4 in the Appendix. The underlying pattern of initial disadvantage and divergence in the US is stable across all subsamples. Linguistic barriers seem to play a larger role in the case of low-skilled immigrants (having a high school degree or less) than for high-skilled immigrants. The results for Germany are less robust, likely due to the low number of observations in the SOEP data. The negative main effect of linguistic distance remains robust across all different subsamples. The interaction term between linguistic distance and the years since migration becomes positively significant for high-skilled immigrants, who seem to drive the CHAPTER 3. LINGUISTIC BARRIERS IN LANGUAGE ACQUISITION 55 observed convergence in Germany. However, the convergence profile becomes insignificant when we split the sample by gender, by skill-level, and when Turks are excluded. To summarize, our results highlight the importance of linguistic origin as a factor of typically unobservable heterogeneity in the integration process of immigrants. The initial disadvantages only marginally disappear over time of residence, but linguistic barriers remain even after a long period of stay. Given the large impact of language proficiency on labor market outcomes (Bleakley and Chin, 2004; Chiswick and Miller, 1995; Dustmann and van Soest, 2002), it is likely that these differences are transferred into labor market disadvantages. Disadvantages in the language acquisition process prevent the social integration of immigrants by reducing their ability to communicate with natives. In addition, imperfect language skills can act as a signal for foreignness, opening the way to discriminatory behavior of employers and decreasing the productiveness of individuals, leading to lower employment probabilities and wages. Against the background of immigration policy design, our results hint at a way to identify target groups for supportive integration policy measures. Immigrants obviousy differ strongly in their costs of language acquisition, dependent on their linguistic back- ground. This heterogeneity is seldomly accounted for in the design of integration policies. Policies aiming at the support of immigrant language aquisition, as currently practiced in Germany with the “Integrationskurse” system (“integration classes”), often include a lump sum payment for public language classes. This lump sum payment, restricting class hours irrespective of the actual need for support, is likely to lead to a inefficient spending of public money. In a class system that does not distinguish language students by their actual need for support, linguistically close immigrants are provided too many class hours, while linguistically distant immigrants might be outpaced. A means-tested voucher taking into account the expected costs by linguistic origin might lead to a more efficient spending of public moneys than a lump sum policy measure.
3.6 Conclusion
International labor migration is a worldwide and steadily growing phenomenon. According to UN estimates, in 2010 roughly 215 million individuals lived in a country different from their country of birth (World Bank, 2011). On a first glimpse, this is a massive number but it still accounts for only around 3% of the world’s population, a surprisingly low number given the large differentials in economic conditions. While technological progress in transportation and communication have led to a significant decrease in the initial costs of migration, cultural and linguistical borders continue to play an important role for international migration flows (Adsera and Pytlikova, 2012; Belot and Ederveen, 2012). In CHAPTER 3. LINGUISTIC BARRIERS IN LANGUAGE ACQUISITION 56 this study, we provide a in-depth analysis and quantification of the linguistic barriers in destination language acquisition in Germany and the US. For immigrants, proficiency in the destination-country language leads to substantial economic returns (Bleakley and Chin, 2004; Dustmann and van Soest, 2002). However, large fractions of the immigrant population possess only insufficient levels of proficiency in the destination language. While the investment in language capital has been thoroughly analyzed in human capital frameworks (Chiswick and Miller, 1995), our knowledge of the influence of typically unobservable heterogeneity in the linguistic origin of immigrants remains limited. The linguistic distance between languages is a concept that is difficult to operationalize for its implementation in empirical models. In this study, we demonstrate four different methods providing continuous measures of linguistic differences and compare their specific advantages and shortcomings. More specifically, we draw from linguistic research and propose using a measure of linguistic distance based on comparisons of pronunciation between word lists. This method, referred to as the ASJP approach, offers a convenient way to derive a continuous measure of linguistic differences. Given its purely descriptive measurement and general applicability to any potential pair of languages, it provides an advantageous measure for the application at hand. We compare its performance with further linguistic approaches using information about language relations (TREE measure) and language characteristics (WALS measure) and a measure based on average test scores (SCORE) by Chiswick and Miller (1999). All four measures of language differences are applied to the analysis of the destination language acquisition of immigrants in the US using data of the American Community Survey (ACS). To take advantage of the general applicability of the linguistically based methods beyond the analysis of English-speaking destination countries, we extend the analysis to German microdata from the German Socio-Economic Panel (SOEP). In both scenarios, we use the different measures of linguistic distance to explain differences in self-reported measures of immigrant’s destination language proficiency. Our results indicate that the linguistic distance, the dissimilarity between the origin and destination languages, has a distinctively negative average effect on the language acquisition of immigrants. Immigrants with a distant linguistic origin face higher costs in the language acquisition than immigrants with a closer linguistic background. Furthermore, we analyze differences in the slope of the language assimilation curve that can be attributed to differences in the linguistic origin. We find different assimilation patterns for the US and Germany. In Germany, immigrants with a more distant source-country language display a steeper language assimilation profile. Initial disadvantages are reduced over time, leading to a convergence in average proficiency for immigrants from different linguistic origins. For the US, we estimate the opposite picture of diverging profiles. Gaps in the CHAPTER 3. LINGUISTIC BARRIERS IN LANGUAGE ACQUISITION 57 proficiency of linguistically close and distant immigrants tend to increase over time of residence. We interpret this difference in assimilation patterns as a potential outcome of stronger enclave effects in the US. This crucial difference highlights the importance of extending the analysis beyond the case of Anglophone countries. The initial disadvantages and differences in assimilation patterns attributable to linguistic distance are able to explain a large fraction of the explained variation in the destination language proficiency. This highlights the importance of linguistic differences for the analysis of the skill acquisition of immigrants, as an influencing factor that was previously part of the “black box” of culture in the economic literature (see, Epstein and Gang, 2010). This additionally explained variation might play an important role in the design of integration policy measures. Lump sum payments for language classes might turn out to be inefficient in the presence of a high heterogeneity in the actual need for language acquisition support and compared to means-tested vouchers taking into account the expected costs of language acquisition. CHAPTER 3. LINGUISTIC BARRIERS IN LANGUAGE ACQUISITION 58 Figures and Tables
Table 3.1: Average Test Scores of US Language Students Average Linguistic Languages Test Scores Distance (Examples) 1.00 1.00 Laotian, Korean, Japanese 1.25 0.80 Cantonese, Mien, Hakka 1.50 0.67 Syriac, Vietnamese, Arabic 1.75 0.57 Bengali, Nepali, Greek 2.00 0.50 Serbo-Croatian, Turkish, Finnish 2.25 0.44 Spanish, Danish, Yiddish 2.50 0.40 Italian, Portuguese, French 2.75 0.36 Dutch, Swahili, Bantu 3.00 0.33 Norwegian, Swedish, Afrikaans Notes: Average test scores of American students learning foreign lan- guages. Numbers provided by Hart-Gonzalez and Lindemann (1993), reproduced from Chiswick and Miller (1999), Appendix B.