Submitted by Alexander Buchberger

Submitted at Institute of Applied Statistics

Supervisor Dipl.-Ing. Mag. Dr. Helmut Waldl

Co-Supervisor Mag. Dr. Werner Lenzelbauer National Migration Flows January 2020 in Upper

Master Thesis to obtain the academic degree of Master of Science in the Master’s Program Statistics

JOHANNES KEPLER UNIVERSITY Altenbergerstraße 69 4040 Linz, Osterreich¨ www.jku.at DVR 0093696 I

I hereby declare under oath that the submitted Master’s degree thesis has been written solely by me without any third-party assistance, information other than provided sources or aids have not been used and those used have been fully documented. Sources for literal, paraphrased and cited quotes have been accurately credited. The submitted document here present is identical to the electronically submitted text document. II

Abstract

Migration flows of the inhabitants of are the main subject of this master thesis. First of all, reasons for migration of each person should be analysed. For this purpose all movements out of each community are taken. A movement is defined as a case when a person moves into a different community, whereas a movement within a certain community does not count. Furthermore moving for example from Linz to and back to Linz within one year does not count as movement. Migrations to and from other (federal) states are not included in the analyses. In the first chapter of this thesis a closer look at the properties of all communities of Upper Austria is done. The available data sets are used to analyse some details for possible migration flows by graphical representation in an Upper Austria map. In the next step a model is adapted by statistical analysis. Standard models like Poisson and geometric regression are taken for a start of the analysis. In addition a so called “Gravity Model of Migration“ is taken, which should better fit the urban-rural migration. Based on the fact that modelling with absolute migration values is not working well, the data is transformed to relative movement values by dividing them by the number of population of the destination/origin community. In this setting a mixture model, the Zero Inflated Beta Regression can be used for analysing the data. Thereby an effect of distance appeared. The further the destination, the less the migration. A positive effect of available jobs in the destination community can be seen too. A further aim of the work is to estimate the probability for migration of a certain inhabitant of Upper Austria. The logistic regression was used in a first attempt. In the model the covariables showed an expected effect, but the prediction for the next year was pretty bad. An improvement with mixed models like “Random Intercept Model“ or “Random Slope Model“ could not be achieved. In addition a second algorithm is used to fit the data. In this case the popular machine learning algorithm Random Forest is taken. With this algorithm the prediction for the next year could be significantly improved. Furthermore it was tried to optimize the results by grouping the data into age groups, but in this case Random Forest provided worse results. III

Kurzfassung

In dieser Arbeit werden die Bevölkerungsströme von der Bevölkerung Oberösterreichs the- matisiert. In erster Linie wird versucht, Gründe des Umzuges der jeweiligen Personen zu untersuchen. Hierbei werden die Wegzüge aus jeder Gemeinde herangezogen. Ein Umzug wird nur gemein- deübergreifend betrachtet, sprich, ein Umzug innerhalb einer Gemeinde wird nicht in die Analyse einbezogen. Weiters werden Umzüge von und nach umliegenden (Bundes-)Länder nicht berücksichtigt. Ein mehrmaliger Umzug innerhalb des Jahres wie zum Beispiel von Linz nach Wels und wieder nach Linz wird auch nicht berücksichtigt und als “Kein Umzug“ deklariert. Im ersten Kapitel dieser Arbeit werden die Eigenschaften von Gemeinden in Oberösterreich näher betrachtet. Die verfügbaren Datensätze werden herangezogen und in der oberösterreichischen Karte dargestellt um Details über das mögliche Wanderungsverhalten zu erkennen. Im nächsten Schritt wird durch statistisches Analysieren ein Modell angepasst. Beginnend mit Standardmodellen wie Poisson und geometrischen Regressionen wird versucht die Daten anzupassen. Zusätzlich wird ein sogenanntes “Gravity Model of Migration“ herangezogen, welches den Stadt-Land Umzug besser modellieren soll. Aufgrund der Tatsache, dass eine Modellierung im disktreten Fall nicht optimal ist, werden die Daten umskaliert, indem man durch die Population von der Wegzugs- bzw. Zuzugsgemeinde dividiert. Hier konnte eine Regression mit Mischverteilung, die “Zero Inflated Beta Regression“ verwendet werden. Dabei konnte man erkennen dass die Distanz eine große Rolle für das Umziehen spielt. Je weiter entfernt das Ziel liegt, desto weniger Umzüge finden statt. Ein positiver Effekt konnte bei freien Arbeitsstellen in der Zielgemeinde für ein vermehrtes Umsiedeln festgestellt werden. Ein weiteres Ziel der Arbeit ist es, die Wahrscheinlichkeit des Umzuges für eine bestimmte Person zu bestimmen. Zu Beginn wird mit der logistischen Regression versucht die Daten gut anzupassen. Im Modell geben die Variablen einen zu erwartenden Effekt ab, jedoch ist die Vorhersage für das nächste Jahr relativ schlecht. Eine Verbesserung der Vorhersage konnte mit einem gemischten Modell wie dem “Random Intercept Modell“ oder “Random Slope Model“ nicht erzielt werden. Zudem wird mit einem anderen Algorithmus versucht, die Klassifizierung zu verbessern. Hierbei wird der Machine Learning Algorithmus Random Forest herangezogen. Mit diesem Algorithmus konnte die Klassifizierung stark verbessert werden. Zusätzlich wird versucht, dieses Ergebnis weiter zu verbessern, indem die Daten nach dem Alter gruppiert werden. IV

Bei abermaligem Anwenden von Random Forest hat sich jedoch eine Verschlechterung der Prognose ergeben. Contents V

Contents

1 Introduction1 2 Description of the Data2 2.1 Datasets ...... 2 2.2 Preprocessing of the Data ...... 3 2.3 Descriptive Analysis ...... 4 2.3.1 General information ...... 4 2.3.2 Infrastructure of the communities ...... 7 2.3.3 Structure of the inhabitants of Upper Austria ...... 12 3 Methods 16 3.1 Generalized Linear Model ...... 16 3.1.1 Poisson Model ...... 17 3.1.2 Geometric Model ...... 18 3.1.3 Logistic Model ...... 18 3.2 Gravity Model of Migration ...... 19 3.3 Inflated Beta Regression ...... 20 3.4 Random Forest ...... 21 4 Cross Section Analysis 24 4.1 Descriptive Analysis ...... 25 4.2 Results for absolute number of migration ...... 25 4.2.1 Gravity Model of Migration ...... 25 4.2.2 Poisson/Geometric Regression ...... 28 4.2.3 Zero Inflated Regression ...... 28 4.2.4 Conclusions ...... 29 4.3 Results for relative numbers of migration ...... 29 4.3.1 Zero Inflated Beta Regression ...... 29 4.4 Migration balance ...... 32 4.4.1 Comparison to real data ...... 33

5 Probability for migration 37 5.1 Logistic Regression ...... 38 5.2 Random Forest ...... 41 5.3 Conclusions ...... 43 6 Conclusions 45 Contents VI

Bibliography 47

A Appendix - Figures 49

B Appendix - Tables 54

C Appendix - Code 60 C.1 Maps ...... 60 C.2 Cross Section Analysis ...... 65 C.3 Probability for migrating ...... 72 List of Figures VII

List of Figures

2.1 Population of Upper Austria in 2014 ...... 4 2.2 Upper Austria split in urban (blue) and rural (orange) areas ...... 5 2.3 Population changes from 2011 to 2014 ...... 6 2.4 Box plot of population changes from 2011 to 2014 ...... 6 2.5 Relative population changes from 2011 to 2014 ...... 7 2.6 Child care institution exists (1) or does not exists (0)in Upper Austria . . . .8 2.7 Box plot of the Gross Domestic Product (Mio e)in Upper Austria ...... 9 2.8 Numbers of different fields of jobs in Upper Austria ...... 9 2.9 Price per m2 building area in Upper Austria ...... 10 2.10 Free m2 building area in Upper Austria ...... 11 2.11 Median age of the population of Upper Austria in 2014 ...... 11 2.12 Number of people in education in 2014 ...... 12 2.13 Highest education in relation to the population of Upper Austria 2014 . . . . 13 2.14 Employed people in relation to the population of Upper Austria 2014 . . . . . 14 2.15 Foreign citizenship in relation to the population of Upper Austria 2014 . . . . 15

3.1 Inflated Beta Regression ...... 20 3.2 Example for a decision tree ...... 22

4.1 Number of re-settlers in Upper Austria in 2014 ...... 24 4.2 Migration movements in relation to its population in Upper Austria in 2014 . 26 4.3 Model diagnostics for the Gravity Model of Migration without “non-movers“ 27 4.4 Residuals vs fitted values of the Gravity Model of Migration ...... 27 4.5 Residuals vs fitted values of the Poisson regression ...... 28 4.6 True values vs predicted values for 2015 of the zero inflated beta regression . 32 4.7 Prediction errors for migration to/from/balance of all communities for 2015 . 35 4.8 Prediction errors for migration to/from/balance of all communities in relation to its population (in %) for 2015 ...... 36

5.1 Classification error based on the boundary for moving ...... 39 5.2 Absolute classification error based on the boundary for moving ...... 40 5.3 Probability of a female 16 year old person for migration in 2015 ...... 43

A.1 People in eduction relative to the population in 2014 ...... 49 A.2 Gross Domestic Product (Mio e) in Upper Austria ...... 50 A.3 Model diagnostics for the Gravity Model of Migration ...... 51 List of Figures VIII

A.4 Model diagnostics for Poisson regression ...... 52 A.5 Residual plot for geometric regression ...... 52 A.6 Model diagnostics for geometric regression ...... 53 A.7 Model diagnostics for Poisson model summing up the years 2011 to 2014 . . . 53 List of Tables IX

List of Tables

2.1 Numbers of free jobs in Upper Austria ...... 9

4.1 Top 6 most frequent migration combinations ...... 25 4.2 Regression results of the zero inflated beta model ...... 30 4.3 Top 5 of highest/lowest predicted numbers of people migrating to/from spe- cific communities for 2015 ...... 33 4.4 Top 5 of highest/lowest predicted numbers of “migration balance“ for 2015 . 33 4.5 Various properties of the prediction errors for migration to/from/balance of all communities for 2015 ...... 34 4.6 Various properties of the prediction errors for migration to/from/balance of all communities in relation to its population for 2015 ...... 35

5.1 Odds ratio for the logistic regression ...... 38 5.2 Classification of the logistic regression for 2015 ...... 39 5.3 Classification of the logistic regression for 2015 with its best boundary ac- cording to the balanced error of misclassification ...... 40 5.4 Classification of the logistic regression for 2015 with its lowest absolute error 41 5.5 Random effects of a random intercept logistic regression ...... 41 5.6 Classification with unbalanced data in Random Forest ...... 42 5.7 Classification with balanced data in Random Forest for migration in 2015 . . 42 5.8 Random Forest Classification grouped by age for migration 2015 (personal characteristics + destination information) ...... 42

B.1 Regression results of Gravity Model of Migration ...... 54 B.2 Regression results of the Poisson model ...... 55 B.3 Regression results of the geometric model ...... 56 B.4 Top 5 highest/lowest errors for from/to migration and balance between pre- dicted and true values of 2015 ...... 57 B.5 Top 5 highest/lowest errors in relation to its population for from/to migration and balance between predicted and true values of 2015 ...... 57 B.6 Regression results for the logistic model for migration on personal level . . . . 58 B.7 Random Forest Classification grouped by age for migration 2015 (personal characteristics + origin information) ...... 59 B.8 Random Forest Classification grouped by age for migration 2015 (only per- sonal characteristics) ...... 59 Introduction 1

Chapter 1 Introduction

Migration plays a big role in human life. People are migrating to other places for better living conditions, jobs and many other reasons. In the last decades, migration changed the world: new working centres or living areas come into existence, and still migration has not stopped and will never stop.

This thesis tries to analyse the population of Upper Austria with respect to its migration behaviour. Before doing a statistical analysis, an overview of Upper Austria is done in a descriptive analysis. Reasonable covariables like different fields of jobs or prices for building areas are taken and analysed by showing the data in a map of Upper Austrian. The main part of the thesis is to analyse the movement between communities. An analysis within one community can not be done, as the information is not available. It is tried to find reasons for migration and furthermore to predict how many people will (im)migrate to certain communities. In this part of the work, the total number of people within each community is considered without differentiation according to age or other parameters. In the final part of the thesis the personal characteristics for migration are taken into consideration. In this case the probability for migration of a person should be estimated. Description of the Data 2

Chapter 2 Description of the Data

In this chapter the datasets, which are going to be used, will be explained shortly and the clean up of the data is described. Furthermore, some interesting descriptive analysis for Upper Austria is done.

2.1 Datasets

Residents of Upper Austria

The main dataset used is the information about the residents of Upper Austria. This dataset is provided by the government of Upper Austria - Amt der OÖ. Landesregierung. This set includes information about all inhabitants of Upper Austria like sex, age, marital status, number of children, work, education, place of residence. The reporting years are from 2009 till 2015.

Child Care

The dataset "Child care" contains the information about all facilities of child care as crèche, kindergarten and after-school care clubs and how many groups each facility has. The infor- mation is only given in text-form. The data is given for 2017 and is provided by the government of Upper Austria.

Building Area

In the dataset "building area" the size of each community of Upper Austria and the free building area in m2 is given. The dataset is provided by the government of Upper Austria and includes only year 2017. Description of the Data 3

Price of Building Area

The dataset "Price Building Area" includes the information about minimum/maximum and the mean of the price per m2 building area. The set contains the years 2010, 2015, 2016 and 2017, and is provided by the government of Upper Austria.

Gross Domestic Product

Information about the Gross Domestic Product (GDP) is available at the homepage of the National Statistic Institute of Austria - Statistik Austria. The GDP is given at NUTS 3 level (Nomenclature des unites territoriales statistiques), which means for small regions. In the case of Upper Austria the regions are: Mühlviertel, , , Linz-Wels, and -Kirchdorf. The dataset of the GDP is given for the years 2014 and 2015.

Places of employment

The last dataset "Places of employment" contains the information how many free jobs are available. This information is given on level and is available for the year 2015 and is provided by the labour market service of Upper Austria - AMS (Arbeitsmarktservice).

2.2 Preprocessing of the Data

In the section of "Preprocessing of the Data" the datasets given in Section 2.1 are getting cleaned up and transformed into data suited for analysis.

Residents of Upper Austria

As in this thesis only movements from communities to other communities in Upper Austria are of interest, all other movements were deleted. That means, people who are for example moving to or from are not taken into account. Furthermore in the years 2009 and 2010 the information where people migrated from is not given. Therefore these years were removed too. In 2015 the total number of communities shrunk from 444 to 442 communities. Aigen and Schlägl moved together and renamed their community to Aigen-Schlägl, and Rohrbach and Berg transformed to Rohrbach-Berg. For having still comparable data for all years, these four communities have been replaced by the two new ones in all other years (2011-2014). Description of the Data 4

Gross Domestic Product/ Places of employment

As mentioned in the section before, the data of the Gross Domestic Product are only avail- able at NUTS3 level. This means that the data have to be split if an analysis is done across communities. The split of the GDP was done over the population within one GDP-section. For the information of the places of employment the same procedure was applied. In this case the data are available on district level and were transformed to community level by weighting with their population.

2.3 Descriptive Analysis

2.3.1 General information

In this section some interesting information about Upper Austria will be given. The de- scriptive analysis will be done just for one year because of computational reasons, namely for year 2014. First of all, its population is around 1.4 million people, in 444 communities. The commu- nity with the biggest population of 197 174 is Linz, and the smallest is Rutzenham with only 278 inhabitants. The spreading of the population can be seen in Figure 2.1. One can

Figure 2.1: Population of Upper Austria in 2014 Description of the Data 5

Figure 2.2: Upper Austria split in urban (blue) and rural (orange) areas

see that there are only three major cities, which are Linz, Wels and Steyr, also known as the statutory cities of Upper Austria. Wels and Steyr have a population of nearly 60 000 and 38 000 respectively. Looking at Figure 2.2 one can see the split into urban and rural areas. The blue ones are the urban areas and the orange ones are rural. The purple ones are crossover areas, which means that there exist some bigger cities, but not as big as in the urban areas. The lighter the colour, the more rural the area.

In Figure 2.1 one can find only three communities with a higher population, in Figure 2.2 six areas appear. In the south there are urban areas around , in the middle west and in the west . As in this thesis the migration flows should be predicted, the change of the population in each community is shown in Figure 2.3. The red colour represents the increase of the population, the blue colour the decrease. The only conspicuousnesses are the decrease of the population in the south and north of Upper Austria, as well as the increase of Wels and the capital (Linz). Over the years from 2011 to 2014 the median of population increased in each community by 17 inhabitants. 50% of the communities had a change from -11 up to 53 people, which can be seen in Figure 2.4. Three communities are not included in the box plot - Linz, Wels Description of the Data 6

Figure 2.3: Population changes from 2011 to 2014

Figure 2.4: Box plot of population changes from 2011 to 2014 and Leonding. These communities had an increase of 7285, 1259 and 1224 people. The communities which had the highest decrease in population are with 124 people lost, followed by Attnang-Puchheim with 111 and with 101. Description of the Data 7

Figure 2.5: Relative population changes from 2011 to 2014

It is quite clear that bigger communities have an higher absolute increase of the population. In Figure 2.5 the relative values are given. Here one can see that the highest increases in the population are more in the middle part, no increase or even a decrease are occurring in the south east and the north of Upper Austria. Linz with the biggest increase of 7285 people shows a relative increase of just 3.6%. The biggest relative population increase oc- curred for St.Nikola an der Donau with 11.33%, which are 96 people of its population of 847. St. Thomas with 10.42% (+57) and Eggendorf im Traunkreis with 9.87% (+85) are the next top values. On the other side, lost 6.65% of its population (-37) followed by St.Georgen bei Obernberg am Inn with 6.32% (-35) and St.Pankraz with 5.29% (-18).

2.3.2 Infrastructure of the communities

In the next step of the descriptive analysis, the infrastructure of the communities is de- scribed. Looking at the different specifications of Section 2.1 the first information - child care - is described. The covariable got transformed into a binary variable which indicates as 1 that child care exists in the community and 0 that it does not exist. Child care means that there is at least a crèche, kindergarten or after school care. From Figure 2.6 it is obvious Description of the Data 8

Figure 2.6: Child care institution exists (1) or does not exists (0)in Upper Austria that in nearly each community at least one child care institution exists. There are only 27 communities with no institution. As described in the Section 2.2 the covariable Gross Domestic Product had been converted into community shape as it was just given in district form. In that case the values depend on the numbers of the population of each community and therefore the graphical representation is very similar to Figure 2.3. The Figure itself can be found as A.2 in the appendix. As expected the highest GDP is in Linz with more than 9.673 Billion Euro, followed by Wels and Steyr with 2.9 Billion Euro and 1.6 Billion Euro respectively. The suburbs of Linz have still higher GDP than all other communities. This can also be seen in the box plot in Figure 2.7. 75% of all communities have a GDP lower than 112 million Euro, the median is around 64 million Euro. The poorest community is Rutzenham with just 9.5 million Euro. Description of the Data 9

Figure 2.7: Box plot of the Gross Domestic Product (Mio e)in Upper Austria

Min. 1st Qu. Median Mean 3rd Qu. Max. 0.54 3.91 6.95 14.68 12.84 1037.06

Table 2.1: Numbers of free jobs in Upper Austria

Figure 2.8: Numbers of different fields of jobs in Upper Austria Description of the Data 10

Figure 2.9: Price per m2 building area in Upper Austria

In Table 2.1 the next covariable is provided. It shows the free jobs in each community. As the "free jobs"- dataset is only given on district level and weighted to the community level by their population, it is possible to have decimal values. The community with the largest number of free jobs is Linz with 1037, followed by Wels (270) and Leonding (175). 75% of the communities have a maximum of 12 free jobs. In Figure 2.8 the numbers of different jobs in each community are given. In Linz, Leonding and Wels have the highest numbers of different fields of jobs - between 500 and 600. As median there are around 212 different fields, and the minimum is 78 in Mörschwang.

The last two covariables are dealing with the building area in each community. The first one is about the free space and the second one is about its price. In general the median price for m2 is around 50.75 e. The most expensive area is located in the south at the lakes of Upper Austria, namely Gmunden with 447.5 e. It is followed by Linz with 372.5 e and Gramastetten with a median price of 305 e. This can be seen in Figure 2.9. All other communities have cheaper prices - 75% are lower than 80 e per m2. The lowest price can be found in Enzenkirchen with 12.5 e. Description of the Data 11

Figure 2.10: Free m2 building area in Upper Austria

Looking at Figure 2.10 one can see that there is no specific relation between free building area and their prices. All over Upper Austria building areas are available. For example, Weyer has the largest area of 92 796.33m2 with a price of 43.75 e/m2, Linz on the other hand has 75 047.65m2 free area, but a price of 372.50 e/m2. The median price in Upper Austria is 50.75 e.

Figure 2.11: Median age of the population of Upper Austria in 2014 Description of the Data 12

Figure 2.12: Number of people in education in 2014

2.3.3 Structure of the inhabitants of Upper Austria

As already mentioned, Upper Austria is home to over 1.4 million people, 49.24% are female and 50.76% male. The median age across the communities is 42 years, and the oldest person is 106 years old. In Figure 2.11 the median age of each community is shown. The range is from 34 years in the communities Auberg and , up to 52 in the communities and . Furthermore it can be seen that the south of Upper Austria is inhabited by a higher number of elderly people. In the other regions no typical cluster can be seen.

In the next step the education is described. In 2014 16.35% of the people were in education. As seen in Figure 2.12 most of these people are in Linz with a number of around 30 000, followed by Wels (9422) and Steyr (5497). As obvious these cities are quite large anyway and more people which are in education are living there. In relation to the population no significant high numbers of people in education can be seen in Figure A.1 in the Appendix. Furthermore these cities have education centres like FH OÖ Campus Wels/Steyr/Linz, and Linz has several university study programs in art, music, science and technology, economy, social sciences, and other fields. For further characterisation of the population of Upper Austria, the highest completed education is taken. It is split into four groups. The first one includes all people who Description of the Data 13

(a) Less than 8 years of school education (b) Finished education after 8 years

(c) Completed education with “Matura“ degree (d) Academic degree

Figure 2.13: Highest education in relation to the population of Upper Austria 2014

have not finished school yet (<8 years of education), the second group includes those who completed 8 years education at school. The third part includes all those who have finished school with “Matura“, and the last group are graduates. These groups are shown in Figure 2.13. From Figure 2.13a the communities look relatively homogenous, with the exception of the south of Upper Austria. This is quite different to the next figure. In Figure 2.13b one can see that in the north-west a higher number of people with only 8 years of school education exists. The south like Gosau and shows in relation to the population the highest numbers of people with only “Matura“ degree (see Figure 2.13c). The highest education is more likely in the areas, where education centres are present, as one can see it in Figure 2.13d. Another characteristic of the population for the descriptive analysis is the number of people working in a job. Around 54% of the whole population are working. Figure 2.14 shows the percentages of people who are employed. The orange colour indicates that there are less employed people than unemployed ones. Blue shows that there are more employed Description of the Data 14

Figure 2.14: Employed people in relation to the population of Upper Austria 2014 persons in each community. Overall there are only few communities with lower than 50% employment. The last specification for inhabitants of Upper Austria is their nationality. It is class-divided into Austrians and other countries. Around 5% of the people in each community are foreign persons. The highest foreign percentage has Überackern. Its location can be seen in Figure 2.15 in the west close to Germany. Around 25% are foreign - these are 170 inhabitants of 487. is following with 23%. There are already 1417 people with a foreign citizenship. Freinberg is another community with more than 20% (20.7) of foreign people. The communities with the lowest percentages are Pötting (3.71%), (3.77%) and Schönau im Mühlkreis (4.46%). In the central area of Upper Austria (Linz-Wels-Steyr) more foreign citizens are living, a reason for this may be the presence of educational centres. Description of the Data 15

Figure 2.15: Foreign citizenship in relation to the population of Upper Austria 2014 Methods 16

Chapter 3 Methods

3.1 Generalized Linear Model

Generalized Linear Models (GLM) are the generalized version of the ordinary linear models. In the linear model the response term follows a normal distribution, which can be generalized to distributions from an exponential family. The expectation of the response is modelled as the response function h of the linear predictor η.

E(yi) = µi = h(ηi) (3.1)

T ηi = xi βi = β0 + β1xi1 + ··· + βkxik (3.2) xi={1, xi1,,...,xik,} are the given covariables and βi represent the corresponding estimators which get computed. The link function is the inverse of the response function. In this thesis the log link for the Poisson model is taken as well as the logit link for the logistic regression. Further description of those models are given. The distributions for the response (normal, Poisson) can be expressed in the form of a one parametric exponential family [Fahrmeir et al., 2009, p.221]: y θ − b(θ ) f(y |θ ) = exp( i i i ω + c(y , φ, ω )) (3.3) i i φ i i i

θ is called the natural or canonic parameter, φ is the dispersion parameter, ωi is the weight and c the constant term. For expectation E(yi) and variance V ar(yi) the following properties hold: 00 0 φb (θi) E(yi) = µi = b (θi) and V ar(yi) = (3.4) ωi

For existence of the moments the first and second derivative of b(θi) have to exist.

The estimation of the coefficients (β) is done with the maximum likelihood. For the esti- mation the log of the density can be taken: y θ − b(θ ) l (β) = log(f(y |β)) = i i i ω + c(y , φ, ω ) (3.5) i i φ i i i Methods 17

0 T θi depends on β by b (θi) = h(xi β). The log-likelihood is given by the sum of log densities over all observations: X l(β) = li(β)

By setting the first derivative of l(β) to 0, β can be estimated. This derivative is also called score function s(β). ∂l s(β) = (3.6) ∂β X ∂h(ηi) (yi − µi) = xi 2 (3.7) ∂ηi σi T 2 E(yi) = h(xi β) and V ar(yi) = σi depend both on β. The maximum likelihood estimator is verified by the second derivative, known as the Fisher Information.

3.1.1 Poisson Model

The Poisson model is a good option if the response is discrete and has low values. For the response variable a Poisson distribution is assumed: λyi exp(−λ) f(yi|λ) = . (3.8) yi! As mentioned in the previous section, the Poisson distribution belongs to a one parametric exponential family, the parameters can be easily extracted:

θi = log(λ), 0 00 b(θi) = exp(θi) = λ = b (θi) = b (θi), φ = 1 and c(yi, φ) = −log(yi!).

The equality of expectation and variance of y is a characteristic features of a Poisson dis- tribution. To fit the model a canonical response function is chosen:

h(ηi) = log(ηi) (3.9)

In R the Poisson model can be executed with the glm function of the R-package stats [Venables and Ripley, 2002]. In this function the attribute “family“ specifies the distribution with its link function. In the case of a Poisson model the family argument is given as “poisson“ and the link function as log. Altogether the code looks like the following:

glm(formual, family=poisson(link=“log“))

. Methods 18

3.1.2 Geometric Model

The geometric model is a special case of the negative binomial model which is again a special case of the Poisson regression. The density of the negative binomial distribution can be rewritten as a gamma mixture of Poisson distribution as given in Equation 3.10[Zeileis et al., 2008, p.5]. Γ(y + θ) µy · θθ f(y; µ, θ) = · (3.10) Γ(θ) · y! (µ + θ)µ+θ 4 µ is the expectation of the density, θ the shape parameter, and Γ(·) is the gamma function. For a fixed θ Equation 3.10 follows the framework of the GLMs of Equation 3.3. Setting θ to 1, the geometric density results as:  µ y 1 f(y; µ) = · (3.11) (1 + µ) µ + 1 for y = 0, 1, 2,...

The geometric regression model is provided in R again as in the Poisson model with the glm function, but in this case a second package (MASS [Venables and Ripley, 2002]) has to be loaded for the family “negative.binomial“. All together the code looks like the following:

glm(formula, family = negative.binomial(theta=1),. . . )

3.1.3 Logistic Model

The logistic model is taken if the response has a binary (0/1) outcome. The goal is to estimate the probability of one outcome, usually P (yi = 1). This probability is de-notated in the following: exp(ηi) E(yi) = µi = h(ηi) = (3.12) 1 + exp(ηi)

The outcome yi is assumed to follow a Bernoulli distribution: ( µi if yi = 1, f(yi|xi) = (3.13) 1 − µi if yi = 0

Other models for 0/1 outcome exist, but the logit model has an easy interpretation using the odds ratio. The odds are given as:

µi P (yi = 1|xi) = = exp(β0) · exp(xi1βi1) · ... · exp(xinβin) (3.14) 1 − µi P (yi = 0|xi) The odds ratio is therefore given as:

P (yi = 1|xin,... ) P (yi = 1|xin + 1,... ) / = exp(βn). (3.15) P (yi = 0|xin,... ) P (yi = 0|xin + 1,... ) Methods 19

The ratio tells how many times the chance of observing a certain feature compared to the baseline is higher.

In this thesis two functions are used for modelling in R. The first one is the standard function

glm(formula, family=binomial(link=“logit“),. . . ) of the stats-package. The second function is

multinom(formula,. . . ) of the nnet-package [Venables and Ripley, 2002] and is taken for faster computing. Additional to the standard logistic regressions, random intercept models and random slope models are tested besides. For these cases the R-package lme4 [Bates et al., 2015] is used with its function

glmer(formula, family=binomial,. . . ).

3.2 Gravity Model of Migration

The Gravity Model of Migration is based on the theory of Newton’s law of gravity in physics. The idea was first formally introduced by Q.Stewards in 1950 but it goes back into the 1880s when it was already used by Ravenstein [Poot et al., 2016, p.1]. In short the law says that the force between two objects is depending on their masses and the distance between them. The bigger one mass, the higher is its force/gravity. This can be seen in the formula from Newton’s law: m · m F = G 1 2 , (3.16) r2

2 where F is the force, m1 and m2 are the masses of objects, r is the squared distance, and G is a gravitational constant. In the Gravity Model of Migration the two objects are cities or in the cases of this report communities. The mass of one object is its population and the distance is given by the centre of each community. By parametrising the equation of Newton’s law, the model of migration looks like: α β Pi · P j Mij = G γ (3.17) Dij

Equation 3.17 states the most common way for modelling the gravity law of migration. Mij is the number of people moving from community i to community j. This depends on the Methods 20

population Pi and Pj and the distance Dij between them. α, β and γ are the parameters which have to be estimated. G is as before in Newton’s law a constant. For modelling Equation 3.17 both sides of the equation have to get logarithmized. The equation can be seen below in Equation 3.18:

log(Mij) = δ + αlog(Pi) + βlog(Pj) + γ(Dij) + ij (3.18)

The constant G is rewritten as δ and an error term ij with mean zero is added to the model. The expected sign of γ should be negative, as it is already known in physics that the distance has a negative influence on the force (in this context on the change for migration).

As the migration flows can depend on some other reasons, further variables can be easily added to the model like this [Garcia et al., 2015, p.96]:

log(Mij) = δ + αlog(Pi) + βlog(Pj) + γ(Dij) + θXij + ij (3.19)

3.3 Inflated Beta Regression

The inflated beta regression is a special case of regression. It is a mixture of three models, two discrete models as the inflated parts - one at 0 and one at 1 - and the continuous regression as the beta regression. In Figure 3.1 an example of a zero one inflated beta distribution is given. The density for a zero one inflated beta regression can be written as follows [Rigby et al., 2019, p.301]:  p0 if y = 0  fy(y|µ, σ, p0, p1) = (1 − p0 − p1)fW (y|µ, σ) if 0 < y < 1 (3.20)  p1 if y = 1

Figure 3.1: Inflated Beta Regression Methods 21

ν τ where p = , p = , 0 1 + ν + τ 1 1 + ν + τ 1 f (y|µ, σ) = yα−1(1 − y)β−1, W B(α, β) 1 − σ2 1 − σ2 with α = µ , β = (1 − µ) . σ2 σ2

µ (also the expectation) and σ are the parameters of the beta distribution, ν originates from the zero inflated regression and τ is the parameter of the one inflated part. In the case that there is only a 0 or 1 inflated model needed, τ or ν is set to 0. The expectation for this regression is given as: µ + τ E(y) = (3.21) (1 + ν + τ)

Each parameter of the model gets estimated by itself. The parameters µ and σ are estimated by the logit link function and ν and τ with a log link:  µ  η = log = XT β µ 1 − µ µ  σ  η = log = XT β σ 1 − σ σ   p1 T ην = log(ν) = log = X βν 1 − p0 − p1   p0 T ητ = log(τ) = log = X βτ 1 − p0 − p1 For each estimated quantity different covariables can be specified. XT can take different forms. The zero one inflated beta regression is implemented in R with the GAMLSS-package [Rigby and Stasinopoulos, 2005]. With the function of the same name gamlss() the coefficients can be estimated. For each parameter the regression formulas can be entered.

gamlss(formula, sigma.formula, nu.formula, tau.formula, family=BEINF,. . . )

The “family“ argument specifies the distribution of the response. “BEINF“ is the family for a zero one inflated beta regression, “BEINF0“ and “BEINF1“ are the family calls for modeling a zero inflated and one inflated beta regression respectively.

3.4 Random Forest

Random Forest is one of the most common machine learning algorithms today. Random Forest is a method for classification and regression. In this thesis the algorithm is used for Methods 22

Figure 3.2: Example for a decision tree

classification. The first approaches of Random Forest have been made by Tin Kam Ho in 1995 [Ho, 1995]. In summary, the idea of Random Forest is to create randomly a forest with many trees. In Figure 3.2 an example is given for a decision tree, which is a part of the algorithm. As one can see there are several splits (age, isWork, isEdu, . . . ) in this tree. In this thesis these splits get computed by the Gini impurity criterion [Breiman and Cutler]. A decision tree is finished when the response is classified perfectly, otherwise further splits have to be done. In Random Forest many such decision trees are used. In the algorithm each tree gets cre- ated using a randomly selected subsample (sampling with replacement) of the original data set. By using the majority vote over all fitted trees new samples can be predicted. This method is called bagging (=bootstrap aggregation). With this method the model variance decreases. In this thesis a randomization of the covariables at each split is done. In general all covariables are taken and selected by the Gini impurity criterion, but in this case a subset of variables is selected randomly and then selected by the Gini criterion. The amount of variables is determined by observing the minimal out of bag (OOB) error. The OOB error is the prediction error of the unused observations of the training dataset. The number of trees are in relation to the accuracy of the model. The more trees, the more accurate the result will be. Random Forest does not overfit with more trees [Breiman, 2001, p.4]. Advantages of decision trees are the facts that they are fast in execution. Furthermore the trees in Random Forest can be established in parallel and therefore save time. Methods 23

In R the Random Forest algorithm is implemented in the package randomForest [Liaw and Wiener, 2002]. With its function of the same name

randomForest(formula, mtry=tuneRF(), . . . ) the data are classified. The function tuneRf() chooses the best number of variables ran- domly selected in each split by minimizing the out of bag error. Cross Section Analysis 24

Chapter 4 Cross Section Analysis

In this part a cross sectional analysis at community level is done. The aim here is to find a model for the migration balance in each community with the covariables economy, population, infrastructure, and the macro-economics (see Equation 4.1): migration balance ∼ economy + population + infrastructure + macro-economics (4.1) For the analysis several models are going to be used. In the first models the data will be analysed with absolute values. In this case models like the Gravity Model of Migration, Pois- son,geometric and zero inflated Poisson models are taken. In the second step the absolute values will be divided by the total number of the population in each community, therefore modelling with the relative migration values. As the data is therefore given between 0 and 1, an zero inflated beta regression is used. As computational problems with the memory and run-times are playing a huge role in com- puting models the analyses are going to be done only for one year (2014). The prediction of this model will be compared with the values of the next year (2015).

Figure 4.1: Number of re-settlers in Upper Austria in 2014 Cross Section Analysis 25

Count of re-settlers: 0 1 2 3 4 5 Frequency: 184328 5058 1792 1022 648 422 Frequency [%]: 94.57 2.59 0.92 0.52 0.33 0.21

Table 4.1: Top 6 most frequent migration combinations

4.1 Descriptive Analysis

Before starting with the estimation of the covariables, the response - migration - gets plotted in a histogram to see how the data is distributed. In Figure 4.1 the numbers of people moving from one community to another are shown. There are 442x441 combinations, which means that a person who is migrating into another community has 441 possibilities. People who are migrating within one community are not counted as the information is not available. People who are moving from Linz to Wels and back to Linz within a year for example are declared as well as “non-movers“. There are many low values in some movement combinations, especially there are 184 328 combinations where no movement occurred. In Table 4.1 the most frequent migration combination is given. For absolute migration movements one can try to fit a Poisson regression, geometric regression, and zero inflated regression. The results of these regressions are described in the next sections (Section 4.2 and 4.3). Furthermore, the Gravity Model of Migration can be tried on discrete data. Before going to the next section, the number of migration is transformed into relative numbers by dividing the values by the population of the destination community. In Figure 4.2 one can see the number of migration in relation to its population. As before the numbers of combinations with 0 migration is at 184 328, but now all other values are between 0 and 1. This is a special case of distribution, the beta regression will be used for analysing the data. Because of the high number of “non-movers“ combinations a zero inflated beta regression can be used for a better analysis.

4.2 Results for absolute number of migration

As described in Section 4.1, two ways for analysis can be done. One way is to analyse with absolute values, the other one with values in relation to its population. First the analysis using the absolute values is shown.

4.2.1 Gravity Model of Migration

The first method to estimate the parameters for migration to another community is the Gravity Model of Migration. This model is a good option to use because of the phenomenon of rural depopulation. As described in Section 3.2, the theory says that the bigger the Cross Section Analysis 26

Figure 4.2: Migration movements in relation to its population in Upper Austria in 2014 community is, the more people are migrating to it. Linz, Wels, and Steyr are the biggest communities in Upper Austria (see Figure 2.1) so that one can expect that they will attract more people. Starting with the easiest regression model, the base model of the Gravity Model of Migration is taken:

log(Mij) = β0 + β1log(Pi) + β2log(Pj) + β3log(Dij) + ij (4.2)

Problematic is that the values of both sides of the equation have to be logarithmized and therefore all values which are equal to zero have to be removed or re-set to a quite low value, e.g 0.0001. In Figure 4.3 the model diagnostics of the Gravity Model of Migration without the "zero values" are given. In the Appendix the model diagnostics after setting zero values to 0.0001 is given in Figure A.3. One can see by comparison between these two Figures (Residuals vs Fitted) that the 0 values are quite obvious. They are always somehow separated from the other values. Looking closer into the residual-plot, one can see in Figure 4.4 that for each discrete migration value a separate line exists. This occurrence leads to the assumption that there exists a systematic error. A reason may be that there are many low values. Cross Section Analysis 27

Figure 4.3: Model diagnostics for the Gravity Model of Migration without “non-movers“

Figure 4.4: Residuals vs fitted values of the Gravity Model of Migration Cross Section Analysis 28

Figure 4.5: Residuals vs fitted values of the Poisson regression

4.2.2 Poisson/Geometric Regression

Poisson regression is applied if the response variable is discrete and low values are occurring. Starting off for modelling the simplest model is taken. This would be the intercept model, but to make a direct comparison to the Gravity Model of Migration possible the three covariables distance, population of the origin, and the population of the destination are included. In Figure 4.5 the same systematic error can be seen as before in the Gravity Model of Migration. Here the lines are less visible, but still existent. Additional to that model, the zero values are removed for a better fit. The model diagnostics with all observation of the Poisson regression is given in the Appendix in Figure A.4. As again this kind of lines in the graphics of the residual analysis occurs for the geometric regression [Zeileis et al.[2008]], the corresponding graphics is only given in the Appendix (see Figure A.5 and Figure A.6).

4.2.3 Zero Inflated Regression

The last method to fit a model with absolute values is the zero inflated regression. As the better fit seemed to be in the Poisson regression, the zero inflated Poisson regression is used. According to the previous regression models, the covariables distance and population are taken into the model. Cross Section Analysis 29

Using the function “zeroinfl“ of the R-package “pscl“ [Jackman, 2017] an error will occur, stating that the model matrix with the given covariables cannot be inverted in the process for calculating the coefficients of the model:

Error in solve.default(as.matrix(fit$hessian)): system is computationally singular.

4.2.4 Conclusions

It can be demonstrated that several discrete regression methods and even the Gravity Model of Migration cannot estimate the absolute numbers of migrating people without any sys- tematic errors. A reason can be the fact that there are a huge amount of zero values and other low values. To circumvent that problem, one might sum the years of 2011 to 2014, and retry the models. In the Appendix the model diagnostics for using all data from 2011 to 2014 are shown in Fig A.7. Unfortunately no change in the residual plot occurs. Since modelling with absolute values did not give a satisfying result, the next section is about modelling with relative migration flows.

4.3 Results for relative numbers of migration

To solve the problem of discreetness, the absolute numbers of migration are divided by the population. It is obvious that there are two possibilities to do so, the first is to divide by the population of the destination, and the other by its origin community. Anticipating results from the next section, the division by the population of the destination is taken, because the prediction of the model fits better. In Figure 4.2 one can see the distribution of the data, and a zero inflated beta distribution fits best.

4.3.1 Zero Inflated Beta Regression

In Table 4.2 the results for the model are summarized. As the numbers of migration are in relation to the population of the destination, the regression model gets weighted by these numbers. Each parameter of the model is estimated separately. The shape parameter σ gets modelled only by the intercept term. All other parameters are modelled by the distance (dist) between each community, the population of the origin (Pop.o), diversity of jobs (diffWork), available jobs (freeWork), existence of childcare (care), the size of free building area (comSize), and its price (comPrice). Additionally a new parameter (mig13) is added to make the model more accurate. “mig13“ is the covariable for the amount of migration relative to the population of the destination the year before. The covariables “diffWork“, “freeWork“ are also set in relation to the population of the communities. In the first models the information about the gross domestic product is included, but the variable gets removed later as the prediction for 2015 are worse. Cross Section Analysis 30

Dependent variable: migration Zero Inflated Beta Regression (µ coefficients) (ν coefficients) (σ coefficients) Intercept -7.519e+00∗∗∗ -2.732e+00∗∗∗ -4.077e+00∗∗∗ (1.165e-03) (3.788e-03) (5.555e-05)

dist -1.834e-02∗∗∗ 4.715e-02∗∗∗ (3.168e-06) (7.376e-06)

Pop.d 3.105e-06∗∗∗ -2.402e-05∗∗∗ ( 2.708e-09) (2.810e-08)

diffWork.d -4.147e+00∗∗∗ 1.419e+01∗∗∗ (1.847e-03) (4.549e-03)

diffWork.o 8.227e+00∗∗∗ 2.797e+01∗∗∗ (2.019e-03) (6.248e-03)

freeWork.d 2.303e+01∗∗∗ 1.369e+00∗∗∗ (5.704e-02) (1.266e-01)

freeWork.o -1.140e+01∗∗∗ 1.866e+01∗∗∗ (7.375e-02) (1.590e-01)

care.d1 -1.243e-01∗∗∗ 2.449e-01∗∗∗ (4.610e-04) (8.135e-04)

care.o1 2.990e-01∗∗∗ 1.201e+00∗∗∗ (8.534e-04) (3.339e-03)

comSize.d 4.964e-07∗∗∗ -6.503e-06∗∗∗ (5.643e-09) (1.323e-08)

comSize.o 1.728e-06∗∗∗ -3.208e-05∗∗∗ (5.575e-09) (1.102e-08)

comPrice.d 8.563e-04∗∗∗ -2.966e-03∗∗∗ (1.288e-06) (3.635e-06)

comPrice.o -1.459e-03∗∗∗ -4.411e-03∗∗∗ (1.105e-06) (-4.411e-03)

mig13 1.472e+02∗∗∗ -1.813e+03∗∗∗ (2.441e-02) (6.884e-01)

Note: ∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01; o...origin; d...destination

Table 4.2: Regression results of the zero inflated beta model Cross Section Analysis 31

Each coefficient has to be interpreted by its own [Alberto Gonzatto Jr. et al., 2017, p.168]. Starting first with the ν coefficients, one can take the odds ratio for easier interpretation. For the covariable “distance“ the odds ratio leads to the following conclusion: The chance that there will be no movement increases by exp(0.04715) = 1.0483 if the distance gets in- creased by 1km. In the cases of diversity of jobs and available jobs one has to be aware that the interpretation is not trivial. As they are in relation to its population too, these values are quite low. This means that the chance for no movement will increase dramatically if 1 unit is added to the relative number of diversity of jobs/available jobs. To detain this effect, one could multiply these data-points before by 1000 for example, so that the estimator will be 1000 times smaller. The covariables size and price for the building area and population of origin have no effect on increasing the chance of “no movement“ between communities. The chance of “no move- ment“ between communities increases by the factor 3.3223 if a childcare is available in the origin community compared to those communities with no child care facility. The µ coefficients cannot be interpreted like the ν coefficients, but taking the sign of the values, one can see for example that distance has a negative effect on the number of persons migrating. That means the further the destination the lower the number of people migrating. The number of people (Pop.d) living in the destination community has a positive effect, as well as available jobs in the destination. All other covariables can be interpreted in the same way. When taking these results, transforming them back to the absolute values by multiplying the number of inhabitants of each community and comparing the prediction to the real data of 2015, one can see in Figure 4.6 that the model fits quite well. There are just few migration combinations which do not fit well. Movements from Leonding to Linz (-634), -Linz (-277), -Linz (-139), and Leonding-Traun (-92) get under-fitted, Linz-Leonding (+674), Wels-Thalham (+164), Linz-Ansfelden (+120), and Linz-Kirchschlag (+99) over- fitted. Removing these combinations and refitting the model does not lead to changes, but the prediction for the total number of migration gets worse. The prediction for 2015 for the total numbers of migration is under-fitted by 57 people, the total error is 45 612 persons. Cross Section Analysis 32

Figure 4.6: True values vs predicted values for 2015 of the zero inflated beta regression

4.4 Migration balance

In Section 4.3.1 the numbers of migrations are estimated for each combination between the communities. To calculate the balance for migration, the numbers of people who are migrating to the community have to be subtracted by the numbers of people who are leaving. In Table 4.3 one can see the top communities with the highest and lowest increase, and in the second column the biggest/lowest decrease of the communities.The top values correspond to all the communities with bigger cities. The lowest values in “migration from“ correspond logically to those communities with the lowest inhabitants. In the case of migration to these communities the low infrastructure of them might play a role. However, this is again a matter of the population size of these communities. for example has just a population of 580. , St.Georgen, Rutzenham, and the other mentioned communities with low migration to/from have less than 1000 inhabitants. One has to keep in mind that these values include only the re-settlers of Upper Austria, as people from outside of Upper Austria are not included in the modelling. Finally Table 4.4 shows the top values of the migration-balance. Again Linz has the top value with an increase of its population of around 2000 people, followed by Wels and Leonding with around 1400. Ried im Innkreis, , and Eferding are losing around 100 people by migration. Cross Section Analysis 33

migration to migration from

Linz 8621 Linz 6548 Wels 2824 Wels 1347 Leonding 2204 Leonding 738 Traun 1174 Steyr 620 Steyr 1046 Traun 605

Ottschlag 6 Perwang am Grabensee 5 Haigermoos 6 St.Georgen am Fillmannsbach 5 Palting 3 Rutzenham 3 St.Georgen am Fillmannsbach 3 Mayrhof 3 St.Radegund 2 Haigermoos 2

Table 4.3: Top 5 of highest/lowest predicted numbers of people migrating to/from specific communities for 2015

migration balance

Linz 2073 Wels 1477 Leonding 1466 Traun 569 Ansfelden 460

Schwandenstadt -96 Kirchdorf an der -96 Eferding -100 Bad Hall -107 Ried im Innkreis -127

Table 4.4: Top 5 of highest/lowest predicted numbers of “migration balance“ for 2015

4.4.1 Comparison to real data

As the predicted values are for the year 2015, one can compare these values with the real data. In Table 4.5 and Figure 4.7 the absolute errors between the predicted values and real values of 2015 are given. One can see that there are still some big outliers, but in general the errors are quite low. The interquartile range is around 29 for migrating from a certain Cross Section Analysis 34

to from balance

Min. -206.57 -626.79 -159.60 1st Qu. -23.66 -11.22 -31.87 Median -10.37 0.89 -13.67 Mean -0.13 -0.13 0.00 3rd Qu. 1.48 17.86 2.79 Max. 3256.71 1207.64 2049.06

Table 4.5: Various properties of the prediction errors for migration to/from/balance of all communities for 2015 community, 26 to, and for the migration balance it is a little bit higher at a value of 35 people. The top outlier error of “moving from“is Linz with 1208 people (0.61% in relation to its population). Leonding follows with an error of 627 (2.33%). The minus sign indicates that the predicted value is under-estimated. For migration to the community the top outlier is again Linz with an over-prediction of 3257 (1.65%) people, Wels follows with 1030 (1.72%) and Leonding with 1030 (3.68%). Ried im Innkreis got under-predicted with 207 (1.81%) persons. In the migration balance Linz still shows an over-prediction of around 2000 (1.04%) “movers“. It is followed by Leonding (1612/6.01%) and Wels (1136/1.90%). is leading the under-prediction in the balance with 160 (2.66%) people. Further values can be found in the Table B.4 in the Appendix. Comparing the predicted values with relative errors, a different picture of the errors can be seen. In Table 4.6 one can see that in all three cases the interquartile distance is quite low. However, again there are big outliers. This time these are not any more big communities, but for example Holzhausen has a misprediction by 12% which correspond to 101 people moving to this community. Mayrhof follows the top errors with 6.65% but in absolute numbers this are only 20 people. Leonding as a big community is still an outlier with 3.68% (985 people). For migration from a community the top percentages are from -6.5% to 4% (minus means that the prediction is under the true value) but the absolute errors in these cases are quite low. In the balance of migration the top relative error occurs for Holzhausen with 19.21%, which are in total 153 people. Further outliers for relative values in migration to/off and balance can be found in the Appendix in Table B.5. In Figure 4.8 the migration to/from and balance are shown in a box plot. Cross Section Analysis 35

Figure 4.7: Prediction errors for migration to/from/balance of all communities for 2015

to_rel (%) from_rel (%) balance_rel (%)

Min. -4.14 -6.47 -5.44 1st Qu. -1.36 -0.59 -1.63 Median -0.66 0.07 -0.72 Mean -0.63 0.03 -0.66 3rd Qu. 0.08 0.93 0.16 Max. 12.74 4.08 19.21

Table 4.6: Various properties of the prediction errors for migration to/from/balance of all communities in relation to its population for 2015 Cross Section Analysis 36

Figure 4.8: Prediction errors for migration to/from/balance of all communities in relation to its population (in %) for 2015 Probability for migration 37

Chapter 5 Probability for migration

In this chapter an analysis is done on the level of the inhabitants of Upper Austria. The aim is to predict the probability of each person for migration into a different community. This will be done with covariables of personal characteristics and information of the communities of the origin and the destination (see Equation 5.1).

P (migration) ∼ personal characteristics + structure of destination/origin (5.1)

The data for the analysis are again only those of the year 2014, and the predicted year is 2015 to check how good the model fits.

As the response variable is binary (0 for no migration and 1 for migration), the logistic regression will be used. Before starting the analysis, some useful information about the dataset is summarized as follows. In Upper Austria 44 952 people were moving in 2014 into a different community (again, no data are available for migration within the same community). This is equal to 3.19% of the population of Upper Austria. Persons under the age of 14 are removed as it is unlikely that they move on their own and no household information is given in the data. In that case only 38 409 people are moving. In the next part of the analysis the following covariables are used. Similar to the previous section, the covariables for the origin and destination communities are the same: Diversity of jobs in absolute values as “diffWork“, free jobs available as “freeWork“, existence of childcare (meaning is there a creche, kindergarden, or after-school-care) as “care“, gross domestic product as “gdp“, the free building size as “comSize“ and price per square meter as “comPrice“. “.o“ and “.d“ respectively are standing for the covariables of origin/destination communities. In addition to the structure of the communities, some personal characteristics are added as seen in Equation 5.1. In this case the age, sex, possession of Austrian nationality (isNation), employment or unemployment (isWork), being in education (isEdu) and the persons’ education level (highEdu) are the added variables. Probability for migration 38

coef sd odds ratio care.o1 0.5819 0.1497 1.7894 care.d1 -0.5068 0.1503 0.6024

IsWork1 0.1534 0.0143 1.1658 highEdu3 0.2381 0.0190 1.2688 age -0.0490 0.0004 0.9522 isNation1 -0.3854 0.0158 0.6802 sex1 0.1254 0.0106 1.1336 isEdu1 -0.9100 0.0203 0.4025 o...origin; d...destination

Table 5.1: Odds ratio for the logistic regression

5.1 Logistic Regression

As mentioned before, the logistic regression will be taken as the response is binary. The estimates of the best model based on the AIC (Akaike information criterion) is given in the Appendix in Table B.6. In Table 5.1 some of the coefficients are given. For better interpretation the odds ratio is calculated. The missing covariables (diffWork, freeWork, comSize, comPrice, and gdp) have an odds ratio of 1, which means that they do not increase the chance for movement. For example if the person is Austrian, the chance of migration will decrease by the factor 0.68. In other words, the chance of migration is lower than for a non-Austrian. Further information can be extracted from Table 5.1, such as the fact that the older the person the lower the chance for migration. People in education will rather stay, people who are already working or have an academic degree will rather migrate to another community.

In the next part, the year 2015 is predicted using the developed model. As the prediction model returns probabilities, it is hard to compare them with the true response (binary). In this case the probability lower than 0.5 indicates a “non-mover“, everyone else is then a “mover“. Taking this split at 0.5 the classification for the prediction of year 2015 can be seen in Table 5.2 (1: movers; 0: non-movers). The class of “non-movers“ gets predicted very well, but unfortunately not the other class. The overall error is not that bad, but as the classes are quite highly imbalanced, the balanced accuracy is around 56%. As the prediction is based on some conditions, one can change them to find the best predic- tion. In Figure 5.1 the miss-classification errors of each group are given. The classification for the “moving“ class (Class1) gets better with lowering the split-condition (boundary) to nearly 0, but at the same time the error increases strongly in the other class. Overall the balanced error stays the same. The best classification according to the balanced error/ac- curacy occurs at the boundary of 0.05. In Table 5.3 the result of this prediction is shown. Probability for migration 39

predicted observed 0 1 0 1169048 0 1 33709 4915

Table 5.2: Classification of the logistic regression for 2015

Figure 5.1: Classification error based on the boundary for moving

In this scenario the balanced accuracy is at 61.07%, but unfortunately the absolute error in the class “non-movers“ is more than 300 000 people. This scenario would make no sense for migration within one year in Upper Austria. Another way for optimization of the classification is to try to minimize the absolute error. Starting at the boundary of 0.5, in Table 5.2 one can already see a minimal error of 33 709 people who are not moving into another community although they should. In Figure 5.2 the absolute classification errors are given. As the class “non-movers“ are classified perfectly at the first model prediction (Table 5.2), their error increases by lowering the boundary. The class “movers“ does not have such a big influence as the absolute numbers of the group are low. Anyway the minimal absolute error can be found at the boundary of 0.38 (horizontal dashed line). In Table 5.4 the classification is given. The balanced accuracy stays nearly the same, an increase of 0.15% occurred. Probability for migration 40

predicted observed 0 1 0 846538 322510 1 19414 19210

Table 5.3: Classification of the logistic regression for 2015 with its best boundary according to the balanced error of misclassification

Figure 5.2: Absolute classification error based on the boundary for moving

Another idea is to model the covariables grouped by the rural-urban communities as seen in Figure 2.2. Fitting a random intercept model with these groups, the fixed effects of the estimates of the coefficients are the same as in the logistic regression (Table B.6). Additional random effects are given, see Table 5.5. By the fact that the prediction is worse in comparison to the standard logistic regression, the computation of the random intercept model takes around 3 hours longer and even then, the model does not converge. Since the effects are the same as for the logistic regression, it is recommended to take the standard logistic regression. Probability for migration 41

predicted observed 0 1 0 1169036 12 1 33592 5032

Table 5.4: Classification of the logistic regression for 2015 with its lowest absolute error

Groups Name Variance Std.Dev typ.o (Intercept) 0.3931 0.6270 typ.d (Intercept) 0.5309 0.7286 o...origin; d...destination

Table 5.5: Random effects of a random intercept logistic regression

5.2 Random Forest

Since the logistic regression does not have a good prediction in the classification, the Ran- dom Forest is taken. As the Random Forest works only well for balanced data, the data must get balanced [Torgo, 2010]. In this case the “non-movers“ are sampled down and the class “movers“ is multiplied. Furthermore one has to be aware that the covariables of origin and destination are the same for “non-movers“and one of them has to be removed. An easy explanation is, if one takes always the difference between these covariables, the difference is zero for those who are staying within one community and for the others not. The algorithm therefore achieves a perfect fit, since it separates the response perfectly by this difference. Although this seems good on the first glance, it has some disadvantages: The variable im- portance is useless since obviously classification happens only on the difference between the population covariates. This makes the interpretation why people migrate invalid, since ob- viously they migrate because they migrate. Also prediction is then not possible. Therefore there are three possible ways to analyse the data: one can look only at the characteristics of a person, or one can include either the origin community covariables or the destination information.

At first it will be shown that unbalanced data do not get predicted well. The results can be seen in Table 5.6. All values get predicted to one class (“non-movers“), therefore no one would be going to migrate the next year. As mentioned before, the data must get balanced. In each group around 200 000 observations are used for the further analyses. In Table 5.7 the prediction results compared to the real data are shown. Comparing the balanced accuracy to the logistic regression, the values have improved significantly. Except for the model without information of the communities, the prediction for the “movers“ is very good. However, the absolute error is still too high. Therefore another approach is to split the data into age groups: The first group includes the ages from 15 up to 18 with the idea that these persons have done their 9 year school Probability for migration 42

predicted observed 0 1 0 1169048 0 1 38624 0

Table 5.6: Classification with unbalanced data in Random Forest

origin destination 0 predicted predicted predicted observed 0 1 0 1 0 1 0 1061293 107755 1058172 110876 1121079 47969 1 3959 34665 3335 35289 33916 4708 Balanced Accuracy 90.27% 90.94% 54.04%

Table 5.7: Classification with balanced data in Random Forest for migration in 2015

education and are looking for other possibilities like high school, traineeships or even work. The second group includes those who have finished school and are moving to education locations like universities. Their age is from 18 up to 26. The third group is the working class, so those from age 27 up to 60, and the last group includes the rest (60 years and up). Regarding the last group, the idea is that those people who have been moving into larger communities for work will maybe move back into rural areas again to enjoy life. In Table 5.8 the prediction for migration in 2015 for each age-group is given. As there are three possible models, the model with its best balanced accuracy is given. The other Tables can be found in the Appendix (Table B.7 and B.8). According to the balanced accuracy each group is predicted worse than with all data. In the smallest group (15-17 years) the false negative is in relation to its group size lower than in the classification of all values. The prediction of the true negatives are in all groups worse in comparison to the whole model. Another separation of the age-groups 18-26 and 27-59 into smaller subgroups might bring

Age 15-17 18-26 27-59 60+ predicted predicted predicted predicted observed 0 1 0 1 0 1 0 1 0 42346 2515 121559 21056 579840 63865 299881 37986 1 187 806 1174 10845 2481 19822 349 2960 Bal.Acc. 87.78% 87.73% 89.48% 89.11% Note: Bal.Acc...Balanced Accuracy

Table 5.8: Random Forest Classification grouped by age for migration 2015 (personal char- acteristics + destination information) Probability for migration 43

Figure 5.3: Probability of a female 16 year old person for migration in 2015 a better classification.

As the prediction after splitting into age-groups did not improve, the prediction model is taken which is based on all observations with information about personal characteristics and destination communities. The aim of the work described in this chapter is to estimate the probability for migration under some specific conditions. For example, the probability for a female 16 year old person can be estatimated for migration to another community. In Figure 5.3 the values are given in a box plot. It can be seen that in mean for migration is around 14%. Around 75% of the female 16 year old persons have at most 18% chance for migration. The upper whisker bound is at 40%. 7715 people of around 16 000 are counted as outliers and have a chance of over 50% for migration into a different community.

5.3 Conclusions

The estimation of the probability for migration into another community turned out to be difficult. In the typical way for modelling with binary data, the logistic regression is the first option. One can see that the significant covariables make sense regarding migration, but in the predicted classification the model failed quite seriously. A low balanced accuracy occurred with only 56%. Trying to fix this low accuracy, the boundary for splitting the predicted probability (default: >50% = “movers“, ≤50%“= non-movers“) has to be set low, but in the same way the absolute misclassification errors increased. The balanced accuracy increased just slightly. The Random Forest Classification created the better classification, a balanced accuracy can be estimated of around 90% with all observations. Although only information of destination or origin are in the model, the true negative values get predicted very well. Unfortunately Probability for migration 44 the false positive increased too. Trying to optimize the outcome, the data are split into age groups, but in this case no better results could be obtained. Conclusions 45

Chapter 6 Conclusions

The first part of the thesis deals with investigations on the reasons for migration and on predictions how many persons with be moving into other communities. Although models have already be established such as the Gravity Model of Migration, it is not always possible to fit the data well. One can see that doing an analysis with simple models with a absolute values does not work well. It can be deduced that the structure of the data (many small values and few very high values) does not work well for most models. Therefore the data was transformed from absolute values to relative values by dividing them (numbers of peo- ple migrating from a community to another) by the population of their origin communities. In this way the best fit of the model was a zero inflated beta regression and it fitted and predicted the data quite well. The only problem occurred when the number of migrations got transformed back to absolute values. One could see that bigger communities like Linz, Wels, Leonding, Traun, and Steyr have a bigger error than smaller communities. Transform- ing the data further into migration into/off a certain community, these communities have significantly higher prediction errors than all other communities. The same occurred for the migration balance, when differences between“migration in“ and “migration off“ values were used. Besides these errors, the overall number of migration for 2015 is well predicted, the model under-estimated the value only by 57 people. The total prediction error is 45 612 people.

In the second part of the thesis it was tried to predict the probability for migration. In the first approach the analysis was started with a logistic regression. Despite the bad classifi- cations of the models, the expected reasons for moving into a different community could be seen in the model. As expected higher education level increases the probability of migration into a different community. Older people are moving with a lower probability and male persons are moving more frequently than females. The classification with a random intercept model did not provide a better prediction. There- fore a different classification method was used - the Random Forest machine learning algo- rithm. In this algorithm the main problem is the fact that only either origin or destination values made sense in the Random Forest. In this case only one information either from destination community or origin could be taken. Furthermore imbalanced data is not that good for this algorithm, as only around 3% of the population of Upper Austria is moving. By balancing the data, the balanced accuracy increased for the prediction significantly in Conclusions 46 comparison to the logistic regression. The class for “movers“could be predicted very well, unfortunately the misclassification of “non-movers“ was higher than expected. Grouping by the age did not improve the classification in Random Forest. Bibliography 47

Bibliography

Alberto Gonzatto Jr., O., Guedes, T., Gonçalves-Zuliani, A., and Nunes, W. Zero-inflated beta regression model for leaf citrus canker incidence in orange genotypes grafted onto different rootstocks. Acta Scientiarum - Biological Sciences, 39:161–171, 2017. doi: 10. 4025/actascibiolsci.v39i2.33063.

Bates, D., Mächler, M., Bolker, B., and Walker, S. Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1):1–48, 2015. doi: 10.18637/jss.v067.i01.

Breiman, L. Random forest. Machine Learning, 45(1):5–32, 2001. doi: 10.1023/A: 1010933404324.

Breiman, L. and Cutler, A. Random forests. URL https://www.stat.berkeley.edu/ ~breiman/RandomForests/. AccessDate: 02.03.2019.

Cribari-Neto, F. and Zeileis, A. Beta regression in R. Journal of Statistical Software, 34(2): 1–24, 2010. URL http://www.jstatsoft.org/v34/i02/.

Fahrmeir, L., Kneib, T., and Lang, S. Regression. Modelle, Methoden und Anwendungen. Springer-Verlag Berlin Heidelberg, 2009.

Garcia, A., Pindolia, D., Lopiano, K., and Tanem, A. Modeling internal migration flows in sub-saharan africa using census microdata. Migration Studies, 3(1):89–110, 2015.

Ho, T. K. Random decision forests. Proceedings of the 3rd International Conference on Document Analysis and Recognition, pages 278–282, 1995.

Jackman, S. pscl: Classes and Methods for R Developed in the Political Science Compu- tational Laboratory. United States Studies Centre, University of Sydney, Sydney, New South Wales, Australia, 2017. URL https://github.com/atahk/pscl/. R package ver- sion 1.5.2.

Liaw, A. and Wiener, M. Classification and regression by randomforest. R News, 2(3): 18–22, 2002. URL https://CRAN.R-project.org/doc/Rnews/. Bibliography 48

Poot, J., Alimi, O., Cameron, M. P., and Maré, D. C. The gravity model of migration: The successful comeback of an ageing superstar in regional science. IZA Discussion Paper, (10329), 2016.

R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2018. URL https://www.R-project.org/.

Rigby, R. A. and Stasinopoulos, D. M. Generalized additive models for location, scale and shape. Journal of the Royal Statistical Society. Series C (Applied Statistics), 54:507–554, 2005.

Rigby, R., Stasinopoulous, M., Heller, G., and Bastiani, F. D. Distributions for Modelling Location, Scale and Shape: Using GAMLSS in R. Chapman and Hall/CRC, 2019. (Draft version of 2017 on https://www.gamlss.org).

Torgo, L. Data Mining with R, learning with case studies. Chapman and Hall/CRC, 2010. URL http://www.dcc.fc.up.pt/~ltorgo/DataMiningWithR.

Venables, W. N. and Ripley, B. D. Modern Applied Statistics with S. Springer, New York, fourth edition, 2002. URL http://www.stats.ox.ac.uk/pub/MASS4. ISBN 0-387-95457- 0.

Zeileis, A., Kleiber, C., and Jackman, S. Regression models for count data in R. Journal of Statistical Software, 27(8), 2008. URL http://www.jstatsoft.org/v27/i08/. Appendix - Figures 49

Appendix A Appendix - Figures

Figure A.1: People in eduction relative to the population in 2014 Appendix - Figures 50

Figure A.2: Gross Domestic Product (Mio e) in Upper Austria Appendix - Figures 51

Figure A.3: Model diagnostics for the Gravity Model of Migration Appendix - Figures 52

Figure A.4: Model diagnostics for Poisson regression

Figure A.5: Residual plot for geometric regression Appendix - Figures 53

Figure A.6: Model diagnostics for geometric regression

Figure A.7: Model diagnostics for Poisson model summing up the years 2011 to 2014 Appendix - Tables 54

Appendix B Appendix - Tables

Dependent variable: migration Gravity Model of Migration (all observations) (without zeros) Intercept −7.744∗∗∗ −2.581∗∗∗ (0.052) (0.070)

dist −1.094∗∗∗ −0.645∗∗∗ (0.006) (0.009)

Pop.o 0.356∗∗∗ 0.317∗∗∗ (0.004) (0.006)

Pop.d 0.396∗∗∗ 0.328∗∗∗ (0.004) (0.006)

Observations 194,922 10,594 R2 0.226 0.396 Adjusted R2 0.226 0.396 Residual Std. Error 1.541 (df = 194918) 0.725 (df = 10590) F Statistic 18,966.000∗∗∗ (df = 3; 194918) 2,319.000∗∗∗ (df = 3; 10590) Note: ∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01 o ... origin; d ... destination

Table B.1: Regression results of Gravity Model of Migration Appendix - Tables 55

Dependent variable: migration Poisson Model (all observations) (without zeros) Intercept 1.522∗∗∗ 1.978∗∗∗ (0.009) (0.008)

dist −0.098∗∗∗ −0.048∗∗∗ (0.0004) (0.0004)

Pop.o 0.00002∗∗∗ 0.00001∗∗∗ (0.00000) (0.00000)

Pop.d 0.00002∗∗∗ 0.00001∗∗∗ (0.00000) (0.00000)

Observations 194,922 10,594 Log Likelihood −78,159.300 −38,083.200 Akaike Inf. Crit. 156,326.600 76,174.400 Note: ∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01 o ... origin; d ... destination

Table B.2: Regression results of the Poisson model Appendix - Tables 56

Dependent variable: migration Geometric Model (all observations) (without zeros) Intercept 0.993∗∗∗ 1.578∗∗∗ (0.157) (0.021)

dist −0.090∗∗∗ −0.025∗∗∗ (0.005) (0.001)

Pop.o 0.00004∗∗∗ 0.00001∗∗∗ (0.00000) (0.00000)

Pop.d 0.00004∗∗∗ 0.00001∗∗∗ (0.00000) (0.00000)

Observations 194,922 10,594 Log Likelihood −52,152.500 −24,544.440 Akaike Inf. Crit. 104,313.000 49,096.890 Note: ∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01 o ... origin; d ... destination

Table B.3: Regression results of the geometric model Appendix - Tables 57

from to balance ComNr error rel_error ComNr error rel_error ComNr error rel_error 40101 1208 (0.61%) 40101 3257 (1.65%) 40101 2049 (1.04%) 40907 108 (1.67%) 40301 1030 (1.72%) 41012 1612 (6.01%) 41022 96 (1.64%) 41012 985 (3.68%) 40301 1136 (1.90%) 41618 95 (2.15%) 41002 232 (1.46%) 41002 394 (2.49%) 41609 93 (1.87%) 41823 177 (3.24%) 41021 391 (1.64%) 40404 -173 (-1.09%) 41713 -125 (-2.50%) 41710 -103 (-2.81%) 40201 -200 (-0.52%) 41422 -154 (-3.15%) 41820 -120 (-2.41%) 41746 -263 (-2.21%) 41743 -163 (-2.72%) 40705 -124 (-0.95%) 41021 -341 (-1.43%) 41746 -180 (-1.51%) 41739 -130 (-2.39%) 41012 -627 (-2.33%) 41225 -207 (-1.81%) 41743 -160 (-2.66%) Note: ComNr... Community Number; rel_error...errors in relation to its population

Table B.4: Top 5 highest/lowest errors for from/to migration and balance between predicted and true values of 2015

from to balance ComNr rel_error error ComNr rel_error error ComNr rel_error error 41616 4.08 (22) 41809 12.74 (101) 41809 19.21 (153) 41015 3.91 (77) 41412 6.65 (20) 41412 8.98 (27) 41515 3.59 (44) 41613 4.17 (85) 41012 6.01 (1612) 41336 3.41 (22) 41012 3.68 (985) 40916 5.40 (18) 40820 3.24 (17) 41724 3.52 (21) 41613 4.38 (89) 41222 -4.54 (-30) 41219 -3.55 (-54) 41623 -4.29 (-39) 41806 -4.68 (-95) 41121 -3.83 (-32) 41608 -4.30 (-40) 41219 -5.02 (-76) 40439 -4.02 (-24) 40830 -4.43 (-40) 41425 -5.02 (-71) 41333 -4.06 (-21) 41001 -4.93 (-56) 41809 -6.47 (-52) 40818 -4.14 (-59) 41616 -5.44 (-29) Note: ComNr... Community Number; rel_error...errors in relation to its population

Table B.5: Top 5 highest/lowest errors in relation to its population for from/to migration and balance between predicted and true values of 2015 Appendix - Tables 58

Dependent variable: migration Intercept −1.059∗∗∗ (0.058)

diffWork.o −0.002∗∗∗ diffWork.d 0.002∗∗∗ (0.0002) (0.0002)

freeWork.o −0.010∗∗∗ freeWork.d 0.013∗∗∗ (0.001) (0.001)

care.o1 0.582∗∗∗ care.d1 −0.507∗∗∗ (0.150) (0.150)

comSize.o −0.00001∗∗∗ comSize.d −0.00001∗∗∗ (0.00000) (0.00000)

comPrice.o −0.004∗∗∗ comPrice.d 0.006∗∗∗ (0.0003) (0.0003)

gdp.o 0.001∗∗∗ gdp.d −0.001∗∗∗ (0.0001) (0.0001)

IsWork1 0.153∗∗∗ (0.014)

highEdu2 0.012 (0.014)

highEdu3 0.238∗∗∗ (0.019)

age −0.049∗∗∗ (0.0004)

isNation1 −0.385∗∗∗ (0.016)

sex1 0.125∗∗∗ (0.011)

isEdu1 −0.910∗∗∗ (0.020)

Observations 1,199,557 Log Likelihood −156,541.700 Akaike Inf. Crit. 313,123.500 Note: o...origin; d...destination ∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01

Table B.6: Regression results for the logistic model for migration on personal level Appendix - Tables 59

Age 15-17 18-26 27-59 60+ predicted predicted predicted predicted observed 0 1 0 1 0 1 0 1 0 40401 4460 102738 39877 571355 72350 287816 50051 1 159 834 916 11103 2365 19938 326 2983 Bal.Acc. 87.02% 82.21% 89.08% 87.67% Note: Bal.Acc...Balanced Accuracy

Table B.7: Random Forest Classification grouped by age for migration 2015 (personal char- acteristics + origin information)

Age 15-17 18-26 27-59 60+ predicted predicted predicted predicted observed 0 1 0 1 0 1 0 1 0 38290 6571 136415 6200 607300 36405 325247 12620 1 670 323 11418 601 19376 2927 3081 228 Bal.Acc. 58.94% 50.32% 53.73% 51.57% Note: Bal.Acc...Balanced Accuracy

Table B.8: Random Forest Classification grouped by age for migration 2015 (only personal characteristics) Appendix - Code 60

Appendix C Appendix - Code

All analyses of this thesis are done with R on Windows 7:

R version 3.5.2 (2018-12-20) – "Eggshell Igloo" Copyright (C) 2018 The R Foundation for Statistical Computing Platform: x86_64-w64-mingw32/x64 (64-bit)

Processor: Intel(R) Core(TM) i5 CPU 760@ 2.80GHz Memory: 16 GB

C.1 Maps

For creating the maps the packages “sp“ and “RColorBrewer“ are used. Further some ad- justments to the package have to be done: library (RColorBrewer) library ( sp ) rds <− readRDS( "AUT_adm3 . rds " ) isTrue <− rds$NAME_1=="Oberösterreich " isTrue[313]<− TRUE # adding Grein isTrue[325]<−TRUE # adding St.Ulrich bei Steyr ooe<−rds[isTrue ,] ooe$NAME_1 [ 1 : 2 ]<−"Oberösterreich" ooe$NAME_2 [ 1 : 2 ]<−c ("Perg" ,"Steyr␣Land") ooe [ which( ooe$NAME_3==40502), "NAME_2 " ]<−" Eferding " ooe$NAME_3<−c (41105,41514,40401,40445,40403:40438,40440:40442,40439, 40443,40444,40446,40501,40503:40510,40826,40512,40627,40601:40605, 40608,40606,40607,40609:40626,40701:40717,40719,40718,40720, 40801:40807,41206,40808:40818,40820,40819,40821:40823,40825,41421, 40827,40824,40828:40831,40833,40834,40901:40906,40908:40913,40915, 40914,40917,40918,40916,40919:40923,41001:41005,41007,41006,41008, 41809,41009:41012,41014:41019,41013,41020,41021,41824,41022,40101, Appendix - Code 61

41101,41102,41108,41103,41104,41106,41107,41109,41110,41113,41111, 41112,41114:41119,41121:41124,41120,41125,41126,41201,40402,41203: 41205,41207,41208:41213,41216,41217,41214,41215,41218:41225,41227: 41230,41226,41231:41236,41301,41302,41343,41304,41305,40502,41306, 41307,41311,41309,41310,41312:41321,41329,41322:41328,41344,41331: 41338,41340:41342,41401,41402,41202,41403:41411,41413,41412,41414: 41420,41422:41430,41501:41507,40907,41508:41513,41515:41518,41521, 41522,40201,41601:41620,41622,41623,41621,41624:41627,41701:41722, 41726,41728,41723:41725,41727,41732,41729:41731,41733:41735,41737, 41736,41738:41752,41801:41808,41810:41817,40511,41818:41823,40832, 40301)

To handle the creation of maps more easily a wrapper function got created: createMap <− function (groups ,name, col , colorRange=30, legend=TRUE, . . . ) { ooe$col \_no <− groups [ order (match(name , ooe$NAME_3 ) ) ] spplot(ooe, ’col_no ’ , col . r e g i o n s= colorRampPalette( col )( colorRange ),colorkey=legend ,...)

Used Datasets

#Absolute Number of Population load ("pop.RData") pop14 <− pop14 [ −1] pop11 <− pop11 [ −1]

#Data on personal level load ( " pers14_1 5 . RData " ) dat14 <− dat14 [ complete .cases(dat14) ,]

#Data on community level source ("Datenaufbau_Modell_Eins .R" )

#Communities split in rural /urban areas source ( " r u r a l_model .R" )

Population

pop14["40101"] <− 40000 #Linz (Adjustment for better graphic) pop14["40301"] <− 40000 #Wels createMap(pop14 , as . integer (names( pop14 ) ) , col=c ("lightblue","red"), legend = T) Appendix - Code 62

Movements: 2011 - 2014

p o p d i f f <− pop14−pop11 popdiff["40101"]<−1000 #Linz popdiff["40301"]<−1000 #Wels popdiff["41012"]<−1000 #Leonding #head(sort(popdiff ,F)) x=165 rc2 = colorRampPalette( colors = c ("white", "red"), space="Lab")(x) rc1 = colorRampPalette( colors = c ("lightblue", "white"), space="Lab")(188 −x ) rampcols = c ( rc1 , rc2 ) rampcols [ c ( x−1, x ) ] = rgb ( t (col2rgb("white")) , maxColorValue=256)

createMap(popdiff , as . integer (names(popdiff)), col=rampcols ,colorRange = 188, legend = T)

#Movements in relation to the population: (reload Data ! ) r e l_p o p d i f f <− p o p d i f f /pop14 x=270 rc2 = colorRampPalette( colors = c ("white", "red"), space="Lab")(x) rc1 = colorRampPalette( colors = c ("lightblue", "white"), space="Lab")(439 −x ) rampcols = c ( rc1 , rc2 ) rampcols [ c ( x−1, x ) ] = rgb ( t (col2rgb("white")) , maxColorValue=256) createMap( rel_popdiff , as . integer (names( r e l_p o p d i f f ) ) , col=rampcols ,colorRange = 188, legend = T)

isWork

work <− aggregate ( dat14$IsWork , l i s t ( dat14$ dest ) , table ) work$y <− work$x/rowSums(work$x ) { x=315 rc2 = colorRampPalette( colors = c ("white", "blue"), space="Lab")(x) rc1 = colorRampPalette( colors = c ("orange", "white"), space="Lab")(442 −x ) rampcols = c ( rc1 , rc2 ) rampcols [ c ( x , x+1)] = rgb ( t (col2rgb("white")) , maxColorValue=256) } Appendix - Code 63 createMap(work$y [ , 2 ] , work$Group . 1 , col=rampcols ,colorRange = 442, legend=T) isEducation isEdu <− aggregate ( dat14$isEdu , l i s t ( dat14$ dest ) , table ) isEdu$y <− isEdu$x/rowSums(isEdu$x ) createMap(isEdu$x[ ,2] ,isEdu$Group . 1 , col=c ("orange","blue"), legend=T) createMap(isEdu$y , a$Group . 1 , col=c ("orange","blue"), legend=T)

Age age <− aggregate(dat14$age , list (dat14$dest),median) createMap(age$x ,age$Group.1 , col=c("orange" ,"blue") , legend=T)

Highest Education edu <− aggregate(dat14$highEdu , list (dat14$dest),table) edu$y <− edu$x/rowSums(edu$x) createMap(edu$y[ ,1] ,edu$Group.1 ,col=c("orange" ,"blue") , legend = T) createMap(edu$y[ ,2] ,edu$Group.1 ,col=c("orange" ,"blue") , legend = T) createMap(edu$y[ ,3] ,edu$Group.1 ,col=c("orange" ,"blue") , legend = T) createMap(edu$y[ ,4] ,edu$Group.1 ,col=c("orange" ,"blue") , legend = T)

Citizenship nat <− aggregate ( dat14$isNation , l i s t ( dat14$ dest ) , table ) nat$y <− nat$x/rowSums(nat$x ) createMap(nat$y [ , 1 ] , nat$Group . 1 , col=c ("orange","blue"), legend=T)

Childcare tmp <− allComb[allComb$nach=="40101" ,] createMap(tmp$ kindereinrichtungen .x,tmp$von , col=c ("white","red"),colorRange = 2, legend = T)

Rural/Urban areas Appendix - Code 64 breakpoints <− unique ( rur$UR_TYP) createMap(rur$UR_TYP, rur$GKZ, col = c ("blue","orange"), legend = F, ... = breakpoints)

Gross Domestic Product tmp <− allComb[allComb$nach=="40101" ,] tmp$BRP. x [ tmp$von=="40101" ] <− 2000 tmp$BRP. x [ tmp$von=="40301" ] <− 2000 createMap(tmp$BRP. x , tmp$von , col=c ("orange","blue"), legend = T)

Diversity of Jobs tmp <− allComb[allComb$nach=="40101" ,] createMap(tmp$Arbeit.x,tmp$von , col=c ("orange","blue"), legend = T)

Free Jobs tmp <− allComb[allComb$nach=="40101" ,] r e l J o b s <− tmp$ freieArbeit.x/tmp$Pop . x createMap(tmp$ freieArbeit .x,tmp$von , col=c ("orange","blue"), legend = T) createMap(relJobs ,tmp$von , col=c ("orange","blue"), legend = T)

Free Building Area tmp <− allComb[allComb$nach=="40101" ,] createMap(tmp$GEM_UMFANG. x , tmp$von , col=c ("orange","blue"), legend = T)

Price for Building Areas tmp <− allComb[allComb$nach=="40101" ,] createMap(tmp$comprice.x,tmp$von , col=c ("orange","blue"), legend = T) Appendix - Code 65

C.2 Cross Section Analysis

In this part the code of Section4 is provided:

Used Datasets and Packages

source ("Datenaufbau_Modell_Eins.R") #response variable: migration #dataset: allComb #.x ... destination covariable #.y ... origin covariable

#No migration (removing same communities) allComb <− allComb[−which(allComb$von == allComb$nach) ,] allComb15 <− allComb15[−which(allComb15$von == allComb15$nach) ,]

library(pscl) #Zero Inflated Poisson Regression library(MASS) #geometric regression l i b r a r y ( pl yr ) #Summary f o r more v a r i a b l e s library(tidyr) #Transform vector to matrix library(gamlss) #Zero Inflated Beta Regression library(dplyr) library(mle4) #Mixed Models

Histogram of the data

#Histogram of the data hist(allComb$migration ,breaks=100,xlim=c(0,500),ylim=c(0,1000), xlab="Numbers of resettlers ",righ=F,main="") allComb$relmig <− allComb$migration/allComb$Pop.y hist(allComb$relmig ,breaks=1000,ylim=c(0,1000), xlab="Numbers of resettlers relative to their destination population" , righ=F,main="")

Gravity Model of Migration

#Gravity Model of Migration grav <− allComb[ ,c("dist","migration","Pop.x","Pop.y")] #grav$migration [ grav$migration==0] <− 0.001 grav <− grav [−which(grav$migration==0),] l_grav <− l o g ( grav ) m_grav1 <− lm(migration~. ,data=l_grav) m_grav0 <− lm(migration~. ,data=l_grav) Appendix - Code 66

summary(m_grav) par(mfrow=c(2 ,2)) plot (m_grav)

Poisson Regression

#Possion Regression p o i s <− allComb[ ,c("dist","migration","Pop.x","Pop.y")] m_pois <− glm(migration~.,data=pois , family = poisson(link="log")) summary(m_pois) par(mfrow=c(2 ,2)) plot(m_pois) pois0 <− p o i s [−which(pois$migration==0),] m_pois0 <− glm(migration~.,data=pois0 , family = poisson(link="log")) par(mfrow=c(1 ,1)) plotlm(m_pois0, ylim=c( −5,20),which=1) plot(predict(m_pois0), resid(m_pois0),ylim=c( −10 ,25) , ylab = "Residuals",xlab="Fitted values", main ="Residuals vs Fitted")

Geometric Regression

#geometric Regression geo <− allComb[ ,c("dist","Pop.x","Pop.y","migration")] m_geo <− glm(migration~.,data=geo, family = negative.binomial(theta = 1)) par(mfrow=c(2 ,2)) p l o t (m_geo) plot (m_geo, which=1)

Zero Inflated Poisson Regression

#Zero Inflated Poisson Regression

zinp <− allComb[ ,c("dist","migration")] zinp15 <− allComb15[ ,c("dist","migration")] #zinp <− allComb[ ,c(3:16,19,20)] #destination/origin variables: #Error in solve.default(as.matrix(fit$hessian)) : #system is computationally singular

m_zinp <− zeroinfl(migration~.|1,data=zinp , dist="poisson") plot(log(predict(m_zinp)) , residuals(m_zinp,type="response") , Appendix - Code 67

ylab="Residuals",xlab="Predicted values", main="Residuals vs Fitted")

Model with 2011-2014 summed up

source ("Model11_14.R") mod <− zeroinfl(migration~.,data=allComb[ ,3:4] , dist="poisson") plot(log(predict(mod)), residuals(mod,type="response"),ylab="Residuals",xlab="Predicted values", main="Residuals vs Fitted")

mod2 <− glm(migration~.,data=allComb[ ,3:6] , family = poisson(link="log")) plot (mod2,which=1)

Mixed Models m_me <− glmer( migration ~ dist+Pop.x+Pop.y+(1|UR_TYP.x)+(1+UR_TYP.y) , data=allComb,family = poisson(link = "log"))

Zero Inflated Beta Regression

#Zero Inflated Beta source ("Datenaufbau_Modell_Eins.R")

#Adding numbers of migration of the previous year for better performance load("migration11_14.RData") mig <− cbind(mig11 ,mig12 ,mig13 ,mig14) mig <− mig[,c(1,2,3,6,9,12)] names(mig) <− c("von" ,"nach" ,"mig11" ,"mig12" ,"mig13" ,"mig14") allComb <− merge(allComb ,mig,by.x=c("von" ,"nach") , by.y = c("von","nach")) allComb15 <− merge(allComb15 ,mig,by.x=c("von" ,"nach") , by.y = c("von","nach")) rm(mig,mig11 ,mig12 ,mig13 ,mig14)

#x ... destination covariable #y ... origin covariable #Scaling data { allComb$Arbeit.x <− allComb$Arbeit.x/allComb$Pop.x #∗1000 allComb$Arbeit.y <− allComb$Arbeit.y/allComb$Pop.y #∗1000 allComb$freieArbeit.x <− allComb$freieArbeit .x/allComb$Pop.x allComb$freieArbeit.y <− allComb$freieArbeit .y/allComb$Pop.y # allComb$Pop.x <− allComb$Pop.x/1000 Appendix - Code 68

# allComb$Pop.y <− allComb$Pop.y/1000 allComb$UR_TYP . x <− as . f a c t o r (allComb$UR_TYP . x ) allComb$UR_TYP . y <− as . f a c t o r (allComb$UR_TYP . y )

allComb15$Arbeit.x <− allComb15$Arbeit.x/allComb15$Pop.x #∗1000 allComb15$Arbeit.y <− allComb15$Arbeit.y/allComb15$Pop.y #∗1000 allComb15$freieArbeit.x <− allComb15$freieArbeit .x/allComb15$Pop.x allComb15$freieArbeit.y <− allComb15$freieArbeit .y/allComb15$Pop.y # allComb15$Pop.x <− allComb15$Pop.x/1000 # allComb15$Pop.y <− allComb15$Pop.y/1000 allComb15$UR_TYP. x <− as . factor (allComb15$UR_TYP.x) allComb15$UR_TYP. y <− as . factor (allComb15$UR_TYP.y)

#No migration (removing same communities) allComb <− allComb[−which(allComb$von == allComb$nach) ,] allComb15 <− allComb15[−which(allComb15$von == allComb15$nach) ,]

#Numbers of migration in relation to the population allComb$migration <− allComb$migration/allComb$Pop.y allComb$mig13 <− allComb$mig13/allComb$Pop.y allComb15$migration <− allComb15$migration/allComb15$Pop.y allComb15$mig14 <− allComb15$mig14/allComb15$Pop .y

# mod <− gamlss(formaula =migration~.,formula.nu=migration~.,data=data=allComb[ ,c(4 ,6:16 ,19 ,20 ,23)] , # family=BEINF0, weights = allComb$Pop.y, # control = gamlss.control(n.cyc=100))) # f i t <− stepGAIC(mod, direction = "both") # f i t n u <− stepGAIC(fit , direction = "both",what="nu") } base_mod <− gamlss(formula = migration ~ dist+Pop.x+Arbeit.x+Arbeit.y+freieArbeit .x+freieArbeit .y+ kindereinrichtungen .x+kindereinrichtungen .y+ GEM_UMFANG. x+GEM_UMFANG. y+comprice . x+comprice . y+mig13 , nu.formula = migration~ dist+Pop.x+Arbeit.x+Arbeit.y+freieArbeit .x+freieArbeit .y+ kindereinrichtungen .x+kindereinrichtungen .y+ GEM_UMFANG. x+GEM_UMFANG. y+comprice . x+comprice . y+mig13 , family=BEINF0, data=allComb, weights = allComb$Pop.y, control = gamlss.control(n.cyc=100))

sigma <− coef(base_mod, what="sigma") mu <− coef(base_mod, what="mu") nu <− coef(base_mod, what="nu") Appendix - Code 69

newData = allComb15[ ,c(3:20 ,24)] names(newData)[ length(names(newData))] <− " mig13 "

pre_mu <− predict(base_mod,newdata = newData[, −2] , type="response " ,what="mu") pre_nu <− predict(base_mod,newdata = newData[, −2] , type="response " ,what="nu") pre <− pre_mu/(1+pre_nu) real_pre <− pre ∗allComb15$Pop.y

pre_data<− data.frame(" origin"=allComb15$von ," dest"=allComb15$nach , "pre"=real_pre ," true"=allComb15$migration ∗allComb15$Pop.y, "Diff"=real_pre−allComb15$migration ∗allComb15$Pop.y, "pre_rel"=pre , "true_rel"=allComb15$migration)

o u t l i <− pre_data[abs(pre_data$Diff)%in% head(sort(abs(pre_data$Diff),T),10),c("origin","dest","Diff ")] outli[order(outli$Diff),] sum(pre_data$Diff) sum(abs(pre_data$Diff))

#Transformation to migration matrix −> Balance/Number of mig. in/out. pre_data <− data.frame(" origin"=allComb15$von , "dest"=allComb15$nach ," pre"=real_pre , "true"=allComb15$migration ∗allComb15$Pop.y, "Diff"=real_pre−allComb15$migration ∗allComb15$Pop.y)

backToMatrix <− function(data){ mig <− spread(data, names(data)[2] , names(data)[3]) rownames(mig) <− mig [ , 1 ] mig <− mig [ , −1] return(mig) }

pre_matrix <− backToMatrix(pre_data[ ,c(1 ,2 ,3)]) true_matrix <− backToMatrix(pre_data[ ,c(1 ,2 ,4)])

preIn <− colSums(pre_matrix ,na.rm=T) trueIn <− colSums(true_matrix ,na.rm=T)

preOff <− rowSums(pre_matrix ,na.rm=T) trueOff <− rowSums(true_matrix ,na.rm=T) Appendix - Code 70

preSaldo <− preIn − preOff trueSaldo <− trueIn − trueOff

#e r r o r : pop <− allComb15[1:442 ,c("nach" ,"Pop.y")] pop <− pop[match(names(preIn),pop$nach) ,]

e r r o r <− data.frame("In"= preIn − trueIn ,"Off"= preOff − trueOff , "balance"= preSaldo−trueSaldo ) error$rel_in <− error$In/pop$Pop.y error$rel_off <− error$Off/pop$Pop.y error$rel_balance <− error$balance/pop$Pop.y

data . frame ("Community"= rownames(error[error$In %in% head(sort(error$In ,T),5) | error$In %in% head(sort(error$In ,F),5),]), " ErrorIn "= round(error[error$In %in% head(sort(error$In ,T),5) | error$In %in% head(sort(error$In ,F),5),"In"],0), "ErrorIn_rel"= round(error[error$In %in% head(sort(error$In ,T),5) | error$In %in% head(sort(error$In ,F),5),"rel_in"] ∗ 1 0 0 , 4 ) ) %>% arrange(desc(ErrorIn))

data . frame ("Community"= rownames(error[error$Off %in% head(sort(error$Off ,T),5) | error$Off %in% head(sort(error$Off ,F) ,5) ,]) , "ErrorOff"= round(error[error$Off %in% head(sort(error$Off ,T),5) | error$Off %in% head(sort(error$Off ,F),5),"Off"],0), "ErrorOff_rel"= round(error[error$Off %in% head(sort(error$Off ,T),5) | error$Off %in% head(sort(error$Off ,F),5),"rel_off"] ,4) ) %>% arrange(desc(ErrorOff))

data . frame ("Community"= rownames(error[error$balance %in% head(sort(error$balance ,T),5) | error$balance %in% head(sort(error$balance ,F) ,5) ,]) , "ErrorBal"= round(error[error$balance %in% head(sort(error$balance ,T),5) | error$balance %in% head(sort(error$balance ,F),5),"balance"] ,0) , Appendix - Code 71

"ErrorOff_rel"= round(error[error$balance %in% head(sort(error$balance ,T),5) | error$balance %in% head(sort(error$balance ,F),5), "rel_balance"] ,4) ) %>% arrange(desc(ErrorBal))

#−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− i r <− data . frame ("Community"= rownames(error[error$rel_in %in% head(sort(error$rel_in ,T),5) | error$rel_in %in% head(sort(error$rel_in ,F) ,5) ,]) , "ErrorIn_rel"= round(error[error$rel_in %in% head(sort(error$rel_in ,T),5) | error$rel_in %in% head(sort(error$rel_in ,F),5), " rel_in " ] ∗ 1 0 0 , 4 ) , " ErrorIn "= round(error[error$rel_in %in% head(sort(error$rel_in ,T),5) | error$rel_in %in% head(sort(error$rel_in ,F),5),"In"] ,0) ) %>% arrange(desc(ErrorIn_rel))

or<−data . frame ("Community"= rownames(error[error$rel_off %in% head(sort(error$rel_off ,T),5) | error$rel_off %in% head(sort(error$rel_off ,F),5),]), "ErrorOff_rel"= round(error[error$rel_off %in% head(sort(error$rel_off ,T),5) | error$rel_off %in% head(sort(error$rel_off ,F),5), " r e l _ o f f " ] ∗ 1 0 0 , 4 ) , "ErrorOff"= round(error[error$rel_off %in% head(sort(error$rel_off ,T),5) | error$rel_off %in% head(sort(error$rel_off ,F),5),"Off"] ,0) ) %>% arrange(desc(ErrorOff_rel))

br <− data . frame ("Community"= rownames(error[error$rel_balance %in% head(sort(error$rel_balance ,T),5) | error$rel_balance %in% head(sort(error$rel_balance ,F) ,5) ,]) , "ErrorBalance_rel"= round(error[error$rel_balance %in% head(sort(error$rel_balance ,T),5) | error$rel_balance %in% head(sort(error$rel_balance ,F),5), "rel_balance"] ∗ 1 0 0 , 4 ) , "ErrorBalance"= round(error[error$rel_balance %in% head(sort(error$rel_balance ,T),5) | error$rel_balance %in% head(sort(error$rel_balance ,F),5), "balance"] ,0) ) %>% arrange(desc(ErrorBalance_rel)) Appendix - Code 72

ar <− cbind(or,ir ,br) ar boxplot(error [ ,1:3] ,names = c("To","From","Balance")) boxplot(error[,4:6] ∗ 100 ,names = c("rel_To" ,"rel_From" ,"rel_Balance"))

summary(error [ ,4:6] ∗ 1 0 0 ) C.3 Probability for migrating

Used Datasets and Packages

load("C:/Users/meins/Desktop/Masterarbeit/Neustart/pers14_15.RData") dat15 <− dat15[complete.cases(dat15) ,] dat15$mig <− as. factor(dat15$mig) dat15$isEdu <− as. factor(dat15$isEdu) dat14 <− dat14[complete.cases(dat14) ,] dat14$mig <− as. factor(dat14$mig)

dat14 <− subset(dat14 ,age >14) dat15 <− subset(dat15 ,age >14)

#Case 0 : #only personal information dat14 . 0 <− dat14[,c(1,2,3,14:19)] dat15 . 0 <− dat15[,c(1,2,3,14:19)] #Case 1 : #personal + origin information dat14 . y <− dat14[,c(1:3,5,7,9,11,13:19,23,25)] dat15 . y <− dat15[,c(1:3,5,7,9,11,13:19,23,25)] #Case 2 : #personal + destination information dat14 . x <− dat14[,c(1:3,4,6,8,10,12,14:19,22,24)] dat15 . x <− dat15[,c(1:3,4,6,8,10,12,14:19,22,24)]

dataKind14 <− subset(dat14.x,age >14 & age <18) dataKind15 <− subset(dat15.x,age >14 & age <18) dataJug14 <− subset(dat14.x,age >17 & age <27) dataJug15 <− subset(dat15.x,age >17 & age <27) dataArb14 <− subset(dat14.x,age >26 & age <60) dataArb15 <− subset(dat15.x,age >26 & age <60) dataPen14 <− subset(dat14.x,age >59) dataPen15 <− subset(dat15.x,age >59) Appendix - Code 73

library(nnet) #Faster version for a logistic regression library(DMwR) #SMOTE for balancing data library(randomForest)

Logistic Regression

#bestmodel by AIC # b <− glm(mig~.,data=dat14_15plus[ ,c(3:19 ,22 ,23)] , family = binomial(link="logit")) # step <− stepAIC(b, direction="both") best <− glm(mig ~ diffWork.x+diffWork.y+freeWork.x+freeWork.y+ care .x+care .y+comSize .x+comSize .y+comPrice .x+comPrice .y+ IsWork+highEdu+age+isNation+sex+isEdu+gdp . x+gdp . y , data=dat14 , family=binomial(link="logit "))

multi <− function(Train, Valid){ mod <− multinom(mig~diffWork .x+diffWork .y+freeWork .x+freeWork .y+ care .x+care .y+comSize .x+comSize .y+comPrice .x+comPrice .y+ IsWork+highEdu+age+isNation+sex+isEdu+gdp . x+gdp . y , data=Train, model = T) pre <− predict(mod,newdata = Valid,type = "prob" ) pre_mult <− ifelse(pre<0.5,0,1) tab<− table(oberved=Valid[ ,c(3)] , predicted=pre_mult) return( list (mod,tab)) }

m_logit <− multi(dat14 ,dat15) summary(m_logit [[1]]) m_logit [[2]]

pre15 <− predict(mult,newdata = Valid,type = "prob" )

e r r <− function(x,data=Valid ,pre=pre15){ if (x>=max(pre)) x <− max( pre ) pre_mult <− ifelse(pre

e r r o r <− e r r 1 ∗(w)+ e r r 2 ∗(1−w) abs_error <− tab[[1,2]]+tab[[2 ,1]] return(list(tab,c("err1"=err1 ,"err2"=err2 ," error"=error), abs_error)) } err(0.05,Valid ,pre15) z<−abs ( seq ( −0.5,to=0,0.005)) a<− (lapply(z, function(x) err(x,data=Valid ,pre=pre15)[[2]])) a1 <− lapply(a, ’[[ ’ ,1) a2 <− lapply(a, ’[[ ’ ,2) a3 <− lapply(a, ’[[ ’ ,3) e r r o r <− unlist(lapply(z, function(x) err(x,data=Valid , pre=pre14 )[[3]])) z[which.min(a3)] plot(z,a1,type="l",ylim=c(0,1),xlim=c(0.5, −0.17),col="red", ylab="error " ,xlab="boundary" ,lwd=2) lines(z,a2,col="blue",lwd=2) lines(z,a3, col="black",lty=2,lwd=2) legend ( x=−0.01,y=1, legend=c("Class0 error","Class1 error", "balanced error"),col=c("red","blue","black"),text.font=0.2, lty=c(1,1,2),lwd=2)

Random Forest forestSmot <− function(Train, Valid, trees = 100, over, under) { set.seed(1705) smot <− SMOTE( mig ~ . , data = Train, perc.over = over, perc.under = under) mtry <− tuneRF ( smot [ , −1] , smot$mig , ntreeTry = trees , stepFactor = 1.5, improve = 0.01, t r a c e = TRUE, p l o t = FALSE) Appendix - Code 75 best .m <− mtry[mtry[, 2] == min(mtry[, 2]), 1] f o r e s t <− randomForest( mig ~ . , data = smot , ntree = trees , importance = TRUE, mtry = best.m) pre <− predict(forest , Valid[, −1]) tab <− table(oberved = Valid[, 1], predicted = pre) return(list(best.m, forest , pre, tab, table(smot$mig)))} a l l . 0 <− forestSmot(Train = dat14.0[ , −c (1 , 2 ) ] , Valid = dat15.0[, −c (1 , 2 ) ] , trees = 100, over = 400, under = 150) a l l . 0 [ [ 4 ] ] a l l . x <− forestSmot(Train = dat14.x[, −c (1 , 2 ) ] , Valid = dat15.x[, −c (1 , 2 ) ] , trees = 100, over = 400, under = 150) a l l . x [ [ 4 ] ] a l l . y <− forestSmot(Train = dat14.y[, −c (1 , 2 ) ] , Valid = dat15.y[, −c (1 , 2 ) ] , trees = 100, over = 400, under = 150) a l l . y [ [ 4 ] ] c h i l d <− forestSmot(Train = dataKind14[, − c(1,2,8,9)], Valid = dataKind15[, − c(1,2,8,9)], trees = 100,over=400, under = 150)) c h i l d [ [ 4 ] ] jug <− forestSmot(Train = dataJug14[, − c(1,2,8,9)], Valid = dataJug15[, − c(1,2,8,9)], trees = 100,over = 400,under = 150) jug [ [ 4 ] ] lab or <− forestSmot(Train = dataArb14[, − c(1,2,8,9)], Valid = dataArb15[, − c(1,2,8,9)], trees = 100,over = 400,under = 150)) lab or [ [ 4 ] ] pen <− forestSmot(Train = dataPen14[, − c(1,2,8,9)], Appendix - Code 76

Valid = dataPen15[, − c(1,2,8,9)], trees = 100,over = 400,under = 150) pen [ [ 4 ] ]