INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 8, ISSUE 11, NOVEMBER 2019 ISSN 2277-8616

The Use Of And Probit Regression Models In The Process Of Graduates’ Employment

Aleksey V. BURKOV, Elena A. Murzina

Abstract: The paper analyzes a variety of conventional estimation methods of binary response operation in order to model the probability of setting up a new business by university graduates. Such methods as 3-factors and 2-factors Logit and Probit Analysis are reviewed. The author used the database of the National Science Foundation as the source for the research about American graduates.

Key words: logit regression, probit regression, function of binary response, 3-factors regression models, 2-factors regression models, econometric modeling, statistical modeling of the labor market. ————————————————————

1. INTRODUCTION 2. LITERATURE REVIEW In econometrics the models using the processes with a This article is a logical continuation of the following articles binary response have recently come to the foreground of of the author: research interest. The conventional methods used for 1. Using the Comprehensive Confirmatory Factor implementing such models are logit and probit models. Analysis Method of Structural Equation Modeling in the They are relevant for assessing the current state of all Process of Graduates Employment [2]; sectors of the economy particularly related to the labour 2. Analysis Method of Structural Equation Modeling market of the qualified specialists with background higher [3]. education. These methods are considered in relation to this The presented articles deal with the use of sphere of the economy. Unfortunately, the data on the correlation and factor analysis to determine the factors qualified specialists with higher education in the Russian influencing the foundation of a new business. To confirm Federation are not sufficiently disclosed, which is necessary the results obtained, the methods of structural equations for carrying out this analysis. Therefore, the author used the were applied. As a result of the analysis, the best results database on specialists with higher education provided by were obtained using 3 and 2 factor models. The 3-factor the National Scientific Association of the USA. The model included the following factors: "Experience" (f1), aforesaid database uses 447 parameters for each "Attitude to Education and Science"(f2), "Business university graduate. In order to build a regression model Characteristics"(f3). The 2-factor model is based on the using the neural networks, it is not possible to use all the factors: "Experience and environment" (f1), "Business parameters. Data must be reduced. As a tool for selecting characteristics" (f2). To determine the number of factors the factors that affect the probability of establishing a used method scree. business by university graduates, we referred to the As a theoretical justification for logit and probit regression methods of correlation analysis, factor analysis and analysis, the following authors' works were confirmatory factor analysis methods including structural used:Christensen, R. [4], Finney, D. J. [5], Hosmer, D. W., equations. With the help of the methods of factor analysis, andLemeshow, S. [8], Cox, D. R., andSnell, E. J. [16], the 3-factor and 2-factor models were obtained. The 3- Greenland, S. [23], Hosmer. D.W.J. and Lemeshow, S. [26]. factor model included the following factors: "Experience" (f1), "Attitude to Education and Science"(f2), "Business 3. SCOPE, OBJECTIVES AND METHODS Characteristics"(f3). The 2-factor model is based on the Since logit and probit models are conventional models for factors: "Experience and environment" (f1), "Business implementing the functions with binary response we characteristics" (f2). In the context of our research we originally referred to these models. In order to estimate the assumed that the economic criterion of specialist's success parameters of the logit and probit models the following is setting up a business. methods have been used:  Quasi-Newton Estimation Method  Simplex Estimation Method  Simplex and Quasi-Newton Estimation Method  Hooke-Jeeves pattern moves Estimation Method  Hooke-Jeeves and Quasi-Newton Estimation Method

 Rosenbrock pattern search Estimation Method —————————————  Rosenbrock and Quasi-Newton Estimation Method In general, the results of the aforesaid methods turned to be  Aleksey V. Burkov, Mari State University, Russia, almost identical. Distinctions begin only after the fifth sign [email protected] after a comma. Nevertheless, we chose the method with  Elena A. Murzina, Volga state university of technology, Russia, [email protected] the least standard errors of model coefficients. Originally, we examined the logit and probit models using 3 factors, and then 2 factors.

3005 IJSTR©2019 www.ijstr.org INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 8, ISSUE 11, NOVEMBER 2019 ISSN 2277-8616

4. RESULTS AND DISCUSSION Method. The formula of regression dependence can be With regard to 3 factor logit models, the most exact method represented as follows: of parameter estimation is a Quasi-Newton Estimation

exp(1.756156  0.5749033f1- 0.4341022f2  0.9252501f3) p( f 1, f 2, f 3)  (1) 1 exp(1.756156  0.5749033f1- 0.4341022f2  0.9252501f3)

The estimations of coefficients, standard errors, t-, to education and a science» as it has the lowest module of the levels of probability for the coefficient estimation in a t-statistics. The obtained importance of the influence of model and some other statistical data are presented in factors is similar to the results of the correlation analysis [2]. Table 1. From Table 1 it is evident, that nearly all model It is noteworthy highlighting fact that the increase in such coefficients are meaningful, since all of them have a low parameters as «Experience» and «Characteristics of level of probability and a high level of t-statistics. Being business» results in the increase of self-employment based on t-statistics, we assume that in the obtained model probability factor as well, while the increase in the the factor «Characteristics of business» is the most parameter «Attitude to education and science» the self- important, the value of the module of its t-statistics is the employment probability factor decreases. highest, and the least important factor is the factor «Attitude

Table 1

Coefficient estimations Const. B Factor 1 (f1) Factor 2 (f2) Factor 3 (f3) Estimate 1.756156 0.5749033 -0.4341022 0.9252501 Standard Error 0.1408642 0.1348108 0.1614367 0.1157518 t (565) 12.46701 4.264519 -2.688994 7.993396 p-level 0 0.00002348146 0.007378391 7.43419700E-15 -95%CL 1.479474 0.3101117 -0.7511915 0.6978936 +95%CL 2.032837 0.8396949 -0.1170129 1.152606 Wald's Chi-square 155.4264 18.18612 7.230688 63.89439 p-level 0 0.00002009077 0.007170485 1.35654000E-15 Odds ratio (unit ch) 5.790135 1.776959 0.647846 2.522499 -95%CL 4.390636 1.363577 0.4718041 2.009516 +95%CL 7.635719 2.31566 0.8895738 3.166435 Odds ratio (range) 10.94642 0.2307998 174.2136 -95%CL 3.635796 0.07908819 49.0223 +95%CL 32.9568 0.6735334 619.1135

Table 2

Correlation Matrix of Parameter Estimates Variances of parameter estimates were computed after rescaling MS error to 1. Const. B Factor 1 (f1) Factor 2 (f2) Factor 3 (f3) Const. B0 1.000000 0.221754 -0.372354 0.412939 Factor 1 0.221754 1.000000 -0.016390 0.073552 Factor 2 -0.372354 -0.016390 1.000000 -0.029271 Factor 3 0.412939 0.073552 -0.029271 1.000000

Table 3

Classification of Cases Odds ratio: 7.7307 Perc. correct: 81.02 % Pred.1.000000 Pred.0.000000 PercentCorrect 1.000000 33 87 27.50000 0.000000 21 428 95.32294

In order to check the accuracy of the model we refer to the correlation in a model is insignificant. The highest matrix of parameters correlation (Tab.2) and the matrix with correlation is observed between the constant and other the number of correct classification of cases (Tab.3), the parameters, but these do not exceed 0.45, therefore, can normal probability plot of residuals and the of be deemed as insignificant. frequency distribution of residuals (Fig.1). The parameter

3006 IJSTR©2019 www.ijstr.org INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 8, ISSUE 11, NOVEMBER 2019 ISSN 2277-8616

Fig. 1 Histogram of frequency distribution of residuals

Considering the table of correct classifications, we can Hence, we can consider, that the model precisely describes conclude, that the model correctly describes 81.02 % of the considered process. Now we shall consider the 3 factor cases, from the general number of cases in which the result . Unlike the logit model, we applied theoptimum of the self-employment was negative, 95.32 % has been method for estimation of parameters in the probit model, i.e. predicted correctly. The plots also prove high adequacy of themethod of Hooke-Jeeves pattern moves Estimation the constructed model. On the normal probability plot of Method. The formula of regression probit dependence can residuals, the residuals are allocated close to the straight be presentedas follows: line of the , and the histogram of frequency distribution of residuals is similar to a curve of NP (p) =NP (1.02477 + 0.332166 f1 - 0.24980 f2 + normal distribution, provided the dependent variable is 0.530444 f3) (2) binary. The left and the right parts of the histogram resemble a curve of normal distribution, the left part, for 0, where, NP is normal probability. the right for 1. We considered the distribution of the The estimations of parameters of the model are provided in residuals for 0 and 1 because the observable value has Table 4. From the table below we can see that the values of values 0 or 1, and modeled functions are continuous and standard errors of probit model parameters are lower than have value from 0 up to 1. Actually values of modeled those of the logit model. However, the values of model functions it is probability of self-employment, which is coefficients are also low, hence, we receive the same normally distributed value, and subtracting normally accuracy of estimation, as in the logit model. It is possible distributed value from 0 or 1 we receive two normally to consider the parameters of the model significant because distributed values, one for 0 another for 1. Hence, the they have high t-statistics, and low levels of probability p. histogram of distribution of the residuals for the functions The most significant factor in 3 measured probit model is modeling probability on the basis of cases with the binary the «Characteristic of business» factor, it has the greatest response should look like 2 curve normal distributions. The module of t-statistics, and the least significant factor is the first curve, for 0, lays on a space from -1 up to 0, and the «Attitude to education and science» factor with the minimal second, for 1, lays on a space from 0 up to 1. Similar t-statistics. The given result completely coincides with the reasonings can be resulted and for the normal probability conclusions made in logit models. plot of residuals. That we also see in Figures 5 and 6.

Table 4

Model: Probit regression N of 0's: 120 1's: 449 Dep. var: SELFEMPL Loss: Max likelihood (MS-err. scaled to 1) Final loss: 239.51922021 Chi? (3 =107.19 p=0.0000 Const. B Factor 1 (f1) Factor 2 (f2) Factor 3 (f3) Estimate 1,02477 0.332166 -0.24980 0.530444 Std. Err. 0.07434 0.076550 0.08910 0.064016 t (565) 13.78540 4.339171 -2.80353 8.286108 p-level 0.00000 0.000017 0.00523 0.000000

3007 IJSTR©2019 www.ijstr.org INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 8, ISSUE 11, NOVEMBER 2019 ISSN 2277-8616

Table 5

Correlation Matrix of Parameter Estimates Variances of parameter estimates were computed after rescaling MS error to 1. Const. B Factor 1 (f1) Factor 2 (f2) Factor 3 (f3) Const. B0 1.000000 0.163381 -0.346332 0.310703 Factor 1 0.163381 1.000000 -0.000984 0.037268 Factor 2 -0.346332 -0.000984 1.000000 -0.025154 Factor 3 0.310703 0.037268 -0.025154 1.000000

Table 6

Classification of Cases Odds ratio: 7.8000 Perc. correct: 81.02 % Pred. Pred. Percent 1.000000 0.000000 Correct 1.000000 32 88 26.66667 0.000000 20 429 95.54565

The values of correlations for the model parameters are provided in Table 5. As well as in case of the logit model the value of correlations is low, that proves the accuracy of the probit model. The greatest values of correlations are observed between factor coefficients and the constant. It is worthwhile mentioning, that in order to consider the quality of the model from the point of view of correlation between the parameters it is possible to come to a conclusion that the probit model is more accurate than the logit model, because the modules of coefficients of parameters correlation in a probit model do not exceed 0.35. However, this fact does not affect the number of correctly predicted values of self-employment probability (Tab.6). The number of correctly predicted values of probability, completely coincides with the same parameter in the logit model and equals to 81.02 %, though the amount of correctly predicted values of employed graduates exceeds 95.55 %.

Fig. 2 Histogram of frequency distribution of residuals

The normal probability plot of residuals and the histogram predicted values of probability of self-employment in both of frequency distribution of residuals (Fig.2) are similar to models equals to 81.02 %.Let us consider the models by 2 normal distribution, that proves the accuracy of the model. It factor. As well as with 3 factor models, we’ll begin is possible to notice, that these two plots are very similar to consideration with the logit model. For the estimation of the same plots in the logit model. If logit and probit models parameters of the two-dimensional logit model we used are compared by 3 factor, we argue that they are almost Simplex method, because it has the lowest standard errors identical. Though, some parameters of probit model show in model parameter estimation. The formula of the two- its higher accuracy, as a result the number of correctly dimensional logit model can be presented as follows:

exp(1.68802 0.60944f1 0.878472f2) p( f 1, f 2)  (3) 1 exp(1.68802 0.60944f1 0.878472f2)

3008 IJSTR©2019 www.ijstr.org INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 8, ISSUE 11, NOVEMBER 2019 ISSN 2277-8616

The estimation of parameters is presented in Table 7. By the greatest correlation is observed between factors and a analyzing the parameters of the model it is possible to constant, as well as in 3 factor model. It is possible to come to the conclusion about high significance of the illustrate the adequacy of the model by the plots. The model, t-statistics is high, and the level of probability p is surface built using the formula (3) is allocated next to the low. In the model we can observe the equal influence of points, the normal probability plot of residuals is similar to factors on the probability of self-employment because t- the plot of the normal distribution, and the histogram of statistics for both factors are virtually equal. This result frequency distribution of residuals resembles the histogram completely coincides with the results of the correlation of normal distribution for the model with binary response analysis [2] that confirms the adequacy of the model. The (Fig.3. Considering all these facts, we can argue the model adequacy of the model is proved by the correlation between is adequate. the parameters (Tab.8). The value of correlation is low, and

Table 7

Model: (logit) N of 0's: 120 1's: 449 Dep. var: SELFEMPL Loss: Max likelihood (MS-err. scaled to 1) Final loss: 264.04056887 Chi? (2 =58.150 p =.00000 Const. B Factor 1 (f1) Factor 2 (f2) Estimate 1.68802 0.6094402 0.8784723 Standard Error 0.1407411 0.1276423 0.1834788 t (566) 11.99379 4.774593 4.787868 p-level 1.08578700E-29 0.000002295979 0.000002154781 -95%CL 1.411581 0.3587297 0.5180898 +95%CL 1.964458 0.8601507 1.238855 Wald's Chi-square 143.851 22.79674 22.92368 p-level 0 0.000001808662 0.000001693154 Odds ratio (unit ch) 5.408758 1.839401 2.407219 -95%CL 4.102436 1.43151 1.678818 +95%CL 7.131048 2.363517 3.451658 Odds ratio (range) 12.83899 22.83482 -95%CL 4.492679 6.327753 +95%CL 36.69075 82.40348

Table 8

Correlation Matrix of Parameter Estimates Variances of parameter estimates were computed after rescaling MS error to 1. Const. B Factor 1 (f1) Factor 2 (f2) Const. B0 1.000000 0.172062 0.608793 Factor 1 0.172062 1.000000 -0.028799 Factor 2 0.608793 -0.028799 1.000000

Table 9

Classification of Cases Odds ratio: 3.9909Perc. correct: 78.91 % Pred.1.000000 Pred.0.000000 PercentCorrect 1.000000 10 110 8.33333 0.000000 10 439 97.77283

The number of correctly predicted cases and the general employment. Though the number of the predicted cases accuracy of the model are provided in Table 9. The built other than self-employment is higher than in 3 factor model, two-dimensional logit model correctly predicts 78.91 % of all but the general accuracy is lower, though insignificantly. the cases and 97.77 % of cases other than self-

3009 IJSTR©2019 www.ijstr.org INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 8, ISSUE 11, NOVEMBER 2019 ISSN 2277-8616

Fig. 3 Histogram of frequency distribution of residuals

Let us examine a two-dimensional probit model. The Simplex is the best method for the estimation of two-dimensional model parameters. The formula of a two-dimensional probit model has the following representation:

NP (p) =NP (0.98449+0.359594 f1+0.471729 f2) (4) where, NP is normal probability. The presented model is significant, because t-statistics of model coefficients have high values, and the level of probability p is low (Tab.10). From Table 10 it is visible that, as well as in logit model, both factors exert an equal impact on a dependent variable. This result corresponds to the result of the correlation analysis [2].

Table 10

Model: Probit regression N of 0's: 120 1's: 449 Dep. var: SELFEMPL Loss: Max likelihood (MS-err. scaled to 1) Final loss: 264.20802045 Chi? (2 =57.815 p =. 00000 Const. B Factor 1 (f1) Factor 2 (f2) Estimate 0.98449 0.359594 0.471729 Std. Err. 0.07299 0.072900 0.093847 t (566) 13.48812 4.932722 5.026554 p-level 0.00000 0.000001 0.000001

Table 11

Correlation Matrix of Parameter Estimates Variances of parameter estimates were computed after rescaling MS error to 1. Const. B Factor 1 (f1) Factor 2 (f2) Const. B0 1.000000 0.133260 0.504319 Factor 1 0.133260 1.000000 -0.008267 Factor 2 0.504319 -0.008267 1.000000

Table 12

Classification of Cases Odds ratio: 6.7121Perc. correct: 79.61 % Pred. Pred. Percent 1.000000 0.000000 Correct 1.000000 10 110 8.33333 0.000000 6 443 98.66370

The obtained results prove the adequacy of the model. to the plot of the normal distribution. Thirdly, the histogram Firstly, the parameters of the model have low correlation. of frequency distribution of residuals is similar to the Significant correlation is observed between factor histogram of normal distribution (Fig.4). The normal «Characteristic of business» and a constant (Tab.11). probability plot of residuals should be considered separately Secondly, the normal probability plot of residuals is similar for values 0 and 1. The aforesaid proves the adequacy of

3010 IJSTR©2019 www.ijstr.org INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 8, ISSUE 11, NOVEMBER 2019 ISSN 2277-8616 the model. The accuracy of the model is also confirmed by employment equals to 98.66 %, and the total number of the data provided in Table 12 from which it can be seen that correctly predicted cases equals to 79.61 % that proves a the number of correctly predicted cases other than self- high accuracy of the model.

Fig. 4 Histogram of frequency distribution of residuals

From all the aforesaid it is possible to make a conclusion [3] Burkov, A.V., Murzina, E.A. (2016). Analysis that two-dimensional probit model is more exact than a two- Method of Structural Equation Modeling // dimensional logit model because its share correctly Advances in Systems Science And Application, 16 predicted above, than a share of correctly predicted cases (4), 1-12. in logit model. [4] Christensen, R. (1990). Log-linear models. New York: Springer-Verlag. As a result of the research carried out we obtained the [5] Finney, D.J. (1971). Probit analysis. Cambridge: following outcomes: Cambridge University Press. 1. On the basis of the factors, we have built three- [6] Haberman, S.J. (1973). The analysis of residuals in dimensional regression logit and probit models of self- cross-classified tables. Biometrics. 29, 205–220. employment probability dependences, correctly [7] Haberman, S.J. (1979). Analysis of qualitative predicting 81.02 % of the cases. A three-dimensional data, Volume 2. New York: Academic Press. logit model is analyzed using function (1), and the [8] Hosmer, D.W., and Lemeshow, S. (1989). Applied probit model is analyzed using function (2). For the logistic regression. New York: John Wileys & Sons, estimation of the logit model parameters we used Inc. Quasi-Newton Estimation Method (Tab.1), and for [9] Pierce, D.A., Schafer, D.W. (1986). Residuals in construction of the probit model we used Hooke-Jeeves generalized linear models. Journal of the American pattern moves Estimation Method (Tab.4). Statistical Association. 81, 977–986. 2. On the basis of the received factors applying the [10] Anderson, T.W. (1958). Introduction to multivariate Simplex method of parameters estimation, we have statistical analysis. New York: John Wiley & Sons, constructed two-dimensional regression logit (Tab.7) Inc. and probit (Tab.19) models of self-employment [11] Belsley, D.A., Kuh, E., Welsch, R.E. (1980). probability dependences. Two-dimensional logit model Regression diagnostics. New York: John Wiley & is analyzed using function (3), while the probit model is Sons, Inc. analyzed using function (4). Two-dimensional logit [12] Blalock, H.M. (1972). . New York: model correctly predicts 78.91 % of cases, and the McGraw-Hill. probit model predicts 79.61 % of cases. [13] Box, G.E.P. (1949). A general distribution theory for a class of likelihood criteria. Biometrika. 36, REFERENCES 317–346. [1] Agresti, A. (1990). Categorical data analysis. New [14] Cook, R.D. (1977). Detection of influential York: John Wiley & Sons, Inc. observations in linear regression. Technometrics. [2] Burkov, A.V., Murzina, E.A. (2015). Using the 19, 15-18. Comprehensive Confirmatory Factor Analysis [15] Cooley, W.W., Lohnes, P.R. (1971). Multivariate Method of Structural Equation Modeling in the data analysis. New York: John Wiley & Sons, Inc. Process of Graduates Employment. Mediterranean [16] Cox, D.R., Snell, E.J. (1989). The analysis of Journal of Social Sciences. 6(6 S2), 6628-667. binary data, 2nd ed. London:

3011 IJSTR©2019 www.ijstr.org INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 8, ISSUE 11, NOVEMBER 2019 ISSN 2277-8616

[17] Cox, D.R.(1972). Regression Models and Life Tables (with discussion). Journal of Royal Statistical Society. 46, 1-30. [18] Draper, N.R., Smith, H. (1969). Applied . New York: John Wiley & Sons, Inc. [19] Gill, P.E., Murray, W.M., Wright, M.H. (1981). Practical Optimization. London: Academic Press. [20] Gill, P.E., Murray, W.M., Saunders, M.A., Wright, M.H. (1984). Procedures for optimization problems with a mixture of bounds and general linear constraints. ACM Transactions on Mathematical Software. 10(3), 282–296. [21] Goodman, L.A. (1979). Simple Models for the Analysis of Association in Cross-Classifications having Ordinal Categories. Journal of American Statistical Association. 74, 537-52. [22] Goodman, L.A. (1981). Association Models and Canonical Correlation in the Analysis of Cross- Classifications having Ordered Categories. Journal of American Statistical Association. 76, 320-34. [23] Greenland, S. (1994). Alternative Models for Ordinal Logistic Regression. Statistics in Medicine. 13, 65-77. [24] Hays, W.L. (1973). Statistics for the social sciences. New York: Holt, Rinehart and Winston. [25] Horst, P. (1963). Matrix algebra for social scientists. New York: Holt, Rinehart, andWinston. [26] Hosmer. D.W.J., Lemeshow, S. (1981). Applied Logistic Regression Models. Biometrics. 34, 318- 327. [27] Kotz, S., Johnson, N.L., eds. (1988). Encyclopedia of statistical sciences. JohnWiley & Sons, Inc.: New York. [28] Magidson, J. (1995). Introducing a New Graphical Model for the Analysis of an Ordinal Categorical Response – Part I. Journal of Targeting, Measurement and Analysis for Marketing. IV (2), 133-48. [29] McCullagh, P. (1980). Regression Models for Ordinal Data. Journal of Royal Statistical Society. 42(2), 109-142. [30] McCullagh, P., Nelder, J.A. (1989). Generalized linear models. 2nd ed. London:Chapman and Hall. [31] Pierce, D.A., and Schafer, D.W. (1986). Residuals in generalized linear models. Journal of the American Statistical Association, 81, 977–986. [32] Pregibon, D. (1981). Logistic Regression Diagnostics. Annals of Statistics. 9, 705-24. [33] Searle, S.R. (1971). Linear models. New York: John Wiley & Sons, Inc. [34] Velleman, P.F., Welsch, R.E. (1981). Efficient computing of regressiondiagnostics. The American Statistician. 35, 234–242. [35] Chistik, O.F., Nosov, V.V., Tsypin, A.P., Ivanov, O.B. Permjakova, T.V. (2016). Research indicators of railway transport activity on the basis of historical series, International Journal of Economic Perspectives. 10(3), 57–65. [36] Williams, D.A. (1982). Extra-Binomial Variation in Logistic Linear Models. Applied Statistics. 31, 144- 48.

3012 IJSTR©2019 www.ijstr.org