Stern S MBA 1 Students Expect to Make the Big Bucks After Graduation

Total Page:16

File Type:pdf, Size:1020Kb

Stern S MBA 1 Students Expect to Make the Big Bucks After Graduation

Stern’s MBA 1 students expect to make the big bucks after graduation!

Data File: salary.doc

Stern’s school of business is a very reputable university. Students generally attend this school not only to increase their knowledge base of business but also to increase their salary. I am interested in finding out what leads MBA 1 student to believe that they going to earn a certain salary after graduation. In other word, what factors affect the expected salary of students? As I could not find enough meaningful data on the topic, I decided to conduct my own survey. I was actually very surprised to see how cooperative students were in filling my survey when I gave them candies in return! Here are some descriptive statistics for all 40 MBA 1 surveyed. Variables include expected salary after graduation (expected), age of the person (age), the number of year of working experience in the chosen industry after graduation (num.of), if the person plans to work in finance (plan fin), if the person plans to work in consulting (plan con), if the person plans to work in marketing (plan mar), if the person plans to work in other industries (plan oth), if the person is unsure about the industry (unsure), if the person plans to work for a Fortune 500 Company (500 comp), the number of hours he/she plans to work a week, and his/her past salary (past sal).

Descriptive Statistics

Variable N Mean Median Tr Mean StDev SE Mean Expected 40 84250 80000 83333 17213 2722 age 40 26.825 26.000 26.694 2.490 0.394 num. of 40 1.725 0.000 1.500 2.449 0.387 plan fin 40 0.6750 1.0000 0.6944 0.4743 0.0750 plan con 40 0.2500 0.0000 0.2222 0.4385 0.0693 plan mar 40 0.0750 0.0000 0.0278 0.2667 0.0422 plan oth 40 0.00000 0.00000 0.00000 0.00000 0.00000 unsure 40 0.00000 0.00000 0.00000 0.00000 0.00000 500 comp 40 0.5000 0.5000 0.5000 0.5064 0.0801 Hrs. of 40 70.62 70.00 70.42 12.57 1.99 past sal 40 54250 48500 51667 20664 3267

Variable Min Max Q1 Q3 Expected 60000 125000 70000 97500 age 23.000 33.000 25.000 29.000 num. of 0.000 8.000 0.000 3.000 plan fin 0.0000 1.0000 0.0000 1.0000 plan con 0.0000 1.0000 0.0000 0.7500 plan mar 0.0000 1.0000 0.0000 0.0000 plan oth 0.00000 0.00000 0.00000 0.00000 unsure 0.00000 0.00000 0.00000 0.00000 500 comp 0.0000 1.0000 0.0000 1.0000 Hrs. of 50.00 95.00 60.00 80.00 past sal 30000 125000 40000 60000

At a first look at the data, there are different things happening. None of the 40 students claim to be unsure about which industry they want to go in after graduation and none of them are planning to work in an industry other than finance, consulting or marketing. Moreover, the average expected salary is $84,250 is much higher than the actual salary of graduating MBA students in 1996 of $70,000 (footnote). This discrepancy could mean that MBA 1 students are rather optimistic.

Salaries are often right tailed. Let’s check the distribution of both expected and past salaries.

9 10 8 7

y 6 y c c n n

e 5 u e u q 4 5 e q r e r F 3 F 2 1 0 0 60000 70000 80000 90000100000110000120000130000 30000 40000 50000 60000 70000 80000 90000100000 Expected salary past salary

These distributions of salaries are right tailed. Thus, it might be helpful to log salaries.

8 10

7 6 y y

c 5 c n n e e

u 4 u 5 q q e e r 3 r F F 2 1

0 0

4.8 4.9 5.0 5.1 4.48 4.53 4.58 4.63 4.68 4.73 4.78 4.83 4.88 4.93 4.98 log expected log past

Let’s check potential outliers in both expected and past salaries:

130000 120000 120000 y

r 110000 a l y r a a s l

100000 a d s

e t 70000 t

s 90000 c a e p p

x 80000 E 70000

20000 60000 There are apparently 3 outliers in the past salaries observations. Now, let’s look at the distribution of the different industries students plan to work in.

130000 130000

120000 120000 y y

r 110000 r 110000 a a l l a a s s

100000 100000 d d e e t 90000 t 90000 c c e e p p

x 80000 x 80000 E E 70000 70000

60000 60000

0 1 0 1 plan fin. plan cons.

130000

120000 y

r 110000 a l a s

100000 d e t 90000 c e p

x 80000 E 70000

60000

0 1 plan mark.

There are only 2 students out of 40 who plan to work in the marketing industry. This variable apparently has a low significance. Students who plan to work in finance are coded by 1. It is interesting to see that students who plan to work in finance and the ones who do not actually have the same median expected salary ($80,000). It looks like students who are planning to go in finance and those who are planning to go to consulting are negatively correlated.

Regression Analysis * plan mark. is highly correlated with other X variables * plan mark. has been removed from the equation

* plan other has all values = 0 * plan other has been removed from the equation

* unsure has all values = 0 * unsure has been removed from the equation

The regression equation is Expected salary = - 44885 + 3352 age + 647 num. of yrs. of exp. - 2591 plan fin. + 4025 plan cons. + 3171 500 comp? + 432 Hrs. of work + 0.125 past salary

Predictor Coef StDev T P VIF Constant -44885 18068 -2.48 0.018 age 3351.8 849.5 3.95 0.000 3.6 num. of 647.5 514.7 1.26 0.217 1.3 plan fin -2591 5082 -0.51 0.614 4.6 plan con 4025 5344 0.75 0.457 4.4 500 comp 3171 2887 1.10 0.280 1.7 Hrs. of 431.6 122.2 3.53 0.001 1.9 past sal 0.12506 0.09684 1.29 0.206 3.2

S = 6999 R-Sq = 86.4% R-Sq(adj) = 83.5%

Analysis of Variance

Source DF SS MS F P Regression 7 9988126411 1426875202 29.13 0.000 Error 32 1567373589 48980425 Total 39 11555500000

Source DF Seq SS age 1 8185071089 num. of 1 75586010 plan fin 1 531542752 plan con 1 158051549 500 comp 1 84132640 Hrs. of 1 872058157 past sal 1 81684215

Unusual Observations Obs age Expected Fit StDev Fit Residual St Resid 16 25.0 70000 86541 3156 -16541 -2.65R 34 27.0 120000 101312 3493 18688 3.08R

R denotes an observation with a large standardized residual

Durbin-Watson statistic = 1.64

We can see that Minitab directly get rid of 3 variables. These variables are students planning to work in marketing, planning to work in other industries and students who are unsure about for which industry they will be working. I would also remove students who plan consulting because it is highly negatively correlated with students who plan finance. The overall regression is statistically significant. However, some variables have P-values over.05.

Residuals Versus the Fitted Values Residuals Versus the Order of the Data (response is Expected) (response is Expected) 3

3 2 l a

2 u l d i a s u 1 e d i R s 1

e d R e 0 z d i e

0 d z r i a d r d -1 a n

d -1 a n t a S t -2 S -2

-3 -3 5 10 15 20 25 30 35 40 55000 65000 75000 85000 95000 105000 115000 125000 Observation Order Fitted Value Normal Probability Plot of the Residuals Histogram of the Residuals (response is Expected) (response is Expected)

3 10

2 l a u d i

s 1 e y R

c d n e 0 e 5 z u i q d r e r a d -1 F n a t S -2

-3 0 -2 -1 0 1 2 -3 -2 -1 0 1 2 3 Normal Score Standardized Residual

The distribution of the residual looks normal. However, we can notice couples of outliers. Now, let’s try a regression with logged salaries for past and expected while keeping the same variables.

Regression Analysis * plan mark. is highly correlated with other X variables * plan mark. has been removed from the equation

* plan other has all values = 0 * plan other has been removed from the equation

* unsure has all values = 0 * unsure has been removed from the equation

The regression equation is log expected = 4.26 + 0.0178 age + 0.00289 num. of yrs. of exp. - 0.0137 plan fin. + 0.0190 plan cons. + 0.0125 500 comp? + 0.00202 Hrs. of work +0.000001 past salary

Predictor Coef StDev T P VIF Constant 4.26054 0.08967 47.52 0.000 age 0.017759 0.004216 4.21 0.000 3.6 num. of 0.002887 0.002554 1.13 0.267 1.3 plan fin -0.01366 0.02522 -0.54 0.592 4.6 plan con 0.01899 0.02652 0.72 0.479 4.4 500 comp 0.01251 0.01433 0.87 0.389 1.7 Hrs. of 0.0020215 0.0006063 3.33 0.002 1.9 past sal 0.00000057 0.00000048 1.18 0.246 3.2

S = 0.03473 R-Sq = 86.2% R-Sq(adj) = 83.2%

Analysis of Variance Source DF SS MS F P Regression 7 0.241162 0.034452 28.56 0.000 Error 32 0.038600 0.001206 Total 39 0.279762

Source DF Seq SS age 1 0.201723 num. of 1 0.001471 plan fin 1 0.012164 plan con 1 0.003731 500 comp 1 0.001379 Hrs. of 1 0.019007 past sal 1 0.001687

Unusual Observations Obs age log expe Fit StDev Fit Residual St Resid 16 25.0 4.84510 4.92668 0.01566 -0.08158 -2.63R 34 27.0 5.07918 4.99767 0.01733 0.08151 2.71R

R denotes an observation with a large standardized residual

Durbin-Watson statistic = 2.01

Logging the salaries does not change the regression model significantly. Thus, I will keep the antilog data. Let’s see how the regression looks without the plan consulting variable.

Regression Analysis

The regression equation is Expected salary = - 46346 + 3477 age + 574 num. of yrs. of exp. - 5884 plan fin. + 2553 500 comp? + 466 Hrs. of work + 0.113 past salary

Predictor Coef StDev T P VIF Constant -46346 17846 -2.60 0.014 age 3476.8 827.6 4.20 0.000 3.4 num. of 573.8 501.9 1.14 0.261 1.2 plan fin -5884 2573 -2.29 0.029 1.2 500 comp 2553 2750 0.93 0.360 1.6 Hrs. of 465.9 112.6 4.14 0.000 1.6 past sal 0.11303 0.09489 1.19 0.242 3.1

S = 6953 R-Sq = 86.2% R-Sq(adj) = 83.7%

Analysis of Variance

Source DF SS MS F P Regression 6 9960345863 1660057644 34.34 0.000 Error 33 1595154137 48338004 Total 39 11555500000

Source DF Seq SS age 1 8185071089 num. of 1 75586010 plan fin 1 531542752 500 comp 1 22521046 Hrs. of 1 1077031091 past sal 1 68593876

Unusual Observations Obs age Expected Fit StDev Fit Residual St Resid 10 33.0 110000 111467 5049 -1467 -0.31 X 16 25.0 70000 86410 3131 -16410 -2.64R 34 27.0 120000 101124 3461 18876 3.13R

R denotes an observation with a large standardized residual X denotes an observation whose X value gives it large influence.

Durbin-Watson statistic = 1.67

Histogram of the Residuals Residuals Versus the Fitted Values (response is Expected) (response is Expected)

9 3 34 8

7 l 2 a u d

6 i y s

c 1 e

n 5 R e

u d q 4 e 0 e z r i F d

3 r a

d -1

2 n a t 1 S -2 0 16 -3 -3 -2 -1 0 1 2 3 55000 65000 75000 85000 95000 105000 115000 125000 Standardized Residual Fitted Value

Normal Probability Plot of the Residuals (response is Expected)

3 34

l 2 a u d i

s 1 e R

d

e 0 z i d r a

d -1 n a t S -2

-3 16 -2 -1 0 1 2 Normal Score

Not the outliers are still here. Now, let’s run a best subset regression to find out what variables are best to choose for our model.

Best Subsets Regression Response is Expected

p 5 p n l 0 H a u a 0 r s m n s t . c . a f o s R-Sq g o i m o a Vars R-Sq (adj) C-p S e f n p f l 1 70.8 70.1 33.7 9417.8 X 1 57.8 56.7 64.8 11325 X 2 81.2 80.2 10.9 7660.7 X X 2 75.5 74.1 24.6 8752.2 X X 3 84.7 83.4 4.6 7012.4 X X X 3 83.2 81.8 8.1 7334.3 X X X 4 85.3 83.6 5.2 6971.0 X X X X 4 85.2 83.5 5.5 7000.3 X X X X 5 85.8 83.8 5.9 6938.4 X X X X X 5 85.6 83.5 6.3 6983.8 X X X X X 6 86.2 83.7 7.0 6952.6 X X X X X X

My choice is between the two possibilities in bold. One reason is that they have small S and relatively high R-sq. Another reason is that C-p should be approximately P+1 =6. Thus, I picked the one that has a C-p of 5.5 and a S of 7000.3. Here is the new regression:

Regression Analysis The regression equation is Expected salary = - 57732 + 4034 age - 6828 plan fin. + 0.0975 past salary + 468 Hrs. of work

Predictor Coef StDev T P VIF Constant -57732 16415 -3.52 0.001 age 4034.1 753.2 5.36 0.000 2.8 plan fin -6828 2392 -2.85 0.007 1.0 past sal 0.09755 0.09198 1.06 0.296 2.9 Hrs. of 468.5 111.7 4.19 0.000 1.6

S = 7000 R-Sq = 85.2% R-Sq(adj) = 83.5%

Analysis of Variance

Source DF SS MS F P Regression 4 9840364419 2460091105 50.20 0.000 Error 35 1715135581 49003874 Total 39 11555500000

Source DF Seq SS age 1 8185071089 plan fin 1 536172676 past sal 1 257832450 Hrs. of 1 861288204

Unusual Observations Obs age Expected Fit StDev Fit Residual St Resid 8 29.0 125000 111564 2946 13436 2.12R 10 33.0 110000 115013 4497 -5013 -0.93 X 16 25.0 70000 87329 2846 -17329 -2.71R 34 27.0 120000 101545 3129 18455 2.95R 39 30.0 100000 107987 4449 -7987 -1.48 X 40 31.0 110000 114851 4466 -4851 -0.90 X

R denotes an observation with a large standardized residual X denotes an observation whose X value gives it large influence. Durbin-Watson statistic = 1.57 Past salaries still have a P-value above .05. So, I decide to take this variable out of the regression.

Regression Analysis

The regression equation is Expected salary = - 69455 + 4583 age - 6843 plan fin. + 501 Hrs. of work

Predictor Coef StDev T P Constant -69455 12157 -5.71 0.000 age 4583.5 547.7 8.37 0.000 plan fin -6843 2396 -2.86 0.007 Hrs. of 500.9 107.7 4.65 0.000

S = 7012 R-Sq = 84.7% R-Sq(adj) = 83.4%

Analysis of Variance

Source DF SS MS F P Regression 3 9785248786 3261749595 66.33 0.000 Error 36 1770251214 49173645 Total 39 11555500000

Source DF Seq SS age 1 8185071089 plan fin 1 536172676 Hrs. of 1 1064005021 Durbin-Watson statistic = 1.64 no autocorrelation. It confirms the residual vrs order plot

Unusual Observations Obs age Expected Fit StDev Fit Residual St Resid 8 29.0 125000 111047 2910 13953 2.19R 10 33.0 110000 116859 4154 -6859 -1.21 X 16 25.0 70000 87704 2829 -17704 -2.76R 34 27.0 120000 101880 3118 18120 2.88R

R denotes an observation with a large standardized residual X denotes an observation whose X value gives it large influence.

Residuals Versus the Order of the Data (response is Expected)

3 34

2 l a u d i 1 s e R

d 0 e z i d r

a -1 d n a t

S -2

-3 16 5 10 15 20 25 30 35 40 Observation Order Residuals Versus the Fitted Values (response is Expected)

3 34 2 l a u d i 1 s e R

d 0 e z i d r a

d -1 n a t

S -2

16 -3 50000 60000 70000 80000 90000 100000 110000 120000 Fitted Value

Normal Probability Plot of the Residuals (response is Expected)

3

2 l a u d i 1 s e R

d 0 e z i d r a

d -1 n a t

S -2

-3 -2 -1 0 1 2 Normal Score Histogram of the Residuals (response is Expected)

10 y c n

e 5 u q e r F

0

-3 -2 -1 0 1 2 3 Standardized Residual

Now we have a statistically significant model with P-value below .05. However, two outliers are still visible in the residuals plots. We can try to get ride of these 2 oultiers (observation 34 and 16).

Regression Analysis

The regression equation is Expected salary = - 65739 + 4508 age - 6676 plan fin. + 475 Hrs. of work Predictor Coef StDev T P VIF Constant -65739 10088 -6.52 0.000 age 4508.2 474.3 9.51 0.000 1.6 plan fin -6676 2071 -3.22 0.003 1.0 Hrs. of 474.7 100.6 4.72 0.000 1.6

S = 5742 R-Sq = 88.9% R-Sq(adj) = 87.9%

Analysis of Variance

Source DF SS MS F P Regression 3 8941383570 2980461190 90.41 0.000 Error 34 1120826956 32965499 Total 37 10062210526

Source DF Seq SS age 1 7937259218 plan fin 1 270764351 Hrs. of 1 733360001

Unusual Observations Obs age Expected Fit StDev Fit Residual St Resid 8 29.0 125000 110091 2846 14909 2.99R 10 33.0 110000 116257 3415 -6257 -1.36 X 13 24.0 70000 80430 2798 -10430 -2.08R

R denotes an observation with a large standardized residual X denotes an observation whose X value gives it large influence.

Durbin-Watson statistic = 1.65

Residuals Versus the Order of the Data (response is Expected)

3 8

l 2 a u d i s

e 1 R

d e z i

d 0 r a d n a

t -1 S

-2

5 10 15 20 25 30 35 ResiduOablss eVrveartisoun sO rtdher Fitted Values (response is Expected)

3

l 2 a u d i s

e 1 R

d e z i

d 0 r a d n a

t -1 S

-2

60000 70000 80000 90000 100000 110000 120000 Fitted Value Normal Probability Plot of the Residuals (response is Expected)

3

l 2 a u d i s

e 1 R

d e z i

d 0 r a d n a

t -1 S

-2

-2 -1 0 1 2 Normal Score

Regression Analysis

The regression equation is Expected salary = - 63050 + 4653 age - 4558 plan fin. + 351 Hrs. of work

Predictor Coef StDev T P VIF Constant -63050 8827 -7.14 0.000 age 4652.8 415.5 11.20 0.000 1.6 plan fin -4558 1908 -2.39 0.023 1.1 Hrs. of 351.10 94.81 3.70 0.001 1.7

S = 5004 R-Sq = 90.1% R-Sq(adj) = 89.2%

Analysis of Variance

Source DF SS MS F P Regression 3 7482915093 2494305031 99.63 0.000 Error 33 826165988 25035333 Total 36 8309081081

Source DF Seq SS age 1 7066013767 plan fin 1 73551726 Hrs. of 1 343349601 Unusual Observations Obs age Expected Fit StDev Fit Residual St Resid 9 33.0 110000 115070 2996 -5070 -1.27 X 34 27.0 70000 80840 1093 -10840 -2.22R

R denotes an observation with a large standardized residual X denotes an observation whose X value gives it large influence.

Durbin-Watson statistic = 2.00

Residuals Versus the Fitted Values (response is Expected)

2 l

a 1 u d i s e R 0 d e z i d r a

d -1 n a t S

-2

60000 70000 80000 90000 100000 110000 120000 Fitted Value

Heteroscadasticity is non—constant variance. It appears that there is non-constant variance the residuals versus the fitted values. But to further explore that aspect, we would have to do a Levenes’ test. Hopefully, the logged variables would take care of this. Let’s do a regression with logged expected salaries.

Residuals Versus the Order of the Data (response is Expected)

2 l

a 1 u d i s e R 0 d e z i d r a

d -1 n a t S

-2 Normal Probability Plot of the Residuals 5 1(r0esponse is Ex1p5ected) 20 25 30 35 Observation Order 2 l

a 1 u d i s e R 0 d e z i d r a

d -1 n a t S

-2

-2 -1 0 1 2 Normal Score Histogram of the Residuals (response is Expected)

7

6

5 y c 4 n e u q 3 e r F 2

1

0

-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 Standardized Residual

Let’s check residuals, leverage points and cook’s distance.

SRES5 HI5 COOK5 0.59872 0.079655 0.007756 0.27269 0.116828 0.002459 -0.013530.076361 0.000004 0.97156 0.103741 0.027315 -1.265100.044574 0.018667 1.15005 0.063633 0.022470 1.27035 0.105834 0.047753 1.99149 0.152316 0.178160 -1.265240.358517* 0.223670 -0.731940.143014 0.022351 -0.156380.058155 0.000378 -1.584370.284480 0.249509 -0.915330.063633 0.014234 0.76627 0.063603 0.009971 0.23355 0.128017 0.002002 -1.449700.231559 0.158323 -0.751440.054100 0.008074 -0.391980.079616 0.003323 1.54343 0.100642 0.066644 -1.683070.092249 0.071967 0.94590 0.060364 0.014370 -0.013530.076361 0.000004 0.35320 0.063603 0.002118 -0.305420.095103 0.002451 0.59163 0.084579 0.008085 1.66542 0.197385 0.170528 0.21357 0.105834 0.001350 -1.202020.170243 0.074111 0.04569 0.064749 0.000036 0.42040 0.043466 0.002008 0.37278 0.145178 0.005900 1.33865 0.097529 0.048414 -0.915330.063633 0.014234 -2.220130.047736 0.061771 -0.156380.058155 0.000378 -0.381650.091066 0.003648 0.38048 0.134487 0.005624

Leverage points should be less than 2.5*(p+1)/n, 2.5*(4+1)/37 =.35 One leverage point is about .35 (*). Cook’s distance should be less than 1, which is true.

Regression Analysis

The regression equation is logexp = 4.18 + 0.0231 age - 0.0256 plan fin. + 0.00186 Hrs. of work

Predictor Coef StDev T P VIF Constant 4.18204 0.04868 85.92 0.000 age 0.023056 0.002291 10.06 0.000 1.6 plan fin -0.02560 0.01052 -2.43 0.021 1.1 Hrs. of 0.0018643 0.0005228 3.57 0.001 1.7

S = 0.02759 R-Sq = 88.3% R-Sq(adj) = 87.2%

Analysis of Variance

Source DF SS MS F P Regression 3 0.188991 0.062997 82.74 0.000 Error 33 0.025126 0.000761 Total 36 0.214116

Source DF Seq SS age 1 0.176882 plan fin 1 0.002427 Hrs. of 1 0.009681

Unusual Observations Obs age logexp Fit StDev Fit Residual St Resid 8 25.0 4.90309 4.85166 0.01077 0.05143 2.02R 9 33.0 5.04139 5.07340 0.01652 -0.03201 -1.45 X 20 25.0 4.77815 4.83539 0.00838 -0.05724 -2.18R 34 27.0 4.84510 4.90015 0.00603 -0.05505 -2.04R

R denotes an observation withResid ual s largeVersus t hstandardizede Fitted Values residual X denotes an observation whose X value(response is log exgivesp) it large influence.

2 Durbin-Watson statistic = 2.34 l

a 1 u d i s e R

d 0 e z i d r a

d -1 n a t S

-2

4.8 4.9 5.0 5.1 Fitted Value It appears that there is still non-constant variance. The only reasonable thing to do now is weighted least square.

Residuals Versus the Order of the Data (response is logexp)

2 l

a 1 u d i s e R

d 0 e z i d r a

d -1 n a t S

-2

5 10 15 20 25 30 35 Observation Order Normal Probability Plot of the Residuals (response is logexp)

2 l

a 1 u d i s e R

d 0 e z i d r a

d -1 n a t S

-2

-2 -1 0 1 2 Histogram of the Residuals Normal Score (response is logexp)

8

7

6

y 5 c n e

u 4 q e r 3 F

2

1

0

-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 Standardized Residual

Recommended publications