CONFERENCE PROCEEDINGS BY TRACK

2018 MIDWEST DSI ANNUAL CONFERENCE INDIANAPOLIS, INDIANA

CONFERENCE CHAIR: PEGGY LEE DANIELS, IUPUI

PROCEEDINGS COORDINATORS: HAOJIE CHEN, VALPARAISO UNIVERSITY ABBY BRIDWELL, VALPARAISO UNIVERSITY

A1

TABLE OF CONTENTS

PREPARED BY ABBY BRIDWELL AND HAOJIE CHEN

Tracks Included:

Business Analytics Finance and Accounting Innovative Education Marketing Operations and Supply Chain Management

BUSINESS ANALYTICS

Comparison and contrast of Statistics Software Packages including R and Python for teaching purposes Ceyhun.Ozgur, Valparaiso University Sanjeev Jha , Valparaiso University Yiming Shen, Valparaiso University ……………………………………………………Page 1-23

Forecasting Intermittent Demand Patterns with Time Series and Machine Learning Methodologies Yuwen Hong, Purdue University Jingda Zhou, Purdue University Matthew A. Lanham ,Purdue University ………………………………………………Page 24-48

A Proposed Data Analytics Workflow and Example Using the R Caret Package Simon Jones, Purdue University Zhenghao Ye, Purdue University Zhuoheng Xie, Purdue University Chris Root, Purdue University Theerakorn Prasutchai, Purdue University Michael Roggenburg, Purdue University Matthew A. Lanham, Purdue University …………………….……………………...Page 49-71

Recruitment Analytics: An Investigation of Program Awareness & Matriculation Liye Sun, Purdue University Matthew A. Lanham, Purdue University ……………………………………………….Page 72

A2

FINANCE AND ACCOUNTING

A Value at Risk Argument for Dollar Cost Averaging Laurence E. Blose, Grand Valley State University Eric Hoogstra, Grand Valley State University ……………………………………..…Page 73-89

INNOVATIVE EDUCATION

Developing an Innovative Supply Chain Management Major and Minor Curriculum Sanjay Kumar, Valparaiso University Ceyhun Ozgur, Valparaiso University Coleen Wilder, Valparaiso University Sanjeev Jha, Valparaiso University ………………………………………………….Page 90-112

Disruptive Innovation & Sustainable Value: The Implications of Disruptive Innovation on the Outcome of RE Businesses in Developed Economies Page 113-134 Marketing: Carrier Choice Optimization with Tier Based Rebate for a National Retailer Surya Gundavarapu, Purdue University Matthew A. Lanham, Purdue University……………………………………………..Page 135-152

"I’ve Been Chain-ged" La Saundra Pendleton Janaina Siegler……………………………………………………………………………Page 153

Role of Political Identity in Friendship Networks Surya Gundavarapu, Purdue University Matthew A. Lanham, Purdue University……………………………………………..Page 154-169

XGBoost - A Competitive Approach for Online Price Prediction Joshua D. McKenney, Purdue University Yuqi Jiang, Purdue University Junyan Shao, Purdue University Matthew A. Lanham, Purdue University………………………………………….….Page 170-185

A3

OPERATIONS AND SUPPLY CHAIN MANAGEMENT

A Comparative Study of Machine Learning Frameworks for Demand Forecasting Kalyan Mupparaju, P Anurag Soni, Purdue University Prasad Gujela, Purdue University Matthew A Lanham, Purdue University……………………………………………Page 186-206

Does Advance Warning Help Mitigate the Impact of Supply Chain Disruptions? Sourish Sarkar, Pennsylvania State University—Erie Sanjay Kumar, Valparaiso University……………………………………………….Page 207-208

Effect of Forecast Accuracy on Inventory Optimization Model Surya Gundavarapu, Purdue University Prasad Gujela, Purdue University Shan Lin, Purdue University Matthew A. Lanham, Purdue University……………………………………………..Page 209-230

The Evolution of the Open Source ERP iDempiere Community Network: A Preliminary Analysis Zhengzhong Shi, University of Massachusetts at Dartmouth Hua Sun, Shandong University………………………………………………………Page 231-241

Future LPG Shipments Forecasting Based on Empty LPG Vessels Data Jou-Tzu Kao, Purdue University Rong Liao, Purdue University Hongxia Shi, Purdue University Joseph Tsai, Purdue University Shenyang Yang, Purdue University Matthew A. Lanham, Purdue University……………………………………………..Page 242-256

Online Small-Group Learning Pedagogies for the 21st Century Classrooms Sema Kalaian ,Eastern Michigan University……………………………………………..Page 257

Opportunities for Enhancing Buyer-Supplier Relationship: Inspirations from the Natural World ……………………………………………………………………………………….Page 258-286

A4

Optimal Clustering of Products for Regression-Type and Classification-Type Predictive Modeling for Assortment Planning Raghav Tamhankar, Purdue University Sanchit Khattar, Purdue University Xiangyi Che, Purdue University Siyu Zhu, Purdue University Matthew A. Lanham, Purdue University……………………………………………Page 286a-303

Reducing the Cost of International Trade Through the Use of Foreign Trade Zones Gary Smith, Penn State Erie…………………………………………………………Page 304-321

Risky Business: Predicting Cancellations in Imbalanced Multi-Classification Settings Anand Deshmukh1, Purdue University Meena Kewlani, Purdue University Yash Ambegaokar, Purdue University Matthew A. Lanham, Purdue University……………………………………………..Page 322-352

Vehicle Routing, Scheduling and Decision Utility Environment Ceyhun Ozgur, Valparaiso University Claire Okkema, Valparaiso University Yiming Shen, Valparaiso University…………………………………………………Page 353-362

A5

Comparison and contrast of Statistics Software Packages including R and Python for teaching purposes Ceyhun.Ozgur, Sanjeev Jha Yiming Shen Abstract This paper shows the advantages and disadvantages of using Python and R in teaching various types of students based on the latest data. We also compare Python and R in solving business problems for actual companies. We give some examples of how to utilize both Python and R. For example, we provide examples of teaching correlation coefficient both with Python and R. We provide three teaching goals for Python and R: 1, helping students to find a job; 2, students receive higher salary; 3, low cost for learning packages both for Python and R. Keywords: Big data, Teaching R, Teaching Python, demonstrations, examples of teaching R and python, Python Python, easy to learn, is a general-purpose programming language. It has a mature and growing ecosystem of open-source tools for mathematics and data analysis for statistics and analytics, which is becoming the language of choice for scientists and researchers. Python code can be written like programing language. The main focus is to execute an entire series of instructions at once, it can also be executed line by line or block by block, making it unique and suitable for working with data interactively. Students learned a bit of programming and problem solving which were relegated to advanced courses. In addition to object-oriented programming as the dominant paradigm. It has led textbook authors to bring powerful, industrial-strength programming languages such as C++ and Java into the introductory software curriculum. As a result, instead of experiencing the rewards and excitement of solving problems with computers, beginning computer science students become overwhelmed by the combined tasks of mastering advanced concepts and the syntax of a programming language. This research will use the Python programming language as a way of making the course in mathematic more manageable for students and instructors alike for undergraduate education. • Python has simple, conventional syntax with its statements. The expressions are created with conventional notation found in algebra. Thus, one can spend less time dealing with the syntax of a programming language and more time to solve problems.

1

• It is easy for beginners to write simple programs in Python such as support for object- oriented software development. • Python is interactive. Users can enter expressions and statements to try out experimental code and receive immediate feedback. • Python is free and is in widespread use in the industry. Students can download Python to run on a variety of devices Based on more than 1000 developers, in figure 1 it can be see that 61% of the developers use Python as their primary programing language. Figure 1 shows the most popularly used software language platform by our respondents. As seen in the figure below Python and R are the most heavily used platform by our respondents. The usage of Python and R are only 17% apart from each other, and Python is the leader. Figure 1: Share of Python, R, Both, or Other platforms usage for Analytics, Data Science, Machine Learning, 90% 80% 76% 70% 60% 59% 54% 50% 40% 30% 20% 10% 18% 17% 9% 6% 5% 2% 0%

2

Table 1 Python 7955 6073 R 7955 4708 SQL 7955 4261 Java 7955 1453 Hadoop/Hive/Pig 7955 1378 SAS Base 7955 738 IBM SPSS 7955 472 Statistics Amazon 7955 425 Machine Learning Minitab 7955 150 Based on more than 1000 developers, in figure 1 it can be see that 61% of the developers use Python as their primary programing language. (Python Developers Survey 2016: Findings) Figure 2: Use of Python in the Software Development There are about 46% of developers use python programing language as their Data analysis tool instead of traditional programmers or Web developers. Use of Python in the Software Development

Other 14% Web Development 22% Software Development 18% Data analysis 19% Scientific or Data Analysis 27% (Python Developers Survey 2016: Findings)

3

R R is an open source software which designed to run statistical analyses which makes the software highly appealing, as it is able to keep up with the demands of a growing number of varied business structures. One of the core benefits of teaching R to students is that it is a highly standardized programming language (Economist 662f, 2014). This means that students need not worry about highly variable language structures or deep understanding of the different languages with which one can code. Rather, they are free to focus on the analytics side of R and understand the applications of the program in a real-world setting [C. Ozgur, T. Colliau, G. Rogers, Z. Hughes, B. Myer-Tyson, 2017]. As students learn the basics of R and become accustomed to using the software, they can understand the bidirectional effect it has on the job and data analytics market currently [Rickert, 2014]. The software grew in popularity in response to the increasing overall demands on big data analytics and the need for programs which could handle massive data files. As R grew in popularity by filling this need, it also changed how computational finance was done; the availability of a new and versatile tool is apt to reframe at least a portion of the discussions around calculation and application [C. Ozgur, B. Myer-Tyson, J. Coto, Y. Shen, 2017]. One of the greatest issues in statistics is comparing groups. To determine the extra vitamins in a cow’s diet that is increasing in their milk production, you assign the extra vitamins to a test group and normal vitamins to a control group. Then, you compare the milk production in the two groups. R gives you two standard tests for comparing two groups with numerical data: the t‐ test with the t.test() function, and the Wilcoxon test with the wilcox.test() function. If you want to use the t.test() function, you first have to check, among other things, whether both samples are normally distributed using any of the methods from the previous section. For the Wilcoxon test, this isn’t necessary. Table 2: Python vs. R in Google Citations How can we use R in data analysis in higher education? What is the advantage python vs. R? What is the advantage R vs. python? Which companies use R and which companies use Python? How is python and R used in teaching for undergraduate education level?

4

Demonstration of teaching topics with R and python Books- Ratio of R / python python Table 3 Comparisons of R vs. Python Books and Papers Papers- Python Ratio of R / python programmi ng for statistics Books-R Papers-R 207 13 15.92 2,680,000 47,600 56.3 Data 6970 310 22.48 9,290,000 205,000 45.32 statistics 4235 42 100 5,470,000 104,000 52.59 model 8631 126 68.5 7,830,000 163,000 48.03 code 6084 105 57.94 4,650,000 158,000 29.43 5225.4 119.2 52.968 5984000 135520 46.334 Google result-R Google result-python Ratio of R / python XX+programing 2,430,000 72,700,000 0.33 XX+data collection 4,200,000 1,520,000 2.76 XX+statistics 108,000,000 29,000,000 3.72 XX+model 154,000,000 8,270,000 18.62 XX+ code 130,000,000 2,710,000 47.97 79,726,000 10,375,000 15

5

Idea about Blogs Blogs On blogs, people write about what software interests them, problems solving methods, and interpreting events in the field. Blog posts contain a great deal of information dealing with their topics, maintaining a blog will require effort. Therefore, the number of bloggers writing about analytics software has the potential to measure the popularity or market share. Unfortunately, it is a difficult task to count the number of relevant blogs. Software such as Java, Python, the C language, and MATLAB have quite a few more bloggers commenting about general programming topics than just analytics. Separating the difference isn’t easy. The blogs title may not give you a clue about what the articles will include. Another problem arises when some companies write up a newsletter, others will write a set of blogs. Each blog will be written by several different people in the same company. The individual blogs may be combined into a single blog, which inflates the count. These two companies Statsoft and Minitab are examples of companies that write blogs, and offer software. What’s really interesting is not when company employees write blogs, but rather when outside volunteers take the initiative. In some cases, blogs are maintained by blog consolidators, who combine blogs into a “metablog.” All one would have to do is find such lists and count the blogs. There is no attempt made to extract blended blogs into such lists. The results are shown here: Figure 3. Number of blogs devoted to each software package on April 7, 2014, and the source of the data. Software of Blogs Number 600 550 500 400 300 200 100 60 40 11 0 R Python SAS Stata R’s blogs have an impressive number of 550. For Python, only 60 blogs that were devoted to the SciPy subroutine library were found. SAS 40 blogs was an impressive figure given that Stata only possessed 11 blogs

6

While searching for a list of blogs related to software, individual blogs were found which related to software. Unfortunately, the list was not kept update, and would be far to time consuming to deal with. If you know of other lists of relevant blogs, please inform us. They will be added to the list. Figure 4 tool preference by education (http://www.burtchworks.com/2017/06/19/2017-sas-r-python-flash-survey-results/) Figure 5: tool preference by years of experience

7

(http://www.burtchworks.com/2017/06/19/2017-sas-r-python-flash-survey-results/) Figure 6: The changes and usage of Statistics software packages R, Python, SAS. 120 100 80 60 40 20 0 2 (https://trends.google.com/trends/explore?date=all&geo=US&q=python,R,SAS) 1, helping students to find a job (1) Trends of changing ( for example the R steady, SAS decreasing, Python increasing for employer (2) For employees In figure 7 we can see the employers searching for qualified graduates for various positions. The three software languages (R, SAS, and Python) have different growth rates. Rs’ growth rate has stayed stable, but has increase in 2017. While SAS has also stayed stable over the past twos during the fiscal year of 2016-2017 there was an increase to 20%, but sadly decreased to 15% during the 2017-2018 fiscal year. In terms of Python there has been a steady growth rate from 5% in 2014 to approximately 13% in 2017. 0 0 4 - 0 1 2 0 0 4 - 0 The changes and usage of Statistics software packages R, Python, SAS from 2004- 2017 python: (United States) R: (United States) SAS: (United States) SPSS: (United States) 6

8

2 0 0 4 - 1 1 2 0 0 5 - 0 4 2 0 0 5 - 0 9 2 0 0 6 - 0 2 2 0 0 6 - 0 7 2 0 0

9

6 - 1 2 2 0 0 7 - 0 5 2 0 0 7 - 1 0 2 0 0 8 - 0 3 2 0 0 8 - 0 8 2 0 0 9 - 0

10

1 2 0 0 9 - 0 6 2 0 0 9 - 1 1 2 0 1 0 - 0 4 2 0 1 0 - 0 9 2 0 1 1 - 0 2 2 0

11

1 1 - 0 7 2 0 1 1 - 1 2 2 0 1 2 - 0 5 2 0 1 2 - 1 0 2 0 1 3 - 0 3 2 0 1 3 -

12

0 8 2 0 1 4 - 0 1 2 0 1 4 - 0 6 2 0 1 4 - 1 1 2 0 1 5 - 0 4 2 0 1 5 - 0 9 2

13

0 1 6 - 0 2 2 0 1 6 - 0 7 2 0 1 6 - 1 2 2 0 1 7 - 0 5 2 0 1 7 - 1 0

14

Figure 7: Job posting of statistical software packages from 2014 to 2017. (https://www.indeed.com/jobtrends/q-R-statistics-q-SAS-statistics-q-python-statistics.) In figured 8 we see the amount of graduates interested in working for the three software language companies(R, SAS, and Python). The interest to work for these software companies has had a large fluctuation. For example, SAS in 2014 had a fluctuating increase of interest until 2016 when the interest level dropped significantly. Since the drop the interest level has stayed constant. R stats had a similar trend in fluctuating interest of graduates as SAS, and in 2017 the interest level of both software packages were approximately the same. Finally for Python in 2014 till 2016 there was a steady growth rate for interested graduates. During the fiscal year of 2016- 2017 there was a large spike in interest, but then had a slow decrease. Figure 8: Jobseeker interested in statistical software packages from 2014 to 2017.

15

(https://www.indeed.com/jobtrends/q-R-statistics-q-SAS-statistics-q-python-statistics.html) Figure 9: of Data scientists and Date analyzers based in US. (Based on 3790 responsers) Most of the Data scientists have Master’s degree(44.5%). Followed by Bachelor’s degree(26.5%). Doctoral degree are close third with( 20.7%). 2, helping students to get salary

45% 42% 40% 35% 32% 30% 25% 20% 16% 15% 10% 5% 5% 3% 2% 1% 0% Master's degree Bachelor's Doctoral community Professional High school I prefer not degree degree college degree to answer

16

On the average Data scientists make $110k a year. Figure 10: Salary of Data scientists or Data analyzers based in US. Companies such as (United States Homeland Security, Zillow, Cdiscounts, and Statoll) have created challenges using various software languages. If an individual is successful in the challenges they receive a cash reward. https://www.kaggle.com/competitions?sortBy=prize&group=active&page=1&pageSize=20 Figure 11: comparison of salaries of different disciplines against data scientists (Data analyzers)

17

$110,000 $49,960 $74,821 $68,943 $46,255 $64,354 $62,855 $45,001 $63,045 $60,518 $52,948 Add the reference!! Figure 12: Spectrum and Jobs Ranking of software languages in 2017 $120,000 $100,000 $80,000 $50,904 $60,000 $40,000 $20,000 $-

18

The IEEE Spectrum Ranking is a site that will combine 12 metrics from 10 different sites. Some measures that are presented are popularity of job sites/search engines. While at the same time the site can show how much new programming code has been added to GitHub over last year. (https://spectrum.ieee.org/ns/IEEE_TPL_2017/comparison/2017/2016/1/1/1/1/1/50/1/50/1/50/1/30/1 /30/1/30/1/20/1/20/1/5/1/5/1/20/1/100/true/1/25/1/25/1/50/1/25/1/25/1/50/1/25/1/25/1/100/1/10 0/1/25/1/40/) 3. Low cost for learning the packages (1) Price In Table 4 there is a list of various software programs with their cost. The table gives who can buy, price, where to buy, and what the system requires. The table doesn’t list Python or R, but these two software packages are open source and free of charge. – Add Python and R to this list Table 4: The cost for popular software packages (2) Easy to learn Code R vs. Code Python Figure 13: Comparison of AT&T Inc. (T) and Verizon Communications Inc. (VZ)’s stocks from 2012-2017 using Python and R

19

Conclusion Figure 13: Recommend for first learning statistical software language From the figure below the first learning language is Python. 63.1% of the respondents agreed that Python is the best. The second language used is R. 24.0% of the respondents agreed on this. Only 9 of the 13 statistical software languages are used by the respondents. 87% of the respondents use either Python or R as their primary language software. 70% 63% 60% 50% 40% 30% 24% 20% 10% 4% 3% 2% 1% 1% 1% 1% 0% 0% 0% 0% 0% For example, we provide examples of teaching correlation coefficient both with Python and R. We provide three teaching goals for Python and R: 1, helping students to find a job; 2, students receive higher salary; 3, low cost for learning the packages both for Python and R. Python can be used on the internet and R cannot be used on the Internet. Python is more useful software

20 language than R, because it can handle bigger data sets. Python language is used for developing web sites and applications. While, R—Language is used for enterprise, desktop, and scientific applications. REFERENCES T. Colliau G. Rogers Z.Hughes & C. Ozgur ” MatLab vs. Python vs. R” 2016 Annual Meeting of Midwest Decision Institute, (MWDSI) conference April 15 2016 pp. 59-70 D. Esler, The Pig in the Python II (May, 2004), https://en.oxforddictionaries.com/definition/pig_in_the_python K. Goldfeld, Simstudy Update: two new functions that generate correlated observations from non-normal distributions (July 4, 2017), https://www.r-bloggers.com/simstudy-update-two-new- functions-that-generate-correlated-observations-from-non-normal- distributions/?utm_source=feedburner&utm_medium=email&utm_campaign=Feed%3A+RBlog gers+%28R+bloggers%29 J. Grus, Data science from scratch: first principles with Python (2015), https://www.intra.valpo.edu/librarydocs/DataSciencefromScratch D. Masad, Introduction to Python for Analytics, https://www.statistics.com/python-for-analytics/ Ž. Ivezić, A. J. Connolly, J. T. VanderPlas, and A. Gray, Statistics, , and machine learning in astronomy: a practical Python guide for the analysis of survey data (2014), http://galileo.valpo.edu/record=b1632807~S0 M. Kleckner, C.Ozgur & C. Wilder “Choice of Software for Business Analytics Courses” 2014 Annual Meeting of the Midwest Decision Sciences Institute, April 2014 pp.69-87 P. Krill, Businesses stick with Java, Python, and C (Aug 2, 2016), https://search.proquest.com/docview/1808338164?pq-origsite=summon K. Lambert, Fundamentals of Python: Data Structures (Oct. 2013) https://ebookcentral.proquest.com/lib/valpo-ebooks/reader.action?docID=3136674 C. Ozgur, B. Myer-Tyson, J. Coto, Y. Shen “A comparative study of network modeling using a relational database (i.e Oracle, mySQL, SQLserver Hadoop) vs. Neo4j” 2017 Annual Meeting of Midwest Decision Sciences Institute Meeting, (MWDSI) April 2017 pp.156-165

21

C. Ozgur, T. Colliau, G. Rogers, Z.Hughes & B. Myer-Tyson ”MatLab vs. Python vs. R” Journal of Data Science, Vol. 15, No. 3 , July 2017 pp.355-372 D. Peter, Introductory statistics with R (2008) Statistics and computing, 2nd, http://galileo.valpo.edu/record=b1550827~S0 PyCharm (January 2017), Python Developers Survey 2016: Findings, https://www.jetbrains.com/pycharm/python-developers-survey-2016/ Pharma Business Week, DataCamp Partners with Anaconda Powered by Continuum Analytics to Bring Open Data Science Education for Python to the Masses, (Nov. 2016), http://www.businesswire.com/news/home/20161117005903/en/DataCamp-Partners-Anaconda- Powered-Continuum-Analytics-Bring D. Robinson, Teach the tidyverse to beginners(July 5, 2017), https://www.r-bloggers.com/teach- the-tidyverse-to- beginners/?utm_source=feedburner&utm_medium=email&utm_campaign=Feed%3A+RBlogger s+%28R+bloggers%29 D. Robinson, Modeling gene expression with broom: a case study in tidy analysis (November 25, 2015), http://varianceexplained.org/r/tidy-genomics-broom/ D. Smith, How perceptions of R have changed (July 5, 2017), https://www.r-bloggers.com/how- perceptions-of-r-have- changed/?utm_source=feedburner&utm_medium=email&utm_campaign=Feed%3A+RBloggers +%28R+bloggers%29 D. Smith, R Then and Now (Jul 5, 2017), https://www.slideshare.net/RevolutionAnalytics/r- then-and-now D. Smith, More Companies using R (July 04, 2017), http://blog.revolutionanalytics.com/2017/07/more-companies-using-r.html V. Tsakalos, Data wrangling : Transforming(July 5, 2017), https://www.r-bloggers.com/data- wrangling-transforming- 13/?utm_source=feedburner&utm_medium=email&utm_campaign=Feed%3A+RBloggers+%28 R+bloggers%29

22

D. Vries, Andrie, and Joris Meys. R for Dummies, (2015-06-18), John Wiley & Sons, Incorporated, https://ebookcentral.proquest.com/lib/valpo-ebooks/detail.action?docID=4040828 Robert A. Muenchen, (6/19/2017) The Popularity of Data Science Software, http://r4stats.com/articles/popularity/ burtchworks.com, (6/19/ 2017)2017 SAS, R, or Python Flash Survey Results, http://www.burtchworks.com/2017/06/19/2017-sas-r-python-flash-survey-results/ Jules Kouatchou, (12/15/2016) Basic Comparison of Python, Julia, R, Matlab and IDL, https://modelingguru.nasa.gov/docs/DOC-2625 John King, Roger Magoulas(2017), 2016 Data Science Salary Survey, http://www.analyticsearches.com/site/files/776/66977/259607/765537/2016-data-science-salary- survey. Gregory Piatetsky (2017), Python overtakes R, becomes the leader in Data Science, Machine Learning platforms, https://www.kdnuggets.com/2017/08/python-overtakes-r-leader-analytics- data-science.html John Fox, Allison Leanage (September 2016), R and the Journal of Statistical Software, Journal of Statistical Software, Volume 73, Issue 2. About Research Analytics licensing and technical support at IU, https://rt.uits.iu.edu/visualization/analytics/software-licensing-support.php Data sources Tools: https://trends.google.com https://www.indeed.com/jobtrends https://spectrum.ieee.org https://www.kaggle.com/surveys/2017

23

Forecasting Intermittent Demand Patterns with Time Series and Machine Learning Methodologies Yuwen Hong, Jingda Zhou, Matthew A. Lanham Purdue University, Department of Management, 403 W. State Street, West Lafayette, IN 47907 [email protected]; [email protected]; [email protected] ABSTRACT Intermittent demand refers to random and low-volume demand. It appears irregularly with large proportion of zero values in between demand periods. The unpredictable nature of intermittent demand poses challenges to companies managing sophisticated inventory systems, incurring excessive inventory or stockout costs. In order to provide accurate predictions, previous studies have proposed the usage of exponential smoothing, Croston’s method and its variants. However, due to the bias and limitations, none of the classical methods has demonstrated adequate accuracy across datasets. Moreover, very few researches have explored new techniques to keep up with the ever-changing business needs. Therefore, this study aims to generalize the predictive accuracy of various machine learning approaches, along with the widely used Croston’s method for intermittent time-series forecasting. Using multiple multi-period time-series, we would like to see if there is a method that tends to capture intermittent demand better than others. In collaboration with a supply chain consulting company, we investigated over 160 different intermittent time series to identify what works the best. Keywords: demand forecasting, intermittent demand, machine learning, model comparison

24 INTRODUCTION Intermittent demand comes into existence when demand occurs sporadically (Snyder, Ord, & Beaumont, 2012). It is characteristic of small amount of random demand with large proportion of zero values, which incurs costly forecasting errors in forms of unmet demand or obsolescent stock (Snyder et al., 2012). Because of its irregularity and unpredictable zero values, intermittent demand is typically related with inaccurate forecasting. As a result, companies either risk losing sales and customers when items are out of stock, or being burdened with excessive inventory cost. According to the surveys by Deloitte (2013 Corporate development survey report | Deloitte US | Corporate development advisory), the world’s largest manufacturing companies are burdened with excessive inventory costs. Those having more than $1.5 trillion in revenue, spent an average of 26% on their service operations. Therefore, small improvements in prediction accuracy of intermittent demand will often translate into significant savings (Aris A. Syntetos, Zied Babai, & Gardner, 2015). Intermittent demand is not only costly, but also common in organizations dealing with service parts inventories and capital goods such as machinery. Those inventories are typically slow- moving and demonstrate a great variety in their nonzero values (Cattani, Jacobs, & Schoenfelder, 2011; Hua, Zhang, Yang, & Tan, 2007). Previous research has tackled the specific concerns related with intermittent demand from various perspectives. Some pay attention to prediction distributions dependent on period of time (e.g., Syntetos, Nikolopoulos, & Boylan, 2010) and concentrate on managing inventory over lead- time, while others examine measurement performance of either entire prediction distribution or point distributions (e.g., Snyder et al., 2012). Earlier research has implemented classic time-series models including exponential smoothing and moving averages. However, those models are designed for high demand coming in regular intervals, and thus fail to address specific challenges faced with intermittent demand problems. More recent models tried to solve this problem with models specifically designed for low-volume, sporadic type of demand. To name a few, Croston’s

25 method, Bernoulli process, and Poisson models. Despite the improvement in performance, those

26 methods do not provide sufficient inventory recommendations (Smart, n.d.). Some recent research, however, has turn to explore algorithms and improve predictive accuracy (e.g., Kourentzes, 2013). Despite the predecessors’ effort, there is no universal method that can handle the ever-changing business need of accurate demand forecasting. This paper will approach the business concerns from an analytic perspective, leveraging analytical tools such as Python and R. Specifically, current research aims to examine the predictive accuracy of various machine-learning approaches, along with the widely used Croston’s method for intermittent time-series forecasting. Using multiple multi-period time-series we would like to see if there is a method that tends to capture intermittent demand better than others. In collaboration with a supply chain consulting company, we investigated over 160 different intermittent time series to identify what works the best. Specifically, the research addresses the following four questions: 1. How well do popular machine learning approaches perform at predicting intermittent demand? 2. How do these machine learning approaches compare to the popular Croston’s method of time-series forecasting? 3. Can combining models via meta-modeling (what we call two-stage modeling) improve capturing the intermittent demand signal? 4. Can one overall model be developed that can capture multiple different intermittent time-series and how would it perform compared to the others? The remainder of this paper will start with a review on the literature on various criteria and methods applied to forecasting intermittent demand. The following section 3 will present the proposed methodology and discuss the criteria formulation. Next, in section 4 various models are formulated and tested. Section 5 outlines the performance of our models. Section 6 concludes the paper with a discussion of the implications of this study, future research directions, and concluding remarks.

27

LITERATURE REVIEW Applications in Various Business Backgrounds Forecasting intermittent demand such as demand of spare parts is a typical problem faced across industries. Despite its importance in inventory management, the sporadic intervals, low volume of order as well as large amount of zero values have made it especially difficult to accurately forecast intermittent demand (Hua et al., 2007). Consequently, business is burdened either with excessive cost of inventory or with the risk of stockout. This is not uncommon for high-price, slow-moving items, such as aircraft service parts, heavy machinery, hardware service parts, and electronic components. Companies that manufacture or distribute such items are often time faced with irregular demand that can be zero for 99% of time. Finally, intermittent demand poses challenges industry wide, and techniques needs to be improved to help companies efficiently address the concerns. Evolvement of Methodology Previous research has endeavored in forecasting demand using various techniques and methods. Among them, a classic method focusing on small and discrete distributed demand is Croston’s method. According to Croston (1972), for irregular demand of small size and large proportion of zero values, its mean demand is easily over-estimated, and its variance is underestimated. Therefore, he suggested an alternative approach, using exponential smoothing to adjust expected time intervals between demand periods and quantity demanded in any periods. He also assumes that time intervals and demand quantity are independent. Multiple models derived from Croston’s method with reasonable modifications. For example, Syntetos & Boylan (2005, 2001) claimed that the original Croston’s method was biased and developed the adjusted Croston method (aka Syntetos-Boylan Approximation (SBA), and Shale-Boylan-Johnston (SBJ) method). This method is shown to achieve higher accuracy than the original one for demand that has shorter intervals between orders (Snyder et al., 2012). However, most of these techniques are based on an exponential smoothing that considers predicting two components: (i) the time between demand, and (ii) order size, finally providing an

28 average demand over the forecast horizon. This potentially underestimates the variance faced in

29 intermittent demand problems (Croston, 1972). Some recent researchers, however, turn to machine learning techniques. For example, Kourentzes (2013) reported that neural networks (NNs) demonstrate higher service level than the Croston’s method and its variants. Furthermore, because NNs do not assume constant demand and can retain the interactions of demand and arrival rate in between demand periods, they break the limitations of Croston’s method (Kourentzes, 2013). Despite its usefulness, NN techniques are under developed. Greater amount of data is required to train and validate NNs’ applicability and predicting power, and research is in urgent need. Therefore, the current research explored the innovative NN method as well as other machine learning methods, such as random forests, and gradient boosting machines, that are gaining popularity for their robustness. Measurement of accuracy Previous studies have suggested many measurements to assess the accuracy of time-series prediction, including mean absolute percentage error (MAPE), root-mean-square error (RMSE), and other statistics, such as the “percentage better” and “percentage best” summary statistics to name a few (e.g, Syntetos & Boylan, 2005). Nevertheless, even though the classical methods are well suited for minimizing RMSE, service level constraints are easily violated. The reason being that demand uncertainty being high will result in lost sales. Another widely used measure is the mean absolute percentage error (MAPE). MAPE is advantageous in interpretability and scale- independency, but is limited in handling large amount of zero values (Kim & Kim, 2016). Other researchers propose that the mean absolute scaled error (MASE) is the most appropriate metric, because it is not only scale-independent, but also handles series with infinite and undefined values, such as the case in intermittent demand (Hyndman & Koehler, 2016). Mean Absolute error (MAE) has also been widely used, because it is easy to understand and compute. However, MAE is scale dependent and is not appropriate to use to compare different time series, which we do in our study. Taking the pros and cons of each metric into consideration, the current study uses MASE and MAE as the major measurements. MASE is leveraged to compare the overall performance of each type

30 of model across series, and MAE is used to optimize model parameters for each individual series.

31

METHODOLOGY Data Description The dataset used was provided by an undisclosed industrial partner. It contains 160 time-series of intermittent demand for unknown items, with each time-series representing the demand of a distinct item. These time-series are observed either in daily or weekly frequency. There are three features in the original data: series number, time, and value. Feature Engineering Five features were created to capture the unique characteristics of the intermittent time-series problem, with the goal of helping the algorithmic approaches learn the patterns better. Specifically, the features aim to integrate two components into the learning process: time-series and intermittent demand. The following are features created: • Time series: lag1, lag2, lag3 • Intermittent demand: non-zero interval, cumulative zeros The three lags are the demand values lagged up to three periods. The “Non-zero interval” is the time interval between the previous two non-zero demands. The “cumulative zeros” is the number of successive zero values until lag one. It shows the length of time during which no demand occurs. A data dictionary and an example data table with newly generated features can be found in the Appendix of the paper. Sequential Data Partitioning Each series was trained on the individual level to capture the unique profile of each item. We used sequential data partitioning to split each series into training and testing sets, with 75% of total observations (starts at the 4th observation) in the training set, and 25% in the testing set. Data Preprocessing All sets were normalized using Min-Max Scaling (i.e. “range” method in R caret package) to ensure all numeric features were on the same scale, ranging from 0 to 1. Normalizing numeric inputs generally avoids the problem that when some features dominate others in magnitude, the model tends to weigh more on large scale features and thus underweight the impact of small scale

32 features regardless of their actual contribution. For features used in this research, the nzInterval

33 and zeroCumulative were in relatively small scales, typically less than 5, while the lagged demands ranged up to 500. As mentioned in the feature engineering section of the paper, nzInterval and zeroCumulative were identified as key variables to capture the intermittent of the demand profile, and thus normalization was extremely important to avoid a biased model. Training and testing sets were pre-processed separately, because forecasts are made on a rolling basis, but the “min” and “max” was carried through from the training set. Model Selection Neural networks (NN) are robust in dealing with noisy data and flexible in terms of model parameters and data assumptions. With multiple nodes trying various combination of weights assigned to each connection, a NN can learn around uninformative observations, which indicates great potential to find out relationships within intermittent time-series data without other extra information. Gradient Boosting Machines (GBM), as a forward learning ensemble method, is robust to random features. By building regression trees on all the features in a fully distributed way, we expect GBM to capture some features of the unstable intermittent demand. Random Forests (RF), similar to GBM, is based on decision trees. The difference is that GBM reduces prediction error by focusing on bias reduction (via boosting weak learners), while RF focuses on reducing error by focusing on variance reduction (via bagging or bootstrap aggregation). Meta modeling (a.k.a. two-stage modeling in our paper), is suggested by some researchers to have better performance than using single base learners in isolation. Particularly, more information could be gathered via models of different focuses. Model Comparison / Statistical Performance Measures The statistical performance measures adopted here were Mean Absolute Error (MAE) and Mean Absolute Scaled Error (MASE). MAE is selected because it is easy to interpret and understand, and it treats errors equally. However, it cannot be used to compare across time-series because it is scale dependent. Therefore, the research also utilized MASE to provide a more holistic perspective

34

35 by comparing accuracy across different time-series. Figures 1 and 2 below detail the study design just described above. Study Design/ Workflow Figure 1. Overall Flow Figure 2. Model Training Details

36

MODEL DEVELOPMENT Common Setup All models were trained using 3-fold cross-validation. Considering the measures used in this research, all regression-type models were optimized on MAE. Since each time-series in our dataset refers to a unique item, to capture the unique profile of each series most precisely, each model except an aggregated model was built and trained across 160 time-series. In other words, instead of training exactly one model, a set of 160 models was generated. Only the first record of each test set was directly used, and predicting features of the rest of the test set were generated on a rolling basis based on the prediction recorded. The formula below is a general one used in all single models, as well as the 1st stage models of the meta-modeling approach. It also served as a base formula for the rest of the models: demand ~ nzInterval + zeroCumulative + lag1 + lag2 + lag3 One thing to notice here is that the response variable used in this formula (i.e. demand), refers to the scaled demand calculated as the actual demand divided by the maximum demand value of a specific training set. This transformation allows the response variable to be in the same scale of the independent variables, potentially improving the preciseness of the model. Normalization on response variables is recommended if you have a similar data set but is not required. If it is adopted, then the prediction results should be reverted by multiplying the maximum. Single Stage Model Three sets of single stage models were trained using Neural Network (NN), Quartile Random Forest (QRF), and Gradient Boosting Machines (GBM) respectively. The NN models used were of one layer given the limited number of input features. The hidden layer node size was tuned over 1, 3, 5, and 10 hidden nodes according to different rules mentioned in other studies in regards to tuning a feed-forward neural network. Specifically, most used rules such as “2n", “n" and “n/2", with n represents the number of input variables.

37

Aggregated Single Stage Model Considering it is time-consuming to train and manage multiple models for different items, we also tried building an aggregated model that can fit all time-series given at once. The rationale behind such a model was that time-series with intermittent demand may share some pattern in common, especially if they were from the same company or product category. Also, machine learning methods tend to be data-hungry, while it is hard to collect large amount of training data from the same item (series) without using lots of outdated data. Generally, this model followed the same setup as the NN model except that the training set is an aggregated one of the 160 smaller ones for each series. A modified model was trained by including the “timeSeriesID” as a categorical feature, for the purpose to capture the unique characteristics of each series as much as possible asides from their commonality. Meta-Model The 1st stage classification models adopted Logit, NN, and RF, respectively. They were optimized on receiver operating characteristic (ROC) curves. The output of the 1st stage model, is the probability of whether the non-zero demand will occur or not. The prediction output was then fed as an independent variable into the 2nd stage meta-model forecast, where QRF and NN were used to predict the temporal demand. Due to the nature of our data set, probabilities might not only contain information about whether there will be demand, but also the size of the demand. Hence, the output of the 1st stage models were kept in the form of a probability instead of being converted to the binary predictive class format often performed with classification-type models. The performance of the 1st stage models were not assessed separately, because which statistical measure can lead to the best impact of the 1st stage model on the 2nd stage model is hard to identify and is beyond of scope of this paper. Moreover, there is no guarantee that the model with highest performance will yield the best results after combing with the 2nd stage models. Our research cares only about the final prediction performance in terms of MAE and MASE, so all 1st stage models

38 were kept and applied to the 2nd stage modeling.

39

RESULTS Machine Learning vs. Croston’s Figure 3 summarizes the average MASE of 3 single step models, one aggregated model, as well as the best performed meta-model, along with the results of traditional Croston’s Method. Figure 3. Model performance on test set According to the results, all machine learning methods had lower MASE than the Croston’s method, and the QRF model performed the best, with a 0.06 decrease in the average MASE. The predictions on test datasets performed reasonably well without obvious overfitting issues. Moreover, a paired t-test showed that the QRF generates significantly lower MASE than the traditional Croston’s method, indicating that this model did achieve higher predictive accuracy. The MAE table below shows similar results. Again, MAE cannot be compared across series. But the focus of this research is about the overall performance of a certain machine learning method. Besides, all the models used the same 160 series for model training and testing. Hence, we deemed the average MAE to be a valid measurement to compare the overall performance of each model. Table1. Average MAE (across same 160 time-series)

40

On an individual basis, our results do show overfitting issue for certain series. According to other studies, this issue usually exists in both machine learning models and the Croston’s model, indicating the problem may lie with the random nature of certain time-series. Meta-Model vs. Single-Step Model The most accurate meta-model, which used RF in the first step and QRF in the second step, did not outperform the one-step QRF as expected. All the other meta-models performed worse than the corresponding one-step models. One possible explanation could be that time series forecasting requires a rolling forecasting, and as a result, the prediction error was amplified through each step. Aggregated Model vs. Series-Level Model The aggregated NN (the single model that take all time-series data as input) to create one overall model, showed similar results as the series-level NN with time-series ID included as a feature. That is not to say aggregated NN is as good as series-level NN, because series-level NN has greater potential and flexibility to be modified on some parameters or input features to better fit the demand of certain item. However, companies carrying large numbers of SKUs with intermittent demand may want to adopt the aggregated approach to simplify their model training. This is an operational decision-support design decision that will need to be considered depending on the business. Moreover, this over-arching approach provides an alternative to items without enough data to train a series-level model individually, and is often done particularly in retail when decision-facilitators are building bottom-up or top-down forecasts for assortment planning decisions. IMPLICATIONS AND LIMITATIONS The current research explored three approaches to predict intermittent demand. Two approaches trained individual model for each time series, the aggregated model trained a single model that can be applied to any series. Among the individual level approaches, most single stage models perform better than the meta-model. The most accurate meta-model which used RF-QRF achieves approximately the same MASE as the QRF single stage model. Although the paired sample t-test

41 indicated that the QRF single stage model (p < .01) decreased the error significantly from the Croston’s method, we noticed that the models tend to give stable predicted values without capturing all the fluctuations appearing in actual data. This behavior resembles the Coston’s method, which yields an average demand that repeats for all the predicted time periods. Statistics wise, the present research provides a way of better forecasting the demand level. Such prediction will help business in determining the service level and saving inventory cost. However, some business may be more interested in meeting the unexpected demand than lowering inventory cost. In that scenario, predictions that capture the spikes of the demand curve will be more preferable. Therefore, future research may explore ways to improve the prediction by capturing the irregular fluctuations more precisely. CONCLUSION A small increase in predictive accuracy can help firms save substantial amount of inventory costs while maintaining acceptable service levels. As the results of this study demonstrate, machine learning techniques, such as Quartile Random Forest, can improve predictive accuracy for intermittent demand forecasts. We consider the limitation of our models to be fitting small number of inputs into a data-hungry model. However, future analysts can explore more input features related to intermittent demand prediction. There are possibilities some models will perform badly in terms of statistical performance measures, but perform well in achieving business performance measures. In this case, the decision maker would need further information about the costs associated with a low service level, to leverage between statistical and business measures.

42

REFERENCES 2013 Corporate development survey report | Deloitte US | Corporate development advisory. (n.d.). Retrieved December 15, 2017, from https://www2.deloitte.com/us/en/pages/advisory/articles/corporate-development-survey- 2013.html Cattani, K. D., Jacobs, F. R., & Schoenfelder, J. (2011). Common inventory modeling assumptions that fall short: Arborescent networks, Poisson demand, and single-echelon approximations. Journal of Operations Management, 29(5), 488–499. https://doi.org/10.1016/j.jom.2010.11.008 Croston, J. D. (1972). FORECASTING AND STOCK CONTROL FOR INTERMITTENT DEMANDS. Operational Research Quarterly, 23(3), 289–303. Hua, Z. S., Zhang, B., Yang, J., & Tan, D. S. (2007). A new approach of forecasting intermittent demand for spare parts inventories in the process industries. Journal of the Operational Research Society, 58(1), 52–61. https://doi.org/10.1057/palgrave.jors.2602119 Hyndman, R. J., & Koehler, A. B. (2006). Another look at measures of forecast accuracy. International Journal of Forecasting, 22(4), 679–688. https://doi.org/10.1016/j.ijforecast.2006.03.001 Kim, S., & Kim, H. (2016). A new metric of absolute percentage error for intermittent demand forecasts. International Journal of Forecasting, 32(3), 669–679. https://doi.org/10.1016/j.ijforecast.2015.12.003 Kourentzes, N. (2013). Intermittent demand forecasts with neural networks. International Journal of Production Economics, 143(1), 198–206. https://doi.org/10.1016/j.ijpe.2013.01.009 Smart, C. (n.d.). Understanding Intermittent Demand Forecasting Solutions. Retrieved December 14, 2017, from http://demand-planning.com/2009/10/08/understanding-intermittent- demand- forecasting-solutions/ Snyder, R. D., Ord, J. K., & Beaumont, A. (2012). Forecasting the intermittent demand for slow- moving inventories: A modelling approach. International Journal of Forecasting, 28(2), 485–496. https://doi.org/10.1016/j.ijforecast.2011.03.009 Syntetos, A. A., & Boylan, J. E. (2001). On the bias of intermittent demand estimates. International Journal of Production Economics, 71(1–3), 457–466. https://doi.org/10.1016/S0925-5273(00)00143-2 Syntetos, A. A., & Boylan, J. E. (2005). The accuracy of intermittent demand estimates. International Journal of Forecasting, 21(2), 303–314. https://doi.org/10.1016/j.ijforecast.2004.10.001

43

44

Syntetos, A. A., Nikolopoulos, K., & Boylan, J. E. (2010). Judging the judges through accuracy- implication metrics: The case of inventory forecasting. International Journal of Forecasting, 26(1), 134–143. https://doi.org/10.1016/j.ijforecast.2009.05.016 Syntetos, A. A., Zied Babai, M., & Gardner, E. S. (2015). Forecasting intermittent inventory demands: simple parametric methods vs. bootstrapping. Journal of Business Research, 68(8), 1746–1752. https://doi.org/10.1016/j.jbusres.2015.03.034

45

APPENDIX A Example of New Features Generated value nzInterval zeroCumulative Lag1 Lag2 Lag3 70 127 0 70 101 1 0 127 70 0 1 0 101 127 70 0 1 1 0 101 127 0 1 2 0 0 101 73 1 3 0 0 0 0 4 0 73 0 0 55 4 1 0 73 0 0 2 0 55 0 73

46

APPENDIX B Data Dictionary Variable Type Description timeSeriesID Categorical ID for each time series (S1-S160) time Date The last day of a time period. value (D t ) Numeric The volume of demand at a certain time. S1-S80 are daily demand, S81-S160 are weekly demand nzInterval Numeric The number of time periods between the previous two periods where (non-zero) demand occurs. zeroCumulative Numeric The number of time periods since the last period where (non-zero) demand occurs. Lag1, Lag2, Lag3 Numeric Demand of the previous 3 time periods. lag1 = D t-1 , lag 2 = D t-2 , lag 3 = D t-3

47

APPENDIX C Paired Sample t-Test on Testing Set Average MASE using Croston’s method Average MASE using QRF 0.01653973 0.10308096 t = 2.7298 Number of observations in each group = 160 p-value = 0.007048

48

A Proposed Data Analytics Workflow and Example Using the R Caret Package Simon Jones, Zhenghao Ye, Zhuoheng Xie, Chris Root, Theerakorn Prasutchai, Michael Roggenburg, Matthew A. Lanham Purdue University, Department of Management, 403 W. State Street, West Lafayette, IN 47907 [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected] Abstract This paper provides a comprehensive explanation of functions that are available in the R caret package, and proposed workflow in how one might use them to perform predictive modeling. The motivation for this paper is that the R statistical software is one of the most popular languages used by analytics professional for data analytics today, and the caret package has grown in popularity since its first release in 2007 to do predictive modeling. Unfortunately, the resources available that demonstrate the key functional components, expected runtime, and how an analyst might use those in a typical workflow for data mining and predictive analytics tasks are limited. We attempt to fill this void by providing the reader a valuable reference in how to get started using caret by motivating the functions and their use with a real- world dataset. The sample dataset we analyzed was acquired from the 2017 WSDM-KKBox’s Churn Prediction Challenge provided on Kaggle.com, a platform for predictive analytics and data mining competitions. We demonstrate ways to clean, impute, pre-process, train, and assess popular machine learning models using caret. However, our contribution focuses on the actual process of getting results, as opposed to the results (e.g. model accuracy) themselves. Keywords: R, caret, predictive analytics

49

Introduction As the world enters a new age of information, it is becoming increasingly important to train people to become literate in analyzing and evaluating data for business and social use. In a popular study by McKinsey & Company, they estimated that by 2018 there would be a shortage of 140,000 to 190,000 people with deep analytical skills to take advantage of the data being generated (Manyika). In another survey conducted by IBM, the job market for Data Scientists will increase 28% by 2020, equating to nearly 364,000 new positions (Columbus, 2017). Clearly, industry is at a major growth stage in using data to support decision-making, and to do so effectively requires that their employees have analytics skills to get the job done. Nowadays, as predictive analytics become a more important part of an organization’s decision-making process, more tools are developed to facilitate the execution of predictive analytics. These tools can range from low in sophistication to those highly complicated tools that are designed for IT experts. Among these are commercial predictive analytic tools like MATLAB and SAS, as well as open-source software like R and Python, that have packages such as caret and scikit-learn. The Classification and Regression Training (caret) package from R for example is viewed as “one of the best packages available in R to prototype various models (Lanham, 2017).”. Caret provides one of the most comprehensive wrappers for any set of R packages and can be solely used to define an entire workflow starting from data cleaning and preprocessing, all the way through model training, prediction, and performance analysis. DataCamp.com describes it as a “go-to package in the R community for predictive modeling and supervised learning... [and interfaces] with all of R’s most powerful Machine Learning facilities (DataRobot, 2016).” While some packages and libraries contain more detailed versions of some of the features in caret, the purpose of using it is to allow for a more streamlined process for modelers, as many packages have subtle differences that can lead to errors. The reason for this is there are many different packages that build different models (e.g. artificial neural network, support vector machine, random forest, etc.) and there is no one defined way in how a package developer must design their input and output arguments. For example, one package might require specifying the model argument as modelA(Y ~ x1 + x1), while another package requires training the model by specifying the individual pieces as arguments like so; modelB(y=Y, x=c(x1,x2)). Thus, caret wraps both of these packages into one consistent function (i.e. train()) that allows the modeler to have a cleaner modeling workflow and allow for more efficient prototyping. For example, to use either of these generic functions might be train(Y ~ x1 + x1, method=”modelA”) or train(Y ~ x1 + x1, method=”modelB”). This data science community which consists of professionals ranging from Data Analyst to Data Scientist to Researcher, have embraced the power of open-source tools such as R as part of their workflow, even though they might be working at a company that has invested heavily in a commercial product (e.g. SAS). The reason is that as method methods are developed, they are often shared in near real-time. For example, the open-access Journal of Statistical Software (https://www.jstatsoft.org/index) is a popular avenue for researchers that develop new quantitative methods, visual tools, etc. to publish examples of R package as soon as their new methodology has been peer-reviewed and published elsewhere. While tutorial papers exist for caret, as we demonstrate in the literature review, we believe there is still a lack of information in how one can use these functions together to develop a predictive modeling workflow.

50

We structure this paper summarizing insights we found from several academic and professional papers on caret in the Literature Review. Next, we describe a data set we used in our study. In the Methodology section of the paper we develop a proposed workflow model that the authors have designed. This portion will contain a systematic process, as well as corresponding scripts in the R language, which will be used on the dataset example. In the Models section we describe ten separate machine learning algorithms used in our study and how we easily developed these models using caret. In the Results section we show how to compare the results of these models and discuss the runtime we observed to train them. Lastly, in the Conclusions we provide some key take-aways points for the reader to help them in their analytics journey using the R caret package. Literature Review While a lot has been said about the comprehensive nature of the caret package, for new data scientists, the process of learning and applying these data mining methods can be daunting. For this reason, we conducted a literature review to understand what is known that has been published, and thus frame our proposed methodological workflow in this space. Since our study is primarily aimed at creating a general framework for a workflow, we carried out our literature review analyzing papers that had been published regarding the basic functionalities of caret as well, as applied research that was performed using this package. Table 1 provides a summary of these studies. Most of these papers are demonstrative of the applications of caret and the data used in them are sample datasets, therefore we have chosen not to delve too deep into the motivations behind the studies. We were more interested in understanding common workflows used in the studies, than the results of the studies themselves, and we use these studies as support for our proposed recommendation. Table 1: Summary of the literature Year Title Author Description 2008 The Caret Package Max Kuhn Provides descriptions of functions available when the package was first released. It explicitly lists out the constituent functions, function descriptions, usage, input values, arguments, and some sample code on how to apply it. Unlike the other literature it does not present a comprehensive application of the package using a sample data set. 2010 Variable Selection Using The Caret Package Max Kuhn Touches on the topic of dimension reduction and walks the reader through different ways by which they could go about it. The two primary sections in this paper cover feature selection using search algorithms, recursive feature elimination, and feature selection using univariate filters. In each section goes over functions in detail, the process by which these methods could be applied to most data sets. Unlike most of the papers published by Kuhn, this does not give any context as to how this fits into the larger data mining process.

51

Max 2013 Predictive Kuhn Delves into the predictive modelling process as a whole then Modelling with R focuses on using the caret package to build predictive models. The and the Caret paper is divided into 8 broad segments: Introduction to the Package predictive modelling process, data splitting, data preprocessing, overfitting and resampling, training and tuning model trees, training and tuning a SVM, comparing models, and parallel processing. The main difference with this piece of literature on caret is that not much emphasis is given to the workings behind the processes as opposed to how they all fit together to paint a larger picture workflow. 2013 Predicting Defects Using Change Genealogies Analyzes and estimates software quality requiring the analysis of its version histories. The authors combine all changes into change genealogies and try to predict software quality based on these genealogies. They show that change genealogies provide sufficiently good classifiers and prove that the genealogy models outperform models based on code complexity which is the norm in this field. The authors specifically used caret package in R to generate 6 models: K-nearest neighbor (knn), Logistic regression (multinom), Recursive partition (rpart), Support Vector Machines ( svmRadial), Tree Bagging (tree-bag), Random forest (raomForest). 2015 Building Predictive Models in R using the Caret Package Kim Herzig, Sascha Just, Andreas Rau, Andreas Zeller Max Kuhn This paper runs the entire process of predictive modeling using data from computational chemistry. The author breaks the workflow down into its constituent steps i.e. data preparation, building the model, tuning the model, predictions, and evaluation. The following study is almost entirely carried out using functions available in the caret package therefore it would be the most comprehensive piece of published literature on the functionalities available within this package. To illustrate its breadth, the author builds and evaluates multiple models with data available emphasizing the theme of a unified interface developed to create different models. 2015 A Short introduction to the Caret Package Max Kuhn A shorter introduction to the caret package and is primarily an introduction to the functions available in it. Much like the other studies, it goes over a sample workflow from a relatively rudimentary dataset. Majority of the emphasis in this paper is given to the 'train' function and its applications while walking the reader through some sample code. 2015 A Comparison of Data Mining Techniques in Evaluating Retail Credit Scoring Using R Programming Dilmurat Zakirov, The comparison conducted by the authors using retail credit Aleksey Bondarev, scoring data for the most part encompasses most popular data Nodar mining techniques. The paper explores K-Nearest Momtselidze Neighbors(KNN), Support Vector Machines (SVM), Gradient Boosted Model (GBM), Naïve Bayes Classification, Classification and Regression Trees, and Random Forests as methods for predicting customer scores using an actual dataset. The paper concludes, that the Random Forest model with down- sampling had a higher accuracy when compared to the other models presented. The paper does a thorough job at summarizing

52

the applications of each of these modeling methods but does not cover the data pre-processing stage. A lot is yet unknown as to the normalization methods used during the data preprocessing phase. 2016 Evaluation of four supervised learning methods for groundwater spring potential mapping in Khalkhal region(Iran) Using GIS-based features Seyed Amir Naghibi, Mostafa Moradi Dashtpagerdi A critical tool for water resource management in semi-arid and arid regions is the mapping of groundwater potential for planning and usage purposes. This paper evaluates four popular data mining models: K-Nearest Neighbors(KNN), Linear Discriminant Analysis (LDA), Multivariate Adaptive Regression Splines (MARS) and Quadratic Discriminant Analysis (QDA) to model groundwater potential maps (GPMS). The authors used caret for carrying out their Quantitative Discriminant Analysis and achieved an accuracy of roughly 61.2%. While this paper is not solely based on caret, it does represent an example of applied research carried out using it. After a review of the literature, most articles come from the author of the package itself. We found that the, “Building Predictive Models in R using the Caret Package (Kuhn, 2015)” paper to be the closest to our study. Our study differs from this in that we demonstrate the functionality using a business example (e.g. churn), rather than a chemistry example. This is important because the target audience of this paper are those researchers or practitioners doing business analytics, and not physical science research. Data The business example we use in this study is from KKBOX, a Taiwanese music streaming service, and happens to be Asia’s top provider founded in 2004. Its primary regions of operation are Taiwan, Hong Kong, Malaysia, and Singapore hosting the world’s most extensive music library with over 30 million tracks. Their business model is funded through advertisements and paid subscriptions. KKBOX aims to accurately predict churn of their paid customers from who they receive most of their revenue. KKBOX provided through Kaggle.com a large dataset of their customers listening and subscription behavior and allowed the analytics community to compete to build the best churn prediction problem. There are five sets of data: train, sample_submission_zero, transaction, user_logs, and members. Data descriptions and features are shown below. Table 2 consists of customer user ids and whether those customers churned or not and have a subscription expiration in February 2017. These users were the main ones that competition participants use to build a predictive model. Table 2: Train table Variable Type Description msno Factor The unique user identification is_churn Integer Classify churner and non- churner

53

The scoring set of data consists of user ids and whether they have churned or not as shown in Table 3. These are users who have subscription expiration in March 2017. In this file, all is_churn values are zeros and the modeler is required to submit their probability prediction based on the modeling solution they generate. Only Kaggle knows the true is_churn flag, but when you submit your predictions to Kaggle you are scored compared to other participants. Table 3: Sample_submission_zero table Variable Type Description msno Factor The unique user identification is_churn Integer Predict churner and non- churner Table 4 contains a description of the listeners transaction details of the users up until February 28, 2017. The data set consists of 9 features as shown below. Table 4: Transaction table Variable Type Description msno Factor The unique user identification payment_method_id Integer Classify payment method user used for subscription payment_plan_days Integer The number of days of subscription plan_list_price Integer The listed subscription price actual_amount_paid Integer The actual paid price is_auto_renew Integer Classify auto subscription renewal users transaction_date Integer The date that transaction was made membership_expire_date Integer The subscription expiration date is_cancel Integer Classify churner The user logs describe the users’ behavior in using the KKBOX service and how they listened to music. For example, in Table 5 we obtain measures about the quantity of songs played, and how much of the length of the songs were listened too. Table 5: User_logs table Variable Type Description msno Factor The unique user identification date Integer Date when the usage was recorded num_25 Integer Number of songs played less than 25% of the song length num_50 Integer Number of songs played between 25% and 50% of the song num_75 Integer Number of songs played between 50% and 75% of the song num_985 Integer Number of songs played between 75% and 98.5% of the song num_100 Integer Number of songs played over 98.5% of the song length num_unq Integer Number of unique songs that user play total_secs Numeric Total time songs are played in the unit of seconds

54

General information about a user such as age, gender, etc. as displayed in Table 6. We found that some of the observations in this table are incomplete or would be considered outliers. Table 6: Members table Variable Type Description msno Factor The unique user identification city Integer The city where user locates bd Integer Age of the user gender Factor Gender of the user registered_via Integer The method user used for registration registration_init_time Integer Initial registration date expiration_date Integer Subscription expiration date Table 7 contains generated features based on the attributes provide in the previous tables. We utilized some of the factors described above to create dummy variables which is indicative of the information. They were then used in the analysis. Table 7: Generated features table Variable Type Description sub_Churn Factor Created based on the subscription model described by Kaggle’s. According to Kaggle, some non- churners are indeed churners, and it’s taken into consideration by this variable. This variable was not used in our modeling. avgpayment Numeric The average payment to the service from the Transaction dataset latest_renew Date What was the latest day to renewal from the Transaction dataset latest_cancel Date What was the latest day of cancellation of the Transaction dataset pct_cancel Numeric On average how many times the data had them listed as canceled on the Transaction dataset Pricother Factor There were several different factor levels for the price. Most prices are put into their respective levels, and a small group that’s left are grouped into "pricother" Pric149 Factor Dummy for price = 49 method Factor Payment method. Similar to price, most methods can be grouped to form their respective levels, and dummy variables are used. meth36 Factor Dummy for method = 36 methother Factor There were several different factor levels for payment method. The majority of methods came from like 6 or 7 different levels and grouped all the rest into "methother" plan Factor Payment plan. The majority of methods came from 2 or 3 different levels. Thus, dummy variables are included.

55 planother Factor There were several different factor levels for the payment plan. The majority of payment plan came from like 2 or 3 different levels and grouped all the rest into " Planother". plan7 Factor Dummy for Plan=7 Daysrange Numeric How many days lasted for the service Methodology The caret package contains functions to perform the entire modeling process from pre-processing to modeling to feature selection to the final predictive output. Figure 1 below details our proposed workflow. Here we focus on five different categories of functions: , Preprocessing, Data Splitting, Modeling, and Model Evaluation. Under each category are the caret functions that one will likely use when prototyping their predictive modeling solution. Figure 1: Proposed caret workflow Data Visualization Data Visualization is the process of understanding the data used throughout the predictive workflow. It can be used at any step of the process to verify results or determine errors. Caret provides a variety of functions for gathering insights from analyzing the data visually. The various plot functions shown in Table 8 provide a very simple way to create complex graphical displays for your data without the need for large amounts of code. Many analysts agree that visualizing your data before analyzing it can expedite finding non-linear relationships or high correlations. However, it is recommended that you utilize caret’s visualization functionality at any step due to its ease of use.

56

Table 8: caret data visualization functions Function Description featureplot Outputs a range of visual plots i.e. scatterplot/boxplot dotPlot Create a dot plot of variable importance values lift Lift Plot plotClassProbs Plot Predicted Probabilities in Classification Models plotObsVsPred Plot Observed versus Predicted Results in Regression and Classification Models Pre-processing Pre-processing is the technique that prepares data for predictive applications. Normally, the raw data may be incomplete or noisy to use due to various problems such as obsolete fields, missing values, outliers, etc. Missing values should be dealt with first using a methodology consistent with the business problem. Also, R can improperly load in feature classes, therefore it is essential to evaluate every variable in the dataset and convert it to the intended class if necessary. Caret is extremely useful in pre- processing raw data beyond just removing missing information. To start with, if a dataset contains categorical features with many levels, then the dummyVars function can provide a very quick way to break apart the factor levels into separate columns and avoid certain instances of correlation. It is recommended to use a combination of findLinearCombos and findCorrelation functions to reduce dimensionality through linear combinations of features and remove correlated variables which harm the effectiveness of certain models. This step can confirm your findings in the visualization step if any exist as well. Finally, the preProcess function provides an extremely simple way to transform and re-scale your data. For example the z-score standardization can easily be done using method=c(“center”,”scale”) while min-max normalization can be done using method=c(“range”). Sometimes, analysts also want to try and make their input features more Gaussian distributed, thus you could do this in isolation such as method=c(“Yeo-Johnson”), or combine this transformation with scaling, such as method=c(“range”,”Yeo-Johnson”). Figure 2 shows an excerpt of the code using dummyVars, nearZeroVar, findCorrelation, and preProcess used in this analysis. Table 9 provides a summary of the pre-processing functions. Figure 2: Pre-processing code applications

57

Table 9: caret pre-processing functions Function Description dummyVars Create A Full Set of Dummy Variables (turns an n-level factor into (n-1) separate boolean factors) findCorrelation Determine highly correlated variables findLinearCombos Determine linear combinations in a matrix nearZeroVar Identification of near-zero variance predictors preProcess Pre-processing of predictors classDist Class centroids and covariance matrix Feature Selection Feature Selection is the process of reducing the dimensionality of the data without significant loss in performance and predicting power with the effort to create a simpler model. In general, most approaches for feature selection (reducing the number of predictors) can be placed into two main methods: wrapper and filter. For wrapper method, caret has functions to perform recursive feature elimination, genetic algorithms, and simulated annealing as shown in Table 10. For filter method, caret has a general framework for using univariate filters. (Kuhn, 2017). If feature reduction was not completed in the pre-process step or this step, further reduction can be done in the modeling portion of the workflow. Certain models run through the train function have Principal Component Analysis added to them which can supplement the functions in the list and further reduce dimensionality. Table 10: caret feature selection functions Function Description filterVarImp Calculation of filter-based variable importance gafs Genetic algorithm feature selection gafsControl, safsControl Control parameters for GA and SA rfe, rfeIter Backward Feature Selection rfeControl Controlling the Feature Selection Algorithms safs Simulated annealing feature selection sbf Selection By Filtering (SBF) sbfControl Control Object for Selection By Filtering (SBF) varImp Calculation of variable importance for regression and classification models Data Splitting Data splitting is the process of partitioning the data into two or more separate subsets with the intention of using one set to train the model, and the other set to test or validate it. Data can be split in several different ways depending on the problem at hand. The createDataPartition function will generate an index for splitting your dataset for any number of stratified random subsets of any size partition between train and test sets. CreateFolds is similar to this and will index your data into k-folds of randomly sampled subsets. Another useful splitting function is createTimeSlices which will index your data into training and test sets when it is in time series format and want a rolling train/test window. After splitting the data, it is common

58 to find, especially in classification problems, that a certain class is much larger. In such cases, using upSample, downSample, SMOTE or ROSE functions provide easy ways to resample your data to obtain more even distributions of each response and reduce bias when the classifier tries to learn. Figure 3 shows an excerpt from the code used in this analysis showing the data partitioning step. Table 11 summarizes these functions. Figure 3: Data partitioning code Table 11: caret data splitting functions Function Description createDataPartition Creates a series of test/training partitions createFolds Splits the data into k groups createTimeSlices Creates cross-validation split for series data downSample, upSample Down- and Up- Sampling Imbalanced Data groupKFold Splits the data based on a grouping factor maxDissim Creates a sub-sample by maximizing the dissimilarity between new samples and the existing subset SMOTE, ROSE Mix of upsampling/downsampling upSample Samples with replacement until all class frequencies are equal downSample Randomly sample a dataset so that all class frequencies match the minority class Modeling Model training starts with specifying training parameters for a model that you may want to run. It feeds in the validation method, type of problem (i.e. regression/classification), and conditions for training. The second step specifies the model to train and the respective tuning parameters. The two important functions are trainControl, which specifies the validation technique (e.g. k-fold, LOOCV, etc.) and type of modeling problem (e.g. regression, classification). The train function is one of the most extensive functions in R. It has integrated or wrapped 238 modeling packages as the time of writing this study, where each approach uses by default three different combinations of tuning parameters. A user can specify directly the tuning parameter combinations they would like to try or use a generic tuneLength= argument that will create a reasonable set of possibilities based on the number you specify. In addition, code-savvy users can create their own models and cleanly run them through the train function. The output for train identifies the “best” model obtained among the set of possible tuning parameters and uses that as the final model. Among other things, Figure 4 shows the trainControl and train functions being applied for this paper’s analysis. Table 12 provides a summary of the modeling and tuning functions in caret.

59

Figure 4: Model Training and Statistic Recording Table 12: caret modeling and tuning functions Function Description trainControl Specifies the type of problem, conditions for training a model and methods for validation train Specifies the desired model and corresponding tuning parameters Model Performance & Evaluation Model evaluation directly follows the model training step. In this section, the user benchmarks the model(s) against some predefined metric for the problem or each other. The confusionMatrix function is a powerful tool for classification problems and produces the distribution of type-I, type-II errors, and correct/incorrect classifications. The regression counterpart to this is the postResample which outputs the MSE and RS for the model. Using the predict function will return a multitude of probability predictions based on the model and can be applied to both classification and regression models. There are also a variety of further statistics that can be reported using the rest of the functions from the list. After evaluating the model, if it does not achieve the required performance, iteration of the modeling stage is recommended using different tuning parameters. Figure 4 above showed the performance evaluation confusionMatrix on the 9th and 10th lines of the . Table 13 summarizes the functions. Table 13: caret performance and evaluation functions Function Description calibration Probability Calibration Plot confusionMatrix Create a confusion matrix

60 defaultSummary Default function to compute performance metrics in train postResample Calculate the MSE and R2 given two numeric vectors of data twoClassSummary Computes sensitivity, specificity, and AUC multiClassSummary Computes some overall measures of performance (accuracy and Kappa) extractPrediction Extract predictions and class probabilities from train objects extractProb Extract class probabilities from train objects negPredValue, posPredValue, sensitivity Calculate sensitivity, specificity and predictive values recall Calculate recall precision Calculate precision F_meas Calculate F values resamples Collation and Visualization of Resampling Results resampleSummary Summary of resampled performance estimates thresholder Generate Data to Choose a Probability Threshold Other In Table 14 there are other functions that do not necessarily fall into any of the other categories that are useful in the data analysis workflow. Table 14: caret other functions Function Description getSamplingInfo Get sampling info from a train model learning_curve_dat Create Data to Plot a Learning Curve modelLookup, checkInstall, getModelInfo Tools for Models Available in train SLC14_1, SLC14_2, LPH07_1, LPH07_2, twoClassSim Simulation Functions Models In our business example, the response variable (is_churn) is a binary variable; therefore, making this problem binary classification problem. In this study, we used ten different models among the many available to compare their performance including accuracy and runtime. Table 15 provides a summary of these methods. Some are common models such as neural network and naïve bayes, some are models suggested in section 7.0.50 (Two Class Only) of The Caret Package (Kuhn, 2017) such as AdaBoost Classification Tree, and Support Vector Machines with class weights. Table 15: caret models used in this study Model Function Type Tuning Parameters Naïve Bayes nb Classification ·Laplace Correction (fL, numeric) ·Distribution Type (usekernel, logical) ·Bandwidth Adjustment (adjust, numeric) Neural Network nnet Classification Regression ·Size (#Hidden Units) ·Decay (Weight Decay)

61 pcaNNet Neural Network with Classification Feature Extraction Regression ·Size (#Hidden Units) ·Decay (Weight Decay) Oblique Random Forest ORFlog Classification ·Number of Randomly Selected Predictors (myry, numeric) Bagged CART treebag Classification Regression No tuning parameters for this model. Stochastic Gradient Boosting gbm Classification Regression ·Number of Boosting Iterations (n.trees, numeric) ·Max Tree Depth (interaction.depth, numeric) ·Shrinkage (shrinkage, numeric) ·Min. Terminal Node Size (n.minobsinnode, numeric) Bagged AdaBoost AdaBag Classification ·Number of Trees (mfinal, numeric) ·Max Tree Depth (maxdepth, numeric) Boosted Logistic Regression LogitBoost Classification ·Number of Boosting Iterations (nIter, numeric) C5.0 Tree C5.0 Classification ·Number of Boosting Iterations (trials, numeric) ·Model Type (model, character) ·Winnow (winnow, logical) Support Vector Machines with Class Weights svmRadial Weights Classification ·Sigma (sigma, numeric) ·Cost (C, numeric) ·Weight (Weight, numeric) 1. Naïve Bayes Naïve Bayes classifiers are a family of simple probabilistic classifiers that are based on Bayes Theorem. These classifiers are particularly well suited to datasets where the dimensionality of the inputs is high. Naïve Bayes is also fast in terms of computation speed and simple in terms of implementation. On the other hand, Naïve Bayes relies on the assumption that all features are independent, which is where the “naïve” in its name comes from. If the naïve assumption is violated, the classifier might perform rather badly. 2. Neural Network The concept of neural network is inspired by how animal brains work: with a vast network of interconnected neurons. It contains an input layer, an output layer, and defined number of hidden layers and neurons. It is widely applied in artificial intelligence such as speech and image recognition. A Neural Network is a powerful algorithm to model non-linear data, especially with a large

62 number of input features such as sounds and pictures. Meanwhile, it is nearly impossible to understand the classification boundaries intuitively. For large data sets, a Neural Network is also computationally expensive. 3. Neural Network with Feature Extraction A Neural Network works well with feature extraction because of its powerful parallel computation. The model can lead an exploratory network which can perform dimension reduction or feature extraction on the training dataset in both linear and nonlinear neurons. “First, the formulation of a single and a multiple feature extraction are presented. Then a new projection index that favors directions possessing multimodality is presenter (Intrator).”

63

4. Oblique Random Forest The Oblique Random Forest shares basic ideas with the Random Forest algorithm such as bagged trees. The main difference is the procedure of optimal splits direction are sought at each node, for the Oblique Random Forest, it focuses on the multivariate models for binary splits in each node. There are two types of Oblique Random Forests: oRF-lda, which performs an unregularized split and oRF-ridge that optimizes regularization parameter at every split (Menze). 5. Bagged CART Bagging is also referred to bootstrapped aggregation, which is an ensemble method from the “ipred” package which creates different models of the same type using different sub-samples of the same dataset. The bagged CART is very effective when the methods have high variance because all models will be combined to provide a final result (Browniee, 2014). 6. Stochastic Gradient Boosting Stochastic Gradient Boosting is basically a gradient boosting model with a minor modification. It combines the randomness as an integral part of the procedure. Stochastic Gradient Boosting makes “the value of f smaller [reducing] the amount of data available to train the base learner at each iteration. This will cause the variance associated with the individual base learner estimates to increase (Friedman, 1999).” 7. Bagged AdaBoost AdaBoost is best used to boost the performance of decision trees on binary classification problems. AdaBoost is a very good approach as it corrects upon its mistakes. The boosting for this model is achieved through ensembling methods. However, the biggest problem with this algorithm is that it is sensitive to outliers. 8. Boosted Logistic Regression Logistic regression is a special case of a Generalized Linear Model (GLM) where the response variable is binary. It uses a logistic function to measure the relationship between the responsible variables and predictors. Unlike other models in GLM family, the outputs of a logistic regression are probabilities, which could be converted to binary results (0 or 1) based on a specified cutoff threshold. One of the advantages of using logistic regression is that it is simple and efficient. It does not use large memory space to run. In addition, the output probability scores are easy to interpret, and it is possible to manually change the cut- off to obtain the best prediction results. However, logistic regression does not perform well for small data sets as it will tend to overfit to the training data. 9. C5.0 Tree The C5.0 tree model extends the C4.5 classification algorithms developed by Ross Quinlan (1992). It is a decision tree model with some improvement from its previous C4.5 tree. C5.0 is a sophisticated data mining tool for discovering patterns that delineate categories, assembling them into classifiers, and using them to make predictions (Information on See5/C5.0, 2017). Like other decision trees, C5.0 is able to handle non- linear features and make intuitive decision rules. However, it does not provide ranking scores and sometimes it is extremely vulnerable to overfitting.

64

10. Support Vector Machines with Class Weights Like Neural Networks, Support Vector Machines (SVMs) are a very popular machine learning technique. SVMs were developed by Cortes & Vapnik in 1995 for binary classification and can handle non-linear decision boundaries. SVMs can handle a large feature space, but are not very efficient (Sachan, 2015). Results Runtime Runtime is one of the key factors to consider when choosing among a possible set of predictive solutions to support the business problem. It is also important when deciding on the technology or architecture to use. Some detractors of R claim that it does not perform as well as other languages such as Python or SAS. Thus, we provide an idea of runtime for a large dataset such as the one this large company uses to understand if their customers will churn or not. There is no concrete relation between runtime and accuracy of the predictive model. However, more complex models will typically require more time to perform the prediction. Essentially, an accurate model with short runtime is ideal, however this is unlikely. Depending on the business problem at hand, a quicker, less accurate model or a slower, more accurate model may be preferred. Table 16 shows the total run for these ten models ranges from one minute to 11 hours. Table 16: Run time by algorithm Algorithm Total Run Time Training Time Predict/Score Time nb 18.94176 mins 9.188566 mins 9.753199 mins nnet 10.46171 mins 10.44262 mins 0.01909322 mins pcaNNet 7.621718 mins 7.584865 mins 0.03685278 mins ORFlog 10.00892 hours 9.759153 hours 0.2497653 hours treebag 6.574278 mins 4.305433 mins 2.268845 mins gbm 2.763423 mins 2.741724 mins 0.02169898 mins AdaBag 34.03858 mins 33.18765 mins 0.850931 mins LogitBoost 1.094711 mins 49.7573 secs 15.92534 secs C5.0 26.34021 mins 25.73059 mins 0.6096223 mins svmRadialWeights 11.82071 hours 11.73735 hours 0.08335575 hours According to runtime recorded in the chart above, the Oblique Random Forest and Support Vector Machines with class weights are two most time-consuming models among ten predictive models we chose for the study. They both took more than ten hours to run. Accuracy, Sensitivity, and Specificity Performance When we get the results, we must first understand the difference between the training set and the testing set and be sure that we did not overfit to the training set. During the “Data Splitting” of the workflow, we partitioned the data into two sets: train and test sets, where the training set contains the majority of the data. This training set will be used to train the model. Once the model is trained, the test set data will be fed into the trained model. If the model is well-trained we will expect to see the accuracy of the test set to be similar to the accuracy of the training set. What we do not want is to overfit the data to the training set, which

65 means that the model we trained only gives us good results for the training set. If we put in the testing data into the overfitted model, we see that the results we get for the test set are not similar to the training set results. Figure 5 shows the accuracy of each model on the train and test sets. Accuracy gives a percentage of the amount of overall correctly identified targeted variable. Figure 5: Accuracy by model Figure 6 provides the sensitivity (or true positive rate). Sensitivity is the percentage of users that were correctly predicted as churners whom were actually churners. We see in this business case, most of the models tend to perform well at predicting churners on the train set, but do not generalize well as the test set performance is much lower. We believe this is due to the imbalance in the data set, as most customers are non-churners. Figure 6: Sensitivity by model

66

Figure 7 provides specificity (or true negative rate). Specificity measures the proportion of predicted non- churners whom were actually non-churners. We see similar performance on the train and test sets for all algorithms on the specificity performance measure. The logic is the same as that for sensitivity in that since most records are non-churners, the algorithms will tend to predict those better than the churners. Figure 7: Specificity by model The Oblique Random Forest and Bagged CART are overfitting models whose accuracy of train dataset is much higher than that of test dataset, especially Oblique Random Forest with test accuracy less than 35%, not mentioning its 10-hour runtime. The best model based on accuracy is C5.0 Tree since its accuracy of test dataset is a bit higher than train dataset and results of two datasets are close. With respect to the sensitivity for our Oblique Random Forest we get 100% which means that we have correctly identified all of the people who churn in our data. As for the specificity for the Oblique Random Forest, we see that it is low which means that we tended to assume that people were going to churn more than they actually were. We get these results because the Oblique Random Forest tended to guess that a customer was going to churn more often than not churn which leads to a high sensitivity but to a low specificity. Table 17: Train and test statistics by algorithm. Model Train Accuracy Train Sensitivity Train Specificity Test Accuracy Test Sensitivity Test Specificity nb 0.819057 0.895409 0.723617 0.719743 0.621522 0.72156 nnet 0.884458 0.907025 0.856248 0.853869 0.673895 0.857197 pcaNNet 0.880958 0.911831 0.842366 0.843237 0.722586 0.845468 ORFlog 0.70673 1 0.340142 0.333839 0.995908 0.321595 treebag 0.998637 0.99818 0.999207 0.893044 0.625205 0.897998 gbm 0.906069 0.928848 0.877595 0.875332 0.693535 0.878695 AdaBag 0.863256 0.886962 0.833623 0.833735 0.723813 0.835767 LogitBoost 0.876175 0.86698 0.887667 0.880318 0.51964 0.886988 C5.0 0.954131 0.951099 0.95792 0.902837 0.638298 0.907729 svmRadialWeights 0.906989 0.924594 0.884983 0.878074 0.690671 0.88154

67

Lift Curve Lift is the concept of how much of an identified population is captured at a given percentile, that is to say, for a given percentile cutoff what is the proportion of positives were identified over how many positives were identified by the model. Figure 8: Lift curves by algorithm For our problem, the lift curves for Naïve Bayes, Bagged ADABoost, and LogicBoost did not perform well since they need a higher percentile of the population to find the same amount of sample results as other models. The C5.0 classification tree was the best model based on lift. Area Under Curve (AUC) The Area Under Curve is obtained from the area under the receiver operating characteristic (ROC) curve, which plots sensitivity versus 1- specificity. The plot is also known as ROC plot. The area under the curve increases when the model identifies true positives accurately regardless of the false positive performance. Table 18: AUC by algorithm Model AUC nb 0.772650517 nnet 0.866525556 pcaNNet 0.86538195 ORFlog 0.836493277 treebag 0.887684069 gbm 0.873647309 AdaBag 0.779440607

68

LogitBoost 0.820858647 C5.0 0.899662705 svmRadialWeights 0.775597068 In general, all ten models performed well based on AUC as all values are greater than 0.77. We observed that the C5.0 tree and Bagged CART models were the two best models for this problem among the ten models. Observational Resampling of Statistics From a statistical point of view, as the ROC, sensitivity, and specificity are calculated by observational data, the statistical true value of these statistics should be best summarized as a confidence interval. In order to generate the graphs below, the created models on subsets of the test data set and calculated the AUC, sensitivity, and specificity 1,000 times. Figure 9 shows the robustness of these models. Figure 9: Boxplot of results For our data, as there were significantly many observations in the training set, the values of each of the statistics are relatively robust, indicating the model is well trained and that the above conclusions made previously in this paper are not due to incidental values, namely the Oblique Random Forest’s markedly low specificity. Conclusion With the development of various kinds of predictive analytic tools and models, the selection of predictive models has become more important for companies in the business world. This study demonstrates how the caret package in R provides many sophisticated functions to generate a complete predictive analytics workflow. Caret provides one of the most comprehensive wrappers for any set of R packages and can be solely used to define an entire workflow starting from data cleaning and preprocessing, all the way through model training, prediction, and performance analysis. In our study, we summarized the available function, show how to use them in order, and show results from a real business problem, using different kinds of algorithms. We believe those new to R or caret will find

69 this paper useful and a go-to reference to get up to speed more quickly and use some of these functions for their predictive modeling task. We are currently working to extend this study on the same dataset using the popular scikit-learn package. Since R and Python are most popular analytics languages used by professionals today, we hope to identify functionality that does and does not exist in each, as well as compare both model performance and runtimes. As stated in our paper, some detractors of R believe the performance is not well-equipped enough to support developing and predicting on large datasets and languages such as Python are. References Batuwita, R., & Palade, V. (2012). CLASS IMBALANCE LEARNING METHODS FOR SUPPORT VECTOR MACHINES (pp. 1-2). University of Oxford Retrieved 12 December 2017, from http://www.cs.ox.ac.uk/people/vasile.palade/papers/Class-Imbalance-SVM.pdf Brownlee, J. (2017). K-Nearest Neighbors for Machine Learning - Machine Learning Mastery. Machine Learning Mastery. Retrieved 11 December 2017, from https://machinelearningmastery.com/k- nearest- neighbors-for-machine-learning/ Cortes, C. & Vapnik, V. (1995). Support-vector network. Machine Learning, 20, 1–25. ctufts/Cheat_Sheets. (2017). GitHub. Retrieved 12 December 2017, from https://github.com/ctufts/Cheat_Sheets/wiki/Classification-Model-Pros-and-Cons Information on See5/C5.0. (2017). Rulequest.com. Retrieved 12 December 2017, from http://www.rulequest.com/see5-info.html Lanham, M. (2017) Project Problem: A Comparison of R Caret and Python scikit-learn for Predictive Analytics. Kuhn, M. (Oct 4, 2007) Caret Package v2.27. Retrieved from https://www.rdocumentation.org/packages/Caret/versions/2.27 Kuhn, M. (2017). The Caret Package. GitHub. Retrieved 14 December 2017, from https://topepo.github.io/Caret/ Sachan, L. (2015). Logistic Regression vs Decision Trees vs SVM: Part II - Edvancer Eduventures. Edvancer.in. Retrieved 11 December 2017, from https://www.edvancer.in/logistic-regression-vs- decision-trees-vs-svm-part2/ Scikit-learn Developers (Nov 21, 2017). Scikit-learn User Guide Release 0.19.1. Retrieved from http://scikit-learn.org/stable/_downloads/scikit-learn-docs.pdf

70

Friedman, J. H. (1999, March 26). Stochastic Gradient Boosting. Retrieved February 22, 2018, from https://statweb.stanford.edu/~jhf/ftp/stobst.pdf Intrator, N. (n.d.). A Neural Network for Feature Extraction. Retrieved February 23, 2018, from https://pdfs.semanticscholar.org/970a/2fa8e2a8a3139a87fa9379dbda0536654a77.pdf Menze, Bjoern H., et al. On Oblique random Forest. citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.207.7485&rep=rep1&type=pdf Non-Linear Classification in R with Decision Trees. (2016, September 21). Retrieved February 23, 2018, from https://machinelearningmastery.com/non-linear-classification-in-r-with-decision- trees/

71

Recruitment Analytics: An Investigation of Program Awareness & Matriculation Liye Sun & Matthew A. Lanham Purdue University, Department of Management 403 W. State Street, West Lafayette, IN 47907 [email protected]; [email protected] POSTER ABSTRACT In this study we provide a descriptive and predictive analysis of analytical program recruitment and awareness. The motivation of this study is that more and more universities are offering analytics programs or analytics concentrations. Understanding where students obtain information and how they view programs is important for attraction and recruitment. From the planning perspective, being able to estimate the matriculation rate could provide valuable decision-support for program preparation. In this study, we first investigate the awareness and marketing about Purdue’s M.S. in Business Analytics and Information Management (BAIM) program and other analytics programs among different demographic groups and conduct a descriptive analysis to identify what are the key factors that attract student to apply and enroll. Then we analyze the application and matriculation rates over the past two years to develop a predictive model that of whom is most likely to attend is provided an offer.

72

A Value at Risk Argument for Dollar Cost Averaging by Laurence E. Blose Eric Hoogstra Professor of Finance Clinical Affiliate Faculty of Finance Grand Valley State University Grand Valley State University [email protected] [email protected] Abstract Keywords: Portfolio Theory, Dollar Cost Average, Value at Risk, Investments The returns on Dollar Cost Average (DCA) portfolios are compared to a variety of lump sum portfolios. The study uses monthly CRSP (Chicago Center for Research in Security Prices) total market returns over the period 1963 through 2017. The DCA implementation periods range from 3 months to 15 months. This paper finds that: • For DCA implementation periods of 8 months or less, DCA investment performance is not better (nor worse) than lump sum strategies. • For periods of 11 months or more, DCA performance is dominated (mean variance domination) by lump sum strategies that involve investing a portion of the funds in the risky investment and the remaining amount in T-Bills. • Using Value at Risk (VAR) the study shows that investors with aversion to downside risk will prefer DCA to the 100% lump sum strategy.

73

Introduction: Dollar Cost Averaging versus Lump Sum Investing Financial planners and advisors often suggest that investors accumulate investments over time using a technique called Dollar Cost Averaging (hereafter referred to as DCA). Typically, the DCA investment is made over an accumulation period of several months with a portion invested at regular intervals. An alternative to DCA, is lump sum investing in which the entire amount is invested at once. A problem with lump sum investing is that if the investment is made at the beginning of a decline in the price of the investment then the entire amount is invested at the higher price and the investor will suffer a substantial loss. With dollar cost averaging, it is unlikely that the entire amount will be invested at the higher price. If the same dollar amount is invested each period, and the investment declines in price over the accumulation period then the investors will acquire more stock using DCA than they would using lump sum investing. On the other hand, if the stock is increasing over the accumulation period, the DCA method will under perform the lump sum method. Although often described and recommended in the financial press, in financial magazines, and by financial advisors, DCA investing is largely absent from academic textbooks on investing. Using theoretical arguments in the academic literature, DCA is typically found to be inferior to lump sum investing in diversified portfolios. See for example Constantinides (1979) and Rozeff (1994). Table 1 presents some of the empirical studies that have compared DCA investing to Lump Sum investing. The studies present mixed results and at best weak support for DCA. Some of the research finds a clear preference for lump sum investing. Other research indicates some preference for DCA investing. However, the preference is often qualified and limited to certain types of investors, certain types of investments, and in some cases limited time periods. One issue that is not addressed either in the financial press nor the academic literature is the length of the accumulation period. This paper examines accumulation periods ranging from 3 months to 15 months and find that the DCA method performs well for shorter accumulation periods and are sub optimal for longer accumulation periods.

74

Calculation of Returns for DCA Portfolio An initial investment of $1000 was invested using DCA accumulation periods of from 3 through 15 months. Monthly returns for the CRSP Value Weighted Index were drawn for the period August 1963 through December 2017. Funds not invested in the CRSP index were invested in T-Bills. The holding period return and effective annual returns (ear’s) were calculated and the average ear is reported in Table 2. The ear for the DCA was calculated as follows: The initial investment of $1000 was divided among the equity investment (CRSP Index) and the T-Bill investment. The initial investment of $1000 was divided by n - the number of months in the accumulation period. This amount was then invested in the equity portfolio at the beginning of each month during the accumulation period. The portion not invested in the Index was invested at the monthly average T-Bill rate. At the beginning of the final month, the final installment payment of 1000/n is added to the equity investment along with interest earned from the T-Bills. The ear was then calculated for the investment. There were 654 months in the study so for an accumulation period of n months, there are (654-n) overlapping accumulation periods. The average, standard deviation, and coefficient of variation of the ear from the 654-n accumulation periods were calculated and are reported in Table 2. Calculation of Returns for Three Buy and Hold Portfolios Returns on three buy and hold portfolios were calculated for comparison with the DCA returns. The Buy and hold portfolio had a portion invested in the equity index and a portion invested in T-Bills as follows: • The 100 % portfolio had all the initial investment of$1000 invested in the equity index. • The low risk portfolio was constructed to have the smallest standard deviation for a portfolio that has the same returns as the DCA portfolio. Table 2 presents returns and the P Value for the F test for the difference in standard deviation between the DCA portfolio and the low risk portfolio.

75

• The high return portfolio was constructed to have the highest possible return for a portfolio that has the same standard deviation as the DCA portfolio. Table 2 presents the returns for the high return portfolio and the p-value for the paired t-test for the difference between the ear for the DCA and the ear for the high return portfolio. Comparison of DCA Returns to Returns on 100% Portfolio For every accumulation period (3 months through 15 months) the DCA portfolio has a lower standard deviation and a lower average ear than the 100% equity portfolio. This result is clearly what should have been predicted. The Equity Index Portfolio has by far the highest standard deviation and the DCA has less invested in the equity index during the accumulation period. Hence the lower standard deviation for the DCA. Since the equity index portfolio has a higher level of systematic risk than the T-Bill portfolio we would expect that it also has a higher average return than the T-Bill portfolio. For every accumulation period the coefficient of variation of the 100% equity portfolio is greater than for the DCA. This indicates that standard deviation per unit of return is greater for the 100% portfolio than for the DCA portfolio. This may indicate a preference for the DCA results especially by highly risk averse investors. However, as seen in the next section, there are partial buy and hold portfolios that have lower coefficients of variation than the DCA. Comparison of DCA Returns to Returns on Partial (Less than 100%) Buy and Hold Portfolios There are two partial buy and hold portfolios. The partial buy and hold portfolios have a proportion invested in the equity index and the remainder invested in T-Bills. The proportion is established at the beginning of the first month and then held without change (or rebalancing) through the accumulation period. The percentage invested in the equity index is at the top of the columns in the Table 2. The low risk portfolio has the lowest possible standard deviation for all portfolios that have an average ear equal to the average ear for the DCA portfolio. Table 2 shows that the low risk portfolios typically have approximately 49% to 51% invested in the equity index.1 This 1 The five month accumulation period results is an exception with 56.5% invested in the equity index.

76 number is curiously stable across all tested accumulation periods with the earlier periods closer to 51% and the later periods closer to 49%. Table 2 presents the results and the results of the F-Test for the hypothesis that the standard deviation of the DCA returns are equal to the standard deviation of the low risk portfolio returns. The standard deviation of the low risk portfolio is less then the standard deviation of the DCA portfolio for every accumulation period examined. Accordingly, the coefficient of variation of the low risk portfolio is lower than the DCA for every accumulation period. However, the F-test to determine if the standard deviation is different does not reject the null hypothesis for the accumulation periods less than nine months.2 For accumulation periods of nine months or greater the null hypothesis is rejected. The longer the accumulation period the lower the test p-value and the more significant the rejection. At fifteen months the p-value is .00477 which is well below the .01 level of significance. The F test indicates that the standard deviations of the low risk portfolios are not significantly different from the standard deviation of the DCA portfolio when the accumulation period is less than nine months. This indicates that DCA portfolios perform as well as the optimal low risk portfolios for accumulation periods less than nine months. For accumulation periods of nine months or greater the low risk portfolios have the same returns as the DCA portfolios but have significantly lower standard deviations. For the greater accumulation periods the low risk portfolios enjoy mean variance dominance over the DCA portfolios and would be preferred by all risk averse investors (no matter how risk averse). The high return portfolios have the highest returns of all portfolios with the same standard deviations as the DCA portfolios. Table 2 shows that the high returns portfolios for all of the accumulation periods have from 51% through 55% invested in the equity index. The high return portfolio’s coefficient of variations are lower than for the DCA portfolios indicating that the high risk portfolios provide greater return for the level of standard deviation than the DCA portfolios. 2 F-Test is a two tailed test. Rejection of null hypothesis of equality for .05 or less level of significance.

77

Table 2 shows the results of the paired t-test for the hypothesis that the average ear for the high return portfolio is equal to the average ear for the DCA portfolio. For accumulation periods of 11 months or less the average ear for the high return portfolio is not significantly different from the average ear for the DCA portfolios. For accumulation periods of 12 months or more the average ear for the high return portfolio is significantly greater than the average ear for the DCA portfolio. As with the low risk portfolio the t-tests indicate that for the higher accumulation periods, the high return portfolio exhibits mean variance dominance over the DCA portfolios. Accordingly, all risk averse investors prefer the high return portfolios when the accumulation period is eleven months or more. In summary, using mean variance dominance as the selection criteria, the DCA portfolios and the partial buy and hold portfolios provide similar returns and risks when accumulation periods are less than 9 to 11 months. However, for longer accumulation periods the partial buy and hold portfolios dominate the DCA portfolios. These results indicate that accumulation periods for DCA investing should not exceed 9 to 11 months. Value at Risk Table 3 presents the value at risk (Var) for the DCA and the buy and hold portfolios. In table 2: Var(x%) = R indicates that there is a probability of x that the outcome will be R or less. The lower R is the greater the possible loss. Investors adverse to downside loss will prefer investments with higher R. Table 3 shows the VaR for probabilities of .05 and .10. For example the first line in Table 3 gives the Value at Risk for x = .05 for the 3 month accumulation period. For the DCA portfolio the 5% VaR is R= -24.19%. In other words, there is a 5% chance that the outcome will less than -24.19% effective annual rate. For the 100% portfolio the 5% VaR is R= - 50.11% indicating that the downside risk is substantially greater for the 100% portfolio than for the DCA. The VaR for the DCA is substantially less than the VaR for the 100% buy and hold portfolio for all accumulation periods. Accordingly, the downside risk for the DCA portfolio is less than the downside risk for the 100% buy and hold portfolio. The partial buy and hold

78 portfolios VaR is approximately the same and is often slightly less than the VaR for the DCA portfolios. Summary and Conclusion For all accumulation periods the DCA has a lower return and a lower standard deviation then the buy and hold portfolio with 100% invested in the equity investment. The coefficient of variation of the DCA portfolio is less than the coefficient of variation of the 100% buy and hold portfolio for all accumulation periods. The coefficient of variation indicates the risk (standard deviation) per unit of return. The lower coefficient of variation indicates a preference for the DCA. However, investors with low risk aversion will still prefer the 100% buy and hold portfolio. These investors will seek the higher returns and are less concerned about the risk of the investment. The DCA returns were also compared to two partial buy and hold portfolios. The partial buy and hold portfolio have only a portion of the initial investment invested in the equity index and the remainder is invested in T-Bills. The two partial buy and hold portfolios are referred to as the low risk portfolio and the high return portfolio and are created as follows: • The low risk portfolio has the lowest possible standard deviation of all possible partial portfolios with the same return as the DCA portfolio. • The high return portfolio has the highest possible average ear of all possible partial portfolios with the same standard deviation as the DCA portfolios. The results indicate that the DCA performs as well as the partial buy and hold portfolios for accumulation periods less than 12 months. However for longer accumulation periods the partial buy and hold returns exhibit mean variance dominance over the DCA returns. Accordingly, all risk averse investors will prefer the partial buy and hold portfolios for accumulation periods of 12 months or greater. The VaR calculation shows that the DCA has substantially less downside risk than the 100% buy and hold portfolio. However the partial buy and hold portfolios have lower VaR than the DCA portfolio for all accumulation periods except 5 and 6 months.

79

References Bacon, P.W., Williams, R. E. Ainina, M.F., (1997), Does Dollar Cost Averaging Work for Bonds? Journal of Financial Planning, (1997) 10 (3) 78-80. Brennan, M. J., Li, F. Torous, W. N. (2005), Dollar Cost Averaging. Review of Finance, 9(5), 509-535. Constantinides, G. (1979). A Note on the Suboptimality of Dollar-Cost Averaging as an Investment Policy. Journal of Financial and Quantitative Analysis, 14(2), 443-450. doi:10.2307/2330513 Knight, J. R., Mandell, L. (1992), Nobody Gains from Dollar Cost Averaging Analytical, Numerical, and Empirical results. Financial Services Review, 2 (1), 51-61. Leggio, K.B. & Lien, D., (2003), An Empirical Examination of the Effectiveness of Dollar-Cost Averaging Using Downside Risk Performance Measures. Journal of Economics and Finance 27: 211. https://doi.org/10.1007/BF02827219 Rozeff, M. S., (1994) Lump-Sum Investing Versus Dollar-Averaging. Journal of Portfolio Management, Winter, Available at SSRN: https://ssrn.com/abstract=820004

80

Table 1: Research Comparing Dollar Cost Average (DCA) Returns to Lump Sum (LS) Returns. Authors, date Study Period Sample DCA Period Findings Israelsen (1999) 10 Years ending 9/30/1998 35 equity funds Monthly over the entire period • 19 of the 35 funds had higher returns using DCA. • DCA performs better with lower standard deviation funds. Rozeff (1994) 1926 through 1990 S&P 500 and a small stock portfolio Monthly over periods 2 through 12 Fractional LS investments outperform DCA. Leggio and Lien (2003) 1926 though 1999 S&P500, small stocks, Corp. and Gov. Bonds 12 monthly periods (non overlapping) Sharp Ratio: DCA preferred for Corporate and Gov. Bonds, LS preferred for equity. Sortino Ratio: LS Investing preferred for all except Gov Bonds. Upside Potential Ratio: DCA the preferred for all except small stocks. Brennan, Li, and Torous (2005) 1926 through 2003 A variety of equity portfolios and individual securities Monthly for periods of 1 to 6 years. Depends on the risk aversion of the investor. Market portfolio: for most investors DCA is superior to LS for more risk averse investors; however, DCA is inferior to Fractional LS. Individual Security: DCA dominates both LS and Fractional LS except for investors with very low risk aversion. Bacon, Williams, and Aimina 1926 through 1995 Corporate and Treasury Bonds Monthly A simple buy and hold strategy (50% in bonds and 50% in t-Bills) beats the DCA. Knight and Mandell (1993) 1963 through 1992

81

S&P 500 Index Monthly for periods of 2 to 10 months DCA is beat by fractional buy and hold strategies. 1 DCA is Dollar Cost Average, LS is 100% lump sum , FLS is fractional lump sum Investments - a portion of the initial investment in the risky portfolio and the remainder in cash or riskfree investments. The FLS ratio is held over the lifetime of the investment.

82

Table 2: Average Effective Annual Returns (ear’s) for Dollar Cost Average and Buy and Hold Portfolios Column 2 contains results for the Dollar Cost Average portfolio. Column 3 contains results for immediate investment of 100% in risky asset. Columns 4 and 5 contains results for portfolios split between equity index and T-Bills. The head of the columns indicate the portion invested in the equity index. Column 4 has results for the Low Risk Portfolio that has the lowest standard deviations of all portfolios with the same expected return as the DCA portfolio. Column 5 has results for the High Return Portfolio that has the highest return of all portfolios with the same standard deviation as the DCA portfolio. Row 3 shows the coefficient of variation for the portfolios. Row 4 provides the P Value for the t-test of the difference of the means of the portfolio and the DCA portfolio (two tailed test) Row 5 provides the P Value for the F test of the difference of the standard deviations of the portfolio and the DCA portfolio (two tailed test). 3 Month Accumulation Period DCA Buy and Hold Portfolios: Amount Invested in Risky Assets 100% Low Risk Port 50.96% High Ret Port 51.85% EX RET 0.098614 0.179836 0.09861402 0.09981 STD DEV 0.207016 0.413994 0.203402082 0.207016 CV 2.099252 2.302064 2.062608146 2.074102 P Value for Difference of Means 0.646747 P Value for Difference of Standard Deviations 0.653336224 4 Month Accumulation Period DCA Buy and Hold Portfolios: Amount Invested in Risky Assets 100% Low Risk Port 51.07% High Ret Port 52.72% Ex Ret 0.092189095 0.15459767 0.092189082 0.093975025 St Dev 0.174069311 0.33767789 0.16859695 0.174069285 COV 1.888176804 2.184236599 1.828816882 1.852293038 P Value for Difference of Means 0.500955 P Value for Difference of Standard Deviations 0.415646401

83

Continued Next Page Table 2 Continued 5 Month Accumulation Period DCA Buy and Hold Portfolios: Amount Invested in Risky Assets 100% Low Risk Port 56.50 High Ret Port 53.17% Ex Ret 0.088953 0.142474 0.088953 0.091169381 St Dev 0.154318 0.294847 0.163232 0.154318453 COV 1.734839 2.069482 1.835046 1.69265658 P Value for Difference of Means 0.380148337 P Value for Difference of Standard Deviations 0.152499 6 Month Accumulation Period DCA Buy and Hold Portfolios: Amount Invested in Risky Assets 100% Low Risk Port 50.58% High Ret Port 53.98% EX RET 0.086706 0.134498 0.086706 0.080505 STD DEV 0.140483 0.265069 0.132781 0.140483 CV 1.620221 1.970807 1.531393 1.745032 P Value for Difference of Means 0.011331* P Value for Difference of Standard Deviations 0.151457 7 Month Accumulation Period DCA Buy and Hold Portfolios: Amount Invested in Risky Assets 100% Low Risk Port 50.20% High Ret Port 53.60% EX RET 0.085313 0.12971 0.085312 0.088109 STD DEV 0.130318 0.244268 0.122205 0.130318 CV 1.527539 1.883194 1.432439 1.479058 P Value for Difference of Means 0.219564 P Value for Difference of Standard Deviations 0.102292 8 Month Accumulation Period DCA Buy and Hold Portfolios: Amount Invested in Risky Assets 100% Low Risk Port 50.00% High Ret Port 53.96% EX RET 0.084302 0.125962 0.084302 0.087403 STD DEV 0.121999 0.225994 0.1133 0.121999 CV 1.447168 1.794137 1.34398 1.395829 P Value for Difference of Means 0.158916 P Value for Difference of Standard Deviations 0.060105

84

Continued Next Page Table 2 Continued 9 Month Accumulation Period DCA Buy and Hold Portfolios: Amount Invested in Risky Assets 100% Low Risk Port 49.77% High Ret Port 54.28% EX RET 0.08335 0.122722 0.08335 0.086728 STD DEV 0.115156 0.211156 0.105938 0.115156 CV 1.381593 1.720604 1.270995 1.327789 P Value for Difference of Means 0.114207 P Value for Difference of Standard Deviations 0.03426* 10 Month Accumulation Period DCA Buy and Hold Portfolios: Amount Invested in Risky Assets 100% Low Risk Port 49.62% High Ret Port 54.55% EX RET 0.082681 0.120345 0.082681 0.086255 STD DEV 0.10909 0.198418 0.099658 0.10909 CV 1.319414 1.648745 1.205336 1.264737 P Value for Difference of Means 0.086661 P Value for Difference of Standard Deviations 0.02178* 11 Month Accumulation Period DCA Buy and Hold Portfolios: Amount Invested in Risky Assets 100% Low Risk Port 49.47% High Ret Port 54.90% EX RET 0.082059 0.118226 0.082059 0.085867 STD DEV 0.103816 0.187067 0.094094 0.103816 CV 1.265132 1.582289 1.146652 1.209029 P Value for Difference of Means 0.061006 P Value for Difference of Standard Deviations 0.01269* 12 Month Accumulation Period DCA Buy and Hold Portfolios: Amount Invested in Risky Assets 100% Low Risk Port 49.29% High Ret Port 55.10% EX RET 0.081556 0.116572 0.081556 0.085531 STD DEV 0.099185 0.177566 0.089359 0.099185 CV 1.216162 1.523223 1.095683 1.159643 P Value for Difference of Means 0.04369 P Value for Difference of Standard Deviations 0.008243**

85

Continued Next Page Table 2 Continued 13 Month Accumulation Period DCA Buy and Hold Portfolios: Amount Invested in Risky Assets 100% Low Risk Port 49.12% High Ret Port 55.22% EX RET 0.08112 0.11516 0.08112 0.085203 STD DEV 0.095112 0.169468 0.085318 0.095112 CV 1.172485 1.471582 1.051753 1.116293 P Value for Difference of Means 0.032389 P Value for Difference of Standard Deviations 0.005967** 14 Month Accumulation Period DCA Buy and Hold Portfolios: Amount Invested in Risky Assets 100% Low Risk Port 48.92% High Ret Port 55.14% EX RET 0.080742 0.113998 0.080742 0.084827 STD DEV 0.091561 0.162996 0.082015 0.091561 CV 1.133992 1.429807 1.015763 1.079388 P Value for Difference of Means 0.026297 0.026297 P Value for Difference of Standard Deviations 0.005374** 15 Month Accumulation Period DCA Buy and Hold Portfolios: Amount Invested in Risky Assets 100% Low Risk Port 48.76% High Ret Port 55.11% EX RET 0.080401 0.112936 0.080402 0.084502 STD DEV 0.088407 0.157103 0.079063 0.088407 CV 1.099568 1.391082 0.983351 1.046213 P Value for Difference of Means 0.021578 P Value for Difference of Standard Deviations 0.004774**

86

Table 3: Value at Risk DCA Accum Period High Ret Port 51.85% 3 VaR(5%) VaR(10%) -0.2407 -0.1655 4 VaR(5%) VaR(10%) DCA Buy and Hold Portfolios: Amount Invested in Risky Assets 100% Low Risk Port 50.96% -0.2419 -0.5011 -0.2360 -0.1667 -0.3507 -0.1621 -0.1923 -0.1291 5 VaR(5%) VaR(10%) -0.1941 -0.4008 -0.1851 -0.1309 -0.2782 -0.1239 -0.1627 -0.1066 6 VaR(5%) VaR(10%) -0.1649 -0.3425 -0.1795 -0.1088 -0.2354 -0.1202 -0.15057 -0.09953 7 VaR(5%) VaR(10%) -0.1444 -0.3015 -0.1317 -0.0933 -0.2052

87

-0.0835 -0.1263 -0.0789 8 VaR(5%) VaR(10%) -0.1290 -0.2721 -0.1157 -0.0817 -0.1833 -0.0713 -0.1133 -0.0690 9 VaR(5%) VaR(10%) -0.1164 -0.2458 -0.1021 -0.0725 -0.1637 -0.0609 -0.1027 -0.0609 10 VaR(5%) VaR(10%) -0.1061 -0.2246 -0.0909 -0.0642 -0.1479 -0.0524 -0.0932 -0.0536 11 VaR(5%) VaR(10%) -0.0968 -0.2060 -0.0812 -0.0571 -0.1339 -0.0450 -0.0849 -0.0472 12 VaR(5%) VaR(10%) -0.0887 -0.1895

88

-0.0727 -0.0510 -0.1215 -0.0385 -0.0776 -0.0416 13 VaR(5%) VaR(10%) -0.0816 -0.1755 -0.0654 -0.0456 -0.1110 -0.0330 -0.0712 -0.0367 14 VaR(5%) VaR(10%) -0.0753 -0.1636 -0.0592 -0.0408 -0.1020 -0.0282 -0.0658 -0.0325 15 VaR(5%) VaR(10%) -0.0699 -0.1541 -0.0542 -0.0366 -0.0949 -0.0244 -0.0609 -0.0288 Notes: 1. Var(x%) = R means that there is a x% chance that the return will be R or less. 2. Calculated using monthly CRISP Index returns over the period August 1963 through December 2017. -0.0650 -0.1455 -0.0497 -0.0329 -0.0884 -0.0209

89

Developing an Innovative Supply Chain Management Major and Minor Curriculum Sanjay Kumar Valparaiso University Ceyhun Ozgur Valparaiso University Coleen Wilder Valparaiso University Sanjeev Jha Valparaiso University Abstract In the recent past, Universities and college/school of businesses have been challenged with maintaining and increasing student enrollment as well as providing curriculum that meets the needs of the industry. Novel and up-to-date curriculum could help in addressing this issue. In this paper we elaborate on the development of an innovative undergraduate major and minor in supply chain and logistics management. The major effectively utilizes existing resources of the colleges of business, open sources software in ERP (Enterprise Resource Planning) and data analytics software such as SAS and R. The course curriculum is benchmarked with peer institutions as well as the needs of the industry. Regional logistics council certified the curriculum. Keywords Supply chain management, Logistics management, Curriculum development, and Operations management

90

Introduction In recent years, U.S. universities have seen a significant overall decline in domestic enrollment. International students have frequently chosen to attend universities in other countries. Along with the sharp drop of students, certain majors have also experienced a fluctuation in enrollment. Consequently, recent industry surveys have reported a sharp shortage of Supply Chain and Logistics Management skills, especially in countries like China and India. In this paper, we elaborate on the development of an innovative major and minor in Supply Chain and Logistics Management at Valparaiso University, a college fully accredited by AACSB. The paper provides rationale behind introducing the major, effectively utilizing existing resources available at the university. To identify competition between Valparaiso University and other schools, a market analysis was conducted. We also explore the industry needs and align the curriculum with the needs of probable employers. Finally, we seek and receive certification from a regional supply chain and logistics council to increase its visibility and credibility. Faculty who are implementing supply chain and logistics or operations management- related majors can utilize the ideas found in this paper, as it provides important guidelines for business schools who are seeking to maintain and increase student enrollment or develop new, impactful majors. Rationale for the Program Supply chain management (SCM) has grown in popularity in both industry and academic demands in the last decade. Although it is a relatively new discipline, SCM has a strong market potential. As business operations are becoming increasingly complex, the need for SCM talent is growing, and successful academic programs are needed to fill the market demand. A supply chain manager is someone who impacts the overall success of a business. They are involved in every element of the organization itself, including purchasing, planning,

91 transportation, production, storage and all the threads that connect all the different elements of a business. It is their work that ensures organizations can control their expenses, increase their sales and effectively maximize profits. Additionally, these managers focus on collaboration and facilitation. The supply chain is the thread that holds everything together, which means supply chain managers are able to execute strategies across the organization. They support and diagnose the needs of any partner within the supply chain. As such, they have cross-functional roles, which include selecting and managing support to suppliers, thereby improving the manufacturing processes; transporting and distributing marketing campaigns in an effective manner; and communication with customers, particularly through improved technological collaboration. SCM professionals can demand high salaries. The U.S. Bureau of Labor Statistics (2016) classifies supply chain managers as logisticians. They earned $74,170 per year on a bachelor’s degree in 2016. Salaries are much higher for those with a master’s degree, and particularly for those who hold an MBA. Projected growth in demand for logisticians is 22% between 2012 and 2022, which is much higher than average. Global supply chain is estimated to be a $26 trillion per year industry. The annual growth rate in supply chain jobs is expected to be approximately 30% (Riddell, 2016). As such, operations around the planet are becoming increasingly complex. Growing businesses are struggling to find sourcing partners that can carry their vision forward internationally. Companies now understand that their supply chain is vital to a successful business. Demand, therefore, is growing (IDG Connect, 2016). This means that companies across all different industries, from technology to consultancy and from manufacturing to retail, are looking for people who understand the supply chain. However, there are not many people who have relevant supply chain knowledge in transportation, production, logistics, distribution and finances. Some

92 estimates show that for every six jobs in a supply chain area, only one is occupied by qualified personnel. Attributing to shortage of schools offering supply chain degrees, this ratio could be skewed to one in nine. The shortage of SCM related talent is felt all across the world. In China and India, the field is growing quickly, as their economy directly impacts global supply chain. Emerging markets of East and Southeast Asia have also created a strong demand for SCM professionals, as their scale and size of distribution networks require the best talent (Stryker, 2014). In 2007, the Chinese Academy of Sciences (CAS) highlighted the shortage domestic and foreign companies are facing in finding supply chain talent. CAS encouraged universities to develop programs in this field. U.S. universities have the potential to train and educate a large number of students, including the majority of international students from China and India, to fill these positions from eastern countries. The relationship of SCM to other businesses makes a program valuable to students and industry while also being feasible and sustainable, given the dynamics of higher education (Neureuther, B.D. O’Neill, K., 2011). There have been numerous programs at the MBA and executive MBA level; however, there have not been as many programs at the undergraduate level. Several institutions of higher education realized the importance of SCM and the demand for trained individuals after the 2000s. Thus, SCM became a common elective course in many business schools (Alkalin, G. I., Huang, Z., & Willems, J. R., 2016).

93

Regional Market Analysis We conducted a market survey in the northwest Indiana region to identify the potential supply and demand of graduates with SCM related expertise. The area has a strong manufacturing and transportation hub, and in the last decade, industrial output for the overall state of Indiana has been increasing. Manufacturers in the steel, oil refinery, and auto part industries account for about 30% of the state’s output, employing approximately 17% of the total workforce (National Association of Manufacturers, 2015). Of all the 50 states, Indiana has the highest concentration of these occupations. In addition, world’s largest steel producer, ArcelotMittal is located in this region. The market analysis also revealed the following: • No other college in the region offers a comparable degree to Valparaiso University. • There is a strong manufacturing base in the region that offer a potential for employment for graduating students. • National trends in supply and demand also support developing the major. SWOT Analysis (Strengths, Weaknesses, Opportunities and Threats) Strengths SCM is an innovative major, because it includes open-source programs such as R and also includes the capstones course. The comparisons of various supply chain courses were implemented using marketing finances and operations in the context of Enterprise Resource Planning (ERP) and SAS software.

94

Weaknesses Typical undergraduate students would like to major in traditional business majors such as: marketing, finance, accounting and management. Opportunities The Google trend for SCM majors in the United States (blue) and China (orange) can be observed in Fig. 1. Both majors are increasing from the recession in 2009. Fig. 1 shows steady growth for both the U.S. and China. Figure 1: Supply Chain Management Major Popularity in the U.S. and China 200 150 100 50 0 2 Threats Basically, this major is small compared to other majors, such as marketing and finance. Developing the Program Creating a SCM program that is valuable to both the industry and students requires a framework that is cross-disciplinary. In an environment present at smaller universities, SCM programs can be sustainable and developed based on already-existing courses, such as logistics, information systems, and production and operations management. Valparaiso University is beginning to implement various courses related to the SCM major and minor, observed in Table 1 and 2 in the Appendix. 0 0 4 - 0 1 2 0 0 4 - 0 6

95

2 0 0 4 - 1 1 2 0 0 5 - 0 4 2 0 0 5 - 0 9 2 0 0 6 - 0 2 2 0 0 6 - 0 Supply Chain Management Supply chain: (United States) Supply chain: (China) 7 2

96

0 0 6 - 1 2 2 0 0 7 - 0 5 2 0 0 7 - 1 0 2 0 0 8 - 0 3 2 0 0 8 - 0 8 2 0 0 9

97

- 0 1 2 0 0 9 - 0 6 2 0 0 9 - 1 1 2 0 1 0 - 0 4 2 0 1 0 - 0 9 2 0 1 1 - 0 2

98

2 0 1 1 - 0 7 2 0 1 1 - 1 2 2 0 1 2 - 0 5 2 0 1 2 - 1 0 2 0 1 3 - 0 3 2 0 1

99

3 - 0 8 2 0 1 4 - 0 1 2 0 1 4 - 0 6 2 0 1 4 - 1 1 2 0 1 5 - 0 4 2 0 1 5 - 0

100

9 2 0 1 6 - 0 2 2 0 1 6 - 0 7 2 0 1 6 - 1 2 2 0 1 7 - 0 5 2 0 1 7 - 1 0

101

Identifying Industry and Curriculum Needs In the spring of 2018, Valparaiso University began implementing a curriculum designed to give students exposure to all areas of SCM and logistics. The program begins with students taking courses in basic business functions, such as marketing, finance, and operations management. This allows students to gain knowledge in the basic business framework and functions of SCM. Students are then exposed to a variety of courses more specific to SCM. This allows students to adhere to the industry and system perspective of SCM as well as educate them on the global aspects of SCM. Course Definitions IDS 306 Global Operations and Supply Chain Management This course examines issues and methods for effectively managing global operations and supply chain. Topics include the role of operations in global strategy, processes, quality, capacity planning, facility layout and location, sourcing decisions, managing inventories for independent & dependent demand, and production. Prerequisites: IDS 205. IDS 340 Statistics for Decision Making A study of statistical concepts and methods to facilitate decision making. Content includes analysis of variance, simple and multiple regression, correlation, time-series analysis, and nonparametric methods. Prerequisite: one of IDS 205, STAT 140, STAT 240, PSY 201, CE 202, or completion of or concurrent enrollment in ECE 365. Not open to students who have completed STAT340/540. SCM 310 Global Logistics Management The focus of this course is on strategic and tactical logistics decisions. This course will provide understanding of the concepts and techniques important for analyzing business logistics problems. A strategic and total systems approach is taken. Topics may include cross-docking, reverse logistics, multi-modal freight operations, high-tech automated warehousing and order

102 delivery and current topics in the logistics industry. Importance of logistics and its relationship to other functional areas of responsibility will be emphasized. Prerequisite: IDS 306 BUS 315 Analytical Modeling A study of the fundamentals of prescriptive analytics is offered with an emphasis on spreadsheet models. Students will learn to analyze decisions and apply a sensitivity analysis to improve outcomes. Topics covered may include: simulation, optimization, managing risk, and decision trees. Students will also communicate their results in written and oral formats appropriate for a general audience. Prerequisites: IDS 115 and MATH 124. SCM 330 Enterprise Resource Planning Systems Hands on "real world" usage of ERP software with a focus on supply chain management. Students will be trained to carry out supply chain management processes such as demand signal and planning, inventory control, capacity utilization, DRP, BOM, MRP, procurement, MPS, manufacturing (work centers, routings), tuning demand into marketable finished goods using ERP software. The course covers sale and delivery of goods and introduces all accounting aspects including invoicing and receiving payments. Prerequisite: IDS 306 and BUS 320 SCM 402 Advanced Analytical Methods for SCM This course provides an in-depth understanding of analytical tools to model supply chain issues. Topics may include aggregate planning and forecasting, inventory management, managing uncertainty, network design, and supply chain coordination.

103

Prerequisite: SCM 310 SCM 405 Supply Chain Strategy- Capstone A capstone course with emphasis on analysis and problem solving related to inventory and risk pooling, network planning, supply contracts, value of information, procurement and outsourcing strategies, and product and supply chain design. Prerequisite: Senior Standing, IDS 340, SCM 330 and SCM 402 MKT 430: International Marketing A study of managerial marketing policies and practices of organizations marketing their products and services in foreign countries. Specific stress will be placed on the relationship between marketing strategy, market structure, and environment. Prerequisite: MKT 304. FIN 430: International Finance An introduction to the functioning and management of the firm in international markets. An emphasis is on the multinational firm but increasing globalization makes international finance of concern to virtually every business operation. Coverage includes the international financial environment, and the measurement and management of risk exposure, particularly foreign exchange exposure, arising during international operations and trade. In addition, financing and investing decisions are considered in the international context. Prerequisite: FIN 304. MGT 440: Cross-Cultural Management This course focuses on the effect of national cultural value differences on the workplace. Leading empirical cross-cultural models are integrated and taught as analytical tools for understanding the effects of differing national cultural values on comparative management

104 issues. Particular emphasis is on the development of skills in cross-cultural conflict avoidance, cross-cultural conflict resolution, and in managing international, multicultural teams and virtual/global networks. May be used to fulfill the Cultural Diversity course component of the General Education Requirements. Registration priority is given to CoB students. Prerequisite: junior standing. Explanation of the Coverage of Courses BUS 315 Analytical Modeling The use of analytics software, such as SAS and open-source software such as R, are integrated into each course. On the one hand, we help students to utilize SAS and SPS for large companies; on the other hand, we try to teach students using open-source software including R and Python to analyze supply chain data for medium or small size companies. SCM 405 Supply Chain Strategy- Capstone Integrate typical operations management topics such as scheduling, inventory management and project management such as Project Evaluation and Review Technique (PERT) and/or Critical Path Method (CPM) (Krajewski, L. J., Ritzman, L. P., & Malhotra, M. K., 2009). Some of the typical operations management topics such as facility layout and location and inventory management are distributed among the major courses. In addition to covering these areas, these topics are also covered with the help of software such as the use of ERP as an integrating area by using SAP (Heizer, J. & Render, B., 2010 & Heizer, J., Render B. & Munson, C., 2016).

105

The use of SAP in the capstone course in SCM is utilized in the context of ERP. In this course, we try to integrate SCM in terms of how it relates to marketing, finance and business planning in general (Gattorna, J., 2017). We have utilized the college of business at Valparaiso University for the SCM major. Below, we have indicated the major courses. • The use of SAP in the capstone course is utilized in the context of Enterprise Resource Planning (ERP) in relation to marketing, finance, and business planning. • The use of analytics software, such as SAS and open-source software such as R, are integrated into each course. • Integrate typical operations management topics such as scheduling, inventory management and project management such as Project Evaluation and Review Technique (PERT) and/or Critical Path Method (CPM). Curriculum Effectiveness Valparaiso University is beginning to attract more students from China and India. These countries have increasing markets, and need more qualified employees with professional and relevant SCM knowledge.

106

Figure 2. Employment Growth of SCM in China from National Bureau of Statistics of China Express Packages growth of SCM in China (in billions) 400 312.8 300 206.7 200 100 36.7 56.9 91.9 139.6 0 2011 2012 2013 2014 2015 2016 Figure 3. Chinese Packages among the World 2016 CHINESE PACKAGES AMONG THE WORLD China, 40% Remaining population, 60% Employers do not find the skill sets that are needed for the open jobs. Measure of Effectiveness Valparaiso University’s SCM curriculum is in the early stages of development; however, the undergraduate program is already beginning to meet the specific needs of the SCM industry and allow current students to gain the skill set needed for future employment. Internships focused on SCM will be offered to current students, giving them real-world knowledge and experience. Internships help students prepare to manage a supply chain and logistics-related material. It will also give Valparaiso University’s program exposure as well as credibility, which may boost enrollment in the years to follow.

107

In addition, several administrators, professors, and department faculty have collaborated to minimize programmatic and curriculum problems. This collaboration has increased the effectiveness of all program activities. Conclusion Because of the strong demand for qualified SCM individuals, academic programs are needed to alleviate the shortage. There is an increase in supply of positions, especially in China and India, but graduates are still few and far between. The development of a Supply Chain and Logistics Management major and minor at Valparaiso University, a college fully accredited by AACSB, offers a curriculum that meets the needs of the industry. References IDG Connect. A global survey of attitudes and future plans for the adoption of supply chain management solutions in the cloud, (2016), http://www.oracle.com/us/products/applications/idg- connect-report-infographic-3101243.pdf Gattorna J., Gower Handbook of Supply Chain Management, (03/2017), Routledge Krajewski L. J, Ritzman L. P., Malhotra M. K.. Operations Management (9th Edition), (01/2009), Pearson Heizer J., Render B.. Principles of Operations Management (8th Edition), (2010), Prentice Hall Heizer J., Render B., C. Munson. Operations Management: Sustainability and Supply Chain Management (12th Edition), (2016), Pearson Slack N. Operations and Process Management, (4th Edition), 2015 Pearson Stevenson W. J, Operations Management 13th Edition, (2018), McGraw-Hill Education

108

Redden E., New International Enrollments Decline, Insider Higher Ed. November 13, (2017),https://www.insidehighered.com/news/2017/11/13/us-universities-report-declines-enrollments- new-international-students-study-abroad accessed January 10, 2018. Strauss V. Why U.S. colleges and universities are worried about a drop in international student applications, The Washington post July 13, 2017 Hildreth Bob, U.S. Colleges Are Facing a Demographic and Existential Crisis, Huffington Post, Jul 05, 2017 Riddell, D. (2016, October 14). Career outlook for logistics and supply chain management. Retrieved from http://www.rsilogistics.com/blog/career-outlook-for-logistics-supply-chain- management/. Stryker, J. G. (2014, June 4). Supply chain talents in emerging markets. Heidrick & Struggles. Retrieved from http://www.heidrick.com/Knowledge-Center/Article/Supply-Chain-Talents-in-the- Emerging-markets. Neureuther, B.D. O’Neill, K. (2011). Sustainable supply chain management programs in the 21st century. American Journal of Business Education, 4(2), 11-18. Retrieved from https://eric.ed.gov/?q=supply+AND+chain+AND+management&id=EJ1056502. Akalin, G. I., Huang, Z. & Willems, J. R. (2016). Is supply chain management replacing operations management in the business core curriculum? Operations and Supply Chain Management: An International Journal, 9(2), pp. 119-130. National Association of Manufactures. (2015 February). [Graph illustration]. Indiana Manufacturing Facts. Retrieved from http://www.nam.org/Data-and-Reports/State-Manufacturing- Data/2014-State-Manufacturing-Data/Manufacturing-Facts--Indiana/.

109

Talking Data. (2017, June 21). [Graph illustration]. Logistics industry report: Beijing TengYun Technology. Retrieved from http://mi.talkingdata.com/report-detail.html?id=534 U.S. Bureau of Labor and Statistics. (2016). [Data set]. Occupational outlook handbook. Retrieved from https://www.bls.gov/ooh/business-and-financial/logisticians.htm. National Bureau of Statistics of China http://www.stats.gov.cn/english/ State Post Bureau of the People’s Republic of China http://www.spb.gov.cn/

110

Appendix: Table 1: BS in Global Supply Chain Management Major BS in Global Supply Chain Management (GSCM) Major General Education Core (56 Credits) Business courses Core (38 Credits) Major in GSCM Requirements (24 Credits) Course Area IDS 340 Statistics for Decision Making Information and Decision Sciences(IDS) SCM 310 Global Logistics Management Supply Chain management (SCM) BUS 315 Analytical Modeling Business(BUS) SCM 330 Enterprise Resource Planning Systems Supply Chain management (SCM) SCM 402 Advanced Analytical Methods for SCM Supply Chain management (SCM) SCM 405 Supply Chain Strategy- Capstone Supply Chain management (SCM) At least two of Global/International Focus courses MKT 430: International Marketing Marketing(MKT) FIN 430: International Finance Finance(FIN) MGT 440: Cross-Cultural Management Management(MGT) Free electives: 7 Credits

111

Table 2: BS in Global Supply Chain Management Minor BS in Global Supply Chain Management (GSCM) Minor Minor in GSCM Requirements (15 Credits) Course Area IDS 205 or equivalent Information and Decision Sciences(IDS) IDS 306 Global Operations and Supply Chain Management Information and Decision Sciences(IDS) At least two of the following SCM 310 Global Logistics Management Supply Chain management (SCM) BUS 315 Analytical Modeling Business(BUS) SCM 330 Enterprise Resource Planning Systems Supply Chain management (SCM) SCM 402 Advanced Analytical Methods for SCM Supply Chain management (SCM) SCM 405 Supply Chain Strategy- Capstone Supply Chain management (SCM) At least one of the following MKT 430: International Marketing Marketing(MKT) FIN 430: International Finance Finance(FIN) MGT 440: Cross-Cultural Management Management(MGT)

112

Disruptive Innovation & Sustainable Value: The Implications of Disruptive Innovation on the Outcome of RE Businesses in Developed Economies 1

113

Abstract Disruptive Innovations such as wireless communication business model innovations deployed by Airbnb and Uber have changed the value propositions and value creation of long established and emerging businesses. In this paper we have investigated the implications of disruptive innovation, integrated vision, connectedness and knowledge creation on sustainable value in the context of developed economies. The primary objective of this research is to quantify the impact of key exogenous variables such as disruptive innovation on sustainable value creation of an entrepreneurial engagement in the renewable energy (RE) sector in developed economies. The outcome of this research has enhanced our knowledge base on how key antecedents disruptive innovation, and knowledge creation assist in creating verdant value proposition, delivery, and sustainable value in the RE markets in developed economies. Key words: disruptive innovation, sustainable value, knowledge creation, integrated vision, developed economies 2

114

INTRODUCTION Technological and business model innovations play major roles in the functioning of developed economies. Technological innovations are especially heavily utilized to optimize the cost of power generation from renewables, which has made considerable advances to achieve cost parity for winds power generation in recent years. Disruptive business model and technology innovations have significant & substantial implications on the outcomes of renewable energy projects. Disruptive business model innovations re-configure business opportunity sensing, value proposition and co-creation, and delivery in a rapidly changing business and technology landscape (Hart & Christensen, 2002; Massa & Tucci, 2013). Multisided business models in conjunction with multipurpose platforms and increasing Web-enabled capabilities are hastening changes in techno-economic paradigms (Bughin, Chui, & Manyika, 2010), and market orientations, which have significant implications on the desired outcome of renewable energy businesses in developed economies (Viswanathan, Seth, Gau, & Chaturvedi, 2009). Furthermore, disruptive innovations seem to attenuate the impacts of environmental selection on the genesis and diffusion of innovations, which are enabled by the workings of embedded institutional, and market structures (Luhmann, 1995; Nelson & Winter, 1977; Smits, Kuhlmann, & Shapira, 2010). In the context of developed economies, sustaining innovations play a more prevalent role than disruptive innovations. However, we predict that disruptive innovations that include both technology, and business model innovations will play an 3

115 appreciable role in changing socio-technical paradigms, which have implication on the outcome of RE businesses (Boons et al., 2013; Hall et al., 2010). Based on these frameworks, we have developed have the model and the associated hypotheses as shown in Figure 1. Figure 1. Hypothesized Model to Determine the Implications of Disrutive Innovation on the the Outcome of RE Businesses in Develoyped Economies HYPOTHESES Disruptive Innovation (DINNOV) Disruptive innovations that include technological and business model innovations have significant & substantial implications on the outcomes of renewable energy 4

116 businesses. Disruptive innovations re-configure business opportunity sensing, value proposition and creation, and delivery in a rapidly changing business and technology landscape (Christensen, Horn, & Johnson, 2008Hart & Christensen, 2002; Massa & Tucci, 2013). Multisided business models in conjunction with multipurpose platforms in the 4th Industrial Revolution, where integrated cyber-physical-biological systems are hastened by additive manufacturing, AI, autonomous systems, advanced robotics and decision support systems, emerging business models are changing fundamental buyer- seller relationships and hastening socio-technological paradigm shifts (Bughin, Chui, & Manyika, 2010; Elfring & Hulsink, 2003). Such transitions are altering market orientations, which have significant implications on the desired outcome of renewable energy businesses in developed economies (Viswanathan, Seth, Gau, & Chaturvedi, 2009). Based on these observations, we hypothesize that disruptive innovation will have a positive direct impact on market creation. Hypothesis1. Disruptive innovation will be positively related with market creation controlling for firm size, project customer household income, reflective practices, and project location. Integrated Vision Integrated vision, which sets the overarching business strategy, and the rationale of a firm to engage in entrepreneurial action, has a significant and substantial impact on the formulation of management and market strategies of the business of interest (Ireland et al., 2009; Elfring & Hulsink, 2003). A clear and well-articulated strategic vision will enable the firm to establish a capable management strategy that enables the RE business to differentiate itself from its competitors by creating aggregated value chains that optimize economic, social, and 5

117 ecological benefits. In the context of developed economies, the formulation and deployment of such integrated strategic vision is imperative for a successful renewable energy business that create sustainable value due to the highly competitive market structures that are supported by established regulatory framework and significant sunk capital in conventional energy generation and distributions systems (Schoemaker, 1992). Based on these rationale, we ascertain that integrated vision will be positively related with eco-communal management controlling for firm size, project customer, household income, reflective practices, and project location as stated in Hypothesis 2. Hypothesis 2. Integrated vision will be positively related with eco-communal management controlling for firm size, project customer household income, reflective practices, and project location. Furthermore, integrated vision defines the perimeters of the markets that will be served through its entrepreneurial actions by providing creative solutions to challenging energy and related problems and demands. Based on these assertions, we hypothesize that integrated vision will have a positive significant and substantial relationship with market creation as stated in Hypothesis 3. Hypothesis 3. Integrated vision will be positively related to market creation controlling for firm size, project customer household income, and project location. Knowledge Creation Knowledge creation refers to knowledge creation about customer needs, the project, technologies, resources, and overarching economic, ecological, and social issues through socialization, externalization, internalization, and combination processes (Nonaka et al., 1994). In developed economies, knowledge creation plays a major and important role in the functioning of the economy and society at large. 6

118

Based on these reasons, we hypothesize that knowledge creation (Dewey, 1929; Hessels & Van Lente, 2008) will be positively related with eco-communal managements as presented in Hypothesis 4. Hypothesis 4. Knowledge creation is positively related with eco-communal management controlling for firm size, project customer household income, reflective practices, and project/business location. Impact of Connectedness on Eco-communal Management Connectedness, which is concerned about the relational dynamics of the key decision maker of the RE business with the stakeholders of the business or project that include the business staff, end users, the business/project host community, the ecosphere, and relevant stakeholders (investors, regulators, and the like) has significant impact on the strategic management of the businesses. The relational spheres and embeddedness of the businesses establishes essential attributes of the business culture that impinges on the long-term and day-to-day operations of the businesses (Dhanaraj, Lyles, Steensma, & Tihanyi, 2004). Based on these reasons, we predict connectedness is positively related with eco- communal management as stated in Hypothesis 5. Hypothesis 5. Connectedness is positively related with eco-communal management controlling for firm size, project customer household income, and project location. Impact of Eco-communal Management’s on Sustainable Value 2 (SV2) Eco-communal management, which we have defined as the key management strategy of the renewable energy businesses (projects) that creates integrated economic, ecological, and social benefits to all the stakeholders is the primary path way of 7

119 translating the integrated vision, know-how, resources, and relational capabilities of the RE business into sustainable value (Schaltegger & Wagner, 2011). Sustainable value 2 (SV2) is a second-order factor with three first order variables that contain12 items. Based on the premise, we reason that eco-communal management has a positive significant and substantial relationship with sustainable value. Hypothesis 6. Eco-communal management is positively related to sustainable value 2 (SV2) controlling for firm size, project customer household income, and project location. Impact of Market Creation on Sustainable Value 2 (SV2) Market creation, which is a framed by the entrepreneurial drive of the RE business, and is an essential part of product/service offerings, opportunity sensing, value proposition & delivery, and monetization has a significant impact on sustainable value created by the businesses (Schaltegger & Wagner, 2011; Laszlo, 2008). Based on these assertions, we hypothesize, market creation will be positively related to the integrated desired outcome of the RE business or project (SV2). Hypothesis 7. Market creation is positively related to sustainable value 2 (SV2) controlling for firm size, project customer household income, and project location. Research Methods Measures All variables used for the research employed 5-point Likert scales Disruptive Innovation (DINNOV) Disruptive innovation (α = .88) was measured using five items adapted from the works of Govindarajan and Kopalle (2006) 8

120

Integrated Vision (IVN) We measured integrated vision with two first-order factors and a total of 4 items (α = .82) adapted from the works of Carroll et al. (2005). Knowledge Creation (KC) Knowledge creation was measured with four first-order factors and a total of 13 items (α = 0.94) adapted from Schulze & Hoegl (2006, 2008). Connectedness (Connect) We used two first-order factors with a total of six items (α = .93) from van Bel et al. (2009). Eco-communal Management (EcoMng) Eco-communal Management is referred to as a management strategy for RE businesses that creates a symbiotic and synergistic partnership among profit, environmental, and societal goals (Hart & Dowell, 2010). It was measured using five items (α = .84). Transition Engagement (TranEng) Transition engagement refers to transformational changes brought by the RE development that is contextualized to the local needs, capabilities, and facilitates socio- technical paradigm shifts. It was measured with two items (α = .82) adapted from Bono and Anderson (2005) and Kayworth and Leidner (2002). Market Creation (MKC) Market creation was measured using five items (α = .78) adapted from the works of Jain and Kaur (2004) were used to measure market creation. 9

121

Sustainable Value (SV2) We used three first-order factors with a total of 12 items (α = .94) from Zhu et al. (2008) and meaningfulness items adapted from the works of May et al. (2004). Controls We utilized three controls in this research: 1. Firm size, which is expressed by the number of people the firm has employed. We used firm size as a control because of the thematic different entrepreneurial strategies that established and start-up renewable energy businesses tend to deploy (Hockerts, & Wüstenhagen, 2010). 2. Project customer household income assess the before tax income (in US $) of RE project customers (Whitmarsh & O'Neill, 2010). We utilized household income as a control because of its multilevel relationships with energy consumption, GDP growth, income generation, and environmental implications associated increased energy consumption (Fang, 2011). 3. Project location is utilized to account for the regional difference in the applications of renewable energy businesses and projects. These differences include market conditions, entrepreneurial climate, regulatory policies, and financing instruments, and environmental selections, which impacts on RE business outcomes (Reiche & Bechberger, 2004). Instrument Development and Testing To ensure the reliability, validity, and appropriateness of the survey instruments, we conducted a pilot test by administering the survey instrument to 40 key decision makers engaged in renewable energy businesses in developed economies. 10

122

Data and Samples Data were gathered from 222 key decision makers of renewable energy businesses and projects primarily in five developed economies (US, Canada, Germany Japan, the UK) and a few responses from France and Norway. These key decision makers were identified through the personal network and business relationships of the researchers. Forty-six percent (45.5% to be exact) of the respondents were senior executives (CEO, COO, CFO, and CTO), 54% were senior manager/managers with titles of the project director and program manager. Data Screening We collected 222 complete survey responses without missing data and in which data were adequate for further analysis. Analysis We conducted exploratory factor analysis (EFA), and confirmatory factor analysis (CFA) with and without a common latent factor, which was followed by structural equation modeling (SEM) as part of our overall analyses. Results Exploratory Factor Analysis (EFA) We submitted the 47-item scales to EFA and several statistics indicated the data were adequate and appropriate for further analysis with Kaiser-Meyer-Olkin (KMO) statistic of 0.921, significant Bartlett’s Test of Sphericity (χ2 = 21,708, df = 5253, and p < 0.000), indicating sufficient inter-correlations, and all commonalities were above 0.40 11

123 further confirming that each item shared some common variance with other items (Hair et al., 2010). Measurement Model The confirmatory factors analysis of the measurement model yielded a good model fit statistics with χ2 = 1802.69 df (degrees of freedom) = 977, CFI = .900, and RMSEA = .062 (Hair et al., 2010) as shown in Table 1 indicating the validity of the factor structure. Table 1. CFA Model Fit Statistics DE CFA SM_2 χ2 1802.69 df 977 p 0.0000 CMIN/df 1.845 CFI 0.900 TLI 0.889 RMSEA 0.062 SRMR 0.09 PCLOSE 0.00 CFA meets all validity and reliability requirements indicating that the model fits the data well. Composite reliability for all the constructs was greater than 0.7 and the average variance extracted (AVE) meeting the requirement for convergent validity. The maximum shared variances (MSV) are less than AVE, and the average shared variances (ASV) are less than AVE meeting discriminate validity requirements (Fornell & Larcker, 1981). Reliability requirements are met with both Cronbach’s alpha and composite reliability for each construct being greater than 0.7. 12

124

The effects of common method bias were checked by comparing CFA results with and without common latent factor (CLF) following Gaskin’s (2012) recommendations, and there was no difference with CFA results with and without CLF indicating that common method bias does not have significant and substantial effect on the data collected. Structural Equation Modeling (SEM) Results The hypothesized model tested as shown in Figure 2 fits the data well with model fit statistics χ2 = 19.08, df = 11, CMIN/df = 1.73, CFI = .998, and p > .05 (Hu & Bentler, 1999; Tabachnick & Fidell, 2007). A summary of key model fit statistics is shown in Table 2. Table 2. SEM Key Model Fit Statistics As shown in Figure 2, all antecedents have substantial and significant impacts on mediators, and the mediators have significant and substantial impacts of on the outcome. 13

125

Figure 2. SEM Results Disruptive Innovation on RE Outcome in Developed Economies Hypotheses Test Results As we have predicted, Hypothesis 1 is supported (β = 0.24, p < 0.001) indicating that disruptive innovation has significant and substantial positive impact on eco-communal management. Hypotheses 2 and 3 are supported with respective positive and significant relationship between integrated vision and eco-communal management (β = 0.40, p < 0.001) and with stronger relationship for integrated vision and market creation (β = 0.70, p < 0.001). The hypotheses test results for Hypothesis 4, further highlights the importance of knowledge creation in creating markets for RE businesses in developed economies with the strongest relationship coefficient of 0.89 that is highly significant (β = 0.89, p < 0.00). 14

126

However, our predication for the relationship between connectedness and eco- communal management (Hypothesis 5) is not supported. In negation to our expectation, this relationship is negative with a betta of -0.48 that is highly significant (β = - 0.48, p < 0.001). Consistent with our predications Hypotheses 6, and 7 are supported with β = 0.76, p < 0.001, and β = 0.15, p < 0.001, respectively confirming eco-communal management and market creation have significant and substantial positive impacts on the outcome of renewable energy businesses in developed economies as measured by sustainable value. The summary of hypotheses test results is presented in Table 3. Further verifying the robustness of the model we have developed, predictor variables explain 78% of the variance in the outcome variable. Table 3: Summary of Hypotheses Test Results Hypothesis Path β p Supported? H1 DINNOV -> MKC 0.24 *** Yes H2 IVN -> EcoComMng 0.40 *** Yes H3 IVN -> MKC 0.70 *** Yes H4 KC -> EcoComMng 0.89 *** Yes H5 Connect -> EcoComMng -0.48 *** No H6 EcoComMng -> SV2 0.76 *** Yes H7 MKC -> SV2 0.15 *** Yes *** p < .001, ** p < .01, * p < .05, ns ≥ .05 In developed economies the primary path for creating sustainable value for RE businesses is underpinned by knowledge creation and through eco-communal management, which is the key management strategy of the business deployed to enhance 15

127 the outcome of the business, improve its competitive positioning and meet market demands for integrated economic, ecological, and social benefits (Hart & Dowell, 2010; Schaltegger & Wagner, 2011). Knowledge creation, which includes technological and business model innovations, opportunity sensing, value creation, and socio-ecological knowledge shapes the management strategy of the business, which in turn has significant and substantial implications on the desired business outcome (Hitt, Ireland, Sirmon, & Trahms, 2011). Key Findings The primary path for value creation for RE businesses in developed economies is undergirded by knowledge creation and translated into the desired outcome by eco- communal management. Here, knowledge creation includes entrepreneurial, and tacit knowledge, opportunity sensing, value creation, and socio-ecological knowledge, which are adaptive and complex. Innovative strategic management based on emergent and systemic knowledge creation improves the competitive positioning of the business and creates symbiotic and synergistic relations among return on investment, environmental and social benefits (Armitage et al., 2008). Disruptive innovation and integrated vision have significant and substantial implication on market creation, which results in market creation’s positive impact on sustainable value (SV2). Hence, it is imperative for RE businesses to engage in emergent technological and business model innovations accompanied appropriate sustainable marketing orientations that are congruent with the DNA and business rationale of its entrepreneurial engagement in the dynamic and fast changing socio-technological landscape (Kumar, Jones, Venkatesan, & Leone, 2011). 16

128

Limitations This research has investigated the implications of disruptive innovation, integrated vision, connectedness, and knowledge creation on entrepreneurial engagement of RE business outcome in developed economies. As such the model that is developed here needs to be verified for contextual applicability (generalizability and transferability) before it is utilized in other sectors. Future Research In light of emergent and fast-paced technological and business model innovations that are part and parcel of the cyber-physical-biological landscape, the implications of disruptive innovations, knowledge & market creation on creating sustainable value in the RE space and other businesses is of paramount importance for entrepreneurs and entrepreneurial established businesses. Hence, the research framework and the model developed here may be modified and applied in other emerging business opportunities to optimize the configurations of innovation inputs (technological and business model), and management approaches to create sustainable value to all the stakeholder. 17

129

REFERENCES Anderson, J. R. 1983. A spreading activation theory of memory. Journal of verbal learning and verbal behavior, 22(3): 261-295. Bono, J. E., & Anderson, M. H. 2005. The advice and influence networks of transformational leaders. Journal of Applied Psychology, 90(6): 1306. Christensen, C. M., Horn, M. B., & Johnson, C. W. 2008. Disrupting class: How disruptive innovation will change the way the world learns. McGraw-Hill New York, NY. Dewey, J. 1929. The quest for certainty: A study ofthe relation of knowledge and action. London: George Allen & Unwin. Dhanaraj, C., Lyles, M. A., Steensma, H. K., & Tihanyi, L. 2004. Managing tacit and explicit knowledge transfer in IJVs: the role of relational embeddedness and the impact on performance. Journal of International Business Studies, 35(5): 428- 442. Elfring, T., & Hulsink, W. 2003. Networks in entrepreneurship: The case of high- technology firms. Small Business Economics, 21(4): 409-422. Fornell, C., & Larcker, D. F. 1981. Structural equation models with unobservable variables and measurement error: Algebra and statistics. Journal of Marketing Research, 18(3): 382-388. Gaskin, J. 2012. Measurement model invariance. GaskinNation's StatWiki. Available from http://statwiki.kolobkreations.com. Govindarajan, V., & Kopalle, P. K. 2006. Disruptiveness of innovations: measurement and an assessment of reliability and validity. Strategic Management Journal, 27(2): 189-199. Hair, J., Black, W., Babin, B., & Anderson, R. 2010. Multivariate data analysis (7th ed.). New Jersey: Prentice Hall. Hall, J., & Vredenburg, H. 2012. The challenges of innovating for sustainable development. MIT Sloan Management Review, 45(1). Hall, J. K., Daneke, G. A., & Lenox, M. J. 2010. Sustainable development and entrepreneurship: Past contributions and future directions. Journal of Business Venturing, 25(5): 439-448. 18

130

Hart, S. L., & Christensen, C. M. 2002. The great leap. Sloan Management Review, 44(1): 51-56. Hart, S. L., & Dowell, G. 2010. A natural-resource-based view of the firm: Fifteen years after. Journal of Management, 37(5): 1464-1479 Hessels, L. K., & Van Lente, H. 2008. Re-thinking new knowledge production: A literature review and a research agenda. Research policy, 37(4): 740-760. Hitt, M. A., Ireland, R. D., Sirmon, D. G., & Trahms, C. A. 2011. Strategic entrepreneurship: creating value for individuals, organizations, and society. The Academy of Management Perspectives, 25(2): 57-75. Hockerts, K., & Wüstenhagen, R. 2010. Greening Goliaths versus emerging Davids— Theorizing about the role of incumbents and new entrants in sustainable entrepreneurship. Journal of Business Venturing, 25(5): 481-492. Hu, L. t., & Bentler, P. M. 1999. Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling: A Multidisciplinary Journal, 6(1): 1-55. Ireland, R. D., Covin, J. G., & Kuratko, D. F. 2009. Conceptualizing corporate entrepreneurship strategy. Entrepreneurship Theory and Practice, 33(1): 19-46. Jain, S. K., & Kaur, G. 2004. Green marketing: An attitudinal and behavioural analysis of Indian consumers. Global Business Review, 5(2): 187-205. Kayworth, T. R., & Leidner, D. E. 2002. Leadership effectiveness in global virtual teams. Journal of Management Information Systems, 18(3): 7-40. Kumar, V., Jones, E., Venkatesan, R., & Leone, R. P. 2011. Is market orientation a source of sustainable competitive advantage or simply the cost of competing? Journal of Marketing, 75(1): 16-30. Laszlo, C. 2008. Sustainable value: How the world's leading companies are doing well by doing good. Press. Luhmann, N. 1995. Social systems. Stanford University Press. Massa, L., & Tucci, C. L. 2013. Business model innovation. In M. Dodgson, D. M. Gann, & N. Phillips (Eds.), The Oxford handbook of innovation management: 420- 441. Oxford University Press. May, D. R., Gilson, R. L., & Harter, L. M. 2004. The psychological conditions of meaningfulness, safety and availability and the engagement of the human spirit at work. Journal of Occupational and Organizational Psychology, 77(1): 11-37.

131

19

132

Nelson, R. R., & Winter, S. G. 1977. In search of a useful theory of innovation. In S. Klaczko- Ryndziun, R. Banerji, J. A. Feldman, M. A. Mansour, E. Billeter, C. Burckhardt, I. Ugi, K.-S. Fu, G. Fehl, & E. Brunn (Eds.), Innovation, economic change and technology policies: 215-245. Springer. Nonaka, I., Byosiere, P., Borucki, C. C., & Konno, N. 1994. Organizational knowledge creation theory: a first comprehensive test. International Business Review, 3(4): 337-351. Nonaka, I., & Toyama, R. 2003. The knowledge-creating theory revisited: knowledge creation as a synthesizing process. Knowledge management research & practice, 1(1): 2-10. Reiche, D., & Bechberger, M. 2004. Policy differences in the promotion of renewable energies in the EU member states. Energy Policy, 32(7): 843-849. Schaltegger, S., & Wagner, M. 2011. Sustainable entrepreneurship and sustainability innovation: categories and interactions. Business Strategy and the Environment, 20(4): 222-237. Schulze, A., & Hoegl, M. 2006. Knowledge creation in new product development projects. Journal of Management, 32(2): 210-236. Schulze, A., & Hoegl, M. 2008. Organizational knowledge creation and the generation of new product ideas: A behavioral approach. Research Policy, 37(10): 1742-1750. Smits, R. E., Kuhlmann, S., & Shapira, P. 2010. The theory and practice of innovation policy: An international research handbook. Edward Elgar. Tabachnick, B. G., & Fidell, L. S. 2007. Using multivariate statistics, 5th ed.: 402-407. Pearson. Viswanathan, M., Seth, A., Gau, R., & Chaturvedi, A. 2009. Ingraining product-relevant social good into business processes in subsistence marketplaces: The sustainable market orientation. Journal of Macromarketing, 29(4): 406-425. Wang, J.-J., Jing, Y.-Y., Zhang, C.-F., & Zhao, J.-H. 2009. Review on multi-criteria decision analysis aid in sustainable energy decision-making. Renewable and Sustainable Energy Reviews, 13(9): 2263-2278. Whitmarsh, L., & O'Neill, S. 2010. Green identity, green living? The role of pro- environmental self-identity in determining consistency across diverse pro- environmental behaviours. Journal of Environmental Psychology, 30(3): 305- 314. 20

133

Yang, B., Watkins, K. E., & Marsick, V. J. 2004. The construct of the learning organization: Dimensions, measurement, and validation. Human Resource Development Quarterly, 15(1): 31-55. 21

134

Carrier Choice Optimization with Tier Based Rebate for a National Retailer Surya Gundavarapu, Matthew A. Lanham Purdue University, Department of Management, 403 W. State Street, West Lafayette, IN 47907 [email protected]; [email protected] Abstract In this study, we develop a decision-support system to help a high-end national retailer optimize their shipping costs. Our first meeting with the retailer introduced us to the type of shipping agreements that usually exist between the retailer and their delivery service provider within this industry. This retailer has a tier-based rebate system with their usual delivery service provider which incentivizes them to accrue higher shipping. Besides, some of the retailer's transactions qualify for the rebate total while others don't. But, the raw dataset wasn't organized in a way where we could easily isolate the rebatable transactions. Hence, we had to do data preparation based on the business rules. Our goal here was to ensure the net total of all rebatable transactions in the retailer's dataset equaled the total provided by the delivery service provider. Once we identified the right records, we aggregated sales by week and built time series models to predict future sales. Beginning with Simple Moving Average (SMA), we built Exponential Weighted Moving Average, and Auto Regressive Moving Average (ARIMA) to forecast weekly sales going forward. Then, we developed an optimization model that would simulate various transportation scenarios and identifies the scenario that minimizes their overall annual transportation costs.

135

Introduction As of September 2017, retailers and wholesalers account for more than half of the global demand for package and courier services. According to a study by SAP, most businesses use the major, well- known parcel delivery services and call it done for shipping services to the end customer. But, SAP shows that doing so would cost them more than it should, and businesses should consider other options for at least some of their packages. But, picking a shipper is not an easy choice to make, since most package delivery services try to distinguish themselves along the dimensions of security, speed of delivery, special requests, price, customer experience, professional appearance. And the delivery service companies try to keep their customers loyal with volume-based contracts that make the retailers eligible for special discounts, services, privileges by accruing shipments with the same delivery service firm. One such arrangement is the tier-based rebate percentage which has the delivery service provider refund a percentage of net value of the transactions back to the retailer in appreciation of their loyalty to the delivery service provider. This percentage increases as the net volume of transactions increases giving the retailer a greater incentive to ship more packages using the same shipper to maximize their rebate. But, the usual delivery service provider might not always be the cheapest transportation vendor in the marketplace and with every shipment the retailer makes, he/she is making a trade-off choice between increasing the total sales with the regular carrier (that can be used in achieving a higher tiered discount) and choosing an alternative cheaper carrier. It is important for the retailer to find the right balance between the two to minimize their shipping costs (28% of shoppers abandon their carts/rethink their purchase if shipment costs are too high). We intend to build a model that optimizes

136 the amount of current and future deliveries that the retailer should direct towards each carrier to reduce their overall shipping costs. The rest of the paper follows this plan. We performed a literature review to understand what studies have been done in the past about shipper optimization and what models/algorithms work best/are currently being used in solving this problem. We give a brief introduction into the data we used in our study. Next, we discuss the models built and their respective results. An optimization model that finds the best choice of delivery service provider was built to identify the best shipment mixture between the regular carrier and the alternate carrier. Then, we discuss our conclusions and give insights into areas of future research.

137

Literature Review For the purpose of this paper, we first focused on research that would provide actual solutions for business that do not have a developed distribution network and, thus, rely on third party carriers to deliver products for the final customer. However, since this concern is as recent as the e- commerce boom, there is no specific published paper with a straightforward solution or framework to assess this issue. Companies may desire to optimize their shipment expenses, but the optimal approach is likely still being tested empirically in the market. Because of the unavailability of published work focusing on comparable problems, our research shifted to benchmarking similar solutions for a broader optimization problem in the transportation industry aligned with practical solutions that could be applied such as financial models and rebatable refund models. The models that were found vary in focus and complexity, ranging from a focus on transportation cost minimization to elaborate network problems requiring an optimal solution to several variables such as location and modality. Many scholarly works can be found that address the issue of rebate promotional activities, but these works largely focus from the side of the offeror of the rebates, rather than determining the maximum utility of such a program to the offeree. Additionally, many publications in this area (Ali, 1994) focus on such promotional activities for consumer products to an end-consumer. While some factors discussed in these papers have parallels in a supply chain setting (purchase acceleration could translate to intentionally increasing business with a specific carrier to redeem a higher rebate), many factors that are developed in these models do not apply with the exogenous environment of the business-to-business supply chain (rebuys, redemption rate, etc.). A common area of modeling within the supply chain industry is known simply as the “Transportation

138

Problem” (“TP”), in which a product must be moved from many factories to many warehouses at the lowest possible cost. Recently, a new mathematical framework for solving such problems was published which results in what the authors deem as a better initial basic feasible solution that other algorithms previously published (Ahmed, 2014). A variant of the Transportation Problem is the Fixed Cost Transportation Problem, which can be solved with uncertainty theory mathematical methods (Yuhong, 2012). However, both of these methods are highly mathematical and theoretical, and lack ease of applicability in a real-world setting. Real-world solutions have been developed using linear programming methods in Excel with premium solver packages, but less documented work is available

139 regarding these methods, and the applicability of such methods seems to be limited to specific corporate-consulting situations (LeBlank, 2014). Models of increasing complexity are also available. One such model for solving a intermodal, service and finance constrained transportation optimization problem focused on a table search metaheuristic model, comparing neighboring solutions until an ideal solution is achieved. Ishfaq and Sox provided the framework for such a model including different types of shipments, modes of transport, and economies of scale (Ishfaq, 2010). When it comes to focusing on predicting and optimizing cost of transportation alone, less recent published work is available. Hall and Galbreth recognize the fact that often in optimization problems, transportation costs are assumed to be linear, when in reality this is often not the case due to bulk or specialized discounts (or “rebates”). Hill and Galbreth model transportation costs as a piecewise function and deploy a heuristic model to determine an optimal solution for situations involving one factory shipping to multiple warehouses (Hill, 2008). While this work begins to address the complexity of transportation costs, it still does not accurately model or forecast costs based on a varying-rebate contract structure. Additional work can be found that analyzes the empirical relation between the price of goods sold, the price charged to consumers of shipping the good, and the quality of shipping service (Dinlersoz, 2004). This work also discusses the impact of economic searching costs on consumer willingness to absorb shipping costs. However, these correlations all assume the shipping costs are fixed or known by the company making such decisions. Such work cannot be effectively deployed by a company unless accurate costing forecasts can be created from which to base these decisions. A review of recent work in the area of freight transportation reveals an interesting “anomaly” in field, specifically that there seems to be a general lack of relationship between transportation

140 optimization models and transportation cost functions (Bravo, 2013). Additionally, this review determined that limitations exist in current transportation cost functions, such as “the role of time and in transportation cost analysis”. This review, in junction with the fact that fewer publications seem to exist focused on transportation-cost predictive modeling, provide a strong justification for our particular research problem and solutions.

141

Data The data provided by the retailer included two main tables and an auxiliary table. One of the main tables consists of the retailer’s record of transactions detailed to the individual transaction level while the other has transactions and summary from the delivery service company. The retailer’s record had all of their shipments while only a subset of these shipments was eligible for rebate total. Some of the columns in the dataset were Shipping Country, Shipping State, Tracking Number, Type of Shipment, Code for the Type. In the shipper’s record of transactions, we had data summarized into different tiers records included tracking number, shipment type, shipment group etc., The auxiliary table had different type of shipment types and their respective eligibility for rebate status. This table had to be joined with the retailer’s record of transactions to identify a shipment’s eligibility for a rebate. Some of the shipment type had to be imputed based on the appropriate business rule (For example, tracking number has information about the type of shipment). Once we matched the rebatable amount that the retailer’s transactions indicate to the rebatable amount that the delivery service provider reported, we were certain that we had the right subset of the data to build models on. we aggregated the net weekly charge invoiced by the delivery service to retailer. This net payable to delivery service was used for time series modeling. Figure 1 Actual Shipping Cost aggregated by Week Exploratory Data Analysis (EDA) During the Exploratory Data Analysis, we first plotted the payables graph to understand the peaks and troughs through the year. The retailer seems to experience high volume of shipments in the first

142

quarter followed by a slower 2nd and 3rd quarter before the sales increase again in the 4th quarter. Based on our discussion with the retailer, this could be a result of the typical cycle of Spring cleaning and holiday shopping that increases the number of new units ordered/shipped on either end of the calendar year. Then, we used seasonality decomposition function to determine how different elements of the demand are affecting sales. This helps us understand the determinants of sales and build the proper models to account for them. Figure 2 shows different components of the demand and it shows that most sales are due to the generic trend. We also see that there appears to be a weekly seasonality. This could be due to that most shipments go out towards the latter part of the week and hence the invoices generated by the delivery service provider increase then. However, we do notice high residuals in the early part of the year that decrease towards the 2nd and 3rd quarter before increasing for the holiday season. Since we only have data for 1 calendar year, we cannot quantify the amount of holiday seasonality and any models built with uneven residuals would perform worse once applied to following calendar year since the holiday seasonality/ spring cleaning would not be explained by the model. Methodology Considering the values of payables are in millions, we decided to take log of the values for the purpose of modelling. This didn’t alter the distribution of the payable values as shown in the visuals below. Figure 2 Decomposition of the Weekly Invoices

143

Figure 3 Log Shipping Costs with Moving Average Figure 4 Shipping Costs with Moving Average Our data analysis process was as follows: We used a planning window of 3 periods for the Simple Moving Average and a half-life of 3 periods for the Exponential Weighted Moving Average. These parameters along with the parameters for ARIMA are explained in detail in the methods. We used Dickey Fuller Test for evaluating SMA and

EWMA and used Residual Sum of Squares for the ARIMA model. Our optimization problem was designed as follows: Our decision variable was the PostCC which ought to be optimized to find the value that minimizes the overall shipping costs for the retailer. Models Simple Moving Average A simple moving average is an arithmetic moving average calculated by adding the values for a number of time periods and then dividing this total by the number of time periods. One of the advantages of this model is that it is customizable for different number of time periods easily and hence fits into any planning window. Further, it smoothens out volatility making it easier to view the trend in a series. Increasing the time period increases its the level of smoothing and shorter time frame attempts to fit the source data much closely. As with any model, an optimum planning window ought to be taken to avoid overfitting and/or high volatility. Moving Average are important analytical tools since they identify trends in current and potential change in an established trend. Comparing two moving averages, each covering different time windows, gives us a slightly more complex analytical tool to predict trend. A shorter term SMA that is higher than longer term SMA would imply an uptrend in future and vice versa. Exponential Weighted Moving Average

144

The weakness of a simple moving average is that all prior values being used in the window have the same weight. The most recent observation has no more influence on the variance than that of an observation few periods back. This would imply that our calculation of the future costs is diluted by distant (less relevant) data. This problem is fixed using the exponentially weighted moving average in which values are weighed by recency. Exponential Weighted Moving Average function takes a decay factor and weighs preceding observations based on an exponential function of the decay factor to forecast the future value. In our model, we built the EWMA with half-life of 3 periods which implies a given observation loses half of its influence in 3 periods following its occurrence. Other ways of mentioning decay factor include span, for how many observations after its occurrence will a given observation exert its influence, alpha, smoothing parameter and com, center of mass. Auto Regressive Integrated Moving Average Autoregressive Integrated Moving Average is a form of regression analysis that seeks to predict future values by examining differences between values in the series instead of using the actual data values. Lags in differed series are referred to as autoregressive and lags within the forecasted data are referred to as "moving average" ARIMA includes parameters p,d,q for the Auto regressive part, integrated and moving average parts of the dataset respectively and it can take into account trends, seasonality, cycles, errors and other non-stationary aspects of a dataset when making forecasts. Results Simple Moving Average A Dickey Fuller Test on this gave us the following statistics

145

Figure 5 Moving Average and Standard Deviation

Metric Value Test Statistic -7.188414 P - Value 2.540907 * 10-7 Table 1 Dickey Fuller Test of SMA model Exponential Weighted Moving Average A Dickey Fuller Test on this gave us the following statistics: Metric Value Test Statistic -1.745780 p - value 0.407693 Table 2 Dickey Fuller Test for EWMA model The magnitude of the Test Statistic and the p-value indicate that we fail to reject the null hypothesis acknowledge that the forecasted series could be non-stationary. Since EWMA predictions are non- stationary, it is wiser to move to an alternative model. Autoregressive Integrated Moving Average We built three autoregressive models with p,d,q values as follows and used the Residual Sum of Squares as the criterion of determination. Figure 6 Exponential Moving Average (Half Life = 3, Min_Periods = 1)

146

Figure 7 ARIMA (2,1,0) Figure 8 ARIMA (0,1,2) Figure 9 ARIMA (2,1,2) The residual sum of squares is shown on top of each graph and as the hybrid model of the first two with (p,d,q) parameters (2,1,2) had the best outcome with RSS = 3.4101. Now that we found a ARIMA model with good results, we attempt to scale it back to the units of the sales figures to see how well our model performed in predicting the sales figures.

147

Figure 10 Predictions scaled back As expected, our model was efficient at forecasting the shipping costs during the second and third quarters but failed to predict the shipping costs on either end of year where there was a lot of residual due to unexplained seasonality. However, the model seems to have integrated the general trend of the shipping costs in both first quarter and fourth quarter accurately even though the magnitude of this trend seems to have been underestimated. Optimization Since we realized that we cannot accurately predict the demand for the beginning and ending quarters accurately without factoring in the seasonality, we decided to place ourselves in the beginning of the calendar year 2016 and attempt to optimize for the delivery service provider choice for the calendar year of 2016.We built two optimization models: ➢ Constant Cheaper Alternative: There always exists a delivery service provider that is x% cheaper than the retailer’s current delivery service provider. ➢ Random variable Cheaper Alternative: There is a possibility of a cheaper alternative if searched. But, the rate by which it is cheaper changes for each week. Constant Cheaper Alternative A constant cheaper alternative model is one which expects the presence of an x% cheaper alternative always. The value of x was selected in series from 0 to 100% in steps of 10. However, the effective rebate rate that the retailer is enjoying with their current delivery service provider (say y) presented a hurdle to this model. Every time a decision on the delivery service provider had to be made, the

148 decision maker had two choices, “Pick their usual delivery service provider and get y cents for every dollar or pick the cheaper carrier and get x cents for every dollar spent”. This is rather easy decision to make based on the values of x and y i.e., ▪ Pick the usual delivery service provider if y > x or ▪ Pick the alternate carrier if x > y This caused the model to direct all the shipments one way based on the values of x and y i.e., ship using the regular carrier always if y>x and ship using the alternate carrier if x > y which isn’t an optimization in true sense of the word. Random Variable Cheaper Alternative Expecting an x% cheaper carrier to exist always is bit unrealistic and considering the peculiar challenge we faced in our model, we decided make the model more realistic by randomizing the %cheaper variable each week and the random number is between 0% to 15% i.e., in any given week, the retailer can expect the presence of an alternate delivery service provider that quote the same price as their current delivery service provider or quote a price that is upto 15% cheaper. This model adds a unique twist to the earlier problem in that in the earlier case, since a x% cheaper alternative is guaranteed, the retailer could simply pick the carrier by comparing x and y. But, given that finding a cheaper carrier in a given week wouldn’t always guarantee a similar deal the following week. Diverting all of a large shipment in a given week to the cheaper carrier could potentially bring the retailer down a tier with their usual delivery service provider and the retailer has to balance these priorities to minimize the overall cost. Optimization Results Based on the above model, we used Palisade @Risk software to find the optimal proportion of shipments that can be diverted to a cheaper carrier and ran this simulation for 30 minutes. @Risk

149 stochastically generates values for the PC parameter and finds the value of POSTCC that minimizes the Total Shipping Cost for the retailer. We see that a policy of diverting about 9.82% of the shipments on average always is the most optimal method of bringing down costs.

150

The distribution of shipping costs are as follows: By applying our model, the retailer can have higher autonomy in their delivery options while retaining rebate benefits from the delivery service provider. The retailer can opt to divert their shipments based on the above model and expect to pay anywhere between 46.610 million and 47.383 million in annual shipping costs. This would mean a savings of about $3 million dollars had they stuck with the same carrier. However, knowing the proportion of transactions to divert is a reasonable academic exercise but in the corporate world, a retailer would want to know which transactions to divert to save on the costs. When these results were communicated to the retailer, we were asked to try to identify the shipments that ought to be diverted along category/package weight/geographic information. This analysis opens the possibility of new avenues of cost saving for the retail industry. Conclusion We first used business rules to clean the data and impute the transactions to ensure we have the right data to begin with. Once we aggregated the shipments by week, we were able to build models that were able to forecast for the future in 2nd and 3rd quarters but failed to show similar success in the first and fourth quarters which were mixed with seasonality. Our request for more data was honored with the retailer giving us access to another calendar year of data that can be used to calculate the seasonality index for the first and fourth quarters and build models on those. Then, we built an optimization model that can identify how much of the shipment can be routed to alternate cheaper carriers. Further research can be along the dimensions of what shipments to route to the alternate carrier. Figure 11 Distribution of Shipping Costs

151

References Ali, A., Jolson, M. A., & Darmon, R. Y. (1994). A model for optimizing the refund value in rebate promotions. Journal of Business Research, 239-245. Emin, D. H. (2006). The shipping strategies of internet retailers: Evidence from internet book retailing. Quantitative Marketing and Economics, 407-438. Juan José Bravo Bastidas, C. J. (2013). Review: Freight transportation function in supply chain optimization models: A critical review of recent trends. Expert Systems with Applications, 6742- 6757. Larry J. LeBlanc, J. A. (2004). Nu-kote’s Spreadsheet Linear Programming Models for Optimizing Transportation. Interfaces. Mollah Mesbahuddin Ahmed, A. S. (2014). An Effective Modification to Solve Transportation Problems: A Cost Minimization Approach. Annals of pure and Applied Mathematics, 199- 206. Rafay, I. C. (2010). Intermodal Logistics: The Interplay of Financial, Operational and Service Issues. Transportation Research Part E Logistics and Transportation Review. W, Y. F. (2012). Optimizing replenishment polices using Genetic Algorithm for single- warehouse multi-retailer system. Expert Systems with Applications: An International Journal, 3081- 3086. Yuhong, S. K. (2012). Fixed Charge Transportation Problem and Its Uncertain Programming Model. Industrial Engineering and Management Systems.

152

"I’ve Been Chain-ged" La Saundra Pendleton Janaina Siegler, PhD As a mother, daughter, sister, church member, agency owner/manager/and Speech- Language Pathologist, I wear many hats. In January 2018, I added MBA Graduate Student to the list. It was in my Operations and Management Class that I discovered how closely related my life was to a “Supply Chain”. As I learned the importance of each section of my chain, I gained an understanding of how my relationships with my “suppliers” was critical in ensuring that my customers were getting the best product, (“La Saundra”) to meet their demands. The work I present to you is the direct result of the process that I went through, during my Operations and Management class, to visualize what my life looks like on “The Chain”, the Supply Chain that is! Role of Political Identity in Friendship Networks Surya Gundavarapu, Matthew A. Lanham Purdue University, Department of Management, 403 W. State Street, West Lafayette, IN 47907 [email protected]; [email protected] ABSTRACT How do political views evolve within networks? Do individuals pick their friends based on their own political ideology? What is the role of political moderates within friendship networks? This paper presents a foundation for answering these questions using network analysis and theory. While this study has been done on a limited graduate student body, the same process can be repeated on a larger audience to gather insights and evaluate the outcomes of the populace at large. Further it presents an equation to evaluate the strength of friendships based on how often individuals meet up to do different types of activities. And it lays a road map for future research in this area to understand the larger societal forces that are in play along political ideologies. Keywords: political identity, analysis

154

INTRODUCTION Questions about how people simultaneously construct and, in the process, are molded by their social milieu are endemic in social sciences. While most social network studies of individuals focus along the dimensions of race (Ajrouch, Antonucci, & Janevic, 2001), gender (Psylla, Sapiezynski, Mones, & Lehmann, 2017), religion (Lewis, McGregor, & Putnam, 2013), education (Grunspan, Wiggins, & Goodreau, 2014), and organizational structure (Schlauch, Obradovic, & Dengel), the idea of using network methods to analyze political ideology is trending. The objective of this paper is to examine the social network of individuals with pre-existing ties to understand if they exhibit ties along the dimension of their political ideology. Political polarization - the vast and growing gap between liberals and conservatives, Republicans and Democrats - is a defining feature of American politics today, and one the Pew Research Center has documented for several years. Last year, Pew reported that over 62% of US adults claimed to receive their news on social media, while this year the number increased to 67% (Shearer & Gottfried, 2017). With an ever-increasing populace on social media, the level in which our social networks define our perception of the world, and thereby our own political ideology, is only going to rise. Thus, motivating the need for more research into social network analysis as a way of understanding the dynamics of our society. Furthermore, the level of partisanship has led people on different ends of the spectrum to have starkly different impressions of the world around them, with individuals clustering themselves with people of similar ideology leading to a vicious cycle of division among the populace (Mitchell, Gottfried, Kiley, & Eva Matsa, 2017). Understanding and implementing steps to efficiently help individuals reach across parties and interact would help in the process of forming a more perfect union. This paper attempts to answer the following research questions within the scope of the research defined: • Do individuals exhibit homophily within their friendship networks? If so, is it a quantitative homophily (more friends within their own cluster than from other clusters) or a qualitative homophily (the strength of the relationship is stronger with people within their own political ideology)?

155

156

• In a social network by political ideology, what is the role of different actors? Are individuals with moderate political ideology likely to act as connectors for the rest of the network, or will taking a side give more access to the network? LITERATURE REVIEW Among the studies that explored the role of politics in social networks, David Lazer and Robert Huckfeldt stand out. In their 2004 paper (Huckfeldt, Mendez and Osborn 2004), Huckfeldt tried to track flow of information in a closed-network of 1108 individuals using self-defined communication networks and the respondent's own political beliefs. One of the interesting outcomes of the survey is that while a majority of respondents expressed a strong affinity for either of political parties, and claimed to be passionate about politics, most of them could not name more than four discussants with whom they regularly discuss politics with. This might be an insight into people's hesitation to discuss politics within their networks. In his paper published by Kennedy School of Government (D. Lazer, et al. 2008), D. Lazer explored the influence of different relationships in shaping and evolving ones' own political ideology. He discussed the various ways to quantify homophily across different demographics and the paper uses various statistical methods including logistic regression to weigh the influence of different individuals within the network. Another interesting part of this research paper is that the author(s)' question set. Instead of picking objective questions that can potentially bias the respondents because of the available options, authors of this article used open ended questions and took notes along the way for the key markers that they are looking for, thereby standardizing the responses since all responses were aggregated by the interviewer. This level of open approach helped the authors quantify even foreign experiences of the respondents. For example, a Latino American male describes his upbringing in Los Angeles and his introduction to education and welfare policies, while the interviewer jotted down relevant notes. Another seminal work in exploring the role of political ideology, Alan Zuckerman's book, Social Logic of Politics, stands out. Dr Zuckerman explores qualitatively how different relationships influence political ideologies of individuals. An interesting outcome in his book is that couples usually tend to exhibit strong homophily and exhibit a high level of influence on each other if

157 their political beliefs are not alike. His study of divorce rates in couples who identify themselves on different sides of the political spectrum is quite interesting. Then, he explores the evolution of

158 ideologies across generations and between friends. According to him, instances where children exhibit greater conservative attitudes than their parents can be predicted by certain micro behavior exhibited by them during childhood in the play pen. DATA The data used for this survey comes from primary research done by the authors. The scope of this study is the student body of Purdue Universities’ Graduate School among the following departments: • Business Analytics & Information Management • Industrial Engineering • Management Information Systems (PhD) • Global Supply Chain and Operations Management. A survey was administered for individuals who belong to any of these groups and the survey results were aggregated to understand the social behavior of individuals. The survey results are anonymized with the researcher being able to see only the ID numbers created from the master list of the survey respondents. This ensures anonymity and increases the participation rate for the survey. As witnessed by Dr. Huckfeldt, individuals are hesitant to share political opinions even if they are passionate about their beliefs. METHODOLOGY Once the data was gathered, we performed exploratory data visualization to understand the demographics of the respondents as shown in Figure 1 below. This helps understand the generalizability of the results. This picture organizes the process followed.

159

Figure 1: Methodology The data is formatted as edge pairs to ensure edges can be weighed according to the formulae discussed later in the paper. Later, the network is visualized with overlays of the attributes and examined to answer the questions above. Figure 2 shows us the distribution of gender in the survey data and it appears that the participation is equal across both genders. Figure 3 shows us the distribution of the ethnicity among the survey respondents showing a high frequency of Asian students shedding some doubt on the generalizability of the results. However, since the scope of the study is to understand the polarity among the student body, it made sense to use the distribution as is and move on to further analysis. Figure 2: Distribution of gender among survey respondents Figure 2: Distribution of the respondents by ethnicity

160

Figure 4 shows us the distribution of survey respondents along the political spectrum on a scale of 0 - 100 (0 being most liberal and 100 being most conservative). While the graph is skewed more toward the liberal side, it appears majority of people identify themselves as neutral rather than as liberal or conservative. Based on the above distribution, individuals were put into three different groups based along the political spectrum they ranked themselves and the cutoffs were as follows: Group Range of Values Liberal 0-35 Moderate 35-65 Conservative 65-100 Table1: Defined groups based on distribution Based on this ranking, the network had 25 self-reported liberals, 13 self-reported moderates, and 11 self-reported conservatives. The network for each respondent was converted into a matrix form and mapped using NodeXL software which gave us the network diagram shown in Figure 5. Figure 4: Distribution of political ideology among the respondents

161

Figure 5: Political network diagram The diagram calculates the number of inter group edges, thus allowing us to provide some evidence in answering the research question of homophily. The results of the number of edges between different groups are tabulated in Table 2. As observed, the number of edges between liberal and conservative nodes is the second highest making one reject the hypothesis that individuals exhibit homophily along the lines of number of friendships they cultivate across party lines. Group 1 Group 2 Number of edges Liberal Liberal 14 Liberal Moderate 23 Liberal Conservative 20 Moderate Moderate 3 Moderate Conservative 9 Conservative Conservative 2 Table 2: Inter group Edges

162

Table 3 shows some of the network metrics. A low reciprocated edges metric might indicate that not all friendships links are being reciprocated indicating different thresholds for individuals in considering someone a friend or indicate them as a mere acquaintance. Further, a low- density metric indicates that there are several potential ties that have not formed. A low number within a closely-knit group of Management students that actually meet twice a week for coffee social might indicate that people might have been keeping their existing friendship networks even after being provided with opportunities to socialize with the group. Metric Value Reciprocated Edges 15% Diameter 9 Average Density 3.13 Density 0.056 Table 3: Network level Metrics The results in Table 4 suggest that the group metrics and the density of the graph within each group is fairly uniform. This indicates that there is no difference in how connected individuals are with people within their own political ideology. However, the difference in the geodesic distance might indicate different intra group dynamics that might need to be explored in future research. Ideology Number of Vertices Average Geodesic Distance Graph Density Liberal 25 1.133 0.040 Moderate 13 0.769 0.038 Conservative 11 0.500 0.036 Table 4: Intra-Group Metrics RESULTS Once the graph metrics and network metrics are observed, it is evident that more coding is essential to answer the remaining research question. The survey also captured how often individuals spend time with others in their network and the types of activities they meet the other individual for. These activities and frequencies are tabulated and assigned the following weights as shown in Table 5.

163

Activity Score Take the same class 1 Voluntary Activities 2 Social Activities 3 Other 5 Table 5: Coding of Activities Since our network comprises of mostly students that belong to same department, the act of taking the same class is more of a consequence of the scope than an out of normal commitment on the part of either individual. Individuals choose who they socialize with, and hence shows the individual's desire to spend more time with the other party, which led to Social Activities given a higher weight than voluntary activities. The Other activity is a free form response field with individuals reporting their house mates as part of their network. Weighing based on frequency of activity is ordinal since high frequency indicates more interactions between the two parties. The coding of activity frequency is shown in Table 6. Frequency Score Rarely 0 Sometimes 1 Often 2 Frequently 5 Table 6: Coding of frequency of activity Once the relationships were appropriately coded, the strength of the relationship was calculated using the formula: Strength of the relationship = ΣActivity Engaged * Frequency of the Activity Figure 5 shows us the network with edge width set to the strength of the relationship.

164

Figure 6: Network visualized weighed by strength of relationship Figure 7 below shows the histogram of strength of the relationships and the average strength of relationships is found to be five. Figure 7: Distribution of strength of the relationship

165

Filtering our based on strength of the relationship, we found that most of the strong relationships exist between Conservatives and Moderates or between Liberals and Moderates as shown below in Figure 8. This suggests that there is indeed a possibility that individuals might not stray far from their own ideology when it comes to the strength of the relationships they form with other individuals. CONCLUSIONS This study provides some evidence that while individuals do not exhibit homophily along the lines of number of friendship ties they make with other individuals, there is indeed a possibility that individuals might not venture too far away from their own ideology when forming strong bonds. Further, a review of node metrics like betweenness shown in Table 7, degree centrality in Table 8 does not show any pattern in difference across groups, suggesting there might not be a meaningful difference in the roles of actors by their political ideologies. Figure 8: Strong links between nodes visualized

166

Table 7: Betweenness Centrality Table 8: Degree Centrality The limitation in our network study is that it comprises a narrow audience shedding doubt on the generalizability of the outcomes to the mass public. While the results are valid for the group studied, further research is needed to verify if the same pattern is observed among the populace at large before applying any policy initiatives for affecting bipartisan friendships. While this is a good pilot study in understanding behavior of social networks using network analysis in a college setting, further research could include in understanding the intra group dynamics that caused the difference in geodesic distances within groups. Further, a network evolution study could be performed to understand if people's ideologies change when they are

167 exposed to a populace where their political views are in the minority. Or will these individuals simply recede themselves in the situation. Further, similar studies can be conducted to understand the influence of individuals on the rest of their network to understand if individuals do have power to effect shift in median political ideologies within their network. REFERENCES Ajrouch, K. J., Antonucci, C. T., & Janevic, M. R. (2001). Social Networks Among Blacks and Whites: The Interaction Between Race and Age. Journal of Gerontology, 56B(S112 - S118). Grunspan, D. Z., Wiggins, B. L., & Goodreau, S. M. (2014). Understanding Classrooms through Social Network Analysis: A Primer for Social Network Analysis in Education Research. CBE Life Science Education. Huckfeldt, R., Mendez, J. M., & Osborn, T. (2004, January). Disagreement, Ambivalence, and Engagement: The Political Consequences of Heterogeneous Networks. Political Psychology, 25(1), 65-95. Huckfeldt, R., Plutzer, E., & Sprague, J. (1993, May). Alternative Contexts of Political Behavior: Churches, Neighborhoods, and Individuals. The Journal of Politics, 55(2), 365-381. Lazer, D. M., Rubineau, B., & Neblo, M. A. (2009). Picking People or Pushing Politics. Annual Meeting 2009 (p. 32). American Political Science Association. Lazer, D., Rubineau, B., Katz, N., Chetkovich, C., & Neblo, M. A. (2008). Networks and Political Attitudes: Structure, Influence, and Co-evolution. Harvard Kennedy School of Government. Lewis, V. A., McGregor, C. A., & Putnam, R. D. (2013, March). Religion, networks, and neighborliness: The impact of religious social networks on civic engagement. Social Science Research, 22(2), 331-346. Mitchell, A., Gottfried, J., Kiley, J., & Eva Matsa, K. (2017, October 21). Political Polarization & Media Habits. Retrieved December 7, 2017, from Pew Research Center: http://www.journalism.org/2014/10/21/political-polarization-media-habits/ Murtz, D. C. (2002, March). Cross cutting Social Networks: Testing Democratic Theory in Practice. The American Political Science Review, 96(1), 111-126.

168

Psylla, I., Sapiezynski, P., Mones, E., & Lehmann, S. (2017, June 15). The role of gender in social network organization. Schlauch, W., Obradovic, D., & Dengel, A. (n.d.). Organizational Social Network Analysis – Case Study in a Research Facility. Kaiserslautern: University of Kaiserslautern. Shearer, E., & Gottfried, J. (2017, September 7). News Use Across Social Media Platforms. Retrieved December 7, 2017, from Pew Research Center: http://www.journalism.org/2017/09/07/news-use-across-social-media-platforms-2017/ Zuckerman, A. (2005). Social Logic of Politics. Philadelphia, PA, USA: Temple University.

169

XGBoost - A Competitive Approach for Online Price Prediction Joshua D. McKenney, Yuqi Jiang, Junyan Shao, Matthew A. Lanham Purdue University, Department of Management, 403 W. State Street, West Lafayette, IN 47907 [email protected]; [email protected]; [email protected]; [email protected] ABSTRACT This study generates price prediction suggestions for a community-powered shopping application using product features, which is a recent topic of a Kaggle.com competition sponsored by Mercari, Inc. As eBay acquired Canadian data analysis firm Terapeak, the importance of using “big data” and machine learning to improve pricing decision-support in business has been rapidly increasing. By obtaining a solution for price prediction via product features for B2C and C2C online retailers, it will be easier for sellers to sell, and enlarge the selling-shopping community of such user-based marketplaces. It could also be a remarkable competitive advantage for companies or individual sellers having highly accurate pricing decision-support. The authors did some exploratory data analysis, we created text features with above/below average prices for the most important features in the dataset, used R and Kernels to perform text analysis to generate features from unstructured product features, then used XGboost and Multiple Linear Regression to dynamically predict product price. XGBoost was able to handle over 2,000 brands data in our case while Multiple Linear Regression was not able to. XGBoost achieved the best performance, with a 0.513 test set RMSLE. Keywords: Price Prediction, Product Features, Regression Analysis, Text Analysis, XGBoost

170

INTRODUCTION We are seeking solutions for accurately and actively giving price suggestions for B2C and C2C retailing. By using text analysis on different product features, we want to give a precise and dynamic pricing prediction on products which are selling online. This would be extremely important for the C2C platforms. The ability of precisely predicts the price of the product they are going to sell will save time for the sellers thus attract more people to join the community and enlarge the population and sales amount. Competition for apps today is on how many people are actively using the app. So we posit that the success of predicting pricing for online product listings will be a splendid chance for a business like Mercari. According to Wall Street Journal, eBay acquired Canadian data analysis firm Terapeak in December 2017 (Hanly, 2017). The data analysis company is good at predicting supply, demand, and pricing products. This is an important step in developing demand and price prediction of their online listing features and shows the importance of doing so. eBay is hoping this company could help them more in the data analysis field by providing them capabilities to know more about their sellers, customers, and products. There is also a change on how people are buying products today. Previously, there were not a lot of people who would consider buying used clothes. But "The Retail Apocalypse Is Fueled by No- Name Clothes" now (The Business of Fashion, 2017). Fashion of vintage is bringing the old and no-name clothes back, and people are more open on shopping nowadays. The increasing number of no-name clothes brings up the importance of predicting product prices, because they are all different. Previously, pricing could heavily rely on brand recognition and historical pricing, but the no-name clothes are totally different. The urgent need of building a model for dynamic price prediction has been raised with no doubt. Jointing statistical researches with business sense and bringing different kinds of models to fit the contemporary business problems, business analytics is the most popular and fully utilized field to address this problem. Our research is utilizing business analytics to find a best fitting model to address online price prediction issue and give the best solution for dynamic pricing to businesses and individual sellers. Big data analysis and text analysis are two most practical and important tools within the business analytics field, and they are the tools which direct us to find the solution. Though the importance of big data analytics and text analysis is gradually increasing, their power has still been undervalued. Only in the last five to ten years have firms began to invest into finding value from such information. From a decision-makers perspective, it might be difficult to image how one could perform machine learning and dynamic prediction using words as input variables.

171

However, big data analytics techniques, particularly those focused on generating insights from text are growing fast as is becoming common-place data analysis by non-technical team members. Social media has provided a massive opportunity for data scientists to keep learning on text using more sophisticated techniques. For example, Professor Chen Ying was able to develop a text mining technique to extract and automatically perform text analysis on public data (Phys Org, 2017). Though the amount of data is much less, the use of text analysis is extremely crucial during our prediction of pricing, and thus leads us to our primary research question in this study: how well can we predict the price for an online retailer using textual product features? We structure our paper by performing a review of the academic literature to frame our research questions, discuss the data we used in our study, outline our methodology, develop our models, and summarize our results. LITERATURE REVIEW Our main focus is on how to predict the price for an online retailer. Throughout the literature, we found that our target problem is similar to that posed in dynamic pricing strategy. Dynamic pricing is a term referring to find the optimal price for goods, especially online goods. The goal is to fit the price into the product’s features, such as supply and demand, but also brand and quality. The basic idea is to determine the best price by analyzing product characteristics, which is why we believe this research area is similar to what we need to address in our study. To expand this question, we further research on the possible features that might be used to estimate a price. In the paper "Dynamic Pricing on the Internet: Importance and Implications for Consumer Behavior," the authors indicate that when predicting the price for certain products, not only the physical values need to be considered, such as the appearance of the product, but also the information behind that product, such as the comments of the customers should be considered. In the paper "Dynamic Pricing of New Experience Goods," the authors suggest that whether the market is a mass market or a niche market is really important for the price determination and social efficiency is another factor, since word-of-mouth has a significant power. To analyze these critical product features, we researched and found that XGBoost is a popular machine learning methodology that has had success in this space and is frequently used in data analytics and statistics research. XGBoost is based on an end-to-end tree boosting system. It offers a faster and more accurate way to solve classification and regression-type problems. The key feature in XGBoost is that it weights the predictors and tries to keep the new decision tree away from the errors made by previous decision trees. Thus it strengthens its accuracy. This idea of re- weighting predictors where errors occur is the key idea behind all boosting algorithms. While

172

XGBoost is used in many fields, price prediction by XGBoost has had success. In the paper “Predicting Buyer Interest for New York Apartment Listings Using XGBoost” researchers tried several different methods to obtain the best pricing model, including logistic regression, support vector machines (SVM), and XGBoost. In their study, XGBoost provided the best solution. Table 1. Literature Review Methods Used Summary During the review of other professional research papers, we found that XGBoost is well adapted for dynamic pricing problems, which is relative to our purpose - predict the price for online retailers. The paper “Pricing Recommendation by Applying Statistical Modeling Techniques” successfully uses XGBoost to predict price. In their paper, the given objective function is: Obj(θ) = L(θ) + Ω(θ) where L is the training loss function that deals with the extent that the training data can predict the model accurately, and Ω is the regularization term to deal with over-fitting issue of models. The paper indicates that the common way to do training loss function is mean squared error, which is the same objective function that one would use with traditional ordinary least squares regression. L(θ) = ∑ i (y i )2 The additional regularization term to the objective function is synonymous to other formulations such as ridge regression that adds a penalty to the size of the estimated parameter coefficients (i.e. β − ŷ i ̂ j ′s) by taking the sumproduct of the squared β ̂ j ′s corresponding to the features. In the popular least absolute shrinkage and selection operator (LASSO) case, it takes the sumproduct of the absolute value of the parameter coefficients corresponding to the features. The idea behind these regularization or shrink penalty terms in the objective function helps to obtain more robust models that have better bias-variance tradeoffs than without them.

173

In conclusion, XGBoost is a highly accurate technique to do model predictions. Thus, in this paper, we focus on using XGBoost as a main methodology to solve our research question. DATA We used the data set from Kaggle competition "Mercari Price Suggestion Challenge." We created many dummy variables in order to do text analysis on the product features. Table 2 shows a data dictionary of the features provided by Mercari. Please refer to the METHODOLOGY section for the creation of dummy variables. Table 2: Data used in study Variable Type Description id Numeric Id of the listing name Categorical The name of the product item_condition_id Numeric The condition of the items provided by the seller category_name Categorical Category of the product · brand_name Categorical The brand name for the product price Numeric The price for the product in USD shipping Categorical The indicator of shipping paid by seller or buyer item_description Categorical The full description of the product Exploratory Data Analysis The price range of the dataset is between $0 to $2009, with the average price being $26.74 and median price $17. Ninty-five percent of the prices are at or below $75, and the mode price is $10. The price variable as showed in Figure 1 is heavily right-skewed, which would influence our prediction. Thus, we transformed the price variable by using the logarithm of prices in order to train our models under the balanced data. After taking the logarithm of prices, the distribution has a more Gaussian distribution as showed in Figure 2.

174

Figure 1: Price Allocation (Range $0 - $100) Figure 2: Log Price Allocation There are 19 categories with greater than 1% of the total products as shown in Figure 3. Within those 19 categories, there are 10 categories under the Women’s category. The Women’s category accounts for 54% of all records whereas the Men’s category accounts for only 8%. There are over 42% of Brand Names that had missing values. Excluding the missing values and the most used Brand Names are PINK, Nike, Victoria's Secret and LulaRoe as shown in Figure 4. Figure 3: Category Percentage (Greater than 1%)

175

Figure 4: Brand Percentage (Greater than 1%, Exclude Null) Many of the top selling brands were found mainly in Women’s clothing as shown in Figure 5. Figure 5: Top 15 Brands’ Category Distribution For brands that primarily sell Electronics, such as Apple and Nintendo, tend have above average prices of $73.30 and $34.70 respectively, compare to the average price is $26.74 and the median price is $17 as shown in Figure 6.

176

Figure 6: Top Brands and their Average Prices METHODOLOGY We used multiple regression as our baseline model and variations of XGBoost models as our more sophisticated alternatives. When training these models we used three-fold cross validation on the multiple regression model, and we tried tree, linear, and dart tree boosters for the XGBoost models. We randomly partitioned the original data in an 80-20% train-test sets. Root Mean Squared Logarithmic Error (RMSLE) is our primary statistical performance measurement. RMSLE is a lower-the-better indicator of model predictive performance specified in the Kaggle competition. We also used the more popular adjusted R-squared as our secondary performance measurer. Adjusted R-squared indicates the percentage of data been explained by the model. Figure 7 outlines the predictive modeling process we applied in our study.

177

Figure 7: Predictive Model Flow Chart After performing initial exploratory data analysis, we found that 43% of our data has missing values, and all the missing values come from the “brand_name” column. We also found that for those data with missing value in the “brand_name” column, most of them have the brand name within the “name” column. Knowing this, we detected if one of the top 10 brand names is in the “name” column, and if there were, the observation was given the correct brand name value to the “brand_name” column. By doing so, we cleaned the data and reduced missing values to 0.4%, which significantly helped our model with the hold out set. To do a regression problem mainly consisting of text, we needed to create lots of dummy variables. To generate dummy variables, we first divided the dataset into two parts based on the prices. One part of the dataset includes the products with a price higher than the average price. The other part includes the products with a price lower than the average price. Text analysis was performed to find those words/phrases with high frequency with one, two, and three-word combinations. These are commonly known as uni-grams, bi-grams, and tri-grams. Comparing the ratios of the frequency of those words showed up in both the above-average-price dataset and the below-average-price dataset. We finalized the text features selection by selecting those words/phrases with high or low ratios as showed in Figures 8 - 11.

178

Figure 8: Three-word Item (Tri-gram) Description Above Median Pricing Figure 9: Three-word Item (Tri-gram) Description Below Median Pricing Figure 10: Two-word Item (Bi-gram) Name Above Median Pricing

179

Figure 11: Two-word Item (Bi-gram) Name Below Median Pricing Figure 12: Decision Tree (part) of Our Model Due to the huge amount of data in our dataset, we generated Figure 12 to show only part of our decision tree and it gives a general idea of our model. It is hard to see each node since there are tons of nodes in this tree. However, this is an overview of how XGBoost is used in our later online price prediction model.

180

MODELS Multiple Linear Regression Multiple linear regression models are models that are a function of several explanatory variables. These parametric models are easy to interpret as each variable’s estimated parameter coefficient provides the partial effect that the feature has on the price. The primary drawback of regression is that this model is often not as accurate or versatile as other more complicated predictive algorithms. The reason being that for multiple regression is that a flat plane is fit to the data in p- dimension space, rather than a curved surface that can capture the training data more accurately. This type of model often plagued by bias in the bias-variance tradeoff as it is often too simple of a model when attempting to explain more complicated relationships. The good thing about it being simple is you do not have to worry about overfitting the model to the training data, which would eliminate model generalizability, which is key in the predictive modeling process. Below is the multiple linear regression model we obtained in our study. Multiple Linear Regression Formula: Price = 2.95 − 0.09(item codition id2) − 0.15(item codition id3) − 0.36(item condition id4) − 0.28(item codition id5) − 0.29(shipping) + 0.32(cat1Electronics) + ⋯ XGBoost XGBoost stands for extreme gradient boosting. It is similar to gradient boosting machines, but more efficient and runs much faster than those similar models. Although the model limits input to be only numeric variables and must be in a matrix format, gradient boosting machines are usually relatively accurate when the model is not overfit. The lack of interpretable of these models can be one of the drawbacks that should be considered when developing models that support pricing decisions. The ability of the XGBoost model to handle all 4,809 different Brand Names and all 104 levels of Category 2 and 669 levels o Category 3 variables as numeric factors, compared to Multiple Regression model, greatly helped reduce the RMSLE. XGBoost Formula: obj = ∑l(y i n t ,ŷ i t) + ∑Ω(f i ) i=1 i=1 (DMLC, 2016)

181

RESULTS RMSLE stands for root mean squared logarithmic error, which a lower-the-better indicator of model prediction accuracy used in this Kaggle competition. We calculated the baseline RMSLE based on the average price in the original dataset, which was 0.7470036. Figure 13 demonstrates that the neither model overfit to the training data, and XGBoost outperformed linear regression (0.5135 to 0.6169) in RMSLE on the test set. Figure 13: RMSLE Performance by Models The XGBoost_tree model performed the best among all models we tried. The result from XGBoost_tree model has the lowest RMSLE and highest R-sqaured as showed in Figure 14. We believe that the XGBoost_tree model will be the best to use for pricing decision support. Figure 14: R-Sqaured Performance by Models

182

Figure 15 indicates the prediction of our best model XGBoost. Within the price range, lower than $200, the prediction is fairly accurate. But the model tends to over predict than under predict the prices. Figure 15: Predicted Price vs Actual Price using Best Model (XGBoost) Although we obtained a relatively satisfied result from the XGBoost model, as previously stated, the results from it are rather hard to interpret. It is our job and it is very important to use those vital functions to help the decision-makers to understand results. The results of the model will be very useful and helpful for any company who wants to find their price drivers are and to predict the resulted price of their current price drivers. CONCLUSIONS By precisely and actively predicting the price of a given products based on its various kinds of features, it would be easier for companies and individual sellers to know about how buyers will value their products. For business, they can understand their price drivers more. For individual sellers on a community-based shopping application like Mercari, it would less time-consuming to sell. We found that transforming text to input features is actually easy to do using popular analytics tools such as R and Kernels. These features can be learned by sophisticated machine learning algorithms such as XGBoost. While popular linear regression did not perform well, XGBoost did. How well can we predict the price for an online retailer using textual product features? Following

183 our process, we were able to predict prices with adjusted R-square of 50%. That means that using textual features explains half of the variation in the price. This provides additional evidence to the previous studies we read that suggests these features can be important. We achieved a 0.513 test set RMSLE compares to the wining team RMSLE of 0.378. We plan to extend this pricing project to provide better decision-support for pricing decisions. For example, we know we have the ability to predict prices with a reasonable degree of accuracy, but how could these forecasts be strategically used to identify price sensitive products. At the end of our research, we tried LightGBM and got a similar result as XGBoost. But we found that LightGBM was faster and more efficient on making predictions for dynamic pricing with lots of features. We think that it’s a direction to do future research since efficiency and time management is crucial when it comes to implementation to business and support business decisions in a timely manner. REFERENCES Chachimouchacha. (2017). BEGINNER'S GUIDE TO MERCARI IN R - [0.50586]. Retrieved from https://www.kaggle.com/jeremiespagnolo/beginner-s-guide-to-mercari-in-r-0-50677 Dirk Bergemann and Juuso Välimäki, "Dynamic Pricing of New Experience Goods," Journal of Political Economy 114, no. 4 (August 2006): 713-743. DMLC. (2016). Introduction to Boosted Trees. Retrieved from http://xgboost.readthedocs.io/en/latest/model.html García-Calderón Chávez, Saúl Abraham (2017, July). Pricing recommendation by applying statistical modeling techniques. Retrieved from https://upcommons.upc.edu/handle/2117/109814 Hanly, Ken. (2017, Dec.13. Internet giant eBay buys Canadian data analysis firm Terapeak. Retrieved from http://www.digitaljournal.com/business/internet-giant-ebay-buys-canadian-data- analysis-firm-terapeak/article/509906 Mercari. (2017). Mercari Price Suggestion Challenge - Can you automatically suggest product prices to online sellers? Retrieved from https://www.kaggle.com/c/mercari-price-suggestion- challenge

184

P. K. Kannan, Praveen K. Kopalle. (2001). Dynamic Pricing on the Internet: Importance and Implications for Consumer Behavior. International Journal of Electronic Commerce, 5:3, 63-83, DOI: 10.1080/10864415.2001.11044211 Phys Org. (2017, Dec. 8). Unlocking the power of web text data. Retrieved from https://phys.org/news/2017-12-power-web-text.html Roozbehani, M., Dahleh, M., & Mitter, S. (2012). Volatility of Power Grids Under Real-Time Pricing. Power Systems, IEEE Transactions on, 27(4), 1926-1940. The Business of Fasion. (2017, Dec. 11). The Retail Apocalypse Is Fueled by No-Name Clothes. Retrieved from https://www.businessoffashion.com/articles/news-analysis/the-retail-apocalypse- is-fuelled-by-no-name-clothes Walters, Troy.(2017). A Very Extensive Mercari Exploratory Analysis. Retrieved from https://www.kaggle.com/captcalculator/a-very-extensive-mercari-exploratory-analysis Yinan Li, Yan Yao, Zhen Lian, Zhihong Qiu (n.d). Predicting Buyer Interest for New York Apartment Listings Using XGBoost. Retrieved from https://cseweb.ucsd.edu/classes/wi17/cse258-a/reports/a092.pdf

185

A Comparative Study of Machine Learning Frameworks for Demand Forecasting Kalyan Mupparaju, Anurag Soni, Prasad Gujela, Matthew A Lanham Purdue University Krannert School of Management 403 W. State St., Krannert Bldg. 466, West Lafayette, IN 47907 [email protected], [email protected], [email protected], [email protected] ABSTRACT We built various demand forecasting models to predict product demand for grocery items using Python's deep learning library. The purpose of these predictive models is to compare the performance of different open-source modeling techniques to predict a time-dependent demand at a store-sku level. These demand forecasting models were developed using Keras and scikit- learn packages and we made comparisons along the following dimensions: 1) predictive performance, 2) runtime, 3) scalability and 4) ease of use. The forecasting models explored in this study are Gradient Boosting, Factorization Machines, and three variations of Deep Neural Networks (DNN). We also explored the effectiveness of categorical embedding layers and sequence-to-sequence type architecture in reducing the errors in the demand forecasts. Our best neural network model is currently placed in the top 1.5 percentile of all the submissions in the Kaggle competition for forecasting retail demand.

186

INTRODUCTION In today’s world of extreme competition, cost reduction is of utmost importance for organizations, primarily in the retail and consumer product goods (CPG) industries. All the major players in these industries try to focus on cost-cutting and maintaining optimum inventory levels to gain a competitive edge. In addition to cost optimization, having just the right amount of inventory is also becoming important for consumer satisfaction especially in the perishable retail goods market. This is where demand forecasting helps these companies. Efficient and accurate demand forecasts enables organizations to anticipate demand and consequently allocate the optimal amount of resources to minimize stagnant inventory. Gartner recently published a paper titled, “Demand Forecasting Leads the List of Challenges Impacting Customer Service Across Industries (Steutermann, 2016 )”. As hinted by the title of the article, they conclude that accurate demand forecasts across all customer-facing industries is important for business. In the Forbes article, “Ten Ways Big Data Is Revolutionizing Supply Chain Management (Columbus, 2015)”, demand forecasting is mentioned as the top 4 supply chain capabilities currently in use. Despite the wide acceptance and usage of forecasting techniques, they have been limited to macro level forecasts. It is only recently that retail companies have started focusing on day level forecasts. A Wall Street Journal article, “Retailers Rethink Inventory Strategies (Ziobro, 2016)” mentions how Home Depot is trying to minimize its inventory at stores. We see that there is an increasing need for demand forecasting techniques that can accurately predict the demand for each item for each day for every store. This need is being fulfilled in some companies by using open source data science tools whereas few other firms use in-house commercial platforms. In this paper, we try to evaluate the predictive model performance of models using open-source data science tools like R and Python to predict demand for thousands of products on a store level for a Kaggle competition dataset. Our performance metric is not just limited to model accuracy. We posit that the real value of a model to a business is a composite of (1) predictive model accuracy, (2) runtime, (3) scalability and (4) ease of use. We structured this paper as follows. We performed a review of the literature to see what methodologies have found to be successful at understanding this problem. We discuss the data set used in our study. Next, we discuss the methodology/design we implemented and discuss the models we investigated. Lastly, we present our results, discuss our conclusions, and how we plan to extend this research. LITERATURE REVIEW We started our search for the optimal prediction model for forecasting by looking at past research done in demand forecasting using different machine learning algorithms. This exercise gave us an understanding of different machine learning models that could be used for forecasting. We also looked at measures frequently employed to compare their performances. Previous research on demand forecasting has traditionally used a methodology called

187

Autoregressive Integrated Moving Average (ARIMA). This methodology has been applied to studies of traffic flow (Williams B. M., 2003) and international travel demand (Lim, Time series forecasts of international

188 travel demand for Australia, 2002). Lim’s paper analyzed stationary and non-stationary international tourism time-series data by formally testing for the presence of unit roots and seasonal unit roots prior to estimation, model selection, and forecasting. They used mean absolute percentage error (MAPE) and root mean squared error (RMSE) as measures of forecast accuracy. This paper showed that by comparing the RMSE’s, lower post-sample forecast errors were obtained when time-series methods such as the Box–Jenkins ARIMA and seasonal ARIMA models were used. Ching-Wu Chu et al. (Ching-Wu Chu, A comparative study of linear and nonlinear models for aggregate retail sales forecasting, 2003) compared the performances of linear (traditional ARIMA) models to non-linear models (Artificial Neural Networks) in forecasting aggregate retail sales. Here, neural networks with de-seasonalized data performed best overall, while ARIMA and neural networks modeled with original data perform about the same. After reviewing this research, we decided to try both ARIMA and neural networks in our analysis. (Guolin Ke, LightGBM: A Highly Efficient Gradient Boosting Decision Tree, 2017) found that data instances with larger gradients play a more important role in the computation of information gain, GOSS can obtain quite accurate estimation of the information gain with a much smaller data size. We evaluated the performance of LGBM for time dependent sales data for prediction. Neural Network models can be applied to a variety of domains and are useful for high dimensional data. However, there are several criticisms of neural networks. The cost-benefit of neural networks is limited in scenarios where the problem space does not have well-defined training examples to learn from. Further, the time required to train and tune the network increases as the number of nodes and connections increases. Finally, these networks can become “black boxes” as the explanation of how they arrived at a given result can technically dense for a non-technical audience (Tu, 1996). Learning to store information over extended time intervals via recurrent backpropagation takes a very long time, mostly due to insufficient, decaying error back flow. Truncating the gradient where this does not do harm, Long Short-Term Memory (LSTM) can learn to bridge minimal time lags in excess of 1000 discrete time steps by enforcing constant error flow through “constant error carrousels" within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O(1). While (Graves, 2013) used LSTM to predict next sequence of text, we are using similar time dependent data of sales to predict the future sales. Sutskever et al.(Sutskever, 2014) showed the effectiveness of an encoder decoder recurrent neural network structure for sequence-to-sequence prediction. A multilayered Long Short-Term Memory (LSTM) can be used to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. While Sutskever et al. used this approach in language translation, this method seems suitable for demand forecasting where we take the input sequence as the sales for the previous days, and predict the sequence of the sales in the future.

189

Cheng Guo and Felix Berkhahn (Cheng Guo, 2016) introduced the concept of categorical embedding in neural networks which can be used in building neural networks with categorical predictors. Embedding reduces memory and speeds up neural networks compared with one-hot encoding. It also captures the intrinsic relations between the categories by mapping each category to a Euclidean space.

190

This concept can be utilized while building neural network models to predict sales of a wide range of items which belong to different families. Liu Yue, et al. in their 2007 paper (Liu Yue, 2007), have shown the effectiveness of another machine learning model in demand forecasting which is Support Vector Machine (SVM). In this research, the model of SVM is introduced into the retail industry for demand forecasting, and the experiment results show that the performance of SVM is superior to traditional statistical models and the traditional Radius Basis Function Neural Network (RBFNN). Liu Yue and colleagues also agree that the prediction accuracy of SVM can also be improved by using ensemble-learning techniques. Factorization Machines can estimate interactions even in problems with huge sparsity (like recommender systems) where SVMs fail. (Rendle, 2011) shows that the model equation of FMs can be calculated in linear time and thus FMs can be optimized directly. From our research of previous studies on demand forecasting, we have seen that a large variety of machine learning models like ARIMA, Exponential Smoothing, Neural Networks, and Support Vector Machines have been used. The accuracy values for forecasts are generally measured in RMSE or MAPE. Table 1 below is a summarization of the literature review. Studies Motivation for the Research Result of the Research Williams, B. M., & Hoel, L. A. (2003). Modeling and forecasting vehicular traffic flow as a seasonal ARIMA process: Theoretical basis and empirical results. Journal of Transportation Engineering, 129(6), 664–672 (Williams B. M., 2003) Lim, C., & McAleer, M. (2002). Time series forecasts of international travel demand for Australia. Tourism Management, 23(4), 389–396 (Lim, Time series forecasts of international travel demand for Australia, 2002) To study the traditional methods of demand forecasting • ARIMA models can give good accuracy • May cause problems in the initial model selection as they are based on heuristic selection of parameters • Can be time-consuming if many time series observations are to be analyzed • RMSE and MAPE are widely used to measure forecast performance Ching-Wu Chu, & Guoqiang Peter Zhang. A comparative study of linear and nonlinear models for aggregate retail sales forecasting. International Journal of Production Economics, 86(3), 217-231. To see if newer nonlinear machine learning models perform better than traditional methods

191

Neural networks with deseasonalized data perform the best overall Zell, A. (1994). Simulation neuronaler netze (Vol. 1). Addison-Wesley Bonn. (Zell, Simulation neuronaler netze (Vol. 1), 1994) Ghiassi, M., Skinner, J., & Zimbra, D. (2013). brand sentiment analysis: A hybrid system using n- gram analysis and dynamic artificial neural network. Expert Systems with Applications, 40(16), • To explore other methods beyond Neural Nets and ARIMA • SVM models have also been successful in giving good forecasts results for retail demand

192

6266–6282. (Ghiassi, 2013) Guolin Ke, Qi Meng, Thomas Finley et. all. (2017). LightGBM: A Highly Efficient Gradient Boosting Decision Tree (Guolin Ke, LightGBM: A Highly Efficient Gradient Boosting Decision Tree, 2017) Jie Zhu, Ying Shan, JC Mao, et. all. (2017). Deep Embedding Forest: Forest- based Serving with Deep Embedding Features (Jie Zhu, 2017) To explore high performance GBM methods for data forecasting LGBM models can significantly outperform XGBoost and SGB in terms of computational speed and memory consumption A. Graves (2013). In Arxiv preprint arXiv:1308.0850, Generating Sequences with Recurrent Neural Networks (Graves, 2013) S. Hochreiter and J. Schmidhuber. (1997). LSTM can solve hard long time-lag problems (Schmidhuber, 1997) Q.V. Le, M.A. Ranzato, R. Monga, et. all (ICML 2012). Building high-level features using large scale unsupervised learning. (Q.V. Le, 2012) To understand different • Long Short-Term Memory Recurrent Neural Network (LSTM) is a very powerful models technique to predict sequence • A simple, straightforward and relatively unoptimized approach can outperform a mature SMT system. Sutskever, O. Vinyals, Q.V. Le, (2014). Sequence to Sequence Learning with Neural Networks. (Sutskever, 2014) Jean, S ́ebastien, Cho, Kyunghyun, et. all, (ACL 2015). On using very large target vocabulary for neural machine translation. (Jean, 2015) Thrun, Sebastian. (NIPS, 1996) Is learning the n-th thing any easier than learning the first? (Thrun, 1996) Sequence to sequence should perform well while predicting a sequence of time dependent demand Steffen Rendle, (2011). Factorization Machines (Rendle, 2011) A. Toscher, M. Jahrer, and R. M. Bell., (2009). The BigChaos Solution to the Netflix Grand Prize (A. Toscher, 2009) M. Piotte, and M. Chabbert, (2009). The Pragmatic Theory Solution to the Netflix Grand Prize (M. Piotte, 2009) To study methods beyond • LSTM for sales forecasting To study methods that can •

193

Factorization machines predict demand for sparse perform better than SVM for dataset sparse dataset • Factorization machines can be used to predict demand effectively with many categorical variables

194

Xiangnan He, Tat-Seng Chua (2017). Neural Factorization Machines for Sparse Predictive Analytics (Xiangnan He, 2017) Cheng Guo, Felix Berkhahn (2016). Entity Embeddings of Categorical Variables (Cheng Guo, 2016) Yoshua Bengio and Samy Bengio, (NIPS 1999). Modeling high dimensional discrete data with multi- layer neural networks (Bengio, 1999) B. Yang, W. Yih, X. He, J. Gao, and Li Deng, arXiv preprint arXiv:1412.6575. (2014). Embedding entities and relations for learning and inference in knowledge bases (B. Yang, 2014) To study the impact of Embedding for Categorical variables Entity embedding not only reduces memory usage and speeds up neural networks compared to one hot encoding • More importantly, it reveals the intrinsic properties of the categorical variables Table 1: Literature review summary DATA The data used in this research is from the Kaggle competition which aims to forecast demand for millions of items at a store and day level for a South American grocery chain. (https://www.kaggle.com/c/favorita-grocery-sales-forecasting/data). The data is provided in different tables named train, test, stores (store related data), items (merchandise data), transaction, oil (oil prices can be a good predictor of sales as Ecuador is an oil-dependent economy), and holidays events (holiday & major event related data). Table 2 provides a summary of all the important columns is given below along with the relations between each data table provided. Variable Type Description Id Integer Identifier defined at the date-store-item-promotion level Unit_Sales Numeric Sales defined at the date-store-item-promotion level Date Date Date of transaction for an item Store_Nbr Integer Store identifier Item_Nbr Integer Item identifier Onpromotion Boolean Whether the item is on promotion City Text City in which store is located State Text State in which store is located Type Text internal store categorization Cluster Integer internal store clustering Family Text The family of item Class Text Class of items Perishable Boolean Whether the item is perishable Table 2: Data used in study

195

Figure 1 provides a data model of how each table and feature map to each other. Figure 1: Sales: Data Model METHODOLOGY Figure 2 provides a detailed outline of the data mining workflow used in this study. Figure 2: Data Mining Workflow

196

We set out to follow CRISP-DM (Cross-Industry Standard Process for Data Mining) process model to approach the given problem. Business Understanding: We started with understanding the business’s objectives of the problem and its application. We understood that in addition to the usual trend in shopping of items, there could be external factors (e.g. Oil Price, Holidays) which could affect the demand. Further, we understood the rationale of high weight given to perishable goods and its impact on the business. As we understood more, it was clear to use that it was a short-term (15th days) demand forecasting required at a granular level. We prepared a preliminary strategy by starting our research more into various modeling technique applicable to time-series and best practices and tooling required in dealing with the Big- Data (100 million rows of training). Through Literature review, we were mindful of the fact that some model-output (e.g. PCA, Clustering) would serve as input for other models and for this reason, it was included in the strategy to try out various model (with moving average features). Data Understanding: We sourced flat data files and made cosmetic changes to do EDA (Exploratory Data Analysis) exercise. Then, with a connection to Tableau, we collected the basic facts about the data and studied the distribution of all the key variables. Data Preparation: Keras requires the predictor variables to be represented as the values in a dataset matrix with predicting values acting as one index and timestep as another dimension. These meant that we had to unstack time from rows to columns. This process is called windowing and is one of the most popular technique to convert time series forecasting into a supervised leaning problem. From a cutoff date in a dataset, we then calculated features of running sales mean at day 1, 3,7,14,30,60,140; running promo sums at 14, 60, and 140 days, running sales mean at 7 and 14 weeks. Further, we arranged the known sales amount of 16 testing days in the same matrix. For feature engineering, we added categorical embedding features of store/location characteristics and sales sequential prediction (from sequence to sequence method, serving as meta-model). Modeling: Various models such as Moving Average, Factorization Machine, LGBM and Deep Neural Nets were trained and evaluated against each other. training and testing, we utilized the matrix in two different ways. One way is when all 16 testing days are considered as one dependent variable and trained on data before the cutoff. This is the single model approach for Neural Network(NN1). Another way is when in which X horizon gets fixed but varying Yi, where i is one testing day. This led us to build 16 different neural networks for each of the days. This set of NN is called as 16 model approach of Neural Network(NN2). Both single and 16 models approach datasets were also used to train other models as well. Finally, (NN3) was developed as in improvement over NN2 by including categorical features and sequence-to- sequence(seq2seq) metamodel. Evaluation: Forecasting models are evaluated based on the statistical measures – Normalized Weighted Mean Squared Logarithmic Error(NWMSLE). NWMSLE =

197

∑ n i=1 w i (ln(y ̂ i + 1) − ∑ w i ln(y i + 1))2 n i=1

198 is the y ̂ i is the weight predicted given the predicted sales at i SKU-Store level; and w i sales at to SKU. i Perishable SKU-Store items level; are y i is given higher weights in evaluation. Other items’ weights are kept equal to 1. These are the standard measures to check performance in forecasting-based problems based on texts and literature. FORECASTING MODELS Moving Average Method This is one of the oldest and most widely used methods of demand forecasting. In this method, the average sales of the previous 3 days, 7 days, 14 days, 28 days, 56 days, 112 days, & 180 days are used as the predictor for the sales of the next day. The predictions are multiplied by a factor that takes care of the difference in sales across the different days of the week. It is simple and gives good accuracy when done on a short-term horizon. However, it is not likely to predict well for a longer-duration span as it is not generalizing the trend mere following the past behavior with auto-regressive components. LGBM Light gradient Boosting Model(LGBM) is a fast variant method in the class of tree-based boosting algorithm. It is designed to be run for large data size where it provides the maximum time performance while achieving the same accuracy as other Decision Trees Boosting methods such as XGBoost and pGBRT. It runs by splitting the tree at a leaf rather than at a level. Specifically, it utilizes two novel techniques – Gradient-based One-Side Sampling(GOSS) and Exclusive Feature Bundling(EFB). GOSS samples out data having a smaller gradient to maximize the estimated information gain. This makes it possible to exclude a relatively large amount of data which doesn’t contribute much to learning and expends a lot of processing. EFB is a dimensional reduction technique that optimally bundles highly uncorrelated features (i.e. they rarely take non-zero values simultaneously). It is one of the most popular machine learning algorithm in data competitions and speeds up the similar process by over 20 times. Factorization Machine Factorization Machine models are general predictor model (like SVM) that works well under very high-sparse data. It aims to learn a kernel by representing higher- order terms as the product of latent factor vectors. It essentially aims to reveal those latent factors that capture the interaction between factors. They are generic approach models that can mimic the superiority of factorization models like timeSVD++[Koren 2009b], FPMC[Rendle et al. 2010]. FMs model equation in the linear time leading to the fast computation of model. FMs allow for parameter estimation even under sparse data. Factorization of parameters allows for estimation of higher order interaction effects even if no observations for the interactions are available. As our dataset is sparse and the data size is large, we will not consider SVMs in our study but rather use Factorization Machines. Deep Neural Networks Neural network models have also been proposed as a means of predicting

199 unit sales. There are several advantages to using neural networks for predictive analytics. First, neural networks are highly tunable and can reliably process large volumes of data. Second, these networks are less sensitive to outliers or extreme values than linear models.

200

However, these techniques are not without limitations. Once trained, neural networks are difficult to retrain. These networks require special software packages to handle time series data. Finally, these networks have a reputation for being “black boxes” and being difficult for non- technical audiences to interpret. Learning from ANN was enhanced with following methods: Categorical Embedding: It is an advanced method (vs. one-hot encoding or dummy) to handle categorical data in machine learning model. Embedding maps categorical a feature to a continuous n- dimensional weight space based on a neural network algorithm which has a loss function defined on the target variable. This weights space exhibits the closeness within categorical values. For example, bigger cities would demonstrate similar weights indicating a closeness in the shopping distribution. This is in direct contrast to one-hot encoding where every category is given a weight of 1. Sequence to Sequence Learning as metamodel: Sequential data poses a unique problem in form of a non-fixed dimensionality or in other words, whose length is not known apriori. An LSTM(Long Short-Term Memory) architecture can solve the problem by reading the sequence one step at a time and mapping it in to a large fixed-dimensional space, also called as Encoding, which can then be fed into another RNN(Recurrent Neural Network) model that learn and predicts the long-range dependencies, also called as Decoding. Thus, this model is also referred to as encoder-decoder architecture. Here, for every store-item level, sales are considered as sequential data and output were generated which acted as features for out DNN. RESULTS Descriptive Analysis We first perform descriptive analytics on the sample of data we have taken. We plot the macro trends in the total sales of all items across all stores. The sales have been transformed using a log of (1+ sales) method to make sure that the variance of sales does not change greatly with time.

201

Figure 3: Exploratory analysis of Sales Data vs. Time We see from the plots above that there is a clear seasonal (weekly) trend in the sales of grocery items. As one would expect, the sales are higher on the weekends and lower during the weekdays. Also, grocery sales seem to increase during the month of December. These results As we have nearly 167000 store-item combinations, and the sales for each of these store-items is essentially a unique time series, visually inspecting all the features is not useful in this case. Forecasting models The objective of all the machine learning models that were built was to forecast sales for about 167000 store-items for the period of 16-31th August 2017. First, a simple moving average model was built in Python. This model performed decently (Top 50%ile in the Kaggle competition presently) and gave a good baseline model to improve upon. Next, we built a neural network model and a gradient boosting model that utilize the moving average at different lags as the predictors As we had to predict the sales for a horizon of 16 days, we built 16 different models each one utilizing the data till time (t) and predicting the sales for time (t+1), (t+2), ..., so on till (t+16). Both the 16-neural network(NN2) and the 16-gradient boosting models approach stood in the top 2%ile of all solutions in the Kaggle competition. We then tried a single neural network(NN1) that was trained with all (t+16) sales as the dependent, in order to predict the sales for the entire 16-day horizon using a single model. The results from this approach were worse than the 16-model approach. We also built a factorization machine model, but the 16-model neural network approach was still the best model we had at this point. After that, we tried a sequence-to-sequence approach to forecast the sequence of 16-day sales using the sequence of previous 50-day sales as input. We decided to use the output of this model as inputs to the 16-model neural network approach to improve the overall accuracy of the model. We also added

202 additional features to the 16-model approach using categorical embedding to create a final model(NN3). This model reduced the model error slightly and is currently in the top 1.5%ile. The runtime comparisons below also show that all the models ran in a reasonable amount of time on a GPU supported machine. Best Model Selection In general, when retailers need forecasts of sales/demand, they are more interested in the forecasts for perishable goods than the nonperishable goods. Therefore, to have a higher accuracy in the forecasts of perishable items, we up-weighted the errors for perishable items by 25%. We used a normalized weighted mean square error metric to select the best one among all the models we built. As we transformed the sales using a log transform the actual metric we ended up using is normalized weighted mean square log error (NWMSLE). Normalized weighted MSLE 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Moving Average Factorization Gradient Boosting NN1 constant Machine range input Train Validation Figure 4: Comparison of Models built Normalized Weighted Mean Square Log error NN2 sliding range inputs NN3 with embed layers and seq2seq meta

203

Figure 5: Running Time Comparison CONCLUSION Demand Forecasting at a granular level is a complicated analytical problem with multiple time- series all propagating in tandem and affected by external factors like Oil Price, Holidays etc.. Inventory Surplus and “Stock-outs” are key ground level issues in retail store management company affecting the bottom-line margins. A handy digital tool driven by this model could help the category manager to decide on the daily inventory stocks for millions of items, at various store locations. This model could also help them to estimate the demand for a new item without any historical shopping data. Though R/Python offers a basket of models to structure this complex problem to a manageable solution, infrastructure limitation constrained us to use a subset of data for training which could underrepresent the information contained in the model. The model also assumes the absence of catastrophic events (like earthquakes), in presence of which the variance would shoot up drastically here. We could further improve our forecasts in future by attempting to use other features like item and store attributes in a sequence-to-sequence type neural network. It would also be interesting to look further into the macro inputs (other than Oil price) indicative of the health of economy like GDP/Inflation rate/Unemployment etc. REFERENCES A. Toscher, M. J. (2009). The BigChaos Solution to the Netflix Grand Prize. B. Yang, W. Y. (2014). Embedding entities and relations for learning and inference in knowledge bases. arXiv preprint arXiv:1412.6575. Bengio, Y. B. (1999). Modeling high dimensional discrete data with multi-layer neural networks. NIPS. Cheng Guo, F. B. (2016). Entity Embeddings of Categorical Variables.

204

Ching-Wu Chu, &. G. (2003). A comparative study of linear and nonlinear models for aggregate retail sales forecasting. International Journal of Production Economics, 86(3), 217-231. Columbus, L. (2015). Ten Ways Big Data Is Revolutionizing Supply Chain Management. Forbes. Ghiassi, M. S. (2013). Twitter brand sentiment analysis: A hybrid system using n-gram analysis and dynamic artificial neural network. Expert Systems with Applications, 40(16), 6266-6282. Graves, A. (2013). Generating Sequences with Recurrent Neural Networks. Arxiv preprint arXiv:1308.0850. Guolin Ke, Q. M. (2017). LightGBM: A Highly Efficient Gradient Boosting Decision Tree. Jean, S. C. (2015). On using very large target vocabulary for neural machine translation. ACL. Jie Zhu, Y. S. (2017). Deep Embedding Forest: Forest-based Serving with Deep Embedding Features. Lim, C. &. (2002). Time series forecasts of international travel demand for Australia. Tourism Management, 23(4), 389-396. Liu Yue, Y. Y. (2007). Demand Forecasting by Using Support Vector Machines. Third International Conference on Natural Computation. M. Piotte, a. M. (2009). The Pragmatic Theory Solution to the Netflix Grand Prize. Q.V. Le, M. R. (2012). Building high-level features using large scale unsupervised learning. ICML. Rendle, S. (2011). Factorization Machines. Schmidhuber, S. H. (1997). LSTM can solve hard long time-lag problems. sdas. (n.d.). dasd. das. Steutermann, S. (2016 ). Demand Forecasting Leads the List of Challenges Impacting Customer Service Across Industries. Gartner. Sutskever, O. V. (2014). Sequence to Sequence Learning with Neural Networks. Thrun, S. (1996). Is learning the n-th thing any easier than learning the first? NIPS. Tu, J. V. (1996). Advantages and disadvantages of using artificial neural networks versus logistic regression for predicting medical outcomes. Journal of Clinical Epidemiology, 1225-1231. Williams, B. M. (2003). Modeling and forecasting vehicular traffic flow as a seasonal ARIMA process: Theoretical basis and empirical results . Journal of Transportation Engineering, 129(6), 664-672. Xiangnan He, T.-S. C. (2017). Neural Factorization Machines for Sparse Predictive Analytics.

205

Zell, A. (1994). Simulation neuronaler netze (Vol. 1). Addison-Wesley Bonn. Ziobro, P. (2016). Retailers Rethink Inventory Strategies. WSJ.

206

Does Advance Warning Help Mitigate the Impact of Supply Chain Disruptions? Sourish Sarkar a* and Sanjay Kumar b a Sam and Irene Black School of Business, Pennsylvania State University—Erie 5101 Jordan Road, Erie, PA 16563, USA b College of Business, Valparaiso University Valparaiso, IN 46383, USA With growing complexities in supply chains, both the frequency and impact of supply chain disruptions appear to increase nowadays. Disruption-related risk management is therefore getting a wider attention in supply chain research. However, the literature focuses more on the analytical modelling and survey-based empirical research, and very few studies focus on the behavioral aspect of supply chain decision-making in the wake of disruption. The purpose of this behavioral experiment is to investigate the effect of advance warning on ordering decisions in a serial supply chain involving disruption at the upstream/downstream echelon. Many supply chain managers believe in increasing the firm’s resilience by improving its ability to quickly detect any disruption. This notion is also supported by recent literature, as the proactive risk management is perceived as better approach than the reactive risk management. In this regard, forecasting (i.e., advance warning) of disruptions has been considered as a preventive strategy. Through our experimental investigation, we attempt to answer two research questions: (i) Does advance warning of disruption necessarily help? (ii) Does advance warning provide similar benefits for both upstream and downstream disruptions? We conduct the laboratory experiment with a setup similar to the traditional beer distribution game, but with the modification needed for introducing the disruptions. A 2x2 experimental design (i.e., location of disruption: upstream or downstream, and availability of advance warning: yes or no) is used for investigating the research questions. A supply chain in our experiment experiences only one disruption that lasts for five periods. During the disruption, the disrupted facility cannot process orders, fulfill demands, or receive shipments. For advance- warning treatments, the disrupted facilities receive the warning five periods in advance. No information on disruption is shared with other echelons in any of the four treatments. We report the preliminary results of our experiment. Advance warning of disruption causes over- reaction that results in higher order volume than that in no-warning cases. This finding is supported by the literature on risk aversion. Particularly for the downstream disruptions (i.e., at the retailer), the over-reaction worsens the overall supply chain performance. However, for the upstream disruptions (i.e., at the manufacturer), the advance warning may help the manufacturer to improve

207 the supply chain performance. We also observe that immediate downstream echelon (i.e., the distributor) is the primary beneficiary of the advance warning received by the manufacturer.

208

Effect of Forecast Accuracy on Inventory Optimization Model Surya Gundavarapu, Prasad Gujela, Shan Lin, Matthew A. Lanham Purdue University, Department of Management, 403 W. State Street, West Lafayette, IN 47907 [email protected]; [email protected]; [email protected]; [email protected] Abstract In this study, we examine the effect of forecast accuracy on the inventory costs incurred by a national retailer using a dynamic inventory optimization model. In the past, the retailer calculated weekly and monthly demand forecasts at a particular distribution center by dividing the annual demand with specific numbers which led to a consistently flat ordering model. This led the retailer to purchase items in bulk from their vendors which led to incurring unnecessary holding costs. The motivation for study is that this type of purchase behavior does not adequately prepare a supply chain for unexpected demand and thus might further deepen their inventory troubles. We introduced to this retailer an easy to deploy inventory model that uses the distribution of demand for each item along with the target service level among other constraints like the purchase capacity etc., to minimize the overall cost for each item. We then show the impact of inventory costs based on how accurate the demand forecast is. Keywords: Dynamic Inventory Optimization, Forecast Accuracy, Wagner-Whitin Algorithm

209

Introduction Companies use inventory as a buffer between supply and demand volatility. Handling optimal inventory level is important for retailers since too much inventory would imply too much capital struck in the supply chain, while too little inventory would prevent customer fulfillment to a satisfactory level. Further, mishandling inventory would not just affect the supply chain department but the overall KPIs of the firm. Nowadays, large firms use ERP systems to aid them in their inventory decisions and these ERP systems integrate critical information about the supply chain, such as customer orders, warehouse capabilities, and demand forecasts to help managers make informed decisions (Fritsch,2017). Further, today firms are investing heavily into machine learning, Big Data Analytics, and the Internet of Things to bring more analytic power to operational decision support systems. For instance, a warehouse manager can now incorporate images as data inputs for intelligent stock management systems that can predict when the company should re-order (Marr, 2016). By improving demand accuracy, companies can reduce the safety stock and free up more cash for other areas of the business. For business-to-consumer (B2C) supply chains, for example, a consumer goods company will forecast what stock keeping units a retailer will order. Also, demand management teams must collaborate with sales forces to accurately estimate the probability that a deal will close and what products the deal will include (Banker, 2013). Therefore, a company could reduce the inventory per square foot and turnover days to keep the cash flow. Unfortunately, improving the demand forecasts is not enough. What really matters is how the company implements stocking and reorder points. In practice, other factors should be incorporated into the process design such as transportation costs. Recently, a Dutch adult beverage firm successfully reduced its freight cost by implementing dynamic order allocation which integrates customer order information and available stock levels. With more information, the company can make

210 strategic decisions in optimizing cost while maximizing customer fulfillment (Banker, 2015). Our research question in this study is, how does demand forecast accuracy translate to additional inventory costs when using a dynamic inventory optimization model for replenishment of spare- type items? We apply the retailer’s own forecasts for a baseline. Then we build a dynamic optimization model using the Wagner-Whitin algorithm to increase the retailer’s responsiveness to the market demand.

211

Our paper is organized as follows: We begun reviewing academic literature related to inventory management. Then, we discuss the type of data we have before diving into our methodology in getting better forecasts. We then used Wagner-Whitin algorithm to calculate the optimal ordering point(s). Lastly, we discuss the savings for the retailer as a function of the forecast accuracy before ending the discussion with further steps to improve this model. Literature Review We examined the academic literature and mainly investigated three things regarding formulating the optimized economic order quantity model. 1. Demand distribution: To make the model such as linear or non-linear model, understanding of demand’s distribution is important. 2. Costs factors for the model: In the inventory system, classically three factors are considered as cost factors: i) replenishment cost, ii) holding cost, iii) shortage cost. Except for those classical factors, we investigated which other external factors could be considered when formulating our costs model. 3. Modeling with constraints: From our data sets, we have found several constraints, including minimum and maximum inventory level, and holding costs. We explored how to handle constraints from the literature. Demand Distribution In the supply chain field, an assumption of items is critical to build up the EOQ model. Most of the companies use a normality assumption for each item because it is the simplest way to build up the linear model. However, if intermittent demand occurs, the normal distribution is not plausible and exponential smoothing is used instead. In addition, as Bookbinder and Lordahl (1989) found, the bootstrap is superior to the normal approximation for estimating high percentiles of LTD distributions for independent data. Below are several methods to estimate demand distribution. Classically, normal distribution,

212

Poisson distribution, exponential smoothing method, Croston’s method and bootstrap method are used in estimating the distribution of demand. Each method has pros and cons and has the most appropriate situation to be used. In addition, as Rehena Begum, Sudhir Kumar Sahu and Rakesh Ranjan Sahoo (2010) mention, each item’s distribution can be considered. In supply chain field, snice items

213 deteriorate, the Weibull distribution is widely used. Table 1 summarizes our findings based on the demand aspects. Author Methods Key Character Advantages (Croston 1972) Normal Distribution Mean and Standard Deviation The simplest way to make linear model with a normality assumption (Ward 1978) Poisson Distribution Lambda In a specific situation, it performs well (Thomas R. Willemain*, Charles N. Smart, Henry F. Schwar 2004) Exponential Smoothing Robust forecasting method Flexible over most of restrictions such as a normality assumption or Central Limit Theorem and performs well over Poisson Distribution (Willemain et al., 1994; Johnston & Boylan, 1996) Croston’s Method accurate forecasts of the mean demand per period estimates the mean demand per period by applying exponential smoothing separately to the intervals between nonzero demands and their sizes (Efron 1979) Bootstrap Method sampling with replacement from the individual observations Ignore autocorrelation in the demand sequence and produce as forecast values only the same numbers that had already appeared in the demand history Table 1: Literature on Demand Distribution Cost factors Classically, three factors are considered as cost factors: i) replenishment cost, ii) holding cost, and iii) shortage cost. Except for those classical factors, we investigated which other external factors could be considered in formulating our costs model. We mainly investigated in two ways: i) How those

214 three cost factors can vary and ii) Whether other factors can be included in the modeling. For example, Mark Ferguson, Vaidy Jayaraman and Gilvan C. Souza (2007) introduced a nonlinear way to handle the holding cost. A cumulative function of holding costs sometimes appears to be nonlinear, so that a different approach is more appropriate. Hoon Jung and Cerry M. Klein (2005) presented the optimal inventory policies for economic order quantity model with decreasing cost functions. It shows that classical costs can be interpreted in different ways as well as other costs factors being included in the model, such as interest rate or the deterioration of items. Table 2 below explains diverse ways to interpret cost factors for the model to minimize the cost. Author Cost Factors Key factors Modeling explanation (Hoon Jung, Cerry M. Klein Decreasing cost functions with geometric Constant demand and a fixed purchasing cost

215 programming 2005) economy of (GP) scale techniques are used (Kun-Jen Chung, Leopoldo Eduardo Cárdenas-Barrón 2012) Fixed backorder Derivatives were used to find costs the optimal point Two type backorders cost are considered: linear backorder cost (backorder cost is applied to average backorders) and fixed cost (backorder cost is applied to maximum backorder level allowed) (J. Ray, KS. Chaudhuri 1997) Stock-dependent demand, shortage, inflation and time discounting Assumption of a constant purchasing cost becomes invalid in real situation Explains relationships between cost factors and external factors. For example, the holding (or carrying cost) consists of opportunity costs and costs in the form of taxes, insurance and costs of storage. (Mark Ferguson, Vaidy Jayaraman, Gilvan C. Souza 2000) Nonlinear Holding Cost cumulative holding cost is a nonlinear function of time Appropriate with more significant for higher daily demand rate, lower holding cost, shorter lifetime, and a markdown policy with steeper discounts (L.A. San-José, J. Sicilia, J. García-Laguna 2015) Partial backordering and non-linear unit holding cost backordering cost includes a fixed cost and a cost linearly dependent on the length of time for which backorder exists Fixed cost which represents the cost of accommodating the item in the warehouse and a variable cost given by a potential function of the length of time over which the item is held in stock Table 2. Diverse cost factors can be included in models Constraints In supply chain optimization, setting up constraints is critical. For example, our data set has multiple constraints, including minimum and maximum inventory level, minimum order quantity, and minimum service level to be achieved. Within those constraints, optimized economic order quantity should be calculated. However, within constraints, logic to find the optimal point are different for various models. For example, in linear modeling, the simplex method is widely used to find the optimal points. We examined which methods are being used to handle constraints and

216 explanations and advantages of them, and specifically we focused on the service level that we need to achieve through for our business partner. Table 3 below explains how to find the optimal points with various constraints and what those models imply. Author Constraints Key factors Modeling explanation (Sridar Bashyam, Michael C. Fu 1997) Random Lead Time and a Service Level Constraint Constraint simulation optimization This paper considers the constrained optimization problem, where orders are allowed to cross in time (Ilkyeong Moon, Sangjin Choi Service Level Stochastic inventory model Service is measured here as the fraction of demand satisfied directly from stock

217

1994) (James H. Bookbinder, Jin Yan Tan 1988) Lot-Sizing Time-varying Problem with demands Service-Level Constraints This paper describes deterministic version of problem, which is time-varying demands (Wen-Yang Lo, Chih-Hung Tsai, Rong-Kwei Li 2000) linear trend in demand Demand rate of a product is a function of time This study proposes a two-equation model to solve the classical no-shortage inventory replenishment policy for linear increasing and decreasing demand Table 3. Literature on Constraints Data The data investigated in this study came from a regional retailer in the United States. The data set consisted of 87,053 observations of Part IDs, which includes diverse variables regarding the inventory system from a specific vendor at a specific distribution center. Table 4 provides a data dictionary of the features we had available for this study. Variable Type Description Date Date Date where inventory was examined at Remington DC Lead Time _ Days Numeric When order was placed to when it was delivered to Remington distribution center Product Group Categorical Parent directory of products. Product Group is composed of two groups 1. ELECTRICAL 2. NA DC name Categorical Distribution Center Name DC number Categorical Unique Distribution Center Number Vendor number Categorical Unique Vendor Number Part ID Categorical Unique parts identification Part Description Text Detailed part description Inventory on hand Numeric Week ending on Friday inventory on hand Inventory on order Numeric Week ending on Friday inventory on order Vendor request minimum order quantity Numeric A constraint Current Purchase Price Numeric Current purchase price of each part

218

Units Shipped Year to Date Numeric Units shipped from a year ago to date

219

Units Shipped Quarter to Date Numeric Units shipped from a quarter ago to date Units Shipped Week to Date Numeric Units shipped from a week ago to date Demand Forecast Annual Numeric Forecasted annual demand Demand Forecast Quarterly Numeric Forecasted quarterly demand Demand Forecast Four Weekly Numeric Forecasted four weekly demand Demand Forecast Weekly Numeric Forecasted weekly demand Order Up To Level In Units Numeric A constraint Minimum order level Numeric A constraint Suggested Order Quantity Actual Numeric Suggested Order Quantity Order Point Independent In Units Numeric Order Point Independent In Units Item Class Categorical the groups that sell the most (As and Bs are faster moving) Table 4: Data used in study In addition to the features obtained in Table 4, we had actual outbound quantities data from the DC which was used for model validation. Methodology Exploratory Data Analysis Figure 1 shows the actual outbound demand versus the retailer’s currently demand forecast. This figure demonstrates the major cause of their high inventory costs they were facing.

220

Figure 1: Actual Demand vs Demand Forecast We began observing large differences between the retailer’s forecast and the actual demand, which translated into high inventory management costs in form of underage costs and holding costs. Further we see different patterns in demand that led us to cluster the items into different clusters to deal with them differently. Our pipeline for the process is shown in Figure 2. Figure 2: Data Processing Pipeline. Once the data was cleaned and the subset clusters were identified, we used the retailer’s forecast as a baseline and built three forecast models of varying accuracy to feed into the optimization model. Models Periodic Review Model Our baseline optimization model was a standard Economic Order Point (EOP) model since the retailer had several constraints on the number of units they can order reducing their flexibility. Thus, an optimization on when they can order was reasonable. Figure 3 below explains the flow chart for an EOP model.

221

Figure 3: Economic Order Point Following the above flowchart, EOP is a continuous review process where we check if we reached a pre-determined re-order point after every order fulfillment and we trigger an order if we did reach a re-order point. Exponential Smoothing with Trend Exponential Smoothing refers to an averaging method that weighs the most recent data more strongly. This is useful if the data changes as a result of seasonality (or pattern) instead of a random walk. Mathematically, this can be written as F t+1 + (1−∝)F t where F t+1 = ∝ D t = Forecast for the next time period D t = Actual Demand at time t F t = Forecast at time t ∝ = Weighing factor referred to as a smoothing parameter This can be enhanced by adding a trend adjustment factor to incorporate the trend into the equation. AF t+1 = F t+1 + T t+1

222 and T t+1 = β(F t+1 ) + (1 − β)T t where A t − F t = Adjusted Forecast for time t+1 T t = Last period’s trend factor and β = Smoothing parameter for trend

223

For this model, people tend to use Mean Average Deviation (MAD) as the metric of determination. Dynamic Programming We used the Wagner-Whitin algorithm to help decide the time of the re-order point. The Wagner- Whitin Algorithm divides the n-period optimization problem to a series of sub-problems and each sub-problem is solved and used in solving the next sub problem. For example, if the decision maker wishes to plan his reorder point for six periods, the model divides his 6-period problem into six single period problems. On week 1, the decision maker has six options, order for all six weeks, order for the first five weeks, etc. all the way down to just ordering for just the first week. Choosing the option that minimizes the sum of holding cost and ordering cost would be the best decision for the decision maker. The series of such recurrent decisions made for all six periods gives us the overall minimum cost path. Mathematically, Step 1: t = 1, z t * = 0 Step 2: t = t+1. If t > T+1, stop. Otherwise go to step 3. Step 3: For all t’ = 1, 2, ..., t - 1, c t ', t = A t ' + c t ' ( D t ' + ... + D t - 1 ) + h t ' ( D t ' + 1 + ... + D t - 1 ) + h t ' + 1 ( D t ' + 2 + ... + D t - 1 ) + ... + h t - 2 D t

224

- 1 Step 4: Compute z * = t ,...,2,1' = min t - 1 { cz t * ' + tt ,' } Step 5: Compute t p t * = argmin{ t ' = 1,2,..., t - 1 z t * ' + c t ', t } that is, choose the period t’ that minimizes cz * ' + ,' Step 6: Go to step 2.

225

The optimal cost is given by t tt z + The optimal set of periods in which ordering/production takes place can be obtained by backtracking from T * 1 p T * + 1

226

Results Based on the above model, we calculated the total cost of inventory management for different products and it is observed that we are able to save about 13.7 % using the actual demand +/- a random value of 1 unit (which gives us an accuracy rate of 85%). We can save up to 20% in inventory costs using the dynamic model developed. We also added an additional parameter into the equation that penalized the retailer for understocking. For example, there was only opportunity cost previously, but we considered the possibility of backordering which would inflict 1.5 times the cost for the retailer, and this caused an additional saving of 20% because of the optimization. Conclusion As discussed, forecasting and optimization go hand in hand when it comes to inventory optimization. Accurate forecasts lead to lower costs. Figure 4 shows the relationship we found when comparing the demand forecasting model’s accuracy and the decisions the inventory decisions that would occur because of the model. Figure 4: Model accuracy versus total costs ($) In summary, we compared the results from our model for the three predicted demands to see which one performed better based on the accuracy of demand. We used total cost incurred during the period under consideration including the cost of stock-outs as a metric to compare the performance. We found that the client’s predicted demand had the highest total cost associated with it due to poor accuracy. Our attempt to use the Exponential Smoothing technique to simulate predicted demand did not yield an accurate forecast. As the model with highly accurate predicted demand showed least total

227 cost, we suggest the client to improve their accuracy of demand forecast, and thus realize better inventory performance. References Kun-Jen Chung, Leopoldo Eduardo Cárdenas-Barrón (2012). “The complete solution procedure for the EOQ and EPQ inventory models with linear and fixed backorder costs” Mathematical and Computer Modelling 55 (2012): 2151–2156. Mark Ferguson, Vaidy Jayaraman, Gilvan C. Souza “An Application of the EOQ Model with Nonlinear Holding Cost to Inventory Management of Perishables”. Elsayed, E. A., and C. Teresi (1983). “Analysis of Inventory Systems with Deteriorating Items.” International Journal of Production Research, 21(4): 449-460. Weiss, H. (1982). “Economic Order Quantity Models with Nonlinear Holding Costs.” European Journal of Operational Research 9:56–60. Sridhar Bashyam, Michael C. Fu (1997). “Optimization of Inventory Systems with Random Lead Times and a Service Level Constraints” Management Science. Ilkyeong Moon, Sangjin Choi (1994). “The Distribution Free Continuous Review Inventory System with a Service Level Constraint” Computers ind. Engng Vol. 27, Nos 1-4, pp. 209-212. James H. Bookbinder, Jin Yan Tan (1988) “Strategies for the Probabilistic Lot-Sizing Problem with Service-Level Constraints” Management Science, September. L.A. San-José, n, J. Sicilia, J. García-Laguna (2015) “Analysis of an EOQ inventory model with partial backordering and non-linear unit holding cost” Omega 54 (2015) 147–157. Wen-Yang Lo*, Chih-Hung Tsai, Rong-Kwei Li (2002) “Exact solution of inventory replenishment policy for a linear trend in demand - two-equation model” Int. J. Production Economics 76(2002) 111-120

228

B. N. Mandal, S. Phaujdar (1989) “An Inventory Model for Deteriorating Items and Stock- dependent Consumption Rate” J. Opt Res Soc Vol.40 No.5 pp 483, 488 Rehena Begum, Sudhir Kumar Sahu, Rakesh Ranjan Sahoo (2010) “An EOQ model for deteriorating items with Weibull distribution deterioration, unit production cost with...”. ARIS A. SYNTETOS, JOHN E. BOYLAN (2008) “Demand forecasting adjustments for service- level achievement” IMA Journal of Management Mathematics (2008) 19, 175−192. Chris Dubelaar, Garland Chow, Paul D. Larson (2001) “Relationships between inventory, sales and service in a retail chain store operation” Thomas R. Willemain*, Charles N. Smart, Henry F. Schwarz (2004) “A new approach to forecasting intermittent demand for service parts inventories” International Journal of Forecasting 20 (2004) 375– 387. G. Padmanabhan a, Prem Vrat (1989) “EOQ models for perishable items under stock dependent selling rate” Banker, Steve (2013). “Demand Forecasting: Going Beyond Historical Shipment Data.” Retrieved from : https://www.forbes.com/sites/stevebanker/2013/09/16/demand-forecasting-going-beyond- historical-shipment-data/#70623c2c16fb Banker, Steve (2013). “Demand Forecasting: Going Beyond Historical Shipment Data.” Retrieved from: https://www.forbes.com/sites/stevebanker/2013/09/16/demand-forecasting-going-beyond- historical-shipment-data/#70623c2c16fb Rosenblum, Paula (2014). “Walmart's Out of Stock Problem: Only Half the Story?” Retrieved from: https://www.forbes.com/sites/paularosenblum/2014/04/15/walmarts-out-of-stock-problem-only- half-the-story/ Banker, Steve (2015). “Transportation and Inventory Optimization are Becoming More Tightly Integrated.” Retrieved from: https://www.forbes.com/sites/stevebanker/2015/08/10/transportation- and-inventory-optimization-are-becoming-more-tightly-integrated/#4a311d0c608f

229

230

Marr, Bernard (2016). “How Big Data and Analytics Are Transforming Supply Chain Management.” Retrieved from: https://www.forbes.com/sites/bernardmarr/2016/04/22/how-big-data-and- analytics- are-transforming-supply-chain-management/#5b63c62439ad Fritsch, Daniel (2017). “Purchase Planning: Basics, Considerations, & Best Practices.” Retrieved from: https://www.eazystock.com/blog/2017/03/09/purchase-planning-basics-considerations- best- practices/

231

The Evolution of the Open Source ERP iDempiere Community Network: A Preliminary Analysis Zhengzhong SHI, Ph.D. University of Massachusetts at Dartmouth Charlton College of Business North Dartmouth, 02747, MA, USA [email protected] Hua SUN, Ph.D. Department of management science and engineering School of Management, Shandong University 27 Shanda Nanlu, Jinan, P.R. China, 250100 [email protected] Abstract This research is a preliminary analysis of the evolution of the award-winning open source iDempiere ERP community network. Network level features, such as the average degree, the degree centralization, the betweenness centralization, the network density, the number of members involved, and the number of ties formed are preliminarily analyzed and compared for the community networks of 30 selected weeks. These time-based community networks are established using the community forum discussion threads with the assumption that members participating thread discussions forming social ties. Overall, it is found that the community network is becoming sparser and more decentralized over weeks. Future research should investigate the impact of the evolution of the community network on the exploitative and explorative innovations in the open source software project community (OSSPC). Introduction Open source software (OSS) development has been a valid and powerful model of software development for several decades. Many very successful OSS projects, such as , Apache Web Server, Apache Spark, and Apache Hadoop, have been developed. A recent research stream in the field of IS is to investigate the impact of community social networks of OSS participants on OSS project development. For example, with data across many OSS projects, Singh, et al. (2011) introduced the social network perspective to the research of OSS project success. Their social network perspective focused on the structural aspect of the OSS community network and investigated the impact of the community internal and external cohesions on project performance. Their definition of success for OSS projects is the knowledge creation in terms of the number of CVS commits (representing the number of completed project modification requests). While they produced interesting findings, their research did not distinguish between exploitative and explorative activities in software development.

231

Temizkan and Kumar (2015), following Singh, et al. (2011), investigated the issue in more refined details. Indeed, they separated exploitation social networks (i.e., the community network of those developers involved in patch development) from the exploration network (i.e., the community network of those developers involved in new feature requests) and their project success measures are similar to those in Singh, et al. (2011). They measured the community network using not only characteristics such as internal and external cohesions, but also the network location and network decomposition. They found that community networks of exploitation and exploration are indeed different and network features have varied impacts on exploitation and exploration. More interestingly, their study incorporated the concept of ambidextrous developers, representing those who work on both exploitative and explorative tasks over time. While Temizkan and Kumar’s (2015) study advanced the research stream of the impact of community networks on the OSS performance in terms of digging into exploitative and explorative community networks and incorporating the concept of ambidextrous developers, it is our belief that an evolutionary perspective is much needed to incorporate the time dimension into the comunity network structure analysis. And this current study is to advance this stream of research by analyzing the evolution of the open source ERP iDempiere community network. The Community Network of an OSS Project: Sample Measures Members of an OSS project community (OSSPC) form social relationships through participating forum discussions actively, developing and testing codes collaboratively, and attending community conferences and other social events. Measures of a community network’s structure include indices such as degree centrality and betweenness centrality at the individual member level and average network degree, density, degree network centralization, and betweenness network centralization at the network level (Freeman, 1978). The average degree and density may be used to represent the internal cohesion within the community as they describe the amount of connections within the community. As there are more connections at one point in time and more repeated connections over time, internal cohesion is developed and nurtured. Internal cohesion facilitates social exchanges, nurtures trust development within an OSSPC, enables information and knowledge sharing, and streamlines collaborative efforts. On the other hand, a very high level of internal cohesion may inhibit participants from creative thinking for product innovations. Singh et al. (2011) indeed found that only a moderate level of internal cohesion is better for OSS project knowledge creation. In a more refined detail, Temizkan and Kumar (2015) found that internal cohesion relates more positively to the patch development (i.e., exploitation) than the feature request (i.e., explorations). In general, the average degree and density, measuring the internal cohesion, are important measures of the structure of an OSS project community social network.

232

Network location represents the centrality of OSSPC members at the individual level. An OSSPC member with a high level of centrality is positioned at a relatively central location in the community network, controlling information and knowledge flow and having a more completed picture of the structure of the network, which may facilitate task completion. Temizkan and Kumar (2015) found that individual centrality index relates more to the patch development (i.e., exploitation) than the feature request (i.e., explorations). However, too much centrality may overload OSS project community members with redundant information and spread their energy too thin, which may produce inefficiencies. At the network level, centralization indices are useful to analyze the macro network structures in terms of individual centrality variations. These measures may reveal the community network structure in an OSSPC. For example, Crowston and Howison (2005) compared the community network structures of large and small OSS projects and Setia, Rajagopalan, Sambamurthy, and Calantone (2012) investigated how peripheral developers contribute to open-source software development. It is our belief that, with the dynamic macro level centralization measures of OSSPC community networks, further analysis of OSSPC innovations may be facilitated. Research Method Data Source Both Singh, et al. (2011) and Temizkan and Kumar (2015) used data from the .com across many projects. Their studies count ties among developers by checking whether they worked on the same projects previously. The potential problem of this counting method in measuring community networks is that while two developers with social ties due to participating in the same OSS projects previously have common background (such as using the same programming language and the and having a common understanding of the underlying embedded business processes), they may never socialized with each other directly. Consequently, whether they truly establish a personal tie is in doubt. This possible lack of personal ties in community networks used in previous studies may prevent important insights from being generated. To remedy this issue, this current research, instead, uses the community network data embedded in a particular OSSPC’s forum. Specifically, the thread data from the iDempiere1 community forum are used to establish the network among its members in order to analyze the evolution of the community network structure. The foundational assumption is that iDempiere community members involved in a thread discussion form an ad-hoc team and have opportunities to establish 1 “iDempiere Business Suite ERP/CRM/SCM done the community way. Focus is on the Community that includes Subject Matter Specialists, Implementers and End-Users. iDempiere is based on original / plus a new architecture to use state-of-the-art technologies like OSGi, Buckminster, zk6.” (idempiere.org). Further, based on an experienced ERP consultant, “The ADempiere, iDempiere, and Compiere environments are amazingly similar. iDempiere came from ADempiere. ADempiere and Openbravo came from Compiere. Compiere came from Jorg Janke. Jorg came from Oracle. As a result, iDempiere and ADempiere have much in common with Oracle's ERP in terms of the financial feature set.” (http://www.chuckboecking.com/blog/).

233 social ties through exchanging information and knowledge and collaborating on self-selected project tasks. Data Collection In the October of 2017, a java program was developed to automatically download all the forum discussions (from 2011-05-19 15:13:22 to 2017-10-13 18:17:02) in the iDempiere community from its google group site2. The data contains 17098 messages in 4040 threads across 294 different weeks for a duration of 335 weeks. The Construction of the Thread-Based Social Network The community network for the iDempiere project during a certain week can be initially constructed by connecting members based on whether they participate in the same thread discussions during that week. More specifically, for the tie between members A and B, their tie strength due to their participating in Thread T discussions on week W can be calculated as C AB(w) (T) = N A /Sum*N B /Sum*( N A + N B )*20. N A and N B are the number of messages posted by A or B in Thread T until week W with attenuated messages posted prior to week W by A and B incorporated. 20 is the coefficient factor to make the tie strength convenient for analysis. As a thread may extend over quite a few weeks, a pair of members may only participate in thread discussions for the first few weeks of the existence of the thread. However, they should still have social ties over the rest weeks of the existence of the thread in an attenuated fashion. The impact on the strength of the A-B tie on week W by messages posted prior to week W by A and B in Thread T are attenuated using the formula: N i (V)/10(W-V)*0.01 . N i

234

(V) is the number of messages posted by Member i on week V. Member i can be member A or B and W should be greater than V. If the same pair of members participating in multiple thread discussions during the same week W, then the sum of strengths for all the social ties established across these different threads on week W should be components of the tie strength for this pair members on week W. Further, at the end of each thread, the social ties existing during the last week’s existence of the thread should also influence the tie strength of the same pair of iDempiere community members over the next few weeks in an attenuated fashion. For example, (A,B) has a tie strength of C AB(w) (T) on week W due to thread T. Assuming week W is the last week for thread T, then C AB(w) (T) will impact the (A,B) tie strength during week N after W (N>W). The formula used is C AB(N) (T) /10(N-W)*0.1. . 2 https://groups.google.com/forum/?hl=en#!forum/idempiere (T)= C AB(W)

235

Obviously, if there is only one member in a thread, then we cannot capture any social relationships established through this thread, even though in reality there may be some relationships stimulated by this single member thread and established through other channels (such as private emails and direct contacts). Results Weeks 1 and 2 had only 2 members participating in thread discussions in the iDempiere community forum. Until week 36, more members started to participate in forum discussions. All the following figures display data from week 36. Data points are selected from the available 294 weeks’ data for this preliminary analysis. The following Table 1 shows the iDempiere Community Network Measures (for 30 selected weeks). Table 1: iDempiere Community Network Measures (for 30 selected weeks) Week Number Centralization Degree Network Centralization (Betweenness) Density 36 0.300 0.040 0.400 44 0.767 0.878 0.286 54 0.418 0.357 0.197 65 0.500 0.300 0.267 75 0.353 0.326 0.135 85 0.498 0.466 0.182 95 0.423 0.447 0.074 105 0.301 0.426 0.123 115 0.354 0.342 0.116 125 0.399 0.229 0.096 135 0.445 0.399 0.071 145 0.401 0.487 0.064 155 0.389 0.369 0.106 165 0.311 0.389 0.068 175 0.220 0.197 0.073 185 0.211 0.225 0.059 205 0.468 0.375 0.075 215 0.451 0.346 0.085 235 0.305 0.370 0.054 245 0.359 0.436 0.072 255 0.479 0.583 0.051 265 0.379 0.490 0.045 275 0.236 0.283 0.053 285 0.246 0.345 0.047 295 0.151 0.381 0.055 305 0.279 0.267 0.052 315 0.283 0.328 0.049 325 0.299 0.303 0.043 330 0.312 0.438 0.066 335 0.198 0.367 0.075

236

Figure 1 shows the number of community members participating in thread discussions and the number of social ties they established over weeks. As the trend lines demonstrate, both the number of members and the number of social ties established were increasing over time and their patterns of growth are matching with each other even though the number of social ties was growing and variating more dramatically. Figure 1: Number of Members(Y1) and Ties (Y2) Over Time (X in weeks) 250 200 150 y = 0.2692x + 35.185 100 R2 = 0.2987 0 0 50 100 150 200 250 300 350 400 Number of Members Number of Ties Linear (Number of Members) Linear (Number of Ties) Figure 2 illustrates the average degree of the social networks for the iDempiere community is slightly trending up. This observation of average degree trending-up indicates the joining effects of two factors. The first is that as more members join the community, most new members have very few connections and indeed start to work at the periphery tier of the community network. The second, on the other hand, is that existing more experienced members may have developed more social ties. The impacts of new and existing more experienced members may trade off against each other. Indeed, the R square for the trending line is only 0.0112, implying this trend may not be significant and more analysis in the future is needed. y = 0.0012x + 3.2282 R2 = 0.0112 0 y = 0.1702x + 14.917 50 R2 = 0.6235 Figure 2: Average Degree (Y) Over Weeks (X) 6 5 4 3 2 1 0 50 100 150 200 250 300 350 400

237

Figure 3 represents the social network topology in week 335 generated by the social network analysis software Pajek (Nooy, Mrvar, and Batagelj, 2005). It is clear that there are 1 core member (Hiep Lq), two first tier members (including RDC and Carlos), multiple third tier members (including Rheine, Joseph, msw, and redhuan), and many fourth tier and fifth tier (i.e., periphery tiers) members. Figure 3: Week 335 social network graph Figure 4: Density, Degree Network Centralization, and Betweenness 1 Network Centralization 0.9 0.8 0.7 0.6 0.5 0.4 y = -8E-05x + 0.3886 0.3 R2 = 0.0032 0.2 y = -0.0007x + 0.4888 0.1 y = -0.0005x R2 = 0.3029 + 0.1935 0 R2 = 0.3889 0 50 100 150 200 250 300 350 400 Network Degree Centralization Network Betweenness Centralization Density Linear (Network Degree Centralization) Linear (Network Betweenness Centralization) Linear (Density)

238

Figure 4 illustrates the network level indices including network density, degree network level centralization, and betweenness network level centralization. Overall, the betweenness network centralization is slightly trending down but may not be significant as the R square is only 0.0032. This means that the variation of the betweenness centrality at the individual level stays stable over week. The may be explained by looking at the dynamics of the tier structure of the community network. On the one hand, as more new members join the community and participating in thread discussions, these new members start as periphery members with very low level betweenness centrality. These new members’ centrality is likely to increase the variation of the betweenness centrality (i.e., the network betweenness centralization index). On the other hand, as members in the mid-tiers (2nd, 3rd, 4th, etc.) actively participate in forum discussions and establish more connections with existing and new members, their betweenness centrality may indeed increase and the centrality difference between top tiers (core and 1st tier) members and the mid-tier members may decrease. With the two forces acting against each other in a balanced manner, the overall network level betweenness centralization index may keep stable over weeks. As to the degree network level centralization, it is slightly trending down with the R square equal to 0.3029, implying that with more new members joining the community, the variation of the number of connections per member goes down on average. On the one hand, new members are mostly playing the periphery roles with few connections to start with and they have low levels of individual degree centrality. This factor is likely to increase the variation in individual centrality for the community network. On the other hand, similar to the reasoning for the betweenness centralization, as members in the mid-tiers (3rd, 4th, etc.) actively participate in forum discussions and establish more connections with existing and new members, their degree centrality is indeed likely to increase and the centrality difference between top tier (core and 1st tier) members and the mid-tier members may decrease. With the two factors acting together, as the middle-tier members’ individual centrality’s influence may be stronger, the overall network level degree centralization index goes down. Consequently, the community network overall turns into more decentralized and this observation implies that the middle tiers of the community network significantly influences the community network structure. As to the community network density, it is trending down over weeks. Clearly, the number of connections established over time in the iDempiere community cannot catch up with the increasing number of members and the consequent increased possible number of ties with full connections. Overall, the community network is becoming sparser. The implication could be that experienced top and middle tier members may not keep up with requests and questions from new members at the periphery tiers and the attractiveness of the community may be negatively impacted. Future research should analyze data from all available weeks to further investigate the evolution of the iDempiere community network. Key individual members should be tracked constantly and the issue of to which degree they play the ambidextrous role in the community is interesting to

239

240 study. As these key members represent the major community resources and the degree to which they play the ambidextrous role may manifest the outcome of the internal resource allocation mechanism, the study of this issue may help with uncovering the key factors of OSS project success. More importantly, the evolution of the tier structure in the community network is deserved to be paid more attention. Further, the issue of how this tier structure is impacting the exploitative and explorative innovations in the iDempiere community is also very interesting to study. A better of understanding of this issue may be very helpful for us to discover the viability for successful OSSPCs. Conclusion This current research is a preliminary analysis of the evolution of the award-winning open source iDempiere ERP community network with the focus on the analysis of network level measures. Average degree, degree centralization, betweenness centralization, density, the number of members involved, and the number of ties formed in the iDempiere community for 30 selected weeks are preliminarily compared and analyzed. Overall, it is found that the community network is becoming sparser and slightly more decentralized over weeks. The current study, with the time series data, extends the research stream of applying the social network perspective to study the phenomenon of OSS development. Future research should investigate the impacts of community networks on the OSSPC exploitative and explorative innovations. Further, by tracking key members in the community and applying the concept of ambidextrous developer in Temizkan and Kumar (2015), further research is going to broaden our understanding of the OSSPC evolution and success. References 1. Singh, P. V., Tan, Y., & Mookerjee, V. (2011). Network effects: The influence of structural on open source project success. 2. Temizkan, O., & Kumar, R. L. (2015). Exploitation and exploration networks in open source software development: an artifact-level analysis. Journal of Management Information Systems, 32(1), 116-150 3. Freeman, L. C. (1978). Centrality in social networks conceptual clarification. Social networks, 1(3), 215-239. 4. Nooy, W.D.; Mrvar, A.; and Batagelj, V. Exploratory Network Analysis with Pajek. Cambridge: Cambridge University Press, 2005. 5. Setia, P., Rajagopalan, B., Sambamurthy, V., & Calantone, R. (2012). How peripheral developers contribute to open-source software development. Information Systems Research, 23(1), 144-163. 6. Crowston, K., & Howison, J. (2005). The social structure of free and open source software development. First Monday, 10(2).

241

Future LPG Shipments Forecasting Based on Empty LPG Vessels Data Jou-Tzu Kao, Rong Liao, Hongxia Shi, Joseph Tsai, Shenyang Yang, Matthew A. Lanham Purdue University, Department of Management, 403 W. State Street, West Lafayette, IN 47907 [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected] ABSTRACT This study assesses the feasibility of using information of empty liquefied petroleum gas (LPG) carrier vessels that are moving in the ocean to predict the how much LPG will be shipped in the future. As prices of multiple commodities fluctuate with the supply and demand of LPG, it is crucial to identify effective indicators of LPG commodity flow to the foresee future trends of the market. In order to conduct this analysis, we acquired access to the shipping schedule data of empty and full LPG vessels, and ran multiple types of regressions to understand the correlation between these two factors. Our analysis shows that we were able to build a valid and usable predictive model using ridge regression to predict the amount of LPG being shipped in the future. Keywords: Predictive Analytics, Liquefied Petroleum Gas, Supply Forecast

242

INTRODUCTION The main question that our study will answer is whether the flow of empty shipping movements can serve as an indicator of future flow commodity LPG shipments. The privilege of tracking and predicting commodity movements will cause phenomenal impact in the business world. In general, prices of commodities are decided by both supply and demand. Having insight into the flow of goods can allow businesses to better predict price changes and take advantage on both ends. For example, if oil producers know less oil will be ordered to ship in the coming future, they can communicate and control the amount of production before the products turn into excess inventory and are forced to encounter price wars with other companies. On the other hand, if oil traders know there will be less oil being provided in the future, they can formulate strategies to acquire sufficient amount in advance to the incremental changes in price. What we focus on in this study is the need of predicting future LPG shipments, and the information required on vessel movements. 1. The Need in Predicting Future Shipments Whether companies build, buy or sell commodities, they normally rely on multiple sources of information with great effort to try to make the correct decisions. They utilize pieces of available information such as sales history and news. However, people are always in search of better indicators that can more quickly and precisely predict the future. This is where the value of data science and our study comes in. We are exploring the potentials of a new source of information which is vessel shipping flows. When companies need to charter, service or trade in commodities, better and quicker information can help them earn or save millions of dollars. Utilization of shipping forecasts has been the key to success in multiple industries. Shipping industry professionals and commodity traders use the forecasts to optimize their investment, planning and market positioning decisions (IHS Markit 2016). Forecast in shipping is especially important in the liquefied petroleum gas (LPG) market, because there is a lot of variation in demand and price of LPG. Figure 1 provides an example of the variation in price and volume from 2014 thru 2017. Figure 1. LPG Price fluctuations in the US from 2014 to 2017

243

According to an article published by Joeri van der Sman (2017), currently the market is in a downward state because transport vessels are supplying beyond demand. What is interesting is it is projected to have an impressive growth in demand in the near future for seaborne LPG trades. Demand forecasts are also very important in the trading and investment markets. One major differentiator among good and bad investment or commodity traders are how well they can find reliable information and turn it into actionable knowledge. Being able to forecast the future trend of a commodity, such as LPG, they will be able to anticipate and take advantage of the demand and future price changes of their assets. Misleading forecasts can result in poor decision-support and lead to severe crises in all industries. China, which produces almost half of the world’s steel, says it is pursuing painful reforms, but that the glut is at least partly the result of weaker global demand (Paul Page 2017). In fashion markets, inventory management also plays a huge role. After years of struggling to manage their inventory, vendors are using increasingly sophisticated tools to track apparel and accessories through the supply chain, aiming to avoid the typical post-Christmas fire sales (Stephanie Wong 2017). 2. The Sufficient Amount of Precise Information on Vessel Movements It is important today for commodity traders to dig deeper into shipping flows as sufficient amount of shipping data becomes available. Maritime transportation is known for having rich information in terms of volume. Most of the transportation information is utilized in supporting the global commodity supply chain. Maritime surveillance data are increasingly used to achieve a high-level picture of situational awareness. Cooperative self- reporting vessel location systems, including the Automatic Identification System (AIS) and Long-Range Identification and Tracking system (LRIT), provide a great amount information about the vessels at port and at sea. This study utilizes the power of data analytics to take full advantage of the shipment data and explore the possibility to using this information to predict future LPG demand, and ultimately predict the price of the commodity. The remainder of this paper is organized as follows: A review on the literature on various criteria and methods used for vessel information detecting is presented in the next section. In this section, several practices that people use to forecast future shipment and commodity price is discussed and evaluated. In section 2, the data used in this analysis will be introduced. The proposed methodology is presented in section 4 and the assumptions and difficulties is discussed. In Section 5 various models are formulated, tested and evaluated. Section 6 outlines the performance of the models and summaries the prediction results. Section 7 concludes the paper with a discussion of the business insights draw from this study, future research directions, and concluding remarks. LITERATURE REVIEW In the pursuit to predict the future amount of shipments of LPG, three logical questions need to be addressed: Whether empty ships are a good indicator of future demand? If so how can the information be

244 used to predict how much LPG will be shipped in the future? Lastly, how do the future shipments define the demand? 1. Vessel Shipment Detection Studies on vessel surveillance, tracking, and prediction have been for various perspectives based on the data collected. Some have researched ship tracking using visual/image data. For example, how to track ships via visualizations (Robert-Inacio, Raybaud et al. 2007), visual classification, visual ship counting (Chen, Chen et al. 2008), and simultaneous detection and tracking (Hu, Yang et al. 2011). Chen, Chen et al. (2008) found that when trying to count ships using image data, ships in regions where wave ripples were prevalent resulted in inaccurate results. Moreover, wave ripples lead to inaccurate ship detection, leading to mistakes in ship classification. Fei, Qing et al. (2014) investigated video surveillance, specifically Closed-Circuit Television (CCT) video sequences to track a ship inland using a Kalman filter methodology. They found their approach was better than previous known approaches, that suffered from cluttered backgrounds and occlusion. The information used as predictors in this study is the number of empty vessels and the tonnage capacity that the vessel can hold. When the shipment is at sea, the tonnage is different depending on the load of the shipment. Based on this information, the empty vessels can be detected. One problem associated with this is that it cannot be known if the vessel is empty or not, and it is challenging to detect the volume of the load. Using the volume as an indicator would be more accurate than the number of shipments in prediction because different vessels have different loading capacity. Future research is needed to address this problem. 2. Predictive Modelling Studies found that are believed to be most related to this project entail trying to predict or understand future behavior or movement between successive AIS points (Hammond 2014). One of the well-known challenges in trying to predict future patterns of ships is the massive amount of information available. There are thousands and ships and millions of tracking points. This is overwhelming information for a human to process and synthesize in an effective manner. Riveiro (2011) points this out when he tries to address this problem by developing a visual analytics approach. Pallotta, Vespe et al. (2013) developed a methodology called Traffic Route Extraction and Anomaly Detection (TREAD) that uses an unsupervised and incremental learning approach on AIS data to detect path anomalies, and project current and future trajectories. As noted in their paper, some have tried to subdivide the area(s) of interest into spatial grids, whose cells might be “characterized by motion properties of the crossing vessels” (Vespe, Sciotti et al. 2008). The potential problem with grid-based approaches for small area surveillance is that they are hard if not impossible to scale. Another important consideration for path prediction is understanding of what is referred to as Course Over Ground (COG) turning points. Essentially these areas would be like intermediate nodes in a supply chain network, while entry and exit points are other nodes. Using an Ornstein-Uhlenbeck stochastic process, (Pallotta, Horn et al. 2014) improve their vessel predictions, by using historical AIS data to estimate parameters that are essential characteristics of recurrent routes. In other words, they exploit prior knowledge to predict the position of vessels with greater confidence. Most of the research conducted to date, try to detect the future trajectories or spatial pattern of shipments.

245

Much less work has been performed to predict the future number of shipments by analyzing historical data. Spatial projections are important in maritime transportation control, but traders are more interested in how much of the commodity of interest will be transported in the future. Therefore, the focus in our study is to predict shipment pattern in terms of different time grids, i.e. conduct time-series analysis on the shipment data. By grouping the shipments by different time grids, it is possible to find a best time frame for future predictions. The accuracy of the prediction can be improved by putting extra information into the time- series model such as exponential smoothing and seasonality. 3. Demand Forecasting Jan Tore Klovland (2002) points out there is a close timing relationship between cycles in economic activity, shipping rates and commodity prices. But the relation is very complex. To understand the theory behind this, people needs to have profound economic knowledge and empirical experience. Instead of relying on theory, the method proposed in this paper is all based on facts available to the public. When a vessel receives a mission of transporting commodities, it may start its voyage to a loading port. Then the vessel is expected to carry commodities to the destination port. The number of empty shipments at sea at one point of time, may indicate the supply of the commodities at a future time point when the shipment arrives at the destination port. Further the supply, or availability of the commodity affects the price of the commodity at that specific market. Based on these series of assumptions, this study investigates the relationship between the drives by interpreting the shipment data. As these handful of studies demonstrate, shipment time pattern recognition and future vessel behavior is a challenging problem. The key objective of this study is to manipulate the available vessel data, find relationships and patterns of the vessels and shipments. Also, develop a predictive model that can provide more accurate results and thoughtful insights of what will happen in the future. The key performance measure is how reliable the conclusion is in helping traders to uncover maritime market opportunities and make investment decisions. DATA For this study, we collaborated with an industry partner that mines this publicly available data. One table called Voyage_lpg provided information about the full ships, including vessel names, date and position of departure and port arrival. The Voyage_lpg_ballast table has similar information, but it is about empty ships as shown in Table 1. We created a more meaningful feature from this data called the average total empty vessels’ capacity. The average total empty vessels’ capacity is obtained by calculating the empty vessels’ capacity from 27 days ago to 21 days ago of specific date as shown in Table 2. The justification of this time window came from trial and error and extensive exploratory data analysis.

246

Table 1: Variables from raw data Table 2: Data used for prediction models METHODOLOGY Predictive modeling-type project Our target is to determine if there is any relationship between the demand of oil and information of empty ships three weeks prior. Figure 2 illustrates the overall methodology flow chart of our study. To follow, we describe how to create features, train models, evaluate models, and generate decision-support. Figure 2. Flow chart of the project

247

Identifying significant features We used correlation analysis to identify the most relevant features and reduce redundancy. The most significant feature is the average_total_empty_tonnage with a correlation coefficient of 0.88. As shown in Figure 3, there is a significant relationship between the total tonnages shipping today (y-axis) and the average_total_empty_tonnage (3 weeks prior). We then performed regression analysis to model the relationship between the two variables (i.e., total number of demands and the average tonnages number of empty ships before 21 to 27 days). Figure 3. Scatter plot of the average total empty tonnage 21~28 days ago with the total tonnage shipped today. Pearson’s correlation is 0.88 between the two set of variables, indicating a high correlation. Our predictive modeling experiments followed a 5-fold cross-validation scheme, where the data is partitioned into 80% training and 20% test sets 5 times in total. In the end, the average performance over each split was obtained. The justification for this design is to obtain a more reliable estimate of error than using a validation set (or one holdout set) approach, as well as obtain a more robust model. R-squared and absolute error ratio were the statistical performance measures we used, which are common for regression- type problems. We discovered that there was a strong linear relationship between LPG demand and empty vessel capacity, so that we incorporated that as a predictor into our regression model. We were not sure exactly the reason that this relationship existed or if it was even causal. We expect that domain experts would have much more insight in this area that we do. Our goal was to just build a predict model as accurately as possible using the available information. With the prediction model, we could check what it would bring to the company and choose the optimal model to do prediction. In detail, we will compare the total cost with prediction and the total cost without prediction. In this way, we can see if this model can benefit a trader. Regression models In this project, we selected and compared the performance of three regression models: linear regression model, ridge regression model, and support vector regression model.

248

The linear regression model was selected as a baseline for our study, since it is the simplest and most- widely used regression model. This model can be described with the following equation: where x i represents the input variables, y i represents the target variable, β represents the coefficients for each input variable, and finally ε i is the residual error. In this model, the training process consists of estimating the coefficient β given a set of input-output pairs (x i ,y i ) using ordinary least squares (OLS). There are no hyper-parameters to tune in the linear regression model. The advantage is that it is very simple to use and usually will not lead to overfitting (because it is too simple). The disadvantage is that it cannot model non-linear relationships between input and output and thus will suffer from higher bias. The ridge regression model was selected since it introduces a regularization term to the linear regression model. One potential problem with linear regression is that it does not provide any mechanism to prevent overfitting issues. Ridge regression’s regularization term penalizes overly complex models and helps obtain a better bias-variance tradeoff. With such capability, the model can prevent overfitting and can be better generalized to innovative/unseen data points. Additionally, we can apply the kernel trick to ridge regression so that non-linear relationships can be modelled. The tuned parameters in ridge regression include the regularization term, as well as the optimized kernel type (linear, poly, or Gaussian). The formula for ridge regression is shown below: λ: is the tuning parameter that determines the effect of the impact of the penalty term λ∑ p j=1 β j 2 . As λ increases, the impact of the penalty term grows; when λ is 0, the ridge regression parameter coefficients are the same as those generated using ordinary least squares in the linear regression model. The Support Vector Regression (SVR) is a more advanced regression model. It uses support vectors to keep a max distance between decision margin and the closest data points. Such mechanism would make the model more robust to noise, thus increasing its generalization capability. The advantage is that it can model complicated relationship in the data, while the disadvantage is that it is overly complicated, therefore the training is going to take longer, and it requires more data for the training to converge. The SVR model has been widely used in many domains, including supply chain prediction (Guanghui, Wang et al), flight control prediction (Shin, Jongho, et al), and tourism demand prediction (Chen, Kuan-Yu et al). The SVR model has been found to achieve the best performance in diverse domains, therefore, we

249 consider it in our study. However, since our dataset is quite simple, we did not expect SVR to perform the best due to its complexities. Grid-search for hyper-parameter tuning Most of the regression models have certain hyper-parameters that we can tune to achieve the best performance. For example, the SVR model has several kernels that can be chosen from (i.e. linear, polynomial, Gaussian etc). The selection of the most appropriate hyper-parameter would cause the model

250 to achieve the best performance. At the same time, a poor selection of hyper-parameter would lead to bad performance. In our case, we implemented a k-fold grid-search technique to find the best hyper-parameter for each model. The entire dataset was separated into 80% training data and 20% testing data. The 80% training data was further split into five folds (16% in each fold, denoted as a1~a5). The model was trained on a1~a4 and tested on a5, and this process was repeated 5 times among a1~a5. The hyper-parameters generating the best average performance on the held-out validation split was chosen in the end as the optimal ones. And the model is trained with the best found hyper-parameter. This trained model was then tested on the 20% testing data to achieve the final performance. Optimization modeling-type project With the prediction model for demand, we can know the future demand of LPG. Predicting the future demand, we can assist company to manage their tonnages of delivering LPG. Neither would they oversupply nor undersupply LPG to market. Therefore, they would able to avoid loss on supplying LPG. RESULTS To measure the performance of a regression model, we used the relative Mean Absolute Error (rMAE) on the testing data as the most important metric. The equation of rMAE is listed as follows where y is the actual value and pred_y is the predicted value: rMAE = (|y-pred_y|) /y The smaller the rMAE, the better the regression model is with predicting future tonnages. We found that the ridge regression performs the best with an average rMAE value of 5.20%. The linear regression and SVR performed slightly worse with an average rMAE of 5.83%. Also, the ridge regression reached the highest R-squared value on the testing dataset. The p-value for the estimated parameter coefficient was also less than 0.001 indicating strong evidence that the feature had an association with the response. This indicates that the ridge regression can best describe the variations and trends of our dataset. Therefore, it should be used in the real system to provide LPG tonnage predictions for decision making guidance. Table 3 displays the performance of each model on the train and test dataset. Model R-squared (train) Mean of R-squared R-squared (test) Mean of R-squared Error ratio (test) Mean of Error ratio Linear 0.778 0.7742 0.751 0.7652 5.93% 5.83% 0.773 0.771 5.85% 0.774 0.764 5.83% 0.773 0.769 5.76% 0.773 0.771 5.76% ridge 0.824 0.8222 0.798 0.8064 5.16% 5.20% 0.819 0.818 5.13% 0.824 0.8 5.29% 0.822 0.807 5.20% 0.822 0.809 5.20%

251

SVR 0.777 0.7726 0.748 0.7636 5.91% 5.79% 0.771 0.77 5.80% 0.773 0.763 5.78% 0.771 0.768 5.71% 0.771 0.769 5.73% Table 3: Model selection summary In Figure 4, the actual total tonnage value and the predicted tonnage value is shown. As we can see, there is an average training R-squared (R2) score of 0.822, and an average rMAE of 5.20% on the test split. The regression model can accurately predict the total tonnage using the methods proposed. Figure 4. Plot of the actual total tonnage shipped (blue lines) and the predicted tonnage shipped using number of tonnages 3 weeks ago (red lines). As shown, the regression model can accurately model the variances in the tonnage changes. The four-in-one plot for the regression is displayed in Figure 5 for regression diagnosis purposes. As shown, all the assumptions of a linear model have been satisfied, therefore justifying the correct usage of the regression model.

252

Figure 5. Four-in-one plot for the ridge regression model. The regression assumptions have been met. Specifically, the errors are normally distributed as shown in the histogram of residuals and probability plots. There does not appear to be a major issue of heteroskedasticity either, thus the constant variance assumption is not violated. FUTURE WORK At current stage, we only use the past three weeks’ average empty vessels’ capacity as the independent variable to build our model. After further exploring data relationships and patterns, we may consider including extra variables in the model in the future. For example, segregating vessels based on different features (e.g. annual total tonnage of LPG, route, vessel size), might allow for improved predicted performance of future LPG amounts. Route and Schedule Per Vessel By identifying time sequence of each vessel (by merging on the date_load from empty vessel and the date_depart from full vessel), we are able to produce a full operating schedule of each vessel. For example, the schedule of no.50 in year 2014 is as follow:

253

Table 4: Shipment schedule of vessel no.50 in year 2014 Given the above schedule and related location code, we can develop strategies to optimize operations based on business need. CONCLUSIONS The ability to predict future commodity flows of LPG is very beneficial to companies at both the supply side and the demand side. With better predictions, companies can optimize their operations to either save costs or earn more profit. The model we developed can help predict the future amount of LPG being shipped three weeks after the detection of empty moving vessels. This has potential to be useful for a commodities trader whom seek to glean as much information on supply and demand as they can to improve their investment strategies. In the future we can incorporate other features such as seasonality, region, and consumer behavior to help make this model more robust against errors. To sum up, this exploratory step we took in utilizing shipping information has returned valid results; in the future, better refinements will continue to make the predictive model stronger and flexible to industry requirements. REFERENCES IHS Markit (2016). “Maritime & Trade Fleet Capacity Forecast.” https://www.ihs.com/products/fleet- capacity-forecast.html Paul Page (2017). “Today’s Top Supply Chain and Logistics News From WSJ.” The Wall Street Journal https://www.wsj.com/articles/todays-top-supply-chain-and-logistics-news-from-wsj-1512128205

254

Joeri van der Sman (2017). “Major Turnaround In LPG Shipping By 2018?” Seeking Alpha https://seekingalpha.com/article/4079722-major-turnaround-lpg-shipping-2018 Stephanie Wong (2017). “Christmas Becomes Inventory War Game for US Fashion Brands.” Bloomberg https://www.businessoffashion.com/articles/news-analysis/christmas-becomes-inventory-war-game-for- us-fashion-brands Akshay, M. et al. (2004) STATE-OF-THE-PRACTICE IN FREIGHT DATA: A REVIEW OF AVAILABLE FREIGHT DATA IN THE U.S. Center for Transportation Research The University of Texas at Austin Yan T.,(2002)Business cycles, commodity prices and shipping freight rates - some evidence from the pre- WWI period. SNF project no1312,gloablizatoin, economic growth and the new economy. Chen, T., et al. (2008). Ship-flow analysis and counting system based on image processing. Proceedings of the Fourth International Conference on Intelligent Information Hiding and Multimedia Signal Processing. Fei, T., et al. (2014). Robust Inland Waterway Ship Tracking via Hybrid TLD and Kalman Filter. Advanced Materials Research, Trans Tech Publ. Hammond, T. (2014). Applications of Probabilistic Interpolation to Ship Tracking. Proceedings of 2014 Joint Statistical Meetings-American Statistical Association. Hu, W.-C., et al. (2011). "Robust real-time ship detection and tracking for visual surveillance of cage aquaculture." Journal of Visual Communication and Image Representation 22(6): 543-556. Pallotta, G., et al. (2014). Context-enhanced vessel prediction based on Ornstein-Uhlenbeck processes using historical AIS traffic patterns: Real-world experimental results. Information Fusion (FUSION), 2014 17th International Conference on, IEEE. Pallotta, G., et al. (2013). Traffic knowledge discovery from ais data. Information Fusion (FUSION), 2013 16th International Conference on, IEEE. Pallotta, G., et al. (2013). "Vessel pattern knowledge discovery from AIS data: A framework for anomaly detection and route prediction." Entropy 15(6): 2218-2245. Riveiro, M. J. (2011). Visual analytics for maritime anomaly detection, Örebro universitet. Robert-Inacio, F., et al. (2007). "Multispectral target detection and tracking for seaport video surveillance." Proceedings of the IVS Image and Vision Computing New Zealand: 169-174. Vespe, M., et al. (2008). Maritime multi-sensor data association based on geographic and navigational

255 knowledge. Radar Conference, 2008. RADAR'08. IEEE, IEEE. Guanghui, W. A. N. G. (2012). Demand forecasting of supply chain based on support vector regression method. Procedia Engineering, 29, 280-284. Shin, J., Kim, H. J., Park, S., & Kim, Y. (2010). Model predictive flight control using adaptive support vector regression. Neurocomputing, 73(4), 1031-1037. Chen, K. Y., & Wang, C. H. (2007). Support vector regression with genetic algorithms in forecasting tourism demand. Tourism Management, 28(1), 215-226.

256

Online Small-Group Learning Pedagogies for the 21st Century Classrooms Sema Kalaian Eastern Michigan University For the last three decades, the emergence of World Wide Web (WWW) communication networks (Internet), technological advances in Information Technology (IT), and handheld mobile technological devices (e.g., tablets, IPads, smart phones), and powerful computer technologies have been continuously redefining, reshaping, and advancing the concept of online/distance learning and computer supported learning (CSL) to deliver instructional course content. The Internet provided a rich new technological medium for teaching and learning that has evolved over the last two decades. Meanwhile, it produced research results that have propelled us closer to understanding how to effectively use Internet-based methods for delivering instruction and learning. In sum, the technological revolution of the internet and the World Wide Web (WWW) continues to have a great impact on the teaching of various subject matters across all levels of schooling and across all disciplines. Also, the advances in instructional technology and computer-mediated communication (CMC) technologies during the last two decades have been contributing significantly to improved student-instructor and student-student interaction capabilities and interactive course design in online instructional environments. In addition, with the rapid growth of online learning (in terms of course and program offering as well as online student enrollments) and the need for innovative online instruction, many educators and instructors have been experimenting and implementing various small-group CSCL for their online courses in web-based environments. Innovative online learning and various alternative forms of online small-group learning pedagogies have been developed and implemented worldwide to replace or supplement the traditional face-to-face classroom instruction. Online teaching/learning using small-group learning methods via computer supported collaborative learning (CSCL) systems such as problem-based learning (PBL), cooperative learning, collaborative learning (CL), and team- based learning (TBL) are examples of such innovative reform-based collaborative student-driven pedagogies. These innovative 21st pedagogies make learning in online environments more stimulating, engaging, and motivating for students to deeply and meaningfully learn the course content and maximize their persistence in the web-based online courses. Brief descriptions of some of the innovative small-group CSCL pedagogies that are used and implemented in the 21st online (virtual) web-based courses will be presented. Also, the advantages and disadvantages of these pedagogies will be presented. Keywords: Online Learning, Online Education, Distance Learning, Distance Education, Cyber Learning, Team-based Learning, e-Learning, Small-group Learning, Problem-based Learning, Cooperative Learning, Collaborative Learning

257

Opportunities for Enhancing Buyer-Supplier Relationship: Inspirations from the Natural World Introduction The operations fraternity has witnessed a remarkable shift in the operations strategy over the past three decades as firms have started focusing more on horizontal alignment of the operations thereby debunking the trend in 1980s when the organizations focused more on vertical integration. Frohlich & Westbrook (2001) note that organizations now are trying to carefully link their internal processes to the external suppliers and customers by means of strategic collaborations to form an effective and efficient supply chain. No organization, therefore, is an island. Upstream and downstream integration between the suppliers and the customer has proved to be the successful manufacturing strategy. By forming such alliances, organizations can form a network that provides them competitive advantage and creates high barriers of entry to the competition. Such collaboration enables the buying and supplying firms to share the risks and combine their individual strengths and work together to offer increased value to the consumer by reducing the non-value-added activities (Whipple and Frankel, 2000). Extant literature indicates that strategic alliances underpin the development of lean and total quality management philosophy that has helped organizations to develop flexible manufacturing processes cutting down operational and inventory costs. The requirement of such collaborations continues to grow as organizations become increasingly specialized in their approaches. Narasimhan and Narayanan (2013) indicated that as organizational process boundaries are becoming more permeable close collaborations among the supply chain partners become more feasible. The article has provided instances of firms utilizing supplier leveraging, co-development and joint venturing. Such activities enable open innovation, helping firms in the value chain to develop better customer offering and competitive advantage. But the tale of strategic alliances doesn’t always have a happy ending. Despite the multiple significant benefits of the collaborations among the supply chain partners, few such partnerships have lasted and are successful in a way the collaborating firms had imagined them to be. Often

258 times, companies have embarked upon partnerships without fully understanding the concepts. Generally, such strategic alliances require radical changes in organizational structures and involvement of not only top-level managers but also the middle and lower level managers. Mclvor and McHugh (2000) found incongruencies among the different management levels leading to inconsistencies in the strategic and tactical activities. In some cases, simple breach of contract or excessive focus on short-term goals by the focus companies have led to the dissolution of the strategic relationship. For the effective management of relationships between two organizations we need more explorative studies into the topic which is scarce in the operations management literature. The primary objective of this paper is to develop a literature review to study the issues in the strategic relationship and present a taxonomical analysis. We will extend the literature to derive systematic inspiration from the natural science and examine how relational patterns in nature holds true in the context of supply chain management. In particular, the focus of this study would be to investigate aspects of cooperation, intermediation and moderating effect of the environment that are evident in nature, yet, seem to have received scant attention in supply chain management research. The next section will provide the literature review of the issues both buyer and suppliers face to develop and maintain strategic relationship, categorize the issues into taxonomies and analyze the severity of these issues. In the third section, we will consider an example of network found in the natural world to illustrate some of the governing principles that help create collaborative communities by communicating effectively among themselves and spanning a symbiotic relationship with each other that eases down the individual process boundaries. In the later sections, we will explore some analogies between the forest and the organizational networks and provide recommendation for future research. Issues in Buyer-Supplier Relationship – An Exploration Our research is closely related with two streams of research. We will study the extant literature to understand the issues that the practitioners in organizations face when maintaining long term relationships with other parties (suppliers or customers). Though the literature is ripe with the

259 studies of the supply chain relationships and the value it brings to the stakeholders, the literature that discusses the issues confounding the buyer-supplier relationships is relatively scant. To ensure that we focus on studies that have emphasized the notion of issues in buyer-supplier relationships, we skimmed through the articles searching for the causes of the issues in the buyer-supplier relationships. Issues in Buyer-Supplier Relationship: Buyer-supplier partnership implies radical changes in the way people work in an organization including teamwork, joint product design and decision making. Literature suggests potential benefits of buyer – supplier relationship including, but not limited to, better innovation opportunities, faster design to market cycle, higher throughput in the production process, enhanced customer service, better demand prediction and lower bullwhip effect and reduction in the operating cost of the both the parties and increased sales. Allred et al. (2011) has indicated the pivotal role of the supply chain collaboration as it mediates the conflict resulting from functional orientations and improves performance. The organizations can exploit the inter-firm resources leveraging the collaboration among them to develop competitive advantage. But sustaining a longer-term relationship and mutual trust is easier said than done. Extant literature finds that trust and satisfaction are some of the most important criteria for the sustainability of the buyer-supplier relationship. The organizations often fall prey to the opportunistic short-term goals and behavior that jeopardizes supplier trust and lead to dissolution of the relationship. Chen et al. (2013) compares the buyer-supplier relationship to the marital relationship between the spouses and draws parallels to explain the issues in the supply chain context. The article refers that in marriages the wives often initiate the divorce when they perceive their investment in the relationship outweigh the accrued benefits and the benefit distribution between the individuals is not proportional to their personal investments. Likewise, perceived fairness is especially important in the strategic relationship as the timing and size of the investments differ across the stakeholders. When the disadvantaged party perceives imbalances being carried over the long run it grows dissatisfied with the relationship and tries to part away from it. The usual manifestation is decreasing investment in the relationship like non- participation in joint decision sessions. Such a decrease in the level of satisfaction and

260 participation among the suppliers impedes their ability to confide on their buyers and may lead to dissolution of trust in the relationship. The literature presented in Table 1 explores the issues between the buyer-supplier relationship in detail. The instances of these dyadic issues vary widely from the lack of trust and satisfaction to several economic reasons. To gain a better understanding of the issues we propose a taxonomic description of the issues along with the severity. We assume that the instances of references of the issues in the academic literature loosely measures the severity of the issues. Dissatisfaction in Relationship: Dissatisfaction in the buyer-supplier relationship has been identified as the most critical issue leading to the dissolution of the relationship. Chen et al. (2013) indicated that when a party is dissatisfied with its relationship with another party it develops non-committal tendencies. Such behavior affects the confidence and comfort levels and trust to maintain a sustainable strategic buyer-supplier relationship. Sources of dissatisfaction among the collaborating parties can be primarily attributed to unfair treatments, performance imbalances and asymmetric dependence between the parties in the dyad. Unfair treatments in a relationship occurs when either party receives benefits that is proportionally different compared to the individual investments in the long run. Such a perception of the unfair treatment generates tension and dissatisfaction in the relationship. An example of unfair treatment is exhibited by a buyer when it ignores its long-term supplier to offer purchase orders to alternative suppliers to minimize the procurement costs. In such instances, the suboptimal decision by the buyer undermines the investments the supplier has made toward the relationships. When the strategic supplier perceives such unfair treatments, it develops dissatisfactions toward the buyer and the relationship. Perception of inconsistencies in supplier performance may generate dissatisfaction in the buyers. Sharland(2003) indicates that buyer trust and commitment depend on the supplier performance and higher quality of inputs. The article notes that if the suppliers fail to maintain consistency in the quality of the inputs leading to higher output cycle times then it may negatively impact the buyer perception of the supplier performance. Such perception also depends on the supplier responsibilities. Chen and Lee (2016) indicates that problems in supplier responsibility which is generally manifested by material and process violations by the unethical suppliers affects the

261 buyer perception of the supplier performance which leads to dissatisfaction in the dyadic relationship. In a collaborative relationship, the firms exploit each other’s resources and opportunities and grow mutual dependence. When a party controls more resources than its counterpart asymmetric dependence sets in with the party controlling more resources, power and influence in the relationship being less dependent in the relationship. Emerson (1962) suggests that the effort a party expends to continue a strategic relationship is directly proportional to the level of its dependence on other parties. Hence, with the growth of asymmetric independence the efforts the powerful party puts in into the relationship is low compared to the effort that the other party puts in. Such differences in efforts jeopardizes the relationship stability and generates duress in the relationship. Poor Management of Interactions: Strategic partnership involves a radical change in the way an organization works and communicates with its environment. Thus, it involves changes in the social systems of both the organizations. The scope of resistance is considerable. The more radical these changes are, organizations face greater resistance to build a successful collaborative partnership. Effective management system may mediate this gap and enable the organization to adopt the changes faster (Whipple and Frankel, 2000). Mclvor and McHugh (2000) has indicated that the managerial shortcoming has been a critical cause of the failure of the collaborative endeavors. The study has indicated communication gap between the strategic (senior level managers) and the tactical (mid and junior level managers) levels acts as the source of the non- coherent relationship building activities. The study has explained using a case study that though strategic partnership with the suppliers was decided by the top-level managers there was no tactical initiative in place to identify the necessary systemic changes to enable active partnership. The team members in the buyer firm are reluctant to include the supplier firm participants in the product design and joint decision meetings. Even if they at all do, the supplier firms participation is generally marginalized as no fundamental inputs from them are considered. The role of purchasing managers is no more important than a clerical role and the purchasing department is often understaffed, and staffs are underpaid. Such incongruent tactical level decisions lead to unsuccessful collaboration with the suppliers and loss of invested efforts.

262

Non-engagement in Opportunities: Buyer-supplier relationships are often formed to co-generate values and co-produce products (Wilkinson, 2008). We may consider the example of joint product design activities where suppliers provide valuable inputs to create better products for the consumers. Thus, the product the parties co-produce and the ensuing values they co-generate in the form of increased sales or higher market penetration define the strength of their relationship and delivers satisfaction to both parties. Chen et al. (2013) notes that while selecting suppliers, buyers consider criteria such as product development capabilities and after-sales support. Hence, if the strategic buyer-supplier relationship fails to engage in activities and co-generate values then it has a higher risk of being terminated. The article compares this aspect of the dyadic relationship with that of having children out of marriage. It notes that marriages without children (equivalent to co-generation of values in buyer-supplier relationship) are more likely to end in divorce. Supply chain literature has often looked beyond the business domain into the natural complex systems to draw inspiration and provide a better framework to explain the observed variance (Choi et al., 2001). Biomimicry, the imitation of the models, systems, and elements of nature to solving complex human problems, is not uncommon in business. Primlani (2013) has used the concepts of biomimicry to develop a framework for business innovators to follow to develop safer alternatives and more efficient processes. The article went further to note that “the conventional development of processes and schema that do not utilize the principles of biomimicry are instantly unsustainable, both in the ecological, and fiscal sense”. Observations from the Natural Network – The Case of the Common Mycorrhizal Network We have found evidences in the forestry literature that the trees in a forest communicates using a Common Mycorrhizal Network (CMN) which enables them to effectively communicate to one another and efficiently manages systematic distribution of carbon and other nutrients. In the next section, we would proceed to review the literature about the complex system associated

263 with the CMNs, study how such a network function and try to draw parallels to the buyer- supplier relationship networks. We will conclude the article by identifying the scope for further research to develop better frameworks, inspired by the mechanism of CMN, to guide practitioners to better manage the inter-firm relationships. Common Mycorrhizal Network (CMN): By using the theory of natural selection, Darwin (1969) suggested “the survival of the fittest” which means the organism that better adapts itself to the environment has the highest chance of survival. In a forest environment the theory suggests a natural tendency of the trees competing with each other to find better sources of food. Following the theory, the trees should grow taller cutting the other smaller trees off the natural resources like sunlight, soil nutrients, carbon dioxide etc. As such the other trees would have reduced chance of survival and would have perished before they live their natural span of life which is clearly not the case. A latest stream on research has found that the plants in a community interacts and even share resources through a common hyphal network. Such a network is termed as Common Mycorrhizal Network (CMN). Mechanism of CMN: Mycorrhizal network is formed of symbiotic relationships between soil fungi and the plant whereby the fungi mines nutrients from the soil that can be readily absorbed by the plant in exchange of glucose from the plant. One or more such fungi can colonize on two or more plants to improve their access to soil nutrients thereby forming mycelial links to CMNs (Selosse et al., 2006). The article suggests that CMNs originate not only from fungal genets colonizing neighboring roots during their growth, but also from hyphal fusions uniting previously separated mycelia. Such fusions, although often restricted to self or genetically close hyphae, can maintain CMN integrity. Shimard et al. (1997) has provided experimental evidence of the existence of a two-way interaction between compatible plants sharing the same CMN network and suggests that plants in a forest environment form a guild based on their shared mycorrhizal associates and exhibits a tendency of mutualism with the mycorrhizal fungi. The inter-tree interactions are moderated by the nutrients in the soil and associated environment which we will discuss going forward. Exploring the similarities with buyer-supplier relationship network: When studying the

264 mycorrhizal network, we recognize that most of the relational patterns of buyer-supplier’s

265 relationship in the supply chain context is similar to the underpinning assumptions of the CMNs. To start with, both operates in a complex environment where individuals compete for limited resources and chance of their survival depends on how well they communicate with the environment and collaborate with others. Cooperation: The basic tenet of the CMNs is to promote cooperation both among the plants in the network and between the plant and the soil fungi. A seedling in the forest often must compete with the overstory trees to secure nutrients for survival. Booth and Hoeksema (2010) suggests that their attachment to a CMN mitigates the negative effect of the competition between trees in the network. In some cases, in the CMNs the adult trees may even help the seedling with the required nutrients to grow. Such kind of exchange structure, involving asymmetric distribution of the nutrients within the network, by the trees has been confirmed by Shimard et al. Booth and Hoeksema carried out an experiment to demonstrate the growth of a seedling in a CMN compared to another seedling without a CMN. The result showed that the seedling in CMN has a chance of 56% higher survivorship compared to the other seedling. The final ratio of root biomass to shoot biomass was 39% more in the networked seedling. Thus, the network has a positive effect on the growth of the seedling negating the competitive behavior of the plants in the same network. Shimard et al. has provided instances when the growing seedlings provided the parent plant with required nutrients when they required those. Drawing from the theory of the social evolution West et al. (2006) defined cooperation as the behavior that provides a benefit to the recipient, but could be beneficial (+ for actor, + for recipient) and/or costly (- for actor, + for recipient) to the actor. The article termed the former behavior as mutualistic behavior and the later one as altruistic behavior. Trivers (1971) in the seminal article discussed the term reciprocal altruism to favor the cooperation between non- relatives. The idea here is that individuals can take turns in helping each other, for example by preferentially aiding others who have helped them in the past (West et al., 2006). There may be a delay between two unidirectional transfer of resources between the sender and the receiver engaged in reciprocal altruistic act. From an organizational perspective, researchers have identified four salient types of mechanisms through which relationship between the buyer and the supplier can be managed. They are

266 information exchange between buyers and suppliers, multiple sourcing, formal contractual relationships and informal partnering relationship (Bozarth et. al., 1998). Information exchange lays the foundation of the buyer-supplier relationships wherein the parties share valuable information between each other. For example, the buyer may share the consumer demand forecasts with the suppliers that will help them to smoothen their production schedule and protect against any bullwhips. Buyers often indulge in multiple sourcing strategies wherein they procure raw materials from multiple sources. Though buyers opt for this strategy sometimes to hedge against procurement uncertainties, application of such strategy is gradually showing decreasing trend. Another strategy that firms often implement at commodity level is the use of detailed contract where the parties formally specify the nature of the relationships. But such formal contracts are often replaced by informal agreements once the parties develop strong working relationships. Bozarth and colleagues argue that often formal contracts in the initial stage of a buyer–supplier relationship may, over time, lead to a cooperative ‘partnering’ agreement which is marked by joint problem solving and cost reduction. Borrowing from social exchange theory, we propose that evolution of contract-based buyer-supplier relationship to more cooperative relationship is mediated by the extent of the reciprocal altruism the dyad has experienced. Extant literature on the reciprocal altruistic behavior in the domain of supply chain is scant. To discuss the “pay what you want pricing” model Machado and Sinha (2013) criticized the reciprocal altruism behavior by stating that such behavior, when used in conjunction with the visibility of the voluntary payments, may provide a viable condition for model. But such a behavior is common in the practitioner’s world. Liker and Choi (2004) indicted how Toyota helped their suppliers to develop better production system using their vast knowledge base of Toyota Production System thereby making unequaled contribution toward their suppliers. The suppliers returned the favor by being more committed in the relationship, being more responsive and flexible to the demands placed by Toyota and contributing toward building a more cohesive supply chain. The party which initiates the altruistic act should overcome the risk of non-reciprocity. Such a overcoming of individual self-interest signals the other party that the sender has real and non-

267 instrumental regard for the recipient, what Molm et. al. (2007) referred to as “expressive value”. When the recipient is able to overcome the temptation of free-riding and gives back such a lagged response generates “expressive value”. Such values and the resulting benefits foster stronger sense of identification of the parties in the dyadic relationship (Willer et. al, 2012) as both the parties realize that the other party is motivated to act in the interest of the recipient rather their own self-interests. Willer and colleagues suggest that dyads with stronger sense of identification experiences higher group solidarity. The members of the dyad undergo depersonalization (Brewer and Gardner, 1996) when they become more invested in the interests of the group rather than their own self-interests and are intrinsically motivated to realize the group’s goal. Such sense of depersonalization may further increase feelings of cohesion, shared fate, and cooperation (Lawler and Yoon, 1998). Proposition 1: The dyadic relationship which experiences positive reciprocal altruism has greater levels of cooperation among the parties in the relationship. Extant literature in operations management shows that a party in a relationship will reciprocate altruistic behavior from the other and such reciprocation will have a positive effect on the relationship performance (Settoon & Mossholder, 2002). Besides the rational performance gains, the parties in the relationship tend to value the positive reciprocation in the relationship (Urda & Loch, 2013). Such an activity enforces emotional solidarity between the parties who view each other in different light than members of the out-groups. Thus, positive reciprocal altruism helps to develop trust and confidence in a relationship which underpin effective mutualism and coordination between the two parties and boosts interdependence between them. Proposition 2: The firms in a dyadic relationship that experiences higher levels of positive reciprocal altruism manifests higher levels of performance over their competitors. Wu et. al. (2011) argues that reciprocal altruism can evolve as a stable strategy under three conditions. First, the parties indulging in reciprocal acts need to be mutually dependent so that they benefit from the altruistic acts. Before beginning an altruistic act, the parties should have some sort of dependence on each other and a party should at least sense some benefits that may accrue when the other party reciprocate. Continuing with the previous example, Toyota showed

268 altruism to its suppliers because it is dependent on its suppliers for parts and its suppliers on it for business. It also realizes that if its suppliers reciprocate back then such reciprocation would help Toyota to enhance its performance. The essence of such basic mutual dependence can be captured by the contract document developed at the onset of contract-based relationship. Second, the parties in the dyad should meet repeatedly. When the parties meet more often they will get to know each other better and understand each other’s ability to contribute in the relationship. The frequencies of the meetings may be coded in the contract if required. Thus, repeated meetings pave the ground for initiating altruistic act. Third, the cooperating party should be able to detect, remember and punish the cheating party. If a buyer, for example, that initiated the altruistic act does not receive any reciprocation then it may choose to punish the supplier by rewarding the contract to a competing firm or may show increased favoritism to the other suppliers. Thus, in presence of a robust contract agreement the parties will likely feel more comfortable to indulge in altruistic act as the risk of non-reciprocation can be hedged against the terms of the contracts between the two parties. Proposition 3: The more rigorous and well-defined are the terms of the contract in the contract- based relationship, the more comfortable the parties will feel to indulge in altruistic activities and, likely, will receive better reciprocation. As the level of trust and cooperation between parties in a relationship increase they tend to share more information among each other that may contribute toward competitive advantages of the firms. Efficient and effective communication between supply chain partners decreases performance and product related errors. Paulraj et. al (2007) argues that when parties share high levels of cooperation and collaboration they are more likely to share important information relating to material procurement and product design issues which helps them to improve product quality, customer responsiveness and save cost through greater operational efficiencies. Such effective communication enables the firm in the supply chain to locate and implement innovations which provide them competitive advantage. Such advantage helps the firm to enjoy superior operational and financial performance than its competitors. Proposition 4: Firms having higher levels of cooperation with their supplier manifest higher levels of performance than its competitors.

269

System Integrator: Shimard et al. has noted that experiments have shown that the transfer of carbon, nitrogen and phosphorous between plants has been through the interconnecting mycelia. Thus, we can infer that mycelia assumes a critical role of the integrating agent between the inter-plant interactions. These mycelia “decides” which species, depending on the participation in CMN should receive proportional measure of nutrients. Thus, mycelia are known to control the communication and flow of materials between the interacting parties thus catalyzing the inter-plant interactions. As the supply chain disintegrate, organizations strive hard to find new ways to manage the inbound flow from the supplier. It became even more challenging to manage the relationship with the suppliers as manufacturing started to get outsourced to cheap labor Asian countries. As the resource-based view (Prahalad and Hammel, 1990) suggests efficient management of the interactions with the suppliers may provide competitive advantage for the organization. Though large OEMs were able to overcome this challenge relying on their resource buffer, the SMEs were the worst hit. They increasingly began to feel the need of a neutral third party, a system integrator, who would help to bridge the gap between the OEMs and the suppliers and thus will provide significant value to the supply chain. They became open to having a third party assist them to manage inbound sourcing, selling excess production capacity, identifying customers who they do not have access to and finding new sources of capital (Bitran et al., 2007). To emphasize the role of the integrators demonstrating catalytic behaviors in nurturing the supply chain relationships the authors examines the business model of the Li and Fung Group, a Hong Kong- based company that serves private-label apparel firms in Europe and North America. Li & Fung, with revenues of $7 billion in U.S. dollars in 2005, operates what might be called a “smokeless” factory. It helps the OEMs to select the best factory to perform each function, leveraging on its knowledge and experience. The firm coordinates and controls each process – raw material sourcing, factory sourcing, manufacturing control etc. - in the supply chain. Using its buying power, Li and Fung can shrink the delivery cycle for the time-sensitive fashion products enabling the customers to make purchases closer to their target completion dates and thus providing them substantial competitive advantage which contributes positively toward the performance of

270 the OEMs. Thus, these system integrators with their expertize to manage the relationship between a buyer and supplier may play a critical role by positively influencing the experience and opinion of the parties in the relationship. Such positive experiences of the parties in a dyadic relationship may enhance the solidarity and trust they share with each other which in turn increase the dyadic interdependence. Proposition 5: The system integrators in the supply chain relationships moderate the relationship between interdependence and firm performance such that relationship is stronger when the level of involvement of the integrators is high and weaker when the level of involvement of the integrators is low. Proposition 6: The system integrators in the supply chain relationships moderate the relationship between levels of positive reciprocal altruism and supply chain interdependence such that relationship is stronger when the level of involvement of the integrators is high and weaker when the level of involvement of the integrators is low. Moderating Effect of the Environment & the Emergent behavior: Shimard et al. has noted that depending on the weather conditions the CMN network adapts itself by changing the distribution of the nutrients among the interacting parties. In the experiment researchers found that when the seedling of the Pseudotsuga menziesii was kept in different light treatments the net transfer of the 13 C changed accordingly. When the seedling was placed in the shade the net transfer of the carbon from Betula papyrifera to P. menziesii increased considerably whereas when placed in sunlight the net transfer of the carbon atoms was minimum owning to the fact that the seedling was able to perform its own photosynthesis. The article noted a role reversal pattern when B. papyrifera was placed in shade, P. menziesii transferred its residual 13 C to support the plant. Such type of adaptation was observed along the changes in the gradient of other nutrients. In the context of supply chain, Choi et al. (2001) discussed the supply chain network as the complex adaptive system and proposed a model of CAS that focuses on the interplay between a

271 system characterized by a network of firms that collectively supply products to a buying firm and environment consisting of end consumer markets that exert demand for the products and services produced by the network, connected economic systems and larger institutional cultural

272 systems. The article proposes self-organization capabilities of the network to adapt better to the environmental changes and thus show emergent behaviors like we have studied in the CMN network. When the supply chain network, in which the buyer and suppliers are part, has higher self-organizing capabilities which can be exemplified by the network’s ability to adjust goals and infrastructure to respond to environmental uncertainties (e.g. economic and demand uncertainties) the parties in the network enjoys competitive advantage which positively impacts the firm performance. Proposition 7: Network self-organization capabilities moderates the relationship between interdependence and firm performance such that the relationship is stronger when the self- organization capability is high and weaker when the capability is low. Proposition 8: Network self-organization capabilities moderates the relationship between reciprocal altruism and firm performance such that the relationship is stronger when the self- organization capability is high and weaker when the capability is low. A conceptual framework depicting the relationships has been provided figure 1. Conclusion Self-organization Capability Robustness of contract-based relationship Reciprocal Altruism Level of cooperation Firm Performance System Integrators Figure 1: Conceptual Framework

273

Though researchers have studied the concepts of the complex systems in the supply chain management domain the literature of studying the concept of reciprocal altruism and its impact on interdependence and the firm performance is relatively scant. While reviewing the literature we identified some opportunities that are worth considering: 1. Operationalization of the constructs and empirical test: In this article we have discussed about how the different supply chain constructs are related with each other. To maintain the parsimony of the study we have not discussed about the operationalization of the constructs. After the constructs are operationalized the framework should be put to rigorous empirical testing. Future researchers may look forward to extending the literature by researching on the ways how some of the new constructs can be operationalized and explore different ways to rigorously test the framework presented.

274

TABLE 1 # Citation Objective Study Description Key Variables/Constructs Findings 1 Chen et al.(2013) Article indicates certain issues driving the dissolution of buyer-supplier relationship: i) Breach of the contract terms by either parties leading to Development of compensation to recoup the relationship-specific investment stronger theory by ii) The parties may intentionally distance themselves from comparing buyer and the other due to supply chain disruptions e.g. product failure supplier relationship and the associated liability dissolution with the iii) Poor management of interaction between two parties divorce as social A formal process of phenomenon. The metaphorical transfer to comparison has been transform divorce into a done in three levels: theory-constitutive ontology, analogy and metaphor for strategic identity. During the buyer–supplier relationship analysis in the level of dissolution Different aspects of the divorce have been cited and explained to draw inference about the buyer- supplier relationship dissolution often results in stressful conditions leading to relationship dissolution iv) If the costs incurred to maintain the strategic relationship outweigh the benefits from the maintenance, the relationship may dissolve v) Dissatisfaction from the perception of unfair treatment. E.g. imbalances of benefit distribution. High level of identity the article satisfaction helps to create a halo that buffers the presents three relationship from the negative impacts in the relationship principles to define the vi) Non- engagement in opportunities, like product reasons of the buyer- development capabilities and provision of after-sales supplier dissolution support, to produce a joint outcome may result in dissolution vii) Asymmetric dependence rising from

275 one party controlling more resources than its counterpart

276

The 2 article discussed some reasons of the failures of supply chain alliances despite of high potential benefits: i) Adoption of "seat of the pants" style in approach to joint management ii) Inadequate understanding of managing and maintaining strategic relationships iii) Drastic changes in mind-set, culture and behavior are overwhelming to the managers managing the changes Whipple and Frankel (2000) iv) Inability to counterbalance the long-term benefits of strategic alliance with the short terms focus on cost reduction v) Inadequate performance enhancement and unsuccessful goals achievement provides perception that the alliance is not worthwhile. Thus the paper discusses that the "win win" situation of the alliance has both a "soft" people-oriented focus as well as the need for "hard" performance oriented approach. People oriented approach encompasses the development of trust, securing senior management support, setting clear goals etc. 3 Carter (1999) A two-sided survey method was administered. 41 buying firms and 63 supplier firms participated. After 97 responses received from the buying firm’s questionnaires to the respective supplier firms were sent. 92 responses were received, and these pairs formed the basis of analysis. Several constructs were Development of strategic alliance success factors for the buyers and suppliers to follow and develop stronger and efficient collaborations used in the survey like Trust, Senior Management Support, Willingness to be Flexible. These constructs were operationalized by the variables like High Integrity, Very Honest and High Moral Character in the construct Trust To provide a set of reliable and valid scale to measure the unethical behavior in buyer-supplier relationships. To provide the definition of the ethical issues in the context of relationship between US buyer and non-US supplier A two-set survey has been carried out: first, a series of focus group interviews has been carried out to rectify any systematic errors in the questionnaire and second a set of modified questionnaires has been sent to the US purchasing managers and their non-US suppliers Constructs like Buyer activities: deceitful and subtle practices, Supplier activities, Buyer's perception of supplier performance and Satisfaction with buyer- supplier relationship have been presented The article studies the effect of the ethical issues on the buyer-supplier relationship (US buyer and non- US supplier). The article indicates a decrease in the level of satisfaction among the members in a dyad if one cannot confidently rely on another. The study shows significant evidence that buyer's unethical behavior negatively influences supplier's satisfaction. The article also provides significant evidence that buyer's perception of supplier's performance maybe negatively affected by the buyer's perception of unethical activities by the supplier.

277

4 Gurnani & Shi(2006) To address the issue of building trust in first- time interactions by designing supply contracts that use either down-payments or nondelivered penalties, depending on the nature of the mistrust probability of delivering order quantity as believed by the supplier, probability of delivering order quantity by the supplier as believed by the buyer, utility functions: benefit to the buyer and cost to the supplier The article indicates that the buyer-supplier relationship, where buyer is placing first time purchase order, may be jeopardized because of the inability of the supplier to supply the order abiding by the agreement between the two parties. The driving factors are inadequate product information, equipment failures, country-specific political risks or trade policy risks, labor absenteeism. 5 Ro et al.(2016) The study invokes Nash bargaining game to determine the optimal contract price and quantity needed The article has indicated that the perceptional gaps in relational norms, unethical behavior, power asymmetry and collaboration between buyer and supplier may undermine the foundation of trust and foster dissatisfaction, conflict and thus result in dissolution of their relationship. The article has cited that the uncertainties resulting from the discrepancies and incoherence in dyadic exchanges have negative effect on the focal firm's desire to maintain the dyad. 6 Johnston et al.(2004) A scenario-based To understand the experimental approach to DV: Relationship perceptual differences test differences between Continuance and between the buyer and the supplier's anticipation Opportunism the supplier with regard of the buyer's behaviors Independent or Control to their reactions to and buyer's stated Variables: Gender, Race, same potential supply behaviors under the same work-experience (for both chain disruptions supply disruption event has buyer and supplier side) been used in this article.

278

Partial Least square algorithm’s path analytic capabilities were used to test the conceptual model's overall pattern using cross- sectional data. Mailed questionnaire was sent to purchasing manager of the buyer's firm and the study had them select a relationship that they believed was their most cooperative. Then the account managers of the supplier firm were surveyed. Constructs: Buyer's To study the relationship of the supplier's level of trust to three categories of inter-firm cooperative behaviors and these behaviors to the buyer's perception of the relationship's performance. assessment of relationship's performance and buyer's satisfaction as outcomes, Supplier's perception of buyer's dependability and supplier's perception of buyer's benevolence as trust, joint responsibility, shared planning and flexibility in arrangements as inter-firm activities The study has found that the supplier involvement in the collaborative activities is partially a function of trust that the target suppliers have in the buyer firm. Such supplier involvement leads to higher positive outcomes which eventually drives buyer's satisfaction which is critical for the buyer-supplier relationship. Thus, if the supplier perceives the buyer as unreliable or/and unkind (constructs for trusts) then such condition may lead to dissolution of the relationship

279

Variables: 7 Accelerated processing, Risk aversion (time pressure coping mechanism), Relationship Magnitude are used as Thomas et al. (2011) independent variable. Information exchange, Operational knowledge transfer and shared interpretation (knowledge sharing behaviors) were used as dependent variable The study finds that the boundary-spanning agent (supplier side) when faced with time pressure develops a couple of coping mechanisms - risk aversion and accelerated processing - which give the buyer boundary-spanning agent (purchasing manager) a perception of suboptimal decision making by the supplier. Such a perception impedes the knowledge sharing and other collaborative activities between supply chain partners. The study found that the risk aversion mechanism shown by the supplier has much stronger negative moderating effect in the high magnitude relationships on the knowledge sharing behaviors by the buyer firms. 8 Sharland (2003) Between subject’s scenario-based To understand how experimental methodology time pressured supplier was utilized in order to test representatives, impact the hypothesis. Participants the knowledge sharing were experienced full-time behaviors of their managers enrolled in a counterpart buying part-time graduate firms business program. Average age: 34 with 9 years of full time work experience DV: Trust and Commitment IV: Cycle Time, Proximity, Manufacturing Quality, Ease of Qualifying, Comparative Price, Supplier Performance. Supplier performance has been operationalized by items like Cycle time reduction, Input Material Quality, Quality Consistency The respondents highly rated trust and commitment with To empirically test the impact of cycle time on supplier selection and on the effectiveness of long-term relationships with suppliers as reflected by commitment and trust developed. A survey based study was conducted on the members of the Institute of Supply Management from three different industries. With a response rate of 14.4% the number of responses analyzed was 108. their key suppliers. These attributes define the basis of the long-term relationship between the two parties. As the study found trust and commitment depends on several factors like supplier performance and higher quality of inputs. It was again indicated that the performance depends on the cycle time reduction. Thus, it can be inferred that if the suppliers are not able to maintain consistency in manufacturing quality and reduce cycle time in manufacturing processes then it may negatively impact supplier performance which in turn may impact trust and commitment issues.

280

9 Mclvor and McHugh (2000) To highlight the change implications for organizations that are attempting to develop collaborative relationships with their suppliers The research focused on the strategic business unit of a global telecommunication equipment manufacturer and its key supplier. A single case research design was chosen. An in-depth case analysis was done in this research. A variety of data collection methods such as direct observation of meetings involving organization personnel and supplier representatives, access to documentation and interview was undertaken. These provided intricate details of important issues and behavior patterns The article indicates that many buyer-supplier relationships have not been able to deliver the intended benefits and have failed. The study uncovers the causes behind the failure: i) The partnership sourcing concept has been misleading as organizations seemed to have not fully understood the concept and in many cases the buyers have retained considerable economic power in comparison to the suppliers ii) There has been focus on superficial content issues but not much focus has been given to manage the change of the process issues - it was not ensured that the complimentary activities and behaviors are ensured within each of the partnering organizations iii) Non coherence among activities at the strategic and tactical level - the study cites that though strategically the organization decided (in the case study) to pursue collaborative relationship with the suppliers no tactical initiative was in place to identify the necessary systemic changes to enable active partnership, no changes in the culture that necessitates shift in the mindset of the mid and junior level managers in the buyer and supplier organizations. iv) The study showed the reluctance of the team members in the buyer organization to involve their suppliers in new product development discussions, consideration of the role of the purchasing managers as the clerical role rather than a critical and strategic role contributed to the inability of the organization to manage the change to imbibe the buyer- supplier collaboration philosophy

281

10 Chen & Lee(2016) To provide the analytical framework to understand some of the observed industry practices to manage supplier responsibility risk The study provides an analytic framework using Stackelberg game framework. It endogenizes the supplier noncompliance probability by modeling the economic trade-off faced by a supplier in compliance to social and environmental standard, assumes random production cost. The study utilizes variables like production cost, public discovery probability, monetary penalty cost, individual specific cost when committing a violation., probability of non-compliance, regulatory sanction cost. It used the Stackelberg game model. The study indicated that despite the collaborative relationships the company executives are concerned about the supplier responsibility problems (SRP). These SRPs include violations by unethical suppliers - material violations and process violations. The study has indicated that such SRP lowers the buyer's confidence on the supplier leading to the dissolution of the relationship. To protect themselves against these violations they incur additional costs in the form of product inspection, audits and other programs related to responsibility. Thus, the benefit of such relationship decreases.

282

References Allred, C. R., Fawcett, S. E., Wallin, C., & Magnan, G. M. (2011). A Dynamic Collaboration Capability as a Source of Competitive Advantage. Decision Sciences, 42(1), 129-161. doi: 10.1111/j.1540-5915.2010.00304.x Bitran, G. R., Gurumurthi, S., & Sam, S. L. (2007). The need for third-party coordination in supply chain governance. MIT Sloan Management Review, 48(3), 30. Booth, M. G., & Hoeksema, J. D. (2010). Mycorrhizal networks counteract competitive effects of canopy trees on seedling survival. Ecology, 91(8), 2294- 2302. Carter, C. R. (2000). Ethical issues in international buyer–supplier relationships: a dyadic examination. Journal of operations management, 18(2), 191-208. Chen, L., & Lee, H. L. (2016). Sourcing under supplier responsibility risk: The effects of certification, audit, and contingency payment. Management science. Chen, Y.-S., Rungtusanatham, M. J., Goldstein, S. M., & Koerner, A. F. (2013). Theorizing through metaphorical transfer in OM/SCM research: Divorce as a metaphor for strategic buyer–supplier relationship dissolution. Journal of operations management, 31(7), 579-586. Choi, T. Y., Dooley, K. J., & Rungtusanatham, M. (2001). Supply networks and complex adaptive systems: control versus emergence. Journal of operations management, 19(3), 351-366. doi: https://doi.org/10.1016/S0272-6963(00)00068-1 Darwin, C. (1969). On the Origen of Species by Means of Natural Selection: Culture et Civilisation. Emerson, R. M. (1962). Power-dependence relations. American sociological review, 31-41. Frohlich, M. T., & Westbrook, R. (2001). Arcs of integration: an international study of supply chain strategies. Journal of operations management, 19(2), 185- 200. Gurnani, H., & Shi, M. (2006). A bargaining model for a first-time interaction under asymmetric beliefs of supply reliability. Management science, 52(6), 865- 880.

283

Johnston, D. A., McCutcheon, D. M., Stuart, F. I., & Kerwood, H. (2004). Effects of supplier trust on performance of cooperative supplier relationships. Journal of operations management, 22(1), 23-38.

284

McIvor, R., & McHugh, M. (2000). Partnership sourcing: an organization change management perspective. Journal of Supply Chain Management, 36(2), 12-20. Narasimhan, R., & Narayanan, S. (2013). Perspectives on supply network–enabled innovations. Journal of Supply Chain Management, 49(4), 27-42. Paulraj, A., Lado, A. A., & Chen, I. J. (2008). Inter-organizational communication as a relational competency: Antecedents and performance outcomes in collaborative buyer–supplier relationships. Journal of operations management, 26(1), 45-64. Prahalad, C. K., & Hamel, G. (1990). The core competence of the corporation. Boston (Ma), 1990, 235-256. Primlani, R. V. (2013). Biomimicry: On the Frontiers of Design. Vilakshan: The XIMB Journal Of Management, 10(2). Ro, Y. K., Su, H. C., & Chen, Y. S. (2016). A tale of two perspectives on an impending supply disruption. Journal of Supply Chain Management, 52(1), 3-20. Selosse, M.-A., Richard, F., He, X., & Simard, S. W. (2006). Mycorrhizal networks: des liaisons dangereuses? Trends in Ecology & Evolution, 21(11), 621- 628. Sharland, A., Eltantawy, R. A., & Giunipero, L. C. (2003). The impact of cycle time on supplier selection and subsequent performance outcomes. Journal of Supply Chain Management, 39(2), 4-12. Simard, S. W., Perry, D. A., Jones, M. D., Myrold, D. D., Durall, D. M., & Molina, R. (1997). Net transfer of carbon between ectomycorrhizal tree species in the field. Nature, 388(6642), 579-582. Thomas, R. W., Fugate, B. S., & Koukova, N. T. (2011). Coping with time pressure and knowledge sharing in buyer–supplier relationships. Journal of Supply Chain Management, 47(3), 22-42. Whipple, J. M., & Frankel, R. (2000). Strategic alliance success factors. Journal of Supply Chain Management, 36(2), 21-28. Wilkinson, I. (2010). Business relating business: managing organisational relations and networks: Edward Elgar Publishing.

285

Wu, Z., & Choi, T. Y. (2005). Supplier–supplier relationships in the buyer– supplier triad: Building theories from eight case studies. Journal of operations management, 24(1), 27-52. doi: https://doi.org/10.1016/j.jom.2005.02.001 Urda, J. and Loch, C.H. (2013). Social preferences and emotions as regulators of behavior in processes. Journal of Operations Management, 31(1-2), pp.6-23 Settoon, R.P. and Mossholder, K.W. (2002). Relationship quality and relationship context as antecedents of person-and task-focused interpersonal citizenship behavior. Journal of applied psychology, 87(2), p.255.

286

Optimal Clustering of Products for Regression-Type and Classification-Type Predictive Modeling for Assortment Planning Raghav Tamhankar, Sanchit Khattar, Xiangyi Che, Siyu Zhu, Matthew A. Lanham Purdue University, Department of Management, 403 W. State Street, West Lafayette, IN 47907 [email protected]; [email protected]; [email protected]; [email protected]; [email protected] ABSTRACT In collaboration with a national retailer, this study focused on assessing the impact of sales prediction accuracy when clustering sparse demand products in various ways, while trying to identify scenarios when framing the problem as a regression-problem or classification-problem would lead to the best demand decision-support. This problem is motivated by the fact that modeling very sparse demand products is hard. Some retailers frame the prediction problem as a classification problem, where they obtain the propensity that a product will sell or not sell within a specified planning horizon, or they might model it in a regression setting that is plagued by many zeros in the response. In our study, we clustered products using k-means, SOMs, and HDBSCAN algorithms using lifecycles, failure rates, product usability, and market-type features. We found there was a consistent story behind the clusters generated, which was primarily distinguished by particular demand patterns. Next, we aggregated the clustering results into a single input feature, which led to improved prediction accuracy of the predictive models we examined. When forecasting sales, we investigated a variety of different regression- and classification- type models and report a short list of those models that performed the best in each case. Lastly, we identify certain scenarios we observed when modeling the problem as a classification problem versus a regression problem, so that our partner could be more strategic in how they use these forecasts for their assortment decision. Keywords: product clustering, classification vs regression-type modeling, assortment planning, decision- support

286

INTRODUCTION Inventory management is a fundamental part of a firm’s operational strategy as it aims to ensure the availability of their product to their customers. This aspect is of paramount importance as it directly impacts the customer experience, which has become one of the key focused area of the firms in our digital world. In addition to this, there are huge costs associated with movement and holding of inventory. Hence, the key is to understand how to operationally optimize distribution of products. A firm sets its financial objectives such as sales, inventory, margin at various levels which impacts the organization’s merchandising strategies and decisions. To meet these financial KPIs, historical data can be paired with effective analytics to identify and anticipate opportunities to improve category-level assortments, and better product placement timing for merchants [1]. Assortment planning is the process of planning, organizing and selecting a group of products which will be designated to particular areas and during specific periods of time [2]. This study in collaboration with a national retailer whom tends to stock large assortments of sparse demand products seeks to answer a few questions. First, what is the best approach to frame modeling such products. For the given industry, retailers will often consider attributes such as product type, store location, geographical demand, product life cycles, and failure rates as often telling predictors or demand. However, modeling demand when demand is a very challenging problem, even more so than traditional intermittent demand. Does the retailer frame the prediction problem as a regression-type problem, which is the most common scenario, or can framing the problem as a classification-problem provide more reliable forecast measures to support the assortment decision? In our study, we seek to provide some empirical evidence by doing traditional clustering of features commonly used regardless of how the problem is framed, then build predictive models using those clusters as an independent variable to predict the number of units sold. Units sold is often either zero or one in the pre-determined planning horizon, but there are cases where more than one unit is sold. Thus, our research questions in this study are: • Can informative clusters be generated using popular unsupervised learning algorithms, and is there a business story about those clusters? • How does clustering of products improve the predictive performance of models in a regression and classification setting? • In which scenarios is the regression approach preferred to the classification approach when modeling very sparse demand products? We structure our paper by first performing a literature review of predictive modeling to support assortment planning decisions, describe the data used in our study, detail our study methodology, discuss the models we investigated to predict demand, summarize our results, and provide answers to our research questions in the conclusions. LITERATURE REVIEW We have critically reviewed prior academic literature and articles to provide us guidance in our study. Our research areas cover both methodologies used in assortment planning and statistical learning approaches for clustering and supervised and unsupervised predictive analysis to support this area. Assortment planning (AP) is defined as the process of optimizing the inventory level of selective types

287 and number of products to be carried at retail stores (Kapoor and Singh, 2014). The objective of AP is to strategically determine the optimal quantity to carry, by reducing the chance of stock-outs (Baker, 2005) and therefore drive bottom line improvements. Traditionally, the assortment decisions were limited to the

288 consideration of substitution and elasticity demand (Aydιn and Hausman, 2009) as well as shelf-space studies (Gajjar and Adil, 2011) as shown in Table 1 below. Table 1: Type of Assortment Planning Models and Variables Considered Model Adopted Author, Year Variable Considered Space Elastic Substitution (Gajjar and Adil, Space available, Upper and lower limits on 2010) space (Yücel, et al., 2009) From our review of the literature, most studies tend to focus on substitution behavior and aspects of computational complexity of the models used to generate assortments under unrealistic scenarios. Thus, in this paper we currently do not provide a clear connection (or extension) from these studies to the problem we investigate with our industry partner. However, we have found close connections to other research streams focused on using machine learning techniques for demand forecasting problems. Machine learning algorithms have been a hot topic in business and demand forecasting as of late, because of their ability to discover trends and behaviors in the data that traditional approaches are just not able to identify. We posit that these techniques can provide the best guidance to retailers to help improve the AP process. In this study, we combine clustering techniques (i.e. unsupervised learning) with predictive modeling techniques (i.e. supervised learning). Predictive analytics using machine learning has tremendous advantages over traditional methods because of the often lack of required assumptions needed. Instead of modeling a trend based on historical sales data, predictive analytics using machine learning can incorporate all relevant features such as product attributes, X Cost of Substitution, Cost of supplier selection, shelf-space limitations (Abbott and Palekar, 2008) X Direct and cross space elasticity, Inventory, profit, cost (Shah and Avittathur, 2008) X X Substitution, Space, Multi-item inventory level, Cannibalization (Li, 2007a) X Substitution, Continuous and discrete in-store traffic, Product profit, Cost (Kök and Fisher, 2007) Utility, Substitution rate, Cost, Price, facings

289

(Hariga, et al., 2007) X Direct and cross space elasticity, Locational effect, Inventory level (Gaur and Honhon, 2006) X X Static and dynamic substitution, Space, Location, inventory, profit (Rajaram, 2001) X Selling and cost price, Set up cost, Budget, Salvage price, Shelf-Space

290 store capacities, transfer/logistics costs, lead-time variability, assortment rules, geo-demographic diversity, price-elasticity of demand, etc. into the demand model to better understand the product-place- promotion aspects that are fundamental components of purchasing behavior (Cook, 2016). These features could lead to better demand forecasts, which could lead to better assortments that have the right depth and diversity balance, as well as lower the chance of product cannibalization, and frequent order replenishment problems (Krupnik, 2016). The goal of segmenting the observations into different clusters is to provide users with in-depth understanding of the “internal mechanism” within the data by maximizing in-group similarity and minimizing between-group similarity. Measurement of similarity varies based on specific context, which can generally be categorized as “relationship-based” and “distance-based” clustering (Wu, 2006). Distance- based approaches include K-means and examples of relationship-based methods are linear or logistic regression. Research has shown that ensemble approaches tend to perform better than single classifiers (Opitz & Macline 1999; Quinlan, 2006), but the ensemble methods discussed there are usually bagging, boosting and stacking. Unlike the traditional studies, Williams’ research investigated the ensemble approach of combining clustering and predictive modeling. His research showed that the ensemble approaches increased the accuracy to predict recurrence and non-recurrence for breast cancer patients. (Williams, 2008) Once the models are deployed, a category manager can very effectively build an assortment plan. While the decisions are made based on the predictive model and constraints of the prescriptive model (e.g. shelf space, budget, etc.), there are other scenarios that should be considered, such as product cannibalization, that also must be considered. The Multinomial Logit (MNL) choice model for example assumes certain properties due to which substitution effects among products (such as cannibalization) cannot be captured (Jena, Dominik, 2017). Thus, while our study does not clearly tie into the academic research, there still necessitates answers to these fundamentally important questions of how to model extremely sparse demand problems to better support the assortment planning decision. We have referred to several research papers in understanding how clustering has been applied for assortment planning. One of our findings is that in one of the very recently published papers, they propose a dynamic clustering policy as a prescriptive approach for assortment personalization. They introduce a policy that adaptively combines estimation (by estimating customer preferences through dynamic clustering) and optimization (by making dynamic personalized assortment decisions using a bandit policy) in an online setting. Hence, if the customers could be specifically identified, clustering approaches are useful aids in assortment planning from store level to personalization level. (Bernstein, 2017) DATA The dataset was provided by a national retailer and contains information about sales of SKUs coming from a vast number of stores over a two-year time frame. Business parameters such as product lifecycles, failure rates, and number of customers that the SKU supports in a specified area were provided. No additional information can be described about the dataset due to confidentiality concerns. METHODOLOGY Figure 1 below shows the overall approach that this study followed. To answer our research questions, we clustered the data three different ways, as well as building a predictive model without clusters. Next, the

291 clusters we fed as inputs along with other features found significant at predicting demand into them for training. Some problems were regression-type models that has the target variable values as 0, 1, 2, etc.,

292 while the classification-type models had a target response of 0 or 1. One being if the unit had some one or more units during the previous time window and zero if it had not sold any units. Figure 1: Methodology outline Before clustering and predictive modes were trained, all irrelevant (insignificant to the objective) features were removed from the dataset. Missing values were imputed using the the mean value. This turned out not to negatively impact the distribution of the variable as is often the case. Numerical features were normalized to maintain scale-uniformity, which should allow the models to train without being biased by the range of the features. Data-partitioning was performed using an 80-20% train-test set all the predictive models. CLUSTERING MODELS Model 1- Base Model without Clusters The base model was developed to provide a reference point in evaluating the performance of models with different clusters. It was directly used in the regression models to forecast sales, and in the classification- case to predict purchase propensities. Model 2: k-Means Clustering Model K-means clustering was performed on failure rates and percentage lifecycle remaining variables, which are common features to use in this application area. K-means is a popluar unsupervised learning algorithm, which is used for unlabeled data (i.e., data without defined categories or groups). The goal of this algorithm is to find groups in the data. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity which is generally calculated using Euclidean distance. The appropriate number of clusters was obtained by analyzing an elbow plot after implementing the algorithm for different values of ‘k’. This plot provides a heuristic to gauge the average cohesion within clusters. Once the optimum value was identified, the cluster number was used as a categorical variable for classification and regression models along with other predictor variables. The variables used to cluster were omitted from the forecasting techniques to prevent bias towards those specific variables.

293

Model 3: HDBSCAN Clustering Model The HDBSCAN clustering technique was implemented on the same set of variables to obtain clusters based on the density of data points. It extends the popular DBSCAN (Density-Based Spatial Clustering of Applications with Noise) clustering algorithm. It is a density-based clustering algorithm that provided a set of points in some space will group together points that are closely packed together (points with many nearby neighbors), marking as outliers, points that lie alone in low-density regions (whose nearest neighbors are too far away). Model 4: Kohonen Network Clustering Model Kohonen Network is a type of unsupervised learning Artificial Neural Network to produce a low- dimensional representation of the input space. Kohonen Networks also known as Self-Organizing Maps (SOMs) help to understand high dimensional data by reducing the dimensions of data to a map. The final map reveals which observations are grouped together. The algorithm works by having several units compete for the current object. Once the data have been entered into the system, the network of artificial neurons are trained by providing information about inputs. The weight vector of the unit is closest to the current object becomes the winning or active unit. During the training stage, the values for the input variables are gradually adjusted in an attempt to preserve neighborhood relationships that exist within the input data set. As it gets closer to the input object, the weights of the winning unit are adjusted as well as its neighbors. PREDICTIVE MODELS In our study, we combined supervised and unsupervised learning models and evaluated the best models that predicting sales units and propensity to sale. We researched and investigated both machine learning clustering techniques and business segmentation approaches to apply to our specific business problem and data set. Classification-Type Modeling: The goal of the classification type modeling is to predict the probability of a SKU being sold at a given store and eventually classify it as a seller or non-seller based on a specific probability cutoff threshold value. Several classic machine learning classification algorithms like Logistic Regression, Classification Tree (CART), and Linear Discriminant Analysis (LDA) models were developed. More sophisticated techniques such as Bagged Classification Trees, Boosted Logistic Regression, a multilayer perceptron feedforward artificial neural network (MLP) were also explored to evaluate these values. Bagged (or boosted aggregated) trees tend to have lower variance than individual trees. Boosted algorithms tend to shrink both bias and variance as records that have been misclassified are provided greater weights and essentially probabilities are generated that will tend to fall more closely to their respected class (e.g. 0 or 1) than not boosting. Neural networks provide the ability to fit complex relationships which often exist in sparse demand patterns. Logistic Regression: Logistic regression is the simplest and most commonly used classification approach in order to predict the binary outcome (independent variable). Similar to the linear regression model, the purpose of logistic regression is to describe the relationship between independent variables and a number of independent variables. Unlike the linear regression, the maximum likelihood estimation is used to estimate the parameter coefficients where the response variable (0 or 1) has been transformed into log- odds via link function known as the logit. CART The Classification and Regression Trees (CART) algorithm, introduced by Leo Breiman, refers to the commonly known decision tree model used for classification or regression type of problems. The advantage of a decision tree model is easy to interpret. A tree can be “learned” by splitting the total dataset into subsets

294

295 based on the attribute value test, and this process is done in a recursive manner. The number of splits determines the tree complexity. A popular approach to obtaining a tree that generalizes well on a test set is to grow a large tree that overfits to the training data, then prune it back until a consistent test error is obtained. This is how we developed our tree model in this study. Bagged Classification Tree A bagged decision tree was investigated because there has been strong evidence that bagging weak tree classifiers can lead to reduction in overall error, by reducing the variance component in the error. The variance in the bias-variance tradeoff is the error that is seen by applying the algorithm on different data sets. An analyst would hope to achieve a robust model, meaning that it will generalize well from dataset to dataset, meaning the error obtained from each dataset is similar. When multiple decision trees results are aggregated (often by averaging), the resulting model results tend to reduce overfitting and lead to a more robust model. LDA Linear discriminant analysis (LDA) is a generalized Fisher’s linear discriminant method, used to find a linear combination of features as linear classifier to separate two or more classes of objectives. It is also viewed as an approach for dimensionality reduction before classification. LDA is also closely related to principal components analysis (PCA) as they both look for the linear combinations of variables to further explain the data. However, LDA tries to differentiate the class of data, while PCA looks for similarity to build feature combination. MLP A multilayer perceptron feedforward artificial neural network (MLP) consists of at least three layers of nodes. Except for the input nodes, each node is a neuron that uses a nonlinear activation function. MLP utilizes a supervised learning technique called backpropagation for training. Its multiple layers and non- linear activation distinguish MLP from a linear perceptron. It can distinguish data that is not linearly separable and seems appropriate for the sparse demand nature of this study. Regression-Type Modeling: Regression-type models were explored to predict the exact quantity of products that will be sold at a store in an upcoming planning horizon. The models were built using only significant variables which helped in forecasting the sales. The clusters obtained from different algorithms were also included in the model as categorical type variables. Multiple Linear Regression and several machine learning algorithms such as Neural Networks and Decision Trees were explored to predict sales. Additionally, a zero-inflated Poisson Regression model was developed for cases having a large number of ‘0’s in the response variable. Multiple Linear Regression By fitting a linear equation to the observed model, multiple linear regression tries to model the relationship between the continuous predicted variable and two or more predictor variables. Both forward and backward selection approaches were used to identify a consistent set of drivers that explain sales. Neural Networks Artificial Neural Networks (ANN) were originally a biologically-inspired programming paradigm that enabled scientists to learn from brain activity. It has the ability of deriving meaning from complicated data and can be used to extract patterns and trends that are identifiable by normal algorithms. ANN is based on a collection of connected nodes referred to as artificial neurons. Data serves as signal inputs and the output of each neuron is calculated by nonlinear function that aggregate all inputs. Zero-Inflated Poisson Regression Zero-inflated Poisson (ZIP) Regression is a specialized multiple linear regression which particular fits well for data having skewed distribution of 0s in the dependent variable. ZIP models assume that some zeros

296

297 occur via a Poisson process, but others were not even eligible to have the event occur. Thus, there are two processes at work—one that determines if the individual is even eligible for a non-zero response, and the other that determines the count of that response for eligible individuals. RESULTS When developing the clusters, we decided to use 6 clusters based on the elbow plot shown in Figure 2. In this plot we were looking for an “elbow” or kink that would suggest past that number of k clusters there are dimensioning returns the plot itself shows the mean squared error (MSE) for different number of ‘k’ clusters. Figure 2: k-Means elbow plot K-Means clusters are built based on vicinity of data points compared to a centroid. In Figure 3 below, we found that the green colored clusters have imminent demand and should be expected in stores frequently. Figure 3: Graph showing the optimal number of cluster for K-Means Kohonen clusters as shown in Figure 4, were built based on the magnitude of failure sales and percentage lifecycle remaining values. The clusters with no green or white color cross section represents products with low value of both the variables

298

Figure 4: Vicinity based clustering based on neural networks (Kohonen Networks) Lastly in Figure 5, cluster 3 represents products which have intermediate value of lifecycle remaining and relatively high failure sales. HDBSCAN, on the hand classified a large chunk of data points as noise (black- colored dots) making difficult to infer any business segmentation of the clusters. Figure 5: Density based clustering algorithm with 4 unique clusters (HDBSCAN) We chose k-means as the final cluster to publish the results as it made business segmentation of SKUs more interpretable. The Kohonen model predicted almost the same clusters as k-means and hence was not substantially differentiable. The HDBSCAN model classified the major chunk of the data points as noise leaving a small subset with clusters. Thus, due to the shortcomings of Kohonen and HDBSCAN models, we proceeded our analysis with k-means clusters. Classification Model Results Table 2 summarizes the statistical performance for each classification model we investigated. The Bagged Logistic Regression without clustering had the greatest area-under-the-curve (AUC) with 0.9841. LDA performed the best (AUC = 0.9825) among the predictive models that used the k-means clusters.

299

Table 2: Baseline model vs Models with clusters comparison Since most of the models performed similar, we used a probability calibration plot as another means to discriminate among competing models. Since the sparse demand assortments using propensity scores are essentially ranks of which SKUs will be in the assortment or not, choosing a model that is best calibrated would be idea. Figure 6, shows the probability calibration plots for the models we investigated. On the x- axis are binned predictive probabilities, and on the y-axis reveals the percentage of actual sellers for each bucket. In theory, lines in this plot that are closer to a 45-degree line are better calibrated than those that are not. Thus, such a model would provide better ranked propensities for the assortment planning decision. We selected the Bagged Logistic model as our optimal classification model. Figure 6: Calibration Plot for all logistic regression models To evaluate the regression models, root mean-squared error (RMSE) and adjusted R-squared were evaluated on the training and test datasets. Both MLR backward and forward selection produced exactly the same set of feature combinations and therefore produced the same output. Based on the coefficients for the corresponding variables, we observed that the quantity sold in previous years had great impact on this year’s sales number. We observed certain features had interesting business implications but could not disclose those in our paper.

300

Table 3: MLR Forward and Backward results for train and test for base model and mode with clusters Table 4 shows the results for the regression-type models with and without clusters we obtained. We observed that the zero-inflated Poisson regression outperforms all other models with an adjusted R2 of 0.905. Additionally, its RMSE error is was the lowest among all models. Table 4: Table comparing after clustering the statistical performance of baseline regression models CONCLUSIONS Assortment planning is one of the most important operational aspects that all retailers face. A key component to the assortment planning decision is having the ability to estimate demand. Intermittent demand is always a challenge, but extremely sparse demand is the most challenging demand forecast to generate, only behind predicting demand for new products. In our study we investigated the impact that clustering has on improving predictions for sparse demand models, but also investigated how one might frame such a scenario given that the target variable might have an abundance of zeros. This study is the groundwork to help our business partner to recommend whether to build classification-type or regression- type predictions for its assortment planning decision-support. Thus, can informative clusters be generated using popular unsupervised learning algorithms, and is there a business story about those clusters? We employed k-Means, Self Organizing maps (Kohonen networks), and HDBSCAN to build clusters. We found that k-Means performed the best among the three. We also found that there was indeed a business story in these clusters. Secondly, how does clustering of products improve the predictive performance of models in a regression and classification setting? We found that four of the six classification models we tried had improved performance with clusters than without. However, the best model Bagged Logistic Regression which performed the best among all competing models was achieved without clustering. In the regression setting, only two of the five models had a higher adjusted R-square using clustering than not clustering. However, our best model for this set of SKUs had an adjusted R2 improvement of 4.6%. These results do not allow us to have a definitive claim that clustering or not clustering is the best way to go in all situations. The evidence we provide suggests that most likely clustering should be used on a case by case basis where it performs better.

301

Lastly, in which scenarios is the regression approach preferred to the classification approach when modeling very sparse demand products? We obtained a highly accurate classification model with an accuracy of 98.5% with an AUC of 98.4% via the Bagged Logistic Regression. Also, the probabilities seemed decently calibrated and thus could be used for ranking purposes. In the regression setting we obtained an adjusted R-square using Poisson Regression that is also considered very high (90.5%). Since we were obtained models that performed very well in both cases, we believe this study should be extended by developing assortments using both models via our partners assortment recommendation engine. We believe only then can we tie in the statistical performance measures to the business performance measures (e.g. sales, lost sales, inventory costs, lost opportunity costs, etc.) REFERENCES All-in-one Category Management and Planogram Software https://www.dotactiv.com/assortment- planning Getting More from Retail Inventory Management, Jan 2013 http://deloitte.wsj.com/cio/2013/01/22/getting-more-from-retail-inventory-management/ Abbott, H. & Palekar, U. S. (2008). Retail replenishment models with display-space elastic demand. European Journal of Operational Research, 186(2), 586-607. Baker, Stacy. “Assortment planning for apparel retailers-Benefits of assortment planning.” ABI/Inform Collection, June 2005. Berkhin, Pavel. Survey of Clustering Data Mining Techniques. 2002, www.cc.gatech.edu/~isbell/reading/papers/berkhin02survey.pdf. Bernstein, Fernando, et al. “A Dynamic Clustering Approach to Data-Driven Assortment Personalization.” By Fernando Bernstein, Sajad Modaresi, Denis Saure :: SSRN, 9 June 2017, papers.ssrn.com/sol3/papers.cfm?abstract_id=2983207. Cook, Chris. “Building the Optimal Retail Assortment Plan with Predictive Analytics.”, 19 Feb. 2016. www.linkedin.com/Pulse/Building-Optimal-Retail-Assortment-Plan-Predictive-Analytics-Cook Gajjar, H. K. & Adil, G. K. (2010). A piecewise linearization for retail shelf space allocation problem and a local search heuristic. [Article]. Annals of Operations Research, 179(1), 149-167. Hasmukh K. Gajjar; Gajendra K. Adil. “Heuristics for retail shelf space allocation problem with linear profit function, International Journal of Retail & Distribution Management.” DeepDyve, Emerald Group Publishing Limited, 15 Feb. 2011, www.deepdyve.com/lp/emerald-publishing/heuristics-for-retail-shelf- space-allocation-problem-with-linear-oGV9fT1lk3. Hariga, M. A., Al-Ahmari, A. & Mohamed, A. R. A. (2007). A joint optimisation model for inventory replenishment, product assortment, shelf space and display area allocation decisions. European Journal of Operational Research, 181(1), 239-251. Kapoor, Rohit. and Singh, Alok “A Literature Review on Demand Models in Retail Assortment Planning.” Academia.edu, 2014,

302 www.academia.edu/5179110/A_Literature_Review_on_Demand_Models_in_Retail_Assortment_Plannin g. Kok, A. G. & Fisher, M. L. (2007). Demand estimation and assortment optimization under substitution: Methodology and application. Operations Research-Baltimore, 55(6), 1001-1021. Krupnik, Yan. “Say Goodbye to Inventory Challenges With Predictive Analytics.” News | Apparel Magazine(AM), 4 Jan. 2016, www.apparelmag.com/say-goodbye-inventory-challenges-predictive- analytics. Li, Z. (2007a). A single period assortment optimization model. Production and Operations Management, 16(3), 369-380. Jena, Sanjay Dominik, et al. “PARTIALLY-RANKED CHOICE MODELS FOR DATA-DRIVEN ASSORTMENT OPTIMIZATION.” Data Science for Real Time Decision Making, Sept. 2017. http://cerc- datascience.polymtl.ca/wp-content/uploads/2017/09/Technical-Report_DS4DM-2017-011.pdf Opitz, D., & Macline, R. (1999). Popular ensemble methods: An empirical study. Artificial Intelligence Research, 11, 169 – 198. Quinlan, J.R. (2006). Bagging, Boosting, and C4.5. Retrieved January 29, 2008, from http://www.rulequest.com/Personal/q.aaai96.ps Shah, J. & Avittathur, B. (2007). The retailer multiitem inventory problem with demand cannibalization and substitution. International Journal of Production Economics, 106(1), 104-114. Williams, Philicity K. CLUSTERING AND PREDICTIVE MODELING: AN ENSEMBLE APPROACH ... 9 Aug. 2008, www.bing.com/cr?IG=9E4CBDB21E994715B83382B14FD36D28&CID=107E8AD488686F8115A9818 289C76E3B&rd=1&h=cjGUbFZYHKNHQ4lTsXtd0YNVvSZyNRsYhTXRRQz9lB8&v=1&r=https%3a %2f%2fetd.auburn.edu%2fbitstream%2fhandle%2f10415%2f1546%2fWilliams_Philicity_49.pdf%3bseq uence%3d1&p=DevEx,5064.1. Wu, Senlin. “DATA MINING USING THRESHOLD-VARIABLE AIDED RELATIONSHIP-BASED CLUSTERING.” Doctor of Philosophy in Management Information Systems, University of Chicago, 2006. Yücel, E., Karaesmen, F., Salman, F. S. & Türkay, M. (2009). Optimizing product assortment under customer-driven demand substitution. European Journal of Operational Research, 199(3), 759-768.

303

Reducing the Cost of International Trade Through the Use of Foreign Trade Zones

2018 Midwest DSI Conference – April 2018 Gary Smith, Instructor Penn State Erie – The Behrend College - Erie, PA Session Goals

• Provide a brief historical perspective on US Foreign Trade

• Discuss Foreign Trade Zones • What they are • How they reduce the cost of International Trade

• Review the Benefits of Foreign Trade Zones • Financial Savings • Administrative Savings US Foreign Trade: Prior to The Great Depression • From the time the United States gained independence from Great Britain until the mid-20th Century, the US practiced Protectionist Trade Policies • Political Independence = Economic Independence • Jeffersonian Trade Embargo (1807-1809) effectively eliminated International Trade (with a significant negative impact on the US economy) • Tariffs averaged 60% in 1830, declining to 20% by 1840 • 1913 Underwood-Simmonds Tariff • Lowered Tariffs (breaking a protectionist tradition • Instituted an Income Tax • 1922 Fordney-McCumber Tariff Act • Authorized the President to raise and lower tariffs by up to 50% • The US Government used tariffs not only to protect industry but to generate revenue for the government US Foreign Trade: Smoot-Hawley Trade Act • League of Nations World Economics Conference (1927) • “The time has come to put an end to tariffs, and move in the opposite direction.” • Herbert Hoover campaign promise (1928) • ‘to increase tariffs on agricultural goods and a decrease in rates for industrial goods.’ • Smoot Hawley Act (1930) • Increased tariffs on agricultural products AND industrial goods • 1,028 economists signed a petition asking President Hoover to veto the bill • Henry Ford spent evening at White House, calling the Act “an economic stupidity” • Hoover opposed the bill and called it “vicious, extortionate and obnoxious” • Hoover yielded to political pressure and signed the bill • Economists agreed that Smoot-Hawley either was a cause of or prolonged The Great Depression The Foreign-Trade Zones Act of 1934 Designed to • Encourage and expedite U.S. participation in international trade • Foster dealing in foreign goods imported not only for domestic consumption but also for export after combining with domestic goods • Defer payment of duties only until goods are entered into the commerce of the U.S. The Foreign Trade Zone is…

• A Secure Area Located in or near a U.S. Port of Entry • For duty purposes only, a Foreign Trade Zone is legally outside the U.S. Customs Territory • In a Foreign Trade Zone, merchandise may be assembled, exhibited, cleaned, manipulated, manufactured, mixed, processed, relabeled, repackaged, repaired, salvaged, sampled, stored, tested, displayed, & destroyed Foreign Trade Zones (FTZs) – An Overview Establishment of Foreign-Trade Zones • FTZ Designation obtained through application by Grantee to the U.S. Foreign Trade Zones Board • FTZ Activation is achieved through application to the local Port Director of U.S Customs & Border Protection Types of Foreign Trade Zones 1. General Purpose Zone • Often an industrial park setting or port complex • Designed as multi-purpose or for use by multiple companies • May be comprised of multiple sites • Serves as the sponsoring zone for the subzone

2. Subzone • Normally single or specific purpose sites • Often isolated manufacturing locations or dedicated distribution centers • Operations cannot feasibly be moved to or accommodated in a general purpose zone Foreign Trade Zones – Financial Savings • Duty Deferral • Duty Elimination on Exports & Scrap • Duty Reduction (Inverted Tariff Relief) for Manufacturing/Production Zones Only • Local Ad Valorem Tax Exemption on Inventory • Administrative Savings due to elimination of drawback, fewer entries, reduced merchandise processing fees, lower brokerage fees Financial Savings – Duty Deferral Financial Savings – Duty Deferral Financial Savings – Duty Elimination • Goods may be exported from a zone free of duty and federal excise tax • Goods may be destroyed in a zone without payment of duty and federal income tax Financial Savings – Duty Elimination (Destruction) Financial Savings – Duty Reduction (Inverted Tariff) With approval from the Foreign-Trade Zones Board, when merchandise is admitted into the zone, the importer may elect a zone status that requires payment of the:

• The Duty rate applicable to either the materials as admitted, OR • The Duty rate applicable to the finished product as removed from the zone, depending upon which is lower

Inverted tariffs may be applied to ingredients and packaging materials. Financial Savings – Duty Reduction (Inverted Tariff) Financial Savings – Ad Valorem Tax Exemption

• In several states, tangible personal property (inventory) is exempt from state and local ad valorem taxes

• Specifics of the exemption vary from state to state or as altered by any of the above processes, shall be exempt from State and local ad valorem taxation. Administrative Savings

• Weekly Entry Weekly Export • On inbound material • On shipments • Only one entry per site required per • Only one entry per site week required per week • Reduces paperwork & recordkeeping • Reduces paperwork & • Expedites arrival of goods recordkeeping • Fewer entries = Reduces costs • Expedites departure of goods • Merchandise Processing Fee • Fewer entries = Reduces • Broker’s Fees costs • Broker’s Fees Foreign Trade Zones – Key Statistics

• Over $610 billion in merchandise (by value) flows through the FTZ program annually • Approximately 63% value received is domestic or foreign duty paid • Approximate $76 billion exported • There is at least one FTZ in every state • There are over 255 approved GPZs and hundreds of approved subzones • There are 195 active FTZs (GPZs and subzones) with a total of 324 active manufacturing/production operations • Warehouse/distribution operations received $224 billion in merchandise • Manufacturing/production operations received $386 billion which represents 63% of all zone activity • Over 440,000 persons were employed at some 3,300 firms that used FTZs in 2016 Foreign Trade Zones – Key Statistics Risky Business: Predicting Cancellations in Imbalanced Multi- Classification Settings Anand Deshmukh1, Meena Kewlani, Yash Ambegaokar, Matthew A. Lanham Purdue University, Department of Management, 403 W. State Street, West Lafayette, IN 47907 [email protected]; [email protected]; [email protected]; [email protected] ABSTRACT We identify a rare event of a customer reneging on a signed agreement, which is akin to problems such as fraud detection, diagnosis of rare diseases, etc. where there is a high cost of misclassification. Our approach can be used in all cases where the class to be predicted is highly under-represented in the data (i.e. data is imbalanced) because it is rare by design; there is a clear benefit attached to this class’ accurate classification and even higher cost attached to its misclassification. Pre-emptive classification of churn, contract cancellations, identification of at- risk youths in a community, etc. are potential situations where our model development and evaluation approach can be used to better classify the rare but important events. We use Random Forest and Gradient Boosting classifiers to predict customers as members of a highly underrepresented class and handle imbalanced data using techniques such as SMOTE, class- weights, and a combination of both. Finally, we compare cost-based multi-class classification models by measuring the dollar value of potential lost revenue and costs that our client can save by using our model to identify at-risk projects and proactively engaging with such customers. While most research deals with binary classification problems when handling imbalanced datasets, our case is a multi-classification problem, which adds another layer of intricacy. Keywords: Predicting cancellations, Class imbalance problem, Rare class mining, Data imbalance, SMOTE, Random Forest, Gradient Boosting 1 Corresponding author 1

114

INTRODUCTION The ability to predict future sales from various leads is a challenging problem. Usually the sales process has multiple stages, with competing interest among buyers and sellers. Identifying strong leads and allocating resources to potential customers is always an important problem for a sales team. If an associate obtains positive feedback or even a verbal commit to a purchase, it provides additional complexity when the customer reneges on the commitment at a later stage. Many firms must bear the sunk costs associated from a customer’s decision change. Examples might include shipping costs, additional inventory costs, as well as wasted team member time. Moreover, larger the size of a project, greater are the number of resources allocated to it (Duran, 2008). In this study, we partnered with a local business (hereon referred to as “partner company”) to understand reasons for their customers cancelling a project after initially agreeing to it. These are resource-intensive and high-cost installation projects and such unforeseen cancellations pose a significant risk to the partner company. The sales pitch for our partner company is an intensive process that their sales force spends a considerable amount of time and resources on, hence, such cancellations also waste the time of the sales force and reduce morale. Today, firms are employing analytics to tackle these problems of uncertain demand and resource allocation. Those firms that collect the right data at the right time have essentially invested into helping themselves improve processes and services in the future. For our problem, if a firm has several stored transactions, it has been shown that probability estimation techniques could be used to provide insights into an opportunities’ potential ( (Duran, 2008); (Lodato, M. W. & M. W. Lodato, 2006); (Söhnchen & Albers, 2010)). Studies in the classification modeling domain have focused on B2B sales forecasting and organization learning using machine learning (Bohanec, Robnik-Šikonja, & Borštnar, 2017). Machine learning can outperform subjective association decision-making in the B2B space was shown by (Yan, et al., 2015). Aspects of our problem have been seen in the healthcare realm. For example, (Sahraoui & Elarref, 2014) propose a problem of patients scheduling elective surgery at a hospital. Here they committed to have a surgery and the hospital has dedicated resources (e.g. allocated a room for surgery, a bed/room for a patient, scheduled a surgeon and/or anesthesiologist, and allocated time for

115 surgical services). However, if the patient does not show up, the hospital has effectively lost business. In 2

116 their study, rather than building models, they take a theory of constraints approach to help identify underlying root causes for what led to a cancelation and better plan for future possible instances. This study could be viewed as falling under the Customer Relationship Management (CRM) umbrella because we need to understand the information flow process in order to improve customer acquisition and retention (Chakravorti 2006). The aim of this study is to identify at-risk projects so that our partner company’s sales force can take pre-emptive measures to save the customer’s business. We also identify projects that would get successfully completed as well as those which would be declined by the partner company, thus taking a step towards improving the sales pipeline. We organized this paper by initially reviewing the past works on various topics related to customer cancellations and churn. Second, we discuss the data used in our study to help our partner company understand their customers better. Third, we outline the methodology we implemented and discuss the several models we investigated to predict the likelihood of a customer reneging on a signed project. Finally, we present our results, discuss our conclusions and how we plan on extending this study. LITERATURE REVIEW A strategic goal for most businesses is to improve the productivity of its sales force. Identifying new sales opportunities and ensuring that sales professionals are deployed to serve the best potential-revenue generating accounts is critical to a company’s revenue growth. An analytical challenge is to predict the likelihood of a customer buying a product or a service. If a large amount of stored transaction data is available, probability estimation techniques could be used to predict the outcome of an opportunity, based on its sales funnel ( (Duran, 2008); (Lodato, M. W. & M. W. Lodato, 2006); (Söhnchen & Albers, 2010)). Sales forecasting is a complex process for several reasons. There are multiple stages involved, each stage has several participants (from the buyer and sellers side) who may not necessarily

117 have the same objectives and interests. Sales forecasts are a critical cog in making managerial decisions and incorrect forecasting can lead to wasting of resources (Bohanec, Robnik-Šikonja, & Borštnar, 2017). 3

118

Customer cancellation is a classification type of problem and machine learning techniques can be employed to improve the accuracy with which the company can predict if a customer would cancel or not (Huang, Chang, & Ho, 2013). Additionally, stakeholders and decision makers in companies are not simply interested in the predictive performance of classification models, they also want to use it to support their decision making. Hence, the interpretability of the prediction models is also important along with the accuracy of prediction (Bohanec, Robnik-Šikonja, & Borštnar, 2017). Before applying a model, a user must trust it – this trust can be generated based on the transparency of the model. Hence, while sophisticated models such as random forests and support vector machines may demonstrate a stronger predictive model, they lack the interpretability of models such as of decision trees and logistic regression (Caruana & Niculescu-Mizil, 2006). The study conducted by (Kotsiantis, 2007) describes various supervised machine learning classification techniques and compares them across several features. The important take away from this paper is that it is essential to understand under which conditions would a technique outperform the others, for a given problem. A modified version of a comparative study of these techniques, as shown by the paper, is as follows: 4

119

Table 3.1: Comparing learning algorithms (ranked from * to **** (best model)) A very peculiar limitation of predicting customer churn or cancellation is that the data is usually imbalanced. Typically, a very small percentage of customers fall into this category and this small percentage of customers – the minority class – are very often the class of customers we are interested in predicting (Zhao, Li, Li, Liu, & Ren, 2005). Some other examples besides customer churn are fraud detection, diagnosis of rare diseases, so on and so forth. However, according to (Chen, Liaw, & Breiman, 2004), most classification algorithms are built to minimize the overall error and not to focus on this minority class. Two approaches used in tackling imbalanced data are (1) down-sampling the majority class or over-sampling the minority class or both, and (2) cost- sensitive learning i.e. assigning a high cost to misclassification. A resampling technique that was developed by (Chawla, Bowyer, Hall, & Kegelmeyer, 2002) is SMOTE (Synthetic Minority Over-Sampling Technique). Typically, the minority class is over- sampled with replacement which means that its data points are replicated at random. In the SMOTE technique, the minority class is over-sampled by creating “synthetic” samples instead of creating samples via replication, thereby increasing the information along with the weight of the minority samples. To summarize the SMOTE technique, k minority class nearest neighbors are identified and, on the line segments joining any or all of these k minority class nearest neighbors’ synthetic examples are introduced. Aside from techniques that could be employed to reduce the class imbalance, certain algorithms were found to perform well on such data. (Chen, Liaw, & Breiman, 2004) discovered that the following two approaches did a significantly better job at prediction of the minority class than the existing algorithms: (1) Weighted Random Forests (based on cost-sensitive learning) (2) Balanced Random Forest based on down-sampling the majority classes. However, they could not discern a difference between the two approaches to identify a winner. Thereafter, (Xie, Li, Ngai, & Ying, 2009) proposed a new learning method called Imbalanced Random Forests (IBRF) and used it to predict churn in the banking industry. Their study integrates the effectiveness of random forest in prediction of customer churn behavior while incorporating

120 balanced and weighted random forests. Their approach alters the class distribution as well as penalizes the misclassification of the minority class. They find that IBRF has a better accuracy than traditional random forest algorithms. Additionally, they find that the top-decile lift of IBRF 5

121 is better than other classification methods like decision tree, artificial neural network and class- weighted core support vector machines (CWC-SVM). The performance of nine different Boolean classification evaluation metrics was compared by (Caruana & Niculescu-Mizil, 2006) across different settings and machine learning algorithms. Their paper finds that learning methods that perform well on one criteria may not perform well on another, hence, picking the correct evaluation metrics for your models is imperative. For data with class imbalance, (Tang, Zhang, Chawla, & Krasser, 2009) find that overall accuracy isn’t an appropriate model evaluation metric as it cannot appropriately evaluate a model that is ineffective in detecting rare positive samples and assigns the model a high overall accuracy when it predicts all samples to be negative. Instead, they recommend the use of Precision and Recall. The table 3. 2 below illustrates a confusion matrix for a binary classification problem – the columns highlight the predicted classes and the rows highlight the actual classes. True positive and true negatives imply that the predicted and actual class are the same. False positive and false negative indicate the cases where the positive and negative cases were misclassified. The formula and interpretation of Accuracy, Precision and Recall (Larose & Larose, Data Mining and Predictive Analytics, 2015) are listed in table 3.3. Predicted Class Positive Negative Actual Class Positive True Positive (TP) False Negative (FN) Negative False Positive (FP) True Negative (TN) Table 3.2: Confusion matrix Evaluation Metric Formula Interpretation Overall Accuracy (TP + TN) (TP + TN + FP + FN) This metric says how often is the classifier correct Sensitivity / Recall When an instance actually falls within a class, how often does the model correctly classify it as falling in this class Positive Prediction Value (PPV) / Precision

122

(TP) (TP + FN) (TP) When the model predicts an instance to fall (TP + FP) within a class, how often does it actually fall within the class 6

123

Table 3.3: Formulae and interpretation of accuracy, precision and recall scores The findings from the papers related to treatment of imbalanced classes and customer churn prediction are summarized in table 3.4, below: Studies Motivation for the research Algorithms Used Class Imbalance Treatment Results/Findings (Chawla, Bowyer, Hall, & Kegelmeyer, 2002) Introducing SMOTE resampling technique 1. C4.5 Decision Tree 2. Ripper 3. Naïve Bayes classifier 1. SMOTE with under-sampling 2. Only under- sampling Combination of SMOTE and under-sampling performs better than only under- sampling (Chen, Liaw, & Breiman, 2004) Treating imbalanced data with Random Forest classifier 1. Random Forest 2. Ripper 1. Balanced Random Forests 2. Weighted Random Forests 3. SMOTE with under-sampling 4. SHRINK 5. One-sided sampling 6. Boosting 1. Balanced and weighted Random Forests perform significantly better than standard Random Forests. 2. No clear winner between balanced and weighted Random Forests (Burez & Van den Poel, 2009) Class imbalance in customer churn prediction 1. Logistic Regression 2. Random Forest 3. Gradient Boosting Classifier 1. Weighted Random Forest 2. Under-sampling (random and with CUBE algorithm) 1. Under-sampling, Boosting and Cost-sensitive learning improve the standard classifier's performance 2. You don't need to make the sample size the same for the classes 3. Best performing class distribution depends on the method and case 4. Weighted random forests perform significantly better than regular random forests (Xie, Li, Ngai, & Ying, 2009) Customer churn prediction using improved balanced random forests 1. Artificial Neural Network 2. Decision Tree 3. Support Vector Machine 4. Improved 1. Balanced Random Forests 2. Weighted Random Forests IBRF has a better accuracy and top-decile left

124

7

125

Balanced Random Forest (IBRF) (Longadge, Dongre, & Malik, 2013) Class 1. AdaBoost 1. Random Under- imbalance in 2. sampling data mining AdaBoost.NC 2. SMOTE 3. SVM 1. Boosting improves the performance of weak classifiers 2. Hybrid techniques (applying two or more techniques) improve performance (Prasasti & Ohwada, 2016) Machine Learning techniques for customer defection 1. Multiple Perceptron (MLP) Neural Network 2. J48 Decision Tree 3. Sequential Minimal Optimization (SMO) Support Vector Machine Random Forest 1. Performance of algorithms differed based on characteristics and type of data 2. J48 Decision Tree and SMO Support Vector Machine had more stable results across datasets Table 3.4: Summary of literature review on treatment of class imbalance and customer churn prediction We used the findings from the literature review to finalize the following aspects of our model: 1. Algorithms selected: a. Random Forest Classifier b. Gradient Boosting Classifier 2. Techniques to treat class imbalance: a. Resampling using SMOTE b. Cost-sensitive learning (assigning class weights) c. Combination of the two 3. Model Evaluation metrics: a. Precision b. Recall c. F1 score: Harmonic mean of Precision and Recall 8

126

While studies have been conducted on treatment of imbalance in classes, the response variable in the datasets were binary. We study impact of imbalance class treatment in a multi-class classification setting. DATA A. PROPRIETARY DATA The data used in this project came from our partner company. The data set consists of the attributes of all the projects undertaken by them over the past one year and has approximately 300,000 observations. The database has various tables that capture information regarding the projects, customers, the product being fixed, leads, sources of leads, partner company’s employees, representatives that are involved in the project, so on and so forth. Features of a few tables are discussed below: Project Information: The projects table lists all the features related to project timelines, price, current state, payment mode, financial status and owner signatures. The target variable for our study is the variable “current state”, which is classified into four types: 1. Active: Current on-going projects (whose eventual project status we wish to predict) 2. Cancelled: Projects where the customer reneges on a signed contract 3. Closed: Projects that were approved and executed successfully 4. Declined: Projects where the partner company does not approve a customer’s project proposal For building our model we are concerned with projects that are either Cancelled, Closed or Declined. Product Information: This table captures descriptive attributes of all the products in the partner company’s database that have been installed or are currently marked as “Active” projects. Customer Information: Customer data such as address, age, credit card scores, and so on are captured here. Leads and Lead sources: This table lists all the past and potential customers of the client and the sources through which these customers were approached. The table gives useful insights on which 9

127 customer segment is targeted by the partner company and the marketing channels used to approach them. Partner company’s Representatives: This table records the ID’s, role and starting year of the partner company’s representatives who interact with the customers. We used the information from this table in combination with the information in the projects table to discern if certain representatives are more efficient and have higher a conversion ratio than others. B. PUBLIC DATA Apart from the data provided to us by the partner company, we also collected publicly available zip code level demographic data such income level, unemployment rate, education level and population. The purpose of collecting this data was to create clusters of zip codes that represent similar kind of people. The motivation behind collecting this data was to explore the possibility of identifying behavioral patterns across different zip codes such as project cancellation rate. We hoped to then uncover underlying characteristics of customers within these clusters which could explain reasons for the cancellations. METHODOLOGY Our study is divided into 4 distinct phases: A. Data exploration and hypothesis development B. Data cleaning and pre-processing C. Model building D. Model evaluation and comparison Figure 5.1 (below) illustrates this process flow. 10

128

Figure 5.1: Methodology A. DATA EXPLORATION AND HYPOTHESIS DEVELOPMENT Exploratory Data Analysis We first explored the data to understand the following: • Interrelationship between tables • Interrelationship between features • Distribution and fill-rate of features • Associations between predictors as well as between the response variable (“current status”) and predictors Hypothesis Generation After understanding the data, we developed hypotheses to understand the predictors and the associations between the predictors and response variable better. They were as follows: • H 1 : Do projects which have a higher discounted price have lower cancellation rates? • H 2 : Do projects in certain cluster of locations have more cancellations than other locations? • H 3 : Do representatives with higher conversion rates have lower project cancellation rates? B. DATA CLEANING AND PRE-PROCESSING During this stage we performed the following tasks: 11

129

• Data validation: Ensuring the correctness and relevance of data; identifying and treating outliers • Treatment of missing values and nulls • Eliminating features with high correlations or near zero variance • Variable transformations such as encoding and standardizing features data Feature Generation: During the feature engineering phase we created several features which could be directly inputted into the model and checked for significance. The features created were linked to the hypotheses we generated as well as to account for neighborhood effect (where the response of customers is based on external influences that affect their decisions). Some of the features we created are: • Offered Price Ratio: Comparison of Offered Price with the Market Price (a measure of discount offered to the customer) • Cluster of zip codes: Clustering based on publicly available, zip code level data sources on income, population and unemployment rates • Conversion Ratio: A metric to measure the performance of the partner company’s representatives • Referral count: Number of referrals a customer received C. MODEL BUILDING Data Partition We used the validation set approach, partitioning our data 70-30% into train and validation sets respectively. Treatment of Class Imbalance The imbalance between projects that are Cancelled, Closed or Declined is treated using the following techniques: 1. Resampling using Synthetic Minority Over-Sampling Technique (SMOTE): 2. Assigning class weights (cost-sensitive learning) 12

130

3. A combination of the two The treatment of class imbalance using the SMOTE resampling technique is performed on the train set only and not the validation set. This is done for the following reason: • In the SMOTE algorithm, which is used for over-sampling the minority classes, k-nearest neighbors for the minority class are identified and synthetic observations of the minority class are created on the line joining any or all the k nearest class neighbors. • If SMOTE is performed on the entire dataset, information from the validation set would bleed into the train set, thereby inflating the precision and recall of the model. Algorithm Selection We selected the Random Forest classifier and the Gradient Boosting classifier to train the models for this multi-class classification problem. D. MODEL EVALUATION Once the models are built, they are evaluated based on certain parameters so that one of them can be picked to be the final model. Since this is a multiclassification problem where there is a class imbalance, we use Precision and Recall as the evaluation metrics. Precision and Recall are defined and can be interpreted as follows (Larose & Larose, Data Mining and Predictive Analytics, 2015): Evaluation Metric Formula Interpretation Precision When the model predicts an instance to fall within a class, how often does it actually fall within the class. A Precision of 0.85 means that out of the 100 times the model classifies the project as falling within a particular project status, 85 times the model would be correct. Recall (True Positive) (True Positive + False Positive) When an instance actually falls within a class, how often does the model correctly classify it as falling in this class. (True Positive) (True Positive + False Negative) A Recall of 0.85 means that out of the 100 projects that fall within a given project status, the model correctly classifies that 85 of them would fall within that project status. Table 5.1: Formulae and interpretation of precision and recall 13

131

Cost Matrix Additionally, to compare the models based on their business impact, we built a cost matrix to quantify the gain or loss of correctly classifying or misclassifying a project’s status: Predicted Status Cancelled Closed Declined Actual Status Cancelled $3,700 -$500 $0 Closed -$500 $0 $0 Declined -$500 -$40 $40 Table 5.2: Model evaluation cost matrix These costs were based on the assumption that out of all the projects our model would identify as being eventually “Cancelled”, 10% could be saved by taking pre-emptive measures. While formulae behind the cost matrix cannot be disclosed, some elements of the matrix are discussed below: • Columns are the project statuses as predicted by the model • Rows are the actual project statuses • If our model classifies a project as “Cancelled” and it eventually does get cancelled, there is a gain since: o 10% of these projects could get saved. Hence, the revenue from these projects would get realized. o The partner company can be cautious about deploying resources once the customer signs the contract, hence saving money and resources in the eventuality that the customer reneges at a later stage. The representatives of the partner company could get reassurance from the customer that they indeed want to proceed with the project before the company proceeds with planning the installation projects. o We have accounted for the opportunity loss of projects that could not be saved even after representatives of the partner company connect with the customer. • If our model classifies as project as “Cancelled” and it eventually gets “Closed” or “Declined” there’s an opportunity loss of having sent a representative to the customer to save the business. 14

132

• If our model classifies a project as “Declined” and it eventually gets “Cancelled” or “Closed” there isn’t any additional gain or loss as we recommend that the company doesn’t take any action before approving the project. Cost-Saving Per Project For each model, a 3 x 3 Confusion matrix is generated: (∑ 3 i=1 ∑ 3 j=1 CF i x j ) The Cost Matrix is given by table 5.2 Hence, we calculate the cost saving per project for each model, using the following formula: Cost Saving Per Project = ∑ 3 i=1 ∑ 3 j=1 ( Number (C i x j of ) × Projects (CF i x j ) (N) ) Finally, we pick a model that has the best performing evaluation metrics across the three classes and helps our industry partner save the highest potential revenue and cost by correctly classifying the projects that would close, get declined by management, or get cancelled by customers. E. DEPLOYMENT The final step is deploying our model into production and predicting the status of active projects. MODELS Our study requires us to classify a project as one of three eventual statuses: 1. Cancelled: Projects where the customer reneges on a signed contract

133

2. Closed: Projects that were approved and executed successfully 3. Declined: Projects where the partner company does not approve a customer’s project proposal This is a multiclassification problem and we use the following two machine learning techniques to solve them. Random Forests: Random forest is a learning technique that consists of bagging un-pruned decision trees with a randomized selection of features at each split. Initially it draws n_tree bootstrap samples from the original data. For each bootstrap sample, it grows an un-pruned classification or regression tree. 15

134

Finally, the class which has the most number of votes across all trees in the forest, is used to classify the case (Breiman, 2001). Gradient Boosting: This is an ensemble technique that starts with weak learners, usually decision trees, and combines them into a single stronger learner (Brownlee, 2016). Once the initial weak model makes a prediction, subsequent boosting steps predict the error residuals. These error residuals are minimized using the gradient decent approach. Hyperparameters specific to this algorithm can tune the individual trees or manage the boosting procedure according to requirements (Jain, 2016). These can be optimized using a grid search or a randomized search. Finally, the algorithm uses a weighted sum of the predictions to provide an overall prediction (Gorman, 2017). RESULTS FINDINGS FROM EXPLORATORY DATA ANALYSIS Some interesting findings from the exploratory data analysis are discussed in this section. Imbalanced Distribution of Project Statuses As discussed earlier, we eliminate “Active” projects from our database of the over 300,000 projects while building our models. The projects that we are interested in studying are either “Cancelled”, “Closed” or “Declined”. The pie chart in figure 7.1 illustrates the distribution of the projects across these three classes (project statuses). Status of Projects 1.83% 22.31% 75.86% Closed Declined Cancelled 16

135

Figure 7.1: Class distribution across projects As visible, one can see a clear imbalance in data across the three classes. Cancelled projects form only 1.83% of all the projects. Hence, the decision to treat the class imbalance before feeding the data into the Random Forest and Gradient Boosting classifiers. Relationship Between Price Discounts Offered and Project Statuses: Tougher the customer, larger the discounts. We developed a metric “Offered Price Ratio” to measure the price offered to the customers compared to the market price. It was interesting to note that the customers who were offered the most discounts (i.e. lowest price ratio of 64% of the market price) most frequently cancelled the projects. A possible explanation for this could be that the partner company’s representatives offered the greatest discounts to unwilling customers during their pitch meetings as an added incentive to sign the project’s contract. This is visible from figure 7.2. 82.10% 79.90% 64% Closed Declined Cancelled Figure 7.2: Relationship between price discounts offered and project statuses Clusters of zip codes We used the k-means algorithm to create a cluster of zip codes that we could use to test whether certain cluster of locations have more cancellations/declines/closed projects than other locations. The external income, population, unemployment rate data used to create the clusters was standardized and then fed into the k-means algorithm. Based on the elbow curve (figure 7.3) we created 5 clusters (figure 7.4). These clusters of zip codes were used as inputs in our models. O f f e r e d P r i

136 c e R a t i o “Cancelled” projects get the greatest discounts 17

137

Figure 7.3: Elbow curve method to select the optimal k Figure 7.4: 5 clusters of zip codes RESULTS OF MODELING We ran the SMOTE algorithm on the Train set. The three types of up-sampling we tested are: 1. Auto: Over-sample all classes to match the majority class 2. Minority: Over-sample the minority class to match the majority class 3. Custom over-sampling using dictionary as an argument The distribution of classes in the train set, before and after SMOTE is illustrated by the table below. 18

138

Table 7.1: Class distribution before and after resampling using SMOTE Variable Importance Along with correctly classifying projects according to the three project statuses, it is also important to identify features which form good identifiers of the final status of a project. Our findings are summarized below: • “Approval Time” is a feature that measures the time taken by the partner company to approve or reject a project proposal. During an initial iteration of the model, we found “Approval Time” to have the largest impact on the status of the project, explaining 70% of the variation in the data. This finding was in accordance with the hypothesis that, the greater the time spent to approve a project, larger the impact on its status. However, we needed to explore this variable further and more importantly, we realize that approval time isn’t always a controllable factor. Hence, we had to drop it from our final model. • For our final dataset, we found variables pertaining to price, such as “price”, “retail price” and so on dominate the model. “Offered Price Ratio” (measure of discount provided to customers) was a good indicator of a project’s final status, explaining up to 12% of the variation. • We also found that attributes related to the partner company’s representatives such as experience, conversion ratio and so on, are important identifiers in the models. Model Evaluation using Precision and Recall scores The models are evaluated based on their Precision and Recall scores as well as the cost savings per project. The models and their Precision and Recall scores for the validation set can be seen from the table 7.2 below. 19

139

Table 7.2: Precision and Recall scores of the models Summary of findings from table 7.2: • The Gradient Boosting classification models outperform the Random Forest classification models, irrespective of the class imbalance treatment. • The base models perform better than the models where class imbalance has been treated. • SMOTE findings across both the classifiers: o Over-sampling the minority class (“Cancellation”) to match the class size of the majority class (“Closed”) degrades the precision and recall scores of the minority class. o Precision and Recall scores for Custom SMOTE over-sampling are better than “Auto” or “Minority” SMOTE over-sampling. • Precision scores for the projects that are “Cancelled” are high across the models. This implies that the likelihood of the projects that are predicted to be cancelled actually being cancelled is high. Hence, our final model can be used by the partner company to save the projects that are predicted to be cancelled. 20

140

• Recall scores for the projects that are “Cancelled” need to be further improved. A significant number of projects where the customer eventually reneged on the signed contract are being predicted to be “Closed” by our model. Model Evaluation using Cost Savings Per Project The combinations of models we ran and the cost saving per project for each of them can be seen from table 7.3. Table 7.3: Comparison of models based on cost savings per project We observe from the table above that while SMOTE and class weights increased the cost- savings for the Random Forest Classifier (in isolation as well as in unison), the base model of the Gradient Boosting classifier outperformed all the models and performed better without the treatment of class imbalance. Final Model Selection: Gradient Boosting Classifier (Base model) The Gradient Boosting classifier’s base model (no class imbalance treatment) has the best precision and recall scores as well as the highest cost savings per project. Classification Technique Classes SMOTE Weight Class Cost Per Project Saving Precision Recall F1 Gradient Boosting Cancelled None None $35.44 0.96 0.58 0.72 Closed 0.84 0.95 0.89 Declined 0.69 0.42 0.52 Table 7.4: Performance of the Gradient Boosting classifier (base model) 21

141

The summary of this model’s performance can be seen in table 7.4 above: • Since the Precision of ‘cancelled’ is high (96%), if a project is classified as ‘cancelled’ it is highly likely to actually be cancelled • But since the Recall value is low (58%) it implies that only 58% of the cancels were predicted out of all the cancelled projects • ‘Closed’ has precision of 84% and recall of 95% implying that the model mostly predicts ‘Closed’ for most of the projects. This could also indicate an imbalance. On an average, our partner company processes over 300,000 projects annually. By deploying our best performing model, which has a cost saving of $35.44 per project, our partner company can save $10.63 million per annum. Classification Technique SMOTE Class Weight Annual Cost Saving Gradient Boosting None None $10.63 Million CONCLUSION In this study, we develop a model to predict if a project undertaken by our industry partner would successfully get completed, get declined by the company, or if the customer would renege on the contract and cancel the project. Along with predicting the status of projects we were able to identify key features that determine the status of a project. We use a Random Forest classifier and a Gradient Boosting classifier for this multi-class classification problem. The imbalance in classes is treated using SMOTE, setting class weights, and a combination of the two. We find that over-sampling the minority class (“Cancellation”) using SMOTE to match the class size of the majority class (“Closed”) degrades the precision and recall scores of the minority class. We find more encouraging results when we define a custom class distribution. It would be interesting to optimize the customized selection of class distribution, so that the base classifiers are beaten by models where the imbalance classes are treated. The models are evaluated by comparing the potential revenue and costs they save as well as the precision and recall scores of predictions. The precision and recall scores of the highest cost saving model is also the highest amongst the models developed. 22

142

By deploying our best performing model (Gradient Boosting classifier without any treatment for class imbalance), our industry partner can save $10.63 million annually. References Bohanec, M., Robnik-Šikonja, M., & Borštnar, M. K. (2017). Organizational Learning Supported by Machine Learning Models Coupled with General Explanation Methods: A Case of B2B Sales Forecasting. Organizacija 50(3). Breiman, L. (2001). Random Forests. Machine Learning, 5-32. Brownlee, J. (2016, September 9). A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning. Retrieved from Machinelearningmastery.com: https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm- machine- learning/ Burez, J., & Van den Poel, D. (2009). Handling class imbalance in customer churn prediction. Expert Systems with Applications, 4626 – 4636. Caruana, R., & Niculescu-Mizil, A. (2006). An Empirical Comparison of Supervised Learning Algorithms. 23rd International Conference on Machine Learning. Pittsburg, PA. Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 321-357. Chen, C., Liaw, A., & Breiman, L. (2004). Using Random Forest to Learn Imbalanced Data. University of Berkeley. Duran, R. (2008). "Probabilistic Sales Forecasting for Small and Medium-Size Business Operations." . Soft Computing Applications in Business, pp. 129-146. Gorman, B. (2017, January 23). A Kaggle Master Explains Gradient Boosting. Retrieved from Blog.Kaggle.com: http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient- boosting/ Huang, H.-C., Chang, A. Y., & Ho, C.-C. (2013). Using Artificial Neural Networks to Establish a Customer-cancellation Prediction Model. PRZEGLĄD ELEKTROTECHNICZNY, pp. 178- 180. Jain, A. (2016, February 21). Complete Guide to Parameter Tuning in Gradient Boosting (GBM) in Python. Retrieved from Analyticsvidhya.com: https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning- gradient- boosting-gbm-python/ 23

143

Kotsiantis, S. B. (2007). Supervised Machine Learning: A Review of Classification Techniques. Informatica 31, 249 - 268. Larose, D. T., & Larose, C. D. (2015). Data Mining and Predictive Analytics. John Wiley & Sons, Inc. Larose, D. T., & Larose, C. D. (2015). Neural Networks. In D. T. Larose, & C. D. Larose, Data Mining and Predictive Analytics (pp. 339 - 358). John Wiley & Sons, Inc. Lodato, M. W., & M. W. Lodato. (2006). Integrated sales process management: a methodology for improving sales effectiveness in the 21st century. AuthorHouse. Longadge, R., Dongre, S. S., & Malik, D. (2013). Class Imbalance Problem in Data Mining: Review. International Journal of Computer Science and Network (IJCSN). Prasasti, N., & Ohwada, H. (2016). Applicability of Machine-Learning Techniques in Predicting Customer Defection. Sahraoui, A., & Elarref, M. (2014). "Bed crisis and elective surgery late cancellations: An approach using the theory of constraints.". Qatar medical journal: 1. Söhnchen, F., & Albers, S. (2010). Pipeline management for the acquisition of industrial projects. Industrial Marketing Management 39(8), 1356-1364. Tang, Y., Zhang, Y.-Q., Chawla, N. V., & Krasser, S. (2009). SVMs Modeling for Highly Imbalanced Classification. IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, 281-288. Xie, Y., Li, X., Ngai, E., & Ying, W. (2009). Customer churn prediction using improved balanced random forests. Expert Systems with Applications, 5445–5449. Yan, J., Zhang, C., Zha, H., Gong, M., Sun, C., Huang, J., . . . Yang, X. (2015). On Machine Learning towards Predictive Sales Pipeline Analytics∗ . Twenty-Ninth AAAI Conference on Artificial Intelligence, (pp. 1945 - 1951). Yeh, I.-C., & Lien, C.-h. (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications 36 , 2473–2480. Zhao, Y., Li, B., Li, X., Liu, W., & Ren, S. (2005). Customer churn prediction using improved one-class support vector machine. Lecture Notes in Computer Science, pp. 300-306. 24

144

VALPARAISO UNIVERSITY Vehicle Routing, Scheduling and Decision Utility Environment Valparaiso University Ceyhun Ozgur Claire Okkema Yiming Shen 2.22.2018

192

Abstract There have been many papers written in literature in vehicle scheduling. In this paper, we will try to summarize all previous research papers in this area. In this section of the paper we basically deal with Vehicle Scheduling and Routing. We assume a given system for Distribution System replenishment and given set of Distribution Centers ask ourselves how should Vehicles be scheduled or routed to achieve the company’s logistic objectives? The problem as typically formulated is to determine the order turn in which the customers will be visited by delivery or pick-up vehicles otherwise called the route. Introduction The questions include the determination of the adequate number of vehicles, the frequency with which each customer should be visited and the times to be associated with the actual stops along the route. Our approach to vehicle routing and scheduling is to first present the discussion with the Travelling Salesman Problem (TSP). This provides an analytical framework. We then consider solution methodologies and further examine some actual operating problems. Some of the problems using vehicle scheduling problems include: (1) garbage route collection system that involve TSP, (2) lawn-mowing system for parks and recreation using TSP, (3) scheduling of like products with automated assembly line with sequence dependent setups, and (4) scheduling like items in a police station. We assume a given system for DC replenishment and a given set of DCs, and ask: How should vehicles is best scheduled to achieve logistics objectives? The problem, as typically formulated, is to determine the order in which customers will be visited by delivery/pickup vehicles, often called the route. Other questions include determination of the proper number of vehicles, the

193 frequency with which each customer should be visited, and the times to be associated with the stops along the route. Our approach to vehicle scheduling is to first present a discussion of the traveling salesmen problem. This provides an analytical framework. We then consider solution methodologies, and thereafter examine some actual operating problems. Traveling salesman problem (Scheduling methods for a capacity constrained work center and automated assembly line with sequence dependent setups) The Traveling Salesman Problem is one of those easily stated but difficult to solve problems on which mathematicians find it very studiedly but not easily solvable. The statements of problem is given as a set of cities distribution centers (DC) to be visited, what is the least cost or distance method of visiting each city once, starting from the same city and returning to it. The starting and ending city could be a central facility or location. Solution Methodologies The traveling salesman problem can be formulated as zero-one integer programming problem. Optimal solution approaches include branch and bound procedures similar to those discussed for distribution centers (DC) location problems and dynamic programming problems. Producing optimal solution procedures become computationally costly and the size of the problem goes up with the increase in the number of nodes or cities or in other words as the size of the problem goes up. That is as the number of distribution centers increases the computational cost goes up geometrically.

194

Heuristic procedures have been devised for this problem that produce reasonably good results in far less time than the optimal procedures. One widely used is based on a time-saved concept. The basic consideration is the time or distance that would be saved if the two distribution centers were visited in a single tour as opposed to visiting each separately different tour. Applications of Decision Utility in Park Systems Although many quantitative scheduling techniques are designed for production scheduling, other types of scheduling problems have been studied but they have some of the same obstacles listed above for production scheduling. For example, consider the problem of scheduling jobs in a governmental agency where the amount of work to be done almost always exceeds the resources available. In this case, the scheduling problem is deciding how much of each job to do and not do given the amount of resources on hand. For example, in the summer, a parks maintenance district must tradeoff how the number of times jobs like tractor mowing (mowing large open areas), trim mowing (mowing small areas around trees, sidewalks, buildings, etc.), litter removal, and ball field dragging are done in each park (Ozgur, 2018). The main problem is to determine the correct balance between the jobs given the resources available. This is clearly a case where a balance is necessary because doing a lot of litter removal and ball field dragging while doing no mowing would not be acceptable to the tax payers. With decision utility, one has to develop a model that measures algorithmic efficiency. This can implement formulas that solve problems containing tasks rather than products (Brown, 1986). In addition, measuring performance, or considering what to do with information on what was actually accomplished and resources available, is routinely done in manufacturing and decision utility. The importance of measuring performance is the comparison of the number of products produced and the number of products that ideally would have been produced given the available resources.

195

Application of Decision Utility in Police Stations Consider the problem of scheduling police officers in a police department where the amount of work to be done almost always exceeds the resources available in a given time period such as summer months. In this case, the scheduling problem is deciding how much of each type of job to do and still protect the public and ensure public safety given the amount of resources on hand for the entire summer months. For each police scheduling period, the police chief used as data a list of the police officers, estimates of the time for a police officer or police car to complete each job in the city, and what additional personnel and equipment was available and needed by the police department for each police activity. A computer schedule was run every two weeks and gave the police chief or the police supervisor an amount of each job the city could accomplish in the next two weeks with the resources predicted to be available (Ozgur, 2018). We have included many sources that can be used to identify the application of vehicle scheduling and decision utility.

196

References Adler, J., & Mirchandani, P. (2017). The vehicle scheduling problem for fleets with alternative- fuel vehicles. Transportation Science, 51(2), 441-456. doi.org/10.1287/trsc.2015.0615 Androutsopoulosa K. N. & Zografosb, K. G., An integrated modelling approach for the bicriterion vehicle routing and scheduling problem with environmental considerations, Transportation research. Part C, Emerging technologies, 82, 180-209 Bertossi, A. A., Carraresi, P. and Gallo, G. (1987). On some matching problems arising in vehicle scheduling models. Networks, 17, 271–281. doi.org/10.1002/net.3230170303 Bish, E. K., Leong, T.-Y., Li, C.-L., Ng, J. W. C. and Simchi-Levi, D. (2001). Analysis of a new vehicle scheduling and location problem. Naval Research Logistics, 48, 363–385. doi.org/10.1002/nav.1024 Bodin, L. and Golden, B. (1981). Classification in vehicle routing and scheduling. Networks, 11, 97–108. doi.org/10.1002/net.3230110204 Borndörfer R., Grötschel M., Klostermeier F., & Küttner C. (1999). Telebus Berlin: Vehicle scheduling in a dial-a-ride system, 391-422. Brown, J. (1986). Decision utility. (Unpublished paper). Bunte, S. & Kliewer, N. (2009). An overview on public scheduling models. Public Transport, 1(4), 299-317. doi.org/10.1007/s12469-010-0018-5

197

Carpaneto, G., Dell'amico, M., Fischetti, M. and Toth, P. (1989). A branch and bound algorithm for the multiple depot vehicle scheduling problem. Networks, 19, 531–548. doi.org/10.1002/net.3230190505 Carrasei, P. & Gallo, C. (05/1984). Network models for vehicle and crew scheduling. European Journal of Operational Research, 16(2), 139-151 Christofides, N. (09/1969). An algorithm for the vehicle-dispatching problems. The Journal of the Operational Research Study, 20(3), 309-318. doi.org/10.1057/jors.1969.75 Dell Amico, M., Fischetti, M., & Toth, P. (1993). Heuristic algorithms for the multiple depot vehicle scheduling problem. Management Science, 39(1), 115. Retrieved from http://ezproxy.valpo.edu/login?url=https://search.proquest.com/docview/213215937?acco untid=14811 Desaulniers, G., Lavigne, J., & Sournis, F. (12/1998). Multi-depot vehicle scheduling problems with time windows and waiting costs. European Journal of Operational Research, 111(3), 479-494. doi.org/10.1016/S0377-2217(97)00363-9 Desrosiers, J., Dumas, Y., Solomon, M. M. & Sournis, F. (03/2005). Time constrained routing and scheduling. Handbooks in Research and Management Science, 8, 35-139. doi.org/10.1016/S0927-0507(05)80106-9 Foster, B. A. & Ryan, D. M. (12/2017). An integer programming approach to the vehicle scheduling problem. Journal of the Operational Research Society, 2(27), 367-384. Freling, R. (05/2001). Models and algorithms for single-depot vehicle scheduling. Journal of Transportation Science, 35(2), 265-280. doi.org/10.1287/trsc.35.2.165.10135

198

Frizzell, P. W. & Giffin, J.W. (07/1995). The split delivery vehicle scheduling problem with time windows and grid network distances. Computers and Operations Research, 22(6), 655- 677. doi.org/10.1016/0305-0548(94)00040-F Goel, A. (2009). Vehicle Scheduling and Routing with Drivers' Working Hours. Transportation Science, 43(1), 17-26. doi.org/10.1287/trsc.1070.0226 Haghani, A. & Mohamadreza, B. (05/2002). Heuristic approaches for solving large-scale bus transit vehicle scheduling problem with route time constraints. Transportation Research Part A: Policy and Practice, 36(4), 309-333. doi.org/10.1016/S0965-8564(01)00004-0 Hill, A., & Benton, W. (1992). Modelling Intra-City Time-Dependent Travel Speeds for Vehicle Scheduling Problems. The Journal of the Operational Research Society, 43(4), 343-351. doi.org/10.2307/2583157 Keaveny, I. T. & Burbeck, S. (1981). Automating trip scheduling and optimal vehicle assignments. Amsterdam: Elsevier Science. Knott, R. P. (1988). Vehicle scheduling for emergency relief management: A knowledge-based approach. Disasters, 12, 285–293. doi.org/10.1111/j.1467-7717.1988.tb00678.x Komijan A.R., D. Delavari. (2017). Vehicle routing and scheduling problem for a multi-period, multi-perishable product system with time window: A case study. International Journal of Production Management and Engineering, 5(2), 45-53.

199

Malmborg, C. J. (08/1996). A genetic algorithm for service level based vehicle scheduling. European Journal of Operational Research, 93(1), 121-134. doi.org/10.1016/0377- 2217(95)00185-9 Ozgur, C. (2018). Resources in parks and police management: Applying decision utility to solve problems with limited resources. International Journal of Information Systems in the Service Sector, 10(2), 69-78. doi.org/10.4018/IJISSS.2018040105 Ozgur, C. (1990). Scheduling methods for a capacity constrained work center and automated assembly line with sequence dependent setups. (Unpublished Dissertation) Park, Y. (09/2001). A hybrid genetic algorithm for the vehicle scheduling problem with due times and time deadlines. International Journal of Production Economics, 73(2), 175- 188. doi.org/10.1016/S0925-5273(00)00174-2 Psaraftis H. N., M. Wen, C. A., & Kontovas. (2016). Dynamic vehicle routing problems: Three decades and counting, Networks, 67(1), 3–31. doi.org/10.1002/net.21628. Raff, S. (05/2003). Routing and scheduling of vehicles and crews: The state of the art. Computers and Operations Research, 10(2,: 63-67, 69-115, 117-147, 149-193, 195-211. doi.org/10.1016/0305-0548(83)90030-8 Ribeiro, C. C., & Soumis, F. (1994). A column generation approach to the multiple-depot vehicle scheduling problem. Operations Research, 42(1), 41. Shahparvari S., B. Abbasi. (2017). Robust stochastic vehicle routing and scheduling for bushfire emergency evacuation: An Australian case study. Transportation Research Part A, Policy and Practice, 104, 2017, 32-49.

200

Waters, C. (1989). Vehicle scheduling problems with uncertainty and omitted customers. The Journal of the Operational Research Society, 40(12), 1099-1108. doi.org/10.2307/2582919 Xiao, Y. & Konak, A. (2017). A genetic algorithm with exact dynamic programming for the green vehicle routing & scheduling problem. Journal of Cleaner Production, 167(20), 1450-1463. Yellow, P. C. (06/1970). The journal of the operational research society: A computational modification to the savings method of vehicle scheduling. doi.org/10.1057/jors.1970.52

201