BUS RIDERSHIP PREDICTION USING MACHINE LEARNING

INTEGRATED WITH GEOGRAPHIC INFORMATION SYSTEM

by

Pengyu Li

A thesis submitted to the Faculty of the University of in partial fulfillment of the requirements for the degree of Master of Civil Engineering

Spring 2020

© 2020 Pengyu Li All Rights Reserved

BUS RIDERSHIP PREDICTION USING MACHINE LEARNING

INTEGRATED WITH GEOGRAPHIC INFORMATION SYSTEM

by

Pengyu Li

Approved: ______Sue McNeil, Ph.D. Professor in charge of thesis on behalf of the Advisory Committee

Approved: ______Sue McNeil, Ph.D. Chair of the Department of Civil and Environmental Engineering

Approved: ______Levi T. Thompson, Ph.D. Dean of the College of Engineering

Approved: ______Douglas J. Doren, Ph.D. Interim Vice Provost for Graduate and Professional Education and Dean of the Graduate College ACKNOWLEDGMENTS

I would like to express the deepest appreciation to my advisor, Professor Sue McNeil, for the countless hours of mentorship and advice throughout my undergraduate and graduate studies. The continuous help and encouragement she gave constantly motivate me to pursue the best outcomes in my academic career and personal life. I would also like to thank her for her trust in me when deciding to extend our relationship to my graduate study. Without her guidance and persistent suggestions, this thesis would not have been possible. I would like to thank Professor Earl “Rusty” Lee, who introduced me to the world of geographic information system (GIS). I worked for Dr. Lee during my junior and senior years, and learned so much about accessibility in transportation planning and traffic modeling. Without the basic knowledge of census data and GIS outside of class, I would not be capable of conducting the data geoprocessing in this study, or even come up with the idea of this thesis. I would like to thank Professor Nii Attoh-Okine and his student Dr. Ahmed Lasisi. My friend and colleague Ahmed helped me learn the basics of machine learning, which allowed me to further explore and apply the machine learning algorithms to this study.

To Cathy and David at DART First State, thank you for providing me with the important data for this study, and thank you for walking me through the planning operations of the bus agency and answering my numerous questions along the way.

iii

Lastly, thank you to my parents who did not have an obligation to finance my expensive overseas study but still supported me as always. Thank you for your endless love, and encouragement, and for always respecting my choices.

iv TABLE OF CONTENTS

LIST OF TABLES ...... vii LIST OF FIGURES ...... viii ABSTRACT ...... x

Chapter

1 INTRODUCTION ...... 1

1.1 Problem Statement ...... 1 1.2 Research Questions ...... 3 1.3 Motivation and Research Objective ...... 3 1.4 Proposed Methodology ...... 5

1.4.1 Data Collection and Pre-processing ...... 5 1.4.2 Hypothetical Bus Stop Creation ...... 6 1.4.3 Spatial Data Analysis ...... 6 1.4.4 Visualization and Recommendation ...... 6

1.5 Outline ...... 7

2 LITERATURE REVIEW ...... 9

2.1 Introduction ...... 9 2.2 Need for New Travel Demand Models ...... 10 2.3 Bus Stop/ Bus Route Design and Optimization ...... 12 2.4 Bus Route/ Bus Stop Designs in GIS Applications ...... 14 2.5 Machine Learning Applications to Bus Transit ...... 15 2.6 Influential Variables ...... 15 2.7 Summary ...... 17

3 DATA SET DESCRIPTION ...... 18

3.1 Defining the Study Area ...... 18 3.2 Brief Review of DART’s Bus Network ...... 19 3.3 Description of DART’s Ridership Data ...... 22 3.4 Bus Stop Visualization and Geoprocessing in ArcGIS ...... 24

v

3.4.1 Mapping the Existing Bus Stops ...... 25 3.4.2 Creating the Hypothetical Bus Stops ...... 25

3.5 Description of Demographic Data at the Census Block Group Level ..... 29 3.6 Land Use Data ...... 34 3.7 Employment Data ...... 35 3.8 Geoprocessing and Spatial Analysis ...... 35 3.9 Data Assumptions and Limitations ...... 35

4 METHODOLOGY ...... 40

4.1 Introduction ...... 40 4.2 Machine Learning ...... 41

4.2.1 Supervised Learning ...... 42 4.2.2 Machine Learning Algorithms ...... 42

4.3 K-Fold Cross-validation ...... 45 4.4 Prediction Outputs ...... 46 4.5 Additional Data Processing ...... 48

5 RESULTS AND DISCUSSIONS ...... 50

5.1 Introduction ...... 50 5.2 Model Outputs and Performance ...... 51

5.2.1 Summary Statistics for Alternative Algorithms ...... 51 5.2.2 The Best Performing Model - Lightgbm ...... 53

5.3 Mapping the Ridership ...... 55 5.4 Feature Importance ...... 66

6 CONCLUDING REMARKS ...... 71

6.1 Conclusions ...... 71 6.2 Contributions to the Transportation Planning Field ...... 72 6.3 Future work ...... 73

REFERENCES ...... 76

vi LIST OF TABLES

Table 1. Top Ten Bus Stops by Daily Ridership ...... 23

Table 2. ACS Dataset and Selected Attributes ...... 31

Table 3. Land Use Types ...... 34

Table 4. Statistical Description of Prediction Outputs vs. Historical Ridership (On) ...... 51

Table 5. Statistical Description of Prediction Outputs vs. Historical Ridership (Off) ...... 52

Table 6. Statistical Description of Prediction Outputs vs. Historical Ridership (Total) ...... 52

Table 7. Top Ten Bus Stops by Predicted Total Daily Ridership ...... 55

vii

LIST OF FIGURES

Figure 1: DART Ridership vs. Expense ...... 11

Figure 2: DART Bus Stops in Delaware ...... 20

Figure 3: DART Bus Stops in Wilmington-Newark Area ...... 21

Figure 4: Relative Frequency of “On” Ridership at Bus Stops (Passengers/Day) .. 23

Figure 5: Relative Frequency of “Off” Ridership at Bus Stops (Passengers/Day) . 24

Figure 6: Relative Frequency of “Total” Ridership at Bus Stops (Passengers/Day) ...... 24

Figure 7: Examples of Spatial Data Stored for Bus Stops ...... 25

Figure 8: Delaware Roads Mapped in GIS ...... 26

Figure 9: Hypothetical Bus Stops in New Castle County and Dover ...... 27

Figure 10: Hypothetical Bus Stops in Wilmington-Newark Area ...... 28

Figure 11: Poverty Rate vs. Ridership ...... 32

Figure 12: Daily Bus Commuters vs. Ridership ...... 33

Figure 13: Bus Stop Service Areas vs. Census Block Groups ...... 37

Figure 14: Bus Stop Service Areas vs. Census Blocks ...... 38

Figure 15: Research Framework ...... 41

Figure 16: Predicted Ridership for Passengers Getting on the Buses ...... 57

Figure 17: Historical Ridership for Passengers Getting on the Buses ...... 58

Figure 18: Predicted and Historical Ridership for Passengers Getting on the Buses 59

Figure 19: Predicted Ridership for Passengers Getting off the Buses ...... 60

Figure 20: Historical Ridership for Passengers Getting off the Buses ...... 61

viii

Figure 21: Predicted and Historical Ridership for Passengers Getting off the Buses ...... 62

Figure 22: Predicted Ridership for Total Passengers ...... 63

Figure 23: Historical Ridership for Total Passengers ...... 64

Figure 24: Predicted and Historical Ridership for Total Passenger ...... 65

Figure 25: Feature Importance for "On" Prediction ...... 67

Figure 26: Feature Importance for "Off" Prediction ...... 68

Figure 27: Feature Importance for "Total" Prediction ...... 69

ix ABSTRACT

Public transit is vital for people who do not have personal vehicles to travel and commute. Unlike large cities where subways and trolley cars are available, bus is the only form of public transit besides a commuter rail line in the state of Delaware.

Thus, it is important to provide extensive service and always seek bus route and service changes to ensure the most transit service coverage is achieved for the changing population.

This research finds that the annual bus ridership in Delaware has been decreasing for several years. Since the population of the state is growing gradually each year, the decrease of ridership suggests that the bus agency faces challenges to provide service that is sufficiently attractive to maintain the number of riders.

Appropriate bus route changes might be able to attract more bus riders, thereby helping the ridership to reach a historical high, or even higher.

Aiming to solve this problem, this research proposes a framework that predicts ridership at any location (hypothetically proposed bus stop locations) using different machine learning techniques. The data include historical on, off and total ridership data for DART’s 1,864 bus stops for 2018, demographic data from the American

Community Survey (ACS), employment and land use data, and the road network. The framework integrates the spatial data to the point data for bus stops to predict ridership. The best performing machine learning model is determined for this study,

x

which demonstrated the power of integrating geographic information systems with machine learning and being able to present the results visually. The prediction results indicate that there might be new locations that are more suitable bus stops.

While the results are promising, analysis suggests that the aggregation level that American Community Survey (ACS) data provides is not ideal for bus ridership prediction. Meanwhile, more variables that influence ridership can be added to this framework if bus agencies can obtain more detailed data. These limitations are listed at the end to give directions for future work.

xi

Chapter 1

INTRODUCTION

1.1 Problem Statement

The Pareto principle (known as the 80/20 rule, the law of the vital few, or the principle of factor sparsity) implies that for there are always 80% of the effects come from 20% of the causes (Box & Meyer, 1986). Although the principle is primarily used in business, the same idea could be applied to public transportation sectors too.

Usually the sales or transit users generated from 20% of the transit lines represents

80% of the revenue or transit users (AlKheder et al., 2018). For DART, Delaware’s public transportation operator, 80% of the trips are generated by even less than 20% of the bus routes (D. Dooley, personal communication, Oct, 2019). This imbalance presents a challenge for public transportation operators. This research explores this imbalance in the context of DART.

DART, officially DART First State, is owned and managed by Delaware

Transit Corporation, a division of Delaware Department of Transportation. DART is the only public transit system in Delaware. DART serves passengers with fixed-route bus, paratransit, and commuter rail service. As a subsidiary of the Delaware

Department of Transportation (DelDOT), DART receives funds from multiple sources including DelDOT and the Federal Transit Administration (FTA) for subsidizing bus

1

and rail services and procuring new buses. The agency manages over 70 bus routes including 8 seasonal routes that serves recreational purposes. It also sponsors a commuter rail line that runs from Newark, Delaware to Philadelphia, with a few stops in both Delaware and Pennsylvania. Moreover, in order to meet ADA compliance and provide services to special needs, disadvantaged populations, and senior residents,

DART operates on-demand paratransit services to ensure access to local destinations and bus stops for passengers ("DART First State - About DART First State", 2019). A dilemma also shared with every other transportation agency is not being able to become self-sufficient in terms of funding transit operations. This one of the biggest challenges that DART faces.

Delaware, with approximately 1 million residents in 1982 square miles and a density of 491.3 residents per square mile, is considered to be relatively densely populated, ranking 6th in the United States (The State of Delaware, n.d., US Census

Bureau, 2019a). The study area, New Castle County, is the most densely populated county of the state. By improving bus service in this area using predicted daily ridership, the most benefit could be realized by the majority of bus riders in the state.

While it is important to emphasize the optimization of bus schedules and bus routes to increase the overall cost efficiency of public transit services, it is also crucial to focus on encouraging ridership as a way not only to increase revenue, but also to serve groups who rely on public transit, such as low and moderate income groups, more adequately.

2

Recent advances in machine learning offer opportunities to improve forecasts where both historical data and disparate sources are available. As machine learning has not been widely applied to public transportation ridership forecasting, a relevant research question can be formulated.

1.2 Research Questions

The main research question formulated to address the challenges involved in forecasting DART ridership is:

Is it possible to develop a framework that utilizes an optimal machine learning algorithm to predict future bus ridership at bus stop level by relating demographic, population, and bus network data for existing and proposed bus stops?

1.3 Motivation and Research Objective

Most of the existing research on bus services tend to revolve around solving scheduling and rerouting problems, whereas some works that forecast transit demand are mostly based on historical data or demographics. Typical bus service coverage area is within ¼ mile of a bus stop, much smaller than rail stations which are within a

½ mile of a station and even more when considering ride and park options, which most of DART service areas do not have (El-Geneidy et al., 2013, "DART First State - Park

& ride / Park & pool," 2020). Such differences can lead to some level of disagreement between the transit demand forecasts and actual ridership within areas that have different public transit options. Conventional travel demand forecasting only estimates the total trips between one or more origin destination pairs. Trips are then assigned to

3

different modes of transportation (Cambridge Systematics, Inc., 2012). It is difficult for the transportation analysts to predict travel demands for every single route within a state border. Therefore, assessing a proposed bus stop located at an area which is not close to a route that has travel demand forecasts can be a challenge for bus route planners. Furthermore, since travel demand forecasting cannot predict trips that are at any random location, it is likely that several bus stops may be predicted to have similar ridership if they are close enough. In chapters 2 and 3, the reasons why conventional transportation planning method has failed to help increase or maintain public transportation ridership, in general, and DART’s ridership, in particular, are discussed. This study suggests an alternative way to analyze and estimate transit demand by comparing potential ridership at actual and hypothetical bus stops in existing and future scenarios. It aims to help public transit agencies to identify locations that may be more suitable for bus stop relocation and bus rerouting by comparing the predicted ridership directly.

The state of Delaware has a growing population, but its annual bus ridership and paratransit usage are decreasing each year ("DelDOT Fact Books", 2018). It is important to ensure the convenience of public transit for the residents especially for those who rely on fixed route buses for commuting, buying groceries, and other essential activities. While some passengers in recent years could be carried by emerging shared mobility modes such as “Uber” and “Lyft”, the emergence of these new modes should not be the excuse for the trend of declining bus ridership due to both public environmental and personal financial issues. While buses remain the same

4

in terms of scheduling and routing, low-occupancy vehicles that carry a single or small number of passengers for shared mobility purposes emit way more greenhouse gases

(per capita) and other harmful particles which lead to worse air quality and burdens that over time also impact climate. Having to spend more in exchange for shorter travel times compared with bus rides makes individuals gradually lose their discretionary income. Thus, a bus network that could attract and carry more bus riders can solve both issues and provide residents with more convenient public transit options.

This research proposes a method for estimating bus travel demand by assessing hypothetically created bus stops within the study area using machine learning integrated with Geographic Information Systems (GIS). Taking DART’s bus service routes in New Castle County as an example, the research spatially joins data from various sources such as American Community Survey (ACS), DelDOT, and DART’s existing ridership records to prepare for the machine learning process.

1.4 Proposed Methodology

The workflow of this research is summarized below:

1.4.1 Data Collection and Pre-processing

Before any analysis can be conducted, collection of demographics, bus route and road network data is the first step of the process. All required data that is used varies as the data does not come from one source or in the same format. Therefore, it

5

is important to join data from different sources and dimensions into a common platform, and pre-process before analysis.

1.4.2 Hypothetical Bus Stop Creation

After spatially joining data from different sources, for the defined study area, the distance between hypothetical bus stops within the study area can be determined and mapped in GIS. Those bus stops are then further processed by adding the joined data.

1.4.3 Spatial Data Analysis

In this step, three sub-analyses are conducted: exploratory data analysis in GIS, classification and regression problems on joined variables, and forecasting daily ridership for hypothetical bus stops using different types of algorithms and finding the optimal approach the study.

1.4.4 Visualization and Recommendation

At the end of the analysis, hypothetical and existing bus stops are mapped and color-coded to demonstrate the difference in predicted and historical ridership, thereby giving reliable evidence for adapting proposed changes of bus routes as new bus routes might attract more riders and provide wider service to public transportation users.

6

1.5 Outline

The following paragraphs outline the remainder of this thesis. Chapter

2 of this thesis covers a literature review summarizing the existing research work in the fields of public bus transit planning, transit ridership forecasting/estimation,

Geographic Information System (GIS) applications in transit ridership forecasting, transit demand modeling, and Machine Learning algorithms for ridership prediction.

Chapter 2 also highlights some of the challenges and gaps in the current literature that built the foundation of this study of bus route planning.

Chapter 3 provides a detailed description of all the data used including types and formats, sources, and roles in the analysis process. It summarizes the demographic variables collected from various sources and suggests potentially more accurate data and data sources for future study. In addition, it presents summary statistics of data across all the bus stops within the study area. Visualized variables of selected bus stops are created to better demonstrate and investigate the relationships between ridership and variables intuitively.

Chapter 4 further explains the proposed research framework that is followed throughout the study. It summarizes the machine learning algorithms that existing transit related research has applied. It lists a group of machine learning algorithms that are used in the study based on the characteristics of the dataset. Additional techniques that attempt to overcome the inaccuracy of data and avoid overfitting of the machine learning models are explained in this chapter as well.

7

Chapter 5 highlights the outcome of the machine learning model. It compares the accuracy of each algorithm and finds the best one for bus ridership prediction.

Also, it provides a comprehensive list of criteria to be used to determine the significance of hypothetical bus stops in terms of potential weekday daily ridership. It suggests a simplified approach that could be replicated by DART or any other bus route planner of transit agencies. By mapping both previous and newly changed bus routes with DART’s collected ridership data, the recommended bus stops with higher ridership prediction can be evaluated and confirmed.

Chapter 6 summarizes the study including the methodology applied and major findings. It discusses the contribution of this study to the field of bus route planning.

Finally, it enumerates several limitations that may affect the accuracy of the results, and lists some opportunities for enhancing the framework through future work

8

Chapter 2

LITERATURE REVIEW

2.1 Introduction

Transportation forecasting, or travel demand modeling, traditionally involves four steps: Trip Generation, Trip Distribution, Mode Choice, and Route Assignment

(McNally, 2007). The process can be extremely data-oriented, and time-consuming depending on the scale of the study area and the availability of detailed demographics and many other socio-economic factors. Additionally, it may not be the ideal tool for estimating demand for buses because the demand for bus, or public transit as a whole, is determined based on some percentage of the total travel demand, with or without conducting origin-destination (OD) surveys. Moreover, the accuracy of bus travel demand could potentially be affected by a variety of factors. Historical data demonstrates that DART’s annual ridership has been decreasing constantly over the past several years, which is elaborated on later in this chapter.

This chapter discusses approaches other researchers have chosen to design bus stop locations and bus routes layouts in both urban and suburb areas using different analytical methods. Then, it summarizes the demographic factors others have used for

9

design and analysis. Finally, it compares the studies to learn from the results with different methods and limitations that might be eliminated by using machine learning.

2.2 Need for New Travel Demand Models

In 1973, Daniel Brand described a fact while transportation forecasting for the

United States had cost hundreds of millions of dollars over the past 20 years until then, there had been only a tiny portion of the funds spent exclusively for the development of new travel demand models (Brand, 1973). Given the date, which was decades ago, one could argue that such weakness is already overcome using advanced technologies and original methods that use sophisticated computers to process data which could not be done prior to the 70s. However, although existing travel demand models may be able to accurately forecast future traffic at any given region, there is still a lot of space to improve.

Unlike travel demand for specific highway networks or urban areas, it is much more challenging for public transit agencies, or bus agencies to be specific, to forecast the demand for buses. Bus schedules and locations of bus stops can change based on many factors, which will also affect the ridership at each bus stop. As a result, travel demands predicted for bus route planning may not always be so close to real world situations. For example, DelDOT’s annual transportation report, Delaware

Transportation Facts, tracks the change of DART’s ridership, which reduced significantly over the past several years ("DelDOT Fact Books", 2018). DART’s bus riders decreased from over 12 million in 2014 to nearly 9.25 million in 2018. And

10

even worse, fixed routes ridership decreased more rapidly than paratransit and SEPTA train ridership ("DelDOT Fact Books", 2018). It has become a serious issue for DART because more than 80% of the transit users are transported by fixed route buses every year, while more than 50% of DART’s expense are used for operating paratransit and subsidizing SEPTA trains as shown in Figure 1

.

Figure 1: DART Ridership vs. Expense (Source: https://dartfirststate.com/rightfit/pdf/dart_brochure.pdf )

Figure 1 demonstrates that DART’s ridership is decreasing in a way that would affect the stability of their income. It could potentially hinder the agency’s ability to extend and provide better service to communities in Delaware. A new ridership forecasting model that factors multiple elements into ridership prediction and ultimately route planning could better explain the relationship between demographics, points of interests, socio-economic status, and ridership. The model could be useful

11

not only to provide future plans to change fixed route service locations (bus stops), but also to identify key factors that influence bus ridership in order to make better urban planning decisions.

2.3 Bus Stop/ Bus Route Design and Optimization

One advantage that bus transit has over other means of transportation is its flexibility to adapt route changes based on passenger’s needs at different locations and times of day (Ceder & Wilson, 1986). Studies of travel demand for buses has focused on a variety levels of geographical study areas historically. Due to the possibility of inaccuracy when data is combined at large scales and the convenience of implementing model outcomes, this study focuses on stop-levels for bus route planning while also aiming to learn from other researches that focus on larger areas and not on specific bus stops.

Szeto and Wu (2011) proposed a route design method using genetic algorithms to reduce the total number of transfers for the passengers and the total travel time for the bus network in a suburban residential community in Hong Kong. By doing so, they attempted to find the optimal solution for a bus route with operating schedules that would satisfy both criteria. However, the result of the study turned out to not be promising due to the conflicts of trying to achieve two objectives. To be more specific, the number of transfers is not improved significantly for the proposed method. Thus, it might be helpful to find a neutral balance of the weight of two objectives in real world situations.

12

Another study “Bus route design in small demand areas” indicated the bus route design problem, which contains an open travelling salesman problem (OTSP), an

NP-hard problem, is a NP-hard problem itself too (Černá, Černý & Přibyl, 2011). This study solved problems for bus routes that pass-through areas with less demand.

Meanwhile, the problem was formulated based on simple assumptions that two economic requirements need to be met: “minimal length of the route with a minimum number of stops and minimal frequency, i.e., the number of services during each time unit”. However, transit agencies always have more problems to be solved rather than these two economic requirements (Černá, Černý & Přibyl, 2011).

Similarly, Wang and Qu (2014) worked on low population density areas and solved the bus route problem as a travelling salesman problem. This study also confirmed that the problem is a NP-hard problem, which means there is no simple optimal answer to the question. It also mentioned the current lack of research on bus interference with other transportation modes and transportation management tools, which made it difficult for researchers to find solutions using existing methods and data.

In 1979, Newell discussed issues with designing bus routes that serve multiple origin-multiple destination pairs with minimum costs (Newell, 1979). He found that demand is also a factor that influences bus services. Based on his findings, a hypothetical solution that illustrates the sensitivity analysis of different types of geometry of bus routes was produced. The analysis suggested that square grids of straight-line bus routes are not the optimal geometry in any situation.

13

The majority of bus route design related research ignores the existence of bus stops and some characteristics that might potentially influence the travel demand.

Fitzpatrick et al. (1997) suggested that several factors should be considered before selecting a bus stop location and its design. Those factors, for example, could be a combination of the existence of sidewalks, openings through walls, impervious, non- slip walkways, etc. The idea of how bus stop’s characteristics could affect the potential travel demand, or ridership in this case, is further recognized in this thesis.

2.4 Bus Route/ Bus Stop Designs in GIS Applications

Geographic Information System (GIS) could be a solution to take bus stop’s characteristics into account by analyzing the data collected by the US Census Bureau and Department of Transportation, such as demographics and road inventory information. It could easily summarize the data relevant to the bus stops and find their relationships with travel demand. Meanwhile, GIS applications have been used for a variety of other purposes in the bus route design field.

Gebeyehu and Takano (2008) used origin-destination (O-D) demand data to design bus routes using GIS. They tried to meet three objectives at the same time: maximizing trip generation coverage, minimizing route overlapping, and minimizing travel time. Although the shortcomings of this method, if used by transit agencies, were not discussed in the paper, one could assume that based on its simplistic assumptions the results may not be applicable for the real-world scenario.

14

Huang et al. (2010) proposed a GIS-based framework to optimize bus route for large cities using a genetic algorithm (GA). The framework is basically a bus stop- based optimization. Despite the fact that the types of population data used are not listed in the paper, there are some other issues that need to be addressed in future works.

2.5 Machine Learning Applications to Bus Transit

Machine learning is an emerging approach in the bus planning field. Machine learning algorithms builds a mathematical model often based on a given set of data, which is also known as “training data”, to predict the outcomes from a larger set of data (Bishop, 2016). Current bus planning related research focus significantly on bus scheduling problems and bus delays by using machine learning (Yamaguchi et al.,

2018); however, potential applications in the bus route design field are widely overlooked. In this thesis, a potential approach using machine learning to predict the travel demand based on historical ridership at stop-levels will be introduced.

2.6 Influential Variables

Transportation planning researchers have found some links between the characteristics of the built environment and travel demand. Cervero and Kockelman

(1997) introduced the 3Ds: density, diversity, and design to the travel demand research field. While the influence of density on demand forecasting had been studied widely, the effects of diversity and design were ignored at that time. According to their study,

15

both diversity, which captures the complexity of the land development, and design, which describes the existing transportation infrastructures, can affect the number of trips generated. For example, a neighborhood that is mixed-use, bicycle and pedestrian-friendly often has fewer vehicle trips than other less compact and isolated areas. Based on this finding, the dataset used in this study also includes some variables that describes the 3Ds of the neighborhoods in Delaware.

Dajani and Sullivan (1976) developed a public transit model that aimed to estimate work trips in North Carolina Standard Metropolitan Statistical Areas. By using the most up-to-date census data at that time, it found the most influential variable for the use of public transit to be auto ownership, which is a factor that could be determined to many other factors such as percent black, median income, density, and level of transit service. This provides a reasonable direction for the data collection process of this thesis as to find more public transportation related demographics at census block or census block group levels.

Points of Interests (POIs) could be another important variable that influence the bus travel demand partly because it tells how attractive a certain bus stop could be as a destination, and potentially how many people could be served by bus within a certain radius of the POIs. Land use data can be useful for ridership prediction model because it also carries important information which cannot be fully explained by demographics. For example, a commercial or residential zone will certainly have more riders than an agricultural zone, while the population for some of those zones may be quite close.

16

2.7 Summary

This thesis aims to learn from the works cited above and expand on their scope in order to construct a bus ridership prediction framework that can be utilized as a supportive tool for planning decisions for transit agencies like DART. The initial purpose of this study is not only to find more accurate approaches to travel demand forecasting, but also to explain constant annual ridership decline for DART’s bus network. By selecting bus stop-level as the analytical focus, the most accurate, yet the easiest outcomes for implementation could be identified either visually or mathematically.

Choosing appropriate and comprehensive variables are critical to developing this model because accuracy comes with a wide range of data that together can explain the volume and patterns of ridership. Land use and POIs could partly fill the gap of incompleteness of public databases. Additionally, giving each zoning type a reasonable weight could also markedly increase the accuracy of the prediction.

17

Chapter 3

DATA SET DESCRIPTION

This chapter documents the data used for this analysis. The data represent the current options and relevant factors identified from the literature. The chapter begins by defining the study area. Each of the different types of data are then reviewed. The next sections review the DART bus network, the demographics and other variables, and explain the techniques of combining all the information into two datasets. Lastly, assumptions and limitations are discussed to provide directions for future improvements.

3.1 Defining the Study Area

The state of Delaware is divided into three counties: New Castle, Kent, and

Sussex. New Castle County and Dover is specifically chosen because of three primary reasons: (1) New Castle County is the most populated county in the state of Delaware and it has the most bus riders comparing to the other two counties, (2) The majority of bus rides in Sussex County are for recreational purposes, especially during Summer and Fall schedules due to beaches and shopping outlets, which could introduce more complexity to the study, (3) There is a transit center in Dover, the state capital and the county seat in Kent, which experiences about two thousand riders each day. It is

18

important to include part of Dover as the study area as it provides information regarding how much the existence of a transit center can positively affect daily ridership.

The study area is defined as New Castle County and part of Kent County, and that includes every bus stop geographically located at and above Dover transit center.

While it might be more straight forward to just include one or all of three counties, this specific study area is chosen because of the primary reasons previously discussed. It is thought to be the most suitable study area for predicting the weekday ridership.

3.2 Brief Review of DART’s Bus Network

DART operates the public transportation services in the state of Delaware. As outlined in Chapter 1, over 70 bus routes including 8 seasonal routes that serves recreational purposes. Of the 70 routes, 43 are primarily in New Castle county

("DART First State - About DART First State", 2019). The focus of this study is all these 43 bus routes plus several bus routes in Kent county because the selected bus routes are in the areas more suitable for weekday ridership prediction since some bus routes in the lower part of Delaware serve for recreational purposes and might be operated on a seasonal basis.

Figure 2 and Figure 3 show the GIS maps of DART’s bus stop locations as of

2018.

19

Figure 2: DART Bus Stops in Delaware

20

Figure 3: DART Bus Stops in Wilmington-Newark Area

21

A total of 1,864 existing bus stops has been selected based on their spatial locations in New Castle County and Dover. The dataset, which contains coordinates of each existing stop and the associated demographic and ridership data, will be used as training data in machine learning in order to predict the potential ridership (daily average) of hypothetical bus stops.

3.3 Description of DART’s Ridership Data

Ridership data is provided by DART’s fixed route planning division. The data package, which was collected from 2018, covers ridership during both weekdays and weekends at each bus stop. Summary tables, which ranks the stops based on daily average ridership, is also provided by DART. In this study, weekday daily average ridership is used because it is thought to have stronger relationship with the socio- economic and population data, which will then be summarized at each bus stop.

The table for ridership has coordinate information (GIS), which then can be projected on GIS maps for data processing. It also has both “on” and “off” counts in order to capture the characteristics of different locations, as some bus stops might also serve for transfer purpose, which means that the number of people getting on and off the bus at the location is particularly uneven. As a result, both “on” and “off” as well as

“total” counts are included in the study. Table 1 presents an example of the ridership data chosen based on the number of daily riders.

22

Table 1. Top Ten Bus Stops by Daily Ridership

STOP NAME ON OFF TOTAL LAT LONG RANK DOVER TRANSIT CENTER 957 882 1,839 39.152 -75.529 1 CHRISTIANA MALL @ P&R 811 897 1,709 39.680 -75.655 2 AMTRAK STATION @ ML 770 632 1,402 39.737 -75.551 3 KING-FRE KING ST @ 9TH ST 520 847 1,367 39.744 -75.547 4 9TH ST @ MARKET ST 502 498 1,001 39.744 -75.548 5 ORANGE ST @ 10TH ST 596 327 923 39.745 -75.549 6 10TH ST @ KING ST 571 249 819 39.745 -75.547 7 KING ST @ 6TH ST 499 299 798 39.741 -75.549 8 11TH ST @ KING ST 456 341 797 39.746 -75.546 9 WALNUT ST @ 5TH ST 414 358 771 39.739 -75.548 10

The majority of the bus stops do not have ridership that is as high as the top ten bus stops. In contrast, ridership is much less in most bus stop locations. Figure 4, Figure 5, and Figure 6 display the relative frequence of the ridership data for on, off, and total values, all of which shows that the majority of the values are closer to zero.

Figure 4: Relative Frequency of “On” Ridership at Bus Stops (Passengers/Day)

23

Figure 5: Relative Frequency of “Off” Ridership at Bus Stops (Passengers/Day)

Figure 6: Relative Frequency of “Total” Ridership at Bus Stops (Passengers/Day)

3.4 Bus Stop Visualization and Geoprocessing in ArcGIS

The first step of data processing for this study is to map the bus stops as well as other data on the GIS platform in order to spatially join every feature based on their geographic relations. There are two sets of data for bus stops to be produced for machine learning process: existing bus stops and hypothetical bus stops.

24

3.4.1 Mapping the Existing Bus Stops

The existing bus stops, already contained in the ridership data, is mapped based on their coordinates. All on, off, and total ridership data is retained. This set of data will be used as the training dataset at the end of the data processing. Figure 7 presents an example of how ridership information for a bus stop is geographically stored.

Figure 7: Examples of Spatial Data Stored for Bus Stops

3.4.2 Creating the Hypothetical Bus Stops

In order to create bus stops that are located at feasible and accessible areas, a

GIS shapefile for road network of Delaware is obtained from firstmap.delaware.gov.

25

The road network is drawn by polylines, which has a number of endpoints to capture and match the shape of roads in the GIS platform. Figure 8 shows how roads in

Delaware are stored in the GIS map.

Figure 8: Delaware Roads Mapped in GIS

Based on the road network, a layer of point features is created by locating the mid-points of the polylines. Generate Points Along Lines Tool is used in this step. The

26

newly created layer, which has 3411 points corresponding to the number of hypothetical bus stops in New Castle County, is saved and stored with their unique coordinates. The hypothetical bus stops, at the end of the analysis, will be color-coded based on the prediction, making it easy for the bus route planners and analysts to visually identify the locations that can potentially attract the most bus riders. Figure 9 shows the overview of hypothetical bus stops created based on the road inventory map.

Figure 9: Hypothetical Bus Stops in New Castle County and Dover

27

Figure 10 presents a more detailed view of hypothetical bus stops in

Wilmington-Newark area. This map demonstrates that the hypothetical bus stops are widely distributed such that most locations that can potentially be served by bus are included.

Figure 10: Hypothetical Bus Stops in Wilmington-Newark Area

28

At this point, two independent layers, which can then be exported as csv files, are created for geoprocessing in the next few steps.

3.5 Description of Demographic Data at the Census Block Group Level

In terms of the size (level) of geography at which this transportation planning related study is conducted, the smallest level available is proven to be more suitable for this analysis because of the radius of bus service coverage. According to DART’s fixed route planner, a quarter of a mile distance is used to capture the service coverage of DART’s bus stops (D. Dooley, personal communication, Oct, 2019). Ideally, a

0.25-mile buffer is applied to spatially join all the data which is contained in the small unit of geographic blocks within the buffer. Thus, census block, the smallest geographic unit determined by the United States Census Bureau (US Census Bureau,

2019b), is the best option to reach the greatest approximation of the real-world scenario. However, block group, which contains between 600 and 3000 people, is the smallest geographic unit that the ACS publishes for public data users (Gaquin &

Dunn, 2012). As a result, the majority part of the data is collected based on block group level. This, comparing to block level data, might affect the result of the analysis.

Part of the data is obtained from American FactFinder, a government census data website that was decommissioned at the end of March, 2020. The data source,

2017 American Community Survey (ACS) 5-year estimates, is used because it contains the most detailed demographics that is needed for this study. A 5-year

29

estimation range also gives the maximum reliability and accuracy that public data can achieve comparing to 1-year estimates.

There is a total of 14 types of ACS data acquired from the U.S. government census, each describing a related characteristic that different communities have within census block group boundaries. This dataset is selected based on the literature review in Chapter 2, and also because it is believed that it can influence the bus ridership during weekdays. Table 2 lists the data categories (attributes) that are used as variables in this study.

Based on the ACS data, some other characteristics are created and calculated to capture the importance of some data type. For example, SSI rate in ID B19056 is added to describe the percentage of the population in each census block group that has received Supplemental Security Income (SSI) in the past twelve months. It is determined by the quotient of the number of SSI income and total number of households. Also, poverty rate and no-vehicle rate are calculate using the similar approach in addition to the ACS data.

Bus ridership is hard to predict especially when limited employment, demographic, and socio-economic data is provided because a single or a few characteristics cannot fully explain the expected ridership (Driscoll et al., 2018). Due to the level of geography that this study focuses on, data summarized at some points representing the population within the bus service area (0.25 miles) may not be similar to the reality. Thus, most of the characteristics, when analyzed individually, show no direct relationship with the existing ridership. To illustrate these issues, the

30

relationship between ridership and some selected variables (poverty rate and commuters) is explored.

Table 2. ACS Dataset and Selected Attributes

ID Title Data Details (Parts of the data used in the study) B01001 SEX BY AGE Total population B01002 MEDIAN AGE BY SEX Median age B08016 PLACE FOR WORK FOR WORKERS Total workers 16 years 16 YEARS and over B08301 MEANS OF TRANSPORTATION TO Total commuters, and WORK sum of each type (Car, Car Drove Alone, Carpooled, Bus, Rail, Walked, WFH) B08303 TRAVEL TIME TO WORK <30 mins, ≥30mins B11001 HOUSEHOLD TYPE Total households, family households, families with male/female only, living alone B19001 HOUSEHOLD INCOME IN THE Household income PAST 12 MONTHS B19013 MEDIAN HOUSEHOLD INCOME IN Median household THE PAST 12 MONTHS income B19055 SOCIAL SECURITY INCOME IN Social security income THE PAST 12 MONTHS FOR HOUSEHOLDS B19056 SUPPLEMENTAL SECURITY SSI income, SSI rate INCOME (SSI) IN THE PAST 12 MONTHS FOR HOUSEHOLDS B19301 PER CAPITA INCOME IN THE PAST Per capita income 12 MONTHS B25001 HOUSING UNITS Total housing units B25003 TENURE Renter/Owner occupied

31

Figure 11 presents a scatterplot of poverty rate versus ridership at each bus stop. A positive trend can be observed, indicating that ridership may be positively correlated with poverty rates of the population that is within walkable distance (0.25 miles) of bus stops. Poverty rate is calculated based on the quotient of “income lower than 60k” divided by “number of households” at each bus stop after ACS data is spatially joined. This will be further investigated during the machine learning process by identifying its feature importance comparing to other demographic variables.

1.00 0.90 0.80 0.70 0.60 0.50 0.40

Poverty Rate Poverty 0.30 0.20 0.10 0.00 0 500 1000 1500 2000 Ridership

Figure 11: Poverty Rate vs. Ridership

Figure 12 presents a plot of the number of commuters (bus riders going to or from work) versus ridership at each stop. Theoretically, a clear positive trend might be observed because people commuting by bus at a particular location means at least

32

equal amount of weekday ridership, if not more. For example, a place that has 160 daily bus commuters should have more than 160 daily riders in general, which is not the case in this figure. The plot explains that data accuracy can be affected by the level of geography on which data is processed. Although the limitation on this particular variable, or other variables, should not necessarily lower the effectiveness of machine learning, a more updated and detailed dataset (block level) can certainly improve the accuracy of the prediction (Hand et al., 2001, Al-Jarrah et al., 2015).

180 160 140 120 100 80 60 Bus Bus Commuters 40 20 0 0 500 1000 1500 2000 Ridership

Figure 12: Daily Bus Commuters vs. Ridership

33

3.6 Land Use Data

To study the impact of the location of bus stops regarding different land use types, a shapefile of Delaware land use was downloaded from the FirstMap

(FirstMap@De, 2020). Due to privacy reasons, the publicly available land use data was from 2007. It categorized the land parcels into dozens of types. After the data is imported into the GIS map, those types are then classified into 3 different numbers: 0,

1, and 2, which stands for areas that are uninhabited, somewhat inhabited but unpaved, and paved with human activities. Table 3 summarizes the land use types and the numbers representing each.

Table 3. Land Use Types

Characteristic Category Land use types Uninhabited 0 Clear-cut, Communication-antennas, Deciduous Forest, Evergreen Forest, Extraction, Herbaceous Rangeland, Idle Fields, Inland Natural Sandy Areas, Junk/Salvage Yards, Mixed Forests, Mixed Rangeland, Tidal Shoreline Somewhat 1 Bay and Covers, Beaches and River Banks, Confined inhabited but Feeding Operations/Feedlots, Cropland, Farmsteads unpaved and Fam Related Buildings, Man-made Reservoirs and Impoundments, Marinas/Port Facilities/Docks, Natural Lakes and Ponds, Railroads

Paved with 2 Airports, Commercial, Highways/Roads/Access human Roads/Freeways, Industrial, activities Institutional/Governmental, Mixed Urban or Build-up Land, Mobile home Parks/Courts, Multi Family Dwellings, Parking Lots, Recreational, Retail Sales/Wholesale/Professional Services, Single Family Dwellings, Transportation/Communication, Utilities

34

3.7 Employment Data

Additional data that describes employment opportunities are acquired from the

U.S. Census Bureau at Longitudinal Employer-Household Dynamics (LEHD) webpage (US Census Bureau, 2020). The LEHD Origin-Destination Employment

Statistics (LODES) provides job and worker information at census block geographic detail, which was enumerated by 2010 census block. Newer information can be expected to be available once 2020’s decennial census data becomes publicly available.

3.8 Geoprocessing and Spatial Analysis

Once all the data is collected from various sources, it is then geocoded into different layers in the GIS map. Census block and census block group shapefiles are downloaded as the base maps for spatial joins. After all the data (ACS, employment, land use, selected points of interests) are joined into census blocks and block groups, another spatial join is performed to combine all the information that is within walkable distance of bus stops into two layers of bus stops (existing bus stops as training dataset and hypothetical bus stops as testing dataset).

3.9 Data Assumptions and Limitations

This section reviews the assumptions made in developing the dataset and the limitations identified. These include the statistical error, data availability, and timeliness.

35

Margins of error that are contained in the ACS data are ignored although there are different levels of uncertainty. As discovered earlier during data preparation, some features can have margins of errors of more than 50 percent of the number that is used in the study. The 2020 census data, which will be available after 2021, will be a better alternative for this approach because the dynamic change of demographics can affect the result of machine learning prediction. More up-to-date dataset can ensure a better performance of the analysis.

Besides, there is a limitation on level of details for public data. Census block level is preferred, and it is probably the only reasonable geographic level on which bus ridership studies should focus. This is particularly a problem when a bus stop is located in an area with lower population density because it can be a significantly large block group since the block group always have population more than 3 hundred. This is also part of the reason that some data are converted to ratios because ratios might represent the characteristics of bus stops better. Also, this problem can potentially be solved if a public bus agency, like DART, is to implement this ridership predicting method because more detailed data can be provided by the Census Bureau or the

Department of Transportation’s planning division.

Figure 13 and Figure 14 compare the visual difference of using block level and block group level data. It is clear that buffers, representing bus service areas, are likely to reach to larger areas using block group level data, which means the information collected at some locations is inaccurate, and it can reduce the overall data quality.

36

Figure 13: Bus Stop Service Areas vs. Census Block Groups

37

Figure 14: Bus Stop Service Areas vs. Census Blocks

Furthermore, the ACS data and decennial census data do not provide most up- to-date information. Bus commuters makes up a large proportion the weekday bus riders, so it is important to know where those commuters live, and where the job opportunities shift to. It is reported that an average American changes jobs about 12 times in lifetime, which means a person could change a job and likely relocate for the

38

job every three or four years. (US Bureau of Labor Statistics, 2018). As a result, data that is from years ago may not be ideal for such study that requires relatively accurate and up-to-date data.

39

Chapter 4

METHODOLOGY

4.1 Introduction

This methodology chapter introduces the research framework that this study employs. The framework aims to provide an alternative method for the public transit agencies to evaluate any proposed change to the existing bus system. It provides analytical evidence of whether a certain change, i.e. a bus stop closing or relocating, should be implemented by comparing the existing ridership to the predicted ridership at a location or surrounding hypothetical bus stops. Given comprehensive and accurate input, the objective is to predict changes in ridership that in practice are similar to what is forecasted using the framework after the service change.

The first two steps of this study are to collect and process data. This step was described in Chapter 3 and two datasets developed. These two sets of data will be used as training and testing data in the machine learning process, the third step. After the predictions are produced, the geocoded hypothetical bus stops will be mapped and color-coded based on the predicted ridership at each stop, thereby providing a way for bus route planners and analysts to assess the significance of existing and proposed bus

40

stops, the fourth and final step. The framework that describes the flow of this research approach is shown in Figure 15.

Figure 15: Research Framework

4.2 Machine Learning

Before developing the structure for the machine learning process, it is necessary to understand the dataset and pick the appropriate approach and algorithms

41

that will have better performance than randomly selected ones. This section explains how algorithms for this study are chosen and why.

4.2.1 Supervised Learning

There are three major types of machine learning algorithms: Supervised

Learning, Unsupervised Learning, and Reinforcement Learning. Supervised learning is a type of machine learning algorithm that consists of a target variable in the dataset to be predicted in classification or regression problems based on the given set of other variables (Ray, 2017). Every variable in the dataset is specifically labeled to have a unique meaning. In contrast, unsupervised learning looks to clustering the variables in the dataset that are not previously labeled, and it does not have a target variable to predict. Finally, reinforcement learning, similar to unsupervised learning, is a type of machine learning algorithm that does not have dataset labelled. It aims to find the suitable or the optimal action to receive the maximum reward (Berry et al., 2019,

Sutton & Barto, 2018). The primary goal for this study is to produce values that represents potential ridership for the hypothetical bus stops. So, supervised learning is picked among the three main types.

4.2.2 Machine Learning Algorithms

There many goals and requirements that a machine learning project can chose to focus on. The decisions also vary with the nature of the data. Below are some factors that are considered when choosing the appropriate algorithms to apply:

42

• The objective of the study.

• The quantity, quality, and the complexity (number of variables or features) of the dataset.

• Is this study (or projects that employs this approach) time-sensitive?

• How accurate does the output need to be?

To begin with, the objective of this study is to predict potential ridership at hypothetical stops. The numerical ridership predictions are continuous which means the number of passengers that get on and off the buses are countable. In this case, regression models should be considered first because it is mostly used to predict the outcome of the variables in numeric values. This helps to navigate the target algorithms to the supervised learning in regression problems.

Besides, both training dataset and testing dataset here have less than 10 thousand rows of inputs, which is a medium size problem especially given the fact that there are more than 30 features included. Moreover, the study tends to assess existing and proposed bus stops based on ridership and therefore time is not sensitive in this case because a proposed change in bus services especially bus stop locations can take months. Lastly, the output does not need to be extremely accurate and precise like some critical measures which can ultimately affect safety to the passengers. The reason to not chase maximum accuracy is that requiring accuracy often can cause over-fitting, which can cause most of the dataset to be not “accurate” (Ravanshad,

2018). For example, the ridership for existing bus stops varies from 1800 per day to 0 per day, while the majority of stops have zero and single figure ridership. If accuracy

43

is maximized, there will be no bus stop that expects more than a thousand riders per day most likely.

Based on the recommendations of some machine learning researchers and experts, Gradient Boosting works the best for the ridership dataset (Li, 2017,

Ravanshad, 2018). Gradient Boosting is a technique for producing the prediction model in the form of an ensemble of shallow decision trees when each tree learns and improves on the former one (Boehmke & Greenwell, 2019). It is a type of ensemble algorithm that depends on different weaker models which are independently trained.

In addition, Support Vector Machine - Regression (SVR) is also used to compare with the performance of Gradient Boosting algorithms. Support vector machine is a type of supervised learning model that can be used for regression and classification analysis

(James et al., 2013). Transportation researchers who studied public transit ridership on trains and metros had employed both Gradient Boosting and SVR and found them reliable in similar ways. (Ding et al., 2016, Wang et al., 2018).

Other potential algorithms are Light GBM and XGBoost. LightGBM is chosen based on its advantages on accuracy, faster training speed and higher efficiency, and its capability of dealing with large set of data (Ke et al., 2017). Given the fact that it runs fairly fast, the advantage of time saving allows bus route planners to quickly evaluate any potential change in bus routes and stops. It will also be good to have the code ran at events like meetings or public hearings to provide insights and evidence to the proposed service changes.

44

Similarly, XGBoost is chosen because it is an optimized gradient boosting library that has high efficiency, flexibility, and accuracy (Chen & Guestrin, 2016). It implements multiple machine learning algorithms under the general Gradient Boosting framework rather than being a simple algorithm itself. Created by Tianqi Chen,

XGBoost has been a reward winning package in many algorithm competitions such as

Kaggle, CodeCup, and Halite (Chen, 2016).

The machine learning algorithms that are applied in this study are listed below:

• Gradient Boosting Decision Trees (GBDT)

• LightGBM

• XGBoost

• SVR

The first three are all gradient boosting types of machine learning techniques. These four machine learning algorithms is applied in parallel. Cross-validation is also used in order to exam the accuracy and effectiveness of each model using Root Mean Square

Error (RMSE). The best performing algorithm is then decided based on the factors such as RMSE and the prediction output to minimize RMSE and avoid over-fitting.

4.3 K-Fold Cross-validation

In the training process, it is very likely that some level of over-fitting can occur especially when the ridership at some rare locations is extremely large compared to others. Cross-validation is a resampling procedure that is normally applied in machine learning to divide the training dataset into several, often 5 or 10, folds and treat one of

45

the folds as the testing data to ensure high accuracy and prevent over-fitting (Boehmke

& Greenwell, 2019). 5-fold cross-validation is used in this study because the training dataset is relatively small. Previously a study suggested that the K value can be determined by the equation below (Jung, 2017):

n K ≈ log(n) , and > 3d (4.1) K

Where n is the sample size, and d is the number of features

There are 1864 existing bus stops and around 30 features included. The k value here is calculated as 3.27. Since all requirements can be satisfied when k is 5, it should be sufficient to split the training data into 5 folds. 1-fold will be used to predict while the remaining 4-folds will be used for training. The performance of cross-validation for each algorithm will be evaluated based on the Root Mean Square Error.

4.4 Prediction Outputs

After executing the machine learning models, several types of output will be collected. These outputs are the keys to determine which machine learning algorithm is adopted for the final analysis in step 4 and recommended for future studies. A brief description of the outputs is listed as follows:

• Tables that contains ridership predictions at each hypothetical bus stop

• Root Mean Square Error

• Feature importance

• Minimum, maximum values, and 75th percentile

46

• Mean and standard deviation

RMSE, as well as other data variance measures, is recorded and evaluated to determine which method(s) performs better than others. Particularly, maximum values and 75th percentiles should be checked for all 12 tables (there are 4 machine learning models and 3 different types of outputs: on, off, and total). There are three main reasons why they are significant factors. First, as discussed in the previous chapter, the ridership data contains several rows that have extremely large ridership, which is because they are at places that cannot simply be distinguished by demographics, land use, and employment data. For example, a transit center may have similar characteristics as a bus stop anywhere else in downtown Wilmington; however, it certainly will have more ridership because it does not only serve as an origin or destination. In order to make sure that no extreme maximum value (either too high or too low) is produced, maximum values and 75th percentiles should be compared.

Second, low maximum values can indicate over-fitting, and it is something that should be avoided. Chapter 3 already discussed the ridership distribution and suggested that too many zeros in ridership can lead to over-fitting. If an output table does have very small maximum ridership with low variance, that algorithm may not be the optimal option. Third, gradient boosting can over-fit a training dataset very easily due to the nature of its greedy algorithm in optimization (Brownlee, 2019). Since gradient boosting is the major algorithm in this study, such issues should be taken into account

Then, tables containing ridership predictions can be imported to the GIS map using the coordinates that were included in the dataset. Each point representing the

47

hypothetical bus stops will then be symbolized with different colors as well as sizes. In the same map, another layer that shows the existing bus stops using the same symbology settings (except different style) will be produced for better visualization.

Finally, feature importance that is based on the best performing machine learning model will be plotted and analyzed. It is important to know whether there are dominating features that affects ridership, or most of the features included in this study are similarly significant. Feature importance is useful for bus route planners in difference ways. Some examples are listed below:

• The most important features indicate a concentration for bus route planners in the future when collecting and requesting more accurate data.

• It provides insight on the relations between features and ridership so that some bus stop changes can be evaluated without employing this prediction method by simply comparing the most influencing features.

• The least important features may be excluded in the future study to save time.

• Additional data (features) can be added in the future study if they become available.

4.5 Additional Data Processing

This chapter expressed a concern related to the nature of the ridership data, which has the potential to lead to over-fitting and inaccuracy due to some large values that cannot be explained by the correlations with the features. As a result, additional analysis will be performed to find the optimal algorithms that are believed to predict values that sheds light on the bus route planning decisions.

48

Applying log transformation is another attempt in this study to try to minimize the effect of large variability in ridership data. Log transformation is a major type of data transformation (McDonald, 2009). As the name indicates, it consists of calculating the log of each label, which is the ridership in this case. At the end of the data input, a log transformation is performed prior to the training process. After predictions are made, the values are then transformed back into normal values. Such techniques are often used in research when dealing with skewed data, which has variables spanning several orders of magnitude (Feng et al., 2019). The ridership data in this study is deemed as skewed data since the vast majority of the values are small while very few are extremely big.

49 Chapter 5

RESULTS AND DISCUSSIONS

5.1 Introduction

Chapter 5 presents the machine learning results and discusses the overall contribution this study makes. Based on the methods that were introduced in chapter 4, the decision as to which machine learning model should be chosen for this and future studies is made. Additionally, maps that displays the predicted ridership at hypothetical bus stops and historical ridership at existing bus stops are included to show the results for selected areas. To provide additional insights, the role features

(such as socioeconomic variables) play in the models is explored and the relative importance of the features discussed.

Based on the objective of this study, one machine learning algorithm is deemed as optimal. Using that algorithm maps are built to visualize the predicted ridership.

These ridership predictions are believed to be based on the best combination of accuracy, complexity, and many other factors discussed in the previous chapters.

However, other methods are also briefly discussed in case they can be more useful when DART or other bus transit agencies employ this approach. It is important to note that the results also vary if more accurate data is used. Ideally, the training dataset

50

should only include data collected in the same year as the ridership data. The prediction dataset should only include the most up-to-date data.

5.2 Model Outputs and Performance

5.2.1 Summary Statistics for Alternative Algorithms

Summary statistics describe the prediction outputs for each of the four algorithms (Gradient Boosting Decision Trees (GBDT); LightGBM; XGBoost; and

SVR) to determine the best performing machine learning model in this study. For each dataset (“On”, “Off” and “Total” ridership at each bus stop) and each algorithm, the minimum and maximum values, the 75th percentile, the mean, standard deviation and root mean square error are reported. The summary statistics are also reported for the historical data.

Table 4 summarizes the measures of the “On” prediction outputs comparing to each other and the existing bus ridership.

Table 4. Statistical Description of Prediction Outputs vs. Historical Ridership (On)

GBDT Lightgbm XGBoost SVR Historical Minimum 0 1 0 0 0 Maximum 24 153 4 36 957 75th percentile 4 19 2 3 9 Mean 3 17 1 4 13 SD 3.88 21.14 0.65 4.20 48 RMSE 1.49 1.35 1.03 1.53

51

Table 5 summarizes the measures of the “Off” prediction outputs comparing to each other and the existing bus ridership.

Table 5. Statistical Description of Prediction Outputs vs. Historical Ridership (Off)

GBDT Lightgbm XGBoost SVR Historical Minimum 0 1 0 0 0 Maximum 31 245 5 39 882 75th 5 25 2 4 11 percentile Mean 4 22 2 4 13 SD 4.48 28.93 0.79 4.59 46.00 RMSE 1.45 1.34 1.02 1.53

Table 6 summarizes the measures of the “Total” prediction outputs comparing to each other and the existing bus ridership.

Table 6. Statistical Description of Prediction Outputs vs. Historical Ridership (Total)

GBDT Lightgbm XGBoost SVR Historical Minimum 0 2 0 0 0 Maximum 53 686 9 72 1839 75th 10 89 4 9 21 percentile Mean 8 77 3 9 26 SD 8.53 102.55 2.11 8.60 91.00 RMSE 1.54 1.37 0.94 1.62

52

After a careful comparison of all three tables, Lightgbm demonstrates consistently better performance of the four models. The mean, standard deviation and range of predicted ridership from the Lightgbm model is significantly larger than other models, and closer to the historical values of “on”, “off” and “total” ridership. It makes sense because it can be expected that other hypothetical bus stop locations might attract as many riders as existing bus stops. The reasons why Lightgbm is considered to be the best performing algorithm are discussed in more detailed at the end of this section.

5.2.2 The Best Performing Model - Lightgbm

There are many reasons why Lightgbm is the best performing algorithm used in this study. First and most obviously, it yielded maximum values that are closer to the historical ridership data. For example, the maximum daily ridership that Lightgbm predicts is the only one that matches the magnitude of historical ridership. Maximum ridership is particularly difficult to predict as high ridership volumes are focused on one transportation hub, , Wilmington. This is where people transfer to another bus or go to daily activities. This situation is unlikely to be replicated at other locations in the state of Delaware. Thus, slightly lower maximum ridership values are expected.

Besides, mean, 75th percentile, and standard deviation values for the outputs of

Lightgbm are closer to the measures for historical ridership. These values are similar especially for “On” and “Off” values. Although the predictions have higher mean and

53

75th percentile values than the historical ridership, there is a feasible explanation. Out of 1864 rows of ridership data, there are approximately 300 rows filled with zeros. It is not clear if ridership data collection was missing for those bus stops, but it certainly affected the mean and 75th percentile measures. Perhaps the values could match more closely with updated ridership data in the coming years.

Last, and most importantly, RMSE is another important measure that tells us directly which model has better performance (Ding et al., 2016). Usually the model with lower RMSE has higher accuracy. Although Lightgbm only has the second smallest RMSE, the RMSE is relatively low given the size of the dataset. Also,

XGBoost might have over-fitted the dataset by ignoring the small but important part of the dataset with a large ridership because the maximum ridership never exceeds 100 for all three predictions (On, Off, and Total). In sum, Lightgbm is determined to be the best performing algorithm in this study.

Table 7 shows the total predicted ridership for the ten bus stops with the largest ridership including and excluding the training data. It comes with detailed latitude and longitude in order to explain the spatial correlations of the hypothetical bus stops. The first decimal place is about 6.9 miles or 11.1 km, which means the top ten bus stops with the highest predicted ridership are close to each other. GIS maps visualizing the ridership in the next section also prove that the bus stops are located closely. While only total ridership is presented, both On and Off should be similar in term of the distribution of high ridership predicts.

54

Table 7. Top Ten Bus Stops by Predicted Total Daily Ridership

Lat Long Total Ridership (daily)

39.73048 -75.6238 686 39.72825 -75.622 686

39.73636 -75.6219 686 39.74371 -75.5594 619

39.73932 -75.552 603 39.74229 -75.5559 550

39.7413 -75.5496 539 39.74267 -75.5486 536

39.74615 -75.5467 517

39.73771 -75.5517 497

5.3 Mapping the Ridership

As introduced in chapter 4, the last step in this research framework is to map the predicted ridership to give visual assistance to the bus route planners and researchers. The predicted total ridership is projected onto the historical ridership map in order to compare the difference. Although the number of passengers getting on and off the bus can be counted as half of the total ridership, the two fields are not exactly the same in almost evert row in DART’s ridership data; thus, it is also important to include the maps for predicted “On” and “Off” numbers to capture the characteristics of locations that have more passengers getting on than getting off, or the opposite way.

The map can be used for bus route planners to quickly evaluate the potential of a proposed bus stop relocation or other bus service changes by simply check if the proposed bus stop or surrounding hypothetical bus stops have higher predicted ridership than the existing bus stops. Figures 16, 17, and 18 present the symbolized

55

maps for the predicted and historical “On” ridership separately and then together.

Similar maps are generated for “Off” ridership (Figures 19, 20 and 21) and “Total” ridership (Figures 22, 23 and 24.) In all figures, circles are used to indicate ridership at hypothetical bus stops and squares historical ridership at actual bus stops. The circles and squares are color coded with green indicating low ridership and red indicating high ridership, and yellow or light green and orange intermediate ranges.

For “On” and “Off” ridership data the thresholds are: green - under 10 riders per day; yellow/ light green – 10-50 riders per day, orange – 50-100 rides per day; red – over

100 riders per day. For “Total” ridership data, the thresholds are doubled.

In all maps, high predicted ridership (in passengers per day) is shown to cluster in downtown Wilmington, as expected, but also in selected suburban locations that warrant further exploration. The historical ridership data clearly shows the existing routes and the predicted ridership data shows areas that are currently not served but have potential riders.

Although comparative visualized maps can be interpreted in various ways based on the needs of the analysts, one way of looking at these maps is to identify the areas that have high ridership potential but currently have not been served with any bus stop whatsoever. In the southwest quadrant of Figure 21, for example, several hypothetical bus stops have relatively higher potential ridership than the existing bus stops in Wilmington Manor area. This indicates the neighborhood in Wilmington

Manor area might be underserved, and a larger group of passengers can benefit from the bus route change that goes through the neighborhood.

56

Figure 16: Predicted Ridership for Passengers Getting on the Buses

57

Figure 17: Historical Ridership for Passengers Getting on the Buses

58

Figure 18: Predicted and Historical Ridership for Passengers Getting on the Buses

59

Figure 19: Predicted Ridership for Passengers Getting off the Buses

60

Figure 20: Historical Ridership for Passengers Getting off the Buses

61

Figure 21: Predicted and Historical Ridership for Passengers Getting off the Buses

62

Figure 22: Predicted Ridership for Total Passengers

63

Figure 23: Historical Ridership for Total Passengers

64

Figure 24: Predicted and Historical Ridership for Total Passenger

65

5.4 Feature Importance

An important insight that can be obtained from this study is the feature importance. Feature importance captures the significance of each demographic, socio- economic, and other variable. Feature importance is computed by the times that a feature is used in the split points of a decision tree in the machine learning model

(Scikit-learn.org, n.d.). The actual times for each feature are then recorded at the end of the machine learning process. In this study, the bar charts presenting the feature importance for all three predictions (On, Off, and Total) are produced based on the percentages of each feature’s used times in terms of total times.

Lessons can be learned for future studies include excluding some less important variables and improving the accuracy and level of details of important variables. Figure 25, 26 and 27 rank the feature importance concluded from the machine learning model.

66

Feature Importance (On)

Mall Landuse Rail Restaurants Walked SSI Rate Bus WFH Malehouseholdernowife Carpooled Total Households Femalehouseholdernohusband Renter Rate Familyhouseholds Car Drove Alone Car income<60k Total Commute Commute<30mins Commute>30mins Per capita Poverty Rate HOUSING_UN Median Age Total Pop livingalone LOWINCJOBS No Veh Rate JOBS BELOW_POV 0 0.01 0.02 0.03 0.04 0.05 0.06

Figure 25: Feature Importance for "On" Prediction

67

Feature Importance (Off)

Mall Landuse Restaurants Rail Walked SSI Rate Bus WFH Malehouseholdernowife Renter Rate Femalehouseholdernohusband Car Drove Alone Carpooled Familyhouseholds Car Total Households Total Commute income<60k Commute>30mins Commute<30mins Poverty Rate HOUSING_UN Total Pop livingalone Median Age Per capita LOWINCJOBS No Veh Rate BELOW_POV JOBS 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07

Figure 26: Feature Importance for "Off" Prediction

68

Feature Importance (Total)

Mall Landuse Restaurants Rail Walked SSI Rate Bus WFH Car Total Households income<60k Malehouseholdernowife Total Commute Familyhouseholds Carpooled Femalehouseholdernohusband Renter Rate Car Drove Alone livingalone Total Pop Poverty Rate HOUSING_UN Median Age Commute>30mins Commute<30mins Per capita LOWINCJOBS No Veh Rate BELOW_POV JOBS 0 0.01 0.02 0.03 0.04 0.05 0.06

Figure 27: Feature Importance for "Total" Prediction

69

According to the bar charts, financial and employment related variables generally have higher impacts on the ridership prediction. Features like jobs, below poverty rate, no vehicle rate, and low-income jobs consistently act as the most important factors in the model as shown in all three bar charts. Interestingly, while factors like median age and housing units were included in the study to make the profile more comprehensive, they are used in the Lightgbm’s decision trees almost as frequently as the financial and employment related variables.

In contrast, the presence of a mall or restaurants, land use data, and access to rail are some of the least important variables. Indeed, there are always a lot of bus riders at malls. The fact that mall being counted as the least important variable might not make intuitive sense; however, it is important to note that there are only a few shopping malls and shopping centers recorded in the dataset so the incomplete information did not reflect the real relationship between shopping activities and bus rides well enough. While it is true that the bus stops at Christiana mall have a large number of riders each day, more people go on weekends, when ridership is generally low, and many other places with high ridership do not have malls nearby. Based on the observation of the feature importance, more detailed data at block level supported by land use data that identifies grocery stores, as well as other points of interests can be very useful to advance the comprehensiveness of the dataset and the accuracy of the machine learning models overall.

70 Chapter 6

CONCLUDING REMARKS

6.1 Conclusions

The results presenting the feature importance, the visualized ridership maps, and the best performing model provided Machine learning algorithms that integrate large quantities of data from diverse sources can serve as an additional technical tool for DART’s bus route planners to assess the feasibility of proposed bus route changes.

This novel ridership prediction tool also provides visualizations of ridership data as maps and provides insights into the importance of various features. The tool can help bus route planners make decisions about proposed bus route changes in a way that is much easier and less time-consuming than the conventional planning approaches which involve origin-to-destination (OD) surveys as well as other sophisticated traffic forecasting software. Declining ridership over the past several years has raised concerns that conventional bus route planning methods are not working so well for

DART. The framework developed in this study can initially complement the existing tools used by transit agencies.

The tools can also be used to identify areas where better data are needed.

However, one important limitation of this work is that the results cannot be validated only evaluated. Validation would require tracking ridership changes in response to

71

service changes. The service changes do not necessarily have to reflect the predicted highest service locations so the evaluation could be undertaken during any service change.

6.2 Contributions to the Transportation Planning Field

Recent advances in machine learning have been applied to the transit operations particularly to predict bus arrival time (Bai et al., 2015, Yin et al., 2017).

Similarly, Google is also using machine learning to predict bus delays (Fabrikant,

2019). However, there is limited work related to the bus ridership forecasting. This study shows how machine learning and GIS can be used to predict bus ridership to help decision makers choose the optimal solutions. The framework provides technical support for public transit agencies to successfully implement proposed bus route changes to meet the needs of stakeholders, including users, neighborhoods and communities, businesses and planners. In addition, a potential increase in ridership can help transit agencies achieve a more financially sustainable future and improve residents’ satisfaction for the public transportation system. In conclusion, this study is a contribution to the field of transportation planning research as it enables bus route planners to make more data-driven and accurate decisions in terms of encouraging more ridership and maximize access to the limited bus services for a larger population.

72

6.3 Future work

Future work falls into four different categories: 1) data quality; 2) alternative machine learning techniques; 3) specific data types such as transit centers and shelters at bus stops; and 4) validation. Each of these categories are briefly discussed.

First of all, detailed and up-to-date data is preferred for this study; however, the actual dataset used in the study still requires a lot of improvements in terms of accuracy and comprehensiveness. It is expected that the performance of the machine learning model can be significantly boosted if more accurate date is obtained for analysis. So, transit agencies who find this framework useful can try to collect more accurate and detailed data from public and private sectors in the future. It also has to be recognized that data formats, quality and availability change over time. While access to the ACS data is changing the 2020 Census is likely to provide more update and different data.

Besides, the accuracy of the ridership predictions can be further increased if more types of machine learning algorithms are used in order to compare the differences. There are four techniques employed in this study, and three of them did not perform as anticipated due to overfitting and limited data quality. It is believed that some of the machine learning algorithms applied in this study or other algorithms that are not used this time may perform better than Lightgbm in the future with different dataset for ridership.

The analysis demonstrated the importance of the data features used but there is also missing data that is important. For example, one can also argue that the transit

73

centers at Wilmington and Dover, due to their different characteristics than other locations, might cause some inaccuracy in the prediction. This is especially true since the variables used in this study cannot explain the characteristic of a bus stop as being a transit center. Most of the variables used in this study attempted to explain the demographic and socio-economic factors around the bus stops. If the bus route planner can investigate the ridership data and exclude whichever he/she thinks may be less relevant to the variables in the dataset, it can be helpful to run the framework again without using that ridership from special locations like transit centers.

Another example of a specific type of data is data on the availability of shelters at bus stops. DART provides a map that categorizes bus stops into “with shelter” group and “without shelter” group. While it was not included in this study, future studies that choose to include this characteristic of bus stops might find some difference in the prediction result because bus stops with shelters will certainly be more attractive to passengers especially during extreme weathers. In this case, this information can be stored in a binary format in the dataset. Similarly, bus stops that allow park and ride may also attract more riders. Both characteristics can be considered for future studies. This data is an example of a larger set of data that can enhance the analysis, which is service characteristics. For example, ridership varies with service frequency and reliability.

Finally, while the predictions might be evaluated by experienced bus route planners who are familiar with the transit-served communities, it currently cannot be validated because it would require ridership tracking in the coming years after a

74

proposed change is implemented. One way to solve the problem is to gather historical census and socio-economic data, and ridership data from the past years before some bus routes are rerouted. If enough accurate data can be obtained, one can re-run the machine learning model using the ridership data recorded after the service change as a tool to validate the predicted ridership. Alternatively, the framework can be used to predict the ridership at recently relocated bus stops. The next year’s ridership tracking can also validate the prediction made a year before, thereby assessing the countability of this framework.

75

REFERENCES

Al-Jarrah, O. Y., Yoo, P. D., Muhaidat, S., Karagiannidis, G. K., & Taha, K. (2015). Efficient machine learning for big data: A review. Big Data Research, 2(3), 87-93. https://doi.org/10.1016/j.bdr.2015.04.001

AlKheder, S., AlRukaibi, F., & Zaqzouq, A. (2018). Optimal bus frequency for Kuwait public transportation company: A cost view. Sustainable Cities and Society, 41, 312-319. https://doi.org/10.1016/j.scs.2018.05.042

Bai, C., Peng, Z., Lu, Q., & Sun, J. (2015). Dynamic bus travel time prediction models on road with multiple bus routes. Computational Intelligence and Neuroscience, 2015, 1-9. https://doi.org/10.1155/2015/432389

Berry, M. W., Mohamed, A., & Yap, B. W. (2019). Supervised and unsupervised learning for data science. Springer Nature.

Bishop, C. M. (2016). Pattern recognition and machine learning. Springer.

Boehmke, B., & Greenwell, B. M. (2019). Hands-on machine learning with R. CRC Press.

Box, G. E., & Meyer, R. D. (1986). An analysis for Unreplicated fractional factorials. Technometrics, 28(1), 11-18. https://doi.org/10.1080/00401706.1986.10488093

Brand, D. (1973). Travel demand forecasting: some foundations and a review. Highway Research Board Special Report, (143).

Brownlee, J. (2019, August 21). A gentle introduction to the gradient boosting algorithm for machine learning. Machine Learning Mastery. https://machinelearningmastery.com/gentle-introduction-gradient- boosting-algorithm-machine-learning/

Cambridge Systematics, Inc., Vanasse Hangen Brustlin, Inc, Gallop Corporation, Bhat, C. R., Shapiro Transportation Consulting, LLC, & Martin/Alexiou/Bryson, PLLC. (2012). Travel demand forecasting: Parameters and techniques (716). Transportation Research Board. https://transportation.ky.gov/Planning/Documents/Travel%20Demand %20Forecasting.pdf

76

Ceder, A., & Wilson, N. (1986). Bus network design. Transportation Research Part B: Methodological, 20(4), 331-344. doi: 10.1016/0191-2615(86)90047-0

Černá, A., Černý, J., & Přibyl, V. (2011). Bus route design in small demand areas. TRANSPORT, 26(3), 248-254. doi: 10.3846/16484142.2011.622135

Cervero, R., & Kockelman, K. (1997). Travel demand and the 3Ds: Density, diversity, and design. Transportation Research Part D: Transport and Environment, 2(3), 199-219. https://doi.org/10.1016/s1361-9209(97)00009-6

Chen, T. (2016, March 10). Story and lessons behind the evolution of XGBoost. homes.cs.washington.edu. https://homes.cs.washington.edu/~tqchen/2016/03/ 10/story-and-lessons-behind-the-evolution-of-xgboost.html

Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. https://doi.org/10.1145/2939672.2939785

Dajani, J. S., and D. A. Sullivan. (1976). A causal model for estimating public transit ridership using census data. High Speed Ground Transportation Journal 10(1): 47-57.

DART First State - About DART First State. (2019). Retrieved 18 December 2019, from https://dartfirststate.com/home/about/index.shtml

DART First State - Park & ride / Park & pool. (2020, February 25). DART- Delaware. https://www.dartfirststate.com/information/getting_there/parkride/i ndex.shtml#PR

DelDOT Fact Books. (2018). Retrieved 6 January 2020, from https://deldot.gov/Publications/reports/fact_book/index.shtml

Ding, C., Wang, D., Ma, X., & Li, H. (2016). Predicting short-term subway ridership and prioritizing its influential factors using gradient boosting decision trees. Sustainability, 8(11), 1100. https://doi.org/10.3390/su8111100

Driscoll, R. A., Lehmann, K. R., Polzin, S., & Godfrey, J. (2018). The effect of demographic changes on transit ridership trends. Transportation Research Record: Journal of the Transportation Research Board, 2672(8), 870- 878. https://doi.org/10.1177/0361198118777605

77

El-Geneidy, A., Grimsrud, M., Wasfi, R., Tétreault, P., & Surprenant-Legault, J. (2013). New evidence on walking distances to transit stops: Identifying redundancies and gaps using variable service areas. Transportation, 41(1), 193-210. https://doi.org/10.1007/s11116-013-9508-z

Fabrikant, A. (2019, June 27). Predicting bus delays with machine learning. Google AI Blog. https://ai.googleblog.com/2019/06/predicting-bus-delays-with- machine.html

Feng, C., Wang, H., lu, N., Chen, T., EH, H., UL, Y., & Tu, X. (2019). Log- transformation and its implications for data analysis. General Psychiatry, 32(5), e100146corr1. https://doi.org/10.1136/gpsych-2019- 100146corr1

Fitzpatrick, K., Hall, K., Perkinson, D., & Nowlin, L. (1997). Location and design of bus stops. Institute of Transportation Engineers.ITE Journal, 67(5), 36. Retrieved from https://search.proquest.com/docview/224879359?accountid=10457

Gaquin, D. A., & Dunn, G. W. (2012). The who, what, and where of America: Understanding the American community survey. Bernan Press.

Gebeyehu, M., & Takano, S. (2008). Demand Responsive Route Design: GIS Application to Link Downtowns with Expansion Areas. Journal of Public Transportation, 11(1), 43-62. doi: 10.5038/2375-0901.11.1.3

Hand, D. J., Mannila, H., & Smyth, P. (2001). Principles of data mining. MIT Press.

Huang, Z., Liu, X., Huang, C., & Shen, J. (2010). A GIS-based framework for bus network optimization using genetic algorithm. Annals Of GIS, 16(3), 185-194. doi: 10.1080/19475683.2010.513152

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning: With applications in R. Springer Science & Business Media.

Jung, Y. (2017). Multiple predictingk-fold cross-validation for model selection. Journal of Nonparametric Statistics, 30(1), 197- 215. https://doi.org/10.1080/10485252.2017.1404598

Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., & Liu, T. (2017). LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Advances in Neural Information Processing Systems 30 (NIPS 2017). Neural Information Processing Systems Foundation. https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient- gradient-boosting-decision-tree

78

Li, H. (2017, April 12). Which machine learning algorithm should I use? The SAS Data Science Blog. https://blogs.sas.com/content/subconsciousmusings/2017/04/12/machin e-learning-algorithm-use/

McDonald, J. H. (2009). Handbook of biological statistics (3rd ed.). Sparky House Publishing.

McNally, M. (2007), "The Four-Step Model", Hensher, D. and Button, K. (Ed.) Handbook of Transport Modelling (, Vol. 1), Emerald Group Publishing Limited, pp. 35-53.

Newell, G. (1979). Some Issues Relating to the Optimal Design of Bus Routes. Transportation Science, 13(1), 20-35. doi: 10.1287/trsc.13.1.20

Ravanshad, A. (2018, April 27). How to choose machine learning algorithms? Medium. https://medium.com/@aravanshad/how-to-choose- machine-learning-algorithms-9a92a448e0df

Ray, S. (2017, September 9). Commonly used machine learning algorithms (with Python and R codes). Analytics Vidhya. https://www.analyticsvidhya.com/blog/2017/09/common-machine- learning-algorithms/?#

Roosmalen, J. V. (2019). Forecasting bus ridership with trip planner usage data: a machine learning application [Master's thesis]. https://essay.utwente.nl/77590/1/Roosmalen_MA_BMS.pdf

Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. A Bradford Book.

Szeto, W., & Wu, Y. (2011). A simultaneous bus route design and frequency setting problem for Tin Shui Wai, Hong Kong. European Journal of Operational Research, 209(2), 141-155. doi: 10.1016/j.ejor.2010.08.020

The State of Delaware. (n.d.). State of Delaware. https://delaware.gov. https://delaware.gov/topics/facts/geo.shtml

US Bureau of Labor Statistics. (2018, March 29). National Longitudinal Surveys. Retrieved January 16, 2020, from https://www.bls.gov/nls/questions-and- answers.htm

US Census Bureau. (2019a). Census Bureau QuickFacts. https://www.census.gov/quickfacts/DE

79

US Census Bureau. (2019b). Glossary. The United States Census Bureau. https://www.census.gov/programs- surveys/geography/about/glossary.html#par_textimage_5

US Census Bureau. (2020). Longitudinal Employer-Household Dynamics. https://lehd.ces.census.gov/

Wang, S., & Qu, X. (2014). Rural bus route design problem: Model development and case studies. KSCE Journal Of Civil Engineering, 19(6), 1892-1896. doi: 10.1007/s12205-013-0579-3

Wang, X., Zhang, N., Zhang, Y., & Shi, Z. (2018). Forecasting of short-term metro ridership with support vector machine online model. Journal of Advanced Transportation, 2018, 1-13. https://doi.org/10.1155/2018/3189238

Yamaguchi, T., A.S., M., & Mine, T. (2018). Prediction of bus delay over intervals on various kinds of routes using bus probe data. 2018 IEEE/ACM 5th International Conference on Big Data Computing Applications and Technologies (BDCAT). https://doi.org/10.1109/bdcat.2018.00020

Yin, T., Zhong, G., Zhang, J., He, S., & Ran, B. (2017). A prediction model of bus arrival time at stops with multi-routes. Transportation Research Procedia, 25, 4623-4636. https://doi.org/10.1016/j.trpro.2017.05.381

80