arXiv:2103.11824v1 [cs.LG] 18 Mar 2021 otto ytm,teeegn nrsrcue fItre o Internet of infrastructures emerging the systems, portation oPtbt ees i aahsbe sdfrvrostranspor various for used been has data Big senso levels. static Petabyte through to data big of collection the for opportunities rwa a oual enkona h ifrainexplosion” “information the as known been popularly has what or 1 i aacnb rcdbc oams ihyyasao hnpeop when ago, years eighty almost to back traced be can data Big Jiang* Weiwei Dat of Survey A Prediction: and Tools Estimation Traffic for Data Big TYPE ARTICLE xxx/xxxx DOI: production at Added Received: rcsigsraigdata. distribu streaming S4 processing Spark, data. big storing for used databases in-memory “Application-cont named publication a in appeared first data” “big jwwthu@.com Email: author. *Corresponding Correspondence 2 1 rcsigadaayistosicueHdo,Sak ln,Co no Flink, Spark, is als Hadoop, it are include tools if tools relevant analytics use and data, no processing big have of would development the it With but it. value from intrinsic has especially Data messy, value. be can is world one the in Data data. dat of These accuracy video. and and audio text, as such data semi-structured or o size whose sets data as defined be can data big speaking, s Broadly and Internet, the smartphones, memo computers, main of of development capacities the taxing large, dat quite big generally of are definition sets ever first The 1997. October in Ellsworth David n uuae n h edfratn ndt ta nraigpac increasing an at data on hund acting maybe for need or the terabytes and of cumulated tens and to accumulated co be value can unknown with size data Data of amount huge for stands sum are volume data The big of Characteristics latency. low with process and aiona USA California, Angeles, Los Angeles, of California-Los University Statistics, of Department China Beijing, University, Tsinghua Engineering, Electronic of Department soeo h yia plcto cnro,bgdt a enwidel been has data big scenarios, application typical the of one As INTRODUCTION 1 iynLuo Jiayun | eie:Adda production at Added Revised: 2 i aa rffi siain rffi rdcin pndata open prediction, traffic estimation, traffic data, big predicti KEYWORDS: and studies. estimation future traffic for for given are data directions big esti future traffic of to for use off-the-shelf used the the tools and promote categorized data are big types and data data Different t open trend, of this survey with date Combined efficiency. operation overall and estimated the tran well the be including can states areas traffic many sources, in data various widely used been has data Big Summary noy(nento hns qimn,dt eppln pa fas at up piling keep data equipment, Things) of (Internet ensory ainmdswt utpeproe,eg,srtgcartacma traffic air strategic e.g., purposes, multiple with modes tation hns(o) ye hsclSses(P)adsatcte hav cities smart and (CPS) Systems Physical Cyber (IoT), Things f dv,Kfa n aot aopHF,Hae ogD,Hive, MongoDB, HBase, HDFS, Hadoop Mahout. and Kafka, rdova, s uvilnecmrs oiedvcs h aasz a grow has size data The devices. mobile cameras, surveillance rs, ye fe eur diinlpercsig h eaiyrep veracity The preprocessing. additional require often types a osatybigdvlpdadudt ofll h rwn n growing the fulfill to update and developed being constantly o aie nofieV,wihrpeetvlm,vlct,vrey v variety, velocity, volume, represent which Vs, five into marized y oa ik n vnrmt ik ecl hstepolmo b of problem the this call We disk. remote even and disk, local ry, e temcmuigpafr n pceSrasaeapplicat are Streams Apache and platform computing stream ted ntecs fbgdt hndt iesosaddt ore increa sources data and dimensions data when data big of case the in atr rtue n14,acrigt h xodEgihDictiona English Oxford the to according 1941, in used first term (a olddmn aigfroto-oevsaiain rte yM by written visualization” out-of-core for paging demand rolled eecutrtefis tepst uniytegot aei th in rate growth the quantify to attempts first the encounter le yei eodteaiiyo rdtoa eainldtbssto databases relational traditional of ability the beyond is type r s“iulzto rvdsa neetn hleg o computer for challenge interesting an provides “Visualization is a eso eaye.Tevlct en h atrt twihdat which at rate fast the means velocity The petabytes. of reds igfo oiedvcs oilmda h nento hns(IoT Things of Internet the media, social devices, mobile from ming .Tevreyrfr odffrn aatpsaalbe including available, types data different to refers variety The e. sdi h rnpraindmi.Coeycnetdt intell to connected Closely domain. transportation the in used y cetd de tproduction at Added Accepted: rnfre noaual omto ti nbet xrc infor extract to unable is it or format usable a into transformed t ute rdce o improving for predicted further l r nrdcd ofurther To introduced. are ols i td rsnsa up-to- an presents study his prainidsr.Using industry. sportation ntss hlegsand challenges tasks, on ainadprediction. and mation e n atrspeed. faster and ter aeet(i tal. et (Xie nagement eet h quality the resents rmTrillionbyte from n rct,advalue. and eracity, gdt. ihthe With data.” ig osta support that ions rvddgreat provided e es ao data Major eeds. oueo data of volume e atr,manage capture, n Q r the are SQL and calCxand Cox ichael and a ytm:data systems: unstructured y.Teterm The ry). sreceived is a gn trans- igent e h last The se. n more. and ) mation 2 Jiang and Luo

2019), travel pattern modeling (L. Gong, Liu, Wu, & Liu 2016), road congestion pattern prediction (He, Zheng, Chen, & Guan 2017), etc. are used for government policies, e.g., decision-making support tools for smart cities, transparent governance and critical operations. Cutting-edge technologies integrated with big data are also used for urban planning and smart cities, for example, big data, in-memory computing, deep learn- ing, and GPUs are used in rapid transit systems (Aqib et al. 2019). For a broader discussion of the applications of big data in the transportation domain, interested readers may refer to the relevant surveys (Moharm, Zidane, El-Mahdy, & El-Tantawy 2018; Neilson, Daniel, & Tjandra 2019; Torre-Bastida et al. 2018; Zhu, Yu, Wang, Ning, & Tang 2018). Among different applications, traffic estimation and prediction are two most significant tasks, which would be the focus of this study. While various data related to the traffic situations can be obtained, the traffic information, e.g., volume, speed, and travel time, is not available without some processing of the raw data in some cases. The process is referred as the traffic estimation problem, whose target is to extract precise traffic information from any raw data. While estimating the traffic information can only give us the historical states, one of the most important application of big data is to predict the future situation based on the historical input and adopt appropriate measures accordingly, e.g., traffic control. Various methods have been proposed for the traffic estimation and prediction tasks, including the statistical models, machine learning models, and deep learning models, in which the deep learning models are becoming dominant because they show the best performance (W. Jiang & Luo 2021; W. Jiang & Zhang 2018a). The success of deep learning models is partially attributed to big data because these models rely on a large training dataset. Several open datasets also boost the development and fair comparison among new models. To summarize, significant progresses have been achieved in previous studies for traffic estimation and prediction with the appearance of big data. However, there is a lack of an up-to-date summary and collection of open datasets and tools. Some of the relevant studies are based on private data, whose results are impossible to replicate. In this survey, we focus on open datasets, especially the big-volume and multi-modal ones. To further boost the relevant studies, we also release a processed GPS trajectory dataset that are collected from more than 20,000 taxi drivers in Beijing in three months, namely, November 2012, November 2014 and November 2015, which has been used in our previous studies (W. Jiang, Lian, Shen, & Zhang 2017; W. Jiang & Zhang 2018b). The dataset is publicly available 1. Our contributions in this paper are summarized as follows:

• We give a summary of different data types used for traffic estimation and prediction tasks;

• We summarize the latest collection of relevant open datasets;

• We contribute a new GPS trajectory dataset for the research community;

• We summarize the collection of relevant big data tools;

• We point out the challenges and future directions of utilizing big data for traffic estimation and prediction tasks.

The following parts of this paper are organized as follows. The data used for traffic estimation and prediction tasks are summarized in Section 2. The big data tools are collected and introduced in Section 3. The relevant challenges and future directions are pointed out in Section 4. The conclusion is drawn in Section 5.

2 DATA-DRIVEN TRAFFIC ESTIMATION AND PREDICTION

In this section, we summarize the different types of data that can be used for traffic estimation and prediction. We also contribute an up-to- date collection of available open datasets for each data type as well as a new GPS trajectory data for further studies. There are different ways of categorizing traffic big data. For example, the traffic big data are divided into supervised and unsupervised types previously, in which supervised data are direct sources of traffic information, e.g., loop detectors and GPS traces, while unsupervised data are indirect sources which can be used to infer traffic information, e.g., call detail records and cell-phone position data. In this study, we further divide traffic big data into more types based on data sources. Open data policies in different countries vary a lot. Take the U.S. government as the example, transportation related open data can be found from National Transit Database (NTD) 2, Federal Highway Administration (FHWA) 3, Bureau of Economic Analysis (BEA) 4, and American Community

1The data would be available here: https://github.com/jwwthu/DL4Traffic 2https://www.transit.dot.gov/ntd/ntd-data 3For example, Urban Congestion Reports https://ops.fhwa.dot.gov/perf_measurement/ucr/ 4https://www.bea.gov/data/ Jiang and Luo 3

Survey (ACS) 5. The Next Generation Simulation (NGSIM) 6 is the most widely used open-source vehicle trajectory dataset for traffic flow stud- ies (L. Li, Jiang, He, Chen, & Zhou 2020). In this study, we mainly focus on the open datasets in the academia. More data collections can be found online, which may be maintained by individuals and research institutes, e.g., mobility datasets 7, open traffic collection 8, and Beijing City Lab 9.

2.1 Trip surveys

Trip surveys are detailed questionnaires on mobility habits, which are usually collected by local authorities or researchers. Sometime it is only way to accurately measure and understand people’s changing daily travel patterns, when different travel modes are considered and location privacy is not violated. For example, in the "My Daily Travel Survey" 10 by the Chicago Metropolitan Agency for Planning between August 2018 and April 2019, the households in northeastern Illinois are asked to tell the trips they made for work, school, shopping, errands, and socializing with family and friends. Another example is the "California Household Travel Survey (CHTS)" 11 by the National Renewable Energy Laboratory (NREL) between 2010 and 2012. As the largest such regional or statewise survey ever conducted in the United States, detailed travel behavior information was obtained from more than 42,500 households via multiple data-collection methods, including computer-assisted telephone interviewing, online and mail surveys, wearable (7,574 participants) and in-vehicle (2,910 vehicles) global positioning system (GPS) devices, and on-board diagnostic sensors that gathered data directly from a vehicle’s engine. Details of personal travel behavior were gathered within the region of residence, inter-regionally within the state, and in adjoining states and Mexico. The survey sampling plan was designed to ensure an accurate representation of the entire population of the state. Among the participated households, 42,436 completed the travel diary survey portion, 3,871 completed the wearable GPS portion, and 1,866 completed the vehicle GPS portion of the study. Trip details including purpose, mode, travelers, tolls, departure time, arrival time, and distance are collected for both private vehicles and public transit trips. More trip surveys are available on the Internet 12. The pros of trip surveys include high-resolution and detailed information, which contain the trip information directly and eliminates the need for traffic estimation. The cons of trip surveys include small sample size, limited spatial and temporal scale, self-reporting errors, and high cost to collect, which limits the availability of such data as well as the real-time applications, e.g., they are less used for traffic prediction.

2.2 Census Data and Official Statistics

Census data and official statistics may also contain traffic information and other information useful for traffic problems, such as location of residence and workplace. These data are usually collected by governments periodically. For example, in the Commuting Flows 13 by the US Census Bureau, the primary workplace location from the respondents are collected. In the Migration Data 14 by the Internal Revenue Service, both inflows and outflows of migration patterns are available for Filing Years 1991 through 2018 with the IRS. More official statistics can be found in Table 1. The pros of census data and official statistics are their large sample size and large coverage, which is usually the entire country. The cons are also obvious. The flows are aggregated at the municipality or county level, which is coarse. There is only limited traffic information. And it are also expensive and time consuming to collect this type of data as the trip surveys. For now, census data and official statistics are not widely used for traffic estimation and prediction tasks with only a few exceptions (Katranji et al. 2020).

2.3 Road Sensor Data

Static sensors can be deployed to collect traffic data, e.g., inductive loop detectors, magnetic and pneumatic tube sensors, radar sensors, infrared sensors, and acoustics sensors. Loop detectors are the most mature technology among road sensors and contribute many famous open datasets for traffic estimation and traffic prediction. A complete list of open road sensor data used in previous studies is shown in Table 2. The pros of road sensor data are their abilities of collecting large volume data continuously and automatically, usually in minutes or seconds. The cons of these data are their fixed coverage in the spatial range as well as the missing data problem caused by sensor failures. Also, the traffic speed obtained from traffic sensors could be a point measurement and may not be suitable to calculate the average travel time across the road link.

5https://www.census.gov/programs-surveys/acs/data.html 6https://ops.fhwa.dot.gov/trafficanalysistools/ngsim.htm 7https://privamov.github.io/accio/docs/datasets.html 8https://github.com/graphhopper/open-traffic-collection/ 9https://www.beijingcitylab.com/data-released-1/ 10https://www.cmap.illinois.gov/data/transportation/travel-survey#My_Daily_Travel_Survey 11https://www.nrel.gov/transportation/secure-transportation-data/tsdc-california-travel-survey.html 12https://www.nrel.gov/transportation/secure-transportation-data/tsdc-cleansed-data.html 13https://www.census.gov/topics/employment/commuting/guidance/flows.html 14https://www.irs.gov/statistics/soi-tax-stats-migration-data 4 Jiang and Luo

TABLE 1 The list of open official statistics data.

Name Type Spatial Range Temporal Download Link Range GB Road Traffic Counts Aggregated Great Britain 2000-2019 https://data.gov.uk/dataset/208c0e7b-353f-4e2d-8b7a- traffic counts 1a7118467acc/gb-road-traffic-counts Highways England net- Traffic flow, Great Britain 2006-2020 https://data.gov.uk/dataset/9562c512-4a0b-45ee-b6ad- work journey time and travel time afc0f99b841f/highways-england-network-journey-time- traffic flow data and-traffic-flow-data DataMall Multiple types Singapore Frequently https://datamall.lta.gov.sg/content/datamall/en/dynamic- updated data.html Minnesota Department of Traffic flow, Twin Cities, Since 2011 http://data.dot.state.mn.us/datatools/ Transportation Traffic Data traffic speed Minnesota, USA Chicago Traffic Tracker Traffic speed Chicago, USA 2011-2018 https://data.cityofchicago.org/Transportation/Chicago- Traffic-Tracker-Historical-Congestion-Esti/77hq- huss/data US- Traffic USA Feb. 2016 https://smoosavi.org/datasets/us_accidents Accidents (Moosavi et al. accident to Dec. 2019) 2020

TABLE 2 The list of open road sensor data.

Name Type Spatial Range Temporal Range Download Link Performance Measurement Sys- Traffic speed, California, USA 2001-2019 http://pems.dot.ca.gov/ tem (PeMS) Data traffic flow METR-LA (Y. Li et al. 2018) Traffic speed, Los Angeles, March 1st to https://github.com/liyaguang/DCRNN traffic flow USA June 30th, 2012 Seattle-Loop-Data (Cui et al. Traffic Speed Seattle, USA January 1st to https://github.com/zhiyongc/Seattle-Loop- 2018) 31st, 2015 Data Traffic Speed Traffic speed Guangzhou, August 1, 2016 https://zenodo.org/record/1205229 Guangzhou (X. Chen et al. 2018) China to September 30, 2016 Traffic Flow Madrid Traffic flow Madrid, Spain Since 2013 http://datos.madrid.es Traffic Flow Whitemud Drive Traffic flow Whitemud Drive, August 6, 2015 http://www.openits.cn/openData1/700.jhtml Canada to August 28, 2015 UTD19 (Loder et al. 2019) Traffic 40 cities globally Various time https://utd19.ethz.ch/index.html capacity periods PORTAL Traffic speed Portland- Regularly https://portal.its.pdx.edu/home Vancouver updated Metropolitan region, USA

2.4 Call Detail Records

Call detail records (CDRs), cellular signaling data and cell-phone position data are all generated from the cellular network, which contain similar infor- mation of human mobility patterns. These data are produced by a phone carrier to store details of calls passing through a device and usually contain various attributes of the call, e.g., timestamp, source, destination, tower, duration. Then the position and trajectory (sequence of locations) can be inferred from CDRs. Compared with GPS trajectory data with a higher spatial resolution, CDRs provide a coarse positioning approach. And data preprocessing is necessary, not only for removing noise, but also for identifying the key locations. As one of the pioneering studies, cellular phone Jiang and Luo 5 tracking data are used for traffic volume estimation and further evaluated with loop detector data in Caceres, Romero, Benitez, and del Castillo (2012). Afterwards, more and more studies are using call detail records and cell-phone position data for traffic status inference, even in recent years. One-month cellular signaling data (SD) are used to extract road-level human mobility with a multi-information fusion framework, which takes the SD uncertainty issue into consideration (Song et al. 2020). Based on a regression model, a cell probe (CP)-based method is proposed to estimate the vehicle speed with the normal location update (NLU) procedure and the consecutive handoff (HO) event as inputs, and the proposed method achieves a 97.63% accuracy (C.-H. Chen 2020). Long-term evolution (LTE) access data are used as input for a deep-learning-based road traffic pre- diction system, to predict the real-time speed of traffic (Ji & Hong 2019). Characteristics of human mobility patterns are revealed by high-frequency cell-phone position data in C. Zhao, Zeng, and Yeung (2021). More usages can be found in recent surveys (Ghahramani, Zhou, & Wang 2020). The public available CDR or similar data are shown in Table 3. The pros of using CDR or similar data are their ubiquitous availability (with mobile phones), large volume, and multiple dimensions (social, mobile, time, demographics). However, the cons include the poor positioning ability, the noise, and the preprocessing requirement. Because the position can only be derived from the cellular towers, it is partially detected only when calls are made and known at tower level only. What’s worse, the ping-pong effect between different towers can create noise in the collected data.

TABLE 3 The list of open CDR data.

Name Spatial Range Temporal Range Download Link Telecommunications-SMS, call, Milan, Italy November 1, 2013 to https://dataverse.harvard.edu/dataset.xhtml? internet-MI (Italia 2015) Januray 1, 2014 persistentId=doi:10.7910/DVN/EGZHFV OpenCellID Multiple cities globally Regularly updated https://www.opencellid.org

2.5 GPS Trajectory Data

GPS trajectory data are often collected from the floating vehicles, which are equipped with on-board positioning systems and communication devices. Also, the smartphones can also be used for GPS trajectory data collection, e.g., in a crowd-sourced approach. While GPS trajectory data are easy to collect, the preporcessing steps are necessary, e.g., location detection to group points into one meaningful location and trajectory segmentation to split a trajectory in sub-trajectories. Map matching is also used to map the GPS coordinates into the road network. Because of the advantages such as fine granularity, GPS-based “Track and Trace” data are formally defined and highlighted for transportation modelling and policy-making in a recent survey (Harrison, Grant-Muller, & Hodgson 2020). Taxi trajectory data can also be used for trip purpose inference and travel pattern discovery (L. Gong et al. 2016). More relevant studies are using GPS trajectory data for traffic estimation and prediction, e.g., travel time is estimated with the GPS trajectory data in Zou, Yang, and Zhu (2020) by combing the gradient boosting decision tree (GBDT) model and the deep neural network (DNN) model, and OD pairs are predicted with big GPS data in Y. Wang, Xu, Peng, Xuan, and Zhang (2020). Due to the privacy concerns, most of the open GPS trajectory data are collected from taxis, with only a few exceptions, as shown in Table 4. Also in many cases, the raw GPS trajectory data are not shared, and only the estimated traffic states derived from the GPS trajectory are publicly available. The pros of using GPS trajectory data include the high spatial and temporal resolution as well as the tracking ability of the full traces. Taxis can also been seen as the mobile probes, which reflect the whole road traffic situations. The cons are similar to CDRs, which include the preprocessing requirement and the noise in the data. What’s more, the GPS trajectory data are often collected in a high frequency, e.g., in seconds, so the huge volume of such data requires the storage and computation abilities, which can only be fulfilled by big data tools. Besides the existing ones, we also contribute a new GPS trajectory dataset for the research community in this study. The GPS trajectory data are collected in Beijing during three time periods, namely, November 2012, November 2014, and November 2015. Each GPS data sample contains the following attributes: the anonymous taxi identity, timestamp, latitude, longitude, azimuth, spot speed, and operation status (occupied, vacant or stopped). The data are sampled with interval of one minute roughly. The data summary is shown in Table 5. This dataset is publicly available 15.

2.6 Location-Based Service Data

With the development of the mobile Internet, location-based service (LBS) data arise with the GPS functionality on smartphones. Various location- base big data are collected from the location-based social network, e.g, checkins, geotagged tweet and micro-blogs, and the Maps and Navigation Apps, e.g., Map and Baidu Map. These data are often collected in a crowd-sourced approach and traffic information extraction step is

15The data would be available here: https://github.com/jwwthu/DL4Traffic 6 Jiang and Luo

TABLE 4 The list of open GPS trajectory and relevant data.

Name Type Spatial Range Temporal Range Download Link SF Taxis or Cabspot- Taxi GPS tra- San Francisco, May 17,2008to http://crawdad.org/epfl/mobility/20090224/index.html ting (Piorkowski et al. jectory USA June 10, 2008 2009) Rome Taxis (Bracciale et al. Taxi GPS tra- Rome, Italy Feburary 1, http://crawdad.org/roma/taxi/20140717/index.html 2014) jectory 2014 to March 3, 2014 Porto Taxis Taxi GPS tra- Porto, July 1, 2013 to https://www.kaggle.com/c/pkdd-15-predict-taxi- jectory Portugal June 30, 2014 service-trajectory-i/data Geolife (Y. Zheng et al. Taxi GPS tra- Beijing, China Apr. 2007 to https://www.microsoft.com/en- 2010) jectory Aug. 2012 us/download/details.aspx?id=52367 Mobile Data Challenge Taxi GPS tra- Lake Geneva 2009 to 2011 https://www.idiap.ch/dataset/mdc (MDC) (Kiukkonen et al. jectory region 2010) TaxiCD Taxi GPS tra- Chengdu, August 3rd to https://js.dclab.run/v2/cmptDetail.html?id=175 jectory China 30th, 2014 Grab-Posisi (Huang et al. Grab Drive Southest Asia April 8, 2019 to Upon request 2019) GPS trajectory April 21, 2019 PrivateCarTrajectoryData Private car Shenzhen, January 2016 https://github.com/HunanUniversityZhuXiao/ GPS trajectory China PrivateCarTrajectoryData TaxiBJ (J. Zhang et al. 2017) Taxi traffic Beijing, China Multiple time https://github.com/Mouradost/ASTIR or flow periods https://www.jianguoyun.com/p/DesHv2UQs- HRBxi5gtYB BikeNYC (Mourad et al. Bike traffic NYC, USA April 1, 2014 to https://github.com/Mouradost/ASTIR or 2019) flow Sepetmber 30, https://www.jianguoyun.com/p/DesHv2UQs- 2014 HRBxi5gtYB Traffic Speed Taxi traffic Chengdu, June 1, 2015 to https://doi.org/10.6084/m9.figshare.7140209.v4 Chengdu (F. Guo et al. speed China July 15, 2015 2019) SHSpeed (Shanghai Traf- Taxi traffic Shanghai, April 1st to https://github.com/xxArbiter/grnn fic Speed) (X. Wang et al. speed China 30th, 2015 2018) TaxiSZ (L. Zhao et al. 2019) Taxi traffic Shenzhen, January 1st to https://github.com/lehaifeng/T-GCN speed China 31st, 2015

TABLE 5 Data summary for the GPS trajectory dataset contributed in this study.

Time Periods November 2012 November 2014 November 2015 Taxi Drivers 8,879 17,749 20,067 Days 30 30 30

necessary. To effectively collect, integrate and process crowd-sourced data including mobile applications, webs, as well as external data sources, a prototype system is developed and validated in Mai-Tan, Pham-Nguyen, Long, and Minh (2020), to infer traffic conditions. Traffic speed data of 29 cities across the world over a 40-day period are gathered from Google Map API and further used for the analysis of traffic congestion pat- terns (Nair, Gilles, Chand, Saxena, & Dixit 2019). Social media texts may also contain traffic information, e.g., the geotagged tweet and micro-blogs. However, the challenge is to extract traffic relevant information from natural languages. Deep learning models have been proposed for conduct the information extraction, e.g., a LSTM-CNN is proposed to extract traffic relevant microblogs in Y. Chen, Lv, Wang, Li, and Wang (2018), which out- performs the baselines including the support vector machine model based on a bag of n-gram features. Social media data are further proven effective Jiang and Luo 7 for traffic accident detection and reporting (Ali et al. 2021; Wan, Lucic, Ghazzai, & Massoud 2020), traffic jam management (Y. Wang, He, & Hu 2020). Combing social media data from multiple social networks can further improve the traffic event detection accuracy, e.g., the case of combin- ing Arabic and English data streams from Twitter and Instagram used in the system SNSJam (Alkouz & Al Aghbari 2020). However, location-based big data may be low-quality with noises. Taking Bluetooth speed data as the ground truth, the quality of data in Sevierville, TN, USA is eval- uated in Hoseinzadeh, Liu, Han, Brakewood, and Mohammadnazar (2020), in which a kNN method manages to achieve a prediction accuracy of 84.5% and 82.9% for traffic speed estimation based on Waze data as the input. The list of open location-based service data is shown in Table 6. The pros of using LBS data include their wide availability and semantic informa- tion, e.g., restaurant, mall, etc. However, the cons are also obvious. The human mobility can only be partially detected when check-ins or geotagged texts are made. The data are thus very sparse and can be more sparse than CDRs. There is also the self-selection bias, which would cause inaccurate traffic information.

TABLE 6 The list of open location-based service data.

Name Type Spatial Range Temporal Range Download Link Q-Traffic or Baidu- Traffic speed from Beijing, China April 1, 2017 to https://github.com/JingqingZ/BaiduTraffic Traffic (Liao et al. navigation Apps May 31, 2017 2018) Alibaba Cloud Tianchi Travel time from Guiyang, China Apr. 2017 https://tianchi.aliyun.com/competition/ Dataset navigation Apps entrance/231598/information Brightkite (Cho et al. 2011) Check-ins N/A Apr. 2008 to Oct. https://snap.stanford.edu/data/loc- 2010 brightkite.html Gowalla (Cho et al. 2011) Check-ins N/A Feb. 2009 to Oct. https://snap.stanford.edu/data/loc- 2010 gowalla.html Foursquare (Yang et al. 2013) Check-ins NYC, USA October 24, 2011 https://sites.google.com/site/yangdingqi/ to February 20, home/foursquare-dataset 2012 Yelp Check-ins Multiple cities Regularly updated https://www.yelp.com/dataset globally MapBJ (Cheng et al. 2018) Traffic congestion Beijing, China Mar. 2016 to June https://github.com/cxysteven/MapBJ from navigation 2016 Apps Tecent API Traffic flow index China Since 2015 https://heat.qq.com/ Uber Movement Travel time and Multiple cities Since 2017 https://movement.uber.com/ speed globally

2.7 Public Transport Transaction Data

Transaction data can be collected in various public transport systems, especially those with automatic fare systems, and further used for traffic esti- mation and prediction. For example, smart card data are often used for public transit origin-destination (OD) estimation (Hussain, Bhaskar, & Chung 2021). Based on smart card and bus trajectory data, a two-stage transportation analysis approach is proposed to reconstruct the individual pas- senger trips and cluster these trips to identify the transit corridors in T. Zhang et al. (2020). Based on 10 million taxi trip records in New York and Shenzhen and using spatio-temporal clustering, Bayesian probability and Monte Carlo simulation, a two-layer framework, which consists of an activity inference model and a pairing journey model, is proposed to extract and predict travel patterns (S. Gong et al. 2020). Using gravity model and Bayesian rules, the purpose of dockless shared-bike users is inferred from a shared bike dataset in Shenzhen, China, and the introduction of activity type proportion and service capacity of point of interest (POIs) is proven effective for the inference (S. Li et al. 2021). Based on daily OD amount by transportation and by purpose and several surveys on time use and activities, an open people mass movement dataset named Open PFLOW is built in Kashiyama, Pang, and Sekimoto (2017), which achieves a comparable accuracy with other approaches, e.g., commercial mobility data and traffic census. 8 Jiang and Luo

There are also many public transport transaction data that are available for the research community as shown in Table 7. The pros of using public transport transaction data are their wide availability and close connection with traffic state. The cons are the storage and computation requirements, which can be high if the data cover a wide spatial and temporal ranges.

TABLE 7 The list of open public transport transaction data.

Name Type Spatial Range Temporal Range Download Link SHMetro (L. Liu et al. Metro Shanghai, China July 1, 2016 to https://github.com/ivechan/PVCGN 2020) September 30, 2016 HZMetro (L. Liu et al. Metro Hangzhou, China January 2019 https://github.com/ivechan/PVCGN 2020) Hangzhou Metro Metro Hangzhou, China January 1, 2019 https://tianchi.aliyun.com/competition/entrance/231708/ to January 25, information 2019 Bike Bay Shared bike Bay Area, USA September 1, https://github.com/TwinkleBill/babs_open_data_year_3 Area (Yi et al. 2019) 2015 to August 31, 2016 BikeNYC Shared bike NYC, USA July 1, 2013 to https://www.citibikenyc.com/system-data December 12, 2016 BikeDC Shared bike Washington D.C., 2011-2016 https://www.capitalbikeshare.com/system-data USA BikeChicago Shared bike Chicago, USA 2015-2020 https://www.divvybikes.com/system-data Bike Chattanooga Shared bike Chattanooga, July 23, 2012 to https://data.chattlibrary.org/ Trip Data Tennessee, USA April 9, 2020 Mobike Bei- Shared bike Beijing, China May 10,2017to https://github.com/SharingBikeNNU/Riding- jing (Cao et al. May 24, 2017 Modes_Tucker 2020) Ride Austin Ride sharing Austin, USA June 2, 2016 to https://data.world/ride-austin April 13, 2017 UberNYC Ride sharing NYC, USA from April to https://github.com/fivethirtyeight/uber-tlc-foil-response September 2014 Didi GAIA Open Ride sharing Chengdu, Xi’an, Various time https://outreach.didichuxing.com/research/opendata/ Data and Haikou, periods China TaxiNYC Taxi NYC, USA Since 2009 http://www.nyc.gov/html/tlc/html/about/ trip_record_data.shtml

2.8 Surveillance and Airborne Digital Cameras

Closed Circuit Television (CCTV) cameras are widely used for monitor traffic patterns and help police forces in large cities. By collecting these surveillance videos, traffic information can be estimated and further used for prediction. For example, traffic state with CCTV systems is analyzed in Bui, Yi, and Cho (2020). However, the challenge is the heavy workload of processing digital multimedia data. Transportation video data manage- ment is considered in Hao and Qin (2020) and a high-performance computing architecture is developed with the distributed file and distributed computing systems. Edge computing is also applied for traffic flow detection in real-time videos. A specific edge device Jetson TX2 platform is used in C. Chen, Liu, Wan, Qiao, and Pei (2021), with the YOLOv3 (You Only Look Once) model for vehicle detection and the optimized DeepSORT (Deep Simple Online and Realtime Tracking) algorithm for vehicle tracking. The edge device manages to achieve an average processing speed of Jiang and Luo 9

37.9 frame per second, with an average accuracy of 92.0%. Besides the static traffic surveillance cameras, remote sensing tools, e.g., airborne digital cameras and Unmanned Aerial Vehicle (UAV) cameras, can also be used for traffic monitoring. For example, the average traffic speed and density is estimated with an airborne optical 3K camera system in Leitloff, Rosenbaum, Kurz, Meynberg, and Reinartz (2014) and the traffic flow parameter is calculated from UAV aerial videos in Brkić, Miler, Ševrović, and Medak (2020). Some open video data are summarized in Table 8. The pros of video data are their continuous observation, the wide coverage for road traffic, and the multiple usages for traffic estimation, prediction, accident analysis and even vehicle tracking. The cons of video data are their high storage and computation requirements. The video stream also has many engineering challenges, e.g., compression artefacts, blurring, and hardware faults.

TABLE 8 The list of open video data.

Name Spatial Range Temporal Range Download Link MIT Traffic One camera in 90 minutes long http://mmlab.ie.cuhk.edu.hk/datasets/mit_traffic/index.html Dataset (X. Wang et al. 2008) MIT Video Surveillance One camera in 982 video https://github.com/alnfedorov/traffic-analysis Data (Fedorov et al. 2019) Chelyabinsk, frames Russia

2.9 License-plate Recognition Data

License-plate recognition (LPR) data are emerging data source for vehicle re-identification as well as traffic estimation and prediction. Large-scale LPR data collection systems are being developed all over the world. However, due to the data privacy concerns, there are few open datasets and no so many relevant studies. In Zhan, Li, and Ukkusuri (2020), a six days’ LPR dataset from a small road network in Langfang, China, is used to estimate and predict link-based traffic state and the feasibility is validated by using a more comprehensive link-level field experiment dataset. A similar type of data is the electronic registration identification (ERI), which is also used in traffic flow prediction (L. Zheng, Yang, Chen, Sun, & Liu 2020). The pros of LPR or ERI data are their precise tracking abilities of individual vehicles. However, this feature also causes the data privacy concern. Another disadvantage is that there are no publicly available LPR or ERI datasets.

2.10 Toll Ticket Data

Electronic toll collection (ETC) systems are widely applied in toll roads, HOV lanes, toll bridges, etc. For example, more than 90 percent of vehicles on expressways in China has been equipped with the ETC system by the end of year 2019. Toll ticket data (TTD) collected from the ETC systems provide a high-coverage and low-cost approach for expressway traffic estimation. Based on the TTD from the Shandong Expressway Toll System in China, a simulation-based dynamic traffic assignment algorithm is proposed to obtain traffic flow in E. Yao, Wang, Yang, Pan, and Song (2021) and achieves a high examination accuracy (around 5% MAPE). The only public source of toll ticket data used for traffic flow prediction as the authors know is the Amap data for KDD CUP 2017 16. The pros of tool ticket data include their high coverage and low cost. However, the cons are obvious. Toll ticket data can only be collected in the places where ETC systems are applicable.

2.11 External Data

Besides the above data that are used to extract and predict traffic states, some external data are also used for these problems as supplementary data, e.g., weather data and calendar information. These data do not have a large volume and are easily available. There are many public sources for weather data, e.g., Weather Underground API 17, MesoWest 18. For calendar information, it is also widely accessible from the Internet, e.g., OfficeHolidays 19. To name some relevant studies that use these external data, 48 weather forecasting factors are analyzed and used for traffic

16https://tianchi.aliyun.com/dataset/dataDetail?dataId=60 17https://www.wunderground.com/weather/api/ 18http://mesowest.utah.edu/ 19https://www.officeholidays.com/ 10 Jiang and Luo

flow prediction based on a regression model in Lee, Hong, Lee, and Jang (2015) and rainfall data are also used for similar purposes (Dunne & Ghosh 2013; Jia, Wu, & Xu 2017).

3 BIG DATA TOOLS

There are already some preliminary trials of applying the latest big data tools for traffic big data, as traditional big data tools are not optimized for big data in the transportation domain. Some preliminary efforts have been proposed for expanding the off-the-shelf tools for a better sup- port of spatiotemporal big data, e.g., Hadoop is expanded with the capacities of spatiotemporal indexing (Z. Li, Huang, Carbone, & Hu 2017) and trajectory analytics (Bakli, Sakr, & Soliman 2019). Based on PostgreSQL and PostGIS, an open-source mobility database named MobilityDB 20 is proposed for moving object geospatial trajectories, e.g., GPS traces (Zimányi, Sakr, & Lesuisse 2020). For analyzing and visualizing spatiotempo- ral big data, new data processing algorithms and methods should be implemented as new or extensions of existing commercial or open-source platforms, for ease of use and a better integration, e.g., the latest SuperMap GIS platform provides full support to the Spark computing frame- work (S. Wang, Zhong, & Wang 2019). Offline and online trajectory analysis are often separated in existing systems. To overcome this shortcoming, a Spark-based hybrid and efficient framework named Dragoon (Fang, Chen, Gao, Pan, & Jensen 2021) is proposed to both offline and online trajec- tory data analytics. Dragoon manages to decrease up to doubled storage overhead compared with the state-the-art big trajectory data management system Ultraman (Ding, Chen, Gao, Jensen, & Bao 2018), while maintaining the similar offline trajectory query performance. Dragoon also manages to achieves at least 40% improvement of scalability over Flink and Spark Streaming and achieves an average doubled performance improvement for online trajectory data analytics. Based on cellular data (2G/3G/4G), base station data, user information and road network data, a real-time urban mobility monitoring and traffic management system is proposed in Z. Yao et al. (2020), leveraging the big data tools including Hive, Spark, and Hbase. The proposed system manages to propose a total 600 TB cellphone data collected over 3 million people daily for 3 years, in a field case study in Guiyang, China. In this section, we collect and summarize the off-the-shelf big data storage and computing tools that are already used or can be further leveraged in traffic estimation and prediction tasks.

3.1

Apcahe Hadoop 21, first released in April 2006, is a sub-project of Lucene (a collection of industrial-strength search tools), under the umbrella of the Apache Software Foundation. Hadoop is written in Java and is a great big data tool as it can process structured and unstructured data from different sources. It parallelizes data processing utilizing computer cluster, accelerating large computations and hiding I/O latency through increased concurrency. It can also leverage its distributed file system to cheaply and reliably replicate chunks of data to nodes (computers) in the cluster, making data available locally on the machine that is processing it. Additionally, Hadoop provides to the application programmer the abstraction of map and reduce operations.

3.2 Apache Pig

Apache Pig 22, first released in September 2008, was initially developed by Yahoo’s researchers for executing MapReduce jobs on large datasets on a high-level of abstraction. It provides Pig Latin, a high-level scripting language, for writing data analysis code. Users write a script using the Pig Latin Language in order to process data in HDFS. The Pig Engine, a component of Apache Pig, would convert all the scripts into a map and reduce task for processing. However, this is a totally internal process and thus the programmers would not see the procedures. The result of Pig would be stored in HDFS after finishing. As compared to MapReduce, Apache pig reduces the time of development using the multi-query approach. The same job could be complete in much less amount of code using Pig Latin as compared to using Java. A Pig Latin can be extended using user-defined functions (UDFs) which the user can write in Java, Python, JavaScript, Ruby, or Groovy and then call directly from the language.

20https://github.com/MobilityDB/MobilityDB 21https://hadoop.apache.org/ 22https://pig.apache.org/ Jiang and Luo 11

3.3 Apache Mahout

Apache Mahout 23, first released in Apr 2009, is an open-source project to provide service for the application of distributed and scalable machine learning algorithms focus on linear algebra and statistics. It is written in Java and Scala. Mahout operates in addition to Hadoop, which allows users to apply machine learning algorithms through distributed computing on Hadoop. Mahout’s core algorithms include recommendation mining, clustering, classification, and frequent item-set mining. Developers are still working on Mahout but a number of algorithms have been implemented.

3.4 Apache Spark

Apache Spark 24 started as a research project at the UC Berkeley AMPLab in 2009 and was open-sourced in early 2010. It is a cluster computing technology designed for fast computation with its own cluster management. As an extension of the Hadoop MapReduce model, it allows more types of computation including batch applications, iterative algorithms, interactive queries, and stream processing. The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application by reducing the number of read/write operations to disk. Spark provides built-in APIs in Java, Scala, or Python that enable writing applications in different programming languages.

3.5 Apache Kafka

Apache Kafka 25, initially released in January 2011, is an open-source distributed publish-subscribe messaging system created by LinkedIn and written in Scala and Java. It aims to provide a unified, high-throughput, and low-latency platform for handling and messaging high-volume real- time data. Kafka utilizes a binary TCP-based protocol that is optimized for efficiency and group in order to reduce the overhead of the network roundtrip. Kafka is also fault-tolerant as the messages are persisted on the disk and replicated within the cluster and it is built on top of ZooKeeper synchronization service.

3.6 Apache Flink

Apache Flink 26, initially released in May 2011, is an open-source, unified stream-processing and batch-processing framework developed by the Apache Software Foundation. The main part of Apache Flink is a distributed streaming dataflow engine that is written in Java and Scala. It executes tasks in a parallel and pipelined way and thus can perform bulk/batch processing programs, stream processing programs and iterative algorithms. Besides providing a high-throughput, low latency streaming engine, Flink also supports event-time processing and state management. Applica- tions developed using Flink are fault-tolerant when machines fail as Flink relies on external fault-tolerance services for maintaining configuration information and distributed synchronization.

3.7 Apache Storm

Apache Storm 27, initially released in September 2011, is a distributed stream processing computation framework originally created by Nathan Marz and the team at BackType. Storm was later acquired and open-sourced by Twitter and became a standard for distributed real-time processing system in a short amount of time. Storm is written predominantly in the Clojure programming language. It uses custom created "spouts" and "bolts" to define the data manipulations process to allow batch, distributed processing of streaming data. “Spouts” would take in data and distribute to “bolts” and the functions and codes users write would be processed in bolts. The whole storm application composed of “spouts” and “bolts” could be represented by a directed acyclic graph (DAG) and form a “topology”. Data flow through Edges on the graph are named as streams. Together, the topology acts as a data transformation pipeline. The general topology structure is similar to a MapReduce job but data are processed in real-time instead of in individual batches. Additionally, Storm topologies run indefinitely until being shut down or encounter failure, while a MapReduce job DAG must eventually end.

23https://mahout.apache.org 24https://spark.apache.org/ 25https://kafka.apache.org/ 26https://flink.apache.org/ 27http://storm.apache.org/ 12 Jiang and Luo

3.8 Apache HDFS

Hadoop Distributed File System (HDFS) 28, first released on September 4, 2007, is a distributed file system designed to run on low-cost commodity hardware. HDFS can store and process large application datasets through storing data files across multiple machines and through parallel processing. It is highly fault-tolerant as it stores data redundantly across data nodes. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. Originally built as infrastructure for the Apache Nutch web search engine project, HDFS is now an Apache Hadoop subproject.

3.9 Apache HBase

Apache HBase 29, initially released in March 2008, is an open-source, column-oriented, non-relational database that evolves from Google’s Bigtable. It is written in Java and sits on top of the HDFS or Alluxio, providing Hadoop with Bigtable functionalities. It is capable of storing a large amount of sparse data in a fault-tolerant way. Besides, HBase provides compression, in-memory operation, and Bloom filters features on a per-column basis. Through the Java API and many other APIs, users can input from HBase into MapReduce jobs run in Hadoop, and MapReduce jobs can also output in HBase tables format. Because of the lineage with Hadoop and HDFS, HBase has been adopted worldwide. HBase cannot replace a classic SQL database, but SQL layer and JDBC driver are provided by Apache Phoenix project to enable using HBase with analytics and business intelligence applications.

3.10 MongoDB

MongoDB 30, first released in February 2009, is an open-source document-oriented database. Classified as a NoSQL database program, MongoDB does not use regular tables and rows to store data but instead uses JSON-like documents. These documents support embedded fields and related data can be stored within them. MongoDB has optional schemas, thus users are not required to specify the number or type of columns before inserting data. MongoDB is developed by MongoDB Inc. and licensed under the Server Side Public License (SSPL).

3.11 Apache Hive

Apache Hive 31, first released in October 2010, is a data warehouse infrastructure initially developed by Facebook for querying and analyzing data. It is built on top of Apache Hadoop and provides users with an SQL-like interface for manipulating data stored in databases and file systems that integrate with Hadoop. Hive solves the problem of having to use Java API for executing SQL applications and queries over distributed data by providing its own SQL type language for querying called HiveQL or HQL. Since most data warehousing applications work with SQL-based querying languages, Hive facilitates the use of SQL-based applications to Hadoop.

4 CHALLENGES AND FUTURE DIRECTIONS

While there are many successfully application cases of big data for traffic estimation and prediction tasks, challenges still exist. Data richness varies a lot for different transportation modes and the problems of data sparsity, high missing data ratios, data noise or lack of data still exist. Data quality, privacy and policies are not fully considered in previous studies. For example, crowd-sourced data have the problems of low data quality, noise removal difficulty and privacy concerns. Some efforts have been made for these challenges, e.g., sparse Bayesian learning is used for traffic state estimation with under-sampled data (Babu, Sure, & Bhuma 2021). For existing data for traffic estimation and prediction tasks, they are heterogeneous in the spatial and temporal ranges. The cross-scale data fusion by integrating various sources is still challenging in the transportation domain. Another challenge is the lack of “real” big data in the transportation domain, especially the open ones. While we did review many open data sources in this study, some of them have a data volume that is hard to be defined as “big". The time range of some available datasets are not long enough for training effective deep learning models. This challenge is partially caused by the high-cost and time-consuming process of collecting some types of data. The other reason is the concern of location privacy leakage, which prohibits the collection of fine-grained data. For existing big datasets, e.g., GPS trajectories, existing big data tools discussed in Section 3 are not fully exploited yet. With the popularity of using graph data

28https://hadoop.apache.org/hdfs/ 29https://hbase.apache.org/ 30https://docs.mongodb.com/manual/introduction/ 31https://hive.apache.org/ Jiang and Luo 13 for traffic prediction, e.g., transportation network in graph neural networks, the existing graph processing tools cannot fully meet the computing requirements and thus new tools are needed. To deal with these challenges, some future directions are pointed out in this section.

4.1 Heterogeneous Data Fusion

A single data source may not be enough for traffic tasks, when different transportation modes are entangled together. Data from multiple sources can be combined to make a better estimation and prediction for traffic situations. Heterogeneous data fusion is driven by this observa- tion, which often uses deep learning for urban big data fusion (J. Liu et al. 2020). There are already some relevant studies. For example, cellular data and loop detectors are integrated for freeway traffic speed estimation in J. Zhang, He, Wang, and Zhan (2015). Sparse stationary traffic sensor data are combined with the GPS trace data to estimate the traffic flow in the entire road network in Gkountouna, Pfoser, and Züfle (2020). License plate recognition data and cellular signaling data are combined for traffic pattern and population distribution correlation analysis in H. Chen, Cai, and Xiong (2021). License plate recognition data and GPS trajectory data are combined for traffic flow estimation in P. Wang, Lai, Huang, Tan, and Lin (2020). Geomagnetic detector data, floating car data and license plate recognition data are combined for aver- age link travel time extraction (Y. Guo & Yang 2020). Bus transit schedule data from General Transit Feed Specification (GTFS), real-time bus location data, and cell-phone data from the geographical mapping software are combined to predict bus delays and a mean absolute percentage error (MAPE) of about 6% is achieved (Shoman, Aboah, & Adu-Gyamfi 2021). The success of these attempts demonstrate both the necessity and the prospect of heterogeneous data fusion techniques.

4.2 Hybrid Computing and Learning Modes

With various data collection sources, different computing modes have been used to collect and process different types of data, e.g., cloud comput- ing, mobile computing, edge computing, fog computing, etc. Different computing modes have different computation and communication capacities and how to integrate and utilize these computing modes with existing big data infrastructures still remains challenging. Some studies are exploring in this direction. For example, based on Internet of Things (IoT) devices and fog computing, a low-cost vehicular traffic monitoring system is devel- oped in Vergis, Komianos, Tsoumanis, Tsipis, and Oikonomou (2020), which collects the vehicle GPS traces and uses the fog devices to process the data collected and extract the traffic information. While traffic estimation and prediction problems are usually formulated as a supervised problem from the perspective of machine learning. However, other learning modes can be leveraged for solving the challenges in these problems, e.g., transfer learning (Pan & Yang 2009) and gen- erative adversarial learning (Goodfellow et al. 2014). Transfer learning is a potential solution for data sparsity problem in different locations. A POI embedding mechanism is proposed in R. Jiang et al. (2021) to fuse human mobility data and city POI data. Furthermore, CNN and LSTM are com- bined to capture both the spatiotemporal and geographical information and the mobility knowledge is transferred from one city to another, which is proven effective for improving the prediction performance for the target city with only limited data available. Existing traffic estimation and pre- diction methods are developed with the assumption that the traffic infrastructures remain the same. This may not hold in some cases with new urban planning. A novel off-deployment traffic estimation problem is proposed and defined in Y. Zhang, Li, Zhou, Kong, and Luo (2020) and a traf- fic generative adversarial network approach named TrafficGAN is further proposed to solve this problem, which is able to describe how the traffic patterns change with the travel demand and underlying road network structures.

4.3 Distributed Solutions and Platforms

As seen in Section 3, big data tools are often operated in a distributed manner, or designed with the real-time parallel processing capability for streaming data, which is missing in the current store then process paradigm for traffic estimation and prediction studies. Some efforts for building and applying distributed solutions and platforms have been made in recent years, but mainly in the industry, e.g., Didi Brain 32 for travel demand prediction and JD Urban Spatio-Temporal Data Engine (JUST) 33 for traffic trajectory and order management. Besides further adopting the existing tools, other new technogies may also be applied for traffic estimation, traffic prediction, or other relevant problems. For example, blockchain is used as a promising solution for traffic big data exchange trust and privacy protection (Hassija, Gupta, Garg, & Chamola 2020) and federated learning is used for privacy-preserving traffic flow prediction (Y. Liu, James, Kang, Niyato, & Zhang 2020).

32https://www.didiglobal.com/science/brain 33http://just.urban-computing.com/ 14 Jiang and Luo

5 CONCLUSION

With the development of big data, more and more tools for collecting, processing, storing and utilizing data become available nowadays. This study focuses on the tasks of traffic estimation and prediction and present an up-to-date collection of available datasets and tools, as a reference for those who seek for public resources. While there are multiple data sources, data richness varies a lot and the data volumes are not big enough with limited spatial and temporal ranges. In this case, the off-the-shelf big data tools have not been widely used in previous studies. To change this situation, challenges and future directions are pointed out, with the aim to promote the application of big data in the transportation domain.

Financial disclosure

None reported.

Conflict of interest

The authors declare no potential conflict of interests.

References

Ali, F., Ali, A., Imran, M., Naqvi, R. A., Siddiqi, M. H., & Kwak, K.-S. (2021). Traffic accident detection and condition analysis based on social networking data. Accident Analysis & Prevention, 151, 105973. Alkouz, B., & Al Aghbari, Z. (2020). Snsjam: Road traffic analysis and prediction by fusing data from multiple social networks. Information Processing & Management, 57(1), 102139. Aqib, M., Mehmood, R., Alzahrani, A., Katib, I., Albeshri, A., & Altowaijri, S. M. (2019). Rapid transit systems: smarter urban planning using big data, in-memory computing, deep learning, and gpus. Sustainability, 11(10), 2736. Babu, C. N., Sure, P., & Bhuma, C. M. (2021). Sparse bayesian learning assisted approaches for road network traffic state estimation. IEEE Transactions on Intelligent Transportation Systems, 22(3), 1733-1741. Bakli, M., Sakr, M., & Soliman, T. H. A. (2019). Hadooptrajectory: a hadoop spatiotemporal data processing extension. Journal of Geographical Systems, 21(2), 211–235. Bracciale, L., Bonola, M., Loreti, P., Bianchi, G., Amici, R., & Rabuffi, A. (2014). Crawdad dataset roma/taxi. Accessed: Jul. Brkić, I., Miler, M., Ševrović, M., & Medak, D. (2020). An analytical framework for accurate traffic flow parameter calculation from uav aerial videos. Remote Sensing, 12(22), 3844. Bui, K.-H. N., Yi, H., & Cho, J. (2020). A multi-class multi-movement vehicle counting framework for traffic analysis in complex areas using cctv systems. Energies, 13(8), 2036. Caceres, N., Romero, L. M., Benitez, F. G., & del Castillo, J. M. (2012). Traffic flow estimation models using cellular phone data. IEEE Transactions on Intelligent Transportation Systems, 13(3), 1430–1441. Cao, M., Huang, M., Ma, S., Lü, G., & Chen, M. (2020). Analysis of the spatiotemporal riding modes of dockless shared bicycles based on tensor decomposition. International Journal of Geographical Information Science, 34(11), 2225–2242. Chen, C., Liu, B., Wan, S., Qiao, P., & Pei, Q. (2021). An edge traffic flow detection scheme based on deep learning in an intelligent transportation system. IEEE Trans. Intell. Transp. Syst, 22(3), 1840-1852. Chen, C.-H. (2020). A cell probe-based method for vehicle speed estimation. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 103(1), 265–267. Chen, H., Cai, M., & Xiong, C. (2021). Research on human travel correlation for urban transport planning based on multisource data. Sensors, 21(1), 195. Chen, X., Chen, Y., & He, Z. (2018, March). Urban traffic speed dataset of guangzhou, china. Zenodo. Retrieved from https://doi.org/10.5281/zenodo.1205229 doi: 10.5281/zenodo.1205229 Chen, Y., Lv, Y., Wang, X., Li, L., & Wang, F.-Y. (2018). Detecting traffic information from social media texts with deep learning approaches. IEEE Transactions on Intelligent Transportation Systems, 20(8), 3049–3058. Cheng, X., Zhang, R., Zhou, J., & Xu, W. (2018). Deeptransport: Learning spatial-temporal dependency for traffic condition forecasting. In 2018 international joint conference on neural networks (ijcnn) (pp. 1–8). Jiang and Luo 15

Cho, E., Myers, S. A., & Leskovec, J. (2011). Friendship and mobility: user movement in location-based social networks. In Proceedings of the 17th acm sigkdd international conference on knowledge discovery and data mining (pp. 1082–1090). Cui, Z., Ke, R., & Wang, Y. (2018). Deep bidirectional and unidirectional lstm recurrent neural network for network-wide traffic speed prediction. arXiv preprint arXiv:1801.02143. Ding, X., Chen, L., Gao, Y., Jensen, C. S., & Bao, H. (2018). Ultraman: a unified platform for big trajectory data management and analytics. Proceedings of the VLDB Endowment, 11(7), 787–799. Dunne, S., & Ghosh, B. (2013). Weather adaptive traffic prediction using neurowavelet models. IEEE Transactions on Intelligent Transportation Systems, 14(1), 370–379. Fang, Z., Chen, L., Gao, Y., Pan, L., & Jensen, C. S. (2021). Dragoon: a hybrid and efficient big trajectory management system for offline and online analytics. The VLDB Journal, 1–24. Fedorov, A., Nikolskaia, K., Ivanov, S., Shepelev, V., & Minbaleev, A. (2019). Traffic flow estimation with data from a video surveillance camera. Journal of Big Data, 6(1), 1–15. Ghahramani, M., Zhou, M., & Wang, G. (2020). Urban sensing based on mobile phone data: approaches, applications, and challenges. IEEE/CAA Journal of Automatica Sinica, 7(3), 627–637. Gkountouna, O., Pfoser, D., & Züfle, A. (2020). Traffic flow estimation using probe vehicle data. In 2020 ieee 7th international conference on data science and advanced analytics (dsaa) (pp. 579–588). Gong, L., Liu, X., Wu, L., & Liu, Y. (2016). Inferring trip purposes and uncovering travel patterns from taxi trajectory data. Cartography and Geographic Information Science, 43(2), 103–114. Gong, S., Cartlidge, J., Bai, R., Yue, Y., Li, Q., & Qiu, G. (2020). Extracting activity patterns from taxi trajectory data: a two-layer framework using spatio-temporal clustering, bayesian probability and monte carlo simulation. International Journal of Geographical Information Science, 34(6), 1210–1234. Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., . . . Bengio, Y. (2014). Generative adversarial networks. arXiv preprint arXiv:1406.2661. Guo, F., Zhang, D., Dong, Y., & Guo, Z. (2019). Urban link travel speed dataset from a megacity road network. Scientific data, 6(1), 1–8. Guo, Y., & Yang, L. (2020). Reliable estimation of urban link travel time using multi-sensor data fusion. Information, 11(5), 267. Hao, Q., & Qin, L. (2020). The design of intelligent transportation video processing system in big data environment. IEEE Access, 8, 13769–13780. Harrison, G., Grant-Muller, S. M., & Hodgson, F. C. (2020). New and emerging data forms in transportation planning and policy: Opportunities and challenges for “track and trace” data. Transportation Research Part C: Emerging Technologies, 117, 102672. Hassija, V., Gupta, V., Garg, S., & Chamola, V. (2020). Traffic jam probability estimation based on blockchain and deep neural networks. IEEE Transactions on Intelligent Transportation Systems. He, Z., Zheng, L., Chen, P., & Guan, W. (2017). Mapping to cells: a simple method to extract traffic dynamics from probe vehicle data. Computer-Aided Civil and Infrastructure Engineering, 32(3), 252–267. Hoseinzadeh, N., Liu, Y., Han, L. D., Brakewood, C., & Mohammadnazar, A. (2020). Quality of location-based crowdsourced speed data on surface streets: A case study of waze and bluetooth speed data in sevierville, tn. Computers, Environment and Urban Systems, 83, 101518. Huang, X., Yin, Y., Lim, S., Wang, G., Hu, B., Varadarajan, J., . . . Zimmermann, R. (2019). Grab-posisi: An extensive real-life gps trajectory dataset in southeast asia. In Proceedings of the 3rd acm sigspatial international workshop on prediction of human mobility (pp. 1–10). Hussain, E., Bhaskar, A., & Chung, E. (2021). Transit od matrix estimation using smartcard data: Recent developments and future research challenges. Transportation Research Part C: Emerging Technologies, 125, 103044. Italia, T. (2015). Telecommunications-sms, call, internet-mi. Ji, B., & Hong, E. J. (2019). Deep-learning-based real-time road traffic prediction using long-term evolution access data. Sensors, 19(23), 5327. Jia, Y., Wu, J., & Xu, M. (2017). Traffic flow prediction with rainfall impact using a deep learning method. Journal of advanced transportation, 2017. Jiang, R., Song, X., Fan, Z., Xia, T., Wang, Z., Chen, Q., . . . Shibasaki, R. (2021). Transfer urban human mobility via poi embedding over multiple cities. ACM Transactions on Data Science, 2(1), 1–26. Jiang, W., Lian, J., Shen, M., & Zhang, L. (2017). A multi-period analysis of taxi drivers’ behaviors based on gps trajectories. In 2017 ieee 20th international conference on intelligent transportation systems (itsc) (pp. 1–6). Jiang, W., & Luo, J. (2021). Graph neural network for traffic forecasting: A survey. arXiv preprint arXiv:2101.11174. Jiang, W., & Zhang, L. (2018a). Geospatial data to images: A deep-learning framework for traffic forecasting. Tsinghua Science and Technology, 24(1), 52–64. Jiang, W., & Zhang, L. (2018b). The impact of the transportation network companies on the taxi industry: Evidence from beijing’s gps taxi trajectory data. Ieee Access, 6, 12438–12450. Kashiyama, T., Pang, Y., & Sekimoto, Y. (2017). Open pflow: Creation and evaluation of an open dataset for typical people mass movement in urban 16 Jiang and Luo

areas. Transportation research part C: emerging technologies, 85, 249–267. Katranji, M., Kraiem, S., Moalic, L., Sanmarty, G., Khodabandelou, G., Caminada, A., & Selem, F. H. (2020). Deep multi-task learning for individuals origin–destination matrices estimation from census data. Data Mining and Knowledge Discovery, 34(1), 201–230. Kiukkonen, N., Blom, J., Dousse, O., Gatica-Perez, D., & Laurila, J. (2010). Towards rich mobile phone datasets: Lausanne data collection campaign. Proc. ICPS, Berlin, 68. Lee, J., Hong, B., Lee, K., & Jang, Y.-J. (2015). A prediction model of traffic congestion using weather data. In 2015 ieee international conference on data science and data intensive systems (pp. 81–88). Leitloff, J., Rosenbaum, D., Kurz, F., Meynberg, O., & Reinartz, P. (2014). An operational system for estimating road traffic information from aerial images. Remote Sensing, 6(11), 11315–11341. Li, L., Jiang, R., He, Z., Chen, X. M., & Zhou, X. (2020). Trajectory data-based traffic flow studies: A revisit. Transportation Research Part C: Emerging Technologies, 114, 225–240. Li, S., Zhuang, C., Tan, Z., Gao, F., Lai, Z., & Wu, Z. (2021). Inferring the trip purposes and uncovering spatio-temporal activity patterns from dockless shared bike dataset in shenzhen, china. Journal of Transport Geography, 91, 102974. Li, Y., Yu, R., Shahabi, C., & Liu, Y. (2018). Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. In International conference on learning representations. Li, Z., Huang, Q., Carbone, G. J., & Hu, F. (2017). A high performance query analytical framework for supporting data-intensive climate studies. Computers, Environment and Urban Systems, 62, 210–221. Liao, B., Zhang, J., Wu, C., McIlwraith, D., Chen, T., Yang, S., . . . Wu, F. (2018). Deep sequence learning with auxiliary information for traffic prediction. In Proceedings of the 24th acm sigkdd international conference on knowledge discovery & data mining (pp. 537–546). Liu, J., Li, T., Xie, P., Du, S., Teng, F., & Yang, X. (2020). Urban big data fusion based on deep learning: An overview. Information Fusion, 53, 123–133. Liu, L., Chen, J., Wu, H., Zhen, J., Li, G., & Lin, L. (2020). Physical-virtual collaboration modeling for intra-and inter-station metro ridership prediction. IEEE Transactions on Intelligent Transportation Systems. Liu, Y., James, J., Kang, J., Niyato, D., & Zhang, S. (2020). Privacy-preserving traffic flow prediction: A federated learning approach. IEEE Internet of Things Journal, 7(8), 7751–7763. Loder, A., Ambühl, L., Menendez, M., & Axhausen, K. W. (2019). Understanding traffic capacity of urban networks. Scientific reports, 9(1), 1–10. Mai-Tan, H., Pham-Nguyen, H.-N., Long, N. X., & Minh, Q. T. (2020). Mining urban traffic condition from crowd-sourced data. SN Computer Science, 1(4), 1–16. Moharm, K. I., Zidane, E. F., El-Mahdy, M. M., & El-Tantawy, S. (2018). Big data in its: Concept, case studies, opportunities, and challenges. IEEE Transactions on Intelligent Transportation Systems, 20(8), 3189–3194. Moosavi, S., Samavatian, M. H., Parthasarathy, S., & Ramnath, R. (2019). A countrywide traffic accident dataset. arXiv preprint arXiv:1906.05409. Mourad, L., Qi, H., Shen, Y., & Yin, B. (2019). Astir: Spatio-temporal data mining for crowd flow prediction. IEEE Access, 7, 175159–175165. Nair, D. J., Gilles, F., Chand, S., Saxena, N., & Dixit, V. (2019). Characterizing multicity urban traffic conditions using crowdsourced data. PLoS one, 14(3), e0212845. Neilson, A., Daniel, B., & Tjandra, S. (2019). Systematic review of the literature on big data in the transportation domain: Concepts and applications. Big Data Research, 17, 35–44. Pan, S. J., & Yang, Q. (2009). A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10), 1345–1359. Piorkowski, M., Sarafijanovic-Djukic, N., & Grossglauser, M. (2009). Crawdad data set epfl/mobility (v. 2009-02-24). Shoman, M., Aboah, A., & Adu-Gyamfi, Y. (2021). Deep learning framework for predicting bus delays on multiple routes using heterogenous datasets. Journal of Big Data Analytics in Transportation, 1–16. Song, Y., Liu, Y., Qiu, W., Qin, Z., Tan, C., Yang, C., & Zhang, D. (2020). Miff: Human mobility extractions with cellular signaling data under spatio-temporal uncertainty. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 4(4), 1–19. Torre-Bastida, A. I., Del Ser, J., Laña, I., Ilardia, M., Bilbao, M. N., & Campos-Cordobés, S. (2018). Big data for transportation and mobility: recent advances, trends and challenges. IET Intelligent Transport Systems, 12(8), 742–755. Vergis, S., Komianos, V., Tsoumanis, G., Tsipis, A., & Oikonomou, K. (2020). A low-cost vehicular traffic monitoring system using fog computing. Smart Cities, 3(1), 138–156. Wan, X., Lucic, M. C., Ghazzai, H., & Massoud, Y. (2020). Empowering real-time traffic reporting systems with nlp-processed social media data. IEEE Open Journal of Intelligent Transportation Systems, 1, 159–175. Wang, P., Lai, J., Huang, Z., Tan, Q., & Lin, T. (2020). Estimating traffic flow in large road networks based on multi-source traffic data. IEEE Transactions on Intelligent Transportation Systems. Wang, S., Zhong, Y., & Wang, E. (2019). An integrated gis platform architecture for spatiotemporal big data. Future Generation Computer Systems, 94, 160–172. Jiang and Luo 17

Wang, X., Chen, C., Min, Y., He, J., Yang, B., & Zhang, Y. (2018). Efficient metropolitan traffic prediction based on graph recurrent neural network. arXiv preprint arXiv:1811.00740. Wang, X., Ma, X., & Grimson, W. E. L. (2008). Unsupervised activity perception in crowded and complicated scenes using hierarchical bayesian models. IEEE Transactions on pattern analysis and machine intelligence, 31(3), 539–555. Wang, Y., He, Z., & Hu, J. (2020). Traffic information mining from social media based on the mc-lstm-conv model. IEEE Transactions on Intelligent Transportation Systems. Wang, Y., Xu, D., Peng, P., Xuan, Q., & Zhang, G. (2020). An urban commuters’ od hybrid prediction method based on big gps data. Chaos: An Interdisciplinary Journal of Nonlinear Science, 30(9), 093128. Xie, J., Reddy Kothapally, A., Wan, Y., He, C., Taylor, C., Wanke, C., & Steiner, M. (2019). Similarity search of spatiotemporal scenario data for strategic air traffic management. Journal of Aerospace Information Systems, 16(5), 187–202. Yang, D., Zhang, D., Yu, Z., & Wang, Z. (2013). A sentiment-enhanced personalized location recommendation system. In Proceedings of the 24th acm conference on hypertext and social media (pp. 119–128). Yao, E., Wang, X., Yang, Y., Pan, L., & Song, Y. (2021). Traffic flow estimation based on toll ticket data considering multitype vehicle impact. Journal of Transportation Engineering, Part A: Systems, 147(2), 04020158. Yao, Z., Zhong, Y., Liao, Q., Wu, J., Liu, H., & Yang, F. (2020). Understanding human activity and urban mobility patterns from massive cellphone data: Platform design and applications. IEEE Intelligent Transportation Systems Magazine. Yi, P., Huang, F., & Peng, J. (2019). A rebalancing strategy for the imbalance problem in bike-sharing systems. Energies, 12(13), 2578. Zhan, X., Li, R., & Ukkusuri, S. V. (2020). Link-based traffic state estimation and prediction for arterial networks using license-plate recognition data. Transportation Research Part C: Emerging Technologies, 117, 102660. Zhang, J., He, S., Wang, W., & Zhan, F. (2015). Accuracy analysis of freeway traffic speed estimation based on the integration of cellular probe system and loop detectors. Journal of Intelligent Transportation Systems, 19(4), 411–426. Zhang, J., Zheng, Y., & Qi, D. (2017). Deep spatio-temporal residual networks for citywide crowd flows prediction. In Proceedings of the thirty-first aaai conference on artificial intelligence (pp. 1655–1661). Zhang, T., Li, Y., Yang, H., Cui, C., Li, J., & Qiao, Q. (2020). Identifying primary public transit corridors using multi-source big transit data. International Journal of Geographical Information Science, 34(6), 1137–1161. Zhang, Y., Li, Y., Zhou, X., Kong, X., & Luo, J. (2020). Off-deployment traffic estimation—a traffic generative adversarial networks approach. IEEE Transactions on Big Data. Zhao, C., Zeng, A., & Yeung, C. H. (2021). Characteristics of human mobility patterns revealed by high-frequency cell-phone position data. EPJ Data Science, 10(1), 5. Zhao, L., Song, Y., Zhang, C., Liu, Y., Wang, P., Lin, T., . . . Li, H. (2019). T-gcn: A temporal graph convolutional network for traffic prediction. IEEE Transactions on Intelligent Transportation Systems. Zheng, L., Yang, J., Chen, L., Sun, D., & Liu, W. (2020). Dynamic spatial-temporal feature optimization with eri big data for short-term traffic flow prediction. Neurocomputing, 412, 339–350. Zheng, Y., Xie, X., & Ma, W.-Y. (2010). Geolife: A collaborative social networking service among user, location and trajectory. IEEE Data Eng. Bull., 33(2), 32–39. Zhu, L., Yu, F. R., Wang, Y., Ning, B., & Tang, T. (2018). Big data analytics in intelligent transportation systems: A survey. IEEE Transactions on Intelligent Transportation Systems, 20(1), 383–398. Zimányi, E., Sakr, M., & Lesuisse, A. (2020). Mobilitydb: A mobility database based on postgresql and postgis. ACM Transactions on Database Systems (TODS), 45(4), 1–42. Zou, Z., Yang, H., & Zhu, A.-X. (2020). Estimation of travel time based on ensemble method with multi-modality perspective urban big data. IEEE Access, 8, 24819–24828.

AUTHOR BIOGRAPHY

Weiwei Jiang received the B.Sc. and Ph.D. degrees from the Department of Electronic Engineering, Tsinghua University, Beijing, China, in 2013 and 2018, respectively. He is currently a Postdoctoral Researcher with the Department of Electronic Engineering, Tsinghua University. His current research interests include the intersection between big data and machine learning techniques and signal processing applications. 18 Jiang and Luo

Jiayun Luo was born in Shenzhen on Sep 27th, 1999. She received a B.S. degree in statistics from the University of California, Los Angeles in the U.S.A in 2020. She is currently working for QH data as a Data Analyst. She published "Bitcoin price prediction in the time of COVID-19" on The International Conference on Management Science Informatization and Economic Innovation and Development (MSIEID2020) in 2020. Her research interest is the application of Deep Learning algorithms and Artificial Intelligent on improving quality of life of Disabled People.

How to cite this article: W. Jiang and J. Luo, Big Data for Traffic Estimation and Prediction: A Survey of Data and Tools, Expert Systems, 2017;00:1–6.