HPC & AI Innovation Exchange A series that examines key trends, technology infrastructures and solutions within high-performance computing.

The ADAS/AD Architecture An in-depth technical analysis of infrastructure solutions for Manufacturing and Automotive, developing autonomous driving

002 and advanced driver assistance systems. This white paper provides an in-depth study of the Technologies solutions for Autonomous Driving (AD) and ADAS (Advanced Driver Assisted Systems). It consists of a business-oriented management synopsis and a deep technical analysis.

Contents

Revisions ...... 2 3 4. Networking in the data centre ...... 45 . . . .

Acknowledgements ...... 2 3 5. Test and Workbenches ...... 46 Table of contents ...... 2...... 3 5. 1. HiL Test bench ...... 46 Executive summary ...... 4...... 3 5. .2 SiL Test bench ...... 47. . . . . Part 1 — Business-oriented management synopsis ...... 5 3 5. .3 .ADAS/AD Workbench ...... 48 1 An overview of the IT infrastructure needed for the development of autonomous driving ...... 5. .

1 1. Autonomous driving ...... 6 . . . . . 3 6. User access — VDI ...... 49

1 1. 1. Complex managed IT infrastructure needed ...... 10 4 Technical Description — Software & Services ...... 51

1 .2 Overview of the IT infrastructure ...... 11. . . . 4 .1 Software ...... 51 1 .2 1. On the remote site ...... 13 4 1. 1. Software — Container technology ...... 51 . . . . 1 .2 .2 The data size — the need for a Data Lake ...... 14 . . . . 4 1. .2 Create a complete software stack ...... 52 . . . . 1 .2 .3 Data analysis — the need for a Data Platform ...... 15 4 .2 Services ...... 54 1 .2 4. Development of scenarios — the need for a Compute Platform ...... 16

1 .2 5. Software & Services ...... 18. . . . . 4 .2 1. Hardware Deployment Services ...... 54 . . . .

1 .3 System design considerations ...... 19 4 .2 .2 Software. Deployment Services ...... 54 . . . .

1 .3 1. Data Lake ...... 20 4 .2 .3 Support Services ...... 55 . . . . . 1 .3 .2 Computing platform infrastructure ...... 20 4 .2 4. Labelling as a service ...... 55. . . . . 1 .3 .3 Test benches ...... 21 . . . . . 4 .2 .5 Managed Services ...... 56 1 .3 .4 Workbench ...... 21 4 .2 6. Financial services ...... 57 1 .3 5. In support of the manufacturers’ development process ...... 21 . . .

Part 2 — Technical analysis ...... 23. . . . . 5 Choosing the right platform — Infrastructure design ...... 57

2 Technical Description — Remote site ...... 23 . . . . . 5 1. Infrastructure design process ...... 58. . . . .

2 1. Data logging ...... 24 5 .2 Practical design examples ...... 59 2 .2 Data ingest and tagging ...... 24 5 .3 General server design considerations ...... 64 3 Technical Description — Data Centre ...... 28 . . . . . 5 4. Storage Design — Data Lake ...... 65. . . . . 3 1. Data Lake ...... 28...... 5 4. 1. File storage and archive ...... 65 3 .2 Data Platform ...... 30

3 .2 1. Dynamic Data analytics ...... 31. . . . . 5 4. .2 File system and file protocol choices ...... 66

3 .3 Compute Platform ...... 32 5 5. Compute Server design ...... 67 . . . .

3 .3 1. Scenario testing and development — simulation ...... 33 5 6. Network design ...... 68 . . . . . 3 .3 .2 HPC system (Compute Platform) ...... 35 . . . . 6 Vocabulary & Terminology ...... 70 3 .3 .3 Storage and memory for HPC ...... 40 6 1. Vocabulary ...... 70 . . . . . 3 .3 4. Networking for HPC ...... 40 . . . . 6 .2 Terminology ...... 71 . . . . . 3 .3 5. Parallel file systems for HPC ...... 41

3 .3 6. HPC Cluster management ...... 43. . . . . A. Technical support and resources ...... 74

3 .3 7. AI support ...... 44 . . . . . A 1. Related resources ...... 7 4. . . . . Executive summary Part 1 — Business-oriented management synopsis

This white paper provides an in-depth study of the solutions for Autonomous A global vision on the development of autonomous driving . Driving (AD) and ADAS (Advanced Driver Assisted Systems) . Our vision provides thorough technical analysis of projects we have been working on and the expected requirements for future projects in The structure of this part that consists of one chapter is: this space . Section 1.1 There are two distinct parts: a business-oriented management synopsis (part 1) Services a comprehensive deep technical analysis (part 2) Section 1.2.5 Software The business-oriented synopsis in Part 1 — chapter 1 describes a global vision on the development of autonomous driving . Dell Technologies understands the whole AD/ADAS development chain and can provide supporting solutions both for car manufacturers which own the complete chain and for Remote Site Data Centre organisations which focus on one or on a number of aspects in the chain . Section 1.2.3

The ensuing technical analysis in Part 2 — chapters 2-4 suggests and debates various options 1.2 Section Data available to approach specific use cases and workloads . Two main options are presented: an Platform User enterprise-ready approach for demanding users as well as an Open Source approach . The objective Section 1.2.1 Section 1.2.2 Access Compute is to provide sufficient technical detail to empower the reader to make informed decisions . In Data logging / Data Lake Platform Autonomous Driving, customers can apply a range of Dell Technologies solutions, applicable to the Data ingest collection of data in test vehicles to simulations run in data centres . A number of solutions will focus Section 1.2.4 on only one aspect — for instance, sensor design . Each solution will be presented in a modular manner . Dell Technologies often partners with specialized companies in this space . In these cases, Section 1.3 the partnership ensures components or services from both companies’ work seamlessly together .

Audience 1 An overview of the IT infrastructure needed for the Expected audience includes IT Business decision makers in the Automotive space, as well as users or suppliers which require an overview of the evolving ICT infrastructures landscape in the Autonomous development of autonomous driving Driving space . The infrastructure needed for the development of autonomous driving can quickly become very complex . The infrastructure requirements may also evolve over time . The infrastructure may be The expected audience on the comprehensive technical study within this white paper would include integrated within existing data centres, as infrastructure-as-a-service, or many other variations . persons involved in setting up or using part of an infrastructure in support of Autonomous Driving This chapter provides a high-level overview of the requirements that come from autonomous driving, developments . and how to organise an infrastructure to support development .

We assume some knowledge about development of autonomous driving . There are many good The chapter is intended for anyone interested in the technology needed . It can help with getting overview articles on the Internet, and many organisations have their internal documents . enough insight to talk to persons in the autonomous driving technology ecosystem . If you are working in a technology company this can also help to understand what opportunities may be there . A word about words If you are working at a car manufacturer or car component developer, it helps to understand how the In autonomous driving, many disciplines meet . Each with its own vocabulary . Sometimes the same complete IT infrastructure could look, and where your organisation fits in . word has different meanings, Sometimes there are different words with the same meaning . Often words describe some overlapping concepts . In this white paper, we choose the vocabulary we use . We start this chapter with a rudimentary overview of what autonomous driving is . Then we See the Appendix for a vocabulary list and a terminology list . describe the importance of data in the development of autonomous driving . A complex managed IT infrastructure is needed to handle the data and to meet the other requirements for autonomous driving . Next, we introduce the overall infrastructure model that we use throughout this white paper . This model is based on the concept of Remote Site, where (self-driving) cars are tested, and the concept of the Data Centre, where the main components are the Data Lake, Data Platform and Compute Platform . The Compute Platform section is more detailed because it describes the needs for HPC and AI technology that is not commonly used in standard Data Centres . The last section gives some insight into how the components in the infrastructure work together .

4 HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture 5 Please note that Chapters 2 & 3 will provide a more detailed technical description using the same global infrastructure picture used in Chapter 1 . To read chapter 1, a basic knowledge of IT When we talk about “ADAS” in this white paper we mean Level 2 or 3 technologies . When we talk infrastructure and the working of data centres is advisable . about AD, we can either mean the complete AD development flow over time towards level 5 or specific Level 4 and 5 technologies . So, when we talk about Autonomous driving in general, that could include ADAS . When we talk about Autonomous driving cars, we talk about technology levels 1 1. Autonomous driving 4-5 .

Car drives under Transition to autonomous driving is accelerating . According to market research by M14 Intelligence 5 all conditions conducted in 2018, there were about 8 million cars in the world with some level of automation in 2017 . By 2030 it is expected that there will be more cars sold with ADAS features than without . Out Drive under limited of the about 30 million new cars some 24 million will be equipped with Level 2 ADAS capability, 6 conditions. Driverless taxi. million with Level 3 features, and about 5 million will be highly autonomous . 4 No steering wheel The number of companies involved in developing autonomous driving technologies is large . The landscape is changing quickly over time as companies are acquired by larger companies, and startups Drive under limited conditions. and alliances are formed . Regulations around autonomous driving are still being developed . In Europe, 3 Trac jam driving for instance, strict rules are already in place for new cars . They should be able to automatically call Car in Driver in emergency services in case of accidents . There may also be regulations in place that, for instance, control Steering and control define how long test data needs to be archived or where that data must be stored . This has to be Brake/acceleration 2 support to the driver taken into account when developing autonomous driving products and services . The most important characteristic of autonomous driving development is the data . Not only the Steering or sheer data volume but also the uncertainty about the data growth . And there are specific data- Brake/acceleration handling requirements . So, let us first look at the data . 1 support to the driver

Warnings and ~ 250+ million km In 2-3 years 0 momentary assistence 5 2+ Exabyte 2030 Levels of autonomous driving as defined by the Society of Automotive Engineers (SAE) . 4 ~ 20+ million km Today Autonomous driving holds many promises for the future . The promise of safer driving is on the top of this list . Autonomous cars will be able to stay alert all day and all night . Self-driving cars can be ~ 1.000.000 km trained to be prepared for very unlikely situations . Autonomous driving will provide more comfort . As Today 3 50-100 Petabyte the driver will not need to do any driving, they can use the time in the car for other more interesting 2020 and/or productive activities . Autonomous driving will lead to more efficient use of the road system . ~ 200.000 km Cars will be able to coordinate with each other and with systems that manage traffic . In progress 2 4-10 Petabyte However, autonomous driving still has a long way to go . There are several major steps that still have to be taken, each associated with a different level of autonomy of the car . At the lowest level, the In progress driver does everything . At the highest level, the car takes care of everything . Each level requires 1 NOW extensive development, many test drives and many scenarios to be developed . Development for each Level Real-world sensor data Development starts Expected level is supported by a complex IT infrastructure . The highest level requires exabytes of data to be required (kilometers) on the market collected, processed and analysed . This level requires big scaled out computer infrastructure where Storage Capacity Required demands are high for throughput computing, many parallel streams of data to be analysed, simulated (Typical) and correlated and even fast supercomputers — HPC systems — to do high-end simulations of possible hardware and car behaviour . Data collection size and storage size requirements will grow exponentially with each new level of autonomous driving . Please note these figures are not actual forecasts but give a global idea of the Car manufacturers and suppliers need to have an IT infrastructure capable of supporting the data growth . development of all levels of autonomous driving at the time they are working on the development of that level . So, whether the focus is on Advanced Driver Assistance Systems (ADAS) up to level 3 or full Autonomous Driving (AD) up to level 5, the infrastructure architecture should be able to support it .

6 HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture 7 Data is key in ADAS/AD The data in a test car is collected and stored in a special unit called “Data logger” . The data loggers The current development focus in autonomous driving is on cars with limited self-driving have a large but limited storage capacity, limited in comparison to the amount of data the sensors in possibilities: a car that can drive autonomously, but only under the supervision of the human driver . the car produce . To be analysed, this data must get out of the Data logger, out of the car, in a data Completely autonomous driving does not yet exist . In the coming years, development will expand centre where it can be processed . Especially for cars on a test circuit, it is important to get it out into higher levels of autonomous driving until we reach fully autonomous cars . The current focus quickly and analyse it to see whether the tests performed gave the expected results . Otherwise, the is on development . The car data used in this development process comes from test drives, not next test drives are being jeopardised in yielding new or adapted tests to perform . For the cars on from cars that have been bought in regular shops that drive on the main road . The development is the public way, this is less important . These cars drive longer test routes mainly aimed at capturing focused on generating and assessing scenarios . A scenario describes an event, the likelihood it will the interaction with the environment . happen and what a car is supposed to do when such an event happens These scenarios must be available for all foreseeable events . And even for an unforeseeable, or really very unlikely event — a There are two major ways to get the data out . The first is to use cassettes in the Data logger on meteorite scratches the car during a snow blizzard — some reasonable action must be taken by the which the data is stored . These cassettes can be taken out or swapped and shipped to a data centre . autonomous car like trying to avoid collision with the meteorite . There are many scenarios possible, Another way is to have a small data centre facility, mobile or fixed, near the test circuit, drive the test and one needs to analyse hundreds of thousands of these scenarios and put the results in the car to that spot and use a wired network connection to extract the data from the Data logger in the autonomous driving software of a car before one has a reliable car that one can trust . car, store it on the local data centre and transfer it from there to the big data centre . Alternatively, a Mobile Upload Unit (MUS) is used: the complete server (with data) is taken out of the car and To develop these scenarios, data is collected and validated during test drives . Car manufacturers use replaced with an empty server . three distinct test environments . The first environment is a closed test track . The second is a public road . The third scenario cars, that could also be simulated, drive on simulated roads . The amount of data collected during the test drive makes it in general impossible to use wireless The closed test track, which can sometimes be as big as a village, can be used to collect data about or mobile communication such as LTE Channel Bonding Technologies . However, it is possible using a lot of different, controlled events and to test developed hardware and software . The test drives on private local high-speed wireless communication . the public road are used to collect as much real-life data as possible, to see how a car behaves and reacts on public roads in all weather and traffic conditions .

When data is collected, it has to be annotated — or tagged — with metadata that supports analysis and use of the data . For example, if on the public road a traffic sign is spotted, it is important to know that the software indeed identified the correct traffic sign, knows what country and time it was sighted . Meanings of traffic signs can differ from country to country . And the car must have all this information to take the right action .

In a test car, data is collected by a large range of other sensors . Including, for example, several cameras, radar sensors, LiDAR (similar to radar) and other car dynamics data . The amount of data collected can be huge . A typical day on a manufacturer’s test circuit can easily generate a hundred terabyte of data . Collecting and storing that data locally in the car brings the first choice . Either you can store data from different sensors in different files, or you can store data from different sensors synchronised in one file . Both approaches have advantages and disadvantages and lead to different ways of processing them later .

Storing data from each sensor separately means you need to timestamp and tag them in such a way you cancorrelate data from different sensors later . Storing all the data in one file means it is correlated right from the start . Storing data separately gives the possibility later to easily look at data from say only two sensors and work on those . Or a manufacturer can share only data from a subset of sensors with a third party company that develops one component of the car . With the ‘everything-in-one-file’ approach, you need to extract data from sensors from the main file if you only want to study a few sensors . However, getting the data out of the car in a synchronised way is somewhat easier with the one-file approach, but less flexible . To make the right choice, you need to look at the whole process from data collection to scenario development and validation and on what the in-car platform is capable of delivering in terms of capturing and storing and processing .

8 HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture 9 The next table provides a summary of sensor types in a car . Although every car manufacturer needs a complete platform of their own, it does not mean they need to implement and maintain it . Actually, most prefer to have it available as managed services . So, Table 1 . Different types of sensors in a car can help with identifying different objects . Using the system integrator would build the platform according to the car manufacturer’s specifications, combined data of several sensors together can greatly enhance usability . This is called sensor fusion . integrating hardware, software and services of many specialised companies, and maintain it for the car manufacturer . Each car manufacturer typically has several brands to support . Their development Usability for Camera Radar LiDAR Sensor Sonar platform needs to accommodate for that so they do not need a separate platform for each brand . fusion Some parts of the platform may be shared even with other car manufacturers, provided data is kept separate . (Camera + Radar + LiDAR) The development platform needs to be designed with flexibility and scalability built in . The platform architecture should be robust enough to span the whole development cycle for one level, and How it Records Uses echo of Uses Combined Uses echo of scalable to span several development levels . The design should also be flexible enough to incorporate works video images radar signals reflecting data from (not audible) new hardware and software as it becomes available . For instance, in two years’ time, there may be laser signals several sound sensors a completely new sensor on the market or a specialised processor that speeds up some calculations for a certain feature . Object detection So, the actual IT infrastructure needed to support ADAS or AD development is large and complex . Object classification To describe the IT infrastructure components and their role, we first start with an overall simplified Distance estimation picture of the architectural design . Object edge precision 1 .2 Overview of the IT infrastructure Lane tracking

Range of visibility Services Functions in bad weather

Functions in bad light Software

Bad OK Good Remote Site Data Centre In order to manage, analyse and process all this data in a reliable way a well-managed infrastructure is needed . Data Platform 1 1. 1. Complex managed IT infrastructure needed Data logging / Data Lake User To develop cars ready for each new level of autonomous driving, more data is needed from test cars Data ingest Access and more compute power is needed to analyse it and to do simulations . Every level requires about Compute ten times more resources than the previous level . The development process for new cars is done by Platform a car manufacturer . Because of the high costs of setting up and maintaining an autonomous driving IT platform — which is in the billion-euro/dollar range, only large car manufactures or companies with enough funding can afford such a development . The core of the autonomous driving IT platform can be delivered by Dell Technologies . This ADAS/AD development platform will be used by many A complete infrastructure — hardware, software, and services — for developing autonomous driving supplying companies . Many specialised companies could provide services, consulting hardware or can easily consist of thousands of servers and several software stacks . It can be made available software for smaller or bigger parts of the development platform . There are very few companies as a managed service or managed in the manufacturer’s own data centre . Each car manufacturer in the world that are able to construct and manage such large and complex infrastructures . or car parts supplier has its own requirements, own approach, own development road map . So, Dell Technologies is one of them and collaborates with many specialised companies for specific there is not a one size fits all infrastructure approach . The infrastructure has to be designed to be components, such as NVIDIA and Altran . NVIDIA is providing specialised processors (GPU) for performant, efficient and cost-effective, but should also be flexible . During the years of development autonomous driving . Mostly geared towards Artificial Intelligence (AI); Altran is providing consulting that fully autonomous driving will take, new insights will emerge, new tools will be developed and and services for the development of autonomous driving . new regulations will come into place . The infrastructure must be designed to be flexible enough to accommodate all these future developments for both hardware and software .

10 HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture 11 At a very high level we see the following components: together with partners Dell Technologies can also implement these . y Remote site (sometimes called test site, edge, or campus) - Remote site Data logging: Data from the test cars locally collected on the remote site The Remote Site part of the architecture may also look very different in one practical case or the - Remote site Data Ingest: Made ready for the data centre and optionally filtered and tagged for other . Also, the naming may differ . One also uses “Edge site” or “Campus” or the remote site may further processing . be split into several physical locations . For the high-level architectural view, we put this all under “Remote Site” . y Transport to the data centre We use the term “Data Lake” in this document as a name for the Central Data Storage part of the y Data centre overall system . It is the hardware and software (system software and file systems) infrastructure - Data Lake: Data put into the data lake and made available for processing that holds the data . - Data Platform: Analysed on a data platform - Compute platform: Simulations and AI on a compute platform The main data that is used is data generated by test cars, test cars on special test environments, or test cars driving on public roads . We will look at how the data is collected in the cars, and what type y Software . Actually, there is not just one software layer, but different types of software at each of sensors they use . Our journey starts when the data is collected and logged in what we call the level and for each component . So, the software block in our simplified picture just indicates “Remote Site” . software is important . y Services . In our simplified view, the block indicates there are important services They can be 1 .2 1. On the remote site generic or specific . The main role of the “Remote Site” is the collection of data produced by test cars, to do some pre- processing and get the data ready to be ingested into the Data Centre where permanent storage and We use this high-level view to structure our document . processing takes place . There can be many types — and different names — for Remote Site, but we concentrate on what for us are the core services: Section 1.1 y Data logging y Data ingestion Services y Data tagging

Section 1.2.5 Data logging is the process of collecting and storing data from the test cars . The easy way is to store Software the raw data that comes from each sensor separately in different files . In a more sophisticated way, a “master” timestamp could be added to allow for correlating data from different sensors during Remote Site Data Centre the analysis . Without a master timestamp, it is generally impossible to correlate data from different sensors in an unambiguous way . For example, video cameras produce continuous streams of data, Section 1.2.3 while other sensors, like the ones that record whether indicators or screen wipers are in use, produce more incidental data . Section 1.2 Section Data Platform Section 1.2.1 Section 1.2.2 User Sometimes it makes sense to already combine streams of data from different sensors, looking at the Access Compute same direction for instance, into one stream . This sensor fusion could result in sharp 3D videos . Data logging / Data Lake Platform Data ingest The data needs to get out of the car before it can be analysed in the data centre . Some common Section 1.2.4 ways to store data are rugged disks that can be taken out easily and transported to the data centre, or, less common, small servers that can be completely replaced . Small servers can provide more Section 1.3 reliable storage for data logging through RAID techniques and be more secure during transportation .

The next sections describe each major part of the overall IT infrastructure . Please note this structure Data ingestion . is mirrored in the second part of this document, but then with detailed technical descriptions . The Data ingestion is the next step to getting the data into the data centre system . This process is sections correspond to the simplified view . The last section provides important points specific to called “Data ingest” . It is not only about getting the raw data into the system . The data is stored in autonomous driving support . a generally-available file system . Also, the appropriate metadata directly connected to the data has to be loaded and a labelling process started in which the data is further tagged and annotated with The exact infrastructure design may vary greatly . Proposing the right architecture in a specific initial metadata . It is also checked whether the data was correctly recorded . If a sensor fails part of situation is one of the strengths of Dell Technologies . Dell Technologies has experience in designing the time, this is flagged in the metadata . Other parts of the data may still be used . If a camera fails or complex, powerful, flexible and cost-effective architectures for large infrastructures . Alone, or produces incorrect data it is good to know that as early as possible .

12 HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture 13 Data tagging . During test drives, enormous amounts of data are collected . When handling this data, considering Data tagging is the process for adding further metadata to the original data . For some metadata, its locality is important to consider . First, the data from a test drive is local to the country or state, this is already done at the remote site . Meta-data has to be added that describes what the data is or sometimes even to the town . Traffic signs have different meanings and sometimes the text below about, where the data was collected and other relevant information needed for using and analysing differs from town to town, but that also has (legal) meaning that changes the interpretation of things data . The tagging process also adds information about which data comes from which sensor or one sees after the sign . For example, the city of Antwerp introduced a complex system of colour- camera, which is the exact GPS data, etc . This classification of data is also called data tagging or coded parking rules and signs that can change from street to street, and are not used anywhere else data labelling . Most of this metadata is generated automatically . However, in some cases, when outside the town . So, keeping data locality connected to the data is important for developing the the automatic system is in doubt, humans are asked to look at the data . They look for instance at right scenario, as reacting to an event may differ depending on where the test car is driving . whether something that looks vaguely like a traffic sign, indeed is a traffic sign, and if so, which Some countries do not allow this type of test data to leave the country or region . That has to be traffic sign . At this stage, it is also decided who is allowed to use the data . For instance, some data taken into account when one wants to develop an ADAS system that will work in that specific may be available to specific groups of users, other data may not . Reasons for making data available country . only in a limited way may be legal — some data may not be made available outside the country in which it was generated . Many companies may have access to the system, but only to part of the The third aspect of locality is more technical . If your developers need to develop a scenario with data . Not only are they not allowed to read the data, but they also cannot see it is there . The data specific camera data that is only available in your data centre on the other side of the globe, itself is stored in a file system . The metadata is stored separately in a data catalogue . transferring that data for a single run may be costly and time consuming . So, handling data locality requires careful technical infrastructure planning . When the data is collected remotely, tagged and ingested into the Data Centre, it ends up in the Data Lake . Why do we need a Data Lake? We describe that in the next section . Another approach would be to distribute the annotation and learned models from each region in the world, without the need to share the actual data . This approach is already being used in the medical 1 .2 .2 The data size — the need for a Data Lake context, where a system abstracts the data, and a query can be sent to a meta-system, which will The amount of data collected through the sensors in the test cars is very large . It can be in the range run the query in each local centre, and return the results of the query without returning the actual of tens of petabytes, or even up to a hundred petabytes . All this data needs to be stored for analysis data . The same can apply to automotive data . and processing . Analysis and processing will generate additional data, but this is in general not larger than 10%-15% of the original data collected . The “Data Lake” is the Central Data Storage part of Storing all the data in one place, in one data centre, in one Data Lake is the easiest way to manage the overall system . It is the hardware and software (system software and file systems) infrastructure data . However, because of the data locality requirements and technical barriers that may not always that holds persistent data . A modern Data Lake is much more than just storage itself . It also provides be possible . support for different workloads and different storage protocols . To help with the transfer of data between the main components of the platform, a “Data Mover” In the ideal case, there is a single Data Lake with a simple structure for the user, where all data may be installed . This moves or copies data from the Data Lake into a cache that may be local to an is accessible from anywhere, and in the same form . Dell Technologies offers one, based on Isilon HPC cluster or Hadoop cluster . A Data Mover is useful otherwise the data access would be too slow technology . However, in some cases, it is not possible to use this seamless solution everywhere . to keep the compute clusters busy . One needs to closely examine the performance trade-off before starting moving data . The Data Lake consists of several layers — or tiers . With Isilon, “auto-tiering” of different storage pools is possible, starting from high-performance pools that can be used for archiving . You should only have data in more places if there is a clear need, otherwise, it’s better to leave the data in the central Data Lake . Because of the amount of data, it is tempting to store as much as possible on “cold storage” media such as tape, to reduce costs . In practice this doesn’t work well as most of the data may often be Just storing the data in a Data Lake is only the beginning . The data needs to be analysed and needed for analysis or processing . Using flash for all the data is too expensive, so the biggest part of interpreted . the data storage consists of disks . 1 .2 3. Data analysis — the need for a Data Platform The Data Lake needs to get data from the Remote Site and make data available to the Data Platform The data collected from the cars need to be labelled, annotated, and analysed . A separate optimised and the Compute Platform . These have specific needs for the way data is handled . For instance, platform for handling the data is normally centred around Hadoop or similar software . A suite of data data scientists in the Data Platform prefer to use HDFS file systems, while users in the compute field analytics tools is installed . Separate software to manage metadata is needed as well . This platform are used to NFS besides CIFS/SMB . is called “Data Platform” . Although “data analysis & data management platform” best describes its function . Where is the data that a developer may need? Does he or she have the rights to use it? Or is he or she working in a country where data can’t be transferred? Is the data available from a data centre The Data Platform provides access to all of the data in the Data Lake and data that is local to the near the developer, or does it have to be transported from the other side of the world? To answer Data Platform itself, through an extensive search facility . The search facility can not only locate data these types of questions, one needs to look at data locality . sets across the system that contain the data one is looking for, but it can also identify the instance that is closest to the data analyst in terms of networking speed and show the associated costs of getting that data set .

14 HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture 15 Table 2 . How HiL, SiL and MiL fit in the development process . The search facility knows all about restrictions of the data, such as regulatory restrictions or restrictions because of intellectual property rights . It can also help to bring the computing to the Parts of the system Complete data instead of the data to compute, by selecting the best compute/data option . Component Environment Car system Analysing data on the Data Platform also involves computation . However, sometimes more (such as sensor) computation is needed for simulations and developing scenarios sometimes needs a specialised Real hardware compute platform . 1 .2 4. Development of scenarios — the need for a Compute Platform The data that’s generated is used to further develop and test scenarios . A scenario is a description On the test site of what may happen when a car drives on a road . For example, how other vehicles or pedestrians Collect initial data, behave when a car is confronted with traffic lights . To support the development and testing of and collect data scenarios, several tools can be used . Simulation uses a model to predict car behaviour in specific testing new scenarios . hardware & software.

The development process for autonomous driving is complex and has many steps and components . Simulation Simulation Real hardware HiL, SiL, MIL are terms one often sees in autonomous driving discussions . Hardware-in-the-loop, Software-in-the loop, and M-in-the-loop are all techniques the simulation of a car’s behaviour in Hardware- an environment, like on a road, with other traffic, sun conditions, and many other parameters . One in-the-Loop specific part of the car, such as a sensor, a camera, or an ECU is the main topic, it is what’s “in-the- HiL loop” . “In-the-loop” means one is looking at what is happening when a parameter changes, the result in the Environment Car & car Real hardware data centre is a changing environment . And that is fed back in the simulation which then starts again (loop) from represented by behaviour tested. software in the represented by From sensor to that new situation, computer. software in the complete cars. computer. In HiL, it is the actual hardware component itself (e g. . a camera) that is tested . It is not tested in Simulation Simulation Simulation the real car but in a virtual reality (VR) environment . The virtual environment can be constructed from real sensor data from real cars in a real test environment or it can be generated by a computer . Testing a system with real data is called “ground truth” and is measurement data for calibrating the Software- system . in-the-Loop SiL in the Environment Car & car Hardware to be Instead of the real camera, one can also use a software representation of the camera . One then talks data centre represented by behaviour tested represented about “Software-in-the-Loop” . Often, but not always, the software that is tested will be transferred software in the represented by by software in later to the real hardware component . computer. software in the computer. computer.

A special kind of SiL, is Model-in-the-loop (MiL), this is used early in the development cycle when Everything one is exploring what a car component could look like . represented in the computer Model- in-the-Loop MiL in the data centre Computer- generated environments, cars and models.

16 HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture 17 When the simulations grow complex, they can use both HPC and AI techniques . These require large The system software in the data centre includes all the traditional data centre software, including computational resources, data resources and specialised software . virtualisation and container management software . For simulation support — AI and HPC, special resource managers are installed, such as Bright Cluster Manager that can manage large workloads All simulations can generate new data that must be stored too . For specific tasks, like testing HiL or on HPC clusters . For direct application development support a complete suite of container-based SiL, separate test- or work-benches are being developed . Sometimes the data created by simulation tools, including those from Pivotal, is available . is so large it is cheaper to generate it again by running the simulation, each time it is needed . Most of the hardware and software infrastructure is available as managed services . There are also There are two techniques used for scenario development . First, is predictive simulation . In services available to help design and maintain the infrastructure . simulation, all or part of the car is simulated in a virtual environment . Simulation can be used to predict behaviour, and test the change of behaviour under different conditions . 1 3. System design considerations Although at the highest level, it’s clear how the architecture of a development platform for ADAS Second, is Deep Learning . In Deep Learning, the computer is ‘trained’ to behave like a real self- or AD looks, the actual implementations can vary a lot . In this section, we provide some important driving car . More accurately, the neural network inside the computer is trained . By feeding input points you must not forget to think about relating to the components in the system . data to see whether the neural network shows the correct behaviour — and adapting it when it does not — one can create a neural network that behaves like a car . One could show hundreds of Whether one needs a completely new infrastructure or whether one can use parts of an existing different traffic signs and train the neural network to recognise new traffic signs that are shown with infrastructure depends on the specific situation . In general, it is advisable to look carefully at the a sufficiently high degree of accuracy . balance of the infrastructure . You may have servers that are occasionally not in use and it may be tempting to use these for the automotive services . But, if the network in place cannot provide the In practice, both predictive simulation and Deep Learning are used together in the simulation necessary bandwidth these servers aren’t of much use . environment and need a lot of data and computing power . Basically, there are two main types of infrastructures: storage and computing . Both can consist of For sensors like cameras or LiDAR, a simulation technique called “ray tracing” can be used as a servers in their own data centre, where administrators need to configure the hardware . But can also predictive simulation technique . It delivers accurate results . As an example, we could imagine a be available ‘as-a-service’ (IaaS) . The system-design goal for storage is a data lake as a single entity car driving on a road, with the sun shining low in front with another car coming from the opposite and compute service . In practice, compromises will be needed due to costs and capacity . direction, with crash seemingly unavoidable . These light conditions can be tested with some new camera sensors in your test car . You probably don’t want to test that in real life on a public road . As For computing, servers with computing nodes are available . Each computing node has one or more a first simulation step, you can create a model of your new camera sensor, create a VR model of your CPUs with many cores along with accelerators such as NVIDIA GPUs and FPGAs . Broadly speaking, road, the sun and the car, and determine what the new sensor would “see” . The technique used to there are two types of computing nodes: nodes that are more useable for processing sensor data do this is called ray tracing: follow every light ray from source-to-sensor . This is a very compute- and nodes that are better in supporting simulations . Computing nodes are not directly addressed by intensive task . The modern physical approaches require huge amounts of compute power from users but mainly made available through container technology and resource managers as a service advanced GPUs and CPUs in multicore HPC systems . (IaaS) .

Data validation is needed in the whole chain . Is the data correctly recorded in the car? This needs For the developers, the infrastructure is not made available directly, but as part of a Platform-as-a- to be validated before the data is used in a simulation or Deep Learning program . The Deep Learning Service (PaaS) . This is a platform that contains all the software a user needs, working with data and algorithms need to be generated and validated and simulation results need to be validated . In some applications and having access to the IaaS . cases, Deep Learning is also used for data validation by looking for data anomalies and data defects . In this case, however, care has to be taken to not just throw away some data because it does not The two broad processing platforms are the Data Platform and the Compute Platform . The seem to fit with the rest of the data . Data Platform is used to manage and process the sensor data and provides data analytics tools . The Compute Platform is used by simulation such as scenario development . AI techniques like 1 .2 5. Software & Services Deep Learning and Machine Learning also use the Compute Platform . So, the Data Platform is There are two broad categories of software that will be run on the infrastructure: the applications a computing platform more focused on data . It is sometimes (more appropriately) called a Data & related models; and the system software . The applications are developed and maintained by the Analytics Platform . In this paper, we’ll stick to the more commonly used term “Data Platform” . software developers of the manufacturers and related companies that are using the infrastructure . A specialised platform is used as a test or a workbench with physical hardware components, like The system software is all the software available on the infrastructure . This includes anything Hardware-in-the-Loop platforms . from the operating systems to application software that is used by the application developers . The software in support of the remote site is fairly standard data handling software . The metadata Both the Data Platform and the Compute Platform need to interact with data from the Data Lake extracting software for autonomous driving is specific as it needs to capture cars in their and need a processing infrastructure . The processing infrastructure for the Compute Platform is environment . Several companies provide these types of tools . made up of scale-out clusters and High-Performance Computing (HPC) clusters . These platforms often have a local data cache, so they do not interact directly with the data in the Data Lake .

18 HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture 19 1 3. .1 Data Lake processors, accelerators come on the market at a fast pace . The rack design, networking, power and The Data Lake (the data layer), consists of several file and storage systems that each fulfil a separate cooling infrastructure should best be designed to last for several years . Inside the racks compute purpose that may need to meet specific requirements . They are integrated into one logical view, appliances or servers can be placed that can be updated more often . Appliances are combinations so a user in general only sees a single filesystem, and does not need to worry about the actual of hardware and software designed to run specific applications . If this happens to be one of your storage hardware . Ingest data systems need storage that has a high bandwidth to take in data fast . applications, that may be a good choice . However, keep in mind the basic software, like AI basic HPC systems used in simulation need storage that is even faster in its communication with the software also evolves fast . processors . In that case, separate parallel file systems may be needed to keep the processors busy . More general compute servers consist of many nodes each with CPUs, GPUs and memory . The The Data Lake storage type can range from Flash disks to tapes . Because of the costs involved in choice here very much depends on the type of applications you run; careful inspection of the storage, one often tries to find a balance between speed and costs: Flash is faster but more costly . characteristics is needed to see if workloads running different applications can be combined . Mass storage is the hardware itself . Basic access turns that into the Data Lake Networking speed and network latency have to be taken into account .

The main data storage system stores all the data . It can support multiple file protocols — sometimes 1 3. 3. Test benches with a storage-specific format . Video data streams, for instance, utilise different storage to labelling Testing and validating designs for new car components is an important goal . Either the hardware data or the data streams between controls and the ECU . component itself is tested, or a software representation is tested in a simulated environment . In the first case, this is called Hardware-in-Loop, in the second case Software-in-Loop . Each subsystem may have its own file system that acts as a kind of “cache” of the complete file For each, a special hardware and software environment must be set-up . The Hardware-in-Loop system . In general, all files are kept online on disks . For data that is not often used, an archiving obviously needs connections for the hardware to be tested, such as a computing environment system may be in place . To move the data from the central file storage to the caches and back, for running the simulation connected to the hardware with a monitoring device . Sometimes, a “Data Mover” is in place that not only moves the data but also keeps track of the coherency . the simulation and monitoring devices can be combined into a single, fast workstation . The SiL Moving data also makes sense when a lot of computing needs to be done using specific data . The environment mainly accesses compute servers and Data Lake . As such, the SiL test bench consists time and effort to move the data more locally to the processors in the computer platform is more mainly of an access station — a device or server to run the simulated hardware and connections to than compensated by faster processing speed . Most of the necessary data movement is done the other parts of the compute and data infrastructure . automatically by the smart pool function of the data lake . In some cases, when lots of data is needed in a parallel filesystem on an HPC cluster, automatic movement may not always be sufficient . 1 3. 4. Workbench The ADAS developers are developing new scenarios based on the collected data . They need a flexible Large amounts of data collected by sensors in test cars must be available for analysis . To use this workbench, consisting of a work station, working data in a local cache and workbench software . data, it must be easy to find, and its context must be clear . The context involves, specific information about the car, time and test run etc . All this information about the collected data is called metadata . Workbenches are mostly implemented as virtualised workstations . The metadata catalogue — or metadata database — contains all the data about the measured data . It is a high- performance, searchable index that keeps track of metadata, location, and current state 1 3. 5. In support of the manufacturers’ development process of the data . The goal of the complex IT infrastructure is the development of self-driving cars; with safety being one of the key requirements . Manufacturers use standard development & validation techniques to 1 3. .2 Computing platform infrastructure assure this safety and to track back the cause in their newly developed systems in case of a problem . The compute part may be a little more differentiated . Compute could be pure High-Performance Computing (HPC) or a different form of compute, with GPUs for real-time, or simulated real-time, Last year, the second release of the ISO standard 26262 “Road vehicles — Functional safety” ray-tracing . was published . This standard addresses the electronic and electrical systems in cars and describes how these systems should be designed, used and disposed of in such a way these systems are NVIDIA technology provides the HPC necessary to further develop the usage of redundant sensors, functionally safe . The development of these systems that are used in a car should, according to the diverse algorithms, and additional diagnostics to support the safer operation of ADAS enabled cars standard, follow a V-model development and validation approach . This has the advantage that there and the development of greater autonomy . NVIDIA technology supports the use of many types of is always a clear picture of whether the developed system functions as required . redundant sensors for sensor fusion . Then, multiple diverse AI deep neural networks and algorithms for vision by cars, localisation, and planning of the path of the car . These are run on a combination of integrated NVIDIA GPUs, CPUs, deep learning accelerators and programmable vision accelerators .

The infrastructure design process should keep in mind two principles: y keep it simple, y design future proof .

Especially for computing keeping to these principles is useful, as there are many options and new

20 HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture 21 Part 2 –Technical analysis Requirements System test Detailed technical descriptions to empower the reader to make informed decisions about IT infrastructures in support of Autonomous Driving .

The technical analysis is divided into 4 chapters, respectively chapters 2, 3, 4 and 5 . Chapter 2 Architecture Intergration test describes the remote site infrastructure . The next chapter, chapter 3, describes the Data Centre infrastructure . Chapter 4 gives an overview of the Software and Services . Chapter 5 gives some guidelines on how to design and architecture an IT infrastructure to support Autonomous Driving development .

Design Unit test Services Chapter 4 Section 4.2

Software Section 4.1 Code Remote Site Data Centre Chapter 3 Section 3.1 The classical V-model used in software development and advocated by ISO 26262 . Data Section 3.2 Data Platform So, the infrastructure should support the vehicle system development process according to the ISO Section 3.3 Chapter 2 logging / Data User standard as much as possible . The V-model advocated by ISO does not support the development of Data Lake Access Section 3.4 systems with Machine Learning as it is focused more on traditional code development . The complete ingest Compute process is called software development, sometimes the left leg of the V is called (code) development Section 3.5 Platform and the right leg validation . Section 3.6

Chapter 5

2 Technical Description — Remote site

Services

Software

Remote Site Data Centre

Data Platform Data logging / Data Lake User Data ingest Access Compute Platform

Understanding what happens on the Remote site, is important because there the data is generated that is used in the other parts of the infrastructure .

22 HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture 23 The main data collected, comes from test cars, and all kinds of sensors in the can . We start this The data collected in the car by the in-car platform, that includes the ECU, is stored by a data logger, chapter with a short description of the data logging process . The last section of this chapter possibly in the form of removable SSD disks Sometimes the information is copied to tapes . Disks or describes how the data gets tagged with metadata, and then ingested into the servers of the Data tapes that can then be transported by courier to the data centre . Centre . The Data Centre is described in the next chapter . In the data centre, there is a separate ingestion room where the disks are physically checked and As a prerequisite, reading the previous chapter “An overview of the IT infrastructure needed for the put into a reader . The ingestion room is close to the data servers . Role-based or group-based access development of autonomous driving” is sufficient . schemes assure data is protected and secure . A typical ingestion room can handle 2 Petabytes per day . Personnel and a copy station administrator must be present . 2 1. Data logging Before the data is uploaded to the central data servers, a data integrity check is performed . Goal: Metadata that is available in the files is automatically extracted and stored in the data catalogue . To record and log data from all the sensor systems in a car, the data must be in a format and on a In some cases, the remote site has enough connectivity to the central data centre in order to be able medium that can easily be transported and processed . to directly stream data from the remote site to the data centre without processing at the remote site . Alternatively, cars are serviced outside the remote site in a so-called ‘digital garage’ with sufficient How it works: connectivity . From the garage, the data can be uploaded to a satellite storage system where initial In a test car data has to be logged in a way that it easily can be extracted . Special rugged cassettes data tagging can also be done and transferred to the data centre . are one option . Dell Technologies does not provide logging facilities for in-car usage, yet . However, it can handle and organise getting data out of a data logger . Technical options — Data ingest Ways of data logging include raw data streams captured after sensor fusion which combines streams Dell Technologies is supporting the following possibilities: from several different sensors . They can be captured in pcap files or multiple time-indexed data streams with varying velocity and density . Several vendors provide tools; including NVIDIA with Drive 1. Ingest station (Dell Technologies Isilon SyncIQ) Xavier, ZFs and Continental . Capable of reading and ingesting data . Transfer to data server . File protocols supported in- clude HDFS, NFS, and SMB . Some processing can be done in the car especially when testing . For instance, if vision programs function correctly, vision and Deep Learning inference processing is needed . NVIDIA developed the 2. Mobile Upload Server (MUS) specialised Xavier SoC chip that can run this . Instead of using tapes or disks to transport data from the logging it is also possible to use mobile servers that include raid disks . Here, the complete server is transported . The Mobile Technical options — data logging Upload Server (MUS) based on Dell PowerEdge T640 is one example . The advantage of this solution is that it is reliable, secure and fast . The typical network input speed from the logging 1. Local copying of data to transport media (rugged disk). Prepare for data ingest. server and output to the Data Lake for a MUS is 40 GbE or 16,37 TiB/hour . In an optimised In a typical scenario the data is transferred on a test site to a rugged disk and shipped by cou- configuration one can reach up to 100 GbE . A more rugged variant server is the XR2 . rier to a data centre . See the Data ingest section .

2 .2 Data ingest and tagging

Goal: Get the data from the test cars into the file system . Make the data available for use by developers in a controlled way .

How it works: There are three distinct steps . First, the data needs to be shipped from the test car at the test site to a data centre where the data can be read . There the data is uploaded into a file system; the actual data ingest . Then the data is tagged and metadata is generated and stored in a data catalogue that is available through a global search facility .

24 HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture 25 Table 3 . Possible ways of getting data from a data logger in a test car into the Data Lake in the Data Technical options — Data tagging Centre . Dell Technologies is supporting the following possibilities: 1. VueForge (Altran) Properties \ Options Tape based High-speed Data MUS (Dell XR2) An ADAS testing platform . logger Storage type in car Removable Removable SSD Removable 2. Labelling as a service Data copied Copy Station on Copy Station in Data Copy Station on Remote Site (from disk Centre Remote Site References to tape) VueForge [visited July 2019] https://www altran. com/us/en/integrated_solution/vueforge/. Transportation medium LTO Tape SSD XR2 Rugged Server . remote site to central The global search and metadata management service is the core service used by developers, data data centre Or optional 25 Gbit/s scientists and taggers to store and access metadata and through the metadata the actual sensor and Ethernet when the situation allows simulation data itself . Although we describe it here as part of the Remote Site, the search facility is Transportation from Courier (car, plane) Courier (car, plane) Courier (car, plane) . Or also available as part of the Data Platform in the Data Centre . a remote site to data through network centre Technical options — Meta-data management & Global search Dell Technologies is enabling the following capabilities: Cost Complexity 1. Meta-data management and search solution Performance This module offered by Dell Technologies consists of two separate components: • User front-end Low Medium High • Data management service

On top of the automated extracted metadata additional metadata is added that can describe access The user front-end allows interactive data analysis . The data management service contains the rights, scene events and scene events at specific timestamps . This happens after the data has been global metadata catalogue and a global search facility . The actual sensor data from test cars and ingested . Machine learning can be used to add even more metadata . When the machine learning has simulations may be spread around multiple data centre locations . problems identifying the right metadata, a person looks to add the correct metadata — known as supervised machine learning . The machine learning programs are improvised by persons confirming In the front-end, users are only presented with the data to which they have access rights . They do whether the choice proposed by the machine was correct or not not see metadata about sensor data that they are not authorised to use . The metadata presented in the front-end may be further filtered by the type of analytics work station the user is accessing . Most cameras generate 2D images, LiDAR sensors can also capture 3D information . Automated 3D An ADAS development work station may present a different subset of metadata than a HiL or SiL software can take 2D images from more cameras — and possibly LiDAR info — to create 3D images . test bench work station . Data from cameras and LiDAR can be combined . Machine learning software can then segment the 3D picture in objects, like the car itself, the ground, pedestrians, etc . 2. Global search The labelled objects are, for instance, used to decide how much space there is between the car and The search facility, based on Elastic Search, provides not only lists of where the actual sensor potential obstacles . data is located, but also information about access rights, access costs, time to access the There are several annotation platforms that combine the automatic and manual identification of data, possible mirrors and other metadata that might be helpful to decide if it is the right sen- objects and their positions . Playment (https://playment io/). automatically identifies 3D point clouds sor data to be used . (for LiDAR and Radar data), video annotation, landmark recognition and annotation, and 2D/3D bounding boxes . C .LABEL from CMORE (https://www cmore-automotive. com/en/products/. 3. Open source based on Elastic Search software-tools/clabel/) specialises in annotating automotive data . It provides bounding box As an alternative access method to the metadata catalogue, one could implement its own annotation for 2D images and 3D point cloud using rectangles and cuboids . Supervisely (https:// dedicated Elastic Search server with customised configuration . supervise ly). focuses on annotating all kinds of objects quickly based on AI . It is not restricted to autonomous driving and can recognise and annotate many other objects, such as food and people . The annotation tool allows for easy approval or change of suggestions made by AI . Teams of annotators can work together on labelling and check the work done by others .

26 HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture 27 3 Technical Description — Data Centre Having one unified Data Lake with multiple economical underlying storage “tiers” is the best choice . Especially with the uncertainty about the workload needs in the future with Autonomous Driving . The previous chapter describes the remote site infrastructure . This chapter, chapter 3, describes the For some special workloads, a parallel file system, such as Lustre, combined with a DMS (Data Data Centre infrastructure, and chapter 4 gives an overview of the Software and Services . Chapter 5 Management System) with a data movement engine or other data locality options may be needed . gives some guidelines on how to design and architecture an IT infrastructure to support Autonomous However, this should be avoided where possible . Driving development . A hybrid transparent combination of fast low-latency file systems and a backend file system like This chapter starts with a description of the Data Lake: the infrastructure that holds all of the data . OneFS (Isilon) provides a gap bridging solution that can cater for the extreme performance needs It is fed mainly with new data from the remote site as described in the previous chapter . The Data as well as the required versatility and multi-protocol approach . For general usage, the data lake Platform and the Compute Platform described next in this chapter use data from the Data Lake and implemented with OneFS is sufficient . Other specific use cases may require data to be handled also feed results back to the Data Lake . faster or even a burst buffer for extreme performance . In the latter case, the Compute Platform must be able to cope with the performance and may have its own (parallel) file system . The Data Platform supports the data analysis tools needed in the development process . The Compute Platform section describes in more details the support for simulation and scenario building . Another option to consider is an object store file system . In such a system, data is stored, managed It explains the use of High-Performance Computing (HPC) for this purpose . HPC is little known in and accessed as objects, rather than as in a hierarchical file system . Quobyte, Scality, and Ceph are the traditional data centre world . That is why we spent a little more time on it here . examples of object store systems used in the development of Autonomous Driving . However, we do not see much adoption in ADAS due to the type of applications . If after reading this chapter you want to dive deeper into the design considerations for data centre and software infrastructure you can read chapter 5 “Choosing the right platform — Infrastructure Technical options — Data Lake design” . Dell Technologies is supporting the following possibilities: As a prerequisite reading, s chapter 1: “An overview of the IT infrastructure needed for the 1. Isilon development of autonomous driving” is sufficient . Highly configurable network attached storage platform .

3 1. Data Lake Advantages: 1 . The structure of Isilon prevents the need of migration . Services 2 . It is relatively easy to manage . Provided it is carefully planned and designed . 3 . Protection of the data can be managed on different levels for different data types/select data up to locking down data sets with WORM (write once read many) like functionality and full Software audit capabilities on data .

Remote Site Data Centre

Data Platform Data logging / Data Lake User Data ingest Access Compute Platform

The main option for implementing the Data Lake is Isilon . Isilon is a proven enterprise-grade system . A major advantage of Isilon is that it is multi-protocol — with active support from HDFS, NFS/ SMB, and S3 soon to be added . This is very useful in a complex environment where developers use different tools that may usedifferent file protocols . Isilon is easy to use and there is already a broad installed base, also amongst ADAS customers . Installation as a basic NAS server is very easy . Isilon has native support for the Hadoop file system HDFS . This is useful for analytics users . Isilon HDFS storage is very efficient compared to server-based HDFS .

28 HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture 29 3 .2 Data Platform Technical options

Services For implementing the Data Platform, Dell Technologies is supporting the following possibilities: 1. 1 Cloudera Data Science Workbench This includes the complete suite of Apache data science tools in a curated and supported Software fashion: Hadoop; Spark; Flume; Hive; Hbase; Impala; Kafka; Oozie; Sentry . The Cloudera Data Science Workbench gives access to all these tools and also includes the Cloudera Navigator and Cloudera Search . Remote Site Data Centre 2. Apache Open Source data analytics tools If one needs only a small subset of analytics tools, it may be useful to only use this from Open Data Source . Platform Data logging / References Data Lake User Data ingest Access Cloudera Data Science Workbench [visited July 2019] Compute https://www cloudera. com/products/data-science-and-engineering/data-science-workbench. . Platform html Apache Spark [visited July 2019] https://spark apache. org.

Goal: 3 .2 1. Dynamic Data analytics Provide a platform for data analysts and other data users like data labellers . Goal: How it works: During the data analysis process, data scientists may need to include more information from more The Data Platform provides the interface for the data scientists to manage, manipulate and sensors, or if a sensor fails, they may need to remove or replace part of the data . A complex example analyse the data . The goal is to acquire insight into the data . Sometimes the data analysed is still can be when during the analysis one finds that data from a specific sensor is not correct . It would be in the Data Lake, sometimes it is more efficient to copy part of the data to a local storage cache . helpful if one can correct that on the fly and continue with the analysis . This requires a dynamic data Typically, analysis is based on Hadoop and related analytics tools . Instead of using a command line, system to support the analytics tools and that can automatically scale to handle streams of data . an integrated analytics workbench is used to provide a graphical interface . One example is Cloudera Data Science Workbench . Another example is the Ready for AI solution . How it works: With the Cloudera Data Science Workbench, one can easily select the data sets one wants to work Project Nautilus is working on an implementation of Pravega combined with Apache Flink . Nautilus with . The data manipulations are defined in a Jupiter notebook that also provides a recorded data is a unified platform to store and analyse IoT and streaming data . Pravega provides a new way of flow . The Data Science Workbench also includes the Slurm Scheduler for resource management; storing data . Instead of storing files or objects, Pravega handles streams of data . A data stream has LDAP for user management; Python 2, Python 3 and R support . It also includes templates for a beginning, and new data is always appended at the end . It is also processed in that order and a different ecosystems and Support for NGC (NVIDIA GPU Cloud) and Singularity container support . stream is not changed . This type of data handling is perfect for analysing and monitoring raw sensor data . As of yet, there is no specific product that supports this type of streaming data analysis . There is, however, an open source tool that may be helpful . The supported version brings additional functionality .

Technical options — Dynamic Data analytics 1. Project Nautilus with Pravega Open sources streaming data analysis tool .

References Pravega — Storage Reimagined for a Streaming World [visited July 2019] http://pravega io/.

30 HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture 31 3 3. Compute Platform 3 3. 1. Scenario testing and development — simulation

Services Remote O ce

SyncIQ Computing SiL/MiL simulation Software Data Lake Streaming Multi-petabyte HPC server parallel scale-out lesystem Test results reads Data Centre Remote Site analysis

Data Results Platform Test cases Data logging / management Data Lake User Data ingest Access Computing Compute Streaming HiL Farms Platform Data Labelling parallel reads

Results Why is HPC needed? Single-core performance is no longer increasing over time . For most business Data Ingest applications the solution is to simply use more CPUs, each with more cores . Virtualisation techniques From Data logged on Remote Site and container technology then help to manage large systems that have hundreds or even thousands of cores . This is not sufficient to support simulation and AI . For simulation, the individual CPU performance is a limiting factor . Partially this is addressed by adding specialised processors, like Data Archiving GPUs or FPGAs that are suitable for certain types of applications . GPUs are useful for AI, while Multi-petabyte archiving FPGAs work well for initial data processing . They can speed up an application tenfold or even on disk and tape hundredfold .

In many cases, even greater performance is needed and more processors have to be used in parallel . When one talks about more processors one can easily talk about hundreds of thousands of processors .

Sometimes, tasks can be run independently from each other . For example, when one wants to transform thousands of images in the same way, one can do that on parallel or many processors — each working on one picture . This type of parallel processing is called “pleasantly” parallel as it is easy to implement . Most advanced simulation and AI applications are of a different nature: they are A typical architecture of an ADAS Simulation System . This is just one of the many possible views . massively parallel . They can be separated in tasks that can run on many processors simultaneously, but these tasks need to communicate a lot with each other and this communication needs to be Goal: synchronised: a task on one processor may need to wait for another task on another processor to be The ultimate goal of autonomous driving is that a car makes the right decision in any possible event . finished, before continuing . To make this possible all kinds of scenarios have to be tested or “simulated” . Today, machine learning is often used to model the car’s behaviour in its environment . This model is When the simulation or AI application performance could become a bottleneck, it is worth trained by also feeding it with real sensor data . investigating whether it should be changed to a massively parallel application and an HPC This simulation process is often called “Simulation” or “Synthetic Simulation” . Since it does not infrastructure should be set-up to accommodate this application depend on real-time data input streams, pure simulation can be much faster than simulations that also use real-time data .

How it works: In a simulated environment — a synthetic simulation — objects and their behaviour and imagery are simulated . How objects react to events that occur is an important part of the simulation . The imagery typically simulates what a car’s sensors would see .

32 HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture 33 Simulation of scenarios can be done in several ways . One can simulate scenarios completely in-silico and without using any external data using the simulation . In that case, one needs much computing Simulations are needed to generate scenarios where test cases could rarely happen or be very power, but not so much network speed going into the compute nodes . dangerous in the real world . This could cover specific working conditions or real-world environments If one combines simulation with streams of real data, such as data from one camera, one has an that don’t yet exist . Simulations are also useful in testing variations of the same basic scenario, such augmented reality (AR) simulation — the camera images are real, but everything in the car is a as different light or weather conditions . The scenarios are created from a VR model and selected simulation, and one also needs graphic computing power from a GPU . scenario parameters . One only needs to store the model and the parameters to regenerate the scenario . This saves a lot of data storage compared to storing the complete scenario as images and Complex simulations can require High-Performance Computing (HPC) . Typically, 50 000-100. 000. movies . Simulated scenarios are sometimes also called “synthetic scenarios” . cores are needed . For the standard processing, typically around 10 000. cores are sufficient . For each event that can possibly happen, one may need to simulate 100,000 scenarios or more . For The hardware requirement for a simulated scenario may be huge, especially when sensor simulation a combination of different events, this can easily grow into the hundreds-of-thousands . One cannot is involved . Sensor simulation needs to be physically based and validated by reference experiments . perform fewer tests in case something important is missed that could lead to dangerous situations . The simulation also needs to be scalable to fit into modern HPC machines . HPC server clusters for So, one needs to carefully design the HPC cluster configuration with the right number of nodes and these simulations may easily consist of thousands of cores with one or more GPUs . interconnect, and choice of CPUs, GPUs and memory . When “ray tracing” is used for the sensor, the high computing requirements ask for hardware like the NVIDIA RTX 6000 GPU . Internal sensors, like the ones that measure wheel rotation and torque, can In some cases, for well-defined scenarios smaller dedicated clusters within a closed environment are be calculated in the CPU . needed to do the AI training . Technical options In a simulation, Deep Learning is mainly used for developing new scenarios . This requires hardware For compute systems, Dell Technologies is supporting the following possibilities: similar to the one needed for HPC simulation . However, there are some special hardware techniques like VNNI — a special hardware instruction that speeds up neural network calculations — or AVX 1. Simulation service support that can be used for floating-point calculations with longer precision than the traditional 64 Bit . Choices are to have one big cluster that can handle all kinds of simulation codes or have a few References separate and specialised ones . Specialised clusters are only useful if you can keep them busy most of Dell EMC solutions for your Artificial Intelligence journey [visited July 2019] the time . http://www dellemc. com/ai.

The actual simulation software used is specific to each use case . For example, the Immense platform 2. Scenario simulation (https://immense ai/). helps with creating city-wide simulations . AIMotive (https://aimotive com/). allows to automatically create millions of synthetic-simulations scenarios ideal for developing Highly-Scalable and Generalised Sensor Structures for Efficient Physically-Based Sim- autonomous driving solutions . The company’s aiSim can do the real-time simulation of 34 sensor ulation of Multi-Modal Sensor Networks, Jörn Thieling ; Jürgen Roßmann, DOI: 10 1109/. feeds on 4 GPUs, and include a range of actors, such as vehicles, pedestrians, animals on 100 km of ICSensT .2018 .8603563 [January 2019] modelled roads .

3 3. .2 HPC system (Compute Platform) Ad Emmen, 09/12/19 HPC systems are needed for simulations and as a basis for scenario development through AI, ------for instance . HPC systems, called HPC clusters, consist of special hardware and software . The Can we get some high quality/ high res hardware consists of many compute “nodes” with CPUs with many cores, and several accelerators pictures from Florian? Add reference like GPUs or FPGAs and attached storage . They are all connected within a fast network . Sometimes Gigabit Ethernet is used, often a lower latency network, like InfiniBand, is needed . The HPC software supports fast processing on an HPC cluster . If the workload on the cluster consists of pleasantly parallel jobs the input and output data can be served directly from the Data Lake . For massively parallel jobs the basis is a parallel file system . This is needed when large files need to be processed . One file is put on several storage devices that can be used by different compute nodes . The goal is to keep the compute nodes busy with processing, not with waiting for data .

To manage all the resources in a cluster in an efficient way, a cluster management system is used . Users submit work (called “jobs”) to the cluster management system . Different sensors have different hardware requirements .

34 HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture 35 An Artificial Intelligence system — the DSS8440 Table 4 . When do you need HPC? The DSS8440 is a system optimised to support AI applications, especially machine learning . It is a Technical Example What to do What is it Hardware Software characteristics called Architecture architecture combination of Dell Technologies and NVIDIA technology: Dell Technologies servers with NVIDIA GPUs . Customers can depend on the reliability and quality associated with these market leaders Many Small enough Processing Distribute Pleasantly Cluster with Jobs handles (thousands or to fit in the many relatively independent parallel many compute by for instance to implement their machine learning strategy . One can choose to incorporate 4, 8 or 10 GPUs . It is more) of the memory of one small images . jobs over many computing nodes served a cluster easy to scale up after an initial test phase . The DSS8440 fits in a 4U chassis . same tasks compute node . For instance, computing from Data management Can be for training nodes . Sometimes Lake . system . The DSS8440 is designed to provide proven reliability and performance for machine learning and executed in Machine Each compute called HTC . Create a reasonable Learning . node gets its pleasantly other AI-related workloads time on one own data set . parallel compute node . programs . The major components in the DSS8440 are glued together by PCIe fabric . This is an open industry One large task Small enough Some large Create one Massively Cluster with Create a single standard architecture that allows easy addition of innovative technologies as they emerge in the needing not to fit in the Machine parallel job parallel many compute MPP program rapidly developing field of machine learning . A DSS8440 can be easily expanded with other PCIe much data memory of one Learning that uses many computing nodes . Multiple compliant components . Up to 17 PCIe devices can be included . compute node . training . processors . copies of the Cannot be Copy same same data executed in data set to created by a The NVIDIA GPU included is the Tesla V100 . With 640 Tensor Cores, Tesla V100 GPUs deliver 112 a reasonable local node Data Mover . Teraflop/s of Deep Learning performance across the switched PCIe fabric of the DSS 8440 . time on one storage . Local “data” compute node . cache created for each The Tesla V100 GPU delivers 47 times higher inference performance than a CPU server . compute node . Can The DSS8440 includes local storage . This speeds up the training phase of an AI application . Ten be a virtual drives can be included, of which 8 can be high-performance direct memory connected VVMe cache if the Data Lake can SSDs . This local storage results in lower latency for the application’s data input and output . automatically provide that . Part of the AI development can already be done in NVIDIA GPU Cloud (NGC) . NCG is a program Standard that offers a registry for pre-validated, pre-optimised containers for many machine learning network connections . frameworks, such as TensorFlow, PyTorch, and MXNet . Along with the performance-tuned NVIDIA AI stack, these pre-integrated containers include NVIDIA® CUDA® Toolkit, NVIDIA Deep Learning One large task Too large Simulation of Create one Massively Cluster with Create a single needing a large to fit in the a complete car parallel job parallel many compute MPP program . libraries, and the top AI software . This helps data scientists and researchers rapidly build, train, data set memory of one component that uses many computing nodes . Implement and deploy AI models . Once ready, the containers can be moved to the DSS8440 and deployed compute node . model in a processors Distributed access to a there . Cannot be generated Create parallel Sometimes copy of the parallel file executed in environment . versions of called super- data created system . a reasonable each data sets . computing by a Data Because of the use of PCIe, the DSS8440 is also very power efficient . time on one Mover in a compute node . parallel file Because of the large capacity of the system, the DSS8440 can also easily be shared by different system . departments or development groups . Software supports this by providing a different hosting A low-latency network environment for each group . connection between With ten GPUs available, compute-intensive tasks like AI or simulation can be easily distributed cluster nodes . over different parts of the system .

36 HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture 37 Ready for AI solution — Platform Technical description For HPC servers, Dell Technologies is supporting the following possibilities: A distributed architecture to support Artificial Intelligence and Deep Learning . This is a combina- tion of hardware technologies that use NVDIA GPUs to provide the necessary computing power . 1. Simulation Servers • R640/R740 The Ready for AI platform consists of 3 main components: • R840 • Dell Technologies PowerEdge servers: an R740 Head node to manage the cluster . Many C4140 • R940XA workers nodes, each accelerated with up to four NVIDIA V100 GPUs, NVLink providing 500 Tflops/U throughput . 2. Deep Learning Servers • Two types of networking: Ethernet to manage the cluster (Dell EMC S3048-ON Ethernet • R740 Switch) and Infiniband for data traffic (Mellanox SB7800 Infiniband Switch) • C4140 • Two types of Del Isilon storage: for Inference (Isilon F800 all-flash scale-out NAS) and for • DSS8440 training (Isilon H500 hybrid scale-out NAS) . F800 enables millions of concurrent connections between worker nodes and storage modes . C4140 The PowerEdge C4140 Server is specifically designed to support AI workloads . A 1U server has The system can scale from a few dozens of Terabyte to 33 Petabyte . The throughput can be up to 2 CPUs and can support up to 4 GPUs . The GPUs draw the most power . Hence they are placed 540 GB/s . in the common cold air inlet region of the server rack . This prevents the GPU from slowing down automatically to prevent overheating . The thermal design of the system is such that, when placed Which one to choose depends very much on the software, especially if the application work can be in the right ambient conditions, the GPUs will always perform at maximum speed . split over many nodes in a pleasantly-parallel way or whether communication is needed between the nodes . References High-Performance Computing Solutions — Make innovation real with high performance comput- ing solutions [visited July 2019} https://www dellemc. com/hpc.

High-Performance Computing — Technical communications [visited July 2019] http://www hpcatdell. com.

NVIDIA Drive Constellation — Virtual Reality Autonomous Vehicle Simulator [visited July 2019] https://www nvidia. com/en-us/self-driving-cars/drive-constellation/.

NVIDIA DRIVE Software [visited July 2019] https://www nvidia. com/en-us/self-driving-cars/drive-platform/software/.

DSS8440 spec sheet [April 2019] https://www dellemc. com/en-gh/collaterals/unauth/data-sheets/products/servers/dss-8440-. server-spec- sheet pdf.

38 HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture 39 3 3. 3. Storage and memory for HPC 3 3. 5. Parallel file systems for HPC

Goal: Goal: Storage and memory are closely connected in HPC systems . The idea is to get the data into the Parallel file systems allow you to distribute a file over several disks . They are typically used in cases HPC processors and the results out of the processors and memory as fast as possible to keep the where one needs to get a lot of data into the compute nodes of a cluster and one needs to get a lot processors busy . of data out as well . This happens in cases where reading from a single disk yields insufficient data flowing into the compute nodes to keep all processors busy . Parallel file systems are mostly seen in How it works: HPC clusters . Two techniques used are NVMe which directly connects storage and memory, and special PCIe Fabric . How it works: The contents of a file are distributed over several or many disks . The way this is done differs from The third-generation PCIe is a technology that has the ability to implement support for workloads one parallel file system to another . Some parallel file systems have options like high-availability . To with changing accelerator and storage bandwidth needs . get the data in and out of the parallel file system from other places in the infrastructure, the data mover needs to convert from a general file protocol to a parallel file system and vice versa . Technical description For storage and memory for HPC, Dell Technologies is supporting the following possibilities: In traditional HPC systems, parallel file systems are typically connected to very large clusters . In ADAS/AD systems, they are mostly used for small 32-64 nodes clusters . 1. NVMe support in servers Parallel file systems need a sufficiently fast network . So, one needs to look at that too . There are 2. HPC clusters with PCIe Fabric several hardware technologies in a compute node that can help speed up the performance .

It started with the flexible first generation of the C4100 1RU 4x GPU front layout series server . Compared to data stored in the Data Lake — where it lives forever — the data in the parallel file Now possible in C4140, FX2, VRTX, and in the DSS8440 . systems used for HPC is short lived . The initial data is loaded into the parallel file system from the Data Lake; the simulation or AI application is run, using that data and writing the results back into the References parallel file systems . The resulting data is written to the Data Lake, and the parallel file system can be Dell EMC PowerMax NVMe Data Storage [visited July 2019] https://www dellemc. com/en-gb/. filled with new input data for the next simulation or AI applications . storage/powermax htm. There are many parallel file systems to choose from . We focus on, Lustre and BeeGFS . Each has different characteristics and usability depending on the use case . The following table provides some 3 3. 4. Networking for HPC characteristics of the two . We did add Isilon as a base .

Goal: Connect the nodes in an HPC cluster to allow for a sufficiently fast data transfer .

How it works: In an HPC cluster, the compute nodes are connected by a very fast network .

Technical options — networking for HPC

For HPC networking, Dell Technologies is supporting the following possibilities:

1. Infiniband

References https://www dell. com/support/article/us/en/04/sln311501/high-performance-comput. - ing?lang=en#Compute-and-Interconnects

40 HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture 41 Table 5 . Isilon is almost always the best choice for storing and managing data . In some specialised Connecting Technologies HPC system parts, Lustre or BeeGFS can be used . Isilon Lustre BeeGFS 1. Fast Ethernet Looking for An enterprise-grade Enterprise-grade HPC Short-lived parallel characteristics general file system with file system with 24/7 file system built for 2. InfiniBand 24/7 highest quality support and reliability the greatest IOPS and support and reliability across all file types and throughput . 3. RDMA across all file types and sizes . sizes . 4. RoCE (RocCEv2) Technical Common NFS file system Scalable to Exabyte Scalable to Exabyte Scalable to 50PB capacity capacity capacity Up to 1PB/s file Scalable to 1PB/s References Up to 200GB/s throughput throughput Dell EMC Ready Solutions for HPC Storage (April 2019] https://www dellemc. com/resources/en-. distributed throughput Up to 22GB/s C4140 Up to 25GB/s read per us/asset/white-papers/products/ready- solutions/Ready_Bundles_for_HPC_Storage-Solution_ Up to 15M distributed R/W per OSS pair node Overview .PDF IOPS Up to 300K IOPS with Up to 1B IOPS read and Easy to install, and scaling out metadata write maintain servers Scalability of Dell EMC Ready Solution for HPC Lustre Storage on PowerVault ME4 [visited In-built tiering to object Data accelerator up to August 2019] https://www dell. com/support/article/us/en/04/sln316413/scalability-of-dell-emc-. storage +3M IOPS ready-solution-for-hpc-lustre- storage-on-powervault-me4?lang=en Use it for Mission-critical and High-throughput and High-throughput and pleasantly parallel HPC capacity file system capacity file system BeeGFS The Parallel Cluster File System [visited July 2019] https://www beegfs. io. environments, or for “scratch space” as “scratch space” in short- dominant small file IO well as 24/7 simulation lived environments . patterns . With multi- environments . ThinkParQ: The Parallel Cluster File System — The Company Behind BeeGFS [visited July 2019] protocol built in . https://thinkparq com/. Choice Preferred choice . Only use when HPC Only to be used as requirements ask for it . an integral part of an Implementation of the Lustre file system on Dell servers . application . The Dell Technologies Ready Solution for HPC Lustre Storage is designed for academic and Lustre has good performance for a single stream and per node . It has specific APIs to support industry users who need to deploy an easy-to-use, high-throughput, scale-out, a parallel file parallel access to files . It supports 10-200 GbE (Gigabit Ethernet) and InfiniBand . With an InfiniBand storage system . The Ready Solution uses the Lustre community edition maintained by Whamcloud front- and backend it also has low latency . Hence Lustre is very popular for HPC systems: 60 out of and is a scale-out storage appliance that can provide high-performance and high availability . the 500 fastest supercomputers in the world — listed on http://top500 org. — run Lustre . Lustre The Ready Solution for HPC Lustre Storage delivers a combination of performance, reliability, also has some disadvantages . Lustre has open source support, which may not have the required density, ease of use, and cost-effectiveness . It uses Dell Technologies PowerEdge servers and the service level: it is not plug-and-play and requires specialised knowledge to implement and maintain . high-density Dell PowerVault ME storage line . The Ready S comes with full hardware and software Lustre is neither multi-protocol, so it requires proxies to implement multi-file protocol access . support from Dell Technologies and Whamcloud .

Technical options — Parallel file systems References For HPC parallel file systems, Dell Technologies is supporting the following possibilities: Dell EMC Ready Solution for HPC Lustre Storage — Using PowerVault ME4 [February 2019] 1. Lustre https://www dellemc. com/resources/en-us/asset/white-papers/solutions/h17632_ready_hpc_. lustre_wp pdf. 2. BeeGFS Not explicitly advertised, but supported when asked for: 3 3. 6. HPC Cluster management 1. IBM GPFS Goal: Efficient management of cluster computing resources for simulations .

How it works: An HPC cluster consists of many resources, many nodes, with or without GPUs, with more or less memory and with fast interconnects, like Ethernet or — most of the time — low latency InfiniBand . There are HPC simulation jobs that have different characteristics . Special middleware called

42 HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture 43 resource managers, or cluster managers, distribute the jobs in an as efficient as possible way over Technical options — AI the cluster nodes . Resource managers need to understand the job requirements and be able to For HPC AI applications, Dell Technologies is supporting the following possibilities: recognise which hardware is available in a specific cluster node . Some do not recognise GPUs for instance, which makes them unusable to almost any HPC cluster . 1. Caffe2

A resource manager like Bright Cluster Manager, can not only manage HPC resources, but also 2. TensorFlow Hadoop clusters used for big data . Therefore, it can be worth looking at how other parts of the infrastructure are managed to find synergies . 3. PyTorch

Sometimes, the job submission is embedded in the workflow system . Generating jobs and submitting For managing the AI workload on an HPC cluster: them to the resource manager enables a user to concentrate on creating AI applications using TensorFlow or Caffe . 1. Bright Cluster Manager with AI plugin

Technical options — cluster management References For HPC cluster management, Dell Technologies is supporting the following possibilities: Dell EMC Ready Solutions for AI — Deep Learning with NVIDIA [February 2019] https://www . emc com/collateral/technical-documentation/h17354-dl-architecture-guide. pdf. 1. Bright Cluster Manager Deep Learning with Dell EMC Isilon [July 2018] https://www dellemc. com/resources/en-us/asset/. 2. openHPC + xCat white- papers/products/storage/h17361_wp_deep_learning_and_dell_emc_isilon pdf.

Not explicitly advertised, but supported when asked for: Bright Computing website [visited July 2019] https://www brightcomputing. com/. 1. IBM LSF Spectrum

2. Univa GridEngine 3 4. Networking in the data centre

3. SLURM Services

Connecting technologies Software 1. ServiceNow

Data Centre 2. Ready Solution for AI Remote Site

References Data Bright Computing website [visited July 2019] https://www brightcomputing. com/. Platform Data logging / Data Lake User Data ingest Access 3 3. 7. AI support Compute Platform Goal: Support scenario development with AI .

How it works: Dell Technologies is a strong advocate of Open Networking for networking in the data centre . Machine learning is used in scenario development . Scenario developers often use standard Open Networking is based on a set of standards and best practices developed by the Open languages . Optimised implementations of those languages are made available . Networking Foundation . Dell Technologies was one of the earlier contributors and supporters of Implementations of AI can be highly parallel . In that case, an HPC cluster with Bright Cluster Open Networking . Open Networking provides a more flexible and future-proof way of setting up Manager with an AI plugin can be the best option . networking in the data centre . It is a form of Software Defined Networking (SDN) .

Depending on the requirements, networking configurations can range from a networking stack fully

44 HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture 45 based on Dell Technologies’ hardware and software in a fully converged setting to disaggregated, almost fully open source stack . A Hardware-in-Loop (HiL) test bench is used to connect a component to a simulated environment . This hardware component can send and receive sensor data as it would when connected to a real Technical description car . When one does not need to test hardware, one may not need the HiL testbed . For data centre networking, Dell Technologies is supporting the following possibilities: Technical options — HiL Testbed 1. Dell Technologies networking stack. 1. Available as a module Full Dell EMC service and support for hardware and software References 2. Open networking stack. Solving the storage conundrum in ADAS development and validation [July 2018] https://www . Dell Technologies service and support with warm hand-over to software partner dellemc com/resources/en-us/asset/white-papers/products/storage/dell-emc-adas-solution-. powered-by-isilon-wp pdf. 3. Disaggregated open source stack. Dell Technologies service and support for base OS and hardware 3 5. .2 SiL Test bench 4. Disaggregated open source stack. Dell Technologies service and support for hardware elements Recorded Data Computing HPC lesystem Replay HPC servers 3 5. Test and Workbenches Test and workbenches are a combination of hardware and software for specific tasks in ADAS/AD development . The Hardware-in-Loop (HiL) and the Software-in-Loop (SiL) test benches support Test results simulation . The ADAS/AD Workbench is the general toolbox . 3 5. 1. HiL Test bench

Recorded Data Computing Special Hardware HPC lesystem Replay HiL HiL Server/Workstation Ethernet System Simulator

Generate test scenarios Management station Test results VDI desktop VDI desktop

CAN/ Replay FlexRay

Management Connected Sensor Same hardware as Goal: station to test in car A Software-in-Loop (SiL) testbed is very similar to a Hardware-in-Loop test bench . A car component VDI desktop Such as Car Camera ECU is connected to a simulated car environment . But in this case, the car component does not exist as real hardware, but as a simulation in itself .

Goal: The behaviour of new car components or car components under development must be thoroughly tested .

How it works:

46 HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture 47 How it works: Technical description The SiL test bench can be used as an interface to the simulation cluster with emphasis on one For file systems, Dell Technologies is supporting the following possibilities: component . However, in reality, it is more used to test the software that will be integrated into the hardware . SiL is used to virtually verify, refine and validate hardware . 1. DAS/AD WorkBench For simulations often an HPC environment is needed . So SiL and HPC services often go together . When the simulation of the hardware component is complete, software has to be generated that Software and work station hardware . can be uploaded into a hardware component . Today’s hardware always contains software such as firmware that is more or less wired in . The hardware component often has different processors than References the simulation system . Conversion of the simulated software on software ready to run is another Solving the storage conundrum in ADAS/AD development and validation [July 2018] https://www . function of the SiL Testbed . dellemc com/resources/en-us/asset/white-papers/products/storage/dell-emc-adas-solution-. powered-by-isilon-wp pdf. Technical options — SiL Testbed 1. SiL testbed available as a module. 3 6. User access — VDI 2. NVIDIA DRIVE Constellation. Services Photorealistic simulation is a safe, scalable solution for testing and validating a self-driving plat- form before it hits the road . NVIDIA DRIVE Constellation integrates powerful GPUs and DRIVE AGX Pegasus . Visualization software running on GPUs simulate cameras, radar, and LiDAR as inputs to NVIDIA’s DRIVE AGX Pegasus, which proces ses the data as if it were actually driving Software on the road . This scalable system is capable of generating billions of miles of diverse autonomous vehicle testing scenarios to validate hardware- and software-in-the-loop prior to deployment . Remote Site Data Centre References https://developer nvidia. com/drive/drive-constellation. [Visited August 2019] Data Platform 3 5. 3. ADAS/AD Workbench Data logging / Data Lake User Data ingest Access Goal: Compute ADAS/AD developers need access to the metadata catalogue — the actual data, simulation, and Platform scenarios — and be able to use the data, initiate simulations, value results and scenarios . They use an ADAS workbench — ADAS work station . Sometimes the ADAS developers are working closely to the test site, sometimes they are working more closely to the main data centre . In any case, they need fast access through their work station . Developers, analysts and other users are connected to the development infrastructure through VDI — Virtual Desktop Infrastructure . In VDI, the core of the desktop actually resides on the server . The How it works: image of the desktop is sent over the network to the user’s device that can be a thin client or PC . The screen the ADAS/AD developer uses may be connected to the actual ADASAD work station . The advantage of VDI is that it is much easier to manage user access, applications and user’s own But if it is located near to or on the test site, it may be connected to the actual work station through storage . a virtual desktop infrastructure (VDI) . The actual workbench software may be distributed over the work station and a server back-end . VDI depends on fast network connections .

The ADAS/AD developer has access to the Global Metadata Database and can use the global search VDI provides to securely access their desktops, applications and workstations, such as simulation facility . workstations .

The VDI software infrastructure is based on VMware Horizon 7 VDI technology . This allows for contextually aware policy management . For instance, for disabling USB access when accessed over an insecure network . Groups of virtual desktops can be easily created and managed . There are several places where users need access: for labelling, for the test bench, for system management and monitoring and for accessing servers at the remote site when present . Using VDI provides standardised access to these all .

48 HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture 49 Technical description 4 Technical Description — Software & Services For VDI, Dell Technologies is supporting the following possibilities: This chapter provides an overview of the software and services that can help when working with 1. Dell Technologies Wyse VDI server a development environment . The Software section describes the global software architecture and technologies . These mainly serve provider application support and management functions . References Dell Technologies provides several services to assist autonomous driving development, ranging from Dell Technologies — Virtual Desktop Infrastructure [visited July 2019] https://www dellemc. com/. managed services to financial services . These are described in the second part of this chapter . en-gb/solutions/vdi/index-it htm. As a prerequisite, reading chapter 1: “An overview of the IT infrastructure needed for the development of autonomous driving” is sufficient, although to really benefit, it is best to also read chapters 2 & 3 .

4 .1 Software

Services

Software

Remote Site Data Centre

Data Platform Data logging / Data Lake User Data ingest Access Compute Platform

Table 6 . HPC Cluster Management Container technology Enterprise solution Bright Cluster Manager + Data PKS + Kubernetes Science Open source OpenHPC Docker Kubernetes Singularity

Cluster management has already been described in the HPC section . 4 1. 1. Software — Container technology

Goal: To manage and deploy application software packed inside containers . Containers allow for packaging only the software and libraries needed . Different software may require different approaches .

How it works: Containers are used to pack application software with their libraries in a small “container” that can run everywhere on the infrastructure . Containers are like Virtual Machines, but smaller as they

50 HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture 51 don’t have the complete operating system packed inside . Containers are self contained and can run NVIDIA DRIVE software for developing autonomous driving everywhere as they do not need other software apart from a container runtime system . NVIDIA DRIVE software enables to develop production applications for automated and There are several broad container technologies . One is Docker, one of the oldest container autonomous cars . It contains software modules, libraries, frameworks, and source packages . technologies, another one is Kubernetes . This describes the containers themselves and simple ways Developers and researchers can use these to optimise, validate, and deploy their work . of running them . One needs advanced container management software to manage containers . One example of Kubernetes is Pivotal Cloud Services or PKS . NVIDIA DRIVE software includes the following separate, individual features:

Technical options — containers technology DRIVE OS: The underlying real-time operating system software includes a safety application For container technology, Dell Technologies is supporting the following possibilities: framework, and offers support of Adaptive AUTOSAR .

1. PKS (Kubernetes) DRIVE AV: The autonomous vehicle (AV) driving system software integrates a wide range of neural networks for detection of all types of environments, and objects within those environments, along 2. Docker (Docker Swarm) with car localization and path-planning algorithms .

Connecting technologies DRIVE IX: This deep learning-based software stack enables manufacturers to develop intelligent experiences inside the care, from driver-monitoring systems using in-cabin cameras to voice and 1. Google’s new High Availability option gesture- activated AI assistants .

DRIVE AR: The perception visualization software stack is used to build cockpit experiences for 4 1. .2 Create a complete software stack dashboard and back-seat screens . It takes information from the car’s sensors and transforms it into comprehensive and accurate visuals that are easily understood, helping to build trust in the Goal: autonomous car technology . Managing and configuring all kinds of software modules and libraries takes a lot of effort . Having a set of software that works together can save time and minimise errors . DRIVE Hyperion: This complete autonomous car development and testing platform includes a DRIVE AGX Pegasus system, along with sensors for autonomous driving (seven cameras, eight How it works: radars, and optional LiDARs), sensors for driver monitoring, sensors for localization, and other Several companies are offering software modules and libraries that span a considerable part of the accessories . complete ADAS/AD development workflow . DRIVE Mapping: The mapping solution integrates a scalable sensor suite, software development The exact composition may very much depend on the actual use case . kits, and co-integrated high-definition maps available through partnerships with leading mapping companies . NVIDIA’s end-to-end mapping technologies help collect environment data, create HD As an example, we describe NVIDIA DRIVE . NVIDIA’s software stack NVIDIA DRIVE focuses on maps, and keep them updated . supporting Deep Learning and visualisation all along the development process . It is mainly used for synthetic driving . The NVIDIA DRIVE AGX architecture enables car manufacturers to build and deploy self-driving cars and trucks that are functionally safe and can be demonstrated compliant to international safety standards including ISO 26262 and ISO/PAS 21448, NHTSA recommendations, and global New Car Assessment Program (NCAP) requirements .

NVIDIA offers a 320 Tflop/s of deep learning compute on DRIVE AGX Pegasus .

References NVIDIA Drive — Scalable AI Platform for Autonomous Driving [visited July 2019] https://www . nvidia com/en-us/self-driving-cars/drive-platform/.

52 HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture 53 4 2 . Services 4 .2 3. Support Services Using a complex infrastructure in an optimal way requires specialised knowledge and continuous Services attention . Dell and its partners can help with specific service support .

Innovation Labs help to explore and test new and emerging technologies . Centres of Excellence Software bundle expertise relevant for the new application technology .

The Dell HPC Community brings together HPC users and experts . Remote Site Data Centre

Technical options — support services Data For support, Dell Technologies is supporting the following possibilities: Platform Data logging / 1. Ready Solutions for AI Solution Stack Data Lake User Data ingest Access Compute This is a combination of AI framework support, library support and services that allows to quickly set up an Platform AI solution . It supports the major AI frameworks and libraries, including BigDL, TensorFlow, Caffe, Neon, cuDNN, cuBLA and support for the Cloudera Data Science Workbench . The services include consulting, support and financial services .

4 .2 1. Hardware Deployment Services The Ready for AI hardware platform is described in the HPC section . The hardware infrastructure for an ADAS/AD development platform will need continuous updating and expansion . Deployment services can help with this task . There are special ready solutions for Deep learning with NVDIA and with Intel .

Technical options — hardware deployment service References Dell Technologies Deployment Services . 1. Explore Ready Solutions for AI [visited July 2019] https://www dellemc. com/en-gb/solutions/data-analytics/machine-learning/ready-solutions-for-. ai ht. - Reference m#scroll=off Deployment Services [visited July 2019] https://www dell. com/en-uk/work/learn/deployment-servers-storage-networking. 2. Support services from Partner Altran [visited July 2019] https://www altran. com/be/en/industries/automotive.

4 .2 .2 Software Deployment Services 3. innovation Labs [visited July 2019] The software infrastructure for an ADAS/AD development platform will need maintenance, updating http://www dellemc. com/innovationlab. and upgrading . Software deployment services can help with this task . 4. Centres of Excellence [visited July 2019] Technical options — software deployment service http://www dellemc. com/coe.

Reference 5. Dell HPC Community [visited July 2019] Dell Technologies Deployment Services [visited July 2019] https://www dellemc. com/en-gb/. http://www dellhpc. or. services/deployment-services/index htm#scroll=off. 4 .2 4. Labelling as a service Creating metadata by labelling or annotating the sensor data collected from cars takes a lot of effort; for example, detecting whether an object is a pedestrian or two, adding data from temperature sensors and many other labels . Much of it can be automated . The automated labelling can be offered as a service . The service consists of algorithms and a collection of servers that are tuned to execute it .

54 HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture 55 Automatic labelling can also include sensor fusion, 3D reconstructions based on 3D and LiDAR data . 4 .2 6. Financial services Lease of computer equipment and software . Technical options — labelling as a service Dell Technologies Labelling as a service . Options — financial services For financial services Dell Technologies is supporting the following possibilities:

4 .2 5. Managed Services 1. Dell Financial Services (DFS) Running ADAS/AD applications, managing and analysing the vast amounts of data involved is already an immense task . It is core to the customer activities . Managing the platform, including the hardware Reference: and software is not . The management also involves specialised IT knowledge . https://www dellemc. com/en-us/flexibleconsumption/index. htm.

Rather than acquiring this knowledge and maintaining an ADAS/AD platform themselves, enterprise customers choose for managed services in which Dell Technologies, together with partners, sets up the platform and related services . This resembles a cloud computing approach but with dedicated 5 Choosing the right platform — Infrastructure design infrastructure . Services For the Data Lake and Data Platform, managed services are already quite common with enterprise customers for other applications besides ADAS/AD . Software For the Computing Platform, managed services, HPC-as-a-Service, are new . Using and managing HPC systems requires specialised knowledge . It typically takes at least one or two years to set up Remote Site Data Centre a knowledgeable team of HPC experts that can manage services that include HPC expertise about provisioning, scheduling and monitoring HPC clusters, allowing to start right away and concentrate on applications . Data Platform More than ever open source tools are used to create the software infrastructure these days . Data logging / Data Lake User The trend, derived from the Unix principle is to build infrastructures from many independent but Data ingest Access collaborating tools . Each tool can be more easily updated and expanded than a more monolithic Compute software infrastructure . Developers were trained using this approach . Platform

Technical options — managed services For managed services Dell Technologies is supporting the following possibilities: This high-level view can be filled out in practice with many different solutions . The choice very much 1. Infrastructure managed services (IMS) depends on the specific requirements .

Reference: In Chapters 2, 3 & 4, we described the concepts of remote site and data centre infrastructure and of https://www dellemc. com/en-gb/services/managed-services/index. htm. the software infrastructure .

There is no “one size fits all” hardware and software platform . The global components are the same, but the actual hardware choices: How many servers? Which data storage size? Which software stack? — this may differ widely from case to case . It depends on the exact data flow, the type of simulations one wants to perform and many other specific customer demands . There are however general considerations that one can take into account to come to a balanced design .

As a prerequisite reading all the previous chapters is advised .

We start this chapter by giving an overview of the infrastructure design process, followed by some practical examples where realistic designs are filled with specific Dell technologies & partners’ tools . We continue with some general server design considerations, followed by remarks specific for Data Lake, Compute Platform and Network design .

56 HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture 57 5 1. Infrastructure design process 5 .2 Practical design examples One starts by looking at the initial ADAS/AD application needs . This includes the amount of Best practices include benchmarking, profiling & optimizing, and documenting . For each part of the measured data to be processed, processing time limits, and the number of developers active . system, there are standard benchmarks available that one can use, or one can use its own set of These needs are then translated into service requirements . For instance, we need data logistics, benchmarks, representing a typical workload . compute services and operations . In the last step, this is translated into technical implementation configuration details . So here we decide on things like the number of cores for HPC and the Typical standard benchmarks performed by Dell Technologies for AI are for instance Mxnet inference structure of the Data Lake components . for ResNet 50 on different configurations, including bare metal or Singularity containers for 32 bits or 16 bits Floating Point data . Table 7 . Architecture design flow example . The figures and data are fictional . Initial application needs For storage benchmarking, one has to use the right file format to do the benchmarks . For instance, the widely used JPEG files for storage benchmarking may be appropriate for benchmarking storage for general applications, but files using the TensorFlow storage format (TFRecord) might be more Ingestion services Preprocessing Sensor Support services Financial & Network appropriate for benchmarking storage for AI . pipeline reprocessing & management infrastructure HiL simulation services & physical data centres Profiling & Optimizing is used to determine for particular hardware configuration, the best runtime Handle input Store metadata Process minute Provide support Support services Network configuration, data formats, software options and environments . data from 100 of real-time data to 500 engineers infrastructure cars of which 50 in 50-minute wall Explore financial simultaneously clock time options Documenting possible platform designs goes far beyond just documents like the one you are reading Ingest data within Ground truth Process Technical support one central data now . It also includes reading blog posts, white papers, technical reports, and research papers about 2 days generation 20 000. HiL test centre cases in 2 months . best practices . Handle remote Process 1 million Development two remote data sites km virtual data in support for 100 centres Data Science Portals SiL environment developers Train AI models Support in 3 Deep Learning Data Analytics different regions HiL SiL Simulations Translated into service requirements

Middleware Data management Processing / Software Service Contract related Network and Services/Dierent Consumption Models compute development management physical data Bright Cluster Manager Data Scence Manager centre HPC MiL Jobs Hadoop Spark Data logistics Simulation Software Service Desk Contract Network Containers Cloud bursting Development management platform Data ingestion Reprocessing Applications On-site supported Financial Physical Data management centre Operating Systems Mass storage Compute Service KPI management Security integration RHEL SUSE Ubuntu Windows Data exchange Operations Reporting Translated to technical implementation configuration Virtualisation/Provisioning

Hypervisor Bare Metal Remote site Data Lake Data platform Compute platform Support services Network infrastructure & physical data centres Hardware 1,5 Petabyte/day 50 Petabytes 2 .500 cores 5 HiL Dell Support Ethernet ingestion service total storage workbenches services for backbone 10 Server Network Storage service desk Gbit/s . CPU GPU Ethernet Big Data NAS and application support FPGA ASIC In€niband HPC Storage NVMe NVMeF 50 copy stations 25 of which 500 Tbyte HPC cluster Dell Financial 1 central data dedicated to data memory Services for centre platform and financing part of compute platform the infrastructure 5 Ingestion 50 000. cores 2 remote data A global view of some architectural choices seen from the Data Science Portals which are the main servers centres access point for users, including developers . 25 GPUs

58 HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture 59 Remote Site Data Centre

Public Cloud Data Platform Compute Platform Data ingest In-Vehicle Data Lake Data Platform Application HPC platform Systems development ML/DL Simulation, Cloud Hbase Libraries SiL, MiL Car sensors pools Hadoop PAS PKS PFS Pivotal market & technology Cold Docker place Mobile Out of scope storage Spark Shared services Upload Station tier Data Science Shared security Workbench Shared networking Data Mining Deep Low latency & Analytics Learning Bosh Embedded HDFS Operating System Copy station Auto Copy tiering HiL test cells Compute station HPC Fabric Platform Business & Kubernetes Data Ingest Analytics Cae2 HPC Storage station Optional direct interconnect direct Optional Data Docker ingest TensorFlow server Linux PyTorch Low latency VMware HFDS NFS/HDFS Bright Transport by courier, car, airplane SiL Farm VDI Platform Operational Centrally managed Management Personalisation Platform Perception, NFS Metadata experience management sensor testing Windows & Linux & development Secure access Search index Dashboard Moderate Latency NFS/HDFS HiL SMB Remote users Remote users

Data Lake Remote users Remote users Another global architectural view . This one focuses more on usage . Notice that in this case A common solution is to first store all the incoming data in the Data Lake and then start analysis and everything, even the actual data ingestion, is done in the central data centre where possible . simulation by moving only the necessary data to the Data Platform or the Compute Platform .

60 HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture 61 Remote Site Data Centre Remote Data Centre

Data Platform Compute Platform Gateways Inniband Data ingest platform ML/DL Simulation, Data Lake Deep Learning HPC NVMe Libraries SiL, MiL R740 Node Storage Mesh Isilon Mobile Data Workstation Upload Station Isilon R740 R740 logging appliance R740 Data Mining Deep (OEM) Isilon & Analytics Learning R740 Isilon C4140 ME4 ME4 MUS 4x NVLINK R740 Copy 2 X HiL Isilon GPUs station T640 Isilon HPC Fabric Business & Management Data Simulation Analytics HPC Storage analytics R640 R640 R740 R740

Perception, HFDS sensor testing & development Ethernet

HiL NFS An architectural design example with products filled in . The number of R740 servers and Deep Learning modules can vary from a few to tens or even hundreds of servers .

SMB

Data Lake

Another approach is to let most of the data flow through the Data Platform first and fill the Data Lake from there . The advantage is that data analysis can start right away . This way feedback can be provided to the remote site faster . More metadata could be added at an earlier stage .

62 HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture 63 5 3. General server design considerations Technical options — System design 1. Fresh Air Hardware Cooling Solution Goal: Designing server racks in data centres for these applications poses several challenges . The idea is 2. Dell PowerEdge and CoolIT Systems Liquid Cooling to use not only the rack space itself as efficiently as possible but also the available power and other infrastructure (like cooling) available in a rack . 3. Redfish implementation

How it works: 4. Isilon for Deep Learning Server racks in data centres are limited by space, power and cooling . When designing hardware systems that contain both compute nodes and storage nodes, as well as networking, the right References balance has to be found .Maintenance costs should also be taken into account . Dell EMC Data Center Power and Cooling Solutions [Visited August 2019] https://www dellemc. . com/en-gb/servers/power-and-cooling htm. Most data centres have an average power envelope of 6 kilowatts . Beyond that, costs are much higher . If one uses an HPC cluster for the simulation where one has four GPUs per node, one easily Redfish Developer Hub [visited August 2019] reaches 6 or more kilowatt or more, per rack . So, one could add storage to the same rack, with lower https://redfish dmtf. org/. power consumption, but more weight than compute nodes, to lower the overall power consumption . Redfish Del website [visited August 2019] https://www dell. com/support/article/uk/en/ukbsdt1/. One needs to think about the design . When talking about 400 Petabytes of data the difference in sln310624/redfish costs between hard disks is also considerable . Deep Learning with Dell EMC Isilon — Technical Whitepaper (March 2019) https://www dellemc. . Servers produce heat . HPC servers with GPUs produce a lot of heat . Cooling those in an efficient com/en-us/collaterals/unauth/white- papers/products/storage/h17361_wp_deep_learning_and_ and inexpensive way is key to good data centre design . Which method to choose depends on many dell_emc_isilon pdf. factors including ambient climate, server density, and building situation . Some techniques that help to improve the cooling situation are fresh air cooling and liquid cooling . Fresh air-cooling hardware can operate at higher temperatures and higher humidity than regular hardware . This enables to use air from outside the building — fresh air — to cool the equipment . This is less expensive than 5 4. Storage Design — Data Lake installing air chillers . As explained, the preferred storage design is to have one logical and physical storage system with a single file protocol to the filesystem on the storage . For practical reasons — for some cases the Another option is to install liquid cooling . In this case, the servers are cooled with water at room access speed of the single system is not sufficient, — or cost reasons, that may not always be temperature . This is especially useful for high-density HPC servers . Standardised cooling options are possible . But this should be treated as a design exception as much as possible . currently in place in newer data centres . So, it is easy to install and connect equipment that needs water cooling, 5 4. 1. File storage and archive Managing complex scaled systems requires an easy way to see what is available and what is the status . The DMTF organisation has defined a standard, called Redfish to describe systems Goal: and components in a data centre . Adhering to this standard means you can use your favourite Explain some technical considerations for storage and archiving . management software that seamlessly integrates information from the monitoring and management software that the hardware vendors provide . How it works: The bulk of the storage capacity is used to store actual sensor data: in general, over 90% of the storage will be used for that . Data is stored locally at primary data centres in a region . Sometimes the data centres provide compute and analysis facilities too . Standard scale-out NAS architectures can be used to implement this storage . NAS architectures provide efficient storage for large data sets and can easily scale . A NAS system is multi-protocol and contains many features . For some parts of the system, a more specialised file system may be used .

The data catalogue is making the actual way in which files are stored less important for the users . The data catalogue provides information about each file and provides a global namespace: you can uniquely identify and access data without knowing its precise location . When a user needs data in another format than that in which it is stored, the “Data Mover” converts the data from the central storage to the cache for use by the user’s application in the right format .

64 HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture 65 At several places in the system, such as the test benches and HPC system, data is locally cached for Technical description better performance . The Data Mover is synchronising all caches with the central storage . For file systems, Dell Technologies is supporting the following possibilities: Archiving can also use proven NAS technologies: an enterprise-grade over multiple sites . 1. Isilon — For the Data Lake Technical description For storage and archiving, Dell Technologies is supporting the following possibilities: 2. Lustre — For HPC specific storage

1. File storage based on proven scale-out NAS architectures References Solving the storage conundrum in ADAS development and validation [May 2018] https://www . 2. Data Mover emc com/collateral/white-papers/dell-emc-adas-solution-powered-by-isilon. pdf.

References High Performance Computing — Storage [Visited August 2019] https://www dell. com/support/. Solving the storage conundrum in ADAS development and validation [May 2018] https://www . article/us/en/04/sln311501/high-performance-computing?lang=en#Storage emc com/collateral/white-papers/dell-emc-adas-solution-powered-by-isilon. pdf.

Top Five Reasons To Choose DELL EMC Isilon For AD/ ADAS [June 2018] https://www emc. com/. 5 5. Compute Server design collateral/solution-overview/h17239-top-five-reasons-for-dell-emc-isilon-in-adas pdf. Goal: Technical considerations for servers . There are many parts of the system where compute servers 5 4. .2 File system and file protocol choices are needed with different requirements . What is the best solution? One big cluster that serves all applications? Or many different ones? There is no single general good answer . It depends on Goal: the specific use . Offering services from Dell HPC and AI innovation lab brings the best possible Technical considerations for file systems . mapping of applications and specific requirements towards hardware selection . These days, where most components of servers are common, the configuration possibilities are enormous, Dell can How it works: help configure and test with PoC of the best solution for operating your software . Not limited There are many different types of file systems and various protocols they use . Each type of to application tuning and infrastructure setup . Translating workloads for specific tasks could be application can have a preferred file system type . For instance, during the ingest step one often accelerated with the experience from Dell in this specific area . The use of hardware offloading and wants HDFS because one wants to correlate data using Hadoop or Map/Reduce . On the other hand, acceleration with NVIDIA’s GPU supporting native TensorFlow open source programming platforms, developers and data scientists often prefer NFS, SIFS or SMB as file system protocol accessing FPGA for specific workloads and IPU and even CPUs from Intel and AMD for interference machine data in a unified way . A system handling all or most of these requirements is a must and is presented learning is key in AD development and operations . Dell’s investigation on these topics and knowledge as “the” Data Lake used for multiple workloads and user access, sharing possibilities in a secure can help to drive the maximum use of efficiency . environment . How it works: There are use cases where the need of high performance or extreme low latency — fully utilising The main processes that need compute servers are the ingest processes, the data analytics, the compute resources — is needed, such as the repeated use of data for enrichment or in cases where simulation and Deep Learning processes, the support for the work and the test benches . All need peak throughput for simulation could lead to the use of a parallel file system like BeeGFS or Lustre . CPU power and memory, however, the size of the memory needed per node may differ . Data Most of these file systems operate best in an isolated cluster only accessible through proxy services . analytics, Deep Learning and simulation all benefit from GPUs — whether that is one per node or The backends of these clustered file systems are most likely InfiniBand or RDMA enabled networks . two . In general, the other processes also benefit from GPUs as there is a lot of image processing The data mover can automate the use of these parallel file systems copying or moving the data going on . from or to the central data lake or archive for reprocessing purposes . High-Performance Computing systems may require these parallel file systems . See the HPC section .

66 HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture 67 Technical options — Server design Spine

For deep learning:

References Deep Learning with NVIDIA [visited July 2019] https://dellsolutionsvr dell. com/deep-learning-with-. nvidia/ Deep Learning with Intel [visited July 2019] https://dellsolutionsvr dell. com/deep-learning-with-. intel/ Leaf

Leaf-Spine architecture . Each Spine switch is connected to all the Leaf switches . Equipment is 5 6. Network design connected to the Leaf switches, not to the Spine switches .

Goal: Technical options — Network design Design the cluster in such a way the network is not a performance bottleneck . A spine/leaf setup is 1 Dell supports many operating systems, for example Dell OS9 and OS10 (based on open source), Cumulus preferred for maximum flexibility and ease of scalability . Linux, Big Switch Big Monitoring Fabric (BMF), Big Switch Big Cloud Fabric (BCF), IP Infusion OcNOS, The amounts of data processed can be large . When a complete test run with dozens of cars must Pluribus NetVisor, Pica8, but not limited to others like Microsoft SONiC . Dell includes support and licenses be processed one can easily reach an average of12 Petabytes of data per day . When processing with the Dell PowerSwitch . is needed in a short time frame, for example when a single run’s data has to be processed within 4 hours in order for the results to be analysed and used by the next test run . This can lead to high References network speed requirements running to hundreds of Gbyte/s . In other parts of the system, the Leaf-Spine Deployment and Best Practices Guide [July 2017] networking requirements are generally lower . https://www dell. com/support/article/us/en/04/sln313954/leaf-spine-deployment-and-best-practices-. guide? lang=en How it works: Core to the network is single-stream performance: how many nodes can simultaneously read a single file? How much data can one read with a single node from a single storage controller? These are some of the questions to answer . In these systems, it’s generally the aggregated performance that’s important . Data enters through the ingest system typically during longer periods with some peaks . One could, for instance, need speed in the order of 300 GB/sec for a four-day period . This could be combined with 5-minute batch jobs from simulation and analysis work .

Sometimes, this can be solved by simply adding sufficient bandwidth . In other cases, building a software-defined cluster can be the solution .

Also, a software-defined network provides possibilities at the switch level which were not available in the past . Booting and Network Operating Systems (NOS) of choice offer flexibility in provisioning and manageability and therefore cost savings in operations . For instance, running Linux on a switch merges complete clusters of servers and the network to a single pane of glass and with the use of standards like Chef, Puppet and Ansible throughout the complete infrastructure, this is offering the lowest TCO with maximum flexibility .

The fast development of networking technology also has to be taken into account . Today, one has PCIe Gen3 with a speed in the order of 10-40 GB/s . In six years from now, one may have products capable of handling 400 GB/s .

68 HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture 69 6 Vocabulary & Terminology 6 2 . Terminology

6 1 . Vocabulary Term Definition Reference AD Autonomous Driving Word/term How we use it Alternatives (but Remarks Competing uses that ADAS Advanced Driver Assistance System with partially non- may confuse AI Artificial Intelligence overlapping meaning) API Application Programming Interface Remote Site A physical place where Edge; Campus In some cases, the A remote data centre is test cars are driving remote site may also also sometimes called AVX Advanced Vector Extensions https://www .intel co. uk/content/. that generate data . have a small data a remote site . www/uk/en/a rchitecture-and- centre, or may be split technology/avx-512- overview .html amongst several physical sites . BeeGFS Parallel Cluster File System https://www .beegfs .io Data Centre A computing centre On-premise; We use this more as a Caffe 2 Deep Learning tool Now part of PyTorch or a collection of Computing centre logical entity . What is CAN Controller Area Network https://can-cia org. computing centres important is that it is where the data from seen as one IT Chef Support for software-driven organisations http://chef .io the remote site is infrastructure . processed and where Container A software package that contains everything https://pivotal .io/containers the actual application needed to run an application including libraries and development and other dependencies scenario testing takes Data Lake The central data storage part of a system . It is the https://www dellemc. com/en-gb/. place . hardware and software (system software and file big- data/data-lake/index .htm Simulation A virtual computer Synthetic simulation; Simulations may be systems) infrastructure that holds the data generated model Deep Learning mixed with real data Data logger Measurement instrument to record sensor data in in augmented reality . relation to time and place Simulations may interact with each DC Data Centre https://www dellemc. com/en-. gb/ other . glossary/modern-data-center .htm Development Software and hardware Software development, The confusion here Deep Learning Part of Machine Learning based on artificial neural development of ADAS/ Scenario development; mainly comes from networks AD components and Application the fact that nearly processes development every activity can be Deployment Planning and installing hardware or software called “development” . DMS Data Management System Sometimes software development is DMTF Formerly known as the Distributed Management http://dmtf org. narrowed to end with Task Force coding as an activity . Sometimes software ECU Electronic Control Unit (In-Car) development includes Elastic Search Advanced search facility . Open source and https://www elastic. co/. the validation phase enterprise versions as well . Enterprise Software Software used by large organisations that expect Data Lake We use the term The term is getting quality, fast in-depth support “Data Lake” in this a bit inflated as document as a name everything with a few Event Something happening at a specific moment in time . for the Central Data disks is called “Data Autonomous cars have to be prepared for events Storage part of the Lake” today . Flink (Apache Flink) Stateful computations over data streams https://flink .apache org/. overall system . It is the hardware and software FPGA Field Programmable Gate Array . Used as a hardware (system software accelerator for specific applications and file systems) infrastructure that GPFS Global Parallel File System Now called Spectrum Scale holds the data . GPU Graphical Processing Unit Component Independent part of In ISO 26262, a Grid Engine (Univa) Resource manager . Used in HPC . There is also an http://www univa. com/products/. the IT infrastructure component is a part of open source version grid- engine .php a system in a car . Ground truth Measured data that is used to validate a calculated HPC High- performance HTC, supercomputing, The confusion starts or simulated model computing parallel computing with the definition of Hadoop Apache Hadoop is an open-source framework that https://www dellemc. com/en-. gb/ what is “high” . It can allows for parallel processing of large data sets and glossary/hadoop .htm range from anything collective mining of disparate data sources that does not fit a personal computer, HiL Hardware-In-Loop . Hardware analysed by a to a million core software simulation supercomputers . In this document Horton (Hortonworks) Data management platform https://hortonworks com/products/. we talk about HPC High-Performance Computing . Usually using many high- performance nodes consisting of many processors in a cluster computing if we have a cluster of many HTC High-Throughput Computing . A special type of HPC compute nodes, for pleasantly parallel applications probably with a parallel filesystem and an InfiniBand network

70 HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture 71 Term Definition Reference Term Definition Reference In-car Used for equipment and sensors in a (test) car Proving Ground Test facility for autonomous cars Inference Artificial Intelligence by logical reasoning Provisioning Providing infrastructure InfiniBand The fast network interconnect . Used in HPC https://www .infinibandta org/. PyTorch End-to-End Deep Learning platform https://pytorch .org/ clusters Puppet Configuration management tool https://chef .io/puppet/ Ingest (system, process) Getting data out of the car into the development Radar Sensor to measure distances . Also used for and analysis infrastructure autonomous cars IPU Intelligent Processing Unit . Accelerator for AI https://www graphcore. .ai/ RDMA Remote Direct Memory Access technology Redfish A standard to describe systems and components in http://redfish dmtf. org. Isilon Enterprise-grade, scale-out network-attached https://www dellemc. com/nl-. nl/ a data centre storage . storage/isilon/index .htm Re-simulation Simulation of autonomous driving scenario ISO 26262 ISO standard 26262 “Road vehicles — Functional https://www .iso org/. safety” standard/68383 .html Reprocessing See Simulation Kubernetes Container orchestration https://kubernetes .io/ RESnet AI implementation . Also used as general AI benchmark Levels of Autonomous Defines levels of autonomy of a car and level of Taxonomy and Definitions for Terms Driving involvement of the driver of a car Related to Driving Automation Road car A test car that is driven on a public road Systems for On-Road Motor RoCE RDMA over Converged Ethernet Vehicles https://www .sae org/standards/. Runtime (running) During the execution of a program content/j3016_201806/ Scenario Possible chronological series of events (that could LiDAR Light Detection and Ranging . A sensor technique happen when a car is driving) similar to radar but with light instead of radar signals ServiceNow Digital workflow support https://www .servicenow com/now-. Massively parallel A computer task that uses many (thousands or platform .html more) processors at the same time . SiL Software-in-Loop: simulation with also the hardware Meta-data Information about data . For example, information analysed in software about time and date of collection, owner and so on . SIM-E See HiL Metro car A car that is used on a test track for autonomous SIM-S See SiL driving SLURM Resource manager https://slurm .schedmd com/. MiL Model in the Loop (Special form of SiL) overview .html Term Definition Reference Term Definition Reference Spark (Apache Spark) Unified analytics engine for large-scale data https://spark .apache org/. ML Machine Learning . A specific type of AI processing NAS Network Attached Storage https://www dellemc. com/en-. gb/ Spectrum Scale (IBM Global parallel file system https://www .ibm com/uk-. en/ glossary/network-attached-storage . Spectrum Scale) marketplace/scale-out-file-and- htm object- storage NOS Network Operating Systems SSD Solid State Drive NVMe Non-Volatile Memory Express https://www dellemc. com/en-. gb/ Tagging Adding metadata (tags) to data glossary/nvme .htm TensorFlow End-to-end open source platform for machine https://www tensorflow. org/. Open Source software Software for which the source is available for learning inspection so one can check its workings . Open TCO Total Cost of Ownership Software is not the same as free software . Open Software may be licensed for a fee Validation Ensuring hardware, software and models meet the requirements OpenHPC OpenHPC is a collaborative, community effort that https://openhpc community/. initiated from a desire to aggregate a number of VDI Virtual Desktop Infrastructure common ingredients required to deploy and manage V-model A classical software design methodology . Design HPC Linux clusters steps resemble a V-shape . Parallel Executing two or more tasks simultaneously . VNNI Vector Neural Network Instructions . Extensions to pcap Format for logging sensor data Intel processors to speed up AI programs PCIe Fabric PCIe or Peripheral Component Interconnect https://www dellemc. com/en-. gb/ VueForge (Altran VueForge) Support and services for IoT and Big Data https://www .altran com/us/en/. Express is a serial bus standard for connecting a glossary/pcie-flash-storage .htm integrated_sol ution/vueforge/ computer to one or several peripheral devices Whamcloud Lustre services and support https://whamcloud com/. PCIe Gen3 Latest version of PCIe Fabric Workload Work to be executed on a computer system/ Pivotal Cloud Services Enterprise container and cloud services https://pivotal .io/platform/pivotal- xCat An open-source tool for automating deployment, https://xcat org/. (PCS) container- service scaling, and management of bare metal servers and Pivotal Kubernetes Services virtual machines (PKS) Pleasantly parallel Parallel tasks that can be executed independently of each other . Pravega Storage abstraction that supports streaming data http://pravega .io/

72 HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture 73 A Technical support and resources

Dell com/support. is focused on meeting customer needs with proven services and support .

Storage technical documents and videos available at www dell. com/storageresources. provide expertise to ensure customer success on Dell EMC storage platforms .

74 HPC & AI Innovation Exchange | 002 The ADAS/AD Architecture The information in this publication is provided “as is ”Dell. Inc . makes no representations or warranties of any kind with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose . Use, copying, and distribution of any software described in this publication requires an applicable software license .

Copyright ©2019 Dell Inc . or its subsidiaries . All Rights Reserved . Dell Technologies, Dell, EMC, Dell EMC are trademarks of Dell Inc . and subsidiaries in the United States and other countries . Other trademarks may be trademarks of their respective owners . Dell Corporation Limited . Registered in England . Reg . No . 02081369 Dell House, The Boulevard, Cain Road, Bracknell, Berkshire, RG12 1LF, UK .

NVIDIA and the NVIDIA logo are trademarks and/or registered trademarks of NVIDIA Corporation in the U .S . and/or other countries .