Data Storage Architectures for Machine Learning and Artificial Intelligence On-Premises and Public Cloud
Total Page:16
File Type:pdf, Size:1020Kb
REPORT Data Storage Architectures for Machine Learning and Artificial Intelligence On-Premises and Public Cloud ENRICO SIGNORETTI TOPICS: ARTIFICIAL INTELLIGENCE DATA STORAGE MACHINE LEARNING Data Storage Architectures for Machine Learning and Artificial Intelligence On-Premises and Public Cloud TABLE OF CONTENTS 1 Summary 2 Market Framework 3 Maturity of Categories 4 Considerations for Selecting ML/AI Storage 5 Vendors to Watch 6 Near-Term Outlook 7 Key Takeaways 8 About Enrico Signoretti 9 About GigaOm 10 Copyright Data Storage Architectures for Machine Learning and Artificial Intelligence 2 1. Summary There is growing interest in machine learning (ML) and artificial intelligence (AI) in enterprise organizations. The market is quickly moving from infrastructures designed for research and development to turn-key solutions that respond quickly to new business requests. ML/AI are strategic technologies across all industries, improving business processes while enhancing the competitiveness of the entire organization. ML/AI software tools are improving and becoming more user-friendly, making it easier to to build new applications or reuse existing models for more use cases. As the ML/AI market matures, high- performance computing (HPC) vendors are now joined by traditional storage manufacturers, that are usually focused on enterprise workloads. Even though the requirements are similar to that of big data analytics workloads, the specific nature of ML/AI algorithms, and GPU-based computing, demand more attention to throughputs and $/GB, primarily because of the sheer amount of data involved in most of the projects. Depending on several factors, including the organization’s strategy, size, security needs, compliance, cost control, flexibility, etc, the infrastructure could be entirely on-premises, in the public cloud, or a combination of both (hybrid) – figure 1. The most flexible solutions are designed to run in all of these scenarios, giving organizations ample freedom of choice. In general, long term and large capacity projects, run by skilled teams, are more likely to be developed on-premises. Public cloud is usually chosen by smaller teams for its flexibility and less demanding projects. ML/AI workloads require infrastructure efficiency to yield rapid results. With thex e ception of the initial data collection, many parts of the workflow are repeated over time, so managing latency and throughput is crucial for the entire process. The system must handle metadata quickly while maximizing throughput to ensure that system GPUs are always fed at their utmost capacity. A single modern GPU is a very expensive component, able to crunch data at 6GB/s and more, and each single compute node can have multiple GPUs installed. Additionally, CPU-storage vicinity is important and why NVMe-based flash devices are usually selected for their characteristics of parallelization and performance. What is more, the data sets require a huge amount of storage capacity to train the neural network. For this reason, scale-out object stores are usually preferred because of their scalability characteristics, rich metadata, and competitive cost. Data Storage Architectures for Machine Learning and Artificial Intelligence 3 Figure 1: Possible combination of storage and computing resources in ML/AI projects In this report, we discuss the most recent storage architecture designs and innovative solutions deployed on-premises, cloud, and hybrid, aimed at supporting ML/AI workloads for enterprise organizations of all sizes. Key findings: • Enterprise organizations are aware of the strategic value of ML/AI for their business and are increasing investments in this area. • End users are looking for turn-key solutions that are easy to implement and that deliver a quick ROI (Return on Investment). • Many of the solutions available are based on a two-tier architecture with a flash-based, parallel, and scale-out file system for active data processing and object storage for capacity and long term data retention. There are also some innovative solutions that are taking a different approach, with the two tiers integrated together in a single system. Data Storage Architectures for Machine Learning and Artificial Intelligence 4 2. Market Framework There are several processes involved in an ML workflow and most of them need to be repeated several times. They all demonstrate different workload characteristics and need to encompass large data sets to return effective results. Very briefly, an ML workflow (figure 2) can be summarized as follows: • Data collection. Data is collected from one or multiple sources into a single repository. Capacity, scalability, and throughput are the most important metrics particularly with active data sources and non-historical data. • Data preparation. The data set is analyzed and stripped of anomalies, out-of-range data, inconsistency, noise, errors, and any other exceptions that could compromise the outcome. Data is tagged and indexed, making it searchable and reusable. This section of the workflow is characterized by a large number of read and write operations with plenty of metadata updates. It is also interesting to note that some storage systems are now able to complete this operation during the ingestion phase thanks to the implementation of serverless computing mechanisms, accelerating the process while simplifying the infrastructure. • Analysis and model selection. This part of the process is all about finding the right algorithms and data models for the training phase. The data scientist analyzes the data to find the training model that suits it best and repeats this part as many times as necessary on a small subset of the data. The storage system has to be fast to get results quickly, allowing it to compare several models before proceeding. With recent advancement in this field, new solutions are surfacing and uto-MLA products are becoming more common, helping end-users with this task (for reference, check Andrew Brust’s GigaOm Market Landscape Report, AI Within Reach: AutoML Platforms for the Enterprise. • Neural network training. This is the most compute and storage-intensive part of the entire workflow. It is where the training data set is passed to the selected algorithms to train the model. • Production and evaluation. This last part is where the data scientist actually sees the results of the ML model. Storage is not accessed anymore but the data has to be preserved in case it is necessary to reassess and improve the model. Data Storage Architectures for Machine Learning and Artificial Intelligence 5 Figure 2. Typical ML/AI Workflow model A storage infrastructure designed for ML/AI workloads has to provide performance, capacity, and cost- effectiveness. With this in mind, we can divide the market into two categories: 1. Two-Tier Infrastructure. The first tier is a fast scale-out file system for all active data and front-end operations. This is backed up by a second-tier object store for capacity. This flexible solution allows end-users to build an infrastructure that can be installed on-premises, in the cloud, or in a hybrid fashion (any combination of the two). This flexibility translates into cost savings but sacrifices some efficiency because of the backend data movements necessary to get the right data available where and when it is needed. 2. Single-System Architecture. A single system is much more efficient and provides top-notch performance by hiding data movements internally or making them not necessary. This simplifies infrastructure and operations, contributing to a reduction in TCO (total cost of ownership). On the flip side, they are more difficult andxpensive e to implement in the cloud, especially at scale, limiting them to on-premises or small cloud installations. Major cloud providers offer managed services based on similar architecture designs, allowing the end- user to simplify resource provisioning and management but, unfortunately, this also creates vendor lock-in for large-scale projects, and becomes expensive. Data Storage Architectures for Machine Learning and Artificial Intelligence 6 3. Maturity of Categories Enterprises adopt an architecture depending on the following factors: • Number of current and future projects • Size of the projects • Size of the organization • Performance needs • Cost If the organization’s strategy is to make a major investment in ML and AI, then it is highly likely that it will adopt an on-premises solution alongside the other resources to run it properly. This is necessary to maintain control over the entire process while containing costs. Other evaluation criteria include security and maintaining the validity and the source of the original data. In fact, if the data sets include sensitive information and already belong to the company, moving or replicating it to external repositories adds complexity to data management, monitoring, and auditing processes. Data governance and stewardship is particularly important in highly regulated industries and for organizations that handle personally identifiable information (PII). For xamplee , making copies of PII, without having the right tools to mask or remove sensitive information, could compromise General Data Protection Regulation (GDPR) compliance. One of the advantages of deploying a cloud-only or hybrid infrastructure is leveraging the tools made available by cloud providers. In fact, most cloud providers have optimized VM instances with GPUs or other types of coprocessors designed for AI workloads, but they also developed specific