Unified Inference and Training at the Edge

Unified Inference and Training at the Edge By Linley Gwennap Principal Analyst October 2020 www.linleygroup.com Unified Inference and Training at the Edge By Linley Gwennap, Principal Analyst, The Linley Group As more edge devices add AI capabilities, some applications are becoming increasingly complex. Wearables and other IoT devices often have multiple sensors, requiring different neural networks for each sensor, or they may use a single complex network to combine all the input data, a technique called sensor fusion. Others implement on-device training to customize the application. The GPX-10 processor can handle these advanced AI applications while keeping power to a minimum. Ambient Scientific sponsored this white paper, but the opinions and analysis are those of the author. Introduction As more IoT products implement AI inferencing on the device, some applications are becoming increasingly complex. Instead of just a single sensor, they may have several, such as a wearable device that has accelerometer, pulse rate, and temperature sensors. Each sensor may require a different neural network, or a single complex network can combine all the input data, a technique called sensor fusion. A microcontroller (e.g., Cortex-M) or DSP can handle a single sensor, but these simple cores are inefficient for more complex IoT applications. Training introduces additional complication to edge devices. Today’s neural networks are trained in the cloud using generic data from many users. This approach takes advan- tage of the massive compute power in cloud data centers, but it creates a one-size-fits-all AI model. The trained model can be distributed to many devices, but even if each device performs the inferencing, the model performs the same way for all users. To provide truly personal services, an edge device must adapt the neural network to the user. For example, a voice-recognition model could be trained to recognize a specific voice. This approach doesn’t require training the model from scratch, a computationally intensive task; instead, an existing model can be retrained on the device to more accu- rately respond to a particular user or use case. This type of on-device training shouldn’t be confused with federated learning, a different concept in which each user’s data is compressed and sent to the cloud for retraining the master model. The benefits of on-device retraining go beyond personalization. The retraining occurs without sending the user’s data to the cloud, improving privacy. On-device training reduces the burden on cloud data centers, which would otherwise need many new servers to keep up with billions of edge devices. Severing the cloud connection also reduces cellular or other network costs that might accrue to either the end user or the cloud-service provider (CSP). Finally, on-device retraining can constantly adapt and improve without waiting for the CSP to update the model. On-device training has many applications. In a consumer setting, a smart speaker or other voice-operated device could recognize commands only from a designated user, preventing visitors from ordering products or accessing personal information, thus ©2020 The Linley Group, Inc. - 1 - Unified Inference and Training at the Edge improving security. A wearable device could learn the user’s normal health parameters (e.g., pulse rate, blood oxygen) and quickly recognize any unusual results that could indicate a medical problem. A voice-operated kiosk in an airport or shopping mall could be trained to recognize and filter out the background noise in its specific location to better isolate the user’s voice. In a factory, an audio device (sometimes in conjunction with a vibration sensor) could monitor the ambient conditions to detect anomalies that might indicate that certain machinery is failing. These consumer and industrial use cases all require some device-specific training or retraining. Multisensor Inferencing Wearable devices pose the biggest challenge for neural-network inferencing. These devices continue to add sensors, particularly for health monitoring. The new Apple Watch 6 includes blood-oxygen, ECG (electrocardiogram), and pulse-rate sensors as well as an altimeter, six-axis accelerometer, and microphone. Even low-cost wearables typically have several sensors. Wearables can also connect to external sensors using Bluetooth and other wireless protocols. For example, a cyclist can connect their smartwatch to sensors on their bike than detect pedal rate, power, and speed. The data from these “dumb” sensors must be analyzed either in the cloud or in the wearable, but only the latter provides real-time information during a ride. This implementation can be combined with retraining to deliver personalized coaching to the rider. Some wearable sensors are always on, while others are enabled only during certain activities. When enabled, these sensors may have different sampling rates and different resolutions (e.g., 4 bits, 8 bits, or 12 bits per sample). Inferencing even a single sensor on a Cortex-M CPU is far less efficient than using a purpose-built AI accelerator. As the workload becomes more complex, this inefficiency starts to sap battery life. Waking the CPU to process each incoming data sample further strains the battery, which in a wearable is already small. Always-on inferencing on a separate AI accelerator allows the CPU to remain in sleep mode as much as possible, conserving battery life. Sensor fusion provides more advanced capabilities to the end user. This approach feeds the data from multiple sensors into a single neural network. In the cycling example, a standard device can analyze data from each of the sensors individually and display the speed and power, for example. But a sensor-fusion model can combine all these factors with pulse rate and blood oxygen to create a recommended pedal rate that optimizes the rider’s workout. This advanced capability requires a complex neural network that can inference in real time without draining the wearable’s battery during a long ride. Only a carefully optimized hardware design can meet these objectives. Retraining Challenges To achieve high accuracy across many possible input cases, neural networks are trained using a large set (corpus) of labeled data, often containing millions of items. Using the traditional stochastic gradient descent (SGD) approach, the model must evaluate each item in the training set, then its weights are adjusted slightly. This process is repeated ©2020 The Linley Group, Inc. - 2 - Unified Inference and Training at the Edge dozens of times until the model converging on a solution. Simply storing an entire corpus is well beyond the capability of most edge devices. Furthermore, the required computations can consume a high-end server for days or weeks and would take far longer on a low-cost edge processor. Fortunately, edge devices can adopt a simpler process. By starting with a pretrained model, only a few training passes are required to adapt it to new data. Furthermore, the training-data set can be fairly small, since the model has already been trained for a broad set of cases and only needs updating for the specific user. If the model is frequently retrained, each retraining session can add only a few new cases, limiting the computation required. Even so, the computation can be challenging for an edge device. For example, an SoC based on a 1.0GHz Cortex-A55 CPU can compute 32 billion AI operations per second (GOPS)—a data-center GPU chip can compute thousands of times faster. A common microcontroller using a 100MHz Cortex-M4 CPU produces only 0.1 GOPS. Thus, for retraining a simple network, the SoC might take a few minutes of continuous computing, but the microcontroller could take several hours or more. For many edge devices, a Cortex-A55 or similar SoC is impractical. Such a processor typically costs $10 or more, whereas a microcontroller costs only a few dollars, making it better suited for cost-sensitive devices. Power is also a consideration. An SoC can easily require 1W or more, far too much for a battery-operated or wireless device. For these devices, a critical factor is GOPS/mW, that is, the ability to compute AI operations while using the least amount of energy. While CPUs are good for general-purpose computation, maximizing AI efficiency requires a more specialized design. Ambient Solution Ambient Scientific has developed a custom architecture optimized for low-power AI operations called DigAn. The name refers to the use of standard digital CMOS to implement an analog math engine. Analog computation is orders of magnitude more efficient than the purely digital approach in CPUs and GPUs, but it is more difficult to design. Combining proprietary design techniques with standard CMOS, Ambient is the first company to bring an analog AI engine into volume production. As implemented in its new GPX-10 chip, a single DigAn core can generate 51 GOPS, outperforming the aforementioned Cortex-A55 while using less than 0.1% of its power. The chip contains 10 cores that together can deliver 512 GOPS while using just 120mW. That works out to 4.3 TOPS/W, which is far better than any CPU or GPU and ahead of most other AI-centric products as well. The design achieves this efficiency in a low-cost 40nm process; the company estimates it will achieve more than 20 TOPS/W in state-of- the-art 7nm technology. The GPX-10 can also operate in always-on mode, using five cores at a low speed, while consuming just 80 microwatts (0.08mW). In this mode, the chip can listen for a wake word and then quickly power up to analyze the rest of the phrase. It can also monitor other sensors to detect anomalous data that requires additional analysis.

Unified Inference and Training at the Edge

Accenture AI Inferencing in Action

GPU Developments 2018

Persistent Memory for Artificial Intelligence

AI Accelerator Latencies in Hybrid Vehicular Simulation

Low-Power Ultra-Small Edge AI Accelerators for Image Recog- Nition with Convolution Neural Networks: Analysis and Future Directions

Survey and Benchmarking of Machine Learning Accelerators

Efficient Management of Scratch-Pad Memories in Deep Learning

Infiltration of AI/ML in Particle Physics Aspects of Data Handling

Ai Accelerator Ecosystem: an Overview

MYTHIC MULTIPLIES in a FLASH Analog In-Memory Computing Eliminates DRAM Read/Write Cycles

Driving Intelligence at the Edge: Axiomtek's Edge AI Solutions

Low-Power Ultra-Small Edge AI Accelerators for Image Recognition with Convolution Neural Networks: Analysis and Future Directions