Unified Inference and Training at the Edge

By Linley Gwennap Principal Analyst

October 2020

www.linleygroup.com

Unified Inference and Training at the Edge

By Linley Gwennap, Principal Analyst, The Linley Group

As more edge devices add AI capabilities, some applications are becoming increasingly complex. Wearables and other IoT devices often have multiple sensors, requiring different neural networks for each sensor, or they may use a single complex network to combine all the input , a tech- nique called sensor fusion. Others implement on-device training to customize the application. The GPX-10 can handle these advanced AI applications while keeping power to a minimum. Ambient Scientific sponsored this white paper, but the opinions and analysis are those of the author.

Introduction

As more IoT products implement AI inferencing on the device, some applications are becoming increasingly complex. Instead of just a single sensor, they may have several, such as a wearable device that has accelerometer, pulse rate, and temperature sensors. Each sensor may require a different neural network, or a single complex network can combine all the input data, a technique called sensor fusion. A (e.g., Cortex-M) or DSP can handle a single sensor, but these simple cores are inefficient for more complex IoT applications.

Training introduces additional complication to edge devices. Today’s neural networks are trained in the cloud using generic data from many users. This approach takes advan- tage of the massive compute power in cloud data centers, but it creates a one-size-fits-all AI model. The trained model can be distributed to many devices, but even if each device performs the inferencing, the model performs the same way for all users.

To provide truly personal services, an edge device must adapt the neural network to the user. For example, a voice-recognition model could be trained to recognize a specific voice. This approach doesn’t require training the model from scratch, a computationally intensive task; instead, an existing model can be retrained on the device to more accu- rately respond to a particular user or use case. This type of on-device training shouldn’t be confused with federated learning, a different concept in which each user’s data is compressed and sent to the cloud for retraining the master model.

The benefits of on-device retraining go beyond personalization. The retraining occurs without sending the user’s data to the cloud, improving privacy. On-device training reduces the burden on cloud data centers, which would otherwise need many new servers to keep up with billions of edge devices. Severing the cloud connection also reduces cellular or other network costs that might accrue to either the end user or the cloud-service provider (CSP). Finally, on-device retraining can constantly adapt and improve without waiting for the CSP to update the model.

On-device training has many applications. In a consumer setting, a smart speaker or other voice-operated device could recognize commands only from a designated user, preventing visitors from ordering products or accessing personal information, thus

©2020 The Linley Group, Inc. - 1 - Unified Inference and Training at the Edge improving security. A wearable device could learn the user’s normal health parameters (e.g., pulse rate, blood oxygen) and quickly recognize any unusual results that could indicate a medical problem. A voice-operated kiosk in an airport or shopping mall could be trained to recognize and filter out the background noise in its specific location to better isolate the user’s voice. In a factory, an audio device (sometimes in conjunction with a vibration sensor) could monitor the ambient conditions to detect anomalies that might indicate that certain machinery is failing. These consumer and industrial use cases all require some device-specific training or retraining.

Multisensor Inferencing

Wearable devices pose the biggest challenge for neural-network inferencing. These devices continue to add sensors, particularly for health monitoring. The new Apple Watch 6 includes blood-oxygen, ECG (electrocardiogram), and pulse-rate sensors as well as an altimeter, six-axis accelerometer, and microphone. Even low-cost wearables typically have several sensors.

Wearables can also connect to external sensors using Bluetooth and other wireless protocols. For example, a cyclist can connect their smartwatch to sensors on their bike than detect pedal rate, power, and speed. The data from these “dumb” sensors must be analyzed either in the cloud or in the wearable, but only the latter provides real-time information during a ride. This implementation can be combined with retraining to deliver personalized coaching to the rider.

Some wearable sensors are always on, while others are enabled only during certain activities. When enabled, these sensors may have different sampling rates and different resolutions (e.g., 4 bits, 8 bits, or 12 bits per sample). Inferencing even a single sensor on a Cortex-M CPU is far less efficient than using a purpose-built AI accelerator. As the workload becomes more complex, this inefficiency starts to sap battery life. Waking the CPU to process each incoming data sample further strains the battery, which in a wear- able is already small. Always-on inferencing on a separate AI accelerator allows the CPU to remain in sleep mode as much as possible, conserving battery life.

Sensor fusion provides more advanced capabilities to the end user. This approach feeds the data from multiple sensors into a single neural network. In the cycling example, a standard device can analyze data from each of the sensors individually and display the speed and power, for example. But a sensor-fusion model can combine all these factors with pulse rate and blood oxygen to create a recommended pedal rate that optimizes the rider’s workout. This advanced capability requires a complex neural network that can inference in real time without draining the wearable’s battery during a long ride. Only a carefully optimized hardware design can meet these objectives.

Retraining Challenges

To achieve high accuracy across many possible input cases, neural networks are trained using a large set (corpus) of labeled data, often containing millions of items. Using the traditional stochastic gradient descent (SGD) approach, the model must evaluate each item in the training set, then its weights are adjusted slightly. This process is repeated

©2020 The Linley Group, Inc. - 2 -

Unified Inference and Training at the Edge dozens of times until the model converging on a solution. Simply storing an entire corpus is well beyond the capability of most edge devices. Furthermore, the required computations can consume a high-end server for days or weeks and would take far longer on a low-cost edge processor.

Fortunately, edge devices can adopt a simpler process. By starting with a pretrained model, only a few training passes are required to adapt it to new data. Furthermore, the training-data set can be fairly small, since the model has already been trained for a broad set of cases and only needs updating for the specific user. If the model is frequently re- trained, each retraining session can add only a few new cases, limiting the computation required.

Even so, the computation can be challenging for an edge device. For example, an SoC based on a 1.0GHz Cortex-A55 CPU can compute 32 billion AI operations per second (GOPS)—a data-center GPU chip can compute thousands of times faster. A common microcontroller using a 100MHz Cortex-M4 CPU produces only 0.1 GOPS. Thus, for retraining a simple network, the SoC might take a few minutes of continuous com- puting, but the microcontroller could take several hours or more.

For many edge devices, a Cortex-A55 or similar SoC is impractical. Such a processor typically costs $10 or more, whereas a microcontroller costs only a few dollars, making it better suited for cost-sensitive devices. Power is also a consideration. An SoC can easily require 1W or more, far too much for a battery-operated or wireless device. For these devices, a critical factor is GOPS/mW, that is, the ability to compute AI operations while using the least amount of energy. While CPUs are good for general-purpose computa- tion, maximizing AI efficiency requires a more specialized design.

Ambient Solution

Ambient Scientific has developed a custom architecture optimized for low-power AI operations called DigAn. The name refers to the use of standard digital CMOS to implement an analog math engine. Analog computation is orders of magnitude more efficient than the purely digital approach in CPUs and GPUs, but it is more difficult to design. Combining proprietary design techniques with standard CMOS, Ambient is the first company to bring an analog AI engine into volume production.

As implemented in its new GPX-10 chip, a single DigAn core can generate 51 GOPS, outperforming the aforementioned Cortex-A55 while using less than 0.1% of its power. The chip contains 10 cores that together can deliver 512 GOPS while using just 120mW. That works out to 4.3 TOPS/W, which is far better than any CPU or GPU and ahead of most other AI-centric products as well. The design achieves this efficiency in a low-cost 40nm process; the company estimates it will achieve more than 20 TOPS/W in state-of- the-art 7nm technology.

The GPX-10 can also operate in always-on mode, using five cores at a low speed, while consuming just 80 microwatts (0.08mW). In this mode, the chip can listen for a wake word and then quickly power up to analyze the rest of the phrase. It can also monitor other sensors to detect anomalous data that requires additional analysis.

©2020 The Linley Group, Inc. - 3 -

Unified Inference and Training at the Edge

In a wearable, each sensor input can be directed to a separate core; in this configuration, only one core needs to run at any given time, conserving power. Each core can run a different neural network, possibly using a different resolution that is optimized for that sensor. This capability minimizes the energy that the AI cores consume while keeping the main CPU in sleep mode. In a sensor-fusion design, up to 10 cores can work together to quickly inference a complex neural network, generating real-time results while using little power.

The GPX-10 comprises more than just the AI accelerator, providing an entire SoC. It features a Cortex-M4 CPU to run a real-time operating system (RTOS) and application code, along with 512KB of integrated flash and 320KB of SRAM. The chip can connect to digital sensors and other peripherals using standard interfaces, and it includes an 8-channel analog-to-digital converter (ADC) to support analog sensors. Essentially, Ambient has combined a complete microcontroller with a highly efficient AI accelerator on a single chip, simplifying the system design.

Figure 1. Ambient GPX-10 AI processor. This chip combines an Arm CPU, standard periph- erals, and a custom AI accelerator to form a complete SoC for a low-power IoT product.

The company has optimized the entire SoC for low-power operation using custom circuit designs. In addition to the analog AI engine, the large on-chip memory uses a custom 3D SRAM design that reduces power by 80%, according to the company. On- chip communication consumes 90% less power using low-voltage signaling. The custom ADC requires only 5 microwatts while obtaining 20,000 samples per second, ideal for always-on sensor fusion. Ambient even redesigned the clock circuit to eliminate 99% of the power from a traditional oscillator. These optimizations were necessary to enable the chip to operate on just 80 microwatts of power.

Conclusion

As IoT and edge devices adopt neural networks to enhance their functionality, their workloads are increasingly complex. Many devices employ multiple sensors, requiring

©2020 The Linley Group, Inc. - 4 -

Unified Inference and Training at the Edge multiple neural networks to analyze this data or perhaps a single complex sensor-fusion network. Sensor fusion enables more valuable analysis and recommendations for the end user.

Adding AI training to these devices enables a new level of customization and capability. Voice-enabled devices can improve their accuracy by learning the user’s speech patterns and most common commands. They can also adapt to background noise in their loca- tion, either suppressing or monitoring that noise as needed. Voice ID can improve security. By running the training on the device, the user’s data isn’t exposed to the cloud-service provider, enhancing privacy.

The challenge for designers is to deliver these new capabilities within the cost and power constraints of a typical edge device. Performing AI training on a Cortex CPU, even for more limited retraining scenarios, can tie up the device for hours and quickly drain the battery. Even multisensor inference can tax a Cortex-M CPU, particularly in wearable devices that have tiny batteries. Most AI accelerators add far too much cost and power for an edge device.

To address this challenge, Ambient’s GPX-10 processor provides a breakthrough in power efficiency, delivering 512 billion operations per second at just 120mW. This per- formance supports both inference and retraining of sophisticated neural networks for speech recognition. To reduce system cost, the GPX-10 integrates a complete Cortex-M4 microcontroller along with its unique AI engine. The company optimized the chip to operate at just 80 microwatts when monitoring for wake words, making it well suited to battery-powered devices. For designers seeking to boost the AI performance of their edge devices, Ambient’s innovative processor meets the challenge.

Linley Gwennap is principal analyst at The Linley Group and editor-in-chief of Report. The Linley Group offers the most-comprehensive analysis of and SoC design. We analyze not only the business strategy but also the internal technology. Our in-depth articles cover topics including embedded processors, mobile processors, server processors, AI accelerators, IoT processors, processor-IP cores, and Ethernet chips. For more information, see our website at www.linleygroup.com.

©2020 The Linley Group, Inc. - 5 -