Jizhi: a Fast and Cost-Effective Model-As-A-Service System for Web-Scale Online Inference at Baidu

JiZhi: A Fast and Cost-Effective Model-As-A-Service System for Web-Scale Online Inference at Baidu

Hao Liu†, Qian Gao†, Jiang Li, Xiaochao Liao, Hao Xiong, Guangxing Chen, Wenlin Wang, Guobao Yang, Zhiwei Zha, Daxiang Dong, Dejing Dou, Haoyi Xiong∗ Baidu Inc., Beijing, China {liuhao30, gaoqian05, lijiang01, liaoxiaochao, xionghao02, chenguangxing, wangwenlin, yangguobao, zhazhiwei, dongdaxiang, doudejing, xionghaoyi}@baidu.com ABSTRACT KEYWORDS In modern internet industries, deep learning based recommender Online inference; recommendation system; intelligent resource systems have became an indispensable building block for a wide management; MLOps spectrum of applications, such as search engine, news feed, and short video clips. However, it remains challenging to carry the ACM Reference Format: well-trained deep models for online real-time inference serving, Hao Liu, Qian Gao, Jiang Li, Xiaochao Liao, Hao Xiong, Guangxing Chen, with respect to the time-varying web-scale traffics from billions of Wenlin Wang, Guobao Yang, Zhiwei Zha, Daxiang Dong, Dejing Dou, Haoyi users, in a cost-effective manner. In this work, we present JiZhi— Xiong. 2021. JiZhi: A Fast and Cost-Effective Model-As-A-Service System for a Model-as-a-Service system — that per second handles hundreds Web-Scale Online Inference at Baidu. In Proceedings of the 27th International of millions of online inference requests to huge deep models with ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD’21), more than trillions of sparse parameters, for over twenty real-time August 14–18, 2021, Virtual Event, Singapore. ACM, New York, NY, USA, 9 recommendation services at Baidu, Inc. pages. https://doi.org/10.1145/3447548.3467146 In JiZhi, the inference workflow of every recommendation request is transformed to a Staged Event-Driven Pipeline (SEDP), 1 INTRODUCTION where each node in the pipeline refers to a staged computation Deep neural networks (DNNs) [19] have recently been used as the or I/O intensive task processor. With traffics of real-time infer- standard workhorses for predicting users’ online behaviors, such ence requests arrived, each modularized processor can be run in a as click-through rate [12, 36], browsing interests [33], and buying fully asynchronized way and managed separately. Besides, JiZhi intentions [13], due to its superiority of representation modeling introduces the heterogeneous and hierarchical storage to further and reasoning for a wide spectrum of data modalities [24]. As early accelerate the online inference process by reducing unnecessary as in the early 2010s, leading internet companies such as Baidu and computations and potential data access latency induced by ultra- Google have successfully integrated DNNs into their core products, sparse model parameters. Moreover, an intelligent resource man- ranging from sponsored search [9, 26] to voice assistant [23]. Fig- ager has been deployed to maximize the throughput of JiZhi over ure 1 depicts several real-world products at Baidu that leveraging the shared infrastructure by searching the optimal resource allo- DNNs for online recommendations. cation plan from historical logs and fine-tuning the load shedding While DNNs have demonstrated its effectiveness in various in- policies over intermediate system feedback. Extensive experiments ternet application domains, the cost of using DNNs for web-scale have been done to demonstrate the advantages of JiZhi from the real-time online inference becomes the major burden for most com- perspectives of end-to-end service latency, system-wide through- panies to adopt the techniques [11, 17] On the one hand, the time put, and resource consumption. Since launched in July 2019, JiZhi consumption (e.g., latency) of the online service is critical for user has helped Baidu saved more than ten million US dollars in hard- experience [5] and can influence the long term retention rate[4]. ware and utility costs per year while handling 200% more traffics arXiv:2106.01674v1 [cs.IR] 3 Jun 2021 On the other hand, the resource consumption (e.g., hardware and without sacrificing the inference efficiency. energy usages) of supporting DNNs would request significant serving infrastructure investment (e.g., high-performance clusters) with † Equal contribution. ∗ Corresponding author. higher power consumption and sometimes makes the systems de- “JiZhi” means making smart decisions extremely fast in Chinese. An open-sourced sign, implementation and operation over-budget [29]. version of JiZhi is available at https://github.com/PaddlePaddle/Serving . To guarantee the utility of online DNN serving, some efforts have been done from algorithmic and system perspectives. From Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed the algorithmic perspective, model compression [8]can significantly for profit or commercial advantage and that copies bear this notice and the full citation reduce the computational overhead by trade-off the model size on the first page. Copyrights for components of this work owned by others than ACM and recommendation effectiveness. From the system perspective, must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a dedicated acceleration hardware such as Graphics Processing Unit fee. Request permissions from [email protected]. (GPU) [28] and software extensions such as TensorRT [32] have KDD ’21, August 14–18, 2021, Virtual Event, Singapore been devised to reduce the DNN computation latency. However, © 2021 Association for Computing Machinery. ACM ISBN 978-1-4503-8332-5/21/08...$15.00 serving DNNs for online recommendations is more than executing https://doi.org/10.1145/3447548.3467146 the feed-forward neural network. The aforementioned techniques 100 100 Traffic load DNN cost Consumption Other cost 80 80

60 60

40 40

20 20 Normalized ratio (%) Time distribution (%) 0 0 Jul.2017 Jan.2018 Jul.2018 Jan.2019 A B C D E Time Model (a) Traffic load and budget explosion.(b) Online inference time breakdown. Figure 2: Characteristics of serving online recommendations. In past three years, Baidu expanded 200-fold budget on inference infrastructure to handle the explosive and di- (a) Baidu news feed. (b) Haokan video. (c) Baidu tieba. versified recommendation traffics. Figure 1: Screenshots of several DNN empowered products By modularizing the complex and application specific inference in Baidu based on JiZhi. pipeline into individual staged processors, SEDP supports flexible are mainly optimized for single DNN training and inference, but customization, performance tuning, and can be well adapted for overlook the complicated data and computation dependency of diverse application scenarios. Moreover, a tailor-designed Heteroge- online inference under time-varying internet traffics. neous and Hierarchical Storage (HHS) module is introduced for the Indeed, serving real-time DNN inference requests for web-scale huge and ultra-sparse DNN model management. By eliminating recommendations is non-trivial for any major internet player with redundant computations and parameter access costs, HHS signif- hundreds of millions of Daily Active Users (DAUs), because of the icantly improves the online inference speed and throughput in a following three major technical challenges. (1) Huge and sparse cost-effective way. Besides, to improve the computational resource DNN models. To improve the recommendation performance, the utilization, we propose the Intelligent Resource Manager (IRM) mod- industrial DNNs are usually trained with massive data samples, ule for automatic system performance tuning. By formulating auto- where each data sample is typically extremely high-dimensional tuning tasks as constrained optimization problems, IRM searches and sparse (i.e., hundreds billions of features with only hundreds the optimal resource allocation plan for each stage processor in of non-zero values). For example, the CTR model in Baidu spon- the offline and adaptively controls the online load shedding pol- sored search and Ads retrieval are over 10TB large [35] and are icy based on real-time system feedback, without sacrificing the still quickly growing. As a result, these huge DNN models with recommendation effectiveness. trillions of sparse parameters require an efficient way to store, re- To the best of our knowledge, this work is the first to study trieve, transfer, and compute. (2) Time-varying web-scale traffics. the system design, implementation, and integration for web-scale The serving system of such deep learning based recommenders is DNN inference from an industrial practitioner’s perspective, by ad- responding to hundreds of millions concurrent inference requests dressing the goals of minimizing the overall resource consumption per second. More critically, the online inference service may expe- under latency constraints. Our major contributions are summarized rience 100-fold traffic increase in peak hour and fast traffic load as follows. (1) We propose an asynchronous design of online in- growth, as depicted in Figure 2(a). Since it is unrealistic to increase ference for deep learning based recommendation, which exposes the budget for every single recommendation service unlimitedly, it the opportunity of fine-grained resource management and flexi- is challenging to build an efficient and cost-effective online infer- ble workflow customization. (2) We introduce a cost-effective data ence system for such explosive and temporally imbalanced traffic placement strategy for huge and ultra-sparse DNN models, which loads. (3) Diverse recommendation scenarios. Figure 2(b) shows the significantly improves the online inference system throughput with diversified data-processing and inference time distribution offive slight resource overhead. (3) We propose a novel intelligent resource real-world deep learning based recommendation services at Baidu. management module to optimize the system resource utilization in Depending on different application scenarios, the recommendation both online and offline fashion. (4) We conduct extensive experi- strategy and inference pipeline may vary. For instance, news recom- ments to demonstrate the advantages of JiZhi by using real-world mendation usually handles both text and image related features and web-scale traffics from various products in Baidu. Moreover, we may involve multiple DNNs (up to 20 DNNs in single recommenda- share our hands on experiences of implementing and deploying tion service in Baidu), while search engine pays more attention to the proposed system, which we envision would be helpful to the user preference modeling [33]. It is challenging to build a unified on- community. We have released an open-source version of JiZhi on line serving architecture that can support various recommendation PaddlePaddle (https://github.com/PaddlePaddle/Serving). scenarios with different online inference workflow. In this work, we present JiZhi, the Model-as-a-Service system 2 BACKGROUND AND RELATED WORK for web-scale online inference at Baidu. Specifically, JiZhi first This section overviews the common routine of industrial deep learn- transforms the workflow of deep learning based recommendation ing based online recommendation and online inference speedup inference into the Staged Event-Driven Pipeline (SEDP), where each techniques, including the legacy online serving design at Baidu. node refers to a staged computation task and the edge between Deep learning based online recommendation. Given a set two nodes refers to a possible transition from one task to another. of items (e.g., contents and goods), the recommendation system aims to return personalized probability score for each item to achieve one or more targets, such as maximizing the CTR [36], improving the dwell time [18], and increase the overall profitability of the product [33]. In the past years, DNN has been widely adopted to serve Mobile applications PC applications as the core model in online recommendation [16, 21, 22]. Based on Deployed internet products the application scenario, different DNN architectures have been Stage adopted for recommendation. For instance, neural collaborative Multi-tenant extension Online load shedding filtering for user-item interaction modeling14 [ ], recurrent neural processor Lightweight Graph compiler Offline resource allocation network for sequential user behavior modeling [15, 30], and deep programming interface reinforcement learning for long-term recommendation optimiza- Staged event-driven pipeline Intelligent resource manager tion [37]. In Baidu, DNN based recommendation service has been successfully integrated into over twenty online internet products Cube cache Query cache (e.g., news feed, short video clips, search engine, and online advertising), serving over one hundred billion online recommendation Distributed sparse parameter cube Dense DNN requests made by hundreds of millions users per day. Heterogeneous and hierarchical storage Serving deep learning based online inference. To guaran- Geo-distributed Distributed file tee the user experience, the online inference process is usually … … constrained to be finished in an ultra low latency, i.e., dozens of computing cluster system milliseconds. However, the majority of research efforts on DNN Cloud infrastructure speedup are focused on model training [7, 34], while limited efforts Figure 3: Layered architecture of JiZhi. have been made for online inference. On the one hand, various model compression techniques [8] have been proposed to reduce the model size by sacrificing the recommendation effectiveness. pipeline first compiles the inference workflow into a directed acyclic On the other hand, dedicated processors (e.g., GPU [28], TPU [17], graph and then execute each independent node in the graph in a Kunlun [27]) and software extensions (e.g., AVX[10], SSE [1], Ten- full asynchronous way. By modularizing the inference task into sorRT [32], and TensorFlow-Serving [25]) have been developed a set of sequence or parallel stages, SEDP can significantly im- to accelerate online inference computation. However, the above prove the resource utilization of the infrastructure, enable work- approaches are mainly optimized for general DNN acceleration, but flow customization, and expose opportunities for more intelligent overlook the unique challenge of diversified data dependency and dynamic resource allocation. The Heterogeneous and Hierarchical model sparsity in the online recommendation. Storage (HHS) is a key-value based data placement system including The previous system design for online inference at Baidu. a distributed sparse parameter cube and a heterogeneous and hierar- Since 2013, Baidu has proposed dedicated hardware and software chical cache. By eliminating unnecessary computation and various platforms to accelerate online data processing and DNN infer- I/O costs, the storage system further reduces the online inference ence [27, 35]. When real-time recommendation requests arrive, latency and improves the system throughput in a cost-effective way. the system first retrieves relevant candidate items (usually millions) Based on SEDP, the Intelligent Resource Manager (IRM) achieves and then apply multi-phase inference for user-item pairs [9] to quasi-optimal resource utilization by monitoring and automatically reduce the candidate size from millions to thousands and finally tuning the computation resource allocation of the serving pipeline return a dozen results to the user. In each phase, by following the in both offline and online. In particular, the offline resource manage- data parallel paradigm, the inference tasks (user-item pairs) are ment component searches the optimal resource allocation plan for packaged into multiple batches, and different batches are then pro- each stage in the SEDP and HHS to maximize the system through- cessed in parallel. For each inference task, the online inference put. The online resource management module adaptively fine-tunes system requires to fetch corresponding raw data from remotely dis- the load shedding policies based on the intermediate system feed- tributed storage, transform input to proper features, and execute the back to guarantee the serving latency when the traffic is overloaded, feed-forward inference computation to output the relevance score. without sacrificing the recommendation effectiveness. For different recommendation applications, the online inference service needs to be developed from scratch and tuning based on expert experience. More critically, the system suffers from pipeline 4 STAGED EVENT-DRIVEN PIPELINE stall because of the computation and access latency of long-tail The primary design objective of JiZhi is to provide customizable and candidates, and quickly meets the resource bottleneck under the tunable online inference service for diverse online recommendation explosive recommendation traffic demand. scenarios, which is achieved by the asynchronous Staged Event- Driven Pipeline. We begin with the construction of stage processor.

3 SYSTEM OVERVIEW Definition 1. Stage processor. Given an online inference work- Figure 3 shows the architecture of JiZhi, which consists of three flow, a stage processor 푝 is defined as a tuple 푝 = ⟨표푝, 푐⟩, where 표푝 is major components, the Staged event-driven pipeline (SEDP), the the operator representing a unit primitive that encapsulate the func- Heterogeneous and hierarchical storage, and the Intelligent resource tionality for a particular execution stage, and 푐 is the channel that manager. Given an online inference task, the Staged event-driven queuing events from upstream processors via a shared data structure. Recommend

Online load User feature shedding extractor Parameter DNN Cutoff Feature ParameterParameter DNNDNN accessoraccessor forwardforward Scores Request compiler accessor forward User-item Item feature Item feature User reader extractor processor (a) Staged event-driven pipeline Applications

Lookup Miss Lookup Retrieve

L1L1Cache Stage processor Lookup cache System resource L1L1CacheQueryCache Dense Remote storage configuration Cache DNN Distributed sparse L1L2L1CachecacheCache L2 Cache Local storage parameter cube Hit Cube cache Configuration plan (b) Heterogeneous and hierarchical storage Figure 4: An illustrative online user-item inference workflow based on JiZhi.

Note an execution stage can be any steps in the online inference workflow interface to engineers and data scientists to support fast workflow, such as feature extraction, model access, and DNN feed- service workflow and model iteration. forward propagation. In this way, an online inference workflow is Multi-tenant extension. SEDP also offers the flexibility to si- modularized into a set of stage processors 푃 = {푝1, 푝2,... }. A stage multaneously access multiple DNNs in a single pipeline, e.g., multi- processor can be understood as a micro-service or a thread, which phase inference where each phase corresponds to a DNN, and is a reusable and self-contained component that may appear one or multi-objective inference accesses multiple DNNs in parallel. The multiple times in a workflow depend on the online inference logic. multi-tenant extension not only enables more complex recommen- In each stage processor, the operator processes incoming events dation strategies in a unified service, but also benefits the recom- concurrently, and passes the finished events to the next channel mendation system iteration via the natural support of online A/B queue attached to subsequent stage processors. testing. Specifically, the access of multiple DNNs usually involves tremendous repetitive data processing and sparse parameter ac- Definition 2. Staged Event-Driven Pipeline (SEDP). A staged cessing operations, which can be significantly reduced in a single event-driven pipeline is defined as a directed acyclic graph 퐺 = (푃, 퐸), service. Moreover, in industrial product iteration, data scientists where 푝푖 ∈ 푃 is a stage operator, and 푒푖 푗 ∈ 퐸 is an edge connecting and researchers often conduct tremendous A/B testing daily (usu- two stage processors 푝푖 and 푝 푗 . Note that in a SEDP, all edges point to ally tens of test groups in concurrent). A common approach is processors of a same stage share a same channel, to provide complex to conduct A/B testing between the current recommendation sys- event processing capability, such as aggregation, join, and ordering. tem and a new test group to determine the satisfaction of the new model/strategy. In the previous online inference system, the system For a predefined SEDP, JiZhi can automatically analyze the de- administrator needs to deploy new online services for each new pendency between each stage processor, construct shared channels, test group, which requires manually splitting the traffic load and and compile the fully asynchronous execution plan. In particu- assigning computational resources for different variants. However, lar, each arrived online inference request (i.e., user-item pair) is such an approach is cumbersome and error-prone. In JiZhi, the A/B regarded as an event and buffered in a queue of the next stage testing is well-supported by hosting multiple different test groups processor. Based on the graph dependency, each stage processor in in a multi-tenant way. By devising a stage processor to dispatch a SEDP can be executed in sequence or parallel. traffic loads to different test groups, each variant is an independent Figure 4 (a) depicts an example online workflow of user-item branch in the SEDP and shares the same computing infrastructure. inference task by using SEDP. The user and item information is first collected and dispatched to a user-specific processor anditem- 5 HETEROGENEOUS AND HIERARCHICAL specific processors. After internal processing stages such as feature STORAGE filtering, ranking, and wrangling, all raw features of the request are We further present the cost-effective data placement design, includ- recombined to retrieve the corresponding feature parameters (i.e., ing the distributed sparse parameter cube and the heterogeneous pre-computed embeddings and sparse vectors). Finally, the retrieved and hierarchical cache. Both components are tailor-designed for sparse parameters are fed to the DNN network to derive the item deep learning based recommendation in a data-driven approach relevance score for personalized recommendation. and tunable to handle different recommendation scenarios. By asynchronously executing stage processors in an event-driven way, SEDP successfully improves the overall throughput of the online inference system by reducing potential pipeline stalls of 5.1 The Distributed Sparse Parameter Cube long-tail items. Note that the capacity (e.g., parallelism, memory) In online recommendation, the DNN usually consists of a huge of each stage processor can be tuned separately to achieve a higher and sparse sub-network (up to Terabytes large) and a moderate- resource utilization rate of the shared infrastructure. Moreover, sized dense network (usually hundreds of Megabytes) [35]. We the modularized design exposes the customizable online inference first introduce the distributed sparse parametera.k.a. cube( the 100 70 80 for the top 0.1% hottest key-value pairs. Thanks to the cube cache, 60 96 the SEDP can avoid up to 90% remote accesses, which reduces 10% 60 50 latency on average. 92 40 40 Query cache. We further introduce the query cache to elimi- Ratio (%) Ratio (%) 20 PDF 88 Invariant score nate unnecessary computations in online inference. Remember the CDF 30 Cumulative ratio (%) Recurrent pair general target of the online recommendation service is to derive 0 84 20 0 2.5 5 7.5 10 60 120 180 240 300 Percentage of parameters (%) Time window (s) the user-item relevance score. The key design insight of the query cache is that the user-item score is stable in a short time period (e.g., (a) Access distribution of sparse pa-(b) Recurrence distribution and vari- rameters. ance of user-item pairs. several minutes) under certain conditions. Figure 5(b) depicts the timeliness of the calculated user-item score. As can be seen, by Figure 5: Data access patterns in online inference service. bounding the time window to two minutes, over 60% scores are invariant. Therefore, for certain recently processed user-item pairs, cube) to manage the extremely high-dimensional parameters of the the computation can be safely eliminated by reusing the previously sparse sub-network. Specifically, the cube is a read-only distributed calculated score, with little influence on the overall recommenda- key-value store, where the key is defined as a compact feature tion performance. The query cache is designed as purely in-memory signature (a universally unique identifier of a feature), and the storage, which persists a set of recently calculated user-item scores value is defined as model weights and statistics of user feedback that fulfill certain conditions (e.g., high relevance items). To guar- of the corresponding sparse feature. To guarantee the distributed antee the recommendation effectiveness, the cached scores will consistency, the feature signature is derived via universal hash expire after a tunable time window or any user feedback (e.g., click, function [6]. The online inference tasks can look up the extremely unlike), which indicates the intermediate change of user preference. sparse non-zero values in constant time complexity by accessing Unlike the cube cache, the query cache employs the Least Recently the cube. To hide the latency induced by random access of the Used (LRU) policy. Query cache helps JiZhi reduces over 20% com- hash table, all keys are stored purely in memory, and values are putations, which significantly improves the cost-effectiveness of grouped into one Gigabyte-sized block that can be either placed in the online inference service. memory or disk (i.e., SSD). The placement of keys and values is a trade-off between the online inference latency and the hardware consumption. In this way, the cube operates on heterogeneous 6 INTELLIGENT RESOURCE MANAGER hardware can support ten-thousand level concurrent parameter In this section, we describes the Intelligent Resource Manager (IRM) lookup with less than ten milliseconds latency within a data center. in detail, including both offline and online auto-tuning components. The cube is also fault-tolerant by replicating data blocks in the By automatically searching and learning the resource allocation distributed cluster. and the load shedding policies for SEDP and HHS, IRM shields engineers and data scientists from the complexity of performance 5.2 Heterogeneous and Hierarchical Cache tuning, and achieves even better performance than expert prior. Although the distributed sparse parameter cube successfully reduces the feature access latency to several milliseconds, the online 6.1 Offline Auto-Tuning for Quasi-Optimal inference process is still communication and computational ex- Resource Allocation tensive. We further introduce the heterogeneous and hierarchical Beyond tuning every single stage processors and caches indepen- cache to reduce both the I/O and computation costs, including the dently, JiZhi optimizes the parameters of the system on a shared cube cache for the distributed sparse parameter cube access and computing infrastructure in a global manner. Specifically, the goal the query cache for the inference pipeline computation. Figure 4 (b) of offline optimization is to minimize the resource consumption shows two types of caches and their interaction with the SEDP. under the latency constraint. We first categorize the tunable sys- Cube cache. The design of the cube cache is motivated by the ob- tem parameters into two classes. (1) System level parameters that servation that the feature access to the cube is heavy-tailed, where influence the global system. (2) Stage level parameters that work only a small portion of hot features are frequently accessed and on dedicated stage processors. Please refer to Appendix A for more updated. Figure 5(a) plots the distribution of 100 million consecu- detailed offline optimization parameters. tive cube accesses of a deployed online recommendation service Specifically, based on historical logs of every stage processor in Baidu, where over 80% lookup are focused on 1% values in the with the varying parameters, JiZhi collectively searches the global cube. Therefore, it is reasonable to cache such hot features locally optimal resource allocation plan. The optimization objective is to reduce expensive network I/O costs. Specifically, the cube cache is a two-layer local storage that persistent a small portion of fre- 푁 ∑︁ 푅 quently accessed key-value pairs in the cube. We adopt the Least Ooffline = argmin F푗 (Θ;휃 푗 ), (Θ,휃 ,휃 ,...,휃 ) ∈Γ (1) Frequently Used (LFU) [20] policy to replace key-value pairs in the 1 2 푁 푗=1 퐿 퐿 ¯ ¯ cache. There are two levels of the cube cache, i.e., the memory-level 푠.푡. F푗 (Θ, 휃 푗 ) ≤ F푗 (Θ, 휃 푗 ), ∀1 ≤ 푗 ≤ 푁, and the disk-level. The disk-level cache is used to hide the expensive network I/O cost of accessing the remote cube, which stores about where 푁 refers to the number of stage processors in the system, 1% frequently accessed key-value pairs to local SSDs. On top of the Γ denotes the set of all possible parameters, Θ refers to the sys- disk cache, a memory cache is further introduced to avoid disk I/O tem level parameters, 휃 푗 for 1 ≤ 푗 ≤ 푁 refers to the stage level parameters for the 푗푡ℎ stage processor, Θ¯ and 휃¯ are the default set- 푗 Sample complier Pruning DNN F 푅 ( ) tings of the parameters. The model 푗 Θ;휃 푗 predicts the resource Retrieved items MSE loss 푡ℎ consumption of the 푗 stage processor based on the parameters Sigmoid 퐿 Sample maker of Θ and 휃 푗 , while the model F푗 (Θ;휃 푗 ) predicts the end-to-end Relu 푡ℎ Estimated score latency of the 푗 stage processor based on the parameters of Θ 0 ~ 1 Relu and 휃 푗 . This optimization problem is challenging as (1) the feasible parameters should be able to bound the end-to-end latency of every Relu Feature Label 푅 퐿 stage processor (i.e., 푁 constraints), (2) F푗 (Θ;휃 푗 ) and F푗 (Θ;휃 푗 ) extractor maker are built using ensemble of practical regression models, which are Item sorter Feature input not naturally non-differentiable over the parameter spaces, and (3) 푅 퐿 the prediction results of F (Θ;휃 푗 ) and F (Θ;휃 푗 ) are noisy and 푗 푗 D" D# D$ … D% probably biased compared to the ground truth. Thus, in addition to common gradient-based approaches, global optimization is re- Cutoff 3% Cutoff 5% Cutoff 4% … Cutoff 6% quired for parameter search. JiZhi devises the Covariance Matrix Adaptation – Evolution Strategy (CMA-ES) with constraints [2] for Figure 6: Pruning DNN based online load shedding. optimal plan search. The objective and constraints are all encoded where 퐾 is the total number of recommendation requests, C is the as black-box functions, while CMA-ES with constraints handles computational cost function, 푝푖 is the corresponding stage proces- these functions naturally for global optimization in a sampling sor, 푞푖 is the pruning function conduct element-wise pruning on manner. Note that, in addition to using the minimizers obtained 퐷푖푛푓 , L∗ and Lˆ are respectively the accuracy before and after on- by CMA-ES, JiZhi restores the solution paths of CMA-ES (i.e., the 푖 line load shedding, 휖 is a small value bounds the recommendation parameters having been ever reached), pickup constraint-satisfied effectiveness degradation. results with the minimal objectives from solution paths, and evalu- To find the optimal pruning policy in an ultra-efficiency way, ate these results in the realistic system deployment (i.e., using 5% we transform the constrained optimization problem to a regression of overall online requests) to find the global optimal parameters. problem and employ a lightweight DNN (a.k.a. the pruning DNN) to decide the number of candidates to cutoff. Specifically, we first 6.2 Online Resource Management for Adaptive 푖푛푓 sort 퐷푖 by leveraging the estimated score from the recall phase. Load Shedding Then we apply the well-trained DNN to decide the cutoff line for 푖푛푓 Different with the offline auto-tuning, the online resource man- each 퐷푖 , where the candidates behind the cutoff line will be agement aims to handle the time-varying traffic load by swiftly directly dropped without passing to the subsequent re-ranking deriving fine-grained computation policy for each online inference stage processors. Figure 6 depicts the workflow of the online load task based on real-time system feed-backs. The online resource shedding process. Based on extracted features from the upstream management is more tightly coupled with the recommendation stage processors and context features such as the intermediate service logic and usually requires task-specific optimization. In system latency, the pruning DNN is ultra-lightweight and can prune this paper, we showcase a commonly used online load shedding low-quality candidates in dozens of microseconds. Please refer to module integrated in various online inference services in Baidu, to Appendix A for detailed feature list. Different with general load demonstrate the cost-effectiveness of online resource management. shedding on data streams [3, 31], the online load shedding in JiZhi As aforementioned, most industrial recommendation services adaptively decides the request-wise pruning ratio by taking the adopt a funnel-shaped pipeline [9] by first retrieving a subset of global recommendation effectiveness into consideration. potential candidates based on lightweight features (a.k.a. the recall phase) and then scoring each candidate more accurately based on 7 DEPLOYMENT more computation expensive features (a.k.a. re-ranking phase). In JiZhi has been deployed in the production environment to support most cases, only a dozen items will eventually be recommended a variety of real-world products in Baidu. In this section, we discuss to users, but over three orders of magnitude candidates will be the deployment details. Please refer to Appendix B for details of passed to the computational expensive re-ranking stage proces- implementation. sors. The key insight of online load shedding is to adaptively prune Online deployment. JiZhi answers remote service access via low-quality candidates when the traffic load exceed the system BRPC (https://github.com/brpc/brpc), a scalable Remote Procedure capacity, with the minimum recommendation effectiveness degra- Call framework used throughout Baidu. The online inference ser- dation constraint. Formally, given a set of selected candidate items vice is duplicated in data centers distributed across China to reduce 푖푛푓 퐷 = {푑1,푑2, . . . ,푑푛 } from the recall phase, the objective of the potential network latency. Note in JiZhi, the response latency and online load shedding component is resource consumption is a design trade-off, where allocate more sources can improve the system throughput and reduce latency, 퐾 and developers can reduce the hardware consumption by relaxing 1 ∑︁ 푖푛푓 Oonline = arg max C(푝푖 (푞푖 ◦ 퐷 )), the latency in a reasonable range. 푞 퐾 푖 (2) 푖=1 Model management and hot-loading. In JiZhi, the sparse 푠.푡. |L∗ − L|ˆ ≤ 휖, part of the DNN is managed by the distributed sparse parameter Table 1: Statistics of online inference services. Table 2: Overall performance. Model size # of feature groups Traffic load Latency Throughput # of instance Service A 430 GB 379 4.58 × 108/푠 Legacy 30 ms 1.53 × 106 11,450 Service A Service B 500 GB 430 4.21 × 108/푠 JiZhi 23 ms 4.42 × 106 3,970 Service C 285 GB 270 3.67 × 107/푠 Legacy 29 ms 1.63 × 106 12,750 Service B Service D 210 GB 106 7.15 × 107/푠 JiZhi 24 ms 4.36 × 106 4,773 Legacy 41 ms 2.8 × 106 2,067 Service C JiZhi 40 ms 5.21 × 106 1,110 cube and the dense part is directly stored in memory. In real-world Legacy 22 ms 3.53 × 106 4280 recommendation applications, the model is updated frequently (e.g., Service D JiZhi 18 ms 8.24 × 106 1833 per hour), and it is undesired to update the model by frequently 100 28 restart the service. To guarantee the availability of the online service, 20 JiZhi supports model hot loading without service interruption. 80 26 15 60 Specifically, we build a monitoring module to continuously track 24 10 40 model update state in the remote address (i.e., the model training Ratio (%) Latency (ms) cluster). Once a new model is ready (identified by model generation 5 20 22 Cumulative ratio (%)

0 0 20 timestamp), the new model will be pulled to local and loaded to 22 23 24 25 26 100 200 300 400 serve new coming recommendation requests via double buffering. Latency (ms) Traffic load (million) (a) Latency distribution. (b) Latency v.s. traffic.

8 EXPERIMENTAL EVALUATION 30 30 8.1 Experimental Setup 25 20 20 In this section, we conduct extensive experiments to evaluate the 15 User User effectiveness of the proposed system. We report evaluations onfour 10 10 Latency (ms) Item Latency (ms) Item Cube 5 Cube different JiZhi empowered real-world online inference services DNN DNN 0 0 deployed in Baidu. All results are collected from the online living 0 8 16 23 100 200 300 400 system. Specifically, Service A is for a multi-modal recommendation Hour Traffic load (million) model on Baidu news feed, Service B and Service C are video recom- (c) Latency breakdown by hour. (d) Latency breakdown by traffic. mendation services used in different product scenarios, Service D Figure 7: Effect of SEDP on latency. is a multi-objective recommendation service fulfilling multiple tar- of each service, we notice the performance gain of Service C is rela- gets. Table 1 summarizes model statistics of five recommendation tively lower than others, which mainly because the corresponding services. We use the previous online inference system (Legacy) as product is resource constrained (i.e., less budget than others), and described in Section 2 as the compared baseline. the system is tuned to minimize the resource consumption. In the evaluation, we focus on system performance metrics. We use latency, throughput, and the number of required instances (i.e., 8.3 Effect of SEDP the computational resource) to evaluate the efficiency and cost- Then we look into the inference time distribution to study the effect effectiveness of the system. Specifically, the latency is computed of SEDP. Due to page limit, we only report the results on Service A, as the averaged time per task (user-item pair) in millisecond (ms), the results on other services are similar. the throughput is defined as the number of tasks processed per We first study the overall latency distribution. Figure 7(a) and second. An instance refers to a virtual machine on the cloud, which Figure 7(b) report the overall online inference latency distribution is commonly equipped with four CPU cores, 20 GB memory, and and the latency distribution with respect to the varying traffic load, shares 8 TB disk in a physical server (4TB SSD and 4TB SATA). respectively. As can be seen, the overall latency follows a bimodal distribution. This makes sense as the first peak corresponds to re- 8.2 Overall Production Experience quests hit two caches, and the second peak corresponds to majority Table 2 reports the overall performance of JiZhi compared with requests in the peak hour. Moreover, we observe the latency in- the legacy online inference system on four different services with creases sub-linearly with respect to the traffic load growth, which respect to three metrics. As can be seen, JiZhi outperforms Legacy verifies the robustness of the system under time-varying traffics. on all services in terms of all metrics. Specifically, JiZhi achieves Figure 7(c) and Figure 7(d) breakdown the latency into four (23.33%, 17.24%, 2.5%, 18.18%) latency improvement and (188.37%, major stages, i.e., user related processing, item related processing, 178.13%, 86.16%, 133.41%) throughput improvement on four services, cube related processing, and DNN forward computation. Overall, respectively. Benefit from such improvement, JiZhi reduces (65.33%, we observe the ratio of latency in each stage is relatively stable 62.56%, 46.3%, 57.17%) instances in the online serving infrastructure at different hours and traffic loads. Moreover, the overall latency without sacrificing system efficiency, which helps Baidu save over slightly increases in the evening peak hour. Besides, with the traffic ten million US dollars’ computational resource budgets (hardware, load growth, we observe the averaged latency first decrease and energy consumption, labor cost, etc.) per year. And we foresee this then increase. This is related to the cache hit ratio, which first economic gain will improve with the future traffic load growth of increases when the ratio of frequent requests increase and then each recommendation service. Further look into the performance decreases when more long-tailed items flush two caches. 0:00 12:00 0:00 12:00 0:00 12:00 0:00 12:00 0:00 12:00 0:00 12:00 0:00 12:00 0:00 Table 3: Resource gain of offline auto-tuning. 86 28 Cube cache Service Reduced # of instance Gain Query cache 85 25 A 1,636 14.29% 84 22 B 1,736 13.62% 83 19 C 184 8.91% Cube cache hit ratio (%) 82 16 Query cache hit ratio (%) Sat Sun Mon Tue Wed Thu Fri D 704 16.45% Time Figure 8: Hit ratio of the cube cache and query cache. Table 4: Offline auto-tuning results in Service A. 0:00 12:00 0:00 12:00 0:00 12:00 0:00 12:00 0:00 12:00 0:00 12:00 0:00 12:00 0:00 6 Category Parameter noOpt Opt

400 5 User batch 30 34 300 4 Item extractor batch 4 12 200 3 Cutoff ratio Stage Item processor batch 6 17 Cutoff ratio (%) 100 Traffic load Traffic load (million) Cube batch size 10 6 2 0 Sat Sun Mon Tue Wed Thu Fri Time DNN batch size 15 25 Figure 9: Cutoff ratio in Service A. Cube cache ratio 1% 1.2% 8.4 Effect of the Storage Module Query cache window 120s 143s System # of arenas 500 549 We report the hit ratio of the cube cache and the query cache at Max active extent 6 25 different hours of Service A in a whole week to demonstrate the Huge page Default Always effect of the heterogeneous and hierarchical storage, as shownin Figure 8. Overall, the hit ratio of two caches are varying by hour Table 5: Effectiveness of multi-tenant serving. and showing strong daily periodicity, which is highly correlated Latency Throughput # of instance with the traffic load variation. In particular, the cube cache achieves CTR 50 ms 2.48 × 106 3,617 84.21% hit ratio in average and the daily variation is less than 3.61%, FR 48 ms 3.42 × 106 2,560 while the query cache yield 19.26% hit ratio in average but with CMT 42 ms 3.4 × 106 1,350 a higher variation range. In summary, the cube cache achieves JiZhi 52 ms 4.53 × 106 1,980 higher hit ratio and every hit reduces partial data access latency, while the query cache achieves a relative lower hit ratio but per hit a relatively higher cutoff ratio at midnight, perhaps because ofa eliminates the whole inference task computation. Two caches are lower click ratio in this period, indicating more items can be safely robust to reduce the inference latency and computational overhead dropped without influencing recommendation effectiveness. at different times in the week. Overall, both offline and online resource management components successfully tune the online inference system to achieve better 8.5 Effect of the Intelligent Resource Manager performance automatically. We further justify the performance gain of the offline resource optimization and online load shedding. 8.6 Multi-Tenant Serving Offline resource optimization. Table 3 reports the quantita- Finally, we use Service E, a recommendation service including multi- tive results of the resource gain of each service with the offline ple DNNs, to demonstrate the effectiveness of multi-tenant support resource optimization. As can be seen, the offline resource opti- in JiZhi. Specifically, Service E including three different deep learn- mization module successfully reduces 8.91% to 16.45% instances ing models, namely CTR, FR, CMT, where each model corresponds without sacrificing system latency. We further conduct the qualita- to a different recommendation target. The models are 1, 743 GB in tive study by comparing key optimization parameters of Service total, involve 968 different feature groups, and the service handles A before (noOpt) and after (Opt) applying offline resource opti- 9.19 × 107 traffics per second. By deploying each different model mization, which is shown in Table 4. As can be seen, the resource as individual services, Table 5 reports the latency, throughput and manager assigns a larger batch size to most stage processors except number of instances of each individual services compared with the the cube accessor. This makes sense since over 80% computations in multi-tenant based online inference on JiZhi. Overall, we observe this stage is reduced by the cube cache, as discussed in Section 8.4. similar system latency by simultaneously executing three different Moreover, the size of the cube cache and the expire window of the models, indicating a linear speedup ratio of parallelization in JiZhi. query cache are increased, and the offline manager derives a more Moreover, JiZhi achieves 82.68% throughput improvement com- aggressive memory allocation strategy. Such settings can further paring with the bottleneck model (i.e., CTR). Most importantly, by reduce the data access latency and improve the CPU utilization, integrating three individual models into a unified online inference which therefore improves the overall resource utilization rate. service, JiZhi saves 73.69% instances in the cluster without losing Online load shedding. To illustrate the effectiveness of online efficiency. JiZhi is even more cost-effective in the multi-tenant sce- load shedding, Figure 9 depicts the cutoff ratio of the online load nario than other single-model services (i.e., Service A to Service D). shedding and the corresponding traffic loads over a week. Weob- This is because the unified service can reduce repetitive data access serve highly synchronized fluctuations of the cutoff ratio and the cost for multiple individual services (e.g., over 80% feature groups traffic load. Specifically, in the evening peak hour of traffic load,we are shared by three models), and largely reuses the intermediate observe the peak of cutoff ratio simultaneously. Besides, we observe results using the heterogeneous and hierarchical storage. 9 CONCLUSION AND FUTURE WORK [15] Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. 2015. Session-based recommendations with recurrent neural networks. arXiv In this work, we presented JiZhi, a fast and cost-effective online preprint arXiv:1511.06939 (2015). inference system at Baidu for the web-scale recommendation. We [16] Houdong Hu, Yan Wang, Linjun Yang, Pavel Komlev, Li Huang, Xi Chen, Jiapei first proposed SEDP, a directed acyclic graph style asynchronous Huang, Ye Wu, Meenaz Merchant, and Arun Sacheti. 2018. Web-scale respon- sive visual search at bing. In Proceedings of the 24th ACM SIGKDD international pipeline for online inference serving. By decoupling the inference conference on knowledge discovery & data mining. 359–367. process into modularized stage processors, SEDP reduces pipeline [17] Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. 2017. stall and enables flexible workflow customization and auto-tuning In-datacenter performance analysis of a tensor processing unit. In Proceedings of for a wide spectrum of recommendation scenarios. Then we intro- the 44th annual international symposium on computer architecture. 1–12. duce the heterogeneous and hierarchical storage for cost-effective [18] Hemank Lamba and Neil Shah. 2019. Modeling dwell time engagement on visual multimedia. In Proceedings of the 25th ACM SIGKDD International Conference on sparse DNN model management, to simultaneously accelerate the Knowledge Discovery & Data Mining. 1104–1113. online inference speed and reduce the overall computational over- [19] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. nature head. After that, an intelligent resource management module is 521, 7553 (2015), 436–444. [20] Donghee Lee, Jongmoo Choi, Jong-Hun Kim, Sam H Noh, Sang Lyul Min, Yookun proposed to optimize the computational resource utilization in Cho, and Chong Sang Kim. 1999. On the existence of a spectrum of policies both offline and online, under inference latency constraints. JiZhi that subsumes the least recently used (LRU) and least frequently used (LFU) policies. In Proceedings of the 1999 ACM SIGMETRICS International conference on has been deployed in Baidu, serving over twenty real-world inter- Measurement and Modeling of Computer Systems. 134–143. net products, and helped Baidu saved over ten million US dollars [21] Hao Liu, Jindong Han, Yanjie Fu, Jingbo Zhou, Xinjiang Lu, and Hui Xiong. 2021. computational resource budget per year while handles hundred mil- Multi-Modal Transportation Recommendation with Unified Route Representation Learning. Proceedings of the VLDB Endowment 14, 3 (2021), 342–350. lions online inference requests per second without sacrificing the [22] Hao Liu, Yongxin Tong, Jindong Han, Panpan Zhang, Xinjiang Lu, and Hui Xiong. inference latency. We share our design and practice experience on 2020. Incorporating Multi-Source Urban Data for Personalized and Context- building the web-scale online inference system and hope to provide Aware Multi-Modal Transportation Recommendation. IEEE Transactions on Knowledge and Data Engineering (2020). useful insights to communities in both academia and industry. [23] Abdel-rahman Mohamed, George E Dahl, and Geoffrey Hinton. 2011. Acoustic modeling using deep belief networks. IEEE transactions on audio, speech, and language processing 20, 1 (2011), 14–22. REFERENCES [24] Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and An- [1] Hossein Amiri and Asadollah Shahbahrami. 2020. SIMD programming using drew Y Ng. 2011. Multimodal deep learning. In International Conference on Intel vector extensions. J. Parallel and Distrib. Comput. 135 (2020), 83–100. Machine Learning. [2] Dirk V Arnold and Nikolaus Hansen. 2012. A (1+ 1)-CMA-ES for constrained opti- [25] Christopher Olston, Fangwei Li, Jeremiah Harmsen, Jordan Soyke, Kiril Gorovoy, misation. In Proceedings of the 14th annual conference on Genetic and evolutionary Li Lao, Noah Fiedel, Sukriti Ramesh, and Vinu Rajashekhar. 2017. TensorFlow- computation. 297–304. Serving: Flexible, High-Performance ML Serving. In Workshop on ML Systems at [3] Brian Babcock, Mayur Datar, and Rajeev Motwani. 2007. Load shedding in data NIPS 2017. stream systems. In Data Streams. 127–147. [26] Jian Ouyang, Shiding Lin, Wei Qi, Yong Wang, Bo Yu, and Song Jiang. 2014. SDA: [4] Lucas Bernardi, Themistoklis Mavridis, and Pablo Estevez. 2019. 150 successful Software-defined accelerator for large-scale DNN systems. In 2014 IEEE Hot Chips machine learning models: 6 lessons learned at booking. com. In Proceedings of 26 Symposium. 1–23. the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data [27] Jian Ouyang, Mijung Noh, Yong Wang, Wei Qi, Yin Ma, Canghai Gu, SoonGon Mining. 1743–1751. Kim, Ki-il Hong, Wang-Keun Bae, Zhibiao Zhao, et al. 2020. Baidu Kunlun An AI [5] Jake Brutlag. 2009. Speed Matters for Google Web Search. https://services. processor for diversified workloads. In 2020 IEEE Hot Chips 32 Symposium. 1–18. google.com/fh/files/blogs/google_delayexp.pdf [28] John D Owens, Mike Houston, David Luebke, Simon Green, John E Stone, and [6] J Lawrence Carter and Mark N Wegman. 1979. Universal classes of hash functions. James C Phillips. 2008. GPU computing. Proc. IEEE 96, 5 (2008), 879–899. Journal of computer and system sciences 18, 2 (1979), 143–154. [29] Jongsoo Park, Maxim Naumov, Protonu Basu, Summer Deng, Aravind Kalaiah, [7] Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Daya Khudia, James Law, Parth Malani, Andrey Malevich, Satish Nadathur, et al. Tianshi Chen, Zhiwei Xu, Ninghui Sun, et al. 2014. Dadiannao: A machine- 2018. Deep learning inference in facebook data centers: Characterization, perfor- learning supercomputer. In 2014 47th Annual IEEE/ACM International Symposium mance optimizations and hardware implications. arXiv preprint arXiv:1811.09886 on Microarchitecture. 609–622. (2018). [8] Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. 2017. A survey of model [30] Chuan Qin, Hengshu Zhu, Tong Xu, Chen Zhu, Chao Ma, Enhong Chen, and Hui compression and acceleration for deep neural networks. arXiv preprint Xiong. 2020. An enhanced neural network approach to person-job fit in talent arXiv:1710.09282 (2017). recruitment. ACM Transactions on Information Systems (TOIS) 38, 2 (2020), 1–33. [9] Miao Fan, Jiacheng Guo, Shuai Zhu, Shuo Miao, Mingming Sun, and Ping Li. [31] Nesime Tatbul, Uğur Çetintemel, Stan Zdonik, Mitch Cherniack, and Michael 2019. MOBIUS: towards the next generation of query-ad matching in baidu’s Stonebraker. 2003. Load shedding in a data stream manager. In Proceedings of the sponsored search. In Proceedings of the 25th ACM SIGKDD International Conference 29th International Conference on Very Large Data Bases. 309–320. on Knowledge Discovery & Data Mining. 2509–2517. [32] Han Vanholder. 2016. Efficient inference with TensorRT. [10] Nadeem Firasta, Mark Buxton, Paula Jinbo, Kaveh Nasri, and Shihjong Kuo. 2008. [33] Xiao Yang, Daren Sun, Ruiwei Zhu, Tao Deng, Zhi Guo, Zongyao Ding, Shouke Intel AVX: New frontiers in performance improvements and energy efficiency. Qin, and Yanfeng Zhu. 2019. Aiads: Automated and intelligent advertising system Intel white paper 19, 20 (2008). for sponsored search. In Proceedings of the 25th ACM SIGKDD International [11] Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Todd Massengill, Ming Conference on Knowledge Discovery & Data Mining. 1881–1890. Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, Logan Adams, et al. 2018. A [34] Shanshan Zhang, Ce Zhang, Zhao You, Rong Zheng, and Bo Xu. 2013. Asynchro- configurable cloud-scale DNN processor for real-time AI. In 2018 ACM/IEEE 45th nous stochastic gradient descent for DNN training. In 2013 IEEE International Annual International Symposium on Computer Architecture (ISCA). 1–14. Conference on Acoustics, Speech and Signal Processing. 6660–6663. [12] Zhoutong Fu, Huiji Gao, Weiwei Guo, Sandeep Kumar Jha, Jun Jia, Xiaowei [35] Weijie Zhao, Jingyuan Zhang, Deping Xie, Yulei Qian, Ronglai Jia, and Ping Li. Liu, Bo Long, Jun Shi, Sida Wang, and Mingzhou Zhou. 2020. Deep Learning 2019. AIBox: CTR prediction model training on a single node. In Proceedings of for Search and Recommender Systems in Practice. In Proceedings of the 26th the 28th ACM International Conference on Information & Knowledge Management. ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 319–328. 3515–3516. [36] Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui [13] Long Guo, Lifeng Hua, Rongfei Jia, Binqiang Zhao, Xiaobo Wang, and Bin Yan, Junqi Jin, Han Li, and Kun Gai. 2018. Deep interest network for click-through Cui. 2019. Buying or browsing?: Predicting real-time purchasing intent using rate prediction. In Proceedings of the 24th ACM SIGKDD International Conference attention-based deep network with multiple behavior. In Proceedings of the 25th on Knowledge Discovery & Data Mining. 1059–1068. ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. [37] Lixin Zou, Long Xia, Zhuoye Ding, Jiaxing Song, Weidong Liu, and Dawei Yin. 1984–1992. 2019. Reinforcement learning to optimize long-term user engagement in recom- [14] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng mender systems. In Proceedings of the 25th ACM SIGKDD International Conference Chua. 2017. Neural collaborative filtering. In Proceedings of the 26th international on Knowledge Discovery & Data Mining. 2810–2818. conference on world wide web. 173–182. Table 6: Representative search parameters in offline perfor- B SYSTEM IMPLEMENTATION mance tuning. The core framework of JiZhi is implemented by C++. Developers can construct customized stage processors by implementing com- Parameter Type Range plex data pre-possessing and post-processing logic based on several Stage Level Parameters simple abstractions. Moreover, JiZhi also provide a Python API ex- User batch Integer [10,45] tension including of a set commonly used stage processors, such as Item extractor batch Integer [2,45] data_reader, feature_processor, cube_accessor, and model_inference. Item processor batch Integer [2,45] Engineers, data scientists and researchers can easily and efficiently Cube batch size Integer [1,20] construct and customize online inference services for a variety DNN batch size Integer [10,45] of online recommendation tasks. Figure 10 showcases an example System Level Parameters code block based on the Python API on PaddlePaddle, which defines Cube cache ratio continuous [0.1,5]% an online inference service in a few lines of code. Query cache window continuous [60,600]s # of arenas (jemalloc) Integer [350,700] Max active extent (jemalloc) Integer [5,40] Huge page (jemalloc) Boolean {Default, Always}

Table 7: Features for online load shedding.

Feature name Description quota Available resource cutoff ratio Previous cutoff ratio qid Current item queue ID escore_avg average of estimated scores escore_variance variance of estimated scores escore_max max of estimated scores escore_min min of estimated scores

A DETAILED PARAMETER DESCRIPTION Table 6 lists some important parameters and their feasible ranges in offline tuning. Specifically, more memory related parameters can be found in the jemalloc document (http://jemalloc.net/). Table 7 lists lightweight features used for intelligent online load shedding. All features are either derived from real-time system feedback or upstream processing results.

Figure 10: Constructing customized online inference service by using the Python API in a few lines of code.