Architectural Patterns for the Design of Federated Learning Systems
Total Page:16
File Type:pdf, Size:1020Kb
Architectural Patterns for the Design of Federated Learning Systems a,b < a,b a,b b a,b a Sin Kit Lo , , Qinghua Lu , Liming Zhu , Hye-Young Paik , Xiwei Xu and Chen Wang aData61, CSIRO, Australia bUniversity of New South Wales, Australia ARTICLEINFO ABSTRACT Federated learning has received fast-growing interests from academia and industry to tackle the chal- Keywords: lenges of data hungriness and privacy in machine learning. A federated learning system can be viewed as a large-scale distributed system with different components and stakeholders as numerous client de- Federated Learning vices participate in federated learning. Designing a federated learning system requires software system design thinking apart from the machine learning knowledge. Although much effort has been put into Pattern federated learning from the machine learning technique aspects, the software architecture design con- Software Architecture cerns in building federated learning systems have been largely ignored. Therefore, in this paper, we present a collection of architectural patterns to deal with the design challenges of federated learning Machine Learning systems. Architectural patterns present reusable solutions to a commonly occurring problem within a Artificial Intelligence given context during software architecture design. The presented patterns are based on the results of a systematic literature review and include three client management patterns, four model management patterns, three model training patterns, and four model aggregation patterns. The patterns are asso- ciated to the particular state transitions in a federated learning model lifecycle, serving as a guidance for effective use of the patterns in the design of federated learning systems. Central Model 1. Introduction Server aggregation Federated Learning Overview The ever-growing use of big data systems, industrial-scale IoT platforms, and smart devices contribute to the exponen- tial growth in data dimensions [30]. This exponential growth of data has accelerated the adoption of machine learning in many areas, such as natural language processing and com- Local Local Local model Client model Client model Client puter vision. However, many machine learning systems suf- training device training device training device fer from insufficient training data. The reason is mainly due to the increasing concerns on data privacy, e.g., restrictions on data sharing with external systems for machine learning Local Global System Task script Client purposes. For instance, the General Data Protection Regu- model model owner lation (GDPR) [1] stipulate a range of data protection mea- Figure 1: Federated Learning Overview. sures, and data privacy is now one of the most important eth- ical principles expected of machine learning systems [22]. Furthermore, raw data collected are unable to be used di- rectly for model training for most circumstances. The raw abling effective data inferences by the data-hungry machine data needs to be studied and pre-processed before being used learning models, Google introduced the concept of feder- for model training and data sharing restrictions increase the ated learning [33] in 2016. Fig.1 illustrates the overview arXiv:2101.02373v3 [cs.LG] 18 Jun 2021 difficulty to obtain training data. Moreover, concept drift [27] of federated learning. There are three stakeholders in a fed- also occurs when new data is constantly collected, replacing erated learning system: (1) learning coordinator (i.e., sys- the outdated data. This makes the model trained on previous tem owner), (2) contributor client - data contributor & local data degrades at a much faster rate. Hence, a new technique model trainer, and (3) user client - model user. Note that a that can swiftly produce models that adapt to the concept contributor client can also be a user client. There are two drift when different data is discovered in clients is essen- types of system nodes (i.e., hardware components): (1) cen- tial. tral server, (2) client device. To effectively address the lack of training data limitations, A Motivation Example concept drift, and the data-sharing restriction while still en- Imagine we use federated learning to train the next-word pre- [email protected] (S.K. Lo); [email protected] diction model in a mobile phone keyboard application. The (Q. Lu); [email protected] (L. Zhu); [email protected] (H. learning coordinator is the provider of the keyboard appli- Paik); [email protected] (X. Xu); [email protected] (C. cation, while contributor clients are the mobile phone users. Wang) The user clients will be the new or existing mobile phone ORCID(s): 0000-0002-9156-3225 (S.K. Lo); 0000-0002-7783-5183 (Q. Lu); 0000-0003-4425-7388 (H. Paik); 0000-0002-2273-1862 (X. Xu); users of the keyboard application. The differences in own- 0000-0002-3119-4763 (C. Wang) ership, geolocation, and usage pattern cause the local data SK Lo et al.: Preprint submitted to Elsevier Page 1 of 19 Architectural Patterns for the Design of Federated Learning Systems to possess non-IID1 characteristics, which is a design chal- marised as follows. lenge of federated learning systems. The federated learning • Global models might have low accuracy, and lack gener- process starts when a training task is created by the learning ality when client devices generate non-IID data. Central- coordinator. For instance, the keyboard application provider ising and randomising the data is the approach adopted produces and embeds an initial global model into the key- by conventional machine learning to deal with data het- board applications. The initial global model (includes task erogeneity but the inherent privacy-preserving nature of scripts & training hyperparameters) is broadcast to the par- federated learning render such techniques inappropriate. ticipating client devices. After receiving the initial model, the model training is performed locally across the client de- • To generate high-quality global models that are adaptive vices, without the centralised collection of raw client data. to concept drift, multiple rounds of communication are Here, the smartphones that have the keyboard application required to exchange local model updates, which could installed receive the model to be trained. The client devices incur high communication costs. typically run on different operating systems and have diverse • Client devices have limited resources to perform the mul- communication and computation resources, which is defined tiple rounds of model training and communications re- as the system heterogeneity challenges. quired by the system, which may affect the model quality. Each training round takes one step of gradient descent on • As numerous client devices participate in federated learn- the current model using each client’s local data. In this case, ing, it is challenging to coordinate the learning process the smartphones optimise the model using the keyboard typ- and ensure model provenance, system reliability and se- ing data. After each round, the model update is submit- curity. ted by each participating client device to the central server. The central server collects all the model updates and per- How to select appropriate designs to fulfill different software forms model aggregation to form a new version of the global quality requirements and design constraints is non-trivial for model. The new global model is re-distributed to the client such a complex distributed system. Although much effort devices for the next training round. This entire process iter- has been put into federated learning from the machine learn- ates until the global model converges. As a result, communi- ing techniques side, there is still a gap on the architectural de- cation and computation efficiency are crucial as many local sign considerations of the federated learning systems. A sys- computation and communication rounds are required for the tematic guidance on architecture design of federated learn- global model to converge. Moreover, due to the limited re- ing systems is required to better leverage the existing so- sources available on each device, the device owners might lutions and promote federated learning to enterprise-level be reluctant to participate in the federated learning process. adoption. Therefore, client motivatability becomes a design challenge. Furthermore, the central server communicates with multiple Research Contribution devices for the model exchange which makes it vulnerable to In this paper, we present a collection of patterns for the de- the single-point-of-failure. The trustworthiness of the cen- sign of federated learning systems. In software engineering, tral server and the possibility of adversarial nodes partici- an architectural pattern is a reusable solution to a problem pating in the training process also creates system reliability that occurs commonly within a given context in software de- and security challenges. After the completion of training, sign [6]. Our pattern collection includes three client man- the learning coordinator stores the converged model and de- agement patterns, four model management patterns, three ploys it to the user clients. For instance, the keyboard ap- model training patterns, and four model aggregation patterns. plication provider stores the converged model and embeds We define the lifecycle of a model in a federated learning it to the latest version of the application for existing or new system and associate each identified pattern to a particular application users. Here,