Practical Federated Gradient Boosting Decision Trees

The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Practical Federated Gradient Boosting Decision Trees Qinbin Li,1 Zeyi Wen,2 Bingsheng He1 1National University of Singapore 2The University of Western Australia {qinbin, hebs}@comp.nus.edu.sg, [email protected] Abstract willing to share their own raw data records. Also, accord- ing to a recent survey (Yang et al. 2019), federated learn- Gradient Boosting Decision Trees (GBDTs) have become ing can be broadly categorized into horizontal federated very successful in recent years, with many awards in machine learning and data mining competitions. There have been sev- learning, vertical federated learning and federated transfer eral recent studies on how to train GBDTs in the federated learning. Much research efforts have been devoted to de- learning setting. In this paper, we focus on horizontal feder- veloping new learning algorithms in the setting of vertical ated learning, where data samples with the same features are or horizontal federated learning (Smith et al. 2017; Takabi, distributed among multiple parties. However, existing studies Hesamifard, and Ghasemi 2016; Liu, Chen, and Yang 2018; are not efficient or effective enough for practical use. They Yurochkin et al. 2019). We refer readers for more recent sur- suffer either from the inefficiency due to the usage of costly veys for details (Yang et al. 2019; Li et al. 2019a). data transformations such as secure sharing and homomor- On the other hand, Gradient Boosting Decision Trees phic encryption, or from the low model accuracy due to differential privacy designs. In this paper, we study a practical (GBDTs) have become very successful in recent years by federated environment with relaxed privacy constraints. In winning many awards in machine learning and data min- this environment, a dishonest party might obtain some infor- ing competitions (Chen and Guestrin 2016) as well as mation about the other parties’ data, but it is still impossible their effectiveness in many applications (Richardson, Domi- for the dishonest party to derive the actual raw data of other nowska, and Ragno 2007; Kim et al. 2009; Burges 2010; parties. Specifically, each party boosts a number of trees by Li et al. 2019b). There have been several recent studies on exploiting similarity information based on locality-sensitive how to train GBDTs in the federated learning setting (Cheng hashing. We prove that our framework is secure without ex- et al. 2019; Liu et al. 2019; Zhao et al. 2018). For example, posing the original record to other parties, while the compu- SecureBoost (Cheng et al. 2019) developed vertical learning tation overhead in the training process is kept low. Our ex- with GBDTs. In contrast, this study focuses on horizontal perimental studies show that, compared with normal training with the local data of each party, our approach can signifi- learning for GDBTs, where data samples with the same fea- cantly improve the predictive accuracy, and achieve compa- tures are distributed among multiple parties. rable accuracy to the original GBDT with the data from all There have been several studies of GDBT training in the parties. setting of horizontal learning (Liu et al. 2019; Zhao et al. 2018). However, those approaches are not effective or effi- 1 Introduction cient enough for practical use. Model accuracy: The learned model may not have a good Federated learning (FL) (McMahan et al. 2016; Mirhoseini, predictive accuracy. A recent study adopted differential pri- Sadeghi, and Koushanfar 2016; Shi et al. 2017; Yang et vacy to aggregate distributed regression trees (Zhao et al. al. 2019; Mohri, Sivek, and Suresh 2019; Li et al. 2019a) 2018). This approach boosts each tree only with the local has become a hot research area in machine learning. Fed- data, which does not utilize the information of data from erated learning addresses the privacy and security issues of other parties. As we will show in the experiments, the model model training in multiple parties. In reality, data are dis- accuracy is much lower than our proposed approach. persed over different areas. For example, people tend to go Efficiency: The approach (Liu et al. 2019) has a pro- to nearby hospitals, and the patient records in different hos- hibitively time-consuming learning process since it adopts pitals are isolated. Ideally, hospitals may benefit more if they complex cryptographic methods to encrypt the data from can collaborate with each other to train a model with the multiple parties. Due to a lot of extra cryptographic calcu- joint data. However, due to the increasing concerns and more lations, the approach brings prohibitively high overhead in regulations/policies on data privacy, organizations are not the training process. Moreover, since GBDTs have to tra- Copyright c 2020, Association for the Advancement of Artificial verse the feature values to find the best split value, there is a Intelligence (www.aaai.org). All rights reserved. huge number of comparison operations even in the building 4642 of a single node. hyper-parameters, Tl is the number of leaves and w is the Considering the previous approaches’ limitations on effi- leaf weight. Each fk corresponds to a decision tree. Train- ciency and model accuracy, this study utilizes a more prac- ing the model in an additive manner, GBDT minimizes the tical privacy model as a tradeoff between privacy and ef- following objective function at the t-th iteration. ficiency/model accuracy (Du, Han, and Chen 2004; Liu, n 1 Chen, and Yang 2018). In this environment, a dishonest L˜(t) = [g f (x )+ h f 2(x )]+Ω(f ) i t i 2 i t i t (2) party might obtain some information about the other par- i=1 ties’ data, but it is still impossible for the dishonest party = ( ˆ(t−1)) = to derive the actual raw data of other parties. Compared to where gi ∂yˆ(t−1) l yi, y and hi 2 ( ˆ(t−1)) differential privacy or secrete sharing, this privacy model is ∂yˆ(t−1) l yi, y are first and second order gradient weaker, but enables new opportunities for designing much statistics on the loss function. The decision tree is built from more efficient and effective GBDTs. the root until reaching the restrictions such as the maximum Specifically, we propose a novel and practical federated depth. Assume IL and IR are the instance sets of left and learning framework for GDBTs (named SimFL). The ba- right nodes after the split. Letting I = IL ∪ IR, the gain of sic idea is that instead of encryption on the feature values, the split is given by we make use of the similarity between data of different par- 2 2 2 ties in the training while protecting the raw data. First, we ( gi) ( gi) ( ) 1 i∈IL i∈IR i∈I gi propose the use of locality-sensitive hashing (LSH) in the Lsplit = [ + − ] − γ 2 hi + λ hi + λ hi + λ i∈IL i∈IR i∈I context of federated learning. We adopt LSH to collect sim- (3) ilarity information without exposing the raw data. Second, GBDT traverses all the feature values to find the split that we design a new approach called Weighted Gradient Boost- maximizes the gain. ing (WGB), which can build the decision trees by exploiting the similarity information with bounded errors. Our analysis 3 Problem Statement show that SimFL satisfies the privacy model (Du, Han, and Chen 2004; Liu, Chen, and Yang 2018). The experimental This paper focuses on the application scenarios of hori- results show that SimFL shows a good accuracy, while the zontal federated learning. Multiple parties have their own training is fast for practical uses. data which have the same set of features. Due to data privacy requirements, they are not willing to share their pri- 2 Preliminaries vate data with other parties. However, all parties want to ex- ploit collaborations and benefits from a more accurate model Locality-Sensitive Hashing (LSH) LSH was first intro- that can be built from the joint data from all parties. Thus, duced by Gionis et al. (1999) for approximate nearest neigh- the necessary incentive for this collaboration is that feder- bor search. The main idea of LSH is to select a hashing func- ated learning should generate a much better learned model tion such that (1) the hash values of two neighbor points than the one generated from the local data of each party are equal with a high probability and (2) the hash values of alone. In other words, (much) better model accuracy is a two non-neighbor points are not equal with a high probabil- pre-condition for such collaborations in horizontal federated ity. A good property of LSH is that there are infinite input learning. We can find such scenarios in various applications data for the same hash value. Thus, LSH has been used to such as banks and healthcares (Yang et al. 2019). protect user privacy in applications such as keyword search- Specifically, we assume that there are M parties, and ing (Wang et al. 2014) and recommendation systems (Qi et each party is denoted by Pi (i ∈ [1,M]). We use Im = al. 2017). {(xm m)} (| | = xm ∈ Rd m ∈ R) i ,yi Im Nm, i ,yi to denote the The previous study (Datar et al. 2004) proposed the p- instance set of P . For simplicity, the instances have global stable LSH family, which has been widely used. The hash m F F (v)= a·v+b IDs that are unique identifiers among parties (i.e., given two function a,b is formulated as a,b r , where xm xn = a different instances i and j ,wehavei j). is a d-dimensional vector with entries chosen indepen- Privacy model.

Load more