Privacy Leakage of Real-World Vertical Federated Learning
Total Page:16
File Type:pdf, Size:1020Kb
1 Privacy Leakage of Real-World Vertical Federated Learning Haiqin Weng∗, Juntao Zhang∗, Feng Xue∗, Tao Wei∗, Shouling Jiy, Zhiyuan Zong∗ ∗Ant Group fhaiqin.wenghaiqin, tilin.zjt, henry.xuef, lenx.wei, [email protected] yZhejiang University [email protected] Abstract—Federated learning enables mutually distrusting par- frameworks for secure and private machine learning, includ- ticipants to collaboratively learn a distributed machine learning ing PySyft [36], FATE [16], TF Federated [17], TF model without revealing anything but the model’s output. Generic Encrypted [13] and CrypTen [10]. These are all designed federated learning has been studied extensively, and several learning protocols, as well as open-source frameworks, have been to provide the fast development of privacy-preserving models developed. Yet, their over pursuit of computing efficiency and for non-experts, who want to choose an efficient protocol fast implementation might diminish the security and privacy for training distributed machine learning models. With the guarantees of participant’s training data, about which little is advance of open-source frameworks, federated learning now known thus far. has become a mainstream choice when different data holders In this paper, we consider an honest-but-curious adversary who participants in training a distributed ML model, does want to train a distributed model on their private data collab- not deviate from the defined learning protocol, but attempts oratively. to infer private training data from the legitimately received Vertical Federated Learning. Based on the characteristics information. In this setting, we design and implement two of data distribution, the general federated learning can be cat- practical attacks, reverse sum attack and reverse multiplication egorized into horizontal federated learning [2], [3], where the attack, neither of which will affect the accuracy of the learned model. By empirically studying the privacy leakage of two distributed datasets share the same feature space but different learning protocols, we show that our attacks are (i) effective - the samples, and vertical federated learning [50], [22], where the adversary successfully steal the private training data, even when distributed datasets share the same samples but differ in feature the intermediate outputs are encrypted to protect data privacy; space. Vertical federated learning can also be considered as (ii) evasive - the adversary’s malicious behavior does not deviate privacy-preserving machine learning, which is tightly related from the protocol specification and deteriorate any accuracy of the target model; and (iii) easy - the adversary needs little prior to multi-party computation. Compared with the horizontal knowledge about the data distribution of the target participant. setting, vertical federated learning is more complicated as it We also experimentally show that the leaked information is as needs to design a specific learning protocol for each machine effective as the raw training data through training an alternative learning algorithm. In our work, we mainly focus on vertical classifier on the leaked information. We further discuss potential federated learning since it has a wider scope of applications countermeasures and their challenges, which we hope may lead to several promising research directions. and is more suitable for joint model training between large enterprises. For example, consider two different companies in Index Terms—Federated machine learning; Privacy leakage. the same city, one is a bank, and the other is an e-commerce company. They share a large percentage of user sets and differs arXiv:2011.09290v2 [cs.CR] 8 Apr 2021 I. INTRODUCTION in user attributes. In this case, vertical federated learning is the Federated learning provides a promising way for different right mean for these two company to train a distributed model participants to train models on their private data without like the risk management model collaboratively [50]. revealing any information beyond the outcome [30], [50], [22], Nevertheless, many learning protocols of vertical federated [3]. The main goal of federated learning is to protect the learning sacrifice data security and user privacy for computing privacy of the participants’ training data and to effectively efficiency. We observe that such sacrifice for efficiency makes and efficiently learning a joint machine learning model from the practical joint model training vulnerable to the adversary, participants’ training data. Compelling applications include us- resulting in the leakage of privacy data. For the privacy- ing data from multiple hospitals to train predictive models for preserving version of logistic regression [22], [50] as an patient survival [27] and using data from user’s smartphones example, it employs a third-party coordinator to coordinator to train next word prediction models [24]. the model training procedure as well as to accelerate the In order to deal with the practical application, recent years training speed. In this learning protocol, the third-party co- have witnessed a great number of delicate open-source frame- ordinator holds an RSA private key and sends the public key works for training a distributed model in federated learning. to the computing participants for encrypting their intermediate For sample, many industry leaders develop and open source outputs. Then, the coordinator combines the encrypted outputs, use its private key for decrypting, and sends the decrypted *work in progress result back to the computing participants. Obviously, such a 2 training procedure leaks much too private information to the target participant’s training data by inserting magic numbers coordinator. Even worse, in practice, we can hardly see an in the exchanged data, when the two participants are jointly honest third-party coordinator which never deviates from the training a XGBoost model. Though not completely equivalent protocol specification nor gathers private information from the to the raw data, these stolen partial orders can be further used limited outputs. On the contrast, there will probably exist a to train an alternative model that is as effective as the one malicious adversary who mildly controls part of the distributed trained on the raw data. training process to infer participant’s private training data from Extensive experimental evaluations against two popular the limited intermediate outputs. learning protocols implemented on the real-world framework Our Work. In this paper, we provide the first systematic FATE show that, the adversary can precisely infer various study on the potential privacy risk of practical learning proto- levels of the target participant’s private data. Specifically, to cols in vertical federated learning. Specially, our work seeks show the effectiveness of our reverse multiplication attack, we to answer the crucial research question: How much is the evaluate the privacy of participants’ training data when they privacy risks of the practical learning protocols for computing collaboratively train a logistic regression model on vertical participates whose data is used as part of the training set? In partitions of the credit dataset. Our evaluation results show that other words, how much is the privacy leakage of the learning the learning protocol of logistic regression–which is claimed protocol about the participants’ training data? to be privacy-preserving to the training data–does not protect Although previous works have shown the possibility of in- the data privacy of the honest participant (the adversary obtain ferring private data in the scenario privacy-preserving machine an extremely high accuracy of 100% accuracy in stealing the learning [28], [32], [23], [48], they neither focus on the vertical training data). This shows that even well-designed learning setting of federated learning nor only performing membership protocols leak a significant amount of information about attacks against machine learning models. Recent years have participants’ training data, and are vulnerable to reverser witnessed many new technical breakthroughs in the learning multiplication attack. protocols and open-source frameworks of vertical federated Our experimental evaluation also shows that a malicious learning, e.g., SecureBoost [8], FATE [16] and PySyft [36], participant can perform accurate reverse sum attack against which have a difference in training process and architectures. other participants. When collaboratively training a single As stated before, those practical learning protocols have poten- boosting tree, the active participant can successfully steal tial risks of privacy leakage due to the sacrifice for computing partial orders of ∼ 3; 200 samples with 100% accuracy from efficiency. Compared with the existing works [28], [32], [23], the target participant. We experimentally demonstrate that [48] in the horizontal setting, measuring the privacy leakage of leaked information can be used to build such an alternative the practical learning protocols in the vertical setting has two classifier that is as accurate as the one trained on the original major challenges: the encrypted intermediate outputs and the data, showing the potential commercial values of the leaked heterogeneous participants’ training data. On the one hand, leakage. the encrypted intermediate outputs make it impossible for In summary, we make the following contributions: the participants to directly infer private data through gradient • We discover the potential privacy risks