Dealer: An End-to-End Model Marketplace with Differential Privacy

Jinfei ∗ Jian Junxu Liu† Zhejiang University Emory University Renmin University of China [email protected] [email protected] [email protected]

Li Xiong Jian Pei Jimeng Sun‡ Emory University Simon Fraser University UIUC [email protected] [email protected] [email protected]

ABSTRACT makes it evident that data is valuable. Recent studies and practices Data-driven machine learning has become ubiquitous. A market- have approached the commoditization of data in various ways. A place for machine learning models connects data owners and model data marketplace sells data either in the direct or indirect (derived) buyers, and can dramatically facilitate data-driven machine learn- forms. These data marketplaces can be generally categorized based ing applications. In this paper, we take a formal data marketplace on what they sell and their corresponding pricing mechanisms: perspective and propose the first enD-to-end model marketplace 1) (raw) data-based pricing, 2) query-based pricing, and 3) model- with differential privacy (Dealer) towards answering the following based pricing. questions: How to formulate data owners’ compensation functions Data marketplaces with data-based pricing sell datasets and al- and model buyers’ price functions? How can the broker determine low buyers to access the data directly, e.g., Dawex [1], Twitter [3], prices for a set of models to maximize the revenue with arbitrage-free Bloomberg [4], Iota [5], and SafeGraph [6]. Data owners have lim- guarantee, and train a set of models with maximum Shapley coverage ited or no control over their data usages, which makes it challenging given a manufacturing budget to remain competitive? For the former, for the market to incentivize more data owners to contribute or re- we propose compensation function for each data owner based on sults in a market lacking transparency. Also, it can be overpriced for Shapley value and privacy sensitivity, and price function for each buyers to purchase the whole dataset when they are only interested model buyer based on Shapley coverage sensitivity and noise sensi- in particular information extracted from the dataset. tivity. Both privacy sensitivity and noise sensitivity are measured Data marketplaces with query-based pricing [26, 27], e.g., Google by the level of differential privacy. For the latter, we formulate two Bigquery [2], partially alleviate these shortcomings by charging optimization problems for model pricing and model training, and buyers and compensating data owners on a per-query basis. The propose efficient dynamic programming algorithms. Experiment marketplace makes decisions about data usage restrictions (e.g., results on the real chess dataset and synthetic datasets justify the return queries with privacy protection [30]), compensation alloca- design of Dealer and verify the efficiency and effectiveness of the tion, and query-based pricing. However, most queries considered proposed algorithms. by these marketplaces are too simplistic to support sophisticated data analytics and decision making. PVLDB Reference Format: Data marketplaces with model-based pricing [8, 14, 23] have been Jinfei Liu, Jian Lou, Junxu Liu, Li Xiong, Jian Pei, and Jimeng Sun. Dealer: recently proposed. [14] focuses on pricing a set of model instances An End-to-End Model Marketplace with depending on their model quality to maximize the revenue, while Differential Privacy. PVLDB, 14(6): 957-969, 2021. [23] considers how to allocate compensation in a fair way among doi:10.14778/3447689.3447700 data owners when their data are utilized for the model of 푘-nearest neighbors (푘-NN). Notably, they focused on either the data owner 1 INTRODUCTION or the model buyer end of the marketplace but not both. Most Machine learning has witnessed great success across various types recently, [8] approaches it in a relatively more complete perspective, of tasks and is being applied in an ever-growing number of in- and proposed strategies for the broker to set model usage charge dustries and businesses. High usability machine learning models from model buyers, and to distribute compensation to data owners. depend on a large amount of high-quality training data, which However, [8] oversimplifies the roles of the two end entities, i.e., data owners and model buyers. For example, data owners still have ∗ Work partially done while at Emory University and Georgia Tech. no means to control the way that their data is used, while model †Work partially done while at Emory University. ‡Work partially done while at Georgia Tech. buyers do not have a choice over the quality of the model that best This work is licensed under the Creative Commons BY-NC-ND 4.0 International suits their demands and budgets. License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of Gaps and Challenges. this license. For any use beyond those covered by this license, obtain permission by Though efforts have been made to ensure emailing [email protected]. Copyright is held by the owner/author(s). Publication rights the broker follows important market design principles in [8, 14, 23], licensed to the VLDB Endowment. how the model marketplace should respond to the needs of both Proceedings of the VLDB Endowment, Vol. 14, No. 6 ISSN 2150-8097. doi:10.14778/3447689.3447700 data owners and model buyers are still understudied. It is therefore tempting to ask: how can we build an end-to-end marketplace

957 dedicated to machine learning models, which can simultaneously [16, 17]. From the broker’s perspective, Dealer depicts the full mar- satisfy the needs of all three entities, i.e., data owners, broker, and ketplace dynamics through two important functions including: 1) model buyers. We summarize the gaps and challenges from the model pricing with the aim of maximizing revenue and at the same perspective of each entity as follows. time following the market design principle of arbitrage-freeness, • Data owners. Under the existing data marketplace solutions [8, which prevents the model buyers from taking advantage of the 23], data owners receive compensation for their data usages marketplace by combining lower-tier models sold for lower prices allocated by the broker. They have no means to set privacy pref- into a higher-tier model to escape the designated price for that tier; erences when supplying their data to the broker. The challenge to and 2) model training with the aim of maximizing Shapley coverage be addressed is: How to formulate data owners’ compensation func- given a manufacturing budget for each model version to remain tions based on not only the fair sharing of revenues allocated by the competitive. broker but also the data owners’ requisites on privacy preservation? We design efficient algorithms for both model pricing and model • Model buyers. As with the same practice of selling digital com- training. Our goal in this paper is not to provide a complete solution modities in several versions, existing work [14] provides a set of for the myriad problems involved in the model marketplace with models for sale with different levels of quality. However, their differential privacy, but rather to introduce a framework with which oversimplified noise injection-based version control quantifies to investigate some of the many questions and potentially open up the model quality via the magnitude of the noise, which does not a new research direction. We briefly summarize our contributions directly align with model buyers’ demands on model utility. The as follows. • We present an end-to-end model marketplace with differential challenge is: How to formulate model buyers’ price functions based privacy , which is the first systematic study that includes on not only resistance on model noise but also the relative utility Dealer all market participants. formalizes the abilities and restric- of the model captured by the value of the data used to build the Dealer tions of the three entities, formulates compensation functions model? for data owners and price functions for model buyers, and pro- • Broker. Both data owners’ compensation functions and model poses the market decisions taken by the broker by formulating buyers’ price functions should be taken into consideration by the two optimization problems: revenue maximization problem with broker when making market decisions. The challenge is: How arbitrage-free constraint for model pricing and Shapley coverage can the broker determine prices for a set of models to maximize the maximization problem for model training. revenue with arbitrage-free guarantee, and train a set of models • For the revenue maximization problem of model pricing, we show with maximum utility given a manufacturing budget to remain its complexity, discretize the search space, and propose an effi- competitive? cient dynamic programming algorithm. For the Shapley coverage

① Data & maximization problem of model training, we show its complex- Compensation Function ⑦ Target Model & Payment ity and propose efficient dynamic programming, greedy, and ⑨ Compensation Model Buyers Data Owners guess-based greedy algorithms with approximation guarantees. ② Model Parameter Setting • Experiments on the real chess dataset and synthetic datasets are ⑤ Model Pricing & Model Training ③ Model Parameter ④ Target Model & conducted, which verify the efficiency and effectiveness of the (Coverage Rate, Purchasing Budget & DP parameter) (Price Function) proposed algorithms for model pricing and model training in Dealer. Organization. The rest of the paper is organized as follows. Section Model Buyers (Survey) 2 presents the related work. Section 3 provides the preliminaries, Figure 1: An end-to-end data marketplace with model-based including the concept of Shapley value and its computation, ver- pricing. Please see Algorithm 8 for a complete pipeline. sioning, arbitrage-free definition, and differential privacy related definitions and properties. We provide an end-to-end model market- Contributions. In this paper, we bridge the above gaps and ad- place with differential privacy in Section 4. Efficient algorithms for dress the identified challenges by proposing an enD-to-end model the optimization problems in Dealer are presented in Section 5. We marketplace with differential privacy (Dealer). report the experimental results and findings in Section 6. Finally, An illustration is provided in Figure 1, which includes three Section 7 draws a conclusion and discusses future work. entities (i.e., data owners, broker, and model buyers) and their ab- breviated interactions. From the data owners’ perspective, Dealer 2 RELATED WORK proposes privacy sensitivity of data owners to model their req- uisites on privacy preservation and uses Shapley value to model In this section, we discuss related work on compensation allocation fair sharing of revenues. Each data owner has a unique compensa- and pricing mechanisms for data markets. For a more detailed tion function based on both privacy sensitivity and Shapley value survey, please see [33]. [35]. From the model buyers’ perspective, Dealer proposes Shap- 2.1 Compensation Allocation ley coverage sensitivity and noise sensitivity of model buyers to In data marketplaces, a common way for compensation allocation model their demands on Shapley coverage and resistance on model is based on the importance of the data contributed by a data owner. noise. Each model buyer has a unique price function based on both An acquiescent method to evaluate importance of a data point to Shapley coverage sensitivity and noise sensitivity. Both privacy a model is leave-one-out (LOO) which compares the difference sensitivity of data owners and noise sensitivity of model buyers between the predictor’s performance when trained on the entire are uniformly measured by the level of differential privacy (DP) dataset with and without the data point [11]. However, LOO does

958 not satisfy the properties (balance, symmetry, zero element, and Arbitrage-free model-based pricing. Due to the increasing per- additivity) that we expect for the data valuation [35]. For example, vasiveness of machine learning based analytics, there is an emerging given a point 푝 in a dataset, if there is an exact copy 푝′ in the dataset, interest in studying the cost of acquiring data for machine learning. removing 푝 from this datasets does not change the predictor since Chen et al. [14] proposed the first and the only existing model- 푝′ is still there. Therefore, LOO will assign zero value to 푝 regardless based pricing framework which directly prices machine learning of how important 푝 is, which violates the symmetry property. model instances with different noises rather than pricing the data. Shapley value is a concept in cooperative game theory, which They formulated an optimization problem to find the arbitrage-free was named in honor of Lloyd Shapley. In contrast to LOO, Shap- price that maximizes the revenue of the broker and proved such ley value satisfies several desirable properties including balance, optimization problem is coNP-hard. However, 1) their work only symmetry, zero element, and additivity [35]. Combined with its focuses on the interactions between the broker and model buyers flexibility to support different utility functions, Shapley value has because the market research functionality of data owners can be been extensively employed in the data pricing field8 [ , 9, 19, 23] completely replaced by the broker; 2) they assume there is only to evaluate the data importance. One major challenge of applying one (averaged) survey price point for each model, which is over- Shapley value is its prohibitively high computational complexity. A simplified; and 3) they do not have an explicit privacy analysis and number of approximation methods [9, 10, 18, 19] have been devel- guarantee for their proposed Gaussian mechanism that adds noise oped to overcome the computational hardness of the exact Shapley to the models. value. The most representative method is Monte Carlo method [10, 18], which is based on the random sampling of permutations. 3 PRELIMINARIES In this paper, we allow data owners to set personalized compen- In this section, we introduce the preliminaries for our later devel- sation functions and the corresponding compensation is not only opment and summarize the frequently used notations in Table 1 dependent on Shapley value but also based on privacy sensitivity. for the convenience to readers. Table 1: The summary of notations. Notation Definition 2.2 Pricing Mechanisms 푡ℎ 퐷푖 the 푖 data owner 푡ℎ 퐵푗 the 푗 model buyer Envy-free query-based pricing. Ghosh et al. [20] initiated the 푡ℎ 푀푘 the 푘 model study of markets for private data using differential privacy. They U utility function SV Shapley value modeled the first framework in which data buyers would like to 푡ℎ 풛푖 the 푖 date set buy sensitive information to estimate a population statistic. They 휖 DP parameter defined a property named envy-free for the first time. Envy-free 휌푖 the privacy sensitivity of 퐷푖 ensures that no individual would prefer to switch their payment 휎푗 the utility sensitivity of 퐵푗 훾푗 the noise sensitivity of 퐵푗 and privacy cost with each other. Guruswami et al. [21] studied CR coverage rate function MB manufacturing budget the optimization problem of revenue maximization with envy-free 푘 (휖푘 , 푠푝 [푗 ]) survey price point 푘 guarantee. They investigated two cases of inputs: unit demand con- (휖푘 , 푝 [푗 ]) complete price point sumers and single minded consumers, and showed the optimization problem is APX-hard for both cases, which can be efficiently solved by a logarithmic approximation algorithm. Li et al. [29–31] pre- 3.1 Fairness and Shapley Value sented the first theoretical framework for assigning value to noisy Consider 푛 data owners 퐷1, . . . , 퐷푛 such that data owner 퐷푖 owns query answers as function of their accuracy, and for dividing the data set 풛푖 (1 ≤ 푖 ≤ 푛). We assume a utility function U(S) revenue among data owners who deserve compensation for their (S ⊆ {풛1,..., 풛푛 }) that evaluates the utility of a coalition S, which privacy loss. They defined an enhanced edition of envy-free, which consists of datasets from multiple data owners, for a task, such is named arbitrage-free. Arbitrage-free pricing ensures the data as training a machine learning model. Shapley [35] lays out the buyer cannot purchase the desired information at a lower price by fundamental requirements of fairness in market, including balance, combing two low-price queries. symmetry, zero element, and additivity. Specifically, the Shapley value is a measure that can be used to evaluate data importance for Arbitrage-free query-based pricing. Lin et al. [32] proposed nec- allocation compensation while satisfying all the requirements. Shap- essary conditions for avoiding arbitrage and provide new arbitrage- ley value measures the marginal utility improvement contributed free pricing functions. They also presented negative results related by 풛푖 from data owner 퐷푖 averaged over all possible coalitions of to the tension between flexible pricing and arbitrage-free, and il- the data owners. lustrated how this tension often results in unreasonable prices. 1 Õ U(S ∪ {풛푖 }) − U(S) In addition to arbitrage-free, Koutris et al. [28] proposed another SV = (1) 푖 푛 푛−1 desirable property for the pricing function, discount-free, which S ⊆{풛1,...,풛푛 }\풛푖 |S | requires that the prices offer no additional discounts than the ones Computing the exact Shapley value has to enumerate all the sub- specified by the broker. In fact, discount-free is the discrete version sets and thus is prohibitively expensive. We adopt a commonly used of arbitrage-free. Furthermore, they presented a polynomial time Monte Carlo simulation method [10, 18] to compute the approxi- algorithm for pricing generalized chain queries. Recently, Chawla mate Shapley value. We first sample random permutations of data et al. [13] investigated three types of succinct pricing functions and sets from different data owners, and then scan each permutation studied the corresponding revenue maximization problems. from the first data set to the last one, and calculate the marginal

959 Algorithm 1: Monte Carlo Shapley value computation. prices. Therefore, arbitrage-free pricing functions are highly desir-

input :data sets 풛1,..., 풛푛 able. A pricing function is arbitrage-free if it satisfies the following output:Shapley value SV푖 for each data set 풛푖 (1 ≤ 푖 ≤ 푛), and 휏 > 0 two properties [14, 31]. It prevents model buyers from taking advan- 1 initialize SV푖 = 0 (1 ≤ 푖 ≤ 푛); 2 for k=1 to 휏 do tage of the marketplace by combining lower-tier versions sold for 3 let 휋푘 be a random permutation of {1, . . . , 푛}; cheaper prices into a higher-tier version to escape the designated 4 for i=1 to n do 5 SV(풛 ) = price for that tier. 휋푘 (푖) + + U({ }) − U ( { }) (R )푛 → R 풛 푘 ( ) ,..., 풛 푘 ( ) 풛 푘 ( ) ,..., 풛 푘 ( − ) ; Property 1. (Monotonicity). Given a function 푓 : , 휋 1 휋 푖 휋 1 휋 푖 1 + 푛 6 SV + = SV(풛 ); 휋푘 (푖) 휋푘 (푖) 푓 is monotone if and only if for any two vectors x, y ∈ (R ) , x ≤ y, 푓 (x) ≤ 푓 (y) SV SV . 7 return 푖 ,..., 푛 Property 2. (Subadditivity). Given a function 푓 : (R+)푛 → R+, 푓 is subadditive if and only if for any two vectors x, y ∈ (R+)푛, 푓 (x + y) ≤ 푓 (x) + 푓 (y). contribution of every new data set. By examining a good number of permutations, the final estimation of Shapley value is simply the 3.4 Differential Privacy average of all the calculated marginal contributions. This Monte Differential privacy16 [ , 17] is a formal mathematical framework Carlo simulation gives an unbiased estimate of the Shapley value rigorously providing privacy protection. To the best of our knowl- given enough number of permutations. edge, none of the existing model marketplaces adopt or consider In practical applications, we can conduct Monte Carlo simulation differential privacy. iteratively until the average has empirically converged, as shown Definition 3.2. (Differential Privacy) A randomized algorithm A in Algorithm 1, where 휏 is the number of permutations. The larger is (휖, 훿)-differentially private, if for any pair of datasets S and S′ the value of 휏, the more accurate the computed Shapley value tends that differs in one data sample, and for all possible output OUT of to be. A, the following holds, Based on the definition and the fair compensation property of 휖 ′ Shapley value, we can easily generalize the Shapley value of one P[A(S) ∈ OUT] ≤ 푒 P[A(S ) ∈ OUT ] + 훿, (2) data set from one data owner to the Shapley value of a coalition of where the probability is taken over the randomness of A. data sets from multiple owners. In short, this formulation implies that the output of a differen- Corollary 3.1. Let S퐷 ⊆ {퐷1, . . . , 퐷푛 } be a subset of data own- tially private algorithm is indistinguishable between two datasets ers. Then, that differ in one sample. 휖 captures the degree of indistinguishabil- Õ ity, which intuitively can be seen as the upper bound of the privacy SV(S퐷 ) = SV푖 loss for each data sample. The smaller 휖 is, the more indistinguish- 푖:퐷푖 ∈S퐷 able the outputs are, and the less the privacy loss is. The other is a Shapley value that can be used for compensation allocation to the parameter 훿 captures the probability that the privacy loss is out of subset of data owners that satisfies all requirements of fairness. the upper bound. Given a specified 휖, the closer 훿 is to 0, the less the probability of privacy breach is, and the better the algorithm is. In 3.2 Versioning practice, for a meaningful DP guarantee, the parameters are chosen 1 In a perfect world where the broker has infinite resource, the broker as 0 < 휖 ≤ 10, 훿 ≪ 푛 , where 푛 is the number of data owners [22]. can sell a personalized model to each model buyer at an individual- In this paper, the broker will train a set of models with differential ized price to maximize the revenue. However, such personalized privacy guarantees. According to the robustness of post-processing pricing is rarely possible in practical applications. There is a prac- of differential privacy, any post-processing of the models bythe tical and frequently adopted way, i.e., offering a small number of model buyers will not incur additional privacy loss. For simplicity, different versions of the product, which are designed to attract we use a uniform small value 훿 and different 휖 for different versions different types of buyers. Under this strategy, which is called ver- of models. sioning [34], buyers segment themselves. A version chosen by a Lemma 3.3. 퐽 buyer reveals the value that the buyer places on the data set and the (Simple Composition [17]) Considering that there are (휖 , 훿 ) price the buyer is willing to pay. Following this industry practice, in randomized algorithms and each of them is 푗 푗 -differentially pri- 푗 ∈ {1, . . . , 퐽 } 퐽 our model marketplace, the broker trains 푙 different model versions vate for , then sequentially applying these algorithms (Í퐽 Í퐽 ) with different model privacy parameters which are measured by satisfies 푗=1 휖푗 , 푗=1 훿푗 -differential privacy. differential privacy. Lemma 3.3 is essential for DP mechanism design and analysis, which enables algorithm designers to compose elementary DP oper- 3.3 Arbitrage-free Pricing ations into a more sophisticated one. Importantly, we will show later When multiple versions are available in a market, it is possible to that it plays a crucial role in model market design as well. Lemma 3.3 derive a more expensive version from one or multiple versions with suggests that differential privacy is an appropriate mechanism for lower total cost. In such a scenario, arbitrage sneaks in. Arbitrage versioning models with arbitrage-free pricing. The challenge is how complicates interactions between the broker and buyers. Buyers to ensure the pricing of the versioned models is arbitrage-free. have to carefully choose versions to achieve the lowest price, while To train a model satisfying differential privacy, a popular method the broker may not maximize the revenue intended by the released is objective perturbation [12], which perturbs the objective function

960 Algorithm 2: Objective perturbation for differentially pri- Model buyers are concerned about the utility or accuracy of the vate model training. model which is captured by the coverage of the Shapley value of the

input :풁푡푟푎푖푛 and (휖, 훿). data used to build a model and the noise added to achieve privacy output: 풘퐷푃 . preservation. The more coverage of the data and the less noise 20퐿2 log(1/훿) 1 Sample 푵 ∼ N (0 , 휎2푰 ), where 휎 = and 퐿 is a 퐿 -Lipschitz 1 푑 1 푑 1 휖2 2 added, the higher the value a model to buyers. In Dealer, we propose constant [22]; 1 Shapley coverage sensitivity and noise sensitivity of model buyers 2 Objective Perturbation: L (풘) = L(풘; 풁 ) + 휆 ∥풘 ∥2 + ⟨푵 , 풘⟩; 푂푃 푡푟푎푖푛 2 푛 1 to model their demands on Shapley coverage and resistance to 3 Optimize L푂푃 (풘) to obtain 훼 approximate solution 풘ˆ ; 40훼 log(1/훿) 4 Sample 푵 ∼ N (0 , 휎2푰 ), where 휎 = ; noise, respectively. To capture the economic constraints, we assume 2 푑 2 푑 2 휆휖2 5 return 풘퐷푃 = 푝푟표 푗Ω (풘ˆ + 푵2); that each model buyer has a budget. The broker can accordingly calculate the Shapley coverage of a given model, which is captured by the Shapley value of the selected data used to train the model. We assume that model buyers always buy the cheapest model satisfying of the model by quantified noise. For the perturbed objective, we their demands. adopts gradient descent-based iterative optimization algorithms The broker defines and builds models using data collected from to obtain an approximate solution, which is popular in practical data owners and sells the models to model buyers. Without loss of machine learning model training. However, as pointed out in [22], generality, we assume that the broker is neutral and does not charge conventional objective perturbation establishes DP guarantee for any cost in model building, i.e., the total revenue from model sales the exact optimum of the perturbed objective function, while the is fully and fairly distributed to data owners. To practice Dealer, the DP guarantee is unknown for the approximate solution. In order broker can easily factor in a percentage of total revenue for model to allow the broker to utilize the practical iterative training algo- building and other cost, which does not affect our mechanisms. The rithms and still achieve differential privacy, in this paper, we follow objective of the broker is to maximize the total revenue. the enhanced objective perturbation called approximate minima To keep our discussion simple, we assume that all parties in perturbation, which allows solving the perturbed objective up to Dealer, including data owners, broker, and model buyers, are honest. 훼 approximation. It uses a two-phase noise injection strategy that That is, no cheating activities exist in the marketplace. We also perturbs both the objective and the approximate output. The idea assume the broker is trusted, i.e., it has access to the raw data of the is summarized in Algorithm 2. In particular, Line 2 perturbs the data owners. Future work may consider settings where the broker model with calibrated noise 푵1, Line 3 optimizes the perturbated is untrusted by incorporating techniques such as local differential model followed by an output perturbation with noise 푵2 in Line 4. privacy [36] and federated learning [24]. Finally, 풘퐷푃 is obtained by a projection to the constrained set Ω. As shown by the next lemma, Algorithm 2 trains an (휖, 훿)-DP model 4.2 Data Owners based on training dataset 풁푡푟푎푖푛, which outputs model parameter We assume 푛 data owners 퐷1, . . . , 퐷푛. For the sake of clarity, we 풘퐷푃 . overload the symbol 퐷푖 (1 ≤ 푖 ≤ 푛) to also denote the data owned Lemma 3.4. (Theorem 1 in [22]) Algorithm 2 is (휖, 훿)-differentially by 퐷푖 . Each data owner 퐷푖 (1 ≤ 푖 ≤ 푛) is willing to share her data private. with the broker for compensation. The compensation requested In Dealer, the broker will train a series of differentially private depends on the data owner’s privacy disclosed through the models models by invoking Algorithm 2. sold by the broker. Moreover, 퐷푖 expects the fair sharing of the revenues with other data owners based on the utility of the data 4 DEALER: STRUCTURE AND PARTIES contributed. In this paper, we use 휖-differential privacy to quantify possible In this section, we propose a model marketplace Dealer. We describe user privacy disclosure. Technically, we use a monotonic the structure of Dealer including the parties and their operations compen- 푐 for data owner 퐷 to model the data owner’s in the marketplace. We first give an overview of Dealer and then sation function 푖 푖 specify the roles and mechanisms in detail. requested compensation for a model that uses data 퐷푖 and satisfies 휖-differential privacy. 푐푖 (휖) = 푏푖 · 푠푖 (휖) (3) 4.1 Overview where 휖 ≥ 0 is the differential privacy parameter, and 푏푖 is the base The objective of Dealer is to bridge the supplies from data owners price and should be proportional to the Shapley value of 퐷 with and the demands from model buyers. We assume one broker who 푖 regard to all data sets {퐷 , . . . , 퐷 } by other peers contributed to collects data from multiple data owners, designs and builds models, 1 푛 building models. Function 푠 (휖) reflects the user’s belief of privacy and sells the models to multiple model buyers. Data owners, broker, 푖 in price. The larger the value of 휖, the less protection provided and model buyers have different concerns and requirements. by 휖-differential privacy, and thus the higher the compensation Data owners have two concerns. First, they are concerned about requested. In general, any monotonic function with respect to 휖 privacy protection. The less privacy protection in models built, the can be used. Notice that factor 푒휖 in the definition of 휖-differential more compensation data owners ask for. Moreover, data owners privacy controls the tolerance of privacy disclosure. The larger the expect fair sharing of revenues from models sold to model buyers. value of 푒휖 , the less privacy protection. In this paper, we use the In Dealer, we propose privacy sensitivity of data owners to model following function for 푠 (휖) in our discussion. their requisites on privacy preservation and use Shapley values to 푖 휖 휌 휌 ·휖 model fair sharing of revenues. 푠푖 (휖) = (푒 ) 푖 = 푒 푖 (4)

961 Parameter 휌푖 ≥ 0 tunes the marginal increase of the user’s price where parameter 훿푗 > 0 is 퐵푗 ’s coverage sensitivity, which controls elasticity of privacy, and is called the user’s privacy sensitivity. how quickly the buyer loses interest in the model if the model does 휖 When 0 ≤ 휌푖 < 1, 푠푖 (휖) grows sub-linearly with respect to 푒 ; when not meet the Shapley coverage expectation 휃 푗 . The larger the value 휌푖 > 1, the growth is super-linear; and when 휌푖 = 1, the growth is of 훿푗 , the sharper the change of the price at the expectation point 휖 the same as 푒 . A very privacy-conservative data owner can have a 휃 푗 . super-linear function, which means a higher compensation if her Second, buyers’ offer prices are also sensitive to noise added to data is used, but may not get selected for model building because the model. The ideal model does not add any noise. The effect of the broker has limited manufacturing budget. noises added for privacy protection can be measured quantitatively It is worth noting that in our current formulation, 푏푖 is based by the parameter 휖 in the 휖-differential privacy mechanism. In 휖- on the Shapley value of 퐷푖 which is computed based on the model differential privacy, no noise is added when 휖 = ∞. Technically, a without any perturbation (corresponding to 휖 = ∞). We can show model buyer 퐵푗 (1 ≤ 푗 ≤ 푚) can set a noise expectation 휂푗 (휂푗 > 0) that the four axioms of Shapley value still hold for 푐푖 (휖) for general 휖 and a parameter 훾푗 (훾푗 > 0), called noise sensitivity, to express the values. Alternatively, one can compute Shapley value for each model buyer’s price sensitivity with respect to noise: the larger the value trained with different 휖-DP. However, this will significantly increase of 훾푗 , the more sensitive the buyer is with respect to noise. the computation cost. In addition, it cannot be guaranteed that the Putting the above two aspects together, for a buyer 퐵푗 (1 ≤ more privacy does a data owner sacrifice, the more compensation 푗 ≤ 푚), let 푉푗 be the buyer’s total budget to purchase a model, she gets, because the broker cannot control the relationship between 휃 푗 the expectation on Shapley coverage in Shapley value, 훿푗 the privacy risk and compensation. Shapley coverage sensitivity, 휂푗 the expectation on model noise in differential privacy, and 훾푗 the noise sensitivity. Then, for a model 4.3 Model Buyers 푀 that is built using data sets 퐷푖1 , . . . , 퐷푖푘 from data owners and satisfies 휖-differential privacy, we use the following price function We assume 푚 model buyers 퐵 , . . . , 퐵 , each carrying budgets 1 푚 for model buyer 퐵 on model 푀 ( ≤ ≤ ) 푗 푉1, . . . ,푉푚, respectively, where 푉푗 > 0 1 푗 푚 . It is intu- 1 1 itive to price the models according to the model accuracy. However, 푃 (퐵푗 , 푀) = 푉푗 · · (7) 1 + 푒−훿 푗 (퐶푅(푀)−휃 푗 ) 1 + 푒−훾푗 (휖−휂푗 ) this formulation is not amenable to the arbitrage-free constraints that we aim to achieve. In this paper, we measure the relative utility Clearly, 푃 (퐵푗 , 푀) reflects model buyer 퐵푗 ’s assessment of the value of the model using the coverage of the Shapley values of the data of model 푀, thus is also called the model value of 푀 for 퐵푗 . being used for building the model and the noise added to the model for privacy protection. This allows us to theoretically achieve the 4.4 Broker arbitrage-free constraints while providing a practical and transfer- The broker collects data from data owners, builds and sells models able utility measure based on the relative marginal contribution to model buyers. To accommodate different data owners’ require- of the data owners as compared to the absolute accuracy. We note ments on privacy preservation and various model buyers’ demands that more data (higher Shapley values) and less noise does not on Shapley coverage and noise sensitivity, multiple versions of a guarantee, but does lead to, higher accuracy in general as we will model may be developed. In practice, the number of versions made show later in our experiments. available to model buyers is often kept small. We also call a version First, buyers’ offer prices are sensitive to the coverage ofthe of a model simply a model. value of the data used. The ideal model takes all data sets from Assume that the broker builds 푙 versions of a model, 푀 , . . . , 푀 . all data owners 퐷 . . . , 퐷 . A model buyer may set an expectation 1 푙 1 푛 A model 푀 (1 ≤ 푘 ≤ 푙) is specified as a tuple 푀 = (휇 , 휖 , 푝 ), requirement on the percentage of the Shapley values of the data 푘 푘 푘 푘 푘 where 휇 = 퐶푅(푀 ) is the coverage rate of the model, 휖 is the that should be used to build the model, and the price that the buyer 푘 푘 level of differential privacy guarantee, and 푝 is the price set by the is willing to pay may change accordingly. We emphasize that when 푘 broker. To make the marketplace concise, we list the models in 휇 using different subsets of data to build a model, not only the quantity value and 휖 value ascending order, that is, 휇 ≤ 휇 and 휖 ≤ 휖 if of the data, but values of the data or their contributions to the model 푘1 푘2 푘1 푘2 푘 ≤ 푘 . Let 푟 (푀 , 퐷 ) be the compensation allocated to data owner matters for the final accuracy of the model. Hence we define the 1 2 푘 푖 퐷 who contributes the data to model 푀 . The compensation is coverage as the ratio of Shapley values of the subset to measure 푖 푘 decided by the broker. the relative utility instead of the ratio of the subset size. In model marketplace Dealer, we make the following assump- Formally, for a model 푀 that is built using 푘 data sets 퐷 , . . . , 퐷 푖1 푖푘 tions about the participants’ behavior. from their data owners, the coverage rate of the model 푀 is • (Participation) A data owner 퐷푖 contributes the data to a SV({퐷푖1 , . . . , 퐷푖푘 }) model 푀푘 as long as the broker can allocate a payment no CR(푀) = CR(퐷푖1 , . . . , 퐷푖푘 ) = (5) SV({퐷1, . . . , 퐷푛 }) less than the data owner asks for, i.e., 푟 (푀푘, 퐷푖 ) ≥ 푐푖 (휖푘 ).A data owner participates in and collects compensation from Technically, a model buyer 퐵푗 (1 ≤ 푗 ≤ 푚) can set a coverage every model that meets the requirements. ( ≤ ) expectation 휃 푗 0 < 휃 푗 1 . The price 퐵푗 is willing to pay for a • (Fairness) For any model 푀푘 and data owners 퐷푖1 and 퐷푖2 model 푀 with respect to coverage rate 퐶푅(푀) can be who contribute their data to the model,

1 푟 (푀푘, 퐷푖 ) 푟 (푀푘, 퐷푖 ) 푃푟푖푐푒 (퐵 ) = 푉 · (6) 1 = 2 푐표푣푒푟푎푔푒 푗 푗 −훿 (퐶푅(푀)−휃 ) ( ) ( ) 1 + 푒 푗 푗 푐푖1 휖푘 푐푖2 휖푘

962 푡ℎ that is, the compensation among all contributing data owners DP parameter 휖푘 , for the 푗 survey participant 퐵푗 , the broker within each model is fair. calculates (푡푚푗 , 푣 푗 ) according to Eq. 8 if there is at least one model • (Cost minimization purchase) A model buyer 퐵푗 buys one satisfying 퐵푗 ’s requirements on both Shapley coverage and model and only one model as long as there exists a model 푀푘 such noise, where 푡푚푗 indicates the target model of 퐵푗 and 푣 푗 is the that budget by the buyer for purchasing 푡푚푗 . We call each (푡푚푗 , 푣 푗 ) a 휃 푗 ≤ 휇푘 푎푛푑 휂푗 ≤ 휖푘 (8) survey price point in the following. If there is no model 푀푘 satisfying Eq. 8, then 퐵푗 purchases With the 푚′ tuples computed from the surveyed buyers, the nothing; if there is only one model 푀푘 satisfying Eq. 8, broker will price each model in the aim of maximizing revenue ( ) then 퐵푗 purchases 푀푘 with budget 푃 퐵푗 , 푀푘 ; if there are and at the same time following the market design principle of multiple models 푀 , . . . , 푀 satisfying Eq. 8, then 퐵 pur- 푘1 푘푙′ 푗 arbitrage-free pricing. The revenue maximization (RM) problem chases model with arg min 휖 , i.e., the model hav- 푘 ∈{푘1,...,푘푙′ } 푘 is formulated as follows. 푙 푚′ ing the lowest 휖 with the lowest price for 퐵푗 , with budget Õ Õ 푃 (퐵 , 푀 ) arg max 푝(휖푘 )· I(푡푚푗 == 푀푘 )· I(푝(휖푘 ) ≤ 푣 푗 ), 푗 arg min푘∈{푘 ,...,푘 ′ }휖푘 . 1 푙 ⟨푝 (휖1),...,푝 (휖푙 )⟩ • (Neutral broker) For each model 푀 that uses data sets 푘=1 푗=1 푘 (9) 퐷푖1 , . . . , 퐷푖푛′ and is purchased by model buyers 퐵푗1 , . . . , 퐵푗푚′ , Í ′ ( + ) ≤ ( ) + ( ) ( ) = · 푠.푡. 푝 휖푘1 휖푘2 푝 휖푘1 푝 휖푘2 , 휖푘1 , 휖푘2 > 0, (10) 푖 ∈{푖1,...,푖푛′ } 푟 푀푘, 퐷푖 푚 푝푘 , i.e., the revenue produced ( ) ≤ ( ) ≤ by each model is fully distributed to all data contributors. In 0 < 푝 휖푘1 푝 휖푘2 , 0 < 휖푘1 휖푘2 , (11) this paper, we assume a neutral broker to keep our discussion simple. In practice, a broker can easily factor in a percentage where I(푡푚푗 == 푀푘 ) indicates whether 퐵푗 ’s target model is 푀푘 , of total revenue for model building and other cost, which I(푝(휖푘 ) ≤ 푣 푗 ) indicates whether the price for model 푀푘 (푝푘 or 푝(휖푘 ) does not affect our mechanisms and algorithms. from the DP parameter perspective) with DP parameter 휖푘 is less than or equal to the budget of 퐵 for purchasing 푀 . The optimal The broker is responsible for defining versions of model. Ideally, 푗 푘 solution of this revenue maximization problem has the arbitrage- the arbitrage-free guarantee may be applied to both 휇 and 휖 . How- 푘 푘 free guarantee since Eq. 10 ensures subadditivity (Property 2) and ever, the arbitrage-free guarantee is only applied to one variable Eq. 11 ensures monotonicity (Property 1). We note that our problem in the existing works [14, 29–31]. It is not clear or feasible how we formulation does not assume that all model buyers can afford the can apply the arbitrage-free guarantee to two variables. Because 휖 푘 target models or all models will have buyers. It is interesting future has much greater impact than 휇 on model accuracy (Figure 6), in 푘 work to consider such additional constraints. this paper, the broker ensures arbitrage-free guarantee with respect We refer to the maximum revenue for RM as 푀퐴푋 (RM) and to 휖 . Now we are ready to state the optimization problem in Dealer. 푘 use (휖 , 푠푝푘 [푗]) to denote the 푗푡ℎ lowest survey price point in model Given a set of data owners, a set of model buyers, and the number 푘 푀 with DP parameter 휖 . For example, we assume 휖 = 1, 휖 = 2, of models 푙, the broker needs to define 푙 models 푀 , . . . , 푀 such 푘 푘 1 2 1 푙 and 휖 = 3 in Figure 2. We have six survey points (participants) that the total revenue Í푙 푝 · 푝푢푟푐ℎ푎푠푒(푝 ) is maximized, where 3 푘=1 푘 푘 shown in black disk (1, 푠푝1 [1] = 1), (1, 푠푝1 [2] = 4), (2, 푠푝2 [1] = 푝푢푟푐ℎ푎푠푒(푝 ) is the number of model buyers purchasing 푀 . In the 푘 푘 3), (2, 푠푝2 [2] = 7), (3, 푠푝3 [1] = 5), and (3, 푠푝3 [2] = 8), where rest of the paper, we refer to this problem as revenue maximization (2, 푠푝2 [1] = 3) means one model buyer would like to purchase problem. model 푀2 with DP parameter 휖2 using budget 3. These survey price points make up a . 5 DEALER: MODEL PRICING AND MODEL survey price space A special case of the RM problem has been studied earlier [14] TRAINING given one (averaged) survey price point for each model. Given 푘 In this section, we study the revenue maximization problem for one survey price point (휖푘, 푠푝 [1]) for each model 푀푘 , 푘 = 1, ..., 푙, model pricing in Section 5.1. Given the maximized revenue from determining whether there exists a pricing function 푝(휖푘 ) that 1) is 푘 model buyers, the broker needs to maximize the Shapley coverage positive, monotone, and subadditive; and 2) ensures 푝(휖푘 ) = 푠푝 [1] for each model version to remain competitive and see if the Shap- for all 푘 = 1, ..., 푙, has been shown as a co-NP hard problem. It is ley coverage satisfies the predefined Shapley coverage. We study easy to see that this co-NP hard problem is a special case of the RM the Shapley coverage maximization problem for model training in problem which has multiple survey price points for each model. Section 5.2. We also summarize the complete Dealer dynamics in In order to overcome the hardness of the original optimization Section 5.3. RM problem, we seek to approximately solve the problem by relaxing the subadditivity constraint. We relax the constraint of 5.1 Revenue Maximization Problem for Model ( + ) ≤ ( ) + ( ) 푝 휖푘1 휖푘2 푝 휖푘1 푝 휖푘2 in Eq. 10 to Pricing 푝(휖푘 )/휖푘 ≥ 푝(휖푘 )/휖푘 , 휖푘 ≤ 휖푘 , (12) Prior to releasing models and setting the prices for sale, the broker 1 1 2 2 1 2 needs to conduct a market survey to collect each model buyer’s which still satisfies the requirement of arbitrage-free pricing be- price function 푃 (퐵푗 , 푀푘 ) in order to price the models to maximize cause this new constraint is sufficient but not necessary for the the revenue, which can be done by the broker herself or third-party subadditivity constraint. We refer to this relaxed problem as re- companies. Let the survey size be 푚′, i.e., 푚′ potential model buyers laxed revenue maximization (RRM) problem. Intuitively, we want are recruited to provide their price function 푃 (퐵푗 , 푀푘 ). Given a to make sure that the unit price for large purchases is smaller than set of models the broker plan to train with coverage rate 휇푘 and or equal to the unit price for small purchases.

963 In the following, we show the maximum revenue for RRM, Algorithm 3: Constructing complete price space for the 푀퐴푋 (RRM), has a lower bound with respect to the maximum relaxed revenue maximization problem. revenue for RM, 푀퐴푋 (RM). 푡ℎ input :Model with noise parameter 휖푘 and the 푗 lowest survey price point for model 푘 with noise parameter 휖푘 , denoted as (휖푘 , 푠푝 [푗 ]). Theorem 5.1. The maximum revenue for RM has the following output:Complete price space. 푘 relationship with the maximum revenue for RRM, 1 add all the survey price points (휖푘 , 푠푝 [푗 ]) to the complete price space; 푘 2 for each survey price point (휖푘 , 푠푝 [푗 ]) do 푀퐴푋 (RRM) ≥ 푀퐴푋 (RM)/2. 3 draw a line 푙 푘 through this point and the original point; (휖푘 ,푠푝 [푗 ]) 4 for each model with noise parameter 휖푘 do Proof. Due to the limited space, please see our technical report 5 draw a vertical line 푙휖푘 ; 6 for 푙 do [7] for the detailed proof. The same to the following theorems. □ each line 휖푘′ 7 for each line 푙 푘 do (휖푘 ,푠푝 [푗 ]) 푠푝푘 [푗 ] 8 add point (휖푘′, 휖푘′ ) by intersecting line 푙휖 ′ and line 푙 푘 Dynamic Programming Algorithm. We show an efficient dy- 휖푘 푘 (휖푘 ,푠푝 [푗 ]) ′ namic programming algorithm to solve the relaxed revenue maxi- to the complete price space for 푘 = 푘 + 1, ..., 푙; 푘 mization problem. 9 for each survey price point (휖푘 , 푠푝 [푗 ]) do ′ 10 add price point (휖 ′, 푠푝푘 [푗 ]) to the complete price space for 푘 = 1, ..., 푘 − 1; price 푘 푘 11 for each price point (휖푘 , 푝 [푗 ]) in the complete price space do 푘 ′ 10 12 if (휖푘 , 푝 [푗 ]) is a survey price point then 13 푓 (휖 , 푝푘 [푗 ]) = 1; 8 푘 7 14 else 15 푓 (휖 , 푝푘 [푗 ]) = 0; 5 5 푘 4 3

1 model risk 0 ǫ1 ǫ2 ǫ3 SC (subadditivity constraint) point, and the price point from Line Figure 2: Revenue maximization example. 10 as MC (monotonicity constraint) point. At first glance, for each model, it seems that all possible valuesin Example 5.2. We show a running example of Algorithm 3. In Fig- the price range can be an optimal price, which makes the problem ure 2, we add the survey price points (1, 푠푝1 [1] = 1), (1, 푠푝1 [2] = 4), arguably intractable to solve. In the following, we show how to (2, 푠푝2 [1] = 3), (2, 푠푝2 [2] = 7), (3, 푠푝3 [1] = 5), and (3, 푠푝3 [2] = 8) construct a complete price space in the discrete space and prove the to the complete price space in Line 1. In Line 2, for the survey price complete price space is sufficient to obtain the maximum revenue. point (1, 1), we draw a line 푙(1,1) through this point and the original Constructing Complete Price Space. It is easy to see that those point in Line 3. In Line 4, for model 푀1 with privacy parameter survey price points could be potential solutions. For each survey 휖1 = 1, we draw a vertical line 푙휖1 . In Lines 6-8, for 푙(1,1) and 푙휖2 , 푘 푘 1 price point (휖푘, 푠푝 [푗]), it determines unit price 푠푝 [푗]/휖푘 and 푠푝 [1] we add intersection point (휖2, 휖2) = (2, 2) to the complete 푘 [ ] ( 푘 [ ]) 휖1 price 푠푝 푗 . The general idea is that if we choose 휖푘, 푠푝 푗 as price space. In total, we have six such new price points shown in the optimal price point in model 푀푘 , it affects the price for models 3 ′ box. In Line 9, for survey price point (3, 푠푝 [1]) = (3, 5), we add 푀푘′, 푘 = 1, ..., 푘 −1 due to the monotonicity constraint and the unit ′ price points (2, 5) and (1, 5) to the complete price space in Line 10. ′ = + price for models 푀푘 , 푘 푘 1, ..., 푙 due to the subadditivity con- Similarly, we also have six such new price points shown in circle. 푘 [ ] straint. If we set the optimal price in model 푀푘 as 푠푝 푗 , the unit Therefore, for the complete price space, we have six price points price of the following models after model 푀푘 cannot be larger than for models 푀 and 푀 . We have five price points for model 푀 푘 푘 1 3 2 푠푝 [푗]/휖푘 . Therefore, for each survey price point (휖푘, 푠푝 [푗]), we 푠푝1 [2] 푘 because the intersection point of (휖2, 휖2) = (2, 8) is same as draw one line 푙 푘 through survey price point (휖 , 푠푝 [푗]) 휖1 (휖푘 ,푠푝 [푗 ]) 푘 3 the MC point (휖푘, 푠푝 [2]) = (2, 8) for 푘 = 2. In Lines 12-15, we and the original point. For each model 푀푘′ with DP parameter have 푓 (2, 3) = 1 and 푓 (2, 5) = 0. ′ ( ′ ) 휖푘 , we draw one vertical line 푙휖푘′ through 휖푘 , 0 . By intersect- ing line 푙 푘 and line 푙 ′ , we obtain 푙 − 푘 new price points (휖푘 ,푠푝 [푗 ]) 휖푘 Theorem 5.3. The complete price space constructed by Algorithm 푠푝푘 [푗 ] ′ 3 is sufficient for finding the optimal solution of the relaxed revenue (휖푘′, 휖푘′ ) for 푘 = 푘 + 1, ..., 푙. We note that we do not need to 휖푘 maximization problem. generate the price points as candidates for 푘 ′ = 1, ..., 푘 − 1 because the unit price of model 푀푘 can only constrain the unit price of A recursive solution. We define the revenue of an optimal solution ′ model 푀푘′, 푘 = 푘 + 1, ..., 푙 (due to the subadditivity constraint). recursively in terms of the optimal solutions to subproblems. We Furthermore, for each model, its price is also determined by the pick as our subproblems the problems of determining the maximum survey price points of its right neighbors (due to the monotonicity revenue 푀퐴푋 [푘, 푗], where 푀퐴푋 [푘, 푗] denotes the revenue of the constraint). Therefore, we need to add the survey price points of optimal solution by considering the first 푘 models and taking the ′ 푡ℎ model 푀푘 to models 푀푘′, 푘 = 1, ..., 푘 − 1. The detailed algorithm 푗 lowest price point in the complete price space of model 푀푘 . For for constructing the complete price space is shown in Algorithm 3. the full problem, the maximum revenue would be max{푀퐴푋 [푙, 푗]} 푘 푙 In Lines 11-15, we use 푓 (휖푘, 푝 [푗]) to distinguish the survey price for all the price points (휖푙, 푝 [푗]) in the complete price space in points from the other points in the complete price space. For ease of model 푀푙 . For the price points in the complete price space of model presentation, we name the price point in the complete price space 푀1, we can directly compute 푀퐴푋 [1, 푗] for all the price points from Line 1 as SU (survey) point, the price point from Line 8 as because there is no initial constraint. For the price points in the

964 Algorithm 4: Dynamic programming algorithm for finding Therefore, we set 푝(휖3) = 8. We backtrack to 푀퐴푋 [2, 4] in model an optimal solution of the relaxed revenue maximization 푀2 and set 푝(휖2) = 7. And then we backtrack to 푀퐴푋 [1, 3] in problem. model 푀1 and set 푝(휖1) = 4. Finally, an optimal pricing setting is

input :Model with DP parameter 휖푘 and its corresponding price points in the complete ⟨푝(휖1), 푝(휖2), 푝(휖3)⟩ = ⟨4, 7, 8⟩. price space. output:푀퐴푋 (RRM). 1 for each model 푀푘 do Table 2: Example for finding optimal pricing. 2 sort the price points in the complete price space in decreasing order; the 푗푡ℎ lowest 푘 푡ℎ 3 use (휖푘 , 푝 [푗 ]) to denote the 푗 lowest price point; price point 1 2 3 4 5 6 푘 푘 푘 푘 푘 Model 4 푀푅 [푘, |푝 |] = 푝 [ |푝 |]푓 (휖푘 , 푝 [ |푝 |]); 푀 2 3 4 0 0 0 5 for 푗 = |푝푘 | − 1 to 1 do 1 푀2 6 9 9 11 4 null |푝푘 | ′ 6 푀푅 [푘, 푗 ] = 푝푘 [푗 ] Í 푓 (휖 , 푝푘 [푘 ]); 푀 15 18 19 19 11 4 푘′=푗 푘 3

7 for 푗 = 1 to |푝1 | do 8 푀퐴푋 [1, 푗 ] = 푀푅 [1, 푝1 [푗 ]];

9 for each model 푀푘 , 푘 = 2, ..., 푙 do 5.2 Shapley Coverage Maximization Problem 10 for each price point (휖 , 푝푘 [푗 ]) do 푘 ′ 11 푀퐴푋 [푘, 푗 ] = max{푀퐴푋 [푘 − 1, 푗 ] } + 푀푅 [푘, 푗 ], where for Model Training 푘−1 ′ 푘 푘−1 ′ 푘 푝 [푗 ] ≤ 푝 [푗 ]&&푝 [푗 ]/휖푘−1 ≥ 푝 [푗 ]/휖푘 ; Given 퐷1, ..., 퐷푛, the broker builds 푙 versions of model, 푀1, ..., 푀푙 − ′ 12 푝 (퐿.푀퐴푋 [푘, 푗 ]) = 푝푘 1 [푗 ] that makes 푀퐴푋 [푘, 푗 ] achievable in Line under manufacturing budget (maximized revenue in model pricing) 11; and attempts to train the best model for each model version to ( ) = { [ ]} = | 푙 | 13 푀퐴푋 푅푅푀 max 푀퐴푋 푙, 푗 for 푗 1, ..., 푝 ; remain competitive. For each model 푀푘 , each participating data owner 퐷푖 receives compensation 푐푖 (휖) (or 푐푖 for the sake of sim- plicity) when applicable. Under manufacturing budget MB, the complete price space of other models, we need to consider both the broker needs to select a subset S, so that the Shapley coverage monotonicity constraint and the subadditivity constraint. We have of the trained model will be maximized. We formalize the subset a recursive equation as follows. selection of S as a Shapley coverage maximization SCM problem ′ 푀퐴푋 [푘, 푗] = max{푀퐴푋 [푘 − 1, 푗 ]} + 푀푅[푘, 푗], (13) as follows. Õ arg max SV푖, (14) 푘−1 ′ 푘 푘−1 ′ 푘 where 푝 [푗 ] ≤ 푝 [푗]&&푝 [푗 ]/휖 ≥ 푝 [푗]/휖 and 푀푅[푘, 푗] S ⊆{퐷1,...,퐷푛 } 푘−1 푘 푖:퐷푖 ∈S denotes the revenue from model 푀 if we price model 푀 as 푝푘 [푗]. Õ 푘 푘 푠.푡. 푐 (휖) ≤ MB. (15) Computing the maximum revenue. Now we could easily write 푖 ∈S a recursive algorithm in Algorithm 4 based on recurrence Eq. 13, 푖:퐷푖 푘 We prove the above problem is NP-hard in Theorem 5.8. Given where |푝 | is the number of complete price points in model 푀푘 . We compute 푀푅[푘, 푗] for all the price points of the complete price the NP-hard complexity, we present three approximation algo- space in Lines 1-6 and 푀퐴푋 [푘, 푗] for the price points in the first rithms. First, we present a pseudo-polynomial time algorithm using model and the other models in Line 8 and Line 11, respectively. dynamic programming. Then, we present a greedy algorithm with the worst case bound if each data owner’s compensation is not 2 2 Theorem 5.4. Algorithm 4 can be finished in 푂 (푁 푙 ) time. too large. Finally, we propose an enumerative guess-based greedy Example 5.5. In model 푀1 of Figure 2, we have 푀퐴푋 [1, 푗] for algorithm with the worst case bound by relaxing the compensation 푗 = 1, ..., 6 shown in Table 2. For computing 푀퐴푋 [2, 2], there is only constraint, which uses the greedy algorithm as a subroutine. one price point (1, 3) satisfying both the monotonicity constraint NP-hardness proof. We prove that SCM is NP-hard by showing and the subadditivity constraint within model 푀1. Therefore, we that the well-known partition problem is polynomial time reducible have 푀퐴푋 [2, 2] = 푀퐴푋 [1, 2] + 푀푅[2, 2] = 3 + 6 = 9. Similarly, we to SCM. can fill the entire table shown in Table 2. Definition 5.6. (Decision Version of SCM) Given a data set S Finding optimal pricing. Although Algorithm 4 determines the of 푛 data owners with their corresponding privacy compensation maximum revenue of RRM, it does not directly show the opti- 푐1, ..., 푐푛 and Shapley value SV1, ..., SV푛, the decision version of mal price for each model 푝(휖푘 ). However, for each price point SCM has the task of deciding whether there is a subset S1 ⊆ S (휖 , 푝푘 [푗]) in the complete price space, we record the price point Í ≤ Í SV ≥ 푘 such that 푖:퐷푖 ∈S1 푐푖 퐵 and 푖:퐷푖 ∈S1 푖 푉 . 푝(퐿.푀퐴푋 [푘, 푗]) in model 푀푘−1, which has the maximum revenue in those price points that satisfy both the monotonicity constraint Definition 5.7. (Decision Version of Partition Problem)[25] Given a set S of 푛 positive integer values 푣 , ..., 푣 , the decision and the subadditivity constraint with (휖 , 푝푘 [푗]) in Line 12 of Algo- 1 푛 푘 version of partition problem has the task of deciding whether the rithm 4. Therefore, we can recursively backtrack the optimal price given set S can be partitioned into two subsets S1 and S2 such that point in model 푀 − from the optimal price point in model 푀 . We 푘 1 푘 the sum of the integers in S equals the sum of the integers in S . need 푂 (푛푙) time to find the maximum value in 푀퐴푋 [푙, 푗] and 푂 (푙) 1 2 time to backtrack. In total, we can construct an optimal solution Theorem 5.8. The decision version of SCM is an NP-hard prob- in 푂 (푁푙) time. We note that such a solution may be one of several lem. solutions that can achieve the maximum value. We show a running example in Table 2. We first obtain 푀퐴푋 [3, 4] Pseudo-polynomial time algorithm. We first present a pseudo- which has the maximum value among 푀퐴푋 [3, 푗] for 푗 = 1, ..., 6. polynomial time algorithm for SCM. Pseudo-polynomial means

965 Algorithm 5: Pseudo-polynomial time algorithm for Algorithm 7: Enumerative guess-based greedy algorithm SCM. for SCM. input :푐푖 , MB, and SV푖 for 푖 = 1, ..., 푛. input :푐푖 , MB, and SV푖 for 푖 = 1, ..., 푛. output: S. output: S. 1 for j=0:a:MB do 1 for i=1 to h do ′ 2 SV[0, 푗 ] = 0; 2 choose 푖 data owner(s) to compose a subset S ; 3 for i =1 to n do Íℎ 푛 3 we have 푖=1 푖 such subsets; 4 for j=0:a:MB do ℎ 푛 4 for j=1 to Í  do 5 if 푐푖 > 푗 × 푎 then 푖=1 푖 ′ SV[ ] = SV[ − ] 5 compute the manufacturing budget of the data owners in S ; 6 푖, 푗 푖 1, 푗 ; ′ 6 delete those S if their manufacturing budget is larger than MB; 7 else S′ S′ S′ 8 SV[푖, 푗 ] = 푚푎푥 {SV[푖 − 1, 푗 ], SV[푖 − 1, 푗 × 푎 − 푐푖 ] + SV푖 }; 7 we have 푟 remaining subsets 1, 2, ..., 푟 ; S′ 8 for each subset 푗 , 푗 = 1, ..., 푟 do ′ MB 9 let 퐷푎 be the data owner with the least Shapley value in S , remove all data 9 backtrack from SV[푛, ⌈ ⌉] to SV[1, 0] to find the selected 퐷푖 ; 푗 푎 S − S′ SV owners in 푗 푗 if their Shapley value is larger than 푎 and get a new S′′ subset 푗 ; |S′ | S′′ M B − Í 푗 Algorithm 6: Greedy algorithm for SCM. 10 run Algorithm 6 in 푗 with remaining manufacturing budget 푖=1 푐푖 ; MB SV = S′ S′′ S′ S′′ input :푐푖 , , and 푖 for 푖 1, ..., 푛. 11 return the data owners in 푗 and 푗 , where 푗 and 푗 have the highest Shapley S output: . value among 푗 = 1, ..., 푟; 1 for i=1 to n do 2 compute SV푖 /푐푖 ; 3 sort SV푖 /푐푖 for 푖 = 1, ..., 푛 in decreasing order and denote as SV1/푐1 ≥ SV2/푐2 ≥ ... ≥ SV푛/푐푛; Enumerative guess-based greedy algorithm. Although Algo- 4 B=0; − ) ≤ MB 5 i=1; rithm 6 can achieve (1 휁 푀퐴푋, the requirement of 푐푖 휁 is a 6 while 퐵 ≤ M B do little strict. We present another algorithm with the same worst case 7 add 푐푖 to B; bound but without the above requirement. The idea is to guess the 8 i=i+1; ℎ most profitable data owners in an optimal solution and compute 9 return the corresponding 퐷푖 of those 푐푖 in B; the rest greedily as in Algorithm 6. The detailed algorithm is shown 1 in Algorithm 7. Let 훼 ∈ (0, 1) be a fixed constant and ℎ = ⌈ 훼 ⌉. We that the algorithm has the polynomial time complexity in terms first enumerate all the subsets with data owner size ≤ ℎ in Lines of MB rather than the traditional number of data owners 푛. We 1-3. We delete those subsets with higher manufacturing budget MB than MB in Lines 4-6. In Lines 7-10, for each remaining subset, we divide MB into ⌈ 푎 ⌉ parts, where 푎 is the greatest common divisor in 푐푖 for all 푖 = 1, ..., 푛. We define SV[푖, 푗] as the maximum call Algorithm 6 to maximize the Shapley value with the remaining SCM that can be attained with manufacturing budget ≤ 푗 × 푎 budget after taking the ≤ ℎ data owners. by only using the first 푖 data owners. The idea is to simplify the ⌈ 1 ⌉ Theorem 5.11. Algorithm 7 runs in 푂 (푛 훼 ) time with (1 − complicated problem of computing SV[푛, ⌈ MB ⌉] by breaking it 푎 훼)MAX worst case bound. down into simpler subproblems in a recursive manner. The detailed algorithm is shown in Algorithm 5. In Line 5, if the manufacturing 5.3 Complete Dealer Dynamics budget is not enough, we do not need to consider the 푖푡ℎ data owner. We summarize the complete Dealer dynamics in Algorithm 8, which Otherwise, we can take 퐷 if we can get more Shapley value by 푖 integrates all the proposed algorithms in the previous sections. We replacing some data owners from 퐷1, ..., 퐷푖−1 in Line 8. assume that the broker can set appropriate parameters 푙, 휖푘 , and 휇푘 Greedy algorithm. The time cost of Algorithm 5 is extremely based on her market experiences. If the maximized revenue cannot dominated by MB and 푎. We propose a greedy algorithm based cover the stated Shapley coverage rates, the broker needs to reset on the ratio between Shapley value SV푖 and compensation 푐푖 in the parameters for the 푙 models. Algorithm 6, which is not sensitive to MB and 푎. The idea is to choose data owners with a high SV but low compensation demand. 6 EXPERIMENTS We sort the data owners in decreasing order of Shapley value per In this section, we present experimental studies validating: 1) our SV / compensation 푖 푐푖 in Line 3. In Lines 6-8, we proceed to take proposed algorithms for pricing models can generate more revenue SV / the data owners, starting with as high as possible of 푖 푐푖 until for the data owners and the broker, and significantly outperform there is no budget. We present a lower bound for Algorithm 6 in the baseline algorithms in terms of efficiency, and 2) our proposed Theorem 5.9, where 푀퐴푋 is the maximum value that we can obtain algorithms for subset selection in model training are efficient and in Eq. 14. effective. Theorem 5.9. If for all 푖, 푐푖 ≤ 휁 MB, Algorithm 6 has a lower bound guarantee (1 −휁 )푀퐴푋 and can be finished in 푂 (푛 log푛) time. 6.1 Experiment Setup We ran experiments on a machine with an Intel Core i7-8700K and Lemma 5.10. There are at most ⌈ 1 ⌉ data owners having compen- 훼 two NVIDIA GeForce GTX 1080 Ti running Ubuntu with 64GB sation 푐 such that their corresponding Shapley value SV is at least 푖 푖 memory. We employed Support Vector Machine (SVM) as our ma- 훼푀퐴푋 in any optimal solution. chine learning model and used both synthetic datasets and a real Lemma 5.10 is easy to see, otherwise, the optimal solution value chess dataset [15] in our experiments. The chess dataset includes is larger than 푀퐴푋, which is a contradiction. 3196 data tuples, each having 36 attributes.

966 105 100 4 DPP+ 8000 8000 DPP 3.5 Linear 80 Low 3 Median 6000 Base 2.5 6000 60

2 price price 4000 ratio 40

4000 revenue 1.5

2000 1 20 2000 0.5

0 0 0 0.001 0.01 0.1 1 10 0.001 0.01 0.1 1 10 DPP+ DPP Linear Low Median Base DPP+ DPP Linear Low Median Base privay parameter privacy parameter algorithm algorithm (a) survey data (b) model price (c) affordability ratio (d) revenue Figure 3: Model pricing effectiveness (independent distribution).

106 Algorithm 8: The complete Dealer dynamics. DPP+ PPDP DPP Greedy Base GuessGreedy 1 %% Data collection; 105 104 Base+ 2 collect dataset {퐷1, ..., 퐷푛 } along with compensation function 푐푖 (휖) for 푖 = 1, ..., 푛 based on Shapley value computed in Algorithm 1; 102

3 %% Model parameter setting; time(s) time(s) 4 decide a set of 푙 models to train with coverage rate 휇푘 and DP parameter 휖푘 for 100 100 푘 = 1, ..., 푙; 5 %% Model pricing; ′ 6 perform market survey among 푚 sampled model buyers (survey participants) based on 20 100 500 2500 12500 100 1k 10k 100k 1m number of survey points number of data owners each model buyer’s pricing function 푃 (퐵푗 , 푀푘 ): collect market survey results of ′ model demand 푡푚푗 and valuation 푣푗 for 푗 = 1, ...,푚 ; Figure 4: Pricing efficiency. Figure 5: Training efficiency. 7 call Algorithms 3 and 4 to compute the optimal price 푝 (휖푘 ) of model 푀푘 for 푘 = 1, ..., 푙; 8 %% Model training and release; 9 for k=1 to l do 10 data selection: call Algorithm 5, 6, or 7 to select training subset S푘 with • GuessGreedy: the guess based greedy algorithm in Algo- 푘 manufacturing budget 푝 (휖푘 )· 푝푢푟푐ℎ푎푠푒 (푝 (휖푘 )) to maximize SV(S ); rithm 7. SV(S푘 ) • ALL: selecting the entire chess training dataset. 11 if for all k, ≥ 휇푘 then SV({퐷1,...,퐷푛 }) • 12 model training: train the model with subset S푘 by Algorithm 2; RAND: randomly selecting a subset of the entire tuples 13 model release: release model 푀푘 , its coverage ratio 휇푘 , DP parameter 휖푘 , and given the manufacturing budget. model price 푝푘 ; 14 %% Model transaction; 6.2 Model Pricing 15 model buyers send their target models and payment to the broker, and the broker sends the corresponding models to model buyers. 16 %% Compensation allocation; Efficiency. We experimentally study the efficiency of the proposed 17 for k=1 to l do algorithms for pricing models. Because Linear, Low, and Median 18 allocate the corresponding compensation to 퐷푖 based on its compensation function 푐푖 (휖푘 ) if 퐷푖 is used to train Model 푀푘 ; algorithms only need to scan the survey price points once, the time cost is very low. For the ease of presentation, we omit the experimental results for the three algorithms. For model pricing, we compare our proposed algorithm with Figure 4 shows the time cost of DPP, DPP+, Base, and Base+ several baseline algorithms as follows. on varying number of survey price points, i.e., the number of sur- veyed model buyers. Both DPP and DPP+ linearly increase with the • DPP: applying the dynamic programming algorithm in Al- number of survey price points, which verifies the efficiency ofthe gorithm 4 to the survey price space. proposed dynamic programming algorithm. The time cost of both • DPP+: applying the dynamic programming algorithm in Base and Base+ is prohibitively high due to the enormous volume Algorithm 4 to the complete price space. of the price combinations for different models. • Linear: taking the lowest survey price from model 푀1 and the highest survey price from model 푀푙 and using linear Effectiveness. We experimentally study the revenue gain of the interpolation for the remaining models 푀2, ..., 푀푙−1 based proposed algorithms on datasets with different distributions. We on the two end-prices. omit Base+ due to the prohibitively high cost. We generate two • Low: taking the lowest price in each model. datasets with 100 survey price points according to simulated price • Median: taking the median price in each model. functions and Eq. 8 (i.e., from 100 model buyers). The number of • Base: applying the exhaustion-based approach to the survey survey price points on each model follows uniform distribution price space. and Gaussian (mean=3, standard deviation=1.2) distribution, respec- • Base+: applying the exhaustion-based approach to the com- tively. For the first model of both datasets, we generate the survey plete price space. price points following uniform distribution with range [1000, 5000]. For model training, we compare the three subset selection al- The remaining four models follow a 1000 increase on both the lower gorithms we proposed with two baseline algorithms as follows. bound and the upper bound of the range. Figures 3(a)(b)(c)(d) show the survey price points, final model • PPDP: the pseudo-polynomial dynamic programming algo- price, affordability ratio (fraction of the model buyers that can rithm in Algorithm 5. afford to buy a model), and revenue on a uniformly distributed • Greedy: the greedy algorithm in Algorithm 6. dataset, respectively. Figure 3(a) shows the survey price points on

967 104 3000 14 1 1 PPDP PPDP 12 0.9 2500 Greedy Greedy 0.9 10 GuessGreedy GuessGreedy 0.8 ALL ALL 2000 8 0.8 RAND RAND 0.7 1500 6 0.7 0.6

accuracy PPDP accuracy =0.001

subset size 1000 4 0.6 Greedy =0.01 GuessGreedy 0.5 =0.1 500 subset Shapley value ALL =1 RAND =10 0 0.5 0.4 0.001 0.01 0.1 1 10 0.001 0.01 0.1 1 10 0.001 0.01 0.1 1 10 25k 50k 75k 100k 125k privacy parameter privacy parameter privacy parameter Shapley coverage (a) subset size (b) Shapley value (c) 휖 vs. accuracy (d) coverage vs. accuracy Figure 6: Model training effectiveness.

5 models with differential privacy parameters 휖 = 0.001, 0.01, 0.1, 1, Greedy and GuessGreedy, which verifies the effectiveness of the and 10, respectively. Figure 3(b) shows that DPP+ and DPP have proposed dynamic programming algorithm. Figure 6(c) shows the similar optimal price settings. All models in DPP have different accuracy of the models trained on the subsets selected by PPDP, prices. For the price setting of DPP+, the second model and the Greedy, GuessGreedy, ALL, and RAND, respectively. Given the man- third model have the same price, which maximizes the revenue ufacturing budget, our proposed subset selection algorithms only compared to DPP and verifies the effectiveness of the complete choose around 800 tuples. However, the accuracy of the proposed price space construction. Figure 3(c) shows that DPP+ has the high- algorithms is even higher than ALL, which verifies the effective- est affordability ratio except for Low. For the most critical metric ness of Shapley value in selecting important data tuples for training revenue, DPP+ outperforms the other algorithms at least 3%, which high utility model. Comparing the proposed algorithms, PPDP has verifies the gain of the complete price space construction. Further- the highest accuracy with the corresponding highest sum of Shap- more, DPP has the same performance with Base, which has the ley value in Figure 6(b). That is, given the limited manufacturing optimal revenue. That is, although we relaxed the constraint of budget, we can choose the subset with the higher sum of Shapley 푝(휖푘 ) + 푝(휖푘′ ) ≥ 푝(휖푘 + 휖푘′ ) in Eq.10 to 푝(휖푘 )/휖푘 ≥ 푝(휖푘′ )/휖푘′ , the value to obtain higher accuracy. Also the accuracy is predominantly relaxed problem has the same optimal revenue with the original. determined by DP parameter 휖. Figure 6(d) shows the relationship All algorithms have similar performances on Gaussian distribution between Shapley coverage and accuracy. We can see that the accu- dataset as on uniform distribution dataset [7]. racy increases with respect to the increasing Shapley coverage in general. 6.3 Model Training 7 CONCLUSION AND FUTURE WORK Efficiency. Figure 5 shows the time cost of Greedy, PPDP, and In this paper, we proposed the first end-to-end model marketplace GuessGreedy with varying number of data owners on synthetic with differential privacy Dealer. Data owners specify their desired datasets. PPDP has the intermediate time cost among the three al- compensation functions based on privacy sensitivity and Shapley gorithms. Greedy significantly outperforms both PPDP and Guess- value. If the compensation proposed by data owner 퐷푖 is too high, Greedy due to its simplicity. GuessGreedy costs the highest time 푛 there is a risk that the broker won’t choose 퐷푖 to train the model. because it needs to enumerate ℎ subsets, where 푛 is the total num- Model buyers provide their price functions that they are willing ber of data owners and ℎ is the size of the sampled subsets during to pay based on Shapley coverage sensitivity and noise sensitiv- enumeration. In our experiments, the time cost for GuessGreedy is ity. If the price provided by model buyer 퐵푗 is too low, there is prohibitively high even when we set ℎ = 2. We skip some results of a risk that 퐵푗 cannot purchase the model from the broker. Based PPDP and GuessGreedy in the figure due to their high cost. on the compensation functions from data owners and the price Effectiveness. We experimentally study the effectiveness of the functions from model buyers, the broker builds 푙 versions of mod- proposed subset selection algorithms PPDP, Greedy, and Guess- els with different Shapley coverage rates and DP parameters, and Greedy. We simulate five models by adding DP noise with parame- presents an arbitrage-free model pricing mechanism to price the ters 휖 = 0.001, 0.01, 0.1, 1, and 10 in the training processing follow- model versions to maximize the revenue (model pricing). Given the 푘 Í푛 maximized revenue, the broker maximizes the Shapley coverage ing Algorithm 2, and each model has MB = 휖푘 /휖1 푖=1 푐푖 (휖푘 ) for 푘 = 1, ..., 5, respectively. for each model version to remain competitive (model training). We Figure 6(a) shows the number of tuples selected by PPDP, Greedy, proposed efficient algorithms for both optimization problems of GuessGreedy, ALL, and RAND, respectively. We employed 3000 model pricing and model training. Experimental results show that tuples as the training dataset. Given the manufacturing budget, our the proposed algorithms are efficient and effective. proposed algorithms select more tuples than RAND. Figure 6(b) shows the sum of Shapley value of the tuples in the subsets selected ACKNOWLEDGMENTS by different algorithms. We used the absolute value of the Shapley This work was supported in part by the NSF grants (CNS-1952192, value in order to select those data points that affect the model perfor- CNS-2027783, SCH-2014438, PPoSS 2028839, and IIS-1838042), AFOSR mance both positively and negatively. Because ALL has the entire under DDDAS program FA9550-12-1-0240, the NSFC grants (61941121 tuples, it has the highest sum of Shapley value. All our proposed and 91646203), the NSERC Discovery Grant program, and the NIH algorithms, outperforms RAND. Furthermore, PPDP outperforms grants (R01 1R01NS107291-01 and R56HL138415).

968 REFERENCES [22] R. Iyengar, J. P. Near, D. Song, O. Thakkar, A. Thakurta, and L. Wang. Towards [1] Dawex, https://www.dawex.com/en/. practical differentially private convex optimization. In IEEE S and P. [2] Google bigquery, https://cloud.google.com/bigquery/. [23] R. Jia, D. Dao, B. Wang, F. A. Hubis, N. M. Gurel, B. Li, C. Zhang, C. Spanos, and [3] https://support.gnip.com/apis/. D. Song. Efficient task-specific data valuation for nearest neighbor algorithms. [4] https://www.bloomberg.com/professional/product/market-data/. Proceedings of the VLDB Endowment, 12(11):1610–1623, 2019. [5] Iota, https://data.iota.org/. [24] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, [6] Safegraph, https://www.safegraph.com/. K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, R. G. L. D’Oliveira, S. E. [7] Technical report, https://github.com/emory-aims/publications/blob/master/ pro- Rouayheb, D. Evans, J. Gardner, Z. Garrett, A. Gascón, B. Ghazi, P. B. Gibbons, jectdealer.pdf/. M. Gruteser, Z. Harchaoui, C. He, L. He, Z. Huo, B. Hutchinson, J. Hsu, M. Jaggi, [8] A. Agarwal, M. Dahleh, and T. Sarkar. A marketplace for data: An algorithmic so- T. Javidi, G. Joshi, M. Khodak, J. Konecný, A. Korolova, F. Koushanfar, S. Koyejo, lution. In Proceedings of the 2019 ACM Conference on Economics and Computation, T. Lepoint, Y. Liu, P. Mittal, M. Mohri, R. Nock, A. Özgür, R. Pagh, M. Raykova, pages 701–726. ACM, 2019. H. , D. Ramage, R. Raskar, D. Song, W. Song, S. U. Stich, Z. Sun, A. T. Suresh, [9] M. Ancona, C. Öztireli, and M. H. Gross. Explaining deep neural networks with F. Tramèr, P. Vepakomma, J. Wang, L. Xiong, Z. Xu, Q. Yang, F. X. Yu, H. Yu, and a polynomial time algorithm for shapley value approximation. In ICML, pages S. Zhao. Advances and open problems in federated learning. CoRR, abs/1912.04977, 272–281, 2019. 2019. [10] J. Castro, D. Gómez, and J. Tejada. Polynomial calculation of the shapley value [25] R. E. Korf. A complete anytime algorithm for number partitioning. Artif. Intell., based on sampling. Computers & OR, 36(5):1726–1730, 2009. 106(2):181–203, 1998. [11] G. C. Cawley and N. L. C. Talbot. Efficient leave-one-out cross-validation of [26] P. Koutris, P. Upadhyaya, M. Balazinska, B. Howe, and D. Suciu. Query-based kernel fisher discriminant classifiers. Pattern Recognition, 36(11):2585–2592, 2003. data pricing. In PODS, pages 167–178. ACM, 2012. [12] K. Chaudhuri, C. Monteleoni, and A. D. Sarwate. Differentially private empirical [27] P. Koutris, P. Upadhyaya, M. Balazinska, B. Howe, and D. Suciu. Toward practical risk minimization. Journal of Machine Learning Research, 12(3), 2011. query pricing with querymarket. In SIGMOD, pages 613–624. ACM, 2013. [13] S. Chawla, S. Deep, P. Koutris, and Y. Teng. Revenue maximization for query [28] P. Koutris, P. Upadhyaya, M. Balazinska, B. Howe, and D. Suciu. Query-based pricing. PVLDB, 13(1):1–14, 2019. data pricing. J. ACM, 62(5):43:1–43:44, 2015. [14] L. Chen, P. Koutris, and A. Kumar. Towards model-based pricing for machine [29] C. Li, D. Y. Li, G. Miklau, and D. Suciu. A theory of pricing private data. In ICDT, learning in a data marketplace. In SIGMOD, pages 1535–1552, 2019. pages 33–44, 2013. [15] D. Dua and C. Graff. UCI machine learning repository, 2017. [30] C. Li, D. Y. Li, G. Miklau, and D. Suciu. A theory of pricing private data. ACM [16] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity Trans. Database Syst., 39(4):34:1–34:28, 2014. in private data analysis. In Theory of cryptography conference, pages 265–284. [31] C. Li, D. Y. Li, G. Miklau, and D. Suciu. A theory of pricing private data. Commun. Springer, 2006. ACM, 60(12):79–86, 2017. [17] C. Dwork, A. Roth, et al. The algorithmic foundations of differential privacy. [32] B. Lin and D. Kifer. On arbitrage-free pricing for general data queries. PVLDB, Foundations and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014. 7(9):757–768, 2014. [18] S. S. Fatima, M. J. Wooldridge, and N. R. Jennings. A linear approximation method [33] J. Pei. A survey on data pricing: from economics to data science. IEEE Trans. for the shapley value. Artif. Intell., 172(14):1673–1699, 2008. Knowl. Data Eng., 2021. [19] A. Ghorbani and J. Y. . Data shapley: Equitable valuation of data for machine [34] C. Shapiro and H. Varian. Versioning: The smart way to sell information. Harvard learning. In ICML, pages 2242–2251, 2019. Business Review, 76(6):107–115, 1998. [20] A. Ghosh and A. Roth. Selling privacy at auction. In Proceedings 12th ACM [35] L. S. Shapley. A value for n-person games. Contributions to the Theory of Games, , 2(28):307–317, 1953. Conference on Electronic Commerce (EC-2011), San Jose, CA, USA, June 5-9, 2011 [36] T. Wang, B. Ding, J. Zhou, C. Hong, Z. Huang, N. Li, and S. Jha. Answering pages 199–208, 2011. multi-dimensional analytical queries under local differential privacy. In SIGMOD, [21] V. Guruswami, J. D. Hartline, A. R. Karlin, D. Kempe, C. Kenyon, and F. McSherry. pages 159–176. ACM, 2019. On profit-maximizing envy-free pricing. In SODA, pages 1164–1173, 2005.

969