<<

Fairness in Learning ⇤

Shahin Jabbari Matthew Joseph Michael Kearns Jamie Morgenstern Aaron Roth 1

Abstract tings where historical context can have a distinct influence We initiate the study of fairness in reinforcement on the future. For concreteness, we consider the specific learning, where the actions of a learning algo- example of hiring (though other settings such as college rithm may affect its environment and future re- admission or lending decisions can be embedded into this wards. Our fairness constraint requires that an framework). Consider a firm aiming to hire employees for never prefers one action over another a number of positions. The firm might consider a variety if the long-term (discounted) reward of choosing of hiring practices, ranging from targeting and hiring ap- the latter action is higher. Our first result is neg- plicants from well-understood parts of the applicant pool ative: despite the fact that fairness is consistent (which might be a reasonable policy for short-term produc- with the optimal policy, any learning algorithm tivity of its workforce), to exploring a broader class of ap- satisfying fairness must take time exponential in plicants whose backgrounds might differ from the current the number of states to achieve non-trivial ap- set of employees at the company (which might incur short- proximation to the optimal policy. We then pro- term productivity and learning costs but eventually lead to vide a provably fair polynomial time algorithm a richer and stronger overall applicant pool). under an approximate notion of fairness, thus es- We focus on the standard model of , tablishing an exponential gap between exact and in which an algorithm seeks to maximize its discounted approximate fairness. sum of rewards in a Markovian decision process (MDP). Throughout, the reader should interpret the actions avail- able to a learning algorithm as corresponding to choices or 1. Introduction policies affecting individuals (e.g. which applicants to tar- The growing use of for automated get and hire). The reward for each action should be viewed decision-making has raised concerns about the potential for as the short-term payoff of making the corresponding deci- unfairness in learning and models. In settings sion (e.g. the short-term influence on the firm’s productiv- as diverse as policing [22], hiring [19], lending [4], and ity after hiring any particular candidate). The actions taken criminal sentencing [2], mounting empirical evidence sug- by the algorithm affect the underlying state of the system gests these concerns are not merely hypothetical [1; 25]. (e.g. the company’s demographics as well as the available applicant pool) and therefore in turn will affect the actions We initiate the study of fairness in reinforcement learning, and rewards available to the algorithm in the future. where an algorithm’s choices may influence the state of the world and future rewards. In contrast, previous work on fair Informally, our definition of fairness requires that (with machine learning has focused on myopic settings where high ) in state s, an algorithm never chooses an such influence is absent, e.g. in i.i.d. or no- mod- available action a with probability higher than another ac- els [5; 6; 9; 10]. The resulting fairness definitions therefore tion a0 unless Q⇤(s, a) >Q⇤(s, a0), i.e. the long-term re- do not generalize well to a reinforcement learning setting, ward of a is greater than that of a0. This definition, adapted as they do not reason about the effects of short-term ac- from Joseph et al. (2016), is weakly meritocratic: facing tions on long-term rewards. This is relevant for the set- some set of actions, an algorithm must pick a distribution over actions with (weakly) heavier weight on the better ac- 1University of Pennsylvania, Philadelphia, PA, USA. Corre- tions (in terms of their discounted long-term reward). Cor- spondence to: Shahin Jabbari, Matthew Joseph, Michael Kearns, respondingly, a hiring process satisfying our fairness def- Jamie Morgenstern, Aaron Roth . inition cannot probabilistically target one population over another if hiring from either population will have similar Proceedings of the 34 th International Conference on Machine long-term benefit to the firm’s productivity. Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s). Unfortunately, our first result shows an exponential separa- ⇤The full technical version of this paper is available at tion in expected performance between the best unfair algo- https://arxiv.org/pdf/1611.03071.pdf. Fairness in Reinforcement Learning rithm and any algorithm satisfying fairness. This motivates problem of fairness in machine learning. Early work in our study of a natural relaxation of (exact) fairness, for mining [8; 11; 12; 17; 21; 29] considered the question which we provide a polynomial time learning algorithm, from a primarily empirical standpoint, often using statisti- thus establishing an exponential separation between exact cal parity as a fairness goal. Dwork et al. (2012) explicated and approximately fair learning in MDPs. several drawbacks of statistical parity and instead proposed one of the first broad definitions of algorithmic fairness, Our Results Throughout, we use (exact) fairness to refer formalizing the idea that “similar individuals should be to the adaptation of Joseph et al. (2016)’s definition defin- treated similarly”. Recent papers have proven several im- ing an action’s quality as its potential long-term discounted possibility results for satisfying different fairness require- reward. We also consider two natural relaxations. The first, ments simultaneously [7; 15]. More recently, Hardt et al. approximate-choice fairness, requires that an algorithm (2016) proposed new notions of fairness and showed how never chooses a worse action with probability substantially to achieve these notions via post-processing of a black-box higher than better actions. The second, approximate-action classifier. Woodworth et al. (2017) and Zafar et al. (2017) fairness, requires that an algorithm never favors an action further studied these notion theoretically and empirically. of substantially lower quality than that of a better action. The contributions of this paper can be divided into two 1.2. Strengths and Limitations of Our Models parts. First, in Section 3, we give a lower bound on the time required for a learning algorithm to achieve near-optimality In recognition of the duration and consequence of choices subject to (exact) fairness or approximate-choice fairness. made by a learning algorithm during its learning process – e.g. job applicants not hired – our work departs from Theorem (Informal statement of Theorems 3, 4, and 5). previous work and aims to guarantee the fairness of the For constant ✏, to achieve ✏-optimality, (i) any fair or learning process itself. To this end, we adapt the fairness approximate-choice fair algorithm takes a number of definition of Joseph et al. (2016), who studied fairness in rounds exponential in the number of MDP states and (ii) the bandit framework and defined fairness with respect to any approximate-action fair algorithm takes a number of one-step rewards. To capture the desired interaction and rounds exponential in 1/(1 ), for discount factor . evolution of the reinforcement learning setting, we mod- ify this myopic definition and define fairness with respect Second, we present an approximate-action fair algorithm to long-term rewards: a fair learning algorithm may only (Fair-E3) in Section 4 and prove a polynomial upper bound choose action a over action a if a has true long-term re- on the time it requires to achieve near-optimality. 0 ward at least as high as a0. Our contributions thus depart Theorem (Informal statement of Theorem 6). For constant from previous work in reinforcement learning by incorpo- ✏ and any MDP satisfying standard assumptions, Fair- rating a fairness requirement (ruling out existing algorithms 3 E is an approximate-action fair algorithm achieving ✏- which commonly make heavy use of “optimistic” strategies optimality in a number of rounds that is (necessarily) expo- that violates fairness) and depart from previous work in fair nential in 1/(1 ) and polynomial in other parameters. learning by requiring “online” fairness in a previously un- considered reinforcement learning context. The exponential dependence of Fair-E3 on 1/(1 ) is tight: it matches our lower bound on the time complexity First note that our definition is weakly meritocratic: an al- of any approximate-action fair algorithm. Furthermore, our gorithm satisfying our fairness definition can never proba- results establish rigorous trade-offs between fairness and bilistically favor a worse option but is not required to favor performance facing reinforcement learning algorithms. a better option. This confers both strengths and limitations. Our fairness notion still permits a type of “conditional dis- 1.1. Related Work crimination” in which a fair algorithm favors group A over group B by selecting choices from A when they are supe- The most relevant parts of the large body of literature on rior and randomizing between A and B when choices from reinforcement learning focus on constructing learning al- B are superior. In this sense, our fairness requirement is 3 gorithms with provable performance guarantees. E [13] relatively minimal, encoding a necessary variant of fairness was the first learning algorithm with a polynomial learn- rather than a sufficient one. This makes our lower bounds ing rate, and subsequent work improved this rate (see Szita and impossibility results (Section 3) relatively stronger and and Szepesvari´ (2010) and references within). The study of upper bounds (Section 4) relatively weaker. robust MDPs [16; 18; 20] examines MDPs with high pa- rameter uncertainty but generally uses “optimistic” learn- Next, our fairness requirement holds (with high probabil- ing strategies that ignore (and often conflict with) fairness ity) across all decisions that a fair algorithm makes. We and so do not directly apply to this work. view this strong constraint as worthy of serious consid- eration, since “forgiving” unfairness during the learning Our work also belongs to a growing literature studying the Fairness in Reinforcement Learning may badly mistreat the training population, especially if the The horizon time H := log (✏(1 )) / log() of an ✏ learning process is lengthy or even continual. Additionally, MDP captures the number of steps an approximately opti- it is unclear how to relax this requirement, even for a small mal policy must optimize over. The expected discounted fraction of the algorithm’s decisions, without enabling dis- reward of any policy after H✏ steps approaches the ex- crimination against a correspondingly small population. pected asymptotic discounted reward (Kearns and Singh (2002), Lemma 2). A learning algorithm is a non- Instead, aiming to preserve the “minimal” spirit of our def- L stationary policy that at each round takes the entire history inition, we consider a relaxation that only prevents an al- and outputs a distribution over actions. We now define a gorithm from favoring a significantly worse option over performance for learning algorithms. a better option (Section 2.1). Hence, approximate-action Definition 1 (✏-optimality). Let ✏>0 and (0, 1/2). fairness should be viewed as a weaker constraint: rather 2 achieves ✏-optimality in steps if for any T than safeguarding against every violation of “fairness”, it L T T instead restricts how egregious these violations can be. We 1 T 2✏ discuss further relaxations of our definition in Section 5. Es µ⇤ [VM⇤ (s)] E VM⇤ (st) , (1) ⇠ T  1 " t=1 # X with probability at least 1 , for s the state reaches at 2. Preliminaries t L time t, where the expectation is taken over the transitions In this paper we study reinforcement learning in Markov and the randomization of , for any MDP M. Decision Processes (MDPs). An MDP is a tuple M = L ( , ,P , ,T,) where is a set of n states, We thus ask that a learning algorithm, after sufficiently SM AM M M SM is a set of k actions, T is a horizon of a (possibly infi- many steps, visits states whose values are arbitrarily close AM nite) number of rounds of activity in M, and is a discount to the values of the states visited by the optimal policy. factor. PM : M M M and RM : M [0, 1] de- Note that this is stronger than the “hand-raising” notion S ⇥A !S S ! 4 note the transition probability distribution and reward dis- in Kearns and Singh (2002), which only asked that the tribution, respectively. We use R¯M to denote the mean of learning algorithm stop in a state from which discounted 2 RM . A policy ⇡ is a mapping from a history h (the se- return is near-optimal, permitting termination in a state quence of triples (state, action, reward) observed so far) from which the optimal discounted return is poor. In Def- to a distribution over actions. The discounted state and inition 1, if there are states with poor optimal discounted state-action value functions are denoted by V ⇡ and Q⇡, and reward that the optimal policy eventually leaves for better V ⇡(s, T ) represents expected discounted reward of follow- states, so must our algorithms. We also note the following ⇡ ing ⇡ from s for T steps. The highest values functions are connection between the average VM values of states vis- achieved by the optimal policy ⇡⇤ and are denoted by V ⇤ ited under the stationary distribution of ⇡ (and in particular ⇡ and Q⇤ [24]. We use µ to denote the stationary distribu- an optimal policy) and the average undiscounted rewards tion of ⇡. Throughout we make the following assumption. achieved under the stationary distribution of that policy. ¯ Assumption 1 (Unichain Assumption). The stationary dis- Lemma 2 (Singh (2016)). Let RM be the vector of mean ⇡ tribution of any policy in M is independent of its start state. rewards in states of M and VM the vector of discounted ⇡ ¯ ⇡ ⇡ rewards in states under ⇡. Then µ RM =(1 )µ VM . ⇡ · · We denote the ✏-mixing time of ⇡ by T✏ . Lemma 1 relates the ✏-mixing time of any policy ⇡ to the number of rounds We design an algorithm which quickly achieves ✏- ⇡ until the VM values of the visited states by ⇡ are close to optimality and we bound the number of steps before this ⇡ T the expected VM values (under the stationary distribution happens by a polynomial in the parameters of M. µ⇡). We defer all the omitted proofs to the Appendix. Lemma 1. Fix ✏>0. For any state s, following ⇡ for 2.1. Notions of Fairness T T ⇡ steps from s satisfies ✏ We now turn to formal notions of fairness. Translated to T our setting, Joseph et al. (2016) define action a’s quality as ⇡ 1 ⇡ ✏ Es µ⇡ [VM (s)] E VM (st) , the expected immediate reward for choosing a from state ⇠ T  1 " t=1 # s and then require that an algorithm not probabilistically X a a a where st is the state visited at time t when following ⇡ from favor over 0 if has lower expected immediate reward. s and the expectation in the second term is over the transi- However, this naive translation does not adequately cap- 3 tion function and the randomization of ⇡. ture the structural differences between bandit and MDP set- 2 Note that R¯M 1 and Var(RM ) 1 for all states. The called the ✏-reward mixing time which is always linearly bounded bounded reward assumption can be relaxed (see e.g. [13]). Also by the ✏-mixing time but can be much smaller in certain cases assuming rewards in [0, 1] can be made w.l.o.g. up to scaling. (see Kearns and Singh (2002) for a discussion). 3Lemma 1 can be stated for a weaker notion of mixing time 4We suspect unfair E3 also satisfies this stronger notion. Fairness in Reinforcement Learning

tings since present rewards may depend on past choices in with probability of at least 1 over histories ht 1. MDPs. In particular, defining fairness in terms of imme- If is ↵-action fair for any input ↵>0, we call L L diate rewards would prohibit any policy sacrificing short- approximate-action fair. term rewards in favor of long-term rewards. This is un- desirable, since it is the long-term rewards that matter in Approximate-choice fairness prevents equally good ac- reinforcement learning, and optimizing for long-term re- tions from being chosen at very different rates, while wards often necessitates short-term sacrifices. Moreover, approximate-action fairness prevents substantially worse the long-term impact of a decision should be considered actions from being chosen over better ones. In hiring, an when arguing about its relative fairness. We will therefore approximately-action fair firm can only (probabilistically) define fairness using the state-action value function QM⇤ . target one population over another if the targeted popu- Definition 2 (Fairness). is fair if for all input >0, all lation is not substantially worse. While this is a weaker L M, all rounds t, all states s and all actions a, a0 guarantee, it at least forces an approximately-action fair algorithm to learn different populations to statistical con- QM⇤ (s, a) QM⇤ (s, a0) (s, a, ht 1) (s, a0,ht 1) fidence. This is a step forward from current practices, in )L L 5 which companies have much higher degrees of uncertainty with probability at least 1 over histories ht 1. about the quality (and impact) of hiring individuals from Fairness requires that an algorithm never probabilistically under-represented populations. For this reason and the favors an action with lower long-term reward over an action computational benefits mentioned above, our upper bounds with higher long-term reward. In hiring, this means that will primarily focus on approximate-action fairness. an algorithm cannot target one applicant population over We now state several useful observations regarding fair- another unless the targeted population has a higher quality. ness. We defer all the formal statements and their proofs In Section 3, we show that fairness can be extremely re- to the Appendix. We note that there always exists a (pos- strictive. Intuitively, must play uniformly at random until sibly randomized) optimal policy which is fair (Observa- L tion 1); moreover, any optimal policy (deterministic or ran- it has high confidence about the QM⇤ values, in some cases taking exponential time to achieve near-optimality. This domized) is approximate-action fair (Observation 2), as is motivates relaxing Definition 2. We first relax the proba- the uniformly random policy (Observation 3). bilistic requirement and require only that an algorithm not Finally, we consider a restriction of the actions in an MDP substantially favor a worse action over a better one. M to nearly-optimal actions (as measured by QM⇤ values). Definition 3 (Approximate-choice Fairness). is ↵-choice L Definition 5 (Restricted MDP). The ↵-restricted MDP of fair if for all inputs >0 and ↵>0: for all M, all rounds M, denoted by M ↵, is identical to M except that in each t, all states s and actions a, a : 0 state s, the set of available actions are restricted to a : { QM⇤ (s, a) QM⇤ (s, a0) (s, a, ht 1) (s, a0,ht 1) ↵, QM⇤ (s, a) maxa0 M QM⇤ (s, a0) ↵ a M . )L L 2A | 2A } with probability of at least 1 over histories ht 1. ↵ ↵ M has the following two properties: (i) any policy in M If is ↵-choice fair for any input ↵>0, we call L L is ↵-action fair in M (Observation 4) and (ii) the optimal approximate-choice fair. policy in M ↵ is also optimal in M (Observation 5). Ob- servations 4 and 5 aid our design of an approximate-action A slight modification of the lower bound for (exact) fair- fair algorithm: we construct M ↵ from estimates of the Q ness shows that algorithms satisfying approximate-choice M⇤ values (see Section 4.3 for more details). fairness can also require exponential time to achieve near- optimality. We therefore propose an alternative relaxation, where we relax the quality requirement. As described in 3. Lower Bounds Section 1.1, the resulting notion of approximate-action fair- We now demonstrate a stark separation between the per- ness is in some sense the most fitting relaxation of fair- formance of learning algorithms with and without fairness. ness, and is a particularly attractive one because it allows First, we show that neither fair nor approximate-choice fair us to give algorithms circumventing the exponential hard- algorithms achieve near-optimality unless the number of ness proved for fairness and approximate-choice fairness. time steps is at least ⌦(kn), exponential in the size of the Definition 4 (Approximate-action Fairness). is ↵-action T L state space. We then show that any approximate-action fair fair if for all inputs >0 and ↵>0, for all M, all rounds algorithm requires a number of time steps that is at least t, all states s and actions a, a0: 1 T 1 ⌦(k ) to achieve near-optimality. We start by proving a QM⇤ (s, a) >QM⇤ (s, a0)+↵ (s, a, ht 1) (s, a0ht 1) lower bound for fair algorithms. )L L 5 1 1 1 (s, a, h) denotes the probability chooses a from s given history h. If < , > and ✏< , no fair algorithm L L Theorem 3. 4 2 8 Fairness in Reinforcement Learning can be ✏-optimal in = O(kn) steps.6 Hence, we focus on approximate-action fairness. Before T moving to positive results, we mention that the time com- Standard reinforcement learning algorithms (absent a fair- plexity of approximate-action fair algorithms will still suf- ness constraint) learn an ✏-optimal policy in a number of 1 fer from an exponential dependence on 1 . steps polynomial in n and 1 ; Theorem 3 therefore shows ✏ 1 1 a steep cost of imposing fairness. We outline the idea for Theorem 5. For <4 , ↵< 8 , >max(0.9,c), c c 1 2 1 1 e proof of Theorem 3. For intuition, first consider the special ( 2 , 1) and ✏< 16 , no ↵-action fair algorithm is ✏- case when the number of actions k =2. We introduce the 1 optimal for = O((k 1 )c) steps. MDPs witnessing the claim in Theorem 3 for this case. T

Definition 6 (Lower Bound Example). For M = L, R , The MDP in Figure 1 also witnesses the claim of Theo- A { } log(1/(2↵)) let M(x)=( M , M , M , M ,T,,x) be an MDP with rem 5 when n = . The discount factor S A P R d 1 e is generally taken as a constant, so in most interesting for all i [n], PM (si,L,s1)=PM (si,R,sj)=1 • 2 cases 1 n: this lower bound is substantially less where j =min i +1,n and is 0 otherwise. 1 ⌧ { } stringent than the lower bounds proven for fairness and for i [n 1], RM (si)=0.5, and RM (sn)=x. • 2 approximate-choice fairness. Hence, from now on, we fo- cus on designing algorithms satisfying approximate-action fairness with learning rates polynomial in every parameter 1 1 but 1 , and with tight dependence on 1 . 4. A Fair and Efficient Learning Algorithm We now present an approximate-action fair algorithm, Fair-E3 with the performance guarantees stated below. Figure 1. MDP(x): Circles represent states (labels denote the state name and deterministic reward). Arrows represent actions. Theorem 6. Given ✏>0, ↵>0, 0, 1 and 2 2 [0, 1) as inputs, Fair-E3 is an ↵-action fair algorithm 2 Figure 1 illustrates the MDP from Definition 6. All the which achieves ✏-optimality after transitions and rewards in M are deterministic, but the re- 1 1 5 1 +5 ward at state sn can be either 1 or , and so no algorithm n T ⇤k 2 = O˜ ✏ (2) (fair or otherwise) can determine whether the Q values 4 4 2 12 M⇤ T min ↵ ,✏ ✏ (1 ) ! of all the states are the same or not until it reaches sn and { } observes its reward. Until then, fairness requires that the steps where O˜ hides poly-logarithmic terms. algorithm play all the actions uniformly at random (if the 1 reward at s is , any fair algorithm must play uniformly 3 n 2 The running time of Fair-E (which we have not attempted at random forever). Thus, any fair algorithm will take ex- to optimize) is polynomial in all the parameters of the MDP ponential time in the number of states to reach sn. This can 1 except 1 ; Theorem 5 implies that this exponential de- be easily modified for k>2: from each state si, k 1 of 1 pendence on 1 is necessary. the actions from state si (deterministically) return to state s1 and only one action (deterministically) reaches any other Several more recent algorithms (e.g. R-MAX [3]) have n 3 3 state smin i+1,n . It will take k steps before any fair algo- improved upon the performance of E . We adapted E pri- { } rithm reaches sn and can stop playing uniformly at random marily for its simplicity. While the machinery required (which is necessary for near-optimality). The same exam- to properly balance fairness and performance is somewhat ple, with a slightly modified analysis, also provides a lower involved, the basic ideas of our adaptation are intuitive. bound of ⌦((k/(1 + k↵))n) time steps for approximate- We further note that subsequent algorithms improving on choice fair algorithms as stated in Theorem 4. E3 tend to heavily leverage the principle of “optimism in 1 1 1 1 face of uncertainty”: such behavior often violates fairness, Theorem 4. If <4 ,↵< 4 ,> 2 and ✏< 8 , no ↵- choice fair algorithm is ✏-optimal for = O(( k )n) which generally requires uniformity in the face of uncer- T 1+k↵ steps. tainty. Thus, adapting these algorithms to satisfy fairness is more difficult. This in particular suggests E3 as an apt Fairness and approximate-choice fairness are both ex- starting point for designing a fair planning algorithm. tremely costly, ruling out polynomial time learning rates. The remainder of this section will explain Fair-E3, be- 6We have not optimized the constants upper-bounding param- ginning with a high-level description in Section 4.1.We 3 eters in the statement of Theorems 3, 4 and 5. The values pre- then define the “known” states Fair-E uses to plan in Sec- sented here are only chosen for convenience. tion 4.2, explain this planning process in Section 4.3, and Fairness in Reinforcement Learning bring this all together to prove Fair-E3’s fairness and per- n 4 1 m = O H 8 log . formance guarantees in Section 4.4. 2 min ✏,↵ ✏ ✓ { }◆ ✓ ◆! 4.1. Informal Description of Fair-E3 A state s becomes known after taking 3 Fair-E relies on the notion of “known” states. A state s is mQ := k max m1,m2 (3) · { } defined to be known after all actions have been chosen from s enough times to confidently estimate relevant reward dis- length-H✏ random trajectories from s. tributions, transition , and Q⇡ values for each M It remains to show that motivating conditions (i) and (ii) action. At each time t, Fair-E3 then uses known states to reason about the MDP as follows: indeed hold for our formal definition of a known state. In- formally, m1 random trajectories suffice to ensure that we If in an unknown state, take a uniformly random trajec- have accurate estimates of all QM⇤ (s, a) values, and m2 • tory of length H✏ . random trajectories suffice to ensure accurate estimates of If in a known state, compute (i) an exploration policy the transition probabilities and rewards. • which escapes to an unknown state quickly and p, the To formalize condition (i), we rely on Theorem 7, connect- probability that this policy reaches an unknown state ing the number of random trajectories taken from s to the within 2T steps, and (ii) an exploitation policy which ✏⇤ accuracy of the empirical V ⇡ estimates. is near-optimal in the known states of M. M – If p is large enough, follow the exploration policy; oth- Theorem 7 (Theorem 5.5, Kearns et al. (2000)). For any erwise, follow the exploitation policy. state s and ↵>0, after

3 2 Fair-E thus relies on known states to balance exploration 1 ⇧ H✏ +3 and exploitation in a reliable way. While Fair-E3 and E3 m = O k log | | (1 ) ↵ ! share this general idea, fairness forces Fair-E3 to more ✓ ◆ ✓ ◆ delicately balance exploration and exploitation. For ex- random trajectories of length H✏ from s, with probability ample, while both algorithms explore until states become ˆ ⇡ of at least 1 , we can compute estimates VM such that “known”, the definition of a known state must be much ⇡ ˆ⇡ VM (s) VM (s) ↵, simultaneously for all ⇡ ⇧. stronger in Fair-E3 than in E3 because Fair-E3 addition- | | 2 ⇡ ally requires accurate estimates of actions’ QM values in Theorem 7 enables us to translate between the number of order to make decisions without violating fairness. For this trajectories taken from a state and the uncertainty about its 3 ⇡ reason, Fair-E replaces the deterministic exploratory ac- VM values for all policies (including ⇡⇤ and hence VM⇤ ). tions of E3 with random trajectories of actions from un- Since ⇧ = kn, we substitute log ( ⇧ )=n log (k).To | | | | known states. These random trajectories are then used to estimate QM⇤ (s, a) values using the VM⇤ (s) values we in- Q⇡ estimate the necessary M values. crease the number of necessary length-H✏ random trajec- 3 k In a similar vein, Fair-E requires particular care in com- tories by a factor of . puting exploration and exploitation policies, and must re- For condition (ii), we adapt the analysis of E3 [13], which strict the set of such policies to fair exploration and fair states that if each action in a state s is taken m2 times, exploitation policies. Correctly formulating this restriction then the transition probabilities and reward in state s can process to balance fairness and performance relies heavily be estimated accurately (see Section 4.4). on the observations about the relationship between fairness and performance provided in Section 2.1. 4.3. Planning in Fair-E3 3 4.2. Known States in Fair-E3 We now formalize the planning steps in Fair-E from known states. For the remainder of our exposition, we We now formally define the notion of known states for make Assumption 2 for convenience (and show how to re- Fair-E3. We say a state s becomes known when one can move this assumption in the Appendix). compute good estimates of (i) RM (s) and PM (s, a) for all Assumption 2. T✏⇤ is known. a, and (ii) QM⇤ (s, a) for all a. 3 Fair-E constructs two ancillary MDPs for planning: M Definition 7 (Known State). Let is the exploitation MDP, in which the unknown states of M are condensed into a single absorbing state s0 with no re- 2 ward. In the known states , transitions are kept intact and H +3 1 k the rewards are deterministically set to their mean value. m1 = O k ✏ n log and (1 ) ↵ M thus incentivizes exploitation by giving reward only ✓ ◆ ✓ ◆! Fairness in Reinforcement Learning

probability that a walk of 2T steps from s following ⇡ will terminate in s0 exceeds /T .

We can use this fact to reason about exploration as follows. First, since Observation 2 tells us that the optimal policy in Figure 2. Left: An MDP M with two actions (L and R) and de- M is approximate-action fair, if the optimal policy stays in terministic transition functions and rewards. Green denotes the the set of M’s known states M, then following the optimal set of known states . Middle: M. Right: M[n] . ↵ \ policy in M is both optimal and approximate-action fair. However, if instead the optimal policy in M quickly es- ↵ for staying within known states. In contrast, M[n] is the capes to an unknown state in M, the optimal policy in M \ exploration MDP, identical to M except for the rewards. may not be able to compete with the optimal policy in M. The rewards in the known states are set to 0 and the re- Ignoring fairness, one natural way of an escape ward in s0 is set to 1. M[n] then incentivizes exploration policy to “keep up” with the optimal policy is to compute \ by giving reward only for escaping to unknown states. See the optimal policy in M[n] . Unfortunately, following this \ the middle (right) panel of Figure 2 for an illustration of escape policy might violate approximate-action fairness – M (M[n] ), and Appendix for formal definitions. high-quality actions might be ignored in lieu of low-quality \ exploratory actions that quickly reach the unknown states 3 uses these constructed MDPs to plan according to Fair-E of M. Instead, we compute an escape policy in M ↵ the following natural idea: when in a known state, 3 [n] Fair-E and show that if no near-optimal exploitation policy exists\ constructs Mˆ and Mˆ based on the estimated transi- [n] in M , then the optimal policy in M ↵ (which is fair by tion and rewards observed\ so far (see the Appendix for for- [n] construction) quickly escapes to the unknown\ states of M. mal definitions), and then uses these to compute additional ˆ ↵ ˆ ↵ 3 restricted MDPs M and M[n] for approximate-action Next, in order for Fair-E to check whether the optimal 3 \ ↵ fairness. Fair-E then uses these restricted MDPs to policy in M[n] quickly reaches the absorbing state of M \ choose between exploration and exploitation. with significant probability, Fair-E3 simulates the execu- tion of the optimal policy of M ↵ for 2T steps from the More formally, if the optimal policy in Mˆ ↵ escapes to [n] ✏⇤ [n] known state s in M ↵ several times,\ counting the ratio of the absorbing state of M with high enough\ probability the runs ending in s , and applying a Chernoff bound; this within 2T steps, then 3 explores by following that 0 ✏⇤ Fair-E is where Assumption 2 is used. policy. Otherwise, Fair-E3 exploits by following the opti- ˆ ↵ mal policy in M for T✏⇤ steps. While following either of Having discussed exploration, it remains to show that these policies, whenever Fair-E3 encounters an unknown the exploitation policy described in Lemma 8 satisfies ✏- state, it stops following the policy and proceeds by taking optimality as defined in Definition 1. By setting T T✏⇤ a length-H✏ random trajectory. in Lemma 8 and applying Lemmas 1 and 10, we can prove Corollary 9 regarding this exploitation policy. 4.4. Analysis of Fair-E3 Corollary 9. For any state s and T T✏⇤ if there 2 ↵ In this section we formally analyze Fair-E3 and prove The- exists an exploitation policy ⇡ in M then ↵ orem 6. We begin by proving that M is useful in the fol- ↵ T lowing sense: M has at least one of an exploitation policy 1 ⇡ t ✏ V ⇡ (s),T V ⇤ (s) . achieving high reward or an exploration policy that quickly E M Es µ⇤ M T ⇠  1 reaches an unknown state in M. t=1 X Lemma 8 (Exploit or Explore Lemma). For any state s 2 3 , (0, 1) and any T>0 at least one of the statements Finally, we have so far elided the fact that Fair-E only has 2 ˆ ↵ ˆ ↵ below holds: access to the empirically estimated MDPs M and M[n] (see the Appendix for formal definitions). We remedy this\ there exists an exploitation policy ⇡ in M ↵ such that ˆ ↵ • issue by showing that the behavior of any policy ⇡ in M ˆ ↵ ↵ T T (and M[n] ) is similar to the behavior of ⇡ in M (and \ ⇡¯ t ⇡ t M ↵ ). To do so, we prove a stronger claim: the behavior max E VM ⇡¯ (s),T E VM ⇡ (s),T T [n] ⇡¯ ⇧  \ ˆ ˆ 2 t=1 t=1 of any ⇡ in M (and M[n] ) is similar to the behavior of X X \ ⇡ in M (and M[n] ). where the random variables ⇡t(s) and ⇡¯t(s) denote the \ states reached from s after following ⇡ and ⇡¯ for t steps, Lemma 10. Let be the set of known states and Mˆ the respectively. approximation to M . Then for any state s , any action 2 there exists an exploration policy ⇡ in M ↵ such that the a and any policy ⇡, with probability at least 1 : • Fairness in Reinforcement Learning

⇡ ⇡ ⇡ 1. VM (s) min ↵/2,✏ V ˆ (s) VM (s)+ exploration, the number of 2T✏⇤ steps exploitation phases { } M  min ↵/2,✏ , needed is at least ⇡ { } ⇡ 2. QM (s, a) min ↵/2,✏ Q ˆ (s, a) { } M  T1(1 ) Q⇡ (s, a)+min ↵/2,✏ . T2 = O . M ✏ { } ✓ ◆ We now have the necessary results to prove Theorem 6. Therefore, after = T + T steps we have T 1 2 Proof of Theorem 6. We divide the analysis into separate parts: the performance guarantee of Fair-E3 and its 1 T 2✏ Es µ⇤ VM⇤ (s) E VM⇤ (st) , approximate-action fairness. We defer the analysis of the ⇠  1 T t=1 probability of failure of Fair-E3 to the Appendix. X as claimed in Equation 2. The running time of Fair-E3 is We start with the performance guarantee and show that n 3 nT 2 O( T ): the additional factor comes from offline when Fair-E3 follows the exploitation policy the average ✏ ✏ ˆ ↵ ˆ ↵ computation of the optimal policies in M and M . VM⇤ values of the visited states is close to Es µ⇤ VM⇤ (s). [n] ⇠ \ However, when following an exploration policy or taking We wrap up by proving Fair-E3 satisfies random trajectories, visited states’ V values can be small. M⇤ approximate-action fairness in every round. The ac- To bound the performance of 3, we bound the num- Fair-E tions taken during random trajectories are fair (and hence ber of these exploratory steps by the MDP parameters so approximate-action fair) by Observation 3. Moreover, they only have a small effect on overall performance. 3 ˆ ↵ ˆ ↵ Fair-E computes policies in M and M[n] . By 3 \ Note that in each T✏⇤-step exploitation phase of Fair-E , the Lemma 10 with probability at least 1 any Q⇤ or V ⇤ ˆ ↵ ˆ ↵ expectation of the average VM⇤ values of the visited states value estimated in M or M[n] is within ↵/2 of its M ↵ \ M ↵ is at least Es µ⇤ VM⇤ (s) ✏/(1 ) ✏/2 by Lemmas 1, 8 corresponding true value in or [n] . As a result, ⇠ \ and Observation 5. By a Chernoff bound, the probability ˆ ↵ ˆ ↵ M and M[n] (i) contain all the optimal policies and \ that the actual average VM⇤ values of the visited states is (ii) only contain actions with Q⇤ values within ↵ of the ↵ less than Es µ⇤ VM⇤ (s) ✏/(1 ) 3✏/4 is less than /4 optimal actions. It follows that any policy followed in Mˆ ⇠ 1 ↵ log( ) and Mˆ is ↵-action fair, so both the exploration and if there are at least 2 exploitation phases. [n] ✏ \ exploitation policies followed by Fair-E3 satisfy ↵-action We now bound the total number of exploratory steps of fairness, and Fair-E3 is therefore ↵-action fair. Fair-E3 by T n T = O nm H + nm ✏⇤ log , 5. Discussion and Future Work 1 Q ✏ Q ✏ ✓ ⇣ ⌘◆ Our work leaves open several interesting questions. For ex- where mQ is defined in Equation 3 of Definition 7. The ample, we give an algorithm that has an undesirable expo- two components of this term bound the number of rounds nential dependence on 1/(1 ), but we show that this de- 3 in which Fair-E plays non-exploitatively: the first bounds pendence is unavoidable for any approximate-action fair al- the number of steps taken when Fair-E3 follows random gorithm. Without fairness, near-optimality in learning can trajectories, and the second bounds how many steps are be achieved in time that is polynomial in all of the param- taken following explicit exploration policies. The former eters of the underlying MDP. So, we can ask: does there bound follows from the facts that each random trajectory exist a meaningful fairness notion that enables reinforce- has length H✏ ; that in each state, mQ trajectories are suf- ment learning in time polynomial in all parameters? ficient for the state to become known; and that random tra- jectories are taken only before all n states are known. The Moreover, our fairness definitions remain open to further modulation. It remains unclear whether one can strengthen latter bound follows from the fact that Fair-E3 follows our fairness guarantee to bind across time rather than sim- an exploration policy for 2T ⇤ steps; and an exploration ✏ ply across actions available at the moment without large policy needs to be followed only O( T✏⇤ log( n )) times be- ✏ performance tradeoffs. Similarly, it is not obvious whether fore reaching an unknown state (since any exploration pol- one can gain performance by relaxing the every-step na- icy will end up in an unknown state with probability of at ture of our fairness guarantee in a way that still forbids dis- least ✏ according to Lemma 8, and applying a Chernoff T✏⇤ crimination. These and other considerations suggest many bound); that an unknown state becomes known after it is questions for further study; we therefore position our work visited mQ times; and that exploration policies are only as a first cut for incorporating fairness into a reinforcement followed before all states are known. learning setting. Finally, to make up for the potentially poor performance in Fairness in Reinforcement Learning

References Toshihiro Kamishima, Shotaro Akaho, Hideki Asoh, and Jun Sakuma. Fairness-aware classifier with prejudice re- Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirch- mover regularizer. In Proceedings of the European Con- ner. Machine bias. Propublica, 2016. ference on Machine Learning and Knowledge Discovery Anna Barry-Jester, Ben Casselman, and Dana Goldstein. in , pages 35–50, 2012. The new science of sentencing. The Marshall Project, Michael Kearns and Satinder Singh. Near-optimal rein- August 8 2015. URL https://www.themarshallproject. forcement learning in polynomial time. Machine Learn- org/2015/08/04/the-new-science-of-sentencing/. Re- ing, 49(2-3):209–232, 2002. trieved 4/28/2016. Michael Kearns, Yishay Mansour, and . Ap- Ronen Brafman and Moshe Tennenholtz. R-MAX - A proximate planning in large POMDPs via reusable tra- general polynomial time algorithm for near-optimal re- jectories. In Proceedings of the 13th Annual Conference inforcement learning. Journal of Machine Learning Re- on Neural Systems, pages 1001– search, 3:213–231, 2002. 1007, 2000.

Nanette Byrnes. Artificial intolerance. MIT Technol- Jon Kleinberg, Sendhil Mullainathan, and Manish Ragha- ogy Review, March 28 2016. URL https://www. van. Inherent trade-offs in the fair determination of risk technologyreview.com/s/600996/artificial-intolerance/. scores. In Proceedings of the 7th Conference on Innova- Retrieved 4/28/2016. tions in Theoretical , 2017. Shiau Hong Lim, Huan Xu, and Shie Mannor. Reinforce- Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Rein- ment learning in robust markov decision processes. In gold, and Richard Zemel. Fairness through awareness. In Proceedings of the 27th Annual Conference on Neural Proceedings of the 3rd Innovations in Theoretical Com- Information Processing Systems, pages 701–709, 2013. puter Science, pages 214–226, 2012. Binh Thanh Luong, Salvatore Ruggieri, and Franco Turini. Michael Feldman, Sorelle Friedler, John Moeller, Carlos k-NN as an implementation of situation testing for dis- Scheidegger, and Suresh Venkatasubramanian. Certify- crimination discovery and prevention. In Proceedings ing and removing disparate impact. In Proceedings of the of the 17th ACM SIGKDD International Conference on 21th ACM SIGKDD International Conference on Knowl- Knowledge Discovery and , pages 502–510, edge Discovery and Data Mining, pages 259–268, 2015. 2011. Sorelle Friedler, Carlos Scheidegger, and Suresh Venkata- Shie Mannor, Ofir Mebel, and Huan Xu. Lightning does subramanian. On the (im)possibility of fairness. CoRR, not strike twice: Robust MDPs with coupled uncertainty. abs/1609.07236, 2016. In Proceedings of the 29th International Conference on Machine Learning, 2012. Sara Hajian and Josep Domingo-Ferrer. A methodology for direct and indirect discrimination prevention in data Clair Miller. Can an algorithm hire better than a hu- mining. IEEE Transactions on Knowledge and Data En- man? The New York Times, June 25 2015. URL gineering, 25(7):1445–1459, 2013. http://www.nytimes.com/2015/06/26/upshot/can-an- algorithm-hire-better-than-a-human.html/. Retrieved Moritz Hardt, Eric Price, and Nathan Srebro. Equality of 4/28/2016. opportunity in . In Proceedings of Jun Morimoto and Kenji Doya. Robust reinforcement the 30th Annual Conference on Neural Information Pro- learning. Neural computation, 17(2):335–359, 2005. cessing Systems, pages 3315–3323, 2016. Dino Pedreshi, Salvatore Ruggieri, and Franco Turini. Matthew Joseph, Michael Kearns, Jamie Morgenstern, and Discrimination-aware data mining. In Proceedings of the Aaron Roth. Fairness in learning: Classic and contextual 14th ACM SIGKDD international conference on Knowl- bandits. In Proceedings of the 30th Annual Conference edge discovery and data mining, pages 560–568. ACM, on Neural Information Processing Systems, pages 325– 2008. 333, 2016. Cynthia Rudin. Predictive policing using machine learn- Faisal Kamiran, Asim Karim, and Xiangliang Zhang. De- ing to detect patterns of crime. Wired Magazine, August cision theory for discrimination-aware classification. In 2013. URL http://www.wired.com/insights/2013/08/ Proceedings of the 12th IEEE International Conference predictive-policing-using-machine-learning-to-detect- on Data Mining, pages 924–929, 2012. patterns-of-crime/. Retrieved 4/28/2016. Fairness in Reinforcement Learning

Satinder Singh. Personal Communication, June 2016.

Richard Sutton and Andrew Barto. Introduction to Rein- forcement Learning. MIT Press, Cambridge, MA, USA, 1st edition, 1998. Latanya Sweeney. Discrimination in online ad delivery. Communications of the ACM, 56(5):44–54, 2013.

Istv’an Szita and Csaba Szepesvari.´ Model-based rein- forcement learning with nearly tight exploration com- plexity bounds. In Proceedings of the 27th Interna- tional Conference on Machine Learning, pages 1031– 1038, 2010.

Blake Woodworth, Suriya Gunasekar, Mesrob Ohannes- sian, and Nathan Srebro. Learning non-discriminatory predictors. In Proceedings of the 30th Conference on Learning Theory, 2017.

Muhammad Bilal Zafar, Isabel Valera, Gomez-Rodriguez Manuel, and Krishna P. Gummadi. Fairness beyond dis- parate treatment and disparate impact: Learning classi- fication without disparate mistreatment. In Proceedings of the 26th International Conference, 2017.

Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cyn- thia Dwork. Learning fair representations. In Proceed- ings of the 30th International Conference on Machine Learning, pages 325–333, 2013.