Fairness in Reinforcement Learning ⇤
Shahin Jabbari Matthew Joseph Michael Kearns Jamie Morgenstern Aaron Roth 1
Abstract tings where historical context can have a distinct influence We initiate the study of fairness in reinforcement on the future. For concreteness, we consider the specific learning, where the actions of a learning algo- example of hiring (though other settings such as college rithm may affect its environment and future re- admission or lending decisions can be embedded into this wards. Our fairness constraint requires that an framework). Consider a firm aiming to hire employees for algorithm never prefers one action over another a number of positions. The firm might consider a variety if the long-term (discounted) reward of choosing of hiring practices, ranging from targeting and hiring ap- the latter action is higher. Our first result is neg- plicants from well-understood parts of the applicant pool ative: despite the fact that fairness is consistent (which might be a reasonable policy for short-term produc- with the optimal policy, any learning algorithm tivity of its workforce), to exploring a broader class of ap- satisfying fairness must take time exponential in plicants whose backgrounds might differ from the current the number of states to achieve non-trivial ap- set of employees at the company (which might incur short- proximation to the optimal policy. We then pro- term productivity and learning costs but eventually lead to vide a provably fair polynomial time algorithm a richer and stronger overall applicant pool). under an approximate notion of fairness, thus es- We focus on the standard model of reinforcement learning, tablishing an exponential gap between exact and in which an algorithm seeks to maximize its discounted approximate fairness. sum of rewards in a Markovian decision process (MDP). Throughout, the reader should interpret the actions avail- able to a learning algorithm as corresponding to choices or 1. Introduction policies affecting individuals (e.g. which applicants to tar- The growing use of machine learning for automated get and hire). The reward for each action should be viewed decision-making has raised concerns about the potential for as the short-term payoff of making the corresponding deci- unfairness in learning algorithms and models. In settings sion (e.g. the short-term influence on the firm’s productiv- as diverse as policing [22], hiring [19], lending [4], and ity after hiring any particular candidate). The actions taken criminal sentencing [2], mounting empirical evidence sug- by the algorithm affect the underlying state of the system gests these concerns are not merely hypothetical [1; 25]. (e.g. the company’s demographics as well as the available applicant pool) and therefore in turn will affect the actions We initiate the study of fairness in reinforcement learning, and rewards available to the algorithm in the future. where an algorithm’s choices may influence the state of the world and future rewards. In contrast, previous work on fair Informally, our definition of fairness requires that (with machine learning has focused on myopic settings where high probability) in state s, an algorithm never chooses an such influence is absent, e.g. in i.i.d. or no-regret mod- available action a with probability higher than another ac- els [5; 6; 9; 10]. The resulting fairness definitions therefore tion a0 unless Q⇤(s, a) >Q⇤(s, a0), i.e. the long-term re- do not generalize well to a reinforcement learning setting, ward of a is greater than that of a0. This definition, adapted as they do not reason about the effects of short-term ac- from Joseph et al. (2016), is weakly meritocratic: facing tions on long-term rewards. This is relevant for the set- some set of actions, an algorithm must pick a distribution over actions with (weakly) heavier weight on the better ac- 1University of Pennsylvania, Philadelphia, PA, USA. Corre- tions (in terms of their discounted long-term reward). Cor- spondence to: Shahin Jabbari, Matthew Joseph, Michael Kearns, respondingly, a hiring process satisfying our fairness def- Jamie Morgenstern, Aaron Roth
tings since present rewards may depend on past choices in with probability of at least 1 over histories ht 1. MDPs. In particular, defining fairness in terms of imme- If is ↵-action fair for any input ↵>0, we call L L diate rewards would prohibit any policy sacrificing short- approximate-action fair. term rewards in favor of long-term rewards. This is un- desirable, since it is the long-term rewards that matter in Approximate-choice fairness prevents equally good ac- reinforcement learning, and optimizing for long-term re- tions from being chosen at very different rates, while wards often necessitates short-term sacrifices. Moreover, approximate-action fairness prevents substantially worse the long-term impact of a decision should be considered actions from being chosen over better ones. In hiring, an when arguing about its relative fairness. We will therefore approximately-action fair firm can only (probabilistically) define fairness using the state-action value function QM⇤ . target one population over another if the targeted popu- Definition 2 (Fairness). is fair if for all input >0, all lation is not substantially worse. While this is a weaker L M, all rounds t, all states s and all actions a, a0 guarantee, it at least forces an approximately-action fair algorithm to learn different populations to statistical con- QM⇤ (s, a) QM⇤ (s, a0) (s, a, ht 1) (s, a0,ht 1) fidence. This is a step forward from current practices, in )L L 5 which companies have much higher degrees of uncertainty with probability at least 1 over histories ht 1. about the quality (and impact) of hiring individuals from Fairness requires that an algorithm never probabilistically under-represented populations. For this reason and the favors an action with lower long-term reward over an action computational benefits mentioned above, our upper bounds with higher long-term reward. In hiring, this means that will primarily focus on approximate-action fairness. an algorithm cannot target one applicant population over We now state several useful observations regarding fair- another unless the targeted population has a higher quality. ness. We defer all the formal statements and their proofs In Section 3, we show that fairness can be extremely re- to the Appendix. We note that there always exists a (pos- strictive. Intuitively, must play uniformly at random until sibly randomized) optimal policy which is fair (Observa- L tion 1); moreover, any optimal policy (deterministic or ran- it has high confidence about the QM⇤ values, in some cases taking exponential time to achieve near-optimality. This domized) is approximate-action fair (Observation 2), as is motivates relaxing Definition 2. We first relax the proba- the uniformly random policy (Observation 3). bilistic requirement and require only that an algorithm not Finally, we consider a restriction of the actions in an MDP substantially favor a worse action over a better one. M to nearly-optimal actions (as measured by QM⇤ values). Definition 3 (Approximate-choice Fairness). is ↵-choice L Definition 5 (Restricted MDP). The ↵-restricted MDP of fair if for all inputs >0 and ↵>0: for all M, all rounds M, denoted by M ↵, is identical to M except that in each t, all states s and actions a, a : 0 state s, the set of available actions are restricted to a : { QM⇤ (s, a) QM⇤ (s, a0) (s, a, ht 1) (s, a0,ht 1) ↵, QM⇤ (s, a) maxa0 M QM⇤ (s, a0) ↵ a M . )L L 2A | 2A } with probability of at least 1 over histories ht 1. ↵ ↵ M has the following two properties: (i) any policy in M If is ↵-choice fair for any input ↵>0, we call L L is ↵-action fair in M (Observation 4) and (ii) the optimal approximate-choice fair. policy in M ↵ is also optimal in M (Observation 5). Ob- servations 4 and 5 aid our design of an approximate-action A slight modification of the lower bound for (exact) fair- fair algorithm: we construct M ↵ from estimates of the Q ness shows that algorithms satisfying approximate-choice M⇤ values (see Section 4.3 for more details). fairness can also require exponential time to achieve near- optimality. We therefore propose an alternative relaxation, where we relax the quality requirement. As described in 3. Lower Bounds Section 1.1, the resulting notion of approximate-action fair- We now demonstrate a stark separation between the per- ness is in some sense the most fitting relaxation of fair- formance of learning algorithms with and without fairness. ness, and is a particularly attractive one because it allows First, we show that neither fair nor approximate-choice fair us to give algorithms circumventing the exponential hard- algorithms achieve near-optimality unless the number of ness proved for fairness and approximate-choice fairness. time steps is at least ⌦(kn), exponential in the size of the Definition 4 (Approximate-action Fairness). is ↵-action T L state space. We then show that any approximate-action fair fair if for all inputs >0 and ↵>0, for all M, all rounds algorithm requires a number of time steps that is at least t, all states s and actions a, a0: 1 T 1 ⌦(k ) to achieve near-optimality. We start by proving a QM⇤ (s, a) >QM⇤ (s, a0)+↵ (s, a, ht 1) (s, a0ht 1) lower bound for fair algorithms. )L L 5 1 1 1 (s, a, h) denotes the probability chooses a from s given history h. If < , > and ✏< , no fair algorithm L L Theorem 3. 4 2 8 Fairness in Reinforcement Learning can be ✏-optimal in = O(kn) steps.6 Hence, we focus on approximate-action fairness. Before T moving to positive results, we mention that the time com- Standard reinforcement learning algorithms (absent a fair- plexity of approximate-action fair algorithms will still suf- ness constraint) learn an ✏-optimal policy in a number of 1 fer from an exponential dependence on 1 . steps polynomial in n and 1 ; Theorem 3 therefore shows ✏ 1 1 a steep cost of imposing fairness. We outline the idea for Theorem 5. For <4 , ↵< 8 , >max(0.9,c), c c 1 2 1 1 e proof of Theorem 3. For intuition, first consider the special ( 2 , 1) and ✏< 16 , no ↵-action fair algorithm is ✏- case when the number of actions k =2. We introduce the 1 optimal for = O((k 1 )c) steps. MDPs witnessing the claim in Theorem 3 for this case. T
Definition 6 (Lower Bound Example). For M = L, R , The MDP in Figure 1 also witnesses the claim of Theo- A { } log(1/(2↵)) let M(x)=( M , M , M , M ,T, ,x) be an MDP with rem 5 when n = . The discount factor S A P R d 1 e is generally taken as a constant, so in most interesting for all i [n], PM (si,L,s1)=PM (si,R,sj)=1 • 2 cases 1 n: this lower bound is substantially less where j =min i +1,n and is 0 otherwise. 1 ⌧ { } stringent than the lower bounds proven for fairness and for i [n 1], RM (si)=0.5, and RM (sn)=x. • 2 approximate-choice fairness. Hence, from now on, we fo- cus on designing algorithms satisfying approximate-action fairness with learning rates polynomial in every parameter 1 1 but 1 , and with tight dependence on 1 . 4. A Fair and Efficient Learning Algorithm We now present an approximate-action fair algorithm, Fair-E3 with the performance guarantees stated below. Figure 1. MDP(x): Circles represent states (labels denote the state name and deterministic reward). Arrows represent actions. Theorem 6. Given ✏>0, ↵>0, 0, 1 and 2 2 [0, 1) as inputs, Fair-E3 is an ↵-action fair algorithm 2 Figure 1 illustrates the MDP from Definition 6. All the which achieves ✏-optimality after transitions and rewards in M are deterministic, but the re- 1 1 5 1 +5 ward at state sn can be either 1 or , and so no algorithm n T ⇤k 2 = O˜ ✏ (2) (fair or otherwise) can determine whether the Q values 4 4 2 12 M⇤ T min ↵ ,✏ ✏ (1 ) ! of all the states are the same or not until it reaches sn and { } observes its reward. Until then, fairness requires that the steps where O˜ hides poly-logarithmic terms. algorithm play all the actions uniformly at random (if the 1 reward at s is , any fair algorithm must play uniformly 3 n 2 The running time of Fair-E (which we have not attempted at random forever). Thus, any fair algorithm will take ex- to optimize) is polynomial in all the parameters of the MDP ponential time in the number of states to reach sn. This can 1 except 1 ; Theorem 5 implies that this exponential de- be easily modified for k>2: from each state si, k 1 of 1 pendence on 1 is necessary. the actions from state si (deterministically) return to state s1 and only one action (deterministically) reaches any other Several more recent algorithms (e.g. R-MAX [3]) have n 3 3 state smin i+1,n . It will take k steps before any fair algo- improved upon the performance of E . We adapted E pri- { } rithm reaches sn and can stop playing uniformly at random marily for its simplicity. While the machinery required (which is necessary for near-optimality). The same exam- to properly balance fairness and performance is somewhat ple, with a slightly modified analysis, also provides a lower involved, the basic ideas of our adaptation are intuitive. bound of ⌦((k/(1 + k↵))n) time steps for approximate- We further note that subsequent algorithms improving on choice fair algorithms as stated in Theorem 4. E3 tend to heavily leverage the principle of “optimism in 1 1 1 1 face of uncertainty”: such behavior often violates fairness, Theorem 4. If <4 ,↵< 4 , > 2 and ✏< 8 , no ↵- choice fair algorithm is ✏-optimal for = O(( k )n) which generally requires uniformity in the face of uncer- T 1+k↵ steps. tainty. Thus, adapting these algorithms to satisfy fairness is more difficult. This in particular suggests E3 as an apt Fairness and approximate-choice fairness are both ex- starting point for designing a fair planning algorithm. tremely costly, ruling out polynomial time learning rates. The remainder of this section will explain Fair-E3, be- 6We have not optimized the constants upper-bounding param- ginning with a high-level description in Section 4.1.We 3 eters in the statement of Theorems 3, 4 and 5. The values pre- then define the “known” states Fair-E uses to plan in Sec- sented here are only chosen for convenience. tion 4.2, explain this planning process in Section 4.3, and Fairness in Reinforcement Learning bring this all together to prove Fair-E3’s fairness and per- n 4 1 m = O H 8 log . formance guarantees in Section 4.4. 2 min ✏,↵ ✏ ✓ { }◆ ✓ ◆! 4.1. Informal Description of Fair-E3 A state s becomes known after taking 3 Fair-E relies on the notion of “known” states. A state s is mQ := k max m1,m2 (3) · { } defined to be known after all actions have been chosen from s enough times to confidently estimate relevant reward dis- length-H✏ random trajectories from s. tributions, transition probabilities, and Q⇡ values for each M It remains to show that motivating conditions (i) and (ii) action. At each time t, Fair-E3 then uses known states to reason about the MDP as follows: indeed hold for our formal definition of a known state. In- formally, m1 random trajectories suffice to ensure that we If in an unknown state, take a uniformly random trajec- have accurate estimates of all QM⇤ (s, a) values, and m2 • tory of length H✏ . random trajectories suffice to ensure accurate estimates of If in a known state, compute (i) an exploration policy the transition probabilities and rewards. • which escapes to an unknown state quickly and p, the To formalize condition (i), we rely on Theorem 7, connect- probability that this policy reaches an unknown state ing the number of random trajectories taken from s to the within 2T steps, and (ii) an exploitation policy which ✏⇤ accuracy of the empirical V ⇡ estimates. is near-optimal in the known states of M. M – If p is large enough, follow the exploration policy; oth- Theorem 7 (Theorem 5.5, Kearns et al. (2000)). For any erwise, follow the exploitation policy. state s and ↵>0, after
3 2 Fair-E thus relies on known states to balance exploration 1 ⇧ H✏ +3 and exploitation in a reliable way. While Fair-E3 and E3 m = O k log | | (1 ) ↵ ! share this general idea, fairness forces Fair-E3 to more ✓ ◆ ✓ ◆ delicately balance exploration and exploitation. For ex- random trajectories of length H✏ from s, with probability ample, while both algorithms explore until states become ˆ ⇡ of at least 1 , we can compute estimates VM such that “known”, the definition of a known state must be much ⇡ ˆ ⇡ VM (s) VM (s) ↵, simultaneously for all ⇡ ⇧. stronger in Fair-E3 than in E3 because Fair-E3 addition- | | 2 ⇡ ally requires accurate estimates of actions’ QM values in Theorem 7 enables us to translate between the number of order to make decisions without violating fairness. For this trajectories taken from a state and the uncertainty about its 3 ⇡ reason, Fair-E replaces the deterministic exploratory ac- VM values for all policies (including ⇡⇤ and hence VM⇤ ). tions of E3 with random trajectories of actions from un- Since ⇧ = kn, we substitute log ( ⇧ )=n log (k).To | | | | known states. These random trajectories are then used to estimate QM⇤ (s, a) values using the VM⇤ (s) values we in- Q⇡ estimate the necessary M values. crease the number of necessary length-H✏ random trajec- 3 k In a similar vein, Fair-E requires particular care in com- tories by a factor of . puting exploration and exploitation policies, and must re- For condition (ii), we adapt the analysis of E3 [13], which strict the set of such policies to fair exploration and fair states that if each action in a state s is taken m2 times, exploitation policies. Correctly formulating this restriction then the transition probabilities and reward in state s can process to balance fairness and performance relies heavily be estimated accurately (see Section 4.4). on the observations about the relationship between fairness and performance provided in Section 2.1. 4.3. Planning in Fair-E3 3 4.2. Known States in Fair-E3 We now formalize the planning steps in Fair-E from known states. For the remainder of our exposition, we We now formally define the notion of known states for make Assumption 2 for convenience (and show how to re- Fair-E3. We say a state s becomes known when one can move this assumption in the Appendix). compute good estimates of (i) RM (s) and PM (s, a) for all Assumption 2. T✏⇤ is known. a, and (ii) QM⇤ (s, a) for all a. 3 Fair-E constructs two ancillary MDPs for planning: M Definition 7 (Known State). Let is the exploitation MDP, in which the unknown states of M are condensed into a single absorbing state s0 with no re- 2 ward. In the known states , transitions are kept intact and H +3 1 k the rewards are deterministically set to their mean value. m1 = O k ✏ n log and (1 ) ↵ M thus incentivizes exploitation by giving reward only ✓ ◆ ✓ ◆! Fairness in Reinforcement Learning
probability that a walk of 2T steps from s following ⇡ will terminate in s0 exceeds /T .
We can use this fact to reason about exploration as follows. First, since Observation 2 tells us that the optimal policy in Figure 2. Left: An MDP M with two actions (L and R) and de- M is approximate-action fair, if the optimal policy stays in terministic transition functions and rewards. Green denotes the the set of M’s known states M , then following the optimal set of known states . Middle: M . Right: M[n] . ↵ \ policy in M is both optimal and approximate-action fair. However, if instead the optimal policy in M quickly es- ↵ for staying within known states. In contrast, M[n] is the capes to an unknown state in M, the optimal policy in M \ exploration MDP, identical to M except for the rewards. may not be able to compete with the optimal policy in M. The rewards in the known states are set to 0 and the re- Ignoring fairness, one natural way of computing an escape ward in s0 is set to 1. M[n] then incentivizes exploration policy to “keep up” with the optimal policy is to compute \ by giving reward only for escaping to unknown states. See the optimal policy in M[n] . Unfortunately, following this \ the middle (right) panel of Figure 2 for an illustration of escape policy might violate approximate-action fairness – M (M[n] ), and Appendix for formal definitions. high-quality actions might be ignored in lieu of low-quality \ exploratory actions that quickly reach the unknown states 3 uses these constructed MDPs to plan according to Fair-E of M. Instead, we compute an escape policy in M ↵ the following natural idea: when in a known state, 3 [n] Fair-E and show that if no near-optimal exploitation policy exists\ constructs Mˆ and Mˆ based on the estimated transi- [n] in M , then the optimal policy in M ↵ (which is fair by tion and rewards observed\ so far (see the Appendix for for- [n] construction) quickly escapes to the unknown\ states of M. mal definitions), and then uses these to compute additional ˆ ↵ ˆ ↵ 3 restricted MDPs M and M[n] for approximate-action Next, in order for Fair-E to check whether the optimal 3 \ ↵ fairness. Fair-E then uses these restricted MDPs to policy in M[n] quickly reaches the absorbing state of M \ choose between exploration and exploitation. with significant probability, Fair-E3 simulates the execu- tion of the optimal policy of M ↵ for 2T steps from the More formally, if the optimal policy in Mˆ ↵ escapes to [n] ✏⇤ [n] known state s in M ↵ several times,\ counting the ratio of the absorbing state of M with high enough\ probability the runs ending in s , and applying a Chernoff bound; this within 2T steps, then 3 explores by following that 0 ✏⇤ Fair-E is where Assumption 2 is used. policy. Otherwise, Fair-E3 exploits by following the opti- ˆ ↵ mal policy in M for T✏⇤ steps. While following either of Having discussed exploration, it remains to show that these policies, whenever Fair-E3 encounters an unknown the exploitation policy described in Lemma 8 satisfies ✏- state, it stops following the policy and proceeds by taking optimality as defined in Definition 1. By setting T T✏⇤ a length-H✏ random trajectory. in Lemma 8 and applying Lemmas 1 and 10, we can prove Corollary 9 regarding this exploitation policy. 4.4. Analysis of Fair-E3 Corollary 9. For any state s and T T✏⇤ if there 2 ↵ In this section we formally analyze Fair-E3 and prove The- exists an exploitation policy ⇡ in M then ↵ orem 6. We begin by proving that M is useful in the fol- ↵ T lowing sense: M has at least one of an exploitation policy 1 ⇡ t ✏ V ⇡ (s),T V ⇤ (s) . achieving high reward or an exploration policy that quickly E M Es µ⇤ M T ⇠ 1 reaches an unknown state in M. t=1 X Lemma 8 (Exploit or Explore Lemma). For any state s 2 3 , (0, 1) and any T>0 at least one of the statements Finally, we have so far elided the fact that Fair-E only has 2 ˆ ↵ ˆ ↵ below holds: access to the empirically estimated MDPs M and M[n] (see the Appendix for formal definitions). We remedy this\ there exists an exploitation policy ⇡ in M ↵ such that ˆ ↵ • issue by showing that the behavior of any policy ⇡ in M ˆ ↵ ↵ T T (and M[n] ) is similar to the behavior of ⇡ in M (and \ ⇡¯ t ⇡ t M ↵ ). To do so, we prove a stronger claim: the behavior max E VM ⇡¯ (s),T E VM ⇡ (s),T T [n] ⇡¯ ⇧ \ ˆ ˆ 2 t=1 t=1 of any ⇡ in M (and M[n] ) is similar to the behavior of X X \ ⇡ in M (and M[n] ). where the random variables ⇡t(s) and ⇡¯t(s) denote the \ states reached from s after following ⇡ and ⇡¯ for t steps, Lemma 10. Let be the set of known states and Mˆ the respectively. approximation to M . Then for any state s , any action 2 there exists an exploration policy ⇡ in M ↵ such that the a and any policy ⇡, with probability at least 1 : • Fairness in Reinforcement Learning
⇡ ⇡ ⇡ 1. VM (s) min ↵/2,✏ V ˆ (s) VM (s)+ exploration, the number of 2T✏⇤ steps exploitation phases { } M min ↵/2,✏ , needed is at least ⇡ { } ⇡ 2. QM (s, a) min ↵/2,✏ Q ˆ (s, a) { } M T1(1 ) Q⇡ (s, a)+min ↵/2,✏ . T2 = O . M ✏ { } ✓ ◆ We now have the necessary results to prove Theorem 6. Therefore, after = T + T steps we have T 1 2 Proof of Theorem 6. We divide the analysis into separate parts: the performance guarantee of Fair-E3 and its 1 T 2✏ Es µ⇤ VM⇤ (s) E VM⇤ (st) , approximate-action fairness. We defer the analysis of the ⇠ 1 T t=1 probability of failure of Fair-E3 to the Appendix. X as claimed in Equation 2. The running time of Fair-E3 is We start with the performance guarantee and show that n 3 nT 2 O( T ): the additional factor comes from offline when Fair-E3 follows the exploitation policy the average ✏ ✏ ˆ ↵ ˆ ↵ computation of the optimal policies in M and M . VM⇤ values of the visited states is close to Es µ⇤ VM⇤ (s). [n] ⇠ \ However, when following an exploration policy or taking We wrap up by proving Fair-E3 satisfies random trajectories, visited states’ V values can be small. M⇤ approximate-action fairness in every round. The ac- To bound the performance of 3, we bound the num- Fair-E tions taken during random trajectories are fair (and hence ber of these exploratory steps by the MDP parameters so approximate-action fair) by Observation 3. Moreover, they only have a small effect on overall performance. 3 ˆ ↵ ˆ ↵ Fair-E computes policies in M and M[n] . By 3 \ Note that in each T✏⇤-step exploitation phase of Fair-E , the Lemma 10 with probability at least 1 any Q⇤ or V ⇤ ˆ ↵ ˆ ↵ expectation of the average VM⇤ values of the visited states value estimated in M or M[n] is within ↵/2 of its M ↵ \ M ↵ is at least Es µ⇤ VM⇤ (s) ✏/(1 ) ✏/2 by Lemmas 1, 8 corresponding true value in or [n] . As a result, ⇠ \ and Observation 5. By a Chernoff bound, the probability ˆ ↵ ˆ ↵ M and M[n] (i) contain all the optimal policies and \ that the actual average VM⇤ values of the visited states is (ii) only contain actions with Q⇤ values within ↵ of the ↵ less than Es µ⇤ VM⇤ (s) ✏/(1 ) 3✏/4 is less than /4 optimal actions. It follows that any policy followed in Mˆ ⇠ 1 ↵ log( ) and Mˆ is ↵-action fair, so both the exploration and if there are at least 2 exploitation phases. [n] ✏ \ exploitation policies followed by Fair-E3 satisfy ↵-action We now bound the total number of exploratory steps of fairness, and Fair-E3 is therefore ↵-action fair. Fair-E3 by T n T = O nm H + nm ✏⇤ log , 5. Discussion and Future Work 1 Q ✏ Q ✏ ✓ ⇣ ⌘◆ Our work leaves open several interesting questions. For ex- where mQ is defined in Equation 3 of Definition 7. The ample, we give an algorithm that has an undesirable expo- two components of this term bound the number of rounds nential dependence on 1/(1 ), but we show that this de- 3 in which Fair-E plays non-exploitatively: the first bounds pendence is unavoidable for any approximate-action fair al- the number of steps taken when Fair-E3 follows random gorithm. Without fairness, near-optimality in learning can trajectories, and the second bounds how many steps are be achieved in time that is polynomial in all of the param- taken following explicit exploration policies. The former eters of the underlying MDP. So, we can ask: does there bound follows from the facts that each random trajectory exist a meaningful fairness notion that enables reinforce- has length H✏ ; that in each state, mQ trajectories are suf- ment learning in time polynomial in all parameters? ficient for the state to become known; and that random tra- jectories are taken only before all n states are known. The Moreover, our fairness definitions remain open to further modulation. It remains unclear whether one can strengthen latter bound follows from the fact that Fair-E3 follows our fairness guarantee to bind across time rather than sim- an exploration policy for 2T ⇤ steps; and an exploration ✏ ply across actions available at the moment without large policy needs to be followed only O( T✏⇤ log( n )) times be- ✏ performance tradeoffs. Similarly, it is not obvious whether fore reaching an unknown state (since any exploration pol- one can gain performance by relaxing the every-step na- icy will end up in an unknown state with probability of at ture of our fairness guarantee in a way that still forbids dis- least ✏ according to Lemma 8, and applying a Chernoff T✏⇤ crimination. These and other considerations suggest many bound); that an unknown state becomes known after it is questions for further study; we therefore position our work visited mQ times; and that exploration policies are only as a first cut for incorporating fairness into a reinforcement followed before all states are known. learning setting. Finally, to make up for the potentially poor performance in Fairness in Reinforcement Learning
References Toshihiro Kamishima, Shotaro Akaho, Hideki Asoh, and Jun Sakuma. Fairness-aware classifier with prejudice re- Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirch- mover regularizer. In Proceedings of the European Con- ner. Machine bias. Propublica, 2016. ference on Machine Learning and Knowledge Discovery Anna Barry-Jester, Ben Casselman, and Dana Goldstein. in Databases, pages 35–50, 2012. The new science of sentencing. The Marshall Project, Michael Kearns and Satinder Singh. Near-optimal rein- August 8 2015. URL https://www.themarshallproject. forcement learning in polynomial time. Machine Learn- org/2015/08/04/the-new-science-of-sentencing/. Re- ing, 49(2-3):209–232, 2002. trieved 4/28/2016. Michael Kearns, Yishay Mansour, and Andrew Ng. Ap- Ronen Brafman and Moshe Tennenholtz. R-MAX - A proximate planning in large POMDPs via reusable tra- general polynomial time algorithm for near-optimal re- jectories. In Proceedings of the 13th Annual Conference inforcement learning. Journal of Machine Learning Re- on Neural Information Processing Systems, pages 1001– search, 3:213–231, 2002. 1007, 2000.
Nanette Byrnes. Artificial intolerance. MIT Technol- Jon Kleinberg, Sendhil Mullainathan, and Manish Ragha- ogy Review, March 28 2016. URL https://www. van. Inherent trade-offs in the fair determination of risk technologyreview.com/s/600996/artificial-intolerance/. scores. In Proceedings of the 7th Conference on Innova- Retrieved 4/28/2016. tions in Theoretical Computer Science, 2017. Shiau Hong Lim, Huan Xu, and Shie Mannor. Reinforce- Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Rein- ment learning in robust markov decision processes. In gold, and Richard Zemel. Fairness through awareness. In Proceedings of the 27th Annual Conference on Neural Proceedings of the 3rd Innovations in Theoretical Com- Information Processing Systems, pages 701–709, 2013. puter Science, pages 214–226, 2012. Binh Thanh Luong, Salvatore Ruggieri, and Franco Turini. Michael Feldman, Sorelle Friedler, John Moeller, Carlos k-NN as an implementation of situation testing for dis- Scheidegger, and Suresh Venkatasubramanian. Certify- crimination discovery and prevention. In Proceedings ing and removing disparate impact. In Proceedings of the of the 17th ACM SIGKDD International Conference on 21th ACM SIGKDD International Conference on Knowl- Knowledge Discovery and Data Mining, pages 502–510, edge Discovery and Data Mining, pages 259–268, 2015. 2011. Sorelle Friedler, Carlos Scheidegger, and Suresh Venkata- Shie Mannor, Ofir Mebel, and Huan Xu. Lightning does subramanian. On the (im)possibility of fairness. CoRR, not strike twice: Robust MDPs with coupled uncertainty. abs/1609.07236, 2016. In Proceedings of the 29th International Conference on Machine Learning, 2012. Sara Hajian and Josep Domingo-Ferrer. A methodology for direct and indirect discrimination prevention in data Clair Miller. Can an algorithm hire better than a hu- mining. IEEE Transactions on Knowledge and Data En- man? The New York Times, June 25 2015. URL gineering, 25(7):1445–1459, 2013. http://www.nytimes.com/2015/06/26/upshot/can-an- algorithm-hire-better-than-a-human.html/. Retrieved Moritz Hardt, Eric Price, and Nathan Srebro. Equality of 4/28/2016. opportunity in supervised learning. In Proceedings of Jun Morimoto and Kenji Doya. Robust reinforcement the 30th Annual Conference on Neural Information Pro- learning. Neural computation, 17(2):335–359, 2005. cessing Systems, pages 3315–3323, 2016. Dino Pedreshi, Salvatore Ruggieri, and Franco Turini. Matthew Joseph, Michael Kearns, Jamie Morgenstern, and Discrimination-aware data mining. In Proceedings of the Aaron Roth. Fairness in learning: Classic and contextual 14th ACM SIGKDD international conference on Knowl- bandits. In Proceedings of the 30th Annual Conference edge discovery and data mining, pages 560–568. ACM, on Neural Information Processing Systems, pages 325– 2008. 333, 2016. Cynthia Rudin. Predictive policing using machine learn- Faisal Kamiran, Asim Karim, and Xiangliang Zhang. De- ing to detect patterns of crime. Wired Magazine, August cision theory for discrimination-aware classification. In 2013. URL http://www.wired.com/insights/2013/08/ Proceedings of the 12th IEEE International Conference predictive-policing-using-machine-learning-to-detect- on Data Mining, pages 924–929, 2012. patterns-of-crime/. Retrieved 4/28/2016. Fairness in Reinforcement Learning
Satinder Singh. Personal Communication, June 2016.
Richard Sutton and Andrew Barto. Introduction to Rein- forcement Learning. MIT Press, Cambridge, MA, USA, 1st edition, 1998. Latanya Sweeney. Discrimination in online ad delivery. Communications of the ACM, 56(5):44–54, 2013.
Istv’an Szita and Csaba Szepesvari.´ Model-based rein- forcement learning with nearly tight exploration com- plexity bounds. In Proceedings of the 27th Interna- tional Conference on Machine Learning, pages 1031– 1038, 2010.
Blake Woodworth, Suriya Gunasekar, Mesrob Ohannes- sian, and Nathan Srebro. Learning non-discriminatory predictors. In Proceedings of the 30th Conference on Learning Theory, 2017.
Muhammad Bilal Zafar, Isabel Valera, Gomez-Rodriguez Manuel, and Krishna P. Gummadi. Fairness beyond dis- parate treatment and disparate impact: Learning classi- fication without disparate mistreatment. In Proceedings of the 26th International World Wide Web Conference, 2017.
Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cyn- thia Dwork. Learning fair representations. In Proceed- ings of the 30th International Conference on Machine Learning, pages 325–333, 2013.