Censored Exploration and the Dark Pool Problem

Kuzman Ganchev, Michael Kearns, Yuriy Nevmyvaka, Jennifer Wortman Vaughan Computer and Information Science, University of Pennsylvania

Abstract to V shares of a given on behalf of a client 1,and does so by distributing or allocating them over multi- ple distinct exchanges (venues) known as dark pools. We introduce and analyze a natural algo- Dark pools are a recent type of in which rithm for multi-venue exploration from cen- relatively little information is provided about the cur- sored data, which is motivated by the Dark rent outstanding orders (Wikipedia, 2009, Bogoslaw, Pool Problem of modern quantitative finance. 2007). The would like to execute as many of the We prove that our algorithm converges in V shares as possible. If vi shares are allocated to dark polynomial time to a near-optimal alloca- pool i, and all of them are executed, the trader learns tion policy; prior results for similar prob- only that the liquidity available at exchange i was at lems in stochastic inventory control guaran- least vi, not the actual larger number that could have teed only asymptotic convergence and exam- executed there; this important aspect of our frame- ined variants in which each venue could be work is known as censoring in the statistics literature. treated independently. Our analysis bears a strong resemblance to that of efficient explo- In this work we make the natural and common as- ration/exploitation schemes in the reinforce- sumption that the maximum amount of consumption i ment learning literature. We describe an ex- available in venue at each time step (e.g., the to- tensive experimental evaluation of our algo- tal liquidity available in the example above) is drawn P rithm on the Dark Pool Problem using real according to a fixed but unknown distribution i.For- v trading data. mally speaking, this means that when i units are sub- mitted to venue i,avaluesi is drawn randomly from Pi and the observed (and possibly censored) amount of consumption is min{si,vi}. 1 Introduction A learning algorithm receives a sequence of volumes V 1,V2,... and must decide how to distribute the V t We analyze a framework and algorithm for the problem units across the venues at each time step t. Our goal is of multi-venue exploration from censored data.Con- to efficiently (in time polynomial in the “complexity” sider a setting in which at each time period, we have of the Pi and other parameters) learn a near-optimal some volume of V units (possibly varying with time) of allocation policy. There is a distinct between-venue ex- an abstract good. Our goal is to “sell” or “consume” ploration component to this problem, since the “right” as many of these units as possible at each step, and number of shares to submit to venue i may depend on there are K abstract “venues” in which this selling or t both V and the distributions for the other venues, and consumption may occur. We can divide our V units the only mechanism by which we can discover the dis- in any way we like across the venues in service of this tributions is by submitting allocations. If we routinely goal. Our interest in this paper is in how to efficiently submit too-small volumes to a venue, we receive cen- learn a near-optimal allocation policy over time, under sored observations and are underutilizing the venue; stochastic assumptions on the venues. if we submit too-large volumes we receive uncensored This setting belongs to a broad class of problems (or direct) observations but have excess inventory. known in the operations research literature as perish- 1In our setting it is important that we view V as given able inventory problems (see Related Work below). In exogenously by the client and not under the trader’s con- the Dark Pool Problem (discussed extensively in Sec- trol, which distinguishes our setting somewhat from prior tion 5), at each time step a trader must buy or sell up works; see Related Work. Our main theoretical contribution is a provably Our main theoretical contribution is thus the devel- polynomial-time algorithm for learning a near-optimal opment and analysis of a multiple venue, polynomial policy for any unknown venue distributions Pi.This time, near-optimal allocation learning algorithm, while algorithm takes a particularly natural and appealing our main experimental contribution is the application form, in which allocation and distribution reestimation of this algorithm to the Dark Pool Problem. are repeatedly alternated. More precisely, at each time step we maintain distributional estimates Pˆi; pretend- 2 Preliminaries ing that these estimates are in fact exactly correct, we V allocate the current volume accordingly. These allo- We consider the following problem. At each time cations generate observed consumptions in each venue, t, a learner is presented with a quantity or volume Pˆ which in turn are used to update or reestimate the i. V t ∈{1, ··· ,V} of units, where V t is sampled from Q We show that when the Pˆi are “optimistic tail mod- an unknown distribution . The learner must decide t ifications” of the classical Kaplan-Meier maximum on an allocation v of these shares to a set of K known vt ∈{ , ··· ,Vt} i ∈{ , ··· ,K} likelihood estimator for censored data, this estimate- venues,with i 0 for each 1 , K t t allocate loop has provably efficient between-venue ex- and i=1 vi = V . The learner is then told the t ploration behavior that yields the desired result. number of units ri consumed at each venue i.Here t t t t Venues with smaller available volumes (relative to the ri =min{si,vi },wheresi is the maximum consump- overall volume V t and the other venues) are gradu- tion level of venue i at time t, which is sampled inde- ally given smaller allocations in the estimate-allocate pendently from a fixed but unknown distribution Pi. t t loop, whereas venues with repeated censored observa- If ri = vi , we say that the algorithm receives a cen- tions are gradually given larger allocations, eventually sored observation because it is possible to infer only t t t t settling on a near-optimal overall allocation distribu- that ri ≤ si.Ifri

Proof: Using the fact that tail probabilities must sat-   Algorithm 2: Main algorithm. Tˆi s ≥ Tˆi s s ≤ s isfy ( ) ( ) for all ,ititiseasytoverify 1 2 3 that by greedily adding units to the venues in decreas- Input: Volume sequence V ,V ,V ,... Tˆ1 i ing of Tˆi(s), Algorithm 1 returns Arbitrarily initialize i for each ; for t ← 1, 2, 3,... do K vi N % Allocation Step: t t t t argmax Tˆi(s) s.t. vi = V. v ← Greedy (V , Tˆ1,...,TˆK); v i=1 s=1 i=1 for i ∈{1,...,K} do t Submit vi units to venue i; It remains to show that the expression above is equiv- t Let ri be the number of shares sold; alent to the expected number of units consumed. We % Reestimation Step: do this algebraically. For an arbitrary venue i t+1 τ τ t Tˆi ← OptimisticKM ({(vi ,ri )}τ=1); end Es∼Pˆ [min(s, vi)] i end ∞ vi−1 = Pˆi(s)min(s, vi)= sPˆi(s)+viTˆi(vi) s=1 s=1 The only undetermined part of Algorithm 2 is the sub- v −2 i routine OptimisticKM , which specifies how we esti- = sPˆi(s)+(vi − 1)Tˆi(vi − 1) + Tˆi(vi) t mate Tˆi from the observed data. The most natural s=1 choice would be the maximum likelihood estimator on vi the data. This estimator is well-known in the statis- = Tˆi(1) + ...+ Tˆi(vi − 1) + Tˆi(vi)= Tˆi(s). s=1 tics literature as the Kaplan-Meier estimator. In the following section, we describe Kaplan-Meier and de- The last two lines follow from the observation that rive a new convergence resultthatsuitsourparticular for any s, Pˆi(s − 1) + Tˆi(s)=Tˆi(s − 1) and so (s − needs. This result in turn lets us define an “optimistic Pˆ s − sTˆ s s − Tˆ s − Tˆ s 1) i(  1) + i( )=( 1) i( 1) + i( ). Thus tail modification” of Kaplan-Meier that becomes our K vi K Tˆi s ˆ s, vi choice for OptimisticKM . i=1 s=1 ( )= i=1 Es∼Pi [min( )], which is the expected number of units consumed. The analysis of Algorithm 2, which is developed in detail over the next few sections, proceeds as follows: 4 Censored Exploration Algorithm Step 1: We first review the Kaplan-Meier maximum In this section we present our main theoretical result, likelihood estimator for censored data and provide a which is a polynomial-time, near-optimal algorithm for new finite sample convergence bound for this estima- multi-venue exploration from censored data. We first tor. This bound allows us to define a “cut-off point” for provide an overview of the algorithm and its analysis each venue i such that the Kaplan-Meier estimate of before diving into the technical details. As mentioned the tail probability Ti(s) for every value of s up to the in the Introduction, the analysis bears strong resem- cut-off point is guaranteed to be “close to” the true tail blance to the “known and unknown state” exploration- probability. We then define a slightly modified version exploitation arguments common in the E3 and RMAX of the Kaplan-Meier estimates in which the tail prob- algorithms for reinforcement learning (Kearns and ability of the next unit above the cut-off is modified Singh, 2002, Brafman and Tennenholtz, 2003). in an optimistic manner. We show that in conjunction  t τ τ with the greedy allocation algorithm, this minor mod- τ=1 I[ri ≥ s, vi >s] be the number of (direct or cen- ification leads to increased exploration, since the next sored) observations of at least s units on time steps at unit beyond the cut-off point always looks at least as which more than s units were requested. The quantity t good as the cut-off point itself. Ni,s is then the number of times there was an oppor- tunity for a direct observation of s units, whether or Step 2: We next prove our main Exploitation not one occurred. Lemma (Lemma 4). This lemma shows that at any t t t We can then naturally definez ˆi,s = Di,s/Ni,s,with time step, if it is the case that the number of units allo- t t zˆ =0ifN = 0. This quantity is simply the empir- cated to each venue by the greedy algorithm is strictly i,s i,s ical probability of a direct observation of s units given below the cut-off point for that venue — which can be that a direct observation of s units was possible. The thought of as being in a known state in the parlance Kaplan-Meier estimator of the tail probability for any of E3 and RMAX — then the allocation is provably s>0aftert time steps can then be expressed as -optimal. s−1   ˆt t Step 3: We then prove our main Exploration Lemma Ti (s)= 1 − zˆi,s , (1) (Lemma 5), which shows that on any time step at s=0 which the allocation made by the greedy algorithm is t with Tˆ (0) = Ti(0) = 1 for all t. not -optimal, it is possible to lower bound the prob- i ability that the algorithm explores. Thus, as with E3 Previous work has established convergence rates for andRMAX,anytimewearenotinaknownstateand the Kaplan-Meier estimator to the true underlying thus cannot ensure optimal allocation, we are instead distribution in the case that the submission sequence 1 t assured of exploring. vi ,...,vi is i.i.d. (see, for example, Foldes and Rejto (1981)), and asymptotic convergence for non-i.i.d. set- Step 4: Finally, we show that on any sufficiently tings (Huh et al., 2009). We are not in the i.i.d. case, sequence of time steps (where “sufficiently long” since the submitted volumes at one venue are a func- is polynomial in K, V ,1/,andln(1/δ), where δ is a tion of the entire history of allocations and executions standard confidence parameter), it must be the case across all venues. In the following theorem we give a that either the algorithm achieves an -optimal solu- new finite sample convergence bound applicable to our tion on at least a 1 − fraction of the sequence, or setting. the algorithm has explored sufficiently often to learn ˆt accurate estimates of the tail distributions out to V Theorem 2 Let Ti be the Kaplan-Meier estimate of units on every venue. In either case, we can show that Ti as given in Equation 1. For any δ>0,withproba- with probability 1 − δ, at the end of the sequence, the bility at least 1 − δ, for every s ∈{1, ··· ,V},   current algorithm achieves an -optimal solution with   t ˆt t probability at least 1 − . Ti (s) − Ti (s) ≤ s 2ln(2V/δ)/Ni,s−1.

4.1 Convergence of Kaplan-Meier Estimators The proof depends on the next lemma, proof omitted.

We begin by describing the standard Kaplan-Meier Lemma 1 For each i ∈{1, ··· ,},letxi and yi be maximum likelihood estimator for censored data (Ka- real numbers in [0, 1],with|xi − yi|≤i for some    plan and Meier, 1958, Peterson, 1983), restricting our i > 0.Then| i=1 xi − i=1 yi|≤ i=1 i. attention to a single venue i.Letzi,s bethetrueprob- zt ability that the demand in this venue is exactly s units Proof of Theorem 2: We will show that ˆi,s con- given that the demand is at least s units. Using the verges to zi,s, and that this implies that the Kaplan- T s fact that 1 − zi,s is the conditional probability of there Meier tail probability estimator converges to i( ). s being a demand of at least given that the demand is Consider a fixed value of s.Lettn be the index τ of at least s − 1, it is easy to verify that for any s>0, n rτ ≥ s vτ >s s−1 the th time step at which i and i .Bydef- Ti(s)= s=0 (1 − zi,s ). At a high level, we can think N t inition, there are i,s such time steps total. For each of Kaplan-Meier as first computing an estimate of zi,s t n t n ∈{1, ··· ,N },letXn = zi,s − I[r = s] . for each s and then using these estimates to compute i,s =1 i It is easy to see that the sequence X1, ··· ,XN t forms an estimate of Ti(s). i,s a martingale; for each n,wehavethat|Xn −Xn+1|≤1 Let I be an indicator function taking on the value and E [Xn+1|Xn]=Xn. By Azuma’s inequality (see, Dt γ 1 if its input is true and 0 otherwise. Let i,s = for example,Alon and Spencer (2000)), for any , t rτ s, vτ >s   τ=1 I[ i = i ] be the number of direct ob- 2 t t   −γ /(2N ) s t N X t  ≥ γ ≤ i,s . servations of units up to time ,andlet i,s = Pr Ni,s 2e t t Noting that XN t = N (zi,s − zˆ ) and setting γ = i,s i,s s Algorithm 3: Subroutine OptimisticKM for comput-    2 t t  t  − Ni,s/2 s Ni,s gives us that Pr zi,s − zˆi,s ≥ ≤ 2e . ing modified Kaplan-Meier estimators. For all ,let t t Setting this equal to δ/V and applying a union bound Di,s and Ni,s be defined as above, and assume that − δ s ∈ >0andδ>0 are fixed parameters. gives us that with probability 1 , for every  t  t { vτ ,rτ }t i {0, ··· ,V − 1}, zi,s − zˆi,s ≤ 2ln(2V/δ)/Ni,s. Input:Observeddata(( i i ) τ=1)forvenue Output: Modified Kaplan-Meier estimators for i s Assume this holds for all . Then it follows from % Calculate the cut-off: s ∈{ , ··· ,V} Lemma 1 that for any particular 1 , ct ← {s s N t ≥ sV/ 2 V/δ } i max : =0 or i,s−1 128( ) ln(2 ) ; s−1 V/δ V/δ % Compute Kaplan-Meier tail probabilities: |T s − Tˆt s |≤ 2ln(2 ) ≤s 2ln(2 ). i( ) i ( ) t t Tˆt Ni,s Ni,s−1 i (0) = 1; s=0 s V for =1to do   t s−1 t t Tˆi (s) ← s=0 1 − Di,s /Ni,s ; 4.2 Modifying Kaplan-Meier end In Algorithm 3 we describe the minor modification % Make the optimistic modification: t of Kaplan-Meier necessary for our analysis. As de- if ci 0. Suppose t cisely the Kaplan-Meier estimate as in Equation 1. this is the case. Notice that Ni,s is monotonic in s, N t ≥ N t s ≤ s However, to promote exploration, we set the value of with i,s i,s whenever . Thus by definition t t ct sci, it must be the case that of a unit from another venue since we do not have an t t t Tˆi (s) ≤ Tˆi (ci) ≤ /(4V ). If the high probabil- accurate estimate of the tail probability for this unit), t ity event in Lemma 2 holds, then Ti(s) ≤ Ti(ci) ≤ and yet no exploration is taking place. By optimisti- t t t t t Tˆ (c )+/(8V ) ≤ /(2V ). Since both Ti(s)andTˆ (s) cally modifying the tail probability Tˆ (c +1)foreach i i i i i are constrained to lie between 0 and /(2V ), it must venue, we ensure that no venue remains unexplored t be the case that |Ti(s) − Tˆ (s)|≤/(2V ). simply because the algorithm unluckily observes a low i demand a small number of times. 4.3 Exploitation and Exploration Lemmas t We now formalize the idea of ci as a “cut-off point” up to which the Kaplan-Meier estimates are accurate. With these two lemmas in place, we are ready to state In the results that follow, we think of >0andδ>0 our main Exploitation Lemma (Step 2), which for- as fixed parameters of the algorithm.2 malizes the idea that once a sufficient amount of ex- ploration has occurred, the allocation output by the Lemma 2 With probability at least 1 − δ, for all s ≤ greedy algorithm will be -optimal. The proof of this ct |T s − Tˆt s |≤/ V t t i, i( ) i ( ) (8 ). lemma is where the requirement that Tˆi (ci +1)beset 2In particular, corresponds to the value specified in optimistically becomes important. In particular, be- ˆt t Theorem 3, and δ corresponds roughly to that δ divided cause of the optimistic setting of Ti (ci +1),weknow t by the polynomial upper bound on time steps. that if the greedy policy allocates exactly ci units to a venue i, it could not gain too much by reallocating ad- Note that this bound is tight in the sense that it is ditional units from another venue to venue i instead. possible to construct examples where the difference in In this sense, we create a “buffer” above each cut-off, expected units consumed is as large as . guaranteeing that it is not necessary to continue ex- Finally, Lemma 5 presents the main exploration lemma ploring as long as one of the two conditions in the (Step 3), which states that on any time step at which lemma statement is met for each venue. the allocation is not -optimal, the probability of a “useful” observation is lower bounded by /(8V ). Lemma 4 (Exploitation Lemma) Assume that at time t, the high probability event in Lemma 2 holds. If Lemma 5 (Exploration Lemma) Assume that at t for each venue i, either (1) vt ≤ ct,or(2)Tˆt(ct) ≤ time , the high probability event in Lemma 2 holds. i i i i /(4V ), the difference between the expected number of If the allocation is not -optimal, then for some venue i / V N t+1 N t t , with probability at least (8 ), i,ct = i,ct +1. units consumed under allocation v and the expected i i number of units consumed under the optimal allocation is at most . Proof: Suppose the allocation is not -optimal at time t. By Lemma 4, it must be the case that there exists t t t t some venue i for which vi >ci and Tˆi (ci) >/(4V ). t t Proof: Let a1, ··· ,aK be any optimal allocation of Let be a venue for which this is true. Since v >c, t t t t+1 t V a v V N N t the units. Since both and are over units, it it will be the case that ,ct = ,c + 1 as long as t t must be the case that i:a >vt ai − vi = i:vt >a vi − t t ˆt t i i i i r ≥ c.SinceT (c) >/(4V ), Lemma 2 implies that ai t t t . We can thus define an arbitrary one-to-one map- T(c) ≥ /(8V ), so r ≥ c with probability at least ping between the units that were allocated to different /(8V ). venues by the algorithm and the optimal allocation. i Consider any such pair in this mapping. Let be the 4.4 Putting It All Together venue to which the unit was allocated by the algo- rithm, and let n be the index of the unit in that venue. With the exploitation and exploration lemmas in Similarly let j be the venue to which the unit was al- place, we are finally ready to state our main theorem. located by the optimal allocation, and let m be the The full proof is omitted due to lack of space, but we index of the unit in that venue. sketch the main ideas below. If Condition (1) in the lemma statement holds for t t Theorem 3 (Main Theorem) For any >0 and venue i, then by Lemma 2, since n ≤ vi ≤ ci, t δ>0,withprobability1 − δ (over the randomness Tˆ (n) ≤ Ti(n)+/(8V ) ≤ Ti(n)+/(2V ). On the i Q {P } other hand, if Condition (2) holds, then by Lemma 3, of draws from and i ), after running for a time t polynomial in K, V , 1/,andln(1/δ), Algorithm 2 it is still the case that Tˆi (n) ≤ Ti(n)+/(2V ). makes an -optimal allocation on each subsequent time t t Now, if vj ct algorithm. Finally, if j j, then it must be the (1 − )oftheR time steps, then we can argue that case that Condition (2) in the lemma statement holds the algorithm will continue to be -optimal on at least for venue j and by Lemma 3 it is still the case that afraction(1− ) of future time steps; this is because T vt ≤ Tˆt vt / V j( j +1) j ( j +1)+ (2 ). the algorithm gets better on average over time as more t observations are made. Thus in all three cases, since m>vj,wehavethat t t t Tj(m) ≤ Tj(vj +1) ≤ Tˆj (vj +1)+/(2V ). Since On the other hand, if the algorithm chose sub-optimal the greedy algorithm chose to send unit n to venue allocations on at least a fraction of the R time steps, i instead of sending an additional unit to venue j,it then by Lemma 5, the algorithm must have incre- ˆt ˆt t t t must be the case that Ti (n) ≥ Tj (vj + 1) and thus mented N t for some venue i and current cut-off ci i,ci Ti(n) ≥ Tj(m) − /V . 2 t approximately R/(8V ) times. By definition of the ci, t V it can never be the case that N t was incremented too Since there are at most pairs in the matching, and i,ci t each contributes at most /V to the difference in ex- many times for any fixed values of i and ci (where “too pected units consumed between the optimal allocation many” is a polynomial in V ,1/,andln(1/δ)); other- and the algorithm’s, the difference is at most . wise the cut-off would have increased. Since there are only K venues and V possible cut-off values to con- mitting an order to (say) buy v shares, a trader is put sider in each venue, the total number of increments in a queue of buyers awaiting transaction, and there is can be no more than KV times this polynomial, an- a similar queue of sell orders. Matching between buy- other polynomial in V ,1/,ln(1/δ), and now K.IfR ers and sellers occurs in sequential arrival of orders, is sufficiently large (but still polynomial in all of the similar to a light exchange. Unlike a light exchange, desired quantities) and approximately 2R/(8V )incre- no information is provided to traders about how many ments were made, we can argue that every venue must parties or shares might be available in the pool at any have been fully explored, in which case, again, future given moment. Thus in a given time period, a submis- allocations will be -optimal. sion of v shares results only in a (possibly censored) report of how many shares up to v were executed. We conclude our theory contributions with a few brief remarks. Our optimistic tail modifications of While presenting their own trading challenges, dark the Kaplan-Meier estimators are relatively mild 3. pools have become tremendously popular exchanges, In many circumstances we actually expect that the responsible for executing 10-20% of the overall US eq- estimate-allocate loop with unmodified Kaplan-Meier uity volume. In fact, they have been so successful that would work well, and we investigate a parametric ver- there are now approximately 40+ dark pools for the sion of this learning algorithm in the experiments de- U.S. Equity alone, leading traders and broker- scribed below. ages to face exactly the censored multi-venue explo- ration problem we have been studying: How should It also seems quite likely that variants of our algorithm one optimally distribute a large trade over the many and analysis could be developed for alternative objec- independent dark pools? We now describe the appli- tive functions, such as minimizing the number of steps cation of our algorithm to this problem, basing it on of repeated reallocation required to execute all shares actual data from four active dark pools. (for which it is possible to create examples where my- opic greedy allocation at each step is suboptimal), or maximizing the probability of executing all V shares 5.1 Summary of Dark Pool Data on a single step. Our data set is from the internal dark pool order flow for a major U.S. broker-dealer. Each (possibly cen- 5 Application: Dark Pool Problem sored) observation in this data is exactly of the form discussed throughout the paper — a triple consisting We now describe the application that provided the of the dark pool name, the number of shares sent to original inspiration to develop our framework and al- that pool, and the number of shares subsequently ex- gorithm. As mentioned in the Introduction, dark pools ecuted within a time interval. It is important are a particular and relatively recent type of exchange to highlight some limitations of the data. First, note for listed . While the precise details are beyond that the data set conflates the policy the brokerage the scope of this paper, the main challenge in exe- used for allocation across the dark pools with the liq- cuting large-volume trades in traditional (“light”) ex- uidity available in the pools themselves. For our data changes is that it is difficult to “conceal” such trades, set, the policy in force was very similar to the bandit- and their revelation generally results in adverse impact style approach we discuss below. Second, the “parent” on price (e.g. the presence of a large-volume buyer orders determining the overall volumes to be allocated causes the price to rise against that buyer). If the across the pools were determined by the brokerage’s volume is sufficiently large, this difficulty of conceal- trading needs, and are similarly out of our control. ment remains even if one attempts to break the trade The data set contains submissions and executions for up slowly over time. Dark pools arose exactly to ad- four active dark pools: BIDS Trading, Automated dress the problems faced by large traders, and tend Trading Desk, D.E. Shaw, and NYFIX, each for a to emphasize trade and order concealment over price dozen of relatively actively-traded stocks 5, thus - optimization (Wikipedia, 2009, Bogoslaw, 2007). ing 48 distinct stock-pool data sets. The average daily In a typical dark pool, buyers and sellers submit or- trading volume of these stocks across all exchanges ders that simply specify the total volume of shares they (light and dark) ranges from 1 to 60 million shares, wish to buy or sell, with the price of the transaction de- with a median volume of 15 million shares. Energy, termined exogenously by “the market”. 4 Upon sub- Financials, Consumer, Industrials, and Utilities indus- tries are represented. Our data set spans 30 trading 3Thesameresultscanbeprovedformoreinvasivemod- ifications, such as pushing all mass above the cut-offs to the light exchanges. This is a slight oversimplification but the maximum possible volume, which would result in even accurate for our purposes. more aggressive exploration. 5Tickers represented are AIG, ALO, CMI, CVX, FRE, 4Typically the midpoint between the bid and ask in HAL,JPM,MER,MIR,NOV,XOM,andNRG. days; for every stock-pool pair we have on average Model Train Loss Test Loss Wins 1,200 orders (from 600 to 2,000), which corresponds Nonparametric 0.454 0.872 3 to 1.3 million shares (from 0.5 million to 3 million). ZB + Uniform 0.499 0.508 12 Individual order sizes range from 100 to 50,000 shares, ZB + Power Law 0.467 0.484 28 with 1,000 shares being the median. 16% of orders are ZB + Poisson 0.576 0.661 0 filled at least partially (meaning that fully 84% result ZB + Exponential 0.883 0.953 5 in no shares executed), 9% of the total submitted vol- Table 1: Average per-sample log-loss (negative log like- ume was executed, and 11% of all observations were lihood) for the different venue distribution models. The censored. “Wins” column counts the number of stock-venue pairs where a given model beats the other four on the test data. 5.2 Parametric Models for Dark Pools

While the theory and algorithm we have developed for relative to the sparse data, while the other parametric censored exploration permit a general (nonparametric) forms cannot accommodate the heavy tails of the data. form for the venue distributions Pi, in any application The comparative results are summarized in Table 1. it is reasonable to ask whether the data permits a sim- Based on this comparison, for our dark pool study we ple parametric form for these distributions, with the investigate a variant of our main algorithm, in which attendant computational and sample complexity bene- the estimate-allocate loop has an estimation step us- fits. For our dark pool data the answer to this question ing maximum likelihood estimation within the ZB + is affirmative. Power Law model, and allocations are done greedily on these same models. We experimented with a variety of common paramet- ric forms for the distributions. For each such form, In terms of the estimated ZB + Power Law parameters the basic methodology was the same. For each of the themselves, we note that for all 48 stock-pool pairs the 4 × 12 = 48 venue-stock pairs, the data for that pair Zero Bin parameter accounted for most of the distri- was split evenly into a training set and a test set. The bution (between a fraction 0.67 and 0.96), which is not training data was used to select the maximum like- surprising considering the aforementioned preponder- lihood model from the parametric class. Note that ance of entirely unfilled orders in the data. The vast β β . we can no longer directly apply the nonparametric majority of the 48 exponents fell between =025 β . Kaplan-Meier estimator — within each model class, we and =13 — so rather long tails indeed — but it must directly maximize the likelihood on the censored is noteworthy that for one of the four dark pools, 7 training data. This is a relatively straightforward and of the 12 estimated exponents were actually negative, efficient computation for each of the model classes we yielding a model that predicts higher probabilities for investigated. The test set was of course used to mea- larger volumes. This is likely an artifact of our size- sure the generalization performance of each maximum and time-limited data set, but is not entirely unreal- likelihood model. istic and results in some interesting behavior in the simulations below. Our investigation revealed that the best models main- tained a separate parameter for the probability of 0 5.3 Data-Based Simulation Results shares being available (that is, Pi(0) is explicitly esti- mated) — a “Zero Bin” or ZB parameter. This is due As in any control problem, the dark pool data in our to the fact that the vast majority of submissions (84%) possession is unfortunately insufficient to evaluate and to dark pools result in no shares being executed. We compare different allocation algorithms. This is be- then examined various parametric forms for the non- 6 cause of the aforementioned fact that the volumes sub- zero portions of the venue distributions , including mitted to each venue were fixed by the specific policy uniform (which of course requires no additional param- that generated the data, and we cannot explore alter- eters), and Poisson, exponential and power law forms native choices — if our algorithm chooses to submit (each of which requires a single additional parameter). 1000 shares to some venue, but in the data only 500 The generalization results strongly favor the power law shares were submitted, we simply cannot infer the out- form, in which the probability of s shares being avail- come of our desired submission. able is proportional to 1/sβ for real β — a so-called β> We thus instead use the raw data to derive a simulator heavy-tailed distribution when 0. Nonparamet- with which we can evaluate different approaches. In ric models trained with Kaplan-Meier are best on the light of the modeling results of Section 5.2, the simu- training data but overfit badly due to their complexity lator for stock S was constructed as follows. For each 6These forms were applied up to the largest volume sub- dark pool i,weusedall of the data for i and stock mitted in the data sets, then normalized. S to estimate the maximum likelihood Zero Bin + stock: AIG stock: NRG Power Law distribution. (Note that there is no need 0.19 0.12 ours 0.18 0.115 for a training-test split here, as we have already sep- ours 0.11 arately validated the choice of distributional model.) 0.17 bandits 0.105 This results in a set of four venue distribution models 0.16 0.1 Pi that form the simulator for stock S. This simula- 0.15 0.095 fraction executed fraction executed tor accepts allocation vectors (v1,v2,v3,v4) indicating 0.14 0.09 how many shares some algorithm wishes to submit to 0.13 0.085 s P bandits each venue, draws a “true liquidity” value i from i 0.12 0.08 0 500 1000 1500 2000 0 500 1000 1500 2000 for each i, and returns the vector (r1,r2,r3,r4), where episodes episodes Figure 1: Sample learning curves. For the stock AIG (left ri =min(vi,si) is the possibly censored number of sharesfilledinvenuei. panel), the bandits algorithm (labeled blue curve) beats uniform allocation (dashed horizontal line) but appears to Across all 12 stocks, we compared the performance of asymptote short of ideal allocation (solid horizontal line). four different allocation algorithms: For the stock NRG (right panel), the bandits algorithm • The ideal allocation, which is given the true param- actually deteriorates with more episodes, underperforming eters of the ZB + Power Law distributions used by both uniform and ideal allocation. For both these stocks the simulator, and allocates shares optimally (greed- (and for the other 10 in our data set), our algorithm (la- ily) with respect to these distributions. beled red curve) performs nearly optimally. • The uniform allocation, which always divides any afixednumberV of shares — thus the same number of order equally among all four venues. shares is repeatedly allocated by the algorithm, though • Our learning algorithm, which implements the re- of course this allocation will change over time for the peated allocate-reestimate loop, using the maximum two adaptive algorithms as they learn. Each episode likelihood ZB + Power Law model for the reestima- of simulation results in a some fraction of the V shares tion step. being executed. Two values of V were investigated — a smaller and therefore easier value V = 1000, and the • Asimplebandit-style algorithm, which begins with larger and therefore more difficult V = 8000. equal weights assigned to all four venues. Allocation to a venue which results in any nonzero number of We begin by showing full learning curves over 2000 shares being executed causes that venue’s weight to episodes with V = 8000 for a couple of representa- be multiplied by a factor α>1 7. Allocations are tive stocks in Figure 1. Here the average performance always proportional to the current weight vector. of the two non-adaptive allocation schemes (Ideal and Uniform) are represented as horizontal lines, while Some remarks on these algorithms are in order. First, learning curves are given for the adaptive schemes. note that the first two allocation methods (ideal and Due to high variance of the heavy-tailed venue dis- uniform) are non-adaptive and are meant to serve as tributions used by the simulator, a single trial of 2000 baselines — one of them the best performance we could episodes is extremely noisy, so we both average over hope for (ideal), and the other the most naive alloca- 400 trials for each algorithm, and smooth the result- tion possible (uniform). Second, note that our algo- ing averaged learning curve with a standard exponen- rithm has a distinct advantage in the sense that it tial decay temporal moving average. is using the correct parametric form, the same be- ing used by the simulator itself. Thus our evalua- We see that our learning algorithm converges towards tion of this algorithm is certainly optimistic compared the ideal allocation (as suggested by the theory), often to what should be expected “in practice”. Finally, relatively quickly. Furthermore, in each case this ideal note that the bandit algorithm is the crudest type asymptote is significantly better than the uniform al- of weight-based allocation scheme of the type that location strawman, meaning that optimal allocations abounds in the no-regret literature (Cesa-Bianchi and are highly non-uniform. Learning curves for the bandit Lugosi, 2006); we are effectively forcing our problem approach exhibit one of three general behaviors over into a 0/1 loss setting corresponding to “no shares” the set of 12 stocks. In some cases, the bandit ap- and “some shares” being executed. Certainly more so- proach is quite competitive with our algorithm, though phisticated bandit-style approaches can and should be converging to ideal perhaps slightly slower (not shown examined. in Figure 1). In other cases, the bandit approach learns to outperform uniform allocation but appears Each algorithm was run in simulation for some number to asymptote short of the ideal allocation. Finally, in of episodes. Each episode consisted of the allocation of some cases the bandit approach appears to actually 7For the experiments we optimized α over all stock-pool “learn the wrong thing”, with performance decaying pairs and thus used α =1.05. significantly with more episodes. This happens when one venue has a very heavy tail, but also a relatively Fraction Executed Order Half-life high probability of executing 0 shares, and occurs be- 12 cause the bandit approach does not have an explicit 0.3 representation of the tails of the distribution. 0.25 10 The left column of Figure 2 shows more systematic 0.2 8 head-to-head comparisons of our algorithm’s perfor- 0.15 6 learning learning mance versus the ideal, uniform and bandit allocations 0.1 4 after 2000 episodes for both small and large V .The 0.05 2 values plotted are averages of the last 50 points on 0.05 0.1 0.15 0.2 0.25 0.3 2 4 6 8 10 12 ideal ideal learning curves similar to Figure 1. These scatterplots 12 show that across all 12 stocks and both settings of 0.3 V , our algorithm competes well with the optimal allo- 0.25 10 cation, dramatically outperforms uniform, and signif- 8 icantly outperforms the bandit allocations (especially 0.2 0.15 6 at the larger volume V = 8000). The average com- learning learning pletion rate across all stocks for the large (small) or- 0.1 4 der sequences is 10.0% (13.1%) for uniform and 13.6% 0.05 2 0.05 0.1 0.15 0.2 0.25 0.3 2 4 6 8 10 12 (19.4%) for optimal allocations. Our algorithm per- uniform uniform forms almost as well as optimal — 13.5% (18.7%) — 12 and much better than bandits at 11.9% (17.2%). 0.3 10 The right column of Figure 2 is quite similar, except 0.25 8 now we measure performance not by the fraction of V 0.2 0.15 6 shares filled in one step, but the natural alternative learning learning of order half-life — the number of steps of repeated 0.1 4 resubmission of any remaining shares to get the total 0.05 2 V/ 0.05 0.1 0.15 0.2 0.25 0.3 2 4 6 8 10 12 number executed above 2. Despite the fact that our bandit bandit algorithm is not designed to optimize this criterion and Figure 2: Comparison of our learning algorithm to the that our theory does not directly apply to it, we see three baselines (ideal, uniform, and bandits). In each plot, the same broad story on this metric as well — our al- the performance of the learning algorithm is plotted on the gorithm competes with ideal, dominates uniform allo- y axis, and the performance of one of the baselines on the x cation and beats the bandit approach on large orders. axis. Left column: Evaluated by the fraction of submitted The average order half-life for large (small) orders is shares executed in a single time step. Here higher values are better, and points above the diagonal are wins for our 7.2 (5.3) for uniform allocation and 5.9 (4.4) for the algorithm. Right: Evaluated by order half-life. Here lower greedy algorithm on the true distributions. Our algo- values are better, and points below the diagonal are wins rithm requires on average 6.0 (4.9) steps, while bandits for our algorithm. Each point corresponds to a single stock uses 7.0 (4.4) to trade the large (small) orders. and order size; small orders (red plus signs) are 1000 shares, large orders (blue squares) are 8000 shares.

Acknowledgments A. Foldes and L. Rejto. Strong uniform consistency for nonparametric survival curve estimators from randomly We are grateful to Curtis Pfeiffer and Andrew Westhead censored data. The Annals of Statistics, 9(1):122–129, for valuable conversations, and to Bobby Kleinberg for in- 1981. troducing us to the literature on the newsvendor problem. W. T. Huh, R. Levi, P. Rusmevichientong, and J. Or- lin. Adaptive data-driven inventory control policies References based on Kaplan-Meier estimator. Preprint avail- able at http://legacy.orie.cornell.edu/~paatrus/ N. Alon and J. Spencer. The Probabilistic Method, 2nd psfiles/km-myopic.pdf, 2009. Edition. Wiley, , 2000. E. L. Kaplan and P. Meier. Nonparametric estimation from D. Bogoslaw. Big traders dive into dark pools. incomplete observations. JASA, 53:457–481, 1958. Business Week article, available at: http: M. Kearns and S. Singh. Near-optimal reinforcement learn- //www.businessweek.com/investor/content/ ing in polynomial time. MLJ, 49:209–232, 2002. oct2007/pi2007102_394204.h%tm, 2007. A. V. Peterson. Kaplan-Meier estimator. In Encyclopedia R. Brafman and M. Tennenholtz. R-MAX - A general poly- of Statistical Sciences. Wiley, 1983. nomial time algorithm for near-optimal reinforcement learning. JMLR, 3:213–231, 2003. Wikipedia. Dark pools of liquidity. http://en.wikipedia. org/wiki/Dark_pools_of_liquidity, 2009. N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006.