Function Approximations As Vector Spaces
Total Page:16
File Type:pdf, Size:1020Kb
Foundations of Reinforcement Learning with Applications in Finance Ashwin Rao, Tikhon Jelvis 1 Function Approximations as Affine Spaces Vector Space A Vector space is defined as a commutative group V under an addition oper- ation (written as +), together with multiplication of elements of V with ele- ments of a field K (known as scalars), expressed as a binary in-fix operation ∗ : K × V ! V, with the following properties: • a ∗ (b ∗ v) = (a ∗ b) ∗ v, for all a; b 2 K, for all v 2 V. • 1 ∗ v = v for all v 2 V where 1 denotes the multiplicative identity of K. • a ∗ (v1 + v2) = a ∗ v1 + a ∗ v2 for all a 2 K, for all v1; v2 2 V. • (a + b) ∗ v = a ∗ v + b ∗ v for all a; b 2 K, for all v 2 V. Function Space The set F of all functions from an arbitrary generic domain X to a vector space co-domain V (over scalars field K) constitutes a vector space (known as function space) over the scalars field K with addition operation (+) defined as: (f + g)(x) = f(x) + g(x) for all f; g 2 F; for all x 2 X and scalar multiplication operation (∗) defined as: (a ∗ f)(x) = a ∗ f(x) for all f 2 F; for all a 2 K; for all x 2 X Hence, addition and scalar multiplication for a function space are defined point-wise. Linear Map of Vector Spaces A linear map of Vector Spaces is a function h : V!W where V is a vector space over a scalars field K and W is a vector space over the same scalars field K, having the following two properties: • h(v1+v2) = h(v1)+h(v2) for all v1; v2 2 V (i.e., application of f commutes with the addition operation). • h(a ∗ v) = a ∗ h(v) for all v 2 V, for all a 2 K (i.e., application of f commutes with the scalar multiplication operation). 3 Then the set of all linear maps with domain V and co-domain W constitute a function space (restricted to just this subspace of all linear maps, rather than the space of all V!W functions). This function space (restricted to the subspace of all V!W linear maps) is denoted as the vector space L(V; W). The specialization of the function space of linear maps to the space L(V; K) (i.e., specializing the vector space W to the scalars field K) is known as the dual vector space and is denoted as V∗. Affine Space An Affine Space is defined asaset A associated with a vector space V and a binary in-fix operation ⊕ : A × V ! A, with the following properties: • For all a 2 A; a ⊕ 0 = a, where 0 is the zero vector in V (this is known as the right identity property). • For all v1; v2 2 V, for all a 2 A; (a ⊕ v1) ⊕ v2 = a ⊕ (v1 + v2) (this is known as the associativity property). • For each a 2 A, the mapping fa : V!A defined as fa(v) = a ⊕ v for all v 2 V is a bijection (i.e., one-to-one and onto mapping). The elements of an affine space are called points and the elements of the vector space associated with an affine space are called translations. The idea behind affine spaces is that unlike a vector space, an affine space doesn’t have anotion of a zero element and one cannot add two points in the affine space. Instead one adds a translation (from the associated vector space) to a point (from the affine space) to yield another point (in the affine space). The term translation is used to signify that we “translate” (i.e. shift) a point to another point in the affine space with the shift being effected bya translation in the associated vector space. This means there is a notion of “subtracting” one point of the affine space from another point of the affine space (denoted with the operation ⊖), yielding a translation in the associated vector space. A simple way to visualize an affine space is by considering the simple example of the affine space of all 3-D points on the plane defined by theequation z = 1, i.e., the set of all points (x; y; 1) for all x 2 R; y 2 R. The associated vector space is the set of all 3-D points on the plane defined by the equation z = 0, i.e., the set of all points (x; y; 0) for all x 2 R; y 2 R (with the usual addition and scalar multiplication operations). We see that any point (x; y; 1) on the affine space is translated to the point (x + x0; y + y0; 1) by the translation (x0; y0; 0) in the vector space. Note that the translation (0; 0; 0) (zero vector) results in the point (x; y; 1) remaining unchanged. Note that translations (x0; y0; 0) and (x00; y00; 00) applied one after the other is the same as the single translation (x0+x00; y0+y00; 00). Finally, note that for any fixed point (x; y; 1), we have a bijective mapping from the vector space z = 0 to the affine space z = 1 that maps any translation (x0; y0; 0) to the point (x + x0; y + y0; 1). 4 Linear Map of Affine Spaces A linear map of Affine Spaces is a function h : A!B where A is an affine space associated with a vector space V and B is an affine space associated with the same vector space V, having the following property: h(a ⊕ v) = h(a) ⊕ v for all a 2 A; for all v 2 V Function Approximations We represent function approximations by parameterized functions f : X × D[R] ! R where X is the input domain and D[R] is the parameters domain. The notation D[Y ] refers to a generic container data type D over a component generic data type Y . The data type D is specified as a generic container type because we consider generic function approximations here. A specific family of function approximations will customize to a specific container data type for D (eg: linear function approximations will customize D to a Sequence data type, a feed-forward deep neural network will customize D to a Sequence of 2-dimensional arrays). We are interested in viewing Function Approximations as points in an appropriate Affine Space. To explain this, we start by viewing parameters as points in an Affine Space. D[R] as an Affine Space P When performing Stochastic Gradient Descent or Batch Gradient Descent, pa- rameters p 2 D[R] of a function approximation f : X × D[R] ! R are updated using an appropriate linear combination of gradients of f with respect to w (at specific values of x 2 X ). Hence, the parameters domain D[R] can be treated as an affine space (call it P) whose associated vector space (over scalars field R) is the set of gradients of f with respect to parameters p 2 D[R] (denoted as rpf(x; p)), evaluated at specific values of x 2 X , with addition operation defined as element-wise real-numbered addition and scalar multiplication op- eration defined as element-wise multiplication with real-numbered scalars. We refer to this Affine Space P as the Parameters Space and we refer to it’s associ- ated vector space (of gradients) as the Gradient Space G. Since each point in P and each translation in G is an element in D[R], the ⊕ operation is element-wise real-numbered addition. We define the gradient function G : X! (P!G) as: G(x)(p) = rpf(x; p) for all x 2 X , for all p 2 P. 5 Representational Space R We consider a function I : P! (X! R) defined as I(p) = g : X! R for all p 2 P such that g(x) = f(x; p) for all x 2 X . The Range of this function I forms an affine space R whose associated vector space is the Gradient Space G, with the ⊕ operation defined as: I(p) ⊕ v = I(p ⊕ v) for all p 2 P; v 2 G We refer to this affine space R as the Representational Space(to signify the fact that the ⊕ operation for R simply “delegates” to the ⊕ operation for P and so, the parameters p 2 P basically serve as the internal representation of the function approximation I(p): X! R. This “delegation” from R to P implies that I is a linear map from Parameters Space P to Representational Space R. Notice that the __add__ method of the Gradient class in rl/function_approx.py is overloaded. One of the __add__ methods corresponds to vector addition of two gradients in the Gradient Space G. The other __add__ method corresponds to the ⊕ operation adding a gradient (treated as a translation in the vector space of gradients) to a function approximation (treated as a point in the affine space of function approximations). Stochastic Gradient Descent Stochastic Gradient Descent is a function SGD : X × R ! (P!P) representing a mapping from (predictor, response) data to a “parameters- update” function (in order to improve the function approximation), defined as: SGD(x; y)(p) = p ⊕ (α ∗ ((y − f(x; p)) ∗ G(x)(p))) for all x 2 X ; y 2 R; p 2 P, where α 2 R+ represents the learning rate (step size of SGD). For a fixed data pair (x; y) 2 X × R, with prediction error function e : P! R defined as e(p) = y − f(x; p), the (SGD-based) parameters change function U : P!G is defined as: U(p) = SGD(x; y)(p) ⊖ p = α ∗ (e(p) ∗ G(x)(p)) for all p 2 P.