Features and Basis Functions
Ryan P. Adams COS 324 – Elements of Machine Learning Princeton University
At this point, you might be reasonably wondering how far we can really get with linear regression. The world is a complicated place, and we can’t expect linear models to capture the wide variety of functions that we might need to approximate in supervised learning. Moreover, linear regression models seem to depend on the inputs being vectors in RD and many data don’t really map easily to real-valued vectors. For example, how could we predict the properties of a molecule with linear regression? A molecule is some complicated graph-like object with atoms and bonds – definitely not something living in RD. The answer is to introduce features or basis functions that make turn various kinds of inputs into numerical vectors.
Making Linear Regression Nonlinear
Let’s start off by just thinking about how to make regression nonlinear in the least squares case. As before, let’s start off with the simple univariate case where we have some data x , y N { n n}n=1 with x R and y R. We modeled these data with a line in slope-intercept form: y = mx + b n ∈ n ∈ with unknown m and b. Then later we observed that we could add an extra constant-valued feature to our data and now not have to treat b as a separate kind of parameter; we put both m and b into a single vector parameter w. Let’s slightly change the way that we think about appending a vector. Rather than it being a trick that we apply, let’s formalize it as a vector valued function Φ : R R2 lin → where
φ x = 1 φ x = x . (1) 0( ) 1( ) When we write out the parameterized straight lines we get
y = wTΦ x = φ x w + φ x w = w + w x . (2) lin( ) 0( ) 0 1( ) 1 0 1 There’s nothing new here except we’ve replaced the idea of “append a 1 to the features” with “transform the data using a function Φ”. Note that what this let us do is write y as a degree-1 polynomial. What if we made it a degree-2 polynomial instead? The transformation would now have 3 terms:
φ x = 1 φ x = x φ x = x2 (3) 0( ) 1( ) 2( )
1 90 data 80 w = [350 -72 5 -0.10] w = [350 -71 5 -0.11] 70 w = [340 -73 5 -0.10] 60
50
40
Income (thousands) 30
20
10
10 12 14 16 18 20 22 Years of Education
Figure 1: Three cubic polynomials roughly fitting the synthetic Income data. and we’d have
y = wTΦ x = φ x w + φ x w + φ x w = w + w x + w x2 . (4) quad( ) 0( ) 0 1( ) 1 2( ) 2 0 1 2 We could of course go farther and make it a cubic polynomial with Φ x = 1, x, x2, x3 T: cub( ) [ ] y = wTΦ x = φ x w + φ x w + φ x w + φ x w = w + w x + w x2 + w x3 . (5) cub( ) 0( ) 0 1( ) 1 2( ) 2 3( ) 3 0 1 2 3 Some cubic polynomials are shown with the Income data in Figure 1. Recall that when we added the constant term to our features, nothing really changed about how we did the math. Interestingly, nothing changes here, either, due to adding a quadratic term. We are just now estimating a larger vector w. Approaching this more generally, if we have some input space then our transformation is a function Φ : RJ that takes those inputs and turns them X X → into vectors in RJ. When we do regression, rather than writing X for our design matrix, we write Φ for the design matrix after transforming the data, i.e.,
φ0 x1 φ1 x1 φJ 1 x1 ( ) ( ) · · · − ( ) φ0 x2 φ1 x2 φJ 1 x2 ( ) ( ) · · · − ( ) Φ = � . . . . . � . (6) � . . . . � � � �φ0 xN φ1 xN φJ 1 xN � � ( ) ( ) · · · − ( )� � � (I’m using zero indexing here to sta�y consistent with the polynomial se�tup from before where φ x = � � 0( ) 1 but that is not a crucial property�of this setup.) Everything else is t�he same as in the simple linear
2 25
20
15 2 x 10
5
0 4 2 0 2 4 4 2 0 2 4 x x
(a) A simple data set in one dimension with two (b) Adding a transformation that includes an x2 classes. These data are not separable with a sin- term makes the data linearly separable (green gle threshold. line).
Figure 2: Example of using basis functions in classification regression case, just with Φ instead of X. The objective function for least squares regression is now: 1 w� = arg min Φw y T Φw y . (7) N ( − ) ( − ) w � � Assuming N J, the solution is going to be ≥ � T 1 T w = Φ Φ − Φ y . (8) ( ) Since we are fixing Φ, we could treat this as a linear problem. But now the solution as a function of x is nonlinear. We refer to these φj as basis function or derived features. We can gain some further intuition about this by thinking about classification. We haven’t talked much about it yet, but a key concept in classification is the idea of the decision boundary, where we transition from making predictions of one category (e.g., dogs) into a different category (e.g., cats). We want our decision boundaries to be simple – ideally linear, but of course not all data can be categorized with simple decision boundaries. Figure 2a shows a very simple data set with two classes. In this figure there is no way to put down a single threshold that perfectly separates the two classes. However, if we augment the features with x2 suddenly a simple linear threshold can classify the data, as shown by the green line in Figure 2b.
Examples of Basis Functions
Polynomials In the previous section, we introduced quadratic and cubic polynomials in one dimension as an example of basis functions in linear regression. Figure 3a shows this basis out to x4 and Figure 3b is five random weightings of these basis functions to give an idea of what kinds of functions it can produce. We can generalize this to multidimensional in different ways. One X 3 0(x)=1 10.0 4 1(x)=x 2 7.5 2(x)=x 3 3(x)=x 5.0 2 (x)=x4 4 2.5 ) x