Recall: Linear Classification

RECALL:LINEAR CLASSIFICATION x v x,v h i v k k f (x) = sgn( v, x c) h i − Peter Orbanz Applied Data Mining 274 · LINEAR CLASSIFIER IN R2 AS TWO-LAYER NN x1 x2 c v 1 v2 1 − > 0 I{• } f (x) f (x) = I v1x1 + v2x2 + v3x3 + ( 1)c > 0 = I v, x > c { − } {h i } Equivalent to linear classifier The linear classifier on the previous slide and f differ only in whether they encode the “blue” class as -1 or as 0: sgn( v, x c) = 2f (x) 1 h i − − Peter Orbanz Applied Data Mining 275 · REMARKS x1 x2 c v 1 v2 1 − y = vtx > c I{ } 2 • This neural network represents a linear two-class classifier (on R ). d • We can more generally define a classifier on R by adding input units, one per dimension. • It does not specify the training method. • To train the classifier, we need a cost function and an optimization method. Peter Orbanz Applied Data Mining 276 · TYPICAL COMPONENT FUNCTIONS Linear units φ(x) = x This function simply “passes on” its incoming signal. These are used for example to represent inputs (data values). Constant functions φ(x) = c These can be used e.g. in combination with an indicator function to define a threshold, as in the linear classifier above. Peter Orbanz Applied Data Mining 277 · TYPICAL COMPONENT FUNCTIONS Indicator function φ(x) = I x > 0 { } Example: Final unit is indicator x1 x2 c v 1 v2 1 − > 0 I{• } f (x) Peter Orbanz Applied Data Mining 278 · TYPICAL COMPONENT FUNCTIONS Sigmoids 1.0 0.8 1 0.6 φ(x) = x 0.4 1 + e− 0.2 -10 -5 5 10 Example: Final unit is sigmoid x1 x2 c v 1 v2 1 − σ( ) • f (x) Peter Orbanz Applied Data Mining 279 · TYPICAL COMPONENT FUNCTIONS Rectified linear units φ(x) = max 0, x { } These are currently perhaps the most commonly used unit in the “inner” layers of a neural network (those layers that are not the input or output layer). Peter Orbanz Applied Data Mining 280 · HIDDEN LAYERS AND NONLINEAR FUNCTIONS Hidden units • Any nodes (or “units”) in the network that are neither input nor output nodes are called hidden. • Every network has an input layer and an output layer. • If there any additional layers (which hence consist of hidden units), they are called hidden layers. Linear and nonlinear networks • If a network has no hidden units, then fi(x) = φi( wi, x ) That means: f is a linear functions, except perhaps for the final application of φ. • For example: In a classification problem, a two layer network can only represent linear decision boundaries. • Networks with at least one hidden layer can represent nonlinear decision surfaces. Peter Orbanz Applied Data Mining 281 · 10 CHAPTER 6. MULTILAYER NEURAL NETWORKS While we can be confident that a complete set of functions, such as all polynomi- als, can10 represent any functionCHAPTER it is nevertheless 6. MULTILAYER a fact that NEURAL a single functional NETWORKS form also suffices, so long as each component has appropriate variable parameters. In the absenceWhile of information we can be suggesting confident that otherwise, a complete we set generally of functions, use a such single as functional all polynomi- form for theals, transfer can represent functions. any10 function it is neverthelessCHAPTER a fact 6. that MULTILAYER a single functional NEURAL NETWORKS form Whilealso su theseffices, latter so long constructions as each component show has that appropriate any desired variable function parameters. can be In imple- the absence of information suggestingWhile we can otherwise, be confident we that generally a complete use set a of single functions, functional such as all form polynomi- mented by a three-layer10 network, they are notCHAPTER particularly 6. MULTILAYER practical because NEURAL NETWORKSfor most for the transfer functions.als, can represent any function it is nevertheless a fact that a single functional form problems we know aheadalso of suffi timeces, so neither long as each the component number has of appropriate hidden units variable required, parameters. nor In the While these latter constructions show that any desired function can be imple- the proper weight values.absenceWhile Even of we if information can there be confidentwere suggestinga that constructive otherwise, a complete we set proof, generally of functions, it use would a single such be as functional all of polynomi- little form mented by a three-layerals,for can the network, represent transfer they functions. any functionare not it particularly is nevertheless practical a fact that because a single forfunctional most form use inproblems pattern we recognition know aheadalso suWhile sinceffi ofces, time these so we long latter neitherdo as not each constructions the know component number the show has desired of that appropriate hidden any function desired units variable function required, anyway parameters. can norbe — imple- In it the is relatedthe proper to the weight training values.absencemented patterns Even of by information a three-layer if in there a very suggestingwere network, complicateda constructive otherwise, they are not we way. particularly proof, generally All it use wouldin practical a all, single be then, because functional of little these for most form resultsuse on in the pattern expressive recognitionforproblems the power transfer since we of know functions. we networks do ahead not of know givetime neither us the confidence desired the number function of we hidden are anyway unitson the required, — right it nor theWhile proper these weight latter values. constructions Even if there showwere thata constructive any desired proof, function it would can be be of imple- little track,is but related shed to little the training practicalmenteduse in patterns by pattern light a three-layer recognition on in the a very network, problems since complicated they we do are of not not designing know way. particularly the All desired and in practical all,training function then, because anyway theseneural for — most it networksresults — on their the main expressiveproblems benefitis related power we for to know the patternof training networks ahead of patternsrecognition time give neither in us a very confidence (Fig. the complicated number 6.3). we of way. hidden are All on units in the all, required, right then, these nor track, but shed littletheresults practical proper on weight the light expressive values. on the Even power problems if ofthere networkswere of designinga give constructive us confidence and proof,training we it would areneural on be the of right little networks — their mainusetrack, in benefit pattern but shed for recognition little pattern practical since recognition light we doon the not (Fig. problems know 6.3). the of desired designing function and training anywayneural — it TWO VS THREEisnetworks relatedLAYERS to — the their training main benefit patterns for in pattern a very recognition complicated (Fig. way. 6.3). All in all, then, these results on the expressive power ofx networks2 give us confidence we are on the right track, but shed little practical light on the problems of designing and training neural x networks — their main benefit for patternx2 recognition2 (Fig. 6.3). fl Two"layer x2 fl Two"layer 1 fl R R1 Two"layer R1 fl Two"layer 2 R1 R 2 x1 x2 R R2 x x1 x2 1 x2 x1 x2 R2 x1 x1 x2 x2 x1 Three"layer x R1 2 x1 x2 R2 Three"layer ... 1 Three"layer R1 Three"layer R R2R1 R1 2 R2 R2 ... R ......x1 x2 x1 R2 RR2 2 R1 Figure 6.3: Whereas a two-layer network classifier can only implement1 a linear decision RR1 boundary, given an adequate number of hidden units, three-, four- and higher-layer x1 x2 networks can implement arbitrary decision boundaries. The decision regions needx1 not x1 x2 x1 be convex, norx2 simply connected. x1 x Figure 6.3: Whereas a two-layer network classifier can only implement a linear1 decision boundary, given an adequate number of hidden units, three-, four- and higher-layer Figure 6.3: Whereas a two-layer network classifier can only implement a linear decision Figure 6.3: Whereas a two-layernetworks can network implement classifier arbitrary decision can only boundaries. implement The decision a linear regions decision need not Peter Orbanz Applied Data Mining Illustration: R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, Wiley 2001 282 boundary,· given anbe adequate6.3 convex, Backpropagation nor number simply connected.of hidden units, algorithm three-, four- and higher-layer boundary,networks given can an implement adequate arbitrary number decision of hidden boundaries. units, three-,The decision four- regions and higher-layer need not We have just seen that any function from input to output can be implemented as a networksbe convex, can implement nor simply arbitrary connected. decision boundaries. The decision regions need not be convex, nor simply connected.three-layer neural network. We now turn to the crucial problem of setting the weights 6.3based on Backpropagation training patterns and desired algorithm output. We have just seen that any function from input to output can be implemented as a 6.3 Backpropagationthree-layer neural network. algorithm We now turn to the crucial problem of setting the weights 6.3 Backpropagationbased on training patterns algorithm and desired output. We have just seen that any function from input to output can be implemented as a We havethree-layer just seen neural that network. any function We now from turn to input the crucial to output problem can of be setting implemented the weights as a three-layerbased on neural training network. patterns We and now desired turn output. to the crucial problem of setting the weights based on training patterns and desired output.

Load more