Tensors oContravariant and covariant oEinstein notation oInterpretation What is a ?

! It is the generalization of a: � – for >2 output input

– Vector data to >1 ��

�� or

! Why do we need it? – Came out of . – Was used by to formulate the theory of . – It’s needed for many problems, including Deep NNs. Contravariant and Covariant Vectors � = �� ! Contravariant vectors: – “Column Vectors” that represent data – Vectors that describe the position of something

! contravariant – � for 0 ≤ � < � and � ∈ ℜ covariant

! Covariant vector: � = � is 1� 1

– “Row vectors” that operate on data × �

vectors is � – �! for 0 ≤ � < � Picture of multiplication ! Einstein notation $%& ! ! � = �! � = �! � !"# – Leave out the sum to make notation cleaner – Always sum over any two indices that appear twice Vector-Matrix Products as Tensors

� = �� covariant rows contravariant ! 1D Contravariant vectors: ! = – � for 0 ≤ � < � � ' � is is ! is � – � for 0 ≤ � < �( � is �×� � × × 1 1 ! 2D tensor (i.e., matrix): ) Picture of multiplication – � ! for 0 ≤ � < �( and 0 ≤ � < �' – There is a gradient for each component �) ! Einstein notation $!%& ) ) ! ) ! � = � ! � = � ! � !"# – Leave out the sum to make notation cleaner – Always sum over any two indices that appear twice Tensor Products Sum over pairs ! Einstein notation of indices

, , , � = � , � ! 2-D Contravariant Tensors: , – � for 0 ≤ �, � < � , – � for 0 ≤ �, � < � ! 4-D Tensor:

, – � , for 0 ≤ �, � < � and 0 ≤ �, � < � – � is known as a tensor • 2D covariant input • 2D contravariant output Picture of a Tensor

! For example, if we have !," � = � !," � 3D tensor contravariant 2D tensor � = � � � � – Tensor 3-D tensor � has Input: 2D image indexed by �&, �* Output: 1D vector indexed by �

– General idea: � ∈ ℜ× Some Useful Definitions ! Delta functions: 1 if � = � � = 0 if � ≠ � – So then we have that � = � � !Gradient w.r.t. vector ∇ �� = � – In tensor notation, we have that ∇ �� = � !Gradient w.r.t. matrix ∇ �� =? ? – First, it must have 1 output dimension and 2 input dimensions. So we have that ∇ �� = � � , GD for Single Layer NN oStructure of the Gradient oGradients for NN parameters oUpdates for NN parameters Gradient Direction for Single Layer NN

! Single layer NN:

� � = �� �� + �

� �� + � ⋅ � �

� � �

– We will need gradient w.r.t the parameters � = �, �, � :

∇� � = ∇� ,, � , ∇� ,, � , ∇� ,, �

– Later, we will also need:

∇� � Gradient Structure for Single Layer NN

�� ∇� + + + ��

∇� ∇� ∇�

�� �� �� ! Single layer NN: – Parameters are � = �, �, � – We will need the parameter :

∇� � = ∇� ,, � , ∇� ,, � , ∇� ,, � – And the input gradient:

∇� � Gradient w.r.t. � )* " = (& !" + %

" !" + & ⋅ ( #$

! % ( ! For this case,

∇� = ∇� � = ∇� � �� + � = � ∇� �� + � �

– Using Einstein notation ∇ � = � ∇� ! �" ! "

This is a matrix or 2-D tensor Gradient w.r.t. � )* " = (& !" + %

" !" + & ⋅ ( #$

! For this case, ! % (

∇� = � ∇� �� + � ∇ ��

– Using Einstein notation, since This is a tensor! ∇ �� , = � � – Then ∇ � ! = �! ∇� " ∇ �� # !," " # !," = �! ∇� " �# �" " # !

�� ∇ � ! = �! ∇� " �" !," " ! Gradient w.r.t. � )* " = (& !" + %

" !" + & ⋅ ( #$

! % (

! For this case,

∇� = � ∇� �� + � ∇ �� + � = � ∇� � = � ∇� – Using Einstein notation,

∇ � ! = �! ∇� " ! " ! Gradient w.r.t. � )* " = (& !" + % " !" + & ⋅ ( #$

! % ( ! For this case,

∇� = ∇ �� This is a tensor! – Using Einstein notation, since

∇ �� = � � , – We have that

∇ � ! = �! �" !," ! Update Direction for � Gradient step: � ← � + �� where � is given by 2 ! " " � , = −∇� � = �, � ∇� � ! " � ! " !

$%& 2 � 1×�! �"×�" � � �!×�" ∇� = !"# �

For efficiency, computation goes this way! Update Direction for � Gradient step: � ← � + �� where � is given by 2 ! " � = −∇� � = �, � ∇� ! � ! " ! $%& 2 � 1×�! = 1×�" �"×�" � �!×�" ∇� !"# �

For efficiency, computation goes this way! Update Direction for � Gradient step: � ← � + �� where � is given by 2 � = −∇ � � = � � " !," � ,!

$%& � 1� 2 ! = � � � !"#

For efficiency, computation goes this way! Local and Global Minima oOpen and Closed Sets oConvex Sets and Functions oProperties of Convex Functions oLocal Minimum, Saddle Points, and Global Minima oOptimization Theorems Open and Closed Sets

! Define: – � ⊂ ℜ$ $ – Open ball of diameter � is � �, � = � ∈ ℜ : � − �+ < � . ! A set � is open if – At every point, there is an open ball contained in �. – ∀� ∈ �, ∃� > 0 s.t. � �, � ⊂ �. ! A set � is closed if � = ℜ − � is open. ! A set � is compact if it is closed and bounded. !Facts: – ℜ$ is both open and closed, but it is not compact. – If � is compact, than every sequence in � has a limit point in �. Convexity Sets

! A set � is convex if ∀� ∈ 0,1 , ∀�, � ∈ �, then �� + 1 − � � ∈ �

Line connecting Line connecting points is points is always in set sometimes outside set Convex Set Nonconvex Set

! Properties: – The intersection of convex sets is convex Convexity Functions ! Let �: � → ℜ where � is a convex set. Then we say that � is a convex function if ∀� ∈ 0,1 , ∀�, � ∈ �, then � �� + 1 − � � ≤ �� � + 1 − � � �

Line is always Line is sometimes above function below function

� � � � � � Convex Function Nonconvex Function ! Properties: – The sums of convex functions is convex – The maximum of a set of convex functions is convex – If � � is convex, then � �� is also convex. – � � is concave if −� � is convex. – � � = �� + � is both convex and concave Properties of Convex Functions

! Valuable properties of convex functions • The sums of convex functions is convex.

– Let � � = ∑ � � . If � � are convex, then � � is convex. • The maximum of a set of convex functions is convex.

– Let � � = max � � . If � � are convex, then � � is convex. • The second derivative of a convex function is positive. ! – Let � � have two continuous derivatives, and let �, � = be the "# Hessian of � at �. Then � � is convex if and only if � � is non-negative definite for all �. • A convex function of a linear transform is convex. – If � � is convex, then � �� is also convex. • An affine transform is both convex and concave. – � � = �� + � is both convex and concave. • A function is concave if its negative is convex. – � � is concave if −� � is convex. Local Minimum

! Let �: � → ℜ where � ⊂ ℜ.

– We say that �+ ∈ � is a local minimum of � if there ∃� > 0 s.t. ∀ � ∈ � ∩ � �, � , � � ≥ � �+ . local minimum

�+ • Necessary condition for local minima:

– Let � be continuously differentiable and let �+ be a local minimum, then ∇� �+ = 0.

� a local minimum ∇� � = 0 Saddle Point

! Let �: � → ℜ where � ⊂ ℜ be a continuously differentiable function.

– We say that �+ ∈ � is a saddle point of � if ∇� �+ = 0 and �+ is not a local minimum. saddle point in 2D* saddle point in 1D

� + * * Graph of � � = �& − �* – Saddle points can cause more problems than you might think.

*Shared from Wikipedia Global Minimum

! Let �: � → ℜ where � ⊂ ℜ.

– We say that �+ ∈ � is a global minimum of � if ∀ � ∈ �, � � ≥ � �+ .

GD steps global minimum

Local minimum

�+ ! Comments: – In general, finding global minimum is difficult. – Gradient descent optimization typically becomes trapped in local minima. Optimization Theorems ! Let �: � → ℜ where � ⊂ ℜ. – If � is continuous and � is compact, then � takes on a global minimum in �. – If � is convex on �, then any local minimum is a global minimum.

– If � is continuously differentiable and convex on �, then ∇� � = 0 implies that � ∈ � is a global minimum of �. ! Important facts: – Global minimum may not be unique. – If � is closed but not bounded, then it may not take on a global minimum. – Generally speaking, gradient descent algorithms converge to the global minimum of continuously differentiable convex functions. – Most interesting functions in ML are not convex!

Convex Function

GD steps global minimum