Image Sequence (or Motion) Analysis

also referred as:

Dynamic Scene Analysis Input: A Sequence of Images (or frames)

Output: Motion parameters (3-D) of the objects/s in motion

and/or

3-D Structure information of the moving objects

Assumption: The time interval ∆T between two consecutive or successive pair of frames is small. Typical steps in Dynamic Scene Analysis

• Image filtering and enhancement

• Tracking

• Feature detection

• Establish feature correspondence

• Motion parameter estimation (local and global)

• Structure estimation (optional)

• Predict occurrence in next frame TRACKING in a Dynamic Scene F1

Since ∆T is small,

the amount of motion/displacement of objects, F2 between two successive pair of frames is also small.

F3

At 25 frames per sec (fps): ∆T = 40 msec.

FN At 50 frames per sec (fps): ∆T = 20 msec. Image Motion Image changes by difference equation: fd (x1, x2 ,ti ,t j )

= f (x1, x2 ,ti ) − f (x1, x2 ,t j )

= f (ti ) − f (ti ) = fi − f j

Accumulated difference image:

fT (X ,tn ) = fd (X ,tn−1,tn ) − fT (X ,tn−1);n ≥ 3,

where, fT (X ,t2 ) = fd (X ,t2 ,t1)

Moving Edge (or feature) detector: ∂f F (X ,t ,t ) = . f (X ,t ,t ) mov _ feat 1 2 ∂X d 1 2

Recent methods include background and foreground modeling. F 1 F2

F1 -F2 abs(F1 -F2)

Input Video Frame

Segmented Video frame Extracted moving object using Alpha matte Two categories of Visual Tracking Algorithms: Target Representation and Localization; Filtering and Data Association.

A. Some common Target Representation and Localization algorithms:

Blob tracking: Segmentation of object interior (for example , block- based correlation or optical flow)

Kernel-based tracking (Mean-shift tracking): An iterative localization procedure based on the maximization of a similarity measure (Bhattacharyya coefficient).

Contour tracking: Detection of object boundary (e.g. active contours or Condensation algorithm)

Visual feature matching: Registration

B. Some common Filtering and Data Association algorithms:

Kalman filter: An optimal recursive Bayesian filter for linear functions and Gaussian noise.

Particle filter: Useful for sampling the underlying state-space distribution of non-linear and non-Gaussian processes. Also see: Match moving; Motion capture; Swistrack BLOB Detection – Approaches used:

• Corner detectors (Harris, Shi & Tomashi, Susan, Level Curve Curvature, Fast etc.)

• LOG, DOG, DOH (Det. Of Hessian), Ridge detectors, Scale-space, Pyramids

• Hessian affine, SIFT (Scale-invariant feature transform)

• SURF (Speeded Up Robust Features)

• GLOH (Gradient Location and Orientation Histogram)

• LESH (Local Energy based Shape Histogram).

Complexities and issues in tracking:

Need to overcome difficulties that arise from noise, occlusion, clutter, moving cameras, multiple moving objects and changes in the foreground objects or in the background environment. DOH - the scale-normalized determinant of the Hessian, also referred to as the Monge–Ampère operator,

where, HL denotes the Hessian matrix of L and then detecting scale-space maxima/minima of this operator one obtains another straightforward differential blob detector with automatic scale selection which also responds to saddles.

Hessian Affine:

SURF is based on a set of 2-D HAAR wavelets; implements DOH SIFT: Four major steps: Detect extremas 1. Scale-space extrema detection at various scales: 2. Keypoint localization 3. Orientation assignment 4. Keypoint descriptor Only three steps (1, 3 & 4) are shown below:

Histograms contain 8 bins each, and each descriptor contains an array of 4 histograms around the keypoint. This leads to a SIFT feature vector with (4 x 4 x 8 = 128 elements). e.g. 2*2*8: GLOH (Gradient Location and Orientation Histogram)

Gradient location-orientation histogram (GLOH) is an extension of the SIFT descriptor designed to increase its robustness and distinctiveness. The SIFT descriptor is computed for a log-polar location grid with 3 bins in radial direction (the radius set to 6, 11 and 15) and 8 in angular direction, which results 17 location bins. Note that the central bin is not divided in angular directions. The gradient orientations are quantized in 16 bins. This gives a 272 bin histogram. The size of this descriptor is reduced with PCA. The covariance matrix for PCA is estimated on 47 000 image patches collected from various images (see section III-C.1). The 128 largest eigenvectors are used for description.

Gradient location and orientation histogram (GLOH) is a new descriptor which extends SIFT by changing the location grid and using PCA to reduce the size. Motion Equations: o x Models derived from mechanics: Euler’s or Newton’s equations. P’ = R.Pθ + T, where: (X, Y) (X’, Y’) θ 2 2 y m11 = n1 + (1− n1 )cos θ θ m12 = n1n2 (1− cos ) − n3 sin θ m = n n (1− cos ) + n sin 13 1 3 2 z P(x, y, z) at t θ P’(x’, y’, z’) 1 at t R = [mij ]3*3 2 m = n n (1− cos )θ+ n sin 21 1 2 θ 3 T = []∂x ∂y ∂z T 2 2 m22 = n2 + (1θ− n2 )cos 2 2 2 m23 = n2n3 (1− cos ) − nθ1 sin where: n1 + n2 + n3 =1 θ Observation in the image plane: m = n n (1− cos ) − n sin U = X’ – X; V = Y’ – Y. 31 1 3 θ 2 θ m = n n (1− cos ) + n sin X = Fx / z;Y = Fy / z. 32 2 3 1 θ 2 2 X '= Fx'/ z';Y = Fy'/ z'. m33 = n3 + (1− n3 )cosθ Mathematically (for any two successive frames), the problem is:

Input: Given (X, Y), (X’, Y’);

Output: Estimate n1, n2, n3, θ, δx, δy, δz First look at the problem of estimating motion parameters using 3D knowledge only: Given only three (3) non-linear equations, you have to obtain seven (7) parameters.

Need a few more constraints and may be assumptions too. Since ∆T is small, θ must also be small enough (in radians). Thus R simplifies (reduces) to: θ θ ⎡ 1 − n3 θ n2 ⎤ ⎡ 1 − 3 2 ⎤ ⎢ ⎥ ⎢ ⎥ R = n3θ 1 − n1 = 3 1 − 1 θ →0 ⎢ θ θ ⎥ ⎢ φ ⎥ φ⎢ ⎥ φ⎢ ⎥ ⎣− n2φ n1 1 ⎦ ⎣− 2 1 φ 1 ⎦ 2 2 φ 2 2 where, 1 + 2 + 3 = θ Now (problemφ linearized),φ given three (3)φ linear equations, Evaluate R . you have to obtain six (6) parameters. - Solution ? Take two point correspondences: ' ' P1 = R.P1 +T; P2 = R.P2 +T; Subtracting one from the other, gives: ' ' (eliminates the translation component) (P1 − P2 ) = R.(P1 − P2 ) ⎡∆x' ⎤ ⎡ 1 φ− ⎤⎡∆x ⎤ 12 φ 3 φ 2 12 φ ⎡ ⎤ ⎢ ' ⎥ ⎢ ⎥⎢ ⎥ 1 ∆y12 = 3 1 − 1 ∆y12 , Solve ⎢ ⎥ ⎢ ⎥ ⎢ φ φ ⎥⎢ ⎥ Φ =φ ; ⎢ ' ⎥ ⎢ ⎥⎢ ⎥ for: ⎢ 2 ⎥ ⎣∆z12 ⎦ ⎣− 2 φ1 1 ⎦⎣∆z12 ⎦ ⎣⎢φ3 ⎦⎥ where,∆x12 = x1 − x2 ,

' ' ' ⎡ 0 ∆z12 − ∆y12 ⎤ ∆x12 = x1 − x2; ∇ = ⎢− ∆z 0 ∆x ⎥ and so on .... for y and z. 12 ⎢ 12 12 ⎥ ⎣⎢ ∆y12 − ∆x12 0 ⎦⎥ Re-arrange to form:φ ⎡ 1 ⎤ ' ⎡∆x12 − ∆x12 ⎤ φ⎢ ⎥ 2 ⎢ ⎥ []∇12 2 = ∆12 2 ' ⎢ ⎥ and ∆12 = ⎢∆y12 − ∆y12 ⎥ ⎢φ ⎥ ⎢ ' ⎥ ⎣ 3 ⎦ ⎣ ∆z12 − ∆z12 ⎦ φ ⎡ 1 ⎤ φ⎢ ⎥ 2 []∇12 2 = ∆12 ⎡ ' ⎤ ⎢ ⎥ ⎡ 0 ∆z12 − ∆y12 ⎤ ∆x12 − ∆x12 ⎢ ⎥ ∇ = ⎢− ∆z 0 ∆x ⎥ and ∆2 = ∆y' − ∆y ⎣⎢φ3 ⎦⎥ 12 ⎢ 12 12 ⎥ 12 ⎢ 12 12 ⎥ ⎢ ⎥ ⎢ ' ⎥ ⎣ ∆y12 − ∆x12 0 ⎦ ⎣∆z12 − ∆z12 ⎦

∇12 is a skew - symmetric matrix. Interprete, why is it so ? ∇12 = 0 So what to do ? Contact a Mathematician? φ ⎡ 1 ⎤ Take two (one pair) more point correspondences: []∇ φ⎢ ⎥ = ∆2 34 ⎢ 2 ⎥ 34 ⎣⎢φ3 ⎦⎥ ' ⎡− ∆z34 0 ∆x34 ⎤ ⎡∆y34 − ∆y34 ⎤ ⎢ ⎥ ∇ = ⎢ ∆y − ∆x 0 ⎥ and ∆2 = ∆z ' − ∆z 34 ⎢ 34 34 ⎥ 34 ⎢ 34 34 ⎥ ⎢ ⎥ ⎢ ' ⎥ ⎣ 0 ∆z34 − ∆y34 ⎦ ⎣∆x34 − ∆x34 ⎦ φ φ Using two pairs (4 points) ⎡ 1 ⎤ ⎡ 1 ⎤ of correspondences: []∇ φ⎢ ⎥ = ∆2 []∇ φ⎢ ⎥ = ∆2 12 ⎢ 2 ⎥ 12 34 ⎢ 2 ⎥ 34 Adding: φ ⎢φ ⎥ ⎢φ ⎥ ⎡ 1 ⎤ ⎣ 3 ⎦ ⎣ 3 ⎦ []∇ + ∇ φ⎢ ⎥ = [∆2 + ∆2 ] ⇒ ∇ Φ = [ [][]∆2 ] 12 34 ⎢ 2 ⎥ 12 34 1234 1234 ⎢φ ⎥ ⎣ 3 ⎦ ⎡− ∆z34 0 ∆x34 ⎤ 0 ∆z − ∆y ⎢ ⎥ ⎡ 12 12 ⎤ ∇ = ∆y − ∆x 0 ⎢ ⎥ 34 ⎢ 34 34 ⎥ ∇12 = − ∆z12 0 ∆x12 ⎢ ⎥ ⎣⎢ 0 ∆z34 − ∆y34 ⎦⎥ ⎢ ∆y − ∆x 0 ⎥ ⎣ 12 12 ⎦ ' ⎡∆y34 − ∆y34 ⎤ ⎡∆x' − ∆x ⎤ ⎢ ⎥ 12 12 and ∆2 = ∆z ' − ∆z 2 ⎢ ' ⎥ 34 ⎢ 34 34 ⎥ and ∆ = ∆y − ∆y 12 ⎢ 12 12 ⎥ ⎢∆x' − ∆x ⎥ ⎢ ' ⎥ ⎣ 34 34 ⎦ ⎣∆z12 − ∆z12 ⎦ ' ' ⎡ −∆z34 ∆z12 ∆x34 −∆y12⎤ ⎡∆y34 −∆y34 +∆x12 −∆x12⎤ ⎢ ⎥ ∇ = ⎢∆y −∆z −∆x ∆x ⎥ ; ∆2 = ∆z' −∆z +∆y' −∆y 1234 ⎢ 34 12 34 12 ⎥ 1234 ⎢ 34 34 12 12 ⎥ ⎢ ⎥ ⎢ ' ' ⎥ ⎣ ∆y12 −∆x12 −∆z34 −∆y34 ⎦ ⎣∆x34 −∆x34 +∆z12 −∆z12 ⎦ φ ⎡ 1 ⎤ []∇ + ∇ φ⎢ ⎥ = [∆2 + ∆2 ] ⇒ ∇ Φ = [ [][]∆2 ] 12 34 ⎢ 2 ⎥ 12 34 1234 1234 ⎣⎢φ3 ⎦⎥ Condition for existence of a unique solution is based on a geometrical relationship of the coordinates of four points in space, at time t1: ⎡ −∆z34 ∆z12 ∆x34 −∆y12⎤ ∇ = ⎢∆y −∆z −∆x ∆x ⎥ 1234 ⎢ 34 12 34 12 ⎥ ⎣⎢ ∆y12 −∆x12 −∆z34 −∆y34 ⎦⎥ ' ' ⎡∆y34 −∆y34 +∆x12 −∆x12⎤ 2 ⎢ ' ' ⎥ and ∆1234 = ⎢∆z34 −∆z34 +∆y12 −∆y12 ⎥ ⎢ ' ' ⎥ ⎣∆x34 −∆x34 +∆z12 −∆z12⎦ This solution is often used as an initial guess for the final estimate of the motion parameters. Find geometric condition, when: ∇1234 = 0 OPTICAL FLOW A point in 3D space:

X 0 = []kxo kyo kzo k ;k ≠ 0, an arbitary constant. Image point: where, X i = [wxi wyi w] X i = PX 0 ⎡x ⎤ x o ⎡ i ⎤ ⎢ zo ⎥ Assuming normalized focal length, f = 1: ⎢ ⎥ = ⎢ ⎥....(1) y yo ⎣ i ⎦ ⎢ ⎥ ⎣ zo ⎦ Assuming linear motion model (no acceleration), between successive frames:

⎡xo (t)⎤ ⎡ xo + ut ⎤ Combining equations (1) and (2): ⎢y (t)⎥ = ⎢ y + vt ⎥....(2) ⎢ o ⎥ ⎢ o ⎥ ⎡(x + ut) ⎤ x (t) o ⎢z (t)⎥ ⎢z + wt⎥ ⎡ i ⎤ ⎢ (zo + wt)⎥ ⎣ o ⎦ ⎣ o ⎦ ⎢ ⎥ = ⎢ ⎥....(3) y (t) (yo + vt) ⎣ i ⎦ ⎢ ⎥ ⎣ (zo + wt)⎦ ⎡(xo + ut) ⎤ ⎡xi (t)⎤ ⎢ (z + wt)⎥ Assume in equation (3), = o ....(3) that w (=dz/dt) < 0. ⎢ ⎥ ⎢ ⎥ y (t) (yo + vt) ⎣ i ⎦ ⎢ ⎥ ⎣ (zo + wt)⎦ In that case, the object (or points on the object) will appear to come closer to you. Now visualize, where does these points may come out from? ⎡u ⎤ This point “e” is a point in ⎡xi (t)⎤ w the image plane, known as Lt = ⎢ ⎥ = e.....(4) the: t→−∞⎢ ⎥ v ⎣yi (t)⎦ ⎢ ⎥ ⎣ w⎦ FOE (Focus of Expansion). The motion of the object points appears to emanate from a fixed point in the image plane. This point is known as FOE.

Approaches to calculate FOE are based on the exploitation of the fact that for constant velocity object motion all image plane flow vectors intersect at the FOE.

Plot the vectors and extrapolate them to obtain FOE. FOE

Image Plane FOE may not exist for all types of motion – say pure ROTATION, as shown below.

Multiple FOE’s may exist for multiple object motion and occlusion.

Depth from Motion

Time varying distance, D(t), in the image plane , is the distance of an image point from the FOE: D(t) = X − e = [x (t) − u ]2 +[y (t) − v ]2 i i w i w Lt D(t) = 0 t→−∞ d[D(t)] Rate of change of distance D(t) is: V (t) = dt

Derive this to obtain d[D(t)] w.D(t) (home assignment): V (t) = = − dt zo (t)

This helps to define, TIME-TO-ADJACENCY equation: D(t) z(t) T = = − A V (t) w Assuming z is +ve and w is D(t) z(t) –ve. D(t) is different for different T = = − pixels. A This equation holds for each V (t) w corresponding object and image point pair.

Consider two object points, z1(t) and z2(t). Then:

D1(t) z1(t) D2 (t) z2 (t) ⎡ D2 (t)⎤⎡V1(t) ⎤ = − ; = − ;⇒ z2 (t) = z1(t)⎢ ⎥⎢ ⎥ V1(t) w V2 (t) w ⎣ D1(t) ⎦⎣V2 (t)⎦

Di(t) and Vi(t) values for any object point can be obtained from the image plane, once e (FOE) is obtained.

Hence it is possible to determine the Z2 (t) relative 3-D depths : of any two object points, Z (t) solely from the image plane motion data. 1

This is the key idea of Structure from motion (SFM) problem, and you are able to extract the shape (structure) information of the object in motion up to a certain scale factor, from a single perspective view only. Another important idea of optical flow is based on the Horn’s (1980) equations. A global energy function is sought to be minimized, this function is given as:

KLT tracker:

or,

• Lucas B D and Kanade T, 1981, An iterative image registration technique with an application to stereo vision. Proceedings of Imaging understanding workshop, pp 121—130.

• Horn, B.K.P. and Schunck, B.G., "Determining optical flow." Artificial Intelligence, vol 17, pp 185-203, 1981

Concepts from above are left for self-study …….. Motion Analysis using rigid body assumption Rigid Body Assumption: 2 xi − x j = cij ,∀t,∀(i, j), where cij are constants.

Motion Equation: ⎡m11 m12 m13 l1 ⎤ ⎢m m m l ⎥ ⎢ 21 22 23 2 ⎥ X (t2 ) = M.X (t1), where, M = ,mij = f (n1,n2 ,n3 ,θ ). ⎢m31 m32 m33 l3 ⎥ ⎢ ⎥ ⎣ 0 0 0 1⎦ T X (ti ) = [x(ti ) y(ti ) z(ti ) 1]

Form matrix A(ti), using four points as:

A(ti ) = []X1(ti ) X 2 (ti ) X 3 (ti ) X 4 (ti ) −1 Thus obtain the matrix M, using: M = A(t j ).A(ti )

Points must be selected in such a fashion that A(ti) is a non-singular matrix.

Can you guess, when A(ti) will be singular ? θ θ θ After m ’s are ij m + m + m −1 m − m obtained: cos( ) = 11 22 33 ; sin(θ ) = 32 23 . 2 θ 2n1

m11 −cos( ) m21 + m12 m31 + m13 n1 = ;n2 = ;n3 = . 1−cos( ) 2n1(1−cos( )) 2n1(1−cos(θ))

This is fine in an ideal case. In noisy situations (or even with numerical errors): A(ti ) = M.A(t j ) + Nij Need to formulate an optimization function to minimize a cost function, to satisfy equations in the least square sense:

X k (ti ) = M.X k (t j ), k =1,2,..., K For example, minimize: K 2 along with Rigidity constraint. ∑[X k (ti ) − M.X k (t j )] k=1 Heard about Use linearized solution as your ? initial estimate. SA or GA Work out the following (2nd method):

X k (ti ) = R.X k (t j ), k =1,2,..., K where, ⎡Rp T ⎤ α R = ⎢ ⎥ R = R Rβ R ⎣ 0 1⎦ p λ

Again, we have 12 unknown elements in R, which are functions of six unknown parameters.

First obtain the 12 unknown elements and then get the six parameters. Motion Analysis

using

Image plane coordinates of the features of the moving object. P (t ) = R.P (t ), k =1,2,..., K ⎡R T ⎤ k i k j R = p ⎢ 0 1⎥ Pk = []xk yk zk 1 ⎣ ⎦ xk 2 = r11xk1 + r12 yk1 + r13 zk1 + tx ;

Thus: yk 2 = r21xk1 + r22 yk1 + r23 zk1 + t y ;

zk 2 = r31xk1 + r32 yk1 + r33 zk1 + tz ; Projective Equations: x X (t) = k ; xk 2 = (r11 X k1 + r12Yk1 + r13 )zk1 + tx ; k z k y = (r X + r Y + r )z + t ; y k 2 21 k1 22 k1 23 k1 y Y (t) = k . k z k zk 2 = (r31 X k1 + r32Yk1 + r33 )zk1 + tz ;

xk 2 (r11 X k1 + r12Yk1 + r13 )zk1 + tx X k 2 = = ; zk 2 (r31 X k1 + r32Yk1 + r33 )zk1 + tz

yk 2 (r21 X k1 + r22Yk1 + r23 )zk1 + t y Yk 2 = = ; zk 2 (r31 X k1 + r32Yk1 + r33 )zk1 + tz xk 2 (r11 X k1 + r12Yk1 + r13 )zk1 + tx X k 2 = = ; zk 2 (r31 X k1 + r32Yk1 + r33 )zk1 + tz

yk 2 (r21 X k1 + r22Yk1 + r23 )zk1 + t y Yk 2 = = ; zk 2 (r31 X k1 + r32Yk1 + r33 )zk1 + tz ⎡X k1 ⎤ Solve for Z , from the ⎢ ⎥ k1 [][]X Y 1 E Y = 0; above two equations to obtain: k 2 k 2 ⎢ k1 ⎥ ⎢ 1 ⎥ After manipulation obtain: ⎣ ⎦ cT e = −1 where: T c = []X k 2 X k1 X k 2Yk1 X k 2 Yk 2 X k1 Yk 2Yk1 Yk 2 X k1Yk1 X k 2Yk 2

e is a vector (matrix) of 8 essential parameters, called the essential matrix. Solve to get e first.

Hence we require 8 point correspondences in this case to obtain the (i) essential parameters first and then (ii) the motion parameters. DYNAMIC STEREO

F1L F1R

F2L F2R

F3L F3R

Environment similar to the Human Vision System (minus brain-power)

FNL FNR Motion equations for dynamic stereo:

Let Al and Ar be the perspective transformation matrices for the pair of stereo cameras. R be the composite motion matrix with 12 unknown elements. I l (t ) = A RX (t ); i 2 l i 1 i =1,2,..., N r Ii (t2 ) = Ar RX i (t1);

Each point provides: eight (8) equations

Thus N points provide: 8N equations (linear).

Number of unknowns: 12 + 3N

To provide a solution: N > 3 Solve for obtaining the optimal solution That’s all for now –

Let’s MOVE ON