Image Sequence (or Motion) Analysis

also referred as:

Dynamic Scene Analysis Input: A Sequence of Images (or frames)

Output: Motion parameters (3-D) of the objects/s in motion

and/or

3-D Structure information of the moving objects

Assumption: The time interval ΔT between two consecutive or successive pair of frames is small. Typical steps in Dynamic Scene Analysis

• Image filtering and enhancement

• Tracking

• Feature detection

• Establish feature correspondence

• Motion parameter estimation (local and global)

• Structure estimation (optional)

• Predict occurrence in next frame TRACKING in a Dynamic Scene F1

Since ΔT is small,

the amount of motion/displacement of objects, F2 between two successive pair of frames is also small.

F3

At 25 frames per sec (fps): ΔT = 40 msec.

FN At 50 frames per sec (fps): ΔT = 20 msec. Image Motion Image changes by difference equation: fd (x1, x2 ,ti ,t j ) = − f (x1, x2 ,ti ) f (x1, x2 ,t j ) = − = − f (ti ) f (ti ) fi f j

Accumulated difference image: = − ≥ fT (X ,tn ) fd (X ,tn−1,tn ) fT (X ,tn−1);n 3, = where, fT (X ,t2 ) fd (X ,t2 ,t1)

Moving Edge (or feature) detector: ∂f F (X ,t ,t ) = . f (X ,t ,t ) mov _ feat 1 2 ∂X d 1 2

Recent methods include background and foreground modeling. F 1 F2

F1 -F2 abs(F1 -F2)

Input Video Frame

Segmented Video frame Extracted moving object using Alpha matte

Two categories of Visual Tracking Algorithms: Target Representation and Localization; Filtering and Data Association.

A. Some common Target Representation and Localization algorithms: Blob tracking: Segmentation of object interior (for example , block- based correlation or optical flow-KLT) Kernel-based tracking (Mean-shift tracking): An iterative localization procedure based on the maximization of a similarity measure (Bhattacharyya coefficient). Contour tracking: Detection of object boundary (e.g. active contours or Condensation algorithm) Visual feature matching: Registration; RANSAC, DLT

B. Some common Filtering and Data Association algorithms: Kalman filter: An optimal recursive Bayesian filter for linear functions and Gaussian noise (FTLE). Particle filter: Useful for sampling the underlying state-space distribution of non-linear and non-Gaussian processes. Also see: Match moving; Motion capture; Swistrack, BOVW, Steak flow, tracklets, spatio-temporal gradients, LCS, LTDS, MRF, LDA, RFT, LCSS, MDA, DFM, Dynamic textures, BOAW, HFST, SRC based MHOF, LBPTOPS, HOP; also Manifold and Deep learning.

BLOB Detection – Approaches used:

• Corner detectors (Harris, Shi & Tomashi, Susan, Level Curve Curvature, Fast etc.)

• Ridge detection, Scale-space Pyramids

• LOG, DOG, DOH (Det. Of Hessian)

• Hessian affine, SIFT (Scale-invariant feature transform)

• SURF (Speeded Up Robust Features)

• GLOH (Gradient Location and Orientation Histogram)

• LESH (Local Energy based Shape Histogram). Complexities and issues in tracking:

Need to overcome difficulties that arise from noise, occlusion, clutter, moving cameras, multiple moving objects and changes in the foreground objects or in the background environment. DOH - the scale-normalized determinant of the Hessian, also referred to as the Monge–Ampère operator,

where, HL denotes the Hessian matrix of L and then detecting scale-space maxima/minima of this operator one obtains another straightforward differential blob detector with automatic scale selection which also responds to saddles.

Hessian Affine:

SURF is based on a set of 2-D HAAR wavelets; implements DOH SIFT: Four major steps: Detect extremas 1. Scale-space extrema detection at various scales: 2. Keypoint localization 3. Orientation assignment 4. Keypoint descriptor Only three steps (1, 3 & 4) are shown below:

Histograms contain 8 bins each, and each descriptor contains an array of 4 histograms around the keypoint. This leads to a SIFT feature vector with (4 x 4 x 8 = 128 elements). e.g. 2*2*8: GLOH (Gradient Location and Orientation Histogram) Gradient location-orientation histogram (GLOH) is an extension of the SIFT descriptor designed to increase its robustness and distinctiveness. The SIFT descriptor is computed for a log-polar location grid with 3 bins in radial direction (the radius set to 6, 11 and 15) and 8 in angular direction, which results 17 location bins. Note that the central bin is not divided in angular directions. The gradient orientations are quantized in 16 bins. This gives a 272 bin histogram. The size of this descriptor is reduced with PCA. The covariance matrix for PCA is estimated on 47000 image patches collected from various images. The 128 largest eigenvectors are used for description.

Gradient location and orientation histogram (GLOH) is a new descriptor which extends SIFT by changing the location grid and using PCA to reduce the size. Motion Equations: o x Models derived from mechanics. Euler’s or Newton’s equations: θ P’ = R.P + T, where: (X, Y) θ (X’, Y’) m = n2 + (1− n2 )cos y 11 θ1 1θ m = n n (1− cos ) − n sin 12 1 2 θ 3 m = n n (1− cos ) + n sin 13 1 3 2 z P(x, y, z) θ P’(x’, y’, z’) at t1 = at t R [mij ]3*3 2 m = n n (1− cos )θ+ n sin 21 1 2 θ 3 T = []∂x ∂y ∂z T = 2 + − 2 m22 n2 (1θ n2 )cos = − − 2 + 2 + 2 = m23 n2n3 (1 cos ) nθ1 sin where: n1 n2 n3 1

θ Observation in the image plane: m = n n (1− cos ) − n sin U = X’ – X; V = Y’ – Y. 31 1 3 θ 2 θ = − + X = Fx / z;Y = Fy / z. m32 n2n3 (1 cos ) n1 sin θ = = = 2 + − 2 θ X ' Fx'/ z';Y Fy'/ z'. m33 n3 (1 n3 )cos Mathematically (for any two successive frames), the problem is:

Input: Given (X, Y), (X’, Y’); θ δ δ δ Output: Estimate n1, n2, n3, , x, y, z First look at the problem of estimating motion parameters using 3D knowledge only: Given only three (3) non-linear equations, you have to obtain seven (7) parameters.

Need a few more constraints and may be assumptions too. Since ΔT is small, θ must also be small enough (in radians). Thus R simplifiesθ (reduces) to: θ − −  1 n3 θ n2   1 3 2  R =  nθ 1 − n  =  1 −  θ →0  3 θ θ 1   3 φ 1  φ− φ − φ  n2φ n1 1   2 1 1  φ where, 2 + 2 + 2 = θ 2 φ φ 1 2 3 Now (problemφ linearized), given three (3) linear equations, Evaluate R . you have to obtain six (6) parameters. - Solution ? Take two point correspondences: ' = + ' = + P1 R.P1 T; P2 R.P2 T; Subtracting one from the other, gives: ' − ' = − (eliminates the translation component) (P1 P2 ) R.(P1 P2 ) ' φ Δx   1 − Δx  φ 12 φ 3 φ 2 12    '     1 Δy = 1 − Δy , Solve  12   φ3 φ 1  12  Φ =φ ; Δ '  − φ Δ for:  2  z12  2 1 1  z12    φ  Δ = −  3  where, x12 x1 x2 , Δ − Δ Δ ' = ' − '  0 z12 y12  x12 x1 x2; ∇ = − Δz 0 Δx  and so on .... for y and z. 12  12 12   Δy − Δx 0  φ  12 12  Re-arrange to form:  1  Δ ' − Δ  x12 x12  []∇ φ  = Δ2   12 2 12 Δ2 = Δ ' − Δ   and 12  y12 y12  φ   Δ ' − Δ   3   z12 z12  φ  1  []∇ φ  = Δ2 12 2 12 Δ − Δ Δ ' − Δ     0 z12 y12  x12 x12   φ ∇ = − Δz 0 Δx  and Δ2 = Δy' − Δy  3  12  12 12  12  12 12   Δ − Δ  Δ ' − Δ   y12 x12 0   z12 z12  ∇ 12 is a skew - symmetric matrix. ∇ = 0 Interprete, why is it so ? 12 So what to do ? φ Contact a Mathematician?  1  Take two (one pair) more point correspondences: []∇ φ  = Δ2 34  2  34 φ  3  − Δ Δ Δ ' − Δ  z34 0 x34   y34 y34    ∇ =  Δy − Δx 0  and Δ2 = Δz ' − Δz 34  34 34  34  34 34   Δ − Δ  Δ ' − Δ   0 z34 y34   x34 x34  φ φ Using two pairs (4 points)  1   1  of correspondences: []∇ φ  = Δ2 []∇ φ  = Δ2 12  2  12 34  2  34 Adding: φ φ  φ   1   3   3  []∇ + ∇ φ  = [Δ2 + Δ2 ]  ∇ Φ = [ [][]Δ2 ] 12 34  2  12 34 1234 1234 φ  − Δ Δ  3   z34 0 x34  0 Δz − Δy    12 12  ∇ = Δy − Δx 0 34  34 34  ∇ = − Δz 0 Δx  12  12 12   0 Δz − Δy  Δ − Δ  34 34   y12 x12 0  Δy' − Δy  Δ ' − Δ  34 34 x12 x12 Δ2 = Δ ' − Δ    and 34 z34 z34 and Δ2 = Δy' − Δy   12  12 12  Δx' − Δx  Δ ' − Δ   34 34   z12 z12  −Δ Δ Δ −Δ Δ ' −Δ +Δ ' −Δ  z34 z12 x34 y12  y34 y34 x12 x12   ∇ = Δy −Δz −Δx Δx  ; Δ2 = Δz' −Δz +Δy' −Δy 1234  34 12 34 12  1234  34 34 12 12   Δ −Δ −Δ −Δ  Δ ' −Δ +Δ ' −Δ   y12 x12 z34 y34   x34 x34 z12 z12  φ  1  []∇ + ∇ φ  = [Δ2 + Δ2 ]  ∇ Φ = [ [][]Δ2 ] 12 34  2  12 34 1234 1234 φ  3  Condition for existence of a unique solution is based on a geometrical relationship of the coordinates of four points in space, at time t : 1 −Δ Δ Δ −Δ  z34 z12 x34 y12 ∇ = Δy −Δz −Δx Δx  1234  34 12 34 12  Δ −Δ −Δ −Δ  y12 x12 z34 y34  Δ ' −Δ +Δ ' −Δ  y34 y34 x12 x12 Δ2 = Δ ' −Δ +Δ ' −Δ  and 1234  z34 z34 y12 y12  Δ ' −Δ +Δ ' −Δ   x34 x34 z12 z12 This solution is often used as an initial guess for the final estimate of the motion parameters. Find geometric condition, when: ∇ = 1234 0 OPTICAL FLOW A point in 3D space: = []≠ X 0 kxo kyo kzo k ;k 0, an arbitary constant. Image point: = []where, = X i wxi wyi w X i PX 0 xo  x  i =  zo  Assuming normalized focal length, f = 1:    ....(1) y yo  i     zo  Assuming linear motion model (no acceleration), between successive frames: + xo (t)  xo ut  Combining equations (1) and (2):   =  +  yo (t) yo vt ....(2) +     (xo ut)  x (t) + z (t) z + wt i =  (zo wt)  o   o     + ....(3) y (t) (yo vt)  i   +   (zo wt) + (xo ut)  xi (t)  (z + wt) Assume in equation (3), = o ....(3) that w (=dz/dt) < 0.    +  y (t) (yo vt)  i   +   (zo wt) In that case, the object (or points on the object) will appear to come closer to you. Now visualize, where does these points may come out from? u  This point “e” is a point in xi (t) w the image plane, known as Lt =   = e.....(4) the: t→−∞  v yi (t)    w FOE (Focus of Expansion). The motion of the object points appears to emanate from a fixed point in the image plane. This point is known as FOE.

Approaches to calculate FOE are based on the exploitation of the fact that for constant velocity object motion all image plane flow vectors intersect at the FOE.

Plot the vectors and extrapolate them to obtain FOE. FOE

Image Plane FOE may not exist for all types of motion – say pure ROTATION, as shown below.

Multiple FOE’s may exist for multiple object motion and occlusion.

Depth from Motion

Time varying distance, D(t), in the image plane , is the distance of an image point from the FOE: D(t) = X − e = [x (t) − u ]2 +[y (t) − v ]2 i i w i w Lt D(t) = 0 t→−∞ d[D(t)] Rate of change of distance D(t) is: V (t) = dt Derive this to obtain = d[D(t)] = − w.D(t) (home assignment): V (t) dt zo (t)

This helps to define, TIME-TO-ADJACENCY equation: D(t) z(t) T = = − A V (t) w Assuming z is +ve and w is D(t) z(t) –ve. D(t) is different for different T = = − pixels. A This equation holds for each V (t) w corresponding object and image point pair.

Consider two object points, z1(t) and z2(t). Then:    D1(t) = − z1(t) D2 (t) = − z2 (t) = D2 (t) V1(t) ; ; z2 (t) z1(t)   V1(t) w V2 (t) w  D1(t) V2 (t)

Di(t) and Vi(t) values for any object point can be obtained from the image plane, once e (FOE) is obtained.

Hence it is possible to determine the Z2 (t) relative 3-D depths : of any two object points, Z (t) solely from the image plane motion data. 1

This is the key idea of Structure from motion (SFM) problem, and you are able to extract the shape (structure) information of the object in motion up to a certain scale factor, from a single perspective view only. Another important idea of optical flow is based on the Horn’s (Horn-Schunk, 1980) equations. A global energy function is sought to be minimized, whose functional form is given as:

KLT tracker:

or,

• Lucas B D and Kanade T, 1981, An iterative image registration technique with an application to stereo vision. Proceedings of Imaging understanding workshop, pp 121—130.

• Horn, B.K.P. and Schunck, B.G., "Determining optical flow." Artificial Intelligence, vol 17, pp 185-203, 1981. Concepts from above are left for self-study …….. KLT- difference between and https://www.ces.clemson.edu/~stb/klt/ Motion Analysis using rigid body assumption Rigid Body Assumption: − 2 = ∀ ∀ xi x j cij , t, (i, j), where cij are constants.

Motion Equation: m11 m12 m13 l1  m m m l  = =  21 22 23 2  = θ X (t2 ) M.X (t1), where, M ,mij f (n1,n2 ,n3 , ). m31 m32 m33 l3     0 0 0 1 = []T X (ti ) x(ti ) y(ti ) z(ti ) 1

Form matrix A(ti), using four points as: = [] A(ti ) X1(ti ) X 2 (ti ) X 3 (ti ) X 4 (ti ) −1 Thus obtain the matrix M, using: = M A(t j ).A(ti )

Points must be selected in such a fashion that A(ti) is a non-singular matrix.

Can you guess, when A(ti) will be singular ? θ θ θ After m ’s are ij + + − − obtained: = m11 m22 m33 1 θ = m32 m23 cos( ) θ ; sin( ) . 2 2n1 m −cos( ) m + m m + m n = 11 ;n = 21 12 ;n = 31 13 . 1 − 2 − 3 − θ 1 cos( ) 2n1(1 cos( )) 2n1(1 cos( ))

This is fine in an ideal case. In noisy situations (or even with numerical errors): = + A(ti ) M.A(t j ) Nij Need to formulate an optimization function to minimize a cost function, to satisfy equations in the least square sense: = = X k (ti ) M.X k (t j ), k 1,2,..., K For example, minimize: K − 2 along with Rigidity constraint. [X k (ti ) M.X k (t j )] k =1 Heard about Use linearized solution as your ? initial estimate. SA or GA Work out the following (2nd method): = = X k (ti ) R.X k (t j ), k 1,2,..., K where, R T  α = p β R   R = R R Rλ  0 1 p

Again, we have 12 unknown elements in R, which are functions of six unknown parameters.

First obtain the 12 unknown elements and then get the six parameters. Motion Analysis

using

Image plane coordinates of the features of the moving object. P (t ) = R.P (t ), k =1,2,..., K R T k i k j =  p  R   P = []x y z 1  0 1 k k k k = + + + xk 2 r11xk1 r12 yk1 r13 zk1 tx ; = + + + Thus: yk 2 r21xk1 r22 yk1 r23 zk1 t y ; = + + + zk 2 r31xk1 r32 yk1 r33 zk1 tz ; Projective Equations: x = + + + X (t) = k ; xk 2 (r11 X k1 r12Yk1 r13 )zk1 tx ; k z k y = (r X + r Y + r )z + t ; y k 2 21 k1 22 k1 23 k1 y Y (t) = k . k z = + + + k zk 2 (r31 X k1 r32Yk1 r33 )zk1 tz ; x (r X + r Y + r )z + t X = k 2 = 11 k1 12 k1 13 k1 x ; k 2 + + + zk 2 (r31 X k1 r32Yk1 r33 )zk1 tz y (r X + r Y + r )z + t Y = k 2 = 21 k1 22 k1 23 k1 y ; k 2 + + + zk 2 (r31 X k1 r32Yk1 r33 )zk1 tz x (r X + r Y + r )z + t X = k 2 = 11 k1 12 k1 13 k1 x ; k 2 + + + zk 2 (r31 X k1 r32Yk1 r33 )zk1 tz y (r X + r Y + r )z + t Y = k 2 = 21 k1 22 k1 23 k1 y ; k 2 + + + zk 2 (r31 X k1 r32Yk1 r33 )zk1 tz X k1  Solve for Z , from the   k1 [][]X Y 1 E Y = 0; above two equations to obtain: k 2 k 2  k1   1  After manipulation obtain:   cT e = −1 where: T = [] c X k 2 X k1 X k 2Yk1 X k 2 Yk 2 X k1 Yk 2Yk1 Yk 2 X k1Yk1 X k 2Yk 2

e is a vector (matrix) of 8 essential parameters, called the essential matrix. Solve to get e first.

Hence we require 8 point correspondences in this case to obtain the (i) essential parameters first and then (ii) the motion parameters. DYNAMIC STEREO

F1L F1R

F2L F2R

F3L F3R

Environment similar to the Human Vision System (minus brain-power)

FNL FNR Motion equations for dynamic stereo:

Let Al and Ar be the perspective transformation matrices for the pair of stereo cameras. R be the composite motion matrix with 12 unknown elements. I l (t ) = A RX (t ); i 2 l i 1 i =1,2,..., N r = Ii (t2 ) Ar RX i (t1);

Each point provides: eight (8) equations

Thus N points provide: 8N equations (linear).

Number of unknowns: 12 + 3N

To provide a solution: N > 3 Solve for obtaining the optimal solution Recent Advances: • Super-resolution from Video • MPEG standardization and representation • CVBR – spatio-temporal analysis • Video Organization - VOD • Active Vision and Surveillance • Multi-camera correspondence and depth • Global motion compensation • Pose and gesture • Adaptive Background models – shading and shadows • Multiple flows (optical flow) • 3-D Kalman filter, Particle filtering • Motion patterns • Deformable motion analysis • Sensor and view planning • Kinematics and Rigidity • Shape from motion That’s all for now –

Let’s MOVE ON