Quick viewing(Text Mode)

Particle Filter-Based Fingertip Tracking with Circular Hough

Particle Filter-Based Fingertip Tracking with Circular Hough

Particle Filter-Based Fingertip Tracking with Circular Hough Transform Features Martin Do, Tamim Asfour, and R¨udigerDillmann Karlsruhe Institute of Technology (KIT), Adenauerring 2, 76131 Karlsruhe, Germany. martin.do,as f our,dillmann @kit.edu { } Abstract multi-scale color features [7] introduces a hierarchical representation of the hand consisting of blobs of dif- In this work, we present a fingertip tracking frame- ferent sizes with each blob representing a part of the work which allows observation of finger movements hand. The blob features are matched with a number in task space. By applying a multi-scale edge extrac- hierarchical 2D models each incorporating a specific tion technique, an edge map is generated in which low finger pose. Therefore, tracking is accomplished under contrast edges are preserved while noise is suppressed. the assumption that the local finger poses regarding Based on circular image features, determined from the the hand remains fixed. In order to implement a con- map using Hough transform, the fingertips are accu- tinuous fingertip tracking method, we would like to rately tracked by combining a particle filter and a sub- rely on prominent features which can be extracted at sequent mean-shift procedure. To increase the robust- any time of an image sequence. In [8], for detecting ness of the proposed method, dynamical motion mod- a guitarist’s fingertips, circular features are proposed els are trained for the prediction of the finger displace- which are localized by performing a circular Hough ments. Experiments were conducted on various image transform. For the same application, [9] defined semi- sequences from which statements on the performance circular templates which are used to find the fingers’ of the framework can be derived. positions. In our work, we adopt the concept of circular fea- 1 Introduction tures to tackle the more complex problem of track- Towards an intuitive and natural interface to ma- ing fingertips of a freely moving hand, where overlap chines, markerless human observation has become a of finger and palm occur frequently leading to diffi- major research focus during the past years. Regard- culties regarding the robust extraction of these fea- ing coarse granular human tracking incorporating the tures. For tracking, we combined particle filtering with torso, arms, head, and legs, considerable progress has a mean-shift algorithm. In addition, a dynamical mo- been achieved whereas human hand tracking still re- tion model for predicting was trained to enhance the mains an unresolved issue, although the hand is con- robustness of the proposed framework. sidered to be one of the most crucial body parts re- The paper is organized as follows. Section 2 de- garding the interaction with other humans and the en- scribes the feature extraction consisting of an edge vironment. detection step and the Hough transform. Details on Regarding markerless tracking and detection of hu- the tracking procedure performed on the resulting map man body parts, most systems have limited sensor ca- are given in Section 3. Subsequently, first experiments pabilities which in common case are limited to a stereo with the framework are explained in Section 4. In Sec- camera setup. Full hand tracking approaches in joint tion 5, the work is summarized and notes to future angle space have been proposed in [1],[2],[3]. However, works are given. due to the highly complex structure of the hand whose motion involves 27 DoF, tracking can be only achieved 2 Feature Extraction at a low frame rate or on multiple views from different In order to generate the edge image, a skin color perspectives. segmentation is performed for extracting the hand and Using stereo vision, a reasonable solution lies in re- finger regions. Morphological operators are applied on ducing dimensionality of the problem by shifting from the segmented image to eliminate noise and to produce joint angle space into task space. In [4], a finger track- a uniform region. To detect the edges in this prepro- ing approach based on Active contours is presented for cessed image, image gradients are calculated on various air-writing. The target to be tracked consists of a con- scales. tour which is laid around the pointing finger. As a result, since no reliable statement can be made on the 2.1 Multi-Scale Edge Extraction actual fingertip position, one has to assume that finger Considering the problem of fingertip tracking, due to pose is not changing. small intensity variances between different parts of the Hence, based on curvature properties, in [5] finger- hand, e.g. the fingernail and the skin, respectively, the tips are detected within a contour which is extracted finger regions and the palm, it is desired to detect edges from skin blob tracking. A more elaborate approach where contrast can vary over a broad range. Depending is presented in [6] where particles are propagated from on the parameters, applying standard algorithms, such the center of the hand to positions close to the con- as the Canny edge detectors on a wider scale, leads to tour. Intersection of the contour with segments at an edge image where numerous, false edges occur. To particles and examination of the transitions between preserve low contrast edges in certain areas while re- non-skin and skin-area indicate whether a particle rep- ducing noise close to high-contrast edges, based on the resents a fingertip. However, this method is specifically work of [10], we implemented a filter approach consist- designed to detect tips of stretched fingers. Based on ing of a steerable Gaussian derivative filter on multiple R = m 1.1 r,...,m 1.1 r with m N whereby a range{− of pixels· · along· the· edge} tangent∈ is considered during the voting process. In order to increase the robustness of the tracking algorithm, a density distri- bution is formed in Hough space by convolving IH with r a Gaussian kernel G(u,v; 2 ). Since the hand motion occurs in 3D Euclidean space, a fixation of r is only valid if movement of the fingertip Figure 1. Left: Original input image. Cen- in direction of the z-axis of a camera is excluded during ter/Left: Color segmented image. Center/Right: the tracking. Adaptation of r in each frame, allows to Edge image using Canny detector. Right: Edge track fingers in all directions. Based on the generated image using the method proposed in Section 2.1. density distribution in frame t a radius estimate rˆt is determined by applying an Expectation Maximization algorithm. Further details are given in Section 3.3. scales. The basis filter for x is defined as follows: 3 Tracking Fingertips (x2+y2) − x x 2σ2 3.1 Prediction G (x,y;σ ) = e k . (1) k k πσ− 4 2 k Providing a prediction on the movement of the ob- jects to be tracked increases the robustness of a statis- y σ Gk(x,y; k) is defined analogously. To determine the tical tracking framework. We train dynamical motion scale at which a gradient can be reliably estimated, models in the form of a second-order auto-regresssive x σ the magnitude of the filter response rk(x,y; k) and (AR) process as proposed in [4], which is described as y σ rk(x,y; k), obtained by convolution of the image I with follows: the filters from Eq. 1, is checked against a noise thresh- ω old. While the magnitude can be calculated according qt q¯ = A1(qt 1 q¯) + A2(qt 2 q¯) + b0 k (5) − − − − − to: D where qt R denotes the current configuration, q¯ ∈ ω m σ x σ 2 y σ 2 the mean configuration, and k [0,1]. To learn the rk (x,y; k) = r (x,y; k) + r (x,y; k) , (2) D D ∈ D k k AR parameters A1, A2 R × and b0 R , training q data is provided in the∈ form of a configuration∈ se- the threshold is set by following function: quence Q = q0′ ,...,qM′ whereas the sequence is gener- ated by manual{ labeling} of fingertips in each frame of 2ln(1 (1 α)R) c = − − − s . (3) a recorded image sequence. k σ 2√ π l p 2 k 2 Two AR models are trained to provide predictions for the local fingertip movement concerning a static α with sl representing the standard deviation and the hand pose as well as the movement of the hand itself. significance level for an image with R pixels which de- Based on the assumption that the motion of each finger fines an upper boundary for allowed misclassification is influenced by the motion of the neighbored fingers, of image pixels. To take into account local intensity the first model is trained with training data whose in- and contrast conditions, we focus on local signal noise stances qi′ Q with D = N consists of the length of the in a specific region rather than on global sensor noise. j, j∈+1 j+1 mod N j vector v = q q between the fingertip j Therefore, Eq. 3 depends on the local standard devia- t t − t σ max σ max and j + 1 mod N: tion sl calculated within a 2 k 2 k -neighborhood max × where σ denotes the largest scale being examined. j, j+1 k q′( j) = v j = 1,...,N. (6) Hence, we calculate each gradient at the minimum re- i k t k σ min liable scale k where the likelihood of error due lo- For finger j, this leads to following displacement vector: cal signal noise falls below a standard tolerance. This guarantees that a more accurate gradient map is esti- 1 j mated which is less sensitive to signal noise and errors v¯j = AN(i,i + 1)vi,i+1 + AN(i,i + 1)vi,i+1 +b j ω . t 2 ∑ 1 t 1 2 t 2 0 k caused by interference from nearby structures. The i= j 1 − − ! − edge image obtained from the map is depicted in Fig. 1. (7) The second model which considers the global move- 2.2 Hough Transformation for Circle Detection ment of the mean position of all fingertips pm is trained m The circle features representing the N fingertips are with a data set formed of qi′ = p with D = 2 resulting detected by applying a Hough transform with radius r. into an overall displacement: For each edge point (x,y) with known direction in the j 2 m m 2 m m mω j form of a rotation angle θ, a vote is assigned to possible vˆt = A1(pt pt 1) + A2(pt 1 pt 2) + b0 k + v¯t . (8) − − − − − circle feature positions (u,v) in two-dimensional Hough space I according to: Due to the coupled fingertip movements, the models H behave well resulting in reasonable prediction of the finger displacements which supports the state estima- IH (u,v) = IH (u,v) + 1 (4) tion in the ensuing tracking procedure. with u = x rcos(θ) and v = y rsin(θ). Unfortunately, curves around± the fingertips do± not always feature per- 3.2 Particle Filter Tracking fect circular arcs. To cope with noisy and slightly de- For the proposed fingertip tracking framework, a formed curves, the voting is performed for a set of radii state hypothesis s of a particle (s,w) consists of the N fingertip positions of the hand with each position Further details concerning the particle filter algorithm being denoted by the coordinates (x,y) within the im- can be found in [11]. age. Particle filtering is an iterative algorithm where, first, at time t M samples are drawn from a set of pre- 3.3 Mean-Shift i i vious particles Xt 1 = (st 1,wt 1) proportionally to To obtain more accurate position estimates, a mean- −i { − − } their likelihood wt 1. Subsequently, from each drawn shift algorithm is applied to move the estimated finger- − i sample a new state hypothesis st is generated. Adding tip position p j = st ( j) towards the peak of local den- a Gaussian random variable ω and the displacement sity distribution. We adopted the EM-like mean-shift i vector vˆj from Eq. 8, st can be written as: algorithm proposed in [12] which in addition provides the possibility to estimate the covariance of the local i i ω st = st 1 + vˆt + . (9) density distribution. The covariance estimation allows − us to adapt the radius r corresponding to the current i To determine a particle set Xt , for each st the likeli- circular image features. Hence, taking into account i hood wt is computed. In order to compute the weights movement in the depth of the camera, for tracking cir- for the new set of particles, one has to approximate cular features in Hough space one has to incorporate the likelihood function p(zt st ) with zt representing the an adaptation of radius rt . Under the assumption that | current observation. Our approximation of p(zt st ) is the distribution can be modeled as a Gaussian, we want | based on two cues: a contour and a distance cue. The to find parameters p¯ j and V¯ j representing center and contour cue is derived by exploiting the external energy covariance matrix of the distribution that maximize i functional Eimg of a contour Ct obtained by connecting following function for M independent samples: i the single points in st according the finger order. The E is determined in terms of an edge image ZE which M img t f p¯ ,V¯ G p ;p ¯ ;V¯ I p , (14) is constructed by drawing lines between the set of max- ( j j) = ∑ ( i j j) H ( i) V i=1 imum bins Zt that can be found in IH . As a result, the likelihood function can be written as: which can be solved iteratively by, first, calculating λi E according to: Eimg(Zt ,Ct ) pc(zt st ) ∝ wc(st ) = exp − . (10) | σ 2 c G(pi;p ¯ j;V¯ j))IH (pi)   λ = . (15) i M ¯ The distance cue is calculated from the Euclidean dis- ∑i=1 G(pi;p ¯ j;Vj)IH (pi) i V tance between st and Zt which consists of the sum of i V minimal distances between st ( j) and Zt . Based on this A new estimation for the center can be derived from cue, the likelihood function can be defined as followin equation:

N V M ∑ j=1 min( st ( j) Zt ) p (z s ) ∝ w (s ) = exp − k − k . pˆ j = λi pi (16) d t t d t σ 2 ∑ | ( N d ) i=1 (11) The final likelihood function is constructed from Eq. 10 whereas a covariance matrix estimation is obtained by and Eq. 11, hence, we define the computation of evaluating following term: weights as follows: M Vˆ = c λ (p p¯ )(p p¯ )T (17) w (si)w (si) j ∑ i i j i j i c t d t i=1 − − wt = . (12) ∑Mp w (sk)w (sk) k=1 c t d t with c being a constant. If convergence is achieved, q the radius is determined from the covariance matrix. One obtains a current state estimate of the fingertip configuration by evaluating the following sum: 4 Results M The proposed fingertip tracking framework was ap- s = wisi. (13) t ∑ t t plied on several image sequences which were captured i=1 with a static stereo camera setup and a resolution of R = 640 480 pixels. For edge extraction, the method × presented in Section 2.1 is applied with σk = 4,2,1,0.5 and α = 0.5. Currently, initialization of the tracking is n done manually by defining a region IH where finger n is to be found. Using the Hough transform with different i radii constructed with m = 3, The maximum bin in IH is labeled as finger n according to the finger order n = Thumb = 0,Index = 1,Middle = 2,Ring = 3,Pinkie = 4 . Figure 2. Left: Original input image. Center: { Taking into account the predicted displacements} Visualization of the Hough space. Right: Gener- of the fingers, the particle filter tracking algorithm ated contour for the particle filter tracking. The with minimum 600 particles shows good performance. particle with the highest likelihood (black dot- Around 3 mean-shift iterations needed to achieve con- ted line) and the particle with the lowest (white vergence, The number of iterations for the subsequent dotted line) are depicted. mean-shift algorithm depends on the numbers of par- ticles meaning less particles result in more mean-shift 25 Translation Rotation Open Close 20 15 10 5 mean pixel error 0 0 10 20 30 40 50 60 70 80 90 100 frame Figure 4. Error plot for a sequence of four hand and finger movements: Translation and rotation Figure 3. Images of the tracking results. Upper of the hand, close and open movement of fingers. row: Simultaneous closing of the fingers. Lower row: Sequential flexing of the fingers. Labeling of fingertips: Thumb (green), index (light blue), 6 Acknowledgments middle (dark blue), ring (pink), and pinkie (red). The work described in this paper was conducted within the EU Cognitive Systems project GRASP (FP7- 215821) funded by the European Commission. iterations and vice versa. Exploiting the combination of both, reasonable accuracy of 7 pixels mean de- References viation is achieved for translation≈ movements. For a rotation, opening and closing movement, the error in- [1] J. Rehg and T. Kanade, “Digiteyes: Vision-based hand creases up to 20 pixels. These measurements are de- tracking for human-computer interaction,” in Proc. picted in Fig. 4. The tracking procedure fails if a finger Workshop Motion of Non-Rigid and Articulated Bod- is lost, which is the case if the movements are too fast. ies, November 1994, pp. 16–22. In case of failure, currently, a very rudimentary re- [2] B. Stenger, P. R. S. Mendonca, and R. Cipolla, “Model- initialization is performed consisting of a search for based 3d tracking of an articulated hand,” in Proc. maximum bins in the vicinity of the last known es- Int. Conf. and Pattern Recognition, timation and arranging of the fingers according ac- December 2001, pp. 310–315. cord position polar space, assuming that fingers are [3] I. Oikonomidis, N. Kyriazis, and A. A. Argyros, “Mark- arranged clockwise, respectively counter clockwise, de- erless and efficient 26-dof hand pose recovery,” in Proc. pending on the hand that is observed. Since this as- 10th Asian Conf. Computer Vision, November 2010. sumption is not valid for several finger poses, hence, [4] A. Blake and M. Isard, Active Contours: The Applica- this might lead to mislabeling. tion of Techniques from Graphics,Vision,Control The- Since this algorithm operates on monocular images, ory and Statistics to Visual Tracking of Shapes in Mo- for each view a tracking instance is created whereas the tion, 2000. 3D finger positions are calculated by exploiting epipo- [5] A. A. Argyros and M. Lourakis, “Vision-based inter- lar geometry. The presented framework is capable of pretation of hand gestures for remote control of a com- online tracking of fingertip motion with a frame rate puter mouse,” in Proc. HCI06 Workshop, May 2006, of 15 Hz on a 2.40 GHz dual core CPU. Sample images pp. 40–51. during the tracking process are depicted in Fig. 3. [6] K. J. Hsiao, T. W. Chen, and S. Y. Chien, “Fast fin- gertip positioning by combining particle filtering with 5 Conclusion particle random diffusion,” in Proc. Int. Conf. Multi- media and Expo, June 2008, pp. 977–980. In this work, we presented a fingertip tracking which [7] L. Bretzner, I. Laptev, and T. Lindeberg, “Hand ges- allows observation of fine granular human actions such ture recognition using multi-scale colour features, hi- as grasping in an efficient manner. Using Hough trans- erarchical models and particle filtering,” in Proc. Int. form and a combination of particle filter and mean- Conf. Aut. Face and Gesture Recognition, May 2002, shift tracking, circular features representing the fin- pp. 423–428. gertips could be localized and tracked. Currently, the [8] A. M. Burns and M. M. Wanderley, “Visual methods proposed framework is applied for capturing human for the retrieval of guitarist fingering,” in Proc. Int grasping movements for online imitation learning us- Conf. New Interfaces for Musical Expression, Paris, ing the on-board stereo camera pair of a robot. France, June 2006, pp. 196–199. However, in the experiments we conducted, we were [9] C. Kerdvibulvech and H. Saito, “Markerless guitarist able to observe that the error on the fingertip local- fingertip detection using a bayesian classifier and a ization increases, when the hand performs movements template matching for supporting guitarists,” in Proc. which go beyond translation. These can be led back 10th Virtual Reality Int. Conf., April 2008. to the use of a single dynamical motion model for the [10] J. Elder and S. Zucker, “ localization, blur, prediction. In the near future, the fingertip prediction and contour-based image coding,” in Proc. Int. Conf. module will be implemented in the form of multiple in- Computer Vision and Pattern Recognition, June 1996. tertwined motion models to provide better predictions. [11] P. Azad, Visual Perception for Manipulation and Im- Concerning the motion model of the hand, we realized itation in Humanoid Robots, 2009. that it needs to be extended by an angular [12] Z. Zivkovic and B. Kroese, “An em-like algorithm for to cover the hand rotation. To enable full online obser- color-histogram-based object tracking,” in Proc. Int vation of the human upper body the fingertip tracking Conf. Computer Vision and Pattern Recognition, Wash- will be integrated into an upper body tracker and its ington D.C., USA, June 2004, pp. 798–803. implementation will be improved to raise its efficiency.