Bachelor Thesis

BACHELOR THESIS

Interactive Projection Mapping

Emil Hedemalm 2014

Bachelor of Science in Engineering Technology Computer Game Programming

Luleå University of Technology Institutionen för system- och rymdteknik (SRT) Abstract The world of interaction is ever expanding and novel approaches for dealing with image-based gesture and input recognition are wanted. To get a glimpse into the various problems and solutions present in the field, a configurable image filter pipeline framework based on OpenCV is presented. Aiming at real-time applications, the framework is written entirely in C++. Both OpenCV-based functions and custom filters are presented in order to solve various problems. The subjects of image analysis and human-computer interaction are explored, including a few specific use-cases. Pipeline configurations are presented for each use-case, including details such as processing time, the order in which filters are applied, settings used and problems that were encountered. The field of augmented reality is investigated and a few exemplary end-user applications are developed and presented for this purpose. Lastly, a discussion on the results is held, which answers problems with possible solutions and suggests further work where necessary.

Sammanfattning Världen av interaktion ökar ständigt och nya sätt att hantera bild-baserad analys för handgester och datorinteraktion efterfrågas. För att försöka få en glimt på de olika problemen och lösningarna som finns så presenteras ett ramverk base- rat på OpenCV. Då målet är realtidsapplikationer är ramverket skrivet i C++. Både OpenCV-baserade och egna filter presenteras för att lösa diverse problem. Ämnena bild-analys och människa-datorinteraktion undersöks, och några specifika använ- darfall presenteras. Lösningar för samtliga användarfall presenteras, och detaljer såsom processeringstid, ordningen i vilken filtrerna appliceras, inställningar som använts och problem som stötts på presenteras likaså. Fältet förstärkt verklighet undersöks och några exemplariska användarapplikationer utvecklas och presenteras för ändamålet. Till sist hålls en diskussion över resultatet, som försöker besvara problem med möjliga lösningar och föreslår vidare arbete där det är nödvändigt.

ii Acknowledgements This thesis was conducted at Bosch Sensortec GmbH in Reutlingen, Germany, during a period of nearly 10 weeks. Thanks to both the employees there as well as Dr. Johannes Hirche for the help of setting up this project, which would never have happened without them. A big thanks to Alexander Ehlert, Felix Schmidt and Christoph Delfs for their enthusiasm and feedback throughout the project. Another big thanks to Luleå Uni- versity of Technology’s section in Campus Skellefteå, including my classmates and lab-assistants. Lastly, a big thanks to the Sammes Stiftelse foundation for their support throughout the years to the development of the game engine framework.

iii Contents

1 Introduction 1 1.1 Background ...... 1 1.2 Purpose ...... 1 1.3 Challenges ...... 2 1.4 Related Work ...... 3 1.5 Social, Ethical, and Environmental Considerations ...... 4

2 Theory 5 2.1 Color spaces ...... 5 2.2 Image processing ...... 5 2.3 Augmented Reality ...... 6

3 Software setup and development 7 3.1 Prerequisites ...... 7 3.2 Configurable filter pipeline framework ...... 8 3.3 Image Filters ...... 10 3.3.1 Utility and Conversion filters ...... 10 3.3.2 Feature detection ...... 11 3.3.3 Morphology filters ...... 11 3.3.4 Channel filters ...... 11 3.3.5 Background removal ...... 12 3.4 Data filters ...... 13 3.4.1 Contour filters ...... 13 3.4.2 Hand Detector ...... 16 3.4.3 Finger State filter ...... 18 3.4.4 Box detection filters ...... 21 3.4.5 Approximate polygons ...... 22 3.5 Render Filters ...... 22 3.5.1 Image gallery ...... 22 3.5.2 Movie Projector ...... 22 3.5.3 Music player ...... 23

4 Test results 24 4.1 Hardware setup ...... 24 4.2 General testing information ...... 25 4.3 Hand detection, hue ﬁltering ...... 25 4.4 Hand detection, background subtraction ...... 27 4.5 Hand interaction applications ...... 27 4.6 Box detection ...... 29 4.7 Polygon detection, Movie projector ...... 30

iv 5 Discussion 33 5.1 Conclusion ...... 33 5.2 Reﬂection ...... 33 5.3 Future work ...... 34

6 References 35

A Appendix 38 A.1 Source code and binary ...... 38 A.2 Videos ...... 38

v 1 Introduction 1.1 Background Interactivity has become one of the most important aspects in the digital era. With the advent of touch-displays, we have moved onto whole new paradigms for user interactions, such as "swipes" and other gestures utilizing one or more fingers. An area of research that has yet come to widespread use are image-based visual interactions (in contrast to haptic interactions). The possibilities of visual analysis surely extend to all fields of applications, but there are still several issues which have prevented it from becoming popular so far, in much part due to the complexity at hand. The real-time usage also requires significant processing power which has not been viable in the past. Before any analysis can be done a means of input must be established. There exist several methods for extracting the necessary image data. Regular RGB cameras, depth cameras, and stereoscopic camera recording are a few examples. As for the analysis itself there are various methods to go about it. First of all the target must be identified. To do this usually some kind of blob- or feature-detection is used. Exactly which kind of detection method is used will depend on what the target is. As such, applications usually have to make an assumption on the expected target before deciding which algorithms to run. After the target has been identified there may follow more analysis. In the case of hand gesture recognition, for example, the identified hand may be probed for it’s pos- ture, as well as the positions of its fingers. Using data gained over time complex gesture recognition could be applied in order to expand the possible interaction alternatives. Before one can perform all relevant analysis mentioned above other processing may have to be done beforehand. Examples of extra calculations could be taking into consideration movement of the camera or changes in lighting conditions as well as filtering out visual artifacts due to noise. Anything which will have an impact on the resulting analysis will have to be taken into account somehow, either by finding a method to filter out the irrelevant data or by setting conditions for the usage of the application. The processes which are used to process image or signal data are usually called "filters" or "processing filters", much due to the fact that most of them work by filtering out some irrelevant data. As such, this is how they are referred to throughout most of the report.

1.2 Purpose The purpose of this project is to investigate the possibilities of image-based feature detection combined with real-time interaction and projected content. More speciﬁcally, the OpenCV [24] framework is tested for this endeavor, and a few experimential solutions are presented to solve a few feature detection problems. OpenCV, or Open source Computer Vision, is a BSD licensed multi-platform library designed for computational efﬁciency with a focus on real-time applications. In order to test various solutions for image analysis problems a testing application is developed. A custom framework for encapsulating OpenCV- and custom image pro-

1 cessing and analysis functions is developed and embedded into the final application. In order to test the possibilities of various processing filter configurations the application must feature the ability to add, remove and adjust various processing filters and the order in which they are applied to the input data. The application is also developed to be able to iteratively test the pipeline on both still images and real-time input such as the feed from a web camera. After experimenting and studying readily available OpenCV functions a few use- cases are presented. In order to meet the ends of each use-case a processing filter pipeline solution is presented. Settings, output and processing time statistics are presented for each pipeline in order to give an overview over how well they perform, both quality- and performance-wise. The problems of actually rendering relevant content to the user and the issues it may present to analysing potentially altered input are but discussed briefly. As can be noted in the survey by Pavlovic et al. [1] input is generally gained either by two or more cameras, a sensor that measures depth, or some similar configuration which provides more data to work with than just regular (3 channel RGB) color data. This thesis, however, will focus on the standard the 3-channel RGB input, as well as consider the challenges with using only single-channel (greyscale) input. Therein lies the challenge in which processing filters should be used, and in what combination, before any feature extraction or gesture analysis can be performed. Gesture detection and analysis are investigated to some extent (as much as is needed in order to attain some decent interaction possibilities) and some example visualizations are presented.

1.3 Challenges The challenges in feature detection lie primarily in the processing filters required before the final analysis can be made. Exactly what approach is best will likely depend on both your hardware setup and what kind of feature analysis you want to support. Hand gesture detection for example can make use of both background removal and hue filtering, while detecting swipes could use entirely different methods. Image analysis, or signal analysis in general, is all about filtering out unwanted data before a final analysis can be calculated properly. With too much noise or false positives the results may be useless. When considering end-user usage, the amount of false positives are required to be near none for a good user experience. For this project two specific analysis tasks are investigated: extraction of the contours and gestures of a hand, and identifying a decent rectangular surface to project content onto. The former may be used for recognizing and reacting to visual gesture input, sometimes referred to as natural user interfaces [26], while the latter intends to be an example of augmented reality by rendering content on top of real world objects. Besides the challenges of filtering out irrelevant parts of the input, the demands of real-time usage requirements also has to be taken into consideration. Due to the ever- increasing usage of mobile platforms this study is meant as a precursor before applying it onto mobile devices. This is why the results section has focus on time consumption of both the individual filters as well as the proposed pipeline configurations in their entirety.

2 Challenges discovered during the tests will be presented in their relevant sections throughout this report.

1.4 Related Work The amount of research in the fields related to image analysis and gesture recognition modelling is quite vast, even more so when considering the general algorithms used to filter out general image data. Here, a few examples of related work are presented, but some research related to specific implementations and OpenCV functions used will be presented within the main text. To start, the work by Pavlovic et al. [1] was studied. It provides a good basis for future studies within hand gesture detection as they present models for all stages of the work, including input gathering, gesture modelling, analysis, and recognition, and some systems and applications. It is made clear that gestures are time-based and dynamic in nature, making all stages of modelling, analysis and recognition quite complex. It also presents the main approaches for modelling and estimating the current state of the hand, including 2D appearance-based models using templates and 3D vol- umetric and skeletal models. It also provided the initial insight that localization of gestures often uses either color- or motion-footprints. Another more recent survey is the one by Mitra and Acharya [2] which gives some additional insight into the field, as it studies both hand and facial gesture recognition techniques. Within they present techniques such as active contour models (Snakes) [3], the Hough transform [4] and pixel-based motion detection, known as optical flow [5]. Filtering using hue or color values for skin has been employed and studied in-depth by various teams before. Some examples include the surveys by Kakumanu et al. [9] and Vezhnevets et al. [10], both of which try to study how to take into consideration such things as illumination conditions, which color spaces work best and how to avoid false positives. An example of recent research is provided by Kawulok et al. [11] in which several techniques are tested and compared to their new method for adaptive skin detection using spatial analysis. One example of a hand gesture recognition approach is presented by Ren et al. [6], in which a depth camera does initial segmentation of the hand while a black wrist-band or belt is required in order to properly cull away the arm from the hand. A technique called Finger-Earth Mover’s Distance (EMD) is then used to analyse the relative angle and distance from the contour edges to the center of the contour in order to identify the fingers. The approach was then tested and used in an application using the Kinect [7] which proved successful. For background removal, or motion detection, the paper by KaewTraKulPong and Bowden [12] might be of particular interest as it proposes an approach which seems to take into consideration both variances in color but also shadows which are created by moving objects. A more recent example is the paper by Barnich and Van Droogen- broeck [13] which presents a technique for adaptive background detection accompa- nied with spatial propagation in order to get the technique to work with moving cameras. Although not tested throughout this thesis, it should be known that there are some sophisticated methods for matching 2D and 3D models with image data in order to

3 identify or estimate hand gestures. One example of such a method is presented in the paper by Orrite-Urunuela et al. [8], where a blob-detection, silhouette maching and a skeleton model is used in order to track and estimate state of the human body as a whole.

1.5 Social, Ethical, and Environmental Considerations The intent with the work conducted in this thesis is to further improve the human- computer interaction by enabling users to interact using visual gestures. Enabling this could pontentially yield new workflows and user-interaction paradigms, or help improve existing ones. If explored to a further degree it could reduce the need for current haptic interaction methods such as using keyboards and mice. Assuming that the cameras which are used for this purpose are protected well enough this could have a great impact on the consumer electronics industry, which currently produces vast amounts of garbage as old products are thrown away. Current visual sensors used for detecting movement could also be complemented with Hue-detection as described herein, in order to attain user interaction or improved detection in otherwise mundane areas. For example a detector could be used to detect the reddish hues of people and keep the lights on whenever anyone is present in the room. Another example could be for alarm-systems, where the cameras could react to any user-specified hand-gesture combination, for example lifting both arms, spreading the fingers equally up in the air (which would be relatively simple to detect with the described methods). As for integrity, using more cameras and visual sensors is, or may be, a sensitive issue ethically. Any system which is to be deployed would have to either be a closed- circuit system, such that the visual data does not leave the system, or that it is protected in some manner, encrypted or otherwise. One solution could be to provide a framework which sends signals whenever a gesture is detected, keeping the image data hidden from external peers.

4 2 Theory 2.1 Color spaces Color spaces define how visible colors are modeled in computer systems. The most common representation for computers is usually the RGB (Red, Green, Blue) color- space. This means that for every pixel of data, there exists one component of Red, one of Green and one of Blue, usually 8 bits for each. The size and type of the components are relevant when considering conversions. The standard RGB format usually uses 8-bit components (24-bit total), meaning their values may span from 0 to 255 (or 1 to 256), while some implementations, for example OpenGL[27], may favor the use of floating point values, usually between 0.0 and 1.0. Examples of other color-spaces include CMYK, variations of RGB, variations of HSV, TV-specific formats and scientific formats such as CIELAB. The one of interest in this project is the HSV color space (Hue, Saturation, Value). The Hue component of the HSV color space (which is also available in other color spaces) is of particular interest as it is circular in nature. This means that once it reaches its maximum it is interpreted in the same way as if it had its minimum value (see figure 1).

Figure 1: The Hue spectrum as speciﬁed in the HSV/HSL encodings of RGB. [22].

2.2 Image processing Image processing is a variant of signal processing that uses two-dimensional signals. As such, it can be explained using the simple data flow consisting of an input signal, a filter (or processor) that somehow changes the data, and an output signal which carries the modified data (see figure 2 for examples). A processing filter, in this case, is synonymous to any process that is applied to input in order to generate some output, in most cases in order to isolate some relevant data. A common example is noise removal, which is usually accomplished by subtraction, division, or computing some matrix on each pixel and its neighbours. There are, however, also several use-cases where new types of data may be wanted. This means that the filters may generate some new customized data based on the input, instead of strictly isolating or modifying the image data within. This could for example be used in preparation for transfer over some specific medium or preparing for some other kind of high-level analysis, such as feature and gesture recognition.

5 Input signal Signal filter Output signal

Greyscale filter

Threshold filter

Figure 2: Image/Signal processing template and examples.

2.3 Augmented Reality Augmented reality, related to mediated or modiﬁed reality, refers to "a live direct or indirect view of a physical, real-world environment whose elements are augmented" [23] with computer-generated sensory input such as graphics, sound, or otherwise, in order to ease or enhance the completion of some task. One recent example is the ocular aid known as the EyeTap [14], which was initially developed in order to ease tasks such as welding by improving the visual information provided to the user by performing processing on the otherwise dangerously bright light before presenting it to the user. Other recent examples include projecting navigation paths over roads for vehicles, rendering game content onto identiﬁed surfaces or identifying language and meaning of arbitrary texts.

6 3 Software setup and development 3.1 Prerequisites In order to work efficiently, both developing and testing filters, several parts were needed, including a user interface, a rendering engine and a framework for image analysis and processing. A graphical overview of the resulting setup is presented in figure 3. The game engine developed during my previous studies and free-time [25], known initially as the Aeonic Engine, was chosen as overall framework. It facilitates a dynam- ically recompilable user interface system and has inherent functionality for handling graphics, sound, physics and mathematics to some extent. As base for handling image-processing OpenCV[24], or Open source Computer Vision, was chosen. OpenCV is a BSD-licensed multi-platform framework or library focused mainly on real-time image processing. OpenCV has a vast amount of functionality available. Some examples include the image processing categories of filtering, transformations, histograms, structural analysis and feature detection. On top of that, a custom framework was designed for handling both OpenCV- based and custom image processing operations within the game engine. The developed framework revolves around a central pipeline which handles the sequence in which selected image-processing filters are applied. Within the pipeline, image-processing filters can be added and removed, temporarily disabled, and also have their settings adjusted between each iteration. Due to the versatility of OpenCV, many of the image- processing filters developed make use of OpenCV functions. Within the implementation all developed framework and filter processing classes have the upper-case ’CV’ prefix in the implementation, to easily differentiate them from existing parts of the game engine. The C++ programming language was used throughout the entirety of this project, together with the Microsoft Visual Studio IDE. An early version of the program has been tested to compile and run with GCC on a virtual machine running Linux as well.

7 Game Engine

Input manager Math library

Graphics manager UI system

Operating system Configurable filter pipeline framework (CVPipeline, CVFilter, CVFilterSetting etc.)

OpenCV framework

Figure 3: Software setup, including some of the components available in the game engine.

3.2 Configurable filter pipeline framework The filter pipeline was developed in order to test various image processing techniques in succession. It assumes an input image (which can be either a static texture, a sequence of images or a live feed from e.g. a web camera), the user-defined filter-chain to process it, and finally an output image (optionally with additional outputs). Select- ing input itself is not really a part of the pipeline, but is a prerequisite for it to have any image data to work with. The image processing filters are divided into 3 main categories: Image filters, Data filters and Render filters. Image filters are those whose input and output are images, either in the same format or possibly adjusted (e.g. Grayscale filter converts 3-channel images to single-channel images). Data filters are those whose output generally are custom non-image data types. Because of this, Data filters usually apply some kind of debug-rendering in order to visualize their results. Render filters render extra content to the user, either via more visualizations provided by the game engine or by creating output files. The pipeline and filter system supports adding and removing filters, adjusting settings of the filters, temporarily disabling and enabling filters as well as saving and loading entire pipeline configurations. The class structure also supports an error mes- saging system. When a filter fails to process its input it will set the error string and return the error result code (-1), which in turn sets the entire pipeline’s error string to the same value, easing debugging and iteration.

8 Figure 5 displays the current structure of the CVPipeline. A typical pipeline processing example starts with defining the initial input. After the initial input has been decided the pipeline’s Process function is called. Before any filter is processed the initial input is first copied into the input image container. Now the first filter in the list is called to work by calling its Process function, passing a pointer to the pipeline as an argument so that it can access any necessary input data and write output data. If the first filter is an image processing filter it will write some data to the output image container and notify the pipeline by returning a value which signifies this. In order to prepare for the next filter’s execution the output container’s data is then copied to the input container. Processing continues similarly for all N filters. As some filters generate other data than image-data they will save that in various other variables within the pipeline class and return values corresponding to their written output data type. Once the last processed filter has done its work the pipeline will call the last successfully processed filter’s Paint function. This in order to render relevant debug graphics for visualizing the results, which is required for the more abstract non-image data return types.

CVPipeline

CVFilter CVFilterSetting

CVImageFilter CVDataFilter CVRenderFilter

Figure 4: A general class hierarchy diagram of the developed filter pipeline framework. The CVPipeline may contain an arbitrary amount of CVFilters (parent class). Each CVFilter may contain an arbitrary amount of CVFilterSettings. The middle parent classes CVImageFilter, CVDataFilter and CVRenderFilter define some characteristics common to all filters derived from them.

9 CVPipeline

CVFilter list Initial Input

CVFilter 1

Input CVFilter 2

CVFilter 3

Output CVFilter 4 ...

Additional output data CVFilter N

Figure 5: Structure of the CVPipeline. It contains image storages for input and output, a pointer to the initial input which is to be processed, the list of CVFilters which are currently added to the pipeline and additional output data types. Exactly which output data types are to be used varies with each processing ﬁlter.

3.3 Image Filters Image filters are those filters whose input and output are images. Most of the image filters make use of OpenCV functions and thus often bear similar names. The list of implemented and tested image filters are as follows: Scale Up/Down, Extract Channels, Hue/Value/Saturation Filter, Canny Edge Detection, Harris Corner Detection, Gaussian Blur, Erode/Dilate, Binary Threshold, Background Removal, as well as conversion and utility filters. Filters requiring further explanation are presented in the following sub-sections. Most image filters are described both in the API [28] and documentation tutorials [29] of the image processing (imgproc) module of OpenCV.

3.3.1 Utility and Conversion filters The utility and conversion filters included Abs, Greyscale, and Convert to Single Byte. The Abs filter calculates absolute values on each pixel channel data and was used to make negative signed values positive in order to make them renderable. The Greyscale filter converts the image to a single-channel image from RGB data and was used at several locations since many filters require single-channel input for their calculations. The Convert to Single Byte filter was used to ensure that the data is provided in single- byte (8-bit) format for each pixel’s channel, converting the entire image if necessary.

10 3.3.2 Feature detection There exist some general algorithms for detecting what could be relevant in an image (or 2D signal). These include for example the tested Harris Corner [19] and Canny Edge detection [20] algorithms. Most of these algorithms require a single-channel (greyscale) input image and output some detected areas of relevance, usually edges or corners, in the form of a single-channel or binary output image. All tested feature detection methods are readily available in OpenCV, and example outputs from them can be seen in ﬁgure 6.

Reference image Harris corner detection Canny Edge detection

Figure 6: General edge/corner detection algorithms. The Harris Corner Detection has undergone thresholding using absolute values in order to visualize the results. The Canny Edge detection can be displayed straight away, but could make use of some further manipulation before the results are usable.

3.3.3 Morphology filters Both the Erode and Dilate filters are morphological filters (from mathematical morphology). Both work by applying a matrix of a given size and shape (called the struc- turing element) over each pixel. The usual function for erode is to slim down edge- and corner areas, and in extension can thus also used for noise removal. Dilate is used to expand and in extension smooth out areas. See figure 7 for example effects of their application.

3.3.4 Channel filters The Extract Channels, Hue, Value and Saturation filters all belong to this category that deals with individual channels of an image. The Extract Channels filter was used to calculate the separate Hue, Value and Sat- uration channels from the provided RGB channels in the input image and stores them in the pipeline for future usage. The Hue, Value and Saturation filters then provide filtering-capabilities for their respective channel. In the presented work, the Hue filter was added in order to isolate the red hues of a hand. Since hues are calculated and treated as a circular array the Hue filter takes this into consideration when filtering. An example of the hue values (between 0 and 255)

11 Thresholded image Dilate, size 5 Erode, size 5

Figure 7: Morphology examples, both based on the threshold image on the left. A combination of both dilate and erode is usually used (called opening or closing depending on which is executed ﬁrst).

is displayed in figure 8. Trying to filter a specific hue close to 0 or 255 (red) will thus filter out both values at the top and the bottom of the spectrum (see figure 1). The Value filter was added in order to remove unnecessary parts (e.g. dark crevaces and bright direct light sources). All pixels within a given value-range are painted black, leaving the rest as they are. The Saturation filter was added for a similar purpose: to cull away grey and whitish surfaces which also have their hue-value near 0 (default value).

3.3.5 Background removal Also known as motion detection, these types of filters try to assess which parts of the image are part of the static background and then filters them away, leaving only moving parts left. At its simplest this works in two steps: identifying the background and removing the background. The background removal filter developed for this project works on the entire frame and works using subtraction between the identified background and the current input, storing the absolute values of the results in the output image. Each pixel’s channel is subtracted individually, meaning that it works for both greyscale, RGB or any other input, but will produce slightly different results. A simple automatic background capture algorithm was added. When active, and whenever the average pixel differences between the background and the the current frame diverges below a certain threshold, the current frame is considered to as the new background. Manual setting of the background is also possible, which was used throughout most of the project. Figure 9 displays an example of using Background removal on both RGB and greyscale images.

12 Hand reference Hue channel

Figure 8: Hue channel rendered in greyscale (0-255) after HSV channel extraction. The channel shows the reddish hue of the hand being drawn both as the lightest and darkest parts of the image. Do note the bright specular spots that are also interpreted as having the same (red) hue as the hand, prompting the need for the value ﬁlter in order to cull them away separately.

3.4 Data filters Similar to the image filters, most of the data filters make use of their equivalents available in OpenCV. Much of the debug-rendering is performed similarly to how they are presented in the various OpenCV tutorials as well. The list of implemented and tested data filters are as follows: Find Contours, Calc Convex Hull, Calc Convexity Defects, Hand Detector, Hough Circles/Lines, ShiTomasi Corners / Good Features To Track, Approximate Polygons, Finger State filter, as well as a few other filters for trying to detect boxes using lines. Several of the data filters require binary input images. A binary image in this case refers to a single-channel image (often interpreted as greyscale), in which all non- 0 pixels (usually having the value 255 for clarity) are treated as true while 0-valued pixels are treated as false. The Threshold and Greyscale image filters are often used to ensure this quality. In the following sections a few of the data filters are explained in detail. Those which were not tested thoroughly or ended up in any of the final pipeline configurations have been omitted. Most functions used by the data filters are described both in the OpenCV API [28] and documentation tutorials [29] of the imgproc module of OpenCV.

3.4.1 Contour filters OpenCV offers a plentitude of functions for calculating with contours. All contour and convex hull related filters used here are entirely based on the functions provided by OpenCV. The Find Contours filter works by identifying interconnected areas in the provided

13 Background Visible hand

Background removal Greyscale equivalent

Figure 9: Background removal. The above two show the identiﬁed background and the current frame to which compute background removal. Results for the subtraction are presented in the bottom row. The left one used RGB input while the right one made use of greyscale equivalents before calculating the subtraction.

binary image, returning a list with all identified contours. In contrast to image data, every contour is made up of a sequence of coordinates in the image which the contour spans. The Calc Convex Hull filter calculates the minimum convex hull for a target contour, and the Calc Convexity Defects filter calculates the local maximum extreme points of the difference or distance between the contour and its convex hull counterpart. See figure 10 for example results. In order to reduce computation time as well as debug-rendering time an additional filtering is done in the Find Contours filter which culls away, or removes, those contours whose contour area are below a minimum threshold value. Without any such threshold the debugging and testing process would not be as swift. However, in order to perform this culling procedure the contour areas must also be calculated. Similarly, convexity defects with depths below a specified threshold threshold value are also discarded (see

14 ﬁgure 11). The OpenCV functions are based on the papers and algorithms as follows: ﬁnd- Contours by Suzuki S. [16] and Teh C.H. & Chin R.T. [17], convexHull by Sklansky J. [18], and contourArea is based on the Green theorem [31]. The OpenCV source code is available at their GitHub repository [30].

Binary preprocessed image Contours calculated

Convex Hull calculated Convexity defects calculated

Figure 10: The above four pictures show the process of calculating the convexity defects by first calculating the contours of a binary image, discarding contours with areas below a certain threshold and then calculating the convex hull of the remaining contour. The Calc Convexity Defects filter’s minimum distance threshold has been tweaked so that only the convexity defects between the fingers will be detected.

15 Minimum defect depth: 20000 Minimum defect depth: 5000

Minimum defect depth: 2000 Minimum defect depth: 200

Figure 11: Calc Convexity Defects output with varying degrees of filtering. The filtering is done based on the depth value of each convexity defect, culling those whose depth are below the specified threshold value. Depth threshold values are listed on top of each output.

3.4.2 Hand Detector Using the information gained from the contour and convex hull functions it is now possible to approximate the position and size of the hand, including the positions and visibility of any fingers. For every convexity defect two fingers are considered to be placed on each side of it. Using the output of the bottom right picture of figure 10 we would thus gain one thumb, one pinky and two fingers created for each of the middle three fingers. After all proposed fingers are created a simple check is done between all preliminary fingers which merges them if they are close enough to each other (if the distance between them is below a minimum finger distance threshold value). Algorithm 1 describes the processing of the Hand Detector filter in pseudo-code form.

16 Algorithm 1 The Hand Detector’s algorithm in its entirety. Clear any hands stored in the pipeline since the last iteration. for all contours in the pipeline do Calculate the contourArea of the contour. if the contourArea is below the minimumArea threshold then Skip this contour and go the next iteration of this outermost loop. end if Create a new hand. Calculate the center and bounding rectangle of the contour and store them (as well as the contourArea) in the hand for future use. if there are any detected convexityDefects associated with the contour then Set numberOfCreatedFingers to 0. for all convexityDefects do Create one finger at the start of the defect and add it to the hand. Create one finger at the end of the defect and add it to the hand. Increment numberOfCreatedFingers by 2. end for for all j such that 0 ≤ j < numberOfCreatedFingers do for all k such that j < k < numberOfCreatedFingers do Calculate the distance between the jth and kth fingers. if the distance is below the minimumF ingerDistance threshold then Remove the kth finger. Decrement numberOfCreatedFingers by 1. Ensure that the next finger (same k index due to the removal) is compared next iteration. end if end for end for end if Add the finished new hand to the list of hands in the pipeline. end for

Additionally a Hand Persistence filter was created which tries to ensure that the detected fingers are relevant. Without this filter the arm, if visible, could potentially be detected as a sixth false finger (see the bottom part of figure 14 for an example). The Hand Persistence filter starts with calculating the relative directions of all fingers, compared to the center of the contour. Once all relative directions have been calculated the average direction of all fingers is calculated. After this a comparison is made between the average direction and each finger’s relative direction. If the finger’s relative direction is not aligned with the average (using the dot product, asserting the value is positive), the finger is regarded as a false positive and is discarded. The OpenCV functions for calculating the center of the contour (the moments of the contour) are based on the The Green theorem[31]. Sample output of the Hand Detector and Hand Persistence filters can be seen in figure 12.

17 Figure 12: Results of the Hand Detector filter based on convexity defects from the output of the Find Convexity Defects filter (see figure 10). The background has been replaced by the original input image to emphasize the possibilities of augmented reality, even with no actual content being rendered yet.

3.4.3 Finger State filter Based on the output of the Hand Detector and Hand Persistence filters, the Finger State Filter takes note of and calculates the current and previous states which the hand has held, in terms of the amount of visible fingers. The Finger State filter was developed in order to detect finger gestures in a more robust manner. As such, it contains thresholds for such things as minimum state duration, which helps cull away false positives due to noise or involuntary movement. It makes use of a structure called the FingerState, which has the following variables: amount of visible fingers, start-time, stop-time, duration and a variable currently called processed which contains information as to whether this finger state has been reacted to in the following filters. A list of known past finger states is then used in order to keep track of the past detected finger states, in order to perform more advanced gesture analysis later on. Each frame the Finger State filter takes a look at the primary visible hand (it as-

18 sumes there is but one relevant hand) and its amount of fingers (fingersNow). If the current amount of fingers is the same as that of the last added finger state, its duration and stop-time gets updated using the current time (now). If the past finger state was not adjusted, but the finger count is the same as that in the previous iteration (fingersLastFrame) a new finger state may be warranted. The current finger count’s duration is calculated and compared to the minimum duration value (minimumDuration). If the current finger count’s duration exceeds the minimum duration a new finger state is created. The new finger state then gets its number of fingers set, and its start-time is set to the previously recorded start-time (lastFingerStartTime). The new finger state is then added to the list of known finger states (fingerStates), and if needed the oldest finger state is removed from the same array. If the finger count is differing between this and the previous iteration, the time of the change is recorded (lastFingerStartTime). After the changes above (if any) are done to the list of known finger states a synchronization procedure is held which copies over the values of the processed variable of each finger state in the pipeline to its counterpart within finger state filter. After the synchronization is done the list in the pipeline is cleared and replaced with the new updated list from the filter. Lastly the fingersLastFrame variable is set to fingersNow as preparation for the next iteration. Algorithms 2 and 3 describe the Finger State filter’s processing in pseudo-code form. Figure 13 displays example input and output of the Finger State filter in terms of detected amount of fingers.

Figure 13: Example statistical output data from the Finger State filter. Above: fingers detected from the Hand Detector combined with the Hand Persistence filter. Below: fingers as specified in the latest detected finger state output of the Finger State filter. The finger state’s duration is color-coded, going from yellow to green as the duration increases. Each pixel corresponds to one frame.

19 Algorithm 2 The Finger State filter algorithm (part 1 of 2): Updating and adding new finger states. Variables on creation of the filter: the empty list of fingerStates, fingersLastFrame = 0, lastFingerStartTime. Settings set by the user: minimumDuration, maxStatesStored. Set hand to be the first found hand in the pipeline. Save the current time to now. Set fingersNow to be the amount of fingers of the current hand. Set lastFingerState to 0. if fingerStates contains any valid states then Set lastFingerState to the last added state in fingerStates. end if if lastFingerState is a valid finger state and lastFingerState’s fingers = fingersNow then Update the lastFingerState’s duration and stop-time using now and its start- time. else if fingersNow = fingersLastFrame then Calculate the duration using now and lastFingerStartTime. if duration > minimumDuration then Create a new finger state newState. Set the start time of newState to lastFingerStartTime. Set the fingers of newState to fingersNow. Set the duration of newState to duration Add the newState to the list of fingerStates. if the size of the fingerStates list exceeds the maxStatesStored then Remove the oldest added state (index 0 if appending) from the fingerStates list. end if end if else Set lastFingerStartTime to now. end if

20 Algorithm 3 The Finger State filter algorithm (part 2 of 2): Synchronization of processed flags and copying the updated list to the pipeline. for all fingerStates do for all fingerStates in the pipeline from last iteration do if a fingerState is identified to be present in both the last iteration and the current iteration then Copy over processed flags from the pipeline version to the local version, since we will overwrite the pipeline list again later. end if end for end for Clear the list of fingerStates present in the pipeline. for all states, fingerState, in fingerStates do if the duration of fingerState exceeds minimumDuration then Add a copy of fingerState to the list of fingerStates present in the pipeline. end if end for

3.4.4 Box detection filters Based on the output of the Hough Lines filter (see figure 16, picture 3), an approach to detect boxes was developed. The main idea was to first isolate near-horizontal lines by filtering all lines by their relative angles, merging them and then trying to detect pairs of lines or line-groups which together span up quads or rectangles. Figure 16 shows the workflow which was tested when developing these filters. The Filter Lines by Angle filter calculates the relative angle (in degrees) for all lines, using the cyclic region between 0 and 180. Values near 0 and 180 are horizontal while values near 90 are vertical lines. As the lines do not have explicit start- or end- points this angle-filtering fulfils its purpose as intended. The Merge Lines filter uses a brute-force comparison method for all detected lines in order to detect if they should be merged. Firstly an angle check is done, where the normalized direction of the lines are compared. If the comparison (a dot product) is below a certain threshold value (currently hard-coded to 0.95) the lines are not considered to be parallel enough. After that a simple overlap check is done to ensure that the lines overlap at least a few pixels over the X-axis. Lastly a check is done comparing the distance in height or Y between the lines’ starting points, comparing it to a maximum distance value set by the user. The Find Quads filter looks through the list of lines available in the pipeline, look- ing for pairs of lines. Each time it finds a pair it will calculate a minimum bounding quad for it and save it in the pipeline’s list of detected quads. Additionally a Quad Aspect Ratio Constraint filter and a Max Box Persistence filter were developed and tested in order to try and improve the Box Detection approach by filtering away those quads with unwanted aspect ratios and smoothing out the results over time.

21 3.4.5 Approximate polygons The Approximate Polygons ﬁlter was developed in order to detect or approximate arbitrary quads in the scene. The ﬁlter makes use of the OpenCV function approxPolyDP which is based on the Ramer-Douglas-Peucker algorithm [21] and works on any series of points. After approximation is done a culling procedure is executed where only polygons of a set amount of vertices are retained, in order to increase performance and ease debug-visualizations.

3.5 Render Filters The Render ﬁlters were developed primarily in order to save image- or video-sequences of the pipeline steps or present demonstrative applications based upon the data provided in previous data ﬁlters. As such, a few end-user applications were developed and tested for their general viability.

3.5.1 Image gallery The image gallery filter was developed during the presented work to add actual content onto the analysed surfaces and react to input. It works by specifying a directory from which it is to display pictures, and then changes the rendered picture whenever certain finger-based input is detected. The here discussed implementation has three notable states depending on the amount of detected fingers: 0, 4 and 5. When no finger is detected, the gallery will auto-play, switching pictures in a determined interval. When the finger count is 5 it is prepared to switch pictures, but will only do so when the finger count of the next frame is 4. This could be compared to a "clicking" action similar to computer mice usage, which usually assumes a rapid movement including flexion and extension, but in this case allows for any finger to be used in the gesture. Since the interaction is only based on the amount of visible fingers, other gestures not usually associated with human-computer interaction are possible. These include such movements as when two or more fingers adduct and abduct, or turning the hand as a whole to conceal or reveal fingers. Another "click" action works similarly but when decreasing the finger count from 4 to 3, and instead of going to the next picture, it will go the previous one. The image gallery is always projected onto the detected/calculated center of the hand contour. See figure 14 for demonstration screen-shots.

3.5.2 Movie Projector The Movie Projector filter is based upon the output of the Approximate Polygons filter, but can be extended to support other data filter outputs. The here discussed implementation checks for the best polygon detected, using the amount of sides and contour area as critera, favoring quadratic surfaces of highest possible contour area. After locating a suitable surface it then synchronizes movie content to be projected onto the surface.

22 3.5.3 Music player The music player filter was developed to show further interaction possibilities based on finger and hand input. In contrast to the Image gallery filter it makes use of input which has been calculated using the Finger State filter and uses another interaction paradigm altogether. The detected hand’s position is used to control volume. The volume is directly controlled to the relative height or Y-position of the hand, so lifting or lowering the hand while it is detected will adjust the volume accordingly. In order to adjust playback the the last added Finger State is examined. Instead of using any specific "click" action as was used in the Image Gallery, the Music player reacts when a certain state is held statically for a specified time period. The three actions implemented are play/pause, go to next track and go to previous track, bound to the finger counts of 5, 3 and 4 respectively. All three actions are set to react when their state has been held for a duration of 2 or more seconds, as 2 seconds proved to be convenient for usability.

5 fingers, Ready-state 4 fingers, "click" 0 fingers, Autoplay

6 ﬁnger problem Hand persistence ﬁx

Figure 14: Screenshots of the image gallery texture being rendered and interacted with. Above: Three identified finger states are implemented. Below: Visualization of extra finger problem and solution when using the Hand persistence filter.

23 4 Test results

The main ﬁlter pipeline conﬁgurations that were tested were two variations of hand detection and two variants of quad or box detection. Three demonstrative end-user applications were tested in total. The results section contains experimental test data, including notes about the various problems encountered while trying to achieve good quality output. The tests are focused mainly on performance due to the requirements in real-time usage. For one, it is crucial to usability and the user experience as a whole. Secondly, these tests are meant as a precursor and comparison basis for future tests on mobile systems and in embedded environments. Two good marks are the 30 and 60 frames per second goals. Any frame-rate above or around 30 gets smooth to the human eye, and 60 frames per second is the de facto standard in modern games featuring rapid gameplay and interaction. In order to meet these requirements the processing time of an entire application as a whole must reach below 33.33 or 16.67 milliseconds per frame respectively.

4.1 Hardware setup The hardware setup was simplistic and consisted of a laptop and a web camera. Only indoor environments were tested. Laptop technical speciﬁcations were as follows:

Statistic User Device 1 Brand Lenovo ThinkPad OS Windows 7 Professional Processor Intel i5, 2.6 GHz RAM 4 GB Architecture 64-bit HDD 300 GB Notes Harddrive encryption

Webcam technical speciﬁcations were as follows:

Statistic Input Device 1 Brand Trust Spotlight webcam pro Hardware resolution 1280 x 1024 (1.3 Megapixel) OpenCV webcam input resolution 640 x 480 Frames per second 30 / 15 (Release / Debug)

The frame per second are based on the amount of frames that were retrieved by Open CV’s VideoCapture functionality when running the conﬁgurable ﬁlter pipeline application.

24 4.2 General testing information In the following sections results are presented, including input/output of the various stages of processing, used filter settings and processing time durations. Results are presented for when running the application both in debug and release mode. The measurements were taken in an interval of at least a few seconds and at most a few minutes. The rounded average over 30 calculations are presented. One thing to note is that only the processing time for the filter pipeline has been studied. Time consumption for rendering output is application dependent and as such was skipped. During all the tests the user device was running several other applications at the same time in order to simulate end-user performance (web browser, file browser, document writer, spreadsheet editor, etc), with a Physical Memory usage of around 75% and a CPU usage below 3% on average before launching the developed test application. As none of the filters provide adaptive settings all settings had to be tweaked in runtime in order to function properly when the environment or lighting changed be- yond a certain extent. Due to this, single still images were used as test subjects for all conducted tests detailed in the following sections. All tests were conducted using 640x480 resolution input still frames provided by the web camera.

4.3 Hand detection, hue filtering The first tested approach for hand detection was done via hue-filtering. The most important pipeline parts here were the Hue filter and the Value filter (for culling bright specular spots), which worked in several different scenarios. Table 1 shows the steps involved from initial input to final output. Additionally the saturation filter could be used to cull away grey areas showing slightly red hues. The biggest drawback when using this method is that any reddish hue will be rec- ognized as a false positive. This was more obvious in a lively environment than in the minimalist testing space used throughout most of the project. Examples of false positives include red plastics and labels on packaging, towels/cloth and even orange colored objects when the filter was not configured properly. The most relevant feature of the hue filtering is that it is less sensitive to changes in the background, and in extension could possibly work well even with a highly mobile camera. Depending on where and how the user is positioned relative to the input further requirements might be necessary for a good user experience. For example wearing a T- shirt, exposing parts of the arm, may yield false positive detected fingers, which could be avoided by limiting the area which is analysed. A demonstration video for the hue filtering can be seen here: https://www. youtube.com/watch?v=n8hLWkn1Qdg.

25 Processing Processing Output Relevant ﬁlter Settings time (release) time (debug)

Extract channels 4 386 µs 14 526 µs

Target hue = 10, Hue ﬁlter 977 µs 39 500 µs Hue range = 25

Target value = 255, Value ﬁlter 692 µs 1 496 µs Scope/range = 20

Type = 0, Erode 635 µs 3 399 µs Size = 5

Type = 0, Dilate 387 µs 4 688 µs Size = 7

Minimum area = Find contours 705 µs 4 379 µs 5000

Calculate convex hull 178 µs 303 µs

Calculate convexity Minimum defect 85 µs 89 µs defects depth = 20000

Hand detector (con- Minimum ﬁnger dis- 53 µs 1 314 µs vexity defects) tance = 50

Hand persistence 1 µs 11 µs Finger State ﬁlter 4 µs 15 µs Total 8102 µs 69 669 µs

Table 1: Hand detector using hue-filtering: Filter order, settings and processing times (See section 4.2) in both release and debug builds. Total pipeline time consumption, including data transfers from output to input, 8 440 µs in release mode and 70 314 µs in debug. 26 4.4 Hand detection, background subtraction The second test approach for hand detection was based on background removal. The aim was to identify a hand, similar to what was presented in section 4.3, but using a single-channel greyscale input. As can be seen in table 2 this approach also achieves an approximate detection of the hand. The biggest drawback with this method lies in the impact of the choice of background. Using a flat background yields more consistent results, while a structured one will yield artifacts in the regions where the background shifts in value. Dilation can be used to some extent to counteract this problem. An additional issue can occur if the value or brightness of the background is close to that of the hand. The subtraction will then yield a near 0 value, which in turn causes the area to yield false negatives which have to be worked around using some additional processing. A second, important drawback with this technique is that light-sources will affect the outcome considerably. If any visible shadows are created or otherwise changed these will yield false positives, as the subtraction only measures difference in the used implementation, and not whether the difference is positive or negative. Due to the nature of background subtraction this technique is also vulnerable to changes in the background. A choice must be made as to whether it should either update the detected background or if that should be done manually. Manual background detection was used throughout the tests but assumes a stationary camera. Automatic background detection could possibly be used, but poses additional problems. Consider a moving hand that stops in front of the camera in order to display some content onto it. Due to the limited movement of the hand and thus little change in pixel values the background removal filter captures the current input as the new background. When this happens the hand-detection has to take this into consideration. It could either assume that the hand is in the same place as it was before, or that it has disappeared. Both of these possibilities cause difficulties as the first one will require other algorithms for detection of the hand and the second approach will interrupt user interaction. Figure 15 illustrates some of the problems of using automatic motion detection. Demonstration video of the background subtraction can be seen here: https: //www.youtube.com/watch?v=wOvQn1GkUlw.

4.5 Hand interaction applications Using the output of the hand- and finger-detection filters, two example render-filters and end-user applications were developed: an Image Gallery and a Music Player. The image gallery and music player filters are described in sections 3.5.1 and 3.5.3. Both of these applications can be developed on top of any working hand & finger detection algorithms. The Image Gallery only uses the output of the fingers straight from the Hand De- tector and Hand Persistence filters, while the Music player is based on the output of the Finger State filter (see section 3.4.3). In addition, they make use of two different user interaction paradigms for interacting with the applications.

27 Processing Processing Output Relevant ﬁlter Settings time (release) time (debug)

Greyscale 823 µs 1 515 µs

Background removal 435 µs 2 025 µs

Threshold Threshold = 5 39 µs 138 µs

Erode Size = 1 157 µs 1100 µs

Dilate Size = 3 169 µs 2149 µs

Minimum area = Find contours 732 µs 5 660 µs 1000

Calc Convex Hull 206 µs 399 µs

Calc Convexity de- Minimum defect 32 µs 103 µs fects depth = 22000

Hand detector (de- Minimum ﬁnger dis- 76 µs 1 492 µs fects) tance = 50

Hand persistence 4 µs 9 µs Finger State ﬁlter 4 µs 20 µs Total 2 864 µs 14 611 µs

Table 2: Hand detector using background removal: Filter order, settings and processing times (See section 4.2) in both release and debug builds. Total pipeline time consumption, including data transfers from output to input, 3 351 µs in release mode and 15 095 µs in debug. 28 Regular background subtraction New background captured Movement

Figure 15: Issues with automatic background detection. The left image shows regular background subtraction of a manually captured background. Note the lower right corner which had a black cloth in the background, causing the arm to appear as it would in a regular greyscale image. When the hand is present in the the background it is captured yet again. The middle image illustrates the contours of the hand being the most visible now due to involuntary movement of the hand while in a stationary position. The third image illustrates the problems of detecting the current hand position while it is in movment after it has once been captured as part of the background. The end result makes it look like two hands are visible at the same time.

The usability of both these applications were tested empirically by a small number of colleagues. The interaction approaches seemed viable, but it was made clear that some kind of smoothing filter similar to the Finger State filter is required for a good user experience. The interaction with the Music Player is much more stable, due to the culling of finger-states of low durations which removes noise in the hand-detection. After comparing the user experience it was made obvious that the Finger State filter should be used before any end-user applications are to interpret user hand-input. The interaction in the music player where the user controls the volume using the general Y-position or height of the hand proved very enticing and could possibly be combined with other interactions or adjusted to modulate other variables. Demonstrational videos for the image-gallery and music player tests can be found on the following locations: https://www.youtube.com/watch?v=Ewro1-PURYc & https://www.youtube.com/watch?v=El2Y7qmtmSk.

4.6 Box detection The main other goal besides detection and interaction with hands was to find apt surfaces to project content onto, as this is central in augmented reality application. Since users are used to handling frames, it is a natural approach to look for quadratic surfaces, or frames, onto which content can be projected. The first attempt at trying to detect these frames was the Box detection pipeline, which was based on Harris Corner Detection, Hough Lines Detection and then custom filters which try to find or approximate suitable quads or rectangles in the scene (see

29 section 3.4.4). The presented implementation does not work as intended, as it yields too varied results for each frame. The pre-processing steps up to and including the gathering of all near-horizontal lines works as intended, but the merging procedure which is done in the Merge Lines filter would require more work before the pipeline as a whole becomes suitable. Due to instability no performance tests were done for this pipeline configuration. Figure 16 shows the various stages of box detection in a frame where it is working as intended. Demonstrative video of the pipeline configuration can be seen here: https://www.youtube.com/watch?v=XzyrLVPypfM.

Harris Corner Detection, Abs Convert to single byte & Gaussian Blur & Greyscale & Threshold Hough Lines Detection

Filter Lines by Angle Merge Lines Find Quads

Figure 16: Sample output stages for the Box detection pipeline when working as intended. A rectangular mobile phone with rounded corners was used for testing purposes.

4.7 Polygon detection, Movie projector After some further experimentation, another method for frame-detection was tested, this time using polygon approximation. On top of the output of the polygon detection a video-projection ﬁlter was developed and tested. Without any smoothing algorithms the end results were shaky. The polygon detection would often yield different results between each frame, making the video move position accordingly. Both noise in the input and involuntary movement attributed to this problem.

30 Adding a simple smoothing algorithm (see equation 1) helped smooth out both the size and position of the video. The amount of frames to smooth over m affects both how fast the video is positioned correctly and how smooth it will remain. A smoothing frame count of 10 seemed to be a viable amount, but a higher number could be used when using ﬁxed-position cameras.

f(xi) xi f(xi) − + when i> 0 f(xi+1)=  m m (1) 0 when i ≤ 0  Results of the filters, settings and processing times for the Polygon Detection and Movie Projector pipeline can be seen in table 3. The first frame’s time-data was skipped when measuring due to it including a one-time procedure required by the Movie Projector filter in order to set up and start playback. An early demonstrational video of the movie projector pipeline before the smoothing and video aspect-ratio scaling was added can be found here: https://www.youtube.com/watch?v= PLJBoYtTrxM.

31 Processing Processing Output Relevant ﬁlter Settings time (release) time (debug)

Greyscale 741 µs 1 687 µs

Edge threshold = 29, Canny Edge detec- Ratio = 3, 3 396 µs 7 423 µs tion Kernel size = 3

Dilate Size = 3 220 µs 2 499 µs

Minimum area = Find contours 929 µs 9 686 µs 5000

Approximate poly- Epsilon = 20 378 µs 911 µs gons

Movie projector, 13 µs 1798 µs polygon

Total 5689 µs 24 004 µs

Table 3: Movie projector using polygon detection: Filter order, settings and processing times (See section 4.2) in both release and debug builds. Total pipeline time consumption, including data transfers from output to input, 6 063 µs in release mode and 24 510 µs in debug. A rectangular mobile phone with rounded corners and the open movie project Big Buck Bunny was used for testing purposes[32].

32 5 Discussion 5.1 Conclusion An introduction has been given around the subjects of image analysis, gesture recognition and augmented reality. Several processing filters have been described and the results of experimental test use-cases have been presented. The results show that OpenCV and the developed framework integrated into the game engine is a stable tool when working in this field. The results of the various use-case tests give an overview over a few solutions, and both output and processing times are presented for each filter. The Hue-filtering and Background removal processes are proven to be suitable for attaining augmented reality using a simple RGB web camera (with their respective weaknesses), and it is shown that the time required by the entire processes are viable for use in real-time scenarios.

5.2 Reflection Aiming at end-user demonstration applications helped significantly to emphasize the requirements of correctness and smoothing, both on input and output. Non-smooth output will make the user lose focus and/or tire of the application, while an unstable input mechanism will sometimes produce false input or none at all. When working with end-users in mind, one must also take into consideration the calculation time that will be required to run the application and render results to the user. Considering a gaming application running at 60 frame per second (16.67 milliseconds per frame) the presented techniques might work out decently, since they had calculation times of around 8, 3 and 6 milliseconds respectively for the conducted tests. However, this is when running a C++ build in release mode on a machine with significant processing power and no limitations on battery energy consumption. Targeting a mobile platform could potentially incur significant restrictions on the processing speed. One solution to potential restrictions could be to scale down the input image, which will decrease calculation times significantly. However, as the resolution of the calculated input is decreased so will the resolution of the detected gestures and the output be decreased as well. Before one can aim at highly interactive low-latency applications both input and output latencies must be low enough. For this project the web camera used throughout this project produced only 15 or 30 frame per second (depending on the build). The latency of the projector would also have to be taken into account if one is to be used in a final application. Based on common empirical knowledge, 30 frames per second is good enough for most non-game end-user applications, and would allow for a maximum processing time of < 33.33 milliseconds per frame (rendering and application processing time included). Besides the approach of using Canny Edge Detection for polygon approximation, an own approach for box/quad detection was tested, using Harris Corner Detection, Hough Lines Detection and several custom filters. The polygon approximation seems to work better in the presented implementation, but both could possibly be improved further.

33 Due to the nature of end-user applications the time consumption of the render- filters were not studied thoroughly. The reasoning behind this is that it will depend on the choice of content to present. For example decoding and rendering a high resolution movie versus streaming only audio. However, if targeting multiple platforms (e.g. mobile) it could be advisable to test playback processing times too, in order to test overall performance. It should be made clear that the filter configuration approaches which have been presented here are by no means all possible ones, but are meant to give an overview into the possibilities of future work in the field.

5.3 Future work As mentioned in the reflection, smoothing algorithms and better input analysis functions are required. The concept of adaptive filters and smoothing/estimation techniques such as Kalman filtering[15] are highly relevant for this and should be one of the first features to test before aiming toward any end-user application. For skin detection there already exists research of other approaches that might be viable. One example is the method presented by Kawulok et al. [11], which com- bines spatial analysis and a local skin color model based on detected facial regions for adaptive skin modelling. For background removal/motion detection a possible next approach could be to try and add the features of the technique described by Barnich and Van Droogenbroeck [13] which features motion propagation. This would be relevant in order to add movement capabilities to the camera in the background subtraction pipeline. The approach presented by KawTraKulPong et al. [12] would also be of interest as it offers a way to deal with shadows. One field of research which is highly relevant is coping with input that has been modified by the projection. After initial analysis of the detected image, the application may decide to project contents onto the surface itself, via e.g. a projector. If that happens the surface will have augmented (added) data which may or may not interfere with the application the next frame that it has to process its input. There is a high risk of affecting the target input/output zone which could destabilize the application. Therefore, measures must be found that isolate the parts of the image that are real objects versus those which have been modified by projected content. One possible approach would be the the usage of invisible radiation like infrared light for the analysis. Any wavelength besides the ones available in the visible spectrum could probably be viable for this approach. As for processing time optimization there are several possibilities available. Exam- ples include GPU-accelerated computing, multi-core threading, architecture/pipeline optimization, and adjusting the input image resolution. Adjusting the input resolution will have a guaranteed effect on processing time but will also affect both the resolution of the user-input and the render output.

34 6 References

[1] Pavlovic V., Sharma R. & Huang T., Visual Interpretation of Hand Gestures for Human-Computer Interaction: A Review, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.19, No. 7, July 1997

[2] Mitra S. & Acharya T., Gesture Recognition: A Survey, IEEE Transactions on Systems, Man, and Cybernetics, Vol. 37, No. 3, May 2007

[3] Kass M., Witkin A., & Terzopoulos D., Snakes: Active contours models, Interna- tional journal of computer vision 1.4, 1988: pp 321-331

[4] Ballard D.H., Generalizing the Hough transform to detect arbitrary shapes, Pat- tern recognition 13.2, 1981: pp 111-122 [5] Horn B.K., & Schunck B.G., Determining optical ﬂow, In 1982 Technical Sym- posium East, International Society for Optics and Photonics, 1981: pp 319-331

[6] Ren Z., Yuan J., & Zhang Z., Robust hand gesture recognition based on ﬁnger- earth mover’s distance with a commodity depth camera., In Proceedings of the 19th ACM international conference on Multimedia, 2011: pp 1093-1096 [7] Ren Z., Meng J., Yuan J., & Zhang Z., Robus hand gesture recognition with kinect sensor, In Proceedings o the 19th ACM internatoina lconference on Multimedia, 2011: pp 759-760. [8] Orrite-Urunuela C., del Rincon J.M., Herrero-Jaraba J.E., & Rogez G., 2D silhouette and 3D skeletal models for human detection and tracking., Pattern Recog- nition, Vol. 4, ICPR 2004, Proceeings of the 17th International Converence on IEEE, 2004: pp 244-247

[9] Kakumanu P., Makrogiannis S., & Bourbakis N., A survey of skin-color modeling and detection methods., Pattern recognition 40.3 (2007): pp 1106-1122 [10] Vezhnevets V., Sazonov V., & Andreeva A., A survey on pixel-based skin color detection techniques., Proc. Graphicon., Vol. 3, 2003.

[11] Kawulok M., Kawulok J., Nalepa J., & Papiez M., Skin detection using spatial analysis with adaptive seed., Proceedings of IEEE International Conference on Image Processing (ICIP 2013), 2013: pp 3720-3724

[12] KaewTraKulPong P., & Bowden R., An improved adaptive background mixture model for real-time tracking with shadow detection., In Video-Based Surveillance Systems, Springer US, 2002: pp 135-144 [13] Barnich O., & Van Droogenbroeck M., ViBe: A universal background subtraction algorithm for video sequences., Image Processing, IEEE Transactions on 20.6, 2011: pp 1709-1724

35 [14] Mann S., et al., HDRchitecture: Real-time Stereoscopic HDR Imaging for Ex- treme Dynamic Range., ACM SIGGRAPH 2012 Emerging Technologies. ACM, 2012 [15] Kalman R.E., A New Approach to Linear Filtering and Prediction Problems, Jour- nal of basic Engineering 82.1, 1960: pp 35-45 [16] Suzuki, S., Topological structural analysis of digitized binary images by border following., Computer Vision, Graphics and Image Processing, 30.1, 1985 [17] Teh C.H. & Chin R.T., On the detection of dominant points on digital curves, Pattern Analysis and Machine Intelligence, IEEE Transactions on 11.8, 1989 [18] Sklansky J., Finding the Convex Hull of a Simple Polygon, Pattern Recognition Letters 1.2, 1982: pp 79-83 [19] Harris C. & Stephens M., A combined corner and edge detector, Alvey vision conference, Vol. 15, 1988 [20] Canny J., A computatoinal approach to edge detection, Pattern Analysis and Ma- chine Intelligence, IEEE Transactions on, 6, 1986: pp 679-698 [21] Ramer-Douglas-Peucker algorithm [Internet], Wikipedia, http://en. wikipedia.org/wiki/Ramer-Douglas-Peucker_algorithm, 2014-05-20, 12:06 [22] Hue [Internet], Wikipedia, http://en.wikipedia.org/wiki/Hue, 2014-05-20, 13:06 [23] Augmented Reality [Internet], Wikipedia, http://en.wikipedia.org/ wiki/Augmented_reality, 2014-05-20, 13:07 [24] OpenCV (Open Source Computer Vision) [Internet], itseez, http://opencv. org/, 2014-05-20, 13:08 [25] Game-engine [Internet], Hedemalm E. (erenik), https://github.com/ erenik/engine, 2014-05-20, 13:08 [26] Natural user interface [Internet], Wikipedia, http://en.wikipedia.org/ wiki/Natural_user_interface, 2014-05-20, 13:08 [27] OpenGL (Open Graphics Library) [Internet], Silicon Graphics International, http://www.opengl.org/, 2014-05-20, 13:09 [28] OpenCV Image processing module API Reference[Internet], opencv dev team, http://docs.opencv.org/modules/imgproc/doc/imgproc. html, 2014-05-20, 13:10 [29] OpenCV Tutorials, Image processing module [Internet], opencv dev team, http://docs.opencv.org/doc/tutorials/imgproc/table_ of_content_imgproc/table_of_content_imgproc.html, 2014- 05-20, 13:10

36 [30] OpenCV Github repository [Internet], https://github.com/Itseez/ opencv, 2014-05-20, 13:10

[31] Green’s Theorem [Internet], Wikipedia, http://en.wikipedia.org/ wiki/Green_theorem, 2014-05-20, 13:10

[32] Big buck Bunny [Internet], Blender Foundation, 2008, http://www. bigbuckbunny.org/, 2014-05-20, 13:11

37 A Appendix A.1 Source code and binary Source code for the game engine framework as well as the configurable pipeline framework will be publicly available on my Github: https://github.com/erenik/ engine The OpenCV related code is mostly located in the CV source subdirectory: https: //github.com/erenik/engine/tree/master/src/CV; while the project- specific code to bind the user interface and hot-keys for a user and tester experience are located in the Projects/IPM subdirectory: https://github.com/erenik/ engine/tree/master/src/Projects/IPM Necessary files required in order to run the application can be provided at request.

A.2 Videos Full list of all published videos related to the project: • Hand detector, hue-ﬁltering: https://www.youtube.com/watch?v=n8hLWkn1Qdg • Hand-detector, background removal: https://www.youtube.com/watch? v=wOvQn1GkUlw

• Image gallery, hand-controlled: https://www.youtube.com/watch?v= Ewro1-PURYc

• Movie projector, quad detector: https://www.youtube.com/watch?v= PLJBoYtTrxM

• Music player, hand-controlled: https://www.youtube.com/watch?v= El2Y7qmtmSk

• Box detection: https://www.youtube.com/watch?v=XzyrLVPypfM