Raspberry Pi: a Smart Video Monitoring Platform

David Emanuel Ribeiro Gaspar

Thesis to obtain the Master of Science Degree in Embedded Systems and Computer Engineering

Supervisor: Prof. Nuno Filipe Valentim Roma

Examination Committee: Chairperson: Prof. Miguel Nuno Dias Alves Pupo Correia Supervisor: Prof. Nuno Filipe Valentim Roma Member of the Committee: Prof. Renato Jorge Caleira Nunes

November 2014 2 Acknowledgments

My parents. My family. My friends. To Dr. Nuno Roma for all the patience, help and support.

i

Abstract

Recent computing trends have led to the release of very low-cost (yet highly capable) single- board computing platforms. The usage of such devices - in this case, the Raspberry Pi - con- nected to an inexpensive Universal Serial Bus (USB) webcam allows the implementation of a very low-cost video processing station. The computational capabilities offered by the Raspberry Pi, combined with the good image quality that is made available by most off-the-shelf webcams, allow the implementation of smart monitoring platforms for a broad range of applications. By combining such a hardware platform with a software architecture consisting of inter-changeable modules, to interface the webcam and to process the gathered data, it is possible to implement a vast set of systems for autonomous video analysis. Under this scenario, the proposed sys- tem is able to perform movement detection and heat-map calculation over a period of time, by analysing a video feed provided by a physically fixed camera. Several possible alternatives for the implementation of each module are presented and discussed, together with a presentation and analysis of the system performance and its real-world applicability.

Keywords

Smart Video Monitoring, Video Surveillance, Motion Detection, Single-Board Computer, Low- Cost Embedded Platform

iii

Resumo

As tendenciasˆ actuais no mundo da computac¸ao˜ temˆ levado a` disponibilizac¸ao˜ de platafor- mas de computacc¸ao˜ de board unica´ que, apesar do seu reduzido custo, apresentam boas ca- pacidades de computac¸ao.˜ O uso destes dispositivos - neste caso concreto, a Raspberry Pi - ligado a uma camaraˆ USB permite a implementac¸ao˜ de uma estac¸ao˜ de tratamento v´ıdeo de baixo custo. A Raspberry Pi, combinada com a boa qualidade de imagem da camara,ˆ per- mite a implementac¸ao˜ de plataformas inteligentes de monitorizac¸ao˜ para o mais variado leque de usos. Combinando esta plataforma hardware com uma arquitectura software assente em modulos´ inter-conectaveis´ para ligar a` web-cam e tratar os dados obtidos, e´ poss´ıvel implemen- tar um vasto leque de sistemas autonomos´ de tratamento de video. O sistema proposto e´ capaz de realizar detecc¸ao˜ simples de movimentos e calculo´ de mapas termicos´ atraves´ da analise´ do v´ıdeo proveniente de uma camaraˆ fixa. Varias´ alternativas para a implementac¸ao˜ de cada modulo´ sao˜ apresentadas, juntamente com uma analise´ a` performance do sistema e a sua viabilidade no mundo real.

Palavras Chave

Monitorizac¸ao˜ Inteligente de Video, Video-Vigilancia,ˆ Detecc¸ao˜ de Movimento, Computador de Placa Unica,´ Plataforma Embebida de Baixo-Custo,

v

Contents

1 Introduction 1 1.1 Motivation ...... 2 1.2 Objectives ...... 4 1.3 Requisites ...... 6 1.4 Document Structure ...... 6

2 Existing Solutions 9 2.1 Security Video-Surveillance ...... 10 2.2 Sports Data Gathering ...... 12 2.3 Wilderness Cameras ...... 13 2.4 Human Monitoring Systems ...... 14 2.5 Community Approaches ...... 15 2.6 Discussion ...... 15

3 Related Technology 19 3.1 Hardware Platforms & Peripherals ...... 20 3.1.1 Processing Platform ...... 20 3.1.2 Camera ...... 22 3.2 Detection Algorithm ...... 25 3.3 Software Libraries ...... 28 3.3.1 MATLAB Computer Vision System Toolbox ...... 29 3.3.2 OpenCV ...... 29 3.3.3 Motion ...... 30 3.4 Discussion ...... 30 3.4.1 Hardware ...... 30 3.4.2 Detection algorithm ...... 33 3.4.3 Software Libraries ...... 33

4 Proposed Architecture 35 4.1 Hardware Layer ...... 36

vii Contents

4.1.1 Processing Board & Resources ...... 36 4.1.2 Peripherals ...... 37 4.2 Software Layer ...... 37 4.2.1 Capture Module ...... 37 4.2.1.A Capture from USB Web-cam ...... 38 4.2.1.B Capture from Local File ...... 39 4.2.2 Processing Module ...... 40 4.2.2.A Video Store ...... 40 4.2.2.B Simple Movement Detection ...... 41 4.2.2.C Activity Mapping ...... 42 4.2.2.D Activity Mapping with Dynamic Reference Updating ...... 44 4.2.3 Communication Protocol ...... 45 4.2.3.A Initialization ...... 45 4.2.3.B Main Cycle ...... 46 4.2.3.C Finishing ...... 46

5 Implementation 47 5.1 General Structure ...... 48 5.2 Capture Module Implementation ...... 52 5.2.1 Capture from USB Device ...... 52 5.2.2 Capture from Sequence File ...... 52 5.3 Processing Module Implementation ...... 52 5.3.1 Video Storage ...... 52 5.3.2 Motion Detection ...... 53 5.3.3 Activity Mapping ...... 53 5.3.4 Activity Mapping with Dynamic Reference Updating ...... 55 5.4 Communication Protocol ...... 56 5.4.1 Execution Parameters Structure ...... 57 5.4.2 Initial Negotiation ...... 57 5.4.3 Main Cycle ...... 58 5.4.4 Finishing Execution ...... 59

6 Experimental Results 61 6.1 Performance ...... 62 6.1.1 Movement Detection ...... 62 6.1.2 Activity Mapping ...... 63 6.1.3 Activity Mapping with Dynamic Reference Updating ...... 64 6.2 Real-World Behaviour ...... 65 viii Contents

6.2.1 Movement Detection ...... 65 6.2.2 Activity Mapping ...... 66 6.2.2.A Sequence 1 ...... 66 6.2.2.B Sequence 2 ...... 67 6.2.3 Activity Mapping with Dynamic Reference Updating ...... 67 6.2.3.A Sequence 1 ...... 67 6.2.3.B Sequence 2 ...... 68

7 Future Work 71

8 Conclusions 73

A Appendix A 79

B Appendix B 81

C Appendix C 83

D Appendix D 85

E Appendix E 87

ix Contents

x List of Figures

1.1 Raspberry Pi board...... 3 1.2 Example of heat activity map...... 5

2.1 Swann DVR8-3425 - an example of a video surveillance system with a set of cam- eras and a central unit...... 10 2.2 Logitech Alert example configuration...... 11 2.3 Heat map generated by a player in a football match, showing the areas of the pitch he spent the most time in...... 12 2.4 An example of a motion-triggered wilderness camera...... 13 2.5 Human tracking by RetailNext. Note the humans enclosed in purple boxes and, in green, the path they took...... 14

3.1 Eee PC 4G, the first netbook released by ASUS...... 21 3.2 Intel Galileo, an example of an x86 single-board computer...... 21 3.3 BeagleBone Black Revision C board...... 22 3.4 The Raspberry Pi camera module...... 23 3.5 Logitech C270 web-cam...... 23 3.6 Frame Differencing execution flowchart...... 26 3.7 A difference frame with two visible cars. Extracted from http://www.mathworks. com/discovery/object-detection.html...... 26 3.8 A photograph taken with a high ISO setting...... 27

4.1 Hardware schematic of the system...... 36 4.2 Software schematic of the system...... 37 4.3 USB Capture Module execution flowchart...... 39 4.4 Capture From File Module execution flowchart...... 39 4.5 Video Store Processing Module execution flowchart...... 40 4.6 Simple Movement Detection Processing Module execution flowchart...... 41 4.7 Colour scheme showing warm and cold colours...... 42 4.8 Activity Mapping Processing Module execution flowchart...... 43

xi List of Figures

4.9 Activity Mapping with Dynamic Reference Updating Processing Module execution flowchart...... 44 4.10 Capture to Processing Module Communication flowchart...... 45

5.1 Example of an original image on the left and its downscaled version on the right. . 49 5.2 Mapping of pixels from original to downscaled frame...... 49 5.3 Example of a histogram. Taken from cambridgeincolour.com...... 50 5.4 Example of an original image, on the left, and its version with histogram equalization applied, on the right...... 51 5.5 HSV colours according to H and V values. S is set to 1.0 in this figure...... 54 5.6 Activity Mapping with Dynamic Reference Updating execution scheme...... 56

6.1 Screenshots of the first video sequence used to test the Activity Mapping modules, with the highlight of a car...... 65 6.2 Screenshots of the second video sequence used to test the Activity Mapping modules. 65 6.3 Images taken by the module when an intrusion was detected...... 66 6.4 Heat map of execution over the street sequence...... 66

7.1 A picture of a crowd and its respective density thermal map (top right) and head detections without crowd density weighing (bottom left) and with (bottom right). [1] 72

xii 1 Introduction

Contents 1.1 Motivation ...... 2 1.2 Objectives ...... 4 1.3 Requisites ...... 6 1.4 Document Structure ...... 6

1 1. Introduction

This section consists of an introduction to the system. It starts with a description of the cur- rent context in low-power computing, its effect on the motivation for the system’s planning and development, before listing its objectives and requisites.

1.1 Motivation

In recent years, the computing world has seen a curious development. Some ten to fifteen years ago, in domestic computing, the GHz race was on - Intel and AMD raced to the top of the specification wars, trying to be the first to sell a 1 GHz processor to the domestic user. A few years later, as heat dissipation started to become a real issue and as domestic CPUs started to hit 3GHz and above frequencies[2], parallel computing started to take over. Processors with two, four[3] and more processing cores started to become a priority as it became increasingly obvious that bigger benefits could be obtained not by doing things sequentially faster, but rather by doing several at a time. Multi-core processors, even with lower frequencies than their previous generation counterparts, could top them in performance - provided programmers made good use of multi-threaded coding. From the second half of the 2000’s first decade, another curious swift in paradigm took place. With the launch of Apple’s first iPhone, in 2007[4], and the myriad of Android phones that then followed, mobile processors saw a gigantic boost in market demand, performance and low-power with ARM processors taking the spotlight from full-fledged x86 units. Soon, mobile phones had CPUs with several cores and several GHz[5]. While these high-performance mobile processors are still sold at relatively high prices, the fast development in this computing area left a trail of low-cost alternatives that are still capable of running complex tasks. This led to a huge increment in popularity for single-board computers. Instead of the modular architecture of household PC’s, these are non-expandable set-ups. Because of that, they are very small in size, have very low power requirements and, depending on their specifications, can also be very low-priced. With the embedding of these new processors and the rise in popularity of operating systems compiled for ARM arcthitectures, single-board computers have become more accessible than ever and their audience has dramatically increased. The Raspberry Pi is arguably one of the most popular among those systems[6]. At a price of 25 dollars for the base model, it can run a full, specifically compiled distribution. It has been very successful, too - the Raspberry Pi Foundation announced in November 2013 in its official blog that it had sold its 2 millionth Raspberry Pi. The fact that it is so widespread significantly increases the available support: solutions for problems are easy to find and so are pre-compiled libraries. There is also a wide variety of Linux distributions specifically compiled for and supported by the Raspberry Pi[7–10]. Even though its specifications tailor it as particularly good for media-center use (with a companion x264 decoding chip and an HDMI-out port[6]), enthusiasts have used it

2 1.1 Motivation as personal servers, robot controllers, and normal personal computers. Single-board computers are, in essence, small, cheap, full-featured computers that provide very good flexibility while being very affordable.

Figure 1.1: Raspberry Pi board.

It is not only computing power and accesibility that have increased: peripherals like web-cams have also seen an enormous progress in recent times. These are small digital cameras that are often used to capture and stream still images and video on computers. They are usually connected through USB and are currently capable of capturing high-definition (up to 1080p)[11] video and multi-megapixel snapshots. The combination of these developments opens unprecedented windows of opportunity. By combining a single-board computer, an HD webcam and the advanced software solutions avail- able today, it is theoretically possible to develop a small, portable, cheap and flexible video capture and processing station. Applications like motion detection and activity-level mapping can be done at lower cost and with greater flexibility than ever, given that single-board computers share an architecture with common computers. The concept can be taken even further. By separating the part of software that interacts directly with the hardware from the one that processes the frames captured from the video feed, this system could take any processing module, hereby known as a plug-in and perform any function the plug-in developer wishes. Webcam device drivers nowadays give the developer low-level access to the pixel values from the video feed and the computer allows full access to its networking port and local storage, so flexibility would be strong. Heavy computational tasks could even be offloaded to a central system, with several similar systems working concurrently, thus effectively configuring a sensor grid, a topology where several sensors are deployed through a wide area to allow for real-time data collection. The envisaged plug-ins can be divided into two groups, according to treatment they give to data and their use case scenario:

3 1. Introduction

Event triggering These plug-ins analyse the acquired frames, looking for a specific type of event, upon which some sort of trigger is run. Autonomous video surveillance systems fall within this category, typically looking for movement when only a fixed background should be seen, like people in off-bounds areas, animals outside their designated terrain or surveillance of a store during off-business hours. Trigger methods include the warning of an external system like a mobile application, local logging, message sending like an email or SMS message or storage/transmission of the frame when the intrusion was detected. These plug-ins are usually time-constrained, which means that the event should be triggered as fast as possible.

Over-time collection of data These plug-ins specialize in the collection and treatment of data over long periods. These usually keep a local number of data structures that they continually update for the final construction of an output. The building of activity heat maps is a usual result of the execution of these applications. A video recording application also fits in this category.

Heat maps are an indication of the zones in the scene where the sought event happened the most often, with red tones indicating the areas where the event took place the most times. Applications of these plug-ins include the overviewing of stores, to know the areas where customers found the most interesting product and of road intersections, to discover the roads with the most traffic or pedestrian movement.

1.2 Objectives

The main goal of the presented work is to develop a cheap, flexible, open-source, easy to de- ploy, use and relocate video-surveillance system. This system should consist on a single-board computer connected to a commercially available USB web-cam running a custom-developed pro- gram, developed in C or C++ and should process data in real-time, with a low frame-rate require- ment. New plug-ins, or processing modules, should be easily developed and integrated - video stabilization, time-lapse videos, video filters, or whatever else a programmer needs or thinks of. The interface between the platform and the plug-in should be easy to use in newly developed plug-ins. One of the main challenges of the proposed system resides in the still limited computational capabilities of the processing platform. Though much more powerful than similar systems werea few years ago, some computer vision algorithms are very demanding and convenient measures may have to be taken to compensate for this limitation - throttling of the frame-rate and conversion from colour to grayscale images are two examples. Three plug-ins will be initially developed for the system:

Event Triggering by Movement Detection - This is an event triggering plug-in and the simplest of the three. When overviewing a static scene and movement is detected, an event should

4 1.2 Objectives

be triggered, with the local logging of the timestamp and snapshot. This system should also work with outdoor scenes should filter the triggering of false event detections due to lighting changes throughout the day.

Activity Mapping - An Activity Mapping plug-in overlooks a scene for a period of time and, at the end of analysis, produces a heat map, representative of the areas of the scene where the most movement occurred.

Figure 1.2: Example of heat activity map.

Two plug-ins will be developed under this premise - one that frequently produces a heat map indicative of the whole activity level since the last map was produced, and another that produces a map at every frame, representative of the activity in the feed for the latest few seconds or minutes. So, in this second plug-in, the same movement can be represented in several heat maps.

The following characteristics will be aimed in the envisaged system:

Low-Cost - The total cost of the system should be as low as possible, without sacrificing good operational results.

Flexibility - The solution should be as flexible as possible, in part due to being open-source and published under the GNU GPL software license. Anyone should be able to use, study, share, copy and modify the code at their will, being able to easily add new use-case scenarios, through the development of new plug-ins that process and decide what to do with each acquired frame. The platform where the plug-in is later integrated is also published as free and open-source software so it can be scrutinized and improved. The software should have as little ties to the hardware as possible, so that anyone who needs a more expensive but more computationally capable platform and wishes to deploy the system is free to do so.

Automaticity - The system, both the platform and plug-ins, should fulfill its premise with as little human intervention as possible.

5 1. Introduction

Size - Single-board computers are often very small - the Raspberry Pi is about the size of a credit card. Some available protective cases also include mounting points, which make these devices easily storable and relocatable.

Versatility - The system should be able to work with good results under various conditions within the same session, like through different times of day or different weather conditions.

1.3 Requisites

The following requisites should be fulfilled to ensure that the system works as intended:

• The system should run on a Raspberry Pi board and, as such, inherits its basic working requirements - a nearby power outlet or battery and an SD memory card containing an appropriate Linux distribution[6]. In order to power both the board and the web-cam, a powered USB hub is also necessary.

• The web-cam device driver must be an USB Video Device Class (UVC) and compatible with the Video4Linux2 (V4L2) API. This also means that when porting the program to other architectures, the target system should support this API. Other video interface API’s might also be implemented, by altering the platform code.

• The camera should be in a fixed position and the background of the image should be as static as possible. The most static the background is, the best the results will be. Also, the area under surveillance should be relatively well lit.

• For triggering events that need network connectivity, it should be provided, whether wired through the board’s ethernet port or wireless through an external WiFi or UMTS adapter. None of the implemented plug-ins presented in this document make use of network connec- tivity, but it must be provided for further flexibility in plug-in development.

1.4 Document Structure

The document is organized in eight chapters. The first is the Introduction and focuses on the project’s motivation of development, its objectives and its requisites. The second chapter is based on already existing solutions with similar objectives as the proposed system - even if only partially so - finishing with a discussion on whether the system’s objectives overlaps those of those systems. The third chapter is the Related Technology, focusing on the used hardware platform to implement the system as well as its peripherals. The same chapter also analyses the Frame Differencing algorithm and the software libraries that implement it. The fourth chapter analyses the system’s proposed architecture, both in its hardware and software aspects. The fifth chapter focuses on the options taken with the implementations of the developed modules, reflecting on

6 1.4 Document Structure their correlation with the system’s objectives. The sixth chapter lists a series of tests done on the modules, both applicational and performance. The seventh lists possibilities for future work and the eighth, and last, draws conclusions from the previous chapters on whether the system’s objectives were fulfilled.

7 1. Introduction

8 2 Existing Solutions

Contents 2.1 Security Video-Surveillance ...... 10 2.2 Sports Data Gathering ...... 12 2.3 Wilderness Cameras ...... 13 2.4 Human Monitoring Systems ...... 14 2.5 Community Approaches ...... 15 2.6 Discussion ...... 15

9 2. Existing Solutions

There are already several systems - both commercial solutions and hobbyist-developed - that include some of the functionalities the system herein described will implement. These have a wide range of applications, from security to sports and outdoor leisure. In this chapter, we analyse some of these systems, their pros and cons, and the feasibility of their use cases on the described system.

2.1 Security Video-Surveillance

Most video surveillance systems found in the market, sold both in retail and by specialized security companies, are typically composed of three components: a set of cameras, a central unit with Digital Video Recorder (DVR) capabilities and some sort of software interface for viewing and managing footage. The cameras are usually fixed (some might include a motor so that their ori- entation can be remotely controlled) and connected to a wall power outlet; each camera watches a determined physical area. The central unit includes a high-capacity hard disk. This unit stores the video taken from each camera in its hard-disk and keeps it for a set period of time before old recordings are deleted and replaced with new ones. The central unit is connected to every single camera by cables and usually allows a the connection to a TV or other type of screen for live video watching. The Swann 3425 Series is an example of this type of system[12].

Figure 2.1: Swann DVR8-3425 - an example of a video surveillance system with a set of cameras and a central unit.

These systems have a number of limitations, the first being lack of practicality. If it becomes necessary to install a new camera or relocate an existing one, the area to be covered needs to have a power source of some sort for the camera to work. Furthermore, cables also need to be relocated or, if the available ones are not long enough, new ones need to be ordered and installed. The second limitation is image quality: since many of these systems are very low cost, imaging is often a sacrificed aspect in order to cut the price, and cheap camera sensors and lenses mean that image quality and definition may be poor in low-light areas, which can, for example, make the identification of a burglar impossible and thus make the system ineffective

10 2.1 Security Video-Surveillance in its purpose. The last drawback is lack of flexibility. Since these systems are purely focused on security applications (rarely performing more than simple video capture and motion detection), and their software platforms are most times proprietary, it means the hardware - though technically capable - is limited to such tasks, not being adaptable to others like over-time data collection.

Logitech Alert[13] improves on the aforementioned type of systems, by using higher-quality cameras with better image quality and resolution (1280*720p HD) and by including a MicroSD slot in each camera so that it has its own local storage, by means of a memory card, and no central DVR is needed. Cameras with weather protection are available for outdoor use and some models even include infra-red night vision. The software solution includes both a mobile application that triggers a notification should movement be detected and a web interface through which a live feed can be viewed. The cameras use only one connection for data and power transfer, so only a wall plug is necessary. This system then attempts to be more practical than the conventional video surveillance hardware while offering a more complex software package capable of more functionalities.

Figure 2.2: Logitech Alert example configuration.

This alternative also has its shortcomings. Though more extensive in functionality than a conventional video surveillance system, it is still a security-driven appliance with security-focused uses. The fact that it uses only proprietary software limits what can be done with it to what Logitech develops. Cost is also a concern, as memory cards have a much higher cost per Gigabyte than hard disks and each camera needs its own. Their capacity is also conventionally much smaller, which means video is stored for less time before it has to be replaced with new footage. Lastly, despite money savings being touted by marketing materials as an advantage, the Logitech Alert is not cheap: a base system for indoor security, without any additional cameras and basic software is, at the time of this writing, priced at $322.98.

11 2. Existing Solutions

2.2 Sports Data Gathering

Over the last few years, a new field in sports has been gradually catching the interest of coaches, players and fans in order to analyse games in a deeper level. By applying image recog- nition methods to a game, it becomes possible to gather a whole set of data on player performance over a match.

Figure 2.3: Heat map generated by a player in a football match, showing the areas of the pitch he spent the most time in.

Pointing one or more cameras at a pitch and then making use of image recognition algorithms allows investigators to collect a wide amount of data on the game being played. The first example, as shown above, is the generation of the heat map of a player’s performance. If the displayed map represented a midfield player and his team attacked from the left to the right, it can be concluded that he is too close to his own goal and not providing enough support to his attacking team mates. Individual heat maps can be generated for each player, and a general one for the whole team. Another example is the calculation of the distance a given player had to cover during a match. Such a calculation can potentially be made by pointing a camera at the pitch, identifying the player in question and then, knowing the dimensions of the pitch, calculate the covered distance using trigonometry. This can allow coaches to point out that a player is running too much and might be fatigued sooner than his team mates. Such examples may need to employ moderately complex algorithms. In order to build a player’s heat map, the system needs to be able to identify the area of the video pertaining to that player and automatically follow it around the pitch. A potentially more complicated issue is player identification: how to know which area of the video feed corresponds to each player? Fa- cial recognition algorythms need to be applied or, alternatively, the previous identification of each player, in which case the manual work of a human is necessary - possibly at the beginning of each match part, since the players leave the pitch (and consequently the video feed) during each break.

12 2.3 Wilderness Cameras

Other data include the average speed of a player and the number and direction of shots at goal a team took. Some commercial applications of these systems are already available and in place: Opta[14] is an example and the biggest firm in data collection for sports. However, these systems employ pro- prietary architecture and implementation details and are, thus, very difficult to obtain information about and document. The description made here is, then, but a speculation on the implementation such a system can use.

2.3 Wilderness Cameras

A different application of a camera when connected to an embedded device are the so-called wilderness cameras[15]. These are based on a weather-proof case, enclosing a camera, a motion sensor, a memory card for local storage and an embedded system. When installed outdoors, the motion sensor continuously looks for movement and, when it is detected, the camera takes a snapshot or a few seconds of video. The purpose of this behaviour is to capture footage of wildlife that would be near impossible to obtain with human presence. Some models improve functionality with HD camera modules, GPS tagging of files, LEDs for lighting or infra-red night vision[16]. They are powered by a battery.

Figure 2.4: An example of a motion-triggered wilderness camera.

These systems are architecturally very simple and limited in their use, but they are also very well tailored for it. The presence of a dedicated motion sensor is one such indication: instead of having the camera permanently turned on and continuously analysing its footage, the low-power sensor allows for the same effect to be achieved with a big increase in battery life. However, these systems are also closed platforms - there is no way to alter them in order to perform additional functions, like triggering a remote notification when movement is detected.

13 2. Existing Solutions

2.4 Human Monitoring Systems

Some systems are available whose goal is to collect data about the usage of public spaces, like stores or supermarkets. A set of cameras is installed around the space to properly cover the areas opened to the public and then run continuously while the space is being used. A set of techniques applied over the video feed can help gather data about the captured footage: heat maps can be calculated over a period of time, to assess the areas with the most, the less and no movement at all. Human recognition algorithms can recognize where a person is in the frame and treat that data accordingly, including the already mentioned calculation of heat maps. The use of these techniques allows a wide range of data to be collected about the video feed. The calculation of the heat map of a particular area tells the zones of that area that saw the most movement - making it easy to identify the zones of the space that people visited or stopped most often, either to look at exposed products or marketing material. A camera looking at the checkout lanes can show if people prefer self-checkout machines or conventional ones with a human operator. Highway operators can discern whether people prefer to pay tolls with vending machines or human operators. Crossing this data with date and time shows how these affect all these factors (allowing, for example, for store operators to schedule employee shifts accordingly) and by integrating the system with the payment system it would allow for yet more conclusions to be drawn on average transaction value over time. Commercial solutions of this type already exist: RetailNext [17] and Prism SkyLabs [18] are two such examples.

Figure 2.5: Human tracking by RetailNext. Note the humans enclosed in purple boxes and, in green, the path they took.

While not a lot of information is available on these systems, apart from the offered functionality (since installations are custom-made for the space in question), it is expectable that their basic architecture is in some points similar to the video surveillance systems: a series of cameras connected to a central device. However, contrasting to the video surveillance case, where the

14 2.5 Community Approaches central station’s focus was long-term storage - in this case, given the high number of calculations made, the central station will have a higher focus on processing performance, in order to be able to handle data from a potentially large number of cameras. Cons of these systems include, once again, their closed nature: not being open-source appli- cations, functionality is limited to what the seller determines. And even though no concrete data was found, cost is a concern given the high degree of customization each installation takes.

2.5 Community Approaches

Since the Raspberry Pi, when combined with Linux and the V4L2 driver, also allows for such flexibility and good value for money, users and hobbyists have also built custom video applications with the processing board, sometimes with USB cameras, other times with the official Raspberry Pi camera module. Enthusiasts have developed programs to take timelapse videos[19], secu- rity systems[20], outdoor motion-triggered cameras[21], and more. While good to show just how much potential and flexibility the Raspberry Pi has, most of these systems have very limited func- tionality, are difficult to configure (or might be so tied to their development platform that include no configuration at all). Another problem arises from these being projects developed and main- tained by hobbyists: they are very dependant on their creator’s motivation and many have halted development altogether.

2.6 Discussion

The analysis of the existing similar and commercially available solutions brings up the clear lack of a video treatment system that manages to be, at the same time, cheap, easily configurable, expandable, portable and open. Most commercial solutions offer good functionality and conve- nience but fall short on expandability, openness and cost. Open source approaches, like that of the Raspberry Pi community, represent a very good start on the use of the platform’s flexibility, but fall short on configurability, support and complexity of the offered functionality - few implement truly advanced algorithms. Accordingly, the software structure that was envisioned for the described project, combined with the hardware’s low cost and flexibility, mean that many of the described systems can be im- plemented as software modules, some with trivial hardware additions or changes: a high-capacity hard disk can be connected to one of the Raspberry Pi’s USB ports, allowing for storage of large quantities of video. A battery can be used instead of a wall plug, removing the need of a nearby power outlet, even if only for a period of time. A waterproof case can allow for outdoors usage of the system. In that regard, it becomes clear that most use cases described in the aforementioned systems can be implemented as modules for the described platform. Video surveillance usually encom-

15 2. Existing Solutions passes three main features: live video feed viewing, long term video storage and motion detection with warning triggering. The first can be implemented through the usage of Linux’s Simple Direct Media Layer and analogous libraries in other operating systems to develop the same functions, while long term video storage can be implemented in a similar way, where instead of showing the images on a window they are kept in a file and later possibly compressed. Motion detection with triggering of a warning is actually one of the developed modules for this work.

Despite sharing some of the listed drawbacks and even introducing one of its own, the system can perform the function of a video surveillance system with a number of advantages. A process- ing module can be developed that takes the frames from one or more cameras and stores them in the local storage - a large capacity external disk can be connected to the Raspberry Pi for saving such large files. The easy configuration of the video feed, although to an extent dependant on the capabilities of the camera and driver, means there is flexibility in the system to be adapted for different functions - in one instance, the camera can be located near to the cash register of a shop and a high resolution colour can help the identification of people. In another, a camera can be located in the top corner of a laboratory and the video feed saved to later analyse the move- ment patterns of an animal species over long periods of time - in this case image detail is not as important and a lower-resolution, monochrome format can be preferred to allow the connected storage to be able to save longer video sequences. Furthermore, external programs like h264 can be used to compress the saved sequences. A large number of cameras can be used, since the Raspberry Pi includes two USB ports that can be connected to hubs, even though the USB specification recommends cables no bigger than 5 metres, connected to each other by at most 5 hubs serially. A possible alternative is to develop a communication system and send the frames through the Local Area Network (LAN).

Accordingly, the system described in this document offers a few advantages over the already described commercial solutions, the first being value for money - a system can be cheaply built with very good image quality, given the low cost and very good quality of web-cams on the market today. Flexibility is also relevant, given the easy parametrization of the video feed. Installation is highly facilitated as most web-cams are powered through their USB cable and no wall plug is necessary.

Some disadvantages are also introduced. As an example, in order to cover a wide area with many cameras, either a series of USB hubs has to be used, which drives up cost and damages installation practicality, or a network communication system has to be deployed, which increases development time and cost (since a Raspberry Pi) has to be purchased for each area. In conclu- sion, this system can satisfactorily perform the functions of a video surveillance system, but only over small areas.

Sports data gathering systems can also be implemented using this system. Provided that the hardware is conveniently accomodated in the case of outdoor venues, a Raspberry Pi with

16 2.6 Discussion a camera can be installed in a sports pitch and gather data on the match taking place. The generation of heat-maps has already been addressed in this chapter, and the calculation of the distance covered by a player is possible, given the appropriate mathematical calculations. Some causes of concern are the computational capabilities of the Raspberry Pi and how able it is to gather data about 22 players at the same time, in the case of a football match. Facial recognition on the go is also of difficult feasibility in this platform, so prior manual identification of players can potentially help deal with the performance concerns. As for wilderness cameras, those are also a goal of an existing system that can be imple- mented - albeit with a few limitations - with the system described here. By connecting either a USB camera or a camera module to the Raspberry Pi, the video feed can be continuously anal- ysed with a motion-triggered processing module, saving a snapshot or video while movement is detected. More advanced applications can include network connectivity for remote notification triggering or other uses. This is, however, a case where the commercial system’s more special- ized design and hardware give it a clear advantage: unless the Raspberry Pi has a nearby power source, any battery that is used will quickly drain due to the lack of a dedicated motion sensor requiring the camera to be always on.

17 2. Existing Solutions

18 3 Related Technology

Contents 3.1 Hardware Platforms & Peripherals ...... 20 3.2 Detection Algorithm ...... 25 3.3 Software Libraries ...... 28 3.4 Discussion ...... 30

19 3. Related Technology

In this section, the available options for the hardware component of the system are listed and analysed, and the option for the Raspberry Pi is scrutinized. Software-wise, the functioning of the Frame-Differencing algorithm - present in the core of all three implemented plug-ins - is explained and possible implementations analysed. This chapter is, thus, divided into four sections: Hard- ware Platforms & Peripherals, Detection Algorithm, Software Libraries and Discussion.

3.1 Hardware Platforms & Peripherals

In this subsection, some possible options for the hardware platform where the software will run are analysed, as well as the peripherals to attach to it. This description will be presented together with a brief analysis about its conformance with the system’s objectives and requisites.

3.1.1 Processing Platform

The main objective of this project is to build a video processing platform that can be used to analyse video streams of up to HD resolutions while being as cheap, small and portable as possible. Being open-source, feature-flexibility is also one of the greatest visions for the project. The current fast progress in computing power means that machines found in the market today have very good performance for a very low price and some trends, like the release of low-power processors, make it possible to build and sell physically small computing devices. In order to build, test and run a complete software package, a hardware platform must be chosen, and that is the first option to take in the project’s hardware context. The two most obvious options are conventional and single-board computers. Conventional computers are often sold in one of two form-factors: desktops and laptops. Desk- tops are composed of a central tower including the computer’s internal components and a series of peripherals attached to it - keyboard, mouse, sound speakers and monitor. They work by being permanently connected to a power plug and are not made to be conveniently used in sev- eral places. Laptops on the other hand are much smaller than desktops and already include a keyboard, screen, touchpad and speakers in their chassis, so they are made to be mobile while keeping the same set of funcionalities as a desktop computer and some new ones related to their mobile nature like wireless communications (Wi-Fi, Bluetooth). They also include a battery, so they can be used for a period of time without being connected to an external source of power. A specific type of laptops are netbooks - visually very similar computers that focus on the core functionalities of a computer. By being designed with a low-power, low-voltage processor, less RAM and less storage capacity, these computers manage to have even smaller sizes while keeping acceptable levels of performance for the average user. It is increasingly difficult to find these models, though, since the emergence of tablets like the iPad, consumers have increasingly abandoned netbooks and most models are being discontinued.

20 3.1 Hardware Platforms & Peripherals

Figure 3.1: Eee PC 4G, the first netbook released by ASUS.

The other option are single-board computers. These are small electronics boards that share a common architecture with a conventional computer - they include a processor, RAM, graphics unit and I/O. Since USB and HDMI ports are common in these devices, usual peripherals like a keyboard, mouse and screen can be connected and the board can be used exactly like a common computer. They are usually run by a system-on-chip containing an ARM processor (similar to those found in smartphones), even though some efforts to release x86 single-board computers are under way.

Figure 3.2: Intel Galileo, an example of an x86 single-board computer.

There are several models of single-board computers in the market, offered by several different manufacturers. The most popular of this type of devices is the already mentioned Raspberry Pi. An alternative is the Intel Galileo[22], an effort by the North-American firm to compete with the Raspberry Pi. For 70 dollars, the Galileo offers a 400MHz 32-bit x86 Intel proccesor, a 10/100 Mbps ethernet connection, and two USB 2.0 ports - one client and one host, and 256MB RAM.

A third option is the BeagleBone Black [23]. This model offers a 1GHz ARM Cortex-A8 pro- cessor and 512MB for about e50.

21 3. Related Technology

Figure 3.3: BeagleBone Black Revision C board.

3.1.2 Camera

The camera is probably the second most important part of the system since, except for the cases where the video feed is already locally stored, it is the source of all data being treated. A good camera also has its own set of requisites: it should provide imagery with good quality, it should be small and easily fixable, cheap, the lens should cover an area as wide as possible and it should offer a wide range of image resolutions, with at least VGA (640*480) and HD resolutions (1280*720) and up. The better the image quality provided by the camera, the better data the system has to work with, which can improve detection rates and lower false positives. Nowadays, it is easy to find web-cams with very high resolutions (up to 1080p) by reputable brands and with good image quality. Configuration of image quality is good since, as has already been noted, different applications can have different resolution requirements. For the implemen- tation in question here, 640 by 480 pixels is considered sufficient data (a VGA frame has only 15% of a Full HD one) given the relatively low computational power of the Raspberry Pi. For most applications in the context of this project, a monochrome frame is also a viable alternative to a colour one - having only half of the data when using the 4:2:2 format, colour data is not strictly necessary when detecting simple movements or building heat maps. The requirement for a low-cost device also means the camera will be simple, offering no spe- cialized or high-profile features. Some cameras on the market already integrate motors to be remotely controlled, infra-red sensors or LEDs for better night vision. Most times, when these features are present, they increase the size of the device and increase cost and energy consump- tion. The uses described in this document have no need of such specialized features, so a simpler device will be considered. One point against the Raspberry Pi is that its USB ports do not provide enough power to drive some devices like a HD web-cam, which makes the use of a USB hub inevitable. With some USB hubs being bigger than the Raspberry Pi itself, this fact somewhat reduces the system’s ease of

22 3.1 Hardware Platforms & Peripherals installation and portability. The most immediate choice was the Raspberry Pi official camera module[24] - a camera ac- cessory that connects to the Raspberry Pi’s Mobile Industry Processor Interface (MIPI) interface.

Figure 3.4: The Raspberry Pi camera module.

The module is capable of both HD (720p and 1080p) resolutions for video and 5 megapixels for photos and is basically a small camera module connected to an electronics board. There is also a version without infra-red filter. Points in favour of this module start with its connectivity: since it connects to a dedicated interface in the Raspberry Pi, no USB ports are taken meaning more available interface options and possibly less power consumption and no more USB hub is needed. Cons include the fact that the relocation of the camera now requires the relocation of the whole system rather than just the camera. The chosen alternative, however, due to its wider availability, was the Logitech C270[25] - a USB camera capable of resolutions of up to 720p and sound capture. It is Linux- and V4L2- compatible.

Figure 3.5: Logitech C270 web-cam.

23 3. Related Technology

The camera is small and, due to its mounting mechanism, easy to setup in most places. Available for about e30, it is cheap and has very good image quality and its V4L2-compliance means other software can directly read the pixel data. Image data is provided in YCbCr 4:2:2 and VGA resolution is available. Due to limitations with the Linux V4L2 driver however, no HD resolutions are available. A USB connection means the relocation of the camera is easier than with the camera module.

Another feature that is offered by this camera is a direct disposal of pixel data in the YCbCr colour space. YCbCr is a family of colour spaces used to encode a pixel’s values of luminance and chrominance, instead of the primary colour components in red, green and blue planes. YCbCr is very often used in video codecs, H.264 being the most well-known example[26].

YCbCr stores three values for each pixel: Y, Cb and Cr. Y encodes the pixel’s luminance or luma (light intensity), Cb is the pixel’s blue value minus luma and Cr is red minus luma. The combination of the Cb and Cr values encodes the pixel’s colour value - if these two values are not present or set to zero, the result is a monochrome image. This is a very important feature for the purpose of this project - as it has been stated, monochrome pictures hold much less data than colour images and for many purposes are just as useful. The use of YCbCr means that the transition from colour to monochrome is trivial. Another advantage is the capability to take advantage from the fact that the human eye is much more sensible to variations in light intensity than in colour value by providing a single set of colour values to a number of pixels. By only providing a pair of Cb and Cr values several pixels, an image very similar to the original can be achieved in a stream that takes much less bandwidth to broadcast or a file that takes less space in its storage medium.

Two colour subsampling examples are YCbCr 4:2:2 and 4:2:0 - the first has a pair of Cb and Cr for each two pixels and the second, a pair for each four pixels. Changing from 4:4:4 (colour information on all pixels) to 4:2:2 saves about a third of the data.

Having to process less data can prove crucial in the performance of the system, particularly one as limited as the Raspberry Pi. Usually, one byte (8 bits) is used to store each of a pixel’s components, both for YCbCr and RGB data. With VGA-sized frames (640 by 480 pixels) and at a typical video rate of 25 frames per second, an uncompressed video with colour information takes about 22MB per second of processing power or storage space. 720 and 1080p videos use about 66 and 148MB/s respectively. With such high data rates, a reduction of a third can prove fundamental in the decision on whether the system is viable or not.

Although the terms YCbCr and YUV are usually thought of having the same meaning, YUV originally referred to analogue television transmissions. More recently it has come to be associ- ated with YCbCr-encoded files.

24 3.2 Detection Algorithm

3.2 Detection Algorithm

Together with the development of the software architecture, communication protocols and methodologies for development of new features and modules, three examples of processing mod- ules will be developed - one that looks for intrusions in the video feed and runs a custom-defined event trigger, one that looks over a scene for a period of time and builds the thermal map relative to the areas in the feed with the most changes and a third one, similar to the second, but with a dynamic adaptation of the detection in terms of a possibly changing background scene.

Each one of these modules needs its own processing algorithm - the definition of internal structures and how to update them according to the frames coming from the video feed, and how to calculate the final result. Each one also has its own set of requisites for the end result to conform to the module’s specifications.

The simpler detection module’s algorithm needs to be as fast as possible, in order to guar- antee that even quick movements are properly detected, so a high frame-rate must be achieved. However, the system must be usable in a number of different scenarios while keeping good de- tection rates with few false positives. Activity mapping (which refers to the building of heat maps indicative of movement quantity), both in its simple and advanced versions, is perhaps the least demanding in terms of requisites, as negligible movements (e.g. the leafs of a tree due to wind) are easily identifiable by analysing the heat map itself, which will be drawn over a snapshot of the video feed. However, if there is a big quantity of such irrelevant movements, it might become important to tackle such a phenomenon.

Despite their different purposes, a core algorithm is shared between the three modules, corre- sponding to the Frame Differencing algorithm, also known as Background Subtraction[27]. In fact, all three implemented modules involve, at the most basic level, a comparison of the current state of the video feed with a chronologically older state and the assessment of the level of difference that there is between the two points in time. Since a video feed is a series of photographs - more commonly known as frames - and each frame is simply a matrix of integer values, the simplest and most effective way to assess this difference is to subtract the two matrices, which is the core of the Frame Differencing Algorithm. It is based on the calculation of difference frames - struc- tures, the same resolution as the video feed, that result from the subtraction of one frame from the other. From these difference frames, one can infer the level of change in the video feed between the old and the new frame.

25 3. Related Technology

Figure 3.6: Frame Differencing execution flowchart.

The algorithm works by saving one frame in memory and by comparing the new frames to it. Since a video frame is basically a matrix of values (usually each pixel being represented by an integer in the range [0; 255]), to compare the two frames, each pixel R(x, y) in the saved frame - often known as the reference frame - is compared to the corresponding pixel C(x, y) by calculating the absolute value of the difference between each pair of pixels[28]. These difference values can then be organized in the same order as the pixels they originated from, in order to form a difference frame. This frame contains only non-negative values (also in the [0;255] range) and indicates how much change (in value) there was in each pixel, and can be used to infer whether there was an intrusion, how big it was and in what zones of the video frame it happened. If some relevant change is detected in the pixels, it means there was an intrusion in the video feed. Below is an example of a difference frame.

Figure 3.7: A difference frame with two visible cars. Extracted from http://www.mathworks.com/ discovery/object-detection.html.

Its simplicity means that in its simplest forms, the Frame Differencing algorithm will provide inadequate results in all but the most trivial applications. Taking, for example, a processing module whose intended purpose is to detect intrusions in a room with networking equipment, the constant blinking lights will show up as positive values in the difference frames and a false positive will be triggered. If the camera is overseeing an outdoors scene and the reference frame is taken at 7am, by 12pm the lighting and shadows will be completely different and the warning event will be

26 3.2 Detection Algorithm constantly triggered. When used in dark places, the camera will increase the sensor sensibility (also known as ISO) to make for a clearer image. However, while dark areas in the picture do indeed become more visible, noise also becomes a constant throughout the image as visible below:

Figure 3.8: A photograph taken with a high ISO setting.

However, the algorithm’s simplicity is also a strength, meaning several aspects of its work can be adapted and parameterized. The first solution to the set of identified problems has to do with the intelligence behind the analysis made to the difference frame. This is a part of the algorithm that can be thought of as a black box - its input is the difference frame and its output is a boolean corresponding to whether there was an intrusion or not. In order to infer this answer, a series of parameters can be set. The first and most intuitive is to set a minimum threshold on the number of pixels with differ- ence, in order to trigger a warning. By setting this parameter, problems like false positives caused by small changes are minimized, as these cases are usually confined to a small fraction of the frame. If this setting is a percentage of the frame’s pixels, instead of an absolute value, even resolution-independence can be achieved. The second parameter is the minimum change (in value) a pixel has to have in order to be counted towards the previous parameter. By imposing a minimum limit like, for example, 15% in change, problems like the variation of most pixels in images with high sensor sensitivity can be solved, since most of these pixels vary in value but not by a long margin, as can be seen in the above figure (with high ISO and noise levels). The usage of this parameter, by itself, is not effective, as a single pixel with a big change is enough to trigger a false positive. Usage of this parameter in combination with the previous one is safer - resulting in a criteria like ’a variation of

27 3. Related Technology at least 20% of the luminance value in at least 25% of the frame’s pixels’. If the first parameter is met (which presumes the second has too), the frame is set as a de- tection frame, meaning a frame including an intrusion. A last parameter can be set: the minimum number of consecutive detection frames required for a warning to be triggered. By setting a min- imum period of time where detection frames have to be consecutively detected, short but swift changes that are not supposed to trigger warnings (like a lightning in the street) can be ignored. But changes in the way difference frames are handled do not solve all of the algorithm’s chal- lenges - in the case where the camera is pointed outdoors and triggers are issued every frame because the time of day has changed, the difference frames still show big values in most pixels, for lots of consecutive frames, since the just captured frame is very different to the reference one. The solution is to adopt a sliding reference frame mechanism. This means that the reference frame is not static and it can be periodically refreshed during execution. Through this mechanism, new frames are compared to one that is chronologically closer to it and such problems can be solved. The parameter here is how often the reference frame should be refreshed, or it should not be at all. An important point regarding these parameters (or thresholds) is that there is no one size fits all set of values that is appropriate to all applications - a minimum of 5 seconds with detection for a warning to be triggered can be enough when observing a laboratory or an animal habitat, but not scenarios where intrusions can be very fast, like sports-related applications. Some manual intervention and fine-tuning is necessary. Another way in which Frame Difference’s simplicity is a plus is that it makes it fast. The whole algorithm relies on very simple mathematical operations (subtraction and absolute value) and memory pointer manipulation. Also, most of the computational effort of the algorithm is the calcu- lation of the difference frame - a difference between two matrices. This is a series of subtractions where the values are not dependant on each other, so it can be safely divided into several working threads. By off-loading this effort to the Graphics Processing Unit (GPU) -an architecture charac- terized by its strong multi-threading capabilities - with a framework like nVidia’s CUDA, very high speed-ups can be potentially obtained. Hence, the Frame Differencing algorithm is very versatile, being usable in both simple detec- tion and over-time data collection processing modules.

3.3 Software Libraries

Apart from Frame Differencing, these algorithms would be very time and effort-consuming to implement from scratch, so it becomes more efficient to analyse and use an existent software library. Among those, two are clearly the most popular: MATLAB Computer Vision System Toolbox and OpenCV.

28 3.3 Software Libraries

3.3.1 MATLAB Computer Vision System Toolbox

MATLAB is a programming environment developed and maintained by Mathworks, specifically tailored for scientific calculations and numerical computing [29]. It allows programmers and sci- entists to write programs to perform advanced mathematical operations, graphic plotting, signal processing, image processing, computer vision, finance operations, among many others. These funcionalities are provided through a range of included packages and optional toolboxes.

Among those is the Computer Vision System Toolbox[30], which provides tools for computer vision and video processing systems. This toolbox implements feature extraction, motion de- tection, object detection, object tracking, stereo vision, video processing, and video analysis algorithms. The ForegroundDetector System object provides an implementation of the Frame- Differencing algorithm, providing a way to compare a video frame to a background model and determine whether a pixel is part of the background or the foreground.

Programmers are also abstracted from system-level details and memory management proce- dures: MATLAB takes care of system-specific procedures. MATLAB is also available for Linux.

3.3.2 OpenCV

OpenCV[31] is a set of libraries, written in C++, that implement computer vision and machine learning algorithms. It includes, among many others, face detection and recognition, image stitch- ing, motion analysis, object tracking and object detection.

OpenCV started as an Intel research project. It is now maintained by OpenCV.org, a non-profit organization. Due to its extremely wide application domain, OpenCV is deployed around the World and used by several organizations, making it the de facto software library for object detection. There are also several extensions of its functionality (that have OpenCV as a requirement), like BazAR and LibPaBOD.

OpenCV implements more than two thousand algorithms and has C, C++, and Python in- terfaces [31]. It is open-source and released under the BSD license, meaning that it can be compiled to new platforms and the details of the implementations can be analysed. Because of this, OpenCV has already been compiled for use on the Raspberry Pi. Since it is still a big software package, if the compilation procedure is performed on the Raspberry Pi itself, the com- pilation time can be several hours long. This is no problem, however, as cross-compiling is also an option with several tutorials available throughout the Internet. OpenCV can also be configured, at compile-time, to use Video 4 Linux as the interface with the web-cam.

OpenCV offers an implementation of the Frame-Differencing algorithm through its Background- Subtractor class.

29 3. Related Technology

3.3.3 Motion

Motion[32] is an open-source Unix library written in C that connects to a camera and performs motion detection on the video feed. It has a set of parameters that can be configured to tune the motion detection algorithm. A number of features on what to do when motion is detected is also offered, including snapshot and video saving, inserting of rows into SQL and PostgreSQL databases, and others. The fact that it is open-source allows developers to inspect its workings and add new features.

3.4 Discussion

In this subsection, a reflection is made on the choices for hardware and software.

3.4.1 Hardware

The choice for the processing platform on which to develop and run the software came down to three categories: desktop computers, laptops and single-board computers. In their favour, desktops have their superior performance and upgrading options. Desktop CPU’s usually run at higher voltages than their equivalent laptop model and components are usually cheaper, performance-wise. Desktop motherboards have wide expansion options, making it easy to increase the amount of system memory, system storage or graphics capabilities, by installing a more powerful graphics card or even several cards in parallel The superior performance of desktop computers is a big point in their favour, particularly when applying complex algorithms over big amounts of data like High-Definition (HD) video, but some of the system’s main objectives have to do with small size and installation and relocation practicality. Desktop computers are not designed with such concerns in mind, which severely limits their application in the case of this particular system. Furthermore, none of the initially developed modules justify the extra cost and size that come with the use of a desktop instead of a smaller, cheaper machine. Laptops appear as an apparently viable option. Despite offering a bit less value for money in performance terms, their much smaller form factor makes them much more agile in terms of places to be installed. Performance is marginally lower than desktops, but much better than single-board computers in most cases and the presence of a battery makes the system able to run indepentently of a nearby power source, even if for just a few hours. The bundling of a series of peripherals - keyboard, screen, wireless connectivity and mouse input through a trackpad - allows for an easy interaction through the local command inputs with no need of a separate station. Despite their good offering, laptops have a few points where the system’s objectives are not fully met and single-board computers offer a better value. Most models have active cooling, ie, a moving part whose purpose is to lower the system’s temperature. This will usually be a fan that

30 3.4 Discussion is continuously running to blow air through a series of vents and expel the heat from the laptop’s components. A fan also brings in dust and humidity from outside, which can be very harmful if the laptop is being run outside, and several planned application of this system involve its outdoors intallation. Performance can also be an issue sometimes. Commonly, the smaller a laptop, the lower the processor’s power will be, which can be a particular problem in netbooks. These devices have such small form factors that heat dissipation becomes very limited. In order to decrease heat production, performance is severely sacrificed. A laptop’s weaknesses are most often where a single-board shines. A device of this type is usually extremely small - even more than netbooks. A Raspberry Pi, for example, is about the same width and height as a credit card - no netbook is as small and, as such, as easy to install. This system is not envisioned as something that will be moved often, so a laptop’s battery ends up as a small advantage and, nonetheless, accessory batteries are available if such a need arises. The lack of included peripherals is easily surpassable. The presence of standard USB ports allows the connection of keyboards and mouses and a network connection allows remote interaction through Secure Shell (SSH), File-Transfer Protocol (FTP) and similar protocols. The biggest advantage single-board computers have over all other options is cost. The most advanced model of the Raspberry Pi costs 35 dollars, while a new notebook will rarely be found for less than 200. The next choice is what specific model to work on. The taken option was the Raspberry Pi, given its low cost, good connectivity options and high popularity. The last point is a particularly important one - the Pi’s wide acceptance means there is a bigger community of users and more support, including pre-compiled libraries. Accessories will also be easier to find.

Cost At the time of this writing, three versions of the Raspberry Pi are available: the Model A, Model B and Model B+. While the Model A has a price of $25, both the Model B and B+ are slightly more expensive, costing $35. Either of these options represents a very low price for what is essentially a fully capable - albeit low-power - computer, thus offering an unbeatable benefit to cost ratio. Conventional entry-point computers - with a keyboard, screen and battery - can cost ten times as much, even those pre-loaded with free distributions of Linux.

Computing Capabilities The Raspberry Pi’s technical specifications vary according to the model, but a few points are common to all three[6].

All Raspberry Pi’s are equipped with a Broadcom BCM 2835 system-on-chip, including a 700MHz single-core ARMv6 ARM1176JZ-F CPU and a dual-core multimedia co-processor[33]. This is the frequency the processor is configured to run at from factory, but it can be in- creased to up to 1GHz without voiding the device’s warranty. All models also offer HDMI (for video and audio), 3.5mm jack (for audio) and composite video connections. They are powered by a Micro-USB connection and include a GPIO connector. The GPU is always a

31 3. Related Technology

Broadcom VideoCore IV, running at 250MHz.

The Raspberry Pi runs purpose-built Linux distributions, of which there are several. The most popular is the Debian-based Raspbian. The operating system is loaded to a memory card (SD for the A and B models, MicroSD for the B+) and run from there.

The first point where there is a distinction between the models is the capacity of the included RAM: the Model A includes 256MB and the Models B and B+ double that capacity to 512MB, with the RAM being shared with the GPU in all cases. All models feature USB connectivity - the Model A offers one port, the Model B offers two and the Model B+ offers four. Network connectivity is only included in the B and B+ versions by way of a 10/100 Mbps network port. The only way to obtain network connectivity with the Model A is to connect an external, USB wireless adapter, which can be very power consuming and takes up the only on-board USB port.

After an analysis of the several available models and the requisites of the project, the Model B was chosen, as the added cost was deemed worth the offered improvements. The added RAM can be very much useful in modules where several frames have to be kept in memory. If we take an example where 25 Full HD (1080p) monochrome frames have to be kept in memory, those frames take about 50MB of RAM space, or about 20% of the Model A’s capacity. Taking into account that the operating system can also take a big part of the available system memory, it becomes obvious how the extra RAM can prove decisive in the system’s performance. Another point was the inclusion of the Ethernet port. Network connectivity is a very important functionality in cases where, for example, a simple detection model wants to trigger remote notifications. Other application,s like remote live video feed watching or remote sending of heat-maps, also require network connectivity. The option of an external Wi-Fi adapter is still present, but with the already mentioned problems.

Despite its current availability, the Model B+ was not taken into account as a possibility for the project’s development platform, given its announcement and retail availability halfway through this project’s development. Had it been available at the time the project was being planned, it would probably have been chosen given there is no price increase from the Model B and the addition of two USB ports could have rendered the USB hub unnecessary in some configurations. It is, however, unclear if the new USB ports provide enough power to drive the adopted web-cam.

Size and Portability All models of the Raspberry Pi adopt a very small size factor, being the same width and height as a credit card and about 2 cm thickness due to the electronics and ports. Being an electronics board, it is not appropriate to be installed where dust or weather conditions might damage it. Protective cases are available, some with outdoors protection and others with mounting points for wall instalation, improving the device’s portability and

32 3.4 Discussion

ease of installation.

Popularity The Raspberry Pi is, by far, the most popular single-board computer in the market today, with several million units having been sold. This means there is a very big number of software developers already developing for this platform. A part of those are open-source enthusiasts, porting existant libraries to it, developing new software, writing tutorials and solving problems. These factors increase the available tools and make it easier to find online solutions to problems, decreasing the development time. The market of Raspberry Pi owners is also big, which encourages companies to develop and sell accessories, adding further functionality and convenience to the device.

3.4.2 Detection algorithm

The frame differencing algorithm, despite its simplicity, is the best suited for most event- triggering applications. The rapid subtraction between frames allows for fast checking of move- ment and, provided it uses a well tuned algorithm, can return very good results. There are, how- ever, some event-triggering applications where this algorithm will not provide good results: those where there are situations with lots of movement that should not trigger an event. As an example, the scenario where the system is analysing a train track, looking for people. The passing of trains will trigger the frame difference algorithm, but this is a false positive. The applications where the event is dependant not on movement, but on the detection of a specific shape (a human one in this case) needs either the Viola-Jones or the Histogram of Oriented Gradients approaches. This is a scenario that falls outside the scope of this report. The developed event-triggering plug-in will only look for simple movement. The same goes for the human monitoring scenario. The simple comparison between frames in not enough to build the thermal map, which means one of those two algorithms, combined with a tracking mechanism between frames, will be used.

3.4.3 Software Libraries

When choosing an object detection software library, one has to weigh the offered functionality and convenience on one hand, while taking into account the scarce computational resources available on the other. Convenience is MATLAB’s greatest strength: with such a vast and powerful set of included functionalities and being easy to code for, a programmer can take only a fraction of the time and of the code size it would take to perform the same operations with C or C++, thus decreasing the development and debugging time and allowing for expanded functionality. However, this envi- ronment also has its cons. The most relevant is the cost - a single MATLAB license costs nearly $2000, which would increase the cost of the decribed system more than tenfold. Second, it is proprietary and closed-source software, which eliminates the ability to analyse and possibly op- timize the implementations to the intended applications. The third aspect is concerned with its

33 3. Related Technology computational requirements: MATLAB is a very large, several-gigabyte installation that vastly sur- passes the system memory and storage available on the Raspberry Pi. Even compiling MATLAB programs and running them in the Raspberry Pi with the MATLAB Runtime Compiler is not an op- tion, as a several hundred megabyte executable is generated. Hence, despite its conveniences, its cons vastly outweigh its pros, particularly in a project where cost and computational efficiency are so important. This makes its use completely inviable. Despite the convenience of already providing an implementation of the Frame-Differencing algorithm and the fact that it is free and open-source, OpenCV is a very big software package, which makes it very demanding in terms of computing capabilities for its algorithms to be run with acceptable performance. Acceptable frame rates are a goal of this system, and with a very limited platform like the Raspberry Pi, OpenCV’s overhead makes it difficult to justify opting for it. Though also free and open-source, very feature-complete and similar in purpose to the simple detection module developed for this project, the Motion library is made available as a unique package, hence no module/plug-in architecture is present. This makes it cumbersome to add features like reading from a file instead of a camera and calculation of heat-maps while selecting from the various available features. Given that none of the analysed libraries and implementations of the Frame Differencing al- gorithm fully line up with the system’s goals and limitations, it was opted to implement a custom version of the algorithm.

34 4 Proposed Architecture

Contents 4.1 Hardware Layer ...... 36 4.2 Software Layer ...... 37

35 4. Proposed Architecture

4.1 Hardware Layer

The hardware layer of the system is structured in several parts, but when refering to the im- plemented configuration that is here described, there are two distinctive sets of devices:

1. the processing board and its resources

2. its connected peripherals

The processing board is the central workstation that provides resources like file storage and system memory, and connected to it there might be other devices.

Figure 4.1: Hardware schematic of the system.

4.1.1 Processing Board & Resources

The processing board is the central computing part of the system. It runs all the implemented software and algorithms and it is usually a computer or embedded device with similar architec- ture. Having such an architecture allows the software it runs to have access to its computational resource, the first being its computational capabilities. The processing board will usually have a Central Processing Unit (CPU) to allow the usual mathematical computations behind a comput- ing device. The second resource is system memory, more commonly known as Random-Access Memory (RAM). System memory is where the board’s operating system and programs are loaded to and each program has access to a fraction of that memory. This lets programs have a set of variables and structures that can be used to perform algorithms over the received frames and out- put the correct result for their use case. A network interface is also offered by way of an ethernet port to allow for communication with remote systems and USB connectivity is also present, for the connection of other peripherals.

36 4.2 Software Layer

4.1.2 Peripherals

In the specific case of the Raspberry Pi, four types of peripheral connections are provided. The first are multimedia interfaces: HDMI, composite video and 3.5mm audio jack interfaces are available to allow the export of video and audio to appropriate devices. HDMI and composite video are commonly used to connect a screen and interact with the system, while audio output can be used by modules that wish to trigger a sound alarm. USB connectivity is currently the de facto interface for peripheral connections, allowing the inclusion of cameras (essential in this system), USB hubs and wireless network connections. Other types of peripherals can be added but fall outside the scope of this project. GPIO and MIPI interfaces are also present, but will not be used in this case.

4.2 Software Layer

Figure 4.2: Software schematic of the system.

The software layer of the system is composed by two main modules: the Capture Module and the Processing Module. There is also a third element in the structure that offers a set of useful functions and to which both modules have access. The capture and processing modules each represent a specific set of behaviors and functions in the system that any implemented instance must provide. Both are essential parts in the correct functioning of the system. Since their co- operation and correct coordination is vital to the system, a Communication Protocol between the two has been defined.

4.2.1 Capture Module

The capture module works between the processing module and the source of the video feed, acting as an abstraction layer between the first and the second. It connects to the source of the video feed - be it a camera, a file or a network stream -, obtains the data and exposes a set of functionalities to the processing module for it to obtain the images organized in frames without

37 4. Proposed Architecture having to adapt to the details of the source itself. Furthermore, the capture module can also implement a series of pre-processing algorithms to apply to the video feed before the frames are passed to the processing module. Examples of pre-processing that have been included in this module and that are supported in the communication protocol are frame downscaling, histogram equalization and colour to monochrome conversion. Frame downscaling refers to the change in resolution of an image so that a second image, originated from the first one, has the same visual contents as the first but a lower resolution (with the same aspect ratio). It is used to reduce the quantity of data that has to be stored, transmitted or worked on when some loss in image information and detail is not detrimentary to the results of the algorithm. Histogram equalization refers to the recalculation of an image’s pixel values so that they are more evenly distributed across that image’s histogram. This is used to increase the image’s contrast, which helps detect small changes. Colour to monochrome conversion refers to the discarding of a colour image’s chroma information, thus obtaining a black-and-white version of the same image. Normally, monochrome pictures take up less space than their colour counterparts, so this is also used to save bandwith, storage and computational effort in cases where colour is not important for the system’s intended use.

Two examples of capture modules were implemented for this project: one that fetches the data from a USB web-cam and one that obtains it from a local raw YCbCr file. The only differences that any processing module should detect when working with these two implementations are a higher frames-per-second rate on the file-reading version and possibly less flexibility on the same module, given that a video file usually has only one native resolution, unlike a camera, which can support many different ones.

4.2.1.A Capture from USB Web-cam

This module’s purpose is to capture real world footage and organize it into frames to be read and analysed by processing modules. This implementation of the capture module specification gets the real-time video feed from a camera connected to the processing board through the USB. To perform the connection and communication with the device, it uses the Video4Linux2 (V4L2) device driver, a video capture interface that is closely integrated with the Linux kernel.

38 4.2 Software Layer

Figure 4.3: USB Capture Module execution flowchart.

The module is responsible for two main tasks: querying of device capabilities and obtaining image data. The first is based on the fetching of the complete list of configurations the device offers for the obtained video feed. Such configurations may encompass video resolutions, formats and colour spaces. Knowing this list (and possibly its own conversion capabilities), the module is capable of reading the processing module’s request for video configuration (a process further explained in the Communication Protocol section) and returning the set that most closely follows this set. After the parameters have been agreed upon, this module queries the device for image data and returns it to the processing module, organized in frames. V4L2 offers mechanisms to support both of these tasks.

4.2.1.B Capture from Local File

Throughout the development phase, it became apparent that it would be useful to sometimes save video sequences for later use, instead of always performing real-time processing. This would allow the same sequence to be processed by different processing modules, the same processing module in different stages of development or even the same module with different parameters. This version of the capture module was developed to allow such development practices.

Figure 4.4: Capture From File Module execution flowchart.

39 4. Proposed Architecture

The file name where the sequence is present is provided by the processing module. The module opens the file (which, in the presented implementation, should contain data in YUV 4:2:0 format) and presents the contents organized into frames to the processing module. This makes the usage of pre-captured sequences transparent to it and adds useful debugging and testing features to the development of this project.

4.2.2 Processing Module

The processing module is the central part of the software component of the system. The system can have a nearly unlimited range of uses, a few of which have been enumerated in this document - the processing module is where this use case is implemented. The processing module communicates with the capture module and, through it, obtains the video feed’s frames - matrices of integers, the same resolution as the video feed, that represent the frame’s pixels. By reading these frames, updating internal structures, performing algorithms over them and using the processing board’s resources, this is the module that produces the whole system’s end result. Being a common program, the processing module has access to all of the board’s resources and peripherals. These include system memory, local storage and network communications. Four examples of processing modules have been implemented:

4.2.2.A Video Store

The video store module was developed with the same motivation as the capture from file module. While that module reads raw YUV files from the hard disk, this module creates them. It receives the frames from the capture module (usually the one whose data source is a USB camera) and saves the frame to a file.

Figure 4.5: Video Store Processing Module execution flowchart.

The implementation of this processing module is the simplest of all four that have been de-

40 4.2 Software Layer veloped. Upon arrival of a frame, it goes through an optional processing stage. The resulting frame is then output to the file and the system is ready for the next frame. In the version that was implemented for this project, no further processing is applied to the frame than those that are performed by the capture module. This module is very simple and does not use the system’s network capabilities (even though such an implementation is possible for online streaming of the video feed). The only used resources are the system memory for temporary saving of the frame - before it is output to the file - and the local storage, to save the resulting video file. Alternatively and less frequently, this module can be used in conjunction with the capture from file module to apply post-processing mechanisms that might have been left out in the first run, for example, for creating a second video file that has been downscaled.

4.2.2.B Simple Movement Detection

The simple movement detection processing module’s purpose is to watch over a video feed and detect foreign movement in it, like a person entering the frame. If such an intrusion is detected, a pre-defined event is run, like the beeping of a sound or the triggering of a remote notification.

Figure 4.6: Simple Movement Detection Processing Module execution flowchart.

The module uses the Frame Differencing algorithm, which detects intrusions by comparing the frames in the video feed to an older one. If the difference surpasses a given threshold, it is inferred that there was an intrusion and the pre-defined event is run. The first step in the module’s execution upon the arrival of a new frame is the update of the internal reference frame - the frame newer ones are compared to. The update mechanism is run in two occasions: at the start of execution, to save the initial reference frame, and when a sliding reference frame mechanism is used and the time threshold for reference frame renewal has been surpassed. In either case, this frame is not compared to any others. Subsequent frames that do not require a reference frame renewal are compared to the existent reference frame and a difference frame is calculated. The usage of the Frame Differencing algorithm has previously been suggested for movement detection[27, 28, 34].

41 4. Proposed Architecture

This difference frame is fed to the Movement Detection stage, where the intelligence behind intrusion detection - discussed in the Frame Differencing section is encapsulated. If an intrusion is detected, a file can be saved (snapshots, video) or a remote notification may be sent. This module is very versatile and flexible, particularly in the range of options for the triggered event, which means that it can use nearly the full range of resources offered by the processing board - local storage and network communication are two options. System memory is also used for the keeping of the reference and difference frames.

4.2.2.C Activity Mapping

The Activity Mapping processing module is one example of an over-time data collection ap- plication, meaning it collects data for a period of time and, in the end of that period, produces an output. In this case, the output is a heat map - an image representative of the areas of the video feed that, over the period of data collection, saw the most movement. This is indicated by a scale of colours ranging from cool to warm tones indicating, respectively, among the zones with measurable movement, those that saw the least and most amounts of it.

Figure 4.7: Colour scheme showing warm and cold colours.

The output consists of the heat-map overlaid on a screenshot from the video feed, for easier identification of the zones in the feed the thermal map is indicating. An option is available to break the execution each X seconds, so that a series of heat maps are generated instead of a single one. This algorithm works with an adaptation of the Frame Differencing algorithm.

42 4.2 Software Layer

Figure 4.8: Activity Mapping Processing Module execution flowchart.

This module works, to a certain point, similarly to the Simple Movement Detection module. A reference frame is kept in memory and new frames are compared to it but, instead of determining whether there was any movement at all in the video feed, a new structure is kept and updated: the accumulation map. This is a matrix with the same resolution as the video feed’s frames, but capable of storing a much bigger number than the pixel’s maximum 255 value in each of its cells. The purpose of this structure is to keep track of how much change in value the pixel corresponding to each cell saw throughout execution. At the end of execution, the values within this structure are then translated to colour values, thus originating the heat map. Each cycle in the execution of this module starts with the evaluation of whether or not a new reference frame needs to be copied into memory and the accumulation map needs to be reset. As before, these operations are performed both on the first program execution and when a X number of seconds after program reset parameter has been set. Either way, a new frame is fetched from the capture module and the accumulation map is reset to only contain zeroes. This is known as the program reset stage. The nuance to keep in mind in this stage is that if its execution was motivated by the expiry of the time-limit parameter, the heat map calculation and output stage is executed. If the program has been reset, the program halts until the next frame arrives. When the next frame arrives, it is compared to the reference frame and the calculated difference frame’s (non-negative) values are added to the accumulation map. When execution ends (or, as described before, periodically in some cases), the heat map calculation and output stage is executed. Here, there can be a accumulation map pre-processing stage where its values can be organized into buckets and then re-calculated so that they are more evenly distributed through their histogram. The presence and execution of these sub-stages are both optional. The

43 4. Proposed Architecture last stage is the calculation of the heat map through mapping of the values in the accumulation map into colours, its overlay into the reference frame and output into the file. This module makes more use of the system memory than the Simple Movement Detection module as, apart from the reference frame, the accumulation map has to be kept in memory. Local storage is also used to save the resulting heat map.

4.2.2.D Activity Mapping with Dynamic Reference Updating

The Activity Mapping with Dynamic Reference Updating is similar to the Activity Mapping while at the same time being a different approach to it. The first Activity Mapping module compared the most recent system frame to the same older one, with system refreshing being limited to an option to periodically replace the reference frame. This new module is more dynamic and takes a different approach to the updating mechanism, in that a parameter X is set and each frame is compared to each of the X frames that came before it.

Figure 4.9: Activity Mapping with Dynamic Reference Updating Processing Module execution flowchart.

While the first Activity Mapping module had an option to update its internal state every X frames, in this new version, that state is updated every cycle. The parameter X defines the number sequentially older frames that the newest one will be compared to.

The end result of this module is the generation of the heat map of the movement registered by the system during the last X frames.

44 4.2 Software Layer

4.2.3 Communication Protocol

The communication protocol between the capture and the processing modules determines the steps taken between the two modules for an agreement to be achieved on the configuration of the image data, for the obtaining of the data itself and for the closing procedures when execution ends. The protocol has thus three steps: initialization, main cycle and finishing. The processing module decides when to initiate all stages of the protocol. This communication protocol applies to all implementations of the capture and processing modules.

Figure 4.10: Capture to Processing Module Communication flowchart.

4.2.3.A Initialization

The initialization phase is where the two modules negotiate the characteristics of the video frames that will be obtained during execution. There is a series of parameters that can be config- ured: image resolution, image downscaling factor, image format, what post-processing to apply, etc. Accordingly, the initialization step has two sub-steps. The first is a request from the process- ing module with the image parameters that would best suit its needs. Upon receiving this request, and knowing the capabilities of the capture device, the capture module evaluates whether the request can be fulfilled and, if not, the technically possible parameters that most closely follow the processing module’s wishes. The new parameters are sent back to the processing module that reads them and takes them as the rules to follow when receiving and treating data. The

45 4. Proposed Architecture capture module is seen as the master in this phase of the protocol. Having direct contact with the video feed source device, only it knows its specifications and limitations. As such, the processing module must read every single parameter of the capture module’s response and rule its own work and not assume any part of its request has been granted. A very important parameter that is also returned at the end of this stage are the pointers to the memory buffers where the frame’s data will be saved. Those pointers are determined (and the respective memory allocated) by the capture module.

4.2.3.B Main Cycle

The Main Cycle of the Communication Protocol refers to the stage where the defining param- eters of the returned frames have all been agreed upon and the capture module is ready to start querying the data source for frames. In this stage, the processing module takes a more controlling role, in the way that the capture module simply stands idle, waiting for the processing module to use its interfaces for frame obtainal. The processing module takes the initiative of using those interfaces at its own rhythm, by calling a function; when the function returns, the frame’s data is ready at the previously agreed memory location. Several interfaces can be offered, for both native and downscaled frames.

4.2.3.C Finishing

When the processing module determines that the execution is over, the capture module is informed of such a fact and closes all connections with the source of data - be it to physical devices, file descriptors or network connections, so that proper programming orientations are followed.

46 5 Implementation

Contents 5.1 General Structure ...... 48 5.2 Capture Module Implementation ...... 52 5.3 Processing Module Implementation ...... 52 5.4 Communication Protocol ...... 56

47 5. Implementation

This chapter analyses the considered options when implementing the software components of the system. The implementation of the project had to have two main points of focus: the Raspberry Pi’s very limited computing capabilities and the project’s objective of being usable in the real world. For the system’s usefulness to meet the desired standards, each module’s performance had to be compatible with its use case. It is desirable that as many frames as possible are analysed per second and each frame’s resolution is as big as possible, without surpassing the platform’s capabilities. In order for these performance levels to be achieved, the software and algorithm’s efficiency were a constant point of focus during the development. This chapter starts by analysing the software’s general software and connection between com- ponents. The algorithms and implementation options for each capture and processing module are analysed. At the end, the communication protocol implementation is also detailed.

5.1 General Structure

The first (and possibly the most important) choice concerned the programming language to use in the system’s implementation. With literally hundreds of languages to choose from, it was important to choose one that helped meeting the project’s goals. As such, a fast, efficient, widely available language was necessary - good memory management is crucial to ensure efficiency and speed standards were met. These goals narrow the array of choices to two languages - C and C++ - both known for the speed of their executables, stability and good memory management. The most popular Raspberry Pi Linux distribution - Raspbian - includes compilers for both (gcc and g++). C was the chosen language, given the author’s greater experience with it, the fact that it is easier to implement a C compiler than a C++ one (making it marginally more widely available and making it more likely that the system is usable in a lesser known platform). Three files implement the project’s basic structure: capturemodule.c, processingmodule.c and utils.c corresponding, respectively, to the Capture Module, Processing Module and a Utilities module. The Capture and Processing modules are interfaced with a communication protocol and both must be present at all times for the system to work. Both these files are swappable - any file that correctly implements the comunication protocol specification can potentially be used, and herein lies the swappable modules aspect of the system, which allows it to have a greater flexibility on video feed source and frame processing use case. capturemodule.c Acts as an abstraction layer between the Processing Module and the source of the video file, implementing the functions that allow it to obtain this video feed, organize its frames and perform the necessary pre-processing and conversion, according to the agreed parameters with the Processing Module.

48 5.1 General Structure

The pre-processing stage refers to a series of algorithms that can be applied on the obtained frame, prior to its return to the Processing Module. The developed version includes two algorithms, with the Processing Module deciding, during the Initiation Communication stage, which ones are to be effectively applied.

Frame Downsampling The frame downsampling algorithm is responsible for transforming a frame into another version of itself, with the same visual contents but a smaller reso- lution (and same aspect ratio).

Figure 5.1: Example of an original image on the left and its downscaled version on the right.

The main governing parameters in this algorithm is the downsample factor - a positive integer that defines by how much both the image’s width and height are to be divided. As such, the downscaled resolution is given by:

nativeW idth nativeHeight nativeResolution downscaledResolution = downscaleF actor ∗ downscaleF actor = downscaleF actor2

This algorithm works by dividing the original frame into downscaledResolution squares, each downscaleF actor ∗ downscaleF actor pixels long.

Figure 5.2: Mapping of pixels from original to downscaled frame.

49 5. Implementation

The point is that each square in the original frame corresponds to a pixel in the equiv- alent position in the downscaled frame, as shown in the figure above - this pixel will contain the equivalent to the average pixel value in the original frame’s square. Ac- cordingly, to perform this calculation, all the pixel values in the original image’s square are summed up and then divided by downscaleF actor2. This value is output to the structure containing the downscaled frame’s values.

Alternatively, a getFullResolutionFrame() method is also provided, so that the Process- ing Module can obtain a frame at native resolution and bypass the Frame Downsam- pling algorithm.

Histogram Equalization This algorithm’s goal is to increase the image’s contrast, poten- tially making intrusion detection easier. The method to achieve this goal is a based on a recalculation of the image’s pixel’s values so that they are better distributed throughout the image’s histogram.

A histogram is a graphical representation of the distribution of the image’s pixels through- out their range values.

Figure 5.3: Example of a histogram. Taken from cambridgeincolour.com.

The x-axis in a monochrome picture with a byte per pixel ranges from 0 to 255 and indicates how many of the pixel’s values have that value. The image that originated the above histogram had a majority of grey pixels and very few near white or black. The result of the histogram equalization algorithm is an image with a much flatter histogram and, as a consequence, greater contrast. The first step in this process is the calculation of the cumulative distribution function, given as

Pi cdfx = j=0 px(j)

with px(j) being the image’s histogram for pixel value i normalized to [0;1]. The function to calculate a pixel’s current value to its new one is given by

0 cdfy(y ) = cdfy(T (k)) = cdfx(k) ∗ 255

The end result is, as mentioned, an image with a much wider contrast.

50 5.1 General Structure

Figure 5.4: Example of an original image, on the left, and its version with histogram equalization applied, on the right.

To interface with the remaining modules, the Capture Module exposes two functions, in order to obtain the current frame from the video feed:

void getFrame();

This function is called by the Processing Module to obtain a frame from the video feed after the application of the filtering and processing functions, as well as downscaling with previously agreed parameters (according to the later-analysed Communication protocol).

void getFullResolutionFrame ();

This function is similar to getFrame(). However, the downscaling factor is ignored and the frame is returned at its native resolution. Neither of these functions has a return value; all frame content is written to the same memory location, previously known by the Processing Module. procmodule.c This file coresponds to the Processing Module, or the code that implements the system implementation’s use case. It invokes the Capture Module functions to obtain frames and processes them. A number of specific processing module examples were implemented; those are further analysed later in this chapter. utils.c Throughout the development time, a number of common code patterns were identified and eventually encapsulated into their own functions and kept in this file, which is available for use by both modules. The frame downsampling algorithm is implemented here, as well as a function to convert from the HSV to YCbCr colour space. A number of other useful functions are also available.

51 5. Implementation

5.2 Capture Module Implementation

5.2.1 Capture from USB Device

The most important and frequently used capture module is the implementation that connects to a USB camera and fetches frames from a real world scene. To connect to the Logitech webcam, the Video4Linux2 API is used, parameterized with the execution parameter struct’s devName string. The API supports several methods for reading the data off the device (read and write meth- ods, asynchronous I/O methods and streaming I/O), but the one that is most frequently used is the Streaming I/O method. The module uses these functionalities to determine the camera’s sup- ported data reading methods, image resolutions and image formats and, with this data, calculating the appropriate response to the processing module’s request for image resolution and format.

5.2.2 Capture from Sequence File

At first, only one capture module was planned to be developed (to capture video from the USB camera and directly feed it to whatever processing module was present at the time of ex- ecution). However, over time it became apparent that it was not always practical to perform the video processing in real time - certain test sequences were more difficult to obtain and it would be useful to save them to a file and test them with either several different processing modules, different versions of the same module or even the exact same module running with different pa- rameters. Another possibility provided by this module is the generation of synthetic sequences representative of very specific test cases. This is a simple module that, by simply using the standard stdio.h functions like fopen() and fgetc(), reads a sequence file off the local file system and implements the capture module’s usual API layer over it, organizing the data in frames. The set of offered functions and functionalities is exactly the same as the other capture module, making the transition nearly transparent from the processing module’s point of view - it should only be noticeable since the frame rate retrieval might be much higher and sequenceName is used to index the file instead of devName. However, only uncompressed YUV files are supported - plans for the future include the im- plementation of a video decompressing algorithm in a way that is also fully transparent to the processing module.

5.3 Processing Module Implementation

5.3.1 Video Storage

The video storage processing module had the same motivation as the Capture From File module. Through the use of the fwrite() function, this module fetches frames from the capture

52 5.3 Processing Module Implementation module (usually the one that connects to a USB camera) and saves them to a file in the local file system. The name of the file where the video will be saved is parameterizable. Plans for the future for this module include the usage of a video compressing algorithm.

5.3.2 Motion Detection

The Motion Detection processing module’s purpose is to detect movement of any kind in the video and take a snapshot. Movement is detected through the comparison between the latest frame and an older, reference one. The first step in execution is the allocation of enough memory to hold a downscaled frame - the reference frame (to which other frames will be compared), the fetching of a frame and the copy of its contents into the allocated memory. An optional renewReferenceFrameSeconds pa- rameter sets a number of seconds after which a new frame must be fetched and set as the new reference frame. Known as reference frame sliding mechanism and turned on by the useRefer- enceFrameRenewal flag, this method allows the module to adapt to slowly changing conditions while minimizing false positives - the incorrect issuing of events. The main cycle is based on the successive fetching of frames and their comparison to the reference frame. If a series of parameters is met, an intrusion is detected and a snapshot is taken and saved. Since a frame is a matrix of integers - each integer corresponding to a pixel - the com- parison between frames is performed by comparing pixels in the same position between frames with the standard math.h abs() function, that calculates the absolute value of the difference be- tween the two values. If this difference equals or surpasses the pixelValueDifferenceThreshold parameter, it is categorized as a difference pixel, ie, a pixel that saw a substantial enough variation in value to be categorized as a pixel with a possible intrusion. The comparison continues through- out the frame and the number of difference pixels is accounted for. A second parameter called framePixelsDifferenceThreshold sets a minimum percentage of the frame’s pixels that must be categorized as difference pixels for the whole frame to be recognized as a difference frame, ie, a frame with an intrusion. A third parameter - secondsWithIntrusion sets the minimum number of seconds filled with consecutive difference frames for a definite intrusion to be recognized and an event to be triggered - in the case of the implemented system, the saving of a snapshot. A last parameter called secondsBetweenSnapshots allows the limiting of the taking of snap- shots to one every X seconds, so as not to cause the saving of an excessive number of images.

5.3.3 Activity Mapping

The Activity Mapping module watches over a video feed and, in the end of execution, builds a heat map indicative of the areas with the most movement in the feed. It is similar to the Movement Detection module in that both compare the system’s current frame to an older one through the Frame Differencing algorithm - but the treatment given to the difference frame is different.

53 5. Implementation

The most important structure in the Activity Mapping module is the accumulation map - a matrix, the same resolution as a downscaled frame, that keeps track of how much change each pixel saw throughout execution - each of these values will later be mapped to a colour - the bigger the value, the warmer the colour. Since one of the goals of the system is its ability to run for long periods of time without human intervention, the accumulation map is a matrix of unsigned long integers. The algorithm starts by obtaining two frames: one in native resolution, to be later used in the finished thermal image, and one in downscaled resolution, to be used as a reference frame. Frames are then obtained in the usual fashion and compared to the reference frame, pixel by pixel, by calculating the absolute value of the difference between the two pixels and adding it to the corresponding position in the accumulation map. This module also implements a pixelVal- ueDifferenceThreshold so that only pixel value differences above this level will be logged. In this case, the value added to the accumulation map is the difference minus the threshold. The cycle continues until execution is manually interrupted or a renewReferenceFrameSec- onds parameter is surpassed. In this case, the current accumulation map is transformed into a heat map, output to a file and execution restarts until X seconds have passed again. The most im- portant task of the thermal map generation procedure is to take the accumulation map and map the values it contains to a range of colours, with a direct correlation between value and colour warmth. This was implemented by using the Hue-Saturation-Value colour space, which charac- terizes colours by a set of three parameters: the Hue (H), which defines the colour’s wave length, the Saturation(S) and the Value(V), the colour’s lightness. The Hue of a colour is defined as a value between 0o and 360o.

Figure 5.5: HSV colours according to H and V values. S is set to 1.0 in this figure.

By analysing the range of colors, it is easy to see that the cooler colours are present at H=180o and the warmer ones at H=0oC. Both S and V are decimal values between 0.0 and 1.0. For this project, both are set at 1.0. To map the accumulation values to Hue values, the smallest and biggest values in the accumulation map need to be determined; the structure is thus iterated

54 5.3 Processing Module Implementation through to find them. The next step is to build the final image, which will be the result of the juxtaposition of the ther- mal imaging over the native resolution snapshot that was taken in the first step of the execution. This image contains only Y values and is iterated through, with its values being the final thermal map’s Y values, multiplied by a accumMapWeight float parameter before being output to the final file. This allows for the original snapshot to be darker in the final image and greater emphasis being given to the thermal colours. The last step is the inclusion of the thermal colours in the final image. It is important to be recalled that, if the frame downscale factor is greater than 1, the accumulation map’s resolution is smaller than that of the original, native-resolution snapshot. Hence, if this is the case in question, the upscaling algorithm must be performed in order to match the colors derived from the accumulation map to the full-resolution snapshot. This is based on a mathematical operation where the pixels in the downscaled-size image D are mapped to the upscaled image U through the equation

UX UY U(X,Y ) = D(F loor( downscaleF actor ), F loor( downscaleF actor ))

After this algorithm is run, the calculated colours are output to the resulting file and the ex- ecution is complete. In the case of a 32-bit system (as is the case with the Raspberry Pi), an unsigned long integer is 4 bytes long. Considering a typical case where a native frame is 640 by 480 pixels wide and the downscale factor is 4, a downscaled frame has a resolution of 160 by 120 pixels, so this structure takes up 75KB of memory, which can be demanding in very-low-capacity systems. However, it is vital that the system is able to run unattended for long periods of time. If we assume that a pixel’s value is at most 255, a minimum difference threshold of 30 and the system is running at a pace of 10 frames per second, the fastest a position in the accumulation map can overflow is over 22 days - a long period of time that is very unlikely to be achieved.

5.3.4 Activity Mapping with Dynamic Reference Updating

The Activity Mapping with Dynamic Reference Updating processing module is an evolution of the regular Activity Mapping module in that the heat map is calculated after each difference frame is calculated and represents the movement registered during the last queueLength frames. There is no native resolution frame in this algorithm. At the start of execution, two circular lists of queueLength elements are built. The first contains frames and the second contains difference frames. Each of these list elements contains two pointers: one to the next element in the list, and another to data the same size as a downscaled frame. At the same time, an accumulation map is also kept. In order to avoid strange results at the start of execution (due to the calculation of difference frames between new frames and empty ones), when the first frame arrives, the whole frame list is filled with it. Every time a frame is obtained, the frame list is updated by discarding the oldest frame in it

55 5. Implementation and saving the new one in the first position. Then, a new difference frame is calculated between the newest (and just obtained) frame and the oldest one. This difference frame is saved in the difference frame list and its oldest structure is thrown out. Lastly, the newest difference frame is added to the accumulation map and the oldest is subtracted from it. The current accumulation map is mapped to a heat map and output to a file, at which point the cycle restarts.

Figure 5.6: Activity Mapping with Dynamic Reference Updating execution scheme.

The effect is a file with a series of heat maps, each representative of the movement during the last queueLength frames.

5.4 Communication Protocol

There are several parameters that need to be agreed on before the capture and processing modules can properly start their operation. The first have to do with the characteristics of the image, like resolution, color mode (color or grayscale) and optional pre-processing algorithms to apply to it. Instead of passing arguments to functions, all communication between the capture and pro- cessing module is done via writing to an instance of the exec params structure (defined in the capturemodule.h header file). This process is inspired by the way that calls are made to V4L2’s API’s. The instance is created by the processing module and a pointer to it is passed through a call to the capture module’s init() function. Appendix B shows the communication flow between Capture and Processing Module.

56 5.4 Communication Protocol

5.4.1 Execution Parameters Structure

The Execution Parameters structure is a struct with the complete set of parameters defining the system’s intervening modules and their configuration values. Appendix A shows the complete list of fields in this structure. The first four parameters are strings identifying the implementations of the capture and pro- cessing modules; for each module, its unique identification and version are kept. The next six fields of the structure characterize the resolution of the frames being passed from the capture to the processing module. Both native and downscaled resolutions are kept. The reason why both native and downscaled parameters are kept is that the capture module exposes functions to obtain both native- and downscaled-sized frames and it might be useful to have these parameters available for reading. For each mode, both width and height are saved as well as resolution; this will usually be equal to the multiplication of width by height and is present for optimization purposes, avoiding unnecessary calculations. The next three fields have to do with the downscaling and pre-processing algorithms that are applied to the frame by the capture module. frameDownsampleFactor defines the frame downscaling factor for the aforementioned frame downscaling algorithm (should be set to 1 if no downscaling is to be performed), equalizeHistogram is a flag that should be set to a value different from 0 if the histogram equalization algorithm is to be applied on obtained frames, and frameColorMode defines whether frames should be in colour or black and white (most algorithms here described are only applicable on monochrome images). These are followed by one of the most important points of the system: the three pointers indicating the memory buffers where the frame contents will be written - one for Y values, one for Cb and one for Cr - the last two may contain no data, depending on whether colour datais bein saved or not. The last fields indicate the path to the video feed - one in case the source is a camera (device descriptor) and one in case it is a file (file path).

5.4.2 Initial Negotiation

The definition of the execution parameters is a negotiation. Since the Capture Module acts as an abstraction layer to the capture device, the processing module is most usually not aware of its limitations and its request may not be fulfillable. Hence, the capture module works as a ”master” of sorts in this procedure, where the processing module merely requests a set of parameters and can not ever assume any will be effectively fulfilled and applied. In the first step, the Processing Module - who takes the initiative of starting communication - is responsible for allocating memory for the execution parameters structure and all its members, ex- cept the frame memory buffers - the Capture Module is responsible for allocating this memory, as

57 5. Implementation its size depends on the size of the frames and that is the module that decides what the resolution will be. Then, it writes its own identification and version strings in the processingModuleID and pro- cessingModuleVersion fields and the device and filename it intends the frames to be read from - at least one of those must be present, which one depends on the Capture Module present at the time. Lastly, four parameters define the intended frame specifications: nativeWidth sets the intended native frame width, nativeHeight sets the intended native frame height, frameColorMode sets whether the frames should be in color or grayscale and frameDownsampleFactor indicates the frame downsampling factor. The Processing Module has now completed its request; the next step is for it to pass the pointer to this structure to the init() function, which is implemented by all capture modules. In this routine, the capture module reads the received request and evaluates the capture source’s capabilities, trying to fulfill the request as closely as possible. The capture module starts by writing its own identification (captureModuleID and captureMod- uleVersion) and then reads the values the processing module requested for the native width and height and re-writes them if necessary, as well as the native size. The last step in the defini- tion of the frame parameters is deciding what behavior to adopt regarding the downscaled width, height and size. If it was not deemed necessary to change the native parameters, calculating the downscaled counterparts should take only a few simple mathematical operations, but if changes were made, it is up to the capture module’s implementation to decide what to do. Either way, the processing module is responsible for accepting the capture module’s determination. The last step in the initiation procedure is allocating the memory buffers that will hold the frame data. These must be big enough to hold a frame in native resolution, otherwise the getFullReso- lutionFrame() will not have enough space to save its result. The processing module now has a complete picture of the parameters to adopt.

5.4.3 Main Cycle

Now that an agreement has been reached regarding the configuration paramters, the system is ready to start fetching and analysing frames. In this stage, known as the Main Cycle, the Processing Module is responsible for setting the rhythm at which to obtain frames. To get a frame, the Processing Module calls the Capture Module’s getFrame() or getFullReso- lution() functions; upon their return, the frame’s data itself will be at the memory buffer pointed by the yPointer pointer. If it is a colour frame, the colour data will be at cbValues and crValues. The fact that the Processing Module controls the frame retrieval rate even allows it to imple- ment dynamic frame retrieval rate methods. The frame retrieval functions return when the frame is ready for reading and if the timer sets off before it does, a frame drop has just happened.

58 5.4 Communication Protocol

5.4.4 Finishing Execution

It is important for the Processing Module to be able to tell the Capture Module that execution has ended - this allows it to close connections to a device or files. In order to perform this, the processing module calls finish(), the last function offered by the capture module.

59 5. Implementation

60 6 Experimental Results

Contents 6.1 Performance ...... 62 6.2 Real-World Behaviour ...... 65

61 6. Experimental Results

In order to assess the quality of the system’s results, each of the three main processing mod- ules - Movement Detection, Activity Mapping and Activity Mapping with Dynamic Reference Updating - was tested to evaluate both its raw speed and real world performance, or the result of its processing of a video sequence. All the tests were performed by running the program on the Raspberry Pi.

6.1 Performance

To test speed performance, a specific metric was used: dropped frame percentage, or the proportion of frames the module was not quick enough to process in time. By forcing the module to run at an increasing frames-per-second rate and varying other module-specific parameters, the drop rate was logged. All tests were performed with a video feed obtained in real time from the USB camera with no movement - frame drops are not logged during event execution or heat map exporting, so this was the best solution to ensure all module were run under the same circumstances.

6.1.1 Movement Detection

The Movement Detection module’s speed performance was tested by varying three of its pa- rameters: the Downscale Factor was set to 1, 2 and 4 (making the program analyse frames at, respectively, 640*480, 320*240 and 160*120), the Histogram Equalization algorithm was tested on both on and off settings and the frame analysis rate was set at every value in the [2;20] range. The complete set of tests for the Movement Detection’s frame rate numbers can be found at Appendix C. In this section, the impact of both the downscale factor and histogram equalization parameters will be assessed.

62 6.1 Performance

The first conclusion to take from the graph is that regardless of the applied downscale factor, at about 8 frames per second, at least 10% of the acquired frames will not be processed, so under these conditions, 7 frames per second is the fastest the system shows acceptable performance - and even then, in the case where no downscaling is performed, about 15% of frames are lost. As for the effect of the downscale factor, until about 14 frames per second, the downscaling al- gorithm appears to have a positive effect on system performance, with lower drop rates for higher downscale factors - despite the frame downscaling algorithm being somewhat computationally ex- pensive. This seems to indicate that the bottleneck in frame acquiring speed is not the processor, but the memory access times. Then, the effect of the Histogram Equalization algorithm was assessed.

This time, the results are much more uniform, with histogram equalization having a negative effect on performance - with the algorithm turned being applied, at 5 frames per second, drop rates have reached nearly 50%, with similar rates with the algorithm turned off not reaching such levels until frames are being acquired twice as fast. This conclusion makes sense - the algorithm provides no benefit performance-wise, merely adding more work to each frame. This leads to the conclusion that it should be turned on only when a serious damage to speed performance will not negatively affect detection.

6.1.2 Activity Mapping

The Activity Mapping module was tested under similar circumstances as the Movement De- tection module, with one exception - given that the Histogram Equalization algorithm will not be usually useful to this module (since it causes frequent changes to tone throughout the whole im- age when object appears in the frame), this algorithm was turned off for the whole duration of the tests to this module.

63 6. Experimental Results

The complete performance test results to the Activity Mapping can be found in Appendix D.

Results in this test are slightly harder to interpret, with the maximum acceptable frame acquiral rate being 6, which makes sense as this algorithm is more complex than the one used by the Movement Detection module. The change in frame downscale factor does not seem to produce constant results across the frame acquiral rates.

6.1.3 Activity Mapping with Dynamic Reference Updating

The parameters used to test the Activity Mapping with Dynamic Reference Updating were the same as the previous module. The complete tests to the Activity Mapping can be found in Appendix E.

Despite this module being slightly more complex than the standard Activity Mapping module, with a lot more of memory copying operations - something that seems to be a bottleneck for the Raspberry Pi - shown performance is similar, with maximum acceptable rate at 5 frames per second.

64 6.2 Real-World Behaviour

6.2 Real-World Behaviour

The Real-World Behavior tests are meant to analyse not the speed at which the module can run, but the result of its processing of a video sequence. The Movement Detection module was tested by pointing the camera at a hallway, waiting a random number of seconds and then having a person cross the scene and checking if screenshots had been correctly taken. The Activity Mapping modules were, on the other hand, tested with two distinct sequences. One shows a parking lot with vehicles arriving, leaving and performing maneuvres and an oc- casional person passing by. The second shows a roundabout with heavy traffic and a lot of pedestrian movement. Both sequences are about 10 minutes long and were captured using the described system. Both were captured in monochrome with no histogram equalization or downscaling performed and at a rate of 3 frames per second.

Figure 6.1: Screenshots of the first video sequence used to test the Activity Mapping modules, with the highlight of a car.

Figure 6.2: Screenshots of the second video sequence used to test the Activity Mapping modules.

6.2.1 Movement Detection

The Movement Detection module that was implemented is supposed to take a screenshot every second while an intrusion is detected. The camera was pointed at a hallway and a minute

65 6. Experimental Results passed with no movement. Then, a person crossed the scene, execution was stopped and the taken screenshots were analysed and are presented below.

Figure 6.3: Images taken by the module when an intrusion was detected.

Each screenshot’s file name is the date and time it was taken, so that the time frame at which the intrusion took place can be easily pinpointed. Of importance is the fact that the execution parameters had to be calibrated for detection to work as expected. For the above sequence, the consecutive frames with detection parameter was set to zero, pixel detection threshold to 10 and minimum pixel percentage to 10%.

6.2.2 Activity Mapping 6.2.2.A Sequence 1

The Activity Mapping module was run over the whole parking lot sequence (with no reference frame renewal) to calculate the thermal map of the whole video. The result is seen below.

Figure 6.4: Heat map of execution over the street sequence.

It is clear that most pixels show a low level of movement (indicated by blue tones). This is due to the presence of a car on the top left corner that left the scene during capture. Its absence was continuously idenfitied as a difference and ended up marked with hot tones.

66 6.2 Real-World Behaviour

6.2.2.B Sequence 2

The same module was executed on the second sequence, and set to calculate a thermal map every 10 seconds. An excerpt of the results is visible below.

The results show a big variance both in quantity as well as in the zones showin the most movement. This leads to the conclusion that, for intervals of 10 seconds, the traffic - both of vehicles and people - is not constant. In this test, the module again shows some sensitivity to subjects that were present at the start of the execution but later left the scene. Such events make the whole thermal map take a bluer tone than anticipated.

6.2.3 Activity Mapping with Dynamic Reference Updating

The Activity Mapping with Dynamic Reference Updating module was also tested with both sequences.

6.2.3.A Sequence 1

First, this module was run over the same sequence and set to show the heat map of the latest 5 seconds of execution. The below screenshot shows the result of execution while the car that left

67 6. Experimental Results the heat mark in the previous test made a maneuvre and left the scene.

The route of a person that crossed the scene on the top right corner is also visible. The results show that the module is working very well and are arguably more interesting than those of the standard Activity Mapping module.

6.2.3.B Sequence 2

Lastly, this module was run over the roundabout sequence and set to represent the last 5 seconds of execution. An excerpt of the results is shown below. The represented frames are not sequential - some have been skipped so the rate of movement is more obvious to the eye.

Several lines of movement are clearly identifiable. First, a series of silhouettes representing a motorcycle that crossed the frame are visible in the lower sections of the screenshots. Since this particular vehicle crossed the scene at a relatively high speed, its trail is not a continuous line.

A series of trails left by cars is also visible, with one entering the roundabout from the right, starting on the sixth frame. The last few frames show that, during the represented seconds, the roundabout’s inner lane is much more used that the outer one, some parts of it even showing warm tones. The routes of people going up and down the stairs are visible, as well as an elevator in each frame’s top left corner and, to a smaller degree, movement of plants due to wind.

68 6.2 Real-World Behaviour

69 6. Experimental Results

70 7 Future Work

71 7. Future Work

Since the system, in its current state, deals exclusively with uncompressed video data, the usage of a video compression algorithm to reduce the necessary space for video storage is a very clear area for future work. One possibility is the h264[26] standard, for being both highly optimized and widely used. This option does not change the amount of data the system has to work with in real time, as by-pixel comparisons must still be performed. The most explorable aspect of future work for the system is the development of the feasibility of more complex algorithms and modules on the Raspberry Pi. One example is the performing of recognition of human shapes in the video, which would improve the usefulness of the system on scenarios like categorization of usage of public spaces. By recognizing human shapes, more intelligent heat maps can be built in these situations through the use of crowd tracking methods.

Figure 7.1: A picture of a crowd and its respective density thermal map (top right) and head detections without crowd density weighing (bottom left) and with (bottom right). [1]

One route to possibly explore is the Histogram of Oriented Gradients algorithm, the de facto method for human shape recognition in images. This can be, however, a very difficult route to take - this is a very complex, computationally expensive procedure and with the Raspberry Pi showing its limitations with much lighter algorithms, it is unlikely that an accptable performance level would be achieved.

72 8 Conclusions

73 8. Conclusions

There are a series of perspectives on which to analyse the implementation of such a system. The first, and possibly the most important, is the reflexion on whether the developed system can be deployed, used and useful in the real world. Given the shown results, the conclusion is yes. One of the goals of the project was to keep a low cost, and that has been clearly achieved. The total cost of the used equipment did not surpass e100, a mere fraction of commercially available systems. Another goal was also achieved - the system is very small and easily installable in tight spaces - by the end of development, with the usage of a USB Wi-Fi adapter, the system needed no intervention to connect to the local network and start listening to connections through SSH, FTP and other similar protocols - with no connected peripherals other than the USB hub, wireless adapter and camera. The second aspect that is important to analyse is whether the shown performance is ac- ceptable for real-world use. In this aspect, the Raspberry Pi clearly shows its limitations, rarely surpassing a stable 6 or 7 frame acquirals per second without serious frame rate drops happen- ing. This by no means a bad rate, however, being perfectly acceptable for a very wide range of applications - and software optimizations might even improve this aspect further. The last aspect and possibly one of the biggest successes is the interchangeable module architecture. By the end of development, with the stable communication protocol in place and several implementations of each module, plug-ins were able to be changed effortlessly with results suffering near to no negative effects from these operations - the most noticeable aspects were the need to specify two paths (one for the desired camera, and one for the sequnce file), so there was some awareness by the Processing Module of the source of data. The flexibility shown, however, was extremely positive.

74 Bibliography

[1] M. Rodriguez, I. Laptev, J. Sivic, and J.-Y. Audibert, “Density-aware person detection and tracking in crowds,” IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2011.

[2] Intel Corporation, “Intel Pentium 4 Processor 571 Datasheet,” URL: http://goo.gl/4AUf9X, accessed: 2014-11-25.

[3] ——, “Intel Core i7-4770R Processor Datasheet,” URL: http://ark.intel.com/products/76642, accessed: 2014-11-25.

[4] Apple Inc., “iPhone Delivers Up to Eight Hours of Talk Time,” URL: http://www.apple.com/pr/ library/2007/06/18iPhone-Delivers-Up-to-Eight-Hours-of-Talk-Time.html, accessed: 2014- 11-25.

[5] Qualcomm Incorporated, “Qualcomm Snapdragon 801 Product Brief,” 2014.

[6] E. Upton and G. Halfacree, Raspberry Pi User Guide.

[7] “Raspbian Linux Distribution Webpage,” URL: http://www.raspbian.org/, accessed: 2014-11- 25.

[8] “Pidora Linux Distribution Webpage,” URL: http://pidora.ca/, accessed: 2014-11-25.

[9] “OpenELEC Linux Distribution Webpage,” URL: http://openelec.tv/, accessed: 2014-11-25.

[10] “Raspbmc Linux Distribution Webpage,” URL: http://www.raspbmc.com/, accessed: 2014- 11-25.

[11] Logitech International, Logitech HD Pro Webcam C920 Setup Guide, 2014.

[12] Swann Communications, Swann 3425 Series DVR Manual, 2014.

[13] Logitech International, Logitech Alert Video Security System - Getting to Know Manual, 2014.

[14] Opta Sports, “Opta Overview,” URL: http://www.optasports.com/media/94007/ final-opta-overview-en.pdf, accessed: 2014-11-25.

75 Bibliography

[15] Bushnell Corporation, Bushnell TrophyCam HD Instruction Manual.

[16] ——, “Bushnell Trophy Cam HD Max Product Overview - Technical Specs,” URL: http://www. bushnell.com/all-products/trail-cameras/trophy-cam/trophy-cam-hd-max, accessed: 2014- 11-25.

[17] BVI Networks, ShopperGauge in-store behavior monitoring system - Case study: Gauging the Impact of Display and Brand Messaging on the Cereal Category, 2010.

[18] Prism SkyLabs, User Manual, 2013.

[19] Andrew Back, “Time-lapse Photography with the Raspberry Pi Camera,” URL: http://designspark.com/blog/time-lapse-photography-with-the-raspberry-pi-camera, ac- cessed: 2014-11-25.

[20] Raspberry Pi Official Blog and Christoph Buenger, “Turn your Pi into a low-cost HD surveil- lance cam,” URL: http://www.raspberrypi.org/archives/5071, accessed: 2014-11-25.

[21] SimpleDev, “Tutorial For Motion Detecting Raspberry Pi Security Camera That Notifies You Through An Email Alert With A Snapshot Attached,” URL: http://simpledev.webs.com/apps/ blog/, accessed: 2014-11-25.

[22] Intel Corporation, Intel Galileo Datasheet, URL: http://www.intel.com/newsroom/kits/quark/ galileo/pdfs/Intel Galileo Datasheet.pdf, accessed: 2014-11-25.

[23] G. Coley and R. P. J. Day, BeagleBone Black System Reference Manual.

[24] Raspberry Pi Foundation, “Raspberry Pi Camera Module Product Page,” URL: http://www. raspberrypi.org/products/camera-module/, accessed: 2014-11-25.

[25] Logitech International, Getting Started with Logitech HD Webcam C270.

[26] International Telecommunication Union, “ for generic audiovisual ser- vices,” 2004.

[27] D. A. Migliore, M. Matteuci, and M. Naccari, “A revaluation of frame difference in fast and robust motion detection,” 2006.

[28] E. Mart´ınez-Mart´ın and Angel´ P. del Pobil, “Robust motion detection in real-life scenarios,” 2012.

[29] The MathWorks Inc., Getting Started with MATLAB, R2013b ed., 2013.

[30] ——, MATLAB Computer Vision System Toolbox - User’s Guide, R2013b ed., 2013.

[31] OpenCV Dev Team, The OpenCV Reference Manual, Release 2.4.8.0 ed., 2013.

76 Bibliography

[32] Kenneth Lavresen, Motion Guide, 3.2.12 ed., 2012.

[33] Broadcom Corporation, Broadcom BCM2835 Datasheet.

[34] M.Piccardi, “Background subtraction techniques: a review,” IEEE International Conference on Systems, Man and Cybernetics, 2004.

77 Bibliography

78 A Appendix A

79 A. Appendix A

80 B Appendix B

81 B. Appendix B

82 C Appendix C

83 C. Appendix C

84 D Appendix D

85 D. Appendix D

86 E Appendix E

87 E. Appendix E

88 89