Towards Hardware Accelerated Rectification of High Speed Stereo Image Streams
Total Page:16
File Type:pdf, Size:1020Kb
M¨alardalenUniversity School of Innovation Design and Engineering V¨aster˚as,Sweden Thesis for the Degree of Master of Science with Specialization in Embedded Systems - 30.0 credits TOWARDS HARDWARE ACCELERATED RECTIFICATION OF HIGH SPEED STEREO IMAGE STREAMS Sudhangathan Bankarusamy [email protected] Examiner: Mikael Ekstr¨om M¨alardalenUniversity, V¨aster˚as,Sweden Supervisor: Carl Ahlberg M¨alardalenUniversity, V¨aster˚as,Sweden November 19, 2017 M¨alardalenUniversity Master Thesis Abstract The process of combining two views of a scene in order to obtain depth information is called stereo vision. When the same is done using a computer it is then called computer stereo vision. Stereo vision is used in robotic application where depth of an object plays a role. Two cameras mounted on a rig is called a stereo camera system. Such a system is able to capture two views and enable robotic application to use the depth information to complete tasks. Anomalies are bound to occur in such a stereo rig, when both the cameras are not parallel to each other. Mounting of the cameras on a rig accurately has physical alignment limitations. Images taken from such a rig has inaccurate depth information and has to be rectified. Therefore rectification is a pre-requisite to computer stereo vision. One such a stereo rig used in this thesis is the GIMME2 stereo camera system. The system has two 10 mega-pixel cameras with on-board FPGA, RAM, processor running Linux operating system, multiple Ethernet ports and an SD card feature amongst others. Stereo rectification on memory constrained hardware is a challenging task as the process itself requires both the images to be stored in the memory. The FPGA on the GIMME2 systems must be used in order to achieve the best possible speed. Programming a system that does not have a display and for used for a specific purpose is called embedded programming. The purpose of this system is distance estimation and working with such a system falls in the Embedded Systems program. This thesis presents a method that makes rectification a step ahead for this particular system. The functionality of the algorithm is shown in MATLAB and using VHDL and is compared to available tools and systems. 1 M¨alardalenUniversity Master Thesis Table of Contents 1 Introduction 3 1.1 Camera parameters....................................4 2 Problem Formulation7 3 Related Work 7 3.1 Advantages and Disadvantages.............................8 4 Proposed Method9 4.1 Project Work-flow.................................... 10 5 Stereo Camera Calibration 11 6 MATLAB Implementation 12 6.1 Rectification........................................ 12 6.2 Analytics......................................... 14 6.3 Conclusion........................................ 15 7 VHDL Implementation 16 7.1 Conclusion........................................ 17 8 Results 18 9 Conclusion 24 10 Future Work 25 References 26 Appendices 34 2 M¨alardalenUniversity Master Thesis 1 Introduction In stereo vision systems, extracting depth information accurately is a common and challenging task. Stereo vision has many applications in the autonomous domain. This includes navigation, object detection, surveillance, medical applications, virtual reality and so on. An important preprocessing step to stereo matching is rectification. Rectification consists of aligning the image points in both the left and right images to a common global plane. The geometry of stereo vision is called epipolar geometry. Figure1 represents the the most important terminologies of epipolar geometry. The lines that are formed by the intersection of the two camera image projection planes with the epipolar planes are called the epipolar lines. Figure2 shows image rectification terminologies. In stereo rectification the images are transformed so that epipolar lines are merged along with horizontal scan lines of the image. Rectified images are easier to be processed by the stereo vision applications [1,2], than the unrectified images, since they will have to only search laterally to find the matching points to estimate the disparity, based on which distance to the object can be calculated. The distance formula based on the disparity is: f:B Z = (1) d where Z is the distance from base line along the camera axis in meters, f is the focal length in meters, B distance between the two cameras in meters and d disparity in meters. In stereo camera systems, sources of errors arise due to the fixed parameter design of the geometry of the cameras. Parameters such as rotation of the cameras, centers of the lenses, focal lengths give rise to errors. Good stereo vision becomes increasingly difficult if multiple sources of errors are to be handled by software applications. In modern systems where FPGAs (Field Programmable Gate Array) are included within the system, it is a natural solution to implement the most processing intensive and crucial tasks in FPGA itself, if possible, before passing the rectified images to the applications. FPGAs are re-programmable silicon chips. Programmable routing resources and pre-built logic blocks are used with such FPGAs. Implementation of custom hardware functionality can be done without having to code, write instructions or without rigging up hard wired circuits. An FPGA can be configured by the means of writing software and compiling them to bitstream which has information on how the wiring of the components in the FPGA should be done. FPGAs are fully re-configurable and take up a new form of circuitry based on the purpose. FPGAs are parallel in nature, unlike processors, meaning that tasks do not have to compete for resources. Each task is assigned a dedicated part of the chip and therefore can work without any interference from other blocks. In view of these features, FPGA adoption is on the rise across all industries. In this thesis FPGA becomes important so as to maintain very low latency between the flow of input and output pixels. The undistortion process is a crucial step in stereo image rectification, especially in case where the fish eye lenses are used or the distortion is high. Figure4 show an idea of distorted and undistorted images. Distortion parameters like skew, rotation, radial and tangential effects are considered in the undistortion process. The GIMME2 [3] board, which will be used in this thesis, uses low distortion lenses, which means there is less movement of pixel between the input and output images and therefore reducing complexity in handling large data. Working on FPGA based systems and programming a system, made for specific purpose, with- out a display, fits to be termed as an embedded system. The GIMME2 board has no on-board display and the system purpose is stereo vision and distance estimation. Therefore the GIMME2 board is an embedded system. The programming of such a system requires that memory, speed, timing issues and utilization of logical resources are part of the design, but speed and timing are not considered here as it is beyond the scope of this thesis. A system design with these characteristics fits into the Embedded Systems program. 3 M¨alardalenUniversity Master Thesis Figure 1: Epipolar Geometry [4] 1.1 Camera parameters When dealing with epipolar geometry, or the geometry of stereo vision, camera parameters which mathematically define the camera behaviour and the position of camera with respect to the outside world, becomes very important. The variables which define the camera behaviour is called the intrinsic parameters. The variables that define the position of the camera and the direction of the view is called the extrinsic parameters. These parameters are explained below. Intrinsic Parameters: 2 3 ax s u0 K = 4 0 ay v0 5 (2) 0 0 1 The intrinsic matrix, K (2), has five parameters which describes how the camera captures the images. The ax and ay variables represent the focal point in terms of pixels. Variable s, is the skew factor that relates the length and breadth of the pixels on the camera's capture plane. Variables u0 and v0 indicates the position, in terms of pixels, of the principle focal point on the captured image, which is ideally the center of the image. Extrinsic Parameters: The position of the camera with respect to the world is described by R, the rotation matrix, and T, the translation matrix. The two matrices together indicates the direction and location of one camera. Matrix (3) represents the transition matrix which can be used to obtain camera co-ordinates from world co-ordinates. Equation4 shows the mathematical expression for doing the same. 2 3 r1;1 r1;2 r1;3 t1 RT = 4r2;1 r2;2 r2;3 t25 (3) r3;1 r3;2 r3;3 t3 2 U 3 2u3 V v = K RT 6 7 (4) 4 5 6W 7 1 4 5 1 4 M¨alardalenUniversity Master Thesis Figure 2: Projections of rectified images [1] Where, U, V, W represents the 3D world co-ordinates and u, v represents the 2D camera co- ordinates. For image rectification and depth estimation purposes the world co-ordinates serves no purpose. Unit projection can be considered and that the position of the camera is also assumed to be at the origin. Therefore with T = [ 0 ], and W = 1, equation4 becomes as shown in equation5. 2 3 2 3 uu Uc 4vu 5 = K R 4Vc 5 (5) 1 1 The variables Uc and Vc, are now the 2D co-ordinates of the picture taken by the camera. Varibales uu and vu are the undistorted image co-ordinates. The intrinsic and extrinsic parame- ters are called the calibration parameters. Obtaining the calibration parameters using a checker board and rectification is explained in section4. A matrix that transforms an image is called a homography matrix, H. Matrix H is also known as the calibration matrix. In the above case, H = K [R]. Therefore equation5 becomes as shown in 2 3 2 3 uu Uc 4vu 5 = H 4Vc 5 (6) 1 1 From epipolar geometry, matrix F is called the fundamental matrix.