DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

A

PROJECT REPORT

ON

“BLIND LEAP - REALTIME OBJECT RECOGNITION WITH RESULTS CONVERTED TO AUDIO FOR BLIND PEOPLE”

Submitted in partial fulfillment for the award of the degree of

BACHELOR OF ENGINEERING

IN

COMPUTER SCIENCE AND ENGINEERING

BY

M K SUBRAMANI 1NH15CS066

Under the guidance of

Ms. JAYA R Sr. Assistant Professor, Dept. of CSE, NHCE

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

CERTIFICATE

It is hereby certified that the project work entitled “BLIND LEAP – REALTIME OBJECT RECOGNITION WITH RESULTS CONVERTED TO AUDIO FOR BLIND PEOPLE” is a bonafide work carried out by M K SUBRAMANI (1NH15CS066) in partial fulfilment for the award of Bachelor of Engineering in COMPUTER SCIENCE AND ENGINEERING of the New Horizon College of Engineering during the year 2018-2019. It is certified that all corrections/suggestions indicated for Internal Assessment have been incorporated in the Report deposited in the departmental library. The project report has been approved as it satisfies the academic requirements in respect of project work prescribed for the said Degree.

………………………….. ………………………… ………………………………. Signature of Guide Signature of HOD Signature of Principal (Ms. Jaya R) (Dr. B. Rajalakshmi) (Dr. Manjunatha)

External Viva

Name of Examiner Signature with date

1. ………………………………………….. ………………………………….

2. …………………………………………… ………………………………….. ple

ABSTRACT

This project tries to transform the visual world into the audio world with the potential to inform blind people objects as well as their spatial locations. Objects detected from the scene are represented by their names and converted to speech. Their spatial locations are encoded into the 2-channel audio with the help of 3D binaural sound simulation.

The system will compose several modules. Video is captured with a portable camera device on the client side, and is streamed to the server for real-time image recognition with existing object detection models (YOLO). The 3D location of the objects is estimated from the location and the size of the bounding boxes from the detection algorithm. Then, a 3D sound generation application based on game engine renders the binaural sound with locations encoded. The sound is transmitted to the user with wireless earphones. Sound is play at an interval of few seconds, or when the recognized object differs from previous one, whichever earliest.

With the help of the device, the user will be able to successfully identify objects that is 3- 5 meters away. Possible Issues that can occur for the current prototype can be: detection failure when objects are too close or too far, and overload of information when the system tries to notify users too many objects.

I

ACKNOWLEDGEMENT

The satisfaction and euphoria that accompany the successful completion of any task would be impossible without the mention of the people who made it possible, whose constant guidance and encouragement crowned our efforts with success.

I have great pleasure in expressing my deep sense of gratitude to Dr. Mohan Manghnani, Chairman of New Horizon Educational Institutions for providing necessary infrastructure and creating good environment.

I take this opportunity to express my profound gratitude to Dr. Manjunatha, Principal NHCE, for his constant support and encouragement.

I am grateful to Dr. Prashanth C.S.R, Dean Academics, for his unfailing encouragement and suggestions, given to me in the course of my project work.

I would also like to thank Dr. B. Rajalakshmi, Professor and Head, Department of Computer Science and Engineering, for her constant support.

I express my gratitude to Ms. Jaya R, Senior Assistant Professor, my project guide, for constantly monitoring the development of the project and setting up precise deadlines. Her valuable suggestions were the motivating factors in completing the work.

Finally, a note of thanks to the teaching and non-teaching staff of Dept of Computer Science and Engineering, for their cooperation extended to me, and my friends, who helped me directly or indirectly in the course of the project work.

M K SUBRAMANI (1NH15CS066)

II

CONTENTS

ABSTRACT I ACKNOWLEDGEMENT II LIST OF FIGURES V

1. INTRODUCTION 1 1.1. 1 1.2. OBJECTIVES OF THE PROPOSED PROJECT WORK 2 1.3. PROJECT DEFINITION 3 1.4. PROJECT FEATURES 4

2. LITERATURE SURVEY 5 2.1. OBJECT RECOGNITION 7 2.1.1 WHAT IS OBJECT RECOGNITION 8 2.1.2 HOW OBJECT RECOGNITION WORKS 8 2.2. EXISTING SYSTEM 11 2.2.1 COMPUTER VISION TECHNOLOGIES 11 2.2.2 SENSORY SUBSTITUTION TECHNOLOGIES 12 2.2.3 ELECTRONIC TRAVEL AIDS 12 2.3. OBJECT DETECTION ALGORITHMS 13 2.3.1 R-CNN 13 2.3.2 FAST R-CNN 14 2.3.3 FASTER R-CNN 15 2.3.4 YOLO (YOU LOOK ONLY ONCE) 16 2.4. PROPOSED SYSTEM 17 2.4.1 OBJECT DETECTION ALGORITHM 17 2.4.2 DIRECTION ESTIMATION 18 2.4.3 DATA STREAMING 18 2.3.4 RESULT FILTERING 19 2.3.5 3D SOUND GENERATION 20

III

3. REQUIREMENT ANALYSIS 21 3.1. METHODOLOGY FOLLOWED 21 3.1.1 CONVOLUTIONAL NEURAL NETWORKS 21 3.1.2 YOLO (YOU ONLY LOOK ONCE) 22 3.2. FUNCTIONAL REQUIREMENTS 24 3.2.1 RASPBERRY PI 25 3.2.2 UNITY 3D 26 3.3. NON-FUNCTIONAL REQUIREMENTS 27 3.3.1 ACCESSIBILITY 27 3.3.2 MAINTAINABILITY 27 3.3.3 PORTABILITY 28 3.4. HARDWARE REQUIREMENTS 28 3.5. SOFTWARE REQUIREMENTS 29

4. DESIGN 30 4.1. DESIGN GOALS 30 4.2. SYSTEM ARCHITECTURE 30 4.3. DATA FLOW DIAGRAM 31

5. IMPLEMENTATION 32 5.1. CONNECTIVITY 32 5.2. OBJECT DETECTION PSEUDO CODE 33

6. RESULT 35 6.1. DETECTION DATA SET CLASSES 35 6.2. OUTPUT 35

7. CONCLUSION 42

REFERENCES 43

ANNEXURE 46 IV

LIST OF FIGURES

Fig no. Fig Name Page no.

2.1 Object recognition used to identify objects 8

2.2 Deep learning technique for object recognition 9

2.3 Machine learning technique for object recognition 10

2.4 Illustration of R-CNN 13

2.5 Illustration of Fast R-CNN 14

2.6 Illustration of Faster R-CNN 15

2.7 Illustration of YOLO 16

2.8 Data flow pipeline of the system 19

3.1 layer of a convolutional neural network 22

3.2 Bounding boxes, input and output for YOLO 23

4.1 System Architecture of the proposed System 30

4.2 Data flow of the proposed System 31

6.1 Raspberry pi top view 36

6.2 Raspberry pi side view 36

6.3 Raspberry pi camera 37

6.4 Remote desktop connection login panel 37

6.5 Raspberry pi terminal (client) 38

6.6 Windows Terminal (Server) 38

6.7 Object detected from a live stream video (Bottle) 39

6.8 Object detected from a live stream video (Phone) 39

6.9 Object detected from a live stream video (Banana) 40

6.10 Object detected from a live stream video (Scissor) 40

6.11 Depiction of the whole system 41

V

Blind Leap

CHAPTER 1

INTRODUCTION

1.1 VISUAL IMPAIRMENT

Visual impairment also called vision loss or visual impairment. It’s a problem related to the vision, which cannot be fixed with the normal glasses. Blindness is a state where person lose his/her complete vision. Both type of people whether it’s with the person with mild to severe visual impairment or the blind find it tough to carry out their day to day activities. They somehow learn to live with this but that only in the familiar environment. But when it’s a completely unfamiliar environment, thing get tougher for them.

Millions of people live in this world with incapacities of understanding the environment due to visual impairment. Although they can develop alternative approaches to deal with daily routines, they also suffer from certain navigation difficulties as well as social awkwardness. For example, it is very difficult for them to find a particular room in an unfamiliar environment. And blind and visually impaired people find it difficult to know whether a person is talking to them or someone else during a conversation.

According to WHO report there are around 36 million blind people and 217 million people with mild to severe visual impairment. The visual impairment reported in this WHO report include cataracts, river blindness and trachoma infections, glaucoma, diabetic retinopathy, uncorrected refractive errors and some cases of childhood blindness. Many people with significant visual impairment benefit from vision rehabilitation, changes in their environment, and tools.

With the advancements in the field of technology we are able to find the solution of nearly everything. In last few years computer vision technology especially “deep neural network” has developed swiftly. Access technologies such as screen readers, screen amplifiers and refreshing displays allow the blind to use mainstream computer

Dept of CSE, NHCE 1

Blind Leap applications and cell phones. The availability of tools is increasing, accompanied by joint efforts to ensure the accessibility of information technology to all potential users, including the blind. Later versions of Microsoft Windows include an Accessibility Wizard and magnifier for those with partial vision, and Microsoft Narrator, a simple . Linux distributions (as live CDs) for the blind include Vinux and Adriane , the latter developed in part by Adriane Knopper who has a visual impairment. macOS and iOS also come with a built-in screen reader called VoiceOver, while Google TalkBack is built in to most Android devices.

Few of the other existing tools which are present to help these special people are “Blindsight”, “TapTapSee” etc. The “Blindsight” offers a mobile app Text Detective featuring OCR or optical character recognition technology to detect and read text from pictures captured by using the camera [2]. “TapTapSee” is mobile app that uses computer vision and crowd sourcing in order to define a picture captured by the blind users in about 10 seconds. Facebook is also developing image captioning technology to help blind users engage in conversations with other users through pictures. Here the picture is transferred into the spoken words.

However, these products were not focusing on magnify general visual sense for the blind people and neither used the spatial sound techniques to further strengthen the user experience. Some work exists in the general scope of sensory substitution include work developed by Neil Harbisson, Daniel Kish and vOICe technology developer. Colorblind artist Neil Harbisson developed a device to transform colour information into sound frequencies. Daniel Kish, who is totally blind, developed accurate echolocation ability using “mouth clicks” for navigation tasks including biking and hiking independently [19]. An extreme attempt of converting visual sense to sound is introduced by the vOICe technology. The vOICe system scans each camera snapshot from left to right, while associating height with pitch and brightness with loudness [13].

Dept of CSE, NHCE 2

Blind Leap

1.2 OBJECTIVES OF THE PROPOSED PROJECT WORK

With advancement in technology we are able to find solution to many problems even we have evolved so much that recently through an excellent algorithmic work a team was able to click the image of black hole. But, till date it is seen that blind people face many problems in doing their activities. They are dependent on others for doing everything or many things. This motivated us to design a prototype. This is a small approach to make the blind people independent. This project is a vision replacement system designed to help blind people for autonomous navigation. It’s an attempt where I won’t be donating the eyes but instead, I will be providing the eyes to these special people around us. The main objective of the project is to let the visually impaired person to know the existing object around him/her. This would make it easier for the person to cope up with his/her work and make them aware of what is on their way. The system would help them detect basic and common things that they would come across on their day to day life. The use of object detection techniques can open up new possibilities in assisting indoor navigation as well as outdoor navigation for blind and visually impaired people.

1.3 PROJECT DEFINITION

In this project, I have built a real-time object detection and position estimation system, with the goal of informing the user about surrounding object and their spatial position using binaural sound. Considering the requirement of real-time objective detection, in this project, I have used the ‘You Only Look Once’ (YOLO) model. YOLO could efficiently provide relatively good objective detection with extremely fast speed. The environment is captured by a portable raspberry pi with raspberry pi Noir camera and transfers through normal video link directly to the YOLO model running on a local server machine with high performance GPU. The server detects objects, sends information directly to the unity sound generator and plays the binaural sound. The system will be using a plug-in for Unity 3D game engine to simulate the 3D sound. The 3D sound is used to give the user the experience of the position of the object.

Dept of CSE, NHCE 3

Blind Leap

1.4 PROJECT FEATURES

The system will compose several modules. Video is captured with a portable raspberry pi along with raspberry pi camera device on the client side, and is streamed to the server. Server contains object detection model for real-time image recognition with existing YOLO models. The 3D location of the objects is estimated from the location and the size of the bounding boxes from the detection algorithm. Then, a 3D sound generation application based on Unity game engine renders the binaural sound with locations encoded. The sound is transmitted to the user with wireless/wired earphones. Sound is play at an interval of few seconds, or when the recognized object differs from previous one, whichever earliest.

With the help of the device, the user will be able to successfully identify objects that is 3-5 meters away. Possible Issues that can occur for the current prototype can be: detection failure when objects are too close or too far, and overload of information when the system tries to notify users too many objects.

Dept of CSE, NHCE 4

Blind Leap

CHAPTER 2

LITERATURE SURVEY

A detailed study and work have been carried out by several researchers on this project. Out of all of them several works have been listed down:

In [1], creators provided a portable and real time solution. They presented a platform that utilizes portable cameras, fast HD video link and powerful server to generate 3D sounds. By using YOLO algorithm and advanced wireless transmitter, the solution could perform accurate real time objective detection with live stream at a speed of 30 frames, 1080P resolution. A prototype for sensory substitution (vision to hearing) is established in the project. Through the project, they demonstrated the possibility of using computer vision techniques as a type of assistive technology.

In [2], they proposed visual substitution system is based on the identification of objects around the blind person. They proposed a system that recognize and locate 2D in the video. This system finds the invariant characteristic of objects to viewpoint changes, provide the recognition and reduce the complexity of detection. The method is based on key points extraction and matching in video. A comparison between query frame and database objects is made to detect object in each frame. For each object detected an audio file containing the information about it is activate. Hence object detection and identification are simultaneously addressed. In this project SIFTS key points extraction and features matching for object identification were used.

In [3] set of researchers built an app with buzzer and vibration mode used for guiding the blind using sensors. The object is detected using key matching algorithm i.e. object is matched to the image in the database.

In [4] an algorithm for detecting objects in a sequence of color images taken from a moving camera is presented. The first step of the algorithm is the estimation of motion in the image plane. Instead of calculating optical flow, tracking single points, edges or

Dept of CSE, NHCE 5

Blind Leap regions over a sequence of images, the motion of clusters is determined, built by grouping of pixels in a color/position feature space. The second step is a motion-based segmentation, where adjacent clusters with similar trajectories are combined to build object hypotheses. Researchers used Kalman filters are used to predict dynamic changes in cluster positions. The main application was vision-based assistance.

In order to solve the problem of real-time factor and accuracy factor in video image tracking, paper [9] presented a real-time tracking algorithm of moving vehicle based on feature points. They provided a simple and practical feature points matching algorithm using Adaptive Kalman Filter (AKF). The results of simulation experiments show that the algorithm performs well in fast tracking maneuvering target, which can not only describe the target accurately, but decrease the time of matching computation.

Another set of researchers in paper [5] focused on A real-time vision system has been developed that analyzes color videos taken from a forward-looking video camera in a car driving on a highway. The system uses a combination of color, edge, and motion information to recognize and track the road boundaries, lane markings and other vehicles on the road. Cars are recognized by matching templates that are cropped from the input data online and by detecting highway scene features and evaluating how they relate to each other. Cars are also detected by temporal differencing and by tracking motion parameters that are typical for cars. The system recognizes and tracks road boundaries and lane markings using a recursive least-squares filter. Experimental results demonstrate robust, real-time car detection and tracking over thousands of image frames. The data includes video taken under difficult visibility conditions.

However, in paper [6] by Ricardo Chincha and YingLi Tian in paper entitled -Finding Objects for Blind People Based on SURF Features has proposed an object recognition method to help blind people find missing items using Speeded-Up Robust Features (SURF). The Proposed recognition process begins by matching individual features of the user queried object to a database of features with different personal items which are saved in advance. From the experiments the total number of objects detected were 84 out of 100,

Dept of CSE, NHCE 6

Blind Leap this shows that their work needs better performance hence to enhance the object recognition SIFT can be used instead of SURF.

In [7] Chucai Yi, Member, IEEE, Yingli Tian, Senior Member, IEEE, and Aries Arditi in their paper ―Portable Camera-Based Assistive Text and Product Label Reading From Hand-Held Objects for Blind Persons‖ has proposed a camera-based assistive text reading framework to help blind persons read text labels and product packaging from hand-held objects in their daily lives. They proposed an efficient and effective motion-based method to define a region of interest in the video. The performance of the proposed text localization algorithm is quantitatively evaluated. Then they employed the Microsoft Speech Software Development Kit display the audio output of text information.

In [8] researchers proposed a system “VOCAL VISION”. Its working concept is based on “image to sound” conversion. The webcam captures the image in front of blind user then object detection algorithm processes this image and enhances the image data that image compare with image in database. If comparison successful, then message through android device deliver to user or blind person with the help of headphone about object name and for scene name of the scene. Different blocks include in this system were blurring, gray scaling, edge detection, thresholding, boundary detection, cropping, RGB to HSV, histogram, normalization. After this the object or scenes are compared with the database images so if comparison is successful then it informs blind person about that object or scenes.

2.1 OBJECT RECOGNITION

Literature review is the most important step in software development. Before the tools are developed, it is necessary to determine the time factor, economy and company strength. Once these things are satisfied, the following steps are needed to determine which operating system and language can be used to develop the tool. As soon as the programmers start to build the tools, the programmers need a lot of external support. This support can be obtained from senior programmers, from book or from websites.

Dept of CSE, NHCE 7

Blind Leap

Before the system is built, the above consideration is taken into account for the development of the proposed system.

2.1.1 WHAT IS OBJECT RECOGNITION?

Object recognition is a method of identification of objects in videos or images. Object recognition is an important output of deep learning and machine learning algorithms. When people look at a photo or watch a video, we can easily see people, objects, scenes and visual details. The goal is to teach a computer to do what comes naturally to man: to get an understanding of what an image contains.

Figure 2.1: Object recognition used to identify objects into different categories

It is also useful in a variety of applications such as the identification of diseases in bioimaging, industrial inspection and robot vision. Object recognition is an important technology behind driverless cars, enabling them to recognize a stop sign or distinguish a pedestrian from a lantern.

2.1.2 HOW OBJECT RECOGNITION WORKS

There are variety of approaches for object recognition. Recently, techniques in machine learning and deep learning have become popular approaches to object

Dept of CSE, NHCE 8

Blind Leap recognition problems. Both techniques learn to identify objects in images, but they differ in their execution [11].

Object Recognition Using Deep Learning

Deep learning techniques have become a popular method for doing object recognition. Deep learning models such as convolutional neural networks, or CNNs, are used to automatically learn an object’s inherent features in order to identify that object. For example, a CNN can learn to identify differences between cats and dogs by analyzing thousands of training images and learning the features that make cats and dogs different.

Figure 2.2: Deep learning technique for object recognition.

There are two approaches to doing object recognition by using deep learning:

Train a model from scratch: To train a deep network from scratch, you collect a very large tagged data set and design a network architecture that will teach and model the features. The results can be impressive, but this approach requires a great deal of training data, and you need to set the layers and weights in the CNN.

Using a pretrained deep learning model: Most deep learning applications use the transfer learning approach, a process that involves fine-tasting of a pretrained model. You start with an existing network, such as AlexNet or GoogleNet, and enter new data containing

Dept of CSE, NHCE 9

Blind Leap previously unknown classes. This method is less time-consuming and can deliver a faster outcome because the model has been trained on thousands or millions of images.

Deep learning provides a high level of accuracy, but requires a large amount of data to make accurate predictions.

Object Recognition Using Machine Learning

Machine learning techniques are also popular for object recognition and offer different approaches than deep learning.

Figure 2.3: Machine learning technique for object recognition.

Common examples of machine learning techniques are:

• HOG feature extraction with an SVM machine learning model

• Features of SURF and MSER in Bag-of-words models.

• The Viola-Jones algorithm, which can be used to recognize a variety of objects, including faces and upper bodies

To perform object recognition with a standard machine learning approach, start with a collection of images (or video) and select the relevant features in each image. For example, a function extraction algorithm can extract or angle properties that can be used to distinguish between classes in your data.

Dept of CSE, NHCE 10

Blind Leap

These features are added to a machine learning model, which will separate these features into their respective categories and then use this information when analyzing and classifying new objects. You can use a variety of machine learning algorithms and function extraction methods, which offer many combinations to create an accurate object recognition model.

The use of machine recognition for object recognition offers the flexibility to choose the best combination of leather features and ratings. It can achieve accurate results with minimal data.

2.2 EXISTING SYSTEM

2.2.1 Computer Vision Technologies

There exist multiple tools to use computer vision technologies to assist blind people. Some of these known technologies are:

• TapTapSee: The mobile app TapTapSee uses computer vision and crowdsourcing to describe a picture captured by blind users in about 10 seconds.

• Text Detective: The Blindsight offers a mobile app Text Detective featuring optical character recognition (OCR) technology to detect and read text from pictures captured from the camera.

• Facebook: Facebook is developing image captioning technology to help blind users engaging in conversations with other users about pictures.

• DuLight project: Baidu recently released a demo video of a DuLight project. No further details of the product are available at the moment. However, the product video suggests concepts of describing scenes and recognizing people, money bills, merchandises, and crosswalk signal.

Dept of CSE, NHCE 11

Blind Leap

However, these products were not focusing on enabling general visual sense for blind people and did not use the spatial sound techniques to further enhance the user experience.

2.2.2 Sensory Substitution Technologies

Some works exist in the general scope of sensory substitution. Some of these known technologies are:

• Mouth Clicks: Daniel Kish, who was totally blind, developed accurate echolocation ability using “mouth clicks” for navigation tasks including biking and hiking independently.

• Colorblind artist Neil Harbisson developed a device to transform color information into sound frequencies. An extreme attempt of converting visual sense to sound is introduced by the vOICe technology. The vOICe system scans each camera snapshot from left to right, while associating height with pitch and brightness with loudness. [16]

However, all these attempts on sensory substitution are reported with very difficult learning process.

2.2.3 Electronic Travel Aids (ETA)

This technology offers a blind navigation system using RFID tags to set up a location alert infrastructure in buildings so that the blind can use an RFID-equipped ETA (such as a cell phone) to determine their location as well as software which can use this localization to generate vocal directions to reach a destination. Electronic travel aids (ETAs) are electronic devices designed to improve autonomous navigation of blind people. The ETA design differs from the sizes, the type of sensor used in the system, the method of transmitting information and also the usage method. Image of the scene before blind user is captured using the video camera and it is converted into sound pattern. The intensity of the pixel of the image is converted into hardness.

Dept of CSE, NHCE 12

Blind Leap

However, the ETA system uses the expensive ultrasonic sensors that contain few electronic components.

2.3 OBJECT DETECTION ALGORITHMS

2.3.1 R-CNN

R-CNN stands for regional based convolutional neural networks. It works on the object proposal algorithm called selective search algorithm. Due to this selective search algorithm it is able to overcome the limitation of CNNs which are computationally expensive and slow because they were containing large number of bounding boxes. But with R-CNN this problem was solved and it used only about 2000 bounding boxes per image. These bounding boxes were chosen based on the texture, intensity and color etc.

Following are the three important steps in R-CNN:

i) Run selective search to generate probable objects. ii) Feed the output of step I to CNN and followed by SVM to predict the classes. iii) Optimize patches by training bounding box regression separately.

Figure 2.4: Illustration of R-CNN

Dept of CSE, NHCE 13

Blind Leap

Limitations of R-CNN:

i) It consumes lot of time as we have to classify 2000 regional proposal per image. ii) The selective search is a fixed algorithm which leads to the problem of no learning happening at this stage. This could give birth to bad candidate regions. iii) Its takes approximately 50 seconds for per test image.

2.3.2 FAST R-CNN

Here we feed image to CNN instead of feeding the regional proposal to CNN. Input image fed to CNN generates convolutional feature maps. Region of proposals are identified from convolutional feature map. ROI feature vector, SoftMax layer is used to predict the class and also the offset values of the bounding boxes.

Its faster than R-CNN because convolutional operation is done only once per image and a feature map is generated from it. Training time for this is approximately 3 seconds per test image [22].

Figure 2.5: Illustration of Fast R-CNN

Dept of CSE, NHCE 14

Blind Leap

2.3.3 FASTER R-CNN

The common thing among the above-mentioned algorithms is that both of them use selection search algorithm for region proposal and the best part of faster R-CCN is that it eliminates the use of selective search algorithm. Similar to fast R-CNN it feeds the whole image in CNN but instead of using selective search algorithm on feature map it uses a separate network to predict the region proposals [23]. The training time per image is 0.2 seconds. Because of the speed it is also used for real time object detection.

Figure. 2.6: Illustration of Faster R-CNN

Dept of CSE, NHCE 15

Blind Leap

2.3.4 YOLO (YOU LOOK ONLY ONCE)

YOLO is algorithm based on regression. Here we predict the classes and bounding boxes for the image in one run of the algorithm. Our main task is to classify the class of the object and the bounding boxes. So, the prediction vector consists of 5 parameters i.e. y (Pc, (bx, by), bh, bw, c). (bx, by) corresponds to the center of the bounding box, bh is the height of the bounding box and bw is the width of the bounding box. whereas Pc is probability that there is an object in the bounding box and c is the class of the object. As said earlier we divide the object into the grids. So, here in this example image is divided into grid of 19x19. After processing we end up with only 5 bounding boxes as maximum grids were empty. This is the reason we predict Pc. In the next step, we’re removing boxes with low object probability and bounding boxes with the highest shared area in the process called non-max suppression.

Figure 2.7: Illustration of YOLO

Dept of CSE, NHCE 16

Blind Leap

I have used this algorithm for my project. Limitation is that it finds difficult to locate smaller objects in the picture.

2.4 PROPOSED SYSTEM

In the proposed system, the disadvantages of the existing system will be taken care. There would be 5 main modules the system will be working around.

2.4.1 OBJECT DETECTION ALGORITHM

To successful detection of surrounding objects, several existing detection systems could classify objects and evaluate it at various locations in an image. Deformable Parts Model (DPM) uses root filters that slides detection windows over the entire image [20].

RCNN uses region proposal methods to generate possible bounding boxes in an image. Then, it applies various ConvNets to classify each box. The results are then postprocessed and output finer boxes. The slow test-time, complex training pipeline and the large storage does not fit into the application. [21]

Fast R-CNN max-pools proposed regions, combines the computation of ConvNet for each proposal of an image, and outputs features of all regions at once. [22]

Based on Fast R-CNN, Faster R-CNN inserts a region proposal network after the last layer of ConvNet. [23]

Both methods speed up the computational time and improve the accuracy. The pipelines of these methods are still relatively complex and hard to optimize.

Considering the requirement of real-time objective detection, in this project, I have used ‘You Only Look Once’ (YOLO) model. YOLO could efficiently provide relatively good objective detection with extremely fast speed.[19]

Dept of CSE, NHCE 17

Blind Leap

2.4.2 DIRECTION ESTIMATION

After detecting the type of objects in a video frame, the next step is to obtain the direction and distance detected object from the user.

The approximated distance is 5-6m for detection. First of all, human is good at inferring direction from binaural sound, and the relative distance, namely object A is closer than object B or object is moving closer and closer between frames. However, absolute distance is difficult to deduce from binaural sound. This means my image processing algorithm needs to provide the accurate directional information and the relative distance, but not the exact depth.

The direction is estimated by dividing the region captured the camera into the coordinate system with (0,0) at the center. The spatial location of the object is conveyed through 3d sound. Ex: If the object is to the left of center then same is conveyed through the 3d sound played through the earphones. The intensity of sound decides how close or how far is an object. The approximated distance is for detection is 5-6m but finding the exact distance is bit problem. My system is able to find the direction correctly and an approximated distance.

Thus, I resort to estimate the direction and relative distance. I have used Raspberry pi Noir camera along with Raspberry pi 3b+. Giving the field of view of the camera, and the bounding box of the object, the direction is estimated from the central pixel location of the bounding box.

2.4.3 DATA STREAMING

The architecture is shown in Figure. In this platform, a Raspberry pi Noir camera and transfers through video link directly to the YOLO model running on a local server machine with high performance GPU capture the environment. The server detects objects, sends information directly to the unity sound generator and plays the binaural sound.

Dept of CSE, NHCE 18

Blind Leap

Figure 2.8: Data flow pipeline of the system.

In particular, a portable camera at 30 frames 1080p-720p resolution captures the environment picture. The video is live streamed through the video link to the computer server. The object detection engine YOLO then predicts objects in the stream. The YOLO algorithm could process a single image frame at a speed of 4-60 frames/second depending on the image size I send to the engine. The outputs are sent to unity sound generator and the generated sounds are played through wireless earbuds. During the implementation, the platform is capable of processing all captured live stream at a minimum speed of 30 frames per second at 1080p-720p resolution.

2.4.4 RESULT FILTERING

YOLO outputs the top classes and their probability for each frame. Probability above 20% will be taken as a confident detection result. To present the results to the user in a reasonable manner, my algorithm also has to decide whether to speak out a detected object and at what time. Obviously, it’s undesirable to keep speaking out the same object to the user even if the detection result is correct. It’s also undesirable if two object names are spoken overlapping or very closely that the user won’t be able to distinguish.

Dept of CSE, NHCE 19

Blind Leap

To solve the first problem, I assume a cool-down-time of five seconds for each class. For example, if a person is detected in the first frame and is spoken out, the program will not speak out “person” again until after five seconds. This is only a sub-optimal solution since it does not deal with multiple objects of the same class. Ideally, if there are two persons in the frame, the user should be informed about the two people, but he does not need to be informed about the same person continuously. One possible improvement, which I have to work upon, is to track the object using overlapping bounding box between frames. To solve the second problem, delay of half a second can be enforced between any spoken classes.

2.4.5 3D SOUND GENERATION

Plug-in for Unity 3D game engine will be used to simulate the 3D sound. Unity- based game program “3D Sound Generator” is developed using either a file watcher or TCP socket to receive the information about the correct sound clips to be played as well as their spatial coordinates.

Dept of CSE, NHCE 20

Blind Leap

CHAPTER 3

REQUIREMENT ANALYSIS

3.1 METHODOLOGY FOLLOWED

3.1.1 CONVOLUTIONAL NEURAL NETWORKS

CNNs have wide applications in image and video recognition, recommender systems and natural language processing. Here the example I will take is related to Computer Vision. However, the basic concept remains the same and can be applied to any other use case!

A convolution neural network consists of one or more convolution layers (often with a subsampling step) and then followed by one or more fully connected layers as in a standard multilayer neural network. The architecture of a CNN is designed to take advantage of the 2D structure of an input image (or other 2D input such as a speech signal). It is achieved with local connections and tied weights, followed by a form of pooling that results in translation-invariant features. Another advantage of CNNs is that they are easier to train and have far fewer parameters than fully-connected networks with the same number of hidden units. Here I have discussed the architecture of a CNN and the reprocessing algorithm to calculate the gradient with respect to the parameters of the model using gradient-based optimization.

A CNN consists of a number of convolution and sub-layer layers, optionally followed by fully-linked layers. The input of a convolution layer is a m x m x r image where m is the height and width of the image and r is the number of channels, e.g. An RGB image has r = 3. The convolution layer will have k filters (or kernels) of size n x n x q, where n is smaller than the size of the image and q can be either the same as the number of channels r or smaller and can for each core varies. The size of the filters gives rise to the locally coupled structure that is each connected to the image to produce k-function cards of size m-n + 1. Each card is then subsampled typically with average or maximum pole over p x p

Dept of CSE, NHCE 21

Blind Leap adjacent regions where p falls between 2 for small images (e.g., MNIST) and is usually not more than 5 for larger inputs. Before or after the sub-sampling layer, an additive bias and sigmoidal nonlinearity are applied to each function map. The figure below illustrates a complete layer in a CNN consisting of confluent and subsampling sublayers. Units of the same colour attached weights.

Figure 3.1: First layer of a convolutional neural network with pooling.

After the convolutional layers there may be any number of fully connected layers. The densely connected layers are identical to the layers in a standard multilayer neural network.

3.1.2 YOLO (You Only Look Once)

You only look once (YOLO) is a state-of-the-art, real-time object detection system. On a Pascal Titan X it processes images at 30 FPS and has a mAP of 57.9% on

Dept of CSE, NHCE 22

Blind Leap

COCO test-dev. YOLOv3 is extremely fast and accurate and for this project. I will be using this algorithm and then further optimize it as per my requirement.

Prior detection systems repurpose classifiers or localizers to perform detection. They apply the model to an image at multiple locations and scales. High scoring regions of the image are considered detections.

It is a totally different approach. It applies a single neural network to the full image. This network divides the image into regions and predicts bounding boxes and probabilities for each region. These bounding boxes are weighted by the predicted probabilities.

The model has several advantages over classifier-based systems. It looks at the whole image at test time so its predictions are informed by global context in the image. It also makes predictions with a single network evaluation unlike systems like R-CNN which require thousands for a single image. This makes it extremely fast, more than 1000x faster than R-CNN and 100x faster than Fast R-CNN.

Figure 3.2: Bounding boxes, input and output for YOLO

Dept of CSE, NHCE 23

Blind Leap

YOLO in easy steps:

1. Divide the image into multiple grids. For illustration, I have drawn 4x4 grids in above figure, but actual implementation of YOLO has different number of grids. (7x7 for training YOLO on PASCAL VOC dataset)

2. Label the training data as shown in the above figure. If C is number of unique objects in our data, S*S is number of grids into which I split the image, then our output vector will be of length S*S*(C+5). For e.g. in above case, our target vector is 4*4*(3+5) as I have divided the images into 4*4 grids and are training for 3 unique objects: Car, Light and Pedestrian.

3. Make one deep convolutional neural net with loss function as error between output activations and label vector. Basically, the model predicts the output of all the grids in just one forward pass of input image through ConvNet.

4. Keep in mind that the label for object being present in a grid cell (P. Object) is determined by the presence of object’s centroid in that grid. This is important to not allow one object to be counted multiple times in different grids.

3.2 FUNCTIONAL REQUIREMENTS

In software engineering, a functional requirement defines a function of a software system or its component. A function is described as a set of inputs, the behaviour, and outputs. Functional requirements may be calculations, technical details, data manipulation and processing and other specific functionality that define what a system is supposed to accomplish. Behavioural requirements describing all the cases where the system uses the functional requirements are captured in use cases.

Here, the system has to perform the following tasks:

• Capture images in the form of video of at least 4 fps from the user side.

• Live stream the video through the Raspberry Pi to the Server

Dept of CSE, NHCE 24

Blind Leap

• The server computes the result from the images in the form of video and detects object and tags each object.

• These tagged objects are converted to speech.

• These speeches are converted to audible form which are 3D sound.

• The 3D sound is transmitted back to the user.

3.2.1 RASPBERRY PI

If raspberry pi has to be represented in one sentence it will definitely be “It’s a low cost, credit card sized computer”. It’s a device which can be used plugged in with TV or computer and uses standard mouse or keyboard. It’s capable of performing all the computation tasks. The Raspberry Pi Foundation is a UK-based charity that works to put the power of computing and digital making into the hands of people all over the world. They do this so that more people are able to harness the power of computing and digital technologies for work, to solve problems that matter to them, and to express themselves creatively. With raspberry pi I can make use of programming languages like python and scratch. It’s a device which is easy to use and can be used by people of all the ages. It’s a device which is able to perform everything that a computer can do like from browsing internet to making spreadsheets and playing games to playing high definition video.

It’s a device capable of interacting with the outside world and is a choice of good stack of digital maker projects, from music machines and parent detectors to weather stations and tweeting bird houses with infra- red cameras.

In this project raspberry pi acts as a client. This along with the raspberry pi noir camera is used to live stream the video to the server end. For the better communication between the server and client the connection is wireless and the only condition is both server and client must be present in the same network.

Dept of CSE, NHCE 25

Blind Leap

3.2.2 Unity 3D

Unity is a cross-platform real-time engine developed by Unity Technologies, first announced and released in June 2005 at Apple Inc.'s Worldwide Developers Conference as an OS X-exclusive game engine. As of 2018, the engine has been extended to support 27 platforms. The engine can be used to create both three- dimensional and two-dimensional games as well as simulations for its many platforms. Several major versions of Unity have been released since its launch, with the latest stable version being Unity 2019.1.0.

Unity gives users the ability to create games and interactive experiences in both 2D and 3D, and the engine offers a primary scripting API in C#, for both the Unity editor in the form of plugins, and games themselves, as well as drag and drop functionality. Prior to C# being the primary programming language used for the engine, it previously supported Boo, which was removed in the Unity 5 release, and a version of JavaScript called UnityScript, which was deprecated in August 2017 after the release of Unity 2017.1 in favor of C#.

The engine has support for the following graphics APIs: Direct3D on Windows and Xbox One; OpenGL on Linux, macOS, Windows; OpenGL ES on Android and iOS; WebGL on the web; and proprietary APIs on the video game consoles. Additionally, Unity supports the low-level APIs Metal on iOS and macOS and Vulkan on Android, Linux, and Windows, as well as Direct3D 12 on Windows and Xbox One.

Within 2D games, Unity allows importation of sprites and an advanced 2D world renderer. For 3D games and simulations, Unity allows specification of texture compression, mipmaps, and resolution settings for each platform that the game engine supports, and provides support for bump mapping, reflection mapping, parallax mapping, screen space ambient occlusion (SSAO), dynamic shadows using shadow maps, render-to-texture and full-screen post-processing effects.

Dept of CSE, NHCE 26

Blind Leap

Since about 2016 Unity also offers cloud-based services to developers, these are presently: Unity Ads, Unity Analytics, Unity Certification, Unity Cloud Build, Unity Everyplay, Unity IAP ("In app purchase" - for the Apple and Google app stores), Unity Multiplayer, Unity Performance Reporting, Unity Collaborate and Unity Hub.

Unity supports the creation of custom vertex, fragment (or pixel), tessellation and compute shaders. The shaders can be written using Cg, or Microsoft's HLSL.

3.3 NON-FUNCTIONAL REQUIREMENTS

In systems engineering and requirements engineering, a non-functional requirement is a requirement that specifies criteria that can be used to judge the operation of a system, rather than specific behaviors. This should be contrasted with functional requirements that define specific behavior or functions. The plan for implementing functional requirements is detailed in the system design. The plan for implementing non-functional requirements is detailed in the system architecture.

3.3.1 ACCESSIBILITY

Accessibility is a general term used to describe the degree to which a product, device, service, or environment is accessible by as many people as possible. In this project it is to be seen that the system is functional throughout and data transfer takes place only when the user requests for it. is simple and easy to use.

3.3.2 MAINTAINABILITY

In software engineering, maintainability is the ease with which a software product can be modified in order to:

• Correct defects

• Meet new requirements

Dept of CSE, NHCE 27

Blind Leap

New functionalities can be added in the project based on the user requirements just by adding the appropriate files to existing project using Python and c# programming languages.

No human resources are required to maintain the components or to collect the raw data from each of the components. Since the programming is very simple, it is easier to find and correct the defects and to make the changes in the project.

3.3.3 PORTABILITY

Portability is one of the key concepts of high-level programming. Portability is the software code base feature to be able to reuse the existing code instead of creating new code when moving software from an environment to another.

Project can be executed under different operation conditions provided it meet its minimum configurations. Only system files and dependent assemblies would have to be configured in such case.

3.4 HARDWARE REQUIREMENTS • Camera FPS : 2 – 60 fps Resolution : 720p -1080p • Raspberry Pi • System (SERVER) Processor : Any Processor above 500 MHz RAM : 8GB Graphic Card : 2GB Hard Disk : 1GB • Earphone

Dept of CSE, NHCE 28

Blind Leap

3.5 SOFTWARE REQUIREMENTS

• Operating system : Windows 7 and above (64bit), Raspbian

• IDE : PyCharm, Unity

• Language : Python, c#

Dept of CSE, NHCE 29

Blind Leap

CHAPTER 4

DESIGN

4.1 DESIGN GOALS

To enable a better understanding of the surroundings for the visually impaired and blind people and help them become less dependent on others.

4.2 SYSTEM ARCHITECTURE

Figure 4.1: System Architecture of the proposed System

The architecture is shown in Figure. In this platform, the environment is captured by a portable camera and transfers through Raspberry Pi directly to the YOLO model

Dept of CSE, NHCE 30

Blind Leap running on a local server machine with high performance GPU. The server detects objects, sends information directly to the unity sound generator and Converts it into the binaural sound. The sound is again transmitted back to the Raspberry Pi which plays back the sound through the headphone.

4.3 DATA FLOW DIAGRAM

Figure 4.2: Data flow of the proposed System

Dept of CSE, NHCE 31

Blind Leap

CHAPTER 5

IMPLEMENTATION

5.1 CONNECTIVITY

# SERVER SCRIPT server_socket = socket.socket() server_socket.bind(('0.0.0.0', 777)) server_socket.listen(0) connection = server_socket.accept()[0].makefile('rb') try: while True: image_len = struct.unpack('

Dept of CSE, NHCE 32

Blind Leap

# CLIENT SCRIPT client_socket = socket.socket() client_socket.connect(('my_server', 8000)) connection = client_socket.makefile('wb') try: picamera.PiCamera() as camera: camera.resolution = (640, 480) camera.start_preview() time.sleep(2) start = time.time() stream = io.BytesIO() for foo in camera.capture_continuous(stream, 'jpeg'): connection.write(struct.pack(' 30: break stream.seek(0) stream.truncate() connection.write(struct.pack('

5.2 OBJECT DETECTION PSEUDO CODE while True: image_len = struct.unpack('

Dept of CSE, NHCE 33

Blind Leap

image_stream.seek(0) image = Image.open(image_stream) data = np.frombuffer(image_stream.getvalue(), dtype=np.uint8) frame = cv2.imdecode(data, 1) img, orig_im, dim = prep_image(frame, inp_dim) im_dim = torch.FloatTensor(dim).repeat(1, 2) if CUDA: im_dim = im_dim.cuda() img = img.cuda() output = model(Variable(img), CUDA) output = write_results(output, confidence, num_classes, nms=True, nms_conf=nms_thesh) if type(output) == int: frames += 1 cv2.imshow("frame", orig_im) key = cv2.waitKey(1) if key & 0xFF == ord('q'): break continue output[:, 1:5] = torch.clamp(output[:, 1:5], 0.0, float(inp_dim)) / inp_dim output[:, [1, 3]] *= frame.shape[1] classes = load_classes('Working/data/coco.names') list(map(lambda x: write(x, orig_im), output)) cv2.imshow("frame", orig_im) key = cv2.waitKey(1) if key & 0xFF == ord('q'): break frames += 1 else: break

Dept of CSE, NHCE 34

Blind Leap

CHAPTER 6

RESULT

6.1 DETECTION DATA SET CLASSES

There are total of 80 classes that can be detected by the system. Following are the objects that can be identified by my system:

[Person, bicycle, car, motorbike, aero plane, bus, train, truck, boat, traffic light, fire hydrant, stop sign, parking meter, bench, bird, cat, dog, horse, sheep, cow, elephant, bear, zebra, giraffe, backpack, umbrella, handbag, tie, suitcase, Frisbee, skis, snowboard, sports ball, kite, baseball bat, baseball glove, skateboard, surfboard, tennis racket, bottle, wine glass, cup, fork, knife, spoon, bowl, banana, apple, sandwich, orange, broccoli, carrot, hot dog, pizza, donut, cake, chair, sofa, potted plant, bed, dining table, toilet, TV monitor, laptop, mouse, remote, keyboard, cell phone microwave, oven, toaster, sink, refrigerator, book, clock, vase, scissors, teddy bear, hair drier, toothbrush].

All these classes have been trained and are stored in the file called as yolov3.weights and the names of each class is stored in the coco.names file.

6.2 OUTPUT

Since the output is in the form of audio the only output that can be documented and be used for verification is to display the name of the object detected and display it on the monitor in real-time. The object detected is labeled in at the top of the object in the video frame of each object. This module is coded separately just for the verification of the output generated and is not the actual part of the project as it slows down the performance of the system. Following are the images and screenshots of the system and the output.

Dept of CSE, NHCE 35

Blind Leap

Figure 6.1: Raspberry pi top view

Figure 6.2: Raspberry pi side view

Dept of CSE, NHCE 36

Blind Leap

Figure 6.3: Raspberry pi camera

Figure 6.4: Remote desktop connection login panel

Dept of CSE, NHCE 37

Blind Leap

Figure 6.5: Raspberry pi terminal (client)

Figure 6.6: Windows Terminal (Server)

Dept of CSE, NHCE 38

Blind Leap

Figure 6.7: Object detected from a live stream video (Bottle)

Figure 6.8: Object detected from a live stream video (Cell phone)

Dept of CSE, NHCE 39

Blind Leap

Figure 6.9: Object detected from a live stream video (Banana)

Figure 6.10: Object detected from a live stream video (Scissor)

Dept of CSE, NHCE 40

Blind Leap

Figure 6.11: Depiction of the whole system

Dept of CSE, NHCE 41

Blind Leap

CHAPTER 7

CONCLUSION

An object detection code is implemented and integrated with unity code for 3d sound engine to provide the best help to the blind person. The system requires lots of data to be transferred over a network. If data transfer is smooth, the system provides the best results. Lot of object which I come across in my day-to-day life is identified and conveyed to the person.

I have come up with the model for object detection as per model mentioned in the report. By using YOLO algorithm and transmission of 3d audio, the solution is able perform accurate real time objective detection. A portable and real time solution is provided in this project. A prototype for sensory substitution (vision to hearing) is established in the project. Through this project, I have demonstrated the possibility of using computer vision techniques as a type of assistive technology.

I was able to achieve the results as per my expectations. I know that lot more can be done in the future to make it an excellent detection system. But for now, this system is simple and easy to use.

Dept of CSE, NHCE 42

Blind Leap

REFERENCES

[1] Let Blind People See: Real-Time Visual Recognition with Results Converted to 3D Audio Rui (Forest) Jiang Earth Science, Stanford [email protected] Qian Lin Applied Physics, Stanford [email protected] Shuhui Qu Civil and Environmental Engineering.

[2] Object detection and identification for a blind in video scene – Hanen Jabnoun, Hamid Amiri.

[3] Use of 4G waveform towards RADAR Nigidita pradhan1, Rabindranath Bera2, Debasish Bhaskar3 PG Student [DEAC], Dept. of ECE, Sikkim Manipal Institute of Technology, India1 Head of the Dept [HOD], Dept. of ECE, Sikkim Manipal Institute of Technology, India2 Assistant Professor, Dept. of ECE, Sikkim Manipal Institute of Technology, India3 Majitar, Rangpo, East Sikkim, India.

[4] Motion-Based Object Detection and Tracking in Color Image Sequences Bernd Heiseley Image Understanding Group (FT3/AB) DaimlerChrysler Research Center Ulm, D-89013, Germany.

[5] Real-time multiple vehicle detection and tracking from a moving vehicle Margrit Betke1, Esin Haritaoglu2, Larry S. Davis2 1 Boston College, Computer Science Department, Chestnut Hill, MA 02167, USA; e-mail: [email protected] 2 University of Maryland, College Park, MD 20742, USA.

[6] Finding Objects for Blind People Based on SURF Features Ricardo Chincha and YingLi Tian Department of Electrical Engineering the City College of New York New York, NY, 1003.

[7] Portable Camera-based Assistive Text and Product Label Reading from Hand-held Objects for Blind Persons Chucai Yi, Student Member, IEEE, YingLi Tian, Senior Member, IEEE, Aries Arditi.

Dept of CSE, NHCE 43

Blind Leap

[8] An Approach for Object and Scene Detection for Blind Peoples Using Vocal Vision. Manisha Rajendra Nalawade, Vrushali Wagh, Shradha Kamble Shraddha A. Kamble Int. Journal of Engineering Research and Applications www.ijera.com ISSN: 2248- 9622, Vol. 4, Issue 12(Part 5), December 2014, pp.01-03.

[9] Research on Moving Vehicles Tracking Algorithm Based on Feature Points and AKF Jinhua Wang; Yu Li; Chongyu Ren; Jie Cao.

[10] A Step-by-Step Introduction to the Basic Object Detection Algorithms (Part 1) PULKIT SHARMA, OCTOBER 11, 2018

[11] https://www.mathworks.com/solutions/deep-learning/object-recognition.html Quattoni, and A. Torralba. Recognizing Indoor Scenes. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.

[12] Saurabh Gupta, Ross Girshick, Pablo Arbelaez and Jitendra Malik, Learning Rich Features from RGBD Images for Object Detection and Segmentation (ECCV), 2014.

[13] Tadas Naltrusaitis, Peter Robison, and Louis-Phileppe Morency, 3D Constrained Local Model for Rigid and Non-Rigid Facial Tracking (CVPR), 2012.

[14] Andrej Karpathy and Fei-Fei Li, Deep Visual-Semantic Alignments for Generating Image Descriptions (CVPR), 2015.

[15] David Brown, Tom Macpherson, and Jamie Ward, seeing with sound? exploring different characteristics of a visual-to-auditory sensory substitution device. Perception, 40(9):1120–1135, 2011.

[16] Liam Betsworth, Nitendra Rajput, Saurabh Srivastava, and Matt Jones. Audvert: Using spatial audio to gain a sense of place. In Human-Computer Interaction– INTERACT 2013, pages 455–462. Springer, 2013.

[17] Jizhong Xiao, Kevin Ramdath, Manor Iosilevish, Dharmdeo Sigh, and Anastasis Tsakas. A low cost outdoor assistive navigation system for blind people. In Industrial

Dept of CSE, NHCE 44

Blind Leap

Electronics and Applications (ICIEA), 2013 8th IEEE Conference on, pages 828–833. IEEE, 2013.

[18] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. arXiv preprint arXiv:1506.02640, 2015.

[19] Pedro F Felzenszwalb, Ross B Girshick, David McAllester, and Deva Ramanan. Object detection with discriminatively trained part-based models. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 32(9):1627–1645, 2010.

[20] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014.

[21] Ross Girshick. Fast r-CNN. In Proceedings of the IEEE International Conference on Computer Vision, pages 1440–1448, 2015.

[22] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-CNN: Towards real- time object detection with region proposal networks. In Advances in Neural Information Processing Systems, pages 91–99, 2015.

Dept of CSE, NHCE 45

Blind Leap

ANNEXURE

Raspberry Pi 3 Model B+

The Raspberry Pi 3 Model B+ is the latest product in the Raspberry Pi 3 range, boasting a 64-bit quad core processor running at 1.4GHz, dual-band 2.4GHz and 5GHz wireless LAN, Bluetooth 4.2/BLE, faster Ethernet, and PoE capability via a separate PoE HAT.

The dual-band wireless LAN comes with modular compliance certification, allowing the board to be designed into end products with significantly reduced wireless LAN compliance testing, improving both cost and time to market.

The Raspberry Pi 3 Model B+ maintains the same mechanical footprint as both the Raspberry Pi 2 Model B and the Raspberry Pi 3 Model B.

Dept of CSE, NHCE 46

Blind Leap

Specification

Processor: Broadcom BCM2837B0, Cortex-A53 64-bit SoC @ 1.4GHz

Memory: 1GB LPDDR2 SDRAM

Connectivity: • 2.4GHz and 5GHz IEEE 802.11.b/g/n/ac wireless LAN, Bluetooth 4.2, BLE • Gigabit Ethernet over USB 2.0 (maximum throughput 300Mbps) • 4 × USB 2.0 ports

Access: Extended 40-pin GPIO header 1 × full size HDMI

Video & sound: MIPI DSI display port MIPI CSI camera port

Multimedia: 4 pole stereo output and composite video port H.264, MPEG-4 decode (1080p30); H.264 encode (1080p30); OpenGL ES 1.1, 2.0 graphics

SD card support: Micro SD format for loading operating system and data storage

Input power: 5V/2.5A DC via micro USB connector 5V DC via GPIO header Power over Ethernet (PoE)–enabled (requires separate PoE HAT)

Environment: Operating temperature, 0–50°C

Production lifetime: The Raspberry Pi 3 Model B+ will remain in production until at least January 2023.

Dept of CSE, NHCE 47

Blind Leap

To Get Started You’ll Need

• A micro SD card with NOOBS or Raspbian OS

• A high-quality 2.5A micro USB power supply, such as the official Raspberry Pi Universal Power Supply

Software installation

Beginners should start with the NOOBS (New Out Of Box Software) operating system installation manager, which gives the user a choice of operating system from the standard distributions.

SD cards with NOOBS pre-installed should be available from any of our global distributors and resellers. Alternatively, you can download NOOBS.

Raspbian is the recommended operating system for normal use on a Raspberry Pi. Find help with installing Raspbian on your Pi in our online Getting started guide.

You can browse basic examples to help you get started with some of the software available in Raspbian, find more detail about the Raspbian operating system, or read information on fundamental Linux usage and commands for navigating the Raspberry Pi and managing its file system and users.

Dept of CSE, NHCE 48