<<

COMPUTER VISION AND BUILDING ENVELOPES

A thesis submitted

To Kent State University in partial

Fulfillment of the requirements for the Degree of Master of Science in

Architecture and Environmental Design

by

Nina K. Anani-Manyo

April 2021

© Copyright

All rights reserved Except for previously published materials

Thesis written by

Nina K. Anani-Manyo

B.S. Architecture, Kent State University, 2019

M. Architecture, Kent State University, 2021

M. S., Kent State University, 2021

Approved by

Rui Liu, Ph.D. ______, Advisor

Reid Coffman, Ph.D. ______, Program Coordinator

Ivan Bernal ______, Director, Architecture and Urban Design

Mark Mistur, AIA, ______, Dean, College of Architecture and Environmental Design

ABSTRACT

Computer vision, a field that falls under artificial (AI), is increasingly establishing grounds in many disciplines as the demand for automated means to solve real-world problems gradually grows. AI is progressively simplifying and speeding up the processes of day-to-day tasks. The application of computer vision within the field of architecture has the potential to increase efficiency as well. Building envelope is an important component of a building and requires regular assessment and inspection. The application of techniques reveals itself as an innovative way of carrying out a task that is typically performed by humans.

Hence, this research discusses the explorations of using computer vision as a tool to classify building materials, evaluate the details, and potentially identify distresses of building envelopes. This is done using a collection of existing digital images and that help train the computer to produce efficient and reliable results. Deep learning techniques such as convolutional neural network algorithms and ’s Teachable Machine are utilized to classify two sets of base data. The successes produced prove the models have the capability of classifying the dataset given to them. These approaches gradually introduce new methods and techniques that can and will revolutionize the industry of Architecture, , and

Construction.

Keywords: Computer vision, architecture, building envelope, deep learning, , convolutional neural network (CNN), image classification.

TABLE OF CONTENTS

LIST OF FIGURES ...... vii

ACKNOWLEDGEMENTS ...... x

CHAPTER 1 INTRODUCTION AND BACKGROUND ...... 1

1.1 Background ...... 1

1.2 ...... 5

1.3 Computer Vision ...... 7

1.4 Problem Statement ...... 11

1.5 Research Objective and Questions ...... 12

1.6 Thesis Outline ...... 13

CHAPTER 2 LITERATURE REVIEW ...... 14

2.1 History of Computer Vision ...... 14

2.2 Images ...... 17

2.3 Deep Learning ...... 20

2.3.1 Introduction ...... 20

2.3.2 Activation Functions ...... 22

2.3.3 Deep Learning Algorithms ...... 26

2.3.4 Tools and Open-Source Software ...... 29

iv

2.4 Design Application ...... 30

2.4.1 Computer Vision and Art ...... 30

2.4.2 Computer Vision and Building and Infrastructure Performance ...... 35

2.4.3 Computer Vision and Urban Design ...... 39

2.5 Summary ...... 41

CHAPTER 3: CNN AND TEACHABLE MACHINE ...... 43

3.1 Introduction ...... 43

3.2 CNN ...... 43

3.3 Teachable Machine ...... 45

3.3.1 ImageNet ...... 47

3.4 Summary ...... 48

CHAPTER 4: DATA, RESULTS AND DISCUSSIONS ...... 49

4.1 Data ...... 49

4.1.1 Data Collection ...... 50

4.1.2 Data Processing ...... 53

4.2 CNN ...... 54

4.2.1 Classifications for Roof Materials ...... 54

4.2.2 Classifications of Damaged or Not Damaged Roof ...... 60

4.3 Teachable Machine ...... 64

v

4.3.1 Classifications for Roof Materials ...... 64

4.3.2 Classifications of Damaged or Not Damaged Roof ...... 66

4.4 Results, Comparisons, and Limitations ...... 67

4.5 Summary ...... 71

CHAPTER 5: CONCLUSION ...... 73

5.1 Conclusions ...... 73

5.2 Overall Limitations ...... 73

5.3 Recommendations ...... 74

REFERENCES ...... 76

APPENDICES ...... 82

Appendix A - Python Code for Downloading Images ...... 83

Appendix B- CNN Code for Roof Material Image Classification ...... 85

vi

LIST OF FIGURES

Figure 1-1: …………………………………………………………………...... 2

Figure 1-2: Timeline of Computer Vision-Artificial Intelligence ………...... 4

Figure 1-3: Different Brick Wall Properties ………………….…………………...... 9

Figure 2-1: .…………………………………………………………...... 16

Figure 2-2: Simple “Display Image” Code…………………………………………...... 18

Figure 2-3: Images…………………………………………………………………………...... 19

Figure 2-4: An Artificial Neuron………………………………………………………...... 21

Figure 2-5: ………………………………………………………...... 22

Figure 2-6: Equation, Range and Derivative………………………………………...... 24

Figure 2-7: Nonlinear Activation Functions………………………………………...... 25

Figure 2-8: CNN………………………………………………………………………………...... 27

Figure 2-9: RNN- Feed Cyclically vs Feed-Forward………………………...... 28

Figure 2-10: GAN ……….………………………………………………………………………...... 29

Figure 2-11: Videoplace………………………………………………………………………...... 31

Figure 2-12: (Un)stable Equilibrium……………………………………………………...... 32

Figure 2-13: Innovations in AI………………………………………………………………...... 33

Figure 2-14: ………………………………………………………………………...... 34

vii

Figure 2-15: Full Convolutional Neural Network...... 37

Figure 2-16: Crack Study……………………………………………………………………...... 38

Figure 2-17: Maps……………………………………………………………………………...... 40

Figure 2-18: Algorithms...... 41

Figure 3-1: ……………………...... 44

Figure 3-2: Creating a New Image Classification Project with Google’s Teachable

Machine…………………………………………………………………………………………………...... 47

Figure 4-1: The Process of Image Classification……………………………………...... 50

Figure 4-2: Image Download Process……………………………………………………...... 51

Figure 4-3: Dataset II Images………………………………………………………………...... 52

Figure 4-4: ……………………………………………………………………………...... 53

Figure 4-5: Classifying Roof Materials...... 54

Figure 4-6: Importing the Required Library…………………………………………...... 55

Figure 4-7: Collecting and Defining the Data…………………………………………...... 56

Figure 4-8: Convolutional Layer……………………………………………………………...... 57

Figure 4-9: Compiling the Model……………………………………………………………...... 58

Figure 4-10: Training………………………………………………………………………………...... 58

Figure 4-11: Accuracies and Losses of the Training and Validation Data...... 58

Figure 4.12: Material Classification- Accurate Results…………………………………………………………….. 59

viii

Figure 4-13: Damage Identification Using “Draw Contours” …………………...... 61

Figure 4-14: Classifying Roof Damages……………………………………………………...... 62

Figure 4-15: RMSprop………………………………………………………………………………...... 63

Figure 4-16: Damage Classification- Results……………………………………………...... 63

Figure 4-17: Material Classification with Google’s Teachable Machine - Results…………………….. 65

Figure 4-18: Damage Classification with Google’s Teachable Machine - Results……………………… 66

Figure 4-19: CNN Model- Accuracy………………………………………………………………….……………………… 68

Figure 4-20: Teachable Machine Model Accuracy…………………………………………….…………………….. 69

ix

ACKNOWLEDGEMENTS

First and foremost, thanks to God almighty for His provision and blessings in every area of my research work.

I would like to show my appreciation to my thesis advisor, Dr. Rui Liu, for his advice and guidance throughout my research. His constant motivation and encouragement to strive for the best pushed me far beyond my known capabilities. His dedication, knowledge, and directions greatly helped me in accomplishing this task, and for that, I am very grateful.

I would also like to express my deepest gratitude to my thesis committee, Dr. Elwin C. Robison, and Prof. Bill Lucak from the College of Architecture and Environmental Design at Kent State

University, Dr. Ruoming Jin from the Department of at Kent State University, and Dr. Mirian Velay-Lizancos from the Lyles School of Civil Engineering at Purdue University. I truly appreciate their advice, feedback, and suggestions which helped improve upon my research. I wish to also thank Ms. Connie Simms for her time spent reviewing my thesis, and for her helpful comments.

Lastly, I would like to thank my parents as well as my sister for their constant support, encouragement, love, and prayers which kept me going strong and made this accomplishment a possibility. I am grateful to everyone who contributed to my educational accomplishment.

Thank you all for your support.

x

CHAPTER 1 INTRODUCTION AND BACKGROUND

1.1 Background

Computer vision is increasingly becoming a vital tool among many fields, such as engineering, medicine, construction, and more, and it is gradually changing the way we analyze buildings.

The applications of computer vision have evolved through the implementation of algorithms to solve certain distinct challenges across various areas. The 1960s was the early stage of computer vision. Before that time, the idea of an artificial entity possessing humanlike responsiveness was found in fiction such as Mary Shelley’s Frankenstein (Guston et al, 2017), but artificial intelligence (AI) was not defined until 1956 (Mijwil, 2015). At a conference at

Dartmouth College, New Hampshire, the term AI was coined by scientist John McCarthy, alongside , Marvin Minsky, Allen Newell, and Herbert A. Simon. John McCarthy was a part of the universities that had already been pioneering AI, such as Stanford University.

Additionally, some of the most significant applications of AI were created at Stanford

University. These universities that were developing AI, aimed to use computer vision to automate the process that the human vision system does, which is to see and tell what is seen.

In 1950, the question about whether could think was posed by Alan Turing. Alan

Turing created “”, which tests the machine to find out if the machine can think or not. In this test, an interrogator, which is shown in Figure 1-1 as the judge, asks a question, and then the machine and the human produce a response. The job of the interrogator

1

is to tell which response belongs to whom. If the interrogator is unable to distinguish between the responses, then the machine has succeeded at imitating the human accurately. There have been many papers to prove and disprove that machines can think, and all these claims and curiosities encourage further research on what AI can do to contribute to the development of various disciplines.

Figure 1-1: Turing Test (Image by Dr. Alan Mathison Turing https://auxiliuscg.com/2020/08/27/who-is-alan-m-

turing-what-is-the-turing-test/)

The evolution of computer vision, a type of machine intelligence, has been contributing to great technological improvements which continue to provide an improved lifestyle for various individuals. For many years, computer scientists in many parts of the world have been attempting to discover ways to make machines draw out meaning from the given visual data.

The exploration of computer vision has been an ongoing research. This is partially because of

2

the efforts and attempts of computer scientists to mimic one of the most intricate organs of the human body, the brain. The brain exhibits many interesting characteristics such as creativity, memory, emotion, intelligence, and many more. This goal of imitating the brain keeps producing innovations as well as new challenges in computer vision. These attempts of computer scientists have always created an effect that triggers a deeper dive into the capabilities of the computer to process information. Over the years, the approaches and processes of automating have induced the programming of algorithms, which have done a lot to improve and discover the machine’s ability. This shows that with the help of these developing algorithms, the machine has the potential of doing more than is currently expected.

Architecture is one field that has truly benefited from technology and its advancements. In recent years, architecture has been moving closer towards the use of computer as a tool for design and problem solving, and this draws to how AI and architecture could work together efficiently to create new approaches and reshape architecture. AI has moved through certain periods in time where the of how AI was explored in different fields of study. Figure 1-2, shown below, portrays the stages and progressions which architecture has undergone to reach AI and how these stages have shaped architecture. After modularity, computational design, and parametricism, architecture has entered the age of AI to signify the creation and improvement of architectural design technology over time. It starts with modularity which has been known to have limitations and produced few design possibilities.

Then computational design was introduced. It allowed more iteration due to the application of computational strategies, and it helped produce more iterations. Parametricism gave more control over shapes, which increased the achievability of constructions.

3

All the approaches have their benefits and contributions to architecture individually. But, AI, which presents itself as a means that will provide all benefits that these previous approaches present, is currently being introduced to architectural design. Over the years, the machine’s abilities have undergone many experiments, testing, and training that have made advancements towards AI. The goal of this study is to view and analyze the machine’s ability with an analysis of the building envelope components, through the lens of an image classification model.

Figure 1-2: Timeline of Computer Vision-Artificial Intelligence

4

1.2 Artificial Intelligence

As a branch of computer science, AI deals with the invention, as well as the study of intelligent mediums such as machines, that take specific measures which will increase its probability of success. John McCarthy (2007) defined AI as “the science and engineering of making intelligent machines” What is expected of an intelligent machine is the ability to be trained, handle complex and indefinite issues, produce desired results, make decisions, and carry out tasks accurately and efficiently. With the constant expansion of AI, machines keep getting better at handling tasks that are usually tackled by humans. Ultimately, the scientific goal of AI is to develop computer applications and methods that exhibit intelligent behaviors.

AI researches and creates technologies and methods to advance the goal of the creation of computer functions that are associated with the intelligence of humans. The key characteristics of an artificially intelligent system include the ability to reason and draw conclusions concerning the situation, the ability to process large amounts of data (unstructured or structured), the ability to learn based on past experiences or patterns, and the ability to analyze and provide solutions to complex problems. The branches of AI include natural language processing, , intelligent automation, , expert systems, and computer vision.

Natural language processing deals with the computer’s understanding of what is stated, asked, or the expected results. Machine learning attempts to get the computer to learn and communicate information as a human will. Machine learning is a subset of artificial intelligence and deep learning is a subset of machine learning. Intelligent automation focuses on the machine’s capabilities of understanding information that relates to human interaction and providing a suitable response. The combination of vision, sound, machine learning,

5

comprehension, and natural language processing helps to achieve intelligent automation.

Robotics, the design, and programming of physical or systems that perform tasks are usually done by humans. Expert systems contain all the human knowledge possible related to an issue. Computer vision is a branch dealing with the processing of visual data such as images and videos to recognize, classify, manipulate, and detect the elements that exist in the image or video. All these branches of AI have a relation to each other, and that is very evident in their real-world applications. Currently, Al powers many tasks such as digital assistants, search engines, and online shopping recommendations, etc. It has endless possibilities.

AI is fast rising and influencing a wide variety of industries, communities, and individuals.

Industries such as automobiles, manufacturing, and medicine, to name a few, are recognizing the benefits that AI presents. Within each field, AI performs different functions. For instance, within the medical field, radiologists use AI to determine the exact volume and form of tumors.

It can also be used to identify signs of certain diseases quickly and accurately with images of x- rays, MRIs, and CT scans. Doctors, , mathematicians, scientists, and more are all analyzing and innovating machine learning research. Looking at artificial intelligence within architecture and how AI can be used to reshape architecture as it has other disciplines, the possibilities are vast. Machines can save architects the time of doing repetitive and tedious tasks, which allows them to invest more time in critical thinking and design. AI can thrive in architecture when it comes to the use of large amounts of data as a source of information that will help create a variation of new designs. Overall, AI can be a very powerful tool in the research, design, planning, and construction stages of architecture.

6

1.3 Computer Vision

Computer vision, a recent field in computer science, which deals with how a computer can gain immense understanding and perceive information that exists within images and videos, aims not only to make the computer see but to automate tasks in the same way human visuals do.

From data collection, computer vision uses certain procedures to bring together machine learning, image processing, , and to extract a required interpretation. Applications such as security systems at public places like airports, whereby the computer identifies and opens a passenger’s travel information just by taking an image of the passenger is a real-life application of computer vision among many. Due to the rapid growth and spread of computer vision, there is an increasing demand for automated methods and technologies to solve real-world issues across many fields, including the field of architecture.

The father of computer vision is Lawrence Roberts. Roberts published an essay in 1963 called

“Machine of Three-Dimensional Solids” (Roberts, 1963). He talked about producing

3D data about solid objects from a 2D view of blocks that are in perspective. From there, researchers understood that it would be better and important to address photographs from the real world. With that knowledge, the researcher thought to focus on computer vision tasks such as , an image processing technique that is used for finding the edge of an object or objects within an image, and , the splitting up of a into many various parts to alter how an image is represented to create something easier to analyze.

These two tasks fall under what is referred to as “low-level computer vision tasks.”

Computer vision is a field that draws from areas such as machine learning, deep learning, , pattern recognition, and finally artificial intelligence. Computer vision applies

7

deep learning techniques, which will be discussed in the next chapter, to analyze images.

Advancements in deep learning have enabled the manipulation of images, such as the removal of background noise in an image, and the highlight of the object of focus in the image. Deep learning has also enabled the identification of the shape of an object in the image, as well as the pattern, and the textures. When a computer is being trained to ‘see’, there is more to it than just images and videos. It covers digital visual data which can be normal images and or graphical representation of locations, but this study focuses more on digital images. What the computer is being trained to do is what the human eye does naturally without much thought.

Figures 1-3 (a) and (b) display brick walls. That observation is made by the human , and by just looking at these two images, more conclusions are drawn about the patterns of color, shadows cast, what is in the foreground, what is in the background, their outline, and many more. This type of information that comes easily and naturally to humans through sight is not as easy for computer vision. Hence, the machine’s approach to ‘seeing’ various aspects of an image involves image detection and classification, which is the extraction of information from an image through a process of training the machine to extract or detect specific information using labeled examples. Image segmentation, which focuses on dividing the image into segments to simplify the analysis of the image. , which aims to locate specific objects in an image concerning the background. Image recognition involves a training process in which the computer learns to recognize patterns in images using a dataset it has already seen. And image manipulation, which edits the original image, like eliminating blur and detecting edges.

8

(a)

9

(b)

Figure 1-3: Different Brick Wall Properties (a) and (b) Brick Patterns- https://thearchitectsdiary.com/brick-facade- house-design-work-group/

Previous and current applications of computer vision include, but are not limited to, robotic assembly, industrial inspections, and autonomous navigation. Computer vision was used to create three-dimensional maps of building interiors, especially of historic buildings. New buildings have information about building details, square footage, and blueprints computerized because the computer has provided the ability to do so in these current times, but historical buildings were not able to partake in that advantage. The algorithm of computer vision automatically recognizes columns, walls, decks, and more.

10

1.4 Problem Statement

Computer vision centers on replicating the complex nature of the visual systems of humans. It is known to be one of the technologies that enable interaction between the digital world and the physical world. Computer vision is a field that has taken the lead in certain functions, such as detecting, labeling objects, working efficiently, accurately, and timely. Advanced research on the role of computer vision in architectural design, engineering, and construction can be a way forward toward producing innovative ideas for the industry of architecture, engineering, and construction.

Buildings have a lifespan. As time passes, certain factors could speed up the deterioration of a building. Buildings experience a lot of problems that shorten their lifespan, such as climate change, extreme conditions from wind and hail to relatively consistent dry, wet, hot, or cold weather. Even though buildings are designed to withstand these conditions, they still bear the effects of these environmental factors. The effects can start showing as cracks, corrosion, and material deterioration that develop around the envelope of a building, which starts small and eventually becomes a more severe issue if there is no regular maintenance done. These problems are typically noticed when a building is being inspected. Buildings are usually inspected around every five years (CTL Engineering, 1927). The human inspector typically looks for surface defects as well as the root cause of those defects when conducting an inspection.

He or she uses metering devices, instruments, and test equipment to perform the inspection.

They also take photographs during the process. The process of manually monitoring a building, with the help of such inspection equipment, requires the full intervention of humans. However, computer vision can add reliability, accuracy, simplicity, and speed to these inspections.

11

Requiring minimum human intervention, can speed up the inspection and create an easier process for the building inspector, contributing to the productivity of building inspection.

The computer can be programmed to evaluate the many details of a building envelope when the right algorithm is applied. Although the human visual system possesses great intelligence, the human eye is susceptible to tiredness and, in some cases, prone to losing its ability to function accurately. Computer vision presents its potential to be reliable, producing results as expected, with fewer errors.

1.5 Research Objective and Questions

The objective of this research is to use computer vision as a tool to evaluate the details and distresses of building envelopes, specifically the roof, using a collection of existing digital images as input which will be processed and classified by the computer. With numerous applications of computer vision, it is evident that the computer can be trained, so this investigation is to determine the extent to which computer vision can be used within the field of building inspection. Can the computer be trained to tell what material the roof is made of?

Can the computer be able to tell if a tile on the roof is missing? Furthermore, can the computer be able to tell why that tile is missing, or potentially what caused the loss of that tile, just like a building inspector can or even better and quicker?

12

1.6 Thesis Outline

This paper is divided into five chapters. Chapter 1 Introduction covers background information about computer vision and how artificial intelligence relates to computer vision. It explains what artificial intelligence is as well as computer vision. The last few sections of Chapter 1 discuss the problem statement of this thesis, the research objective, and the research questions. Chapter 2 Literature Review takes a deeper dive into deep learning techniques employed by computer vision and what those techniques are. It highlights examples of computer vision applications in art, infrastructure, and urban design, and how the use of computer vision has helped bring about innovation in various practices. Chapter 3 CNN and

Teachable Machine discusses the frameworks of Teachable machine and CNN for classification, and how both models work. Chapter 4 Data, Results and Discussions talks about the process of collecting and processing the gathered data used in this research. It presents the application of the CNN model and Google’s Teachable Machine model as a means of classifying the roof image dataset firstly by material and then by damages (either damaged or not damaged). The results of both models are compared, and the findings are discussed. Chapter 5 Conclusion gives an overall summary and limitations of this study and highlights potential areas of this research for further exploration.

13

CHAPTER 2 LITERATURE REVIEW

2.1 History of Computer Vision

Computer vision (CV) has had major milestones which have contributed to the growth of this field. In 1959, the first digital image scanner was invented, subsequently, a group of scientists,

Herbert Simon, Marven Minsky, John McCarthy, and Arthur Samuel managed to come up with the first AI programs which were used to teach computers how to solve simple mathematical problems (Sodhi, Awasthi, & Sharma, 2019). The results these scientists achieved made them very enthusiastic about the computer’s potential.

In the 1960s, Seymour Papert (Papert, 1996), a , first attempted to solve a problem with computer vision, “the summer vision project”. “The summer vision project” is an attempt to use our summer workers effectively in the construction of a significant part of a visual system. The task was chosen partly because it can be segmented into sub-problems which will allow individuals to work independently and yet participate in the construction of a system complex enough to be a real landmark in the development of “pattern recognition”

(Papert, 1996), but this project failed due to the complex nature of computer vision which exist because of the goal of creating an alignment with the human visual system. Computer vision is known to be a difficult field of research, and although some research problems have been solved, it has not been satisfactorily solved. Human vision is augmented by the human brain, which is a complex organ in and of itself, so computer vision research keeps delving deeper into discovering ways to create an alignment. Further ventures to problem-solving with computer vision came later after Papert’s attempt. This time, researchers used models such as , a type of neural network which was influenced by the brain, for problem-solving.

14

Perceptron, created by Frank Rosenblatt and other researchers, is an algorithm that is used for supervised learning of binary classifiers, whereby the binary classifier determines if an input, expressed by some vectors, is part of a specific class or not, but the Perceptron had its limitations. Further down the line of computer vision, research moved to image processing.

Image processing deals with filter applications, finding the edges in an image, and denoising an image. Another recorded milestone was in 1966, when Marvin Minsky, a computer scientist, assigned a graduate student a problem of connecting a camera to a computer to have the computer describe what it sees. Although nothing big came from that project at that time, what made this a milestone is that it triggered further research into how the human eyes see, process information in the brain, and makes decisions with that information, and how the machine can replicate that.

From the 1990s to the 2000s, computer vision evolved further, with key representation algorithms and feature extraction used to identify objects in images. Algorithms programmed to solve individual problems became better at solving those tasks the more they were repeated. are being programmed to train themselves, triggering self- improvement over time. After Alan Turing’s proposal of the Turing test (a proposition that tested whether a machine could think) (Shan et al., 2013), Donald Geman along with other researchers introduced the “visual Turing test” for computer vision systems (Geman et al.,

2015). Visual Turing test is “an operator-assisted device that produces a stochastic sequence of binary questions from a given test image”. This test is only about vision and does not require natural language processing as Alan Turing’s test does. From the query, an algorithm flexibly selects a sequence of questions or images to present to the method under evaluation (MuE) in

15

a way that the answers cannot be foretold based on previous answers. It is the job of the human operator to provide the right answers to the questions asked (Figure 2-1). The questions that the query produces follow a storyline, something like what humans do when they see an image. The visual Turing test’s goal is to give a new approach to the research of how computers can begin to understand images the way humans do.

Figure 2-1: Visual Turing Test (Fountoukidou, 2020)

Computer vision has enabled many practices, such as searching for specific images of specific people, it has enabled the use of tagging, description, and identification of objects and labeling them and/ or applying a filter. It can recognize words even when they are in an image. It can recognize and translate a handwritten image into text. Computer vision has been used to enable self-driving cars to be clear about their surroundings. , which is an online photo library, uses computer vision to find objects and classify images by the kind of object that the image contains. Before computer vision, image analysis such as x-rays had to be done manually, but the development of computer vision has enabled computers to automate the

16

image analysis process, improving the process continuously (Mery, 2014). Computer vision has opened a wide array of uses, from solving a simple problem to the potential of solving very complex issues.

2.2 Images

Images play a very important role in the problem-solving research aspect of computer vision.

Just as the human visual system can see, understand, and explain an image, researchers are striving for similar responses with the computer, and an image helps jumpstart that learning process for the computer. The computer’s read of an image varies from that of a human, so it is important to understand how the computer looks at an image to train it. Images are the depths of computer vision and can be acquired from cameras, video recorders, and more. An image is used in computer vision as an input, which was reinterpreted and represented in a form that consists of a defined number of columns and rows. The rows within a matrix represent the height of the image, and the columns represent the width. An image consists of multiple pixels and each pixel is one component within the matrix. Each pixel has a value, and that value is a reinterpretation of a channel that indicates a visual detail of that image. That visual detail could be a color, scale, and more. Each pixel value ranges from a number between 0 and 255, as Red

Green Blue (RGB) depiction. These pixel values put together, produce colors and give an overall clarity to an image.

In computer vision algorithms, all the factors mentioned above are used to train the computer.

What makes to the steps of a computer vision structure is firstly the input, which is acquired from phones, cameras, and more. Then the processing of that input data, which consists of specific tasks that are expected of the computer to do with the data is received (e.g., to find the

17

edges, detect an object in the image or simply display the input, either in its matrix form or just as it is). Then finally the output, which displays the results. With each vision task that is given to the computer, the computer performs such tasks with the help of an algorithm that was created specifically to help train the computer. For instance, a simple task given to the computer, such as to display a given image will require the computer to run some lines of code to display the image.

Figure 2-2 is a display of some lines of Python codes that are intended to make the computer read, show, and write the image that is inputted.

Figure 2-2: Simple “Display Image” Code

To complete the task, the computer must import the necessary libraries and one of them is

Numpy. is a highly improved library for multi-dimensional arrays and matrices, designed to perform numerical operations. The other library that is imported is cv2, which contains many functions that are designed to solve computer vision problems. There are many more libraries that exist to perform specific functions. After the libraries are imported, the function cv2.imread() helps the computer read the image located in the file path placed in the parenthesis, and then stores it in a variable named “Sunset_image.” “Cv2.imshow()” displays the image in a window as shown in Figure 2-3(a). Adding an extra line to the code such as,

“sunset_gray_image = cv2.cvtColor(Sunset_image, cv2.COLOR_BGR2GRAY)”, produces the

18

output shown in Figure 2-3(b). The function cv2.cvtColor() converts the color of the image to grayscale with cv2.COLOR_BGR2GRAY, the image is converted to grayscale.

(a)

(b) Figure 2-3: Images (a) Input Image and (b) Output image

The image input can also be displayed in a matrix form, by adding, “print (Sunset_image)”, which displays the “numpy.ndarray”, which is an N-dimensional array type that describes a container of items that are of the same size and type. The output displayed is a combination of pixel values that form the overall image, and each value ranges from numbers between 0 and

255.

19

Many existing algorithms are used to train the computer to complete certain tasks. There is a code to change the color of the pixel, to rotate the entire image, to scale the image, to match a template, and many more. The data collection portion of this research required a lot of images from the internet and an open-source Python code was successfully utilized to download numerous amounts of images from the internet. The downside to the code was that it does not filter through the images and select images about the search, but rather it goes to the web browser and downloads every image with a filename that is related to the phrase that was typed in. That means the user will have to look through hundreds of images to eliminate unnecessary ones. On the plus side, the amount of time that would have been required to download images from the internet one at a time was cut short with the help of the code. So, there certainly are positive and negative aspects of some computer vision algorithms. They are not completely satisfactory but have come a long way. Some are more accurate than they were before because of continued research in the field of computer vision.

2.3 Deep Learning

2.3.1 Introduction

Deep learning techniques used in computer vision include convolutional neural networks

(CNN), recurrent neural networks (RNN), and generative adversarial networks (GAN). These techniques can effectively solve computer vision problems. Deep learning is a collection of techniques from an artificial neural network which is a field under machine Learning. They are called artificial neural networks because they imitate how biological neurons are structured in the human brain.

20

Artificial neural networks (ANN) have a basic structure consisting of artificial neurons grouped into layers as shown below. Layers typically consist of input, hidden, which could be one or more layers whereby the output of one layer is fed as an input to the next layer, and output. In the same way, neurons communicate through complex connections in the human brain. ANNs perform similarly using complex algorithms (Figure 2-4). In an artificial neuron, each neuron

receives an input (x1, x2, …., xm) which is associated with a weight (ω1, ω2, …., ωm). The neuron sums all the signals it receives, multiplying each signal with its associated weight. These inputs activate the neurons by the application of linear or nonlinear functions called activation functions. The output (y) is later passed through an activation function to produce the final output.

Figure 2-4: An Artificial Neuron (Image by Jayesh Ahire - Medium.com)

Achieving high accuracy with a deep learning algorithm depends on how well the ANNs are trained. The commonly used algorithm for training ANNs is (Skorpil, & Stastny,

2006). The calculation for an error in a layer updates a weight (recalculates the weight) from backward.

21

2.3.2 Activation Functions

In deep learning, neural network activation functions play an important role in the training process. Activation functions are numerical equations that control the output, efficiency as well as accuracy of the model. They help normalize the output to an array between -1 and 1 or 0 and

1. Activation functions are included in an artificial neural network to enable the system to learn composite data and decide what moves on to the next neuron. When an input, which is numeric data, is fed into the neuron that is in the input layer, the numeric input is multiplied by the weight of each neuron, and that produces the output of the neuron which is fed into the next layer as an input. The activation function takes the output from the previous neuron and changes it into a form such that it can be fed into the next layer. Figure 2-5 displays a simple step of how an activation function works in a neural network, which typically makes use of a non-linear activation function.

Figure 2-5: Activation Function - (missinglink.ai)

Non-linear activation functions help the neural network to learn and reproduce complex data.

The non-linear activation function enables “stacking”, which creates a deep neural network by piling up multiple layers of neurons. These multiple layers are hidden and typically play an

22

essential role in the machine’s ability to learn complex data and produce results of high accuracy. The non-linear activation function is one of three, there is the linear activation function and the binary step function. Considering a neuron mathematically, which is a weighted sum multiplied by the input and added to a bias, the output value that the neuron produces will be expected to be higher than a specific threshold for the neuron to be activated.

If the output value is lower than the given threshold, the neuron is not activated. This is the binary step function that operates on a threshold . The binary step function is useful for creating a binary classifier and does work well with cases where multiple classes exist, because in that situation, multiple neurons could be activated, and there will be uncertainties about which of the neurons were activated. The Linear activation function is a function that works in a straight line. The activations produced from the multiplication of the inputs by the weights are proportional to the input. The outputs of a linear activation function are not binary but rather multiple. More than one neuron is activated but, in this case, the Softmax, which interprets the input values as probabilities, is taken and a decision is made based on that (Gao & Pavel, 2017).

The Linear function takes the form of A=cx, whereby c, which is the derivative is a constant and shows no relationship with the input x, which means that, when an error occurs, and backpropagation is used to make changes, those changes will not depend on the input. The linear nature of this function also diminishes “stacking”, because no matter how many layers are stacked, it is still one layer. So, although the linear function works better than the binary step function, the non-linear function works best for neural networks (Figure 2-6).

23

(a)

(b) Figure 2-6: Equation, Range and Derivative (a)Binary Step and (b) Linear Functions (Image by Sebai Dorsef - medium.com)

There are several commonly used non-linear activation functions in neural networks, including

Sigmoid (logistics), Tanh (hyperbolic tangent), ReLU (rectified Linear Unit), Softmax, and others

(Figure 2-7) (Nwankpa et al., 2018). Sigmoid, which is also referred to as a logistic function, is mostly used in feed-forward neural networks which means that every node in a layer is linked to the nodes of the previous layer. Connections move forward and do not form a cycle. The is used as a prediction tool, which produces a smooth gradient in output values and plays an important role in logistic regression which is a means of predicting the outcome of binary classification problems. The sigmoid function takes the weighted sum of the input features as an input and produces the probability value as an outcome. Its prediction formula produces an output of different values of probabilities that are always between 0 and 1

(Figure 2-7(a)). Tanh is an activation function that ranges between -1 and 1 and produces a zero-centered output. It creates an easier optimization of the loss function. For a multi-layer neural network, Tanh is known to provide better training performance (Figure 2-7(b)). ReLU,

24

which is the most used function, is a function that produces an output of zero if the input is less than zero, thereby eliminating the negative part of the function. But if the input is positive, it is outputted directly (Figure 2-7(c)). The Softmax activation function is used to determine probability distribution and for classification of data that involves multiple variables (Figure 2-

7(d)).

(a)

(b)

(c)

25

(d)

Figure 2-7: Nonlinear Activation Functions (a)Sigmoid, (b)Tanh, (c)ReLU and (d) (Images by Sebai Dorsaf- medium.com

2.3.3 Deep Learning Algorithms

2.3.3.1 Convolutional Neural Network (CNN)

CNNs process data across multiple layers. CNN enables the computer to perform image classifications by searching for features like curves and edges in the input/image, and then to expound on more abstract concepts through the convolutional layer which is the key building block of a CNN. It takes an image and passes it through a sequence of convolutional, nonlinear, pooling, and fully connected layers, to produce an output. The layers are set as three , i.e., height, width, and depth. The first layer of CNN is the layer. As the filter is sliding around the input image, the original pixel values in the image multiply the values in the filter, which are then summed up to end up with a single number. That single number represents the section the filter was on. This is repeated in all regions of the input image. The output of the first input layer is an activation map, which shows the relevant regions of the image. This activation map then becomes the input for the second convolution layer. Each layer of input describes the location of regions in the original image where features of curves and edges appear. When the results from the first convolution layer, which is the input of the

26

second convolution layer, passes through the second convolution layer, the results are high- level features such as hands, feet, square, etc. As activation maps pass through more convolution layers the results produced are more complex. In the end, the filters could activate finer details in the image such as specific colors and writings. The pooling layer reduces the parameter counts to control the training time and makes sure vital information about the image is kept while reducing dimensionality. The fully connected layer is the final feature map that is useful for classification. This layer performs tasks with one dimensional data.

Convolution neural networks are used for object detection, image recognition and computer vision tracking tasks. They address issues such as backgrounds that make it difficult to identify the object of focus, pixelated or noisy images when zooming in etc. (Figure 2-8).

Figure 2-8: CNN (Image by Disha Shree Gupta https://www.analyticsvidhya.com/blog/2017/06/architecture-of- convolutional-neural-networks-simplified-demystified/)

2.3.3.2 Recurrent Neural Network (RNN)

RNNs are algorithms that are known for their internal memory. Because RNNs can remember the important information related to the input, it is deemed as more accurate when it comes to predicting what is coming next and works effectively when handling sequential data (audio, video, text). RNNs are fed cyclically, so inputs received previously are looped back into the

27

network in addition to current inputs. RNNs are used in combination with long short-term memory (LSTM) which extends the memory of RNNs because they are known to have a short- term memory. Figure 2-9 is a side-by-side comparison of an RNN and a feed forward neural network.

Figure 2-9: RNN- Feed Cyclically vs Feed-Forward (https://missinglink.ai/guides/neural-network- concepts/recurrent-neural-network-glossary-uses-types-basic-structure/)

2.3.3.3 Generative Adversarial Networks (GAN)

GANs are generative models, with training data. This technique learns to create new data with the same information as the training data (Wang et al., 2020). For example, GANs can create images that resemble a photograph of human faces, but those images created do not belong to any human. GANs can achieve highly realistic images by pairing a generator with a discriminator, which are the two parts of a GAN and they are both neural networks. The generator learns to create an intended output, while the discriminator learns to differentiate the fake output of the generator from that of the real data (Figure 2-10). The output of the generator is directly connected to the input of the discriminator. The generator keeps creating data that is very similar to the real data attempting to deceive the discriminator. If the training of the generator goes well, the discriminator is unable to tell that the generator output is fake,

28

hence decreasing its accuracy. Discriminator loss corrects the discriminator when it classifies a fake as real and a real as fake with backpropagation.

Figure 2-10: GAN (Image by Brijesh Modasara https://medium.com/@kraken2404/introduction-to-generative-adversarial-networks-gans-89095151cd3a)

2.3.4 Tools and Open-Source Software

A wide variety of open-source software used to teach and train computers to perform tasks that range from simple to complex are readily available and open to the public. With adequate learning, the machine can produce solutions to daily problems with high accuracy. These tools are very useful in developing research in computer vision and other machine-related tasks and are consistently being improved for further precision in problem-solving. OpenCV, which has a python interface, is one of the most popular open-source libraries used for image processing which is the process of applying a mathematical function to an image to get the desired output

(OpenCV, 2000). TensorFlow, PyTorch, among others are examples of open-source, deep learning tools that aids one in beginning the study of deep learning (TensorFlow, 2015; PyTorch,

2016); Keras: The Python deep learning API, 2015). In deep learning, algorithms are applied in

29

layers that use the output of the previous layer as an input for the next layer, and these tools are built to support the compound learning process of the computer. Each software is preferred based on how the user intends to use it or how it is designed to serve the user’s needs. Open- source software was created because of the desire to move toward automation, and consequently, it continues to make advancements towards machine learning and deep learning.

2.4 Design Application

2.4.1 Computer Vision and Art

Computer vision techniques have been connected to art from long ago. Art has been one of the many fields that utilize computer vision to create innovation. From the study of visual art to artwork generated algorithmically to interactive art. Numerous artists incorporated the techniques of computer vision into their practice to find innovative ways of experimenting with what they have drawn or designed and evaluate the computer’s ability to understand. They also use it to discover new ways of describing the outline of a human. Myron Krueger was the first to integrate computer vision in interactive art, which he called Videoplace (Krueger, 1985). In

Videoplace, Krueger believed that the whole human body needed to play a part in the interactions with computers, so he made a participant stand in front of a backlit wall and face the projection screen, to digitize and analyze the participants’ silhouette and movement (Figure

2-11 (a) and (b)). Krueger produced many interactions with such means.

30

(a)

(b) Figure 2-11: Videoplace (a) and (b) Interactions (Image by Myron Krueger http://www.inventinginteractive.com/2010/03/22/myron-krueger/)

In 2019, there was a Computer Vision Art Gallery held in Seoul, South Korea, at the International

Conference on Computer Vision. It was a workshop on computer vision for fashion, art, and

31

design. The winner, Terence Broad, showed a piece called the (un)stable equilibrium 1:1, which he explained as a part of a series of works teaching “generative adversarial networks without any data.” The piece shows two generator networks, a neural network that takes inputs of a random variable and has to return it once it is trained. Those two generator networks try to generate images that can pass as the network of the other (see Figure 2-12). The two networks are also competing to produce differences in the colors they form, creating an abstract artwork.

The video exhibited showed the results of the generator networks at work, as they slowly and subtly enlarge and change shape with various color transitions.

Figure 2-12: (Un)stable Equilibrium (Broad 2019)

Current developments of AI in the production of new and different designs in the field of art, as well as architecture, make use of GANs for innovations. With the help of existing images of faces, paintings, plans, buildings, and more, AI is used to generate new ideas and/or designs that evoke a kind of emotion in the viewer. Figure 2-13 (a) shows two paintings generated by

AI. Several portrait images are given as input data to the GAN, which produces something new.

Figure 2-13 (b) shows two images that were generated by AI. The computer is given a dataset of portraits and with such inputs and the use of GANs, it creates new art.

32

(a)

(b) Figure 2-13: Innovations in AI (a) AI Generated Paintings (Image by Mario Klingemann- https://www.theverge.com/2019/3/5/18251267/ai-art-gans-mario-klingemann-auction-sothebys-technology (b) AI Generated Architecture (Image by Michael Hasey- http://www.michaelhasey.com/gan-exterior)

33

Another use of GAN is in the works of Refik Anadol. Refik Anadol is a media artist who forms a relationship between architecture, machine intelligence, and media art. His work deals with turning pools of data into art and presenting them in live audio/visual material to make the invisible visible. He utilizes machine learning algorithms to expand his raw materials. With different kinds of algorithms, he acquires numerous amounts of data (e.g., images) from the internet, filters through to the data he needs, and creates his digital art. In his work “Machine

Hallucination” (Refik Anadol, 2020), a video projection that covers the full 360 space and uses machine learning algorithms on a dataset of over 3 million images of New York. In his work, shown in Figure 2-14 (a) and (b), he challenges the typical idea of space. With machine learning algorithms he groups these images and morphs them to create a visible change of the images taken by people. He does all this within one space.

(a)

34

(b) Figure 2-14: Digital Art (a) and (b) Machine Hallucination (Images by Refik Anadol- refikanadol.com)

2.4.2 Computer Vision and Building and Infrastructure Performance

In a review article called “Advances in Computer Vision-Based Civil Infrastructure Inspection and

Monitoring”, computer vision techniques were used to identify structural components (Spencer et al., 2019). Using segmentation, edge recognition, and classification, through an analysis of many images taken of the structure, the computer can recognize which component is a column and which component is a shear wall. Also, it can tell the beam apart from the ceiling and so on.

This paper gave an overview of advances towards computer vision techniques as it pertains to problems of civil infrastructure. One of the problems explained in the article dealt with traditional ways of addressing the issues of civil infrastructure. Practices such as visual inspections by inspectors have been trained on the decision-making criteria. Although monitoring a structure can provide an understanding of the state of the structure, such as strains, displacements, etc., inspections conducted were noticed to be time-consuming, labor- intensive, dangerous, and expensive. The techniques of computer vision have been acknowledged as a key role in civil engineering and will aid in improving inspections. These

35

techniques of computer vision have been fueled through the used ANN, which is modeled on the human brain, having many linked nodes and transfer information within through the connections, and convolutional neural networks (CNN) which is a complex input-output relation of data that analyzes visual imagery. CNN has achieved high success in complex visual problems.

In the inspection process, what is usually envisioned by researchers is automation, using unmanned aerial vehicles for data acquisition, taking photos and videos used for inspections, and using computer vision techniques for data processing. Damage detection and structural component recognition are also reviewed by Spencer et al. (2019). CNNs have been used for the application of damage detection (crack detection) on asphalt pavements, steel decks, and concrete surfaces and have produced high accuracy results in all cases. There have been numerous applications of CNN such as a classification framework presented by Kim et al (2018) for identifying cracks amongst cracklike patterns using CNN. The “Novel Fusion” CNN presented could identify minor cracks at multiple scales, with high accuracy and under compound backgrounds that are present, among others. CNNs are enhanced and modified to enable the computer to detect objects, structural components accurately. With an input image, the computer produces a result that shows differentiation between structural components. Figure

2-15 shows the application of multiple non-linear filters in which the output is not a linear function of its input to extract different information.

36

Figure 2-15: Full Convolutional Neural Network (Spencer et al., 2019).

In vision-based damage detection, the focus is on concrete cracks that are based on heuristic filters, such as edge detectors, and researchers used these filters to identify cracks of different sizes. Figure 2-16 shows some examples of the results of the study of crack thickness done by researchers with the use of different methods.

37

Figure 2-16: Crack Study (Spencer et al., 2019)

The canny method (Reddy, 2016), which is widely applied in computer vision systems and is used in the study above, is a multi-stage algorithm included in OpenCV that is used to detect a wide range of edges in an image while dramatically reducing the amount of data to be processed. The process of this algorithm can be broken down into five steps: (1) application of

Gaussian filter to remove the noise; (2) locating the intense gradients in the image; (3) application of non-maximum suppression to get rid of false responses to edge detection; (4)

Application of double thresholds to determine potential edges; (5) track edge by hysteresis:

Finalize the detection of edges by suppressing all the other edges that are weak and not

38

connected to strong edges. The result of the canny method algorithm is a cleaner output in comparison to the results of the other algorithm studies.

2.4.3 Computer Vision and Urban Design

In a recent final review of a thesis studio, “Architecture and Artificial Intelligence”, (at Taubman

College of Architecture and Urban Planning, University of Michigan) one of the projects presented, titled Xiong’an: Imagined City (Wang, Yao, & Yang, 2020). With the help of AI and neural networks as a developing tool, students depicted the imagined city of Xiong’an (located in China) in three stages, urban scale, block scale and architecture scale. Their goal was to create a decentralized city, designing pavilions that are dispersed throughout the city. Their exploration of AI and neural networks involved the mapping of imagined contexts using satellite images of two cities, Tokyo and Venice, to create a new satellite map of imagined Xiong’an city on an urban scale. On the block scale, they applied suitable public spaces of Tokyo and Venice onto specific urban areas they selected in the newly produced satellite map. They did this to produce an arrangement of “imagined public space.” They used 2D to 2D style transfer technology which transferred spatial qualities that plazas located in Venice possess onto the selected urban areas of the “imagined Xiong’an.” Further exploration of 3D modeling of the

“imagined plazas” was developed from tracing and mapping the images that were transferred onto imagined Xiong’an, with the help of Grasshopper. In the architecture scale, intending to try to create a decentralized public space, they produced small-scaled designs instead of one large-scaled structure that is centralized, and positioned them in the dispersed imagined plazas, hence creating a more organic urban environment (Figure 2-17 (a) and (b)).

39

(a)

(b) Figure 2-17: Maps (a)Xiong’an City (b) Imagined Xiong’an (Tokyo and Venice) (Wang, Yao, & Yang, 2020)

Computer vision techniques have been and are still being applied to many professional or daily practices. It has been used to merge different views, merge different exposures, blend two photographs, and turn a collection of photographs into a 3-dimensional model. “Structure from

40

motion algorithms” can regenerate a scattered 3D point model of a large complex set from hundreds of photographs that are partly overlapping (Figure 2-18 (a) and (b)) (Snavely, Seitz, &

Szeliski, 2006).

(a) (b) Figure 2-18 Structure from Motion Algorithms (a) Photographs (b) Results_ dataset_ rendered_ non-photorealistic style (Snavely, Seitz, & Szeliski, 2006).

2.5 Summary

The constant development of deep learning techniques continues to discover and present endless applications of computer vision in various fields. The rapid advancements of computer vision in algorithms and applications will enable a greater understanding of the use of computer vision in more tasks while expounding on how much the computer can learn and or be trained.

The benefits of computer vision are many. Unlike the human eye, computers and cameras do not get tired, and due to that fact, the computer presents itself to be more reliable. Although the human eye can interpret the visual world at a level that the computer cannot, with the use of algorithms, the computer can be trained to provide faster and simpler processes, with accurate results. Once a system is created, it can be used across many fields for inspecting counting, monitoring, quality check, tracking, and many more. In many ways, computer vision

41

has already begun to automate daily tasks, and that is evident in autonomous vehicles (self- driving cars such as Tesla) that can detect objects, lane marking, traffic , and road signs.

Computer vision has also already influenced sectors such as agriculture, manufacturing, healthcare, and sports, for tracking the ball, as well as other tasks. Computer vision, although still young, already has numerous real-world applications. With time it will be able to free humans from performing tasks that can be automated. In that , humans can focus on high-level tasks.

42

CHAPTER 3: CNN AND TEACHABLE MACHINE

3.1 Introduction

The gradual shift towards automation in various disciplines and daily practices begins with the development of ways to teach or train the computer, and algorithms continue to serve as a useful method for that purpose. Numerous existing computer vision algorithms are tailored to different problems, and they range from simple to complex depending on the type of task it is used for. In the learning process, when the computer is given a task to complete, it deals with the study and building of algorithms that can learn from given data as well as form predictions on that data. As a result, such algorithms are mainly driven by data to make certain decisions.

The innovations in algorithms and software continue to present task-specific solutions, which encourages the machine to master those tasks and produce accurate results. Not only are the software and algorithms aiming for accuracy but are aiming for accessibility and ease of use.

This chapter discusses the components of CNN model and what they are mostly used for in computer vision. It also talks about Google’s Teachable Machine and the models and data it operates on.

3.2 CNN

CNNs are deep learning algorithms that are popularly used for computer vision and are inspired by the human brain. The tasks of CNNs include recognizing objects within an image, image recognition, image classification, etc. This study applies CNN for image classification purposes.

The layers of a CNN model, as discussed in Chapter 2, consist of an input layer, multiple hidden layers, and an output layer which are typically divided into, the convolutional layer, the pooling layer, and the fully connected layer. They are all linked together for CNNs to process data to

43

classify images. The convolutional layer has a fundamental role in how the CNN model works.

When information is received by the convolutional layer, the layer convolves all the kernels

(filters) across the dimensions of the data and gives an activation map which contributes to the output. CNNs are supervised learning methods, which can require large amounts of labeled data (Sakib et al., 2018). One of the ways the CNN model learns is by showing it numerous images for it to make predictions based on images it has not seen. In supervised learning, which is a machine learning task, training data is labeled and used to train the model to classify images (Figure 3-1) (Liu and Wu, 2012).

Figure 3-1: Supervised Learning (Image by Anukrati Mehta https://www.digitalvidya.com/blog/supervised- learning/)

To be able to recognize the same way humans do, CNNs possess digital color images with red, green, and blue encoding. Therefore, a typical image is seen by the CNN model with three depth layers of colors (channels), in addition to the width and height (which are based on the number of pixels). With these layers, CNNs extract significant features of images and manage to

44

convert them into smaller dimensions without losing the main characteristics of the images

(Zhao et al., 2016).

The primary use of CNNs is to classify images because of their high accuracy (Jaswal,

Vishvanathan, and Kp, 2014). Image classification deals with the prediction of specific classes, and through training, the model can recognize an image as belonging to its respective class.

Image classification tasks include but are not limited to binary classification, tasks with two classes, and multi-class classification, tasks with more than two classes. When a CNN model is tasked with classifying wolves and cats from humans (two classes), the legs are a major feature the machine may use as a way of classifying. But when the machine is tasked with classifying wolves, dogs, and humans (three classes), the legs are no longer a major feature for the computer to analyze, because now the computer has to differentiate between a dog and a wolf.

With that, there must be another feature the computer extracts. Therefore, the quality of the data is important to help the computer make accurate predictions. CNN models extract features from an image, from which, key features are learned rather than pre-trained. Actions such as resizing the images and then training the model could affect the accuracy of the algorithm if the resizing caused distortions and/ or blurriness.

3.3 Teachable Machine

Teachable Machine (“Teachable Machine,” 2017) is a web-based application that was created by Google to help create machine learning classification models and to help with the research on image, sound, and video recognition (Carney, 2020). With Teachable Machine, image, video, and sound recognitions are made easy and fast for anyone through a process of gathering and

45

loading in the data, training the models, and then testing it. This web-based tool requires no special or technical expertise in coding and is simple to use.

In addition to the constructed CNN model used in this research, Google’s Teachable Machine was used to classify the two datasets, roof materials, and roof damage. Using the same data size as the CNN model, Teachable Machine could easily and quickly classify the data given to it.

The teachable machine model experiments with an approach that does not require the user to build a machine learning code because it is a tool that is based on the deep learning technique called . Transfer learning is a method of using transferable knowledge obtained from another model and then refined by using the data accessible for the necessary task, which is in this case the roof images used. Most of the neural network, the fully trained model, is kept while a little part of the model is based on the data inputted for the last stage. In the Teachable

Machine model, classification is easier because what occurs is the replacement of the last classifier layer. Hence the classifier layer is the only layer that is trained with a new dataset

(which is the dataset created for this research). The classifier layer is the layer that is displayed when beginning a new project in the teachable machine website (Figure 3-2).

46

Figure 3-2: Creating a New Image Classification Project with Google’s Teachable Machine

The base model of Teachable Machine is MobileNet, a pre-trained model. MobileNet is trained on a large dataset called ImageNet, and it is taught to recognize 1000 objects in the learning process (Howard et al., 2017). Trained on a large dataset (ImageNet) with over 21 thousand classes, Teachable Machine mostly relies on its pre-trained model. As a result of this transfer learning approach, the user does not need to acquire a large dataset to train the model.

3.3.1 ImageNet

ImageNet is a very large image that contains labeled images generated by humans and is intended for the development of computer vision algorithms (“ImageNet,” 2007).

ImageNet was developed as a means of encouraging and promoting the research and development of computer vision methods by providing a resource such as this database.

ImageNet has a little over 14 million images in the dataset and a little over 21 thousand classes,

47

with categories such as vehicles, dogs, balloons, fruits, and more (“ImageNet,” 2007). ImageNet was created based on a WordNet hierarchy. WordNet is a database of words that have connections based on their semantic relationships. The advancements of computer vision research create a need for an advanced, and much larger database for upcoming algorithms, and ImageNet presents itself as a way of handling current and future advancements of computer vision algorithms (Deng, et al, 2009).

3.4 Summary

Research and developments of CNNs and Teachable Machine have progressed. CNNs can learn key features of an image for each class by itself, and the Teachable Machine tool allows users to include more classes as needed. CNNs are popular because of their architecture. The feature generated from the convolution layer is used as input in the next layer and is convolved with filters to produce more features. CNNs are flexible and can learn new and useful features from images. Unlike the CNN model, google presents Teachable Machine as machine learning without coding, whereby users can train their models without constructing a code. The next chapter discusses the use of both models to classify the gathered data for this study.

48

CHAPTER 4: DATA, RESULTS AND DISCUSSIONS

4.1 Data

In Chapters 2 and 3, CNN was found to be the most used deep learning neural network that works well for analyzing image data, and Teachable Machine is a fast, easy, and accessible tool for image classification. Therefore, both models were utilized for this study. Data is a fundamental component of training with both models. CNNs learn with data and Teachable

Machine is pre-trained with data, and the last classifier layer is trained with data as well. This chapter covers the processes involved in the classification of roof images in this study. Figure 4-

1 below gives a broad scope of the approach of this research which will be further discussed later in this chapter. It covers the data collection, data analysis, and the classification process using both Google’s Teachable Machine and the constructed CNN model.

To determine if the computer can successfully classify the set amount of roof images by their material, and identify the damaged roof images from the ones in good conditions, there needs to be an adequate understanding of how algorithms operate to perform these expected tasks.

After the training process is complete, both models were tested to verify their accuracies. This chapter also discusses the results, comparison of both models, and some of their inaccuracies and shortcomings.

49

Figure 4-1: The Process of Image Classification

4.1.1 Data Collection

The primary data needed for this research is a collection of roof images. A relatively large number of images was necessary to achieve high accuracy in the training process. Upon reaching out to the industry for the data needed for this study, there was hesitation on the part of the industry to share their data. Hence, an existing open-source image downloader code that uses Python and Selenium was the sole source of gathering the needed data. The code was used to download multiple roof images from the internet. Selenium is a tool that is used to control web browsers, and aids in the automation of daily tasks (Devi, Bhatia, & Sharma, 2017).

Once the code is run, a new window pops up, asking what to download (Figure 4-2). Once the

50

request is typed in (e.g., metal roof) and entered, a Firefox browser is opened, and the downloading process begins.

Figure 4-2: Image Download Process

This image downloader code provided a quicker way of downloading many roof images as opposed to downloading one image at a time. The image downloader code performs an image search, and downloads by first counting through the images that are searched, and then proceeds to download them. The data gathered were made into two databases, i.e., Dataset I and Dataset II. Dataset I was used to determine if the computer can be trained to tell what material it views in the roof image. About this first group of data, a collection of four types of roof images were downloaded, including asphalt shingle, clay tile, concrete tile, and metal. The downloader code was run four times, keying in a different material out of the four each time to obtain the needed images. These images were used to create a final base data for the first groups of images with around 190 images per class and were all placed in a training file. The number of images originally downloaded for each material ranged from 300 to 500, consisting of some images that were not related to the search. A run-through of the downloaded images had to be completed to eliminate the irrelevant photos. The code downloads all the images that

51

appear on the images filter based on the search requested and still requires a person to skim through many images. The second group of data addresses the question of whether the computer can tell missing tiles from the roof. Within Dataset II, one group shows pictures of damaged roofs with missing tiles, and the other displays images of roofs with good conditions

(Figure 4-3 (a) and (b)).

(a)

(b)

Figure 4-3: Dataset II Images (a)Damaged Roof with Missing Tiles (b)Roof with Good Conditions

52

As shown in Figure 4-4, the two groups consist of training, validation, and testing data. The training and validation data contain the categorized roof images. The testing folder contains the roof images that are not categorized by file names. The reason for that is, when the classification model is being tested, there need to be no clues as to what material is being displayed based on the name given to the file. Instead, the machine is supposed to be able to tell what material is being displayed solely based on the training and the image it is viewing.

Figure 4-4: Databases

4.1.2 Data Processing

Based on the tasks of this research, the general steps of constructing this image classification model involve importing the necessary libraries, loading in the data, training, and predicting the image categories. In this model, the data used to construct the final classification model comprises the training dataset, the validation dataset, and the testing dataset. The training dataset contains the labeled data and is used to improve the accuracy of the image classification model. It is the data that is initially used to match the weights of the model. The validation data is a part of the training dataset and provides a fair assessment of the model matched on the training dataset. The validation dataset is used to certify the model’s performance. Lastly, the testing dataset was used to produce an impartial assessment of the

53

last model matched on the training dataset. The data in the testing dataset contains a collection of both images that have been used in training and images that have never been used in training, as a way of performing cross-validation on the model.

4.2 CNN

4.2.1 Classifications for Roof Materials

A CNN classifier was developed to run in Jupyter Notebook to classify roof materials (Figure 4-

5).

Figure 4-5: Classifying Roof Materials

54

In Jupyter Notebook, the first step taken in the image classification model for this research was to import the required libraries including TensorFlow, Matplotlib, Cv2, Os, and Numpy for purposes of constructing and training models (Figure 4-6). Additional imports include an image data generator from the Keras library of TensorFlow which generates a label for an image based on the file name, and stochastic , which is a recursive method of optimization

(Song, Chaudhuri, & Sarwate, 2013). Stochastic gradient descent is a variation of gradient descent which is used for large data because it has a faster learning rate.

Figure 4-6: Importing the Required Library

Loading data requires the conversion of the training data to a dataset that can fit (in terms of image size and class) the neural network. To generate the labels from the directory created,

“flow_from_directory (file path)”, was used. The training dataset and validation dataset consists of images with varying dimensions. All the images were resized to 200 x 200 also to fit the neural network to increase the chances of accuracy. The 200 x 200 given to the images is a sufficient size that can help produce increasing improvements in the neural network. Larger images take time because it will require the machine to learn from more pixels.

Machine learning models train faster with small images. With the lines of code shown in Figure

4-7, the neural network will then train the inputted images in batches of 10, because based on

55

the data size, a large batch size could produce poor results. The batch size can be increased gradually through training, but a smaller batch size train faster. The “class_mode” also had to be identified, given that the data collected comprises 4 categories, the “class_mode” equals categorical. The class_mode signifies the number of categories of images that exist within the dataset.

Figure 4-7: Collecting and Defining the Data

Using a CNN model with max-pooling, feature map dimensions are diminished, and as a result, the number of parameters to study reduces. There is also a decrease in the amount of computation due to the feature summary performed in the pooling layer. Figure 4-8 shows the layers of the CNN used. In the first convolution layer, 16 filters of size 3 x 3 were defined with the ReLU activation function which increases the non-linearity of the images. Images are naturally non-linear and passing them through a convolution layer forces on linearity, therefore the ReLU is needed to further increase an image’s non-linearity. The next is the max-pooling layer of size 2 x 2. It selects the max pixels of a given number pixels of in an image.

56

Figure 4-8: Convolutional Layer

A second and third convolution layer is defined with increasing numbers of filters to enable higher extractions of more complex features within the images. After defining all the convolution and max-pooling layers, the flatten layer is added to convert the data into a 1- dimensional array to be fed into the next layer. The last two layers defined were the dense layers. The dense layer, which is also called the fully connected layer, acquires input and learning features from previous layers to produce an output. The type of image classification in this research is a multi-class classification, which means there exist multiple classes and outputs have exclusive classes. Therefore, the Softmax activation function was adopted, compressing a vector between 0 and 1.

The next stage is to compile the model. To do so, the loss function, optimizer, and metrics had to be defined. As a result of the multi-class classification used in this research, the loss function defined is ‘categorical_crossentropy.’ Categorical cross-entropy is a loss that when used, trains the CNN to produce a probability that is over the classes of each image. Stochastic gradient descent is the optimizer used, which is provided by Keras library and possesses a learning rate that needs to be defined (Figure 4-9).

57

Figure 4-9: Compiling the Model

The model was fit for 100 epochs (iterations) and 60 steps per epoch, which was decided upon by trial and error (Figure 4-10). Once the model was run, there was a gradual increase in the percentage of accuracy and a gradual decrease in the percentage of loss, and this was evident for both training and validation data. Figure 4-11 shows the accuracies and losses of the last three iterations.

Figure 4-10: Training

Figure 4-11: Accuracies and Losses of the Training and Validation Data

The last stage was the testing stage. The prediction labels generated with the use of

‘class_indices’ list asphalt shingle as the first, clay tile as the second, concrete tile as the third, and metal as the fourth. The path for the testing dataset created was defined and the images were loaded in. ‘Model.predict()’ was used to predict the images in the testing file, and the

58

outputs generated (Figure 4-12 (a) and (b)), displays the images with prediction labels, showing some of the accurate predictions the model made.

(a)

(b)

Figure 4-12: Material Classification- Accurate Results (a) Clay Tile (b) Metal

59

Many factors played a role in the model’s accuracy, which will be discussed in the next chapter.

Some predictions were very accurate. However, there were uncertainties, and some predictions were completely wrong.

4.2.2 Classifications of Damaged or Not Damaged Roof

Beginning the process of identifying damage in a roof with the given images, the first approach taken is to locate the darker region of the image which is the area that is missing a tile or has a crack. Once that dark region is located, indicated by a border of dark pixels, that border could be highlighted to show the area of distress within the image. Utilizing the draw contours function (cv2.drawContours) as a way of detecting the borderline around to the darkest region, the difficulty faced was that multiple dark pixels within the image were being highlighted as well as the main damage. Dark tones that naturally existed within the brick material and those that were present because of weather effects had boundary lines drawn around them (Figure 4-

13 (a) and (b)). Contours are useful in computer vision algorithms. Within an image, they help create separation from one thing, and they help find objects (Gollapudi, 2019). Contours deal with the outline of an object within an image and are also drawn with the Canny edge detection which was discussed in Chapter 2. The boundary points of the shape are detected and the coordinates (x,y) are stored.

60

(a) (b)

Figure 4-13: Damage Identification Using “Draw Contours” (a) Input (b) Output

The identification of damaged roofs was performed using the same constructed CNN model

(Figure 4-14). The classification approach was a better method of processing the roof images collected at once and then grouped into two categories, “Damaged” and “Not damaged”.

Training the CNN model for these two classes, binary classification is the best fit for identifying a damaged roof. In machine learning, binary classification problems are one of the most addressed. This form of classification, whereby images are categorized into only two categories,

0 or 1, in this case, damaged and not damaged, makes binary classification problems the simplest kind to solve through the productive use of CNN. Hence it is solved to a reasonably high level of accuracy.

61

Figure 4-14: Classifying Roof Damages

To be able to fit the CNN model used in the material classifications with the task of damaged/ not damaged classifications, the algorithm needs to be modified. In the last dense layer of the

CNN, the activation function used needs to be changed to Sigmoid which is mostly used for binary classification problems (Nwankpa et al., 2018). When compiling the model, the loss was defined as binary cross-entropy and the optimizer was defined with RMSprop (root mean square) which uses a learning rate that changes over time (Figure 4-15) (Ruder, 2017).

62

Figure 4-15: RMSprop (Image by Vitaly Bushaev https://towardsdatascience.com/understanding-rmsprop-faster- neural-network-learning-62e116fcf29a)

RMSprop is an optimization technique that is based on a gradient and aims to normalize the gradient. After running epochs, the predictions made by this binary classification model were more accurate than that of the multi-class classification (Figure 4-16 (a) and (b)). Both classifications had a very similar number of images within each class. It shows that the algorithm works better with a binary classification task and to get high accuracy. In the multi- class classification task, the algorithm might need to be further improved.

(a)

63

(b) Figure 4-16: Damage Classification - Results (a) Damage (b) No Damage

4.3 Teachable Machine

4.3.1 Classifications for Roof Materials

The process of training using the gathered material roof images in Teachable Machine was quicker as opposed to that of the CNN model. Because of the pre-trained model teachable machine uses, the process outlined for this task was first, to load in the dataset that was already created and used for training the CNN model. Each class (asphalt shingle, clay tile, concrete tile, and metal) is loaded into four classes and renamed. More classes could be added as needed. The Teachable Machine model is developed in a way that allows numerous categories to be trained at once, this is because it is based on a robust pre-trained model. Once all the data is loaded, the next step is to train the model. The model was trained in less than a minute. The last step is testing, the model allows the use of files or in the testing stage and provides the percentage of accuracy. Figure 4-17 (a) and (b) shows the results of the

64

Teachable Machine model classifying the roof materials in the four classes (asphalt shingle, clay tile, concrete tile, metal).

(a)

(b)

Figure 4-17: Material Classification with Google’s Teachable Machine- Results (a) Clay Tile (b) Metal

65

4.3.2 Classifications of Damaged or Not Damaged Roof

Training with the damage and no damage roof images involved the same process mentioned in

Section 4.3.1. (1) Load data, (2) train, (3) test. A total of 155 damaged roof images were used and 263 images of roofs in good conditions were used. The training time was also under a minute. Figure 4-18 (a) and (b) shows the results of one of the “no damage” “damage” roof images accurately predicted by the teachable machine model.

(a)

66

(b)

Figure 4-18: Damage Classification with Google’s Teachable Machine-Results (a) Damage(b) No Damage

4.4 Results, Comparisons, and Limitations

Creating a CNN model requires more time than creating a Teachable Machine model. However,

CNN models can easily be adapted to be applied to similar classification tasks.

With only the last layer to train, Teachable Machine has a strong, fully trained model it depends on to produce accurate results. Because of this pre-trained model, the machine is familiar with the task and has a high chance of producing better accuracy. Also, because the teachable machine model is trained on a large base data (ImageNet), and models that are trained on

ImageNet have been proven to have some robustness to them (Hendyrcks, et al, 2019).

The constructed CNN model may not be as simple and fast as the teachable machine model. It requires specific lines of codes, and activation functions to aid with the classification process. It

67

is flexible enough to make necessary adjustments to the lines of code to suit the task and/or to improve the model. Therefore, the results produced by the CNN model have the potential for improvements based on the modifications made to the algorithm. Teachable machine’s pre- trained model can also be changed and /or improved.

Using a dataset that is moderate in size for both classifications, the CNN model successfully made accurate predictions for most of the images that were inputted (Figure 4-19 (a) and (b)).

Materials such as metal, concrete, and clay were the most identified images. Comparing the results of teachable machines with that of the CNN model, teachable machine’s results have higher accuracy (Figure 4-20 (a) and (b)).

(a)

68

(b)

Figure 4-19: CNN Model - Accuracy (a) Classifying Roof Material(b) Classifying Roof Damages

(a)

69

(b)

Figure 4-20: Teachable Machine Model - Accuracy (a) Classifying Roof Material (b) Classifying Roof Damages

The results produced by the constructed CNN model show 0% accuracy in the testing of all eight asphalt shingle images, with all classified as concrete tiles. The visual characteristics of asphalt shingle roofs relate strongly to concrete tile roofs. Asphalt used on the roof has a finer texture than asphalt used on roads, and when the machine is tasked with a classification problem certain features within the image help the computer perform classification. Asphalt shingle and concrete tile roofs are the two materials that can be said to have similar visual properties, as opposed to asphalt shingle and metal or asphalt and clay tile. Therefore, the computer’s inability to classify asphalt could be caused by the asphalt shingle roof images used. The performance of a CNN model depends on the quality and the size of the data provided. Also, having two classes is a simpler task for the CNN model to tackle, more classes add a higher complexity to the CNN model. With two materials (asphalt shingle and concrete tile) having

70

very similar visual properties, a larger asphalt shingle roof data could help improve the results because the machine needs more data to learn from. Additionally, up-close images of the asphalt shingle roof images, as well as the concrete tile images, could also help the machine find features that differentiate one from the other. Features such as the grittier texture of the asphalt shingle roof in comparison to the concrete tile. An up-close picture can also show the slightly varying shade of both materials. Analyzing the damage and non-damage images used, there was a clear distinction between the two sets of images, aiding in the accurate results of both models.

In both models, the material classification task provided 100% accurate results when classifying concrete tile and metal images, testing with 11 images each. The Teachable Machine model fully classified all the clay tile images, but the CNN model only predicted 55% (testing with eight images) of the clay tile images accurately. Both models were tested with eight asphalt shingle images and eight clay images, and those two materials had lower accuracies in the CNN model.

More testing images could have potentially increased their accuracies. The robustness of the teachable machine model provides another effective way of classifying. Regarding damage identification, the CNN model’s results were 100% accurate.

4.5 Summary

The results of CNN and teachable machine models analyzed in this study show that computer vision methods can be efficient approaches to classification tasks. In the binary classification task (“damage” and “not damaged”) both models produced accurate results (100% and 100% accuracy respectively). However, for the multi-class classification tasks, the Teachable Machine

71

model presented a higher accuracy than the constructed CNN model (88% vs 64%). Although both models are robust, the larger training data set the Teachable Machine model (MobileNet,

ImageNet), makes them more robust than the constructed CNN model. Teachable Machine is also more user-friendly. The CNN model is flexible and could be improved upon with a larger dataset and quality images to increase its accuracy.

72

CHAPTER 5: CONCLUSION

5.1 Conclusions

Computer vision has come a long way in its classification abilities and advancements towards

AI. CNNs are fundamental in machine learning and are mostly used today in many fields, and algorithms are constantly under research and improvement. CNNs are made suitable for computer vision tasks like classification because of their automated feature extraction

(Shaheen et al, 2019). Therefore, the application of computer vision as a problem-solving tool in building envelopes is a step closer to creating innovative methods that can contribute not only to the field of architecture, engineering, and construction but also to AI. This study shows that computer vision can be used as a tool to evaluate the attributes of building envelopes through a classification method. Classification tasks can be performed with the use of codes built from scratch and/ or models built based on a pre-trained model that uses transfer learning as a means of classification like Teachable Machine. Roof materials can be classified, and roof damages can be classified with quality digital images and a robust classification model. Based on the tests that were run for both, the results produced by both the constructed CNN model and Teachable Machine, for both roof material classification and damage classification, have proven to be a potentially effective approach to the inspection process of building envelopes.

5.2 Overall Limitations

The importance of obtaining and training the CNN model with a quality dataset is mostly proven in the output once the model is trained. When the dataset has noise, distortions, and/ or blur, it limits the machine’s potential in producing accurate results (Dodge & Karam, 2016).

73

Image classification requires a significant amount of dataset for training. This approach for training the computer helps in the creation of robust machine learning algorithms (Mastorakis,

2018). But does a limited amount of data equate to poor performance of the algorithm?

Another limitation of the CNN model is, not everyone can use it as opposed to the teachable machine model. In this study the difficulty faced was first with the image download process.

Due to the broad download approach the code uses, the amount of data reduced after eliminating the images that were poor in quality. These images also came in a range of sizes.

Resizing the images caused major distortions, and those images had to be manipulated to correct the distortions which took time when dealing with a lot of images. The best approach would be to gather these categories of roof images by drones. What that could achieve include a set size for all the images with a set orientation, and a larger dataset, which all increase the quality of the images.

5.3 Recommendations

This research focused on the classification of roof material and roof distresses, further explorations could be made on computer vision to determine the potential causes of the distress seen in the roof images which could have been caused by hail and/ or wind. A larger dataset with more classes of building materials and specifically drone images could be used to train the CNN model for material classification to improve the data quality for the CNN model.

Through video data, computer vision can be used to count and label tiles, concrete, glass, and metal panels on a building’s envelope as needed for maintenance. Furthermore, a study on how Teachable Machine could be used to identify cast-in-place concrete cracks as well as the components of a building envelope using its video aspect can be explored. In that sense, when

74

the camera points at concrete, the computer will immediately know it is concrete and label it as concrete. As the camera attached the drone pans around, the machine identifies what it sees, be it glass, concrete, metal, etc. In architecture, computer vision can be used to recognize numerous façade patterns and with that to produce innovative and intricate façade designs.

Unmanned aerial vehicles have the potential of equipping architecture, engineering, and construction with approaches for problem-solving. Through image and video, computer vision can be explored to analyze how well the computer can detect underlying issues by assessing poor work done on a building envelope during the construction process. It can also be investigated to maintain skyscrapers, by helping to keep track of all structural components of a skyscraper. Computer vision has the capabilities of assisting humans to perform better at their jobs and in their day-to-day practices and will eventually enable more people to be critical in their thinking and actions.

With the use of large amounts of data, machine intelligence can make decisions, solve problems, produces new ideas, and more. This ability can play a very important role in architecture and the design process. The early stages of research in the design process can be covered by AI. AI ability can be used to gather tons of data on current and past projects which will be beneficial for the design and construction of buildings. The provision of ample information such as building codes and zoning data, with the use of AI, can also create a shorter planning process for architects while generating multiple design options. The benefit of the machine is countless. As time passes, AI continues to be incorporated in daily practices, and potentially widely applied in architectural and engineering design, as well as the automation of construction.

75

REFERENCES

Agarap, Abien. (March 2018). Deep Learning Using Rectified Linear Units (ReLU). De La Salle

University.

Anadol, Refik. (2020). https://refikanadol.com/works/machine-hallucination/

Airola, R., & Hager, K. (2017). Image classification, deep learning, and convolutional neural

networks. Karlstad University. 48-53.

Building Envelope CTL Engineering. https://ctleng.com/building-envelope

Bay, H., Ess, A., Tuytelaars, T., & Van Gool, L. (2008). Speeded-Up Robust Features (SURF).

Computer Vision and Image Understanding, 110(3), 346–359.

Broad, Terence. (2019). (un)stable equilibrium.

https://computervisionart.com/pieces2019/unstable-equilibrium/

Carney, Michelle, et al. (April 2020). Teachable Machine: Approachable Web-Based Tool for

Exploring Machine Learning Classification. Google, Inc. Mountain View US.

Chaillou, Stanislas. (2019). AI + Architecture, Towards a New Approach. Harvard GSD.

Chakraverty, Snehashish, et al. (2019). Perceptron Learning Rule. Concepts of Soft ,

pp. 183–188.

Devi, Jyoti & Bhatia, Kirti & Sharma, Rohini. (2017). A Study on Functioning of Selenium

Automation Testing Structure. International Journal of Advanced Research in Computer

Science and . 7. 855-862. 10.23956/ijarcsse/V7I5/0204.

Deng, Jia, et al. (June 2009). ImageNet: A Large-Scale Hierarchical Image Database.

ResearchGate.

76

Dodge, Samuel., & Karam, Lina. (2016). Understanding How Image Quality Affects Deep Neural

Networks. arXiv:1604.04004v2.

Explorations in Artificial Intelligence and Machine Learning. CDC Press Taylor & Francis Group.

Fountoukidou, Tatiana & Sznitman, Raphael. (2019). Concept-Centric Visual Turing Tests for

Method Validation.

Gollapudi, Sunila (2019). Learn Computer Vision Using OpenCV: with Deep Learning CNNs and

RNNs. Apress.

Gao, Bolin & Pavel, Lacra. (2017). On the Properties of the Softmax Function with Application in

Game Theory and .

Geman, Donald., Geman, Stuart., Hallonquist, Neil., & Younes, Laurent (2015-03-24). Visual

Turing test for computer vision systems. Proceedings of the National Academy of

Sciences. 112 (12): 3618–3623.

Hendrycks, Dan., Lee, Kim., & Mazeika. (October 2019). Using Pre-Training Can Improve Model

Robustness and Uncertainty. arXiv:1901.09960v5.

Howard, Andrew G. (2017). MobileNets: Efficient Convolutional Neural Networks for Mobile

Vision Applications. arXiv:1704.04861v1

ImageNet. (2007). http://www.image-net.org/

Jaswal, Deepika & Vishvanathan, Sowmya & Kp, Soman. (2014). Image Classification Using

Convolutional Neural Networks. International Journal of Scientific and Engineering

Research. 5. 1661-1668. 10.14299/ijser.2014.06.002.

77

Janocha, Katarzyna., & Czarnecki, Wojciech. (2017). On Loss Functions for Deep Neural

Networks in Classification. Janocha, K., & Czarnecki, W. (2017). On Loss Functions for

Deep Neural Networks in Classification. ArXiv, abs/1702.05659.

Kim H, Ahn E, Shin M, Sim SH. (2018). Crack and Non Crack Classification from Concrete Surface

Images Using Machine Learning. Struct Health Monit. Epub.

Koch, Christian, et al. (2015). A Review on Computer Vision-Based Defect Detection and

Condition Assessment of Concrete and Asphalt Civil Infrastructure. Advanced

Engineering , vol. 29, (2), 196–210.

Kardava, Irakli, et al. (2018). Training Process Automation for Computer Vision. World Congress

on Engineering, vol. 1.

Keras: The Python Deep Learning API. (2015). https://keras.io/

Kacorri, Hernisa. (2017). Teachable Machine for Accessibility. ACM SIGACCESS Accessibility and

Computing. 10-18. 10.1145/3167902.3167904.

Krueger, M.W., & Wilson, S. (1985). VIDEOPLACE: A Report from the ARTIFICIAL REALITY

Laboratory. Leonardo 18(3), 145-151. https://www.muse.jhu.edu/article/601392.

Lowe, David. (2004). Distinctive Image Features from Scale-Invariant Keypoints. International

Journal of Computer Vision.

Liu, Qiong & Wu, Ying. (2012). Supervised Learning. 10.1007/978-1-4419-1428-6_451.

Leutenegger, Stefan., Chili, Margarita., & Siegwart, Roland. (2013) BRISK: Binary Robust

Invariant Scalable Keypoints. Autonomous Systems Lab, ETH Zurich.

Mery, D. (2014). Computer vision technology for X-ray testing. Insight - Non-Destructive Testing

and Condition Monitoring, 56(3), 147–155. https://doi.org/10.1784/insi.2014.56.3.147

78

Matplotlib: Python plotting — Matplotlib 3.3.3 documentation. (2003). https://matplotlib.org/

Mastorakis, Georgios. (2018). Human-Like Machine Learning: Limitations and Suggestions.

ResearchGate.

Mijwil, Maad. (2015). History of Artificial Intelligence. 3. 1-8.

NumPy. (1995). https://numpy.org/

Nwankpa, Chigozie., et al. (2018). Activation Functions: Comparison of Trends in Practice and

Research for Deep Learning.

OpenCV. (2000). https://opencv.org/

Paal, S.G., Jeon, J.S., Brilakis, I., & DesRoches, R. (2015). Automated damage index estimation of

reinforced concrete columns for post-earthquake evaluations. J Struct Eng, 141 (9), p.

04014228

Papert, Seymour. (1996). The Summer Vision Project. Massachusetts Institute of Technology,

Project Mac.

PyTorch. (2016). https://pytorch.org/

Ruder, Sebastian. (2017). An Overview of Gradient Descent Optimization Algorithms.

Roberts, Lawrence. (1963). of Three-Dimensional Solids.

Reddy, R. & Nagaraju, Chiluka & Reddy, I Raja Sekhar. (2016). Canny Scale Edge Detection.

IJETT. 10.14445/22315381/IJETT-ICGTETM-N3/ICGTETM-P121.

Skorpil, Vladislav & Stastny, Jiri. (2006). NEURAL NETWORKS AND BACK PROPAGATION

ALGORITHM.

Shelley, M. W., Guston, D. H., Finn, E., & Robert, J. S. (2017). Frankenstein: Annotated for

scientists, engineers, and creators of all kinds. Cambridge, MA: The MIT Press.

79

Sakib, Shadman & Ahmed, & Jawad, Ahmed & Kabir, Jawad & Ahmed, Hridon. (2018). An

Overview of Convolutional Neural Network: Its Architecture and Applications.

10.20944/preprints201811.0546.v1.

Singh, Gyanendra & Vedrtnam, Ajitanshu & Sagar, Dheeraj. (2013). An Overview of Artificial

Intelligence. SBIT Journal of Sciences and Technology, 2(1).

Spencer, Billie F., et al. (2019). Advances in Computer Vision-Based Civil Infrastructure

Inspection and Monitoring. Engineering, vol. 5, (no. 2).

Shan, Qi, et al. (2013). The Visual Turing Test for Scene Reconstruction. 2013 International

Conference on 3D Vision.

Sodhi, Pinky., Awasthi, Naman., & Sharma, Vishal. (2019). Introduction to Machine Learning

and Its Basic Application in Python.

Smith, Chris, et al. (2006). The History of Artificial Intelligence. University of Washington.

Snavely, N., Seitz, S. M., & Szeliski, R. (2006). Photo tourism: Exploring photo collections in 3D.

ACM Transactions on Graphics (Proc. SIGGRAPH 2006), 25(3):835–846.

Song, Shuang., Chaudhuri, Kamalika., & Sarwate, Anand. (December 2013). Stochastic Gradient

Descent with Differently Private Updates. University of California, San Diego.

Teachable Machine. (2017). https://teachablemachine.withgoogle.com/

TensorFlow. (November 2015). https://www.tensorflow.org/

Wang, Nishang., Yao, Jiawei., & Yang, Zhaoxuan. (2020). Xiong’an: Imagined City.

https://imaginedcity.wall-atlas.com/

Wang, Zhengwei., et al. (2020). Generative Adversarial Networks in Computer Vision: A Survey

and Taxonomy. arXiv:1906.01529v6 [cs.LG].

80

Yoshimura, Yuji, et al. (2019). Deep Learning Architect: Classification for Architectural Design

through the Eye of Artificial Intelligence. Lecture Notes in Geoinformation and

Cartography Computational Urban Planning and Management for Smart Cities, pp. 249-

265.

Yu, Lei., Yu, Zhixin., & Gong, Yan. (2015). An Improved ORB Algorithm of Extracting and

Matching Features. International Journal of , 8(5).

Zhang, Hang. (2019). 3D Model Generation on Architectural Plan and Section Training through

Machine Learning. Technologies, vol. 7, no. 4.

Zhao, Juanping & Guo, Weiwei & Cui, Shiyong & Zhang, Zenghui & Yu, Wenxian. (2016).

Convolutional neural Network for SAR Image Classification at Patch Level. 945-948.

10.1109/IGARSS.2016.7729239.

Websites

How to download google images using Python and Selenium?

https://pycheat.com/download_img.php

Guide to the Sequential Model- Keras 2.0.2 Documentation. https://faroit.com/keras-

docs/2.0.2/getting-started/sequential-model-guide/

TensorFlow- tf.keras.preprocessing.image.ImageDataGenerator

https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/image/ImageDat

aGenerator

81

APPENDICES

82

Appendix A - Python Code for Downloading Images

83

84

Appendix B- CNN Code for Roof Material Image Classification

85

86