Download PDF 1592 KB

06.21.18 WorldQuant Perspectives

Understanding Images: Computer Vision in Flux

Computer vision has long been a key driver of artificial intelligence. But despite decades of work and proliferating applications, machines still have a lot to learn from the human brain.

WorldQuant, LLC 1700 East Putnam Ave. Third Floor Old Greenwich, CT 06870 www.weareworldquant.com WorldQuant Understanding Images: Computer Vision in Flux Perspectives 06.21.18

TEACHING MACHINES TO SEE HASN’T BEEN EASY. THOUGH Nonetheless, computer-vision applications are proliferating. machine learning methods have recently brought a significant Computer vision is not only being used to identify and analyze improvement in computer vision, the problem of developing information from images or 3-D objects but also to understand machines that can understand images has not been fully solved. content and context. Today you can find computer-vision Machines remain inferior to humans when it comes to many algorithms in facial-recognition systems for passport control and visual interpretation skills. While in some ways machines have security, object recognition for optical and electronic microscopes, significant advantages — such as the recognition of hidden motion tracking, human emotion recognition, medical diagnoses, patterns and the manipulation of large amounts of data — autonomous vehicles and drones. recognizing human faces in particular remains a difficult task.1 All of these applications depend on the development of the It’s not really clear where the next improvements in computer- internet of things, which connects sensors in devices that can vision development will come from. Today the biggest send and receive data across the internet. Computer vision is a breakthroughs stem from the development of multilayer neural metafield that makes use of various technologies to provide a networks. But these advances, while exciting, have their own major component of intelligent machines. limitations: Neural-network approaches remain empirical and Machine learning plays an important role in current computer- nontransparent, which makes further improvement difficult. vision technologies. A fundamental hurdle of computer vision Deep neural networks provide some understanding of images has been identifying objects in an image.2,3 Machine learning but do not help us master what’s been called the understanding algorithms are quite good at categorization and classification of understanding. of objects, and they can recognize more than single objects. Effective computer vision requires making a clear distinction Consider mobile apps like Google Goggles or Samsung’s between background and foreground. This can be done using Bixby, which provide information and feedback on whatever a clustering methods, basic functions of machine learning smartphone camera is “viewing.” A multilayered system drives algorithms. A higher level involves analyzing an image by its these apps. First, they distinguish among different objects that parts, treating it as a whole and creating a description of it. The appear in the image — a process called segmentation. Second, starting point might be a vector, which contains the objects and they separate objects and examine each of them. Third is the relations among them, such as position, orientation, time (if it’s core of the recognition process, a database of images and three- a video) and other features that provide hints about context. This dimensional models, which are mathematical representations process resembles human behavior: First, our brain identifies of 3-D surfaces. Partial 3-D models gleaned from the image the objects in context, then it theorizes what’s happening, using are compared with models in the database. Then database and experience from similar situations. statistical analysis may be able to tell us the theme of the image and produce a text description: Is it a busy road or a Machine learning offers effective methods for image classification, natural landscape? image recognition and search, facial detection and alignment, and motion monitoring, while also providing tools for synthesizing Most of the information humans possess comes from their visual new vision algorithms. senses: their ability to see, process visual information, store and retrieve it from their memories, and understand what they are seeing. Not surprisingly, developing the means for machines to see Basics of Computer Vision like humans was one of the earliest goals of artificial intelligence Image processing uses a family of algorithms dedicated to (AI). In fact, the very development of AI, particularly machine learning so-called image (picture or video) transformation. The simplest methods such as neural networks and deep learning, has occurred computer-vision algorithms make photo zooming and rotation with computer vision in mind. Yet for all of the high-profile successes applications possible, with more-sophisticated transformations, of recent machine learning programs, which have mastered complex such as edge sharpening, made possible by the use of techniques board games like Go and chess, a robot equipped with computer like optimal filtration, frequency, auto-correlation analysis and vision still struggles to compete with a three-year-old human at other signal-processing algorithms. (A typical image-processing recognizing and understanding what’s in front of its nose. task is choosing an Instagram filter for a photo.) From a

mathematical perspective, most of these algorithms deal with In 1975, Japanese electrical engineering professor Kunihiko an image represented as a matrix of pixel intensity values. If an Fukushima introduced his cognitron, a mathematical model image is colored, we have three matrices that correspond to red, that included layers of connected artificial neurons of two types: green and blue intensities (there may be other color encodings). excitatory cells and inhibitory cells, just like in a human brain. For three-dimensional images, we deal with 3-D matrices This model is known as a self-organizing multilayered artificial (cubes) and their minimal elements, voxels. Each 3-D model is neural network (ANN). The next significant step was to connect a mathematical representation of the 3-D surface, built using computer vision and machine learning. In 1980, Fukushima images of objects from different perspectives. A characteristic introduced a model of human vision known as the neocognitron; of human vision is the ability to construct a 3-D view from it was based on an ANN, with a special architecture designed 2-D perspectives. to recognize objects that had shifted in an image.5 His system had several layers of artificial neurons that helped to recognize Computer vision is a collection of methods for understanding handwritten characters. images. If we think of a computer-vision algorithm as a black box, it takes an image as an input and makes decisions that produce The neocognitron algorithm groups neurons into two types an output, which may aid in facial recognition, microchip-defect of clusters: S-cells and C-cells. S-cells are responsible for detection or the diagnostic reading of X-ray images. Most of these recognizing image patterns like lines, corners, combinations of tasks can be considered classification problems, such as “Whose corners and full characters. C-cells are responsible for locating face is in the photo?” or “What is the defect in this microchip?” We these patterns and thus enable their identification no matter how can think about the classification tasks of image comprehension they’ve shifted within the image. In other words, the neocognitron in the following way: will not be confused if the character appears not in a corner but in the center of the image. 1. Is a target object present in the image? For example, is it a photo of a cat? This model still has great influence on modern computer-vision 2. Which target object is present in the image? Is the animal in research, with many achievements in image recognition based on the photo a cat or a dog? In this case, we need an algorithm to its multilayer neural networks. distinguish among different images. An important breakthrough occurred in 1988 when Yann LeCun These may seem like rudimentary tasks, but they are challenging from AT&T Bell Laboratories suggested applying a learning to machine learning algorithms. The first task can be more algorithm known as backpropagation of error to neocognitron’s difficult; we may have lots of images of cats, but we don’t exactly architecture. Backpropagation is a step-by-step, layer-by-layer know how to define “not a cat.” This so-called noncompact correction of neuron weights in a neural network, based on class requires special approaches to describe or distinguish its the network’s current output error. LeCun’s backpropagation members. The second task is easier for machine learning training neural network is known as the convolutional neural network because both sets of key images — dog and cat — represent (CNN) and remains one of the most popular tools for advanced well-defined classes. There will be lots of cat and dog photos for training. These are known as compact classes. Drawing a distinction between them is relatively straightforward. Today you can find computer- vision algorithms in A Brief History of Computer Vision facial-recognition systems Work on computer vision began in the 1960s with the earliest attempts at robotic development.4 One of the first practical for passport control applications was to teach a machine to read numbers and letters and security,and object (this was particularly applicable to post office needs). Other early applications involved X-ray image processing and three- recognition for optical and dimensional reconstruction of images from computer tomography electronic microscopes. and magnetic resonance imaging.

automated image recognition. It’s what we mean by a teaching white to black), image filtration, histogram analysis (a probability neural network; typically, when we talk about deep learning, we’re distribution of numerical values) and edge detection. referring to this and similar approaches. The word “deep” refers to the depth of the layer hierarchy. The second step is to detect desirable objects among the segments. In this sense, a human face in a photo is, roughly, a CNN is powerful, but it has a number of disadvantages. Primarily, round-shaped homogeneous segment of skin that contains two there’s a lack of theory. Unlike theory-based machine learning segments for eyes, one for a nose and one for a mouth in a proper methods, such as AdaBoost or a support vector machine (SVM) position beneath the nose. Features of segments might consist of for neural networks, we usually have no idea how CNN will react size, form, texture, color and so on. If there is enough information, if we change the underlying architecture — that is, the number it also makes sense to use such features as pixel-intensity and size of neuron layers and the connections among neurons. gradients; the difference of intensity among the left, middle and There is no rule saying what architecture works optimally for a right parts of an object can be an input to aid in its classification. specific task. Moreover, there is still no evidence that multilayer Object detection and classification take image recognition to the (deep) networks are really more efficient than, say, two- or three- point where machine learning methods can be applied. level layers, and this makes discussions about deep learning somewhat speculative. After the images have been segmented, the choice of the learning method depends on how much data is available. Generally, based Another issue with neural networks is a lack of transparency or on the amount of data you have, there are different strategies you interpretability. A neural network can identify two pictures of the can pursue. same person, but we can’t see what properties or features of the image led the network to make that choice. So we deal with Lack of Training Data a black box that applies huge computations to every pixel in an 1. Unsupervised learning. Sometimes examples of images for image, and we have no idea what happens in the neural layers. training are missing or in short supply. This may happen if we are Of course, that resembles the situation with the human brain’s trying to detect unknown objects or anomalies — this is common opaque internal processes, which we should have expected in astronomy, for example. For this purpose, we can use machine with AI. But lack of interpretation makes it difficult to correct or learning methods without a training set. These are usually improve output. referred to as unsupervised machine learning methods.

Computer Vision in Practice 2. Cluster analysis. Generally, this technique segregates images In practice, automated image understanding usually consists into homogeneous clusters by defining some distance measure of two steps. The first is to split an image into segments that between objects, and then describes each cluster with some represent some objects, such as parts of the human body in a rules, relying on the common properties of images within the selfie or buildings in a photo of a town. There is no unique recipe same cluster. To be efficient, unsupervised learning methods for image segmentation. The solution may depend on many require some a priori information — features we can rely on to factors, including image quality, texture density and colors. Image decide whether images are close enough to one another to appear segmentation uses tools such as gray scaling (gradations from in the same cluster.

3. Class definition. In this approach, we derive a rule that says Yann LeCun’s backpropagation whether a certain object belongs to a specific class. Again we have to rely on some a priori knowledge. There are plenty of formal neural network remains methods for derivation rules. First is simple logic with a set of rules one of the most popular that all sound like “if-then,” otherwise known as Boolean logic. Another, more flexible approach is fuzzy logic, introduced by the tools for advanced automated University of California, Berkeley’s Lotfi Zadeh in the 1960s. Unlike image recognition. Boolean logic, in which each statement is either true (1) or false (0), fuzzy logic works with the gray scale between 0 and 1. Fuzzy

logic allows us to use statements we are not completely sure of, such as something that is, say, 63 percent true. A third approach is VERY AVERAGE LIMITED DATA SIZE DATA BIG DATA probabilistic. Here the probability of an object belonging to some class may be derived from a priori (or nonstatistical) information. Image Image The derivation is based on Bayesian formulas that make segmentation, segmentation, Simple complex feature feature feature extraction connections between a priori and a posteriori probabilities. extraction extraction

Average Amount of Data Fuzzy Unsuper- Random logic vised SVM AdaBoost CNN AdaBoost If you have an average amount of data (approximately 100 to learning Forest 2,000 per class in the training set), it means you have enough for A priori Parallel computation learning and for time-efficient processing. Here the researcher’s information main responsibility is to establish limits on overfitting — that is, seeing patterns in data that are more apparent than real; methods Figure 1 like SVM and AdaBoost will help you do this. SVM is one of the image segmentation and complex feature extraction are usually most successful machine learning methods because of its strong theoretical base and very efficient overfitting controls. AdaBoost pragmatic approaches. However, if you have large amounts of is simpler, with lower computational complexity. It is more useful data, simple feature extraction can best be accomplished using when we have an exact classification rule for images. Of course, CNN, AdaBoost or Random Forest techniques. success will depend on good feature extraction: a process for reducing data and eliminating redundancies. The Future of Computer Vision As we suggested at the start of this article, machines remain Big Data inferior to humans in many aspects of vision, particularly the “Big Data” is a very popular phrase these days and reflects a understanding or interpretation of images, though they do whole new direction in machine learning development. There is have some very significant advantages. We can anticipate the a science to extracting and storing very large amounts of data. development of computer vision unfolding in three ways. One For computer vision, it may require huge datasets of images — could occur by increasing the computational power for existing thousands, even millions of images, from databases of human algorithm architectures, with a step-by-step improvement faces, website images and YouTube videos. Time and resource in quality. A second could involve deeper research into consumption becomes more important, and methods that allow nontransparent algorithms to make them more predictable and parallel computation may be necessary; these include CNN, AdaBoost or Random Forest, a learning program that constructs controllable. This would require theoretical work we currently and manipulates decision trees. The features chosen are lack. A third is the introduction of completely new concepts of usually not sophisticated, such as low-level indicators like pixel image understanding — although it’s hard to say where they will intensity or the histogram of oriented gradients (HOG), another come from. image-recognition technique, which “describes” the image with histograms. They do not require complex algorithms to extract, And there’s always the scientific wild card. The field of brain and they often don’t even require image segmentation. studies could give us fresh insights into how the brain functions, leading to new machine learning techniques. We suspect that the The choice of computation algorithms is a function of data scale human brain, with its remarkable ability to absorb and understand (Figure 1). If you have quite limited amounts of sample data, visual information, still has a lot of secrets to reveal.◀

Endnotes 1. Paul Viola and Michael J. Jones. “Robust Real-Time Face 4. Berthold K.P. Horn. Robot Vision (Cambridge, MA: Detection.” International Journal of Computer Vision 57 no. 2 MIT Press, 1986). (2004): 137-154. 5. Kunihiko Fukushima. “Neocognitron: A Self-Organizing 2. Rafael C. Gonzalez, Richard E. Woods and Steven L. Eddins. Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position.”Biological Digital Image Processing Using MATLAB (Upper Saddle River, Cybernetics 36, no. 4 (1980): 93–202. NJ: Prentice Hall, 2004). 6. Jürgen Beyerer, Fernando Puente León and Christian Frese. 3. Linda G. Shapiro and George C. Stockman. Computer Vision Machine Vision: Automated Visual Inspection: Theory, (Prentice-Hall, 2001). Practice and Applications (Berlin: Springer, 2016).

Thought Leadership articles are prepared by and are the property of WorldQuant, LLC, and are being made available for informational and educational purposes only. This article is not intended to relate to any specific investment strategy or product, nor does this article constitute investment advice or convey an offer to sell, or the solicitation of an offer to buy, any securities or other financial products. In addition, the information contained in any article is not intended to provide, and should not be relied upon for, investment, accounting, legal or tax advice. WorldQuant makes no warranties or representations, express or implied, regarding the accuracy or adequacy of any information, and you accept all risks in relying on such information. The views expressed herein are solely those of WorldQuant as of the date of this article and are subject to change without notice. No assurances can be given that any aims, assumptions, expectations and/or goals described in this article will be realized or that the activities described in the article did or will continue at all or in the same manner as they were conducted during the period covered by this article. WorldQuant does not undertake to advise you of any changes in the views expressed herein. WorldQuant and its affiliates are involved in a wide range of securities trading and investment activities, and may have a significant financial interest in one or more securities or financial products discussed in the articles.