Quick viewing(Text Mode)

Improving Perception from Electronic Visual Prostheses

Improving Perception from Electronic Visual Prostheses

Improving Perception from Electronic Visual Prostheses

Justin Robert Boyle BEng (Mech) Hons1

Image and Video Research Laboratory School of Systems Queensland University of Technology

Submitted as a requirement for the degree of Doctor of Philosophy

February 2005

Keywords image processing, visual prostheses, bionic eye, artificial human vision, visual perception, subjective testing, visual information

ii Abstract

This thesis explores methods for enhancing -like sensations which might be similar to those experienced by blind users of electronic visual prostheses. Visual prostheses, otherwise referred to as artificial vision systems or bionic eyes, may operate at ultra low image quality and information levels as opposed to more common electronic displays such as televisions, for which our expectations of image quality are much higher. The scope of the research is limited to enhancement by : that is, by manipulating the content of images presented to the user. The work was undertaken to improve the effectiveness of visual prostheses in representing the visible world.

Presently development is limited to animal models in Australia and prototype human trials overseas. Consequently this thesis deals with simulated vision experiments using normally sighted viewers. The experiments involve an original application of existing image processing techniques to the field of low quality vision anticipated from visual prostheses.

Resulting from this work are firstly recommendations for effective image processing methods for enhancing viewer perception when using visual prosthesis prototypes. Although limited to low quality images, recognition of some objects can still be achieved, and it is useful for a viewer to be presented with several variations of the image representing different processing methods. Scene understanding can be improved by incorporating Region-of-Interest techniques that identify salient areas within images and allow a user to zoom into that area of the image. Also there is some benefit in tailoring the image processing depending on the type of scene.

Secondly the research involved the construction of a metric for basic information required for the interpretation of a visual scene at low image quality. The amount of information content within an image was quantified using inherent attributes of the image and shown to be positively correlated with the ability of the image to be recognised at low quality.

iii Table of Contents Abstract ...... iii List of Figures...... viii List of Tables...... x Statement of Original Authorship ...... xi Acknowledgements...... xii Publications...... xiii Chapter 1 Introduction ...... 14 1.1 Overview ...... 14 1.2 Aim...... 15 1.3 Scope ...... 15 1.4 Thesis Structure...... 16 1.5 Contributions...... 17 Chapter 2 Image Quality and Visual Perception...... 19 2.1 Introduction ...... 19 2.2 Visual Perception Physiology ...... 20 2.3 A Visual Hierarchy Model ...... 22 2.3.1 Early Vision Effects ...... 23 2.3.2 Cognitive Effects...... 31 2.4 Region-of-Interest ...... 35 2.5 Visual Information ...... 39 2.6 Chapter Summary...... 41 Chapter 3 Visual Prosthesis Application...... 43 3.1 Overview ...... 43 3.2 General Introduction to the Application...... 43 3.3 Current Visual Prosthesis Research ...... 44 3.3.1 Retinal Systems...... 45 3.3.2 Optic Nerve Systems...... 47 3.3.3 Visual Cortex Systems ...... 48 3.4 Image Processing specifically related to Bionic Eye Projects ...... 49 3.4.1 Vision Chip Developments ...... 49 3.4.2 CCD-based Systems...... 51 3.4.3 Receptive Field Modeling ...... 53 3.4.4 Multiple Resolution Work...... 54 3.5 Applicable to Visual Prostheses ...... 55 3.5.1 Digital Imaging and Human Vision ...... 55 3.5.2 Image Characteristics and Visual Understanding ...... 58 3.6 Thesis Research Questions and Approach ...... 67 3.6.1 Image Processing Requirements ...... 68 3.6.2 Testing Method ...... 69 3.7 Chapter Summary...... 71

iv Chapter 4 Recognition Performance ...... 72 4.1 Overview...... 72 4.2 Subjective Tests to Determine Useful Processing Methods ...... 73 4.2.1 Methodology ...... 73 4.2.2 Images Chosen ...... 74 4.2.3 Results...... 75 4.2.4 Test Conclusions ...... 85 4.3 Subjective Tests to Determine Influence of Image Type...... 87 4.3.1 Methodology ...... 87 4.3.2 Images Chosen ...... 88 4.3.3 Results...... 90 4.3.4 Test Conclusions ...... 96 4.4 Chapter Conclusions ...... 97 Chapter 5 Quantifying Information Content ...... 98 5.1 Introduction...... 98 5.2 Perceived Information Content in Images...... 100 5.2.1 Images Used...... 100 5.2.2 Multidimensional Visual Information Model: ...... 101 5.2.3 Test Method ...... 103 5.2.4 Test Participants and Instructions ...... 104 5.2.5 Test Results ...... 105 5.2.6 Strong Visual Information Rankings ...... 111 5.2.7 Test Conclusions ...... 113 5.3 Information Content Model Fitting...... 113 5.3.1 Possible Image Attributes for a Visual Information Metric...... 113 5.3.2 Metric Development for a Specific Image Quality Class ...... 118 5.3.3 Information Content Metric for all Image Quality Classes...... 127 5.4 Correlations Between Recognition Rate And Perceived Information Content ...... 133 5.5 Chapter Summary...... 135 Chapter 6 Scene Specific Imaging ...... 136 6.1 Overview...... 136 6.2 Characteristics of Simple Scenes ...... 136 6.2.1 Office ...... 136 6.2.2 Home...... 137 6.2.3 Street ...... 137 6.2.4 Outdoors...... 138 6.2.5 Head and Shoulders ...... 138 6.2.6 Café/Restaurant...... 139 6.2.7 Public Toilets ...... 139 6.3 Image Processing targeted to Scene Type...... 140 6.4 Subjective Tests for Scene Weighted Processing ...... 142 6.5 Chapter Summary...... 144

v Chapter 7 A comparison of ROI methods for low quality images...... 145 7.1 Overview ...... 145 7.2 ROI Processing applied to Entire Image...... 146 7.2.1 Image Preparation ...... 146 7.2.2 Processing Methods Compared...... 148 7.2.3 Images Used ...... 150 7.2.4 Experiment ...... 151 7.2.5 Results ...... 151 7.3 Digital Zoom ...... 154 7.3.1 Automatic Zoom Methods ...... 155 7.3.2 Results of Automatic Zoom Experiment...... 158 7.4 Chapter Summary...... 160 Chapter 8 Discussion, Conclusion and Future Work...... 161 8.1 Discussion and Conclusion ...... 161 8.2 Future Work ...... 164 8.2.1 Motion ...... 164 8.2.2 Colour...... 164 8.2.3 Device interfacing ...... 165 8.2.4 Supplementary/Symbolic Information ...... 165 8.2.5 Range Indication ...... 166 8.2.6 Simulating Techniques...... 167 8.2.7 Other Testing Techniques ...... 167 8.3 Final Word...... 168 References 169 Appendix A Section 4.2 Experiment ...... 178 A.1 Example Test Stimulus...... 178 A.2 Booklet Design...... 179 A.3 Borderline Recognition Assessment for Section 4.2 Experiment...... 180 Appendix B Section 4.3 Experiment ...... 185 B.1 Example Test Stimulus...... 185 B.2 Borderline Recognition Assessment for Section 4.3 Experiment...... 186 Appendix C Chapter 5 Experiments...... 187 C.1 Example Test Stimulus – 7 Images Presented all at same time ...... 187 C.2 Example Test Stimulus – 3 Images Presented all at same time ...... 188 C.3 Example Test Stimulus – Paired Comparison Experiments...... 189 C.4 Booklet Design...... 190 Appendix D Chapter 6 Experiment ...... 191 D.1 Example Test Stimulus...... 191 D.2 Booklet Design...... 192 Appendix E Chapter 7 Experiments...... 193 E.1 Training Image Database ...... 193 E.2 Example Test Stimuli for Section 7.2 Experiment...... 199 E.3 Example Test Stimuli for Section 7.3 Experiment...... 200 E.4 Booklet Order for Chapter 7 Experiments ...... 201

vi List of Figures

Figure 1.1: Mean Square Error figures between reference image and low quality versions ...... 24 Figure 3.1: Basis of Visual Prostheses...... 44 Figure 3.2: Pixelised vision; top – greyscale, bottom = binary images...... 56 Figure 3.3: Circular pixelised vision...... 57 Figure 3.4: Alternate stimulation strategies...... 58 Figure 3.5: Simulating the effect of modulating phosphene brightness...... 61 Figure 3.6: Importance Mapping concept ...... 65 Figure 3.7: Safety Post enhancement with advanced image processing techniques.. 66 Figure 3.8: Enhancing the information content of a low quality image of stairs...... 67 Figure 4.1: Image set used in the psychophysical testing ...... 74 Figure 4.2: Image Processing techniques used in the psychophysical testing ...... 75 Figure 4.3: Recognition rate for objects in the image set ...... 76 Figure 4.4: Effect of spatial resolution and grey-scale on object recognition...... 78 Figure 4.5: Images with 3 grey levels (white, grey, black) were not significantly more recognisable than black & white images...... 79 Figure 4.6: Comparing resolution and grey scale...... 80 Figure 4.7: Significantly higher recognition is achieved with increased spatial resolution (Right) over increased greyscale resolution (Left)...... 82 Figure 4.8: Object recognition rate for various processing methods (n=110) ...... 83 Figure 4.9:Edge images were not well recognised...... 84 Figure 4.10: Subjective preferences between image and its inverse – some subjects preferred white on black, others black on white...... 86 Figure 4.11: Test Objective - Obtain an Recognition-Quality curve...... 88 Figure 4.12: The nine image quality classes used in the tests...... 89 Figure 4.13: Test image set ...... 89 Figure 4.14: Recognition-Quality Envelope of recognition for all images in test set 91 Figure 4.15: Variation in recognition among image types...... 92 Figure 4.17: Recognition rates for each object type ...... 95 Figure 5.1: Two images with different amounts of visual information content...... 98 Figure 5.2: The nine image quality classes used in the tests...... 100 Figure 5.3: Multidimensional Visual Information Model...... 103 Figure 5.4: Images containing high information content for high quality images... 106 Figure 5.5: Images containing high information content for low quality images.... 106 Figure 5.6: Strong viewer preferences (70% or above consensus among viewers) showing images ranked from highest to lowest perceived information content ...... 112 Figure 5.7: Calculating Fractal Dimension for Binary Images...... 115 Figure 5.8: Calculating Fractal Dimension for Greyscale Images...... 116 Figure 5.9: Determining image similarity and symmetry – pixel matching ...... 117 Figure 5.10: Determining image similarity and symmetry – pixel difference and average value...... 118 Figure 5.11: Correlation between 15 image attributes and perceived information content...... 128 Figure 5.12: Metric performance for Strong viewer preferences (70% or above consensus among viewers) showing images ranked from highest to lowest perceived information content...... 131

vi i Figure 5.13: Example relationship between recognition and information content (25x25 Binary Paired Comparison data)...... 134 Figure 6.1: Visual stimuli used to gauge perception of low quality images ...... 142 Figure 7.1: Image preparation ...... 146 Figure 7.2: Shaded areas do not necessarily correlate with scene objects in images that have had grey levels equalised (right most image)...... 147 Figure 7.3: Processing methods used in tests (refer text for details)...... 148 Figure 7.4: Example Distribution – Size Map distribution for Beach training images; ...... 149 Figure 7.5: Images used in comparison tests...... 150 Figure 7.6: When presenting the entire image, results indicate a clear preference for no Importance Processing (n=96) ...... 152 Figure 7.7: Digital zoom concept – the most salient area is identified in an image and resized to the maximum display resolution...... 154 Figure 7.8: Trim method to select zoom window ...... 155 Figure 7.9: Scope Box method to select zoom window...... 155 Figure 7.10: Saliency Map developed by University of Southern California...... 156 Figure 7.11: Zoom window selected from central 25% of image...... 156 Figure 7.12: Zoom window selected from central-bottom 25% of image ...... 157 Figure 7.13: Image Preparation for Digital Zoom Tests ...... 157 Figure 7.14: Example stimulus showing detail of zoom window border...... 158 Figure 7.15: Preferences for methods to automatically zoom into an image (n=96)159 Figure 8.1: Halftone representation...... 167

viii List of Tables

Table 3.1 – Thesis experiments...... 70 Table 4.1: Analysis of Variance for various processing methods...... 84 Table 4.2: Correct image identification (n=25) ...... 90 Table 5.1: Perceived information content for comparing 7 different object types .. 105 Table 5.2: Pattern analysis for information content rankings ...... 110 Table 5.3: Dominant visual information viewer preferences...... 111 Table 5.4: Correlation coefficients for variables considered for metric for 256x256 greyscale images ...... 120 Table 5.5: Candidate models for metric for 256x256 greyscale images...... 123 Table 5.6: Model Predictions of 256x256 Greyscale Image Set using model: f(Edges, Entropy) ...... 125 Table 5.7: Candidate models for a metric for 10x10 binary images...... 126 Table 5.8: Summary of metric performance ...... 129 Table 5.9: Predictive performance of metric proposed for all image qualities ...... 130 Table 5.10: The number of correct metric predictions of images with the highest information content ...... 132 Table 5.11: Correlation coefficients between recognition rate and perceived information content...... 134 Table 6.1: Image Processing descriptors of different scene types ...... 140 Table 6.2: Attentional feature weights for each scene type ...... 141 Table 6.3: Preferred ranking for image representation ...... 143

ix Statement of Original Authorship

The work contained in this thesis has not been previously submitted for a degree or diploma at any other higher education institution. To the best of my knowledge and belief, the thesis contains no material previously published or written by another person except where due reference is made.

Signed:

Date:

x Acknowledgements

I would like to thank Anthony Maeder, first for his enthusiasm and willingness to conduct the research at QUT and also for his supervision and guidance, textbooks, time, motivational prods, positive feedback, attention to detail and financial support. I could not wish for a better supervisor. The comments and advice from my associate supervisor Wageeh Boles are also appreciated.

Massive thanks to QUT for a Postgraduate Research Award and the Faculty of Built Environment and Engineering for a top-up scholarship. Financial assistance from the SAIVT program director Sridha Sridharan and the QUT Grants-in-Aid office is appreciated regarding conference attendance. Thanks to the faculty administration staff especially Scott Allberry, for assistance with preparing questionnaires.

I am grateful to Wilfried Osberger and Laurent Itti and Dirk Walther of iLab for permission to implement variations of their importance map and saliency codes in my research. A big thank you to all the volunteers who participated in the subjective testing, including students at Brisbane State High School and their coordinating teachers, as well as family members and colleagues in the Image and Speech lab who were roped in. The efforts of Jason Pelecanos towards soccer matches, BBQs and other inclusive activities to invigorate the research lab are appreciated.

I thank Melissa, Jeremy and Ruth: it has been a big chunk of our lives with some major ups and downs. I really appreciate the child care & washing (!), mental support and lifts into uni dropping Dad off over those speed-bumps. Thanks to my parents for their silent and not-so-silent encouragement with research and publications.

This work is dedicated to Terry – may you some day see your wife, children and rainforest paradise.

xi Publications

The research has resulted in the following fully refereed publications (or abstract refereed only where indicated by an asterisk).

Boyle J, Maeder A, Boles W, Digital Imaging Challenges for Artificial Human Vision, South African Computer Journal (26), pp.222-227, 2000

Boyle J, Maeder A, Boles W, Image Processing and Artificial Human Vision Systems, WoSPA2000 - 3rd Australasian Workshop on Signal Processing Applications, Brisbane, 2000

* Boyle J, Maeder A, Boles W, “Challenges in Digital Imaging for Artificial Human Vision”, in Human Vision and Electronic Imaging IV, Rogowitz T, Pappas T, Editors, Proceedings of SPIE Vol 4299, pp.533-543, 2001

Boyle J, Maeder A, Boles W, Static Image Simulation of Electronic Visual Prostheses, ANZIIS 2001 – Proceedings of the 7th Australian and New Zealand Intelligent Information Systems, Perth, pp.85-88, 2001, {1st prize student paper competition}

Boyle J, Maeder A, Boles W, Image Enhancement for Electronic Visual Prostheses, Australian Physical & Engineering in Medicine Journal 25(2), pp.81-86, 2002

* Boyle J, Maeder A, Boles W, Visual Perception with Electronic Visual Prostheses, Physical Sciences and Engineering in Medicine Queensland Branch Local Symposium, Brisbane, June 2002 {1st prize student paper competition}

xi i Boyle J, Maeder A, Boles W, Visual Perception of Low Quality Images, Proceedings of the 9th International Conference on Neural Information Processing, Singapore, 2002

Boyle J, Maeder A, Boles W, Inherent Visual Information for Low Quality Image Presentation, WDIC2003 - Proceedings of the 2003 APRS Workshop on Digital Image Computing, Theme: Medical Applications of , Brisbane, pp.51-56, 2003

Boyle J, Maeder A, Boles W, Can Environmental Knowledge Improve Perception with Electronic Visual Prostheses?, WC2003 – Proceedings of the World Congress on Medical Physics and Biomedical Engineering, Sydney, 2003

Boyle J, Maeder A, Boles W, Scene Specific Imaging for Bionic Vision Implants, ISPA2003 – Proceedings of the 3rd International Symposium in Image and Signal Processing and Analysis, Rome, pp.423-427, 2003

xiii Chapter 1 Introduction

1.1 Overview

Blindness affects millions of people worldwide and over 100,000 Australians. This research supports quality-of-life improvements for them by exploring appropriate image processing techniques for electronic visual prosthesis systems: so-called “bionic eyes". These systems consist of a vision chip or camera that records the visual world and transmits this information via electric pulses to implanted electrodes in contact with the retina, optic nerve or visual cortex. These three sites provide lower resolution opportunities for synthetic image presentation to the human visual system.

Existing mobility aids for the visually impaired such as canes, guide dogs and sonar glasses are available. However it is anticipated that the richness of sensory substitution would be much greater with a visual prosthesis. Although unlikely to recreate the full experience of vision, visual prostheses may provide enough visual cues for blind people to perform every-day tasks such as navigation, recognition, and reading.

The terminology “low quality” used in this thesis refers to images which can only contain relatively little visual information content. Knowledge of human perception of low quality (eg. low resolution) images, such as those expected from visual prostheses, is very limited. While researchers have worked extensively in characterising high quality image perception, most of this work is not relevant or useful for low quality images. Yet it is in this low resolution regime that the most immediate gains in artificial vision can be made. Ways to identify the sparse information that is important for viewer understanding of scenes when presented in low quality images are thus needed.

Introduction

1.2 Aim

The overall aim of this research is to develop simple image processing techniques that improve perception for users of electronic visual prostheses. This can be broken down into several component elements of investigation: • Determine useful image processing methods for artificial human vision systems • Allow wider implementation of recently developed Region-of-Interest image processing routines beyond previous applications; these routines identify important and salient areas within an image • Facilitate further understanding of the human visual system, specifically perception performance from low quality visual information • Provide a basis from which more complex and beneficial (eg. real time) image processing units can be developed such that a prosthesis may provide maximum benefits to the blind

1.3 Scope

The research described in this thesis is bounded by the following limitations and assumptions:

1. Psychophysical sensations of what might be seen with a prosthesis is simulated by presenting visual stimuli to normally sighted viewers. Prosthesis development in Australia is currently limited to animal models and data from implanted patients is not yet available. It is anticipated that some of the experiments comparing image processing techniques described in this thesis could be repeated using implanted patients when available.

2. Perception experiments undertaken in the project were based on static/still images. Improved perception is anticipated if the techniques are applied to image sequences/video as a user would be offered a richer representation of a scene, in addition to moving about to see how scene elements (background/foreground) interact.

15 Introduction

3. The images presented to subjects in simulation experiments were ordered pixel arrays in a square pattern (equal image height and width). The reported evoked visual field of implanted patients is not a regularly ordered array and varies from patient to patient. Due to this wide variability it was decided to present a symmetric image representation to gauge some understanding of low quality image understanding. It is anticipated that implant users would undergo tuning and training through post operative exercises similar to auditory implant programs to use viable electrodes efficiently.

4. There exist other techniques related to implant electrode stimulation which may produce different psychophysical sensation. One such technique could be using different electrode current flow and return paths to create wide variations in perceived visual sensations. This thesis does not consider such techniques and is instead based on manipulating conventional pixel-based images digitally (digital image processing) to improve visual perception.

1.4 Thesis Structure

The thesis is structured as two main sections: Background (Chapters 1 – 3), and New Work (Chapters 4 – 7), followed by a Conclusion (Chapter 8).

Review reading will first be presented to determine fundamental theory relevant to image understanding and visual prostheses. Chapter 2 describes research in the area of image quality and visual perception. Chapter 3 describes current artificial vision research including image processing activities specifically related to bionic eye projects. At the end of Chapter 3, several requirements for prosthetic image processing systems are identified. These requirements drive research questions around which the remaining thesis chapters are based.

Chapter 4 explores low quality recognition performance and establishes the applicability of a computationally cheap Region-of-Interest (ROI) processing technique to low quality images. A model for characterising low quality images on the basis of how much visual information they contain is developed in Chapter 5.

16 Introduction

An approach to tailor image processing depending on the type of scene is explored in Chapter 6. Finally Chapter 7 compares several ROI methods against each other to identify which may be most helpful when moving through a scene. The applicability of all image processing techniques explored in the thesis are tested using normally sighted human perception (subjective testing) experiments.

Chapter 8 provides a discussion and conclusion of the work and provides some commentary on how the research can be extended.

1.5 Contributions

Resulting from this work are two significant original contributions to knowledge, which are summarised below. These contributions are explored through the examination of a number of related research questions, which are explored in detail in Chapter 3 Section 3.6.

First, investigations based on early vision aspects of digital images are used to provide recommendations for effective image processing methods for enhancing viewer perception when using visual prosthesis prototypes.

The hypothesis that low level processing can improve scene understanding was verified with several subjective experiments. Although bounded by ultra low quality environments, prostheses can facilitate some recognition performance. By including a range of image processing routines or modes of operation, users can gain as many visual cues from a scene as possible.

When considering prototype resolution, enhanced perception is achieved with increased spatial resolution of implant electrodes over increased greyscale resolution. Thus a fundamental aspect of implant design is maximising the spatial resolution.

There are also considerations to be made with respect to context. The most easily recognised environments for users of vision prostheses are faces. The spatial pattern of two eyes and underlying mouth is easily recognised even at the lowest image

17 Introduction quality environments tested in the thesis experiments. Thus enhanced perception is likely when viewing simple face scenes (eg. TV newsreader) compared to other scene types.

Incorporating a digital zoom function in prostheses designs could lead to enhanced perception, and Region-of-Interest processing techniques (which automatically identify salient areas within images) should be used to obtain the zoomed image. This is an original application of these techniques beyond traditional applications such as image compression.

The second contribution is the construction of a metric for basic information content required for interpretation of visual scenes at low image quality.

The ability of a low quality image to be recognised was found to be positively correlated with the amount of perceived information content in that image. Thus an image with high information content can be expected to be more easily recognised at low quality than an image containing low information content.

Experiments reported here have identified that a simple face with no surrounding clutter is most visually informative amongst those image types tested. Also, higher visual information content results from:

(a) more objects in the scene;

(b) closer objects;

(c) strong edges, arising from high intensity contrast.

Finally visual information can be quantified using the number of edges in the image. Thus, image perception can be enhanced by maximising the number of edges within the image.

The above contributions, while somewhat general in nature, are seen as forming the essential elements of any visual prosthesis system capable of providing only low quality images to the observer.

18 Chapter 2 Image Quality and Visual Perception

2.1 Introduction

This chapter will provide some background on the overlap of visual perception and imaging, as a framework for improving perception through electronic visual prostheses. As a visual response is the goal of visual prostheses it is of relevance to explore research in visual perception within a framework of image quality. Image quality is a term used to denote the amount of information retained in an image that has been degraded in some way from its ideal form.

Studies of the interaction between human visual perception and electronic imaging is one of the key growth areas in imaging [81]. Concerning quality of images, research to date has been focused on the high quality still images and video associated with modern multi-media environments. Perceptual models for image quality using characteristics of the human visual system have been developed [1,105], and with them, perceptually based image compression techniques (eg. [69]). However, this work is based on high quality images and there is no similar quality characterisation for the emerging field of visual prosthetics.

The rest of this chapter is presented in four sections. Section 2.2 provides an overview of perception physiology: what happens in the eye during visual perception. In Section 2.3, a hierarchical model for perception and imaging is presented. Research is described first in the areas of low level, or early vision. The discussion includes previous work in the field of image quality and how quality has been traditionally assessed. Following from this, research in higher levels or cognitive vision is described. Section 2.4 presents an area of work that incorporates both low and high levels of vision, known as Region-of-Interest processing. Finally Section 2.5 provides an overview of some research approaches concerning visual understanding and complexity.

2.2 Visual Perception Physiology

2.2 Visual Perception Physiology

An important component in understanding how to design visual prostheses is the physiology of visual perception [59]. The retina is the innermost layer of the back of the eye, and is organised into layers that contain photoreceptors, interneurons and blood vessels. The embryonic development of the retina results in inside-out design, so that the photoreceptors are nearest the back of the eye and light must pass through the retinal interneurons and blood vessels to reach the photoreceptors. The two types of photoreceptors, cones and rods, contain visual photopigment. The first step in photoreception is photopigment bleaching: when light activates visual pigment molecules. Bleaching initiates a sequence of events leading to a change in the cell membrane potential. Ganglion cells are the retinal cells whose axons form the optic nerve, so their output is the final product of the information processing that occurs in the retina. Ganglion cell axons enter the optic nerve in an orderly fashion, so that adjacent axons in the nerve correspond to adjacent receptive fields on the retinal surface. The pathway ascends to the lateral geniculate nucleus (LGN) of the thalamus and then projects to the primary visual cortex in the occipital lobes of the brain.

Studies of anatomy, physiology, and human perception (eg. stroke victims) conclude that the human visual system is subdivided into several separate parts whose functions are quite distinct [48]. There are two subdivisions in the visual pathway (LGN + visual cortex): parvocellular and magnocellular. Both have inputs from the same rods and cones, but have differences in the way the photoreceptor inputs are combined. Their receptive fields (regions of the retina over which impulse activity can be influenced) are circularly symmetric and show centre-surround arrangement (also in retinal bipolar cells). These cells are configured to convert information from photoreceptors into information about spatial discontinuities in light patterns. Some cells are excited (impulse rate speeded up) by illumination of a small retinal region and inhibited (impulse rate slowed down) by illumination of large surrounding region, while others are the reverse of this.

20 2.2 Visual Perception Physiology

The magno and parvo divisions differ in four ways and imply they contribute to different aspects of vision: 1. colour: 90% of parvo layer cells are wavelength sensitive (they combine cone inputs in effect to subtract them); the magno system is colour blind, or wavelength insensitive (they sum inputs of 3 cone types so the response to illumination is on or off at all wavelengths). For example, two different colours such as red and green at the same relative brightness are indistinguishable. 2. acuity: magno cells have larger receptive field centres than parvo (by a factor of 2 or 3) ie. lower spatial resolution. For both mango and parvo cells, the receptive field size increases with distance from the fovea. This is consistent with differences in acuity between foveal and peripheral vision. 3. Speed: magno cells respond faster and more transiently than parvo (which suggests a role in detecting movement). Many cells at higher levels in this pathway are selective for orientation and for direction of movement. 4. Contrast: magno cells are more sensitive to low contrast stimuli than parvo.

At higher stages the continuations of these pathways are selective for different aspects of vision (form, colour, movement, stereopsis). The extended M/P pathways are described further in the visual cortex (blob and interblob systems) and properties include: • orientation selective • selective for direction of movement • end-stopped (respond to short but not long line/edge stimuli) • colour information and colour contrast information in separate systems (eg. colour contrast used to determine borders, but not information of colours forming the borders) • brightness selective • respond poorly when stimulation of either eye alone but vigorously when both eyes stimulated together (stereoscopic depth).

Hubel and Livingstone [48] argue that end stopping (like centre-surround system) is an efficient way of encoding information about shape. They also propose that

21 2.2 Visual Perception Physiology different kinds of visual tasks differ in their colour, temporal, acuity and contrast characteristics. Other findings by the authors are as follows: • People follow brightness alterations much faster than pure colour alterations (the magno system is colour blind and faster than the parvo) • Movement perception reflects magno characteristics: colour blindness, quickness, high contrast sensitivity and low acuity (which has been proved with perceptual experiments). • Motion perception, stereoscopic depth perception, the ability to use relative motion as a depth cue, shading as a depth cue, and perspective as a depth cue are all lost at equiluminance. • The retinal image is two dimensional. In order to capture three dimensions, the human visual system uses many kinds of cues besides stereopsis and relative motion: perspective, gradients of texture, shading, occlusion and relative position within the image. • Magno system is more primitive than parvo. Parvo is only well developed in primates who can scrutinise in much more detail shape, colour and surface properties of objects ie. visual identification and association. • Results are presented that suggest that only luminance contrast and not colour differences are used to link parts comprising a scene together.

It is difficult to predict if the aspects of vision described above (eg. brightness selective, movement perception) will be similar for artificially induced vision. As the parvocellular and magnocelular pathways extend through to the visual cortex, any similar induced perceptions would probably be independent on the stimulation site of the prosthesis.

2.3 A Visual Hierarchy Model

In this section a hierarchical model of imaging and visual perception is used to describe research activities in this area. A good review of the research concerning the interplay between visual perception and electronic imaging is given by Rogowitz et al [81]. Their review draws on observations made during human interaction

22 2.3 A Visual Hierarchy Model research (see for example [15,21,69,70,75]). Arising from these observations is a visual hierarchy: • aesthetic & emotional aspects • cognitive effects: memory, semantic categorisation and visual Higher representation levels of • perceptual effects: colour constancy, suprathreshold pattern & vision texture analysis • visual phenomena mediated by the threshold sensitivity of low- level spatial, temporal and colour mechanisms.

2.3.1 Early Vision Effects

At the bottom of the hierarchy are low-level, or “early vision” effects. Many early vision models have been proposed to characterise image quality on the basis of these low level effects. It is argued in this thesis that these early vision models cannot be extended to the poor quality images anticipated from visual prostheses (eg. 10x10 or 25x25 resolution, black and white images). The early vision models have arisen to address distortions, or artifacts, that have been caused by electronic imaging processes, such as compression and halftoning. Such distortions include blurring, granular noise, jerkiness and blocking. The body of knowledge on image quality is based on reducing the visibility of these distortions. Many vision-based algorithms have consequently been developed in the fields of still image and video compression, image enhancement, restoration and reconstruction, image halftoning and rendering, image and video quantisation and display.

Traditional quality models are based on measuring the differences between an original “perfect” quality image and an altered image having undergone an imaging process. The low spatial resolution binary images anticipated from visual prostheses are so dramatically different from the perfect reference image that the quality models developed to date in the literature do not apply.

To illustrate that these quality models are not useful, consider the popular quality metric of Mean Square Error (MSE) defined as:

23 2.3 A Visual Hierarchy Model

1 M −1N −1 ˆ 2 MSE = ∑∑(xij − xij ) (Equation 1) MN i=0 j =0 where M, N are the number of horizontal and vertical pixels, xij is the value of the original pixel at position (i,j), and xˆij is the value of the distorted pixel at position (i,j).

Figure 1.1 shows the MSE from a reference perfect image for several distorted versions of the reference. As can be seen, the MSE metric is best suited where the differences are small, and the MSE for the low quality 10x10 binary image is approaching that of a grey stripe pattern which did not originate from the Reference Image.

Reference Image

black spot 25x25 binary 25x25 10x10 pattern MSE = 338 760 4665 6727 8919 10673

Figure 1.1: Mean Square Error figures between reference image and low quality versions

Janssen [43] has proposed an alternate description of image quality which differs from the traditional “perceived distance to the original”. He states that quality is how well an observer is able to employ an image as a source of information about the outside world, and proposes the following metric:

Quality = (λ1⋅N) + (λ2⋅C) + λ3 where N = naturalness, C = brightness contrast and λ1, λ2 and λ3 are scalar constants determined from experiments.

Much of the image quality literature is focussed on reducing the visibility of image artefacts (distortions) and so some of the approaches used for this purpose are

24 2.3 A Visual Hierarchy Model included here. Rogowitz et al [81] summarise methods for characterising these image artifacts: 1. physical measurement: measuring key image parameters; comparing images on a pixel-by-pixel basis using a metric such as mean squared error 2. use human observers to judge perceived quality 3. develop metrics based on experiments measuring human visual characteristics to estimate human judgements These approaches are recognised in other sources [24,53,67] and are described below in further detail.

2.3.1.1 Physical Measurement Metrics

Commonly used physical measurement metrics include Signal-to-Noise Ratio (SNR), Peak Signal-to-Noise Ratio (PSNR), Mean Absolute Error (MAE), Mean Squared Error (MSE), local MSE, and distortion contrast. These metrics are easy to use in that information on viewing conditions is not needed and they are computationally simple. However, the methods are considered poor (even for high quality images) in that they do not work well on images with different content (eg. edges/textured regions) [24] and they treat all impairments as equally important [67].

2.3.1.2 Human Observers as Judges of Perceived Quality

Human observers could be used to directly obtain feedback on image quality, and indeed the ‘gold standard’ in determining image quality is the human observer. By nature, this feedback incorporates human visual system considerations, but is expensive and the results may not generalise [81]. However this method can be of use in evaluating low quality images and is therefore of relevance to this thesis. Traditionally this feedback, obtained either from trained experts or via psychological experiments, has been used to compare a distorted image from an imaging process. If the artifacts are visible (ie. suprathreshold), quality can be assessed via: • a rating scale 1-5, eg. bad, poor, fair, good, excellent; however this will only characterise large quality differences and may produce inconsistent results when evaluating images with different types of artifacts

25 2.3 A Visual Hierarchy Model

• paired comparison experiments, where a 7 point scale ranges from –3 to 3 for much worse, worse, slightly worse, same, slightly better, better, much better • two stimulus forced choice scales which ask which image has the higher quality; comparisons can be made using images with different types of artifacts

If the artifacts are just on the visual threshold, just noticeable difference (JND) testing is often used to assess quality. JND tests are not biased on differences in types of artifacts, and are often used to predict the visually lossless point between a compressed image and the original. Display time and learning effects affect the JND point, especially if an observer is given hints about where to look. One can also maintain a visible difference map and have a user specify image quality for different regions in an image [24].

Quality has also been assessed via the receiver operating characteristic method, which measures the performance of an observer in making decisions using an altered (eg. compressed) image. The rate of true positive decisions is compared with the rate of false positive decisions, and the decisions can be subject to fatigue and time of day. Therefore the reliability of ROC studies is dependent on the large number of tests conducted under a range of human circumstances [53]. Typical studies are done with a number of controlled observers.

A detailed discussion of human observer research methods in electronic imaging is contained in Snyder and Trejo [90]. This covers psychophysical (acuity, discrimination), physiological methods (electroretinogram, positron emission tomography, cerebral blood flow) and behavioural methods (search time, legibility, response time, which includes subtasks of visual perception, decision making and motor response). Psychophysical tests are of most relevance in this thesis as the research hypothesis is validated by such methods. Nemine [62] defines psychophysical techniques as a method used to measure characteristics of the environment in terms of the psychological value. Travis [98] observes that psychophysics is non-invasive, and involves investigation of a system by the study of the psychological response to physical stimuli. The physics refers to measurement of the stimuli, while the psychology refers to the measurement of sensation.

26 2.3 A Visual Hierarchy Model

Recommendation 500 of the International Telecommunication Union is often cited in the field of image quality evaluation [41]. This document covers methods and viewing conditions for assessing perceived quality in a standardised way. More recently, the ITU has developed Recommendation P.910 to standardize methods for multimedia quality assessment. The premise behind subjective assessment is the use of human observers to rate video sequences (usually short clips), and thus it may be impractical to use these methods for the in-service continuous evaluation of image quality. A Video Quality Experts Group (VQEG) has been established to provide other objective methods for video image quality evaluation [101].

Obtaining subjects’ perception on the low quality images anticipated from visual prosthesis has similarities with the classical inkblot tests used in psychology, known as the Rorschach [27]. Clear guidelines for psychological assessment have been established [30], covering seating, verbal instructions, recording and enquiry on responses. Error can be introduced by censorship by the subject, scoping errors, poor handling of subtleties of interpretation, incorrect incorporation of age or education and examiner bias (illusion correlation). Alterations in wording, rapport and encouragement can significantly alter responses. A large number of variables are likely to produce spurious random significance. Side-by-side seating is recommended if possible.

Tests used in this thesis are aimed to determine the intelligibility of low quality images, via a viewer’s ability to recognise an object. Specifically related to these tests are the interpretation and questioning used in ink-blot testing. For example, if shown a pattern, typical questions would be: Q. What is this object? Q. What about it made it look like [ ______]? Interpretation of the answers has been assisted by reference codes and categories. A location attribute is used to categorise the area of the image used to draw the viewer’s conclusions: • whole response (W) – if the entire image was used • unusual detail (Dd) – location responses made by <5% of subjects • common detail (D) – a frequently identified area.

27 2.3 A Visual Hierarchy Model

Determinants are noted for any style or characteristic of the image eg. shape, texture. Are there any arbitrary contours created where none exist? Finally the content of the answer is allocated a category. These include whole human (real person) eg. Lena, the image processing test image; whole human (myth/fictional) eg. ghost, angels; human detail eg. person without head; whole animal; anatomy eg. lungs, stomach; art eg. statues, jewellery; botany eg. plants; clothing eg. hat; clouds; food; household; landscape; science.

Other cognitive issues in image quality measurement are given by De Ridder [19]. He states that methods for assessing perceived image quality can produce biased estimates of a viewer’s quality impression. These include instructions given to subjects, and choice of rating scale, such as: - single scaling: 1 (lowest sharpness) to 10 (highest sharpness) - double stimulus scaling – same as above but reference image introduced - comparison scale ranging from –10 (1st image is much sharper than 2nd) to 10 (2nd image much sharper than 1st).

2.3.1.3 HVS-based Metrics

The next category for consideration is metrics that are based on experiments measuring human visual characteristics. To overcome the shortcomings of the simple physical measurement metrics such as Peak Signal-to-Noise Ratio or Mean Absolute Error, a large number of often complex perceptual quality metrics have been proposed. A good summary of perceptual metrics as applied to image quality research is contained in Ahumada [1]. Perceptual metrics can be classified as follows: • Metrics with simple characteristics; these include contrast sensitivity function and luminance adaptation (refer below); • Metrics incorporating preferences for suprathreshold artifacts; these attempt to model the objectionability of imaging artifacts; • Threshold perceptual models; these metrics predict the visibility of distortions near the visual threshold. Visible differences maps can be created which specify the probability of seeing a difference between two images at each pixel location. Other models give the visibility of errors within each frequency band – overall

28 2.3 A Visual Hierarchy Model

image quality is then determined by identifying the frequency band with the most visible artifacts.

There are several visual factors used in perceptual image quality metrics [24]:

1. Contrast sensitivity functions The contrast threshold function (CTF), or its inverse, the contrast sensitivity function (CSF) is the most widely used attribute for metrics. The CTF indicates when frequency components just become visible, and specifies the internal noise level across spatial frequencies; ie. it identifies the amount of quantisation that can be applied near the visibility threshold for the same perceptual error. The CSF is a measure of the spatial resolution of the human visual system. When used for image compression applications, the CSF is often assumed to be a low pass function to ensure quantisation artifacts become less visible for increased viewing distance.

2. Luminance adaptation Luminance adaptation is the second most commonly used attribute. The Weber- Fechner law describes how sensitivity to luminance differences in a stimulus is proportional to the mean luminance of the stimulus (contrast threshold remains constant for increasing luminance levels). As the background luminance increases, the sensitivity to errors decreases proportionately. Again for compression applications, as local luminance increases, an increased level of quantisation can be tolerated. Luminance adaptation can be implemented in the spatial or frequency domain.

3. Linear transforms Psychophysical studies have shown that the human visual system has visual channels selective to spatial frequency (with approximately an octave bandwidth) and to orientation (with sensitivity of 15 deg to 60 deg). Several desirable properties for linear transforms are used to model the human visual system: frequency and orientation selectivity, linear/quadtree phase, minimum overlap between adjacent channels, shift invariance, scale variancy.

29 2.3 A Visual Hierarchy Model

4. Masking: contrast; noise; & mutual Contrast/pattern masking is where the signal is masked (reduced visibility) by the presence of another signal or noise. For compression applications, it is desired to have the image content (eg. where image is textured) masking the quantisation noise. In masking locations, the image and noise signals need to be in the same spatial location, have the same spatial frequency and have the same orientation. Incorrect predictions of contrast masking is the major reason why perceptual metrics fail. A major problem is that metrics present contrast masking as a complex multiple frequency decomposition (computationally complex), and perhaps single channel metrics could be used. Osberger [67] states that masking may be one of the most influential components of a vision quality model for complex natural images (more than the choice of single versus multiple channels).

5. Summation of errors Summation of errors is done across frequency bands and across space to reduce dimensionality arising from many channels (an excess of information) into a single map and perhaps even a single number.

If the artifacts are above the threshold of visibility (for example at high compression ratios), the objectionability of the artifacts depends on the personal preferences of an observer. This could be incorporated into a model, but most metrics ignore the issue. There is no consensus regarding distortion levels at which observer preferences play a role [24].

Metrics should be able to characterise spatial variations in quality across an image. Therefore some perceptual metrics provide a two dimensional quality map, which assign a level of perceived distortion to each location [106]. There is also the desire expressed by these metric designers to collapse the map into a single number that reflects overall quality, so a user can specify a single quality number for the entire image. The predictive ability of a single number is very dependent on the psychophysical methods used to validate the metric.

30 2.3 A Visual Hierarchy Model

2.3.2 Cognitive Effects

At the higher end of the hierarchical model are higher level of cognitive effects. In this thesis, it is desired to determine how little information makes up a scene, and is of use. In order to determine the lowest visual ‘primitives’, or building blocks that make up an image, an analogy can be made to the process of understanding words if alphabetical characters are removed from the word. Different thresholds of understanding may be found depending on the approach: • Replace letters randomly, and ask subject if it still makes sense • Build words from letters (randomly), and ask the subject if it makes sense.

Several viewpoints on top down perceptual effects are collected in Cantoni [10]. Yarbus points out that visual exploration of a picture by a human observer follows different paths according to the particular task that has been assigned. Savina states that what is important in scene understanding is to collect in the shortest time possible, information that allows one to perform the task assigned, leaving aside the rest. To answer some question about a scene, one needs to analyse a few restricted areas for a long time, meaning perhaps that the extraction of certain kinds of information is difficult compared to the extraction of others. Gibson maintains that the world around us acts as a huge external repository of information necessary to act, and we directly extract from time to time elements that are needed.

The human response to visual stimuli is covered well by Hendee and Wells [34]. Bottom up versus top down visual processing is compared. Bottom up models describe where a scene is comprised of individual features, and finer details are obtained by slower scanning of the scene, requiring much time. In top down models, an overall impression of the entire scene is obtained and features are filled in later.

Studies of visual scan paths indicate that real world knowledge (physical laws), past experiences and expectancies affect eye fixation. Areas with high information content (contours, non-homogeneous areas) are fixated by an observer. A perceptual cycle/search plan is proposed:

31 2.3 A Visual Hierarchy Model

Available stimulus information

(Modifies) (Samples)

(Directs) Schema Visual Exploration

It is not possible to separate the mechanisms of detection/recognition/interpretation of visual images. Instead there is a single process of constant interplay between perception and cognition. Vision can be regarded as having a preattentive phase and an attentive phase, and terms such as ‘useful field of view’, ‘visual lobe’, functional visual field’ are used to describe the gathering of information from an area extending the fovea. Preattentive stimuli are immediately detected by parallel preattentive system. Attentive processing stimuli require a serial search by disk of focal attention.

Preattentive processing is further described by Callaghan [9]. This processing involves parallel and independent registration of features in the visual field. Features are registered on separate ‘maps’ and linked to a master location map. Further attentional analysis is required for identifying an object (the information in the master map is linked together). Callaghan suggests that perhaps there are no preattentive/attentive stages but an attention continuum during perceptual processing. Texture segregation and popout are easily produced from visual primitives eg. line orientation, hue, brightness, form, line terminators, line length, curvature and closure. The author describes experiments conducted to support the proposal that within-region segregation is an important factor in texture segregation. Observers were presented with arrays of elements with varying hue and form (eg. circles & squares) and asked to identify boundaries between elements ie. natural scenes were not used in the experiments.

Other perceptual studies are reported in [71]. The visual perception program at SRI includes studies of reduced field of view, limited spatial resolution, system-produced distortions, system delay and system update rates. Most relevant to the application addressed by this thesis were limited spatial resolution studies, where a stereoscopic

32 2.3 A Visual Hierarchy Model display was used to present images on 2 colour monitors that the viewer saw as a fused stereoscopic image. Each monitor had a resolution of 1280x1024 pixels, and when viewed at 57cm, produced a pixel subtense of 1.5 min arc, equivalent to a visual acuity of 20/30. Coarser spatial resolution was achieved by programming the display to use selected small numbers of pixels instead of single pixels to produce an image point.

Information presented at a rate higher than 30 images/second is not integrated as a discrete sequence of images due to limitations of brain neural networks [34]. The extraction of basic attributes of light fields (ie. characteristic features of an image) is the central issue in biological vision. There is no mathematical theory of image enhancement available because we are unable to objectively describe the perception- cognition relationships. The simplest way to approach this is to focus on image contrast and detail. In practice signals are not band-limited and sampling is a finite duration operation. More samples are needed than theoretically predicted.

Semantic categorisation research is also of relevance to cognitive perception. Image semantic researchers [60,80] have found that colour seems to play a significant role in the perceptual organisation of images. Colour was found to be important in natural scenes, but not with people or manmade objects/environments (where spatial organisation, spatial frequency and shape features are more influential). The presence of strong colours (bright red, lime green, pink, pure white) can indicate man-made objects, especially when combined with spatial and regional features. into regions of uniform colour or texture gave opposite results for man-made (straight lines & boundaries, geometric shapes, sharp edges) and natural scenes (rigid boundaries, random edge distributions). Semantic categories appear to correlate with image descriptors (eg. indoor scenes = brownish, low light levels, many straight edges), and so attempts have been made at a metric for semantic categorisation. For example, feature combinations to capture semantics for ‘cityscapes’ category: skin = no, face = no, silhouette = no, nature = no, number of regions = large, region size = small/medium, central object = no, energy = hi, number of edges = large, details -= yes, colour = brown/grey.

33 2.3 A Visual Hierarchy Model

The studies of eye movement paths, or scanpath theory, has also contributed to the understanding of visual perception [92]. The researchers present eye movement studies as an approach to describe how we see in our mind’s eye (top-down). When subjects were asked to first look at a picture and then later asked to visualise that same picture from memory, the scanpaths were very similar. This provides evidence for the top down scanpath theory of vision, since there is no external world available to satisfy the bottom up concept that the external world enters the brain and controls visual perception. The researchers recognise there is a problem in matching the bottom up signals coming from the wide peripheral visual field (low resolution and high sensitivity for moving objects) and from multiple high resolution glimpses by the centrally located fovea. These regions of interest (ROIs) are sequentially visited by a string of fixations, saccades, rapid eye movement jumps and are simultaneously matched by top down representations of the hypothesised image. The authors remark that when the retinal field is mapped onto the visual cortex, there is considerable magnification of the signals coming from the fovea (ROIs), and a reduction of signals coming from the low resolution periphery (only colour and textural segmentation of large areas). These foveal and peripheral representations indicate the kind of bottom up information entering the visual brain.

Stark and Privitera also conjecture a meeting point for top down/bottom up processing [92]. They propose that where top down inputs to levels I, II, and III of the visual cortex meet bottom up visual signal information going to levels IV and V in the visual cortex, this is the site of the matching between top down subfeature representation with incoming bottom up sensory signal flows. After matching, they then propose that the scanpath continues to the next ROI, and in this way, the top down model moves, fixates and foveates the eye, to bring forward successive subfeatures for checking.

A final point of interest concerning this research group are the studies of scanpath eye movements with dynamic scenes. Smooth pursuit eye movements play a large role in scanpaths while subjects are observing dynamic scenes and have an interesting characteristic – they maintain the fovea over the moving object as long as this is possible and as long as the moving object is one the top down spatial cognitive model continues to address.

34 2.3 A Visual Hierarchy Model

Other researchers [76] recording the eye positions of human subjects viewing natural scenes found that subjects looked at image regions that had high spatial contrast, and in these regions, the intensities of nearby image points (pixels) were less correlated with each other than in images selected at random.

As Region-of-Interest concepts feature highly in the above discission of cognitive perception, it is expanded in the next section where several research activities in this area are described.

2.4 Region-of-Interest

This section provides commentary on an area of work incorporating both low and higher levels of vision. Deficiencies of modelling vision using only early vision phenomena have been identified previously [67] in that model parameters need to be chosen to reflect human response to complex natural scenes – not simple artificial stimuli, and that higher level and cognitive factors need to be employed.

A common goal of models described in this section is that they identify Regions-of- Interest (ROIs) within an image in an attempt to predict where the human eye will fixate in the image. When compared against subjective tests using eye-tracking machines or similar attention-recording devices, these region-of interest algorithms provide a high degree of correlation with human observer behaviour. These ROI techniques have found application in advertising [40], military surveillance [64] and visually lossless compression (where uninteresting areas of the image are compressed more than others so compression artefacts are placed in these areas only) [69,75].

There are several factors influencing attention – motion, contrast, size, eccentricity and location, shape, foreground/background, edges and texture, prior instructions and context of viewing, people, gestalt properties (closure, orientation, proximity, similarity, symmetry, clutter and complexity, unusual or unpredictable stimuli (eg. high information content), and interaction between basic features [67].

35 2.4 Region-of-Interest

Schill et al [85] have recorded eye movements when subjects viewed natural scenes. They analysed spatial statistics of fixated regions with higher order statistics (bispectra), and found a clear bias for subjects to fixate on regions with frequency components of multiple orientations (eg. regions with curved edges or occlusion patterns). Using this information as candidate features of informative regions, the authors have developed a system attempting to automatically select informative regions in saccadic scene analysis. The system integrates a simplified bottom-up mechanism to a task-oriented top-down mechanism. The cognitive (top-down) stage infers knowledge based on the Dempster-Shafer theory for uncertain reasoning. The bottom up/early visual system computation is a neural network processing stage, where features are extracted by linear orientation selective filters. They conject that top down and bottom up processing relies on a common principle: “information gain”. The systems they have developed computes the most informative region which should be selected by the next eye movement. In visual prostheses applications, a human user would shift fixation point, and thus there is not the need to model the top-down voluntarily controlled attention shifts.

Osberger [67] has defined a quality metric using the notion of importance maps. This metric has improved performance over traditional quality metrics which assume the whole of a scene is viewed foveally (representing image fidelity). An importance-map weighted metric is more representative - traditional metrics overemphasise visual distortion in textured areas and don’t account for strong masking in these areas. The metric has been extended to a perceptually based compression [69] and a new model for automatic detection of regions of interest in complex video sequences [70]. Features of the new model include motion (the model can distinguish camera motion eg. pan, tilt, zoom, from true object motion, and has adaptive motion thresholds for different video scenes), colour, intensity contrast, size, shape, location, background, and skin tone (a narrow range of Hue Saturation Value). Feature maps representing individual factors were correlated with eye movements from 24 viewers to quantify weights for each factor. 75% of viewer’s fixations occurred in the 30% of the scene total area that was estimated by the model as being the most important.

36 2.4 Region-of-Interest

A technique that avoids the segmentation of the above model is the context-free region-of-interest algorithm presented by Nguyen et al [64]. The technique is aimed to be useful for images with interpretable content that varies with resolution and field of view. There are three stages to the algorithm: 1. Quadtree feature maps are generated for 4 visual factors (contrast, relative brightness, variance, edge density). Each level of quadtree decomposition narrows the field of view over which the feature is examined. If the feature persists in a region, the region gets further divided. 2. Each region is assigned an importance value [0 1] based on region detail (higher importance if region keeps splitting into narrower fields of view) 3. An overall normalised [0 1] importance map is generated from the weighted combination of importance-weighted quadtree feature maps. From the overall importance map, an integer number of bits can be assigned to each pixel proportional to the importance map pixel values, rather than applying a uniform number of bits per pixel. Context dependent criteria could be integrated to improve technique.

If colour is taken into account as well as intensity, more information can be obtained about the image contents. Similar to the above models, colour importance has been defined at each pixel location in an image, and used to weight the results of image analysis tasks [54]. It is difficult to extract colour information due to poorly defined data dependencies between colour bands. Spectral differences become important in regions where the difference in luminance is negligible. Shadows and highlights can cause sharp changes in luminance producing undesirably strong edges. Colour correction (eg. joining panoramic photos) and colour quantization are also situations where colour has a significant influence. An importance measure (normalised from [0, 1]) was constructed by combining global and local factors on case-by-case basis. Global factors included the probability of colour in the image, the probability of colour group in the image, probability of colour in colour group, variability (if low, discrimination is limited – may need to enhance). Local factors were similar to global but acting in regular mxn neighbourhood, or irregular segmented region. It is not possible to modulate colour in the visual prostheses application, so information about colour as described here is not highly assistive to this thesis.

37 2.4 Region-of-Interest

A progressive technique for human face archiving and retrieval uses a similar importance measure [7]. The compression technique described has 1.5 times the compression rate of the JPEG standard and is more visually acceptable as compressed images do not suffer from blockiness, and the visually important information (edges) are reconstructed first.

Another region of interest algorithm has been proposed to detect the main subject in photographs [50]. Advantages of the approach are that the model includes semantic (human skin and face, sky, cloud, grass, tree) as well as geometric features (no motion or depth features as the application is only for photographic images). An image is segmented into regions and the following features ‘believed to influence visual attention’ computed for each region: • Low-level: colour, brightness and texture (self saliency – by itself, and relative saliency – in competition for each) • Geometric: centrality, borderness, surroundedness, size, shape, symmetry • Semantic: skin, face, sky, and grass All features are plotted on an “effectiveness-complexity” graph in relation to the main subject detection. Eg. face is strong indicator for main subject but is less effective than location feature (centrality, borderness) because low likelihood of face regions among all regions.

Itti and Koch have proposed many models for visual attention: multi-scale feature maps to detect local discontinuities in intensity, colour, orientation & optical flow, and biological plausible models such as a centre-surround mechanism modelled on visual receptive fields (cortex transform) [40]. Receptive field properties can be modelled using difference of Gaussian filters (nonoriented features) or Gabor filters (for oriented features). Feature maps (normalised to [0 1]) are produced for intensity, colour, and orientation (0, 45, 90, 135 deg), with 6 feature maps for each at different spatial resolutions. The feature maps are then combined into master or saliency map using 1 of 4 methods. In the final saliency map, the most salient location is suppressed or inhibited, so the system can focus on the next most salient location. A circular focus of attention (rather than actual object) is identified (radius was 80 or 64 pixels depending on image set). The average number of false detections (mean ±

38 2.4 Region-of-Interest standard deviation) was reported for each method used on a database of traffic signs (ie. still images). The authors state that algorithms are much simpler, but their method is independent of nature of the targets (ie. context free).

The above saliency model has been extended to include object recognition [58]. The new model combines a fast visual attention front-end which rapidly selects the few most conspicuous image locations and a slower object recognition back-end which identifies objects at the selected locations. The object recognition back-end is trained on target features which are simple stimuli only eg. circle vs rectangle, with a hope to extend this in future to natural colour images eg. pedestrians. The relevance of such a system to visual prostheses might be to give a speech interpretation of the scene when walking down the street eg. tree, post, sign etc.

2.5 Visual Information

This section provides background to visual information contained in images. Cooper et al in their paper Causal scene understanding [16], asked some intriguing questions pertinent to visual prostheses: “What is visual understanding? What does it mean to look at a scene and understand what it is about?” Since understanding is in large part, the preparation we make for acting, the question can be reformulated in this way: “What knowledge about a scene would a visually impaired person need in order to take intelligent action in that scene?” A comment is made that every picture tells a story and visual understanding consists of figuring out what that story is. In Cooper et al’s paper and others like it [78], scene understanding is described from the point of view of a robot – to predict what is going to happen. Many intelligent agent systems have been developed, for example robots that can pick up mugs with handles (vertical lifting force plus rotational torque to counteract mug rotation) and vision based robot corridor-cleaners. In the visual prosthesis application, there is a functioning human brain to interpret visual signals and understand the significance of elements in a scene and the relationships between those elements, unlike robot/knowledge-based applications, where an agent does the interpretation. Therefore, visual prostheses systems should perhaps

39 2.5 Visual Information mainly focus on bottom-up systems, trying to get most out of a scene, while using knowledge of top-down cognitive interpretation eg. magno/parvo channels etc.

Experiments are described in Chapter 6 of this thesis that quantify the amount of visual information inherent in an image. Previous research in this area includes that concerning visual complexity. Riglis [77] has reported on the following measures for estimating visual complexity: • The number of words used when describing a picture; • Estimators from systems; straight lines, smooth curves = low complexity, right angles = medium complexity, intersections of lines at acute angles = high complexity; • Geometrical characteristics derived from Gestalt psychologists: symmetry of stimuli, & symmetry of curves present in image, similarity of objects in image, saliency of curves present in image, smoothness of curves present in image; • Other geometrical characteristics: area of figure, value of angles, number of revealed elements, diversity of angles and sides, symmetry; • In a study involving the perceived beauty of forests, high image complexity was found to be related to high number of colours, high number of edges, high fractal dimension, high standard deviation, high entropy, and larger file sizes (image encoding including Hufmann encoding & run-length encoding); • Klinger-Salingaros complexity: temperature (internal contrast), harmony (symmetry), life (product of temp and harmony), complexity = temp * (constant – harmony) Except for the perceived beauty of forest studies, all of the above were not tested with real images. Riglis undertook experiments getting subjects to rank images from low to high complexity and determining relationships with fractal dimension, fractal image format compression, GIF compression, JPEG compression, TIFF compression, pixels mean, pixels median, pixels standard deviation, and his understanding/implementation of K-S harmony, life and temperature. He found a positive correlation with 1 out of 3 experiments for fractal compression, K-S life, pixels standard deviation (other 2 no correlation, 1 was on-line with no control images for comparisons).

40 2.5 Visual Information

Meletiou has estimated the complexity in scenes using Osberger/Maeder importance maps [57]. After segmentation and importance classification (based on contrast, size, shape, location, background), the extracted regions were grouped in 10 categories according to importance levels. Complexity = sum(i, 1-10) of (i*no. of regions in category), ie. the more complex the image is the more important it will be. Observers were asked to describe an image presented, and then rank (from 1-5) the difficulty of verbally describing the images presented. Reaction time, number of words and ranking were compared with a complexity metric (reaction time and number of words is probably better suited to exploring threshold of sensitivity of verbalisation: some subjects could talk for hours on simple image). All measures were statistically significant. Meletiou conjects that perhaps high fractal dimension relates to a simple image, and low fractal dimension to a complex image.

Other researchers have used the term ‘visual complexity’ to describe the running time of algorithms [5]. Visual complexity has been proposed as the sum of the number of edges in the scene, the screen resolution and the number of visible edge crossings (wire mesh rendering application).

Stange undertook several subjective experiments using simple geometric models [91]. Visual complexity was modelled with 4 parameters: • each individual object’s colour • each individual object’s size • each individual object’s shape • number of individual objects in the image The only parameter that had a statistically significant correlation was the number of individual objects in the image.

2.6 Chapter Summary

This chapter has described research in image quality and visual understanding. Physiological aspects were provided to gain a brief overview of how visual perception is achieved in the human visual system. The large body of knowledge

41 2.6 Chapter Summary concerning image quality was mentioned to be based on addressing imaging distortions.

A hierarchical vision structure was then described, starting at early vision effects and incorporating increasing levels of cognitive or higher level factors. Region-of- Interest techniques were described which combine advantages of early vision and cognitive vision models.

Finally some previous studies in the area of visual information were described, which has high relevance to an application where visual information needs to be maximised.

42 Chapter 3 Visual Prosthesis Application 3.1 Overview

This chapter contains five further sections. The first introduces the application of artificial vision. Section 3.3 provides an overview of current visual prosthesis projects. The image processing aspects of these projects are discussed separately in Section 3.4. Section 3.5 frames the field of digital image processing for this application. The poor quality of anticipated images produced by artificial vision systems is described along with several processing techniques that are compatible with the anticipated evoked visual sensations of visual prostheses. Finally Section 3.6 presents several image processing requirements for visual prostheses, which drive research questions to be addressed in this thesis.

3.2 General Introduction to the Application

Biomedical in-vivo applications of computing, especially where computing systems are superimposed on or integrated with human systems, offer enormous challenges to researchers to develop new or better solutions which can improve our quality of life. The development of intelligent, reprogrammable devices for insertion into the body (such as pacemakers) is an example. Bold new projects have emerged, such as the MIT “wearable computer” or “thinking cap”, where computer systems interface very closely with the user’s body [73].

Several international research teams are currently developing artificial human vision ("bionic eye") systems that have the potential to restore some visual faculties to blind persons. While the approaches by the various teams differ, a common element is that they all require a system that converts a visual scene into electronic pulses that stimulate nerve cells in the visual pathway (eg. via implanted electrodes), resulting in a crude induced “image” being formed in the visual regions of the brain. The utility of the induced image depends on how much visual information is presented, which in turn is determined by image quality and image processing considerations.

3.2 General Introduction to the Application

Little human trialing of visual prostheses has yet been conducted from which to draw conclusions on image quality. The perceived quality of an image is dependent on the number of electrodes in the implant, with higher numbers of electrodes giving higher spatial resolution of images. At present, size and manufacturing constraints place limitations on the numbers of electrodes in a given implant. An open question from an image processing point of view is how to optimise the amount of useful visual information obtainable from the relatively few electrodes in the implants.

The next section gives an overview of research in the area of artificial human vision. It describes the categories or general areas of research and summarises the approaches of the various research teams. Details of their respective designs are not covered in depth, and the interested reader is referred to the project websites or publications listed in the text. This background on vision research provides a framework for the image processing methods suggested later, and gives the reader an appreciation of the challenging nature of the application.

3.3 Current Visual Prosthesis Research

Good reviews of the history and present state of the art in visual prostheses systems is presented by Warren and Normann [104], Margalit et al [55], Suaning et al [93] and Lysaght et al [51]. The basis of all visual prostheses is an image sensing device (video camera or vision chip and lens) that records the visual world and transmits this information in real-time to the upper level visual processes (refer Figure 3.1). An image acquired by the camera is processed or manipulated to be in a form matching the implant device. The processed image is then sent as electronic pulses to implanted electrodes within a blind patient.

QQQQQQQQQQ QQQQQQQQQQ QQQQQQQQQQ QQQQQQQQQQ QQQQQQQQQQ QQQQQQQQQQ QQQQQQQQQQ QQQQQQQQQQ QQQQQQQQQQ Image sensing device Processing Unit QQQQQQQQQQ Implanted electrode array Figure 3.1: Basis of Visual Prostheses

44 3.3 Current Visual Prosthesis Research

When undergoing electrical stimulation, patients have reported the perception of spots of lights in their visual field, referred to as phosphenes. Although unlikely to recreate perfect vision, artificial vision systems may evoke enough phosphene perception to perform every-day tasks such as navigation, recognition, and reading.

Visual prosthesis research can be categorised depending on the intended stimulation site of the implant: • Retinal - Increasing proximity to the brain • Optic nerve • Visual Cortex - Increased potential beneficiaries

The visual cortex holds the potential to assist the largest number of blind persons, as prostheses designed to stimulate the retina or optic nerve require the rest of the visual pathway to the brain to be intact. However, the surgical risk to a patient with an otherwise healthy brain may be higher for visual cortex prostheses.

A brief overview of current artificial vision projects is presented below, along with project websites and sample reference papers where available.

3.3.1 Retinal Systems

3.3.1.1 University of Southern California (Mark Humayun, Gislin Dagnelie, Eugene de Juan) [37]

Ophthalmologists at the University of Southern California (Doheny Retina Institute) have implanted permanent retinal prostheses into several patients, as part of an FDA-approved feasibility trial. A wafer-thin silicon microchip, embedded with photosensor cells and electrodes, is powered by an external laser beam. Photosensor cells receive and convert light images from the pupil to electrical impulses. These impulses can then drive action potentials in the remaining ganglion cells of patients with retinal disease. http://www.usc.edu/hsc/doheny/ (accessed 21/1/05)

45 3.3 Current Visual Prosthesis Research

3.3.1.2 MIT-Harvard (Joseph Rizzo, John Wyatt), USA [79]

This is a joint collaboration between the Massachusetts Eye and Ear Infirmary and the Massachusetts Institute of Technology. Their prosthesis consists of a power source, a small, fixed-direction laser with 820 nm wavelength, and a data source, a tiny charge-coupled-device (CCD) camera whose output amplitude modulates the laser beam. A signal-processing microchip in the data source converts the visual information to an electronic code that is carried on the laser beam. Both the power and data source is mounted on a pair of spectacles. http://www.bostonretinalimplant.org/ (accessed 1/6/04)

3.3.1.3 Tübingen University (Eberhart Zrenner), Germany [87] and Bonn University (Rolf Eckmiller), Germany [26]

These two research groups are funded by the German Federal Ministry of Education and Science. In the SUB-RET (Tübingen) approach researchers are working on a device consisting of microphotodiodes which are to be placed underneath the retina to stimulate postsynaptic retinal cells directly by converting light to electric energy. In the EPI-RET (Bonn) approach scientists develop a microcontact array which is mounted onto the retinal surface to stimulate retinal ganglion cells. http://www.uak.medizin.uni-tuebingen.de/depii/groups/subret/ http://www.nero.uni-bonn.de/ri/retina-en.html (accessed 1/6/04)

3.3.1.4 Nagoya University (Tohru Yagi), Japan [109]

The research at Nagoya University could be termed biohybrid, in that there is a combination of biological and man-made elements in the construction of the implant. The research aims are to develop devices in which cultured neural cells and a photoelectric device are combined. This technique is similar to other techniques in that electrical components are being placed directly into contact with the retina. However the use of nerve cells as a part of the implant make for a potentially more reliable system. http://www.nidek.com/artificial_vision.html (accessed 1/6/04)

46 3.3 Current Visual Prosthesis Research

3.3.1.5 Optobionics (Alan & Vincent Chow), USA [72]

This device is powered solely by incident light and does not require the use of external wires or batteries. An artificial silicon retina is implanted under the retina (subretina space) and is designed to mimic the photoreceptor layer. The research effort is mentioned here for completeness, although there is no image sensing device (shown in Figure 2.1) and hence no opportunities for perception enhancement by image processing. http://www.optobionics.com (accessed 1/6/04)

3.3.2 Optic Nerve Systems

3.3.2.1 Catholique Université de Louvain (Claude Veraart), Belgium [100]

The techniques are based on optic nerve stimulation using a self-sizing spiral cuff electrode. In preliminary testing to date, with the help of blind human volunteers, researchers have been able to produce phosphenes throughout the visual field. Stimulation at this location would be suited to patients who have non-functioning rods and cones in the retina but a healthy optic nerve. http://www.md.ucl.ac.be/gren/Projets/vision.html (accessed 1/6/04)

3.3.2.2 University of New South Wales/University of Newcastle (Nigel Lovell, Gregg Suaning), Australia [94]

The visual prosthesis system consists of a camera, StrongARM microprocessor system and an implantable electrode array connected by a radio frequency link. This prevents the need to pass wires through the skin. The current design consists of a 10 x 10 array of electrodes, giving the potential for 100 stimulation sites. Recent work has tended to redirect this project from optic nerve electrode cuffs towards retinal- stimulation. http://rambler.newcastle.edu.au/~greggs/ (accessed 1/6/04)

47 3.3 Current Visual Prosthesis Research

3.3.3 Visual Cortex Systems

3.3.3.1 Dobelle Institute (William Dobelle) Portugal [20]

The research team has successfully implanted a 64-electrode array on the visual cortex of a patient using wires passing through the skin. The patient is claimed to be able to read two inch tall letters at a distance of five feet, representing a visual acuity of about 20/400. Although the electrode array produces tunnel vision, the patient is also claimed to be able to navigate in unfamiliar environments. http://www.dobelle.com/index.html (accessed 1/6/04)

3.3.3.2 National Institutes of Health - Washington D.C. (Edward Schmidt), USA [86]

NIH researchers have implanted microelectrode arrays into the visual cortex and recorded stimulation parameters and characteristics of artificially created phosphenes. The Neural Prosthesis Program within the Division of Stroke, Trauma and Neurodegenerative Disorders addresses many types of neural stimulation, not just related to nerves in the visual system. http://www.ninds.nih.gov/npp/ (accessed 1/6/04)

3.3.3.3 University of Utah (Richard Normann) USA [66]

This research is based at the John Moran Laboratories in Applied Vision and Neural Sciences at the University of Utah. The design of the cortical prosthesis employs penetrating microelectrodes rather than surface electrodes. The developers of penetrating cortical electrode arrays claim that the closer spacing of electrodes compared to surface cortical arrays result in increased spatial resolution and lower currents to induce visible perception, and are therefore less likely to induce seizures from overstimulation. http://www.bioen.utah.edu/cni/projects/blindness.htm (accessed 1/6/04)

48 3.3 Current Visual Prosthesis Research

3.3.3.4 University of New South Wales (John Morley, Minas Coroneo) Australia

An animal model has been developed where one side of the brain is electrically stimulated and responses measured in the other side of the brain. Funding sources for the research include the National Health and Medical Research Council and the Brain Foundation. http://medicalsciences.med.unsw.edu.au/medsciences.nsf/website/researchactivities.labor atories.vision_cognition (accessed 1/6/04)

3.4 Image Processing specifically related to Bionic Eye Projects

In this section the hardware and some image processing considerations are described for research specifically relating to artificial vision. Research is described in 4 areas: 1. vision chip developments; 2. CCD-based systems; 3. receptive field modelling; 4. multiple resolution work.

3.4.1 Vision Chip Developments

Researchers at the University of Newcastle and University of New South Wales in Australia (refer Section 3.3.2.2) use an OmniVision CMOS image chip to acquire visual scenes for their portable prosthesis prototype [95]. They have proposed a regular hexagonal mosaic of electrodes in the implantable array rather than a rectangular layout, which allows better separation between electrodes [31]. The expectation is that this will increase visual acuity and minimize aliasing in the evoked artificial image. The researchers also conject that from an information theory standpoint, modulating the size and intensity of a phosphene are equivalent psychophysically.

Japanese researchers at Nagoya University (refer Section 3.3.1.4) have developed a vision chip/artificial retina comprising parallel arrays of simple analogue circuits together with parallel array sensors [110]. The authors review previous

49 3.4 Image Processing specifically related to Bionic Eye Projects developments in vision chips, and mention that these chips have not experienced wide application as the outputs of these chips are not sufficiently accurate under natural illumination due to low sensitivity of photosensors. They have overcome this problem with a light-adaptive one-dimensional 100 pixel line sensor. Spatial filtering properties of the vision chip have been tested by mounting a camera lens to focus an image on the photosensor array. The spatial distribution of the output voltages of the chip showed a Laplacian-Gaussian-like receptive field.

The team above have recognised the importance of depth information in visual information processing and have consequently incorporated depth perception in the vision chip [111]. Again, the chip has 100 analogue sensors connected laterally by resistors, giving a one-dimensional (line) 100 pixel sensor, which allows parallel processing in real time. The output of the circuit is a serial signal representing depth. Depth is computed from the disparity between two vision chips (fitted with lenses) which are 120mm apart and turned inside at 6 degrees. Zero crossings (edges) are detected from the left and right vision chips and used in determining the disparity.

Further work on vision chips is presented by Kyuma et al [46]. An impediment to real time processing has been the separation of image sensing (camera) and image processing (computer) functions. Consequently, system performance is limited by slow camera frame rate and low transmission rate between the camera and computer. ‘Artificial retina’ chips developed by the authors are described, that can simultaneously sense and process images ie. more akin to the parallel real time processing of the human visual system. These artificial retinas consist of a two dimensional variable sensitivity photodetection cell array, with sensitivity similar to commercially available CCDs. The chips are 12mmx12mm (256x256 resolution) or 6.5mmx6.5mm (32x32 resolution), have a dynamic range of 40dB (input light power) and variable frame rate (1msec – 1000msec). A variety of on-chip image processing can be achieved by changing a control voltage pattern on the chip. These processing functions include image sensing, edge extraction, image smoothing, random access (only a partial image projected onto the chip), pattern matching, and image compression/recognition. The application of these vision chips to prostheses is not stated by the authors beyond ‘man-machine interfaces for multimedia systems’, and instead general industrial applications are cited eg. automotive,

50 3.4 Image Processing specifically related to Bionic Eye Projects avionics. The authors have used the vision chips to control computer game characters by hand gesture recognition.

3.4.2 CCD-based Systems

The processing hardware for a retinal prosthesis project at the University of Southern California (refer Section 3.3.1.1) is a FPGA/EPLD (Field Programmable Gate Array)/(Electrically Programmable Logic Device) [18]. The device allows easier implementation of highly parallel algorithms/hardware needed for concurrent processing than a single processor. Three SRAM memories serve as frame buffers to support the storage of images delivered from the camera. Two SRAMS support dual-buffered video, where a current image in transit from the camera can be stored, while a prior image can be simultaneously processed from a second memory. Once the camera has completed delivery of a transit image, the roles of the memories are reversed, so that the new image is processed, while a fresh image is stored in the alternate RAM. A third frame buffer is available for intermediate computations that may occur in algorithms such as spatial . An 8-bit pipeline A/D converter supports cameras which provide only analogue video. The whole board can be worn in a shirt pocket or clipped to a belt.

The Humayun team make some interesting comments regarding image processing required for prosthetic devices to: • match the crude resolution of the implant array • accommodate the limited dynamic range of the array • simplify image aspects such as brightness and colour gradients which cannot be faithfully represented by the array [17]. They conjecture that the sacrifice in resolution (spatially and in contrast) would be acceptable in view of a wider operating range (field size and light/dark adaptation) that would be achieved. Further they comment that the wearer of a prosthesis would achieve a significant degree of learning, compensating in higher visual processing for the detail lost at the input. A comparison is made between the processing required for a retinal versus a cortical prosthesis. At the level of the visual cortex, the neural information stream has already undergone several transformations (dynamic range compression, edge and colour recoding, and translation of analogue information into

51 3.4 Image Processing specifically related to Bionic Eye Projects spike trains) and thus the processor for a cortical prosthesis may have to be more powerful and more trainable than that for a retinal prosthesis.

Researchers involved with optic nerve stimulation (refer Section 3.3.2.1) describe a resolution reduction algorithm based on image segmentation by growth of zones (less computational power than other segmentation algorithms) and implementation in a low-power VLSI device [28]. The authors propose an algorithm based on the extraction of the main features of the image, with transmitted information being only the position and form of the relevant entities in a scene. However, their current implementation appears only to be based on intensity. They propose to give a blind person ability to control the segmentation level (adjustable threshold values), producing areas of uniform illuminance matching corresponding objects or surfaces. Due to the nature of their segmentation algorithm, they report undesirable fast transitions (eg. merging of 2 zones) when segmenting with successive images.

Harvey and Sawan [32] describe their efforts in two areas: a cortical implant (silicon die mounted on the back of an electrode array) and an external system (scene acquisition, processing, RF communication). The completed prototype allows the testing of various stimulation algorithms and strategies. CCD array (336x244 pixels) output is sent via an analogue to digital converter to a commercially available processor board. The extent of image processing appears to be resolution reduction to 25x25 pixels and colour histogram equalisation.

Werblin and Jacobs [107] propose a cellular nonlinear network as a retinal camera, using photodetectors/conventional CCD as input. Various image processing operations can be performed across the CNN array by changing values of a set of amplifiers. The authors have used the CNN system to predict patterns of activity at retinal output. Beneficial features for a retinal camera incorporating a CNN array are proposed: • Battery powered chip array • Onboard stored program of image processing algorithms that can be invoked remotely or on the basis of the characteristics of the visual scene (twilight or bright sun, for reading or navigation, high or low resolution)

52 3.4 Image Processing specifically related to Bionic Eye Projects

• Variety of output available: , motion detection, contrast enhancement • Background normalised, giving high contrast near the ambient background level

3.4.3 Receptive Field Modeling

German researchers developing retinal prostheses (refer Section 3.3.1.3) have proposed a system that approximates receptive field properties of primate retinal ganglion cells [26]. While still preliminary in nature, the research is based on a set of individually tuneable spatiotemporal receptive field filters, acting on input from a photosensor array. Each receptive field filter is individually tuneable to a wide range of physiologically plausible spatial and temporal frequencies. Details of the receptive field function proposed are contained in [6]. Input data is fed into 2 distinct filter pathways, one for the centre computation and one for the surround. Each pathway performs a spatial scalar product of the pixel data, and a two dimensional Gaussian, whose width determines the spatial extent of the receptive field. The resulting signals are then processed by a temporal low pass filter. The surround pathway signal can be optionally delayed, and then signals from both pathways converge at a mixer component. Finally a gain factor enables range adaptation and switching between on-off and off-on (centre-surround) behaviour. The resulting signal is then used to stimulate nerve cells. The authors also describe a concept for training the system using visual perception feedback from human subjects: the subject suggests functional changes to the system via a neural net module, based on the difference between the actually perceived visual pattern and the expected perception. This feedback is anticipated as an essential step in the future for tuning a prosthesis to the needs of an individual.

The hardware for the receptive field processing above is described in [88]. Image acquisition consists of a CMOS chip with a high dynamic range with respect to illumination intensity (140dB). This full dynamic range can be used within a single image frame without any distortions like blooming, smearing or time lag. Two designs have been developed – a 128x128 pixel sensor arranged on a hexagonal grid structure and a rectangular 400x300 pixel sensor. Signal processing is carried out on-chip, so an additional frame buffer is not required unlike

53 3.4 Image Processing specifically related to Bionic Eye Projects conventional CCD devices. The spatial filter used for on- and off-centre receptive field functions is inserted between the sensor chip and the signal processor. The developers propose to house the sensor chip in a package with integrated focusing optics, mounted on a spectacle frame, along with the telemetry unit required for wireless transmission of stimulus data and power for electrode stimulation.

Similar hardware design has very recently been completed in Switzerland [112]. The Swiss researchers have manufactured a thinned CMOS chip which is intended to be placed in the sub-retinal space and remotely powered by an external coil. The output from the system mimics the ganglion response to light: bipolar voltage pulses with light-modulated frequency. The chip has not yet been tested physiologically.

The significance and importance of visual receptive fields in visual processing is supported by Hungenahally [38,39], who has attempted to emulate visual receptive fields and their implementation for image processing in an artificial retina. He has proposed a family of differentio-aggregation functions for information extraction from two dimensional spatial images. He demonstrates the mathematical functions in eradicating sensory noise from medical images and extracting dimensionally selective information.

3.4.4 Multiple Resolution Work

Amerijckx et al [2] describe a remapping algorithm using two CCD cameras and its implementation on a VLSI chip. One tele-lens camera produces high resolution in the central area of the image, while a second wide-angle camera captures peripheral image areas. This system processes these two images in real time to obtain a resulting image with high resolution at the centre, similar to the central part of the retina.

Belgian researchers (refer Section 3.3.2.1) have extended their work on prosthetics to sensory substitution devices converting vision to sound [11]. The image processing involved in their models is based on their identification of the main features of the primary visual system: lateral inhibition and graded resolution. Lateral inhibition is implemented by an edge detection filter and graded resolution is modelled using a

54 3.4 Image Processing specifically related to Bionic Eye Projects multi-resolution artificial retina based on the filtered image. An example of this graded resolution is given – in a grid of 8x8 large pixels, the 16 central pixels are each divided into four pixels to build a medium resolution grid of 8x8 pixels. In this second grid, the 16 central pixels are again divided into four pixels to build a high resolution grid of 8x8 pixels etc.

This foveal vision representation has also been implemented in a head mounted display unit coupled with an eye tracking system [42]. The authors claim that conventional HMDs suffer from a narrow field of view and low resolution and consequently cannot be used for applications such as tele-microsurgery. Their HMD displays high resolution at a subject’s view point (obtained by an eye tracker) and low resolution at the periphery, therefore displaying images at a higher perceived resolution in a wider view angle.

The multiple resolution approaches described above may have application where bandwidth is limited. However, it is most likely that the pixel density in a device would be fixed at the maximum possible under manufacturing and size constraints. Improved scene understanding is expected when the entire electrode layout is used rather than applying low resolution image sections to some parts of the implant.

3.5 Digital Imaging Applicable to Visual Prostheses

This section discusses digital imaging applied to visual prostheses and reviews useful image processing methods that could enhance visual information presented to visually impaired users.

3.5.1 Digital Imaging and Human Vision

There are many parallels between digital imaging environments and the human visual system. Visual sensations are preprocessed from over 100 million rods and cones (the photoreceptors in the retina), to approximately 1.5 million optic nerve fibres, with conduction time from sub-retina to the lateral geniculate nucleus in the order of 1-5ms [102]. The capability of the human visual system for resolving fine detail and edges and ignoring uniform regions has been shown to be biologically

55 3.5 Digital Imaging Applicable to Visual Prostheses hard-wired into our retinas. Connected directly with the rods and cones of the retina are two layers of processing neurons that perform an operation very similar to the Laplacian operator that highlights the points, lines and edges in an image and suppresses uniform and smoothly varying regions [83]. Furthermore, the processes that occur in the visual cortex when a person examines a visual scene make use of feature extraction and object recognition processes, mimicked by computer vision techniques [47]. This complexity suggests that in artificial vision systems, image processing and manipulation would have a more significant role than simply a camera and display package.

The image processing aspects of the artificial vision systems under development are largely based on manipulating a pixelised representation, where a scene is represented as an organised phosphene array. For example, the mandrill image represented in various pixelised versions of different spatial resolutions is shown below in Figure 3.2.

64 x 64 32 x 32 16 x 16 8 x 8 pixels

Figure 3.2: Pixelised vision; top – greyscale, bottom = binary images

Each picture element, or pixel, would ideally correspond to a stimulating electrode in the implant. The top row shows greyscale images which are unlikely in prosthesis prototypes. More likely are binary images (bottom row), where a pixel is either ON or OFF. The viewing distance also affects image interpretation - the coarser resolution versions above (16 x 16, 8 x 8) are more comprehensible from greater viewing distances.

56 3.5 Digital Imaging Applicable to Visual Prostheses

The pixels shown in Figure 3.2 are also shown as squares touching adjacent pixels along each border. Patients undergoing electrical stimulation report visual sensations as a ‘spot of light’. Therefore it may be more representative to model pixels as circular with a gap between adjacent pixels as shown in Figure 3.3 below.

25x25 25x25 circular 25x25 circular greyscale greyscale binary

Figure 3.3: Circular pixelised vision

Given that a limited number of stimulating electrodes is physically possible, it is evident that some type of information content enhancing processing is required, as described later.

An immediate problem is selecting a useful number and pattern of phosphenes for a prosthesis. Until clinical testing progresses with human subjects, the degree of success of an implant in creating phosphenes in the visual field of the subject is unknown. For example, if a patient is fitted with a 100-electrode implant, will they be able to see more or less than 100 phosphenes in their visual field? To date, there are no physiological stimulation models in the literature that predict the number of phosphenes that will be produced from a given number of electrodes. For example, the Dobelle research team [20] report that each implant electrode produces one or more phosphenes in the visual field, while Schmidt et al [86] report that 34 phosphenes were produced from 38 electrodes.

The issue is compounded by the different physical stimulation strategies followed. In addition to varying the stimulation parameters, such as current amplitude or pulse width duration, there is the potential to make use of different current flow and return paths. For example, a single stimulating electrode and single return electrode may

57 3.5 Digital Imaging Applicable to Visual Prostheses give rise to a small high intensity dot in the visual field. Using the same single stimulating electrode but with two or several return electrodes may give a different charge density profile on the tissue which may give rise to a broader low intensity patch of light (see Figure 3.4).

(a) (b)

Figure 3.4: Alternate stimulation strategies (a) single electrode pair ; (b) single stimulating electrode and multiple return electrodes

Frequency encoding might be another possibility where an image is transformed to the frequency domain. Here, image information is represented as signals having various amplitude, frequency and phase characteristics, which could be delivered to different locations within an electrode array. This has similarities with auditory implants, where an audio signal is split into frequency bands and delivered to different locations within the cochlea for improved perception [4,65]. While important to the final design of prostheses, the issues of prediction of phosphene numbers and patterns described above require physiological testing and modelling and thus would be impacted by the other physiological properties of implants. These aspects, such as biocompatibility and encapsulation of the internal electronics, and evaluation of acceptable stimulation waveforms that prevent tissue damage, are outside the scope of this thesis.

3.5.2 Image Characteristics and Visual Understanding

There is a wide range of features, or characteristics of digital images that give us visual understanding to varying degrees [83]. However many characteristics are not compatible with the modelling of anticipated evoked visual sensations of visual prostheses. For example, it may not be physically possible to control the colour of a phosphene, and make it RED, then BLUE on the following stimulation followed by GREEN on the next etc. Colour processing along with other future possibilities are discussed further in Chapter 8.

58 3.5 Digital Imaging Applicable to Visual Prostheses

Image characteristics that are compatible with simulating artificial vision and have variations that can be tested are the following: • Spatial resolution • Brightness • Contrast • Edges • Distance information • ‘Importance’ mapping (using the notion of combining several of the above factors to add value to the information content of an image).

Results of subjective tests using these image characteristics are presented later in Chapter 4, and the next sections provide background to these features.

3.5.2.1 Spatial Resolution

Maximising the number of electrodes in an implant to give high spatial resolution would certainly enhance the information content of images. Some preliminary researchers in visual prostheses systems conjected the following three approaches, in the 1970’s [99]: 1. small matrix size – coded information: Due to the small matrix size, 10x10 or less, information must be categorised and encoded to maximise information delivery. This system cannot effectively provide a direct two dimensional representation of space, but must extract the significant environmental features and present them in coded format. This approach places a heavy demand on the learning capacity of the user. 2. intermediate matrix – preprocessed input. With a matrix size of between 20x20 (400 points) and 32x32 (1024 points): an effective two dimensional display can be achieved. Simulation experiments carried out on sighted viewers [8] suggested that a phosphene matrix containing 600 points (24x24) would be sufficient to permit a reading speed of 120 words per minute, where 10 letters were presented at a time to subjects. The combination of a suitable field range for detection of peripheral hazards with adequate central resolution for useful object identification presents a severe challenge at this matrix size. 3. maximum density matrix – direct spatial display; A 4000 point (64x64) display can provide a fairly good image of a face.

59 3.5 Digital Imaging Applicable to Visual Prostheses

Other simulation work to determine how many electrodes would be needed to provide useful vision has been done by Cha et al at the University of Utah [12,13,14]. Normally sighted human subjects wore a video camera attached to a head-mounted visor which simulated pixelised vision. The tests covered visual acuity, reading speed and mobility performance. Images were pixelised and projected on to a small monochromatic monitor. To create the illusion of phosphenes, perforated masks that represented different pixel densities and field sizes were placed between the eye and the monitor. The conclusions drawn from these studies were: • The most important factor in visual acuity was pixel density (spacing). However the most important factor in reading speed was pixel number, not spacing. • When using low density masks, acuity was increased with voluntary head movements. • A 25 x 25 array provided a visual acuity of 20/30, which allowed a reading speed of 100 word/min and good obstacle avoidance.

More recent (2003) studies simulating vision performance at different spatial resolutions has been carried out at Johns Hopkins University and University of Southern California. One study involved presenting pixelised face representations of 10x10 to 32x32 spatial resolution [96] The researchers found that parameters such as contrast, grid size, dot size, dot gap, drop out rate and greyscale resolution had a significant effect on facial recognition speed and accuracy. In a separate study [33], 4x4, 6x10 and 16x16 electrode arrays were simulated in a number of performance tasks including four choice orientation discrimination of a Sloan letter E, object recognition and discrimination, a cutting task, a pouring task, symbol recognition and two reading tasks. Subjects performed best using the 16x16 array which corresponded to a visual acuity of 20/420, although simple objects and symbols could still be recognised sporadically at the lowest resolution array.

Thus, given that an implant could deliver sufficiently high phosphene numbers to the visual field, the ability to read and navigate around obstacles is achievable. It should be noted that the performance stated above is based on the assumption that each electrode produces a corresponding phosphene in an ordered array in the visual field.

60 3.5 Digital Imaging Applicable to Visual Prostheses

While more electrodes ideally equates to improved perception, the upper limit of electrodes may be determined by the small space available to implant the array along with the minimum electrode spacing required to determine adequate phosphene resolution. This is a manufacturing constraint and is outside the scope of this thesis. Spatial resolution is mentioned here as a technique to explore low quality image perception.

3.5.2.2 Brightness Modulation

Multiple brightness levels in images may be highly informative. Figure 3.5 shows how brightness modulation might be simulated for the mandrill image. The top images show circular pixelated versions using eight versus and three greylevels compared with the original at 256 greylevels. The bottom images show variations of a halftoning technique [82] with different pixel radii and dot orientation. The goal of halftoning is to preserve the visual impression of grey tones in spite of the fact that pixel-by-pixel the image is ideally black or white. Increasing the number of greylevels is considered to be equivalent psychophysically to increasing the dot/phosphene size [49].

Original 128x128 25x25 – 8 grey levels 25x25 – 3 grey levels

4 pixel radius – 45 degree 6 pixel radius – 45 degree 6 pixel radius - 0 degree

Figure 3.5: Simulating the effect of modulating phosphene brightness

61 3.5 Digital Imaging Applicable to Visual Prostheses

Physiologically, the brightness of induced phosphenes has been found to be modified with stimulus amplitude, frequency and pulse duration [86] and in other studies, logarithmically related to stimulus current amplitude [35]. This suggests that in principle, several (2 – 4) bits of greyscale/size variance should be achievable. For early prototypes however, only a 1-bit grey scale might be possible, producing only binary (black and white) images.

3.5.2.3 Contrast

Contrast affects the detection of many kinds of image features (eg. regions, edges, textures) and is known to be a fundamental early vision characteristic in human vision [3]. In some visual environments, such as reading black text on a white background, it may perhaps be better to deliver negative or inverse images. In any case, the ability to enhance the contrast may prove useful to highlight image contents that would otherwise be much harder to see.

3.5.2.4 Edge Detection

In scene recognition and interpretation, edges play a fundamental role [56]. Edges assist in the formation of a primal sketch to derive shape information from images. Also there are biological mechanisms for detecting oriented zero-crossing segments (edges) in retinal ganglion cells. An essential function of an artificial vision system would be to highlight the edges of objects. The Dobelle research team [20] expect improved results for patients with the implementation of Sobel filters for edge detection. The prominence of uniformly shaded areas could be decreased, while edges that might otherwise be hardly noticeable could be highlighted.

3.5.2.5 Distance Indication

An artificial vision system that conveys the distance to an object would be particularly useful. While sonar distance aids have been common for many years, the auditory signal emitted by these devices can interfere with important surrounding environmental noises. The ability to convey distance visually rather than audibly would therefore be desirable. Distance information can be obtained by computing depth from disparity from two cameras or by using ultrasonic or laser rangefinders.

62 3.5 Digital Imaging Applicable to Visual Prostheses

Distances could then be mapped to intensities, where the nearest object is shown with the highest intensity. If the device display only supports a 1-bit grey scale, only the nearest object need be displayed. This distance 'mode of operation' could be quite useful in combination with a standard image of luminance intensities.

3.5.2.6 Importance Extraction

A feature of an efficient artificial vision system would be importance extraction - to present only the most important object in a scene and disregard the uninteresting or homogenous elements. Section 1.1 covered several region-of-interest algorithms which aim to predict where the human eye fixates on an image. An extension of these algorithms is the concept of assigning an importance score or weighting to each area in an image to generate an “importance map”[52,68]. This importance ranking has previously been applied in visually lossless compression, where improved compression ratios have been achieved with high perceived image quality.

This importance ranking could be used in artificial vision systems to identify the most important object in a scene and present only this object, discarding the remainder. The definition of importance may comprise some combination of motion, location, contrast, contrast, size and shape. The components that comprise “importance” may also be adjusted for different viewer situations eg. home, entertainment and mobility.

Importance ranking could also be used to optimise the bandwidth for data transfer in artificial vision systems. If the bandwidth is limited, one could apply variable resolution to the image on the basis of importance. Homogenous or uninteresting scene elements would be displayed at low resolution, while important areas, such as edges and moving objects, would be displayed in high resolution. Thus a decreased bit-rate could be used to present an image of high perceived quality.

63 3.5 Digital Imaging Applicable to Visual Prostheses

Figure 3.6 depicts the importance map process as an example of extracting important areas from within an image. An image is first segmented into regions of similar properties. A split and merge segmentation algorithm is used based on grey level variance. Feature maps/images are then constructed from the segmented image corresponding to five features known to influence attention: 1. closeness – the closer an object, the more important 2. intensity contrast – regions of high intensity contrast from surrounding regions are more important 3. shape – elongated regions are more important than round regions 4. size – the larger a region the more important 5. centralness – regions in the centre of the viewing area are more important

Each region in the feature map is assigned an importance score, normalised from 0 (not important) to 1 (very important). That is, lighter areas in the feature maps should grab a viewer’s attention more than darker areas. From the five feature maps, an overall Importance Map is created from combining these feature maps. A normalised sum of squares is performed as indicated below:

2 2 2 2 2 R.I. = [ω1•(M1) + ω2•(M2) + ω3•(M3) + ω4(M4) + ω5•(M5) ] (Equation 2) R.I.max

Where R.I. = Region Importance ω = weight applied to each feature map M1 = Closeness Map M2 = Contrast Map M3 = Shape Map M4 = Map M5 = Central Map R.I.max = Max Region Importance

64 3.5 Digital Imaging Applicable to Visual Prostheses

Original Segmented

Closeness Contrast Shape

Size Central

ω3

ω2 ω4 ω1 ω5

Importance Map

Figure 3.6: Importance Mapping concept

65 3.5 Digital Imaging Applicable to Visual Prostheses

Figure 3.7 below shows a post and chain which would pose a hazard to a blind person. A 16 x 16 resolution copy of the image is shown adjacent to the original. It should be noted that the 16 x 16 image is shown in full grey-scale which is unlikely to be possible in vision prostheses. Although one can discern the dark blob of a post, the shadow of the post provides a confusing visual cue, and the safety chain attached to the top of the post is not evident.

Original 16 x16 image (full grey scale)

16x16 Importance Map 16x16 Distance Map

Figure 3.7: Safety Post enhancement with advanced image processing techniques.

The bottom left image in Figure 3.7 shows the enhancement provided by mapping ‘importance’ to intensity. The image is shown with 4 grey levels, as might be achieved in vision prostheses. Regions assessed as important (high contrast, large in size, long and slender, central to the image etc.) are represented with the highest intensity. It is noted that the safety chain is now evident but the shadow of the post is also present.

66 3.5 Digital Imaging Applicable to Visual Prostheses

The bottom right image in Figure 3.7 shows the distance mapping discussed in Section 3.5.2.5. The closest regions to the viewer are presented with the highest intensity. It is noted that the chain is evident but the post shadow is not discernable.

Another example of the same processing with another outdoor scene is shown below in Figure 3.8. It is believed that a beneficial image processing system would provide several of these ‘modes of operation’ to gain as many visual cues form the low quality image as possible. Experiments described in Chapter 4 will test this conjecture.

Original 16 x16 image (full grey scale)

16x16 Distance Map 16x16 Importance Map

Figure 3.8: Enhancing the information content of a low quality image of stairs.

3.6 Thesis Research Questions and Approach

In consideration of the visual prostheses literature presented in this chapter and within the image quality framework described in Chapter 2, several issues can be identified to drive research questions. These research questions are described in the next section followed by an outline of how these questions will be addressed in the remaining thesis chapters.

67 3.6 Thesis Research Questions and Approach

3.6.1 Image Processing Requirements

Within the scope of this thesis, there are several image processing requirements for visual prostheses. Visual prostheses need to: 1. facilitate some recognition performance while bounded by an ultra low image quality regime; 2. allow a user to gain as many visual cues from a scene as possible; 3. use simple low level processing to improve scene understanding; 4. convey maximum scene information 5. deal with different scene types

These requirements drive several research questions around which the remaining thesis chapters are based:

Q1: Although limited to low quality images anticipated from visual prostheses, can recognition of some objects be achieved? Q2: Does Region-of-Interest processing improve scene understanding beyond standard/Base Case processing? Q3: Can a model be constructed for basic information required for the interpretation of a visual scene at low image quality? Q4: Should the processing techniques be adjusted depending on the scene type?

In the low image quality domain where spatial resolution is limited, an effective approach to improve scene understanding is Region-Of-Interest (ROI) modelling to extract salient or important areas within an image. It is reasonable to expect ROI processing to be an effective approach as these methods trim away information that may not be relevant to scene understanding. Other applications such as image compression, military and advertising use ROI processing to extract features and regions where the human eye might fixate in an image. Therefore we expect that such techniques could be incorporated into visual prostheses to trim away the large amounts of data in an input image. Thus the limited number of display pixels (implant electrodes) would be used most efficiently by presenting to blind users only the important elements of a scene.

68 3.6 Thesis Research Questions and Approach

It is expected that ROI processing will provide an improved outcome over the standard (or Base Case) type of processing in prostheses, which consists of subsampling to match the spatial resolution of the electrode array and binarisation. The Importance Map ROI technique discussed in Section 3.5.2.6 is selected for the thesis experiments because it is computationally cheap and variations can be constructed around a standard model to alter the appearance and hence the interpretability of the final processed image.

It is also expected that a model can be constructed for basic information required for the interpretation of a visual scene at low image quality. Image quality is characterised thoroughly in the literature for high quality images but not for low quality images.

3.6.2 Testing Method

The research questions will be tested by experiments with normally sighted viewers. As explained at the commencement of the thesis in Section 1.3 - Scope, prosthesis development in Australia is currently limited to animal models. Thus use of normally sighted viewers is considered the only option for simulation studies at this time.

Several simulation experiments are presented as follows: Research Question Thesis section Q1: Although limited to low quality Ch4 – Recognition experiments images anticipated from visual Section 4.2: processing techniques prostheses, can recognition of some compatible with visual prostheses objects be achieved? Section 4.3: recognition & influence of image type

Q2: Does Region-of-Interest processing Section 4.2: computationally cheap improve scene understanding beyond region based (Importance Map) method standard/Base Case processing? Ch7: several ROI methods compared.

69 3.6 Thesis Research Questions and Approach

Research Question Thesis section Q3: Can a metric be constructed for Ch5 – Quantifying Information Content basic information required for the Section 5.3: information content model interpretation of a visual scene at low Section 5.4: recognition & information image quality? content Q4: Should the processing techniques be Ch6: scene specific imaging adjusted depending on the scene type?

Table 3.1 – Thesis experiments

Q1 is tested through recognition experiments described in Chapter 4. In Section 4.2, several image processing techniques which may lead to improved perception of low quality images are assessed by normally sighted viewers. This testing is to obtain an understanding of low quality image perception and uses several processing techniques described in Section 3.5.2 as being compatible with visual prostheses. Section 4.3 describes a separate experiment assessing perception of low quality image and the influence of image type.

Q2 concerns Region-of-Interest (ROI) processing applied to low quality images. ROI processing was presented in Section 2.4 as a powerful perception modeling tool using a combination of early vision and cognitive effects. Section 4.2 experiments establish the applicability of a computationally cheap region-based (Importance Map) technique to low quality images. Several variations of this method are compared with a pixel-based (Saliency Map) technique in Chapter 7.

The construction of a metric in response to Q3 is described in Chapter 5. This Chapter expands previous work in the area of visual complexity (Section 2.5) and links visual complexity with information content. A robust metric to predict perceived information content is developed from one series of subjective data and tested against additional data. Also correlations are made between subjective information content and object recognition for low quality images.

Q4 concerns the influence of different environments for the visual prosthesis user. The concept of tailoring image processing to the scene type is tested in Chapter 6.

70 3.6 Thesis Research Questions and Approach

There is a lack of fundamental theory relating to the specifics of image understanding, and consequently the above research questions represent opportunities to refine this knowledge by subjective testing. A variety of viewer behaviour is expected due to individual preferences influenced by past experiences and expectancies, and hence the instructions given to viewers influence these variations. In the experiments described in following chapters, viewers were advised that: The images appear as just a range of blocks – you may not be able to see anything in the images at all. However this quality level is similar to what a blind person might see with a bionic eye.

A final comment relating to the subjective testing is that there were variations in the experiment sample sizes due to the availability of volunteers, which ranged from n = 20 to n = 247. The strength of the findings are influenced by the sample size (results from the smaller size are more a suggestion). In addition some experiments tested several factors simultaneously so the sample assessing one factor was much smaller than the total participants involved in the experiment. For example 225 participants assessing 9 image quality classes represents a sample size of only 25.

3.7 Chapter Summary

This chapter has described the research activities underway internationally in the field of electronic visual prostheses. Research efforts were described for approaches aimed at stimulating the retina, optic nerve and visual cortex. Image processing aspects for many of these projects were also described.

The chapter also identified processing techniques that are compatible with the anticipated evoked visual sensations of visual prostheses. These included spatial resolution, brightness modulation, contrast, edges, distance information and importance mapping.

Finally several image processing requirements for visual prostheses were identified which drive research questions to be addressed by thesis experiments. An outline was given of the subjective testing proposed for the remaining thesis chapters.

71

Chapter 4 Recognition Performance

4.1 Overview

This chapter describes two preliminary experiments aimed to explore low quality image perception. It aims to answer the research questions:

Q1: Although limited to low quality images anticipated from visual prostheses, can recognition of some objects be achieved? It is anticipated that some recognition is possible and that different types of images result in varied recognition.

Q2: Does Region-of-Interest processing improve scene understanding beyond standard/Base Case processing? There is reason to believe that ROI processing will trim away unnecessary information resulting in improved perception.

Section 3.5.2 discussed some image characteristics that are compatible with the anticipated evoked visual sensations of visual prostheses. Adjustment of these characteristics may result in enhanced information delivery to blind users of visual prostheses. However, the extent of their success is difficult to quantify without experimentation. Therein lies the framework for the first experiment described in Section 4.2.

The second experiment described in Section 4.3 aims to quantify recognition performance for low quality images by constructing an envelope of recognition. The effect of the type of image is also determined from this experiment.

4.2 Subjective Tests to Determine Useful Processing Methods

4.2 Subjective Tests to Determine Useful Processing Methods

As acknowledged above, there is a need to quantify the perception performance of adjusting various image characteristics for improving scene recognition and understanding. This section describes psychophysical testing on possible operating modes for an artificial vision system, to identify the most informative image adjustments that could be made for improved understanding of picture content. The experiments are aimed to assess the performance of proposed processing techniques. This assessment is achieved by presenting degraded images to normally sighted viewers and asking them to identify the scene and make use of the data. The images presented have varying levels of resolution, greyscale, edge detection, importance extraction and distance mapping.

4.2.1 Methodology

The subjective testing was undertaken by way of a booklet questionnaire survey issued to participants. The booklet contained 20 pages of test patterns. Each page or test pattern contained between 4 and 9 images, which were different versions of the same object. An example of a booklet page is shown in Appendix Section A.1. The different versions represented the various processing methods (edge detection, distance mapping, inverse image, importance mapping) which were to be compared against each other. The subjects were asked to write a description of the object and rank the top three images that they believed showed the object most clearly.

Participants were drawn on a voluntary basis from senior high school students. School students were chosen for subjects due to the large numbers and to reduce the likelihood of familiarity with image processing issues (eg. holding a low spatial resolution image at a distance to discern objects). The subjects had been given no prior background information as to the nature of the images except that the pictures ‘may be similar to what a blind person might see with a bionic eye and are scenes that people are likely to see when walking about’. Viewing conditions for the experiment were not controlled.

73 4.2 Subjective Tests to Determine Useful Processing Methods

4.2.2 Images Chosen

While there are several good image databases available to the computer vision community, it was desired to produce unique images for the project which would not have been seen by others. This would reduce inconsistencies in the results had the subjects had a priori knowledge.

The image set was composed of chairs, doorways, posts, steps, and faces which were considered to form mobility hazards within a visually-impaired person’s environment. Two different types of each hazard were included (see Figure 4.1).

chair 1 chair 2 doorway 1 doorway 2 face 1

post 1 post 2 steps 1 steps 2 face 2

Figure 4.1: Image set used in the psychophysical testing

Variations of image characteristics that are applicable to visual prostheses were applied to the images. The spatial resolution and greyscale of images were representative of current prototypes – 10x10, 16x16, and 25x25 pixel images with either 2 (black and white) or 3 grey levels (black, grey, white). An inverse (reverse contrast) was included along with an edge image, and Distance and Importance Maps.

A phosphene mask was applied to each image to create the illusion of phosphenes (ie. pixels were circular and did not touch each other along their borders). An example of the types of images presented is shown below in Figure 4.2.

74 4.2 Subjective Tests to Determine Useful Processing Methods

Original 25 x 25 16 x 16 10 x 10 Inverse

Spatial Resolution Variations

3 grey levels Edges Distance ‘Importance’ Mapping Mapping

Figure 4.2: Image Processing techniques used in the psychophysical testing

Only two pages with the same object appeared in one booklet to minimise learning effects. In establishing the order of the two images in the booklet, the object version with the lower resolution was always presented first. The booklet page order was chosen to ensure substantial differences in appearance between successive booklet sheets (refer Appendix Section A.2).

4.2.3 Results

The questionnaire survey was completed by 174 high school students in their penultimate year of study. In analysing the results of the questionnaire surveys, it could not be assumed that a blank entry counted the same as ‘Don’t know’. Therefore the sample size was reduced to exclude blank entries. The experiments were designed to answer a number of questions which appear in italics in this section.

Q. What were the most recognisable objects in the survey? Object recognition was assessed by analysing the respondent’s guess of the object. In the analysis, a range of descriptions were accepted when accessing the recognition of objects as the context or environment of the user would contribute perception

75 4.2 Subjective Tests to Determine Useful Processing Methods cues. Users of artificial vision systems would be (presumably) aware of their surrounds and would consequently be able to place objects in their context. Also powerful interpretation and increased understanding of a scene would be gained by rapid succession of images: ie. a movie versus a single image, in addition to moving about to see how various objects interact.

Many of the responses in the survey indicated that the person was able to recognise the object as having certain properties, but when it came to naming the object, the description was wrong. These descriptions were deemed to be correct (contextual recognition) given that the person interpreting the image would have knowledge of the context of the object. Examples of this are ‘Post 1’, which some respondents named as ‘cactus in the desert’. Its height and slender form identified it as an object to be avoided. If the same image was viewed on a city street, the description would be likely to be more representative of the actual object. Other examples are the face images wherein a respondent was able to recognise a face and head but associating the wrong gender with the face. Appendix Section A.3 documents the recognition assessment for all images. Note the listing comprises only borderline responses and is not a complete list.

Combining the results from all test patterns (multiple resolutions and grey scale, and different processing methods), the recognition rate was as follows:

Object Recognition (n=168) 100%

80% n o i t i 60% n g 40% Reco

% 20%

0% face 1 face 2 chair1 post 1 chair2 door 2 post 2 door 1 steps 2 steps 1

98% 98% 92% 69% 33% 31% 20% 19% 16% 4%

Figure 4.3: Recognition rate for objects in the image set

76 4.2 Subjective Tests to Determine Useful Processing Methods

Figure 4.3 includes 95% confidence intervals around mean recognition rates across all test patterns. The most recognisable objects were the two faces with 98% recognition. Chair 1 also was highly recognised. Face recognition has been recognised as one of the foremost visual learning steps in the human baby [29]. In studies where the eye positions of babies were monitored when presented with visual stimuli, the babies spent longer looking a true face pictures than at other stimuli patterns, including where the same face components (mouth, eyes etc) were present but rearranged spatially. This finding may give evidence of immediate visual response to biologically important objects. The result also agrees with studies looking at the specialised processing required for face recognition [97]. There is neurophysiological evidence for the existence of neurons in the temporal lobe in monkeys, sheep and humans which responds selectively to faces. Particular neurons are sensitive to the direction of gaze and have a maximum response if the face is viewed straight on. Faces are encoded with differences to a prototypic ‘average’ face/caricature, where differences are assessed relative to a norm.

Interestingly, Chair 2 was difficult to recognise, and its round features contributed to many animal impressions in the responses. As mentioned above and in other sources [34], past experiences and expectancies influence visual perception. Had subjects been told they were in a room containing office equipment one would expect there not to be responses such as “animals”. Thus perception performance in this assessment can be considered to be a worst case: no context or hints were provided and static (still) images were viewed. The ability of the brain to interpret low information even at this worst case is apparent.

Analysis of Variance (ANOVA) on the data shown in Figure 4.3 reveals significant differences in recognition rate for the ten images tested. The test was based on 12 observations (refer to booklet layout shown in Appendix Section A.2) and resulted in

F(9,110),α = 0.05 = 200 > Fcritical (2.0) with P = 5.8E-64 (highly significant). Thus there are significant differences between the mean recognition rates when averaged across all test patterns (multiple resolutions and grey scale, and different processing methods).

77 4.2 Subjective Tests to Determine Useful Processing Methods

Another fundamental aspect to determine from the testing was the effect of spatial resolution on object recognition. The result found is represented graphically below in Figure 4.4. The plot shows 95% confidence intervals around mean recognition rates for the ten images used in the test.

Spatial Resolution & Grey Scale (n=140)

80% 70%

n 60% io

it 50% n

g 40% o

c 30% e

R 20% % 10% 0% 10x10 16x16 25x25 B&W Images 48% 44% 49% 3 Grey Level Images 49% 50% 53% Spatial Resolution

Figure 4.4: Effect of spatial resolution and grey-scale on object recognition.

Q. How does greyscale affect recognition? Although the differences in Figure 4.4 are fairly small, it can be seen that at a particular resolution, images with 3 grey levels (white, mid-grey and black) are more recognisable than black and white images (Figure 4.5).

Statistical testing shows however that the differences are not significant. A two sample t-test was performed using 30 observations (3 spatial resolutions across 10 different images) at α = 0.05. The test assessed the hypotheses:

H0: There is no recognition difference when adding greyscale;

H1: The addition of greyscale achieves significantly different recognition results; ie. a two-tailed t-test. A t-statistic of -0.4 was obtained which was much less than the critical t value 2.0 for 58 degrees of freedom. The significance of this value for a one-tail test was P=0.71 and since this is greater than 0.05, H0 was not rejected: adding greyscale does not result in significant recognition results.

78 4.2 Subjective Tests to Determine Useful Processing Methods

So while small improvements with adding greyscale were noted in the results, for constant spatial resolution, images with 3 grey levels (white, mid-grey and black) were not significantly more recognisable than black and white images (Figure 4.5).

No significant increase in recognition

Figure 4.5: Images with 3 grey levels (white, grey, black) were not significantly more recognisable than black & white images.

Q. How does spatial resolution affect recognition? Again referring to Figure 4.4, images with 3 grey levels were more easily recognised as spatial resolution increased. However for black and white images this was not always so. Thus resolution is still important for object recognition – not pure resolution but relative to the size of the object one is trying to show.

Analysis of Variance (ANOVA) of the data shown in Fig 4.4 reveals that the differences when increasing spatial resolution were are not significant. This analysis compared the hypotheses:

H0: µ 10x10 = µ 16x16 = µ 25x25

H1: At least two of the means are not equal, at α = 0.05. The test was performed for 20 observations (10 images with 3 greylevels and 10 B&W images) and also for averaged results for B&W images and 3-grey images (2 observations). For 20 observations, F(2,57)=0.07 < Fcritical (3.1) with P = 0.93, while results for 2 observations were F(2,3)= 1.2 < Fcritical (9.6) with P = 0.40. Thus as both

P values were above 0.05, H0 was not rejected: mean results did not differ significantly as spatial resolution increased.

79 4.2 Subjective Tests to Determine Useful Processing Methods

Q. Given the choice between increased spatial resolution and increased intensity resolution (grey scale), which would give higher recognition? One aspect of the testing was designed to analyse the effect of resolution versus grey scale. A subject was simultaneously presented with images at a low resolution with 3 grey levels (ie. white, grey, black) and higher resolution black and white images. The test analysed the following issues: • 10x10 3grey compared with 16x16 B&W • 10x10 3grey compared with 25x25 B&W • 16x16 3grey compared with 25x25 B&W

The findings from this testing are shown in Figures 4.6 and 4.7.

Resolution vs Greyscale (n=110)

Spatial Resolution b&w - import b&w - distance b&w - inverse 16x16 - 3grey vs b&w - normal 25x25 - B&W 3grey - import 3grey - distance 3grey - inverse 3grey - normal

10x10 - 3grey vs 25x25 - B&W

10x10 - 3grey vs 16x16 - B&W

0% 10% 20% 30% 40% 50% 60%

% Re cognition

Figure 4.6: Comparing resolution and grey scale.

Figure 4.6 comprises data for only those subjects who could correctly identify the object. Three groups of bars are shown corresponding to the 3 bullet points above. Each group shows a distribution of processing method chosen by subjects as showing the object most clearly (ie. Rank 1 on the bottom of the test sheet shown in Appendix

80 4.2 Subjective Tests to Determine Useful Processing Methods

Section A.1A.2). The bars in each of the three groupings add to 100%. The plot also shows individual 95% confidence intervals for each of the processing methods obtained across the ten images used in the test. The four bars at the top of each grouping refer to the higher resolution black and white images, while the bottom four represent lower resolution images with 3 grey levels. It is clear that higher recognition is achieved with the higher resolution black and white images over lower resolution images with 3 grey levels. This indicates higher recognition is achieved with increased spatial resolution rather than increased greyscale resolution.

Statistical testing of this data shows these differences in recognition are significant. Two sample t-tests were performed for each of the three groupings using 10 observations (10 different images) at α = 0.05. The test assessed the hypotheses:

H0: There is no recognition difference between low resolution images with 3 greylevels compared with higher resolution black and white images;

H1: Higher resolution black and white images are more easily recognised; ie. a one-tailed t-test. In all three cases (10x10 3grey compared with 16x16 B&W, 10x10 3grey compared with 25x25 B&W, 16x16 3grey compared with 25x25 B&W), P values were less than 0.05, indicating recognition rates for the lower resolution 3-grey images are significantly lower than the higher resolution black and white images.

81 4.2 Subjective Tests to Determine Useful Processing Methods

Significantly increased understanding

Figure 4.7: Significantly higher recognition is achieved with increased spatial resolution (Right) over increased greyscale resolution (Left)

82 4.2 Subjective Tests to Determine Useful Processing Methods

Q. What are the presentation modes, or processing methods, that show objects most clearly? On the questionnaire test sheets , the subjects were asked to rank the top 3 images in the order that they thought showed the object most clearly (refer Appendix Section A.1A.2). While all 3 rankings show a trend of the most useful processing methods, it was of prime interest to determine what subjects nominated as their first choice, which they may have felt most strongly about. The first choice nominations are graphically represented in Figure 4.8 below. The plot shows individual 95% confidence intervals for each of the processing methods obtained across the ten images used in the test.

Black & White Images 80% on ti 60% normal ogni

c inverse e 40% R

e distance g a r

e 20% importance v

A edges 0% 10x10 16x16 25x25 normal 9% 19% 24% inverse 6% 21% 16% distance 30% 30% 27% importance 52% 28% 20% edges 3% 2% 2% Spatial Resolution

3 Grey Level (Black, Grey& White) Images 70% n

o 60% i 50%

ognit normal c

e 40% inverse R

e 30% g

a distance r

e 20% v importance A 10% 0% 10x10 16x16 25x25 normal 17% 14% 33% inverse 15% 8% 7% distance 23% 25% 12% importance 35% 42% 29% Spatial Resolution

Figure 4.8: Object recognition rate for various processing methods (n=110)

83 4.2 Subjective Tests to Determine Useful Processing Methods

Analysis of Variance (ANOVA) of the data shown in Figure 4.8 testing the hypothesis that the means are equal at α = 0.05 for 10 observations, reveals the following significance levels: Greyscale Degrees of Spatial resolution resolution freedom 10x10 16x16 25x25 Black and white (4,45) P= 1.0E-5 P = 0.11 P = 0.28 images 3-grey level (3,36) P = 0.22 P = 0.02 P = 0.05 images

Table 4.1: Analysis of Variance for various processing methods

Table 4.1 indicates that there are significant differences between the means for 10x10 B&W images and 16x16 images with 3 grey levels (P<0.05). From Figure 4.8, it can be seen that distance and importance processing were more commonly nominated as showing the object clearly in these cases. These processing methods were significantly higher than ‘Normal’ presentation modes. However, there were no significant differences between means for 25x25 images. This also suggests that several presentation modes should be used in artificial vision systems rather than a single mode of operation.

Q. How does edge enhancement affect recognition? The upper plot of Figure 4.8 also shows that edge-processed images were not well recognised (Figure 4.9). At the low resolutions used in the tests, edges comprised too large a percentage of the total image pixels. For example, in a 10x10 image, a vertical edge would comprise an entire column representing a tenth of the image.

Figure 4.9:Edge images were not well recognised

84 4.2 Subjective Tests to Determine Useful Processing Methods

Statistical testing confirms that results for edges are significantly lower than the average of other methods. Two sample t-tests were performed for each of the three spatial resolution groups (10x10, 16x16, 25x25) using 10 observations (10 different images) at α = 0.05. The test assessed the hypotheses:

H0: There is no recognition difference between edge-processed images compared with the average recognition of other processing methods (average {normal, inverse, distance, importance};

H1: Edge-processed images are significantly less easily recognised; ie. a one- tailed t-test. In all three cases, P values were less than 0.05, indicating recognition rates for the edge-processed images were significantly lower than the average of other processing methods.

Q7. What effect does image content (type of scene) have on recognition? The test image set reflected diversity in scene content, and it was found that image content is important in recognising ability. It would therefore be beneficial to have adaptive processing for different scenes. For recognising chairs and doorways, distance and importance processing was best, while for human faces, normal (or inverse) processing was most beneficial. Interestingly, there appeared to be a subjective difference between inverse and normal images, which differed between individuals.

4.2.4 Test Conclusions

This section describes subjective experiments undertaken to determine useful image processing methods for visual prosthetic applications and provide a framework for prototype development. A condensed summary of the results is as follows: • at a particular resolution, images with 3 grey levels (white, grey and black) were not significantly more recognisable than black and white images; • higher recognition was achieved with increased resolution rather than increased grey scale; • the most recognisable objects were images of human faces with 98% recognition;

85 4.2 Subjective Tests to Determine Useful Processing Methods

• the test image set reflected diversity in scene content, and it was found that image content is important in recognising ability – therefore beneficial to have device switchable processing for different scenes; • at lower spatial resolutions, 1 or 2 processing methods were quire useful (importance & distance processing); • edge-processed images were not well recognised; at the low resolutions used in the tests, edges comprised too large a percentage of the total image pixels; • there appeared to be a subjective difference between inverse and normal images (Figure 4.10);

Figure 4.10: Subjective preferences between image and its inverse – some subjects preferred white on black, others black on white.

• for recognising chairs/doorways – distance & importance processing was best; for human faces: normal & inverse; and • resolution is still important for object recognition – not pure resolution but relative size of object trying to show.

86 4.3 Subjective Tests to Determine Influence of Image Type

4.3 Subjective Tests to Determine Influence of Image Type

In this section further results are presented on subjective tests simulating what might be seen by users of low quality vision systems. A group of 225 normally sighted subjects viewed a set of low quality (low spatial resolution and low grey-scale resolution) static images. The aim for this testing was to quantify intelligibility/recognition for low quality images and determine the effect of the type of image. Results from this testing form part of an image quality model to assess the usefulness of low quality images.

4.3.1 Methodology

Part of the research involves assessing visual perception at this low end of the image quality spectrum. Chapter 2 described numerous models assessing the human visual system and image quality. However, these models apply to the high end of the image quality spectrum (see Watson [105] for a good compilation). There is a need to fill this void to assess image quality for emerging implant designs. The work extends upon the previous subjective tests on normally sighted viewers described in the preceding section which determined the impact of several image processing techniques on object recognition.

The simulation tests were undertaken to provide insight into human perception of low quality images and were aimed at simulating artificially-induced low quality vision.

The objective was to obtain a Recognition-Quality envelope (see Fig 4.11), where a subject was able to use the information presented to draw an intelligible conclusion about the image. This section introduces the concept of recognition-quality curves which show recognition performance plotted against image quality.

87 4.3 Subjective Tests to Determine Influence of Image Type

Is there a variation in the degree of Recognitio n recognition possible for different images of the same quality?

Is there a threshold of minimum quality required for intelligible viewing?

Image Quality

Figure 4.11: Test Objective - Obtain an Recognition-Quality curve

It was anticipated that as image quality was increased, there would be an increase in the ability of an object to be recognised. However, for a given image quality, recognition performance was expected to vary among viewers, and so producing an ‘envelope’ of recognition as opposed to a straight line response. This may also indicate that the ability of an object to be recognised may not improve within a range of image qualities.

Participation was on a voluntary basis and comprised 271 senior high school students and 11 mature age respondents. Invalid data resulted in the rejection of 57 questionnaires (21%). Thus the final sample size was 225, representing sample sizes of 25 for each of the 9 image quality classes.

Participants had no prior knowledge of the images. Booklet instructions stated that a range of high quality and low quality images could be expected, and although the low quality images might just appear as a range of blocks, they may be similar to what a blind person might see with a bionic eye.

4.3.2 Images Chosen

There were 9 Image Quality classes tested (see Fig 4.12). Original images were 256x256 pixels representing a range of scene types. A decreasing image quality

88 4.3 Subjective Tests to Determine Influence of Image Type scale was presented using spatial resolutions typical of visual prosthesis designs (25x25, 16x16, 10x10) and reducing the grey levels from full greyscale to binary. It was also of interest to expose the structure of an image by presenting image edges.

Full Greyscale Binary

1. 2. 256 x 256 3.

256x256 Edge (image structure) 4. 5. 25x25

6. 7. 16x16

8. 9. 10x10 Decreasing Image Quality

Figure 4.12: The nine image quality classes used in the tests

Reduced quality image sets were prepared for the images shown in Fig 4.13.

Tree Flower Balloon

Lighthouse Face Buildings

Capsicum Gorilla Duck

Figure 4.13: Test image set

89 4.3 Subjective Tests to Determine Influence of Image Type

The subject was presented with 9 different images on the one page (tree, flower, balloon, lighthouse, face, buildings, capsicum, gorilla, rubber duck) corresponding to an image quality class described above. An example of the test stimuli is shown in Appendix Section B.1.

4.3.3 Results

Responses indicated by subjects were collated to determine recognition rates. Most subject responses were easy to classify as "Yes, this person has correctly recognised the object" or otherwise. However, where a subject's response was borderline, Appendix Section B.2 was constructed to maintain consistent judgements on whether images where correctly recognised. Responses were accepted if they had similar context to the answer. Note the table includes only borderline responses and is not a complete listing.

Table 4.2 shows the proportion of respondents who could correctly identify all (9/9) images presented to them and two-thirds (6/9) of the images shown to them. 6/9 was chosen to reflect recognition performance clearly over 50%.

QUALITY CLASS Respondents % Correct Respondents % Correct Correctly response for Correctly identifying response for identifying all 9 all 9/9 images two thirds (6/9) of 6/9 images images (out of the image set (out of total of 25) total of 25) 10 x 10 Binary 0 0% 0 0% 10 x 10 Greyscale 0 0% 4 16% 16 x 16 Binary 0 0% 1 4% 16 x 16 Greyscale 1 4% 7 28% 25 x 25 Binary 0 0% 4 16% 25 x 25 Greyscale 2 8% 20 80% 256 x 256 Binary 4 16% 25 100% 256 x 256 Edge 24 96% 25 100% 256 x 256 Greyscale 25 100% 25 100%

Table 4.2: Correct image identification (n=25)

90 4.3 Subjective Tests to Determine Influence of Image Type

None of the respondents viewing the low quality binary images (10x10, 16x16, 25x25) were able to correctly identify all 9 of the presented images. Also surprisingly at high resolution, only 16% of respondents viewing the binary versions of the originals correctly identified all images. Even when considering identification of more than half of the image set, recognition performance was still low for respondents viewing the low quality binary images. In fact, the same number of people could identify two thirds (6/9) of the 25x25 binary image set as they could with the 10x10 greyscale images. 80% of viewers of the 25 x 25 greyscale image set could correctly identify half of the image set. This value of useful spatial resolution agrees with previous simulation work of others ([99] refer Section 3.5.2.1), which found that effective two dimensional displays can be achieved with matrix sizes of between 20x20 and 32x32.

When considering an average across all image types, it was possible to construct an envelope of recognition for the test set as shown below in Fig 4.14, which has a similar shape to the envelope proposed in Fig 4.11. The plot shows 95% confidence intervals around mean recognition rates for the nine image quality classes used in the test. Maximum and minimum curves have been added to indicate the range of values obtained. Although the maximum and minimum values are shown joined with a line to form an envelope, they do not imply that the x-axis is always ordered in the image quality order as shown. (The order shown below is in increasing order of recognition for mean recognition rates across all image types).

All Images (n=225) Error bars denote 95% confidence intervals around mean recognition rate.

100%

n 80% io it n

g 60% o c e 40% R

% 20%

0% 10 Bin 16 Bin 10 F/G 25 Bin 16 F/G 25 F/G 256 256 256 Bin Edges F/G Image Quality

Figure 4.14: Recognition-Quality Envelope of recognition for all images in test set

91 4.3 Subjective Tests to Determine Influence of Image Type

Analysis of Variance (ANOVA) was performed on the data shown in Fig 4.14 to compare the hypotheses:

H0: µ 10 Bin = µ 16 Bin = …. = µ 256 F/G

H1: At least two of the means are not equal, at α = 0.05. The test resulted in a F-value of 111 which exceeded the critical F-value (1.98) for the number of degrees of freedom in the data (8, 216), and was highly significant at

P=2.64E-72. Thus H0 was rejected and it was concluded that recognition rates were significantly different for the image quality classes used in the test.

Recognition-Quality curves for specific object types are shown in Fig 4.16. It can be seen that the x-axes have different ordering. One conclusion from these results is that recognition performance varies widely depending on many factors, one of which is the type of image. Had these curves been plotted with the same x-axis ordering (say on increasing recognition rate of values averaged across image types), the recognition plot of Fig 4.15 below would be obtained. The data points are shown joined to demonstrate the jaggedness of the curves, highlighting that recognition performance varies with type of image.

Specific Images (n=25)

100%

80% on i

t Mean 60% Lighthouse

ogni Buildings c e Tree R

% 40% Gorilla Capsicum Face Flower 20% Balloon Rubber Duck

0% 10 Bin 16 Bin 10 F/G 25 Bin 16 F/G 25 F/G 256 Bin 256 Edges 256 F/G Image Quality

Figure 4.15: Variation in recognition among image types

In general, one might expect recognition to improve as spatial resolution and the number of greylevels increase. The results here validate the experiments of Section 4.2.3 and those by others [36] that recognition rate/perceived quality is dependent on image type and there is interplay between greylevel and spatial resolution.

92

Figure 4.16: Recognition-Image Quality curves for each test image; Average recognition rate (averaged across viewers and image quality classes) is shown above each chart.

Gorilla (n=25) - Avge: 44% Capsicum (n=25) - Avge: 50% Buildings (n=25) - Avge: 50%

100% 100% 100%

80% 80% 80% 60% 60% 60% 40% 40% 40% 20% 20% 20%

0% 0% 0% g g n n n n n n n n n n n n G G G G G G i i i i i i i i i i G G G i i i i / / / / / / / / / B B Or Or B Orig

dges dges g g dges g i i E 25 B 16 B 10 B E 25 B 16 B 10 B E 16 B 25 B 10 B 25 F 16 F 10 F 25 F 10 F 16 F 10 F 25 F 16 F Or Or Ori

Balloon (n=25) - Avge: 51% Tree (n=25) - Avge: 53% Rubber Duck (n=25) - Avge: 57%

100% 100% 100% 80% 80% 80% 60% 60% 60%

40% 40% 40%

20% 20% 20%

0% 0% 0% g n n n n G G G i i i n n n n n n n i G G G i i i / / / G G G i / / / / / / Bin ges Ori B B F F F B B ges Orig B Bi Bi Bi dges Orig d ig d g E 25 B 16 B 10 B 16 E 10 25 25 16 10 E 25 10 16 10 F 16 F 25 F 25 F 10 F 16 F Or Orig Ori

Lighthouse (n=25) - Avge: 66% Flower (n=25) - Avge: 72% Face (n=25) - Avge: 85%

100% 100% 100%

80% 80% 80%

60% 60% 60%

40% 40% 40% 20% 20% 20% 0% 0% g n n n n n n n n 0% G G G i i i i i G G G i / / / / / / n n n n G G G i B / / / B Bi Bi Bi Or Orig dges dges g ges g B Bi Bi Bi i Orig E 25 B 16 B 10 B d 25 16 10 E g 25 F 16 F 10 F 25 F 16 F 10 F E 25 16 10 Or Ori 25 F 16 F 10 F Ori

93 4.3 Subjective Tests to Determine Influence of Image Type

For 5 of the 9 images (face, flower, lighthouse, duck, capsicum), as more information was presented in the way of either greyscale or spatial resolution, the recognition rate increased.

For example, recognition performance for • the flower image set = 10Bin < 16 Bin < 10 F/G < 16 F/G < 25 Bin < 25 F/G etc • the face image set = 10Bin < 16 Bin < 10 F/G < 25Bin < 16 F/G < 25 F/G etc where “10Bin < 16 Bin” indicates 16x16 Binary images were more easily recognised than 10x10 Binary images.

On the other hand, the gorilla image set resulted in the following quality class ordering: 10 Bin < 10 F/G < 25 in < 16 Bin < 16 F/G < 25 F/G etc.

This appears an illogical order due to a lower recognition rate achieved for 25x25 Binary images than 16x16 Binary images. However the actual recognition rates for the gorilla image set are very low. It can be conjectured that images with low overall recognition rates give spurious results due to guessing. In contrast, the 4 images that were most highly recognised across all image quality classes (face=85%, flower=72%, lighthouse=66%, duck=57%) all had logical quality class ordering as recognition rate increased.

Mean recognition rates for each object type are shown over in Fig 4.17. The plot of Fig 4.17 shows 95% confidence intervals around mean values (average across all image quality classes) and maximum and minimum values to show the range of recognition obtained.

94 4.3 Subjective Tests to Determine Influence of Image Type

All Ima ge Qua lity Cla sse s (n=225)

100% n

io 80% it n

g 60% o c

e 40%

% R 20% 0% r e e n e s er m e k e c o illa u c r w ng Tr bb c o llo i ous Fa di u a l Du s h G i Flo t R p B u B gh i Ca L Image Quality

Figure 4.17: Recognition rates for each object type

Similar to the results found in Section 4.2.3, the face image had the highest mean recognition rate when averaged across all image quality classes. Again this agrees with the literature where face recognition has been recognised as neurologically programmed [97] and one of the foremost visual learning steps in the human baby [29]. For low quality presentations, images that were highly recognised were the face and flower (greyscale images) and face, lighthouse and flower (binary images). The gorilla, duck and capsicum images were not recognised well at low quality.

Analysis of Variance (ANOVA) was performed on the data shown in Fig 4.17 to compare the hypotheses:

H0: µ Face = µ Flower = …. = µ Gorilla

H1: At least two of the means are not equal, at α = 0.05. The test reports an F-value = (found variation of the group averages)/(expected variation of the group averages), and if the H0 hypothesis is correct, the F-value is about 1. Interestingly, the test resulted in a F-value of 1.12 which was less than the critical F-value (2.07) for the number of degrees of freedom in the data (8, 72). This

F-value has a significance of P=0.36, and thus H0 could not be rejected.

Thus when combining recognition rates for all quality classes, there was no significant difference based on image type.

95 4.3 Subjective Tests to Determine Influence of Image Type

4.3.4 Test Conclusions

This section described an experiment to further understanding of recognition of low quality images and determine the affect of the type of image on object recognition. Recognition was found to vary depending on the type of image, but differences were not significant when averaged across the different image qualities assessed in the test. The face image had the highest mean recognition rate across all image qualities.

The number of respondents correctly identifying two thirds of the 25x25 binary image set = the number of respondents correctly identifying more than half of the 10x10 greyscale image set.

80% of respondents could correctly identify more than half of the 25x25 greyscale images indicating reasonable vision is achieved at this level. It must be remembered that these test images were static and improved perception could be expected with presentation of image sequences – ie. a movie versus a single image. Also a visual prosthesis user would be able to move about to see how various objects interact.

There is an interplay between greyscale resolution and spatial resolution – for some objects, higher recognition is achieved with increased greyscale over spatial resolution, while the reverse applies for other objects.

For those objects which were highly recognised (above 55% averaged across all image quality classes: face, flower, lighthouse, duck) it was possible to obtain a recognition curve that increased as image quality increased. However for the remaining images which were not well recognised (much guessing), recognition rates both increased and decreased as image quality increased.

This work is extended in the next Chapter by: 1. Correlating several image statistics, such as fractal dimension, symmetry, number of edges, and number of segments, with the images in these tests to determine if recognition can be automatically predicted. 2. Constructing a visual information model for low quality images comprising several dimensions, in addition to the actual object as considered in this chapter.

96 4.4 Chapter Conclusions

4.4 Chapter Conclusions

At the commencement of this chapter, two research questions were stated which can now be answered following the experiments described in this chapter:

Q1: Although limited to low quality images anticipated from visual prostheses, can recognition of some objects be achieved?

A1: Results indicated that greyscale images were easier to identify: 80% of respondents could correctly identify more than half of a 25x25 greyscale image set, while only 16% of respondents correctly identified more than half of a 25x25 binary image set. Spatial resolution was more important for recognition performance than greyscale resolution.

Recognition was found to vary depending on the type of image, with face images being the most easily recognised. It would be beneficial to have device switchable processing for different scenes. Further exploration of the idea of adjusting image processing adjusted depending on the scene type is presented in Chapter 6.

Q2: Does Region-of-Interest processing improve scene understanding beyond standard/Base Case processing?

A2: Results indicated that there may be some benefit in pursuing ROI methods in further detail, especially for very low (10 x 10) resolution images. For recognising chairs and doorways – ROI and distance processing were best, while for human faces – standard/Base Case and inverse was best. Access to a range of processing routines was therefore advisable. Further assessment of the applicability of ROI techniques to low quality image perception is presented in Chapter 7.

97

Chapter 5 Quantifying Information Content 5.1 Introduction

One of the desired functions of visual prostheses is to convey maximum scene information to limited electrode numbers in implants. How can one tell if there is maximum scene information in the conveyed image, or even, how does one quantify the amount of visual information in an image? Can a metric be developed that can rank images (like the two shown in Figure 5.1) for the amount of visual information they contain?

Figure 5.1: Two images with different amounts of visual information content

This chapter attempts to answer these questions and describes in detail the construction of a metric for visual information. It aims to answer the research question proposed at the end of Chapter 3:

Q3: Can a metric be constructed for basic information required for the interpretation of a visual scene at low image quality?

This knowledge would result in a new way to characterise low quality images on the basis of providing maximum information. Images could be characterised on the ratio of perceived information they convey (human user’s concept) to their representative information (raw measure of intrinsic information, typically ‘bit’).

Assuming that information content in images can be quantified, how can this knowledge be used in the visual prostheses application? This chapter proposes that image content can be manipulated in a way so that the resulting image to be

98 5.1 Quantifying Information Content Introduction conveyed to implant electrodes contains maximum information. One means to do this is using the Importance Map method described previously in Section 3.5.2.6. This method involves the combination of several feature maps/images representing attentional features to form an overall importance map. Using the knowledge of what constitutes visual information, weights for each feature map (Intensity, Edges, Colour contrast, Edges etc) could be adjusted iteratively to maximise the amount of visual information in the resulting importance map.

This Chapter comprises three further sections depicted below:

Section 5.2 Identify perceived visual information content Construct 8 subjective data sets

Section 5.3 Propose Metric using 1 of 8 subjective data sets: • Metric specific to a particular image quality • One metric for all image quality

Validate metric against other 7 subjective data sets

Section 5.4 Correlating visual information content with perception

Section 5.2 describes a subjective experiment for perceived information content in images. Subjective rankings are presented for eight visual ‘dimensions’. Patterns among rankings and viewer preferences are noted to gain insight into subjective visual information. The development of a robust metric is detailed in Section 5.3 and predictive performance of the metric examined. Finally, Section 5.4 determines whether high perceived information content in images actually corresponds to high recognition rates: is it true that an image with high information content can be recognised easier at low quality than an image with low information content.

99 5.2 Perceived Information Content in Images

5.2 Perceived Information Content in Images

An experiment was conducted to rank the amount of inherent visual information in images. In the experiments images were compared with each other to obtain a ranking from most to least visually informative. In addition to using the results to propose a metric to quantify visual information, an additional benefit is determining how perceived information content changes as image quality decreases.

5.2.1 Images Used

Similar to the experiments discussed in Section 4.3 and shown below in Figure 5.2, there were 9 image quality classes tested. Original images were 256x256 pixels representing a range of scene types.

Full Greyscale Binary

1. 2. 256 x 256 3. 3.

256x256 Edge (image structure) 4. 5. 25x25

6. 7. 16x16

8. 9. 10x10

Decreasing Image Quality

Figure 5.2: The nine image quality classes used in the tests

A decreasing image quality scale was presented using spatial resolutions typical of visual prosthesis designs (25x25, 16x16, 10x10) and reducing the grey levels from full 256 levels of greyscale to binary. It was also of interest to expose the structure

10 0 5.2 Perceived Information Content in Images of an image by presenting image edges. Reduced quality image sets across the nine classes were prepared for each of the images shown in Figure 5.3.

5.2.2 Multidimensional Visual Information Model:

Eight aspects/dimensions were explored to determine what impact, if any, they had on perceived information content. The first issue (Actual Objects) was assessed by comparing 7 images against each other while the other dimensions were assessed with only 3 images each.

1. Actual Objects

7 images representing a range of different scene types: tree, flower, balloon, lighthouse, face, buildings, capsicum. There was no implicit ranking concerning visual information.

2. Number of Objects

3 images of increasing object number for similar scene type. The first image contained one balloon, the second image three or four balloons and finally an image containing many balloons. This set had an expected ranking of visual information in proportion to the number of objects.

3. Angle of Object

3 images of a fruit bowl at 90° (top-down), 45° (angled) and 0° (side-on). Perceived visual information may vary due to occlusion and distortion of objects. Figure 5.3: Multidimensional Visual Information Model (continued over)

10 1 5.2 Perceived Information Content in Images

4. Distance to Object

3 images of a couple on a bicycle with decreasing distance to the couple’s faces. The first image is a whole of body image, the second shows a half-body view and the final image consists of head and shoulders only. This set had an expected ranking of visual information with distance ie. higher visual information where the whole of the scene can be viewed.

5. Connection between Image Objects

3 images of different couples with decreasing connection between the couple. One image shows the cheeks touching, the next shoulders touching, while the final image shows space between the couple. It was expected that images showing space between the couple may indicate more of what was happening in the scene, and thus information content would be greater with increased separation.

6. Image Detail

3 images of the same face with different edge detail. The first image shows the face alone, the second includes a phone, while the third shows part of an additional face. Information content was expected to be greater for images containing higher detail ie. the additional face and phone images would be more visually informative.

Figure 5.3: Multidimensional Visual Information Model (continued over)

10 2 5.2 Perceived Information Content in Images

7. Contrast between Objects & Surround

3 images of capsicums with varying contrast. Green, red and yellow capsicums gave varying contrast against a light background when viewed as greyscale images. It was expected that images with higher contrast would be ranked as containing higher information content.

8. Variety of Object Types

3 images comparing different object types. The first image contained an orange and sunglasses, the second depicted an orange and a mug, while the third showed scissors and a mug. There was no expected ranking of information content.

Figure 5.3: Multidimensional Visual Information Model

5.2.3 Test Method

Two questionnaire-based methods were used:

1) Images presented all on one page An example test stimulus is shown in Appendix Section C.1 (this shows the presentation of 7 images for assessing the “Actual Objects” test set) and Section C.2 (showing the presentation of 3 images for assessing the “Distance to Object” test set).

For assessing the first issue (Actual Objects), the 7 images were presented on the same page, and subjects were asked to rank from 1 to 7. When considering such a large number of comparisons, this method gives strong responses for the extremes

10 3 5.2 Perceived Information Content in Images

(most and least visually informative) and weaker responses for the mid-lying images. Thus a paired comparison (binary decision) test was performed on the 7 image set.

2) Paired comparison (binary decision) questionnaire test An example test stimulus is shown in Appendix Section C.3.

Considerable effort was required in the design of the questionnaires to ensure variety (avoid boredom), reduce the chance of learning effects from multiple viewings of the same object, and to keep the questionnaires short (avoid fatigue). To achieve this, 9 booklet versions were produced (Books A, B, C etc) with the format shown in Appendix Section C.4. Conditions for viewing the experiment (ambient illumination etc) were not controlled.

5.2.4 Test Participants and Instructions

Participation was on a voluntary basis and comprised 271 Year 11 students and 11 mature age respondents. Invalid data resulted in the rejection of 57 questionnaires (21%). Thus the final sample size was 225, representing sample sizes of 25 for each of the 9 image quality classes.

Participants had no prior knowledge of the images. Booklet instructions stated that a range of high quality and low quality images could be expected, and although the low quality images might just appear as a range of blocks, they may be similar to what a blind person might see with a bionic eye.

In assessing visual information using human viewers, it was anticipated that there would be a varied understanding and interpretation of the concept of visual information. In addition to the above comment that viewers were advised of the bionic eye application, the following example question was provided to all viewers: WHICH IMAGE APPEARS TO CONTAIN MORE INFORMATION? In other words, which image could you answer the most questions about? (eg. What is the scene? How many objects?) If you had to rely on only one of the images to perform a task which would it be? Beyond these comments viewers made their own interpretation of visual information.

104 5.2 Perceived Information Content in Images

5.2.5 Test Results

Eight factors were analysed. The first factor was determining the effect of the actual object shown in the image on perceived information content. The seven different objects were compared against each other. Two ranking schemes were used: 1) images were presented all at the same time 2) paired comparison tests. Both methods gave similar results and the rankings for each method are shown below. The table shows images ranked from highest perceived information content (1) to lowest information content (7) for the nine image quality classes and a ranking combining all image quality classes.

Visual Information Ranking - Images Presented all at same time 1 2 3 4 5 6 7 All Quality Classes Face Flower Tree Buildings Lighthouse Capsicum Balloon 256 F/G Face Buildings Tree Lighthouse Flower Balloon Capsicum 256 Bin Buildings Face Tree Flower Lighthouse Capsicum Balloon 256 Edges Buildings Face Tree Flower Lighthouse Balloon Capsicum 25 F/G Face Flower Capsicum Tree Balloon Lighthouse Buildings 25 Bin Face Flower Tree Buildings Lighthouse Capsicum Balloon 16 F/G Face Flower Capsicum Balloon Tree Lighthouse Buildings 16 Bin Face Flower Tree Buildings Capsicum Lighthouse Balloon 10 F/G Face Flower Capsicum Tree Balloon Buildings Lighthouse 10 Bin Tree Flower Face Buildings Lighthouse Capsicum Balloon

Visual Information Ranking - Paired Comparison Presentations 1 2 3 4 5 6 7 All Quality Classes Face Flower Tree Buildings Capsicum Lighthouse Balloon 256 F/G Buildings Face Lighthouse Tree Flower Balloon Capsicum 256 Bin Buildings Face Tree Flower Lighthouse Capsicum Balloon 256 Edges Buildings Face Tree Flower Lighthouse Balloon Capsicum 25 F/G Face Flower Capsicum Balloon Tree Lighthouse Buildings 25 Bin Face Flower Tree Buildings Capsicum Lighthouse Balloon 16 F/G Face Flower Capsicum Tree Balloon Lighthouse Buildings 16 Bin Face Flower Tree Buildings Lighthouse Capsicum Balloon 10 F/G Face Flower Tree Capsicum Balloon Lighthouse Buildings 10 Bin Tree Face Flower Buildings Capsicum Lighthouse Balloon

Table 5.1: Perceived information content for comparing 7 different object types

105 5.2 Perceived Information Content in Images

When considering the ranking for all quality classes (n=225) both methods gave the following near identical ranking order: Face > Flower > Tree > Buildings > Lighthouse/Capsicum > Balloon. Ie. the face image has higher subjective information content than the flower etc.

The high quality Top 3 (256x256, 256x256_Binary, 256x256_Edge) were Face, Buildings and Tree:

Figure 5.4: Images containing high information content for high quality images

The low quality Top 3 were Face, Flower and Tree (Binary), Face, Flower and Capsicum (Greyscale):

(Binary)

(Greyscale)

Figure 5.5: Images containing high information content for low quality images

The effect of the other factors/dimensions on visual information are presented in Table 5.2 over. Patterns that emerged in the visual information ranking are noted along with the number of image quality classes (out of 9) with that ranking. Strong viewer preferences are defined as the pattern chosen by 70% or more of the sample size. Although it may appear arbitrary, a 70% level was chosen from careful inspection of the data and the fact that in a normal distribution, 68% of cases will fall 1 standard deviation above and below the mean.

10 6 5.2 Perceived Information Content in Images

Dimension Visual Information Ranking Order Any very How many Were strong viewer image original HIGHEST LOWEST preferences? quality (256F/G) (chosen by classes images >70% of ranked sample) like this? Number of 1st dominant pattern Yes – 5/9 6/9 Yes Objects in Scene 16Bin: 88% 10Bin: 64% 25Bin: 92% 16Bin: 88% Edges: 84% 25Bin: 92% 256Bin: 96% Edges: 84% 256F/G: 96% 256Bin: 96% 256F/G: 96% 2nd pattern No 2/9 no 16 F/G: 32% 25 F/G: 36%

Comments: A strong pattern was clear in the results which confirmed expectations: the more objects in the scene, the higher the visual information. 6/9 image quality classes were ranked in this way which was two thirds of the quality classes. Five of the nine image quality classes had very strong viewer preferences for this ordering. For two low quality classes, the image of the single balloon was favoured highest, but preferences were not strong

Table 5.2: Pattern analysis for information content rankings (continues over)

10 7 5.2 Perceived Information Content in Images

Dimension Visual Information Ranking Order Any very How many Original strong viewer quality images HIGHEST LOWEST preferences? classes like this? Angle of Object 1st dominant pattern No 5/9 Yes 10F/G: 36% 16Bin: 32% 16F/G: 40% Edges: 36% 256F/G: 52% 2nd pattern No 3/9 no 10Bin: 60% 16Bin: 32% 25Bin: 44%

Comments: The dominant pattern indicates highest information is in a top down view (90 degrees) of the fruit bowl, where almost the entire bowl circumference is visible. The contents of the bowl can be most easily seen in this top-down view. This pattern was ranked more visually informative for high quality and greyscale images. When limited to binary representation, the side on view (0 degrees) was ranked higher, perhaps to a sharper profile of the bananas against the background.

Distance to 1st dominant pattern No 5/9 Yes Object 10F/G: 32% 16Bin: 28% 16F/G: 32% Edges: 48% 256F/G: 68% 2nd pattern No 4/9 no 10Bin: 40% 16Bin: 28% 25F/G: 64% 256Bin: 36% Comments: The dominant pattern, including the Original image set is ranked in increasing distance to the viewer ie. more visual information where you can see more of the image and background. However the second pattern, which includes rankings for low quality binary images, indicate the closest view of faces contain more information. Table 5.2: Pattern analysis for information content rankings (continues over)

10 8 5.2 Perceived Information Content in Images

Dimension Visual Information Ranking Order Any very How many Original strong viewer quality images HIGHEST LOWEST preferences? classes like this? Connection 1st dominant pattern No 5/9 no between image 10F/G: 36% objects 16Bin: 32% 16F/G: 36% Edges: 68% 256Bin: 36% 2nd pattern Yes – 1/9 4/9 Yes 25F/G: 80% 10Bin: 36% 25Bin: 32% 25F/G: 80% 256F/G: 48% Comments: The first dominant pattern indicates viewers rated the image with the most separation the most informative, perhaps because more of the occupation of the couple (card game) was visible, or maybe the background picture and table edge contributed highly. However the 2nd pattern indicates included a strong (80%) preference in decreasing connection between the couple for 25x25 greyscale images. Image Detail 1st dominant pattern Yes – 1/9 4/9 no 16F/G: 80% 10Bin: 52% 10F/G: 64% 16F/G: 80% 25F/G: 60% 2nd pattern No 3/9 no 16Bin: 64% 25Bin: 40% 256Bin: 36%

3rd pattern No 2/9 Yes Edges: 48% 256F/G: 64%

Comments: A simple face with no surrounding clutter was most visually informative for the low quality images (1st & 2nd patterns). The influence of the mobile phone in our society may be reflected in the ranking of the high quality images (3rd pattern) where the face with the phone appeared to contain more information. Table 5.2: Pattern analysis for information content rankings (continues over)

10 9 5.2 Perceived Information Content in Images

Dimension Visual Information Ranking Order Any very How many Original strong viewer quality images HIGHEST LOWEST preferences? classes like this? Contrast 1st dominant pattern Yes – 1/9 5/9 Yes between objects Edges: 72% 10F/G: 44% and surround 16F/G: 64% 25F/G: 56% Edges: 72% 256F/G:56% 2nd pattern No 2/9 no 10Bin: 40% 256Bin: 64%

3rd pattern No 2/9 no 16Bin: 44% 25Bin: 28%

Comments: Strong edges correspond with high perceived information content. The dominant pattern was for high quality and the low quality greyscale images. When greyscale is available, the stalk and capsicum form/contours may cause this ranking. Variety of Object 1st dominant pattern No 4/9 no types 10Bin: 40% 10F/G: 60% 16Bin: 48% 16F/G: 32% 2nd pattern No 3/9 Yes 25Bin: 32% 25F/G: 40% 256F/G:44%

3rd pattern No 2/9 no Edges: 52% 256Bin: 52%

Comments: The ordering of the dominant pattern is the same as presented to viewers on the questionnaire sheets. It is interesting that the lowest quality image classes make up this dominant pattern. Perhaps viewers of the low quality images were not able to make an intelligible distinction between images and ranked the images in order of appearance.

Table 5.2: Pattern analysis for information content rankings

11 0 5.2 Perceived Information Content in Images

5.2.6 Strong Visual Information Rankings

63 visual information rankings were obtained (7 additional factors/dimensions x 9 image quality classes). Dominant patterns (ie. the most frequently specified ordering in terms of perceived information content) were identified for each case. The strength of the dominant patterns (ie. the frequency with which that pattern was specified by observers) ranged from 96% (24 of 25 respondents ranking images in that order) to 28% (only 7 of 25 respondents). The number of cases for each ten percentile class were as follows:

Strength and number of cases for dominant viewer patterns (63 in total) 90-100%: 3 STRONG 80-89%: 4

70-79%: 1 60-69%: 12 Dominant 50-59%: 6 patterns 40-49%: 16 30:39%: 19 20-29%: 2 WEAK 10-19%: 0 0-9%: 0

Table 5.3: Dominant visual information viewer preferences

It was of interest to further examine strong dominant viewer patterns in the data. Eight of the 63 rankings had 70% or above consensus among viewers. Five of these related to the number of objects in the scene.

Strong viewer preferences are shown over in Figure 5.6.

11 1 5.2 Perceived Information Content in Images

Number of Objects in Scene 5 image quality classes: 16x16_Binary (88%), 25x25_Binary (92%), 256x256_Edges (84%), 256x256_Binary (96%), 256x256 (96%) Highest Lowest

Closeness between image objects 1 image quality class: 25x25greyscale set (80%) Highest Lowest

Image Detail 1 image quality class: 16x16greyscale set (80%) Highest Lowest

Contrast between Objects & Surround 1 image quality class: 256x256_Edge set (72%) Highest Lowest

Figure 5.6: Strong viewer preferences (70% or above consensus among viewers) showing images ranked from highest to lowest perceived information content

11 2 5.2 Perceived Information Content in Images

5.2.7 Test Conclusions

Four conclusions can be drawn from Figure 5.6 and the perceived information content experiment: 1. the more objects in the scene, the higher the visual information

2. the closer the objects in the scene, the higher the visual information

3. a simple face with no surrounding clutter was most visually informative at low resolution levels

4. strong edges, arising from high intensity contrast, correspond with high perceived visual information content

These viewer preferences now need to be checked against predictions from a visual information metric which is undertaken in the next section.

5.3 Information Content Model Fitting

Described above are experiments to assess perceived information content in eight visual dimensions. Subjective rankings from one of these eight dimensions (Actual Objects) is now used to construct a metric to quantify visual information in images. The metric is then validated against the subjective results of the other 7 dimensions.

5.3.1 Possible Image Attributes for a Visual Information Metric

After consideration of the literature on visual information content (refer Section 2.5) 15 image attributes were considered for the visual information metric: 1. file size

2. standard deviation

3. maximum standard deviation in 4 image quadrants

4. variance

5. maximum variance in 4 image quadrants

6. entropy

113 5.3 Information Content Model Fitting

7. number of edges

8. number of segments

9. fractal dimension

10. 11. 12. image internal similarity measures

11. 14. 15. image symmetry measures

Descriptions are provided below for these attributes. File Size Size on disk (bytes) from Windows/DOS Standard deviation

Standard deviation for image pixels using:

n 2 (Equation 3) ∑(X i − X ) n i=1 1 s = , where X = ∑ X i and n = number of elements n −1 n i=1

Maximum standard deviation in 4 image quadrants

S1 S2 max{s1,s2,s3,s4} S 3 S4

Variance

Variance of image pixels = s2; squaring this term emphasises different parts.

Maximum variance in 4 image quadrants

2 S2 S 1 2 2 2 2 2 max{s 1,s 2,s 3,s 4} 2 2 S 3 S 4

Entropy

While this term is also used in the field of thermodynamics, its use here refers to the image processing context of describing the probability of each possible grey level occurring in an image. For greyscale images this is 0 to 255, and for binary images this is 0 and 255 only) using:

255 (Equation 4) ( pixels)g ( pixels)g  h(x) = − p(x)ln p(x)∂x = ∑− log2   ∫ 256x256 256x256 Where (pixels)g = no. of g =0   pixels at that greylevel.

11 4 5.3 Information Content Model Fitting

Number of edges

Sobel edge detection (Matlab version 6.5 Release 13) – horizontal and vertical edges.

Number of segments

The image is segmented using quadtree decomposition; this segments an image on the basis that a block is split into 4 smaller blocks if the maximum value in the block minus the minimum value in the block is greater than a threshold (200/255 was used). Block splitting continues until max value - min value is not greater than the threshold. Blocks are then merged with neighbours if similar in value.

Fractal Dimension The Box Counting Method [45] was used for binary images. The image is covered with a grid of square cells with cell size r. Fractal dimension is determined from functions of cell size as shown in Figure 5.7.

r = Cell side length N(r) = no. of cells containing a portion of image

Performed over range of box sizes: 128x128, 64x64, 32x32 …..1x1

Log(N(r))

Slope = fractal dimension

Log(1/r)

Figure 5.7: Calculating Fractal Dimension for Binary Images

11 5 5.3 Information Content Model Fitting

Fractal dimension for greyscale images (refer Figure 5.8) was determined from an analysis of a pixel’s environment at different square size r [45].

r = 5 r = 7 r = 9

Min & Max grey values are determined within square size r & assigned to central pixel respectively • For each square size r, get 2D max & min function • Difference in volume between max & min function is determined for entire image V(r)

256 256 z = f 2 ( x , y ) Volume = ∂ z ∂ y ∂ x ∫∫ ∫ (Equation 5) 1 1 z = f1 ( x , y ) Boundary in x-plane Ê Ì for every (x,y) in region, z may extend from Boundary in y-plane Ç lower surface to upper surface

ln(V) fractal dimension = 3 - (slope/2)

Slope

Ln(r)

Figure 5.8: Calculating Fractal Dimension for Greyscale Images

11 6 5.3 Information Content Model Fitting

Image similarity and symmetry Three measures were used for image internal similarity (exact match across x and y axes) and image symmetry (mirror match across x and y axes):

1. Exact pixel match (Fig 5.9) - no sub-block analysis (same result operating on big or small block)

• Exact pixel match across y-axis (same)

• Exact pixel match across y-axis (mirror)

• Exact pixel match across x-axis (same)

• Exact pixel match across x-axis (mirror)

Figure 5.9: Determining image similarity and symmetry – pixel matching

2. Shaded pixel difference between blocks - 5 level subblock analysis (objects might be in a different position within a block)

3. Average pixel value - 5 level sub-block analysis

11 7 5.3 Information Content Model Fitting

For the sub-block analysis used in measures (2) and (3) above, five levels were used as depicted below.

LEVEL 1 2 3 4 5 CONFIG 2x2 4x4 8x8 16x16 32x32 No. PIXELS 128 64 32 16 8 same Eg.

This sub-block level is weighted more (weight = 128 / block 1 block 2 mirror block-pixels)

Figure 5.10: Determining image similarity and symmetry – pixel difference and average value

There were two approaches to developing the metric to then compare predictions against subjective data: 1. develop a metric for each image quality class (25x25 greyscale, 256x256 binary etc) to be applied only to images of that quality

2. develop a metric that is stable across all image quality classes (not just the ones used in these tests).

5.3.2 Metric Development for a Specific Image Quality Class

Stepwise regression was used to search for the optimum subset of variables. The procedure was based on sequentially introducing variables into a regression model one at a time and testing the significance of all variables at each stage.

15 image attributes were considered for the visual information model. The addition of any single variable from the above list will increase the regression sum of squares, or SSR (amount of variation in y-values explained by the model) and reduce the error

11 8 5.3 Information Content Model Fitting sum of squares (variation about the regression line). The use of unimportant variables reduces the effectiveness of the model by increasing the variance of the estimated response.

The stepwise regression procedure, taken from [103] was as follows:

STEP 1 Simple linear regression was performed with each variable. The variable giving the largest regression sum of squares, or largest value of R2, with significance (tested using the F-statistic) was chosen as the initial variable, x1 say.

STEP 2

Each variable was inserted along with x1. The variable giving the largest significant 2 2 increase in R , in the presence of x1, over the R found in step 1 was then selected as x2.

This process was continued until the most recent variable inserted failed to induce a significant increase in the explained regression. Such an increase was determined using the F-test.

It was quite possible that a variable entering the regression equation at an early stage might have been rendered unimportant or redundant because of relationships that exist between it and other variables entering the later stages. Therefore at each stage in which a new variable was entered in the regression equation through a significant increase in R2 as determined by the F-test, all the variables already in the model were subjected to F-tests in light of this new variable, and were deleted if they did not display a significant f-value. The procedure was continued until a stage is reached in which no additional variables could be inserted or deleted.

Model development and comparisons of metric predictions with subjective data are illustrated below for 2 image quality classes: 256x256 greyscale and 10x10 binary, which represent the two extremes of the image quality classes tested.

119 5.3 Information Content Model Fitting

5.3.2.1 Example 1: Construction of a model for the 256x256 greyscale quality image set

The table below shows sample (Pearson product-moment) correlation coefficients (= Multiple R) for each variable along with the level where the model is significant as tested by the F statistic.

Variable y1 – Images presented at one time y2 – Paired comparison tests Ref R Level where Ref R Level where Correlation significant Correlation significant

Coefficient for F(1,7-1-1) Coefficient for F(1,7-1-1) test test File size 1 0.74 0.06 A 0.63 0.13 SD 2 0.74 0.06 B 0.71 0.08 Quad max SD 3 0.64 0.12 C 0.58 0.18 Variance 4 0.74 0.06 D 0.71 0.08 Quad max var 5 0.63 0.14 E 0.56 0.20 Entropy 6 0.56 0.20 F 0.43 0.33 Edges 7 0.90 0.01 G 0.86 0.02 Segments 8 0.28 0.55 H 0.20 0.68 Sim_pixels 9 0.08 0.87 I 0.06 0.90 Sim_shaded 10 0.35 0.45 J 0.23 0.63 Sim_mean 11 0.70 0.08 K 0.65 0.12 Sym_pixels 12 0.10 0.84 L 0.00 1.00 Sym_shaded 13 0.43 0.34 M 0.28 0.54 Sym_mean 14 0.70 0.08 N 0.68 0.10 Fractal dim 15 0.76 0.05 O 0.70 0.08

Table 5.4: Correlation coefficients for variables considered for metric for 256x256 greyscale images

Initially a model will be developed for y1 data, where images were presented to subjects at one time. This will be compared to y2 – Paired Comparison data. In the discussion that follows, the reference numbers (column 2) or letters (column 5) for the image attributes tabulated above are contained within angle brackets.

We start with Edges <7> – this variable has the highest regression sum of squares SSR, correlation coefficient R and R2. The model is significant at the 0.005 level (highly significant).

120 5.3 Information Content Model Fitting

We now test all variables with Edges already in the model. We need to find the largest increase in SSR, in the presence of Edges, over the SSR found for Edges alone. ie. We need to find the variable xj, for which R(βj|β7) = R(β7,βj) - R(β7) is largest, where R(β) denotes regression sum of squares for a model with variable β.

The combination of Edges and Entropy <7,6> is significant at the 0.006 level and has the highest increase in SSR above the model with Edges alone. This SSR increase is significant at the α=0.084 level (F(1,7-2-1) test). In order for this increase to be significant at the α=0.05 level we would need the sample size to be 12, not 7. Now when subjecting edges in the presence of entropy to a significance test ie. R(β7|β6), P = 0.005, which is highly significant, so Edges can be retained. Thus looking at the α=0.1 level of confidence Entropy can be included along with Edges.

We now require checking all other variables with Edges and Entropy already in the model ie. R(βj|β7,β6) = R(β7,β6,βj) - R(β7,β6). The combination of Edges, Entropy and Sym_mean <7,6,14> gives a model significance of 0.017 and the largest increase in SSR. However this increased regression is only significant at the α=0.255 level of confidence (F(1,7-3-1) test). Other variable models are significant at the following levels of confidence: α=0.012 <14>, α=0.062 <7>, and α=0.008<6>. Thus if considering variables up to the α=0.26 level then this third variable can be included.

For four variables, we require R(βj|β7,β6,β14) = R(β7,β6,β14,βj) - R(β7,β6,β14), and need to check increases in SSR with a F(1,7-4-1) test. The largest increase in SSR is with variable <9> - Sim_pixels. However this significance is very low: P = 0.466! Other variable models are significant at α=0.017 <9>, α=0.034 <14>, α=0.098 <7>, and α=0.025 <6> levels of confidence. Thus if considering variables up to the α=0.5 level then this fourth variable can be included (overall model significance = 0.067).

A check will also be made using the variable with the second highest R2 value as the first term in the model. We start with Fractal Dimension <15>, which has a correlation coefficient of 0.76 and an F statistic that is significant at the α=0.05 level.

121 5.3 Information Content Model Fitting

2 variable model: Fractal Dimension + Standard Deviation <15,2> gives the highest increase in SSR, which is significant at the α=0.06 level. 3 variable model: Fractal Dimension + Standard Deviation + Sim_mean <15,2,11> is significant at the α=0.195 level. Also Fractal Dimension + Standard Deviation + Variance <15,2,4> is easier to compute and is significant at the α=0.215 level. 4 variable model: <15,2,4,14> is significant at the α=0.23 level, while <15,2,11,9> is significant at the α=0.184 level of confidence. 5 variable model: <15,2,11,9,10> is significant only at the α=0.419 level of confidence. These models with low significance (α>0.05, representing small confidence intervals) are of limited use in developing a metric to predict information content.

For completeness a check is made on using the variable with the third highest R2 value as the first term in the model. We start with File size <1>, which has a correlation coefficient of 0.74 and an F statistic that is significant at the α=0.055 level. 2 variable model: File size + Edges <1,7> is significant at the α=0.173 level, which is clearly inferior to models proposed above. Thus there is no need for further analysis down this path.

Several candidate models were proposed above with varying numbers of model terms. A simple model is a consideration that cannot be ignored but it is not desired to underfit the model. The Cp statistic can be used to consider compromise between excessive bias incurred when one underfits the model (chooses too few model terms) and excessive prediction variance when one overfits (has redundancies in the model). The Cp statistic is a function of the total number of parameters (p) in the candidate model and the error mean square: (s2 −σ 2 )(n − p) c = p + (Equation 6) p σ 2 where σ 2 is the error mean square for the most complete model, s2 is the error mean square for the candidate model, p is the number of model parameters. Cp > p indicates a model that is biased due to being an underfitted model, while Cp ≈ p indicates a reasonable model.

12 2 5.3 Information Content Model Fitting

Candidate models are listed in the table below. Variables No. of R2 Error Model Significance Cp parameters mean significance of increased square regression

7 2 0.82 352.33 0.005 7.7 7,6 3 0.92 190.23 0.006 0.084 3.6 7,6,14 4 0.95 153.22 0.017 0.255 3.8 7,6,14,9 5 0.97 164.32 0.067 0.466 5.0

15 2 0.58 803.41 0.046 20.5 15,2 3 0.84 374.44 0.024 0.060 7.8 15,2,4 4 0.91 274.58 0.041 0.215 5.8 15,2,4,14 5 0.96 170.58 0.070 0.235 5.0

15 2 0.58 803.41 0.046 38.2 15,2 3 0.84 374.44 0.024 0.060 14.4 15,2,11 4 0.92 259.74 0.038 0.195 9.0 15,2,11,9 5 0.97 130.38 0.053 0.184 5.7 15,2,11,9,10 6 0.99 97.52 0.170 0.419 6.0

Table 5.5: Candidate models for metric for 256x256 greyscale images

The best model is <7,6> - Edges and Entropy, which has a high R2, high overall model significance, significant increase in regression over <7> and a Cp statistic that indicates a reasonable model (not underfitted or overfitted). The simple model of <7> - Edges has high significance but the Cp value indicates it is underfitted.

Thus a modelling function f is proposed such that: Information Content = f(Edges, Entropy) Actual Equation: Information Content = 295.6 – (0.047*Edges) – (23.43*Entropy)

Prediction performance will still be checked with the simpler underfitted model: Information Content = f(Edges) Actual Equation: Information Content = 195.5 – (0.052*Edges)

123 5.3 Information Content Model Fitting

We also wished to compare the two types of experiment data, where images were presented at one time (y1) against paired comparison data (y2). So for a model based on y2- paired comparison experiments, we start with Edges which has the highest SSR and significant F statistic. The 2 variable model of Edges + SD gives the highest increase in regression. While the overall 2 variable model is significant at P=0.047, the increase in regression over the 1 variable model is not highly significant (P=0.407). The 2 variable model Edges + Entropy which is equivalent to the <7,6> model developed for y1 data, has an overall significance of P=0.048 but the regression increase is not significant over the 1 variable model (P=0.422).

Model Performance of 256x256 greyscale metric The model was built using the responses from subjects viewing the Actual Objects test set, containing 7 images of that particular image quality class. Subjective responses were also collated in the experiment for other image sets for that quality class. Example test stimuli for this data collection is shown in Appendix Section C.2. The predictive performance of the model is now checked against rankings of the additional test image sets. The additional test sets comprise only 3 images each.

Model predictions are presented over in Table 5.6.

Using the model Information Content = f(Edges, Entropy), the dominant ranking as selected by 25 test respondents, was only predicted for 1/7 test cases: the number of objects in scene. This was a very strong preference with 96% of respondents selecting this order. Using the simpler model Information Content = f(Edges), the dominant ranking was predicted for 2/7 test cases: {the number of objects in scene} + {contrast between objects & surround).

An additional example will be presented outlining the construction of a metric specifically for the 10x10 Binary image set to determine if there is similar poor predictive performance.

124 5.3 Information Content Model Fitting

Dominant Pattern arising from No. of test Correct No. of test Other quality tests: responses in Prediction responses classes with Visual Information Ranking Order dominant using same as model-predicted pattern model model dominant pattern HIGHEST LOWEST Number of Objects in Scene 24/25 Y 24/25 6/9 10B, 16B, 25B, 256E, 256B, 256F/G

Angle of Object 13/25 N 3/25 3/9 25B, 16B, 10B

Distance to Object 17/25 N 1/25 4/9 256B, 25F, 16B, 10B

Connection between image objects 12/25 N 0/25 0/9

Image Detail 16/25 N, but same 1/25 1/9 image 256B ranked #1

Contrast between objects and 14/25 N 8/25 2/9 surround 16B, 25B

Variety of Object types 11/25 N 3/25 0/9

Table 5.6: Model Predictions of 256x256 Greyscale Image Set using model: f(Edges, Entropy)

12 5 5.3 Information Content Model Fitting

5.3.2.2 Example 2: Construction of a model for the 10x10 Binary image set

Candidate models are listed in the below table. Variables No. of R2 Error Model Significance Cp parameters mean significance of increased square regression 8 2 0.88 337.29 0.002 46.41 8,12 3 0.96 127.10 0.001 0.038 13.90 8,12,10 4 0.99 70.66 0.003 0.133 7.21 8,12,10,15 5 1.00 31.91 0.009 0.164 4.87 8,12,10,15,11 6 1.00 34.13 0.082 0.522 6.00

Table 5.7: Candidate models for a metric for 10x10 binary images

Model <8,12> - Segments & Sym_pixels has a significant increase in regression over <8> but high Cp statistic. Model <8,12,10,15> - Segments, Sym_pixels, Sim_shaded, & Fractal dimension has a better Cp value but the increased regression over simpler model is not as significant as Model <8,12>.

Using the model Information Content = f(Segments, Sym_pixels), the dominant ranking as selected by 25 test respondents, was predicted for 4/7 test cases. The more detailed model Information Content = f(Segments, Sym_pixels, Sim_shaded, Fractal dim) predicted 3/7 cases.

Using the models developed for the 25x25 greyscale test set on the 10x10 Binary image sets yielded 4/7 correct predictions for Information Content = f(Edges, Entropy), and 3/7 correct predictions for Information Content = f(Edges).

That is, similar predictive performance was experienced between a model developed specifically from data from that image quality class (10x10 binary) and data from much higher quality images. Thus there is motivation to pursue the development of one metric applicable across all image quality classes.

126 5.3 Information Content Model Fitting

5.3.3 Information Content Metric for all Image Quality Classes

It was desired to develop a metric that was stable across all image quality classes, as predictive power of specifically tailored metrics for a particular image quality class appears arbitrary as described above. Again subjective rankings from one of the eight visual dimensions (Actual Objects) was used to construct this global metric. The metric was then validated against the subjective results of the other 7 dimensions.

Correlations between the 15 image attributes discussed above and perceived information content rankings are shown over in Figure 5.11. The vertical axis represents average correlations for the two different ranking schemes used: Images presented all at once and Paired Comparison tests. The stepwise regression model developed in the previous section systematically grouped attributes based on increased significance of regression. Now it is desirable to choose one attribute applicable for all image quality classes. From the plots of Fig 5.11, it is evident that the “Edge” attribute features in the uppermost one or two correlation curves for both binary and greyscale images at all spatial resolutions tested in the experiment. Thus edges are proposed as a dominant indicator of information content across both low and high image quality classes.

This determination supports Marr’s emphasis of zero crossing (edge) detection in producing images of the external world [56]. The role of edges in scene recognition and interpretation was discussed in Section 3.5.2.4.

A metric based on the number of edges in an image is now validated by comparing metric predictions with perceived information rankings for the remaining 7 data sets. 63 dominant viewer rankings were compared – 7 visual dimensions x 9 image quality classes.

127 5.3 Information Content Model Fitting

Greyscale Images

1 Filesize (B)

0.9 Standard Dev

quad max SD 0.8 Variance 0.7 quad var n

o 0.6 i Entropy t

a l 0.5 Edges

rre

o 0.4 Segments c 0.3 Sim_pixels

Sim_shaded 0.2 Sim_mean 0.1 Sym_pixels 0 Sym_shaded 10x10 16x16 25x25 256x256 Sym_mean Spatial Resolution Fractal dim

Binary Images

1 Filesize (B)

0.9 Standard Dev quad max SD 0.8 Variance 0.7 quad var n

o 0.6 i Entropy t a l 0.5 Edges rre

o 0.4 Segments c 0.3 Sim_pixels Sim_shaded 0.2 Sim_mean 0.1 Sym_pixels 0 Sym_shaded 10x10 16x16 25x25 256x256 Sym_mean Spatial Resolution Fractal dim

Figure 5.11: Correlation between 15 image attributes and perceived information content.

12 8 5.3 Information Content Model Fitting

The performance of the edge metric in predicting subjective dominant viewer patterns is shown over in Table 5.9 and in summary form below in Table 5.8.

Strength and Frequency of image Frequency of exact number of cases for with highest info ranking being dominant viewer content being predicted by metric patterns (63 in predicted by metric total) 90-100%: 3 100% 100% STRONG 80-89%: 4 75% 75% 70-79%: 1 100% 100% 60-69%: 12 67% 25% Dominant patterns 50-59%: 6 83% 50% 40-49%: 16 38% 19% 30:39%: 19 32% 21%

20-29%: 2 100% 100% WEAK 10-19%: 0 - - 0-9%: 0 - -

Table 5.8: Summary of metric performance

Out of the 63 test cases examined, three cases had 90% or above consensus from subjects viewing the sample set. For each of these cases, the metric successfully predicted not only which of the 3 images had the highest information content (2nd column above) but also the ranking order chosen by subjects (3rd column above). Metric performance at weaker subject consensus levels are also shown.

There were several cases where the metric prediction in the 2nd column above was low. However this was for cases where there was low consensus amongst the sample regarding the preferred ranking order. ie. if human subjects could not agree unanimously on a preferred ranking order it was difficult to expect a metric to do so. What is important is whether the metric could predict those cases where their was strong agreement among the sample.

12 9 5.3 Information Content Model Fitting

Predictive Metric Performance

A metric has been proposed from subject responses to one of eight visual dimensions (Actual Objects). Here it is validated against the other seven dimensions - Number, Angle, Distance, Connectivity, Detail, Contrast, Variety.

The table shows whether the metric predicted the dominant ranking as chosen by test subjects.

Ranking of Image 1, 2 and 3 chosen by Highest Number of Respondents and number (out of 25) and % Qu ality Class NUM BER ANGLE DISTANCE CONNECTIVITY DETAIL CONT RAST VARIETY num1num2num3 angleangleangle3 dista dista distance3 conn conn conn3 detai detai detail3 contrcontrcontrast3 varie varie variety3 10 Bin 321 163121531 21012 39132133211012 310 ***Predicted 64% #1 Predicted 60% ***Predicted 10 36% ***Predicted 52% 40% 40% 10 F/G 231 12123912 3821 39132162131112 315 #1 Predicted 48% 36% 32% 36% 64% 44% 60% 16 Bin 321 22312831 2731 28123161231112 312 ***Predicted 88% #1 Predicted 32% ***Predicted 7 #1 Predicted 32% #1 Predicted 64% #1 Predicted 44% 48% 16 F/G 123 81 231012 3831 29132202131612 38 32% 40% 32% 36% 80% #1 Predicted 64% 32% 25 Bin 321 233121121 3812 3812310123 713 28 ***Predicted 92% #1 Predicted 44% 32% 32% ***Predicted 40% ***Predicted 28% 32% 25 F/G 123 92 311232 11612 320132152131413 210 36% 48% 64% ***Predicted 80% #1 Predicted 60% #1 Predicted 56% 40% 256 Edges 3 2 1 21 1 23912 31231 217312122131823 113 84% 36% 48% 68% 48% 72% 52% 256 Bin 3 2 1 24 3 21932 1931 29123 93211623 113 ***Predicted 96% 36% ***Predicted 36% ***Predicted 36% ***Predicted 36% ***Predicted 64% ***Predicted 52% 256 F /G 3 2 1 24 1 2 3 13 1 2 3 17 1 2 3 12312162131413 211 ***Predicted 96% 52% 68% 48% #1 Predicted 64% ***Predicted 56% 44%

denotes strong viewer preferences with 70% of more of respondents choosing that pattern (18 o ut of s ample s ize = 25)

***Predicted - refers to exact ranking of all 3 images predicted by the metric. #1 Predicted - refers to image with highest info content predicted by the metric, ie. #1 in the rank order only. Table 5.9: Predictive performance of metric proposed for all image qualities

130 5.3 Information Content Model Fitting

It was of main interest to examine metric performance in view of strong dominant viewer patterns in the data. Eight of the 63 rankings had 70% or above consensus among viewers. These have been mentioned above in Sec 5.2.6, but are reproduced below in Figure 5.12 with the inclusion of metric performance.

Number of Objects in Scene 5 image quality classes: 16x16_Binary (88%), 25x25_Binary (92%), 256x256_Edges (84%), 256x256_Binary (96%), 256x256 (96%) Highest Lowest

All 5 cases predicted by metric?: Yes Closeness between image objects 1 image quality class: 25x25greyscale set (80%) Highest Lowest

Single case predicted by metric?: Yes Image Detail 1 image quality class: 16x16greyscale set (80%) Highest Lowest

Single case predicted by metric?: No (Metric prediction: phone > 2 faces > single face) Contrast between Objects & Surround 1 image quality class: 256x256_Edge set (72%) Highest Lowest

Single case predicted by metric?: Yes

Figure 5.12: Metric performance for Strong viewer preferences (70% or above consensus among viewers) showing images ranked from highest to lowest perceived information content

13 1 5.3 Information Content Model Fitting

The visual information metric predicted 7 of the 8 strong viewer preferences (70% or above consensus level). Viewers of the 16x16 greyscale Image Detail set ranked a simple face as containing most visual information, while the metric ranked the image of the phone and two faces ahead of the single face. The familiarity and strong recognition of the human face at low levels of image quality may cause viewers to select it over others containing unrecognisable blobs.

The metric was found to work best with binary images, which are expected from at least early prototype designs. (Limited greyscale may be possible by modulating stimulus amplitude, frequency and pulse duration as discussed in Section 3.5.2.2). The number of ranking cases where the metric was able to predict the image with the highest information content is shown in Table 5.10 below. There are a total of seven ranking cases for each image quality class, corresponding to each visual dimension explored.

10x10 Binary set - 4/7 10x10 Greyscale set – 1/7 16x16 Binary set - 6/7 16x16 Greyscale set – 1/7 25x25 Binary set - 4/7 25x25 Greyscale set – 3/7 256x256 Binary set – 6/7 256x256 Greyscale set – 3/7 256x256 Edge set - 6/7

Table 5.10: The number of correct metric predictions of images with the highest information content

This may be another reason why the metric prediction for the 10x10 greyscale Image Detail set did not agree with the ranking chosen by 80% of viewers. Table 5.10 shows that for 16x16 greyscale images, the metric was successful in predicting the image with the highest information content in only 1 out of 7 cases. However for 16x16 binary images, the metric prediction was correct for 6 out of 7 cases. It should be remembered that the strength of dominant patterns on which metric performance is assessed range from 96% to 28%. At high levels of viewer consensus, the metric is accurate in predicting images with the highest information content, and is thus considered acceptable for this application.

Therefore, it can be stated that visual information content in images can be quantified and a mechanism for achieving this with a reasonable level of performance has been

132 5.3 Information Content Model Fitting proposed here. However, does maximising information content in low quality images result in enhanced perception of that image? In order to answer this question, it is necessary to analyse the relationship between low quality image recognition and information content in images ie. is the measure for information content an adequate pointer to how well an image might be recognised? This relationship is explored in the next section.

5.4 Correlations Between Recognition Rate And Perceived Information Content

It was desired to determine if there was any relationship between recognition rates and the amount of visual information as perceived by viewers.

Previous experiments described in Section 4.3 assessed perception performance for these same images (ie. the ability of these images to be correctly recognised). The subjects were first asked to describe the objects (eg. Appendix Section B.1) and then secondly to assess the images for the amount of visual information they contained (refer test stimulus C.1). The questionnaire booklet design shown in Appendix Section C.4 shows a section referred to as “PART 3 Check if correlated with recogn” which relates to this section of correlating information content and recognition. As with other aspects of this experiment, care was taken in the booklet design to reduce learning effects, fatigue and boredom.

Relationships between correct object recognition and subjective information content scores were obtained for each image quality class. For example, Figure 5.13 shows the relationship for 25x25 binary Paired Comparison experiments. The horizontal axis shows a subjective score for visual information developed from the numeric ranking scheme used (higher numeric score = higher perceived information content).

133 5.4 Correlations Between Recognition Rate And Perceived Information Content

Recognition Rate vs Information Content

) 1 5

2 ly = te

n 0.8 a (

ect t r r c e 0.6 on R co ti s t

c 0.4 e j ng obj ogni i b c fy e

ti 0.2 R su n e % d i 0

0 50 100 150

Subjective Score for Visual Information Content

Figure 5.13: Example relationship between recognition and information content (25x25 Binary Paired Comparison data)

The significance of these relationships were then assessed. Linear regression models for each quality class were developed for two series of data: 1. where images were presented at one time

2. paired comparison data

The significance of the models and correlation coefficients appears in Table 5.11 below.

Images presented at one Paired Comparison time Image Quality Class R Significance R Significance

Correlation F(1,7-1-1) test Correlation F(1,7-1-1) test Coeff. Coeff. 10x10 Binary 0.76 0.05 0.76 0.05 10x10 Greyscale 0.54 0.21 0.36 0.43 16x16 Binary 0.69 0.09 0.71 0.07 16x16 Greyscale 0.70 0.08 0.61 0.15 25x25 Binary 0.90 0.01 0.85 0.01 25x25 Greyscale 0.75 0.05 0.81 0.03 256x256Edge 0.70 0.08 0.66 0.10 256x256 Bin 0.69 0.09 0.73 0.06

Table 5.11: Correlation coefficients between recognition rate and perceived information content

13 4 5.4 Correlations Between Recognition Rate And Perceived Information Content

There was some evidence for correlation between ranked information content and recognition rates with significance levels ranging from P=0.05 to P=0.1 for all but the 10x10 greyscale image set. Thus if the information content of an image is maximised (by maximising the number of edges) enhanced perception is expected.

5.5 Chapter Summary

In the field of low quality vision, there is a need for delivering maximum scene information to a limited number of display electrodes/pixels. In this chapter a method is proposed to enhance recognition using importance maps weighted to maximise the “information content” in the resulting importance map.

An experiment was described to quantify the term information content. 15 image attributes were correlated with subjective rankings of visual information. Initially metrics were developed tailored to a specific image quality (eg. 10x10 binary metric). However their arbitrary predictive performance led to the construction of a metric that was stable across a wide range of image quality classes. The number of edges in an image was found to be a dominant indicator of perceived information content. An edge metric was tested on additional subjective data and found to be appropriate in assessing information content. Finally it was shown that subjective information content was significantly related to object recognition.

Thus it was possible to construct a model for basic information required for the interpretation of a visual scene at low image quality.

This finding can now be applied to generating importance maps containing higher information content. Chapter 8 compares such a method with others to determine preferred presentation options.

135

Chapter 6 Scene Specific Imaging

6.1 Overview

As discussed in preceding chapters of this thesis, in order to make best use of the limited number of electrodes in visual prostheses, it is proposed to first process images to extract more information from the scene. One of the conclusions of the preliminary experiments discussed in Chapter 4 was that it might be beneficial to have device switchable processing for different scenes. This chapter then aims to answer the research question :

Q4: Should the processing techniques be adjusted depending on the scene type?

Characteristics of several scene types are outlined in this chapter and then categorised in image processing terms. A simple experiment involving 20 normally sighted viewers tests if there is some benefit in scene-dependent processing to deliver enhanced perception of low quality images.

6.2 Characteristics of Simple Scenes

This section lists characteristics of simple scenes that a patient fitted with a visual implant may experience.

6.2.1 Office

Many office environments have fluorescent lighting. Work spaces might be defined by partitions that are up to two metres high or floor-to-ceiling walls, or a combination of walls and partitions. A person’s working range would be of the order

136 6.2 Characteristics of Simple Scenes of one metre, but a visual range of five metres would be useful. Objects in the environment may be located on a horizontal desk surface (phone, computer, documents). Objects could be distinguished by intensity and colour contrast on the desk. The rest of the office would probably not have much colour except for office plants, pictures, and people. The user is mostly stationary in this environment.

6.2.2 Home

Although the viewer is familiar with the home environment, potential hazards abound, including room and cupboard doors left open and obstacles on the floor. The kitchen is usually comprised of a (reflective) sink, benches and cupboard. Bedrooms would contain a bed, cupboard and dresser. Chairs and possibly a television are found in a lounge room, while a dining room contains table and chairs. Bathrooms may contain a bathtub, vanity unit and toilet.

6.2.3 Street

The viewer is likely to be moving in a streetscape environment, either as a vehicle passenger of walking. Possible obstacles include posts, street signs, curb, people, and construction works (including holes and fencing). Many edges are contained in this man-made (constructed) environment, including footpaths and shopfronts. There is limited colour variation, with the predominant building and ground colours being greys, and pedestrians and parked cars represented by coloured regions. There is a combination of natural lighting (sunlight) and shaded portions. The viewer may require a working range of five metres (when moving) and a visual range of up to fifty metres.

13 7 6.2 Characteristics of Simple Scenes

6.2.4 Outdoors

Outdoor environments contain natural scenes, such as trees, plants, grasses, bushes, seats, beach, ocean. There are limited edges and natural lighting (sunlight). For park areas there is limited colour variation (mostly greens) and alternating intensity from shade and sun patches. For beach scenes there is high intensity glare, with many reflections from water and white sand. Beach environments also usually have finite colours, with blue ocean and white or yellow sand. Smaller coloured regions in the outdoor environment may correspond to people or signs. A viewer may require a working range up to ten metres and a visual range as much as 100 metres.

6.2.5 Head and Shoulders

A special case of scene type exists for situations where the viewer is engaged in conversation or communication with others. In these close contact situations the viewer requires an image of the head and shoulders alone. The visual range need not be more than two metres. Both the scene and viewer are mostly stationary. Faces and other skin areas are detected within a range of pink hues, while the hair may be darker (eg. browns and black).

13 8 6.2 Characteristics of Simple Scenes

6.2.6 Café/Restaurant

This scene type often has indoor lighting conditions (fluorescent/incandescent). There are usually tables approximately one metre high separated by small spaces (navigation gaps). Tables could be circular or rectangular. Chairs are positioned around the tables with chair backs typically higher than the surface of the table. Cutlery and plates may lie on the table, with glasses, cups or jugs projecting up to 200mm above the table surface. Cutlery and glass on the table are highly reflective. A payment area may comprise a desk at waist-chest height with a horizontal top for signing cheques etc. A viewer would need to be able to locate toilets and the café exit. The exit is usually signed and may be a two metre high large rectangle of different intensity, perhaps with a door.

6.2.7 Public Toilets

Gender differentiation is required for the viewer. The entrance to a public toilet is often through doors, with a handle at waist height on the left or right of the door. A 90 degree or 180 degree turn is then made to the left or right, along a floor to ceiling wall which is often tiled. Cubicles are built out from the wall and may be timber or rendered concrete with doors containing a lock at waist height. There may be a urinal, consisting of either separate units or a continuous unit along a wall sometimes with a step up. The toilets should contain a wash basin with soap or a liquid soap dispenser located above the basin. There may be a hand dryer/towel dispenser with a waste bin located below.

13 9 6.3 Image Processing targeted to Scene Type

6.3 Image Processing targeted to Scene Type

It is proposed to present maximum information to implant electrodes targeted to a user’s environment. This can be achieved by applying varying processing routines to the input images.

In this section the scene types mentioned in the previous section are described in image processing terms to identify which image processing routines to apply. Table 6.1 shows these scene type descriptions.

Motion Colour Edges Number Range Regions SCENE In Viewer Dominant Colour Number Types Working Visual scene Colours variation range range Office Low Low White, Low – High Straight Med 1m 5m Pastel med Home Low Med All Low – Med Straight Med 2m 5m med Curved Street High High Grey Med High Straight High 5m 50m Outdoor Low Med Green Low Low Curved Low 10m 100m White Yellow Blue Head & Low Low pink hues Low med Curved Low 1m 2m Shoulder Café Med Low Silver Med high Straight High 1m 2m Curved Toilets Low Med White, Low – high Straight med 2m 4m Silver med

Table 6.1: Image Processing descriptors of different scene types

In order to translate these scene descriptions into processing algorithms the scene first needs to be categorised. There have been some advancements in the area of automatic scene categorization. For example Chernyak and Stark [15] have developed a model for sequential knowledge acquisition using Bayes’ theorem (probability based). Segment features, such as average colour, aspect ratio and position are obtained from training sets of images covering scene categories (eg. “office”, “construction”, “children playing”). The algorithm then attempts to guess

140 6.3 Image Processing targeted to Scene Type the scene category of a test image. Although useful for robot applications designed to minimize human intervention, automatic categorization would not be essential for visual prostheses applications. Human users would presumably be aware of their environment, and would be able to manually select the scene type suited to their surrounds.

Once the scene type is known, it is proposed to apply context/scene dependent importance weighting to the image. Chapter 5 described a means to manipulate image content using Importance Weighting to produce higher information content in the image conveyed to prosthesis electrodes. This chapter proposes a similar method of image manipulation again using the Importance Map method (refer Section 3.5.2.6). However, here it is proposed that weights for feature maps are selected according to their scene type, and proposed feature weights for several scene types are shown below in Table 6.2. Rather than all features having the same effect on the resulting importance map, we propose to vary their contribution depending on the scene type. The percentage weights shown in Table 6.2 indicate the weight to apply to that feature map to produce the resulting importance map. A 50% level would indicate neutral leaning/bias. ATTENTIONAL FEATURE Closeness Intensity Shape Size Viewing Area Contrast SCENE Foreground Lots of Long & skinny Large regions Most viewing (100%) vs contrast (100%) vs (100%) vs in central Background (100%) vs broad & round small regions view (100%) (0%) little contrast (0%) (0%) vs periphery (0%) (0%) Office 90% 70% 90% 25% 100% Home 70% 90% 50% 50% 95% Street 25% 20% 100% 50% 50% Outdoors 25% 80% 10% 100% 50% Head & 100% 80% 50% 25% 100% Shoulders Café 100% 20% 50% 50% 80% Toilets 90% 30% 10% 50% 80%

Table 6.2: Attentional feature weights for each scene type

141 6.4 Subjective Tests for Scene Weighted Processing

6.4 Subjective Tests for Scene Weighted Processing

It was desired to test the proposal that context or scene weighted processing can improve perception of low quality images. The images shown in Figure 6.1 taken from the test stimulus in Appendix Section D.1 were shown to 20 normally sighted volunteers. Two images were shown, one representing an "outdoor" scene, and the other an "office" scene. Low quality versions of the images were presented alongside, representing quality levels typical of current prosthesis prototypes (25x25 spatial resolution, binary images). Booklet design is shown in Appendix Section D.2. Four low quality versions of the original were shown: 1. subsampled to 25x25 and binarised – this represents the standard or base case level of image processing present in most implant designs (no importance processing); 2. subsampled, importance processing with features weighted equally, then binarised; 3. subsampled, importance processing with features weighted according to the correct scene type (eg. for the lighthouse image, weights selected for "outdoor" scenes in Table 2), then binarised; 4. subsampled, importance processing with weights applied corresponding to a different scene type (eg. applying "office" weights to lighthouse image), then binarised.

A B C D E

Figure 6.1: Visual stimuli used to gauge perception of low quality images A = Original 256x256 image; B = subsampling to 25x25 binary; C = importance mapping with all feature weights equal; D = importance mapping with weights selected for “outdoor” scene type”; E = importance mapping with “office” weights

14 2 6.4 Subjective Tests for Scene Weighted Processing

Participants were asked to rank the images for how best (ie. most informatively) they represented the original scene. The images nominated as best representing the originals were as per Table 6.3.

Lighthouse image “Outdoor” weights “Office” No Equal Weights processing Weights 10/20 7/20 3/20 0/20 50% 35% 15% 0% Chair image No “Office” Equal “Outdoor” weights processing Weights Weights 12/20 6/20 2/20 0/20 60% 30% 10% 0%

Table 6.3: Preferred ranking for image representation

For the lighthouse image, half of the sample size (10/20) chose the “outdoor” weighted image as best representing the original scene. This is in line with the expectation that improved perception may be obtained with processing images with respect to scene type. However for the chair image, most respondents found the image with no importance processing was better at representing the original image. Feedback from participants suggested that this image was closest in grey level values to the original, and if the importance-processed images had been inverted (ie. a black chair on a white background) they would have chosen that image. For a subsequent thesis experiment described in the next chapter, the importance-processed images were inverted to be in a similar form to the original.

It should be noted that the image inversion recommendation arises from experiments with sighted viewers with sophisticated expectations. This may not be the same for visually impaired persons with a simplified understanding of the world. Other criteria relating to electrode stimulation may dominate once these systems are more common. This might include always stimulating the smaller number of electrodes irrespective of dominant greylevel information to avoid long term tissue damage, or sharing of electrodes to obtain longer life from the electrode arrays.

143 6.5 Chapter Summary

6.5 Chapter Summary

By first processing images presented to implant electrodes it may be possible to provide enhanced presentation than subsampling alone. In this chapter, some of the characteristics of simple scenes have been listed and described in image processing terms. A simple experiment was conducted to determine the effect of processing images depending on the scene type. Expectations were confirmed for a test image containing an outdoor scene, but not for an office scene. Image inversion to best match an original high quality scene is recommended.

This chapter aimed to address the following research question: Q4: Should the processing techniques be adjusted depending on the scene type?

Experiment results show that improved perception may be obtained with processing images with respect to scene type. Scene dependent importance mapping is a powerful tool to use in the automatic optimisation of low quality images for human viewers. The next chapter compares this method with others to determine preferred presentation options.

144

Chapter 7 A comparison of ROI methods for low quality images 7.1 Overview

So far the thesis has reviewed useful image processing strategies to present the most useful information to implanted users considering this great information loss.

The previous two chapters described methods of manipulating image content based on a region of interest image processing technique known as Importance Maps. Chapter 5 concerned adjusting feature weights so that the resulting Importance Map contains maximum scene information. Chapter 6 proposed that feature weights are set for the particular scene type. Experiments described here in Chapter 7 aim to assess these proposals along with other methods to determine which method best helps users move through a scene.

The work in this chapter aims to extend upon the findings of Chapter 4 concerning the research question:

Q2: Does Region-of-Interest processing improve scene understanding beyond standard/Base Case processing?

It is anticipated that ROI processing will trim away unnecessary information resulting in improved perception.

The region-of-interest/importance framework of the processing used in this Chapter was described in Section 3.5.2.6. The following section describes the comparison experiment, showing the types and range of images used and instructions given to participants. Results follow which indicate a clear trend. Two experiments were conducted: 1. Presentation of entire region-of-interest processed image 2. Presentation of only salient area found from ROI processing only – digital zoom.

145 7.2 ROI Processing applied to Entire Image

7.2 ROI Processing applied to Entire Image 7.2.1 Image Preparation

Images used in the tests were prepared as shown in Fig 7.1.

Test Image 256x 256

Processing Refer 1) – 6) in text below

Resized to 25x25 (nearest neighbour)

Histogram Thresholded at 128 equalisation gr eylevel Thresholded at 128 greylevel

Possible inversion Possible inversion to appear like to appear like Original. Original. Inverted Inverted

Figure 7.1: Image preparation

14 6 7.2 ROI Processing applied to Entire Image

Six image processing methods were tested: four variations of Importance Mapping, edge detection and a non-processed ‘base-case’. In all methods the final image was Nearest-Neighbour resized to 25x25 spatial resolution which is representative of electrode numbers in prosthesis prototypes. One test case had greylevels equalised before thresholding at the 128 grey level, while a second test set had only thresholding at the 128 level with no histogram equalisation. Histogram equalisation spreads the grey levels out across the full greyscale range and it is intuitive to apply this equalisation to use the full dynamic range obtainable from the image. Other methods of histogram transformations (eg stretch, uniform) may be considered, but as the eventual image is reduced to very few shades, the differences are unlikely to be influential. Histogram equalisation depends on illumination and object shades of grey and can introduce spurious shadings in the thresholded image which do not actually represent image objects (see Fig 7.2). Hence it was desired to find if there were any differences in preferred processing algorithm using histogram equalisation.

Original 25x25 thresholded versions

Figure 7.2: Shaded areas do not necessarily correlate with scene objects in images that have had grey levels equalised (right most image).

Finally, the thresholded images were compared with the original 256x256 test image and inverted if necessary to most closely match the original grey level. This inversion was the result of recommendations made after previous experiments described at the close of Chapter 6.

14 7 7.2 ROI Processing applied to Entire Image

7.2.2 Processing Methods Compared

Six methods of processing an original high quality image were compared:

Original

1) IM eq 2) IM sc 3) IM tr

4) IM opt 5) Edge 6) No IP

Figure 7.3: Processing methods used in tests (refer text for details)

1. Importance Mapping with all features weighted equally: ωcontrast = ωsize = ωshape etc. 2. Importance Mapping with weights selected depending on the image scene type. Experiments described in Chapter 7 have shown that improved perception may be obtained with processing images with respect to scene type in this way. 3. Weights selected in accordance with a training set of images from that scene type. A training database was developed consisting of 15 images of each scene category used in the tests (refer Appendix E.1). Feature maps for each image were determined along with the percentage of pixels in the top 25% of each feature map (ie. between 0.75 and 1.00 in the normalised images). This gave a measure of the strength of that feature for that image. For some images, there

14 8 7.2 ROI Processing applied to Entire Image

would be no pixels in the range 0.75 – 1.00 for that feature, while for others 100% of the image pixels might lie in this 0.75 – 1.00 importance range. Feature distributions were made for each scene category. Pixels in this upper 0.75 – 1.00 range would be determined for a test image and its position within the distribution determined. Weights were selected according to position of the test image within the feature distribution (for example refer Fig 7.4). e

s 100 ba a t 80

ng da 60 ni i a r

t 40 n i s

e 20 g

a 12% m

I 0 % 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Feature Weight

Figure 7.4: Example Distribution – Size Map distribution for Beach training images; If the size map for a test image had 50% of pixels in the range 0.75 – 1.00, this would be greater than only one other image in the training image set. The weight applied to that feature map when combining feature maps is the interpolated value between the nearest images, ie. ~12%.

4. Weights iteratively adjusted in order to give the highest number of edges in the resulting Importance Map. Experiments described in Chapter 6 found that a subject’s ability to recognise objects in low quality images is correlated with the amount of edges in that image. A medium-scale Quasi-Newton line search optimisation routine was used to adjust the five weights to maximise the number of edges in the Importance Map. 5. Considering that the number of edges were found to correlate with correct object recognition, it was desired to present an edge map alone. Images were prepared using the Canny edge detection method operating on 25x25 spatial resolution images. 6. Finally an image was presented with no importance processing applied, as a base comparison case.

14 9 7.2 ROI Processing applied to Entire Image

7.2.3 Images Used

Chapter 6 described several scene categories that a blind person might encounter. Six of these were chosen for this experiment and 4 images for each category (Figure 7.5). Image selection was made on the basis of forming functional mobility problems. Dowling [21] has reviewed previous efforts in enhancing mobility for visually impaired persons, including the following mobility problems: • Lighting conditions and glare • Changes in terrain and depth (stairs, curbs) • Unwanted contacts (bumps) • Street crossings • Visual clutter

(i) beach

(ii) street

(iii) office

(iv) home

(v) cafe

(vi) head & shoulders

Figure 7.5: Images used in comparison tests

15 0 7.2 ROI Processing applied to Entire Image

7.2.4 Experiment

A group of 242 volunteers participated in the experiment. From this, 50 samples were discarded (21%) due to either incomplete responses or subjects who normally wore glasses/contact lenses but who were not wearing them at that time. This left 192 normally sighted or corrected-to-normal viewers. Half the sample (n=96) viewed the images that were greylevel equalised, while the other half viewed the non-equalised images. Subjects were presented with test stimuli as shown in Appendix Section E.2 An original high quality (256x256 greyscale) image was shown along with the six different versions of the image. The design of the booklets is shown in Appendix Section E.4 – Part A. Viewing conditions for the experiment were not controlled.

7.2.5 Results

Figure 7.6 below and over shows the breakdown of viewer preferences. There was a clear preference in both the equalised and non-equalised viewer group for no importance processing (base-case). This was the most chosen method for six of the six scene types, especially for faces, where 85% of subjects chose that processing method. Error bars representing 95% confidence intervals are shown in the plot below, and were obtained from average preferences across the six scene types.

80%

70% Equalised at

h Not Equalised t

d 60% g o h t sin

e 50% o o m

g 40% ch s 30% ect j cessin b o r u 20% p S

% 10%

0% IM eq IM s c IM tr IM opt Edge No IP Image processing method

Figure 7.6: Viewer preferences when presenting entire image; n=96 (continues over)

15 1 7.2 ROI Processing applied to Entire Image

3000 l) a t beach o 2500 office 4608 t

( 2000 face

house ked

n 1500 a street es r

l café

p 1000 sam

f 500 o . o

N 0 IM eq IM s c IM tr IM opt Edge No IP Image processing method

Figure 7.6: When presenting the entire image, results indicate a clear preference for no Importance Processing (n=96)

To ensure the ‘clear’ preference for no importance processing (‘No IP’) was statistically significant, Analysis of Variance (ANOVA) was performed on the data shown in Fig 7.6 to compare the hypotheses:

H0: µ IM eq = µ IM Sc = …. = µ No IP

H1: At least two of the means are not equal, at α = 0.05. Viewer preferences across each scene type formed the basis of observations for the ANOVA: 6 observations (beach, office, face, house, street, café) comparing 6 different processing methods. The test resulted in F-values of {19 & 57} for {equalised & non-equalised data} which exceeded the critical F-value (2.53) for the number of degrees of freedom in the data (5, 30), and was highly significant at

{P=1.5E-8 & 1.8E-14 2.64E-72}. Thus H0 was rejected and it was concluded that at least two of the means were not equal. The ANOVA was then repeated but this time excluding data for ‘No IP’. This time F-values were {1.6 & 1.3} for {equalised & non-equalised data} which were less than the critical F-value (2.76) for the number of degrees of freedom in the data (4, 25), and {P=0.20 & 0.30}. This indicates that when ‘No IP’ data is excluded, the means of the other processing methods are not significantly different.

Another issue of interest was histogram equalisation. A two sample t-test was performed using 36 observations (6 processing methods and average results of each of 6 scene types) at α = 0.05. The test assessed the hypotheses:

15 2 7.2 ROI Processing applied to Entire Image

H0: There is no difference in histogram equalisation (differences shown in the

upper plot of Fig 7.6 were due to sampling errors or chance only);

H1: Histogram equalisation achieves significantly different results; ie. a one- tailed t-test (a directional test showing results as higher or lower was not of interest). A t-statistic of 1.76E-15 was obtained which was much less than the critical t value 1.67 for 62 degrees of freedom. The significance of this value for a one-tail test was

P=0.5 and since this is greater than 0.05, H0 was not rejected: histogram equalisation does not result in significant differences.

Fig 7.6 also shows processing methods divided into scene type. The plot indicates which scene types may be better suited for a particular processing method. For example, one of the processing methods compared in the experiment was edge detection. Section 3.5.2.4 described how a research team developing cortical implant devices expected improved perception results with the implementation of edge detection processing. The data shown in Fig 7.6 indicate that low quality edge maps were best recognised for house and street scenes. ANOVA testing using 8 observations (4 of each image type across 2 equalisation/non-equalised data sets) of 6 image types shows significantly higher results for house and street scenes (P=0.0005).

This experiment had several conclusions: (i) The Base Case (‘No IP’) data was significantly better than the other processing methods. (ii) There was no significant difference between the remaining five processing methods This indicates there is no real advantage in tuning feature weights for the importance map method for low quality images. This is a worthwhile conclusion to allow the computational overhead required for this processing to be used elsewhere in prosthesis systems. (iii) There was no significant difference in histogram equalisation before thresholding images. (iv) Low quality edge maps were best recognised for house and street scenes.

153 7.3 Digital Zoom

7.3 Digital Zoom

The results discussed in the previous section indicate that the base case was best for presenting images – ie. presenting subsampled and binarised images only without any region-of-interest processing. However, rather than presenting an entire ROI- processed image, improved perceptual results might be obtained with using ROI processing to identify salient areas within an image and presenting those areas alone (in a subsampled and binarised form). In effect the approach is to find interesting areas within the image and perform a ‘digital zoom’ – enlargening those salient areas to the resolution limit set by the implant electrode array (refer Figure 7.7). It is anticipated that digital zoom would be a common and easily-implemented prosthesis function, and it would be useful to make this zoom method automatic for a blind user.

Figure 7.7: Digital zoom concept – the most salient area is identified in an image and resized to the maximum display resolution

15 4 7.3 Digital Zoom

7.3.1 Automatic Zoom Methods

An additional test was conducted comparing seven methods of zooming into an image. For the purposes of this exercise, the original image was 256x256 spatial resolution. 1. IM_trim (Fig 7.8); A trimmed version of an Importance Map to only include elements above a threshold. Pixels were trimmed around each border: top, right, bottom, left, top, right etc. until a pixel value above the threshold was found. As only square images were presented, the final image was made into a square of dimension equal to the maximum dimension of the trimmed box. The smaller dimension was expanded until image dimensions were equal, and the expansion direction was on the basis that pixels of more important regions were added.

Original Importance Map Square with trimmed box trimmed box

Figure 7.8: Trim method to select zoom window

2. IM_scope (Fig 7.9); A 128x128 box size containing the highest greylevel values in a 256x256 Importance Map ie. one quarter of the image area. The 128x128 box was moved pixel by pixel across the image until it contained the highest sum of pixel values.

Figure 7.9: Scope Box method to select zoom window

15 5 7.3 Digital Zoom

3. The trim method described in 1) above applied to a Saliency Map generated by code obtained from iLab at the University of Southern California [40,84]. This Region-of-Interest research was discussed in Section 2.4. A Saliency Map is created from combining three feature maps corresponding to colour, intensity and orientation at six spatial scales. Unlike the Importance Map concept which segments images first into regions, the saliency feature maps are created from Difference-of-Gaussian (Mexican-hat) operators applied direct to pixel data (Figure 7.10). Default values for the code implementation were used.

Figure 7.10: Saliency Map developed by University of Southern California Top: Difference of Gaussians filter applied to 3 feature maps; Bottom: Saliency Map output showing Regions-of-Interest

4. The 128x128 box scope method described in 2) above applied to a Saliency Map. 5. A 128x128 box containing the horizontal and vertical centre of the image (Figure 7.11). This method has no dependence on image content and relies on spatial position with the image only. It assumes that the centremost part of an image may be the area worth zooming into.

image

window

Figure 7.11: Zoom window selected from central 25% of image

15 6 7.3 Digital Zoom

6. Similarly to 5) in that there is no dependence on image content, this method crops a 128x128 box aligned at the bottom centre of the image. This area may be significant for a viewer especially when mobile, as it contains the foreground immediately in front of the camera (Fig 7.12).

image

window

Figure 7.12: Zoom window selected from central-bottom 25% of image

7. For reference, an option of “No Zoom” was also included, where the whole 256x256 image was represented.

For all the above methods, the stimulus presented to viewers was the cropped zoomed version from the original resized to 25x25 spatial resolution (Fig 7.13). One test case had greylevels equalised before thresholding at the 128 grey level, while a second test set had only thresholding at the 128 level with no histogram equalisation.

Original Zoom window Zoom window Subsampled 256x256 selected with cropped 25x25 1 of 6 methods

Histogram Equalisation

Thresholded at 128 greylevel

Figure 7.13: Image Preparation for Digital Zoom Tests

15 7 7.3 Digital Zoom

The same subjects who viewed the earlier described experiment (refer Section 7.2) also viewed the variations on zoom method. Half the sample (n=96) viewed the images that were greylevel equalised, while the other half viewed the non-equalised images. An example of the test stimuli presented to subjects is shown in Appendix Section E.3 and the design of the test booklet is shown in Section E.4 – Part B. Viewing conditions for the experiment were not controlled. Subjects were shown a zoom window overlaid on the original image, in addition to a 25x25 black and white version of the zoom window. When overlaid on the original image, the zoom window was shown as a white square bordered on the inside and outside by a black square to maximise visibility on all background greylevels (refer Figure 7.14).

Figure 7.14: Example stimulus showing detail of zoom window border

7.3.2 Results of Automatic Zoom Experiment

Viewer preferences are shown over in Figure 7.15. Error bars representing 95% confidence intervals are shown on the upper plot, and were obtained from average preferences for the six scene types. ANOVA testing on the seven processing methods resulted in strongly significant differences between the means (P=7.09E-8 and 2.33E-6 for non-equalised and equalised datasets respectively). The trim method applied to Saliency Maps (“Sal trim”) had the highest preference for automatically zooming into a part of the image. This method was best overall and for four of the

15 8 7.3 Digital Zoom six scene types. For beach scenes the trim method applied to importance maps (“IM trim”) was best, while for café scenes, which contained high clutter, “No Zoom” was best.

35%

at 30% Equalised h t d Not Equalised g o 25% n h t si e o 20% o m g ch n

s 15% ect j cessi 10% b o r u p S 5% % 0% IM trim IM scope sal trim sal centre bottom none scope Zoom method

l)

a 1400 t o t 1200

4608 1000 beach ( office

ked 800 n a r 600 face es

l house

p 400 street

sam 200 f café

. o 0 o

N IM trim IM sal trim sal centre bottom none scope scope Zoom method

Figure 7.15: Preferences for methods to automatically zoom into an image (n=96)

Again results were independent on whether histogram equalisation was applied to images. A two sample t-test was performed using 42 observations (7 processing methods and average results of each of 6 scene types) at α = 0.05. The test assessed the hypotheses:

H0: There is no difference in histogram equalisation (differences shown in Fig

7.15 were due to sampling errors or chance only);

H1: Histogram equalisation achieves significantly different results; ie. a one- tailed t-test (not interested in a directional test which would show results as higher or lower).

15 9 7.3 Digital Zoom

A t-statistic of 1.85E-15 was obtained which was much less than the critical t value 1.66 for 80 degrees of freedom. The significance of this value for a one-tail test was

P=0.5 and since this is greater than 0.05, H0 was not rejected: histogram equalisation does not result in significant differences.

The trim methods (“Sal trim” and “IM trim”) were approximately twice as good as the scope methods (“Sal scope” and “IM scope”). This may be due to the scope box method having a fixed box size (equal to one quarter of the image area) while the box size for the trim method varied depending on the image, potentially returning a more useful zoomed image.

Thus if a digital zoom function were to be employed in a prosthesis design to highlight areas which may help a visually impaired user, favourable results are most likely to be achieved with the Saliency Map method. The trim method on importance maps (“IM trim”) is also slightly better than zoom windows based on a geometric part of the image which do not consider image content.

7.4 Chapter Summary

This chapter described a comparison of region-of-interest processing methods for low quality image presentation. The experiments showed that it is better to use Importance Map/Region-of-Interest processing to select a region within the image and present that alone, rather than presenting the actual Importance/Salience representation for the entire image.

So in response to the research question: Q2: Does Region-of-Interest processing improve scene understanding beyond standard/Base Case processing? It can be seen that ROI processing does improve scene understanding when used in a zoom application, but not if applied to the entire image.

160

Chapter 8 Discussion, Conclusion and Future Work 8.1 Discussion and Conclusion

Electronic visual prostheses, or bionic eyes, are likely to provide some coarse visual sensations to blind patients who have these systems implanted. The quality of artificially induced vision is anticipated to be very poor initially. Research described in this thesis has explored image processing techniques that improve perception for users of visual prostheses.

The work has focussed on improving perception via image processing techniques. Images are just data and image processing is simply manipulating that data. There are potentially other techniques that may results in improved perception that are outside the scope of this research. These include using different electrode paths to create variations of charge density patterns, and delivering preconditioning stimulus to selectively excite deeper axons away from those in contact with electrodes.

Useful image processing methods were determined by way of subjective experiments with normally sighted viewers. These experiments provide a basis from which more complex and beneficial vision prostheses may be derived. For example, the tests involved presentation of static (still) images to viewers and a logical extension to the work is to conduct similar experiments on image sequences (video). Thus a body of knowledge has been established from which real-time processing units can be developed such that a prosthesis may provide maximum benefits to the blind.

The work has also facilitated further understanding of the human visual system, specifically perception from low quality visual information. The field of image quality, including how quality is measured and characterised, is vast. However work to date focuses on high quality images associated with modern multimedia environments. This research contributes to the image quality literature by characterising low quality images. The amount of visual information carried by

161 8.1 Discussion and Conclusion images has been quantified, and in this way a new means of characterising low quality images can be stated on the basis of presenting maximum information.

Finally the research has involved an original application of Region-of-Interest processing routines beyond traditional applications such as image compression. Region-of-Interest processing was presented in Section 2.4 as a powerful perception modeling tool using a combination of early vision and cognitive effects. Prosthesis researchers had not previously recognised the benefit of applying techniques to automatically identify important areas in images. Thesis experiments validated the use of computationally cheap Region-of-Interest techniques in visual prosthesis designs.

Detailed research findings are now discussed in order of presenting in the thesis.

Some preliminary experiments are discussed in Chapter 4. In an experiment to determine useful image processing methods, it was found that some types of images are better recognised at low quality using Importance and Distance Map methods, while for others, ‘Base Case/Normal’ processing is better. It is recommended to have switchable modes of operation to allow user selection of the processing routine. Results also indicated that it is better for a device to have increased spatial resolution rather than increased greyscale resolution, and faces are pretty much the easiest type of image to understand. These results were reinforced in a separate experiment determining the effect of image type on perception.

In Chapter 5, a stable measure was developed to quantify the amount of visual information in an image. This was found to be the number of edges in an image. It was also found that high information content in images correlates with higher perception.

Chapter 6 described an experiment that showed there is some benefit in processing images according to their scene type (office, home etc). It is recommended to invert the processed binary images before presentation if necessary to most closely match the original images.

162 8.1 Discussion and Conclusion

In Chapter 7, a comparison of Region-of-Interest methods was presented. ‘Base Case/Normal’ processing was best when presenting the entire image, but ROI processing has some benefit over Base Case processing with a zoom type function. If a Region-Of-Interest technique was to be implemented in a zoom function, a technique known as the Saliency Map (pixel based) gives the most favourable result. This method was better than region-based Region-of-Interest methods, ‘Base Case’ and methods using the geometric location within an image (where image content is not taken into account). The experiments found that there was no benefit in tuning feature weights in the Importance Map method when such low quality images are used. Finally there was no difference in the results if images are histogram-equalised prior to binarisation or not.

In exploring image processing techniques that improve perception from visual prostheses, four research questions were addressed:

Research Question Findings Q1: Although limited to low quality Basic recognition can be achieved for images anticipated from visual low quality images although this is prostheses, can recognition of some dependent on spatial and greyscale objects be achieved? resolutions and image type. Face environments are most easily recognised. Q2: Does Region-of-Interest processing Improved scene understanding can be improve scene understanding beyond expected if Region-of-Interest processing standard/Base Case processing? is used to zoom into interesting areas within the image. Q3: Can a model be constructed for Maximising the number of edges in an basic information required for the image results in higher perceived interpretation of a visual scene at low information content and higher image quality? recognition performance

Q4: Should the processing techniques be Improved perception may be obtained adjusted depending on the scene type? with processing images with respect to scene type.

163 8.2 Future Work

8.2 Future Work

The research presented in this thesis can be extended in several areas.

8.2.1 Motion

The algorithms described in this paper were employed on static images, and a further extension of this work is to image sequences/video. It is anticipated that enhanced perception would be achieved over that experienced with static images, as image sequences convey higher scene information and subjects would be able to move about to see how various scene elements (background/foreground) interact.

It is likely that an optimum frame presentation rate exists to maximise visual understanding. Nauseating disorientation effects may arise if head movements are not matched with visual information. Time delay effects in helmet/head mounted displays resulting in limitations from image lag are covered by Dudfield [22] and Nelson [61].

8.2.2 Colour

As identified by Suaning et al [93], colour would further enhance images delivered to subjects. While patients undergoing cortical stimulation have reported seeing white, yellow, red, and blue phosphenes [86], the successful modulation of colour has not yet been reported in the literature. From an image processing viewpoint, colour filtering could be relatively easily applied to present monochromatic images of only a selected colour. For example green filtering to locate green apples in a bin full of green and red apples. In a paper [108] which discusses alternatives to colour in the context of transmitting information in images eg. monochromatic coding (single colour), size, flashing stimulus, a disadvantage of colour is identified in that colour cannot depict relative importance.

164 8.2 Future Work

8.2.3 Device interfacing

The ability to connect an artificial vision system to a television or computer would be desirable. This concept has been incorporated in the design of the Dobelle implant which allows for a television/computer/internet interface and remote video screen/VCR monitor [20]. Other routine image processing functions could include the possibility to record a video clip/snapshot of an interesting landmark, route or environment, and transmit the clip/snapshot to another viewer or an external device.

8.2.4 Supplementary/Symbolic Information

Given that phosphenes can be produced in the visual field, increased information transfer might be achieved if use were made of these phosphenes not just for representing one part of a scene, but for “coding” of associated information. For example, a particular phosphene in the top left hand corner of the visual field could represent the close proximity to the left of a tall and narrow object (such as a lamp- post), which may prevent an injury. Similarly, the middle right-most phosphene might represent fast movement from the right. In addition to using a selected phosphene for information representing, other modes of transfer might cover phosphene brightness and blink rate. It is feasible that supplementary sensory information (eg. distance) might be conveyed in such a manner to efficiently use the small number of phosphenes.

The ability to convey alphanumeric character patterns directly, rather than capturing the text via camera, may be beneficial in improving reading speeds and visual acuity. A 5 x 7 phosphene array can be used to create a full set of symbols, similar to dot matrix displays, where the 35 dots can overlap or appear as discrete elements without affecting legibility [74,89]. While a 5 x 7 matrix is generally quite adequate for groups of characters presented in context, it can result in some confusion when single characters are used (eg. 2 called Z, B called 8). This may lead to the use of 7 x 9 matrix fonts, but matrices larger than 7 x 9 do not result in meaningful improvements [89].

165 8.2 Future Work

The use of supplementary stimulus could also be applied for colour analysis. Although artificially-created phosphenes might not represent the true colour of the camera image, it should be possible to overlay some colour identification information. For example, a colour identification function would be advantageous to selectively give an indication of the colour detected in the centre of the camera view.

8.2.5 Range Indication

The distance to an object would be particularly useful information to convey through an artificial vision system. The use of sonar distance aids has been common for many years (eg. [23]). However these devices emit an auditory signal to convey distance information which can interfere with important surrounding environmental noises. It is thus desirable to incorporate distance information visually through an artificial vision system. Distance information can be obtained using similar ultrasonic rangefinders or computing depth from disparity from two cameras. Distances could then be mapped to intensities, where the nearest object is shown with the highest intensity. If the device display only supports a 1-bit grey scale, only the nearest object need be displayed. Newman and Jain [63] state that the greatest advantage of range data is in explicitly representing surface information. Binary/silhouette data provides negligible information about surface and intensity data can contain significant variations in reflected light.

In combination with a standard image of luminance intensities, this distance 'mode of operation' could be quite useful. The literature contains several works from a computer graphic modelling viewpoint which claim that through the combination of range and intensity, a fuller description of the model can be achieved which takes advantage of the strengths and weaknesses of each method is isolation [44,113].

166 8.2 Future Work

8.2.6 Simulating Techniques

Many of the experiments undertaken in this thesis involved the presentation of binary images to viewers that did not include greyscale effects. The use of halftoning techniques to create the illusion of greyscale was mentioned in Sections 3.5.2.2 and 3.4.1 along with the theory that modulating the size and intensity of a phosphene are equivalent psychophysically. Future experiments in simulating artificial vision could explore the use of this technique (refer Figure 8.1).

Original Halftone – max radius = 2

Halftone – max radius = 4 Halftone – max radius = 10

Figure 8.1: Halftone representation

8.2.7 Other Testing Techniques

Also the recognition experiments conducted in this thesis were open-ended in that hints or clues were not provided. Other ways to assess recognition performance might include asking task dependent questions about the scene such as “circle the doorway” or “identify the obstacle location”. The assessment of perception in this way requires careful design and consistent determination of correct responses across subjects. Objective measures, such as tracking tasks, maze navigation, reading speed, visual acuity scores based on Landolt-C’s and Sloan E’s could also be used similar to previous attempts to quantify visual processing methods eg. [12,13,14,33].

16 7 8.3 Final Word

8.3 Final Word

My motivation for undertaking this work was growing up with a blind parent and witnessing frequent injuries from collisions with obstacles: dishwasher doors being left open, cupboard doors ajar, corners of brick walls. There is always some wish that one day some visual perception may return to a level sufficient to avoid such collisions, and allow the amazing world so familiar to us sighted people to be viewed.

In the initial design of this research project, a scope of work was set that was achievable. It would not be realistic to expect to achieve sight restoration as a result of a PhD. However with the facilities and testing methods available, some ideas could be explored in the area of image processing. It is hoped that this work may contribute in a small way to the numerous international efforts of developing a safe and useful electronic visual prostheses for blind people.

168

References

1. Ahumada A, “Computational image quality metrics: A review”, SID Digest of Technical Papers, 24, pp.305-308, 1993

2. Amerijckx C, Legat J, Trullemans C, Design and implementation of a remapping algorithm for visual prosthesis, Proceedings Vision Interface '99, Canadian Image Processing. & Pattern Recognition Society, Toronto, Canada, pp.380-385, 1999

3. Barten P, Contrast sensitivity of the human eye and its effects on image quality, SPIE Press, Washington, 1999

4. Baskent D, Shannon R, Frequency-place compression and expansion in cochlear implant listeners, Journal of the Acoustical Society of America, 116(5), pp.3130-3140, 2004

5. Beatty J, Booth K, Matthies L, Revisiting Watkins’ algorithm (Computer Graphics), 7th Canadian Man-Computer Communications Conference, pp.359- 370, 1981

6. Becker M, Braun M, Eckmiller R, “Retina implant adjustment with reinforcement learning”, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98, IEEE, New York, USA, Vol.2, pp.1181-4; 1998

7. Bell D, Maeder A, Progressive technique for human face archiving and retrieval, Journal of Electronic Imaging, Vol. 5(2), pp191-197, 1996

8. Brindley G, The number of information channels needed for efficient reading, J Physiol. 177, pp.46, 1964

9. Callaghan T, Interference and dominance in texture segregation: Hue, geometric form, and line orientation, Perception & Psychophysics, Vol. 46 (4), pp.299-311, 1989

10. Cantoni V (ed), “Human and machine vision – Analogies and divergencies”, Proceedings of the 3rd International Workshop on Perception, Plenum Press, 1994

11. Capelle C, Faik C, Trullemans C, Veraart C, “Real time experimental visual prosthesis using sensory substitution of vision by audition”, Proceedings of the 16th Annual International Conference of the IEEE Engineering in Medicine and Biology Society. Engineering Advances: New Opportunities for Biomedical Engineers, IEEE, New York, USA; Vol.1, pp.255-256, 1994

169 References

12. Cha K, Horch K, Normann R, Mobility performance with a pixelised vision system, Vision Research 32(7), pp.1367-1372, 1992

13. Cha K, Horch K, Normann R, Simulation of a phosphene-based visual field: Visual acuity in a pixelized vision system, Annals of Biomedical Engineering 20(4), pp. 439-449, 1992

14. Cha K, Horch K, Normann R, Reading speed with a pixelized vision system, Journal of the Optical Society of America A-Optics & Image Science, 9(5), pp. 673-677, 1992

15. Chernyak D.A. and Stark L.W., “Top-down guided eye movements: Peripheral model,” in Human Vision and Electronic Imaging VI, Rogowitz T, Pappas T, Editors, SPIE Proc. Vol. 4299, pp.349-360, 2001

16. Cooper P, Birnbaum L, Brand M, Causal scene understanding, Computer Vision and Image Understanding, Vol.62 (2), pp.215-231, 1995

17. Dagnelie G, Humayun M, Greenberg R, de-Juan E, “The physiological connection: stimulating the human and amphibian retina”, Proceedings: International Conference on Neural Networks, IEEE, New York, USA, Vol.4, pp.2321-2326, 1997

18. DeMarco S, Clements M, Vichienchom K, Liu W, Humayun M, Weiland J, An epi-retinal visual prosthesis implementation, Proceedings of the First Joint BMES/EMBS Conference 1999: IEEE Engineering in Medicine and Biology 21st Annual Conference and the 1999 Annual Fall Meeting of the Biomedical Engineering Society, IEEE, Piscataway, USA, Vol.1, pp475, 1999

19. De Ridder H, Cognitive issues in image quality measurement, Journal of Electronic Imaging, January, Vol.10(1), pp.47-55, 2001

20. Dobelle W, Artificial vision for the blind by connecting a television camera to the visual cortex, ASAIO (American Society of Artificial Internal Organs) Journal 2000; 46(1), pp. 3-9, 2000

21. Dowling J, Maeder A, Boles W, Mobility enhancement and assessment for a Visual Prosthesis, in Human Vision and Electronic Imaging IX, Rogowitz T, Pappas T, Editors, Proceedings of SPIE Vol 5369, pp. 780-791, 2004

22. Dudfield H, Hardiman T and Selcon S, “Human factors issues in the design of Helmet-Mounted Displays,” in Helmet- and Head-Mounted Displays and Symbology Design Requirements II, Lewandowski R, Stephens W, Haworth L (eds.), Proc. SPIE 2465, pp. 132-141, 1995

23. Easton R, Inherent problems of attempts to apply sonar and vibrotactile sensory aid technology to the perceptual needs of the blind, Optometry and Vision Science, 69(1), pp. 3-14, 1992

17 0 References

24. Eckert M, Bradley A, “Perceptual quality metrics applied to still image compression”, Signal Processing, European Association for Signal Processing, Vol. 70, pp177-200, 1998

25. Eckmiller R, Becker M, Hunermann R, “Towards a learning retina implant with epiretinal contacts”, IEEE International Conference on Systems, Man, and Cybernetics Vol. 4, pp. p.396-399, 1999

26. Eckmiller R, Becker M, Hunermann R, “Dialog concepts for learning retina encoders”, Proceedings: International Conference on Neural Networks, IEEE, New York, USA, Vol.4, pp.2315-2320, 1997

27. Exner J, The Rorschach: A comprehensive system: Vol 2 Current research and advance interpretation, Wiley, New York, 1978

28. Gilmont T, Verians X, Legat J, Veraart C, Resolution reduction by growth of zones for visual prosthesis, Proceedings: International Conference on Image Processing, IEEE, New York, USA, pp.299-302 vol.1, 1996

29. Gregory R, Eye and Brain – the psychology of seeing, World University Library, London, 1966

30. Groth-Marnat G, Handbook of psychological assessment, 3rd ed, John Wiley & Sons, New York, 1997

31. Hallum L, Taubman D, Suaning G, Morley J, Lovell N, A filtering approach to artificial vision: A phosphene visual tracking task, Proceedings of the World Congress on Medical Physics and Biomedical Engineering (WC2003), 2003

32. Harvey J, Sawan M, Image acquisition and reduction dedicated to a visual implant, Proceedings of the 18th Annual International Conference of the IEEE Engineering in Medicine and Biology Society. `Bridging Disciplines for Biomedicine', IEEE, New York, USA, pp.403-4 vol.1, 1997

33. Hayes J, Yin V, Piyathaisere D, Weiland J, Humayun M, Dagnelie G, Visually guided performance of simple tasks using simulated prosthetic vision, Artificial Organs Vol. 27 (11), pp.1016-1028, 2003

34. Hendee W, Wells P (eds), The perception of visual information - 2nd edition, Springer-Verlag, New York, 1997

35. Henderson D, Evans J, Dobelle W, The relationship between stimulus parameters and phosphene threshold/brightness, during stimulation of human visual cortex, Transactions - American Society for Artificial Internal Organs, 25, pp. 367-71, 1979

36. Huang T, PCM picture transmission, IEEE Spectrum, Vol.2 (12), pp.57-63, 1965

17 1 References

37. Humayun M, Weiland J, Fujii G, Greenberg R, Williamson R, Little J, Mech B, Cimmarusti V, Van Boemel G, Dagnelie G, de Juan E, Visual perception in a blind subject with a chronic microelectronic retinal prosthesis, Vision Research, Vol 43 (24), pp. 2573-2581, 2003

38. Hungenahally S, “Differentio-aggregation functions for perceptual sub-band coding of images: Emulation of visual receptive fields”, IEEE International Conference on Systems, Man and Cybernetics. 'Humans, Information and Technology', IEEE, USA, Vol.3, pp.2420-2425, 1994

39. Hungenahally S, “Mathematical basis for the design of an artificial retina: visual prosthesis for the retinally blind”, IEEE International Conference on Systems, Man and Cybernetics. ‘Intelligent Systems for the 21st Century’, IEEE, New York, USA, Vol.3, pp.2396-2401, 1995

40. Itti L, Koch C, Feature combination strategies for saliency-based visual attention systems, Journal of Electronic Imaging, 10(1), pp.161-169, 2001

41. ITU/R Recommendation BT.500-7, 10/1995, http://www.itu.ch/ , access date: 4/6/04

42. Iwamoto K, Tanie K, “Development of an eye movement tracking type head mounted display: capturing and displaying real environment images with high reality”, Proceedings. 1997 IEEE International Conference on Robotics and , IEEE, New York, USA, Vol.4, pp.3385-3390, 1997

43. Janssen R, Computational Image Quality, SPIE PRESS Monograph Vol. PM101, SPIE - The International Society for Optical Engineering, 2001

44. Kay G, Caelli T, Inverting an illumination model from range and intensity maps, CVGIP: Image Understanding, Vol 59 (2) March, pp. 183-201, 1994

45. Kraft R, Kauer J, Estimating the fractal dimension from digitized images, Munich University of Technology – Weihenstephan, Dept of Agricultural & Horticultural Sciences Mathematics, Statistics & Data Processing Institute, Freising / Germany, http://www.wzw.tum.de/ane/algorithms/algorithms.html, Access date: 20/4/04, 1995

46. Kyuma K, Miyake Y, Kage H, Artificial retina chips, IEEE International Conference on Neural Networks, Vol.4, pp.2304-2308, 1997

47. Levine M, Vision in man and machine, McGraw-Hill, New York, 1985

48. Livingstone M, Hubel D, Segregation of form, color, movement, and depth: Anatomy, physiology, and perception, Science, Vol. 240, pp.740-749, 1988

49. Loce R, Roetling P, Lin Y, Digital halftoning for printing and display of electronic images, in Electronic Imaging Technology, Dougherty E (Ed), SPIE - The International Society for Optical Engineering, pp225-288, 1999

17 2 References

50. Luo J, Etz S, Singhal A, Gray R, Performance-scalable computational approach to main subject detection in photographs, Human Vision and Electronic Imaging VI, Rogowitz B, Pappas T (Eds), Proceedings of SPIE Vol.4299, pp. 494-505, 2001

51. Lysaght M, Vogelstein J, Lockhart N, Cheswick Thide C, Nallari M, Caulkins C, Artificial vision, 1999 (compiled), Brown University, Providence, Rhode Island, Available: http://biomed.brown.edu/Courses/BI108/BI108_1999_Groups/Vision_Team/Vi sion.htm , Access date: 20 April 2004

52. Maeder A, Human understanding limits in visualization, International Journal of Pattern Recognition & , 11(2), pp.229-237, 1997

53. Maeder A, Eckert M, “Medical image compression: Quality and performance issues”, Proceedings: New Approaches in Medical Image Analysis, SPIE – The International Society for Optical Engineering, Washington, Vol.3747, 1999

54. Maeder A, Pham B, A colour importance measure for colour image analysis, IS&T and SID’s Color Imaging Conference: Transforms & Transportability of Colour, Phoenix, Nov 1993, 232-237, 1993

55. Margalit E, Maia M, Weiland J, et al, Retinal prosthesis for the blind, Survey of Ophthalmology, Vol.47 (4), pp.335-356, 2002

56. Marr D, Vision, W.H. Freeman & Company, New York, 1982

57. Meletiou A, Measurement of complexity in visual images, MSc in Human Computer Interaction Project, Department of Computing & Electrical Engineering, Heriot-Watt University, 1999

58. Miau F, Itti L, A neural model combining attentional orienting to object recognition: Preliminary explorations on the interplay between where and what, In: Proc. IEEE Engineering in Medicine and Biology Society (EMBS), Istanbul, Turkey, Oct 2001

59. Moffett D, Moffet S, Schauf C, Human physiology – foundations and frontiers, 2nd edition, Mosby-Year Book Inc, St Louis, pp.268-283, 1993

60. Mojsilovic A Rogowitz B, A psychophysical approach to modeling image semantics, Human Vision and Electronic Imaging VI, Rogowitz B, Pappas T (Eds), Proceedings of SPIE Vol.4299, pp. 470-477, 2001

61. Nelson W, Hettinger L, Haas M, Russell C, Warm J, Dember W and Stoffregen T, Compensation for the effects of time delay in a helmet-mounted display: perceptual adaptation versus algorithmic prediction, in Helmet- and Head- Mounted Displays and Symbology Design Requirements II, Lewandowski R, Stephens W, Haworth L (eds.), Proc. SPIE 2465, pp. 132-141, 1995

17 3 References

62. Nemine K, Calibration and evaluation of virtual environment displays, Proceedings: Symposium on Research Frontiers in Virtual Reality, IEEE Computer Society Press, San Jose, USA, Oct 25-26 1993

63. Newman T, Jain A, A survey of automated visual inspection, Computer Vision and Image Understanding, Vol. 61(2), March pp.231-262, 1995

64. Nguyen A, Chandran V, Sridharan S, Prandolini R, Importance assignment to regions in surveillance imagery to aid visual examination and interpretation of compressed images, Proceedings of International Symposium on Intelligent Multimedia, Video & Speech Processing (ISIMP), Hong Kong, pp. 385-388, 2001

65. Nie K, Stickney G, Zeng F, Encoding frequency modulation to improve cochlear implant performance in noise, IEEE Transactions on Biomedical Engineering, 52(1), pp.64-73, 2005

66. Normann R, Maynard E, Rousche P, Warren D, A neural interface for a cortical vision prosthesis, Vision Research 39(15), pp. 2577-2587, 1999

67. Osberger W, Perceptual vision models for picture quality assessment and compression applications, PhD thesis, Space Centre for Satellite Navigation, School of Electrical and Electronic Engineering, QUT, 1999

68. Osberger W, Maeder A, Automatic identification of perceptually important regions in an image using a model of the human vision system, 14th International Conference on Pattern Recognition, Brisbane, Australia, pp. 701- 704, 1998

69. Osberger W, Maeder A, Bergmann N, A perceptually based quantisation technique for MPEG encoding, Proceedings of the SPIE – Human Vision & Electronic Imaging III, Rogowitz T, Pappas T, Editors, Proceedings of SPIE Vol. 3299, San Jose, USA, Jan 1998

70. Osberger W, Rohaly A, Automatic detection of regions of interest in complex video sequences, in Human Vision and Electronic Imaging VI, Rogowitz T, Pappas T, Editors, Proceedings of SPIE Vol. 4299, pp.361-372, 2001

71. Pausch R, Shackelford M, Proffitt D, A user study comparing head-mounted and stationary displays, Proceedings: Symposium on Research Frontiers in Virtual Reality, IEEE Computer Society Press, San Jose, USA, Oct 25-26 1993

72. Peachey N, Chow A, Subretinal implantation of semiconductor-based photodiodes: progress and challenges, Journal of Rehabilitation Research & Development, 36(4), pp. 371-376, 1999

73. Pentland A, Wearable computers, IEEE Microcomputers 19(6), pp.9-11, 1999

74. Perez R, Electronic display devices, TAB Professional and Reference Books, Pennsylvania USA, pp.196-202, 1988

17 4 References

75. Privitera C, Stark L, Focused JPEG encoding based upon automatic pre- identified regions-of-interest, in Human Vision and Electronic Imaging IV, Rogowitz T, Pappas T, Editors, Proceedings of SPIE Vol. 3644, pp. 552-558, 1999

76. Reinagel P and Zador AM, Natural scene statistics at the centre of gaze, Network: Computation in Neural Systems. vol.10, no.4, pp341-350, 1999

77. Riglis E, Modeling visual complexity in images, First Year PhD Report, Image Laboratory, School of Mathematical and Computer Sciences, Heriot-Watt University, 1998

78. Rivlin E, Rosenfeld A, Navigational functionalities, Computer Vision and Image Understanding, Vol.62 (2), pp.232-244, 1995

79. Rizzo J, Wyatt J, Loewenstein J, Kelly S, Shire D, Methods and perceptual thresholds for short-term electrical stimulation of human retina with microelectrode arrays, Investigative Ophthalmology & Visual Science, 44(12), pp 5355-5361, 2003

80. Rogowitz B, Frese T, Smith J, Bouman C, Kalin E, Perceptual image similarity experiments, SPIE Conference on Human Vision and Electronic Imaging III, San Jose, California, January 1998, SPIE Vol.3299, pp.576-590, 1998

81. Rogowitz B, Pappas T, Allebach J, Human vision and electronic imaging, Journal of Electronic Imaging, Vol.10(1), pp.10-19, 2001

82. Rosenberg D, Color Halftone Version 7.0, An Adobe Photoshop filter module which simulates an enlarged print color halftone effect, Adobe Systems, 2002

83. Russ J, The image processing handbook 3rd Edition, CRC Press, Florida USA, pp. 242-247, 1999

84. Saliency Map source code sourced from iLab, University of Southern California: http://ilab.usc.edu/toolkit/ Access date: 12 August 2003

85. Schill K, Umkehrer E, Beinlich S, Krieger G, Zetzsche C, Scene analysis with saccadic eye movements: Top-down and bottom-up modeling, Journal of Electronic Imaging, Vol.10(1), pp.152-160, 2001

86. Schmidt E, Bak M, Hambrecht F, Kufta C, O'Rourke D, Vallabhanath P, Feasibility of a visual prosthesis for the blind based on intracortical microstimulation of the visual cortex, Brain 119, pp. 507-522, 1996

87. Schubert M, Stelzle M, Graf M, Stert A, Nisch W, Graf H, Hammerle H, Gabel V, Hofflinger B, Zrenner E, Subretinal implants for the recovery of vision, IEEE International Conference on Systems, Man, and Cybernetics Vol. 4, pp. 376-381, 1999

17 5 References

88. Schwarz M, Hauschild R, Hosticka B, Huppertz J, Kneip T, Kolnsberg S, Mokwa W, Trieu H, Single chip CMOS image sensors for a retina implant system, IEEE, Vol.6, pp.645-648, 1998

89. Sherr S, Electronic displays, John Wiley & Sons, New York, pp.29-37, 1979

90. Snyder H, Trejo L, Research Methods, in Colour in Electronic Displays, Widdel H, Post D (eds), Plenum Press, 1992

91. Stange K, A 4-parameter model of visual complexity in abstract images and a computer program for the empirical investigation of complexity, pleasingness and interestingness of images based on the model, XVI Congress Of The International Association Of Empirical Aesthetics, New York, 2000

92. Stark L, Privitera C, Yang H, Azzariti M, Fai Ho Y, Blackman T, Chernyak D, Representation of human vision in the brain: How does human perception recognise images?, Journal of Electronic Imaging, Vol.10(1), pp.123-151, 2001

93. Suaning G, Lovell N, Schindhelm K, Coroneo M, The bionic eye (electronic visual prosthesis): A review, Australian and New Zealand Journal of Ophthalmology 26, 195-202, 1998

94. Suaning G, Lovell N, Kwok C, Fabrication of platinum spherical electrodes in an intra-ocular prosthesis using high-energy electrical discharge, Sensors and Actuators A: Physical, vol. 108, pp. 155-161, 2003

95. Suaning G, Lovell N, CMOS neurostimulation system with 100 electrodes and radio frequency telemetry”, Inaugural Conference of the IEEE EMBS (Vic), Melbourne, pp.37-40, Feb 1999

96. Thompson R, Barnett G, Humayun M, Dagnelie G, Facial recognition using simulated prosthetic pixelized vision, Investigative Ophthalmology & Visual Science, 44(11), pp.5035-5042, 2003

97. Thorpe S, Image processing by the human visual systems, Eurographics ’90 EG.90TN4 – Tutorial Note, Eurographics Technical Report Series, 1990

98. Travis D, Effective colour displays – Theory and Practice, Academic Press, 1991

99. Vaughan H, Schimmel H, Feasibility of electrocortical visual prosthesis, in Visual prosthesis – The interdisciplinary dialogue, Sterling T, Bering E, Pollack S, Vaughan H Editors, Proceedings of the second conference on visual prosthesis, Academic Press, New York, pp.65-79, 1971

100. Veraart C, Wanet-Defalque M, Gérard B, Vanlierde A, Delbeke J, Pattern recognition with the optic nerve visual prosthesis, Artificial Organs Vol.27 (11), pp.996-1004, 2003

17 6 References

101. VQEG – Video Quality Experts Group, Institute for Telecommunication Sciences, U.S. Department of Commerce, Colorado http://www.its.bldrdoc.gov/vqeg/, accessed 4/6/04.

102. Wandell B, Foundations of vision, Sinauer Associates Inc, Massachusetts, USA, pp. 124-126, 1995

103. Walpole R, Myers R, Probability and statistics for engineers and scientists – 5th edition, Macmillan Publishing Company, New York, 1993

104. Warren D, Normann R, Visual neuroprostheses, in Handbook of neuroprosthestic methods, Finn W, LoPResti P (eds), The Biomedical Engineering Series, CRC Press, pp. 2003

105. Watson A (ed), Digital images and human vision, MIT Press, 1993

106. Watson A et al, The DCTune algorithm, Vision Science and Technology Group, NASA Ames Research Center, http://vision.arc.nasa.gov/dctune/, Access date: 29 May 2004

107. Werblin F, Jacobs A, The cellular neural network as a retinal camera for visual prosthesis, Proceedings: International Conference on Neural Networks, IEEE, New York, USA, Vol.4, pp.2327-2332, 1997

108. Widdel H, Post D, (eds), Colour in Electronic Displays, Plenum Press, 1992

109. Yagi T, Ito Y, Kanda H, Tanaka S, Watanabe M, Uchikawa Y, Hybrid retinal implant: fusion of engineering and neuroscience, IEEE International Conference on Systems, Man, and Cybernetics Vol. 4, pp. 382-385, 1999

110. Yagi T, Kameda S, Hayashida Y, Li L, An artificial retina with adaptive mechanisms and its application to retinal prostheses, Vol.4, IEEE, pp.418-423, 1999

111. Yamakawa T, Shimonomura K, Udono T, Yagi T, Depth perception circuit employing serial output signals from two vision chips, Vol. 4, IEEE, pp.390- 395, 1999

112. Ziegler D, Linderholm P, Mazza M, Ferazzutti S, Bertrand D, Ionescu A, Renaud, An active microphotodiode array of oscillating pixels for retinal stimulation, Sensors and Actuators A: Physical, 110(1-3):11-17, 2004

113. Zhang G, Wallace A, Physical modelling and combination of range and intensity edge data, CVGIP: Image Understanding, Vol. 58 (2) September, pp.191-220, 1993

17 7

Appendix A Section 4.2 Experiment

A.1 Example Test Stimulus

All images are different versions of the same object.

A B

C D

E

1. DESCRIBE THE OBJECT: ______

2. RANK THE TOP 3 IMAGES THAT YOU THINK SHOW THE OBJECT MOST CLEARLY: 1)______2)______3)______

178 Appendix A Section 4.2 Experiment

A.2 Booklet Design

Test Attributes Ref Spatial Res & Grey levels Images Image Characteristics shown per shown per page: page N = normal, I = inverse, D = distance, Im = importance, E = edges 1 10x10 B&W 5 5 x B&W (N,I,D,Im,E) 2 10x10 3 grey 4 4 x 3grey (N,I,D,Im) 3 10x10 B&W vs 3grey 9 9 = (5 x B&W) + (4 x 3grey) 4 16x16 B&W 5 5 x B&W (N,I,D,Im,E) 5 16x16 3 grey 4 4 x 3grey (N,I,D,Im) 6 16x16 B&W vs 3grey 9 9 = (5 x B&W) + (4 x 3grey) 7 25x25 B&W 5 5 x B&W (N,I,D,Im,E) 8 25x25 3 grey 4 4 x 3grey (N,I,D,Im) 9 25x25 B&W vs 3grey 9 9 = (5 x B&W) + (4 x 3grey) 10 10x10 3 grey vs 16x16 B&W 8 8 = (4 x B&W )+ (4 x 3grey) 11 10x10 3 grey vs 25x25 B&W 8 8 = (4 x B&W )+ (4 x 3grey) 12 16x16 3 grey vs 25x25 B&W 8 8 = (4 x B&W )+ (4 x 3grey)

Image presentation criteria: 1. images with lower spatial resolution should be presented before higher resolution versions of the same image to reduce learning effects 2. try to obtain a large a difference as possible, so pair of the following attribute reference numbers from above table: 1 – 9, 2 – 7, 3 – 8, 4 – 10, 5 – 12, 6 – 11

Six booklets A – F measuring different combinations of image attributes:

Book Chair1 Chair2 Post1 Post2 Steps1 Steps2 Face1 Face2 Door1 Door2 A 1-9 2-7 3-8 4-10 5-12 6-11 1-9 2-7 3-8 4-10 B 2-7 3-8 4-10 5-12 6-11 1-9 2-7 3-8 4-10 5-12 C 3-8 4-10 5-12 6-11 1-9 2-7 3-8 4-10 5-12 6-11 D 4-10 5-12 6-11 1-9 2-7 3-8 4-10 5-12 6-11 1-9 E 5-12 6-11 1-9 2-7 3-8 4-10 5-12 6-11 1-9 2-7 F 6-11 1-9 2-7 3-8 4-10 5-12 6-11 1-9 2-7 3-8

Then booklet order to ensure variety and reduce learning effects (20 pages):

1 2 3 4 5 6 7 8 9 10 Chair1 Post1 Steps1 Face1 Door1 Steps2 Post2 Face2 Chair2 Door2

11 12 13 14 15 16 17 18 19 20 Face1 Post1 Door1 Steps1 Chair1 Post2 Chair2 Door2 Face2 Steps2

179 Appendix A Section 4.2 Experiment

A.3 Borderline Recognition Assessment for Section 4.2 Experiment

Contextually Accepted Rejected Accepted Toilet Pidgeon Park bench Chair on skateboard Baby pram Chair with armrests Vulture chairlift Giraffe/dog Flamingo Rooster Emu

Contextually Accepted Rejected Accepted Lounge chair a rocking [?] facing to the left Posture chair that’s sitting up straight High chair Rocking horse 3d chair on a rock Electric chair Baby’s pram Rocking chair Child’s pram Reclining chair Baby stroller Chair with wheels Kids move chair stool

18 0 Appendix A Section 4.2 Experiment

Contextually Accepted Rejected Accepted Pillar holding up roof Tower Door of a house with welcome mat Gravestone Foot Corner of wall Wall from side or column Torch Door in the right corner (building support) Closeups of walls in a maze Buildings/skyscraper Vase/box which is a very different colour to the walls of the room Jug in the corner of the room Lighthouse A square Wall Vase / bottle Window Block Grass bush Rectangle A pile on Object on a table (eg. mug, salt Large rectangular window shaker) pole Candle holder Corridors Toilet paper Can of food Cup on table Elevation shaft Fire hydrant Block of chocolate Tree trunk Statue on stand O/head view of street

Contextually Accepted Rejected Accepted Block of dry ice Beaker Window / Window with tree outside Computer screen Container Piece of cloth (rectangular) hanging up on something Gameboy screen Swimming pool Wall / Wall with window & curtain on right Piece of paper Building Box (rectangular shaped object); A box (not clear enough to describe); Block or box Blackboard Bucket A block that someone can sit on Box, possibly a floor pan Cup / mug Square Top view of a table/ desk Block of chocolate An enclosed space t.v. screen The sky Floor mat Fish bowl Towel hanging Football field Hole / Hole in the ground / A square ditch

18 1 Appendix A Section 4.2 Experiment

Contextually Accepted Rejected Accepted “If gender noted was wrong” eg. George Washington man with afro The back of a head Early American president Gorilla’s face/monkey Mozart Footballer with helmet Beethoven Shrub / face Bob Marley Person with lots of facial hair Ben Harper William Shakespeare Elizabeth Taylor Jimmy Hendrix Mon Lisa painting Artist

Contextually Accepted Rejected Accepted “If gender noted was wrong” eg. Head of a bird Child molester from ch.7 news ladies face An old man’s head in profile Flower Dero person facing left Back of a persons head Mr. Ed the talking horse Human face, child Side view of a person facing left Old person with glass with string attached (crying) Crying person Side view of a dog’s head Guy with long hair Face with long hair Barbie

18 2 Appendix A Section 4.2 Experiment

Contextually Accepted Rejected Accepted Water pump with hose on right Fish hook Long thing with protrudence Cactus ‘L’ End of a pier Big flower Axe Powerpole with powerline extending to right Streetlight/lamp Half an anchor Telephone pole Flower stem Sign pointing up Umbrella Diving board/platform Tree with branches/leaves Powerline tree Crane Joining of posts (T) Basketball hoop Winder on old clothesline Arrow pointing down Stick figure pointing gun to A tree with monkeys in it right Corn / maize Ladder Saxophone Spear with hook underneath Street sign Office building with stairs Traffic lights Skyscraper Flag facing right Waving finger Hockey stick Submarine Pencil? Windsurfer Flagpole Traffic light Tower? Road markings Lighthouse? Hand basin Sword pipes

Contextually Accepted Rejected Accepted Tree beside a hole/ditch in the Tall building A line ground to the left Worm/slug/caterpillar City with buildings Flagpole Driveway Birds eye view of coin box Something sticking up Road at night (with reflective Caterpillar crawling A stick median strip) Perspective view of a road White thing sticking up Tree / tree out of the ground / a A stage with curtains pulled lone tree back on either side and pole in middle Mountain with stick Street lamp / light Traffic lights Tower

18 3 Appendix A Section 4.2 Experiment

Person standing out in the open Straight white line behind black surface Pole with cave Tall archway/corridor Pole in foreground, mountain in background Mountains with a pole at top

Contextually Accepted Rejected Accepted Multi-storey building with Building stepped storeys

Contextually Accepted Rejected Accepted Stick Swimming pool Stairs leading up to a tree A very tall skinny tree in a Tree raised garden bed Flagpole/yellow & red beach Clothesline poles Life savers flag Powerline pole lightpole Flag on a golf course Street footpath Golf hole A stick in the ground Weed/seedling coming from ground Something pole-like with a house/shelter in the distance on the right Pole in ground

18 4

Appendix B Section 4.3 Experiment B.1 Example Test Stimulus

CAN YOU TELL WHAT IS SHOWN IN EACH IMAGE. 1) Write a word under each image to describe the main object or content of the scene. If you can’t tell what is shown, write “Can’t tell”. 2) Put a circle around the images that you are confident about.

185 Appendix B Section 4.3 Experiment

B.2 Borderline Recognition Assessment for Section 4.3 Experiment

Image Accepted Rejected Lighthouse tower, well, powerhouse, buildings, horizon, house on cliff, post, high-rise building, chimney, house, mineshaft, oil rig, sky, pole, jetty with post, watch tower, small building, landscape Buildings houses, factory, households stairs, steps, trees, mountain, hill, forest, house with smoking chimney

Tree plant, canyon, gully

Gorilla woman's back, dog (common for 256x256_Binary R2D2, people kissing, 2 images), men, person sitting, person, teddy, bear toy, people, duck animal, someone eating, godzilla, dinosaur, lady, person leaning over, hair & head, man bending over Capsicum apple, fruit, pumpkin, jack-o-lantern ball, rose, balloon, love heart, fist

Face monster, side view of a head 2 people looking at jumping dog, tiger, bird, duck

Flower splat, flowers, centre of fruit eye, shooting target, food plate, donut, letter 'O', 'Q', box, clock, sign, square, wheel, ball, door handle, apple Balloon lightbulb, plane, sun, cloud, bird, aeroplane, moon dot, tennis ball, heart, block, window, rectangle, wall with window, star, footprint, square, light, firefly Duck man, smiley face, person, dog

186 Appendix C Chapter 5 Experiment

Appendix C Chapter 5 Experiments C.1 Example Test Stimulus – 7 Images Presented all at same time

Rank the images for how much visual information they contain:

1 = contains most visual information 7 = contains least visual information

187 Appendix C Chapter 5 Experiment

C.2 Example Test Stimulus – 3 Images Presented all at same time

Rank the images for how much visual information they contain:

1 = contains most visual information 3 = contains least visual information

188 Appendix C Chapter 5 Experiment

C.3 Example Test Stimulus – Paired Comparison Experiments

WHICH IMAGE APPEARS TO CONTAIN MORE INFORMATION?

† † † † † DEFINITELY SLIGHTLY SAME SLIGHTLY DEFINITELY LEFT IMAGE LEFT IMAGE RIGHT RIGHT IMAGE IMAGE

The subjects were also given the following comments relating to this question:

There are 5 boxes under the image pair to be compared: Box 1 = left image has much more information than right image Box 2 = left image has slightly more information than right image Box 3 = images have same amount of visual information Box 4 = right image has slightly more information than left image Box 5 = right image has much more information than left image

189 Appendix C Chapter 5 Experiment

C.4 Booklet Design Booklet design chosen to minimize learning effects; 9 booklets A – I, with 3 parts in each booklet PART 1 – Develop Metric; PART 2 – Validate Metric PART 3 2 methods: 7 additional visual dimensions Check if 1) paired comparison 3 images were presented to subjects at one time correlated 2) presenting 7 images all at once with recogn Paired Comparison All at 12 3456 7 once BOOK Image Set 1 2 3 OBJECT NUMBER ANGLE DISTANCE CONN. DETAIL CONTR VARIETY A a 10B 16F OB 25F 10B 25F O 16F OB 25B OE 25B B b “ “ “ 25B 10F OE 10B 25B O 25F OB 25F C c “ “ “ OE 16B OB 10F 25F 10B OE O 16B D a 16B 25F OE 10F 16F O 16B OE 10F OB 10B O E b “ “ “ 10B 25B 10B 16F OB 16B O 10F OB F c “ “ “ O 25F 10F 25B O 16F 10B 16B 10F G a 25B 10F O 16B OE 16B 25F 10B 25B 10F 16F 16F H b “ “ “ 16F OB 16F OE 10F 25F 16B 25B OE I c “ “ “ OB O 25B OB 16B OE 16F 25F 10B 10B = 10x10 binary 10F = 10x10 full grey 16B = 16x16 binary 16F = 16x16 full grey 25F = 25x25 full grey OB = original (256x256) binary OE = original (256x256) edge O = original (256x256) Image Set (a) Image Set (b) Image Set (c) Balloon – Caps Balloon – People Balloon – Building Balloon – Flower Balloon – Lighthouse Balloon – Tree Caps – Tree Caps – Building Caps – Flower Binary comparison of 7 different People – Tree Caps – Lighthouse Caps – People objects = 21 comparisons People – Building People – Flower People – Lighthouse Building – Lighthouse Building – Tree Building – Flower Flower - Lighthouse Flower - Tree Lighthouse – Tree

190

Appendix D Chapter 6 Experiment

D.1 Example Test Stimulus

RANK THE BOXES 1 → 4 ACCORDING TO HOW THEY BEST (IE MOST INFORMATIVELY) REPRESENT THE ORIGINAL SCENE SHOWN ON THE LEFT 1 = IMAGE THAT REPRESENTS THE ORIGINAL IMAGE THE BEST*; 4 = THE WORST

191 Appendix D Chapter 6 Experiment

D.2 Booklet Design

2 images were chosen – lighthouse (outdoor image) and chair (office image). Four low quality versions were shown beside an original image. The order of presentation differed between the two images.

Lighthouse images shown in Section D1 from Left to Right are:

ORIGINAL “OUTDOOR” BASE CASE EQUAL “OFFICE” WEIGHTS (no importance WEIGHTS WEIGHTS processing)

Chair images shown in Section D1 from Left to Right are:

ORIGINAL BASE CASE EQUAL “OFFICE” “OUTDOOR” (no importance WEIGHTS WEIGHTS WEIGHTS processing)

192

Appendix E Chapter 7 Experiments E.1 Training Image Database Streetscape

193 Appendix E Chapter 7 Experiments

Café/Restaurant

194 Appendix E Chapter 7 Experiments

Heads/Shoulders

195 Appendix E Chapter 7 Experiments

Beach

196 Appendix E Chapter 7 Experiments

Office

197 Appendix E Chapter 7 Experiments

Home

198 Appendix E Chapter 7 Experiments

E.2 Example Test Stimuli for Section 7.2 Experiment

If you were trying to move through this scene:

which version would you find most helpful.

a b c d e f

199 Appendix E Chapter 7 Experiments

Consider the same image again.

E.3 Example Test Stimuli for Section 7.3 Experiment Imagine you could zoom in to one part of the image. Which zoomed version(s) shown on the

bottom row would you find most helpful if you were moving through the scene?

The part of the original image from which the zoomed image has been taken from is shown above

for interest.

Zoom window shown on original:

(NO ZOOM) 25x25 Black and White version of zoom window:

a b c d e f g

(NO ZOOM)

200 Appendix E Chapter 7 Experiments

E.4 Booklet Order for Chapter 7 Experiments

IMAGE IMAGE Random placement of images: TYPE No. A B C D E F PARTA PARTB café 1 C1 1 2 3 4 5 6 IM eq 1 sal_trim street 1 S1 3 5 4 6 1 2 IM opt 2 sal_scope house 1 H1 2 1 6 5 3 4 IM sc 3 IM_trim face 1 F1 5 4 2 3 6 1 IM tr 4 IM_scope No import. office 1 O1 4 6 1 2 3 5 processing 5 centre beach 1 B1 6 4 5 1 2 3 edge 6 bottom PART A – ROI applied to entire café 2 image PART B – ROI use for automatic street 2 zoom house 2 Table above is repeated for all 4 image sets face 2 Numbers 1-6 above refer to processing method used (refer Table above right) office 2 Booklets had 48 pages – 24 for PART A and 24 for PART B beach 2 PART A and PART B were shown in consecutive page order for each image café 3 street 3 house 3 face 3 office 3 beach 3 café 4 street 4 house 4 face 4 office 4 beach 4

201