IPO

Annual Progress Report

34

1999 Contents

Page

4 General

8 Work Programme 1999

13 Organization IPO

17 User-Centred Design

19 D.G. Bouwhuis and J.B. Long Developments

24 J.B. Long Cognitive ergonomics lessons: Possible challenges for USI?

37 W.A. IJsselsteijn, H. de Ridder and J. Freeman Measuring presence: An overview of assessment methodologies

49 Information Access and Presentation

51 F.L. van Nes Developments

54 F.J.J. Blommaert and TJ.W.M. Janssen The satisfied user

63 G. Klein Teeselink, H. Siepe and J.R. de Pijper Log files for testing usability

73 Multimodal Interaction

75 A.J.M. Houtsma Developments

78 D.J. Hermes and S.L.J.D.E. van de Par Scientific approaches to sound design

IPO Annual Progress Report 341999 2 86 M.M.J. Houben, D.J. Hermes and A. Kohlrausch Auditory perception of the size and velocity of rolling balls

94 S.L.J.D.E. van de Par and A. Kohlrausch Integration of auditory-visual patterns

103 Spoken Language Interfaces

105 R.N.J. Veldhuis Developments

111 M.M. Bekker and L.L.M. Vogten Usability of voice-controlled product interfaces

125 C.H.B.A. van Dinther, P.S. Rao, R.NJ. Veldhuis and A. Kohlrausch A measure for predicting audibility discrimination thresholds

133 E.A.M. Klabbers, R.N.J. Veldhuis and K.D. Koppen Improving diphone synthesis by adding context-sensitive diphones

142 J. Hirschberg, D. Litman and M.GJ. Swerts On-line spotting of ASR errors by means of prosody

150 Publications

174 Papers accepted for publication

176 Colophon

For information on the use of material from this report, see the last page. 3 General

G.W.M. Rauterberg

IPO, Center for User-System Interaction IPQ is a research institute of Eindhoven University of Technology. As several external market studies have confirmed, IPQ is the largest and most well-known academic re­ search centre in The Netherlands, specialized in the domain of User-System Interaction. IPQ has a long research tradition. Qriginally IPQ was founded in 1957 as a research institute of Philips Research Laboratories and Eindhoven University of Technology. The main focus of research was the domain of perception and cognition. Early in 1997 this fo­ cus changed and so did the name of the institute: Center for User-System Interaction. The organizational structure changed too and IPQ became an autonomous university research institute. In 1999 the directorship of IPQ changed. Since 1st May 1999 Prof. dr G.W.M. Rauterberg has been the new director oflPQ with an integral mandate. IPQ would like to thank the former scientific director Prof. dr T .M.A. Bemelmans and the managing director Ing. J.J.C.M. Ruis for all their valuable contributions to IPQ in the past. Since early Jan­ uary 2000 IPQ has been completely independent of Philips. IPQ wishes to thank Philips for all the substantial support given for more than 40 years. It is IPQ's mission to be an international centre for multidisciplinary scientific research and design in the field of User-System Interaction. In close cooperation with industry and the government, IPQ will work on the design and realization of user-friendly, innovative interaction styles and user-centred design methods.

Quality assessment IPO The Association of Universities in the Netherlands (VSNU) is in charge of regular quality assessment of education and research at Dutch universities. Every five years an institute is assessed by the aforementioned association. The most recent assessment of IPQ re­ search was conducted in 1996. In its final report the visiting committee concluded that IPQ is a laboratory of very good, and in some areas even excellent, scientific quality, by international standards. The institute comprises a large community of interdisciplinary re­ searchers of high quality, with specialists in engineering, computer vision, human lan­ guage sciences, sensory perception, cognition, industrial design, and ergonomics.

Work Programme IPQ's research is organized into four problem-based research themes: 1. User-Centred Design 2. Information Access and Presentation 3. Multimodal Interaction 4. Spoken Language Interfaces

IPO Annual Progress Report 341999 4 Each research theme is described in greater detail in the following sections of this Annual Progress Report.

J.F. Schouten Institute for User-System Interaction Research The graduate school 'IF. Schouten Institute for User-System Interaction Research' was officially recognized by the Royal Dutch Academy of Arts and Sciences. This very im­ portant milestone was a result of the outstanding performance in the past years in the area of PhD projects. The J.F. Schouten Institute is attracting international interest from grad­ uate research fellows. In 1999 two graduates received their PhD degree. N. BelaId, 'Perceptual linearization of soft-copy displays for achromatic images'. Super­ visors: Prof. dr J.A.I Roufs, Prof. dr ir P.F.F. Wijn and Dr ir IB.O.S. Martens. Eindhoven University of Technology, May 17. TJ.W.M. Janssen, 'Computational image quality'. Supervisors: Prof. dr A. Kohlrausch, Prof. R.J. Watt (University of Stirling, UK) and Dr ir F.J.J. Blommaert. Eindhoven University of Technology, November 2.

The postgraduate school for User-System Interaction Design On September 1st, 1998, Eindhoven University of Technology started an international two-year design programme for 'User-System Interaction'. The programme consists of a first year course curriculum, followed by a second year that is almost completely reserved for design thesis work. The latter is based on practical assignments in industry. All courses are held in English and are organized in a strict sequence of two to three weeks for each course. Participants in this programme have to apply for admission. After being selected, they are appointed as research assistants at the university. The postgradu­ ate students are paid for participating in the design programme. The programme itself leads to an internationally recognized diploma: Master of Technological Design. The next course will start in September 2000 with 20 student positions, after which in March 2001 the next cohort will be admitted. Details about this extensive and innova­ tive design programme can be found on the web (http://www.tue.nVipo/usilindex.html).

Organizational issues and programme initiatives In 1998 the Executive Board of the University nominated the working domain of User­ System Interaction as a strategic research area for the university. In that context a process was started to involve the Faculties of Mathematics and Computing Science, Electrical Engineering, and Technology Management in 'User-System Interaction'. The short-term goal is to formulate a strategic interfaculty research programme in which the faculties mentioned and IPO will be involved. Since 1977 the funding programme SOBU (Samenwerkings Orgaan Brabantse Uni­ versiteiten) has supported cooperation projects between Tilburg University (Katholieke Universiteit Brabant, KUB) and Eindhoven University of Technology (TUE). In 1999 IPO ran five projects under this programme: 'The prosodic realization of text structures', ' algorithms for linguistic aspects of speech synthesis', 'Context con­ trolled language generation in multimodal human-computer interaction', 'Context con-

5 trolled dialogue management for spoken language interfaces' , and 'The user perspective on electronic commerce' . In 1998 a large applied Innovation Oriented research Programme (lOP) was started in The Netherlands on 'Man-Machine Interaction' (lOP-MMI), sponsored by the Dutch Ministry of Economic Affairs. This 20 million guilder programme will enable several re­ search institutes to do research on selected areas, in close cooperation with industry. The duration time of this special IOP-MMI programme will be six years. In 1999 IPO had two projects approved: 'Multimodal access to transactions and information services', and 'In­ teraction and dialogue structure in user interfaces'. In 1999 a third important funding programme 'ToKeN 2000' was launched in the context of the National Action Programme (NAP Elektronische Snelweg), sponsored by the Dutch Ministry of Economic Affairs, the Dutch Ministry of Education, Culture and Sciences, and the Netherlands Organization for Scientific Research (NWO). This pro­ gramme will bring together researchers in the fields of computing sciences and cognitive sciences. IPO is contributing with a project called 'Flexible interfaces for dialogues in a multimedia world' . The fourth large and interesting programme is the Information Society Technolo­ gies (1ST) part of the 5th Framework Programme of the European Commission. IPO also intends to make substantial contributions to this programme. In 1999 IPO contributed in the workpackage 2 'Evaluation of new entertainment media and display technologies' of the ACTS project TAPESTRIES. In 1997 the European Commission launched a network of 13 projects in the frame­ work of the Esprit Long Term Research programme. 13 stands for Intelligent Information Interfaces. It aims to stimulate research projects that explore and prototype radical new human-centred systems and interfaces for the interaction with information. IPO is partici­ pating in the 13 project COMRIS (Co-habited mixed-reality information spaces). Another project funded by the European Commission in which IPO was participat­ ing was VODIS (Voice-operated driver information system). This project, funded within the fourth framework ESPRIT programme, was completed successfully in 1999.

Conferences and workshops On 2nd February 1999 the annual meeting of the Dutch Audio Engineering Society took place at IPO. The Topic was 'Intermodal aspects of vision and sound'. About 90 experts in this field participated.

From 5th to 9th April 1999 IPO was involved in the organization of the 2nd International Workshop on 'Presence' at the University of Essex, Colchester (UK). This workshop was part of the ACTS TAPESTRIES project activities.

On 23rd April 1999 one ofthe two speakers at the 'Dies Natalis' ceremony of Eindhoven University of Technology was Prof. dr G.W.M. Rauterberg. The title of his presentation was 'User-system interaction: A challenge for the present and the future'.

On 25th June 1999 a meeting of the Dutch Society of Phonetic Sciences took place at IPO. Several researchers intensively discussed upcoming issues in this field.

6 In association with Eindhoven University of Technology (TUE) and the European Speech Communication Association (ESCA), IPO organized a 3-day Workshop on Dialogue and Prosody from 1 to 3 September 1999. This workshop took place near Eindhoven. The workshop brought together researchers representing theoretical, empirical, computation­ al, and experimental approaches to exchange research findings and ideas about the inter­ play between dialogue and prosody.

On 2-3 December 1999 the first 'Symposium on User-System Interaction' (SUSI) was held at Carlton De Brug, Mierlo (The Netherlands), which was completely organized and financed by IPO. This symposium attracted about 50 participants, primarily from differ­ ent faculties of Eindhoven University of Technology and IPO.

7 Work Programme 1999

General

The research activities of IPO in 1999 were structured in the following programmes and projects:

1 User-Centred Design

• Logfiles for testing usability. • Sensible interfaces for medical systems. • Support for user-centred design. • Usability metrices database. • Natural user interfaces.

2 Information Access and Presentation

• Information access. • Trust and control. • E-commerce. • The role of a mental model in search. • Agents and objects: On the dynamics of interaction.

3 Multimodal Interaction

• Tele-applications in the home environment. • Interaction and evaluation metrics in user interfaces. • Human-to-human interaction on internet. • Modality selection in human-system interaction. • Non-speech audio for user interfaces. • Auditory-visual integration. • Tactile information in user-system interaction. • Visual pattern recognition in natural user interfaces. • Modelling and manipulating architectural design tools in virtual environments. • User interfaces for architectural design tools in virtual reality.

4 Spoken Language Interfaces

• Multi-appliance multi-user technology for voice-controlled applications in the home.

IPO Annual Progress Report 34 1999 8 • Multimodal access to information systems. • Development of text-to-speech conversion. • Interactive portable speech-oriented interfaces. • Co-habited mixed-reality information spaces. • Dialogue management and output generation in spoken language information systems. • SLI-development research. • Speech interface for PC telephone. • Properties of a sinusoidal representation of music and speech.

5 PhD thesis projects

• Contrast sensitivity of the human eye and its effects on image quality. • Chinese-Dutch business negotiations: Insights from discourse. • A tale and two adaptations: Coping processes of older persons in the domain of independent living. • Fidelity of soft-copy displays for achromatic images. • Adaptive multimodal interaction. • Fundamental aspects of image quality. • Dialogue management and output generation in spoken language information systems: Speech synthesis. • Dialogue management and output generation in spoken language information systems: Language generation. • Cognitive modelling of technology-related behaviour patterns of older consum- ers. • Quality metric for MPEG-2 coded video. • Perceived realism in 3D imaging. • Machine learning algorithms for linguistic aspects of speech synthesis. • The prosodic realization of text structure. • Aspirations of older people in relation to their use of private communication technologies. • Generation and perception of non-speech sound stimuli for use in multimodal user-system interaction. • Choice and value elicitation of complex technological systems. • Target acquisition with visual and tactile feedback. • A new model for the perception of binaural stimuli. • Goals and implementation plans in routine user behaviour. • Is usability compositional? • Improvement of the flexibility of concatenative speech synthesis. • Salesgirl agents as user support for navigating E-commerce databases. • The influence of 'exposure time' on intelligibility, processing time and accept- ability of synthetic speech. • Natural mapping for control elements. • Automating methods and tools for user-centred design. • Modelling and manipulating architectural design tools in virtual environments.

9 • A software system for informational support of R&D. • A design tool for user interface design tasks. • Computer vision based object manipulation in an augmented reality environ­ ment.

Cooperation with external organizations: • AT&T Research Laboratories, Florham Park, NJ, USA • A TRlHIP (Human ) Research Laboratories, Kyoto, Japan • ATRIMIC (Media Integration & Communications) Research Laboratories, Kyoto, Japan • ATRlSLT (Spoken Language Translation) Research Laboratories, Kyoto, Japan • Baan Institute, Putten, The Netherlands • Bildrohren Fabrik, Aachen, Germany • CELEX project for Lexical Data Bases, Nijmegen, The Netherlands Other participants: Instituut voor Nederlandse Lexicografie, Leyden, The Netherlands Max Planck Institut fUr Psycholinguistik, Nijmegen, The Netherlands • Centrum voor Gesproken Lectuur, Grave, The Netherlands • COST 258 (Chair: Prof. E. Keller, University of Lausanne, Lausanne, Switzerland) • CWI, Centre for Mathematics and Computer Science, Amsterdam, The Netherlands • Delft University of TechnoJogy, Faculty of Electrical Engineering, Delft, The Netherlands • Delft University of Technology, Faculty ofIndustrial Design Engineering, Delft, The Netherlands • Delft University of Technology, Faculty of Information Technology and Systems, Delft, The Neth­ erlands • Design Academy, European Design Centre, Eindhoven, The Netherlands • Eindhoven University of Technology, Department of History of Technology, Eindhoven, The Neth- erlands • Essex University, Essex, Great Britain • Fluke Industrial Design, Almelo, The Netherlands • Foundation for Speech Technology, Utrecht, The Netherlands Other participants: - Amsterdam University, Institute of Phonetic Sciences, Amsterdam, The Netherlands - Leyden University, Department of General Linguistics, Phonetics Laboratory, Leyden, The Neth- erlands - Nijmegen University, Department of Language and Speech, Nijmegen, The Netherlands - The Utrecht Institute of Linguistics OTS, Utrecht, The Netherlands • Frontier Design, Ede, The Netherlands • Groningen University, Biomedical Technology Centre, Groningen, The Netherlands • Independent Television Commission (lTC), Hampshire, Great Britain • Indian Institute of Technology, Madras, India • Kodak Inc., Rochester, NY, USA • KPN Research, Leidschendam, The Netherlands • KTH, Stockholm, Sweden • Language and Speech Technology (priority programme of NWO, the Netherlands Organization for Scientific Research) Other participants: - Amsterdam University, Alfa Informatica, Amsterdam, The Netherlands - Groningen University, Alfa Informatica, Groningen, The Netherlands

10 - KPN Research, Leidschendam, The Netherlands - Nijmegen University, Department of Language and Speech, Nijmegen, The Netherlands - Nijmegen Institute for Cognition and Information (NICI), Nijmegen, The Netherlands - Philips Dialogue Systems, Aachen, Germany - Public Transport Travel Information (OVR), Utrecht, The Netherlands • La Trobe University, Department of Computer Science and Computer Engineering, Melbourne" Australia • Leyden University, Department of , Leyden, The Netherlands • LIMSI-CNRS, Orsay, France • Maastricht University, IKAT, Department of Computer Science, Maastricht, The Netherlands • Maastricht University, Medical Information Research Group, Maastricht, The Netherlands • Moscow State University, Department of Psychophysiology, Moscow, Russia • Mountside Software Engineering, Eindhoven, The Netherlands • Nijmegen Institute for Cognition and Information (NICI), Nijmegen, The Netherlands • Nijmegen University, Department of English, Nijmegen, The Netherlands • Nijmegen University, Department of Language and Speech, Nijmegen, The Netherlands • Nijmegen University, Expertise Centrum Nederlands, Nijmegen, The Netherlands • Philips Components, Eindhoven, The Netherlands • Philips Corporate Standardization Department, Eindhoven, The Netherlands • Philips Design, Human Behaviour Research-Knowledge Area, Eindhoven, The Netherlands • Philips Display Components, Eindhoven, The Netherlands • Philips Hearing Instruments, Eindhoven, The Netherlands • Philips Research Laboratories, Eindhoven, The Netherlands • Philips Research Laboratories, Aachen, Germany • Philips Sound & Vision, ASA Laboratories, Eindhoven, The Netherlands • Polderland, Nijmegen, The Netherlands • Project COMRIS (ESPRIT programme of the European Union) Other participants: - GMD - National German Research Center for Information Technology, Bonn, Germany IlIA CSIC, Institut d'Investigacio en Intelligencia Artificial, Consejo Superior Investigaciones Cientfficas, Barcelona, Spain Riverland Next Generation, Brussels, Belgium The University of Reading, Reading, Great Britain - UniversiUit Dortmund, Dortmund, Germany Vrije Universiteit Brusse1, Brussels, Belgium • Project Dialogue Management and Knowledge Acquisition (DenK) (sponsored by the Cooperation Unit of Brabant Universities (SOBU), Tilburg, The Netherlands) Other participants: Eindhoven University of Technology, Department of Mathematics and Computer Science, Eind­ hoven, The Netherlands - Tilburg University, Department of Linguistics and Computer Science, Tilburg, The Netherlands • Project TAPESTRIES (ACTS programme of the European Union) Other participants: AEA Technology pIc, Culham, Great Britain - CCETT, Cesson S6vigne Cedex, France Centro Studi e Laboratori Telecommunicazione (CSELT), Torino, Italy Essex University, Essex, Great Britain European Broadcasting Union (BBU), Geneva, Switzerland Independent Television Commission (lTC), Hampshire, Great Britain Radiotelevisione Italiana (RAI), Torino, Italy

11 • Project VODIS (European Community, Language Engineering) Other participants: - ELAN Informatique, Toulouse, France - INESC, Lisbon, Portugal - Lernout & Hauspie Speech Products, Ypres, Belgium Multicom Research, Pittsburgh, PA, USA PSA Peugeot Citroen, Velizy-Villacoublay, France - Renault Research Innovation, Rueil Malmaison, France Robert Bosch GmbH, Gerlingen, Stuttgart, Germany - Robert BoschIBlaupunkt GmbH, Hildesheim, Germany RWfH-Aachen, Aachen, Germany S&P MEDIA GmbH, Bielefeld, Germany - Universitat Karlsruhe, Karlsruhe, Germany Volkswagen AG, Wolfsburg, Germany • Rijksmuseum, Amsterdam, The Netherlands • School of Arts, Faculty of Art, Media & Technology, Interaction Design, Utrecht, The Netherlands • Software lTC, Cluy-Napoca, Romania • Swiss Federal Institute of Technology (ETH), Institute for Construction and Design, ZUrich, Swit­ zerland • Swiss Federal Institute of Technology (ETH), Institute for Hygiene and Applied Physiology, ZUrich, Switzerland • Technion, Department of Science Education, Haifa, Israel • F.J. Tieman BV, Rockanje, The Netherlands • Tilburg University, Department of Linguistics, Tilburg, The Netherlands • Twente University, Department of Computer Sciences, Enschede, The Netherlands • Universitat Autonoma de Barcelona, Department of Psychology, Barcelona, Spain • University College London, Department of Psychology, London, Great Britain • University of Connecticut, Department of Surgery and Neuroscience, Farmington, CT, USA • University of Kansas, Department of Psychology, Lawrence, KS, USA • University of London, Goldsmiths College, Department of Psychology, London, Great Britain • University of Oldenburg, Physics Department, Oldenburg, Germany • University of Skovde, Department of Languages, Skovde, Sweden • Utrecht University, Buys Ballot Laboratory, Utrecht, The Netherlands • Utrecht University, Faculty of Social Sciences, Department of Psychonomics, Utrecht, The Nether­ lands • Visueel Advies Centrum, Eindhoven, The Netherlands

12 Organization IPO

Board of Management Prof. dr W.M.J. van Gelder (chairman) (until 27th May) Faculty of Technology Management, Eindhoven University of Technology Prof. dr J.C.M. Baeten Faculty of Mathematics and Computing Science, Eindhoven Uni­ versity of Technology Prof. dr ir W.M.G. van Bokhoven Faculty of Electrical Engineering, Eindhoven University of Tech­ nology Drs E.P.C. van Utteren Corporate Research, Philips, Eindhoven

Advisory Board Prof. dr W.J.M. Levelt (chairman) Max Planck Institut ftir Psycholinguistik, Nijmegen Prof. dr P.M.E. de Bra Faculty of Mathematics and Computing Science, Eindhoven Uni­ versity of Technology Prof. dr G.A.M. Kempen Faculty of Social Sciences, Leyden University Prof. dr M. Moortgat The Utrecht Institute of Linguistics OTS Prof. drs J. Moraal Faculty of Technology Management, Eindhoven University of Technology Prof. dr ir L.C.W. Pols Institute of Phonetic Sciences, University of Amsterdam Dr L.R.B. Schomaker Nijrnegen Institute for Cognition and Information, Nijrnegen Uni­ versity Prof. dr B. Wielinga Faculty of Psychology, Department of Social Science Informatics, Amsterdam University

Scientific Director Prof. dr T.M.A. Bemelmans* (until 30th April)

Managing Director lng. J.J.C.M. Ruis* (until 30th April)

Director Prof. dr G.W.M. Rauterberg (from 1st May)

IPO Annual Progress Report 341999 13 Management Team Prof. dr G.W.M. Rauterberg - User-Centred Design (until 21st June) Prof. dr D.G. Bouwhuis (from 22nd June) Prof. dr J.B. Long (from 22nd June)

Dr ir FJ.J. Blommaert - Information Access and (until 14th June) Presentation Prof. dr ir F.L. van Nes (from 15th June)

Dr ir RJ. Beun - Multimodal Interaction (until 30th June) Prof. dr A.lM. Houtsma (from 1st July)

Dr ir RNJ. Veldhuis - Spoken Language Interfaces

Personnel Management Ms PJ. Evers, BC.

Financial Management L.F. Luwijs

Research Staff Ir RM.C. Ahn Ms Dr M. Beers Ms Dr ir M.M. Bekker Dr ir RJ. Beun* Dr ir FJ.J. Blommaert Prof. dr H. Bouma* Dr ir RJ. Collaris Prof. dr R Collier Ms Dr A.F.M. Doeleman Dr T.D. Freudenthal Dr H.P. de Greef Drs G. de Haan Dr D.J. Hermes Ms Drs O.M. van Herwijnen Ms Dr ir M.D. Janse Dr ir TJ.W.M. Janssen Ms Dr RJ.E. Joordens* Prof. dr A. Kohlrausch Ms Drs K.D. Koppen Dr E.J. Krahmer Prof. ir S.P.J. Landsbergen Dr P. Markopoulos Dr ir J.B.O.S. Martens Ir K.M. van Mensvoort Dr ir S.L.J.D.E. van de Par Dr X.H.G. Pouteau* Dr H.H.A.M. Priist Dr J.R. de Pijper Ms Dr P.S. Rao* Ms Dr S.M.M. te Riele Drs J.A. Rypkema*

14 Research Staff Drs AH.M. Siepe (continued) Dr M.G.J. Swerts Dr J.M.B. Terken Dr ir G.E. Veldhuijzen van Zanten* Ms Dr I.M.L.e. Vogels Dr ir L.L.M. Vogten Ms Dr M. Vroubel Ms Dr ir M.E Weegels* Dr J.W. Wesselink Dr ing. EC. van Westrenen DrW. Wong Dr S.N. Yendrikhovskij

Visiting Scientists Ms Dr B. Colombo Universita Cattolica del Sacro Cuore, Milan, Italy Prof. dr Y. Hou Xi'an Jiaotong University, Xi'an, P.R. China Ms M. D'Imperio M.Sc. Ohio State University, Columbus, Ohio, USA Prof. dr J.E Juola University of Kansas, Lawrence, Kansas, USA Prof. dr J.B. Long University College London, London, Great Britain Ms Drd. jng. Z. Valsan University Politeknica of Bucharest, Romania

Associate Staff Ing. M.e. Boschman Ir D. Kersten Ir J.A Pellegrino van Stuyvenberg

Doctoral Candidates D. Abrazhevich, M.Sc. D. Aliakseyeu, M.Sc. Ir DJ. Breebaart Ir W.P. Brinkman Drs C.H.B.A van Dinther Ms Drs M.d.1.M. Docampo Rama EN. Egger, M.Sc. Ir AR.H. Fischer Ir M.M.J. Houben Ir T.J.W.M. Janssen* Ms Drs H. Keuning-van Oirschot Ms Drs E.A.M. Klabbers Ms Drs G. Klein Teeselink A Labanouski, M.Sc. A. Malchanau, M.Sc. Ms Drs L.MJ. Meesters Ms Drs J.N. den Ouden Drs S.e. Pauws S. Subramanian, M.E. Ms Drs M. Theune Ir M.C. Willemsen* Drs W.A. llsselsteijn Ms Drs M. van Zundert

15 Designers Programme USI Drs M.A.C. Alders A Asayonak, M.Sc. T. Bahnyuk, B.Sc. Drs H.BJ. Bartelink Dipl.des. C. Bartneck O. Bidiuk, M.E. Drs RT. de Boer Drs A de Booij Ms M. Bukowska, M.Sc. HJ. Cho, M.Sc. Ms Drs C.T.M. Doucet Drs J.J. Fahrenfort Drs C. Friedrich Ms I.CJ. Halters, Lic. Ms Drs K. van der Hiele Ms Drs E.A.W.H. van den Hoven J.Hn,M.E. Ms I. Joglik, M.Sc. Ms Drs J. Kort Drs R Kramer Ms N. Kravtsova, M.Sc. A Kuliashou, M.Sc. Ms T.A. Lachina, M.Sc. Drs H.L. Lamberts S. Lashin, B.Sc. S. Martchenko, B.Sc. Ing. PJ. Marijnissen B. Menchtchikov, M.Sc. Ms Ir AC. Morskate Ms Ir E.M. Perik Drs L.M.C. Pernot W.Qi,M.Sc. Ir J.FJ. van Rooij S. Ryjov, M.Sc. Ms Drs D.C.G. Schaap Ms M. Tsikalkina, M.Sc. H. Wang, M.E. K. Yondenko, M.Sc.

Services

Administrative Staff Ms U. van Amelsfort Ms C.GJ.M. Driessen-Wouters

Library P.G.F.F. Barten

Systems Management F.M. Aerts RG.H. Curfs* Ing. C.PJ.M. Kuijpers

Technical Support A Aarts J.H. Bolkestein* N.L.J. van de Yen * Left during 1999

16 USER-CENTRED DESIGN

[PO Annual Progress Report 341999 17 Developments

D.G. Bouwhuis and J.B. Long e-mail: [email protected]

Introduction User-Centred Design constructs and validates design knowledge that can be used to develop interactive software and hardware design solutions for humans. The nature of the interaction, constituting the design problem, between users and applications needs to be characterized by three main features of performance: effectiveness, efficiency, and satisfaction. As interactive applications are increasingly present in the daily life environment and the task environment of the user, such systems should be designed from the outset with use in mind, rather than with an emphasis on technical functionality. This approach is called User-Centred Design (UCD). UCD normally involves a number of different key actors throughout the develop­ ment of the application, including the users, to obtain feedback about the design and use of the system, the designers to provide prototypes for users to try out, and to redesign the prototypes in the light of user performance, feedback, and comments. The benefits of this approach include increased productivity, enhanced quality of work, reductions in support, training costs and workload, and improvement in health and safety. Over the years, it has been shown that large software applications are much more likely to live up to their ex­ pectations, and meet user requirements, if users are involved in the design from the earli­ est stages. In many cases, UCD therefore is a powerful determiner of the success of an application, with a favourable costlbenefit ratio. Since interactive applications continue to penetrate ever more layers of society, requirements must be extended even further. People dealing with public software applica­ tions and information systems not only need the legal right, but also easy and full access to them, which means the usability requirements are even more demanding. On the other hand, interactive applications may have such an appropriate computational basis, that they can deal with knowledge at a high level of abstraction, and so provide powerful new options for users, especially professional ones. This provision, however, can only be made effectively, when object representations are stored with features identified by human users, which again stresses the need for user involvement in system development. In UCD two approaches are distinguished: (1) the product view and (2) the process view. In the first, the emphasis is on the features of the product under development. The organization of the development team, its resources, and its strategies are addressed by the process view. While these two approaches are clearly different in their focus, in actual practice both approaches go together to an increasing extent. Products are no longer just single boxes with a line cord, a display, and control elements, they are gradually growing into environments that provide complex functionality, in which more users, rather than one, participate in the execution of a task. Controlling such an environment is decidedly less predictable than that of a single product with known and limited functionality.

[PO Annual Progress Report 34 [999 19 One example that reflects both a product and process view is the series of studies concerned with image manipulation that originally made use of the Build-It system. The latter's versatility led to further development, resulting in the Visual Interaction Platform (VIP), which is easier to set up and is transportable. It has, among other applications, been used for multi-user trials with interactive picture manipulation in collaborative settings. Subjects could present, select, and manipulate pictures on a table top that was simultane­ ously visible to users in different locations, during which they could see each other on a vertically mounted projection screen and engage in video conferencing. Such situations promote a powerful sense of interaction between people. One reason for this effect is that the domain under discussion and manipulation is directly at hand for all dialogue partners concerned. Another reason is that the size of the projected image of the partner's face is much more life-like than when displayed in a screen window. Analogous to the sense of Presence in stereo TV, this effect has been dubbed Co-presence and it is the topic of in­ vestigation in the area of interactive smart home technology. The home is an environment that will become much more interactive with the advent of distributed intelligence and wearable computers, creating ambient intelligence, which is not so much a product, but rather an environment that is in continuous interaction with the inhabitant of the house. These issues are being studied in a project called the LivingLab, which is intended to pro­ vide a laboratory facility for semi-permanent habitation. One of the aims of this project is Jo investigate gerontechnological applications in realistic living situations. From this ap­ plication, direct extensions can be made to applications in Public Health Engineering (co­ operation is planned with the Faculty of Architecture of Eindhoven University of Tech­ nology) and to applications for People with Special Needs (PSN). A final application is directed at a target group that has up to now been almost completely bypassed by Infor­ mation and Communication Technology (lCT) - the group of infants. We expect that, after suitable adaptation, the VIP platform can be modified to become an immersive environ­ ment for infants and small children. In it, they will be able to manipulate virtual images of toys, which have their own types of behaviour and corresponding sounds.

Projects

Support of user-centred design The historical challenge of the User-Centred Design approach is to involve users of a fu­ ture product as early as possible in the design process. At such a point in time, the func­ tionality of the product has not generally been decided upon, and ordinarily the user is quite unfamiliar with the new possibilities that the product will provide. Yet another chal­ lenge is how the product functionality can be integrated into the daily activities of the us­ er, or target group, whereas it usually is assumed in product design that it is the product that will be designed and implemented, and not the task environment or working condi­ tions. Few, however, would contest that products like the telephone or the TV have had an important role in shaping the lifestyle and living conditions of many people. In the ab­ sence of a clear idea how, and in what context, future products will be used, special tech­ niques have to be developed to help users imagine for what applications a product could be used and what its interface style would be like. Efforts to develop such techniques have been made in a number of ongoing projects, including future telecom services in the

20 home, and video conferencing for medical information and for art collections (the latter being part of the ToKeN2000 project). One technique involves encouraging people to think aloud about activities they carry out in the home or working situation, or have them write down the activities they perform during a day. Another way is to make props that help people to imagine concrete devices that could perform some function. People then imagine how they could carry out activities of a specified nature, on the basis of which applications could be developed. This approach is called reflective design.

Informational support of the software design process We share a growing interest in problem-solving processes typical of a Software Design and Development Process (SDDP). When a designer solves a sub-problem (which be­ longs to the SDDP) and knows little or even nothing about a possible solution, the prob­ lem of ill-defined information need arises. The proposed approach to support the software designer with an ill-defined infor­ mation need is to combine (1) task-based information retrieving techniques and (2) infor­ mation-exploration interface principles. This project includes an investigation of (1) search behaviour of the software designers and (2) the software design task. The search behaviour will be analysed in observational studies. The software design task will be stud­ ied by analysis and formalization of the software design and development standards. The central problem here is to find a way to transform a design problem description (which can easily be acquired from a designer) into an effective query. One of the possible solu­ tions is to apply neural networks. An alternative approach is to simulate the problem-solv­ ing process to produce the queries sequentially. A final part of the design solution is to find a suitable way to present the retrieved information. The information-exploration in­ terface theory is likely to be useful for this purpose.

Analysis of user satisfaction at the implementation level User satisfaction is considered to be one of the main determinants of interface qUality. Sat­ isfaction is an 'emotion' that indicates the attainment of a desired goal within the accept­ able boundaries of time and effort. Both in emotion theory and in (at least) three levels of interaction are distinguished. These levels can be expressed as a se­ mantic, a strategy defining, and an implementation level. In emotion, the levels are re­ ferred to as tertiary (mood), secondary, and primary (directly related to action) emotions. At the implementation level, primary emotions are mainly involved. To investigate the satisfaction of users, the interaction dialogue is divided into four processes. One process is involved with the system, and three processes are internal to the user: perception, cog­ nitive computation and action, respectively. In order to make predictions about the satis­ faction of users at the implementation level, manipulations referring to each of these four processes will be studied in relation to the interaction influence on the overall user satis­ faction. The major questions in this project are to determine whether user satisfaction at the implementation level can be established on the basis of subjective values of effective­ ness and efficiency, and whether the latter are influenced by the demands made on the ef­ fort to complete a task. Another goal of the project is to quantify a model in which the effects of the processes of the low-level interaction cycle are compared to each other with regard to their total effect on user satisfaction.

21 User-related factors in the electronic payment systems in e-commerce The aim of this project is to develop a framework for a consumer-business payment sys­ tem in an e-commerce environment. The aim is to establish a framework that can be op­ timized with respect to user-centred factors. In this optimization process, several groups of properties of payment systems will be taken into account: (1) user-related properties, such as usability and customer base, (2) transaction and database properties, such as reli­ ability, traceability, and security, and (3) system properties, such as scalability, interoper­ ability, and flexibility. The first step in this research will be to select properties that are of importance in an e-commerce payment system. This selection will be carried out using the literature, existing payment systems, and by expert knowledge elicitation. In the second step, the relationship between the selected properties and user-related properties will be established. The framework will be constructed taking into consideration only the most important, dominant properties. This framework will then be the basis for the evaluation of existing payment systems. This evaluation will provide a base for the important prop­ erties that in turn can be used to refine the framework. The know ledge about the weak­ nesses of existing systems will be the foundation for a redesign plan. This framework can then be used as the basis for the implementation of a new payment system that can be eval­ uated by users in an experimental setting.

Perceived realism in 3D imaging This PhD project was part of the European project ACTS TAPESTRIES, which was, amongst other things, concerned with the evaluation of new entertainment media and dis­ play technologies, such as virtual reality, stereoscopic TV, and telepresence applications. The aim of the on-going PhD project is to identify and test factors that are likely to have an impact on various measures of observers' subjective appreciation, in particular pres­ ence (Le., the sense of 'being there' in a mediated environment), quality, naturalness, and eye strain. To this end, several measurement methodologies have been developed and ap­ plied. The project intends to arrive at recommendations for the design of high-presence applications, as well as develop and test measurement methodologies for measuring the subjective impact of these advanced media technologies.

Aspirations of older people in relation to their use of communication technologies This project aims to investigate systematically the relationship between aspirations of old­ er people in their daily lives and their use of communication technologies (for example telephone, e-mail, the internet, or a videophone), and how this use interacts with the older individual's personal and environmental conditions. The project concerns the following: • The adjustment of older people to their personal and situational circumstances. • The present aspirations and plans of older people under current conditions. • The expansion of the aspirations and plans of older people through experiencing new opportunities in communication.

Age and generation effects in user-system interaction It has been widely observed that elderly people show a reluctance to use interactive de­ vices. This reluctance is reported to be caused by the difficulty in controlling the devices successfully. It is too easy to describe this effect as just one of 'age' (which explains very little, but still may be considered inevitable and, therefore, something to be accepted). A different way of explaining the difficulty of use is by postulating a dependence on person-

22 al experience that corresponds less-and-Iess with advancing technology. The latest exper­ iments in the framework of this project centred on the ease of using interfaces of varying complexity, in which generation and age variables were carefully controlled. On the basis of earlier research, two generations were distinguished: the 'electrome­ chanical' (born before 1960) and the 'software' generation (born after 1960). The soft­ ware generation could be further divided into the 'display' generation (born between 1960 and 1970) and the 'menu' generation (born after 1970). Two main dependent varia:bles were errors and execution times. Results showed that the software generation only made few errors, in contrast to the electromechanical generation. However, there was no rela­ tion with age within either group, which strengthens the idea that experience, rather than the aging process per se is the main determinant. Conversely, execution times showed only an age effect and hardly any generation effect. Consequently, there is a dissociation between age and generation that points to the factors that are responsible for the efficiency of use of interactive equipment. Other factors, that have been investigated, relate to the actual use of equipment by the participants. Prior knowledge, as measured by the period of time participants used a remote control, a device with 'software' control, or a mobile phone, had an important influence on the ease with which participants could control the interfaces. Interestingly, the level of interface complexity did not interact with age or gen­ eration, which again points to the importance of personal experience.

23 Cognitive ergonomics lessons: Possible challenges for USI?

J.B. Long e-mail: [email protected]

Abstract IPO, Center for User-System Interaction is in the process of extending its traditional research in human perception and cognition, to include user-system interaction (USI). Cognitive Er­ gonomics (CE) attempted a similar extension 20 years ago. Since USI and CE have much in common, it may therefore be instructive for IPO research programmes to consider the lessons learned by CE and those remaining to be learned, as possible challenges for USI. To facilitate such consideration, this paper reviews the past, present, and future of CE as a discipline. The past, and in particular its shortcomings, is characterized in terms of an earlier description of­ fered by Long (1987). The present is characterized in terms of the lessons assumed to have been learned by CE, since that publication. The future is characterized in terms of lessons which still remain to be learned. Possible challenges for USI are identified for all CE lessons. Last, the paper raises some general issues concerning CE and USI.

Introduction Cognitive Ergonomics (CE) emerged in the late 1970s on the back of the 'personal' com­ puter and the 'naive' user. The contrast was with the 'mainframe' computer and the 'pro­ fessional' user. The emergence was associated with developments in Cognitive Psychol­ ogy, Linguistics, and (Card, Moran & Newell, 1983). 'Cognitive' Ergonomics was in contrast to 'Physical' or 'Traditional' Ergonomics (Long, 1987). Ac­ cording to Long, "The advent of the computer, together with changes in Psychology, have given rise to a new form of Ergonomics termed 'Cognitive Ergonomics "'. This paper con­ siders the past, present and future of CE, respectively, in terms of Long's original paper, some lessons learned since its publication, and some lessons, which still remain to be learned. Both types of lesson may be considered as possible challenges for user-system interaction (USI). "The most general definition of Cognitive Ergonomics is thus the application of to work ... to achieve the optimization (between people and their work) ... with respect to well-being and productivity" (Long, 1987). It is further argued that "Cognitive Ergonomics, then, is a configuration relating work to science. In this con­ text, its aim can be defined as the increase in compatibility between the user's represen­ tations and those of the machine". Long further claims that CE shows "how the system designer can be provided with information which will help improve cognitive compatibil­ ity" and supports "the practical aim of helping designers to produce better systems, and the use of theoretical structures and empirical methods to achieve this end". In this paper, the above general characterization of CE, especially as concerns its shortcomings, is considered to be the past, with respect to which present lessons are as­ sumed to have been learned (criticizing one's own characterization is less invidious to

IPO Annual Progress Report 341999 24 identify lessons required, than criticizing the characterization of others, however superior the latter may be in other respects). Lessons still remaining, if learned, would constitute one possible future for CEo Finally, a number of general issues, concerning CE and US I, raised by this paper, are identified. Since IPO is in the process of extending its traditional approach, to human perception and cognition, to include user-system interaction, with much the same aim of CE's original attempt, it may be instructive to consider lessons learned and remaining, as possible challenges for USI. This paper is intended to facilitate such consideration. To that end, possible challenges for USI are identified for all CE les­ sons. This paper is structured, following the discipline framework proposed by Long and Dowell (1989), as the "acquisition and validation by research of knowledge supporting practices, which solve the general problem, having a particular scope". When applied to CE and USI for present purposes, the framework produces a structure with the following sections: Particular Scope; General Problem; Practices; Knowledge; and Discipline. Each section successively addresses CE past, in terms of Long (1987), CE present, in terms of lessons learned, and CE future, in terms of lessons remaining. Both lessons learned and lessons remaining constitute possible USI challenges (UC). To make the challenges more concrete, they are illustrated in terms of e-commerce, a domain of current interest to IPO.

Particular scope of Cognitive Ergonomics References are made to three different, but related, particular scopes (that is, concerns) for CE in Long (1987). First, with reference to the discipline, CE is about: people; their work; well-being; and productivity ... "the emphasis being on the person doing the work and the manner in which it is carried out, rather than on the technology or on the environment" (per se, that is). Second, with reference to humans interacting with computers, as a new technology-driven phenomenon, the CE scope is: "agent(s); instrument; functions; enti­ ties; and location". Last, with reference to difficulties experienced by users of computers, CE is about: domain. of application; task; and computer interface. The overlap between the particular scopes is obvious, for example, technology, instrument, and computer in­ terface. However, the scopes do not appear equally complete, for example, only the first scope makes reference to well-being and productivity. Two main lessons have been learned. LL1: CE scope is singular. That is, CE has a single scope and not a number of different scopes. Multiple par­ ticular scopes, or descriptions thereof, as proposed by Long, would require specific justi­ fication, since each scope would require a separate mapping to CE knowledge and prac­ tices, as well as to one another, so militating against coherence and parsimony. Without a single scope, CE's efforts would be fragmented and so less effective. UC1: To establish a single USI scope, as required. The challenge here for USI is that whatever its particular scope or concerns, for ex­ ample, users, systems, interaction, etc., there should be only a single scope (in the absence of a rationale for multiple scopes). That is, the IPO research programmes should have the same USI scope. For example, if lAP has the particular scope of users interacting with systems, as concerns information access and presentation, then there would exist no single scope with MMI, if the latter were to be limited to only multimodal perception of systems.

25 In the case of e-commerce, for example, the aUditory-haptic (MMI) perception of goods for sale, by means of virtual reality, would not relate to selection behaviours between goods (lAP). A relationship is a challenge for any integrated USI knowledge of e-com­ merce as a whole. LL2: CE particular scope is: users; technology; work; and peiformance. CE has always included user(s) and technology in its particular scope. However, ex­ cept in process control, work has not always been considered separately from the user be­ haviours required to carry it out, nor performance as more general than user or human per­ formance. There is wide consensus within CE currently for such a particular scope (Dia­ per, 1989; Dowell & Long, 1989; Barnard, 1991; Green & Hoc, 1991; Hollnagel, 1991; Newman, 1994; Storrs, 1994; Carroll, 1995; Long, 1996; Dowell & Long, 1998; Flach, 1998; Vicente, 1998; Woods, 1998). UC2: To establish a particular USl scope, as required. The particular USI scope at the highest level of description obviously includes: us­ ers; systems; and their interaction. The USI challenge here is whether additional concepts are required and if so, which ones. Should the concept of 'shopping experience', so dear to e-commerce providers, for example, as pleasure or entertainment, as well as the con­ duct of transactions, be included in USI? If shopping experience is included, should it be in one or all research programmes? If in all, how could UCD integrate the concept for its purposes? In addition, the challenge is to demonstrate that the particular scope is: com­ plete; coherent; and fit-for-purpose, whatever that may be, for example, to understand, de­ sign, develop, etc., e-commerce services. Two main lessons remain to be learned (LRI and LR2 below). LR1: CE particular scope needs to relate user( s) and technology together, as an in­ teractive worksystem. Performance of work, involving user(s) and technology, cannot ultimately be a function of either alone, but only of both together. Technology here is essentially a tool. All too often, however, performance is not expressed jointly, for example, 'human per­ formance' in terms of errors fails to express overall performance. Performance, then, must be considered a function of the interactive worksystem. User(s) and technology must be considered together as a worksystem. UC3: To partition the USl particular scope, as required. The challenge is to establish the appropriate relations between the concepts, com­ prising the USI particular scope. In the event, no partitioning may be required. However, the absence of partitioning needs to be established and justified. In e-commerce, for ex­ ample, is it sufficient to consider the user's buying behaviours and those of the electronic shopping cart representation, as simply 'interacting' (as an lAP concern, for example), rather than achieving jointly the user's shopping goals? In the latter case, the challenge would be how to distribute the joint achievement over the IPO research programmes (lAP, MMI and SLI) and how it could be integrated by UCD for its purposes. LR2: CE particular scope needs to distinguish peiformance as task quality and as interactive worksystem costs. Task quality expresses how well the task is performed by the worksystem. Work­ system costs express the workload incurred (by the user(s) and technology) in performing the task that well. The distinction between task quality and worksystem costs is one re­ quired by CE design practices and supported by CE design knowledge (see later).

26 UC4: To establish performance as part of the particular US! scope, as required. There are many expressions of performance: effectiveness~ efficiency; usability~ safety; productivity~ comfort; functionality; acceptability, etc. The challenge for USI here is to establish a concept of performance, suitable for its own purposes (however these pur­ poses may be specified). In e-commerce, for example, does the effectiveness of shopping include the pleasure or entertainment of the shopping experience? If so, is pleasure or en­ tertainment the purview of a particular research programme, for example MMI, or more generally? In either case, the challenge is how UCD can conceptualize, operationalize, and metricate shopping performance. In order to facilitate consideration of the relationship between these remaining CE lessons and their possible challenges for USI, the concepts implicated in the former are illustrated schematically in Figure 1.

interactive worksystem

user(s) --+ technology (costs) .-- (costs) . . goals performance Figure 1: Particular scope of Cogni­ . tive Ergonomics: Worksystem (in­ i ~ cluding user(s) and technology), work which performs work at some level (task quality) of (worksystem) costs for some lev­ el of (task) quality.

General discipline problem of Cognitive Ergonomics The general problem of CE is the provision of advice to designers "to produce better sys­ tems ... more usable systems" (Long, 1987). In addition: "the system designer can be pro­ vided with information ... to inform designers ... " etc. The shortcoming here is the restric­ tion of the general problem to advising designers, rather than more generally to the direct support of design itself. Two main lessons have been learned. LL3: CE general discipline problem is design. Design here is to be understood widely and to include the practices of design, eval­ uation, and iteration, etc. (see later). The problem is not simply the provision of advice, but the need of both CE design knowledge and design practices to support design directly. UC5: To establish a general US! discipline problem, as required. If perception and cognition research is about understanding these two types of hu­ man, mental phenomena, then their extension to USI implicates the understanding of USI as expressed by its particular scope. For example, in e-commerce USI might seek to un­ derstand preference and selection buying behaviours of shoppers, when using the technol­ ogy, without any direct or indirect concern for the design of e-commerce services. In the latter case, visual perception (as an TAP topic) needs only to be extended to include sys­ tems (rather than the task environment more generally). The challenge is how the exten­ sion would inform UCD. LL4: CE design has the particular scope of' user(s); technology; work; and per­ formance.

27 In other words, the general CE problem of design requires the specification of: us­ er(s); technology; work; and performance. It is not just a matter of advising designers about, for example, the user interface. There is a general consensus for design and its re­ lation to the particular scope of CE (Benyon, 1998; Dowell & Long, 1998; Green, 1998; Hockey & Western, 1998; Hollnagel, 1998; Reason, 1998; Salvendy, 1998). UC6: To establish the relation of the particular scope of us1 to design, as required. If there is no relation between USI, for example, as understanding or design, then there is no challenge for USI technology programmes lAP, MMI, and SLI. However, if USI establishes direct or indirect relations with design, then it becomes a challenge to es­ tablish the relations between the particular scope of USI and the particular scope of de­ sign, especially as it concerns UCD. For example, in e-commerce, given some relation be­ tween USI understanding of shoppers' preferences and buying behaviours and design, the challenge is to operationalize the relationship in e-commerce and to validate it with re­ spect to the design process. The challenge is both to UCD and the technology pro­ grammes. Two main lessons remain to be learned. LR3: CE general problem needs to express design, as design problems and design solutions. It is not sufficient to express design in terms of the CE particular scope of: user(s); technology; work; and performance. For the sake of completeness, the CE general prob­ lem needs to embody both the start-point of design, the design problem, and the end-point of design, the design solution. How otherwise would design start and stop by intent? UC7: To express USI general problem as problems and solutions, as required. If the USI general problem is understanding, the challenge is to establish what con­ stitutes a problem for understanding, and a solution that counts as providing that under­ standing. It is unclear how UCD might respond to such a challenge, but it is instructive to enquire. Likewise, with USI as design or design-related. For example, in e-commerce, what counts as a USI research problem and solution? Depending on the general problem, a particular problem might be the shopping interaction itself, or the effectiveness thereof, for example, how the goods are perceived (MMI) or why different shopping cart repre­ sentations are associated with different levels of user workload (lAP). In both cases, the challenge would be some integration with UCD practices. LR4: CE general problem needs to express design problems and design solutions, in tenns of the interactive worksystem and performance. It is not just a question of expressing design in terms of the particular scope, that is: user(s); technology; work; and performance. The CE general problem needs, in addition, to express the relationship between the (desired) task quality of the interactive worksys­ tern and its associated (acceptable) costs. It is the relationship, which constitutes both the target and the achievement of design. UC8: To express USI general problem as problems and solutions, in terms ofa par­ titioned particular scope, as required. The challenge here is to bring together the USI general problem, whether to under­ stand, to design, or to develop, etc., expressed as problems and solutions of the appropri­ ate kind, with the partitioning (if required) of the particular scope. For example, in e-com­ merce research, the understanding of how users of the technology effect transactions re­ quires the problem of joint goal achievement to be addressed within some partitioning of the user and the technology. If such understanding is provided by lAP, MMI, or SLI, the

28 challenge is to integrate the understanding with UCD practices. The concepts implicated in these remaining CE lessons are illustrated schematically in Figure 2 to facilitate their consideration as possible USI challenges.

worksystem transform ... worksystem (Pa0 Pd) (Pa = Pd)

Figure 2: General problem of Cognitive Ergonomics: Design a worksystem, whose actual performance (Pa) equals some desired performance (Pd), where P is a function of user(s)' and technology costs and task quality (see Figure 1).

Practices of Cognitive Ergonomics According to Long (1987), Cognitive Science (CS) (and more generally the Human and Behavioural Sciences) acquire the knowledge that CE applies. Cognitive Science has two primary practices, those of analysis of human cognitive phenomena, to produce an 'acqui­ sition representation', and generalization of that representation, to produce the general knowledge itself, as required to describe and explain human cognitive phenomena. CE also has two primary practices, those of particularization of the CS general know ledge, already acquired, to produce an 'applications representation', such as guidelines for the design of user-system technology, etc., and synthesis, as required to optimize "the rela­ tionship between people and their work with respect to well-being and productivity". The practices of CS and CE are thus different, the latter applying the knowledge acquired by the former. Two main lessons have been learned. U5: CE design practices include design, evaluation, and iteration. The practice of particularization transforms CS knowledge into guidelines, check­ lists, etc., that is, it does not itself contribute directly to design, only indirectly. The prac­ tice of synthesis, in contrast, contributes to design directly. The practice is indeed the de­ sign, or synthesis, of the new user-system interactions. However, synthesis here identifies no additional CE practices to the notion of design itself. CE practices, then, need to be specified more completely. UC9: To establish USI design practices, as required. If the USI general problem is understanding user-system interaction phenomena, then the establishment of USI design practices is not a challenge for lAP, MMI, or SLI. However, it is a challenge in this case how they would relate to UCD (if at all). If the USI general problem involves design, directly or indirectly, then there is a challenge to estab­ lish detailed USI design practices, at least by UCD. For example, if an e-commerce web­ site plans to offer new services, either in terms of the range of goods, or of the accessing technology (which might be lAP, MMI or SLI related), the USI challenge would be to es­ tablish the design practices (UCD), which its knowledge is intended to support. LL6: CE design practices address: user(s); technology; work; and performance. CE design practices need to support the particular scope of CEo There is general consensus for these design practices and their particular scope (see earlier references). UC10: To establish the particular scope of USI design practices, as required.

29 This challenge is to relate USI design practices, presumably the purview of UCD, to the particular USI scope (see earlier). If testing or evaluation is a UCD practice, then all the components of the particular scope need to be included. For example, in e-com­ merce, if a new website is being evaluated, then the users, system, interactions, effective­ ness, etc., must all form part of the evaluation, depending on the particular scope. As sug­ gested earlier, evaluation, in general, might be expected to be the concern of UCD. How­ ever, the use of perceptual and cognitive models in the evaluation might be more the concern ofIAP, MMI, and SLI, and so joint evaluation would be the challenge. Two main lessons remain to be learned. LR5: CE practices need to diagnose design problems and to prescribe design solu­ tions. It is not sufficient to distinguish overall practice as 'design' or 'synthesis', or indeed the differing CE practices of design, evaluation, and iteration. CE practices need to relate specifically to design problems and design solutions (see earlier). Thus, a CE practice is needed to construct design problems, that is, to diagnose, and a CE practice is needed to construct design solutions, that is, to prescribe. UCll: To establish US! practices, which formulate problems and derive solutions, as required. In other words, the USI challenge here is to develop practices commensurate with the problems and solutions, associated with the general problem (see earlier). For exam­ ple, in e-commerce, explaining the interaction of the user with the electronic shopping cart (better to understand), or diagnosing the ineffectiveness of the same cart (better to design), requires the relevant problems to be identified and solved. Those practices may involve all or only some combination of the research programmes - lAP, MMI, SLI, and UCD. If the practices ofUCD, however, differ from those of the technology programmes, the chal­ lenge is their integration. LR6: CE practices diagnose performance as the design problem and prescribe an interactive worksystem as the design solution. The lesson proposes a specific relationship between the CE particular scope, the general problem, and detailed CE practices. UC12: To establish US! practices, which formulate problems and derive solutions in terms of a (partitioned) particular scope, as required. The challenge requires an explicit relationship to be made between the partitioned USI particular scope, the general problem, and the detailed USI practices. In other words, the detailed USI practices need to relate to the partitioned USI specific scope (see earlier). The USI practices of explaining or diagnosing, etc., need to be related to the partitioned components of the USI particular scope. For example, in e-commerce, explaining or di­ agnosing practices would need to accommodate any joint achievement by the system and the user of the latter's goals (see earlier). The challenge here would be differing relations, either between technology programmes or between UCD and the technology pro­ grammes. The concepts implicated in these remaining CE lessons appear schematically in Fig­ ure 3 to facilitate their consideration as possible USI challenges.

30 diagnose ... design prescribe .. design ~ problem .. solution

Figure 3: Practices of Cognitive Ergonomics: Diagnose design problem and prescribe design solution.

Knowledge of Cognitive Ergonomics According to Long (1987), there are two basic types of knowledge: science and non-sci­ ence (that is, "experiential knowledge, both craft and personal"). Scientific knowledge is particularized into guidelines and checklists to advise designers, such that they produce more usable systems. These forms of advice are considered to be a CE 'applications rep­ resentation' of the scientific knowledge, that is, one supporting application of the knowl­ edge. Two main lessons have been learned. LL7: CE knowledge needs to support practices ofdesign, evaluation, and iteration. It is unclear why Long's 'application representation' is not a third type of knowl­ edge, along with science and non-science, since it is self-evidently neither of these types. If it is a type of knowledge, then it is not simply 'advice to designers'. Last, design knowl­ edge needs to be related directly to completely specified design practices, such that these practices are supported (and not only generally advised). UCI3: To establish USI knowledge supporting detailed USI practices, as required. Different practices require different types of knowledge. For example, models and theories support the understanding of perception and cognition. In contrast, design prin­ ciples and guidelines support the specification of user-system interactions. The challenge here is to acquire lAP, MMI, and SLI knowledge compatible with their own, but also with UCD practices, whatever these practices may be (see earlier). In e-commerce, for exam­ ple, if the UCD practice is testing or evaluation of users' trust in a shopping service, then an lAP or other model of trust would be required to support the testing or evaluation proc­ ess. LL8: CE knowledge needs to reference its particular scope of" users; technology; work; and performance. In other words, CE design knowledge needs to reference what is designed, but not simply as a description of the interface, of the user's behaviours or of some other individ­ ual aspect of the particular scope. For example, if the latter includes some notion of per­ formance, then knowledge in the form of design principles, guidelines, checklists, etc., needs to reference such performance (for consensus, see earlier references). UCI4: To establish USI knowledge, which references its particular scope, as re­ quired. USI theories, whatever their purpose, need to reference the concerns of USI (see earlier). For example, if an lAP or other model of trust in e-commerce explains users' trust behaviours, the model needs to reference all constituents of the USI particular scope - us­ ers, systems, interactions, etc. The challenge is the same for applications of MMI and SLI models. Two main lessons remain to be learned. LR7: CE knowledge, as general design principles, needs to support design practices of diagnosing design problems and prescribing design solutions.

31 It is not sufficient for CE knowledge to support design indirectly, nor to support di­ rectly only general practices. The knowledge must address directly the specific design practices (that is, of diagnosis and prescription). Such CE knowledge is termed 'general design principles'. UC15: To establish US! knowledge, which addresses USI problems and solutions, as required. The challenge is for USI knowledge, as theory, guidelines, etc., to be commensurate with practices required to formulate and solve the problems. For example, if lAP, MMI, and SLI knowledge is integrated into a theory of trust, what problems of understanding trust does it solve? The challenge would be to explain the risk-taking behaviours of shop­ pers or how trust builds up over time, or how trust varies as a function of interaction mo­ dality. LR8: CE knowledge, as general design principles, needs to support diagnosis of peiformance, as the design problem, and prescription of the interactive worksys­ tem, as the design solution. Again, this remaining lesson emphasizes the specific relationship required between CE knowledge and CE practices, as well as between these constituents and the CE general problem and particular scope. UC16: To establish US! knowledge, which supports the formulation of problems and the derivation of solutions, in terms of a (partitioned) particular scope, as re­ quired. The challenge here is to bring together USI knowledge, practices, general problem, and (partitioned) particular scope. For example, in e-commerce, any lAP, MMI, SLI joint or individual theories of trust, between the users and the system, would need to support related practices (both theirs and UCD's) in their address of a (partitioned) particular scope of the joint achievement of the user's shopping goals (see earlier). The concepts implicated in these remaining CE lessons appear in Figure 4 to aid their consideration as possible USI challenges.

General Design Principle ~------~ , / / / , / , DPn ...J / DSn I , ,.. / , DP2 / DS2 / •..J I , ,.. it- .. diagnose " prescribe problem 1 f---- solution I r---- (DPl) (DSl) (Pa<> Pd) - (Pa =Pd) -

Class of Design Class of Design Problems (DP) Solutions (DS)

Figure 4: Knowledge of Cognitive Ergonomics: General design principle expressing the relationship between diagnosis of a class of design problems (see Figure 2) and prescription of a class of design solutions (see Figure 3).

32 Discipline of Cognitive Ergonomics "The most general definition of Cognitive Ergonomics is thus the application of Cognitive Psychology to work" (Long, 1987). In addition, "Cognitive Ergonomics is a configuration relating work to science". Last, "Its aim is the increase of compatibility between the user's representations and those of the machine". CE particularizes scientific knowledge into ap­ plication representations, which are then synthesized "to optimize the relationship be­ tween people and their work ... with respect to well-being and productivity". CE, then, is conceived as an applied science discipline. Two main lessons have been learned. LL9: CE is a design discipline. As such, it is essentially an engineering, rather than a science, discipline. Engineer­ ing, here, may be informal (craft) or formal and may be related to science. UC17: To establish US! as a discipline, as required. USI may, of course, be a discipline, a sub-discipline or a hybrid discipline. It may develop into a scientific discipline, adding system interaction to perception and cognition, or an engineering (design) discipline, supporting the development of services and tech­ nology. The challenge here is to specify USI's discipline aspects. For example, the USI discipline could simply seek to understand the phenomena of e-commerce or to develop e-commerce systems (or even some combination). In either case, the challenge is to relate IAP, MMI, and SLI to UCD within a discipline. LL10: CE, as a design discipline, needs design knowledge to support design prac­ tices of design, evaluation and iteration, solving the general problem, having the particular scope of user(s); technology; work; and performance. In other words, CE, as a discipline, needs to instantiate as appropriate all the require­ ments of a discipline (see Long & Dowell, 1989), just like other disciplines (for consen­ sus, see earlier references). UC18: To establish US!, as a discipline, having knowledge to support detailed practices, solving the US! general design problem, having a particular (parti­ tioned) scope, as required. The challenge here is for USI to instantiate all the discipline constituents. In e-com­ merce, knowledge of shopping (lAP, MMI, and SLI) supports practices (of UCD), such as testing or evaluation, as they relate to transaction problems and solutions, involving us­ ers, systems, interactions, etc. LR9: CE needs to validate its design knowledge as design principles, and design practices by solving design problems. In other words, having the constituent elements of a discipline is not enough. The elements need to be validated to ensure the status of the knowledge as formal design prin­ ciples with respect to the practices, etc. Further, the (scientific) validation of any associ­ ated scientific knowledge does not by itself constitute validation of the CE knowledge, practices, design principles, etc. Validation needs to be formal (rather than informal) to make CE a formal engineering design discipline (rather than a craft one). UC19: To establish validation of USI knowledge supporting US! practices, ad­ dressing USI problems and solutions, as required. The challenge here is to demonstrate that USI knowledge and practices are fit for their declared purposes. In other words, the (generic) status of the knowledge needs to be demonstrated in practice. If USI knowledge (lAP, MMI, or SLI) is to explain phenomena,

33 it needs to predict new phenomena, for example, of shopping trust. IfUSI knowledge (all programmes) is to develop systems, the challenge, for example, is to prescribe advances in technology, etc. LRIO: CE, as an engineering design discipline, needs design knowledge to support design practices of diagnosing design problems and prescribing design solutions, so solving the general problem having the particular scope of an interactive work­ system (user( s) and technology) with desired peljormance, expressed as task qual­ ity and worksystem costs. This remaining lesson emphasizes the intimate relations, that is, the coherence of the CE discipline characterization. Such coherence is notably missing in Long (1987) and, to a lesser extent, in the lessons learned. UC20: To establish US!, as a discipline with knowledge, supporting detailed prac­ tices offormulating problems and deriving solutions, solving the general US! prob­ lem having a particular (partitioned) scope, as required. In other words, USI can develop in many different ways. Whichever the way, the discipline constituents need to be: complete; coherent; and fit-for-purpose. The challenge is for the particular scope of e-commerce to be commensurate with USI e-commerce knowledge (lAP, MMI, and SLI) and with the practices of all programmes, including UCD. The concepts implicated in these remaining CE lessons are shown schematically in Figure 5 to support their consideration as possible USI challenges.

CE ~ CE CS CS knowledge .... research ~ - -:- -- knowledge research knowledge ~ , , , practices .------: practices

general CE CS general problem design understanding problem 1 1 particular CE particular scope CS particular scope particular scope I (world of work) I (world of nature) scope

Figure 5: Discipline of Cognitive Ergonomics (CE): Showing its informal relations (----) with the disci­ pline of Cognitive Science (CS).

Additional issues During the course of preparing this paper, a number of general issues were identified. They are briefly referenced here for the sake of completeness. The main concern is with any impact on CE lessons as possible USI challenges.

34 Cognitive Ergonomics and Ergonomics No systematic attempt has been made here to relate CE and Ergonomics as concerns the past, lessons learned, and lessons remaining. It is possible that these lessons apply to both the discipline as well as the sub-discipline. However, such application is not certain, but it would be expected that CE and Ergonomics lessons learned and remaining would nev­ ertheless have considerable overlap. Hence, many of the implications for possible USI challenges would remain unchanged if Ergonomics, rather than CE, had been considered the 'past'.

Cognitive Ergonomics and HCI No attempt has been made here to distinguish the disciplines of CE and HCI (as opposed to the phenomenon of humans interacting with computers). Although the professional pragmatics of the two disciplines may be different, in practice, and for many, the two terms are interchangeable. For example, it would be difficult, if not impossible, to distin­ guish the CE, as opposed to the HCI, contributions to consensus, concerning the discipline constituents (see references). As concerns possible USI challenges, detailing CE rather than HCI lessons would not be expected to change the implications radically. Most of the CE lessons could also have been inferred from the development of HC!.

Cognitive Ergonomics and Cognitive Engineering No attempt has been made here to distinguish CE and Cognitive Engineering. Similar comments are in order as concerns Cognitive Ergonomics and HCI. That is to say, the im­ plications for possible USI challenges are not likely to be very different. Cognitive Ergonomics and Engineering CE never operates in a vacuum. It is usually associated with some form of engineering, for example, mechanical, electronic, software, etc., depending on the domain of applica­ tion. Lessons learned and remaining are almost certainly not independent of the associat­ ed types of engineering. If USI emerges as needing to relate to an engineering discipline, similar implications are likely to hold for possible USI challenges, otherwise not.

Conclusion This paper has attempted to consider the past, present and future of CE, respectively, in terms of Long's 1987 paper, some lessons learned since its publication, and some lessons which still remain to be learned. The lessons indicate a considerable extension of CE as a discipline and in terms of its constituents: particular scope; general problem; practices; and knowledge. Each CE lesson, learned or remaining, has been identified with a possible USI challenge. In this exciting time, when IPa is similarly attempting to extend its tradi­ tional research, in perception and cognition, to include user-system interaction, it is hoped that the CE lessons identified as challenges, will be helpful and instructive for the support of such USI development.

35 Acknowledgements This paper could not have been written without the extensive discussions I have had with my IPQ colleagues, during my professorial visits, and in particular Matthias Rauterberg. I thank them for the lessons I have learned. Lessons remaining are still on the agenda.

References Barnard, P. (1991). Bridging between basic theories and the artefacts of human-computer interaction. In: J.M. Carroll (Ed.): Designing Interaction, 103-127. Cambridge: Cambridge University Press. Benyon, D. (1998). Cognitive engineering as the development of information spaces. Ergonomics. 41, 153- 155. Card, S.K., Moran, T.P. & Newell, A (1983). The Psychology of Human-Computer Interaction. Hillsdale, Nl: Erlbaum. Carroll, I.M. (1995). Scenario-Based Design. New York: John Wiley & Sons. Diaper, D. (1989). The discipline of HC!. Interacting with Computers, 10, 3-65. Dowell, J. & Long, 1. (1989). Towards a conception for an engineering discipline of human factors. Ergo­ nomics, 32, 1513-1535. Dowell, 1. & Long, J. (1998). Target paper: Conception of the cognitive engineering design problem. Ergonomics, 41, 126-139. Flach, J.M. (1998). Cognitive systems engineering: Putting things in context. Ergonomics. 41. 163-167. Green, T.R.G. (1998). A conception of a conception. Ergonomics, 41, 143-146. Green, T.R.G. & Hoc, J.-M. (1991). What is cognitive ergonomics? Le Travail Humain, 54, 291-304. Hockey, G.R.1. & Western, S.1. (1998). Advancing human factors involvement in engineering design: A bridge not far enough? Ergonomics, 41, 147-149. Hollnagel, E. (1991). Cognitive ergonomics and the reliability of cognition. Le Travail Humain, 54, 305- 321. Hollnagel, E. (1998). Comment on 'Conception of the cognitive engineering design problem' by John Dow­ ell and John Long. Ergonomics, 41. 160-162. Long, J. (1987). Cognitive ergonomics and human-computer interaction. In: P.B. Warr (Ed.): Psychology at Work (3rd rev. ed.), 73-95. Harmondsworth, UK: Penguin. Long, J. (1996). Specifying relations between research and the design of human-computer interactions. International Journal of Human-Computer Studies, 44, 875-920. Long, J. & Dowell, J.( 1989). Conceptions for the discipline of HC!: Craft, applied science, and engineering. In: A. Sutcliffe and L. Macaulay (Eds.): People and Computers V: Proceedings of the 5th Confer­ ence ofthe British Computer Society, Human-Computer Interaction Specialist Group, BCS HCI SG, Nottingham, UK, September 5-8.1989,9-32. Newman, W. (1994). A preliminary analysis of the products of HeI research, using pro forma abstracts. In: B. Adelson, S. Dumais and J. Olson (Eds.): Proceedings of the Conference on Human Factors in Computing Systems, CHI-94. Boston. MA. USA, April 24-28. 1994,278-284. Reason, J. (1998). Broadening the cognitive engineering horizons: More engineering, less cognition and no philosophy of science, please. Ergonomics. 41, 150-152. Salvendy, G. (1998). A response to John Dowell and John Long, 'Conception of the cognitive engineering design problem'. Ergonomics, 41. 140-142. Storrs, G. (1994). A conceptualization of multi-party interaction. Interacting with Computers, 6, 173-189. Vicente, K. (1998). An evolutionary perspective on the growth of cognitive engineering: The Riso geno­ type. Ergonomics, 41. 156-159. Woods, D.O. (1998). Designs are hypotheses about how artefacts shape cognition and collaboration. Ergo­ nomics.41. 168-173.

36 Measuring presence: An overview of assessnlent methodologies

W.A. Usselsteijn, H. de Ridder! and J. Freeman2 e-mail: [email protected]

Abstract There is a clear need for a reliable, robust and valid methodology for measuring presence, de­ fined as the sense of 'being there' in a mediated environment. In this paper we describe a number of different approaches taken to measuring presence, divided into two general cate­ gories: subjective measures and objective corroborative measures. Since presence is a subjec­ tive experience, the most direct way to assess it is through users' subjective reports. This ap­ proach has serious limitations however, and should be used judiciously. Objective measures, such as postural, physiological, or social responses to media, can be used to corroborate sub­ jective measures, thereby overcoming some of their limitations. At present, the most promis­ ing direction for presence measurement is to develop and use an aggregate measure of pres­ ence that is comprised of both subjective and objective components, tailored to the specific medium under study.

Introduction "When the curtain swept up to reveal the now-legendary wide-screen roller coaster ride, I realized that the film's creators were no longer content to have me look at the roller coaster but were trying to put me physically on the ride. The audience no longer surround­ ed the work of art; the work of art surrounded the audience - just as reality surrounds us. The spectator was invited to plunge into another world. We no longer needed the device of identifying with a character on the other side of the 'window'. We could step through it and be a part of the action!" - Morton Heilig commenting on his experience with Cin­ erama in New York, 1952.

The concept of presence As Heilig's quote illustrates, there has long been a tendency to reproduce reality with in­ creasing levels of fidelity, especially in the arts and cinema. The psychological effect that Heilig described now is more commonly known as presence, i.e., the sense of 'being there' in a mediated environment. This user experience is of particular interest to us today, since the current pace of technological development in networks, computing power and displays, as well as improvements in human-computer interfaces, is making it increasing­ ly possible to create services that are capable of eliciting a sense of presence in the user. The concept of presence has potential relevance for the design and evaluation of a broad range of interactive and non-interactive media, and applications in areas such as training and education, telecommunications, medicine, and entertainment. Since the early 1990' s onwards, the subjective sensation of being there in a remote or mediated environ-

1. Delft University of Technology, Department ofIndustrial Design, The Netherlands. 2. Goldsmiths College, University of London, Great Britain.

IPO Annual Progress Report 341999 37 ment has been studied in relation to various media, most notably virtual environments (VEs).

Two types of presence: Physical and social In an attempt to unify a set of six different conceptualizations of presence found in liter­ ature, Lombard and Ditton (1997) elegantly define presence as the "perceptual illusion of non-mediation", Le., the extent to which a person fails to perceive or acknowledge the ex­ istence of a medium during a technologically mediated experience. The conceptualiza­ tions Lombard and Ditton identified can be grouped into two broad categories: physical and social (Freeman, 1999, p. 19-20). The physical category refers to the sense of being physically located somewhere, whereas the social category refers to the feeling of being together (and communicating) with someone. The question may be raised whether it is fruitful to group these two distinct categories under one unifying definition, since a number of aspects of communication that are central to social presence are unnecessary to establish a sense of physical presence. Indeed, a medium can provide a high degree of physical presence without having the capacity to transmit reciprocal communicative sig­ nals at all. Conversely, one can experience a certain amount of social presence, or the 'nearness' of communicative partners, using applications that supply only a minimal physical representation, as is the case, for example, with telephone or internet chatrooms. This indicates that social and physical presence can be meaningfully distinguished. This is not to say, however, that the two are unrelated. There are likely to be a number of com­ mon determinants, such as the immediacy of the interaction, which are relevant to both social and physical presence. In fact, applications such as videoconferencing or shared virtual environments are based on providing a mix of both the physical and social com­ ponents3. The extent to which shared space adds to the social component is an empirical question, but it seems likely that as technology increasingly conveys non-verbal commu­ nicative cues, such as gaze direction or posture, social presence will increase. There is ev­ idence suggesting that this indeed may be the case (Short, Williams & Christie, 1976; Mtihlbach, Bocker & Prussog, 1995).

Determinants of presence Systematic research into the causes and effects of presence has only started recently, but a large number of factors that may potentially influence the sense of presence have al­ ready been suggested in the literature. Although the terminology used tends to vary across authors, there appears to be a broad agreement on the major concepts. Based on analyses by Sheridan (1992), Steuer (1995), Zeltzer (1992), Held and Durlach (1992), Ellis (1996), and Barfield, Zeltzer, Sheridan and Slater (1995), the factors thought to underlie presence include the following: (i) The extent and fidelity of sensory information - this is the amount of useful and sa­ lient sensory information presented in a consistent manner to the appropriate senses of the user. This includes Steuer's (1995) notion of 'vividness', i.e., the ability of a technology to produce a sensorially rich mediated environment. Note that this cat-

3. Although the terminology used by various authors tends to vary, the term 'co-presence' has been suggested to refer to the mix of social and physical presence, i.e., a sense of 'being there together'.

38 egory can apply to both interactive and non-interactive media. Examples from this category are monocular and binocular cues to spatial layout, resolution, field of view, or spatialized audio. (ii) The match between sensors and display - this refers to the sensory-motor contingen­ cies, i.e., the mapping between the user's actions and the perceptible spatio-tempo­ ral effects of those actions. For example, using head tracking, a tum of the user's head should result in a corresponding real-time update of the visual and auditory display. (iii) Content factors - this is a very broad category including the objects, actors, and events represented by the medium. Our ability to interact with the content and to modify it, as identified by Sheridan (1992), is also likely to be important for pres­ ence. Other content factors include the user's representation or virtual body in the VE (Slater & Usoh, 1993), and the autonomy ofthe environment, i.e., the extent to which objects and actors (e.g., agents) exhibit a range of autonomous behaviours (Zeltzer, 1992). Social elements, such as the acknowledgement of the user through the reactions of other actors, virtual or real, are important for establishing a sense of social presence. The nature of the potential task or activity and the meaningfulness of the content have been suggested to playa role as well. (iv) User characteristics are likely to playa significant role, but have received little at­ tention thus far. Such characteristics include the user's perceptual, cognitive, and motor abilities (e.g., stereoscopic acuity, susceptibility to motion sickness, concen­ tration), prior experience with and expectations towards mediated experiences, and a willingness to suspend disbelief. Allocating sufficient attentional resources to the mediated environment has also been proposed as an important component of pres­ ence (Barfield & Weghorst, 1993; Witmer & Singer, 1998; Draper, Kaber & Usher, 1998).

Factors (i) and (ii) may be regarded as mediaform variables that are aimed at mak­ ing the medium as transparent as possible, thereby creating the illusion of non-mediation (Lombard & Ditton, 1997). To create and sustain this illusion, distractions and negative cues to presence should be avoided. An awkward interface stresses the mediated nature of the experience and may diminish the sense of presence. Examples of such negative cues include bad stereoscopic alignment (causing eye strain), coding distortions in the image (e.g., visible blockiness or noise), weight of a head-mounted display, process interrup­ tions (e.g., 'new mail has arrived', malfunctions, or error notices), noticeable tracking lags, etc. Furthermore, distractions that draw the user's attention away from the mediated environment to the real world (e.g., a telephone ringing) are likely to diminish the user's sense of presence as well (as is predicted by Draper, Kaber and Usher's (1998) attentional resource model of presence). Generally speaking, a seamless continuity between the real and the mediated environment is likely to add to a more convincing illusion of non-medi­ ation. An example of such a continuity is presented in the KidsRoom (Bobick, Intille, Davis, Baird, Pinhanez, Campbell, Ivanov, ShUtte & Wilson, 1999), a room where real­ world items such as posters and windows are combined with large back-projection walls, to create an interactive narrative playspace for children.

39 Measuring presence Scientific research into presence is in its infancy. At present, there is no generally accept­ ed theory of presence. Technological advances have only recently been developed to a level that enables and motivates the systematic exploration of presence measures. A sin­ gle, accepted paradigm for the assessment of presence is still lacking and, perhaps as a consequence, a variety of presence measures have been proposed. A reliable, robust, and useful measure of presence will provide display, interface, and content developers with a tool to evaluate media within a user-centred design ap­ proach, by allowing them to identify and test those factors that may produce an optimal level of presence for the user. As Ellis (1996) has argued, a valid and stable presence measure allows for equivalence classes to be established, i.e., maintaining the same level of measured presence, whilst trading off contributing factors against each other. In addi­ tion, a good measure of presence helps human factors specialists to investigate the rela­ tionship between presence and task performance, and may aid our general understanding of the experience of presence in the real world. The different approaches taken to measuring presence can be divided into two gen­ eral categories of presence measures: subjective measures and objective corroborative measures.

Subjective measures of presence Post-test rating scales Since presence is primarily a subjective sensation, it has been argued that 'subjective re­ port is the essential basic measurement' (Sheridan, 1992). Indeed, the majority of studies measure presence through post-test questionnaires and rating scales, which have the ad­ vantage that they do not disrupt the media experience and are easy to administer. For in­ stance, Slater, Usoh and Steed (1994) asked participants, after exposing them to a VE, to answer three questions on Likert scales (1-7) that served as an indication of presence: (a) their sense of 'being there' in the computer-generated world, (b) the extent to which there were times when the experience of the computer-generated world became the dominant reality for the participants, thereby forgetting about the 'real world' outside, and (c) whether they remembered the computer-generated world as 'something they had seen' or as 'somewhere they had visited'. According to Slater (1999), the "experiencing-as-a­ place" is central to understanding presence in VEs: people are there, they respond to what is there, and they remember it as a place. Witmer and Singer (1998) developed a presence questionnaire (PQ), with questions on four major categories of theoretical presence determinants: control factors, sensory factors, distraction factors, and realism factors. A cluster analysis on the combined data of four experiments revealed a measure of presence along three subscales: (a) involve­ ment/control, (b) naturalness, and (c) interface qUality. Items on the first subscale referred to the perceived control over and responsiveness of the VE, as well the participant's in­ volvement in it. Naturalness items referred to the extent to which the interaction and the participant's movement within the VE felt natural, as well as the extent to which the VE was consistent with reality. Interface quality items addressed whether control and display devices interfered with or distracted people from task performance. As Slater (1999)

40 notes, a problem with Witmer and Singer's PQ is that the questions intrinsically confound individual characteristics and properties of the VE, with no way of separating them out. Although generally speaking rating scales are effective, they need to be used with great care, as results obtained through subjective report are potentially unstable (Ellis, 1996; Freeman, Avons, Pearson & IJsselsteijn, 1999). Several research groups are cur­ rently using factor analysis to construct reliable and valid presence questionnaires that are applicable to both interactive and non-interactive media.

Continuous presence assessment One important drawback of discrete post-test subjective ratings is that they do not provide any measure of temporal variations in presence. Such variations are likely to occur through changes in the stimulus (i.e., both in form and content) or the participant (e.g., prolonged exposure may have an impact on presence through perceptual-motor adapta­ tion, boredom, fatigue, etc.) and would be lost in an overall post-test presence rating. In addition, discrete subjective measures are potentially subject to inaccurate recall and an­ choring effects. To overcome these limitations, the method of continuous presence as­ sessment has been applied (IJsselsteijn, De Ridder, Hamberg, Bouwhuis & Freeman, 1998; IJsselsteijn & De Ridder, 1998; Freeman et at., 1999). This method, originally de­ veloped to measure TV picture quality (De Ridder & Hamberg, 1997), requires subjects to make an on-line judgment of presence using a hand-held slider. A computer subse­ quently samples the position of the slider at a constant rate. When applied to non-interac­ tive stereoscopic media, it was found that presence ratings were subject to considerable temporal variation depending on the extent of sensory information present in the stimulus material (IJsselsteijn et ai., 1998), A potential criticism that may be raised against continuous presence assessment is that subjects are in fact required to divide their attention between both the display and the on-line rating they are asked to provide, thereby possibly disrupting the presence experi­ ence. However, observers are unlikely to report on a belief of being in the displayed scene, since they are usually well aware of actually being in a media laboratory. Rather, they re­ port on a sensation of being there that approximates what they would experience if they were really there. This does not necessarily conflict with providing a continuous rating - especially given the fact that the measurement device requires very little attention or effort to operate. An additional criticism, which is relevant to both continuous and discrete subjective post-test ratings, is the need to consciously reflect on the amount of presence experienced 4 to give a reliable rating , This criticism underlines the need for objective measures to cor­ roborate subjective measures, which is discussed later in this paper. At present, continuous presence assessment is mainly applicable to non-interactive media, since using a measurement device, such as a hand-held slider, may interfere with operating an interaction device. An alternative option for measuring temporal variations in such a case would be to use continuous verbal reporting.

4. Taking only one measurement with a post-test technique using independent groups may stop par­ ticipants reflecting on their presence during the media experience. However, such a between-sub­ jects design typically requires more subjects than a within-subjects design to obtain a similar level of sensitivity.

41 Psychophysical methods Various psychophysical methods for measuring presence have been proposed to date (e.g., Barfield et aI., 1995; Sheridan, 1996; Stanney & Salvendy, 1998), but very little em­ pirical research is currently available that employs this well-established approach. Snow and Williges (1998) used free-modulus magnitude estimation to investigate the effects of a wide variety of VE system parameters on presence. With this method, originally pro­ posed by Stevens (1971), participants are presented with a series of stimuli and asked to assign a number to each stimulus corresponding to the strength of their subjective sensa­ tion. Infree-modulus magnitude estimation, the observer is asked to assign any value to the first stimulus and subsequently assign values accordingly. Cross-modality matching can be regarded as a variation on magnitude estimation that has special applicability to constructs that do not lend themselves easily to verbal scaling. This method requires a participant to express a judgment of a subjective sensation in one modality by adjusting a parameter in a different modality, e.g., 'Make this sound as intense as the strength of presence you experienced in this VE' (Welch, 1997). A final psychophysical method we would like to mention here is the method of paired comparisons, which, in the context of VEs, has sometimes been referred to as the 'virtual reality Turing test'. Here the user is asked to distinguish between a virtual (or re­ mote) and a real scene. Schloerb (1995) suggested that the probability that a human oper­ ator perceives that he or she is physically present in a given remote environment can serve as a subjective measure of presence. Since subjects are unlikely to confuse the real envi­ ronment with the virtual one, Schloerb proposed a number of perceptual limitations (or filters) to be imposed on the subject during the test. For example, both the real and virtual environments have to be viewed with a limited visual field, with reduced contrast or col­ our, not using sound, etc. Following Schloerb's (1995) proposal, Sheridan (1996) sug­ gested taking the amount of degradation of the real scene necessary to make it indistin­ guishable from the virtual one as a measure of presence. A potential criticism that can be raised against this methodology is that it becomes a test of participants' abilities to dis­ criminate between two degraded images rather than a measure of presence.

Objective corroborative measures The search for objective measures to corroborate subjective presence measures is moti­ vated by a number of reasons. First of all, subjective measures require users to have a fair understanding of what is meant by the term 'presence'. Possibly as a consequence of the fact that, at present, most naive users are relatively unfamiliar with the concept of pres­ ence, subjective measures have been shown to be potentially unstable, with inconsisten­ cies across different raters and rating situations (Ellis, 1996). In addition, as Freeman et al. (1999) have shown, subjective ratings of presence can be biased by previous experi­ ence' e.g., rating a different attribute in a previous session. Secondly, objective measures relate to user responses that are, in general, produced automatically and without much conscious deliberation. This thus diminishes the likelihood that the subject is responding to the demand characteristics of the experiment. In addition, it circumvents the conflict between sensation ('1 am in a VE') versus knowledge (,I am in a psychology lab wearing an HMD'), which seems intrinsic to subjective measures of presence.

42 In an attempt to develop objective corroborative measures of presence, the behav­ ioural realism approach was proposed, which is based on the premise that as a display bet­ ter approximates the environment it represents, an observer's responses to stimuli within the display will tend to approximate those which he/she would exhibit in response to the environment itself (Freeman, Avons, Meddis, Pearson & IJsselsteijn, in press). Based on this principle, a variety of objective corroborative measures can be formulated. Several authors have suggested the use of reflexive responses (e.g., ducking in response to a loom­ ing virtual object) and socially conditioned responses (e.g., smiling) as objective ap­ proaches to measuring presence (Held & Durlach, 1992; Sheridan, 1992; Slater & Wilbur, 1997). Generally speaking, the objective measure that is being used as a presence measure should be tailored to the media experience the user is intended to have. Some authors have suggested task peiformance as an objective measure of presence (e.g., Schloerb, 1995; Kalawsky, Bee & Nee, 1999). Although such measures may be very useful for the assessment of training efficiency in YEs, the relationship between presence and performance measures remains unclear to date. Although there seems to be some ten­ tative evidence suggesting a positive correlation between presence and task performance (e.g., Witmer & Singer, 1998), Ellis (1996) reported that removing redundant information from air-traffic control displays elevated performance, whereas such a manipulation would also tend to decrease presence. However, one needs to be aware of the possibility that reducing realism (by removing redundant information) may in fact enable partici­ pants to better select a useful viewpoint. This implies that the overall level of presence need not necessarily diminish in line with Ellis' notion of iso-presence equivalence classes. There is general agreement in the field however (e.g., Slater & Wilbur, 1997) with Ellis' assertion that there need not necessarily be a causal relationship between presence and performance. In addition, task performance measurements usually require the completion of a task within a VE. In this respect, their application to non-interactive media is problematic.

Postural responses Within the behavioural realism paradigm, Freeman et al. (in press) investigated observ­ ers' postural adjustments as a potential objective corroborative measure of presence. Both Prothero (1998) and Ohmi (1998) proposed that measures ofvection (i.e., feeling of self­ movement) and presence should be related, basing their arguments on the premise that it is likely that an observer will feel present in an environment that causes hirnlher to expe­ rience vection. In the Freeman et al. (in press) experiment, observers were instructed to stand as still as possible in front of a display that showed a moving video of a rally car circling a track, using monoscopic and stereoscopic presentation in counterbalanced ses­ sions. The observers' automatic postural adjustments were measured via a magnetic tracking device. The results demonstrated a positive effect of stereoscopic presentation on the magnitude of postural responses elicited. Post-test subjective ratings of presence, vec­ tion, and involvement were also higher for the stereoscopically presented stimuli. How­ ever, the postural responses cannot be taken as a direct substitute for subjective presence ratings as the two measures were not correlated across subjects. They can, however, cor­ roborate group subjective ratings of presence.

Physiological measures A number of physiological indicators, such as heart rate and skin conductance response

43 (SCR), have been suggested as objective corroborative measures of presence, but at present there is little empirical research available that is specifically aimed at relating physiological measures to presence. Galvanic skin response is known to be sensitive to arousal, whereas heart rate is sensitive to hedonic valence (pleasant stimuli cause heart rate acceleration, whereas unpleasant ones cause a slowing of the heart rate). Both arousal and hedonic valence are viewed as two primary components of emotion (Greenwald, Cook & Lang, 1989). Within the behavioural realism paradigm, these measures would likely be most useful when applied to media experiences that could potentially elicit an emotional response. Wiederhold, Davis and Wiederhold (1998) reported a study in which they investi­ gated the effects of varying levels of immersion in a flight simulation, comparing a com­ puter screen to a head-mounted display (HMD). They studied participants' subjective presence ratings as well as measures of heart rate, respiration rate, peripheral skin temper­ ature, and SCR. They found that SCR was the single measure that reflected a change in arousal across all participants, with no significant effect of immersion however. When asked in which condition subjects felt more present, they chose the HMD condition. Al­ though this study yielded some interesting insights into the psychophysiological measure­ ment of presence, it lacks generalizability due to the limited number of subjects. Sallnas (1999) investigated the effect of haptic feedback on co-presence (i.e., the perception of 'being there together'), using both questionnaires and galvanic skin re­ sponse as measures of presence. Fourteen pairs of participants (each comprised of one male and one female) performed four collaborative tasks. A between-group design was used. Half of the pairs were able to feel the forces exerted by the other person through force-feedback whereas the other half received no force-feedback. All pairs had audio channels with which to communicate. The results showed that haptic feedback signifi­ cantly improved task performance and significantly enhanced presence, but not co-pres­ ence. The galvanic skin response measure did not rate as a significant effect in this study (Sallnas, personal communication). One potential explanation ofthese results may be that the audio feedback that was presented across conditions already supported a sufficient level of co-presence, thereby saturating any effect of haptic feedback on co-presence. More extensive studies are needed to investigate whether SCR, heart rate or other potential physiological correlates of presence provide a reliable corroborative measure. Such studies would need to include a reliable, sensitive subjective measure of presence.

Dual task measures Consistent with the view that the allocation of attentional resources plays an important role in presence (Barfield & Weghorst, 1993; Witmer & Singer, 1998; Draper, Kaber & Usher, 1998), secondary reaction time measures could potentially be used as a corrobo­ rative measure to subjective report. The fundamental assumption is that as more effort is dedicated to the primary task, fewer resources remain for the secondary task. Typical re­ sults in this line of research show that when attending to the primary task people make more errors in responding to the secondary task, or take longer to respond to a cue (often a tone or a light) (Norman & Bobrow, 1975). It may thus be hypothesized that as presence increases, more attention will be allocated to the mediated environment, which would mean an increase in secondary reaction times and errors. This hypothesis has yet to be em­ pirically investigated.

44 A slightly different approach would be to look at the extent to which information from a secondary source (real or mediated) is processed, either through a shadowing task during the medium experience or a memory task afterwards. As more attentional resourc­ es are allocated to the secondary source, less attention can be directed towards the primary source, thereby, in theory, diminishing presence. The extent to which signals from the real world are integrated into the experience of the virtual world could also potentially serve as a corroborative presence measure, al­ though such signals would presumably need to have some ecological validity in the vir­ tual world - e.g., a telephone ringing in an underwater diving simulation would not be very convincing as part of the mediated experience. This also illustrates the point made earlier about avoiding distractions and providing a seamless continuity between the real and the virtual.

Social responses To obtain an objective corroborative measure of social presence and co-presence, socially conditioned responses are likely to yield relevant results. In social encounters we interact with other people through a rich repertoire of social behaviours, both verbal and nonver­ bal, most of which occur automatically and without conscious reflection. Such behaviours are also likely to occur in mediated interactions of a social nature. Observable social be­ haviours with potential relevance to social presence include facial expressions (e.g., smil­ ing), gestures (e.g., reaching out to shake someone's hand), body and head movements (e.g., nodding), eye contact, vocal cues (e.g., tone of voice), tum-taking behaviour in di­ alogues (e.g., frequency of interruptions), the use of space (e.g., moving towards some­ one), and verbal expressions directly acknowledging the communicative partner (e.g., 'How did you do thatT or 'I see what you mean'). In an attempt to develop a virtual reality therapy for fear of public speaking, Slaters recently reported on a study investigating participants' responses when giving a five­ minute talk in front of a virtual audience. The virtual audience's behaviour varied from enthusiastic, interested responses, like nodding and applauding, to disinterested, even rude, responses, such as frowning or walking out of the virtual lecture theatre. The results showed that a participant's self-rating of speaking performance was correlated to the re­ action of the virtual audience, in that a positive audience response led to a positive self­ evaluation. Although this study was aimed at investigating the effects that positive audi­ ence feedback may have in overcoming fear of public speaking, useful social (e.g., ges­ tures or facial expressions) and psychophysiological (e.g., SCR or heart rate) measures of presence may potentially be derived from such a scenario, provided that independent ma­ nipulations ensure different levels of presence and that sensitive and reliable subjective measures of presence are used.

Towards an aggregate measure of presence Given the relatively recent birth of the presence research field, significant progress has been made in developing measures of presence. It seems reasonable to assume that rather

5. 'Co-presence as an amplifier of emotion' . Invited talk by M. Slater at the 2nd International Work­ shop on Presence, University of Essex, Colchester, UK, April 6-7, 1999.

45 than one overall presence measure, an aggregate measure of presence that comprises of both subjective and objective components will be developed, to avoid the limitations of either measure by itself. It will thus consist of different but complementary types of meas­ ures, and the extent to which each measure is used may vary with the characteristics of the application. Subjective measures are expected to become more stable as more reliable questionnaires are developed and more insight is gained into the structure of presence. In conjunction with subjective measures, potential objective corroborative measures need to be further developed and investigated, so that appropriate and sensitive objective meas­ ures can be selected and tailored to the specific medium under study.

Acknowledgements Support from the UK Independent Television Commission (ITC) and from the EC, through the ACTS TAPESTRIES project, is gratefully acknowledged. The authors wish to thank Steven van de Par for his valuable comments on an earlier version of this paper.

References Barfield, W. & Weghorst, S. (1993). The sense of presence within virtual environments: A conceptual framework. In: G. Salvendy and M. Smith (Eds.): Human Computer Interaction: Applications and Case Studies, 699-704. Amsterdam: Elsevier. Barfield, W., Zeltzer, D., Sheridan, T.B. & Slater, M. (1995). Presence and performance within virtual en­ vironments. In: W. Barfield and T.A Furness III (Eds.): Virtual Environments and Advanced Inter­ face Design. Oxford: Oxford University Press. Bobick, AF., Intille, S.S., Davis, J.W., Baird, F., Pinhanez, e.S., Campbell, L.W., Ivanov, Y.A., Shtitte, A & Wilson, A. (1999). The KidsRoom: A perceptually-based interactive and immersive story envi­ ronment. Presence: Teleoperators and Virtual Environments, 8, 369-393. Draper, IV., Kaber, D.E. & Usher, J.M. (1998). Telepresence. Human Factors, 40, 354-375. Ellis, S.R. (1996). Presence of mind: A reaction to Thomas Sheridan's 'Further musings on the psychophys­ ics of presence'. Presence: Teleoperators and Virtual Environments, 5, 247-259. Freeman, I (1999). Subjective and Objective Approaches to the Assessment of Presence. PhD thesis, Uni­ versity of Essex, Colchester, UK. Freeman, J., Avons, S.E., Pearson, D.E. & I1sselsteijn, W.A. (1999). Effects of sensory information and pri­ or experience on direct subjective ratings of presence. Presence: Teleoperators and Virtual Environ­ ments, 8, 1-13. Freeman, J., Avons, S.E., Meddis, R, Pearson, D.E. & Usselsteijn, W.A (in press). Using behavioural re­ alism to estimate presence: A study of the utility of postural responses to motion stimuli. Presence: Teleoperators and Virtual Environments. Greenwald, M.K., Cook, E.W. & Lang, PJ. (1989). Affective judgment and psychophysiological response: Dimensional covariation in the evaluation of pictorial stimuli. Journal of Psychophysiology, 3, 51- 64. Held, RM. & Durlach, N.!. (1992). Telepresence. Presence: Teleoperators and Virtual Environments, 1, 109-112. IJsselsteijn, W.A. & Ridder, H. de (1998). Measuring temporal variations in presence. Paper presented at the 'Presence in Shared Virtual Environments' Workshop, BT Laboratories, Ipswich, UK, June lO- 11,1998. Available on-line: http://www.ipo.tue.nl!homepageslwijssels1btpaper.html. IJsselsteijn, W.A.. Ridder, H. de, Hamberg, R, Bouwhuis, D.O. & Freeman, J. (1998). Perceived depth and the feeling of presence in 3DTV. Displays, 18, 207-214. Kalawsky, RS., Bee, S.T. & Nee, S.P. (1999). Human factors evaluation techniques to aid understanding of virtual interfaces. BT Technology Journal, 17, 128-141.

46 Lombard, M. & Ditton, T. (1997). At the heart of it all: The concept of presence. Journal of Computer­ Mediated Communication, 3(2). Only available on-line: http://www.ascusc.orgljcmc/vo13/issue21 lombard.html. Mtihlbach, L., B6cker, M. & Prussog, A. (1995). Telepresence in videocommunications: A study on stere­ oscopy and individual eye contact. Human Factors, 37. 290-305. Norman, D.A. & Bobrow, D.G. (1975). On data-limited and resource-limited processes. Cognitive Psychol­ ogy. 7. 44-64. Ohmi, M. (1998). Sensation of self-motion induced by real-world stimuli. Selection and Integration of Vis­ uallnformation: Proceedings of the International Workshop on Advances in Research on Visual Cognition, Tsukuba, Japan. December 8-11. 1997, 175-181. Prothero, J.D. (1998). The Role of Rest Frames in Vection, Presence and Motion Sickness. PhD thesis, Uni­ versity of Washington, USA. Available on-line: http://www.hitl.washington.edu/publications/r-98- 111. Ridder, H. de & Hamberg, R (1997). Continuous assessment of image quality. SMPTE Journal, 106, 123- 128. SalInas, E.-L. (1999). Presence in multimodal interfaces. Paper presented at the 2nd International Workshop on Presence, University of Essex, Co1chester, UK, April 6-7, 1999. Available on-line: http:// www.nada.kth.se/-evalottalPresenceIlWVP.html. Schloerb, D.W. (1995). A quantitative measure of telepresence. Presence: Teleoperators and Virtual Envi­ ronments, 4, 64-80. Sheridan, T.B. (1992). Musings on telepresence and virtual presence. Presence: Teleoperators and Virtual Environments, 1, 120-125. Sheridan, T.B. (1996). Further musings on the psychophysics of presence. Presence: Teleoperators and Virtual Environments, 5,241-246. Short, J., Williams, E. & Christie, B. (1976). The Social Psychology of Telecommunications. London: Wi­ ley. Slater, M. (1999). Measuring presence: A response to the Witmer and Singer presence questionnaire. Pres­ ence: Teleoperators and Virtual Environments, 8, 560-565. Slater, M. & Usoh, M. (1993). Representations systems, perceptual position, and presence in immersive vir­ tual environments. Presence: Teleoperators and Virtual Environments, 2, 221-233. Slater, M. & Wilbur, S. (1997). A framework for immersive virtual environments (FIVE): Speculations on the role of presence in virtual environments. Presence: Teleoperators and Virtual Environments, 6, 603-616. Slater, M., Usoh, M. & Steed, A. (1994). Depth of presence in virtual environments. Presence: Teleopera­ tors and Virtual Environments, 3, 130-144. Snow, M.P. & Williges, RC. (1998). Empirical models based on free-modulus magnitude estimation of perceived presence in virtual environments. Human Factors, 40, 386-402. Stanney, K. & Salvendy, G. (1998). Aftereffects and sense of presence in virtual environments: Formulation of a research and development agenda. International Journal of Human-Computer Interaction, 10, 135-187. Steuer, J. (1995), Defining virtual reality: Dimensions determining telepresence. In: F. Biocca and M.R. Levy (Eds.): Communication in the Age of Virtual Reality, 33-56. Hillsdale, NJ: Lawrence Erlbaum Associates. Stevens, S.S. (1971). Issues in psychophysical measurement. Psychological Review, 78,426-450. Welch, RB. (1997). The presence of aftereffects. In: G. Salvendy, M. Smith and R Koubek (Eds.): Design of Computing Systems: Cognitive Considerations, 273-276. Amsterdam: Elsevier. Wiederhold, B.K., Davis, R & Wiederhold, M.D. (1998). The effects of immersiveness on physiology. In: G. Riva, B.K. Wiederhold and E. Molinari (Eds.): Virtual Environments in Clinical Psychology and Neuroscience, 52-60. Amsterdam: lOS Press. Witmer, B.G. & Singer, M.J. (1998). Measuring presence in virtual environments: A presence question­ naire. Presence: Teleoperators and Virtual Environments, 7, 225-240. Zeltzer, D. (1992). Autonomy, interaction and presence. Presence: Teleoperators and Virtual Environ­ ments, 1, 127-132.

47 INFORMATION ACCESS AND PRESENTATION

[PO Annual Progress Report 34 [999 49 Developments

F.L. van Nes e-mail: [email protected]

Introduction: The lAP research domain These days technology provides us with immense amounts of information, organized in a large number of databases that generally are interconnected. The usefulness of all that in­ formation depends on its accessibility, which in turn largely depends on the way the user interface with these databases presents the auxiliary information that is necessary for the search and navigation processes, ultimately leading to the desired information from the database. The usefulness of the latter information may, again, depend on its presentation via the user interface. For example, the legibility of text depends on its font. The above situation describes the choice of research domain of the Information Ac­ cess and Presentation programme at IPO: all facets of the usability of information retrieval from one or many interconnected databases. However, the issue of accessing available in­ formation and having it presented through a user interface is not only important on a large scale with databases, but also on a small scale, i.e., every time the question arises how to extract and view a piece of information from some available, but amorphous, data. Arti­ ficial agents may be employed to answer this question, and in this function therefore are a legitimate subject of research in IPO' s information access and presentation programme. In the following we will describe ongoing research. But we make especial mention of one finished lAP research project that was concluded last year with a dissertation that obtained the 'cum laude' qualification - as the first thesis to receive this honour since this option was again made available at the beginning of 1999: 'Computational Image Quali­ ty', by Dr T.J.W.M. Janssen.

Current implementation of lAP's research programme In the past year a number of topics received special attention, and will continue to do so in the coming year. One general issue regarding information access is database structure; two projects, summarized below, are related to this issue. As was explained in the introduction, the user interface has a central role in the ac­ cess as well as the presentation of information. Four projects, also summarized below, deal with aspects of the user interface. The first looks at interface quality in general and aims at its quantification. The second can be described as dealing with specific interface properties: preferences in rendering images (especially their colour) and understanding the basis of these preferences. The third is focussed on dynamic interaction processes in the interface through agents. The fourth project is also concerned with dynamic interac­ tion processes in the user interface, by logging all user actions that are directed at obtain­ ing access to auditory information, and analysing the activity files thus obtained. Finally, two projects, also summarized below, have to do with two important prop­ erties of applications such as databases: learnability and capability of generating trust in

IPO Annual Progress Report 341999 51 the user. Trust is a fundamental aspect of using systems, for example in electronic com­ merce.

Specific research projects Database structures for optimal search The basic question of this project is how to match the structure of a database at 'view lev­ el' to the user's cognitive structure of these data, to minimize user effort in exploiting the database system. At present, most systems only are useful to users with knowledge of the system's methods of organization. However, for novice users the way the database is structured generally is not clear. In order to improve this, insight has to be gained into the cognitive structure of data. Secondly, the cognitive structure should be determined for a given user goal, after these goals have been analysed in relation to data retrieval.

Support for the software design process Whereas the former project is primarily aimed at 'general users', this one focusses on a well-defined user class: software designers. When writing and developing software, peo­ ple may use large existing software libraries - if they know what to look for. At this point the problem of ill-defined information need becomes clear: how can even a well struc­ tured software database be searched if it is not clear what is being looked for? The solution to this problem may lie in the combination of a task-based information retrieval technique with information-exploration interface principles. The design of such information-explo­ ration interfaces is a new research direction in human-computer interaction.

What makes a good user interface? The aim of this project can be summarized in three points: (1) to develop a computational concept for interface quality, (2) to derive a metric from this concept for the quantification of interface quality, and (3) to build a demonstrator or prototype to demonstrate the appli­ cability of the concept. To pursue this aim, the process of user-system interaction will be primarily regarded as a goal-directed information-processing task. In this process the aim of the user is to successfully achieve the goal of the interaction. The amount of success is primarily determined by two factors: (1) effectivity, i.e., remaining distance to the desjred goal, and (2) efficiency, i.e., amount of effort spent on achieving the goal. Within this con­ cept the quality of a user interface can be interpreted straightforwardly as the degree to which the interface supports successful interaction. A full article on this project can be found in this section of the IPO Annual Progress Report.

User preferences for rendering images Users, or rather viewers of images on visual displays, prefer certain colour appearances, even of 'natural' images. These preferences are related to the cognitive and emotional ef­ fects that images have on viewers. It is both of theoretical and practical interest to develop a computational model to achieve the preferred colour appearance of images. Such a mod­ el requires input about colour categorization by viewers. Therefore, a computational mod­ el of colour categorization has been developed. A paper describing the model has been submitted for publication.

52 Dynamics of interaction Computational modelling is also employed for the operational description of an artificial agent that can be evoked in certain user interfaces. It is involved with dynamic interaction, albeit in a restricted domain. By reasoning and communicating with other agents with spe­ cific knowledge of a domain, this knowledge can be extended. The artificial agent may be applied as a cooperative assistant for a user who is trying to operate a particular class of applications. A prototype of such an assistant was already constructed in the DenK project, which embodied a cooperation of IPO and the Faculty of Mathematics and Com­ puting Science, both of Eindhoven University of Technology and Tilburg University. In the past year the current project resulted in a publication entitled: 'Communication mod­ elling and context dependent interpretation: An integrated approach' .

LOFTUS: Logfiles for testing usability A full article on the LOFTUS project can be found in this section of the IPO Annual Progress Report. The project aims to gather insight into the usefulness of the analysis of logfiles, in particular as a method for evaluating the usability of individual interface com­ ponents and their respective influences on the usability of the overall user interface.

Learning the operation of a device An ACT-R model was developed that simulates the learning process of users performing tasks with a simulated device. The model takes the amount of foreknowledge that a sub­ ject has as an input variable, and models the learning process as a function of this fore­ knowledge. Future work will refine the developed model by increasing the complexity of the simulated device and manipulating the degree of correctness of the foreknowledge.

MUST: Maximizing usability and trustworthiness in electronic commerce Maximizing usability as well as trustworthiness in e-commerce system design is increas­ ingly viewed as a conditio sine qua non for on-line vendors to attract and retain new cus­ tomers. This research project therefore investigates the psychological nature of trust on the one hand and, on the other, explores the relationship between a system's experienced trustworthiness and its usability. The main objective is to develop a method for usability and trustworthiness evaluation and design, which bears a specific relationship to the do­ main of business-to-consumer e-commerce.

53 The satisfied user F.J.J. Blommaert and T.J.W.M. Janssen e-mail: [email protected]

Abstract We interpret human-system interaction as a sequence of information-processing operations. According to computational cognition (Newell, 1990; Simon & Kaplan, 1989) and computa­ tional vision (Marr, 1982), information processing should be studied at three different levels of abstraction, that is, at the knowledge level, at the level of algorithm, and at the level of im­ plementation. Human-system interaction is analysed at these levels with an emphasis on user satisfaction and effective and efficient goal achievement. It is concluded that user satisfaction has more or less independent components at these levels, namely, satisfaction evoked by ad­ equate subgoaling, adequate problem-solving strategy, and adequate implementation.

Introduction User satisfaction is one of the main determinants of interface quality. Hence, designing better interfaces often means user satisfaction has to be increased. Currently, the relation between user satisfaction and interface properties is opaque. The aim of this article is to provide a conceptual framework for this relation. There is a definition of user satisfaction (ISO 9241), which runs as follows: 'Satis­ faction is the comfort and acceptability of the work system to its users and other people affected by its use'. According to the ISO definition, together with Effectiveness and Ef­ ficiency it determines the Usability of a work system. In this paper we will not interpret user satisfaction according to the strict ISO definition, but instead use the common inter­ pretation of user satisfaction within human communications as a positive emotion evoked in some way by using a particular interface or work system. It is broadly acknowledged that user interfaces should support user satisfaction. To be able to implement this rule into the design process, a lot of empirical knowledge has been gathered on user satisfaction that is condensed in design rules for user interfaces (see e.g., Norman, 1990; Nielsen, 1993). Furthermore, a considerable amount of experimental methods have been developed to evaluate user satisfaction for specific interfaces in ex­ perimental situations (see e.g., Smith & Mosier, 1986; Hollemans, 1998). What is lacking in our view is an understanding of the phenomenon satisfaction in terms of its function in human cognition and human behaviour in general, and how it is related to goal achieve­ ment and interface properties. The approach we adopt is computational, i.e., (1) we regard the user-system inter­ action process as information processing and (2) we apply an analysis-through-synthesis technique, meaning that we propose a synthetic user and determine for which interface properties this user is satisfied with the interaction process. In the next step, the theory should of course be tested against real users, improved accordingly, and so on.

/PO Annual Progress Report 34/999 54 User satisfaction Why is a user more satisfied with one interface rather than another? Since satisfaction is an emotion, we need to consult emotion theory. After the publication of an influential paper (Simon, 1967), there has been growing consensus among cognitive researchers that emotions manage our multiple goals, work as an interrupt system if necessary, and structure the cognitive system into distinct modes of organization (Oatley & Jenkins, 1996). Satisfaction can then be interpreted as the emo­ tion or mood of achieving subgoals, of being engaged in what one is doing. An opposite of satisfaction is dissatisfaction or frustration, an emotion or mood of losing a goal and knowing that it cannot be reinstated. 'Another' opposite of satisfaction is anger, which oc­ curs if a goal is obstructed but looks as if it could be reinstated. In this case, anger makes us try harder. Why there is a need for management of goals will be explained in the next section.

The management of goals In order to achieve goals, humans use their abilities to perceive the world, solve problems, manipulate objects, and communicate with other humans. The human tragedy, however, is that there are limitations to their capabilities: there are the perceptual limitations in the amount and nature of information that can be captured in a certain period of time, there are speed and memory limitations in the human cognitive processor, and there are limita­ tions in the human action processor, in speech as well as in motoric capabilities. Since hu­ man behaviour at a good approximation is a serial process, this means that achieving a certain goal takes up time and resources that, at the same time, cannot be spent on the achievement of other goals. Thus, there is value in prioritizing goals, and in achieving goals at a minimum of costs in time and resources. Obviously, important goals should have a high priority, and efficient behaviour enables the unexpected achievement of ad­ ditional goals. Management to achieve at least the important goals, and preferably more, is thought to be the functional task of emotions. Positive emotions (satisfaction and en­ joyment) signal the approach of goals, and negative emotions (frustration and anger) sig­ nal the obstruction of goal achievement. At this point, some important observations can already be made, namely, that user satisfaction is related to the efficient achievement of user goals, and that good interfaces enable and support the efficient achievement of user goals. To be able to specify this into more detail, we have to take a closer look at the human-system interaction process.

Human-system interaction A diagram of the architecture of the interaction process is depicted in Figure 1. It shows how the information flows from user to system and back to the user in a single cycle of the interaction process, and how this information is interpreted by the human cognitive processor, which passes on new information to the action system and so on. A single cycle of the interaction process is the basic unit from which the whole proc­ ess is built up. Human behaviour is split up into either 2 (perception-action) or 3 (percep­ tion-cognition-action) subprocesses, according to the demands posed by a certain task. As

55 can be seen in Figure 1, human behaviour can be regarded as information processing; a viewpoint shared with computational cognition (Newell, 1990; Simon & Kaplan, 1989) and computational vision (Marr, 1982; Watt, 1989).

Figure 1: The architecture of the interaction process plus an indication of the infonnation flow.

According to Marr (1982), an information-processing system can only be fully un­ derstood if it is understood at at least three different levels of abstraction. We adopt this viewpoint and analyse human-system interaction at the following levels: (1) The knowledge level or the level of the computational theory. At this level, the user has goals and tasks. He has knowledge about the world and the logic to find a solution to achieve these goals, often by using means-ends analysis. It basical­ ly leads to a hierarchy of realistic computational subgoals that lead to the achievement of the desired goal. (2) The level of the algorithm. Here there are precise representations of input and desired output, and a mapping strategy or algorithm to compute the output from the input according to the computational theory specified under (1). (3) The implementation level. At this level, there is a physical specification of the implementation of the algorithm, considering behavioural and technological constraints.

The knowledge level The behavioural system runs without a deus ex machina or homunculus, but nonetheless there must be a principle that captures the survival value of good 'programs' or 'behav­ iours'; good programs have the property that they help their owner. Being helpful is in­ terpreted here as being effective, i.e., they should lead to goal achievement and be effi­ cient, that is, they should lead to goal achievement while consuming only a limited amount of resources. The assumption is comparable with the 'fitness' assumption used in 'natural computation' (Ballard, 1997).

56 The general problem that the human behavioural system has to solve at the knowl­ edge level is how to act given a multitude of goals with different priorities and a limited amount of resources. It is a typical planning problem hindered by the fact that there is only limited knowledge about the world and about problem-solving procedures. Goal priorities, together with the principles of effectiveness and efficiency, deter­ mine how to act next. If effectiveness and efficiency allow, the action taken next involves achieving the goal with the highest priority. Such action, however, can be interrupted any time by external or internal (e.g., physical as well as mental) events. If the goal with the highest priority is achieved, the next action will be devoted to the achievement of the next highest goal, and so on. It is the typical opportunistic behaviour of an uncertain manage­ ment system. In achieving a certain goal, there are basically two uncertainties: there is (1) uncer­ tainty about the exact identity of the goal target, and (2) uncertainty about the exact path to use to reach the goal target. Often there is a collection of reasonable alternatives for the target, and also a collection of alternative paths that lead to the target. However, not all alternative targets and paths are equally suitable. How suitable an alternative target is de­ pends on what the target will be used for: its suitability depends on the supergoal, one lev­ el up in the goal hierarchy. As an example, the suitability of a piece of wood depends on whether you want to use it for making a cupboard or for lighting a fire. The fact that the identity of a target often has a certain identity margin is typically used to try to reduce the amount of effort put in to find it. This is known as satisficing (Simon, 1967), which lends itself for optimization as is shown in Figure 2a.

\ \ \ \ \ \ \ t (1-p)*E+p*D

~o o

time ---too- ti me ---too-

Figure 2: Illustration of 'satisficing' as a problem solution with minimal costs for the user. Figure 2a (left) shows how a minimal-cost solution is constructed from linearly increasing total costs in resources spent and (non-linearly) decreasing costs for target distance to the ideal. Figure 2b (right) shows how the minimal-cost solution can be manipulated to lower costs by speeding up the target approach (by e.g., learning or user support).

Typically there are two penalty terms that make up the user-cost function for goal achievement. The first penalty term is the total amount of resources spent on finding a cer­ tain target. The second penalty term is distance to the ideal target expressed in subjective costs. Finding the optimal solution to satisficing then is a standard regularization problem (see e.g., Poggio, Torre & Koch, 1985) with a cost function that should be minimized

57 according to the following:

min ((1-p)*C(resources) + p*C(distance)}, where p is a regularization parameter that expresses the importance of approaching the ideal target. The minimum cost gives the subjectively most satisfactory solution. An interesting point for interface design is how the minimum can be manipulated to lower values. There are two obvious ways: (1) to spend less resources per unit time while keeping the distance function the same and (2) to spend the same amount of resources per time unit, while speeding up the goal approach. An example of making the goal approach more efficient is given in Figure 2b. It can be achieved through learning or the application of user support (e.g., with agents). Subgoaling, splitting up the process of goal achievement into manageable pieces, seems to be a natural way to solve problems for humans. An obvious advantage over plan­ ning is that solving all subproblems but the first one results in more informedness about the problem compared with planning, since all information gathered can be used to solve the next problem step. A typical way to apply this strategy is the opportunistic goal ap­ proach: choose a subgoal that is closer to the end-goal than the current state, achieve this subgoal, and repeat this procedure until the end-goal is reached. From the point of view of user satisfaction, the first basic question is under what conditions humans are satisfied with a certain prioritization of goals, formation of goal stack, and management of the action process that leads to goal achievement. The answer is that there is high satisfaction if the action process is experienced as more successful than foreseen, either in the amount of goals reached or in the amount of effort spent. There will be satisfaction if the goal achievement is near to what we have foreseen. There will be dissatisfaction, or even frustration in the case that the results are lower, or considerably lower than expected. From the point of view of satisfaction, the second basic question is what the prop­ erties of interfaces should be in order to, at least, satisfy users. The general answer is that they should support users in the prioritization of goals, formation of goal stacks (if possi­ ble), and the management of the action process so that the action process becomes as suc­ cessful as possible. For accurate management, informedness about goal approach is indis­ pensable. In other words, we need a system to monitor the distance between current state and goal state.

The level of algorithm At this level there should be a precise specification of input, output, and a precise compu­ tational description of the way the output can be derived from the input. The input is the specification of the goal, while the output is its realization. The trajectory from input to output is a problem-solving method, which can generally be formalized by a search for an adequate solution to a problem in an adequate problem space (Newell, 1990). Again, the basic question is when a human will be satisfied with the goal achieve­ ment. The answer is that there will be satisfaction if there is an experience of goal ap­ proach that is faster or equally fast to that expected by the human. The second basic ques­ tion, about how a user interface should be designed to evoke or support user satisfaction,

58 is that it should support goal approach as well as possible. This raises the question of the most efficient search strategies, and under which conditions these strategies are enabled and supported. Usually a search can be interpreted as a process intended to reduce uncertainty in either the location or the identity of a goal target, or both. The search process is started by defining a suitable search space that contains the target. The uncertainty in the location of a single target is then related to N, the number of items in the search space. This uncer­ tainty can basically be reduced in two ways: (1) by decreasing the number of items N in the search space, and (2) by increasing the number of targets from J to M. The uncertainty in target location is then related to NIM. However, the uncertainty in target identity is in­ creased by a comparable factor. The feasibility of using this satisficing strategy depends on the supergoal, one level up, as explained in the previous section. The general strategy of reducing the size of a search space is by eliminating non­ targets or subspaces of non-targets, or selecting subspaces that contain the target. The sim­ plest strategy, which always works (except in cases where the search space is of infinite size), is to generate and test every single item in the search space until the target is found. The uncertainty decrease in location of the target then reduces from N to N-J per step, which is very slow. The speed can be enormously increased if the elimination of multiple items is allowed. If one eliminates L items at a time, the uncertainty reduces from N to N­ L per step.

K

Figure 3: Hierarchical data structure of N items with K .. .. hierarchical levels and L cate­ NIL K gories per node.

The natural way to represent items to allow elimination or selection of multiple items is in a hierarchical structure. Super-categories, categories, or subcategories can be deleted or selected in one go. If there is insufficient information about the structure for ordering choices, so-called weak or blind search methods are used. Well-known examples are depth-first, breadth-first, and random (non-deterministic) search. Search efficiency may improve spectacularly if there is a way to order the choices, so that the most promis-

59 ing ones are explored earliest. In such situations one applies heuristically informed meth­ ods (Pearl, 1985). Well-known examples are hill-climbing, beam search, and best-first. A well-known way to enable heuristics is by applying natural categories (Murphy & Las­ saline, 1997): the hierarchical structure of the data is the same as the mental structure of the data that humans have learned and use in communicating with each other. Consumer products, such as food in supermarkets, are structured in a more or less natural way. Given a search space with N items, the number of hierarchical levels K and the number of categories per node L can still be chosen. From the efficiency point of view it is interesting to know which combinations of K and L minimize the number of generate­ and-tests to reach a single target. If, on average, every item is approximately equally pop­ ular as the target, the average number of generate-and-tests, independent of the exact tar­ get, equals:

K*L + NIL"'K.

This function is depicted as the surface plot in Figure 4. The minimum of this function is the optimal solution for specifying the hierarchical structure for single-item search.

600

".". 400 1:: 2 10 -(]) 200 8 0 6 0

2 levels 40 0 categories per node

Figure 4: Surface plot of the average number of generate-and-tests necessary to find a single tar­ get in a search space that is hierarchically structured, as a function of the number of hierarchical levels and the numbers of categories per node. The bold line is a boundary condition that ex­ cludes K,L combinations for which L'''K>N. The plot is for N=1000.

In conclusion, satisfaction at the level of algorithm is supported by enabling and supporting effective and efficient strategies for approaching goals. Optimal strategies evoke maximal user satisfaction.

60 The level of implementation At this level, choices must be made about how the information exchange between user and system should be implemented, which are necessary for applying the strategy and achiev­ ing the goal. There must be a language, with its lexicon, syntax, and semantics, for the information transfer of system to user and vice versa. Sensory modalities that are suited for the type of information and language must be chosen, plus the hardware that represents system and interface. The space of possible implementations is typically constrained, both by limitations of the human behavioural system and by limitations of available technolo­ gy. Nevertheless, there is no unique solution to the implementation problem, which poses the question of the more or most desirable properties of the implementation. Ac­ cording to our philosophy, we choose user satisfaction as the adequacy yardstick to dis­ criminate between different possible implementations, which means that the implementa­ tion is judged with respect to effectiveness and efficiency of goal achievement. More pre­ cisely, given that we have chosen an optimal strategy at the level of algorithm, which implementation of this strategy is most effective and efficient? This question basically addresses the effectiveness and efficiency of the information flow that goes around in the interaction process. The first point that needs to be raised is that the efficiency is judged at a subjective level, which means that it is derived from sub­ jective attributes, such as the subjective experience of costs and benefits and the subjec­ tive experience of time. Although there probably is a high positive correlation between these subjective attributes and their objective (physical) counterparts, the relations are probably not simply linear. The second point is that it is not so trivial what costs are. Ev­ idently it is not the costs that are invested by the system into the interaction process, but the costs that are invested by the user, such as time, perceptive attention, interpretation costs of messages from the interface, costs made by the cognitive processor to decide upon an answer, the action costs to give the answer, and the waiting costs related to the inertia of the system. User satisfaction at the level of implementation then is probably re­ lated to the minimization of these user costs. This results in something like 'average user costs' in a single cycle of the interaction process.

Conclusions For humans, a natural way of dealing with goal achievement is by subgoaling. Hence, the accompanying way to deal with processes that should lead to goal achievement is via sub­ processing. The hierarchical structure of the human-system interaction process can be linked well to these hierarchies of goals and processes. At the knowledge level, the end­ goal of the interaction process itself should optimally support goal achievement of the su­ per-process one level up. At the level of the algorithm, the strategy, or specific sequence of interaction cycles, should lead to goal achievement at the knowledge level. At the level of implementation, accomplishing a single interaction cycle should lead to goal achieve­ ment at the level of algorithm. Finally, at the sub-implementation level, sub-subprocesses (perception, cognition, action, interface, and system) should support goal achievement of a single interaction cycle. Our basic assumption is that user satisfaction is highly correlated with the subjective experience of goal achievement. Given the fact that goal achievement can be interpreted

61 in a hierarchy of supergoals, goals, and subgoals, in human-system interaction user satis­ faction also has components at different levels of goal achievement. Optimal interface support for the achievement of goals then also consists of support at different levels of ab­ straction.

References Ballard, D.H. (1997). An Introduction to Natural Computation. Cambridge, MA; London: MIT Press. Hollemans, G. (1998). Assessing satisfaction with consumer electronics. In: M. Sikorski and M. Rauterberg (Eds.): Proceedings of the International Workshop on Transferring Usability Engineering to Indus­ try, Gdansk, Poland, June 11-14,1998,59-61. Gdansk: Technical University of Gdansk. Marr, D. (1982). Vision. San Francisco: Freeman. Murphy, G.L. & Lassaline, M.E. (1997). Hierarchical structure in concepts and the basic level of categori­ zation. In: K. Lamberts and D. Shanks (Eds.): Knowledge, Concepts, and Categories, 93-132. Hove, UK: Taylor and Francis. Newell, A. (1990). Unified Theories of Cognition. Harvard: Harvard University Press. Nielsen, J. (1993). Usability Engineering. Boston: Academic Press Professional. Norman, D.A. (1990). The Design of Everyday Things. New York: Bantam Books. Oatley, K. & Jenkins, J.M. (1996). Understanding Emotions. Cambridge, UK: Blackwell. Pearl, J. (1985). Heuristics: Intelligent Search Strategies for Computer Problem Solving. Amsterdam: Addison-Wesley. Poggio, T., Torre, V. & Koch, C. (1985). Computational vision and regularization theory. Nature, 337, 314- 319. Simon, H.A. (1967). Motivational and emotional controls of cognition. Psychological Review, 74(1),29-39. Simon, H.A. & Kaplan, C.A. (1989). Foundations of cognitive science. In: M.I. Posner (Ed.): Foundations of Cognitive Science, 1-47. Cambridge, MA; London: MIT Press. Smith, S.L. & Mosier, IN. (1986). Guidelines for Designing User Interface Software. Report ESD-TR-86- 278. Bedford, MA: the MITRE Corporation. Watt, R. (1989). Visual Processing: Computational, Psychophysical and Cognitive Research. Hove, UK: Erlbaum.

62 Log files for testing usability

G. Klein Teeselink, H. Siepe and J.R. de Pijper e-mail: [email protected]

Abstract The aim of this study is to gain insight in the usefulness of log file analysis a.<; a method to evaluate the usability of individual interface components and their influence on the usability of the overall user interface. We selected a music player as application, with four different interfaces and two different databases of music tracks. The music player only allowed the user to (a) select and (b) playa music track. An experiment was carried out in which all the user actions with the application were recorded in a log file. These log files were analysed. The preliminary results indicate advantages and disadvantages of the method of log file analysis for testing usability.

Introduction A user interface (UI) can be seen as composed of interaction components. These individ­ ual components (which themselves can be considered as user interfaces), are constructed out of still smaller components (Haakma, 1998). In this way a hierarchical layered struc­ ture is obtained. Figure 1 shows an example of a hierarchical layered structure of a user interface of a music player containing two main components: 'select track' and 'play track', the first of which is further subdivided into two subcomponents: 'find track' and 'make track active'.

Play the track

Scroll bar Clickable window 'PLAY' button

Figure 1: Hierarchical layered structure of a user interface of a music player containing two main components: 'select track' and 'play track' , the first of which is further sub­ divided into two subcomponents: 'find track' and 'make track active'.

When drawing conclusions about the usability of a user interface, usually only the interface at the highest level is taken into consideration. It is difficult to evaluate the usa­ bility of the individual components and their influence on the overall user interface.

IPO Annual Progress Report 341999 63 Knowledge about the influence of individual components on the usability of the overall user interface would enable designers to decide which components a user interface should be built of in order to best support users in performing their tasks in a certain application domain. In this project, two aspects of usability are considered: efficiency (operationally de­ fined as the number of user actions needed to complete a task) and effectivity (defined as the time needed to complete a task). One technique that can be used to evaluate the influence of individual interface components and possible other factors on usability is a method called 'log file analysis'. In a log file all the input (user actions), output (system feedback) and internal states of an application (e.g., in the case of a music player, what is the currently selected track) are recorded in chronological order. Analysis of this file may support evaluation of the usa­ bility of a certain user interface and the contribution of individual subcomponents. The aim of the study is to gain insight in the usefulness of log file analysis as a meth­ od to evaluate the usability of an interface in terms of efficiency and effectivity and to as­ sess the influence of individual interaction components and possible other factors of in­ terest on the usability of the overall user interface. In this article we describe an experiment that we carried out to that purpose, discuss the results, and draw conclusions.

Experiment Application As a sample application to collect data from, we designed a simple music player, imple­ mented as an MS Windows 98 application on a Pc. The music player allows a user to se­ lect and playa piece of music from a list of either 15 or 100 tracks, randomly ordered and then numbered. Four different interfaces were implemented, which force the user to in­ teract with the player in different ways (See Figure 2). The interfaces differ in two aspects: (1) scrolling mode, i.e., the way in which the user can scroll through the list of available tracks, and (2) size of the listbox used to display the list of tracks. Interfaces 1 and 2 let the user search through the list of tracks using 'Next' and 'Previous' buttons and differ only in the visible size of the listbox; Interfaces 3 and 4 offer a scrollbar to peruse the list and also differ only in the size of the listbox. Table 1 illustrates the scheme.

Table 1: Relation between interfaces, scrolling mode, and listbox size.

Interface Scrolling mode Listbox size

1 Buttons Small (3)

2 Buttons Large (10)

3 Scroll bar Sma11 (3)

4 Scroll bar Large (10)

64 Figure 2: Screen dumps of the four interfaces used in the experiment.

65 The user interfaces can also be interpreted in terms of the hierarchical layered struc­ ture shown in Figure 1. Interfaces 3 and 4, which both use scrollbars to move through the list of tracks, follow exactly the structure shown in Figure 1. Interfaces 1 and 2, which both use buttons to scroll, have only two components: 'select track' and 'play track' . Here, 'select track' cannot be further subdivided, since pressing the 'Next' or 'Previous' button both scrolls the list and changes the selected track, so that 'find track' and 'make track active' are accomplished with one and the same action. Tasks Ten different tasks were designed, which subjects were asked to complete for each of the four interfaces. In each case, the subjects were asked to find a certain track in the list (the 'target track') and then play it. The tasks differed in: • Target distance: The distance between the initially selected track and the target track: 1, 7, or 42 tracks. • Database size: The size of the database of music tracks used: 15 or 100 tracks. • Find mode: In half the cases, the subjects were given the rank number of the tar­ get track and, since the tracks in the listbox were numbered, this made finding the target track relatively easy; in the other half they were only told the title of the target track. These find modes are called 'By number' and 'By title', respec­ tively. This makes for a total of 40 assignments, where each assignment represents a unique com­ bination of one of the four interfaces, two database sizes, three target distances, and two find modes. The order of presentation was randomized for each subject. Subjects The tasks were presented to 30 subjects, 19 male and 11 female. All subjects were stu­ dents or employees of Eindhoven University of Technology and were experienced com­ puter users. Procedure The experiment ran on a Pc. Each subject did the test individually in an interactive ses­ sion that took about 30 minutes. Preceding the test, the subjects performed six training as­ signments, to familiarize them with the different interfaces and the experimental setting. During the test, log files were automatically created, containing one entry for each event generated by the subject. An event was generated whenever the subject clicked the left mouse button anywhere in the interface, kept the 'Next' or 'Previous' button down (Interfaces 1 and 2), or dragged the scrollbar (Interfaces 3 and 4). Each entry in the log file contains information about the nature of the event, the time elapsed since the previous event, and information about the internal state of the music player, such as the currently selected track, the first visible track in the listbox, and the currently playing track, if any.

Using log file data Interpreting the log files As explained before, we concentrate on two aspects of usability: efficiency and effectiv­ ity. Measures for both of these can be readily derived from the log files, in terms of the number of user actions and the time needed to complete a task, respectively.

66 A total of 35560 events were recorded in the log files. Not all events, however, cor­ respond to a user action. For instance, when a subject is dragging the scrollbar, the listbox is scrolled continuously and a stream of events is generated by the system. The same hap­ pens in Interfaces 1 and 2 when a subject keeps the 'Next' or 'Previous' button depressed. We have called such events 'compound' events. Thus, decisions have to be made how to translate events to user actions. Obviously, these decisions must be made with care, since they have a serious impact on the interpretation of the results. In this case, we have decid­ ed to translate a compound event such as keeping a button depressed into two user actions: one corresponding to the moment when the button is depressed and one when it is re­ leased. Another decision that had to be made has to do with the beginning and end of a task. When a new task was presented to a subject, it sometimes took a while before the subject began doing things. Similarly, every task ends with the subject pressing the 'Play' button and then the 'Klaar' (,Finished') button. The time between these two button presses is not relevant to the task; sometimes, subjects simply wanted to listen to the music for a bit. We have defined a task as beginning with the first event generated by the subject and ending with the last press of the 'Play' button. Log files as a data source In this article, we do not present an exhaustive analysis of the experimental results; the experimental set-up with its interfaces and tasks is merely a means to explore the possi­ bilities and limitations of log file analysis. Instead, we will present just a few analyses to demonstrate how log file data can be used. The design of the current experiment involves three independent variables: database size (2 levels), find mode (2 levels) and target distance (3 levels). Each of these variables is expected to affect the usability of the user interface, but the degree of this influence may well be different for the four interfaces used in the experiment. Analysis of the log files collected in the experiment enables us to find answers to these questions. The following two examples serve to illustrate this.

Differences between interfaces In a first analysis, we can test whether there is an overall difference in the usability of the four interfaces. Using the log file data, we obtain graphs as shown in Figure 3. The left­ hand graph shows the effectivity of the interfaces with mean elapsed time per task in sec­ onds along the vertical axis, and the four interfaces on the horizontal axis. The right-hand graph does the same for efficiency, with number of user actions along the vertical axis. The error bars indicate the 95% confidence interval around the mean. It seems immediately apparent that Interface 4 (scroUbars and large listbox) scores best, both in efficiency and effectivity, while Interface 1 (buttons and smalilistbox) scores worst on both counts. In terms of effectivity, there seems to be a clear ascending order from Interface 1 through Interface 4. For efficiency, Interfaces 1 to 3 appear to be about the same, with Interface 4 an unmistakable winner. To determine if the observed differences are in fact statistically significant and to get an estimate of the size of the effects, we use the analysis-of-variance (ANOV A) tech­ nique (univariate repeated measures design). The results are summarized in Table 2 and Table 3.

67 16r------, 30r------,

14

12 20 u ~ 10 (/) c OJ 0 E 13 4: 1:J"" 8 10 OJ Ii; (/) (/) "- ::> [jj'" Z U U ~ I ~ '"OJ '"OJ Interface 1 Interface 2 Interface 3 Interface" Interface 1 Interface 2 Interface 3 Interface4

Interface 10 Interface 10

Figure 3: Effectivity (left) and efficiency (right) of the four interfaces. The error bars indicate the 95% confidence interval (CI) around the mean.

Table 2 gives the results of the ANOV A's with 'Elapsed Time' and 'Nr. User Ac­ tions' as dependent variables, respectively. For both variables the results are highly sig­ nificant, indicating that the interfaces differ significantly both in efficiency and effectivi­ ty. The column labelled 112 represents an estimate ofthe size of the effect. The higher this value, the more variance is explained, and thus the stronger the effect, the idea being that even if an effect is significant, it is interesting only if it is large enough.

Table 2: ANDVA results for differences in effectivity (elapsed time) and efficiency (number of user actions) between the four interfaces.

Variable Source dO df2 F P 112

Elapsed time Interface 3 84.3 12.73 <.0005 .31

Nr. of User Actions Interface 3 84.3 34.66 <.0005 .55

Table 3: Results of post-hoc analyses for differences in effectivity (elapsed time) and efficiency (number of user actions) between the four interfaces.

Elapsed time Nr. User Actions

2 3 4 2 3 4

2 3

4

Table 3 gives pairwise differences between the four interfaces, according to a post­ hoc analysis (Tukey method). This makes it possible to determine which pairwise differ­ ences between interfaces account for the overall significance. An asterisk '*' indicates a

68 significant difference between two interfaces at the .05 level. For 'Nr. User Actions' (ef­ ficiency) the picture is very clear: the overall significant result is attributable solely to the difference between Interface 4 on the one hand and the other three on the other. For 'Elapsed Time' (effectivity), Interface 4 does not differ significantly from Interface 3.

Find mode and interaction with the inteifaces As another example, we will look at the effect of the independent variable 'find mode' on usability and the variable's possible interaction with the four user interfaces (Figure 4).

30 r------, 50

40

Interface 10 Interface 10 20 30 U.., • Interface 1 on 20 • Interrace t ~ c III ., 0 ~ .~ III <{ • Interface 2 -0 10 • Interface 2 ~ to .., OJ Ul Ul Q. I :::l II= [jj"' • Interface 3 Z 0 • Interface 3 [j [j ~

1!!.~ $. I{) --- en • Interrace4 ~ -10 .. Interface4 BVNumber Byline

Find by number or title Find by number or title

Figure 4: Effecitvity (left) and efficiency (right) of the four interfaces in interaction with the variable 'find mode'. The error bars indicate the 95% confidence interval (CI) around the mean.

The left-hand graph again shows effectivity (mean elapsed time per task) and the right-hand graph shows efficiency (mean number of user actions). The left part of each graph shows the means for the tasks where the find mode was 'By number', the right part for the tasks where the find mode was 'By title'. The results of corresponding ANOVA's are summarized in Table 4.

Table 4: ANOVA results for differences in effectivity (elapsed time) and efficiency (number of user actions) between the four interfaces, the two find modes and their interaction.

2 Val"iable Source dfl df2 F P Tt

Elapsed time Interface 3 84.3 12.73 <.0005 .31 Find mode 1 28.0 218.57 <.0005 .84 Interface by Find mode 3 84.3 6.3 .001 .18

Nr. of User Actions Interface 3 84.3 34.66 <.0005 .55 Find mode 1 28.0 173.75 <.0005 .86 Interface by Find mode 3 84.4 17.55 <.0005 .38

Clearly, the factor 'Find mode' has a profound effect on usability: When subjects know the rank number of the target track, they perform the task in a much shorter time than when they only know the track title (3.77 vs. 18.80 sec.), and they need to perform fewer actions (6.73 vs. 32.77). These effects are highly significant and also account for a

69 large proportion of the variance, as evidenced by the high 112 values. The interaction between interface and find mode is also significant for both dependent variables, but for effectivity (elapsed time) it is not very large (11 2 = .18). The stronger effect for efficiency (11 2 = .38) is apparently due mainly to Interface 4, where subjects need far fewer actions than in the other interfaces.

Interaction components As explained in the introduction, we can think of Interfaces 1 and 2 as consisting of two interaction components: 'select track' and 'play track'. For Interfaces 3 and 4, the 'select track' component is further subdivided into a 'find track' and 'make track active' compo­ nent. It may be of interest to user-interface designers to know if users tend to divide their time differently over these components depending on certain aspects of the user interface. For instance, we can use our log file data to determine if there is such a difference between Interfaces 3 and 4, which differ only in size of the listbox. In this case, a chi-square test is appropriate. Table 5 shows, among other things, that subjects spent 79% of their time on 'Find track' actions with Interface 3, as opposed to 67% with Interface 4. From this table, a Pearson chi-square value of 65.7 was calculated, which, with 3 degrees of freedom, is highly significant (p < .0005).

Table 5: Time and proportion of total time spent on events associated with the four interface components, for Interfaces 3 and 4.

Find Activate Play Error Total

Interface 3 Time (sec) 1810.1 224.3 263.8 6.2 2304 Proportion .79 .10 .12 .03

Interface 4 Time (sec) 1151.1 244.7 301.5 12.2 1710 Proportion .67 .14 .18 .07

Total 2961 469 566 18 4014

Log files as a heuristic device In the previous section, we have used the data in the log files, after some interpretation and recoding, more or less directly as input to statistical analyses. We were thus able to state with confidence, e.g., that the interfaces with a scrollbar are more effective than the interfaces with buttons, all other things being equal. Such a hypothesis might well have been formulated before the experiment and the whole process of testing the hypothesis could be automated, to be applied to a new set of interfaces or group of subjects. Merely observing a fact does not, however, explain it. Take the observation that when subjects have to locate a target track knowing only the title, they need much more time than when they know the list number. After observing this fact, one might develop the notion that one of the reasons may be that, when only the title is known, subjects will often start searching the list in the wrong direction; or that they will tend to more often overshoot the target, especially if the size of the listbox is small. Such notions can be eas­ ily checked by direct inspection of the log files and by generating appropriate tables and graphs.

70 By the same token, one may look for different strategies followed by the subjects, or follow up on unexpected observations and outliers. This use of log files as a rich source for exploration and explanation is at least as important as their use as a data source for statistical analysis.

Conclusion Log files are very useful as a tool in evaluating usability aspects of user interfaces. • Effectivity and efficiency of interfaces can be measured in terms of time and number of actions needed, respectively, which can be easily derived from log files. • In more complex interfaces than the ones used in this study, log file analysis can also help in assessing the roles played by individual interface components. • Finally, log files are extremely useful as a heuristic device, an instrument to ex- plore user strategies, and to find explanations for observed significant effects. All the same, the value of log file analysis is restricted to those aspects of usability that can be related to countable or measurable objective data, such as time spent on a task. Other aspects, such as how 'fun' an interface is to work with, or how much satisfaction a user derives from working with a particular interface, must be explored with different techniques, e.g., questionnaires and interviews.

Acknowledgement The project was funded by the User-System Interaction Technology group of Philips Research Laboratories, Eindhoven, The Netherlands.

References Haakma, R. (1998). Layered Feedback in User-System Interaction. PhD thesis, Eindhoven University of Technology, The Netherlands.

71 MULTIMODAL INTERACTION

[PO Annual Progress Report 34 [999 73 Developments

AJ.M. Houtsma e-mail: [email protected]

Introduction Multimodal interaction research deals with the manner in which users interact with sys­ tems through all their sensory and motor channels. A pilot flying an air plane uses his eyes to watch the instruments, read a map or look around for other traffic, his ears to monitor engine sound or listen to radio communication, while feeling the air's reactive force on the plane through the stiffness of the controls. While taking in all this information through various sensory channels, he is also acting on the system by stick and rudder control through his hands and feet, while speaking into a microphone for communication with air traffic control. In our interaction with complex systems it is quite common that several input and output modalities are involved simultaneously. When designing such systems it is therefore important to know or to find out which combination of sensory and motor mo­ dalities allows for optimal use of the system. The research programme currently comprises three components, each with a specif­ ic focus. One is concerned with issues in computer vision and display, another with sound and hearing in combination with visual display, and finally a third one with issues in tac­ tile and haptic perception in combination with motion control and force feedback. Special attempts are being made to focus on cumulative effects of the use of several sensory mo­ dalities, and to include all three aspects (perception, cognition, and action) of the user sys­ tem interaction process.

Computer vision and display Pattern recognition is a key technology for the next generation of user interfaces and therefore largely determines how feasible alternative interaction modalities will be, such as spoken language and handwriting. Some projects have involved extending and apply­ ing the visual pattern recognition capabilities of the current Visual Interaction Platform (VIP). The functionality of the VIP can be divided into an interaction space and a com­ munication (or visualization) space. The interaction space is realized by an image that is projected onto a table, where one or more users interact with the system by moving bricks on the table. Brick movements are observed by a camera and analysed. The communica­ tion space consists of a second image, projected onto a vertical screen, which all partici­ pants can observe and discuss. The originally purchased library for pattern recognition was used to build a medical application. This application for easy retrieval, selection, and display of images from a pa­ tient database is meant to be used by doctors discussing patients at a weekly meeting. A full-size fixed prototype, as well as a smaller portable prototype of the system have been realized. The latter is important if we want to conduct field studies with the VIP platform. The original library for brick recognition was a commercial product with limited functionality. A new library for pattern recognition has been developed. It is faster and can be extended with new functionality. [PO Annual Progress Report 34 1999 75 In one project, we are developing a 2D drawing tool. This tool requires the recog­ nition of alternative interaction elements (in addition to the bricks); the recognition library is being extended accordingly. The tool is being applied to an early architectural design. The project is part of a joint venture with the Faculties of Computing Science (computer graphics) and Architecture (design systems), in which alternative systems for use in ar­ chitectural design are being studied. In addition to developing applications for the current VIP platform, we are also try­ ing to create new functionality. In one project, we are testing the usefulness of existing computer vision methodologies for estimating 3D form and position (e.g., shape from tex­ ture, stereo, etc.). The goal is to find out whether algorithms can be made sufficiently fast and robust to allow real-time 3D interaction with the user. The first application to be ex­ plored is the visualization of a medical 3D data set. In another project, we are combining commercial products for handwriting recog­ nition with in-house software for sketch recognition. In this case, the system input is no longer a camera image, but a signal generated from a stylus moving across an electronic sketch pad. The sketch recognition software is being developed in cooperation with the Faculty of Electrical Engineering. The intended application is the input into an electronic simulation program (such as SPICE) of electronic circuit diagrams by means of hand­ drawing. To date we have mainly explored the effect of pre-processing (orientation, speed, etc.) of the pen (stylus) input on the recognition rate of handwriting recognizers.

Sound, hearing, and visual display A software platform has been developed to enable flexible design of auditory-visual ex­ periments with accurate stimulus timing. Experiments were conducted investigating the sensitivity and tolerance for auditory-visual asynchrony. Other experiments showed that subjects are able to compare temporal visual and auditory patterns with a high degree of accuracy, and can perceptually bind auditory and visual stimuli on the basis of matching such temporal patterns. A new model for the perception of binaural stimuli was developed and tested with sound stimuli that had varying spectral and temporal parameters. The current model was found to accurately simulate many psychophysical detection data published during the past 40 years. Future research will therefore be focussed on the improvement of signal processing applications involving spatial sound reproduction, rather than on further im­ provement of the model itself. The current model will be used extensively for this pur­ pose. Application-oriented work continued, in cooperation with Philips Research and Sound & Vision, on efficient digital audio coding. A new aspect of the coding work is the inclusion of a sinusoidal coding structure. In cooperation with the Indian Institute of Technology (lIT), a loudness-based model was implemented and applied to find the most relevant sinusoidal components for representing a specific section of an audio signal. In a project called 'Non-speech audio for user interfaces' , research has focussed on the generation and perception of non-speech sound stimuli for use in multimodal user in­ terfaces, and on auditory material perception. The first project dealt with interactive sounds to be coupled with interaction devices such as track balls and mice. For synthesis purposes, experiments were carried out to find the acoustic cues listeners use in determin­ ing the size and velocity of rolling objects. In the auditory material perception study, the

76 parameter space of simple impact sounds was explored to determine when sounds are per­ ceived as being produced by metal, wooden, glass, rubber, or plastic objects. The synthe­ sis software for designing impact sounds was successfully used for the synthesis of inter­ face sounds for large automotive vehicles.

Tactile and haptic perception, motion control, andforcefeedback In one project an inventory was made of all manual control and sensing devices that are currently available as human-computer interfaces. In another project, the usefulness of force feedback in target acquisition with a mouse, trackball or joystick was investigated. A potentially helpful force-field, guiding cursor movement to an intended target, can be turned on as soon as the system knows which target is the intended one. Traces and velocity profiles were collected, which were produced by subjects during target acquisition tasks with various control devices. Statis­ tics of these traces are a necessary input for any algorithm that has to estimate which target is being moved to. In a third project, the relative contributions were investigated of auditory, visual, and tactile feedback to the accurate manual reproduction of rhythmic patterns. With the aid of a virtual system, in which all three modes of feedback could be independently turned on or off, it was found that tactile feedback appears to make the largest contribution to reproduction accuracy, closely followed by auditory feedback. Visual feedback seems to play a less significant role.

77 Scientific approaches to sound design

D.J. Hermes and S.L.J.D.E. van de Par e-mail: [email protected]

Abstract Sound design can be studied from various perspectives. This contribution gives an overview of four different scientific approaches. The first is the physical approach, in which sound pro­ duction is physically modelled using mathematical equations or computer models of the sound-generating event. The second approach is concerned with perceptual concepts such as pitch and timbre. The psychophysicist is typically interested in perception thresholds, fre­ quency and temporal resolution, just-noticeable differences, magnitude scaling, etc. The third approach, referred to as auditory scene analysis, studies the perceptual process in which some sound components are attributed to the same sound-generating event, while some others are perceptually attributed to different events. The last approach we will deal with is called the ecological approach. According to this approach, the listener does not so much listen to the pitch, the timbre, or the rhythm of the sound, but rather uses the physical cues in the sound to mentally reconstruct an image of the environment and of what happens in that environment. The implications of these different approaches for the design of soundscapes, non-speech au­ dio, and of user-interface sounds in particular will be discussed.

Introduction Sound design can be studied from many different perspectives. One of the oldest ap­ proaches consists of generating or recording sounds such as sinewaves, noises, and other signals. The results can be modified, edited, and combined. A well-known example is FM synthesis (Chowning, 1973), a technique by which harmonic tone complexes can by gen­ erated and combined to form musical tones. Such techniques have been widely used, e.g., for the synthesis of musical sounds. In this paper, the focus will not so much be on the design of musical sounds, but on user-interface sounds. Four approaches, largely with origins in different scientific disci­ plines, will briefly be discussed. These approaches are complementary and not mutually exclusive. The choice made depends on the purpose of the study one has in mind. The first approach, covering the field of what is called acoustics, is the physical approach, in which the mechanics of sound-producing objects is studied on the basis of mathematical equa­ tions or computer simulations of the sound-generating object. The second approach, psy­ choacoustics, is concerned with the perceptual phenomena such as masking, pitch percep­ tion, and magnitude scaling. The third and the fourth approach are known as auditory scene analysis (ASA) and the ecological approach to sound perception, respectively. Both deal with the problem of how the human auditory system structures the components of the perceived sounds to construct an image of the events occurring in the listener's environ­ ment. In the discussion we will argue that the design of appropriate user-interface sounds requires knowledge stemming from all four approaches.

[PO Annual Progress Report 34 1999 78 Acoustics This approach to the study of sound is probably best established. On the basis of mathe­ matical equations or computer models, the acoustic behaviour of sound-producing objects is studied in terms of, e.g., acoustic impedance, material properties such as elasticity and stiffness, or resonance and eigenfrequencies. The correctness of these models is based on physical measures that determine the correspondence of the model output with the real sound signal. Very often this correspondence is derived from the frequency spectrum of the signals. In general, one can say that sound produced by simple impacts between rigid, solid objects is well understood. Such sounds usually can be synthesized in such a way that they can hardly, if at all, be distinguished from their natural counterparts, either by measuring them physically or by listening. The situation becomes more complicated when the impact between the objects is complex, e.g., scraping or rolling, when the sound­ generating objects are flexible, or when fluids or flowing air become part of the sound­ generation process. The dynamics of such systems, e.g., a woodwind instrument, a string instrument, or the human voice, are so complicated that even modern computing facilities cannot accurately calculate the sounds they produce. It is at this stage that another com­ plicating factor comes into play. A small difference when measured physically might be easily audible, while in another instance a much larger physical difference might simply be inaudible. Apparently, the normal distance measures used in physics cannot be used directly to predict the subjective distance perceived between two sounds, bringing us to the next, the psychophysical, approach.

Psychoacoustics Psychoacoustics is the science concerned with the study of perceptual phenomena such as masking, magnitude scaling, frequency discrimination, pitch perception, intensity dis­ crimination, loudness perception, etc. (see e.g., Moore, 1997) The object of study is the relation between the spectral and temporal properties of a sound and the perceptual at­ tributes: pitch, loudness, duration, and timbre. The methodology is often borrowed from engineering and is derived from signal-detection theory, information theory, etc. The typ­ ical psychoacoustician measures just-noticeable differences, absolute thresholds, or mag­ nitude judgments. The stimuli consist of relatively simple sounds, such as tone pips and noise bursts, which have well-defined spectral and temporal properties like frequency, phase, intensity, and duration. For masking studies and studies of pitch perception, a small number of such stimuli may be combined. The typical subject is intensively trained in long practice sessions; for certain experiments, they are provided with the feedback to the correctness of their responses. With this methodology, the psychoacoustician tries to study the physiological constraints of the hearing system that determine the response of the subjects. The effects of higher-order processes like attention, motivation, musical ex­ perience, education, etc., are minimized. As a consequence of the long training sessions, there are only a small number of subjects involved in a typical psychoacoustic experi­ ment. In contrast to what is often believed, there is no simple relation between the frequen­ cy content of a signal and its perceived pitch. Neither is there a simple relation between the physical intensity of a signal and its perceived loudness. Psychoacoustics has contrib-

79 uted much to our understanding of these perceptual attributes. For relatively simple and well-recorded sounds, we now possess computational models by means of which the per­ ceived pitch (for a review see Hess, 1983; Hermes, 1992) and loudness (Moore, Glasberg & Baer, 1997) of a sound can be calculated quite accurately. Similarly, there are accurate models of masking where, for two signals, the extent to which one signal is masked by the other can be predicted (e.g., Dau, Piischel & Kohlrausch, 1996a; 1996b). By contrast, there are no quantitative models that can relate the physical duration of a sound signal to its perceived duration. For the perceived timbre of a sound, the situation is even more complicated, since there is no direct relation between the physical properties of a sound signal such as intensity, frequency and phase, and what kind of sound we per­ ceive. The acoustical dynamics of objects with flexible components, fluid, or flowing air are associated with sound signals that have noisy components and complicated onsets. The human listener is extremely sensitive to such details. They playa substantial role in the perceived timbre of a sound and the human processing of speech. Physical concepts like amplitUde, frequency, and phase are not well defined for these components.

Auditory scene analysis Speech and musical sounds are among the most complex and dynamic sounds we usually hear. Despite their complexity and rapidly changing characteristics, the human listener normally is capable of interpreting these signals without much difficulty. Even if the lan­ guage is not known, speech can be recognized as such. This shows that the utterances spo­ ken by one speaker and the melodies played by one musical instrument must have some relatively stable properties that bind the rapidly changing sequences of phonemes or tones together into a single utterance or melody. It is well known that pitch and rhythm playa very important role in this respect, both in speech and in music. A lot of research has been done on pitch and rhythm perception, which enables us to estimate the pitch contour and the rhythmic pattern of well-recorded single utterances or melodies quite reliably. When more than one utterance or more than one melody are present simultaneously, understanding how the human listener derives the pitch contours and rhythmic patterns is one of the great challenges of modem days' hearing research. The analysis of such complicated sounds requires the development of computer algorithms that can make sense of a signal when the signal is produced by more than one speaker, or when the music is produced by various instruments playing together. It is likely that these algorithms will only perform satisfactorily when capable of determining whether a sound consists of speech or of music, a task that is performed almost automatically and without any significant delay by the human listener. So we must conclude that there are acoustic properties that bind successive seg­ ments of sound together into a single auditory stream. Within such a stream, the strong variability makes it possible for large amounts of information of different kinds to be con­ veyed. An auditory environment or soundscape usually consists of a complex of streams, from which the listener can extract information of many different kinds by directing his or her attention to the stream that is most interesting. The field of hearing research, which is primarily concerned with studying this perceptual process of binding and streaming, is called auditory scene analysis (ASA, Bregman, 1990). The roots of this field of research are in experimental psychology_ In summary, there is a perceptual process that binds var-

80 ious sound components into one or more auditory streams on the basis of relatively stable properties. On the other hand, the complex and dynamical structure of each stream sup­ plies the listener with various kinds of meaningful information. Unravelling this complex perceptual process is one of the big challenges faced by current hearing research.

Ecological approach The ecological approach to sound perception (e.g., Gaver, 1993a; 1993b) has another point of departure. In this approach, it is presumed that in everyday listening we try to re­ construct the meaningful events that generate the sounds. This means that we do not so much listen to the pitch, the loudness, the rhythm, or the timbre of a sound, but more to the properties of the sound-producing object as far as they have meaning to us. This im­ plies that, at a basic level, we listen to whether sounds are produced by solids, fluid, or air, by large or small objects, what kind of material the sound-producing objects are made of, whether the objects are hollow or solid, smooth or rough, etc. At a higher level, we use this information to mentally reconstruct whether we are hearing a telephone, the slam­ ming of a car door, someone walking up or down stairs, a nightingale, a trumpet, etc. Fi­ nally, a sound not only provides information about the physical event that has generated it, the space in which it is generated also contributes to the sound we hear by creating re­ flections and reverberation. We use this information to hear the distance to the sound source (Bronkhorst & Houtgast, 1999), whether the sound is produced in a small or in a large room, and whether the walls are hard and reflecting, or covered with curtains, books or wooden panels. Hence, the information in a sound not only is used to reconstruct a mental image of the event generating the sound, but also the environment in which the sound is generated. While investigators often use the ecological approach to try to find out which phys­ ical cues the listener uses, it appears difficult to draw any general conclusions. For in­ stance, it is often said that larger objects produce lower frequencies, so low-frequency sounds are mostly associated with large objects. On the other hand, the thicker a large, flat surface, the higher the frequencies produced. So a larger dimension, in this case thickness, is not coupled with a lower frequency but with a higher frequency. This illustrates that an arbitrary sound often cannot unambiguously be associated with one well-specified phys­ ical event. Actually, most sounds we hear can be interpreted in multiple ways. Under nor­ mal circumstances, the context in which the sound is heard will remove this ambiguity to a large extent. This certainly does not mean, however, that any arbitrary sound can be associated with any physical event. The set of physical events that can be associated with a sound is limited. On the other hand, it cannot be denied that there are sounds that listeners almost invariably classify as produced by large objects, while other sounds are classified as being produced by small objects. If the perceptual properties of a sound are such that the mental image it evokes does not fit the physical context in which it is produced, it is most likely to be perceived as a separate sound that is not associated with the physical event that pro­ duced this sound.

81 Synthetic soundscapes So far, we have paid little attention to the human ability to localize sound in space. Sound localization has been part of psychoacoustics and to a lesser extent of ASA research (Blauert, 1997). It provides the knowledge in aid of the design of sound fields in such a way that, within a restricted area, the sounds can be perceived to come from about any arbitrary direction. This can be implemented through a multi-speaker system that gener­ ates the appropriate wavefront around the listener (e.g., Berkhout, De Vries & Vogel, 1993), or through headphones that present the sound to the eardrum exactly as it is record­ ed at the eardrum of the listener (e.g., Blauert, 1997). The disadvantage of the multi­ speaker system is that the location in the room where the spatial effects are best perceived usually is quite small. The disadvantage of the headphones system is that good perform­ ance requires the so-called head-related transfer functions of the listener to be known, and head-related transfer functions are difficult to measure and differ from individual to indi­ vidual. However, we have a proper understanding of these problems and their potential solutions. Although it may be costly, the implementation of spatial hearing largely is a solved problem technically speaking. Thus, we know the factors that determine where the listener perceives a sound source. We additionally have computational algorithms to predict the perceived pitch and loudness of single, well-recorded sounds. What kind of physical event the listener associ­ ates with a sound, however, largely is an unsolved problem. This gives us another one of the most challenging aspects of modem non-speech audio technology, to find adequate paradigms for designing synthetic soundscapes. For the visual modality, new computer technology is widely available: graphical software packages and animation software provide the images that can be presented to the participant by means of head-mounted displays, high-resolution 3D-video displays, or even immersive projection displays, such as the CAVE. Track balls and joysticks with force feedback, and data-glove systems can provide input to the haptic senses. We are now able to design virtual objects that look like a teapot with a smooth surface, for exam­ ple, and which has a certain size, form, and colour. With modem force-feedback technol­ ogy we can make the user feel as if he or she is carrying such an object in his or her hand. But can we synthesize the sound that is produced when the participant puts this teapot on a table, or when the teapot is dropped on the floor and breaks? With physical modelling of these events alone, we can come a long way to produce the sound produced when the teapot is put on the table, since a teapot and a table consist of inflexible solid material. But it still is very difficult to synthesize other common sounds such as the sound of breaking, or the sound of a marble rolling on a floor. If we need the sounds to be coupled to certain events, e.g., in a computer animation, we will normally resort to natural or studio record­ ings of such an event. No software packages are available to design these sounds synthet­ ically and interactively. The simple reason is that we do not have a fundamental under­ standing of the signal properties that determine, for instance, when a sound is perceived as coming from a metal or a wooden object, from a large or small object, or when the ob­ ject is perceived as hollow or solid, or when it is perceived as scraping or as rolling over a surface.

82 Non-speech audio and user interfaces Problems like the ones just mentioned are not the only reason why non-speech audio has yet to become an integral part of the user interface of modern computers. There are two additional reasons. The first has to do with the necessary binding of the synthetic sounds with the visual and the haptic aspects of the event the sounds are meant to represent. In­ tegration of sound in a user interface not only requires the 'image' it evokes to fit the vir­ tual context in which it is generated. The perceptual moment of occurrence of the sound must also be very accurately timed with the visual and the haptic events. It appears that when we move an object over a surface, or when we see a sound-generating event, we are very sensitive to the timing of the sound with respect to the image we see or to what we feel. When the sound comes too early or too late, it is not perceived as part of the event it is meant to be coupled with (Van de Par & Kohlrausch, 1999). At best, it will be lost in the background noise to which no significance is attributed. A second factor preventing the integration of sound in user interfaces relates to the fact that most synthesized sounds become very annoying after repeated hearing. Most user interface sounds are designed as warning sounds, which essentially have the purpose of attracting attention. When the perceived properties of a sound have little to do with what it indicates, when the sound is the same every time it is presented, or when it is de­ signed in such a way that it must attract the listener's attention, the user will inevitably become bored or annoyed. The larger the perceived urgency of the sound, and the more often it is produced, the sooner this moment occurs. This problem can only be solved by designing user-interface sounds in such a way that their meaning is immediately clear to the listener: they must have natural variability, and ideally, the user should not realize that they are there, just as one does not consciously hear one's own footsteps. The only exceptions are attention sounds and warning calls that have the function of attracting our attention, but these should be used sparingly, lest they become irritating or do not draw the attention of the listener when it is really necessary. All other sounds in artificial soundscapes, or in a user interface in particular, must be as inconspicuous as possible. The best result we can obtain is to design them in such a way that we do not realize that they are there.

Discussion In this paper we have described four different approaches to the study of sound design. We have argued that the effective design of user-interface sounds requires all four ap­ proaches. We can illustrate this on the basis of some simple examples. The first example is the doorbell. If this warning sound is not loud enough, we may not hear it when the radio is on or when we are vacuum cleaning. Of course, we can make it much louder, but expe­ rience tells us that warning sounds become very annoying if they are much louder than 20 dB above the environmental sound level. The effective design of warning sounds requires that their level and frequency content is adapted to the level and frequency content of the environmental sounds. The algorithms that can do these calculations (Laroche, Tran Quoc, Hetu & McDuff, 1991) are based on detailed knowledge of human loudness per­ ception.

83 More complex systems, such as the medical monitoring systems in intensive care departments, contain various subsystems that can produce a variety of warning sounds. If the properties of these sounds are not designed to work with each other, the sounds can mask each other, be difficult to localize, or have inadequate measures of perceived urgen­ cy. Different sounds may merge into one auditory stream, so that some of the signals may be ignored. Designing systems with a large number of warning sounds, therefore, requires knowledge of ASA. In the design of user interfaces, it is argued that the sounds used must fit into the metaphorical context for which they are used. For instance, the sound coupled with the rolling of a track ball can best be coupled with a rolling sound, while scrolling the scroll bar should sound like scraping. Other simple virtual impacts, such as hitting and bounc­ ing, e.g., the cursor hitting the boundaries of open windows, must be coupled with simple ticks or bouncing sounds. Also, the kind of material perceived must fit the virtual material used in the metaphor (Hermes, 1998). Hence, the designer of user-interface sounds must have knowledge of ecological sound perception. If not, the designer can only try to model his sounds in accordance with acoustics, but the resulting algorithms require very inten­ sive calculations. The required computation time soon becomes prohibitive for real-time interactive programmes. These examples illustrate that adequate and effective design of auditory interfaces requires knowledge of acoustics and a wide range of different fields of hearing research. If one of these fields is systematically ignored, the resulting interfaces or soundscapes may show deficits of a more or less serious nature. Some of these shortcomings may per­ haps not immediately be disturbing and annoying. Long-lasting or repeated working with­ in these virtual auditory environments, however, is bound to result in annoyance, fatigue, or an increased number of errors. This may explain why most computer programmes would not be much different when they were designed for deaf people. If they are not well designed, the sounds can better be omitted altogether. In conclusion, the design of sound­ scapes and auditory interfaces requires expert knowledge from a wide range of sciences.

Acknowledgements We would like to thank Willy Wong and Adrian Houtsma for reading and commenting on this manuscript.

References Berkhout, A.J., Vries, D. de & Vogel, P. (1993). Acoustic control by wave field synthesis. Journal of the Acoustical Society of America, 93(5), 2764-2778. Blauert, J. (1997). Spatial Hearing: The Psychoacoustics of Human Sound Localization. Cambridge, MA: MIT Press. Bregman, A.S. (1990). Auditory Scene Analysis. Cambridge, MA: MIT Press. Bronkhorst, A.W. & Houtgast, T. (1999). Auditory distance perception in rooms. Nature, 397(6719), 517- 520. Chowning, J.M. (1973). The synthesis of complex audio spectra by means of frequency modulation. Jour­ nal of the Audio Engineering Society, 21(7), 526-534. Dau, T., Piischel, D. & Kohlrausch, A. (1996a). A quantitative model of the "effective" signal processing in the auditory system. I. Model structure. Journal ofthe Acoustical Society ofAmerica, 99(6), 3615- 3622.

84 Dau, T., Piischel, D. & Kohlrausch, A. (1996b). A quantitative model of the "effective" signal processing in the auditory system. II. Simulations and measurements. Journal ofthe Acoustical Society ofAmer­ ica, 99(6), 3623-3631. Edworthy, J. & Adams, A. (1996). Warning Design: A Research Prospective. London: Taylor & Francis. Gaver, W.W. (1993a). What in the world do we hear? An ecological approach to auditory source perception. Ecological Psychology, 5, 1-29. Gaver, W.W. (1993b). How do we hear in the world? Explorations of ecological acoustics. Ecological Psy­ chology, 5, 285-313. Hermes, DJ. (1992). Pitch analysis. In: M. Cooke, S. Beet and M. Crawford (Eds.): Visual Representations of Speech Signals, 3-25. Chichester: John Wiley & Sons. Hermes, DJ. (1998). Auditory material perception. IPO Annual Progress Report, 33, 95-102. Hess, W. (1983). Pitch Determination ofSpeech Signals: Algorithms and Devices. Berlin: Springer-Verlag. Laroche, c., Tran Quoc, H., Hetu, R. & McDuff, S. (1991). 'Detectsound': A computerised model for pre- dicting the detectability of warning signals in noisy workplaces. Applied Acoustics, 32(3), 193-214. Moore, B.CJ. (1997). An Introduction to the Psychology of Hearing (2nd ed.). London: Academic Press. Moore, B.CJ., Glasberg, B.R. & Baer, T. (1997) A model for the prediction of threshold, loudness and par­ tial loudness. Journal of the Audio Engineering Society, 45(4), 224-240. Par, S.L.J.D.E. van de & Kohlrausch, A. (1999). Integration of auditory-visual patterns. IPO Annual Progress Report 34, 94-102.

85 Auditory perception of the size and velocity of rolling balls

M.Mol. Houben, Dol. Hermes and A. Kohlrauschl e-mai1: [email protected]

Abstract In everyday life a complexity of sounds provides us with information about sound sources, their locations, and their surroundings. In the interaction with systems in their environment, this wealth of information may be of use to the user. In order to create suitable auditory inter­ faces we require a better understanding of how people perceive everyday sounds. We have chosen to study the sounds of rolling balls, because a rolling ball can be used as a metaphor for cursor movement resulting from moving a mouse or a trackball. In three experiments the perception of recorded sounds of rolling balls was investigated. In the first experiment sub­ jects were presented with pairs of sounds of rolling balls with different sizes and equal veloc­ ity in which they had to decide which of the two sounds was created by the larger balL In the second experiment the velocity of rolling balls was varied and subjects had to choose the fast­ er of two balls equal in size. The resu1ts showed that subjects are able to identify the size of rolling balls and that most subjects can clearly discriminate between rolling balls with differ­ ent velocities. However, some subjects had difficulties in labelling the velocity correctly, probably due to lack of feedback about the correctness of their responses. In the third exper­ iment both size and velocity varied. The results showed a clear interaction between the audi­ tory perception of size and velocity of rolling balls.

Introduction In this study we attempt to integrate two approaches to the study of non-speech sound. Some application-oriented work is based on the assumption that in everyday listening people listen to the properties of sources that create sound rather than to the musical prop­ erties of the sounds (Gaver, 1993). Rather than the dimensions of sound itself, such as fre­ quency, sound level, and duration, dimensions of the sound's source, such as type of in­ teraction, material, surface condition, velocity, and geometry of the objects involved, are perceived in auditory object identification. On the other hand, some studies have investi­ gated auditory object identification in a more systematic way using psychophysical pro­ cedures. Warren and Verbrugge (1984) studied bouncing and breaking glass bottles, Repp (1987) investigated hand-clapping, Gaver (1988) studied people's abilities to perceive the material and lengths of struck wooden and metal bars, Freed (1990) studied the perceptual ratings of mallet hardness for percussive sound events, Li, Logan and Pastore (1991) in­ vestigated the ability of subjects to perceive the gender of a human walker, and Lutfi and Oh (1997) studied the auditory discrimination of material changes in a struck-clamped bar. In our experiment, the perception of recorded sounds of rolling wooden balls is in­ vestigated. We are interested in this type of sound, because the rolling ball can be used as

1. Also affiliated with Philips Research Laboratories, Eindhoven, The Netherlands.

IPO Annual Progress Report 341999 86 a metaphor for movement. Using the sound of a rolling ball to represent movement is an even more obvious choice when you use a mouse or trackball, because the amount of cur­ sor movement obtained by moving the mouse or trackball is derived from the movement of a little rolling ball inside the device. By adding sound to a trackball with force feed­ back, three modalities are combined: the tactile, visual, and auditory modality. Implemen­ tation of auditory information that is consistent with visual and haptic feedback has the potential to increase the ease-of-use and direct involvement of users. Three experiments are reported. First, subjects were asked to discriminate differenc­ es in the size of rolling balls on the basis of recorded sounds (Franssen, 1998, chapters 3 and 4). Second, the auditory perception of the velocity of rolling balls is examined. Third, the interaction between size and velocity is tested. If subjects are able to correctly associ­ ate the recorded sounds with the size or velocity of the rolling ball, the results can be used to search for perceptually relevant acoustic structures produced by the rolling ball, i.e., to find a relation between psychoacoustical and physical dimensions. In a later stage we in­ tend to synthesize parameterized sounds of rolling balls in which the information to be displayed can be mapped to physical dimensions of the rolling ball, such as its velocity and its size.

Experiment I: Perception of the size of rolling balls Method Sound recordings of wooden balls rolling over a wooden plate, which was positioned on a table, were used. To reduce the sound produced by the table, a layer of felt was placed between the plate and the supporting table. By releasing the ball from various heights us­ ing a gutter, a range of initial velocities was obtained. The recordings were performed in the centre of a sound absorbing room. A microphone (B&K 4003) was positioned halfway along the rolling path, i.e., 50 cm away from the gutter in the rolling direction. The distance from the centre of the roll­ ing path was 42 cm vertically and 40 em horizontally. The sounds were recorded on DAT tape. Due to the sudden change in slope from gutter to plate, a single tick is heard at the beginning of the sound signal. A metal strip at the end of the plate induces a second tick. The mean velocity was obtained by calculating the inverse of the time between the two ticks in the sound signal. Sound recordings of wooden balls with mean velocities of 0.75 mls and diameters of 22, 25, 35, 45, 55, 68, and 83 mm were used. The stimuli were presented pairwise. Pairs consisted of balls adjacent in size. The stimuli were cut out of the middle of the recorded sounds, thus leaving out the onsets and offsets. The duration of the stimuli was 800 ms with 700 ms silence between them. The stimuli were faded in and out over 10 ms by means of a Hanning window. The levels of the stimuli were equalized to RMS values cor­ responding to 80 dB SPL. Eight subjects participated in the experiment. They were seated in a soundproof booth and listened to the stimuli with Beyer DT 990 headphones. They were not familiar with the sounds. After each pair of sounds, the subjects had to decide which of the two sound examples in a pair was created by the larger ball (212AFC procedure). They did not receive any feedback about the correctness of their responses. It was not mentioned in the instructions that the velocity was the same for all rolling balls. The 12 stimulus pairs were

87 presented four times in random order and were preceded by 10 test pairs covering the range of stimuli.

Results The two permutations of the same stimulus combination, e.g., 35-45 and 45-35, were con­ sidered together, resulting in 8 repetitions for every combination. Figure 1 represents the medians and quartiles of the individual percentage correct responses (Pc) values for the various stimulus pairs. The upper and lower boundaries of the 95% confidence interval for guessing correspond with Pc values of 61 % and 39%, respectively. This means that the listener's responses deviated significantly from guessing if the Pc was higher than 61 % or lower than 39%. The latter case implies that the listener was able to discriminate the size of a rolling ball, but consistently mistook the smaller ball for the larger one or, in other words, was not able to label the sounds correctly. 100 I

60

22-25 25-35 35-45 45-55 55-68 68-83 diameter of first and second ball (mm)

Figure 1: Results of Experiment I. Percentage correct responses Pc is represented as a function of the stimulus pair. The horizontal dot-dashed line is the upper boundary of the 95% confidence interval for guessing. Medians are shown by dots and quartiles by horizontal bars. The number pairs on the x-axis denote the two diameters of the rolling ball in the stimulus pair.

Discussion Figure 1 shows that subjects are generally able to discriminate between the sounds of roll­ ing balls with different sizes. The percentages of correct responses Pc reveal that the task was neither too easy (which would result in saturated responses of 100% correct) nor too difficult (which would result in a percentage of correct responses of 50%). For the stim­ ulus pair comprising diameters 22 mm and 25 mm Pc was not significantly above chance. All subjects performed near chance for this stimulus pair, which implies that labelling was not the problem. The relative difference in size between the two balls in this particular stimulus pair (14%) is the smallest of all stimulus pairs and probably too small to be per­ ceived by the subjects.

Experiment TI: Perception of the velocity of rolling balls Method Sound recordings of wooden balls with a diameter of 45 mm, recorded in the same way as described in Experiment I, were used. Seven representative samples were chosen with mean velocities of 0.36, 0.50, 0.63, 0.71, 0.79, 0.87, and 0.93 mls. The intensity cue - balls

88 rolling quickly produce a louder sound than balls rolling slowly - was eliminated by ad­ justing the stimuli to an equal overall RMS value corresponding to 80 dB SPL. All eight subjects of the first experiment participated in a pilot experiment. The same experimental set -up and procedure was used as in Experiment I, except that pairs consisted of the stimuli nearest or second nearest in velocity to the other stimulus in the pair, resulting in 22 pairs altogether. The stimulus pairs were presented four times in ran­ dom order and were preceded by 7 test pairs covering the range of stimuli. No prior infor­ mation was given about the fact that the size of the balls was the same for all examples. The variation in performance across subjects was very large. The mean Pc values of individual subjects covered a range of 50% to 80%. Closer inspection of the data indicated that several subjects obviously had difficulties in labelling the sounds correctly - the sounds could be discriminated, but instead of the faster ball, the slower one was chosen, resulting in a Pc value clearly below chance. In addition, some subjects indicated that they found it difficult to make a distinction between the size and the velocity of the ball, not realizing that the size was kept constant. For this reason, the main experiment was conducted in a slightly different way. Six naive subjects who had not participated in the earlier experiments, were presented with all possible stimulus pairs composed of two different stimuli, i.e., the entire stimulus domain minus the main diagonal, resulting in 42 permutations. Within a session, these 42 stimulus pairs were presented four times in random order. Each subject participated in four ses­ sions. The two permutations of the same combination were treated together, resulting in 32 repetitions for 21 combinations. The stimuli as well as the experimental set-up re­ mained the same as in the pilot experiment. The difference between this experiment and the pilot experiment was that, before giving a response, the subjects were able to repeat the stimulus pairs as often as they wanted. No feedback was given about the correctness of the responses. It was now explicitly mentioned that the size of the balls was equal for all sound samples.

Results The results per subject are shown in Figure 2. The squares denote stimulus pairs (both per­ mutations of the same combination considered together). The percentage of correct re­ sponses Pc of 32 presentations of the stimulus pair in question is printed within these blocks. The mean Pc value per subject is printed in the top-left corner of the panels. The upper boundary of the 95% confidence interval for guessing corresponds with a Pc value of65%.

Discussion With one exception (see Figure 2, subject 5), the subjects could clearly discriminate be­ tween rolling balls with different velocities. Performance was better than in the pilot ex­ periment. A clear trend can be observed: errors primarily occur for stimulus pairs close to the diagonal, implying a small difference between the velocities in a pair. One subject (subject 3) had great difficulty labelling the sounds correctly. He reversed the labelling, resulting in percentage correct responses close to 0%. His inverted responses are similar to the responses of subjects 1, 2, and 6. These results were confirmed by six other subjects who participated in this experi­ ment for only one session, i.e., 8 repetitions of 21 stimulus combinations. The average Pc of three of the subjects was between 75% and 95% and thus significantly above chance.

89 The average Pc of the other three subjects was about 50%. Two of them produced very low Pc values for specific stimulus pairs compensated by high Pc values for other pairs, indicating that for those two subjects labelling was the problem rather than discrimina­ tion. Inspection of the change of the results over the four sessions indicates that the per­ formance of subjects I to 4 and 6, who can clearly discriminate between balls rolling with different velocities, remained the same. In other words, they could discriminate velocity during the entire length of the experiment. However, the performance of subject 5 de­ clined enormously during the four sessions, from a mean Pc value of 80% to 30%, indi­ cating an increased difficulty in labelling the stimulus pairs correctly. The average Pc val­ ue of 51 % should therefore not be interpreted as reflecting chance performance. So obvi­ ously, subjects are able to discriminate between rolling balls with different velocities, but some subjects have difficulty in labelling the sounds correctly. This difficulty may be caused by the lack of feedback during the experiment.

.87 subject 1 66 subject 2 91 subject 3 16 mean Pc: 93% mean Pc: 95% mean Pc; 11% .79 100 100 78 97 72 13 '""'Vl -.. .71 63 97 100 81 97 100 13 31 6 8 '-' ...:>. .63 91 97 97 100 100 97 97 100 34 3 13 0 'u 0 Q) .50 91 100 100 100 100 88 97 100 97 94 9 19 0 3 3 ;;. .36 59 94 100 100 100 100 84 97 100 100 100 100 0 0 0 3 0 3

.87 subject 4 53 subject 5 91 subject 6 59 mean Pc: 76% mean Pc: 51% mean Pc: 85% .79 69 91 22 97 69 75 ,-., .71 56 75 91 66 41 91 59 88 91 ~'-" 0 .63 78 81 91 97 53 41 25 72 81 88 97 94 'u ..9 .50 50 91 81 94 100 44 41 41 19 75 69 97 100 94 94 Q) ;;. .36 31 38 75 69 94 84 59 38 44 34 34 53 66 84 91 100 97 100 .50 .63 .71 .79 .87 .93 .50 .63 .71 .79 .87 .93 .50 .63 .71 .79 .87 .93 velocity (m/s) velocity (m/s) velocity (m/s)

Figure 2: Results of Experiment II. Percentage of correct responses is depicted per subject and per stim­ ulus combination. In the top-left corner of the panels the mean percentage of correct responses is given.

Experiment III: Interaction between size and velocity Method Sound recordings of wooden balls rolling over a wooden surface were used. The sounds were recorded in the same way as described in Experiment I. Both size and velocity of the balls were varied. Eight sounds were used: vlsI, v1s2, vIs3, v2sI, v2s3, v3s1, v3s2, and v3s3. Here, vI, v2 and v3 denote velocities of approximately 0.40, 0.60 and 0.90 mis, re­ spectively, and sl, s2 and s3 denote sizes of 25 mm, 45 mm and 83 mm in diameter, re­ spectively.

90 This experiment was carried out by twelve subjects who had not participated in the preceding experiments. The stimuli were presented pairwise. The stimulus pairs were pre­ sented 8 times in random order (both permutations of the same combination treated to­ gether) and were preceded by 4 test pairs covering the range of stimuli. The subjects were seated in a sound-absorbing room. A loudspeaker was used to reproduce the sounds re­ corded on DAT. The stimuli could not be repeated and the subjects did not receive any feedback on the correctness of their responses. The experiment was divided into two parts, which only differed regarding the task subjects had to perform and the stimulus pairs pre­ sented. One task was to decide which of the two sounds in a pair was produced by the larger balL The other task was to decide which of the two sounds in a pair was produced by the faster rolling balL The order of the two tasks was counter-balanced. The stimuli used in the velocity judgment task were vlsl, vls3, v2s1, v2s3, v3s1, and v3s3. Pairs were constructed by combining two stimuli with different velocities re­ sulting in 12 combinations. The stimuli used in the size judgment task were vIsI, vls2, v1s3, v3s1, v3s2, and v3s3. Pairs consisted of stimuli with different sizes, resulting in 12 combinations. Results Figure 3 visualizes the results of judging the size of the rolling balls averaged across sub­ jects. The percentage of correct responses is depicted per stimulus pair, comprising the stimulus in the title of the panel and the corresponding stimulus positioned at the abscissa. Grey circles denote stimulus pairs with no difference in velocity (only the size differs). Black crosses denote stimulus pairs for which the two stimuli differ in size as well as in velocity. From left to right within a panel, the difference in size between the two stimuli in a pair increases. The upper boundary of the 95% confidence interval for guessing cor­ responds with a Pc value of 58%.

o'--~~--~--vcl s2 \1ls3 v3s2 v3s3 100~~~v~1~s3~~~ Figure 3: Results of Experi­ ~#_~ ___ wcu@l ment III, size judgment task. Percentage of correct responses averaged across subjects is de­ picted per stimulus pair, com­ prising the stimulus in the title

oL-- _____----I of the panel and the corre­ \/1s2 vls1 sponding stimulus positioned v3s2 v3s1 at the abscissa.

In Figure 4, the results of judging the velocity are depicted. Grey circles denote stimulus pairs equal in size but with different velocity, and black crosses denote stimulus pairs for which the two stimuli differ in both size and velocity. From left to right within a

91 panel, the difference in velocity between the two stimuli in a pair increases. Some pairs are shown twice in different panels to facilitate comparison between pairs.

vIs! 100 .---~---~--, 100~~~v~1=s3~~--,

L-~ ___~----l o~--~----~--~ o V251 v3S1 v~2s~3 v2s3 v3s3 v2s1 v3s1 v3s3 100~~-~-~--' 100 .----~-----'-=~--, ~--~

O~------~ O~------~--~ v2s1 v1s1 \1283 v1 v2s3 v1s3 v2s1 v1

Figure 4: Results of Experiment III, velocity judgment task. See Fig­ ure 3 for a description.

Discussion Figure 3 shows that all Pc values are above the upper boundary of the 95% confidence interval for guessing (Pc of 58%) for the size judgment task. All Pc values lie around 90%, except for the pair comprising stimuli vls2 and v3s3 with a Pc value of 68% (see bottom right panel). The larger ball in this pair also has a greater velocity. When the difference in size is larger (pair comprising visl and v3s3) or when the difference in velocity is elim­ inated (pair comprising v3s2 and v3s3), subjects are much better at identifying the larger ball. So subjects are generally able to discriminate the size of rolling balls, even when the velocity differs, though this may decrease the performance. As can be seen by comparing Figures 3 and 4, judgment of the velocity of rolling balls is not as easy as judgment of the size of rolling balls for the presented sounds. In­ creasing the difference in velocity (from left to right within a panel) usually increases Pc slightly. Increasing the difference in size from no difference (grey circles) to a difference (black crosses) improves performances in the top left and bottom right panel (for which the faster ball is also the larger ball) and diminishes performances in the top right and bot­ tom left panel (for which the faster ball is also the smaller ball). Evidently an increase in velocity accompanied by a decrease in size improves the identifiability of velocity while subjects have more difficulties identifying the velocity when both velocity and size are increased. The reason for this interaction may be a conflicting change in spectral cue when both size and velocity of a rolling ball are increased. Roughly speaking, the brightness of the sound of a rolling ball decreases with its size and increases with its velocity. So if the velocity increases while the size decreases, the change in timbre caused by the decrease in size reinforces the change in timbre caused by the increase in velocity. If, however, both velocity and size increase, the changes in timbre may counteract and diminish the identifiability of size or velocity.

92 General discussion Experiment I showed that subjects are able to discriminate the sounds of rolling balls of different sizes. Experiment II showed the ability of subjects to discriminate between the sounds of rolling balls with different velocities. However, some subjects had difficulties labelling the velocity correctly. Experiment III indicated that when both the size and ve­ locity of a rolling ball are varied, subjects generally are still able to discriminate the size and velocity, but an interaction effect is clearly present in the results. Due to the labelling difficulty exhibited by some subjects and the interaction be­ tween size and velocity, one should be careful when using sounds of rolling balls with dif­ ferent sizes and velocities as a stand-alone signal rather than integrating them with other types of feedback. One might expect that in a multimodal application, where multiple cues are available, these problems are easily solved. To investigate what kind of information within the sounds subjects use to judge the size and velocity, future experiments are planned in which the recorded sounds will be manipulated by combining different aspects of different sounds into a new one. It is pos­ sible that subjects use spectral cues, such as the ratio between high and low frequencies, to judge size, and temporal cues, such as quasi-periodicity induced by the ball's imperfect sphericity, to judge velocity. Therefore, the sounds should be manipulated by combining the temporal characteristics of one sound with the spectral characteristics of another. These modified sounds will serve as a basis for perception experiments, which might help to unravel the perceptual cues for size and velocity. In addition, they will serve as a basis for the synthesis of rolling sounds, for which certain parameters can be adjusted in such a way that the listener can perceive the sounds as that of a ball of a well-defined size, rolling with a well-defined velocity.

References Franssen, L. (1998). Auditory perception of the size of rolling balls. IPO Manuscript No. 1192. Freed, OJ. (1990). Auditory correlates of perceived mallet hardness for a set of recorded percussive sound events. Journal of the Acoustical Society of America, 87, 311-322. Gaver, W.W. (1988). Everyday Listening and Auditory Icons. PhD thesis, University of California, San Diego, USA. Gaver, W.W. (1993). What in the world do we hear? An ecological approach to auditory event perception. Ecological Psychology, 5, 1-29. Li, X., Logan, RJ. & Pastore, RE. (1991). Perception of acoustic source characteristics: Walking sounds. Journal of the Acoustical Society of America, 90, 3036-3049. Lutfi, RA. & Oh, E.L. (1997). Auditory discrimination of material changes in a struck-clamped bar. Jour­ nal of the Acoustical Society ofAmerica., 102, 3647-3656. Repp, B.H. (1987). The sounds of two hands clapping: An exploratory study. Journal of the Acoustical Society of America, 81, 1100-1109. Warren Jr., W.H. & Verbrugge, RR (1984). Auditory perception of breaking and bouncing events: A case study in ecological acoustics. Journal of Experimental Psychology: Human Perception and Per­ formance, /0, 704-712.

93 Integration of auditory-visual patterns

S.LJ.D.E. van de Par and A. KohIrauschl e-mail: [email protected]

Abstract An intrinsic aspect of human processing of multi modal streams of information is the grouping of stimuli across modalities. Without much effort we are able to link acoustic speech to the visible movements in the face of a human speaker, for example, even if we are not familiar with the speaker's voice at all. Similarly, in daily life we easily link non-speech auditory stim­ uli to the objects that generate them. In applications such as graphical user interfaces with in­ teractive audio or in virtual reality, it is essential to understand how interactive audio can be designed in such a way that users are able to link it to objects within the visual display. With­ out this link the audio is bound to stand apart from the visual information, which can great1y reduce the added value of audio in the context of interfaces and virtual reality applications. In this study experiments were done to investigate the users' ability to link or group synthetic auditory and visual streams. For this purpose, ways were explored to achieve auditory-visual grouping. Furthermore, sensitivity to auditory-visual synchrony of these stimuli was investi­ gated as well as the ability of users to select objects on the basis of auditory-visual grouping. Results indicate that human information processing incorporates robust mechanisms for grouping auditory-visual stimuli. These mechanisms, however, seem to be rather vulnerable to auditory-visual delays.

Introduction In daily life we usually have little difficulty deciding whether perceptual events that occur within different sensory modalities do or do not result from a single physical event. This so-called crossmodal binding or grouping is an important property of our perceptual sys­ tem that supports the integration of multi-sensory information. Here is an example of crossmodal binding: In a crowd of people we can easily link a voice to the corresponding face of the speaker, even if we are not familiar with this person and even when there are various other voices that are audible at the same time. Similarly, we can link the sounds of footsteps to a specific person within view within that crowd. Several apparent factors can be thought of that contribute to crossmodal binding. Generally speaking, congruence across modalities seems to be of prime importance. One can think of congruence in time, in place, and in stimulus content. The temporal synchro­ ny with which events occur in different sensory modalities is a necessary consequence of the way the physics of, e.g., the auditory and visual stimulus generation works, and may be an important source of information contributing to crossmodal binding. A human speaker normally needs to move his lips to produce speech utterances. The lips move in close synchrony with certain acoustical events in the speech signal. Similarly, the sound of a footstep only occurs in synchrony with the moment that the foot makes contact with the floor, a distinctly visible event. Examples where synchrony is required for crossmodal interactions are reported by McGrath and Summerfield (1985) for the McGurk effect, and

1. Also affiliated with Philips Research Laboratories, Eindhoven, The Netherlands.

IPO Annual Progress Report 341999 94 by Sekuler, Sekuler and Lau (1997) for interpretation of visual trajectories. These effects may indicate that crossmodal synchrony is required for integration or binding of multi­ sensory information. Congruence in perceived localization of, e.g., the visual and auditory stimulus is likely to be an additional factor contributing to crossmodal binding. This so-called colo­ cation is a natural consequence of the physical way in which stimuli reach our sensory or­ gans, and of the ability of the human perceptual system to localize stimuli in several sen­ sory modalities. A third factor is the congruence between the content of perceived stimuli in two or more sensory modalities. In this context, congruence results from a high likelihood for certain combinations of stimuli to occur in several sensory modalities. The observer's ex­ perience is essential in this case. For example, we have knowledge about the sound that is typically generated by a car and therefore we can expect a strong bias to link such a sound to the car that is within view, even if there is no precise co-location and there are no clear auditory-visual events that support temporal synchrony. In this paper we will focus on temporal congruence, or more specifically auditory­ visual temporal synchrony. We shall investigate several aspects of AV synchrony percep­ tion in connection with cross-modal perceptual binding. First, some detection experi­ ments are described that measure sensitivities to AV asynchrony. In a next set of binding experiments, the ability of subjects to bind auditory and visual stimuli is explored by in­ vestigating how well subjects can compare temporal patterns that are presented visually and auditorily. In a third set of selection experiments, we investigate whether subjects can make links between auditory and visual stimuli when there are several competing stimuli and see how quickly such a task can be executed.

Auditory-visual asynchrony detection In this experiment, thresholds are reported for the detection of auditory-visual asynchro­ ny. Thresholds were measured with stimuli where the audio was either delayed or ad­ vanced with respect to the video.

Stimuli and methods The stimulus consisted of a white disk against a dark background on a computer monitor, which was visible for 2000 ms. At the moment the disk appeared, it accelerated downward linearly until it hit a bar after which it returned, decelerating linearly, until it came to rest at the end of the 2000-ms interval. The impact with the bar occurred at a random moment within the 2000-ms interval, though always at least 500 ms after the start of the stimulus interval and no later than 500 ms before the end of the stimulus interval. Thus the rates of acceleration and deceleration generally were not identical. In the 2-Interval 2-Alternative Forced-Choice task, both a test (target) and a refer­ ence (control) interval were presented in random order. In both intervals, the visual signal was accompanied by an associated sound. For the reference interval, the sound consisted of a brief tonal signal with a frequency of 500 Hz gated on at the precise moment of im­ pact. The tone had an instantaneous onset and an exponential decay with a time constant of 30 ms. For the test interval, the moment of onset of the auditory stimulus was either delayed or advanced with respect to the moment of impact of the visual stimulus. At the

95 moment of onset, the auditory stimulus had a sound pressure level of 70 dB and was pre­ sented over Beyer Dynamics DT 990 headphones. A 2-Interval 2-Alternative Forced-Choice procedure with adaptive AV-delay ad­ justment was used to determine the smallest asynchrony that was detectable. Feedback upon correctness of the subject's response was given after each trial. The amount of A V delay was adjusted according to a two-down, one-up rule (Levitt, 1971). The initial step fraction for adjusting the A V delay was 2. Thus after two correct answers the A V delay was divided by 2, and after one incorrect answer the A V delay was multiplied by 2. After each reversal of the A V -delay track, the step fraction was reduced by taking the square root value until a step fraction of 1.091 was reached. The run was then continued for an­ other 8 reversals. The median was calculated for the A V asynchrony of these last 8 revers­ als and used as a threshold value. At least four threshold values were obtained for all sub­ jects and for both audio delays and advances. The mean was taken from these four values. Four normal-hearing subjects with normal or corrected to normal vision participated in the experiments varying from 23 to 53 years in age.

Results The results of the auditory-visual detection threshold experiment are shown in Table I. It is clear that considerable differences in thresholds are found for conditions where audio is delayed with respect to video, compared to conditions where audio is advanced with respect to video. Such differences were reported earlier by Dixon and Spitz (1980) among other studies. The fact that audio advances are detected better than audio delays may be linked to the difference in the speed of sound and light. This difference causes variable amounts of audio delays depending on the distance between source and observer. Possi­ bly, the human perceptual system is adapted to tolerate large audio delays which, in some way, may reflect lower sensitivities to audio delays as compared to video delays, as sug­ gested in Dixon and Spitz (1980) for example.

Table 1: Threshold values for detecting audio advanced and audio delayed with respect to the video.

Subjects Video delay (ms) Audio delay (ms)

AK 25 100 11 27 93 ML 27 101 SP 38 47 Mean 29 85

Threshold values in our experiment are considerably lower than those of Dixon and Spitz (1980). Additional experiments, where the measurement procedure was changed to one that was more similar to the one used by Dixon and Spitz, e.g., omitting feedback, yielded more similar results. This suggests that the experimental procedure has a strong influence on experimental results. The low threshold values found in this experiment are more in line with data typically found in temporal order judgment tasks such as reported by Hirsh and Sherrick (1961).

96 The different thresholds that are found for audio delays and audio advances cannot be explained with a simple Thurstonian detection model using a difference in timing be­ tween auditory and visual events as a decision variable. One has to assume that auditory and visual events are not represented independently on the internal time continuum.

Auditory-visual binding: Detection of A V jitter Whereas the perception of AV synchrony in a single auditory-visual event was studied in the previous section, this section explores the ability of human observers to compare au­ ditory and visual temporal patterns. To this purpose, the disk moved according to some predetermined pattern or trajectory. In synchrony with the presentation of this trajectory, a tonal signal was co-modulated with either the momentary position, velocity, or acceler­ ation of the disk. The ability to compare auditory and visual patterns was operationalized by creating two types of AV patterns: one where auditory and visual patterns exactly matched, and another where the auditory and visual pattern were discordant by a certain amount due to the introduction of temporal A V jitter. The minimum discordance, or jitter, that was needed to be able to discriminate both types of A V patterns was determined in the experiments described next.

Visual position, velocity, and acceleration: Stimuli and methods The stimulus interval consisted of a white disk on a computer monitor, which was visible for 2000 ms and moved upward and downward from its starting position according to a certain trajectory. At the same time, a tonal signal was presented that was varied in am­ plitude in a way that matched the visual trajectory, or that was discordant to the visual trajectory by some amount. The visual trajectory for each stimulus interval was generated in the following way. During the interval ranging from 200 ms after the start of the interval and 200 IDS before the end of the interval, five events were generated, each at a random moment within the intervaL For each of these five events, the visible disk moved up and down according to a Gaussian pulse with a standard deviation of lOOms. The five Gaussian pulses were sim­ ply added, thus two pulses that were very close to one another resulted in a larger move­ ment than each separate pulse could have caused. Other experiments, mentioned further on, showed that for this type of experiment a total of five events led to the best perform­ ance of the subjects. For the matched interval, the same five events were used to generate pulses that were added to constitute an auditory envelope. This envelope was then used to modulate a 500-Hz tonal signal. This was done in three different ways: • The tonal envelope was proportional to the height or vertical position of the vis­ ual disk as measured with respect to its starting position. • The tonal envelope was proportional to the velocity of the visual disk. • The tonal envelope was proportional to the acceleration of the visual disk.

For the discordant interval, essentially the same method was used, except that the pulses that were used to generate the visual stimulus were jittered with respect to the puls­ es that were used for the auditory stimulus. The A V jitter was varied according to an adap­ tive procedure that was identical to that used in the previous detection experiment. For each individual pulse, the video was randomly selected to be either advanced or delayed

97 with respect to the audio. The envelope of the tonal stimulus reached a maximum SPL of 70 dB, 66 dB, and 70 dB, for position, velocity, and acceleration conditions, respectively. Essentially, the same adaptive procedure was used as in the detection experiment to determine the smallest amount of jitter that was needed to be able to discriminate the matched and the discordant interval. Feedback about correctness of the subject's response was given after each trial. Subjects had to indicate which of the two intervals contained the matched stimulus. Further details of the adaptive procedure and threshold measure­ ment are the same as in the detection experiments described above. Four normal-hearing subjects with normal or corrected to normal vision participated in the experiments varying from 23 to 36 years in age.

Results and discussion The results in terms of threshold jitter of the A V -binding experiment are shown in Figure 1. From the threshold values that were found it is clear that subjects are able to perform the task of indicating in which interval the auditory and visual stimuli match. It is also clear that none of the three alternative conditions made it difficult for subjects to deter­ mine the matched interval. There seem to be individual differences, however, in the per­ formance in the three different conditions.

300 r------~------T------r_------~ SubjectHB Subject JG Subject ML ~ Subject SP 200 (i) E ..... -OJ ;:: :=:; 100

Figure 1: A V jitter thresholds for four subjects for conditions where the envelope of the tone o '--____ -'--1...... is proportional to visual posi­ Position Velocity Acceleration tion, velocity, and accelera­ Visual object: tion.

It is worth noting at this stage that for some time we have been presenting these stimuli to people who had no knowledge about, or experience with, the stimuli and were simply faced with the three conditions of this experiment. All of them were able to do the task immediately without any prior instructions about the nature of the stimulus. This sug­ gests that a robust mechanism operates while making these relatively accurate crossmodal comparisons between auditory and visual patterns. In additional measurements, the frequency was varied instead of the level of the au­ ditory stimulus. The change in frequency depended on the position of the visual stimulus. In these measurements, subjects again had little difficulty performing the task and similar threshold values were found.

98 Detection ofA V jitter with an overall A V delay The purpose of these experiments with an overall A V delay was to find out whether sub­ jects can compare similarities in AV patterns even when there is an overall AV delay. For this purpose jitter was applied to the discordant interval in the same way as in the previous experiment. The only difference was that for each stimulus presentation an additional fixed amount of overall delay or advance was applied to the video in both the matched and the discordant intervals. Thus, in the discordant interval there were differences in auditory and visual patterns due to jitter and in addition, the patterns were offset in overall timing with respect to each other. In the matched interval the A V patterns were identical, except for the same overall A V delay that was also applied to the discordant interval. Further properties of the stimuli and the measurement procedure were identical to those used in the previous experiment. For this experiment only conditions were used where the level of the auditory stim­ ulus depended on the position of the visual stimulus. The pulse width was set to 25 rns.

Results and discussion Results are shown in Figure 2. As can be seen, low values are found for the threshold jitter that is needed to discriminate matched from discordant A V patterns, that is for overall au­ dio delays ranging from -100 to 100 ms. The values are lower than for comparable con­ ditions in Figure 1. This was probably due to the pulse width of 25 ms that was used in this experiment compared to the lOOms used in the previous experiments. For audio de­ lays beyond 100 ms and especially for audio delays smaller than -100 rns, the ability to compare AV patterns seems to be reduced considerably.

e-e Subject HB 700 l1-li Subject JG A----A Subject ML 600 +-+ Subject SP 500 en E ':' 400

200

100 Figure 2: A V jitter thresholds for four sub­ O~i~~~~~~L-~~~~~~ jects as a function of the overall A V delay. -300 -200 -100 0 100 200 300 Positive delays indicate audio delays, neg­ Audio delay (ms) ative delays indicate audio advances.

An important conclusion can be drawn from these experiments. In order to be able to compare AV patterns, it seems not to be necessary to compare the pattern in the audi­ tory and visual domain from moment to moment such as one needs to assume for the de­ tection of asynchronies in single A V events.

99 Therefore, the question arises in what way the jitter is detected. Since the presence of A V jitter introduces differences between the rhythmic patterns that are available visu­ ally and auditorily, one could speculate that subjects compare the similarity of the rhythm present in the auditory domain and in the visual domain. Apparently this comparison can only be achieved with success for small overall delays. Another speculation may be that the overall delay is compensated for internally to line up the AV events. After this internal compensation, a moment-lo-moment compari­ son of A V events would allow for the detection of A V jitter. To gain a better understanding of how A V jitter is detected, additional experiments without overall delay were done where the number of A V events or pulses was varied. These experiments show that jitter thresholds are smaller when there are 3 to 10 pulses compared to the situation where there is only one pulse. Apparently, the presence of more than one pulse helps in the detection process. The availability of more than one A V pulse of course implies the presence of temporal intervals between consecutive auditory and consecutive visual events constituting a rhythmic pattern. Such patterns are not available when there is only one A V pulse. Thus it seems that A V jitter is more easily detected than A V asynchrony and that this may be attributed to a detection strategy that is based on comparing rhythmic patterns between the auditory and visual domain. This finding may have important implications for A V applications such as internet television where variable auditory and visual delays may occur.

Auditory-visual selection The results that were obtained in the A V -binding experiments showed that subjects are able to compare temporal auditory and visual patterns with a high degree of accuracy. Therefore it is interesting to see how well subjects perform in an experiment where they have to select one visual object from a number of visual objects on the basis of a simulta­ neous auditory pattern that matches the movement of one of the visual objects. In addi­ tion, a second experiment was performed where a modulated tone had to be selected from a series of concurrent but differently modulated tones on the basis of the movement of one visual object. In the first of these two selection experiments, a variable number of visual objects were shown of which one had a temporal pattern of movement that corresponded with the auditory stimulus. The subject's task was to select this visual object as quickly and with as few errors as possible. In the second experiment, the role of audio and video was reversed and subjects had to make a selection from a series of modulated tones while looking at one visual object. Results show that subjects can perform the task reasonably quickly. For the condi­ tion where there were two visual objects an average response time of 1.0 s was measured, with six visual objects an average response time of 2.6 s was measured. Similarly, in the second selection experiment average response times of 1.2 and 3.2 were measured for two and six tones, respectively. Results show a roughly linear dependence on the number of items from which sub­ jects had to make a selection. This is in line with a 'serial search model' such as that de­ scribed in Luce (1986) for memory search. An implementation of such a model in the con­ text of our experiments would assume that subjects use a strategy where they have to ob-

100 serve each object or tone for a short time in sequence until they find an object or tone that matches the pattern that is presented in the other sensory modality. This suggests that A V binding is achieved through a process that requires attention. Reported introspection of subjects performing this selection task confirms this suggestion.

Conclusions This study investigated the perceptual binding between auditory and visual stimuli. The results show that with simple means it is possible to make perceptual links between audi­ tory and visual stimuli. The results can be summarized as follows: • The asynchrony detection experiments, with feedback, showed a high sensitivity (29 ms) to perceiving AV asynchronies where the audio leads the video and less sensitivity (85 ms) when audio lags behind the video. Such low values are in line with data found in crossmodal temporal order judgment experiments (e.g., Hirsh & Sherrick, 1961). They are much lower, however, than thresholds found in syn­ chrony judgment experiments, without feedback, conducted by ourselves and by Dixon and Spitz (1980) for example. • In the binding experiments where temporal auditory and visual patterns had to be compared, subjects performed well independent of the precise way in which the auditory and visual pattern covaried. The most important factor that seems to be necessary to achieve binding is that the temporal A V patterns as such are iden­ tical and that they covary in a synchronous way. The synchrony constraint can be relaxed to a certain extent; the limit for successful binding seems to be an overall A V delay of about 100 ms. • The selection experiments showed that such stimuli as used in the binding exper­ iments can be used to make perceptual links between auditory streams and visual objects when there are competing streams or objects. For limited numbers of ob­ jects (2 to 6) this linking can be achieved in a matter of seconds and roughly lin­ early depends on the number of objects. This may indicate that attention is re­ quired for the process of A V binding.

The ability to make perceptual links between auditory and visual information is of relevance in virtual environments, film and computer animations, and may also be impor­ tant in user-interface design where both the visual and auditory modality are used to con­ vey information to the user.

Acknowledgements The authors thank Adrian Houtsma and Willy Wong for their comments on this manu­ script.

References Dixon, N.F. & Spitz, L. (1980). The detection of audio-visual desynchrony. Perception, 9, 719-721. Hirsh, U. & Sherrick Jr., C.E. (1961). Perceived order in different sense modalities. Journal of Experimen­ tal Psychology, 62, 423-432.

101 Levitt, H. (1971). Transformed up-down methods in psychoacoustics. Journal of the Acoustical Society of America, 49, 467-477. Luce, R.D. (1986). Response times: Their role in inferring elementary mental organization. Oxford: Oxford University Press. McGrath, M. & Summerfield, Q. (1985). Intermodal timing relations and audio-visual speech recognition by normal-hearing adults. Journal of the Acoustical Society of America, 77, 678-685. Sekuler, R., Sekuler, AB. & Lau, R. (1997). Sound alters visual motion perception. Nature, 385, 308.

102 SPOKEN LANGUAGE INTERFACES

IPO Annual Progress Report 341999 103 Developments

R.N.J. Veldhuis e-mail: [email protected]

Introduction The goals of the Spoken Language Interfaces programme (SLI) are (1) to demonstrate the applicability of speech as a modality for user-system interaction and (2) to do internation­ ally competitive research contributing to the improvement of speech-based interaction. The research on the improvement of speech-based interaction focusses on dialogue man­ agement and the generation of spoken messages, but a certain level of expertise on speech input is also maintained. The SLI products are demonstration models or prototypes of speech interfaces for real-life applications, algorithms or devices that give a proof of concept for newly devel­ oped speech technology and reports of SLI research in the form of journal papers, confer­ ence papers and PhD theses. In the last two years or so, consensus has grown in the speech research community that the quality and intelligibility of present day's diphone-based speech systems is insuf­ ficient for a large number of applications. A new and promising approach called unit-se­ lection based speech synthesis has come up. In this data-driven approach, units which can be substantially larger than diphones, are selected from a speech database and concatenat­ ed with no or minimal prosodic modification. This approach has already proven to pro­ vide good results for limited domains. Therefore, in 1999 SLI started to shift the focus of its research on speech generation to unit selection. Another insight is that in many applications speech is only one of the modalities of the interaction. Although telephone-based interaction is mono-modal at present, this may soon change. Mobile phones, in fact, are already equipped with a small visual display, which offers possibilities for presenting information, and web phones are on their way to the market. This development is having an impact on speech technology and makes it nec­ essary for speech interfaces to be studied in multimodal contexts. SLI's research was focussed on usability aspects, system requirements, the acquisi­ tion of capabilities, the development of technology, the realization of a complete interface or parts of it, and on the characteristics of speech or language. Below some highlights are presented per project.

Projects Multi-appliance multi-user technology for voice-controlled applications in the home (MAMUt) The MAMUt project aims to study and identify user requirements of voice-controlled sys­ tems in the home environment, consisting of Multiple Appliances and operated by Multi­ ple Users together. In close cooperation with Philips Research, last year's work concen­ trated on voice control of a simple audio set, operated by single users having free choice

IPO Annual Progress Report 341999 105 between voice input and conventional remote control. Two state-of-the-art automatic speech recognizers were compared with a simulated 'ideal' recognizer in a Wizard-of-Oz set-up, and the influence upon the user's satisfaction was studied for various feedback conditions. The main conclusions are that users feel positive about voice-controlled inter­ faces with feedback and that the performance of the speech recognition has a major influ­ ence upon the appreciation of voice input. To resolve speech-recognition errors, it appears that users chose different strategies, depending on the length of a speech recognition error sequence. Users gave up on speech input after trying, at most, five times to get the intend­ ed result. This year mUlti-appliance and multi-user aspects were studied. In a first, explora­ tive, experiment, the appliances consisted of five different audio/video devices, controlled by a single user. A Wizard-of-Oz set-up simulated perfect recognition of voice commands and a butler concept was introduced to facilitate operation and to disambiguate incom­ plete commands. Results show that the majority of users greatly appreciated the butler be­ cause of its user friendliness and ease of use. A human-like butler was favourite compared with a robot-like butler and subjects who chose a human-like butler addressed their butler on average about seven times more often than those who preferred a robot-like butler. A second exploratory study has been started on voice control involving two users and one single appliance.

Multimodal access to transaction and information systems (MATIS) The project is ajoint activity of Nijmegen University, Tilburg University, and Delft Uni­ versity of Technology in the framework of the IOP-MMI funding programme. The indus­ trial partners are KPN Research and Philips Aachen. The project is exploring the combi­ nation of speech, text, and graphics as interaction modalities in two types of information and transaction services: 1. A service with a limited visual presentation, e.g., an existing telephone information service combined with a low-cost terminal such as a GSM or web phone. This serv­ ice allows only limited keyboard and display possibilities. This study aims to find out how visually presented text may help to make the dialogue more efficient, and to assess changes in the verbal behaviour of users in the context of a multimodal in­ terface. Here, the emphasis is on putting together an actual demonstrator in the early stages of the project by means of state-of-the-art technology, and to start experi­ menting with real customers as soon as possible. Among other things, these exper­ iments will shed light on the special requirements for automatic speech recognition in a multimodal environment. Engineering these requirements in the domain of speech recognition technology will be part of the project. 2. A service with a more extensive visual presentation, e.g., an interactive installation manual based on a TV with a set-top box and a telephone. This work aims to explore more in depth the issues that arise when users talk about objects in a visual domain displayed on the screen. Here, the aim is to develop a demonstrator that supports communication and reasoning about the visual domain by the end of the project, and to deal with interaction issues and further extensions in follow-up projects.

The difference in emphasis between the services enables the exchange and integra­ tion of acquired knowledge at the end of the development trajectories without creating de-

106 pendencies that would obstruct the continuation of the sub-projects. For both demonstra­ tors, IPO's responsibilities are the development of the dialogue managers and the user tests.

Interactive portable speech-oriented interfaces (IPSOI) In the VODIS project, which ended this year, a command-and-control system was devel­ oped to control car stereo and car telephone equipment through the use of speech synthe­ sis and speech recognition. In many ways, the VODIS software is specific to the particular hardware used in that project. In the IPSOI project, the results of the VODIS project were used to develop a more generic dialogue-based command-and-control system, using both speech input and speech output. This system can be used as a baseline for the development and evaluation of spoken dialogue prototypes. The IPSOI system can be seen as a speech-driven user interface to control one or more components. A component can be a software application or any piece of hardware that can be controlled through software (e.g., stereo equipment or telephone). The heart of the system is the dialogue manager. On the one hand, the dialogue manager maintains a dialogue with the user to find out what the user wants and to give feedback about the status of the system. On the other hand, the dialogue manager communicates with the components that are hooked up to the system to implement the user's wishes and gather information on the components' status. All modules in IPSOI, including the dialogue manager, the speech synthesis and recognition components, and the software interfaces to connected hardware, run as sepa­ rate applications. The dialogue manager uses the Windows messaging system to find out which components are available and to coordinate everything. This highly modular struc­ ture makes it relatively easy to use different speech synthesis and recognition systems and to add components to be controlled.

Co-habited mixed-reality information spaces (COMRIS) The COMRIS vision of co-habited mixed-reality information spaces emphasizes the co­ habitation of software and human agents in a pair of closely coupled spaces, a virtual and a real one. The COMRIS project uses the concept of a conference as the thematic space and concrete context of work. At a physical-reality conference people gather to show their results, see interesting things, find interesting people, meet officials in person, or engage in any kind of discussion. In the mixed-reality conference real and virtual conference ac­ tivities occur in paralleL Each participant wears a personal assistant, an electronic badge and earphone device, wirelessly hooked into an intranet. This personal assistant - the COMRIS parrot - realizes a bi-directionallink between the real and virtual spaces. It ob­ serves what is going on around its host (whereabouts, activities, and other people around), and it informs its host about potentially useful encounters, ongoing demonstrations that may be worth attending, etc. This information is gathered by the software agents that par­ ticipate in the virtual conference on behalf of a real person. Each agent has the purpose of representing, defending, and furthering a particular interest or objective of the real partic­ ipant, including those interests that the participant is not explicitly attending to. Within the COMRIS project, IPO is responsible for the speech output of the COMRIS parrot and for usability aspects.

107 In the speech-output work package, the first version of a speech-output module was produced, including a diphone database for the synthesis of an English-speaking female voice. The speech-output module runs on the wearable device and takes its input, consist­ ing of prosodically annotated text, from the Natural Language Generation module, run­ ning on a server. In the usability work package, two usability studies were conducted. The COMRIS vision includes a radical information push model, in which shortcomings of earlier push approaches are cured by leaning heavily on the context of the user. In the first usability study, the acceptance of radical push was explored in a realistic simulation. It was found that users greatly prefer pull to push in a variety of contexts of use. More concretely, users want to be able to decide themselves whether and when the agent delivers a message from virtual space, rather than leaving it to the agent to decide. In a second usability experi­ ment, different interface designs enabling the user to control the parrot functionality were evaluated.

Dialogue management and output generation in spoken-dialogue information systems (NWO-TST) The Dutch NWO programme for Language and Speech Technology is aiming for the state-of-the-art for Language and Speech technology in The Netherlands. In order to focus the work in the programme, an inquiry system for public transportation (Openbaar Ver­ voer Informatie Systeem, or OVIS) was chosen as the domain of application. IPO is re­ sponsible for the dialogue module, which controls the flow of information in the dialogue, for the language and speech generation modules, and for human-factors aspects.

NWO-TST: Dialogue management The dialogue manager for OVIS-3 has been implemented. The module incorporates the ideas concerning the zooming strategy reported in last year's Annual Progress Report.

NWO-TST: Language generation Work has been done on the development ofthe language generation module for OVIS-3, particularly on the implementation of the sentence templates. Because OVIS-3 can per­ form several dialogue acts, involving all information in the domain, a very large number of templates are required. Currently, over a hundred templates have been implemented to express the most essential dialogue acts. The current set of templates does not cover all combinations of dialogue acts that can be produced by the dialogue manager. Work on the language generation module therefore will continue. A new version of the well-known Dale and Reiter algorithm for the generation of referring expressions has been developed, which allows for the generation of reduced de­ scriptions, taking into account both previous instances of the current referent and mention of other entities, using a notion of salience. This year, the referring expression algorithm was improved and extended to generate different types of descriptions, such as relational descriptions, bridging descriptions, and personal pronouns. The intermediate results of this work were presented at the ESSLLI Workshop on the Generation of Nominals. In ad­ dition, the algorithm was successfully implemented and evaluated.

108 NWO-TST: Speech synthesis The duration module, which was developed at Lucent Technologies - Bell Labs Innova­ tions, was integrated in the IPO speech-synthesis system. Currently, a listening experi­ ment is running via the internet, to compare the output of the new duration module with that of the old one. Additionally, statistics were computed concerning the fit of the new predictions with natural durations and the fit of the old predictions with the natural dura­ tions. The root-mean-squared error of the predictions of the new module is about 5 ms smaller than root-mean-squared error of the predictions of the old module, meaning that its fit to the natural data is somewhat better. Additional work has been done on solving the problem of audible spectral discon­ tinuities at diphone boundaries. In a previous experiment, the occurrence of these discon­ tinuities was investigated. A correlation between the results of a listening experiment with several spectral distance measures resulted in one measure, the Kullback-Leibler meas­ ure, which could reasonably predict the occurrence of audible discontinuities. This dis­ tance measure was used to cluster diphone contexts that are spectrally similar. Then, ad­ ditional context-sensitive vowel-consonant (VC) diphones were recorded in a represent­ ative left-consonant context close to the centroid of a corresponding consonant-vowel (CV) cluster. A new listening experiment was performed with exactly the same task as in the pre­ vious experiment, to see if the addition of context-sensitive diphones reduced the number of audible discontinuities. This proved to be the case. The improvement was largest for the luI, smaller for the Iii, and negligible for the Ia!.

NWO-TST: Usability Because the working version of the OVIS-3 system was not available in time, recorded dialogues of another transport information system, VIOS, were analysed. In brief, this analysis (with real users) shows a clear similarity with the results of the lab experiments, where selected users went through a number of scenarios with a simulation. Again, rec­ ognition errors are the main source of dialogue failure, but not the only source. Interest­ ingly, recognition errors also occur quite often in successful dialogues, from which we conclude that recognition errors do not necessarily lead to dialogue failure. Due to time constraints affecting the coding of the dialogues, there is no simple way to identify and categorize differences in the causes and treatment of errors between successful and failed dialogues, so that it is unclear which strategies lead to successful treatment of problems due to recognition errors. There seem to be many different situations and many factors in­ volved. In order to pursue this point, a much more detailed analysis would be required.

Prosodic characteristics of dialogue A number of studies on error handling in various types of spoken interactions have been executed. (1) In collaboration with AT&T Research Laboratories (Florham Park, NJ, USA), research was conducted on the use of prosody to detect user utterances that are mis­ recognized by the speech recognition component of a spoken dialogue system. This re­ search has shown that prosody, in addition to acoustic confidence scores and information from the recognized string, for example, can be exploited for on-line error spotting. (2) Within the NWO-TST project, analyses were carried out on post-error utterances in hu­ man-machine interactions to see which cues people use after they have detected a com­ munication problem. Based on the analysis of 120 human-machine dialogues, it was

109 found that users more often employ marked feature values, such as the presence of redun­ dant information or a non-standard word order, when the preceding system utterance con­ tained a problem, whereas unmarked feature values are used in response to unproblematic system utterances. Using precision and recall matrices, it could be shown that various cues are accurate problem spotters, which was confirmed in experiments using a memory­ based learning technique. In addition, it was found that utterances following an error more often have marked prosody (e.g., higher pitch range) than utterances that do not follow an error. (3) In collaboration with ATR-MIC (Kyoto, Japan), perception studies were con­ ducted to evaluate the use of prosody to mark different forms of feedback in so-called re­ petitive utterances; it was found that prosody can accurately signal whether or not trans­ mitted information has been successfully integrated by the communication partner. Another line of research was concerned with the prosodic marking of information status (whether information is given, new, or contrastive) in Dutch, Italian, and Japanese. Production experiments using a dialogue game set-up showed that contrastive and new in­ formation is prosodically marked in all three languages, although these languages are very distinct in terms of their intonation structure.

CoolCall - Voice control of a computer-integrated telephone Cool Call is a commercial telephone application running on a PC, which is designed to re­ place the current desktop telephone set. Among other things, it contains an integrated ad­ dress book allowing contact management, touch dialling, and several essential call ma­ nipulation features. The CoolCall project at IPO is investigating the feasibility and usa­ bility of a speech input interface to control the dialling and call manipulation features. For instance, by saying a name, CoolCall will automatically establish a telephone connection with the specified contact. One of the main problems is the correct recognition of an ut­ tered contact name. Therefore, the use of rule-based, self-learning, and dictionary look­ up methods are evaluated to find a reliable phonetic transcription of proper names that can be used in the speech recognition process. User requirements and design ofthe speech in­ terface are formulated by using the MUSE-SE design methodology in cooperation with University College London. The speech interface consists of existing automatic speech recognition components. The project has already resulted in a user test of the CoolCall system without voice control, a questionnaire study investigating user needs and desires regarding a computer-integrated telephone, and software design and implementation of a first version of the voice control module.

Properties of a sinusoidal representation of music and speech The aim of the project was to investigate the possibility of using a hearing model to obtain an auditory representation of music and speech signals that can be used to predict audibil­ ity discrimination thresholds and that can quantify subjective differences. A distance measure based on a model of partial loudness was selected. It was implemented and adapt­ ed to the problem of estimating discrimination thresholds related to modifications of the spectral envelope of steady complex tones representing synthetic vowels. Data obtained from subjective listening tests, using a representative set of stimuli in a 3IFC adaptive pro­ cedure, show that the model makes reasonably good predictions of the discrimination threshold. Systematic deviations from the predicted thresholds can be related to individual differences in auditory ftlter selectivity.

110 Usability of voice-controlled product interfaces

M.M. Bekker and L.L.M. Vogten e-mail: [email protected]

Abstract Voice-controlled operation of an audio set was evaluated in a quasi home environment with 76 subjects, who had free choice between voice input and conventional remote control. The set consisted of a triple CD changer, a double audio cassette player, and a tuner. Evaluation was divided into four studies, examining different usability issues. Studies 1 and 2 focussed on the comparison of a perfect speech recognizer (Wizard-of-Oz simulation) with two state­ of-the-art automatic speech recognizers (ASR). In studies 3 and 4, different types of feedback were compared: spoken feedback, non-speech audio feedback, and no feedback. The main conclusion is that users are positive about voice-controlled interfaces with feedback. How­ ever, it is clear that ASR performance has a major influence on users appreciation of voice controL AlIowing users to adapt some of the interface design options depending on the situ­ ation of use contributes greatly to the success of such interfaces. Furthermore, fine tuning many of the design decisions of a voice-controlled interface, such as the type of feedback, feedback sound, and vocabulary design, is especially important as long as state-of-the-art ASR technology still is far from perfect.

Introduction Speech recognition technology offers a promising input medium for consumer products, and is expected to enhance usability and become a distinctive product feature. Only a few of the appliances with speech input have been launched in the consumer product market. Some of these have been successful, such as mobile phones with voice dialling, whereas others have failed, such as the VoiceCommander, a remote control with automatic speech recognition (ASR). Speech interfaces are said to be especially successful in situations where a speaker's hands and eyes are busy, mobility is required, a speaker's eyes are occupied, or where harsh or cramped conditions preclude use of a keyboard (Shneiderman, 1998). Further­ more, a speech interface provides a more natural interface (Kamm, 1994), taking advan­ tage of users' experience with every day communication skills. Thus, even though it is generally known in what situations speech interfaces are likely to be successful, it is un­ clear what detailed design decisions will ensure that the final speech interface is a success. A related issue to voice control in general, is how to provide feedback about the commands that have been recognized and are executed by the appliances, especially since voice control does not have feedback inherent in its use with direct feedback in, for ex­ ample, cursor movement. Therefore it is extra important to offer appropriate feedback (Baber, 1997). Again, there are advantages and disadvantages for the different solutions to this problem. Both audio and visual feedback have their own useful characteristics (Gaver, 1997; Bernsen, 1996). Visual information exists in space and is more stable. However, we have to move our head and bodies to see what is around us. In contrast, sound information

[PO Annual Progress Report 341999 111 exists in time, but across space. It is transient and therefore the information will be lost, but it is omni-directionaL Because sound is omni-directional, audio feedback may be a useful option in a home environment, where people often move around. Keeping in mind that audio output may be irritating when applied incorrectly, care must be taken that it is implemented in a user-friendly manner (Hindus, Arons, Stifelman, Gaver, Mynatt & Back, 1995). As state-of-the-art ASR performance is still far from perfect, two studies were con­ ducted to determine how ASR performance influences user performance and satisfaction. Two different state-of-the art recognizers were selected to examine these questions. One user study by Yankelovich, Levow and Marx (1995), concerning users interacting with a speech-only interface to an electronic mail, a calendar, a weather and a stock quotes ap­ plication, found that error rates were only loosely correlated with user satisfaction: some people with the highest error rates were most positive about the application. However, this result is only based on 11 subjects, and no statistical evidence was presented to support this conclusion. Another study by Wolf and Zadrozny (1998) found that users, interacting with a speech system for common banking transactions, tended to overestimate ASR ac­ curacy. They also found that the ability to accomplish one's goals had a larger influence on user satisfaction than ASR performance. Their results are based on correlating user sat­ isfaction and ASR accuracy. In the studies presented in this paper, a comparison is made between the satisfaction of users interacting with both an ideal speech recognizer and a state-of-the-art speech recognizer, thus allowing a within-subject comparison. While feedback plays a central role in most interfaces, it is even more crucial for voice interfaces, because of the lack of feedback inherent in voice control. Two different types of auditory feedback can be distinguished: auditory icons (everyday sounds mapped to computer events by analogy with everyday sound-producing events; Gaver, 1997) and earcons (abstract, synthetic tones used in structured combinations to create sound messages; Brewster, Wright & Edwards, 1993). While studies have been conducted to examine the difference between auditory icons and earcons (e.g., Sikora, Roberts & Murray, 1995), this paper presents research to explore usability issues of voice-controlled interfaces and compares auditory feedback to spoken feedback. Studies 1 and 2 were conducted to determine whether and how feedback influences user performance and satisfaction. Study 3 compared the providing of spoken feedback with providing no feedback about recognizer performance, while study 4 compared spoken with non-speech audio feedback.

Method Since the overall set-up of the four studies was very similar, the description of the method used is combined for all four studies. Table 1 gives an overview of the set-up of the four studies. Studies 1 and 2 compared a 'perfect' recognizer (in a Wizard-of-Oz set-up) with two different state-of-the-art speech recognizers. In studies 3 and 4, different types of feedback were compared, using ASR only. For studies 1 to 4, the number of tasks per minute and the number of commands per task were determined as user performance meas­ ures. Users' satisfaction about the speech interface was determined by using question­ naires containing questions on user preference (about ease of use, fun, number of errors, etc.), to be answered on a 5-point scale.

112 Table I: Overview of study set-up for studies 1 and 2, examining the influence of ASR performance on the user's performance and satisfaction, and studies 3 and 4, examining the influence of various types of feed­ back.

Study Two sessions, with ASR type, Vocabulary Subjects # of partial counterbalanced conditions version tasks

I Wizard ASR ASR_I: Dragon 59 words, 20 females 80 'large' + 20 males

2 ASR Wizard ASR_2: Philips, 31 words, 12 males 80 version 617 'small'

3 No feedback Spoken ASR_3: Philips, 31 words, 12 males 80 feedback version 8 'small'

4 Spoken Non speech ASR3: Philips, 57 words, 12 males 80 feedback audio feedback version 9b 'large'

ASR technology Two different speech recognizers were used. ASR_l consisted of Dragon Dictate in "Command & Control mode" and with Dragon Xtools 2.0 (1997) as interface. The other ASR that was used in studies 2, 3, and 4, was an experimental Philips ASR engine, called ASR_2, 3, and 4, respectively. Their most important characteristics are listed in Table 2.

Table 2: Characteristics of the speech recognizers used in studies I to 4 (ASR_l to ASR_4, respectively).

ASR_l ASR_2, ASR_3, and ASR_4

speaker-dependent semi-speaker- independenta

individual application words trained at least once no individual application training

Acoustic Echo Cancellation (AEC)

close by microphone, on remote control (0.5 m) distant microphone, on audio set (3 m)

phoneme-based recognition word-based recognition

isolated word-based recognition continuous speech recognition

59 words application vocabulary 31 and 57 words application vocabulary

a. The speaker-independent English vocabulary was based on 40 male speakers from Germany, Belgium, and The Netherlands. Ideally, about 100 male and 100 female speakers should be used for better robustness to different speakers. Therefore, it is called semi-speaker-independent. Fur­ thermore, training was done without AEC. Training with AEC improves the recognition time.

Audio set controlled by 'voice commands' and 'remote control' The audio set selected for the four studies contained a 3-CD changer, a tuner, and a double tape deck. To control the set, the subjects had a free choice between voice commands (VC) or a conventional remote control (RC). No buttons were available on the set for the subjects. The output signal of the remote control was received by an infrared (IR) server, connected to the computer, from where the commands were sent to the audio set. In study 1 the voice commands of the users were picked up by a small microphone (High-

113 screen 463444) fixed onto the left side of the remote control, and in studies 2 to 4 a mi­ crophone (Sennheiser MD421) was positioned on top of the audio set. A computer program provided the control signals for the IR server and audio set in two different ways: manually or automatically. Manual control was intended for the Wiz­ ard-of-Oz settings in studies 1 and 2, to simulate a 100% correct identification of the voice commands of the subject. The Wizard interpreted the subjects' speech and selected the appropriate button on the screen, evoking the corresponding set of commands to be sent to the IR server, where they were translated into IR signals and then transmitted to the au­ dio set. In the automatic control mode, the Wizard was replaced by one of the two ASR engines. All recognized commands and all commands sent from the IR server to the audio set were Jogged automatically.

Set-up The studies took place in a room furnished as a typical living room. Subjects were seated in an easy chair, the remote control was located on the right elbow-rest, and a clipboard with task sheets and a pencil were located on their lap. A dictionary, which could be used in the experimental task, was located on a low table immediately in front of the subjects. During the sessions, subjects were provided with a sheet explaining the function of the RC buttons and showing the available voice commands. During the experiments only the main functionality of the RC was used. The other buttons were hidden by means of black tape. The audio set was situated at a distance of about 2 m in front of the subjects and about 1.5 m above the floor. The loudspeakers were located at a distance of about 1 m and on either side of the set. The IR server was positioned on top of the set, receiving the IR signals sent to the audio set by the Remote Control. Two cameras registered the experi­ ments. The computer monitor, a video mixer and the S VHS video recorders were situated in the control room, next to the living room. The IR server and the microphone were con­ nected to the computer positioned behind the audio set.

Subjects and their background A commercial agency recruited the subjects for the four studies. In study 1, 20 male and 20 female subjects participated, aged 19 to 56. Due to technical problems only 18 females participated in the ASR session of study 1. For study 2, 12 male subjects (between 20 and 53) participated, for study 3 there were 12 male subjects (21 to 59) and for study 4 there were 12 male subjects (20 to 48). Subjects had no association with Philips or IPO, and their education level ranged from low (primary school) to high (university). All subjects had at least a basic education in English.

Tasks Subjects received a printed list of30 to 32 tasks (see Table 1 for an overview of study set­ up characteristics), including a request to adjust the audio set to a specified setting, to switch between CD tracks, to select a radio station, to fast forward and play the tape, etc. On average a task contained 2.5 to 3 partial tasks, each corresponding with one transition of the audio set. Then the subjects listened to the music for a short time with a fixed vol­ ume setting, meanwhile performing a parallel task by looking up the first word of a ran­ domly selected page in a dictionary and filling in that word on the task/response sheet. Subjects were free to operate and control the set by voice commands or by remote control.

114 Design and procedure In all four studies a within-subject design with two conditions was used, in which the two experimental conditions were counter-balanced: in studies 1 and 2 regarding Wizard vs. ASR control, and in studies 3 and 4 regarding spoken vs. alternative feedback mode (none or non-speech audio). For each subject the experiment consisted of the following steps: 1. A general introduction, consisting of an oral introduction about the aim of the ex­ periment. 2. An interview about the subjects' background, living situation and audio equipment at home, during which the experimenter filled in a questionnaire. 3. The subject filled in a questionnaire on his/her opinion about general statements concerning audio and video products. 4. A technical introduction to the audio set and details of the remote control, and a training session for the ASR. 5. The first out of two sessions, preceded by a short written instruction, in which the subject worked with the set under one of the two counter-balanced experimental conditions. 6. The subjects filled in a questionnaire, giving their opinion about using the set. 7. The second of the two sessions, also preceded by a short written instruction, in which the alternative experimental condition was applied. 8. The subjects filled in a questionnaire regarding their opinion about using the set. 9. Finally, the subjects filled in a questionnaire in which the two sessions were com­ pared, followed by open questions on using voice control andlor feedback issues. It took between 1.5 and 2 hours per subject to complete the whole experimental proce­ dure.

Vocabulary The commands to activate the set consisted of the numbers 1 to 19, names for the sound source, track and channel number, station names, volume settings, tape manipulation, and sound control. The main differences between the 'large' vocabularies applied in studies 1 and 4 versus the 'small' ones of studies 2 and 3 (see Table 1) were a reduction in the pos­ sible channel and track numbers (from 19 to 9), and a reduction in radio station names. In study 1 the recognizer was speaker dependent, for male and female speakers, and had to be trained independently by each subject. The recognizer used in studies 2 to 4 was used speaker-independently and for male speakers only. The written instruction before the Wizard sessions explicitly encouraged the sub­ jects to use any spoken input they liked. Preceding that session subjects were also in­ formed that, as opposed to the ASR session, a more powerful recognizer was applied, which was able to recognize the relevant key-words from any arbitrary phrase.

Data collection In order to obtain a complete registration of the subjects' voice input for each task and in both sessions, the videotapes were manually transcribed. All spoken commands were in­ serted and added to the remote control commands in the log files. For the ASR sessions, it was also checked whether or not each command was recognized and executed by the audio set. The log files were annotated with different codes for the various execution re­ sults, to fully register the hardware and software performance.

115 Feedback Interacting with the set in study 3, the subjects received spoken feedback in one session and no feedback about speech recognition in the other session. In study 4 the subjects re­ ceived spoken feedback in one session and non-speech audio feedback in the other ses­ sion. Because of financial and technological constraints of the final realistic design imple­ mentation, only simple feedback messages were provided. Spoken feedback was provided for all recognized commands, by a male voice repeating the recognized commands. Audio feedback was provided, distinguishing between two different non-speech audio sounds. One was a complex tone with increasing pitch, indicating that a voice command had been captured and recognized. The other was a sharp buzz, indicating that a command was picked up that had no meaning for the ASR in the current context of use. These non­ speech audio sounds were based on musical listening ideas for feedback (Gaver, 1997).

Results Performance of the recognizers (studies 1-4) Figure 1 gives an overview of ASR performance for the four studies. The ASR with the larger vocabulary, used in study 1, scored lowest: 61 % correct for 18 females and 64% correct for 20 males, with 32% and 30% deletions, respectively. The highest performance was found in study 2 for the smaller vocabulary, where 12 males scored 86% correct and 9% deletions.

Recognition score 1:: = _____ --111--_-.1., -----1 = 60% ~~- ...... ---~--. -- _...... ------

-

ASR 1 ASR_2 ASR_3 ASR_4 male mala mala Ina Ie 2307 122& 2292 3373

Figure 1: Mean percentage of recognitions for the two ASR technologies, used by five groups of different subjects in the four studies perfonned: ASR_l with 18 females and with 20 males, and ASR_2 to ASR_4 each with 12 males. The numbers below each group show the total number of voice commands used by each group.

In the four studies a total of 592 out-of-vocabulary (OOV) commands were used (6% of the 10547 voice commands given), of which 326 commands were deleted and 266 substituted. The following errors occurred: • About one third of the OOV commands consisted of isolated numbers (206 times). • Another third of the OOV commands was beyond the range of allowed com­ mands, for example radio _4 to radio _8 (59 times), or consisted of synonyms, for example CD (20 times) instead of disk. • The remaining OOV commands consisted of non-speech sounds (36 times), Dutch words (12 times), or various scattered utterances.

116 Perfect recognition (Wizard) vs. realistic ASR (studies 1-2) Execution tempo Figure 2 shows the number of tasks the subjects performed per minute for the Wizard and ASR sessions in studies 1 and 2. An ANOV A with repeated measures was used with order of condition as between-subject independent variable, and the two conditions as within­ subject independent variable, and tasks per minute as dependent variable. In study 1, a sig­ nificant effect of type ofASR (F(l,lS) =26.74, P < 0.(01) was found on tasks per minute. In addition, a significant condition by order effect (F(1,IS) =20.68, P < 0.(01) was found. In study 2, no significant effect of type of ASR was found (F(1,lO) = 3.02, P > 0.05).

Tasks per minute Tasks per minute

1.5+---

0.5

ASR 1 -=------

Figure 2: Mean number of tasks per minute, executed by the 20 males in study 1 (left panel) and 12 males in study 2 (right panel) for the two Wizard and ASR conditions. The cross-bars indicate the standard deviation from the mean.

Verbosity The number of commands per task, spoken or via remote control, averaged across the sub­ jects, is shown in Figure 3 for studies 1 and 2.

task i Commands per task i 10.,.-..------;

i 8 +------+-----1 I 8f------

6 61------+

WizanU

Figure 3: Mean number of voice + remote control commands per task used by the 20 males in study I (left panel) and 12 males in study 2 (right panel) for the two Wizard and ASR con­ ditions. Left of each pair: first session, right: second session. The cross-bars indicate the standard deviation from the mean.

An ANOV A with repeated measures was used with order of condition as between­ subject independent variable, and the two conditions as within-subject independent vari­ able, and number of commands per task as dependent variable. In study 1, a significant effect (F( 1, IS) =63.5S, p < 0.001) was found of condition on commands per task. In study

117 2, no significant effect (F(l, 10) =3.56, p > 0.05) was found, even though overall the ASR also required more commands per task than the Wizard.

Preference for voice vs. remote control The subjects' preference for voice or remote control is expressed as the ratio between the number of voice control (Ve) minus remote control (Re) commands and the total number of commands used by the subject. A preference of + 1 means that subjects used only voice commands and -1 means only remote control. Figure 4 shows the preferences in studies 1 and 2 for the individual subjects, according to gender and age.

Preference (VC-RC) I (VC+RC) Preference (VC-RC) I (VC+RC)

j 0.5 B.

I110 -0.5 1----l!----lHf------j 0::

Age (20 females) Age (20 males)

Figure 4: Relative use of voice control (VC) and remote control (RC) commands of the individual subjects in study 1 (left panel) and study 2 (right panel) as a function of gender and ranked according to age, for the two Wizard and ASR conditions. Grey (solid) bars, upper end: Wizard score. White (open) bars, upper end: ASR score. Two females participated in the Wizard session of study 1 only.

In study 1, there is on average a clear preference for voice control for the Wizard session: the average preference for voice operation of males (0.73) is higher than the av­ erage of females (0041). In the ASR session the preference of all but three subjects changed clearly: 11 females and 3 males score a negative preference value, in favour of the remote control. On average the preference drops to 0.33 for males and -0,08 for fe­ males. For study 2, the same trend was found; average preference drops from 0.74 for the Wizard to 0040 for the ASR and all but one subjects changed in favour of the remote con­ trol.

Opinions about voice control After completing the final session, the subjects were asked to compare the two sessions; the responses are shown in Figure 5. In both studies 1 and 2, overall the subjects judged the Wizard to be less error prone (302), easier (303) and more fun to use (304) and, for study 1 only, also more efficient (301). After finishing the final session, the subjects were asked to write down in what sit­ uations they would like to use voice control, and in what situation they definitely would not like to use voice control.

118 agree disagree fully partly partly fully

....:...~~~~;;;~;1301Control In flrst session more efficient than In ,..,cond r L-...g Figure 5: Mean re­ ~~~:i~i"'-~l:e set errors In MCOnd session than In first sponses of the subjects 303 on a 5-point scale of Set easier to use In ,..,cond se.slon than In flr.t agreement about state­ ~ fIIIIIiIIliIiII ments on the use of the audio set in studies 1 [~~~~~~~~:J~3~:::Control In IIrst ,..,."Ion more fun than In second and 2, comparing the Cstudyl,order: Wizard_l-ASR_1 (10subj.) two Wizard and ASR , • study 1, order: ASR_1 - Wizard_1 (10 conditions for differ­ C study 2, order: Wizard_2 - ASR_2 (6 subM ent orders of the two I • study 2, order: ASR 2 - Wizard 2 (6 subj.) sessions.

The following "no, definitely not" examples were given by 23 subjects (18 in study 1, and 5 in study 2): • Impolite to use voice commands while having conversations (e.g., visit of friends) (10 subjects). • At parties with a lot of background noise and the risk of losing control of the set (7 subjects). • In the work place where it causes displeasure (2 subjects). • Voice commands will disturb the music (1 subject). • Children will start shouting and playing around with the set (l subject). • While driving a car (unsafe) (l subject). • On the couch (1 subject).

The following "yes please" examples were given by 48 subjects (36 in study 1 and 12 in study 2): • Can continue with my work (hands busy or dirty) (18 subjects). • Don't have to look for the remote control (12 subjects). • Can control the set from a distant room (9 subjects). • Safe control of an audio set in a car environment (4 subjects). • Easier to use (3 subjects). • Specific apparatus, such as central heating system, computer, and security sys­ tems (3 subjects). • When relaxing or in bed (2 subjects). Feedback conditions (studies 3-4) Execution tempo The number of tasks that the subject performed per minute, averaged across the subjects, is shown in Figure 6 for the various feedback conditions applied in studies 3 and 4. An ANOVA with repeated measures was used with order of condition as between-subject in­ dependent variable, and the two feedback conditions as within-subject independent vari­ able, and tasks per minute as dependent variable. In studies 3 and 4, no significant effects were found for feedback conditions on verbosity (F(l, 9) =0.01, P > 0.9; F(1,lO) = 0.61, P > 0.4).

119 Tasks per minute, ASR3

1.5 f------{

i 1 i I 0.5 0.5

No feedback Spoken fb Spokenfb

Figure 6: Mean tasks per minute executed by the 12 males in study 3 (left panel) and 12 males in study 4 (right panel) for the various feedback conditions: no feedback, spoken feedback and non-speech audio feedback, respectively. The cross-bars indicate the standard deviation from the mean.

Verbosity Figures 7 shows the number of voice commands plus remote control commands, averaged for the two sessions per condition, for studies 3 and 4. An ANOVA with repeated meas­ ures was used with order of condition as between-subject independent variable, and the two feedback conditions as within-subject independent variable, and commands per task as dependent variable. In study 3, a significant effect of the feedback condition on verbos­ ity (F(1,8) = 16.73, P < 0.003) was found. Subjects using the audio set without spoken feedback used more commands per task and more remote control commands than did sub­ jects receiving spoken feedback. In study 4, no significant effect of feedback condition on verbosity (F(1,lO) = 0.61, P > 0.45) was found.

Commands per task, ASR _3 Commands per task, ASR_4 10.,....------1

6

4

2

No feedback Spokenfb Spoken fb Audlofb

Figure 7: Mean number of voice + remote control commands per task used by the 12 males in study 3 (left) and 12 males in study 4 (right) for the various feedback conditions: no feedback, spoken feedback, and non-speech audio feedback. The cross-bars indicate the standard devi­ ation from the mean.

Opinions about feedback After completing the final session the subjects were asked to compare the two sessions with spoken and non-speech audio feedback, again on a 5-point scale for agreement. Fig­ ure 8 shows the results. The four subjects who received spoken feedback in the final ses­ sion, judged spoken feedback as easier to use and more fun than non-speech feedback. For the other group of eight subjects with the opposite order, non-speech was easier (312) or no such preference was found (313, 314). This indicates that subjects tended to have a

120 preference for the feedback they had received last. Their opinion was more extreme in the case of having used spoken feedback last. In studies 3 and 4, the subjects were asked to write down in what situations they would like to receive feedback and in what situations they definitely would not, after finishing the second session. Table 3 gives an overview of the most frequently mentioned examples for these questions.

agree disagree fully partly partly fully 311 Control In first session more efficient than In second

LI ~_1_J;;~~312 Working with set In second session more difficult than In first 313 Set easier to use In second session than In first 314 Working with set In second session more fun than In first

a sbJdy 3, order: Spoken 1b - No 1b (6 subj.) • sbJdy 3, order: No fb - Spoken 1b (6 subj.) a sbJdy 4, order: Audio fb - Spoken fb (4 subn • sbJdy 4, order: Spoken 1b - Audio 11: (8 sub).)

Figure 8: Mean responses of the subjects of studies 3 and 4, on a 5-point scale of agree­ ment about statements on the use of the ASR-controlled audio set, comparing the dif­ ferent feedback conditions for the different orders of the two sessions.

Table 3: Most frequently mentioned situations in which feedback is appreciated or not.

Situations in which subjects #of Situations in which subjects would #of would like to receive feedback subjects definitely not like to receive feedback subjects

In the car (safer) 6 With groups of people, e.g., at a party, 6 everybody will give commands

Can continue with my work 4 Sound can be irritating for other people, 4 (hands busy or dirty) at night or for visitors or when you get too much feedback

For specific apparatus like TV, 3 Always, it is irritating 2 computer, lighting, telephone

When studying 3 For specific apparatus: car, kitchen I apparatus, all mechanical apparatus Easier to use the apparatus 2

Finally, the subjects were asked some questions in a semi-structured interview about their experiences with feedback. Twenty-one out of 24 subjects were interested in having the audio set with voice control at home. Eleven out of 24 subjects preferred to have spoken feedback, because it confirms that something has been recognized, it sup-

121 ports learning to use the system, and because it is useful for switching between sound sources. Seven subjects did not like spoken feedback under any circumstances. Eight out of 12 subjects preferred non-speech audio feedback, because it is less irritating, faster, more direct, and less intrusive. Nine subjects would like to get spoken feedback for a subset of the commands (for switching between sources and not for the more obvious commands such as the volume level). Five subjects wanted spoken feedback for all commands. Seven out of 24 subjects liked to receive feedback depending on the situation of use (not at parties, with visitors or with simple use). Seven subjects always wanted feedback and one subject wanted no feed­ back at alL Nine subjects gave no opinion. Eighteen out of 24 subjects wanted to be able to switch the feedback off, three subjects did not want to switch it off or be able to choose between the non-speech audio and the spoken feedback. Two subjects gave no answer to this question. The sound quality of the feedback was judged to be good by 19 out of 24 subjects, the other five subjects judged the quality to be low or too monotonous. The vol­ ume level of the feedback was judged to be good by 13 subjects, while 11 subjects judged it to be too low, and suggested that it should be linked to the volume level of the set.

Discussion and conclusions State-of-the-art ASR performance Concerning the performance of two state-of-the-art speech recognizers the studies showed that they are still far from perfect in a quasi living room environment (perform­ ance ranging from 65-86%). While the studies were run in a laboratory setting, it is pos­ sible that, using such speech recognizers in a real living room with multiple people, they have an even poorer performance, although some of the deterioration might be balanced by training the speech recognizer in similar circumstances. The following recommenda­ tions can be made based on the analysis of the types of errors that occurred: • The findings about the commands indicate that subjects need to be trained prop­ erly in what commands are within the vocabulary. On the other hand, it shows that vocabulary design can probably prevent a major part of the out-of-vocabu­ lary commands that occurred in the current studies, for example by building in synonyms for commands into the vocabulary. • A suggestion for future vocabulary design is to be extra careful choosing com­ mands that have overlapping parts. This is supported by guidelines in the litera­ ture for good vocabulary design. Kamm and Helander (1997, p. 1054) state that polysyllabic words are generally easier to recognize correctly than monosyllabic words. Furthermore, polysyllabic word pairs with overlapping parts are often confused by recognition systems.

It is interesting to see that speech recognition performance under simulated real-life conditions of use are not ideal yet. However, some improvements can be made by imple­ menting some of the recommendations above. Although no formal tests were done to de­ termine the effects of vocabulary design and size, the current studies indicate the impor­ tance of good vocabulary design and weighing the trade-off between increasing the vo­ cabulary size and possible decrease of recognizer accuracy.

122 User performance issues The preference for voice control drops when performance of ASR is poorer: overall, sub­ jects have a larger preference for voice commands in the Wizard sessions than in the speech recognizer sessions, as study 1 shows. Also subjects interacted faster with a Wizard than with one of the speech recogniz­ ers, though the difference was not significant for study 2. This may have been caused by the smaller difference in ASR performance between the speech recognizer and the Wizard in the second study. Overall, however, the results indicate that ASR performance influ­ ences user performance. No significant difference was found in performance between the sessions with spo­ ken and the sessions without feedback, or between the sessions with spoken and audio feedback. Thus, type of feedback does not influence subjects' performance.

Users opinions about voice control and feedback issues Overall, subjects are very positive about their experience with voice control, and they would like to use it at home. Their opinion about voice control is influenced by ASR per­ formance, and the Wizards are experienced as being faster, less error prone, more fun, and easier to use. No differences in opinion about interacting with the audio set were found between sessions with different types of feedback. Thus, the type of feedback does not seem to influence overall user satisfaction. Subjects gave examples of situations in which they would and would not like to use voice control. They most frequently stated that they would like to use it, because it would allow them to continue doing what they were doing, that they would not have to look for the RC, and that they would be able to control the audio set from a distance. The most frequently mentioned situations in which they would not like to use voice control are that it is impolite to use voice control when other people are present, and in situations with a lot of background noise (such as a party). A slight preference for spoken feedback versus no feedback was found. Some peo­ ple would like to use spoken feedback, but not necessarily for all commands. Some users were very explicit about preferring not to receive spoken feedback. No conclusive evi­ dence was found in preference for spoken versus audio feedback. Overall, users preferred what they had used last. Subjects preferred spoken feedback because it was more inform­ ative, other subjects preferred receiving audio feedback because they felt it was faster and less intrusive. Some examples were given of situations in which subjects would prefer not to re­ ceive feedback at all, e.g., in situations where they had visitors, when it might bother other people present in the room, or when they had interacted with the set for a longer period of time. Users preferred to be able to link the volume level of the feedback to the volume level of the audio set. They also stated that they would like to be able to switch the feed­ back on and off. Furthermore, they mentioned that they would like to select commands for which they could receive spoken and audio feedback. For example, some users said that they would not need feedback for obvious commands (such as volume control). Over­ all, the present studies provide guidance on preferred functionality related to feedback, but not on the choice of feedback. In summary, the main conclusion is that users are very positive about voice-control­ led interfaces with feedback. However, it is clear that ASR performance has a m.yor in-

123 fluence on users appreciation of voice controL Allowing users to adapt some of the inter­ face design options depending on the situation of use will greatly contribute to the success of such interfaces. Furthermore, fine tuning many of the design decisions of a voice-con­ trolled interface, such as type of feedback, feedback sound and vocabulary design, will be especially important as long as state-of-the-art ASR technology still is far from perfect.

Acknowledgements The authors wish to thank Paul Kaufholz, Gerard Jorna and Jeroen de Waard for their con­ tributions to the experiments and data analysis, and Steffen Pauws for comments on an earlier version of this paper.

References Baber, e. (1997). Beyond the Desktop. Designing and Using Interaction Devices. San Diego, CA: Academ­ ic Press. Bernsen, N.D. (1996). Towards a tool for predicting speech functionality. Free Speech Journal, 1. Availa­ ble on-line: hup://cslu.cse.ogLedU/fsjl. Brewster, S.A., Wright, P.e. & Edwards, A.D.N. (1993). An evaluation of earcons for use in auditory hu­ man-computer interfaces. In: S. Ashlund, K. Mullet, A. Henderson, E. Hollnagel and T. White (Eds.): Proceedings of the InterCH1 '93 Conference on Human Factors in Computing Systems, Am· sterdam, The Netherlands, April 24-29, 1993, 222-227. Dragon Xtools 2.0 (1997). Guide and Reference. Newton, MA: Dragon Systems Inc. Gaver, W.W. (1997). Auditory interfaces. In: M. Helander, T.K. Landauer and P. Prabhu (Eds.): Handbook of Human· Computer Interaction, 1003-1041. Amsterdam: North-Holland. Hindus, D., Arons, B., Stifelman, L., Gaver, B., Mynatt, E. & Back, M. (1995). Designing auditory inter­ actions for PDAs. Proceedings of the ACM Symposium on User Interface Software and Technology 1995, Pittsburgh, PA, USA, November 14·17, 1995, 143-146. Kamm, C. (1994). User interfaces for voice applications. In: D.B. Roe and J.G. Wilpon (Eds.): Voice Com­ munication Between Humans and Machines, 422-441. Washington DC: National Academy Press. Kamm, C. & Helander, M. (1997). Design issues for interfaces using voice input. In: M. Helander, T.K. Landauer and P. Prabhu (Eds.): Handbook of Human-Computer Interaction, 1043-1059. Amster­ dam: North-Holland. Shneiderman, B. (1998). Designing the User Interface: Strategies for Effective Human-Computer Interac­ tion. Amsterdam: Addison-Wesley. Sikora, C., Roberts, L. & Murray, L.T. (1995). Musical vs. real world feedback signals. In: I.R. Katz, R. Mack and L. Marks (Eds.): Proceedings of the ACM CHI '95 Conference on Human Factors in Com­ puting Systems, 220-221. New York: ACM Press. Wolf, C.G. & Zadrozny, W. (1998). Evolution of the conversation machine: A case study of bringing ad­ vanced technology to the marketplace. In: C.-M. Karat and A. Lund (Eds.): Proceedings ofthe ACM CHI '98 Conference on Human Factors in Computing Systems, Los Angeles, CA, USA, April 18-23, 1998,488-495. Yankelovich, N., Levow, G. & Marx, M. (1995). Designing speech acts: Issues in speech user interfaces. In: I.R. Katz, R. Mack and L. Marks (Eds.): Proceedings of the ACM CHI '95 Conference on Human Factors in Computing Systems, Denver, CO, USA, May 7·11, 1995, 369-376.

124 A measure for predicting audibility discrimination thresholds

C.H.B.A. van Dinther, P.S. Rao1, R.N.J. Veldhuis and A. Kohlrausch2 e-mail: [email protected]

Abstract Both in speech synthesis and sound coding it often is beneficial to have a measure that pre­ dicts whether, and to what extent, two sounds are different. This paper addresses the problem of estimating the perceptual effects of small modifications to the spectral envelope of a har­ monic sound. A recently proposed auditory model is investigated that transforms the physical spectrum into a pattern of specific loudness, as a function of critical band-rate. A distance measure based on the concept of partial loudness is presented, which treats detectability in terms of a partial loudness threshold. The implementation of the model is described, as well as its adaptation to the problem of estimating discrimination thresholds related to modifica­ tions of the spectral envelope of synthetic vowels. Data obtained from subjective listening tests using a representative set of stimuli in a 3IFC adaptive procedure show that the model makes reasonably good predictions of the discrimination threshold. Systematic deviations from the predicted thresholds can be related to individual differences in auditory filter selec­ tivity.

Introduction Predicting whether two sounds are perceived as different and expressing supra-threshold quality differences is interesting for sound compression and speech synthesis. A distance measure for estimating audibility thresholds and supra-threshold quality differences is im­ portant in both these areas of research. This paper deals with the first problem of finding a distance measure that predicts audibility discrimination thresholds of modifications to the spectral envelope of steady vowel-like sounds. The loudness model described in Moore, Glasberg and Baer (1997) was investigated for this purpose. Previous work on the prediction of perceptual discriminability was, for instance, reported by Plomp (1976), who investigated differences in the sound spectrum of complex tones by means of dif­ ferences between excitation patterns using Minkowski metrics. In addition, Gagne and Zurek (1988) investigated resonance-frequency discrimination of single vowel formants. Kewley-Port (1991) reported on detection thresholds for isolated vowels and examined several detection hypotheses of vowel spectra, based on their excitation patterns. Sommers and Kewley-Port (1996) studied modelling formant frequency discrimination of female vowels and evaluated an excitation-pattern model for this purpose. In the study of auditory models of formant frequency discrimination for isolated vowels, Kewley-Port and Zheng (1998) used Euclidean distances to express differences between specific loudness patterns.

1. Also affiliated with the Indian Institute of Technology, Bombay, India. 2. Also affiliated with Philips Research Laboratories, Eindhoven, The Netherlands.

IPO Annual Progress Report 341999 125 In this study we will also consider a model based on excitation patterns, but these will be used to calculate the partial loudness of a modified sound with respect to a refer­ ence sound. The partial loudness of a signal is the loudness of the signal when presented in a background sound. It will be shown that this measure can be used to predict discrim­ ination thresholds in a variety of more or less realistic situations. This paper indicates in which range partial loudness thresholds can be expected. A limitation of the model is that the possible influence of phase differences is ignored, because the loudness of a sound is calculated from its amplitude spectrum. In this paper, a brief overview is given ofthe loudness model of Moore et al. (1997). A modified version of this model is presented for calculating partial loudness of modifi­ cations of the envelope of steady harmonic complexes, which is used to predict audibility discrimination thresholds for arbitrary modifications of the spectral envelope. Next, measured discrimination thresholds are presented to validate the model, using a 3IFC adaptive procedure with a representative set of stimuli. The data obtained in this proce­ dure show that the model makes reasonably good predictions of the discrimination thresh­ old, except for the discrimination threshold of spectral modifications applied to the val­ leys of the spectral envelope of the stimuli, where large intersubject differences are ob­ served. In the discussion it is shown that the deviations between the subjects in this case can be related to subjects' differences in the upward spread of masking.

The partial loudness model and its implementation The implementation described in this section is based on a recently proposed model for computing loudness (Moore et ai., 1997). Figure 1 shows a block diagram of the sequence of stages of this loudness model.

Fixed fil ter Fixed filter Transform Transform Calculate modelling modelling speetrum to excitation area under Stimulus transfer from f- transfer excitation f- to specific specific Loudness -- free field through -- pattern loudness -- loudness - to eardrum middle ear pattern

Figure 1: Block diagram of sequence of stages in the loudness model by Moore, Glasberg and Baer (1997).

The first two blocks describe transfer functions from the free field to the eardrum and through the middle ear. For sounds presented over earphones, the fixed filter model­ ling the transfer from the free field to the eardrum is replaced by one with a flat frequency response. In the third stage the excitation pattern of a given sound is calculated from the effective spectrum reaching the cochlea. According to Moore (1986), excitation patterns can be thought of as the distribution of 'excitation' evoked by a particular sound in the inner ear along a frequency axis. In terms of a filter analogy, the excitation pattern repre­ sents the output level of successive auditory filters as a function of their centre frequen­ cies. The auditory filter represents frequency selectivity at a particular centre frequency, where the filter shape varies with the input level. The excitation pattern is generally pre­ sented as a function of the ERB rate rather than as a function of frequency. ERB stands for Equivalent Rectangular Bandwidth. The ERB rate is a value on the so-called ERB scale, which is more closely related to how the sound is represented in the auditory sys-

126 tern. The auditory filters are uniformly spaced on this scale. The ERB rate e is related to the frequency F in kHz (Moore et al., 1997) as follows:

to e = 21.4· log( 4.37 F + 1).

The fourth stage of the model is the transformation from excitation pattern to spe­ cific loudness N' , which is the loudness per ERB. The overall loudness of a given sound, in sone, is assumed to be the area under the specific loudness pattern. The relationship be­ tween the specific loudness and the excitation in power units is based on the assumption that, at medium to high levels, the specific loudness produced by a given amount of exci­ tation is proportional to the compressed internal effect evoked by that excitation (Stevens, 1957; Zwicker & Scharf, 1965). The calculation of loudness in this way presupposes that the sound is presented in quiet. To judge the loudness of a modified sound with respect to a reference sound, we consider the loudness of a signal presented in a background sound. The background sound generally reduces the loudness of the signal. This effect is called partial masking. The loudness of a signal in a background sound therefore is called partial loudness. It is as­ sumed that the partial loudness in sone of a signal in a background sound can be calculated by integrating the partial specific loudness of the signal as a function of the ERB rate. The calculation of partial specific loudness requires the calculation of three excitation patterns (Moore et ai., 1997). First, an excitation pattern is calculated for the total sound, that is, the signal plus the background sound. The parameters of the auditory filters used to cal­ culate this excitation pattern are stored. Then, using these parameters, two further excita­ tion patterns are calculated: one for the background sound and one for the signal. From these excitation patterns the partial specific loudness of the signal can be calculated ac­ cording to the formulas given by Moore et al. (1997).

00,----,-----,----,-----,-----,----,-----,----,

50

40

-'Q.. (I) 30 !B 20

10

OL---~ _____L ____ ~ ____~ ____L-_ll~ ____-L ____ ~ o 5 10 15 20 25 30 35 40 ERB rate

oo,----,-----.----~----.-----.----,-----.----~

50

40 -' Q.. (1)30 III ." 20

10

°0L---~-----L----~15-----2~O-----2L5--ll-~30-----3~5----~40 ERB rate

Figure 2: Taking minima and maxima of the two excitation patterns of the ref­ erence sound and the modified sound.

127 We apply this approach to calculate the partial loudness of a modification of the spectral envelope of a reference sound in the following way. The reference sound is intended to take the role of the background sound and the modified sound that of the signal plus back­ ground. The partial loudness of the modification to the reference sound is then computed in the same way as the partial loudness of the signal in the procedure that is presented above. There is a problem, however, in that Moore's model only takes positive changes of the spectral envelope of a reference sound into account, and consequently of the exci­ tation pattern of this sound. In our case we want to consider arbitrary changes such as il­ lustrated in the top panel of Figure 2. We approached this problem as follows. Let El be the excitation pattern of the reference sound and E2 that of the modified sound. The ex­ citation pattern of the background sound is then defined as minCE l' E 2 ), that of the total sound as max(E l' E 2 ), and that of the signal as IE 1 - E21. Negative changes are treated in the same way as positive ones. Hence, only the absolute value of the difference is of in­ terest. The resulting excitation patterns are illustrated in the lower panel of Figure 2. By applying these patterns to the model, we can calculate the partial loudness of the modifi­ cation with respect to the reference sound. The partial loudness measured in this way then is adopted as a measure for the distances between the sound with excitation pattern Eland the sound with excitation pattern E 2 • This measure is symmetric, i.e., when the reference sound and the modified sound are exchanged it results in the same distance between the sounds. We did not validate whether subjects also show similar thresholds when the ref­ erence sound and the modified sound are exchanged. Moore et al. (1997) used partial loudness values of 0.003 and 0.008 sone to estimate discriminability thresholds.

The experiment The aim of this experiment is to validate whether partial loudness, computed according to the model described in the previous section, can be used to estimate audibility discrimi­ nation thresholds for arbitrary modifications of the spectral envelope of steady harmonic complexes. A requirement for a distance measure that is useful for a broad class of speech and musical sounds, is that it should be capable of predicting audibility thresholds for a large variety of spectral modifications. Therefore we chose a representative set of condi­ tions for the experiment, distributed over two vowel-like spectra, Ial and Iii. The condi­ tions are chosen in such a way that they cover relevant amplitude changes to the spectrum. In particular, we considered spectral amplitude changes at the formant peaks and in the valleys of the spectra. Combinations of these changes were used too, both in opposite and in equal directions. The amplitudes of the harmonics were modified by multiplication with a factor close to 1. When more than one harmonic was modified, each harmonic was multiplied by the same factor. To produce global positive and negative changes of the spectrum, we used a low-pass and a high-pass filter that changed the tilt of the spectrum. The locations of the changes and of their combinations are shown in Figure 3. Table 1 gives an overview of all modifications. It is generally accepted that changes of amplitudes in the spectrum of harmonic sounds are more detectable than phase changes. It has been shown that for complex tones with a fundamental frequency beyond 150 Hz, the maximal effect of phase on timbre is smaller than the effect of changing the slope of the amplitude pattern by 2 dBloct (Plomp & Steeneken, 1968). Therefore, a fundamental frequency of 220 Hz for the vowel-like

128 spectra was used. We applied random phase to the stimuli to obtain an equal distribution of energy within each pitch period. For a fundamental frequency of 110 Hz, however, phase effects could arise, which could lead to temporal cues. Adding a condition with a regular phase tested the influences of these effects. In each of the conditions, the spectral amplitudes were modified in small steps corresponding to the calculated partial loudness. For each condition we measured the values of the partial loudness at which the subjects were able to discriminate between the reference sound and the modified sound.

60 /a/ 50

40 .J !l. (/)30 m '0 20

10

0 0 2 4 8 10 12 14 16 18 20 Harmonic number (FO =220 Hz)

!if

30 Figure 3: Spectra of the vowels lal 20 and Iii and indications where the

10~ harmonics are modified. The num­ bers refer to the kind of modifica- 2 4 6 8 10 12 14 16 18 20 tion applied to the spectrum given Harmonic number (FO =220 Hz) in Table 1.

Table 1: Descriptions of the conditions.

Conditions and corresponding modifications Stimulus descriptions Condition FO [Hz] Phase Vowel Modifications 1 220 random a harmonic 6, positive 2 220 random a harmonics 10-11, positive 3 220 random a harmonics 12-15, positive 4 220 random a harmonics 12-15, positive and harmonic 6, negative 5 220 random a modification of spectral tilt, low pass filter 6 220 random a modification of spectral tilt, high pass filter 7 220 random i harmonics 1-2, negative 8 220 random i harmonics 4-8, positive 9 220 random i harmonic 12 positive 10 220 random i ' harmonics 4-8, positive and harmonics 1-2, negative

11 110 ! random a modification of spectral tilt, low pass filter 12 110 regular a • lllodification of spectral tilt, low pass filter

129 Four subjects (JG, JB, PR, and RD) participated in the experiments. All were young adults with normal hearing. The stimuli were presented binaurally via headphones at a sensation level of approximately 55 dB. A 3IFC adaptive procedure was used to obtain thresholds. In this procedure, each trial consisted of three stimuli: two stimuli represented the reference sound and one the modified sound. The duration of the stimulus interval and the pause before one trial was 300 ms and that of the interstimulus interval 400 ms. The assignment of the odd stimulus to one of the three intervals was randomized. The subject's task was to indicate the odd interval. Immediately after each response, feedback was given indicating whether the response was correct or incorrect. After two correct responses the amount of spectral modification was reduced by one step. After one incorrect response it was increased by one step. The spectral modifications of a stimulus were divided into 20 steps reaching from about 1 sone to 0.00 1 sone of partial loudness for the modification. A run began with a modification of the spectrum that produced an easily discriminable change. A test run was completed after 12 up-down reversals. A single-run estimate of the partial loudness was obtained by taking the median of the steps for the last eight reversals of the run. In this way, the subject would reach a situation in which differences are detect­ ed with an accuracy of 71 %. For each experimental condition, the median of five single­ run estimates was taken as the final estimate of the partial loudness for each subject. In Figure 4 the medians are given for each subject, where the data of JG, JB, PR, and RD are indicated with a circle, a triangle, a cross and a star, respectively. The interquartile ranges are indicated by means of bars.

0 0.055 I JG t:. JB 0.05 PA AD 0.045 *

wo ~0.035 ~ .g 0.03 :l ::L ..Q ~ 0.025 8!. 0.02

0.Q15

0.01

0.005

2 3 4 5 6 7 8 9 10 11 12 Condition number

Figure 4: Partial loudness per condition number. Data are medians across five sessions taken from four SUbjects. Bars show the interquartile ranges of the subjects' partial loudness. Con­ nection lines are shown for clarity.

130 Discussion The partial loudness model of Moore et ai. (1997) is based on the average of results from a large number of experiments involving listeners with normal hearing. The parameters of individual subjects, however, may vary from those of Moore's model. Typical exam­ ples of parameters that are expected to vary are the transfer functions of the filters in the first two blocks of the model in Figure 1 and the bandwidths of the auditory filters. Hence, the predictions of the model should not be expected to be accurate for individual listeners. For most of the conditions the thresholds are between 0.005 and 0.02 sone, which is ac­ tually close to the range of 0.003 to 0.008 sone mentioned by Moore et al. One subject shows a high threshold for condition 3 and three subjects show high thresholds for condi­ tion 8. Conditions 3 and 8 are modifications in the valleys of the spectral envelope. A pos­ sible explanation is that the higher thresholds for these conditions are due to differences in the slopes of the subjects' auditory filters with respect to those used in Moore's model. There is support for this assumption in that making the slopes of the auditory filters of the model shallower decreases the values of the predicted partial loudness for conditions 3 and 8, whereas the values for the other conditions remain largely the same. To investigate this possible explanation, an additional experiment was carried out to measure the upward spread of masking of the subjects, which is related to the slope of the auditory filters. We used a masker frequency of 440 Hz and a target frequency of 1000Hz to create a situation similar to condition 8. The masker levels were 70, 60, and 55 dB SPL. Table 2 shows the medians of the masked thresholds of four sessions.

Table 2: Masked thresholds of a 1000 Hz signal at masker levels 70, 60, and 55 dB. Data are medians across four sessions taken from the four subjects.

Masked threshold Subject Masker level JG JB PR RD 70 24.50 20.75 32.00 12.25 60 9.25 6.25 21.50 4.25 55 5.75 2.50 17.75 2.00

The data in Table 2 support the assumption that the subjects' differences for condi­ tion 8 are due to differences in auditory filter slopes. Subject PR shows high masked thresholds accompanied by a high score for condition 8, whereas subject RD shows low masked thresholds accompanied with a low score for condition 8. The masked thresholds of the subjects JB and JG lie between those of PR and RD, which also is the case for the threshold measured in condition 8. Comparison of the empirical data shown in Figure 4 with the calculated partial loud­ ness indicates that the developed distance measure in this study provides reasonably good predictions, especially if we consider that deviations can be introduced by differences in the subjects' auditory filters. For most conditions the partial loudness at threshold is somewhat higher than the range (0.003 to 0.008 sone) mentioned by Moore et al. (1997). The differences across conditions and subjects are small, given that a variation of 1 dB of the masked threshold corresponds to a factor 2-3 in partial loudness value.

131 Acknowledgements We would like to thank Steven van de Par and Jeroen Breebaart for their help with setting up the experiments.

References Gagne, J.P. & Zurek, P.M. (1988). Resonance-frequency discrimination. Journal of the Acoustical Society of America, 83, 2293-2299. Kewley-Port, D. (1991). Detection thresholds for isolated vowels. Journal of the Acoustical Society of America, 89, 820-829. Kewley-Port, D. & Zheng, Y. (1998). Auditory models offormant frequency discrimination for isolated vowels. Journal of the Acoustical Society ofAmerica, 103, 1654-1666. Moore, B.C.J. (1986). Frequency Selectivity in Hearing. London: Academic Press. Moore, B.C.J., Glasberg, B.R. & Baer, T. (1997). A model for the prediction of thresholds, loudness, and partial loudness. Journal of the Audio Engineering Society, 45, 224-240. Plomp, R. (1976). Aspects of Tone Sensation. London: Academic Press. Plomp, R. & Steeneken, H.J.M. (1968). Effect of phase on the timbre of complex tones. Journal of the Acoustical Society ofAmerica, 46, 409-421. Sommers, M.S. & Kewley-Port, D. (1996). Modeling formant frequency discrimination of female vowels. Journal of the Acoustical Society ofAmerica, 99, 3770-3781. Stevens, S.S. (1957). On the psychophysical law. Psychological Review, 64, 153-181. Zwicker, E. & Scharf, B. (1965). A model of loudness summation. Psychological Review, 72, 3-26.

132 Improving diphone synthesis by adding context-sensitive diphones

E.A.M. Klabbers, R.N.J. Veldhuis and K.D. Koppen e-mail: [email protected]

Abstract One well-known problem with concatenative synthesis is the occurrence of audible disconti­ nuities at concatenation points. Formant jumps across concatenation points suggest the prob­ lem is due to spectral differences. In a previous experiment (Klabbers & Yeldhuis, 1998), the results of a listening experiment were correlated with several spectral distance measures to find one that best predicts the audible discontinuities. The Kullback-Leibler distance proved to be the best measure. In this paper we demonstrate its use for clustering diphones with sim­ ilar contexts. For each of the clusters, a limited number of context-sensitive diphones is added to the database to reduce the number of audible discontinuities. A new listening experiment was performed, which showed that a significant improvement can be obtained.

Introduction One well-known problem with concatenative synthesis is the occurrence of audible dis­ continuities at concatenation points, which is most prominent in vowels and semi-vowels and is caused by contextual influences. IPO's speech-synthesis system uses diphones as concatenative units produced by a professional female speaker and that were embedded in nonsense words. Consonant-vowel (CV) and vowel-consonant (VC) diphones are ex­ cised from nonsense words of the form C@CVC@, where @ stands for schwa.

Figure 1: Spectrogram for the vowel luI in the synthesized Dutch word doek. A spectral dis­ continuity can clearly be detected at the concatenation point of Idul and luk! at 65 ms.

/PO Annual Progress Report 34 /999 133 Figure 1 shows the spectrogram for the vowel luI in the synthesized Dutch word doek (/dul + luk/), a word which was clearly perceived as discontinuous. It reveals a con­ siderable jump in F2 of about 700 Hz at the diphone boundary (located at 65 ms). This, together with other informal observations, suggests that the problem is of a spectral na­ ture. Other discontinuities come from mismatches in Fo and in phase (Dutoit, 1997). In IPO's synthesis system, Fo mismatches are avoided by monotonizing the diphones before storing them in the database. Phase mismatches are avoided by using a method called phase synthesis for resynthesis of the nonsense words (Gigi & Vogten, 1997). In this ap­ proach, the signal is decomposed into an amplitude and phase value for each harmonic. Several solutions have been proposed to solve the problem of spectral mismatch. One approach is to reduce the number of audible discontinuities by using larger units such as triphones. However, this does not solve the problem, as discontinuities continue to oc­ cur albeit less frequently. Moreover, the inventory size increases drastically. The number of diphones required for English lies around 1200 units, whereas the number of triphones is at least 10,000 (Huang et ai., 1997). Another approach is to minimize spectral mismatch by varying the location of the cutting point in the nonsense words depending on the context (Conkie & Isard, 1997). This calls for a spectral distance measure that correctly represents the amount of spectral mismatch. Moreover, it is based on the underlying assumption that the short-term spectral envelopes are not constant over time, resulting for instance in non-flat formant trajecto­ ries. Figure 1, along with many other observations in our database, shows that formant tra­ jectories can be fairly flat throughout a vowel when embedded in symmetrical nonsense words. The F2 of the luI in Idul stays around 1600 Hz while the F2 of the luI in lukl stays around 900 Hz. A third approach is to reduce spectral mismatch by means of waveform interpola­ tion, spectral-envelope interpolation, or formant trajectory smoothing. It requires specific signal representations to allow these types of operations. The disadvantage of formants as a representation is that they are very difficult to estimate reliably. Waveform and spectral envelope interpolation have the disadvantage that smooth transitions are often achieved at the expense of naturalness (Dutoit, 1997). A fourth approach is to reduce the number of discontinuities by including context­ sensitive or specialized units in the database (Olive, Van Santen, Mobius & Shih, 1998). This implies that one knows which contexts can be clustered so as to keep the inventory size within bounds. Our investigation was aimed at gaining insight into this approach.

Preliminary investigation In a previous experiment (Klabbers & Veldhuis, 1998), the occurrence of audible discon­ tinuities in our dip hone database was investigated in detail. To reduce the data set to man­ ageable proportions, the study was restricted to five vowels, la!, IA/, Iii, 11/ and luI, which cover the extremes of the vowel space. First, a listening experiment was conducted to determine the extent to which audible discontinuities occur in our database. The stimuli consisted of concatenated left CIV and right VCr diphones, excised from the symmetrical nonsense words C)@C}VC)@ and Cr@CrVCr @. Thus, the diphones Idul and luk! that form the stimulus Idukl were recorded in the nonsense words d@dud@ and k@kuk@. In the stimuli, the consonant portions were

134 cut off to prevent them from influencing the perception of the transition in the middle of the vowel. Fading was used to smooth the transition from silence to vowel and vice versa. The duration of the vowels was fixed at 130 ms and the signal power of the second di­ phone was scaled to match that of the first. Five subjects with a background in phonetics or psychoacoustics participated in the experiment, which had a within-subjects design, meaning that each subject listened to all stimuli (2284 in total). The stimuli were presented in random order. For each stimulus, the subjects had to judge the transition at the diphone boundary in the middle of the vowel as being either smooth or discontinuous. To reduce the variability between subjects, a ma­ jority score was calculated, i.e., a stimulus was marked as discontinuous when four out of five listeners perceived it as such. Summing the scores obtained in the experiment for each of the vowels, we obtain the percentage of perceived discontinuities as presented in Table 1. It becomes apparent that the amount of discontinuities is particularly high for lui and particularly low for Ia!. This is in line with findings by Van den Heuvel, Cranen and Riet­ veld (1996), who also found the greatest amount of coarticulation in lui when investigat­ ing speaker variability in the coarticulation of la!, Iii and lui. They found the smallest amount of articulation in Iii.

Table 1: Percentages of perceived disconti­ nuities, based on majority scores.

Perceived Vowels discontinuities

Ia! 17.1%

Iii 43.1% IA! 52.1% I ! /II 55.5% i lui 73.9%

The second step in our investigation was to correlate the results of the listening ex­ periment with several spectral distance measures to obtain an objective measure for a dis­ continuity. These distance measures, taken from various fields of research, were used to determine spectral distances across diphone boundaries between spectral envelopes. Spectral envelopes are calculated from LPC coefficients computed over a 40-ms Hanning window. A spectral distance measure is said to predict an audible discontinuity when its outcome exceeds a predefined threshold ~. To correlate the distance measures with the scores of the listeners, two probability density functions p(DIO) and p(Dll) are estimated from the data, representing the probability of a spectral distance (D) given that the transi­ tion was marked as continuous (0) or discontinuous (1). For a certain threshold ~, the probability of afalse alarm, Le., the case that a transition is wrongly classified as discon­ tinuous, is PF(~) and the probability of a hit, Le., correct detection of a discontinuity, is PD(~)' which are given by the following: p F(j3) r(DlO)dD and p D(j3) = rp(DII )dD. A miss, the case that a discontinuity goes undetected, i~ defined as 1 - PD anla correct re­ jection, the case that a transition is rightly classified as being continuous, is 1 - PF.

135 A plot of pairs (PD(~)' PF(~» for all values of ~ constitutes a Receiver Operating Characteristic (ROC; Luce & Krumhansl, 1988) (see Figure 2). In this experiment, it is assumed that the subjects are correct in their judgments. The question then is how well a spectral distance measure can predict the listeners' judgments. As shown in the right panel of Figure 2, ROCs are always upwardly concave. The straight line represents the chance level. The further the curve extends to the left, the better the measure serves as a predictor. Looking at the left panel in Figure 2, this means that the two probability density functions p(DIO) and p(Dll) are moving away from each other, thus increasing the hit rate and de­ creasing the false alarm rate.

--- p-.~ p(OIO), , p(OIl) continuous dis ontinuous , / p(DIO) p(DI1) / , / / , / , / ~ ~ / 0 / / ll.. / / / /

0 / \ p p correct false rejection miss alarm hit

Figure 2: Principle of Receiver Operating Characteristics. The left panel displays the probabil­ ity density functions of the spectral distance given that the vowel is marked as either continu­ ous or discontinuous. The right panel displays the plot of the probability of a false alarm (PF) versus the probability of a hit (Po), constituting an ROC.

We did not need to decide on an appropriate threshold value in this study, since the purpose of the analysis was solely to determine the best performing distance measure. In our study, the Kullback-Leibler distance measure (KL distance) appeared to have the best fit with the decisions of the listeners in comparison to other measures, such as the Eucli­ dean distance between Mel-based cepstral coefficients and the Euclidean distance be­ tween formants. The KL distance is calculated from the spectral envelopes P(oo) comput­ ed around the concatenation point at the end of the CIV diphones and Q(ro) computed around the concatenation point at the onset of the VCr diphones. A symmetrical version of the KL distance is used, which is given by DKL(P, Q) = J(P(ro) -Q(ro»(log(P(ro)/Q(ro)))dro. The main reason for using a symmetrical distance measure is that DKL(P(oo), Q(ro» equals DKL (Q( 00), P( ro», which is convenient because all vowels are recorded in nonsense words with the same left and right consonant context. The KL distance has the important property that it emphasizes differences in spectral regions with high energy more than dif­ ferences in spectral regions with low energy. Thus, spectral peaks are emphasized more than valleys between the peaks and low frequencies are emphasized more than high fre­ quencies, due to the 6dB/octave declination in spectral energy that results from the com­ bination of high-frequency components in the signal (-12dB/octave) and radiation at the mouth ( +6dB/octave).

136 Clustering procedure To reduce the number of audible discontinuities, we propose extending the diphone data­ base with context-sensitive diphones. We aim to limit the number of additional diphones by grouping spectrally similar contexts into clusters. The KL distances measured between left and right diphones determine which contexts end up in the same cluster. Within a cluster, the average KL distance is small, whereas between clusters it is large. To avoid audible discontinuities between contexts from non-corresponding clusters, additional right (VCr) diphones are recorded with a representative from the non-corresponding left cluster. The representative is a left (CIV) diphone with the lowest average KL distance to the other diphones in the cluster. The KL distance for the new diphones is expected to be small. The procedure is illustrated in Figure 3 for the vowel luI. In our investigation, the maximum number of clusters was restricted to three (equal for left and right diphones). For example, one can see that concatenating /bul and luk/ to make hoek is expected to be unproblematic, as both consonants corne from the same cluster with a small KL distance. However, for doek and hoek an audible discontinuity is likely to occur, as they corne from non-corresponding clusters. Figure 3 shows that for the luk/ diphone, which was original­ ly recorded in the symmetrical nonsense word k@kuk@, two additional diphones are re­ corded in the asymmetrical nonsense words l@iuk@ andf@fuk@. Then doek can be cre­ ated by concatenating the original left diphone Idul with the new diphone luk/ taken from i@luk@. .

Left diphones Right diphones Right diphones Right diphones recorded symmetrically recorded symmetrically recorded with a recorded with a representative representative

first cluster

second cluster

third cluster

Original database Additional database Figure 3: An example of the clustering procedure for lui. The context-sensitive diphone clusters are shaded. The representative of each cluster is printed in bold.

137 For lui, the clustering procedure assigns all alveolars to the same cluster. For laI, the division into clusters is {b, w, f, v, z}, IS}, {p, t, k, d, g, J, m, n, N, 1, L, r, s, z, x, G}. Here, the lSI provides an extreme context. For Iii, the division into clusters is {p, t, k, g, n, N, f, s, x, G}, {b, d, J, m, L, r, w, v, z, Z}, {l, S}.

Experiment New recordings were made with which a new listening experiment was performed, to measure the improvement of this approach over the old one. To make comparison possi­ ble, the new recordings were made incorporating both the old symmetrical and the new asymmetrical nonsense words. Again, stimuli were created consisting of concatenated left CIV and right VCr diphones, except that now there were two diphone combinations for each vowel, one with a right diphone from the symmetrical nonsense word Cr@CrVCr@ (method without clustering) and one with a right diphone from the asymmetrical nonsense word Crep@CrepVCr@ (method with clustering). To reduce the total number of stimuli, we now focussed on just three vowels laI, iii, and lui. The total number of stimuli in the experiment was 2254, of which 1449 stemmed from the method without clustering and 805 carne from the method with clustering. The listening experiment was repeated with six subjects with a background in pho­ netics. They had not participated in the previous experiment. The task and experimental set-up were identical to the first listening experiment.

Results Table 2 lists the percentage of perceived discontinuities for the new database with and without clustering. Again, these are based on the majority scores (4 out of 5). Since we had six subjects in this experiment, one randomly chosen subject was left out to keep the results comparable to the old situation. It can be seen that clustering reduces the number of audible discontinuities.

Table 2: Percentage of perceived discontinuities based on majority scores for new database with and without clustering.

New database New database with Vowels without clusteriug clustering

fa! 7.7% 6.0%

IiI 19.3% 17.0%

luI 53.4% 26.3%

Its significance is demonstrated by a paired-samples t-test, which was performed on the KL distance and on the summed subjects' scores (see Table 3). When we look at the results for the KL distance, we can see that the distance significantly decreases for con­ text-sensitive diphones for both Iii and lui, but is not significant for Ial. However, in the subjects' jUdgments, the number of audible discontinuities significantly decreases for all three vowels.

138 Table 3: Paired samples t-test for new database with and without clustering for KL dis­ tance and summed subjects' scores.

Vowels KLdistance I Summed subjects' scores

Ial t(482) 0.524, P = 0.601 t(482) = 2.772, P = 0.006 Iii t(482) = 6.326, p < 0.001 t(482) = 2.618, P =0.009 lui t(482 =8.121, P < 0.001 t(482) = 12.47, P < 0.001

The effects found for the KL distance are nicely illustrated in Figure 4, which dis­ plays the distribution ofKL distances in rank order. For laI, it can indeed be seen that clus­ tering does not have much of an effect on the KL distances, probably because the average distance was low to begin with. For instance, when we take a threshold ~ of 1.5 we can see that 400 out of 500 stimuli lie below this threshold for Ial. For Iii and luI the improve­ ment is clear. Taking the same ~ we can see that the number of stimuli below the threshold increases from approximately 250 to 320 for Iii and from 280 to 350 for lui.

Figure 4: Distributions of KL distances for laI, lil, and luI. The solid line represents the new database without clustering and the dashed line represents the new database with clustering.

Again considering the specific example of Iduk!, we now see that adding an addi­ tionalluk! diphone that has been recorded in the appropriate context makes a huge visible difference (see Figure 5). Instead of an abrupt jump in the F2 of about 700 Hz as observed in Figure I, the F2 gradually descends from around 1300 Hz to 1100Hz.

Discussion The findings in this investigation lead to a number of interesting observations. First, the differences in results between the three vowels laI,lil, and lui show that Ial is least affected by coarticulation, whereas IiI is more and lui is most affected by it. The alveolars account for most of the coarticulation in lui. There, the slow movement of the tip of the tongue causes the frontal mouth cavity to be small throughout the vowel, which leads to a rela­ tively high F2 value. For Ial and IiI this does not make much difference, since the locus of alveolars is much nearer to their customary F2 value than for lui. Second, the finding that audible discontinuities still occur for the fal and that clus­ tering does not reduce the amount of audible discontinuities leads to the conclusion that besides co articulation there is always random variation in the pronunciation of the stimuli. This was also observed by Olive et ai. (1998, p. 197) who found F2 variations in excess

139 of 50 Hz for repetitions of the exact same phrase as uttered by a highly professional speak­ er. Van Santen (1997) reports even larger FI and F2 variations (up to 250 Hz) in the re­ peated pronunciation of IIJ in six and million by a professional speaker. Therefore, one should always record several instances of a diphone and choose the one that is optimal for the database. Third, Figure 5 shows that when the diphones are recorded in asymmetrical non­ sense words, the formant trajectories are no longer stable, but gradually change from start to finish. In that case, it may make sense to optimize the concatenation point of the di­ phone as proposed by Conkie and Isard (1997).

0.00 0.Q1 0.02 0.03 0.04 0.05 0.06 0.07 0.06 0.09 0.10 0.11 0.12 0.13 time (5)

Figure 5: Visible improvement in the luI of the synthesized word doek. Instead of an abrupt formant jump of 700 Hz at the concatenation point at 65 ms, a gradual descent of F2 from 1300 Hz to ap­ prox. 1100Hz can be observed.

Conclusion The research presented in this paper confirms that adding context-sensitive diphones re­ duces concatenation artefacts at diphone boundaries. The number of audible discontinui­ ties at diphone boundaries is reduced, because the vowel portions at the diphone bounda­ ries that are concatenated are spectrally more alike. A more important conclusion ob­ tained in this research is that instead of recording all possible diphone combinations, and thus drastically increasing the size of the diphone database, the number of additional di­ phones to be recorded can be limited by applying a clustering algorithm in which the KL distance is used to group contexts that are close to each other spectrally. The KL distance appears to be a measure that can predict an audible discontinuity to a reasonable extent. The results of this study can be extended to other phonemes that are likely to be af­ fected by coarticulation, such as short vowels, diphthongs, and semi-vowels like Ill, Iwl, and IjI. But more importantly, the results can be useful in a new synthesis method that has become popular over the past few years, namely non-uniform unit selection (Sagisaka, 1988; Black & Campbell, 1995). The principle behind this approach is that concatenative

140 units are no longer fixed and stored separately, but that they are selected on the spot from a large speech corpus. In order for this approach to work well, it is crucial to have an ob­ jective distance measure that corresponds well with human perception, which can be used to select units from the corpus that have the smallest perceptual distance to the ideal units. The KL distance could be a suitable distance measure for this purpose.

Acknowledgements This project is part of the Priority Programme Language and Speech Technology (TST), sponsored by NWO (the Netherlands Organization for Scientific Research).

References Black, A. & Campbell, W.N. (1995). Optimising selection of units from speech databases for concatenative synthesis. In: J.M. Pardo, E. Enriquez, J. Ortega, J. Ferreiros, M. Macias and F.J. Valverde (Eds.): Proceedings of the 4th European Conference on Speech Communication and Technology, EU­ ROSPEECH-95, Volume 1, Madrid, Spain, September 18-21, 1995,581-584. Conkie, A. & Isard, S. (1997). Optimal coupling of diphones. In: J. van Santen, R Sproat, J. Olive and J. Hirschberg (Eds.): Progress in Speech Synthesis, 293-304. New York: Springer-Verlag. Dutoit, T. (1997). Introduction to Text-to-Speech Synthesis. Dordrecht: Kluwer Academic Press. Gigi, E.F. & Vogten, L.L.M. (1997). A mixed-excitation vocoder based on exact analysis of harmonic com­ ponents. IPO Annual Progress Report, 32, 105-110. Heuvel, H. van den, Cranen, B. & Rietveld, T. (1996). Speaker variability in the coarticulation of la, i, ul. Speech Communication, 18, 113-130. Huang, X., Acero, A., Hon, H., Ju, Y., Liu, J. & Meredith, S. (1997). Recent improvements on Microsoft's trainable text-to-speech system - Whistler. Proceedings ofthe IEEE International Conference onAu­ dio, Speech, and Signal Processing, ICASSP-97, Volume 1, Munich, Germany, April 21-24, 1997, 959-962. Klabbers, E. & Veldhuis, R (1998). On the reduction of concatenation artefacts in diphone synthesis. In: RH. Mannell and J. Robert-Ribes (Eds.): Proceedings of the 5th International Conference on Spo­ ken Language Processing, ICSLP-98, Volume 5, Sydney, Australia, November 30 - December 4, 1998, 1983-1986. Luce, R & Krumhansl, C. (1988). Measurement, scaling and psychophysics. In: S. Stevens (Ed.): Hand­ book of Experimental Psychology (2nd ed.), 3-73. New York: Wiley. Olive, J., Santen, J. van, Mobius, B. & Shih, C. (1998). Multilingual Text-to-Speech Synthesis: The Bell Labs Approach. Boston: Kluwer Academic Publishers. Sagisaka, Y. (1988). Speech synthesis using an optimal selection of non-uniform synthesis units. Proceed­ ings of the IEEE International Conference on Audio, Speech, and Signal Processing, ICASSP-88, New York, USA, April 11-14, 1998,679-682. Santen, J. van (1997). Prosodic modeling in text-to-speech synthesis. In: G. Kokkinakis, N. Fakotakis and E. Dermatas (Eds.): Proceedings of the 5th European Conference on Speech Communication and Technology, EUROSPEECH-97, Volume 1, Rhodes, Greece, September 22-25, 1997, 19-28.

141 On-line spotting of ASR errors by means of prosody

J. Hirschberg!, D. Litman! and M.G.J. Swerts2 e-mail: [email protected]

Abstract Given the state of the art of current speech technology, errors are unavoidable in present spo­ ken dialogue systems. Therefore, one of the main concerns in dialogue design is how to de­ cide whether or not the system has understood the user correctly. We identify methods to dis­ tinguish between correctly and incorrectly recognized utterances, scored by hand for semantic concept accuracy, using acoustic/prosodic characteristics. The analysis was performed on data collected during independent experiments done with an interactive voice response sys­ tem that provides travel information over the phone.

Introduction Given the state of the art of current speech technology, spoken dialogue systems are very error-prone, largely because of user utterances that are misrecognized (Oviatt, Bernard & Levow, 1998; Krahmer, Swerts, Theune & Weegels, 1999). As a result, "[... ] the lion's share of dialogue management intelligence in the present generation of [spoken dialogue] systems is mainly needed to cope with recognition errors" (Den Os, Boves, Lamel & Baggia, 1999, p. 1528). Accordingly, a major question in current dialogue design is how such errors can be detected, which is a necessary precondition to solve communication problems between the system and the user. One attempted solution for the detection of er­ rors is to use acoustic confidence scores, which Automatic Speech Recognition (ASR) systems use as reliability measures to decide whether or not to 'believe' the recognized string. However, there is no simple one-to-one mapping between low confidence scores and recognition errors, nor between high confidence scores and correct recognitions (see e.g., Bouwman, Sturm & Boves, 1999). In addition, communication problems may be due to other factors than recognition errors, because the system makes wrong default assump­ tions for example. Finally, to make things even more complex, in some cases, the recog­ nizer might tolerate misrecognitions without causing communication errors, i.e., when the recognized string is conceptually identical to the intended meaning of the user's utterance (e.g., "a.m." versus "in the morning") (Taylor, King, Isard & Wright, 1998). In conclu­ sion, a spoken dialogue system cannot fully rely on confidence scores to decide whether or not it can believe a recognized utterance. The current research is an attempt to find out whether an ASR system can use other sources of information - in addition to the acoustic confidence scores - to decide whether or not it can believe the recognized string. More specifically, the question is whether pro­ sody may signal recognition errors. From past research, we know that prosody may indeed be a useful cue in that respect. Recognition performance is known to vary depending upon

1. Affiliated with AT&T Research Laboratories, Florham Park, NJ, USA. 2. Also affiliated with Antwerp University (UIA) and with the Flemish Fund for Scientific Research (FWD-Flanders ).

[PO Annual Progress Report 341999 142 the relative formality or casualness of speaking style (Weintraub, Taussig, Hunicke-Smith & Snodgrass, 1996), though there have been few attempts to identify this variation pre­ cisely. One exception is a study of the effect of speaking style on recognition perfor­ mance in the Switchboard Corpus (Ostendorf et ai., 1996). Lexical and prosodic features hypothesized as possible correlates of acoustic errors were calculated for correctly and incorrectly modelled regions. Utterances were divided into regions where the word hypo­ thesis was correct, compared to regions where there was an acoustic modelling error. Numerous features were calculated for these regions, such as phone identity, lexical stress placement, normalized fo and energy, word frequency, part-of-speech, speaking rate, word length, and recognized silence before and after the word. The best automatically trained procedure using decision trees distinguished correct from incorrect matches 24.4% ofthe time (baseline: 35.8%), with the most important feature being speaking rate. Other research has identified particular types of speech variability, both across and within speakers, which may also be linked to prosodic variation, and which have been shown to have a negative impact on recognition performance. In particular, several studies (Wade, Shriberg & Price, 1992; Oviatt, Levow, MacEachern & Kuhn, 1996; Swerts & Ostendorf, 1997; Levow, 1998; Bell & Gustafson, 1999) report that hyperarticulate speech, charac­ terized by careful enunciation, slowed speaking rate, and increase in pitch and loudness often occurs when users in human-machine interactions try to correct system errors. Oth­ ers have shown that such speech also decreases recognition performance (Soltau & Waibel, 1998). Finally, prosody has been shown to be effective in ranking recognition hypotheses, as a post-processing filter that degrades wrong hypotheses and selects the correct ones, basically following an analysis-by-synthesis method to check the suitability of a recognition candidate through its prosodic structure (Veilleux, 1994; Hirose, 1997). The current paper presents results of prosodic analyses which show that misrecog­ nized turns differ significantly from correctly recognized ones in terms of a number of prosodic features. In this research, we consider 'semantic' errors rather than plain word errors. We also describe machine-learning experiments that combine prosodic features with additional sources of automatically obtainable information, such as recognized string and acoustic confidence score, to predict whether speaker turns will be correctly recog­ nized or not with a high degree of accuracy.

Corpus Our corpus consisted of a set of dialogues between humans and TOOT, a spoken dialogue system for accessing train schedules from the web via telephone, which was collected to study variations in dialogue strategy (Litman & Pan, 1999). TOOT is implemented using a platform developed at AT&T, combining ASR, text-to-speech, a phone interface, and modules for specifying a finite-state machine dialogue manager and application func­ tions. The speech recognizer is a speaker-independent hidden Markov model system with context-dependent phone models for telephone speech and constrained grammars defin­ ing the vocabulary that is permitted in any dialogue state. Subjects performed four tasks with one of several versions of TOOT, which differed in terms of locus of initiative (system, user, or mixed), confirmation strategy (explicit, im­ plicit, or none), and whether these conditions could be changed by the user during the task. Subjects were 39 students, 20 native speakers and 19 non-native speakers; 16 sub-

143 jects were female and 23 male. Dialogues were recorded and system and user behaviour was automatically logged. The concept accuracy (CA) of each non-rejected turn was man­ ually labelled by one of the experimenters, by listening to the dialogue recordings while examining the system log. If the ASR output correctly captured all the task-related infor­ mation in the turn (e.g., time, departure and arrival cities), the turn was given a CA score of 1 (a semantically correct recognition). Otherwise, the CA score reflected the percent­ age of correctly recognized task information in the turn. For the study described below, we examined 2067 user turns from 152 dialogues in this corpus, which included 170 re­ jections, based on acoustic confidence scores, and 491 turns with CA < 1.

Results Characterizing misrecognitions Our goal was to determine whether misrecognized turns differed from recognized turns in terms of their prosodic features, and, if so, to develop methods to distinguish these auto­ matically. To this end, we examined the fo and rms average values and the fo and rms max­ imum values for each turn (calculated using Entropic Research Laboratory's pitch tracker, geCfO); the total duration of the user turn; the length of pause preceding the turn; the amount of silence within the turn (percentage of zero frames); and the tempo (calculated as syllables from the recognized string per second). For each feature, we calculated mean values for recognized (concept accuracy score of 1) turns and of misrecognized (concept accuracy less than 1) turns on a per speaker basis. We then performed paired t-tests on these means (Table 1), to ensure that our results applied across speakers.

Table 1: Comparison of misrecognized vs. recognized turns by prosodic feature across speakers.

Mean Misrecognized - Feature T-stat P value Recognized

Fomax 5.40 28.87 Hz 0

RMSmax 3.00 185.74 .005

RMS avg 2.07 -32.43 .05

Duration 10.42 2.34 sec 0

Prior Pause 5.22 .38 sec 0

Tempo .45 .13 sps .65

Foavg I 1.36 1.63 Hz .18

% Silence .05 -.02% .29

Our results indicate that misrecognized turns do differ in important ways from cor­ rectly recognized ones. We found statistically significant differences (at a 95% confi­ dence level) across speakers, for the first five features listed in the table. Misrecognized turns exhibited significantly higher fo and rms maxima, lower mean rms, longer durations, and were preceded by longer pauses than were correctly recognized speaker turns. Speak-

144 ing rate, fo average, and amount of silence within turn showed no such differences. We ran similar tests on normalized versions of all features, normalizing by first turn in the task, by preceding turn, or by scaling values per speaker, with similar results in each case. When we include rejected turns with misrecognized turns, and simply compare these non­ recognitions with correct recognitions, we also obtain statistically significant differences in the same direction for fo and rms maximum, duration, and preceding pause, as above. Thus, our findings in fact distinguish all 'problem' turns, including those the ASR engine itself (rightly or wrongly) hypothesizes are poorly recognized. The features we found to be significant indicators of a failed recognition (fo excur­ sion, loudness, long prior pause, and longer duration) are all features previously associat­ ed with hyperarticulated speech. Since prior research has suggested that speakers may re­ spond to failed recognition attempts with hyperarticulation, which itself may lead to more recognition failures, had we in fact identified a means of characterizing and identifying hyperarticulated speech prosodically? Since we had independently labelled all speaker turns for evidence of hyperarticulation (two of the authors labelled each turn as 0, 'not hy­ perarticulated', 1 'some hyperarticulation in the turn', 2 'hyperarticulated', following Wade, Shriberg and Price (1992», we were able to test this possibility. We excluded any turn either labeller had labelled as '1' or '2', and again performed paired t -tests on mean values of misrecognized vs. recognized turns for each speaker. Results again showed that fo and rms maximum, total turn duration, and prior pause are significantly different for misrecognized vs. recognized turns. Thus, while, indeed, hyperarticulated turns are rec­ ognized significantly more poorly than non-hyperarticulated turns, and misrecognized turns are significantly more likely to have a higher hyperarticulation score, and while hy­ perarticulated turns do have a significantly higher fo maximum, rms maximum, and over­ all duration than non-hyperarticulated turns, our findings for the prosodic characteristics of misrecognized turns hold, even when perceptibly hyperarticulated turns are excluded from the corpus. While the prosodic features we found characterizing semantic misrecognitions hold across speakers, we also checked to see whether in fact these features could help explain the 'goat' phenomenon in ASR experience - the fact that some voices are recognized much more poorly than others (Doddington et at., 1998). Could this phenomenon be ac­ counted for in terms of a propensity for some speakers to typically employ greater fo and rms excursion, to produce turns of a longer duration, and to produce prolonged pauses be­ fore their turns? To explore this possibility, we divided our speakers into those whose ra­ tio of unrecognized to recognized turns was .5 or greater (goats) and those with a ratio of less than .5 (sheep). Of 18 goats overall, 7 were women, 9 were non-native speakers, and 14 of the total fell into at least one of those categories. We restricted our analysis to speak­ ers participating in experimental conditions in which the system provided implicit confir­ mation and permitted mixed speaker/system initiative, since this condition provided roughly equal numbers of goats (8) and sheep (5). Three of the 8 goats were non-native speakers, and 2 of the sheep, so the proportion of non-native to native speakers was rough­ ly the same in each category. Of the goats, 3 were male and 5 female, while only one of the 5 sheep was female. We calculated means for each speaker over all turns for the features fo and rms max­ imum' turn duration, and prior pause, which we found above to characterize unrecognized vs. recognized turns, and performed t-tests on the means. The results do indeed suggest

145 that goats may be distinguished from sheep prosodically, although not in the same way that unrecognized turns can be distinguished. Our 'goats' are distinguished from better recognized speakers in terms of significant differences in overall fo maximum and tempo - the latter a factor that does not figure in our prior analyses; goats exhibit larger fo excur­ sions and speak more slowly than sheep. However, when prosodic features are normalized by first turn in task or by mean for speaker over task, fo maximum ceases to be a signifi­ cant predictor of 'goat' status. So, apparently, speakers with higher ranges prove difficult for the recognizer, as folk wisdom has found. But the simple hypothesis that goats more often produce utterances with the prosodic characteristics found to distinguish misrecog­ nized utterances is not supported.

Table 2: Estimated error for predicting misrecognitions.

Features Used Error (SE)

All 10.79% (.83)

Prosody, ASR Conf 12.71 % (.71)

ASR Conf, ASR string 13.39% (.85)

ASR string 14.41 % (.78)

Prosody 16.15% (.87)

ASRConf 18.30% (.81)

Baseline 25.88%

Machine-learning experiments Given the prosodic differences between misrecognized and correctly recognized utteranc­ es in our corpus, is it possible to accurately predict when a particular utterance will be se­ mantically misrecognized or not? This section describes experiments using the machine learning program RIpPER (Cohen, 1996) to automatically induce such a prediction modeL Like many learning programs, RIpPER takes as input the classes to be learned, the names and possible values (symbolic, continuous, or text) of a set of features, and training data specifying the class and, feature values for each training example. RIpPER outputs a clas­ sification model for predicting the class of future examples. The model is learned using greedy search guided by an information gain metric, and is expressed as an ordered set of if-then rules. When multiple rules are applicable, RIpPER uses the first rule. Our 2 predicted classes correspond to concept accuracy of 1 (T) or not 1 (F). As in the previous section, the features used in our learning experiments include the raw acous­ tic features. However, our learning experiments also include other potential predictors of misrecognition. Four features represent the experimental conditions described above (sys­ tem (adaptive or non-adaptive), dialogue strategy (mixed initiative with implicit system confirmation, system initiative with explicit confirmation, and user initiative with no sys­ tem confirmation), subject, and task number), while 2 features represent the ASR output (confidence score and recognized string). All of these features could in principle automat­ ically be made available to a spoken dialogue system. We also include one hand-labelled

146 prosodic feature, hyperarticulation score, described in the previous section. Table 2 shows the relative performance of some feature sets we tried3. The best performing feature set uses all of the features discussed above, and gives an error of 10.79% +/- .83%. Furthermore, even when we exclude the features that were not automatically logged in the original experiment (even if they could have been auto­ mated, in principle), or that represent the experimental conditions, our results are statisti­ cally equivalent. In contrast, a baseline classifier predicting the majority class of T has an error of 25.88%. A sample classification model (error of 11.21 % +/- .72%) learned from only the prosodic and other automatically available features is shown below.

if (asr::;; -3.33589) A (dur ~ 1.42496) A (tempo::;; 2.54345) then F if (asr::;; -3.80883) A (dur ~ 1.13314) A (foav ::;; 201.595) then F if (asr::;; -2.75271) A (tempo::;; 1.51501) A (zeros::;; 0.595294) then F if (asr::;; -2.74865) A (string contains 'help') then F if (asr::;; -5.04968) A (dur ~ 0.925401) A (zeros::;; 0.453237) then F if (asr::;; -2.71435) A (dur ~ 0.982058) A (fomax::;; 155.887) then F if (asr::;; -3.61652) /\ (string contains '8') then F if (dur ~ 1.39754) A (tempo::;; 2.25751) A (zeros::;; 0.626667) then F if tempo ~ 7.25392 then F else T

Note that 5 prosodic features appear in these rules. Although fa maximum and du­ ration were shown to be significant features in our earlier t-tests, these rules also include 3 additional (non-significant) prosodic features (tempo, fa average and % silence (zeros». If we consider the predictive power of each of our prosodic features in isolation, we find that duration is the best performing feature (error of 17.01 % +/- .99%), followed by tempo (20.70% +/- .99%). All of the other prosodic features perform no better in isolation than the baseline. In addition to the prosodic features, this rule set also contains two ASR fea­ tures - confidence score (asr) and recognized string (string). While confidence score was intended to predict word accuracy, it also appears useful in predicting concept accuracy. More surprising perhaps is the appearance of certain particular recognized strings as cues to recognition accuracy. Note the use of the recognized string 'help' in the fourth rule and '8' in the seventh. Using ASR string alone in fact is the best single feature in predicting our data (14.41 %), although this error rate is still worse than our best combination of fea­ tures, and is probably less likely to generalize. Finally, despite the fact that system ver­ sions and dialogue strategies differ in our corpus, none of these features appear in the rules. From these results it is still not clear what the contribution of our prosodic variables is. Do they simply correlate with information already available to recognition systems in particular, are they adding any predictive power not already available from the ASR confidence scores? To investigate this possibility, we examined the contribution of the confidence score alone in predicting misrecognitions vs. correct recognitions. Using this metric alone, the estimated error is 18.3%, significantly worse than the error estimated

3. The error rates and standard errors (SE) result from 25-fold cross-validation. Two error rates are statistically significantly different when the errors plus or minus twice the standard error do not overlap.

147 when our prosodic features are included (12.71 %), and statistically equivalent to using prosody alone (16.15%). But using confidence score plus recognized string gives an esti­ mated error of 13.39%, which, at a 95% confidence level, is not significantly different from our best performing feature set. With respect to the prosodic features, it should be noted that when we compare the performance of prosodic features normalized in different ways (discussed above), we find that the unnormalized feature set significantly outper­ forms (by 6-10%) the prosodic features normalized by first task utterance or by previous utterance. This difference may indicate that there are limits to the ranges in features such as fo and rms maximum, duration, and preceding pause, within which recognition per­ formance is optimal.

Discussion Lewis and Norman (1986) point out that, as soon as human users are involved, it is not possible to design user interfaces that fully eliminate errors. Users may make mistakes or provide the system with input that it cannot interpret. Since such failures are unavoidable, it is generally considered best practice that user interfaces be designed in such a way that they include methods to discover errors and to recover from them. This implies that a run time prediction of error is crucial to allow the system to immediately adapt its interaction strategy accordingly (Johnson, 1999). The current study has presented research whose goal it was to immediately detect recognition errors made by the ASR of a spoken dia­ logue system. Results from a statistical comparison of recognized vs. misrecognized ut­ terances indicate that there are significant prosodic differences between misrecognized and correctly recognized utterances, both within and across speakers. Our machine learn­ ing experiments show that prosodic features can be used to predict semantic misrecogni­ tions with a high degree of accuracy, improving over the use of simple acoustic-based confidence scores by about one third. Next steps include testing whether our results hold across recognizers, continuing our comparison of goats vs. sheep, and extending our cur­ rent study characterizing misrecognitions to include analysis of turns in which users be­ come aware of these misrecognitions and turns in which they correct them. Our ultimate goal is to provide prosody-based mechanisms for identifying and reacting to ASR failures in spoken dialogue systems.

Acknowledgements The research reported upon in this paper was partly carried out while Marc Swerts was a visiting researcher at AT&T Research Laboratories in Spring 1999. A slightly different version of this paper was presented at the 1999 International Workshop on Automatic Speech Recognition and Understanding (ASRU), December 12-15, 1999, Keystone, CO, USA.

References Bell, L. & Gustafson, J. (1999). Repetition and its phonetic realizations: Investigating a Swedish database of spontaneous computer-directed speech. Proceedings of the 14th International Congress of Pho­ netic Sciences, ICPhS-99, Volume 2, San Francisco, CA, USA, August 1-7,1999,284-299.

148 Bouwman, A.G.G., Sturm, J. & Boves, L. (1999). Incorporating confidence measures in the Dutch train timetable information system developed in the Arise project. Proceedings of the International Con­ ference on Acoustics, Speech, and Signal Processing, Volume I, Phoenix, AZ, USA, March 15-19, 1999, 493-496. Cohen, W. (1996). Learning trees and rules with set-valued features. Proceedings ofthe 13th National Con­ ference on Artificial Intelligence, MAl-96, Portland, OR, USA, August 4-8, 1996,709-716. Doddington, G., Liggett, W., Martin, A., Przybocki, M. & Reijnolds, D. (1998). Sheep, goats, lambs and wolves: A statistical analysis of speaker performance in the NIST1998 speaker recognition evalua­ tion. Proceedings of the 5th International Conference on Spoken Language Processing, ICSLP-98, Sydney, Australia, November 30 December4, 1998, 1351-1354. Hirose, K (1997). Disambiguating recognition results by prosodic features. In: Y. Sagisaka, N. Campbell and N. Higuchi (Eds.): Computing Prosody: Computational Models for Processing Spontaneous Speech, 327-342. New York: Springer. Johnson, C. (1999). Why human error modeling has failed to help system development. Interacting with Computers, 11(5), 517-524. Krahmer, E., Swerts, M., Theune, M. & Weegels, M. (1999). Error spotting in human-machine interaction. Proceedings ofEurospeech-99, Budapest, Hungary, September 5-7,1999, 1423-1426. Levow, G. (1998). Characterizing and recognizing spoken corrections in human-computer dialogue. Pro­ ceedings of the 36th Annual Meeting of the Associationfor Computational Linguistics, and the 17th International Conference on Computational Linguistics. COUNGIACL-98, Montreal, Quebec, Can­ ada, August 10-14, 1998, 736-742. Lewis C. & Norman, D.A. (1986). Designing for error. In: D.A. Norman and S. W. Draper (Eds.): User-Cen­ tered System Design, 411-432. Hillsdale, London: Lawrence Erlbaum. Litman, D. & Pan, S. (1999). Empirically evaluating an adaptable spoken dialogue system. Proceedings of the 7th International Conference on User Modeling, Banff, Alberta, Canada, June 20-24, 1999, 55- 64. Os, E. den, Boves, L., Lamel, L. & Baggia, P. (1999). Overview of the ARISE project. Proceedings ofEu­ rospeech-99. Budapest, Hungary, September 5-7,1999, 1527-1530. Ostendorf, M., Byrne, B., Bacchiani, M., Finke, M., Gunawardana, A., Ross, K, Roweis, S., Shriberg, E., Talkin, D., Waibel, A., Wheatley, B. & Zeppenfeld, T. (1996). Modeling Systematic Variations in Pronunciation via a Language-Dependent Hidden Speaking Mode. Final Report of the 1996 CLSP/ JHU Workshop on Innovative Techniques for Large Vocabulary Continuous Speech Recognition. (Unpublished) Oviatt, S., Levow, G.A., MacEachern, M. & Kuhn, K (1996). Modeling hyperarticulate speech during hu­ man-computer error resolution. Proceedings of the 4th International Conference on Spoken Lan­ guage Processing, ICSLP-96, Philadelphia, PA, USA, October 3-6, 1996, 801-804. Oviatt, S., Bernard, J. & Levow, G.A. (1998). Linguistic adaptations during spoken and muItimodal error resolution. Language and Speech. Special issue on Prosody and Conversation, 41(3-4), 419--422. Soltau, H. & Waibel, A. (1998). On the influence of hyperarticulated speech on recognition performance. Proceedings of the 5th International Conference on Spoken Language Processing, ICSLP-98, Syd­ ney, Australia, November 30 - December 4, 1998, 229-232. Swerts, M. & Ostendorf, M. (1997). Prosodic and lexical indications of discourse structure in human-ma­ chine interactions. Speech Communication, 22,25-41. Taylor, P., King, S., Isard, S. & Wright, H. (1998). Intonation and dialogue context as constraints for speech recognition. Language and Speech. Special issue on Prosody and Conversation, 41(3-4), 493-512. Veilleux, N. (1994). Computational Models of the Prosody/Syntax Mapping for Spoken Language Systems. PhD thesis, Boston University. Wade, E., Shriberg, E. & Price, P. (1992). User behaviors affecting speech recognition. Proceedings of the 2nd International Conference on Spoken Language Processing, ICSLP-92, Banff, Alberta, Canada, October 12-16,1992,995-998. Weintraub, M., Taussig, K, Hunicke-Smith, K & Snodgrass, A. (1996). Effect of speaking style on LVC­ SR perfonnance. Proceedings of the 4th International Conference on Spoken Language Processing, ICSLP-96, Philadelphia, PA, USA, October 3-6.1996. Addendum, 16-19.

149 Publications

P.1525 S.N. Yendrikhovskij, FJ.J. Blommaert and H. de Ridder Color reproduction and the naturalness constraint In: Color Research and Application, 1999, 24(1), 52-67 This article aims to determine the process underlying the subjective impression about the fidelity of reproduced object colors. To this end, we present the concept of the naturalness constraint and a framework for specification of naturalness judgments. We consider several research questions that are essential for this framework and discuss plausible answers supported by experiments. In general, naturalness assessment of reproduced object colors can be (1) defined as similarity to prototypical object colors and (2) characterized by a probability density function (e.g., Gaussian). Experiments show that (3) there is a considerable amount of consistency in naturalness judg­ ments of locally and globally processed images (although observers are slightly more tolerant of global image processing), and (4) naturalness judgments vary for different object categories, e.g., subjects are more consistent in naturalness judgments of skin, grass, and sky reproductions than shirt reproduction. We suggest that (5) naturalness of a whole picture is determined by the natu­ ralness of the most critical object in that picture. Finally, we introduce a naturalness index pre­ dicting perceived naturalness of color reproduction of real-life scenes.

P.1526 C. Dillen, B. de Ruyter and P. Cosyns Zelfmoordpogingen in Vlaanderen: Een multicentrumstudie (Suicide attempts in Flanders: A multi-centre study) In: TijdschriJt voor Geneeskunde, 1999,55(2), 88-95 Inspired by a World Health Organization (WHO) study on para suicide, the working group Sui­ cide Research in Flanders started a registration project in five centres in Flanders. All suicide at­ tempts registered in emergency departments were registered in the study. This registration started in January 1992 and lasted until 1995. A total of 2643 suicide attempts (59% by females) were registered during this period. For the majority, the age of the attemptors varies between 21 and 35 years of age. Intoxication via medicine is most common (88%) while the most frequently used product is benzodiazephine. Almost half of the sample had tried at least one previous attempt be­ fore this study. As much as 33% had contact with their general practitioner (GP) about two to four weeks before the attempt. Most attempts took place in the home environment and were dis­ covered within the hour. While 40% of the sample was remitted by their family, only 18% was remitted by the GP. Almost half of the sample received a stomach rinsing (although fully con­ scious during intake) while 21 % remained under simple observation. Finally, 47% of the sample was collocated in a psychiatric department. This latter result was significantly different over the different hospitals.

P.1527 Armin Kohlrausch Eigener ganzheitlicher Ansatz (A comprehensive approach) In: Horakustik, 1999, 34(1), 79-80 Book review of the book: Akustische Kommunikation. Grundlagen mit Horbeispielen (Acoustic communication. Fundamentals and listening examples) by Ernst Terhardt.

IPO Annual Progress Report 341999 150 P.l528 Joyce Vliegen and Andrew J. Oxenham Sequential stream segregation in the absence of spectral cues In: Journal of the Acoustical Society of America, 1999, J05( 1), 339-346 This paper investigates the cues used by the auditory system in the perceptual organization of se­ quential sounds. In particular, the ability to organize sounds in the absence of spectral cues is studied. In the first experiment listeners were presented with a tone sequence ABA ABA ... , where the fundamental frequency (fo) of a tone A was fixed at 100 Hz and the fo difference be­ tween tones A and B varied across trials between 1 and 11 semitones. Three spectral conditions were tested: pure tones, harmonic complexes filtered with a bandpass region between 500 and 2000 Hz, and harmonic complexes filtered with a bandpass region chosen so that only harmonics above the tenth would be passed by the filter, thus severely limiting spectral information. Listen­ ers generally reported that they could segregate tones A and B into two separate perceptual streams when the fo interval exceeded about four semitones. This was true for all conditions. The second experiment showed that most listeners were better able to recognize a short atonal melody interleaved with random distracting tones when the distracting tones were in an fo region 11 sem­ itones higher than the melody than when the distracting tones were in the same fo region. The results were similar for both pure tones and complex tones comprising only high, unresolved har­ monics. The results from both experiments show that spectral separation is not a necessary con­ dition for perceptual stream segregation. This suggests that models of stream segregation that are based solely on spectral properties may require some revision.

P.1529 Reinier W.L. Kortekaas Psychoacoustical evaluation of PSOLA. II. Double-formant stimuli and the role of vocal pertur­ bation In: Journal of the Acoustical Society of America, 1999, 105( 1), 522-535 This article presents the results of listening experiments and psychoacoustical modelling aimed at evaluating the pitch-synchronous-overlap-and-add (PSOLA) technique. This technique can be used for simultaneous modification of pitch and duration of natural speech, using simple and ef­ ficient time-domain operations on the speech waveform. The first set of experiments tested the ability of subjects to discriminate double-formant stimuli, modified in fundamental frequencies using PSOLA, from unmodified stimuli. Of the potential auditory discrimination cues induced by PSOLA, cues from the first formant were found to generally dominate discrimination per­ formance. In the second set of experiments the influence of vocal perturbation, i.e., jitter and shimmer, on discriminability of PSOLA-modified single-formant stimuli was determined. The data show that discriminability deteriorates at most modestly in the presence of jitter and shim­ mer. With the exception of a few conditions, the trends in these data could be replicated by either using a modulation-discrimination or an intensity-discrimination model, dependent on the form­ ant frequency. As a baseline experiment detection thresholds for jitter and shimmer were meas­ ured. Thresholds for jitter could be replicated by using either the modulation-discrimination or the intensity-discrimination model, dependent on the (mean) fundamental frequency of stimuli. The thresholds for shimmer could be accurately predicted for stimuli with a 250-Hz fundamental, but less accurately in the case of a loo-Hz fundamental.

P.l530 Jonathan Freeman, S.E. Avons, Don E. Pearson and Wijnand A. IJsselsteijn Effects of sensory information and prior experience on direct subjective ratings of presence In: Presence, 1999,8(1), 1-13 We report three experiments using a new form of direct subjective presence evaluation that was developed from the method of continuous assessment used to assess television picture quality. Observers were required to provide a continuous rating of their sense of presence using a hand-

151 held slider. The first experiment investigated the effects of manipulating stereoscopic and motion parallax cues within video sequences presented on a 20 in. stereoscopic CRT display. The results showed that the presentation of both stereoscopic and motion-parallax cues was associated with higher presence ratings. One possible interpretation of Experiment I is that CRT displays that contain the spatial cues of stereoscopic disparity and motion parallax are more interesting or en­ gaging. To test this, observers in Experiment 2 rated the same stimuli first for interest and then for presence. The results showed that variations in interest did not predict the presence ratings obtained in Experiment 1. However, the subsequent ratings of presence differed significantly from those obtained in Experiment I, suggesting that prior experience with interest ratings af­ fected subsequent judgments of presence. To test this, Experiment 3 investigated the effects of prior experience on presence ratings. Three groups of observers rated a training sequence for in­ terest, presence, and 3-Dness before rating the same stimuli as used for Experiments 1 and 2 for presence. The results demonstrated that prior ratings sensitize observers to different features of a display resulting in different presence ratings. The implications of these results for presence evaluation are discussed, and a combination of more-refined subjective measures and a battery of objective measures is recommended.

P.l531 Adrianus 1.M. Houtsma Acoustics education in the Low Countries Abstract in: Acustica, 1999,85, Supplement 1, S10 This presentation is an overview of acoustics education programs found at universities and pro­ fessional colleges in The Netherlands and Flanders. As has happened in many other academic areas, the traditional organization of educational programs along disciplinary lines has to a large extent made a place for an organization based on application areas. In university programs on acoustics, one finds clusters of specialized subjects that seem to be typical for schools or depart­ ments, such as audiology, audiometry, and speech therapy in medical schools, phonetics, music, and speech in humanities departments, and architectural acoustics, noise control, imaging, and fundamentals in applied sciences or engineering departments. Several professional colleges offer courses that are focused on measurement techniques, insulation properties of materials, and noise control in buildings and structures. A fundamental and broad treatment of acoustics, followed by a treatment of specialized applications in the areas of noise control engineering, architectural acoustics, or electroacoustics as elective options, is found in the Advanced Acoustics Course, or­ ganized by the Dutch and the Belgian acoustical societies and taught annually in Antwerp.

P.1532 Mark Houben, Luuk Franssen and Dik Hermes Auditory perception of rolling balls Abstract in: Acustica, 1999, 85, Supplement 1, S54 Two experiments investigating the perception of recorded sounds of rolling wooden balls are re­ ported. In the first experiment, whether subjects can identify differences in the size of rolling wooden balls is studied. In the second experiment, the velocity of rolling balls is varied. Real re­ cordings of wooden balls rolling over a wooden plate were presented pairwise, with a 700-ms pause in between. The stimuli had a duration of 800 ms and were presented at equal SPL. Sub­ jects had to decide which of the two sound examples was created by the larger (first experiment) or faster ball (second experiment) (212 AFC procedure). The results of the first experiment show that subjects are able to identify differences in the size of rolling balls, except for the stimulus pair with the smallest relative difference in diameter of 14%. The second experiment reveals that most subjects can clearly discriminate between rolling balls with different velocities, if the rela­ tive difference exceeds about 30%. However, some subjects had difficulties in labelling the sounds correctly, resulting in percentage correct responses close to 0%. It is to be expected that labelling errors will disappear when subjects receive feedback about the correctness of their re­ sponses.

152 P.1533 Adrianus J.M. Houtsma On the tones that make music Abstract in: Acustica, 1999,85, Supplement 1, S311 Advances in musical acoustics, music psychology, architectural acoustics, and electroacoustics during the past few decades have all contributed to our present understanding of the complex processes involved in making, listening to, and enjoying music. Psychoacoustics has made con­ tributions by exploring basic relationships between tones, tone combinations, and tone sequences on the one hand, and sensations of loudness, pitch, timbre, consonance, dissonance, and rhythm on the other. Statistical analyses of musical intervals used in certain compositions, for instance, show a clear relationship with auditory critical bandwidth. Musical instruments typically used for playing lead voices or complex harmonies are found to have lower partials with an exact or near­ ly exact harmonic relationship so that they can convey clear and unambiguous pitch sensations. Even spatial effects have been used by composers since the Renaissance to the present day. Such effects do not always lead to the intended results, however, because of the often unpredictable acoustic behaviour of listening spaces. Reinier Plomp has made significant contributions to our present understanding of these problems. The title of this presentation is, in fact, the subtitle of an auditory demonstration CD recently issued by him in The Netherlands.

P.1534 Leslie R. Bernstein, Steven van de Par and Constantine Trahiotis The normalized correlation: Accounting for NoSn thresholds with Gaussian and 'low-noise' masking noise Abstract in: Acustica, 1999, 85, Supplement 1, S416-S417 Recently, both Eddins and Barber [J. Acoust. Soc. Am. 103,2578-2589 (1998)] and Hall et aL [J. Acoust. Soc. Am. 103, 2573-2577 (1998)] demonstrated that 'low-noise' noise produced more masking of antiphasic tones than did Gaussian noise. Eddins and Barber could not account for this finding by considering changes in the normalized interaural correlation produced by add­ ing antiphasic signals to the diotic maskers. They calculated the normalized correlation subse­ quent to half-wave, square-law rectification and low-pass filtering. That approach was used suc­ cessfully by Bernstein and Trahiotis [J. Acoust. Soc. Am. 100,3774-3784 (1996)] to account for binaural detection performance as a function of frequency. In this presentation, it will be shown that one can account quantitatively for thresholds obtained with 'low-noise' noise and with Gaus­ sian noise. by incorporating a stage of compression prior to rectification and low-pass filtering. The compressive function employed produces effects similar to those measured along the basilar membrane. Adding compression allows one to account quantitatively for the low-noise noise and Gaussian-noise thresholds while maintaining a highly successful quantitative account of the Bernstein and Trahiotis data. Thus the normalized correlation remains a very useful metric that accounts for a wide variety of binaural detection data.

P.1535 Steven van de Par, Constantine Trahiotis and Leslie R. Bernstein Should normalization be included in cross-correlation based models of binaural processing? Abstract in: Acustica, 1999, 85, Supplement 1, S417 It is generally accepted that cross-correlation-based models of binaural information processing provide excellent qualitative and quantitative accounts of data obtained in a wide variety of ex­ perimental contexts. These contexts include binaural detection, lateralization, localization, and the perception of pitch mediated by strictly binaural cues. The purpose of this research was to investigate the application of 'normalization' as part of the computation of indices of interaural correlation. Such indices have often been utilized to account for binaural detection data but the normalization, to our knowledge, has neither been explicitly studied nor evaluated on its own merits. Normalization ensures that the value of correlation obtained with identical inputs wi1l be

153 + 1 (or for polarity-opposite inputs, -1). Nevertheless, normalization, as a stage of processing, is not overtly included in modern cross-correlation-function-based models. In this presentation log­ ical arguments and new data regarding the need to include normalization within cross-correla­ tion-based models of binaural detection/discrimination are provided. The new data were obtained while roving the overall levels of the stimuli in a binaural detection task. The data indicate that, to account for binaural detection, an extremely rapid and accurate stage of normalization must be included in cross-correlation-based models.

P.l536 Armin Kohlrausch and Jeroen Breebaart Measuring the role of masker-correlation uncertainty in binaural masking experiments Abstract in: Acustica, 1999,85, Supplement 1, S465 Masked thresholds of a 500-Hz sinusoid were measured in an NoSn condition for both running and frozen-noise maskers using a 3IFC procedure. The nominal masker correlation varied be­ tween 0.64 and 1, and the bandwidth of the masker was either 10, 100, or 1000Hz. The running­ noise thresholds were expected to be higher than the frozen-noise thresholds because of interau­ ral correlation uncertainty of the masker intervals for running-noise conditions. Since this corre­ lation uncertainty decreases with increasing masker bandwidth, differences between running and frozen-noise conditions should decrease with increasing bandwidth for interaural correlations smaller than + 1. For an interaural correlation close to + 1, no difference between frozen-noise and running-noise thresholds is expected for all values of the masker bandwidth. These expectations are supported by the experimental data. For the IO-Hz running-noise condition, the thresholds can be accounted for in terms of interaural correlation uncertainty. For the frozen-noise condi­ tions and both the 100 and looo-Hz running-noise conditions, it is likely that the limits of accu­ racy of the internal representation determine the detection thresholds.

P.1537 Sergej Yendrikhovskij Beyond colour appearance: A new image quality space In: Proceedings of the International Colour Management Forum, Derby, UK, March 24-25, 1999, 151-156 Most of modern image quality (IQ) metrics are based on the properties of the visual system only and do not take into account cognitive aspects involved in IQ judgments. Such metrics can be very useful to define the subjective tolerance to reproduction errors, especially at the threshold level. Basically, they can provide the answer to the question 'what do people accept to see in im­ ages'. However, they are unable to respond to more general questions of the supra-threshold level 'what do people expect to see' and 'what do people prefer to see' in observed pictures. To answer these questions, one has to consider cognitive aspects of image appraising and develop a general model of image quality. The following paper elaborates on some of these issues for the colour domain.

P.1538 AJ.M. Houtsma Acoustics education in the Low Countries Abstract in: Journal of the Acoustical Society of America, 1999, 105(2. Pt. 2), 936 See abstract P.1531.

P.l539 AJ.M. Houtsma On the tones that make music

154 Abstract in: Journal of the Acoustical Society of America, 1999,105(2, Pt. 2), 1237 See abstract P.1533.

P.1540 N. BelaYd Perceptual linearization of soft-copy displays for achromatic images PhD thesis, Eindhoven University of Technology, May 17, 1999 The fidelity of soft-copy achromatic displays was investigated before and after applying percep­ tual linearization. This procedure first proposed by Pizer [Computing Graphics and Image Processing 17,262-268 (1981)J, was developed to guarantee the fidelity of a display in such a way that equal steps in the input values to the display system should evoke equal steps in visual perception of the human observer. Pizer tried to realize this under the assumption that the steps are perceptually equal if they contain an equal number of just noticeable differences (JNDs). The idea of perceptual linearization was adopted by other research groups and standardization bodies, such as the American College of Radiology and the National Electrical Manufacturers Associa­ tion (ACRlNEMA). In this thesis we present an iterative linearization procedure based on mag­ nitude estimation of brightness differences within simple images. These images consisted of grey square patches embedded in a uniform surround. This was used to generate a series of equal brightness intervals. Measurements were performed for three different surround luminances. It was found that perceptual linearization is strongly influenced by the surround luminance value. The resulting linearizing look-up tables (LUTs) were compared to the ACRINEMA display standard and to the CIE lightness standard. The effect of background illumination on perceptual linearization was also investigated, and the results suggest that this effect is minimal. Our exper­ imental data were compared with the outcome of two mathematical models for brightness re­ sponses which take the surround effect into account. A good agreement was found with Whittle's brightness model [Vision Research 32, 1493-1507 (1992)] and with a modified model of King­ dom and Moulden [Vision Research 31, 851-858 (1991 )]. This result suggests that these mathe­ matical models may be applied instead of going through lengthy experimental procedures. In or­ der to evaluate the potential value of perceptual linearization in tasks with complex scenes, ach­ romatic natural and medical images were judged on both perceptually linearized and nonlinearized displays. Both performance and appreciation-oriented tasks were used. Perform­ ance-oriented tasks consisted of judging digital grey values, from specified locations in the dis­ played images. using either brightness matching with a reference scale or estimation without ref­ erence. Performance-oriented tasks were shown not to be affected by perceptual linearization. For the appreciation-oriented task, subjects (both medical experts and non-experts) showed a stronger preference for the surround-independent LUTs for most of the images considered. Based on the experiments performed in this thesis, we can conclude that the idea of global per­ ceptual linearization as proposed by Pizer does not really improve the performance- nor the ap­ preciation-oriented quality for complex achromatic displays. Applying local perceptuallineari­ zation to complex images would be a relevant topic for future study, however.

P.1541 S.N. Yendrikhovskij, F.J.J. Blommaert and H. de Ridder Optimizing color reproduction of natural images In: Final Program and Proceedings of IS&TISID, the 6th Color Imaging Conference: Color Science, Systems, alldApplications, Scottsdale, AZ, USA, November 17·20,1998,140-145 The paper elaborates on understanding, measuring, and optimizing perceived color quality of natural images. We introduce a model for optimal color reproduction of natural scenes which is based on the assumption that color quality of natural images is constrained by perceived natural­ ness and colorfulness of these images. To verify the model, a few experiments were carried out where subjects estimated the perceived 'naturalness', 'colorfulness' and 'quality' of images of

155 natural scenes. The judgments were related to statistical parameters of the color point distribution across the images in the CIELUV color space. It was found that the perceptually optimal color reproduction can be derived from this statistic within the framework of our model. We specify naturalness, colorfulness, and quality indices, describing the observer's judgments. Final1y, an algorithm for optimizing perceived quality of color reproduction of natural scenes is discussed.

P.1542 Matthias Rauterberg New directions in user-system interaction: Augmented reality, ubiquitous and mobile computing In: Proceedings of the Symposium on Human lnteifacing. Man & Machine: The Cooperation of the Future, Eindhoven, The Netherlands, May 20, 1999, 105-113 The embodiment of computers in the work place has had a tremendous impact on the field of user-system interaction. Mice and graphic displays are everywhere, the desktop workstations de­ fine the frontier between the computer world and the real world. We spend a lot of time and en­ ergy to transfer information between those two worlds. This could be reduced by better integrat­ ing the virtual world of the computer with the real world of the user. The most promising ap­ proach to this integration is Augmented Reality (AR). The expected success of this approach lies in its ability to build on fundamental human skills: namely, to interact with real world subjects and objects! We present a new interaction style, called Natural User Interface (NUl). Several pro­ totypical applications already implement NUIs and demonstrate its technical feasibility. We have shown that Natural User Interfaces (NUl) have many advantages over traditional interaction styles and (immersive) Virtual Reality (VR). Especially aspects of nonverbal communication be­ tween humans and capturing task-related activities (motor movements, voice and sound) are em­ phasized. Technology is already ripe enough that many research prototypes can show the feasi­ bility of particular components of NUls. Calling for multi-disciplinary research, we identified several domains we consider as key technologies to attack the remaining technical problems, which are mainly pattern recognition and projection. Providing people with support for their real­ world tasks, NUls open a wide new design space for the next generation of USI technology.

P.1543 Emiel Krahmer and Kees van Deemter On the interpretations of anaphoric noun phrases: Towards a full understanding of partial match­ es In: Journal afSemantics, 1998, 15, 355-392 Starting from the assumption that NPs of all kinds can be anaphoric on antecedents in the lin­ guistic context, we work towards a general theory of context-dependent NP meaning. Two com­ plicating factors are that the relation between anaphors and antecedents is by no means unrestrict­ ed and that often there is a partial match between anaphor and antecedent. We argue that the pre­ suppositions-as-anaphors approach of Van der Sandt provides a natural starting point for our enterprise. Unfortunately, this theory has a number of deficiencies for our purposes, in particular where the treatment of partial matches is concerned. We propose a number of modifications of Van der Sandt's formal theory and apply the modified algorithm first to definite NPs and later to NPs of all kinds. The resulting modified version of the presuppositions-as-anaphors theory is ar­ gued to be more general, formally more precise, and empirically more adequate than its prede­ cessor.

P.l544 A. Kohlrausch and S. van de Par Auditory-visual interaction: From fundamental research in cognitive psychology to (possible) applications In: B.E. Rogowitz and T.N. Pappas (Eds.): Proceedings of SPIE: Human Vision and Electronic Imaging IV, Volume 3644, San Jose, CA, USA, January 25-28,1999,34-44

156 In our natural environment, we simultaneously receive information through various sensory mo­ dalities. The properties of these stimuli are coupled by physical laws, so that, e.g., auditory and visual stimuli caused by the same event have a fixed temporal relation when reaching the observ­ er. In speech, for example, visible lip movements and audible utterances occur in close synchrony which contributes to the improvement of speech intelligibility under adverse acoustic conditions. Research into multi-sensory perception is currently being performed in a great variety of exper­ imental contexts. This paper attempts to give an overview of the typical research areas dealing with audio-visual interaction and integration, bridging the range from cognitive psychology to applied research for multimedia applications. Issues of interest are the sensitivity to asynchrony between audio and video signals, the interaction between audio-visual stimuH with discrepant spatial and temporal rate information, crossmodal effects in attention, audio-visual interactions in speech perception, and the combined perceived quality of audio-visual stimuli.

P.1545 Lydia Meesters and Jean-Bernard Martens Blockiness in JPEG-coded images In: B.E. Rogowitz and T.N. Pappas (Eds.): Proceedings of SPIE: Human Vision and Electronic Imaging IV, Volume 3644, San Jose, CA, USA, January 25-28, 1999, 245-257 In two experiments, dissimilarity data and numerical scaling data were obtained to determine the underlying attributes of image quality in baseline sequential JPEG-coded images. Although sev­ eral distortions were perceived, i.e., blockiness, ringing, and blur, the subjective data for all at­ tributes were highly correlated, so that image quality could approximately be described by one independent attribute. We therefore proceeded by developing an instrumental measure for one of these distortions, Le., blockiness. In this paper a single-ended blockiness measure is proposed, i.e., one that uses only the coded image. Our approach is therefore fundamentally different from most (two-sided) image quality models that use both the original and the degraded image. The measure is based on detecting the low-amplitude edges that result from blocking and estimating the amplitudes. Because of the approximate one-dimensionality of the underlying psychological space, the proposed blockiness measure also predicts the image quality of sequential baseline­ coded JPEG images.

P.l546 Jean-Bernard Martens and Lydia Meesters The role of image dissimilarity in image quality models In: B.E. Rogowitz and T.N. Pappas (Eds.): Proceedings of SPIE: Human Vision and Electronic Imaging IV, Volume 3644, San Jose, CA, USA, January 25-28, 1999, 258-269 Although the concept of image dissimilarity is very familiar in the context of instrumental meas­ ures for image quality, it is fairly uncommon to use it as an experimental paradigm. Most instru­ mental measures relate image quality to some distance, such as the root-mean-squared error (RMSE), between the original and the processed image, such that image dissimilarity arises nat­ urally in this context. Dissimilarity can, however, also be judged consistently by subjects. In this paper we compare the performance of a number of representative instrumental models for image dissimilarity (such as the Sarnoff model and RMSE) with respect to their ability to predict both image dissimilarity and image quality, as perceived by human subjects. Two sets of experimental data, one for images degraded by noise and blur, and one for JPEG-coded images, are used in the comparison. In none of the examined cases could a clear advantage of complicated distance met­ rics (such as the Sarnoff model) be demonstrated over simple measures such as RMSE.

P.lS47 Wijnand IJsselsteijn, Huib de Ridder and Joyce Vliegen The effect of stereoscopic filming parameters and exposure duration on quality and naturalness judgements

157 In: B.E. Rogowitz and T.N. Pappas (Eds.): Proceedings ofSPIE: Human Vision and Electronic Imaging IV, Volume 3644, San Jose, CA, USA, January 25-28, 1999,300-311 Previously we reported a study into the effect of stereoscopic filming parameters on perceived quality, naturalness, and eye strain. In a pilot experiment, using 25 seconds exposure duration, a marked shift occurred between naturalness and quality ratings as a function of camera separation. This shift was less clearly present in the main experiment, in which we used an exposure duration of 5 seconds. This suggests a potential effect of exposure duration on observer appreciation of stereoscopic images. To further investigate this, we performed an experiment using exposure du­ rations of both 5 and lO seconds. For these durations, twelve observers rated naturalness of depth and quality of depth for stereoscopic still images varying in camera separation, convergence dis­ tance, and focal length. The results showed no significant main effect of exposure duration. A small yet significant shift between naturalness and quality was found for both duration condi­ tions. This result replicated earlier findings, indicating that this is a reliable effect, albeit content­ dependent. A second experiment was performed with exposure durations ranging from 1 to 15 seconds. The results of this experiment showed a small yet significant effect of exposure dura­ tion. Whereas longer exposure durations do not have a negative impact on the appreciative scores of optimally reproduced stereoscopic images, observers do give lower judgments to monoscopic images and stereoscopic images with unnatural disparity values as exposure duration increases.

P.1548 Sergej Y endrikhovskij Image quality: Between science and fiction In: Final Program and Proceedings of IS& Ts PIGS Conference, the 52nd Annual Conference of the Society for Imaging Science and Technology, Savannah, GA, USA, April 25-28, 1999, 173- 178 At present, most studies on Image Quality employ subjective assessment with only one goal - to avoid it in the future. Such an attitude is most likely caused by some methodological difficulties involved in psychophysical scaling. and the skepticism towards the prospect of modelling cog­ nitive stages of human judgments. The following paper (1) reviews the main methodological problems, (2) discusses possible remedies, (3) considers a model which specifies the major con­ straints imposed on Image Quality, and (4) provides a meaningful classification of different im­ age categories.

P.l549 Aad Houtsma Toonhoogte onderzoek in Nederland (Pitch research in The Netherlands) In: 25 jaar AES in Nederland, Lustrumboek van de Nederlandse Sectie van de Audio Engineering Society 1974-1999,1999,177-186 Pitch research has always been considered as a very important part of psychoacoustics practiced in The Netherlands. So much even that it is sometimes said that, to a Dutch psychoacoustician, a pitch theory is equivalent with a hearing theory. This specific research interest is mostly due to important contributions by a few scientists, just before and after WW2. who were able to convey their thoughts and ideas to their students. Research concentrated on the perception of the so called missing fundamental, the fact that the pitch of harmonic complex tones is always deter­ mined by the fundamental frequency, no matter whether this frequency is physically present or not. The initial explanation of this phenomenon, expressed in the residue theory of J.F. Schouten was based on limited frequency resolution in the cochlea. This theory was ultimately proven to be inadequate by systematically gathered empirical evidence. and replaced by central, neural sig­ nal processing models. These models allowed computation of pitch by means of real-time com­ puter algorithms, which was of essential value for experimental research on speech prosody and for the development of real-time voice pitch displays used in speech therapy. They also provide insight in the perceptual functioning of certain types of musical instruments.

158 P.1550 H. Kanis, M.F. Weegels and W.S. Green Scientific research in a design context In: M.A. Hanson, EJ. Lovesey and S.A. Robertson (Eds.): Contemporary Ergonomics 1999. London: Taylor & Francis, 1999,374-378 Design relevant research is often accorded a low scientific status. Whilst user trials are indispen­ sable in the anticipation of future user activities with a newly designed product, user trials are criticized in several respects: lack of theory, no experimental control, no statistical significance, and questionable validity. These comments are typical of a quantification-oriented stance. This paper outlines alternative perspectives, which revaluate a primarily qualitative, in-depth ap­ proach, with ample room for the user's perspective and observation of actual product use.

P.1551 Geertde Haan The usability of interacting with the virtual and the real in COMRIS In: A. Nijholt, O. Donk and B. van Dijk (Eds.): Interactions in Virtual Worlds, Proceedings of the 15th Twente Workshop on Language Technology, TWLT-15, Enschede, The Netherlands, May 19-21,1999,69-79 COMRIS (Co-habited Mixed-Reality Information Spaces) is a research project that aims to cre­ ate a wearable assistant that provides context-sensitive information about interesting persons and events to conference and workshop visitors. Virtual and augmented reality applications generally require users to switch from one reality to another or to emerge themselves into a heavyweight VR environment. In COMRIS the two worlds are intimately linked and extended into each other without the need to switch or emerge. In the physical world, the most tangible part of the COM­ RIS system is a small wearable and personal device, 'the parrot', that speaks in the user's ear and provides a few buttons to fine-tune messaging defaults to the circumstances and to respond to messages. The parrot also houses an active badge that keeps track of the whereabouts of its 'wearer' relative to beacons distributed around the conference premises. Conference visitors also have a number of information kiosks at their disposal to enter and adjust their interest profiles and to consult the conference schedule and their personal agenda. In the virtual world, agents rep­ resent and attempt to match the possible (events and persons) and the actual interests (from the interest profile) of the users. A process 'competition for attention' selects the interests on the ba­ sis of the user's context and guards the user from information overload. Interests are subsequent­ ly translated into natural language and presented to the user as spoken messages. After presenting an overview of COMRIS, this paper focusses on usability evaluation which deserves special at­ tention because COMRIS users remain engaged in their regular working and social relations while receiving assistance and distraction from the virtual world. The paper describes the general approach and the plans concerning usability evaluation in the COMRIS project and discusses the preliminary results from the first experiment.

P.1552 Robbert-lan Beun and Anita H.M. Cremers Object reference in a shared domain of conversation In: Pragmatics & Cognition, 1998, 6(112),121-152 In this paper we report on an investigation into the principles underlying the choice of a particular referential expression to refer to an object located in a domain to which both participants in the dialogue have visual as well as physical access. Our approach is based on the assumption that participants try to use as little effort as possible when referring to objects. This assumption is op­ erationalized in two factors, namely the focus of attention and a particular choice of features to be included in a referential expression. We claim that both factors help in reducing the effort needed to, on the one hand, refer to an object and, on the other hand, to identify it. As a result of the focus of attention the number of potential target objects (i.e., the object the speaker intends

159 to) is reduced. The choice of a specific type of feature determines the number of objects that have to be identified in order to be able to understand the referential expression. An empirical study was conducted in which pairs of participants cooperatively carried out a simple block-building task, and the results provided empirical evidence that supported the aforementioned claims. Es­ pecially the focus of attention turned out to play an important role in reducing the total effort. Additionally, focus acted as a strong coherence-establishing device in the studied domain.

P.l553 Geert de Haan Usability issues in COMRIS, an intelligent personal information system In: EACE Quarterly, 1999,3(2),5-10 COMRIS is a research and development project that aims to create a wearable assistant for con­ ference and workshop visitors. A personal interest profile and an active badge system enable agents in a virtual information space to provide context-sensitive information about interesting persons and events to conference and workshop visitors in their own physical information space. COMRIS encounters special usability problems, not found in either traditional information sys­ tems or virtual reality applications. This paper describes the COMRIS project, discusses the pro­ posed usability studies, and presents the conclusions from the first experiment.

P.l554 Geert de Haan, Mathilde Bekker and Paul de Greef Spotlight: IPO, Center for Research on User-System Interaction In: EACE Quarterly, 1999,3(2),22-23 An organization overview in the regular column 'Spotlight' ofEACE Quarterly, the magazine of the European Association for Cognitive Ergonomics. The paper gives a brief overview of the his­ tory and the background of IPO and describes the type of research that is done. For each research group, the aims and purposes of the investigations are briefly described and both research topics and project names are mentioned. Finally, the paper invites scientists and post-graduate students who are interested in taking part in IPO's research and education activities to seek further infor­ mation at the website and get in contact.

P.1555 Adrianus J.M. Houtsma Hawkins and Stevens revisited at low frequencies In: Proceedings ofthe 16th International Congress on Acoustics and 135th Meeting ofthe Acous­ tical Society of America, lCA-98, Volume JJ, Seattle, WA, USA, June 20-26, 1998, 893-894 The classical data on masking of pure tones by white noise [J.E. Hawkins, Jr. and S.S. Stevens, J. Acoust. Soc. Am. 22, 6-13 (1950)] yield a non-monotonic critical ratio function that increases at low and at high frequencies. Modern bandwidth estimates of auditory filters, however, yield equivalent rectangular bandwidth (ERB) functions that are monotonic with frequency. This ap­ parent discrepancy has sometimes been interpreted as evidence for frequency-dependent detec­ tion efficiency in the auditory system. Another possible reason for the discrepancy, however, could be the fact that the classical data were measured with headphones that needed substantial correction at low frequencies to compensate for acoustic leakage. To test this, the Hawkins and Stevens experiment was repeated with Etymotic ER-2 insert phones over a frequency range from 30 to 8000 Hz. An adaptive forced-choice as well as a Bekesy tracking procedure was used. The resulting critical ratio function shows monotonic dependence on frequency, down to 90 Hz, sim­ ilar to the ERB function. This appears to make critical ratio data in principle consistent with ERB data measured with notched-noise techniques, and diminishes the need for frequency-dependent detection efficiency in the auditory system.

160 P.1556 Leslie R. Bernstein, Steven van de Par and Constantine Trahiotis The normalized interaural correlation: Accounting for NoS1t thresholds obtained with Gaussian and 'low-noise' masking noise In: Journal of the Acoustical Society of America, 1999, 106(2), 870-876 Recently, Eddins and Barber [J. Acoust. Soc. Am. 103,2578-2589 (1998)] and Hall et al. [J. Acoust. Soc. Am. 103, 2573-2577 (1998)] independently reported that greater masking of inter­ aura1ly phase-reversed (S1t) tones was produced by diotic low-noise noise than by diotic Gaus­ sian noise. Based on quantitative analyses, Eddins and Barber suggested that their results could not be accounted for by assuming that listeners' judgments were based on constant-criterion changes in the normalized interaural correlation produced by adding the S1t signal to the diotic masker. In particular, they showed that a model like the one previously employed by Bernstein and Trahiotis [J. Acoust. Soc. Am. 100,3774-3784 (1996)] predicted an ordering of thresholds between the conditions of interest that was opposite to that observed. Bernstein and Trahiotis computed the normalized interaural correlation subsequent to half-wave, square-law rectification and low-pass filtering, the parameters of which were chosen to mimic peripheral auditory processing. In this report, it is demonstrated that augmenting the model by adding a physiologi­ cally valid stage of 'envelope compression' prior to rectification and low-pass filtering provides a remedy. The new model not only accounts for the data obtained by Eddins and Barber (and the similar data obtained by Hall et al.), but also does not diminish the highly successful account of the comprehensive set of data that gave rise to the original form of the model. Therefore, models based on the computation of the normalized interaural correlation appear to remain valid because they can account, both quantitatively and qualitatively, for a wide variety of binaural detection and discrimination data.

P.1557 Jeroen Breebaart, Steven van de Par and Armin Kohlrausch The contribution of static and dynamically varying lIDs and lIDs to binaural detection In: Journal of the Acoustical Society of America, 1999, 106(2),979-992 This paper investigates the relative contribution of various interaural cues to binaural unmasking in conditions with an interaurally in-phase masker and an out-of-phase signal (MoS1t). By using a modified version of multiplied noise as the masker and a sinusoid as the signal, conditions with only interaural intensity differences (lIDs), only interaural time differences (ITDs), or combina­ tions of the two were realized. In addition, the experimental procedure allowed the presentation of specific combinations of static and dynamically varying interaural differences. In these con­ ditions with multiplied noise as masker, the interaural differences have a bimodal distribution with a minimum at zero lID or lID. Additionally, by using the sinusoid as masker and the mul­ tiplied noise as signal, a unimodal distribution of the interaural differences was realized. Through this variation in the shape of the distributions, the close correspondence between the change in the interaural cross correlation and the size of the interaural differences is no longer found, in contrast to the situation for a Gaussian-noise masker [Domnitz and Colburn, 1. Acoust. Soc. Am. 59,598-601 (1976)]. When analysing the mean thresholds across subjects, the experimental re­ sults could not be predicted from parameters of the distributions of the interaural differences (the mean, the standard deviation, or the root-mean-square value). A better description of the sub­ jects' performance was given by the change in the interaural correlation, but this measure failed in conditions which produced a static interaural intensity difference. The data could best be de­ scribed by using the energy of the difference signal as the decision variable, an approach similar to that of the equalization and cancellation model.

P.1558 Jeroen Breebaart and Armin Kohlrausch A new binaural detection model based on contralateral inhibition

161 In: T. Dau, V. Hohmann and B. KoIlmeier (Eds.): Psychophysics, Physiology and Models of Hearing. Singapore: World Scientific Publishing, 1999, 195-206 Binaural models attempt to explain binaural phenomena in terms of neural mechanisms that ex­ tract binaural information from acoustic stimuli. In this paper a model setup is presented that can be used to simulate binaural detection tasks. In contrast to the most often used cross correlation between the right and left channel, this model is based on contralateral inhibition. The presented model is applied to a wide range of binaural detection experiments. It shows a good fit for chang­ es in masker bandwidth or masker correlation, static and dynamic cues and level and frequency dependencies.

P.1559 Emiel Krahmer and Mariet Theune Efficient generation of descriptions in context In: R. Kibble and K. van Deemter (Eds.): Proceedings ofthe Workshop 'The Generation ofNom­ inal Expressions', 11th European Summer School in Logic, Language and Information, ESSLLl- 99, Utrecht, The Netherlands, August 9-14,1999, no page numbers In their interesting 1995 paper, Dale and Reiter present various algorithms they developed alone or in tandem to determine the content of a distinguishing description. That is: a definite descrip­ tion which is an accurate characterization of the entity being referred to, but not of any other ob­ ject in the current 'context set'. They argue that their Incremental Algorithm is the best one from a computational point of view (it has a low complexity and is fast) as well as from a psychological point of view (humans appear to do it in a similar way). Even though Dale and Reiter (1995) pri­ marily aimed at investigating the computational implications of obeying the Gricean maxims, the Incremental Algorithm has become more or less accepted as the state-of-the-art for generating descriptions. However, due to their original motivation, various other aspects of the generation of definites remained somewhat underdeveloped. In this paper we flesh out a number of these aspects, without losing sight of the attractive properties of the original algorithm (speed, com­ plexity, psychological plausibility).

P.1560 Emiel Krahmer and Marc Swerts Contrastive accent in dialogue In: B. Geurts, M. Krifka and R. van der Sandt (Eds.): Proceedings of the Workshop 'Focus and Presupposition in Multi-Speaker Discourse', 11th European Summer School in Logic, Language and Information, ESSLLI-99, Utrecht, The Netherlands, August 9-14,1999, 13-17 What is the meaning of contrastive accents in dialogue? To answer this question, a number of hurdles have to be taken. To begin with, there is still no consensus in the prosodic literature whether a separately identifiable contrastive intonation exists in the first place. So, before we can even begin to answer the initial question, we have to try and answer a different question: do di­ alogue participants distinguish contrastive accents from other kinds of accents? While there are various answers to the tirst question, this paper is to the best of our knowledge the tirst where both questions are addressed in combination.

P.1561 Leo Noordman, Ingrid Dassen, Marc Swerts and Jacques Terken Prosodic markers in text structure In: K. van Hoek, A.A. Kibrik and L. Noordman (Eds.): Discourse Studies in Cognitive Linguis­ tics. Amsterdam: John Benjamins Publishing, 1999, 133-148 In this paper we examine the question whether spoken texts contain audible markers of text struc­ ture. We present experimental results from a study of spoken Dutch narrative, analysed according to two different text structure theories: Story Grammar (SG) and Rethorical Structure Theory

162 (RST). The results clearly show that text structure is reflected in two different prosodic features, pause and pitch. The higher the level in hierarchy, both in terms of SG and RST, the longer the pause duration and the higher the pitch range.

P.1562 Marc Swerts, Cinzia Avesani and Emiel Krahmer Reaccentuation or deaccentuation: A comparative study of Italian and Dutch In: 1.1. Ohala, Y. Hasegawa, M. Ohala, D. Granville and A.C. Bailey (Eds.): Proceedings of the 14th International Congress of Phonetic Sciences, ICPhS-99, San Francisco, CA, USA, August 1-7,1999, 1541-1544 This paper reports on a comparative analysis of accentuation strategies within Italian and Dutch NPs. Accent-patterns were obtained in a (semi-)spontaneous way via a simple dialogue-game with 8 Dutch speakers and 4 Italian ones. In this way target descriptions of all speakers were ob­ tained in the following four contexts: all new, single contrast in the adjective, single contrast in the noun, and double contrast. It was found that the two languages both signal information status prosodically, but in a rather different way. In Dutch, accent distribution is the main discrimina­ tive factor: new and contrastive information are accented, while given information is not. New­ ness and contrastive accents were not intonation ally different, yet a post-hoc test revealed that listeners could distinguish a contrastive intonation from a newness one, because contrastive ac­ cents generally were the sole accent in the phrase and always had the shape of a nuclear accent even in non-default positions. In Italian, distribution is not a significant factor, since within the elicited NPs both adjective and noun are always accented, irrespective of the status of the infor­ mation. However, there is a gradient difference in that 'given accents' are perceived as less prom­ inent than the other two, while there is no overall perceptual difference between contrastive and newness accents.

P.1563 Kees van Deemter, Emiel Krahmer and MarH~t Theune Plan-based vs. template-based NLG: A false opposition? In: T. Becker and S. Busemann (Eds.): Proceedings of the Workshop 'May I speakfreely?', 23rd German Annual Conference for Artijiciallntelligence, KI-99, Bonn, Germany, September 14-15, 1999, 1-6 This paper uses the algorithm employed in a number of recent template-based NLG systems to challenge the wide-spread assumption that template-based methods are inherently less well­ founded than plan-based methods.

P.1564 Emiel Krahmer, Marc Swerts, Mariet Theune and Mieke WeegeJs Prosodic correlates of disconfirmations In: Proceedings ofthe ESCA International Workshop on Dialogue and Prosody, Veldhoven, The Netherlands, September 1-3,1999,169-174 In human-human communication, dialogue participants are continuously sending and receiving signals on the status of the information being exchanged. These signals may either be positive ('go on') or negative ('go back'), where it is usually found that the latter are comparatively marked to make sure that the dialogue partner is made aware of a communication problem. This paper focuses on the users' signalling of information status in human-machine interactions, and in particular looks at the role prosody may play in this respect. Using a corpus of interactions with two Dutch spoken diaJogue systems, prosodic correlates of users' disconfirmations were inves­ tigated. In this corpus, disconfirmations may serve as a positive signal in one context and as a negative signaJ in another. Our findings show that the difference in signalling function is reflect­ ed in the distribution of the various types of disconfirmations as well as in different prosodic var-

163 iables (pause, duration, intonation contour and pitch range). The implications of these results for human-machine modelling are discussed.

P.l565 Atsushi Shimojima, Yasuhiro Katagiri, Hanae Koiso and Marc Swerts An experimental study on the informational and grounding functions of prosodic features of Jap­ anese echoic responses In: Proceedings of the ESCA International Workshop on Dialogue and Prosody, Veldhoven, The Netherlands, September 1-3, 1999, 187-192 Echoic responses, which reuse portions of the texts uttered in the preceding turns, abound in di­ alogues, although semantically they contribute little new information. Earlier, we conducted a corpus-based analysis on echoic responses occurring in real-life dialogues, and examined their informational and dialogue-coordinating functions in connection with their temporaUprosodic features. The present study attempts to complement this observational approach with an experi­ mental approach, where particular prosodic/temporal features of echoic responses can be studied in a more controlled and focused manner. In combination, the two lines of analyses provide an evidence that (1) echoic responses with different timings, intonations, pitches, and speeds signal different degrees in which the speakers have integrated the repeated information into their prior knowledge, and (2) the dialogue-coordinating functions of echoic responses vary with the speak­ er's integration rates signalled by these prosodic/temporal cues.

P.1566 Mariet Theune Parallelism, coherence, and contrastive accent In: G. Olaszy, G. Nemeth and K. Erd6hegyi (Eds.): Proceedings ofthe 6th European Conference on Speech Communication and Technology, Eurospeech-99, Volume 1, Budapest, Hungary, Sep­ tember 5-9, 1999, 555-558 This paper discusses an experiment concerning the assignment of contrastive accents, i.e., ac­ cents used to indicate the presence of contrastive information. The experiment tested which of two existing approaches to the determination of contrastive information gives the most natural results. According to one approach the presence of 'alternative items' is the only condition for the assignment of contrastive accent; according to the other the presence of parallelism between sentences also is a condition. The experimental results indicate that the presence of alternative items combined with parallelism always triggers a preference for contrastive accent, whereas in the absence of parallelism, accent assignment seems to depend on the degree of coherence of the utterance.

P.l567 Gert Veldhuijzen van Zanten User modelling in adaptive dialogue management In: G. Olaszy, G. Nemeth and K. Erd6hegyi (Eds.): Proceedings ofthe 6th European Conference on Speech Communication and Technology, Eurospeech-99, Volume 3, Budapest, Hungary, Sep­ tember 5-9,1999, 1183-1186 This paper describes an adaptive approach to dialogue management in spoken dialogue systems. The system maintains a user model, in which assumptions about the user's expectations of the system are recorded. Whenever recognition errors occur, the dialogue manager drops assump­ tions from the user model, and adapts its behaviour accordingly. The system uses a hierarchical slot structure that allows generation and interpretation of utterances at various levels of general­ ity. This is essential for exploiting mixed-initiative to optimize effectiveness. The hierarchical slot structure is also a source of variation. This is important because it is ineffective to repeat prompts in cases of speech recognition errors, for in such cases users are likely to repeat their responses also, and thus the same speech recognition problems reoccur.

164 P.1568 Emiel Krahmer, Marc Swerts, Mariet Theune and Mieke Weegels Problem spotting in human-machine interaction In: G. Olaszy, G. Nemeth and K. Erdohegyi (Eds.): Proceedings ofthe 6th European Conference on Speech Communication and Technology, Eurospeech-99, Volume 3, Budapest, Hungary, Sep­ tember 5-9, 1999, 1423-1426 In human-human communication, dialogue participants are continuously sending and receiving signals on the status of the information being exchanged. We claim that if spoken dialogue sys­ tems were able to detect such cues and change their strategy accordingly, the interaction between user and system would improve. Therefore, the goals of the present study are as follows: (i) to find out which positive and negative cues people actually use in human-machine interaction in response to explicit and implicit verification questions and (ii) to see which (combinations of) cues have the best predictive potential for spotting the presence or absence of problems. Using precision and recal1 matrices it was found that various combinations of cues are accurate problem spotters. This kind of information may turn out to be highly relevant for spoken dialogue systems, e.g., by providing quantitative criteria for changing the dialogue strategy or speech recognition engine.

P.1569 Morten Fjeld, Martin Bichsel, Fred Yoorhorst, Kristina Lauche, Helmut Krueger and Matthias Rauterberg Designing graspable groupware for co-located planning and configuration tasks In: EACE Quarterly, 1999,3(2), 16-21 This paper shows some of the vital steps in the design process of a graspable groupware system. Activity theory is the theoretical foundation for our research. Our design philosophy is based on the tradition of Augmented Reality (AR), which enriches natural communication with virtual features. Another important part of our design philosophy is the use of coinciding action and per­ ception spaces. We developed groupware for layout planning and configuration tasks called the BUILD-IT system. This system enables users, grouped around a table, to cooperate in the design manipulation of a virtual setting, thus supporting co-located, instead of distributed, interaction. The multi-user nature of BUILD-IT overcomes a serious drawback often seen with CSCW tools, namely that they are based on single-user applications. We believe that co-location is an indis­ pensable factor for the early stage of a complex planning process. Input and output, however, can be prepared and further developed off-line, using any conventional CAD system.

P.l570 Margaret Cox, Lars Oestreicher, Clark Quinn, Matthias Rauterberg and Markus Stoltze HCI - Theory or practice in education In: S. Brewster, A. Cawsey and G. Cockton (Eds.): Proceedings ofthe IFIP TC.131nternational Conference on Human-Computer interaction, iNTERACT-99, Volume l/, Edinburgh, UK, Au­ gust 30 - September 3,1999, 134 The workshop aims at identifying how education programs in Human-Computer Interaction (HCI) can benefit from modem teaching techniques and how collaboration between industry and academia can be improved for a better understanding of the mutual educational needs.

P.1571 Sergej N. Yendrikhovskij, Frans JJ. Blommaert and Huib de Ridder Towards perceptually optimal colour reproduction of natural scenes In: LW. MacDonald and M.R. Luo (Eds.): Colour Imaging: Vision and Technology. Chichester: John Wiley & Sons, 1999,363-382 Current image technology does everything except one final and very important step image qual­ ity appraisal. Up to now it is human beings that have had this privilege, but they may well be glad

165 to share it with a technical device. Imagine a photocamera with the same honest opinion about the quality of its own pictures as an observer has, a TV set with self-optimizing quality, or a print­ er with always perfect images. The basic idea in the development of such 'magic' equipment is simple: observer's judgments, and, if necessary, optimized before presenting the image to the ob­ server. In this paper we sequentially elaborate on (a) comprehending basic principles of colour appraising, (b) understanding the visuocognitive process that underlies subjective evaluation of image quality, (c) measuring perceived attributes, such as naturalness, colourfulness, and quality from objective properties of images, and (d) optimizing colour reproduction of natural scenes.

1572 Matthias Rauterberg User-system interaction: Trends and future directions In: K.-P. Gartner (Ed.): Ergonomische Gestaltungswerkzeuge in der Fahrzeug- und Pro­ zejJfahrung: 41. FachausschujJsitzung Antropotechnik der Deutschen Gesellschaft fUr Luft und Raumfahrt e. V., Stuttgart, Germany, October 19-20, 1999, 1-11. Bonn: Deutsche Gesellschaft flir Luft und Raumfahrt The embodiment of computers in the work place has had a tremendous impact on the field of user-system interaction. Mice and graphic displays are everywhere, the desktop workstations de­ fine the frontier between the computer world and the real world. We spend a lot of time and en­ ergy to transfer information between those two worlds. This could be reduced by better integrat­ ing the virtual world of the computer with the real world of the user. The most promising ap­ proach to this integration is Augmented Reality (AR). The expected success of this approach lies in its ability to build on fundamental human skills: namely, to interact with real world subjects and objects! We present a new interaction style, called Natural User Interface (NUl). Several pro­ totypical applications already implement NUls and demonstrate its technical feasibility.

1573 Adrianus I.M. Houtsma Psychoacoustics education in the Benelux In: Acustica, 1999, 85(5),626-628 Psychoacoustics is taught both at universities and professional schools in The Netherlands and in Belgium, but always as part of some broader education or training program. The most promi­ nent ofthese programs and their relation to psychoacoustics are medical studies (audiology), ar­ chitecture (noise effects, concert halls), and language studies (speech perception). Several other interesting combinations occur. A special case worth mentioning is an Advanced Acoustics course offered annually by the Dutch (NAG) and Belgian (ABAV) acoustical societies in which psychoacoustics is an integral and essential part of a concentrated training in electroacoustics, architectural acoustics, or noise control engineering.

1574 Armin Kohlrausch Research and teaching in psychoacoustics at German universities In: Acustica, 1999,85(5),629-631 This paper gives an overview of recent developments and the present situation at German uni­ versities with respect to research and teaching in psychoacoustics. It also mentions aspects of job perspectives for students with psychoacoustic expertise. This paper is based on information gath­ ered by a working group within the technical committee Psychoacoustics of the German Acous­ tical Society (DEGA).

P.1575 S.N. Yendrikhovskij, F.J.J. Blommaert and H. de Ridder

166 Representation of memory prototypes for an object color In: Color Research and Application, 1999,24(6), 393-410 This article presents a general framework for modeling memory colors and provides experimen­ tal evidence supporting this model for one particular object, i.e., a banana. We propose to char­ acterize memory colors from experimentally determined similarity judgments of apparent object colors with the prototypical color of that object category. The aim of the first experiment was to analyse the memory representation of banana color in the CIELUV color space. To this end, we prepared images imitating different colors of a banana and asked subjects to scale the similarity in color of a banana shown on a CRT display and a typical ripe banana as it is remembered from their past experience. The relationships between the similarity judgments and chromaticity coor­ dinates representing the manipulated banana samples can be well described by a bivariate normal distribution. Another experiment was carried out to gain more insight into the perception process leading to an appearance of the banana stimuli. Additionally, a sample of a banana colors from a fruit market was measured and compared with the similarity judgments.

P.1576 Steven van de Par and Armin Kohlrausch Dependence of binaural masking level differences on center frequency, masker bandwidth, and interaural parameters In: Journal of the Acoustical Society of America, 1999, 106(4, Pt. 1), 1940-1947 Thresholds for sinusoidal signals masked by noise of various bandwidths were obtained for three binaural configurations: NoSo (both masker and signal interaurally in phase), NOSlt (masker in­ teraurally in phase and signal interaurally phase-reversed), and NltSO (masker interaurally phase­ reversed and signal interaurally in phase). Signal frequencies of 125,250,500, 1000,2000, and 4000 Hz were combined with masker bandwidths of 5, 10,25,50, 100,250,500, 1000,2000, 4000, and 8000 Hz, with the restriction that masker bandwidths never exceeded twice the signal frequency. The overall noise power was kept constant at 70 dB SPL for all bandwidths. Results, expressed as signal-to-total-noise power ratios, show that NoSo thresholds generally decrease with increasing bandwidth, even for subcritical bandwidths. Only at frequencies of 2 and 4 kHz do thresholds appear to remain constant for bandwidths around the critical bandwidth. NOSlt thresholds are generally less dependent on bandwidth up to two or three times the (monaural) critical bandwidth. Beyond this bandwidth, thresholds decrease with a similar slope as for the NoSo condition. NltSO conditions show about the same bandwidth dependence as NOSlt, but thresholds in the former condition are generally higher. This threshold difference is largest at low frequencies and disappears above 2 kHz. An explanation for wider operational binaural critical bandwidth is given which assumes that binaural disparities are combined across frequency in an optimally weighted way.

P.1577 S.J.L. Mozziconacci and D.J. Hermes Role of intonation patterns in conveying emotion in speech In: U. Ohala, Y. Hasegawa, M. Ohala, D. Granville and A.C. Bailey (Eds.): Proceedings of the 14th International Congress of Phonetic Sciences, ICPhS-99, San Francisco, CA, USA, August 1-7,1999,2001-2004 In a production and perception study, the relation was studied between the emotion or attitude expressed in an utterance and the intonation pattern realized on that utterance. In the production study, the pitch curves of emotional utterances were labelled in terms of the !PO intonation gram­ mar. One intonation pattern, the' l&A', was produced in all emotions studied. Some other pat­ terns were specifically used in expressing some of these emotions. In the perception study, in which the perceptual relevance of these findings was checked, the communicative role of the in­ tonation patterns found in the database was tested. This listening test provided converging evi-

167 dence on the contribution of specific intonation patterns to the perception of some of the emo­ tions and attitudes studied. Some intonation patterns, such as final '3C', and' 12', which were specifically produced in some emotion, also introduced a perceptual bias towards that emotion. In that sense, the results from the perception study supported those from the production study.

P.1578 Matthias Rauterberg Different effects of auditory feedback in man-machine interfaces In: N.A. Stanton and J. Edworthy (Eds.): Human Factors in Auditory Warnings. Aldershot, UK: Ashgate Publishing, 1999,225-242 The hearing of sounds (such as alarms) is based on the perception of events and not on the per­ ception of sounds as such. For this reason, sounds are often described by the events they are based on. Sound is a familiar and natural medium for conveying information that we use in our daily lives, especially in the working environment. The following examples help to illustrate the im­ portant kinds of information that sound can communicate: (1) Information about normal, physi­ cal events - most real-world events (for example, putting a pencil on the desk) are associated with a (soft-)sound pattern. This type of auditory feedback is parallel to the visual feedback of the ac­ tual status of all interacting objects (for example, pencil and desk); (2) Information about (hid­ den) events in space - all audible signals out of our visual field (for example, footsteps warn us of the approach of another person); (3) Information about abnormal structures - a malfunctioning engine sounds different from a healthy one and/or alarms in supervisory control environments; (4) Information about invisible structures - an hidden structures that can be transformed into au­ dible signals (for example, tapping on a wall is useful in finding where to hang a heavy picture); (5) Information about dynamic change - all specific semantics of real-wor1d dynamics (for ex­ ample, as we fill a glass we can hear when the liquid has reached the top). Two experiments were carried out to estimate the effect of sound feedback. First, we investigated the effects of auditory feedback for a situation where the sound is given additionally to the visual feedback. In the sec­ ond experiment we investigated the effects of auditory feedback of hidden events which were produced by a continuous process in the background.

P.l579 Torsten Dau, Jesko Verhey and Armin Kohlrausch Intrinsic envelope fluctuations and modulation-detection thresholds for narrow-band noise car­ riers In: Journal of the Acoustical Society ofAmerica, 1999, 106(5),2752-2760 A model is presented which calculates the intrinsic envelope power of a bandpass noise carrier within the passband of a hypothetical modulation filter tuned to a specific modulation frequency. Model predictions are compared to experimental1y obtained amplitude modulation (AM) detec­ tion thresholds. In experiment I, thresholds for modulation rates of 5,25, and 100 Hz imposed on a bandpass Gaussian noise carrier with a fixed upper cutoff frequency of 6 kHz and a band­ width in the range from I to 6000 Hz were obtained. In experiment 2, three noises with different spectra of the intrinsic fluctuations served as the carrier: Gaussian noise, multiplied noise, and low-noise noise. In each case, the carrier was spectrally centred at 5 kHz and had a bandwidth of 50 Hz. The AM detection thresholds were obtained for modulation frequencies of 10, 20, 30, 50, 70, and 100Hz. The intrinsic envelope power of the carrier at the output of the modulation filter tuned to the signal modulation frequency appears to provide a good estimate for AM detection threshold. The results are compared with predictions on the basis of the more complex auditory processing model by Dau et al. [J. Acoust. Soc. Am. 99, 3615-3622 (1997)].

P.1580 Dik Hermes and Steven van de Par Perception, soundscapes and science

168 In: Soundscapes be)for(e 2000. Amsterdam: Centre for Electronic Music, 1999,90-93 Sound can be studied from many different perspectives. Best established, probably, is the phys­ ical approach, in which the mechanics of sound-producing objects is studied on the basis of math­ ematical equations or computer models of the sound-generating object. In general, one can say that sound produced by simple impacts between rigid, solid objects is well understood. The sit­ uation becomes more complicated when the impact between the objects is complex, e.g., scrap­ ing or rolling, when the sound-generating objects are flexible, and when fluids or flowing air be­ come part of the sound-generation process. The dynamics of such systems, e.g., a wood instru­ ment, a string instrument or the human voice, are so complicated that even modern computer facilities cannot accurately calculate the sounds they produce. But in the design of soundscapes this is only one of the problems. There is no simple relation between physical quantities of a sound signal such as intensity, frequency and phase, and what kind of sound we perceive. The dynamics of physical systems containing flexible components, fluids or flowing air, mostly pro­ duce sound signals with complicated onsets and containing noisy components. Physical concepts such as amplitude, frequency and phase are not well defined for these components. On the other hand, the human listener is extremely sensitive to details at the onset of a sound and the contri­ bution of the noisy components. They playa substantial role in the perceived timbre of a sound. The consequences of this finding, and the alternative approaches used in studying and designing natural and artificial soundscapes are discussed.

P.1581 T.J.W.M. Janssen Computational image quality PhD thesis, Eindhoven University of Technology, November 2, 1999 In this thesis an attempt is made to answer the fundamental question "What is image quality?". The approach we pursue to answer this question is based on the following four-point philosophy: (1) images are carriers of visual information; (2) visuo-cognitive processing is information processing; (3) visuo-cognitive processing is an essential stage in human interaction with the en­ vironment; and (4) image quality is the adequacy of the image as input to the vision stage of the interaction process. The structure of this thesis is as follows. In chapter 2 we formulate an answer to the question "What is image quality?" based on the philosophy outlined above. In this chapter we give a description of image quality in terms of two components, usefulness, that is, the pre­ cision of the internal representation of the image, and naturalness, that is, the degree of match between the internal representation of the image and representations stored in memory. The re­ sults of two series of experiments are used to demonstrate the validity of this concept. In chapter 3 we focus on the internal quantification of outside world attributes. Using a rather technical view on visual processing, we regard vision primarily as a process in which attributes of items in the outside world are measured and internally quantified with the aim to discriminate and/or identify these items. We show that the scale function of metrics optimized with respect to these tasks should be (partially) flexible. Furthermore, we show that such metrics exhibit properties resem­ bling visual phenomena such as adaptation, crispening, and constancy. In chapter 4 we imple­ ment the image quality concept of chapter 2 using the partially flexible metrics of chapter 3. In this chapter a measure for usefulness is developed, based on the overall discriminability of the items in the image. Furthermore, a measure for naturalness of the grass, skin, and sky areas of the image is developed, based on memory standards for grass, skin, and sky colour. These mem­ ory standards are themselves constructed from the grass, skin, and sky areas of a large set of im­ ages. To conclude, chapter 5 returns to the concept for image quality introduced in chapter 2. Fol­ lowing a strict, top-down analysis, the entire trajectory is completed from the semantics of image quality down to the development of algorithms for the prediction of usefulness, naturalness, and image quality. The result is a complete, thorough, and explicit description of image quality ac­ cording to the above four-point philosophy.

169 P.1582 J.B. Martens Patroonherkenning en beeldbewerking (Pattern recognition and image processing) In: ICT-zakboekje. Arnhem: Koninklijke PBNA, 1999,437-457 This introductory text discusses the main terminology and applications of pattern recognition (computer vision) and image processing. Two important aspects, i.e., local feature extraction and statistical classification, are covered in some more detail.

P.1583 Rene Ahn and Tijn Borghuis Communication modelling and context-dependent interpretation: An integrated approach In: T. Altenkirch, W. Naraschewski and B. Reus (Eds.): Types for Proofs and Programs. Inter­ national Workshop, TYPES-98, Kloster Irsee, Germany, March 27-31, 1998. Lecture Notes in Computer Science 1657. Berlin: Springer-Verlag, 1999, 19-32 In this paper we present a simple model of communication. We assume that communication takes place between two agents. Each agent ha<; a private and subjective knowledge state. The knowl­ edge of both agents is partial, finite, and represented in a computational way. We investigate how ideas can be transferred from one agent to the other one, in spite of the subjective nature of the knowledge of both participants. Posing the problem in this way, it can be seen that mechanisms for context-dependent interpretation are a prerequisite for successful communication.

P.1584 H. Kanis, M.F. Weegels and L.P.A. Steenbekkers The uninformativeness of quantitative research for usability focused design of consumer prod­ ucts In: Proceedings of the Human Factors and Ergonomics Society 43rd Annual Meeting, Houston, TX, USA, September 27 - October 1,1999,481-485 Scientific journals in the area of ErgonomicslHuman Factors are strongly oriented towards quan­ titative research. At the same time, authors are urged to discuss the practical applicability of their research, such as in terms of design recommendations. On the basis of empirical findings in sev­ eral studies, the present paper demonstrates that usability focused design of consumer products is barely supported by research into human characteristics. Neither do the commonly found over­ all performance measures obtained in summative evaluations appear to be of any help in design. Instead, to inform design, in-depth research into actual user activities is indispensable, preferably including observations in a naturalistic environment enriched by clarification of product use from a user's perspective.

P.1585 Emiel Krahmer and Paul Piwek Presupposition projection as proof construction In: H. Bunt and R. Muskens (Eds.): Computing Meaning, Volume 1. Dordrecht: Kluwer Aca­ demic Publishers, 1999,281-300 Most theories of presupposition fail to give a formal account of the interaction between world­ knowledge and presupposition projection. In this article an algorithm is sketched that is based on the idea that presuppositions are in many ways comparable to anaphors. It extends this idea by employing a deductive system, Constructive Type Theory (CIT), to acquire a formal handle on the way world-knowledge influences presupposition projection. In CIT, proofs for expressions are explicitly represented as objects. These objects can be seen as a generalization of discourse markers. They are useful in dealing with presuppositional phenomena that require world-knowl­ edge, such as Clark's bridging examples and Beaver's conditional presuppositions.

170 P.l586 RoeJof Hamberg and Huib de Ridder Time-varying image quality: Modeling the relation between instantaneous and overall quality In: SMPTE Journal, 1999, 802-811 Experiments are described that evaluated a model linking instantaneously perceived quality to overall quality judgments of long video sequences. Subjects evaluated a 3-min MPEG-2 coded video sequence by means of a continuous assessment procedure. Additionally, they rated overall quality of segments of 10, 30, 60, and 180 sec of the same video material. The model describing the relation between the instantaneous and overall quality ratings contains two main ingredients, viz., an exponentially decaying weighting function, simulating the experimentally established re­ call advantage for the most recently presented material (called the recency effect), and a nonlin­ ear averaging procedure stressing the relative importance of strong impairments. The fit of the model to the experimental data resulted in a decay time constant of 26 sec and a power of 3 for the nonlinear weighting. These findings suggest that subjects rely predominantly on the worst events of a sequence when determining their overall quality judgment.

P.1587 Michael Meehan, Jonathan Freeman and Wijnand IJsselsteijn Overview of the Second International Workshop on Presence: April 6-7, 1999, Colchester, Eng­ land In: CyberPsychology & Behavior, 1999, 2(4), 363-368 On the 6th and 7th of April 1999, the Second International Workshop on Presence was held at the University of Essex. The event was organized by Jonathan Freeman (currently at Goldsmiths College, University of London and at the University of Essex during the workshop) and Wijnand Dsselsteijn (IPQ, Center for User-System Interaction) as part of the European Commission ACTS TAPESTRIES project. The TAPESTRIES project was managed by the UK Independent Television Commission (ITC) and the workshop was sponsored by SONY, BT, and the ITC. The workshop program is followed by a brief introduction to presence and then by short summaries of each of the papers.

P.1588 J.B. Martens Image display and visual perception In: Edurad Syllabus Nr. 33. Utrecht: Nederlandse Vereniging voor Radiologie, 1998, 70-74 Two nonlinear mappings are involved in image display. First, the digital image in computer memory is transformed to luminance levels on the display. Second, the human visual system can capture and process images on the display and can trigger a human response, such as a judgment about the brightness in a designated area. It would be very beneficial if display systems were per­ ceptually linearized, i.e., if the brightness response was linearly related to the input image. This paper discusses the relationship between brightness models and display characteristics, and ar­ gues why a single display standard, although very useful, cannot be expected to guarantee display linearization in all viewing circumstances.

P.1589 DikJ. Hermes Non-speech audio in mens/systeem-interactie (Non-speech audio in human-system interaction) In: Management & Injormatie, 1999, 7(5), 26-32 Until now the use of non-speech audio in user interfaces has almost completely been restricted to the use of alarm and warning signals. This contribution first gives an overview of a number of problems often occurring with the use of this kind of signals. A number of guidelines are present­ ed for their use in simple and in complex systems. Finally, an indication is given of the role non-

171 speech audio can have in modern, multifunctional and interactive systems, which problems can occur with their implementation, and how one can solve these problems.

P.1590 I.M.B. Terken and R.NJ. Veldhuis Taal- en spraaktechnologie (Language and speech technology) In: Management & Informatie, 1999, 7(5),36-46 Starting from the observation that recent developments in computer technology and language and speech technology are beginning to result in commercial applications of language and speech technology, this paper summarizes technical issues, opportunities and botdenecks, and usability issues. Examples of existing applications (commercial or experimental) include dictation sys­ tems, systems for text-to-speech conversion, voice command systems, and conversational inter­ faces for database inquiry. In addition, language technology is applied in information retrieval systems, and speech technology is applied for security (identification). The main technological bottlenecks lie in the areas of speech recognition and interpretation. In both cases, the main draw­ back is the massive amount of data needed to develop flexible systems: in the case of speech rec­ ognition, flexibility is needed to handle pronunciation variability within and between speakers, large vocabularies, and spontaneous speech phenomena. In the case of interpretation, modelling and storing large amounts of world knowledge poses a real problem. Many of these problems can be circumvented by constraining the domain one way or the other, either by restricting the number of speakers, the domain of discourse, or the vocabulary. Now that speech interfaces are gaining ground in user-system interaction, researchers are gaining insight into usability issues. One of the benefits is that speech-based interaction is natural, and relieves the user from the need to focus his eyes on the device and to stay near it. One of the costs is that there is a continued lack of robustness of the technology and a lack of communicative competence in the system, requiring more communicative competence of the user, which can only be gained through interaction with the system.

P.1591 Wolfgang Kopp and Matthias Rauterberg Ergebnisse einer Untersuchung zur benutzergerechten GestaItung von Messwarten (Results of a study about the design of user-friendly control rooms) In: Chemie Ingenieur Technik, 1999, 71(9), 1004 The increasing demands being placed on chemical plants regarding higher availability, security, and productivity, are leading to more and more complex automation structures. Hence, the user interface is becoming increasingly important. Sustainable improvements in this area can prima­ rily be achieved by taking into account technical possibilities, the knowledge and skills of plant operators, as well as business organization aspects.

P.1592 David House, Dik Hermes and Frederic Beaugendre Perception of tonal rises and falls for accentuation and phrasing in Swedish In: R.H. Mannell and 1. Robert-Ribes (Eds.): Proceedings of the 5th International Conference on Spoken Language ProceSSing, lCSLP-98, Volume 6, Sydney, Australia, November 30 Decem­ ber4, 1998,2799-2802 In previous experiments with Dutch, French, and Swedish listeners, it was shown that the loca­ tion in the sy Hable of the onset of a rising or falling pitch movement is critical for the perception of accentuation. As the onset of the pitch movement is shifted through the syllable, there is a point at which the percept of accentuation shifts from one syllable to the next. This point is termed the accentuation boundary. It has also been proposed that in certain positions, the percept of accentuation conflicts with the percept of phrasing. An experiment with Swedish listeners was

172 carried out using the same stimuli as used for the accentuation study, but now the task was to determine the phrasing of the syllables. The results indicate that perceptual phrase boundaries can be determined in the same way as accentuation boundaries. Differences in the locations of the boundaries can be interpreted in terms of strengths of tonal cues for accentuation and phras­ ing.

P.1593 Edward M. Burns and Adrianus J.M. Houtsma The influence of musical training on the perception of sequentially presented mistuned harmon­ ics In: Journal of the Acoustical Society of America, 1999, 106(6), 3564-3570 The question of whether musical scales have developed from a processing advantage for frequen­ cy ratios based on small integers, i.e., ratios derived from relationships among harmonically re­ lated tones, is widely debated in musicology and music perception. In the extreme position, this processing advantage for these so-called 'natural intervals' is assumed to be inherent, and to ap­ ply to sequentially presented tones. If this is the case, evidence for this processing advantage should show up in psychoacoustic experiments using listeners from the general population. This paper reports on replications and extensions of two studies from the literature. One [Lee and Green, J. Acoust. Soc. Am. 96, 716-725 (1994)] suggests that listeners from the general popula­ tion can in fact determine whether sequentially presented tones are harmonically related. The other study [Houtgast, J. Acoust. Soc. Am. 60,405-409 (1976)] is interpreted in different terms, but could be confounded by such an ability. The results of the replications and extensions, using listeners of known relative pitch proficiency, are consistent with the idea that only trained musi­ cians can reliably determine whether sequentially presented tones are harmonically related.

P.1594 S. Dobler, P. Meyer, H.W. Ruhl, R. Collier, L. Vogten and K. Belhoula A server for area code information based on speech recognition and synthesis by concept In: G. Gorz (Ed.): KONVENS 92: 1. Konferenz 'Verarbeitung natiirlicher Sprache', Niirnberg, 7.-9. Oktober 1992. Berlin, Heidelberg: Springer-Verlag, 1992,353-357 An automatic information server to be operated via the telephone net is presented, requiring the input of an old German area code for mail addresses and responding with the new area code. The server both employs speaker-independent recognition of connected words, and a synthesis scheme by concept speaking town names by rules. Recognition is based on an HMM recognizer, trained speaker independent. Its feature extraction is insensitive both to different frequency re­ sponses of telephone and transmission line, and to slowly varying background noise. Speech syn­ thesis uses the PSOLA technique and concatenation of diphones. In order to generate highly nat­ ural sounding speech, most utterances are stored as natural speech, with only town names being synthesized. To prevent the impression of two different voices, utterances and diphones were spoken by the same speaker.

P.1595 Julia Hirschberg, Diane Litman and Marc Swerts Prosodic cues to recognition errors In: Proceedings of the 1999 International Workshop on Automatic Speech Recognition and Understanding, ASRU-99, Keystone, CO, USA, December 12-15,1999, cd-rom We identify methods of distinguishing between correctly and incorrectly recognized utterances (scored by hand for semantic concept accuracy) for a speech recognition system, using acoustic! prosodic characteristics. The analysis was performed on data collected during independent ex­ periments done with an interactive voice response system that provides travel information over the phone.

173 Papers accepted for publication

MS. 1242 T.D. Freudenthal Age differences in the performance of information retrieval tasks To appear in: Behaviour & Information Technology

MS.l281 M.M. Bekker and J.B. Long User involvement in the design of human-computer interactions: Some similarities and differ­ ences between design approaches To appear in: People and Computers: Proceedings of the Conference of the British Computer Society, HCI 2000, Human-Computer Interaction Specialist Group 15, Sunderland, UK, September 5-8, 2000

MS.1285 TJ.W.M. Janssen and FJJ. Blommaert Predicting the usefulness and naturalness of colour reproductions To appear in: Journal of Imaging Science and Technology

MS.1293 W.A. IJsselsteijn, H. de Ridder and 1. Vliegen Subjective evaluation of stereoscopic images: Effects of camera parameters and display dura­ tion To appear in: IEEE Transactions on Circuits and Systems for Video Technology

MS. 1295 J.B. Martens and M.e. Boschman The psychophysical measurement of image quality To appear in: Vision Models

MS.1296 J.B. Martens and L.MJ. Meesters Single-ended instrumental measurement of image quality To appear in: Vision Models

MS.l298 J. Freeman, S.E. Avons, R. Meddis, D.E. Pearson and W.A. Usselsteijn Using behavioural realism to estimate presence: A study of the utility of postural responses to motion-stimuli To appear in: Presence: Teleoperators and Virtual Environments

MS.1300 A. Kohlrausch and T. Dau Auditory processing of amplitude modulation

[PO Annual Progress Report 341999 174 To appear in: Proceedings of the 18th Danavox Symposium, Kolding, Denmark, September 7- 10, 1999

MS. 1302 S.c. Pauws, D.G. Bouwhuis and 1.H. Eggen Programming and enjoying music with your eyes closed To appear in: Proceedings of the CHI 2000 Conference on Human Factors in Computing Sys­ tems, The Hague, The Netherlands, April 1-6, 2000

MS.1304 P. Markopoulos Interactors: Formal architectural models of user interface software To appear in: Encyclopedia of Microcomputers

MS. 1305 TJ.W.M. Janssen and FJ.J. Blommaert An information-processing approach to image quality To appear in: Proceedings of the SPlE Conference: Human Vision and Electronic Imaging V, San Jose, CA, USA, January 24-27, 2000

MS.l308 S.N. Yendrikhovskij Computing color categories To appear in: Proceedings of the SPIE Conference: Human Vision and Electronic Imaging V, San Jose, CA, USA, January 24-27, 2000

MS. 1309 D.G. Bouwhuis Parts of life: Configuring equipment to individual life style To appear in: Ergonomics

MS.13l0 W.A. IJsselsteijn, H. de Ridder, J. Freeman and S.E. Avons Presence: Concept, determinants and measurement To appear in: Proceedings of SPlE: Human Vision and Electronic Imaging V, San Jose, CA, USA, January 24-27, 2000

MS.1311 W.A. IJsselsteijn, H. de Ridder and J. Vliegen Effects of stereoscopic filming parameters and display duration on the SUbjective assessment of eye strain To appear in: Proceedings ofSPIE: Stereoscopic Displays and Applications XI, San Jose, CA, USA, January 24-27, 2000

MS.1313 L.MJ. Meesters and J.B. Martens A single-ended blockiness measure for JPEG-coded images To appear in: Signal Processing

175 MS.l314 L.M.J. Meesters and J.B. Martens Influence of processing method, bit-rate and scene content on perceived and predicted image quality To appear in: Proceedings of SPIE: Human Vision and Electronic Imaging V. San Jose, CA, USA, January 24-27, 2000

MS.1315 S. Subramanian andA. Makur Predictive absolute moment block truncation coding for image compression To appear in: Proceedings ofSPIE: Visual Communication and Image Processing 2000, Perth, Australia, June 20-25, 2000

MS.1317 W.A. Dsselsteijn, J. Freeman, H. de Ridder, S.E. Avons and D. Pearson Towards an objective corroborative measure of presence: Postural responses to moving video To appear in: Proceedings of the 3rd International Workshop on Presence, Delft, The Nether­ lands, March 27-28, 2000

176 Reprints and preprints of IPO Publications Single copies of material from this issue of the IPO Annual Progress Report may be made for personal, non­ commercial use. Permission to make multiple copies must be obtained from lPO, Center for User-System Interaction. Illustrations may be used only with explicit mentioning of the source.

Colophon The following persons contributed to the production and distribution of this issue of the IPO Annual Progress Report:

Editors D.G. Bouwhuis F.L. van Nes A.J.M. Houtsma R.NJ. Veldhuis

Design J.A. Pellegrino van Stuyvenberg

Coordination, typing, U. van Amelsfort and distribution

Language correction S.R.G. Ralston Centre for Language and Technology Eindhoven University of Technology

Printing University Press Facilities Eindhoven University of Technology

ISSN 0921-2566

/PO Annual Progress lJeport 34 /999 177